Utilization Is Virtually Useless As a Metric!

The Association of System Performance Professionals The Computer Measurement Group, commonly called CMG, is a not for profit, worldwide organization of data processing professionals committed to the measurement and management of computer systems. CMG members are primarily concerned with performance evaluation of existing systems to maximize performance (eg. response time, throughput, etc.) and with capacity management where planned enhancements to existing systems or the design of new systems are evaluated to find the necessary resources required to provide adequate performance at a reasonable cost. This paper was originally published in the Proceedings of the Computer Measurement Group’s 2006 International Conference. For more information on CMG please visit http://www.cmg.org Copyright 2006 by The Computer Measurement Group, Inc. All Rights Reserved Published by The Computer Measurement Group, Inc., a non-profit Illinois membership corporation. Permission to reprint in whole or in any part may be granted for educational and scientific purposes upon written application to the Editor, CMG Headquarters, 151 Fries Mill Road, Suite 104, Turnersville, NJ 08012. Permission is hereby granted to CMG members to reproduce this publication in whole or in part solely for internal distribution with the member’s organization provided the copyright notice above is set forth in full text on the title page of each item reproduced. The ideas and concepts set forth in this publication are solely those of the respective authors, and not of CMG, and CMG does not endorse, guarantee or otherwise certify any such ideas or concepts in any application or usage. Printed in the United States of America. Utilization is Virtually Useless as a Metric! Adrian Cockcroft – eBay Research Labs We have all been conditioned over the years to use The most fundamental assumption here is that the utilization or %busy as the primary metric for capacity average service time is constant, and does not depend planning. Unfortunately, with increasing use of CPU upon the load level. This is one of the assumptions that virtualization and sophisticated CPU optimization has been broken by virtualization, Hyper-threading and techniques such as hyper-threading and power variable speed power-saving CPUs. We will discuss management the measurements we get from the systems these technologies later in this paper. are “virtually useless”. This paper will explain many of the ways in which the data we depend upon is distorted, The simple queue has a single processing element. In and proposes that we turn to direct measurement of the reality, today’s computer systems do not form simple fundamental alternatives, and express capacity in terms queues and do not have a single processing element. of headroom, in units of throughput within a response Operating systems expose an abstraction that looks like a time limit. simple set of processing elements, and provide metrics to measure that abstraction, such as overall CPU utilization. 1. Background The main new theme at the CMG 2005 conference Common capacity planning techniques either assume that seemed to be CPU virtualization. Many people using there is a single processing element, or a fixed number of VMware, Xen, Solaris Zones, Hyper-Threading and identical processing elements, with characteristics that other virtualization facilities are finding that their only change on relatively infrequent “upgrade” events. In measurements of CPU utilization don't make sense any fact none of these assumptions can be relied upon for the more. The primary performance management vendors most common systems in use today. This is not an were discussing how they are building some support for obscure feature of a specialized system design, it’s baked virtualization concepts into their tools. While the into the low cost mainstream products that everyone mainframe world has been dealing with virtualization for uses! many years, the Unix/Linux and Windows world has now adopted a bewildering variety of virtualization Response time for a simple queue with a large number of techniques, with no standard metrics, and often without users can be modeled as the service time divided by the any indication that the CPU has been virtualized! unused capacity. For an idle system, utilization is near zero, so unused capacity (one minus the utilization) is 2. The Good Old Days near one, and the response time is similar to the service Lets start by examining the way we have traditionally time. For a busy system, utilization is near one, and collected and manipulated utilization, and highlight the unused capacity is much less than one. So dividing into underlying assumptions we have made in the past. These the service time, the response time is large. assumptions are not always stated, and in many cases they no longer hold. For complex systems the utilization metric is not a reliable indicator of capacity. In some cases the reported Utilization is properly defined as busy time as a utilization metric will reach 100% well before the system proportion of elapsed time. Utilization can also be is out of capacity, so response time stays low until the obtained by multiplying the average throughput of the load reaches a much higher throughput level. In other service by the average service time. If an operation takes cases the utilization metric is not linearly related to 10ms to process and the service is performed 50 times a throughput, so for example a 10% increase in throughput second then the utilization will be 50 * 0.010 = 0.5, could cause the reported utilization metric to increase usually written as 50%. When applied to a simple queue, from 40% to 80%. with work arriving at random, we also know that the response time and the queue length increases at high 3. I/O Utilization utilization. This paper is mostly concerned with CPU utilization, but there are lessons to be learned from I/O utilization. hosts, and poor or unpredictable response times can In days of old, when tools like iostat were first written for sometimes be traced to this effect. Unix systems, a disk was a simple thing. It could seek, read and write, and while it was doing those things it was When iostat thinks that a LUN is 100% utilized it makes a busy disk and you had to wait for it to finish. Thus the false assumption that a single service element is iostat reported disk utilization and when the disk got processing a single queue of requests. In fact multiple busy, the response time got bad. The advent of intelligent service elements are processing multiple queues of disk controllers changed all that. For example the SCSI requests. A large number of concurrent requests can protocol allows multiple commands to be sent to the usually be processed, but there is no way for the iostat disk, with completion in any order. Each disk consists of command to find out what the internal capability of a a pair of queues in tandem, one of requests that have not LUN is, so it cannot reliably report utilization. been issued, and one of requests that are currently inside the disk itself. Sophisticated I/O trace based capture and modeling tools such as Ortera Atlas (http://www.ortera.com) can Storage virtualization has always been an issue. From the determine what is happening at the filesystem and beginning, disks have been partitioned, and the volume layers, and calculate a new metric called the utilization metrics of each partition have a complex “capability utilization” of the underlying LUN. relationship to the utilization of the disk as a whole. It is also very difficult to see the utilization of an individual This problem is directly analogous to the situation we file. Partitions can also be combined using a volume now find in the CPU realm. manager into stripes, concatenations, mirrors and RAID volumes. 4. Intel Hyper-Threading Hyper-Threading is used by most recent Intel server Some versions of Unix still have a simplistic iostat that processors. In summary, when a CPU executes code, reports on a single simple logical queue, the more there are times when the CPU is waiting for a long enlightened iostat commands have more useful options. operation such as a memory read to complete. Since the For example on Solaris, you can see the iostat data on a pipeline of functional units inside the CPU is stalled for a per partition, per disk, per controller and per volume very short period this is often called a “pipeline bubble”. basis, and it does separate the wait queue metrics from By adding a small amount of extra complexity to the the active queue metrics. To get per-volume stats you CPU, it can be made to keep track of two separate have to use the bundled Solaris volume manager, add-on execution states at the same time, and two completely products such as Vertitas Volume Manager don’t work isolated code sequences can share the same pipeline of with iostat. Try iostat –xpnCez (think “expenses” to execution units. In this way when one thread stalls, the remember this combination) to get lots of detailed other thread takes over and the pipeline bubbles are information. eliminated. Modern disk subsystems are built using storage area Intel describes this technology at networks (SANs) and cached RAID disk controllers. The http://www.intel.com/technology/hyperthread/ and Yaniv logical units (LUNs) that appear to the operating system Pessach of Microsoft describes it at to be single disks, are actually complex combinations of http://msdn.microsoft.com/msdnmag/issues/05/06/Hyper RAM cache and large numbers of disks. Threading/default.aspx. I tried searching for information on the performance impact and performance monitoring In a high availability SAN configuration each LUN can impact and found several studies that describe be accessed via four paths.

Utilization Is Virtually Useless As a Metric!

Power Management 24

Power Management Using FPGA Architectural Features Abu Eghan, Principal Engineer Xilinx Inc

Clock Gating for Power Optimization in ASIC Design Cycle: Theory & Practice

Computer Architecture Techniques for Power-Efficiency

Dynamic Voltage/Frequency Scaling and Power-Gating of Network-On-Chip with Machine Learning

Power Reduction Techniques for Microprocessor Systems

Happy: Hyperthread-Aware Power Profiling Dynamically

Learning-Directed Dynamic Voltage and Frequency Scaling Scheme with Adjustable Performance for Single-Core and Multi-Core Embedded and Mobile Systems †

Optimization of Clock Gating Logic for Low Power LSI Design

White Paper | ADVANCED POWER MANAGEMENT HELPS BRING

Mutual Impact Between Clock Gating and High Level Synthesis in Reconﬁgurable Hardware Accelerators

Dynamic Voltage and Frequency Scaling : the Laws of Diminishing Returns Authors Etienne Le Sueur and Gernot Heiser