<<

The Role of Performance Relating the Metrics CPU execution time for a program = CPU clock cycles for a program * Ming-Hwa Wang, Ph.D. clock cycle time = CPU clock cycles for a program / , measured COEN 210 Architecture by running the program Department of CPU clock cycles = instructions for a program * average CPI or clock Santa Clara University measuring the instruction count (depends on the architecture but not on the Introduction exact implementation) by using hardware counters or by using hardware performance is often key to the effectiveness of an entire system tools (that profile the execution or by using a simulator of the architecture) of hardware and software CPI = CPU clock cycles / instruction count, provides one way of comparing accurately measuring and comparing different machines is critical to two different implementations of the same instruction set architecture, CPI purchasers, and therefore to designers varies by applications as well as among implementations with the same for different types of applications, different performance metrics may be instruction set appropriate and different aspects of a computer system may be the most CPU time = instruction count * CPI * clock cycle time = instruction count * significant in determining overall performance CPI / clock rate an individual computer user is interested in reducing response/execution CPU execution time = (instructions / program) * (clock cycles / instruction) * time (the time between the start and completion of a task), computer center (seconds / clock cycle) n managers are often interested in increasing (the total amount of CPU clock cycles = i=1 (CPIi * Ci), where Ci is the count of the number of work done in a given time) instructions of class i executed, CPIi is the average number of cycles per for a machine X, performanceX = 1 / execution timeX instruction for that instruction class, and n is the number of instruction balance performance and cost classes high-performance design (e.g., supercomputer) – performance is the primary goal and cost is secondary Choosing Programs to Evaluate Performance low-cost design (e.g., PC clones, embedded ) – cost takes performance measurements should be reproducibility by listing everything precedence over performance another experimenter would need to duplicate the results cost/performance design (e.g., workstation) – balances cost against workload – the set of programs run performance benchmarks – programs specifically chosen to measure performance the benchmarks form a workload that the user hopes will predict the Measuring Performance performance of the actual workload time is the measure of the best type of programs to use for benchmarks are real applications execution/response/wall-clock/elapsed time is the total time to complete a that the user employs regularly or simply applications that are typical task measured in seconds using real application as benchmarks makes it much more difficult to CPU time is the time CPU spends for a task and does not find trivial ways to the execution of the ; and when include time spent waiting for I/O or running other programs (system techniques are found to improve performance, such techniques are performance refer to elapsed time on an unloaded system) much more likely to help other programs in addition to the benchmark user CPU time or CPU performance: the CPU time spent in the the use of benchmarks whose performance depends on very small program code segments encourages optimization in either the architecture or system CPU time: the CPU time spent in the OS compiler that target those segments the Unix time command a designer might try to make some sequence of instructions run clock rate is the inverse of the clock cycle (or tick, clock tick, clock period, especially fast for the sequence occurs in the benchmarks, and with clock, cycle), usually published as part of the document for a machine specific compiler option for special-purpose optimizations sometime in the quest to produce highly optimized code for benchmarks, engineers introduce erroneous optimizations small benchmarks are attractive when beginning a design and more easily standardized

CPU benchmarks – the SPEC (System Performance Evaluation the earlier version of Amdahl’s law: execution time after Cooperative) suite improvement = execution time affected by improvement / amount the first release in 1989 consists of 4 integer and 6 floating-point of improvement + execution time unaffected benchmarks, the matrix300 was to exercise the computer’s memory speedup = performance after improvement / performance before system, but optimization by blocking transformations substantially improvement = execution time before improvement / execution lower the number of memory accesses required and transform the time after improvement inner loops from having high miss rate to having almost negligible case miss rate, which reorganized the program to minimize Fallacies and Pitfalls memory usage, and thus was eliminated from later release pitfall: expecting the improvement of one aspect of a machine to increase the 1992 release (SPEC92) separates integer and floating-point performance by an amount proportional to the size of the improvement programs (SPECint and SPECfp), and the SPECbase disallows execution time after improvement = execution time affected by program-specific optimization flags improvement / amount of improvement + execution time unaffected the 1995 release (the SPEC95 suite) consists of 8 integer (SPECint95) a corollary of Amdahl’s law: make the common case fast, making the and 10 floating-point (SPECfp95) programs common case fast will tend to enhance performance better than the SPEC ratio: normalize the execution time by dividing the optimizing the rare case, and common case is often simpler and easier execution time on a Sun SPARCstation 10/40 by the execution to enhance than the rare case time on the measured machine fallacy: hardware-independent metrics predict performance (e.g., use code the SDM (System Development Multitasking) benchmark size as a measure of speed) the SFS (System-level File Server) benchmark the size of the compiled program is important when memory space is the 1996 release adds SPECpc96 for high-end scientific workloads at a premium today, the fastest machines tend to have instruction sets that lead to larger programs but can be executed faster with less hardware Comparing and Summarizing Performance 6 the simplest approach to summarizing relative performance is to use total pitfall: using MIPS (native MIPS = instruction count / (execution time * 10 ) execution time as a performance metric MIPS specifies the instruction execution rate but does not take into the average of the execution times that is directly proportional to total account the capabilities of the instructions execution time is the arithmetic mean (AM) = ( nTime ) / n i=1 i MIPS varies between programs on the same computer weighted arithmetic mean: assign a weighting factor wi to each program to MIPS can very inversely with performance indicate the of the program in that workload fallacy: synthetic benchmarks predict performance create a single benchmark program where the execution frequency of Performance of Recent Processors statements in the benchmark matches the statement frequency in a for a given instruction set architecture, increase in CPU performance can large set of benchmarks come from 3 sources: Whetstone – for scientific and engineering environment quoted in increase in clock rate Whetstones per second (the number of executions of one iteration improve in organization that lower the CPI of the Whetstone benchmark) compiler enhancements that lower the instruction count or generate – for systems programming environment instructions with a lower average CPI no user would ever run a synthetic benchmark as an application memory system has a significant effect on performance – when the clock usually not reflect program behavior rate is increased by a certain factor, the processor performance increases special purpose optimizations can inflate the performance by a lower factor because the performance loss in the memory system; pitfall: using the arithmetic mean of normalized execution time to predict since the speed of main memory is not increased, increasing the processor performance speed will exacerbate the bottleneck at the memory system especially on geometric mean = ( nExecution time ratio )1/n the floating-point benchmarks (the Amdahl’s law) i=1 i geometric mean(Xi)/geometric mean(Yi) = geometric mean(Xi/Yi) Amdahl’s law – speedup is the measure of how a machine performs fallacy: the geometric mean of execution time ratios is proportional to total after some enhancement relative to how it performed previously execution time

geometric means do not track total execution time and thus can’t be used to predict relative execution for a workload

Historical Perspective instruction mix – measure the relative frequency of instructions in a computer across many programs average instruction execution time – multiply the time for each instruction by its weight in the mix peak MIPS – choose an instruction mix that minimizes the CPI, even if the instruction mix is totally impractical MFLOPS/MOPS (millions of floating-point/integer operations per second) or megaFLOPS/OPS MFLOPS = number of floating-point operations in a program / (execution time * 106) floating-point operations are heavily used in scientific calculations but compiler, as an extreme example, have a MFLOPS near 0 MFLOPS based on operations in the program rather than on instructions, so it has a stronger claim than MIPS to being a fair comparison between different machines; unfortunately, MFLOPS is not dependable because the set of floating-point operations is not consist across machines, and the number of actual floating-point operations performed may vary; the MFLOPS rating changes according not only to the mixture of integer and floating-point operations but to the mixture of fast and slow floating-point operations normalized MFLOPS – weight the operations, giving more complex operations larger weights relative MIPS = (timereference / timeunrated) * MIPSreference 1-MIPS machine – the VAX-11/780 (actually 0.5 MIPS) kernel benchmarks – primarily for benchmarking high-end machines Livermore Loops consist of a series of 21 small loops fragments Linpack consists of a portion of a linear algebra subroutine package toy benchmarks – easy to compile and run on almost any computer