The Role of Performance Relating the Metrics CPU Execution Time for a Program = CPU Clock Cycles for a Program * Ming-Hwa Wang, Ph.D

The Role of Performance Relating the Metrics CPU execution time for a program = CPU clock cycles for a program * Ming-Hwa Wang, Ph.D. clock cycle time = CPU clock cycles for a program / clock rate, measured COEN 210 Computer Architecture by running the program Department of Computer Engineering CPU clock cycles = instructions for a program * average CPI or clock Santa Clara University cycles per instruction measuring the instruction count (depends on the architecture but not on the Introduction exact implementation) by using hardware counters or by using software hardware performance is often key to the effectiveness of an entire system tools (that profile the execution or by using a simulator of the architecture) of hardware and software CPI = CPU clock cycles / instruction count, provides one way of comparing accurately measuring and comparing different machines is critical to two different implementations of the same instruction set architecture, CPI purchasers, and therefore to designers varies by applications as well as among implementations with the same for different types of applications, different performance metrics may be instruction set appropriate and different aspects of a computer system may be the most CPU time = instruction count * CPI * clock cycle time = instruction count * significant in determining overall performance CPI / clock rate an individual computer user is interested in reducing response/execution CPU execution time = (instructions / program) * (clock cycles / instruction) * time (the time between the start and completion of a task), computer center (seconds / clock cycle) n managers are often interested in increasing throughput (the total amount of CPU clock cycles = i=1 (CPIi * Ci), where Ci is the count of the number of work done in a given time) instructions of class i executed, CPIi is the average number of cycles per for a machine X, performanceX = 1 / execution timeX instruction for that instruction class, and n is the number of instruction balance performance and cost classes high-performance design (e.g., supercomputer) – performance is the primary goal and cost is secondary Choosing Programs to Evaluate Performance low-cost design (e.g., PC clones, embedded computers) – cost takes performance measurements should be reproducibility by listing everything precedence over performance another experimenter would need to duplicate the results cost/performance design (e.g., workstation) – balances cost against workload – the set of programs run performance benchmarks – programs specifically chosen to measure performance the benchmarks form a workload that the user hopes will predict the Measuring Performance performance of the actual workload time is the measure of computer performance the best type of programs to use for benchmarks are real applications execution/response/wall-clock/elapsed time is the total time to complete a that the user employs regularly or simply applications that are typical task measured in seconds using real application as benchmarks makes it much more difficult to CPU time is the time CPU spends computing for a task and does not find trivial ways to speedup the execution of the benchmark; and when include time spent waiting for I/O or running other programs (system techniques are found to improve performance, such techniques are performance refer to elapsed time on an unloaded system) much more likely to help other programs in addition to the benchmark user CPU time or CPU performance: the CPU time spent in the the use of benchmarks whose performance depends on very small program code segments encourages optimization in either the architecture or system CPU time: the CPU time spent in the OS compiler that target those segments the Unix time command a designer might try to make some sequence of instructions run clock rate is the inverse of the clock cycle (or tick, clock tick, clock period, especially fast for the sequence occurs in the benchmarks, and with clock, cycle), usually published as part of the document for a machine specific compiler option for special-purpose optimizations sometime in the quest to produce highly optimized code for benchmarks, engineers introduce erroneous optimizations small benchmarks are attractive when beginning a design and more easily standardized CPU benchmarks – the SPEC (System Performance Evaluation the earlier version of Amdahl’s law: execution time after Cooperative) suite improvement = execution time affected by improvement / amount the first release in 1989 consists of 4 integer and 6 floating-point of improvement + execution time unaffected benchmarks, the matrix300 was to exercise the computer’s memory speedup = performance after improvement / performance before system, but optimization by blocking transformations substantially improvement = execution time before improvement / execution lower the number of memory accesses required and transform the time after improvement inner loops from having high cache miss rate to having almost negligible case miss rate, which reorganized the program to minimize Fallacies and Pitfalls memory usage, and thus was eliminated from later release pitfall: expecting the improvement of one aspect of a machine to increase the 1992 release (SPEC92) separates integer and floating-point performance by an amount proportional to the size of the improvement programs (SPECint and SPECfp), and the SPECbase disallows execution time after improvement = execution time affected by program-specific optimization flags improvement / amount of improvement + execution time unaffected the 1995 release (the SPEC95 suite) consists of 8 integer (SPECint95) a corollary of Amdahl’s law: make the common case fast, making the and 10 floating-point (SPECfp95) programs common case fast will tend to enhance performance better than the SPEC ratio: normalize the execution time by dividing the optimizing the rare case, and common case is often simpler and easier execution time on a Sun SPARCstation 10/40 by the execution to enhance than the rare case time on the measured machine fallacy: hardware-independent metrics predict performance (e.g., use code the SDM (System Development Multitasking) benchmark size as a measure of speed) the SFS (System-level File Server) benchmark the size of the compiled program is important when memory space is the 1996 release adds SPECpc96 for high-end scientific workloads at a premium today, the fastest machines tend to have instruction sets that lead to larger programs but can be executed faster with less hardware Comparing and Summarizing Performance 6 the simplest approach to summarizing relative performance is to use total pitfall: using MIPS (native MIPS = instruction count / (execution time * 10 ) execution time as a performance metric MIPS specifies the instruction execution rate but does not take into the average of the execution times that is directly proportional to total account the capabilities of the instructions execution time is the arithmetic mean (AM) = ( nTime ) / n i=1 i MIPS varies between programs on the same computer weighted arithmetic mean: assign a weighting factor wi to each program to MIPS can very inversely with performance indicate the frequency of the program in that workload fallacy: synthetic benchmarks predict performance create a single benchmark program where the execution frequency of Performance of Recent Processors statements in the benchmark matches the statement frequency in a for a given instruction set architecture, increase in CPU performance can large set of benchmarks come from 3 sources: Whetstone – for scientific and engineering environment quoted in increase in clock rate Whetstones per second (the number of executions of one iteration improve in processor organization that lower the CPI of the Whetstone benchmark) compiler enhancements that lower the instruction count or generate Dhrystone – for systems programming environment instructions with a lower average CPI no user would ever run a synthetic benchmark as an application memory system has a significant effect on performance – when the clock usually not reflect program behavior rate is increased by a certain factor, the processor performance increases special purpose optimizations can inflate the performance by a lower factor because the performance loss in the memory system; pitfall: using the arithmetic mean of normalized execution time to predict since the speed of main memory is not increased, increasing the processor performance speed will exacerbate the bottleneck at the memory system especially on geometric mean = ( nExecution time ratio )1/n the floating-point benchmarks (the Amdahl’s law) i=1 i geometric mean(Xi)/geometric mean(Yi) = geometric mean(Xi/Yi) Amdahl’s law – speedup is the measure of how a machine performs fallacy: the geometric mean of execution time ratios is proportional to total after some enhancement relative to how it performed previously execution time geometric means do not track total execution time and thus can’t be used to predict relative execution for a workload Historical Perspective instruction mix – measure the relative frequency of instructions in a computer across many programs average instruction execution time – multiply the time for each instruction by its weight in the mix peak MIPS – choose an instruction mix that minimizes the CPI, even if the instruction mix is totally impractical MFLOPS/MOPS (millions of floating-point/integer operations per second) or megaFLOPS/OPS MFLOPS = number of floating-point operations in a program / (execution time

The Role of Performance Relating the Metrics CPU Execution Time for a Program = CPU Clock Cycles for a Program * Ming-Hwa Wang, Ph.D

Chapter 1: Computer Abstractions and Technology 1.6 – 1.7: Performance and Power

Computer Organization and Architecture Designing for Performance Ninth Edition

Trends in Electrical Efficiency in Computer Performance

Computer Performance Evaluation and Benchmarking

Performance Scalability of N-Tier Application in Virtualized Cloud Environments: Two Case Studies in Vertical and Horizontal Scaling

Clock Rate Improves Roughly Proportional to Improvement in L • Number of Transistors Improves Proportional to L2 (Or Faster)

An Algorithmic Theory of Caches by Sridhar Ramachandran

02 Computer Evolution and Performance

Measurement and Rating of Computer Systems Performance

Atmega165p Datasheet

Computer Performance Factors

A Review of High Performance Computing