inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Performance Lecture #38 Performance •The better a computer 2006-12-01 performs, the better it is! Titleless TA Aaron Staley •But what is better? inst.eecs./~cs61c-tc

Ancient Greeks used Computers ⇒ Mechanical device had 37 gear wheels which could be used to add, subtract, multiply, and divide http://www.nytimes.com/2006/11/30/science/30compute.html CS61C L3 Performance (1) Staley, Fall 2005 © UCB CS61C L3 Performance (2) Staley, Fall 2005 © UCB

Why Performance? Two Notions of “Performance” • Purchasing Perspective: given a DC to Top Passen- Throughput Plane collection of machines (or upgrade Paris Speed gers (pmph) options), which has the Boeing 6.5 610 - best performance ? 470 286,700 - least cost ? 747 hours mph - best performance / cost ? BAD/Sud 3 1350 132 178,200 • Computer Designer Perspective: faced Concorde hours mph with design options, which has the •Which has higher performance? - best performance improvement ? •Interested in time to deliver 100 passengers? - least cost ? •Interested in delivering as many passengers per day as possible? - best performance / cost ? •In a computer, time for one task called • All require basis for comparison and Response Time or Execution Time metric for evaluation! •In a computer, tasks per unit time called Throughput or Bandwidth •Solid metrics lead to solid progress! CS61C L3 Performance (3) Staley, Fall 2005 © UCB CS61C L3 Performance (4) Staley, Fall 2005 © UCB

Definitions Example of Response Time v. Throughput • Performance is in units of things per sec • Time of Concorde vs. Boeing 747? • bigger is better • Concord is 6.5 hours / 3 hours = 2.2 times faster • If we are primarily concerned with • Throughput of Boeing vs. Concorde? response time • Boeing 747: 286,700 pmph / 178,200 pmph • performance(x) = 1 = 1.6 times faster execution_time(x) • Boeing is 1.6 times (“60%”) faster in " F(ast) is n times faster than S(low) " means… terms of throughput performance(F) execution_time(S) • Concord is 2.2 times (“120%”) faster in terms of flying time (response time) n = = We will focus primarily on response performance(S) execution_time(F) time. CS61C L3 Performance (5) Staley, Fall 2005 © UCB CS61C L3 Performance (6) Staley, Fall 2005 © UCB Words, Words, Words… What is Time? • Straightforward definition of time: • Will (try to) stick to “n times faster”; • Total time to complete a task, including disk its less confusing than “m % faster” accesses, memory accesses, I/O activities, operating system overhead, ... • As faster means both decreased • “real time”, “response time” or execution time and increased “elapsed time” performance, to reduce confusion we • Alternative: just time processor (CPU) will (and you should) use is working only on your program (since “improve execution time” or multiple processes running at same time) “improve performance” • “CPU execution time” or “CPU time” • Often divided into system CPU time (in OS) and user CPU time (in user program)

CS61C L3 Performance (7) Staley, Fall 2005 © UCB CS61C L3 Performance (8) Staley, Fall 2005 © UCB

How to Measure Time? Measuring Time using Clock Cycles (1/2) • Real Time ⇒ Actual time elapsed • CPU execution time for a program • CPU Time: Computers constructed using a clock that runs at a constant = Clock Cycles for a program rate and determines when events take x Clock Period place in the hardware • or • These discrete time intervals called clock cycles (or informally clocks or = Clock Cycles for a program cycles) • Length of clock period: clock cycle time (e.g., 2 nanoseconds or 2 ns) and clock rate (e.g., 500 megahertz, or 500 MHz), which is the inverse of the clock period; use these!

CS61C L3 Performance (9) Staley, Fall 2005 © UCB CS61C L3 Performance (10) Staley, Fall 2005 © UCB

Measuring Time using Clock Cycles (2/2) Performance Calculation (1/2)

• One way to define clock cycles: • CPU execution time for program = Clock Cycles for program Clock Cycles for program x Clock Cycle Time = Instructions for a program • Substituting for clock cycles: (called “Instruction Count”) CPU execution time for program x Average Clock = (Instruction Count x CPI) (abbreviated “CPI”) x Clock Cycle Time • CPI one way to compare two machines = Instruction Count x CPI x Clock Cycle Time with same instruction set, since Instruction Count would be the same

CS61C L3 Performance (11) Staley, Fall 2005 © UCB CS61C L3 Performance (12) Staley, Fall 2005 © UCB Performance Calculation (2/2) How Calculate the 3 Components?

CPU time = Instructions x Cycles x Seconds • Clock Cycle Time: in specification of computer (Clock Rate in advertisements) Program Instruction Cycle • Instruction Count: CPU time = Instructions x Cycles x Seconds • Count instructions in loop of small program Program Instruction Cycle • Use simulator to count instructions CPU time = Instructions x Cycles x Seconds • Hardware counter in spec. register Program Instruction Cycle - (Pentium II,III,4) CPU time = Seconds • CPI: Program • Calculate: Execution Time / Clock cycle time • Product of all 3 terms: if missing a term, can’t Instruction Count predict time, the real measure of performance • Hardware counter in special register (PII,III,4)

CS61C L3 Performance (13) Staley, Fall 2005 © UCB CS61C L3 Performance (14) Staley, Fall 2005 © UCB

Calculating CPI Another Way Example (RISC processor) Op Freq CPI Prod (% Time) • First calculate CPI for each individual i i instruction (add, sub, and, etc.) ALU 50% 1 .5 (23%) Load 20% 5 1.0 (45%) • Next calculate frequency of each individual instruction Store 10% 3 .3 (14%) Branch 20% 2 .4 (18%) • Finally multiply these two for each 2.2 instruction and add them up to get Instruction Mix (Where time spent) final CPI (the weighted sum) • What if Branch instructions twice as fast?

CS61C L3 Performance (15) Staley, Fall 2005 © UCB CS61C L3 Performance (16) Staley, Fall 2005 © UCB

What Programs Measure for Comparison? Benchmarks • Ideally run typical programs with typical input before purchase, • Obviously, apparent speed of or before even build machine processor depends on code used to test it • Called a “workload”; For example: • Engineer uses compiler, spreadsheet • Need industry standards so that • Author uses word processor, drawing different processors can be fairly program, compression software compared • In some situations its hard to do • Companies exist that create these • Don’t have access to machine to benchmarks: “typical” code used to “” before purchase evaluate systems • Don’t know workload in future • Need to be changed every 2 or 3 years since designers could (and do!) target • Next: benchmarks & PC-Mac showdown! for these standard benchmarks

CS61C L3 Performance (17) Staley, Fall 2005 © UCB CS61C L3 Performance (18) Staley, Fall 2005 © UCB Example Standardized Benchmarks (1/2) Example Standardized Benchmarks (2/2) • SPEC • Standard Performance Evaluation Corporation (SPEC) SPEC CPU2000 • Benchmarks distributed in source code • Members of consortium select workload • CINT2000 12 integer (gzip, gcc, crafty, perl, ...) - 30+ companies, 40+ universities • CFP2000 14 floating-point (swim, mesa, art, ...) • Compiler, machine designers target • All relative to base machine benchmarks, so try to change every 3 years Sun 300MHz 256Mb-RAM Ultra5_10, which gets score of 100 • An Example of SPEC 2000:

• They measure CFP2000 CINT2000 gzip C Compression wupwise Fortran77 Physics / Quantum Chromodynamics swim Fortran77 Shallow Water Modeling vpr C FPGA Circuit Placement and - System speed (SPECint2000) Routing mgrid Fortran77 Multi-grid Solver: 3D Potential Field gcc C C Programming Language Compiler applu Fortran77 Parabolic / Elliptic Partial Differential Equations mcf C Combinatorial Optimization - System throughput (SPECint_rate2000) crafty C Game Playing: Chess mesa C 3-D Graphics Library parser C Word Processing galgel Fortran90 Computational Fluid Dynamics eon C++ Computer Visualization art C Image Recognition / Neural Networks perlbmk C PERL Programming Language equake C Seismic Wave Propagation Simulation facerec Fortran90 Image Processing: Face Recognition •Newer one: www.spec.org/osg/cpu2006/ gap C Group Theory, Interpreter vortex C Object-oriented Database ammp C Computational Chemistry bzip2 C Compression lucas Fortran90 Number Theory / Primality Testing fma3d Fortran90 Finite-element Crash Simulation twolf C Place and Route Simulator sixtrack Fortran77 High Energy Nuclear Physics Accelerator Design CS61C L3 Performance (19) Staley, Fall 2005 © UCB CS61C L3 Performance (20) apsi Fortran77 Meteorology: Pollutant DistrSitabluetyi,o Fnall 2005 © UCB

Another Benchmark Performance Evaluation: An Aside Demo • PCs: Ziff-Davis Benchmark Suite If we’re talking about performance, let’s discuss the ways shady salespeople • “Business Winstone is a system-level, have fooled consumers application-based benchmark that measures (so that you don’t get taken!) a PC's overall performance when running today's top-selling Windows-based 32-bit 5. Never let the user touch it applications… it doesn't mimic what these 4. Only run the demo through a script packages do; it runs real applications through a series of scripted activities and 3. Run it on a stock machine in which uses the time a PC takes to complete those “no expense was spared” activities to produce its performance scores. 2. Preprocess all available data • Also tests for CDs, Content-creation, Audio, 3D graphics, battery life 1. Play a movie http://www.etestinglabs.com/benchmarks/

CS61C L3 Performance (21) Staley, Fall 2005 © UCB CS61C L3 Performance (22) Staley, Fall 2005 © UCB

Megahertz Myth Peer Instruction • Processor manufactures compete with clock-rate, usually in the popular press. • But remember: CPI=Instruction Count x CPI x Clock Cycle Time A. Rarely does a company selling a product give • THE POINT: A CPU with a faster ABC clock rate can actually run slower! unbiased performance data. 1: FFF 2: FFT B. The Sieve of Eratosthenes and Quicksort were3: FTF early effective benchmarks. 4: FTT 5: TFF C. A program runs in 100 sec. on a machine, mult6: TFT accounts for 80 sec. of that. If we want to make7: TTF the program run 6 times faster, we need to up 8: TTT CS61C L3 Performance (23) Staley, Fall 2005 © UCB tCSh61eC L 3s Pperfoermeandce (2o4)f mults by AT LEAST 6. Staley, Fall 2005 © UCB Writing Fast Code - Hardware Dependence Complexity holds true for instruction count

• Consider two sorting algorithms: • QUICKSORT: 800 Quick (Instr/key) Radix (Instr/key) Basically selects “pivot” in an array and 700 rotates elements about the pivot 600

Average Complexity: O(n*log(n)) 500

RADIX SORT: 400 Advanced bucket sort 300 Basically “hashes” individual items. 200 Complexity: O(n) 100 0 1000 10000 100000 100000 1E+07 0

CS61C L3 Performance (26) Staley, Fall 2005 © UCB CS61C L3 Performance (27) Staley, Fall 2005 © UCB

Yet CPU time suggests otherwise… Never forget Cache effects!

800 Quick (Instr/key) 5 Radix (Instr/key) Quick(miss/key) 700 Quick (Clocks/key) Radix(miss/key) 4 600 Radix (clocks/key)

500 3 400 2 300

200 1 100 0 0 1000 10000 100000 1000000 1000000 1000 10000 100000 100000 1E+07 0 0

CS61C L3 Performance (28) Staley, Fall 2005 © UCB CS61C L3 Performance (29) Staley, Fall 2005 © UCB

“And in conclusion…” • Latency v. Throughput • Performance doesn’t depend on any single factor: need Instruction Count, Clocks Per Instruction (CPI) and Clock Rate to get valid estimations • User Time: time user waits for program to execute: depends heavily on how OS switches between tasks • CPU Time: time spent executing a single program: depends solely on processor design (datapath, pipelining effectiveness, caches, etc.) • Megahertz Myth. • The clock rate is not enough to figure out performance; it is just one factor. • Writing fast code. • Hardware can make a fast algorithm slow or vide-versa. • Generally need to run tests to figure out what will happen

CPU time = Instructions x Cycles x Seconds Program Instruction Cycle

CS61C L3 Performance (30) Staley, Fall 2005 © UCB