Chapter 1: Quantitative Analysis Modern Design: Fundamentals of Superscalar Processors

Mark Heinrich

School of Computer Science University of Central Florida Define and quantify power ( 1 / 2)

• For CMOS chips, traditional dominant energy consumption has been in switching transistors, called

dynamic power 2 Powerdynamic = 1/ 2 × CapacitiveLoad ×Voltage × FrequencySwitched

• For mobile devices, energy better metric 2 Energydynamic = CapacitiveLoad ×Voltage • For a fixed task, slowing (frequency switched) reduces power, but not energy • Capacitive load is a function of number of transistors connected to output and technology, which determines capacitance of wires and transistors • Dropping voltage helps both, so went from 5V to 1V • To save energy & dynamic power, most CPUs now

turn off clock of inactive modules (e.g. Fl. Pt. Unit) 2 Example of quantifying power

• Suppose 15% reduction in voltage results in a 15% reduction in frequency. What is impact on dynamic power?

2 Powerdynamic =1/ 2 × CapacitiveLoad ×Voltage × FrequencySwitched

2 =1/ 2 × .85 × CapacitiveLoad × (.85×Voltage) × FrequencySwitched 3 = (.85) × OldPowerdynamic

≈ 0.6 × OldPowerdynamic

• Because turning down the voltage and performance are linear but voltage and power are cubic, be careful statements like I saved x% in power with ONLY a y% performance decrease!!! USE BETTER METRICS

3 Define and quantify power (2 / 2)

• Because leakage current flows even when a transistor is off, now static power important too

Powerstatic = Currentstatic ×Voltage

• Leakage current increases in processors with smaller transistor sizes and is a function of VT • Increasing the number of transistors increases power even if they are turned off • In 2006, goal for leakage was 25% of total power consumption; high performance designs at 40% • Very low power systems even gate voltage to inactive modules to control loss due to leakage (voltage gating)

4 Define and quantify dependability (1/3)

• How to decide when a system is operating properly? • Infrastructure providers now offer Service Level Agreements (SLA) to guarantee that their networking or power service would be dependable • Systems alternate between 2 states of service with respect to an SLA: 1. Service accomplishment, where the service is delivered as specified in SLA 2. Service interruption, where the delivered service is different from the SLA • Failure = transition from state 1 to state 2 • Restoration = transition from state 2 to state 1

5 Define and quantify dependability (2/3)

• Module reliability = measure of continuous service accomplishment (or time to failure). 2 metrics 1. Mean Time To Failure (MTTF) measures Reliability 2. Failures In Time (FIT) = 1/MTTF, the rate of failures • Traditionally reported as failures per billion hours of operation • Mean Time To Repair (MTTR) measures Service Interruption – Mean Time Between Failures (MTBF) = MTTF+MTTR • Module availability measures service as alternate between the 2 states of accomplishment and interruption (number between 0 and 1, e.g. 0.9) • Module availability = MTTF / ( MTTF + MTTR)

6 Example calculating reliability • If modules have exponentially distributed lifetimes (age of module does not affect probability of failure), overall failure rate is the sum of failure rates of the modules • Calculate FIT and MTTF for 10 disks (1M hour MTTF per disk), 1 disk controller (0.5M hour MTTF), and 1 power supply (0.2M hour MTTF): FailureRate =10 ×(1/1,000,000) +1/ 500,000 +1/ 200,000 =10 + 2 + 5/1,000,000 =17 /1,000,000 =17,000FIT MTTF=1,000,000,000 /17,000 ≈ 59,000hours 7 Outline

• Time and performance • Iron Law • MIPS and MFLOPS • Benchmarks • How to average • How to evaluate speedups

8 Defining Performance

• What is important to whom? • Computer system user – Minimize elapsed time for program = time_end – time_start – Called response time • Computer center manager – Maximize completion rate = #jobs/second – Called throughput

9 Response Time vs. Throughput

• Is throughput = 1/av. response time? – Only if NO overlap – Otherwise, throughput > 1/av. response time – E.g., two processes with response time Ns running on a single processor. – If no overlap, throughput = 1/N (jobs per second) – If context for exception handling such as page fault, throughput > 1/N (jobs per second) – Or two processes with response time Ns running on two processors. – Throughput = 2/N (jobs per second)

10 What is Performance for us?

• For computer architects – CPU time = time spent running a program • Intuitively, bigger should be faster, so: – Performance = 1/X time, where X is response, CPU execution, etc. • Elapsed time = CPU time + I/O wait • We mostly concentrate on CPU time

11 Performance Comparison

• Machine A is n times faster than machine B iff perf(A)/ perf(B) = time(B)/time(A) = n • Machine A is x% faster than machine B iff – perf(A)/perf(B) = time(B)/time(A) = 1 + x/100 • E.g. time(A) = 10s, time(B) = 15s – 15/10 = 1.5 => A is 1.5 times faster than B – 15/10 = 1.5 => A is 50% faster than B • Tip: don’t use % when > 100%, people get dopey • Another tip: Know when to properly use “fewer” versus “less”. The rule in English is simple and unwavering. If you can count them, use FEWER. – Machine A does NOT have less misses than B but it may have fewer…

12 Breaking Down Performance

• A program is broken into instructions – H/W is aware of instructions, not programs • At lower level, H/W breaks instructions into cycles – Lower-level state machines change state every cycle • For example: – 1 GHz clock rate == 1 ns clock cycle time – 500MHz P-III runs 500M cycles/sec, 1 cycle = 2ns – 2GHz P-IV runs 2G cycles/sec, 1 cycle = 0.5ns

13 Iron Law

Time 1/Processor Performance = ------Program

The image cannot be displayed. Your computer may not have enough memory to The image cannot be displayed. Your computer may not have enough The image cannot be displayed. Your computer may not have open the image, or the image may have been corrupted. Restart your computer, memory to open the image, or the image may have been corrupted. Restart enough memory to open the image, or the image may have and then open the file again. If the red x still appears, you may have to delete the your computer, and then open the file again. If the red x still appears, you been corrupted. Restart your computer, and then open the image and then insert it again. may have to delete the image and then insert it again. file again. If the red x still appears, you may have to delete Instructions Cycles the imageTime and then insert it again. = X X Program (code size: IC) (CPI) (cycle time: CT)

Architecture --> Implementation --> Realization

Compiler Designer Processor Designer Chip Designer

14 Iron Law

• IC – Instructions executed, not static code size – Determined by algorithm, compiler, ISA • CPI – Determined by ISA and CPU organization – Overlap among instructions reduces this term – Typically, must be measured – Today, we talk about IPC not CPI! • CT – Determined by technology, organization, clever circuit design

15 Our Goal

• Minimize time which is the product, NOT isolated terms • Common error to miss terms while devising optimizations – E.g. ISA change to decrease instruction count – BUT leads to CPU organization which makes clock slower • Bottom line: terms are inter-related

16 Iron Law Example

• Machine A: clock 1ns, CPI 2.0, for program x • Machine B: clock 2ns, CPI 1.2, for program x • Which is faster and how much? Time/Program = instr/program x cycles/instr x sec/cycle Time(A) = N x 2.0 x 1 = 2N Time(B) = N x 1.2 x 2 = 2.4N Compare: Time(B)/Time(A) = 2.4N/2N = 1.2 • So, Machine A is 20% faster than Machine B for this program

17 Iron Law Example

Keep clock(A) @ 1ns and clock(B) @2ns For equal performance, if CPI(B)=1.2, what is CPI(A)?

Time(B)/Time(A) = 1 = (Nx2x1.2)/(Nx1xCPI(A)) CPI(A) = 2.4

18 Iron Law Example

• Keep CPI(A)=2.0 and CPI(B)=1.2 • For equal performance, if clock(B)=2ns, what is clock(B)?

Time(B)/Time(A) = 1 = (N x 2.0 x clock(A))/(N x 1.2 x 2) clock(A) = 1.2ns

19 Another Example

OP Freq Cycles ALU 43% 1 Load 21% 1 Store 12% 2 Branch 24% 2

• Assume stores can execute in 1 cycle by slowing clock 15% • Should this be implemented?

20 Another Example OP Freq Cycles ALU 43% 1 Load 21% 1 Store 12% 2 Branch 24% 2

• Old CPI = 0.43 + 0.21 + 0.12 x 2 + 0.24 x 2 = 1.36 • New CPI = 0.43 + 0.21 + 0.12 + 0.24 x 2 = 1.24

• Speedup = old time/new time – = {P x old CPI x T}/{P x new CPI x 1.15 T} – = (1.36)/(1.24 x 1.15) = 0.95 • Answer: Don’t make the change

21 Yet Another Example

• CPI of ideal machine is 1.0 – make this assumption unless otherwise specified • May need to renormalize mix if IC changes

CPU A: 20% compares, 20% branches, CPIbranch=2 CPU B: compare&branch, CPIc&b=2, 25% slower clock, CPIother=1 What is the frequency of c&b in CPU B?

ICB=0.8 ICA (no compares); 0.2/0.8 = 25% c&b Which CPU is faster?

CPIA = 0.2×2+0.8×1=1.2 CPIB = 0.25×2+0.75×1=1.25

TA=ICA×1.2×Tclk TB= .8 ICA×1.25×1.25 Tclk=1.25×ICA×Tclk (A is ~4% faster)

22 Other Metrics

• MIPS and MFLOPS • MIPS = instruction count/(execution time x 106) = clock rate/(CPI x 106) • But MIPS has serious shortcomings

23 Problems with MIPS

• e.g. without FP hardware, an FP op may take 50 single-cycle instructions • With FP hardware, only one 2-cycle instruction l Thus, adding FP hardware: – CPI increases (why?) 50/50 => 2/1 – Instructions/program 50 => 1 decreases (why?) – Total execution time decreases 50 => 2 l BUT, MIPS gets worse! 50 MIPS => 25 MIPS (assuming 50Mhz)

24 Problems with MIPS

• Ignore program • Usually used to quote peak performance – Ideal conditions => guarantee not to exceed! • When is MIPS ok? – Same compiler, same ISA – e.g. same binary running on Pentium-III, IV – Why? Instr/program (or IC) is constant and can be ignored

25 Rules

• Use ONLY Time • Beware when reading, especially if details are omitted • Beware of Peak – Guaranteed not to exceed

26 New Metric

• Emphasize both cpu time and energy consumption

• Ex. ’s Netburst wins CPU time but fails Performance per Watt • What is performance per Watt? Instructions per Watt? • Insn per Watt (IPW) = IC / Avg. Power Consumption • Avg. Power Consumption = Overall Energy Consumption/ CPU time • Therefore, IPW = IC * CPU time /Overall Energy Consumption

• Implication?

27 More on Insn per Watt

• Example. For the following two insns A = A + 1; B = B + 1; – Design #1: scalar non-pipeline, each insn takes 5 cycles (10 cycles overall) – Design #2: scalar pipeline, two insns take 6 cycles overall – Assume same overall energy (E) for the two operations • For design #1: – IPW = 2 * 10 / E = 20 / E • For design #2: – IPW = 2 * 6 / E = 12 / E • Design 1 is a better design according to IPW criterion – Plausible as pipelining incurs higher power consumption

28 More on Insn per Watt

• Design #3: scalar nonpipeline, insert 10 stall cycles between the two operations – overall execution = 20 cycles – overall energy: E if static energy is ignored – IPW = 2 * 20 / E = 40 / E – Implication: if want higher IPW, simply insert stalls. (Even if static energy is included, still can play the same trick).

• Overall, be careful with this metric. – Insn per Watt = IC * CPU time /Overall Energy – Suggestion: use it to compare designs with similar performance. – Better metric: performance per watt = 1/CPU time / (overall energy / CPU time) = 1 / overall energy • Energy rules, no performance impact – Even better metric: Energy-delay product (smaller the better) or 1 / (CPU time * overall energy) • Similar metric: MIPS2 per Watt – ≈ (1/CPU time)2 / (overall energy / CPU time) – = 1 / (overall energy * CPU time)

29 Energy Per Instruction

• In mobile environments you are often trying to design the highest throughput within a fixed energy budget • In that case a must achieve low EPI Product Performance Power EPI in nJ ([email protected]) i486 1.0 1.0 10 Pentium 2.0 2.7 14 Pentium Pro 3.6 9 24 Pentium 4 6.0 23 38 (Willamette) Pentium 4 7.9 38 48 (Cedarmill) Pentium M 5.4 7 15 Core Duo 7.7 8 11 (Yonah) 30 Li-Ion battery ~460kJ/kg so P4 could run 958GI on .1kg batt= 240 s (4 minutes) Asynchronous Processors

• No clock! – The ultimate in – Token-based message passing at the circuit level

• Use E * t^2 as metric (voltage independent)

• Voltage as a performance knob – Same processor can be 24 MIPS at .024 EPI or 120 MIPS at . 1EPI – Achieve average-case performance not worst-case! – Dip it in liquid nitrogen, it runs faster! – Run it off a potato! – Cryogenically frozen processors?

31 Which Programs

• Execution time of what program? • Best case – you always run the same set of programs – Port them and time the whole workload • In reality, use benchmarks – Programs chosen to measure performance – Predict performance of actual workload – Saves effort and money – Representative? Honest? Benchmarketing…

32 Types of Benchmarks

• Real programs – representative of real workload – only accurate way to characterize performance – requires considerable work • Kernels or microbenchmarks – “representative” program fragments – good for focusing on individual features not big picture • Instruction mixes – instruction frequency of occurrence; calculate CPI

33 Benchmarks: SPEC2006

• System Performance Evaluation Cooperative (www.spec.org) – Formed in 80s to combat benchmarketing – SPEC89, SPEC92, SPEC95, Spec 2000, now SPEC2006 • 12 integer and 18 floating-point programs – Compared to a Sun UltraSparc II system at 296MHz (reference machine) – Report GM of ratios to reference machine

34 Benchmarks: SPEC CINT2006

Benchmark Description 400.perlbench PERL script execution 401.bzip2 Compression 403.gcc C compiler 429.mcf Combinatorial optimization 445.gobmk AI – Game playing (Go) 456.hmmer Search a gene sequence database 458.sjeng AI – Game tree search and pattern recognition (chess) 462.libquantum Library for simulating a quantum computer 464.h264ref Video Compression 471.omnetpp Discrete event simulation 473.astar AI – Game, Path finding 483.xalancbmk XSLT for transforming XML documents into HTML, text or other XML document types

35 Benchmarks: SPEC CFP2006 Description 410.bwaves Computational fluid dynamics 416.gamess Quantum chemical computations 433.milc Quantum chromodynamics 434.zeusmp Magzetohydrodynamics 435.gromacs Molecular Dynamics 436.cactusADM General relativity 437.leslie3d Computational fluid dynamics 444.namd Structural biology, classical molecular dynamics simulation 447.dealII Solving partial differential equations using adaptive finite element 450.soplex Simplex linear program solver 453.povray Ray-tracer 454.calculix Structural mechanics 459.GemsFDTD Computational electromagnetics 465.tonto Quantum crystallography 470.lbm Computational fluid dynamics 481.wrf Weather forecasting 482.sphinx3 Speech recognition 999.specrand Random number generator 36 Benchmarks: Commercial Workloads

• TPC: transaction processing performance council – TPC-A/B: simple ATM workload – TPC-C: warehouse order entry – TPC-D/H/R: decision support; database query – TPC-W: online bookstore (browse/shop/order) • SPEC – SPECJBB:multithreaded Java business logic – SPECjAppServer: Java web application logic – SPECweb2005: dynamic web serving • Common attributes – Synthetic, but deemed usable – Stress entire system, including I/O, – Must include O/S effects to model these

37 Benchmark Pitfalls

• Benchmark not representative – Your workload is I/O bound, SPECint is useless • Benchmark is too old – Benchmarks age poorly; benchmarketing pressure causes vendors to optimize compiler/hardware/software to benchmarks – Need to be periodically refreshed

38 Benchmark Pitfalls

• Choosing benchmark from the wrong application space – e.g., in a realtime environment, choosing gcc • Choosing benchmarks from no application space – e.g., synthetic workloads, esp. unvalidated ones • Using toy benchmarks (dhrystone, whetstone) – e.g., used to prove the value of RISC in early 80’s • Mismatch of benchmark properties with scale of features studied – e.g., using SPECINT for large cache studies

39 How to Average

Machine A Machine B Program 1 1 10

Program 2 1000 100

Total 1001 110

• One answer: for total execution time, how much faster is B? 9.1x

40 How to Average

• Another: arithmetic mean (same result) • Arithmetic mean of times: ⎧ n ⎫ 1 • AM(A) = 1001/2 = 500.5 ⎨∑time(i)⎬× • AM(B) = 110/2 = 55 ⎩ i=1 ⎭ n • 500.5/55 = 9.1x • Valid only if programs run equally often, so use weighted arithmetic mean:

⎧ n ⎫ 1 ⎨∑(weight(i)×time(i))⎬ ÷ ⎩ i=1 ⎭ n

41 Other Averages

• e.g., drive 30 mph for first 10 miles, then 90 mph for next 10 miles, what is average speed? • Average speed = (30+90)/2 WRONG • Average speed = total distance / total time • = (20 / (10/30 + 10/90)) • = 45 mph • When dealing with rates (mph) do not use arithmetic mean

42 Harmonic Mean

n • Harmonic mean of rates = ⎧ n 1 ⎫ ⎨∑ ⎬ ⎩ i=1 rate(n)⎭ • Use HM if forced to start and end with rates (e.g. reporting IPC or miss rates or branch misprediction rates)

43 Dealing with Ratios

Machine A Machine B Program 1 1 10 Program 2 1000 100 Total 1001 110

• If we take ratios with respect to machine A

Machine A Machine B Program 1 1 10

Program 2 1 0.1

44 Dealing with Ratios

• Average for machine A is 1, average for machine B is 5.05 • If we take ratios with respect to machine B

Machine A Machine B Program 1 0.1 1 Program 2 10 1 Average 5.05 1

• Can’t both be true!!! • Don’t use arithmetic mean on normalized ratios!

45 Geometric Mean

• Use geometric mean for ratios n • Geometric mean of ratios = n ∏ratio(i) i=1 • Independent of reference machine • In the example, GM wrt. machine a is 1, wrt. machine B is also 1 – Normalized with respect to either machine

46 But…

• GM of ratios is not proportional to total time • AM in example says machine B is 9.1 times faster • GM says they are equal • If we took total execution time, A and B are equal only if – Program 1 is run 100 times more often than program 2

47 Averaging Summary

• Use AM for times • Use HM if forced to use rates • Use GM if forced to use ratios

• Best of all, use unnormalized numbers to compute time

48 Amdahl’s law

• Performance Improvement (“speedup”) is limited by the part you cannot improve

• Speedups

Performance of task with gizmo Speedup = Performance of task without gizmo

or Execution time of task without gizmo Speedup = Execution time of task with gizmo

49 Amdahl’s law

(s)Speedupenhanced = best case speedup from gizmo alone ("in a perfect world")

(f)Fraction enhanced=fraction of task that gizmo can enhance ("how perfect the world is")

Execution time of task without gizmo (s ) Speedup = overall overall Execution time of task with gizmo Execution time = old Fraction enhanced Execution timeold ×(1− Fraction enhanced )+ Execution timeold × Speedupenhanced 1 = Fraction enhanced (1− Fraction enhanced )+ Speedupenhanced 1 = f (1− f ) + s

50 Amdahl’s law example

• You do simulation of jet plane wings – 1 run takes 1 week on your fastest computer • You get this ad in your mailbox: – The Acme Hyperbole is the largest ever built, it has 100,000 processors (great!) – It costs $1 bazillion (not so great) • Now, 1 week is 600,000 sec, so – You could run a simulation in 6 seconds, right? – Well, not all of a program can be done at the same time • Data dependencies: – x = (…), followed by (…) = x * y • Control dependencies: – If xxx then yyy else zzz • Say 80% of your program is parallelizable (pretty good)

51 Amdahl’s law example

Speedupenhanced =100,000

Fractionenhanced=0.8 1 Speedupoverall = Fractionenhanced (1− Fractionenhanced ) + Speedupenhanced 1 = 0.8 (1− 0.8) + 100000 1 ≈ = 5 0.2 • So approximately 5 times faster, or 33 hours – Not quite as great as one would hope – Worth a bazillion dollars? (TRY 100 PROCS => 4.8 !!!)

52 Amdahl’s law (cont.)

• Another interpretation – Recall: speedup limited by part you cannot improve – Also: the common case matters most 1 Ex 1: s = =1.094 ≈1.10 overall 0.95 f = 0.95, s = 1.10 (1− 0.95) + 1.10 1 Ex 2: s = =1.047 overall 0.05 f = 0.05, s = 10 (1− 0.05) + 10

Ex 3: s 1.053 f = 0.05, s → ∞ overall =

53