Quantitative Analysis Modern Processor Design: Fundamentals of Superscalar Processors

Chapter 1: Quantitative Analysis Modern Processor Design: Fundamentals of Superscalar Processors Mark Heinrich School of Computer Science University of Central Florida Define and quantify power ( 1 / 2) • For CMOS chips, traditional dominant energy consumption has been in switching transistors, called dynamic power 2 Powerdynamic = 1/ 2 × CapacitiveLoad ×Voltage × FrequencySwitched • For mobile devices, energy better metric 2 Energydynamic = CapacitiveLoad ×Voltage • For a fixed task, slowing clock rate (frequency switched) reduces power, but not energy • Capacitive load is a function of number of transistors connected to output and technology, which determines capacitance of wires and transistors • Dropping voltage helps both, so went from 5V to 1V • To save energy & dynamic power, most CPUs now turn off clock of inactive modules (e.g. Fl. Pt. Unit) 2 Example of quantifying power • Suppose 15% reduction in voltage results in a 15% reduction in frequency. What is impact on dynamic power? 2 Powerdynamic =1/ 2 × CapacitiveLoad ×Voltage × FrequencySwitched 2 =1/ 2 × .85 × CapacitiveLoad × (.85×Voltage) × FrequencySwitched 3 = (.85) × OldPowerdynamic ≈ 0.6 × OldPowerdynamic • Because turning down the voltage and performance are linear but voltage and power are cubic, be careful statements like I saved x% in power with ONLY a y% performance decrease!!! USE BETTER METRICS 3 Define and quantify power (2 / 2) • Because leakage current flows even when a transistor is off, now static power important too Powerstatic = Currentstatic ×Voltage • Leakage current increases in processors with smaller transistor sizes and is a function of VT • Increasing the number of transistors increases power even if they are turned off • In 2006, goal for leakage was 25% of total power consumption; high performance designs at 40% • Very low power systems even gate voltage to inactive modules to control loss due to leakage (voltage gating) 4 Define and quantify dependability (1/3) • How to decide when a system is operating properly? • Infrastructure providers now offer Service Level Agreements (SLA) to guarantee that their networking or power service would be dependable • Systems alternate between 2 states of service with respect to an SLA: 1. Service accomplishment, where the service is delivered as specified in SLA 2. Service interruption, where the delivered service is different from the SLA • Failure = transition from state 1 to state 2 • Restoration = transition from state 2 to state 1 5 Define and quantify dependability (2/3) • Module reliability = measure of continuous service accomplishment (or time to failure). 2 metrics 1. Mean Time To Failure (MTTF) measures Reliability 2. Failures In Time (FIT) = 1/MTTF, the rate of failures • Traditionally reported as failures per billion hours of operation • Mean Time To Repair (MTTR) measures Service Interruption – Mean Time Between Failures (MTBF) = MTTF+MTTR • Module availability measures service as alternate between the 2 states of accomplishment and interruption (number between 0 and 1, e.g. 0.9) • Module availability = MTTF / ( MTTF + MTTR) 6 Example calculating reliability • If modules have exponentially distributed lifetimes (age of module does not affect probability of failure), overall failure rate is the sum of failure rates of the modules • Calculate FIT and MTTF for 10 disks (1M hour MTTF per disk), 1 disk controller (0.5M hour MTTF), and 1 power supply (0.2M hour MTTF): FailureRate =10 ×(1/1,000,000) +1/ 500,000 +1/ 200,000 =10 + 2 + 5/1,000,000 =17 /1,000,000 =17,000FIT MTTF=1,000,000,000 /17,000 ≈ 59,000hours 7 Outline • Time and performance • Iron Law • MIPS and MFLOPS • Benchmarks • How to average • How to evaluate speedups 8 Defining Performance • What is important to whom? • Computer system user – Minimize elapsed time for program = time_end – time_start – Called response time • Computer center manager – Maximize completion rate = #jobs/second – Called throughput 9 Response Time vs. Throughput • Is throughput = 1/av. response time? – Only if NO overlap – Otherwise, throughput > 1/av. response time – E.g., two processes with response time Ns running on a single processor. – If no overlap, throughput = 1/N (jobs per second) – If context switch for exception handling such as page fault, throughput > 1/N (jobs per second) – Or two processes with response time Ns running on two processors. – Throughput = 2/N (jobs per second) 10 What is Performance for us? • For computer architects – CPU time = time spent running a program • Intuitively, bigger should be faster, so: – Performance = 1/X time, where X is response, CPU execution, etc. • Elapsed time = CPU time + I/O wait • We mostly concentrate on CPU time 11 Performance Comparison • Machine A is n times faster than machine B iff perf(A)/ perf(B) = time(B)/time(A) = n • Machine A is x% faster than machine B iff – perf(A)/perf(B) = time(B)/time(A) = 1 + x/100 • E.g. time(A) = 10s, time(B) = 15s – 15/10 = 1.5 => A is 1.5 times faster than B – 15/10 = 1.5 => A is 50% faster than B • Tip: don’t use % when > 100%, people get dopey • Another tip: Know when to properly use “fewer” versus “less”. The rule in English is simple and unwavering. If you can count them, use FEWER. – Machine A does NOT have less cache misses than B but it may have fewer… 12 Breaking Down Performance • A program is broken into instructions – H/W is aware of instructions, not programs • At lower level, H/W breaks instructions into cycles – Lower-level state machines change state every cycle • For example: – 1 GHz clock rate == 1 ns clock cycle time – 500MHz P-III runs 500M cycles/sec, 1 cycle = 2ns – 2GHz P-IV runs 2G cycles/sec, 1 cycle = 0.5ns 13 Iron Law Time 1/Processor Performance = --------------- Program The image cannot be displayed. Your computer may not have enough memory to The image cannot be displayed. Your computer may not have enough The image cannot be displayed. Your computer may not have open the image, or the image may have been corrupted. Restart your computer, memory to open the image, or the image may have been corrupted. Restart enough memory to open the image, or the image may have and then open the file again. If the red x still appears, you may have to delete the your computer, and then open the file again. If the red x still appears, you been corrupted. Restart your computer, and then open the image and then insert it again. may have to delete the image and then insert it again. file again. If the red x still appears, you may have to delete Instructions Cycles the imageTime and then insert it again. = X X Program Instruction Cycle (code size: IC) (CPI) (cycle time: CT) Architecture --> Implementation --> Realization Compiler Designer Processor Designer Chip Designer 14 Iron Law • IC – Instructions executed, not static code size – Determined by algorithm, compiler, ISA • CPI – Determined by ISA and CPU organization – Overlap among instructions reduces this term – Typically, must be measured – Today, we talk about IPC not CPI! • CT – Determined by technology, organization, clever circuit design 15 Our Goal • Minimize time which is the product, NOT isolated terms • Common error to miss terms while devising optimizations – E.g. ISA change to decrease instruction count – BUT leads to CPU organization which makes clock slower • Bottom line: terms are inter-related 16 Iron Law Example • Machine A: clock 1ns, CPI 2.0, for program x • Machine B: clock 2ns, CPI 1.2, for program x • Which is faster and how much? Time/Program = instr/program x cycles/instr x sec/cycle Time(A) = N x 2.0 x 1 = 2N Time(B) = N x 1.2 x 2 = 2.4N Compare: Time(B)/Time(A) = 2.4N/2N = 1.2 • So, Machine A is 20% faster than Machine B for this program 17 Iron Law Example Keep clock(A) @ 1ns and clock(B) @2ns For equal performance, if CPI(B)=1.2, what is CPI(A)? Time(B)/Time(A) = 1 = (Nx2x1.2)/(Nx1xCPI(A)) CPI(A) = 2.4 18 Iron Law Example • Keep CPI(A)=2.0 and CPI(B)=1.2 • For equal performance, if clock(B)=2ns, what is clock(B)? Time(B)/Time(A) = 1 = (N x 2.0 x clock(A))/(N x 1.2 x 2) clock(A) = 1.2ns 19 Another Example OP Freq Cycles ALU 43% 1 Load 21% 1 Store 12% 2 Branch 24% 2 • Assume stores can execute in 1 cycle by slowing clock 15% • Should this be implemented? 20 Another Example OP Freq Cycles ALU 43% 1 Load 21% 1 Store 12% 2 Branch 24% 2 • Old CPI = 0.43 + 0.21 + 0.12 x 2 + 0.24 x 2 = 1.36 • New CPI = 0.43 + 0.21 + 0.12 + 0.24 x 2 = 1.24 • Speedup = old time/new time – = {P x old CPI x T}/{P x new CPI x 1.15 T} – = (1.36)/(1.24 x 1.15) = 0.95 • Answer: Don’t make the change 21 Yet Another Example • CPI of ideal machine is 1.0 – make this assumption unless otherwise specified • May need to renormalize mix if IC changes CPU A: 20% compares, 20% branches, CPIbranch=2 CPU B: compare&branch, CPIc&b=2, 25% slower clock, CPIother=1 What is the frequency of c&b in CPU B? ICB=0.8 ICA (no compares); 0.2/0.8 = 25% c&b Which CPU is faster? CPIA = 0.2×2+0.8×1=1.2 CPIB = 0.25×2+0.75×1=1.25 TA=ICA×1.2×Tclk TB= .8 ICA×1.25×1.25 Tclk=1.25×ICA×Tclk (A is ~4% faster) 22 Other Metrics • MIPS and MFLOPS • MIPS = instruction count/(execution time x 106) = clock rate/(CPI x 106) • But MIPS has serious shortcomings 23 Problems with MIPS • e.g.

Quantitative Analysis Modern Processor Design: Fundamentals of Superscalar Processors

Accelerating HPL Using the Intel Xeon Phi 7120P Coprocessors

Intel Cirrascale and Petrobras Case Study

Power Measurement Tutorial for the Green500 List

Towards Better Performance Per Watt in Virtual Environments on Asymmetric Single-ISA Multi-Core Systems

Comparing the Power and Performance of Intel's SCC to State

NVIDIA Ampere GA102 GPU Architecture Whitepaper

Using Intel Processors for DSP Applications

High Performance Embedded Computing in Space

Low-Power High Performance Computing

PAKCK: Performance and Power Analysis of Key Computational Kernels on Cpus and Gpus

EPYC: Designed for Effective Performance

Measurement, Control, and Performance Analysis for Intel Xeon Phi