Lecture 02: Technology Trends and Quan6ta6ve Design and Analysis for Performance
CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan [email protected] www.secs.oakland.edu/~yan
1 Contents
• Computers and computer components • Computer architectures and great ideas in history and now • Trends, Cost and Performance
2 Understanding Performance
• Algorithm – Determines number of operaCons executed • Programming language, compiler, architecture – Determine number of machine instrucCons executed per operaCon • Processor and memory system – Determine how fast instrucCons are executed • I/O system (including OS) – Determines how fast I/O operaCons are executed
3 Below Your Program
• ApplicaCon soLware – WriNen in high-level language • System soLware – Compiler: translates HLL code to machine code – OperaCng System: service code • Handling input/output • Managing memory and storage • Scheduling tasks & sharing resources • Hardware – Processor, memory, I/O controllers
4 Levels of Program Code
• High-level language – Level of abstracCon closer to problem domain – Provides for producCvity and portability • Assembly language – Textual representaCon of instrucCons • Hardware representaCon – Binary digits (bits) – Encoded instrucCons and data
5 Trends in Technology
• Integrated circuit technology – Transistor density: 35%/year – Die size: 10-20%/year – IntegraCon overall: 40-55%/year
• DRAM capacity: 25-40%/year (slowing)
• Flash capacity: 50-60%/year – 15-20X cheaper/bit than DRAM
• MagneCc disk technology: 40%/year – 15-25X cheaper/bit then Flash – 300-500X cheaper/bit than DRAM
6 Bandwidth and Latency
• Bandwidth or throughput – Total work done in a given Cme – 10,000-25,000X improvement for processors – 300-1200X improvement for memory and disks
• Latency or response Cme – Time between start and compleCon of an event – 30-80X improvement for processors – 6-8X improvement for memory and disks
7 End of Moore’s Law?
Cost per transistor is rising as transistor size con6nues to shrink
8 Power and Energy
• Problem: – Get power in and distribute around – get power out: dissipate heat
• Three primary concerns: – Max power requirement for a processor – Thermal Design Power (TDP) • Characterizes sustained power consumpCon • Used as target for power supply and cooling system • Lower than peak power, higher than average power consumpCon – Energy and energy efficiency
• Clock rate can be reduced dynamically to limit power consumpCon
9 Energy and Energy Efficiency
• Power: energy per unit Cme – 1 waN = 1 joule per second – Energy per task is oLen a beNer measurement
• Processor A has 20% higher average power consumpCon than processor B. A executes task in only 70% of the Cme needed by B. – So energy consumpCon of A will be 1.2 * 0.7 = 0.84 of B
10 Dynamic Energy and Power
• Dynamic energy – Transistor switch from 0 -> 1 or 1 -> 0
• Dynamic power
• Reducing clock rate reduces power, not energy • The capaciCve load: – a funcCon of the number of transistors connected to an output and the technology, which determines the capacitance of the wires and the transistors. 11 An Example from Textbook page #21
12 An Example from Textbook
• Suppose a new CPU has – 85% of capacitive load of old CPU – 15% voltage and 15% frequency reduction
2 Pnew Cold ×0.85 ×(Vold ×0.85) ×Fold ×0.85 4 = 2 = 0.85 = 0.52 Pold Cold × Vold ×Fold
13 Power Trends
• In CMOS IC technology
Power = Capacitive load× Voltage2 ×Frequency
×30 5V → 1V ×1000
14 Power
• Intel 80386 consumed ~ 2 W • 3.3 GHz Intel Core i7 consumes 130 W • Heat must be dissipated from 1.5 x 1.5 cm chip • This is the limit of what can be cooled by air
15 The Power Wall
• We can’t reduce voltage further • We can’t remove more heat
• Techniques for reducing power: – Do nothing well • Turn off clock of inacCve module – Dynamic Voltage-Frequency Scaling – Low power state for DRAM, disks – Overclocking, turning off cores
16 Sta6c Power
• Because of leakage current flows even a transistor is off
• Scales with number of transistors
• Leakage can be as high as 50% for – In part because of large SRAM caches
• To reduce: power gaCng – Turn off power of inacCve modules
17 Measuring Performance
• Typical performance metrics: – Response Cme – Throughput
• Speedup of X relaCve to Y – ExecuCon CmeY / ExecuCon CmeX
• ExecuCon Cme – Wall clock Cme: includes all system overheads – CPU Cme: only computaCon Cme
• Benchmarks – Kernels (e.g. matrix mulCply) – Toy programs (e.g. sorCng) – SyntheCc benchmarks (e.g. Dhrystone)
– Benchmark suites (e.g. SPEC06fp, TPC-C) 18 Response Time and Throughput
• Response Cme – How long it takes to do a task • Throughput – Total work done per unit Cme • e.g., tasks/transacCons/… per hour • How are response Cme and throughput affected by – Replacing the processor with a faster version? – Adding more processors? • We ll focus on response Cme for now…
19 Rela6ve Performance: Speedup
• Define Performance = 1/ExecuCon Time • X is n Cme faster than Y
PerformanceX PerformanceY
= Execution timeY Execution timeX = n n Example: Cme taken to run a program
n 10s on A, 15s on B
n ExecuCon TimeB / ExecuCon TimeA = 15s / 10s = 1.5
n So A is 1.5 Cmes faster than B
20 Measuring Execu6on Time
• Elapsed Cme – Total response Cme, including all aspects • Processing, I/O, OS overhead, idle Cme – Determines system performance
• CPU Cme – Time spent processing a given job • Discounts I/O Cme, other jobs shares – Comprises user CPU Cme and system CPU Cme – Different programs are affected differently by CPU and system performance – “Cme” command in Linux
21 CPU Clocking
• Operation of digital hardware governed by a constant-rate clock
Clock period
Clock (cycles)
Data transfer and computation Update state
n Clock period: duraCon of a clock cycle –12 n e.g., 250ps = 0.25ns = 250×10 s
n Clock frequency (rate): cycles per second 9 n e.g., 4.0GHz = 4000MHz = 4.0×10 Hz 9 n Clock period: 1/(4.0×10 ) s = 0.25ns 22 CPU Time
CPU Time = CPU Clock Cycles×Clock Cycle Time CPU Clock Cycles = Clock Rate • Performance improved by – Reducing number of clock cycles – Increasing clock rate – Hardware designer must oLen trade off clock rate against cycle count
23 CPU Time Example
• Computer A: 2GHz clock, 10s CPU time • Designing Computer B – Aim for 6s CPU time – Can do faster clock, but causes 1.2 × clock cycles of A • How fast must Computer B clock be?
Clock CyclesB 1.2×Clock CyclesA Clock RateB = = CPU TimeB 6s
Clock CyclesA = CPU TimeA ×Clock RateA =10s× 2GHz = 20×109 1.2× 20×109 24×109 Clock Rate = = = 4GHz B 6s 6s
24 Instruc6on Count and CPI
Clock Cycles = Instruction Count ×Cycles per Instruction CPU Time = Instruction Count ×CPI×Clock Cycle Time Instruction Count ×CPI = Clock Rate • InstrucCon Count for a program – Determined by program, ISA and compiler • Average cycles per instrucCon – Determined by CPU hardware – If different instrucCons have different CPI • Average CPI affected by instrucCon mix
25 CPI Example
• Computer A: Cycle Time = 250ps, CPI = 2.0 • Computer B: Cycle Time = 500ps, CPI = 1.2 • Same ISA • Which is faster, and by how much? CPU Time Instruction Count CPI Cycle Time A = × A × A = I× 2.0 ×250ps = I×500ps A is faster… CPU Time Instruction Count CPI Cycle Time B = × B × B = I×1.2×500ps = I×600ps CPU Time I×600ps B = = 1.2 CPU Time I 500ps …by this much A ×
26 CPI in More Detail
• If different instruction classes take different numbers of cycles
n Clock Cycles (CPI Instruction Count ) = ∑ i × i i=1
n Weighted average CPI Clock Cycles n Instruction Count CPI ⎛CPI i ⎞ = = ∑⎜ i × ⎟ Instruction Count i=1 ⎝ Instruction Count ⎠
Relative frequency
27 CPI Example
• AlternaCve compiled code sequences using instrucCons in classes A, B, C Class A B C CPI for class 1 2 3
IC in sequence #1 2 1 2 IC in sequence #2 4 1 1 n Sequence #1: IC = 5 n Sequence #2: IC = 6
n Clock Cycles n Clock Cycles = 2×1 + 1×2 + 2×3 = 4×1 + 1×2 + 1×3 = 10 = 9
n Avg. CPI = 10/5 = 2.0 n Avg. CPI = 9/6 = 1.5
28 Performance Summary
The BIG Picture
Instructions Clock cycles Seconds CPU Time = × × Program Instruction Clock cycle
• Performance depends on – Algorithm: affects IC, possibly CPI – Programming language: affects IC, CPI – Compiler: affects IC, CPI
– InstrucCon set architecture: affects IC, CPI, Tc
29 SPEC CPU Benchmark
• Programs used to measure performance – Supposedly typical of actual workload • Standard Performance EvaluaCon Corp (SPEC) – Develops benchmarks for CPU, I/O, Web, … • SPEC CPU2006 – Elapsed Cme to execute a selecCon of programs • Negligible I/O, so focuses on CPU performance – Normalize relaCve to reference machine – Summarize as geometric mean of performance raCos • CINT2006 (integer) and CFP2006 (floaCng-point)
n n Execution time ratio ∏ i i=1
30 Principles of Computer Design
• The Processor Performance EquaCon
31 Principles of Computer Design
• Different instrucCon types having different CPIs
32 Metrics of Performance
ApplicaCon Answers per day/month
Programming Language Compiler (millions) of InstrucCons per second: MIPS ISA (millions) of (FP) operaCons per second: MFLOP/s Datapath Control Megabytes per second FuncCon Units Transistors Wires Pins Cycles per second (clock rate)
33 Impacts by Components CPI
inst count Cycle time
Inst Count CPI Clock Rate
Program X
Compiler X (X)
Inst. Set. X X
Architecture X X
Technology X
34 Principles of Computer Design
• Take Advantage of Parallelism – e.g. mulCple processors, disks, memory banks, pipelining, mulCple funcConal units
• Principle of Locality – Reuse of data and instrucCons
• Focus on the Common Case – Amdahl’s Law
35 Amdahl s Law
⎡ Fractionenhanced ⎤ ExTimenew = ExTimeold × ⎢(1− Fractionenhanced )+ ⎥ ⎣ Speedupenhanced ⎦
ExTimeold 1 Speedupoverall = = ExTimenew Fractionenhanced (1− Fractionenhanced ) + Speedupenhanced
Best you could ever hope to do: 1 Speedupmaximum = (1 - Fractionenhanced )
36 Using Amdahl’s Law
37 Amdahl’s Law for Parallelism
• The enhanced fracCon F is through parallelism, perfect parallelism with linear speedup – The speedup for F is N for N processors • Overall speedup
• Speedup upper bound (when N à∞ ): – 1-F: the sequenCal porCon of a program
38 Amdahl’s Law for Parallelism
39 Pi\all: Amdahl s Law
• Improving an aspect of a computer and expecCng a proporConal improvement in overall performance
T T = affected + T improved improvement factor unaffected n Example: mulCply accounts for 80s/100s
n How much improvement in mulCply performance to get 5× overall? 80 20 = + 20 n Can t be done! n n Corollary: make the common case fast
40 Exercise #1: Amdahl’s Law
41 Exercise #1: Amdahl’s Law solu6on
• Textbook page #47
42 Exercise #2: CPU 6me and speedup
43 Exercise #2: solu6on, textbook page 51
44