<<

Lecture 02: Technology Trends and Quan6ta6ve Design and Analysis for Performance

CSE 564 Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan [email protected] www.secs.oakland.edu/~yan

1 Contents

and computer components • Computer architectures and great ideas in history and now • Trends, Cost and Performance

2 Understanding Performance

• Algorithm – Determines number of operaCons executed • Programming language, compiler, architecture – Determine number of machine instrucCons executed per operaCon • and memory system – Determine how fast instrucCons are executed • I/O system (including OS) – Determines how fast I/O operaCons are executed

3 Below Your Program

• ApplicaCon soLware – WriNen in high-level language • System soLware – Compiler: translates HLL code to machine code – OperaCng System: service code • Handling input/output • Managing memory and storage • Scheduling tasks & sharing resources • Hardware – Processor, memory, I/O controllers

4 Levels of Program Code

• High-level language – Level of abstracCon closer to problem domain – Provides for producCvity and portability • Assembly language – Textual representaCon of instrucCons • Hardware representaCon – Binary digits (bits) – Encoded instrucCons and data

5 Trends in Technology

technology – Transistor density: 35%/year – Die size: 10-20%/year – IntegraCon overall: 40-55%/year

• DRAM capacity: 25-40%/year (slowing)

• Flash capacity: 50-60%/year – 15-20X cheaper/bit than DRAM

• MagneCc disk technology: 40%/year – 15-25X cheaper/bit then Flash – 300-500X cheaper/bit than DRAM

6 Bandwidth and Latency

• Bandwidth or throughput – Total work done in a given Cme – 10,000-25,000X improvement for processors – 300-1200X improvement for memory and disks

• Latency or response Cme – Time between start and compleCon of an event – 30-80X improvement for processors – 6-8X improvement for memory and disks

7 End of Moore’s Law?

Cost per transistor is rising as transistor size con6nues to shrink

8 Power and Energy

• Problem: – Get power in and distribute around – get power out: dissipate heat

• Three primary concerns: – Max power requirement for a processor – Thermal Design Power (TDP) • Characterizes sustained power consumpCon • Used as target for power supply and cooling system • Lower than peak power, higher than average power consumpCon – Energy and energy efficiency

• Clock rate can be reduced dynamically to limit power consumpCon

9 Energy and Energy Efficiency

• Power: energy per unit Cme – 1 waN = 1 joule per second – Energy per task is oLen a beNer measurement

• Processor A has 20% higher average power consumpCon than processor B. A executes task in only 70% of the Cme needed by B. – So energy consumpCon of A will be 1.2 * 0.7 = 0.84 of B

10 Dynamic Energy and Power

• Dynamic energy – Transistor from 0 -> 1 or 1 -> 0

• Dynamic power

• Reducing clock rate reduces power, not energy • The capaciCve load: – a funcCon of the number of transistors connected to an output and the technology, which determines the capacitance of the wires and the transistors. 11 An Example from Textbook page #21

12 An Example from Textbook

• Suppose a new CPU has – 85% of capacitive load of old CPU – 15% voltage and 15% reduction

2 Pnew Cold ×0.85 ×(Vold ×0.85) ×Fold ×0.85 4 = 2 = 0.85 = 0.52 Pold Cold × Vold ×Fold

13 Power Trends

• In CMOS IC technology

Power = Capacitive load× Voltage2 ×Frequency

×30 5V → 1V ×1000

14 Power

80386 consumed ~ 2 W • 3.3 GHz i7 consumes 130 W • Heat must be dissipated from 1.5 x 1.5 cm chip • This is the limit of what can be cooled by air

15 The Power Wall

• We can’t reduce voltage further • We can’t remove more heat

• Techniques for reducing power: – Do nothing well • Turn off clock of inacCve module – Dynamic Voltage-Frequency Scaling – Low power state for DRAM, disks – , turning off cores

16 Sta6c Power

• Because of leakage current flows even a transistor is off

• Scales with number of transistors

• Leakage can be as high as 50% for – In part because of large SRAM caches

• To reduce: power gaCng – Turn off power of inacCve modules

17 Measuring Performance

• Typical performance metrics: – Response Cme – Throughput

• Speedup of X relaCve to Y – ExecuCon CmeY / ExecuCon CmeX

• ExecuCon Cme – Wall clock Cme: includes all system overheads – CPU Cme: only computaCon Cme

• Benchmarks – Kernels (e.g. matrix mulCply) – Toy programs (e.g. sorCng) – SyntheCc benchmarks (e.g. )

suites (e.g. SPEC06fp, TPC-C) 18 Response Time and Throughput

• Response Cme – How long it takes to do a task • Throughput – Total work done per unit Cme • e.g., tasks/transacCons/… per hour • How are response Cme and throughput affected by – Replacing the processor with a faster version? – Adding more processors? • Well focus on response Cme for now…

19 Rela6ve Performance: Speedup

• Define Performance = 1/ExecuCon Time • X is n Cme faster than Y

PerformanceX PerformanceY

= Execution timeY Execution timeX = n n Example: Cme taken to run a program

n 10s on A, 15s on B

n ExecuCon TimeB / ExecuCon TimeA = 15s / 10s = 1.5

n So A is 1.5 Cmes faster than B

20 Measuring Execu6on Time

• Elapsed Cme – Total response Cme, including all aspects • Processing, I/O, OS overhead, idle Cme – Determines system performance

• CPU Cme – Time spent processing a given job • Discounts I/O Cme, other jobs shares – Comprises user CPU Cme and system CPU Cme – Different programs are affected differently by CPU and system performance – “Cme” command in Linux

21 CPU Clocking

• Operation of digital hardware governed by a constant-rate clock

Clock period

Clock (cycles)

Data transfer and computation Update state

n Clock period: duraCon of a clock cycle –12 n e.g., 250ps = 0.25ns = 250×10 s

n Clock frequency (rate): cycles per second 9 n e.g., 4.0GHz = 4000MHz = 4.0×10 Hz 9 n Clock period: 1/(4.0×10 ) s = 0.25ns 22 CPU Time

CPU Time = CPU Clock Cycles×Clock Cycle Time CPU Clock Cycles = Clock Rate • Performance improved by – Reducing number of clock cycles – Increasing clock rate – Hardware designer must oLen trade off clock rate against cycle count

23 CPU Time Example

• Computer A: 2GHz clock, 10s CPU time • Designing Computer B – Aim for 6s CPU time – Can do faster clock, but causes 1.2 × clock cycles of A • How fast must Computer B clock be?

Clock CyclesB 1.2×Clock CyclesA Clock RateB = = CPU TimeB 6s

Clock CyclesA = CPU TimeA ×Clock RateA =10s× 2GHz = 20×109 1.2× 20×109 24×109 Clock Rate = = = 4GHz B 6s 6s

24 Instruc6on Count and CPI

Clock Cycles = Instruction Count × CPU Time = Instruction Count ×CPI×Clock Cycle Time Instruction Count ×CPI = Clock Rate • InstrucCon Count for a program – Determined by program, ISA and compiler • Average cycles per instrucCon – Determined by CPU hardware – If different instrucCons have different CPI • Average CPI affected by instrucCon mix

25 CPI Example

• Computer A: Cycle Time = 250ps, CPI = 2.0 • Computer B: Cycle Time = 500ps, CPI = 1.2 • Same ISA • Which is faster, and by how much? CPU Time Instruction Count CPI Cycle Time A = × A × A = I× 2.0 ×250ps = I×500ps A is faster… CPU Time Instruction Count CPI Cycle Time B = × B × B = I×1.2×500ps = I×600ps CPU Time I×600ps B = = 1.2 CPU Time I 500ps …by this much A ×

26 CPI in More Detail

• If different instruction classes take different numbers of cycles

n Clock Cycles (CPI Instruction Count ) = ∑ i × i i=1

n Weighted average CPI Clock Cycles n Instruction Count CPI ⎛CPI i ⎞ = = ∑⎜ i × ⎟ Instruction Count i=1 ⎝ Instruction Count ⎠

Relative frequency

27 CPI Example

• AlternaCve compiled code sequences using instrucCons in classes A, B, C Class A B C CPI for class 1 2 3

IC in sequence #1 2 1 2 IC in sequence #2 4 1 1 n Sequence #1: IC = 5 n Sequence #2: IC = 6

n Clock Cycles n Clock Cycles = 2×1 + 1×2 + 2×3 = 4×1 + 1×2 + 1×3 = 10 = 9

n Avg. CPI = 10/5 = 2.0 n Avg. CPI = 9/6 = 1.5

28 Performance Summary

The BIG Picture

Instructions Clock cycles Seconds CPU Time = × × Program Instruction Clock cycle

• Performance depends on – Algorithm: affects IC, possibly CPI – Programming language: affects IC, CPI – Compiler: affects IC, CPI

– InstrucCon set architecture: affects IC, CPI, Tc

29 SPEC CPU Benchmark

• Programs used to measure performance – Supposedly typical of actual workload • Standard Performance EvaluaCon Corp (SPEC) – Develops benchmarks for CPU, I/O, Web, … • SPEC CPU2006 – Elapsed Cme to execute a selecCon of programs • Negligible I/O, so focuses on CPU performance – Normalize relaCve to reference machine – Summarize as geometric mean of performance raCos • CINT2006 (integer) and CFP2006 (floaCng-point)

n n Execution time ratio ∏ i i=1

30 Principles of Computer Design

• The Processor Performance EquaCon

31 Principles of Computer Design

• Different instrucCon types having different CPIs

32 Metrics of Performance

ApplicaCon Answers per day/month

Programming Language Compiler (millions) of InstrucCons per second: MIPS ISA (millions) of (FP) operaCons per second: MFLOP/s Control Megabytes per second FuncCon Units Transistors Wires Pins Cycles per second (clock rate)

33 Impacts by Components CPI

inst count Cycle time

Inst Count CPI Clock Rate

Program X

Compiler X (X)

Inst. Set. X X

Architecture X X

Technology X

34 Principles of Computer Design

• Take Advantage of Parallelism – e.g. mulCple processors, disks, memory banks, pipelining, mulCple funcConal units

• Principle of Locality – Reuse of data and instrucCons

• Focus on the Common Case – Amdahl’s Law

35 Amdahls Law

⎡ Fractionenhanced ⎤ ExTimenew = ExTimeold × ⎢(1− Fractionenhanced )+ ⎥ ⎣ Speedupenhanced ⎦

ExTimeold 1 Speedupoverall = = ExTimenew Fractionenhanced (1− Fractionenhanced ) + Speedupenhanced

Best you could ever hope to do: 1 Speedupmaximum = (1 - Fractionenhanced )

36 Using Amdahl’s Law

37 Amdahl’s Law for Parallelism

• The enhanced fracCon F is through parallelism, perfect parallelism with linear speedup – The speedup for F is N for N processors • Overall speedup

• Speedup upper bound (when N à∞ ): – 1-F: the sequenCal porCon of a program

38 Amdahl’s Law for Parallelism

39 Pi\all: Amdahls Law

• Improving an aspect of a computer and expecCng a proporConal improvement in overall performance

T T = affected + T improved improvement factor unaffected n Example: mulCply accounts for 80s/100s

n How much improvement in mulCply performance to get 5× overall? 80 20 = + 20 n Cant be done! n n Corollary: make the common case fast

40 Exercise #1: Amdahl’s Law

41 Exercise #1: Amdahl’s Law solu6on

• Textbook page #47

42 Exercise #2: CPU 6me and speedup

43 Exercise #2: solu6on, textbook page 51

44