Technology Trends and Quan6ta6ve Design and Analysis for Performance

Lecture 02: Technology Trends and Quan6ta6ve Design and Analysis for Performance CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan [email protected] www.secs.oakland.edu/~yan 1 Contents • Computers and computer components • Computer architectures and great ideas in history and now • Trends, Cost and Performance 2 Understanding Performance • Algorithm – Determines numBer of operaCons executed • Programming language, compiler, architecture – Determine numBer of machine instrucCons executed per operaCon • Processor and memory system – Determine how fast instrucCons are executed • I/O system (including OS) – Determines how fast I/O operaCons are executed 3 Below Your Program • ApplicaCon soLware – WriNen in high-level language • System soLware – Compiler: translates HLL code to machine code – OperaCng System: service code • Handling input/output • Managing memory and storage • Scheduling tasks & sharing resources • Hardware – Processor, memory, I/O controllers 4 Levels of Program Code • High-level language – Level of abstracCon closer to proBlem domain – Provides for producCvity and portability • AssemBly language – Textual representaCon of instrucCons • Hardware representaCon – Binary digits (Bits) – Encoded instrucCons and data 5 Trends in Technology • Integrated circuit technology – Transistor density: 35%/year – Die size: 10-20%/year – IntegraCon overall: 40-55%/year • DRAM capacity: 25-40%/year (slowing) • Flash capacity: 50-60%/year – 15-20X cheaper/Bit than DRAM • MagneCc disk technology: 40%/year – 15-25X cheaper/Bit then Flash – 300-500X cheaper/Bit than DRAM 6 Bandwidth and Latency • Bandwidth or throughput – Total work done in a given Cme – 10,000-25,000X improvement for processors – 300-1200X improvement for memory and disks • Latency or response Cme – Time Between start and compleCon of an event – 30-80X improvement for processors – 6-8X improvement for memory and disks 7 End of Moore’s Law? Cost per transistor is rising as transistor size con6nues to shrink 8 Power and Energy • ProBlem: – Get power in and distriBute around – get power out: dissipate heat • Three primary concerns: – Max power requirement for a processor – Thermal Design Power (TDP) • Characterizes sustained power consumpCon • Used as target for power supply and cooling system • Lower than peak power, higher than average power consumpCon – Energy and energy efficiency • Clock rate can Be reduced dynamically to limit power consumpCon 9 Energy and Energy Efficiency • Power: energy per unit Cme – 1 waN = 1 joule per second – Energy per task is oLen a BeNer measurement • Processor A has 20% higher average power consumpCon than processor B. A executes task in only 70% of the Cme needed by B. – So energy consumpCon of A will be 1.2 * 0.7 = 0.84 of B 10 Dynamic Energy and Power • Dynamic energy – Transistor switch from 0 -> 1 or 1 -> 0 • Dynamic power • Reducing clock rate reduces power, not energy • The capaciCve load: – a funcCon of the numBer of transistors connected to an output and the technology, which determines the capacitance of the wires and the transistors. 11 An Example from Textbook page #21 12 An Example from Textbook • Suppose a new CPU has – 85% of capacitive load of old CPU – 15% voltage and 15% frequency reduction 2 Pnew Cold ×0.85 ×(Vold ×0.85) ×Fold ×0.85 4 = 2 = 0.85 = 0.52 Pold Cold × Vold ×Fold 13 Power Trends • In CMOS IC technology Power = Capacitive load× Voltage2 ×Frequency ×30 5V → 1V ×1000 14 Power • Intel 80386 consumed ~ 2 W • 3.3 GHz Intel Core i7 consumes 130 W • Heat must Be dissipated from 1.5 x 1.5 cm chip • This is the limit of what can Be cooled By air 15 The Power Wall • We can’t reduce voltage further • We can’t remove more heat • Techniques for reducing power: – Do nothing well • Turn off clock of inacCve module – Dynamic Voltage-Frequency Scaling – Low power state for DRAM, disks – Overclocking, turning off cores 16 Sta6c Power • Because of leakage current flows even a transistor is off • Scales with numBer of transistors • Leakage can Be as high as 50% for – In part Because of large SRAM caches • To reduce: power gaCng – Turn off power of inacCve modules 17 Measuring Performance • Typical performance metrics: – Response Cme – Throughput • Speedup of X relaCve to Y – ExecuCon CmeY / ExecuCon CmeX • ExecuCon Cme – Wall clock Cme: includes all system overheads – CPU Cme: only computaCon Cme • Benchmarks – Kernels (e.g. matrix mulCply) – Toy programs (e.g. sorCng) – SyntheCc Benchmarks (e.g. Dhrystone) – Benchmark suites (e.g. SPEC06fp, TPC-C) 18 Response Time and Throughput • Response Cme – How long it takes to do a task • Throughput – Total work done per unit Cme • e.g., tasks/transacCons/… per hour • How are response Cme and throughput affected By – Replacing the processor with a faster version? – Adding more processors? • Well focus on response Cme for now… 19 Rela6ve Performance: Speedup • Define Performance = 1/ExecuCon Time • X is n Cme faster than Y PerformanceX PerformanceY = Execution timeY Execution timeX = n n Example: Cme taken to run a program n 10s on A, 15s on B n ExecuCon TimeB / ExecuCon TimeA = 15s / 10s = 1.5 n So A is 1.5 Cmes faster than B 20 Measuring Execu6on Time • Elapsed Cme – Total response Cme, including all aspects • Processing, I/O, OS overhead, idle Cme – Determines system performance • CPU Cme – Time spent processing a given joB • Discounts I/O Cme, other joBs shares – Comprises user CPU Cme and system CPU Cme – Different programs are affected differently By CPU and system performance – “Cme” command in Linux 21 CPU Clocking • Operation of digital hardware governed by a constant-rate clock Clock period Clock (cycles) Data transfer and computation Update state n Clock period: duraCon of a clock cycle –12 n e.g., 250ps = 0.25ns = 250×10 s n Clock frequency (rate): cycles per second 9 n e.g., 4.0GHz = 4000MHz = 4.0×10 Hz 9 n Clock period: 1/(4.0×10 ) s = 0.25ns 22 CPU Time CPU Time = CPU Clock Cycles×Clock Cycle Time CPU Clock Cycles = Clock Rate • Performance improved By – Reducing numBer of clock cycles – Increasing clock rate – Hardware designer must oLen trade off clock rate against cycle count 23 CPU Time Example • Computer A: 2GHz clock, 10s CPU time • Designing Computer B – Aim for 6s CPU time – Can do faster clock, but causes 1.2 × clock cycles of A • How fast must Computer B clock be? Clock CyclesB 1.2×Clock CyclesA Clock RateB = = CPU TimeB 6s Clock CyclesA = CPU TimeA ×Clock RateA =10s× 2GHz = 20×109 1.2× 20×109 24×109 Clock Rate = = = 4GHz B 6s 6s 24 Instruc6on Count and CPI Clock Cycles = Instruction Count ×Cycles per Instruction CPU Time = Instruction Count ×CPI×Clock Cycle Time Instruction Count ×CPI = Clock Rate • InstrucCon Count for a program – Determined By program, ISA and compiler • Average cycles per instrucCon – Determined By CPU hardware – If different instrucCons have different CPI • Average CPI affected By instrucCon mix 25 CPI Example • Computer A: Cycle Time = 250ps, CPI = 2.0 • Computer B: Cycle Time = 500ps, CPI = 1.2 • Same ISA • Which is faster, and By how much? CPU Time Instruction Count CPI Cycle Time A = × A × A = I× 2.0 ×250ps = I×500ps A is faster… CPU Time Instruction Count CPI Cycle Time B = × B × B = I×1.2×500ps = I×600ps CPU Time I×600ps B = = 1.2 CPU Time I 500ps …By this much A × 26 CPI in More Detail • If different instruction classes take different numbers of cycles n Clock Cycles (CPI Instruction Count ) = ∑ i × i i=1 n Weighted average CPI Clock Cycles n Instruction Count CPI ⎛CPI i ⎞ = = ∑⎜ i × ⎟ Instruction Count i=1 ⎝ Instruction Count ⎠ Relative frequency 27 CPI Example • AlternaCve compiled code sequences using instrucCons in classes A, B, C Class A B C CPI for class 1 2 3 IC in sequence #1 2 1 2 IC in sequence #2 4 1 1 n Sequence #1: IC = 5 n Sequence #2: IC = 6 n Clock Cycles n Clock Cycles = 2×1 + 1×2 + 2×3 = 4×1 + 1×2 + 1×3 = 10 = 9 n Avg. CPI = 10/5 = 2.0 n Avg. CPI = 9/6 = 1.5 28 Performance Summary The BIG Picture Instructions Clock cycles Seconds CPU Time = × × Program Instruction Clock cycle • Performance depends on – Algorithm: affects IC, possiBly CPI – Programming language: affects IC, CPI – Compiler: affects IC, CPI – InstrucCon set architecture: affects IC, CPI, Tc 29 SPEC CPU Benchmark • Programs used to measure performance – Supposedly typical of actual workload • Standard Performance EvaluaCon Corp (SPEC) – Develops Benchmarks for CPU, I/O, WeB, … • SPEC CPU2006 – Elapsed Cme to execute a selecCon of programs • NegligiBle I/O, so focuses on CPU performance – Normalize relaCve to reference machine – Summarize as geometric mean of performance raCos • CINT2006 (integer) and CFP2006 (floaCng-point) n n Execution time ratio ∏ i i=1 30 Principles of Computer Design • The Processor Performance EquaCon 31 Principles of Computer Design • Different instrucCon types having different CPIs 32 Metrics of Performance ApplicaCon Answers per day/month Programming Language Compiler (millions) of InstrucCons per second: MIPS ISA (millions) of (FP) operaCons per second: MFLOP/s Datapath Control Megabytes per second FuncCon Units Transistors Wires Pins Cycles per second (clock rate) 33 Impacts by Components CPI inst count Cycle time Inst Count CPI Clock Rate Program X Compiler X (X) Inst. Set. X X Architecture X X Technology X 34 Principles of Computer Design • Take Advantage of Parallelism – e.g. mulCple processors, disks, memory Banks, pipelining, mulCple funcConal units • Principle of Locality – Reuse of data and instrucCons • Focus on the Common Case – Amdahl’s Law 35 Amdahls Law ⎡ Fractionenhanced ⎤ ExTimenew

Load more