Technology Trends and Quan6ta6ve Design and Analysis for Performance

Lecture 02: Technology Trends and Quan6ta6ve Design and Analysis for Performance CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan [email protected] www.secs.oakland.edu/~yan 1 Contents • Computers and computer components • Computer architectures and great ideas in history and now • Trends, Cost and Performance 2 Understanding Performance • Algorithm – Determines numBer of operaCons executed • Programming language, compiler, architecture – Determine numBer of machine instrucCons executed per operaCon • Processor and memory system – Determine how fast instrucCons are executed • I/O system (including OS) – Determines how fast I/O operaCons are executed 3 Below Your Program • ApplicaCon soLware – WriNen in high-level language • System soLware – Compiler: translates HLL code to machine code – OperaCng System: service code • Handling input/output • Managing memory and storage • Scheduling tasks & sharing resources • Hardware – Processor, memory, I/O controllers 4 Levels of Program Code • High-level language – Level of abstracCon closer to proBlem domain – Provides for producCvity and portability • AssemBly language – Textual representaCon of instrucCons • Hardware representaCon – Binary digits (Bits) – Encoded instrucCons and data 5 Trends in Technology • Integrated circuit technology – Transistor density: 35%/year – Die size: 10-20%/year – IntegraCon overall: 40-55%/year • DRAM capacity: 25-40%/year (slowing) • Flash capacity: 50-60%/year – 15-20X cheaper/Bit than DRAM • MagneCc disk technology: 40%/year – 15-25X cheaper/Bit then Flash – 300-500X cheaper/Bit than DRAM 6 Bandwidth and Latency • Bandwidth or throughput – Total work done in a given Cme – 10,000-25,000X improvement for processors – 300-1200X improvement for memory and disks • Latency or response Cme – Time Between start and compleCon of an event – 30-80X improvement for processors – 6-8X improvement for memory and disks 7 End of Moore’s Law? Cost per transistor is rising as transistor size con6nues to shrink 8 Power and Energy • ProBlem: – Get power in and distriBute around – get power out: dissipate heat • Three primary concerns: – Max power requirement for a processor – Thermal Design Power (TDP) • Characterizes sustained power consumpCon • Used as target for power supply and cooling system • Lower than peak power, higher than average power consumpCon – Energy and energy efficiency • Clock rate can Be reduced dynamically to limit power consumpCon 9 Energy and Energy Efficiency • Power: energy per unit Cme – 1 waN = 1 joule per second – Energy per task is oLen a BeNer measurement • Processor A has 20% higher average power consumpCon than processor B. A executes task in only 70% of the Cme needed by B. – So energy consumpCon of A will be 1.2 * 0.7 = 0.84 of B 10 Dynamic Energy and Power • Dynamic energy – Transistor switch from 0 -> 1 or 1 -> 0 • Dynamic power • Reducing clock rate reduces power, not energy • The capaciCve load: – a funcCon of the numBer of transistors connected to an output and the technology, which determines the capacitance of the wires and the transistors. 11 An Example from Textbook page #21 12 An Example from Textbook • Suppose a new CPU has – 85% of capacitive load of old CPU – 15% voltage and 15% frequency reduction 2 Pnew Cold ×0.85 ×(Vold ×0.85) ×Fold ×0.85 4 = 2 = 0.85 = 0.52 Pold Cold × Vold ×Fold 13 Power Trends • In CMOS IC technology Power = Capacitive load× Voltage2 ×Frequency ×30 5V → 1V ×1000 14 Power • Intel 80386 consumed ~ 2 W • 3.3 GHz Intel Core i7 consumes 130 W • Heat must Be dissipated from 1.5 x 1.5 cm chip • This is the limit of what can Be cooled By air 15 The Power Wall • We can’t reduce voltage further • We can’t remove more heat • Techniques for reducing power: – Do nothing well • Turn off clock of inacCve module – Dynamic Voltage-Frequency Scaling – Low power state for DRAM, disks – Overclocking, turning off cores 16 Sta6c Power • Because of leakage current flows even a transistor is off • Scales with numBer of transistors • Leakage can Be as high as 50% for – In part Because of large SRAM caches • To reduce: power gaCng – Turn off power of inacCve modules 17 Measuring Performance • Typical performance metrics: – Response Cme – Throughput • Speedup of X relaCve to Y – ExecuCon CmeY / ExecuCon CmeX • ExecuCon Cme – Wall clock Cme: includes all system overheads – CPU Cme: only computaCon Cme • Benchmarks – Kernels (e.g. matrix mulCply) – Toy programs (e.g. sorCng) – SyntheCc Benchmarks (e.g. Dhrystone) – Benchmark suites (e.g. SPEC06fp, TPC-C) 18 Response Time and Throughput • Response Cme – How long it takes to do a task • Throughput – Total work done per unit Cme • e.g., tasks/transacCons/… per hour • How are response Cme and throughput affected By – Replacing the processor with a faster version? – Adding more processors? • Well focus on response Cme for now… 19 Rela6ve Performance: Speedup • Define Performance = 1/ExecuCon Time • X is n Cme faster than Y PerformanceX PerformanceY = Execution timeY Execution timeX = n n Example: Cme taken to run a program n 10s on A, 15s on B n ExecuCon TimeB / ExecuCon TimeA = 15s / 10s = 1.5 n So A is 1.5 Cmes faster than B 20 Measuring Execu6on Time • Elapsed Cme – Total response Cme, including all aspects • Processing, I/O, OS overhead, idle Cme – Determines system performance • CPU Cme – Time spent processing a given joB • Discounts I/O Cme, other joBs shares – Comprises user CPU Cme and system CPU Cme – Different programs are affected differently By CPU and system performance – “Cme” command in Linux 21 CPU Clocking • Operation of digital hardware governed by a constant-rate clock Clock period Clock (cycles) Data transfer and computation Update state n Clock period: duraCon of a clock cycle –12 n e.g., 250ps = 0.25ns = 250×10 s n Clock frequency (rate): cycles per second 9 n e.g., 4.0GHz = 4000MHz = 4.0×10 Hz 9 n Clock period: 1/(4.0×10 ) s = 0.25ns 22 CPU Time CPU Time = CPU Clock Cycles×Clock Cycle Time CPU Clock Cycles = Clock Rate • Performance improved By – Reducing numBer of clock cycles – Increasing clock rate – Hardware designer must oLen trade off clock rate against cycle count 23 CPU Time Example • Computer A: 2GHz clock, 10s CPU time • Designing Computer B – Aim for 6s CPU time – Can do faster clock, but causes 1.2 × clock cycles of A • How fast must Computer B clock be? Clock CyclesB 1.2×Clock CyclesA Clock RateB = = CPU TimeB 6s Clock CyclesA = CPU TimeA ×Clock RateA =10s× 2GHz = 20×109 1.2× 20×109 24×109 Clock Rate = = = 4GHz B 6s 6s 24 Instruc6on Count and CPI Clock Cycles = Instruction Count ×Cycles per Instruction CPU Time = Instruction Count ×CPI×Clock Cycle Time Instruction Count ×CPI = Clock Rate • InstrucCon Count for a program – Determined By program, ISA and compiler • Average cycles per instrucCon – Determined By CPU hardware – If different instrucCons have different CPI • Average CPI affected By instrucCon mix 25 CPI Example • Computer A: Cycle Time = 250ps, CPI = 2.0 • Computer B: Cycle Time = 500ps, CPI = 1.2 • Same ISA • Which is faster, and By how much? CPU Time Instruction Count CPI Cycle Time A = × A × A = I× 2.0 ×250ps = I×500ps A is faster… CPU Time Instruction Count CPI Cycle Time B = × B × B = I×1.2×500ps = I×600ps CPU Time I×600ps B = = 1.2 CPU Time I 500ps …By this much A × 26 CPI in More Detail • If different instruction classes take different numbers of cycles n Clock Cycles (CPI Instruction Count ) = ∑ i × i i=1 n Weighted average CPI Clock Cycles n Instruction Count CPI ⎛CPI i ⎞ = = ∑⎜ i × ⎟ Instruction Count i=1 ⎝ Instruction Count ⎠ Relative frequency 27 CPI Example • AlternaCve compiled code sequences using instrucCons in classes A, B, C Class A B C CPI for class 1 2 3 IC in sequence #1 2 1 2 IC in sequence #2 4 1 1 n Sequence #1: IC = 5 n Sequence #2: IC = 6 n Clock Cycles n Clock Cycles = 2×1 + 1×2 + 2×3 = 4×1 + 1×2 + 1×3 = 10 = 9 n Avg. CPI = 10/5 = 2.0 n Avg. CPI = 9/6 = 1.5 28 Performance Summary The BIG Picture Instructions Clock cycles Seconds CPU Time = × × Program Instruction Clock cycle • Performance depends on – Algorithm: affects IC, possiBly CPI – Programming language: affects IC, CPI – Compiler: affects IC, CPI – InstrucCon set architecture: affects IC, CPI, Tc 29 SPEC CPU Benchmark • Programs used to measure performance – Supposedly typical of actual workload • Standard Performance EvaluaCon Corp (SPEC) – Develops Benchmarks for CPU, I/O, WeB, … • SPEC CPU2006 – Elapsed Cme to execute a selecCon of programs • NegligiBle I/O, so focuses on CPU performance – Normalize relaCve to reference machine – Summarize as geometric mean of performance raCos • CINT2006 (integer) and CFP2006 (floaCng-point) n n Execution time ratio ∏ i i=1 30 Principles of Computer Design • The Processor Performance EquaCon 31 Principles of Computer Design • Different instrucCon types having different CPIs 32 Metrics of Performance ApplicaCon Answers per day/month Programming Language Compiler (millions) of InstrucCons per second: MIPS ISA (millions) of (FP) operaCons per second: MFLOP/s Datapath Control Megabytes per second FuncCon Units Transistors Wires Pins Cycles per second (clock rate) 33 Impacts by Components CPI inst count Cycle time Inst Count CPI Clock Rate Program X Compiler X (X) Inst. Set. X X Architecture X X Technology X 34 Principles of Computer Design • Take Advantage of Parallelism – e.g. mulCple processors, disks, memory Banks, pipelining, mulCple funcConal units • Principle of Locality – Reuse of data and instrucCons • Focus on the Common Case – Amdahl’s Law 35 Amdahls Law ⎡ Fractionenhanced ⎤ ExTimenew

Technology Trends and Quan6ta6ve Design and Analysis for Performance

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support