Integrated Circuits, Performance, Power

Integrated Circuits, Performance, Power Alexander Nelson January 15, 2021 University of Arkansas - Department of Computer Science and Computer Engineering Integrated Circuits Technologies Over Time 1 Transistors and Integrated Circuits Transistors { On/Off Switch controlled by electric signal Integrated Circuit { Many transistors in a single chip VLSI/ULSI { Containing thousands/millions/billions of transistors Intel 8080 (1974) { 6,000 Transistors AMD Ryzen 9 3900X (2019) { 9.89B Transistors 45 years = 16,483,333 times more transistors 2 Semiconductors Semiconductor { Solid substance with conductivity between insulator and conductor Silicon { Natural semiconductor Can be chemically modified to be: • Conductor • Insulator • Areas that can conduct or insulate as a switch 3 Manufacturing Process Yield { Proportion of working dies per wafer 4 Intel Core i7 Wafer 300mm wafer, 280 chips, 32nm technology Each chip is 20.7 x 10.5nm 5 Integrated Circuit Cost Nonlinear relation to area and defect rate • Wafer cost and area are fixed • Defect rate determined by manufacturing process • Die area determined by architecture & circuit design 6 Performance How do you define CPU performance? 6 Example Which airplane has the best performance? 7 Response Time vs. Throughput Response Time { How long to finish a task Throughput { Total work done per unit time • e.g. tasks/transaction per hour How are these two metrics affected by: • Replacing processor with faster version? • Adding more processors? 8 Relative Performance If Performance defined as: 1 Performance = ExecutionTime Then: PerformanceX = Execution TimeY = n PerformanceY Execution TimeX Can be phrased as \Processor X is n times faster than Processor Y" 9 Relative Performance If Performance defined as: 1 Performance = ExecutionTime Then: PerformanceX = Execution TimeY = n PerformanceY Execution TimeX Can be phrased as \Processor X is n times faster than Processor Y" Example: Assume a program runs in: • 10s on Processor A • 15s on Processor B Then: Execution TimeY = 15 = 1:5 Execution TimeX 10 So, A is 1.5 times faster than B 10 Execution Time How do you measure execution time? Elapsed Realtime { Total response time including all aspects • Processing, I/O, OS overhead, idle time Determines system performance CPU Time { Time spent processing a given job • Discounts I/O time, other jobs' shares Comprises user CPU time & system CPU time Different programs affected differently by CPU & system performance 11 CPU Clocking Nearly all CPU governed by clock Clock Period { Duration of a clock cycle (in seconds) Clock Frequency { # of clock cycles per second (in hertz) 12 CPU Time CPU Time can be defined as: CPU Clock Cycles CPUTime = CPUClockCycles×ClockCycleTime = Clock Rate How to improve performance? • Reduce number of clock cycles per operation or per program • Increase clock rate Hardware designer may need to trade off clock rate with cycle count 13 CPU Time Example Example: Computer A: 2GHz Clock Frequency results in 10s CPU Time Design Computer B such that: • Aiming for 6s CPU Time • May increase clock frequency, but causes 1.2X clock cycles 14 CPU Time Example Example: Computer A: 2GHz Clock Frequency results in 10s CPU Time Design Computer B such that: • Aiming for 6s CPU Time • May increase clock frequency, but causes 1.2X clock cycles Clock CyclesB 1:2×Clock CyclesA Clock RateB = = CPU TimeB 6s 9 ClockCyclesA = CPUTimeA×ClockRateA = 10s×2GHz = 20×10 1:2×20×109 24×109 Clock RateB = 6s = 6s = 4GHz 15 Instruction Counts and CPI Clock Cycles = Instruction Count × Cycles Per Instruction CPU Time = Instruction Count × CPI × Clock Cycle Time Instruction Count×CPI CPU Time = Clock Rate Instruction Count { Number of instructions for a particular program Depends on: • Program • ISA • Compiler Average Cycles Per Instruction { Determined by ISA/CPU Hardware Different instructions may have different CPI Average CPI affected by % of instruction classes in program 16 CPI Example Computer A { Cycle Time = 250ps, CPI = 2.0 Computer B { Cycle Time = 500ps, CPI = 1.2 Same ISA Which is faster? By how much? 17 CPI Example A is faster by 1.2 times relative performance 18 CPI in Detail Different instruction classes take different # of cycles Pn Clock Cycles = i=1(CPIi × Instruction Counti ) Weighted Average CPI: Clock Cycles Pn Instruction Counti CPI = Instruction Count = i=1(CPIi × Instruction Count ) The sum is the relative frequency of each class of instruction 19 CPI Example 20 Performance Summary Performance Depends on: • Algorithm { Affects instruction count, possibly CPI • Programming language { Affects instruction count & CPI • Compiler { Affects instruction count & CPI • ISA { Affects instruction count, CPI, & Time per cycle Instructions Clock Cycles Seconds CPU Time = Program × Instruction × ClockCycle 21 Power Power Trends Intel Core i5-9400 9th Gen (2019) { 14nm, 2.90-4.1 GHz, 65W 22 Why isn't clock frequency continuing to increase? 22 Battery Life Isn't Keeping Up 23 Dynamic Power Consumption CMOS Transistors { complementary metal oxide semiconductor Best performance per watt since 1976 Gradually being replaced by FinFET technology (can go <20nm) Primary energy consumption is dynamic energy 1 2 Energy / 2 × Capacitive Load × Voltage × Frequency Switched Reducing voltage from 5V!1V allowed increase of 1000x clock frequency with only 30x gain in power consumption 24 Reducing Power Suppose a new CPU has: • 85% capacitive load of old CPU • 15% voltage and 15% frequency reduction 2 Pnew Cold ×0:85×(Vold ×0:85) ×Fold ×0:85 4 Then: P = 2 = 0:85 = 0:52 old Cold ×Vold ×Fold The new CPU uses 52% of the power of the old CPU However: • Can't reduce voltage further • Can't remove more heat How else can we improve performance? 25 Uniprocessor Performance Since 2003, constrained by power, instruction-level parallelism, memory latency 26 Multiprocessors Multicore Microprocessors { More than one processor per chip Requires explicitly parallel programming • Compare with instruction level parallelism • Hardware executes multiple instructions at once • Hidden from the programmer! • Hard to do • Programming for performance • Load balancing • Optimizing Communication and synchronization 27 Benchmarking Processors SPEC { System Performance Evaluation Cooperative { Funded by computer vendors for standard set of benchmarks SPECINTC2006 benchmarks on 2.66 GHz Intel Core i7 920 28 SPEC Power Benchmark Power consumption of a server at different workload levels • Performance: ssj ops/sec • Power: Watts (Joules/sec) P10 P10 Overall ssj ops per Watt = ( i=0 ssj opsi ) ÷ ( i=0 poweri ) 29 Power SPEC SPECpower ssj2008 for Xeon X5650 30 Pitfall: Amdahl's Law Improving one aspect does not guarantee a proportional improvement in overall performance Amdahl's Law: Taffected Timproved = improvement factor + Tunaffected Example: Multiply accounts for 80/100 seconds of execution How much improvement in multiply performance to get 5x improvement? 31 Pitfall: Amdahl's Law Improving one aspect does not guarantee a proportional improvement in overall performance Amdahl's Law: Taffected Timproved = improvement factor + Tunaffected Example: Multiply accounts for 80/100 seconds of execution How much improvement in multiply performance to get 5x improvement? 80 20 = n + 20 { Can't be done! Corollary: Make the common case fast! 32.

Load more