LowLow powerpower ArchitecturesArchitectures
LectureLecture #1:Introduction#1:Introduction Dr. Avi Mendelson [email protected] Contributors: Ronny Ronen, Eli Savransky, Shekhar Borkar, Fred PPollackollack
Dr. Avi Mendelson, Technion, EE department Intel
WhyWhy PowerPower meter?meter?
Dr. Avi Mendelson Page 2
1 MooreMoore’’ss LawLaw For many years technology obeys Moore’s Law Transistors Per Die 108 256M 64M 7 Memory ® 10 Memory 16M Pentium III Microprocessor 6 4M ® 10 1M Pentium II Pentium® Pro 5 256K Pentium® 10 64K i486™ 16K i386™ 104 4K 80286 1K 8086 103 8080 4004 102 101 100 ’70 ’73 ’76 ’79 ’82 ’85 ’88 ’91 ’94 '97 2000 Source: Intel Dr. Avi Mendelson Page 3
In the Last 25 Years Life was Easy(*) Doubling of transistor density every 30 months Increasing die sizes, allowed by – Increasing Wafer Size – Process technology moving from “black art” to “manufacturing science” ⇒ Doubling of transistors every 18 months
Tech Old µArch mm (linear) New µArch mm (linear) Ratio 1.0µ i386C 6.5 i486 11.5 Ratio3.1 0.7µ i486C 9.5 Pentium® 17 3.2 0.5µ Pentium® 12.2 Pentium® Pro 17.3 2.1 0.18µ Pentium® III 10.3 Next Gen ? 2--3
Implications: (in the same technology) (*) source Fred Pollack,
1. New µArch ~ 2-3X die area of the last µArch Micro-32 2. Provides 1.5-1.7X integer performance of the last µArch
Dr. Avi Mendelson Page 4
2 Suddenly,Suddenly, thethe powerpower monstermonster appearsappears
Dr. Avi Mendelson Page 5
The power crisis – power consumption
Sourse: cool- chips, Micro 32
Dr. Avi Mendelson Page 6
3 Processor Power Evolution
100 ? Pentium® II Pentium® Pro Pentium® 4
Pentium® III 10 Pentium®
Pentium® w/MMX tech. Max Power (Watts) Power Max i486i486
i386i386 1 1.5µ 1µ 0.8µ 0.6µ 0.35µ 0.25µ 0.18µ 0.13µ
New generation: always increase power Compactions: higher performance at lower power One size fits all: start with high power segment and shrink it to Mobile
Dr. Avi Mendelson Page 7
The power crisis: Power Density A real thread to the Moor law Think of watts/cm2 Power is not distributed evenly over the chip. A failure can happen if a single point reach the max power point. Complex algorithms lead to denser power: – Dense random logic Timing pressure leads to faster/bigger/power- hungrier gates – Designers put together units that communicate with each other. It creates “regions” with high activity factors -> hot spots.
Dr. Avi Mendelson Page 8
4 PowerPower DensityDensity
1000 Rocket Nozzle Nuclear Reactor Nozzle
2 100 Pentium® 4
Hot plate Pentium® III Pentium® II Watts/cm 10 Pentium® Pro i386 Pentium® i486 1 1.5µ 1µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ 0.13µ 0.1µ 0.07µ
* “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999.
Dr. Avi Mendelson Page 9
SomeSome implicationsimplications
We can’t build microprocessors with ever increasing power density and die sizes The constraint is power – not manufacturability The design of any future micro-processor should take power into consideration. We need to distinguish between different aspects of power: Power delivery Max power (TJ) Power density - hot spots Energy – static + dynamic Power and Energy aware design should take care of each of these aspects One- size does not fit all anymore
Dr. Avi Mendelson Page 10
5 WhyWhy powerpower andand powerpower densitydensity increaseincrease overover timetime
Dr. Avi Mendelson Page 11
BasicBasic terminologyterminology
Dr. Avi Mendelson Page 12
6 Power and the digital world…
Power is consumed: – When capacitance is charged and discharged – A charged cap is a logical ‘1’, a discharged cap is ‘0’
E=CV2 IN OUT 01 10
The capacitance can be the gates of other transistors or wires (buses and long interconnects)
Dr. Avi Mendelson Page 13
Power and the digital world (2)…
Secondary effects like leakage and short-circuit current are increasing with advanced process technologies
IN OUT IN OUT 0 1 1/2
Leakage Short-circuit (sub-threshold)
Leakage is growing dramatically – 7% now, expect 20% in next process technology, 50% in next one – … Unless we do something (and we will)
Dr. Avi Mendelson Page 14
7 PowerPower && EnergyEnergy Energy “The capacity for doing work” * Important for – Battery life - lower energy per task Î longer battery life – Electric bills - lower energy per task Î lower bills Measured over time Proportional to the overall capacitance and to the voltage squared (CV2)
* Merriam-Webster’s Collegiate® Dictionary - http://www.m-w.com/
Dr. Avi Mendelson Page 15
PowerPower && EnergyEnergy Power Work done per time unit – Measured in Watts P = αCV2f (α: activity, C: capacitance, V: voltage, f: frequency) “Measured” at peak time Higher power Î higher current – Cannot exceed platform power delivery constrains Higher power Î higher temperature – Cannot exceed the thermal constrains
Dr. Avi Mendelson Page 16
8 Voltage, Power, Frequency Transistor switches faster at higher voltage Î Higher voltage enables higher frequency Maximum frequency grows about linearly with voltage …Within a given voltage range Vmin-Vmax
– V < Vmin 1000 Î transistors won’t switch XScale processor freq. & power vs. voltage * 900
– V > Vmax 800 Î the device may burn Fequency(Mhz) Î the device may burn 700 Power (mWatt)
“The cube law”: 600 P = kV3 500 (or ~1%V=3%P) 400 300
200
100
0 * Source: Intel Corp. (http://developer.intel.com) 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9
Dr. Avi Mendelson Page 17
The scalability theory – process technology
Target (Ideally):Eache generation (2-3 years) Reduce gate delay by 30% 50% freq gain 2. Increase density by 2x – 0.7 shrink on a side, 50% area reduction on compaction – Transistor Z and L shrink by 30% – Interconnect pitches shrink by ~30% – Add metal layers to make-up for (1) pitches < 30%, and (2) RC
Dr. Avi Mendelson Page 18
9 ScalingScaling theorytheory----11 ofof 22 = = = = = Width W 0.7, Length L 0.7, tox 0.7 ° Lateral and vertical dimensions reduce 30% 0.7 × 0.7 Area Cap = C a = = 0.7, 0.7
Fringing Cap = C f = 0.7, Total Cap ⇒ C = 0.7 ± Capacitance--area and fringing--reduce 30%
Die Area = X ×Y = 0.7 × 0.7 = 0.72 ² Die area reduces 50%
Dr. Avi Mendelson Page 19
ScalingScaling theorytheory----22 ofof 22 Cap 0.7 = = 0.7 Transistor 1 ³ Capacitance per transistor reduces 30%
Cap = 0.7 = 1 Area 0.7 × 0.7 0.7 ´ Capacitance per unit area increases 43% × = = = W − = 0.7 0.7 = Vdd 0.7, Vt 0.7, I (Vdd Vt ) 0.7 tox 0.7 C ×Vdd 0.7 × 0.7 0.7 × 0.72 T = = = 0.7, Power = C ×V 2 × f = = 0.72 I 0.7 0.7 µ Delay reduces 30%, power reduces 50%
Dr. Avi Mendelson Page 20
10 Process Technology – the Enabler
Every process generation (every 2-3 years), Ideally: – Shorten gate delay by 30% Î ~50% (100/70) frequency gain – Vdd scaled down by ~30% Î Results: 2 » 2/3 reduction in energy/transition (CV2Î 0.7 x 0.72 = 0.34X) 1 » 1/2 reduction in power (CV2fÎ0.7 x 0.72 x 1.5 =0.5X) » Power density unchanged
Dr. Avi Mendelson Page 21
IdealIdeal Scenarios...Scenarios...
Ideal “Shrink” Ideal New µarch – Same µarch – Same die size – 1X #Xistors – 2X #Xistors – 0.5X size – 1X size – 1.5X frequency – 1.5X frequency – 0.5X power – 1X power – 1X IPC (instr./cycle) – 2X IPC – 1.5X performance – 3X performance – 1X power density – 1X power density
Looks good. Isn’t it?
Dr. Avi Mendelson Page 22
11 Process Technologies – Reality
But in reality: – New designs squeeze frequency to 2X per process – New designs use more transistors (2X-3X to get 1.5X-1.7X perf) So, every new process and architecture generation: – Power goes up about 2X – Power density goes up 30%~80% This is bad, and… Will get worse in future process generations: – Voltage (Vdd) will scale down less – Leakage is going to the roof
Not as good as it first looked… Aha?
Dr. Avi Mendelson Page 23
In Picture…
Silicon Process Technology 1.5µ 1.0µ 0.8µ 0.6µ 0.35µ 0.25µ 0.18µ 0.13µ
Intel386™ DX Processor
Intel486™ DX Processor
Pentium® Processor
Pentium® Pro Processor
Pentium® II Processor
Pentium® III Processor
Pentium® 4 Processor
Dr. Avi Mendelson Page 24
12 Performance Efficiency of µarchitectures
Old mm New mm Area Perf Tech µArch (linear) µArch (linear) Ratio Ratio 1.0µ i386C 6.5 i486 11.5 3.1 0.7µ i486C 9.5 Pentium® proc 17 3.2 1.8 0.5µ Pentium® proc 12.2 Pentium Pro® proc 17.3 2.1 1.5 0.18µ Pentium III® proc 10.3 Pentium® 4 proc 14.7 2 1.4
3.5 Area Implications: (in the same technology) 3 Perf. 1. New µarch ~2-3X die area of the last µarch 2.5 2
2. Provides 1.4-1.8X integer performance of 1.5 the last µarch 1 0.5
0 486=>PP PP=>Ppro PIII=>P4P We are on the Wrong Side of a Square Law
Dr. Avi Mendelson Page 25
Power Evolution (Theoretical)
250250 100100 Leakage Power ) ) 2 Active Power 2 200200 Power Density 7575
150150 5050 Watts Watts 100100
2525 5050 Power Density (W/cm Power Density Power Density (W/cm Power Density 00 00 0.25µ0.25µ 0.18µ0.18µ 0.13µ0.13µ 0.1µ0.1µ For a 15mm/side die (225mm2) Assume 2X frequency increase each generation Future process numbers are estimated
Dr. Avi Mendelson Page 26
13 More aspects of power aware architectures
Dr. Avi Mendelson Page 27
The “Power Bottleneck” (Hot Spots) The Thermal Story Power Map On-Die Temperature
110 250 100 200 90
150 80
100 70 Heat Flux Heat 2)
60 Temperature 50 50 0 40
Silicon is not a good heat conductor With high power density, cannot assume power uniformity – High temperature Î high leakage Î high power Î higher temperature Artificially expanding the die size not help. Must attack the hot spots Smart layout that separates the hot units increases the processor’s power envelop! * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999.
Dr. Avi Mendelson Page 28
14 The Mythical Power Envelop CPU “power envelop” – Maximum power that commercial cooling technologies can dissipate Limited by – Total system power – Processor power. Typical figures: Server <130W, Desktop 50-80W, Notebook 20-30W, sub-notebook <10W Bigger systems cool better and dissipate more power – Heat syncs – Heat pipes – Better TIM (Thermal Interface Materials) Average power density matters: – Uniformly distributed power allows for higher CPU dissipation
Dr. Avi Mendelson Page 29
Energy Efficiency
Energy per task – Proportional to # of processed insts. per task – Proportional to the average work consumed per instruction – Deteriorates as speculation increases and complexity grows Or Formally, per a given task, – Energy per retired instruction is: β*W, where » β: Ratio of Total to Retired number of processed instructions » W: Average energy spent in processing an instruction Both figures grow with every new micro-architecture In that respect: high performance modern micro-architectures are less energy-efficient Luckily, process technology offsets that by reducing energy per switch
Dr. Avi Mendelson Page 30
15 VoltageVoltage ScalingScaling
Within a given voltage range, higher voltage allows higher freq. Used for trading power and frequency. Either – Statically, at manufacturing time – Dynamically, at run time (e.g., Intel’s SpeedStep® Technology
Actual range depends on specific 1000 XScale proc. freq & power vs voltage design and process technology 900
Examples*: 800 Fequency(Mhz) – Intel® XScale™ processors runs 700 Power (mWatt)
from 0.75V (150MHz/50mW) 600
to 1.65V (800MHz/900mW) 500
– Intel mobile Pentium® III processor 400
sells from 1.1V (600MHz) 300 to 1.7V (1GHz) 200
100
0 * Source: Intel Corp. (http://developer.intel.com) 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9
Dr. Avi Mendelson Page 31
AlphaAlpha hothot spotsspots
Area 30% Freq. 50%
Source - CoolChips-99 Power 67%
Dr. Avi Mendelson Page 32
16 VoltageVoltage ScalingScaling (cont.)(cont.)
Huge effect on Power: 20% freq reduction Î 20% voltage reduction
Î 35% energy reduction. (αCV2 = αC*0.82 = αC*0.64)
Î 50% power reduction. (αCV2f = αC*0.83 = αC*0.51) Even more impressive if we recall: – 20% freq hit Î only 10%-15% performance hit* Voltage scaling can be used to trade performance with power!
* Depends mainly on core to bus frequency ratio and caches size.
Dr. Avi Mendelson Page 33
What is the impact of the computer architecture?
Dr. Avi Mendelson Page 34
17 Power and performance--trade-off
Voltage scaling alone is not enough to cap power So far, the analysis indicates that: You will have to tradeoff performance for power – Reduce die area => Reduce Active C => Reduce Perf(C) & Power(C) – Reduce Vdd and freq => Perf(freq) & Power(freq,Vdd2)--cubic – Reduce dies size, Vdd, and freq
Argument: ° Tradeoff performance for power does Goal #1 make sense? ± Set a goal that comprehends both: performance and power. ² Use energy delay product (E*D) to evaluate tradeoffs
Dr. Avi Mendelson Page 35
E*D product (lower is better) may provide a better criteria 4 1
3 E = energy / instruction 2
= Power * sec / instruction Delay
Energy (PJ) 1 = Watt / MIPS 0 0 D = sec / instruction 0123 D = sec / instruction Vdd (volts) = 1 / MIPS 400
300 2 E *D ~ Watt / MIPS E x D 200
100 0123 Vdd (volts)
Dr. Avi Mendelson Page 36
18 PowerPower segmentssegments
One size does not fit all: Embedded systems – Most of the power is consumed by the CPU – We are not thermally limited. – What we really care about is battery life. – In real time systems we can take advantage of known “deadlines” Laptops (Mobile systems) – We are thermally limited. – We can not use deadlines (most of the time). – We need to optimize for max battery life and max performance in a given power envelop. Desktops: – We mainly care about power awareness and Thermal issues
Dr. Avi Mendelson Page 37
Example:Example: AdderAdder DesignsDesigns FA Various algorithms exist to implement an integer adder – Ripple, select, skip (x2), Look-ahead, conditional-sum. – Each with its own characteristics of timing and power consumption. Ripple Carry Carry Select
FA FA FA FA FA FA FAFA FAFA 0 1
Variable/Fixed Width Carry Skip
Carry Look-ahead FA FA FA FA
FA FA FA FA
Dr. Avi Mendelson Page 38
19 Power and Delay Numbers FA
According to Callaway and Swartzlander*:
Energy Delay (pJ) (nSec) Ripple Carry 117 54.27 Constant Width Carry Skip 109 28.38 Variable Width Carry Skip 126 21.84 Carry Lookahead 171 17.13 Carry Select 216 19.56 Conditional Sum 304 20.05
If we must choose one option, as-is: – If power is the objective – use “constant width carry skip” – If delay is most important – use “carry look-ahead”
* “Estimating the power consumption of CMOS adders” - Callaway, T.K.; Swartzlander, E.E., Jr. 11th Symposium on Computer Arithmetic, 1993. Proceedings.
Dr. Avi Mendelson Page 39
Power Complexity Metrics Power α C V2 f Metrics: suppose we introduce new feature that consumes extra x power and gain y performance: 1. Power/Perf (Î Energy), assuming same technology (same C) and same voltage » For battery life, energy bills. » For a given power envelope – without voltage scaling. 2. Power/Perf2 (Î Energy*Delay) » Balance performance and power needs. 3. Power/Perf3 (Î Energy*Delay2) » For a given power envelope – with voltage scaling. assuming that we can (1) trade frequency and voltage scaling, and (2) we can lower the voltage as much as we wish
Dr. Avi Mendelson Page 40
20