LowLow powerpower ArchitecturesArchitectures

LectureLecture #1:Introduction#1:Introduction Dr. Avi Mendelson [email protected] Contributors: Ronny Ronen, Eli Savransky, Shekhar Borkar, Fred PPollackollack

Dr. Avi Mendelson, Technion, EE department

WhyWhy PowerPower meter?meter?

Dr. Avi Mendelson 2

1 MooreMoore’’ss LawLaw For many years technology obeys Moore’s Law Per Die 108 256M 64M 7 Memory ® 10 Memory 16M III 6 4M ® 10 1M Pentium II Pentium® Pro 5 256K Pentium® 10 64K ™ 16K ™ 104 4K 80286 1K 8086 103 8080 4004 102 101 100 ’70 ’73 ’76 ’79 ’82 ’85 ’88 ’91 ’94 '97 2000 Source: Intel Dr. Avi Mendelson Page 3

In the Last 25 Years Life was Easy(*) „ Doubling of density every 30 months „ Increasing die sizes, allowed by – Increasing Wafer Size – Process technology moving from “black art” to “manufacturing science” „ ⇒ Doubling of transistors every 18 months

Tech Old µArch mm (linear) New µArch mm (linear) Ratio 1.0µ i386C 6.5 i486 11.5 Ratio3.1 0.7µ i486C 9.5 Pentium® 17 3.2 0.5µ Pentium® 12.2 Pentium® Pro 17.3 2.1 0.18µ Pentium® III 10.3 Next Gen ? 2--3

Implications: (in the same technology) (*) source Fred Pollack,

1. New µArch ~ 2-3X die area of the last µArch Micro-32 2. Provides 1.5-1.7X integer performance of the last µArch

Dr. Avi Mendelson Page 4

2 Suddenly,Suddenly, thethe powerpower monstermonster appearsappears

Dr. Avi Mendelson Page 5

The power crisis – power consumption

Sourse: cool- chips, Micro 32

Dr. Avi Mendelson Page 6

3 Processor Power Evolution

100 ? Pentium® II Pentium® Pro Pentium® 4

Pentium® III 10 Pentium®

Pentium® w/MMX tech. Max Power (Watts) Power Max i486i486

i386i386 1 1.5µ 1µ 0.8µ 0.6µ 0.35µ 0.25µ 0.18µ 0.13µ

New generation: always increase power Compactions: higher performance at lower power One size fits all: start with high power segment and shrink it to Mobile

Dr. Avi Mendelson Page 7

The power crisis: Power Density A real thread to the Moor law „ Think of watts/cm2 „ Power is not distributed evenly over the chip. A failure can happen if a single point reach the max power point. „ Complex algorithms lead to denser power: – Dense random logic „ Timing pressure leads to faster/bigger/power- hungrier gates – Designers put together units that communicate with each other. It creates “regions” with high activity factors -> hot spots.

Dr. Avi Mendelson Page 8

4 PowerPower DensityDensity

1000 Rocket Nozzle Nuclear Reactor Nozzle

2 100 Pentium® 4

Hot plate Pentium® III Pentium® II Watts/cm 10 Pentium® Pro i386 Pentium® i486 1 1.5µ 1µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ 0.13µ 0.1µ 0.07µ

* “New Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999.

Dr. Avi Mendelson Page 9

SomeSome implicationsimplications

„ We can’t build with ever increasing power density and die sizes „ The constraint is power – not manufacturability „ The design of any future micro-processor should take power into consideration. We need to distinguish between different aspects of power: „Power delivery „ Max power (TJ) „Power density - hot spots „ Energy – static + dynamic „ Power and Energy aware design should take care of each of these aspects One- size does not fit all anymore

Dr. Avi Mendelson Page 10

5 WhyWhy powerpower andand powerpower densitydensity increaseincrease overover timetime

Dr. Avi Mendelson Page 11

BasicBasic terminologyterminology

Dr. Avi Mendelson Page 12

6 Power and the digital world…

„ Power is consumed: – When capacitance is charged and discharged – A charged cap is a logical ‘1’, a discharged cap is ‘0’

E=CV2 IN OUT 01 10

„ The capacitance can be the gates of other transistors or wires (buses and long interconnects)

Dr. Avi Mendelson Page 13

Power and the digital world (2)…

„ Secondary effects like leakage and short-circuit current are increasing with advanced process technologies

IN OUT IN OUT 0 1 1/2

Leakage Short-circuit (sub-threshold)

„ Leakage is growing dramatically – 7% now, expect 20% in process technology, 50% in next one – … Unless we do something (and we will)

Dr. Avi Mendelson Page 14

7 PowerPower && EnergyEnergy Energy „ “The capacity for doing work” * „ Important for – Battery life - lower energy per task Î longer battery life – Electric bills - lower energy per task Î lower bills „ Measured over time „ Proportional to the overall capacitance and to the voltage squared (CV2)

* Merriam-Webster’s Collegiate® Dictionary - http://www.m-w.com/

Dr. Avi Mendelson Page 15

PowerPower && EnergyEnergy Power „ Work done per time unit – Measured in Watts „P = αCV2f (α: activity, C: capacitance, V: voltage, f: frequency) „ “Measured” at peak time „ Higher power Î higher current – Cannot exceed platform power delivery constrains „ Higher power Î higher temperature – Cannot exceed the thermal constrains

Dr. Avi Mendelson Page 16

8 Voltage, Power, Frequency „ Transistor switches faster at higher voltage Î Higher voltage enables higher frequency „ Maximum frequency grows about linearly with voltage …Within a given voltage range Vmin-Vmax

– V < Vmin 1000 Î transistors won’t switch XScale processor freq. & power vs. voltage * 900

– V > Vmax 800 Î the device may burn Fequency(Mhz) Î the device may burn 700 Power (mWatt)

„ “The cube law”: 600 P = kV3 500 (or ~1%V=3%P) 400 300

200

100

0 * Source: Intel Corp. (http://developer.intel.com) 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9

Dr. Avi Mendelson Page 17

The scalability theory – process technology

Target (Ideally):Eache generation (2-3 years) „ Reduce gate delay by 30% 50% freq gain „ 2. Increase density by 2x – 0.7 shrink on a side, 50% area reduction on compaction – Transistor Z and L shrink by 30% – Interconnect pitches shrink by ~30% – Add metal layers to make-up for (1) pitches < 30%, and (2) RC

Dr. Avi Mendelson Page 18

9 ScalingScaling theorytheory----11 ofof 22 = = = = = Width W 0.7, Length L 0.7, tox 0.7 ° Lateral and vertical dimensions reduce 30% 0.7 × 0.7 Area Cap = C a = = 0.7, 0.7

Fringing Cap = C f = 0.7, Total Cap ⇒ C = 0.7 ± Capacitance--area and fringing--reduce 30%

Die Area = X ×Y = 0.7 × 0.7 = 0.72 ² Die area reduces 50%

Dr. Avi Mendelson Page 19

ScalingScaling theorytheory----22 ofof 22 Cap 0.7 = = 0.7 Transistor 1 ³ Capacitance per transistor reduces 30%

Cap = 0.7 = 1 Area 0.7 × 0.7 0.7 ´ Capacitance per unit area increases 43% × = = = W − = 0.7 0.7 = Vdd 0.7, Vt 0.7, I (Vdd Vt ) 0.7 tox 0.7 C ×Vdd 0.7 × 0.7 0.7 × 0.72 T = = = 0.7, Power = C ×V 2 × f = = 0.72 I 0.7 0.7 µ Delay reduces 30%, power reduces 50%

Dr. Avi Mendelson Page 20

10 Process Technology – the Enabler

„ Every process generation (every 2-3 years), Ideally: – Shorten gate delay by 30% Î ~50% (100/70) frequency gain – Vdd scaled down by ~30% Î Results: 2 » 2/3 reduction in energy/transition (CV2Î 0.7 x 0.72 = 0.34X) 1 » 1/2 reduction in power (CV2fÎ0.7 x 0.72 x 1.5 =0.5X) » Power density unchanged

Dr. Avi Mendelson Page 21

IdealIdeal Scenarios...Scenarios...

„ Ideal “Shrink” „ Ideal New µarch – Same µarch – Same die size – 1X #Xistors – 2X #Xistors – 0.5X size – 1X size – 1.5X frequency – 1.5X frequency – 0.5X power – 1X power – 1X IPC (instr./cycle) – 2X IPC – 1.5X performance – 3X performance – 1X power density – 1X power density

Looks good. Isn’t it?

Dr. Avi Mendelson Page 22

11 Process Technologies – Reality

„ But in reality: – New designs squeeze frequency to 2X per process – New designs use more transistors (2X-3X to get 1.5X-1.7X perf) „ So, every new process and architecture generation: – Power goes up about 2X – Power density goes up 30%~80% „ This is bad, and… „ Will get worse in future process generations: – Voltage (Vdd) will scale down less – Leakage is going to the roof

Not as good as it first looked… Aha?

Dr. Avi Mendelson Page 23

In Picture…

Silicon Process Technology 1.5µ 1.0µ 0.8µ 0.6µ 0.35µ 0.25µ 0.18µ 0.13µ

Intel386™ DX Processor

Intel486™ DX Processor

Pentium® Processor

Pentium® Pro Processor

Pentium® II Processor

Pentium® III Processor

Pentium® 4 Processor

Dr. Avi Mendelson Page 24

12 Performance Efficiency of µarchitectures

Old mm New mm Area Perf Tech µArch (linear) µArch (linear) Ratio Ratio 1.0µ i386C 6.5 i486 11.5 3.1 0.7µ i486C 9.5 Pentium® proc 17 3.2 1.8 0.5µ Pentium® proc 12.2 ® proc 17.3 2.1 1.5 0.18µ Pentium III® proc 10.3 Pentium® 4 proc 14.7 2 1.4

3.5 Area Implications: (in the same technology) 3 Perf. 1. New µarch ~2-3X die area of the last µarch 2.5 2

2. Provides 1.4-1.8X integer performance of 1.5 the last µarch 1 0.5

0 486=>PP PP=>Ppro PIII=>P4P We are on the Wrong Side of a Square Law

Dr. Avi Mendelson Page 25

Power Evolution (Theoretical)

250250 100100 Leakage Power ) ) 2 Active Power 2 200200 Power Density 7575

150150 5050 Watts Watts 100100

2525 5050 Power Density (W/cm Power Density Power Density (W/cm Power Density 00 00 0.25µ0.25µ 0.18µ0.18µ 0.13µ0.13µ 0.1µ0.1µ For a 15mm/side die (225mm2) Assume 2X frequency increase each generation Future process numbers are estimated

Dr. Avi Mendelson Page 26

13 More aspects of power aware architectures

Dr. Avi Mendelson Page 27

The “Power Bottleneck” (Hot Spots) The Thermal Story Power Map On-Die Temperature

110 250 100 200 90

150 80

100 70 Heat Flux Heat 2)

60 Temperature 50 50 0 40

„ Silicon is not a good heat conductor „ With high power density, cannot assume power uniformity – High temperature Î high leakage Î high power Î higher temperature „ Artificially expanding the die size not help. Must attack the hot spots „ Smart layout that separates the hot units increases the processor’s power envelop! * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999.

Dr. Avi Mendelson Page 28

14 The Mythical Power Envelop „ CPU “power envelop” – Maximum power that commercial cooling technologies can dissipate „ Limited by – Total system power – Processor power. Typical figures: Server <130W, Desktop 50-80W, Notebook 20-30W, sub-notebook <10W „ Bigger systems cool better and dissipate more power – Heat syncs – Heat pipes – Better TIM (Thermal Interface Materials) „ Average power density matters: – Uniformly distributed power allows for higher CPU dissipation

Dr. Avi Mendelson Page 29

Energy Efficiency

„ Energy per task – Proportional to # of processed insts. per task – Proportional to the average work consumed per instruction – Deteriorates as speculation increases and complexity grows Or Formally, per a given task, – Energy per retired instruction is: β*W, where » β: Ratio of Total to Retired number of processed instructions » W: Average energy spent in processing an instruction Both figures grow with every new micro-architecture „ In that respect: high performance modern micro-architectures are less energy-efficient „ Luckily, process technology offsets that by reducing energy per switch

Dr. Avi Mendelson Page 30

15 VoltageVoltage ScalingScaling

„ Within a given voltage range, higher voltage allows higher freq. „ Used for trading power and frequency. Either – Statically, at manufacturing time – Dynamically, at run time (e.g., Intel’s SpeedStep® Technology

„ Actual range depends on specific 1000 XScale proc. freq & power vs voltage design and process technology 900

Examples*: 800 Fequency(Mhz) – Intel® XScale™ processors runs 700 Power (mWatt)

from 0.75V (150MHz/50mW) 600

to 1.65V (800MHz/900mW) 500

– Intel mobile Pentium® III processor 400

sells from 1.1V (600MHz) 300 to 1.7V (1GHz) 200

100

0 * Source: Intel Corp. (http://developer.intel.com) 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9

Dr. Avi Mendelson Page 31

AlphaAlpha hothot spotsspots

Area 30% Freq. 50%

Source - CoolChips-99 Power 67%

Dr. Avi Mendelson Page 32

16 VoltageVoltage ScalingScaling (cont.)(cont.)

„ Huge effect on Power: 20% freq reduction Î 20% voltage reduction

Î 35% energy reduction. (αCV2 = αC*0.82 = αC*0.64)

Î 50% power reduction. (αCV2f = αC*0.83 = αC*0.51) „ Even more impressive if we recall: – 20% freq hit Î only 10%-15% performance hit* „ Voltage scaling can be used to trade performance with power!

* Depends mainly on core to frequency ratio and caches size.

Dr. Avi Mendelson Page 33

What is the impact of the architecture?

Dr. Avi Mendelson Page 34

17 Power and performance--trade-off

Voltage scaling alone is not enough to cap power So far, the analysis indicates that: You will have to tradeoff performance for power – Reduce die area => Reduce Active C => Reduce Perf(C) & Power(C) – Reduce Vdd and freq => Perf(freq) & Power(freq,Vdd2)--cubic – Reduce dies size, Vdd, and freq

Argument: ° Tradeoff performance for power does Goal #1 make sense? ± Set a goal that comprehends both: performance and power. ² Use energy delay product (E*D) to evaluate tradeoffs

Dr. Avi Mendelson Page 35

E*D product (lower is better) may provide a better criteria 4 1

3 „ E = energy / instruction 2

„ = Power * sec / instruction Delay

Energy (PJ) 1 „ = Watt / MIPS 0 0 „ D = sec / instruction 0123 D = sec / instruction Vdd (volts) „ = 1 / MIPS 400

300 2 „ E *D ~ Watt / MIPS E x D 200

100 0123 Vdd (volts)

Dr. Avi Mendelson Page 36

18 PowerPower segmentssegments

One size does not fit all: „ Embedded systems – Most of the power is consumed by the CPU – We are not thermally limited. – What we really care about is battery life. – In real time systems we can take advantage of known “deadlines” „ Laptops (Mobile systems) – We are thermally limited. – We can not use deadlines (most of the time). – We need to optimize for max battery life and max performance in a given power envelop. „ Desktops: – We mainly care about power awareness and Thermal issues

Dr. Avi Mendelson Page 37

Example:Example: AdderAdder DesignsDesigns FA „ Various algorithms exist to implement an integer adder – Ripple, select, skip (x2), Look-ahead, conditional-sum. – Each with its own characteristics of timing and power consumption. Ripple Carry Carry Select

FA FA FA FA FA FA FAFA FAFA 0 1

Variable/Fixed Width Carry Skip

Carry Look-ahead FA FA FA FA

FA FA FA FA

Dr. Avi Mendelson Page 38

19 Power and Delay Numbers FA

„ According to Callaway and Swartzlander*:

Energy Delay (pJ) (nSec) Ripple Carry 117 54.27 Constant Width Carry Skip 109 28.38 Variable Width Carry Skip 126 21.84 Carry Lookahead 171 17.13 Carry Select 216 19.56 Conditional Sum 304 20.05

„ If we must choose one option, as-is: – If power is the objective – use “constant width carry skip” – If delay is most important – use “carry look-ahead”

* “Estimating the power consumption of CMOS adders” - Callaway, T.K.; Swartzlander, E.E., Jr. 11th Symposium on Computer Arithmetic, 1993. Proceedings.

Dr. Avi Mendelson Page 39

Power Complexity Metrics „ Power α C V2 f „ Metrics: suppose we introduce new feature that consumes extra x power and gain y performance: 1. Power/Perf (Î Energy), assuming same technology (same C) and same voltage » For battery life, energy bills. » For a given power envelope – without voltage scaling. 2. Power/Perf2 (Î Energy*Delay) » Balance performance and power needs. 3. Power/Perf3 (Î Energy*Delay2) » For a given power envelope – with voltage scaling. assuming that we can (1) trade frequency and voltage scaling, and (2) we can lower the voltage as much as we wish

Dr. Avi Mendelson Page 40

20