Low Energy & Power Design Issues Low Power Design Problem

• Processor trends Microprocessor Power • Circuit and Technology Issues (source ISSCC) 30 • Architectural optimizations • Low power µP research project 20

10 Power (Watt)

0 75 80 85 90 95 Year When supply voltage drops to 1Volt, then 100Watts = 100 Amps

Slide 2

Portable devices Two Kinds of Computation

Required • General purpose processing (what you have been Portable Functions studying so far) • Multimodal radio • Bursty - mostly idle with bursts of computation • Protocols, ECC, ... • Maximum possible throughput required during active • Voice I/O compression & periods decompression • Handwriting recognition • Signal processing (for multimedia, wireless Battery • Text/Graphics processing communications, etc.) (40+ lbs) • Video decompression • Stream based computation • Speech recognition • No advantage in increasing processing rate above • Java interpreter required for real-time requirements

How to get 1 month of operation?

Slide 3 Slide 4 Optimizing for Energy Consumption Switching Energy

• Conventional General Purpose processors (e.g. Vdd Pentiums)

• Performance is everything ... somehow we’ll get the Vin Vout power in and back out • 10-100 Watts, 100-1000 Mips = .01 Mips/mW CL • Energy Optimized but General Purpose • Keep the generality, but reduce the energy as much as 2 possible - e.g. StrongArm Energy/transition = CL * Vdd • .5 Watts, 160 Mips = .3 Mips/mW 2 Power = Energy/transition * f = CL * Vdd * f • Energy Optimized and Dedicated • 100 Mops/mW

Slide 5 Slide 6

Low Power & Low Energy System Design Energy Reduction in CPU’s

• Standard power management helps • Sleep modes System Design partitioning, Power Down • Power down blocks • reduction doesn’t help Algorithm Complexity, Concurrency, Locality, Regularity, Data representation • Number of operations = Nops 2 Voltage scaling, Parallelism, • Energy/operation = CV Architecture 2 Instruction set, Signal correlations • Total Energy = Nops * CV Sizing, Logic optimization, Energy is independent of clock rate! Circuit/Logic Activity Driven Power Down, Low-swing logic, Adiabatic switching • Reducing the clock rate only degrades Technology Threshold Reduction, throughput, but no savings in battery life - unless Multi-thresholds the voltage is changed

Slide 7 Slide 8 α Node Transition Activity and Power Factors Affecting Transition Activity, 0->1

Switch a CMOS gate for N clock cycles “Static” component (does not account for timing) E = C • V 2 • nN() N L dd Type of Logic Function (NOR vs. XOR)

EN : the energy consumed for N clock cycles Type of Logic Style (Static vs. Dynamic) n(N): the number of 0->1 transition in N clock cycles Signal Statistics E () N • nN • • 2 • Inter-signal Correlations Pavg = lim ------fclk = lim ------C Vdd fclk N → ∞ N N → ∞ N L “Dynamic” or timing dependent component nN() α → = lim ------Circuit Topology 01N → ∞ N Signal Statistics and Correlations

P = α • C • V 2 • f avg 01→ L dd clk

Slide 9 Slide 10

Static 2 Input NOR Gate Type of Logic Style: Static vs. Dynamic

V Vdd dd

Assume: A CLK prob(A=1) = 1/2 ABOut prob(B=1) = 1/2 B CL 001 A B Then: 010 C prob(Out=1) = 1/2 100 L A B CLK prob(0→1) 110 = prob(Out=0).prob(Out=1) = 3/4 × 1/4 = 3/16 Power is dissipated when Out=0 A STATIC NOR DYNAMIC NOR α Out N0 3 0->1 = 3/16 B α = 3/16 α ==------0->1 01→ N 4 2

Slide 11 Slide 12 “Dynamic” or Glitching Activity in CMOS Glitch Reduction Using Balanced Paths

A0 F Cin Add0 Add1 Add2 Add14 Add15

S0 S1 S2 S14 S15 A1 A2 A3 A4 A5 A6 A7 Ripple

4.0 4 A0 S15 A1

6 A2 2.0 3 A3 S10 F Merge Cin A 5 4 S1 A5 2 Sum Output Voltage, Volts 0.0 A6 0510Time, ns A α 7 0->1 can be > 1 due to glitching!

Slide 13 Slide 14

Switching activity and capacitance minimization Minimum Supply Voltage

7.5 • Gated clocks. (disable all modules not in use each cycle) multiplier 2.0µm technology 7.0 C • V 6.5 clock generator L dd Td = enable only those modules using a 6.0 I • Block enables. ( ) 5.5 5.0 • Instruction Buffer. (0th level cache) 4.5 I ~ (V - V )2 4.0 dd t 3.5 Td(Vdd=1.5) (1.5) ² (5 - 0.7)2 3.0 ring oscillator • Add stop and sleep instructions to the instruction = 2 2.5 microcoded DSP chip Td(Vdd=5) (5) ² (1.5 - 0.7) set. 2.0 = 8 times slower at 1.5 1.5 adder NORMALIZED DELAY • Minimum size busses 1.0 adder (SPICE) Velocity saturated 2.0 4.0 6.0 I ~ (V - V ) not quite V (volts) dd t • Minimize I/O - on-chip memory dd so bad a penalty

Lowering Vdd reduces energy but increases delays Critical difference is the amount above Vt

Slide 15 Slide 16 Architecture Trade-offs - Reference Datapath Parallel Datapath

A

A 1 2T A>B 1

T COMPARATOR ADDER LATCH A LATCH LATCH B LATCH A>B 1 C C LATCH 2T ADDER LATCH A LATCH

1 COMPARATOR LATCH B LATCH B C LATCH 2T COMPARATOR 1 COMPARATOR T MUXMUX C µ2 Area = 636 x 833 1 1 2T T 1 B A>B T COMPARATOR ADDER LATCH A LATCH LATCH B LATCH 1 C C LATCH ⇒ 2T Critical path delay Tadder + Tcomparator (= 25ns) 1 COMPARATOR ⇒ 2T fref = 40Mhz Area = 1476 x 1219 µ2 Total capacitance being switched = Cref The clock rate can be reduced by half with the same ⇒ Vdd = Vref = 5V throughput fpar = fref / 2 Power for reference datapath = P = C V 2 f Vpar = Vref / 1.7, Cpar = 2.15Cref ref ref ref ref 2 ≈ from [Chandrakasan92] (IEEE JSSC) Ppar = (2.15Cref) (Vref/1.7) (fref/2) 0.36 Pref

Slide 17 Slide 18

The More Parallel the Better?? Pipelined Datapath

1.00 Fixed Throughput 0.90 Minimal Area 0.80 0.70 A 0.60 1 T 0.50 A>B B 1 T ADDER LATCH A LATCH LATCH P LATCH LATCH B LATCH

0.40 C2 LATCH C1 LATCH COMPARATOR

1 COMPARATOR 0.30 T C 0.20 1 1 µ2 T T Area = 640 x 1081 0.10NORMALIZED POWER Minimal Power 0.00 ⇒ 1.00 2.003.00 4.00 5.00 Critical path delay is less max [Tadder , Tcomparator] V (volts) dd Keeping clock rate constant: fpipe = fref ⇒ Capacitance overhead starts to dominate at “high” levels Voltage can be dropped Vpipe = Vref / 1.7 of parallelism and results in an optimum voltage Capacitance slightly higher: Cpipe = 1.15Cref P =(115C )(V /1 7)2 f ≈ 039P Slide 19 Slide 20 Architecture Summary for a Simple Memory Architecture

Serial Access Parallel Access

Architecture type Voltage Area Power MEMORY MEMORY Addr Addr Simple datapath CELL (no pipelining or 5V 1 1 parallelism) ARRAY ARRAY Row Decoding Row Decoding Pipelined datapath 2.9V 1.3 0.39 4 4 4 4 4 4 4 4

f Mux f / 8 Latch 4 4 4 Parallel datapath 2.9V 3.4 0.36 4 4 f Latch 8-nibbles f Mux

Pipeline-Parallel2.0V 3.7 0.2 4 bit display interface Voltage = 3V Voltage = 1.1V

Slide 21 Slide 22

Proposed CPU Architecture: LP-ARM LP-ARM: Energy Estimation

Mem[N:0] Add[31:0] fCLK Instruction Cache (8kB): VDD Low-Power SRAM: 2 kByte Block = 78 pJ [Burstein] Clock Bus/DMA Bus I/O Generator Abort Rst Complete Instruction Cache Design: ~150 pJ ∆V Controller Buffer fref Clock Generation Global & External 50pF line = 70 pJ Inst. ARM Data Total Clock Generation: ~100 pJ Cache Core Cache Controller ARM Core Register File + ALU + Shifter > 50% Total [ARM,Burd] Int[7:0] Processor Buffer Register File: 30 pJ, ALU: 24 pJ, Shifter: 16 pJ Dominant State Bus I/O Total Core: ~140 pJ Energy Consumer Data[31:0]

Slide 23 Slide 24 Research Goal

10 MIPS, 1 nJ/inst. ⇔ 80 MIPS, 9 nJ/inst. (10 mW) (720 mW)

DC-DC LP-ARM Converter CPU 100 pJ 500 pJ Processor Bus << 100 pJ

I/O 0.5 MB Interface SRAM (8 ICs) 100 pJ 300 pJ Improve energy efficiency by an order of magnitude

Slide 25