Dynamic Voltage Scaling

EE241 - Spring 2007 Advanced Digital Integrated Circuits

Lecture 13: Dynamic voltage scaling

Announcements

Homework 2 Due today Please let me know when you have updated your proposal abstracts Midterm reports next Thursday

1 Class Material

Last lecture Lowering power Today’s lecture Dynamic voltage scaling Clock gating

Power /Energy Optimizaton Space

Constant Throughput/Latency Variable Throughput/Latency

Energy Design Time Sleep Mode Run Time

Logic design Scaled V DFS, DVS Active DD Clock Gating TSizing

Multi-VDD

Stack effects Sleep T’s

Leakage + Multi-VT Multi-VDD Variable VT + Variable VT + Input control 4

2 Level-Converting Flip-Flop

CLK CK Q CK M M CK CK 1 2 D CK CK CK

Dual-Supply-Datapath: Layout Issue

: VDDH circuit : VDDL circuit : Signal flow

VDDL Row

VDDH Row (empty) (a) Dedicated row (Conventional)

V Row DDL Complex interconnections VDDH Row (b) Possible layout reduction (Conventional)

3 Standard-Cell Dual-Supply-Voltage N-well isolation VDDH VDDL VDDH VDDL

i1 o1 i2 o2

VSS

VDDH circuit VDDL circuit VDDH circuit VDDL circuit (a) circuit schematic (b) layout

A VDDH circuit is assigned only to a critical path

A VDDL circuit is used in a non-critical path and for driving a large capacitive load 7

Shared-Well Dual-Supply-Voltage Shared N-well

VDDH VDDH VDDL

VDDL i1 o1 i2 o2

VSS VSS

VDDH circuit VDDL circuit VDDH circuit VDDL circuit (a) circuit schematic (b) layout Both circuits can be placed in the same N-well Cell layout becomes complex An intrinsic negative back-biasing of PMOS degrades speed 8 Shimazaki, ISSCC’03

4 ALU Block Diagram

clock gen.

clk ain0 ain carry sum 9:1 5:1 carry sum MUX MUX gen. sel. gp INV1 INV2 gen. s0/s1 9:1 2:1 partial MUX MUX sum 0.5pF bin logical : VDDH circuit unit

: VDDL circuit sumb (long loop-back bus) 9

Low Swing Bus & Level Converter

VDDH VDDH pc keeper V V ain0 DDL DDL sel sumb (V ) sum DDH INV1 INV2

domino level converter (9:1 MUX)

Delay of INV1 does not increase INV2 is placed near 9:1 MUX to increase noise immunity Level conversion is done by a domino 9:1 MUX 10

5 Measured Results: Energy & Delay Room temp. 800 1.16GHz 700 Single-supply VDDL=1.4V 600 Energy:-25.3% Shared well Delay :+2.8% (V =1.8V) 500 DDH Energy [pJ] 400 VDDL=1.2V Energy:-33.3% 300 Delay :+8.3% 200 0.6 0.8 1.0 1.2 1.4 1.6

TCYCLE [ns] The dual-supply technique expands the power-delay optimization space 11

Adaptive Supply Voltages

6 Processors for Portable Devices

1000 Dynamic Voltage 100 Scaling Notebook Computers

10 Pocket-PCs

Performance (MIPS) 1 PDAs 0.1 110 Processor Energy (Watt*sec) Burd ISSCC’00 • Eliminate performance ↔ energy trade-off. 13

Typical MPEG IDCT Histogram

7 Processor Usage Model

Desired Compute-intensive and Throughput low-latency processes Maximum Processor Speed

Background and time System Idle high-latency processes System Optimizations: Burd • Maximize Peak Throughput ISSCC’00 • Minimize Average Energy/operation 15

Common Design Approaches (Fixed VDD)

Compute ASAP: Excess throughput

Always high throughput time Clock Frequency Reduction:

fCLK Reduced Delivered Throughput

Energy/operation remains unchanged… time 16 while throughput scaled down with fCLK

8 Scale VDD with Clock Frequency

Constant supply voltage 1 3.3V

~10x Energy 0.5 Reduction

Reduce VDD, slow circuits down. Energy/operation 0 1.1V

00.51Burd Throughput (∝ f ) ISSCC’00 CLK 17

CMOS Circuits Track Over VDD

1.0 CLK f

Inverter RingOsc 0.5 RegFile SRAM Normalized max. 0 VT 2VT 3VT 4VT VDD Burd Delay tracks within +/- 10% ISSCC’00 18

9 Dynamic Voltage Scaling (DVS)

1 Vary fCLK,VDD 2 Dynamically adapt Delivered Throughput

time

Burd • Dynamically scale energy/operation with throughput. ISSCC’00 • Always minimize speed → minimize average energy/operation. • Extend battery life up to 10x with the exact same hardware! 19

Operating System Sets Processor Speed

• DVS requires a voltage scheduler (VS). • VS predicts workload to estimate CPU cycles. • Applications supply completion deadlines. Processor Speed (MPEG)

CPU cycles (MHz) = F 40 Δ time DESIRED 20 DESIRED DESIRED

F 0 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Time (sec) 20

10 Converter Loop Sets VDD, fCLK

IDD fCLK

RST Counter

f1MHz Latch Ring Oscillator Processor

T FMEAS A 7 V B

PENAB F V Set by DES FERR DD N O.S. Σ ENAB L 0110100 + CDD Register Digital Loop Filter Buck converter • Feedback loop sets VDD so that FERR → 0. • Ring oscillator delay-matched to CPU critical paths. Burd • Custom loop implementation → Can optimize C . ISSCC’00 DD 21

Recent DVS-Enabled Microprocessors

Xscale: 180nm 1.8V bulk-CMOS [Intel00] 0.7-1.75V, 200-1000MHz, 55-1500mW (typ) Max. Energy Efficiency: ~23 MIPS/mW

PowerPC: 180nm 1.8V bulk-CMOS [Nowka02] 0.9-1.95V, 11-380MHz, 53-500mW (typ) Max. Energy Efficiency : ~11 MIPS/mq

Crusoe: 130nm 1.5V bulk-CMOS [Transmeta03] 0.8-1.3V, 300-1000MHz, 0.85-7.5W (peak)

Pentium M: 130nm 1.5V bulk-CMOS [Intel03] 0.95-1.5V, 600-1600MHz, 4.2-31W (peak)

11 Design Over Wide Range of Voltages

• Circuit design constraints. (Functional verification)

• Circuit delay variation. (Timing verification)

• Noise margin reduction. (Power grid, coupling)

• Delay sensitivity. (Local power distribution)

Design verification complexity similar to

high-performance processor design @ fixed VDD 23

Delay Variation & Circuit Constraints

1.0 CLK f Inverter RingOsc 0.5 RegFile SRAM Normalized max. 0 VT 2VT 3VT 4VT VDD Cannot use NMOS pass gates – fails for V < 2V . • DD T Burd • Functional verification only needed at one V value ISSCC’00 DD . 24

12 Relative Delay Variation Delay relative to ring oscillator +40

Four extreme cases of +20 critical paths:

Gate 0 Interconnect Diffusion All vary monotonically with VDD. Series Percent Delay Variation -20 V 2VT 3VT 4VT T V Burd DD ISSCC’00 • Timing verification only needed at min. & max. VDD. 25

Delay Sensitivity

∂Delay ∂Delay ΔVDD ≈ ⋅ , ΔVDD = I(VDD )⋅R Delay ∂VDD Delay(VDD ) 1

0.8

0.6 Delay / ∂ 0.4

0.2 Burd 0 Normalized ISSCC’00 VT 2VT 3VT 4VT VDD • Design of local power grid (for timing constraints) only need to consider VDD ≈ 2VT. 26

13 Design for Dynamically Varying VDD

• Static CMOS logic.

• Ring oscillator.

• Dynamic logic (& tri-state busses).

• Sense amp (& memory cell).

Max. allowed |dVDD/dt| → Min. CDD = 100nF (0.6μm)

Circuits continue to properly operate as VDD changes 27

Static CMOS Logic

VDD

rds|PMOS Vin = 0 Vout = VDD

Vout CL

max. τ = 4ns

0.6μm CMOS: |dVDD/dt| < 200V/μs

• Static CMOS robustly operates with varying VDD. 28

14 Ring Oscillator

Simulated with dVDD/dt = 20V/μs 4

2 VDD Volts 1

fCLK 0 60 80 100 120 140 160 180 200 220 240 260 Time (ns)

• Output fCLK instantaneously adapts to new VDD. 29

Dynamic Logic

VDD clk = 1 Errors clk Vout ΔVDD False logic low: ΔVDD > VTP VDD V Vin out

Volts −ΔVDD Latch-up: ΔV > V clk DD be

Time

0.6μm CMOS: |dVDD/dt| < 20V/μs • Cannot gate clock in evaluation state. • Tri-state busses fail similarly → Use hold circuit. 30

15 Measured System Performance & Energy

100

Dynamic VDD 80 x 85 MIPS @ 60 Static VDD 5.6 mW/MIPS (3.8V) 40 6 MIPS @ 20 0.54 mW/MIPS Dhrystone 2.1 MIPS (1.2V) 0 0 1 2 3 4 5 6 Energy (mW/MIPS) Burd ISSCC’00 • Dynamic operation can increase energy efficiency > 10x. 31

VDD-Hopping

MPEG-4 encoding Time 1 #n #n+1 Transition 0.8 time between ƒ 0.6 levels Next milestone = 200µs n-th slice finished here 0.4 0.2 Application slicing and software Normalized power feedback guarantee real-time 0 operation. 1 23 8 # of frequency levels Two hopping levels are sufficient. 32

16 Reducing Active Power: System-Level

Don’t switch capacitance if you don’t need to!

• Application-specific processing • Preservation of data correlations • Locality of reference • Distributed processing • Demand-driven / Data-driven computation

Clock gating

Requires careful skew control ... Well handled in today’s EDA tools

17 Clock-gating efficiently reduces power

Without clock gating 30.6mW

With clock gating MPEG4 decoder 8.5mW

DEU 0 5 10 15 20 25 VDE Power [mW] MIF DSP/ 90% of F/F’s were clock-gated. HIF 896Kb SRAM 70% power reduction by clock- gating alone.

Courtesy M. Ohashi, Matsushita, ISSCC 2002 35

Circuit-Level Activity Encoding

Conditional Inversion Coding for Interconnect

18 Number Representation