EE241 - Spring 2007 Advanced Digital Integrated Circuits
Lecture 13: Dynamic voltage scaling
Announcements
Homework 2 Due today Please let me know when you have updated your proposal abstracts Midterm reports next Thursday
2
1 Class Material
Last lecture Lowering power Today’s lecture Dynamic voltage scaling Clock gating
3
Power /Energy Optimizaton Space
Constant Throughput/Latency Variable Throughput/Latency
Energy Design Time Sleep Mode Run Time
Logic design Scaled V DFS, DVS Active DD Clock Gating TSizing
Multi-VDD
Stack effects Sleep T’s
Leakage + Multi-VT Multi-VDD Variable VT + Variable VT + Input control 4
2 Level-Converting Flip-Flop
VH
VL
CLK CK Q CK M M CK CK 1 2 D CK CK CK
5
Dual-Supply-Datapath: Layout Issue
: VDDH circuit : VDDL circuit : Signal flow
VDDL Row
VDDH Row (empty) (a) Dedicated row (Conventional)
V Row DDL Complex interconnections VDDH Row (b) Possible layout reduction (Conventional)
(c) Shared-well layout A shared-well technique is appropriate for random placement of cells 6
3 Standard-Cell Dual-Supply-Voltage N-well isolation VDDH VDDL VDDH VDDL
i1 o1 i2 o2
VSS
VSS
VDDH circuit VDDL circuit VDDH circuit VDDL circuit (a) circuit schematic (b) layout
A VDDH circuit is assigned only to a critical path
A VDDL circuit is used in a non-critical path and for driving a large capacitive load 7
Shared-Well Dual-Supply-Voltage Shared N-well
VDDH VDDH VDDL
VDDL i1 o1 i2 o2
VSS VSS
VDDH circuit VDDL circuit VDDH circuit VDDL circuit (a) circuit schematic (b) layout Both circuits can be placed in the same N-well Cell layout becomes complex An intrinsic negative back-biasing of PMOS degrades speed 8 Shimazaki, ISSCC’03
4 ALU Block Diagram
clock gen.
clk ain0 ain carry sum 9:1 5:1 carry sum MUX MUX gen. sel. gp INV1 INV2 gen. s0/s1 9:1 2:1 partial MUX MUX sum 0.5pF bin logical : VDDH circuit unit
: VDDL circuit sumb (long loop-back bus) 9
Low Swing Bus & Level Converter
VDDH VDDH pc keeper V V ain0 DDL DDL sel sumb (V ) sum DDH INV1 INV2
domino level converter (9:1 MUX)
Delay of INV1 does not increase INV2 is placed near 9:1 MUX to increase noise immunity Level conversion is done by a domino 9:1 MUX 10
5 Measured Results: Energy & Delay Room temp. 800 1.16GHz 700 Single-supply VDDL=1.4V 600 Energy:-25.3% Shared well Delay :+2.8% (V =1.8V) 500 DDH Energy [pJ] 400 VDDL=1.2V Energy:-33.3% 300 Delay :+8.3% 200 0.6 0.8 1.0 1.2 1.4 1.6
TCYCLE [ns] The dual-supply technique expands the power-delay optimization space 11
Adaptive Supply Voltages
12
6 Processors for Portable Devices
1000 Dynamic Voltage 100 Scaling Notebook Computers
10 Pocket-PCs
Performance (MIPS) 1 PDAs 0.1 110 Processor Energy (Watt*sec) Burd ISSCC’00 • Eliminate performance ↔ energy trade-off. 13
Typical MPEG IDCT Histogram
14
7 Processor Usage Model
Desired Compute-intensive and Throughput low-latency processes Maximum Processor Speed
Background and time System Idle high-latency processes System Optimizations: Burd • Maximize Peak Throughput ISSCC’00 • Minimize Average Energy/operation 15
Common Design Approaches (Fixed VDD)
Compute ASAP: Excess throughput
Always high throughput time Clock Frequency Reduction:
fCLK Reduced Delivered Throughput
Energy/operation remains unchanged… time 16 while throughput scaled down with fCLK
8 Scale VDD with Clock Frequency
Constant supply voltage 1 3.3V
~10x Energy 0.5 Reduction
Reduce VDD, slow circuits down. Energy/operation 0 1.1V
00.51Burd Throughput (∝ f ) ISSCC’00 CLK 17
CMOS Circuits Track Over VDD
1.0 CLK f
Inverter RingOsc 0.5 RegFile SRAM Normalized max. 0 VT 2VT 3VT 4VT VDD Burd Delay tracks within +/- 10% ISSCC’00 18
9 Dynamic Voltage Scaling (DVS)
1 Vary fCLK,VDD 2 Dynamically adapt Delivered Throughput
time
Burd • Dynamically scale energy/operation with throughput. ISSCC’00 • Always minimize speed → minimize average energy/operation. • Extend battery life up to 10x with the exact same hardware! 19
Operating System Sets Processor Speed
• DVS requires a voltage scheduler (VS). • VS predicts workload to estimate CPU cycles. • Applications supply completion deadlines. Processor Speed (MPEG)
80
60
CPU cycles (MHz) = F 40 Δ time DESIRED 20 DESIRED DESIRED
F 0 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Time (sec) 20
10 Converter Loop Sets VDD, fCLK
IDD fCLK
RST Counter
f1MHz Latch Ring Oscillator Processor
T FMEAS A 7 V B
PENAB F V Set by DES FERR DD N O.S. Σ ENAB L 0110100 + CDD Register Digital Loop Filter Buck converter • Feedback loop sets VDD so that FERR → 0. • Ring oscillator delay-matched to CPU critical paths. Burd • Custom loop implementation → Can optimize C . ISSCC’00 DD 21
Recent DVS-Enabled Microprocessors
Xscale: 180nm 1.8V bulk-CMOS [Intel00] 0.7-1.75V, 200-1000MHz, 55-1500mW (typ) Max. Energy Efficiency: ~23 MIPS/mW
PowerPC: 180nm 1.8V bulk-CMOS [Nowka02] 0.9-1.95V, 11-380MHz, 53-500mW (typ) Max. Energy Efficiency : ~11 MIPS/mq
Crusoe: 130nm 1.5V bulk-CMOS [Transmeta03] 0.8-1.3V, 300-1000MHz, 0.85-7.5W (peak)
Pentium M: 130nm 1.5V bulk-CMOS [Intel03] 0.95-1.5V, 600-1600MHz, 4.2-31W (peak)
22
11 Design Over Wide Range of Voltages
• Circuit design constraints. (Functional verification)
• Circuit delay variation. (Timing verification)
• Noise margin reduction. (Power grid, coupling)
• Delay sensitivity. (Local power distribution)
Design verification complexity similar to
high-performance processor design @ fixed VDD 23
Delay Variation & Circuit Constraints
1.0 CLK f Inverter RingOsc 0.5 RegFile SRAM Normalized max. 0 VT 2VT 3VT 4VT VDD Cannot use NMOS pass gates – fails for V < 2V . • DD T Burd • Functional verification only needed at one V value ISSCC’00 DD . 24
12 Relative Delay Variation Delay relative to ring oscillator +40
Four extreme cases of +20 critical paths:
Gate 0 Interconnect Diffusion All vary monotonically with VDD. Series Percent Delay Variation -20 V 2VT 3VT 4VT T V Burd DD ISSCC’00 • Timing verification only needed at min. & max. VDD. 25
Delay Sensitivity
∂Delay ∂Delay ΔVDD ≈ ⋅ , ΔVDD = I(VDD )⋅R Delay ∂VDD Delay(VDD ) 1
0.8
0.6 Delay / ∂ 0.4
0.2 Burd 0 Normalized ISSCC’00 VT 2VT 3VT 4VT VDD • Design of local power grid (for timing constraints) only need to consider VDD ≈ 2VT. 26
13 Design for Dynamically Varying VDD
• Static CMOS logic.
• Ring oscillator.
• Dynamic logic (& tri-state busses).
• Sense amp (& memory cell).
Max. allowed |dVDD/dt| → Min. CDD = 100nF (0.6μm)
Circuits continue to properly operate as VDD changes 27
Static CMOS Logic
VDD
rds|PMOS Vin = 0 Vout = VDD
Vout CL
max. τ = 4ns
0.6μm CMOS: |dVDD/dt| < 200V/μs
• Static CMOS robustly operates with varying VDD. 28
14 Ring Oscillator
Simulated with dVDD/dt = 20V/μs 4
3
2 VDD Volts 1
fCLK 0 60 80 100 120 140 160 180 200 220 240 260 Time (ns)
• Output fCLK instantaneously adapts to new VDD. 29
Dynamic Logic
VDD clk = 1 Errors clk Vout ΔVDD False logic low: ΔVDD > VTP VDD V Vin out
Volts −ΔVDD Latch-up: ΔV > V clk DD be
Time
0.6μm CMOS: |dVDD/dt| < 20V/μs • Cannot gate clock in evaluation state. • Tri-state busses fail similarly → Use hold circuit. 30
15 Measured System Performance & Energy
100
Dynamic VDD 80 x 85 MIPS @ 60 Static VDD 5.6 mW/MIPS (3.8V) 40 6 MIPS @ 20 0.54 mW/MIPS Dhrystone 2.1 MIPS (1.2V) 0 0 1 2 3 4 5 6 Energy (mW/MIPS) Burd ISSCC’00 • Dynamic operation can increase energy efficiency > 10x. 31
VDD-Hopping
MPEG-4 encoding Time 1 #n #n+1 Transition 0.8 time between ƒ 0.6 levels Next milestone = 200µs n-th slice finished here 0.4 0.2 Application slicing and software Normalized power feedback guarantee real-time 0 operation. 1 23 8 # of frequency levels Two hopping levels are sufficient. 32
16 Reducing Active Power: System-Level
Don’t switch capacitance if you don’t need to!
• Application-specific processing • Preservation of data correlations • Locality of reference • Distributed processing • Demand-driven / Data-driven computation
33
Clock gating
Requires careful skew control ... Well handled in today’s EDA tools
34
17 Clock-gating efficiently reduces power
Without clock gating 30.6mW
With clock gating MPEG4 decoder 8.5mW
DEU 0 5 10 15 20 25 VDE Power [mW] MIF DSP/ 90% of F/F’s were clock-gated. HIF 896Kb SRAM 70% power reduction by clock- gating alone.
Courtesy M. Ohashi, Matsushita, ISSCC 2002 35
Circuit-Level Activity Encoding
Conditional Inversion Coding for Interconnect
36
18 Number Representation
37
19