EE241 - Spring 2012 Advanced Digital Integrated Circuits

Lecture 18:

Outline

Finish multiple supplies Dynamic voltage scaling

2

1 Supply Voltage Tradeoffs

Multiple Supplies in a Block

CVS Layout:

Usami’98 4

2 Level-Converting Flip-Flop

VH

VL

CLK CK Q CK M M CK CK 1 2 D CK CK CK

5

Three VDD’s

1.4

1.3

1.2

1.1

1

0.9(V) V2 (V) 3

0.8V

0.7 + 0.6 Power Reduction Ratio

0.5

0.4

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 V1 (V) V2 (V)

From Kuroda V1 = 1.5V, VTH = 0.3V, p(t):lambda 6

3 Optimum Numbers of Supplies

{ V1, V2 } { V1, V2, V3 } { V1, V2, V3, V4 } 1.0

V2/V1 V2/V1 V2/V1 V3/V1

V3/V1 0.5 V4/V1 Supply Voltage Ratio Voltage Supply 1.0

P2/P1 P3/P1 P4/P1 0.4 Power Dissipation Ratio 0.5 1.0 1.5 0.5 1.0 1.5 0.5 1.0 1.5 V1 (V) V1 (V) V1 (V)

 The more VDD’s, the less power, but the effect will be saturated.  Power reduction effect will be decreased as VDD’s are scaled.  Optimum V /V is around 0.7. 2 1 Hamada, CICC’01 7

Multiple Supply Voltages

Two supply voltages per block are optimal Optimal ratio between the supply voltages is 0.7 Level conversion is performed on the voltage boundary, using a level-converting flip-flop (LCFF) An option is to use an asynchronous (combinatorial) level converter More sensitive to coupling and supply noise

8

4 Dual-Supply-: Layout Issue

: VDDH circuit : VDDL circuit : Signal flow

VDDL Row

VDDH Row (empty) (a) Dedicated row (Conventional)

V Row DDL Complex interconnections VDDH Row (b) Possible layout reduction (Conventional)

(c) Shared-well layout A shared-well technique is appropriate for random placement of cells 9

Standard-Cell Dual-Supply-Voltage N-well isolation VDDH VDDL VDDH VDDL i1 o1 i2 o2

VSS

VSS

VDDH circuit VDDL circuit VDDH circuit VDDL circuit (a) circuit schematic (b) layout

A VDDH circuit is assigned only to a critical path

A VDDL circuit is used in a non-critical path and for driving a large capacitive load 10

5 Shared-Well Dual-Supply-Voltage Shared N-well

VDDH VDDH VDDL

VDDL i1 o1 i2 o2

VSS VSS

VDDH circuit VDDL circuit VDDH circuit VDDL circuit (a) circuit schematic (b) layout Both circuits can be placed in the same N-well Cell layout becomes complex An intrinsic negative back-biasing of PMOS degrades speed Shimazaki, ISSCC’03 11

Measured Results: Energy & Delay Room temp. 800 1.16GHz 700 Single-supply VDDL=1.4V 600 Energy:-25.3% Shared well Delay :+2.8% (V =1.8V) 500 DDH Energy [pJ] 400 VDDL=1.2V Energy:-33.3% 300 Delay :+8.3% 200 0.6 0.8 1.0 1.2 1.4 1.6

TCYCLE [ns] The dual-supply technique expands the power-delay optimization space 12

6 Power /Energy Optimization Space

Constant Throughput/Latency Variable Throughput/Latency

Energy Design Time Sleep Mode Run Time

Logic design Scaled V DFS, DVS Active DD Clock gating Trans. sizing

Multi-VDD

Stack effects Trans sizing Sleep T’s

Leakage Scaling VDD Multi-VDD Variable VTh + Variable VTh

+ Multi-VTh + Input control 13

Adaptive Supply Voltages

14

7 Processors for Portable Devices

1000 Dynamic Voltage 100 Scaling Notebook Computers

10 Pocket-PCs

Performance (MIPS) 1 PDAs 0.1 110 Energy (Watt*sec) Burd ISSCC’00 • Eliminate performance  energy trade-off 15

Typical MPEG IDCT Histogram

16

8 Processor Usage Model

Desired Compute-intensive and Throughput low-latency processes Maximum Processor Speed

Background and time System Idle high-latency processes System Optimizations: Burd • Maximize Peak Throughput ISSCC’00 • Minimize Average Energy/operation 17

Common Design Approaches (Fixed VDD)

Compute ASAP: Excess throughput

Always high throughput time Clock Frequency Reduction:

fCLK Reduced Delivered Throughput Delivered

Energy/operation remains unchanged… time 18 while throughput scaled down with fCLK

9 Scale VDD with Clock Frequency

Constant supply voltage 1 3.3V

~10x Energy 0.5 Reduction

Reduce VDD, slow circuits down. Energy/operation 0 1.1V

00.51Burd Throughput ( f ) ISSCC’00 CLK 19

CMOS Circuits Track Over VDD

1.0 CLK f

Inverter RingOsc 0.5 RegFile SRAM Normalized max. 0 VT 2VT 3VT 4VT VDD Burd Delay tracks within +/- 10% ISSCC’00 20

10 Dynamic Voltage Scaling (DVS)

1 Vary fCLK,VDD 2 Dynamically adapt Delivered Throughput

time

Burd • Dynamically scale energy/operation with throughput. ISSCC’00 • Always minimize speed  minimize average energy/operation. • Extend battery life up to 10x with the exact same hardware! 21

Operating System Sets Processor Speed

• DVS requires a voltage scheduler (VS). • VS predicts workload to estimate CPU cycles. • Applications supply completion deadlines. Processor Speed (MPEG)

80

60

CPU cycles (MHz)  F  time DESIRED 40 20 DESIRED F 0 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Time (sec) 22

11 Converter Loop Sets VDD, fCLK

IDD fCLK

RST

f1MHz Latch Ring Oscillator Processor

FMEAS 7

PENAB F V Set by DES FERR DD N O.S.  ENAB L 0110100 + CDD Register Digital Loop Filter Buck converter • Feedback loop sets VDD so that FERR  0. • Ring oscillator delay-matched to CPU critical paths. Burd • Custom loop implementation  Can optimize C . ISSCC’00 DD 23

Design Over Wide Range of Voltages

• Circuit design constraints. (Functional verification)

• Circuit delay variation. (Timing verification)

• Noise margin reduction. (Power grid, coupling)

• Delay sensitivity. (Local power distribution)

Design verification complexity similar to

high-performance @ fixed VDD 24

12 Delay Variation & Circuit Constraints

1.0 CLK f Inverter RingOsc 0.5 RegFile SRAM Normalized max. 0 VT 2VT 3VT 4VT VDD Cannot use NMOS pass gates – fails for V < 2V . • DD T Burd • Functional verification only needed at one V value ISSCC’00 DD . 25

Relative Delay Variation Delay relative to ring oscillator +40

Four extreme cases of +20 critical paths:

Gate 0 Interconnect Diffusion All vary monotonically with VDD. Series Percent Delay Variation -20 V 2VT 3VT 4VT T V Burd DD ISSCC’00 • Timing verification only needed at min. & max. VDD. 26

13 Delay Sensitivity

Delay Delay VDD ,() VIVRDD DD Delay VDD Delay() V DD 1

0.8

0.6 Delay / Delay Delay /  0.4

0.2 Burd 0 Normalized ISSCC’00 VT 2VT 3VT 4VT VDD • Design of local power grid (for timing constraints) only need to consider VDD  2VT. 27

Multiple Path Tracking

A. Drake, ISSCC’07 28

14 Alternative: Error Detection

Bull, ISSCC’2010 29

Design for Dynamically Varying VDD

• Static CMOS logic.

• Ring oscillator.

• Dynamic logic (& tri-state busses).

• Sense amp (& memory cell).

Max. allowed |dVDD/dt|  Min. CDD = 100nF (0.6m)

Circuits continue to properly operate as VDD changes 30

15 Static CMOS Logic

VDD

rds|PMOS Vin = 0 Vout = VDD

Vout CL

max.  = 4ns

0.6m CMOS: |dVDD/dt| < 200V/s

• Static CMOS robustly operates with varying VDD. 31

Ring Oscillator

Simulated with dVDD/dt = 20V/s 4

3

2 VDD Volts 1

fCLK 0 60 80 100 120 140 160 180 200 220 240 260 Time (ns)

• Output fCLK instantaneously adapts to new VDD. 32

16 Dynamic Logic

VDD clk = 1 Errors clk Vout VDD False logic low: VDD > VTP VDD V Vin out

Volts VDD Latch-up: V > V clk DD be

Time

0.6m CMOS: |dVDD/dt| < 20V/s • Cannot gate clock in evaluation state. • Tri-state busses fail similarly  Use hold circuit. 33

Measured System Performance & Energy

100

Dynamic VDD 80 x 85 MIPS @ 60 Static VDD 5.6 mW/MIPS (3.8V) 40 6 MIPS @ 20 0.54 mW/MIPS Dhrystone 2.1 MIPS (1.2V) 0 0 1 2 3 4 5 6 Energy (mW/MIPS) Burd ISSCC’00 • Dynamic operation can increase energy efficiency > 10x. 34

17 VDD-Hopping

MPEG-4 encoding Time 1 #n #n+1 Transition 0.8 time between ƒ 0.6 levels Next milestone = 200µs n-th slice finished here 0.4 0.2 Application slicing and software Normalized power feedback guarantee real-time 0 operation. 1 23 8 # of frequency levels Two hopping levels are sufficient. 35

Power /Energy Optimization Space

Constant Throughput/Latency Variable Throughput/Latency

Energy Design Time Sleep Mode Run Time

Logic design Scaled V DFS, DVS Active DD Clock gating Trans. sizing

Multi-VDD

Stack effects Trans sizing Sleep T’s

Leakage Scaling VDD Multi-VDD Variable VTh + Variable VTh

+ Multi-VTh + Input control 36

18 Clock gating

Requires careful skew control ... Well handled in today’s EDA tools

37

Clock-gating efficiently reduces power Without clock gating 30.6mW

With clock gating MPEG4 decoder 8.5mW

DEU 0 5 10 15 20 25 VDE Power [mW] MIF DSP/ 90% of F/F’s were clock-gated. HIF 896Kb SRAM 70% power reduction by clock- gating alone.

Courtesy M. Ohashi, Matsushita, ISSCC 2002 38

19 Local Clock Gating

2 Q CKI 1.2 0.85 0.85 DI 0.5 D 0.85 0.5 0.5

CKIB CKIB 0.5

0.5

0.85 0.5 0.85 0.5 Data-Transition Look-Ahead

Pulse XNOR Generator

CKIB ‘Clock on demand’ 0.85 Flip-flop CKI CP 0.5 39

Power /Energy Optimization Space

Constant Throughput/Latency Variable Throughput/Latency

Energy Design Time Sleep Mode Run Time

Logic design Scaled V DFS, DVS Active DD Clock gating Trans. sizing

Multi-VDD

Stack effects Trans sizing Sleep T’s

Leakage Scaling VDD Multi-VDD Variable VTh + Variable VTh

+ Multi-VTh + Input control 40

20 Circuit-Level Activity Encoding

Conditional Inversion Coding for Interconnect

41

Number Representation

42

21 Next Lecture

Leakage management

43

22