Dynamic Voltage Scaling

EE241 - Spring 2012 Advanced Digital Integrated Circuits Lecture 18: Dynamic Voltage Scaling Outline Finish multiple supplies Dynamic voltage scaling 2 1 Supply Voltage Tradeoffs Multiple Supplies in a Block CVS Layout: Usami’98 4 2 Level-Converting Flip-Flop VH VL CLK CK Q CK M M CK CK 1 2 D CK CK CK 5 Three VDD’s 1.4 1.3 1.2 1.1 1 0.9(V) V2 (V) 3 0.8V 0.7 + 0.6 Power Reduction Ratio 0.5 0.4 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 V1 (V) V2 (V) From Kuroda V1 = 1.5V, VTH = 0.3V, p(t):lambda 6 3 Optimum Numbers of Supplies { V1, V2 } { V1, V2, V3 } { V1, V2, V3, V4 } 1.0 V2/V1 V2/V1 V2/V1 V3/V1 V3/V1 0.5 V4/V1 Supply Voltage Ratio Voltage Supply 1.0 P2/P1 P3/P1 P4/P1 0.4 Power Dissipation Ratio 0.5 1.0 1.5 0.5 1.0 1.5 0.5 1.0 1.5 V1 (V) V1 (V) V1 (V) The more VDD’s, the less power, but the effect will be saturated. Power reduction effect will be decreased as VDD’s are scaled. Optimum V /V is around 0.7. 2 1 Hamada, CICC’01 7 Multiple Supply Voltages Two supply voltages per block are optimal Optimal ratio between the supply voltages is 0.7 Level conversion is performed on the voltage boundary, using a level-converting flip-flop (LCFF) An option is to use an asynchronous (combinatorial) level converter More sensitive to coupling and supply noise 8 4 Dual-Supply-Datapath: Layout Issue : VDDH circuit : VDDL circuit : Signal flow VDDL Row VDDH Row (empty) (a) Dedicated row (Conventional) V Row DDL Complex interconnections VDDH Row (b) Possible layout reduction (Conventional) (c) Shared-well layout A shared-well technique is appropriate for random placement of cells 9 Standard-Cell Dual-Supply-Voltage N-well isolation VDDH VDDL VDDH VDDL i1 o1 i2 o2 VSS VSS VDDH circuit VDDL circuit VDDH circuit VDDL circuit (a) circuit schematic (b) layout A VDDH circuit is assigned only to a critical path A VDDL circuit is used in a non-critical path and for driving a large capacitive load 10 5 Shared-Well Dual-Supply-Voltage Shared N-well VDDH VDDH VDDL VDDL i1 o1 i2 o2 VSS VSS VDDH circuit VDDL circuit VDDH circuit VDDL circuit (a) circuit schematic (b) layout Both circuits can be placed in the same N-well Cell layout becomes complex An intrinsic negative back-biasing of PMOS degrades speed Shimazaki, ISSCC’03 11 Measured Results: Energy & Delay Room temp. 800 1.16GHz 700 Single-supply VDDL=1.4V 600 Energy:-25.3% Shared well Delay :+2.8% (V =1.8V) 500 DDH Energy [pJ] 400 VDDL=1.2V Energy:-33.3% 300 Delay :+8.3% 200 0.6 0.8 1.0 1.2 1.4 1.6 TCYCLE [ns] The dual-supply technique expands the power-delay optimization space 12 6 Power /Energy Optimization Space Constant Throughput/Latency Variable Throughput/Latency Energy Design Time Sleep Mode Run Time Logic design Scaled V DFS, DVS Active DD Clock gating Trans. sizing Multi-VDD Stack effects Trans sizing Sleep T’s Leakage Scaling VDD Multi-VDD Variable VTh + Variable VTh + Multi-VTh + Input control 13 Adaptive Supply Voltages 14 7 Processors for Portable Devices 1000 Dynamic Voltage 100 Scaling Notebook Computers 10 Pocket-PCs Performance (MIPS) 1 PDAs 0.1 110 Processor Energy (Watt*sec) Burd ISSCC’00 • Eliminate performance energy trade-off 15 Typical MPEG IDCT Histogram 16 8 Processor Usage Model Desired Compute-intensive and Throughput low-latency processes Maximum Processor Speed Background and time System Idle high-latency processes System Optimizations: Burd • Maximize Peak Throughput ISSCC’00 • Minimize Average Energy/operation 17 Common Design Approaches (Fixed VDD) Compute ASAP: Excess throughput Always high throughput time Clock Frequency Reduction: fCLK Reduced Delivered Throughput Delivered Energy/operation remains unchanged… time 18 while throughput scaled down with fCLK 9 Scale VDD with Clock Frequency Constant supply voltage 1 3.3V ~10x Energy 0.5 Reduction Reduce VDD, slow circuits down. Energy/operation 0 1.1V 00.51Burd Throughput ( f ) ISSCC’00 CLK 19 CMOS Circuits Track Over VDD 1.0 CLK f Inverter RingOsc 0.5 RegFile SRAM Normalized max. 0 VT 2VT 3VT 4VT VDD Burd Delay tracks within +/- 10% ISSCC’00 20 10 Dynamic Voltage Scaling (DVS) 1 Vary fCLK,VDD 2 Dynamically adapt Delivered Throughput time Burd • Dynamically scale energy/operation with throughput. ISSCC’00 • Always minimize speed minimize average energy/operation. • Extend battery life up to 10x with the exact same hardware! 21 Operating System Sets Processor Speed • DVS requires a voltage scheduler (VS). • VS predicts workload to estimate CPU cycles. • Applications supply completion deadlines. Processor Speed (MPEG) 80 60 CPU cycles (MHz) F time DESIRED 40 20 DESIRED F 0 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Time (sec) 22 11 Converter Loop Sets VDD, fCLK IDD fCLK RST Counter f1MHz Latch Ring Oscillator Processor FMEAS 7 PENAB F V Set by DES FERR DD N O.S. ENAB L 0110100 + CDD Register Digital Loop Filter Buck converter • Feedback loop sets VDD so that FERR 0. • Ring oscillator delay-matched to CPU critical paths. Burd • Custom loop implementation Can optimize C . ISSCC’00 DD 23 Design Over Wide Range of Voltages • Circuit design constraints. (Functional verification) • Circuit delay variation. (Timing verification) • Noise margin reduction. (Power grid, coupling) • Delay sensitivity. (Local power distribution) Design verification complexity similar to high-performance processor design @ fixed VDD 24 12 Delay Variation & Circuit Constraints 1.0 CLK f Inverter RingOsc 0.5 RegFile SRAM Normalized max. 0 VT 2VT 3VT 4VT VDD Cannot use NMOS pass gates – fails for V < 2V . • DD T Burd • Functional verification only needed at one V value ISSCC’00 DD . 25 Relative Delay Variation Delay relative to ring oscillator +40 Four extreme cases of +20 critical paths: Gate 0 Interconnect Diffusion All vary monotonically with VDD. Series Percent Delay Variation -20 V 2VT 3VT 4VT T V Burd DD ISSCC’00 • Timing verification only needed at min. & max. VDD. 26 13 Delay Sensitivity Delay Delay VDD ,() VIVRDD DD Delay VDD Delay() V DD 1 0.8 0.6 Delay / Delay Delay / 0.4 0.2 Burd 0 Normalized ISSCC’00 VT 2VT 3VT 4VT VDD • Design of local power grid (for timing constraints) only need to consider VDD 2VT. 27 Multiple Path Tracking A. Drake, ISSCC’07 28 14 Alternative: Error Detection Bull, ISSCC’2010 29 Design for Dynamically Varying VDD • Static CMOS logic. • Ring oscillator. • Dynamic logic (& tri-state busses). • Sense amp (& memory cell). Max. allowed |dVDD/dt| Min. CDD = 100nF (0.6m) Circuits continue to properly operate as VDD changes 30 15 Static CMOS Logic VDD rds|PMOS Vin = 0 Vout = VDD Vout CL max. = 4ns 0.6m CMOS: |dVDD/dt| < 200V/s • Static CMOS robustly operates with varying VDD. 31 Ring Oscillator Simulated with dVDD/dt = 20V/s 4 3 2 VDD Volts 1 fCLK 0 60 80 100 120 140 160 180 200 220 240 260 Time (ns) • Output fCLK instantaneously adapts to new VDD. 32 16 Dynamic Logic VDD clk = 1 Errors clk Vout VDD False logic low: VDD > VTP VDD V Vin out Volts VDD Latch-up: V > V clk DD be Time 0.6m CMOS: |dVDD/dt| < 20V/s • Cannot gate clock in evaluation state. • Tri-state busses fail similarly Use hold circuit. 33 Measured System Performance & Energy 100 Dynamic VDD 80 x 85 MIPS @ 60 Static VDD 5.6 mW/MIPS (3.8V) 40 6 MIPS @ 20 0.54 mW/MIPS Dhrystone 2.1 MIPS (1.2V) 0 0 1 2 3 4 5 6 Energy (mW/MIPS) Burd ISSCC’00 • Dynamic operation can increase energy efficiency > 10x. 34 17 VDD-Hopping MPEG-4 encoding Time 1 #n #n+1 Transition 0.8 time between ƒ 0.6 levels Next milestone = 200µs n-th slice finished here 0.4 0.2 Application slicing and software Normalized power feedback guarantee real-time 0 operation. 1 23 8 # of frequency levels Two hopping levels are sufficient. 35 Power /Energy Optimization Space Constant Throughput/Latency Variable Throughput/Latency Energy Design Time Sleep Mode Run Time Logic design Scaled V DFS, DVS Active DD Clock gating Trans. sizing Multi-VDD Stack effects Trans sizing Sleep T’s Leakage Scaling VDD Multi-VDD Variable VTh + Variable VTh + Multi-VTh + Input control 36 18 Clock gating Requires careful skew control ... Well handled in today’s EDA tools 37 Clock-gating efficiently reduces power Without clock gating 30.6mW With clock gating MPEG4 decoder 8.5mW DEU 0 5 10 15 20 25 VDE Power [mW] MIF DSP/ 90% of F/F’s were clock-gated. HIF 896Kb SRAM 70% power reduction by clock- gating alone. Courtesy M. Ohashi, Matsushita, ISSCC 2002 38 19 Local Clock Gating 2 Q CKI 1.2 0.85 0.85 DI 0.5 D 0.85 0.5 0.5 CKIB CKIB 0.5 0.5 0.85 0.5 0.85 0.5 Data-Transition Look-Ahead Pulse XNOR Generator CKIB ‘Clock on demand’ 0.85 Flip-flop CKI CP 0.5 39 Power /Energy Optimization Space Constant Throughput/Latency Variable Throughput/Latency Energy Design Time Sleep Mode Run Time Logic design Scaled V DFS, DVS Active DD Clock gating Trans. sizing Multi-VDD Stack effects Trans sizing Sleep T’s Leakage Scaling VDD Multi-VDD Variable VTh + Variable VTh + Multi-VTh + Input control 40 20 Circuit-Level Activity Encoding Conditional Inversion Coding for Interconnect 41 Number Representation 42 21 Next Lecture Leakage management 43 22.

Dynamic Voltage Scaling

Instruction-Level Distributed Processing

Clock Gating for Power Optimization in ASIC Design Cycle: Theory & Practice

Saber Eletrônica, Designers Pois Precisamos Comprovar Ao Meio Anunciante Estes Números E, Assim, Carlos C

Analysis of Body Bias Control Using Overhead Conditions for Real Time Systems: a Practical Approach∗

Register Allocation and VDD-Gating Algorithms for Out-Of-Order

A 65 Nm 2-Billion Transistor Quad-Core Itanium Processor

Optimization of Clock Gating Logic for Low Power LSI Design

Clock Gating

Mutual Impact Between Clock Gating and High Level Synthesis in Reconﬁgurable Hardware Accelerators

White P Aper

1/2 - 2008 Elektronik Industrie

Low Power Data-Dependent Transform Video and Still Image Coding