Chapter 4 Low-Power VLSI Design

Jin-Fu Li Advanced Reliable Systems (ARES) Lab. Department of Electrical Engineering National Central University Jhongli, Taiwan Outline

• Introduction • Low-Power Gate-Level Design • Low-Power Architecture-Level Design • Algorithmic-Level Power Reduction • RTL Techniques for Optimizing Power

National Central University EE4012VLSI Design 2 Introduction • Most SOC design teams now regard power as one of their top design concerns • Why low-power design?  Battery lifetime (especially for portable devices)  Reliability • Power consumption  Peak power  Average power

National Central University EE4012VLSI Design 3 Overview of Power Consumption • Average power consumption  Dynamic power consumption  Short-circuit power consumption  Leakage power consumption  Static power consumption • Dynamic power dissipation during switching

Cinput

interconnect Cdrain input

National Central University EE4012VLSI Design 4 Overview of Power Consumption • Generic representation of a CMOS logic gate for switching power calculation

V A pMOS

VB network

Vout V A nMOS Cdrain   Cint erconnect   Cinput network VB

TT/ 2 1 dVout dVout Pavg  [ Vout (Cload )dt  (VDD Vout )(Cload )dt] T 02dt T / dt

National Central University EE4012VLSI Design 5 Overview of Power Consumption • The average power consumption can be expressed as 1 P  C V 2  C V 2 f avg T load DD load DD CLK • The node transition rate can be slower than the clock rate. To better represent this behavior, a node transition factor ( T ) should be introduced 2 Pavg   T C load V DD f CLK • The switching power expressed above are derived by taking into account the output node load capacitance

National Central University EE4012VLSI Design 6 Overview of Power Consumption

V VA A V internal VB

VB C internal Vinternal

Vout

C Vout VA VB load

The generalized expression for the average power dissipation can be rewritten as  # ofnodes  Pavg     Ti C iV i V DD f CLK  i 1 

National Central University EE4012VLSI Design 7 Gate-Level Design – Technology Mapping • The objective of logic minimization is to reduce the boolean function. • For low-power design, the signal switching activity is minimized by restructuring a logic circuit • The power minimization is constrained by the delay, however, the area may increase. • During this phase of logic minimization, the function to be minimized is

 Pi (1  Pi ) C i i

National Central University EE4012VLSI Design 8 Gate-Level Design – Technology Mapping • The first step in technology mapping is to decompose each logic function into two-input gates • The objective of this decomposition is to minimize the total power dissipation by reducing the total switching activity

A 0.2  0.0384   0.0196 B 0.2   0.0099 C 0.5 D 0.5 A B C D A 0.2   0.0384 B 0.2

C 0.5   0.0099 D 0.5   0.1875

National Central University EE4012VLSI Design 9 Gate-Level Design – Phase Assignment

High activity node High activity node

A A

B B

C C

National Central University EE4012VLSI Design 10 Gate-Level Design – Pin Swapping

adb c adb c

d a Switching activity Switching activity c b

b c

a d

d c a b b c a d

National Central University EE4012VLSI Design 11 Gate-Level Design – Glitching Power • Glitches  spurious transitions due to imbalanced path delays • A design has more balanced delay paths  has fewer glitches, and thus has less power dissipation • Note that there will be no glitches in a dynamic CMOS logic

A A B B D C E D C E

National Central University EE4012VLSI Design 12 Gate-Level Design – Glitching Power • A chain structure has more glitches • A tree structure has fewer glitches

A B

C Chain structure

D

A

B Tree structure C D

National Central University EE4012VLSI Design 13 Gate-Level Design – Precomputation

REG REG Combinational Logic R1 R2

REG REG Combinational Logic R1 R2

Precomputation Logic

National Central University EE4012VLSI Design 14 Gate-Level Design – Precomputation

A REG 1- Comparator B R1 (MSB)

REG A R2

(n-1)-bit REG Enable Comparator R4 Precomputation logic F REG B R3

National Central University EE4012VLSI Design 15 Gate-Level Design – Gating Clock

D Q D Q D Q D Q Fail DFT rule clk checking

T Add control pin D Q D Q D Q D Q to solve DFT violation problem

clk

National Central University EE4012VLSI Design 16 Gate-Level Design – Input Gating

f1

clk

+ select

f2

National Central University EE4012VLSI Design 17 Clock-Gating in Low-Power Flip-Flop

D D Q

CK

Source: Prof. V. D. Agrawal

National Central University EE4012VLSI Design 18 Reduced-Power Shift Register

D D Q D Q D Q D Q

Output multiplexer D Q D Q D Q D Q

CK(f/2)

Flip-flops are operated at full voltage and half the clock frequency.

Source: Prof. V. D. Agrawal National Central University EE4012VLSI Design 19 Power Consumption of Shift Register

16-bit shift register, 2μ CMOS 2 P = C’VDD f/n Deg. Of Freq Power 1.0 parallelism (MHz) (μW)

1 33.0 1535

2 16.5 887 4 8.25 738 0.5

0.25 Normalized power C. Piguet, “Circuit and Logic Level Design,” pages 103-133 in W. Nebel 0.0 and J. Mermet (ed.), Low Power 1 2 4 Design in Deep Submicron Degree of parallelism, n Electronics, Springer, 1997. Source: Prof. V. D. Agrawal National Central University EE4012VLSI Design 20 Architecture-Level Design – Parallelism

16 16 A R A R 32 16 32 16x16 16x16 f fref/2 ref multiplier multiplier 16 R B R M 32 U f fref ref/2 X

Assume that With the same 16x16 R multiplier, the power supply can fref be reduced from V to V /1.83. ref ref 16x16 fref/2 Vref 2 fref multiplier P  2.2C ( )  0.33P 16 32 parallel ref 1.83 2 ref B R 16

fref/2 National Central University EE4012VLSI Design 21 Architecture-Level Design – Pipelining The hardware between the pipeline stages is reduced then the reference voltage Vref can be reduced to Vnew to maintain the same worst case delay. For example, let a 50MHz multiplier is broken into two equal parts as shown below. The delay between the pipeline stages can be remained at 50MHz when the voltage Vnew is equal to Vref/1.83

32 Half Half 32 (A ,B) REG REG multiplier multiplier

fref V P  1.2C ( ref ) 2 f  0.36 P pipeline ref 1.83 ref ref National Central University EE4012VLSI Design 22 Architecture-Level Design – Retiming

Retiming is a transformation technique used to change the locations of delay elements in a circuit without affecting the input/output characteristics of the circuit.

Two versions of an IIR filter.

(1) (1) x(n) y(n) x(n) y(n) D D D2Da a w(n) D (2) 2D (1) (2) (1) w1(n) b retiming D b

w2(n) (2) (2)

National Central University EE4012VLSI Design 23 Architecture-Level Design – Retiming Retiming for pipeline design

REG C1 C2 REG C3 (6ns) (2ns) (4ns)

fref

C1 C2 REG REG C3 (6ns) (2ns) (4ns)

fref

National Central University EE4012VLSI Design 24 Architecture-Level Design – Retiming

Clock cycle is 4 gate delays

Clock cycle is 2 gate delays

National Central University EE4012VLSI Design 25 Architecture-Level Design – Power Management

C2

C1

C1_FREEZE

C2_FREEZE

C2

C1

C1_FREEZE

C2_FREEZE

National Central University EE4012VLSI Design 26 Architecture-Level Design – Bus Segmentation • Avoid the sharing of resources  Reduce the switched capacitance • For example: a global system bus  A single shared bus is connected to all modules, this structure results in a large bus capacitance due to  The large number of drivers and receivers sharing the same bus  The parasitic capacitance of the long bus line • A segmented bus structure  Switched capacitance during each bus access is significantly reduced  Overall routing area may be increased

National Central University EE4012VLSI Design 27 Architecture-Level Design – Bus Segmentation

Cbus

Cbus1 Bus Interface

Cbus1

National Central University EE4012VLSI Design 28 Algorithmic-Level Design – factivity Reduction

Minimization of the switching activity, at high level, is one way to reduce the power dissipation of digital processors. One method to minimize the switching signals, at the algorithmic level, is to use an appropriate coding for the signals rather than straight . The table shown below shows a comparison of 3-bit representation of the binary and Gray codes.

Binary Code Decimal Equivalent 000 000 0 001 001 1 010 011 2 011 010 3 100 110 4 101 111 5 110 101 6 111 100 7

National Central University EE4012VLSI Design 29 State Encoding for a Counter • Two-bit binary counter:  State sequence, 00 → 01 → 10 → 11 → 00  Six bit transitions in four clock cycles  6/4 = 1.5 transitions per clock • Two-bit Gray-code counter  State sequence, 00 → 01 → 11 → 10 → 00  Four bit transitions in four clock cycles  4/4 = 1.0 transition per clock • Gray-code counter is more power efficient.

G. K. Yeap, Practical Low Power Digital VLSI Design, Boston: Kluwer Academic Publishers (now Springer), 1998. Source: Prof. V. D. Agrawal National Central University EE4012VLSI Design 30 Binary Counter: Original Encoding

a Present Next state state A b abAB B 0001 0110 1011 1100

A = a’b + ab’ CK B = a’b’ + ab’ CLR Source: Prof. V. D. Agrawal

National Central University EE4012VLSI Design 31 Binary Counter: Gray Encoding

Present a Next state state A abAB B 0001 b 0111 1000 1110

A = a’b + ab CK B = a’b’ + a’b CLR Source: Prof. V. D. Agrawal National Central University EE4012VLSI Design 32 Three-Bit Counters

Binary Gray-code State No. of toggles State No. of toggles 000 - 000 - 001 1 001 1 010 2 011 1 011 1 010 1 100 3 110 1 101 1 111 1 110 2 101 1 111 1 100 1 000 3 000 1 Av. Transitions/clock = 1.75 Av. Transitions/clock = 1 Source: Prof. V. D. Agrawal National Central University EE4012VLSI Design 33 N-Bit Counter: Toggles in Counting Cycle

• Binary counter: T(binary) = 2(2N –1) • Gray-code counter: T(gray) = 2N • T(gray)/T(binary) = 2N-1/(2N –1) → 0.5

Bits T(binary) T(gray) T(gray)/T(binary) 1221.0 2 6 4 0.6667 3 14 8 0.5714 4 30 16 0.5333 5 62 32 0.5161 6 126 64 0.5079 ∞ - - 0.5000

Source: Prof. V. D. Agrawal National Central University EE4012VLSI Design 34 FSM State Encoding

Transition probability based on 0.6 0.6 PI statistics 11 01 0.3 0.3 0.1 0.1 0.4 0.4 00 01 00 11 0.1 0.1 0.6 0.9 0.6 0.9 Expected number of state-bit transitions: 2(0.3+0.4) + 1(0.1+0.1) = 1.6 1(0.3+0.4+0.1) + 2(0.1) = 1.0

State encoding can be selected using a power-based cost function.

Source: Prof. V. D. Agrawal

National Central University EE4012VLSI Design 35 FSM: Clock-Gating • Moore machine: Outputs depend only on the state variables.  If a state has a self-loop in the state transition graph (STG), then clock can be stopped whenever a self-loop is to be executed.

Xi/Zk Si Sk Xk/Zk

Clock can be stopped Sj Xj/Zk when (Xk, Sk) combination occurs.

Source: Prof. V. D. Agrawal National Central University EE4012VLSI Design 36 Clock-Gating in Moore FSM

PI

Combinational logic PO Flip-flops

Clock activation Latch logic L. Benini and G. De Micheli, Dynamic Power Management, CK Boston: Springer, 1998. Source: Prof. V. D. Agrawal National Central University EE4012VLSI Design 37 Bus Encoding for Reduced Power

• Example: Four bit bus  0000 → 1110 has three transitions.  If of second pattern are inverted, then 0000 → 0001 will have only one transition. • Bit-inversion encoding for N-bit bus:

N

N/2

0 after inversion encoding Number of bit transitions bit Number of 0N/2N Number of bit transitions Source: Prof. V. D. Agrawal National Central University EE4012VLSI Design 38 Bus-Inversion Encoding Logic Sent data Received data

Bus register Polarity decision M. Stan and W. Burleson, “Bus-Invert Coding for Low Power I/O,” IEEE logic Polarity bit Trans. VLSI Systems, vol. 3, no. 1, pp. 49-58, March 1995. Source: Prof. V. D. Agrawal National Central University EE4012VLSI Design 39 RTL-Level Design – Signal Gating

Simple Decoder Decoder with enable

module decoder (a, sel); module decoder (en,a, sel); input [1:0[ a; input en; ouput [3:0] sel; input [1:0[ a; reg [3:0] sel; ouput [3:0] sel; always @(a) begin reg [3:0] sel; case (a) always @({en,a}) begin 2’b00: sel=4’b0001; case ({en,a}) 2’b01: sel=4’b0010; 3’b100: sel=4’b0001; 2’b10: sel=4’b0100; 3’b101: sel=4’b0010; 2’b11: sel=4’b1000; 3’b110: sel=4’b0100; endcase 3’b111: sel=4’b1000; end default: sel=4’b0000; endmodule endcase end endmodule

National Central University EE4012VLSI Design 40 RTL-Level Design – Datapath Reordering

Initial Reordered

stable Mux A

glitchy Mux A

National Central University EE4012VLSI Design 41 RTL-Level Design – Memory Partition

128x32 din 32 addr dout write

noe 8 addr[7:0] M 32 pre_addr d q addr[7:1] U dout X clk noe write addr dout din 32

addr0 128x32

National Central University EE4012VLSI Design 42 RTL-Level Design – Memory Partition

• Application-driven memory partition

Reads 64K bytes

Data ARM Addr Core R/W

Addr 28K 4K 32K Range 64K

National Central University EE4012VLSI Design 43 RTL-Level Design – Memory Partition

• A power-optimal partitioned memory organization

Decoder

ARM R/W Addr R/W Addr R/W Addr CS Data CS Data CS Data Core

National Central University EE4012VLSI Design 44