Power1.Ps (Mpage)
Total Page:16
File Type:pdf, Size:1020Kb
Low Energy & Power Design Issues Low Power Design Problem • Processor trends Microprocessor Power • Circuit and Technology Issues (source ISSCC) 30 • Architectural optimizations • Low power µP research project 20 10 Power (Watt) 0 75 80 85 90 95 Year When supply voltage drops to 1Volt, then 100Watts = 100 Amps Slide 2 Portable devices Two Kinds of Computation Required • General purpose processing (what you have been Portable Functions studying so far) • Multimodal radio • Bursty - mostly idle with bursts of computation • Protocols, ECC, ... • Maximum possible throughput required during active • Voice I/O compression & periods decompression • Handwriting recognition • Signal processing (for multimedia, wireless Battery • Text/Graphics processing communications, etc.) (40+ lbs) • Video decompression • Stream based computation • Speech recognition • No advantage in increasing processing rate above • Java interpreter required for real-time requirements How to get 1 month of operation? Slide 3 Slide 4 Optimizing for Energy Consumption Switching Energy • Conventional General Purpose processors (e.g. Vdd Pentiums) • Performance is everything ... somehow we’ll get the Vin Vout power in and back out • 10-100 Watts, 100-1000 Mips = .01 Mips/mW CL • Energy Optimized but General Purpose • Keep the generality, but reduce the energy as much as 2 possible - e.g. StrongArm Energy/transition = CL * Vdd • .5 Watts, 160 Mips = .3 Mips/mW 2 Power = Energy/transition * f = CL * Vdd * f • Energy Optimized and Dedicated • 100 Mops/mW Slide 5 Slide 6 Low Power & Low Energy System Design Energy Reduction in CPU’s • Standard power management helps • Sleep modes System Design partitioning, Power Down • Power down blocks • Clock rate reduction doesn’t help Algorithm Complexity, Concurrency, Locality, Regularity, Data representation • Number of operations = Nops 2 Voltage scaling, Parallelism, • Energy/operation = CV Architecture 2 Instruction set, Signal correlations • Total Energy = Nops * CV Transistor Sizing, Logic optimization, Energy is independent of clock rate! Circuit/Logic Activity Driven Power Down, Low-swing logic, Adiabatic switching • Reducing the clock rate only degrades Technology Threshold Reduction, throughput, but no savings in battery life - unless Multi-thresholds the voltage is changed Slide 7 Slide 8 α Node Transition Activity and Power Factors Affecting Transition Activity, 0->1 Switch a CMOS gate for N clock cycles “Static” component (does not account for timing) E = C • V 2 • nN() N L dd Type of Logic Function (NOR vs. XOR) EN : the energy consumed for N clock cycles Type of Logic Style (Static vs. Dynamic) n(N): the number of 0->1 transition in N clock cycles Signal Statistics E () N • nN • • 2 • Inter-signal Correlations Pavg = lim -------- fclk = lim ------------ C Vdd fclk N → ∞ N N → ∞ N L “Dynamic” or timing dependent component nN() α → = lim ------------ Circuit Topology 01N → ∞ N Signal Statistics and Correlations P = α • C • V 2 • f avg 01→ L dd clk Slide 9 Slide 10 Static 2 Input NOR Gate Type of Logic Style: Static vs. Dynamic V Vdd dd Assume: A CLK prob(A=1) = 1/2 ABOut prob(B=1) = 1/2 B CL 001 A B Then: 010 C prob(Out=1) = 1/2 100 L A B CLK prob(0→1) 110 = prob(Out=0).prob(Out=1) = 3/4 × 1/4 = 3/16 Power is dissipated when Out=0 A STATIC NOR DYNAMIC NOR α Out N0 3 0->1 = 3/16 B α = 3/16 α ==------- --- 0->1 01→ N 4 2 Slide 11 Slide 12 “Dynamic” or Glitching Activity in CMOS Glitch Reduction Using Balanced Paths A0 F Cin Add0 Add1 Add2 Add14 Add15 S0 S1 S2 S14 S15 A1 A2 A3 A4 A5 A6 A7 Ripple 4.0 4 A0 S15 A1 6 A2 2.0 3 A3 S10 F Merge Cin A 5 4 S1 A5 2 Sum Output Voltage, Volts 0.0 A6 0510Time, ns A α 7 0->1 can be > 1 due to glitching! Slide 13 Slide 14 Switching activity and capacitance minimization Minimum Supply Voltage 7.5 • Gated clocks. (disable all modules not in use each cycle) multiplier 2.0µm technology 7.0 C • V 6.5 clock generator L dd Td = enable only those modules using a bus 6.0 I • Block enables. ( ) 5.5 5.0 • Instruction Buffer. (0th level cache) 4.5 I ~ (V - V )2 4.0 dd t 3.5 Td(Vdd=1.5) (1.5) ² (5 - 0.7)2 3.0 ring oscillator • Add stop and sleep instructions to the instruction = 2 2.5 microcoded DSP chip Td(Vdd=5) (5) ² (1.5 - 0.7) set. 2.0 = 8 times slower at 1.5 1.5 adder NORMALIZED DELAY • Minimum size busses 1.0 adder (SPICE) Velocity saturated 2.0 4.0 6.0 I ~ (V - V ) not quite V (volts) dd t • Minimize I/O - on-chip memory dd so bad a penalty Lowering Vdd reduces energy but increases delays Critical difference is the amount above Vt Slide 15 Slide 16 Architecture Trade-offs - Reference Datapath Parallel Datapath A A 1 2T A>B 1 T COMPARATOR ADDER LATCH A LATCH LATCH B LATCH A>B 1 C C LATCH 2T ADDER LATCH A LATCH 1 COMPARATOR LATCH B LATCH B C LATCH 2T COMPARATOR 1 COMPARATOR T MUXMUX C µ2 Area = 636 x 833 1 1 2T T 1 B A>B T COMPARATOR ADDER LATCH A LATCH LATCH B LATCH 1 C C LATCH ⇒ 2T Critical path delay Tadder + Tcomparator (= 25ns) 1 COMPARATOR ⇒ 2T fref = 40Mhz Area = 1476 x 1219 µ2 Total capacitance being switched = Cref The clock rate can be reduced by half with the same ⇒ Vdd = Vref = 5V throughput fpar = fref / 2 Power for reference datapath = P = C V 2 f Vpar = Vref / 1.7, Cpar = 2.15Cref ref ref ref ref 2 ≈ from [Chandrakasan92] (IEEE JSSC) Ppar = (2.15Cref) (Vref/1.7) (fref/2) 0.36 Pref Slide 17 Slide 18 The More Parallel the Better?? Pipelined Datapath 1.00 Fixed Throughput 0.90 Minimal Area 0.80 0.70 A 0.60 1 T 0.50 A>B B 1 T ADDER LATCH A LATCH LATCH P LATCH LATCH B LATCH 0.40 C2 LATCH C1 LATCH COMPARATOR 1 COMPARATOR 0.30 T C 0.20 1 1 µ2 T T Area = 640 x 1081 0.10NORMALIZED POWER Minimal Power 0.00 ⇒ 1.00 2.003.00 4.00 5.00 Critical path delay is less max [Tadder , Tcomparator] V (volts) dd Keeping clock rate constant: fpipe = fref ⇒ Capacitance overhead starts to dominate at “high” levels Voltage can be dropped Vpipe = Vref / 1.7 of parallelism and results in an optimum voltage Capacitance slightly higher: Cpipe = 1.15Cref P =(115C )(V /1 7)2 f ≈ 039P Slide 19 Slide 20 Architecture Summary for a Simple Memory Architecture Serial Access Parallel Access Architecture type Voltage Area Power MEMORY MEMORY Addr Addr Simple datapath CELL CELL (no pipelining or 5V 1 1 parallelism) ARRAY ARRAY Row Decoding Row Decoding Pipelined datapath 2.9V 1.3 0.39 4 4 4 4 4 4 4 4 f Mux f / 8 Latch 4 4 4 4 Parallel datapath 2.9V 3.4 0.36 4 f Latch 8-nibbles f Mux Pipeline-Parallel2.0V 3.7 0.2 4 bit display interface Voltage = 3V Voltage = 1.1V Slide 21 Slide 22 Proposed CPU Architecture: LP-ARM LP-ARM: Energy Estimation Mem[N:0] Add[31:0] fCLK Instruction Cache (8kB): VDD Low-Power SRAM: 2 kByte Block = 78 pJ [Burstein] Clock Bus/DMA Bus I/O Generator Abort Rst Complete Instruction Cache Design: ~150 pJ ∆V Controller Buffer fref Clock Generation Global & External 50pF line = 70 pJ Inst. ARM Data Interrupt Total Clock Generation: ~100 pJ Cache Core Cache Controller ARM Core Register File + ALU + Shifter > 50% Total [ARM,Burd] Int[7:0] Processor Buffer Register File: 30 pJ, ALU: 24 pJ, Shifter: 16 pJ Dominant State Bus I/O Total Core: ~140 pJ Energy Consumer Data[31:0] Slide 23 Slide 24 Research Goal 10 MIPS, 1 nJ/inst. ⇔ 80 MIPS, 9 nJ/inst. (10 mW) (720 mW) DC-DC LP-ARM Converter CPU 100 pJ 500 pJ Processor Bus << 100 pJ I/O 0.5 MB Interface SRAM (8 ICs) 100 pJ 300 pJ Improve energy efficiency by an order of magnitude Slide 25.