Power1.Ps (Mpage)

Low Energy & Power Design Issues Low Power Design Problem • Processor trends Microprocessor Power • Circuit and Technology Issues (source ISSCC) 30 • Architectural optimizations • Low power µP research project 20 10 Power (Watt) 0 75 80 85 90 95 Year When supply voltage drops to 1Volt, then 100Watts = 100 Amps Slide 2 Portable devices Two Kinds of Computation Required • General purpose processing (what you have been Portable Functions studying so far) • Multimodal radio • Bursty - mostly idle with bursts of computation • Protocols, ECC, ... • Maximum possible throughput required during active • Voice I/O compression & periods decompression • Handwriting recognition • Signal processing (for multimedia, wireless Battery • Text/Graphics processing communications, etc.) (40+ lbs) • Video decompression • Stream based computation • Speech recognition • No advantage in increasing processing rate above • Java interpreter required for real-time requirements How to get 1 month of operation? Slide 3 Slide 4 Optimizing for Energy Consumption Switching Energy • Conventional General Purpose processors (e.g. Vdd Pentiums) • Performance is everything ... somehow we’ll get the Vin Vout power in and back out • 10-100 Watts, 100-1000 Mips = .01 Mips/mW CL • Energy Optimized but General Purpose • Keep the generality, but reduce the energy as much as 2 possible - e.g. StrongArm Energy/transition = CL * Vdd • .5 Watts, 160 Mips = .3 Mips/mW 2 Power = Energy/transition * f = CL * Vdd * f • Energy Optimized and Dedicated • 100 Mops/mW Slide 5 Slide 6 Low Power & Low Energy System Design Energy Reduction in CPU’s • Standard power management helps • Sleep modes System Design partitioning, Power Down • Power down blocks • Clock rate reduction doesn’t help Algorithm Complexity, Concurrency, Locality, Regularity, Data representation • Number of operations = Nops 2 Voltage scaling, Parallelism, • Energy/operation = CV Architecture 2 Instruction set, Signal correlations • Total Energy = Nops * CV Transistor Sizing, Logic optimization, Energy is independent of clock rate! Circuit/Logic Activity Driven Power Down, Low-swing logic, Adiabatic switching • Reducing the clock rate only degrades Technology Threshold Reduction, throughput, but no savings in battery life - unless Multi-thresholds the voltage is changed Slide 7 Slide 8 α Node Transition Activity and Power Factors Affecting Transition Activity, 0->1 Switch a CMOS gate for N clock cycles “Static” component (does not account for timing) E = C • V 2 • nN() N L dd Type of Logic Function (NOR vs. XOR) EN : the energy consumed for N clock cycles Type of Logic Style (Static vs. Dynamic) n(N): the number of 0->1 transition in N clock cycles Signal Statistics E () N • nN • • 2 • Inter-signal Correlations Pavg = lim -------- fclk = lim ------------ C Vdd fclk N → ∞ N N → ∞ N L “Dynamic” or timing dependent component nN() α → = lim ------------ Circuit Topology 01N → ∞ N Signal Statistics and Correlations P = α • C • V 2 • f avg 01→ L dd clk Slide 9 Slide 10 Static 2 Input NOR Gate Type of Logic Style: Static vs. Dynamic V Vdd dd Assume: A CLK prob(A=1) = 1/2 ABOut prob(B=1) = 1/2 B CL 001 A B Then: 010 C prob(Out=1) = 1/2 100 L A B CLK prob(0→1) 110 = prob(Out=0).prob(Out=1) = 3/4 × 1/4 = 3/16 Power is dissipated when Out=0 A STATIC NOR DYNAMIC NOR α Out N0 3 0->1 = 3/16 B α = 3/16 α ==------- --- 0->1 01→ N 4 2 Slide 11 Slide 12 “Dynamic” or Glitching Activity in CMOS Glitch Reduction Using Balanced Paths A0 F Cin Add0 Add1 Add2 Add14 Add15 S0 S1 S2 S14 S15 A1 A2 A3 A4 A5 A6 A7 Ripple 4.0 4 A0 S15 A1 6 A2 2.0 3 A3 S10 F Merge Cin A 5 4 S1 A5 2 Sum Output Voltage, Volts 0.0 A6 0510Time, ns A α 7 0->1 can be > 1 due to glitching! Slide 13 Slide 14 Switching activity and capacitance minimization Minimum Supply Voltage 7.5 • Gated clocks. (disable all modules not in use each cycle) multiplier 2.0µm technology 7.0 C • V 6.5 clock generator L dd Td = enable only those modules using a bus 6.0 I • Block enables. ( ) 5.5 5.0 • Instruction Buffer. (0th level cache) 4.5 I ~ (V - V )2 4.0 dd t 3.5 Td(Vdd=1.5) (1.5) ² (5 - 0.7)2 3.0 ring oscillator • Add stop and sleep instructions to the instruction = 2 2.5 microcoded DSP chip Td(Vdd=5) (5) ² (1.5 - 0.7) set. 2.0 = 8 times slower at 1.5 1.5 adder NORMALIZED DELAY • Minimum size busses 1.0 adder (SPICE) Velocity saturated 2.0 4.0 6.0 I ~ (V - V ) not quite V (volts) dd t • Minimize I/O - on-chip memory dd so bad a penalty Lowering Vdd reduces energy but increases delays Critical difference is the amount above Vt Slide 15 Slide 16 Architecture Trade-offs - Reference Datapath Parallel Datapath A A 1 2T A>B 1 T COMPARATOR ADDER LATCH A LATCH LATCH B LATCH A>B 1 C C LATCH 2T ADDER LATCH A LATCH 1 COMPARATOR LATCH B LATCH B C LATCH 2T COMPARATOR 1 COMPARATOR T MUXMUX C µ2 Area = 636 x 833 1 1 2T T 1 B A>B T COMPARATOR ADDER LATCH A LATCH LATCH B LATCH 1 C C LATCH ⇒ 2T Critical path delay Tadder + Tcomparator (= 25ns) 1 COMPARATOR ⇒ 2T fref = 40Mhz Area = 1476 x 1219 µ2 Total capacitance being switched = Cref The clock rate can be reduced by half with the same ⇒ Vdd = Vref = 5V throughput fpar = fref / 2 Power for reference datapath = P = C V 2 f Vpar = Vref / 1.7, Cpar = 2.15Cref ref ref ref ref 2 ≈ from [Chandrakasan92] (IEEE JSSC) Ppar = (2.15Cref) (Vref/1.7) (fref/2) 0.36 Pref Slide 17 Slide 18 The More Parallel the Better?? Pipelined Datapath 1.00 Fixed Throughput 0.90 Minimal Area 0.80 0.70 A 0.60 1 T 0.50 A>B B 1 T ADDER LATCH A LATCH LATCH P LATCH LATCH B LATCH 0.40 C2 LATCH C1 LATCH COMPARATOR 1 COMPARATOR 0.30 T C 0.20 1 1 µ2 T T Area = 640 x 1081 0.10NORMALIZED POWER Minimal Power 0.00 ⇒ 1.00 2.003.00 4.00 5.00 Critical path delay is less max [Tadder , Tcomparator] V (volts) dd Keeping clock rate constant: fpipe = fref ⇒ Capacitance overhead starts to dominate at “high” levels Voltage can be dropped Vpipe = Vref / 1.7 of parallelism and results in an optimum voltage Capacitance slightly higher: Cpipe = 1.15Cref P =(115C )(V /1 7)2 f ≈ 039P Slide 19 Slide 20 Architecture Summary for a Simple Memory Architecture Serial Access Parallel Access Architecture type Voltage Area Power MEMORY MEMORY Addr Addr Simple datapath CELL CELL (no pipelining or 5V 1 1 parallelism) ARRAY ARRAY Row Decoding Row Decoding Pipelined datapath 2.9V 1.3 0.39 4 4 4 4 4 4 4 4 f Mux f / 8 Latch 4 4 4 4 Parallel datapath 2.9V 3.4 0.36 4 f Latch 8-nibbles f Mux Pipeline-Parallel2.0V 3.7 0.2 4 bit display interface Voltage = 3V Voltage = 1.1V Slide 21 Slide 22 Proposed CPU Architecture: LP-ARM LP-ARM: Energy Estimation Mem[N:0] Add[31:0] fCLK Instruction Cache (8kB): VDD Low-Power SRAM: 2 kByte Block = 78 pJ [Burstein] Clock Bus/DMA Bus I/O Generator Abort Rst Complete Instruction Cache Design: ~150 pJ ∆V Controller Buffer fref Clock Generation Global & External 50pF line = 70 pJ Inst. ARM Data Interrupt Total Clock Generation: ~100 pJ Cache Core Cache Controller ARM Core Register File + ALU + Shifter > 50% Total [ARM,Burd] Int[7:0] Processor Buffer Register File: 30 pJ, ALU: 24 pJ, Shifter: 16 pJ Dominant State Bus I/O Total Core: ~140 pJ Energy Consumer Data[31:0] Slide 23 Slide 24 Research Goal 10 MIPS, 1 nJ/inst. ⇔ 80 MIPS, 9 nJ/inst. (10 mW) (720 mW) DC-DC LP-ARM Converter CPU 100 pJ 500 pJ Processor Bus << 100 pJ I/O 0.5 MB Interface SRAM (8 ICs) 100 pJ 300 pJ Improve energy efficiency by an order of magnitude Slide 25.

Power1.Ps (Mpage)

Wind Rose Data Comes in the Form >200,000 Wind Rose Images

Copyrighted Material

Chapter 1-Introduction to Microprocessors File

Power Architecture® ISA 2.06 Stride N Prefetch Engines to Boost Application's Performance

Floboss 107 Flow Manager Instruction Manual

Ibm Power8 Processors Analýza Výkonnosti Procesorů Ibm Power8

Computer Architectures an Overview

The POWER4 Processor Introduction and Tuning Guide

Ilore: Discovering a Lineage of Microprocessors

POWER Processor

Qoriq LS1012A SDK V0.3 Contents

The Design Space of Shelving