Low Power Computing Jeffrey Myers Josaphat Valdivia  Motivation

 Improvements ◦ Transistor Level ◦ CPU Level ◦ System Level (Hardware/Software Codesign) ◦ Cloud Computing

 DARPA Project  Mobile Computing ◦ Battery concerns

 Super Computing ◦ Exascale on current technology: 200MW  Would require its own nuclear reactor

 Cloud Computing ◦ Large amounts of unused computing power  Improve from the transistor level ◦ ’s Tri-Gate  Possible to improve any one single component ◦ RAM ◦ CPU  Low voltage/underclocked CPU   Significance Compression  System Improvements • 37% performance increase

• Greater than 50% dynamic power reduction.

• Lower leakage (static power reduction)

• Benefits from smaller transistors

Sacrifices Performance Preserves Performance

 Underclocking • Significance  Clock

Compression Gating  Undervolting

Details

 Power=Capacitance*Voltage2*Frequency

 Operating voltage determines the frequency of operation

 Examples: ◦ Intel SpeedStep ◦ AMD PowerNow! ◦ AMD Cool’n’Quiet Benefits Drawbacks

 Power consumption

reduction  Degraded Performance ◦ Not as bad as might be  Stability thought due to memory bottleneck  Lower heat/fan noise ◦ Can be almost eliminated Details Benefits

 Transistors switch even  Lowers Dynamic if the data doesn’t Activity change  Place an enable on the  Minimal Hardware clock to turn switching Needed on/off Drawbacks/Limitations

 High leakage technology does not benefit as much  Statically enabled/disabled Improvement

 Comparator-Based Gating ◦ Use a comparator with the enable to prevent switching when data does not change

Details

 Operate on a byte-per-byte basis ◦ Byte-serial pipeline datapath  Tweaks ISA to compress instructions ◦ 80% of immediate values are only 8 bits ◦ Most common R-types recoded to use 3 bytes ◦ Adds extension bits to gate logic

Drawbacks:

 CPI increased by 79% Byte-parallel Operation CPI is around only 6% higher Underclocking & Clock Gating Significance Undervolting Compression

Gains: • 6-30+ Watts • Less Register Activity • 30-40% logic • Lower • 47% power reduction activity temperatures in experiment pipeline reduction • 11% power reduction • Lower Noise in real processor

Costs: • Clock Frequency • Some added • Higher CPI or complexity significant complexity

• Modified ISA Best For: • Mobile Devices • All • Mobile Devices • Cloud • Embedded Computers Systems  Application Specific Processors have a lower power consumption ◦ DSPs (contain special logic to perform operations faster and more efficiently)  Compiler techniques can be used on GPPs to improve power. ◦ Best achieved when compiler and hardware designed together  Shutdown for Low Power (SLOP) instruction  Register bypassing  Cache improvements Details: • SLOP is functionally equivalent to a NOP • SLOP instructions are used to gate the clock, place memory into lower voltage states and gating logic • SLOP instructions are added by the compiler or dynamically to prevent speculation • Incorrect speculation leads to wasted calculations, wasted power • Is more effective than slowing down the clock with high leakage transistors • Static power consumption becomes more dominant when the clock is slowed • The longer it takes to process the instruction, the more static energy consume  Compiler optimization used to bypass the register file ◦ Temporary variables that are only read or used once are put directly into functional units ◦ Reduces transistor switch activity (dynamic power)  55% decrease in register activity  30% decrease in the cycle count  Results in a 35% energy reduction in data path consumption Details BenefitsBenefits

 Cache dissipates 14-27%  Compiler does most of of power the work  Minor hardware changes  Lower switching in cache  2 layer instruction cache ◦ 2nd layer is very small; the Drawbacks/LimitationsDrawbacks/Limitations size of basic block • Mimics knapsack problem st ◦ 1 layer in low power state – NP Complete problem during loops • Requires an advanced  Compiler optimized code compiler to take advantage of the hardware SLOP Instruction Register Bypassing Caching Improvement Gains: • Lower power • 55% decrease in • 30-40% logic without much register activity activity added • 35% energy reduction reduction complexity • Good for high- leakage transistors Costs: • Speculation • Compiler complexity • Higher CPI or • Performance significant complexity

• Modified ISA Best For: • Mobile Devices • All • All

Note: All of these improvements require additional hardware support. Situation Benefits

 Average server loads 10-  Up to 60% power reduction 50%  On processors maintain  Lots of high performance 80% utilization processors sitting idle Solution • Dynamically predict the load • Allocate to processor in heterogeneous server • Place unused processors in sleep mode  Power reduction can be achieved without significant performance loss ◦ Other bottlenecks are still the dominant factor in the system ◦ Can be done with near 0 performance loss at the expense of significant complexity ◦ Designed together, the hardware and software can achieve the best balance between performance and power  Most proposed improvements can be implemented simultaneously  DARPA’s PERFECT ◦ Power Efficiency Revolution For Embedded Computing Technologies  Near-threshold voltage operation  Architectural advances  New memory hierarchies  Application specific cores  Reconfigurable, synthesized processing elements  Software algorithms to minimize consumption

◦ Goal: From 1 GFLOP/watt to 75GFLOP/watt [1] “22nm Announcement Presentation,” Intel Corp., Apr. 2011, http://download.intel.com/newsroom/kits/22nm/pdfs/22nm- Announcement_Presentation.pdf.

[2] Bellas, N.; Hajj, I.N.; Polychronopoulos, C.D.; Stamoulis, G.; , "Architectural and compiler techniques for energy reduction in high-performance microprocessors," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on , vol.8, no.3, pp.317-326, June 2000

[3] Canal, R.; Gonzalez, A.; Smith, J.E.; , "Very low power pipelines using significance compression," Microarchitecture, 2000. MICRO-33. Proceedings. 33rd Annual IEEE/ACM International Symposium on , vol., no., pp.181-190, 2000

[4] M. Cooney, “DARPA takes aim at ‘Achilles Heel’ of advanced computing: Power,” Jan. 2012, http://www.networkworld.com/community/blog/darpa-takes-aim-achilles-heel-advanced-computing-power.

[5] “Enhanced Intel® SpeedStep® Technology for the Intel® Pentium® M Processor,” white paper, Intel, Mar. 2004.

[6] Guzma, V.; Pitkanen, T.; Kellomaki, P.; Takala, J.; , "Reducing processor energy consumption by compiler optimization," Signal Processing Systems, 2009. SiPS 2009. IEEE Workshop on , vol., no., pp.063-068, 7-9 Oct. 2009

[7] Kulkarni, M.; Sheth, K.; Agrawal, V.D.; , "Architectural for high leakage technologies," System Theory (SSST), 2011 IEEE 43rd Southeastern Symposium on , vol., no., pp.67-72, 14-16 March 2011

[8] Nagothu, K.; Kelley, B.; Prevost, J.; Jamshidi, M.; , "On prediction to dynamically assign heterogeneous microprocessors to the minimum joint power state to achieve Ultra Low Power Cloud Computing," Signals, Systems and Computers (ASILOMAR), 2010 Conference Record of the Forty Fourth Asilomar Conference on , vol., no., pp.1269-1273, 7-10 Nov. 2010

[9] R. F. Service, “What It'll Take To Go Exascale,” Jan. 2012, http://www.sciencemag.org/content/335/6067/394.full.

[10] Wei Wang; Yu-Chi Tsao; Ken Choi; SeongMo Park; Moo-Kyoung Chung; , "Pipeline power reduction through single comparator-based clock gating," SoC Design Conference (ISOCC), 2009 International , vol., no., pp.480-483, 22-24 Nov. 2009