Apollo: Adaptive power optimization and Control for the land Warrior
Massoud Pedram Dept. of EE-Systems University of Southern California NirajK.Jha Dept. of EE Princeton University
April 26, 2001
Outline
Part I: Dynamic Power Management and Power-Aware Architecture Reorganization OS-directed power management Architecture organization techniques Apollo Testbed Summary I Part II: Power-Aware Behavioral and System Synthesis Input space adaptive software Power-aware dynamic scheduler High-level software energy macromodels Leakage power analysis and optimization HW-SW co-synthesis with DR-FPGAs Low power distributed system of SOCs Summary II
M. Pedram and N. Jha Introduction
Land Warrior (LW) system’s design objectives include: Situational awareness Power awareness Average power reduction of the LW system while meeting key minimum performance and quality-of-service criteria is an important design driver Much of the power savings is at the system-level and can be achieved thru dynamic power management (DPM) and power-aware architecture organization Our research focuses on these two Land Warrior approaches to system-level power reduction and is expected to deliver 2- 4 X power savings for the LW system
M. Pedram and N. Jha
Project Overview
Land Warrior System Spec
on System Synthesis et •Transformations Process / c s in u •HW/SW partitioning Component Pr c Fo •Allocation and binding Library
Power-aware Hardware Synthesis, Power-aware Dynamic Scheduling Architecture Optimizer
s Data Logger / OS-Directed cu Fo Emulator Power / QoS Management C US
M. Pedram and N. Jha OS-directed Power Management
OS “controls” system Power resources. It can also Saving control the power states of the resources
Deep
ACPI Power Consumption Need to develop (Advanced Configuration and Work Sleep Sleep Off power management Power Interface) policies ACPI - Power Saving Modes provides an interface between the OS and system resources Program 1 Program 2 Program 3 Application Program Interface System Resources Kernel Device Drivers I/O CPU Hard Disk Power Manager (PM)
ds an mm Co
W R R
Service Queue (SQ) Service Provider (SP) Service Requestor (SR)
M. Pedram and N. Jha
Policy Optimization Flow
Power management policy optimization flow: Build stochastic models of Application Program the service requesters (SRs) and service providers (SPs) System Utilities Construct a continuous- requests time Markovian decision hints process model (CTMDP) of the power-managed system Service Queue Obtain stochastic parameter values by State static profiling and Data info runtime monitoring Read/Write Solve the resulting power Power Manager optimization problem under performance and commands quality-of-service constraints to obtain the Device Driver optimal policy
M. Pedram and N. Jha Example: A Fujitsu Hard Disk
Two abstract models of the Fujitsu Hard Disk
busy busy
idle A 4-state SP model A 3-state SP model idle
sleep sleep standby
Policies under study: “Always On” policy “Greedy” reactive policy “Time Out” policies with different time-out values “N” policies with different N values 3CTMDP: CTMDP policies using the 3-state SP model 4CTMDP: CTMDP policies using the 4-state SP model CTMDP-Poll: 3CTMDP with polling to address the long tail of distribution
M. Pedram and N. Jha
Experimental Results
Always On Greedy N-Policy Time-Out 3CTMDP 4CTMDP CTMDP-POLL
1.4
1.2
1
0.8
Power (W) 0.6
0.4
0.2 0.001 0.01 0.1 1 10 Delay (sec)
Power-delay trade-off curves for Uniform transition time distribution for the hard disk and a combination of Exponential and Pareto distributions for the request inter-arrival times M. Pedram and N. Jha A Two-layered PM Strategy for the LW
Mission Type
Combat Reconnaissance DataData Trace Trace
Terrain/Weather To OSPM Woods, foggy Urban, rainy LWLW System System PolicyPolicy Generator Generator
Mission time
2 hours 12 hours SystemSystem Mode Mode
Mode-based Decision Tree Parametrizable PQ policy
M. Pedram and N. Jha
On-chip Memory Bus Power Reduction
The processor-memory bus is Large highly capacitive. Significant capacitance power is consumed to drive these busses Memory
Processor
Floating point average
30%
20% Integer average
10%
0123459 6 78 10 Computer Architecture, Bits of branch displacement A Quantitative Approach, Hennessy and Patterson
M. Pedram and N. Jha Address Bus Encoding: Alborz Code
Coding-enable Flag Encoder block:
Code word
SUB D LWC XOR Source word BUS
Sender’s codebook
Decoder block: Coding-enable Flag
Code word
XOR LWC D ADD BUS Source word Receiver’s codebook
M. Pedram and N. Jha
Experimental Results
Benchmark Base Fixed ALBORZ 512 Adaptive Adaptive ALBORZ Case Entry Redundant ALBORZ Redun. 32 Entry Irredun. Pwr Pwr 32 64 Pwr Entry Entry Art 34.3 2.765 2.466 2.110 2.382 92.0% 92.9% 93.9% 93.1% Gzip 34.7 4.005 3.103 2.651 3.483 88.5% 91.1% 92.4% 90.0% Vortex 37.3 6.048 6.244 5.554 4.918 85.9% 83.3% 85.2% 86.9% Equak 35.4 6.383 7.673 6.472 6.576 82.1% 78.4% 87.8% 81.5% Gcc 35.4 6.373 7.651 6.881 7.490 82.1% 78.5% 80.6% 78.9% Parser 37.0 4.177 4.273 3.614 4.560 88.8% 88.5% 90.3% 87.3% Vpr 35.9 7.021 6.085 5.502 5.928 80.5% 83.1% 84.7% 83.5% %power 85.3% 85.1% 87.0% 85.9% reduction M. Pedram and N. Jha Off-chip Memory Bus Power Reduction
DRAM address bus is multiplexed between the row and column addresses SRAM DRAM
A1: Row1 Col1 0001 Row1 00 Internal Col1 01 External A2: Row2 Col2 0011 Row2 00 Internal Col2 11
SA=1 SA=4 Find a permutation (one-to-one function) that generates the minimum external switching activity
M. Pedram and N. Jha
Example
Address space of 0, 1, 2, 3 = 2^2 Multiplexed on a 1-bit bus 2^1 nodes and 2^2 edges
Pyramid Binary 00 00 0 00 0 0 0 0 01 0 01 0 1 1 10 01 11 1 10 1 1 1 0 10 1 11 1 11 0 1
SA=4 SA=2 SA=6 SA=4
M. Pedram and N. Jha RC Graph and Eulerian Cycle
Construct a merged Row-Column Graph where nodes represent row and column addresses and edge (u,v) corresponds to binary number uv Find an Eulerian cycle on this graph using forward or backward traversal to obtain the Pyramid code
00 01 10 11 00 01 00 0 8 11 1
01 10 9 14 6
10 15 13 12 4
10 11 11 7 5 3 2
Pyramid I – Forward Traversal M. Pedram and N. Jha
Experimental Results
Binary vs. Pyramid Binary Pyramid
2500
2000
1500
1000 Switching Activity 500
0 1024 512 256 128 64 32 16 8 4 Segment Size
SPEC95: compress, perl, and ijpeg
14 12 10 Invert 8 External 6 Internal 4 2 0 Transition Count in Millions rt d I rt e i ry e ry ary v m v in n id-B n id-BI B yra Bina yramid Bina mid-BI P P Pyramid us I yram yram ra B P Bus I P Bus Invert Py M. Pedram and N. Jha Apollo Testbed Block Diagram
USB Webcam
PS/2 KB & Mouse RS232 GPS Assabet : Neponset: SA 1110 SA 1111 CF Flash ROM
Audio Speaker & Microphone $SROOR7HVWEHG PCMCIA WLAN
M. Pedram and N. Jha
Device Information
Microtek EyeStar U2S Webcam 900 800 USB interface 700 600 VISION CPiA chip (Color Processor 500 High Resolution interface ASIC) 400 Low Power Standby 300 30 fps 200 24-bit color 100 0 640x480 pixels Power (mW)
1800 Cisco Aironet 340 Wireless LAN PC Card 1600 1400 IEEE 802.11b 1200 Data rate: 1, 2, 5.5 and 11 Mbps 1000 Transmit 800 Receive Range: 120 m outdoors; 30 m indoors Idle 600 400 Adjustable Transmit Power
200 200 0 180 Power (mW) 160 Trimble Lassen LP GPS Kit 140 120 8-channel, continuous tracking receiver 100 Normal 80 Deep Sleep Update rate: 1 Hz 60 40 Position accuracy: 25 m 20 0 Velocity accuracy: 0.1 m/sec Power (mW) M. Pedram and N. Jha Apollo Testbed Photo Shots
M. Pedram and N. Jha
Hardware/Software Layers
Situational awareness software with power management object-db view event comm map mail gps GTK gpsd Gnome Xwindow (4.0.2) Software Linux (2.4.0) SA Patch (rmk1) Board Patch (np1) switchd wvlan-cs scale frontlight
Assabet Neponset SA1110 SA1111 PCMCIA FlashROM PS/2 RS232 PWM USB SA1 Hardware
WLAN CF KB GPS FL Webcam M. Pedram and N. Jha Map Viewer/Finder Snap Shots
M. Pedram and N. Jha
Summary I
The degree of power savings due to OSPM is dependent on the workload conditions and the minimum performance and quality-of-service requirements We expect a factor of 2-4 reduction in the total power dissipation of the LW system. This is being demonstrated on the Apollo Testbed Even higher power savings are possible if the processor, I/O device hardware, and device driver software are equipped with built-in power management facilities Future research work includes power reduction of the PCMCIA wireless LAN and the USB Webcam. The next generation of the Apollo Testbed will use the XScale processor from Intel. XScale microarchitecture offers a 4-8 X improvement in the power-performance metric over Intel StrongARM
M. Pedram and N. Jha Outline
Part I: Dynamic Power Management and Power-Aware Architecture Reorganization Introduction OS-directed power management Architecture organization techniques Apollo Testbed Conclusions Part II: Power-Aware Behavioral and System Synthesis Input space adaptive software Power-aware dynamic scheduler High-level software energy macromodels Leakage power analysis and optimization HW-SW co-synthesis with DR-FPGAs Low power distributed system of SOCs Summary II
M. Pedram and N. Jha
Input Space Adaptive Software Synthesis
Inputs Simulate program Apply enabling compiler Identify promising with input traces 1 transformations 2 sub-programs for ISAD 3 For each selected sub-program
Choose the input Optimize the selected sub- Outputs sub-spaces that lead to programs under the most energy savings 4 chosen input sub-spaces 5
Improvements in performance and energy (SPARClite/StrongARM): Performance Energy Energy*delay Maximum 7.2X/8.5X 7.6X/7.8X 54.7X/66.3X Average 3.7X/3.0X 3.7X/3.1X 13.7X/9.3X
M. Pedram and N. Jha Dynamic Scheduling
Dynamic voltage Dynamic power scaling management
Servicing soft aperiodic tasks Low power consumption Resource reclaiming Online Responsive service of soft Slack time exploitation Scheduling aperiodic tasks
Off-line flexibility Static Precedence relationship Global flexibility optimization Scheduling Hard timing constraints
Execution slot reservation for hard aperiodic tasks Static Resource Allocation
Power reduction over non-power-conscious scheme: ave.: 2.2X; max.: 3.1X
M. Pedram and N. Jha
Battery-aware Scheduling
Approaches Objectives
Slack-based list Initial static Precedence relationship scheduling scheduling Hard timing constraints
Slack-time reallocation Variable voltage Minimize the average power consumption Schedule slots shifting to scaling matchtheallocatedslacktime Increase the battery life span
Battery-aware Optimize the profile of the Schedule slots schedule discharge current interchanging processing Increase the battery life span
Battery lifespan extension: average: 1.3X, maximum: 1.6X
M. Pedram and N. Jha Energy Macromodels for Software
Typical inputs
Instrumented Un-instrumented function function
Model Behavioral Power simulation & Post- simulation = + + E b1c1 b2c2 b3c3 processing
c1 c2 c3 Energy = Regression b1 ? 1 2 1 1.00E-07 tool = 2 4 1 1.27E-07 b2 ? 3 5 1 4.24E-07 4 6 1 8.05E-07 b = ? .. 3 60 1 1 0.00029 61 2 2 0.000304 62 3 2 0.000309 63 4 2 0.000326
M. Pedram and N. Jha
Complexity-based Model Results
Examples Mo dels # of samples SPARClite
Error Sp eedup
chksum c + c N 400 1.4% 1361
1 2
igray c + c log (N ) 2560 8.0% 540
1 2 2
edgedet c + c M + c N + c MN 1000 0.3% 673325
1 2 3 4
2
ins sort c + c N + c N 250 6.7% 30050
1 2 3
mult c + c L + c LM + c LM N 625 2.4% 32213
1 2 3 4
myqsort c + c N + c Nlog (N ) 250 5.3% 38155
1 2 3 2
msort c + c N + c Nlog (N ) 1000 4.2% 126780
1 2 3 2
myfrag c + c N 1500 14.6% 81517
1 2
Ultra SPARC II (336 MHz) running SunOS 5.7 Real memory 4GB
M. Pedram and N. Jha Leakage Power Optimization
CDFG Derive Extract high- Determine off- utilization of probability critical path Schedule on-critical paths FU’s path FU’s
Prioritize FU’s
MTCMOS Estimate Library Synthesis System Leakage Power
M. Pedram and N. Jha
Leakage Results
80 Pleak/Ptotal Opt: Pleak/Ptotal 70 % Reduction: Ptotal (additional) 60 %Reduction:Pleak 50 40 30 20 10 0 0.35u 0.18u 0.13u 0.10u 0.07u Technology Family
M. Pedram and N. Jha Hardware/software Co-Synthesis with DR-FPGAs
Allocation Allocation transformation 250 Assignment 200 Assignment 150 Reconfiguration prefetch Scheduling transformation Our approach 100
Initialization Scheduling 50 solution Reconfiguration power (mW) 0 12345678910 Solution Solution prune Exa mples prioritizing
Solution
14000000 Solution 12000000 700 10000000 600 8000000 Reconfiguration prefetch 500 6000000 Our approach
400 4000000 2000000 300
Schedule length (Time units) 0 200 12345678910
Power consumption (mW) Exa mples 100
0 100 200 300 400 500 600 System price (Dollars) M. Pedram and N. Jha
System-level Functional Partitioning
Functional Partitioning Integrated with System Synthesis - Homogeneous model with partitioner/synthesizer - Integrated evolutionary algorithm - Optimize power/price under area and real-time constraints
System specification Target architecture Design constraints Design space Partitioning and exploration synthesis environment
Functional partitioning with allocation
Functional partitioning Resource with assignment library
Global scheduling
Cost calculation
Distributed system of SOCs
M. Pedram and N. Jha Summary II
Input space adaptive SW synthesis reduces energy*delay by up to 66X Power-aware dynamic scheduler reduces power by up to 3.1X Battery-aware scheduler extends battery lifespan by up to 1.6X High-level SW energy macromodels speed up power estimation by 1-to-5 orders of magnitude with an average error of only 5% Leakage power optimization reduces leakage in .07 micron technology by up to 8.4X and total power by up to 2.3X HW-SW co-synthesis with DR-FPGAs reduces power by up to 6X Functional partitioning for distributed system of SOCs reduces power by up to 2.5X
M. Pedram and N. Jha