Apollo: Adaptive power optimization and Control for the land Warrior

Massoud Pedram Dept. of EE-Systems University of Southern California NirajK.Jha Dept. of EE Princeton University

April 26, 2001

Outline

Part I: Dynamic Power Management and Power-Aware Architecture Reorganization OS-directed power management Architecture organization techniques Apollo Testbed Summary I Part II: Power-Aware Behavioral and System Synthesis Input space adaptive software Power-aware dynamic scheduler High-level software energy macromodels Leakage power analysis and optimization HW-SW co-synthesis with DR-FPGAs Low power distributed system of SOCs Summary II

M. Pedram and N. Jha Introduction

Land Warrior (LW) system’s design objectives include: Situational awareness Power awareness Average power reduction of the LW system while meeting key minimum performance and quality-of-service criteria is an important design driver Much of the power savings is at the system-level and can be achieved thru dynamic power management (DPM) and power-aware architecture organization Our research focuses on these two Land Warrior approaches to system-level power reduction and is expected to deliver 2- 4 X power savings for the LW system

M. Pedram and N. Jha

Project Overview

Land Warrior System Spec

on System Synthesis et •Transformations Process / c s in u •HW/SW partitioning Component Pr c Fo •Allocation and binding Library

Power-aware Hardware Synthesis, Power-aware Dynamic Scheduling Architecture Optimizer

s Data Logger / OS-Directed cu Fo Emulator Power / QoS Management C US

M. Pedram and N. Jha OS-directed Power Management

OS “controls” system Power resources. It can also Saving control the power states of the resources

Deep

ACPI Power Consumption Need to develop (Advanced Configuration and Work Sleep Sleep Off power management Power Interface) policies ACPI - Power Saving Modes provides an interface between the OS and system resources Program 1 Program 2 Program 3 Application Program Interface System Resources Kernel Device Drivers I/O CPU Hard Disk Power Manager (PM)

ds an mm Co

W R R

Service Queue (SQ) Service Provider (SP) Service Requestor (SR)

M. Pedram and N. Jha

Policy Optimization Flow

Power management policy optimization flow: Build stochastic models of Application Program the service requesters (SRs) and service providers (SPs) System Utilities Construct a continuous- requests time Markovian decision hints process model (CTMDP) of the power-managed system Service Queue Obtain stochastic parameter values by State static profiling and Data info runtime monitoring Read/Write Solve the resulting power Power Manager optimization problem under performance and commands quality-of-service constraints to obtain the Device Driver optimal policy

M. Pedram and N. Jha Example: A Fujitsu Hard Disk

Two abstract models of the Fujitsu Hard Disk

busy busy

idle A 4-state SP model A 3-state SP model idle

sleep sleep standby

Policies under study: “Always On” policy “Greedy” reactive policy “Time Out” policies with different time-out values “N” policies with different N values 3CTMDP: CTMDP policies using the 3-state SP model 4CTMDP: CTMDP policies using the 4-state SP model CTMDP-Poll: 3CTMDP with polling to address the long tail of distribution

M. Pedram and N. Jha

Experimental Results

Always On Greedy N-Policy Time-Out 3CTMDP 4CTMDP CTMDP-POLL

1.4

1.2

1

0.8

Power (W) 0.6

0.4

0.2 0.001 0.01 0.1 1 10 Delay (sec)

Power-delay trade-off curves for Uniform transition time distribution for the hard disk and a combination of Exponential and Pareto distributions for the request inter-arrival times M. Pedram and N. Jha A Two-layered PM Strategy for the LW

Mission Type

Combat Reconnaissance DataData Trace Trace

Terrain/Weather To OSPM Woods, foggy Urban, rainy LWLW System System PolicyPolicy Generator Generator

Mission time

2 hours 12 hours SystemSystem Mode Mode

Mode-based Decision Tree Parametrizable PQ policy

M. Pedram and N. Jha

On-chip Memory Bus Power Reduction

The processor-memory bus is Large highly capacitive. Significant capacitance power is consumed to drive these busses Memory

Processor

Floating point average

30%

20% Integer average

10%

0123459 6 78 10 Computer Architecture, Bits of branch displacement A Quantitative Approach, Hennessy and Patterson

M. Pedram and N. Jha Address Bus Encoding: Alborz Code

Coding-enable Flag Encoder block:

Code word

SUB D LWC XOR Source word BUS

Sender’s codebook

Decoder block: Coding-enable Flag

Code word

XOR LWC D ADD BUS Source word Receiver’s codebook

M. Pedram and N. Jha

Experimental Results

Benchmark Base Fixed ALBORZ 512 Adaptive Adaptive ALBORZ Case Entry Redundant ALBORZ Redun. 32 Entry Irredun. Pwr Pwr 32 64 Pwr Entry Entry Art 34.3 2.765 2.466 2.110 2.382 92.0% 92.9% 93.9% 93.1% Gzip 34.7 4.005 3.103 2.651 3.483 88.5% 91.1% 92.4% 90.0% Vortex 37.3 6.048 6.244 5.554 4.918 85.9% 83.3% 85.2% 86.9% Equak 35.4 6.383 7.673 6.472 6.576 82.1% 78.4% 87.8% 81.5% Gcc 35.4 6.373 7.651 6.881 7.490 82.1% 78.5% 80.6% 78.9% Parser 37.0 4.177 4.273 3.614 4.560 88.8% 88.5% 90.3% 87.3% Vpr 35.9 7.021 6.085 5.502 5.928 80.5% 83.1% 84.7% 83.5% %power 85.3% 85.1% 87.0% 85.9% reduction M. Pedram and N. Jha Off-chip Memory Bus Power Reduction

DRAM address bus is multiplexed between the row and column addresses SRAM DRAM

A1: Row1 Col1 0001 Row1 00 Internal Col1 01 External A2: Row2 Col2 0011 Row2 00 Internal Col2 11

SA=1 SA=4 Find a permutation (one-to-one function) that generates the minimum external switching activity

M. Pedram and N. Jha

Example

Address space of 0, 1, 2, 3 = 2^2 Multiplexed on a 1-bit bus 2^1 nodes and 2^2 edges

Pyramid Binary 00 00 0 00 0 0 0 0 01 0 01 0 1 1 10 01 11 1 10 1 1 1 0 10 1 11 1 11 0 1

SA=4 SA=2 SA=6 SA=4

M. Pedram and N. Jha RC Graph and Eulerian Cycle

Construct a merged Row-Column Graph where nodes represent row and column addresses and edge (u,v) corresponds to binary number uv Find an Eulerian cycle on this graph using forward or backward traversal to obtain the Pyramid code

00 01 10 11 00 01 00 0 8 11 1

01 10 9 14 6

10 15 13 12 4

10 11 11 7 5 3 2

Pyramid I – Forward Traversal M. Pedram and N. Jha

Experimental Results

Binary vs. Pyramid Binary Pyramid

2500

2000

1500

1000 Switching Activity 500

0 1024 512 256 128 64 32 16 8 4 Segment Size

SPEC95: compress, perl, and ijpeg

14 12 10 Invert 8 External 6 Internal 4 2 0 Transition Count in Millions rt d I rt e i ry e ry ary v m v in n id-B n id-BI B yra Bina yramid Bina mid-BI P P Pyramid us I yram yram ra B P Bus I P Bus Invert Py M. Pedram and N. Jha Apollo Testbed Block Diagram

USB Webcam

PS/2 KB & Mouse RS232 GPS Assabet : Neponset: SA 1110 SA 1111 CF Flash ROM

Audio Speaker & Microphone $SROOR7HVWEHG PCMCIA WLAN

M. Pedram and N. Jha

Device Information

Microtek EyeStar U2S Webcam 900 800 USB interface 700 600 VISION CPiA chip (Color Processor 500 High Resolution interface ASIC) 400 Low Power Standby 300 30 fps 200 24-bit color 100 0 640x480 pixels Power (mW)

1800 Cisco Aironet 340 Wireless LAN PC Card 1600 1400 IEEE 802.11b 1200 Data rate: 1, 2, 5.5 and 11 Mbps 1000 Transmit 800 Receive Range: 120 m outdoors; 30 m indoors Idle 600 400 Adjustable Transmit Power

200 200 0 180 Power (mW) 160 Trimble Lassen LP GPS Kit 140 120 8-channel, continuous tracking receiver 100 Normal 80 Deep Sleep Update rate: 1 Hz 60 40 Position accuracy: 25 m 20 0 Velocity accuracy: 0.1 m/sec Power (mW) M. Pedram and N. Jha Apollo Testbed Photo Shots

M. Pedram and N. Jha

Hardware/Software Layers

Situational awareness software with power management object-db view event comm map mail gps GTK gpsd Gnome Xwindow (4.0.2) Software (2.4.0) SA Patch (rmk1) Board Patch (np1) switchd wvlan-cs scale frontlight

Assabet Neponset SA1110 SA1111 PCMCIA FlashROM PS/2 RS232 PWM USB SA1 Hardware

WLAN CF KB GPS FL Webcam M. Pedram and N. Jha Map Viewer/Finder Snap Shots

M. Pedram and N. Jha

Summary I

The degree of power savings due to OSPM is dependent on the workload conditions and the minimum performance and quality-of-service requirements We expect a factor of 2-4 reduction in the total power dissipation of the LW system. This is being demonstrated on the Apollo Testbed Even higher power savings are possible if the processor, I/O device hardware, and device driver software are equipped with built-in power management facilities Future research work includes power reduction of the PCMCIA wireless LAN and the USB Webcam. The next generation of the Apollo Testbed will use the XScale processor from . XScale microarchitecture offers a 4-8 X improvement in the power-performance metric over Intel StrongARM

M. Pedram and N. Jha Outline

Part I: Dynamic Power Management and Power-Aware Architecture Reorganization Introduction OS-directed power management Architecture organization techniques Apollo Testbed Conclusions Part II: Power-Aware Behavioral and System Synthesis Input space adaptive software Power-aware dynamic scheduler High-level software energy macromodels Leakage power analysis and optimization HW-SW co-synthesis with DR-FPGAs Low power distributed system of SOCs Summary II

M. Pedram and N. Jha

Input Space Adaptive Software Synthesis

Inputs Simulate program Apply enabling compiler Identify promising with input traces 1 transformations 2 sub-programs for ISAD 3 For each selected sub-program

Choose the input Optimize the selected sub- Outputs sub-spaces that lead to programs under the most energy savings 4 chosen input sub-spaces 5

Improvements in performance and energy (SPARClite/StrongARM): Performance Energy Energy*delay Maximum 7.2X/8.5X 7.6X/7.8X 54.7X/66.3X Average 3.7X/3.0X 3.7X/3.1X 13.7X/9.3X

M. Pedram and N. Jha Dynamic Scheduling

Dynamic voltage Dynamic power scaling management

Servicing soft aperiodic tasks Low power consumption Resource reclaiming Online Responsive service of soft Slack time exploitation Scheduling aperiodic tasks

Off-line flexibility Static Precedence relationship Global flexibility optimization Scheduling Hard timing constraints

Execution slot reservation for hard aperiodic tasks Static Resource Allocation

Power reduction over non-power-conscious scheme: ave.: 2.2X; max.: 3.1X

M. Pedram and N. Jha

Battery-aware Scheduling

Approaches Objectives

Slack-based list Initial static Precedence relationship scheduling scheduling Hard timing constraints

Slack-time reallocation Variable voltage Minimize the average power consumption Schedule slots shifting to scaling matchtheallocatedslacktime Increase the battery life span

Battery-aware Optimize the profile of the Schedule slots schedule discharge current interchanging processing Increase the battery life span

Battery lifespan extension: average: 1.3X, maximum: 1.6X

M. Pedram and N. Jha Energy Macromodels for Software

Typical inputs

Instrumented Un-instrumented function function

Model Behavioral Power simulation & Post- simulation = + + E b1c1 b2c2 b3c3 processing

c1 c2 c3 Energy = Regression b1 ? 1 2 1 1.00E-07 tool = 2 4 1 1.27E-07 b2 ? 3 5 1 4.24E-07 4 6 1 8.05E-07 b = ? .. 3 60 1 1 0.00029 61 2 2 0.000304 62 3 2 0.000309 63 4 2 0.000326

M. Pedram and N. Jha

Complexity-based Model Results

Examples Mo dels # of samples SPARClite

Error Sp eedup

chksum c + c N 400 1.4% 1361

1 2

igray c + c log (N ) 2560 8.0% 540

1 2 2

edgedet c + c M + c N + c MN 1000 0.3% 673325

1 2 3 4

2

ins sort c + c N + c N 250 6.7% 30050

1 2 3

mult c + c L + c LM + c LM N 625 2.4% 32213

1 2 3 4

myqsort c + c N + c Nlog (N ) 250 5.3% 38155

1 2 3 2

msort c + c N + c Nlog (N ) 1000 4.2% 126780

1 2 3 2

myfrag c + c N 1500 14.6% 81517

1 2

Ultra SPARC II (336 MHz) running SunOS 5.7 Real memory 4GB

M. Pedram and N. Jha Leakage Power Optimization

CDFG Derive Extract high- Determine off- utilization of probability critical path Schedule on-critical paths FU’s path FU’s

Prioritize FU’s

MTCMOS Estimate Library Synthesis System Leakage Power

M. Pedram and N. Jha

Leakage Results

80 Pleak/Ptotal Opt: Pleak/Ptotal 70 % Reduction: Ptotal (additional) 60 %Reduction:Pleak 50 40 30 20 10 0 0.35u 0.18u 0.13u 0.10u 0.07u Technology Family

M. Pedram and N. Jha Hardware/software Co-Synthesis with DR-FPGAs

Allocation Allocation transformation 250 Assignment 200 Assignment 150 Reconfiguration prefetch Scheduling transformation Our approach 100

Initialization Scheduling 50 solution Reconfiguration power (mW) 0 12345678910 Solution Solution prune Exa mples prioritizing

Solution

14000000 Solution 12000000 700 10000000 600 8000000 Reconfiguration prefetch 500 6000000 Our approach

400 4000000 2000000 300

Schedule length (Time units) 0 200 12345678910

Power consumption (mW) Exa mples 100

0 100 200 300 400 500 600 System price (Dollars) M. Pedram and N. Jha

System-level Functional Partitioning

Functional Partitioning Integrated with System Synthesis - Homogeneous model with partitioner/synthesizer - Integrated evolutionary algorithm - Optimize power/price under area and real-time constraints

System specification Target architecture Design constraints Design space Partitioning and exploration synthesis environment

Functional partitioning with allocation

Functional partitioning Resource with assignment library

Global scheduling

Cost calculation

Distributed system of SOCs

M. Pedram and N. Jha Summary II

Input space adaptive SW synthesis reduces energy*delay by up to 66X Power-aware dynamic scheduler reduces power by up to 3.1X Battery-aware scheduler extends battery lifespan by up to 1.6X High-level SW energy macromodels speed up power estimation by 1-to-5 orders of magnitude with an average error of only 5% Leakage power optimization reduces leakage in .07 micron technology by up to 8.4X and total power by up to 2.3X HW-SW co-synthesis with DR-FPGAs reduces power by up to 6X Functional partitioning for distributed system of SOCs reduces power by up to 2.5X

M. Pedram and N. Jha