Lecture Organizations • Lecture 1: Software-level • Introduction to Low-power systems • Low-power binary enco ding Power-Aware Computing • Power-aware compiler techniques • Lectures 2 & 3 • (DVS) techniques Lecture 2 – OS-level DVS: Inter-Task DVS – CilCompiler-llDVS:Itlevel DVS: Intra-TkTask DVS – Application-level DVS • Dynamic • Lecture 4 • Software power estimation & optimization • Low-power techniques for multiprocessor systems • Leakage reduction techniques

2 Low Power SW.2 J. Kim/SNU

Voltage, Frequency & Energy Basic Idea of DVS

7.000 2 z E ∝ Ncycle · VDD 6.000 Power Deadline 5.000 2 5.0 (a) No • 12.5x108 cycle 4.000 50MHz power-down •5.0V 3.000 • 31.25J 10 25 2.000 5.02 Energy Clock speed (b) Power-down • 5x108 cycle 1.000 50MHz •5.0V 0.000 • 12. 5J 2.5 2.3 10 25 2.1 1.9 (c) Dynamic • 5x108 cycle 1.7 1.5 voltage •2.0V 131.3 20MHz 1.1 Voltage 2.02 scaling • 2.0J 0.9 0.7 25 Time 0.5 → Slow and Steady wins the race!

3 4 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU Key Issues for successful DVS Commercial DVS Processors • Efficient Detection of Slack/Idle Intervals • Crusoe • Efficient Voltage Scaling Policy for Slack Intervals • AMD K2 + (PowerNow Technology) • Intel SpeedStep • XScale

slack interval

How to detect How to scale voltage

5 6 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

Voltage Scaling Processors DVS Support in PXA250 • Use Two Registers in PXA250 Xscale Core • CCCR (Core Clock Configuration Register): Commercial Academic – Specify memory clock & core clock Transmeta AMD Intel UC Berkely Ubicom Processors Crusoe Mobile K6 PXA250 (ARM8) LART(StrongARM) (LongRun) (PowerNow) • CCLKCFG (Core Clock Configuration) Register

200~700MHz 192~588MHz 100~400MHz 5~80MHz 59~251MHz SliLScaling Leve l – Set FCS (Frequency Change Sequence) bit to 1.1~1.65V 0.9~2.0V 0.85~1.3V 1.2~3.8V 0.79~1.65V change the clock speed

59↔251MHz : 140μs CP14 register 6 : CCLKCFG 1.1 ↔ 1.65V 0.9 ↔2.0V Each step 1.2 ↔ 3.8V 31 10 SliTiScaling Time 0790.79→1. 65V : 40 μs < 300μs 200μs 500μs 520μs 0.79←1.65V : 5.5ms reserved

Scaling Power ?? ?? ?? 130μJ ?? FSC TURBO

Change if FCS bit = 1

7 8 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU CCCR Setting Example Voltage Scaling Code

0x41300000 : CCCR 31 9876543210 1 #include void Main(void) 2 #include { reserved N M L 3 int i, bb, cc, j; 4 int thread__g[]{,,};args[3] = {0, 1, 2}; 5 for ( k = 1 ; k < 13 ; k++ ) { 6 void change_clock_speed(k); 7 change_clock_speed(int speed) bb = get_os_time(); N 8 { for ( i = 0 ; i < 10000 ; i++ ) j = 10; LM 9 int settings[20]={ 0, 0x121, 0x122, 0x123, 0x124, 0x125, 0x1a2, cc = get_os_time(); 2 3 4 6 10 0x141, 0x1a4, 0x142, 0x1a5, 0x143, 0x144, 0x145 }; printf("%d : %d \n", k, cc - bb); 11 int cccr_val = 0x121, clkcfg_val = 2; } 11 99.5 .85V 199.1 1.0V 298.6 1.1V 12 } 13 cccr_val = settings[speed]; 14 118.0 switch (speed) { 21 235.9 353.9 15 case 6 : clkcfg_val = 3; break; 13V1.3V 16 case 8 : clkcfg_val = 3; break; 31132.7 265.41.1V 398.9 17 case 10 : clkcfg_val = 3; break; 18 default : clkcfg_val = 2; break; 41147.5 294.9 19 } 1.0V 20 memcpy(0x40000000+0x1300000, &cccr_val, 4); 21 CP14_WRTIE_CCLKCFG((g);clkcfg_val); 5 1 165. 9 331. 8 13V1.3V 22 } 23 298. 24 int 12199.1 1.1V 398.1 1.3V 25 get_os_time() 6 26 { 27 int ostime; 2 2 235.9 28 29 memcpy(&ostime, 0x40000000+0xa00010, 4); 32265.4 1.1V 30 return(ostime); 31 } 42294.9 52331.9 1.3V

9 10 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

Voltage Scaling in Linux ARM IEM DEMO

SVS Kernel module Kernel thread DVS scheduler setNewVoltage() setScaledSpeed() Wake_up 전압 조절 Sleep_ on Wake_up setScaledSpeed()

KlhdKernel thread Device driver 2 ~ 3ms ltc1663_i2c_write_data() Wake_up ltc1663_i2c_write_data() Write Driver voltage value Sleep_on LTC1663 DAC Wake_up Write voltage Regulate CPU V olt age LTC1663 DAC

11 12 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU Successful Low Power S/W Techniques Roadmap

1. Understand workload variations of your target • DVS in Non Real -Time Systems

2. Devise efficient ways to detect them • DVS in Real -Time Systems • Compiler-level DVS: Intra-task DVS 3. Devise efficient ways to utilize the detected workload • OS-level DVS: Inter-task DVS variations using available H/W supports • Application-level DVS – MPEG-decoder implementation • Algorithm-level DVS – Low-power convolution

13 14 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

Non Real-Time Jobs DVS for Non Real-Time Jobs scheduler • Non Real-Time Jobs • NtiiNo timing cons tittraints • Basic Approach: • No periodic executions • Predict workload based on history information • UkUnknown WC ET • Usually based on some variations variations of intervalof interval

– PAST, FLAT – LONG_SHORT , AGED _AVERAGE It is hard to predict the future workload!! – CYCLE, PATTERN, PEAK

15 16 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU Key Question PAST • Looking a fixed window into the past

• Assume the next window will be like the previous one How can we predict the future workload?

• If the past window was • mostly busy ⇒ increase speed • Based on long term history: Hard to adapt quickly for the changed workload • mostly idle ⇒ decrease speed

• Based on short term history: Too many clock/voltage changes

17 18 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

Example: PAST FLAT • Try to smooth speed to a global average

busy time • Make the utilization of next window to be Utilization = window size • Set speed fast enough to complete the predicted new work being pushed into the coming window PAST FUTURE

time low utilization low utilization ?

Decrease Decrease speed speed

19 20 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU Example: FLAT LONG-SHORT • Look up the last 12 windows • Short-term past : 3mostrecentwindows3 most recent windows =0.7 • Long-term past : the remaining windows ? • Workload Prediction • the utilization of next window will be a weighted time average of these 12 windows’ utilizations

I/DthdIncrease/Decrease the speed the next utilization to be 0.7

21 22 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

Example: LONG-SHORT AGED-AVERAGE • Employs an exponential-smoothing method utilization = # cycles of busy interval / window size • Workload Prediction 0 .3 .5 1 1 1 .8 .5 .3 .1 0 0 • The utilization of next window will be a weighted average of all previous windows’ utilizations – ggyeometrically reduce the wei ght time 0 12345678 9 10 11 12

0 +.3+.5 +1+1+1+.8 +.5 +.3+ 4(.1+ 0 + 0) current = 0.276 time 9 + 4(3)

fclk = 0.276× fmax

23 24 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU Example: AGED_AVERAGE CYCLE • Workload Prediction utilization = # cycles of busy interval / window size • Examine the last 16 windows 0 .3 .5 1 1 1 .8 .5 .3 .1 0 0 – Does there exist a cyclic of length X? – If so, predict by extending this cycle time – Otherwise, use the FLAT algorithm 0 12345678 9 10 11 12

1 2 4 8 current average = 0 + 0 + (0.1) + (0.3) + ⋅⋅⋅ time 3 9 27 81

fclk = average× fmax

25 26 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

Example: CYCLE

utilization = # cycles of busy interval / window size 0 .4 .8 .1 .3 .5 .7 .0

time 0 12345678 current time | 0 − .3 | + | .4 − .5 | + | .8 − .7 | + | .1 − 0 | error measure = = 0.15 4 Predict : The next utilization will be .3

27 28 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU PATTERN Example: PATTERN • A generalized version of CYCLE

• Workload Prediction A BCD • Convert the n-most recent windows’ utilizations 0 0250.25 050.5 7257.25 101.0 into a pattern in alphabet {A, B, C, D}. Pattern = ABCDD Pattern = ABCD • Find the same pppattern in the past 0 .3.511 .1.35.6.9

⋅⋅⋅⋅⋅⋅ time 12345 ⋅⋅⋅⋅⋅⋅ 89 10 11 12

current Predict : The next utilization will be D time

29 30 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

Perspectives-based Algorithm Roadmap

• Per-Task Basis Performance Predictions • DVS in Non Real -Time Systems

• DVS in Real -Time Systems • Compiler-level DVS: Intra-task DVS • OS-level DVS: Inter-task DVS • Application-level DVS – MPEG-decoder implementation k ×WorkEst old + Work fse WorkEst new = • Algorithm-level DVS k + 1 k × Deadline old + Work fse + Idldle – Low-power convolution Deadline new = k + 1 WorkEst Perf = Deadline [Flautner, OSDI2002] 31 32 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU Two Types of DVS Algorithms Inter-task DVS • Inter-task DVS algorithms • Inter-Task Voltage Scheduling for Hard Real-Time • Determine the supply voltage and clock speed Systems [Yao95, Hong98, Okuma99, Shin99, Lee99]. on task-by-task basis • Problem : Given a set of tasks, how to assign the proper speed to each task dynamically while guaranteeing all their deadlines. • Intra-task DVS algorithms • Task-by-task Speed Assignment • Determine the suppl y volta ge and clock s peed within a single task boundary – The sl ack ti me due to a tas k use d by fo llow ing tasks, not by the current one. • Practical Limitations – Requires OS modifications – Cannot be applied to a single -task environment – Can be ineffective in a multi-task environment

33 34 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

An Example of Ineffective Inter DVS Earlier Version of Intra-task DVS

task T D C • Run-time voltage hopping [Lee00] A 8 4 1 • Each task is partitioned into N timeslots. B 8 4 1 • Frequency and voltage determined for each timeslot. C 8 8 4 = Dominant task • Voltage scheduling embedded in application programs. • Can be applied to conventional non-DVS OS. 1 1 – No systematic guideline slack – Manual selection of scaling points A B C A B C – Too much burden for average programmers

0 2 468 0 2 468 Task (partitioned into 4 timeslots) offline schedule Run-time schedule f • A dominant task (C) • Exploits small slack times from other tasks. deadline • Cannot use its own. t 35 36 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU Overview of Intra-task DVS Basic Idea : Inter-task DVS

task • Intra-task voltage scheduling framework based on a τ static timing analysis of RT programs. (WCEC,D eadli ne ) = (160 cyc les, 2μsec)

• The clock speed is adjusted in a task by embedded codes. 32 different execution paths (p1, …,p32) • FllFully expl litoits all sl ac k times comifing from executi tition time variations within a single task Inter-task DVS • No OS modification The task completes its execution at 80Mhz • Applicable to a single-task environment. • Provides a systematic methodology for developing DVS- aware programs p32:050.5μsec Automatic Voltage Scaler Tool Ö p3:1μsec

p2:1.5μsec p :2μsec 1 deadline

WCEP Non-WCEP : there is a slack time 37 38 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

Basic Idea : Optimal DVS Basic Idea : Intra-task DVS

task τ CFG b1 (WCEC,D eadli ne ) = (160 cyc les, 2μsec) EtitExecution not 10 WCEP on WCEP 32 different execution paths (p1, …,p32) b b2 w 10 h 10 Slack intervals p1:80Mhz p2:60Mhz p3:40Mhz p32:20Mhz

b if b3 5 Speed slow -down 10 p1:2μsec p2:2μsec p3:2μsec p32:2μsec b6 b4 5 10 WCEP Need to know the remaining Non-WCEP : there is no slack time worst case execution cycles b7 b5 But we cannot know the execution path in advance !! for a new speed. 10 10

39 40 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU Remaining Execution Cycle Speed Assignment Algorithm CiComputation • Branching edges are CFG candidate for speed program CRWEC(bi) : the remaining b1 b [160] changes : 1 [160] 10 worst case execution cycles 10 S1; • Branches : The (150→30) if (cond1) S2; execution control [150,110,70,30] else b bw [150,110 ,70 ,30] follows the shorter [30] 2 b bw 10 h path at if-then-else [30] 2 while (cond2) { 10 10 h S3; node. => B-type 10 VSEs if (cond3)S) S4; [140,100 ,60] [140,100,60] • Loops : The execution S5; [20] b control exits a loop } if b3 b [20] Maximum # if b3 5 10 after it iterates by the 5 if (cond4)S) S6; of loop 10 [130,90,50] smaller number of S7; iterations = 3 times than the [15] [130,90,50] b6 b4 maximum iteration [15] b6 b4 5 10 number. => L-type 5 10 VSEs

b b CEC(bi) : the # of clock 7 5 b b [[0]10] 10 [120,80 ,40] [10] 7 5 cycles needed to execute b 10 10 10 i [120,80,40] 41 42 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

The Effect of Intra-Task The Change of CRWEC ShdliScheduling For execution path : (b1,b2,bif,b7) 80MHz b1 b2bifb7 b C (t) (cycles) C (t) (cycles) (2.5V) deadline 1 RWEC RWEC idle state

bw 150 150 b2 2 h 0.44 μsec 2 μsec (a) without the intra-task scheduling b 100 100 ifif b3 80MHz 80MHz b 80MHz 1 deadline (2.5V) b6 b4 b2 bif b 50 50 16MHz 7 80MHz 16MHz (0.7V) deadline deadline b b 10.7MHz 2 μsec 7 5 80MHz (b) wi th th e i ntra-taskhdlik scheduling 0 0 1 2 time 1 2 time Energy Consumption of (b) (μsec) (μsec) =034= 0.34 execution time idle time execution time Energy Consumption of (a) (a) No intra-task scheduling (b) Intra-task scheduling 43 44 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU B-type VSE L-type VSE

• CRWEC • When the actual loop • 150 → 30 iterations measured at run bi [160] time is 2 • Speed if 10 • C [150,110,70,30] • S(b ) ← S(b )x) x 1/5 RWEC j i b [30] [150] • 60 → 20 while i 10 bj bk • Speed then10 10 else [20] • S(bj) ← S(bi) x 1/3 Speed update ratio bj bk 10 30 Maximum # of loop iterations = 3 Speed update ratio

"The transition overhead is considered to determine a new speed.

45 46 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

VSE Selection Code Generation for VSEs

Compute M and C (b ) M = Max Speed × D RWEC i b1

Find candidate L-type VSEs SpeedUpdateRatio = SpeedTable(b1,b2) scaling NewSpeed = CurSpeed×SpeedUpdateRatio code LoopI terN um (bwh))0=0 Change_f_V(NewSpeed)

Compute the maximum b increase Cinc in the original CWCEC wh b2 SpeedUpdateRatio No CRWEC(b ) scaling LoopIterNum(b )++ = if wh (M-CWCEC)

Select B-type VSEs

47 48 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU Automatic Voltage Scaler Simulation Results

MPEG4 encoder MPEG4 decoder

0.3 User provided Information 0.25 Call Graph (ex. loop bound) 0.2 nsumption oo Modified Timing C Program Assembly Code Compiler Analyzer 0.15

ed energy c energy ed 0.1 Syntax Tree zz

0.05 normali

0 0 50 100 150 200 voltage transition time (μsec) Transformed Code Speed Timing Program Transformer Allocator Information Œ Less th an 25% and 7% of th e ori gi nal program Œ There is a large difference between WCET and ACET of Speed deadlin e TblTable the MPEG-4 decoder

49 50 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

Simulation Results A Profile-Based Intra-Task

MPEG4 encoder MPEG4 decoder VlVoltage ShdliScheduling 40000 • IntraVS algorithm based on average-case execution 30000 information

ansitions • Average-case execution paths (ACEPs) are the most 20000 frequently executed paths

of voltage tr voltage of 1000 0

## • More effective than the original intraVS algorithm • The timing constraints of a hard real-time program is 0 0 50 100 150 200 still satisfied, even if the ACEPs are used for voltage v oltage transition time (μ sec ) scaling decisions. • How many times voltage scaling code were executed

• When CVTO <30< 30μsec in MPEG-4 encoder, the number of voltage transitions decreases sharply, and energy consumption does not increase rapidly. • How many copies of voltage scaling code ?

• 20 VSEs are inserted when CVTO > 50 μsec. 51 52 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU General IntraVS Algorithm RAEP-based IntraVS • Motivations Select a predicted reference execution path (WCEP) • To make the common case more energy-efficient • If we use one of hot paths as a reference path for intraVS, the speed change graph for the hot paths Set the initial speed will be a near flat curve with little changes in clock speed. prediction miss • Even f or th e pa ths tha t are no t the ho t pa ths, they are more energy-efficient because they can start with a lower clock speed that RWEP-based New execution path is IntraVS. longer than the reference path • RAEP is the best representative of the hot paths. Y N Speed is raised Speed is lowered (to meet deadline) (to save energy) 53 54 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

RAEP-based IntraVS RAEP-based IntraVS

b b CRAEC(bi) : the remaining 1 1 [95] 80MHz (()2.5V) deadlin e 10 average case execution cycles b b b b Frequently executed path 16MHz (0.7V) 2 if 6 7

b bw [855525][85,55,25] [25] 2 2 μsec 10 h 10 (a) with the RWEP-based IntraVS b deadline [75,45] 1 47.5MHz (1.35V) b b b 6 7 b b2 if [15] if b3 5 10 average #fl# of loop 2 μsec [75,45] (b) with the RAEP-based IntraVS iterations = 2 b b [15] 6 4 • There are Up-VSEs as well as Down-VSEs at the RAEP-based 5 10 IntraVS. Down-VSE • The RAEP-based IntraVS achieves 55% more energy reduction. b b7 5 Up-VSE • But, the deadline can be missed. [[]10] 10 10 [65,35]

55 56 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU Reference Path Modification Experimental Results

[30] [34] b1 b1 r=0.5 10 r=0.41 10 [20] [24] b [10] 2 b3 [10] b2 b3 10 10 r=2 10 10 r=1.43

b [10] b b [14] 4 4 5 [10] 4 b5 [10] 10 20 20

Deadline : 0.5μsec b [10] 4 r=0.71 10

M-RAEP deadline RWEP deadline 100Mhz 100Mhz M-RAEP

RWEP RAEP RAEP Slack Factor = (deadline-WCET)/deadline

Path (b1,b3,b5) Path (b1,b3,b4) 57 58 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

Conclusion Experiments on Itsy • Presented a novel intra-task DVS algorithm using static timing analysis on RWECs • Provides a framework for automatic DVS-aware low- power program generation • The RAEP-based IntraVS algorithm exploits the fact that the average-case execution paths are more likely to be followed at run time than the WCEP. • Demonstrated the effectiveness of the approach using MPEG-4 encoder//pgdecoder programs

59 60 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU Experimental Environment Experimental Results

• Itsy Pocket Computer V2.6 • CPU : Intel StrongARM SA1110 • Frequency scaling: 11 levels (59.0 MHz ~ 206.4 MHz) • Voltage scaling: 30 levels (1. 00 V ~ 200V)2.00 V) • Default setting: 1.55 V/206.4 MHz • Linux operating system (ver. 2.0.30)

Itsy V2.6

Vin Rest of Itsy HW Multimeter Multimeter 1 2 VRbatt Rbatt Recoding Computer

61 62 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

Experimental Results Roadmap

• DVS in Non Real -Time Systems

• DVS in Real -Time Systems • Compiler-level DVS: Intra-task DVS • OS-level DVS: Inter-task DVS • Application-level DVS – MPEG-decoder implementation • Algorithm-level DVS – Low-power convolution

63 64 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU Inter-task DVS Overview Preliminaries • Inter-task DVS algorithms • Computing model • Determine the supply voltage and clock speed • Non-real-time on task-by-task basis – tasks have no timing constraints • Real-Time • Inter-task DVS – Timing constraints • Is similar to that of imppprecise computation in – Periodic and(or) aperiodic tasks conventional real-time systems – Scheduling policy : EDF, RM, and etc. – Imprecise computation • Use the slack time to increase the values of results • While guaranteeing the feasible schedule of tasks • Different DVS algorithms are necessary depending – Dynamic voltage scaling on different computing models. • Use the slack time to lower the voltage/clock speed • While guaranteeing the feasible schedule of tasks

65 66 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

Inter-Task DVS Generic Inter-DVS Algorithms • “Run-Calculate-Assign-Run” strategy for the supply • Consist of two parts voltage determination • Slack estimation – Identify as much slack times as possible • Running the current task – Slack times • Calculating the maximum allowable execution • Static slack times time for the next task – Extra times available for the next task that can be identified statically – WCET plus slack time • Dynamic slack times • Assigning the supply voltage for the next task – Ones caused from run-time variations of the task executions • Running the next task • Slack distribution – Adjust the speed so that the resultant speed schedule is as flat as possible

67 68 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU Static and Dynamic Approaches Slack Estimation Methods • Off-line (Static) voltage scheduling approaches • The execution times are assumed to be known a Scaling Voltage Scaling Methods priori Decision • There are several optimal solutions for EDF, RM, (1) Path-based method and etc. IntraDVS (2) Stochastic method Off-line

• On-line (Dynamic) voltage scheduling approaches (3) Maximum constant speed • The execution times are assumed to be not known (4) Stretching to NTA InterDVS • There cannot be an optimal solution (5) Priority-based slack-stealing On-line

(6) Utilization updating

69 70 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

Maximum Constant Speed Example

• The lowest possible clock speed that guarantees the Static slack time Period feasible schedule of a task set WCET Length f • EDF scheduling max Task 1 21 – If the worst case processor utilization U of a speed time 0 1 2 3 4 5 6 given task set is lower than 1.0 under the Task 2 3 1 maximum speed f max , the task set can be

scheduled with a new maximum speed dd fMCS n 1 1 ci U = + = 0.833 U = f = U ⋅ f spee ∑ MSC max 2 3 time i=1 p i Maximum constant speed 0 123456 • Rate Monotonic scheduling fMCS = 0.833⋅ fmax

i ⎡ ⎤

t dd ∑ ⎢ ⎥C j RM j=1 ⎢Tj ⎥ ⎢ ⎥ fMCS = fmax × max{}Li (t) |1 < i ≤ n,0 < t < di L (t) = spee i t i time 0 123456 Deadline miss

71 72 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU Stretching to NTA Example • Even though a given task set is scheduled with the maximum constant speed, since the actual execution NTA NTA times of tasks are usually much less than their WCETs, the tasks usually have dynamic slack times

current time time current time time • For the task τ which is scheduled at time t NTA • If the nex t tas k is la ter than t +WCET (τ ) • We can slow down the execution of τ so that its NTA execution completes exactly at this next task time arrival time (NTA) current time NTA

current time time

current time time

73 74 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

Priority-Based Slack Stealing Example • Exploits basic properties of priority-driven Period scheduling such as EDF and RM WCET Length dd • When a higher-priority task completes its Task 1 21spee time execution earlier than its WCET, the following 0 123456 lower-priority tasks can use the slack time from the completed higher-priority task Task 2 3 1

• Advantage Task 3 6 1 speed time 0 123456 • Most task instances in a hyper-period may have d

chances to utilize dynamic slack times ee

– Because most task executions complete spe time earlier than WCETs 0 123456 – Therefore, many task instances can be scheduled with lowered voltages and speeds speed time 0 123456

75 76 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU Utilization Updating Example • The actual processor utilization during run time is Period

usually lower than the worst case processor WCET peed Length ss time utilization 0 123456 Task 1 21 UU0.833=0.833 • This method is to estimate the required processor Task 2 3 1 performance at the current scheduling point eed pp

• BlltithtdtBy recalculating the expected worst case s time processor utilization 0 123456 1 1 U = + = 0.833 t=0.6 0.6 1 – Using the actual execution times of completed 2 3 U = + = 0.633 task instances 2 3 d ee

spe time 0 123456 t=1.4 1 0.8 t=2 U = + = 0.766 77 78 2 3 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

Slack Distribution Method Existing Inter-Task DVS Algorithms • Greedy approach • All the slack times are given to the next activated task Category Scheduling Policy DVS Policy Used Method lppsEDF (3)+(4) • Most inter-task DVS algorithms have adopted it ccEDF (6) • Clearly, this approach is not an optimal solution, laEDF (6)* but is widely used because of its simplicity EDF Inter-task DRA (()3)+ ()(4)+ ()(5) DVS AGR (4)*+(5) lpSEH (3)+(4)+(5)* lppsRM (3)+(4) RM ccRM (4)* Intra-task Path-based method intraShin (1) DVS Stochastic method intraGruian (2)

79 80 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU SimDVS: A Unified DVS Evaluation Environment Experimental Results

Task Set Generator Inputs InterDVS Module Machine specification

Off-line Slack Estimation Task Set Specification Slack Module information Executable Program

Profile Information Task Execution Energy Estimation Module Module

IntraDVS Preprocessing Module

VltVoltage CFG Scaler IntraDVS Module Outputs DVS-aware CFG * Energy CFG Intra-Task Generator consumption Simulator * … Speed Transition Stochastic Data Table

81 82 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

Experimental Results Experimental Results

83 84 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU Experimental Results Experimental Results (IntraDVS) ion ion 1.5 1.5 tt tt

1.25 1.25 Consump Consump

1 1

a a

ve Energy ve Energy ve Energy b 0750.75 b 0750.75 c c d

Relati d Relati 0.5 0.5 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 Slack Ratio (AvailableTime/WCET) Slack Ratio(AvailableTime/WCET)

MPEG4 Decoder MPEG4 Encoder

85 86 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

DEW - DVS Evaluation Workbench

• XScale-based DVS evaluation environment

• Pros – Allows to monitor real Embedded RT Applications Performance Evaluation DVS system behaviors under Algorithms for Hard Real-Time DVS • Cons Systems Using DEW VELOS DVS API – Slower than software Embedded RTOS /module simulation • Because DEW runs actual applications PXA250 DVS processor – Less flexible for experimental studies Intel DBPXA250 • Because DEW represents a single machine specification

88 Low Power SW.2 J. Kim/SNU Evaluation Results Using SimDVS and DEW

Applications Embedded RT Applications 1 1

0.9 0.9

0.8 JAVA POSIX Socket Win32 DVS 0.8 n n API API API API API oo oo 0.7 0.7

VELOS 0.6 0.6

y Consumpti y Consumpti 050.5 JAVA Thread Protocol Window Energy gg 050.5 gg Library (KVM) manager stack manager estimator Kernel 0.4 0.4 0.3 0.3 alized Ener alized Ener mm mm ItInterrup t Memory DiDevice DVS 0.2 0.2 boot Scheduler Nor Nor service allocator drivers module 0.1 0.1

0 0 2468 2468 Libraries PXA250 Number of Tasks Number of Tasks Libraries DVS processor Libraries lppsEDF ccEDF laEDF DRA AGR lpSHE lppsEDF ccEDF laEDF DRA AGR lpSHE Intel DBPXA250 platform SimDVS DEW

89 90 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

Sources of Differences Example of System Overheads on • Impacts of a Reallfl Platform • System overhead – Basic : context switching overhead and tick scheduler overhead Second task execution with 50 MHz – DVS : slack computation and clock/voltage First task execution with 50 MHz scaling time t (t+25) ms • System timing resolution – Simulator : continuous time model context switching delay – Real system : discrete time model • Memory behavior – Changes in cache and memory access behavior time – Data/Instruction fetch latency tikick sch hdlieduling

91 92 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU System Overhead System Overhead Variations • The system overhead increases very quickly as the task execution frequency increases

0.35% „ DVS H/W • In particular, DVS parts increase quickly DVS H/W 0.30% DVS S/W The ratio of time delay caused by SYS rest the clock/voltage scaling hardware 0.25%

0.20% ead Ratio „ Long-period task set Medium-period task set Short-period task set hh DVS S/W 0.35% 0.35% 3.50%

DVS H/W DVS H/W DVS H/W 0.15% The ratio of time delay caused by 0.30% DVS S/W 0.30% DVS S/W 3.00% DVS S/W SYS rest SYS rest SYS rest

the slack computation in a DVS 0.25% 0.25% 2.50% 0.10% stem Over

yy algorithm 0.20% 2.00% S 0.20% 0.05% 1.50% 0.15% 0.15% „ SYS rest 0.10% 1.00% 0.00% 0.10% FF FF FF FF FF EE FF EE FF EE FF EE AA AA AA AA FF FF FF FF RR RR RR RR PM PM PM PM DR DR DR DR AG AG AG AG

laED laED laED laED 0.50% lpSH lpSH lpSH lpSH 0.05% ccED ccED ccED ccED The ratio of the rest of the system 0.05% lppsED lppsED lppsED lppsED 2468 overhead such as context switching 0.00% 0.00% 0.00% PM PM PM PM PM PM PM PM

Number of Tasks DRA DRA DRA DRA AGR AGR AGR AGR DRA DRA DRA DRA AGR AGR AGR AGR PM PM PM PM laEDF laEDF laEDF laEDF lpSHE lpSHE lpSHE lpSHE DRA DRA DRA DRA laEDF laEDF laEDF laEDF lpSHE lpSHE lpSHE lpSHE AGR AGR AGR AGR and timer service ccEDF ccEDF ccEDF ccEDF psEDF psEDF psEDF psEDF sEDF sEDF sEDF sEDF ccEDF ccEDF ccEDF ccEDF cEDF cEDF cEDF cEDF ppsEDF ppsEDF ppsEDF ppsEDF laEDF laEDF laEDF laEDF lpSHE lpSHE lpSHE lpSHE ll ll ll ll pp pp pp pp pp pp pp pp cc cc cc cc l l l l lp lp lp lp 2468 2468 2468 Number of Tasks Number of Tasks Number of Tasks

93 94 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

Energy Efficiency Variations Changes in Memory System • In DRA, AGR, and lpSHE, the increased system Behihaviors (1) overhead (due to the increased execution frequency) • Under a DVS-enabled RTOS, TkTask’s execution ti me increases significantly affect the energy efficiency 6 due to the lowered clock speed

5 • Desirable for reducing t nn energy consumption

4 • But, it can introduce negative side effects as well Medium-period task set ption Cou Long-period task set Short-period task set mm 1 1 1 – AiAn increase i n th e 3 0.9 0.9 0.9 number of task 0.8 0.8 0.8 preemptions which

0.7 0.7 0.7 lized Pree 2

aa increases the number of 0.6 0.6 0.6 memory accesses 0.5 0.5 0.5 Norm 1 0.4 0.4 0.4 • In aggressive algorithms, the

0.3 030.3 030.3 number of preemptions

0.2 0.2 0.2 0 increases more rapidly than the 2468 0.1 0.1 others 0.1 Number of Tasks 0 0 0 2468 2468 2468 lppsEDF ccEDF laEDF DRA AGR lpSHE Number of Tasks Number of Tasks Number of Tasks lppsEDF ccEDF laEDF DRA AGR lpSHE lppsEDF ccEDF laEDF DRA AGR lpSHE lppsEDF ccEDF laEDF DRA AGR lpSHE

95 96 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU Changes in Memory System References Behihaviors (2) • Transmeta Corporation. Crusoe Processor. http://www.transmeta.com, June 2000. • PXA250 • Performance Monitoring Unit • AMD Corporation . PowerNow! Technology . 1.8 http://www.amd.com, December 2000. • 32-way set-associative cache 1.6 of Inst/Data cache • Intel Corporation. Intel XScale Technology. ount

CC http://developer. intel. com/ design/ intelxscale/, November 2001. 1.4 • Each application • I. Hong, G. Qu, M. Potkonjak, and M. B. Srivastava. Synthesis 1.2 • 16-KB program code Techniques for Low-Power Hard Real-Time Systems on ry Access oo 1 Variable Voltage Processor. In Proceedings of the IEEE Real- Time Systems Symposium, pages 178-187, December 1998. 0.8 • Y. Shin, K. Choi, and T. Sakurai. Power Optimization of Real- • The increases in memory accesses lized Mem 0.6 Time Embedded Syypstems on Variable Speed Processors. In aa can be attributed to two sources Proceedings of the International Conference on Computer- 0.4 • The increase in the number of

Norm Aided Design, pages 365-368, November 2000. preemptions 0.2 • The increase in memory • H. Aydin, R. Melhem, D. Mosse, and P. M. Alvarez. Dynamic and Aggressive Scheduling Techniques for Power-Aware Real- 0 accesses from the algorithm 2468 itself Time Systems. In Proceedings of IEEE Real-Time Systems Number of Tasks Symposium, December 2001. lppsEDF ccEDF laEDF DRA AGR lpSHE

97 98 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

References References

• P. Pillai and K. G. Shin. Real-Time Dynamic Voltage Scaling for • D. Grunwald, P. Levis, and K. I. Farkas. Policies for Dynamic Low-Power Embedded Operating Systems. In Proceedings of Clock Scheduling. In Proceedings of the 4th Symposium on 18th ACM Symposium on Operating Systems Principles Operating Systems Design and Implementation, pages 73-86, (SOSP'01), October 2001. October 2000. • D. Shin, J. Kim, and S. Lee. Intra-Task Voltage Scheduling for • S. Lee and T. Sakurai. Run-time Voltage Hopping for Low- Low-EHdRlEnergy Hard Real-Time Appli A licati ons. IEEE Des ign an d power RlReal-Time Sy Sst ems. I n P roceedi ngs of th e 37th Des ign Test of Computers, 18(2):20-30, March 2001. Automation Conference, pages 806-809, June 2000. • F. Gruian. Hard Real-Time Scheduling Using Stochastic Data • T. Burd and R. Brodersen. Design Issues for Dynamic voltage and DVS Processors. In Proceedings of the International scaling . In Proceedings of the International Symposium on Low Symposium on Low Power Electronics and Design, pages 46- Power Electronics and Design, pages 9-14, July 2000. 51, August 2001. • D. Burggp,er and T. M. Austin. The SimpleScalar Tool Set, version • W. Kim, J. Kim, and S. L. Min. A Dynamic Voltage Scaling 2.0. Technical Report 1342, University of Wisconsin-Madison, Algorithm for Dynamic-Priority Hard Real-Time Systems Using CS Department, June 1997. Slack Time Analysis. To appear in Proceedings of Design, • F. Yao , A. Demers , and A. Shenker. A Scheduling Model for Automation and Test in Europe (DATE'02), March 2002. Reduced CPU Energy. In Proceedings of the IEEE Foundations of Computer Science, pages 374-382, 1995.

99 100 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU Image Convolution

DVS-Aware Alggporithm Development convolution 011 -1 0 1

-1 -1 0

• One of the fundamental operations of image processing. • DVS Unfriendly!!

101 102 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

Direct Implementation: Low-Power Implementation CttConstant-WkldAlithWorkload Algorithm

Decomposed Kernel Variable workload MinusOnePtr -1 -1 -1 Kernel Analysis Kernel[0] v: -1 xoffset: -2 -1 0 -1 8 -1 & PosPtr yoffset: -2 -2 0 NegPtr next next null -1 -1 -1 Rearrangement Kernel[1] v: 8 -1 • p2 multiplications p2 PosPtr -1 NegPtr null Modified additions for each Convolution convolved element. CttiComputation Execution Time Prediction & Voltage/Frequency Setting No Workload Variations! Variable Workload Algorithm Based on Kernel Characteristics

103 104 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU Low-Power Implementation Kernel Analysis and Rearrangement

• Property 1 1. For most kernels, the number of distinct kernel elements is small. Decomposed Kernel • Property 2. 0, 1 and –1 are used frequently. MinusOnePtr -1 -1 -1 Kernel Analysis Kl[0]Kernel[0] v: -1 xoffset: -2 -1 0 -1 8 -1 • Property 3. Many kernel elements have the same absolute kernel & PosPtr yoffset: -2 -2 0 Rearrangement NegPtr next next null -1 -1 -1 Kernel[1] values. v: 8 -1 PosPtr -1 NegPtr nullCltiConvolution MdifidModified output Process Convolution Computation image Execution Time input Prediction & 011 -1 -2 -1 image Voltage/Frequency Setting -1 0 1 000

-1 -1 0 1 2 1

Overall processing step

105 106 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

Modified Convolution Algorithm: SDMK Direct vs. SDMK

• For 1 or –1, no multiplication. • For 0, no addition & no multiplication.

Original a b c • For the same absolute values, a single multiplication. Original Kernel a b c Kernel Reversed c b a Reversed Kernel c b a Kernel Input d1 d2 d3 d4 d5 d6 Input d1 d2 d3 d4 d5 d6 Sequence Sequence

*c ++*b *a *c ++*b *a

step i 0 1 1 *c ++*b *a *c ++*b *a step i+1 *c ++*b *a *c ++*b *a

-1 0 1 step i+2 *c ++*b *a *c ++*b *a -1 -1 0 step i+3 *c + *b *c + *b step i+4 step i+5 Direct SDMK

Convolved Convolved Number of step i step i+1 step i+2 step i+3 step i+4 9 additions & 6 additions & Sequence Sequence Operations/pixel 9 multiplications 0 multiplications Direct implementation SDMK implementation

107 108 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU Exec Time Prediction & Speed Setting Experimental Environments

• Itsy Pocket Computer V2.6 • CPU : Intel StrongARM 1110 • By a static method • Frequency scaling: 11 levels (59.0 MHz ~ 226.4 MHz) • Based on the number of required arithmetic operations • Voltage scaling: 30 levels (1.00 V ~ 2.00 V) • Linux operating system (ver. 2.0.30)

• Byyadynamic a dynamic-method • Based on actual measurements of execution times • In the direct algorithm, by pre-constructed speed table

109 110 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

Results (Energy Dissipation) Results (Execution Time)

• There is no performance degradation over the direct approach. • Average 67.6% energy saving in the core processor, and 62.8% in the whole Itsy system.

111 112 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU Conclusions MPEG Decoder Implementation • Presented a low-power implementation of image convolution algorithm for variable voltage processors. • 버퍼를 활용한 DVS • The energy efficiency of the proposed implementation • Workload-variation 슬랙 시간을 활용하는 것이 가능 comes from: • 추가적인 메모리 자원을 소모 – one-buffer-size = image_ width * imagehei_ ght * • Smaller Ncycle byte_per_pixel

• Lower Vdd 2 • Fewer memory references E ∝ Ncycle · VDD period

– i.e., less energy consumed in non-CPU j components j+1 j+2

VSTj Time AETj

OPj+1

Deadline for sj Deadline for sj+1 Deadline for sj+2

113 114 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU

Measurement Results Demo (1)

• 버퍼 방식 DVS 기법 사용 (WCET사용) • Bitrate of sample video : 163Kbps • DAQ 보드로 프로파 일링된 데이터를 실 제 동영상과 같이 표 현 • Ideal DVS • DVS Using Moving Avg. • Buffer-Based DVS

• Energy saving : up to 53%

현재 화면을 디코딩하는 전압 115 116 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU Demo (2)

• 각 DVS policy 들 사이의 총 전력 소모를 비교

Energy-Optimal Off-Line Voltage Scheduling

117 Low Power SW.2 J. Kim/SNU

Off-Line Volt. Sched. Problem Existing Works for the Problem

• Voltage schedule (speed schedule) : S(t) • Note that the system model covers • the ppprocessor speed as a function of time • Fixed-priority (RM, DM) periodic/aperiodic task set • The energy consumption under S(t) is given by • EDF periodic/aperiodic task set

E(S) = ∫interval P(S(t)) dt – pi < pj iff di < dj – P is a convex function from speed to power

• Given N jobs J ,J, J , ,J, J where 1 2 … N • For EDF job sets (a special case), the problem can be • ri : the release time of Ji solved in poly. time by Yao’s algo.[FOCS’95] • d : the deadline of J i i • solution space = convex , obj . func . = convex • c : the workload (# of execution cycles) of J i i • For general job sets, the problem becomes much – assumed to be known a priori difficult • pi :th: the pr ior itity o fJf Ji compute a feasible voltage schedule S(t) that minimizes E(S) • main source of difficulty : feasibility condition • Quan & Hu [TCAD 03]: exhaustive oppgtimal algo. • S(t) is feasible iff S(t) gives Ji its workload ci between ri and di ’ for all J , J , , J 1 2 … N • Yun & Kim [TECS’03]: NP-hardness & FPTAS 119 120 Low Power SW.2 J. Kim/SNU Low Power SW.2 J. Kim/SNU References • Optimal algorithm for EDF job sets • F. Yao, A. Demers, S. Shenker, “A Scheduling Model for Reduced CPU Energy”, In Proc . Foundations of Computer Sciences (FOCS’95), 1995

• HitifFPjbtHeuristic for FP job sets • G. Quan and X. Hu, “Energy Efficient Scheduling for Real- Time Systems On Variable Voltage Processor”, In Proc. Design Automation Conference (DAC’01), 2001. • Exhaustive optimal algorithm for FP job sets • G. Quan and X. Hu, “Minimum Energy Fixed Priority ShdliScheduling for Var iblVltiable Voltage Processors”, IEEE Transactions on Computer Aided Design and Systems, 2003. • NP-har dness proo f & FPTAS f or FP j ob set s • H.-S. Yun and J. Kim, “On Energy-Optimal Voltage Scheduling for Fixed-Priority Hard Real-Time Systems”, ACM Transactions on Embedded Computing Systems , 2003. 121 Low Power SW.2 J. Kim/SNU