Is Compiling for Performance == Compiling for Power?

Home , Inline expansion, Loop unrolling, Strength reduction

Madhavi Valluri and Lizy John

Lab oratory for Computer Architecture

Department of Electrical and Computer Engineering

The University of Texas at Austin

[email protected]

Abstract energy trends of three generations of Alpha pro ces-

sors. Power dissipation increases signi cantly from

one generation to the next despite the reduced supply

Energy consumption and power dissipation are in-

voltages and advanced pro cessor technologies. The

creasingly becoming important design constraints in

pap er shows the p ower in the Alpha 21264 increasing

high performance microprocessors. Compilers tra-

almost linearly with frequency, with power reaching

ditional ly are not exposed to the energy details

72 Watts at 600MHz. The maximum power dissi-

of the processor. However, with the increasing

pated under worst case conditions was found to be

power/energy problem, it is important to evaluate

ab out 95 Watts. These examples clearly indicate that

how the existing compiler optimizations in uenceen-

power dissipation and energy consumption will so on

ergy consumption and power dissipation in the pro-

b ecome imp ortant limiting factors in the design of

cessor. In this paper we present a quantitative study

high p erformance pro cessors.

wherein we examine the e ect of the standard opti-

mizations levels -O1 to -O4 of DEC Alpha's cc com-

Until recently, the two problems were b eing

piler on power and energy of the processor. We also

dealt with only at the circuit-level. Voltage scal-

evaluate the e ect of four individual optimizations on

ing, low swing buses, conditional clo cking etc have

power/energy and attempt to classify them as \low

help ed alleviate the problems enormously. How-

energy" or \low power" optimizations. In our exper-

ever, architectural-level and compiler-level analysis

iments we nd that optimizations that improve per-

can help tackle these problems much earlier in the

formance by reducing the number of instructions are

design cycle. Recently,several architectural and com-

optimized for energy. Such optimizations reduce the

piler techniques have b een prop osed to reduce p ower

total amount of work done by the program. This is in

and energy [3, 6, 7,8,9,10,11,12]. In our work we

contrast to optimizations that improve performance

concentrate on the in uence of compilers on power

by increasing the overlap in the program during exe-

dissipation and energy consumption.

cution. The latter kind of optimizations increase the

average power dissipated in the processor.

Compilers traditionally are not exp osed to the

energy details of the pro cessor. Current compiler

optimizations are tuned primarily for p erformance

1 Intro duction

and o ccasionally for co de size. With the increasing

power/energy problem, it is imp ortant to evaluate

Energy consumption and power dissipation are in- how the existing optimizations in uence energy con-

creasingly b ecoming imp ortant design constraints in sumption and p ower dissipation in the pro cessor. An

high p erformance micropro cessors. Power dissipation interesting question to answer would b e - if wecom-

a ects circuit reliability and packaging costs. Energy pile for p erformance, are we automatically compil-

consumption directly e ects battery life. With the ing for low power or low energy? Current compil-

increasing use of general purp ose pro cessors in the ers already havetwo axes in the optimizations used

emb edded world, designing low energy pro cessors is - namely compiling for sp eed(in general-purp ose pro-

imp ortant. Gowan et al. [5], discuss the power and cessors) and compiling for co de size(in emb edded sys- 1

tems), do we need a third axis with optimizations that provide concluding remarks and future directions in

compile for power/energy? Section 6.

Toanswer the ab ove questions, wepresent a quan-

titative study wherein we examine the in uence of

2 Related Work

a few state-of-the-art compiler optimizations on en-

ergy and p ower of the complete pro cessor. Westudy

In this section we present some of the previous work

the e ect of the standard optimizations levels -O1 to

done in understanding the interaction between the

-O4 of DEC Alpha's cc compiler on power and en-

compiler and p ower/energy of the pro cessor.

ergy of the pro cessor. We also evaluate the e ect of

four individual optimizations on power/energy and

The study by Kandemir et al. [7] quantitatively

attempt to classify them as \low energy optimiza-

examines the in uence of di erent high-level com-

tions" or \low power optimizations" or b oth. The

piler optimizations on system energy. However, in

optimizations we study are simple basic-block schedul-

their study, they evaluate only lo op-nest optimiza-

ing, loop unrol ling, function inlining, and aggres-

tions such as lo op fusion, lo op ssion, blo cking, tiling,

sive global scheduling. For our exp eriments, we use

scalar expansion and unrolling. In our pap er, we dis-

Wattch [2], an architectural simulator that estimates

cuss b oth the p ower dissipated and energy consump-

CPU energy consumption. Wattchintegrates param-

tion details, while in the pap er by[7] they rep ort only

eterizable p ower mo dels into the Simplescalar [4] pro-

energy details. Their main observation in the pap er

cessor simulator.

is that the optimizations app ear to increase the en-

ergy consumed in the core while reducing the energy

In our study we nd that the set of compiler

consumed in the memory system. Unoptimized co des

optimizations that improve p erformance by reduc-

consume more energy in the memory system.

ing the number of instructions executed are opti-

mized for b oth energy and power. This is in con-

There have b een a few instruction scheduling tech-

trast to optimizations that improve p erformance by

niques prop osed which attempt to reduce the p ower

increasing the existing parallelism in the program.

dissipated in the pro cessor. Su et al. [11] prop osed

The latter kind of optimizations increase the average

cold scheduling, wherein, they assign priority to in-

power dissipated in the pro cessor. We nd that op-

structions based on some pre-determined p ower cost

timizations such as common-sub expression elimina-

and use a generic list scheduler to schedule the in-

tion, copy propagation, lo op unrolling are very go o d

structions. The p ower cost of scheduling an instruc-

for reducing energy since they reduce the number

tion dep ends on the instruction it is b eing scheduled

of instructions in the program, hence the amount

after. This corresp onds to the switching activityon

of total work done is less in programs with these

the control path. Toburen et al. [12] prop ose another

optimizations. Such optimizations should de nitely

power-aware scheduler which schedules as many in-

b e included in the compile for power/energy switch.

structions as p ossible in a given cycle until the en-

Optimizations such as instruction scheduling signi -

ergy threshold of that cycle is reached. Once that

cantly increase p ower (and may o ccasionally increase

precomputed threshold is reached, scheduling pro-

energy) b ecause they increase the overlap in programs

ceeds to the next time-step or cycle. In our work,

without reducing the total numb er of instructions in

by evaluating several state-of-the-art optimizations,

the program. However, such optimizations can be

we attempt to identify other optimizations besides

easily mo di ed to takepower details into considera-

instruction scheduling that can be improved if the

tion and can b e used to increase p erformance without

power/energy mo dels of the pro cessor were exp osed

increasing average p ower.

to them.

The rest of the pap er is organized as follows: Signi cantwork has b een done in reducing energy

In Section 2, we discuss some previous work that consumption in the memory. Most techniques achieve

has been done in the area of compilers and low a reduction in energy through innovative architec-

power/energy. Section 3 shows a few examples that tural techniques [6, 8,9,10]. Some of the works that

motivates the need for our study. We describ e the dif- include compiler involvement are [10] and [6]. In [6],

ferent compiler optimizations evaluated in Section 4. the authors suggest the use of an L-cache. An L-

In Section 5, we describ e our exp erimental framework cache is a small cache which is placed between the

and discuss in detail the results obtained. Finally,we I-cache and CPU. The L-cache is very small (holds a 2

few basic blo cks), hence consumes less energy. The constant. This optimization can b e a go o d candidate

compiler is used to select go o d basic blo cks to place to use in the \compile for p ower/energy" switch.

in the L-cache. Another approach to reduce mem-

These examples show that compilers can be op-

ory energy is Gray co de addressing [10]. This form

timized to pro duce co de for low power or low en-

of addressing reduces the bit switching activity in

ergy, without sacri cing p erformance. In this study

the instruction address path. Bunda et al. [3] and

we hop e to exp ose the current void in the area of

Asanovic [1] investigated the e ect of energy-aware

power/energy-aware compilers and attempt to iden-

instruction sets. These techniques would involvethe

tify go o d candidates for further improvement.

compiler even earlier in the co de generation pro cess.

The pap er by Bunda et al [3] concentrates on reduc-

ing memory energy, and Asanovic [1] investigates new

4 Compiler Optimizations

instructions to reduce energy in the memory, register

les and pip eline stages.

In our study we evaluate the in uence of compiler

optimizations on pro cessor power/energy using the

native C compiler cc on a Dec Alpha 21064 running

3 Motivating Examples

the OSF1 op erating system. We also used the gcc

compiler to study the e ect of a few individual op-

Consider the data dep endence graph (DDG) shown

timizations. The details of b oth the compilers and

in Figure 1(b). It contains six op erations. All op er-

their di erent options are presented in the following

ations except op E have a latency of one cycle, op E

subsections.

takes two cycles to complete. We will assume there

are in nite functional units for this example. An in-

struction scheduler that attempts to also optimize

4.1 Standard Optimization Levels on

for registers would schedule op E as close to op F

cc and gcc

as p ossible. The resulting schedule can be seen in

Figure 1(b). If we assume that each op eration con-

The di erent levels in the cc compiler, along with the

sumes one unit of power, compared to the schedule

optimizations p erformed at each level are describ ed

in Figure 1(b), the schedule in Figure 1(c), dissipates

below.

less peak power (3 units vs 2 units in Figure 1(b)).

Figure 1(c) is also a valid schedule. By extending

-O0 No optimizations p erformed. In this level, the

the lifetime of op E by one cycle, we reduce the p eak

compiler's goal is to reduce the cost of compilation.

power dissipated without a ecting p erformance. The

Only variables declared register are allo cated in reg-

design choice of letting op E o ccupy the register for

isters.

one cycle longer than required will prove to be in-

-O1 Many lo cal optimizations and global optimiza-

exp ensive only if there are sucient number of reg-

tions are p erformed. These include recognition and

isters. Currentschedulers do not takepower details

elimination of common sub expressions, copy propa-

into consideration and hence mightschedule op E in

gation, induction variable elimination, co de motion,

cycle 2 even if there are sucient registers. This ex-

test replacement, split lifetime analysis, and some

ample shows that two variations of the same co de

minimal co de scheduling.

can have the same p erformance but di erent power

-O2 This level do es inline expansion of static pro-

requirements.

cedures. Additional global optimizations that im-

Another go o d candidate for reducing energy with-

prove sp eed (at the cost of extra co de size), suchas

out increasing power would be function-in-lining.

integer multiplication and division expansion (using

Function-in-lining is done in cases where the callee

shifts), lo op unrolling, and co de replication to elimi-

pro cedure body is small. In these cases, the co de

nate branches are also p erformed. Lo op unrolling and

required for the calling sequences outweigh the co de

elimination of branch instructions increase the size of

in the pro cedure body. If this pro cedure is called

the basic blo cks. This helps the hardware exploit in-

many times, in-lining can save a tremendous number

struction level parallelism (ILP) in the program.

of instructions. Function-in-lining do es not increase

the overlap such as instruction scheduling, hence this -O3 Includes all -O2 optimizations and also do es

optimization keeps energy low and holds the power inline expansion of global pro cedures p erformed. 3 A cycle 1 A

B C E B C cycle 2

E D D cycle 3

F F cycle 4

(a) Example DDG (b) Peak Power = 3 Energy = 6 E A cycle 1

B C cycle 2

D (c) Peak Power = 2 Energy = 6 cycle 3

cycle 4

Figure 1: Motivating Example

Optimizations p erformed are common sub expression -O4 Software pip elining, an aggressive instruction

elimination, combining instruction through substitu- scheduling technique used to exploit ILP in lo ops is

tion (copy propagation), dead-store elimination, lo op p erformed using dep endency analysis. Vectorization

strength reduction and minimal scheduling. of some lo ops on 8-bit and 16-bit data is also done.

This level also invokes a scheduling pass which inserts

-O2 Nearly all supp orted optimizations that do not

NOP instructions to improve the scheduling.

involvea space-sp eed tradeo are p erformed. Lo op

unrolling and function inlining are not done, for ex- We use the FORTRAN g77 compiler to compile

ample. This level also includes an aggressive instruc- the Sp ecFP b enchmarks. g77 is a program to call gcc

tion scheduling pass. with options to recognize programs written in For-

tran. The standard optimization levels o ered by gcc

-O3 This turns on everything that -O2 do es, along

arelistedbelow:

with also inlining of pro cedures.

-O0 No optimizations p erformed.

Wenotethatinboth cc and gcc, the optimizations

-O1 This level is very similar to the -O1 in cc. that increase the ILP in a program are in optimiza- 4

tion levels -O2, -O3 and -O4 (-O4 only in cc). The 5.1 Wattch 1.0 and Benchmarks

di erent levels include almost the same optimizations

in both the compilers. We use b oth cc and gcc in

We use the Wattch 1.0 simulator [2] for our ex-

our work. Weuse cc wherever p ossible, gcc wherever

p erimentation. Wattch is an architectural simula-

sp eci c ho oks to control individual optimizations are

tor that estimates CPU energy consumption. The

required.

power/energy estimates are based on a suite of pa-

rameterizable power mo dels for various hardware

structures in the pro cessor and on the resource us-

4.2 Individual Optimizations

age counts. The power mo dels are interfaced with

Simplescalar [4]. sim-outorder, Simplescalar's out-

We analyze the impact of four di erent individual op-

of-order issue simulator has b een mo di ed to keep

timizations provided by gcc. Wechose gcc for this b e-

track of which unit is being accessed in each cycle

cause gcc provides more numb er of distinct individual

and record the total energy consumed for an applica-

optimizations than cc to chose from. All the individ-

tion.

ual optimizations are applied on top of optimizations

Wattch has three di erent options for clo ck gat-

p erformed at -O1. The individual optimizations cho-

ing to disable unused resources in the pro cessor. The

sen are:

simplest clo cking style assumes that the full mo deled

-fschedule-insns This optimization attempts to re-

power will be consumed if any accesses o ccur in a

order instructions to eliminate execution stalls that

given cycle, and zero otherwise. This is ideal clo ck

o ccur due unavailability of required data. This helps

gating. The second p ossibility assumes that if only

machines that have slow oating point or memory

a p ortion of a unit's p ort are accessed, the p ower is

load instructions byallowing other instructions to b e

scaled linearly according to the numb er of p orts b e-

issued until the result of the load or oating point

ing used. In the third clo ck gating scheme, power

instruction is required. The scheduler used is a basic-

is scaled linearly with port or unit usage, but un-

blo ck list-scheduler and it is run after lo cal register

used units dissipate 10% of their maximum power.

allo cation has b een p erformed.

This corresp onds to the static p ower dissipated when

-fschedule-insns2 Similar to -fschedule-insns, but

there is no activity in unit. Wechose p ower and en-

requests an additional pass of instruction scheduling

ergy results corresp onding to the third scheme since

after register allo cation has b een done. This pass

it is the most realistic of all schemes. We used the

do es aggressive global scheduling b efore and after

default con guration in sim-outorder for our study,

global register allo cation. Postpass scheduling (when

but changed the RUU (Register Up date Unit) from

scheduling is done after register allo cation) minimizes

16 to 32 and LSQ (Load Store Queue) LSQ size from

the pip eline stalls due to the spill instructions intro-

8to16. The functional unit latencies exactly match

duced by register allo cation.

the functional units latencies in the Alpha 21064 pro-

cessor. We use the pro cess parameters for a .35um

- nline-functions Integrates all simple functions

pro cess at 600MHz.

into their callers. The compiler heuristically decides

which functions are simple enough to b e worth inte-

We chose six di erent benchmarks for our study

grating in this way.

- three Sp ecInt95 b enchmarks, namely compress, go

and li, two Sp ecFp95 b enchmarks su2cor and swim,

-funrol l-loops Perform the optimization of lo op un-

and saxpy,atoy b enchmark.

rolling. This is done only for lo ops whose number of

iterations can b e determined at compile time or run

time.

5.2 Results

5 Exp erimental Results

In the following subsections we present a detailed

analysis of the results obtained. We rst discuss the

In this section we rst describ e the Wattch simula- in uence of standard optimizations on energy and

tor and our b enchmarks. We then present a detailed power following which we study the e ects of indi-

analysis of our results. vidual optimizations. 5

5.2.1 In uence of Standard Optimizations on such as common sub-expression elimination, induc-

Energy tion variable elimination and unrolling that reduce

the numb er of instructions executed. Optimizations

Table 5.2.1 shows the results obtained when the

such as the ones in -O4 (inserting NOPs to improve

benchmarks are compiled with di erent standard op-

scheduling), mayimprove p erformance, but can also

timizations levels. We present the results of all opti-

increase the numb er of instructions, leading to higher

mizations relative to the result of optimization level

energy requirements. The energy increase is seen to

-O0. For example, when we consider the number of

b e up to 4% (in compress).

instructions, the p ercentage of instructions executed

by abenchmark optimized with option -O2 is given

5.2.2 In uence of Standard Optimizations on

by:

Power

% of Insts Executed by Program

O 2

To study the in uence of compiler optimizations on

# of Insts Executed by P r og r am

O 2

power, we again refer to Table 5.2.1. We see that

= 100

# of Insts Executed by P r og r am

O 0

though the number of instructions and the number

of cycles taken reduces in higher optimization levels,

For example, in Table 5.2.1, we see that compress

the numb er of instructions do not reduce enough to

when compiled with -O2 executed 17.96% fewer in-

keep the instructions p er cycle (IPC) constant. IPC

structions than compress when compiled with -O0.

reduces in -O1 co des but increases in -O2, -O3 and

Our results are presented in this form for all b ench-

-O4 co des. IPC in -O0 is low b ecause of the poor

marks and for all optimizations. As mentioned in Sec-

quality of co de pro duced. Since optimizations suchas

tion 4, weusedcc to compile the Sp ecInt b enchmarks

common sub expression elimination improvecodeby

and saxpy and g77 to compile the Sp ecFP b ench-

reducing instructions rather than increasing available

marks su2cor and swim.

parallelism, IPC do es not increase in -O1 co des. Most

We observe that the number of instructions com-

optimizations that increase IPC such as instruction

mitted drops drastically from optimization -O0 to

scheduling, lo op unrolling etc are included in -O2,

-O1,and also drops signi cantly in co des optimized

-O3 and -O4 levels. Power dissipated is the amount

with -O2 and -O3. There is however a very marginal

of work done in one cycle. This is directly prop or-

increase in the number of instructions in compress.

tional to the IPC. Hence, we see that optimizations

In co des optimized with -O4 option, the number of

that increase IPC, increase the p ower dissipated. In-

instructions increases due to the extra NOPs co de

struction scheduling and other -O2, -O3 optimiza-

generated for scheduling.

tions are go o d for p erformance improvement but are

The reduction in numb er of instructions directly in-

bad when instantaneous p ower is the main concern.

uences execution time or p erformance. The p erfor-

mance improvement is signi cantin -O1 when com-

5.2.3 In uence of Individual Optimizations

pared to -O0, sometimes as high as 73% (swim). -O2,

on Energy and Power

-O3 also lead to signi cantimprovementover -O1, for

example, we see an 8% improvement in li with -O2

We refer to Tables 2 to 7 for exp eriments on

optimization. In some b enchmarks like saxpy the im-

how the di erent individual optimizations a ect

provementis only ab out 0.6%. Optimizations -02,

power/energy. We showthe results for each b ench-

-O3 improve p erformance in compress even though

mark separately. The tables showthe p erformance,

the numb er of instructions increases.

power and energy of each of the optimizations rela-

tive to p erformance, p ower and energy of co de with

The energy consumed by the co de is again directly

-O0 (similar to Table 5.2.1). Since the individual op-

prop ortional to the numb er of instructions. Here we

timizations are applied over the -O1 option, in our

see that even though -02, -O3 improve p erformance

discussions, we always compare results of the opti-

in compress, the energy consumed is higher. This is

mizations with results of -O1. We rst discuss the

b ecause of the higher numb er of instructions. Hence,

e ects of the instruction scheduling options.

the amountofwork done is more. In all the b ench-

marks, we see that the energy decreases when the The -fschedule-instr optimization do es simple basic

number of instructions decrease. Hence, if we are blo ck list-scheduling and -fschedule-instr2 do es ag-

compiling for energy,we should chose optimizations gressive global scheduling. We exp ect b oth options 6

Table 1: E ects of Standard Optimization on Power/Energy

Benchmark opt level Energy Exec Time Insts Avg Power IPC

O0 100.00 100.00 100.00 100.00 100.00

O1 74.48 81.55 81.52 91.33 99.96

compress O2 75.13 81.44 82.04 92.25 100.73

O3 75.13 81.44 82.04 92.25 100.73

O4 79.01 82.77 86.11 95.45 104.03

O0 100.00 100.00 100.00 100.00 100.00

O1 66.20 64.13 68.94 103.23 107.50

go O2 62.62 61.31 63.01 102.14 102.78

O3 62.62 61.31 63.01 102.14 102.78

O4 63.67 62.19 63.75 102.38 102.51

O0 100.00 100.00 100.00 100.00 100.00

O1 81.32 83.66 83.18 97.20 99.42

li O2 79.60 75.97 82.97 104.78 109.21

O3 79.60 75.97 82.97 104.78 109.21

O4 85.71 77.89 90.96 110.05 116.78

O0 100.00 100.00 100.00 100.00 100.00

O1 97.38 100.24 92.49 97.15 92.27

saxpy O2 97.69 99.38 92.49 98.30 93.07

O3 97.69 99.38 92.49 98.30 93.07

O4 98.31 99.27 92.84 99.02 93.51

O0 100.00 100.00 100.00 100.00 100.00

O1 42.09 51.04 33.21 82.46 65.06

su2cor O2 40.99 47.52 33.10 86.28 69.67

O3 40.99 46.37 33.10 87.65 71.38

O0 100.00 100.00 100.00 100.00 100.00

O1 30.10 36.64 20.01 82.15 54.63

swim O2 28.93 34.01 19.05 85.06 56.01

O3 28.93 34.01 19.05 85.06 56.01

Table 2: Individual Optimizations on Compress

opt level Energy Exec Time Insts Power IPC

O0 100.0 100.0 100.0 100.0 100.0

O1 67.66 74.68 60.46 90.60 80.95

inline-func 67.69 74.68 60.46 90.63 80.95

sched-instr2 68.82 74.94 63.21 91.82 84.35

sched-instr 66.66 73.47 59.83 90.72 81.43

unroll-lo ops 66.84 74.19 59.90 90.09 80.74

Table 3: Individual Optimizations on li

opt level Energy Exec Time Insts Power IPC

O0 100.00 100.00 100.00 100.00 100.00

O1 70.91 74.67 66.18 94.96 88.63

inline-func 71.02 73.14 68.00 97.11 92.97

sched-instr2 69.56 66.65 68.33 104.36 102.52

sched-instr 69.56 66.65 68.33 104.36 102.52

unroll-lo ops 66.05 59.91 68.19 110.24 113.81

to increase the IPC and hence the power. We can The p ower increase is up to 3.9% . In li, the p ower

see from the tables that IPC go es up in most b ench- increases byasmuchas10%. The aggressivesched-

marks, in some b enchmarks up to 4.6% (in su2cor). uler (prepass scheduler) increases register pressure 7

Table 4: Individual Optimizations on saxpy

opt level Energy Exec Time Insts Power IPC

O0 100.00 100.00 100.00 100.00 100.00

O1 96.78 98.56 96.21 98.19 97.61

inline-func 96.78 98.56 96.21 98.19 97.61

sched-instr2 97.07 97.14 96.27 99.93 99.11

sched-instr 96.79 98.52 96.15 98.24 97.60

unroll-lo ops 96.87 98.72 95.97 98.13 97.21

Table 5: Individual Optimizations on su2cor

opt level Energy Exec Time Insts Power IPC

O0 100.00 100.00 100.00 100.00 100.00

O1 42.09 51.04 33.21 82.47 65.07

inline-func 42.06 51.01 33.21 82.46 65.11

sched-instr2 42.49 50.36 34.02 84.38 67.55

sched-instr 40.90 47.79 33.30 85.58 69.67

unroll-lo ops 40.17 48.35 31.17 83.08 64.46

Table 6: Individual Optimizations on swim

opt level Energy Exec Time Insts Power IPC

O0 100.00 100.00 100.00 100.00 100.00

O1 30.06 36.64 20.02 82.02 54.64

inline-func 30.06 36.64 20.02 82.02 54.64

sched-instr2 30.91 36.39 20.53 84.92 56.41

sched-instr 29.83 35.11 20.32 84.95 57.86

unroll-lo ops 29.29 35.38 18.19 82.80 51.43

Table 7: Individual Optimizations on go

opt level Energy Exec Time Insts Power IPC

O0 100.00 100.00 100.00 100.00 100.00

O1 40.97 42.75 42.65 95.83 99.77

inline-func 40.92 42.78 42.58 95.64 99.54

sched-instr2 43.07 44.01 45.25 97.87 102.82

sched-instr 43.52 44.89 46.52 96.96 103.63

unroll-lo ops 39.38 41.95 39.30 93.88 93.69

and hence causes signi cantnumb er of spills, thereby any reordering. The reason why we see some im-

increasing the total numb er of instructions executed provement in p erformance (and increase in IPC) is

and the total energy. The increase in numb er of in- b ecause the hardware is limited by the instruction

structions and energy are up to 3.52% and 2.14% re- window size, the global scheduler which has the full

sp ectively. This optimization needs to be improved program as its scop e helps the hardware see more

up on if p ower and energy are a concern. Wewould see instructions than it otherwise would have.

a greater impact of these optimizations if the target

pro cessor was an in-order machine, wherein the com-

We next discuss the impact of unrolling. Unrolling

piler if fully resp onsible for exp osing the parallelism.

app ears to b e a go o d optimization to use for energy

In an out-of-order issue machine, the hardware can

b ecause the number of instructions reduce signi -

nd the parallelism even if the compiler do es not do

cantly. We are able to reduce the numb er of instruc- 8

tions by 3.35% in go, the energy falls by 1%. Wesee tion scheduling, whicharetypically used to increase

that in the some b enchmarks the energy falls by5% the parallelism in the co de.

(li). However, reducing the energy do es not necessar-

Out of the four individual optimizations weevalu-

ily reduce p ower. For instance, in li, the p ower go es

ated, we found unrolling to be a go o d optimization

up by10%. Unrolling increases the size of the basic

for energy reduction but it increases power dissipa-

blo ck, hence allows the hardware increase the over-

tion. Function inlining is go o d for both energy re-

lap of instructions. This leads to an increase in the

duction and reducing p ower dissipation. Instruction

number of simultaneous op erations b eing executed.

scheduling was found to b e a bad optimization to use

It may b e noted that the IPC in li increases by 25%.

when p ower is a concern. Simple schedulers did not

However, this observation is not consistent among all

a ect the energy consumption, but aggressivesched-

the benchmarks, in many benchmarks, there is no

ulers i.e., schedulers that increased register pressure

increase in IPC. This is b ecause the target architec-

and intro duced spills, increased the energy consump-

ture has a go o d branch predictor, it do es unrolling

tion as well. For our future work, we would like to

in hardware, hence reducing the impact of software

evaluate more individual optimizations and improve

unrolling. We are currently investigating how the un-

the ones that we nd are currently unoptimized for

rolling optimization a ects p ower if we turned o the

power or energy.

branch prediction hardware. We exp ect to see a sig-

ni cant increase in IPC and p ower in the co des after

unrolling has b een applied.

References

Our next optimization is inlining of function calls.

Inlining as explained in the motivation section will

[1] K. Asanovic. Energy-exp osed instruction set ar-

reduce the numb er of instructions and hence energy.

chitectures. Work In Progress Session, Sixth

However, in our b enchmarks, only go and su2cor a

International Symposium on High Performance

very marginal decrease in energy. In our future work,

Computer Architecture, Jan 2000.

we will be investigating further with a b etter set of

benchmarks more suited for this optimization.

[2] D. Bro oks, V. Tiwari, and M. Martonosi.

Wattch: A framework for architectural-level

power analysis and optimizations. In 27th Inter-

6 Conclusions

national Symposium on Computer Architecture,

Jun 2000.

In this pap er we evaluated the impact of using the

[3] J. Bunda, W. C. Athas, and D. Fussell. Evalu-

di erentlevels of optimizations in the cc compiler on

ating p ower implication of cmos micropro cessor

system p ower and energy. We also evaluated the ef-

design decisions. In 1994 International Work-

fect of a few individual optimizations. We found the

shop on Low Power Design, April 1994.

that energy consumption reduces when the optimiza-

tions reduce the numb er of instructions executed by

[4] D. Burger and T. M. Austin. Evaluating fu-

the program, i.e., when the amountofwork done is

ture micropro cessors: The simplescalar to ol set.

less. The standard optimization level -O1 reduces

Technical rep ort, Dep. of Comp. Sci., Univ. of

the number of instructions drastically as compared

Wisconsin, Madison, 1997.

to -O0 b ecause it invokes optimizations suchascom-

mon sub expression elimination, an optimization used

[5] M. K. Gowan, L. L. Biro, and D. B. Jackson.

to eliminate redundant computations in the program.

Power considerations in the design of the alpha

The drop is not as that signi cantin-O2, -O3 and

21264 micropro cessor. In Design Automation

-O4 optimizations. The energy also drops in the same

Conference, pages 726{731, 1998.

prop ortion.

Wefoundpower dissipation to b e directly prop or- [6] N. B. I. Ha jj, C. Polychronop oulos, and G. Sta-

tional to the average IPC of program. -O2, -O3 and moulis. Architectural and compiler supp ort for

-O4 levels have signi cantly higher IPC and hence energy reduction in the memory hierarchy of

higher average p ower. The optimization levels -O2, high p erformance micropro cessors. In ISLPED

-O3 and -O4 include optimizations such as instruc- 98, pages 70{75, July 1998. 9

[7] M. Kandemir, N. Vijaykrishnan, M. J. Irwin,

and W. Ye. In uence of compiler optimizations

on system power. In Design Automation Con-

ference, July 2000.

[8] J. Kin, M. Gupta, and W. H. Mangione-Smith.

The lter cache: An energy ecient memory

structure. In 30th International Symposium on

Microarchitecture, pages 184{193, Dec 1997.

[9] S. Manne, A. Klauser, and D. Grunwald.

Pip eline gating: Sp eculation control for energy

reduction. In 25th International Symposium on

Computer Architecture, pages 1{10, Jun 1998.

[10] C.-L. Su and A. M. Despain. Cache designs for

energy eciency. In 28th Annual Hawaii Inter-

national Conference on System Sciences, pages

306{315, 1995.

[11] C. L. Su, C. Y. Tsui, and A. M. Despain. Low

power architecture design and compilation tech-

niques for high-p erformance pro cessors. In IEEE

COMPCON,Feb. 1994.

[12] M. C. Toburen, T. M. Conte, and M. Reilly. In-

struction scheduling for lowpower dissipation in

high p erformance micropro cessors. In Power-

Driven Microarchitecture Workshop In Conjunc-

tion With ISCA 1998, Jun 1998. 10