Is Compiling for Performance == Compiling for Power?
Madhavi Valluri and Lizy John
Lab oratory for Computer Architecture
Department of Electrical and Computer Engineering
The University of Texas at Austin
Abstract energy trends of three generations of Alpha pro ces-
sors. Power dissipation increases signi cantly from
one generation to the next despite the reduced supply
Energy consumption and power dissipation are in-
voltages and advanced pro cessor technologies. The
creasingly becoming important design constraints in
pap er shows the p ower in the Alpha 21264 increasing
high performance microprocessors. Compilers tra-
almost linearly with frequency, with power reaching
ditional ly are not exposed to the energy details
72 Watts at 600MHz. The maximum power dissi-
of the processor. However, with the increasing
pated under worst case conditions was found to be
power/energy problem, it is important to evaluate
ab out 95 Watts. These examples clearly indicate that
how the existing compiler optimizations in uenceen-
power dissipation and energy consumption will so on
ergy consumption and power dissipation in the pro-
b ecome imp ortant limiting factors in the design of
cessor. In this paper we present a quantitative study
high p erformance pro cessors.
wherein we examine the e ect of the standard opti-
mizations levels -O1 to -O4 of DEC Alpha's cc com-
Until recently, the two problems were b eing
piler on power and energy of the processor. We also
dealt with only at the circuit-level. Voltage scal-
evaluate the e ect of four individual optimizations on
ing, low swing buses, conditional clo cking etc have
power/energy and attempt to classify them as \low
help ed alleviate the problems enormously. How-
energy" or \low power" optimizations. In our exper-
ever, architectural-level and compiler-level analysis
iments we nd that optimizations that improve per-
can help tackle these problems much earlier in the
formance by reducing the number of instructions are
design cycle. Recently,several architectural and com-
optimized for energy. Such optimizations reduce the
piler techniques have b een prop osed to reduce p ower
total amount of work done by the program. This is in
and energy [3, 6, 7,8,9,10,11,12]. In our work we
contrast to optimizations that improve performance
concentrate on the in uence of compilers on power
by increasing the overlap in the program during exe-
dissipation and energy consumption.
cution. The latter kind of optimizations increase the
average power dissipated in the processor.
Compilers traditionally are not exp osed to the
energy details of the pro cessor. Current compiler
optimizations are tuned primarily for p erformance
1 Intro duction
and o ccasionally for co de size. With the increasing
power/energy problem, it is imp ortant to evaluate
Energy consumption and power dissipation are in- how the existing optimizations in uence energy con-
creasingly b ecoming imp ortant design constraints in sumption and p ower dissipation in the pro cessor. An
high p erformance micropro cessors. Power dissipation interesting question to answer would b e - if wecom-
a ects circuit reliability and packaging costs. Energy pile for p erformance, are we automatically compil-
consumption directly e ects battery life. With the ing for low power or low energy? Current compil-
increasing use of general purp ose pro cessors in the ers already havetwo axes in the optimizations used
emb edded world, designing low energy pro cessors is - namely compiling for sp eed(in general-purp ose pro-
imp ortant. Gowan et al. [5], discuss the power and cessors) and compiling for co de size(in emb edded sys- 1
tems), do we need a third axis with optimizations that provide concluding remarks and future directions in
compile for power/energy? Section 6.
Toanswer the ab ove questions, wepresent a quan-
titative study wherein we examine the in uence of
2 Related Work
a few state-of-the-art compiler optimizations on en-
ergy and p ower of the complete pro cessor. Westudy
In this section we present some of the previous work
the e ect of the standard optimizations levels -O1 to
done in understanding the interaction between the
-O4 of DEC Alpha's cc compiler on power and en-
compiler and p ower/energy of the pro cessor.
ergy of the pro cessor. We also evaluate the e ect of
four individual optimizations on power/energy and
The study by Kandemir et al. [7] quantitatively
attempt to classify them as \low energy optimiza-
examines the in uence of di erent high-level com-
tions" or \low power optimizations" or b oth. The
piler optimizations on system energy. However, in
optimizations we study are simple basic-block schedul-
their study, they evaluate only lo op-nest optimiza-
ing, loop unrol ling, function inlining, and aggres-
tions such as lo op fusion, lo op ssion, blo cking, tiling,
sive global scheduling. For our exp eriments, we use
scalar expansion and unrolling. In our pap er, we dis-
Wattch [2], an architectural simulator that estimates
cuss b oth the p ower dissipated and energy consump-
CPU energy consumption. Wattchintegrates param-
tion details, while in the pap er by[7] they rep ort only
eterizable p ower mo dels into the Simplescalar [4] pro-
energy details. Their main observation in the pap er
cessor simulator.
is that the optimizations app ear to increase the en-
ergy consumed in the core while reducing the energy
In our study we nd that the set of compiler
consumed in the memory system. Unoptimized co des
optimizations that improve p erformance by reduc-
consume more energy in the memory system.
ing the number of instructions executed are opti-
mized for b oth energy and power. This is in con-
There have b een a few instruction scheduling tech-
trast to optimizations that improve p erformance by
niques prop osed which attempt to reduce the p ower
increasing the existing parallelism in the program.
dissipated in the pro cessor. Su et al. [11] prop osed
The latter kind of optimizations increase the average
cold scheduling, wherein, they assign priority to in-
power dissipated in the pro cessor. We nd that op-
structions based on some pre-determined p ower cost
timizations such as common-sub expression elimina-
and use a generic list scheduler to schedule the in-
tion, copy propagation, lo op unrolling are very go o d
structions. The p ower cost of scheduling an instruc-
for reducing energy since they reduce the number
tion dep ends on the instruction it is b eing scheduled
of instructions in the program, hence the amount
after. This corresp onds to the switching activityon
of total work done is less in programs with these
the control path. Toburen et al. [12] prop ose another
optimizations. Such optimizations should de nitely
power-aware scheduler which schedules as many in-
b e included in the compile for power/energy switch.
structions as p ossible in a given cycle until the en-
Optimizations such as instruction scheduling signi -
ergy threshold of that cycle is reached. Once that
cantly increase p ower (and may o ccasionally increase
precomputed threshold is reached, scheduling pro-
energy) b ecause they increase the overlap in programs
ceeds to the next time-step or cycle. In our work,
without reducing the total numb er of instructions in
by evaluating several state-of-the-art optimizations,
the program. However, such optimizations can be
we attempt to identify other optimizations besides
easily mo di ed to takepower details into considera-
instruction scheduling that can be improved if the
tion and can b e used to increase p erformance without
power/energy mo dels of the pro cessor were exp osed
increasing average p ower.
to them.
The rest of the pap er is organized as follows: Signi cantwork has b een done in reducing energy
In Section 2, we discuss some previous work that consumption in the memory. Most techniques achieve
has been done in the area of compilers and low a reduction in energy through innovative architec-
power/energy. Section 3 shows a few examples that tural techniques [6, 8,9,10]. Some of the works that
motivates the need for our study. We describ e the dif- include compiler involvement are [10] and [6]. In [6],
ferent compiler optimizations evaluated in Section 4. the authors suggest the use of an L-cache. An L-
In Section 5, we describ e our exp erimental framework cache is a small cache which is placed between the
and discuss in detail the results obtained. Finally,we I-cache and CPU. The L-cache is very small (holds a 2
few basic blo cks), hence consumes less energy. The constant. This optimization can b e a go o d candidate
compiler is used to select go o d basic blo cks to place to use in the \compile for p ower/energy" switch.
in the L-cache. Another approach to reduce mem-
These examples show that compilers can be op-
ory energy is Gray co de addressing [10]. This form
timized to pro duce co de for low power or low en-
of addressing reduces the bit switching activity in
ergy, without sacri cing p erformance. In this study
the instruction address path. Bunda et al. [3] and
we hop e to exp ose the current void in the area of
Asanovic [1] investigated the e ect of energy-aware
power/energy-aware compilers and attempt to iden-
instruction sets. These techniques would involvethe
tify go o d candidates for further improvement.
compiler even earlier in the co de generation pro cess.
The pap er by Bunda et al [3] concentrates on reduc-
ing memory energy, and Asanovic [1] investigates new
4 Compiler Optimizations
instructions to reduce energy in the memory, register
les and pip eline stages.
In our study we evaluate the in uence of compiler
optimizations on pro cessor power/energy using the
native C compiler cc on a Dec Alpha 21064 running
3 Motivating Examples
the OSF1 op erating system. We also used the gcc
compiler to study the e ect of a few individual op-
Consider the data dep endence graph (DDG) shown
timizations. The details of b oth the compilers and
in Figure 1(b). It contains six op erations. All op er-
their di erent options are presented in the following
ations except op E have a latency of one cycle, op E
subsections.
takes two cycles to complete. We will assume there
are in nite functional units for this example. An in-
struction scheduler that attempts to also optimize
4.1 Standard Optimization Levels on
for registers would schedule op E as close to op F
cc and gcc
as p ossible. The resulting schedule can be seen in
Figure 1(b). If we assume that each op eration con-
The di erent levels in the cc compiler, along with the
sumes one unit of power, compared to the schedule
optimizations p erformed at each level are describ ed
in Figure 1(b), the schedule in Figure 1(c), dissipates
below.
less peak power (3 units vs 2 units in Figure 1(b)).
Figure 1(c) is also a valid schedule. By extending
-O0 No optimizations p erformed. In this level, the
the lifetime of op E by one cycle, we reduce the p eak
compiler's goal is to reduce the cost of compilation.
power dissipated without a ecting p erformance. The
Only variables declared register are allo cated in reg-
design choice of letting op E o ccupy the register for
isters.
one cycle longer than required will prove to be in-
-O1 Many lo cal optimizations and global optimiza-
exp ensive only if there are sucient number of reg-
tions are p erformed. These include recognition and
isters. Currentschedulers do not takepower details
elimination of common sub expressions, copy propa-
into consideration and hence mightschedule op E in
gation, induction variable elimination, co de motion,
cycle 2 even if there are sucient registers. This ex-
test replacement, split lifetime analysis, and some
ample shows that two variations of the same co de
minimal co de scheduling.
can have the same p erformance but di erent power
-O2 This level do es inline expansion of static pro-
requirements.
cedures. Additional global optimizations that im-
Another go o d candidate for reducing energy with-
prove sp eed (at the cost of extra co de size), suchas
out increasing power would be function-in-lining.
integer multiplication and division expansion (using
Function-in-lining is done in cases where the callee
shifts), lo op unrolling, and co de replication to elimi-
pro cedure body is small. In these cases, the co de
nate branches are also p erformed. Lo op unrolling and
required for the calling sequences outweigh the co de
elimination of branch instructions increase the size of
in the pro cedure body. If this pro cedure is called
the basic blo cks. This helps the hardware exploit in-
many times, in-lining can save a tremendous number
struction level parallelism (ILP) in the program.
of instructions. Function-in-lining do es not increase
the overlap such as instruction scheduling, hence this -O3 Includes all -O2 optimizations and also do es
optimization keeps energy low and holds the power inline expansion of global pro cedures p erformed. 3 A cycle 1 A
B C E B C cycle 2
E D D cycle 3
F F cycle 4
(a) Example DDG (b) Peak Power = 3 Energy = 6 E A cycle 1
B C cycle 2
D (c) Peak Power = 2 Energy = 6 cycle 3
F
cycle 4
Figure 1: Motivating Example
Optimizations p erformed are common sub expression -O4 Software pip elining, an aggressive instruction
elimination, combining instruction through substitu- scheduling technique used to exploit ILP in lo ops is
tion (copy propagation), dead-store elimination, lo op p erformed using dep endency analysis. Vectorization
strength reduction and minimal scheduling. of some lo ops on 8-bit and 16-bit data is also done.
This level also invokes a scheduling pass which inserts
-O2 Nearly all supp orted optimizations that do not
NOP instructions to improve the scheduling.
involvea space-sp eed tradeo are p erformed. Lo op
unrolling and function inlining are not done, for ex- We use the FORTRAN g77 compiler to compile
ample. This level also includes an aggressive instruc- the Sp ecFP b enchmarks. g77 is a program to call gcc
tion scheduling pass. with options to recognize programs written in For-
tran. The standard optimization levels o ered by gcc
-O3 This turns on everything that -O2 do es, along
arelistedbelow:
with also inlining of pro cedures.
-O0 No optimizations p erformed.
Wenotethatinboth cc and gcc, the optimizations
-O1 This level is very similar to the -O1 in cc. that increase the ILP in a program are in optimiza- 4
tion levels -O2, -O3 and -O4 (-O4 only in cc). The 5.1 Wattch 1.0 and Benchmarks
di erent levels include almost the same optimizations
in both the compilers. We use b oth cc and gcc in
We use the Wattch 1.0 simulator [2] for our ex-
our work. Weuse cc wherever p ossible, gcc wherever
p erimentation. Wattch is an architectural simula-
sp eci c ho oks to control individual optimizations are
tor that estimates CPU energy consumption. The
required.
power/energy estimates are based on a suite of pa-
rameterizable power mo dels for various hardware
structures in the pro cessor and on the resource us-
4.2 Individual Optimizations
age counts. The power mo dels are interfaced with
Simplescalar [4]. sim-outorder, Simplescalar's out-
We analyze the impact of four di erent individual op-
of-order issue simulator has b een mo di ed to keep
timizations provided by gcc. Wechose gcc for this b e-
track of which unit is being accessed in each cycle
cause gcc provides more numb er of distinct individual
and record the total energy consumed for an applica-
optimizations than cc to chose from. All the individ-
tion.
ual optimizations are applied on top of optimizations
Wattch has three di erent options for clo ck gat-
p erformed at -O1. The individual optimizations cho-
ing to disable unused resources in the pro cessor. The
sen are:
simplest clo cking style assumes that the full mo deled
-fschedule-insns This optimization attempts to re-
power will be consumed if any accesses o ccur in a
order instructions to eliminate execution stalls that
given cycle, and zero otherwise. This is ideal clo ck
o ccur due unavailability of required data. This helps
gating. The second p ossibility assumes that if only
machines that have slow oating point or memory
a p ortion of a unit's p ort are accessed, the p ower is
load instructions byallowing other instructions to b e
scaled linearly according to the numb er of p orts b e-
issued until the result of the load or oating point
ing used. In the third clo ck gating scheme, power
instruction is required. The scheduler used is a basic-
is scaled linearly with port or unit usage, but un-
blo ck list-scheduler and it is run after lo cal register
used units dissipate 10% of their maximum power.
allo cation has b een p erformed.
This corresp onds to the static p ower dissipated when
-fschedule-insns2 Similar to -fschedule-insns, but
there is no activity in unit. Wechose p ower and en-
requests an additional pass of instruction scheduling
ergy results corresp onding to the third scheme since
after register allo cation has b een done. This pass
it is the most realistic of all schemes. We used the
do es aggressive global scheduling b efore and after
default con guration in sim-outorder for our study,
global register allo cation. Postpass scheduling (when
but changed the RUU (Register Up date Unit) from
scheduling is done after register allo cation) minimizes
16 to 32 and LSQ (Load Store Queue) LSQ size from
the pip eline stalls due to the spill instructions intro-
8to16. The functional unit latencies exactly match
duced by register allo cation.
the functional units latencies in the Alpha 21064 pro-
cessor. We use the pro cess parameters for a .35um
- nline-functions Integrates all simple functions
pro cess at 600MHz.
into their callers. The compiler heuristically decides
which functions are simple enough to b e worth inte-
We chose six di erent benchmarks for our study
grating in this way.
- three Sp ecInt95 b enchmarks, namely compress, go
and li, two Sp ecFp95 b enchmarks su2cor and swim,
-funrol l-loops Perform the optimization of lo op un-
and saxpy,atoy b enchmark.
rolling. This is done only for lo ops whose number of
iterations can b e determined at compile time or run
time.
5.2 Results
5 Exp erimental Results
In the following subsections we present a detailed
analysis of the results obtained. We rst discuss the
In this section we rst describ e the Wattch simula- in uence of standard optimizations on energy and
tor and our b enchmarks. We then present a detailed power following which we study the e ects of indi-
analysis of our results. vidual optimizations. 5
5.2.1 In uence of Standard Optimizations on such as common sub-expression elimination, induc-
Energy tion variable elimination and unrolling that reduce
the numb er of instructions executed. Optimizations
Table 5.2.1 shows the results obtained when the
such as the ones in -O4 (inserting NOPs to improve
benchmarks are compiled with di erent standard op-
scheduling), mayimprove p erformance, but can also
timizations levels. We present the results of all opti-
increase the numb er of instructions, leading to higher
mizations relative to the result of optimization level
energy requirements. The energy increase is seen to
-O0. For example, when we consider the number of
b e up to 4% (in compress).
instructions, the p ercentage of instructions executed
by abenchmark optimized with option -O2 is given
5.2.2 In uence of Standard Optimizations on
by:
Power
% of Insts Executed by Program
O 2
To study the in uence of compiler optimizations on
# of Insts Executed by P r og r am
O 2
power, we again refer to Table 5.2.1. We see that
= 100
# of Insts Executed by P r og r am
O 0
though the number of instructions and the number
of cycles taken reduces in higher optimization levels,
For example, in Table 5.2.1, we see that compress
the numb er of instructions do not reduce enough to
when compiled with -O2 executed 17.96% fewer in-
keep the instructions p er cycle (IPC) constant. IPC
structions than compress when compiled with -O0.
reduces in -O1 co des but increases in -O2, -O3 and
Our results are presented in this form for all b ench-
-O4 co des. IPC in -O0 is low b ecause of the poor
marks and for all optimizations. As mentioned in Sec-
quality of co de pro duced. Since optimizations suchas
tion 4, weusedcc to compile the Sp ecInt b enchmarks
common sub expression elimination improvecodeby
and saxpy and g77 to compile the Sp ecFP b ench-
reducing instructions rather than increasing available
marks su2cor and swim.
parallelism, IPC do es not increase in -O1 co des. Most
We observe that the number of instructions com-
optimizations that increase IPC such as instruction
mitted drops drastically from optimization -O0 to
scheduling, lo op unrolling etc are included in -O2,
-O1,and also drops signi cantly in co des optimized
-O3 and -O4 levels. Power dissipated is the amount
with -O2 and -O3. There is however a very marginal
of work done in one cycle. This is directly prop or-
increase in the number of instructions in compress.
tional to the IPC. Hence, we see that optimizations
In co des optimized with -O4 option, the number of
that increase IPC, increase the p ower dissipated. In-
instructions increases due to the extra NOPs co de
struction scheduling and other -O2, -O3 optimiza-
generated for scheduling.
tions are go o d for p erformance improvement but are
The reduction in numb er of instructions directly in-
bad when instantaneous p ower is the main concern.
uences execution time or p erformance. The p erfor-
mance improvement is signi cantin -O1 when com-
5.2.3 In uence of Individual Optimizations
pared to -O0, sometimes as high as 73% (swim). -O2,
on Energy and Power
-O3 also lead to signi cantimprovementover -O1, for
example, we see an 8% improvement in li with -O2
We refer to Tables 2 to 7 for exp eriments on
optimization. In some b enchmarks like saxpy the im-
how the di erent individual optimizations a ect
provementis only ab out 0.6%. Optimizations -02,
power/energy. We showthe results for each b ench-
-O3 improve p erformance in compress even though
mark separately. The tables showthe p erformance,
the numb er of instructions increases.
power and energy of each of the optimizations rela-
tive to p erformance, p ower and energy of co de with
The energy consumed by the co de is again directly
-O0 (similar to Table 5.2.1). Since the individual op-
prop ortional to the numb er of instructions. Here we
timizations are applied over the -O1 option, in our
see that even though -02, -O3 improve p erformance
discussions, we always compare results of the opti-
in compress, the energy consumed is higher. This is
mizations with results of -O1. We rst discuss the
b ecause of the higher numb er of instructions. Hence,
e ects of the instruction scheduling options.
the amountofwork done is more. In all the b ench-
marks, we see that the energy decreases when the The -fschedule-instr optimization do es simple basic
number of instructions decrease. Hence, if we are blo ck list-scheduling and -fschedule-instr2 do es ag-
compiling for energy,we should chose optimizations gressive global scheduling. We exp ect b oth options 6
Table 1: E ects of Standard Optimization on Power/Energy
Benchmark opt level Energy Exec Time Insts Avg Power IPC
O0 100.00 100.00 100.00 100.00 100.00
O1 74.48 81.55 81.52 91.33 99.96
compress O2 75.13 81.44 82.04 92.25 100.73
O3 75.13 81.44 82.04 92.25 100.73
O4 79.01 82.77 86.11 95.45 104.03
O0 100.00 100.00 100.00 100.00 100.00
O1 66.20 64.13 68.94 103.23 107.50
go O2 62.62 61.31 63.01 102.14 102.78
O3 62.62 61.31 63.01 102.14 102.78
O4 63.67 62.19 63.75 102.38 102.51
O0 100.00 100.00 100.00 100.00 100.00
O1 81.32 83.66 83.18 97.20 99.42
li O2 79.60 75.97 82.97 104.78 109.21
O3 79.60 75.97 82.97 104.78 109.21
O4 85.71 77.89 90.96 110.05 116.78
O0 100.00 100.00 100.00 100.00 100.00
O1 97.38 100.24 92.49 97.15 92.27
saxpy O2 97.69 99.38 92.49 98.30 93.07
O3 97.69 99.38 92.49 98.30 93.07
O4 98.31 99.27 92.84 99.02 93.51
O0 100.00 100.00 100.00 100.00 100.00
O1 42.09 51.04 33.21 82.46 65.06
su2cor O2 40.99 47.52 33.10 86.28 69.67
O3 40.99 46.37 33.10 87.65 71.38
O0 100.00 100.00 100.00 100.00 100.00
O1 30.10 36.64 20.01 82.15 54.63
swim O2 28.93 34.01 19.05 85.06 56.01
O3 28.93 34.01 19.05 85.06 56.01
Table 2: Individual Optimizations on Compress
opt level Energy Exec Time Insts Power IPC
O0 100.0 100.0 100.0 100.0 100.0
O1 67.66 74.68 60.46 90.60 80.95
inline-func 67.69 74.68 60.46 90.63 80.95
sched-instr2 68.82 74.94 63.21 91.82 84.35
sched-instr 66.66 73.47 59.83 90.72 81.43
unroll-lo ops 66.84 74.19 59.90 90.09 80.74
Table 3: Individual Optimizations on li
opt level Energy Exec Time Insts Power IPC
O0 100.00 100.00 100.00 100.00 100.00
O1 70.91 74.67 66.18 94.96 88.63
inline-func 71.02 73.14 68.00 97.11 92.97
sched-instr2 69.56 66.65 68.33 104.36 102.52
sched-instr 69.56 66.65 68.33 104.36 102.52
unroll-lo ops 66.05 59.91 68.19 110.24 113.81
to increase the IPC and hence the power. We can The p ower increase is up to 3.9% . In li, the p ower
see from the tables that IPC go es up in most b ench- increases byasmuchas10%. The aggressivesched-
marks, in some b enchmarks up to 4.6% (in su2cor). uler (prepass scheduler) increases register pressure 7
Table 4: Individual Optimizations on saxpy
opt level Energy Exec Time Insts Power IPC
O0 100.00 100.00 100.00 100.00 100.00
O1 96.78 98.56 96.21 98.19 97.61
inline-func 96.78 98.56 96.21 98.19 97.61
sched-instr2 97.07 97.14 96.27 99.93 99.11
sched-instr 96.79 98.52 96.15 98.24 97.60
unroll-lo ops 96.87 98.72 95.97 98.13 97.21
Table 5: Individual Optimizations on su2cor
opt level Energy Exec Time Insts Power IPC
O0 100.00 100.00 100.00 100.00 100.00
O1 42.09 51.04 33.21 82.47 65.07
inline-func 42.06 51.01 33.21 82.46 65.11
sched-instr2 42.49 50.36 34.02 84.38 67.55
sched-instr 40.90 47.79 33.30 85.58 69.67
unroll-lo ops 40.17 48.35 31.17 83.08 64.46
Table 6: Individual Optimizations on swim
opt level Energy Exec Time Insts Power IPC
O0 100.00 100.00 100.00 100.00 100.00
O1 30.06 36.64 20.02 82.02 54.64
inline-func 30.06 36.64 20.02 82.02 54.64
sched-instr2 30.91 36.39 20.53 84.92 56.41
sched-instr 29.83 35.11 20.32 84.95 57.86
unroll-lo ops 29.29 35.38 18.19 82.80 51.43
Table 7: Individual Optimizations on go
opt level Energy Exec Time Insts Power IPC
O0 100.00 100.00 100.00 100.00 100.00
O1 40.97 42.75 42.65 95.83 99.77
inline-func 40.92 42.78 42.58 95.64 99.54
sched-instr2 43.07 44.01 45.25 97.87 102.82
sched-instr 43.52 44.89 46.52 96.96 103.63
unroll-lo ops 39.38 41.95 39.30 93.88 93.69
and hence causes signi cantnumb er of spills, thereby any reordering. The reason why we see some im-
increasing the total numb er of instructions executed provement in p erformance (and increase in IPC) is
and the total energy. The increase in numb er of in- b ecause the hardware is limited by the instruction
structions and energy are up to 3.52% and 2.14% re- window size, the global scheduler which has the full
sp ectively. This optimization needs to be improved program as its scop e helps the hardware see more
up on if p ower and energy are a concern. Wewould see instructions than it otherwise would have.
a greater impact of these optimizations if the target
pro cessor was an in-order machine, wherein the com-
We next discuss the impact of unrolling. Unrolling
piler if fully resp onsible for exp osing the parallelism.
app ears to b e a go o d optimization to use for energy
In an out-of-order issue machine, the hardware can
b ecause the number of instructions reduce signi -
nd the parallelism even if the compiler do es not do
cantly. We are able to reduce the numb er of instruc- 8
tions by 3.35% in go, the energy falls by 1%. Wesee tion scheduling, whicharetypically used to increase
that in the some b enchmarks the energy falls by5% the parallelism in the co de.
(li). However, reducing the energy do es not necessar-
Out of the four individual optimizations weevalu-
ily reduce p ower. For instance, in li, the p ower go es
ated, we found unrolling to be a go o d optimization
up by10%. Unrolling increases the size of the basic
for energy reduction but it increases power dissipa-
blo ck, hence allows the hardware increase the over-
tion. Function inlining is go o d for both energy re-
lap of instructions. This leads to an increase in the
duction and reducing p ower dissipation. Instruction
number of simultaneous op erations b eing executed.
scheduling was found to b e a bad optimization to use
It may b e noted that the IPC in li increases by 25%.
when p ower is a concern. Simple schedulers did not
However, this observation is not consistent among all
a ect the energy consumption, but aggressivesched-
the benchmarks, in many benchmarks, there is no
ulers i.e., schedulers that increased register pressure
increase in IPC. This is b ecause the target architec-
and intro duced spills, increased the energy consump-
ture has a go o d branch predictor, it do es unrolling
tion as well. For our future work, we would like to
in hardware, hence reducing the impact of software
evaluate more individual optimizations and improve
unrolling. We are currently investigating how the un-
the ones that we nd are currently unoptimized for
rolling optimization a ects p ower if we turned o the
power or energy.
branch prediction hardware. We exp ect to see a sig-
ni cant increase in IPC and p ower in the co des after
unrolling has b een applied.
References
Our next optimization is inlining of function calls.
Inlining as explained in the motivation section will
[1] K. Asanovic. Energy-exp osed instruction set ar-
reduce the numb er of instructions and hence energy.
chitectures. Work In Progress Session, Sixth
However, in our b enchmarks, only go and su2cor a
International Symposium on High Performance
very marginal decrease in energy. In our future work,
Computer Architecture, Jan 2000.
we will be investigating further with a b etter set of
benchmarks more suited for this optimization.
[2] D. Bro oks, V. Tiwari, and M. Martonosi.
Wattch: A framework for architectural-level
power analysis and optimizations. In 27th Inter-
6 Conclusions
national Symposium on Computer Architecture,
Jun 2000.
In this pap er we evaluated the impact of using the
[3] J. Bunda, W. C. Athas, and D. Fussell. Evalu-
di erentlevels of optimizations in the cc compiler on
ating p ower implication of cmos micropro cessor
system p ower and energy. We also evaluated the ef-
design decisions. In 1994 International Work-
fect of a few individual optimizations. We found the
shop on Low Power Design, April 1994.
that energy consumption reduces when the optimiza-
tions reduce the numb er of instructions executed by
[4] D. Burger and T. M. Austin. Evaluating fu-
the program, i.e., when the amountofwork done is
ture micropro cessors: The simplescalar to ol set.
less. The standard optimization level -O1 reduces
Technical rep ort, Dep. of Comp. Sci., Univ. of
the number of instructions drastically as compared
Wisconsin, Madison, 1997.
to -O0 b ecause it invokes optimizations suchascom-
mon sub expression elimination, an optimization used
[5] M. K. Gowan, L. L. Biro, and D. B. Jackson.
to eliminate redundant computations in the program.
Power considerations in the design of the alpha
The drop is not as that signi cantin-O2, -O3 and
21264 micropro cessor. In Design Automation
-O4 optimizations. The energy also drops in the same
Conference, pages 726{731, 1998.
prop ortion.
Wefoundpower dissipation to b e directly prop or- [6] N. B. I. Ha jj, C. Polychronop oulos, and G. Sta-
tional to the average IPC of program. -O2, -O3 and moulis. Architectural and compiler supp ort for
-O4 levels have signi cantly higher IPC and hence energy reduction in the memory hierarchy of
higher average p ower. The optimization levels -O2, high p erformance micropro cessors. In ISLPED
-O3 and -O4 include optimizations such as instruc- 98, pages 70{75, July 1998. 9
[7] M. Kandemir, N. Vijaykrishnan, M. J. Irwin,
and W. Ye. In uence of compiler optimizations
on system power. In Design Automation Con-
ference, July 2000.
[8] J. Kin, M. Gupta, and W. H. Mangione-Smith.
The lter cache: An energy ecient memory
structure. In 30th International Symposium on
Microarchitecture, pages 184{193, Dec 1997.
[9] S. Manne, A. Klauser, and D. Grunwald.
Pip eline gating: Sp eculation control for energy
reduction. In 25th International Symposium on
Computer Architecture, pages 1{10, Jun 1998.
[10] C.-L. Su and A. M. Despain. Cache designs for
energy eciency. In 28th Annual Hawaii Inter-
national Conference on System Sciences, pages
306{315, 1995.
[11] C. L. Su, C. Y. Tsui, and A. M. Despain. Low
power architecture design and compilation tech-
niques for high-p erformance pro cessors. In IEEE
COMPCON,Feb. 1994.
[12] M. C. Toburen, T. M. Conte, and M. Reilly. In-
struction scheduling for lowpower dissipation in
high p erformance micropro cessors. In Power-
Driven Microarchitecture Workshop In Conjunc-
tion With ISCA 1998, Jun 1998. 10