W&M ScholarWorks
Dissertations, Theses, and Masters Projects Theses, Dissertations, & Master Projects
2003
Power considerations for memory-related microarchitecture designs
Zhichun Zhu College of William & Mary - Arts & Sciences
Follow this and additional works at: https://scholarworks.wm.edu/etd
Part of the Computer Sciences Commons
Recommended Citation Zhu, Zhichun, "Power considerations for memory-related microarchitecture designs" (2003). Dissertations, Theses, and Masters Projects. Paper 1539623427. https://dx.doi.org/doi:10.21220/s2-az56-s960
This Dissertation is brought to you for free and open access by the Theses, Dissertations, & Master Projects at W&M ScholarWorks. It has been accepted for inclusion in Dissertations, Theses, and Masters Projects by an authorized administrator of W&M ScholarWorks. For more information, please contact [email protected]. Power Considerations for Memory-related Microarchitecture Designs
A Dissertation
Presented to
The Faculty of the Department of Computer Science
The College of William Mary in Virginia
In Partial Fulfillment
Of the Requirements for the Degree of
Doctor of Philosophy
by
Zhichun Zhu
2003
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. APPROVAL SHEET
This dissertation is submitted in partial fulfillment of
the requirements for the degree of
Doctor of Philosophy
'/
Zhichun Zhu
Approved, July 2003
daodong Zhang Thesis Advisor .
Phil Kearns
Brucp'TiOwekamp
Dimitrios Nikolopoulos
TH11 ,L '
) U -v v - Jun Yang University of California, Riverside
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission Table of Contents
Acknowledgments vii
List of Tables viii
List of Figures x
Abstract xiii
1 Introduction 2
1.1 Importance of Low-power Architecture D esigns ...... 3
1.2 Our W o rk ...... 4
2 Background 8
2.1 Sources of Power Consumption ...... 8
2.2 Low-power Techniques ...... 9
2.2.1 Power-saving Techniques at Physical, Circuit and Logic Levels . . . 10
2.2.2 Power-saving Techniques at Architectural Level ...... 11
2.2.3 Power-saving Techniques at Software L ev el ...... 12
iii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3 Evaluation Environment 15
3.1 Evaluation M e tric s ...... 15
3.1.1 Performance M e tric s ...... 15
3.1.2 Power and Energy Metrics ...... 16
3.1.3 Energy-efficiency Metrics ...... 17
3.2 Performance Evaluation T o o ls ...... 18
3.3 Energy Consumption Evaluation T o o ls ...... 21
3.4 W orkloads ...... 22
4 Access Mode Predictions for Low-power Cache Design 25
4.1 Significance of Low-power Cache D e s ig n ...... 25
4.2 Related W o rk ...... 27
4.3 Comparisons between Phased and Way-prediction C aches ...... 28
4.4 AMP Caches...... 31
4.4.1 S tra te g y ...... 31
4.4.2 Power Consumption ...... 33
4.5 Access Mode P re d ic to rs ...... 34
4.5.1 D e s ig n s ...... 34
4.5.2 Accuracy ...... 37
4.5.3 Overhead ...... 38
4.6 Multicolumn-based Way-prediction ...... 39
4.6.1 Limitation of MRU-based Way-prediction ...... 39
4.6.2 Multicolumn Cache ...... 42
iv
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.6.3 Power Considerations for Multicolumn C ache ...... 45
4.7 Experimental Environment ...... 49
4.8 Experimental Results ...... 50
4.8.1 Comparisons of Multicolumn and MRU Caches ...... 50
4.8.2 Energy Reduction of AMP Caches ...... 53
4.8.3 Energy-efficiency of AMP Caches ...... 59
4.9 S u m m a ry ...... 62
5 Look-ahead Adaptation Techniques to Reduce Processor Power Consump
tion 64
5.1 Motivation ...... 64
5.2 Load Indicator Schem e ...... 68
5.2.1 Power Saving Opportunity ...... 68
5.2.2 Load In d ic a to r ...... 70
5.3 Considerations for Load Indicator Scheme ...... 75
5.4 Experimental Environment ...... 78
5.4.1 Power Savings ...... 80
5.5 Effectiveness of Load Indicator Scheme ...... 82
5.6 Combining Local and Look-ahead Optimizations ...... 85
5.6.1 Motivation ...... 85
5.6.2 Load-instruction Indicator ...... 88
5.7 Comparisons between Different Schemes ...... 90
5.7.1 Power Savings ...... 90
v
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.7.2 Performance Im pact ...... 92
5.7.3 Energy Reduction ...... 93
5.7.4 Different Configurations ...... 94
5.7.5 Average Intervals in Each Power M ode ...... 96
5.8 Load-register Indicator Schem e ...... 97
5.9 Effectiveness of Load-register Scheme ...... 100
5.9.1 Power Savings ...... 100
5.9.2 Energy S avings ...... 102
5.10 Related W ork ...... 103
5.11 S u m m a ry ...... 105
6 Conclusion and Future Work 107
6.1 Conclusion ...... 107
6.2 Future Work ...... 109
Bibliography 112
vi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ACKNOWLEDGMENTS
Pursuing this degree is a long journey for me. I have learned a lot during this journey, not only from the study, the research work, but also from many people including my adviser, my professors, and my fellow students. I would like to thank all of them for their help. First of all, I want to thank my adviser, Dr. Xiaodong Zhang, for his mentoring during these five years. He created an excellent research environment for his students by his guidance, hard working, determination, high standard, and vast knowledge. The Department of Computer Science at the College of William and Mary is a wonderful place for me to pursue my Doctoral degree. I would like to thank all faculty and staff in this department for their help in these years. I would like to acknowledge Dr. Phil Kearns, Dr. Bruce Lowekamp, Dr. Dimitrios Nikolopoulos, and Dr. Robert Noonan for serving the dissertation committee. I would also like to thank Dr. William Bynum, Dr. Evgenia Smirni, and Dr. Andreas Stathopoulos for serving the proposal committee. Especially, I thank Dr. William Bynum for reviewing many of my manuscripts. I want to thank Dr. Evgenia Smirni, Dr. Andreas Stathopoulos, and Dr. Dimitrios Nikolopoulos for their great help and advices in my career development. I would also like to acknowledge Vanessa Godwin for the great help she has offered to me in my graduate study. I would like to thank the members of the High Performance Computing and Software Lab and other graduate students, Songqing Chen, Xin Chen, Lei Guo, Song Jiang, and Qi Zhang, who have helped to create an environment full of stimulation. I also want to thank Dr. Xing Du for his great help in my research. I want to thank Dr. Jun Yang from University of California at Riverside, to take time from her busy schedule to serve as the external member of the committee. I would also like to acknowledge Dr. Yiming Hu from University of Cincinnati, for his kind help in my career development. I would also like to acknowledge the funding agencies that provided funds and equip ment that supported my research: Sun Microsystems, for their donations of experimental equipments, and the National Science Foundation and the Air Force Office of Scientific Research, whose grants funded the majority of my graduate research. I want to thank all my family members. I cannot imagine how I can go through this without their support. My special thanks go to my mother for encouraging me during difficult times, for taking care of my life, and for all she has done for me. At last, but definitely not at least, I want to thank my husband for his care, his support, his encouragement, his tolerance, and his confidence on me. I am so lucky to have him to be with me for all the good and hard times.
vii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Tables
3.1 Descriptions of SPEC2000 benchmarks [66] ...... 24
4.1 Comparisons of sizes and misprediction rates of different access mode predic
tors. The separate 64 KB instruction cache (iLl) and the data cache (dLl)
are 8-way with block size of 64 B. The unified L2 cache (uL2) is 8-way with
block size of 128 B. The misprediction rates are the average over all the
SPEC2000 programs ...... 37
4.2 The overall hit rates and first hit rates of multi-column and MRU structures
for LI instruction cache. The cache is 64 KB with block size of 64B...... 52
4.3 The overall hit rates and first hit rates of multi-column and MRU structures
for LI data cache. The cache is 64 KB with block size of 64B...... 53
4.4 The overall hit rates and first hit rates of multi-column and MRU structures
for L2 cache. The cache is 4 MB with block size of 128B...... 54
4.5 Access latencies for the way-prediction hit/miss and phased access on 64 KB
4-way LI and 4 MB 8-way L2 caches . 60
viii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.1 The distribution of high demanding, low demanding, and idle windows for
program swim...... 74
5.2 Experimental param eters ...... 79
5.3 Percentage of execution time in low power mode, and percentage of power
reduction on the issue logic and execution units by the load indicator scheme
under the system with 1 GHz processor...... 82
5.4 Parameters for instruction indicator. EC and DC are the conditions for
enabling and disabling the low power mode. I i p c and FP i p c are the number
of integer instructions issued per cycle and the number of floating point
instructions issued per cycle, respectively...... 85
5.5 The conditions for enabling and disabling the low power mode under different schemes.
LD is the number of outstanding load misses. ITeg and Freg are the numbers of free integer
and floating-point registers. TExx and TDxx are thresholds for enabling and disabling
the low power mode, respectively...... 99
5.6 The conditions for enabling and disabling the low power mode by the load-register indicator
scheme. LD is the number of outstanding load misses. Ireg and Freg are the numbers of
free integer and floating-point registers, respectively...... 103
ix
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Figures
4.1 The access patterns for (a) a conventional n-way set-associative cache; (b)
a phased n-way set-associative cache; and (c) a way-prediction n-way set-
associative cache ...... 26
4.2 Cache access latencies of the tag path, data path, sequential access, and
parallel access on a 16-KB, 4-way set associative cache with block size of
32-byte. The values are estimated using the CACTI model [58] ...... 29
4.3 The hit rates for SPEC2000 benchmarks on an eight-way 64 KBytes data
cache with block size of 64 Bytes and an eight-way 4 MBytes unified L2
cache with block size of 128 Bytes ...... 30
4.4 The access pattern of n-way AMP cache...... 32
4.5 Way-prediction strategy for MRU cache ...... 41
4.6 Overall cache hit rate and first hit rate of a 64 KB MRU data cache for
program 172.mgrid ...... 42
4.7 Way-prediction strategies for multicolumn cache ...... 44
4.8 Way-prediction strategy for multicolumn cache without swapping ...... 45
4.9 Comparisons of the first hit rates among the multicolumn and MRU caches. 51
x
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.10 Energy reduction of multicolumn caches compared with MRU caches. . . . 55
4.11 Energy consumption of multicolumn cache, phased cache, and AMP cache.
The system has four-way 64 KB instruction and data caches and eight-way
4 MB L2 cache...... 56
4.12 Decomposition of energy consumed by instruction, data, and L2 caches. . . 57
4.13 Energy consumed by data and L2 caches ...... 58
4.14 CPI and E-D product reductions of the AMP cache compared with the mul
ticolumn cache ...... 61
4.15 CPI and E-D product reductions of the AMP cache compared with the phased
cache...... 62
5.1 Comparisons of the optimization based on current system status and the
look-ahead optimization ...... 66
5.2 Sampled IPC values and number of outstanding loads during an arbitrarily
selected 1024-cycle interval for program swim. “W64” and “W32” correspond
to the sample window sizes of 64 cycles and 32 cycles, respectively...... 72
5.3 CDF of arrival interval of miss streams ...... 76
5.4 Distribution of the number of concurrent accesses ...... 77
5.5 IPC values under the system with 1 GHz processors ...... 83
5.6 Percentage of total execution time in low power mode under our scheme
(load) and instruction indicator (instruction) ...... 86
5.7 Percentage of power reduction ...... 91
5.8 Normalized IPC values ...... 92
xi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.9 Percentage of energy reduction ...... 93
5.10 Power reduction under systems with processor speed scaling from 466 MHz
to 1 GHz and 2 GHz ...... 95
5.11 Average intervals in the low power mode and the normal execution mode
under different schemes ...... 96
5.12 Percentage of power reduction under the load indicator, load-instruction in
dicator, and load-register indicator schemes on systems with 1 GHz processors. 101
5.13 Percentage of total execution time in the low power mode under the load-
register indicator scheme. The load and register portions correspond to the
low power execution periods triggered only by the load indicator and the
register indicator, respectively. The overlap portion corresponds to those
caused by both triggers ...... 102
5.14 Percentage of energy reduction under the load indicator, load-instruction
indicator, and load-register indicator schemes under systems with 1 GHz
processors ...... 104
xii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ABSTRACT
The fast performance improvement of computer systems in the last decade comes with the consistent increase on power consumption. In recent years, power dissipation is becoming a design constraint even for high-performance systems. Higher power dissipation means higher packaging and cooling cost, and lower reliability. This Ph.D. dissertation will inves tigate several memory-related design and optimization issues of general-purpose computer microarchitectures, aiming at reducing the power consumption without sacrificing the per formance. The memory system consumes a large percentage of the system’s power. In addition, its behavior affects the processor power consumption significantly. In this disser tation, we propose two schemes to address the power-aware architecture issues related to memory:
• We develop and evaluate low-power techniques for high-associativity caches. By dy namically applying different access modes for cache hits and misses, our proposed cache structure can achieve nearly lowest power consumption with minimal perfor mance penalty.
• We propose and evaluate look-ahead architectural adaptation techniques to reduce power consumption in processor pipelines based on the memory access information. The scheme can significantly reduce the power consumption of memory-intensive ap plications. Combined with other adaptation techniques, our schemes can effectively reduce the power consumption for both compute- and memory-intensive applications.
The significance, potential impacts, and contributions of this dissertation are:
1. Academia and industry R&D has solely targeted the objective of high performance in both hardware and software designs since the beginning stage of building computer systems. However, the pursuit of high performance without considering energy con sumption will inevitably lead to increased power dissipation and thus will eventually limit the development and progress of increasingly demanded mobile, portable, and high-performance computing systems.
2. Since our proposed method adaptively combines the merits of existing low-power cache designs, it approaches the optimum in terms of both retaining performance and saving energy. This low power solution for highly associative caches can be easily deployed with a low cost.
3. Using “a cache miss”, a common program execution event, as a triggering signal to slow down the processor issue rate, our scheme can effectively reduce processor power consumption. This design can be easily and practically deployed in many processor architectures with a low cost.
xiii
with permission of the copyright owner. Further reproduction prohibited without permission. Power Considerations for Memory-related Microarchitecture Designs
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 1
Introduction
Computer system designs have been totally driven by the dedicated pursuit of high per
formance for several decades. As a consequence, we have seen a technique trend that well
matches the Moore’s law: both the clock rate of a processor and the number of transistors
inside a chip double for every eighteen months over the last decades. However, this fast
improvement on system performance does not come with no cost. As the performance im
proves exponentially, the chip power consumption also increases significantly. For instance,
the power dissipation of the Alpha processors increases from 30 Watts on an early product
21064, to 50 W atts on 21164, then to 90 W atts on 21264. It is estimated that this value
will reach 125-150 Watts on the future product 21464 [78]. In recent years, the high power
consumption has put a lot of pressure on computer system designs. For example, in order
to keep the processor running correctly under this high power consumption (which leads to
high temperature), some special cooling and package techniques must be applied.
This dissertation will investigate several memory-related design and optimization issues
of general-purpose computer architectures. For general-purpose systems, high performance
is still the major concern. Thus, low-power designs for those systems should reduce power
consumption without (or only with a slight) performance loss.
2
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 1. INTRODUCTION 3
1.1 Importance of Low-power Architecture Designs
Traditionally, power consumption is a major concern for portable system designers. On
one hand, users’ demand on computation power is ever increasing. Today, many portable
systems have processing power comparable to desktops or even servers produced just a few
years ago. On the other hand, the operation of those systems is limited by the battery
lifetime. Users’ requirement on light weight and long operation time never ends. Although
the advancement on battery technology does increase the battery lifetime, the improvement
is only linear, far behind the exponential power increase. Thus, reducing power consumption
can directly improve the usability of those systems.
In recent years, power consumption is becoming a critical design constraint even for
high-end computer systems. Higher power dissipation means higher packaging and cooling
cost, and also means lower reliability. For instance, it is estimated that once the processor
dissipates more than 40 Watts, every Watt increase on power consumption costs about
$1 [64] in cooling and package. With the fast increase on power consumption, the air-cooled
techniques will reach their limits soon. The urge for some other fancy cooling techniques
such as water-cooled techniques seems not feasible due to their cost, noise, and weight. The
increasing power consumption of computer systems may also cause environment problems.
US EPA estimates that 10% of current electricity usage in US is directly due to desktop
computers. In addition, this percentage may increase in the future as ubiquitous computing
becomes more and more popular.
Power issues have been attacked at almost all design levels ranging from physical to
application. Traditionally, low-power circuit and logic techniques, such as reducing the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 1. INTRODUCTION 4
supply voltage value and logic optimization, are the major research focuses in this field.
These techniques have been effectively reducing the power dissipation of a single transistor
or a single device. However, as the number of transistors inside a chip and the clock rate
increase exponentially, low-power CMOS and logic designs alone can no longer solve all the
power problems. Computer system designs must be power-aware. This requires work to
be done in architecture, operating systems, and compiler [41, 52], Architectural approaches
has been and will be playing an important role here. For example, the projected power
consumption of Pentium III 800 MHz processor scaled from Pentium Pro technique would
reach 90 Watts. After careful designs and optimizations at architectural level, the product’s
power consumption is only 22 Watts [21].
1.2 Our Work
As the speed gap between processor and memory continues to widen, the design of mem
ory system becomes increasingly important to the performance of many applications. To
achieve high performance, large, highly associative, and multi-level caches are commonly
deployed in today’s computer systems. To tolerate long memory access latency, modern
processors widely exploit complicated issue logic. As a consequence, the memory system
contributes a large amount of the system’s power consumption. In addition, the behavior
of memory system also affects the power consumption of other components significantly.
This Ph.D. dissertation will investigate several memory-related design and optimization is
sues of general-purpose computer architectures, aiming at reducing the power-consumption
without sacrificing the execution performance.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 1. INTRODUCTION 5
We focus on designs and optimizations of processor pipeline and caches, which are not
only the critical paths to host program executions, but are also major power consumers in a
computer system. For example, it is estimated that the motherboard consumes more than
50% of the notebook power [41, 52]. Even for high-end computer systems, the processor is
the single most power-hungry component and contributes about 25% of total system power
consumption [70].
As stated above, processor is a particularly important target for power-aware designs
due to its high power dissipation within a small die size (e.g. Alpha 21264 consumes 90
W atts within 314 m m 2 area). High-performance processors normally consist on-chip caches
to reduce the memory stall time. On-chip caches occupy about 40% of the chip area and
belong to the major power consumers in microprocessors. For instance, on-chip caches
in Alpha 21264 processors consume 15% of the total power [32], It is estimated that this
portion will further increase to 26% in a future product, the Alpha 21464 processors [78]. For
some low-power processors with simple execution logics, this percentage may be even larger.
As an example, in a low power embedded processor SA-110, caches consume 43% of its total
power [54]. Thus, it is very important to reduce the on-chip cache power consumption.
Our first work targets reducing power consumption of set-associative caches. Set-
associative caches are widely used for their ability to reduce cache conflict misses. However,
the conventional implementation of set-associative caches is not energy-efficient because
multiple blocks in a set are accessed for each cache reference but at most one block will
contain the required data. We notice that none of existing techniques are optimized for
both cache hits and misses in terms of performance and power consumption. Since cache
hit rates are highly application-dependent, those techniques only perform well for a certain
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 1. INTRODUCTION 6
type of applications. To address this limit, we propose an access mode prediction (AMP)
cache that dynamically applies different access modes based on the cache hit/miss predic
tion. Our experimental results indicate that the AMP cache can achieve the lowest energy
consumption across a wide range of applications with optimal performance.
To boost performance, modern processors can issue multiple instructions in each cycle
and execute instructions in the order different from that in the program. To support those
features, the issue logic and execution units are very complicated and contribute a large
percentage of processor power consumption. For example, the issue logic and execution
units consume about 40% of the total processor power in Alpha 21264 [78]. Thus, dynamic
adjusting the processor issue rate can significantly reduce the processor power consump
tion [6].
Our second work focuses on reducing the processor power consumption based on the
memory access information. When a cache load miss falls to the main memory, it is almost
certain that the processor will stall for this memory access thus the full processor issue rate
is not necessary to maintain the performance. We propose a load indicator scheme that
adjusts the processor issue rate according to the existence of main memory accesses. Our
experiments show that this simple scheme is very effective in reducing the power consump
tion for memory-intensive applications. This scheme can be further combined with other
indicators based on the average number of instructions issued in each cycle or the number of
unoccupied registers to reduce power consumption for both compute- and memory-intensive
applications.
The significance, potential impacts, and contributions of this dissertation are:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 1. INTRODUCTION 7
1. Academia and industry R&D has solely targeted the objective of high performance
in both hardware and software designs since the beginning stage of building com
puter systems. However, the pursuit of high performance without considering energy
consumption will inevitably lead to increased power dissipation and thus will eventu
ally limit the development and progress of increasingly demanded mobile, portable,
and high-performance systems. This dissertation will address this important issue by
focusing on power considerations for memory-related microarchitecture designs.
2. Since our proposed method adaptively combines the merits of existing low power
caches, it approaches the optimum in terms of both retaining performance and saving
energy. This low power solution for highly associative caches is easy to be deployed
with a low cost.
3. Using “a cache miss”, a common program execution event, as a triggering signal to
slow down the processor issue rate, our schemes are able to accurately and effec
tively save unnecessary processor power consumption. This design can be easily and
practically deployed in many standard and special processors with a low cost.
The dissertation is organized as follows. In the next chapter, we will present some
technical background about low power designs. Then, we will introduce the evaluation
environment in Chapter 3. In Chapter 4, we will discuss the low power cache designs based
on dynamic power mode switching. In Chapter 5, we will present our work in reducing
pipeline power consumption using memory access information. Finally, we will summarize
our work and discuss future work in Chapter 6.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 2
Background
In this chapter, we will briefly discuss the sources of power consumption and introduce some
existing work in reducing the power consumption.
2.1 Sources of Power Consumption
In CMOS circuitry, power are consumed by two different types of activities. Dynamic power
dissipation is caused by the capacitance charging and discharging as signals transit from 0
to 1 and from 1 to 0. The dynamic power dissipation can be expressed as in [52]:
Powerdynamic — — ■ C ■ V ■ A ■ /, (2-1)
where C is the capacitance of switching nodes, V is the supply voltage, A is the activity
factor, and / is the clock frequency.
The capacitance C is a function of wire length and transistor size, which are determined
by the underlying circuit technology. The value of C is also roughly proportional to the
number of transistors in the circuit. The activity factor A is the average transition proba
bility, which is determined by the circuit implementation and the program behavior. The
value of A is between 0 and 1.
8
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 2. BACKGROUND 9
Another type of power dissipation is static power consumption. Different from the
dynamic power dissipation that only arises due to signal transitions, the static power is
consumed in absence of any switching activity. It is caused by leakage current flowing
through every transistor that is powered on. The static power consumption can be modeled
at the architectural level as in [15]:
PoweT s t a t i c = N ■ V • k d e s i g n ' I l e a k i (2.2)
where N is the number of transistors, V is the supply voltage, kdesign is an empirically deter
mined parameter representing the characteristics of a transistor, and 7/eafc is a technology-
related parameter describing the leakage current per transistor.
Currently, the dynamic power dissipation dominates the total chip power consumption.
It accounts for more than 95% of the total power dissipation. The static dissipation only ac
counts for 2-5% of the total chip power [43]. Thus, like most existing low-power techniques,
our work focuses on reducing the dynamic power consumed by the computer systems. As
transistors become smaller and faster, the percentage of static power dissipation will increase
in the future [15]. There is an increasing number of research projects targeting reducing
the static power consumption. Other sources of power consumption also only account for
a very small percentage of the total chip power, such as the short-circuit power dissipation
due to the finite-slope input signals.
2.2 Low-power Techniques
As stated in Chapter 1, the increasing power consumption has become a severe design
constraint as the computer systems become faster and more complicated. Researchers have
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 2. BACKGROUND 10
proposed low-power techniques at almost all design levels, ranging from physical level to
software level, to address this issue [41, 92],
2.2.1 Power-saving Techniques at Physical, Circuit and Logic Levels
Equations (2.1) and (2.2) show that the basic principle of low-power design is to reduce the
supply voltage, reduce the capacitance, reduce the switching frequency, reduce the short
circuit currents, and reduce the leakage currents. These goals can only be achieved by the
technology advancements at physical and circuit levels. Other techniques, such as transistor
pin ordering, delay paths balancing, and network restructuring, can produce low-energy gate
and functional unit designs. Those logic optimizations can also effectively reduce the circuit
power consumption.
Traditionally, the major research focus in low-power design area is on techniques at
these design levels. Researchers have proposed many techniques to effectively keep the chip
power consumption at a reasonably low level. However, the fast increase on both speed
and complexity of computer systems has pushed the power dissipation to a point that the
low-power techniques at physical, circuit, and logic levels alone can no longer solve all the
problems.
In general, the supply voltage V decreases by 30% for each process generation, and the
wire width also reduces by about 30% (which decreasing the capacitance of each transis
tor) [15]. Both reductions can help keeping the power consumption down. On the other
hand, the number of transistors inside the chip doubles for each process generation. In
the meantime, the clock frequency of processor also doubles for every 18 months. These
two factors will push the chip power consumption up. As a simple estimation, we can see
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 2. BACKGROUND 11
that the chip capacitanceC will increase to 0.7 x 2 x C in the next process generation, the
supply voltage V will decrease to 0.7V, and the clock rate / will double to 2/. According
to Equation (2.1), the chip dynamic power dissipation will increase by about 40% for each
process generation, even after the low-power circuit level techniques have been applied.
Thus, the dramatic increases on both complexity and clock frequency of computer systems
have already made it necessary to consider power issues at high design levels.
2.2.2 Power-saving Techniques at Architectural Level
In most cases, reducing power consumption and improving performance have conflicting
interests. For instance, from Equation (2.1), we can see that slowing down the clock is a
straightforward solution for reducing power dissipation. However, the execution time of a
program is inversely proportional to the clock rate /. Thus, it is crucial to make trade-offs
between performance and power consumption for power-aware designs.
As a natural bridge between software and circuits, the architectural design can effectively
utilize the low-level power-saving techniques according to the program’s execution behavior
and achieve a good balance between performance and power consumption. Since the major
target for high-end computer systems is still high performance, the power-aware designs
for those systems should reduce the power consumption without (or only with a slight)
performance loss.
Normally, the design of a general-purpose computer system is optimized to achieve the
best average performance for a wide range of representative applications. As a consequence,
it is common that part of hardware resources are underutilized for a specific application.
Disabling those underutilized resources will not affect the system performance while the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 2. BACKGROUND 12
power consumed by those resources can be saved.
A lot of work has been done at the architectural level to reduce the power consumption of
the whole computer system, including the processor, cache, memory, and bus. For example,
pipeline gating [51] reduces the energy consumption of processors with branch prediction
mechanisms by determining when the prediction is likely to be incorrect and preventing
wrong-path instructions from entering the pipelines. The filter cache [47] adds a very small
buffer between the CPU and LI cache to filter references to the LI cache. A hit on the
buffer will consume less energy than an access to the LI cache due to the smaller size of
the buffer. Signal encoding with Gray code [67] or invert coding [42] can reduce the energy
consumption on the bus by reducing the bit switching frequency. Compression is another
type of effective approaches to reducing the energy consumption of processor core, cache,
and bus. The compression can be based on significant bytes, zero, or other frequently used
values [20, 75, 80].
Researchers have also proposed many architectural schemes to reduce the static power
consumption. For instance, the cache decay technique can reduce the cache leakage power
by predicting those blocks which will not be used anymore and placing those blocks in a
low leakage standby mode [43].
We will discuss architectural low-power techniques related to our work in more details
in subsequent chapters.
2.2.3 Power-saving Techniques at Software Level
The techniques for energy reduction at software level can be applied at three different layers:
algorithm/application, compiler, and operating systems.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 2. BACKGROUND 13
The most effective approach of energy reduction is algorithm optimization. In this
aspect, reducing energy consumption is identical to reducing computation complexity [92].
As a simple example, if a 30% reduction of the number of instructions to be executed results
in reducing the program’s execution time by 30%, this can be directly converted to 30% of
energy saving if the power consumption of each instruction is a constant.
Some performance-oriented compilation techniques are also beneficial to energy con
sumption. For example, loop transformations and data transformations can improve the
program locality thus reduce the memory stall time. These techniques can also reduce the
energy consumption because they can reduce the number of memory references which are
more energy hungry than cache accesses. However, some performance-oriented optimization
is not the best from energy consumption point of view. For example, compared with the
performance-oriented instruction re-ordering, the energy-oriented one can reduce the energy
consumption by 10-15% [41]. So it is necessary to perform optimizations based on different
targets.
As the traditional resource manager, operating systems also play an important role in
energy saving. Currently, many hardware devices provide multiple power modes where those
modes with lower power dissipation have longer response time. Operating systems can take
advantage of this feature for power management. Dynamic power management achieves
energy-efficient computation by selectively turning off system components when they are
idle or partially unexploited [9]. For example, spinning down the disk when its inactive
time passes a threshold is a commonly used policy for energy reduction. ACPI (Advanced
Configuration and Power Interface) is an open industry specification that defines the manner
in which the OS, motherboard hardware, and peripheral devices talk to each other regarding
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 2. BACKGROUND 14
power usage [40].
Although software level techniques are effective in reducing power dissipation, their ef
fects only restrict to a specific application (algorithm or compiler techniques) or a specific
execution environment (OS power management). To deliver a product (processor or com
puter system) with targeted power consumption, it must still rely on power-aware designs
and optimizations at architectural level.
In this dissertation, we will present two memory-related architectural designs to reduce
the system power consumption.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 3
Evaluation Environment
3.1 Evaluation Metrics
3.1.1 Performance Metrics
A direct indicator of system performance is the execution time or delay of a group of
representative programs on the target computer system. For the purpose of computer ar
chitecture research, two related metrics, CPI (cycles per instruction) and its reciprocal
I PC (instructions per cycle), provide more insightful informations about the system per
formance. The execution time or delay ( D) of a program can be expressed as:
D= I ■ CPI ■ (1//), (3.1)
"■ D = ! n h ' (32)
where I is the number of instructions executed, and / is the clock rate. D is measured in
unit of Second.
The value of CPI is determined by two factors. The first one isthe program’s behavior,
such as the inherent instruction-level parallelism (ILP) and the program locality. The
CPI also depends on the underlying architecture, for example, the instruction set, the
15
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 3. EVALUATION ENVIRONMENT 16
processor issue rate, and the cache size. Analyzing the change and breakdown of the CPI
or I PC values can provide many insightful informations on architectural designs, such as
the performance bottleneck and the effectiveness of a new scheme.
3.1.2 Power and Energy Metrics
Power consumption of a system is the rate at which energy is drawn over time. It is
measured in unit of Watt. The power consumption directly determines the system operation
temperature. Thus, it is essential to guarantee that the power consumption will never be
so high to drive the system reaching a dangerous temperature level [36]. Lower power
dissipation will alleviate some burdens on packaging and cooling systems, and make the
system more reliable.
Energy consumption is measured in unit of Joule. Its relationship with power dissipation
is expressed as:
E — P ■ D, (3.3)
where E and P are energy and power consumptions, respectively. Lower energy consump
tion means more useful work can be done for a given energy budget. For portable systems,
reducing energy consumption means increasing the battery lifetime.
According to Equation (3.3), it seems to be straightforward that reducing either power
consumption P or execution delay D can effectively reduce the energy consumption. How
ever, in most cases, reducing power consumption comes at the cost of degrading performance
(i.e. enlarging the delay). For example, slowing down the clock rate can effectively reduce
the power dissipation, since P cc f (Equation 2.1). On the other hand, slowing down the
clock rate will also enlarge the execution time, since D oc (1//) (Equation 3.2). Thus,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 3. EVALUATION ENVIRONMENT 17
simply slowing down the clock rate cannot save energy. Low-power designs need to also
consider their performance impact.
3.1.3 Energy-efficiency Metrics
In some cases, a low-power technique is also beneficial to performance or at least without
negative effects on performance. For example, loop optimization not only reduces the pro
gram’s execution time, but also reduces its energy consumption on memory systems. In
these cases, using energy or power consumption metric is sufficient to evaluate the effec
tiveness of the technique.
However, as we discussed above, in most cases, a design optimized to performance
causes unnecessary energy consumption. As an example, aggressively allowing multiple
instructions executing on different branch paths can improve the performance of superscalar
processors. However, the energy consumed by instructions executed on wrong-paths is
wasted. Thus, designers must make trade-offs on performance and energy for their target
systems. A new metric is needed to evaluate this trade-off.
The energy-delay product ( EDP) is widely adopted by researchers to evaluate the
energy-efficiency of their designs [12, 31, 39, 47, 48, 67, 92].
EDP = E ■ D. (3.4)
In general, lower energy consumption corresponds to longer execution time. The target of
energy-efficient designs is to find the optimal design point where the energy-delay product
is minimized.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 3. EVALUATION ENVIRONMENT 18
Authors in [13, 52] argue that it is more appropriate to apply different metrics for
different classes of systems. For low-power portable systems, the battery life limits the
system operations. Users are more tolerant for slow response time on those systems. Thus,
energy is the primary consideration there. This makes the power-delay product ( PDP ),
which is identical to energy ( E ), be a natural metric for such kind of systems. On the
other hand, for the highest-performance server-class systems, the program delay is more
important than the energy consumption. As a reflection, it may be more suitable to over
weight the delay part. ED 2P (E ■ D2) metrics are more appropriate for the highest-end
systems. EDP metrics cannot be applied across diverse classes of systems for comparing
system efficiency. However, EDP metrics are still useful for a given class of systems [52].
In our study, we will use the EDP metric to evaluate the power-performance trade-offs,
since our analysis is within a single class of systems, the general-purpose systems.
3.2 Performance Evaluation Tools
The most straightforward and convincing way to evaluate the performance of a target
system is direct measurement. For example, to promote their products, computer vendors
normally report the SPEC results of the products by running the standard benchmarks
provided by Standard Performance Evaluation Corporation [66]. In order to provide detailed
information on system performance, today’s high-performance processors provide hardware
counters that can monitor the number of certain events happening during a program’s
execution. For instance, users can use hardware counters to check the cache miss rate of a
given program and identify the performance bottleneck.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 3. EVALUATION ENVIRONMENT 19
Although the direct measurement is accurate, several factors limit its usage in the ar
chitecture research. The hardware counters can only monitor a few events at a time. More
importantly, the direct measurement can only be performed on existing systems. Thus, it is
not suitable for evaluating new ideas with variants of architecture parameters. Simulation
and modeling are more suitable on that situation.
Normally, modeling requires a relatively small amount of computation resources and
takes a short time. However, modern computer systems are very complicated. For instance,
the processor can issue multiple instructions in each cycle; instructions can be executed
speculatively; and memory access latency is not a constant. All these factors make it very
difficult if not impossible to develop an accurate and attractable model for real architectural
study.
Simulation has been extensively used for the computer architecture research because
of the high cost on building hardware prototypes and also because of its accuracy in es
timating performance of complicated computer systems. SimpleScalar tool set [14] is the
most commonly used simulator for evaluating new ideas for uniprocessor systems in re
cent years. It is an execution-driven simulator that generates simulation results on the
fly and produces comprehensive statistics for the program execution. The most advanced
simulator in the tool set, sim-outorder, emulates a superscalar processor with five-stage
pipelines1 and multi-level memory systems. It supports several advanced techniques such
as multi-issue, out-of-order execution, speculative execution, and non-blocking load/store.
Currently, SimpleScalar supports two instruction sets. The PISA version supports the Sim
pleScalar Instruction Set Architecture which is a close derivative of the MIPS instruction
1 The five stages of the pipeline are fetch, dispatch, execution, writeback, and commit.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 3. EVALUATION ENVIRONMENT 20
set. The Alpha version supports Alpha ISA and can directly run binary code compiled on
real Alpha systems.
We use the Alpha version of sim-outorder simulator in our experiments and insert some
new functions to the simulator for our study. For example, we have implemented the
multicolumn-based way-prediction on the cache access function for our low-power cache
study.
SimpleScalar only emulates the behavior of the processor. Compared with those sim
ulators (e.g. SimOS [60]) that emulate the behavior of the whole system, including the
processor, memory, disks, and operating systems, SimpleScalar cannot capture the pro
gram’s execution behavior due to context switches or system calls. However, compared
with those whole-system simulators, it provides more detailed and flexible simulations on
the processor, which is the target in our study. In addition, the benchmark we have used
in our study, SPEC 2000 [66], mainly consists scientific applications, whose execution be
havior is much less dependent on operating systems than commercial workloads. In order
to characterize the execution behavior of commercial workloads, we need to use the whole-
system simulators [84, 27]. However, since the major concern of this dissertation focuses on
microarchitecture, using a detailed processor simulator is more suitable. In addition, our
experimental results will not be favored due to those behaviors that cannot be captured
by the simulator. For example, the variants of cache miss rate would be larger if context
switches are performed. This will make our low-power cache designs save more energy than
that under the single process environment.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 3. EVALUATION ENVIRONMENT 21
3.3 Energy Consumption Evaluation Tools
Energy consumption can be analyzed by either simulation-based or probabilistic-based tech
niques [41]. Simulation-based techniques accurately monitor transitions at each cycle and
use lower level analysis results to construct higher level models. In contrast, probabilistic-
based techniques view signals as random transitions between 0 and 1 with certain statistical
characteristics. Normally, simulation-based techniques are more accurate than probabilistic-
based ones but take much more computation time.
Until recently, power estimation tools are only available at low design levels, such as
Hspice [5] and PowerMill [68]. Although these circuit level simulations can accurately
estimate dynamic and static power dissipation within a few percentage, they are not very
useful for making architectural decisions [12]. Their simulation is too slow. Thus, they are
only practical for circuits with relatively small number of transistors. More importantly,
these tools need complete circuit designs as input which are not available at architectural
level design stage.
Researchers have been making efforts to provide energy estimation tools at the archi
tecture level. As a regular structure, cache has been a good candidate for energy analy
sis [67, 42, 46, 58]. For example, authors in [42] propose an analytical energy dissipation
model for conventional set-associative caches and use it to evaluate several low power tech
niques. CACTI [79] is an analytical model for estimating the access and cycle time of
direct-mapped and set-associative caches. It takes the cache size, block size, associativity,
and process parameters as input. It has been extended to CACTI 2.0 [58] that includes
both timing and power models for on-chip caches. We use CACTI 2.0 to estimate the power
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 3. EVALUATION ENVIRONMENT 22
consumption of different implementations of set-associative caches in our study.
Only until recently, energy consumption simulations for the integrated system are pub
licly available [12, 74], Wattch [12] integrates parameterized power models of common struc
tures in modern superscalar processors into the SimpleScalar architectural simulator [14].
The cycle-level performance simulator generates cycle-by-cycle hardware access counts for
each basic module and inputs these counts to the parameterized power models to estimate
the power dissipation of the whole processor and each component. SimplePower [74] also
builds its infrastructure on the SimpleScalar simulator. However, it uses a combination of
analytical and transition-sensitive models for energy estimation. For regular structures such
as caches, the analytical models are applied. For other components such as functional units
whose energy consumptions are dependent on the switching activity and capacitance, the
transition-sensitive models are applied. The simulator contains switch capacitance tables
for functional units for each input transition obtained from lower level simulations. Table
lookups are done based on the input bits and output bits for the estimated components.
Wattch provides fully parameterizable power model for the processor, while SimplePower
only provides parameterized model for caches. The currently publicly available version of
SimplePower only models energy consumption for an in-order 5-stage pipelined data path
with perfect caches running only integer instructions.
3.4 Workloads
SPEC CPU2000 is an industry-standardized benchmark suite that contains compute inten
sive benchmarks exercising a wide range of hardware [66]. It is extensively used to evaluate
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 3. EVALUATION ENVIRONMENT 23
the performance of the processor, memory, and compiler.
The SPEC CPU2000 suite consists of two groups of benchmarks, the integer one
(CINT2000) and the floating point one (CFP2000). Table 3.1 describes all the programs
briefly [66].
We use SPEC CPU2000 as workload in our study. The reference input data files are
used in the experiments. We use the precompiled Alpha version of SPEC2000 binaries [77]
and run it under SimpleScalar simulator. To eliminate the start-up effects, we fast-forward
the first four billion instructions, then collect detailed statistics on the next one billion
instructions.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 3. EVALUATION ENVIRONMENT
Category Program Description CINT2000 164.gzip Compression 175.vpr FPGA Circuit Placement and Routing 176.gcc C Programming Language Compiler 181.mcf Combinatorial Optimization 186. crafty Game Playing: Chess 197.parser Word Processing 252.eon Computer Visualization 253.perlbmk PERL Programming Language 254.gap Group Theory, Interpreter 255.vortex Object-oriented Database 256.bzip2 Compression 300.twolf Place and Route Simulator CFP2000 168.wupwise Physics / Quantum Chromodynamics 171.swim Shallow Water Modeling 172.mgrid Multi-grid Solver: 3D Potential Field 173.applu Parabolic / Elliptic Partial Differential Equations 177.mesa 3-D Graphics Library 178.galgel Computational Fluid Dynamics 179.art Image Recognition / Neural Networks 183.equake Seismic Wave Propagation Simulation 187.facerec Image Processing: Face Recognition 188.ammp Computational Chemistry 189.1ucas Number Theory / Primality Testing 191.fma3d Finite-element Crash Simulation 200.sixtrack High Energy Nuclear Physics Accelerator Design 301.apsi Meteorology: Pollutant Distribution
Table 3.1: Descriptions of SPEC2000 benchmarks [66].
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 4
Access Mode Predictions for
Low-power Cache Design
4.1 Significance of Low-power Cache Design
As we have stated in Chapter 1, the successful pursuit of high performance on computer
systems has produced a negative byproduct of high power dissipation. Circuit-level tech
niques alone can no longer keep the power dissipation under the reasonable level. Increasing
efforts have been made on reducing power dissipation via architectural approaches [13]. One
research focus is on reducing the power consumed by on-chip caches, which are among the
major power consumers in microprocessors.
Today’s computer systems normally consist multi-level memory hierarchy. The perfor
mance of caches is critical to many applications’ performance. To achieve high performance,
the size of caches is getting larger, and the associativity of caches is also increasing. In
addition, many processors embrace both level one and level two caches on chip. As a conse
quence, on-chip caches occupy about 40% of chip area and consume a large portion of chip
power. For instance, on-chip caches in Alpha 21264 processors consume 15% of the total
25
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 26
Access the set Access the set Access the set i___ 1 Access all the a Access all the Way tapsand n data ntags prediction I I Access the predicted Access 1 data (a) Traditional n-way way ( 1 tag and 1 data ) Cache
Access all the Prediction (b) Phased n-way Cache remaining ways ( n-1 correct tags and n-1 data) Yes
(c) Way-prediction n-way Cache
Figure 4.1: The access patterns tor (^aj a conventional n-way set-associative cacne; ptj a pnased n-way set-associative cache; and (c) a way-prediction n-way set-associative cache.
power [32]. It is estimated that this portion will further increase to 26% in a future prod
uct, the Alpha 21464 processors [78]. This percentage can be even larger in some low power
processors. For example, caches consume 43% of the total power of embedded processor
SA-110 [54],
Set-associative cache is commonly used in modern computer systems for its ability to
reduce cache conflict misses. However, a conventional set-associative cache implementation
is not power-efficient by its nature. As Figure 4.1-(a) shows, a conventional n-way set-
associative cache probes all the n blocks (both tag and data portions) in a set for each
cache reference. However, at most block will contain the required data. The percentage of
wasted energy will increase as the cache associativity n increases. We have already seen
high-associativity caches in some commercial processors. For example, Intel Pentium 4
processor exploits four-way LI caches and an eight-way L2 cache. Thus, it is very important
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 27
to reduce the power consumption of set-associative caches.
4.2 Related Work
An effective approach to reducing the power consumption of a set-associative cache is to
reduce the number of memory cells involved for an access. One scheme is to divide each
data RAM into several sub-banks, and to only activate the words at the required offset
from all cache ways [67]. Another alternative is to selectively disable a subset of cache ways
during execution periods with modest cache activity [3]. Other examples of techniques to
reduce cache energy consumption include the one used in SA-100 processor which divides
each of the 32-way instruction and data caches into 16 fully-associative caches and enables
only one-eighth of the cache for each access [54]. The filter cache [47] adds a very small
buffer between the CPU and LI cache to filter references to the LI cache. A hit on the
buffer will consume less energy than an access to the cache due to the much smaller size
of the buffer. The tag comparison elimination technique keeps links within the instruction
cache and allows instruction fetch to bypass the tag array [50, 85]. This only works for
instruction caches whose access patterns are regular and predictable.
Two representative techniques in reducing the power consumption of set-associative
caches are the phased cache and way-prediction cache [39]. The phased cache [11, 34] first
compares all the tags with the accessing address, then probes the desired data way only.
The way-prediction technique first speculatively selects a way to access before making a
normal cache access. Combining these two techniques, the predictive phased cache [37] first
probes all the tag sub-arrays and the predicted data sub-array. This approach reduces the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 28
energy consumption of non-first hits (cache hits whose locations are mispredicted) in the
way-prediction cache at the price of increasing energy consumption for both first hits (cache
hits whose locations are correctly predicted) and cache misses.
4.3 Comparisons between Phased and Way-prediction
C aches
Figure 4.1-(b) and (c) illustrate the access patterns for phased and way-prediction n-way
set-associative caches, respectively. Compared with the conventional implementation, the
phased cache only probes one data sub-array instead of n data sub-arrays1. However, the
sequential accesses of tag and data will increase the cache access latency. As shown in
Figure 4.2, the tag path without output takes about 1.15 ns, and the data path alone
without output takes about 0.90 ns on a 16-KB, 4-way set associative cache with block size
of 32-byte2. Under the conventional implementation, the tag and the data are accessed in
parallel. Each cache access takes about 1.32 ns. In contrast, if the tag and the data are
accessed one by one, the access latency will be almost doubled to 2.22 ns. Thus, although
the phased implementation can reduce the power consumption of set-associative caches, it
will increase the cache access latency and harm the overall system performance.
The way-prediction cache first accesses the tag and data sub-arrays of the predicted
way. For instance, the MRU-based way-prediction cache always accesses the most recently
accessed block in a set first. If the prediction is not correct, it then probes the rest of tag
and data sub-arrays simultaneously. An access in phased cache consumes more energy and
1 Each way comprises a tag sub-array and a data sub-array. 2The values are estimated using the CACTI model [58].
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 29
Cache Access Latency (16KB, 4-way, 32B/block) ■ Output ■ Selection ■ Multiplexor
uaea □ Compare □ Sense amplifier ■ Word/bit line ■ Decode
Datapath Tag path Serial access Parallel access
Figure 4.2: Cache access latencies of the tag path, data path, sequential access, and parallel access on a 16-KB, 4-way set associative cache with block size of 32-byte. The values are estimated using the CACTI m odel [58].
has longer latency than a correctly predicted access in way-prediction cache, but consumes
less energy than a mispredicted access. When the prediction accuracy is high, the way-
prediction cache is more energy-efficient than the phased cache [39].
The way-prediction hit rate is bounded by the cache hit rate which is highly application-
dependent. Figure 4.3 shows the hit rates of all the SPEC2000 benchmark programs on a
64 KBytes data cache and a 4 MBytes L2 cache3. Most applications have LI data cache
hit rates as high as 95% with a few exceptions, such asart whose hit rate is only 65%.
The distribution of L2 cache hit rates is even more diverse than that of LI cache hit rates.
Fourteen of the twenty-six programs have L2 cache hit rates higher than 95%, while ten
3We will discuss the experimental environment in details at Section 4.7.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 30
100%
60%
50%
♦ dLl hit rate uL2 hit rate
Figure 4.3: The hit rates for SPEC2000 benchmarks on an eight-way 64 KBytes data cache with block size of 64 Bytes and an eight-way 4 MBytes unified L2 cache with block size of 128 Bytes.
programs have L2 cache hit rates below 80%.
The way-prediction cache will consume less energy on applications with good locality,
while the phased cache will consume less energy on applications with poor locality. Thus,
neither way-prediction cache nor phased cache can perform consistently well in power saving
across all these applications.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 31
4.4 AMP Caches
4.4.1 Strategy
The energy consumption of a cache access can be minimized if a cache hit is handled by
the way-prediction mode, while a cache miss is handled by the phased mode. Motivated
by this relationship between a cache access mode and its energy consumption, and by
the application-dependent cache hit patterns, we propose to use an access mode prediction
technique based on cache hit/miss prediction for low power cache design [89]. It combines
the energy-efficient merits of both phased and way-prediction cache structures. In the case
of predicted cache hits, the way-prediction scheme is applied to determine the desired way
and to probe that way only. In the case of predicted misses, the phased scheme is applied
to access all the tags first, then probe the required way only. We call this AM P cache.
Figure 4.4 presents the access sequence controlled by access mode predictions. Upon a
reference, the access mode predictor makes a decision based on the cache access history. If
the predictor indicates that the way-prediction mode should be applied, the following access
sequence will be handled by the way-prediction scheme. Otherwise, it will be handled by
the phased access scheme.
The two most important components in the AMP cache design are the access mode
predictor and the way-predictor.
• Access Mode Predictor
A good access mode predictor is essential to select a suitable access technique for
each cache reference. Since we target at low power cache design, the access mode
predictor should only add little overhead on both latency and energy consumption.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 32
Access the set
Access Mode Prediction
Access all the Use way n tags prediction
Yes
Access 1 data Way prediction
Access the predicted way ( 1 tag and 1 d a ta )
Access all the Prediction''\No remaining ways n-1( correct ? tags and n-1 d a ta )
Yes
Figure 4.4: The access pattern of n-way AMP cache.
Motivated by existing branch prediction techniques, we derive a simple predictor that
uses a global access history register and a global pattern history table obtaining high
prediction accuracy.
• Way-predictor
A good way-predictor can minimize the energy consumption of cache hits. Previous
studies have shown that MRU-based way-prediction can effectively reduce the energy
consumption of set-associative caches. However, we find that as the cache associativity
increases, the effectiveness of MRU-based way-prediction in energy saving continues
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 33
to decrease. To address this problem, we study several way-prediction policies and
find that the multicolumn-based way-prediction is a nearly optimal technique. It is
highly effective for energy reduction of high associativity caches. We also propose a
power-efficient variant of multicolumn cache to eliminate the swapping operation in
the original design.
4.4.2 Power Consumption
In this section, we will use a simplified timing and energy model to quantify the observation
that we have discussed above. The latency and energy consumption of different types
of caches used in our experiments are estimated based on the CACTI timing and power
consumption model [58].
Let E tag and E data be the energy consumed by a tag sub-array and a data sub-array
upon a reference, respectively. For a correctly predicted hit in the way-prediction cache, the
energy consumed is Etag + Edata> compared with n ■ Etag + Edata in the phased cache, where
n is the associativity of the cache. On the other hand, a miss in the way-prediction cache
will consume (n + 1) • E tag + (n + 1) • E data. in comparison with (n + 1) • E tag + E data in the
phased cache. Regarding the access latency, a correctly predicted hit in the way-prediction
cache takes one time unit, compared with two time units in the phased cache. Our objective
is to pursue the lowest possible energy consumption and latency for both cache hits and
misses.
Let H , M , and Wf, be the numbers of cache hits, misses, and write-backs, respectively.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 34
A phased cache, a way-prediction cache, and an AMP cache consume energy as follows:
Ephased — H ■ (n • Etag + Edata) + M ■ ((« + 1) • Efag + Edata) (4 -1)
+ W & • E d a ta t
Ewayjprediction = H ■ (Etag A Edata) A M • ((ti 1) • Etag A Edata) (4*2)
A M r ■ Tl ■ Edata A Wfc • Edata ~l~ ^wjmispredi
E a MP — H{Etag A Edata) A M((n A 4)Etag A Edata) A Wb ■ Edata (4-3)
"I" AAjnispredi
where Aw_miSpred and ^ A.mispred are the energy overheads of mispredictions in the way-
prediction cache and the AMP cache, respectively. M r is the number of read misses.
For an AMP cache with a perfect access mode prediction and a prefect way-prediction, its
energy consumption is the lower bound of energy consumption for the set-associative cache.
The misprediction overhead is determined by both the accuracy of access mode predictor
and that of way-predictor. We will discuss these two predictors in following sections.
4.5 Access Mode Predictors
4.5.1 Designs
Authors in [82] propose to use cache hit/miss prediction for improving load instruction
scheduling. In order to reduce the memory bandwidth requirement, authors in [72] use
miss prediction to dynamically mark which load instruction is cacheable/non-allocatable.
We want to apply cache hit/miss prediction to choose way-prediction or phased access for
energy savings.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 35
This is different from most work in predicting cache behavior which targets prefetching
or reducing cache miss rate. For example, the next cache line and set predictor is used for
instruction fetch prediction and branch prediction [19]. Prediction caches use a history of
recent cache misses to predict whether a replaced cache line will be accessed in the near
future thus should be put into the victim cache to reduce the overall cache miss rate [10].
The access mode predictor shares a similar objective with the branch predictor. Thus,
we derive the access mode predictor from existing branch predictors [65, 81, 55, 53]. The
intuition behind the access mode predictor is that cache misses are clustered and program
behavior is repetitive. Technically, nearly all branch prediction techniques may be adapted.
However, since we target at reducing cache power consumption, the access mode predictor
must be simple. Thus we only present variants with low resource requirements here.
Each cache (instruction, data, or L2) has its own access mode predictor, which uses its
reference address or hit/miss history as the index to the prediction table. We have studied
following prediction variants.
• Saturating counter.
This prediction has the simplest implementation. It is based on the two-bit saturating
up/down counter [65]. Each cache has its own prediction table. The number of entries
in the table equals the number of cache sets. Each entry is a two-bit saturating
counter. When a cache reference arrives, its set index determines which counter
should be accessed. If the most significant bit of the counter is “1”, the way-prediction
scheme will be applied. Otherwise, the phased scheme will be applied. The counter is
incremented for each way-prediction hit and is decremented for each way-prediction
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 36
miss or cache miss.
• Two-level adaptive predictor.
Another alternative is derived from the two-level adaptive branch predictor [81]. We
implement both GAg and PAg versions. In the implementation derived from GAg
version, a global &-bit access history register records the results of the most recent k
accesses. If the access is a way-prediction hit, a “1” is recorded; otherwise, a “0” is
recorded. The global access history register is the index to a global pattern history
table which contains 2k entries. Each entry is a two-bit saturating counter. In the
implementation based on PAg version, each set has its own access history register.
A single pattern history table is indexed by all the access history registers. In our
experiments, the number of entries in the global pattern history table for both GAg
and PAg predictors is set to the number of sets in the corresponding cache.
• (M,N) correlation predictor.
This is based on the scheme proposed in [55]. An M-bit shift register stores the
hit/miss history for the most recent M accesses. Each set has2M entries which
are indexed by the M-bit register. Each entry is an TV-bit counter. We apply (2,2)
predictor in the experiments.
• gshare predictor.
Thegshare predictor is originally proposed in [53]. The global pattern history table is
indexed by the exclusive OR of the global access history with the set index of current
reference. The number of entries in the table is equal to the number of sets in the
cache in our experiments.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 37
4.5.2 Accuracy
Access Mode Predictor Size (bits) Misprediction Rate (%) iLl dLl uL2 iLl dLl uL2 Saturating Counter 256 256 8K 0.11 5.68 14.44 GAg -256-256 -8 K 0.12 4.97 5.51 PAg -IK - I K -56K 0.13 3.94 3.83 (2, 2) -IK - I K -32K 0.11 5.27 13.43 gshare -256 -256 —8K 0.12 6.01 15.57
Table 4.1: Comparisons of sizes and misprediction rates of different access mode predictors. The separate 64 KB instruction cache (iLl) and the data cache (dLl) are 8-way with block size of 64 B. The unified L2 cache (uL2) is 8-way with block size of 128 B. The misprediction rates are the average over all the SPEC2000 programs.
Table 4.1 compares the sizes and misprediction rates of five different policies that are
discussed above on the system with 8-way 64 KB instruction/data caches and an 8-way 4
MB L2 cache. In our experiments, for saturating counter, a 128-entry 2-bit (32-byte) table
is used for instruction and data caches, and a 4096-entry 2-bit (1-KB) table is used for L2
caches. For GAg and gshare, both the instruction cache and the data cache have a 32-byte
global pattern history table and a 7-bit global access history register; the L2 cache has a 1
KB global pattern history table and a 12-bit global access history register. For PAg, both
instruction and data caches have a 32-byte global pattern history table and a 112-byte local
access history register table; the L2 cache has a 1-KB global pattern history table and a
6144-byte local access history register table. For (2,2), both instruction and data caches
have a 128-byte table and a 2-bit register; the L2 cache has a 4-KB table and a 2-bit register.
The values of misprediction rates presented are the average over all the twenty-six
SPEC2000 programs. The two-level adaptive predictor PAg obtains the lowest mispre
diction rates which are below 4% on instruction, data, and L2 caches. However, it also
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 38
occupies more additional area than other predictors. The GAg predictor gains the second
lowest misprediction rates which are below 6% on both LI and L2 caches and requires the
smallest additional area.
4.5.3 Overhead
We decide to select the GAg predictor in the remaining of our experiments for several
reasons. The prediction accuracy of GAg is only slightly lower than that of PAg. However,
the GAg predictor requires much less additional space for recording access history than the
PAg predictor. For GAg predictor, the 8-way 64 KB instruction and data caches only need
a 32-byte global pattern history table and a 7-bit global access history register each; the 4
MB L2 cache only requires a 1 KB global pattern history table and a 12-bit global access
history register. Regarding the area, the overhead of GAg predictors for instruction and
data caches is only 0.05%, and is only 0.02% for L2 cache.
More importantly, unlike other predictors, the GAg predictor does not require the ad
dress of the next cache reference to perform the mode prediction. Thus, the access mode can
be selected before the next cache reference arrives. Since the access time of the small predic
tion table is less than the access time of the cache, there is no timing overhead introduced
by the GAg predictor. Regarding the energy consumption, our simulation results indicate
that the overall overhead on instruction, data, and L2 caches by using mode predictors is
under 0.8% for all the programs.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 39
4.6 Multicolumn-based Way-prediction
In the previous section, we discuss the selection of access mode predictors. The effectiveness
of the AMP cache on energy reduction depends not only on the accuracy of access mode
predictor, but also on the underlying way-prediction technique. In this section, we will
introduce a way-prediction technique based on multicolumn caches [83].
Way-prediction caches were originally proposed for reducing the average access latency of
set associative caches [22, 2, 44, 1, 18, 83]. Some prediction strategies are used to predict the
way containing the desired data and read from that way first. Hash-rehash [2], MRU [22, 44],
column-associative [1], predictive sequential associative [18], and multicolumn [83] caches
are a few such examples.
Way-prediction techniques can also be beneficial for power saving in set-associative
caches [39]. If the way-prediction is correct, which is called a first-hit in this paper, only
the desired way is probed. In this case, the power consumption is close to that in a direct-
mapped cache. However, if the prediction is wrong, the way-prediction cache will consume
more energy than a conventional implementation because some additional operations are
performed to maintain the prediction mechanism. Thus, the accuracy of the prediction
technique, measured by the first-hit rate, is crucial to the reduction of power consumption.
4.6.1 Limitation of MRU-based Way-prediction
Authors in [39] propose to use the MRU way-prediction technique to reduce the power
consumption for set-associative caches. Authors in [46] show that the hash-rehash, the
column-associative, and the MRU way-prediction techniques effectively reduce the power
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 40
dissipation of two-way set-associative caches. Previous studies have shown that an MRU-
based way-prediction cache is more power-efficient compared with other techniques [39, 46].
The MRU cache maintains an MRU list that marks the location of the most recently
used (or accessed) block for each set. For any reference mapping into the set, the search
always begins from the MRU location. This search order is effective for both level one
and level two caches because of the temporal locality in reference streams [44]. For low-
associativity caches, such as two-way or four-way caches, the MRU technique works well in
predicting the desired way. However, as the cache associativity further increases, the MRU
structure may potentially decrease the first-hit rate.
Figure 4.5 shows how an MRU cache searches the locations for two coming references on
a four-way associative cache which has four sets. For clarity, only the tag and set portions
of a reference are presented here. Thus the first reference (0001013 maps to set “01” with
tag “0001”.
Upon the first reference 000101. the MRU cache begins its search from way 1 according
to the MRU information of set 1 (Figure 4.5). The predicted way does not contain the
desired data, then the MRU cache probes all the remaining ways as the second attempt. In
this example, the first reference is a cache miss. Thus, the least recently used block at way
2 is replaced and the corresponding MRU information is updated to point to way 2. When
the second reference to the same set, 101101. arrives, the search still begins from the MRU
location, which is way 2 now. This is not a first hit. The MRU cache then probes all the
remaining ways and finds the desired block at way 3. The MRU information is updated
again to reflect the fact that the MRU location is way 3 now. In this example, neither of
the two references is a first hit on the MRU cache.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 41
reference 1 f 000101 1 ......
(^updated^)' replaced
reference 2 101101 1 [ set MRU info way 0 way 1 way 2 way 3 —/ T d —/ ? " d Isl 0000 1101 0001 ( 1011
S
updated set MRU info/ way 0 way 1 way 2 way 3
0000 1101 0001 1011
Figure 4.5: Way-prediction strategy for MRU cache.
In an MRU cache, the number of search entries is equivalent to the number of sets. As
the cache associativity doubles for a fixed cache size and a fixed block size, the number of
search entries in the MRU cache is halved. This means the number of entries available for
way-prediction is reduced by half. The interference on the MRU information will be worsen.
Normally, this causes the way-prediction accuracy to decrease. As shown in Figure 4.6, our
experiments indicate that for the SPECint2000 program 172. mgrid, as the associativity of a
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 42
First Hit Rate (Program 172.mgrid)
120%
100% -
60% -
S overall 40% MRU first-hit 20% Multicolumn first-hit
0% -I 4-way 8-way 16-way Cache Associativity
Figure 4.6: Overall cache hit rate and first hit rate of a 64 KB MRU data cache for program 172.mgrid.
64 KB LI MRU data cache increases from four to eight, then to sixteen, the first hit rate of
the cache drops from 91.8% to 81.3%, then to 59.4%, respectively. This trend is consistent
with the results presented in [22]. In addition, the first hit rate of high-associativity MRU
cache falls far behind the overall cache hit rate. This indicates that the MRU-based way-
prediction is not optimal in power-saving.
4.6.2 Multicolumn Cache
The multicolumn cache [83] has addressed the limitation of MRU cache discussed above. The
authors observe that the importance of a cache block in a set may be different to different
references mapping into the same set. To consider the differences, The authors define a
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 43
concept of major location as the location on which a reference can be directly mapped4,
and use the major location as the guidance for the access sequence. The multicolumn cache
maps a block as in a direct-mapped cache but replaces a block as in a set-associative cache.
As the associativity increases, the first-hit rate does not change much in the multicolumn
cache. The number of search entries in the multicolumn cache equals the product of the
number of sets and the cache associativity. Thus, as the associativity increases for a fixed
cache size and a fixed block size, the number of search entries in the multicolumn cache
remains the same. Using the same program 172.mgrid as example, as the associativity of
a 64 KB LI multicolumn data cache increases from four to eight, then to sixteen, the first-
hit rate keeps the same (92.7%, 92.7%, and 92.7%, respectively, as shown in Figure 4.6).
In addition, the first hit rate of multicolumn cache is very close to the overall cache hit
rate. This indicates that the multicolumn is a nearly optimal way-prediction policy, since
the cache hit rate is the upper bound of first hit rate that can be achieved by any way-
prediction policy.
Figure 4.7 shows the search sequence in a multicolumn cache for the same reference
stream in Section 4.6.1. For the first reference 000101. its tag is “0001”. For an n-way set
associative cache, the major location of a reference is determined by the low order logn
bits of its tag. Thus the major location of the first reference is way 1. The search begins
here. The predicted way does not contain the desired block, then all the remaining ways
are probed as the second attempt. Since this reference is a cache miss, the LRU block at
way 2 is replaced. However, the new block (000101) and the block already residing at way 1
4For an n-way set associative cache, the major location of a reference is determined by the low order log n bits of its tag.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 44
reference 1 major [ 000101 1 location 01
replaced^) way 1 X way 2
reference 2 ( 101101 ) major location 11
M s t 0000 0001 1101 1U i 1
Figure 4.7: Way-prediction strategies for multicolumn cache.
(110101) have the same major location “01”. The design of multicolumn cache ensures that
the MRU block is always at its major location. In this example, the new block (000101) is
placed at its major location way 1, and the block originally at way 1 (110101) is swapped
to way 2. For the second reference (101101). its major location is way 3. Thus the tag and
data portion of way 3 are probed as the first attempt. This is a first hit. In comparison,
the second reference is a non-first hit in the MRU implementation due to the interference
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 45
reference 1 major [ 000101 I location 01
MRU info/ major major /najor majpr way 0 \w ay 1 / way 2 way 3 set loc 0 loc ‘ loc 2 .Job Job 3 jag data tag \ data tag data tag data 0 XzUT* — 1 00 ~oT 10 11 0000 1101 f 1st 1011 2 s _____ 3 ---- 1 ---- 1 — V - 1 updated major major/major major replaced way 1 " r way 2 way 3
0000 1101 0001 1011
reference 2 [ 101101 ] major location 11
major major major maj way 2 \
0000
Figure 4.8: Way-prediction strategy for multicolumn cache without swapping.
on MRU information.
4.6.3 Power Considerations for Multicolumn Cache
In the original design of multicolumn caches, a swapping mechanism is applied to ensure
that the MRU block always resides at its major location after a reference (though other
data blocks may replace it later). From power consumption point of view, swapping is an
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 46
expensive operation. Two cache ways are involved in each swapping operation. The power
consumption approximately equals the cost of two accesses to a single way.
In order to eliminate the swapping operation, we propose a power-efficient variation for
the multicolumn cache. An index entry is maintained for each major location in a set to
record its MRU information.
Figure 4.8 shows the implementations of multicolumn caches without swapping. We still
use the same access sequence used in Section 4.6.1 as the example. For the first reference
(000101). its tag is “0001”, and its major location is “01”. The MRU information of major
location 1 at set 1 is retrieved, which points to way 1. Thus, the search begins from here.
The predicted way does not contain the desired block, then all the remaining ways are
probed as the second attempt. This reference is a cache miss, the LRU block at way 2
is replaced. However, the new block (000101) and the block already residing at way 1
(110101) have the same major location 1. Since the new block (000101) is the most recently
accessed block whose major location is 1, the MRU information of major location 1 needs
to be updated to reflect this change. In the implementation without swapping, the MRU
information of major location 1 is updated to point to way 2 where the new block resides.
By recording the MRU information for each major location, the MRU block does not need
to always reside at its major location. Thus, the energy-consuming swapping operations are
avoided. For the second reference (101101), its major location is 3 and the corresponding
MRU information points to way 3. The tag and data portions of way 3 are probed as the
first attempt. This is a first hit. No swapping is performed and the MRU information keeps
unchanged.
There are several performance and energy consumption trade-offs in the implementations
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 47
of multicolumn caches. For the original design, the search always begins from the major
location of a reference, which is determined by a simple bit selection on the reference
address. Thus, the decision of a search order will not lengthen the cache access time, and
will not consume additional energy. However, to maintain this way-prediction mechanism,
the swapping operation may be performed. This swapping operation is energy-consuming,
and may also delay subsequent coming references.
On the other hand, the implementation without swapping avoids the energy-consuming
operation at the price of additional cache area for recording MRU information, of additional
energy consumption for retrieving MRU information, and of possible increase on cache access
time. Just like the MRU cache, the multicolumn cache without swapping needs to fetch the
MRU information first. Since the reference address is used to index the MRU information
table, the retrieval of MRU information may lengthen the cache access time. However,
the arguments for MRU cache can also be applied here. If the reference address could be
available earlier, the cache access could begin at an earlier pipeline stage [18]. In addition,
the overhead of fetching MRU information can be tolerated for L2 caches [44].
Regarding the area, the overhead of recording the MRU information is trivial. Compared
with the MRU cache which has one MRU entry for each set, the multicolumn cache without
swapping has a larger MRU table where each block has an MRU entry. However, the MRU
table in the multicolumn cache still only accounts for a very small portion of the data array.
The area overhead is log(associativity)/ (blocksize ■ 8). The cache block size in a modern
computer system is normally at least 32 bytes. Thus, the area overhead is very small. For
example, considering a sixteen-way set-associative cache with block size of 64 bytes, each
set needs 64 bits to record the MRU information. The MRU table only accounts for less
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 48
than 1% of the data array.
One more dimension of the MRU table in the multicolumn cache also makes the access
more complicated. It is necessary to first identify which one of the n MRU entries in a
set to be fetched. This is determined by the major location of the reference. Since the
determination is from a bit selection, the latency of fetching the MRU information in the
multicolumn cache is comparable to that in the MRU cache. Furthermore, the bit selection
also makes it possible to only probe the desired entry. As the result, the power consumption
for indexing the MRU table in the multicolumn cache is also comparable to that in the MRU
cache. The energy consumption overhead of accessing such a small MRU table is also trivial.
We evaluate the frequency of swapping operations in eight-way multicolumn caches un
der the default system configuration in our experiments. For all the SPEC2000 benchmarks,
on average, only 0.2% of references to the instruction cache involve swapping operations,
while 3.8% and 12.0% of references to the data and L2 caches involve swapping operations,
respectively. Since for LI instruction cache, the possibility of a reference involving a swap
ping operation is very low, it is more feasible to directly use the major location as the
search guidance, in company with swapping operations for this latency-sensitive cache. On
the other hand, the possibility of a reference to L2 cache involving a swapping operation
is as high as 12.0%, which means about 20% additional energy will be consumed by the
swapping operation. In this case, it seems more feasible to apply the multicolumn imple
mentation without swapping and overlap the fetch of MRU information with the accesses
to LI caches.
In summary, the original multicolumn cache design has no overhead on cache access
latency. However, it requires energy-consuming swapping operations to maintain the way-
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 49
prediction mechanism. The multicolumn implementation without swapping eliminates the
swapping operations at the cost of trivial increase on cache area and access latency. In our
experiments presented in the following sections, we will utilize the multicolumn cache with
swapping for instruction and data caches which are latency-sensitive and involve relatively a
small percentage of swapping operations, and apply the implementation without swapping
for L2 cache which is not so latency-sensitive but involves a large percentage of swapping
operations.
4.7 Experimental Environment
CACTI [79] is a timing model for on-chip caches. It has been extended to CACTI 2.0 [58]
that includes both timing and power models. We use CACTI 2.0 to estimate the power
consumption of different implementations of set-associative caches in our study. For caches
with access mode predictions, we estimate the energy consumption of first-hits, non-first-
hits, misses, and write-backs in both way-prediction and phased accessing modes, respec
tively. We use the SimpleScalar tool [14] to collect the cache access and program execution
statistics. We modified the cache simulator in order to emulate the behavior of different
way-prediction and mode prediction approaches.
SPEC CPU2000 is a comprehensive benchmark suite that contains compute intensive
benchmarks exercising a wide range of hardware. We use the precompiled Alpha version
of SPEC2000 binaries [77]. The reference input data files are used in the experiments. We
fast-forward the first four billion instructions, then collect detailed statistics on the next
one billion instructions.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 50
In our execution-driven simulation, the memory hierarchy is the main focus. The sim
ulated 1 GHz 8-issue processor has separate 64 KB instruction and data caches with block
size of 64 Bytes, and a unified 4 MB L2 cache with block size of 128 Bytes. This config
uration is similar to those used in high-end workstations such as Alpha 21264, and Ultra
Sparc III. The associativity of LI and L2 caches ranges from four to sixteen. The caches
have two sub-banks in each data RAM [67]. We assume that 0.18 /rm technology is used.
4.8 Experimental Results
4.8.1 Comparisons of Multicolumn and MRU Caches
As we have stated in previous sections, the first-hit rate is crucial for power reduction of way-
prediction caches. In this section, we will first compare the first-hit rate of the multicolumn-
based way-prediction with that of the MRU-based way-prediction for set-associative caches.
Figure 4.9 shows the difference of first-hit rates between the multicolumn and MRU
caches as the cache associativity increases. For clarity, this figure only presents the average
first-hit rates of all the SPEC2000 programs and the first-hit rates of two representative
programs vpr (an integer program) and facerec (a floating-point program). The overall hit
rates and the first-hit rates of MRU and multicolumn instruction, data, and L2 caches on
each program are presented in Table 4.2 to Table 4.4, respectively.
The first-hit rates of MRU caches decrease as the cache associativity increases. In
contrast, the first-hit rates of multicolumn caches keep almost unchanged. For example, as
the cache associativity increases to sixteen, the first-hit rate of the MRU L2 cache is only
40% of that of the multicolumn L2 cache for program vpr. In addition, the first-hit rates of
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 51
Overall hit rate Multicolumn first-hit rate * MRU first-hit rate
100% -
60% -
40% -
20% -
l-way 16-way 4-way4-way *-way 16-way 4-way !-way 16-way
dLl uL2
Figure 4.9: Comparisons of the first hit rates among the multicolumn and MRU caches.
multicolumn caches are very close to the overall cache hit rates, which are the upper bounds
of first-hit rates, at different cache levels and for different cache organizations. The average
first-hit rates of multicolumn caches are above 98% of the average overall cache hit rates.
On the other hand, the first-hit rates of MRU caches have noticeable differences compared
with the overall cache hit rates. For the sixteen-way L2 cache, the average first-hit rate of
MRU-based way-prediction is only 69% of the average overall hit rate.
In terms of first-hit rates, the multicolumn-based way-prediction is a nearly optimal
way-prediction technique. However, high first-hit rate is not our final target. Figure 4.10
illustrates the effectiveness of multicolumn-based way-prediction in reducing cache energy
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 52
program four-way eight-way sixteen-way overall mul-col MRU overall mul-col MRU overall mul-col MRU 164.gzip 1.0000 1.0000 0.9955 1.0000 1.0000 0.9955 1.0000 1.0000 0.9955 175.vpr 1.0000 1.0000 0.9975 1.0000 1.0000 0.9975 1.0000 1.0000 0.9637 176.gcc 0.9996 0.9961 0.9898 0.99980.9962 0.9851 0.9999 0.9962 0.9824 181.mcf 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 186.crafty 1.0000 0.9997 0.9750 1.0000 0.99970.9564 1.0000 0.9997 0.9404 197. parser 1.0000 1.0000 0.9995 1.0000 1.0000 0.9939 1.0000 1.0000 0.9914 252.eon 1.0000 0.9979 0.9897 1.0000 0.99790.9802 1.0000 0.9979 0.9639 253.perlbmk 0.9993 0.9969 0.9895 0.9998 0.99690.9774 1.0000 0.9970 0.9706 254.gap 1.0000 0.9988 0.9943 1.0000 0.99880.9867 1.0000 0.9988 0.9795 255.vortex 0.9961 0.9897 0.9694 0.9981 0.9904 0.9615 0.9986 0.9906 0.9474 256.bzip2 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9889 300.twolf 1.0000 0.9990 0.9909 1.0000 0.9990 0.9869 1.0000 0.9990 0.9800 168.wupwise 1.0000 1.0000 0.9981 1.0000 1.0000 0.9963 1.0000 1.0000 0.9837 171.swim 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9999 172. mgr id 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9903 173.applu 1.0000 1.0000 0.9987 1.0000 1.0000 0.9942 1.0000 1.0000 0.9934 177.m esa 1.0000 0.9990 0.9894 1.0000 0.9990 0.9785 1.00000.9990 0.9577 178.galgel 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 179.art 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 183.equake 1.0000 0.9962 0.9860 1.0000 0.9963 0.9642 1.0000 0.9963 0.9531 187.facerec 1.0000 1.0000 1.0000 1.0000 1.0000 0.9992 1.0000 1.0000 0.9937 188.ammp 1.0000 0.9995 0.9992 1.0000 0.9995 0.9983 1.0000 0.99950.9974 189.1ucas 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 191.fma3d 0.9996 0.9970 0.9913 1.00000.9971 0.9873 1.0000 0.9971 0.9802 200.sixtrack 1.0000 0.9991 0.9991 1.0000 0.9991 0.9987 1.0000 0.9991 0.9845 301.apsi 1.0000 0.9995 0.9911 1.0000 0.9995 0.9863 1.0000 0.99950.9834 Average 0.9998 0.9988 0.9940 0.9999 0.99880.9894 0.9999 0.9988 0.9816
Table 4.2: The overall hit rates and first hit rates of multi-column and MRU structures for LI instruction cache. The cache is 64 KB with block size of 64B.
consumption. This figure shows the reduction of overall energy consumed by instruction
cache, data cache, and L2 cache under the multicolumn-based way-prediction compared
with the MRU-based way-prediction. Compared with the MRU-based way-prediction, the
multicolumn-based technique can reduce the overall energy consumption of four-way in
struction, data, and L2 caches by 0.1% to 39.6% (6.8% on average). As cache associativity
increases to eight, the energy reduction is 0.4% to 50.3% (16.6% on average). For sixteen
way caches, the energy reduction of multicolumn caches is even more promising, which
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 53
program four-way eight-way sixteen-way overall mul-col MRU overall m ul-col MRU overall mul-col MRU 164.gzip 0.9896 0.9840 0.9692 0.9898 0.9839 0.9574 0.9900 0.9839 0.9351 175. vpr 0.9617 0.9343 0.9027 0.9621 0.9343 0.8539 0.9624 0.9343 0.8083 176.gcc 0.9512 0.9500 0.9453 0.9512 0.9500 0.9406 0.9513 0.9500 0.9326 181.m cf 0.8087 0.8039 0.7952 0.8084 0.8035 0.7841 0.8079 0.8032 0.7488 186.crafty 0.9942 0.9677 0.9252 0.9967 0.9683 0.8473 0.9973 0.9684 0.7851 197. parser 0.9803 0.9720 0.9455 0.9809 0.9719 0.9170 0.9811 0.9720 0.8794 252.eon 0.9998 0.9945 0.9733 0.9999 0.9945 0.9559 0.9999 0.99450.9172 253.perlbmk 0.9977 0.9922 0.9742 0.9985 0.99250.9616 0.9989 0.9925 0.9403 254.gap 0.9973 0.9964 0.9921 0.9973 0.9964 0.9874 0.9973 0.9964 0.9773 255.vortex 0.9929 0.9820 0.9623 0.9939 0.9821 0.8928 0.9948 0.9822 0.8734 256.bzip2 0.9821 0.9772 0.9443 0.9824 0.9772 0.9331 0.9826 0.9772 0.9159 300.twolf 0.9490 0.9411 0.9123 0.9490 0.9409 0.8920 0.9494 0.9410 0.8610 168.wupwise 0.9893 0.9746 0.9675 0.9893 0.9746 0.9611 0.9893 0.9746 0.9513 171.swim 0.9175 0.9145 0.9134 0.9149 0.9145 0.9119 0.9149 0.9145 0.8869 172.mgrid 0.9665 0.9270 0.9179 0.9665 0.9270 0.8131 0.9665 0.9270 0.5935 173.applu 0.9431 0.9351 0.9185 0.9432 0.9352 0.8958 0.9432 0.9352 0.8651 177.mesa 0.9962 0.9954 0.9799 0.9962 0.9954 0.9405 0.9963 0.9954 0.9050 178.galgel 0.9884 0.9706 0.8783 0.9884 0.9706 0.8056 0.9884 0.9706 0.6926 179.art 0.6599 0.6565 0.6457 0.6599 0.6565 0.6323 0.6599 0.6565 0.6072 183.equake 0.9999 0.99990.9893 0.9999 0.99980.9753 0.9999 0.99980.9154 187.facerec 0.9672 0.9657 0.9476 0.9672 0.9657 0.9411 0.9672 0.9657 0.9260 188.ammp 0.9371 0.9295 0.9115 0.9379 0.9296 0.8989 0.9384 0.9297 0.8726 189.1ucas 0.9188 0.9145 0.9034 0.9188 0.9145 0.8882 0.9188 0.9145 0.8585 191.fma3d 1.0000 0.9996 0.9925 1.0000 0.9996 0.9726 1.0000 0.9996 0.9518 200.sixtrack 0.9880 0.9774 0.9371 0.9882 0.9775 0.8805 0.9882 0.9775 0.8080 301.apsi 0.9719 0.9337 0.8899 0.9720 0.9337 0.8626 0.9720 0.9337 0.8257 Average 0.9557 0.9457 0.9244 0.9559 0.9458 0.8963 0.9560 0.9458 0.8552
Table 4.3: The overall hit rates and first hit rates of multi-column and MRU structures for LI data cache. The cache is 64 KB with block size of 64B.
ranges from 5.1% to 67.3% (40.0% on average).
4.8.2 Energy Reduction of AMP Caches
Figure 4.11 compares the energy consumption of multicolumn cache, phased cache, and
the AMP cache. We have measured the energy consumed by four-way instruction, data,
and eight-way L2 caches. All the values are normalized to the energy consumed by the
multicolumn implementation. The results show that the relative energy consumption of
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 54
program four-way eight-way sixteen-way overall m ul-col MRU overall mul-col MRU overall mul-col MRU 164.gzip 0.9831 0.9816 0.9773 0.9827 0.9813 0.9713 0.9825 0.9811 0.9374 175.vpr 0.9994 0.9994 0.8233 0.9994 0.9994 0.5972 0.9994 0.9994 0.4011 176.gcc 0.9981 0.9969 0.8416 0.9981 0.9969 0.7289 0.99810.9969 0.6378 181.m cf 0.7076 0.6879 0.5329 0.6926 0.6720 0.5068 0.6858 0.6653 0.4764 186.crafty 0.9956 0.9909 0.9358 0.9922 0.9895 0.7615 0.99080.9903 0.5612 197. parser 0.9700 0.9569 0.84050.9702 0.9561 0.7516 0.9701 0.9556 0.6512 252.eon 0.9941 0.9940 0.9332 0.9905 0.9904 0.9844 0.9880 0.9879 0.9874 253.perlbmk 0.9988 0.9941 0.9628 0.9977 0.9977 0.91260.9959 0.9958 0.8960 254.gap 0.7590 0.7545 0.7401 0.7589 0.7545 0.7221 0.7590 0.7546 0.6896 255. vortex 0.9812 0.9572 0.8869 0.9729 0.9448 0.7706 0.9669 0.9346 0.6591 256.bzip2 0.9806 0.9647 0.8126 0.9803 0.9637 0.6521 0.9791 0.9623 0.4663 300.twolf 0.9995 0.9995 0.8252 0.9995 0.99950.6364 0.9995 0.9995 0.4880 168.wupwise 0.6617 0.6484 0.5480 0.6460 0.6398 0.5213 0.6433 0.6382 0.4990 171.swim 0.6983 0.6808 0.6802 0.7048 0.6878 0.6862 0.7048 0.6878 0.6859 172.mgrid 0.7686 0.6956 0.6883 0.7678 0.6957 0.6826 0.7663 0.6957 0.5229 173.applu 0.6645 0.6574 0.6484 0.6644 0.6567 0.6463 0.6644 0.6567 0.6307 177.mesa 0.8909 0.86760.8365 0.8936 0.8686 0.7916 0.8931 0.8683 0.7132 178.galgel 0.9930 0.9916 0.5660 0.9939 0.9919 0.5491 0.9939 0.9919 0.4934 179.art 0.9998 0.9998 0.56690.9998 0.9998 0.4646 0.9998 0.9998 0.4273 183.equake 0.7344 0.7348 0.7340 0.7343 0.7342 0.7237 0.7342 0.7342 0.7052 187.facerec 0.7460 0.7328 0.7065 0.7460 0.7328 0.6810 0.7464 0.7328 0.4594 188.ammp 0.7407 0.5977 0.4367 0.6188 0.5444 0.3665 0.5972 0.5337 0.2916 189.1ucas 0.5550 0.4207 0.2955 0.6155 0.4810 0.3557 0.6118 0.5051 0.3855 191.fma3d 0.9994 0.9994 0.9994 0.3659 0.3648 0.3659 0.3574 0.3591 0.3556 200.sixtrack 0.9982 0.9949 0.7412 0.9982 0.9952 0.6656 0.9982 0.9951 0.5141 301.apsi 0.8321 0.8292 0.6986 0.8310 0.8284 0.5947 0.8309 0.8283 0.5002 Average 0.8711 0.8511 0.7407 0.8429 0.8256 0.6573 0.8406 0.8250 0.5783
Table 4.4: The overall hit rates and first hit rates of multi-column and MRU structures for L2 cache. The cache is 4 MB with block size of 128B.
the multicolumn and phased caches is application-dependent. Among all the twenty-six
programs, phased caches consume less energy than multicolumn caches for six programs.
Compared with multicolumn caches, phased caches may consume up to 37% more energy
or up to 40% less energy, depending on the program behavior. On average, phased caches
consume 16.7% more energy than multicolumn caches.
For all the programs, the AMP caches consume less energy than multicolumn caches.
The energy reduction is 8.6% on average and up to 46.2%. This indicates that the AMP
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 55
—+— 16-way —*— 8-way —*— 4-way 16-way average 8-way average 4-way average
50%
40%
30%
20%
10%
0%
Figure 4.10: Energy reduction of multicolumn caches compared with MRU caches.
caches can effectively save energy on way-prediction misses and cache misses. Compared
with phased caches, the AMP caches consume less energy for all the programs but one (mcf).
This exception is due to the very high miss rates on data and L2 caches and the irregular
access patterns of the program. For mcf, the AMP caches consume 31.1% more energy than
phased caches. For other programs, the access mode prediction technique can reduce the
energy consumption of phased caches by 10.6% to 27.1%. For all the programs, the average
energy reduction of AMP caches is 20.0%, compared with phased caches. This indicates
that the location of many references can be correctly predicted by the way-prediction policy
and significant amount of energy can be saved.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 56
1.60
1.40 de a 1.20
g 1.00 U 12 u 0.80 e W 'S 0.60 I g 0.40 o —•— Multicolumn Z Phased 0.20 —*— Mode Prediction 0.00 (N ±3 3 3 8 ’3
Figure 4.11: Energy consumption of multicolumn cache, phased cache, and AMP cache. The system has four-way 64 KB instruction and data caches and eight-way 4 MB L2 cache.
To show more details about energy reduction of caches with access mode predictions, we
use six programs as examples and show the decomposition of energy consumed by instruc
tion, data, and L2 caches in Figure 4.12. All of the values are normalized to the overall
energy consumption of multicolumn caches. Looking into the decomposition of energy
consumed by instruction, data, and L2 caches, we find that phased instruction caches con
sistently consume more energy than both multicolumn and AMP instruction caches. This is
due to the high first-hit rates of multicolumn and AMP instruction caches. Compared with
these two caches, the phased cache consumes more energy for first-hits. The relative energy
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 57
177.mesa
Figure 4.12: Decomposition of energy consumed by instruction, data, and L2 caches.
consumptions on data and L2 caches are application dependent. For example, lucas has
high L2 cache miss rates. Since the phased cache has lower energy consumption for misses
than the multicolumn cache, the L2 phased cache consumes much less energy than the L2
multicolumn cache. The cache with access mode predictions always intends to consume the
lowest possible energy for both hits and misses. For program lucas, its energy consumption
on instruction cache is comparable to that of multicolumn cache, and its energy consump
tion on L2 cache is comparable to that of phased cache. Our study show that the cache
with access mode predictions is the most energy-efficient one.
The percentage of energy consumed by instruction phased caches is application-
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 58
B Multicolumn ■ Phased □ Mode Prediction
F igure 4.13: Energy consumed by data and L2 caches.
dependent. It ranges from 26.4% (for art) to 82.0% (for fmaSd). Since the phased in
struction cache always consumes more energy than the multicolumn and AMP instruction
caches, to make a more fair comparison, we assume the instruction caches always have the
same AMP structure.
Figure 4.13 compares the energy consumption per instruction by only data and L2 caches
for the three different cache implementations. For seventeen of the twenty-six programs,
the multicolumn data and L2 caches consume less energy than the phased data and L2
caches by up to 26.6%. For the other nine programs, the phased implementation consumes
less energy than the multicolumn implementation by up to 63.4%. On average, their energy
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 59
consumption is comparable to each other. The phased implementation consumes 0.3% more
energy than the multicolumn implementation on average. The AMP data and L2 caches
consume less energy than the multicolumn data and L2 caches for all the programs by up
to 60.2% and 14.4% on average. For twenty-two programs, the AMP caches consume less
energy than the phased implementation by 6.1% to 26.7%. For the other four programs,
the phased implementation consumes less energy than the AMP caches by 5.5% to 39.7%.
On average, the AMP data and L2 caches consume 10.7% less energy than the phased data
and L2 caches.
In summary, the multicolumn cache consumes less energy for applications with high
cache hit rates, while the phased cache consumes less energy for applications with low hit
rates. Since the hit rate is highly application-dependent especially at data cache and L2
cache, neither the multicolumn cache nor the phased cache works well for a wide range of
applications in terms of energy reduction. In contrast, the AMP cache always intends to
consume the lowest possible energy for both cache hits and misses. It consumes less energy
than both the multicolumn and the phased caches for a wide range of applications.
4.8.3 Energy-efficiency of AMP Caches
Regarding the access latency, a correctly predicted hit in the multicolumn cache or in the
AMP cache has shorter latency than an access in the phased cache. Table 4.5 presents the
access latency for a way-prediction hit, a way-prediction miss, and a phased access by using
the CACTI model. According to these obtained values, we set the cache access latency
by processor cycle in our experiments as follows. For instruction and data caches, a way-
prediction hit takes one cycle, a way-prediction miss or a phased access takes two cycles.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 60
iL l/d L l L2 Way-prediction Hit 0.90 ns 1 cycle 5.85 ns 6 cycles Way-prediction Miss 1.61 ns 2 cycles 11.39 ns 12 cycles Phased Access 1.88 ns 2 cycles 10.77 ns 12 cycles
Table 4.5: Access latencies for the way-prediction hit/miss and phased access on 64 KB 4-way LI and 4 MB 8-way L2 caches.
For L2 cache, a way-prediction hit takes six cycles, a way-prediction miss or a phased access
takes twelve cycles.
There is a concern that the AMP cache may not efficiently handle access sequences
with varied latencies. However, for the way-prediction caches such as MRU cache, it is
also necessary to handle accesses with varied latencies. Thus, the AMP cache does not
introduce additional complexity to deal with different access latencies compared with the
way-prediction caches. For the same reason, the mixture of phased accesses with way-
prediction hits does not introduce additional complexity in solving the conflicts on accessing
data array, compared with the mixture of way-prediction misses with way-prediction hits.
Figure 4.14 and Figure 4.15 compare the performance and energy-delay product of the
AMP cache with those of the multicolumn cache and the phased cache, respectively.
Compared with the multicolumn cache, the AMP cache may mispredict the access modes
of some way-prediction hits and perform phased accesses. This only slightly increases the
average cache access latency and degrades the overall performance. Among the twenty-six
programs, the AMP cache gets the same CPI for seven programs compared with the multi-
column cache. The maximum CPI increase is 1%. The average CPI increase is only 0.1%.
This indicates that only a very small percentage of way-prediction hits are mispredicted as
phased accesses by the access mode predictor. The AMP cache obtains almost the same
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 61
■ CPI reduction ■ ED reduction 35%
30%
25%
20%
15%
10%
5%
0%
■5%
Figure 4.14: CPI and E-D product reductions of the AMP cache compared with the multicolumn cache.
performance as the multicolumn cache.
Way-prediction hits have shorter latencies than those of phased accesses. Thus, the
average access latency in the AMP cache is shorter than that in the phased cache. Compared
with the phased cache, the CPI reduction of the AMP cache ranges from 0.4% to 14.9%,
and is 6.1% on average.
The AMP cache reduces the energy-delay product by 8.5% on average (up to 46.2%)
compared with the multicolumn cache, and by 24.8% on average (up to 35.3%) compared
with the phased cache.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 62
B CPI reduction BED reduction
16.5%
Figure 4.15: CPI and E-D product reductions of the AMP cache compared with the phased cache.
4.9 Summary
The way-prediction cache and phased cache are two existing techniques in reducing the
power consumption of set-associative caches. However, neither minimizes energy consump
tion for both cache hits and misses. Their effectiveness in reducing energy consumption is
highly application-dependent. In order to further reduce the energy consumption of set-
associative caches, we propose a cache structure with access mode predictions (AMP cache)
which combines the merits of way-prediction and phased accessing together. Our study
shows the following results:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. AMP CACHE 63
• With a simple access mode prediction based on cache hit/miss prediction, the AMP
cache can effectively reduce the energy consumption for a wide range of applications
under systems with moderate to high associativity caches. For example, the AMP
cache reduces energy consumed by four-way LI and eight-way L2 caches by 9% and
20% on average (up to 46% and 27%) compared with the multicolumn and phased
implementations, respectively.
• The AMP cache is an energy-efficient design. It can reduce the energy-delay product
of a system with four-way LI and eight-way L2 caches by 9% and 25% on average
(up to 46% and 35%) compared with the multicolumn and phased implementations,
respectively.
• The multicolumn-based way-prediction technique is a nearly optimal scheme, which
can correctly predict the locations of 98% of cache hits on average. For four-way,
eight-way, and sixteen-way caches, multicolumn-based way-prediction can reduce the
energy consumption by 7%, 17%, and 40% on average, respectively, compared with
MRU-based way-prediction. In addition, the AMP cache can also utilize other way-
prediction techniques.
• The AMP cache can exploit the same mechanism used in the way-prediction cache to
handle the varied latencies from different access modes and way-prediction hits and
misses.
• The additional overhead of access mode predictor and multicolumn-based way-
prediction is negligible measured by cache area, access latency, and energy consump
tion.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 5
Look-ahead Adaptation Techniques
to Reduce Processor Power
Consumption
5.1 Motivation
The pursuing of high performance on general purpose processors has been increasing the
speed as well as the complexity of processors. As a byproduct, the last decade has seen a
dramatic increase of processor power consumption [78]. To address this issue, researchers
have targeted reducing processor’s power dissipation with minimum performance impacts.
One effective approach at architectural level, called architecture adaption, is to adaptively
activate and deactivate hardware resources in accord with the dynamic changes of a running
program’s execution behavior [17, 6, 29, 56, 61]. This is based on an observation that
the program execution behavior varies significantly among different programs and among
different execution phases of a single program [76].
There are two key factors in architecture adaptation [61]: (1) when to trigger the adap-
64
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 65
tation, i.e., under what conditions to activate/deactivate resources; and (2) what adaptation
techniques to apply. Our work focuses on the first issue.
In order to reduce processor power consumption while retaining performance, most
existing solutions try to meet the current program requirement with a minimum number of
active resources. However, this type of approaches shares a common limitation: the adaption
is triggered after the change of demand has been detected. Without the foreknowledge of
future demand variants, resources are kept active when the current demand is high, even if
the demand is going to drop. Figure 5.1-(a) shows an example of power-saving optimization
based on a current system status. The monitored value is compared with a threshold at
the end of each sampling window (the dots). At time A, the scheme will put the processor
into a normal execution mode based on the current knowledge that the measured value is
higher than the threshold. However, it does not foresee that a slack exists between time B
and D, where the processor is almost idle.
Ideally, the deactivation of hardware resources should start earlier than the drop point
to maximize the power saving, subject to maintaining the same performance. As illustrated
in Figure 5.1-(b), part of the work is delayed to time C. However, since the same amount
of work is finished before time D, the overall performance remains the same although the
processor stays at the low power mode for a longer time. Thus, the optimization based on
current system status loses some power-saving opportunities compared with the look-ahead
optimization.
In this study, we show that some hardware events can accurately predict future demand
degradation thus hardware resources can be deactivated confidently in advance even if the
current demand is high. Specifically, this study utilizes the events of main memory accesses
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 66
normal low- —« threshold slack
frame (a) Optimization based on current system status
low- work redistribution ...... threshold
T . i , slack s frame < (b) Look-ahead optimization
Figure 5.1: Comparisons of the optimization based on current system status and the look-ahead optimization.
to trigger architecture adaptation.
As the speed gap between processors and memory continues to widen, the memory access
latency consistently increases compared with processor cycle time. Once an L2 cache load
miss to main memory happens, it is almost certain that the processor will stall for this cache
miss (although the processor may perform some useful work for subsequent instructions
before stalling). For example, considering a moderate system configuration, a 4-way issue
processor with a 128-entry instruction window runs at 2 GHz clock rate, and the memory
access latency is 100 ns. Once a load miss happens, the load cannot be resolved within 200
cycles, while the instruction window will become full in 32 cycles at the full issue rate and
this will force the processor to stall. Thus, when a load miss happens, maintaining the full
processor issue width is wasteful even if the current program demand is high. In addition,
this resource degradation period will enlarge as the speed gap between the processor and
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 67
memory continues to widen.
We propose a new scheme, called load indicator , that triggers the processor issue rate
adaption with the existence of main memory accesses. In particular, the scheme reduces the
issue width when a load miss occurs, and resumes the full issue rate when all outstanding
loads finish. Previous studies have shown that adjusting the processor issue rate is an
effective adaption technique for power-saving [6, 61]. With an accurate prediction on future
demand degradation, the load indicator scheme redistributes the CPU work over a relatively
long period of time to maximize the power saving while achieving the same performance.
Our experiments show that, for memory-intensive applications, this scheme can save
more power with a performance loss comparable to that of the pipeline balancing tech
nique [6] which periodically adjusts the processor issue rate based on the average issued
IPC (instructions per cycle). For seven memory-intensive applications from SPEC2000
benchmark suites, our scheme can reduce the power consumption on the issue logic and
execution units by 24% and 11%, respectively. The total processor power consumption is
reduced by 5.4% on average with a performance loss of 0.5%, compared with a 4.2% average
power saving with a performance loss of 0.7% by the pipeline balancing technique.
The load indicator scheme foresees the degradation on resource demands caused by
memory accesses. Thus it only has effects on memory-intensive applications. We further
propose two variants of the scheme to cover a wider range of applications. The first variant,
called load-instruction indicator scheme, utilizes information of both load misses and the
IPC changes (in the absence of load misses) to adjust the processor issue rate. The second
variant, called load-register indicator scheme, reduces the processor issue rate when cache
load misses happen or the number of free registers drops below a threshold, and resumes
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 68
the full issue rate when there are no outstanding load misses and the number of free regis
ters increases above another threshold. Both variants captures power saving opportunities
caused by program behavior changes and hardware constraints besides main memory ac
cesses. They effectively save energy for applications with a wide range of memory stall
portions.
5.2 Load Indicator Scheme
As indicated in [76], the execution behavior of programs varies significantly among different
programs and among different execution phases of the same program. Thus, architectural
adaptations based on program requirement variants are effective on reducing processor
power consumption with a negligible performance loss [17, 6, 29, 56, 61].
5.2.1 Power Saving Opportunity
The primary target of modern general purpose processors is high performance. Modern
processors can execute multiple instructions in each cycle to boost performance. Out-of-
order execution is supported by most state-of-art general purpose processors to improve
performance. These techniques improve processor performance significantly. As a simple
example, for a system with 16 KB instruction/data caches (1 cycle access time) and 256
KB L2 cache (6 cycle access time), as the processor issue width increases from four to eight,
the performance of SPEC2000 program 171.swim improves by 25%; on the other hand, if
the processor only supports in-order execution, the performance will degrade by 45%.
Although these techniques do improve the performance effectively, they also increase the
energy consumption significantly. In order to execute multiple instructions concurrently, the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 69
processor needs to consist multiple functional units. Compared with the single-issue pro
cessors, the duplicate of functional units also means multiplication on energy consumption
of datapath. In order to support out-of-order execution, the issue logic of processor is much
more complicated than that only supports in-order execution. For an n-way issue in-order
execution processor, the issue logic only needs to check the dependency and data availabil
ity of the first n instructions in the instruction window. In comparison, for an n-way issue
out-of-order execution processor, the issue logic needs to check all the m instructions in
the instruction window and select n ready instructions whose data dependences are solved
and the corresponding functional units are available. In general, m n, the issue logic
picks n ready instructions out of m entries. We can see that the issue logic of out-of-order
execution processor is much more complicate than that of in-order execution processor. The
increasing complexity of issue logic also corresponds to the increase on energy consumption.
As a consequence, the issue logic and execution units consume about 40% of the total
processor power consumption. Previous studies have shown that adjusting the processor
issue rate is an effective power-saving technique, which can effectively reduce the power
consumed by the issue logic and execution units [6, 61]. Authors in [6] propose a technique,
called pipeline balancing, that dynamically adjusts processor issue width for each sampling
window based on the average issued IPC measured in the previous window. For convenience
of discussions, we will call this technique as instruction indicator in following sections.
Authors in [61] propose a technique that uses the mean functional unit utilization and the
number of structural hazard over the last period to trigger the issue rate adaptation for the
next period.
Those techniques share a common limitation. They capture the change of program
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 70
behavior after the change has happened. More importantly, they cannot foresee the re
source requirement degradation. Thus, they only perform power optimization based on the
current knowledge of program behavior. As shown in Figure 5.1, this property loses some
opportunities in power saving.
If a scheme can accurately predict a future program demand degradation, it is safe
to only satisfy a part of the currently demanded work. The main idea of our look-ahead
optimization is to redistribute work over a relatively long period of time in order to generate
more power saving opportunities while finishing the same amount of work over that period.
In principle, this is similar to many power-saving schemes for multimedia applications.
For multimedia applications, as long as the demanded work of a “frame” finishes within a
deadline, the performance is considered to be the same regardless the work real completion
time. Thus, redistributing the work within a frame using dynamic voltage scaling is effective
in saving power and retaining performance [38, 61].
However, general-purpose applications do not have such a concept of “frame” as mul
timedia applications do. Thus, the question here is how to find time periods within which
work can be redistributed to save energy while the same amount of work can be finished
within the periods.
5.2.2 Load Indicator
There are several possible solutions to identify the performance degradation in advance, such
as by providing hints from the application, compiler, or operation systems. For example,
authors in [73] propose to use a compiler-driven static IPC prediction at the loop-level to
guide the fetch throttling for energy saving. Our study explores a hardware-based indicator
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 71
with a simple implementation, catching current technology trend.
As the speed gap between processors and memory continues to widen, the memory ac
cess latency increases consistently relative to the processor cycle time. Researchers have
proposed many techniques to reduce the average memory access latency, for example, de
ploying advanced DRAM technologies [25, 88], exploiting the DRAM row buffer local
ity [86, 28, 87], and reordering current memory accesses [59, 90]. However, even after
applying those latency-reduction techniques, for modern multi-issue, multi-GHz processors,
once a cache load miss to main memory happens, it is almost certain that the processor
will stall due to the long main memory access time. Thus, maintaining a partial processor
issue width is enough to retain the performance from the time when the load miss happens
to the time when the missing data are returned. The existence of load misses sends a signal
of future reduction of program requirements on computation resources.
Figure 5.2 shows the sampled IPC values of program swim and the number of outstand
ing cache load misses in a representative 1024-cycle interval during the program’s execution.
For clarity, we only present the floating point IPC values in the figure. The integer IPC
values have a similar pattern. The system configuration will be presented in details in
Section 5.4. From the figure, we can see that normally multiple cache load misses happen
together. From the starting time when the first cache load miss occurs to the ending time
when all the load misses return, the program execution behavior forms a “frame”: an active
period followed by an idle period. This result shows that the information of load misses can
be used to accurately predict the future performance degradation.
This figure also presents the measured IPC values under different sampling window
widths. Larger window size smooths out spurs and avoids thrashing between different
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 72
3 L o a d F P W 64 F P W 32 2 . 5 FP threshold
2
1 . 5
1
0 . 5
0 0 3 264 1 2 8 2 5 6 5 1 2 1 0 2 4
Time Interval (cycle)
Figure 5.2: Sampled IPC values and number of outstanding loads during an arbitrarily selected 1024-cycle interval for programswim. “W64” and “W32” correspond to the sample window sizes of 64 cycles and 32 cycles, respectively.
power modes (e.g. from cycle 192 to cycle 256). However, larger window size may also fail
to capture the change of program behavior (e.g. from cycle 480 to cycle 544). Thus, the
sampling window size must be tuned carefully.
We propose a scheme, called load indicator, that saves power by reducing the issue width
when one or more load misses occur, and retains performance by resuming the full issue
rate when there are no outstanding load misses1. In the implementation, a register is used
to record the number of outstanding L2 cache load misses. When a load miss happens, if
the processor is in the normal execution mode (i.e., with the full issue rate), the processor
will transit to the low power mode. In the low power mode, the processor issue rate and the
1 We only consider cache load misses since normally write misses can be well handled by write buffers and will not directly cause the processor stall.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 73
number of active functional units are reduced by half. When a load miss is returned or a
speculated load miss is squashed2, the register value is decreased by one. When the register
value drops to zero, the processor will return to the normal execution mode immediately.
if (a cache load miss happens)
miss register value + 1;
if (processor is in normal mode)
processor transfers to the low power mode;
if (a cache load miss resolves)
miss register value - 1;
if (miss register value = 0 AND processor is in low power mode)
processor returns to the normal mode;
The load indicator can identify the execution periods when a program does not require
the full processing power. Several techniques can be applied during those periods to reduce
the processor power consumption. For instance, if the processor has a dual speed pipeline
structure [57, 62], instructions can be issued to the slow pipeline when cache load misses
happen. In this study, we apply the technique of reducing the processor issue rate because
this is a very effective technique and can be applied to most architectures.
The complexity of the load indicator scheme is low, and the overhead is trivial. Only
one register is added to record the number of outstanding loads. The control logic is also
very simple because it only checks the register value then makes the adaptation. There
2 The scheme does not distinguish between true and speculative load misses for simplicity. Our experi ments show that this only has small impact on power consumption and performance.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 74
Scheme High demanding window Low demanding window Idle window Percent /PC/ IPCF Percent IPC / IPCF Percent Normal execution 60.4% 0.9852 0.7620 17.7% 0.2980 0.1945 21.9% Load indicator 55.9% 0.9741 0.7644 23.8% 0.4070 0.2660 20.3%
Table 5.1: The distribution of high demanding, low demanding, and idle windows for program swim.
is some additional logic to adjust the processor issue rate. However, compared with clock
gating at each component on a cycle-by-cycle basis, the adjustment of issue rate is at a
coarser level. The additional power consumed by the scheme is negligible.
Table 5.1 presents the redistribution of work after applying the load indicator scheme
using program swim as the example. Under normal execution, at 60.4% of sampling win
dows, either the issued integer IPC value IP C i or the issued floating point IPC value IPC f
is higher than its corresponding threshold. On the other hand, the processor issues almost
no instructions at 21.9% of sampling windows. This indicates that a large space is left to
shift some of the work from the high demanding windows to the low demanding or idle
windows. After applying the load indicator scheme, the percentage of sampling windows
with high issued IPC values drops to 55.9%, while the percentage of sampling windows with
low IPC values increases to 23.8%, and the percentage of idle windows drops to 20.3%. The
average IPC values in low demanding windows also increase. This indicates that more work
is done in low IPC windows, and the processor gets more chances to run in the low power
mode.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 75
5.3 Considerations for Load Indicator Scheme
A concern on the load indicator scheme is that reducing the processor issue rate when a
main memory access happens will slow down finding the next main memory access and
might degrade the performance. Another concern is that the existence of main memory
accesses might make the processor issue rate switching too frequently. However, these two
concerns do not affect the effectiveness of the load indicator scheme, because normally main
memory accesses are clustered together.
In [90], we study the burstiness of main memory accesses. The system configuration is
as follows. The processor is 4-way issue running at 2 GHz speed. The instruction and data
caches are 4-way 64 KB with 2-cycle access latency, and the L2 cache is 4-way 1 MB with
8-cycle access latency. The memory system is a 2-channel Direct Rambus DRAM system
with 60 ns access latency.
We first measure the arrival intervals of cache misses so as to gauge the burstiness in
the miss stream. If a large fraction of accesses arrive in short intervals, that indicates
the miss stream is highly bursty. Because the DRAM access delay will slowdown the
instruction execution and the speed of the processor generating cache misses, we configure
a perfect DRAM system that has the latency of L2 cache hit and an infinite bandwidth.
Figure 5.3 shows the CDF (Cumulative Distribution Function) of the arrival interval of
memory accesses of fifteen selected SPEC2000 programs. The top and the bottom figures
contain the floating point and integer programs, respectively.
All of the applications have a dense distribution on short intervals. For example, m cf
has 22% distribution at the zero cycle interval, which means that 22% of misses follow some
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 76
oLL o
0.4
w upw ise ------swim - ...... m g r id ...... 0.2 applu ...... galgel ------art ------f a c e r e c ...... am m p ...... lucas ......
0 50 100 150 200 Arrival Interval (cycle)
0.8
0.6
u_ o
0.4
0.2 v p r - g c c ------m c f ...... p a r s e r ...... b z ip 2 ------twolf ------0 50 100 150 200 Arrival Interval (cycle)
F igure 5.3: CDF of arrival interval of miss streams.
other misses at the same cycle. For swim, 10% of misses follow some other misses at the
same cycle. For m cf and swim, the average arrival intervals are nine cycles and 25 cycles,
respectively. Among the fifteen selected programs, eight of them have more than 25% of L2
misses clustered within seven cycle intervals, and four of them have more than 40% of L2
misses clustered within seven cycle intervals.
Figure 5.4 further presents the distribution of the number of concurrent accesses in the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 77
0.8
0.6
0.4 / 179.art — 181 .mcf 171.swim 1 89,tucas ...... 0.2 173.applu 301.apsi — • 187.facerec 178.galgel ......
Number of Concurrent Accesses
1.2
t 08 'n nj XIg
© 0.6 >
3 E o3 0.4 172.mgrid — 188.am m p — 168,wupwise ...... 1 76.gcc ...... 0.2 300.twolf — 256.bzip2 - - 175.vpr ......
2 4 8 16 32 Number of Concurrent Accesses
Figure 5.4: Distribution of the number of concurrent accesses.
bursty phase, when two or more outstanding memory references happen together. The
top figure contains programs with the fraction of bursty phase higher than 40%; and the
bottom one contains programs with the lower bursty phase fraction. Most programs have
high burstiness in the bursty phase. In general, programs with a higher fraction of bursty
phase tend to have higher probability on large number of concurrent accesses. For all the
programs presented in the top figure, more than 40% of bursty references are grouped with
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 78
at least three other references. Even for some programs with a small bursty phase fraction,
the burstiness inside the bursty phase is still high. For example, program bzip2 only spends
6% of its execution time in the bursty phase, however, more than 60% of concurrent accesses
are clustered as groups with at least eight references.
These indicate that normally multiple memory accesses happen in very short intervals.
There are two reasons for high burstiness. First, when a miss happens, it is highly possible
that the current working set does not reside in the cache, thus more misses are expected to
happen in the near future. Second, wide-issue processors execute multiple instructions per
cycle, thus make cache misses be clustered.
The clustering of memory accesses makes the load indicator scheme effective in capturing
power saving opportunities while still maintaining the performance. Even when the proces
sor issue rate reduces, the next memory access can still be found before previous memory
accesses finish. In addition, the clustering of memory accesses also makes the processor
issue rate switch not too frequently. We will discuss the power mode switch frequency in
more details in Section 5.7.5.
5.4 Experimental Environment
We use sim-alpha as the simulator, which has been validated against a 466 MHz Alpha
21264 processor [26]. The Alpha 21264 processor is a supercalar processor that can issue
up to four integer instructions and two floating-point instructions every cycle. Its pipeline
contains seven basic stages. The on-chip instruction cache is 64 KB, 2-way set-associative
with block size of 64 bytes, and contains a line predictor and a way predictor. The on-chip
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 79
Processor speed 466 MHz/1 GHz/2 GHz Processor issue rate 6-way Functional units 4 int ALU, 4 int Mult 1 fp ALU, 1 fp Mult Integer queue 20 entries Floating-point queue 15 entries Reorder buffer 80 entries Load queue size 32 Store queue size 32 Register file 80-entry integer 72-entry floating point Instruction cache 2-way 64 KB, 64 B block 1/1/2 cycle hit latency D ata cache 2-way 64 KB, 64 B block 3/3/4 cycle hit latency L2 cache 1-way 2 MB, 64 B block 7/12/14 cycle hit latency DRAM latency 2(RAS), 4(CAS), 2(precharge) 2 (controller) bus cycles Bus bandwidth 1.2/1.6/2.1 GB/s
Table 5.2: Experimental parameters.
data cache is 64 KB, 2-way set-associative with 64-byte blocks, and is backed up with an
8-entry victim buffer [45, 24], The off-chip L2 cache is 2 MB direct-mapped with 64-byte
blocks. The memory bus is 128-bit wide and operates at 75 MHz speed [26]. Considering
the current processor speed, we also scale the processor speed to 1 GHz and 2 GHz in our
experiments. The cache access latency, DRAM access latency, and bus bandwidth are scaled
accordingly. The scaled cache access latency is estimated by the CACTI 3.0 model [63].
We assume that the process technique for caches is also scaled down as the processor speed
increases. We also assume that the DRAM access latency and bus bandwidth improve by
30% when the processor speed increases from 466 MHz to 1GHz, and from 1 GHz to 2 GHz.
Table 5.2 presents the key parameters in the experiment.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 80
We modify the simulator to model our schemes and the pipeline balancing technique.
The processor issue rate is dynamically adjusted when the conditions defined by the schemes
are satisfied. The modified version also collects the statistics of program’s execution behav
ior under each power-aware design for calculating the power consumption.
SPEC CPU2000 benchmark programs are used as the workloads [66]. We use the pre
compiled Alpha version of SPEC2000 binaries [77]. The reference input data files are used
in the experiments. The first four billion instructions are fast-forwarded. Then, the detailed
statistics are collected on the next one billion instructions.
5.4.1 Power Savings
Reducing the processor issue rate can reduce the power consumed by the issue logic and
execution units [6]. These are two major power consumers in the processor. Among the 72-
Watt power consumed on average by the 600 MHz Alpha 21264 processor, 18% is consumed
by the instruction issue units, 10% is consumed by the integer execution units, and another
10% is consumed by the floating-point execution units [32], We use these data in calculating
power savings.
We assume that as the processor speed increases, the percentage of power consumed by
the issue logic and the execution units does not change. This assumption is conservative
and does not favor our schemes. As the processor speed increases, the percentage of power
consumed by the issue logic and execution units tends to increase. For example, the issue
units and execution units including the driving clocks consume 46% and 22% of total pro
cessor power for the projected Alpha 21464 processor [78]. As estimated in [6], the 46%
of power consumption includes the power consumed by register files and register mapping,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 81
thus the issue logic consumes about 23% of total power.
The Alpha 21264’s pipeline has a cluster structure. The integer registers and functional
units are organized as two clusters. The floating-point registers and functional units form
another cluster [45, 24]. We assume that reducing the processor issue rate only reduces
the power consumed by the issue logic and the integer functional units. This is also a
conservative assumption. When the issue width is reduced by half, one of the integer
cluster is gated.
There is an argument that reducing processor issue rate cannot bring much energy saving
when aggressive clock gating techniques are applied to each component on a cycle-by-cycle
basis. However, the increased design and verification complexity, clock skew, and inductive
noise make implementing such aggressive clock gating challenging [69, 13]. Thus, the best
granularity at which clock gating is applied is a trade-off between power saving and design
complexity [69]. In addition, gating at a coarser grain may save more power since an entire
resource or block can be powered off [13, 6]. Even if aggressive clock gating is applied,
deactivating resources by architecture adaptation may save the remaining power after clock
gating [61].
As discussed in [6], since reducing the issue rate does not change the number of instruc
tions executed by the execution units, it can only reduce the power consumed by the clocks
driving execution units. The clock network consumes 32% of processor power. Assuming
the percentage of power consumed by clocks is a constant for each component, reducing
the issue rate by half can reduce the power consumed by the clocks driving the integer
execution units by half. It equals the saving on total processor power consumption by
10% x 32% x 1/2 = 1.6%.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 82
m cf art swim lucas applu ammp facerec Time in low power mode (%) 91.9 73.8 88.1 79.9 70.2 47.8 30.8 Power reduction on issue logic (%) 32.2 25.8 30.8 28.0 24.6 16.7 10.8 Power reduction on execution unit (%) 14.7 11.8 14.1 12.8 11.2 7.6 4.9
Table 5.3: Percentage of execution time in low power mode, and percentage of power reduction on the issue logic and execution units by the load indicator scheme under the system with 1 GHz processor.
It is normally unknown how many instructions can be issued in a given cycle until all
the choices are exhausted. Thus, reducing the issue rate will reduce the amount of issue
attempts if the program execution time is not extended. Therefore, reducing the issue
rate can save power on both issue arbiters and clocks for the issue logic. For the issue
logic, the arbiters including the driving clocks consume 70% of the issue queue power.
Thus, reducing the issue rate by half can reduce the total processor power consumption by
18% x 70% x 1/2 = 6.3%. The total processor power reduction by halving the issue rate is
6.3% + 1.6% = 7.9%. This estimation method is the same as that used in [6].
5.5 Effectiveness of Load Indicator Scheme
In this section, we will discuss the effectiveness of the load indicator scheme on reducing
processor power consumption for memory-intensive applications. The results on a wider
range of applications will be presented in Section 5.7. All results are for the system with 1
GHz processor unless mentioned specifically.
From the SPEC2000 benchmark suite, we select seven memory-intensive applications,
whose memory stall portions are higher than 20% under our experimental setup3. Table 5.3
3 The memory stall portion of a program is defined as the percentage of time difference between running the program under a real system and under a system with an infinitely large L2 cache.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 83
B Normal B Always-halving □ Load indicator
swim ammp
Figure 5.5: IPC values under the system with 1 GHz processors.
shows the percentage of power reduction on the issue logic and execution units for these
seven applications by the load indicator scheme. The load indicator scheme puts the proces
sor in the low power execution mode at 30.8% (for program facerec) to 91.9% (for program
mcf) of the total execution time. On average, the processor spends 68.9% of time in the low
power mode. This translates to 24.1% (35% x 68.9%) and 11.0% (16% x 68.9%) average
reduction on the power consumed by the issue logic and execution units compared with the
normal execution scheme, respectively. Thus, the load indicator scheme is very effective in
capturing power saving opportunities for memory-intensive applications.
For general-purpose computer systems, the reduction of processor power consumption
must not cause severe performance loss. Figure 5.5 shows the IPC values of the seven
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 84
memory-intensive applications under different schemes. Compared with the scheme that
the processor always executes under the normal mode [normal), the scheme that always
reduces the processor issue rate by half ( always-halving) decreases IPC values by up to
13.9% (for program facerec). For program facerec, the always-halving issue rate scheme
can save power by 7.9% but at the price of 13.9% performance loss. This results in a 7.0%
increase on the total energy consumption. Thus, simply reducing the processor issue rate
is not a solution to save power and energy. The processor issue rate must be reduced at a
right time in order to retain performance and save energy.
The average performance loss by the load indicator scheme on the seven programs is
only 0.5%. The maximum IPC decrease is only 1.7% (on program lucas). Actually, the
performance of program applu is improved slightly because fewer mis-speculative instruc
tions are issued and executed. As an example, program facerec is executed under the low
power mode by 30.8% of its total execution time. If the performance loss of 13.9% by the
alway-halving scheme were evenly distributed over the whole execution period, the perfor
mance loss by reducing the issue rate at 30.8% of time would be 4.3%. However, the load
indicator scheme only decreases performance by 0.9% for the program. This indicates that
the load indicator can reduce the processor issue rate at the right time, and can save power
consumption and retain performance simultaneously.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 85
Parameters Values Sample width 64 cycles
EC ( I i p c < 1-1) & (FP i p c < 0.4) DC ( I i p c >1-2) | (FP i p c > 0.5)
Table 5.4: Parameters for instruction indicator. EC and DC are the conditions for enabling and disabling the low power mode. Ijp c and FPjpc are the number of integer instructions issued per cycle and the number of floating point instructions issued per cycle, respectively.
5.6 Combining Local and Look-ahead Optimizations
5.6.1 Motivation
In previous sections, we have already shown that the existence of outstanding cache load
misses is a strong indicator reflecting future variants of program requirements on processor
resources. Based on this information, the load indicator scheme dynamically adjusts the
processor issue rate to save power consumption without degrading performance. However,
this technique only has significant impacts on memory-intensive applications. The programs
with negligible memory stall times seldom have chances to go to the low power mode. Thus,
their power consumption and performance will not be affected.
Figure 5.6 shows the percentage of program execution time in low power mode under
our scheme and the instruction indicator for fifteen SPEC2000 applications with a wide
range of computation and memory demands. The programs are sorted by their memory
stall portions, which range from 71% for m cf to 3% for twolf. The parameters used in the
instruction indicator scheme are summarized in Table 5.4. The parameters have already
been tuned for these applications under the system configuration.
For ten of the fifteen programs, the load indicator puts the processor in the low power
mode more or at least comparably frequently, compared with the instruction indicator. For
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 86
BLoad ■ Instruction
Figure 5.6: Percentage of total execution time in low power mode under our scheme (load) and instruction indicator (instruction).
programs losing more than half of their total execution time to memory stall (m cf and art),
our scheme and the instruction indicator put the processor into the low power mode by
almost the same amount of time. Because main memory accesses are so frequent for these
two applications (more than five accesses per 100 instructions), the low IPC values are
mainly caused by the frequent memory stall. Thus, the load indicator and the instruction
indicator schemes capture the same effect from different perspectives. The look-ahead
ability of our scheme does not gain much because the execution demanding is already low
enough during most of the execution time. For programs with memory stall portions ranging
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 87
from 35% to 40% (applu to swim), our scheme captures much more opportunities to put the
processor to the low power mode than the instruction indicator. As discussed in Section 5.2,
this mainly comes from the prediction of future variants and the redistribution of current
work. For programs with modest memory stall portion (8% for mgrid to 29% for ammp),
the load indicator scheme captures more power-saving opportunities on four applications
and less opportunities on three applications. For applications with low memory stall portion
(about 3% for vpr to twolf), our scheme captures less opportunities to put processor to the
low power mode since only a small percentage of time has outstanding load misses.
Although using memory access information to guide the processor issue rate adjust
ment is simple and timely, it fails to capture another important factor that also causes the
variants of program resource requirements — the change of program’s inherent instruction-
level parallelism (ILP). In contrast, the metric of instructions per cycle (IPC) reflects the
overall effects of the change of ILP and the change of program behavior due to limited
hardware resources [6]. However, in order to smooth out the spurs of IPC variants caused
by temporary factors, the IPC values used to trigger processor issue rate switching must
be monitored and averaged during a large enough time window. Thus, simply using IPC
variants as the indicator may also lose opportunities to save power or retain performance
due to the delayed captures and triggers.
In comparisons, the load indicator scheme is application- and system-independent; while
the instruction indicator scheme needs system- and application-dependent parameters. The
load indicator scheme captures the future program variants caused by memory accesses;
while the instruction indicator scheme captures the current program variants caused by
program’s inherent ILP changes and hardware constraints.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 88
5.6.2 Load-instruction Indicator
With the knowledge of future variants, it is possible to redistribute processor work in order
to generate more power saving opportunities. With the knowledge of current variants,
it is possible to capture current power saving opportunities. Combining these two kinds
of optimizations together, we propose another technique, called load-instruction indicator,
that utilizes both the memory access information and the IPC values to guide the processor
issue rate switches.
The scheme works as follows. When there are no outstanding load misses, the hardware
monitors the issued IPC and adjusts the processor issue rate, as in the instruction indicator
technique. When a cache load miss happens and the processor is in the normal execution
mode, the processor will switch to the low power mode. During the service of cache load
misses, the monitoring of issued IPC is suspended. It will be resumed when all the out
standing load misses finish. By combining local and look-ahead optimizations, the switching
of processor issue rate can be done more effectively and more timely. The overhead of the
scheme is negligible, since both the load indicator and the instruction indicator have trivial
overhead.
There are several choices in suspending and resuming the IPC monitoring. One choice
is to clear all counters when a cache load miss is detected. Another choice is to just freeze
the values in the counters when cache load misses are served. We test both choices and find
that they almost perform the same. In our experiments, all the counters are cleared when
a load miss is detected.
if (a load miss happens) {
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 89
if (counter for load misses = 0)
record current power mode as previous mode
counter for load misses + 1
if (in normal mode)
transfer to low power mode
reset or stop counters for sampling cycle and issued instructions
>
if (a load miss is resolved) {
counter for load misses - 1
if (no outstanding load misses) {
return to normal mode (OR previous mode)
restart counters for sampling cycle and issued instructions
}
>
if (no outstanding load misses) {
if (reach sampling window boundary) {
if (in normal mode AND issued instructions below deactivating thresholds)
transfer to low power mode
if (in low power mode AND issued instructions above activating thresholds)
return to normal mode
reset counters for sampling cycle and issued instructions
>
else {
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 90
sampling cycle + 1
issued instructions + number of instructions issued in this cycle
>
>
5.7 Comparisons between Different Schemes
In this section, we compare the three schemes we have discussed so far: the load indicator,
instruction indicator, and load-instruction indicator.
5.7.1 Power Savings
Figure 5.7 shows the processor power reductions under three different schemes for the fifteen
programs. We can see that for the seven memory intensive applications, the load indicator
reduces the processor power consumption by 5.4%, compared with a 4.2% power saving by
the instruction indicator scheme. In some cases, the power reduction difference is large.
For example, the load indicator saves 7.0% of total power for program swim, while the
instruction indicator only saves 2.1% of total power. As we have discussed in Section 5.2,
for this program, 60.4% of sampling windows have IPC values higher than the thresholds
under the normal execution, although almost no instructions are issued in 21.9% of sampling
windows. Thus, under the instruction indicator scheme, less than 40% of time windows can
be executed in the low power mode. In contrast, the load indicator scheme foresees the low
demanding and idle periods and delays current work. Thus, the processor can stay at the
low power mode much longer without causing severe performance loss. For three programs
whose memory stall portions are only about 3% (vpr to twolf), the load indicator saves less
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 91
□ I ,oail ■ Instruction □ Load-instruction
Figure 5.7: Percentage of power reduction.
power than the instruction indicator. For all the fifteen applications, the power reduction
by the load indicator ranges from 0.5% ( twolf) to 7.3% ( mcf). The instruction indicator
saves power by 0.6% ( mgrict) to 7.4% (mcf). The average power reductions by the load
indicator and the instruction indicator are 3.3% and 3.1%, respectively.
The load indicator performs better for memory-intensive applications, while the instruc
tion indicator performs better for programs with small memory stall portions. The load-
instruction technique combines the merits of both techniques and achieves higher power
saving than either technique. The power reduction by the load-instruction technique ranges
from 1.5% ( wupwise) to 7.3% (mcf). The average power saving is 4.4%.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 92
I Always-halving B Load □ Instruction B Load-instruction
1.05
1.00 -
0 0.95 - 3 h
0.90 o Z 0.85
0.80
0.75
Figure 5.8: Normalized IPC values.
5.7.2 Performance Impact
Figure 5.8 presents the IPC values under the three schemes for the fifteen programs. The
results under the always-halving scheme are also presented as a reference. All the values are
normalized to the IPC results under the processor that is always in the normal execution
mode.
The load indicator causes the performance loss by up to 1.7% ( lucas). The average
performance loss is 0.5%. The instruction indicator degrades performance by up to 3.7%
(facerec). The average performance loss is 0.9%. The load-instruction technique causes a
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 93
□ Load ■ Instruction □ Load-instruction
Figure 5.9: Percentage of energy reduction.
performance loss by up to 4.1% ( vpr). The average performance loss is 1.1%. Among the
fifteen applications, the load indicator scheme gains higher IPC values than the instruction
indicator scheme on six programs. The load-instruction indicator scheme gains the lowest
performance on only six programs although it puts the processor into the low power mode
the most frequently.
5.7.3 Energy Reduction
The energy reduction is shown in Figure 5.9. The load indicator reduces the processor energy
consumption by 0.1% ( wupwise) to 7.1% ( applu). The maximum energy reduction by the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 94
instruction indicator is 6.3% (mcf). The load-instruction technique reduces the processor
energy consumption by up to 7.2% ( applu). The load indicator and the load-instruction
indicator do not increase energy consumption for any program. However, the instruction
indicator increases the energy consumption for program facerec slightly (0.5%). This is
because the energy saved by the scheme cannot pay off the increase due to performance
loss. Among the fifteen programs, the load-instruction technique achieves the lowest energy
consumption for eleven programs. The average energy reduction by the load-instruction
technique is 3.2%, compared with 2.9% by the load indicator and 2.2% by the instruction
indicator.
For high performance systems, reducing power is mainly for temperature control. Under
this aspect, the energy-delay product ( EDP) and E D 2P are two suitable metrics [13]. The
load indicator, instruction indicator, and load-instruction indicator can reduce the average
EDP by 2.4%, 1.3%, and 2.0%; and can reduce the average E D 2P by 1.9%, 0.3%, and
1.0%, respectively.
5.7.4 Different Configurations
Next, we will compare the effectiveness of the load indicator, instruction indicator, and
load-instruction indicator under different system configurations. We only present the power
saving results here. Figure 5.10 shows the percentage of reduction on the whole processor
power consumption. For clarity, we only present five representative applications in the
figure, which consist applications with high, moderate, and low memory access portions.
The average values are for all the fifteen applications. When the processor speed scales
from 466 MHz to 1 GHz and 2 GHz, the average power reduction by the load indicator
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 95
HLoad ■ Instruction □ Load-instruction
swim Average
Figure 5.10: Power reduction under systems with processor speed scaling from 466 MHz to 1 GHz and 2 GHz.
increases from 5.1% to 5.4% and 5.8%. As the speed gap between processor and memory
widens, the processor spends more time waiting for memory accesses to return. Thus, the
effectiveness of the load indicator in reducing power and energy consumption increases.
As the processor speed scales from 466 MHz to 1 GHz and 2 GHz, the average power
reduction by the instruction indicator increases from 2.9% to 3.1% and 5.8%; and the
average power reduction by the load-instruction indicator increases from 4.0% to 4.4% and
6.6%. The load-instruction indicator always gains the largest power saving under different
system configurations.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 96
5000 1
4500 - ■ Normal 4000 - HI Low
3500 -
3000 -
2500 -
2000 -
1500 - Load-instruction Instruction
1000 - Load:
500 -
Figure 5.11: Average intervals in the low power mode and the normal execution mode under different schemes.
5.7.5 Average Intervals in Each Power Mode
Figure 5.11 compares the average intervals in the low power mode and the normal execution
mode under the three schemes. On average, the programs stay at the low power mode for
236 cycles before returning to the normal execution mode under the load indicator scheme.
The average interval in the normal execution mode is 631 cycles. Under the instruction
indicator scheme, the average intervals in the low power mode and the normal execution
mode are 859 cycles and 1101 cycles, respectively. The average low power interval of the
load-instruction scheme is 260 cycles, and the average normal execution interval is 196
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 97
cycles.
In general, the load indicator scheme generates shorter intervals in each power mode than
the instruction indicator scheme. This indicates that the load indicator scheme are more
responsive than the instruction indicator scheme. On the other hand, the load indicator
scheme causes the programs with low memory stall portions to stay at the normal execution
mode longer than the instruction indicator scheme. This is because the load indicator
cannot capture the changes of program behavior due to the program ILP variants. For
most programs, the load-instruction indicator scheme causes the most frequent power mode
switching. It can capture the variants of program behavior more timely than the other two
schemes, and can cover both compute- and memory- intensive applications.
5.8 Load-register Indicator Scheme
In previous sections, we have already shown that the load indicator scheme is an effective
look-ahead technique in capturing the power saving opportunities caused by main memory
accesses. The load indicator can be further combined with the instruction indicator to save
power for both compute- and memory-intensive applications. A different direction to extend
the load indicator in order to cover a wider range of applications is to combine it with other
early indicators that reflect the changes of program behavior. One of such indicators is the
usage of hardware resources.
Each on-the-fly instruction occupies some hardware resources, such as physical registers,
reorder buffer entries, and load/store queue entries. When there are only a small number of
free entries left in those hardware resources, it is very likely that the processor will stall in
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 98
the near future. The cumulation of occupied entries may come from the lack of ILP among
the on-the-fly instructions or from the lack of other hardware resources.
For the system used in our study, we find that the number of free physical registers can
be used to identify the execution periods in which programs only require partial processor
issue width. We propose a scheme that uses both the register usage information and the
cache load miss information to trigger the issue rate switching. We call this scheme load-
register indicator. The scheme reduces the processor issue rate when the number of free
registers drops below a threshold or when there are outstanding load misses. The full issue
rate is resumed when the number of free registers increases above another threshold and
when there are no outstanding load misses. Considering that the shortage of functional
units may also cause instructions holding registers for a long time, the scheme checks the
availability of functional units when the number of free registers drops below the switching
threshold. If any instruction fails to get the required functional unit in the previous cycle,
the processor will stay at the normal execution mode.
if (a cache load miss happens)
miss register value + 1;
if (processor is in normal mode)
processor transfers to the low power mode;
if (number of free registers < threshold_low AND no FU shortage)
if (processor is in normal mode)
processor transfers to the low power mode;
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 99
Schemes Enabling low power mode Disabling low power mode Load indicator LD > 0 LD = 0 Load-register indicator {LD > 0) I {Ireg < TEir) {LD = 0) & {Ireg > TD ir) I {Freg < T E fR ) & {Freg > T D fr)
Table 5.5: The conditions for enabling and disabling the low power mode under different schemes. LD is the number of outstanding load misses. Ireg and Freg are the numbers of free integer and floating-point registers. TExx and TDxx are thresholds for enabling and disabling the low power mode, respectively.
if (a cache load miss resolves)
miss register value - 1;
if (miss register value = 0 AND number of free registers > threshold_high
AND processor is in low power mode)
processor returns to the normal mode;
In contrast to the load indicator scheme, which needs no system-dependent parameters,
the load-register scheme requires system-dependent threshold values to trigger the processor
issue rate switching. However, the load-register scheme has effects on a wider range of
programs, as shown later in the experimental results. The increase of hardware complexity
of the load-register scheme is also trivial. Only the counters for the free registers and the
comparators to the thresholds are needed.
The load-register scheme can capture the variants of program resource requirements
caused by both the changes of ILP and the limited hardware resources. In addition, it can
identify the program behavior changes more timely than the instruction indicator because
the existence of cache load misses and the number of free registers can be monitored at finer
time granularity than the IPC values.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 100
Table 5.5 summarizes the conditions for enabling and disabling the low power mode
under the load indicator and the load-register indicator schemes.
5.9 Effectiveness of Load-register Scheme
In this section, we will show that, the combination of cache load misses and the number of
free physical registers can be used to identify power saving opportunities for both compute-
and memory- intensive applications.
5.9.1 Power Savings
Figure 5.12 compares the processor power reduction under the load-register scheme with
the load indicator and the load-instruction indicator schemes. For all the fifteen applica
tions, the load-register indicator scheme saves more power than the load indicator scheme.
Compared with the load-instruction indicator scheme, the load-register indicator scheme
saves more or at least comparable power for thirteen programs. Only for two programs,
vpr and twolf, the load-register scheme saves less power than the load-instruction indicator.
The power reduction by the load-register scheme ranges from 2.4% (for gcc) to 7.8% (for
swim). The average power reduction is 5.5%, which is 69% of the maximum power saving
that can be achieved by halving the processor issue rate. In comparisons, the load indicator
scheme and the load-instruction indicator scheme only achieve 42% and 56% of the maxi
mum power saving, respectively. The power savings for ten programs by the load-register
indicator scheme is higher than 5%. In contrast, only five programs get power saving higher
than 5% under the load indicator scheme; and five programs have power saving higher than
5% under the load-instruction indicator scheme.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 101
■ Load ■ Load-instruction □ Load-register
Figure 5.12: Percentage of power reduction under the load indicator, load-instruction indicator, and load-register indicator schemes on systems with 1 GHz processors.
Figure 5.13 further shows the decomposition of program execution time in the low power
mode under the load-register indicator scheme for the fifteen SPEC2000 applications. The
load and register portions correspond to the low power execution periods triggered only by
the load indicator and the register indicator, respectively. The overlap portion corresponds
to those caused by both triggers. In general, the existence of cache load misses triggers more
transitions to the low power mode for applications with high memory stall portions, and
the shortage of free registers plays more important roles for applications with low memory
stall portions.
Table 5.6 summarizes the parameters used in the load-register scheme.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 102
□ Register ■ Overlap ■ Load
Figure 5.13: Percentage of total execution time in the low power mode under the load-register indicator scheme. The load and register portions correspond to the low power execution periods triggered only by the load indicator and the register indicator, respectively. The overlap portion corresponds to those caused by both triggers.
5.9.2 Energy Savings
The reduction of energy consumption is shown in Figure 5.14. The load-register technique
reduces the processor energy consumption by up to 8.8% ( applu). However, it increases the
energy consumption for programs wupwise and vortex by 3.0% and 0.2%, respectively. This
is because the energy saved by the scheme cannot pay off the increase due to performance
loss. The average energy reduction by the load-register technique is 3.0%, compared with
2.9% by the load indicator and 3.2% by the load-instruction indicator.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 103
Conditions Values Enabling (LD > 0) | (Ireg < 8) | ( Freg < 4) Disabling (LD - 0) & (Ireg > 12) k (Freg) > 8)
Table 5.6: The conditions for enabling and disabling the low power mode by the load-register indicator scheme. LD is the number of outstanding load misses. Ireg and Freg are the numbers of free integer and floating-point registers, respectively.
In comparison, both load-instruction indicator and load-register indicator can effectively
reduce power consumption for compute- and memory-intensive applications. Both need
system-related parameters to guide the issue rate adjustment. The load-register scheme can
make the adjustment at a finer grain than the load-instruction indicator. Our experimental
results indicate that further combining the load indicator, the instruction indicator, and
the register indicator can only bring diminishing power-saving return compared with either
load-instruction or load-register schemes.
5.10 Related Work
Recently, there has been an increasing number of studies targeting reducing the power
consumption of general-purpose processors. Pipeline gating reduces the processor energy
consumption by preventing wrong-path instructions from entering the pipelines [51]. Au
thors in [29] propose power saving techniques to avoid unnecessary comparisons at the
wake-up logic. Dynamically adjusting the issue queue size, the reorder buffer size, and the
load/store queue size is an effective approach to reducing the power consumption of issue
logic [17, 29, 56].
Authors in [30] propose a method to reduce processor energy consumption by switch
ing the processor execution between out-of-order and in-order execution modes under the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 104
■ Load ■ Load-instruction □ Load-register
Figure 5.14: Percentage of energy reduction under the load indicator, load-instruction indicator, and load-register indicator schemes under systems with 1 GHz processors.
guidance of an external performance indicator. The dual speed pipeline processor saves
power by executing non-critical instructions in slow components and retains performance
by executing critical instructions in fast components [57, 62], The Pentium 4 processor
explores the StopClock structure to reduce the power consumption, where the clock signal
to a bulk of processor logic is halted for a short time period [33]. The pipeline balancing
technique dynamically adjusts the processor issue rate for each sampling window based on
the monitored IPC [6]. Authors in [8] propose to use the current rate of instructions pass
ing pipeline stages to throttle the processor front-end for power saving. Authors in [38, 61]
target energy saving for multimedia applications by using dynamic voltage scaling and ar
chitectural adaptation at the granularity of a frame and a fixed period within the frame.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 105
Buyuktosunoglu et. al. propose a combination of fetch gating and issue queue adaptation
to reduce the power consumed by the front-end instruction delivery path [16]. Huang et. al.
propose a positional adaptation technique that chooses configurations based on particular
code sections [35].
The effectiveness of load indicator scheme in reducing processor power consumption
depends on the memory system performance. Recently, researchers have proposed to utilize
the idle intervals to perform precomputation-based prefetching [4, 7, 23, 49, 91]. Those
are performance-oriented techniques, which perform a large number of speculative (maybe
duplicate or useless) operations, thus may not be efficient from an energy consumption
point of view. Different from those performance-oriented techniques, our work targets
energy saving by utilizing the idle intervals.
5.11 Summary
Adjusting the processor issue rate based on the resource requirement of programs can ef
fectively reduce the processor power consumption with a negligible performance loss. Our
study has shown that the existence of cache load misses is a strong indicator reflecting
the reduction on a near future resource requirement. For memory-intensive applications,
simply applying this indicator can capture most of the program behavior variants timely.
Compared with the pipeline balancing technique using the instruction indicator, the load
indicator can save more power with comparable performance losses.
The load-instruction scheme adaptively utilizes both the load indicator and the instruc
tion indicator used in the pipeline balancing technique. Compared with the load indicator,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5. LOOK-AHEAD ADAPTATION 106
this scheme can capture the current power saving opportunities caused by the program ILP
variants. Compared with the pipeline balancing, it can identify future program variants
and redistribute current work. As shown by our experimental results, the load-instruction
scheme works well for both compute- and memory- intensive applications.
The number of free registers can also indicate the variants of program future resource
requirements timely. The load-register indicator utilizes the information of both cache load
misses and the register utilization, and is also effective for both compute- and memory
intensive applications.
The load indicator, the load-instruction indicator, and the load-register indicator
schemes reduce the processor power consumed by the issue logic and execution units. They
can be applied together with other techniques that reduce power consumed by other com
ponents such on-chip caches to further reduce the total processor power consumption.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 6
Conclusion and Future Work
6.1 Conclusion
With the dramatic improvements on fabrication techniques and architecture designs, com
puter system performance has been increasing exponentially for several decades. A negative
byproduct of this fast performance improvement, nevertheless, is the undesirable increase of
power consumption. High power consumption shortens battery lifetime of portable systems
and causes high cooling and package cost on performance-oriented systems. To countercheck
this trend, researchers have proposed a number of techniques at various design levels. As
a natural bridge between circuits and applications, architectural designs play an important
role in reducing system power consumption. They can effectively utilize low-power circuit
techniques by exploiting application characteristics.
A principle in low-power architecture designs is to avoid unnecessary operations when
ever possible. For general-purpose systems, the system designs are optimized to achieve the
best performance for a wide range of applications. Thus, it is common that a significant
amount of hardware resources are underutilized for a specific application or during a certain
execution period. However, those underutilized resources will consume the same amount
107
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 6. CONCLUSION AND FUTURE WORK 108
of power if no power-aware designs are applied. In this dissertation, we have discussed two
memory-related low-power microarchitecture designs that reduce unnecessary accesses to
caches and processor pipelines.
Set-associative caches are widely used in modern computer systems because they can
reduce cache conflict misses and improve the memory system performance. However, the
conventional implementations of set-associative caches are not energy-efficient by their na
ture. Our first work proposes an access mode prediction cache structure to address this
issue. We find that the way-prediction technique only works well for applications with
good locality, while the phased technique only works well for application with poor local
ity. The cache energy consumption can be minimized when a cache hit is handled by the
way-prediction technique and a cache miss is handled by the phased technique. The AMP
cache dynamically applies these two techniques for the next cache reference based on the
cache hit/miss prediction. We also optimize a way-prediction scheme, multicolumn-based
way-prediction, for energy reduction. Our experimental results indicate that the AMP
cache with multicolumn way-prediction is a nearly optimal cache structure in terms of both
energy consumption and performance, compared with existing solutions.
Our second work focuses on reducing power consumption on the processor pipeline. Dy
namically adjusting the processor issue rate is an effective technique to reduce the power
consumed by the issue logic and execution units. We find that a simple scheme based on
the existence of main memory load accesses can predict the future degrade on computation
demands and capture the power saving opportunities. Our study indicates that this load
indicator scheme can effectively reduce the power consumption for memory-intensive ap
plications with negligible performance impacts. The load indicator scheme can be further
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 6. CONCLUSION AND FUTURE WORK 109
combined with other indicators to save power for both compute- and memory-intensive
applications. The load-instruction indicator scheme utilizes both the existence of mem
ory accesses and the number of instruction issued in each time window. The load-register
scheme uses the memory access information and the number of unoccupied registers to guide
the issue rate switching. Different from the load indicator scheme, the later two schemes
need some system-dependent parameters to make the adaptation. However, they can cover
a wider range of applications.
The techniques we have proposed in this dissertation can be applied together with each
other or with power-saving techniques that target other components to further reduce the
power consumption of the whole system.
6.2 Future Work
This dissertation has been focused on hardware approaches. Based on our experience on low
power designs, further significant improvements are likely to come from hardware/software
cooperations at different levels, which have been studied but mainly for performance ob
jectives. At the system level, for example, operating systems may actively turn on/off
computing nodes, hard disks, or a portion of processors in multiprocessors, as long as fu
ture performance demand is predictable. At a lower level, profiling tools may analyze phases
of runtime power demands for a particular application, and then the processor power can be
adjusted to the phases with appropriate hardware support. Currently, it is hard to predict
the performance loss of a program after applying the architecture adaptation. This fact is
undesirable especially for applications requiring guaranteed or predictable response time.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 6. CONCLUSION AND FUTURE WORK 110
we want to first study how to control the performance loss with software support. With the
flexibility and long-range visibility, software may help to address this issue by providing some
feedback information to the adaptation mechanism. We also want to study, when software
support is present, how to dynamically choose the best indicator or the best combination
of indicators, and how to reduce the complexity and overhead of the implementation.
SMT (simultaneous multithreading) [71], a recent advance in computer architecture,
also provides new platforms for low power research. On SMT processors, multiple threads
share some common hardware resources such as instruction fetching units, physical regis
ters, instruction issue queue, scheduling logic, caches, and main memory units. How to
efficiently and effectively utilize those resources among multiple threads is an important
research topic. Some hardware resources, particularly caches, may be either shared or
partitioned for the threads, each with tradeoffs on performance for different application
scenarios. If all or most threads are memory-intensive applications, cache conflict may be
a severe issue and thus partitioned cache is preferred to avoid significant performance de
grade. On the other hand, when the threads have unbalanced working set sizes, shared
caches are favored because the caches will be better utilized than partitioned caches. While
performance-oriented studies have been conducted, energy efficiency has received mush less
attentions. Prom an energy consumption point of view, building caches using small blocks
is much more energy-efficient, because a smaller fraction of chip area will be charged. Thus,
partitioning data caches, especially large L2 or even L3 caches, is favored by energy saving.
We are investigating approaches that partition the caches into a number of small blocks and
dynamically assign those blocks to threads according to their current data access patterns.
Our future research attempts to answer the following questions: (1) how to implement cache
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 6. CONCLUSION AND FUTURE WORK 111
partitioning efficiently, without the increase of cache access time; (2) how to dynamically
assign blocks to multiple threads to maximize performance; and (3) how to improve cache
energy efficiency by dynamically turning on/off cache blocks.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Bibliography
[1] A. Agarwal andS. PUDAR. Column-associative caches: a technique for reducing the miss rate for direct-mapped caches. In Proceedings of the 20th International Symposium on Computer Architecture, San Diego, CA, 1993.
[2] A. A g a r w a l , M. Horowitz, and J. H e n n e s s y . Cache performance of operating systems and multiprogramming workloads. ACM Transactions on Computer Systems, 6(4):393—431, November 1988.
[3] D. H. A l b o n e s i . Selective cache ways: On-demand cache resource allocation. In Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchi tecture (MICRO-32), pages 248-259, 1999.
[4] M. Annavaram,J. M. Patel, andE. S. D a v id s o n . Data prefetching by dependence graph precomputation. In Proceedings of the 28th Annual International Symposium on Computer Architecture, pages 52-61, 2001.
[5] Avant! Corporation. Avant! Star-Hspice Data Manual, http://www.avantcorp.com.
[6] R. I. B a h a r a n dS. M a n n e . Power and energy reduction via pipeline balancing. In Proceedings of the 28th Annual International Symposium on Computer Architecture, pages 218-229, 2001.
[7] R. Balasubram onian,S. Dwarkadas, andD. H. A l b o n e s i . Dynamically allo cating processor resources between nearby and distant ILP. In Proceedings of the 28th Annual International Symposium on Computer Architecture, pages 26-37, 2001.
[8] A. BANIASADI and A. MOSHOVOS. Instruction flow-based front-end throttling for power-aware high-performance processors. In Proceedings of the 2001 International Symposium on Low Power Electronics and Design, pages 16-21, 2001.
[9] L. B e n in i, A. Bogliolo, and G. D. M i c h e l i . A survey of design techniques for system-level dynamic power management. 8(3), June 2000.
[10] J. E. Bennett and M. J. F ly n n . Prediction caches for superscalar processors. In Proceedings of the 30th Annual IEEE/ACM International Symposium on Microarchi tecture (MICRO-30), pages 81-91, 1997.
[11] W . J. Bow hill and et. al.A 300 MHz 64 b quad-issue CMOS RISC microprocessor. In Proceedings of IEEE International Solid-State Circuits Conference, 1995.
112
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. BIBLIOGRAPHY 113
[12] D. B r o o k s , V. Tiwari, a n d M. M artonosi. Wattch: A framework for architectural-level power analysis and optimizations. In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 83-94, 2000.
[13] D. M. B r o o k s , P. Bose, S. E. Schuster, H. J a c o b s o n , P. N. K u d v a , A. Buyuktosunoglu, J.-D . W e l l m a n , V. Zyuban, M. Gupta, andP. W. C o o k . Power-aware microarchitecture: Design and modeling challenges for next- generation microprocessors. IEEE Micro, 20(6):26-44, November/December 2000.
[14] D. C. Burger and T. M. A u s t in . The simplescalar tool set, version 2.0. Technical Report CS-TR-1997-1342, University of Wisconsin, Madison, 1997.
[15] J. A. B u tts and G. S. Sohi. A static power model for architects. In Proceedings of the 33rd annual IEEE/ACM international symposium on Microarchitecture, p a g es 191-201, 2000.
[16] A. Buyuktosunoglu, T. Karkhanis,D. Albonesi, and P. B o s e . Energy efficient co-adaptive instruction fetch and issue. In Proceedings of the 30th Annual International Symposium on Computer Architecture, 2003.
[17] A. Buyuktosunoglu, S. Schuster, D. B r o o k s , P. B o s e , P. C o o k , a n d D. A l - BONESI. An adaptive issue queue for reduced power at high performance. In Workshop on Power-Aware Computer Systems, in conjunction with ASPLOS-IX, 2000.
[18] B. CALDER, D. G runw ald, AND J. Emer. Predictive sequential associative cache. In Proceedings of the Second International Symposium on High-Performance Computer Architecture (HPCA ’96), pages 244-253, 1996.
[19] B. Calder and D. G r u n w a ld . Next cache line and set prediction. In Proc. of the 22nd Annual International Symposium on Computer Architecture, pages 287-296, 1995.
[20] R. Canal, A. Gonzalez, and J. E. Smith.Very low power pipelines using sig nificance compression. In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-33), pages 181-190, 2000.
[21] D. C a r m e a n . Invited talk: Power, a perspective from the desktop. In Kool Chips Workshop, in conjunction with MICR033, 2000.
[22] J. H. C h a n g , H. C h a o , a n dK. S o . Cache design of a sub-micron CMOS system/370. In Proceedings of the lfth Annual International Symposium on Computer Architecture, pages 208-213, 1987.
[23] J. D. C o l l i n s , H. W a n g , D. M. T u l l s e n , C. H u g h e s , Y .-F. L e e , D. L a v e r y , AND J. P. S h e n . Speculative precomputation: Long-range prefetching of delinquent loads. In Proceedings of the 28th Annual International Symposium on Computer Ar chitecture, pages 14-25, 2001.
[24] Compaq Computer Corporation. 21264/EV68CB and 21264/EV68DC Hardware Ref erence Manual, 2001.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. BIBLIOGRAPHY 114
[25] V. CUPPU, B. JACOB, B. Davis, and T. M udge. A performance comparison of contemporary DRAM architectures. In Proceedings of the 26th Annual International Symposium on Computer Architecture, pages 2 2 2 -2 3 3 , 1999.
[26] R. Desikan, D. Burger, AND S. W. K e c k l e r . Measuring experimental error in microprocessor simulation. In Proceedings of the 28th Annual International Symposium on Computer Architecture, pages 266-277, 2001.
[27] X. Du, X. Zhang, and Z. Z h u . Memory hierarchy considerations for cost-effective cluster computing. IEEE Transactions on Computers, 49(9):915-933, Sept 2000.
[28] W . f e n L in , S. K. Reinhardt, andD. B u r g e r . Reducing DRAM latencies with an integrated memory hierarchy design. In Proceedings of the Seventh International Symposium on High-Performance Computer Architecure, pages 301-312, 2001.
[29] D. Folegnani and A. G o n z a l e z . Energy-effective issue logic. In Proceedings of the 28th Annual International Symposium on Computer Architecture, pages 230-239, 2001.
[30] S. G h ia s i, J. Casmira, andD. G r u n w a ld . Using IPC variation in workloads with externally specified rates to reduce power consumption. In Workshop on Complexity- Effective Design, in conjunction with the 27th Annual International Symposium on Computer Architecture, 2000.
[31] R. Gonzalez and M. H o r o w i t z . Energy dissipation in general purpose micropro cessors. IEEE Journal of Solid-State Circuits, 31 (9):1277—1284, September 1996.
[32] M. K. G o w a n , L. L. B i r o , a n d D. B. J a c k s o n . Power considerations in the design of the Alpha 21264 microprocessor. In Proceedings of the 1998 Conference on Design Automation, pages 726-731, 1998.
[33] S. H. G u n t h e r , F. B in n s , D. M. Carmean, andJ. C. H a l l . Managing the impact of increasing microprocessor power consumption. Intel Technology Journal, Ql, 2001.
[34] A. H a s e g a w a , I. K a w a s a k i,K. Y a m a d a ,S. Y o s h i o k a , S. Kawasaki, and P. B is w a s . SH3 — high code density, low-power. IEEE Micro, 15(6): 11—19, December 1995.
[35] M. H u a n g , J. Renau, and J. Torrellas. Positional adaptation of processors: application to energy reduction. In Proceedings of the 30th Annual International Sym posium on Computer Architecture, 2003.
[36] M. H u a n g , J. R e n a u , S.-M. Y o o , a n d J. Torrellas. A framework for dynamic energy efficiency and temperature management. In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-33), pages 202- 213, 2000.
[37] M. H u a n g , J. R e n a u , S.-M. Y o o , a n d J. Torrellas. LI data cache decomosition for energy efficiency. In Proceedings of ACM/IEEE International Symposium on Low Power Electronics and Design, 2001.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. BIBLIOGRAPHY 115
[38] C. J. H u g h e s , J. Srinivasan, andS. V. A d v e . Saving energy with architectural and frequency adaptations for multimedia applications. In Proceedings of the Sfth Annual International Symposium on Microarchitecture, pages 250-261, 2001.
[39] K. I n o u e , T. Ishihara, andK. M u r a k a m i.Way-predicting set-associative cache for high performance and low energy consumption. In Proceedings of the 1999 International Symposium on Low Power Electronics and Design, pages 273-275, 1999.
[40] Intel Corporation. ACPI specification minimizes power usage by PCs. http: / / www.intel.com.
[41] M. J. Irwin a n d V. Narayanan. Lowpower design: From soup to nuts. Tutorial in conjunction with ISCA 2000, 2000.
[42] M . B. KAM BLE AND K . G h o s e . Analytical energy dissipation models for low-power caches. In Proceedings of the 1997 International Symposium on Low Power Electronics and Design, pages 143-148, 1997.
[43] S. Kaxiras, Z. Hu, and M. M artonosi. Cache decay: Exploiting generational behavior to reduce cache leakage power. In Proceedings of the 28th Annual International Symposium on Computer Architecture, pages 240-251, 2001.
[44] R. E. K e s s l e r , R. JOO SS, A. Lebeck, and M. D. H i l l . Inexpensive implementa tions of set-associativity. In Proceedings of the 16th Annual International Symposium on Computer Architecture, pages 131-139, 1989.
[45] R. E. K e s s l e r . The Alpha 21264 microprocessor. IEEE Micro, 19(2):24-36, 1999.
[46] H . K im , V. Narayanan,M. Kandemir, and M. I r w in . Multiple access caches: Energy implications. In Proceedings of the IEEE Computer Society Annual Workshop on VLSI (WVLSP00), 2000.
[47] J. Kin, M. Gupta, and W . H. M angione-Smith. The filter cache: An energy efficient memory structure. In Proceedings of the 30th Annual International Symposium on Microarchitecture, pages 184-193, 1997.
[48] A. Lebeck, X. Fan, H. Zeng, and C. E llis. Power aware page allocation. Pro In ceedings of the 9th international conference on Architectural support for programming languages and operating systems, p a g es 105-116, 2000.
[49] C.-K. L u k . Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In Proceedings of the 28th Annual Interna tional Symposium on Computer Architecture, pages 40-51, 2001.
[50] A. M a , M. Zhang, and K. A s a n o v i c . Way memoization to reduce fetch energy in instruction caches. In Workshop on Complexity-Effective Design, in conjunction with the 28th International Symposium on Computer Architecture, 2001.
[51] S. M a n n e , A. Klauser, andD. G r u n w a ld . Pipeline gating: Speculation control for energy reduction. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 132-141, 1998.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. BIBLIOGRAPHY 116
[52] M. M artonosi, D. Brooks, and P. B o s e . Modeling and analyzing cpu power and performance: Metrics, methods, and abstractions. Tutorial in conjunction with HPCA 2001, 2001.
[53] S. M cFarling. Combining branch predictors. Technical Report TN-36, Digital Equip ment Corporation, Western Research Lab, June 1993.
[54] J. M ontanaro and et. al.A 160-M H z 32-b 0.5-W CMOS RISC microprocessor. Digital Technical Journal, 9(1):49—62, 1997.
[55] S.-T. P a n , K. So, a n d J. T. R a h m e h . Improving the accuracy of dynamic branch prediction using branch correlation. In Proceedings of the Fifth International Con ference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-V), pages 76-84, 1992.
[56] D. PONOMAREV, G. Kucuk, and K. Ghose. Reducing power requirements of in struction scheduling through dynamic allocation of multiple datapath resources. In Proceedings of the Sfth Annual International Symposium on Microarchitecture, pages 90-101, 2001.
[57] R. Pyreddy and G. Tyson.Evaluating design tradeoffs in dual speed pipelines. In Workshop on Complexity-Effective Design, in conjunction with the 28th International Symposium on Computer Architecture, 2001.
[58] G. Reinman and N. Jouppi.An integrated cache timing and power model. Technical report, COMPAQ Western Research Lab, 1999.
[59] S. R i x n e r , W . J. D a l l y ,U. J. K a p a s i,P. M attson, andJ. D. O w e n s . Memory access scheduling. In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 128-138, 2000.
[60] M. Rosenblum, E. B u g n i o n , S. Devine, and S. A. H e r r o d . Using the SimOS ma chine simulator to study complex computer systems. ACM Transactions on Modeling and Computer Simulation, 7(1) :78—103, January 1997.
[61] R. S a s a n k a ,C. J. Hughes, and S. V. A d v e . Joint local and global hardware adaptations for energy. In Proceedings of the Tenth International Conference on Ar chitectural Support for Programming Languages and Operating Systems (ASPLOS-X), 2002.
[62] J. S. S e n g , E. S. T u n e , a n d D. M. T u l l s e n . Reducing power with dynamic critical path information. In Proceedings of the 34th Annual International Symposium on Microarchitecture, pages 114-123, 2001.
[63] P. ShivakumarAND N. Jouppi. An integrated cache timing, power, and area model. Technical report, COMPAQ Western Research Lab, 2001.
[64] D. SlNGH AND V. T iw a r i. Power challenges in the internet world. In Cool Chips Tutorial - An industrial perspective on low power processor design. In conjunction with the 32nd Annual International Symposium on Microarchitecture, pages 8-15, 1999.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. BIBLIOGRAPHY 117
[65] J. E. S m it h . A study of branch prediction strategies. In Proceedings of 8th Annual International Symposium on Computer Architecture, pages 135-148, 1981.
[66] Standard Performance Evaluation Corporation. SPEC CPU2000. http: / / www.spec.org.
[67] C.-L. Su AND A. M. DESPAIN. Cache design trade-offs for power and performance optimization: a case study. In Proceedings 1995 International Symposium on Low Power Design, pages 63-68, 1995.
[68] Synopsys Inc. PowerMill Data Sheet, http://www.synopsys.com.
[69] V. T i w a r i, D. S in g h , S. R a j g o p a l , G. M e h t a , R. Patel, andF. B a e z . Reducing power in high-performance microprocessors. In Proceedings of the 1998 Conference on Design Automation, pages 732-737, 1998.
[70] Tom’s Hardware Guide. Comparison of 21 Power Supplies. http://www.tomshardware.com/howto.
[71] D. TULLSEN, S. Eggers, AND H. L e v y . Simultaneous multithreading: Maximizing on-chip parallelism. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 392-403, 1995.
[72] G. T y s o n , M. F a r r e n s , J. M atthews, andA. R. P l e s z k u n . A modified approach to data cache management. In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 93-103, 1995.
[73] O. U n s a l , I. K o r e n , C. Krishna, andC. M o r i t z . Cool-fetch: Compiler-enabled power-aware fetch throttling. ACM Computer Architecture Letters, 1:100-103, 2002.
[74] N. Vijaykrishnan,M. K a n d e m ir , M. J. I r w in , H. S. K im , a n d W. Y e . Energy- driven integrated hardware-software optimizations using SimplePower. In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 95-106, 2000.
[75] L. V illa, M. Zhang, and K. Asanovic.Dynamic zero compression for cache energy reduction. In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-33), pages 214-220, 2000.
[76] D. W . WALL. Limits of instruction-level parallelism. In Fourth International Con ference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 1991.
[77] C. WEAVER, http://www.simplescalar.org/spec2000.html. SPEC2000 b inaries.
[78] K. WlLCOX AND S. M a n n e . Alpha processors: A history of power issues and a look to the future. In Cool Chips Tutorial - An industrial perspective on low power processor design. In conjunction with the 32nd Annual International Symposium on Microarchitecture, pages 16-37, 1999.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. BIBLIOGRAPHY 118
[79] S. J. E. W ilton and N. P. J o u p p i. An enhanced access and cycle time model for on-chip caches. Technical Report WRL Research Report 93/5, DEC Western Research Laboratory, 1994.
[80] J. Y a n g , Y. Zhang, and R. G u p t a . Frequent value compression in data caches. In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchi tecture (MICRO-33), pages 258-265, 2000.
[81] T.-Y. Yeh a n d Y. N. P a t t . Alternative implementations of two-level adaptive branch prediction. In The 19th Annual International Symposium on Computer Architecture (ISCA), pages 124-134, 1992.
[82] A. YOAZ, M. Erez, R. Ronen, and S. J o u r d a n . Speculation techniques for improv ing load related instruction scheduling. In Proceedings of the 26th Annual International Symposium on Computer Architecture (ISCA ’99), pages 4 2 -5 3 , 1999.
[83] C . Z h a n g , X. Zhang, andY. Y a n . T w o fast and high-associativity cache schemes. IEEE Micro, 17(5):40-49, September/October 1997.
[84] X. Zhang, Z. Zhu, and X. Du. Analysis of commercial workload on SMP multipro cessors. In Performance’99 (extended abstract), 1999.
[85] Y. Z h a n g a n d J. Y a n g . L ow cost instruction cache designs for tag comparison elimination. In Proceedings of ACM/IEEE International Symposium on Low Power Electronics and Design, 2003.
[86] Z. Z h a n g , Z. Z h u , a n d X. Z h a n g . A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality. In Proceedings of the 33rd Annual International Symposium on Microarchitecture, pages 32-41, 2000.
[87] Z. Z h a n g , Z. Z h u , a n d X. Z h a n g . Breaking address mapping symmetry at multi levels of memory heirarchy to reduce dram row-buffer conflicts. Journal of Instruction- Level Parallelism, 3, October 2001.
[88] Z. Z h a n g , Z. Z h u , a n d X. Z h a n g . Cached DRAM for ILP processor memory access latency reduction. IEEE Micro, 21(4):22-32, July/A ugust 2001.
[89] Z. Z h u a n d X. Z h a n g . Access-mode predictions for low-power cache design. IEEE Micro, 22(2):58-71, M arch/April 2002.
[90] Z. Z h u , Z. Zhang, and X. Zhang. Fine-grain priority scheduling on multi channel memory systems. InProceedings the Eighth International Symposium on High- Performance Computer Architecture, pages 107-116, 2002.
[91] C. Zilles and G. S o h i. Execution-based prediction using speculative slices. In Proceedings of the 28th Annual International Symposium on Computer Architecture, pages 2-13, 2001.
[92] V. V. Z y u b a n .Inherently low-power high-performance superscalar architectures. PhD thesis, University of Notre Dame, Department of Computer Science and Engineering, 2000.
with permission of the copyright owner. Further reproduction prohibited without permission. VITA
Zhichun Zhu
Zhichun Zhu is born in Zhuzhou, Hunan, China. She received her B.S. degree in Computer
Engineering from Huazhong University of Science and Technology (HUST), Wuhan, Hubei,
China, in 1992. She entered the Ph.D. program in Computer Science at the College of
William and Mary in Fall 1998. Her research interests are computer architecture, perfor
mance evaluation, and parallel and distributed computing. She is a student member of the
IEEE and the ACM.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.