UNIVERSITY OF CINCINNATI

Date:______

I, ______, hereby submit this work as part of the requirements for the degree of: in:

It is entitled:

This work and its defense approved by:

Chair: ______

Performance Analysis of Location Cache for Low Power Cache System

A thesis submitted to the

Division of Research and Advanced Studies of the University of Cincinnati

in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE

in the Department of Electrical and Computer Engineering and Computer Science of the College of Engineering May 28, 2007

by

Bin Qi B.E. (Electrical Engineering), Shanghai JiaoTong University, China, July 1999

Thesis Advisor and Committee Chair: Dr. Wen-Ben Jone To my dearest parents and brother Abstract

In modern , more memory hierarchy and larger caches are inte- grated on chip to bridge the performance gap between high-speed CPU core and low speed memory. Large set-associative L2 caches draw a lot of power, generate a large amount of heat, and reduce the overall yield of the chip. As a result, large power consumption of the cache memory system has become a new bottleneck for many microprocessors. In this research, we analyze the performance of a lo- cation cache which works with a low power L2 cache system implemented by the drowsy cache technique. A small direct-mapped location cache is added to the tra- ditional L2 cache system. It caches the way location information for the L2 cache access [6]. With this way location information, the L2 cache can be accessed as direct-mapped cache to save both dynamic and leakage power consumption. De- tailed mathematical analysis of the location cache power saving rate is presented in this work. To evaluate the power consumption of the location cache system on real world workloads, both SPEC CPU2000 and SPEC CPU2006 benchmark applications are simulated with the reference input set. Simulation results demon- strate that the location cache system can save a significant amount of power for all benchmark applications in L1 write through policy, and save power for benchmark applications with high L1 miss rate in L1 write back policy. Acknowledgement

I wish to express my sincere thanks towards my advisor, Dr. Wen-Ben Jone, for the endless hours he spent discussing and refining my work. With an office door always open for students, and a mind ready to catch any point I might have missed; he was untiring in his efforts to provide guidance and constructive criti- cism throughout my thesis. Thank you Dr. Jone, for all the great help and sincere concern. I would like to thank the members of my committee, Dr. Ranga R. Vemuri and Dr. Yiming Hu, for spending their valuable time reviewing this work. Special thanks to Rui Min for the thought-provoking discussion with him and Honghao Wang for his help when I met problems in setting up the simulation environment.

Last, but not the least, I would like to express my gratitude to my parents and my brother, for their constant support and love. Contents

1 Introduction 1

2 Background 6 2.1 Basics of Cache ...... 6 2.1.1 Memory Hierarchy ...... 7 2.1.2 Cache mapping techniques ...... 8 2.1.3 Cache Write Policies ...... 10

2.2 Leakage current and drowsy Technique in cache ...... 11 2.2.1 Leakage current ...... 11 2.2.2 Drowsy technique ...... 14

3 Location Cache Design and Analysis 17

3.1 Traditional L2 Cache System Architecture ...... 17 3.2 Location Cache Architecture ...... 20

3.2.1 Structure of Location Cache ...... 21

3.2.2 Working Principle and Hardware Organization of Loca- tion Cache ...... 24 3.3 Timing Analysis of Location Cache ...... 28

i 3.3.1 Timing Components and Delay Model ...... 28 3.3.2 Timing Characteristic of Location Cache ...... 32

3.4 Power Analysis of Location Cache ...... 33

3.4.1 Power Components and Power Model for CACTI . . . . . 33

3.4.2 Mathematical Power Analysis for L2 Location Cache Sys-

tem ...... 35

4 Power Experiment Results of Location Cache 42

4.1 Experimental Environment ...... 42 4.2 Experimental Results with Write Through L1 Data Cache . . . . . 44 4.3 Experimental Results with Write Back L1 Data Cache ...... 54 4.4 Experimental Results Without Drowsy State ...... 63

5 Conclusions and Future Work 68

ii List of Figures

2.1 The structure of memory hierarchy ...... 7 2.2 Leakage current components in an NMOS transistor...... 12

2.3 Drowsy SRAM control and leakage power reduction ...... 15 2.4 Deterioration of inverter VTC under low-Vdd ...... 16

3.1 Physically addressed L2 cache architecture...... 18

3.2 Virtually addressed L2 cache architecture...... 20

3.3 Physically Addressed Location Cache Architecture...... 22 3.4 Virtually addressed location cache architecture...... 23 3.5 Flow diagram for location cache content update...... 26 3.6 Internal cache structure...... 30

4.1 SPEC CINT2000 location cache hit rate vs large entry number: write through...... 47 4.2 SPEC CINT2000 location cache hit rate vs small entry number:

write through...... 48 4.3 SPEC CFP2000 location cache hit rate vs large entry number: write through...... 48

iii 4.4 SPEC CFP2000 location cache hit rate vs small entry number: write through...... 49 4.5 SPEC CPU2006 location cache hit rate vs large entry number: write through...... 49

4.6 SPEC CPU2006 location cache hit rate vs small entry number:

write through...... 50 4.7 SPEC CINT2000 location cache power saving rate vs large entry number: write through...... 51

4.8 SPEC CINT2000 location cache power saving rate vs small entry number: write through...... 51

4.9 SPEC CFP2000 location cache power saving rate vs large entry number: write through...... 52 4.10 SPEC CFP2000 location cache power saving rate vs small entry

number: write through...... 52 4.11 SPEC CPU2006 location cache power saving rate vs large entry number: write through...... 53

4.12 SPEC CPU2006 location cache power saving rate vs small entry number: write through...... 53 4.13 SPEC CINT2000 location cache hit rate vs large entry number: write back...... 56 4.14 SPEC CINT2000 location cache hit rate vs small entry number: write back...... 56

4.15 SPEC CFP2000 location cache hit rate vs large entry number: write back...... 57

iv 4.16 SPEC CFP2000 location cache hit rate vs small entry number: write back...... 57 4.17 SPEC CPU2006 location cache hit rate vs large entry number: write back...... 58

4.18 SPEC CPU2006 location cache hit rate vs small entry number:

write back...... 58 4.19 SPEC CINT2000 location cache power saving rate vs large entry number: write back...... 59

4.20 SPEC CINT2000 location cache power saving rate vs small entry number: write back...... 60

4.21 SPEC CFP2000 location cache power saving rate vs large entry number: write back...... 60 4.22 SPEC CFP2000 location cache power saving rate vs small entry

number: write back...... 61 4.23 SPEC CPU2006 location cache power saving rate vs large entry number: write back...... 61

4.24 SPEC CPU2006 location cache power saving rate vs small entry number: write back...... 62

v List of Tables

3.1 Normalized access delays for various location cache configurations 33

4.1 The 12 CINT2000 benchmark applications ...... 44 4.2 The 14 CFP2000 benchmark applications ...... 45 4.3 The 6 SPEC CPU2006 benchmark applications ...... 45

4.4 L1 cache miss rate of SPEC CINT2000 : write through...... 46

4.5 L1 cache miss rate of SPEC CFP2000 : write through...... 46

4.6 L1 cache miss rate of SPEC CPU2006 : write through...... 46 4.7 L1 cache miss rate of SPEC CINT2000: write back...... 55 4.8 L1 cache miss rate of SPEC CFP2000: write back...... 55 4.9 L1 cache miss rate of SPEC CPU2006: write back...... 55 4.10 Location cache power saving rate of SPEC CINT2000 : write through...... 64

4.11 Location cache power saving rate of SPEC CFP2000 : write through. 65

4.12 Location cache power saving rate of SPEC CPU2006 : write through. 65 4.13 Location cache power saving rate of SPEC CINT2000 : write back. 65

4.14 Location cache power saving rate of SPEC CFP2000 : write back. 66

4.15 Location cache power saving rate of SPEC CPU2006 : write back. 66

vi Chapter 1

Introduction

A cache memory is normally a very fast random access memory to which the can access much more quickly than to the main memory (DRAM). Nowadays as the processor memory performance gap has been steadily increas- ing, where processors operate at speeds that are far greater than the speeds at which main memories can supply the required data, cache has been playing a vi- tal role in bridging the performance gap between high-speed microprocessors and low-speed main memories. As a result, larger cache and more memory hierarchy are embedded in modern microprocessors. For example, in Intel 2 third generation processor, a 24-way set-associative 6 MB L3 cache is integrated into the processor, and the cache occupies more than 60% of the area [5]. A 4-way set-associative, 4 MB L2 data cache is used in 64 bits, 90-nm SPARC(r) RISC , with 55% of chip area invested for the L2 cache block [4].

In order to reduce the number of accesses to long latency memories, a larger cache system is required. However, a large cache generally draws a lot of power,

1 generates a large amount of heat and reduces the overall yield of the chip. Caches can consume 50% or more of the total chip energy in a modern processor. For ex- ample, Pentium Pro dissipates 33% [7] and the StrongARM-110 dissipates 42% [8] of its total power in caches. In the past, dynamic power dominates the to- tal power consumption of CMOS transistors. When CMOS transistors are not switching, they are in OFF state and the leakage power is negligible. However, as the feature size decreases, leakage power is increasing exponentially with pro- cess technology and projected to dominate the total power dissipation. According to the projection in [9], for 70-nm processes, more than 60% of power can be consumed in L1 caches if left unchecked.

With developments of personal communication, the fast-growing portable de- vice market requires devices which are both high performance and low power consumption. As a result, designing power-efficient cache memories has become one of the most critical design concerns. Several techniques on the architecture level have been proposed to reduce the power consumed by caches. In [10, 11, 12], phased caches first access the tag and then the data arrays. Only the hit data way is accessed in the second phase, resulting in less data way access energy at the expense of longer access time. Speculative way activation is another kind of se- lective way activation, which attempts to make a prediction of the way where the required data may be located. If the prediction is correct, the cache access la- tency and power consumption are similar to those of a direct-mapped cache of the same size. If the prediction is wrong, the cache is accessed again to retrieve the desired data. The cache is accessed as a direct-mapped cache twice. Another disadvantage is that the predicted way number must be made available before the

2 actual data address is generated [13, 15, 14]. The CPU must be modified so as to produce predictions in earlier stages of the instruction pipeline. Apart from the optimization on the architecture level, several techniques on the circuit level have been presented to reduce the leakage power. The gated-

Vdd technique inserts a high-Vth transistor between the SRAM cell and one of its power supply trails (Vdd/GND) [16]. The SRAM cell will be detached from its power supply when it does not tend to be used, and the state of the cache is lost. A multi-threshold CMOS (MTCMOS) technique has also been proposed to satisfy both the requirement of lowering the threshold voltage and reducing the standby or subthreshold leakage current [17]. To increase operating speed, low-Vt MOSFETs are used for logic gates, and during the long standby time, the power supply is disconnected with high-Vth MOSFETs. However, the MTCMOS technique cannot be applied to the memory cell array, because the sleep control transistors in the MTCMOS technique cut the power supply off, destroying the states of memory cell. If high-Vth MOSFETs are used in memory cell array, it will increase the overall memory access time. A simple but effective drowsy technique was proposed in [9]. This method implements caches with drowsy/standby mode and normal mode, where different supply voltages can be selected. An SRAM cell consumes significantly less leakage power when placed into drowsy mode by supplying a lower voltage. In drowsy mode, the SRAM cell content can still be preserved.

The architecture we analyze in this work is known as location cache, a low power L2 cache system [6]. It adds a small direct-mapped location cache to the traditional L2 cache system. This small location cache is accessed simultaneous

3 with the L1 cache, and tries to cache the access way location information of the L2 cache. As it is accessed at the same time as the L1 cache, there is no perfor- mance penalty and no modification in the CPU pipeline is needed to support this cache architecture. Detailed micro-architecture implementation of the proposed location cache is present in this work. Though only the location information is cached in the location cache, both dynamic and leakage power of the L2 cache can be saved. Dynamic power is saved by accessing a set-associative L2 cache as a direct-mapped cache, while leakage power is saved by putting other unaccessed ways into drowsy state. Although location cache saves the power consumption of L2 cache, this added small cache has its own added power overhead. Detailed mathematical analysis of the location cache power saving rate is presented in this work. The cache power consumption is estimated by the CACTI 4.2 cache model [22]. To evaluate the power consumption of the proposed location cache system on real world workloads, SPEC CPU2000 and SPEC CPU2006 benchmark appli- cations [25] are simulated with the reference input set. The thesis is organized as follows:

Chapter 2 reviews some of the background information of memory hierarchy, cache mapping techniques, cache write policies, leakage current, and drowsy technique in cache.

Chapter 3 shows the traditional L2 cache system architecture as well as the location cache L2 cache system architecture. Detailed timing analysis and power analysis of the location cache are presented thereafter.

Chapter 4 describes the experimental environment of this work, and performs

4 power simulation of the location cache in both write through and write back policies. Further analysis of the relationship between power saving rate and location cache size is presented based on the simulation results.

Chapter 5 concludes this thesis and discusses future work.

5 Chapter 2

Background

This chapter briefly discusses all background information related to this work. Firstly, memory hierarchy, cache mapping techniques and cache write policies are presented. Secondly, leakage current and drowsy technique used in cache are introduced.

2.1 Basics of Cache

Cache memories play a vital role in keeping the speed of memory access com- parable to today’s high-end microprocessors. In this section some basic cache concepts, such as memory hierarchy, cache mapping techniques and cache write policies are introduced.

6 CPU

Level 1 Increasing Levels in the memory distance from the Hierarchy CPU in access Level 2 time

...

Level n

Size of the memory at each level

Figure 2.1: The structure of memory hierarchy

2.1.1 Memory Hierarchy

A memory system is essentially divided into main memory and cache memory. Main memory (DRAM) is relatively cheap but suffers from the disadvantage of being slow. An access from the main memory into a register may take 20-50 clock cycles. However, there are other kinds of memory that are very fast, and can be accessed within several clock cycles. This fast memory is what is called cache.

The purpose of a cache is to speed up memory accesses by storing recently used chunks of memory. Figure 2.1 shows the structure of a memory hierarchy. This structure allows the processor to have an access time that is determined primarily by level 1 of the hierarchy and yet virtually have the memory as large as level n [2]. Deciding what data to be stored in the cache is a policy that takes advantage of

7 the concept called spatial and temporal locality. Spatial and temporal locality of memory accesses refers to the idea that if a certain chunk of memory is accessed recently, it is likely that the memory (or memory near it) will be accessed again in the near future. For example, when we execute consecutive instructions in the main memory, it is reasonable to assume that the next few instructions that we access would probably be contained in a given contiguous block of data, and hence the next instruction to be accessed would be close by. Similarly, if we are accessing elements of an array, they will be located consecutively, and therefore it would make sense to store a block of data that has recently been used in a cache.

A cache system can have more than one level, with the size of the cache in- creasing and the speed of access decreasing with each lower level. When the processor needs to execute an instruction, it first looks in the level-1 cache, then the level-2 cache and so on. If the data is not found in all cache levels, the CPU then looks for it in the main memory. When the CPU does find data in one of its cache locations, it is termed a hit; while failure to find the data is a miss. Every miss introduces a delay, or latency, as the processor tries a slower level, which is also called a miss penalty. Nowadays, most modern microprocessors have on-chip L1, and L2 caches, some even have large on chip L3 caches, e.g. Intel Itanium 2 family [5].

2.1.2 Cache mapping techniques

There are three basic cache mapping techniques: direct, fully associative and set associative.

• Direct cache mapping: A direct-mapped cache is the easiest to implement,

8 but is also the most restrictive. In a direct-mapped cache, each main memory location can only be copied into one and only one location in the cache. This is accomplished by dividing the main memory into pages which correspond in size with the cache. direct-mapped caches have a speed advantage in that only one location need be searched for a reference to determine if it is there.

The disadvantage manifests itself when a processor repeatedly requests data from two different pages with the same offset. When this happens, the cache is said to be ‘thrashing’, which results in poor cache performance. direct- mapped caches can be referred to as one-way set associative caches.

• Fully associative cache mapping: fully associative cache mapping is the most complex type to implement, but is the most flexible with regards to where the data can reside. A newly read block of the main memory can be placed anywhere in a fully associative cache. When a memory reference oc- curs, all address tags must be checked to determine if the reference is in the

cache. This must be accomplished as quickly as possible, if the cache is to respond to memory requests quick enough to satisfy zero wait-state perfor- mance. When the cache is too large, more time is needed to check the cache than to just go to the main memory with a reference. A fully associative cache will have only one set but many block frames per set. Fully associa- tive cache is expensive in hardware, and may slow the processor , leading to lower overall performance.

• Set associative cache mapping: set associative cache mapping combines the

best of both direct and fully associative cache mapping techniques. As with

9 a direct-mapped cache, blocks of the main memory data will still map into a specific set but can now be in any of the N-cache block frames within the set. Consider for example a two-way set associative cache. Two-way means that each set contains two block frames. Once a memory reference is generated, a set mapping function is applied to the address to determine which set the

data will be located in, just like the direct-mapped cache. When the set is determined, each block’s tag in the set must be compared with the reference to see if it resides in the cache. The higher the associativity, i.e., the more blocks there are per set, the more comparisons need to occur before a hit or miss is determined. Typical implementations are 2, 4, 8 and 16-way set associativity.

2.1.3 Cache Write Policies

There are two basic cache write policies: write back and write through.

• Write back: the data is written only to the block in the cache of the current level. The modified cache block is written to the lower-level cache or main memory only when it is replaced. To reduce the frequency of writing back blocks on replacement, a dirty bit is commonly used to indicate whether the block is modified (dirty) or not (clean). If the dirty bit is clean, the block

is not written back on a miss, since the identical information to the cache is found in lower levels. With write back, writes occurs at the speed of the current level of cache, and multiple writes within a block require only one write to the lower-level memory. Since some writes are hit in the current

10 level of cache, there will be less access to the lower-level memory and it uses less memory bandwidth.

• Write through: the data is written to both the block in the cache and to the block in the lower level of caches or the main memory. Write through is easier to implement than write back. The cache is always clean, so un-

like write back read misses never result in writes to the lower level. Write through also has the advantage that the next lower level has the most current copy of the data, which simplifies data coherency. Due to these advantages, some modern microprocessors also use this write policy, like Intel Itanium 2 family [5].

2.2 Leakage current and drowsy Technique in cache

In our location cache architecture, we also use the information obtained from the location cache to put more L2 cache parts into drowsy state, which saves standby leakage power significantly. In this section, we give a brief description of leakage current in a CMOS transistor and the drowsy technique used in a cache.

2.2.1 Leakage current

There are four main sources of leakage current in a CMOS transistor (see Figure 2.2):

1. Reverse-biased junction leakage current (IREV )

2. Gate induced drain leakage (IGIDL)

11 Figure 2.2: Leakage current components in an NMOS transistor.

3. Gate direct-tunneling leakage (IG)

4. Sub-threshold (weak inversion) leakage (ISUB) as described next.

• Reverse-biased junction leakage(IREV ). The junction leakage occurs from the source or drain to the substrate through the reverse-biased diodes when a transistor is OFF. A reverse-biased pn junction leakage has two main com- ponents: one is minority carrier diffusion/drift near the edge of the depletion region; the other is due to electron-hole pair generation in the depletion re- gion of the reverse-biased junction [3]. The magnitude of the diode’s leak- age current depends on the area of the drain diffusion and the leakage cur-

rent density, which is in turn determined by the doping concentration. For present technology, junction reverse-bias leakage components from both the source-drain diodes and the well diodes are generally negligible with respect to the other three leakage components [3][27].

12 • Gate induced drain leakage (IGIDL). This leakage current flows from the drain-gate overlap to the substrate of a transistor. It arises in the high electric field under gate/drain overlap region causing deep depletion. Thinner oxide and higher supply voltage increase the GIDL current.

• Gate direct-tunneling leakage(IG). According to quantum mechanism, charged carriers can pass through the gate oxide potential barrier into the gate [27], and this causes the gate direct-tunneling leakage current. The magnitude of the gate direct tunneling current increases exponentially as the gate ox- ide thickness Tox decreases. An effective approach to overcome the gate leakage currents while maintaining excellent gate control is to replace the currently-used silicon dioxide gate insulator with high-K dielectric material

such as T iO2 and T a2O5.

• Sub-threshold (weak inversion) leakage (ISUB). When gate-to-source volt-

age Vgs is smaller than threshold voltage Vth, ISUB exists from drain to source. The magnitude of the subthreshold current is a function of temper- ature, supply voltage, device size, and the process parameters out of which

the threshold voltage (VT ) plays a dominant role.

In current CMOS technologies, the sub-threshold leakage current, ISUB, is much larger than the other leakage current components [26]. This is mainly be- cause of the relatively low VT in modern CMOS devices. Based on the BSIM3 leakage equation, the sub-threshold leakage current of a single transistor can be modeled as [23]:

13 −V −|Vth|−Voff W b(V −V ) 2 dd I = µ · C · · e dd dd0 · v · (1 − e vt ) · e n·vt (2.1) leakage 0 OX L t

where µ0 is the zero bias mobility, COX is gate oxide capacitance per unit area,

W b(Vdd−Vdd0) L is the aspect ratio of the transistor, e is the DIBL factor derived from the curve fitting method, Vdd0 is the default supply voltage for each technology,

kT vt = q is the thermal voltage, Vth is threshold voltage, n is the subthreshold swing coefficient, and Voff is an empirically determined BSIM3 parameter which is also a function of the threshold voltage.

2.2.2 Drowsy technique

In this section first the drowsy technique of a single SRAM cell is presented, and then the way to determine the data retention voltage is described.

Drowsy technique, brought up by [9], utilize DVS () to reduce the leakage power of SRAM cells. In active mode, the standard supply voltage is provided to the SRAM cells. However, when cells are not intended to be accessed for a time period, they are placed into a ”sleep” or ”drowsy” mode by supplying a standby voltage in the range of 200-300mv to the SRAM cells.

In drowsy mode, the leakage power is significantly reduced due to the decrease in both leakage current and supply voltage. Figure 2.3 shows the supply voltage control mechanism and the sub-threshold leakage power reduction of the SRAM cell with DVS. As shown in Figure 2.3(b), the leakage power of 6T and 4T SRAM cells re-

14 Figure 2.3: Drowsy SRAM control and leakage power reduction [9]. duces significantly as we scale the supply voltage down. According to the result in [9], the leakage power of the 4T and 6T SRAM cells can be reduced by 92% and 77% respectively at the 300mV standby voltage. However, the standby volt- age (Data Retention Voltage, DRV) cannot be reduced unlimitedly, and the reason will be presented in the rest of this section. The cell stability is often characterized using static noise margin (SNM) where noises like mismatches and disturbances are modeled as DC offsets [28, 29, 30]. When these DC offsets exceed the SNM of an SRAM cell, the cell is caused a false switch. SNM can be visualized by superimposing the voltage transfer curves (VTC) of both cross-coupled inverters within an SRAM cell. Its value is defined as the edge of the maximum square that can fill into both VTC curves [30]. In

Figure 2.4, VT and VF denote the voltages of nodes T and F in the SRAM cell.

VTCT denotes the VTC resulting from the inverter whose input is T and output is

F, while VTCF denotes the VTC resulting from the inverter whose input is F and output is T. When Vdd is 0.36v, the resulting SNM is around 100mv. When Vdd

15 Vdd=0.36V

Vdd=0.10V

Figure 2.4: Deterioration of inverter VTC under low-Vdd.

reduces to 0.1v, the VTC degrades such that the noise margin reduces to 0. If Vdd reduces further, the SRAM cell cannot retain the stored data any more. But, the real noise margin comes not only with reduced Vdd, but with temperature, process variation, etc. So, the standby voltage cannot be reduced all the way down to 0.1v. In [31], it is found that a guard band over 100mv of the minimum voltage (the one with zero SNM) is sufficient to overcome these noise effects. In our work, 0.35v is used as the DRV for our 0.13µm technology.

16 Chapter 3

Location Cache Design and Analysis

In this section, we first describe the traditional L2 cache system architecture, and then introduce the location cache architecture for the L2 cache system. Finally, we present the timing and power analysis of the location cache.

3.1 Traditional L2 Cache System Architecture

As illustrated in Figure 2.1, normally L1 cache is small and the access speed is the most important factor to consider when determining the L1 cache size. L2 cache is larger and slower, and normally uses set-associative cache trying to reduce the miss rate. Depending on how an L1 cache is addressed, it can be divided into physically addressed and virtually addressed. When the L1 cache is physically addressed, the virtual address issued by the microprocessor core first is translated into its corresponding physical address by the translation look-aside buffer (TLB), and is then used to access the L1 cache. Figure 3.1 illustrate this structure in an L2

17 Core

Virtual Address

TLB

Physical Address

L1 cache

Physical Address Data

L2 cache

Other Memory Hierarchies

Figure 3.1: Physically addressed L2 cache architecture. cache system. In this organization, the cache is physically indexed and physically tagged. In such a system, the amount of time to access memory, assuming cache hit, must accommodate both a TLB access and a cache access; of course, these accesses can be pipelined. The advantage of a physically addressed cache is that the design of TLB is flexible, and it does not have to carefully coordinate between the minimum page size, the cache size, and the associativity. Also, a physically addressed cache will not have the aliasing problem which a virtually addressed cache may have.

Alternatively, a microprocessor core can access its cache with an address that is completely or partially virtual. Since the cache is fully virtually addressed, the TLB is not used until a cache miss happens and needs to fetch the cache

18 block from the main memory. When a cache miss occurs, however, the processor needs to translate the virtual address to its physical address so that it can fetch the cache block from the main memory. When the cache is both virtually indexed and virtually tagged, the aliasing problem may happen [2]. When the cache is accessed with a virtual address and pages are shared between programs (which may access them with different virtual addresses), there is the possibility of aliasing. Aliasing occurs when the same object has two names-in this case, two virtual addresses for the same page. This ambiguity creates a problem because a word on such a page may be cached in two different locations, each corresponding to different virtual addresses. This ambiguity would allow one program to write the data without the other program being aware that the data had changed. A common solution is that the cache is virtually indexed, but uses physical tags. In this design the page offset portion of the address, which is really a physical address since it is not translated, is used to index the L1 cache. The L1 cache and TLB are access simultaneously. Figure 3.2 illustrates this structure for an L2 cache system. In this design, there will be no aliasing problem, but the cache size of each way cannot exceed the page size. [1]

Another advantage of using a physically addressed cache is that the tag array of the L1 cache is smaller than the virtually addressed cache. Since normally the virtual address is larger than its physical counterpart, the bit number of each tag entry will be larger in a virtually addressed cache if the same block size and index address are used. For example, in the Intel Itanium 2 family, the virtual address is 64 bits while its physical address is 50 bits [5]. If the tag is virtually addressed instead of physically addressed, each tag entry will have 14 bits more. This will

19 Core

Virtual Address

TLB L1 cache

Physical Address Data

L2 cache

Other Memory Hierarchies

Figure 3.2: Virtually addressed L2 cache architecture. introduce more access delay and more power consumption for the tag array.

Due to the structure hazard, an L1 cache can have separate L1 instruction cache and L1 data cache. Correspondingly, a pair of dedicated instruction TLB and data TLB is associated with each L1 cache. For the L2 cache, it may also have separate data and instruction caches or an unified cache. In the Intel Itanium 2 family, it has separate physically addressed L1 data cache and instruction cache with each of 16 KB, and an unified L2 cache of 256 KB [5].

3.2 Location Cache Architecture

In order to reduce the miss rate of an L2 cache, normally the L2 cache is a large set-associative cache with multiple ways. For instance, the L2 cache in the In- tel Itanium 2 family is a 256KB, 8 ways set-associative cache. Accessing a set-

20 associative cache wastes power because multiple data and tag ways are probed simultaneously, but only one way carries the required data. To resolve this prob- lem, a new cache architecture, called location cache, is proposed in [6]. In this section, firstly the structure of a location cache is introduced, and secondly the detailed hardware organization and working principle of the location cache are described.

3.2.1 Structure of Location Cache

The location cache is a small direct-mapped cache, using address affinity infor- mation to provide the accurate location information for L2 cache references [6]. The proposed location cache system reduces cache access power while improving its performance, when compared with a conventional set-associative L2 cache. Depending on the L2 cache architecture described in the last section, a location cache can be physically addressed or virtually addressed. Figure 3.3 illustrates the revised L2 cache system architecture with a location cache, which is physically addressed.

In this physically addressed cache system, the location cache is physically addressed. It caches the access way location information of the L2 cache (the way number in one set a memory reference falls into). This cache works in parallel with the L1 cache. As a location cache tries to cache the L2 location information, the block address (composed of the index address and the tag address) of the location cache should be of the same length as that of the L2 cache. For instance, in Intel Itanium 2 the physical address is 50 bits and the L2 cache block size is 128 bytes, so the block address of the location cache has 43 (50-7) bits. If the

21 Core

Virtual Address

TLB

Physical Address

Location New Way Infomation L1 cache cache

Physical Way information Data Address

L2 cache

Other Memory Hierarchies

Figure 3.3: Physically Addressed Location Cache Architecture. location cache has 512 entries (i.e., the index contains 9 bits), then each tag array entry will have 34 (43-9) bits.

The location cache can also be virtually addressed based on the architecture shown in Figure 3.2. The revised L2 cache system architecture with a location cache, which is virtually addressed, is illustrated in Figure 3.4. In this virtually addressed cache system, the location cache is virtually addressed. This cache works in parallel with the TLB and the L1 cache. On a L1 cache miss, the physical address (physical tag and index) translated by the TLB and the way information provided by the location cache are both presented to the L2 cache. Since the location cache tries to cache the location information of the entire block in the L2 cache, the location cache should have the same block address as the L2 cache,

22 Core

Virtual Address

L1 cache TLB Location cache

Physical Address Data Way Information

L2 cache New Way Information

Other Memory Hierarchies

Figure 3.4: Virtually addressed location cache architecture. instead of the L1 cache. For instance, in Intel Itanium 2 the virtual address is 64 bits and the L2 cache block size is 128 bytes, so the block address of the location cache will be 57 (64-7) bits. If the location cache has 512 entries, then each tag array entry will have 48 (57-9) bits. When compared with the physically addressed location cache, each entry of the tag array in a virtually addressed location cache will have 14 (64-50) more bits.

The data array of a location cache is used to store the way location information of the L2 cache. If the L2 cache has N ways, then the maximum number of bits for storing this information will be N bits. As the way number is normally a power of 2, we can also use binary encoding, which needs log2N bits to store this information. For instance, if the L2 cache is an 8 way set-associative cache, each

23 entry of the data array will be 3 bits in the location cache. We emphasize that the L2 cache is accessed based on physical address, while the L1 and location cache can be accessed by either physical address or virtual address depending on the implementation strategy.

3.2.2 Working Principle and Hardware Organization of Loca-

tion Cache

One interesting issue arises here: the locations for which references should be cached? Obviously, the location cache should catch the references which turn out to be L1 misses. A recency-based strategy is not suitable, because the recent accesses to the L2 caches are very likely to be cached in the L1 caches. The equation below defines the optimal coverage of the location cache:

OptCoverage = L2Coverage − L1Coverage (3.1)

The actual coverage of a location cache will increase as the size of location cache increases. But, unfortunately, the access time, the access power, and leakage power of the location cache will be increased too. This will be further explored in the following sections. The proposed cache system works in the following way. The location cache is accessed in parallel with the L1 cache. If the L1 cache sees a hit, then the result obtained from the location cache is discarded. If there is a miss in the L1 cache and a hit in the location cache, the L2 cache is accessed as a direct-mapped cache. If both the L1 cache and the location cache see a miss, then the L2 cache is

24 accessed as conventional set-associative cache. When there is hit in the location cache, both access time and access power of the L2 cache will be saved. As opposed to the way-prediction methods, the cached location is not a prediction. Even if there is a location cache miss, we do not see any extra delay penalty as seen in way-prediction caches.

The content (i.e., the new way information) in the location cache is updated when both L1 miss and location miss occur. The flow diagram for the location cache content update is shown in Figure 3.5. As the location cache stores the location information of the L2 cache, it uses the same block address as the L2 cache, instead of the L1 cache. Normally, the block size of the L2 cache is larger than the L1 cache; for instance, in Intel Itanium 2, the L1 block size is 64 bytes while the L2 block size is 128 bytes. Due to this different addressing scheme, the location cache can still catch many references which are L1 misses but location cache hits. When the L2 block size is the same as the L1 block size, although it is not very common, the location cache entry number (e.g. 512) is allocated to be larger than the L1 cache (e.g. 64) without exceeding the access time of the L1 cache. This is because the location cache data array and its tag array are both smaller compared with the L1 data array. As a result, the location cache still can catch many L1 misses, when the block size in the L1 cache is the same as that in the L2 cache. According to the description of the drowsy technology in Section 2.2.2, cache can be put into drowsy mode when it is not accessed for a period of time. In drowsy mode, the leakage power is significantly reduced compared to the normal mode. As the L1 cache hit rate is normally high, the L2 cache can also be put into

25 L1 miss, Location cache miss

Access L2 cache as N-way set- associative cache

Y N L2 hit

Depending on which replacment method the L2 cache uses, such as From the N way tag array random, LRU, or FIFO, the way comparator output, only one output location information of the new is '1', which is way location replaced block is got when getting information missed block from lower memory hierarchy

Encode N bits way location Encode bits way location

information to log2 N bits information to log2 N bits

Write log2 N bits new way location information to location cache

Figure 3.5: Flow diagram for location cache content update.

26 drowsy mode when it is idle for a period of time. In a simple L2 cache system, all the ways of the L2 cache are waken up, when there is an access to the L2 cache. With the way information stored in the location cache, when there is a hit in the location cache, only the hit way of the L2 cache is waken up, while other ways can still be kept in drowsy state. So, in the location cache system, separate multiplexors for selecting Vdd or Vdd low are used for each way and they are controlled by the location information provided by the location cache. The location cache is organized as a direct-mapped normal cache: data array and tag array. For the data array of the location cache, each entry has log2N bit where N is the way number of the L2 cache. As the tag array of the location cache has the same block address as the L2 cache, the entry width (W) of its tag array can be calculated by the following equation:

W = T − log2B − log2E (3.2) where we have the following notations: T: the total access address bit width, E: the entry number of the location cache, B: the block size of the L2 cache in byte.

Normally the block size of an L2 cache is fixed, so each entry of the tag array will have less bits if more entries are used in the location cache. For instance, in Intel Itanium 2 the physical address is 50 bits and the L2 cache block size is 128 bytes. If the location cache has 256 entries, then each tag array entry will have 35 (50-7-8) bits. However, if the location cache has 512 entries, then each tag array

27 entry will have 34 (50-7-9) bits. For an L2 cache system with the support of the location cache, small encoder and decoder hardware must be added. A binary decoder is needed for decoding the location information stored in the location cache to select which way to access in the L2 cache. A binary encoder is needed for converting the comparator output of the L2 tag array to get the location information stored in the location cache. For instance, if the L2 cache has 8 ways, a 3-to-8 decoder and an 8-to-3 encoder will be needed.

3.3 Timing Analysis of Location Cache

In this section, first the timing components of a cache and its delay model used in CACTI are introduced, and then the timing characteristic of the location cache is analyzed and justified by CACTI.

3.3.1 Timing Components and Delay Model

The internal cache structure discussed is shown in Figure 3.6. The decoder first decodes the address and selects the appropriate row by driving one wordline in the data array and one wordline in the tag array. Each array contains as many wordlines as there are rows in the array, but only one wordline in each array can go high at a time. Each memory cell along the selected row is associated with a pair of bitlines; each bitline is initially precharged high. When a wordline goes high, each memory cell in that row pulls down one of its two bitlines; the value stored in the memory cell determines which bitline goes low. Each sense amplifier

28 monitors a pair of bitlines and detects the memory cell state when one changes. By detecting which bit line goes low, the sense amplifier can determine the logic value stored in the memory cell.

The information read from the tag array is compared with the tag bits of the address provided by the CPU. In an N-way set-associative cache, N comparators are required. The results of the N comparisons are used to drive a valid (hit/miss) output as well as to drive the output multiplexors. These output multiplexors select the proper data from the data array (in a set-associative cache or a cache in which the data array width is larger than the output width), and drive the selected data out of the cache. From Figure 3.6, the following component can be identified:

• Decoder

• Wordlines (in both the data and tag arrays)

• Bitlines (in both the data and tag arrays)

• Sense Amplifiers (in both the data and tag arrays)

• Comparators

• Multiplexor Drivers

• Output Drivers (data output and valid signal output)

The delay of each component will be estimated separately, and then combined to estimate the access time of the entire cache. The analytical model of CACTI is developed by decomposing the circuit of each component into many equiva- lent RC circuits, and using simple RC equations to estimate the delay of each

29 Figure 3.6: Internal cache structure.

30 stage[19]. Since the detailed area model is also provided by CACTI, it also cal- culates the wire length and hence the associated capacitance and resistance of the address and data routing tracks [21].

There are two potential critical paths in a cache read access. If the time to read the tag array, perform the comparison, and drive the multiplexor select signals is larger than the time to read the data array, then the tag side is the critical path. However, if it takes longer to read the data array, then the data side is the critical path [19].

For a direct-mapped cache, the access time is simply the larger of the path through the tag array or the path through the data array, and we have:

Taccess,dm = Max(TdataSide + ToutDrive,data,TtagSide,dm + ToutDrive−valid) (3.3) where:

TdataSide = Tdecoder,data + Twordline,data + Tbitline,data + Tsense,data,

TtagSide,dm = Tdecoder,tag + Twordline,tag + Tbitline,tag + Tsense,tag + Tcompare.

In a set-associative cache, the tag array must be read before the data signals can be driven. Thus, the access time of a set-associative cache can be written as:

Taccess,sa = Max(TdataSide,TtagSide,sa) + Toutdrive−data (3.4) where

TtagSide,sa = Tdecoder,tag + Twordline,tag + Tbitline,tag + Tsense,tag + Tcompare+

31 Tmuxdriver + Toutdrive−inv

3.3.2 Timing Characteristic of Location Cache

As the location cache is accessed in parallel with the L1 cache, the added location cache will not increase the time to access the L2 cache only if the access time of the location cache is smaller than L1 cache. From the analysis of the location cache structure in the previous section (Section 3.2.1), we can see that the tag array is much larger than the data array in the location cache. Therefore, the critical path of the location cache is in its tag array. Table 3.1 lists the access delays of the location cache for various entry sizes.

The L1 cache discussed in this research is configured the same as Itanium 2, which is a 16KB 4-way set-associative cache, with a cache line size of 64-bytes, implemented with a 0.13µm technology. The results were produced using the CACTI 4.2 simulator. We chose the access delay of the L1 cache as the baseline and normalized the baseline delay to 1. It can be observed that a location cache with up-to 1024 entries still has shorter access latency than the L1 cache, so the access delay is not a critical factor to choose the entry number of the location cache. In the following sections, the relation of power consumption of the location cache vs. the entry number will be analyzed.

When there is a hit in the location cache, the L2 cache will be accessed as a direct-mapped cache. If non-unified L2 cache access, i.e., direct-mapped and set- associative accesses, is supported, the location cache design can also improve the performance of the L2 cache. To verify the benefit of accessing the L2 cache as a

32 Table 3.1: Normalized access delays for various location cache configurations Cache L1 1024 512 256 128 64 32 16 config. cache entry entry entry entry entry entry entry Access delay 1 0.819 0.814 0.796 0.773 0.732 0.731 0.730 direct-mapped cache, our L2 cache uses the same configuration as that of Itanium 2, which is 256KB, 8 ways with block size 128 bytes implemented in a 0.13µm technology. From the simulation result of CACTI 4.2, when L2 is accessed as a set-associative cache, the access delay is 1.94736 ns. On the contrary, when the L2 is accessed as a direct-mapped cache, the access delay is 0.973827 ns, which is 49.99% faster.

3.4 Power Analysis of Location Cache

Location cache not only saves the L2 dynamic access power, but also puts other unaccessed ways into drowsy state to further save static leakage power. In this section, the power model used in CACTI is introduced first, and then mathematical power analysis of the L2 location cache system is presented.

3.4.1 Power Components and Power Model for CACTI

The basic model used for estimation of dynamic power dissipation is shown in Equation 3.5.

2 Pdyn = CL · Vdd · P0→1 · f (3.5)

where, CL is the physical capacitance of a device, Vdd is the device supply voltage,

33 P0→1 is the probability of a transition at the capacitive load from ‘0’ to ‘1’, and f is the frequency of the cache operation [20]. Since the CACTI model tracks the physical capacitance of each stage of the cache model, the corresponding switch- ing activity, and the number of such devices, the dynamic power consumed at each stage can be modeled by Equation 3.5. As the read and write operations in- volve different components of cache, different dynamic power consumptions are estimated in CACTI. As the sub-threshold leakage current is the dominant leakage source, in CACTI only sub-threshold leakage is modeled. The sub-threshold leakage current in a MOSFET is modeled as Equation 2.1 in Section 2.2.1. In Equation 2.1, for a given threshold voltage (Vth) and temperature (T), except for the device width (W) all the remaining terms are constant for all the transistors in a given design.

So, Equation 2.1 can be reduced to Equation 3.6, where Il is the leakage of a unit width transistor at a given temperature and threshold voltage [18].

Ilkg = W · Il(T,Vth) (3.6)

The width of a transistor in a cache design is determined based on the capac- itive load driven by this gate. CACTI works by calculating the leakage for each gate after the width of the gate has been determined. The leakage per gate is then multiplied by the width of the circuit and then by the number of instances of that particular circuit [22]. When a cache is not active, it can be put into drowsy state to reduce the leakage power. According to the analysis of Section 2.2.2, Vdd can be reduced to Vdd low.

In our following analysis, CACTI 4.2 is simulated with Vdd low assigned 0.35v to

34 get the drowsy leakage power.

3.4.2 Mathematical Power Analysis for L2 Location Cache Sys-

tem

As location cache can cache the way location of the L2 cache, it knows which way to activate for next access when there is a location cache hit. According to this location information, it will access the L2 cache as direct-mapped cache to save the dynamic power, and put other unaccessed ways into drowsy state to save the leakage power as well. In a memory hierarchy, the L2 cache is accessed only when there is a L1 cache miss. When the L2 cache is active, there will be dynamic power and normal leakage power. When the L2 cache is in standby mode, it can be put into drowsy state which only consumes drowsy leakage power.

As follows, the dynamic power and leakage power of a traditional L2 cache is analyzed first, and then the power consumption of an L2 cache system with location cache is analyzed. Finally, the saved power and power saving rate of the location cache system are both presented.

1. The power consumption of a traditional L2 cache (Pcache(no loc)) can be written as shown in Equation 3.7:

Pcache(no loc) = (Pl2 dyn + Pl2 lkg) · MRl1 + Pl2 dwy · (1 − MRl1) (3.7)

where we have

Pl2 lkg: the leakage power of the L2 cache,

35 Pl2 dwy: the drowsy leakage power of the L2 cache,

MRl1: L1 cache miss rate,

Pl2 dyn: the average dynamic power of the L2 cache.

Pl2 dyn = Pl2 rd · Rrd + Pl2 wt · Rwt

Rrd: the ratio between read operations and total operations,

Rwt: the ratio between write operations and total operations.

2. For an L2 cache system with location cache, the total power (Pcache(with loc)) is composed of two components: the power for the added location cache

(Ploc), and the power for the L2 cache (Pl2(with loc)).

(a) The added power consumption due to the small location cache (Ploc):

Ploc = Ploc dyn + Ploc lkg (3.8)

where we have

Ploc dyn: dynamic power of the location cache,

Ploc lkg: leakage power of the location cache.

(b) The total power consumption of an L2 cache (Pl2(with loc)) with a location cache can be categorized in the following three classes:

36 • Class 1: when the L1 cache misses, and the location cache hits.

1 P 1 (with loc) = · (P + P ) · MR · HR l2 N l2 dyn l2 lkg l1 loc N − 1 + · P · MR · HR (3.9) N l2 dwy l1 loc

where we have N: the number of ways in L2 cache,

HRloc: location cache hit rate.

• Class 2: when the L1 cache misses, and the location cache misses.

2 Pl2(with loc) = (Pl2 dyn + Pl2 lkg) · MRl1 · (1 − HRloc) (3.10)

• Class 3: when the L1 cache hits.

3 Pl2(with loc) = Pl2 dwy · (1 − MRl1) (3.11)

The total L2 cache power (Pl2(with loc)) in a location cache system can thus be determined as:

1 2 3 Pl2(with loc) = Pl2(with loc) + Pl2(with loc) + Pl2(with loc) (3.12)

(c) The total power (Pcache(with loc)) of an L2 location cache system is:

Pcache(with loc) = Pl2(with loc) + Ploc (3.13)

37 3. The saved power (Psav) of the L2 location cache system can be derived as:

Psav = Pcache(no loc) − Pcache(with loc) (3.14) N − 1 = · (P + P − P ) · MR · HR − P N l2 dyn l2 lkg l2 dwy l1 loc loc (3.15)

and the power saving rate Rpw sav is:

Rpw sav = Psav/Pcache(no loc) (3.16)

Since the location cache is accessed in parallel with the L1 cache, it is accessed all the time. When the L1 cache has a very high hit rate, the location cache itself will consume a lot of power which is totally useless, sometimes even higher than the power saved in the L2 cache system. In the above analysis, if Psav > 0, then the location cache system will save power when compared with a conventional L2 cache system. Otherwise, it means the saved power cannot offset the power consumed by the added location cache hardware. Detailed experimental results are presented in the next chapter.

When the L2 cache is waken up from drowsy state to normal state, the power supply voltage will be switched from Vdd low to Vdd. So, there will be wake-up power consumption, which is dynamic power in nature. This wake-up power can be modeled as Equation 3.17.

(V − V ) · C P = dd dd low (3.17) wakeup T

38 where C is the total switching capacitance (e.g., the diffusion capacitance con- nected to VVDD in Figure 2.3 (a)), and T is the wake-up time. When only one way of the L2 cache is waken up, the total switching capacitance is only 1/N of the capacitance C where all ways of the L2 cache are waken up. So, if we define the wake-up power of all ways of the L2 cache as Pwake all, the wake-up power

1 of only one way (Pwake one) will be N · Pwake all. In our location cache system, when there is a hit in the location cache, only one of the N ways in the L2 cache needs to be waken up. Consequently, the wake-up power can also be saved in our location cache architecture.

When considering the wake-up power, Equations (3.7), (3.13), (3.15) and (3.16) can be modified as:

0 Pcache(no loc) = Pcache(no loc) + Pwake all · MRl1 (3.18)

0 1 P (with loc) = P (with loc) + · P · MR · HR cache cache N wake all l1 loc

+(1 − HRloc) · Pwake all · MRl1 (3.19)

0 0 0 Psav = Pcache(no loc) − Pcache(with loc) N − 1 = P + · P · MR · HR (3.20) sav N wake all l1 loc

39 0 0 0 Rpw sav = Psav/Pcache(no loc) (3.21)

where Pwake all is the wake-up power of all ways for the L2 cache as discussed above.

0 We can prove that Rpw sav < Rpw sav. This is to prove the following equation:

0 Rpw sav < Rpw sav (3.22)

Substituting the power saving rate with wake-up power considered, we have:

N−1 P Psav + · Pwake all · MRl1 · HRloc sav < N (3.23) Pcache(no loc) Pcache(no loc) + Pwake all · MRl1

Manipulating both sides of the above equation, we get:

N − 1 P · P · MR < P (no loc) · · HR · P · MR (3.24) sav wake all l1 cache N loc wake all l1

Finally, we have to prove that the following equation holds

Psav N − 1 < · HRloc (3.25) Pcache(no loc) N

The Inequation 3.25, can be proven by replacing Psav and Pcache(no loc) with

40 Equation 3.15 and 3.7, and proved by the following derivations:

P N−1 · HR · (P + P − P ) · MR − P sav = N loc l2 dyn l2 lkg l2 dwy l1 loc Pcache(no loc) (Pl2 dyn + Pl2 lkg − Pl2 dwy) · MRl1 + Pl2 dwy N−1 · HR · (P + P − P ) · MR < N loc l2 dyn l2 lkg l2 dwy l1 (Pl2 dyn + Pl2 lkg − Pl2 dwy) · MRl1 N − 1 < · HR N loc

0 From the proof above, we can get that Rpw sav is always greater than Rpw sav, which means that the power saving rate when considering the wake-up power is greater than that without considering the wake-up power. As the wake-up circuitry is not included in the CACTI cache model and it varies for different implemen- tations, this part of power saving is not included in our experimental result in the next chapter. However, if the wake-up power is considered, better power saving results can be expected for the proposed location cache design.

41 Chapter 4

Power Experiment Results of Location Cache

In this section, first the simulation environment for power experiments is de- scribed, and then the experimental results when the L1 data cache uses both write through and write back policy are presented.

4.1 Experimental Environment

For the power experiment of the location cache system, a CACTI 4.2 cache model is used to evaluate the power consumption of location cache, L1 and L2 caches.

Compared to the previous version of CACTI, CACTI 4.2 has several major im- provements listed as follows [22]. First, the technology scaling model in previ- ous versions of CACTI is simply an ideal linear scaling. However, many aspects of deep submicron technology scaling are far from ideal, so in CACTI 4.2 new

42 specific device parameters for each technology are used. Second, some basic cir- cuit structures are updated to reflect the changes in modern processes. A smaller individual SRAM cell is used, which affects the gate capacitance and wire ca- pacitance estimated. Another major improvement in circuit structure is to choose the latch-based voltage mode design instead of the original current mirror based sense amplifier design, which has greatly reduced the power consumption by sense amplifiers. Third, a leakage power consumption model is added. We also use this added leakage power consumption model to derive the drowsy cache leakage power consumption for L2 cache, which is described in Section 3.4.1.

Simplescalar 3.0d CPU simulator is used to get the L1 cache miss rate and location cache hit rate to evaluate the power consumption of the location cache system [24]. The write through policy of a cache was not supported in the origi- nal simplescalar simulator, but sometimes this policy is used in an L1 data cache as mentioned in the next section. Thus, the simplescalar simulator was extended to support the write through policy for the L1 data cache. A new location cache simulation model is also added to the simplescalar simulator. The L2 cache sys- tem configuration is similar to that of Intel Itanium 2 processor [5]. Physically addressed separate 16 KB L1 instruction and L1 data caches are used. They are both 4-way set associative caches with the cache line size of 64 bytes and 64 cache lines. The L2 cache is a 256 KB 8-way set-associative unified cache with the cache line size of 128 bytes. The bus between the L1 and L2 cache is 256-bit wide. A 0.13µm technology is used for cache power simulation, similar to the

Itanium 2 processor.

To evaluate the power consumption of a location cache system on real world

43 Table 4.1: The 12 CINT2000 benchmark applications 164.gzip Data compression utility 175.vpr FPGA circuit placement and routing 176.gcc C compiler 181.mcf Minimum cost flow network 186.crafty Chess program 197.parser Natural language processing 252.eon Ray tracing 253.perlbmk Perl 254.gap Computational group theory 255.vortex Object Oriented Database 256.bzip2 Data compression utility 300.twolf Place and route simulator workloads, all 26 SPEC CPU2000 and 6 SPEC CPU2006 benchmark applications

[25] are simulated with the reference input set for 100 million instructions. The details for 12 CINT2000, 14 CFP2000, and 6 CPU2006 benchmark applications are listed in Table 4.1, Table 4.2 and Table 4.3 respectively.

4.2 Experimental Results with Write Through L1

Data Cache

As described in Section 2.1.3, two kinds of write policies, write through and write back, can be used in a cache. They both have advantages and disadvantages. When there is a write operation, in the write through policy the access to the next level of cache or main memory is also generated no matter whether it is a hit or miss at this level. So by write through, the ratio between the L2 cache access number and the L1 cache access number, which is defined as L1 miss rate, will be much higher than that in write back. In Intel Itanium 2 processor, the write

44 Table 4.2: The 14 CFP2000 benchmark applications 168.wupwise Quantum chromodynamics 171.swim Shallow water modeling 172.mgrid Multi-grid solver in 3D potential field 173.applu Parabolic/elliptic partial differential equations 177.mesa 3D Graphics library 178.galgel Fluid dynamics: analysis of oscillatory instability 179.art Neural network simulation; adaptive resonance theory 183.equake Finite element simulation; earthquake modeling 187.facerec Computer vision: recognizes faces 188.ammp Computational chemistry 189.lucas Number theory: primality testing 191.fma3d Finite element crash simulation 200.sixtrack Particle accelerator model 301.apsi Solves problems regarding temperature, wind,velocity and distribution of pollutants

Table 4.3: The 6 SPEC CPU2006 benchmark applications 401.bzip2 Integer; Compression 429.mcf Integer; Combinatorial Optimization 445.gobmk Integer; Artificial Intelligence: go 433.milc Floating point; Physics: Quantum Chromodynamics 470.lbm Floating point; Fluid Dynamics 999.specrand Floating point; Pseudorandom number generator

45 Table 4.4: L1 cache miss rate of SPEC CINT2000 : write through. Benchmark gzip vpr gcc mcf crafty parser L1 miss rate 0.1466 0.0503 0.0966 0.1483 0.0693 0.0607 Benchmark eon perlbmk gap vortex bzip2 twolf L1 miss rate 0.1187 0.1398 0.1578 0.1332 0.1407 0.0690

Table 4.5: L1 cache miss rate of SPEC CFP2000 : write through. Benchmark wupwise swim mgrid applu mesa galgel art L1 miss rate 0.0547 0.0197 0.0777 0.0756 0.0608 0.0396 0.1445 Benchmark equake facerec ammp lucas fma3d sixtrack apsi L1 miss rate 0.0700 0.1408 0.1382 0.0045 0.0424 0.1827 0.2289 through policy is used for L1 data cache, so we first present the simulation results using the write through policy. The L1 cache miss rates of SPEC CINT2000, SPEC CFP2000, and SPEC CPU2006 are shown in Table 4.4, Table 4.5, and Table 4.6 separately. From these tables, it can be observed that the L1 cache miss rate varies from 0.45% (lucas of CFP2000) to 40.5% (bzip2 of CPU2006) depending on different benchmark applications.

The location cache hit rates for SPEC CINT2000 are shown in Figure 4.1 and Figure 4.2 for different location cache sizes. The location cache hit rates for SPEC CFP2000 are shown in Figure 4.3 and Figure 4.4. The location cache hit rates for SPEC CPU2006 are shown in Figure 4.5 and Figure 4.6. From these figures, it is shown that the location cache hit rate increases as the cache entry number increases. Whether the location cache hit rate is sensitive to the cache

Table 4.6: L1 cache miss rate of SPEC CPU2006 : write through. Benchmark bzip2 mcf gobmk milc lbm specrand L1 miss rate 0.405286 0.399973 0.072794 0.399940 0.134418 0.108008

46 1 Hit Rate 0.95

0.9

0.85

0.8

0.75

0.7

0.65 32 entry 64 entry 0.6 128 entry 256 entry 512 entry 0.55 gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf CINT2000 Benchmark

Figure 4.1: SPEC CINT2000 location cache hit rate vs large entry number: write through. entry number or not is highly dependent on application type. In some applications, e.g., mcf and milc in CPU2006, the location cache hit rates are not sensitive to the cache entry number. In some applications, however, the location cache hit rates are not sensitive to large cache entry numbers, e.g. ”art” in CFP2000, but are sensitive to small entry numbers. Such a small entry number as 16 in ”art” will have relatively high hit rate, and increasing its entry number will not increase the hit rate significantly. According to Equation 3.15, it can be observed that the saved power of an L2 location cache system is highly dependent on the L1 cache miss rate and the location cache hit rate. As follows, Figure 4.7 and Figure 4.8 show the location cache power saving rate for SPEC CINT2000 applications; Figure 4.9 and Figure 4.10 show the location cache power saving rate for SPEC CFP2000 applications; Figure 4.11 and Figure 4.12 show the location cache power saving rate for SPEC

47 1 Hit Rate

0.9

0.8

0.7

0.6

0.5

1entry 0.4 2 entry 4 entry 8 entry 16 entry 0.3 gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf 2 CINT2000 Benchmark

Figure 4.2: SPEC CINT2000 location cache hit rate vs small entry number: write through.

1

0.95

0.9

0.85

0.8

Hit Rate 0.75

0.7

0.65

32 entry 64 entry 0.6 128 entry 256 entry 512 entry 0.55 wupwise swim mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi CFP2000 Benchmark

Figure 4.3: SPEC CFP2000 location cache hit rate vs large entry number: write through.

48 1

0.9

0.8

0.7

0.6 Hit Rate

0.5

0.4

1 entry 0.3 2 entry 4 entry 8 entry 16 entry 0.2 wupwise swim mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi CFP2000 Benchmark

Figure 4.4: SPEC CFP2000 location cache hit rate vs small entry number: write through.

1

0.95

0.9 Hit Rate 0.85

0.8 32 entry 64 entry 128 entry 256 entry 512 entry 0.75 bzip2 mcf gobmk milc lbm specrand CPU2006 Benchmark

Figure 4.5: SPEC CPU2006 location cache hit rate vs large entry number: write through.

49 1

0.9

0.8

0.7

0.6

0.5 Hit Rate 0.4

0.3

0.2 1 entry 2 entry 0.1 4 entry 8 entry 16 entry 0 bzip2 mcf gobmk milc lbm specrand CPU2006 Benchmark

Figure 4.6: SPEC CPU2006 location cache hit rate vs small entry number: write through.

CPU2006 applications. Here, the power saving rate is the ratio of the saved power by the location cache system (Equ. 3.15) to the power consumption of the tradi- tional L2 cache (Equ. 3.7).

From the analysis of the last chapter, the location cache itself will add power consumption, but it can save the power consumption of the L2 cache. So, in Equation 3.15, there are two components: one is positive and the other is negative.

The positive part is the saved power of the L2 cache, while the negative part is the added power consumption of the location cache. The positive part is dependant on the L1 cache miss rate and the location cache hit rate. When the location cache entry number increases, normally the location cache hit rate increases and accordingly the saved power of the L2 cache is increased. The negative part is dependant on the size of the location cache. When the location cache entry number is increased, both dynamic and leakage power of the location cache will increase.

50 0.7

0.65

0.6

0.55

0.5

0.45 Power Saving Rate

0.4

32 entry 0.35 64 entry 128 entry 256 entry 512 entry 0.3 gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf CINT2000 Benchmark

Figure 4.7: SPEC CINT2000 location cache power saving rate vs large entry number: write through.

0.7

0.65

0.6

0.55

0.5

0.45

0.4 Power Saving Rate 0.35

0.3 1 entry 2 entry 0.25 4 entry 8 entry 16 entry 0.2 gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf CINT2000 Benchmark

Figure 4.8: SPEC CINT2000 location cache power saving rate vs small entry number: write through.

51 0.7

0.6

0.5

0.4

0.3

0.2

Power Saving Rate 0.1

0

32 entry 64 entry −0.1 128 entry 256 entry 512 entry −0.2 wupwise swim mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi CFP2000 Benchmark

Figure 4.9: SPEC CFP2000 location cache power saving rate vs large entry number: write through.

0.7

0.6

0.5

0.4

0.3 Power Saving Rate

0.2

1 entry 0.1 2 entry 4 entry 8 entry 16 entry 0 wupwise swim mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi CFP2000 Benchmark

Figure 4.10: SPEC CFP2000 location cache power saving rate vs small entry number: write through.

52 0.85

0.8

0.75

0.7

0.65 Power Saving Rate

0.6

32 entry 0.55 64 entry 128 entry 256 entry 512 entry 0.5 bzip2 mcf gobmk milc lbm specrand CPU2006 Benchmark

Figure 4.11: SPEC CPU2006 location cache power saving rate vs large entry number: write through.

0.8

0.7

0.6

0.5

0.4

0.3 Power Saving Rate

0.2

1 entry 0.1 2 entry 4 entry 8 entry 16 entry 0 bzip2 mcf gobmk milc lbm specrand CPU2006 Benchmark

Figure 4.12: SPEC CPU2006 location cache power saving rate vs small entry number: write through.

53 From Figure 4.7 to Figure 4.12, the location cache system saves power for all the applications, except lucas and swim with large entry numbers in CFP2000. This is because lucas has a very low L1 cache miss rate (0.45%) and swim has relatively low L1 cache miss rate (1.97%) . Depending on different applications, for CINT2000 the power saving rates range from 30.4% (vpr) to 55.3% (gap); for CFP2000 (except lucus) the power saving rates range from 11.0% (swim) to 57.7% (sixtrack); for CPU2006 the power saving rates range from 37.2% (gobmk) to 72.6% (mcf).

4.3 Experimental Results with Write Back L1 Data

Cache

In the last section, we showed the power experimental results when L1 data cache uses the write through policy. In this section we show the power experimental results when L1 data cache uses the write back policy, with other configurations the same as those in the last section.

The L1 cache miss rates for SPEC CINT2000, SPEC CFP2000, and SPEC CPU2006 are shown in Table 4.7, Table 4.8, and Table 4.9 respectively. Since the write back policy is used, many write operations can be caught as hit in the L1 data cache. The overall miss rate of the L1 cache in write back is smaller than that in write through. In our simulation results, it can be observed that the L1 cache miss rates vary from 0.11% (lucas of CFP2000) to 10.2% (art of CFP2000) depending on different benchmark applications. The location cache hit rates for SPEC CINT2000 are shown in Figure 4.13

54 Table 4.7: L1 cache miss rate of SPEC CINT2000: write back. Benchmark gzip vpr gcc mcf crafty parser L1 miss rate 0.0249 0.0018 0.0262 0.0356 0.0262 0.0024 Benchmark eon perlbmk gap vortex bzip2 twolf L1 miss rate 0.0058 0.0198 0.0394 0.0264 0.0199 0.0039

Table 4.8: L1 cache miss rate of SPEC CFP2000: write back. Benchmark wupwise swim mgrid applu mesa galgel art L1 miss rate 0.0023 0.0022 0.0255 0.0183 0.0020 0.0026 0.1020 Benchmark equake facerec ammp lucas fma3d sixtrack apsi L1 miss rate 0.0037 0.0328 0.0447 0.0011 0.0096 0.0440 0.0685 and Figure 4.14 for different location cache sizes. The location cache hit rates for SPEC CFP2000 are shown in Figure 4.15 and Figure 4.16. The location cache hit rates for SPEC CPU2006 are shown in Figure 4.17 and Figure 4.18. From these figures, it can be observed that the location cache hit rate increases as the cache entry number increases. For most applications, the location cache hit rate is not sensitive to the cache entry number, when entry number is smaller than 16. While the entry number is larger than 16, increasing the entry number will increase the location cache hit rate greatly. When 512 entries are used, for CINT2000 the hit rates range from 58.7% (bzip2) to 87.2% (twolf); for CFP2000 the hit rates range from 33.3% (galgel) to 96.2% (equake); for CPU2006 the hit rates range from 65.7% (bzip2) to 97.2% (specrand).

In the following, Figure 4.19 and Figure 4.20 show the location cache power saving rates for SPEC CINT2000 applications; Figure 4.21 and Figure 4.22 show

Table 4.9: L1 cache miss rate of SPEC CPU2006: write back. Benchmark bzip2 mcf gobmk milc lbm specrand L1 miss rate 0.041754 0.049996 0.011061 0.049997 0.016475 0.028562

55 0.9 32 entry 64 entry 0.8 128 entry 256 entry 512 entry 0.7

0.6

0.5

Hit Rate 0.4

0.3

0.2

0.1

0 gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf CINT2000 Benchmark

Figure 4.13: SPEC CINT2000 location cache hit rate vs large entry number: write back.

0.2 1 entry 2 entry 0.18 4 entry 8 entry 0.16 16 entry

0.14

0.12

0.1 Hit Rate 0.08

0.06

0.04

0.02

0 gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf CINT2000 Benchmark

Figure 4.14: SPEC CINT2000 location cache hit rate vs small entry number: write back.

56 1 32 entry 64 entry 0.9 128 entry 256 entry 512 entry 0.8

0.7

0.6

0.5 Hit Rate 0.4

0.3

0.2

0.1

0 wupwise swim mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi CFP2000 Benchmark

Figure 4.15: SPEC CFP2000 location cache hit rate vs large entry number: write back.

0.35

0.3

0.25

0.2

Hit Rate 0.15

0.1

1 entry 0.05 2 entry 4 entry 8 entry 16 entry 0 wupwise swim mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi CFP2000 Benchmark

Figure 4.16: SPEC CFP2000 location cache hit rate vs small entry number: write back.

57 1 32 entry 64 entry 0.9 128 entry 256 entry 512 entry 0.8

0.7

0.6

0.5 Hit Rate 0.4

0.3

0.2

0.1

0 bzip2 mcf gobmk milc lbm specrand CPU2006 Benchmark

Figure 4.17: SPEC CPU2006 location cache hit rate vs large entry number: write back.

0.35 1 entry 2 entry 4 entry 0.3 8 entry 16 entry

0.25

0.2

Hit Rate 0.15

0.1

0.05

0 bzip2 mcf gobmk milc lbm specrand CPU2006 Benchmark

Figure 4.18: SPEC CPU2006 location cache hit rate vs small entry number: write back.

58 0.3

0.2

0.1

0

−0.1 Power Saving Rate

−0.2

32 entry −0.3 64 entry 128 entry 256 entry 512 entry −0.4 gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf CINT2000 Benchmark

Figure 4.19: SPEC CINT2000 location cache power saving rate vs large entry number: write back. the location cache power saving rates for SPEC CFP2000 applications; Figure 4.23 and Figure 4.24 show the location cache power saving rates for SPEC CPU2006 applications. According to Equation 3.15, it can be observed that the saved power of an L2 location cache system is highly dependent on the L1 cache miss rate and the location cache hit rate. Although a large entry number (e.g., 512 entries) in the location cache can bring high location cache hit rate, it does not guarantee the location cache system will save power. As the L1 cache miss rate in write back is much lower than write through, it plays a big role here. In Equation 3.15, there are two components: one is positive which is the saved power of L2 cache, and the other is negative which is the added power consumption of the location cache itself. When the L1 cache miss rate is lower than a threshold rate, e.g. 3%, no matter which entry number is chosen, the added power consumption of the loca-

59 0.04

0.02

0

−0.02

−0.04

−0.06

Power Saving Rate −0.08

−0.1

1 entry 2 entry −0.12 4 entry 8 entry 16 entry −0.14 gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf CINT2000 Benchmark

Figure 4.20: SPEC CINT2000 location cache power saving rate vs small entry number: write back.

0.4

0.3

0.2

0.1

0 Power Saving Rate

−0.1

32 entry −0.2 64 entry 128 entry 256 entry 512 entry −0.3 wupwise swim mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi CFP2000 Benchmark

Figure 4.21: SPEC CFP2000 location cache power saving rate vs large entry number: write back.

60 0.1

0.05

0

−0.05 Power Saving Rate

−0.1 1 entry 2 entry 4 entry 8 entry 16 entry −0.15 wupwise swim mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi CFP2000 Benchmark

Figure 4.22: SPEC CFP2000 location cache power saving rate vs small entry number: write back.

0.35

0.3

0.25

0.2

0.15

0.1

0.05 Power Saving Rate 0

−0.05 32 entry 64 entry −0.1 128 entry 256 entry 512 entry −0.15 bzip2 mcf gobmk milc lbm specrand CPU2006 Benchmark

Figure 4.23: SPEC CPU2006 location cache power saving rate vs large entry number: write back.

61 0.15 1 entry 2 entry 4 entry 8 entry 16 entry 0.1

0.05

0 Power Saving Rate

−0.05

−0.1 bzip2 mcf gobmk milc lbm specrand CPU2006 Benchmark

Figure 4.24: SPEC CPU2006 location cache power saving rate vs small entry number: write back. tion cache will not offset the saved power of the L2 cache. So, the location cache system will not save power. This is illustrated by vpr, eon, twolf in CINT2000 and wupwise, swim, mesa in CFP2000, gobmk in CPU2006. Based on the simulation results from CACTI, read power is larger than write power. So the read operation to write operation ratio also plays a role there. For the following estimation, we assume the read to write ratio is 1:1 in the L2 cache. The threshold rate can be calculated when the positive part is equal to the negative part in Equation 3.15. We assume the hit rate of the location cache be 100% (although it is normally not reachable), and let MRl1 th be the threshold miss rate of the L1 cache. By

Equation 3.8 and Equation 3.15, when HRloc is assigned 100%, MRl1 th can be expressed in the following equation:

Ploc dyn + Ploc lkg MRl1 th = N−1 (4.1) N · (Pl2 dyn + Pl2 lkg − Pl2 dwy)

62 If the L1 cache miss rate is smaller than the threshold miss rate (MRl1 th), then the location cache will not save power. Although Ploc dyn, Ploc lkg, and HRloc vary according to the entry number, we can still use entry number 512 to estimate the threshold miss rate of L1. From the data of our cache power model, it can be calculated as 2.67%. This verifies our simulation results that the location cache system will not save power in too small L1 cache miss rate.

When the L1 cache miss rate is larger than the threshold miss rate (MRl1 th), like mcf, gap in CINT2000, and art, ammp, sixtrack in CFP2000, and mcf, milk, specrand in CPU2006, the location cache system will save power with large entry numbers. From Figure 4.19, Figure 4.21 and Figure 4.23, it can be observed that 256 or 512 entry achieve the most significant power saving. The power saving rate can reach as high as 21.1% (gap) in CINT2000, 22.7% (art) in CFP2000, and 30.0% (specrand) in CPU2006.

4.4 Experimental Results Without Drowsy State

From the location information stored in the location cache, both the dynamic power and leakage power of an L2 cache can be saved by not waking up the unaccessed ways. However, if the drowsy technology is not used, only the dy- namic power of the L2 cache will be saved in our location cache architecture. In this section, we present the power analysis of the location cache without drowsy state, and compare its results vs. that with drowsy state. By repeating the similar analysis to that presented in Section 3.4.2, like Equa- tions 3.7 and 3.15, the power consumption (Pnodwy cache(no loc)) of a traditional

63 Table 4.10: Location cache power saving rate of SPEC CINT2000 : write through. Benchmark gzip vpr gcc mcf crafty parser Power saving rate with drowsy 0.5111 0.2391 0.4090 0.4993 0.3220 0.2928 Power saving rate without drowsy 0.0129 -0.0101 0.0203 0.0029 0.0150 -0.0080 Benchmark eon perlbmk gap vortex bzip2 twolf Power saving rate with drowsy 0.4861 0.5051 0.5113 0.4806 0.5050 0.3442 Power saving rate without drowsy 0.0063 0.0201 0.0040 0.0036 0.0089 -0.0031

L2 cache (without location cache and without drowsy state) and the saved power

(Pnodwy sav) of the L2 location cache (with location cache but without drowsy state) can be developed as follow Equations 4.2 and 4.3:

Pnodwy cache(no loc) = Pl2 dyn · MRl1 + Pl2 lkg (4.2)

N − 1 P = · P · MR · HR − P (4.3) nodwy sav N l2 dyn l1 loc loc

Compared with Pcache(no loc) in Equation 3.7, Pnodwy cache(no loc) is larger as the L2 cache is not put into drowsy state when it is in standby mode. However,

Pnodwy sav is smaller when compared with Psav in Equation 3.15, since our loca- tion cache hit knowledge is not used to save the L2 leakage power. So, only the L2 cache dynamic power is saved. The simulation results with both write through and write back policies are shown in Tables 4.10 to 4.15. Here, the location cache has 256 entries.

64 Table 4.11: Location cache power saving rate of SPEC CFP2000 : write through. Benchmark wupwise swim mgrid applu mesa galgel art Power saving rate with drowsy 0.2585 -0.0054 0.3485 0.3929 0.3010 0.1613 0.3712 Power saving rate without drowsy -0.0104 -0.0159 0.0064 0.0094 -0.0072 -0.0129 0.0553 Benchmark equake facerec ammp lucas fma3d sixtrack apsi Power saving rate with drowsy 0.3455 0.5055 0.4610 -0.2028 0.2174 0.5457 0.5532 Power saving rate without drowsy -0.0037 0.0216 0.0005 -0.0183 -0.0019 0.0087 0.0367

Table 4.12: Location cache power saving rate of SPEC CPU2006 : write through. Benchmark bzip2 mcf gobmk milc lbm specrand Power saving rate with drowsy 0.7000 0.6972 0.3299 0.6972 0.4934 0.4826 Power saving rate without drowsy 0.0506 0.0389 -0.0069 0.0389 0.0014 0.0307

Table 4.13: Location cache power saving rate of SPEC CINT2000 : write back. Benchmark gzip vpr gcc mcf crafty parser Power saving rate with drowsy 0.0448 -0.2407 0.0469 0.1873 0.0782 -0.2210 Power saving rate without drowsy -0.0050 -0.0181 -0.0025 0.0037 0.0002 -0.0174 Benchmark eon perlbmk gap vortex bzip2 twolf Power saving rate with drowsy -0.1166 0.0093 0.2110 0.1050 0.009 -0.1597 Power saving rate without drowsy -0.0129 -0.0063 0.0060 -0.0022 -0.0076 -0.0147

65 Table 4.14: Location cache power saving rate of SPEC CFP2000 : write back. Benchmark wupwise swim mgrid applu mesa galgel art Power saving rate with drowsy -0.2209 -0.2250 0.0803 0.0972 -0.2163 -0.2521 0.2273 Power saving rate without drowsy -0.0175 -0.0176 -0.0016 -0.0004 -0.0170 -0.0187 0.0320 Benchmark equake facerec ammp lucas fma3d sixtrack apsi Power saving rate with drowsy -0.1697 0.0669 0.2179 -0.2487 -0.0692 0.2239 0.2229 Power saving rate without drowsy -0.0152 -0.0019 0.0075 -0.0183 -0.0107 0.0079 0.0181

Table 4.15: Location cache power saving rate of SPEC CPU2006 : write back. Benchmark bzip2 mcf gobmk milc lbm specrand Power saving rate with drowsy 0.1782 0.2657 -0.0551 0.2657 0.0149 0.2246 Power saving rate without drowsy 0.0066 0.0124 -0.0115 0.0124 -0.0084 0.0142

From the results of these tables, it can be observed that the power saving rate without drowsy mode is much smaller than that with drowsy. For some applica- tions, like CINT2000 vpr and twolf in write through, the saving rate changes from positive to negative. This means that the saved L2 dynamic power cannot offset the power added by the location cache. For some applications, like CINT2000 vpr and parser in write back, the location cache cannot save power no matter with drowsy mode or without drowsy mode. This is because of very low L1 cache miss rate (0.18% for vpr and 0.24% for parser). Most of the cache accesses can be caught in the L1 cache, so the added location cache power consumption is larger than the power saved from the L2 cache. In both Psav (Equation 3.15) and

Pnodwy sav (Equation 4.3), if the L1 miss rate is extremely small, the dominate part is the negative part (Ploc). So, in both power saving rates the numerators are

66 close to each other (i.e., −Ploc), while the denominator Pcache(no loc) is smaller than Pnodwy cache(no loc). As a result, for these applications, the absolute value of power saving rate with drowsy is larger than that without drowsy, which matches our simulation results.

67 Chapter 5

Conclusions and Future Work

In this research, we analyzed the performance of the location cache designed in a drowsy L2 cache system. Detailed working principle and hardware organiza- tion of the L2 location cache system are reviewed. From the location information stored in the location cache, both dynamic power and leakage power of the L2 cache can be greatly saved. A mathematical power saving analysis for the L2 cache system is presented by considering both dynamic and leakage power sav- ing. CACTI 4.2 is used to model the cache for our timing analysis and power analysis, and the Simplescalar 3.0d simulator is used to evaluate the cache be- havior (e.g., hit rate and miss rate) of the location cache system in the real-word workload environment. Both SPEC CPU2000 and CPU2006 benchmark applica- tions are simulated for both write-through and write-back policies. Our goal is to find out the performance loss and the amount of power saved due to the location cache architecture. For the location cache system, according to the CACTI 4.2 simulation, there will be no performance loss. If a dual-mode L2 cache access

68 (i.e., direct-mapped access and set-associative access) is supported, the location cache design can improve the performance of the L2 cache (i.e. 49.99% improve- ment in our experiment).

The power saving rate of the location cache is highly dependent on the L1 cache miss rate, so the location cache has higher saving rate for the L1 write- through policy than that in write-back policy. From our simulation results, in the write through policy, for CINT2000 the power saving rate ranges from 30.4% (vpr) to 55.3% (gap); for CFP2000 the power saving rate ranges from 11.0% (swim) to 57.7% (sixtrack); for CPU2006 the power saving rate ranges from 37.2% (gobmk) to 72.6% (mcf). In the write back policy, the power saving rate can reach as high as 21.1% (gap) in CINT2000, 22.7% (art) in CFP2000, and 30.0% (specrand) in CPU2006. As the location cache is accessed simultaneously with the L1 cache, the location cache access frequency is much higher than that of the L2 cache. When the L1 cache miss rate is very low (i.e., below the threshold rate 3% in our experiments), the saved L2 cache power will not offset the added location cache power. So, the location cache architecture will not save power in these situations.

Towards future work, we suggest the following:

1. Since the location cache will not save power when the L1 cache miss rate is very low and the location cache hit rate is not sensitive to its entry num- ber for some applications, an adaptive location cache may be developed. Some cache lines of an adaptive location cache can be detached virtually from the location cache depending on the dynamic profile of the running applications.

69 2. The location cache may also be added between the L2 cache and the L3 cache, while further investigation is needed to determine how much power can be saved in this architecture.

3. The wake-up power is not considered in this research, although it has been proved that more power will be saved if it is considered. CACTI is used as

the cache model to estimate the power consumption. The CACTI simulation results may have 10% variation with the SPICE simulation results [22], so the layout of an entire cache system with location cache implementation may be developed to get more accurate power consumption of the cache. After that, the wake-up power can also be considered.

4. Lastly, in this research, the way information of the L2 cache is stored in the location cache, and the L2 cache is waken up in way-wise. Further granularity may be explored, such that other location information may be stored in the location cache and the L2 cache can be waken up by cache

lines or cache blocks.

70 Bibliography

[1] J. L. Hennessy, D. A. Patterson and D. Goldberg, Computer architecture : a quantitative approach, 3rd ed., Morgan Kaufmann Publishers, San Fran- cisco, 2003.

[2] D. A. Patterson and J. L. Hennessy, Computer organization and design :the hardware/software interface, 3rd ed., Elsevier/Morgan Kaufmann, Amster- dam/Boston, 2004.

[3] F. Fallah and M. Pedram, “Standby and active leakage current control and minimization in CMOS VLSI circuits,” IEICE Trans. Electron., vol. E88-C, pp. 509-519, Apr. 2005.

[4] H. McIntyre, D. Wendell, K. J. Lin, P. Kaushik, S. Seshadri, A. Wang, V. Sundararaman, P. Wang, S. Kim, W. Hsu, H. C. Park, G. Levinsky, J. Lu, M. Chirania, R. Heald, P. Lazar and S. Dharmasena, “A 4-MB on-chip L2 cache for a 90-nm 1.6-GHz 64-bit microprocessor,” IEEE J Solid State Circuits, vol. 40, pp. 52-59, Jan. 2005.

[5] S. Rusu, H. Muljono and B. Cherkauer, “Itanium 2 processor 6M: higher frequency and larger L3 cache,” IEEE Micro, vol. 24, pp. 10-18, Apr. 2004.

[6] Rui Min, Wen-Ben Jone and Yiming Hu, “Location cache: a low-power L2 cache system,” Proceedings of 2004 International Symposium on Low Power Electronics and Design, pp. 120-125, Aug. 2004.

[7] C. Zhang, J. Yang and F. Vahid, “Low static-power frequent-value data caches,” Proceedings of Design, Automation and Test in Europe Conference and Exhibition, pp. 214-19 Vol.1, Feb. 2004.

[8] J. Montenaro, et al., “A 160MHz 32b 0.5W CMOS RISC microprocessor,” Proceedings of 43rd International Symposium on Solid-State Circuits,, pp. 214-215, Feb. 1996.

71 [9] N. S. Kim, K. Flautner, D. Blaauw, and T. Mudge, “Circuit and microarchi- tectural techniques for reducing cache leakage power,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 12, no. 2, pp. 167-184, Feb. 2004.

[10] C. Su and A. Despain, “Cache design tradeoffs for power and performance optimization: A case study,” Proceedings of International Symposium on Low Power Electronics and Design, pp. 63–68, 1997.

[11] A. Hasegawa, I. Kawasaki, K. Yamada, S. Yoshioka, S. Kawasaki, and P. Biswas, “Sh3: high code density, low power,” IEEE Micro, vol. 15, pp. 11– 19, December 1995.

[12] T. Lyon, E. Delano, C. McNairy, and D. Mulla, “Data cache design consid- erations for the itanium2 processor,” Proceedings of the 2002 IEEE Interna- tional Conference on Computer Design: VLSI in Computers and Processors (ICCD’02), pp. 356 –362, 2002.

[13] B. Calder, D. Grunwald, and J. Emer, “Predictive sequential associative cache,” Proc. of the 2nd IEEE Symposium on High-Performance Computer Architecture (HPCA ’96), pp. 244–254, 1996.

[14] T. N. Vijaykumar, “Reactive-associative caches,” Proceedings of Inter- national Conference on Parallel Architectures and Compiler Techinques (PACT’01), pp. 49–61, 2001.

[15] K. Inoue, T. Ishihara, and K. Murakami, “Way-predicting set-associative cache for high performance and low energy consumption,” Proceedings of International Symposium on Low Power Electronics and Design, pp. 273– 275, 1999.

[16] M. Powell, S. H. Yang, B. Falsafi, K. Roy, and T. N. Vijaykumar, “Gated- Vdd: a circuit technique to reduce leakage in deep-submicron cache mem- ories,” Proceedings of International Symposium on Low-power Electronics and Design, pp. 90-95, 2000.

[17] S. Mutoh, T. Douseki, Y. Matsuya, T. Aoki, S. Shigematsu, and J. Yamada, “1-V power supply high-speed digital circuit technology with multithreshold-voltage CMOS,” IEEE J. Solid-State Circuits, vol. 30, no. 8, pp. 847-857, Aug. 1995.

72 [18] Mahesh Mamidipaka, and Nikil Dutt., “eCACTI: An enhanced power esti- mation model for on-chip caches.” University of California Irvine Center for Embedded Computer Systems, Technical Report TR-04-28, Sept. 2004.

[19] S.Wilton and N. Jouppi., “An enhanced access and cycle time model for on- chip caches.” Western Research Lab Research Report 93/5, Jun. 1994.

[20] G. Reinman and N. Jouppi., “CACTI 2.0: An integrated cache timing and power model.” Western Research Lab Research Report 2000/7, Feb. 2000.

[21] P. Shivakumar and N. Jouppi., “CACTI 3.0: An integrated cache timing, power, and area model.” Western Research Lab Research Report 2001/2, Aug. 2001.

[22] ”Cacti4.”, http://quid.hpl.hp.com:9081/cacti/

[23] Y. Zhang, D. Parikh, K. Sankaranarayanan, K. Skadron, and M. Stan., “Hotleakage: A temperature-aware model of subthreshold and gate leak- age for architects,” TR-CS-2003-05, Dept. of Computer Science, Univ. of Virginia, Mar. 2003.

[24] D. Burger and T. M. Austin, “The SimpleScalar tool set, University of Wis- consin, Tech. Rep. TR-97-1342, 1997.

[25] SPEC Benchmark Suite. http://www.spec.org

[26] Semiconductor Industry Association, International Technology Roadmap for Semiconductors, 2005 edition, http://public.itrs.net/.

[27] C. Piguet, Low-power electronics design, CRC Press LLC, 2005

[28] T. Douseki and S. Mutoh, “Static-noise margin analysis for a scaled-down CMOS memory cell,” Electronics & Communications in Japan, Part II: Elec- tronics, vol. 75, no. 7, pp. 102-115, 1992.

[29] M. Lee, W. I. Sze, and C. M. Wu, “Static noise margin and soft-error rate simulations for thin film transistor cell stability in a 4Mbit SRAM design,” Proceedings of IEEE International Symposium on Circuits and Systems, pp. 937-940, 1995.

[30] E. Seevinck, F. J. List, and J. Lohstroh, “Static-noise margin analysis of MOS SRAM cells,” IEEE J. Solid-State Circuits, vol.22, no. 5, pp. 748-754, Oct. 1987.

73 [31] H. Qin, Y. Cao, D. Markovic, A. Vladimirescu, and J. Rabaey, “SRAM leak- age suppression by minimizing standby supply voltage,” Proceedings of In- ternational Symposium on Quality Electronic Design, pp. 55-60, 2004.

74