University of Cincinnati

UNIVERSITY OF CINCINNATI Date:___________________ I, _________________________________________________________, hereby submit this work as part of the requirements for the degree of: in: It is entitled: This work and its defense approved by: Chair: _______________________________ _______________________________ _______________________________ _______________________________ _______________________________ Performance Analysis of Location Cache for Low Power Cache System A thesis submitted to the Division of Research and Advanced Studies of the University of Cincinnati in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in the Department of Electrical and Computer Engineering and Computer Science of the College of Engineering May 28, 2007 by Bin Qi B.E. (Electrical Engineering), Shanghai JiaoTong University, China, July 1999 Thesis Advisor and Committee Chair: Dr. Wen-Ben Jone To my dearest parents and brother Abstract In modern microprocessors, more memory hierarchy and larger caches are integrated on chip to bridge the performance gap between high-speed CPU core and low speed memory. Large set-associative L2 caches draw a lot of power, generate a large amount of heat, and reduce the overall yield of the chip. As a result, large power consumption of the cache memory system has become a new bottleneck for many microprocessors. In this research, we analyze the performance of a location cache which works with a low power L2 cache system implemented by the drowsy cache technique. A small direct-mapped location cache is added to the traditional L2 cache system. It caches the way location information for the L2 cache access [6]. With this way location information, the L2 cache can be accessed as direct-mapped cache to save both dynamic and leakage power consumption. De- tailed mathematical analysis of the location cache power saving rate is presented in this work. To evaluate the power consumption of the location cache system on real world workloads, both SPEC CPU2000 and SPEC CPU2006 benchmark applications are simulated with the reference input set. Simulation results demon- strate that the location cache system can save a significant amount of power for all benchmark applications in L1 write through policy, and save power for benchmark applications with high L1 miss rate in L1 write back policy. Acknowledgement I wish to express my sincere thanks towards my advisor, Dr. Wen-Ben Jone, for the endless hours he spent discussing and refining my work. With an office door always open for students, and a mind ready to catch any point I might have missed; he was untiring in his efforts to provide guidance and constructive criti- cism throughout my thesis. Thank you Dr. Jone, for all the great help and sincere concern. I would like to thank the members of my committee, Dr. Ranga R. Vemuri and Dr. Yiming Hu, for spending their valuable time reviewing this work. Special thanks to Rui Min for the thought-provoking discussion with him and Honghao Wang for his help when I met problems in setting up the simulation environment. Last, but not the least, I would like to express my gratitude to my parents and my brother, for their constant support and love. Contents 1 Introduction 1 2 Background 6 2.1 Basics of Cache . 6 2.1.1 Memory Hierarchy . 7 2.1.2 Cache mapping techniques . 8 2.1.3 Cache Write Policies . 10 2.2 Leakage current and drowsy Technique in cache . 11 2.2.1 Leakage current . 11 2.2.2 Drowsy technique . 14 3 Location Cache Design and Analysis 17 3.1 Traditional L2 Cache System Architecture . 17 3.2 Location Cache Architecture . 20 3.2.1 Structure of Location Cache . 21 3.2.2 Working Principle and Hardware Organization of Loca- tion Cache . 24 3.3 Timing Analysis of Location Cache . 28 i 3.3.1 Timing Components and Delay Model . 28 3.3.2 Timing Characteristic of Location Cache . 32 3.4 Power Analysis of Location Cache . 33 3.4.1 Power Components and Power Model for CACTI . 33 3.4.2 Mathematical Power Analysis for L2 Location Cache Sys- tem . 35 4 Power Experiment Results of Location Cache 42 4.1 Experimental Environment . 42 4.2 Experimental Results with Write Through L1 Data Cache . 44 4.3 Experimental Results with Write Back L1 Data Cache . 54 4.4 Experimental Results Without Drowsy State . 63 5 Conclusions and Future Work 68 ii List of Figures 2.1 The structure of memory hierarchy . 7 2.2 Leakage current components in an NMOS transistor. 12 2.3 Drowsy SRAM control and leakage power reduction . 15 2.4 Deterioration of inverter VTC under low-Vdd . 16 3.1 Physically addressed L2 cache architecture. 18 3.2 Virtually addressed L2 cache architecture. 20 3.3 Physically Addressed Location Cache Architecture. 22 3.4 Virtually addressed location cache architecture. 23 3.5 Flow diagram for location cache content update. 26 3.6 Internal cache structure. 30 4.1 SPEC CINT2000 location cache hit rate vs large entry number: write through. 47 4.2 SPEC CINT2000 location cache hit rate vs small entry number: write through. 48 4.3 SPEC CFP2000 location cache hit rate vs large entry number: write through. 48 iii 4.4 SPEC CFP2000 location cache hit rate vs small entry number: write through. 49 4.5 SPEC CPU2006 location cache hit rate vs large entry number: write through. 49 4.6 SPEC CPU2006 location cache hit rate vs small entry number: write through. 50 4.7 SPEC CINT2000 location cache power saving rate vs large entry number: write through. 51 4.8 SPEC CINT2000 location cache power saving rate vs small entry number: write through. 51 4.9 SPEC CFP2000 location cache power saving rate vs large entry number: write through. 52 4.10 SPEC CFP2000 location cache power saving rate vs small entry number: write through. 52 4.11 SPEC CPU2006 location cache power saving rate vs large entry number: write through. 53 4.12 SPEC CPU2006 location cache power saving rate vs small entry number: write through. 53 4.13 SPEC CINT2000 location cache hit rate vs large entry number: write back. 56 4.14 SPEC CINT2000 location cache hit rate vs small entry number: write back. 56 4.15 SPEC CFP2000 location cache hit rate vs large entry number: write back. 57 iv 4.16 SPEC CFP2000 location cache hit rate vs small entry number: write back. 57 4.17 SPEC CPU2006 location cache hit rate vs large entry number: write back. 58 4.18 SPEC CPU2006 location cache hit rate vs small entry number: write back. 58 4.19 SPEC CINT2000 location cache power saving rate vs large entry number: write back. 59 4.20 SPEC CINT2000 location cache power saving rate vs small entry number: write back. 60 4.21 SPEC CFP2000 location cache power saving rate vs large entry number: write back. 60 4.22 SPEC CFP2000 location cache power saving rate vs small entry number: write back. 61 4.23 SPEC CPU2006 location cache power saving rate vs large entry number: write back. 61 4.24 SPEC CPU2006 location cache power saving rate vs small entry number: write back. 62 v List of Tables 3.1 Normalized access delays for various location cache configurations 33 4.1 The 12 CINT2000 benchmark applications . 44 4.2 The 14 CFP2000 benchmark applications . 45 4.3 The 6 SPEC CPU2006 benchmark applications . 45 4.4 L1 cache miss rate of SPEC CINT2000 : write through. 46 4.5 L1 cache miss rate of SPEC CFP2000 : write through. 46 4.6 L1 cache miss rate of SPEC CPU2006 : write through. 46 4.7 L1 cache miss rate of SPEC CINT2000: write back. 55 4.8 L1 cache miss rate of SPEC CFP2000: write back. 55 4.9 L1 cache miss rate of SPEC CPU2006: write back. 55 4.10 Location cache power saving rate of SPEC CINT2000 : write through. 64 4.11 Location cache power saving rate of SPEC CFP2000 : write through. 65 4.12 Location cache power saving rate of SPEC CPU2006 : write through. 65 4.13 Location cache power saving rate of SPEC CINT2000 : write back. 65 4.14 Location cache power saving rate of SPEC CFP2000 : write back. 66 4.15 Location cache power saving rate of SPEC CPU2006 : write back. 66 vi Chapter 1 Introduction A cache memory is normally a very fast random access memory to which the processor can access much more quickly than to the main memory (DRAM). Nowadays as the processor memory performance gap has been steadily increas- ing, where processors operate at speeds that are far greater than the speeds at which main memories can supply the required data, cache has been playing a vi- tal role in bridging the performance gap between high-speed microprocessors and low-speed main memories. As a result, larger cache and more memory hierarchy are embedded in modern microprocessors. For example, in Intel Itanium 2 third generation processor, a 24-way set-associative 6 MB L3 cache is integrated into the processor, and the cache occupies more than 60% of the area [5]. A 4-way set-associative, 4 MB L2 data cache is used in 64 bits, 90-nm SPARC(r) RISC microprocessor, with 55% of chip area invested for the L2 cache block [4]. In order to reduce the number of accesses to long latency memories, a larger cache system is required. However, a large cache generally draws a lot of power, 1 generates a large amount of heat and reduces the overall yield of the chip. Caches can consume 50% or more of the total chip energy in a modern processor. For example, Pentium Pro dissipates 33% [7] and the StrongARM-110 dissipates 42% [8] of its total power in caches.

Load more