System-level Energy Optimisation Methodologies for DRAM Memory of Embedded Systems

by Su Myat Min Shwe

A Thesis Submitted in Accordance with the Requirements for the Degree of Doctor of Philosophy

School of Computer Science and Engineering The University of New South Wales

Nov 2013 ⃝c Copyright by Su Myat Min Shwe 2013 All Rights Reserved

ii

Thesis Publications

• S. M. Min, H. Javaid, A. Ignjatovic and S. Parameswaran. A Case Study on Exploration of Last-level Cache for Energy Reduction in DDR3 DRAM. In 2nd Mediterranean Conference on Embedded , MECO 2013 & ECyPS2013,´ Budva, Montenegro.

• S. M. Min, H. Javaid and S. Parameswaran. XDRA: Exploration and Opti- mization of Last-level Cache for Energy Reduction in DDR DRAMs. In Design Automation Conference, DAC’13, USA, June 2013.

• S. M. Min, H. Javaid and S. Parameswaran. RExCache: Rapid Exploration of Unified Last-level Cache. In Asia and South Pacific Design Automation Conference, ASP-DAC’13, Japan, Jan 2013.

• S. M. Min, J. Peddersen, and S. Parameswaran. Realising Cycle Accu- rate Processor-Memory Simulation via Interface Abstraction. In VLSI Design (VLSI Design), 2011 24th International Conference on VLSI Design, VLSI’11, India, Jan 2011.

vii Contributions of this Thesis

• A novel interface abstraction layer between the processor and memory system is proposed to implement a cycle-accurate processor-memory system simulator, so that detailed statistics of the memory system, such as performance, and power consumption, can be captured cycle-accurately.

• A novel estimation methodology of the execution time and energy consumption of the memory system is proposed.

• A rapid exploration framework is presented to quickly estimate a suitable last- level cache configuration which enables maximum power savings with negligible performance degradation of the memory system. This framework integrates the cycle-accurate processor-memory simulator, cache simulator and proposed execution time/energy estimators in order to greatly reduce the simulation time.

• An improved power mode controller to efficiently manage the DRAM power modes for DRAM energy reduction is presented.

• A DRAM energy reduction estimator which is derived using a small number of cycle-accurate simulations is proposed, to obtain the energy savings amount of the DRAM system accurately and rapidly while using a specific last-level cache.

• An exploration framework is presented to explore the last-level cache design space for maximum DRAM energy reduction. The framework uses the novel analysis techniques for computation of the proposed DRAM energy reduction estimator parameters which do not require cycle-accurate simulations of all the last-level cache configurations, and thus enables fast exploration of the large design space.

viii Acknowledgements

I would like to express my deepest gratitude to my supervisor Prof. Sri Parameswaran for his continuous support, patience, motivation, constant encouragement, and im- mense knowledge. Without his insightful guidance and kind support, this disser- tation would not have been possible. I greatly appreciate all the support he has provided to me throughout my candidature.

I would also like to express my deepest appreciation to my dear husband, Win. Without his love, constant motivation, understanding, and endless support, I would not have made it this far. I am greatly indebted to him and I cannot find words to express my gratitude to him.

A Special gratitude goes to Dr. Haris Javaid for his kind support, guidance, helpful suggestions and sharing on great technical knowledge. I owe him my heartfelt appreciation.

I would also like to thank my joint-supervisor, Dr. Aleksander Ignjatovic, for his support and sharing of mathematical knowledge related to my research. Many thanks to the academic committee members for reviewing my research: Dr Oliver Diessel, Dr. Bruno Gaeta, Dr. Annie Guo, and Dr. Hui Wu. Their comments and feedbacks always guided my research in the right direction. My sincere thanks go to Dr. Jorgen Peddersen for his valuable comments and useful advice. I am also truly grateful to all the members of the Embedded Systems group: Dr. Jude Angelo Ambrose, Liang, Tuo, Josef, Haseeb, Babak, Dr. Xin He and Dr. Krutartha Patel for their cooperation, continuous morale support, and for all forms of help throughout the study. I would also like to extend my thanks to members of the Computer Science and Engineering Department, UNSW, for their various support that I received over my candidature.

Furthermore, I would like to take this opportunity to thank you to my lovely sister, Dr. Thazin Aung for being supportive and having invaluable suggestions for my life. I was lucky to be her sister and I will forever be thankful to her. Last,

ix but not least, I would like to thank my family for their unconditional love and encouragement. My humble apologies to anyone whose name I might not have mentioned here, but I appreciate your support from the bottom of my heart. To all, whom I have mentioned and whom I have forgotten to mention, I would like to dedicate this work.

x Abstract

Managing power/energy consumption in complex SoC (System On Chip) systems and Application Specific Instruction set Processors (ASIPs) is emerging as a major concern in the design of embedded systems. In these systems, especially in battery- operated portable devices, performance of the system is not only measured by the speed and functionalities of what the system provides but also the lifetime of the bat- tery, which is directly proportional to the power/energy consumption of the system. Among the different components of the system, DRAM is one of the higher power consumers. The increased demand on the long battery life requires power/energy aware methodologies and a comprehensive design process flow to optimise DRAM power/energy consumption of such power-hungry devices.

For power/energy estimation purposes, a high level system simulation guided approach is necessary due to the time consuming process of RTL/gate-level perfor- mance and power estimation. Applying a two-step simulation approach (memory trace sequences are captured with the processor simulator or hardware-assisted ap- proach in the first step and the collected traces are used in second step’s memory system simulation) obtains inaccurate results due to a lack of feedback from one memory request to the next memory request. This thesis presents a design method- ology with a seamless interface layer to glue the processor component and mem- ory component for building a one-step system level processor-memory simulation framework so that every memory request from the processor component can be sent directly to the memory component for on-the-fly memory simulation. Over six me- diabench benchmarks, our one-step simulation approach provides greater accuracy than trace-driven memory simulations which has shown an 80% variation (over six mediabench benchmarks) in the choice of fixed memory latency in order to achieve the most accurate power consumption.

Exploiting the last-level cache is a well-known technique that reduces the DRAM memory traffic. The last-level cache is inserted just before the DRAM level in the

xi memory hierarchy design in order to improve the performance of the system. Esti- mation of improved performance amounts for a last-level cache configuration (cache size, cache line size and associativity) with a cycle-accurate simulation approach is exorbitantly slow. Thus, the cycle-accurate simulation for a large last-level cache configuration design space is not a feasible option to obtain the highly accurate estimates. This thesis introduces a technique to rapidly find out the performance and energy consumption of the whole system while using different last-level cache configurations. The proposed technique utilised a combination of one time slow cycle-accurate simulation and a large number of fast trace-based simulations for all the configurations, and thus, reduced the total simulation time (from 257 days to 21 hours maximally for h264 Enc application). Our methodology helps in signifi- cantly reducing the turnaround time to obtain the highly accurate execution time numbers with reasonably accurate energy numbers (average absolute accuracy of 99.74% in execution time and 80.31% in energy consumption for nine multimedia applications).

DRAM’s energy consumption is a very important component of total energy consumption of a system design. Exploiting both the last-level cache and DRAM’s power modes together creates a chance of achieving the energy reduction in the DRAM memory system. However, the increase/reduction of energy consumption is dependent on the application request pattern and the last-level cache configuration. Selecting a suitable last-level cache configuration (from a large design space) for a target application to obtain the maximum energy savings takes a great amount of time. Thus, we developed a design framework and an energy reduction estimator to quickly explore a suitable configuration for maximum DRAM’s energy reduction. First, we analysed the energy increase/reduction of the test configurations which are chosen with Latin Hypercube Sampling which is a well-known design of experimen- tal technique. Based on this analysis, we proposed an energy reduction estimator that captures the dependence of the memory system’s energy reduction on certain parameters such as memory traffic, power mode switching time, etc. The energy

xii reduction estimator of the DARM system is modelled by capturing the relationship between energy reduction with highly correlated DRAM parameters and by using the Kriging prediction method. We show that our technique is able to predict the DRAM energy reduction for 330 last-level cache configurations in several days (with the accuracy within 4.4% on average) for 11 applications from the mediabench and SPEC2000 suite, whereas the cycle accurate simulation took several months.

xiii Contents

Statement of Originality ...... iii Copyright Statement ...... iv Authenticity Statement ...... v Thesis Publications ...... vii Contributions of this Thesis ...... viii Acknowledgements ...... ix Abstract ...... xi Table of Contents ...... xiii List of Tables ...... xviii List of Figures ...... xxi

1 Introduction 1 1.1 The Importance of Memory System ...... 3 1.2 SoC Design Issues for Battery-operated Devices ...... 5 1.3 Energy Awareness in DRAM Memory System ...... 7 1.4 Design Automation with SoC Design Challenges ...... 8 1.5 The Need for High Level Modelling ...... 9 1.6 Simulation-assisted Analytical Modelling ...... 10 1.7 Research Aims and Contributions ...... 11 1.8 Thesis Overview ...... 15

2 DRAM Background 16

xiv 2.1 Main Memory Systems Types ...... 16 2.2 Overview of DRAM ...... 17 2.3 DRAM Execution time and Power Consumption ...... 18 2.4 DRAM Controller Organisation ...... 20 2.5 DDR3 SDRAM ...... 22 2.5.1 DDR3 Device Configuration ...... 23

3 Literature Review 25 3.1 Introduction ...... 25 3.2 High Level System Simulation ...... 26 3.2.1 Instruction Set Simulation ...... 26 3.2.2 Memory Simulation ...... 29 3.2.3 System Simulation ...... 31 3.2.4 The Need for Execution-Driven System Simulation ...... 33 3.2.5 Summary of System Simulation Research ...... 34 3.3 Exploitation of Last-level cache ...... 35 3.3.1 Last-level Cache Exploration ...... 36 3.3.2 DRAM Performance Improvement ...... 37 3.3.3 DRAM Energy-Aware Scheme ...... 41 3.3.4 Summary: Exploitation of Last-level Cache Work ...... 41 3.4 DRAM Power/Energy Management ...... 42 3.4.1 Compiler-directed DRAM Power/Energy Management . . . . 44 3.4.2 OS-level DRAM Power/Energy Management ...... 49 3.4.3 System-level DRAM Power/Energy Management ...... 53 3.4.4 DRAM Power/Energy Estimation ...... 58 3.4.5 Summary of DRAM Power/Energy Management Research . . 60

4 Interface Abstraction Layer 62 4.1 Introduction ...... 62 4.2 Motivation ...... 64

xv 4.3 Contribution ...... 66 4.4 Background ...... 66 4.4.1 Processor Simulator Component ...... 66 4.4.2 Memory Simulator Component ...... 67 4.5 Proposed Integration Methodology ...... 68 4.5.1 Case Study ...... 72 4.6 Experimental Tests and Results ...... 76 4.7 Summary ...... 84

5 Rapid Exploration of Unified Last-level Cache 85 5.1 Introduction ...... 85 5.2 Problem Statement ...... 89 5.3 RExCache Framework ...... 90 5.3.1 Application Trace Generation ...... 90 5.3.2 Cache Simulation ...... 92 5.3.3 Cache Exploration ...... 93 5.4 Experimental Methodology ...... 96 5.5 Results and Analysis ...... 97 5.6 Advantages and Limitations ...... 104 5.7 Summary ...... 105

6 Effects of Last-level Cache Configurations 107 6.1 Introduction ...... 107 6.2 Power Mode Controller ...... 110 6.3 Results ...... 112 6.4 Fast Design Space Exploration ...... 116 6.5 Summary ...... 117

7 Energy Reduction in DDR DRAMs 118 7.1 Introduction ...... 118

xvi 7.2 Problem Statement ...... 124 7.3 DRAM Energy Reduction Estimator ...... 126 7.3.1 Kriging Model ...... 127 7.4 XDRA Framework ...... 130 7.5 Experiments and Results ...... 135 7.6 Advantages and Limitations ...... 150 7.7 Summary ...... 151

8 Conclusions 153 8.1 Future Work ...... 157

Bibliography 160

xvii

List of Tables

2.1 DDR3 1 Gb Device Configuration from Micron [1] ...... 24

2.2 256MB and 4GB DDR3 DRAM Device Configuration ...... 24

4.1 Configuration Settings of Processor and Memory ...... 77

4.2 Average Simulation Time [second] ...... 78

4.3 Total Execution Time[clock cycles] ...... 79

4.4 Average Power Consumption[mW] ...... 80

5.1 Number of Memory Accesses for Last-level Cache Policies...... 93

5.2 Detailed Analysis of Execution-Time and Energy Estimators. [s/m/d in Simulation Time represents seconds/minutes/days respectively] . . 98

5.3 Cache Configurations w.r.t. minimum Execution Time and minimum Energy Consumption from RExCache...... 101

5.4 Exploration Time comparison of Cycle-accurate Simulations, Tradi- tional Method and RExCache...... 104

6.1 Power Modes of Micron DDR3 DRAM. SB and PD stand for StandBy and PowerDown respectively...... 108

6.2 Optimal Cache Configurations and their Area Footprints...... 115

7.1 Power modes of Micron DDR3 DRAM. SB and PD stand for StandBy and PowerDown respectively...... 119

xix 7.2 L2 Cache Configurations with maximum DRAM Energy Reduction (BC PD) from XDRA for different DRAM sizes. The numbers in parentheses are area footprint in mm2...... 143 7.3 Time Comparison of Cycle-accurate Processor-memory Simulator and XDRA for 256MB DRAM...... 150 7.4 Time Comparison of Cycle-accurate Processor-memory Simulator and XDRA for 4GB DRAM...... 150

xx List of Figures

1.1 Embedded Systems and General Purpose Computing System Shipment 2 1.2 Typical Embedded System ...... 4 1.3 Power Consumption Trends for SoC ...... 6

2.1 Evolution of DRAM Architecture from Conventional DRAM through the state-of-the-art DDR3 DRAM...... 17 2.2 DRAM Organisation and Terminology ...... 18 2.3 Different Row Buffer Policies of DRAM with Latency Effect . . . . . 19 2.4 Overview of Memory Controller Design...... 21 2.5 DRAM Address Mapping ...... 22 2.6 Detailed Memory Controller Design...... 23

3.1 A Taxonomy of Architecture Simulation Tools from Different Aspects. 27 3.2 Write-induced Interference ...... 38 3.3 Loop Fusion Transformation ...... 47 3.4 Loop Fission Transformation ...... 48 3.5 Loop Tiling Transformation ...... 48

4.1 Available Power Modes and Approximate Power Consumption in Spe- cific Power Mode of DDR3 SDRAM [2] ...... 63 4.2 Memory Latency [clock cycles] to obtain the closet values to correct Execution Time and Average Power Consumption ...... 65 4.3 Detailed Simulation Framework ...... 69

xxi 4.4 Interface Abstraction Layer ...... 70

4.5 Applied IAL simulator framework in building Functional, Cycle-accurate Processor-memory simulator ...... 74

4.6 Error Rate for Total Execution Time ...... 82

4.7 Error Rate for Average Power Consumption ...... 82

5.1 Execution Time and Energy Consumption of different L2 (last-level) Cache Configurations for g721 Enc application’s execution on Target System of Figure 5.2...... 87

5.2 An example of a Target System...... 90

5.3 RExCache Framework. Dotted-lined rectangles and broken arrows show our novel contributions...... 91

5.4 An example of LCI period for Target System of Figure 5.2 (L2 is the last-level cache)...... 92

5.5 An example of Execution Time estimation for Target System of Fig- ure 5.2 (L2 is the last-level cache)...... 94

5.6 Execution Time of different Cache Configurations normalised to Com- mon Cache (CC) Configuration...... 102

5.7 Energy of different Cache Configurations normalised to Common Cache (CC) configuration...... 103

6.1 An example of Target System...... 109

6.2 Energy Saving and Performance Degradation of PMC system w.r.t NoPMC system...... 111

6.3 Mixed Impact of Poorly Chosen Cache Configuration ...... 116

7.1 Power Consumption Breakdown in a Uniprocessor System with on- chip L1 cache, off-chip L2 cache and DRAM memory...... 119

7.2 ...... 120

xxii 7.3 Total Cache and DRAM Energy Reduction for different L2 (last-level) Cache Configurations...... 122 7.4 Effect of two distinct L2 (last-level) Cache Configurations on DRAM idle periods. PD stands for PowerDown...... 124 7.5 An example of Target System...... 125 7.6 Correlation Coefficients of most common Parameters, averaged over 2 DRAM sizes and 11 Applications...... 128 7.7 XDRA Framework. Dotted-lined rectangles and broken arrows show our novel contributions...... 131 7.8 An example of LCI period for Target System of Figure 7.5 (L2 is the last-level Cache)...... 132 7.9 An Example of calculating PDcycles and PDCnt...... 134 7.10 Average Error in estimated Energy Reduction for 256MB DRAM. . . 139 7.11 Average Error in estimated Energy Reduction for 4GB DRAM. . . . 140 7.12 Error in Energy Reduction from Cache Configurations selected under differing Area Constraints for 256MB DRAM...... 141 7.13 Error in Energy Reduction from Cache Configurations selected under differing Area Constraints for 4GB DRAM...... 142 7.14 Normalised DRAM Energy Consumption Breakdown of adpcmEnc for different L2 caches and DRAM sizes...... 144 7.15 Normalised DRAM Energy Consumption Breakdown of adpcmDec for different L2 caches and DRAM sizes...... 144 7.16 Normalised DRAM Energy Consumption Breakdown of jpegEnc for different L2 caches and DRAM sizes...... 145 7.17 Normalised DRAM Energy Consumption Breakdown of jpegDec for different L2 caches and DRAM sizes...... 145 7.18 Normalised DRAM Energy Consumption Breakdown of g721Enc for different L2 caches and DRAM sizes...... 146

xxiii 7.19 Normalised DRAM Energy Consumption Breakdown of g721Dec for different L2 caches and DRAM sizes...... 146 7.20 Normalised DRAM Energy Consumption Breakdown of mpeg2Enc for different L2 caches and DRAM sizes...... 147 7.21 Normalised DRAM Energy Consumption Breakdown of mpeg2Dec for different L2 caches and DRAM sizes...... 147 7.22 Normalised DRAM Energy Consumption Breakdown of spec vpr for different L2 caches and DRAM sizes...... 148 7.23 Normalised DRAM Energy Consumption Breakdown of spec bzip2 for different L2 Caches and DRAM sizes...... 148 7.24 Normalised DRAM Energy Consumption Breakdown of spec gzip for different L2 Caches and DRAM sizes...... 149

xxiv Chapter 1

Introduction

The use of embedded systems is pervasive in almost all modern devices nowadays. Embedded systems are widespread in the areas of automotive electronics, aircraft electronics, trains, telecommunication, medical systems, military applications, au- thentication systems, consumer electronics, fabrication equipment, smart buildings, robotics, etc., which demonstrate the huge variety of embedded systems in our daily environment. It has been estimated that approximately 79% of all processors are used in em- bedded systems [3]. In 2010, over 16 billion embedded devices were sold [4]. The market is estimated to grow at an annual rate of 15% and reach more than 40 billion embedded devices by 2020 [4]. Figure 1.1 illustrates the trend in shipment units of embedded systems and general purpose computing systems as recently reported by the International Data Corporation (IDC) research firm [5]. BCC research predicted that the $113 billion market for embedded technology in 2010 is expected to grow at 7% annually over the next 5 years to reach $158.6 billion by 2015 [6]. Embedded systems are the origins of ubiquitous computing [7] (providing ”in- formation anytime, anywhere”) or pervasive computing [8] (focusing on practical aspects and exploitation of already available technology) [3]. The ubiquitous com- puting systems are now invading every aspect of our daily life and driving the development of embedded systems profoundly. Unlike a general purpose system,

1 2 CHAPTER 1. INTRODUCTION

9000

8000

7000

6000

5000

4000 Millions of Units of Millions 3000

2000

1000

0 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 Year Embedded Systems General Purpose Computers

Figure 1.1: Embedded Systems and General Purpose Computing System Shipments, sourced from [5]

embedded system is highly specialised to the application domain and designed to perform a specific task or a class of tasks. A general purpose computing system is a combination of generic hardware and general purpose operating systems for exe- cuting a variety of applications, whereas an embedded system is a combination of special purpose hardware and embedded OS/firmware for executing a specific set of applications. Hence, the design of embedded system can be optimised to the application needs.

Embedded systems are more constrained in hardware usage and/or software func- tionality than a general purpose system. Hardware constraints include processing performance, power consumption, memory, display size, hardware functionality, and so on. In software, limitations include scaled-down applications, no operating sys- tem (OS) or a limited OS, or less abstraction-level code, etc.,. Besides the functional 1.1. THE IMPORTANCE OF MEMORY SYSTEM 3 constraints, many other design constraints such as timeliness, fault tolerance, secu- rity, safety, etc., of embedded systems vary from one application domain to another. For example, real time constraints for an embedded system are extremely important in aircraft electronics systems and high performance constraints are necessary for consumer electronics systems such as digital cameras, and game consoles, etc. In other embedded system like telecommunication devices such as mobile phones, low power design constraints are a must. A typical embedded system, shown in Figure 1.2 is usually constituted by one or more dedicated chip controllers such as a Microprocessor (Intel 80486 [9]) or a MicorController (Microchip’s PIC32 [10]) or a Field Programmable Gate Array (FPGA) device (Xilinx’s Spartan-6 [11]) or a Digital Signal Processor (DSP) (TI’s TMS320C6000 [12]) or an Application Specific Integrated Circuit (ASIC) (for ex- ample, a custom-made IC designed to monitor a human’s heart beat). The chip controller and all I/O communications are controlled by firmware which typically resides in the main memory. Rapidly improving silicon technology allows us to integrate one or more pro- cessors on a single chip along with memories, application specific circuits, and in- terconnect infrastructure. Ever increasing computational demands from the appli- cation side, and the requirements to integrate numerous components onto a single integrated chip together makes system-on-chips (SoCs) necessary. Many of today’s embedded systems are based on SoC platforms [13]. An example of a SoC-based embedded system is one of today’s most successful embedded devices, the mobile phone.

1.1 The Importance of Memory System

In today’s advanced SoC embedded system (from general purpose microprocessors to customised application specific systems), memory (whether static RAM or dy- namic RAM) plays a major dominant role to determine the performance, power 4 CHAPTER 1. INTRODUCTION

FPGA/ASIC/DSP/SoC/ Embedded Microprocessor/controller

Memory

Communication Interface (USB)

Input Ports System (Sensors) Core Output Ports (Actuators)

Other Integrated Circuits and subsystem

Figure 1.2: Typical Embedded Systems, adapted from [3]

consumption and cost of the system [14–16]. According to Moore’s law, processor performance increases on average by 60% annually; however, memory performance increases by only roughly 10% annually [17]. The processor-memory performance gap leads to the memory wall problem [18] which becomes a limitation on system performance due to memory performance not being able to match that of the pro- cessor. As memory accesses become slower with respect to the processor speed, and consume more power with increasing memory size, the consideration of memory performance, energy consumption and area cost becomes increasingly important. In modern computing systems, roughly 90% of a computer system’s time is not com- puting but waiting for the memory system [19]. This delay response of the memory system indicates that the performance of the whole system is influenced by the latency of the memory system. Memory system (on-chip or off-chip) also heavily dominates the power consumption (50-70% of the total system power [15, 20]) in 1.2. SOC DESIGN ISSUES FOR BATTERY-OPERATED DEVICES 5 the design of electronic systems, especially embedded systems. Moreover, memory occupies a significant fraction of the available die area in nearly all SoC designs as well as in [21]. Obviously, optimisation of SoC design under various design constraints must take the memory system into consideration. Overall, cap- turing the correct memory system behaviour is the most important factor not only for the processor performance evaluations but also for determining the performance of the whole system. Since the application is known a priori in an embedded system, it is possible to customise the memory system for the specific application with the desired constraints.

1.2 SoC Design Issues for Battery-operated De- vices

Recent trends, motivated by user preferences towards carrying less, have focused on portable and feature-rich devices. The rapid emergence of portable consumer devices (such as cell phones, pagers, MP3 players, PDAs, laptops, cameras, cam- corders, portable GPS, etc.) places greater emphasis on power/energy consumption of embedded electronic systems. Figure 1.3 shows the trend of power consumption and power requirements for portable SoCs over a time period. The widening gap between the power consump- tion trend and the power requirements trend shows that it is necessary to reduce the power consumption of SoC designs. Consequently, power/energy consumption has become one of the most critical design constraints for SoCs due to limited battery lifetime, heat dissipation, size constraints (which directly impact on static power), and costs [22]. Low power/energy optimisations have been performed in different stages of the design flow; from high level (associated with functional/behaviour de- scription) to RTL (including registers and logical operations), logic/gate level (with flip-flops and logic cells) and layout level (related with physical layout connections). 6 CHAPTER 1. INTRODUCTION

Figure 1.3: Power Consumption Trends for SoC, sourced from [27]

According to the authors of [23, 24], more power savings opportunities exist at the high level (system level) design abstraction since small changes at the lower level require more effort to validate the design, and thus increase the complexity of optimisation.

Targeting the energy-constrained battery-equipped system, energy (power con- sumption over a time period) minimisation is crucial. Even though energy efficiency is a dominant factor of battery-powered devices, performance and area footprint have to be considered as well and traded-off as necessary. Design metrics such as energy-delay product are sometimes used to evaluate between alternative designs that are equivalent in total energy dissipation but different in performance [25, 26]. Area metric has a huge impact on both profit margins due to manufacturing costs as well as proportional effects on power consumption. Specifically, all of the today’s design practices of SoCs are increasingly focusing on the requirements of the end application and target markets, rather than just the overall system optimisation. 1.3. ENERGY AWARENESS IN DRAM MEMORY SYSTEM 7

1.3 Energy Awareness in DRAM Memory Sys- tem

High storage density and low manufacturing costs are making DRAM (Dynamic Random Access Memory) a primary option for today’s SoC memory design. This is especially true for data-hungry products such as multimedia-rich consumer SoCs due to the ever-increasing demand for the enormous amount of data storage. Generally, the larger the DRAM size, the higher the power consumption of the memory system will be incurred, and thus there is a growing need to improve DRAM energy effi- ciency. DRAM power can be categorised into two groups: static power representing leakage current and dynamic power accounting for switching activity. Static power (also known as leakage power) is dissipated regardless of activity whereas dynamic power (also known as switching power) is consumed when the memory access pat- terns (such as reading/writing operations) are applied to DRAM. Since the static power is always being dissipated, the performance (execution time) improvement will lead to system-wide energy savings. Hence, the energy efficiency of the sys- tem can be enhanced by either improving energy-aware performance efficiency or minimising DRAM power consumption of its memory system [16]. Much research has recently been done to optimise the energy consumption of DRAM by minimis- ing static/dynamic power consumption in various domains such as the compiler domain [28, 29], (OS) domain [30, 31], and system architecuture domain [32,33]. Compiler-directed approaches statically analyse application code to detect mem- ory idle periods in addition to data access patterns to optimally place both code and data in DRAM. OS-level approaches, on the other hand, manage memory traf- fic at the kernel layer through page migration, power-aware page allocation and similar policies. Both compiler-level and OS-level DRAM power optimisation work at a much coarser granularity than the system architecutre approach. The system architecture approach has greater control in finer-grained components to optimise 8 CHAPTER 1. INTRODUCTION the characteristics of the system memory such as performance, area, and energy consumption. As a result, a lot of researchers have worked on the optimisation techniques of system level to reduce DRAM energy consumption by exploiting the underlying memory architecture and utilising the power aware mechanisms. Exam- ples of such DRAM system level optimisation approaches include memory controller optimisation [34], memory-aware cache algorithms modification [35] and DRAM design schehme changes [33]. Memory controller optimisation consists of memory access scheduling [36], memory requests reordring [37], power-saving mechanisms by utilising DRAM’s different power modes [38], etc. Cache algorithm modifications typically design cache replacement policies with awareness of DRAM characteristics. Finally, DRAM design scheme changes include modification on the microarchitec- ture of the DRAM chip itself. Additionally, tuning caches for a given application is also a good way to reduce power consumption since the power consumption of the memory system is application dependent. All these techniques manage the improve- ment of DRAM energy efficiency by energy-aware performance speedup or explicitly controlling the reduction of DRAM power consumption.

1.4 Design Automation with SoC Design Chal- lenges

The ever-shortening time-to-market is critical for being successful in today’s SoC market. One of the key enablers to reduce time to market with efficient SoC design is implementing the system with a major advance in productivity. To improve SoC design productivity, a lot of reusable hardware blocks or intellectual properties (IPs) are being utilised [39–41]. For example, IBM Blue Logic Library [42] and CoreCon- nect Architecture [43] provide reusable designed cores and interconnection modules respectively which enable SoC designers to rapidly realise a system. Even though the utilisation of reusable components can be shorten the design cycle time to some 1.5. THE NEED FOR HIGH LEVEL MODELLING 9 extent, making the right choices of the design components can be difficult. The right hardware components and the best interconnection scheme must be selected in consideration of functional and other constraints such as performance, power consumption, area cost, etc. Early design decisions are crucial in SoC development process since design errors can be costly and making changes later in the develop- ment cycle becomes more problematic and expensive. As a result of increased design complexity over the years, reduced price demands and rapid deployment, design au- tomation is increasingly important to obtain quantitative insights before deploying it on the target. Different types of prediction techniques (by using heuristic approach, analytical approach and simulation approach) are applied in the design automation process [44]. Heuristic methods [45] conduct general guidelines or empirical studies to model the design process. Analytical approaches [46, 47] employ mathematical formulations to predict system behaviour. Finally, simulation systems [48] model the execution of the design activity to capture detailed system behaviour before the actual system is constructed. Functional and timing correctness can be ensured by means of simulation and modelling.

1.5 The Need for High Level Modelling

Most typical SoC designers use hardware modelling at the register transfer level (RTL) where the design is synthesized at the gate-level [49]. However, gate-level hardware implementation (RTL description) is often very large and simulation re- quires considerable amount of time. In RTL design process, register requirements, signal integrity, chip size, pin assignments and others must be identified. Addition- ally, various RTL design issues (such as concurrency, synchronisation and scheduling) make RTL modelling tedious and difficult. These factors increase the complexity of RTL modelling at an early analysis phase of the SoC design process. Thus, system level modelling (modelling at higher level of abstraction in advance of RTL design) 10 CHAPTER 1. INTRODUCTION is adopted due to the demand of shortened design turnaround time, reduced time- to-market pressure, lower cost requirements, quick functional verification, and easy architectural simulation/exploration. Additionally, higher abstraction level mod- elling allows verification to discover problems much earlier in the design process. Typically, designers begin by developing performance model of the target archi- tecture (system level), followed by actual logic design (RTL). After that, the logic specifications are converted into circuits by a circuit designer and then, the circuits are positioned into SoC floor plan by a layout engineer [48]. At the system level, the models are programmed in C/C++ whereas hardware designs use Verilog/VHDL languages. Basically, system level architecture simulation is generally used for sys- tem analysis to achieve the desired goal with the specified constraints by exploring different design alternatives in the early stages of design cycle [50].

1.6 Simulation-assisted Analytical Modelling

The architecture simulators can be classified as functional simulators and tim- ing/performance simulators. Functional simulators emphasize the functionalities of the modelled components, while timing/performance simulators accurately incor- porate the timing features of the target modules in addition to their functionalities. Timing simulators model the target architecture cycle by cycle basis, and thus the accuracy of the model is higher in timing/performance simulations. Cycle accurate simulations are extensively used due to the need for detailed study of the target model before building the prototype [51]. In terms of input sets, the simulator is specified as the trace driven (address traces of the execution event are fed into the simulator) or execution driven simulator (application is directly fed into the simulator). Trace driven simulation is mostly used in memory system simulations in which the trace, captured from processor simulator/hardware device, is used as the input. Though trace driven simulation is much simpler and easier to under- stand, it lacks the timing feedback information which is necessary to accurately 1.7. RESEARCH AIMS AND CONTRIBUTIONS 11 predict the execution time. Therefore, cycle accurate execution driven simulation is needed to properly evaluate the system performance [52]. However, the more de- tailed the model, the slower the simulation especially for large-scale workloads and exploration of large design space. In such a case, simulation-assisted analytical mod- elling is used to estimate the model accurately as well as to speedup the modelling time. Analytical models alone cannot capture the detailed system architecture and hence the accuracy of analytical model may not be sufficient for design decisions. In simulation-assisted analytical modelling, cycle accurate simulation is necessary only for partial design space to expose the information which will be utilised in the analytical model. Such a technique involves capturing the activity by simulating a few detailed cycle accurate simulation and feeding it to the analytical-based esti- mation technique. Examples of this technique are statistical regression based power modelling [41] and estimation of power consumption with efficient power model [53].

1.7 Research Aims and Contributions

The aim of this thesis is to improve the energy efficiency of DRAM memory with highly accurate estimation techniques and design methodology in order to quickly explore a large design space of exploited cache configurations. The key aspects of this thesis are :

1. Interface Abstraction Layer for High Level System Simulation . The complexity of SoC design increased drastically from time to time which requires early testing and verification of the system in order to reduce time-to-market windows. However, it is not a good option to test the system with the physical prototype design model directly at an early stage of design due to the higher cost and time involved in prototyping. Hence, there is a need of software simu- lation paradigms to explore a wide range of design alternatives rapidly and in a cost-effective manner at a very early design stage. These can substantially reduce the time-to-market pressure for building a system with the rich features 12 CHAPTER 1. INTRODUCTION

of functionality requirements. To simulate a processor with a memory system, typically, a two step approach is applied where the memory traces are captured with the processor simulator in the first step and the memory traces are pro- cessed with the memory simulator in the second step. In the trace generation (first step), fixed memory latency and fixed memory power consumption for all memory requests are used since the processor simulator has no knowledge about them. Using the fixed memory latency and fixed power consumption degrades the correctness of the memory simulation (which in turn affects the correctness of the whole system) due to the variance of memory latency and power consumption from one memory request to another based on the state and configuration settings of the memory module. Thus, we describe a generic methodology for implementing a combined simulator which contains both the processor and memory models in order to provide accurate timing and mem- ory power information. We show how an interface component can be added to communicate between a processor simulator and a memory simulator.

2. Performance/Energy Estimation of the System including Last-level Cache. Exploiting Last-level cache is a well known technique for improving the performance of the system as well as the energy efficiency by reducing the expensive DRAM traffic. Cache exploration with slow cycle-accurate simula- tion is a very time consuming process since a design space of last-level cache configurations are typically very large (with the value reaching up to 330 con- figurations of each application in our experiments). Another alternative is us- ing a fast trace-driven simulation to produce cache hits and miss statistics of all the cache configurations which are then used with an analytical model to esti- mate execution time and power consumption. However, cache statistics alone do not contain sufficient timing information of accurate estimation. Hence, we proposed a simple, absolutely accurate execution time estimator and a simple and reasonably accurate energy estimator for the system of processor module 1.7. RESEARCH AIMS AND CONTRIBUTIONS 13

using the fixed latency and power consumption memory model. Additionally, we provide an exploration framework to quickly estimate the execution time and energy consumption of the system with distinct cache configurations using a minimal number of slow full-system cycle-accurate simulations by combin- ing a cycle-accurate simulator, a trace-driven cache simulator and our novel execution time and energy estimators.

3. Efficient DRAM Power Modes Control with Last-level Cache Ex- ploitation. With the availability of computationally complex embedded sys- tems, increased DRAM sizes becomes a greater demand in these systems. Along with increased DRAM sizes, memory power consumption is a major concern especially for power-hungry portable devices. Although the modern DDRx (DDR, DDR2, DDR3, etc.) DRAM devices include the internal default power management mechanism, it controls the power consumption of DRAM by setting DRAM device into low power consuming mode after the processing of each memory request. In this default mechanism, the high amount of power mode switching overhead incurs due to the consecutive power mode switching activities since DRAM device is required to switch back to the high power consuming mode (active power mode) before processing the next memory re- quest. We study the DRAM power modes control mechanism which manages to set the DRAM device into the lowest power consuming mode only after the predefined threshold period of DRAM idle time so that the power mode switching overhead can be reduced. In order to create enough DRAM idle time for gaining energy reduction benefits, last-level cache is utilised to make the DRAM device idle during the time of memory requests being satisfied from the last-level cache. We propose DRAM power control algorithm to manage between the DRAM active state (for doing read/write operations) and lowest power consuming state.

4. DRAM Energy Reduction Estimation Using Regression Analysis 14 CHAPTER 1. INTRODUCTION

Energy Model. DRAM energy consumption can be significantly reduced with the last-level cache exploitation and our proposed efficient power mode control technique. However, the amount of energy reduction varies from one last-level cache configuration to another as well as across the target ap- plications. An estimation technique is applied in order to figure out posi- tive/negative energy reduction amounts for a particular last-level cache con- figuration of a specific target application. One of the estimation methods - regression analysis - is capable of finding a relationship of the energy reduc- tion on the certain DRAM parameters such as DRAM memory traffic, DRAM power modes switching time, DRAM’s idle time, etc. Our proposed highly accurate energy reduction estimator uses the regression analysis based Krig- ing model which considers the spatial correlation between the current design point and initial trained data. The Kriging based energy estimation model is built with the trained data which has been collected with a small number of cycle-accurate simulations for a certain set of cache configurations.

5. Rapid Design Space Exploration of Last-level Cache Configurations for Maximum DRAM Energy Reduction. The energy reduction of DRAM memory while utilising the last-level cache can be accurately deter- mined with our proposed DRAM energy estimator. Some of the cache con- figurations for a specific application do not gain a significant DRAM energy reduction amount. Thus, exploration of all the last-level cache configurations is necessary to obtain a cache configuration which gives the maximum DRAM energy reduction. Such exploration for a large design space of last-level cache configurations is an exhaustive search which will be very time-consuming pro- cess. To solve this problem, we propose a methodology to utilise one-time slow cycle-accurate simulation and fast trace-driven cache simulations for all the cache configurations together with our novel DRAM energy reduction es- timator. Our proposed design flow does not require cycle-accurate simulations 1.8. THESIS OVERVIEW 15

of all the cache configurations for computation of the estimator parameters, and thus enables fast exploration of last-level cache configurations.

1.8 Thesis Overview

The remainder of the thesis is organised as follows. Chapter 2 presents DRAM background information, terminologies of DRAM architecture, DRAM system or- ganisation and information about the specific DRAM type that this thesis focuses on. Chapter 3 conducts a detailed literature survey of simulation framewrok to estimate performance and energy consumption of DRAM memory. Chapter 3 also reflects DRAM power/energy management mechanisms and performance/energy es- timation techniques for improving DRAM energy efficiency. Chapter 4 elaborates on the cycle accurate processor-memory simulator. Chapter 5 presents the rapid exploration of last-level cache together with the accurate estimation of performance and energy consumption of the system. Chapter 6 describes the efficient power mode controller to explicitly control the DRAM power consumption and shows the effects of last-level cache configurations on the DRAM energy savings. Chapter 7 provides DRAM energy reduction estimator and design flow to rapidly explore the last-level cache configuration with maximum DRAM energy reduction. Chapter 8 concludes the thesis by describing the overall contribution of the research followed by the directions of the future research. Chapter 2

DRAM Background

2.1 Main Memory Systems Types

Many types of memory devices such as Random Access Memory (RAM), Read-only Memory (ROM), flash memory, etc. are available in modern computer systems. Amongst them, RAM is mainly used for the main memory. RAM devices can be further classified as static RAM (SRAM) and dynamic RAM (DRAM). The main difference between SRAM and DRAM is the life time of the data that they store. The data inside the SRAM will remain as long as the power can be provided. Ad- ditionally, DRAM density is higher than SRAM, but is cheaper. As such, DRAM is mainly used for main memory storage and SRAM is used as cache for speed perfor- mance. Synchronous DRAM (SDRAM) is a type of DRAM which has synchronous interface (waiting for a clock signal before responding to the control inputs) to the system bus. Newer generations of SDRAM are DDR SDRAM (Double Data Rate SDRAM), DDR2 SDRAM, and DDR3 SDRAM with higher transfer rates and lower power consumption when moving from one generation to another. DDR3 SDRAM is the state-of-the-art memory device and all the experiments of this thesis are based on DDR3 SDRAM. The evolution of the DRAM architecture is illustrated in Fig- ure 2.1.

16 2.2. OVERVIEW OF DRAM 17

Conventional DRAM

SDRAM 3.3 Volt, Max. Transfer Rate 200 Mbps

DDR SDRAM Double Data Rate, 2.5 Volt, Max. Transfer Rate 400Mbps

DDR2 SDRAM 1.8 Volt, Max. Transfer Rate 800Mbps

DDR3DDR3 SDRAM SDRAM 1.5 Volt, Max. Transfer Rate 1600Mbps

Figure 2.1: Evolution of DRAM Architecture from Conventional DRAM through the state-of-the-art DDR3 DRAM.

2.2 Overview of DRAM

The basic DRAM memory cell is a single-transistor cell with a small capacitor. The content of the memory cell is indicated by the presence or absence of charge on the capacitor. Two main types of DRAM are Asynchronous DRAM (conventional DRAM) and Synchronous DRAM (SDRAM). The operations of the asynchronous DRAM are not based on the clock and those of SDRAM are mainly driven with a clock.

The structure of the DRAM (whether conventional DRAM or SDRAM) is illus- trated in Figure 2.2. The DRAM device consists of one or more ranks and each rank is composed of one or more DRAM chips. The DRAM chip is formed with a number of banks each of which has a separate row buffer (sense amplifier/DRAM page). If 18 CHAPTER 2. DRAM BACKGROUND

DRAM Bank DRAM Chip Rank DRAM Device

DRAM R R DRAM a a Chip n n k k DRAM Activate Precharge Chip Memory RB DRAM Controller Bank 8 Chip RB

Bank 1 DRAM Chip DRAM Channel DRAM Device Device *RB = Row Buffer (or) Sense Amplifier (or) Page

Figure 2.2: DRAM Organisation and Terminology the required data is from the different banks, multiple row buffers of different banks can be accessed independently. The status of the row buffer can be classified as close page (if there is no data inside the row buffer) and open page (if the row buffer is occupied with the data from one of the DRAM rows). Depending on the status of the row buffer and the location of the requested data, the DRAM access latency and the power consumption are different from one memory operation to another.

2.3 DRAM Execution time and Power Consump- tion

There are three necessary sequential steps to access memory data; row precharging, row activation and column access. Row activation occurs when the particular row of data from the bank needs to be brought to the row buffer (due to the close status of the row buffer) before accessing the column data successively. If the data from another row of the bank needs to be accessed, the data from the row buffer has to be written back to the DRAM location before activating another data row (row precharging process) due to data conflict between the requested data and existing 2.3. DRAM EXECUTION TIME AND POWER CONSUMPTION 19 data of the row buffer even though the row buffer is in the open page status. Thus, the memory request could be row hit, row conflict or row closed. The DRAM latency effect and the generated DRAM commands sequence based on the DRAM row buffer status are shown in Figure 2.3.

(2) t (1) tRAS RAS DRAM DRAM DRAM (1) tRP

Close Page Open Page Open Page, RB RB RB Status Status Page Conflict (2) t (3) t CAS (1) tCAS CAS

latency = tRAS + tCAS latency = tCAS latency = tRP + tRAS + tCAS (a) (b) (c)

Generated Command Sequence based on the Row Buffer Status RB : Row Buffer : Row Buffer Hit CAS tRAS :Row Access Latency

Row Buffer Close : RAS CAS tCAS :Column Access Latency

t Row Buffer Conflict : PRE RAS CAS RP :Row Precharge Latency

Figure 2.3: Different Row Buffer Policies of DRAM with Latency Effect

In Figure 2.3(a), the row access latency (row activation) and the column access latency will be incurred due to the close page row buffer (row closed). Only the column access latency will be incurred in Figure 2.3(b) due to the open page row buffer and data hit inside the row buffer (row hit). In contrast, the combination of row precharge latency, the row access latency and the column access latency is required in Figure 2.3(c) due to the data conflict in the open page row buffer (row conflict). As the row hits require only the column accesses, row hits have shorter latencies among the three row buffer situations described.

Power consumption of the DRAM device is based on the power of predefined spec- ification (specification power), command scheduling conditions (scheduled power) 20 CHAPTER 2. DRAM BACKGROUND and actual operating VDD and clock frequency (system power) [54]. First, the speci- fication power of each subcomponent is calculated with the predefined specifications. After that, the specification power is derated based on the command scheduling con- ditions. Finally, the system power is calculated by derating the scheduled power to the system’s actual voltage and frequency. The total system power is the addi- tion of the system power of background power, activation power, read power, write power, I/O termination power and refresh power. The details of power calculation for DDR3 DRAM can be obtained from Micron’s data system power calculation [54]. Estimating accurate power for the specific application is essential as the expected power consumption affects how much heat is produced by the circuit and what kind of power supply it requires. For example, even for the same application, the device configuration of a larger row size will consume more current than that of the smaller row size as it consumes more current per row activation significantly.

2.4 DRAM Controller Organisation

The memory controller serves as a mediator between the processor(s) and the DRAM device(s). A channel connector is used to connect the memory controller and the DRAM device (as shown in Figure 2.2) while the system bus is used between the processor and the memory controller. The DRAM controller manages the movement of data into and out of the DRAM device while ensuring timing and resource con- straints of the DRAM device. Figure 2.4 illustrates the abstract DRAM controller whose typical operations include transaction scheduling, address translation (DRAM address mapping), command scheduling and DRAM row buffer/bank management. The memory controller performs the transaction scheduling for the incoming re- quests (from one or more processors and one or more I/O devices). In the transaction scheduling stage, transaction sequence ordering policy (First Come First Serve or Priority or Read over Write, etc., [36]) is applied in order to reduce latency and improve memory bandwidth utilisation. The next stage is the address translation 2.4. DRAM CONTROLLER ORGANISATION 21

Memory CPU Memory Controller

Application L1 Cache

Buffer & Bank Management Request CPU Bank0 Queue streams Memory

CPU Arbiter Bank1 Queue ...

Bank7 Queue

I/O

Transaction Address CommandCommand Scheduling Mapping SchedulingScheduling

Figure 2.4: Overview of Memory Controller Design. stage in which the arbiter interface converts the physical address to the logical ad- dress (DRAM specific location in terms of channel ID, rank ID, bank ID, row ID and column ID). After the address mapping is performed, the request transaction is placed in the specific bank queue according to the bank ID of that transaction. The example of memory address translation is shown in Figure 2.5. Moreover, each memory request transaction is scheduled as a sequence of DRAM commands (command scheduling) in order to access physical DRAM device. Ad- ditionally, the memory controller manages the operation of the DRAM page (row buffer) based on the pre-defined row buffer management policy (which is typically open page or close page policy). In the open page row buffer policy, the row buffer remains opened after the memory request has been processed in order to maximise the row buffer hit. In the close page policy, the row buffer has to be closed after the processing of each memory request. The detailed flow of the typical memory 22 CHAPTER 2. DRAM BACKGROUND

Physical Address: 0x40000004 Memory CPU DRAM Controller Channel? Rank? Bank? Row? Column? Logical Address / DRAM Address

Figure 2.5: DRAM Address Mapping controller is illustrated in Figure 2.6. The incoming memory requests are queued in the bus interface queue of the memory controller. After that, the memory controller processes the memory requests from the bus interface queue by going through the transaction scheduling, address translation, DRAM command scheduling, DRAM row buffer and bank management stages. When the processing of the memory request is finished, the memory controller notifies the caller for the complete trans- action.

2.5 DDR3 SDRAM

The DDR3 SDRAM uses a double data rate architecture to achieve high performance operation. It uses 8n-prefetch architecture which is specifically designed to transfer simultaneous eight corresponding n bit wide data transfers at interface I/O pins for 8n-bit wide data at the internal memory core. The DRAM device is internally configured as an 8-bank DRAM and read/write accesses to the DDR3 SDRAM are burst-oriented which means, accesses start at a selected location and continue for a programmed number of locations in a programmed sequence. The DDR3 SDRAM allows a pipelined, multibank mechanism for concurrent operations which can provide high bandwidth by hiding the time of row precharging and activation. 2.5. DDR3 SDRAM 23

Memory Request Completed Transaction

Transaction Queue ··· ···

DRAM Get Memory Command Queue

Transaction Queue Interface Bus ··· Command Scheduling Scheduling Command (round robin, priority,...) robin, (round Bank 0 Queue

Physical to Bank 1 Queue DRAM Address Mapping ··· DRAM

Bank N Queue (Mapping Policy) (Bank & Buffer Management)

Figure 2.6: Detailed Memory Controller Design.

2.5.1 DDR3 Device Configuration

DDR3 SDRAM [1] is a high speed dynamic random access memory. The number of rows and number of columns are different according to the internal data bus width. Table 2.1 describes three kinds of DDR3 device configuration (256M x 4, 128M x 8 and 64M x 16) provided by Micron [1] for 1Gb DDR3 devices. In each device configuration description (256M x 4 or 128M x 8 or 64M x 16), the first part (256M or 128 M or 64M) represents the DRAM density while the next part (4 or 8 or 16) represents the width of the data bus (number of bits in one column) respectively so that the total DRAM size for each configuration is 1Gb. In the 256M x 4 DDR3 1 Gb device configuration, the total number of rows is 16K (16384 rows) and it needs 14 address pins to access the row address. As the total number of columns included is 2K (2048 columns), it needs 11 pins to access the column location. As 16384 rows and 1024 columns are included in 128M x 8 pattern, it needs 14 24 CHAPTER 2. DRAM BACKGROUND

Table 2.1: DDR3 1 Gb Device Configuration from Micron [1] Device Configuration 256M x 4 128M x 8 64M x 16 No. of Banks 8 8 8 No. of Rows 16K[13:0] 16K[13:0] 8K[12:0] No. of Columns 2K[11,9:0] 1K[9:0] 1K[9:0] Data Bus Width (bits) 4 8 16 address lines for row access and 10 address lines for column access. However, in 64M x 16 device format, the total number of rows is only 8K (8192 rows) and the total number of columns is 1K (1024 columns). For all the experiments in this thesis, 256 MB DRAM and 4GB DRAM with 32 bit bus interface are built by using Micron’s (256M x 4 DDR3 1 Gb) and (64 M x 16 DDR3 1Gb) device configuration respectively. Table 2.2 shows the configurations for 256MB and 4GB DRAM sizes which are utilised in this thesis. The storage of 256MB DRAM is composed of two 1Gb (64M x 16) chips including one channel and one rank. For a 4GB DRAM size, eight 1Gb (256M x 4) is used with one channel and four ranks.

Table 2.2: 256MB and 4GB DDR3 DRAM Device Configuration DRAM Size 256MB 4GB DRAM chips per rank 2 8 No. of Ranks 1 4 No. of Channel 1 1 DRAM configuration (64M x 16) (256M x 4) Chapter 3

Literature Review

3.1 Introduction

Memory power/energy consumption becomes a major concern in the context of portable embedded SoC designs since memory component is a major contributor to longer battery lives. Analysing the power/energy consumption of the memory system to adjust the battery lifetime at the final design stage would become very costly and it is not an optimal option for good system design. As such, early design estimation is necessary and simulation is the best approach to facilitate design and functional validation processes with the target specification before the real hardware becomes available. Furthermore, simulation is also the most widely used method to evaluate memory system performance. This chapter gives a broad overview of different types of simulation approaches to generate detailed and accurate memory statistics. A discussion then follows on existing DRAM power/energy management techniques. This chapter also describes the estimation techniques for performance and/or energy savings of the DRAM memory system.

25 26 CHAPTER 3. LITERATURE REVIEW

3.2 High Level System Simulation

System architecture simulators are indispensable tools for performance evaluation in both design space exploration and computer system designs such as software de- velopment and optimisation processes. The architecture simulators can be named differently from three perspectives; scope, detail and input aspect. From the scope view, the simulator is named based on which component/module is modelled to simulate, for example, processor, memory or I/O module. How the simulator works throughout the simulation process is considered from the detail aspect (whether functionally or timing accurately). The last factor, considered from the input as- pect, is how the simulator is driven (whether from a memory references sequence of application or the application itself). The visualised explanation for naming ar- chitecture simulator is shown in Figure 3.1, and more details are explained in the following paragraphs.

3.2.1 Instruction Set Simulation

There are several types of processor, memory, and system simulators. These can be divided into functional and timing (or performance) simulators. Functional sim- ulators are for testing only the functionality of the simulator module while the timing/performance simulator takes into account all the timing requirements of the targets in addition to the functionalities. Basically, the functional simulator does not model the real architecture on each individual processor cycle. Thus, the functional simulator cannot be used to measure the performance of the system although it is useful for functional tests, module verification, and obtaining basic information such as the number of executed instructions, the frequency of branch instructions occur- rence, etc. The most popular functional simulators are Sim-fast [55], Sim-safe [55], and SimCore [56]. Sim-fast and Sim-safe are optimised subset simulators of the SimpleScalar toolset [55] for speedup simulation. They perform only functional sim- ulation and do not account for the behaviour of microarchitecture. The difference 3.2. HIGH LEVEL SYSTEM SIMULATION 27 driven Simulator Execution- Input Aspects Input Simulator Trace-driven Trace-driven accurate accurate Simulator Non-cycle- Timing Simulator Detail Detail Aspects Cycle- Simulators accurate Simulator Architecture Simulator Functional (Full) System System Simulator A Taxonomy of Architecture Simulation Tools from Different Aspects. Scope Scope Aspects Memory Memory Simulator Figure 3.1: Processor Simulator 28 CHAPTER 3. LITERATURE REVIEW between Sim-safe and Sim-fast is that the former checks the correct alignment and the access permissions of each memory reference, but the latter does not have this feature. SimCore [56] is derived from SimAlpha [57] system simulator and satis- fies the features of high readability and high speed execution. SimCore has almost all the same functions as Sim-fast with the speed improvement, though the target architecture of SimCore is fixed to the Alpha processor.

Timing (or performance) simulators, on the other hand, measure the perfor- mance of the system by keeping track of individual clock cycles. Timing simulators can further be divided into cycle accurate and non-cycle accurate simulators. Ex- amples of cycle accurate processor simulators include FaCsim [58] for ARM926EJ-S core, PTLSim [59] for x86-64 architectures, Xtensa ISS for Tensilica platforms [60], RealView ARMulator ISS [61] for ARM architectures, and the C64x+ CPU simu- lator for the Texas Instruments platform [62]. Cycle-accurate processor simulators have involved full modelling of the microarchitecture on a cycle-by-cycle basic. Non- cycle accurate processor simulators (also known as instruction set simulators [63]) are represented by Sim-outorder [55] for Simplescalar architectures (a close deriva- tive of the MIPS architecture [64] and can emulate Alpha, PISA, ARM and x86 instruction sets), and Sim-Alpha [65] for the Alpha instruction set platform [66]. The Sim-outorder from SimpleScalar tool is a widely used instruction-level perfor- mance simulator with detailed microarchitecture implementations, including branch prediction, caches, speculated execution support, ALUs, and external memory. Sim- Alpha [65] is an extension of the SimpleScalar tool by targeting Alpha 21264 mi- croarchitecture. The strength of Sim-Alpha (as compared to other simulators such as SimpleScalar, SimOS ) lies in its validated feature against actual hardware. The FaCSim accurately models the ARM926EJ-S based embedded system with the in- terpretive simulation technique (computes elapsed cycles and incrementally adds it to advance the core clock instead of performing cycle-by-cycle simulation) in or- der to achieve the flexibility and yet achieve high speed. FaCsim is divided as a functional front-end and cycle-accurate back-end that run in parallel by exploiting 3.2. HIGH LEVEL SYSTEM SIMULATION 29 functional/timing model decoupling techniques [67]. Most of the simulators have been modelled to simulate only one fixed architecture. In order to expose various architectures in the processor simulation environment, CPU Sim [68] is designed to facilitate hands-on experience in the learning environment. In CPU Sim, the detail implementations of microinstructions, machine instructions, and instructions of target architecture need to be specified at the register-transfer level (RTL). MikroSim [69] is another processor simulation tool for educational areas. Similar to CPU Sim, the MikroSim approach is based on implementation at the RTL, and the ease of use feature (due to support) is the main attraction of the MikroSim.

3.2.2 Memory Simulation

Some simulators such as Sim-Cache [55], Sim-Cheetah [55], LDA [70], DCMSim [71], CacheSim [72], MSCsim [73], Dinero [74], and SuSeSim [75] are known in the context of memory simulation. Sim-Cache and Sim-Cheetah are from SimpleScalar simu- lation suite to generate cache statistics/profiles for single cache and multiple cache configurations in a single program execution respectively. Some memory simulators, such as MSCSim, DCMSim, and CacheSim, are proposed for didactic purposes. The MSCSim (Multilevel and Split Cache Simulator) [73] is a memory hierarchy simu- lation tool whose main features include a possibility to simulate the behaviour of various memory components, such as unified cache, split cache, multi-level cache, and virtual memory. DCMSim (Didactic Cache Memory Simulator) [71] supplies the need for a learning aid tool by providing detailed step-by-step execution with the visualisation in the interface. Additionally, DCMSim is an adapted object-oriented development such that the specified objects representing the logic blocks of cache memory systems can be reutilised if needed. The CacheSim simulator [72] has sim- ilar features like DCMSim, but CacheSim provides a fully graphical interface and applies modular implementation instead of an object oriented approach. The latency 30 CHAPTER 3. LITERATURE REVIEW of data access varies from one memory level to another in the memory hierarchy. Thus, the Latency-of-Data-Access (LDA) model [70] was developed by taking into account multiple hierarchical memory levels with different latencies and cost mod- els. Moreover, LDA incorporates the contention analysis based on queuing theory or event-driven simulation while considering the latency of the memory accesses of concurrent tasks. The cache behaviour is also modelled by the well-known Dinero cache simulator [74] with the support of the sub-block placement technique [76] (by splitting up blocks into smaller units of transfer) in order to achieve the reduction in miss penalty. The ACME simulator [77] exploits an adaptive caching techniques to manage the cache replacement policies that can further improve the hit rates. Most of the cache simulators, including Dinero, are multi-pass simulators, that means, the simulator cannot be configured to process multiple cache configurations in a single program execution (it is required to read the input sequence for each and every cache configuration). The method of Janapsatya [78], CRCB algorithm [79] and SuSeSim [80] are single-pass cache simulators (the input is needed to be read only once) which can process multiple cache configurations at one time. The purpose of simulating multiple cache configurations simultaneously is to speed up the cache sim- ulation time. All of these cache simulators are not neither functional simulators nor timing simulators. Moreover, these memory simulators are used to examine cache behaviours only and do not provide the timing simulation of the memory system. Alternatively, Cacti [81] is the most widely used design tool in order to model the access time and power consumption of caches and other memories. Though Cacti can provide some statistics for DRAM memory, it does not accurately simulate the timing behaviour of DRAM system. The detailed timing simulation of mem- ory and cache behaviour is also provided by Ruby [82] (from GEMS toolset [82]). Though Ruby provides a flexible infrastructure capable of accurately simulating a wide variety of cache coherent memory systems, Ruby does not faithfully simulate the DRAM behaviour [83]. Thus, the authors of [84] proposed the detailed and cycle-accurate DRAM simulator known as DRAMsim, which uses accurate power 3.2. HIGH LEVEL SYSTEM SIMULATION 31 and timing models to provide useful power consumption and delay information for the DRAM memory.

3.2.3 System Simulation

The behaviour of the processor depends on the memory system, and the behaviour of the memory system depends on the processor module due to the interactions between these two components. Hence, the system simulator becomes necessary to capture the timing dependent effects between them. Several full system simulators have been previously proposed, such as SimOS [85, 86], Simics [87], and Mambo [88]. SimOS (implemented for the environment of SPARC and MIPS based machines) simulate the computer system using the services provided by the underlying operating system (OS). Some examples of OS’s services include OS’s process abstraction and signals to simulate a CPU, file mapping operations to simulate the memory management unit, and separate processes to simulate the I/O devices [85,86]. SimOS models the complete hardware of the machine, and thus can be used for studying OS or realistic workloads. Simics, on the other hand, is a multi-processor full system simulator for various ISAs such as SPARC, Alpha, x86, PowerPC, MIPS, and ARM. The Simics Central tool smoothly controls the connection of the heterogeneous processors (dif- ferent types of processor) machine. The new device model can easily plug-in into the Simics simulator due to its easy extensible feature. Based on Virtutech Simics sys- tem simulator, SimFlex simulation tool [89] has been built in order to provide fast, accurate and flexible simulation of large-scale systems. SimFlex uses well-defined component interfaces to be flexible and statistical sampling theory [90] (estimate performance without executing the complete program) to reduce simulation time while achieving high accuracy. Another system simulator based on Simics functional simulator is the FeS2 system simulator which added the timing execution feature on top of Simics. FeS2 is for the x86 platform and its memory model leverages GEMS’s Ruby memory timing model. Mambo [88] is also a full system simulator specific to 32 CHAPTER 3. LITERATURE REVIEW the PowerPC based systems. Simics and Mambo provide functional simulation to quickly simulate the system as well as the timing model simulation to obtain the highly accurate results. AMD’s SimNow simulator [91] is a fast and configurable x86 and AMD64 platforms simulator for AMD’s family of processors. The SimNow simulation can be saved at any point to a media file, from which the simulation can be re-run at a later time. Based on the functional SimNow simulator, AMD and HP jointly developed a full system simulation infrastructure called COTSon [92] to model complete full software stack and complete hardware models. All of these full system simulators focus more on the functional performance and do not attempt to calculate accurate timing information and power consumption of the memory system. To enable timing modelling in system simulation, GMES toolset [82] (use Simics functional simulator as a foundation infrastructure) has been released with the detailed microarchitectural processor timing model Opal and memory system timing simulator Ruby. Although GMES decouples simulation functionality and timing events in simulator development, the memory system of GEMS, Ruby, does not model the detailed features of DRAM module. Another simulator, called M5 [93] has been designed to include all the features of SimOS and GEMS plus detailed net- work I/O and multiple network system models. M5 provides a highly configurable object-oriented simulation framework supporting Alpha, SPARC, and MIPS ISAs. Since M5 does not support an accurate timing memory model, the gem5 full sys- tem simulator [94] was developed by merging the best aspects of M5 and GEMS simulators. The key feature of gem5 simulator is its flexibility to choose the CPU model (simple functional or in-order or out-of-order or timing processor model), system mode (user-level mode or full system mode with both user-level and OS kernel-level services), and the memory system (M5’s fast and easy memory system or GEMS’s Ruby memory model) according to the specific requirements in simula- tion. The gem5 simulator supports most commercial ISAs such as ARM, ALPHA, MIPS, Power, SPARC, and x86. 3.2. HIGH LEVEL SYSTEM SIMULATION 33

3.2.4 The Need for Execution-Driven System Simulation

In terms of the input to the simulator, the simulation can be trace-driven [95] or execution-driven simulation [96]. Trace-driven simulation consists of a simulator whose input is a memory trace (a sequence of memory references). The input memory trace for the trace-driven simulator can be collected primarily by using hardware-based snooping method (for example, DragonHead [97]), microcode mod- ification (modify the machine-level code to write all the addresses of the memory references into reserved area), instruction-set emulator/processor simulator (mod- ify emulator/simulator to cause an application to generate address traces), static code annotation/software-based profiling (modify the source program to generate a memory trace such as PIN [98]) and single-step execution at an operation system level (record memory references while stepping through a program one instruction at a time) [99]. In execution-driven simulation, the input application is executed directly on the simulator.

Memory performance and power estimation techniques typically engage with trace-driven simulation - the memory trace is collected with one of the aforemen- tioned trace collection methods (reference generation step) and the collected memory requests sequence is executed by the memory simulator (trace simulation step). A reference generator produces a sequence of memory requests (a reference trace) and a trace simulator models the memory system by processing each memory request in a reference trace. Trace-driven simulation provides the portability feature due to the applicability of trace reuse in multiple simulations. However, trace-driven memory simulation is not adequate in order to simulate the behaviour of memory system cor- rectly and accurately since it lacks the feedback information (the response time from one memory request affects the processing of the next memory request) necessary to accurately predict the execution time. As a result, the exact modelling of the in- teraction between the CPU pipeline and the memory system cannot be captured in trace-driven memory simulation [100]. Hence, the cycle-accurate processor simulator 34 CHAPTER 3. LITERATURE REVIEW is also the vital component along with the detailed cycle-accurate memory simulator to capture the accurate memory output statistics (on-the-fly simulation). The on- the-fly simulation technique can also overcome the reference generation overhead, especially for large storage requirements. As a result, we need to bring together a processor simulator with a detailed timing and power memory model. Researchers have combined Sim-alpha [65] and GEMS [101] platforms with DRAMSim. All of the above processor simulators are non-cycle accurate. Additionally, these integrations are each specifically targeted for specific instruction set architectures (for instance, Sim-alpha for Alpha processor simulator only and GEMS is for SPARC processor simulator) and cannot be re-targeted to differing processors without extensive work. It may be the case of performance evaluation modelling for different processor in- struction set architecture and different memory model type which cannot be directly used with existing full system simulators. Hence, there is a need of generic integra- tion methodology for detailed, cycle-accurate, functional and performance processor memory simulator. This thesis describes a generic methodology which can be used to integrate a cycle accurate processor simulator with a DRAM memory simulator which takes in the memory addresses as input. The output of the DRAM simulator can be timing information and/or power/energy information. Note that the DRAM simulator does not need to manipulate data values (this is typical of most memory simulators). However for cycle accurate simulations, data values are important and a mechanism must be provided for manipulating these values.

3.2.5 Summary of System Simulation Research

Many of the initial research works for system simulation have been from the area of simulation and modelling. Various simulation techniques for different system components exist to obtain the performance statistics, such as execution time, power consumption, etc. of the target system component. High level system simulation is increasingly popular in order to accelerate the time-to-market window of the 3.3. EXPLOITATION OF LAST-LEVEL CACHE 35 embedded system. A large number of instruction set simulators, memory module simulators and full system simulators (processor and memory component simulate together) are available in the area of architectural high-level simulation. In order to capture timing dependencies between the processor and memory components, full system simulation is necessary so that detailed statistics of both the processor and memory components can be obtained accurately. Most of the existing full system simulation models are non cycle-accurate functional simulators, and thus cycle-accurate full system simulator for a target platform is necessary if accurate design trade-offs have to be measured. The interface abstraction methodology we propose in this thesis are not dependent on any such simulator, making it suitable for integrating the existing processor and memory simulator models to create a seamless cycle-accurate processor-memory full system simulator. In this thesis, we present an approach to build a high level cycle-accurate processor- memory simulation model for embedded systems. We investigated a technique that seamlessly integrates the processor component and memory component in a cycle- accurate manner to provide detailed statistics of the processor component as well as the memory component. We experimentally show the accuracy that can be achieved by utilising our proposed interface abstraction layer in the integration of an existing processor simulation model and a memory simulation model.

3.3 Exploitation of Last-level cache

DRAM performance is critical in modern computing systems, particularly with the widespread use of data-centric applications (for example, streaming applications). Cache memories have been used extensively in order to reduce DRAM’s access times. In a cache memory hierarchy, last-level cache plays an important role in reducing access times of the DRAM system [102]. Thus, caches act as a bridge to close the performance gap between the processor and DRAM memory system. A suitable last-level cache configuration for a particular system, can achieve high performance 36 CHAPTER 3. LITERATURE REVIEW at low cost especially for memory-bound applications [100].

3.3.1 Last-level Cache Exploration

To explore the behaviour of the last-level cache, Wang et. al. [100] proposed a trace-driven hardware/software co-simulation approach. The hardware approach is applied for cycle accurate long trace extraction and a software simulation is applied to simulate the collected trace. There are two classes of simulation, according to the input type fed to the simulator - trace-driven simulation and execution-driven simulation (more details are in Section 3.2.4). In general, the results of trace-driven simulation are not as accurate as execution driven simulation due to the trace-driven simulation not considering the interaction between the CPU pipeline and memory system. Execution-driven cycle-accurate simulation (system simulation including at least a processor module and memory module together) results in more accurate sys- tem behaviour [103], but its simulation is extremely slow. Thus, the co-simulation technique proposed by [100] tries to overcome the slowness of execution-driven simu- lation by obtaining accurate and flexible trace-driven simulation for the exploration of last-level caches. This approach captures the memory request transactions and time stamp information of the front side bus (FSB) by snooping on the FSB sig- nals via a logic analyser. The collected trace is then simulated with the enhanced CASPER cache simulator (enhanced by adding supports for long trace and timing information to CASPER [104]). Thus, the approach presented in [100] utilises the co-simulation of hardware and software approach to reflect the actual behaviour of the applications in the exploration of last-level caches. In contrast, this the- sis proposed the system-level software simulation approaches (without needing any hardware-related process) in the exploration of a large design space of last-level caches for the early verification of system design. 3.3. EXPLOITATION OF LAST-LEVEL CACHE 37

3.3.2 DRAM Performance Improvement

DRAM memory requests (from the last-level cache misses) exhibit repetitive pat- terns due to the temporal locality of reference [105] (a high possibility that recently accessed data can be accessed again in the near future). By exploiting the temporal locality behaviour, Scavenger [106] proposed a technique to reduce DRAM latency by retaining the most frequently missed blocks in the last-level cache. In Scavenger, the last-level cache is divided into a conventional cache portion and a victim file portion. The conventional cache portion performs as the last-level cache while the victim portion keeps the most frequently missed blocks from the conventional part of the cache. The evicted memory blocks which have the most frequently accessed pattern in the past (more likely to be used in the future) from the conventional cache are placed in the victim part of the last-level cache. When the last-level cache misses from the conventional cache part occurs, the Scavenger [106] calculates the most frequently accessed value and places the missing block in the victim part if the missing block has higher frequently accessed value than all the elements in the con- ventional part as well as all the elements in the victim part. Scavenger achieves the reduction of DRAM latency by preventing the repeated eviction of block addresses in the last-level cache.

One way to increase DRAM throughput is managing the writes from the last- level cache effectively. Sometimes, the write requests significantly interfere with the processing of the read requests, which can degrade overall system performance due to the procrastination of the read requests [107]. The interference of the read requests by the write operations in a continuous memory sequence is called write-induced in- terference [107]. The write-induced interference (shown in Figure 3.2) increases the latency of read operations due to the timing gap requirements between the consecu- tive read operations and write operations in the DRAM memory. Additionally, the interference incurs the memory data bus’s read-to-write or write-to-read switching penalty. Previous studies [107–112] have researched aspects of managing last-level 38 CHAPTER 3. LITERATURE REVIEW cache writeback events in order to improve DRAM efficiency.

Figure 3.2: Write-induced Interference, sourced from [112]

Eager Writeback [108] is a technique of managing write events to DRAM. Instead of sending the write data to DRAM at the actual eviction time, eager writeback approach sends the eventual evicted last-level cache lines to the intermediate write buffer in advance by modifying the traditional writeback policy [103]. Traditional write back policy issues the write event to DRAM immediately when a cache line is evicted while the eager writeback approach issues the write event to the write buffer whenever the bus is idle (prior the actual eviction to DRAM). The eager writeback operation for all the writeback data in last-level cache is triggered if the current memory request happens the writeback event. The writeback event to DRAM is performed when the intermediate writeback buffer is full. Even though the eager writeback approach improves the overall system performance by re-distributing the writing of dirty cache lines, this approach has a space limitation due to the size of the write buffer to service the writeback events in advance. Thus, the authors of [109] proposed a Virtual Write Queue (VWQ) which co- ordinates the write back traffic of the last-level cache with the scheduling policies of the DRAM memory controller to improve both system performance and energy saving significantly. The VWQ technique modifies the memory controller to sched- ule the writeback events of the last-level cache with the intention of reducing the 3.3. EXPLOITATION OF LAST-LEVEL CACHE 39 write-induced interference [107]. The VWQ approach instructs the cache to trans- fer a specific cache lines to the VWQ of the memory controller. The cache lines are selected with a specific DRAM resources mapping (such as having a mapping to a specific rank/bank) in order to generate a burst of write operations) so that the bus turnaround penalty can be reduced. The VWQ approach improves the DRAM performance and power efficiency by increasing the row buffer/page mode locality [21] occurring in successive accesses to the same DRAM page instead of accessing different and random pages.

The technique of driving last-level cache writeback operations to reduce the write-induced interference is also discussed in [111]. In contrast to the Eager Write- back [108] and VWQ [109] approaches, [111] utilises the arrival time of the memory requests so that it can avoid a large penalty of the read request coming soon after a write request. The authors of [111] proposed a rank idle time predictor to predict the long rank idle cycles. During the predicted rank idle cycles, a sequence of write- back events are serviced so that the delay of the following read request after write operation (write-induced interference) can be minimised. The rank idle time predic- tor uses the address of the last-level cache miss block and DRAM address mapping policy to accurately predict whether this block will be accessed again before being evicted. As soon as a rank becomes idle, the predictor predicts whether there is a possibility that the read request will come to that rank in the next m cycles. If there is no possibility of a read request within a certain time (m cycles), a sequence of dirty cache blocks will be serviced within that time. The approach in [111] reduces the write-induced interference by exploiting the rank idle time to service a sequence of read requests and write requests separately.

Similar to the eager writeback approach [108], Zhe et. al. [112] proposed an early scheduling of the eventual evicted last-level cache writeback blocks. In the eager writeback approach, the early eviction is triggered whenever the bus is idle. In contrast, the approach in [112] triggers the eviction of the writeback block of the last-level cache only when there is a possibility of this writeback event happening 40 CHAPTER 3. LITERATURE REVIEW within a predefined time after a specific read memory request. In order to send a batch of writeback events to DRAM for increasing the page locality of DRAM, the approach in [111] used the evicted block buffer to keep the predicted eventual evicted last-level cache blocks. In [112], the write requests have higher priority over the read requests. A sequence of the write requests from the evicted block buffer to DRAM is performed when the actual writeback event of the writeback block inside the buffer happened or the evicted block buffer is full. The early eviction of the eventual evicted last-level cache blocks in [112] improves DRAM efficiency by effectively redistributing the write memory requests.

The authors of [107] also proposed DRAM-aware last-level cache writeback pol- icy. The DRAM-aware last-level cache writeback approach of [107] aggressively sends a sequence of extra last-level cache writeback data which must reside in the same DRAM page of the current writeback event. This technique improves DRAM perfor- mance and energy efficiency by not only exploiting the DRAM row buffer locality in write operations but also by reducing the bus switching time penalty from read-to- write or write-to-read operations. The same authors of [107] proposed DRAM-aware last-level cache replacement policy in [110]. The two proposed replacement policies are Latency and Parallelism-Aware (LPA), and Write-caused Interference-Aware (WIA) replacement policy. LPA policy exploits DRAM bank-level parallelism (by servicing multiple writeback requests concurrently in different DRAM banks) for sending a batch of last-level cache writeback data. On the other hand, WIA policy uses the same technique of [107] where a sequence of extra last-level cache writeback data residing in the same DRAM page is sent together in the writeback event. The proposed LPA and WIA replacement policies enhance the system performance by exploiting the bank-level parallelism, row-buffer locality, and reducing write-induced interference. 3.3. EXPLOITATION OF LAST-LEVEL CACHE 41

3.3.3 DRAM Energy-Aware Scheme

Lee et. al. [113] proposed a DRAM energy aware data prefetching scheme. The proposed method of [113] prefetches the data from DRAM into last-level cache to increase DRAM idle times by clustering DRAM accesses. The authors of [113] utilise the stride-based prefetching mechanism for the prediction of the patterns of consec- utive memory references which have certain address gap occurring from the looping structures. They prefetch the data several iterations ahead to completely hide mem- ory latencies. The reference prediction table (similar to the direct-mapped instruc- tion cache) is used to keep information of previously prefetched addresses, stride (the difference gap between the previous address and the current address), and looka- head distances (to generate the next prefetch address). The next prefetch address is calculated based on the information of the associated access line (similar index access pattern to the index access of the direct-mapped instruction cache) inside the reference prediction table. The authors of [113] improves DRAM performance and energy savings significantly with the DRAM access clustering scheme.

3.3.4 Summary: Exploitation of Last-level Cache Work

Techniques discussed above show that exploitation of last-level cache plays a vital role in improving the performance and power efficiency of the DRAM system. How- ever, most of the prior research related to the last-level cache exploitation mainly focuses on the performance improvement of the DRAM system. In this thesis, we ex- ploit the last-level cache to reduce the background power consumption of the DRAM system. Longer DRAM idle periods can be achieved if successive memory requests are consecutively satisfied from the last-level cache. The background power of the DRAM system can be reduced if the DRAM device can be switched to a lower power consuming mode while the DRAM is idle. The patterns of DRAM idle periods for a specific application differ from one last-level cache configuration (such as cache 42 CHAPTER 3. LITERATURE REVIEW size, cache line size, associativity, replacement policy, etc.) to another configura- tion. Although a lot of researchers have shown that the performance and/or energy efficiency can be achieved with the aid of last-level cache, these research utilise only predefined last-level cache configurations. Some of the last-level cache configura- tions consume even more energy consumption and degrade performance to some extent. We have shown that only a suitable last-level cache can achieve a significant amount of DRAM energy efficiency. The approach used in this thesis provides a mechanism of setting the DRAM device into the lowest power consuming mode dur- ing long DRAM idle time which occurs due to the utilisation of a suitable last-level cache. We also presented an accurate and quick exploration mechanism to find an application-specific suitable last-level cache from hundreds of such configurations which consist of a very large design space.

3.4 DRAM Power/Energy Management

Memory components in embedded systems, particularly memories in embedded sys- tems with data intensive applications, are the main contributors of system power consumption [114]. Efforts to reduce the energy consumed by the memory system have been made in various levels: compiler-level; operating system(OS)-level; and the system-level. Compiler directed approaches [28, 29, 115–118] statically analyse the behaviour of the application and system software to detect the idle times of mem- ory modules (bank/rank) to be put into the lower power mode. To prolong the idle- ness of the memory modules (the longer the idleness, the longer the memory modules can stay in lower power mode), the array data are analysed and modified by placing similar life-pattern data into the same modules or by re-ordering the computation to place the accessed data (within a specified time window) into a small number of memory modules. OS-level approaches [30, 31, 119, 120] (for example, page migra- tion, power aware page allocation mechanism) consider code and data placement at the kernel layer. The compiler-level and the OS-level approaches are not generally 3.4. DRAM POWER/ENERGY MANAGEMENT 43 aware of system architecture which has greater control in finer-grained components to optimise the performance statistics and the energy consumption of the system memory. Several system level energy optimisation techniques [32, 33, 35, 121–123] have been proposed to reduce the energy consumption by exploiting the underlying memory architecture and utilising power aware techniques.

Some of the previous DRAM power/energy optimisation techniques exploit the locality in the row buffer while the others take advantage of bank-level/rank-level parallelism to reduce the power/energy consumption of DRAM. Each DRAM device consists of one or more ranks and each rank is composed of one or more DRAM chip. The DRAM chip is formed with a number of banks, each of which has a separate row buffer/page mode. The data accesses from the current row buffer (requests to the same page) has shorter access latency and lower power consumption than accessing from the different pages (since the data from the DRAM device is needed to fetch to the row buffer again). Thus, successive accesses to the current row buffer (row buffer locality) can improve the DRAM performance and energy efficiency to some extent. Alternatively, DRAM efficiency can be increased by accessing data from different banks (bank-level parallelism) or different ranks (rank-level parallelism) simultaneously [28]. In optimising the energy/power consumption of DRAM, some research utilises the different background power modes/states supported by DRAM since the background power which is consumed is the highest power of total memory power consumption [35]. More details on DRAM architecture and available power states are described in Section 2. Even though DRAM can be set to lower power consuming modes to reduce the background power, the DRAM device has to be in the active state before any operations such as reading or writing are performed. There is a penalty (power mode switching cost or resynchronisation overhead) of switching the device from the lower power consumed background states to an active state. As a consequence, the trade-off between the energy saving while in the lower power mode and resynchronisation cost (power mode switching overhead from the lower background power state to the active state) has to be considered since a more 44 CHAPTER 3. LITERATURE REVIEW energy saving mode incurs a higher resynchronisation cost. Thus, the idleness of the bank memory should be long enough to amortise the power mode switching penalty to achieve significant energy savings.

3.4.1 Compiler-directed DRAM Power/Energy Management

Significant energy savings and performance improvements can be obtained by ex- ploiting memory operating modes and banked memories simultaneously [28]. Some examples of banked memory are Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate SDRAM (DDR SDRAM), and Rambus DRAM (RDARM) architecture. Kandemir et. al. [115] proposed a mechanism to restructure the execution order of loop iterations to optimise the energy consumption of the banked memory. The array data and the underlying bank structures are analysed and the execution order of the loops are restructured according to the number of simultaneous accesses to different banks. After reordering the loop iterations, the data required to be accessed within a certain period is located into a small number of memory banks. During this time, the unused banks are set to the lower power mode. Zhong et. al. [28] proposed a graph modelling approach to partition the variables located across different memory banks to maximally improve system performance and power consumption. If the variables can be grouped into some memory banks, other unused memory banks can be switched to lower power consuming mode so that the overall energy consumption of the memory system can be reduced. In the proposed graph modelling approach in [28], the memory access graph, including the information of the variables’ memory location (to which bank/row of memory), is built after generating the data dependencies of all the variables in the program’s control flow graph. Based on the memory access graph, the proposed mechanism of [28] iteratively finds an optimal variable partition such that the maximum energy saving is achieved which satisfying the given performance constraint. 3.4. DRAM POWER/ENERGY MANAGEMENT 45

Kandemir discussed a data layout transformation strategy on the banked memory in [29] to reduce the energy consumption of the memory system. The array layouts are transformed to group the data access patterns to the same bank so that other unused memory banks can be placed into the lower power consuming mode. The author of [29] utilised a two-step mapping strategy in data layout transformations to obtain longer idle times of unused memory banks. The array index are mapped into a virtual bank in the first step and the virtual bank is mapped to the physical bank in the second step. In this approach, only the second step (virtual bank to physical memory bank) needs to be modified when the underlying memory architecture is changed.

Luz et. al. [124] also proposed a dynamic data migration strategy (DM) for placing the arrays with temporal proximity (if two or more data have been accessed together in the past, it is likely that they will be accessed together again in the near future) into the same set of banks to achieve energy reduction in the multi-banked memory systems. DM captures the temporal access relations between arrays (by keeping the number of temporal proximity accesses between the arrays), and dy- namically migrates the arrays which have a high number of temporal proximity accesses, into the same memory banks from time to time. DM performs the sam- pling of proximity accesses at regular intervals to capture the number of times of accessing the two arrays together within the sampling window. The DM mechanism incurs high energy overhead as well as execution time overhead of migration pro- cess. Additionally, the modification of the memory controller operations is required to capture the new locations of the specific array after migration process.

Delaluz et. al. [125] proposed a technique to cluster the array variables for DRAM energy reduction. The proposed approach of [125] uses a three-step process for DRAM energy reduction: clustering related array variables; generating bank access profile; and, inserting the power mode switching instructions in the application code. The related array variables (variables which have similar lifetime access patterns) are grouped together (by modifying variable declaration order since the variables 46 CHAPTER 3. LITERATURE REVIEW are placed in the physical memory modules according to the declaration order) so that they can be placed in the same memory module. Different heuristics (such as grouping array variables with the same first usage program point, grouping array variables with the same last usage program point, etc.) are applied while clustering the array variables. After that, the bank access profiles are generated based on the information of clustered variables and physical memory configurations. The bank access profile includes the information of accessing the different banks at different program phases (for example, bank zero and bank one are accessed at phase one while bank two and bank three are idle at phase one). With the bank access profile, the point of power mode transition is determined and annotated in the application code.

Delaluz et. al. [125] also proposed hardware-assisted run-time memory man- agement (self-monitored) techniques to reduce DRAM energy consumption. The self-monitored approach automatically detects the idleness of the memory mod- ules by using a prediction technique and transitions to different low power modes according to the length of the idle period. The self-monitored approach explores three different hardware predictors (adaptive threshold predictor, constant thresh- old predictor, and history-based predictor) which predicts the threshold which is the waiting period before switching to the specific low power mode to compensate the power mode switching overhead. The adaptive threshold predictor (ATP) starts with an initial threshold and transitions to the lower power consuming mode if the module is not accessed within this threshold. If the next access comes within a short period after switching to the low power mode, ATP doubles the threshold for the next power mode switching to improve the actual energy savings. In the con- stant threshold predictor (CTP), the constant threshold of each low power mode is used for the whole processing (for example, switch to standby mode after x cycles of idleness, and switch to next lower power consuming mode after (x+y) cycles of idleness). The history-based predictor (HBP) predicts the threshold for the next power mode switching control based on the history of the memory access patterns 3.4. DRAM POWER/ENERGY MANAGEMENT 47 and the thresholds being used within a predefined history time window. The authors of [126,127] proposed loop-level transformation techniques in multi- bank memories in order to obtain energy benefits for array-dominated applications such as image and video processing applications. The approach presented in [126] mainly targets the loop transformation techniques and did not apply different low power modes. However, the proposed techniques of [127] utilises hardware-assisted constant-threshold power management (CTP) proposed in [121] (setting the idle memory banks into a suitable low-power mode depending on the monitored idle duration). In [126, 127], the application is modified to apply the loop-level trans- formations by using loop fusion/fission [128]. Loop fusion (shown in Figure 3.3) combines two loops into a single loop in order to group the array references which are accessing the same memory banks within the same loop. However, loop fusion sometimes cannot achieve actual energy savings due to the extra bank activations which results from the mismatched cache line sizes for larger data access inside one loop [127].

for (i=0;i

Figure 3.3: Loop Fusion Transformation,sourced from [127]

Loop fission (also known as loop splitting or loop distribution [129]) shown in Figure 3.4, on the other hand, splits a given loop into two or more loops to place different arrays into separate loops so that the number of bank activations for a given loop can be minimised. With the loop fission technique, if the access time of a specific bank can be prolonged (for the different arrays residing in a separate memory bank), the other unused banks can stay for a longer period in a lower power mode. Thus, loop fusion improves the cache’s spatial locality (likelihood of referencing the data if the near data was just referenced) whereas the loop fission 48 CHAPTER 3. LITERATURE REVIEW exploits the bank-locality (accessing the data from the same bank).

for (i=0;i

Figure 3.4: Loop Fission Transformation,sourced from [127]

Loop tiling presented in [127,130] (shown in Figure 3.5) is another technique of partitioning the loop into smaller uniform blocks to schedule the execution in order to enhance the temporal cache locality (if the data is referenced at one point, it is likely to access the same data in the near future) across multiple loops. Figure 3.5 describes how the array can be accessed with the data block size T so that the memory traffic can be reduced due to the existence of the loop data inside the cache until it is reused.

Loop for (ii=0; ii

Figure 3.5: Loop Tiling Transformation, sourced from [127]

Koc et. al. [131] also proposed a DRAM energy reduction technique which per- forms extra computations for the data located in the low-power memory bank so that the power mode switching penalty (from lower power mode to active power mode to do the read/write operations) can be avoided. If the data from the low- power memory bank is necessary, the requested data is recomputed using the values of the data stored in the active banks instead of reactivating the low-power bank. However, the energy savings of the memory system cannot always be achieved us- ing data recomputation of [131] since there is an extra execution time overhead of 3.4. DRAM POWER/ENERGY MANAGEMENT 49 data recomputation. The extra data recomputation sometimes has a large negative impact on the energy savings of the memory system.

3.4.2 OS-level DRAM Power/Energy Management

Lebeck et. al. [119] explored the interactions of the page allocation scheme with static and dynamic hardware power management policies at OS layer. The static power management scheme uniformly places the whole memory chip into a single low power state whereas the dynamic power management scheme places each memory chip into different low power states according to the different idle time of the memory chip. The dynamic power management scheme, proposed in [119], is similar to the constant threshold predictor employed in [125] (compiler-directed DRAM power control approach), as discussed in Section 3.4.1. The dynamic policy exploits the idle time between accesses to a memory module and sets the memory module into a lower power state if it is not accessed within a predefined threshold time (let’s say, within x cycles). If this memory module has not still been accessed within next (x+y) cycles, this memory module is set to the next lower power state. To maximise the energy efficiency, the authors of [119] also modified the OS kernel to allocate the physical pages into the minimum number of DRAM chips and set the unused DRAM chips (within a predefined time window) into low power states with the static/dynamic power management policies. Tolentino et. al. [132] discussed three OS-level page shaping techniques (reactive shaping, proactive shaping, and hybrid shaping) in order to minimise the number of active memory devices since non-active memory devices can be placed into a lower power mode which in turn can reduce the energy consumption of the memory system. In reactive shaping policy, the traditional scattered allocated page frames are reactively aggregated into a subset of memory modules with page migration. In order to avoid the page migration overheads, the proactive shaping approach modifies the page allocator to allocate the page from a subset of memory modules 50 CHAPTER 3. LITERATURE REVIEW by occupying only the minimal number of memory modules at the allocation time. However, the proactive shaping approach results in page fragmentation over the time. Thus, the hybrid approach utilises the allocation-time page placement of the proactive shaping policy (to allocate the minimal device modules at allocation time) and the migration feature of reactive shaping policy (to perform periodic page aggregation into a minimal memory modules).

Mingsong et. al. [133] proposed a power-mode switching delay hiding mechanism at the OS level by doing early processing of the power-mode switching (from active mode to low power mode or from low power mode to active mode) of the DRAM ranks. The early power mode switching (to switch the device to/from the low power mode just in time before the actual switching time) is managed through the OS- controlled buffer cache [134] which has a direct mapping to DRAM for I/O accesses. Most of the accesses to the buffer cache are made through the OS’s system calls, and thus OS has direct control of the early power mode switching. The mechanism of [133] completely hides the switching delay by setting the DRAM device into the low power state as soon as the I/O processing finishes and the device is switched back into the active power state just before the memory request arrives to the memory controller.

Delaluz et. al. [135] proposed an OS-level power mode control strategy by keeping track of the number of accesses to DRAM banks for each process. The proposed method of [135] switches the unused memory banks of the next process into low power mode during the context switching time (a period of time before being pre- empted of the current process to transfer the control to the next process). The prediction of the unused memory banks for the next process is performed with the information of the number of memory bank accesses (in previous predefined time window) by each process recorded in the OS page-table. OS updates the bank usage information of each process based on where the physical page resides in the DRAM, the mapping of the physical page to virtual page, and the status of the virtual pages inside the page table. 3.4. DRAM POWER/ENERGY MANAGEMENT 51

Similar to the approach discussed in [135] which controls the power mode at the bank-level granularity, Huang et. al. [136] also proposed a similar power man- agement (which is called power-aware virtual memory (PAVM) ) technique at the granularity of rank-level. The PAVM mechanism keeps track of the usage of ranks by different processes and sets the inactive ranks of the next process into low power mode at process scheduling time (context-switching time). PAVM also aggregates the application’s data related with each process into a subset of active ranks so that the number of ranks required to be in active power mode for each process can be minimised. PABC (Power-Aware Buffer Cache Management) [31], another DRAM power management technique, takes into consideration the application’s data space as well as the OS-related data space while aggregating the process-related data into a subset of active ranks.

Min et. al. [137] also proposed a similar power mode control mechanism to that proposed in [135, 136]. However, [137] controls the setting of the memory modules into the lower power consuming mode with the number of consecutive non-access times to each memory module instead of setting all the unused memory modules into the lower power consuming mode at the context-switching time used by [135, 136]. In [137], the OS kernel keeps track of the number of consecutive accesses and non- accesses of all the memory modules by incrementing the number of accesses/non- accesses counter of each memory module at every memory access. If the number of consecutive non-access to the memory module matches the predefined threshold amount, the unused memory module is switched into the lower power mode with the assumption that temporal locality will prevail (if the module is not accessed for sometime, it is likely not to be accessed again in the near future).

Huang et. al. [120] described a memory traffic reshaping mechanism at the OS level in order to prolong the idle periods of DRAM ranks so that DRAM ranks can remain longer in the low power mode. The idea of creating the longer idle period in the DRAM device proposed in [120] is similar to the idea of [28, 29, 115] discussed in Section 3.4.1. The proposed reshaping mechanism of [120] divides the type of the 52 CHAPTER 3. LITERATURE REVIEW rank as cold rank (for placing the infrequently-used pages) and hot rank (for placing the frequently-accessed pages). OS kernel keeps track of the number of times each page was accessed and performs the page migration process to move the frequently- accessed pages from the cold ranks to the hot ranks. With the technique of moving frequently-accessed pages from the cold ranks to the hot ranks, the cold ranks can be set into low power mode longer and the energy savings can be achieved since the the frequent power mode switching does not occur for the cold ranks.

Moshnyaga et. al. [138] also proposed a OS-level page allocation mechanism. Different to the page allocation schemes proposed in [119, 120, 132], which mainly focuses to obtain some unused DRAM modules by aggregating the pages only into a subset of memory modules, the proposed approach of [138] allocates the OS page to reduce the number of DRAM refresh times. Generally, the unused memory modules are not required to be refreshed at the periodic refresh points of DRAM memory. The operating system has complete information of unused pages and the associated DRAM locations of the unused pages. The approach presented in [138] allocates the page in order to minimise the amount of DRAM locations and skips the refresh op- erations for some DRAM locations which are occupied by the unused pages. DRAM energy consumption can be reduced by controlling the DRAM refresh processing which consumes a certain amount of energy depending on the size required for the refresh operations.

Bathen et. al. [139] proposed a power-aware page allocation mechanism, ViP- ZoneE, to reduce the DRAM power consumption. ViPZoneE considers the utilisa- tion and the power consumption of DRAM operations for the various DIMMS (dual in-line memory modules consisting of a series of DRAMs). ViPZoneE inserts the code annotations at the application level (to generate a special memory allocation requests to the OS kernel) and modifies the GLIBC (GNU C) library’s standard memory allocation algorithm (to identify the preferred location of DRAM). With the ViPZoneE approach, OS kernal places the annotated code inside the specified DIMMs. Thus, ViPZoneE optimises the DRAM energy consumption to allocate 3.4. DRAM POWER/ENERGY MANAGEMENT 53 from a certain DIMMs so that other unused DIMMS can be set to a lower power mode (within a specified time window).

3.4.3 System-level DRAM Power/Energy Management

Several researchers have worked to reduce DRAM energy consumption by setting the DRAM device into one of the low power states at the system level. The default power management mechanism inside the memory controller of the DRAM memory sets the DRAM device into low power mode at every memory request. Therefore, the default power management scheme of DRAM can significantly degrade the performance since every memory access incurs power mode switching cost [133]. Liu et al. [123] introduced a DRAM power management technique by buffering DRAM write operations in an intermediate buffer (called page hit aware write buffer (PHA-WB)) to improve the hit rate of the row-buffer (also known as sense amplifier or DRAM page). If the requested data is in the row-buffer, row precharging and activation (which needs to be done to obtain new data from the DRAM array to the row buffer) do not need to be performed and hence, DRAM activate power can be reduced. The authors of [123] intended to reduce frequent DRAM page conflicts occurring due to the write operations (which is called the write-induced interference [107] discussed in Section 3.3). Although the read operations cannot be delayed since the processor requires the responses from the read operations before the execution of other instructions, the write operations could be delayed without impacting on the performance of the system [111]. Thus, the PHA-WB approach buffers the write operations if these write operations make requests to the locations which are different from the locations inside the row buffer. If the target locations of the buffered write operations matches the locations inside the row buffer by checking at every row activation time, these write operations are triggered. The PHA-WB approach reduces the activate power (by shifting the write operations) as well as the reading power (by providing the response from the PHA-WB instead of accessing 54 CHAPTER 3. LITERATURE REVIEW the slow DRAM if the requested data resides in the PHA-WB). Although the PHA-WB approach can possibly reduce the activate power and reading power, an extra overhead in the power consumption of PHA-WB (depend- ing on the number of entries) adversely affects DRAM power savings. Thus, Liu et al. [123] proposed a throughput-aware PHA-WB (TAP) mechanism which dy- namically adjusts the size of the PHA-WB (with the clock-gating technique [140]) according to the DRAM access patterns. The TAP approach keeps track of the number of write operations inside a sequence of consecutive DRAM accesses within a predefined interval and modifies the number of activate entries in PHA-WB. Dif- ferent from the PHA-WB approach which reduces the activate power and reading power, this thesis utilises only a last-level intermediate cache in order to reduce the background power, read power, and write power of the DRAM system. The authors of [32] described a queue-aware power-down mechanism and a power- aware memory scheduler to optimise the power consumption of DRAM by managing the commands inside the command queue of the DRAM controller. The queue-aware power-down mechanism monitors all the commands inside the command queue and sets the idle DRAM ranks (which do not have to service any commands inside the command queue) into low power mode. The power-aware memory scheduler, on the other hand, groups the commands inside the command queue according to the target rank. The idle time of each DRAM rank is increased by servicing the same-rank commands consecutively which can reduce the number of power-mode switching times. In order to further prolong the idle time (power-down duration) of the memory modules, [32] estimated a delay time during which all the memory commands are blocked. In [32], the acceptable delay time is estimated with a linear- regression-based delay estimator which uses the DRAM state information and a predefined power threshold. Kim et. al. [141] proposed a power-aware scheduling scheme by modifying the command scheduler of the memory controller in order to minimise the number of power state transitions. The power-aware scheduler mentioned in [32] groups the 3.4. DRAM POWER/ENERGY MANAGEMENT 55 commands which target the same ranks by considering all the commands in the queue whereas [141] groups only the write operation commands which target the same ranks. In [141], a batch of write transactions from the command queue is issued when the number of the pending write transactions is larger than the pre- defined threshold. A batch of write transactions is selected from the pending write transactions based on the power mode switching cost from the current power state of the target rank to the active state. The lower the power mode switching cost, the higher the chance of selection of the write operation to be issued to DRAM.

Kim et. al. [141] also proposed another DRAM power-aware rank scheduler by modifying the cache replacement policy of the last-level cache in order to reduce the number of write operations (to reduce the active power) to DRAM. In [141], the cache blocks are divided into two regions (least-recently-used (LRU) region and non-LRU region) and two techniques (clean block selection and target rank’s power- state-based block selection) are applied for the selection of the replacement block in the write-back cache. The clean block selection method is to select the replacement block from the non-dirty blocks (which do not have stale data) of the LRU region since the write-back operation could not be triggered while replacing data in non- dirty blocks. If the clean block cannot be found in the LRU region, the target rank’s power-state-based block selection method is applied in which the replacement block will be selected based on the power state of the target rank for the required write- back data. The rank selection is prioritised according to the power mode switching cost from the current power state to the active power state (the lower the switching cost of the target rank, the higher the chance of selection of the cache block).

Wu et. al. [142] proposed a system-level memory design called RAMZzz to optimise the energy consumption of DRAM. In RAMZzz, the ranks are categorised as the hot ranks and cold ranks according to the rank idle time. The concept of the hot and cold ranks are also used in [143] (which is discussed in Section 3.4.2) where the ranks are divided based on the frequency of accessing the OS pages. In RAMZzz, the hot ranks include a large amount of short idle periods whereas the 56 CHAPTER 3. LITERATURE REVIEW cold ranks include a small number of long idle periods for a specific memory request pattern. RAMZzz periodically performs DRAM page migration for the page which have short idle periods from the old rank to the hot rank so that the cold ranks can be switched for longer into a low power consuming mode. In the RAMZzz memory system, the memory controller also monitors the memory access locality periodically and groups DRAM pages with similar access locality (based on the effects of idle time periods pattern on the ranks) into the same rank so that the pages in the same rank have roughly the same level of hotness [142].

Trajkovic et. al. [144] proposed a DRAM energy reduction technique by ex- ploiting a small prefetch buffer and write buffer inside the memory controller. The intermediate write buffer is also used in the PHA-WB approach [123] which targets the increase in the page hit rate for reducing activate power. The proposed technique of [144] is to minimise activate power (due to activation and precharging process) and the active power (due to active operations such as reading or writing process). On every read miss, the additional cache lines are also fetched together with the read miss cache line into the prefetch buffer. The predefined number of consecutive write operations is buffered and a group of buffered write operations are sent only when the DRAM is not busy servicing any other read or write request. The fixed configurations (16 byte line size with full associativity) for both the prefetch buffer and write buffer are used in the approach presented in [144]. In this thesis, we also exploited a last-level cache (used as a prefetch buffer), but our approach targets the reduction in the background power consumption in addition to the reduction of activate power and active power.

Amin et. al. [35] also proposed the modification of cache replacement policy, called rank-aware last-level cache replacement policy (RARE), to reduce the back- ground power consumption of DRAM. RARE prevents the pre-chosen prioritised ranks’ cache blocks from being replaced so that the prioritised ranks can stay longer in the low power mode, thereby reducing the background power for these ranks. In order to avoid cache pollution (eviction of frequent required data from the cache) 3.4. DRAM POWER/ENERGY MANAGEMENT 57 due to the prolonged prioritisation of ranks, RARE updates the set of prioritised ranks in a round robin fashion after a pre-determined time period. The utilisation of DRAM’s deepest low power background mode (self-refresh power-down mode) in RARE is somewhat similar to the approach presented in this thesis; however, our work provides an insightful evaluation of the self-refresh mode setting with different last-level cache configurations.

The RAWB (rank-aware write buffer) approach is also proposed by Amin et. al. in [35]. In the RAWB approach, an intermediate write buffer (similar to [123,144]) is utilised to buffer some write transactions requesting the low-power memory ranks to avoid the frequent power mode switching time due to the write requests. The approach of utilising an intermediate write buffer is also proposed in Liu et al. [123] (page hit aware write buffer approach) where the write transactions are buffered and sent as a batch of write operations later to improve the page hit rate. The RAWB approach sends a batch of write requests to a certain rank only when a cache read miss to that rank occurs.

RAIDR [145] (Retention-Aware Intelligent DRAM Refresh) is proposed to reduce the number of refresh operations (which in turn can reduce the refresh power) by skipping the unnecessary refreshes of the DRAM device. DRAM refresh operations delay the processing of the memory requests and thus, degrade system performance and incur high power consumption. Reducing the number of refresh times is also proposed in [138] by skipping the refresh operations for the unused memory modules at the OS-level DRAM power control which is discussed in Section 3.4.2. In contrast, RAIDR groups DRAM rows into different retention time bins based on the retention time (the time interval to perform the periodic refresh) and applies a different refresh rate to each bin. RAIDR profiles the retention time by monitoring the time interval of the first changes in one of the bits of each DRAM row after the previous refresh operation. If the retention time is less than the normal periodic refresh time, these rows do not need to be refreshed at regular refresh times. After profiling the retention time, RAIDR groups DRAM rows into different retention time bins and performs 58 CHAPTER 3. LITERATURE REVIEW refresh operations only for the rows in each bin (which are necessary to be refreshed) at the required refresh rate. An approach to reduce refresh power is also proposed in the smart refresh mech- anism by Ghosh et. al. [146]. The smart refresh technique considers not to per- form the periodic refresh operations to the DRAM rows which have been recently read/written. The smart refresh approach uses a counter for each DRAM row to monitor whether the row is required to perform the refresh operations at periodic refresh points. The counters of those rows being read/written will be reset to a peri- odic refresh interval value and the next periodic refresh operation will be skipped for the rows which still have non-zero value in the counters. RAIDR, discussed in [145], uses the retention bin approach whereas the smart refresh technique [146] uses the counter approach to reduce refresh power. Haung et. al. [143] proposed a cooperative technique of OS-level and system-level power management mechanism to efficiently manage DRAM power consumption. In the cooperative technique of [143], a context-aware power management unit (PMU) is implemented in the system-level memory controller which uses information of different memory access behaviours on a per-process basic given by the OS kernel. PMU keeps the history of rank access information for each process within a pre- defined time interval. Based on the rank access information of each process, PMU determines a suitable threshold value for each rank (the waiting time before trig- gering the power mode switching process) so as to switch the device into low power mode. The PMU approach also sets the unused memory ranks (by checking the rank-access timing pattern of each process in PMU) into low power mode.

3.4.4 DRAM Power/Energy Estimation

The DRAM power/energy estimation technique is proposed in [147,148]. Kadayif et. al. [147] proposed a compiler-directed energy-aware compilation framework (EAC) to estimate and optimise the energy consumption of DRAM for a specific target 3.4. DRAM POWER/ENERGY MANAGEMENT 59 application. EAC estimates the energy consumption of the system to obtain the energy impact of data optimisations such as loop tiling [127, 130] and data trans- formations [29] (which have been discussed in Section 3.4.1). EAC computes the energy consumption of the system by extracting the required information to com- pute the power consumed in the datapath, cache memory, main memory, buses, and clock network at the compiler level. The required information that EAC extracted are the number and type (branch instructions or addition instruction, etc.) of each instruction to compute datapath energy, the number of hits/misses and reads/writes to compute cache energy, the number of execution cycles and memory stall cycles for the clock energy and the number of accesses to compute memory energy con- sumption [147]. In this thesis, we estimate the total energy savings by using the different memory-related information (such as the total number of idle duration (power-down duration), the total number of refresh times, and the total number of DRAM requests, etc.).

An estimation model to predict the idle period distribution pattern is also pro- posed in RAMZzz [142]. RAMZzz’s estimation model considers the history of the access locality of each page of the memory ranks within the previous predefined time slot in estimation of idle duration patterns for the next time slot. RAMZzz also determines the waiting idle time period before doing the power mode transition since some of the short idle periods do not achieve the actual power saving benefits. However, the estimation model presented in this thesis is the total energy savings based on the application-specific idle period pattern.

Recently, Thomas et. al. [148] proposed a prediction-based power saving policy (PSP) that uses a history-based predictor to forecast the duration of the idle period and employs a suitable low-power mode (either self-refresh or power-down mode or a combination of both power saving modes). PSP predicts the waiting time threshold value (the idle time to wait before switching to the low power mode) as well as the time of switching the DRAM modules back to active power mode in order to minimise the power mode switching costs. Instead of predicting the duration of 60 CHAPTER 3. LITERATURE REVIEW the idle period within a specified time window at run-time, this thesis processes a sequence of memory requests to obtain an estimate of the total number of idle (power-down) duration.

3.4.5 Summary of DRAM Power/Energy Management Re- search

DRAM power consumption is one of the most important components of power con- sumption of the whole system, and thus its reduction is targeted during the power- aware high-level optimisation process. Current power optimisation approaches at the compiler-level are limited to the source-code availability issue. Additionally, the compiler-directed approach has control only over the data portion and does not have direct access to optimise instructions. Moreover, the compiler-directed ap- proaches are unaware of the underlying system architecture. Hence, it is not possible to obtain the fine-grained-level power optimisation of the DRAM system with the compiler-directed power management techniques. OS-directed power management approaches are transparent to the system archi- tecture, but OS-level approaches mostly manage memory optimisation with OS-level paging units. Thus, OS-level power optimisation techniques only achieve coarse- grained-level power optimisation which could not achieve maximum energy benefits of the DRAM system. System-level power optimisation approaches, on the other hand, can fully utilise the available resources and have complete control to the fine-grained components. Therefore, system-level power management is feasible in optimising the power/energy consumption of the whole memory chip at the same time. Most of the previous approaches do not exploit the lowest power consuming mode of DRAM memory due to high power mode switching cost for the frequent power mode switching events. In this thesis, we utilise the lowest power consum- ing mode (self-refresh power-down mode) for the DDR(x)-DRAM system (DDR(x) refers to DDR, DDR2, and DDR3). Our work differs from all of the above system 3.4. DRAM POWER/ENERGY MANAGEMENT 61 level approaches, in that we explicitly exploit spatial locality by inserting the last- level cache, utilise DRAM’s lowest power background mode and examine the effects of the configurations of this last-level cache on the DRAM energy savings. We also show a quick estimation method to predict the energy savings of the DRAM system for each last-level cache configuration from hundreds of last-level cache design space configurations for a typical embedded system. Finally, we show how to utilise the proposed estimation method to select a suitable last-level cache configuration in or- der to achieve the maximum energy benefits of the DRAM system. This thesis will help in providing DRAM power aware design methodology at the system level with reduced design time. Chapter 4

Interface Abstraction Layer

4.1 Introduction

Embedded systems are heavily utilised in all aspects of modern life. Much of the modern embedded systems are built using processors, and include memory. Presently, the most commonly used and economical memory type is the DDR SDRAM. The channel is the connector between the DRAM device and the memory controller. The DRAM device may contain one or more ranks which is composed of one or more DRAM chips. Each DRAM chip contains a number of banks, and each bank contains a number of rows of memory elements. Within a bank, a single row is brought to what is known as the “row buffer” in a process known as activation. This activation process costs time and energy. Once the row is brought to the row buffer, numerous accesses can be made to the row in rapid succession. If another row is needed, then the data currently in the row buffer must be written back to its original row (known as precharge), and the new row is activated by writing its data into the row buffer. Since there are multiple banks in the memory, multiple row buffers can be filled, and can be made ready to be read or written. Thus, accessing differing data from the same row will take a lot less time than accessing different data from differing rows.

62 4.1. INTRODUCTION 63

Up to 90% of the non-I/O energy is consumed by the memory in embedded sys- tems [114]. A number of differing modes are available in a typical modern memory system. Figure 4.1 shows the modes available and the respective consumption of power in each of the modes (this is for a 1Gb DDR3 system obtained from Micron Inc. [2]). In Figure 4.1, the ovals refer to the background power state whereas the rectangle boxes refer to the specific activity (reading, writing, activating or precharg- ing). Thus, the energy consumed will vary (especially for the background power) based upon which modes are visited for an access. In summary, the performance and power consumption of the memory system mostly rely on the status and power modes of the device respectively.

Idle Self-refresh PD Precharge SB (55mA) (6mA)

Precharge PD Active PD Activating SE (12mA) (35mA) 3 clk FE (35mA)

4 k) clk cl 13 k) E ( cl S (4 FE Active Writing Reading Active SB (57mA) 6 4 clk

Writing Reading Precharging Precharge Precharge

PD: Power Down, SE: Slow Exit Command Auto SB: Standby FE: Fast Exit Figure 4.1: Available Power Modes and Approximate Power Consumption in Specific Power Mode of DDR3 SDRAM [2]

Testing the application with the real target platform is nearly impossible. If 64 CHAPTER 4. INTERFACE ABSTRACTION LAYER an erroneous outcome of the system design is observed after the target product is finalised, the design and manufacturing cost will be higher. Thus, there is a need for simulation at design time. To simulate a processor with a memory system, typically, an instruction set simulator is used, and a fixed number of clock cycles are assumed for memory accesses. To simulate and obtain memory performance and power consumption figures, the designer typically uses a two-pass approach: the first pass generates a memory trace from the Instruction Set Simulator (ISS); and the second pass processes the memory trace from the first pass in a memory simulator. In the two-pass approach, the processor simulator has no knowledge about the memory latency and power consumption of the memory module. Thus, the processor component uses a fixed latency time and fixed power consumption for all memory requests. Using the fixed memory latency to generate the memory trace degrades the correctness of the system memory simulation as the memory latency and consumed power vary from one memory request to another based on the state of the memory system and memory configuration settings. For instance, this error is more acute in modern memories such as the DDR2 and DDR3 memories where differing accesses can take quite different times, and differing amounts of power.

4.2 Motivation

Advances in system memory technology have helped to reduce the memory wall problem [18]. However, memory latency, which varies from one memory request to another in more advanced memory devices like the DDR memory, leads to inac- curate performance outcomes in the simulation phase. Figure 4.2 shows the fixed latency that should be used to obtain the most correct execution time and power consumption values for six differing benchmarks using the Xtensa Processor [60] and DDR3-SDRAM memory [2]. As can be seen in Figure 4.2, not only do the best fixed clock latency numbers vary for the same benchmark, they also vary between benchmarks to obtain the 4.2. MOTIVATION 65

40

35

30

25 (clock cycles) (clock Memory Latnecy Memory 20

15 adpcm adpcm jpeg jpeg g721 g721 Enc Dec Enc Dec Eec Dec For Execution Time 32 33 36 31 33 33 For Power Consumption 36 33 31 32 25 20

Figure 4.2: Memory Latency [clock cycles] to obtain the closet values to correct Execution Time and Average Power Consumption

correct results for differing parameters (in this case, execution time and power con- sumption). By the use of an incorrect simulator (say by using fixed cycles, which at best can only be guessed at), a designer can easily make significant errors, particu- larly for real time systems. Correct simulation leads a designer to creating a system which is optimal. In this research, we study the problem of memory simulation in the two-pass approach and propose a novel integration methodology of the processor module and memory module to overcome the problem of performance inaccuracies with the typical two-pass approach in the system architecture simulation.

The rest of this chapter is organised as follows. Section 4.3 presents our contribu- tions in this chapter. Section 4.4 introduces the underlying concepts of a processor memory simulator. The proposed design methodology of integrating the processor simulator and memory simulator are discussed in Section 4.5. Section 4.6 presents the comparison of the results of our approach with the typical two-pass simulation approach. Finally, the summary of this chapter is presented in Section 4.7. 66 CHAPTER 4. INTERFACE ABSTRACTION LAYER

4.3 Contribution

The main contributions of this chapter can be summarised as follows:

• For the first time we provide a generic methodology to integrate a cycle ac- curate processor simulator with a DRAM simulator to provide timing and memory power information;

• As a proof of concept, we provide a case study which utilises the above method- ology to integrate Tensilica’s cycle accurate ISS with DRAMSim’s timing and power model; and,

• We show the usefulness of such a cycle accurate simulation model over fixed latency simulators for gaining accurate results.

4.4 Background

The novel generic integration component presented in this chapter is to apply as the bridge between the functional cycle accurate processor module and the memory module to obtain the accurate performance and power metrics. This section provides a background of the processor and memory simulation and highlights the problems involved in typical separate simulation.

4.4.1 Processor Simulator Component

Traditional full system simulation techniques first generate a trace file of the accessed memory elements and then the trace is fed into a memory component to get the memory statistics. However, this leads to large inaccuracies due to the lack of feedback from the memory system to the processor simulator (for example, even though differing requests take differing times, the processor will not be able to utilise these values). The initial processor simulation assumes a fixed average delay for every memory access within the system. In reality, the memory delays are 4.4. BACKGROUND 67 variable, and this variance can give erroneous results, especially where dynamic power optimisation techniques are being performed. Many factors contribute to this variation, such as the latency of switching memory banks or the reordering of requests by the memory control hardware to reduce power consumption. Therefore, it is necessary to create a full system simulator to achieve accurate results that can be used to properly analyse the system.

4.4.2 Memory Simulator Component

Most of the memory simulators are not functional simulators and they do not ma- nipulate the values. All the system memory requests (mainly consisting of memory addresses and timing information) are handled by the memory controller which manages the flow of data going to and from the main memory. Generally, the memory controller performs transaction scheduling, address translation, command scheduling, and buffer and bank management. The transaction scheduling compo- nent schedules the memory request transactions according to the transaction policies such as First Come First Served (FCFS), Priority Scheduling, First Ready-FCFS (FR-FCFS), and Read Over Write, etc. The physical address given by the proces- sor has to be translated to the logical memory addresses in the form of channel, rank, bank, row and column. Typically, the memory controller uses two policies, open page and close page, to control the DRAM row buffer. The open page policy keeps the rows open after reading/writing data to/from row buffers whereas the rows buffers are closed immediately after processing data with the row buffer in the close page policy. In command scheduling, the memory controller manages to arrange the DRAM access protocol signals such as the row access strobe (ras), and column access strobe(cas), etc. In general, the variance of the memory latency will rely on the states of the above typical four phases of the memory controller. As the system performance and power utilisation mainly rely on the memory re- sponses to the processor, there is a need for detailed and accurate memory simulation 68 CHAPTER 4. INTERFACE ABSTRACTION LAYER to analyse performance and power consumption tradeoffs.

4.5 Proposed Integration Methodology

Central to the novel methodology implemented in this work to allow the processor core to communicate with the system memory for all the memory transactions is the Interface Abstraction Layer (IAL). In our proposed approach, we build the memory model into the IAL so that the processor component and memory component can be simulated together to obtain the performance statistics with high accuracy. It should be noted that functional simulators need to manipulate the values in a stor- age model. Such manipulations of values are not necessary for mere performance evaluations. All the data/instruction requests (cache misses) from the processor de- termine their individual latency via a cycle-accurate memory model instead of using a fixed latency for every memory request. The IAL has to take care of the func- tionality of the memory model as the processor core needs to actually retrieve/store values in the memory. If the memory request operation is not handled accurately, the application will not be truly cycle accurate within the system simulator (for example if there are data dependent loops within the program, then the number of times the loops execute is dependent on the values stored). To make certain that unnecessary blocking does not occur in the system which is performing the simulation, the processor simulation, IAL and memory simulation are executed as separate processes. Thus, the processor component can continue processing items which are independent of the result of the requested memory operations. The de- tailed data flows of the processor component and memory component across IAL layer are illustrated in Figure 4.3. There are five subcomponents in the IAL layer as shown in Figure 4.4. These are: memory request handler; transaction controller; location index controller; memory simulation controller; and timer callback handler. The memory request handler and the timer callback handler are integral components to communicate the processing 4.5. PROPOSED INTEGRATION METHODOLOGY 69

Tensilica Cache Model Xtensa Core Benchmark Benchmark Application Xensa Core Xensa

Posted Cache Miss OK/Err Respond Load/Store

Memory Memory Transaction Write Simulation Timer Request Completed Memory IAL Controller Handler Callback Read Update DRAM System Transaction

Completed Transaction

Transaction Queue ··· ···

DRAM Get Memory Command Queue

Request Queue Interface Bus (round robin, priority,...) robin, (round Command Scheduling Scheduling Command ···

Bank 0 Queue

Physical to Bank 1 Queue DRAM Address DRAMSim Memory Simulator Memory DRAMSim Mapping DRAM ··· Bank N Queue (Mapping Policy) (Bank & Buffer Management)

Figure 4.3: Detailed Simulation Framework

between the processor component and memory module. The timer handler should be triggered to do the interface layer processing and memory simulation once every processor clock cycle that is simulated. The memory request handler accepts the memory requests (cache misses) from the processor module. Whenever cache misses 70 CHAPTER 4. INTERFACE ABSTRACTION LAYER occur, the processor simulator should send the memory request to the memory request handler. In some systems, the size of the memory request coming from the processor simulator may not be exactly the same width as the interface bus (front side bus) width of the system memory.

Processor Module

Memory Req Response OK/Err

Cache Miss

Trigger Event Trigger Load/Store

Memory Memory Wait for Next Simulation Request Clock Cycle Controller Handler Trigger Action Block No Request

Completed Transaction Controller

IAL Transaction? DRAM Send Memory Request Yes R/W

DRAM Finished Block No Location Request? Controller Yes

Process Queued Transaction Status Transaction

Memory Module

Figure 4.4: Interface Abstraction Layer 4.5. PROPOSED INTEGRATION METHODOLOGY 71

Generally, the size of the memory request is larger than the front side bus width. When such a case occurs, the transaction controller in the IAL has to handle mul- tiple memory requests according to the requested size and the system memory’s interface bus width (this is referred to as a block memory request). For example, the transaction controller handler must handle two memory requests when the re- quested size is 64 bits and the interface bus width is only 32 bits. As a consequence, the transaction controller should respond with multiple responses for a block read request. Similarly, the block write request notifies the processor core only after the last transaction of the block write request has been processed. The timer handler is a critical component to do the memory simulation. In every processor clock cycle, the timer handler should be triggered, which in turn sends a notification to the mem- ory simulation component to perform the necessary memory simulation action. The memory requests from the transaction controller will be processed during the mem- ory simulation. The memory request processing will be based on the transaction handling policy of the system memory simulator.

After the memory simulation has been finished for the current clock cycle, the transaction controller has to check if there is any completed memory request trans- action from the memory simulator. It is likely that the memory request simulation may be just partially completed if the initial memory request is a block request. For a read block/write block memory request, even though the current memory request simulation has finished, the block request that generated it may not be complete which means the transaction controller needs to begin memory simulation of the next transaction of the block request, with the process continuing until all transac- tions in the block request have been completed. The IAL location controller is used to manage read/write operation to the DRAM storage after getting the transaction complete signal from the transaction controller. Ideally, the location controller has to do the conversion of the logical address requested by the processor to a physical DRAM address (in the form of channel id, rank id, block id, row id, column id) which in turn is translated to the storage location index. After getting the storage index 72 CHAPTER 4. INTERFACE ABSTRACTION LAYER from the location index module, the data read/write operation is performed upon the simulated DRAM storage and the completed transaction signal is sent to the processor core. Finally, the IAL component should wait to perform the simulation again during the next processor clock cycle.

4.5.1 Case Study

The case study provided in this chapter is based on an existing processor simulator (Tensilica’s Xtensa tool) and an existing DDR3-DRAM memory simulator. The interface abstraction component presented in this chapter is applied to connect these two simulators in order to showcase the full system simulation. While any cycle accurate processor simulator and any memory model can be used in our technique, for the purpose of illustration we have used the Tensilica Xtensa tool suite and specifically the Tensilica Xtensa LX2 core simulator [60] and DRAMsim memory simulator [84] in the experiments of this chapter. The processor model is constructed by using the Tensilica Xtensa tool suite [149] which includes a C/C++ compiler, Instruction Set Simulator(ISS), Tensilica In- struction Extension (TIE), Xtensa PRocessor Extension Synthesis (XPRES) [150], Xtensa SystemC (XTSC) and the Xtensa Modelling Protocol (XTMP). XTSC (Sys- temC simulation environment) and XTMP (C/C++ simulation environment) are the modelling protocols to perform simulation at the system level for designing cus- tom memories, hardware devices, multiple Xtensa cores, etc. The Xtensa Family of processors [60] provides an extensible processor platform for creation of different processor configurations. The designer can vary the processor bus width, off-chip memory size, on-chip memory size, instruction/data cache size and instruction/data cache line size, etc. to tailor each processor implementation to match the SoC’s tar- get application requirements. To acquire detailed information of memory timing and power calculations, we employ the widely used DRAMsim [84] memory simulator. The DRAMsim includes 4.5. PROPOSED INTEGRATION METHODOLOGY 73 detailed timing and power consumption models for a variety of memory architectures such as SDRAM, DDR, DDR2, and DRDRAM. In DRAMsim [84], all the memory requests (cache misses) coming from the processor side are put into the transaction queue via the bus interface unit (BIU). The DRAMsim System Controller selects the memory transaction from the BIU and sends it to the transaction queue. The trans- action queue maps the physical address to the DRAM memory address. After that, the DRAMsim System Controller generates the DRAM command sequences based on the row buffer management policy which is selected at the system initialisation time. We utilised the IAL in combination with the processor module and memory module to create the full system simulator. The detailed flow of the IAL applied system simulator is shown in Figure 4.5. The targeted simulation environment is a cycle accurate full system simulators for an embedded system.

Being an execution-driven simulation approach, we use program executables as the input to the proposed simulator, as opposed to the traditional trace-driven simu- lation approach which has input modelled only as a trace representing the instruction sequence, which is generated by the functional simulator or the trace generators from a target machine. As the Tensilica Xtensa XTMP simulation environment allows the use of a multi-threading mechanism, we created the IAL component and memory component in another process context, called the IAL context, in addition to the original context which we call the core context. As can be seen in Figure 4.4 and Figure 4.5, the open headed arrow processing is done in the context of the core while the close headed arrow processing is done in the IAL process context. The timer callback handler (the process with the clock icon within the IAL in Figure 4.4) is registered while creating the IAL process to trigger on every processor clock cycles. The memory request handler is attached to the processor component to initiate all the cache misses by using the XTMP protocol feature of the Xtensa core. When- ever cache misses occur, the core sends the memory request to the memory request handler. The transaction controller takes care of all the block memory requests processing and sends the memory request to the bus interface queue of the memory 74 CHAPTER 4. INTERFACE ABSTRACTION LAYER

Benchmark Xtensa Executable Application C Compiler

Instruction & Instruction Xtensa Data Data Cache Cache LX2 Core Cache

Configurations Xtensa ISS

Response Trigger Cache Miss Memory Req Event Load/Store OK/Err

Memory Interface Abstraction Layer Configurations Processed Send Memory Queued Transaction Request Transaction Status

Bus Buffer & Interface Performance Bank Queue &IAL Power Managemen t Retrieve queued Consumption Memory Request Statistics Physical DRAMsim Command DRAM Transaction Scheduling ADDR Scheduling Mapping

Figure 4.5: Applied IAL simulator framework in building Functional, Cycle-accurate Processor-memory simulator

component. Whenever the IAL timer is triggered, the memory simulation controller starts the memory simulation action by processing the queued transaction sent by the transaction controller according to the transaction scheduling policy (FCFS, FR-FCFS, etc.) of the memory simulator. The details of how the memory simula- tor works can be found in [84]. For the completed transaction requests, the DRAM 4.5. PROPOSED INTEGRATION METHODOLOGY 75

Algorithm 1: (DRAM Memory Location Index Retrieval)

1 UINT *ddr3Mem[chCnt][rankCnt][bankCnt]; 2 UINT* getMemLoc(chId,rankId,bankId,rowId,colId) 3 { 4 UINT **loc = NULL; 5 UINT s = 0; 6 loc=&(ddr3Mem[chId][rankId][bankId]); 7 if ( NULL == *loc ) then 8 bankSize = rowCnt * colCnt; 9 s = sizeof(UINT); 10 *loc = (UINT*) malloc(bankSize * s);

11 index = (colCnt * rowId) + colId; 12 return (*loc + index); 13 }

location address will be calculated to perform the actual read/write operation of the DRAM storage in IAL. The physical location in this study is retrieved from the address mapping module of DRAMsim.

As shown in Algorithm 1, we implemented the DDR3 storage as a five dimen- sional array structure which can be accessed with the physical DRAM address. The term ”chCnt” refers to the total number of channels, rankCnt is the total number of ranks, bankCnt is the number of banks per chip, rowCnt refers to the total number of rows, and colCnt is the total number of columns. It should be noted that the data type will differ based on the data bus width and number of chips in one rank, as the total amount of data per read/write request can be achieved by multiplying the data bus width and the number of chips per rank.

In order to efficiently handle memory, we dynamically allocated space only for the total number of banks, each of which has the same bankId in the rank numbered as rankId for the channel numbered as chId. UINT in Algorithm 1 refers to the unsigned integer type which is 32 bits for our system calculated from the 4 bits data bus and 8 chips per rank. The loc at line 4 in Algorithm 1 gets the location address of the requested chId, rankId and bankId while the loc at line 10 allocates 76 CHAPTER 4. INTERFACE ABSTRACTION LAYER the total bank space at this location. Based on the requested rowId and colId, the index is retrieved from the allocated space. After the DRAM read/write operation is performed, the IAL location controller responds to the core for the completed transaction and will go to the next state (the state with the clock icon in Fig. 4.4) until the next clock cycle event is triggered.

4.6 Experimental Tests and Results

Benchmarks from the MediaBench multimedia benchmark suite [151] were utilised to compare the separate simulation of the processor and memory simulator as well as the combined simulator with the IAL. Comparison results of two different ap- proaches are shown below. The first is the combined approach where the simulator simultaneously simulates the processor module and memory module. In the second approach, called the two-pass approach, a fixed delay is allocated to the processor simulator for each memory access and the total time taken is recorded. The trace from the processor simulator is then fed to the memory simulator separately to obtain power consumption figures. We simulate the two-pass approach with 31 dif- ferent fixed memory latency values, ranging from 10 clock cycles to 40 clock cycles. By varying these memory latencies, we seek to find the closest memory latency for each benchmark, at which the processor simulation will give a result which is con- sistent with the combined approach. Note that the combined approach was verified for correctness at the RTL level by the use of models provided by the vendors of the memories. The configuration settings that we use for the processor core and memory component are described in Table 4.1.

For the Xtensa Instruction Set Architecture (ISA), which employs 32 bit ad- dressing (232 address space), we configure the simulation for a 4 GB memory storage based on Micron 1Gb x4 DDR3 SDRAM chip [2]. Having 8 chips for each rank with each chip containing a 4 bit data bus creates a memory system with a 32 bit data 4.6. EXPERIMENTAL TESTS AND RESULTS 77

Architecture Tensilica LX2 Interface Bus Width 32 bits Processor Speed 563 MHz Instruction & Data Cache Size 2KB & 1KB Cache Line Size 32 bits Architecture DDR3 Speed Grade -187E Clock Frequency 533 MHz Memory Transaction Selection First Come First Served Buffer Management open-page Refresh Period 64 ms

Table 4.1: Configuration Settings of Processor and Memory bus. We automate the test steps with the PerlScript to provide different configu- ration settings (cache size, timing parameters, transaction selection policy, etc. can be changed), compile the benchmark application in the targeted platform, simulate the system as if the application is running on the target machine and collect the output statistics. All the experiments were conducted on an Opteron quad core machine running at 2.15 GHz with 8GB RAM. All the configuration settings and the simulation environment of the two approaches were verified to be the same so that the output statistics can be compared between the two approaches. The only difference between the two approaches is the memory latency. As our simulation is based on the cycle-accurate Tensilica processor simulator [60] and well-proven cycle-accurate DRAMsim memory simulator [84], the results of execution time and power consumption figures provided by the combined approach can be taken to be accurate value.

The total simulation time taken by the combined approach as well as the two- pass approach with differing amounts of memory latency is presented in Table 4.2. The IAL applied approach (combined approach) takes more time than the average simulation time of two-pass approach to produce more accurate results. Note that 78 CHAPTER 4. INTERFACE ABSTRACTION LAYER the results for g721 benchmarks do not include the complete simulations (see next paragraph). The average increase in simulation time for the one-pass approach, over those benchmarks that were able to complete power simulation, is 13.5%. Thus, it is feasible to use the combined approach in terms of cost and simulation time.

One-pass Approach Two-pass Approach adpcm Enc 81 75 adpcm Dec 93 83 jpeg Enc 923 754 jpeg Dec 303 271 g721 Enc 13479 7,664* g721 Dec 16,736 8,201* * only for two simulations with fixed latency 20 and 25

Table 4.2: Average Simulation Time [second]

Table 4.3 and Table 4.4 give the results from the experiments conducted to find the fixed memory latency. The choices of memory latency are based on the results of either the execution time or power consumption closest to the results provided by the combined approach. The first column in both tables indexes the results in the other columns. Columns 2 to 7 show the results of the benchmarks, which are shown in the first row. In Table 4.3, ActET is the actual execution time and clk x is the total execution time obtained with x amount of memory latency. Similarly, ActPC is the actual power consumption and clk x is the average power consumption obtained with x amount of memory latency in Table 4.4. OptClk in both Table 4.3 and Table 4.4 refers to the optimal fixed memory latency which should be used to get the actual one. The results are shown for those cases where the results of fixed memory latency were closest to that of the combined approach. For example, a memory latency of 36 clock cycles would be the best choice for jpeg Enc application when execution time is considered, and 31 clock cycles when power consumption is considered. The power result for memory latency greater than 25 clock cycles for g721 application is not possible due to memory request timing errors. In order to maintain the contents 4.6. EXPERIMENTAL TESTS AND RESULTS 79 33 g721 Dec 5,099,543,721 3,543,221,347 3,949,336,138 4,212,374,325 4,958,760,816 5,103,038,959 5,374,217,421 33 g721 Enc 4,219,051,350 2,698,605,371 3,289,939,246 4,000,190,880 4,118,567,059 4,237,014,693 4,954,813,221 31 Dec 89,149,087 59,208,924 72,877,698 89,301,512 92,040,304 94,779,607 102,997,562 jpeg 36 Enc 269,752,716 156,134,953 191,877,067 234,812,580 241,970,934 249,129,408 270,605,461 jpeg 33 4.3: Total Execution Time[clock cycles] Dec 26,005,447 18,426,664 21,178,108 24,511,276 25,103,382 25,774,264 27,809,243 Table adpcm 32 Enc 21,717,249 16,884,353 18,834,969 21,183,023 21,576,250 21,970,090 23,153,271 adpcm 20 25 31 32 33 36 ActET clk clk clk clk clk clk OptClk 80 CHAPTER 4. INTERFACE ABSTRACTION LAYER 20 1006.27 1004.51 1095.27 – – – – g721 Dec 25 1002.23 1117.25 1089.18 – – – – g721 Enc 32 990.32 963.06 Dec 1002.08 1114.88 1079.32 1011.56 1000.65 jpeg 31 Enc 981.39 978.07 971.08 964.48 945.79 1017.99 1015.11 jpeg 33 4.4: Average Power Consumption[mW] Dec 982.81 998.74 989.99 960.29 1064.04 1051.67 1006.96 Table adpcm 36 Enc 950.62 970.55 965.08 959.76 945.33 1006.57 1001.37 adpcm 20 25 31 32 33 36 ActPC clk clk clk clk clk clk OptClk 4.6. EXPERIMENTAL TESTS AND RESULTS 81 of the DRAM, it needs to be refreshed at particular times (64 ms in our experiment). At this stage the memory simulator is unable to produce reliable results for large applications with having the refresh commands whose progress could not be tracked because of the memory request timing conflicts. Thus, we were unable to obtain results for the g721 Enc application and g721 Dec application for clock cycles of 31 and higher. We hope to rectify this problem in the future. However, this problem does not in any way reduce the need for a combined simulator. The error rates of the execution time and power consumption of the two-pass approach compared to the accurate one-pass approach results are shown in Figure 4.6 and Figure 4.7 respectively. The vertical axes in both graphs refer to the error rate percentage. The horizontal axes show the benchmark application running with a different fixed amount of memory latency. For example, an error rate of 22.3% of the total execution time will be obtained for the adpcm Enc with 20 clock cycles memory latency which is shown in the leftmost bar of adpcm Enc in Figure 4.6. An error rate of approximately 5.9% for the power consumption of adpcm Enc with 20 clock cycles memory latency can be seen in the first bar (leftmost) of adpcm Enc in Figure 4.7. As can be seen from the results, the optimal fixed memory latency for the two- pass approach is different from one benchmark application to another. Furthermore, the fixed memory latency also depends on the particular metric such as execution time or power consumption. That means the chosen memory latency may achieve accurate performance statistics, but may provide erroneous results for the power consumption aspects. The flawed figures in the performance of the system in varying the memory latency values can be observed in Table 4.3. The optimal memory latency for the adpcm Enc application (32 clock cycles), does not perform well for the jpeg Enc application. As shown in Figure 4.6, the adpcm Enc application achieves the closest to the actual execution time with the error rate of only 0.6% while there is a 10.3% error rate for the jpeg Enc applica- tion if the fixed memory latency of 32 clock cycles is used. Similarly, the memory 82 CHAPTER 4. INTERFACE ABSTRACTION LAYER

45

40

35

30 (%)

25

20

15

Relative ErrorRate Relative 10

5

0 adpcm Enc adpcm Dec jpeg Enc jpeg Dec g721 Enc g721 Dec 20 clk cycles 22.3 29.1 42.1 33.6 36.0 30.5 25 clk cycles 13.3 18.6 28.9 18.3 22.0 22.6 31 clk cycles 2.5 5.7 13.0 0.2 5.2 17.4 32 clk cycles 0.6 3.5 10.3 3.2 2.4 2.8 33 clk cycles 1.2 0.9 7.6 6.3 0.4 0.1 36 clk cycles 6.6 6.9 0.3 15.5 17.4 5.4

Figure 4.6: Error Rate for Total Execution Time

14.0

12.0

10.0

8.0

6.0

4.0 Relative Error Rate(%)Error Relative

2.0

0.0 adpcm Enc adpcm Dec jpeg Enc jpeg Dec g721 Enc g721 Dec 20 clk cycles 5.9 8.3 3.7 11.3 11.5 0.2 25 clk cycles 5.3 7.0 3.4 7.7 8.7 8.8 31 clk cycles 2.1 2.5 0.3 0.9 32 clk cycles 1.5 1.6 1.1 0.1 33 clk cycles 1.0 0.7 1.7 1.2 36 clk cycles 0.6 2.3 3.6 3.9

Figure 4.7: Error Rate for Average Power Consumption latency of 36 clock cycles is the best for the jpeg Enc application, which incurs only 0.3% error rate. However, this memory latency is not the optimal one for other 4.6. EXPERIMENTAL TESTS AND RESULTS 83 benchmarks. Note that even only one clock cycle difference in memory latency, can lead to large differences in execution time. This can be clearly seen in Table 4.3 in which 118,376,179 clock cycles difference is seen between simulations using 31 clock cycles memory latency and 32 clock cycles in the g721 Enc application. The inaccuracy is more significant in larger applications such as jpeg or g721. Based on the error rate values of execution time and that of power consumption (in Figure 4.6 and Figure 4.7), there is no optimal memory latency that can be chosen for all the applications as well as for all the metrics (execution speed and power consumption). The optimal memory latency for both the execution time and power consumption may be very close in small applications like adpcm application. But, there are some applications like jpeg Enc in which the optimal memory latency for both cases is not the same – 36 clock cycles memory latency for the execution time and 31 clock cycles latency in power consumption as shown in Table 4.3 and Table 4.4. If we use 31 as the memory latency, the error rate of execution time will increase from 0.3% to 13% with 35,792,881 clock cycles increment. If we choose 36, the error rate of power consumption will go up from 0.3% to 3.6% which is an increase of 32.28 mW.

Therefore, it is difficult to estimate the average optimal memory latency values of the system since it is totally application-specific. We observed that the average power consumption inaccuracy is not so significant for almost all the benchmarks we tested. The reason is that the current system does not include any low power consumption techniques. If low power consumption mechanisms are implemented to control power modes based on the application access history, the inaccuracy will be higher. Thus, it can be seen that the clock cycle count (if a two-pass approach were to be used) must be quite different from one another to get accurate results for power consumption and for execution time of the application. In a realistic situation, a designer trying to guess the fixed clock latency for memory accesses will necessarily fail. Thus, it is necessary for a combined approach from which we can obtain the accurate results, and we believe that the methodology of the systematic integration layer proposed in this work is helpful in building the cycle-accurate full 84 CHAPTER 4. INTERFACE ABSTRACTION LAYER system simulator.

4.7 Summary

In this chapter, we describe the shortcomings that occur in the separate simulation of the processor and memory components. In order to overcome the performance inaccuracies encountered in the two-pass simulation approach, the novel systematic integration layer methodology to build the full system simulator is proposed. Ad- ditionally, the applied study of the proposed method in the processor and memory simulator to build the cycle accurate system simulator for the embedded system is provided. With the proposed integration layer component, the combined simulator can be used to analyse the system performance and power utilisation of the ap- plication very efficiently. Moreover, the combined system provides accurate results at design time which is not possible in a typical two-pass method due to memory latency assumptions. Chapter 5

Rapid Exploration of Unified Last-level Cache

5.1 Introduction

Current processor based embedded systems heavily rely on caches to bridge the speed gap between processors and main memories. Typical embedded systems con- tain an embedded processor with one (or two) level(s) of on-chip caches, an off-chip unified (for both instructions and data) last-level cache and an off-chip main memory (DRAM memory). The deployment of caches results in significant improvements in both system performance and energy efficiency. However, an incorrect cache config- uration may increase energy consumption and reduce performance when compared to a system with an appropriate cache configuration.

A lot of research has been done to rapidly explore and find optimal on-chip L1 cache configuration (cache size, line size and associativity) [75, 78, 79] for a given application or a set of applications. A few researchers have also focused on simulta- neous exploration of both on-chip L1 and L2 caches [152–154]. However, not much attention has been paid to exploration of (on-/off-chip) unified last-level caches. In particular, little or no attention has been paid to the estimation of execution time

85 86 CHAPTER 5. RAPID EXPLORATION OF UNIFIED LAST-LEVEL CACHE and energy consumption in the presence of a configurable last-level cache. Use of inaccurate execution time estimators often lead to over-design of a real-time sys- tem, whilst hoping to meet real-time performance constraints. Therefore, accurate estimation of execution time will allow early, more realistic exploration of last-level cache in real-time systems. Earlier research has shown that optimisation of caches (excluding last-level caches) can save up to 40% [155] of energy consumption. A suit- able last-level cache configuration can further improve energy efficiency, because: 1) most of the memory transactions will be serviced by the last-level cache, and thus leaving main memory idle for longer periods; and 2) the main memory can be tran- sitioned to low-power modes during those idle periods [32,144,156].

Therefore, in this chapter, we focus on the rapid exploration and optimisation of unified last-level cache with respect to performance and energy consumption. We assume that the last-level cache is configurable and off-chip, while other caches are on-chip and preconfigured. Note that the exploration of both on-/off-chip last-level caches will be identical. Ideally, it is best to explore all cache levels simultaneously which is beyond the scope of this thesis.

Motivation and Contribution. Rapid cache exploration is hampered by slow full-system cycle-accurate simulations [157]. Thus, trace-driven cache simu- lation [74, 75, 78, 79] is an attractive alternative to cycle-accurate simulation of all the cache configurations. Trace-driven cache simulators take the application trace as input and capture cache hit and miss statistics of all the cache configurations, which are then used with an analytical model to estimate execution time and energy consumption [78, 158–160]. Although a fast method, cache statistics alone do not contain sufficient timing information for accurate estimation of performance and energy consumption. Hence, trace-driven cache simulations have typically focused on exploration of on-chip L1 caches only [78, 158–160]. If the analytical models for execution time and energy consumption from [78,158–160] are extended to include last-level caches (referred to as traditional estimators in this thesis), then they result in quite inaccurate estimations (see Section 5.5). This is particularly due to the fact 5.1. INTRODUCTION 87 that at some instants the processor executes without even accessing the last-level cache or the memory. Consider g721 Enc application (from mediabench [151]) running on a uniproces- sor system with on-chip separate L1 instruction and data caches, an off-chip unified L2 cache and memory. Figure 5.1 plots the execution time and energy consumption of the system against cache hits of different L2 cache configurations. It is clear from Figure 5.1 that the execution time of last-level cache configuration with maximum hits is significantly different from the cache configuration with minimum execution time. The same applies to the last-level cache configuration with minimum energy consumption. This illustrates the fact that cache hits/misses alone are not sufficient to estimate execution time and energy consumption. 8 9.5 5 4 x 10 x x 10 x max. hit does not mean min. execution time / min. energy consumption 8.5 3.2

7.5 2.4

6.5 1.6

5.5 0.8 Execution Time (cycles) Execution

Millions Consumption (uJ) Energy Millions 4.5 0 80 82 84 86 88 80 82 84 86 88 Cache Hits Cache Hits maxHits minExecutionTime minEnergy Figure 5.1: Execution Time and Energy Consumption of different L2 (last-level) Cache Configurations for g721 Enc application’s execution on Target System of Figure 5.2.

One might fall back to cycle-accurate simulations of all last-level cache configu- rations instead of a trace-based approach. However, this is not a feasible option as cycle-accurate simulations are exorbitantly slow. Hence, the challenge is to quan- tify the effect of last-level cache configurations on system execution time and energy consumption with minimal number of cycle-accurate simulations. To this end, we make the following contributions: 88 CHAPTER 5. RAPID EXPLORATION OF UNIFIED LAST-LEVEL CACHE

• Novel execution time estimator. The execution time estimator uses cache line size, memory accesses, and Last-level Cache Idle (LCI) periods. An LCI period refers to the execution period when last-level cache will be idle and hence captures those memory transactions that hit in lower level caches. Our results indicate a worst average error of 0.26% only. Although our estimator is simple (because it uses little micro-architectural knowledge of the system), its high absolute accuracy makes it suitable for rapid exploration of last-level cache configurations, particularly for real-time systems. • An energy estimator. The energy estimator uses the power consumption of pro- cessor and on-chip caches (base power consumption, measured through cycle- accurate simulation of the processor with on-chip caches and largest last-level cache) and execution time (obtained from the above execution time estimator) to compute energy consumption, which is then adjusted by the addition of en- ergy consumption of the last-level cache itself. Our results indicate a worst average error of 19.69%. Although the estimator is simple, it has reasonable absolute accuracy making it suitable for rapid exploration of last-level cache configurations. • RExCache framework for quick exploration of last-level cache. RExCache in- tegrates a cycle-accurate simulator and a trace-driven cache simulator with our novel execution time estimator and energy estimator. RExCache simu- lates an application cycle-accurately only once to capture the LCI profile and base power consumption (see Section 5.3 for details). A cache simulator and CACTI [81] are then used to provide memory accesses, cache line size and LCI periods to execution time estimator and base power consumption, exe- cution time and cache energy to energy estimator. Once execution time and energy estimates are available, RExCache chooses the best cache configuration (minimum execution time, minimum energy, etc.).

In summary, we propose the exploration of unified last-level cache to improve performance and energy efficiency of a uniprocessor system. To facilitate such an 5.2. PROBLEM STATEMENT 89 exploration, we propose the RExCache framework, based on a novel execution time estimator and energy estimator, which can be used to select the most suitable cache configuration for a real-time application. Unlike a general-purpose system, such an exploration is possible for an embedded system because design space exploration is often performed to optimise it for an application or a class of applications [157].

The rest of this chapter is organised as follows. Section 5.2 describes our target system and our aim to solve a specific problem. Section 5.3 presents our proposed rapid exploration framework for a unified last-level cache. The experimental setup and the results are explained in Section 5.4 and Section 5.5 respectively. Advantages and limitations of our proposed RExCache framework are discussed in Section 5.6. Finally, the summary of this chapter is mentioned in Section 5.7.

5.2 Problem Statement

We target a uniprocessor system with multi-level cache hierarchy and DRAM mem- ory where the last-level cache is unified and off-chip (we assume that the on-chip caches are preconfigured and unchangeable). An example system is shown in Fig- ure 5.2, where the processor has on-chip separate L1 instruction and data caches, which are connected to a unified off-chip L2 cache. The L2 cache is interfaced with DRAM memory that contains both application instructions and data. Here, L2 is the last-level cache and we use this system as an example throughout the thesis in addition to its use in our experiments.

Our goal is to determine which last-level cache configuration will maximally re- duce execution time and/or energy consumption of a uniprocessor system executing an application given other architectural parameters such as processor type, memory type, lower level caches, etc. remain unchanged. The parameters that varied in the last-level cache are its size, line size and associativity. 90 CHAPTER 5. RAPID EXPLORATION OF UNIFIED LAST-LEVEL CACHE

L1 I-cache L2 Unified DRAM CPU cache L1 D-cache

On-die chip

Figure 5.2: An example of a Target System.

5.3 RExCache Framework

Our framework, RExCache, which quickly explores last-level cache is shown in Fig- ure 5.3. The input to RExCache consists of: applications and last-level cache con- figurations. At high level, RExCache cycle-accurately simulates each application to capture its memory trace. This memory trace is then fed to a cache simulator to record statistics for all the last-level cache configurations. Those statistics along with the memory trace of the application are used in the estimators to estimate ex- ecution time and energy of all the cache configurations. Finally, RExCache chooses best cache configuration (minimum execution time, minimum energy, etc.) for each application. The following paragraphs explain core components of RExCache in greater detail from the perspective of one application.

5.3.1 Application Trace Generation

An application is simulated in a cycle-accurate simulator with largest last-level cache to record two entities at the second-last-level and last-level cache’s interface: 1) Last- level Cache (LC) memory trace; and 2) Last-level Cache Idle (LCI) profile. The LC trace will contain only those memory requests that will be missed in lower level caches. For instance, the memory trace captured at L1–L2 interface in Figure 5.2 will only contain L1 cache misses. Since the LC trace will be the same regardless of the last-level cache configuration, we use the largest cache to capture the LC trace 5.3. REXCACHE FRAMEWORK 91

Last-level Cache Applications Configurations

LC Trace Cache Cycle-accurate Cache Simulator CACTI Line Size Processor Simulator LCI Cache Cache Profile Profile Energy

Base System Cache Exploration

LC : Last-level Cache Best cache LCI : Last-level Cache Idle configuration for Base System has on-chip caches and each application largest off-chip last-level cache

Figure 5.3: RExCache Framework. Dotted-lined rectangles and broken arrows show our novel contributions. due to lower simulation time. An LCI period refers to an application’s execution period that does not access last-level cache. For instance, at L1–L2 interface in Figure 5.2, LCI periods will be the execution periods with no memory requests from the application (consecutive non-load and non-store instructions) and execution periods with memory requests that will hit in L1 caches. Figure 5.4 illustrates such an LCI period, where L2 cache will not be accessed due to non-load and non- store instructions, and hits in L1 caches. Hence, L2 cache will be idle during the marked LCI period. An LCI profile captures all LCI periods in clock cycles from the execution of an application. For an in-order processor, which is typical of embedded processors, an LCI profile of the application will not change across different last-level cache configurations because the processor pipeline is stalled during each memory request and lower level caches remain unchanged. Hence, LCI periods will contribute to the application execution time irrespective of which last-level cache configuration 92 CHAPTER 5. RAPID EXPLORATION OF UNIFIED LAST-LEVEL CACHE

ld R5, R1+20 L1 CM ld R6, R1+21 L1 CM add R7, R6, R5 N/A Last-level Cache sub R8, R6, R5 N/A Idle (LCI) period xor R9, R8, R7 N/A st R9, R1+20 L1 CH ld R5, R1+40 L1 CM ...

L1 CH: L1 Cache Hit L1 CM: L1 Cache Miss N/A: Not Applicable

Figure 5.4: An example of LCI period for Target System of Figure 5.2 (L2 is the last-level cache). is used. Note that an LCI profile captures micro-architectural events from the actual application execution without the need for detailed analytical modelling of the system’s micro-architecture. This is the only step in RExCache where an application is cycle-accurately simulated. Thus, despite the fact that hundreds of last-level cache configurations are possible, only one of them is cycle-accurately simulated.

5.3.2 Cache Simulation

In this step, we feed the LC trace to a cache simulator to generate a cache profile for all the last-level cache configurations. A cache profile reports whether the memory will be accessed or not for each memory request in the LC trace. A memory request in a LC trace is considered a memory access or non-access in the cache profile depending on the cache policy and whether the request hits or misses the cache. For example, a read miss in a cache with write-back policy may result in one or two memory accesses. The decision depends on the dirtiness of the cache data to be replaced; that is, if the cache data is dirty then it will be written to memory first and then the missed data will be read from memory resulting in two memory accesses, otherwise the missed data will only be read from memory resulting in only 5.3. REXCACHE FRAMEWORK 93 one memory access. Likewise, a write miss in a cache with write-back and write- allocate policies will result in a memory access only if the cache data to be replaced is dirty. Table 5.1 reports the number of memory accesses required for read miss, write hit and write miss under different cache policies.

read write write miss miss hit allocate no-allocate write-through 1 1 1 1 write-back 2 if dirty, 0 1 if dirty, 1 1 otherwise 0 otherwise

Table 5.1: Number of Memory Accesses for Last-level Cache Policies.

5.3.3 Cache Exploration

Recall from Section 5.1 that the last-level cache profile (hits/misses) alone is not sufficient to estimate execution time because it lacks the necessary timing infor- mation. We transform memory access and non-access information from the cache profile into a more useful representation, Application Execution (AE) profile. An AE profile captures an application’s execution from a given cache profile and an LCI profile. We insert LCI periods from the LCI profile between appropriate memory accesses and non-accesses in the cache profile to capture the exact execution of an application for a given last-level cache profile. The execution time can then be calcu- lated by summing up the cycles spent in Memory Accesses (MAs) and Non-Accesses (MNAs), and the LCI profile. This leads to the following execution time estimator:

ET = LCIcycles + [MNAs × CacheHL + MAs × CacheLS × CacheML]

where CacheLS, CacheHL and CacheML refer to cache line size, cache hit latency and cache miss latency respectively. The LCIcycles are computed from the LCI 94 CHAPTER 5. RAPID EXPLORATION OF UNIFIED LAST-LEVEL CACHE profile by summing up all the LCI periods. These cycles contribute to the execution time irrespective of last-level cache configuration. The second factor estimates the cycles spent in accessing instructions/data from last-level cache in case of hits. In the third factor, the product of last-level cache line size and total number of memory accesses measures the amount of traffic going to memory during the execution of an application. The memory traffic is then converted into cycles by multiplying it with the last-level cache miss latency. These three factors capture almost the whole application execution. Figure 5.5 shows an example of how execution time of an application can be estimated from a given cache profile and LCI profile.

Cache Profile MR1 Mem Access MR: Memory Request LCI: Last-level Cache Idle Mem Access Latency = 30 cycles Mem Non-Access Latency = 4 cycles MR2 Mem Non-Access Cache Line Size = 1 word MR3 Mem Non-Access MR4 Mem Non-Access MR1 Mem Non-Access MR5 Mem Access LCI period (20 cycles) MR6 Mem Access MR2 Mem Non-Access MR7 Mem Access MR3 Mem Non-Access MR8 Mem Access LCI period (110 cycles) MR9 Mem Access MR4 Mem Non-Access MR10 Mem Access MR5 Mem Access ... LCI period (70 cycles) LCI cycles = 220 cycles MR6 Mem Access ET = 220 + 4 * 4 + 6 * 1 * 30 MR7 Mem Access = 416 cycles MR8 Mem Access MR1 LCI period (20 cycles) LCI period (20 cycles) MR9 Mem Access MR3 MR10 Mem Access LCI period (110 cycles) ... MR5 LCI period (70 cycles) MR8 LCI period (20 cycles) AE Profile ...

LCI Profile

Figure 5.5: An example of Execution Time estimation for Target System of Fig- ure 5.2 (L2 is the last-level cache).

The cache profile reports memory accesses and non-accesses (for the sake of 5.3. REXCACHE FRAMEWORK 95 simplicity, the example assumes that each memory request resulted in one memory access) while the LCI profile reports L2 cache idle periods. Theoretically, each LCI period is inserted after appropriate Memory Request (MR) in the cache profile. For example, second LCI period occurs after MR3 and hence is inserted between MR3 and MR4 in the AE profile. The AE profile shows the exact execution of the application. The execution time – 364 cycles – is then calculated using a 4 cycle memory non-access latency (L2 cache hit latency) and a 30 cycle memory access latency (L2 cache miss latency). Note that construction of an AE profile is not required but is used here for visual illustration of how an exact application execution is captured using LCI and cache profiles.

The energy estimator uses power consumption of the processor and on-chip caches (BasePower), execution time and cache energy consumption. Energy con- sumption is estimated by multiplying BasePower by execution time (obtained from execution time estimator) which is then adjusted by the addition of the energy consumption of the last-level cache itself. Hence, the energy estimator is:

E = [BasePower × ET] + CacheE

where ET is the execution time and CacheE is the energy consumption of the last-level cache. We measure the BasePower by cycle-accurate simulation of the system with largest last-level cache – the same system configuration that is used in the application trace generation step. BasePower contains both static and dynamic power of the processor and on-chip caches. Cache energy also contains both static and dynamic energy where static energy is estimated by multiplying static power (obtained from CACTI [81]) by execution time, and dynamic energy is estimated by multiplying the number of hits and misses with energy per hit and miss (obtained from CACTI [81]) respectively.

Cache exploration computes the execution time and energy consumption of all 96 CHAPTER 5. RAPID EXPLORATION OF UNIFIED LAST-LEVEL CACHE the last-level cache configuration using the estimators. The execution time estimator uses LCIcycles that is computed only once for an application, and memory accesses and cache line size which are provided by the cache simulator. On a similar note, the energy estimator uses BasePower that is computed only once for an application, and execution time and cache energy which are provided by the execution time estimator and CACTI [81] respectively. Since the estimators require only one cycle-accurate simulation of an application with little micro-architectural knowledge of the system, they are simple, fast and hence suitable for rapid exploration of last-level cache configurations. Once the estimates are available, RExCache chooses the cache configuration with minimum execution time or minimum energy consumption. The exploration time of RExCache is dominated by application trace generation step especially when applications are large instead of cache simulation and/or our estimators. Application trace generation step can be improved by trace sampling [161]; however, such a technique will not capture whole application execution and will introduce errors. Hence, we argue that an exhaustive search of the design space to find an optimal cache configuration is better than a heuristic search to find a near-optimal cache configuration because 1) estimators are simple and fast (see Section 5.5), and 2) the number of last-level cache configurations is in hundreds rather than thousands or millions (see Section 5.4).

5.4 Experimental Methodology

We evaluated the RExCache framework by exploring L2 cache configurations in the target system of Figure 5.2. The target system is implemented using Tensilica’s LX4 processor [60] with 2KB L1 instruction and 1 KB L1 data caches, both direct- mapped with 4B line size. We considered 330 off-chip L2 cache configurations by changing the cache size from 4KB to 4MB, line size from 4B to 128B and associativity from 1 to 16. These configurations are typically explored in an embedded system’s 5.5. RESULTS AND ANALYSIS 97 design [75,78,157]. We used a 4 cycle and a 30 cycle latency for L2 cache hit and miss respectively, which are typical of an embedded system [152]. In our experiments, the L2 cache was configured for Least Recently Used (LRU), write-back and write- allocate policies. We used an Xtensa instruction set simulator (in XTensa Modelling Protocol en- vironment) and the tool from [75]1 as a cycle-accurate simulator and cache simulator in RExCache. In addition, CACTI 6.5 [81] was configured for a given 90nm tech- nology to obtain energy consumption of L2 cache configurations. For evaluation purposes, we used adpcm Enc/Dec, jpeg Enc/Dec, g721 Enc/Dec, mpeg2 Enc/Dec and H.264 Enc applications from mediabench [151]. All experiments were conducted on an Intel Xeon 64 core machine with 256GB RAM.

5.5 Results and Analysis

Analysis of Estimators. We evaluated execution time and energy estimators by comparing the estimated values with actual values from cycle-accurate simulations. Table 5.2 reports the absolute accuracy of our estimators in the second major col- umn. The execution time estimator exhibited a worst average absolute error of only 0.26%. Our execution time estimator is simple, yet very accurate and hence emi- nently suitable for rapid exploration of last-level cache configurations in real-time systems. Our energy estimator is not as accurate as the execution time estimator because the BasePower is fairly but not entirely constant across different last-level cache configurations. Additionally, the error in estimated execution time also affects the energy estimation. A worst average absolute error of 19.69% is observed for the adpcm Dec application. We also compared our estimators to traditional estimators that are based on an execution time estimator proposed for on-chip L1 separate instruction and data

1We modified their tool to generate memory accesses and non-accesses rather than cache hits and misses. Note that their tool is much faster than DineroIV [74]. 98 CHAPTER 5. RAPID EXPLORATION OF UNIFIED LAST-LEVEL CACHE 2h 1d 1d 4.1d 16.7d 35.9d Time 22.1m 14.4m 30.8m Simulation 31.03 33.38 23.99 23.63 25.50 26.69 27.58 27.71 23.40 Max. L2 cache Energy Absolute Error(%) estimators [160], Avg. 19.34 21.28 14.31 13.15 13.68 13.92 18.18 17.64 14.10 extended to 4.44 9.24 4.19 2.91 1.05 1.75 3.26 11.98 13.73 Max. Traditional Absolute Error(%) 0.67 1.08 2.46 2.00 3.90 3.16 0.23 0.22 1.24 Execution Time Avg. 39s 28s 50s 21h 41m 43m 9.3h 3.1h 2.3m Time Simulation 30.42 31.54 23.38 22.90 27.04 27.75 26.65 26.55 22.67 Max. Energy Absolute Error(%) Avg. 18.63 19.69 13.64 12.38 13.76 13.69 17.25 16.66 13.63 RExCache 0.00 1.73 1.17 0.29 0.57 0.01 0.01 0.43 0.40 Max. Absolute Error(%) 0.00 0.02 0.26 0.03 0.00 0.00 0.00 0.01 0.11 Execution Time Avg. Dec Enc Dec Enc Dec Enc 5.2: Detailed Analysis of Execution-Time and Energy Estimators. [s/m/d in Simulation Time represents sec- jpeg jpeg g721 Dec g721 Enc H.264 Enc mpeg2 mpeg2 adpcm adpcm Application Table onds/minutes/days respectively] 5.5. RESULTS AND ANALYSIS 99 caches in [160]2 which uses cache hits, cache misses and average NCPI (Net Clock Cycles per Instruction) to estimate execution time. As explained in [160, 162, 163], the average NCPI is obtained from cycle-accurate simulations of a number of cache configurations [160], comprising of: 1), differing cache sizes with smallest/largest cache line size and associativity; 2), differing cache line size with smallest/largest cache size and associativity; and 3), differing associativity with smallest/largest cache size and line size. This results in tens of simulations for a reasonably accurate value of average NCPI. Since the estimator in [160] cannot be applied directly, we extended it as follows to include off-chip L2 cache statistics:

[L1-Hits × L1-HL] + [Total Instructions × average NCPI]

+ [L2-Hits × L2-HL] + [L2-Misses × L2-LS × L2-ML]

where HL, ML and LS refer to hit latency, miss latency and cache line size. The first factor estimates the time spent in fetching instructions/data from L1 caches, while the second factor estimates the net time to execute the fetched instructions. The rest of the two factors estimate the time spent in L2 cache and memory with the number of L2 hits and misses. Note that the LCI cycles used in our estima- tor are estimated by the first and second factors in the traditional execution time estimator. The traditional energy estimator is the same as our energy estimator (see Section 5.3.3) except that the execution time is obtained from the traditional execution time estimator. The third major column of Table 5.2 reports the absolute accuracy of the traditional estimators. The results indicate a worst average error of 3.9% and 21.28% for traditional execution time and energy estimators respectively. This is because L1 hits and average NCPI do not capture all the LCI periods of an application execution, thus introducing more errors than our estimators.

2The estimator in [160] had, on average, 80% and 21% better absolute accuracy in execution time and energy consumption respectively, than the estimator in [78]. 100CHAPTER 5. RAPID EXPLORATION OF UNIFIED LAST-LEVEL CACHE

The improved absolute accuracy of our estimators will allow better design space exploration under real-time constraints. We applied 100 different execution times and energy constraints to all the applications, and then searched for minimum en- ergy and minimum execution time cache configurations respectively. The traditional method selected incorrect cache configurations in 31 cases while the cache config- urations from RExCache were similar to the ones from the cycle-accurate method in all the 100 cases. This result indicates that our estimators have greater appli- cability than traditional estimators due to their higher accuracy. Furthermore, the simulation time of RExCache approach and Traditional approach are also compared in Table 5.2 which are shown in the last column of each approach. RExCache outperforms the Traditional approach with at least 97% simulation time reduction. Selection of Cache Configurations. In RExCache, we explored L2 cache’s design space to find the cache configurations with minimum execution time and min- imum energy consumption for different applications. The selected configurations are reported in Table 5.3, denoted as [cache size, cache line size, cache associativity]. These results indicate that minimum execution time configuration will not necessar- ily have minimum energy consumption. To quantify the execution time and energy efficiency improvements of RExCache, we selected the following four cache configu- rations:

• CC: System with a reasonably Common Cache configuration – 64KB, 64B line size, direct-mapped L2 cache, which has been reported in [152,154]. • BC ET: System with Best Cache configuration w.r.t Execution Time. • BC E: System with Best Cache configuration w.r.t Energy consumption. • LC: System with Largest Cache – [4MB, 128B, 16A].

Figure 5.6 plots the execution time of BC ET, BC E and LC normalised to CC for all the applications. The BC ET configuration reduced execution time by up to 50% (H.264 Enc) compared to the CC configuration. Hence, the use of an arbitrary last-level cache configuration is not a suitable design choice. The BC E has little 5.5. RESULTS AND ANALYSIS 101

Application Cache configuration with Minimum Execution Time Minimum Energy Consumption adpcm Enc [16KB, 4B, 8A] [8KB, 64B, 16A] adpcm Dec [8KB, 4B, 16A] [4KB, 64B, 16A] jpeg Enc [512KB, 4B, 1A] [16KB, 16B, 16A] jpeg Dec [128KB, 4B, 4A] [16KB, 16B, 16A] g721 Enc [32KB, 4B, 8A] [8KB, 64B, 8A] g721 Dec [16KB, 4B, 8A] [4KB, 16B, 16A] mpeg2 Enc [2MB, 16B, 1A] [8KB, 64B, 4A] mpeg2 Dec [512KB, 16B, 1A] [4KB, 16B, 16A] H.264 Enc [4MB, 4B, 4A] [32KB, 16B, 16A]

Table 5.3: Cache Configurations w.r.t. minimum Execution Time and minimum Energy Consumption from RExCache. performance degradation than CC with a maximum of only 1% for mpeg2dec. In addition, BC E configurations always had higher execution times than BC ET con- figurations because BC E configurations minimise energy consumption rather than execution time. The LC configuration’s execution time is almost similar to BC ET in all the applications because it reduces the cache misses maximally. However, the LC configuration will consume a large amount of energy due to its large area (see Figure 5.7). The energy consumption of BC ET, BC E and LC normalised to CC is plotted in Figure 5.7 for all the applications where the energy consumption included energy consumption of the processor and on-chip L1 caches, and off-chip unified L2 cache excluding the energy consumption of memory. As expected, energy consumption of LC is highest and a lot more than all the other configurations (at least 2.3 times the energy consumption of the CC configuration) because of its large area. The maximum energy savings are achieved by BC E configurations in all the applica- tions, reporting a saving of up to 35% (g721 Dec application) compared to the CC configuration. These improvements illustrate the significance of exploring last-level cache configurations to optimise a system for execution time or energy consumption. 102CHAPTER 5. RAPID EXPLORATION OF UNIFIED LAST-LEVEL CACHE

1.05 CC Configuration 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 Normalised Execution Time Execution Normalised Normalised Execution Time Execution Normalised 0.55 0.5 0.45 LC LC LC LC LC LC LC LC LC BC_E BC_E BC_E BC_E BC_E BC_E BC_E BC_E BC_E BC_ET BC_ET BC_ET BC_ET BC_ET BC_ET BC_ET BC_ET BC_ET adpcm adpcm jpeg jpeg g721 g721 mpeg2 mpeg2 H.264 Enc Dec Enc Dec Enc Dec Enc Dec Enc Figure 5.6: Execution Time of different Cache Configurations normalised to Com- mon Cache (CC) Configuration.

Exploration Time of RExCache. A major concern in exploration frame- works like RExCache is the exploration time. Table 5.4 reports the time taken by the cycle-accurate simulator, traditional method and RExCache for exploration of 330 L2 cache configurations for each of the nine applications. For the cycle-accurate simulator, we simulated all the L2 cache configurations to measure the simulation time. The traditional method column reports the time for generation and processing of the application trace to compute average NCPI (TGP), the time for simulation and exploration of cache configurations (CE) and the total exploration time. The RExCache column also reports the aforementioned times, except for the TGP sub- column which refers to the generation and processing of the application trace to compute LCI cycles. RExCache reduced exploration time from several days to a 5.5. RESULTS AND ANALYSIS 103

3.5

3

2.5 CC Configuration

2

1.5 Normalised Energy Normalised Consumption Normalised Normalised Energy Consumption 1

0.5 LC LC LC LC LC LC LC LC LC BC_E BC_E BC_E BC_E BC_E BC_E BC_E BC_E BC_E BC_ET BC_ET BC_ET BC_ET BC_ET BC_ET BC_ET BC_ET BC_ET adpcm adpcm jpeg jpeg g721 g721 mpeg2 mpeg2 H.264 Enc Dec Enc Dec Enc Dec Enc Dec Enc Figure 5.7: Energy of different Cache Configurations normalised to Common Cache (CC) configuration.

few hours compared to the cycle-accurate simulations, enabling quick exploration of last-level cache due to simple, fast execution time and energy estimators. Addi- tionally, RExCache takes a much lower time than the traditional method because it simulates an application only once compared to the need for tens of simulations in the traditional method (to compute the average NCPI). Consequently, RExCache reduced the simulation time by 98% on average, while providing better absolute accuracy. 104CHAPTER 5. RAPID EXPLORATION OF UNIFIED LAST-LEVEL CACHE

Application Cycle-Accurate Traditional [160] RExCache Simulator TGP CE Total TGP CE Total adpcm Enc 3h 22m 8s 22.1m 39s 8s 39s adpcm Dec 2h 14.3m 8s 14.4m 20s 8s 28s jpeg Enc 10h 2h 13s 2h 2m 13s 2.3m jpeg Dec 4h 30.7m 8s 30.8m 42s 8s 50s g721 Enc 7d 1d 6m 1d 35m 6m 41m g721 Dec 8d 1d 6m 1d 37m 6m 43m mpeg2 Enc 116d 16.4d 16m 16.7d 9h 16m 9.3h mpeg2 Dec 32d 4.1d 8m 4.1d 3h 8m 3.1h H.264 Enc 257d 35.8d 2h 35.9d 19h 2h 21h

Table 5.4: Exploration Time comparison of Cycle-accurate Simulations, Traditional Method and RExCache.

5.6 Advantages and Limitations

RExCache features several advantages over cycle-accurate simulations and stan- dalone trace-driven cache simulators. RExCache: 1) is fast as it only uses one cycle-accurate simulation per application; and 2) integrates cache simulation and our simple, fast estimators to quickly provide execution time and energy consump- tion of all last-level cache configurations. Although we use an exhaustive search to find the optimal cache configuration (because there are only hundreds of cache configurations), a designer can use a heuristic instead to search larger design spaces.

RExCache is very flexible as any cycle-accurate simulator can be used if the memory trace at second-last-level and last-level cache’s interface can be captured. Likewise, any cache simulator can be used if the cache profile described in Sec- tion 5.3.2 can be produced. Such a profile requires only minor modifications to already available cache simulators [74,75,79]. Note that RExCache can explore the last-level cache of a uniprocessor system with any number of levels in the cache hi- erarchy irrespective of whether the last-level cache is on-chip or off-chip. However, in this work, we applied RExCache to off-chip unified last-level cache only. 5.7. SUMMARY 105

RExCache can also be used to find best last-level cache configuration for a class of applications. In this case, the trace from combined execution of the applications should be captured and used as input to cache simulators. A truly representative trace from multiple applications’ execution might not be possible due to indeter- minism; however, this is a different problem and is not the focus of this thesis. RExCache will explore last-level cache configurations to find the best one for a given trace irrespective of whether that trace captures the execution of single or multiple applications. RExCache as such cannot explore last-level caches in a multi-processor system because the LCI periods will be different across different last-level cache configura- tions (unlike a uniprocessor system where LCI periods are the same across different last-level configurations) due to inter-processor dependencies and cache coherency. In future, we will look into extending RExCache for multi-processor systems.

5.7 Summary

In this chapter, we have proposed rapid exploration of unified last-level cache in a uniprocessor system to improve performance and energy efficiency. To this end, we propose the RExCache framework which integrates cycle-accurate and cache sim- ulators with simple, fast and highly accurate execution time and simple, fast and reasonably accurate energy estimators to avoid slow cycle-accurate simulations of all cache configurations. The cache configurations selected by RExCache for nine mediabench applications reduced execution time by up to 50% and energy consump- tion by up to 35% compared to a common cache configuration ([64KB, 64B, 1A], reported in [152, 154]). RExCache took only a few hours (21 hours for H.264E) to explore 330 cache configurations compared to several days (257 days for H.264E) of cycle-accurate simulations, illustrating the usefulness of our estimators and frame- work. When compared to a traditional method (based on [160]), RExCache provided 4% and 2% improvement in accuracy of execution time and energy estimation with 106CHAPTER 5. RAPID EXPLORATION OF UNIFIED LAST-LEVEL CACHE

98% less simulation time.3

3This research was supported under the Australian Research Council’s Discovery Projects fund- ing scheme (project number DP120104158). Chapter 6

Effects of Last-level Cache Configurations

6.1 Introduction

As embedded systems become computationally more complex, there is a greater de- mand for increased main memory in those systems. At the same time, main memory in such systems can consume up to 80% of the system’s energy [114]. Therefore, the use of DDR3 DRAMs as the main memory in embedded systems is becoming common due to their higher operational frequency and lower power consumption compared to previous generations of DDR DRAMs. Additionally, DDR3 DRAMs offers a greater number of low-power modes, which provides opportunity for better power/energy optimisation.

In general, a DDR3 DRAM mainly consumes three types of powers: background; active; and refresh power. Background power is consumed continuously while active power is consumed during read/write operations. Refresh power is consumed peri- odically when DDR3 has to refresh itself to retain the data. A DDR3 DRAM can be transitioned to one of the low-power modes to save background power; however,

107 108 CHAPTER 6. EFFECTS OF LAST-LEVEL CACHE CONFIGURATIONS there is a cost (wake-up latency) associated with transitioning it back to the ac- tive state (for read/write operations). Different low-power modes and their wake-up latencies for DDR3 DRAM [2] are shown in Table 6.1. Energy reduction cannot be achieved if DDR3 DRAM is set to a low-power mode whose wake-up latency is higher than the idle period of DRAM. The wake-up latency varies depending upon the low-power mode; the low-power mode with minimum power consumption incurs the highest wake-up latency. For example, self-refresh power down mode consumes the least amount of power in DDR3 DRAMs and incurs the highest wake-up latency (9× less power than active state with a wake-up latency of 64 clock cycles [2]). Max- imum energy reduction is possible when DDR3 is transitioned into the self-refresh power down mode, and rarely switched back to active mode (for read/write opera- tions). If a last-level cache can capture the spatial and temporal locality of memory requests to a good extent, then the DRAM will be idle for longer periods, improving the possibility of higher energy saving with the use of self-refresh power down mode.

Low-power Current Wakeup Latency Mode (mA) (clock cycles) Active SB (active state) 57 0 Precharge SB 55 3 Active PD 35 4 Precharge PD 12 13 Self-refresh PD 6 64

Table 6.1: Power Modes of Micron DDR3 DRAM. SB and PD stand for StandBy and PowerDown respectively.

In this chapter, we present a case study on how different last-level cache con- figurations affect the idle periods of a DDR3 DRAM, and show that a suitable last-level cache configuration with self-refresh power down mode can result in sig- nificant savings in energy consumption of DDR3 DRAM. We propose a power mode controller for DDR3 DRAM which adaptively transitions the DRAM to self-refresh power down mode when a memory request hits in the last-level cache, and switches 6.1. INTRODUCTION 109 it back to the active state (for read/write operations) when a memory request misses in the cache. Our power mode controller works with the last-level cache controller and memory controller to handle the use of self-refresh power down mode for DDR3 DRAM. We performed simulations with hundreds of last-level cache configurations and observed that energy savings of up to 89% are possible when a suitable last-level cache with self-refresh power down mode is used. We target a uniprocessor system with multi-level cache hierarchy and DDR3 DRAM. An example system is shown in Figure 6.1, where the processor has on-chip separate L1 instruction and data caches, which are connected to a unified off-chip L2 cache. The L2 cache is interfaced to the DRAM memory through a memory controller. The DRAM contains both application instructions and data. Here, L2 is the last-level cache and we use this system as an example throughout the chapter in addition to its use in our experiments. The DDR3 DRAM contains an internal mode controller which automatically sets it into one of the low-power modes except the self-refresh power down mode whenever it is idle. Power Mode Controller (PMC) is another module that is typically embedded into the memory controller firmware to control the use of low-power modes [144].

L2 cache L1 I- miss cache Cache PMC controller L1 cache miss CPU DRAM L2 unified cache Memory L1 D- control cache operations

On-die chip Off-chip Memory Controller

Figure 6.1: An example of Target System.

In this chapter, PMC transitions the DRAM into either active state or self-refresh power down mode based upon whether memory requests hit or miss in the last-level 110 CHAPTER 6. EFFECTS OF LAST-LEVEL CACHE CONFIGURATIONS cache. The PMC obtains the hit/miss status of a memory request from the last- level cache controller using an external connection (dotted line in Figure 6.1). The proposed PMC can be implemented either as a software in the memory controller firmware or as a separate controller.

6.2 Power Mode Controller

The Power Mode Controller (PMC) algorithm is described in Algorithm 2. In gen- eral, PMC tries to transition DRAM into self-refresh power down mode whenever possible to enable maximum energy saving. Therefore, whenever there is a hit in last-level cache, PMC switches the DRAM to self-refresh power down mode if the DRAM is in active state (lines 1–4). However, before the transition, PMC waits for any ongoing DRAM transactions due to previous memory requests so that those requests are not interrupted. If a memory request misses the last-level cache, then the PMC has to activate the DRAM when it is in one of the low-power modes (lines 5–9). During this transition, the DRAM will incur a wake-up latency (64 cycles for DDR3 DRAM [2]) which will affect the performance.

Algorithm 2: (Power Mode Controller Algorithm)

1 if (last-level cache hit) then 2 if (DRAM is in active state) AND (No ongoing DRAM transactions) then 3 Switch DRAM to self-refresh power down mode 4 else 5 if (DRAM is in low-power mode) then 6 Switch DRAM to active state (incurs wake-up latency)

For a given application, the idle periods of DRAM vary from one last-level cache configuration to another. Since PMC transitions the DRAM into self-refresh power down mode whenever it is possible, the energy savings might be offset by the high wake-up latency when idle periods are not sufficiently long enough. Therefore, in

6.2. POWER MODE CONTROLLER 111

Performance Degradation (%) (%) Degradation Degradation Performance Performance 3 3 - - 9 7.5 6 4.5 3 1.5 0 -1.5 -4.5 -6 -16.783 -19.051 -27.971 Performance Degradation (%) Degradation Performance -51.538 Energy Saving (%) Saving Energy Energy Saving and Performance Degradation of PMC system w.r.t NoPMC system. adpcmEnc adpcmDec jpegEnc jpegDec g721Enc g721Dec mpeg2Enc mpeg2Dec Figure 6.2: SC LC OC SC LC OC SC LC OC SC LC OC SC LC OC SC LC OC SC LC OC SC LC OC 0

30 30 90 75 60 45 30 15 - - -15 -45 -60 Energy Savings (%) Savings Energy 112 CHAPTER 6. EFFECTS OF LAST-LEVEL CACHE CONFIGURATIONS this chapter, we explore different last-level cache configurations to select the one which maximally reduces the energy consumption of both the last-level cache and DRAM. We use the target system of Figure 6.1 with LRU and write-back policies for L2 cache. The target system is implemented using Tensilica’s LX2 processor [60] with 2KB L1 instruction and 1 KB L1 data caches, both direct-mapped with 4B line size. We used -125E DDR3 DRAM (1600 Million Transfers per second) from Micron [2] to create a system with 4GB (4 ranks) memory. The DRAM had an interface width of 4B and used an open-page row buffer policy and an internal refresh time of 64 ms. A wake-up latency of 64 cycles was used for self-refresh power down mode. We used our in-house cycle-accurate simulator which integrates DRAMSim [84] with LX4’s simulator to create a detailed processor-memory simulator (which is presented in chapter 4). The low-level power estimation tool from Micron [2] is used to calibrate the DDR3 DRAM power model in our simulator. In addition, CACTI 6.5 [81] was configured for a given 90nm technology to obtain energy consumption and area of L2 cache configurations. The L2 cache design space consisted of 140 configurations, constituted by changing the cache size from 2KB to 128KB, line size from 8B to 128B and associativity from 1 to 8. For evaluation purposes, we used adpcmenc/dec, jpegenc/dec, g721enc/dec, mpeg2enc/dec applications from mediabench. All experiments were conducted on an Intel Xeon 64 core machine with 256GB RAM.

6.3 Results

We consider two flavours of the target system for comparison purposes:

• NoPMC, where a standard memory controller is only used without PMC and self-refresh power down mode; and

• PMC, where PMC with self-refresh power down mode is used. 6.3. RESULTS 113

The standard memory controller in the NoPMC system only uses one of the shallow low-power modes rather than the self-refresh power down mode. In essence, NoPMC takes advantage of only the cache, whereas PMC exploits the prolonged idle periods (due to the spatial and temporal locality captured by the cache) through the use of self-refresh power down mode. Figure 6.2 reports the energy saving (on left y-axis) and performance degradation (on right y-axis) of the PMC system over the NoPMC system.

The percentage of energy savings (Energ Sav) and the percentage of performance degradation (Perf Degr) obtained with the PMC system over the NoPMC system are given in Equation 6.1 and Equation 6.2 respectively. Energ Consum in Equation 6.1 refers to the energy consumption and Execu Time in Equation 6.2 refers to the execution time.

( ) Energ Consum (NoP MC) − Energ Consum (PMC) Energ Sav (%) = × 100 Energ Consum (NoP MC) (6.1) ( ) ExecuT ime (NoP MC) − ExecuT ime (PMC) Perf Degr (%) = × 100 ExecuT ime (NoP MC) (6.2)

In Equation 6.1, the energy consumption of the PMC system is computed using the energy consumption of both the L2 cache and DDR3 DRAM. The energy con- sumption of the L2 cache (Energ Consum L2) is calculated based on Equation 6.3. In Equation 6.3, the leakage power (consumed in circuit blocks [164]) and the dy- namic energy per read/write access are retrieved from the Cacti [81] while the total execution time and the total number of read/write accesses to the L2 cache are obtained from the cycle-accurate simulator. 114 CHAPTER 6. EFFECTS OF LAST-LEVEL CACHE CONFIGURATIONS

Energ Consum L2 = (leakage power × execution time) +

(dynamic energy per read accesses × total read accesses ) +

(dynamic energy per write accesses × total write accesses)

(6.3)

For each application, we have plotted energy saving and performance degradation for three different cache configurations: SC, system with smallest cache configura- tion; LC, system with largest cache configuration; and, OC, system with optimal cache configuration (minimum total energy consumption of both the L2 cache and DDR3 DRAM). The smallest and largest caches are chosen to report the extreme points of the design space. The SC system reduced energy consumption for relatively small applications with around 60% for adpcmEnc and adpcmDec, and around 2% for jpegEnc and jpegDec. For the rest of the relatively large applications, the SC system increased energy consumption by up to 32% (mpeg2Dec). This is because the use of self-refresh power down mode is not suitable when idle periods are short in length due to the high wake-up latency. This fact can also be observed through the performance degradation where the SC system degrades the performance by up to 52% (g721Enc) for relatively large applications. Both LC and OC systems significantly improve the energy saving compared to an SC system. This is because most of the memory requests are serviced by the cache, which increases DRAM idle periods and reduces the frequency of DRAM transitions to self-refresh power down mode. The reduction in frequency of DRAM transitions reduces the overhead of wake-up latency, and hence the degradation of performance. Both LC and OC systems degraded the performance by a maximum of only 2%. It is interesting to note that the OC system saves more energy than the LC system (g721Enc and g721Dec) because the largest cache consumes a relatively high amount of energy itself, which offsets the energy saving of DDR3 DRAM. In 6.3. RESULTS 115 essence, the largest cache may result in an overdesigned system. In summary, the OC system reduced energy consumption by up to 89% with a maximum performance degradation of only 2%. These results point to the fact that a suitable last-level cache, without overdesigning the system, should be used to exploit the full potential of self-refresh power down mode for DDR3 DRAM.

Table 6.2 reports the cache configurations used in OC systems for all the eight applications, along with the area footprint of those cache configurations. It should be noted that optimal cache configurations vary from one application to another be- cause of the different memory access patterns. More importantly, the area footprint of optimal cache configurations is significantly smaller than the area footprint of the largest cache (0.26 mm2). This result again points to the fact that use of the largest cache can result in an overdesigned system.

Performance degradation can be overwhelming which in turn can increase the energy consumption if the wrong intermediate buffer configurations are selected. Thus, there is a need for choosing the right buffer configuration at the design time in order to achieve the optimal energy consumption with the proposed PMC module. The impact of energy savings and performance effect with poorly selected configu- ration (16KB prefetch buffer size, 32 word block size, 8 associativity) can be seen

Application Optimal Cache Configuration Area Footprint [cache size, line size, associativity] mm2 adpcm Enc [16KB, 128B, 8A] 0.035 adpcm Dec [4KB, 32B, 8A] 0.023 jpeg Enc [128KB, 128B, 2A] 0.22 jpeg Dec [128KB, 128B, 2A] 0.22 g721 Enc [16KB, 16B, 2A] 0.08 g721 Dec [16KB, 128B, 8A] 0.04 mpeg2 Enc [64KB, 128B, 4A] 0.11 mpeg2 Dec [128KB, 128B, 2A] 0.22

Table 6.2: Optimal Cache Configurations and their Area Footprints. 116 CHAPTER 6. EFFECTS OF LAST-LEVEL CACHE CONFIGURATIONS in Figure 6.3. The data representation in the X-axis and the two Y-axes are similar to that in Figure 6.2. Minimal energy savings can be seen for jpegEnc and jpegDec (less than 3%) with the performance penalty of 6% and 7% respectively for the se- lected configuration. For such a case, it may not be useful to utilise the PMC due to the additional overhead.

16KB Buffer Size, 32 Word Block Size, 8 Associativity Configuration 90 20

80 18

16 70 14 60 12 50 10 40 8

Energy Savings (%) Savings Energy 30 6 Performance Slow Down Down (%)Slow Performance Performance Slow Down Down (%)Slow Performance 20 4

10 2

0 0 adpcm adpcm jpeg jpeg g721 g721 mpeg2 mpge2 Enc Dec Enc Dec Enc Dec Enc Dec EC_S ET_R

Figure 6.3: Mixed Impact of Poorly Chosen Cache Configuration

6.4 Fast Design Space Exploration

Although our experiments show that exploration of last-level cache design space for selection of the optimal cache configuration can yield significant energy savings with little performance degradation, such an exploration can be slow and time-consuming. 6.5. SUMMARY 117

In this chapter, we used cycle-accurate simulations to obtain the energy savings of all the cache configurations; however, cycle-accurate simulations will be slow for larger applications and larger design spaces. In such a situation, design space exploration can be sped up by :

• Using analytical models or estimation methods to predict the energy saving for a given cache configurations; or,

• Using heuristics to cycle-accurately simulate/search a subset of design space.

Additionally, design space exploration can also include constrained optimisation, where an optimal cache configuration is searched for under different user constraints such as area, energy, execution time, etc.

6.5 Summary

In this chapter, we conducted a study on how last-level cache can affect the idle periods of a DRAM, and how those idle periods can be exploited by the use of self-refresh power down mode for DDR3 DRAM to enable maximum energy saving. We proposed a power mode controller to adaptively transition DRAM to self-refresh power down mode when a memory request hits the last-level cache, and activate the DRAM when a memory request misses the last-level cache. The proposed power mode controller works in conjunction with the cache controller and memory con- troller. Our experiments with eight mediabench applications illustrate that the use of a suitable last-level cache with self-refresh power down mode can save up to 89% energy consumption with a maximum performance degradation of only 2% compared to a system with just the standard memory controller. Therefore, we conclude that exploration of last-level cache in the context of DDR3 DRAMs has significant potential for energy optimisation of the memory subsystem. Chapter 7

Energy Reduction in DDR DRAMs

7.1 Introduction

Better energy efficiency increases reliability and improves battery life of embedded systems. Main memories in embedded systems consume up to 80% of total system power (excluding I/O power) [114] (also illustrated by our experiments reported in Figure 7.1). Thus, reducing energy consumption of the main memory can have a great impact on the overall energy efficiency of the system.

Figure 7.2, drawn to scale, shows the current drawn (which in turn governs the power) by a modern DDR3 DRAM memory. The activate current (which consists of activation and precharge currents) is shown with a (red) dashed line, whereas the read and write current is shown in a (blue) dotted line. The former is used to activate DRAM, whereas the latter is used to perform read/write operations. Much like other DDR DRAMs, DDR3 has to refreshed periodically, which consumes the refresh current shown in a (purple) dashed-dotted line. The (green) solid line shows the background current, which is continuous. However, depending upon the memory mode, the amount of background current varies. Figure 7.2 shows this

118 7.1. INTRODUCTION 119

2 x10 Processor L2 Cache DRAM_Background DRAM_Active DRAM_Refresh 5

4

3 Background power is 90% of DRAM power and 70% of total system power 2

1 Power Consumption (mW) Consumption Power Power Consumption (mW) Consumption Power 0 adpcmenc adpcmdec jpegenc jpegdec g721enc g721dec mpeg2enc mpeg2dec

Figure 7.1: Power Consumption Breakdown in a Uniprocessor System with on-chip L1 cache, off-chip L2 cache and DRAM memory. background current for different low-power modes (see Table 7.1). These modes are used when DDR3 DRAM is not performing any read/write operations, and hence is idle. According to our experiments, it is the background power (due to background current) that consumes 90% of DRAM power and about 70% of total system power (see Figure 7.1). In this thesis, we focus on the use of the self-refresh Power Down (PD) mode to reduce this background power consumption, because it is the deepest low-power mode, and consumes approximately only a tenth of the background power of the shallowest low-power mode, Active StandBy (SB) mode (see Table 7.1).

Low-power Current Wakeup Latency Mode (mA) (clock cycles) Active SB 57 0 Precharge SB 55 3 Active PD 35 4 Precharge PD 12 13 Self-refresh PD 6 64

Table 7.1: Power modes of Micron DDR3 DRAM. SB and PD stand for StandBy and PowerDown respectively.

Transition to any of the low-power modes comes at the cost of a wakeup latency, 120 CHAPTER 7. ENERGY REDUCTION IN DDR DRAMS

Clock Cycles

Mem Cmds R/W ACT ACT R/W PRE PRE REF Set DRAM to SELF-REFRESH mode SB Refresh - Active SB Precharge Precharge Active PD Precharge Precharge PD Self

Read/Write Current(R/W) Activate Current(ACT&PRE) Background Current Refresh Current(REF) Figure 7.2: Different Currents drawn by a DDR3 DRAM, adapted from Micron [2].

which is reported in Table 7.1 for different low-power modes of DDR3 DRAM. Ef- ficient exploitation of a low-power mode requires DRAM idle periods to be long enough to amortise the performance penalty due to the wakeup latency. High wakeup latency of the self-refresh PD mode (highest amongst all the low-power modes) has typically hampered its use, and only two works [35, 148] are known to date which employ this mode. The authors of [148] proposed history based predic- tor to forecast durations of DRAM idle periods, and a memory controller policy for selective use of precharge PD and self-refresh PD modes. They focused on maximal exploitation of DRAM idle periods, without attempting to prolong those periods for complete use of the self-refresh PD mode. On the other hand, Amin et al. [35] proposed a cache replacement policy to skew cache-memory traffic in such a way that idle periods for some of the DRAM ranks (a DRAM consists of multiple ranks, where a rank is the smallest granularity at which low-power modes are applied) are prolonged to enable the use of the self-refresh PD mode for those ranks. Such a policy, as shown in [35], is effective, but requires the use of a completely new cache 7.1. INTRODUCTION 121 policy, and thus significant hardware modification.

In this thesis, we also focus on prolonging of the DRAM idle periods; however, for the first time, we propose to do so by the use of a suitable last-level cache config- uration (cache size, line size and associativity), which is obtained by exploration of the last-level cache design space. The selected cache configuration is the one which maximally reduces total energy consumption of the last-level cache and DRAM. Our proposal is motivated by the fact that many modern embedded processors such as ARM [61], Xtensa [60], and Microblaze [165] allow configuration of the cache so that one can choose the cache configuration to be implemented. Additionally, since embedded systems execute an application or a class of applications repeatedly, one can tune cache parameters for better energy efficiency. In fact, the authors of [166] also advocated the importance of last-level cache exploration for a given application; however, they focused on energy reduction of only the cache rather than both the cache and DRAM.

The design space of the last-level cache can be explored either using full-system cycle-accurate simulations (including processor, caches and memory) or cache sim- ulators. Cycle-accurate simulations are exorbitantly slow [157], especially when multi-level caches and DRAM are involved. Hence, they are not feasible for ex- ploration of hundreds of cache configurations, which are typical of an embedded system design [157, 167]. Fast cache simulators [167], an alternative option, use memory trace of an application to output the number of hits and misses for all the cache configurations. Calculation of DDR3 DRAM energy consumption is a complex process [2, 168], and using only the number of last-level cache hits/misses or cache size is not sufficient. Figure 7.3 plots total energy reduction of cache and DRAM against the number of hits for different L2 cache configurations in a uniprocessor system with two levels of cache, executing the mpeg2Enc application compared to a system without second level cache. Cache configurations with maximum energy reduction and maximum number of hits are marked along with the largest cache configuration. It can be seen that the MaxER configuration is different from both 122 CHAPTER 7. ENERGY REDUCTION IN DDR DRAMS

Max. Hits

Energy (J) Reduction Energy Max. Energy Reduction Largest Cache Cache Hits

Figure 7.3: Total Cache and DRAM Energy Reduction for different L2 (last-level) Cache Configurations.

MaxHits and the largest cache. When the cache area footprint is compared, MaxER (0.052 mm2) is significantly smaller than MaxHits (3.22 mm2) and the largest cache (6.21 mm2). This plot illustrates the fact that cache hits or cache sizes alone are not sufficient for selection of a suitable cache configuration. Hence, the challenge is to quantify the effect of last-level cache configurations on DRAM energy consumption with minimal number of cycle-accurate simulations. Therefore, in this chapter, we propose a novel DRAM energy reduction estimator to quickly predict energy reduc- tions for differing cache configurations, and a novel framework around the estimator for quick exploration and selection of a suitable last-level cache configuration. Novel Contributions. In particular, we make the following contributions in this chapter.

• A novel DRAM energy reduction estimator. The estimator uses five parame- ters: self-refresh PD cycles, self-refresh PD count, row buffer conflicts, refresh count and DRAM traffic to predict the energy reduction of DRAM for a given last-level cache configuration. The estimator is based upon a Kriging model that is derived through a small number of cycle-accurate simulations. Our estimator has an average error of less than 4.4% when tested on several appli- cations from mediabench and SPEC2000 applications, and 256MB and 4GB 7.1. INTRODUCTION 123

DRAMs. • XDRA framework. Our framework integrates processor-memory and cache simulators with our estimator with the help of novel analysis techniques. These techniques do not require cycle-accurate simulations of all the cache configurations for computation of the estimator parameters, and thus enable fast exploration of last-level cache configurations. When used to select cache configurations with maximal reduction in total energy of last-level cache and DRAM, XDRA found 20 out of 22 times the same configurations as from cycle- accurate simulations. The suboptimal configurations were off by a maximum of 3.9% from their optimal counterparts.

Motivational Example. To illustrate how differing cache configurations can affect DRAM idle periods and improve its energy efficiency, let us consider a JPEG encoder (jpegEnc) application running on a uniprocessor system with L1 and L2 cache and a DRAM memory. Figure 7.4 reports DRAM activity for several thousand cycles, where a value of 1 means the DRAM is accessed and a value of 0 means that it is idle. Two L2 cache configurations are used: 1) a 4KB, 4B line size, direct-mapped cache, denoted as [4KB, 4B, 1A] and shown in the lower plot; and 2) a 128KB, 8B line size and 2-way associative cache, denoted as [128KB, 8B, 2A] and shown in the upper plot. The values inside the parentheses report DRAM idle cycles and the number of idle periods. For instance, in the [128KB, 8B, 2A] L2 cache system, DRAM was idle for 27.6 million cycles, distributed across 7,700 idle periods. The [128KB, 8B, 2A] L2 cache increased DRAM idle cycles by 22% which was expected due to its larger size. The number of idle periods has reduced from 270,000 to only 7,700, and this reduction will be advantageous in reducing DRAM energy consumption because: 1) idle periods will be longer enabling DRAM’s transition to self-refresh PD mode; 2) the DRAM can remain in self-refresh PD mode for longer periods; and 3) fewer wakeups from self-refresh PD mode mean less performance penalty. The plot for [128KB, 8B, 2A] cache corroborates our belief, where several short idle periods of the [4KB, 4B, 1A] cache that are not suitable for self-refresh PD mode 124 CHAPTER 7. ENERGY REDUCTION IN DDR DRAMS have mostly been consolidated into two longer idle periods which in fact are long enough to transition DRAM into self-refresh PD mode. In this example, the [128KB, 8B, 2A] cache reduced 38% more DRAM energy consumption than the [4KB, 4B, 1A] cache. This example shows the importance of exploring last-level cache, and choosing a suitable configuration in the first place can tremendously reduce energy consumption of DRAM.

Long idle periods increase the chance of using self-refresh PD mode 1 [128KB, 8B, 2A] (27.6M, 7.7K) 0

Activity 1 1 51 101 151 201 [4KB, 4B, 1A]

DRAM (22.6M, 270K) 0 Short1 idle periods51not suitable 101 for self-refresh 151 PD 201 mode Clock Cycles

Figure 7.4: Effect of two distinct L2 (last-level) Cache Configurations on DRAM idle periods. PD stands for PowerDown.

The rest of this chapter is organised as follows. Section 7.2 describes our target system and our aim to solve a specific problem. Section 7.3 presents our proposed DRAM energy reduction estimator. Section 7.4 explains our proposed quick explo- ration framework to obtain a suitable last-level cache for maximum DRAM energy reduction. The experimental results are explained in Section 7.5. Advantages and limitations of our proposed XDRA framework are discussed in Section 7.6. Finally, the summary of this chapter is mentioned in Section 7.7.

7.2 Problem Statement

We target a uniprocessor system with multi-level cache hierarchy and a DDR3 DRAM. An example system is shown in Figure 7.5, where the processor has on-chip separate L1 instruction and data caches, which are connected to a unified off-chip 7.2. PROBLEM STATEMENT 125

L2 cache. The L2 cache is interfaced to the DRAM memory through a memory controller. The DRAM contains both application instructions and data. Here, L2 is the last-level cache and we use this system as an example throughout the thesis in addition to its use in our experiments.

L1 I- Memory cache PMC control L2 Unified operations CPU cache Memory Controller L1 D- cache DRAM

On-die chip Memory System

Figure 7.5: An example of Target System.

DDR3 DRAM contains an internal mode controller which automatically sets it into one of the shallow or deep low-power modes whenever it is idle. The Power Mode Controller (PMC) is another module that is typically embedded into the memory controller firmware to control the use of low-power modes [144]. In our work, PMC transitions DRAM into the deepest low-power mode, self-refresh PD, if it is idle for at least a threshold number of clock cycles as used in [32,35]. This is done to avoid significant degradation of performance that may arise from greedy use of self-refresh PD mode for every idle period. We use 30 cycles as the self-refresh PD threshold (PDthreshold) which was obtained through experimentation [32,35]. Given an application, a uniprocessor system with unified last-level cache and DRAM, a PDthreshold and last-level cache configurations (size, line size and asso- ciativity), our goal is to determine, for all the last-level cache configurations, the reduction in combined energy consumption of the last-level cache and DRAM com- pared to the system without last-level cache. Using the aforementioned energy re- ductions, we aim to select the cache configuration with maximal energy reduction with/without a constraint on last-level cache area footprint. 126 CHAPTER 7. ENERGY REDUCTION IN DDR DRAMS

7.3 DRAM Energy Reduction Estimator

Estimation of DRAM energy consumption is a complex process [2, 168] due to the involvement of different DRAM states and their transitions during the execution of an application. In this chapter, we build a DRAM energy reduction estimator based upon the concept of Kriging models [169]. Note that CACTI [81] does not consider detailed DRAM architecture, and hence is not suitable for estimation of DRAM energy consumption [168]. Estimator Parameters. The parameters of an estimator are a key choice as they directly influence the accuracy, and hence usefulness of an estimator. One should only include the most influential parameters because other parameters with little influence increase an estimator’s complexity without any useful benefit. We conducted several experiments with differing last-level cache configurations to record DRAM energy reduction, and used a correlation based technique to obtain the significance of the following most common parameters (some of which are explained in more detail later) on the amount of energy reduction:

• Application parameters: total memory requests (TMR). • Cache parameters: size (CS), line size (CLS), associativity (CA), hits (CH) and misses (CM). • DRAM parameters: accesses (DAs), non-accesses (DNAs), memory traffic (MT), self-refresh PD cycles (PDcycles), number of times self-refresh PD mode is used (PDCnt), row buffer conflicts (RBConf) and number of times DRAM is refreshed (RefCnt).

Figure 7.6 depicts the average value of correlation coefficients for all the param- eters over 2 DRAM sizes (256MB and 4GB) and 11 applications (from mediabench and SPEC2000 benchmarks). A value of +1 signifies a perfect positive correlation (that is, the increase in parameter value will increase energy reduction), whereas a value of -1 signifies a perfect negative correlation (that is, increase in parameter value will decrease energy reduction). A value close to zero means no correlation. It 7.3. DRAM ENERGY REDUCTION ESTIMATOR 127 can be seen from the figure that PDcycles, PDCnt, RBConf, RefCnt and MT are the most significant parameters (marked with rectangles) as their correlation values are either greater than 0.8 or less than -0.8. An intuitive reasoning for these parameters is as follows:

• PDcycles: the total number of cycles for which DRAM is in PD mode during the execution of an application. This parameter depends on the total number of DRAM idle cycles and the PDthreshold. More PDcycles mean DRAM remains in PD mode for longer periods, providing more energy reduction. • PDCnt: the total number of times DRAM is transitioned to self-refresh PD mode. More transitions to self-refresh PD mode mean more overhead, which results in less energy reduction. • RBConf: In DRAM, data is brought to a row buffer before it can be ac- cessed [2]. A row buffer conflict occurs when the requested data is not present in the row buffer, resulting in its reloading (precharging and/or activation), which increases DRAM energy consumption. • RefCnt: DRAM must be refreshed periodically to retain its contents. Refresh- ing DRAM consumes refresh power which in turn increases DRAM energy consumption. • MT: The product of last-level cache line size and total number of DRAM accesses measures the amount of traffic going to DRAM during the execution of an application. More traffic to DRAM means that it will be active for a longer time, increasing its energy consumption.

7.3.1 Kriging Model

Kriging models take into account the spatial correlation between the current de- sign point, xi, and an initial set of design points (training set), x, to estimate the output at xi. We chose the Kriging model because it allows various polynomial 128 CHAPTER 7. ENERGY REDUCTION IN DDR DRAMS RBConf RefCnt TMR : DRAM Accesses Accesses DRAM : : Memory TrafficMemory : : Memory TrafficMemory : MT : Cache Associativity: Cache MT MT CA PDCycles PDCnt DAs DNAs : Power Down cycles cycles Down Power : : Power Down cycles cycles Down Power : : Row Buffer Conflicts BufferRow : : Row Buffer Conflicts BufferRow : CM : Refresh Count Refresh : : Refresh Count Refresh : : Cache Line Size : Cache Misses DAs MissesCache : CH RefCnt CLS CM RefCnt PDCycles PDCycles RBConf CA CLS Accesses - CS 1 0 -1 0.8 0.6 0.4 0.2 Power Down Count Down Power Power Down Count Down Power -0.2 -0.4 -0.6 -0.8

: DRAM Non-Accesses DRAM : : DRAM Non DRAM : : Total Memory Requests Requests Memory Total: : Total Memory Requests RBConf Requests Memory Total: Correlation Coefficient Correlation : Cache Hits Cache : : Cache Size Cache : DNAs TMR CS CH DNAs PDCnt: TMR PDCnt: Correlation Coefficients of most common Parameters, averaged over 2 DRAM sizes and 11 Applications. Figure 7.6: 7.3. DRAM ENERGY REDUCTION ESTIMATOR 129 functions and correlation functions, and does not restrict to just linear regression models. Additionally, Kriging models can capture complex interactions between pa- rameters due to the use of spatial correlations, and thus perform better than linear regression models [170]. Our experiments reveal similar results, which are detailed in Section 7.5.

More formally, a kriging model combines a global model with local trends in the form of:

y(xi) = f(x) + z(xi) where y(x) is to be estimated, f(x) is a known approximation function, and z(x) is a stochastic process with mean zero, variance σ2, and non-zero covariance. While f(x) globally approximates the design space, z(x) creates local deviations which are in- terpolated by the Kriging model with the use of spatial correlations. In other words, the regression function f(x) captures global impact of the parameters, whereas the correlation function z(x) captures local impact of the parameters. In many cases, f(x) is taken as either a constant or a linear polynomial in x’s parameters [171]. Additionally, many correlation functions such as exponential, gaussian, etc. are available to be used as z(x), where the correlation of two points depends upon the weighted distance between them. For example, the widely used exponential corre- lation function is:

∏p s −θh|xi(h)−xj (h)| h corr(xi, xj) = e h=1 where p is the total number of parameters in a design point, and θh and sh are unknowns that govern the correlation distance weights. θh represents the significance of parameter h, whereas sh represents the smoothness of function in the direction of parameter h. Given a training set, the coefficients of f(x) and weights of z(x) are estimated using the maximum likelihood technique [172]. Once these are known, 130 CHAPTER 7. ENERGY REDUCTION IN DDR DRAMS the final model is written as:

ˆ ′ −1 ˆ yˆ(xi) = f(x) + r(x) R (y − f(x))

where r(x) is the correlation vector between xi and x, R is the correlation matrix representing the correlation between all the pairs of design points in x, and y contains output values at the design points in the training set x.

7.4 XDRA Framework

Our framework, XDRA, which quickly explores last-level cache in the context of DRAM energy reduction is shown in Figure 7.7. The XDRA integrates our esti- mator with a cycle-accurate processor-memory simulator, a cache simulator, and employs our novel analysis techniques to quickly compute the parameters used in the estimator. The following paragraphs explain core components of XDRA in more detail.

Last-level Cache Idle (LCI) Profile Generation. An application is simu- lated in a cycle-accurate processor-memory simulator without last-level cache and PMC’s power down mechanism to record two entities at the second-last-level cache and memory interface: 1) No Last-level Cache (NoLC) memory trace; and 2) LCI profile. NoLC trace will contain only those memory requests that will be missed in lower level caches. For instance, in Figure 7.5 without L2 cache, the memory trace captured at the L1–memory interface will only contain L1 cache misses. An LCI period refers to an application’s execution period that does not access DRAM. For in- stance, at L1–memory interface without the L2 cache in Figure 7.5, LCI periods will be the execution periods with no memory requests from the application (consecutive non-load and non-store instructions) and execution periods with memory requests that will hit in L1 caches. Hence, the L2 cache will be idle during LCI periods. 7.4. XDRA FRAMEWORK 131

Last-level Cache Application Configurations Cache NoLC Line Size Processor- Trace Memory Cache Simulator CACTI Simulator LCI Cache Cache Cache Profile Profile Energy Area Base System Cache Profile Cache Analysis Exploration

PDthreshold MT PDCnt RefCnt RBConf PDCycles - NoLC: No Last-level Cache DRAM Energy - LCI: Last-level Cache Idle Best cache - Base System excludes last-level Reduction configuration cache and PD mechanism Estimator

Figure 7.7: XDRA Framework. Dotted-lined rectangles and broken arrows show our novel contributions.

Figure 7.8 illustrates such an LCI period. An LCI profile captures all LCI peri- ods in clock cycles from the execution of an application. For an in-order processor, which is typical of embedded processors, an LCI profile of the application will not change across different last-level cache configurations because the processor pipeline is stalled during each memory request and lower level caches remain unchanged. Note that DRAM will be idle during all LCI periods and hence these periods will contribute to DRAM idle periods. The DRAM energy consumption from this sim- ulation is used as the reference point for calculation of energy reductions for all the last-level cache configurations. This is the only step in XDRA where an application is cycle-accurately simulated.

Cache Simulation. In this step, we feed the NoLC trace to a cache simulator to generate cache profile for each of the last-level cache configuration. A cache profile reports whether DRAM will be accessed or not for each memory request in 132 CHAPTER 7. ENERGY REDUCTION IN DDR DRAMS

ld R2, R1+20 L1 CM shl R2, 1 N/A ld R3, R1+10 L1 CM neg R3 N/A Last-level Cache Idle (LCI) period xor R2, R3 N/A inc R4 N/A mov R3, R4 L1 CH ld R5, R1+40 L1 CM ... L1 CH: L1 Cache Hit L1 CM: L1 Cache Miss N/A: Not Applicable

Figure 7.8: An example of LCI period for Target System of Figure 7.5 (L2 is the last-level Cache). the NoLC trace. A memory request in the NoLC trace is considered DRAM access or non-access in the cache profile depending upon the cache policy (such as write- back, write-allocate, etc.) and whether the request hits or misses the cache. For more details, readers are referred to [173].

Cache Profile Analysis. This step analyses the cache profile of each last-level cache configuration to compute the five parameters that are used in our estimator. A DRAM Idle (DI) profile captures the duration of each DI period from a given cache profile and the LCI profile of the application where a DI period refers to consecutive cycles of DRAM idleness. First, we insert LCI periods from the LCI profile between appropriate DRAM accesses and non-accesses in the cache profile. Then, all DI periods are marked where an idle period consists of consecutive DRAM non-accesses and LCI periods. PDcycles are computed by summing all DI periods, whereas PDCnt is equal to the number of DI periods that are greater than PDthreshold. Note that this step allows a designer to apply any threshold that he/she deems suitable.

Figure 7.9 shows an example of how PDcycles and PDCnt are derived from a 7.4. XDRA FRAMEWORK 133 cache profile and LCI profile. The cache profile reports DRAM accesses and non- accesses while the LCI profile reports L2 cache idle periods. First, each LCI period from LCI profile is inserted after appropriate the Memory Request (MR) in the cache profile. For example, the second LCI period occurs after MR3 and hence is inserted between MR3 and MR4 in the DI profile. After merging of these two profiles, the DI profile shows all the DI periods marked in dotted-lined rectangles. The cycles in each DI period are calculated using a 4 cycle DRAM non-access latency (L2 cache hit latency). For example, for the first DI period, there are 20 and 110 cycles from two LCI periods and 16 cycles from 4 DRAM non-accesses, totalling to 146 cycles. Finally, the DI profile is converted to PDcycles by applying the PDthreshold, that is, the initial threshold number of cycles in each DI period will not contribute to PDcycles. Note that the last DI period will be filtered since its duration is only 28 cycles. For this example, DRAM will remain in self-refresh PD mode for 156 cycles, whereas PDCnt will be 2.

The number of row buffer conflicts, RBConf, depends upon the DRAM address mapping, and row buffer policy (such as open-page, closed-page, etc.). A row buffer conflict occurs when the requested data is not present in the row buffer, which requires its loading from DRAM through the process of activation. We calculate the location (rank, bank and row ids [2]) of each DRAM access using the DRAM address mapping, which is known a priori. Afterwards, if the rank and bank ids of two successive DRAM accesses are the same, but the row ids are different, then we record a row buffer conflict. Depending upon the row buffer policy, the row buffer may be empty (closed) between DRAM accesses to the same bank, which are accounted for using the information from the row buffer policy. Note that the aforementioned method does not capture all the row buffer conflicts because DRAM needs to be refreshed periodically to retain its contents. During the refresh process, the contents of the row buffer are lost, which results in a row buffer conflict for the next DRAM access. Thus, the inaccuracies in RBConf depend upon RefCnt, the number of times DRAM is refreshed during the execution of an application. 134 CHAPTER 7. ENERGY REDUCTION IN DDR DRAMS

Cache Profile MR: Memory Request DA: DRAM Access DNA: DRAM Non-Access LCI: Last-level Cache Idle DI: DRAM Idle DNA Latency = 4 cycles MR1 DNA PD: Power Down PDthreshold = 30 cycles MR2 DNA MR3 DNA MR1 DNA MR4 DNA LCI period (20 cycles) MR5 DA MR2 DNA DI period => PD period MR6 DA MR3 DNA 146 cycles => 116 cycles MR7 DNA LCI period (110 cycles) MR8 DNA MR4 DNA MR9 DA MR5 DA DI period => PD period MR10 DA LCI period (70 cycles) 70 cycles => 40 cycles ... MR6 DA MR7 DNA DI period => PD period MR1 MR8 DNA 28 cycles => 0 cycles LCI period (20 cycles) LCI period (20 cycles) MR3 MR9 DA MR10 DA LCI period (110 cycles) PDcycles = 116 + 40 = 156 cycles MR5 ... PDCnt = 2 LCI period (70 cycles) DI Profile MR8 LCI period (20 cycles) ...

LCI Profile

Figure 7.9: An Example of calculating PDcycles and PDCnt.

The parameter RefCnt depends upon the time for which DRAM is active. Ac- curate computation of DRAM’s active time is not possible because DRAM accesses take differing cycles depending upon the DRAM state [156], which is unknown unless cycle-accurate simulation is performed. Therefore, we estimate the DRAM active time as [DRAM accesses × fixed DRAM access latency], where the fixed latency is an average of latencies of all DRAM accesses from cycle-accurate simulation per- formed during the ‘LCI profile generation’ step. This DRAM active time is then divided by the refresh frequency of DRAM, which is known a priori (for example, 64 ms for DDR3 from Micron [2]), to get an estimate of RefCnt. Since both RBConf and RefCnt are estimated, they might cause inaccuracies in the estimator; however, our results (see Section 7.5) show that such inaccuracies are small. 7.5. EXPERIMENTS AND RESULTS 135

The parameter MT measures the total traffic going to DRAM during the exe- cution of an application. MT is calculated in bytes by multiplying the number of DRAM accesses by last-level cache line size. In this way, we analyse the cache pro- file of each last-level cache configuration to calculate the values of five parameters for use in our estimator, without the need for cycle-accurate simulations of those configurations. Cache Exploration. At the final step, DRAM energy reduction is computed for each last-level cache configuration using the PDcycles, PDCnt, RBConf, RefCnt and MT in our estimator. The predicted energy reductions are adjusted by subtract- ing the energy consumptions of respective cache configurations which are obtained from CACTI [81]. Once net DRAM energy reductions for all last-level cache config- urations are available, the cache configuration with maximum reduction is selected. Here, the decision to search all the cache configurations for global optima or to use a heuristic search is left to designer. We chose to use the former approach in this thesis.

7.5 Experiments and Results

We use the target system of Figure 7.5 with LRU and write-back policies for the L2 cache. The target system is implemented using Tensilica’s LX4 processor [60] with 2KB L1 instruction and 1 KB L1 data caches, both direct-mapped with 4B line size. We used -125E DDR3 DRAM (1600 Million Transfers per second) from Micron [2] to implement the DRAM memory. We created two target systems with 256MB (1 rank) and 4GB (4 ranks) memories to observe the applicability of our estimator. These DRAMs had an interface width of 4B and used open-page row buffer policy and internal refresh time of 64 ms. A threshold of 30 cycles and a wakeup latency of 64 cycles was used for self-refresh Power Down (PD) mode. We used the cycle-accurate simulator presented in chapter 4 [156]1 and the tool

1This cycle-accurate simulator integrated DRAMSim [84] with LX4’s simulator to create a 136 CHAPTER 7. ENERGY REDUCTION IN DDR DRAMS from [167] as the processor-memory simulator and cache simulator in XDRA. In addition, CACTI 6.5 [81] was configured for a given 90nm technology to obtain the energy consumption and area of L2 cache configurations. The L2 cache design space consisted of 330 configurations, constituted by changing the cache size from 4KB to 4MB, line size from 4B to 128B, and associativity from 1 to 16. These configurations are typically explored in an embedded system’s design [157,167]. For evaluation purposes, we used adpcmEnc, adpcmDec, jpegEnc, jpegDec, g721Enc, g721Dec, mpeg2Enc, mpeg2D ec applications from mediabench, and memory-bound vpr, gzip and bzip2 applications from SPEC2000 [174]. For SPEC2000 applications, the first 500 million memory requests were used [175]. Please note that power consumption breakdown for applications from SPEC2000 benchmark are not re- ported in Figure 7.1. This is because we used the first 500 million memory requests for SPEC2000 applications, and stopped the cycle-accurate simulation afterwards. Stopping a simulation in Tensilica’s XTensa MultiProcessor (XTMP) environment does not generate a partial energy consumption report for the processor and caches. All experiments were conducted on an Intel Xeon 64 core machine with 256GB RAM. For selection of design points in the training set, we use a well-known design of experiments technique - Latin Hypercube Sampling (LHS). LHS is a statistical tech- nique that evenly distributes multiple points in the design space so that all segments of the design space’s dimensions are represented. The minimum requirement of LHS for the number of design points is the summation of ranges of all the design space dimensions, which in our case is 22 (11 cache sizes, 6 line sizes and 5 associativities). Additionally, we used 8 corner design points [176] from the design space to create a training set with 30 design points. The training set was cycle-accurately simulated to capture actual energy reductions and values of the parameters in the Kriging model. We created Kriging models using the DACE toolbox [171] for MATLAB, where constant and linear polynomials were used for the regression function f(x), detailed processor-memory simulator. 7.5. EXPERIMENTS AND RESULTS 137 while linear, gaussian, exponential, spherical and spline functions were used as the correlation function z(x). Note that training of a Kriging model takes a few seconds, and hence several different functions can be evaluated to best fit the training set data.

LinReg and Simple Estimators For rigorous evaluation of our estimator, we created two other estimators: 1) LinReg, based on a linear regression model; and 2) Simple, based on a constant energy per DRAM access model.

The first estimator is based on linear regression where the same parameters as the Kriging estimator are used. We refer to it as the LinReg estimator, and it is written as:

ER = β0P DCycles + β1P DCnt + β2RBConf + β3RefCnt + β4MT

The coefficients of the LinReg estimator were computed using the same training set as used for the Kriging estimator.

The Simple estimator is written as:

ER = [DAwc − DAc] × [DRAMEnergywc/DAwc]

where DAwc and DAc are the number of DRAM accesses without last-level cache and with a given last-level cache configuration respectively. DRAMEnergywc is the energy consumption of DRAM in a system without last-level cache and is computed from the cycle-accurate simulation of such a system. The first factor estimates the number of DRAM accesses reduced by a given cache configuration, which is then multiplied by a constant energy consumption per DRAM access (second factor) to compute overall DRAM energy reduction.

Estimator Accuracy. We evaluated the accuracy of the three estimators by calculating average errors using the estimated energy reductions and the actual 138 CHAPTER 7. ENERGY REDUCTION IN DDR DRAMS energy reductions from cycle-accurate simulations for all the cache configurations. Figure 7.10 reports the average errors when 256MB DRAM is used for all the appli- cations. Our estimator had a worst average error of 2.2% across all the applications. On the other hand, LinReg and Simple had maximum average errors of 15.3% and 6.8% respectively. For 4GB DRAM, the errors are 3.4%, 26.2% and 99.4% for Kriging, LinReg and Simple estimators, which are illustrated in Figure 7.11. It is noteworthy that the average error of our estimator is consistently low compared to significant variations for LinReg and Simple, which means better applicability for our estimator.

To further stress the accuracy of the estimators, we performed constrained op- timisation on the estimated cache design spaces. We searched for a cache config- uration with maximum energy reduction under a constraint on the area footprint of the cache, which is typical of embedded systems. For each application, we ap- plied several constraints, ranging from largest to smallest cache area. The cache configurations selected by the estimators were compared to corresponding cache configurations from actual design space (obtained using the same constraints) to record minimum and maximum errors in energy reduction over all the constraints. The results for 256MB DRAM are illustrated in Figure 7.12 for all the applications. Again, our estimator performs consistently well with a maximum error less than 4.4% over all the applications. LinReg and Simple estimators had significant vari- ations in the selection of cache configurations, with errors up to 74.1% and 92.3% respectively, which renders them unreliable compared to our estimator. Results for 4GB DRAM are reported in Figures 7.13, where Kriging, LinReg and Simple esti- mators had errors up to 3.3%, 67.9% and 7% respectively. These results show that our estimator is better at consistently modelling DRAM energy reductions with dif- fering last-level cache configurations, and thus the rest of this section reports results for our estimator only.

Cache Selection with XDRA. For each application, we used XDRA to explore L2 cache to select the cache configuration with a maximum DRAM energy reduction. 7.5. EXPERIMENTS AND RESULTS 139 Average Error in estimated Energy Reduction for 256MB DRAM. Kriging LinReg Simple Figure 7.10: adpcmEnc adpcmDec jpegEnc jpegDec g721Enc g721Dec mpeg2Enc mpeg2Dec spec_vpr spec_gzip spec_bzip2

9 6 3 0

18 15 12 Energy Reduction Error (%) Error Reduction Energy 140 CHAPTER 7. ENERGY REDUCTION IN DDR DRAMS Average Error in estimated Energy Reduction for 4GB DRAM. Kriging LinReg Simple Figure 7.11: adpcmEnc adpcmDec jpegEnc jpegDec g721Enc g721Dec mpeg2Enc mpeg2Dec spec_vpr spec_gzip spec_bzip2 0

25 75 50

100 Energy Reduction Error (%) Error Reduction Energy

7.5. EXPERIMENTS AND RESULTS 141

spec_gzip

spec_bzip2

spec_vpr

mpeg2Dec

mpeg2Enc

g721Dec g721Enc

ConstEnrg Model jpegDec

jpegEnc

adpcmDec

adpcmEnc

spec_gzip

spec_bzip2

spec_vpr

mpeg2Dec

mpeg2Enc

g721Dec

g721Enc jpegDec

LinReg ModelLinReg jpegEnc

adpcmDec

adpcmEnc

spec_gzip

spec_bzip2

spec_vpr

mpeg2Dec

mpeg2Enc

g721Dec

g721Enc jpegDec

KrigingModel jpegEnc

Error in Energy Reduction from Cache Configurations selected under differing Area Constraints for 256MB

Min. Max. adpcmDec adpcmEnc 0 80 60 40 20 100 Figure 7.12: DRAM.

142 CHAPTER 7. ENERGY REDUCTION IN DDR DRAMS

spec_gzip

spec_bzip2

spec_vpr

mpeg2Dec

mpeg2Enc

g721Dec

g721Enc jpegDec

ConstEnrg Model

jpegEnc

adpcmDec

adpcmEnc

spec_gzip

spec_bzip2

spec_vpr

mpeg2Dec

mpeg2Enc

g721Dec

g721Enc

jpegDec jpegEnc

LinReg Model LinReg

adpcmDec

adpcmEnc

spec_gzip

spec_bzip2

spec_vpr

mpeg2Dec

mpeg2Enc

g721Dec

g721Enc jpegDec

Kriging Model Kriging

jpegEnc Error in Energy Reduction from Cache Configurations selected under differing Area Constraints for 4GB

Min. Max. adpcmDec adpcmEnc 0 70 60 50 40 30 20 10 Figure 7.13: DRAM. 7.5. EXPERIMENTS AND RESULTS 143

Table 7.2 reports the L2 cache configurations selected by XDRA with their area footprints. Out of 22 cache configurations (two configurations for two different DRAM sizes per application), XDRA found the same 20 configuration as from the the cycle-accurate processor-memory simulations. The suboptimal configurations had a maximum increase of just 3.9% in total energy consumption of the cache and DRAM.

Application 256MB DRAM 4GB DRAM adpcm Enc [4KB, 16B, 16A] (0.03) [16KB, 32B, 8A] (0.05) adpcm Dec [4KB, 16B, 16A] (0.03) [16KB, 32B, 4A] (0.05) jpeg Enc [256KB, 128B, 1A] (0.5) [256KB, 128B, 1A] (0.5) jpeg Dec [256KB, 128B, 1A] (0.5) [64KB, 16B, 16A] (0.31) g721 Enc [8KB, 16B, 16A] (0.05) [32KB, 4B, 1A] (0.5) g721 Dec [8KB, 64B, 8A] (0.03) [8KB, 16B, 16A] (0.05) mpeg2 Enc [32KB, 128B, 8A] (0.05) [1MB, 128B, 8A] (1.5) mpeg2 Dec [16KB, 16B, 16A] (0.08) [512KB, 128B, 1A] (0.86) spec vpr [64KB, 16B, 16A] (0.31) [256KB, 128B, 8A] (0.5) spec bzip2 [512KB, 128B, 4A] (0.9) [1MB, 128B, 2A] (1.5) spec gzip [256KB, 128B, 8A] (0.52) [512KB, 32B, 4A] (1.6)

Table 7.2: L2 Cache Configurations with maximum DRAM Energy Reduction (BC PD) from XDRA for different DRAM sizes. The numbers in parentheses are area footprint in mm2.

Note that in Table 7.2, the area footprint of CC PD configuration is 0.145 mm2. Once the best cache configuration was known from XDRA, for comparison pur- poses, we simulated the following three systems in the processor-memory simulator to obtain their actual DRAM energy reduction:

• BS: Base system without L2 cache and self-refresh PD mechanism, but with PD mechanism of DDR3 internal mode controller. • CC PD: System with a reasonable Common Cache configuration (64KB, 64B line size, direct-mapped L2 cache, which has been reported in [152]) and PD mechanism. 144 CHAPTER 7. ENERGY REDUCTION IN DDR DRAMS

• BC PD: System with Best Cache configuration and PD mechanism.

1 adpcmenc DRAM_Refresh 0.75 DRAM_Active DRAM_BckGrnd L2 cache 0.5

0.25

Normalised Energy Reduction Reduction Energy Normalised 0 BaseSys CC_PD BestC_PD BaseSys CC_PD BestC_PD 256MB 4GB

Figure 7.14: Normalised DRAM Energy Consumption Breakdown of adpcmEnc for different L2 caches and DRAM sizes.

1 adpcmdec

DRAM_Refresh 0.75 DRAM_Active DRAM_BckGrnd 0.5 L2 cache

0.25 Normalised Energy Reduction Energy Normalised 0 BaseSys CC_PD BestC_PD BaseSys CC_PD BestC_PD 256MB 4GB

Figure 7.15: Normalised DRAM Energy Consumption Breakdown of adpcmDec for different L2 caches and DRAM sizes.

The energy reduction comparison of the above three systems for all 11 applica- tions with 256MB and 4GB DRAMs are reported in Figures 7.14– 7.24. Only the result of the vpr application (Figure 7.22) is discussed since the results for other applications are similar. 7.5. EXPERIMENTS AND RESULTS 145

1 jpegenc

DRAM_Refresh 0.75 DRAM_Active DRAM_BckGrnd 0.5 L2 cache

0.25

Normalised Energy Reduction Energy Normalised 0 BaseSys CC_PD BestC_PD BaseSys CC_PD BestC_PD 256MB 4GB

Figure 7.16: Normalised DRAM Energy Consumption Breakdown of jpegEnc for different L2 caches and DRAM sizes.

1 jpegdec

DRAM_Refresh 0.75 DRAM_Active DRAM_BckGrnd L2 cache 0.5

0.25

Normalised Energy Reduction Reduction Energy Normalised 0 BaseSys CC_PD BestC_PD BaseSys CC_PD BestC_PD 256MB 4GB

Figure 7.17: Normalised DRAM Energy Consumption Breakdown of jpegDec for different L2 caches and DRAM sizes.

Figure 7.22 reports the normalised energy consumption breakdown for ‘vpr’ ap- plication from SPEC2000. Both CC PD and BC PD systems significantly reduce the energy consumption of DRAM; however, BC PD is more energy efficient than CC PD, by a factor of 12× and 24× for 256MB and 4GB DRAMs respectively. These results show that a suitable cache configuration with self-refresh PD reduces both active and background power of DRAM (although reduction in background 146 CHAPTER 7. ENERGY REDUCTION IN DDR DRAMS

1 g721enc

DRAM_Refresh 0.75 DRAM_Active DRAM_BckGrnd 0.5 L2 cache

0.25

Normalised Energy Reduction Energy Normalised 0 BaseSys CC_PD BestC_PD BaseSys CC_PD BestC_PD 256MB 4GB

Figure 7.18: Normalised DRAM Energy Consumption Breakdown of g721Enc for different L2 caches and DRAM sizes.

1 g721dec

DRAM_Refresh 0.75 DRAM_Active DRAM_BckGrnd L2 cache 0.5

0.25

Normalised Energy Reduction Reduction Energy Energy Normalised 0 BaseSys CC_PD BestC_PD BaseSys CC_PD BestC_PD 256MB 4GB

Figure 7.19: Normalised DRAM Energy Consumption Breakdown of g721Dec for different L2 caches and DRAM sizes. power is more significant than active power) because: 1) most of the memory re- quests are serviced by the cache, which reduces DRAM activity; and 2) DRAM can be put into self-refresh PD mode more often. In summary, the BC PD system from XDRA reduced, on average, 3.6× and 4× more cache and DRAM energy compared to CC PD for 256MB and 4GB DRAMs respectively. These results indicate the use- fulness of XDRA in selecting a suitable cache configuration, and use of an arbitrary 7.5. EXPERIMENTS AND RESULTS 147

1 mpeg2enc

DRAM_Refresh 0.75 DRAM_Active DRAM_BckGrnd

0.5 L2 cache

0.25

Normalised Energy Reduction Energy Normalised 0 BaseSys CC_PD BestC_PD BaseSys CC_PD BestC_PD 256MB 4GB

Figure 7.20: Normalised DRAM Energy Consumption Breakdown of mpeg2Enc for different L2 caches and DRAM sizes.

1 mpeg2dec DRAM_Refresh 0.75 DRAM_Active DRAM_BckGrnd L2 cache 0.5

0.25

Normalised Energy Reduction Energy Normalised 0 BaseSys CC_PD BestC_PD BaseSys CC_PD BestC_PD 256MB 4GB

Figure 7.21: Normalised DRAM Energy Consumption Breakdown of mpeg2Dec for different L2 caches and DRAM sizes. cache configuration always is not the most energy efficient choice.

The performance penalty due to wakeup latency of self-refresh PD mode is mea- sured by comparing the execution times of BC PD with a similar system but without self-refresh PD mechanism. In our experiments, a maximum penalty of 2% was ob- served due to the self-refresh PD mode. This result illustrates the fact that a suitable cache not only increases DRAM idle periods but also consolidates them into longer 148 CHAPTER 7. ENERGY REDUCTION IN DDR DRAMS

1 vpr

DRAM_Refresh 0.75 DRAM_Active DRAM_BckGrnd L2 cache 0.5

0.25

Normalised Energy Reduction Energy Normalised 0 BaseSys CC_PD BestC_PD BaseSys CC_PD BestC_PD 256MB 4GB

Figure 7.22: Normalised DRAM Energy Consumption Breakdown of spec vpr for different L2 caches and DRAM sizes.

1 bzip2

DRAM_Refresh 0.75 DRAM_Active DRAM_BckGrnd L2 cache 0.5

0.25

Normalised Energy Reduction Energy Normalised 0 BaseSys CC_PD BestC_PD BaseSys CC_PD BestC_PD 256MB 4GB

Figure 7.23: Normalised DRAM Energy Consumption Breakdown of spec bzip2 for different L2 Caches and DRAM sizes. periods to make them suitable for the self-refresh PD mode and to reduce the num- ber of DRAM wakeups, and hence reducing the overall performance penalty. It is important to note that none of the BC PD systems incurred any performance penalty compared to the BS system because the reduction in execution time due to cache hits amortised the wakeup latency of the self-refresh PD mode. The above results point to the fact that a suitable last-level cache with the self-refresh PD 7.5. EXPERIMENTS AND RESULTS 149

1 gzip DRAM_Refresh 0.75 DRAM_Active DRAM_BckGrnd L2 cache 0.5

0.25

Normalised Energy Reduction Energy Normalised 0 BaseSys CC_PD BestC_PD BaseSys CC_PD BestC_PD 256MB 4GB

Figure 7.24: Normalised DRAM Energy Consumption Breakdown of spec gzip for different L2 Caches and DRAM sizes.

mode can tremendously increase DRAM energy efficiency with: 1) marginal perfor- mance penalty compared to a similar system but without PD mechanism; and 2) performance improvement compared to a similar system but without any cache.

Table 7.3 reports the time taken by the cycle-accurate processor-memory simula- tor and XDRA for exploration of 330 L2 cache configurations with 256MB DRAM. The total time for XDRA has been broken down into: time to train our estimator (TS); time to generate LCI profile and cache simulation (LCI); and time for cache profile analysis and cache exploration (CPA). XDRA reduces exploration time from several hundred days to a few days, resulting in savings of at least 86% of simulation time, which enables quick exploration of the last-level cache. Note that the total time of XDRA with the LinReg estimator is the same as our estimator because of similar training set. However, if the Simple estimator is used in XDRA, then its time will be equal to LCI time only, making it faster than our estimator. However, as shown before, Simple estimator had average errors as high as 99.4% and does not perform consistently across various applications, and thus limiting its practical use. Results for 4GB DRAM reveal similar savings, which are reported in Table 7.4. 150 CHAPTER 7. ENERGY REDUCTION IN DDR DRAMS

Application Cycle-Accurate XDRA Simulator TS LCI CPA Total adpcm Enc 3.2h 17m 45s 6m 23m adpcm Dec 2.1h 12m 33s 6m 17m jpeg Enc 21.4h 2h 5m 23m 2.4h jpeg Dec 7.9h 43m 2m 9m 55m g721 Enc 13.8d 1d 1h 7h 1.5d g721 Dec 14.6d 1d 1h 7h 1.7d mpeg2 Enc 155.2d 14d 12h 1d 15.5d mpeg2 Dec 41.7d 4d 4h 13h 4.5d spec vpr 103.7d 9d 9h 2d 11.3d spec bzip2 143.9d 13d 12h 2d 15.1d spec gzip 124.1d 11d 10h 1d 13.4d

Table 7.3: Time Comparison of Cycle-accurate Processor-memory Simulator and XDRA for 256MB DRAM. Application Cycle-Accurate XDRA Simulator TS LCI CPA Total adpcm Enc 4.9h 27m 1m 6m 32m adpcm Dec 3.5h 19m 47s 6m 25m jpeg Enc 1.3d 3h 6m 22m 3.3h jpeg Dec 12.8h 1h 3m 8m 1.3h g721 Enc 19.5d 2d 2h 17h 2.1d g721 Dec 14.4d 1d 1h 19h 1.7d mpeg2 Enc 183.6d 17d 14h 3d 18.6d mpeg2 Dec 46.9d 4d 4h 1d 4.9d spec vpr 129.8d 12d 11h 2d 14.1d spec bzip2 192d 18d 15h 22h 19.3d spec gzip 295.7d 27d 23h 2d 29.5d

Table 7.4: Time Comparison of Cycle-accurate Processor-memory Simulator and XDRA for 4GB DRAM. 7.6 Advantages and Limitations

XDRA features several advantages: 1) XDRA is fast as it only uses one cycle- accurate simulation per application; 2) it integrates the processor-memory and cache 7.7. SUMMARY 151 simulators with analysis techniques to quickly compute parameters for the DRAM energy reduction estimator; and 3) it uses the estimator for fast exploration of last- level cache configurations. Although several cycle-accurate simulations are required to build the estimator, the number of simulations is less than 10% (30 out of 330, see Section 7.5) of the whole design space. XDRA is very flexible as any processor-memory and cache simulators can be used given the profiles described earlier can be produced, which would require mi- nor modifications to existing simulators. Finally, a designer has the flexibility to use the cache policies, PDthreshold, DRAM address mapping, row buffer policies, etc. of his/her system under design, although we did not test with all the possible combinations of such options. XDRA can also be used to find best last-level cache configuration for a class of applications. In this case, the trace from the combined execution of the applications should be captured and is fed to the cache simulation. A truly representative trace from multiple applications’ execution might not be possible due to indeterminism; however, this is a different problem and not the focus of our work. XDRA will explore last-level cache configurations to find the best one for a given trace irrespective of whether that trace captures execution of single or multiple applications. XDRA as such cannot explore last-level cache in a multi-processor system be- cause the LCI periods will be different across different last-level cache configurations (unlike a uniprocessor system where LCI periods are the same across different last- level configurations) due to inter-processor dependencies and cache coherency. In future, we will look into extending XDRA for multi-processor systems.

7.7 Summary

This thesis for the first time proposes an estimator to predict the energy reduc- tions when last level caches are used and the main memory power consumption is reduced by aggressively switching the memory to self-refresh power mode whenever 152 CHAPTER 7. ENERGY REDUCTION IN DDR DRAMS the memory is idle. The estimator is Kriging based and a complex system contain- ing a processor-memory simulator, a cache simulator and novel analysis techniques. The predictor is accurate to within 4.4% on average for 11 applications from medi- abench and SPEC2000 suite and two DRAM sizes. We used the estimated energy reductions to find suitable energy configurations, and were able to do so in 20 of the 22 cases. In the two cases which were non-optimal the errors were within 4% of optimal values. XDRA is comparatively fast, and reduces 85% of the time taken by complete simulation. Chapter 8

Conclusions

Power estimation and reduction have been the primary concerns across the various abstraction levels in computer systems, ranging from the high level application layer to low level layout design. Among the different system components, main memory consumes high amounts of power, and memory power consumption is becoming the limiting factor in achieving overall energy savings. Plenty of research for memory power reduction has been carried out in the compiler, operating system and system architecture area. This thesis has explored efficient techniques in memory power estimation and reduction at the system architecture level. Increasing adoption of high level simulation in the design flow demands power estimation and power reduction methodologies with high level modelling techniques. However, accuracy and efficiency are two main concerns in the modelling approaches, especially in the estimation of power utilisation. This thesis presented a cycle ac- curate simulation platform to obtain the detailed statistics of the memory system. In this simulation platform, a novel interface abstraction layer (IAL) is proposed to glue the processor system component and DRAM memory system component seam- lessly. The latency of processing for each memory request differs from one memory operation to another. Previously proposed memory simulators apply a two-pass trace-driven mechanism in which the memory trace sequences are captured with a typical processor/system simulator (using fixed memory latency) and fed into the

153 154 CHAPTER 8. CONCLUSIONS simulator. Even though the variance of memory latencies are taken into consider- ation in these trace-driven memory simulations, inaccurate memory statistics are produced due to the fact that there is no consideration in response time of one memory request to another memory request. In this thesis, the processor-memory simulation with IAL uses the on-the-fly one-pass approach, in which the incoming memory request to the memory component is processed directly instead of captur- ing it as a trace, to achieve the correct memory performance figures in the memory simulation step. As a result, the proposed framework with IAL provides far more accurate results than trace-driven memory simulations which has up to 80% varia- tion in the choice of fixed memory latency to achieve accurate power consumption figures.

In order to reduce the power consumption and improve performance of the sys- tem, one of the techniques in the architecture area is utilising the last-level cache residing just before the DRAM memory level in the memory hierarchy. Last-level cache plays an important role in avoiding expensive memory accesses to the off- chip DRAM system. Finding a suitable last-level cache configuration for specific target application to achieve the maximum power savings and/or minimum exe- cution time of the whole system is a key concern in a large design space of cache configurations. Some prior research used last-level cache exploitation approach in order to reduce DRAM´slong latency. These previous approaches applied an ex- haustive searching mechanism in order to explore the behaviour of last-level caches. In typical embedded systems, last level cache configuration (cache size, cache line size and associativity) ranges from [4KB cache size, 4 byte cache line size, 1 way associativity] to [128KB cache size, 128 byte line size, 16 way associativity]. With the availability of various cache configurations and a huge design space to explore, efficient methodologies and design flows are necessary to help in faster architecture exploration leading to reduced design time and increased productivity. This the- sis proposed a rapid exploration methodology including a fast and highly accurate performance simulator and reasonably accurate energy estimator to tune a unified 155 last level cache according to the application running on it. This proposed rapid exploration mechanism avoids cycle-accurate simulations for all the possible cache configurations by performing slow cycle-accurate simulation only once and fast cache simulations for all the last-level cache configurations. The proposed execution time and energy estimators had absolute accuracy of at least 99% and 83% respectively with a 98% simulation time reduction when compared to the previous approach.

This thesis also presented a power reduction technique by proposing a DRAM power mode controller. The proposed power mode controller utilises the additional buffer as the last-level cache in the memory hierarchy. In this design, last-level cache is modelled as an intermediate prefetch buffer which prefects a batch of data from the DRAM memory such that the next memory request can be serviced from the prefetch buffer. The time to service memory requests from the prefect buffer becomes DRAM’s idle time during which the proposed power mode controller sets DRAM device into low power mode. The model used in our experiments has four main different power states of memory - active, standby, sleep and deep sleep. The proposed power mode controller sets the DRAM device into the lowest power con- suming mode (deep sleep power state) whenever DRAM is idle and the DRAM’s idle time is larger than the predefined threshold period in order to avoid setting the device to low power mode within the short idle time. The default internal power management system of DRAM controls the DRAM device to go into the lower power consuming mode (standby and sleep power states) after every memory request. Be- fore processing the next memory request, the device is switched back to active mode from the low power mode with the power mode switching overhead. Thus, the inter- nal power management scheme does not gain power saving benefits due to frequent power mode switching overhead. In contrast, the proposed power mode controller achieved significant power reductions with the exploitation of DRAM´slowest power consuming mode, and last-level cache which reduces the power mode switching over- head.

Though DRAM energy consumption is reduced with the last-level cache and 156 CHAPTER 8. CONCLUSIONS

DRAM´sdeep sleep power state, not all the last-level cache configurations are suit- able to achieve the highest power savings. Some of the last-level cache configurations make DRAM energy consumption even higher. With the price reduction trend of SRAM and the demand for faster processing, the design space of last-level cache configurations is larger, and larger even for the area-constraint embedded systems. Finding a suitable last-level cache configurations to obtain maximum energy sav- ings with an exhaustive searching mechanism is not a feasible solution, especially in such a large design space. This thesis developed a design framework together with a proposed novel energy reduction estimator which can rapidly explore a maximum energy savings cache configuration. The energy reduction estimator proposed in this thesis is a Kriging based model, which is trained through a small set of data chosen by using the Latin Hypercube Sampling technique. This Kriging based energy reduc- tion estimator utilises various DRAM parameters (which have a significant perfect correlation value to the energy reduction amount) as estimator’s variables compared to constant DRAM energy based characterisation used in previous approaches. It also gives a better visibility of DRAM’s power consumption because this approach helps in relating activity of power states of DRAM to power consumption. This proposed rapid exploration framework is verified on a variety of benchmarks from mediabench and SPEC2000 suites, and simulated with a Tensilica processor model and Micron’s DDR3 memory model for a uniprocessor system. An elaborate discus- sion is also provided on how to improve the estimator when a class of applications or multi-processor system is used. It was shown that, with a reasonably large number of data samples, the proposed DRAM energy saving estimator is accurate to within 4.4% on average with respect to the cycle-level accurate simulation. Also, the pre- sented framework design is comparatively fast, and reduces 85% of the time taken by complete simulation. 8.1. FUTURE WORK 157

8.1 Future Work

This thesis presented power/energy estimation and reduction techniques for system architecture based design frameworks. There are many possible future directions and some of them could focus on the following.

1. We presented the interface abstraction layer (IAL) which supports communi- cations between processor simulation and memory simulation. Even though our target system is a uniprocessor system, this approach can easily be ex- tended to multiprocessor systems (MPSoCs) by keeping consistent data across the local caches of different processors. In other words, cache coherent mech- anism for MPSoCs is required to consider in IAL so that all the local caches of different processors hold only the up-to-date data. Additionally, mem- ory consistency is also an important factor for shared-memory multiprocessor systems. The memory consistency model allows multiple processors to con- currently read/write the data from/to the shared memory in a synchronised manner. Our proposed IAL can be extended to the multiprocessor system by applying cache coherent and memory consistent mechanisms.

2. The DRAM power mode control presented in Chapter 6 can be extended in many ways. One of the possible extensions is to provide a threshold adaptive mechanism for dynamic (run-time) control of different power modes by mon- itoring the history of previous memory requests. The threshold-based power mode control presented in this thesis considers the constant threshold value. However, power savings can further be achieved by utilising multiple power modes (only lowest power consuming mode is applied in this thesis) with dif- ferent threshold values for setting the device into low power mode. Selection of the power mode for the next round depends on the information of the DRAM’s idle period, the frequency of idle time, etc., which can be obtained from the history of previous memory requests within a certain time frame at this stage. The second possibility is to extend the power mode control at the granularity of 158 CHAPTER 8. CONCLUSIONS

the DRAM’s bank level (DRAM chip is composed of multiple banks). Instead of applying the same low power mode for the whole DRAM chip which is pre- sented in this thesis, different power saving mode can be applied to each bank considering the different workloads going to the specific bank. The workload information can be analysed off-line with the captured memory sequences or run-time history information. Additionally, DDR3 DRAM energy consump- tion can be further reduced by the use of multiple low-power modes after the selection of the optimal cache configuration, rather than just the self-refresh power down mode. It is possible that some of the DRAM idle periods might not be long enough for self-refresh power down mode even with the optimal cache configuration. In such a scenario, PMC should select one of the other low-power modes that will be suitable for the DRAM idle period at hand to avoid the overhead of using self-refresh power down mode at all times.

3. Another possible future work includes finding a suitable code/data placement in the DDR3 DRAM to prolong idle periods of DRAM ranks/banks. DRAM accesses of an application can be obtained by offline analysis, which can then be analysed to place code/data considering spatial and temporal locality of those DRAM accesses. After code/data placement, a valid mapping between the last-level cache and DRAM will be either created or selected from existing address mapping policies [177].

4. Performance/energy estimators utilising last-level cache presented in Chap- ter 5 and 7 can be improved for a class of applications since some of embed- ded systems mainly target multiple applications. The last-level cache should be optimised for a class of applications which has not been considered in this thesis. In the proposed energy reduction estimator, the test data in guiding the estimator construction is trained per application. Multi-application based trained data can help in capturing various power reduction patterns in the estimation model. 8.1. FUTURE WORK 159

5. Finally, our estimation methods were not extensively targeted for multiproces- sor systems. It will be interesting to see how the interactions across different processors impact the DRAM’s idle period which is required in our proposed energy reduction estimators. The variance of DRAM’s idle period across dif- ferent last-level cache configurations is due to interprocessor dependencies and cache coherency. A good mechanism to quickly explore DRAM’s idle peri- ods of each cache configuration can help in extending the energy reduction estimator for a multi-processor system. Bibliography

[1] Micron Technology, Inc. http://download.micron.com/pdf/datasheets/ dram/ddr3/1Gb_DDR3_SDRAM.pdf.

[2] Micron, Inc., “Micron DDR3.” http://www.micron.com/products/dram/ ddr3/.

[3] P. Marwedel, Embedded System Design. Kluwer Academic, 2003.

[4] “esfacts.” Available at: http://www.artemis-ju.eu/embedded systems.

[5] “IDC.” Available at: http://www.idc.com/.

[6] “Embedded systems: Technologies and markets.” Available at: http://www.bccresearch.com/report/embedded-systems-technologies- markets-ift016d.html.

[7] M. Weiser, “Hot topics-ubiquitous computing,” Computer, vol. 26, pp. 71 –72, oct 1993.

[8] M. Satyanarayanan, “Pervasive computing: vision and challenges,” Personal Communications, IEEE, vol. 8, pp. 10 –17, aug 2001.

[9] N. Srinath, 8085 Microprocessor: Programming and Interfacing. Prentice-Hall Of India Pvt. Limited, 2005.

[10] “Microchip’s 32-bit microcontrollers.” Available at: http://www.microchip.com/pagehandler/en-us/family/32bit/.

[11] “All programmable fpgas.” Available at: http://www.xilinx.com/products/silicon-devices/fpga/index.htm?from=hpsb.

[12] “Digital signal processors.” Available at: http://www.ti.com/lsds/ti/dsp/overview.page.

[13] A. D. Pimentel, C. Erbas, and S. Polstra, “A systematic approach to exploring embedded system architectures at multiple abstraction levels,” IEEE Trans. Comput., vol. 55, pp. 99–112, Feb. 2006.

160 BIBLIOGRAPHY 161

[14] S. Przybylski, “Morning tutorial: Sorting out the new DRAMs,” 1997.

[15] K.-B. Lee and T.-S. Chang, SoC MEMORY SYSTEM DESIGN. Springer, 2006.

[16] S. C. Jung Ho Ahn and S. O, Energy Awareness in Contemporary Memory Systems. Springer, 2011.

[17] P. Grun, N. D. Dutt, and A. Nicolau, Memory architecture exploration for programmable embedded systems. Kluwer, 2003.

[18] W. A. Wulf and S. A. Mckee, “Hitting the memory wall: Implications of the obvious,” News, vol. 23, pp. 20–24, 1995.

[19] B. L. Jacob, The Memory System: You Can’t Avoid It, You Can’t Ignore It, You Can’t Fake It. Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, 2009.

[20] F. Catthoor, E. d. Greef, and S. Suytack, Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design. Norwell, MA, USA: Kluwer Academic Publishers, 1998.

[21] B. L. Jacob, S. W. Ng, and D. T. Wang, Memory Systems: Cache, DRAM, Disk. Morgan Kaufmann, 2008.

[22] A. Crone and G. Chidolue, “Functional verification of low power designs at rtl,” vol. 4644, pp. 288–299, 2007.

[23] A. Raghunathan, N. K. Jha, and S. Dey, High-Level Power Analysis and Op- timization. Norwell, MA, USA: Kluwer Academic Publishers, 1998.

[24] O. S. Unsal and I. Koren, “System-level power-aware design techniques in real-time systems,” in Proceedings of the IEEE, pp. 1055–1069, 2003.

[25] P. R. Panda, A. Shrivastava, B. Silpa, and K. Gummidipudi, Power-efficient System Design. Springer, 2010.

[26] J. Rabaey, Low Power Design Essentials. Engineering (Springer-11647), Springer, 2009.

[27] “A practical guide to low-power design, user experience with cpf, silicon inte- gration initiative, inc (si2).” Available at: http://www.si2.org/?page=1061.

[28] Z. Wang and X. S. Hu, “Power aware variable partitioning and instruction scheduling for multiple memory banks,” in Proceedings of the conference on Design, automation and test in Europe - Volume 1, DATE ’04, (Washington, DC, USA), pp. 10312–, IEEE Computer Society, 2004. 162 BIBLIOGRAPHY

[29] M. Kandemir, “Impact of data transformations on memory bank locality,” in Proceedings of the conference on Design, automation and test in Europe - Volume 1, DATE ’04, (Washington, DC, USA), pp. 10506–, IEEE Computer Society, 2004.

[30] Y.-H. Lu, L. Benini, and G. De Micheli, “Operating-system directed power reduction,” in Proceedings of the 2000 international symposium on Low power electronics and design, ISLPED ’00, (New York, NY, USA), pp. 37–42, ACM, 2000.

[31] M. Lee, E. Seo, J. Lee, and J. soo Kim, “Pabc: Power-aware buffer cache management for low power consumption,” IEEE Transactions on Computers, vol. 56, pp. 488–501, 2007.

[32] I. Hur and C. Lin, “A comprehensive approach to dram power management,” in HPCA, pp. 305–316, 2008.

[33] H. Zheng, J. Lin, Z. Zhang, E. Gorbatov, H. David, and Z. Zhu, “Mini-rank: Adaptive dram architecture for improving memory power efficiency,” Microar- chitecture, IEEE/ACM International Symposium on, vol. 0, pp. 210–221, 2008.

[34] X. Fan, C. S. Ellis, and A. R. Lebeck, “Memory controller policies for dram power management,” in In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED, pp. 129–134, 2001.

[35] A. M. Amin and Z. A. Chishti, “Rank-aware cache replacement and write buffering to improve dram energy efficiency,” in Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design, ISLPED ’10, (New York, NY, USA), pp. 383–388, ACM, 2010.

[36] S. Rixner, W. Dally, U. Kapasi, P. Mattson, and J. Owens, “Memory access scheduling,” in Computer Architecture, 2000. Proceedings of the 27th Interna- tional Symposium on, pp. 128 –138, june 2000.

[37] J. Shao and B. T. Davis, “A burst scheduling access reordering mechanism.,” in HPCA, pp. 285–294, IEEE Computer Society, 2007.

[38] I. Hur and C. Lin, “A comprehensive approach to dram power management,” in HPCA, pp. 305–316, 2008.

[39] J. A. Darringer, R. A. Bergamaschi, S. Bhattacharya, D. Brand, A. Herkers- dorf, J. K. Morrell, I. Nair, P. Sagmeister, and Y. Shin, “Early analysis tools for system-on-a-chip design,” IBM J. Res. Dev., vol. 46, pp. 691–707, Nov. 2002. BIBLIOGRAPHY 163

[40] W. Kruijtzer, P. van der Wolf, E. de Kock, J. Stuyt, W. Ecker, A. Mayer, S. Hustin, C. Amerijckx, S. de Paoli, and E. Vaumorin, “Industrial ip inte- gration flows based on ip-xact standards,” in Proceedings of the conference on Design, automation and test in Europe, DATE ’08, (New York, NY, USA), pp. 32–37, ACM, 2008. [41] S. Ahuja, A. Lakshminarayana, and S. Shukla, Low Power Design with High- Level Power Estimation and Power-Aware Synthesis. Springer, 2011. [42] “IBM Blue Logic Technology.” Available at: http://www- 3.ibm.com/chips/bluelogic/. [43] “IBM CoreConnect Bus Architecture White Paper.” Available at: www- 3.ibm.com/chips/products/coreconnect/index.html. [44] A. Dewey, Design Automation Technology. Circuits & Filters Handbook 3e, CRC Press, 2009. [45] D. A. Menasc´e,E. Casalicchio, and V. Dubey, “A heuristic approach to optimal service selection in service oriented architectures,” in Proceedings of the 7th international workshop on Software and performance, WOSP ’08, (New York, NY, USA), pp. 13–24, ACM, 2008. [46] D. R. Rice, “An analytical model for computer system performance evalua- tion,” SIGMETRICS Perform. Eval. Rev., vol. 2, pp. 14–30, June 1973. [47] Y. C. Tay, Analytical Performance Modeling for Computer Systems. Morgan and Claypool Publishers, 1st ed., 2010. [48] S. S. Mukherjee, S. V. Adve, T. Austin, J. Emer, and P. S. Magnusson, “Per- formance simulation tools,” Computer, vol. 35, pp. 38–39, Feb. 2002. [49] A. Raghunathan, S. Dey, and N. K. Jha, “Register-transfer level estimation techniques for switching activity and power consumption,” in in Proc. Int. Conf. Computer-Aided Design, pp. 158–165, 1996. [50] K. Olukotun, M. Heinrich, and D. Ofelt, “Digital system simulation: Method- ologies and examples,” in In Proc. of the 35th Design Automation Conference (DAC98, pp. 658–663, ACM/IEEE, 1998. [51] J. J. Yi and D. J. Lilja, “Simulation of computer architectures: Simulators, benchmarks, methodologies, and recommendations,” IEEE Trans. Comput., vol. 55, pp. 268–280, Mar. 2006. [52] M. Heinrich, “Memory system simulation and the right energy metric,” 2008. [53] L. Negri and A. Chiarini, “Power simulation of communication protocols with statec,” pp. 277–294, 2006. 164 BIBLIOGRAPHY

[54] Micron Technology, Inc. http://download.micron.com/pdf/technotes/ ddr3/TN41_01DDR3%20Power.pdf.

[55] D. Burger, “The simplescalar tool set, version 2.0,” tech. rep., 1997.

[56] K. Kise, T. Katagiri, H. Honda, and T. Yuba, “The simcore/alpha functional simulator,” in Proceedings of the 2004 workshop on Computer architecture ed- ucation: held in conjunction with the 31st International Symposium on Com- puter Architecture, WCAE ’04, (New York, NY, USA), ACM, 2004.

[57] K. Kise, H. Honda, and T. Yuba, “Simalpha version 1.0: Simple and readable alpha processor simulator,” in Asia-Pacific Computer Systems Architecture Conference, pp. 122–136, 2003.

[58] J. Lee, J. Kim, C. Jang, S. Kim, B. Egger, K. Kim, and S. Han, “Facsim: a fast and cycle-accurate architecture simulator for embedded systems,” in LCTES ’08: Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems, (New York, NY, USA), pp. 89–100, ACM, 2008.

[59] M. T. Yourst, “Ptlsim: A cycle accurate full system x86-64 microarchitectural simulator,” in ISPASS, pp. 23–34, 2007.

[60] Tensilica, Inc., “Xtensa Configurable Processors.” http://www.tensilica. com.

[61] ARM Limited, “RealView ARMulator ISS User Guide, Version 1.4.3.” http: //infocenter.arm.com, 2007.

[62] Texas Instruments., “C64x+ CPU cycle-accurate simulator.” http://www.ti. com/.

[63] M. Poncino and J. Zhu, “Dynamosim: a trace-based dynamically compiled instruction set simulator,” in ICCAD, pp. 131–136, 2004.

[64] T. R. Gross, J. L. Hennessy, S. A. Przybylski, and C. Rowen, “Measurement and evaluation of the mips architecture and processor.,” ACM Trans. Comput. Syst., vol. 6, no. 3, pp. 229–257, 1988.

[65] R. D. Doug, D. Burger, S. W. Keckler, and T. Austin, “Sim-alpha: a validated, execution-driven alpha 21264 simulator,” 2001.

[66] R. E. Bryant, “Alpha assembly language guide,” 1998.

[67] C. J. Mauer, M. D. Hill, and D. A. Wood, “Full-system timing-first simula- tion.,” in SIGMETRICS, pp. 108–116, ACM, 2002. BIBLIOGRAPHY 165

[68] D. Skrien, “Cpu sim 3.1: A tool for simulating computer architectures for computer organization classes,” J. Educ. Resour. Comput., vol. 1, pp. 46–59, Dec. 2001.

[69] MikroSim., “Mikrocodesimulator MikroSim 2010.” http://www. mikrocodesimulator.de/index_eng.php.

[70] J. Simon and J.-M. Wierum, “The latency-of-data-access model for analyzing parallel computation,” Inf. Process. Lett., vol. 66, no. 5, pp. 255–261, 1998.

[71] E. Cordeiro, I. Stefani, T. Soares, and C. Martins, “Dcmsim: didactic cache memory simulator,” in Frontiers in Education, 2003. FIE 2003 33rd Annual, vol. 2, pp. F1C – 14–19 Vol.2, nov. 2003.

[72] M. L. C. Cabeza, M. I. G. Clemente, and M. L. Rubio, “Cachesim: a cache simulator for teaching memory hierarchy behaviour.,” in ITiCSE (C. Erickson, T. Wilusz, M. Daniels, R. McCauley, and B. Z. Manaris, eds.), p. 181, ACM, 1999.

[73] L. Coutinho, J. Mendes, and C. Martins, “Mscsim -multilevel and split cache simulator,” pp. 7 –12, oct. 2006.

[74] J. Edler and M. D. Hill, “Dinero iv trace-driven uniprocessor cache simulator.” http://pages.cs.wisc.edu/ markhill/DineroIV/.

[75] M. S. Haque, A. Janapsatya, and S. Parameswaran, “Susesim: a fast sim- ulation strategy to find optimal l1 cache configuration for embedded sys- tems,” in Proceedings of the 7th IEEE/ACM international conference on Hard- ware/software codesign and system synthesis, CODES+ISSS ’09, (New York, NY, USA), pp. 295–304, ACM, 2009.

[76] U. Choudhary, P. Phadke, V. Puttagunta, and S. Udayashankar, “Analysis of sub-block placement and victim caching techniques.”

[77] I. Ari, A. Amer, R. Gramacy, E. L. Miller, S. A. Brandt, and D. D. E. Long, “Acme: Adaptive caching using multiple experts,” in IN PROCEEDINGS IN INFORMATICS, pp. 143–158, 2002.

[78] A. Janapsatya, A. Ignjatovic, and S. Parameswaran, “Finding optimal l1 cache configuration for embedded systems,” in ASP-DAC, pp. 796–801, 2006.

[79] N. Tojo, N. Togawa, M. Yanagisawa, and T. Ohtsuki, “Exact and fast l1 cache simulation for embedded systems,” in Proceedings of the 2009 Asia and South Pacific Design Automation Conference, ASP-DAC ’09, (Piscataway, NJ, USA), pp. 817–822, IEEE Press, 2009. 166 BIBLIOGRAPHY

[80] M. S. Haque, A. Janapsatya, and S. Parameswaran, “Susesim: a fast simula- tion strategy to find optimal l1 cache configuration for embedded systems,” in CODES+ISSS, pp. 295–304, 2009.

[81] HP, “Cacti 6.5.” http://www.hpl.hp.com/research/cacti/.

[82] M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood, “Multifacets general execution-driven multiprocessor simulator (gems) toolset,” SIGARCH Com- put. Archit. News, vol. 33, p. 2005, 2005.

[83] M. Ghosh and H. hsin S. Lee, “Dram decay: Using decay counters to reduce energy consumption in drams abstract.”

[84] D. Wang, B. Ganesh, N. Tuaycharoen, K. Baynes, A. Jaleel, and B. Jacob, “Dramsim: a memory system simulator,” SIGARCH Comput. Archit. News, vol. 33, pp. 100–107, Nov. 2005.

[85] M. Rosenblum, S. A. Herrod, E. Witchel, and A. Gupta, “Complete computer system simulation: The simos approach,” IEEE Parallel Distrib. Technol., vol. 3, no. 4, pp. 34–43, 1995.

[86] IBM Austin Research Lab., “SimOS-PowerPC web page..” http://www. research.ibm.com/arl/projects/SimOSppc.html, 2000.

[87] P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hog- berg, F. Larsson, A. Moestedt, and B. Werner, “Simics: A full system simu- lation platform,” Computer, vol. 35, pp. 50 –58, feb 2002.

[88] P. Bohrer, J. Peterson, M. Elnozahy, R. Rajamony, A. Gheith, R. Rock- hold, C. Lefurgy, H. Shafi, T. Nakra, R. Simpson, E. Speight, K. Sudeep, E. Van Hensbergen, and L. Zhang, “Mambo: a full system simulator for the powerpc architecture,” SIGMETRICS Perform. Eval. Rev., vol. 31, pp. 8–12, Mar. 2004.

[89] N. Hardavellas, S. Somogyi, T. F. Wenisch, E. Wunderlich, S. Chen, J. Kim, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk, “Simflex: A fast, accurate, flexible full-system simulation framework for performance evaluation of server archi- tecture,” SIGMETRICS Performance Evaluation Review, vol. 31, pp. 31–35, 2004.

[90] T. F. Wenisch, R. E. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J. C. Hoe, “Simflex: Statistical sampling of computer system simulation,” IEEE Micro, vol. 26, no. 4, pp. 18–31, 2006.

[91] “AMD SimNow Simulator,” tech. rep., 11 2010. BIBLIOGRAPHY 167

[92] E. Argollo, A. Falcn, P. Faraboschi, M. Monchiero, and D. Ortega, “Cotson: Infrastructure for full system simulation,” Operating Systems Review, 2009.

[93] N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, and S. K. Reinhardt, “The m5 simulator: Modeling networked systems,” IEEE Micro, vol. 26, pp. 52–60, July 2006.

[94] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hes- tness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The gem5 simulator,” SIGARCH Comput. Archit. News, vol. 39, pp. 1–7, Aug. 2011.

[95] H. Khalid, “A trace-driven simulation methodology,” SIGARCH Comput. Ar- chit. News, vol. 23, pp. 27–33, Dec. 1995.

[96] R. Covington, S. Dwarkada, J. R. Jump, J. B. Sinclair, and S. Madala, “The efficient simulation of parallel computer systems,” in International Journal in Computer Simulation, pp. 31–58, 1991.

[97] W. Li, E. Li, A. Jaleel, J. Shan, Y. Chen, Q. Wang, R. R. Iyer, R. Illikkal, Y. Zhang, D. Liu, M. Liao, W. Wei, and J. Du, “Understanding the mem- ory performance of data-mining workloads on small, medium, and large-scale cmps using hardware-software co-simulation.,” in ISPASS, pp. 35–43, IEEE Computer Society, 2007.

[98] C. keung Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wal- lace, V. Janapa, and R. K. Hazelwood, “Pin: building customized program analysis tools with dynamic instrumentation,” in In PLDI 05: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, pp. 190–200, ACM Press, 2005.

[99] R. A. Uhlig and T. N. Mudge, “Trace-driven memory simulation: A survey,” ACM Computing Surveys, vol. 29, pp. 128–170, 2004.

[100] T. Wang, Q. Wang, D. Liu, M. Liao, K. Wang, L. Cao, L. Zhao, R. Iyer, R. Il- likkal, L. Wang, and J. Du, “Hardware/software co-simulation for last level cache exploration,” in Proceedings of the 2009 IEEE International Conference on Networking, Architecture, and Storage, NAS ’09, (Washington, DC, USA), pp. 371–378, IEEE Computer Society, 2009.

[101] M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood, “Multifacet’s general execution-driven multiprocessor simulator (gems) toolset,” SIGARCH Com- put. Archit. News, vol. 33, no. 4, pp. 92–99, 2005. 168 BIBLIOGRAPHY

[102] M. Lodde, J. Flich, and M. E. Acacio, “Dynamic last-level cache allocation to reduce area and power overhead in directory coherence protocols,” in Proceed- ings of the 18th international conference on Parallel Processing, Euro-Par’12, (Berlin, Heidelberg), pp. 206–218, Springer-Verlag, 2012.

[103] J. L. Hennessy and D. A. Patterson, Computer Architecture, Fourth Edition: A Quantitative Approach. San Francisco, CA, USA: Morgan Kaufmann Pub- lishers Inc., 2006.

[104] R. R. Iyer, “On modeling and analyzing cache hierarchies using casper,” in MASCOTS, pp. 182–187, 2003.

[105] J.-H. Lee, J.-S. Lee, and S.-D. Kim, “A new cache architecture based on temporal and spatial locality,” Journal of Systems Architecture, vol. 46, no. 15, pp. 1451 – 1467, 2000.

[106] A. Basu and N. Kirman, “Scavenger: A new last level cache architecture with global block priority,” in In Proceedings of the 40th International Symposium on Microarchitecture, 2007.

[107] E. E. O. M. Chang Joo Lee, Veynu Narasiman and Y. N. Patt, “Dram-aware last level cache writeback: Reducing write-caused interference in memory sys- tem.,” tech. rep., TR-HPS-2010-002, 2010.

[108] H. hsin Lee and G. Tyson, “Eager writeback - a technique for improving bandwidth utilization,” in In Proceedings of the 33rd annual ACM/IEEE in- ternational symposium on Microarchitecture, pp. 11–21, ACM Press, 2000.

[109] J. Stuecheli, D. Kaseridis, D. Daly, H. C. Hunter, and L. K. John, “The virtual write queue: coordinating dram and last-level cache policies.,” in ISCA (A. Seznec, U. C. Weiser, and R. Ronen, eds.), pp. 72–82, ACM, 2010.

[110] V. N. O. M. Chang Joo Lee, Eiman Ebrahimi and Y. N. Patt, “Dram-aware last-level cache replacement.,” tech. rep., TR-HPS-2010-007, 2010.

[111] Z. Wang, S. M. Khan, and D. A. Jim´enez,“Rank idle time prediction driven last-level cache writeback,” in Proceedings of the 2012 ACM SIGPLAN Work- shop on Memory Systems Performance and Correctness, MSPC ’12, (New York, NY, USA), pp. 21–29, ACM, 2012.

[112] Z. Wang, S. M. Khan, and D. A. Jim´enez,“Improving writeback efficiency with decoupled last-write prediction,” in Proceedings of the 39th International Sym- posium on Computer Architecture, ISCA ’12, (Piscataway, NJ, USA), pp. 309– 320, IEEE Press, 2012. BIBLIOGRAPHY 169

[113] Y. Lee and S. Kim, “Dram energy reduction by prefetching-based memory traffic clustering,” in Proceedings of the 21st edition of the great lakes sym- posium on Great lakes symposium on VLSI, GLSVLSI ’11, (New York, NY, USA), pp. 103–108, ACM, 2011.

[114] F. Catthoor, E. d. Greef, and S. Suytack, Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design. Norwell, MA, USA: Kluwer Academic Publishers, 1998.

[115] M. Kandemir, U. Sezer, and V. Delaluz, “Improving memory energy using access pattern classification,” in Computer Aided Design, 2001. ICCAD 2001. IEEE/ACM International Conference on, pp. 201 –206, 2001.

[116] C.-G. Lyuh and T. Kim, “Memory access scheduling and binding considering energy minimization in multi-bank memory systems,” in Proceedings of the 41st annual Design Automation Conference, DAC ’04, (New York, NY, USA), pp. 81–86, ACM, 2004.

[117] G. Chen, M. T. Kandemir, H. Saputra, and M. J. Irwin, “Exploiting bank locality in multi-bank memories.,” in CASES (J. H. Moreno, P. K. Murthy, T. M. Conte, and P. Faraboschi, eds.), pp. 287–297, ACM, 2003.

[118] G. Chen, F. Li, and M. Kandemir, “Compiler-directed channel allocation for saving power in on-chip networks,” SIGPLAN Not., vol. 41, pp. 194–205, January 2006.

[119] A. R. Lebeck, X. Fan, H. Zeng, and C. Ellis, “Power aware page allocation,” SIGPLAN Not., vol. 35, pp. 105–116, November 2000.

[120] H. Huang, K. G. Shin, C. Lefurgy, and T. Keller, “Improving energy efficiency by making dram less randomly accessed,” in Proceedings of the 2005 inter- national symposium on Low power electronics and design, ISLPED ’05, (New York, NY, USA), pp. 393–398, ACM, 2005.

[121] V. Delaluz, M. Kandemir, N. Vijaykrishnan, A. Sivasubramaniam, and M. J. Irwin, “Dram energy management using software and hardware directed power mode control,” in Proceedings of the 7th International Symposium on High- Performance Computer Architecture, HPCA ’01, (Washington, DC, USA), pp. 159–, IEEE Computer Society, 2001.

[122] X. Fan, C. Ellis, and A. Lebeck, “Memory controller policies for dram power management,” in Proceedings of the 2001 international symposium on Low power electronics and design, ISLPED ’01, (New York, NY, USA), pp. 129– 134, ACM, 2001. 170 BIBLIOGRAPHY

[123] S. Liu, S. Ogrenci Memik, Y. Zhang, and G. Memik, “An approach for adaptive dram temperature and power management,” in Proceedings of the 22nd annual international conference on Supercomputing, ICS ’08, (New York, NY, USA), pp. 63–72, ACM, 2008.

[124] V. D. L. Luz, M. Kandemir, and I. Kolcu, “Automatic data migration for reducing energy consumption in multi-bank memory systems,” in Proceedings of the 39th annual Design Automation Conference, DAC ’02, (New York, NY, USA), pp. 213–218, ACM, 2002.

[125] V. Delaluz, M. Kandemir, N. Vijaykrishnan, A. Sivasubramaniam, and M. J. Irwin, “Hardware and software techniques for controlling dram power modes,” IEEE TRANSACTIONS ON COMPUTERS, vol. 50, pp. 1154–1173, 2001.

[126] V. Delaluz, M. T. Kandemir, N. Vijaykrishnan, and M. J. Irwin, “Energy- oriented compiler optimizations for partitioned memory architectures.,” in CASES, pp. 138–147, 2000.

[127] M. T. Kandemir, I. Kolcu, and I. Kadayif, “Influence of loop optimizations on energy consumption of multi-bank memory systems.,” in CC (R. N. Horspool, ed.), vol. 2304 of Lecture Notes in Computer Science, pp. 276–292, Springer, 2002.

[128] A. Fraboulet, K. Kodary, and A. Mignotte, “Loop fusion for memory space optimization,” in Proceedings of the 14th international symposium on Systems synthesis, ISSS ’01, (New York, NY, USA), pp. 95–100, ACM, 2001.

[129] M. Wolfe, High performance compilers for parallel computing. Addison-Wesley, 1996.

[130] Q. Huang, J. Xue, and X. Vera, “Code tiling for improving the cache perfor- mance of pde solvers.,” in ICPP, pp. 615–, IEEE Computer Society, 2003.

[131] H. Koc, O. Ozturk, M. Kandemir, S. H. K. Narayanan, and E. Ercanli, “Min- imizing energy consumption of banked memories using data recomputation,” in Proceedings of the 2006 international symposium on Low power electronics and design, ISLPED ’06, (New York, NY, USA), pp. 358–362, ACM, 2006.

[132] M. Tolentino, J. Turner, and K. Cameron, “An implementation of page allo- cation shaping for energy efficiency,” in Proceedings of the 3rd workshop on High-performance, Power-aware Computing (HPPAC), 2007.

[133] M. Bi, R. Duan, and C. Gniady, “Delay-hiding energy management mecha- nisms for dram,” in HPCA, pp. 1–10, 2010. BIBLIOGRAPHY 171

[134] S. T. Jones, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau, “Geiger: moni- toring the buffer cache in a virtual machine environment,” SIGARCH Comput. Archit. News, vol. 34, pp. 14–24, Oct. 2006.

[135] V. Delaluz, A. Sivasubramaniam, M. Kandemir, N. Vijaykrishnan, and M. J. Irwin, “Scheduler-based dram energy management,” in IN PROCEEDINGS OF THE 39TH CONFERENCE ON DESIGN AUTOMATION, pp. 697–702, ACM Press, 2002.

[136] H. Huang, P. Pillai, and K. G. Shin, “Design and implementation of power- aware virtual memory,” in Proceedings of the annual conference on USENIX Annual Technical Conference, ATEC ’03, (Berkeley, CA, USA), pp. 5–5, USENIX Association, 2003.

[137] J.-H. Min, H. Cha, and V. P. Srini, “Dynamic power management of dram using accessed physical addresses,” Microprocess. Microsyst., vol. 31, pp. 15– 24, Feb. 2007.

[138] V. Moshnyaga, H. Vo, G. Reinman, and M. Potkonjak, “Reducing energy of dram/flash memory system by os-controlled data refresh,” in Circuits and Systems, 2007. ISCAS 2007. IEEE International Symposium on, pp. 2108 – 2111, may 2007.

[139] L. A. D. Bathen, M. Gottscho, N. Dutt, A. Nicolau, and P. Gupta, “Vip- zone: Os-level memory variability-driven physical address zoning for energy savings,” in Proceedings of the eighth IEEE/ACM/IFIP international confer- ence on Hardware/software codesign and system synthesis, CODES+ISSS ’12, (New York, NY, USA), pp. 33–42, ACM, 2012.

[140] Y. Luo, J. Yu, J. Yang, and L. Bhuyan, “Low power network processor design using clock gating,” in Proceedings of the 42nd annual Design Automation Conference, DAC ’05, (New York, NY, USA), pp. 712–715, ACM, 2005.

[141] S. Kim, S. Kim, and Y. Lee, “Dram power-aware rank scheduling,” in Pro- ceedings of the 2012 ACM/IEEE international symposium on Low power elec- tronics and design, ISLPED ’12, (New York, NY, USA), pp. 397–402, ACM, 2012.

[142] D. Wu, B. He, X. Tang, J. Xu, and M. Guo, “Ramzzz: rank-aware dram power management with dynamic migrations and demotions,” in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, (Los Alamitos, CA, USA), pp. 32:1–32:11, IEEE Computer Society Press, 2012.

[143] H. Huang and K. G. Shin, “Co-operative software-hardware power manage- ment for main memory,” in In Proceedings of PACS, 2004. 172 BIBLIOGRAPHY

[144] J. Trajkovic, A. V. Veidenbaum, and A. Kejariwal, “Improving sdram ac- cess energy efficiency for low-power embedded systems,” ACM Trans. Embed. Comput. Syst., vol. 7, pp. 24:1–24:21, May 2008.

[145] J. Liu, B. Jaiyen, R. Veras, and O. Mutlu, “Raidr: Retention-aware intelligent dram refresh,” in Proceedings of the 39th Annual International Symposium on Computer Architecture, ISCA ’12, (Washington, DC, USA), pp. 1–12, IEEE Computer Society, 2012.

[146] M. Ghosh and H.-H. S. Lee, “Smart refresh: An enhanced memory controller design for reducing energy in conventional and 3d die-stacked drams,” in Pro- ceedings of the 40th Annual IEEE/ACM International Symposium on Microar- chitecture, MICRO 40, (Washington, DC, USA), pp. 134–145, IEEE Computer Society, 2007.

[147] I. Kadayif, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, and A. Sivasubra- maniam, “Eac: A compiler framework for high-level energy estimation and optimization,” in Proceedings of the 2002 Design, Automation and Test in Europe Conference and, pp. 436–442, 2002.

[148] G. Thomas, K. Chandrasekar, B. Akesson, B. Juurlink, and K. Goossens, “A predictor-based power-saving policy for dram memories,” in Proc. 15th Euromicro Conference on Digital System Design, (Izmir, Turkey), September 2012.

[149] D. M. Chris Rowen, “Automated processor generation for system-on-chip,” tech. rep., June 2000.

[150] Tensilica, “XPRES Compiler.” Available at: http://www.tensilica.com/products/devtools/hw dev/xpres/, 2008.

[151] C. Lee, M. Potkonjak, and W. H. Mangione-Smith, “Mediabench: a tool for evaluating and synthesizing multimedia and communicatons systems,” in MI- CRO 30: Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, (Washington, DC, USA), pp. 330–335, IEEE Computer Society, 1997.

[152] A. Gordon-Ross, F. Vahid, and N. D. Dutt, “Fast configurable-cache tuning with a unified second-level cache,” IEEE Trans. Very Large Scale Integr. Syst., vol. 17, pp. 80–91, Jan. 2009.

[153] W. Zang and A. Gordon-Ross, “T-spacs: a two-level single-pass cache simula- tion methodology,” in Proceedings of the 16th Asia and South Pacific Design Automation Conference, ASPDAC ’11, (Piscataway, NJ, USA), pp. 419–424, IEEE Press, 2011. BIBLIOGRAPHY 173

[154] A. G. Silva-Filho and F. R. Cordeiro, “A combined optimization method for tuning two-level memory hierarchy considering energy consumption,” EURASIP J. Embedded Syst., vol. 2011, pp. 2:1–2:12, Jan. 2011.

[155] C. Zhang, F. Vahid, and W. Najjar, “A highly configurable cache architecture for embedded systems,” SIGARCH Comput. Archit. News, vol. 31, pp. 136– 146, May 2003.

[156] S. Min, J. Peddersen, and S. Parameswaran, “Realising cycle accurate pro- cessor memory simulation via interface abstraction,” in VLSI Design (VLSI Design), 2011 24th International Conference on, pp. 141 –146, jan. 2011.

[157] L. Benini, A. Macii, and M. Poncino, “Energy-aware design of embedded mem- ories: A survey of technologies, architectures, and optimization techniques,” ACM Trans. Embed. Comput. Syst., vol. 2, pp. 5–32, February 2003.

[158] Y. Li and J. Henkel, “A framework for estimation and minimizing energy dissipation of embedded hw/sw systems,” in Proceedings of the 35th annual Design Automation Conference, DAC ’98, (New York, NY, USA), pp. 188–193, ACM, 1998.

[159] C. C. Wen-Tsong Shiue, Satishkumar Udayanranana, “Data memory design and exploration for low-power embedded systems,” ACM Trans. Des. Autom. Electron. Syst., vol. 6, pp. 553–568, Oct. 2001.

[160] H. Javaid, A. Janapsatya, M. Haque, and S. Parameswaran, “Rapid runtime estimation methods for pipelined mpsocs,” in Design, Automation Test in Europe Conference Exhibition (DATE), 2010, pp. 363 –368, march 2010.

[161] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automatically char- acterizing large scale program behavior,” SIGARCH Comput. Archit. News, vol. 30, pp. 45–57, Oct. 2002.

[162] J. J. Yi, D. J. Lilja, and D. M. Hawkins, “A statistically rigorous approach for improving simulation methodology,” in Proceedings of the 9th International Symposium on High-Performance Computer Architecture, HPCA ’03, (Wash- ington, DC, USA), pp. 281–, IEEE Computer Society, 2003.

[163] B. C. Lee and D. M. Brooks, “Accurate and efficient regression modeling for microarchitectural performance and power prediction,” SIGARCH Comput. Archit. News, vol. 34, pp. 185–194, Oct. 2006.

[164] J. H. Anderson, F. N. Najm, and T. Tuan, “Active leakage power optimiza- tion for fpgas,” in Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays, FPGA ’04, (New York, NY, USA), pp. 33–41, ACM, 2004. 174 BIBLIOGRAPHY

[165] “Microblaze soft processor core.” Available at: http://www.xilinx.com/tools/microblaze.htm. [166] K. Swaminathan, E. Kultursay, V. Saripalli, V. Narayanan, and M. Kandemir, “Design space exploration of workload-specific last-level caches,” in Proceed- ings of the 2012 ACM/IEEE international symposium on Low power electron- ics and design, ISLPED ’12, (New York, NY, USA), pp. 243–248, ACM, 2012. [167] M. Haque, J. Peddersen, A. Janapsatya, and S. Parameswaran, “Dew: A fast level 1 cache simulation approach for embedded processors with fifo replace- ment policy,” in Design, Automation Test in Europe Conference Exhibition (DATE), 2010, pp. 496 –501, march 2010. [168] K. Chandrasekar, B. Akesson, and K. Goossens, “Improved power modeling of ddr sdrams,” in DSD, pp. 99–108, 2011. [169] T. J. Santner, W. B., and N. W., The Design and Analysis of Computer Experiments. Springer-Verlag, 2003. [170] G. Mariani, A. Brankovic, G. Palermo, J. Jovic, V. Zaccaria, and C. Silvano, “A correlation-based design space exploration methodology for multi-processor systems-on-chip,” in Proceedings of the 47th Design Automation Conference, DAC ’10, (New York, NY, USA), pp. 120–125, ACM, 2010. [171] “Kriging toolbox for matlab.” http://www2.imm.dtu.dk/ hbni/dace/. [172] D. R. Jones, M. Schonlau, and W. J. Welch, “Efficient global optimization of expensive black-box functions,” J. of Global Optimization, vol. 13, pp. 455– 492, Dec. 1998. [173] J. L. Hennessy and D. A. Patterson, Computer Architecture, Fourth Edition: A Quantitative Approach. San Francisco, CA, USA: Morgan Kaufmann Pub- lishers Inc., 2006. [174] A. Jaleel, “Memory characterization of work- loads using instrumentation-driven simulation.” http://www.jaleels.org/ajaleel/workload/SPECanalysis.pdf. [175] S. Sair and M. Charney, “Memory Behavior of the SPEC2000 Bechmark Suite,” tech. rep., IBM T.J. Watson Research Center, Oct 2000. [176] R. Baysal, B. Nelson, and J. Staum, “Response surface methodology for simu- lating hedging and trading strategies,” in Simulation Conference, 2008. WSC 2008. Winter, pp. 629 –637, dec. 2008. [177] J. Shao and B. T. Davis, “The bit-reversal sdram address mapping.,” in SCOPES (K. M. Kavi and R. Cytron, eds.), ACM International Conference Proceeding Series, pp. 62–71, 2005.