Analytical Evaluation of Energy and Throughput for Multilevel Caches
Total Page:16
File Type:pdf, Size:1020Kb
Analytical Evaluation of Energy and Throughput for Multilevel Caches Muhammad Yasir Qadri, Klaus D. McDonald-Maier School of Computer Science and Electronic Engineering University of Essex, Colchester CO4 3SQ, UK Email: [email protected], [email protected] Abstract--With the increase of processor-memory II. RELATED WORK performance gap, it has become important to gauge the performance of cache architectures so as to evaluate their This section presents the related research findings impact on energy requirement and throughput of the in the area of cache performance estimation and its system. Multilevel caches are found to be increasingly usage for various applications. prevalent in the high-end processors. Additionally, the Lim et al. [5] have proposed a set of equations to recent drive towards multicore systems has necessitated estimate accurately worst case time analysis (WCTA) the use of multilevel cache hierarchies for shared memory for RISC processors. Their models detail the architectures. This paper presents simplified and pipelining, instruction cache and data cache effects on accurate mathematical models to estimate the energy consumption and the impact on throughput for multilevel real-timeliness of the system. Another reference of caches for single core systems. timing models is the work by Peuto et al. [6], in which the authors proposed an instruction timing model Keywords-cache mathematical models; energy; accounting cache effects. Taha et al. [7] presented an throughput; multilevel-cache instruction throughput model of Superscalar processors. Their model include parameters such as I. INTRODUCTION superscalar width, depth of pipeline, instruction fetch Cache memories were primarily introduced to mechanism (in-order/out-of-order), branch predictor, bridge the gap between processor and memory central issue window width, number of functional units performance. With the processors operating at their latencies and throughputs, re-order buffer width remarkable speeds and DRAM speeds operating at and cache size and latency etc. Their model resulted in fraction of that, multilevel cache hierarchies provided a errors up to 5.5% when compared to the SimpleScalar viable solution to keep up the trend of the Dave simulator [8]. Wada et al. [9] proposed detailed circuit House’s revision of Moore’s law predicting that level analytical access time model for on-chip cache computer performance will double every 18 months memories. The model takes inputs such as number of [1]. The mathematical models presented in this paper tag/data array per word/bit line etc. On comparing with analyze the energy consumption and throughput for SPICE results the model gives 20% error for an 8 ns multilevel data cache using PowerPC750, and access time cache memory. UltraSPARC-II processors. The throughput model was Simunic et al. [10] proposed analytical models for validated using SIMICS full system simulator and energy estimation in embedded systems. The per cycle energy model using per cycle energy consumption energy model presented in their work comprises statistics presented in the processor datasheet. The components such as energy consumption of processor, models are multilevel extension of the single level memory, interconnects and pins, DC-to-DC converters models presented by the authors in [2-4]. The aim of and level two (L2) cache. The model validation was this research is to propose simplified mathematical performed using integrated simulator of ARM SDK models for multilevel cache so in the future [11]. This was found to be within 5% of the hardware multiprocessor configurations could be evaluated using measurements for the same operating frequency. The them, in a resource constrained environment such as models presented in their work holistically analyze the the system itself. embedded system power and do not estimate energy The rest of this paper is divided into five sections. consumption for individual components of a processor. In the following section related work is discussed. The Kamble et al. in [12] also presented detailed cache energy and throughput models for multilevel cache are energy model. The analytical models for conventional introduced in section 3. In the fourth section the caches were found to be accurate to within 2% error. models are validated using a two level cache hierarchy, However, the technique over–predicts the power and the final section presents the conclusion. dissipations of low–power caches by as much as 30%. 1 Revisiting the models in [13] the authors found them to a full system simulation tool. The simulator is targeted be accurate within 20% for low power caches upon to provide accurate timing profile, but presently does comparing with their designed simulator CAche Power not support energy profiling of the target platforms. Estimator (CAPE) for the simulated execution of Austin et al. [8] presented SimpleScalar another full several SPEC95 benchmarks. Li et al. [14] proposed a system virtualization platform. The tool uses an full system energy model comprising of cache, main execution-driven simulation technique that reproduces memory, and software energy components. Their work a device’s internal operation. Exploiting SimpleScalar also details a framework to assess and optimize energy interface, Brooks et al. [20] developed a tool called dissipation of embedded systems. Tiwari et al. [15] Wattch for architectural level power analysis. It proposed an instruction level energy model estimating maintains accuracy within 10% as compared to results energy consumed in individual pipeline stages. An of circuit level power analysis tools. A further addition identical methodology was applied in [16] by the to this was done by Flores et al. [21] by proposing authors to study the effects of cache enabling and Sim-PowerCMP for chip multiprocessors. The tool disabling. estimates both dynamic and leakage power for CMP A simulator for analyzing cache energy dissipation architectures based on a Linux x86 model of RSIM called CAche Power Estimator (CAPE) was developed [22] presented by Hughes et al. by Kamble et al. based on the mathematical models they proposed [12, 13]. Based on these models another III. THE CACHE ENERGY AND THROUGHPUT MODELS tool named INCAPE was developed by Korkmaz et al. This section presents the energy and throughput [17]. This simulator is capable of estimating energy models for a two-level cache hierarchy. dissipation for complete memory hierarchy including multilevel cache system. Another widely referred, open A. Energy Model source tool is CACTI (cache access and cycle time If , , and is the energy consumed by model) [18] by HP Laboratories Inc. It provides instruction, data and level 2 (L2) cache operations, thorough, near accurate memory access time and is the Energy consumed by the instructions energy estimates. However it is not a trace driven which do not require data memory access, and simulator, so energy consumption resulting in number the leakage energy of the processor, then the total of hits or misses is not accounted for a particular energy consumption of the code in Joules [J] can application. Magnusson et al. [19] presented SIMICS, be defined as, (1) Where, L1 Instruction Cache (2) (3) (4) L1 Data Cache (5) (6) (7) (8) L2 Cache (9) (10) (11) (12) 2 In the above equations , , and calculated by multiplying the number of memory denote the read, write and miss penalty energy accesses with their read and write cycles energy. of the corresponding cache x (i.e. instruction, data or The idle mode leakage energy of the processor L2 cache). The read and write cycle energy per cache can be calculated as access is denoted by and . The , (13) number of data read and write transactions of the cache (including all hits and miss) is denoted by and . Furthermore , , where [Sec] is the total time for which processor denote the L2 cache’s instruction fetch, data read and was idle. data write transactions respectively. The processor’s B. Throughput Model per cycle energy consumption is denoted by ; If , , and is the time taken in instruction, , , and denote the data and level 2 (L2) cache operations, and the read/write miss penalty (in terms of number of cycles) time taken in execution of cache access instructions and their corresponding miss rates. The energy [Sec], and the time taken in consumed in L2 cache to data and code memory is read, write and miss penalty for cache x; then the denoted by and that could also be total time taken by an application could be estimated as (14) Furthermore, L1 Instruction Cache (15) (16) (17) L1 Data Cache (18) (19) (20) (21) L2 Cache (22) (23) (24) (25) and, (26) 3 where , is the time taken per cache TABLE II. CACHE SIMULATOR DATA read and write cycle and is the processor cycle CACTI Data time in seconds [sec]. L1 I-Cache PowerPC750 UltraSPARC II TABLE I. PROCESSOR PARAMETERS Cache Size [KBytes] 32 Cache Size [KBytes] 16 Line Size [Bytes] 32 Line Size [Bytes] 32 Parameter Value Number of Lines 1024 Number of Lines 512 R/W Ports 1 R/W Ports 1 Processor PowerPC750 UltraSPARC II Associativity 8 Associativity 2 Execution mode In-order In-order Miss Penalty[cycles] 7 Miss Penalty[cycles] 10 Clock frequency [MHz] 373.5 168 Read ports 0 Read ports 0 Cycle Time [ns] 2.68 5.95 Write ports 0 Write ports 0 Fabrication Technology [μm] 0.2 0.35 Access Time [ns] 1.82 Access Time [nSec] 2.91 V [V] 2.2 3.3 Cycle Time [ns] 1.04 Cycle Time [nSec] 1.58 dc Read Energy [nJ] 0.725 Read Energy [nJ] 0.49 Operating current IDD [A] 2 9.4 Energy per Cycle [nJ] 11.8 185 MontaVista Operating System Fedora Core 3 L1 D-Cache Linux 2.1 PowerPC750 UltraSPARC II Cache Size [KBytes] 32 Cache Size [KBytes] 16 Line Size [Bytes] 32 Line Size [Bytes] 32 IV. MODEL VALIDATION Number of Lines 1024 Number of Lines 512 To validate the accuracy of the proposed models, R/W Ports 1 R/W Ports 1 Associativity 8 Associativity 2 Virtutech’s SIMICS full system simulator was used to Miss Penalty[cycles] 7 Miss Penalty[cycles] 10 simulate two processor models i.e.