<<

...... THE AND MODEL FOR STORING AND MANIPULATING

...... THIS ARTICLE DESCRIBES AN ANALYTICAL MODEL OF SYSTEMS THAT STORE AND

MANIPULATE DATA.THE CACHE AND CODEC MODEL CAPTURES THE INTERPLAY BETWEEN

TWO CENTRAL ASPECTS OF SUCH SYSTEMS: INFORMATION REPRESENTATION AND DATA-

PROCESSING EFFICIENCY.THIS ARTICLE DESCRIBES THE MODEL AND EXPLORES DESIGN

OPTIONS FOR SYSTEMS THAT INCORPORATE DATA AND COMPRESSION.IN

EACH CASE STUDY, THE AUTHORS RAISE SEVERAL RESEARCH IMPLICATIONS AND

DIRECTIONS FOR FUTURE RESEARCH.

...... The recent explosion of data has this might be the time or energy per thrown traditional data management ques- encrypted or decrypted. This term is directly tions into stark relief. As the data growth rate linked to how a particular function is imple- outpaces technology scaling, - mented. For example, DRAM requires more tems designers will need to become increas- time and energy per bit read than static RAM ingly strategic about what data they choose to (SRAM). Similarly, encryption is store, where they store it, and in what represen- slower and more power hungry than a hard- tation (for example, compressed, encrypted, ware implementation. processed, or raw). Although modern com- The second property is of informa- puter systems span a range of scales and appli- tion per bit of data. Data’s value is in the Van Bui cations, at their core they all store and information it contains. Depending on how transform information. The cache and codec the information is encoded, the data might Martha A. Kim model that we developed captures, at a high use more or fewer bits to represent it. A stor- abstraction level, the ways a computer system age component that does not manipulate the Columbia University stores and transforms information. This simple data or change the representation would not model of how data flows through a system, change this term. In contrast, a lossless com- potentially manipulated or transformed along pressor would maintain the information the way, can help researchers and designers while shrinking the data, thereby increasing confront and explore a range of design options. this term. The cache and codec model represents the Ultimately, what matters is the cost per system as a set of interacting components, bit of information produced—that is, each of which has two properties. The first is cost ¼ cost bitdata. The cache and codec the cost per bit of data processed. For a stor- bitinfo bitdata bitinfo agecomponent,thismightbethetimeor model captures the interplay between the energy per bit read or written. For encryption, data representation and the characteristics of ......

28 Published by the IEEE Computer Society 0272-1732/14/$31.00 2014 IEEE the components that manipulate it, thereby particular level, which we will call Ai, depends enabling high-level design-space explorations on the of data drawn from the mod- for systems that must manipulate large - ule at the ith level and the cost of doing so: umes of data efficiently. Ai ¼ Vi Ci ð2Þ In this article, we describe the cache and codec model, along with two case studies that where Vi is the expected volume of data use the model to explore the properties of drawn from a module in bits. The volume of encrypted and compressed storage systems. data at each level is a function of k and the R values of the upper-level module(s) (for Cache and codec model example, the number of accesses to DRAM depends on what is happening in the last- We model a simple yet general class of level cache [LLC] and above): computer systems consisting of a that draws data from a storage hierarchy. Vi ¼ Vi1 Ri1 ð3Þ This storage hierarchy consists of caches that store information, and, optionally, The volume of data being accessed at the that re-encode data along the way. first level, V0,issimplyk, which is an input Because the storage system is often a hier- to the model. In addition to k, the other archy, the model is an extension of the classic inputs to the model are the order of the mod- expression for the expected access time to the ules and an ðÞRi; Ci pair for each module. ith level of a cache hierarchy: Figure 1 illustrates a slightly more com- plex example modeling memory compres- E ¼ k C þ R E ð1Þ i i i iþ1 sion. This hierarchy contains three levels of In this expression, k is the total number of cache, backed by compressed memory. Thus, between the LLC and the memory sits a com- bits accessed, and Ci is the access cost per bit at level i. There are two key conceptual differ- pressor/decompressor that compresses in- ences with respect to the classic expression. coming data on writes and decompresses First, each level of the hierarchy consists of outgoing data on reads. The key to using the either a cache (or other storage module) or a model is reasoning about appropriate R and codec. In either case, the first term in Equa- C values for each module. The compressor tion 1 captures the cost of accessing that modeled in this example compresses data by ¼ module, whether it is storing or re-encoding a factor of 4, so R 0.25. Similarly, in this data. Second, we have substituted R for the example we model an L2 with a 20-percent i miss rate, so for that module R ¼ 0.20. traditional miss rate in Equation 1. Ri is the “ratio” of a module, which for both caches Using Equation 3, we can calculate the and codecs, captures the module’s influence expected volume of data drawn from each on the volume of data. In the case of a cache module, as illustrated by the dotted red lines in Figure 1. Then, using those volume esti- module, Ri is simply the miss rate, indicating that for every n bits requested from a cache, it mates and the efficiency parameter for each module, C, we can calculate the cost contribu- is expected to request Ri n bits from the lower levels of the hierarchy. For a 4 com- tion for each module according to Equation 2 pression module, the R value would be either and shown with the dashed lines in Figure 1. Although the case studies that follow center 4 or 0.25, depending on the direction of the compression. on , the model is applicable to various As a concrete example, consider a simple other systems where data gets manipulated and stored in some form or fashion. two-level hierarchy. In this scenario, Equa- tion 1 expands to Case studies

E1 ¼ k C1 þ k R1 C2 We present two case studies of the cache and codec model for encrypted and com- The first term, k C1, is the access cost to pressed storage systems, demonstrating the the first level; the second term, k R1 C2,is sort of design space explorations that it the access cost at level 2. The access cost at a supports...... JULY/AUGUST 2014 29 ...... BIG DATA

percent on Dunnington.2 The

k fault rates were estimated to be an average of k

k typical virtual-memory miss rates on systems k with page sizes ranging from 4 to 64 Kbytes.3 0.1 0.02 0.005 0.00125 The read and write energy efficiency values C/D V = V = V = V = k L1 L2 L3 Memory (C parameters) for LLC are from Chang et al.,4 and the values for memory and disk are from Ramos and Bianchini.5

LLC encryption. For each LLC encryption R = 0.10 R = 0.20 R = 0.25 R = 0.25 R = 0.00 scenario, we consider three cache technolo- C = 5 C = 20 C = 50 C = 30 C = 100 gies. In the ideal case, the LLC will have both A = 5k A = 2k A = 1k A = 0.15k A = 0.125k high density and low power. SRAM is the more traditional cache implementation with fast access times, but low density relative to Figure 1. Illustration of how the cache and codec model captures the other technology and high leakage current. primary features of a memory compression system, namely the -transfer torque magnetic RAM (STT- compression’s impact on the volume of data stored in memory and its cost RAM) is an alternative technology with relative to the cost of the system’s other components. higher density, but it has high write and write energy consumption. A third op- Data encryption tion is embedded DRAM (eDRAM), which Big data is a rich target for hackers because also features high density and low leakage. A of the sheer amount of information stored on drawback of eDRAM is that it requires those servers, with high-profile data breaches refresh operations, and this is a major source of customer records or unauthorized accesses of power dissipation. Each of these LLC tech- regularly reported. Even RSA, a company nologies has tradeoffs with respect to density with expertise in technology, is and power. not immune to these breaches.1 Applying Because the LLC is already power con- data encryption to big- systems strained, adding a data encryption module to could be a second line of attack for security. the system could further increase total system This not only is a deterrent to hackers, but energy costs. We use our cache and codec also can minimize the damage should a model to explore a system with an encryption breach occur. unit followed by the LLC (SRAM, STT- The sooner or higher in the memory hier- RAM, or eDRAM), DRAM, and finally a archy that data is encrypted, the more secur- hard disk. We are interested in understanding ity this technique offers. However, early the energy efficiency that data encryption encryption can also be more costly, because technologies should achieve in order to pre- data in the upper levels of a hierarchy is vent significant increases in total energy cost accessed, by design, more often, and thus will in these systems. be encrypted and decrypted more often. Figure 2a shows the total access energy as a There is thus a tradeoff between complexity, function of data encryption and decryption cost, and security. energy, ranging from an idealized point where In this case study, we use the cache and encryption is free to it costing 10 nJ/bit. We codec model to analyze the energy cost asso- observe that regardless of the LLC technol- ciated with encrypting data in the LLC, ogy, energy overheads can be made quite low memory, and nonvolatile storage. For each of provided the encryption is implemented effi- the three levels, we examine several technolo- ciently (at 0.2 nJ/bit or lower), but that gies, as summarized in Table 1. The table also beyond that point, the overhead will rise very lists the R and C parameters for each cache rapidly to as much as 251 percent when and codec used in this analysis. encryption/decryption runs at 10 nJ/bit. Of We use the SPEC OMP benchmarks, the three locations to place an encryption which have an average dataset size of 31.09 module, before the LLC is, not surprisingly, Mbytes, and an average LLC miss rate of 36 the most sensitive to encryption efficiency...... 30 IEEE MICRO Table 1. Technology parameters for the encryption case study.

Level Technology RCread Cwrite Source

Last-level cache SRAM 0.36 2.10 nJ/ 2.21 nJ/block Zhang et al.2 and (LLC) Chang et al.4 STT-RAM 0.36 0.94 nJ/block 20.25 nJ/block Zhang et al.2 and Chang et al.4 eDRAM 0.36 1.74 nJ/block 1.79 nJ/block Zhang et al.2 and Chang et al.4 Memory PCM 5.05e-6 4.94 pJ/bit 33.64 pJ/bit Hennessy et al.3 and Ramos et al.5 DRAM 5.05e-6 1.17 pJ/bit 0.39 pJ/bit Hennessy et al.3 and Ramos et al.5 Disk SSD 0 150 lJ/4KB 150 lJ/4KB Ramos et al.5 HDD 0 16.8 mJ/4KB 16.8 mJ/4KB Ramos et al.5 ...... *eDRAM: embedded DRAM; HDD: ; PCM: phase-change memory; SRAM: static RAM; SSD: solid- drive; STT-RAM: spin-transfer torque magnetic RAM.

Memory encryption. For memory encryp- and are also more energy efficient. However, tion, we evaluate DRAM and phase-change they have lower capacity per dollar,9 creating a memory (PCM) memory using the cache tradeoff between performance, energy, and and codec model. PCM is being explored as cost. We focus on evaluating the energy costs an alternative to DRAM because it is denser with the cache and codec model, but we could and has a lower idle power. Although adjust C model parameters to analyze per- DRAM has lower access times, PCM does formance and dollar costs as well. Examining not require refreshing. Given these tradeoffs, Figure2c,wenotethatencryptionatthislevel researchers have proposed hybrid PCM- is far less sensitive to encryption efficiency DRAM systems.8 than the other two, suggesting that lower cost, We do not expect the energy overhead of possibly even software encrypters, would be memory encryption to be as high as at the sufficient. Although the energy overhead for LLC, because of the less-frequent memory both SSDs and HDDs is just a fraction of a reads and writes that require encryption and percent, the total SSD energy costs are 92 per- decryption. Nonetheless, it is important to cent lower than for HDDs, which is what we test this hypothesis. Figure 2b shows the data would expect to find when analyzing the access energy for encrypted memory as a energy dimension of the tradeoff. function of the encryption efficiency. As with The energy efficiency for various software the LLC, the total energy cost differs little for AES implementations can vary between 1.55 PCM and DRAM. To keep energy costs from to 28 nJ/bit depending on how optimized increasing above 5 percent, encryption effi- the design is.10 For ASIC AES-128 imple- ciency should be less than 0.6 nJ/bit, 3 times mentations, energy efficiency can range from less efficient than the 0.2 nJ/bit required for 5 0.42 pJ/bit to 0.18 nJ/bit.11 Our models sug- percent overhead with LLC encryption. On gest that hardware-based AES encryption is the other hand, the overhead does not rise as more suitable for LLC and memory encryp- steeply as at the LLC, with an 88 percent tion in order to minimize energy overheads. overhead when encryption requires 10 nJ/bit. At 1.55 nj/bit energy efficiency, software AES would have to improve energy efficiency by Nonvolatile storage encryption. We evaluate at least 87 percent for LLC and 61 percent in two encrypted nonvolatile storage media, the case of memory encryption to keep solid state drives (SSDs) and hard disk drives energy overheads low. (HDDs). SSDs offer roughly three orders of Combining data encryption with compres- magnitude lower access latency than HDDs, sion can provide strong protection for data, ...... JULY/AUGUST 2014 31 ...... BIG DATA

300 introduced in 1993. Thanks to ever- 100 growing data sizes and concerns about energy eDRAM-DRAM-HDD read consumption, cache and memory compres- STTRAM-DRAM-HDD read 80 SRAM-DRAM-HDD read sion are once again a topic of great interest, eDRAM-DRAM-HDD write with significant academic scrutiny.6,12 STTRAM-DRAM-HDD write SRAM-DRAM-HDD write The space of potential compressed storage 60 designs is vast. Which modules in the mem- ory hierarchy should store compressed or 40 uncompressed data? How will each choice impact performance and energy? Does the

Data access energy (mJ) 20 answer differ according to the anticipated workload? Should compression be done in hardware or software? Will new technologies 0 (a) 0.1 1 10 such as 3D stacking and embedded DRAM 100 make a difference? These questions, and SRAM-DRAM-HDD read SRAM-PCM-HDD read others, should be evaluated during the design 80 SRAM-DRAM-HDD write phase. The cache and codec model can cap- SRAM-PCM-HDD write ture the first-order effects of such systems and help designers perform the necessary high- 60 level, path-finding design-space explorations. In this case study, we demonstrate how the 40 cache and codec model can be used to explore extensions of Co-DCC,6 arecently

Data access energy (mJ) 20 proposed compressed LLC. Decoupled Compressed Cache (DCC) 0 and an optimized design called Co-Com- (b) 0.1 1 10 pacted DCC (Co-DCC) seek to increase 100 SRAM-DRAM-HDD read effective cache capacity at area overheads SRAM-DRAM-SSD read comparable to previous compressed caches.6 80 SRAM-DRAM-HDD write SRAM-DRAM-SSD write DCC reduces area overheads using super- 60 blocks, which contain contiguous cache blocks that a single address tag. Frag- 40 mentation is reduced in a superblock by decoupling the address tags, allowing sub- 20 blocks in a set to map to any tag in that set. Data access energy (mJ) Co-DCC further optimizes DCC to reduce 0 0.1 1 10 fragmentation by compacting the blocks in a (c) Encryption/decryption efficiency (nJ/bit) superblock into the same set of sub-blocks. DCC increases normalized effective capacity by 2.2 times, whereas Co-DCC increases this Figure 2. Encrypted data access costs as a function of encryption/decryption average by 2.6 times for several workloads.6 efficiency. Although efficient encryption and decryption can keep the Co-DCC addresses the common issues in overheads low regardless of placement, the encrypted LLC (a) is the most cache compression design, including internal sensitive to efficiency compared to the encrypted memory (b) and fragmentation and high area overheads, and encrypted disk (c). can improve both performance and energy efficiency across a range of workloads. within the performance constraints required Using the cache and codec model, we to support big data analytics. We consider data explore what would happen if the Co-DCC compression in our second case study. techniques were applied to other caches in addition to the LLC. As in Figure 1, each component in the system, including the com- Memory compression is not a new tech- pressor/decompressor (C/D), has an associ- nology; it dates back to the HP Omnibook ated (R, C)pair...... 32 IEEE MICRO Table 2. Experimental R and C values for the compression case study.

Level RClatency Cenergy Source

Compressor/ 0.26 9 cycles decompression, 0.05 nJ/block Sardashti et al.6 and decompressor (C/D) 16 cycles compression decompression, 0.13 Muralimanohar et al.7 nJ/block compression Level 1 (L1) 0.06 2 cycles 0.17 nJ/block Sardashti et al.6 and Muralimanohar et al.7 Level 2 (L2) 0.43 10 cycles 0.32 nJ/block Sardashti et al.6 and Muralimanohar et al.7 Level 3 (L3) 0.36 30 cycles 1.47 nJ/block Sardashti et al.6 and Muralimanohar et al.7 Memory 0.00 160 cycles 60.35 nJ/block Sardashti et al.6 and Muralimanohar et al.7

The R values for the caches and memory CPACKþZ used in the implementation of are the miss rates, whereas for the C/D mod- Co-DCC.6 ule the R value is the compression factor (that The C values are either the per-bit time or is, the compressed size over the original size). energy cost for each module in the system. Co-DCC uses the CPACKþZ dictionary- We model the same system as in the original based compression algorithm, which com- Co-DCC experiments, a 3.2-GHz processor bines CPACK and the zero-block detection with a 32-Kbyte private L1 cache, 256-Kbyte algorithm. The compression factor for private L2 cache, 8-Mbyte shared L3 cache, CPACKþZ is on average 0.26 with a decom- and 4 Gbytes of main memory.6 We use pression latency of nine cycles.6 Cache com- Cacti to derive both the energy and latency pression lowers the cache miss rate, which is cost of each data storage module.7 The captured in the R values for the storage mod- energy and latency cost for the C/D module ules containing compressed data. In particu- are nine cycles and 0.05 nJ per bit for lar, Co-DCC achieves on average 24 percent CPACKþZ.6 Table 2 summarizes the R and lower LLC miss rate across a range of work- C parameters for this experiment. loads,6 so the R values of any compressed We use the cache and codec model to ana- caches are tuned accordingly: lyze four systems that differ in terms of where the C/D is placed in the hierarchy, thereby

Ri ¼ð1 PcÞM; varying how much of the hierarchy caches compressed versus uncompressed data, and where Pc is the percentage decrease in miss how much data is expected to flow through rate (that is, 24 percent), and M is the origi- the compressor/decompressor. nal miss rate. Figure 3 shows the time and energy break- In this case study, we set k to be the aver- down for each configuration, normalized to age data size for the SPEC OMP the access time and energy of a hierarchy suite, which is 31.09 Mbytes.2 To explore the with no compression. system’s dynamics, we model the average Examining both access time and energy, miss rates ranging from 0 to 100 percent. we observe that the higher in the storage hier- Additionally, we model the average miss rates archy the compressor is placed, the more (R parameter) for the data storage modules time (or energy) is spent on compression. using SPEC OMP running on Intel’s Dun- This is because the higher in the hierarchy nington.2 The R parameter for the C/D the compressor sits, the more data it is called module is based on the compression factor of on to compress or decompress, thus ...... JULY/AUGUST 2014 33 ...... BIG DATA

should be placed. Although the focus to date 500 has been on memory and LLC compression, C/D L1 this exploration suggests that there are many L2 circumstances in which pulling the compres- L3 400 Memory sor higher into the storage hierarchy could be beneficial. Compression might even be thought of as a way to mitigate high cache- 300 miss rates, and, using compression, a system could perhaps substitute smaller or lower- associativity (and lower-power) caches with- 200 out impacting performance. It is also instructive to compare and con- Normalized read time (%) trast the performance and energy trends. For 100 example, although Figure 3 suggests that L1 compression on workloads with very low miss rates is detrimental to performance, it is a clear win from an energy perspective, 100 reducing it by almost half.

80 his article has presented the cache and 60 T codec model for the storage and manipulation of data. As data volumes and 40 system design spaces grow, thanks in part to new algorithms and technologies, high-level 20 path finding will increase in importance. Normalized read energy (%) This model provides one tool for this pur-

0 20 40 60 80 100 SPEC Avg 0 20 40 60 80 100 SPEC Avg 0 20 40 60 80 100 SPEC Avg 0 20 40 60 80 100 SPEC Avg 0 20 40 60 80 100 SPEC Avg pose. Thanks to its generality, it can be adapted and applied to a broad range of sys- tems and applications beyond the two described here. MICRO L2 L3 L1 No

Memory ...... References compression compression compression compression compression 1. J. Markoff, “SecurID Company Suffers a Breach of Data Security,” New York Times, Figure 3. Time and energy breakdown for different cache miss rates and 17 Mar. 2011; www.nytimes.com/2011/03/ data compressor locations in the for Co-Compacted 18/technology/18secure.html. Decoupled Compressed Cache (Co-DCC) compression. Data compression at L1 has excessively high performance costs for memory-efficient 2. Y. Zhang, M. Kandemir, and T. Yemliha, workloads, whereas energy costs are more manageable. “Studying Inter-core Data Reuse in Multi- cores,” Proc. ACM SIGMETRICS Joint Int’l Conf. Measurement and Modeling of Com- puter Systems, 2011, pp. 25–36. increasing its cost. When miss rates are low, 3. J.L. Hennessy and D.A. Patterson, Com- this is not offset by savings from the lower puter Architecture, Fifth Edition: A Quantita- levels of the hierarchy caching compressed tive Approach, 5th ed., Morgan Kaufmann, data. However, when miss rates are high and 2011. more data is flowing through the lower levels, 4. M.-T. Chang et al., “Technology Com- the savings due to compression offset its cost. parison for Large Last-Level Caches We observe that the higher the miss rates, the (L3Cs): Low-Leakage SRAM, Low Write- higher in the hierarchy the compressor Energy STT-RAM, and Refresh-Optimized ...... 34 IEEE MICRO eDRAM,” Proc. IEEE 19th Int’l Symp. High Martha A. Kim is an assistant professor Performance Computer Architecture, 2013, in the Computer Science Department at pp. 143–154. Columbia University. Her research interests 5. L. Ramos and R. Bianchini, “Exploiting include computer architecture, parallel Phase-Change Memory in Cooperative hardware and software systems, and energy- Caches,” Proc. IEEE 24th Int’l Symp. Com- efficient computation on big data. Kim has puter Architecture and High Performance a PhD in computer science from the Uni- Computing, 2012, pp. 227–234. versity of Washington. She is a member of 6. S. Sardashti and D.A. Wood, “Decoupled IEEE and the ACM. Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching,” Proc. 46th Ann. IEEE/ACM Int’l Direct questions and comments about this Symp. , 2013, pp. 62–73. article to Martha Kim, 450 Computer Sci- 7. N. Muralimanohar, R. Balasubramonian, and ence Building, 1214 Amsterdam Ave., Mail- N. Jouppi, CACTI 6.0: A Tool to Model Large code: 0401, New York, NY 10027-7003; Caches, tech. report HPL-2009-85, Hewlett- [email protected]. Packard, 2009. 8. M.K. Qureshi, V. Srinivasan, and J.A. Rivers, “Scalable High Performance Main Memory System using Phase-Change Memory Tech- nology,” Proc. 36th Ann. Int’l Symp. Com- puter Architecture, 2009, pp. 24–33. 9. D. Narayanan et al., “Migrating Stor- age to SSDs: Analysis of Tradeoffs,” Proc. 4th ACM European Conf. Computer Sys- tems, 2009, pp. 145–158. 10. B. Liu and B.M. Baas, “Parallel AES Encryp- tion Engines for Many-Core Processor Arrays,” IEEE Trans. Computing, vol. 62, no. 3, 2013, pp. 536–547. 11. H.W. Chung, “An Energy Efficient AES Engine with DPA-Resistance,” PhD disser- tation, Dept. Electrical Eng. and Computer Science, Massachusetts Institute of Tech- nology, 2009. 12. A. Shafiee et al., “Memzip: Exploiting Unconventional Benefits from Memory Compression,” to be published in Proc. 20th Int’l Symp. High-Performance Com- puter Architecture, 2014.

Van Bui is a PhD candidate in the Com- puter Science Department at Columbia University. Her research interests include computer architecture, parallel computing, and high-performance and energy-efficient computing. Bui has an MS in computer sci- ence from the University of Houston. She is a member of IEEE...... JULY/AUGUST 2014 35