The Cache and Codec Model for Storing and Manipulating Data
Total Page:16
File Type:pdf, Size:1020Kb
................................................................................................................................................................................................................. THE CACHE AND CODEC MODEL FOR STORING AND MANIPULATING DATA ................................................................................................................................................................................................................. THIS ARTICLE DESCRIBES AN ANALYTICAL MODEL OF SYSTEMS THAT STORE AND MANIPULATE DATA.THE CACHE AND CODEC MODEL CAPTURES THE INTERPLAY BETWEEN TWO CENTRAL ASPECTS OF SUCH SYSTEMS: INFORMATION REPRESENTATION AND DATA- PROCESSING EFFICIENCY.THIS ARTICLE DESCRIBES THE MODEL AND EXPLORES DESIGN OPTIONS FOR SYSTEMS THAT INCORPORATE DATA ENCRYPTION AND COMPRESSION.IN EACH CASE STUDY, THE AUTHORS RAISE SEVERAL RESEARCH IMPLICATIONS AND DIRECTIONS FOR FUTURE RESEARCH. ......The recent explosion of data has this might be the time or energy per bit thrown traditional data management ques- encrypted or decrypted. This term is directly tions into stark relief. As the data growth rate linked to how a particular function is imple- outpaces technology scaling, computer sys- mented. For example, DRAM requires more tems designers will need to become increas- time and energy per bit read than static RAM ingly strategic about what data they choose to (SRAM). Similarly, software encryption is store, where they store it, and in what represen- slower and more power hungry than a hard- tation (for example, compressed, encrypted, ware implementation. processed, or raw). Although modern com- The second property is bits of informa- puter systems span a range of scales and appli- tion per bit of data. Data’s value is in the Van Bui cations, at their core they all store and information it contains. Depending on how transform information. The cache and codec the information is encoded, the data might Martha A. Kim model that we developed captures, at a high use more or fewer bits to represent it. A stor- abstraction level, the ways a computer system age component that does not manipulate the Columbia University stores and transforms information. This simple data or change the representation would not model of how data flows through a system, change this term. In contrast, a lossless com- potentially manipulated or transformed along pressor would maintain the information the way, can help researchers and designers while shrinking the data, thereby increasing confront and explore a range of design options. this term. The cache and codec model represents the Ultimately, what matters is the cost per system as a set of interacting components, bit of information produced—that is, each of which has two properties. The first is cost ¼ cost  bitdata. The cache and codec the cost per bit of data processed. For a stor- bitinfo bitdata bitinfo agecomponent,thismightbethetimeor model captures the interplay between the energy per bit read or written. For encryption, data representation and the characteristics of ....................................................... 28 Published by the IEEE Computer Society 0272-1732/14/$31.00 c 2014 IEEE the components that manipulate it, thereby particular level, which we will call Ai, depends enabling high-level design-space explorations on the volume of data drawn from the mod- for systems that must manipulate large vol- ule at the ith level and the cost of doing so: umes of data efficiently. Ai ¼ Vi  Ci ð2Þ In this article, we describe the cache and codec model, along with two case studies that where Vi is the expected volume of data use the model to explore the properties of drawn from a module in bits. The volume of encrypted and compressed storage systems. data at each level is a function of k and the R values of the upper-level module(s) (for Cache and codec model example, the number of accesses to DRAM depends on what is happening in the last- We model a simple yet general class of level cache [LLC] and above): computer systems consisting of a processor that draws data from a storage hierarchy. Vi ¼ ViÀ1  RiÀ1 ð3Þ This storage hierarchy consists of caches that store information, and, optionally, codecs The volume of data being accessed at the that re-encode data along the way. first level, V0,issimplyk, which is an input Because the storage system is often a hier- to the model. In addition to k, the other archy, the model is an extension of the classic inputs to the model are the order of the mod- expression for the expected access time to the ules and an ðÞRi; Ci pair for each module. ith level of a cache hierarchy: Figure 1 illustrates a slightly more com- plex example modeling memory compres- E ¼ k  C þ R  E ð1Þ i i i iþ1 sion. This hierarchy contains three levels of In this expression, k is the total number of cache, backed by compressed memory. Thus, between the LLC and the memory sits a com- bits accessed, and Ci is the access cost per bit at level i. There are two key conceptual differ- pressor/decompressor that compresses in- ences with respect to the classic expression. coming data on writes and decompresses First, each level of the hierarchy consists of outgoing data on reads. The key to using the either a cache (or other storage module) or a model is reasoning about appropriate R and codec. In either case, the first term in Equa- C values for each module. The compressor tion 1 captures the cost of accessing that modeled in this example compresses data by ¼ module, whether it is storing or re-encoding a factor of 4, so R 0.25. Similarly, in this data. Second, we have substituted R for the example we model an L2 with a 20-percent i miss rate, so for that module R ¼ 0.20. traditional miss rate in Equation 1. Ri is the “ratio” of a module, which for both caches Using Equation 3, we can calculate the and codecs, captures the module’s influence expected volume of data drawn from each on the volume of data. In the case of a cache module, as illustrated by the dotted red lines in Figure 1. Then, using those volume esti- module, Ri is simply the miss rate, indicating that for every n bits requested from a cache, it mates and the efficiency parameter for each module, C, we can calculate the cost contribu- is expected to request Ri  n bits from the lower levels of the hierarchy. For a 4 com- tion for each module according to Equation 2 pression module, the R value would be either and shown with the dashed lines in Figure 1. Although the case studies that follow center 4 or 0.25, depending on the direction of the compression. on big data, the model is applicable to various As a concrete example, consider a simple other systems where data gets manipulated and stored in some form or fashion. two-level hierarchy. In this scenario, Equa- tion 1 expands to Case studies E1 ¼ k  C1 þ k  R1  C2 We present two case studies of the cache and codec model for encrypted and com- The first term, k Á C1, is the access cost to pressed storage systems, demonstrating the the first level; the second term, k Á R1 Á C2,is sort of design space explorations that it the access cost at level 2. The access cost at a supports. ............................................................. JULY/AUGUST 2014 29 .............................................................................................................................................................................................. BIG DATA percent on Intel Dunnington.2 The page k fault rates were estimated to be an average of k k typical virtual-memory miss rates on systems k with page sizes ranging from 4 to 64 Kbytes.3 0.1 0.02 0.005 0.00125 The read and write energy efficiency values C/D V = V = V = V = k L1 L2 L3 Memory (C parameters) for LLC are from Chang et al.,4 and the values for memory and disk are from Ramos and Bianchini.5 LLC encryption. For each LLC encryption R = 0.10 R = 0.20 R = 0.25 R = 0.25 R = 0.00 scenario, we consider three cache technolo- C = 5 C = 20 C = 50 C = 30 C = 100 gies. In the ideal case, the LLC will have both A = 5k A = 2k A = 1k A = 0.15k A = 0.125k high density and low power. SRAM is the more traditional cache implementation with fast access times, but low density relative to Figure 1. Illustration of how the cache and codec model captures the other technology and high leakage current. primary features of a memory compression system, namely the Spin-transfer torque magnetic RAM (STT- compression’s impact on the volume of data stored in memory and its cost RAM) is an alternative technology with relative to the cost of the system’s other components. higher density, but it has high write latency and write energy consumption. A third op- Data encryption tion is embedded DRAM (eDRAM), which Big data is a rich target for hackers because also features high density and low leakage. A of the sheer amount of information stored on drawback of eDRAM is that it requires those servers, with high-profile data breaches refresh operations, and this is a major source of customer records or unauthorized accesses of power dissipation. Each of these LLC tech- regularly reported. Even RSA, a company nologies has tradeoffs with respect to density with expertise in data security technology, is and power. not immune to these breaches.1 Applying Because the LLC is already power con- data encryption to big-data storage systems strained, adding a data encryption module to could be a second line of attack for security. the system could further increase total system This not only is a deterrent to hackers, but energy costs. We use our cache and codec also can minimize the damage should a model to explore a system with an encryption breach occur.