<<

CRAM: Efficient Hardware-Based Memory Compression for Bandwidth Enhancement Vinson Young , Sanjay Kariyappa , and Moinuddin K. Qureshi Georgia Institute of Technology {vyoung,sanjaykariyappa,moin}@gatech.edu

Abstract—This paper investigates hardware-based memory changing memory capacity. Unfortunately, this means such compression designs to increase the memory bandwidth. When memory compression solutions are not viable unless both the lines are compressible, the hardware can store multiple lines hardware vendors (e.g. , AMD etc.) and the OS vendors in a single memory location, and retrieve all these lines in a single access, thereby increasing the effective memory band- (Microsoft, etc.) can co-ordinate with each other on the width. However, relocating and packing multiple lines together interfaces, or such solutions will be limited to systems where depending on the compressibility causes a line to have multi- the same vendor provides both the hardware and the OS. We ple possible locations. Therefore, memory compression designs are interested in practical designs for memory compression typically require metadata to specify the compressibility of the that can be implemented entirely in hardware, without relying line. Unfortunately, even in the presence of dedicated metadata caches, maintaining and accessing this metadata incurs significant on any OS/ support. Such designs would provide the 1 bandwidth overheads and can degrade performance by as much bandwidth benefits, while providing constant memory capacity. as 40%. Ideally, we want to implement memory compression A prior study, MemZip [5], tried to increase the memory while eliminating the bandwidth overheads of metadata accesses. bandwidth using hardware-based compression; however, it This paper proposes CRAM, a bandwidth-efficient design for requires significant changes to the memory organization and memory compression that is entirely hardware based and does not require any OS support or changes to the memory modules the memory access protocols. Instead of striping the line across or interfaces. CRAM uses a novel implicit-metadata mechanism, all the chips on a memory DIMM, MemZip places the entire whereby the compressibility of the line can be determined by line in one chip, and changes the number of bursts required scanning the line for a special marker word, eliminating the to stream out the line, depending on the compressibility of overheads of metadata access. CRAM is equipped with a low-cost the . Thus, MemZip requires significant changes to the Line Location Predictor (LLP) that can determine the location of the line with 98% accuracy. Furthermore, we also develop data organization of commodity memories and the memory a scheme that can dynamically enable or disable compression controller to support variable burst lengths. Ideally, we want based on the bandwidth cost of storing compressed lines and the to obtain the memory bandwidth benefits from compression bandwidth benefits of obtaining compressed lines, ensuring no while retaining support for commodity DRAM modules and degradation for workloads that do not benefit from compression. using conventional data organization and protocols. Our evaluations, over a diverse set of 27 workloads, show that CRAM provides a speedup of up to 73% (average 6%) without Compression can change both the size and location of the causing slowdown for any of the workloads, and consuming a line. Without additional information, the storage overhead of less than 300 bytes at the memory controller. would not know how to interpret the data obtained from the memory (compressed or uncompressed). Conventional I.INTRODUCTION designs for memory compression rely on explicit metadata As modern compute systems pack more and more cores that indicates the compressibility status of the line, and this on the chip, the memory systems must also scale information is used to determine the location of the line. Such proportionally in terms of bandwidth in order to supply data to designs store the metadata in a separate region of memory. all the cores. Unfortunately, memory bandwidth is dictated by Unfortunately, accessing the metadata can incur significant the pin count of the processor chip, and this limited memory bandwidth overhead. While on-chip metadata caches [3] reduce arXiv:1807.07685v1 [cs.AR] 20 Jul 2018 bandwidth is one of the bottlenecks for system performance. the bandwidth required to obtain the metadata, such caches Data compression is a promising solution for enabling a higher are designed mainly to exploit spatial locality and are not as effective memory bandwidth. For example, when the data effective for workloads that have limited spatial locality. Our is compressible, we can pack multiple memory lines within goal is to design hardware-compressed memory, without any one memory location and retrieve all of these lines with a OS support, using commodity memories and protocols, and single memory request, increasing memory bandwidth and without the metadata lookup. performance. In this paper, we study main memory compression designs that can increase memory bandwidth. 1In fact, a few months ago, Qualcomm’s Centriq [4] system was announced Prior works on memory compression [1][2][3] aim to obtain with a feature that tries to provide higher bandwidth through memory both the capacity and bandwidth benefits from compression, compression while forgoing the extra capacity available from memory compression. Centriq’s design relies on increasing the linesize to 128 bytes, trying to accommodate as many pages as possible in the main striping this wider line across two channels, having ECC DIMMs in each memory, depending on the compressibility of the data. As channel to track compression status, and obtaining the 128-byte line from the effective memory capacity of such designs can change one channel if the line is compressible. Ideally, we want to obtain bandwidth benefits without changing the linesize, or relying on ECC DIMMs, and without at runtime, these designs need support from the Operating getting limited to 2x compression ratio. Nonetheless, the Centriq announcement System (OS) or the hypervisor, to the dynamically shows the commercial appeal of such hardware-based memory compresssion. Compressible No Compression A B a Incompressible (accesses = 4) X Y

b Ideal Compression A B (accesses = 3) X Y

c Compression with Metadata A B Metadata (accesses = 5) X Y Metadata

Fig. 1. Number of accesses to read 4 lines: A, B, X, Y for (a) uncompressed memory, (b) ideal compressed memory, and (c) compressed memory with metadata lookup overhead. Metadata lookup causes significant bandwidth overhead.

We explain the problem of bandwidth overhead of metadata unchanged regardless of compression. However, for B, the with an example. Figure1 shows three memory systems, each location depends on compressibility. We propose a history- servicing four memory requests A, B, X and Y. A and B based Line Location Predictor (LLP), that can identify the are compressible and can reside in one line, whereas X and correct location of the line with a high accuracy, which helps Y are incompressible. For the baseline system (a), servicing in obtaining a given line in a single memory access. The LLP these four requests would require four memory accesses. For is based on the observation that lines within a tend to an idealized compressed memory system (b) (that does not have similar compressibility. We propose a page-based last- require metadata lookup), lines A and B can be obtained in compressibility predictor to predict compressibility and thus a single access, where as X and Y would require one access location, and this allows us to access the correct location with each, for a total of 3 accesses for all the four lines. However, 98% accuracy. CRAM, combined with implicit-metadata and when we account for metadata lookup (c), it could take up to 5 LLP, eliminates metadata lookups and achieves an average accesses to read and interpret all the lines, causing degradation speedup of 8.5% on spec workloads. relative to an uncompressed scheme. Our studies show that Unfortunately, even after eliminating the bandwidth over- even in the presence of metadata caching, the metadata lookup heads of the metadata lookup, some workloads still have can degrade performance by as much as 40%. Ideally, we slowdown with compression due to the inherent bandwidth want to implement memory compression without incurring the overheads associated with compressing memory. For example, bandwidth overheads of metadata accesses. compressing and writing back clean-lines incurs bandwidth To this end, this paper presents CRAM, an efficient hardware- overhead, as those lines are not written to memory in an based main-memory compression design. CRAM decouples uncompressed design. For workloads with poor reuse, this and separately solves the issue of (i) how to interpret the data, bandwidth overhead of writing compressed data does not get and (ii) where to look for the data, to eliminate the metadata amortized by the subsequent accesses. To avoid performance lookup. To efficiently interpret the data received on an access, degradation in such scenarios, we develop Dynamic-CRAM, a we propose an implicit-metadata scheme, whereby compressed sampling-based scheme that can dynamically enable or disable lines are required to contain a special value, called a marker. compression depending on when compression is beneficial. For example, with a four-byte marker, the last four bytes of a Dynamic-CRAM ensures no slowdown for workloads that do compressed line is required to always be equal to the marker not benefit from compression. value. We leverage the insight that compressed data rarely Overall, this paper makes the following contributions: uses the full 64-byte space, so we can store compressed data 1. It proposes CRAM, a hardware-based compressed memory within 60 bytes and use the remaining four bytes to store design to provide bandwidth benefits without requiring OS the marker. On a read to a line that contains the marker, the support or changes to the memory module and protocols. line is interpreted as a compressed line. Similarly, an access CRAM performs memory accesses using conventional linesize to a line that does not contain the marker is interpreted as (64 bytes) and does not rely on the availability of ECC-DIMMs. an uncompressed line. The likelihood that an uncompressed 2. It proposes implicit-metadata to eliminate the storage and line coincidentally matches with a marker is quite small (less bandwidth overheads of metadata, by providing both the than one in a billion), and CRAM handles such rare cases of compressibility status and data in a single memory access. marker collisions simply by identifying lines that cause marker collisions on a write and storing such lines in an inverted form 3. It proposes a low-cost Line Location Predictor (LLP) to (more details in Section V-A). determine the location of the line with a high accuracy. The implicit-metadata scheme eliminates the need to do 4. It proposes Dynamic-CRAM, to enable or disable compres- a separate metadata lookup. However, CRAM now needs an sion at runtime based on the cost and benefit of compression. efficient mechanism to determine the location of the given line. Our evaluations show that CRAM improves bandwidth by CRAM restricts the possible locations of the line, based on 9% and provides a speedup of up to 73% (average 6%), while compressibility. For example, in Figure1(b), when A and B are ensuring no slowdown for any of the workloads. CRAM can compressible, CRAM restricts that both A and B must reside be implemented with minor changes to the memory controller, in the location of A. Therefore, the location of A remains while incurring a storage overhead of less than 300 bytes.

2 II.BACKGROUNDAND MOTIVATION Compression Status Information (CSI) for each line. Thus, we Compression exploits redundancy in data values and can need the CSI of the line along with the data line to not only provide both larger effective capacity and higher effective interpret the data line, but also to determine the location of bandwidth for the memory systems. While exploiting the the data line. Even if we provisioned only one bit per line in increased memory capacity requires OS support (to handle memory, the size of this metadata would be quite large. For the dynamically changing capacity depending on data values), example, for our 16GB memory, having 1-bit per line to specify memory compression for exploiting only the bandwidth benefits if the line is compressed or not would require a capacity of 32 can potentially be implemented entirely in hardware. We megabytes. Therefore, conventional designs keep the metadata provide background on hardware-based memory compression, region in memory and access this metadata region on a demand the potential benefit from compression, the challenges in basis and it in an on-chip metadata cache [3][5]. Such implementing such a compressed design, and insights that designs are effective only when the metadata cache has a high can help in developing efficient designs. hit-rate, due to either high spatial locality or small workload footprint. However, these approaches become ineffective when scaled to much larger workloads with low spatial locality. In A. Memory Compression for Bandwidth the worst-case, these designs may need a separate metadata Figure2 provides an overview of a compressed memory access for every data access, constituting a potential bandwidth design. We will assume that the design is geared towards overhead of 50-80% (e.g., in xz and cactu). Therefore, avoiding obtaining only the bandwidth benefits, and the extra capacity the bandwidth overhead of metadata accesses is vital to building created by compression remains unused. Compressed memory an effective memory compression design. designs leverage compression algorithms [6][7][8][9] [10] [11][12] to accommodate data into a smaller space. If the lines C. Potential for Performance Improvement are compressible we can either store them in their original location and stream out in a smaller burst, or place multiple Our goal is to develop an efficient memory compression compressed lines in one location and stream out all these lines design that provides higher bandwidth. Figure3 shows the in one access. We use the second option as it avoids dynamically performance benefit from an idealized compression scheme changing the burst length, thus retaining compliance with the that does not maintain any metadata and simply transfers all the protocols and data mapping used in lines that would be together in a compressed memory system, designs. Thus, if two lines A and B are compressible, then thereby obtaining all the benefits of compression and none of both are resident in the location of A. One memory access the overheads. We also show the performance of a practical can provide both A and B, thereby increasing the effective memory compression that maintains metadata in memory and bandwidth if both A and B get used. The bandwidth benefit of is equipped with a 32KB metadata cache. such a design is dictated by both (a) the ability to pack multiple Ideal Design (no metadata) Design with metadata 1.80 lines together, and (b) the spatial locality of the workload 1.60 (ability to use adjacent lines). 1.40 1.20 1.00 Processor Speedup 0.80 0.60 0.40 LLC Core xz libq milc GAP Gems leslie wrf17 bc twi cc twi pr twi fotoniklbm17soplex mcf17 parestsphinx gcc06 bc web cc web pr web SPEC ALL27 cactu17omnet17 Fig. 3. Speedup from ideal compression (no metadata lookup) and practical Compression- Memory Metadata Cache: Decompression Compressibility & compression (w/ metadata cache). Engine Controller Location Info On average, compression can provide a speedup of 9%; how- Read/Write Metadata ever, the overheads of implementing compression erodes this. Requests Transfer In fact, we observe significant degradation with compression DRAM Metadata for several workloads. For example, Graph workloads (bc twi - pr web) have small potential benefits from compression, due to poor spatial locality and low data reuse. It is important that Fig. 2. Overview of Compressed Memories. Conventional designs track compression status of each line in a metadata region and use a metadata cache. the implementation of compressed memory does not degrade the performance of such workloads. B. The Challenge of Metadata Accesses An access to the compressed memory obtains a 64-byte line, D. Insight: Store Metadata in Unused Space however, the memory controller would not know if the line To reduce the metadata access of compressed memory, we contains compressed data or not. For example, an access to A leverage the insight that not all the space of the 64 byte line would provide both A and B, if both lines are compressible, is used by compressed memory. For example, when we are and only A if the lines are uncompressed. Simply obtaining the trying to compress two lines (A and B) they must fit within line from location A is insufficient to provide the information 64 bytes; however, the compressed size could still be smaller about compressibility of the line. A separate region in memory, than 64 bytes (and not large enough to store additional lines C which we refer to as the metadata region keeps track of the and D). We can leverage the unused space in the compressed

3 memory line to store metadata information within the line. For TABLE I example, we could require that the compressed lines store a SYSTEM CONFIGURATION 4-byte marker (a predefined value) at the end of the line, and Processors 8 cores; 3.2GHz, 4-wide OoO the space available to store the compressed lines would now Last-Level Cache 8MB, 16-way get reduced to 60 bytes. Figure4 shows the probability of a Compression Algorithm FPC + BDI Main Memory pair of adjacent lines compressing to ≤64B and ≤60B. As the Capacity 16GB probability of compressing pairs of lines to ≤64B and ≤60B Bus Frequency 800MHz (DDR 1.6GHz) are 38% and 36%, respectively, we find that reserving space Configuration 2 channel, 2x rank, 64-bit bus tCAS-tRCD-tRP-tRAS 11-11-11-39 ns for this marker does not substantially impact the likelihood of compressing two lines together and thus would not have a significant impact on compression ratio. B. Workloads We use a representative slice of 1-billion instructions selected Double ≤ 64 Double ≤ 60 100 by PinPoints [14], from benchmarks suites that include SPEC 80 2006 [15], SPEC 2017 [16], and GAP [17]. We evaluate all 60 SPEC 2006 and SPEC 2017 workloads, and mark ’06 or ’17 40 to denote the version when the workload is common to both. 20 We additionally run GAP suite, which is graph analytics with 0 real data sets (twitter, web sk-2005)[18]. We show

% of compressible lines xz libq milc leslie wrf17 cc twi pr twi fotoniklbm17soplex mcf17 Gemsparestsphinx gcc06 bc twi pr web detailed evaluation of the workloads with at least five misses cactu17 bc web cc web omnet17 average per thousand instructions (MPKI). The evaluations execute Fig. 4. Probability of a pair of adjacent lines compressing to ≤64B and ≤60B. benchmarks in rate mode, where all eight cores execute the We can use 4 bytes to indicate status of the line, without significantly affecting same benchmark. TableII shows L3 miss rates and memory compressibility. footprints of the workloads we have evaluated in detail. In We can use this insight to store the metadata implicitly within addition to these workloads, we also include 6 mixed workloads the line and avoid the bandwidth overheads of accessing the that are formed by randomly mixing the SPEC workloads. metadata explicitly. If the line obtained from memory contains the marker value, the line is deemed compressed, whereas, TABLE II WORKLOAD CHARACTERISTICS if the the line does not have the marker value, then it is deemed uncompressed. However, there could be a case where Suite Workload L3 MPKI Footprint the uncompressed line coincidentally stores the marker value. fotonik 26.2 6.8 GB A practical solution must efficiently handle such collisions, lbm17 25.5 3.4 GB even though such collisions are expected to be extremely rare. soplex 23.3 2.1 GB We propose CRAM, an efficient hardware-based compression libq 23.1 418 MB mcf17 22.8 4.4 GB design, that does not require any OS support or changes to the milc 21.9 3.1 GB memory module/protocols, and avoids the bandwidth overheads Gems 17.2 5.8 GB of metadata lookups. We discuss our evaluation methodology SPEC parest 16.4 465 MB before discussing our solution. sphinx 11.9 223 MB leslie 11.9 861 MB cactu17 10.6 2.1 GB III.METHODOLOGY omnet17 8.6 1.9 GB A. Framework and Configuration gcc06 5.8 205 MB We use USIMM [13], an simulator with detailed memory xz 5.7 943 MB wrf17 5.2 798 MB system model. TableI shows the configuration used in our bc twi 66.6 9.2 GB study. We assume a three-level cache hierarchy (L1, L2, L3 bc web 7.4 10.0 GB cc twi 101.8 6.0 GB being on-chip SRAM caches). All caches use line-size of 64 GAP bytes. The DRAM model is based on DDR4. cc web 8.1 5.3 GB We model a system to perform virtual to pr twi 144.8 8.3 GB pr web 13.1 8.2 GB translations, and this ensures that the memory accesses of different cores do not map to the same physical page. Note that, other than the virtual memory translation, We perform timing simulation until each benchmark in the OS is not extended to provide any support to enable the a workload executes at least 1 billion instructions. We use compressed memory. weighted speedup to measure aggregate performance of the For compression, we use a hybrid compression scheme workload normalized to the baseline and report geometric where we use FPC and BDI and compress with the one that mean for the average speedup across all the 27 workloads (7 gives better compression. Information about the compression SPEC2006, 8 SPEC2017, 6 GAP, 6 MIX). For other workloads algorithm used and the compression-specific metadata (e.g. that are not memory bound, we present full results of all 64 base for BDI) are stored within the compressed line, and are benchmarks evaluated (29 SPEC2006, 23 SPEC2017, 6 GAP, counted towards determining the size of the compressed line. 6 MIX) in Section VII-B.

4 IV. CRAM: BASIC DESIGN lines undergo 2-to-1 compression, 4-to-1 compression, or no Our proposed design, CRAM, tries to obtain bandwidth compression. Thus, line A (lines with line-address ending in benefits using memory compression without requiring OS ”00”) is always resident in the same location, whereas line support, without changes to bus protocol, and while maintaining B (lines with address ending in ”01”) can be in the original the existing organization for the memory modules. In this location at B (if B is uncompressed) or at A (if B is compressed). section, we provide an overview of the basic CRAM design, Note that, on average there are only two locations for each and discuss the shortcomings of maintaining and retrieving line in the group. An access to line A can provide location metadata associated with compression. information for all four lines in the group if the line is 4-to-1 compressed, or for line B otherwise. Thus, a sequential access A. Organization and Operation of CRAM across the memory would obtain the first line in the group always from the original location, and this line can provide Figure5 shows an overview of CRAM. The main memory location information for the subsequent lines in the group. can store compressed data, and the job of compression and decompression is performed by the logic on the memory Write Operation: When a cacheline gets evicted from LLC, controller on the processor chip. The L3 cache is assumed the memory controller checks if (a) the neighboring cachelines to store data in uncompressed form. The bus connecting the are present in the LLC, and if (b) the group of 2 or 4 cachelines memory controller and the memory modules use the existing can be compressed to the size of a single uncompressed JEDEC protocol and transfer 64 bytes on each access. If the cacheline (64 Bytes). If the group of cachelines can be lines are compressible, then a single access can provide multiple compressed to a single block, the memory controller compacts neighboring lines, and increase effective bandwidth. them together and issues a write containing the 2 or 4 compressed lines to one physical location. Note that for a compressed memory, the controller can have the flexibility to compress and write back clean lines as well, otherwise the benefits of compression will become restricted only to dirty lines. Our default policy compacts and writes back clean lines if they are compressible, in the hope that this bandwidth cost will be amortized by future re-use. Read Operation: On a read, the controller needs to determine the compressibility and the location of the line, as the line can get relocated based on compressibility. Conventional designs rely on metadata in memory to provide the Compression Status Information (CSI) of each line. In our case, this CSI-metadata Fig. 5. An Overview of the CRAM Design. would be a 3-bit entity for a group of 4 lines (to indicate one Restricted Data Mapping: Without loss of generality, CRAM of 5 possible states for the group, based on Figure6). If the supports up to 4-to-1 compression, which means up to four CSI metadata is available, the read can determine the location compressed lines can be resident in one memory location. If of the line, access the memory, decompress all the line(s) if the compressibility is not high enough to store 4 lines in one needed, and store all the retrieved lines in the L3 cache. location, then the design tries 2-to-1 compression, where two 1.80 neighboring lines are placed in one location. If the lines are 1.60 1.40 uncompressed they retain their original location. If we organize 1.20 the data layout appropriately, we can reduce the amount of 1.00

Speedup 0.80 uncertainty (i.e., number of possible positions) in locating lines. 0.60 CRAM restricts the location of the lines in a group of 4 lines 0.40 xz libq milc GAP Gems leslie wrf17 bc twi cc twi pr twi to help in locating the lines easily. fotoniklbm17soplex mcf17 parestsphinx gcc06 bc web cc web pr web SPEC ALL27 cactu17omnet17 Fig. 7. Speedup of CRAM with explicit metadata (+ metadata cache) compared Possible locations: 1 2 2 3 to uncompressed memory. uncompressed A B C D B. The Problem With Explicit Metadata A B C D We can design CRAM with explicit metadata, where the 2:1 compressed A B C D metadata specifies the compression status of the line. Given A B C D that the size of the metadata is 3-bits per group of four lines, we need on average 0.75 bits per line. For our 16GB memory, 4:1 compressed A B C D containing 1 billion lines, the total size of this metadata would be 24 megabytes, much larger than the capacity of on-chip Fig. 6. CRAM relocates and packs adjacent lines. Restricting placement reduces number of valid positions. structures. We can keep the metadata in memory and cache it in an on-chip metadata cache, as done in prior works [3], Figure6 shows the five different line permutations for [5]. For workloads with good spatial locality or small memory a group of 4 lines under CRAM, based on whether the footprint, most metadata requests and updates will be serviced

5 Data Clean Evict Metadata 2.00 1.75 1.50 1.25 1.00 0.75 0.50 Consumption 0.25 0.00 Normalized Bandwidth xz libq milc GAP Gems leslie wrf17 bc twi cc twi pr twi fotonik lbm17 soplex mcf17 parest sphinx gcc06 bc web cc web pr web SPEC ALL21 cactu17omnet17 Fig. 8. Bandwidth consumption for data, compressed writebacks, and metadata for CRAM with explicit metadata, normalized to uncompressed memory. Metadata accesses constitute significant bandwidth overhead. by the cache and such a design would work well. Unfortunately, An incompressible line is stored in its original form, without having an explicit metadata requires accessing memory on a any space reserved for the marker. The probability that the miss in the metadata cache. Figure7 shows the performance uncompressed line coincidentally matches with the 32-bit of CRAM with explicit metadata with a 32KB metadata cache. marker is quite small (less than 1 in a billion). Our solution On average, this scheme shows 10% slowdown relative to an handles such extremely rare cases of collision with marker uncompressed memory, because of the metadata accesses. values by storing such lines in an inverted form. This ensures Figure8 shows the breakdown of the bandwidth consumed that the only lines in memory that contain the marker value by the CRAM design, normalized to the uncompressed memory. (in the last four bytes) are the compressed lines. In general, compression is effective at reducing the number of requests for data. However, depending on the workload, the metadata cache can have poor hit-rate, and require frequent Line A Line B Marker x22222222 2-to-1 compressed access to obtain the metadata. These extra metadata accesses A B C D Marker x44444444 4-to-1 compressed can constitute a significant bandwidth overhead. For example, x00000000 xz needs over 50% extra bandwidth just to fetch the metadata. Line X Uncompressed Thus, schemes that require a separate metadata lookup can end 60 B up degrading performance relative to uncompressed memory. 4 B xFFFFFFFF We develop a solution that eliminates metadata lookups. Fig. 9. Implicit metadata using markers: Compressed lines always contain a marker in the last four bytes. V. CRAM: OPTIMIZED DESIGN To avoid the bandwidth overheads of the metadata access, Determining Compression Status with Markers: When a CRAM decouples the information provided by the metadata line is retrieved, the memory controller scans the last four into two parts: (a) determining the compression status, and (b) bytes for a match with the markers. If there is a match with determining the location of the cachelines, and solves each either the 2-to-1 marker or the 4-to-1 marker, we know that of them separately. The first solution tackles the problem of the line contains compressed data for either two lines, or four interpreting accessed lines with implicit-metadata using marker lines, respectively. If there is no match, the line is deemed to values. The second component handles the problem of the store uncompressed data. Thus, with a single access, CRAM locating lines with a line-location predictor. obtains both the data and the compression status. A. Implicit-Metadata: No Metadata Lookups Handling Collisions with Marker via Inversion: We define a marker collision as the scenario where the data in an We exploit the insight that a pair of compressed lines does not uncompressed line (last four bytes) matches with one of the always use all the available 64 bytes. Our analysis (presented markers. Since our design generates per-line markers, the in Figure4) shows that the probability that a pair of lines likelihood of marker collision is quite rare (less than one is compressible within 64 bytes but not within 60 bytes is in a billion). However, we still need a way to handle it without quite small (close to 2%). We exploit this leftover space in incurring significant storage or complexity. CRAM handles compressed lines to specify the compression status of the line, marker collisions simply by inverting the uncompressed line using a predefined special value, which we call as the marker. and writing this inverted line to memory, as shown in Figure 10. Figure9 shows the implicit-metadata design using markers, Doing so ensures that the only lines in memory that contain for lines that are compressible with 2-to-1 compression, 4-to- the marker value (in the last four bytes) are compressed lines. 1 compression, or no compression. If the line contains two compressed lines (e.g. A and B both reside in A), then the Marker line is required to contain the marker corresponding to 2-to-1 x44444444 compression (x22222222 in our example) in the last four bytes. x0 ...... 44444444 Similarly, if the line contains four compressed lines (A, B, C, Marker No 2 Collision Line to install in mem addr A and D, all reside in A), then the line is required to contain the ? Update A 1 Invert on collision LIT - marker corresponding to 4-to-1 compression (x44444444 in our Install as is - Yes - example). Marker reduces the available space for compressed x1 ...... BBBBBBBB Line inversion Do: 1 & 2 Inverted line lines to 60 bytes. If the compressed lines require cannot fit Table within 60 bytes, then it is stored in uncompressed form. Fig. 10. Line inversion handles collisions of uncompressed lines with marker. 6 A dedicated on-chip structure, called the Line Inversion values on a per-line basis. This would make the marker values Table (LIT), keeps track of all the lines in memory that are impractical to guess without knowledge of the secret-keys of stored in an inverted form. The likelihood that multiple lines the hash function, which are generated randomly for each resident in memory concurrently encounter marker collisions machine. Furthermore, the secret-keys are regenerated in the is negligibly small. For example, if the system continuously event of an LIT overflow which changes the per-line markers. writes to memory, then it will take more than 10 million years to obtain a scenario where more than 16 lines are concurrently Efficiently Invalidating Stale Data: Compression can relocate stored in inverted form. Therefore, for our 16GB memory, we the lines, and, when lines get moved, they can leave behind a provision a 16-entry LIT in CRAM. potentially stale copy of the line. For example, in Figure 11, if adjacent lines A and B became compressible (into values When a line is fetched from memory, it is not only checked A’ and B’), we could move B’ and store lines A’ and B’ against the marker, but also against the complement of the together in one physical location. However, an old value of B marker. If the line matches with the inverted value of the would still exist in the previous location. Reading the previous marker, then we know that the line is uncompressed. However, location would reveal an old value of B that could still be we do not know if the retrieved data is the original data for the erroneously interpreted as a valid uncompressed cacheline. line or if the line was stored in memory in an inverted form Keeping all locations of the line in sync requires significant due to a collision with the marker. In such cases, we consult bandwidth overheads. Therefore, we simply mark such lines the LIT. If the line address is present in the LIT, then we know as invalid using a special 64-byte marker value, called Invalid the line was stored in an inverted fashion and we will write Line Marker (Marker-IL). Marker-IL is also initialized at boot the reverted value in the LLC. Otherwise, the data obtained time using a randomly generated value. Per-line Marker-IL from the memory is written as-is to the LLC. can be generated as in Section V-A. Collisions with Marker- On a write to the memory, if the line address is present in IL are extremely rare (1 in 2512 probability, less than one in the LIT, and the last four bytes of the line no longer match quadrillion years), and are also handled using line inversion, with any of the markers, then we write the line in its original and are tracked by the LIT. form and remove this line address from the LIT. Each entry in the LIT contains a valid bit and the line address (30 bits), Before Compression After Compression Compression with so our 16-entry LIT incurs a storage overheads of only 64 Invalidate bytes. We recommend that the size of the LIT be increased in proportion to the memory size. Line A Line B A' B' B A' B' Invalid Marker

Efficiently Handling LIT Overflows: In the extremely rare Stale value 64B Marker cases LIT can overflow, and we have two solutions to handle this scenario: (Option-1) Make the LIT memory-mapped (one Fig. 11. Compression relocates lines and can create copies of data. We mark inversion-bit for every line in memory, stored in memory) such lines as invalid to ensure correct operation. and this can support every line in memory having a collision. On marker-collision, the memory system has to make two Handling Updates to Compressed Lines: An update to a accesses: one access to the memory, and another to the LIT compressed line can render the entire group (of two or four to resolve collision. Under adversarial settings, the worst-case lines) from compressible to incompressible. Such updates must effect would simply be twice the bandwidth consumption. We be performed carefully so that the data of the other line(s) implement updates to the LIT by resetting the LIT entry when in the group gets relocated to their original location(s). To lines with marker-collisions are brought into the LLC and accomplish this, we need to know the compressibility of the marking these cachelines as dirty. On eviction, these lines will line when the line was obtained from memory. To track this be forced to go through the marker-collision check and will information, we provision 2-bits in the tag store of the LLC appropriately set the corresponding LIT entry. (Option-2) On that denotes the compression level when the data was read an LIT overflow, CRAM can regenerate new marker values from memory. On an eviction, we can determine if the lines using the random number generator, encode the entire memory were previously uncompressed, 2-to-1 compressed, or 4-to-1 with new marker values, and resume the execution. As cases of compressed by checking these two bits, and we can send writes LIT overflows are rare (once per 10 million years), the latency and invalidates (when applicable) to the appropriate locations. of handling LIT overflows does not affect performance. Ganged Eviction: Write-back of a cacheline that belongs to a Attack-Resilient Marker Codes: The markers in Figure9 compressed group can require a read-modify-write operation if were chosen for simplicity of explanation. Markers generated the other cachelines in the group are not present in the cache. from simple address based hash functions can be a target for a Our design avoids this by using a ganged-eviction scheme Denial-of-Service Attack. An adversary with knowledge of the which forces the eviction of all members of a compressed hash function can write data values intended to cause frequent group if one of its members gets evicted. This ensures that all LIT overflows resulting in severe performance degradation. the members of a compressed group are either simultaneously We address this vulnerability by using a cryptographically present or absent from the LLC, effectively avoiding the need secure hash function (e.g. such as DES [19], given that marker for read-modify-write operations. Our evaluations show that generation can happen off-the-critical path) to generate marker ganged eviction has negligible impact on the LLC hit rate.

7 CRAM (Explicit Metadata) CRAM (Implicit Metadata + LLP) 1.40 1.74 1.20 1.00 0.80 Speedup 0.60 0.40 xz libq milc mix1 mix2 mix3 mix4 mix5 mix6 GAP MIX Gems leslie wrf17 bc twi cc twi pr twi fotoniklbm17soplex mcf17 parestsphinx gcc06 bc web cc web pr web SPEC ALL27 cactu17omnet17 Fig. 12. Speedup of CRAM with explicit metadata and CRAM with implicit metadata, normalized to uncompressed memory. CRAM with implicit metadata eliminates metadata lookup and improves performance.

B. Prediction for Line Location With explicit-metadata, if there is a hit in the metadata With implicit-metadata, CRAM can efficiently determine cache, we can determine the location of the line and obtain the the compressibility status of any line retrieved from memory. line is one memory access. However, a miss in the metadata Reading the line from an incorrect location returns the invalid- cache means we need to send two access, one for the metadata line marker (Marker-IL). However, in such cases, a second and second for the data. Figure 14 compares the hit-rate of request must be sent to another location to obtain the line (for the metadata cache (32KB) with the prediction accuracy of example, an access to B gets routed to A because A contains the LLP (128 bytes). Even though the LLP is quite small, it both A and B). Sending multiple accesses to retrieve a line provides an accuracy of 98%, much higher than the hit-rate of from memory wastes bandwidth. To obtain the line in a single the metadata cache. On an LLP misprediction, we re-issue the access (in the common case), we develop a Line Location request to the other possible locations of the line. Predictor (LLP), that predicts the compressibility status of the line. Knowing the compressibility helps in determining C. Speedup of CRAM with Optimizations the location of the line (e.g. B will be in original location if CRAM, when combined with implicit-metadata and LLP, can incompressible and at A if compressible). accomplish the task of locating and interpreting lines, without

Last Compressibility Table the need for a separate metadata lookup. Figure 12 shows 01 the performance of CRAM (with implicit-metadata + LLP) 10 Compressed compared to the basic CRAM design (with explicit-metadata). Lookup Predicted Predicted CRAM (with implicit-metadata) eliminates the metadata lookup, 00 Compression location Hash Status - which significantly helps both compressible and incompressible - Line Addr workloads. For SPEC workloads, CRAM provides a speedup. Page Addr - However, for Graph workloads, CRAM still causes a slowdown. 01 We investigate bandwidth of CRAM to determine the cause.

Fig. 13. Line Location Predictor uses line address and compressibility- prediction (based on last-time compressibility) to predict location. D. Bandwidth Breakdown of CRAM Figure 15 shows the bandwidth consumption of CRAM To design a low-cost LLP, we exploit the observation that (with implicit-metadata + LLP), normalized to uncompressed lines within a page are likely to have similar compressibility memory. The components of bandwidth consumption of CRAM [3][20]. Figure 13 shows the organization of the LLP. LLP are data, second access due to LLP mispredictions, and clean contains the Last Compressibility Table (LCT), that tracks the writebacks + invalidates (for writing compressed data). High last compression status seen for a given index. The LCT is location prediction accuracy means we are able to effectively indexed with the hash of the page address. So, for a given remove the cost of metadata lookup, except for bc twi. For access, the index corresponding to the page address is used to Graph workloads, the inherent cost of compression (i.e., predict the compressibility, then line location. The LCT is used compressing and writing back clean lines, and invalidating) is only when a prediction is needed (for example, A is always the dominant source of bandwidth overhead and the cause for resident in its own location and does not need a prediction). performance degradation. We develop an effective scheme to We use a 512-entry LCT, so the storage overhead is 128 bytes. disable compression when compression degrades performance.

Metadata-Cache Hit CRAM Prediction Accuracy Data Clean Evict + Inv. LLP Mispredict 100 1.75 1.50 80 1.25 60 1.00 0.75 40 0.50

20 Consumption 0.25 Accuracy (%) 0.00

Location Prediction 0

Normalized Bandwidth xz libq milc xz leslie pr twi GAP libq lbm17soplex mcf17 Gemsparestsphinx gcc06 wrf17 bc twi cc twi SPEC ALL21 milc GAP fotonik bc web cc web pr web Gems leslie wrf17 bc twi cc twi pr twi cactu17omnet17 fotoniklbm17soplex mcf17 parestsphinx gcc06 bc web cc web pr web SPEC ALL27 cactu17omnet17 Fig. 14. Probability of finding line in one access for explicit-metadata and Fig. 15. Bandwidth consumption for Optimized CRAM approach, normalized CRAM with LLP predictor. to uncompressed memory.

8 CRAM (Always-Compress) CRAM (Dynamic) Ideal CRAM (no metadata + no BW for inval and compressed write) 1.40 1.74 1.30 1.20 1.10 1.00 0.90 Speedup 0.80 0.70 0.60 xz libq milc mix1 mix2 mix3 mix4 mix5 mix6 GAP MIX Gems leslie wrf17 bc twi cc twi pr twi fotoniklbm17soplex mcf17 parestsphinx gcc06 bc web cc web pr web SPEC ALL27 cactu17omnet17 Fig. 16. Speedup of Static-CRAM, Dynamic-CRAM, and Ideal memory compression. Dynamic-CRAM avoids slowdown for workloads that do not benefit from compression, and performs similar to ideal scheme with no overheads.

VI.CRAM:DYNAMIC DESIGN Dynamic-CRAM monitors the bandwidth costs and benefits Thus far, we have focused only on avoiding the metadata of compression at run-time, to determine if compression should access overheads of compressed memory. However, even after be enabled or disabled. To efficiently implement Dynamic- eliminating all of the bandwidth overheads of the metadata, CRAM, we use set-sampling, whereby a small fraction of sets there is still performance degradation for several workloads. in the LLC (1% in our study) always implement compression Compression requires additional writebacks to memory which and we track the cost-benefit statistics only for the sampled sets. can consume additional bandwidth. For example, when a The decision for the remaining (99%) of the sets is determined cacheline is found to be compressible on eviction from LLC, by the cost-benefit analysis on the sampled sets, as shown in it needs to be written back in its compressed form to memory. Figure 17. To track the cost and benefit of compression, we What could have been a clean evict in an uncompressed memory use a simple saturating counter. The counter is decremented is now an additional writeback, which becomes a bandwidth on seeing the bandwidth cost and is incremented on seeing the overhead.2 Additionally, CRAM requires invalidates to be sent, bandwidth benefit of compression. The Most Significant Bit which further adds to the bandwidth cost of implementing (MSB) of the counter determines if the compression should be compression. In general, if the workload has enough reuse enabled or disabled for the remaining sets. We use a 12-bit and spatial locality, the bandwidth cost of compression yields counter in our design. We extend Dynamic-CRAM to support bandwidth savings in the long run. But for a workload with poor per-core decision by maintaining a 12-bit counter per core and reuse and spatial locality (such as several Graph workloads), a 3-bit tag storage for the lines in the sampled sets to identify the cost of compression does not get recovered, causing the core that requested the cacheline. performance degradation. Useful Compressed Misprediction Invalidate Request 1 Prefetch 2 Writeback 3 4 4096 A. Design of Dynamic-CRAM Increment Set 1 1 Utility Counter We can avoid the degradation by dynamically disabling com- Enable Enforce pression, when compression is found to degrade performance. Set 0 Policy Disable Doing so would return the workload its baseline performance 2 Decrement Sampled Set 3 Set 99 with an uncompressed memory. We call this design Dynamic- (always compress) 4 Utility Counter CRAM. Dynamic-CRAM compares at runtime the ”bandwidth 0 cost of doing compression” with the ”bandwidth benefits from Saturating Counter compression”, and enabling or disabling compression based Fig. 17. Dynamic-CRAM analyzes cost-benefit of compression on sampled on this cost-benefit analysis. sets. This analysis determines compression policy for the other sets. Bandwidth Cost of Compression: The bandwidth overhead of compression comes from sending extra writebacks (com- B. Effectiveness of Dynamic-CRAM pressed writebacks from clean locations), sending invalidates, Figure 16 shows the performance of Optimized CRAM (that and sending requests to mispredicted locations. These are always tries to compress), and Dynamic-CRAM. CRAM with- additional requests incurred due to compression that could out the Dynamic optimization provides performance improve- have been avoided if we had used an uncompressed design. ment for SPEC workloads; however, it degrades performance Bandwidth Benefits of Compression: Compression pro- for GAP workloads. However, Dynamic-CRAM eliminates vides bandwidth benefits by enabling bandwidth-free prefetch- all of the degradation, ensuring robust performance – the ing. On reading a compressed line, adjacent lines get fetched design is able to obtain performance when compression is without any extra bandwidth. This saves bandwidth if the beneficial and avoid degradation when compression is harm- prefetched lines are useful. Tracking useful prefetches can ful. On average, Dynamic-CRAM provides 6% performance allow us to determine the benefits from compression. improvement, nearing two-thirds of the performance of an idealized compression design that does not incur any bandwidth 2CRAM installs new pages in an uncompressed form to avoid inaccurate overheads for implementing compression. Thus, Dynamic- prefetches. By compressing adjacent lines that are evicted from the LLC, we CRAM is a robust and efficient way to implement hardware- can ensure that prefetches are done only when the neighboring lines have been previously accessed together and are thus expected to be useful. based main memory compression.

9 1.80

1.60

1.40

1.20 Speedup

1.00

0.80 Fig. 18. S-curve showing speedup of Dynamic-CRAM for 64 workloads, sorted by speedup. TABLE IV VII.RESULTS AND ANALYSIS CRAMSENSITIVITYTO NUMBEROF MEMORY CHANNELS A. Storage Overhead of CRAM Structures Num. Channels Avg. Speedup of CRAM 1 4.8% CRAM can be implemented with minor changes at the 2 5.5% memory controller. Table III shows the storage overheads 4 4.6% required for implementing Dynamic-CRAM. The total storage E. Comparison to Larger Fetch for L3 of the additional structures at the memory controller is less CRAM can install adjacent lines from the memory to the than 300 bytes. In addition to these structures, CRAM needs L3 cache. While this may seem similar to prefetching, we note 2-bits in the tag-store of each line in the LLC to track prior- there is a fundamental difference. CRAM installs additional compressibility. And, per-core Dynamic-CRAM needs 4-bits lines in L3 only when those lines are obtained without any per each line in sampled sets (1%) for reuse and core id. bandwidth overhead. Meanwhile, prefetches result in an extra TABLE III memory access which incurs additional bandwidth. We compare STORAGE OVERHEADOF CRAMSTRUCTURES Structure Storage Cost the performance of next-line prefetching and Dynamic-CRAM Marker for 2-to-1 4 Bytes TableV. Next-line prefetching causes an average slowdown Marker for 4-to-1 4 Bytes of 10%, while CRAM achieves a speedup of 6% as it obtains Marker for Invalid Line 64 Bytes adjacent lines without the bandwidth cost. Line Inversion Table (LIT) 64 Bytes Line Location Predictor (LLP) 128 Bytes TABLE V COMPARISON OF CRAM TO NEXT-LINE PREFETCH Dynamic-CRAM counter 12 Bytes Next-Line Prefetch Dynamic-CRAM Total 276 bytes SPEC -5.7% +8.5% GAP -21.1% +0.0% B. Extended Evaluation MIX -7.3% +4.2% We perform our study on 27 workloads that are memory ALL27 -9.7% +5.5% intensive. Figure 18 shows the speedup with Dynamic-CRAM across an extended set of 64 workloads (29 SPEC2006, 23 VIII.RELATED WORK SPEC2017, 6 GAP, and 6 mixes), including ones that are To the best of our knowledge, this is the first paper to not memory intensive. Dynamic-CRAM is robust in terms of propose a robust hardware-based main-memory compression performance, as it avoids degradation for any of the workloads for bandwidth improvement, without requiring any OS-support while retaining improvement when compression helps. and without causing changes to the memory organization and C. Impact on Energy and Power protocols. We discuss prior research related to our study. Figure 19 shows the power, energy consumption and energy- A. Low-Latency Compression Algorithms delay-product (EDP) of a system using Dynamic-CRAM, normalized to a baseline uncompressed main memory. Energy As decompression latency is in the critical path of memory consumption is reduced as a consequence of fewer number of accesses, hardware compression techniques typically use simple requests to main memory. Overall, Dynamic-CRAM reduces per-line compression schemes [6][7][8][9][10][11][12]. We energy by 5% and improves EDP by 10%. evaluate CRAM using a hybrid compression using FPC [6] and BDI [10]. However, CRAM is orthogonal to the compression 1.2 algorithm and can be implemented with any compression 1.1 algorithm, including dictionary-based [21][22][23][24][25]. 1 0.9 B. Main Memory Compression 0.8 Hardware-based memory compression has been applied to 0.7 increase the capacity of main memory [1][2][3]. To locate

Normalized to Baseline Speedup Power Energy EDP the line, these approaches extend the entries to Fig. 19. Dynamic-CRAM impact on energy and power include information on the compressibility of the page. These D. CRAM Sensitivity to Number of Memory Channels approaches are attractive as they allow locating and interpreting CRAM offers bandwidth-free adjacent-line prefetch, which lines using the TLB. However, such approaches inherently are latency benefits that exist regardless of the number of require software-support (from the OS or hypervisor) that limit memory channels. TableIV shows that CRAM consistently their applicability. We want a design that can be built entirely provides speedup of 5% even with larger number of channels. in hardware, without any OS support.

10 Explicit-Metadata Optimized for Row-Buffer Hits (Memzip / LCP) Dynamic CRAM 1.40 1.74 1.20 1.00 0.80 Speedup 0.60 0.40 xz libq milc mix1 mix2 mix3 mix4 mix5 mix6 GAP MIX Gems leslie wrf17 bc twi cc twi pr twi fotoniklbm17soplex mcf17 parestsphinx gcc06 bc web cc web pr web SPEC ALL27 cactu17omnet17 Fig. 20. Performance of schemes that need explicit metadata management (Memzip, LCP) optimized for row-buffer hits. Prior approaches still require significant bandwidth to retrieve and update metadata. Several studies [5][26][27] propose to send compressed Furthermore, SCC requires skewed-associative lookup of all data across links in smaller bursts, and send additional ECC or possible positions, which is possible to do in a cache; however, metadata bits when there is still room in a burst length. These such unrestricted probes of all possible placement locations proposals try to improve the bandwidth of the memory system would incur intolerable bandwidth overheads in main memory. by sending fewer bursts per memory request. However, these D. Adaptive Cache-Compression proposals require either non-traditional data organization (such Prior works have looked at adaptive or dynamic cache as MiniRank) [28][29][30] or changes to the bus protocols compression [11][32][36][37] to avoid performance degra- or both. CRAM enables compressed memory systems with dation due to latency overheads of decompression or due existing memory devices and protocols. to extra misses caused by sub-optimal replacement in com- Prior studies [3][5] have advocated reducing the latency pressed caches. These designs are primarily target cache hit for metadata lookups by placing the metadata in the same rate. Whereas, our main memory proposal targets bandwidth row buffer as the data line. However, this does not reduce the overheads inherent in memory compression (metadata or bandwidth required to obtain the metadata. For comparison, we compressed writes). Additionally, fine-grain adaptive mem- implement an explicit-metadata scheme optimized to access the ory compression has been previously unexplored, as prior same row as the data line. Figure 20 compares the performance approaches have had no capability to turn off (except by of Dynamic-CRAM with optimized explicit-metadata provi- expensive global operation). sioned with a 32KB metadata cache. The bandwidth overheads of obtaining metadata is still significant, causing slowdown. E. Predicting Cache Indices Whereas, Dynamic-CRAM provides performance improvement. Several studies have looked at predicting indices in asso- COP [31] proposes in-lining ECC into compressed lines ciative caches [38][39][40][41][42][43][44][45][20]. A and uses the ECC as markers to identify compressed lines. cache can verify such predictions simply by checking the tag, Unfortunately, COP is designed to provide reliability at low and issuing a second request in case of a misprediction. Our cost and provides no performance benefit if the system does work is quite different from these, in that we try to predict not need ECC or already has an ECC-DIMM. Whereas, the location for memory. Since memory does not provide tags CRAM is designed to provide bandwidth benefits by fetching to identify the data like caches, we verify our prediction by multiple lines, and helps regardless of whether the system integrating implicit-metadata within the line, allowing memory has ECC-DIMM or not. Furthermore, COP relies on a fairly accesses to provide information about whether the location complex mechanism to handle marker collisions (locking lines contains compressed data or not. Our predictors utilize this in the LLC, memory-mapped linked-list etc.), whereas, CRAM implicit-metadata to verify the location prediction and issue a handles marker collisions efficiently via data inversion. request to an alternate location on a misprediction. IX.CONCLUSIONS C. SRAM-Cache Compression This paper investigates practical designs for main-memory Prior work has looked at using compression to increase compression to obtain higher memory bandwidth. The proposed capacity of on-chip SRAM caches. Cache compression is design, CRAM, is hardware-based, does not require any typically done by accommodating more ways in a cache set OS/hypervisor support, or changes to the memory modules and statically allocating more tags [11][32]. Recent proposals, or access protocols. We show that for compressed memory such as SCC, investigate reducing SRAM tag overhead by designs, the bandwidth overheads of accessing metadata can sharing tags across contiguous sets [33][34][35]. Compressed be significant enough to cause slowdown for several workloads. caches typically obtain compression metadata by storing We propose the implicit-metadata design, based on marker metadata beside tag and retrieving them along with tag accesses. values, to eliminate the storage and bandwidth overheads of However, these approaches do not scale for memory, as there is the metadata access. We also propose a simple and effective no tag space or tag lookup to enable easy access to metadata. predictor to predict the location of the line in compressed Our restricted data-mapping in CRAM is inspired by the memory, and a dynamic scheme to disable compression when placement in SCC [34], in that the location of the line gets compression degrades performance. Our proposed design determined by compressibility. However, unlike SCC, our provides an average speedup of 6%, and avoids slowdown placement ensures that a significant fraction of lines do not for any of the workloads. This design can be implemented change their locations, regardless of their compression status. with minor additions to the memory controller.

11 REFERENCES [21] X. Chen, L. Yang, R. P. Dick, L. Shang, and H. Lekatsas, “C-pack: A high-performance cache compression algorithm,” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 18, [1] B. Abali, H. Franke, X. Shen, D. Poff, and T. Smith, “Performance of 2010. hardware compressed main memory,” in High-Performance [22] T. M. Nguyen and D. Wentzlaff, “Morc: A manycore-oriented compressed Architecture, 2001. HPCA. The Seventh International Symposium on, cache,” in (MICRO), 2015 48th Annual IEEE/ACM 2001. International Symposium on. IEEE, 2015. [2] M. Ekman and P. Stenstrom, “A robust main-memory compression [23] A. Arelakis and P. Stenstrom, “Sc2: A statistical compression cache scheme,” in ACM SIGARCH News, vol. 33, no. 2. scheme,” in Computer Architecture (ISCA), 2014 ACM/IEEE 41st IEEE Computer Society, 2005. International Symposium on, June 2014. [3] G. Pekhimenko, V. Seshadri, Y. Kim, H. Xin, O. Mutlu, P. B. Gibbons, [24] A. Arelakis, F. Dahlgren, and P. Stenstrom, “Hycomp: A hybrid cache M. A. Kozuch, and T. C. Mowry, “Linearly compressed pages: A compression method for selection of data-type-specific compression low-complexity, low-latency main memory compression framework,” in methods,” in Proceedings of the 48th International Symposium on Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-48. New York, NY, USA: ACM, 2015. Microarchitecture, ser. MICRO-46. New York, NY, USA: ACM, 2013. [Online]. Available: http://doi.acm.org/10.1145/2830772.2830823 [Online]. Available: http://doi.acm.org/10.1145/2540708.2540724 [25] Y. Tian, S. M. Khan, D. A. Jimenez,´ and G. H. Loh, “Last-level cache [4] Qualcomm, “Qualcomm centriq 2400 processor,” https://www.qualcomm. deduplication,” in Proceedings of the 28th ACM International Conference com/media/documents/files/qualcomm-centriq-2400-processor.pdf, 2017, on Supercomputing, ser. ICS ’14. New York, NY, USA: ACM, 2014. [Online]. [Online]. Available: http://doi.acm.org/10.1145/2597652.2597655 [5] A. Shafiee, M. Taassori, R. Balasubramonian, and A. Davis, “Memzip: [26] V. Sathish, M. J. Schulte, and N. S. Kim, “Lossless and lossy Exploring unconventional benefits from memory compression,” in memory i/o link compression for improving performance of gpgpu High Performance Computer Architecture (HPCA), 2014 IEEE 20th workloads,” in Proceedings of the 21st International Conference International Symposium on. IEEE, 2014. on Parallel Architectures and Compilation Techniques, ser. PACT [6] A. R. Alameldeen and D. A. Wood, “Frequent pattern compression: A ’12. New York, NY, USA: ACM, 2012. [Online]. Available: significance-based compression scheme for l2 caches,” Dept. Comp. Scie., http://doi.acm.org/10.1145/2370816.2370864 Univ. Wisconsin-Madison, Tech. Rep, vol. 1500, 2004. [27] H. Kim, P. Ghoshal, B. Grot, P. V. Gratz, and D. A. Jimenez,´ “Reducing [7] J. Dusser, T. Piquet, and A. Seznec, “Zero-content augmented caches,” network-on-chip energy consumption through spatial locality speculation,” in Proceedings of the 23rd International Conference on Supercomputing, in Proceedings of the Fifth ACM/IEEE International Symposium on ser. ICS ’09. New York, NY, USA: ACM, 2009. [Online]. Available: Networks-on-Chip, ser. NOCS ’11. New York, NY, USA: ACM, 2011. http://doi.acm.org/10.1145/1542275.1542288 [Online]. Available: http://doi.acm.org/10.1145/1999946.1999983 [8] Y. Zhang, J. Yang, and R. Gupta, “Frequent value locality and value- [28] H. Zheng, J. Lin, Z. Zhang, E. Gorbatov, H. David, and Z. Zhu, “Mini- centric data cache design,” in ACM SIGOPS Operating Systems Review, rank: Adaptive dram architecture for improving memory power efficiency,” vol. 34, no. 5. ACM, 2000. in 2008 41st IEEE/ACM International Symposium on Microarchitecture, [9] J. Yang, Y. Zhang, and R. Gupta, “Frequent value compression Nov 2008. in data caches,” in Proceedings of the 33rd Annual ACM/IEEE [29] D. H. Yoon, M. K. Jeong, and M. Erez, “Adaptive granularity memory International Symposium on Microarchitecture, ser. MICRO 33. systems: A tradeoff between storage efficiency and throughput,” in New York, NY, USA: ACM, 2000. [Online]. Available: http: Proceedings of the 38th Annual International Symposium on Computer //doi.acm.org/10.1145/360128.360154 Architecture, ser. ISCA ’11. New York, NY, USA: ACM, 2011. [10] G. Pekhimenko, V. Seshadri, O. Mutlu, P. B. Gibbons, M. A. Kozuch, [Online]. Available: http://doi.acm.org/10.1145/2000064.2000100 and T. C. Mowry, “Base-delta-immediate compression: practical data [30] J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, compression for on-chip caches,” in Proceedings of the 21st international “Future scaling of processor-memory interfaces,” in Proceedings of the conference on Parallel architectures and compilation techniques. ACM, Conference on High Performance Computing Networking, Storage and 2012. Analysis, Nov 2009. [11] A. R. Alameldeen, D. Wood et al., “Adaptive cache compression for high- [31] D. J. Palframan, N. S. Kim, and M. H. Lipasti, “Cop: To compress and performance processors,” in Computer Architecture, 2004. Proceedings. protect main memory,” in 2015 ACM/IEEE 42nd Annual International 31st Annual International Symposium on. IEEE, 2004. Symposium on Computer Architecture (ISCA), June 2015. [12] J. Kim, M. Sullivan, E. Choukse, and M. Erez, “Bit-plane compression: [32] J. Guar, A. R. Alameldeen, and S. Subramoney, “Base-victim compres- Transforming delta for better compression in many-core architectures,” sion: An opportunistic cache compression architecture,” in Computer in Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual Interna- Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Sympo- tional Symposium on, 2016. sium on, 2016. [33] S. Sardashti and D. A. Wood, “Decoupled compressed cache: Exploiting [13] N. Chatterjee, R. Balasubramonian, M. Shevgoor, S. Pugsley, A. Udipi, spatial locality for energy-optimized compressed caching,” in Proceedings A. Shafiee, K. Sudan, M. Awasthi, and Z. Chishti, “Usimm: the utah of the 46th Annual IEEE/ACM International Symposium on Microarchi- simulated memory module,” University of Utah, Tech. Rep, 2012. tecture. ACM, 2013. [14] H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi, [34] S. Sardashti, A. Seznec, D. Wood et al., “Skewed compressed caches,” in “Pinpointing representative portions of large intel programs with Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International dynamic instrumentation,” in Microarchitecture, 2004. MICRO-37 2004. Symposium on. IEEE, 2014. 37th International Symposium on, Dec 2004. [35] B. Panda and A. Seznec, “Dictionary sharing: An efficient cache [15] J. L. Henning, “Spec cpu2006 benchmark descriptions,” SIGARCH compression scheme for compressed caches,” in 49th Annual IEEE/ACM Comput. Archit. News, vol. 34, Sep. 2006. [Online]. Available: International Symposium on Microarchitecture, 2016, 2016. http://doi.acm.org/10.1145/1186736.1186737 [36] Y. Xie and G. H. Loh, “-aware dynamic shared cache compression [16] S. P. E. Corporation, “Spec cpu 2017,” 2017, accessed: 2017-11-10. in multi-core processors,” in 2011 IEEE 29th International Conference [Online]. Available: https://www.spec.org/cpu2017/ on Computer Design (ICCD), Oct 2011. [17] S. Beamer, K. Asanovic, and D. A. Patterson, “The GAP [37] S. Kim, S. Lee, T. Kim, and J. Huh, “Transparent dual memory benchmark suite,” CoRR, vol. abs/1508.03619, 2015. [Online]. Available: compression architecture,” in 2017 26th International Conference on http://arxiv.org/abs/1508.03619 Parallel Architectures and Compilation Techniques (PACT), Sept 2017. [18] T. A. Davis and Y. Hu, “The university of florida sparse matrix [38] A. Agarwal, J. Hennessy, and M. Horowitz, “Cache performance collection,” ACM Trans. Math. Softw., vol. 38, Dec. 2011. [Online]. of and multiprogramming workloads,” ACM Available: http://doi.acm.org/10.1145/2049662.2049663 Trans. Comput. Syst., vol. 6, Nov. 1988. [Online]. Available: [19] D. Coppersmith, “The data encryption standard (des) and its strength http://doi.acm.org/10.1145/48012.48037 against attacks,” IBM J. Res. Dev., vol. 38, May 1994. [Online]. [39] A. Agarwal and S. D. Pudar, Column-associative caches: A technique for Available: http://dx.doi.org/10.1147/rd.383.0243 reducing the miss rate of direct-mapped caches. ACM, 1993, vol. 21, [20] V. Young, P. J. Nair, and M. K. Qureshi, “Dice: Compressing no. 2. dram caches for bandwidth and capacity,” in Proceedings of the [40] J. J. Valls, A. Ros, J. Sahuquillo, and M. E. Gomez, “Ps-cache: An 44th Annual International Symposium on Computer Architecture, ser. energy-efficient cache design for chip multiprocessors,” J. Supercomput., ISCA ’17. New York, NY, USA: ACM, 2017. [Online]. Available: vol. 71, Jan. 2015. [Online]. Available: http://dx.doi.org/10.1007/s11227- http://doi.acm.org/10.1145/3079856.3080243 014-1288-5

12 [41] B. Calder, D. Grunwald, and J. Emer, “Predictive sequential associative cache,” in Proceedings of the 2Nd IEEE Symposium on High- Performance Computer Architecture, ser. HPCA ’96. Washington, DC, USA: IEEE Computer Society, 1996. [Online]. Available: http://dl.acm.org/citation.cfm?id=525424.822662 [42] D. H. Albonesi, “Selective cache ways: On-demand cache resource allocation,” in Microarchitecture, 1999. MICRO-32. Proceedings. 32nd Annual International Symposium on. IEEE, 1999. [43] M. D. Powell, A. Agarwal, T. N. Vijaykumar, B. Falsafi, and K. Roy, “Reducing set-associative cache energy via way-prediction and selective direct-mapping,” in Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture, ser. MICRO 34. Washington, DC, USA: IEEE Computer Society, 2001. [Online]. Available: http://dl.acm.org/citation.cfm?id=563998.564007 [44] H.-C. Chen and J.-S. Chiang, “Low-power way-predicting cache using valid-bit pre-decision for parallel architectures,” in 19th International Conference on Advanced Information Networking and Applications (AINA’05) Volume 1 (AINA papers), vol. 2, March 2005. [45] A. Deb, P. Faraboschi, A. Shafiee, N. Muralimanohar, R. Balasubramo- nian, and R. Schreiber, “Enabling technologies for memory compression: Metadata, mapping, and prediction,” in 2016 IEEE 34th International Conference on Computer Design (ICCD), Oct 2016.

13