Compresso: Pragmatic Main Memory Compression

Esha Choukse Mattan Erez Alaa R. Alameldeen University of Texas At Austin University of Texas at Austin Labs [email protected] [email protected] [email protected]

even directly increase effective bandwidth [6, 7]. Abstract—Today, larger memory capacity and higher memory Hardware memory compression solutions like IBM bandwidth are required for better performance and energy MXT [8], LCP [4], RMC [9], Buri [10], DMC [11], and efficiency for many important client and datacenter applications. Hardware memory compression provides a promising direction CMH [12] have been proposed to enable larger capacity, to achieve this without increasing system cost. Unfortunately, and higher bandwidth, with low overhead. Unfortunately, current memory compression solutions face two significant hardware memory compression faces major challenges that challenges. First, keeping memory compressed requires prevent its wide-spread adoption. First, compressed memory additional memory accesses, sometimes on the critical path, which management results in additional movement, which can can cause performance overheads. Second, they require changing the to take advantage of the increased capacity, lead to performance overheads, and has been left widely and to handle incompressible data, which delays deployment. unacknowledged in previous work. Second, to take advantage We propose Compresso, a hardware memory compression archi- of larger memory capacity, hardware memory compression tecture that minimizes memory overheads due to compression, techniques need support from the operating system to with no changes to the OS. We identify new data-movement dynamically change system memory capacity and address trade-offs and propose optimizations that reduce additional memory movement to improve system efficiency. We propose a scenarios where memory data is incompressible. Such support holistic evaluation for compressed systems. Our results show that limits adoptive potential of the solutions. On the other hand, Compresso achieves a 1.85x compression for main memory on aver- if compression is implemented without OS support, hardware age, with a 24% speedup over a competitive hardware compressed needs innovative solutions for dealing with incompressible data. system for single-core systems and 27% for multi-core systems. As Additional data movement in a compressed system is caused compared to competitive compressed systems, Compresso not only reduces performance overhead of compression, but also due to metadata accesses for additional translation, change increases performance gain from higher memory capacity. in compressibility of the data, and compression across cache line boundaries in memory. We demonstrate and analyze the I.INTRODUCTION magnitude of the data movement challenge, and identify novel Memory compression can improve performance and trade-offs. We then use allocation, data-packing and prediction reduce cost for systems with demands, such as based optimizations to alleviate this challenge. Using these those used for machine learning, graph analytics, databases, optimizations, we propose a new, efficient compressed memory gaming, and autonomous driving. We present Compresso, system, Compresso, that provides high compression ratios the first compressed main-memory architecture that: (1) with low overhead, maximizing performance gain. explicitly optimizes for new trade-offs between compression In addition, Compresso is OS-transparent and can run any mechanisms and the additional data movement required standard modern OS (e.g., Linux), with no OS or software for their implementation, and (2) can be used without any modifications. Unlike prior approaches that sacrifice OS modifications to either applications or the operating system. transparency for situations when the system is running out of Compressing data in main memory increases its effective promised memory [8, 10], Compresso utilizes existing features capacity, resulting in fewer accesses to secondary storage, of modern operating systems to reclaim memory without thereby boosting performance. Fewer I/O accesses also needing the OS to be compression-aware. improve tail latency [1] and decrease the need to partition tasks We also propose a novel methodology to evaluate the across nodes just to reduce I/O accesses [2, 3]. Additionally, performance impact of increased effective memory capacity transferring compressed cache lines from memory requires due to compression. The main contributions of this paper are: fewer bytes, thereby reducing memory bandwidth usage. The saved bytes may be used to prefetch other data [4, 5], or may • We identify, demonstrate, and analyze the data movement- related overheads of main memory compression, and Appears in MICRO 2018. c 2018 IEEE. Personal use of this material is present new trade-offs that need to be considered when permitted. Permission from IEEE must be obtained for all other uses, in any designing a main memory compression architecture. current or future media, including reprinting/republishing this material for • We propose Compresso, with optimizations to reduce advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component compressed data movement in a hardware compressed of this work in other works. 978-1-5386-5541-2/18/$31.00 c 2018 IEEE memory, while maintaining high compression ratio by (a) Compressed address spaces (OS-Transparent) (b) Ways to allocate a in compressed space (c) Cache line overflow

Fig. 1: Overview of data organization in compressed memory systems.

repacking data at the right time. Compresso exhibits only adapt it for CPU memory-capacity compression by decreasing 15% compressed data movement accesses, as compared the compression granularity from 128 bytes to 64 bytes to to 63% in an enhanced LCP-based competitive baseline match the cache lines in CPUs. We also observe that always (Section 4). applying BPC’s transform is suboptimal. We add a module that • We propose the first approach where the OS is completely compresses data with and without the transform, in parallel, agnostic to main memory compression, and all hardware and chooses the best option. Our optimizations save an average changes are limited to the memory controller (Section 5). of 13% more memory, compared to baseline BPC. Compresso • We devise a novel methodology for holistic evaluation that uses the modified BPC compression algorithm, achieving takes into account the capacity benefits from compression, 1.85x average compression on a wide range of applications. in addition to its overheads (Section 6). Compression Granularity. A larger compression • We evaluate Compresso and compare it to an uncom- granularity, i.e., compressing larger blocks of data, allows a pressed memory baseline. When memory is constrained, higher compression ratio at the cost of higher latency and data Compresso increases the effective memory capacity by movement requirements, due to different access granularities 85% and achieves a 29% speedup, as opposed to an 11% between core and memory. Compresso uses the compression speedup achieved by the best prior work. When memory granularity of 64B. is unconstrained, Compresso matches the performance of the uncompressed system, while the best prior work B. Address Translation Boundaries degrades performance by 6%. Overall, Compresso outperforms best prior work by 24% (Section 7). Since different data compress to different sizes, cache lines and pages are variable-sized in compressed memory systems. II.OVERVIEW As shown in Fig. 1a, the memory space available to the OS In this paper, we assume that the main memory stores com- for allocating to the applications is greater than the actual pressed data, while the caches store uncompressed data (cache installed memory. In any compressed memory system, there compression is orthogonal to memory compression). In this are two levels of translation. (i) Virtual Address (VA) to OS section, we discuss important design parameters for compressed (OSPA), the traditional virtual-to-physical memory architectures, and Compresso design choices. address translation, occurring before accessing the caches to avoid aliasing problems and (ii) OSPA to Machine Physical A. Compression Algorithm and Granularity Address (MPA), occurring before accessing main memory. The aim of memory compression is to increase memory Fixed-size pages and cache lines from the VA need to capacity and lower bandwidth demand, which requires a be translated to variable-sized ones in MPA. OS-transparent sufficiently high compression ratio to make an observable compression translates both the page and cache line levels difference. However, since the core uses uncompressed data, in the OSPA-to-MPA layer in the memory controller. On the the decompression latency lies on the critical path. Several other hand, compression-aware OS has variable-sized pages in compression algorithms have been proposed and used in this OSPA space itself, and the memory controller only translates domain: Frequent Pattern Compression (FPC) [13], Lempel-Ziv addresses into the variable cache line sizes for the MPA space. (LZ) [14], C-Pack [15], Base-Delta-Immediate (BDI) [16], Bit- Bus transactions in the system happen in the OSPA space. Plane Compression (BPC) [6] and others [17, 18]. Although LZ Since OS-aware solutions have different page sizes as per com- results in the highest compression, its dictionary-based approach pressibility, in OSPA, their bus transactions do not follow current results in high energy overhead. We chose BPC due to its higher architectural specifications. Hence, all hardware, including all average compression ratio compared to other algorithms. BPC peripheral devices and associated drivers, must be modified to is a context-based compressor that transforms the data using its accommodate the change. Compresso, being OS-transparent, Delta-Bitplane-Xor transform to increase data compressibility, has the same, fixed-page size in VA and OSPA spaces, for and then encodes the data. Kim et al. [6] describe BPC in example, only 4KB pages. Larger OS page sizes (2MB, 1GB the context of memory bandwidth optimization of a GPU. We etc.) can be broken into their 4KB building blocks in the MPA.

2 xdszdcuk.Teeaetiilt aae u edmore need but metadata. manage, the to in requires trivial pointers and are These fragmentation chunks. causes fixed-sized it management. that is sophisticated method this of pages. variable-sized Allocation and Packing sizes. Page line D. cache possible BPC. 4 more-aggressive with the LinePack for ratio uses used compression to when the compared reduces 13% it ratio by BDI, compression with 2, in used Fig. when 2.3% LinePack in only works shown loses LCP-packing As LCP this, algorithms. while to compression due data simpler that when with find well underperforms We but variable. lines; more similar is cache have that most match pages across those can data for it flexibility, only compression packing LinePack’s lower its Given page, system. translation. per OS-transparent line-size an target in of feasible aware not be is to For Additionally,which TLB accesses. request. the memory assumes metadata extra this to the leads with this parallel exceptions, in request additional memory one effectively in cycle. computes LinePack that memory for circuit offset a cacheline describe the we main evaluation, accessing our when In only memory. required is calculation this However, LinePack. to benchmarks. some compared for frequent shows be [19] work can handled Recent exceptions metadata. and these the uncompressed that in stored pointers are explicit size, com- with to not target do page the that a lines to cache within press i.e., Exceptions, lines size. translation cache target reduce same the the to all way compressing smart by a overhead, meta- proposed of [4] bits (LCP) 2 Pages requiring cacheline. thereby compress sizes, per could allowed we data 4 example, sum- to For lines by it. cache offset before the line-sizes calculating the and ming page, the packing of within ways cacheline main calculating two are of cachelines. There ease the cacheline. lies the the tradeoff and of offset The ratio the compression page. high a a within between placement their for scheme Lines Cache Compressed Packing C. opesdcceie r aibeszd hsrqiiga requiring thus variable-sized, are cachelines Compressed i.1 hw h w asasse a allocate can system a ways two the shows 1b Fig. simpler for compression high away trades LCP-packing LCP-packing using for motivations main two are There Compression Ratio 0 2 4

perlbench

bzip2 (i) (ii) gcc LinePack (ii) C-akn losaseuaiemain speculative a allows LCP-packing bwaves (i) LCP-Packing BPC withLinePack

aibeszdcuk.Tedrawback The chunks. Variable-sized gamess i.2 opeso ai ihBCadBIwt C-akn sLinePack. vs LCP-Packing with BDI and BPC with ratio Compression 2: Fig. (i)

trn h noe ie feach of sizes encoded the Storing : mcf (ii) ipe ieofe calculation. line-offset Simpler milc nrmna loainin allocation Incremental zeusmp 7 ieryCompressed Linearly : gromacs

cactusADM

leslie3d

Compresso namd BPC withLCP

gobmk

soplex

3 povray hns opes ssiceetlalcto n52 chunks 512B in allocation incremental uses Compresso chunks. compare We chunks. chunks variable-sized 4 later, only discussed with as movement data Section compressed in sizes more page to more lead Additionally, can fragmentation. more more chunks, to variable-sized For lead sizes. sizes page MPA 8 MPA supports pack- minimum ing chunk-based however, 512B pages, OSPA a non-zero metadata for have balance allocation to schemes order Both in unit compression. base with the be to 512B choose We onigt h 1Bcuk htcmrs compressed a comprise that chunks 512B the to pointing numbers frame dynami- to compression. page better for each uncompressed. repacking in is for space opportunities page identify free the cally been the zeros; if has all tracks cleared contains page also is page OSPA Compresso the bit if the compressed set if the is bit set and zero is the compressed- MPA; bit and in valid mapped zero, valid, The and flags. size page page add. the and includes shift making section page, bitwise OSPA a per just entry computation one address has metadata not and memory, the main OS, main the the in of to stored 1.6% is exposed of metadata overhead to This storage capacity. similar a memory metadata, to as amounting the boot. page [4], OSPA to during LCP per equal OS 64B is space. the uses and to MPA Compresso fixed advertises dedicated is Compresso pages a capacity OSPA maintains memory in of number page Compresso total this OSPA The translation, using each happens for perform also metadata To number page MPA metadata. the translation the to to system, OSPA corresponds OS-transparent from that an page In a cacheline. for OSPA within metadata offset uses requires the translation writeback) This identifying or translation. MPA fill to LLC OSPA (an access main-memory each on). so and 1.5KB 1KB, (512B, sizes page 8 allowing thereby calculix mle aeui losfrbte aiu compression. maximum better for allows unit base smaller A h eodpr fmtdt nr sthe is entry metadata of part second The control The Compresso. in entry metadata a shows 3 Fig. In

both hmmer 52,1B K n K)with 4KB) and 2KB 1KB, (512B, sjeng BDI withLinePack COMPRESSOMET DATA TA E M O S S E R P M O C . I I I Saaeo Stasaetcmrse systems, compressed OS-transparent or OS-aware IV-A

i.3 eaaaetyorganization. entry Metadata 3: Fig. GemsFDTD .Hne ecos oeaut h scheme the evaluate to choose we Hence, 1.

MFs.U o8MFsaesoe for stored are MPFNs 8 to Up (MPFNs). libquantum

h264ref

tonto

lbm

omnetpp

astar BDI withLCP

sphinx3

xalancbmk 1Bfixed-sized 512B Forestfire ahn page machine variable-sized Pagerank

Graph500

Average , ftecmlt ae edn oalto diinlmemory additional of lot a to relocation leading the page, require complete can the and of instructions, a million Such the every 2KB. to once say, overflows size, page overflow page eventually 1KB possible cache can a next example, its page for the as allocation, size, page, current in a its one to by one streamed Moreover, grow is lines instructions. data 1000 incompressible per such This times line. if 4 changed average the As on underneath it. happens lines a to cache allocated such the space of 1c, the movement Fig. in fits in decreased, longer shown been no have it could that compressibility meaning its memory, the to 5a). accesses (Fig. (overflows) memory path multiple boundaries critical to line the leading cache on thereby regular memory, the the system, across in memory stored compressed be a might in they variable-sized are lines an can cache sources: compression three to to from due compared accesses be memory accesses additional previous The overhead, the additional work. in movement unevaluated 63% remains data and memory, of uncompressed significant average to an lead can memory counter bits). a (6 and cachelines each), inflated bits of (6 number leftover pointers is the the inflation for metadata use 17 We aligned store accesses. in to (64B) data kept bytes size extra are reducing line page, for per cache optimal lines A inflated metadata. cacheline of along the number The lines, total inflated scheme. the such packing with to specific corresponding numbers a sequence support different the entirely to rather to an similar movement, than for is data used This compression-related is 5a). reduce but (Fig. reason—to LCP, page in that region in exception space is there that be the some to allows in have Compresso multiple such would Instead, of to page space. number the lead more of make would In rest to overflow moved grows. the an as size accesses, compressed such memory system, its page). when naive 4KB overflows a a to in lead lines cache can 64 encoded 64, the X store 4 bits to of (2 16B one sizes needs to line Compresso compressed and be sizes can possible cacheline Each page. OSPA nainRoom. Inflation drsigadmngn opesddt ntemain the in data compressed managing and Addressing Relative extra memory accesses 0.0 0.5 1.0 1.5 2.0 ADDITION AAMO T N E M E OV M DATA L NA O I T I D D A V. I nainroom inflation

perlbench sacceln switnbc rmteLLC the from back written is line cache a As : i.4 diinlcmrsinrltddt oeetmmr rfcrltv oucmrse baseline. uncompressed to relative traffic memory movement data compression-related Additional 4: Fig. bzip2 nae cachelines inflated gcc Right Bar:Variable-sizedchunks(4) Left Bar:Fixed-sizedchunks(512B) rtbcst ahln ntememory the in cacheline a to Writebacks bwaves tteedo nMApg,provided page, MPA an of end the at

(i) gamess ah ieoverflow line cache pi-cesccelines cache Split-access mcf (ii) milc hnei compressibility in Change

ob trduncompressed stored be to zeusmp

gromacs aeoverflow page cactusADM

leslie3d

a edto lead can namd

sthe As : gobmk occurs soplex

povray Overflow HandlingAccesses Split Cache-lineAccesses 4 mutt ihaeaeo 3,wt aiu f180%. of Trade-offs maximum Related a Movement with Data 63%, A. of average high a original accesses to to additional These relative amount benchmark. LCP), each on by accesses (based memory baseline competitive path. a critical in the misses, on be accesses translation. can memory MPA there additional metadata, to to relevant leading OSPA the an caching needs Despite access memory every writes. and reads nodrt vi uhsltacse,tesse a hoeto choose can work. system previous the the split-accesses, in such discussed avoid been average to not on order have accesses In yet memory and extra 4), 30.9% (Fig. to lead memory in sizes. page 4 only have to be optimizations, would choice movement better data discuss the those we without that However, optimizations enables later. with movement choice sizes data This page important accesses 8 512B. 8 some resizing of using uses chunks page hand, Compresso fixed-sized more sizes. other incremental page 53% the 4 upto On to to compared 1.59. but leads reach 1.85, sizes only of page ratio sizes compression page average 4 an achieve can sizes the complicates and sizes, circuit. metadata, cache-block 4 offset-calculation more encoded with requires larger than bins to sizes size 8 due line Supporting cache sizes. 8 line with cache more occur 17.5% overflows respectively, However, line sizes. ratio, page cache sizes possible compression 8 line average with cache coupled 1.59 compressed when 4 or or 1.82 to 8 offers respect using With packing, cause movement. line to data cache likely trigger to and are lead overflows, bins sizes more more More only However, ratio. a work compression. compression for Prior better the sizes line. maximize cache permissible to compressed the tries Here, a of lines. or one cache page indicates and compressed pages size for possible sizes a choosing of while number movement possible data the compressed of frequency the trade-offs. related movement calculix ao otiuino opes sietfignwdata new identifying is Compresso of contribution major A compression from accesses memory additional shows 4 Fig. pi-cesccelnsta rs ah ieboundaries line cache cross that lines cache Split-access Alignment Line Cache 2) page 8 bins. size page with exists tradeoff similar A and ratio compression between trade-off a Pages exists and Lines There Cache for Sizes Possible 1) hmmer

sjeng

GemsFDTD

libquantum (iii) h264ref

eaaaaccesses Metadata tonto

opes ss4cceln bins. line cache 4 uses Compresso lbm

omnetpp Metadata CacheMisses astar

sphinx3

xalancbmk sdsusdearlier, discussed As :

Forestfire

Pagerank

Graph500

Average 0.0 0.5 1.0 1.5 2.0 keep holes in the page to make sure that the lines are aligned. when any writeback to the associated page results in a cache This leads to loss in compression (23% loss on average with line overflow and is decremented upon cache line underflows BPC) and complex calculations for cache line offsets. The (i.e., new data being more compressible). Another 3-bit global other possible solution is to have alignment-friendly cache predictor changes state based on page overflows in the system. line sizes, as we discuss later. We speculatively increase a page’s size to the maximum (4KB) when the local as well as global predictors have the higher B. Data-Movement Optimizations bit set. Hence, a page is stored uncompressed if it receives We now describe the optimizations we use in Compresso in multiple streaming cache line overflows during a phase when order to alleviate the problem of compressed data movement. the overall system is experiencing page overflows. Fig. 6 shows how these optimizations reduce data movement. False negatives of this predictor lose the opportunity to save Without the full set of optimizations, our choice of incremental data-movement, while false positives squander compression allocation of 512B chunks does not outperform using 4 without any reduction in data-movement. We incur 22.5% false variable page size allocation. However, with all optimizations, negatives and 19% false positives on an average. This optimiza- Compresso saves significant data movement compared to other tion reduces the remaining extra accesses from 36% to 26%. configurations, reducing the average relative extra accesses to 3) Dynamic Inflation Room Expansion 15% compared to 63% in a competitive compressed memory system that uses 4 page sizes. We describe and analyze the As discussed earlier, we use inflation room to store the cache mechanisms below. lines that have overflown their initial compressed size. However, frequently, when a cache line overflows, it cannot be placed 1) Alignment-Friendly Cache Line Sizes in the inflation room because of insufficient space in the MPA Previous work [4, 9] selected the possible sizes for cache page, despite having available inflation pointers in the metadata. lines as an optimization problem, maximizing the compression As shown in Option 1 in Fig. 5c, prior work would recompress ratio. This led them to use the cache line sizes of 0, 22, 44 the complete page, involving up to 64 cache line reads and 64 and 64B. This choice leads to an average of 30.9% of cache writes. Instead, we go for the optimized Option 2 ( Fig. 5c) and lines stored split across cache line boundaries. Accessing any allocate an additional 512B chunk, expanding the inflation room, of these cache lines would require an extra memory access on requiring only 1 cache line write. Due to metadata constraints, the critical path. We redo this search in order to maintain high this can be done only till the page has fewer than 8 allocations compression ratio while decreasing the alignment problem, of 512B chunks, and fewer than 17 inflated lines. With this, we arriving at cache line sizes of 0, 8, 32 and 64B. The decrease have reduced the remaining extra accesses from 26% to 19%. in compression is just 0.25%, while at the same time bringing 4) Dynamic Page Repacking down the split-access cache lines from 30.9% to 3.2%. This decrease in split-access cache lines is intuitive, given that the Previous work [4, 9] in this field does not consider or cache line boundaries are at 64B. We note that if changes evaluate repacking (recompressing) a page while it is in main are made to the cache line packing algorithm such that it memory. It is widely assumed that a page only grows in introduces holes in the pages to avoid split-access cache lines, size from its allocation, till it is freed (which then leads to we can completely avoid extra accesses from misalignment. zero storage for OS-aware compression). However, we note However, that leads to higher complexity in the line-packing that even for OS-aware compressed systems, repacking is algorithm, as well as the offset calculation of a line in a page. important for long running applications. Fig. 7 shows the Hence, we skip that optimization. Alignment-friendly cache loss in compression ratio if no repacking is performed (24% line sizes bring down the extra accesses from 63% to 36%. storage benefits are squandered on average). So far we have only focussed on dealing with decreasing 2) Page-Overflow Prediction compressibility, and not increase in compressibility, or One of the major contributors to compression data movement underflows. However, as the cache lines within a page become is frequent cache line overflows. This often occurs in scenarios more compressible, the page should be repacked. Additionally, of streaming incompressible data. For example, it is common we have proposed data-movement optimizations that leave for applications to initialize their variables with zeros, and compressible pages uncompressed in order to save on then write back incompressible values to them. If a zero page, data-movement. Such poor packing will squander potentially which offers the maximum compressibility, is being written high compressibility. On the other hand, repacking requires back with incompressible data, its cache lines will overflow data movement, potentially moving many cache lines within one by one, causing the complete page to overflow multiple the page. If a page is repacked on each writeback that improves times, as it jumps the possible page sizes one by one. In such compression ratio, it may lead to significant data movement. cases, we use a predictor to keep the page uncompressed to The compression-squandering optimizations mostly affect avoid the data movement due to compression. A repacking the entries that are hot in the metadata cache, since they mechanism later restores the overall compression ratio. target the pages with streaming incompressible data, that get We associate a 2-bit saturating counter with each entry evicted from LLC with high locality. Using this insight, we in the metadata cache (Fig. 5b). The counter is incremented choose metadata cache eviction of a page’s entry as a trigger

5 (a) Inflation Room (b) Page-Overflow Predictor (c) Dynamic Inflation Room Expansion

Fig. 5: Illustrations for data-movement optimizations.

Left Bar: Fixed-sized chunks (512B) Alignment-Friendliness Dynamic IR Expansion Final Extra accesses Right Bar: Variable-sized chunks (4) Prediction Metadata Cache Opt 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 memory accesses gcc mcf lbm milc astar tonto sjeng bzip2 namd soplex gobmk povray hmmer calculix bwaves leslie3d gamess zeusmp Reduction in relative extra h264ref sphinx3 Average gromacs omnetpp Pagerank Forestfire Graph500 perlbench xalancbmk cactusADM GemsFDTD libquantum Fig. 6: Reduction in additional compression-related data movement memory traffic as optimizations are applied one by one. to check for its repacking. Compresso dynamically tracks the Challenge to deal with OS-Aware OS-Transparent potential free space within each page, as part of its metadata Translation from OSPA to MPA Yes Yes entry. This is novel to Compresso and important to keep the Data movement due to size change Yes Yes Metadata access overheads Yes Yes overheads of repacking low. Repacking is only triggered if No knowledge of free pages in OSPA No Yes enough space is available to positively impact actual effective Overcommitment of memory by the OS No Yes memory use (i.e., deallocate at least one 512B chunk). Dynamic Page Repacking reduces the compression Tab. I: OS-aware vs. OS-transparent compression. squandered from 24% to 2.6%, while amounting to only 1.8% extra accesses. The small memory access overhead of repacking is attributed to a reasonable choice of trigger for repacking. cache the first 32B of metadata for such pages (Fig. 3). The saved cache space from these half-sized entries is used for 5) Metadata Cache Optimization additional metadata entries, at the cost of a small amount of A metadata cache of at least the same size as the second- extra logic and tag area. This optimization allows us to double level TLB should be used in order to keep the common case the effective cache size for incompressible data. Fig. 6 shows of a TLB hit fast. For most applications, this is a large-enough that this optimization exhibits crucial increase in hit rates for size for capturing the metadata of its hot pages. However, omnetpp, Forestfire, Pagerank and Graph500, despite using some benchmarks, like Forestfire and omnetpp, have high the same area as a regular cache. This optimization brings miss rates, as shown in Fig. 4. A multi-core scenario with down the remaining extra accesses from 19% to 15%. such applications is particularly problematic. Fig. 6 shows, after applying all the data-movement We identify an optimization that expands metadata cache optimizations, the extra accesses are down to 15% on average, coverage with low cost. We exploit the fact that all cache lines of which 3.2% are due to split-access cache lines, 2.1% due in an uncompressed OSPA page are exactly 64B, and only to change in compression, and 9.7% due to metadata cache misses. Note that these optimizations are orthogonal and can Without Repacking With Dynamic Repacking be applied separately, with the exception of Dynamic Inflation 1.0 Room Expansion, which needs to be used in combination 0.5 with a fixed-size chunk-based allocation. Relative 0.0 gcc mcf lbm Compression Ratio milc V.OS-TRANSPARENCYCHALLENGES astar tonto sjeng bzip2 namd soplex gobmk povray hmmer calculix bwaves leslie3d gamess zeusmp h264ref sphinx3 Average gromacs omnetpp Pagerank Forestfire Graph500 perlbench xalancbmk cactusADM GemsFDTD libquantum There are a few challenges of compressed memory that are Fig. 7: Loss in compression ratio from no repacking. specific to OS-transparent systems (Tab. I).

6 A. Keeping Memory Compressed

It is important to keep the memory highly compressed to avoid running out of memory. For this, compression-aware OS banks on using zero bytes for free pages. Even partially OS-aware solutions like IBM MXT [8] require the OS to zero out a page upon freeing it, which breaks OS transparency and has high performance overheads. None of the previous work actively repacks pages for better compression. Compresso keeps Fig. 8: Ballooning for OS-transparent handling of out-of-memory. memory compressed by aggressively re-compressing pages to a smaller size when possible, as explained in Section IV-B4. Memory % LCP Compresso Unconstrained 1-Core 4-Core 1-Core 4-Core 1-Core 4-Core 80%* 1.04 1.54 1.15 1.78 1.24 2.1 B. Running Out Of Main Memory 70%* 1.11 1.97 1.29 2.33 1.39 2.51 60%* 1.28 2.45 1.56 2.81 1.72 3.23 In a compressed system, the OS is promised an address All performance relative to uncompressed constrained memory baseline. *Not all benchmarks finish execution. space larger than what is physically available. In cases where the data in main memory is less compressible than Tab. II: Speedup Results from Memory-capacity impact initially assumed, this can lead to an out-of-memory scenario evaluation with different constrained memory baselines. in the MPA space. When this happens, solutions prior to Compresso raise an exception to the OS, thereby breaking the VI.METHODOLOGY OS transparency and requiring OS modifications. Compresso, A. Novel Dual-Simulation Methodology on the other hand, maintains complete OS transparency by leveraging existing OS features developed for use by In order to evaluate a compressed system in a holistic fashion, virtual machines. Our challenge is very similar to the it is imperative to understand all its aspects: (i) Overheads over-commitment of memory in a virtualized environment, due to compression latency and additional data movement, where the total memory of all running virtual machines (ii) Bandwidth impact of compression due to inherent exceeds the memory capacity of the host. Any modern OS that prefetching benefits and zero cache lines, and (iii) Impact of can run as a virtual machine, including Linux, Windows, and memory capacity increase on the application. The first two MacOS, uses memory ballooning [20]. Ballooning helps safely aspects of a compressed system have been evaluated in most overcommit memory in a virtualized environment without previous work using cycle-based simulations. However, we substantial performance degradation due to paging on the host observe that the memory capacity impact evaluation is missing, (Fig. 8). When the memory pressure increases in one of the without which, a compressed system’s evaluation is incomplete. virtual machines (e.g., VM0), the hypervisor uses a ballooning For a more complete evaluation of the system, we use driver, which is run within each guest, to reclaim memory two performance evaluation mechanisms: (i) cycle-based from another guest VM (e.g., VM1) and allocate it to the simulations to evaluate the latency overheads and bandwidth memory-hogging virtual machine (VM0). To reclaim space, benefits of compression; and (ii) memory-capacity impact runs the ballooning driver “inflates” by demanding pages from the to evaluate the performance benefits of reduced OS paging when guest OS using system calls. With more memory required for more effective capacity is made available through compression. the balloon, the guest uses its regular paging mechanism to Memory-Capacity Impact Evaluation. Memory capacity provide that space, possibly paging out used pages. This means impact evaluation requires complete runs of benchmarks, since that only free pages or cold pages are freed up. The page compression ratio does not generally exhibit a recurring phase numbers claimed by the balloon driver are communicated to behavior. Since complete runs in simulators are highly time- the hypervisor, which then unmaps them in physical memory. consuming, we run benchmarks on a real HW (Intel Xeon E5 We propose to use the same OS facility to solve the problem v3 CPU) and OS (Linux: Kernel 3.16). The idea is to change of memory pressure resulting from poorly-compressed data. the memory available to the benchmark dynamically according The Compresso driver can inflate and inform the hardware to its real-time compressibility. We use a profiling stage, in about the freed pages. Hardware then marks these pages which we run the applications and pause them every 200M as invalid in its metadata, requiring no storage in the MPA instructions to dump the memory they have allocated in the space. In case of Linux, the Compresso driver uses the physical address space and save vectors of the compression around the core function __alloc_pages(). We recently ratio over instruction intervals. We then run the benchmark discovered similar OS-transparency ideas in a patent [21] with using taskset on a fixed core, and use the hardware counters to little detail and no evaluation. Liu et al [22] evaluate the calculate the number of retired instructions in that core. This performance overheads of ballooning on Linux. They show that instruction number is used to look up the saved vectors to get the in a 2GB RAM system, 1GB can be reclaimed in under 500ms compression ratio and accordingly change the memory budget. without fragmentation and 1s for highly fragmented systems. We use the cgroups feature of Linux to budget the memory,

7 3GHz OOO core × 1-4 Core either statically or dynamically. Static budgeting of the Issue Width = 4 memory replicates a regular system. Dynamically changing the ROB = 192 entries memory available to the system based on the compressibility 64KB L1D, 512KB L2 16-way 2MB L3 for 1-core, 8MB shared L3 for 4-core of the data in the memory allows us to emulate a compressed 64B cachelines system. The OS uses the swap area to page out the pages that TLB miss overhead = 0 do not fit in the limited memory provided for the runs. 8GB We evaluate the impact of compression on a system with DRAM DDR4-2666MHz (Capacity varied in Memory capacity evaluation) memory constrained to different fractions of the footprint BL=8, tCL=18, tRCD=18, tRP=18 of the running benchmark or workload (Tab. II). As the Decompression/Compression overhead = 12 Metadata cache hit latency = 2 system becomes more memory constrained, the benefits of Compression related accesses added to regular RD/WR Queues compression generally increase, but at 60% of the footprint, Compressed line sizes : 0/8/32/64 B several benchmarks stall because of frequent paging. We Compressed page sizes : Basic Compressed System : 0/512B/1K/2K/4K illustrate the results with 70% constrained memory in Fig. 10 Compresso : 0/512B/1K/1.5K/2K/2.5K/3K/3.5K/4K and Fig. 11. Note that if the system is not memory-capacity constrained, then the cycle-based evaluation alone captures Tab. III: Performance simulation parameters. the performance impact of Compresso. D. Simulator and Benchmarks B. CompressPoints for Representative Regions We use SPECcpu2006, Graph500, Forestfire [25], and Pager- Cycle-based simulation requires representative regions of a ank (centrality) [25] benchmarks. For cycle-based simulation, benchmark whose simulation results can be extrapolated for the we extend zsim [26]. For energy estimation, we use McPAT [27] complete benchmark. SimPoints [24] uses just the basic-block and CACTI [28]. For BPC energy estimate, we use 40nm TSMC vectors (basic-block execution counts) to characterize the standard cells. The evaluation parameters are detailed in Tab. III. intervals of a benchmark and has been shown to correlate well We add a 12-cycle latency for compression/decompression with pipeline and caching behavior. However, while evaluating with BPC. The model is similar to what is proposed by Kim a compressed system, representativeness of data is crucial, et al. [6] for BPC in GPGPUs. Extending it to CPUs, we use leading us to use CompressPoints [23] instead. 8 cycles for buffering the cache line from DDR4, 2 cycles CompressPoints extends basic-block vectors with memory to 17 bit-planes, and 2 cycles for the concatenation. compression metrics, including the compression ratio, rate of page overflows and underflows, and memory usage within E. Multi-Core Simulations each interval. This leads to much better representativeness We divide the benchmarks into groups (high and low) based of compression ratio. Consider GemsFDTD (Fig. 9), which on their single-core speedup, metadata cache hit rate, and exhibits remarkably different compressibility between the sensitivity to limited memory. The mixes are created by SimPoint and the CompressPoint. The performance simulation selecting benchmarks such that there is equal representation results presented in this paper use 10 CompressPoints of from each group, as shown in Tab. IV. Mix10 represents a length 200 million instructions each. worst case scenario for compression overhead. C. Accuracy of Virtual Address Translation We use 10 multi-core CompressPoints, each of 200M instructions found from the methodology described in [23]. Modern operating systems map all zero-initialized pages We use the syncedFastForward feature of zsim for multicore in the system to a single page frame, marking it as a simulations, which fast-forwards the execution of all the cores copy-on-write. Therefore, if the of till they reach their CompressPoints, and then simulates their the benchmark is used to take the memory snapshots, the run together, such that they are always under contention. For compression ratios can be misleadingly high. In order to have memory capacity impact evaluation, the workload consists a realistic compression ratio, we follow the approach used of benchmarks rerunning under contention till the longest in [19] and [23], and use Linux proc filesystem. We use the running benchmark completes execution on the unconstrained pagemap to get the physical addresses and the kpageflags file memory system. The same workload is then run on the to check whether a page is resident in main memory. constrained memory system and the percentage progress of

GemsFDTD Simpoint each benchmark is measured after the elapsed time equals Astar Compresspoint to the total runtime on the unconstrained memory system.

13 13 The average progress across the individual benchmarks in the 11 11 9 9 workload is chosen as the metric for comparison. 7 7 5 5 F. Evaluated Systems and Baselines 3 3 1 1 In order to get a competitive baseline, we compare

Compression Ratio 0 200 400 600 800 1000 1200 Billion Instructions Compresso to an optimized version of the system described Fig. 9: Difference between the compressibility representativeness in the LCP [4] paper. It uses 4 different compressed page of SimPoint and CompressPoint for GemsFDTD and astar [23]. sizes with an inflation room, and same size metadata cache

8 h efrac an osbewt agrmmr capacity. memory larger a with on possible bound gains upper performance an the provides performance. on system impact memory ca- uniform unconstrained a where The have interactions not do complex latency and more pacity and appli- account changes into phase take not cation simula- does metric cycle-based this and Unfortunately, re- impact tions. combining memory-capacity by the performance from estimate sults rep- to effort metric best performance our overall resents This speedups independent. two mutually the the are since and evaluation, simulation impact cycle-based memory-capacity the paper. from the numbers in them performance report not (average do benchmarks but we all 11.8%), LinePack), across of inferior of slowdown were evaluated results (instead also the We LCP-packing since LCP-system. with evaluating of by Compresso version lines LCP+Align cache Alignment-friendly separate an we of most addition, impact the In the is Compresso. behavior, for OS-aware baseline uses an competitive also with which LCP databases, compressed Therefore, large substantial LCP. another for involves is optimized [10] system this memory changing Buri – movement. opportunistically data compression additional by of achieved granularity compression the are high reports which [11], optimizations. ratios, DMC movement work, work, data related prior Another simulated. Compresso’s on based are lacking baseline prefetch only competitive most of free the benefits and forms compression optimized accesses our The with zero-line algorithm. LCP OS-aware compression the BPC uses It Compresso. as c,GmFT,adlmaeecue,snete tl u oecsiepgn fmmr scntand ute icsini text. in discussion Further constrained. is memory if paging excessive to due stall they evaluations). since impact excluded, memory-capacity are lbm and and cycle-based GemsFDTD, Mcf, (combining performance Overall (b)

Performance Overall Performance relative to Performance relative to uncompressed constrained uncompressed memory system memory system 0.0 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0

perlbench perlbench bzip2 bzip2 gcc gcc

scluae samlilcto fthe of multiplication a as calculated is bwaves bwaves LCP Mem-CapImpact LCP Cycle-Based gamess gamess

mcf mcf

milc LCP a yl-ae n eoycpct matevaluation. impact Memory-capacity and Cycle-based (a) milc

i.1:Promnerslsfrsnl-oesystem single-core for results Performance 10: Fig. gromacs gromacs

cactusADM cactusADM

leslie3d LCP+Align leslie3d

namd namd

gobmk gobmk LCP+Align Mem-CapImpact LCP+Align Cycle-Based

This soplex soplex

povray povray 9

calculix Compresso opesdsse eas fti rfthn.Compression prefetching. a this in of accesses because memory system free fewer high compressed 12% a exhibits requires as also memory demand, which acts bandwidth compressed benchmark, and libquantum to lines The access cache prefetch. of compressed 64B performance multiple single returns relative a Second, overall 6%. an 25%, exhibits and bandwidth 43% highest requirements, the of having soplex, accesses these, zero-line Among respectively. high exhibit accessing leslie3d soplex by like handled and Benchmarks lines are alone. cache metadata and compression access all-zero (cached) memory of require writebacks) not (and two do of fills the because First, over happens performance reasons. This 2% system. than uncompressed more baseline gain soplex and leslie3d, capacity. memory and compression increased benefits, of to bandwidth impact due to the not due slowdowns improvements show or overheads only Compresso. for results 0.998 and LCP-system These lines, for cache The 0.961 Alignment-friendly LCP-system, system. with for uncompressed 0.938 an are to geomeans respect with performance Performance Single-Core A. energy-area and simulations, present we performance section, overheads. from this In results accesses. the memory additional the on calculix

h plctosgc rp50 atsD,libquantum, cactusADM, Graph500, gcc, applications The Simulation. Cycle-Based optimization data-movement each of impact the shows 6 Fig. hmmer hmmer

sjeng sjeng

GemsFDTD GemsFDTD

libquantum libquantum

h264ref Unconstrained memory h264ref EV A N O I UAT L VA E . I I V tonto tonto Compresso Mem-CapImpact Compresso Cycle-Based

lbm lbm

omnetpp omnetpp

i.1asosterelative the shows 10a Fig. astar astar

sphinx3 sphinx3

xalancbmk xalancbmk

Forestfire Forestfire

Pagerank 7.08 Pagerank

Graph500 Graph500

GEOMEAN Geomean 0.0 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 hycmlt xcto,a eso nScinVII-B. Section in compressible, show are that we benchmarks as with execution, workload complete a they of run part when However, a either. incom- as help all not are paging does they frequent compression Since to pressible, environment. due memory highly constrained stall are a and lbm in capacity and memory GemsFDTD, to Mcf, mem- sensitive 10b). unconstrained (Fig. an system of ory system performance constrained to maximum uncompressed sensitivity the the much achieves even show as not memory, do baseline. limited bzip2 uncompressed and an constrained h264ref, to a relative Gamess, 70%), to (at applied system compression memory memory of impact the metadata and 0.5%. 1.1% by by optimization 1.2%, Expansion cache by Room incompressibility predicting Inflation 2.4%, Dynamic alignment-friendly by sizes using line increased cache is performance average, high very not with is applications rates. accessed the miss being benefits metadata line memory This cache metadata exception. based the a memory an that with the assumption parallel to the access in on speculative that allows fact LCP the This access, Compresso. to than due better is slightly performs LCP+Align 15%. the to the 31% decreases from Compresso and compression 4% 1%. to to due to 91% slowdown 54% from maximum from gcc of libquantum bring access optimizations of memory movement explained extra data directly the instance, down be For can 6. Fig. numbers with These libquantum, soplex. cactusADM, over and gcc, Compresso astar, like of benchmarks benefits for speedup LCP-system results incremental result This high a implement. the as we in overall, optimizations overflows movement page data addition of the number of in lower mechanism a handling having overflow to page Compresso cheaper OS-transparent, a Being has overflow. page every upon baseline. for above speedup 2% is overall other Compresso The and with libquantum libquantum benchmarks. of spatial-locality locality buffer high row the increases also results. evaluation impact Memory-capacity and Cycle-based (a) eoyCpct matEvaluation. Impact Capacity Memory Breakdown. Performance Optimization-Based h eto h ecmrsaeete almost-linearly either are benchmarks the of rest The Pagerank, and Forestfire, omnetpp, mcf, benchmarks For fault page a requires OS-aware, being (baseline), LCP-system Performance relative to uncompressed system 0 1 2 3 Compresso Cycle-Based LCP+Align Cycle-Based LCP Cycle-Based mix1

mix2

mix3

mix4

mix5

mix6

mix7 Compresso Mem-CapImpact LCP+Align Mem-CapImpact LCP Mem-CapImpact i.1:Promnerslsfrmlicr system multi-core for results Performance 11: Fig. mix8

mix9 i.1ashows 10a Fig.

mix10

gmean 0 1 2 3 On 10 i1 oetr,Pgrn,Gah0,cactusADM Graph500, Pagerank, hmmer Forestfire, gamess, povray, Forestfire, perlbench Graph500, bwaves, mcf, Pagerank h264ref, povray, bwaves, gobmk gromacs, bzip2, perlbench, sphinx3 Mix10 calculix, cactusADM, xalancbmk, namd Mix9 gcc, omnetpp, sjeng, hmmer Mix8 leslie3d, lbm, Forestfire, tonto Mix7 gamess, astar, milc, soplex Mix6 libquantum, GemsFDTD, mcf, Mix5 Mix4 Mix3 Mix2 Mix1 n08 peu ie,1%sodw) opes out-performs Compresso resulting slowdown). 13% compression, (i.e., good high speedup have 0.87 have also in which but contains of rates, Mix10 miss all 0.92. metadata Graph500, of performance and low Pagerank, a Forestfire, little, in a results effect compression still this The it balance in but namd rates. and performers gcc miss Mix4 from bad metadata benefits 10. bandwidth high are and to which 4 due sjeng, Mixes isolation in and seen omnetpp be contains can impact The overhead.scenario. handling overflow move- per- page data higher low the much and to due for exhibit optimizations, LCP, also ment seen with Mixes than be Compresso These can with 8. formance pattern and similar 3, A 2, the soplex Mixes Mix1. of to in because due libquantum is benefits and This reduction runs. bandwidth single-core pronounced the more in is performer which mcf, bad containing a good despite shows Compresso, Mix1 system. with 4-core performance a in compression of overheads Evaluation Performance 4-Core B. 24.2%. by for LCP Compresso. 1.06 the LCP, for outperforms for 1.28 1.03 and of speedup LCP+Align overall relative average an 1.39. of speedup relative a memory achieves bound unconstrained which upper with system, The approximated 1.29. is Compresso’s performance performance to relative for compared average as an 1.11 exhibits of compression well it lower LCP, perform the to of to memory benefits Due of cases, namd). level some Forestfire, threshold in (Graph500, or, some have, least they at memory need of amount the to sensitive h rsueo h eaaaccei ihri 4-core a in higher is cache metadata the on pressure The Simulation. Cycle-Based observe we evaluations, two the from results the Combining

Performance relative to

a.I:Wrlasfrmlicr evaluation multi-core for Workloads IV: Tab. uncompressed constrained memory system 0 1 2 3

mix1 5.39 mix2 performance. Overall (b)

mix3 LCP+Align LCP

mix4 28.29 i.1asosteperformance the shows 11a Fig. mix5

mix6 Unconstrained memory Compresso

mix7 opes therefore, Compresso

mix8

mix9

mix10

gmean 0 1 2 3 opes civsa2.%promneoe LCP. over performance 27.5% a achieves Compresso opes,17 o C n,19frLPAin relative LCP+Align, system. for memory 1.9 constrained for and, the 2.27 LCP to is for speedup 1.78 overall memory-capacity Compresso, relative and The evaluations. cycle-based impact the combining scenario, bound, 1.97 upper 11b). the (Fig. to to (2.51) compared close performance very memory’s as reaches unconstrained system, and LCP, memory by achieved performance constrained relative a 2.33 over achieves LCP. over Compresso LinePack gain average, using all On when Graph500 capacity and memory it LCP. Pagerank, significant over since Forestfire, view, Compresso because with of is performance LCP. point This 100% to this than compared from more as interesting gains space is more again up Mix10 frees Compresso 4. since Mix can of This behavior difference. the a in make well seen not performs be does workload memory the additional the memory, and in more benchmarks use incompressible to other workload the allow to isolation. enough in run when are memory which limited of despite to availability, all sensitive 5 povray, memory very 2, and the Mixes different Pagerank, to why sensitive xalancbmk, of very reason containing phases not the are the is 7 This on and all workload. based a not memory time, in since slack same benchmarks the a that at allows Note footprint it maximum system. their memory reach applied benchmarks constrained when a compression of to benefits performance the shows optimization. cache metadata incompressibility, and predicting expansion by each IR 1% dynamic and cache 5.5%, by alignment-friendly sizes using line increased is performance age, LCP+Align. for 0.95 and LCP, for Pagerank. performance and better Forestfire, to omnetpp, lead like benchmarks could in which size increased, If cache be cache. metadata could metadata the 96KB computing, a warehouse-scale the with targeting to systems access. due these memory respectively, speculative evaluated 1.4%, We parallel and LCP’s 1.3% from by gained 10 benefits Compresso and out-performs 4 Mixes LCP+Align for Mixes. all across LCP -oeMmr aaiyIpc Evaluation. Impact Capacity Memory 4-Core Breakdown. Performance Optimization-Based promnei multi-core a in performance overall the shows also 11b Fig. Compresso, with performance better exhibit Mixes other The compressed have workload the in benchmarks some Once 0.9 Compresso, for 0.975 are speedups relative average, On Energy relative to uncompressed system 0 1 2 3

perlbench DRAM Energy(LCP) bzip2 gcc bwaves gamess mcf milc

vrl,i -oesetup, 4-core a in Overall, zeusmp DRAM Energy(LCP+Align) gromacs

i.1:Eeg vlaino Compresso of evaluation Energy 12: Fig. cactusADM leslie3d namd i.11a Fig. naver- On gobmk soplex povray 11 (i)

Work calculix uiPrilyM 4 C No Yes No No LinePack Light LCP LCP N/A 64B or N/A LinePack LCP 64B 1KB or Overheads 64B Energy 64B C. 64B MC N/A MC MC MC TLBs, MC BST, Partially Yes Partially No 1KB No Compresso DMC MC LLC, Buri LCP Partially RMC IBM-MXT Related ai erae aig h vrl nrysvnsaepositive. are savings energy overall compression the better its paging, and decreases 45% compressed ratio of average reduces an Compresso by a movement that across data Given average reduce benchmarks. on of 24% can set by memory encoding energy baseline switching energy-optimized internal-DRAM a an than though energy more system, [31] consume switching al. not DRAM et does internal Ghose compression Further, BDI-based energy. that subsystem conclude DRAM only the the comprises that of example, demonstrated 7% for have energy, [30] switching al. channel models et DRAM bus Seol of scope, evaluation our detailed beyond is a While usage. energy. energy toggling core bus equal has and reduces 11% Compresso by system, energy uncompressed DRAM system. an LCP+Align to the over achieves Compared 19% and Compresso savings, LCP-system, energy more the 60% to move- comparison data astar In compression-related and ment. to due misses, energy cache DRAM higher metadata has from to due accesses energy Forest- memory DRAM sjeng, extra higher Mcf, exhibit cache. omnetpp metadata and in Pagerank hit fire, that accesses lines to cache due zero energy. benefits to energy access DRAM read exhibit DRAM cactusADM a and 0.08 of is 0.8% a cache than of metadata less 96KB 0.4% is 8-way which than an channel. nJ, of DDR4-2666 less energy 2GB is access a The for active which power mW, the active that 40nm channel’s 7 find DRAM with is we 800MHz, BPC utilization at power synthesizing [28] cells Upon standard TSMC compressor/decompressor cache. BPC metadata and the movement; and to data due in energy change controller to due energy rvoswr 2]sosta opeso a increase may compression that shows [29] work Previous zeusmp like benchmarks well-compressed that shows 12 Fig. on: are energy system on Compresso of effects Major oeseeg sg u oslowdown/speedup; to due usage energy core’s

hmmer DRAM Energy(Compresso) sjeng

transparent GemsFDTD

OS- libquantum a.V eae oksummary work Related V: Tab. h264ref

Hardware tonto changes lbm omnetpp Compression

Granularity astar sphinx3 Core Energy(Compresso) 4.34 xalancbmk Forestfire Cacheline packing Pagerank

(iii) Graph500 (ii)

Data-movement Average optimizations memory DRAM D. Area Overhead in order to achieve higher compression, hence still requiring The BPC compressor unit with our optimizations synthesized metadata access for every memory access. using 40nm TSMC standard cells [32] at 800MHz requires an Buri [10] targeted lightweight changes to the OS by manag- overall area of 43Kµm2, corresponding to roughly 61K NAND2 ing memory allocation and deallocation in hardware. However, gates. A 96KB metadata (single read/write port) cache has an it required the OS to adapt to scenarios with incompressible data, area of roughly 100Kµm2 [28]. Although the area overhead using similar methods as MXT. It used an OS-visible “shadow is not negligible, the benefits from Compresso justify it. address” space, similar to OSPA and used the LCP approach to handle computing cache line offsets within a compressed page. E. Cache Line Offset Calculation DMC [11] proposed using two kinds of compression : LZ In Compresso, we find the position of a cache line within at 1KB granularity for cold pages and LCP with BDI for hot a compressed page by looking up the metadata cache and pages. It decides between the 2 compression mechanisms in calculating the offset with a custom low-latency arithmetic an OS-transparent fashion, at 32KB boundaries, which can unit. The bin sizes (0/8/32/64B) are first shifted right by 3 bits potentially increase the data movement. Additionally, LCP is to reduce their width. We then need to add up to 63 numbers now applied for 32KB pages instead of 4KB pages, resulting of values 0/1/4/8. A simple 63-input 4-bit adder requires under in lower compression. 1.5K NAND gates and 38 NAND-gate delays, which can be CMH [12] proposed compression to increase the capacity of reduced to 32 gate delays with some optimizations that take a Hybrid Memory Cube (HMC), by having compressed and possible inputs into account. DDR4-2666MHz allows only uncompressed regions per vault, with a floating boundary in ˜30 gate delays in one cycle, but this offset calculation can between. The compression and allocation granularity is at the be partially parallelized with metadata cache lookup, thereby cache-block level, thereby not needing any page movement. making the overhead only 1 cycle. However, having to maintain strictly separate compressed and uncompressed regions amplifies the problem of cache-block VIII.RELATEDWORK level data movement, causing their scheme to require explicit Software memory compression has been implemented every 100 million cycles. in commercial systems for the past decade. For example, Linux (since Linux 3.14 and Android 4.4), iOS [33] and IX.CONCLUSION Windows [34] keep background applications compressed. Any We identify important tradeoffs between compression access to compressed pages generates a compressed-, aggressiveness and unwanted data movement overheads of prompting the OS to decompress the page into a full physical compression, and optimize on these tradeoffs. This enables frame. Since the hardware is not involved, hot pages are stored Compresso to acheive higher compression benefits while uncompressed, thereby limiting the compression benefits. also reducing the performance overhead from this data Most other prior work in system compression deals movement. We present the first main-memory compression with cache compression or bandwidth compression [5]– architecture that is designed to run an unmodified operating [7, 13, 16, 18, 35]–[45]. We summarize the prior work in system. We propose a detailed holistic evaluation with high main memory compression in Tab. V. accuracy using the novel methodologies of Memory Capacity IBM MXT [8] was mostly non-intrusive, but required a Impact Evaluations. Overall, by reducing the data movement few OS changes. The OS was informed of an out-of-memory overheads while keeping high compression, we make main condition by the hardware changing watermarks based on memory compression pragmatic for adoption by real systems, compressibility. Furthermore, OS was required to zero out and show the same with holistic evaluation. free pages to avoid repacking data in hardware. Since the compression granularity was 1KB, a memory read would result X.ACKNOWLEDGEMENTS in 1KB worth of data being decompressed and transferred. To We thank our anonymous reviewers for their valuable address this, a large 32MB L3 cache with a 1KB line size feedback and suggestions. This work was funded in part by was introduced. These features would significantly hinder the a gift from Intel Corporation, and supported by the National performance and energy efficiency of MXT in today’s systems. Science Foundation Grant #1719061. We thank Michael Sul- Robust Memory Compression [9] was an OS-aware livan, Jungrae Kim, Yongkee Kwon, Tianhao Zheng, Vivek mechanism, proposed to improve performance of MXT. RMC Kozhikkottu, Amin Firoozshahian, and Omid Azizi for their used translation metadata as part of page table data and cached help and feedback during various phases of the work. this metadata on chip in the block size table. RMC used 4 possible page sizes, and divided a compressed page into 4 R EFERENCES subpages, with a hysteresis area at the end of each subpage [1] J. Dean and L. A. Barroso, “The Tail at Scale,” Communications of the to store inflated cache lines. ACM, vol. 56, pp. 74–80, 2013. Linearly Compressed Pages [4] was also OS-aware. It [2] M. E. Haque, Y. hun Eom, Y. He, S. Elnikety, R. Bianchini, and K. S. addressed the challenge of cache line offset calculation by McKinley, “Few-to-Many: Incremental Parallelism for Reducing Tail Latency in Interactive Services,” in ASPLOS, 2015, pp. 161–175. compressing them all to same size, such that compression still [3] M. Jeon, Y. He, S. Elnikety, A. L. Cox, and S. Rixner, “Adaptive remains high. It used an exception region for larger cachelines Parallelism for Web Search,” in Eurosys, 2013, pp. 155–168.

12 [4] G. Pekhimenko, V. Seshadri, Y. Kim, H. Xin, O. Mutlu, P. Gibbons, [25] J. Leskovec and R. Sosic,ˇ “SNAP: A General-Purpose Network Analysis M. Kozuch, and T. Mowry, “Linearly compressed pages: a low- and Graph-Mining Library,” ACM Transactions on Intelligent Systems complexity, low-latency main memory compression framework,” in and Technology (TIST), vol. 8, p. 1, 2016. Proceedings of the 46th Annual IEEE/ACM International Symposium [26] D. Sanchez and C. Kozyrakis, “ZSim: fast and accurate microarchitectural on Microarchitecture, 2013, pp. 172–184. simulation of thousand-core systems,” in Proceedings of the 40th Annual [5] R. G. Y. Zhang, “Enabling Partial Cache Line Prefetching Through International Symposium on Architecture, 2013, pp. 475–486. Data Compression,” in ICPP, 2000, pp. 277–285. [27] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, [6] J. Kim, M. Sullivan, E. Choukse, and M. Erez, “Bit-Plane Compression: and N. P. Jouppi, “McPAT: An Integrated Power, Area, and Timing Transforming Data for Better Compression in Many-Core Architectures,” Modeling Framework for Multicore and Manycore Architectures,” in in Proceedings of the 43rd Annual International Symposium on Computer Proceedings of the 42Nd Annual IEEE/ACM International Symposium Architecture, 2016, pp. 329–340. on Microarchitecture, 2009, pp. 469–480. [7] J. Yang, Y. Zhang, and R. Gupta, “Frequent value compression in data [28] S. J. E. Wilton and N. P. Jouppi, “CACTI: An Enhanced Cache Access caches,” in Proceedings of the 33rd Annual IEEE/ACM International and Cycle Time Model,” IEEE Journal of Solid-State Circuits, vol. 31, Symposium on Microarchitecture, 2000, pp. 258–265. pp. 677–688, 1996. [29] G. Pekhimenko, E. Bolotin, N. Vijaykumar, O. Mutlu, T. C. Mowry, and [8] R. B. Tremaine, P. A. Franaszek, J. T. Robinson, C. O. Schulz, T. B. S. W. Keckler, “A case for toggle-aware compression for GPU systems,” Smith, M. Wazlowski, and P. M. Bland, “IBM Memory Expansion International Symposium on High Performance Technology (MXT),” in IBM Journal of Research and Development, vol. (HPCA), 2016. 45, No. 2, 2001, pp. 271–285. [30] H. Seol, W. Shin, J. Jang, J. Choi, J. Suh, and L.-S. Kim, “Energy [9] M. Ekman and P. Stenstrom, “A Robust Main-Memory Compression Efficient Data Encoding in DRAM Channels Exploiting Data Value Scheme,” in Proceedings of the 32nd Annual International Symposium Similarity,” in Proceedings of the 43rd International Symposium on on Computer Architecture, 2005, pp. 74–85. Computer Architecture, 2016. [10] J. Zhao, S. Li, J. Chang, J. L. Byrne, L. L. Ramirez, K. Lim, Y. Xie, [31] S. Ghose, A. G. Yaglikc¸i, R. Gupta, D. Lee, K. Kudrolli, W. X. Liu, and P. Faraboschi, “Buri: Scaling big-memory computing with hardware- H. Hassan, K. K. Chang, N. Chatterjee, A. Agrawal, M. O’Connor, and based memory expansion,” ACM Trans. Archit. Code Optim., vol. 12, O. Mutlu, “What Your DRAM Power Models Are Not Telling You: no. 3, pp. 31:1–31:24, 2015. Lessons from a Detailed Experimental Study,” in SIGMETRICS, 2018. [11] S. Kim, S. Lee, T. Kim, and J. Huh, “Transparent dual memory compres- [32] Taiwan Semiconductor Manufacturing Company, “40nm CMOS Standard sion architecture,” in Proceedings of the 26th International Conference on Cell Library v120b,” 2009. Parallel Architectures and Compilation Techniques, 2017, pp. 206–218. [33] “Apple Releases Developer Preview of OS X Mavericks [12] C. Qian, L. Huang, Q. Yu, Z. Wang, and B. Childers, “CMH: Compression With More Than 200 New Features,” 2013. [Online]. management for improving capacity in the hybrid memory cube,” in Available: https://www.apple.com/pr/library/2013/06/10Apple-Releases- Proceedings of the 15th ACM International Conference on Computing Developer-Preview-of-OS-X-Mavericks-With-More-Than-200-New- Frontiers, 2018. Features.html [13] A. R. Alameldeen and D. A. Wood, “Frequent Pattern Compression: A [34] “Announcing Windows 10 insider preview,” 2015. [Online]. Available: significance-based compression scheme for L2 caches,” Technical Report https://blogs.windows.com/windowsexperience/2015/08/18/announcing- 1500, Computer Sciences Department, University of Wisconsin-Madison, windows-10-insider-preview-build-10525/#36QzdLwDd3Eb45ol.97 Tech. Rep., 2004. [35] E. G. Hallnor and S. K. Reinhardt, “A unified compressed memory [14] J. Ziv and A. Lempel, “A universal algorithm for sequential data hierarchy,” in Proceedings of the 11th International Symposium on compression,” IEEE Transactions on Information Theory, vol. 23, no. 3, High-Performance Computer Architecture, 2005, pp. 201–212. pp. 337–343, 1977. [36] A. R. Alameldeen and D. A. Wood, “Adaptive cache compression [15] X. Chen, L. Yang, R. Dick, L. Shang, and H. Lekatsa, “C-PACK: a for high-performance processors,” in Proceedings of the 31st Annual high-performance microprocessor cache compression algorithm,” in International Symposium on Computer Architecture, 2004, pp. 212–. IEEE Educational Activities Department vol. 18, 2010, pp. 1196–1208. [37] A. Shafiee, M. Taassori, R. Balasubramonian, and A. Davis, “MemZip: [16] G. Pekhimenko, V. Seshadri, O. Mutlu, P. Gibbons, M. Kozuch, Exploring unconventional benefits from memory compression,” in and T. Mowry, “Base-delta-immediate compression: practical data International Symposium on High Performance Computer Architecture compression for on-chip caches,” in Proceedings of the 21st International (HPCA), 2014, pp. 638–649. Conference on Parallel Architectures and Compilation Techniques, 2012, [38] D. J. Palframan, N. S. Kim, and M. H. Lipasti, “COP: To Compress and pp. 377–388. Protect Main Memory,” in Proceedings of the 42Nd Annual International Symposium on Computer Architecture. New York, NY, USA: ACM, [17] A. Arelakis, F. Dahlgren, and P. Stenstrom, “HyComp: A Hybrid Cache 2015, pp. 682–693. Compression Method for Selection of Data-type-specific Compression [39] S. Sardashti, A. Seznec, and D. A. Wood, “Yet Another Compressed Methods,” in Proceedings of the 48th International Symposium on Cache: A Low-Cost Yet Effective Compressed Cache,” ACM Trans. Microarchitecture. New York, NY, USA: ACM, 2015, pp. 38–49. Archit. Code Optim., pp. 27:1–27:25, 2016. [18] A. Arelakis and P. Stenstrom, “SC2: A Statistical Compression Cache [40] J. Kim, M. Sullivan, S.-L. Gong, and M. Erez, “Frugal ECC: Efficient and Scheme,” in Proceeding of the 41st Annual International Symposium Versatile Memory Error Protection Through Fine-grained Compression,” on Computer Architecture. Piscataway, NJ, USA: IEEE Press, 2014, in Proceedings of the International Conference for High Performance pp. 145–156. Computing, Networking, Storage and Analysis, 2015, pp. 12:1–12:12. [19] S. Sardashti and D. A. Wood, “Could compression be of general use? [41] “Qualcomm Centriq 2400 Processor,” 2017. [Online]. Available: evaluating memory compression across domains,” ACM Trans. Archit. https://www.qualcomm.com/media/documents/files/qualcomm-centriq- Code Optim., vol. 14, no. 4, pp. 44:1–44:24, Dec. 2017. 2400-processor.pdf [20] C. A. Waldspurger, “Memory resource management in VMware ESX [42] S. Chaudhry, R. Cypher, M. Ekman, M. Karlsson, A. Landin, S. Yip, server,” in Proceedings of the 5th symposium on Operating systems H. Zeffer, and M. Tremblay, “Rock: A high-performance Sparc CMT design and implementation (OSDI), 2002, pp. 181–194. processor,” Proceedings of the 42Nd Annual IEEE/ACM International [21] P. A. Franaszek and D. E. Poff, “Management of Guest OS Memory Symposium on Microarchitecture, vol. 29, no. 2, pp. 6–16, March 2009. Compression In Virtualized Systems,” Patent US20 080 307, 2007. [43] R. de Castro, A. Lago, and M. Silva, “Adaptive compressed caching: [22] T. Chen, “Introduction to acpi-based memory hot-plug,” in Design and implementation,” in Computer Architecture and High LinuxCon/CloudOpen Japan, 2013. Performance Computing, 2003. Proceedings. 15th Symposium on, 2003. [23] E. Choukse, M. Erez, and A. R. Alameldeen, “CompressPoints: An [44] H. Alam, T. Zhang, M. Erez, and Y. Etsion, “Do-It-Yourself Virtual Evaluation Methodology for Compressed Memory Systems,” in IEEE Memory Translation,” in Proceedings of the 44th Annual International Computer Architecture Letters, vol. PP, no. 99, 2018, pp. 1–1. Symposium on Computer Architecture. New York, NY, USA: ACM, [24] E. Perelman, G. Hamerly, M. Biesbrouck, T. Sherwood, and B. Calder, 2017, pp. 457–468. “Using SimPoint for accurate and efficient simulation,” in Proceedings [45] P. Franaszek, J. Robinson, and J. Thomas, “Parallel compression of the International Joint Conference on Measurement and Modeling with cooperative dictionary construction,” in Proceedings of the Data of Computer Systems (SIGMETRICS), 2003, pp. 318–319. Compression Conference, 1996, pp. 200–209.

13