2017 IEEE International Conference on Consumer Electronics (ICCE)

LZ4m: A Fast Compression Algorithm for In-Memory Data

Se-Jun Kwon∗, Sang-Hoon Kim†, Hyeong-Jun Kim∗, and Jin-Soo Kim∗ ∗College of Information and Communication Engineering, Sungkyunkwan University Email: [email protected], [email protected], [email protected] †School of , Korea Advanced Institute of Science and Technology (KAIST) Email: [email protected]

Abstract—Compressing in-memory data is a cost-effective as Deflate [7], LZ4 [8], and LZO [9] compress in-memory data solution for dealing with the memory demand from data-intensive better than WKdm. However, they exhibit poor performance applications. This paper proposes a fast algo- in terms of compression and decompression speed, thereby rithm for in-memory data that improves performance by utilizing the characteristics frequently observed from in-memory data. impairing system performance. This paper proposes a new novel compression algorithm I.INTRODUCTION called LZ4m, which stands for LZ4 for in-memory data. We Dealing with the ever-increasing memory requirements of expedite the scanning of input stream of the original LZ4 data-intensive applications is the subject of intensive work algorithm by utilizing the characteristics frequently observed in both industry and academia. In particular, it becomes an from in-memory data. We also revise the encoding scheme important issue in consumer devices as their fast evolution of the LZ4 algorithm so that the compressed data can be places a high demand on memory. As a cost-effective solution represented in a more concise form. Our evaluation results for the demand, manufacturers attempt to compress in-memory with the data obtained from a running mobile device show data thereby increasing the effective memory size and lower- that LZ4m outperforms previous compression algorithms in ing manufacturing cost [1], [2]. Specifically, in-memory data compression and decompression speed by up to 2.1× and compression techniques become important since the market 1.8×, respectively, with a marginal loss of less than 3% in for consumer devices becomes extremely price sensitive and compression ratio. sales are shifted from premium to cheaper models equipped II.ZIV-LEMPEL ALGORITHMAND LZ4 with small memory [3]. Data compression, or compression shortly, refers to the Many compression algorithms aiming at low compression technique that reduces the size of data by representing the latency are based on the Lempel-Ziv algorithm [10]. The key data in a more concise form. There have been many studies strategy of Lempel-Ziv algorithm is to scan an input stream that employ the compression for in-memory data, ranging and find the longest match between the substring starting from from computer architecture systems to operating systems. For the current scanning position and the substrings in the already instance, some memory technologies [4], [?] compress cache scanned part of the input stream. If such a match is found, the lines at the memory controller before writing them to memory, matched substring at the current position is substituted with providing a larger amount of memory than physically equipped a pair of the backward offset to the match and the length of one. Compcache and zram [5] find out rarely referenced page the match. However, identifying the longest match can incur frames and compress them to reduce their memory footprint. high overheads in time and space. Thus, algorithms in practice A compression algorithm for in-memory data must be usually alter the strategy to quickly identify a substring that fast in both compression and decompression while providing is long enough to be compressed [7], [8], [9]. acceptable compression ratio1 as the performance of memory LZ4 is one of the most popular and successful compres- access heavily influences on overall system performance. To sion algorithms based on the Lempel-Ziv algorithm. Fig- this end, Wilson et al. have proposed the WKdm algorithm ure 1 illustrates how LZ4 identifies the match and encodes that utilizes the regularities that are frequently observed from an input stream. LZ4 scans an input stream with a 4- in-memory data [6]. However, although WKdm exhibits out- window (“DABE” at position 10 in this case) and checks standing compression speed, its poor compression ratio is the whether the substring in the window appeared before. To assist major concern that hinders the application of the algorithm. On the match, LZ4 maintains a hash table that maps a 4-byte the other hand, general-purpose compression algorithms such substring to a position from the start of the input stream. If the hash table contains an entry for the current 4-byte This work was supported by the National Research Foundation of Ko- substring in the scanning window (Figure 1(a)- 1 ), it implies rea (NRF) grant founded by the Korea Government (MSIP) (No. NRF- that the current substring appeared at the certain position in 2016R1A2A1A05005494). 1Throughout the paper, we use the term compression ratio to refer to the the previously scanned stream (Figure 1(a)- 2 ). Therefore, the ratio of the compressed size to the original size. substring starting at the position has an identical prefix with

978-1-5090-5544-9/17/$31.00 ©2017 IEEE 2017 IEEE International Conference on Consumer Electronics (ICCE)

Scanning window ¥ Scanning window ¥ Position 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Position 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Data F D A B E G E A B D A B E G E A F C D D Data F C D A B E G E A B D A B E G E A B C D D

Substring Position Substring Position ¤ FCDA 0 ¤ FCDA 0 £ CDAB 1 BEGE 4 12 DABE 2 10 £ ABDA 8

(a) Matching process of LZ4 (a) Matching process of LZ4m

Token Body Token Body Literal Match Literal Match Literal Match Literal Match Offset length length Length Data Offset Length length length Length Data Offset Length

4 4 bits 0+ 0+ bytes 2 bytes 0+ bytes 2 bits 3 bits 3 bits 0+ bytes 0+ bytes 1 byte 0+ bytes

(b) Encoding of LZ4 (b) Encoding of LZ4m

Fig. 1. Matching process and encoding of LZ4 Fig. 2. Matching process and encoding of LZ4m

III.PROPOSED LZ4MALGORITHM LZ4 focuses on the compression and decompression speed the substring starting at the current scanning position. LZ4 with an acceptable compression ratio. However, as LZ4 is utilizes this property and finds the longest match starting from designed as a general-purpose compression algorithm, it does these two positions. In this example, the longest match string not utilize the inherent characteristics of in-memory data. is “DABEGEA”. Then, the corresponding hash table entry In-memory data consists of virtual memory pages that is updated to the current scanning position and the scanning contains the data from the stack and the heap regions of window is slid forward by the length of the match (Figure applications. The stack and the heap regions contain constants, 1(a)- 3 ). If no entry for the current substring is found in the variables, pointers, and arrays of basic types which are usually hash table, a new entry is inserted into the hash table and structured and aligned to the word boundary of the system [6]. the scanning window is advanced by one byte. The scanning Based on the observation, we can expect that data compression is repeated until the scanning window reaches the end of the can be accelerated without a significant loss of compressing stream. opportunity by scanning and matching data in the word granularity. In addition, as the larger granularity requires less According to the matching scheme, the input stream is bits to represent the length and the offset, the substrings (i.e., partitioned into a number of substrings, called literals and literals and matches) can be encoded in a more concise form. matches. The literal means a substring that does not match We revise the LZ4 algorithm to utilize these characteristics any of previously appeared substrings. The literal is always and propose the LZ4m algorithm which stands for LZ4 for in- followed by one or more matches unless the literal is the last memory data. Figure 2(a) illustrates the modified compression substring of the input stream. LZ4 encodes the substrings into scheme and the unit encoding of LZ4m. LZ4m uses the same encoding units which are comprised of a token and a body as scanning window and hash table of the original LZ4. In shown in Figure 1(b). Every encoding unit is started with a 1- contrast to the original LZ4 algorithm, LZ4m scans an input byte token. If the substring to encode is a literal followed by a stream and finds the match in a 4-byte granularity. If the hash match, the upper 4 bits of the token are set to the length of the table indicates no prefix match exists, LZ4m advances the literal and the lower 4 bits are set to the length of the following window by 4 bytes and repeats identifying the prefix match. match. If 4 bits are not enough to represent the length, i.e., the As shown in the Figure 2(a), after examining the prefix at substring is longer than 15 bytes, the corresponding bits are position 8, the next position of the window is 12 instead of 9. set to 1111(2), and the length subtracted by 15 is placed on the If a prefix match is found, LZ4m compares subsequent data body which follows the token. The body contains the literal and finds the longest match in the 4-byte granularity as well. data and the description of the match, which is comprised of In the example, although there is a 6-byte match starting from a backward offset to the match and the length of the match. position 12, LZ4m only considers the 4-byte match from 12 As the offset is encoded in 2 bytes, LZ4 can look back up to 15, and slides the scanning window forward by four bytes, to 64 KB to find a match. Additional matches immediately to position 16 (Figure 2(a)- 3 ). followed by a certain match are encoded in the similar way, We also optimize the encoding scheme to utilize the 4-byte except that the literal length field in the token is set to 0000(2) granularity. As the offset is aligned to the 4-byte boundary and and the body omits the part for the literal. the length is a multiple of 4 bytes, two least significant bits of

978-1-5090-5544-9/17/$31.00 ©2017 IEEE 2017 IEEE International Conference on Consumer Electronics (ICCE)

these fields are always 00(2). Thus, every field representing the 0.6 length and the offset (the literal length and the match length in the token, and the literal length, the match length, and the 0.5 match offset in the body) can be shortened by 2 bits. Moreover, as LZ4m is targeting to compress 4 KB pages in a 4-byte 0.4 granularity, the field for the match offset requires at most 10 bits. Consequently, we place the two most significant bits of 0.3 the match offset in the token and put the remaining 8 bits in the body. 0.2 Compression ratio

IV. EVALUATION 0.1 We evaluated the proposed LZ4m algorithm using real in- memory data obtained from a mobile device. We evaluated 0.0 LZ4 and LZO1x as the representatives of general-purpose LZ4m LZ4 LZO1x WKdm algorithms, and WKdm as a specialized one. To collect in- (a) Compression ratio memory data, particularly the one that in-memory data com- pression schemes are interested in, we collected data that are 1.4 Compression Decompression evicted from main memory via swapping. We set up a mobile 1.2 system platform on a mobile system development board. And then we configured the kernel to swap data on a custom block 1.0 device emulated with a SSD development platform board. The custom block device collects swap data to a particular 0.8 partition that can be examined later off-line. We collected 0.6 98,995 pages (386.7 MB) of swap data while launching 24 popular applications 512 times in total. Normalized speed 0.4 Figure 3 compares the algorithms in terms of their com- pression ratio and compression/decompression speed. Recall 0.2 that the compression ratio is the average of pages, and the 0.0 smaller compression ratio implies the smaller compressed size LZ4m LZ4 LZO1x WKdm for the same data. Also, the speeds are normalized to that of LZ4m. Compared to the general-purpose algorithms (i.e., (b) Compression/decompression speed LZ4 and LZO1x), LZ4m shows the comparable compression Fig. 3. Compression ratio and compression/decompression speed of various ratio which is only degraded by up to 3%. However, LZ4m compression algorithms outperforms these algorithms in speed by up to 2.1× and 1.8× for compression and decompression, respectively. On 1 the other hand, LZ4m outperforms WKdm significantly in LZ4m compression ratio and decompression speed at the cost of 21% LZ4 0.8 slowdown in compression speed. These results suggest that LZO1x LZ4m substantially improves the compression/decompression WKdm speed of LZ4 with a marginal loss of compression ratio. 0.6 Figure 4 shows a cumulative distribution of compression ratio for pages. The compression ratio curve of LZ4m does 0.4

not fall much behind those of the LZO and LZ4 algorithms. Cumulative ratio However, WKdm shows distinctively compression ratio curve 0.2 which lags much behind other algorithms. In addition, 6.8% of pages cannot be compressed at all with WKdm whereas 0 less than 1% does with others. This suggests that the WKdm’s 0 0.2 0.4 0.6 0.8 1 speed-up in compression can be offset by its poor compression Compression ratio ratio. To further analyze the implication of the 4-byte granularity Fig. 4. Cumulative distibution of compression ratio of the match offset and the match length, we counted the length of match substrings from the trace. Figure 5 compares the results from the original LZ4 and LZ4m with a cumulative occurrence of matches only by 2.5%, which implies that the 4- count of matches over match length. The inset magnifies the byte granularity scheme influences just little on the opportunity original result whose match lengthes are between 0 to 32. We to find a match. Also, the distribution of the match lengths can verify that the increased granularity decreases the total shows that an n-byte match in the original scheme can be

978-1-5090-5544-9/17/$31.00 ©2017 IEEE 2017 IEEE International Conference on Consumer Electronics (ICCE)

LZ4 LZ4m LZ4m ) 20 25 6 LZ4 16 20 20 LZO1x WKdm 16 12 15 12 8 8 10

4 Time (microsecond) 4 5

Cumulative match count (x10 0 0 4 8 12 16 20 24 28 32 0 0 0 512 1024 1536 2048 2560 3072 3584 4096 0 0.2 0.4 0.6 0.8 1 Match length Compression ratio

Fig. 5. Cumulative count of matches over match length Fig. 7. Average decompression speed over compression ratio

100 ratio. Interestingly, WKdm does not excel other algorithms LZ4m in decompression speed, especially for the pages that are LZ4 80 not compressed well, although it promises outstanding com- LZO1x pression speed. This is a critical drawback as a compression WKdm 60 algorithm for in-memory data since decompression usually happens on performance-critical paths in memory subsystems, 40 such as fault handling or swap-in. V. CONCLUSION Time (microsecond) 20 We optimized a popular general-purpose compression algo- rithm by utilizing the inherent characteristics of in-memory 0 data. Evaluation with real-life data confirms that the proposed 0 0.2 0.4 0.6 0.8 1 LZ4m greatly boost the compression/decompression speed Compression ratio without substantial loss in compression ratio. We plan to apply Fig. 6. Average compression speed over compression ratio the LZ4m to a real in-memory compression system and to measure its implication on the overall system performance.

REFERENCES 4 4 substituted with a ⌊n/ ⌋ × -byte match. This suggests that [1] Samsung Electronics, Co., “Galaxy Note 3.” [On- the disadvantage of the 4-byte granularity in the match length line]. Available: http://www.samsung.com/uk/consumer/mobile- is also marginal. devices/smartphones/galaxy-note/SM-N9005ZKEBTU [2] Android. Low RAM. [Online]. Available: Figure 6 shows the relationship between the compression https://source.android.com/devices/tech/low-ram.html speed and the compression ratio of the algorithms. The speed [3] International Data Corporation (IDC), “Worldwide quarterly is obtained by measuring the time to compress each page and mobile phone tracker,” Aug. 2015. [Online]. Available: http://www.idc.com/tracker/showtrackerhome.jsp averaging the time of pages having the same compression [4] S. Arramreddy, D. Har, K.-K. Mak, R. B. Tremaine, and M. Wazlowski, ratio. The time to compress well-compressed pages (pages “IBM “MXT” memory compression technology debuts in a serverworks with small compression ratio values) are similar across the northbridge,” in Proceedings of Hot Chips 12, 2000. [5] F. Douglis, “The compression cache: Using on-line compression to algorithms. For badly-compressed pages (pages with large extend physical memory,” in Proceedings of the Winter USENIX Con- compression ratio values), however, LZ4m shows outstanding ference, 1993, pp. 519–529. compression speed compared to LZ4 and LZO1x. This con- [6] P. R. Wilson, S. F. Kaplan, and Y. Smaragdakis, “The case for com- pressed caching in virtual memory system,” in Proceedings of the 1999 siderable amount of speedup is originated from the scanning USENIX Annual Technical Conference, June 1999, pp. 101–116. process of LZ4m that advances the scanning window by 4 [7] J. loup Gailly and M. Adler. (2013) :a massively spiffy bytes if no prefix match is found, expediting the scanning by yet delicately unobtrusive compression library. [Online]. Available: http://www.zlib.net/ four times on the hardly-compressible pages. [8] Y. Collect. (2013) LZ4: extremely fast compression algorithm. [Online]. Figure 7 shows the relationship between the decompression Available: http://code.google.com/p/lz4 [9] M. F. Oberhumer. (2011) LZO real-time data compression library. speed and the compression ratio of the algorithms. The speed [Online]. Available: http://www.oberhumer.com/opensource/lzo is obtained in the same way as the average compression speed. [10] J. Ziv and A. Lempel, “A universal algorithm for sequential data We can confirm that LZ4m outperforms other algorithms in de- compression,” IEEE Transactions on , vol. 23, 1977. compression speed over almost entire range of the compression

978-1-5090-5544-9/17/$31.00 ©2017 IEEE