Deep Semantic Compression for Tabular Data
Total Page:16
File Type:pdf, Size:1020Kb
Research 19: Machine Learning Systems and Applications SIGMOD ’20, June 14–19, 2020, Portland, OR, USA DeepSqueeze: Deep Semantic Compression for Tabular Data Amir Ilkhechi Andrew Crotty Alex Galakatos Yicong Mao Grace Fan Xiran Shi Ugur Cetintemel Department of Computer Science, Brown University [email protected] ABSTRACT 1 INTRODUCTION With the rapid proliferation of large datasets, efficient data In nearly every domain, datasets continue to grow in both compression has become more important than ever. Colum- size and complexity, driven primarily by a “log everything” nar compression techniques (e.g., dictionary encoding, run- mentality with the assumption that the data will be use- length encoding, delta encoding) have proved highly effec- ful or necessary at some time in the future. Applications tive for tabular data, but they typically compress individual that produce these enormous datasets range from central- columns without considering potential relationships among ized (e.g., scientific instruments) to widely distributed (e.g., columns, such as functional dependencies and correlations. telemetry data from autonomous vehicles, geolocation co- Semantic compression techniques, on the other hand, are ordinates from mobile phones). In many cases, long-term designed to leverage such relationships to store only a subset archival for such applications is crucial, whether for analysis of the columns necessary to infer the others, but existing or compliance purposes. approaches cannot effectively identify complex relationships For example, current Tesla vehicles have a huge number across more than a few columns at a time. of on-board sensors, including cameras, ultrasonic sensors, We propose DeepSqeeze, a novel semantic compression and radar, all of which produce high-resolution readings [11]. framework that can efficiently capture these complex rela- Data from every vehicle is sent to the cloud and stored in- tionships within tabular data by using autoencoders to map definitely [30], a practice that has proved useful for a variety tuples to a lower-dimensional representation. DeepSqeeze of technical and legal reasons. In 2014, for instance, Tesla also supports guaranteed error bounds for lossy compression patched an engine overheating issue after identifying the of numerical data and works in conjunction with common problem by analyzing archived sensor readings [30]. columnar compression formats. Our experimental evaluation However, the cost of data storage and transmission has uses real-world datasets to demonstrate that DeepSqeeze not kept pace with this trend, and effective compression al- can achieve over a 4× size reduction compared to state-of- gorithms have become more important than ever. For tabular the-art alternatives. data, columnar compression techniques [12] (e.g., dictionary encoding, run-length encoding, delta encoding) can often ACM Reference Format: achieve much greater size reductions than general-purpose Amir Ilkhechi, Andrew Crotty, Alex Galakatos, Yicong Mao, Grace tools (e.g., gzip [9], bzip2 [4]). Yet, traditional approaches that Fan, Xiran Shi, Ugur Cetintemel. 2020. DeepSqueeze: Deep Seman- operate on individual columns miss an important insight: tic Compression for Tabular Data. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIG- inherent relationships among columns frequently exist in MOD’20), June 14–19, 2020, Portland, OR, USA. ACM, New York, NY, real-world datasets. USA, 14 pages. https://doi.org/10.1145/3318464.3389734 Semantic compression, therefore, attempts to leverage these relationships in order to store the minimum number of columns needed to accurately reconstruct all values of entire Permission to make digital or hard copies of all or part of this work for tuples. A classic example is to retain only the zip code of a personal or classroom use is granted without fee provided that copies are not mailing address while discarding the city and state, since the made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components former can be used to uniquely recover the latter two. Ex- of this work owned by others than ACM must be honored. Abstracting with isting semantic compression approaches [14, 22, 25, 26, 31], credit is permitted. To copy otherwise, or republish, to post on servers or to though, can only capture simple associations (e.g., functional redistribute to lists, requires prior specific permission and/or a fee. Request dependencies, pairwise correlations), thereby failing to iden- permissions from [email protected]. tify complex relationships across more than a few columns at SIGMOD’20, June 14–19, 2020, Portland, OR, USA a time. Moreover, some require extensive manual tuning [22], © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-6735-6/20/06...$15.00 while others work only for certain data types [31]. https://doi.org/10.1145/3318464.3389734 1733 Research 19: Machine Learning Systems and Applications SIGMOD ’20, June 14–19, 2020, Portland, OR, USA To overcome these limitations, we present DeepSqeeze, 2.1.1 Lossless. As the name suggests, lossless compression a new semantic compression framework for tabular data is reversible, meaning that the original input can be perfectly that uses autoencoders to map tuples to a lower-dimensional recovered from the compressed format. Lossless compression space. We additionally introduce several enhancements to algorithms typically operate by identifying and removing sta- the basic model, as well as columnar compression techniques tistical redundancy in the input data. For example, Huffman that we have adapted for our approach. As we show in coding [24] replaces symbols with variable-length codes, as- our evaluation, DeepSqeeze can outperform existing ap- signing shorter codes to more frequent symbols such that proaches by over 4× on several real-world datasets. they require less space. In summary, we make the following contributions: One common family of lossless compression algorithms includes LZ77 [43], LZ78 [44], LZSS [36], and LZW [42], • We present DeepSqeeze, a new semantic compres- among others. These approaches work in a streaming fash- sion framework that uses autoencoders to capture com- ion to replace sequences with dictionary codes built over a plex relationships among both categorical and numer- sliding window. ical columns in tabular data. Many hybrid algorithms exist that utilize multiple lossless • We propose several optimizations, including exten- compression techniques in conjunction. For example, DE- sions to the basic model (e.g., automatic hyperparam- FLATE [20] combines Huffman coding and LZSS, whereas eter tuning) and columnar compression techniques Zstandard [16] uses a combination of LZ77, Finite State En- specifically tailored to our approach. tropy coding, and Huffman coding. A number of common • Our experimental results show that DeepSqeeze can tools (e.g., gzip [9], bzip2 [4]) also implement hybrid varia- outperform state-of-the-art semantic compression tech- tions of these core techniques. niques by over 4×. The remainder of this paper is organized as follows. Sec- 2.1.2 Lossy. Unlike lossless compression, lossy compression tion 2 provides background on a wide range of compression reduces data size by discarding nonessential information, techniques, including the most closely related work. Next, such as truncating the least significant digits of a measure- in Section 3, we present DeepSqeeze and give a high-level ment or subsampling an image. Due to the loss of infor- overview of our approach. Then, Sections 4-6 describe each mation, the process is not reversible. While some use cases stage of DeepSqeeze’s compression pipeline in detail. Fi- require perfect recoverability of the input (e.g., banking trans- nally, we present our experimental evaluation in Section 7 actions), our work focuses on data archival scenarios that and conclude in Section 8. can usually tolerate varying degrees of lossiness. The discrete cosine transform [13] (DCT), which repre- sents data as a sum of cosine functions, is the most widely 2 RELATED WORK used lossy compression algorithm. For example, DCT is the Compression is a well-studied problem, both in the context basis for a number of image (e.g., JPEG), audio (e.g., MP3), of data management and in other areas. This interest has and video (e.g., MPEG) compression techniques. spawned a huge variety of techniques, ranging from general- Another simple form of lossy compression is quantization, purpose algorithms to more specialized approaches. which involves discretizing continuous values via rounding We note that many of these techniques are not mutually or bucketing. DeepSqeeze uses a form of quantization to exclusive and can often be combined. For example, many enable lossy compression of numerical columns, which we existing semantic compression algorithms [14, 25, 26] apply describe further in Section 4. other general-purpose compression techniques to further reduce their output size. Similarly, DeepSqeeze combines 2.2 Columnar Compression ideas from semantic and deep compression methods, as well The advent of column-store DBMSs (e.g., C-Store[35], Ver- as further incorporating columnar (and even some general- tica [27]) prompted the investigation of several compression purpose) compression techniques. techniques specialized for columnar storage formats, includ- In the following, we give an overview of the compression ing dictionary