Understanding and Modeling Lossy Compression Schemes on HPC Scientific Data
Total Page:16
File Type:pdf, Size:1020Kb
Understanding and Modeling Lossy Compression Schemes on HPC Scientific Data Tao Lu1, Qing Liu1,2, Xubin He3, Huizhang Luo1, Eric Suchyta2, Jong Choi2, Norbert Podhorszki2, Scott Klasky2, Mathew Wolf2, Tong Liu3, and Zhenbo Qiao1 1 Department of Electrical and Computer Engineering, New Jersey Institute of Technology 2 Computer Science and Mathematics Division, Oak Ridge National Laboratory 3 Department of Computer and Information Sciences, Temple University Abstract—Scientific simulations generate large amounts of Data deduplication and lossless compression have been floating-point data, which are often not very compressible us- widely used in general-purpose systems to reduce redundancy ing the traditional reduction schemes, such as deduplication in data. In particular, deduplication [7] eliminates redundant or lossless compression. The emergence of lossy floating-point compression holds promise to satisfy the data reduction demand data at the file or chunk level, which can result in a high reduc- from HPC applications; however, lossy compression has not been tion ratio if there are a large number of identical chunks at the widely adopted in science production. We believe a fundamental granularity of tens of kilobytes. For scientific data, this rarely reason is that there is a lack of understanding of the benefits, occurs. It was reported that deduplication typically reduces pitfalls, and performance of lossy compression on scientific data. dataset size by only 20% to 30% [8], which is far from being In this paper, we conduct a comprehensive study on state-of- the-art lossy compression, including ZFP, SZ, and ISABELA, useful in production. On the other hand, lossless compression using real and representative HPC datasets. Our evaluation in HPC was designed to reduce the storage footprint of appli- reveals the complex interplay between compressor design, data cations, primarily for checkpoint/restart. Shannon entropy [9] features and compression performance. The impact of reduced provides a theoretical upper limit on the data compressibility; accuracy on data analytics is also examined through a case simulation data often exhibit high entropy, and as a result, study of fusion blob detection, offering domain scientists with the insights of what to expect from fidelity loss. Furthermore, the trial lossless compression usually achieves a modest reduction ratio and error approach to understanding compression performance of less than two [10]. With the growing disparity between involves substantial compute and storage overhead. To this end, compute and I/O, more aggressive data reduction schemes are we propose a sampling based estimation method that extrapolates needed to further reduce data by an order of magnitude or the reduction ratio from data samples, to guide domain scientists more [11], and very recently the focus has shifted towards to make more informed data reduction decisions. lossy compression. I. INTRODUCTION Lossy compression, such as JPEG [12], has long been used in computer graphics and digital images, leveraging the fact Cutting-edge computational research in various domains that the visual resolution by human eyes is well below machine relies on high-performance computing (HPC) systems to ac- precision. However, its application in the scientific domain celerate the time to insights. Data generated during such is less well established. Since scientific data are primarily simulations enable domain scientists to validate theories and composed of high-dimensional floating-point values, lossy investigate new microscopic phenomena in a scale that was floating-point data compressors have begun to emerge, includ- not possible in the past. Because of the fidelity requirements ing ISABELA [13], ZFP [14], and SZ [15]. Although lossy in both spatial and temporal dimensions, analysis output reduction offers the most potential to mitigate the growing produced by scientific simulations can easily reach terabytes or storage and I/O cost, there is a lack of understanding of how even petabytes per run [1]–[3], capturing the time evolution of to effectively use lossy compression from a user perspective, physics phenomena in a fine spatiotemporal scale. The volume e.g., which compressor should be used for a particular dataset, and velocity of data movement are imposing unprecedented and what level of reduction ratio should be expected. To pressure on storage and interconnects [4], [5], for both writ- this end, the paper aims to perform extensive evaluations of ing data to persistent storage and retrieving them for post- state-of-the-art lossy floating-point compressors, using real and simulation analysis. As HPC storage infrastructure is being representative HPC datasets across various scientific domains. pushed to the scalability limits in term of both throughput We focus on addressing the following broad questions: and capacity [6], the communities are striving to find new approaches to curbing the storage cost. Data reduction1, among • Q1: What data features are indicative of compressibility? others, is deemed to be a promising candidate by reducing the (Section III-A) amount of data moved to storage systems. • Q2: How does the error bound influence the compression 1The term of data reduction and compression are used interchangeably ratio? Which compressor (or technique) can benefit the most throughout this paper. from loosening error bound? (Section III-B) • Q3: How does the design of compression influence compres- transform) is applied to the signed integers. This transform is sion throughput? What is the relationship between compres- carefully designed to mitigate the spatial correlation between sion ratio and throughput? (Section III-C) data points, with the intent of generating near-zero coefficients • Q4: What is the impact of lossy compression on data fidelity that can be compressed efficiently. Finally, embedded coding and complex scientific data analytics? (Section III-D) [24] is used to encode the coefficients, producing a stream • Q5: How to extract data features and accurately predict the of bits that is roughly ordered by their impact significance compression ratios of various compressors? (Section IV) on error, and the stream can be truncated to satisfy any user- Through answering these questions, we aim at helping HPC specified error bound. end users understand what to expect from lossy compressors. Motivated by the reduction potential of spline functions As a completely unbiased third-party evaluation without ad- [25], [26], ISABELA [13] uses B-spline based curve-fitting hoc performance tunings, we hope to shed light on the limita- to compress the traditionally incompressible scientific data. tions of existing compressors, and point out some of the new Intuitively fitting a monotonic curve can provide a model that R&D opportunities for compressor developers and the commu- is more accurate than fitting random data. Based on this, nities to make further optimizations, thus ensuring the broad ISABELA first sorts data to convert highly irregular data adoption of reduction in science production. Our experiments to a monotonic curve. Similarly, SZ [15] employs multiple for evaluating floating-point data compressors, including sci- curve-fitting models to encode data streams, with the goal of entific datasets and scripts we used for evaluation, are publicly accurately approximating the original data. SZ compression available at https://github.com/taovcu/LossyCompressStudy. involves three main steps: array linearization, adaptive curve- fitting, and compressing the unpredictable data. To reduce II. BACKGROUND memory overhead, it uses the intrinsic memory sequence of Data deduplication and lossless compression fully maintain the original data to linearize a multi-dimensional array to a data fidelity and reduce data by identifying duplicate contents one-dimensional sequence. The best-fit routine employs three through the hash signature and eliminating redundant data. prediction models, based on the adjacent data values in the We utilize two lossless schemes throughout this work, GZIP sequence: constant, linear, and quadratic, which require one, [16] and FPC [17], for performance comparisons with lossy two, and three precursor data points, respectively. And the compression. The deflate algorithm implemented in GZIP model that yields the closest approximation is adopted. If is based on LZ77 [18], [19]. The core of the algorithm is none satisfies the pre-defined error bound, SZ marks the data comparing the symbols in the look-ahead buffer with symbols point as unpredictable, which is then encoded by binary- in the search buffer to determine a match. FPC employs fcm representation analysis. The curve-fitting step transforms the [20] and dfcm [21] to predict the double-precision values. fitted data into integer quantization factors, which are further An XOR operation is conducted between the original value encoded using Huffman tree. Unlike ISABELA [13], SZ does and the prediction. With a high prediction accuracy, the XOR not sort the original data to avoid the indexing overhead. The result is expected to contain many leading zero bits, which encoded data are further compressed using GZIP. can be easily compressed. Ultimately, the effectiveness of While lossy compression has been identified as a means these methods depends on the repetition of symbols in data. to potentially reduce scientific data by more than 10x, deter- However, for even slightly variant floating-point values, the mining the compressibility of data without compressing the binary representations contain