<<

Understanding and Modeling Schemes on HPC Scientific Data

Tao Lu1, Qing Liu1,2, Xubin He3, Huizhang Luo1, Eric Suchyta2, Jong Choi2, Norbert Podhorszki2, Scott Klasky2, Mathew Wolf2, Tong Liu3, and Zhenbo Qiao1

1 Department of Electrical and Computer Engineering, New Jersey Institute of Technology 2 Computer Science and Mathematics Division, Oak Ridge National Laboratory 3 Department of Computer and Sciences, Temple University

Abstract—Scientific simulations generate large amounts of Data deduplication and have been floating-point data, which are often not very compressible us- widely used in general-purpose systems to reduce redundancy ing the traditional reduction schemes, such as deduplication in data. In particular, deduplication [7] eliminates redundant or lossless compression. The emergence of lossy floating-point compression holds promise to satisfy the data reduction demand data at the file or chunk level, which can result in a high reduc- from HPC applications; however, lossy compression has not been tion ratio if there are a large number of identical chunks at the widely adopted in science production. We believe a fundamental granularity of tens of kilobytes. For scientific data, this rarely reason is that there is a lack of understanding of the benefits, occurs. It was reported that deduplication typically reduces pitfalls, and performance of lossy compression on scientific data. dataset size by only 20% to 30% [8], which is far from being In this paper, we conduct a comprehensive study on state-of- the-art lossy compression, including ZFP, SZ, and ISABELA, useful in production. On the other hand, lossless compression using real and representative HPC datasets. Our evaluation in HPC was designed to reduce the storage footprint of appli- reveals the complex interplay between compressor design, data cations, primarily for checkpoint/restart. Shannon entropy [9] features and compression performance. The impact of reduced provides a theoretical upper limit on the data compressibility; accuracy on data analytics is also examined through a case simulation data often exhibit high entropy, and as a result, study of fusion blob detection, offering domain scientists with the insights of what to expect from fidelity loss. Furthermore, the trial lossless compression usually achieves a modest reduction ratio and error approach to understanding compression performance of less than two [10]. With the growing disparity between involves substantial compute and storage overhead. To this end, compute and I/O, more aggressive data reduction schemes are we propose a sampling based estimation method that extrapolates needed to further reduce data by an order of magnitude or the reduction ratio from data samples, to guide domain scientists more [11], and very recently the focus has shifted towards to make more informed data reduction decisions. lossy compression. I.INTRODUCTION Lossy compression, such as JPEG [12], has long been used in computer graphics and digital images, leveraging the fact Cutting-edge computational research in various domains that the visual resolution by human eyes is well below machine relies on high-performance computing (HPC) systems to ac- precision. However, its application in the scientific domain celerate the time to insights. Data generated during such is less well established. Since scientific data are primarily simulations enable domain scientists to validate theories and composed of high-dimensional floating-point values, lossy investigate new microscopic phenomena in a scale that was floating-point data compressors have begun to emerge, includ- not possible in the past. Because of the fidelity requirements ing ISABELA [13], ZFP [14], and SZ [15]. Although lossy in both spatial and temporal dimensions, analysis output reduction offers the most potential to mitigate the growing produced by scientific simulations can easily reach terabytes or storage and I/O cost, there is a lack of understanding of how even petabytes per run [1]–[3], capturing the time evolution of to effectively use lossy compression from a user perspective, physics phenomena in a fine spatiotemporal scale. The volume e.g., which compressor should be used for a particular dataset, and velocity of data movement are imposing unprecedented and what level of reduction ratio should be expected. To pressure on storage and interconnects [4], [5], for both writ- this end, the paper aims to perform extensive evaluations of ing data to persistent storage and retrieving them for post- state-of-the-art lossy floating-point compressors, using real and simulation analysis. As HPC storage infrastructure is being representative HPC datasets across various scientific domains. pushed to the scalability limits in term of both throughput We focus on addressing the following broad questions: and capacity [6], the communities are striving to find new approaches to curbing the storage cost. Data reduction1, among • Q1: What data features are indicative of compressibility? others, is deemed to be a promising candidate by reducing the (Section III-A) amount of data moved to storage systems. • Q2: How does the error bound influence the compression 1The term of data reduction and compression are used interchangeably ratio? Which compressor (or technique) can benefit the most throughout this paper. from loosening error bound? (Section III-B) • Q3: How does the design of compression influence compres- transform) is applied to the signed integers. This transform is sion throughput? What is the relationship between compres- carefully designed to mitigate the spatial correlation between sion ratio and throughput? (Section III-C) data points, with the intent of generating near-zero coefficients • Q4: What is the impact of lossy compression on data fidelity that can be compressed efficiently. Finally, embedded coding and complex scientific data analytics? (Section III-D) [24] is used to encode the coefficients, producing a stream • Q5: How to extract data features and accurately predict the of that is roughly ordered by their impact significance compression ratios of various compressors? (Section IV) on error, and the stream can be truncated to satisfy any user- Through answering these questions, we aim at helping HPC specified error bound. end users understand what to expect from lossy compressors. Motivated by the reduction potential of spline functions As a completely unbiased third-party evaluation without ad- [25], [26], ISABELA [13] uses B-spline based curve-fitting hoc performance tunings, we hope to shed light on the limita- to the traditionally incompressible scientific data. tions of existing compressors, and point out some of the new Intuitively fitting a monotonic curve can provide a model that R&D opportunities for compressor developers and the commu- is more accurate than fitting random data. Based on this, nities to make further optimizations, thus ensuring the broad ISABELA first sorts data to convert highly irregular data adoption of reduction in science production. Our experiments to a monotonic curve. Similarly, SZ [15] employs multiple for evaluating floating-point data compressors, including sci- curve-fitting models to encode data streams, with the goal of entific datasets and scripts we used for evaluation, are publicly accurately approximating the original data. SZ compression available at https://github.com/taovcu/LossyCompressStudy. involves three main steps: array linearization, adaptive curve- fitting, and compressing the unpredictable data. To reduce II.BACKGROUND memory overhead, it uses the intrinsic memory sequence of Data deduplication and lossless compression fully maintain the original data to linearize a multi-dimensional array to a data fidelity and reduce data by identifying duplicate contents one-dimensional sequence. The best-fit routine employs three through the hash signature and eliminating redundant data. prediction models, based on the adjacent data values in the We utilize two lossless schemes throughout this work, sequence: constant, linear, and quadratic, which require one, [16] and FPC [17], for performance comparisons with lossy two, and three precursor data points, respectively. And the compression. The deflate algorithm implemented in GZIP model that yields the closest approximation is adopted. If is based on LZ77 [18], [19]. The core of the algorithm is none satisfies the pre-defined error bound, SZ marks the data comparing the symbols in the look-ahead buffer with symbols point as unpredictable, which is then encoded by binary- in the search buffer to determine a match. FPC employs fcm representation analysis. The curve-fitting step transforms the [20] and dfcm [21] to predict the double-precision values. fitted data into integer quantization factors, which are further An XOR operation is conducted between the original value encoded using Huffman tree. Unlike ISABELA [13], SZ does and the prediction. With a high prediction accuracy, the XOR not sort the original data to avoid the indexing overhead. The result is expected to contain many leading zero bits, which encoded data are further compressed using GZIP. can be easily compressed. Ultimately, the effectiveness of While lossy compression has been identified as a means these methods depends on the repetition of symbols in data. to potentially reduce scientific data by more than 10x, deter- However, for even slightly variant floating-point values, the mining the compressibility of data without compressing the binary representations contain few identical symbols. Hence, full data, and the impact of information loss on data analytics scientific simulations use lossless compression only if neces- have not been fully studied. Although trial and error can sary, e.g., for checkpoint/restart [22]. certainly answer these questions, this incurs overhead in terms The general acceptance of precision loss provides an op- of compute and storage, and should be avoided as much as portunity to drastically improve the ratio - possible. Our proposed evaluation and modeling aim to fill if two symbols are within the error tolerance, they can be these gaps in data reduction, and allow users to understand represented using the same code. This paper includes studies the outcome before they perform reduction. using three lossy compressors: ISABELA, SZ, and ZFP, which were shown to be superior in prior work [15]. Each compressor III.EVALUATION is briefly described as follows. We evaluate the compression and compression ratio Motivated by fixed-rate encoding and random access, ZFP of various compressors on a SUSE Linux Enterprise Server [23] follows the classic for image data. 11 (x86 64) with a 32-core AMD Opteron(tm) 6410 Proces- Working in 4d (where d is the number of dimensions) sized sor and 256GB DRAM. Our measurements of compression blocks, ZFP first aligns the floating-point data points within throughput do not include the time spent on disk I/O since a block to a common exponent, which is determined by the the goal is to evaluate the compression algorithms, instead of largest absolute value. The original data in the block is then system performance as a whole. Our evaluations focus on the converted to mantissas along with the common exponent. following metrics: (1) Error bound: It limits the accuracy loss Second, the exponent is encoded and stored. The mantissas during compression. An error bound can be enforced as an are then converted to fixed-point signed integers. Third, a absolute or a relative value or both. Assuming the value of a reversible orthogonal block transform (e.g., discrete cosine data point is denoted as V , a point-wise absolute error bound

TABLE II: A list of symbols for SZ compression the sample data. Similarly, the compression of full data point Symbols Description ≤ ≤ − − PointCount Number of points in a dataset i (3 i f) is influenced by point i 2 and i 1, and QuantIntv Quantization interval a different Huffman tree constructed based on the full data. NodeCount Number of Huffman tree nodes For any given point j in the sample, even if the neighboring HitRatio Curve-fitting hit ratio points are accordingly sampled to maintain the bounded lo- TreeSize Size of Huffman tree EncodeSize Total size of cality, the two Huffman trees are likely different. Therefore, OutlierCount Number of curve-missed points there does not exist a one-to-one or linear relation between OutlierSize Total size of curve-missed points CRSample pointj and CRFull. By Jensen’s inequality [30], TotalSize Size of compressed data for a non-linear function f and a mean-unbiased estimator U CR Compression ratio of a parameter p, the composite estimator f(U) is not a mean- unbiased estimator of f(p). That is, mean-unbiasedness is not sample data is an unbiased estimation of the compression ratio preserved under non-linear transformations. of full data, using block based random sampling method. As confirmed in Figure 8(b), using the compression ratio Proof. The compression ratio of sample data is said to be an of samples to estimate that of the full dataset is inaccurate for unbiased estimation of the full dataset, when the compression SZ. The average estimation error is at least 42%. Therefore, ratio of sample data is expected to equal that of the full dataset. more advanced models are needed to accurately estimate the As mentioned, ZFP floating-point data in blocks. SZ compression ratio. Assuming the full and sample dataset contain f and s blocks, Finding 7: For compression schemes which have bounded BlockSize is the original size of a block, and CRFull blocki th locality, sampling based approach can provide an unbi- and CRSample blockj indicate the compression ratio of the i and jth blocks in the full and sample datasets, respectively. ased estimation of the full data performance. Without bounded locality, the compression ratio of sample data After reduction, the size of block i is RSFull blocki . For the full dataset, the reciprocal of compression ratio is: may deviate significantly from that of the full data. f P RSF ull blocki f 1 = i=1 = 1 P 1 C. Advanced Models for Estimating SZ Compression Ratio CRFull f∗BlockSize f CRF ull block i=1 i Similarly, for the sample data, In SZ compression, a curve-fitted data point is mapped to a s 1 1 1 quantization factor, observing a pre-defined error bound, and = P CRSample s CRSample block j=1 j a curve-missed point will be encoded by binary-representation As a result of random sampling, for any blockj (1 ≤ j ≤ s) analysis. Quantization factors may occur with different prob- in the sample, the probability that this block is blocki (1 ≤ i ≤ abilities, thus, can be encoded using the Huffman tree for 1 ≤ ≤ f) in the full dataset is f . For any j (1 j s), the expected storage efficiency. Therefore, the compressed data of SZ reciprocal of compression ratio of blockj in the sample can consists of three parts: the Huffman tree, the encoding of be calculated as: curve-fitted points, and the encoding of curve-missed points. f E[ 1 ]= E[ 1 P 1 ] The binary-representation analysis based encoding for curve- CRSample block f CRF ull block j i=1 i missed points is fairly straightforward to estimate, since our The overall expected reciprocal of compression ratio of the tests show the encoding has approximately resulted in equal sample can be calculated as: s size for sample and full data (Figure 9(a)), our model assumes E[ 1 ]= 1 E[ P 1 ] the encoding size is proportional to the number of data points. CRSample s CRSample block j=1 j However, as Figure 9 shows, the storage cost per Huffman tree s f = 1 P E[ 1 P 1 ] node, and the code length per curve-fitted data point varies s f CRF ull block j=1 i=1 i considerably in the sample and full data. Moreover, we notice f 1 1 1 that the majority of datasets result in a high curve-fitting ratio = E[ P ]= E[ ] f CRF ull block CRF ull i=1 i (Figure 4), and in this case Huffman tree and the encoding Therefore, the expected value of CRSample equals that of will account for most of space after compression. CRFull, and the compression ratio of the sample dataset, using To predict the storage cost of the Huffman tree and the random block based sampling, provides an unbiased estimation corresponding Huffman coding, the key is to accurately esti- for the full dataset. mate the number of tree nodes, which determines the tree size and height, which directly impact the average code length. Statement 2: For SZ compression, no sampling method can We identify the opportunity to predict the node count of guarantee an unbiased estimation of the full dataset. Huffman tree through observing the distribution of quan- Proof. SZ adopts the Huffman tree to encode the quantization tization factors. Figure 10 demonstrates the distribution in factors of the curve-fitted points, with the goal of further re- compressing the sample and full data, respectively. It is evident ducing the storage footprint. The compression ratio of sample that the distribution of sample and full data are highly similar, data point j (3 ≤ j ≤ s) is influenced by point j − 2 and both approximately following Gaussian distribution. Given j − 1, and the associated Huffman tree constructed based on this important observation, this paper employs the Gaussian estimation error of GaussModel is much lower than the [7] W. Xia, H. Jiang, D. Feng, F. Douglis, P. Shilane, Y. Hua, M. Fu, naive sampling based approach. Specifically, GaussModel Y. Zhang, and Y. Zhou, “A comprehensive study of the past, present, and future of data deduplication,” Proceedings of the IEEE, 2016. has an average estimation error of 29%, while the naive [8] D. Meister, J. Kaiser, A. Brinkmann, T. Cortes, M. Kuhn, and J. Kunkel, sample based estimation has an average error of 73%. We “A study on data deduplication in hpc storage systems,” in International comment that using samples to extrapolate performance Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2012, pp. 1–11. inevitably introduces errors, due to the huge information loss [9] C. E. Shannon, “A mathematical theory of communication,” The Bell in sampling. However, the results show our model combining System Technical Journal, vol. 27, no. 3, pp. 379–423, July 1948. with sampling can dramatically reduce the estimation error. [10] M. Burtscher and P. Ratanaworabhan, “Fpc: A high-speed compressor for double-precision floating-point data,” IEEE Transactions on Com- puters, vol. 58, no. 1, pp. 18–31, 2009. ONCLUSIONAND UTURE ORK V. C F W [11] I. Foster and et al., “Computing just what you need: Online data analysis In this paper, we conduct comprehensive evaluations of and reduction at extreme scales,” in European Conference on Parallel Processing (Euro-Par’17), Santiago de Compostela, Spain, 2017. lossy compression schemes on HPC scientific datasets. We [12] G. K. Wallace, “The still picture compression standard,” IEEE expose how lossy compression schemes work, what data transactions on consumer electronics, vol. 38, no. 1, pp. xviii–xxxiv, features are indicators of compressibility, the relationship 1992. [13] S. Lakshminarasimhan, N. Shah, S. Ethier, S. Klasky, R. Latham, between the compression ratio and the error bound, how R. Ross, and N. Samatova, “Compressing the incompressible with is- the compressibility influences the compression throughput, as abela: In-situ reduction of spatio-temporal data,” Euro-Par 2011 Parallel well as the impact of lossy compression on data fidelity. We Processing, pp. 366–379, 2011. [14] P. Lindstrom, “Fixed-rate compressed floating-point arrays,” IEEE trans- also present how to properly sample datasets for the purpose actions on visualization and computer graphics, vol. 20, no. 12, pp. of making estimation. We prove that 2674–2683, 2014. random block-based sampling ensures an unbiased compres- [15] S. Di and F. Cappello, “Fast error-bounded lossy hpc data compression with sz,” in Parallel and Distributed Processing Symposium, 2016 IEEE sion ratio estimation for ZFP, and the estimation accuracy International. IEEE, 2016, pp. 730–739. can be up to 99%. The proposed GaussModel dramatically [16] J.-l. Gailly, “gzip: The data compression program,” 2016. [Online]. increases the SZ estimation accuracy. This work can help HPC Available: https://www.gnu.org/software/gzip/manual/gzip.pdf [17] M. Burtscher and P. Ratanaworabhan, “Fpc: A high-speed compressor users understand the outcome of lossy compression, which for double-precision floating-point data,” IEEE Transactions on Com- is crucial for the broad adoption of lossy compression to puters, vol. 58, no. 1, pp. 18–31, Jan 2009. HPC production environments. The new understanding of the [18] A. Lempel and J. Ziv, “On the complexity of finite sequences,” IEEE Transactions on , vol. 22, no. 1, pp. 75–81, 1976. relationship between data features and compressibility, as well [19] J. Ziv and A. Lempel, “A universal algorithm for sequential data as accurate compression ratio estimation models are essential compression,” IEEE Transactions on information theory, vol. 23, no. 3, to enable the online “to compress or not” decision and the pp. 337–343, 1977. [20] Y. Sazeides and J. E. Smith, “The predictability of data values,” in compressor selection. We plan to implement and integrate a Proceedings of the 30th annual ACM/IEEE international symposium on decision algorithm into ADIOS (Adaptable IO system) [31], Microarchitecture. IEEE Computer Society, 1997, pp. 248–258. so that scientists can simply describe their requirements and [21] B. Goeman, H. Vandierendonck, and K. De Bosschere, “Differential fcm: Increasing value prediction accuracy by improving table usage transparently use the best compression scheme. efficiency,” in High-Performance Computer Architecture, 2001. HPCA. The Seventh International Symposium on. IEEE, 2001, pp. 207–216. ACKNOWLEDGMENT [22] N. Sasaki and et al., “Exploration of lossy compression for application- level checkpoint/restart,” in Parallel and Distributed Processing Sympo- The work is partially sponsored by U.S. National Sci- sium (IPDPS), 2015 IEEE International. IEEE, 2015, pp. 914–922. ence Foundation grants CCF-1718297, CCF-1717660, CNS- [23] P. Lindstrom, “Fixed-rate compressed floating-point arrays,” IEEE 1702474, and DOE SIRIUS project. Transactions on Visualization and Computer Graphics, vol. 20, no. 12, pp. 2674–2683, Dec 2014. [24] J. M. Shapiro, “Embedded image coding using zerotrees of EFERENCES R coefficients,” IEEE Transactions on processing, vol. 41, no. 12, [1] T. Lu and et al., “Canopus: A paradigm shift towards elastic extreme- pp. 3445–3462, 1993. scale data analytics on hpc storage,” in 2017 IEEE International Con- [25] S. Wold, “Spline functions in data analysis,” Technometrics, vol. 16, ference on Cluster Computing (CLUSTER), Sept 2017, pp. 58–69. no. 1, pp. 1–11, 1974. [2] D. Tao, S. Di, Z. Chen, and F. Cappello, “Significantly improving lossy [26] X. He and P. Shi, “Monotone b-spline smoothing,” Journal of the compression for scientific data sets based on multidimensional prediction American statistical Association, vol. 93, no. 442, pp. 643–650, 1998. and error-controlled quantization,” in IEEE International Parallel and [27] D. Harnik, R. I. Kat, O. Margalit, D. Sotnikov, and A. Traeger, “To zip Distributed Processing Symposium, 2017. or not to zip: effective resource usage for real-time compression.” in [3] W. Austin, G. Ballard, and T. G. Kolda, “Parallel tensor compression FAST, 2013, pp. 229–242. for large-scale scientific data,” in IEEE International Parallel and [28] J. Zhou, Y. Li, V. K. Adhikari, and Z.-L. Zhang, “Counting youtube Distributed Processing Symposium. IEEE, 2016, pp. 912–922. via random prefix sampling,” in Proceedings of the 2011 ACM [4] D. Ghoshal and L. Ramakrishnan, “Madats: Managing data on tiered SIGCOMM Conference on Internet Measurement Conference, ser. IMC storage for scientific workflows,” in Proceedings of the 26th Inter- ’11. New York, NY, USA: ACM, 2011, pp. 371–380. national Symposium on High-Performance Parallel and Distributed [29] O. Rana and et al., “Paddmas: parallel and distributed data mining appli- Computing. ACM, 2017, pp. 41–52. cation suite,” in Proceedings 14th International Parallel and Distributed [5] B. Dong, K. Wu, S. Byna, J. Liu, W. Zhao, and F. Rusu, “Arrayudf: User- Processing Symposium. IPDPS 2000, 2000, pp. 387–392. defined scientific data analysis on arrays,” in Proceedings of the 26th [30] J. L. W. V. Jensen, “Sur les fonctions convexes et les inegalit´ es´ entre International Symposium on High-Performance Parallel and Distributed les valeurs moyennes,” Acta mathematica, vol. 30, no. 1, pp. 175–193, Computing. ACM, 2017. 1906. [6] ASCAC Subcommittee, “Top ten exascale research challenges,” 2014. [31] Q. Liu and et al., “Hello adios: The challenges and lessons of developing [Online]. Available: https://science.energy.gov/∼/media/ascr/ascac/pdf/ leadership class i/o frameworks,” Concurr. Comput. : Pract. Exper., meetings/20140210/Top10reportFEB14.pdf vol. 26, no. 7, pp. 1453–1473, May 2014.