Understanding and Modeling Lossy Compression Schemes on HPC Scientific Data

Total Page:16

File Type:pdf, Size:1020Kb

Understanding and Modeling Lossy Compression Schemes on HPC Scientific Data Understanding and Modeling Lossy Compression Schemes on HPC Scientific Data Tao Lu1, Qing Liu1,2, Xubin He3, Huizhang Luo1, Eric Suchyta2, Jong Choi2, Norbert Podhorszki2, Scott Klasky2, Mathew Wolf2, Tong Liu3, and Zhenbo Qiao1 1 Department of Electrical and Computer Engineering, New Jersey Institute of Technology 2 Computer Science and Mathematics Division, Oak Ridge National Laboratory 3 Department of Computer and Information Sciences, Temple University Abstract—Scientific simulations generate large amounts of Data deduplication and lossless compression have been floating-point data, which are often not very compressible us- widely used in general-purpose systems to reduce redundancy ing the traditional reduction schemes, such as deduplication in data. In particular, deduplication [7] eliminates redundant or lossless compression. The emergence of lossy floating-point compression holds promise to satisfy the data reduction demand data at the file or chunk level, which can result in a high reduc- from HPC applications; however, lossy compression has not been tion ratio if there are a large number of identical chunks at the widely adopted in science production. We believe a fundamental granularity of tens of kilobytes. For scientific data, this rarely reason is that there is a lack of understanding of the benefits, occurs. It was reported that deduplication typically reduces pitfalls, and performance of lossy compression on scientific data. dataset size by only 20% to 30% [8], which is far from being In this paper, we conduct a comprehensive study on state-of- the-art lossy compression, including ZFP, SZ, and ISABELA, useful in production. On the other hand, lossless compression using real and representative HPC datasets. Our evaluation in HPC was designed to reduce the storage footprint of appli- reveals the complex interplay between compressor design, data cations, primarily for checkpoint/restart. Shannon entropy [9] features and compression performance. The impact of reduced provides a theoretical upper limit on the data compressibility; accuracy on data analytics is also examined through a case simulation data often exhibit high entropy, and as a result, study of fusion blob detection, offering domain scientists with the insights of what to expect from fidelity loss. Furthermore, the trial lossless compression usually achieves a modest reduction ratio and error approach to understanding compression performance of less than two [10]. With the growing disparity between involves substantial compute and storage overhead. To this end, compute and I/O, more aggressive data reduction schemes are we propose a sampling based estimation method that extrapolates needed to further reduce data by an order of magnitude or the reduction ratio from data samples, to guide domain scientists more [11], and very recently the focus has shifted towards to make more informed data reduction decisions. lossy compression. I. INTRODUCTION Lossy compression, such as JPEG [12], has long been used in computer graphics and digital images, leveraging the fact Cutting-edge computational research in various domains that the visual resolution by human eyes is well below machine relies on high-performance computing (HPC) systems to ac- precision. However, its application in the scientific domain celerate the time to insights. Data generated during such is less well established. Since scientific data are primarily simulations enable domain scientists to validate theories and composed of high-dimensional floating-point values, lossy investigate new microscopic phenomena in a scale that was floating-point data compressors have begun to emerge, includ- not possible in the past. Because of the fidelity requirements ing ISABELA [13], ZFP [14], and SZ [15]. Although lossy in both spatial and temporal dimensions, analysis output reduction offers the most potential to mitigate the growing produced by scientific simulations can easily reach terabytes or storage and I/O cost, there is a lack of understanding of how even petabytes per run [1]–[3], capturing the time evolution of to effectively use lossy compression from a user perspective, physics phenomena in a fine spatiotemporal scale. The volume e.g., which compressor should be used for a particular dataset, and velocity of data movement are imposing unprecedented and what level of reduction ratio should be expected. To pressure on storage and interconnects [4], [5], for both writ- this end, the paper aims to perform extensive evaluations of ing data to persistent storage and retrieving them for post- state-of-the-art lossy floating-point compressors, using real and simulation analysis. As HPC storage infrastructure is being representative HPC datasets across various scientific domains. pushed to the scalability limits in term of both throughput We focus on addressing the following broad questions: and capacity [6], the communities are striving to find new approaches to curbing the storage cost. Data reduction1, among • Q1: What data features are indicative of compressibility? others, is deemed to be a promising candidate by reducing the (Section III-A) amount of data moved to storage systems. • Q2: How does the error bound influence the compression 1The term of data reduction and compression are used interchangeably ratio? Which compressor (or technique) can benefit the most throughout this paper. from loosening error bound? (Section III-B) • Q3: How does the design of compression influence compres- transform) is applied to the signed integers. This transform is sion throughput? What is the relationship between compres- carefully designed to mitigate the spatial correlation between sion ratio and throughput? (Section III-C) data points, with the intent of generating near-zero coefficients • Q4: What is the impact of lossy compression on data fidelity that can be compressed efficiently. Finally, embedded coding and complex scientific data analytics? (Section III-D) [24] is used to encode the coefficients, producing a stream • Q5: How to extract data features and accurately predict the of bits that is roughly ordered by their impact significance compression ratios of various compressors? (Section IV) on error, and the stream can be truncated to satisfy any user- Through answering these questions, we aim at helping HPC specified error bound. end users understand what to expect from lossy compressors. Motivated by the reduction potential of spline functions As a completely unbiased third-party evaluation without ad- [25], [26], ISABELA [13] uses B-spline based curve-fitting hoc performance tunings, we hope to shed light on the limita- to compress the traditionally incompressible scientific data. tions of existing compressors, and point out some of the new Intuitively fitting a monotonic curve can provide a model that R&D opportunities for compressor developers and the commu- is more accurate than fitting random data. Based on this, nities to make further optimizations, thus ensuring the broad ISABELA first sorts data to convert highly irregular data adoption of reduction in science production. Our experiments to a monotonic curve. Similarly, SZ [15] employs multiple for evaluating floating-point data compressors, including sci- curve-fitting models to encode data streams, with the goal of entific datasets and scripts we used for evaluation, are publicly accurately approximating the original data. SZ compression available at https://github.com/taovcu/LossyCompressStudy. involves three main steps: array linearization, adaptive curve- fitting, and compressing the unpredictable data. To reduce II. BACKGROUND memory overhead, it uses the intrinsic memory sequence of Data deduplication and lossless compression fully maintain the original data to linearize a multi-dimensional array to a data fidelity and reduce data by identifying duplicate contents one-dimensional sequence. The best-fit routine employs three through the hash signature and eliminating redundant data. prediction models, based on the adjacent data values in the We utilize two lossless schemes throughout this work, GZIP sequence: constant, linear, and quadratic, which require one, [16] and FPC [17], for performance comparisons with lossy two, and three precursor data points, respectively. And the compression. The deflate algorithm implemented in GZIP model that yields the closest approximation is adopted. If is based on LZ77 [18], [19]. The core of the algorithm is none satisfies the pre-defined error bound, SZ marks the data comparing the symbols in the look-ahead buffer with symbols point as unpredictable, which is then encoded by binary- in the search buffer to determine a match. FPC employs fcm representation analysis. The curve-fitting step transforms the [20] and dfcm [21] to predict the double-precision values. fitted data into integer quantization factors, which are further An XOR operation is conducted between the original value encoded using Huffman tree. Unlike ISABELA [13], SZ does and the prediction. With a high prediction accuracy, the XOR not sort the original data to avoid the indexing overhead. The result is expected to contain many leading zero bits, which encoded data are further compressed using GZIP. can be easily compressed. Ultimately, the effectiveness of While lossy compression has been identified as a means these methods depends on the repetition of symbols in data. to potentially reduce scientific data by more than 10x, deter- However, for even slightly variant floating-point values, the mining the compressibility of data without compressing the binary representations contain
Recommended publications
  • A Survey Paper on Different Speech Compression Techniques
    Vol-2 Issue-5 2016 IJARIIE-ISSN (O)-2395-4396 A Survey Paper on Different Speech Compression Techniques Kanawade Pramila.R1, Prof. Gundal Shital.S2 1 M.E. Electronics, Department of Electronics Engineering, Amrutvahini College of Engineering, Sangamner, Maharashtra, India. 2 HOD in Electronics Department, Department of Electronics Engineering , Amrutvahini College of Engineering, Sangamner, Maharashtra, India. ABSTRACT This paper describes the different types of speech compression techniques. Speech compression can be divided into two main types such as lossless and lossy compression. This survey paper has been written with the help of different types of Waveform-based speech compression, Parametric-based speech compression, Hybrid based speech compression etc. Compression is nothing but reducing size of data with considering memory size. Speech compression means voiced signal compress for different application such as high quality database of speech signals, multimedia applications, music database and internet applications. Today speech compression is very useful in our life. The main purpose or aim of speech compression is to compress any type of audio that is transfer over the communication channel, because of the limited channel bandwidth and data storage capacity and low bit rate. The use of lossless and lossy techniques for speech compression means that reduced the numbers of bits in the original information. By the use of lossless data compression there is no loss in the original information but while using lossy data compression technique some numbers of bits are loss. Keyword: - Bit rate, Compression, Waveform-based speech compression, Parametric-based speech compression, Hybrid based speech compression. 1. INTRODUCTION -1 Speech compression is use in the encoding system.
    [Show full text]
  • Fast Cosine Transform to Increase Speed-Up and Efficiency of Karhunen-Loève Transform for Lossy Image Compression
    Fast Cosine Transform to increase speed-up and efficiency of Karhunen-Loève Transform for lossy image compression Mario Mastriani, and Juliana Gambini Several authors have tried to combine the DCT with the Abstract —In this work, we present a comparison between two KLT but with questionable success [1], with particular interest techniques of image compression. In the first case, the image is to multispectral imagery [30, 32, 34]. divided in blocks which are collected according to zig-zag scan. In In all cases, the KLT is used to decorrelate in the spectral the second one, we apply the Fast Cosine Transform to the image, domain. All images are first decomposed into blocks, and each and then the transformed image is divided in blocks which are collected according to zig-zag scan too. Later, in both cases, the block uses its own KLT instead of one single matrix for the Karhunen-Loève transform is applied to mentioned blocks. On the whole image. In this paper, we use the KLT for a decorrelation other hand, we present three new metrics based on eigenvalues for a between sub-blocks resulting of the applications of a DCT better comparative evaluation of the techniques. Simulations show with zig-zag scan, that is to say, in the spectral domain. that the combined version is the best, with minor Mean Absolute We introduce in this paper an appropriate sequence, Error (MAE) and Mean Squared Error (MSE), higher Peak Signal to decorrelating first the data in the spatial domain using the DCT Noise Ratio (PSNR) and better image quality.
    [Show full text]
  • Image Compression Using DCT and Wavelet Transformations
    International Journal of Signal Processing, Image Processing and Pattern Recognition Vol. 4, No. 3, September, 2011 Image Compression Using DCT and Wavelet Transformations Prabhakar.Telagarapu, V.Jagan Naveen, A.Lakshmi..Prasanthi, G.Vijaya Santhi GMR Institute of Technology, Rajam – 532 127, Srikakulam District, Andhra Pradesh, India. [email protected], [email protected], [email protected], [email protected] Abstract Image compression is a widely addressed researched area. Many compression standards are in place. But still here there is a scope for high compression with quality reconstruction. The JPEG standard makes use of Discrete Cosine Transform (DCT) for compression. The introduction of the wavelets gave a different dimensions to the compression. This paper aims at the analysis of compression using DCT and Wavelet transform by selecting proper threshold method, better result for PSNR have been obtained. Extensive experimentation has been carried out to arrive at the conclusion. Keywords:. Discrete Cosine Transform, Wavelet transform, PSNR, Image compression 1. Introduction Compressing an image is significantly different than compressing raw binary data. Of course, general purpose compression programs can be used to compress images, but the result is less than optimal. This is because images have certain statistical properties which can be exploited by encoders specifically designed for them. Also, some of the finer details in the image can be sacrificed for the sake of saving a little more bandwidth or storage space. This also means that lossy compression techniques can be used in this area. Uncompressed multimedia (graphics, audio and video) data requires considerable storage capacity and transmission bandwidth. Despite rapid progress in mass-storage density, processor speeds, and digital communication system performance, demand for data storage capacity and data- transmission bandwidth continues to outstrip the capabilities of available technologies.
    [Show full text]
  • Lossless Compression of Audio Data
    CHAPTER 12 Lossless Compression of Audio Data ROBERT C. MAHER OVERVIEW Lossless data compression of digital audio signals is useful when it is necessary to minimize the storage space or transmission bandwidth of audio data while still maintaining archival quality. Available techniques for lossless audio compression, or lossless audio packing, generally employ an adaptive waveform predictor with a variable-rate entropy coding of the residual, such as Huffman or Golomb-Rice coding. The amount of data compression can vary considerably from one audio waveform to another, but ratios of less than 3 are typical. Several freeware, shareware, and proprietary commercial lossless audio packing programs are available. 12.1 INTRODUCTION The Internet is increasingly being used as a means to deliver audio content to end-users for en­ tertainment, education, and commerce. It is clearly advantageous to minimize the time required to download an audio data file and the storage capacity required to hold it. Moreover, the expec­ tations of end-users with regard to signal quality, number of audio channels, meta-data such as song lyrics, and similar additional features provide incentives to compress the audio data. 12.1.1 Background In the past decade there have been significant breakthroughs in audio data compression using lossy perceptual coding [1]. These techniques lower the bit rate required to represent the signal by establishing perceptual error criteria, meaning that a model of human hearing perception is Copyright 2003. Elsevier Science (USA). 255 AU rights reserved. 256 PART III / APPLICATIONS used to guide the elimination of excess bits that can be either reconstructed (redundancy in the signal) orignored (inaudible components in the signal).
    [Show full text]
  • The H.264 Advanced Video Coding (AVC) Standard
    Whitepaper: The H.264 Advanced Video Coding (AVC) Standard What It Means to Web Camera Performance Introduction A new generation of webcams is hitting the market that makes video conferencing a more lifelike experience for users, thanks to adoption of the breakthrough H.264 standard. This white paper explains some of the key benefits of H.264 encoding and why cameras with this technology should be on the shopping list of every business. The Need for Compression Today, Internet connection rates average in the range of a few megabits per second. While VGA video requires 147 megabits per second (Mbps) of data, full high definition (HD) 1080p video requires almost one gigabit per second of data, as illustrated in Table 1. Table 1. Display Resolution Format Comparison Format Horizontal Pixels Vertical Lines Pixels Megabits per second (Mbps) QVGA 320 240 76,800 37 VGA 640 480 307,200 147 720p 1280 720 921,600 442 1080p 1920 1080 2,073,600 995 Video Compression Techniques Digital video streams, especially at high definition (HD) resolution, represent huge amounts of data. In order to achieve real-time HD resolution over typical Internet connection bandwidths, video compression is required. The amount of compression required to transmit 1080p video over a three megabits per second link is 332:1! Video compression techniques use mathematical algorithms to reduce the amount of data needed to transmit or store video. Lossless Compression Lossless compression changes how data is stored without resulting in any loss of information. Zip files are losslessly compressed so that when they are unzipped, the original files are recovered.
    [Show full text]
  • Understanding Compression of Geospatial Raster Imagery
    Understanding Compression of Geospatial Raster Imagery Document Overview This document was created for the North Carolina Geographic Information and Coordinating Council (GICC), http://ncgicc.com, by the GIS Technical Advisory Committee (TAC). Its purpose is to serve as a best practice or guidance document for GIS professionals that are compressing raster images. This document only addresses compressing geospatial raster data and specifically aerial or orthorectified imagery. It does not address compressing LiDAR data. Compression Overview Compression is the process of making data more compact so it occupies less disk storage space. The primary benefit of compressing raster data is reduction in file size. An added benefit is greatly improved performance over a network, because the user is transferring less data from a server to an application; however, compressed data must be decompressed to display in GIS software. The result may be slower raster display in GIS software than data that is not compressed. Compressed data can also increase CPU requirements on the server or desktop. Glossary of Common Terms Raster is a spatial data model made of rows and columns of cells. Each cell contains an attribute value identifying its color and location coordinate. Geospatial raster data like satellite images and aerial photographs are typically larger on average than vector data (predominately points, lines, or polygons). Compression is the process of making a (raster) file smaller while preserving all or most of the data it contains. Imagery compression enables storage of more data (image files) on a disk than if they were uncompressed. Compression ratio is the amount or degree of reduction of an image's file size.
    [Show full text]
  • Analysis of Image Compression Algorithm Using
    IJSTE - International Journal of Science Technology & Engineering | Volume 3 | Issue 12 | June 2017 ISSN (online): 2349-784X Analysis of Image Compression Algorithm using DCT Aman George Ashish Sahu Student Assistant Professor Department of Information Technology Department of Information Technology SSTC, SSGI, FET Junwani, Bhilai - 490020, Chhattisgarh Swami Vivekananda Technical University, India Durg, C.G., India Abstract Image compression means reducing the size of graphics file, without compromising on its quality. Depending on the reconstructed image, to be exactly same as the original or some unidentified loss may be incurred, two techniques for compression exist. This paper presents DCT implementation because these are the lossy techniques. Image compression is the application of Data compression on digital images. The discrete cosine transform (DCT) is a technique for converting a signal into elementary frequency components. Image compression algorithm was comprehended using Matlab code, and modified to perform better when implemented in hardware description language. The IMAP block and IMAQ block of MATLAB was used to analyse and study the results of Image Compression using DCT and varying co-efficients for compression were developed to show the resulting image and error image from the original images. The original image is transformed in 8-by-8 blocks and then inverse transformed in 8-by-8 blocks to create the reconstructed image. Image Compression is specially used where tolerable degradation is required. This paper is a survey for lossy image compression using Discrete Cosine Transform, it covers JPEG compression algorithm which is used for full-colour still image applications and describes all the components of it.
    [Show full text]
  • Lossy Audio Compression Identification
    2018 26th European Signal Processing Conference (EUSIPCO) Lossy Audio Compression Identification Bongjun Kim Zafar Rafii Northwestern University Gracenote Evanston, USA Emeryville, USA [email protected] zafar.rafi[email protected] Abstract—We propose a system which can estimate from an compression parameters from an audio signal, based on AAC, audio recording that has previously undergone lossy compression was presented in [3]. The first implementation of that work, the parameters used for the encoding, and therefore identify the based on MP3, was then proposed in [4]. The idea was to corresponding lossy coding format. The system analyzes the audio signal and searches for the compression parameters and framing search for the compression parameters and framing conditions conditions which match those used for the encoding. In particular, which match those used for the encoding, by measuring traces we propose a new metric for measuring traces of compression of compression in the audio signal, which typically correspond which is robust to variations in the audio content and a new to time-frequency coefficients quantized to zero. method for combining the estimates from multiple audio blocks The first work to investigate alterations, such as deletion, in- which can refine the results. We evaluated this system with audio excerpts from songs and movies, compressed into various coding sertion, or substitution, in audio signals which have undergone formats, using different bit rates, and captured digitally as well lossy compression, namely MP3, was presented in [5]. The as through analog transfer. Results showed that our system can idea was to measure traces of compression in the signal along identify the correct format in almost all cases, even at high bit time and detect discontinuities in the estimated framing.
    [Show full text]
  • Telestream Cloud High Quality Video Encoding Service in the Cloud
    Telestream Cloud High quality video encoding service in the cloud Telestream Cloud is a high quality, video encoding software-as-a-service (SaaS) in the cloud, offering fast and powerful encoding for your workflow. Telestream Cloud pricing is structured as pay-as-you-go monthly subscriptions for quick scaling as your workflow changes with no upfront expenses. Who uses Telestream Cloud? Telestream Cloud’s elegantly simple user interface scales quickly and seamlessly for media professionals who utilize video transcoding heavily in their business like: ■ Advertising agencies ■ On-line broadcasters ■ Gaming ■ Entertainment ■ OTT service provider for media streaming ■ Boutique post production studios For content creators and video producers, Telestream Cloud encoding cloud service is pay-as-you-go, and allows you to collocate video files with your other cloud services across multiple cloud storage options. The service provides high quality output in most formats to meet the needs of: ■ Small businesses ■ Houses of worship ■ Digital marketing ■ Education ■ Nonprofits Telestream Cloud provides unlimited scalability to address peak demand times and encoding requirements for large files for video pros who work in: ■ Broadcast networks ■ Broadcast news or sports ■ Cable service providers ■ Large post-production houses Telestream Cloud is a complementary service to the Vantage platform. During peak demand times, Cloud service can offset the workload. The service also allows easy collaboration of projects across multiple sites or remote locations. Why choose Telestream Cloud? The best video encoding quality for all formats & codecs: An extensive set of encoding profiles built-in for all major formats to ensure the output is optimized for the highest quality.
    [Show full text]
  • Conference Paper
    International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol. XXXIV-5/W10 Accurate 3D information extraction from large-scale data compressed image and the study of the optimum stereo imaging method Riichi NAGURA*, *Kanagawa Institute of Technology [email protected] KEY WORDS: High Resolution Imaging, Data compression, Height information extraction, Remote Sensing ABSTRACT The paper reports the accurate height information extraction method under the condition of large-scale compressed images, and proposes the optimum stereo imaging system using the multi-directional imaging and the time delay integration method. Recently, the transmission data rate of stereo images have increased enormously accompany by the progress of high- resolution imaging. Then, the large-scale data compression becomes indispensable, especially in the case of the data transmission from satellite. In the early stages the lossless compression was used, however the data-rate of the lossless compression image is not sufficiently small. Then, this paper mainly discusses the effects of lossy compression methods. First, the paper describes the effects for JPEG compression method using DCT, and next analyzes the effects for JPEG2000 compression method using wavelet transform. The paper also indicates the effects of JPEG-LS compression method. The correlative detection techniques are used in these analyses. As the results, JPEG2000 is superior to former JPEG as foreseeing, however the bit error affects in many vicinity pixels in the large-scale compression. Then, the correction of the bit error becomes indispensable. The paper reports the results of associating Turbo coding techniques with the compressed data, and consequently, the bit error can be vanished sufficiently even in the case of noisy transmission path.
    [Show full text]
  • CS 1St Year: M&A Types of Compression: the Two Types of Compression Are: Lossy Compression
    CS 1st Year: M&A Types of compression: The two types of compression are: Lossy Compression - where data bytes are removed from the file. This results in a smaller file, but also lower quality. It makes use of data redundancies and human perception – for example, removing data that cannot be perceived by humans. So whilst quality might be affected, the substance of the file is still present. Lossy compression would be commonly used over the internet, where large files present a problem. An example of lossy compression is “mp3” compression, which removes wavelength extremes which are out of the hearing range of normal people. MP3 has a compression ratio of 11:1. Another example would be JPEG (Joint Photographics Expert Group), which is used to compress images. JPEG works by grouping pixels of an image which have similar colour or brightness, and changing them all to a uniform, “average” colour, and then replaces the similar pixels with codes. The Lossy compression method eliminates some amount of data that is not noticeable. This technique does not allow a file to restore in its original form but significantly reduces the size. The lossy compression technique is beneficial if the quality of the data is not your priority. It slightly degrades the quality of the file or data but is convenient when one wants to send or store the data. This type of data compression is used for organic data like audio signals and images. Lossy Compression Technique Transform coding: This method transforms the pixels which are correlated in a representation into disassociated pixels.
    [Show full text]
  • Discrete Cosine Transformation Based Image Data Compression Considering Image Restoration
    (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 6, 2020 Discrete Cosine Transformation based Image Data Compression Considering Image Restoration Kohei Arai Graduate School of Science and Engineering Saga University, Saga City Japan Abstract—Discrete Cosine Transformation (DCT) based restored image and the original image not on the space of the image data compression considering image restoration is original image but on the observed image. A general inverse proposed. An image data compression method based on the filter, a least-squares filter with constraints, and a projection compression (DCT) featuring an image restoration method is filter and a partial projection filter that may be significantly proposed. DCT image compression is widely used and has four affected by noise in the restored image have been proposed [2]. major image defects. In order to reduce the noise and distortions, However, the former is insufficient for optimization of the proposed method expresses a set of parameters for the evaluation criteria, etc., and is under study. assumed distortion model based on an image restoration method. The results from the experiment with Landsat TM (Thematic On the other hand, the latter is essentially a method for Mapper) data of Saga show a good image compression finding a non-linear solution, so it can take only a method performance of compression factor and image quality, namely, based on an iterative method, and various methods based on the proposed method achieved 25% of improvement of the iterative methods have been tried. There are various iterative compression factor compared to the existing method of DCT with methods, but there are a stationary iterative method typified by almost comparable image quality between both methods.
    [Show full text]