Evaluation of Lossless and Lossy Algorithms for the Compression of Scientific Datasets in Netcdf-4 Or HDF5 files
Total Page:16
File Type:pdf, Size:1020Kb
Geosci. Model Dev., 12, 4099–4113, 2019 https://doi.org/10.5194/gmd-12-4099-2019 © Author(s) 2019. This work is distributed under the Creative Commons Attribution 4.0 License. Evaluation of lossless and lossy algorithms for the compression of scientific datasets in netCDF-4 or HDF5 files Xavier Delaunay1, Aurélie Courtois1, and Flavien Gouillon2 1Thales, 290 allée du Lac, 31670 Labège, France 2CNES, Centre Spatial de Toulouse, 18 avenue Edouard Belin, 31401 Toulouse, France Correspondence: Xavier Delaunay ([email protected]) Received: 5 October 2018 – Discussion started: 20 November 2018 Revised: 7 June 2019 – Accepted: 3 July 2019 – Published: 23 September 2019 Abstract. The increasing volume of scientific datasets re- format is widely used throughout the global scientific com- quires the use of compression to reduce data storage and munity because of its useful features. The fourth version of transmission costs, especially for the oceanographic or me- the netCDF library, denoted netCDF-4/HDF5 (as it is based teorological datasets generated by Earth observation mission on the HDF5 layer), offers “Deflate” and “Shuffle” algo- ground segments. These data are mostly produced in netCDF rithms as native compression features. However, the com- files. Indeed, the netCDF-4/HDF5 file formats are widely pression ratio achieved does not fully meet ground process- used throughout the global scientific community because of ing requirements, which are to significantly reduce the stor- the useful features they offer. HDF5 in particular offers a dy- age and dissemination cost as well as the I/O times between namically loaded filter plugin so that users can write com- two modules in the processing chain. pression/decompression filters, for example, and process the In response to the ever-increasing volume of data, scien- data before reading or writing them to disk. This study eval- tists are keen to compress data. However, they have certain uates lossy and lossless compression/decompression meth- requirements: both compression and decompression have to ods through netCDF-4 and HDF5 tools on analytical and be fast. Lossy compression is acceptable only if the compres- real scientific floating-point datasets. We also introduce the sion ratios are higher than those of lossless algorithms and if Digit Rounding algorithm, a new relative error-bounded data the precision, or data loss, can be controlled. There is a trade- reduction method inspired by the Bit Grooming algorithm. off between the data volume and the accuracy of the com- The Digit Rounding algorithm offers a high compression ra- pressed data. Nevertheless, scientists can accept small losses tio while keeping a given number of significant digits in the if they remain below the data’s noise level. Noise is difficult dataset. It achieves a higher compression ratio than the Bit to compress and of little interest to scientists, so they do not Grooming algorithm with slightly lower compression speed. consider data degradation that remains under the noise level as a loss (Baker et al., 2016). In order to increase the com- pression ratio within the processing chain, “clipping” meth- ods may be used to degrade the data before compression. 1 Introduction These methods increase the compression ratio by removing the least significant digits in the data. Indeed, at some level, Ground segments processing scientific mission data are fac- these least significant digits may not be scientifically mean- ing challenges due to the ever-increasing resolution of on- ingful in datasets corrupted by noise. board instruments and the volume of data to be processed, This paper studies compression and clipping methods that stored and transmitted. This is the case for oceanographic can be applied to scientific datasets in order to maximize and meteorological missions, for instance. Earth observation the compression ratio while preserving scientific data content mission ground segments produce very large files mostly in and numerical accuracy. It focuses on methods that can be ap- netCDF format, which is standard in the oceanography field plied to scientific datasets, i.e. vectors or matrices of floating- and widely used by the meteorological community. This file Published by Copernicus Publications on behalf of the European Geosciences Union. 4100 X. Delaunay et al.: Digit rounding compression algorithm point numbers. First, lossless compression algorithms can be applied to any kind of data. The standard is the “Deflate” al- gorithm (Deutsch, 1996) native in netCDF-4/HDF5 libraries. It is widely used in compression tools such as zip, gzip, and zlib libraries, and has become a benchmark for lossless data compression. Recently, alternative lossless compression al- gorithms have emerged. These include Google Snappy, LZ4 (Collet, 2013) or Zstandard (Collet and Turner, 2016). To achieve faster compression than the Deflate algorithm, none of these algorithms use Huffman coding. Second, prepro- cessing methods such as Shuffle, available in HDF5, or Bit- shuffle (Masui et al., 2015) are used to optimize lossless com- pression by rearranging the data bytes or bits into a “more compressible” order. Third, some lossy/lossless compression algorithms, such as FPZIP (Lindstrom and Isenburg, 2006), ZFP (Lindstrom, 2014) or Sz (Tao et al., 2017), are specifi- cally designed for scientific data – and in particular floating- point data – and can control data loss. Fourth, data reduction methods such as Linear Packing (Caron, 2014a), Layer Pack- Figure 1. Compression chain showing the data reduction, prepro- ing (Silver and Zender, 2017), Bit Shaving (Caron, 2014b), cessing and lossless coding steps. and Bit Grooming (Zender, 2016a) lose some data con- tent without necessarily reducing its volume. Preprocessing methods and lossless compression can then be applied to ob- 2 Compression algorithms tain a higher compression ratio. This paper focuses on compression methods implemented Compression schemes for scientific floating-point datasets for netCDF-4 or HDF5 files. These scientific file formats usually entail several steps: data reduction, preprocessing, are widespread among the oceanographic and meteorolog- and lossless coding. These three steps can be chained as il- ical communities. HDF5 offers a dynamically loaded filter lustrated in Fig. 1. The lossless coding step is reversible. It plugin that allows users to write compression/decompression does not degrade the data while reducing its volume. It can filters (among others), and to process data before read- be implemented by lossless compression algorithms such as ing or writing them to disk. Consequently, many compres- Deflate, Snappy, LZ4 or Zstandard. The preprocessing step sion/decompression filters – such as Bitshuffle, Zstandard, is also reversible. It rearranges the data bytes or bits to en- LZ4, and Sz – have been implemented by members of the hance lossless coding efficiency. It can be implemented by HDF5 user community and are freely available. The netCDF algorithms such as Shuffle or Bitshuffle. The data reduction Operator toolkit (NCO) (Zender, 2016b) also offers some step is not reversible because it entails data losses. The goal compression features, such as Bit Shaving, Decimal Round- is to remove irrelevant data such as noise or other scientifi- ing and Bit Grooming. cally meaningless data. Data reduction can reduce data vol- The rest of this paper is divided into five sections. Sec- ume, depending on the algorithm used. For instance, the Lin- tion 2 presents the lossless and lossy compression schemes ear Packing and Sz algorithms reduce data volume, but Bit for scientific floating-point datasets. Section 3 introduces the Shaving and Bit Grooming algorithms do not. Digit Rounding algorithm, which is an improvement of the This paper evaluates lossless compression algorithms De- Bit Grooming algorithm that optimizes the number of man- flate, LZ4, and Zstandard; Deflate because it is the bench- tissa bits preserved. Section 4 defines the performance met- mark algorithm, LZ4 because it is a widely-used, very-high- rics used in this paper. Section 5 describes the performance speed compressor, and Zstandard because it provides bet- assessment of a selection of lossless and lossy compression ter results than Deflate both in terms of compression ra- methods on synthetic datasets. It presents the datasets and tios and of compression/decompression speeds. The Deflate compression results before making some recommendations. algorithm uses LZ77 dictionary coding (Ziv and Lempel, Section 6 provides some compression results obtained with 1977) and Huffman entropy coding (Huffman, 1952). LZ77 real CFOSAT and SWOT datasets. Finally, Sect. 7 provides and Huffman coding exploit different types of redundancies our conclusions. to enable Deflate to achieve high compression ratios. How- ever, the computational cost of the Huffman coder is high and makes Deflate compression rather slow. LZ4 is a dic- tionary coding algorithm designed to provide high compres- sion/decompression speeds rather than a high compression ratio. It does this without an entropy coder. Zstandard is a Geosci. Model Dev., 12, 4099–4113, 2019 www.geosci-model-dev.net/12/4099/2019/ X. Delaunay et al.: Digit rounding compression algorithm 4101 fast lossless compressor offering high compression ratios. It zero in the binary representation of the quantized floating- makes use of dictionary coding (repcode modeling) and a point value. It is also similar to Sz’s error-controlled quan- finite-state entropy coder (tANS) (Duda, 2013). It offers a tization (Tao et