Polynomial Data Compression for Large-Scale Physics Experiments

Noname manuscript No. (will be inserted by the editor) Polynomial data compression for large-scale physics experiments Pierre Aubert · Thomas Vuillaume · Gilles Maurin · Jean Jacquemier · Giovanni Lamanna · Nahid Emad Received: date / Accepted: date Abstract The new generation research exper- ation ground-based high-energy gamma ray ob- iments will introduce huge data surge to a con- servatory, Cherenkov Telescope Array (CTA), tinuously increasing data production by current requiring important compression performance. experiments. This data surge necessitates effi- Stand-alone, the proposed compression method cient compression techniques. These compres- is very fast and reasonably efficient. Alterna- sion techniques must guarantee an optimum tively, applied as pre-compression algorithm, it tradeoff between compression rate and the cor- can accelerate common methods like LZMA, responding compression /decompression speed keeping close performance. ratio without affecting the data integrity. This work presents a lossless compression al- Keywords Big data · HPC · lossless compres- gorithm to compress physics data generated by sion · white noise Astronomy, Astrophysics and Particle Physics experiments. The developed algorithms have been tuned 1 Introduction and tested on a real use case : the next gener- P. Aubert · T. Vuillaume · G. Maurin · J. Several current and next generation experimen- Jacquemier · G. Lamanna tal infrastructures are concerned by increasing Univ. Grenoble Alpes, Univ. Savoie Mont Blanc, volume of data that they generate and manage. CNRS, LAPP, 74000 Annecy, France This is also the case in the Astrophysics and As- arXiv:1805.01844v1 [cs.NA] 3 May 2018 E-mail: [email protected] troparticle Physics research domains where sev- P. Aubert · N. Emad eral projects are going to produce a data del- Laboratoire d’informatique Parallélisme Réseaux Al- gorithmes Distribués, UFR des Sciences 45 avenue uge of the order of several tens of Peta-Bytes des États-Unis 78035 Versailles (PB) per year [1] (as in the case of CTA) up to P. Aubert · N. Emad some Exa-Bytes (as for the next generation as- Maison de la Simulation, Université de Versailles tronomical radio observatory SKA[2]). Such an Saint-Quentin-en-Yvelines, USR 3441 CEA Saclay increasing data-rate implies considerable tech- 91191 Gif-sur-Yvette cedex nical issues at all levels of the data flow, such 2 P. Aubert, T. Vuillaume, J. Jacquemier, G. Maurin, G. Lamanna & N. Emad as data storage, processing, dissemination and ods. Section4 reports the improvement ob- preservation. tained from our best polynomial compression The most efficient compression algorithms method on given distributions and CTA data generally used for pictures (JPEG), videos [14]. Section5 gives further details about com- (H264) or music (MP3) files, provide compres- pression quality. In section6, some concluding sion ratios greater than 10. These algorithms remarks and future plans will be given. are lossy, therefore not applicable in scientific context where the data content is critical and inexact approximations and/or partial data dis- 2 Motivation carding are not acceptable. In the context of this work, we focus on compression methods to re- As the data volumes generated by current spond to data size reduction storage, handling, and upcoming experiments rapidly increase, and transmitting issues while not compromising the transfer and storage of data becomes an the data content. economical and technical issue. As an exam- Following types of lossless compression ple, CTA, the next generation ground-based methods are applicable for aforementioned sit- gamma-ray observatory, will generate hundreds uations. LZMA [3], LZ78 [4], BZIP2 [5], GZIP PB of data by 2030. The CTA facility is based [6], Zstandard [7] or the Huffman algorithms are on two observing sites, one per hemisphere and often employed because they provide the best will be composed of more than one hundred tele- compression ratio. The compression speeds of scopes in total. Each of them is equipped with these methods however impose significant con- photo-sensors equipping the telescopes’ cameras straints considering the data volumes at hand. and generating about two hundred PB/year of Characters lossless compression, CTW uncompressed raw data that are then reduced (Context Tree Weighting) [8], LZ77 [9], LZW on sites after data selection conditions to the or- [10], Burrows-Wheeler transform, or PPM [11], der of the PB/year off-site data yield. The CTA cannot be used efficiently on physics data as pipeline thus implies a need for both lossy and they do not have the same characteristics as lossless compression, and the amount of lossy text data, like the occurance or repetition of compression should be minimized while also en- characters. Other experiments have recently suring good data reading and writing speed. solved this data compression issue [12], [13] for The writing speed needs to be close to real-time, smaller data rates. since there is limited capacity on site to buffer With the increasing data rate, both the com- such large data volumes. Furthermore, decompression speed and ratio have to be improved. pression speed is also an issue; the whole cumu- This paper primarily addresses the data com- lated data are expected to be reprocessed yearly, pression challenges. In this paper, we propose a which means that the amount of data needed polynomial approach to compress integer data to be read from disk (, decompressed) and pro- dominated by a white noise in a shorter time cessed will grow each year (e.g. 4 PB, 8 PB, 12 than the classical methods with a reasonable PB, ...). compression ratio. This paper focuses on both In CTA, as in many other experiments, the compression ratio and time because the de- the data acquired by digitization can be decompression time is typically shorter. scribed by two components: a Poissonian dis- The paper is organized as follows. Section tribution representing the signal, dominated by 2 explains some motivations. Section3 de- a Gaussian-like distribution representing the scribes our three polynomial compression meth- noise, which is most commonly white noise. As Polynomial data compression for large-scale physics experiments 3 values in the same integer and compute them back. Fig. 1 Example of analog signal digitization in most Fig. 2 Illustration of the reduction principle. The physics experiments. In many cases the white noise upper line represents the data (different colours for (a Gaussian distribution) dominates the signal (gen- different values). In the second line, the orange blocks erally a Poissonian distribution). So, the biggest part represent the changes between the different values to of the data we want to compress follows a Gaussian compress. The last line shows the compressed data distribution. (as they are stored). First, the minimum value of the data, next, the base b = max − min + 1, which defines the data variations set, Z/bZ, finally the data shown in figure1, the noise generally signifi- variations. Several data can be stored in the same cantly dominates the searched signal. unsigned int and only the changes between the data are stored. The common parameters like the range In this paper, we propose a compression of the data (minimum and maximum or compression algorithm optimised on experimental situation base) are stored only once. with such characteristics, Gaussian distribution added to a Poissonian one. Furthermore, in order to respond to time re- quirement and allow for almost real-time execu- tion the proposed solution can be also combined with the most powerful known compression al- 3.1 Basic compression method gorithms such as LZMA to increase tremen- 2 N dously its speed. Considering a n elements data vector, v N , its minimum, vmin and its maximum vmax define its associated ring. If the data ring is smaller 3 The polynomial compression than the unsigned int ring, it is possible to store several values in one unsigned int. The smaller An unsigned int range, q0; 232q defines a math- is the base, the higher is the compression ra- 32 ematical set Z/dZ, called ring, where d = 2 . tio. As the data are in vmin; vmax , the range J K The digitized data also define a ring, in this between 0 and vmin is useless. Therefore, the case, the minimum is vmin and the maximum data can be compressed by subtracting the min- is vmax so the corresponding ring is defined as imum value, forming a smaller base. The mini- Z/bZ with b = vmax − vmin + 1. In many cases mum can be stored once before the compressed b < d, so, it is possible to store several pieces of data. The compression base B is defined by : data in the same unsigned int (see in figure2). B = vmax − vmin + 1. With this base we are able This compression can be made by using a poly- to store ¹vmax − vminº different values. The com- nomial approach. The power of a base is given pression ratio, p, is given by the number of bases by the values range. This allows to add different B that can be stored in one unsigned int (in 4 P. Aubert, T. Vuillaume, J. Jacquemier, G. Maurin, G. Lamanna & N. Emad Fig. 4 This figure shows how the data of the vector v are stored in the packed vector. The first line gives the base used to store the values, the second Fig. 3 Illustration of the advanced reduction. The line shows the variables used to store the values with upper line represents the data (different colours for respect to their base. To increase the compression ra- different values). In the second line, the orange blocks tio we need to split the last base B in to base R and represent the changes between the different values to R0 in order to use the storage capacity of an unsigned compress. The last line shows the compressed data int as much as we can.

Load more