Lossless Compression of Double-Precision Floating-Point Data for Numerical Simulations: Highly Parallelizable Algorithms for GPU Computing
Total Page:16
File Type:pdf, Size:1020Kb
IEICE TRANS. INF. & SYST., VOL.E95–D, NO.12 DECEMBER 2012 2778 PAPER Special Section on Parallel and Distributed Computing and Networking Lossless Compression of Double-Precision Floating-Point Data for Numerical Simulations: Highly Parallelizable Algorithms for GPU Computing Mamoru OHARA†a) and Takashi YAMAGUCHI†, Members SUMMARY In numerical simulations using massively parallel com- stored; this is especially true for checkpoint data, which is puters like GPGPU (General-Purpose computing on Graphics Processing used only when failures occur. Considering overheads for Units), we often need to transfer computational results from external de- transferring data between the devices and host memories, it vices such as GPUs to the main memory or secondary storage of the host machine. Since size of the computation results is sometimes unacceptably is preferable that the data is shrunk in the devices. For ex- large to hold them, it is desired that the data is compressed and stored. In ample, we can implement compressor circuits for simulation addition, considering overheads for transferring data between the devices data in the same FPGA in which we also implement accel- and host memories, it is preferable that the data is compressed in a part erator circuits. Some techniques for implementing the com- of parallel computation performed on the devices. Traditional compres- sion methods for floating-point numbers do not always show good paral- pressor in FPGAs have been proposed [3], [4]. On the other lelism. In this paper, we propose a new compression method for massively- hand, numerous lossy compression algorithms for graphics parallel simulations running on GPUs, in which we combine a few succes- data running on GPUs have been reported, however, there sive floating-point numbers and interleave them to improve compression are only a few ones for simulation data. We in this paper ffi e ciency. We also present numerical examples of compression ratio and propose some GPGPU algorithms for compressing simula- throughput obtained from experimental implementations of the proposed method runnig on CPUs and GPUs. tion results in GPUs. key words: GPGPU, massively-parallel numerical simulation, data com- For numerical simulations, compression must be loss- pression, double-precision floating-point data less. Moreover, most data of numerical simulations are double-precision floating-point (FP) numbers. Thus result- 1. Introduction ing data from the simulations usually consists of many double-precision FP numbers. In literature, a few but not Massively parallel simulation has become a commonly used so many lossless compression techniques for FP data have scientific methodology with the popularization of COTS been reported. Some of the compression techniques are chip multiprocessors and coprocessors. Today, we can designed for specific use: Isenburg et. al. presented an achieve billions of FLOPS by interconnecting PCs contain- arithmetic-coding-based method for compressing 3D geo- ing multicore CPUs with high-speed LANs. Moreover, even metric data [5]–[7]. Harada et. al. developed compressors a PC employing a dedicated graphic processing unit (GPU) for MPEG-4 audio data [8]. They are not always suitable can gain over 500G FLOPS in some parallel applications for compressing resulting data of numerical simulations. by utilizing the dedicated unit for general purpose process- FPC is one of general-purpose lossless compression ing, that is called GPGPU (General-Purpose processing with techniques for double-precision FP numbers [9] (hereafter GPU). we simply refer to a double-precision FP number as dou- In numerical simulations using massively parallel com- ble). FPC can well compress FP data of wide applications, puters like GPGPU or dedicated hardware accelerators, such as message passing, sattelite observation, and numer- which are sometimes implemented in FPGAs (Field Pro- ical simulations. FPC deploys a 1-pass compression algo- grammable Gate Array), we often need to transfer interme- rithm for performance. It predicts a value in a data stream diate computational results from external devices such as based on a hysteretic technique and gets the difference be- GPUs to the main memory or secondary storage of the host tween the true and the predicted values. The difference is machine in order to display them or save them for check- likely to contain many zero-valued bits when the prediction pointing [1], [2]. Since the scale of simulations expands is well performed. FPC compresses the leading zero-valued along with the growth of computing power, the size of the bytes in the difference by encoding the number of the lead- computation results is sometimes unacceptably large to hold ing zero bytes (NLZ) into a few bits. It is expected that en- them. Therefore, it is desired that the data is compressed and coding NLZ is more preferable for 1-pass compression than entropy coding or dictionary-based methods. Manuscript received January 10, 2012. Conventional 1-pass prediction algorithms such as that Manuscript revised June 7, 2012. used in FPC have some disadvantages for using in massively †The authors are with the Tokyo Metropolitan Industrial Tech- nology Research Institute, Tokyo, 135–0064 Japan parallel numerical simulations. Most of all, these 1-pass al- a) E-mail: [email protected] gorithms are designed for serial processing, thus they are DOI: 10.1587/transinf.E95.D.2778 not easy to implement them on massively-parallel comput- Copyright c 2012 The Institute of Electronics, Information and Communication Engineers OHARA and YAMAGUCHI: LOSSLESS COMPRESSION OF DOUBLE DATA FOR NUMERICAL SIMULATIONS 2779 ers. Although Burtscher et. al. developed a parallel ver- nally, 6 concludes this paper. sion of FPC, pFPC [10], it is suitable for computers having some multi-core processors rather than massively-parallel 2. Preliminaries systems. Therefore, recently O’Neil and Burtscher pro- posed another parallel FPC named GFC whose prediction In this section, we discuss some compression methods for algorithm is thoroughly redesigned for GPGPU [11]. While FP numbers. In various types of processors and softwares GFC exceeds pFPC in compression throughput on GPUs by including CPUs and GPUs, the FP number formats stan- about five times, its mean compression ratio is much worse dardized in IEEE 754 [13] are used. We suppose that most than that of pFPC. appliciations of numerical simulation also adopt IEEE 754 In addition, precise prediction of FP values is essen- format. Therefore, in this paper we will present a technique tially difficult. This is especially true for 64-bit double data. for compressing floating-point data based on IEEE 754. Usually, we can obtain only small number of leading zero A FP number based on IEEE 754 has a sign-bit fol- bytes in FP data, and thus we cannot take full advantages of lowed by le-bit-length exponent and lm-bit-length mantissa. NLZ-based methods. IEEE 754 defines a number of formats which have different In this paper, we propose a highly-parallelizable loss- widths in bits of these fields: that are, binary16 (half pre- less compression technique for double data resulting from cision: le = 5, lm = 10), binary32 (single precision: le = numerical simulations. It is expected that a resulting data 8, lm = 23), binary64 (double precision: le = 11, lm = 52), stream from a numerical simulation has few discontinuous binary128 (quadruple precision: le = 15, lm = 112), and points and varies smoothly, thereby we suppose we can pre- some decimal formats. In scientific applications, binary64 dict the values by applying interpolating polynomials. Sim- format, so-called double-precision FP number, is widely ilar to some previous work in arithmetic predictors [3], [4], used, thus, we in this paper mainly focus binary64 format. [12], we construct a simple and highly-parallelizable predic- While the order of the 8 bytes of a double held in memory tor based on the Newton polynomial. We can improve par- differs according to processor architectures, in this paper, allelism of a compression technique by using simple arith- we always locate the sign bit at the highest position, and the metic predictors. This approach is similar to GFC, however, exponent and fraction parts follow as shown in Fig. 1. The by combining two different polynomials in the same manner exponent field is biased by 1023 for omitting a sign bit for as FPC, accuracy of our predictor is comparable to FPC and exponent; mantissa is also normalized so that its integer part much higher than that of GFC. is always 1 and only the fraction part is held in the fraction We also propose a recombination technique in which field. we combine and interleave bytes of several FP numbers in For compressing resulting data of scientific applica- order to improve compression ratio. There are two portions tions, we need lossless compression tehchniques. In litera- in a FP number in terms of predictability: while a sign bit, ture, a few but not so many lossless compression techniques an exponent, and a couple of higher bits of mantissa are rel- for FP data have been reported. Isenburg et. al. presented atively predictable, the lower bits of mantissa are very hard an arithmetic-coding-based method for compressing 3D ge- to predict. We combine four doubles and rearrange the byte ometric data [5]–[7]. In their method, we first predict co- order of the doubles so that the relatively-predictable por- ordinates of a point in a 3D mesh model, each of which tions are gathered in front in order to make long NLZ. It is is expressed in FP numbers, from coordinates of the adja- expected that by the rearrangement, we can efficiently com- cent points by some techniques, typically by parallelogram press the predictable portions with almost no negative ef- predictors. Then, we compress the difference between the fects on compression of unpredictable portions, which are true value and the predicted value with range coders [14], not compressible so much anyway.