High Speed Lossless Image Compression
Total Page:16
File Type:pdf, Size:1020Kb
High Speed Lossless Image Compression Hendrik Siedelmann12, Alexander Wender1, and Martin Fuchs1 1 University of Stuttgart, Germany 2 Heidelberg University, Germany Abstract. We introduce a simple approach to lossless image compression, which makes use of SIMD vectorization at every processing step to provide very high speed on modern CPUs. This is achieved by basing the compression on delta cod- ing for prediction and bit packing for the actual compression, allowing a tuneable tradeoff between efficiency and speed, via the block size used for bit packing. The maximum achievable speed surpasses main memory bandwidth on the tested CPU, as well as the speed of all previous methods that achieve at least the same coding efficiency. 1 Introduction For applications which need to process large amounts of image data such as high speed video or high resolution light fields, the I/O subsystem can easily become the bottle- neck, even when using a RAID0 or SSDs for storage. This makes compression an at- tractive tool to widen this bottleneck, increasing overall performance. As lossy com- pression incurs signal degradation and may interfere with further processing, we will only consider lossless compression in the following. While dictionary based compres- sion methods like lzo [18] can reach a high bandwidth, the compression ratio of such generic compression methods is quite low compared to dedicated image compression methods. However, research in lossless image compression has mainly been concerned with the maximization of the compression ratio, and even relatively fast image com- pression schemes like jpeg-ls are not able to keep up with the transfer rates of fast I/O configurations. In this article, we present a simple lossless image compression scheme, which makes use of the SIMD instructions of modern CPUs to achieve extremely high per- formance. Our implementation allows a configurable tradeoff between speed and com- pression and exceeds the memory transfer speeds of modern CPUs in its fastest con- figuration, allowing incorporation of lossless compression into many areas where com- pression was not feasible before. Our specific use case, which motivated this work, is an example of such an application: recording a very large and dense light field data set (sev- eral terabytes in size), requiring a continuous compression bandwidth of 360 MiB=s, a rate which the used I/O configuration was not able to guarantee on its own. 1.1 Applications of Lossless Image Compression Compression is always a tradeoff of processing resources against bandwidth and stor- age. While lossy compression provides high compression ratios, its application is lim- ited to areas where the distortion of the signal due to lossy compression does not pose 2 Siedelmann, Wender, Fuchs 10000 simdcomp main memory 512256 128 64 4xRAID0 1024 32 1000 BBP 2048 16 8 ffvhuff LZO 4 lzjbquicklzdensitypithylzf fastlz FFV1 x264 100 x265 snappy lz4 jpeg-ls wflz lznt1 fari 10 yalz77 gzip jpeg2000 gipfeli compress xpress zstd 1 single core bandwidth in MiB/s in bandwidth core single 0.1 1 2 3 compression ratio Fig. 1. Bandwidth vs. compression ratio for all tested methods, showing single core results on TM an Intel R Core i7-2600 CPU on the “blue sky” test sequence from the SVT data set [12]. “BBP” denotes our method for several block sizes from 2048 to 4 bytes. An example 4xRAID0 configuration and main memory bandwidth are included for comparison. The simdcomp method is the C implementation of [14] and can, in general, not compress image data, but is shown for bandwidth comparison. For lzo, and ffvhuff all configurations are shown, as well as all available presets for x264 and x265. a problem. But in some cases losses are not acceptable. Examples for this include ref- erence data sets for image processing or sophisticated image processing, like super- resolution. In this case, lossless compression may still provide an overall performance increase due to bandwidth savings as well as reduced storage requirements. However lossless compression provides less compression, which reduces the gain obtained from using it. In many capture scenarios, the required bandwidth for on the fly compression, before initial storage, imposes a hard constraint. This is addressed by the proposed scheme, as compression bandwidth can be adjusted to speeds that surpass memory transfer speeds, a hard limit for any hardware configuration. 2 Related Work In the following, we briefly discuss lossless compression methods that focus on high speed, see Fig. 1 for an overview of evaluated methods. The methods are ordered by the dimensionality of the compression scheme, starting with generic 1D compression algo- rithms. Any compression method with lower dimensionality may be used to compress data with a higher dimensionality at reduced efficiency, as correlation in the missing dimensions cannot be exploited. On the other hand, lower dimensionality can lead to lower algorithmic complexity and therefore higher speed. While the method discussed High Speed Lossless Image Compression 3 in this paper is a pure image compression method, we will regard and evaluate com- pression from the basis of compressing a video stream, which allows the inclusion of advanced lossless video compression methods like AVC and HEVC which allow higher compression ratios by exploiting the 3D correlation. Also applications for high band- width image processing like light field, or high speed video capture, may allow the use of video compression methods, making an evaluation based on video data adequate. 2.1 Dictionary and 1D Compression General purpose compression methods like deflate (zlib/gzip) [10] or bzip2 [23] achieve good compression ratios for most types of inputs, but are quite slow, with a maximum bandwidth of around 30 MiB=s on an off-the-shelf Intel Core i7-2600 CPU. Faster methods, including lzo, lz4 and gipfeli [18,9,15] are more directly based on the original LZ77 compression scheme [28] and provide a bandwidth of up to 152 MiB=s for our configuration, see Fig. 1. This performance has led to the utiliza- tion for lossless 4K image transmission [11], but compared to image based methods the compression ratio is rather poor, as dictionary based methods are not well suited to the task of image coding. Specialized methods adapted for specific tasks, like inte- ger or floating point compression, provide a much higher speed of several gigabytes per second, by using a simple bit packing approach [8,14] and exploiting the SIMD instruc- tions provided by modern CPU architectures [14]. However, those methods operate on 32 bit integers or 64 bit floating point values and are thus not directly applicable to the compression of 8 bit image data. We have developed an algorithm which uses SIMD bit packing for the actual compression, combining it with a prediction scheme and small block sizes, allowing efficient compression of image data. See section 3 for details. 2.2 Image Compression Dedicated lossless image compression methods like jpeg-ls [27] peak at around 25 MiB=s, with modifications reaching 75 MiB=s on an Intel R CoreTM i7-920 proces- sor at 2:67 GHz [26]. The jpeg-ls codec was standardized in 1999 and is still widely used as the baseline for the evaluation of lossless image compression methods. To our knowledge there are no significantly faster methods available, even though the gap to the fastest 1D compression method [14] amounts to more than two orders of magnitude. Later works mainly concentrate on increasing compression efficiency at reduced speed, which is not the focus of this work. 2.3 Video Compression State of the art video standards like HEVC, as well as its predecessor AVC, include lossless profiles which provide high compression ratios. The ffvhuff coder from the ffmpeg library [3] based on HuffYUV [24], which uses the jpeg-ls predictor together with simple Huffmann coding, obtains a speed of 343 MiB=s, the highest speed for any video codec we evaluated, see Fig. 1. Another interesting compressor is the ffv1 coder [17] at 44 MiB=s, also from ffmpeg, which combines a jpeg-ls style 4 Siedelmann, Wender, Fuchs predictor with a context adaptive entropy coder, based on similiar principles as the one from AVC, with context adaption over several frames. The ffv1 coder achieves a compression ratio and speed competitive to AVC, see Fig. 1. 2.4 GPU Aided Compression There have been various efforts to improve the performance of different compression techniques by using the massively parallel computation capabilities of modern GPUs. For 1D compression a text based method has been ported to the GPU by Ozsoy et al. [21] demonstrating a speedup of up to 2.21x for the GPU solution. Floating point compression based on bit packing has also been shown to reach a speedup of 5x [20], but at the cost of a reduced compression efficiency and in comparison to a CPU version which does not make use of SIMD for bit packing, like implemented in [14]. An approach for GPU based predictive lossless image coding has been presented by Kau and Chen, showing a speedup of up to 5x with a combined system utilizing GPU and CPU [13], however absolute performance is quite low at less than 1 MiB=s. The cuj2k library, implementing jpeg2000 on the GPU provides similar performance to a parallel jpeg2000 implementation for the CPU [2]. The work presented by Treib et al. [25], implementing a simple wavelet based compressor, provides results with slightly worse compression than jpeg2000 but with a compression speed of up to 700 MiB=s, which is significantly higher than previous approaches. The difficulties in porting standard image compression methods to the GPU are ana- lyzed by Richter and Simon [22], specifically for jpeg and jpeg2000. They conclude that especially the entropy coding part proves difficult for modern GPUs, together with the codestream build-up.