IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 40, NO. 2, FEBRUARY 2021 607

High-Throughput Lossy-to-Lossless 3D

Diego Rossinelli , Gilles Fourestey , Felix Schmidt, Björn Busse , and Vartan Kurtcuoglu

Abstract— The rapid increase in medical and biomed- strength particularly in the elderly. Such large footprints call ical image acquisition rates has opened up new avenues for parallel file systems for archival as well as HPC clusters for image analysis, but has also introduced formidable for image analysis, as illustrated by Reynaud et al. [5]. challenges. This is evident, for example, in selective plane illumination microscopy where acquisition rates of Furthermore, sharing those large images over the network is about 1-4 GB/s sustained over several days have redefined still a largely unsolved problem. the scale of I/O bandwidth required by image analysis Over the last three decades, the signal processing commu- tools. Although the effective bandwidth could, principally, nity has developed sophisticated schemes be increased by lossy-to-lossless data compression, this is for images, relying on the concept of multi-resolution analy- of limited value in practice due to the high computational demand of current schemes such as JPEG2000 that reach sis (MRA) and wavelets [6]–[11]. These schemes have brought compression throughput of one order of magnitude below two game-changing benefits: substantial increase in effective that of image acquisition. Here we present a novel lossy- storage capacity and drastic increase in effective I/O and to-lossless data compression scheme with a compression network bandwidth. Part of this research culminated in the throughput well above 4 GB/s and compression rates and JPEG 2000 standard [12], with 3D images considered in rate-distortion curves competitive with those achieved by JPEG2000 and JP3D. Part 10 [13], which is usually referred to as “JP3D”. The bitstreams generated by JPEG2000 are not just compressed, Index Terms— Image compression, integration of multi- but also lossy-to-lossless and scalable with respect to quality, scale information, parallel computing. resolution, and region of interest (ROI) -accessible. Lossy- I. INTRODUCTION to-lossless refers to the ability to read just a fraction of the ERKEL has recently referred to the challenge of effec- compressed bitstream to get a reasonable approximation of Ptively handling very large sets of images as “the struggle the entire image, whereas reading the entire bitstream leads to with image glut” [1]. While this struggle is already evi- a lossless decompression. Quality-scalable bitstreams provide dent in medical imaging, the development of adjacent fields us with the ability to control the distortion by prescribing a foreshadows what is yet to come. For instance, selective reading bitrate. The efficiency of a quality-scalable bitstream plane illumination microscopy (SPIM), a tool employed, e.g, is generally characterized by its rate-distortion curve (r-d in developmental biology, may generate data at rates of up curve) in terms of peak signal-to-noise ratio (PSNR) versus to 4 GB/s [2]. Image sets of 10-30 TB in size are not bits-per-sample (BPS). ROI-accessibility allows us to read unusual. Other biomedical imaging modalities producing large exclusively those portions of the bitstream describing a specific footprint images include optical coherence tomography [3] and ROI, whereas resolution-scalable bitstreams directly expose high-resolution peripheral quantitative computed tomography sequences of bits representing a specific resolution. (HR-pQCT) [4], used clinically to assess bone structure and While lossy-to-lossless is, principally, a promising approach to handling and sharing the image glut in medicine and life Manuscript received September 9, 2020; revised October 17, 2020; sciences, the JPEG2000 and JP3D formats are inadequate: on accepted October 20, 2020. Date of publication October 23, 2020; date of current version February 2, 2021. This work was supported, in part, the latest CPUs, their performance is one order of magnitude by the Swiss National Science Foundation through NCCR Kidney.CH below what is required to keep up with the highest image as well as through Grants 153523 and 182683. (Corresponding author: acquisition rates [1], [5], [14]. Pursuing substantial improve- Diego Rossinelli.) Diego Rossinelli is with the Institute of Physiology, University of Zurich, ments in compression speed, Amat et al. [14] proposed the 8057 Zürich, Switzerland, and also with Lucid Concepts AG, 8005 Zürich, Keller-Lab Block (KLB) file format, achieving a throughput Switzerland (e-mail: [email protected]). of about 600 MB/s (which would correspond to 3 GB/s on the Gilles Fourestey is with SCITAS, EPFL, 1015 Lausanne, Switzerland (e-mail: gilles.fourestey@epfl.ch). platforms considered here, assuming perfect scaling). This was Felix Schmidt and Björn Busse are with Center for Experimen- bought at the expense of dropping all bitstream features but tal Medicine, Institute of Osteology and Biomechanics, Universität- ROI-accessibility. As the file format acronym suggests, the raw sklinikum Hamburg-Eppendorf (UKE), 20251 Hamburg, Germany (e-mail: [email protected]; [email protected]). file is decomposed into spatiotemporal tiles - hence the ROI- Vartan Kurtcuoglu is with the Institute of Physiology, University of accessibility - and each block is independently compressed Zurich, 8057 Zürich, Switzerland (e-mail: [email protected]). exploiting the available thread-level parallelism (TLP). Color versions of one or more of the figures in this article are available online at https://ieeexplore.ieee.org. Today’s image analysis software are primarily limited Digital Object Identifier 10.1109/TMI.2020.3033456 by I/O bandwidth – much more so than by storage [1].

0278-0062 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: MAIN LIBRARY UNIVERSITY OF ZURICH. Downloaded on February 18,2021 at 12:06:07 UTC from IEEE Xplore. Restrictions apply. 608 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 40, NO. 2, FEBRUARY 2021

Quality-scalable bitstreams directly address this issue. Visually and update steps, corresponding to a polyphase matrix factor- lossless image previews can be achieved by reading perhaps ization of the transform. Among several other advantages, this just 10% of the bitstream resulting in a 10X boost of the decomposition provides a two-fold algorithmic improvement effective I/O bandwidth. While the KLB format typically leads over first-generation wavelets [23]. Integer wavelets, whose to a 2:1 compression rate, it brings no benefits in terms of transform is implemented exclusively with integer operations effective I/O bandwidth, as KLB bitstreams are not quality- [21], [24], show the power of the lifting scheme and have scalable. direct implications for data compression. 2) Zerotree : A major advancement in image com- A. Contributions pression came in the form of zerotree codecs in conjunc- tion with wavelets, such as the embedded zerotrees wavelet The contributions of this article are as follows. We demon- (EZW) proposed by Shapiro [25]. These codecs generate strate that it is possible to devise data compression schemes quality-scalable bitstreams by exploiting the parent-children generating scalable lossy-to-lossless bitstreams at throughputs coefficients correlation across the image’s MRA. Zerotree exceeding the acquisition rates of the latest microscopes and based codecs saw a peak in recognition with the work of Said scanners. Specifically, the scheme described herein leads to and Pearlman [26], [27], where the reasons for the outstanding : compressed bitstreams featuring EZW efficiency were elucidated and the set partitioning in • quality-scalability and ROI-accessibility hierarchical trees (SPIHT) codec was outlined. The SPIHT • multiresolution representation is surprisingly compact and improves upon the • compression rates comparable to those of JP3D compression results achieved by EZW. • r-d curves comparable to those of JP3D With their ability to encode zerotrees - insignificant pyrami- • throughput of 30 GB/s, per-node dal regions within the MRA - with a single symbol, zerotree • lossless decompression throughput of 30 GB/s, per-node codecs [25]–[27] lead to outstanding compression rates. With • lossy decompression throughput of 80 GB/s, per-node respect to other 2D and 3D codecs, zerotree codecs provide We are not aware of open source or commercial counterparts us with several other advantages: with comparable performance. The effective I/O bandwidth • compactness and low computational complexity achieved with our scheme allows out-of-core analysis algo- • flexibility in granularity rithms to be accelerated by one to two orders of magnitude. • capture of data correlation across subbands In the following text we describe our approach and assess both timings and compression rates against the state of the art. Simplicity and compactness enable us to quickly assess (and The assessment relies on three datasets acquired with three often discard) ways to map these codecs on current CPUs. different modalities: wide-field microCT, scanning electron The ability to process groups of coefficients rather than indi- microscopy (SEM) and HR-pQCT. vidual coefficients give us the flexibility to trade compression efficiency for speed. 3) JPEG2000: This standard, abbreviated as JP2K,iswidely B. State of the Art adopted by public institutions. It is based on the 5/3 and 9/7 1) Wavelets: The term “wavelet” has been introduced by “Daubechies wavelets”, belonging to the family of biorthogo- Grossman and Morlet [6] and Morlet [7], and extends the work nal smooth wavelets with compact support [18]. In JPEG2000, of Gabor [15], in which signals are decomposed into “logons”, lossless compression is performed with the 5/3 wavelet: the discretized elements in the time-frequency chart. The pioneer- associated filters translate into an instruction stream featuring ing work of Meyer [16], Mallat [8]–[10], Daubechies [17], exclusively integer operations and are thus bitwise reversible. and Cohen et al. [18] has laid the mathematical foundation The 5/3 wavelets are in a sense optimal as they represent of MRA and the compactly supported biorthogonal wavelets, the simplest symmetrical biorthogonal wavelets featuring two upon which a wide range of technologies relies today. vanishing moments [28]. While exhibiting four vanishing Despite their notable advantages, wavelets are far from per- moments and thus providing more competitive r-d curves, fect. Cohen [19] talks about “three major curses”: inefficiency the 9/7 wavelets lead to a transform featuring floating point in describing irregular domains, inefficiency in representing arithmetic operations, are bitwise irreversible and therefore not signals with strong preferential directions, and high compu- suitable for straightforward lossless compression. tational cost due to the associated hierarchical yet sparse A cornerstone of the JP2K standard is its codec: embedded data structures. Selesnick et al. report additional “troubles in block-coding with optimized truncation (EBCOT) [29]. This paradise” [20], namely the high oscillations when manipulat- intra-band codec [13], as opposed to zerotree codecs, does not ing wavelet coefficients, the non-translation-invariance of the exploit the parent-children dependency in the MRA. An advan- discrete (DWT), and their non-ideal low- tage of EBCOT is that wavelet coefficients are not necessarily pass/high-pass filter banks. computed by the dyadic transform outlined by Mallat, but can Part of the research efforts to alleviate these limitations led be obtained by the discrete packet wavelet transform (DPWT), to the lifting scheme introduced by Calderbank et al. [21], sometimes resulting in sparser representations hence higher Sweldens [22], and Daubechies and Sweldens [23]. Sometimes compression rates. EBCOT is decomposed into two tiers: also referred to as second-generation wavelets, the lifting tier-1 is responsible for generating independently compressed scheme decomposes the DWT into a sequence of prediction bitstreams from blocks of coefficients, whereas tier-2 performs

Authorized licensed use limited to: MAIN LIBRARY UNIVERSITY OF ZURICH. Downloaded on February 18,2021 at 12:06:07 UTC from IEEE Xplore. Restrictions apply. ROSSINELLI et al.: HIGH-THROUGHPUT LOSSY-TO-LOSSLESS 3D IMAGE COMPRESSION 609

the so-called post-compression rate-distortion (PCRD) opti- TABLE I mization by coalescing and reordering the bitstreams into a SUMMARY OF THE STATE OF THE ART single scalable one [29]. 4) JP3D: Part 2 of JP2K allows the compression of 3D images by reinterpreting slices as different components and applying a multi-component transform (MCT) that implements a full wavelet transform along the depth axis, combined in turn with the vertical and horizontal wavelet transforms of Part 1. As described by Schelkens et al. [13], JP3D is the 3D extension of Part 1 and 2 of JP2K. A considerable amount of research exists in the context of wavelet-based compression of and 70MB/s for the HEVC codec compiled to support a bit volumetric images. Schelkens et al. [30] showed that wavelet- depth of 16. To our knowledge, coded bitstreams feature based compression significantly outperforms DCT-based tech- to date just ROI-accessibility on the temporal axis by placing niques. Xiong et al. [24] benchmarked their newly designed ‘key’ frames, and poor quality-scalability, if any. integer wavelets against ESCOT, a variation of EBCOT, 7) GPUs: Cornelis et al. [37] investigated the use of on volumetric medical images by comparing r-d curves and OpenCL [38] for the GPU-acceleration of IRIS [39], a lossless compression rates. Zhang et al. [31] investigated the implementation of the JP2K/JP3D standard. They report a ten- use of JP3D on hyperspectral imagery, concluding that JP3D fold improvement in speed over the baseline, achieving an is not significantly better than JP2K because of the weaker overall throughput of about 50-100 MB/s on a Tesla GPU correlation across adjacently stacked hyperspectral images. As with peak memory bandwidth of 144 GB/s and a 1 TFLOP/s of today, JP3D represents the “gold standard” against which peak in single precision (SP). By designing a novel wavelet- new 3D image compression schemes are benchmarked. based compression scheme, Treib et al. [40] achieved approxi- 5) HTJ2K: The JPEG committee has recently released an mately 700 MB/s by processing 2D images on a GeForce GPU additional chapter (Part 15) of the JP2K standard, known with 192 GB/s peak memory bandwidth and a 1.5 TFLOP/s SP as High-Throughput JPEG 2000 (HTJ2K). HTJ2K aims at peak. Balevic et al. developed CUJ2K [41], a CUDA imple- replacing EBCOT (Part 1) with a new block coding, improving mentation of JP2K Part 1 achieving approximately 22 MB/s. the execution time by one or more orders of magnitude at the According to Richter and Simon [42], the relatively poor expense of coding efficiency, although the loss in efficiency performance of the GPU-enabled wavelet-based compression was set to be 15% or less [32], [33]. The algorithm behind software can be attributed to the irregular pattern exhibited by HTJ2K was derived from the fast block coding with optimized the entropy coding, i.e. the final stage of these compression truncation (FBCOT) proposed by Taubman et al. [33]. While schemes. HTJ2K does mention JP3D and the compression of volumetric Recently, Naman and Taubman [43] investigated the use images [32], we could not find any report on the application of GPUs for accelerating HTJ2K, although their work only of HTJ2K to 3D images. Quantitatively similar improvements considered the decoding stage, and focused only on 2D have been achieved by Auli-Llinas et al., with bitplane image images. In spite of a 18:1 nominal peak performance ratio coding with parallel coefficients (BPC-BaCo) [34]. BPC-BaCo between GPU and CPU, they report an outperforming factor does not appear to have been designed with the compression of just 3X or less. On the GPU, Enfedaque et al. reported an of 3D images in mind. improvement factor of 25X of BPC-BaCo over Kakadu [44], 6) H.265/HEVC: Because of their high throughput, video a CPU implementation of HTJ2K. However, in their work, codecs represent a viable option for compressing volumetric the nominal peak performance ratio between the GPU and images, although it is believed that the absence of rigid body CPU was only 5 to 1. It remains unclear why the authors did motions and the higher bit depth of volumetric data affects not compare against the CPU implementation of BPC-BaCo. their efficiency adversely [24]. H.265, also known as HEVC No results on 3D images were reported there. [35], is a state-of-the-art able to generate lossy- Table I summarizes the research efforts representing the to-lossless bitstreams from 2D + t footage. state of the art. The field ‘Name’ denotes the name of the Bui et al. [36] investigated the use of video codecs to compression approach, ‘Q-s’ stands for quality-scalable, ‘R-s’ medical volumetric images. H.264, an earlier codec, for resolution-scalable, ‘ROI’ for region-of-interest accessibil- reached about 40 MB/s on a mid-range consumer CPU, ity, ‘IPT’ for ideally projected throughput on the platforms corresponding to about 3 GB/s on the platforms considered considered herein, ‘SQ’ for supported quantization (bit depth here based on ideal projections. While this is, in principle, and signedness) of image pixels. competitive to our work, the codec is lossy as the images need to be 16 : 8 bit-shaved in a preprocessing step. II. METHOD Bruylants et al. [28] benchmarked HEVC against JP3D. In 10 out of 11 medical images (computed tomography, magnetic In this section, we first introduce our data compres- resonance, and ultrasound images), JP3D outperformed the sion approach with its conceptual breakdown. We then H.265 codec in all aspects: lossless compression rate, r-d elucidate the cornerstones of the method, namely second- curve, and throughput. The ideally projected throughput of generation wavelets, zerotree codecs and asymmetric numeral their work [28] on our platforms is about 700 MB/s for JP3D systems.

Authorized licensed use limited to: MAIN LIBRARY UNIVERSITY OF ZURICH. Downloaded on February 18,2021 at 12:06:07 UTC from IEEE Xplore. Restrictions apply. 610 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 40, NO. 2, FEBRUARY 2021

A. Conceptual Breakdown In the broad field of image compression, the method described herein belongs to the family of “lossy plus lossless residual encoding” [46]. To work on very large datasets, we decompose images into Cartesian 3D subdomains – referred to as tiles – with memory footprints that fit into the last level data cache. Tiles do not overlap and are processed independently. Figure 1 depicts the stages of our compression and decom- pression schemes for a single tile. The first stage of the encoding is the forward wavelet transform (“3D FWT”) of the tile. The wavelet coefficients, represented as floating point numbers, are then used to generate integer coefficients with a dead-zone quantizer (“Quantizer”). The quantized coefficients are subsequently passed on to a zerotree codec (“3D Codec”). The output of the latter is further compressed by an entropy- Fig. 1. The numerical stages in our data compression (left) and coder (“EC”). decompression (right) schemes. The computationally most intensive The compression includes encoding of the residuals,i.e.the tasks are depicted in orange. FWT: forward (3D) wavelet transform, IWT: back-transformed quantization round-off errors. The latter are inverse wavelet transform, EC: entropy coding. obtained by dequantizing the integer coefficients (“Dequan- tizer”) and carry out the inverse wavelet transform (“3D • wavelets on the finite interval IWT”). The residuals are entropy coded and embedded into the • synthesis scaling function: 3rd order average-interpolant compressed bitstream. The computation of residuals represents • analysis scaling functions: conservation of M0, M1, M2, a fundamental deviation from the classical wavelet-based  i th 0 i compression schemes that rely on bitwise reversible transforms where M represent the i moment of the signal k ck k . [24], [28]. The reason behind our choice is to accommodate The “on the interval” property [48] is critical in this con- 1 wavelets that yield the most competitive r-d curves. text, as it provides us with the best approach to algebraically It is important to point out that encoding the residuals represent the signal boundaries, thus achieving maximum to make reversible coders with “irreversible wavelets” (i.e., energy compaction in wavelet space. Classical wavelets are wavelets that are built upon non-integer coefficients) is not designed by considering infinite time domain, thus infinite- a completely independent additional step: one has to find a length signals. JP2K circumvents this requirement by mirror- 0 quantization that both minimizes the number of bitplanes to ing the original signal, thus obtaining periodic C patterns. 1 code and minimizes the footprint of the residual bitstream. This approach leads to C discontinuities across the image Because of this relation, the residual encoding affects both boundaries. Other approaches include a smooth window mod- the r-d curve and throughput, even at lossy bitrates. ulation of the extended (mirrored) signal. 1) Second-Generation Wavelets: In our approach, the cor- The third-order average-interpolant synthesis scaling func- nerstone behind the “3D FWT” and “3D IWT” stages in tion leads to an analysis wavelet function offering a competi- Figure 1 are the second-generation wavelets. Biorthogonal tive advantage over the 5/3 and the 9/7 analysis wavelets. For wavelets [18] are employed to generate the MRA of a dis- smooth signals, its energy compaction is more effective than { L} ≥ 5/3 analysis wavelets. When dealing with discontinuous sig- cretized signal ck k for some L 0 by repeatedly applying the computational building block nals, our analysis wavelets have fewer high-energy coefficients  than the 9/7 analysis wavelets. l = l+1 ck h2k− j c j Here, the lifting scheme is employed to manipulate the j analysis scaling function such that it conserves the signal’s l = l+1, dk g2k− j c j (1) first three moments, locally and globally, across coarser reso- j lutions. This improves the r-d curve at close-to-lossless bitrates where h, g are the low-pass and high-pass analysis filters, considerably, and also provides less artifacts, especially in { l } , { l } signal amplitude, when changing resolution. For illustrative respectively, and ck k dk k represent the subband coeffi- cients [47] - referred to as scaling and detail coefficients, purposes, our synthesis scaling and synthesis wavelet functions respectively - at a coarser level l. From the MRA, one are shown in Figure 2 for the second coarsest resolution level. can reconstruct the original signal by repeatedly filtering the subband coefficients with the synthesis filters h˜, g˜: 2) Zerotree Codec: In our approach, zerotree codecs are  the cornerstone of the “3D codec” stage in Figure 1,which l = ˜ l−1 +˜ l−1. ck h2 j−kc j g2 j−kd j (2) is responsible for discovering redundant patterns within the j MRA. Our zerotree codec is composed of a sequence of By using the lifting scheme [18], [22], we design novel identical iterations. At each iteration, a bitplane is extracted second-generation wavelets accommodating the following features: 1Especially when considering small tiles, such as in our case

Authorized licensed use limited to: MAIN LIBRARY UNIVERSITY OF ZURICH. Downloaded on February 18,2021 at 12:06:07 UTC from IEEE Xplore. Restrictions apply. ROSSINELLI et al.: HIGH-THROUGHPUT LOSSY-TO-LOSSLESS 3D IMAGE COMPRESSION 611

TABLE II SINGLE-NODE PERFORMANCE OF THE TARGET PLATFORMS

Fig. 2. Our synthesis scaling (left) and wavelet (right) functions at the second coarsest resolution level. The left and right functions depicted = / 1 1 in red are centered at x 5 16, and are associated with c2 and d2, respectively.

in compressed form. The extraction relies on comparing preventing integer overflows: the most significant bits are dumped to memory while the x undergoes a renormalization quantized coefficients against a significance threshold that is process. halved at each iteration. An iteration is conceptually divided next D into two passes: a significance pass, where a new set of In a symmetric fashion, a RANS decoder’s state relates to its current state x and outputs s as side product: significant coefficients is found, and a refinement pass, where the less significant bits of the already significant coefficients D(x) = ls x/m+mod(x, m) − bs, are extracted. s = s(mod(x, m)). (4) Here we process coefficients in groups of 2 × 2 × 2. This brings the dual benefit of pruning the codec’s algorithmic cost The final step to decode s is a binary search through {bs} that 8-fold and dealing with bytes rather than individual bits. We can be accelerated with the help of a lookup table. We note that must mention that a coarser granularity does not always have encoding and decoding are symmetric in their access patterns: a detrimental effect on the compression rate. If the entropy if the encoding starts by operating on the first symbol, the coding stage is static, as in our case, arranging coefficients in decoding starts by outputting the last symbol. groups of 8 leads to bitrates describing the 8th-order entropy [26]. B. Platforms Capturing the correlation between parent-children coeffi- cients in the MRA allows us to potentially achieve higher The target platforms of this work are the latest server-grade compression rates, although the magnitude of this correlation x86 CPUs supporting the SIMD AVX2/AVX-512 instruc- has been subject to debate [30], [49]. tion set, allowing us to process up to two 16-way SIMD 3) Asymmetric Numeral Systems: Within our approach, FMA instructions per cycle, totaling 64 scalar SP operations asymmetric numeral systems are the cornerstone of the “EC” per core. Specifically, we consider the nodes of the Euler stage in Figure 1. The output of the zerotree codec, in particu- supercomputer of ETH Zurich, Switzerland. The numbers lar the one of the significance pass, can be further compressed characterizing the single-node performance are reported in with an entropy coder. Here we consider asymmetric numeral Table II. The I/O bandwidth figures have been measured by systems (ANS) coders recently introduced by Duda [50]. writing/reading to/from their local NVMe storage. In our tests, ANS arguably represent a major advancement in data com- the measured system memory bandwidth is about 75% of its pression over the last two decades. Today, ANS coders are nominal value. The ratio between measured I/O and measured responsible for the “heavy lifting” tasks behind some of the RAM bandwidth is thus about 300: 1. Since I/O operations most popular compression tools [51] such as Apple’s LZFSE and computation are concurrent, we note that the time spent 4 [52] and Facebook’s ZSTD [53], and other domain specific to transfer one byte from disk can cover a budget of about 10 tools such as CRAM DNA [54]. FLOPs. These FLOPs are “free”. The specific ANS considered in this work is range variant The main challenge of the present work is the effective 4 ANS (RANS) [50]. The state of a RANS coder is represented conversion of these free 10 FLOPs into the smallest number by an arbitrarily large integer x.Givens, the symbol to encode, of written (or read) bytes possible, through data compression. the next state of the encoder C is defined as C. Design Choices C(s, x) = mx/ls+bs + mod(x, ls), (3) We briefly elucidate our strategy for effectively mapping where m is a power of two (in our case m = 4096), {ls}s the numerical schemes presented in section II onto the target are the (discrete) probabilities for each symbol renormalized platforms described in section II-B. = { } such that s ls m,and bs s is an exclusive prefix sum 1) Small Tiles: Wavelets lead to a separable 3D transform. = s−1 bs i=0 ls. The transform relies upon three building blocks: a batched Although x is theoretically an arbitrarily large number, 1D transform of the tile rows, the x-y transposition of the in practice, it is represented by a fixed-size integer. The tile slices, and the y-z transposition of the tile. The 3D compressed bitstream is then generated as side product by transform consists of executing three times the batched 1D

Authorized licensed use limited to: MAIN LIBRARY UNIVERSITY OF ZURICH. Downloaded on February 18,2021 at 12:06:07 UTC from IEEE Xplore. Restrictions apply. 612 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 40, NO. 2, FEBRUARY 2021

transforms, interleaved by the two transpositions. The overall Providing static histograms to the entropy coder does not only access pattern of the 3D transform exhibits high spatial locality decrease its computational complexity, but also allows us to as well as long-term temporal locality: all the coefficients are replace the expensive integer division of equation 3 during accessed multiple times in the five steps. Since the temporal the encoding with an integer multiplication, a bitwise shift locality is long-term, we restrain the data cache misses to operation, and a look-up on a table with 256 entries. compulsory misses by considering tiles with footprints fitting The reciprocals of {ls}s, representing the normalized his- in the L3 data cache, e.g. 64×64×64 or 32 ×32 ×32 voxels. togram entries, are computed once before the coding of the bit- A direct consequence of this choice is the relatively “shal- stream, by fetching entries from a bigger table of 4096 entries. low” MRA, featuring perhaps just 4-6 resolution levels. This As proposed by Alverson [57], the division x/ls and the in turn affects the compression efficiency of the scheme when reciprocal (as, hs ) are computed as a tile exhibits a very smooth signal. Empirically, we observe − x/l =xa 2 hs , a loss in compression rate by considering smaller tiles on the s s hs benchmark datasets. However, the benefit of considering tiles as =2 /y, with footprints above 2 MB is subject to a strongly diminishing hs = w +log2ls, (5) return. w = Besides the benefit of providing finer-grained ROI- where 16 in our case, representing the considered word accessibility, smaller tile sizes bring another important advan- length in bits. We must mention that this approach was first tage. With the appropriate source code transformation tech- suggested by Giesen [58]. niques, the computational pattern of the 1D transform is The possibly longer latencies of this approach are expected captured and manipulated to be executed very efficiently. to be almost completely covered by ILP, if multiple RANS- The observed performance improvement of these synthesized encoders are interleaved. kernels over the baseline code is more than eight-fold when Overall, the observed improvement over the integer division mapped to AVX2 instructions, and sometimes sixteen-fold for is more than four-fold. AVX-512 instructions. D. Software 2) RANS Encoders: Although qualitatively faster than arith- metic coding, RANS implementations lead to instruction 1) Parallelism: Data-level parallelism (DLP) is enforced in 4 streams featuring tight data dependency and practically just two ways: the kernels featuring operational intensities below 5 scalar integer plus control flow operations. The out-of-order the ridge point [59] are written in ISPC, whereas those kernels (OOO) back-end of contemporary CPUs seems unable to that are expected to be compute-bound are synthesized by M4 effectively cover these latencies2 because of the lack of scripts featuring almost exclusively AVX2/AVX-512 intrinsics. instruction-level parallelism (ILP) in the code. Here we adopt The latter are the kernels transposing tile slices, zipping tile the strategy proposed by Giesen [55] of interleaving indepen- rows, extracting/expanding bitplanes, and 1D-transforming tile dent RANS encoders to enhance ILP. A stream of symbols rows. is assigned to an array of encoders in a round-robin fashion. TLP is delegated to the client code. For our benchmarks we The generated compressed bitstreams are then multiplexed into rely entirely on message passing interface (MPI) to expose a single bitstream on-the-fly. By increasing the number of TLP. The initial version of the software employed kernel- encoders, we increase the ILP as well as the allocated hardware level threads and thread-cooperation, but it turned out to resources. In our context, we empirically found a sweet spot be significantly slower than the MPI-only implementation. 7 with 8 interleaved encoders.3 We believe that workloads in the order of 10 cycles - the Even if small tile sizes are non-ideal for very smooth computational time for compressing a tile - are probably too signals, they can be effectively coupled with static - thus very small to expect any improvement by considering cooperating fast - context insensitive entropy coders. The computation of a threads. 256-entries histogram is a convenient way to approximate the 2) MPI I/O: The ability of extracting and compressing tiles local context. The calculation of a static histogram translates in an out-of-core fashion enables our software to process to simple gather-add-scatter operations, although its vectoriza- very large images. However, the I/O performance of the tile tion requires a conflict-detection mechanism. Calculating his- extraction is very sensitive to the access pattern. To address tograms with a sparse/dynamic representation would expose a this issue, we organized the computation into a two-levels substantially lower degree of ILP and no DLP. The calculation hierarchy: MPI tasks within a NUMA-node, and the group of a static histogram is thus expected to be about one order of of NUMA nodes. The image is processed by considering sets magnitude faster with respect to the simplest dynamic/adaptive of slices to form “layers” of tiles. Sets are assigned to NUMA histogram. nodes in a round-robin fashion. Within a NUMA node, MPI On current CPU microarchitectures, the integer multiplica- tasks call MPI I/O collective blocking functions to load a 6 tion has a reciprocal throughput of about 1 cycle/op whereas few consecutive image slices into shared memory and feature the integer division is performed at about 35 cycles/op [56]. local synchronization points to prevent data races. A similar

4Meant in the broader “operations per byte” sense. 2Mostly due to Read-After-Write (RAW) data hazards. 5DLP has shown to bring performance benefits even to memory-bound 3When running 1 software thread: 1 hardware thread. The sweetspot moves kernels by maximizing the off-chip transferred Bytes per instruction density. to higher values if HyperThreading is disabled. 6Using System V IPC shared memory.

Authorized licensed use limited to: MAIN LIBRARY UNIVERSITY OF ZURICH. Downloaded on February 18,2021 at 12:06:07 UTC from IEEE Xplore. Restrictions apply. ROSSINELLI et al.: HIGH-THROUGHPUT LOSSY-TO-LOSSLESS 3D IMAGE COMPRESSION 613

pattern is used for writing the bitstreams to disk after all the TABLE III tiles in the layer have been compressed. Although we rely BENCHMARK DATASETS on blocking collective I/O operations within a NUMA node, read/compress/write stages of the different NUMA nodes are free to overlap.

III. EXPERIMENTS AND RESULTS In this section we assess the performance of our data com- pression scheme in terms of time-to-solution (TTS), through- put, lossless compression rates as well as r-d curves using datasets acquired with different imaging modalities.

A. Reference Software The performance of the present approach is assessed against three software packages: IRIS [39] (version 1.1.5), a C implementation of JP2K/JP3D standard, the Keller Lab Block repository [60], and Kakadu [44], a partial implementation of the JP2K standard that does not include Part 10/JP3D. Several points must be taken into account regarding IRIS. Firstly, IRIS supports only images with unsigned pixel values. We therefore have generated unsigned versions of the Cochlea and Tibia images for IRIS, by shifting their value ranges appropriately without causing overflows. Secondly, by exam- ining and modifying the IRIS source code, we managed to substantially improve the associated r-d curves without affecting execution time. For the r-d curves reported herein, we observe a PSNR improvement ranging between 35% and 100%. Thirdly, as neither the library nor the client code of this software exposes TLP, to compare its performance on our multicore platforms, we combine IRIS with a multitask- ing approach7 in which independent tasks are scheduled to process the image tiles. This requires a preprocessing step that Fig. 3. Opaque isosurface rendering of the Cochlea dataset for the decomposes the image into tiles of 64 × 64 × 64 pixels, and isovalue 9500. stores them in the local scratch space. It is important to note that this preprocessing stage is specific to IRIS. Furthermore the IRIS preprocessing overhead is not taken into account in B. Datasets our benchmarks. We compiled IRIS with Intel’s icc with all We consider three image sets in our benchmarks: Cochlea, optimization flags enabled. Kidney,andTibia. The information characterizing these The KLB source code is written in C++11 and relies images is reported in Table III. The field ‘Range’ denotes the on C++11 threads for TLP. The library does not entail bit depth and signedness of the pixel values. workload decomposition on a distributed memory architecture The Cochlea dataset was imaged from a human cochlea and therefore is limited to run on a single node. We modified specimen with a microtomography scanner Scanco Medical the KLB source code to process tiles of 64 × 64 × 64 pixels µCT100. Figure 3 illustrates the geometrical complexity of 8 and compiled it with Intel’s icpc with -Ofast -march=native the Cochlea dataset that renders data compression challeng- -fno-alias -DNDEBUG. ing. The Kidney dataset was acquired by serial block face Kakadu is arguably a CPU efficient implementation of scanning electron microscopy (SEM) of a mouse kidney JP2K including the HTJ2K part. Kakadu does not imple- sample on an Apreo VS system (Thermo Fisher Scientific, ment JP3D/Part 10 of the standard. However, the software Eindhoven, The Netherlands) at the University of Zurich is somehow able to compress volumetric images by reinter- Center for Microscopy and Image Analysis. The set con- preting image slices as different channels and subsequently sists of 170 images of 9974 × 10,051 pixels, and voxel performing a multi-component transform (Part 2), which can size of 10 nm × 10 nm × 50 nm. Figure 4 shows the be steered to carry out a DWT in the z−direction, prior to dataset. the 2D image compression. In our benchmark we relied on The Tibia dataset was acquired from a human distal tibia kdu_buffered_compress of Kakadu 8.0.5. using a Scanco Medical XtremeCT II HR-pQCT scanner at Universitaetsklinikum Hamburg-Eppendorf (UKE). The distal 7Implemented with a simple MPI + C scheduler that distributes the tiles in a round-robin fashion. tibia is a clinically relevant region for bone strength assess- 8By default the KLB tests are carried out with tiles of 96 × 96 × 8pixels. ment. Figure 5 shows a cut view of the dataset.

Authorized licensed use limited to: MAIN LIBRARY UNIVERSITY OF ZURICH. Downloaded on February 18,2021 at 12:06:07 UTC from IEEE Xplore. Restrictions apply. 614 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 40, NO. 2, FEBRUARY 2021

Fig. 5. Volume rendering of the Tibia image. Regions depicted in blue Fig. 4. Volume rendering of the Kidney image. Regions depicted in blue have pixel values of 1000 or lower, regions depicted in pale pink assume have pixel values of 64 or lower, regions depicted in pale pink assume values around 4000, and regions depicted in red assume values around values around 120, and regions depicted in red assume values around 8000. 200. TABLE IV C. Time-to-Solution RESULTS OF THE TIME-TO-SOLUTION BENCHMARK FOR THE XEON GOLD 6150 (TOP) AND EPYC 7742 (BOTTOM)NODES The first benchmark simulates the image “acquisition + data compression” use case, consisting of a microscope or scanner connected to a single high-end workstation. To keep the benchmark simple, we focus on the overall TTS. This includes reading the image, compressing it and writing bitstreams to file. We must mention that the reading time is included for sake of clarity, rather than for realism: modern microscopes and scanners are able to place data chunks directly into memory through, e.g., USB 3.× ports and appropriate firmware/drivers. I/O subsystem. Except for the Cochlea image, reading the IRIS, Kakadu, and the present software were set to perform same image from disk more than once led to up to 10-fold lossless compression featuring 3 lossy quality layers, thus improvements in reading time. totaling 4 layers. For IRIS and Kakadu, the layers target Thirdly, the TTS of the present software is completely bitrates were 0.5, 1.5, 3, and the lossless compression bitrate. dominated by the IO; choosing different tile sizes in the range The present software was set to group bitplanes into quality between 32 and 64 pixels does not seem to influence it, the layers as follows: the first layer embeds the first 7 bitplanes, the compression workload is completely covered by the I/O read second layer embeds two bitplanes, the third one embeds just timing for the Cochlea dataset, and write timings for the other one bitplane, and the fourth layer includes all the remaining datasets. bitplanes as well as the residuals. Results are reported in Table IV. By employing all node cores, the present software outperforms the competing soft- D. Scale-Out ware by about 2X-7X on the Xeon Gold node, and 5X-11X Processing images featuring very large footprints requires on the EPYC node. Although the metric of this benchmark is computer clusters. For this reason, we examine the perfor- straightforward, there are a number of considerations to make. mance of the present software with respect to the number of Firstly, both input image and compressed image are nodes by compressing the 98 GB Cochlea image. read/written to/from the local storage (NVMe) of the node, In this benchmark, we read/write images from/to the cluster respectively. Relying on the local storage is a convenient way parallel file system (Lustre), respectively. In principle, this of minimizing the interactions between the benchmark node filesystem can achieve an I/O bandwidth of around 32 GB/s and the cluster file system. The thin layer in perfect benchmark conditions (no other users running on of the node (absence of a graphical user interface, decreased the system, aligned I/O, 1 file per thread, 5 threads per RAID presence of external processes) makes this experimental setup group) and without fabric partitioning. However, the file sys- close to ideal and representative of high-end workstations. tem was operational when the present benchmark was carried Secondly, the table entries represent the median values out, with hundreds of users and with a fabric partitioning. In of 40 repeated, end-to-end benchmark runs. Running the these conditions, a maximum achievable bandwidth of 20 GB/s benchmark multiple times amplifies caching effects of the is expected.

Authorized licensed use limited to: MAIN LIBRARY UNIVERSITY OF ZURICH. Downloaded on February 18,2021 at 12:06:07 UTC from IEEE Xplore. Restrictions apply. ROSSINELLI et al.: HIGH-THROUGHPUT LOSSY-TO-LOSSLESS 3D IMAGE COMPRESSION 615

TABLE VI AGGREGATE DECOMPRESSION THROUGHPUTS ON THE XEON GOLD 6150 (TOP) AND EPYC 7742 (BOTTOM)NODES

Fig. 6. Strong scaling (left) and aggregate throughputs (right) of the present software, for reading (yellow), compressing (red), and writ- ing (blue) images to disk. The ideal trend is depicted in black. Error bars represent 95% confidence intervals.

TABLE V AGGREGATE COMPRESSION THROUGHPUTS ON THE XEON GOLD 6150 (TOP) AND EPYC 7742 (BOTTOM)NODES

Fig. 7. Rate-distortion curves for IRIS (blue squares: 5/3 DWT, green squares: 9/7 DWT), Kakadu (pink triangles: 5/3 DWT, violet triangles: 9/7 DWT), the present approach with tiles of 32×32×32 voxels (red circles) and 64 × 64 × 64 voxels (orange circles), KLB (black, dashed line) for the Cochlea image.

the uncompressed image and writing the compressed one) from their TTS. The compulsory timings have been computed The left picture of Figure 6 reports the read/compress/write with the I/O rates reported in Table II. We would like to timings with respect to the number of nodes, whereas the right emphasize that these approximations overestimate the actual picture shows the aggregate throughputs in GB/s. throughput because I/O read, compression, and I/O write can The error bars depict 95% confidence intervals and the be carried out concurrently, as in the present software. plotted lines represent the median values of measurements, The throughput of the present software appears to outper- over 40 end-to-end benchmark runs. form the competition by one to two orders of magnitude. The data compression timing decreases following a rel- On the Xeon Gold node, the highest achieved throughput atively clean 1/n trend, with the compression throughput was 10.5 GB/s with tiles of 32 × 32 × 32 pixels, whereas on scaling linearly as expected. the EPYC node, the highest throughput was 29.8 GB/s with We note that the read and write timings do not improve tiles of 64 × 64 × 64 pixels. substantially with more than 3-4 nodes. We attribute the slight Table VI reports the decompression throughput of degradation of I/O bandwidth to resource contention across the present software, defined as footprint(output) / the nodes. The measured read/write plateaux are 21 GB/s time(decompress(input)). The throughput of the lossy and 19 GB/s, respectively. decompression strongly depends on the number of considered bitplanes. Here we decompress the four quality layers E. Throughput mentioned earlier in the text. Lossless decompression achieves a throughput of 30 GB/s, The performance of reading images from disk and writ- whereas the fastest lossy decompression achieves more ing bitstreams to disk strongly depends on the considered than 80 GB/s. input and output file formats, as well as hardware details of the I/O subsystem. The focus of this benchmark is thus on the compression throughput defined as footprint(input) / F. Compression Rates time(compress(input)). We compare the compression rates of the present approach Aggregate throughputs are reported in Table V, together against IRIS, Kakadu, and KLB. To obtain well-sampled r- with the associated outperforming factors, for both tiles of d curves, we set IRIS, Kakadu, and the present software to 32 × 32 × 32 pixels (TS32) and 64 × 64 × 64 pixels (TS64). generate 20 quality layers at disparate bitrates. Along with Each entry in the table represents the median value of 40 end- the r-d curves, we provide a visual comparison of results to-end benchmark runs. achieved with the four compression approaches. In addition, The throughput of our software was directly measured by we provide assessments of whether the compression can be instrumenting the source code. considered visually lossless. The assessments were performed The throughput of the competing software was instead in a randomized, controlled, blinded fashion by experts in the estimated by subtracting both compulsory I/O timing (reading respective fields of the datasets.

Authorized licensed use limited to: MAIN LIBRARY UNIVERSITY OF ZURICH. Downloaded on February 18,2021 at 12:06:07 UTC from IEEE Xplore. Restrictions apply. 616 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 40, NO. 2, FEBRUARY 2021

Fig. 8. Visual assessment of the distortion for the Cochlea image. Regions depicted in blue have pixel values of −1600 or lower, regions depicted in white assume values around 5000, and regions depicted in red assume values around 13000.

1) Cochlea: Figure 7 shows the r-d curves of the Cochlea bitstreams, for both IRIS and the present software. The curves are obtained by compressing the entire image in lossless form, then decompressing it by prescribing, respectively, a set of bitrates or PSNR values and computing the PSNR values or bitrates with respect to the original image. In terms of lossless compression rates, Kakadu achieves the most competitive result with 2.1:1, whereas IRIS achieves 2:1, while both KLB and the present method achieve 1.9:1. Fig. 9. Rate-distortion curves for IRIS (blue squares: 5/3 DWT, green In the lossy regime, we note that Kakadu features sig- squares: 9/7 DWT), Kakadu (pink triangles: 5/3 DWT, violet triangles: 9/7 DWT), the present approach with tiles of 32×32×32 voxels (red circles) nificantly better r-d curves compared to both IRIS and the and 64× 64× 64 voxels (orange circles), KLB (black, dashed line) on the present method. At lower bitrates, the distortion of IRIS with Kidney image. the 9/7 DWT seems more pronounced with respect to the curves of Kakadu and the present method. We find the curve 2) Kidney: As depicted in Figure 9, the most competitive discrepancies between Kakadu and IRIS surprising, as they lossless compression is achieved by IRIS, reaching a rate of are implementations of the same standard. We also note that 2:1, whereas Kakadu achieves 1.9:1. The present software the IRIS r-d curve with the 9/7 DWT is significantly more reaches a lossless compression rate of about 1.8:1 for a tile competitive than the one with 5/3 DWT. Both reach a plateau size of 64 × 64× 64. KLB attains a lossless compression rate where the distortion becomes independent of the bitrate. of 1.66:1. Figure 8 illustrates the distortion at low bitrates within Figure 10 illustrates the distortion at various bitrates for the a 400 by 400 pixels ROI. At 0.5bps, IRIS and Kakadu yield Kidney dataset, within a 400 by 400 pixels ROI. At 0.25bps results clearly superior to the present software, showing no and 0.5bps, none of the approaches achieve visually lossless visible artifacts due to tiling. Our domain expert assessed the compression according to our domain expert. IRIS and Kakadu compression with Kakadu to be visually lossless already at show minor spurious artifacts, whereas the present approach 0.5bps. Compression with IRIS was considered visually loss- exhibits rather strong tiling. At 1.5bps, IRIS features visually less at 1.5bps, whereas the present method achieved visually lossless compression, whereas Kakadu and the present method lossless compression at 3bps. still show significant deviations from the original image.

Authorized licensed use limited to: MAIN LIBRARY UNIVERSITY OF ZURICH. Downloaded on February 18,2021 at 12:06:07 UTC from IEEE Xplore. Restrictions apply. ROSSINELLI et al.: HIGH-THROUGHPUT LOSSY-TO-LOSSLESS 3D IMAGE COMPRESSION 617

Fig. 10. Visual assessment of the distortion for the Kidney image. Regions depicted in blue have pixel values of 30 or lower, regions depicted in white assume values around 100, and regions depicted in red assume values around 200.

During data compression, more than 30% of the cycles are spent in the zerotree codec. While not reported in the figure, most of the cycles inside the codec are spent in the significance pass, where new significant coefficients are discovered in the MRA. The instruction stream associated with this pass is dense with RAW-hazards and control flow instructions, exposing almost no ILP. The FWT and IWT are the second and third most time- consuming kernels, respectively, taking together about 30% of Fig. 11. Rate-distortion curves generated by IRIS (blue squares: 5/3 DWT, green squares: 9/7 DWT), Kakadu (pink triangles: 5/3 DWT, violet the overall cycles. triangles: 9/7 DWT), the present approach with tiles of 32×32×32 voxels These kernels achieve 40% of the nominal peak in single (red circles) and 64×64 ×64 voxels (orange circles), KLB (black, dashed precision. With respect to AVX2, the FWT and IWT stages are line) on the Tibia image. accelerated by 30%-60% by using the AVX512F instructions, leading to an overall TTS decrease of about 10%. 3) Tibia: Figure 11 depicts the r-d curves for the Tibia The remaining cycles are spent mostly in performing data dataset. Here, both Kakadu and IRIS fare considerably better reordering (helping in turn the 3D codec in querying the than the present approach. MRA), the computation of the residuals, and the entropy Figure 12 illustrates the distortion at low bitrates for the coding. Tibia dataset. At 0.5bps,and1.5bps, none of the compression We note that our entropy coder takes less than 10% of the approaches are considered visually lossless by our domain total cycles. This is in strong contrast with classical wavelet- expert. At 3bps, both IRIS and Kakadu achieve visually based codecs, where most cycles are consumed by arithmetic lossless compression, whereas the present method still shows coding. observable differences with respect to the original image. Figure 13 (right) reports the time distribution for the lossless decompression of a tile. As mentioned earlier, this operation G. Performance Analysis takes about half of the computational time of the compression. Figure 13 reports the time distribution of the most time- Here, codec and inverse wavelet transform take about the same consuming kernels by considering tiles of 32×32×32 voxels. time. The computational complexity of the decoding stage

Authorized licensed use limited to: MAIN LIBRARY UNIVERSITY OF ZURICH. Downloaded on February 18,2021 at 12:06:07 UTC from IEEE Xplore. Restrictions apply. 618 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 40, NO. 2, FEBRUARY 2021

Fig. 12. Visual assessment of the distortion for the Tibia image. Regions depicted in blue have pixel values of 1000 or lower, regions depicted in white assume values around 5000, and regions depicted in red assume values around 1000.

speed is above the acquisition rates of all commercial medical and biomedical scanners and microscopes available today. A lossless decompression of the image achieves about 30 GB/s, whereas the lossy decompression achieves more than 80 GB/s. This boosts, in turn, the effective I/O bandwidth of image analysis software, enabling the analysis of very large images with out-of-core at Fig. 13. Time distribution for the compression scheme (left) and unprecedented speeds. (lossless) decompression scheme (right) of the present approach. The relatively large excess in throughput (2X-7X) of the present software ultimately suggests to invest more CPU (3D codec) decreases about two-fold with respect to the cycles in compressing the data or improving the r-d curve, encoding. The main reason behind this is that the decoding for example by introducing a post compression rate-distortion stage does not search for coefficients that became significant: optimization (PCRD) stage, similar to JP2K [29]. The com- the bitstream contains precisely this information. pression of 4D datasets, such as in cardiac magnetic resonance imaging (cardiac MRI), would also represent an interesting IV. CONCLUSION next step. We have described an effective 3D data compression scheme that maps well on contemporary CPUs. For the benchmarks ACKNOWLEDGMENT considered herein, the present scheme has shown clear supe- The authors wish to thank Prof. Peter Schelkens riority over IRIS, Kadkadu, and KLB in terms of time-to- (ETRO, Vrije Universiteit Brussel) for providing the solution and throughput. Our approach leads to rate-distortion IRIS-JP3D software and Dr. Tim Bruylants (intoPIX) curves and lossless compression rates with variable efficiency for his support throughout this project. They would depending on the image at hand. like to thank Dr. Aous Naman (School of Electrical Scalable lossy-to-lossless bitstreams are generated Engineering and Telecommunications, University of at 30 GB/s, exceeding the acquisition rate of the latest New South Wales) for his support about the Kakadu 4-sCMOS SPIM microscopes. To our knowledge, this usage. They also thank, respectively, Dr. Willy Kuo

Authorized licensed use limited to: MAIN LIBRARY UNIVERSITY OF ZURICH. Downloaded on February 18,2021 at 12:06:07 UTC from IEEE Xplore. Restrictions apply. ROSSINELLI et al.: HIGH-THROUGHPUT LOSSY-TO-LOSSLESS 3D IMAGE COMPRESSION 619

always below 5%. For a single tile, the bypass decreases the overall CPU cycles by 15% on average. Moreover, most of the tiles exhibit histograms with very few non-zero entries during the refinement pass. Therefore, static histograms are either serialized by run-length encoding or a sparse entry Fig. 14. Two-channel filter bank of the DWT, factored into lifting steps. representation, depending on which approach yields the best coding efficiency. (Institute of Physiology, University of Zurich) and Dr. Dominik Péus (Department of Otorhinolaryngology, REFERENCES Kantonsspital Baselland) for their visual assessment of the kidney and cochlea images. They would like to thank Dr. [1] J. M. Perkel, “The struggle with image glut,” Nature, vol. 533, no. 7601, pp. 131–132, May 2016. Bruno Koller (SCANCO Medical) for sharing the cochlea [2] Andor. (2016). Zyla 4.2 PLUS sCMOS. [Online]. Available: image, Dr. Arne Seitz, Dr. Olivier Guiet, Dr. Romain Burri http://www.andor.com/scientific-cameras/neo-and-zyla-scmos- (BIOP, EPFL), and Dr. Kyle Michael Douglass (LEB, EPFL) cameras/zyla-42-plus-scmos [3] M. Faiza, S. Adabi, B. Daoud, and M. R. Avanaki, “High-resolution for sharing their data for this project. They wish to thank Dr. wavelet-fractal compressed optical coherence tomography images,” Andres Kaech (Center for Microscopy and Image Analysis, Appl. Opt., vol. 56, no. 4, pp. 1119–1123, 2017. University of Zurich) for providing the kidney SEM image [4] A. M. Cheung et al., “High-resolution peripheral quantitative computed tomography for the assessment of bone strength and structure: A review set. They gratefully acknowledge the support of Dr. Vittoria by the canadian bone strength working group,” Current Osteoporosis Rezzonico (SCITAS, EPFL), Prof. Jan Hesthaven (Faculty Rep., vol. 11, no. 2, pp. 136–146, Jun. 2013. of Basic Sciences, EPFL), and Lucid Concepts AG for their [5] E. G. Reynaud, J. Peychl, J. Huisken, and P. Tomancak, “Guide to light- sheet microscopy for adventurous biologists,” Nature Methods, vol. 12, support throughout this project. They would like to thank no. 1, pp. 30–34, Jan. 2015. Ugo Varetto and David Schibeci (Pawsey Supercomputing [6] A. Grossmann and J. Morlet, “Decomposition of hardy functions into Centre) and the Euler cluster support team (ETH Zurich), for square integrable wavelets of constant shape,” SIAM J. Math. Anal., vol. 15, no. 4, pp. 723–736, Jul. 1984. giving them access to their computing infrastructure. [7] J. Morlet, Sampling Theory and Wave Propagation. Berlin, Germany: Springer, 1983, pp. 233–261, doi: 10.1007/978-3-642-82002-1_12. APPENDIX A [8] S. G. Mallat, “A theory for multiresolution signal decomposition: The SOFTWARE wavelet representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 11, no. 7, pp. 674–693, Jul. 1989. In order to allow the reader to reproduce our results and [9] S. G. Mallat, “Multifrequency channel decompositions of images and verify the efficiency and coding performance of our approach, wavelet models,” IEEE Trans. Acoust., Speech, Signal Process., vol. 37, we provide a binary executable packaged within an off- no. 12, pp. 2091–2110, 1989. [10] S. G. Mallat, “Multiresolution approximations and wavelet orthonormal the-shelf .MSI installer for Windows. The installer can be bases of L2(R),” Trans. Amer. Math. Soc., vol. 315, no. 1, p. 69, found at https://drain.lucid.ch/, and here [61]. Please contact Jan. 1989. the corresponding author for the executables. The [11] O. Rioul and M. Vetterli, “Wavelets and signal processing,” IEEE Signal Process. Mag., vol. 8, no. 4, pp. 14–38, Oct. 1991. compression software has been integrated into the imaging [12] C. Christopoulos, A. Skodras, and T. Ebrahimi, “The JPEG2000 still workflow system XamFlow 1.6 (Lucid Concepts AG, Zurich, image coding system: An overview,” IEEE Trans. Consum. Electron., Switzerland). This allows it to be seamlessly appended directly vol. 46, no. 4, pp. 1103–1127, Nov. 2000. [13] P. Schelkens, A. Munteanu, A. Tzannes, and C. Brislawn, “JPEG2000 after image acquisition, or before archiving. The fast lossy-to- part 10—Volumetric data encoding,” in Proc. IEEE Int. Symp. Circuits lossless decompression is used to speed up interactive steps Syst., 2006, p. 4. of the workflows when dealing with very large images, and [14] F. Amat, B. Höckendorf, Y. Wan, W. C. Lemon, K. McDole, and P. J. Keller, “Efficient processing and analysis of large-scale light-sheet the overall system performance is improved due to reduced microscopy data,” Nature Protocols, vol. 10, no. 11, pp. 1679–1696, transfer times of the compressed image data. Nov. 2015. [15] D. Gabor, “Theory of communication. Part 1: The analysis of informa- tion,” J. Inst. Electr. Eng. III, Radio Commun. Eng., vol. 93, no. 26, APPENDIX B pp. 429–441, Nov. 1946. WAVELET DESIGN [16] Y. M ey er, Ondelettes et Opérateurs. Paris, France: Hermann, 1991. [17] I. Daubechies, “Orthonormal bases of compactly supported wavelets,” Figure 14 depicts the lifting steps of the forward DWT Commun. Pure Appl. Math., vol. 41, no. 7, pp. 909–996, Oct. 1988. considered in the present work. The coefficients in the picture [18] A. Cohen, I. Daubechies, and J.-C. Feauveau, “Biorthogonal bases of are computed for infinite time domains. In order to adapt these compactly supported wavelets,” Commun. Pure Appl. Math., vol. 45, no. 5, pp. 485–560, Jun. 1992. wavelets to finite-length signals, the coefficients are found by [19] A. Cohen, “Adaptive methods for PDE’s: Wavelets or mesh refinement?” solving a system of linear equations for those scaling functions 20002, arXiv:math/0212414. o https://arxiv.org/abs/math/0212414 whose support touches the signal boundary. The procedure is [20] I. W. Selesnick, R. G. Baraniuk, and N. C. Kingsbury, “The dual-tree complex wavelet transform,” IEEE Signal Process. Mag., vol. 22, no. 6, described by Houben in [62]. pp. 123–151, Nov. 2005. [21] A. R. Calderbank, I. Daubechies, W. Sweldens, and B.-L. Yeo, “Wavelet APPENDIX C transforms that map integers to integers,” Appl. Comput. Harmon. Anal., vol. 5, no. 3, pp. 332–369, Jul. 1998. ZEROTREE AND ENTROPY CODING [22] W. Sweldens, “The lifting scheme: A construction of second generation We opted to bypass the entropy coding for the bitstream wavelets,” SIAM J. Math. Anal., vol. 29, no. 2, pp. 511–546, Mar. 1998. [23] I. Daubechies and W. Sweldens, “Factoring wavelet transforms into generated by the Zerotree codec during the significance pass, lifting steps,” J. Fourier Anal. Appl., vol. 4, no. 3, pp. 247–269, since we empirically observed that the compression rate is May 1998.

Authorized licensed use limited to: MAIN LIBRARY UNIVERSITY OF ZURICH. Downloaded on February 18,2021 at 12:06:07 UTC from IEEE Xplore. Restrictions apply. 620 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 40, NO. 2, FEBRUARY 2021

[24] Z. Xiong, X. Wu, S. Cheng, and J. Hua, “Lossy-to-lossless compression [42] T. Richter and S. Simon, “Comparison of cpu and GPU based coding on of medical volumetric data using three-dimensional integer wavelet low-complexity algorithms for display signals,” Proc. SPIE, vol. 8856, transforms,” IEEE Trans. Med. Imag., vol. 22, no. 3, pp. 459–470, Sep. 2013, Art. no. 885615. Mar. 2003. [43] A. Naman and D. Taubman, “Decoding high-throughput Jpeg2000 [25] J. M. Shapiro, “Embedded image coding using zerotrees of wavelet coef- (HTJ2K) On AG,” in Proc. IEEE Int. Conf. Image Process., 2019, ficients,” IEEE Trans. Signal Process., vol. 41, no. 12, pp. 3445–3462, pp. 1084–1088. 1993. [44] The Kakadu Software. [Online]. Available: https://kakadusoftware.com/ [26] A. Said and W. A. Pearlman, “A new, fast, and efficient image codec [45] P. Enfedaque, F. Auli-Llinas, and J. C. Moure, “GPU implementation of based on set partitioning in hierarchical trees,” IEEE Trans. Circuits bitplane coding with parallel coefficient processing for high performance Syst. Video Technol., vol. 6, no. 3, pp. 243–250, Jun. 1996. image compression,” IEEE Trans. Parallel Distrib. Syst., vol. 28, no. 8, [27] A. Said and W. A. Pearlman, “An image multiresolution representation pp. 2272–2284, Aug. 2017. for lossless and ,” IEEE Trans. Image Process.,vol.5, [46] M. Rabbani and P. W. Jones, Digital Image Compression Techniques, no. 9, pp. 1303–1310, Sep. 1996. vol. 7. Bellingham, WA, USA: SPIE, 1991. [28] T. Bruylants, A. Munteanu, and P. Schelkens, “Wavelet based volumetric [47] M. Vetterli and J. Kovacevic, Wavelets and Subband Coding. Upper medical image compression,” Signal Process., Image Commun., vol. 31, Saddle River, NJ, USA: Prentice-Hall, 1995. pp. 112–133, Feb. 2015. [48] D. L. Donoho, “Interpolating wavelet transforms,” Dept. Statist., Stan- [29] D. Taubman, “High performance scalable image compression with ford Univ., Stanford, CA, USA, Tech. Rep., 1992, vol. 2, no. 3, pp. 1–54. EBCOT,” IEEE Trans. Image Process., vol. 9, no. 7, pp. 1158–1170, [49] M. W. Marcellin and A. Bilgin, “Quantifying the parent-child coding Jul. 2000. gain in zero-tree-based coders,” IEEE Signal Process. Lett., vol. 8, no. 3, [30] P. Schelkens, A. Munteanu, J. Barbarien, M. Galca, X. Giro-Nieto, and pp. 67–69, Mar. 2001. J. Cornelis, “Wavelet coding of volumetric medical datasets,” IEEE [50] J. Duda, “Asymmetric numeral systems: Entropy coding Trans. Med. Imag., vol. 22, no. 3, pp. 441–458, Mar. 2003. combining speed of with compression rate of [31] J. Zhang, J. E. Fowler, N. H. Younan, and G. Liu, “Evaluation of JP3D ,” 2013, arXiv:1311.2540. [Online]. Available: for lossy and lossless compression of hyperspectral imagery,” in Proc. http://arxiv.org/abs/1311.2540 IEEE Int. Geosci. Remote Sens. Symp., vol. 4, Jul. 2009, p. IV-474. [51] J. Duda and M. Niemiec, “Lightweight compression with encryp- [32] JPEG Committee. (2017). High Throughput JPEG 2000 (HTJ2K): tion based on asymmetric numeral systems,” 2016, arXiv:1612.04662. Call for Proposals. [Online]. Available: https://jpeg.org/downloads/ [Online]. Available: http://arxiv.org/abs/1612.04662 htj2k/HTJ2K_draft_cfp.pdf [52] A & A Inc. (2016). LZFSE Compression Library and Command Line [33] D. Taubman, A. Naman, and R. Mathew, “High throughput block coding Tool. [Online]. Available: https://github.com/lzfse/lzfse in the HTJ2K compression standard,” in Proc. IEEE Int. Conf. Image [53] Facebook, “ - Fast real-time compression algorithm,”. Process. (ICIP), Sep. 2019, pp. 1079–1083. [Online]. Available: https://github.com/facebook/zstd, 2016. [34] F. Auli-Llinas, P. Enfedaque, J. C. Moure, and V. Sanchez, “Bitplane [54] S-Tools. (2016). CRAMv3. [Online]. Available: image coding with parallel coefficient processing,” IEEE Trans. Image https://samtools.github.io/hts-specs/CRAMv3.pdf Process., vol. 25, no. 1, pp. 209–219, Jan. 2016. [55] F. Giesen, “Interleaved entropy coders,” 2014, arXiv:1402.3392. [35] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the [Online]. Available: http://arxiv.org/abs/1402.3392 high efficiency video coding (HEVC) standard,” IEEE Trans. Circuits [56] F. Agner. (2020). Instruction tables: Lists of instruction latencies, Syst. Video Technol., vol. 22, no. 12, pp. 1649–1668, Dec. 2012. throughputs and micro-operation breakdowns for Intel, AMD and Via [36] H. Bui et al., “Scalable parallel I/O on a blue Gene/Q supercomputer Cpus. Copenhagen University College of Engineering. [Online]. Avail- using compression, topology-aware data aggregation, and subfiling,” in able: http://www.agner.org/ Proc. 22nd Euromicro Int. Conf. Parallel, Distrib., Netw.-Based Process., [57] R. Alverson, “Integer division using reciprocals,” in Proc. 10th IEEE Feb. 2014, pp. 107–111. Symp. Comput. Arithmetic, Jun. 1991, pp. 186–190. [37] J. G. Cornelis, J. Lemeire, T. Bruylants, and P. Schelkens, “Het- [58] F. Giesen. (2016). RANS With Static Probability Distributions. [Online]. erogeneous acceleration of volumetric JPEG 2000,” in Proc. 23rd Available: https://fgiesen.wordpress.com/2014/02/18/rans-with-static- Euromicro Int. Conf. Parallel, Distrib., Netw.-Based Process., Mar. 2015, probability-distributions/ pp. 1–8. [59] S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful [38] A. Munshi, “The OpenCL specification,” in Proc. IEEE Hot Chips 21 visual performance model for multicore architectures,” Commun. ACM, Symp. (HCS), Aug. 2009, pp. 1–314. vol. 52, no. 4, pp. 65–76, 2009. [39] (2011). JPEG 2000: IRIS-JP3D Software. [Online]. Available: [60] F. Amat. (2017). Keller Lab Block File Type (.klb) C++11 Source Code http://vubtechtransfer.be/success-stories/jpeg-2000-iris-jp3d-software/ and API. [Online]. Available: https://bitbucket.org/fernandoamat/keller- [40] M. Treib, F. Reichl, S. Auer, and R. Westermann, “Interactive editing lab-block-filetype of gigasample terrain fields,” Comput. Graph. Forum, vol. 31, no. 2, [61] D. Rossinelli, “Software for high-throughput lossy-to-lossless 3D image pp. 383–392, 2012. compression,” Zenodo, 2020, doi: 10.5281/zenodo.4097961. [41] A. Balevic, N. Fuerst, M. Heide, S. Papandreou, and A. Weiss, “Cuj2k: A [62] S. Houben. (2002). Second-Generation Wavelets on Finite Inter- JPEG2000 encoder in Cuda,” Inst. Parallel Distrib. Syst., Univ. Stuttgart, vals. [Online]. Available: https://www.win.tue.nl/casa/meetings/seminar/ Stuttgart, Germany, Tech. Rep., 2009. previous/_index_files/wavelets4.pdf

Authorized licensed use limited to: MAIN LIBRARY UNIVERSITY OF ZURICH. Downloaded on February 18,2021 at 12:06:07 UTC from IEEE Xplore. Restrictions apply.