Learning Better Using

Fabian Mentzer Luc Van Gool Michael Tschannen ETH Zurich ETH Zurich Google Research, Brain Team [email protected] [email protected] [email protected]

Abstract xl x AAAB6nicbVDJSgNBEK2JW4xbXG5eGoPgKcxEQU8S8OIxolkgGUJPpyZp0tMzdPeIYcgnePGgiFe/yJt/Y2c5aOKDgsd7VVTVCxLBtXHdbye3srq2vpHfLGxt7+zuFfcPGjpOFcM6i0WsWgHVKLjE uuFGYCtRSKNAYDMY3kz85iMqzWP5YEYJ+hHtSx5yRo2V7p+6olssuWV3CrJMvDkpVY9gilq3+NXpxSyNUBomqNZtz02Mn1FlOBM4LnRSjQllQ9rHtqWSRqj9bHrqmJxapUfCWNmShkzV3xMZjbQeRYHtjKgZ6EVvIv7ntVMTXvkZl0lqULLZojAVxMRk8jfpcYXMiJEl lClubyVsQBVlxqZTsCF4iy8vk0al7J2XK3cXper1LA3IwzGcwBl4cAlVuIUa1IFBH57hFd4c4bw4787HrDXnzGcO4Q+czx/XUo4v l BPG AAACv3icbVFda9swFFW8ry77Sre97UUsDNoRgp0NNhgZGXvZYwdLW7BNuJavExFJNpLc1DP+HXvZ6/af9m8mOy20TS8Ijs7RvUf33qQQ3Fjf/9fz7ty9d//B3sP+o8dPnj4b7D8/NnmpGc5ZLnJ9 moBBwRXOLbcCTwuNIBOBJ8n6a6ufnKE2PFc/bFVgLGGpeMYZWEctBoOoq1EvNVQNPV+IxWDoj/0u6C4ILsBw9pJ0cbTY7/2K0pyVEpVlAowJA7+wcQ3aciaw6UelwQLYGpYYOqhAoonrzrahbxyT0izX7ihLO/ZqRg3SmEom7qUEuzI3tZa8VTu/NNiVEnkbHZY2+xjX XBWlRcW2X8tKQW1O28nRlGtkVlQOANPcdUfZCjQw6+Z7zcDy9U/Xt8INy6UElb6NGNduGGkYxHXUyuHl0qYHbZFxez2M6z69EpHKUwzNCgqcbvNHGRdiullxi6NUw2bElUJNnfN04mZOu1qHtB4GzaemcasMbi5uFxxPxsG78eT7++Hs83anZI+8Iq/JAQnIBzIj38gR mRNGzshv8of89b54S095xfap17vIeUGuhVf9B7+92bM= We leverage the powerful lossy al- gorithm BPG to build a lossless image compression sys- QC RC RC tem. Specifically, the original image is first decomposed − +

into the lossy reconstruction obtained after compressing it AAACu3icbVFda9swFFW8ry776sfe9iIWBu0Iwc4GGxSXwF722MHSFmyvXMvXiRZJNpK8NDP+Hdvr9q/2byY7LbRNLwiOztG9R/fetBTcWN//1/Pu3X/w8NHW4/6Tp8+ev9je2T0xRaUZTlkhCn2W gkHBFU4ttwLPSo0gU4Gn6eJTq5/+QG14ob7aVYmJhJniOWdgHfUt7irUMw2rhl6cbw/8kd8F3QTBJRhMXpIujs93er/irGCVRGWZAGOiwC9tUoO2nAls+nFlsAS2gBlGDiqQaJK6M23oG8dkNC+0O8rSjr2eUYM0ZiVT91KCnZvbWkveqV1cGWxKqbyLjiqbf0xqrsrK omLrr+WVoLag7dRoxjUyK1YOANPcdUfZHDQw62Z7w8DyxU/Xt8IlK6QElb2NGdduGFkUJHXcytHVwsL9tsiovR4kdZ9ei1gVGUZmDiWG6/xhzoUIl3NucZhpWA65Uqipcw7Hbua0q3VA60HQHDaNW2Vwe3Gb4GQ8Ct6Nxl/eDyZH652SLfKKvCb7JCAfyIR8JsdkShjR 5Df5Q/56oce8755YP/V6lzl75EZ41X8Hvtij x AAAB6HicbVDLTgJBEOzFF+ILHzcvE4mJJ7KLJnoyJF48QiJIAhsyO/TCyOzsZmbWSAhf4MWDxnj1k7z5Nw4LBwUr6aRS1Z3uriARXBvX/XZyK6tr6xv5zcLW9s7uXnH/oKnjVDFssFjEqhVQjYJL bBhuBLYShTQKBN4Hw5upf/+ISvNY3plRgn5E+5KHnFFjpfpTt1hyy24Gsky8OSlVjyBDrVv86vRilkYoDRNU67bnJsYfU2U4EzgpdFKNCWVD2se2pZJGqP1xduiEnFqlR8JY2ZKGZOrviTGNtB5Fge2MqBnoRW8q/ue1UxNe+WMuk9SgZLNFYSqIicn0a9LjCpkRI0so U9zeStiAKsqMzaZgQ/AWX14mzUrZOy9X6hel6vUsDcjDMZzAGXhwCVW4hRo0gAHCM7zCm/PgvDjvzsesNefMZw7hD5zPH1h+jVA= x AAAC03icbVFNb9MwGHbD1yhfHRy5WLRDG1RVUg6bQEWVuHAcEt0mJVHlOG9bq7YT2Q5dsXJBnJB25j/sxBX+BUf+DU66Stu6V7L0+Hne7zfJOdPG9/81vFu379y9t3W/+eDho8dPWttPj3RWKAoj mvFMnSREA2cSRoYZDie5AiISDsfJ/EOlH38BpVkmP5tlDrEgU8kmjBLjqHGrExk4NXUeO1VkWdqIMkU5pLYTJcK+LjtlOW61/Z5fG94EwQVoD3fOz9/+PXt5ON5u/IzSjBYCpKGcaB0Gfm5iS5RhLnfZjAoNOaFzMoXQQUkE6NjWbZR4xzEpnmTKPWlwzV6OsERovRSJ 8xTEzPR1rSJv1E7XBTalRNxEh4WZHMSWybwwIOmqtUnBsclwtU2cMgXU8KUDhCrmpsN0RhShxu38SgHD5l/d3BIWNBOCyPTVetFhENuoksP1IQe7VZJe9d2LbRNfskhmKYR6RnIYrOK7E8b5YDFjBrqpIosukxIUdpUHfbdzXOfaw7YdlO/qUwbXD7cJjvq94E2v/ylo D9+jlW2h5+gF2kUB2kdD9BEdohGi6Af6hX6jP97Is9437/vK1WtcxDxDV8w7+w+96eW1 AAAC1HicbVFdb9MwFHXDx0b56gZvvFh0kzZUqqR7YBIqqsQLj0Oi3aQkqhznZrVqO5Ht0BWTJ8QTEq/8Dl7hl/BvcNJV2tZdKdLxOb73xOcmBWfa+P6/lnfn7r37W9sP2g8fPX7ytLOzO9F5qSiM ac5zdZYQDZxJGBtmOJwVCohIOJwm8/e1fvoZlGa5/GSWBcSCnEuWMUqMo6ad/YgyRTmkNjJwYZqBNuGEziu7FyXCvq72qmra6fp9vym8CYJL0B09R02dTHdav6I0p6UAaSgnWoeBX5jYEmWYc6vaUamhcC7kHEIHJRGgY9vYV3jfMSnOcuU+aXDDXu2wRGi9FIm7KYiZ 6ZtaTd6qXawNNqVE3EaHpcmOY8tkURqQdPVrWcmxyXEdJ06ZAmr40gFCFXOvw3RGFKHGhX7NwLD5F/duCQuaC0Fk+modfRjELn0nh+tNDg/qIf36eBjbNr5SkcxTCPWMFDBc9fcyxvlwMWMGeqkiix6TEhR2zsOByxw3sw6x7QbV22aVwc3FbYLJoB8c9Qcfg+7o3Wqn aBu9QC/RAQrQGzRCH9AJGiOKfqDf6A/66028r9437/vqqte67HmGrpX38z/27uLo with BPG and the corresponding residual. We then model p(r|xl) p(r|xl) the distribution of the residual with a convolutional neu- AAAB73icbVDLSgNBEOz1GeMrPm5eBoMQL2E3CnqSgBePEcwDkiXMTmaTIbOz48ysGNb8hBcPinj1d7z5N042OWhiQUNR1U13VyA508Z1v52l5ZXVtfXcRn5za3tnt7C339Bxogitk5jHqhVgTTkT tG6Y4bQlFcVRwGkzGF5P/OYDVZrF4s6MJPUj3BcsZAQbK7VkST09dvlpt1B0y24GtEi8GSlWDyFDrVv46vRikkRUGMKx1m3PlcZPsTKMcDrOdxJNJSZD3KdtSwWOqPbT7N4xOrFKD4WxsiUMytTfEymOtB5Fge2MsBnoeW8i/ue1ExNe+ikTMjFUkOmiMOHIxGjyPOox RYnhI0swUczeisgAK0yMjShvQ/DmX14kjUrZOytXbs+L1atpGpCDIziGEnhwAVW4gRrUgQCHZ3iFN+feeXHenY9p65IzmzmAP3A+fwAm05AQ AAACxHicbVFda9swFFXcfXTZV7qPp72IhUEyQrDTQgcjIzAYe+xgaQu2CbJ8nYhIspHkpqnm/Y0973X7Rfs3k50W2qYXBEfn6N6je29ScKaN7/9reTv37j94uPuo/fjJ02fPO3svjnVeKgpTmvNc nSZEA2cSpoYZDqeFAiISDifJ8nOtn5yB0iyX3826gFiQuWQZo8Q4atZ5HTU17FyRdYWLnvpxPuP9WafrD/0m8DYILkF38go1cTTba/2K0pyWAqShnGgdBn5hYkuUYZRD1Y5KDQWhSzKH0EFJBOjYNt4VfueYFGe5ckca3LDXMywRWq9F4l4KYhb6tlaTd2rnVwbbUiLu osPSZB9iy2RRGpB087Ws5NjkuB4fTpkCavjaAUIVc91huiCKUOOGfMPAsOWF61vCiuZCEJm+jyhTbhhpGMQ2quXwanPjXl1kWF/7sW3jaxHJPIVQL0gB403+IGOcj1cLZmCQKrIaMClBYec8HrmZ46ZWH9tuUH2sKrfK4PbitsHxaBjsD0ffDrqTT5udol30Br1FPRSg QzRBX9ERmiKKLPqN/qC/3hePe9orN0+91mXOS3QjvJ//AXwA25Q= ral network-based probabilistic model that is conditioned AC ACAC on the BPG reconstruction, and combine it with entropy r r coding to losslessly encode the residual. Finally, the im- Figure 1. Overview of the proposed learned lossless compression age is stored using the concatenation of the bitstreams pro- approach. To encode an input image x, we feed it into the Q- duced by BPG and the learned residual coder. The resulting Classifier (QC) CNN to obtain an appropriate quantization param- compression system achieves state-of-the-art performance eter Q, which is used to x with BPG. The resulting lossy in learned lossless full-resolution image compression, out- reconstruction xl is fed into the Residual Compressor (RC) CNN, performing previous learned approaches as well as PNG, which predicts the probability distribution of the residual, p(r|xl), WebP, and JPEG2000. conditionally on xl. An arithmetic coder (AC) encodes the resid- ual r to a bitstream, given p(r|xl). In gray we visualize how to reconstruct x from the bistream. Learned components are shown in violet. 1. Introduction The need to efficiently store the ever growing amounts of overhead incurred by using an imprecise model of the data data generated continuously on mobile devices has spurred distribution. One beautiful result is that maximizing the a lot of research on compression . Algorithms likelihood of a parametric probabilistic model is equivalent like JPEG [51] for images and H.264 [53] for are to minimizing the bitrate obtained when using that model used by billions of people daily. for lossless compression with an entropy coder (see, e.g., After the breakthrough results achieved with deep neu- [29]). Learning parametric probabilistic models by likeli- ral networks in image classification [27], and the subse- hood maximization has been studied to a great extent in the quent rise of deep-learning based methods, learned lossy generative modeling literature (e.g. [50, 49, 39, 34, 25]). image compression has emerged as an active area of re- Recent works have linked these results to learned lossless search (e.g. [6, 45, 46, 37, 2, 4, 30, 28, 48]). In lossy compression [29, 18, 47, 24]. compression, the goal is to achieve small bitrates R given Even though recent learned lossy image compression a certain allowed D in the reconstruction, i.e., methods achieve state-of-the-art results on various data the rate-distortion trade-off R + λD is optimized. In con- sets, the results obtained by the non-learned H.265-based trast, in lossless compression, no distortion is allowed, and BPG [43, 7] are still highly competitive, without requir- we aim to reconstruct the input perfectly by transmitting ing sophisticated hardware accelerators such as GPUs to as few as possible. To this end, a probabilistic model run. While BPG was outperformed by learning-based ap- of the data can be used together with entropy coding tech- proaches across the bitrate spectrum in terms of PSNR [30] niques to encode and transmit data via a bitstream. The and visual quality [4], it still excels particularly at high- theoretical foundation for this idea is given in Shannon’s PSNR lossy reconstructions. landmark paper [40], which proves a lower bound for the In this paper, we propose a learned lossless compres- bitrate achievable by such a probabilistic model, and the sion system by leveraging the power of the lossy BPG, as

6638 illustrated in Fig. 1. Specifically, we decompose the in- boom et al.[18] propose Integer Discrete Flows (IDFs), put image x into the lossy reconstruction xl produced by defining an invertible transformation for discrete data. In BPG and the corresponding residual r. We then learn a contrast to L3C, the latter works focus on smaller data sets probabilistic model p(r|xl) of the residual, conditionally such as MNIST, CIFAR-10, ImageNet32, and ImageNet64, on the lossy reconstruction xl. This probabilistic model is where they achieve state-of-the-art results. fully convolutional and can be evaluated using a single for- ward pass, both for encoding and decoding. We combine Likelihood-Based Generative Modeling As mentioned it with an arithmetic coder to losslessly compress the resid- in Section 1, virtually every generative model can be used ual and store or transmit the image as the concatenation of for lossless compression, when used with an entropy cod- the bitstrings produced by BPG and the residual compres- ing . Therefore, while the following genera- sor. Further, we use a computationally inexpensive tech- tive approaches do not take a compression perspective, nique from the generative modeling literature, tuning the they are still related. The state-of-the-art PixelCNN [50]- “certainty” (temperature) of p(r|xl), as well as an auxiliary based models rely on auto-regression in RGB space to ef- shallow classifier to predict the quantization parameter of ficiently model a conditional distribution. The original BPG in order to optimize our compressor on a per-image PixelCNN [50] and PixelRNN [49] model the probability basis. These components together lead to a state-of-the-art distribution of a given all previous (in raster- full-resolution learned lossless compression system. All of scan order). To use these models for lossless compression, our code and data sets are available on github.1 O(H · W ) forward passes are required, where H and W In contrast to recent work in lossless compression, we are the image height and width, respectively. Various speed do not need to compute and store any side (as optimizations and a probability model amendable to faster opposed to L3C [29]), and our CNN is lightweight enough training were proposed in [39]. Different other paralleliza- to train and evaluate on high-resolution natural images (as tion techniques were developed, including those from [34], opposed to [18, 24], which have not been scaled to full- modeling the image distribution conditionally on subsam- resolution images to our knowledge). pled versions of the image, as well as those from [25], con- In summary, our main contributions are: ditioning on a RGB pyramid and images. Similar • We leverage the power of the classical state-of-the-art techniques were also used by [9, 31]. lossy compression algorithm BPG in a novel way to build Engineered Lossless Compression Algorithms The a conceptually simple learned lossless image compres- wide-spread PNG [33] applies simple autoregressive filters sion system. to remove redundancies from the RGB representation (e.g. • Our system is optimized on a per-image basis with a replacing pixels with the difference to their left neighbor), light-weight post-training step, where we obtain a lower- and then uses the [11] algorithm for compres- bitrate probability distribution by adjusting the confi- sion. In contrast, WebP [52] uses larger windows to trans- dence of the predictions of our probabilistic model. form the image (enabling patch-wise conditional compres- • Our system outperform the state-of-the-art in learned sion), and relies on a custom entropy coder for compression. lossless full-resolution image compression, L3C [29], Mainly in use for lossy compression, JPEG2000 [41] also as well as the classical engineered algorithms WebP, has a lossless mode, where an invertible mapping from RGB JPEG200, PNG. Further, in contrast to L3C, we are also to compression space is used. At the heart of FLIF [42] is an outperforming FLIF on Open Images, the domain where entropy coding method called “meta-adaptive near-zero in- our approach (as well as L3C) is trained. teger ” (MANIAC), which is based on the CABAC method used in, e.g., H.264 [53]. In CABAC, the 2. Related Work context model used to compress a symbol is selected from a Learned Lossless Compression Arguably most closely finite set based on local context [36]. The “meta-adaptive” related to this paper, Mentzer et al.[29] build a computa- part in MANIAC refers to the context model which is a de- tionally cheap hierarchical generative model (termed L3C) cision tree learned per image. to enable practical compression on full-resolution images. Townsend et al.[47] and Kingma et al.[24] leverage Artifact Removal Artifact removal methods in the con- the “bits-back scheme” [17] for lossless compression of an text of lossy compression are related to our approach in image stream, where the overall bitrate of the stream is that they aim to make predictions about the information lost reduced by leveraging previously transmitted information. during the lossy compression process. In this context, the Motivated by recent progress in generative modeling us- goal is to produce sharper and/or more visually pleasing im- ing (continuous) flow-based models (e.g. [35, 23]), Hooge- ages given a lossy reconstruction from, e.g., JPEG. Dong et al.[12] proposed the first CNN-based approach using a net- 1https://github.com/fab-jul/RC-PyTorch work inspired by super-resolution networks. [44] extends

6639 this using a residual structure, and [8] relies on hierarchical skip connections and a multi-scale loss. Generative mod- els in the context of artifact removal are explored by [13], which proposes to use GANs [14] to obtain more visually −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 pleasing results. Figure 2. Histogram of the marginal pixel distribution of residual values obtained using BPG and Q predicted from QC, on Open 3. Background Images.

3.1. Lossless Compression In particular, we can use a CNN to parametrize p. To this We give a very brief overview of lossless compression end, one general approach is to introduce (structured) side basics here and refer to the literature information z available both at encoding and decoding time, (i) for details [40, 10]. In lossless compression, we consider and model the probability distribution of natural images x conditionally on z, using the CNN to parametrize p(x|z).2 a stream of symbols x1,...,xN , where each xi is an ele- ment from the same finite set X . The stream is obtained Assuming that both the encoder and decoder have access to z and p, we can losslessly encode x(i) as follows: We by drawing each symbol xi independently from the same first use the CNN to produce p(x|z). Then, we employ distribution p˜, i.e., the xi are i.i.d. according to p˜. We are interested in encoding the symbol stream into a bitstream, an entropy encoder (described in the previous section) with (i) such that we can recover the exact symbols by decoding. In p(x|z) to encode x to a bitstream. To decode, we once (i) this setup, the entropy of p˜ is equal to the expected number again feed z to the CNN, obtaining p(x|z), and decode x from the bitstream using the entropy decoder. of bits needed to encode each xi: One key difference among the approaches in the litera- E H(˜p)= bits(xi)= xi∼p˜ [− log2 p˜(xi)] . ture is the factorization of p(x|z). In the original PixelCNN paper [49] the image x is modeled as a sequence of pix- In general, however, the exact p˜ is unknown, and we instead els, and z corresponds to all previous pixels. Encoding as consider the setup where we have an approximate model p. well as decoding are done autoregressively. In IDF [18], x Then, the expected bitrate will be equal to the cross-entropy is mapped to a z using an invertible function, and z is then between p˜ and p, given by: encoded using a fixed prior p(z), i.e., p(x|z) here is a deter- ministic function of z. In approaches based on the bits-back E H(˜p,p)= xi∼p˜ [− log2 p(xi)] . (1) paradigm [47, 24], while encoding, z is obtained by decod- ing from additional available information (e.g. previously Intuitively, the higher the discrepancy between the model p encoded images). In L3C [29], z corresponds to features used for coding is from the real p˜, the more bits we need to extracted with a hierarchical model that are also saved to encode data that is actually distributed according to p˜. the bitstream using hierarchically predicted distributions.

Entropy Coding Given a symbol stream xi as above and 3.3. BPG a probability distribution p (not necessarily p˜), we can en- code the stream using entropy coding. Intuitively, we would BPG is a lossy image compression method based on like to build a table that maps every element in X to a the HEVC coding standard [43], essentially applying HEVC on a single image. To motivate our usage of BPG, sequence, such that xi gets a short sequence if p(xi) is we show the histogram of the marginal pixel distribution high. The optimum is to output log2 p(xi) bits for sym- of the residuals obtained by BPG on Open Images (one of bol xi, which is what entropy coding algorithms achieve. Examples include [19] and arithmetic cod- our testing sets, see Section 5.1) in Fig. 2. Note that while ing [54]. the possible range of a residual is {−255,..., 255}, we ob- serve that for most images, nearly every point in the residual In general, we can use a different distribution for ev- is in the restricted set , which is indicative of ery symbol in the stream, as long as the pi are also available {−6,..., 6} for decoding. Adaptive entropy coding algorithms work by the high-PSNR nature of BPG. Additionally, Fig. A1 (in the allowing such varying distributions as a function of previ- suppl.) presents a comparison of BPG to the state-of-the-art ously encoded symbols. In this paper, we use adaptive arith- learned image compression methods, showing that BPG is metic coding [54]. still very competitive in terms of PSNR. BPG follows JPEG in having a chroma format param- 3.2. Lossless Image Compression with CNNs eter to enable subsampling, which we disable As explained in the previous section, all we need for loss- by setting it to 4:4:4. The only remaining parameter to set less compression is a model p, since we can use entropy 2We write p(x) to denote the entire probability mass function and coding to encode and decode any input losslessly given p. p(x(i)) to denote p(x) evaluated at x(i).

6640 is the quantization parameter Q, where Q ∈ {1,..., 51}. We use a (weak) autoregression over the three RGB chan- Smaller Q results in less quantization and thus better qual- nels to define the joint distribution over channels via logistic ity (i.e., different to the quality factor of JPEG, where larger mixtures pm: means better reconstruction quality). We learn a classifier p(r ,r ,r |x )= p (r |x ) · p (r |x ,r ) · to predict Q, described in Section 4.4. 1 2 3 l m 1 l m 2 l 1 pm(r3|xl,r2,r1), (3) 4. Proposed Method where we removed the indices uv to simplify the notation. We give an overview of our method in Fig. 1. To encode For the mixture pm we use a mixture of K = 5 logistic an image x, we first obtain the quantization parameter Q distributions pL. Our distributions are defined by the out- k from the Q-Classifier (QC) network (Section 4.4). Then, we puts of the RC network, which yields mixture weights πcuv, k k compress x with BPG, to obtain the lossy reconstruction xl, means µcuv, variances σcuv, as well as mixture coefficients k which we save to a bitstream. Given xl, the Residual Com- λcuv. The autoregression over RGB channels is only used pressor (RC) network (Section 4.1) predicts the probability to update the means using a linear combination of µ and the mass function of the residual r = x − xl, i.e., target r of previous channels, scaled by the coefficients λ. We thereby obtain µ˜: p(r|xl)= RC(xl). k k k k k µ˜1uv = µ1uv µ˜2uv = µ2uv + λαuv r1uv We model p(r|xl) as a discrete mixture of logistic distri- k k k k (4) butions (Section 4.2). Given p(r|xl) and r, we compress r µ˜3uv = µ3uv + λβuv r1uv + λγuv r2uv. to the bitstream using adaptive arithmetic coding algorithm With these parameters, we can define (see Section 3.1). Thus, the bitstream B consists of the con- K catenation of the codes corresponding to xl and r. To de- k k k pm(rcuv|xl,rprev)= X πcuv pL(rcuv|µ˜cuv,σcuv), (5) code x from B, we first obtain xl using the BPG decoder, k=1 then we obtain once again p(r|xl) = RC(xl), and subse- quently decode r from the bitstream using p(r|xl). Finally, where rprev denotes the channels with index smaller than c we can reconstruct x = xl + r. In the formalism of Sec- (see Eq. 3), used to obtain µ˜ as shown above, and pL is the tion 3.2, we have x = r, z = xl. logistic distribution: Note that no matter how bad RC is at predicting the real e−(r−µ)/σ distribution of r, we can always do lossless compression. pL(r|µ,σ)= . σ(1 + e−(r−µ)/σ)2 Even if RC were to predict, e.g., a uniform distribution—in that case, we would just need many bits to store r. We evaluate pL at discrete r, via its CDF, as in [39, 29], evaluating 4.1. Residual Compressor pL(r)= CDF(r + 1/2) − CDF(r − 1/2). (6) We use a CNN inspired by ResNet [15] and U-Net [38], shown in detail in Fig. 3. We first extract an initial feature 4.3. Loss map fin with Cf = 128 channels, which we then downscale As motivated in Section 3.1, we are interested in min- using a stride-2 , and feed through 16 residual imizing the cross-entropy between the real distribution of blocks. Instead of BatchNorm [20] layers as in ResNet, our the residual p˜(r) and our model p(r): the smaller the cross- residual blocks contain GDN layers proposed by [5]. Sub- entropy, the closer p is to p˜, and the fewer bits an entropy sequently, we upscale back to the resolution of the input coder will use to encode r. We consider the setting where image using a transposed convolution. The resulting fea- we have N training images x(1),...,x(N). For every im- tures are concatenated with f , and convolved to contract in age, we compute the lossy reconstruction x(i) as well as the the 2 · C channels back to C , like in U-Net. Finally, the l f f (i) (i) (i) network splits into four tails, predicting the different param- corresponding residual r = x − xl . While the true eters of the mixture model, π,µ,σ,λ, described next. distribution p˜(r) is unknown, we can consider the empirical distribution obtained from the samples and minimize: 4.2. Logistic Mixture Model N (i) (i) We use a discrete mixture of logistics to model the L(RC)= − X log p(r |xl ). (7) probability mass function of the residual, p(r|xl), similar i=1 to [29, 39]. We closely follow the formulation of [29] here: This loss decomposes over samples, allowing us to mini- Let c denote the RGB channel and u,v the spatial location. mize it over mini-batches. Note that minimizing Eq. 7 is We define the same as maximizing the likelihood of p, which is the perspective taken in the likelihood-based generative model- p(r|xl)= Y p(r1uv,r2uv,r3uv|xl). (2) u,v ing literature.

6641 n hsacmuainlylgtegtcomponent. lightweight mak- computationally a times, this multiple ing downsamples and shallow is network the of zoom-in a show we left, the On (RC). compressor residual the of architecture The 3. Figure Given eiul hstaeofcnb otoldwith controlled be the can encoding to trade-off allocated This bits the residual. and BPG to allocated bits to suboptimal is it approach, fix our in Thus, mean. dependent samxueo oitc aaerzdvia parametrized logistics of mixture a is h Cntokt oe,i sbnfiilt s higher a use to beneficial is for it easier model, are to that network components RC the contains image an if ample, h irtso hs mgsbigsra rudsome around spread being images these of fixed of bitrates same set the random the a with compressing images that note natural to important is it ity, o fhg rqec tutr n/rnie hl vir- like parameter While a have BPG’s noise. methods and/or compression lossy structure all frequency tually high mean of can lot complex a where “complexity”, varying of images Q•Classifier 4.4. r bandb vrg oln aho h final the of features each final pooling the average Further, by obtained size). prediction are input the the that of ensure independent (to is em- layers we normalization ResNet, to no contrast ploy In set). validation Images Open { RC, trained a optimal and single image, a compo- fixed is these a there for encoding that bits observe waste We not nents. does BPG that such hnnraie ihasfmx eal r hw nSec- in shown are Details the tion softmax. for a with logits normalized the then obtain to layer, a lcsfrQ,adtani opeitacasin class a predict to it train resid- and 8 QC, for with blocks network ual ResNet-inspired light-weight a use resulting ie ewr,teQCasfir(C,adte use then and (QC), Q-Classifier QC the network, sifier hneso the of channels 11 hl h nu oQ stefl-eouiniae the image, full-resolution the is QC to input the While ned norppln ehv rd-f ewe the between trade-off a have we pipeline our in Indeed, oefiinl banagood a obtain efficiently To admsto aua mgsi xetdt contain to expected is images natural of set random A Q ( , . . . , x A.1 o l images. all for ) x Q l ocompress to h os eosrcino h image the of reconstruction lossy the , ntesplmnaymaterial. supplementary the in onvgt h rd-f ewe irt n qual- and bitrate between trade-off the navigate to , 17 C Conv Conv f } dmninlvco sfdt ul connected fully a to fed is vector -dimensional

ie nimage an given , ReLU GDN

C Conv

AAAC1HicbVFNaxNBGJ6sXzV+pfYowmBaaDWE3XhQkEjAi8cKJi3sLmF29t1myMzsMjNrGsc9qSfBq7/Dm+gv8Yd4d3bTQNv0hYFnnuf9fpOCM218/2/Lu3b9xs1bW7fbd+7eu/+gs/1wovNSURjT nOfqOCEaOJMwNsxwOC4UEJFwOErmb2r96AMozXL53iwLiAU5kSxjlBhHTTt7kYFT0+SxCSd0XtmIMkU5pHY3SoR9Vu1W1bTT9ft+Y3gTBGegO3r879dOK/1yON1u/YjSnJYCpKGcaB0GfmFiS5RhLnfVjkoNhStHTiB0UBIBOrZNHxXec0yKs1y5Jw1u2PMRlgitlyJx noKYmb6s1eSV2um6wKaUiKvosDTZy9gyWZQGJF21lpUcmxzX68QpU0ANXzpAqGJuOkxnRBFq3NIvFDBs/tHNLWFBcyGITJ+uFx0GsY1qOVxfcrhfJ+nX34PYtvE5i2SeQqhnpIDhKr6XMc6Hixkz0EsVWfSYlKCwqzwcuJ3jJtcBtt2getWcMrh8uE0wGfSD5/3Bu6A7 eo1WtoUeoSdoHwXoBRqht+gQjRFF39BP9Bv98SbeJ++z93Xl6rXOYnbQBfO+/weq9uWY ReLU + f

× Conv Residual Block H x Conv GDN ′ ihBG o h rhtcue we architecture, the For BPG. with × ReLU

LeakyReLU AAAC1HicbVFNaxNBGJ6sXzV+pfYowmBaaDWE3XhQkEjAi8cKJi3sLmF29t1myMzsMjNrGsc9qSfBq7/Dm+gv8Yd4d3bTQNv0hYFnnuf9fpOCM218/2/Lu3b9xs1bW7fbd+7eu/+gs/1wovNSURjT nOfqOCEaOJMwNsxwOC4UEJFwOErmb2r96AMozXL53iwLiAU5kSxjlBhHTTt7kYFT0+SxCSd0XtmIMkU5pHY3SoR9Vu1W1bTT9ft+Y3gTBGegO3r879dOK/1yON1u/YjSnJYCpKGcaB0GfmFiS5RhLnfVjkoNhStHTiB0UBIBOrZNHxXec0yKs1y5Jw1u2PMRlgitlyJx noKYmb6s1eSV2um6wKaUiKvosDTZy9gyWZQGJF21lpUcmxzX68QpU0ANXzpAqGJuOkxnRBFq3NIvFDBs/tHNLWFBcyGITJ+uFx0GsY1qOVxfcrhfJ+nX34PYtvE5i2SeQqhnpIDhKr6XMc6Hixkz0EsVWfSYlKCwqzwcuJ3jJtcBtt2getWcMrh8uE0wGfSD5/3Bu6A7 eo1WtoUeoSdoHwXoBRqht+gQjRFF39BP9Bv98SbeJ++z93Xl6rXOYnbQBfO+/weq9uWY Q + W . Tail Conv →3·K ′ dmninlfauemp The map. feature -dimensional x AAACsHicbVHLbtNAFJ2YVwmvFpawGDVCalFk2WEBEooUiQ3LIkgbybai6/F1MmRmbM2MSYPlT2BLP4JP4Ev4Bz6AHYyTVmqbXmmkM+fc901LwY0Ngt8d79btO3fv7dzvPnj46PGT3b2nx6aoNMMx K0ShJykYFFzh2HIrcFJqBJkKPEkX71v95Ctqwwv12a5KTCTMFM85A+uoT6dTMd3tBX6wNroNwnPQG73493f/15+fR9O9zlmcFaySqCwTYEwUBqVNatCWM4FNN64MlsAWMMPIQQUSTVKve23oS8dkNC+0e8rSNXs5ogZpzEqmzlOCnZvrWkveqJ1eFNiWUnkTHVU2f5vU XJWVRcU2reWVoLag7apoxjUyK1YOANPcTUfZHDQw6xZ6pYDli29uboVLVkgJKnsVM67dMrIoTOq4laOLKw0P2iR++z1M6i69ZLEqMozMHEocbuL7ORdiuJxzi/1Mw7LPlUJNXeXhwO2crnMd0roXNu+axp0yvH64bXA88MPX/uBj2Bv5ZGM75DnZJwckJG/IiHwgR2RM GJmR7+QHOfMG3sSberBx9TrnMc/IFfO+/Adm0Nga Q x ( l Q etanasml clas- simple a train we , |Q| a eetduigthe using selected was Q

,σ ,λ π, σ, µ, Conv iluulyla to lead usually will

lse,wihare which classes, Conv with stride 2 x h ewr rdcstepoaiiydsrbto fteresidual, the of distribution probability the predicts network the , . Q C o ex- For : f 256 = Residual Blocks Q Q Q Q = = - , 6642 eyhg erigrt of rate learning high very a iodi Eq. in lihood eol edt otefradps hog Coc,to once, RC through pass forward the do get to need only we image. the on depending iterations, 10-20 in converges ootmz Eq. optimize To where r by certain” smaller “more a logistic choosing predicted the making by smaller eody h piiaini nyoe 5prmtr.Fi- parameters. practical 15 for over since nally, only is optimization the Secondly, rtv oeigltrtr eg [ (e.g. literature modeling erative t“o eti”(.. h rbblt ascnetae too concentrates mass probability around the tightly (i.e., certain” “too it usmldvrinof version subsampled ito sol ae on based only is diction 4.5. illcto) hscue opeeyngiil overhead negligible completely of a spa- every causes a for this learn not location), (and only tial mixture we every since and However, channel trans- every to bitstream. for have the thus via we it and decoding, mit for known be to needs τ H uigecdn,we eadtoal aeacs othe to access have additionally target we when encoding, during eol edt evaluate to need only we ilsamr optimal more a yields h rdce distribution predicted the fR rdcsa predicts RC if a aetecosetoyi Eq. in cross-entropy the make can c cuv k hsnfreeymixture every for chosen , C ent htti sas opttoal ha.Firstly, cheap. computationally also is this that note We find We hl Ci led rie olanagood a learn to trained already is RC While nprdb h eprtr cln mlydi h gen- the in employed scaling temperature the by Inspired · ,σ ,π λ, σ, µ, oee,teei raigpit hr emake we where point, breaking a is there However, . W τ · •Optimization K r p ecnd h u nEq. in sum the do can we , cuv τ 3 = AAAC1HicbVFNaxNBGJ6sXzV+pfYowmBaaDWE3XhQkEjAi8cKJi3sLmF29t1myMzsMjNrGsc9qSfBq7/Dm+gv8Yd4d3bTQNv0hYFnnuf9fpOCM218/2/Lu3b9xs1bW7fbd+7eu/+gs/1wovNSURjT nOfqOCEaOJMwNsxwOC4UEJFwOErmb2r96AMozXL53iwLiAU5kSxjlBhHTTt7kYFT0+SxCSd0XtmIMkU5pHY3SoR9Vu1W1bTT9ft+Y3gTBGegO3r879dOK/1yON1u/YjSnJYCpKGcaB0GfmFiS5RhLnfVjkoNhStHTiB0UBIBOrZNHxXec0yKs1y5Jw1u2PMRlgitlyJx noKYmb6s1eSV2um6wKaUiKvosDTZy9gyWZQGJF21lpUcmxzX68QpU0ANXzpAqGJuOkxnRBFq3NIvFDBs/tHNLWFBcyGITJ+uFx0GsY1qOVxfcrhfJ+nX34PYtvE5i2SeQqhnpIDhKr6XMc6Hixkz0EsVWfSYlKCwqzwcuJ3jJtcBtt2getWcMrh8uE0wGfSD5/3Bu6A7 eo1WtoUeoSdoHwXoBRqht+gQjRFF39BP9Bv98SbeJ++z93Xl6rXOYnbQBfO+/weq9uWY + seulto equal is by , τ c k n hni vr tpo the of step every in then and , 7 ·

o ie image given a for Conv µ 5 min nta image that on rescaling cuv 8 τ floats Conv Transpose µ euesohsi rdetdsetwith descent gradient stochastic use we ,

cuv Concat n h rs-nrp nrae again. increases cross-entropy the and ) σ c,u,v X hssit rbblt astowards mass probability shifts This .

p Conv σ ˜ p 60 = hti ls otetarget the to close is that rdce rmR u using but RC from predicted x cuv k τ H log eiulBlock Residual ReLU l τ . p h predicted the ecnipoetefia bitrate final the improve can We . c k × 9 ihasml rc:Intuitively, trick: simple a with = E p · bytes. W .. eoptimize we i.e., , − τ σ k Tail Tail Tail Tail ( τ 2 cuv k dmninlimages, -dimensional x r c k n vr channel every and n momentum and ( 22 ( i i · ) ) n usqetyEq. subsequently and | σ ),w ute optimize further we , ]) 7 ymnmzn h like- the minimizing by x p cuv k 8 l,cuv ( ( adtu h bitrate) the thus (and i r ) n the and vra over AAACsHicbVHditNAFJ7Gv7X+7eqlFw4WYVdKSKqgIIWCIF6uaHcLSSgnk5N2tjOTMDOx1pBH8Nb1EcQ38hl8CSftLuxu98DAN993/k9aCm5sEPzteDdu3rp9Z+du9979Bw8f7e49PjJFpRmOWSEK PUnBoOAKx5ZbgZNSI8hU4HG6eN/qx19RG16oL3ZVYiJhpnjOGVhHfY5lNd3tBX6wNroNwjPQGz378/vfh9e/Dqd7ndM4K1glUVkmwJgoDEqb1KAtZwKbblwZLIEtYIaRgwokmqRe99rQF47JaF5o95Sla/ZiRA3SmJVMnacEOzdXtZa8Vvt2XmBbSuV1dFTZ/G1Sc1VW FhXbtJZXgtqCtquiGdfIrFg5AExzNx1lc9DArFvopQKWL767uRUuWSElqOxlzLh2y8iiMKnjVo7OrzTcb5P47fcgqbv0gsWqyDAycyhxuInv51yI4XLOLfYzDcs+Vwo1dZWHA7dzus51QOte2LxrGnfK8OrhtsHRwA9f+YNPYW/kk43tkKfkOdknIXlDRuQjOSRjwsiM /CA/yak38Cbe1IONq9c5i3lCLpl38h8JVdb8 AAACsHicbVFba9swFFa8W5fd2m1vexELg3YEY2cPK4xAYC977NjSBmwTjuXjRI0kG0lulhn/hL2uv23/ZrLTQtv0gODT9537SUvBjQ2Cfz3vwcNHj5/sPe0/e/7i5av9g9enpqg0wykrRKFnKRgU XOHUcitwVmoEmQo8S1dfW/3sArXhhfppNyUmEhaK55yBddSPuOTz/UHgB53RXRBegcHkLensZH7Qu4yzglUSlWUCjInCoLRJDdpyJrDpx5XBEtgKFhg5qECiSequ14Z+cExG80K7pyzt2JsRNUhjNjJ1nhLs0tzVWvJe7dd1gV0plffRUWXz46TmqqwsKrZtLa8EtQVt V0UzrpFZsXEAmOZuOsqWoIFZt9BbBSxf/XZzK1yzQkpQ2ceYce2WkUVhUsetHF1faXzYJvHb71FS9+kNi1WRYWSWUOJ4Gz/MuRDj9ZJbHGYa1kOuFGrqKo9Hbue0y3VE60HYfGkad8rw7uF2wenIDz/5o+/hYOJvb0r2yDvynhySkHwmE/KNnJApYWRB/pC/5NIbeTNv 7sHW1etdxbwht8w7/w/cE9OZ AAACtHicbVFba9RAFJ6Nt7reWvXNl8FFaGUJyQoqyMKCLz5WcLuFJJSTyUl32LmEmYnrGvIjfBL0l/lvnGRbaLs9MPDN9537ySvBrYuif4Pgzt179x/sPRw+evzk6bP9g+cnVteG4Zxpoc1pDhYF Vzh33Ak8rQyCzAUu8tXnTl98R2O5Vt/cpsJMwrniJWfgPLVIhXct4Gx/FIVRb3QXxBdgNHtJejs+Oxj8TgvNaonKMQHWJnFUuawB4zgT2A7T2mIFbAXnmHioQKLNmr7flr7xTEFLbfxTjvbs1YgGpLUbmXtPCW5pb2odeav247LArpTL2+ikduXHrOGqqh0qtm2trAV1 mnbrogU3yJzYeADMcD8dZUswwJxf6rUCjq9++rkVrpmWElTxNmXc+GUUSZw1aScnl5eaHnZJwu57lDVDesVSpQtM7BIqnG7jxyUXYrpecofjwsB6zJVCQ33l6cTvnPa5jmgzittPbetPGd883C44mYTxu3DyNR7Nwu1NyR55RV6TQxKTD2RGvpBjMieMrMgv8of8Dd4H acAC3LoGg4uYF+SaBeo/+3LVSQ== AAACs3icbVFba9swFFbcXbrs1nZ724tYGLQjGDtjbDACgb3ssYMlLdimHMvHiRpJNpK8LDP+D30r2z/bv5nstNA2PSD49H3nftJScGOD4F/P23nw8NHj3Sf9p8+ev3i5t38wM0WlGU5ZIQp9moJB wRVOLbcCT0uNIFOBJ+nya6uf/ERteKF+2HWJiYS54jlnYB01iw2fSzjbGwR+0BndBuEVGExek86Oz/Z7l3FWsEqiskyAMVEYlDapQVvOBDb9uDJYAlvCHCMHFUg0Sd2129B3jsloXmj3lKUdezOiBmnMWqbOU4JdmLtaS96r/bousC2l8j46qmz+Oam5KiuLim1ayytB bUHbbdGMa2RWrB0AprmbjrIFaGDW7fRWAcuXv93cCleskBJU9j5mXLtlZFGY1HErR9eHGh+2Sfz2e5TUfXrDYlVkGJkFlDjexA9zLsR4teAWh5mG1ZArhZq6yuOR2zntch3RehA2X5rGnTK8e7htMBv54Qd/9D0cTPzNTckueUPekkMSkk9kQr6RYzIljJyTC/KH/PU+ epGXetnG1etdxbwit8yT/wEcddTv | . x σ λ σ µ π Obviously, l cuv k ) ) , hsdistribution This . τ -optimization, AAACtXicbVFNa9tAEF2raZO6H0maYy/bmIJdjJCcQwrBYMglxxTqxCAJs1qN4sX7IXZXdRxV0N/QSw/tH8u/6cpOIInzYOHtezM7OzNpwZmxQXDb8l5svXy1vfO6/ebtu/e7e/sfLowqNYUxVVzp SUoMcCZhbJnlMCk0EJFyuEznp41/+QO0YUp+t8sCEkGuJMsZJdZJk6Krf15PeW+61wn8YAW8ScI70hl9+oUanE/3W3/iTNFSgLSUE2OiMChsUhFtGeVQt+PSQEHonFxB5KgkAkxSrT5c489OyXCutDvS4pX6MKMiwpilSF2kIHZmnnqN+Kx3fV9g00rFc3JU2vxrUjFZ lBYkXX8tLzm2CjfzwhnTQC1fOkKoZq47TGdEE2rdVB8VsGx+4/qWsKBKCCKzLzFl2g0ji8Kkihs7ul/VsNs84jfXXlK18QPEUmUQmRkpYLjO7+eM8+Fixiz0M00WfSYlaOwqDwdu5nj1Vg9XnbA+qWu3yvDp4jbJxcAPj/zBt7Az8tEaO+gjOkRdFKJjNEJn6ByNEUUc /UZ/0T/v2Eu8zMvXoV7rLucAPYKn/gPIytYV ihafactor a with p Tail 4 ( × r σ 0 | x h pre- the , . r networks. 9 spatially l cuv ) c which , This . τ 15 σ ˜ we , also cuv k (8) ≪ 8 τ . . [bpsp] Open Images CLIC.mobile CLIC.pro DIV2K RC (Ours) 2.790 2.538 2.933 3.079 L3C 2.991 +7.2% 2.639 +4.0% 2.944 +0.4% 3.094 +0.5% PNG 4.005 +44% 3.896 +54% 3.997 +36% 4.235 +38% JPEG2000 3.055 +9.5% 2.721 +7.2% 3.000 +2.3% 3.127 +1.6% WebP 3.047 +9.2% 2.774 +9.3% 3.006 +2.5% 3.176 +3.2% FLIF 2.867 +2.8% 2.492 −1.8% 2.784 −5.1% 2.911 −5.5% Table 1. Compression performance of the proposed method (RC) compared to the learned L3C [29], as well as the classical engineered approaches PNG, JPEG2000, WebP, and FLIF. We show the difference in percentage to our approach, using green to indicate that we achieve a better bpsp and red otherwise.

5. Experiments it is not available while training the RC network. Thus, we compress the training images with a random Q selected 5.1. Data sets from {12, 13, 14}, obtaining a pair (x,xl) for every image. Training Like L3C [29], we train on 300 000 images from Q-Classifier Given a trained RC network, we randomly the Open Images data set [26]. These images are made select 10% of the training set, and compress each selected available as , which is not ideal for the lossless com- (Q) pression task we are considering, but we are not aware of image x once for each Q ∈ Q, obtaining a xl for each (Q) a similarly large scale lossless training data set. To pre- Q ∈ Q. We then evaluate RC for each pair (x,xl ) to find the optimal Q′ that gives the minimum bitrate for that vent overfitting on JPEG artifacts, we downscale each train- ′ (Q ) ing image using a factor randomly selected from [0.6, 0.8] image. The resulting list of pairs (x,xl ) forms the train- by means of the Lanczos filter provided by the Pillow li- ing set for the QC. For training, we use a standard cross- brary [32]. For a fair comparison, the L3C baseline results entropy loss between the softmax-normalized logits and the were also obtained by training on the exact same data set. one-hot encoded ground truth Q′. We train for 11 epochs on batches of 32 random 128×128 crops, using the Adam Evaluation We evaluate our model on four data sets: optimizer [21]. We set the initial LR to the Adam-default − Open Images is a subset of 500 images from Open Im- 1E 4, and decay after 5 and 10 epochs by a factor of 0.25. ages validation set, preprocessed like the training data. CLIC.mobile and CLIC.pro are two new data sets com- 5.3. Architecture and Training Ablations monly used in recent image compression papers, released Training on Fixed Q As noted in Section 5.2, we se- as part of the “Workshop and Challenge on Learned Image lect a random Q during training, since QC is only avail- Compression” (CLIC) [1]. CLIC.mobile contains 61 im- able after training. We explored fixing Q to one value (try- ages taken using cell phones, while CLIC.pro contains 41 ing Q ∈ {12, 13, 14}) and found that this hurts generaliza- images from DSLRs, retouched by professionals. Finally, tion performance. This may be explained by the fact that we evaluate on the 100 images from DIV2K [3], a super- RC sees more varied residual statistics during training if we resolution data set with high-quality images. We show ex- have random Q’s. amples from these data sets in Section A.3. For a small fraction of exceptionally high-resolution im- Effect of the Crop Size Using crops of 128×128 to train ages (note that the considered testing sets contain images of a model evaluated on full-resolution images may seem too widely varying resolution), we follow L3C in extracting 4 constraining. To explore the effect of crop size, we trained non-overlapping crops xc from the image x such that com- different models, each seeing the same number of pixels in bining xc yields x. We then compress the crops individu- every iteration, but distributed differently in terms of batch ally. However, we evaluate the non-learned baselines on the size vs. crop size. We trained each model for 600 000 itera- full images to avoid a bias in favor of our method. tions, and then evaluated on the Open Images validation set (using a fixed Q = 14 for training and testing). The results 5.2. Training Procedures are shown in the following table and indicate that smaller Residual Compressor We train for 50 epochs on batches crops and bigger batch-sizes are beneficial. of 16 random 128×128 crops extracted from the training Batch Size Crop Size BPSP on Open Images set, using the RMSProp optimizer [16]. We start with an − initial learning rate (LR) of 5E 5, which we decay ev- 16 128×128 2.854 ery 100 000 iterations by a factor of 0.75. Since our Q- 4 256×256 2.864 Classifier is trained on the output of a trained RC network, 1 512×512 2.877

6643 GDN We found that the GDN layers are crucial for good In Fig. 4 we show the bpsp of each of the 500 images of performance. We also explored instance normalization, and Open Images, when compressed using our method, FLIF, conditional instance normalization layers, in the latter case and PNG. For our approach, we also show the bits used conditioning on the bitrate of BPG, in the hope that this to store xl for each image, measured in bpsp on top (“xl would allow the network to distinguish different operation only”), and as a percentage on the bottom. The percentage modes. However, we found that instance normalization is averages at 42%, going up towards the high-bpsp end of more sensitive to the resolution used for training, which led the figure. This plot shows the wide range of bpsp covered worse overall bitrates. by a random set of natural images, and motivates our Q- Classifier. We can also see that while our method tends to 6. Results and Discussion outperform FLIF on average, FLIF is better for some high- bpsp images, where the bpsp of both FLIF and our method 6.1. Compression performance in bpsp approach that of PNG. We follow previous work in evaluating bits per subpixel (Each RGB pixel has 3 subpixels), bpsp for short, some- 6.2. Runtime times called bits per dimension. In Table 1, we show the We compare the decoding speed of RC to that of L3C performance of our approach on the described test sets. On for 512×512 images, using an NVidia Titan XP. For our Open Images, the domain where we train, we are outper- components: BPG: 163ms; RC: 166ms; arithmetic coding: forming all methods, including FLIF. Note that while L3C 89.1ms; i.e., in a total 418ms, compared to L3C’s 374ms. was trained on the same data set, it does not outperform QC and τ-optimization are only needed for encoding. FLIF. On the other data sets, we consistently outperform We discussed above that both components are computation- both L3C and the non-learned approaches PNG, WebP, and ally cheap. In terms of actual runtime: QC: 6.48ms; τ- JPEG2000. optimization: 35.2ms. These results indicate that our simple approach of us- ing a powerful lossy compressor to compress the high-level 6.3. Q•Classifier and τ•Optimization image content and leverage a complementary learned prob- In Table 2 we show the benefits of using the Q-Classifier abilistic model to model the low level variations for lossless as well as the τ-optimization. We show the resulting bpsp residual compression is highly effective. Even though we for the Open Images validation set (top) and for DIV2K only train on Open Images, our method can generalize to (bottom), as well as the percentage of predicted Q that are various domains of natural images: mobile phone pictures ±1 away from the optimal Q′ (denoted “±1 to Q′”), against (CLIC.mobile), images retouched by professional photog- a baseline of using a fixed Q = 14 (the mean over QC’s raphers (CLIC.pro), as well as high-quality images with di- training set, see Section 5.2). The last column shows the verse complex structures (DIV2K). required number of forward passes through RC.

Ours Q-Classifier We first note that even though the QC was 6 xl only only trained on Open Images (see Sec 5.2), we get simi- FLIF lar behavior on Open Images and DIV2K. Moreover, we 5 PNG see that using QC is clearly beneficial over using a fixed 4 Q for all images, and only incurs a small increase in bpsp ′

bpsp compared to using the optimal Q (0.18% for Open Images, 3 0.26% for DIV2K). This can be explained by the fact that ′ 2 QC manages to predict Q within ±1 of Q for 94.8% of the images in Open Images and 90.2% of the DIV2K images. 1 Furthermore, the small increase in bpsp is traded for a reduction from requiring 7 forward passes to compute Q′ to 100% Fraction BPG / Total a single one. In that sense, using the QC is similar to the 50% “fast” modes common in image compression algorithms,

0% where speed is traded against bitrate. ←− Image Index −→ 500 Figure 4. Top: Distribution of bpsp, on the 500 images from Open τ-Optimization Table 2 shows that using τ-Optimization Images validation set. The images are sorted by the bpsp achieved on top of QC reduces the bitrate on both testing sets. using our approach. We show PNG and FLIF, as well as the bpsp needed to store the lossy reconstruction only (“xl only”). Bottom: Discussion While the gains of both components are small, Fraction of total bits used by our approach that are used to store their computational complexity is also very low (see Sec- xl. Images follow the same order as on the top panel. tion 6.2). As such, we found it quite impressive to get the

6644 Input/Output x Lossy reconstruction xl Residual r = x − xl Two samples from our predicted p(r|xl)

Figure 5. Visualizing the learned distribution p(r|xl) by sampling from it. We compare the samples to the ground-truth target residual r. We also show the image x that we losslessly compress as well as the lossy reconstruction xl obtained from BPG. For easier visualizations, pixels in the residual images equal to 0 are set to white, instead of gray. Best viewed on screen due to the high-frequency noise. reported gains. We believe the direction of tuning a hand- distribution. We expect the samples to be visually similar to ful of parameters post training on an instance basis is a very the ground-truth residual r = x − xl. promising direction for image compression. One fruitful di- The sampling results are shown in Fig. 5, where we vi- rection could be using dedicated architectures and including sualize two images from CLIC.pro with their lossy recon- a tuning step end-to-end as in meta learning. structions, as obtained by BPG. We also show the ground- truth residuals r. Then, we show two samples obtained from 6.4. Visualizing the learned p(r|xl) the probability distribution p(r|xl) predicted by our RC net- work. For the top image, r is in {−9,..., 9}, for the bot- While the bpsp results from the previous section validate tom it is in {−5,..., 4} (cf. Fig. 2), and we re-normalized the compression performance of our model, it is interesting r to the RGB range {0,..., 255} for visualization, but to to investigate the distribution predicted by RC. Note that reduce eye strain we replaced the most frequent value (128, we predict a mixture distribution per pixel, which is hard i.e., gray), with white. to visualize directly. Instead, we sample from the predicted We can clearly see that our approach i) learned to model

′ the noise patterns discarded by BPG inherent with these im- Data set Setup bpsp ±1 to Q # forward ages, ii) learned to correctly predict a zero residual where Open Optimal Q′ 2.789 100% |Q| = 7 BPG manages to perfectly reconstruct, and iii) learned to Images Fixed Q = 14 2.801 82.6% 1 predict structures similar to the ones in the ground-truth. Our QC 2.794 1 94.8% Our QC + τ 2.790 1 7. Conclusion DIV2K Optimal Q′ 3.080 100% |Q| = 7 In this paper, we showed how to leverage BPG to achieve Fixed Q = 14 3.096 73.0% 1 state-of-the-art results in full-resolution learned lossless im- Our QC 3.088 1 age compression. Our approach outperforms L3C, PNG, 90.2% Our QC + τ 3.079 1 WebP, and JPEG2000 consistently, and also outperforms the hand-crafted state-of-the-art FLIF on images from the Table 2. On Open Images and DIV2K, we compare using the op- Open Images data set. Future work should investigate input- ′ timal Q for encoding images, vs. a fixed Q = 14 and vs. using Q dependent optimizations, which are also used by FLIF and predicted by the Q-Classifier. For each data set, the last row shows which we started to explore here by optimizing the scale the additional gains obtained from applying the τ-optimization. of the probabilistic model for the residual (τ-optimization). The forth column shows the percentage of predicted Q that are ±1 ′ Similar approaches could also be applied to latent probabil- away from the optimal Q and the last column corresponds to the ity models of lossy image and video compression methods. number of forward passes required for Q-optimization.

6645 References [20] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal [1] Workshop and Challenge on Learned Image Compression. Covariate Shift. In ICML, 2015. 4 https://www.compression.cc/challenge/. 6 [21] Diederik P Kingma and Jimmy Ba. Adam: A method for [2] Eirikur Agustsson, Fabian Mentzer, Michael Tschannen, stochastic optimization. In ICLR, 2015. 6 Lukas Cavigelli, Radu Timofte, Luca Benini, and Luc Van [22] Durk P Kingma and Prafulla Dhariwal. Glow: Generative Gool. Soft-to-Hard for End-to-End flow with invertible 1x1 . In NeurIPS, 2018. 5 Learning Compressible Representations. In NIPS, 2017. 1 [3] Eirikur Agustsson and Radu Timofte. NTIRE 2017 Chal- [23] Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, lenge on Single Image Super-Resolution: Dataset and Study. Ilya Sutskever, and Max Welling. Improved variational in- In CVPR Workshops, 2017. 6 ference with inverse autoregressive flow. In NIPS, 2016. 2 [4] Eirikur Agustsson, Michael Tschannen, Fabian Mentzer, [24] Friso H Kingma, Pieter Abbeel, and Jonathan Ho. Bit-swap: Radu Timofte, and Luc Van Gool. Generative Adversar- Recursive bits-back coding for lossless compression with hi- ial Networks for Extreme Learned Image Compression. In erarchical latent variables. In ICML, 2019. 1, 2, 3 ICCV, 2019. 1 [25] Alexander Kolesnikov and Christoph H Lampert. PixelCNN [5] Johannes Balle,´ Valero Laparra, and Eero P Simoncelli. Den- Models with Auxiliary Variables for Natural Image Model- sity modeling of images using a generalized normalization ing. In ICML, 2017. 1, 2 transformation. arXiv preprint arXiv:1511.06281, 2015. 4 [26] Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, [6] Johannes Balle,´ Valero Laparra, and Eero P Simoncelli. End- Sami Abu-El-Haija, Alina Kuznetsova, Hassan Rom, Jasper to-end Optimized Image Compression. ICLR, 2016. 1 Uijlings, Stefan Popov, Shahab Kamali, Matteo Malloci, [7] Fabrice Bellard. BPG Image format. https:// Jordi Pont-Tuset, Andreas Veit, Serge Belongie, Victor bellard.org/bpg/. 1 Gomes, Abhinav Gupta, Chen Sun, Gal Chechik, David Cai, Zheyun Feng, Dhyanesh Narayanan, and Kevin Murphy. [8] Lukas Cavigelli, Pascal Hager, and Luca Benini. Cas-cnn: OpenImages: A public dataset for large-scale multi-label A deep convolutional neural network for image compression and multi-class image classification. Dataset available from artifact suppression. In IJCNN, pages 752–759, 2017. 3 https://storage.googleapis.com/openimages/web/index.html, [9] Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter 2017. 6 Abbeel. PixelSNAIL: An Improved Autoregressive Genera- tive Model. In ICML, 2018. 2 [27] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural net- [10] Thomas M Cover and Joy A Thomas. Elements of Informa- works. In NIPS, 2012. 1 tion Theory. John Wiley & Sons, 2012. 3 [28] Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, [11] Peter Deutsch. DEFLATE compressed data format specifi- Radu Timofte, and Luc Van Gool. Conditional Probability cation version 1.3. Technical report, 1996. 2 Models for Deep Image Compression. In CVPR, 2018. 1 [12] Chao Dong, Yubin Deng, Chen Change Loy, and Xiaoou Tang. Compression artifacts reduction by a deep convolu- [29] Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, tional network. In ICCV, pages 576–584, 2015. 2 Radu Timofte, and Luc Van Gool. Practical full resolution learned lossless image compression. In CVPR, 2019. 1, 2, 3, [13] Leonardo Galteri, Lorenzo Seidenari, Marco Bertini, and Al- 4, 6 berto Del Bimbo. Deep generative adversarial removal. In ICCV, pages 4826–4835, 2017. 3 [30] David Minnen, Johannes Balle,´ and George D Toderici. Joint Autoregressive and Hierarchical Priors for Learned Image [14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Compression. In NeurIPS. 2018. 1, 11 Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014. [31] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz 3 Kaiser, Noam Shazeer, and Alexander Ku. Image Trans- [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. former. ICML, 2018. 2 Deep residual learning for image recognition. In CVPR, [32] Pillow Library for Python. https://python-pillow. pages 770–778, 2016. 4, 11 org. 6 [16] Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. [33] Portable Network Graphics (PNG). http://libpng. Neural Networks for Machine Learning Lecture 6a Overview org/pub/png/libpng.html. 2 of mini-batch gradient descent. 6 [34] Scott Reed, Aaron¨ Oord, Nal Kalchbrenner, Sergio Gomez´ [17] Geoffrey Hinton and Drew Van Camp. Keeping neural net- Colmenarejo, Ziyu Wang, Yutian Chen, Dan Belov, and works simple by minimizing the description length of the Nando Freitas. Parallel Multiscale Autoregressive Density weights. In COLT, 1993. 2 Estimation. In ICML, 2017. 1, 2 [18] Emiel Hoogeboom, Jorn WT Peters, Rianne van den Berg, [35] Danilo Jimenez Rezende and Shakir Mohamed. Varia- and Max Welling. Integer discrete flows and lossless com- tional inference with normalizing flows. arXiv preprint pression. In NIPS, 2019. 1, 2, 3 arXiv:1505.05770, 2015. 2 [19] David A Huffman. A method for the construction of [36] Iain E Richardson. H. 264 and MPEG-4 video compression: minimum-redundancy codes. Proc. IRE, 40(9):1098–1101, video coding for next-generation . John Wiley & 1952. 3 Sons, 2004. 2

6646 [37] Oren Rippel and Lubomir Bourdev. Real-Time Adaptive Im- age Compression. In ICML, 2017. 1 [38] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical . In MICCAI, pages 234–241, 2015. 4 [39] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. Kingma. PixelCNN++: A PixelCNN Implementation with Discretized Logistic Mixture Likelihood and Other Modifi- cations. In ICLR, 2017. 1, 2, 4 [40] C. E. Shannon. A Mathematical Theory of Communication. Bell System Technical Journal, 27(3):379–423, 1948. 1, 3 [41] Athanassios Skodras, Charilaos Christopoulos, and Touradj Ebrahimi. The JPEG 2000 still image compression standard. IEEE Processing Magazine, 18(5):36–58, 2001. 2 [42] J. Sneyers and P. Wuille. FLIF: Free lossless image format based on MANIAC compression. In ICIP, 2016. 2 [43] Gary J Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand. Overview of the high efficiency video coding (hevc) standard. IEEE Transactions on Circuits and Systems for Video Technology, 22(12):1649–1668, 2012. 1, 3 [44] Pavel Svoboda, Michal Hradis, David Barina, and Pavel Zemcik. Compression artifacts removal using convolutional neural networks. arXiv preprint arXiv:1605.00366, 2016. 2 [45] Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Huszar. Lossy Image Compression with Compressive Au- toencoders. In ICLR, 2017. 1 [46] George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen, Joel Shor, and Michele Covell. Full Resolution Image Compression with Recurrent Neural Net- works. In CVPR, 2017. 1 [47] James Townsend, Tom Bird, and David Barber. Practical lossless compression with latent variables using bits back coding. In ICLR, 2019. 1, 2, 3 [48] Michael Tschannen, Eirikur Agustsson, and Mario Lucic. Deep Generative Models for Distribution-Preserving Lossy Compression. In NeurIPS. 2018. 1 [49] Aaron¨ van den Oord, Nal Kalchbrenner, Lasse Espeholt, ko- ray kavukcuoglu, Oriol Vinyals, and Alex Graves. Condi- tional Image Generation with PixelCNN Decoders. In NIPS, 2016. 1, 2, 3 [50] Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel Recurrent Neural Networks. In ICML, 2016. 1, 2 [51] Gregory K Wallace. The JPEG still picture compression standard. IEEE Transactions on Consumer Electronics, 38(1):xviii–xxxiv, 1992. 1 [52] WebP Image format. https://developers.google. com/speed/webp/. 2 [53] Thomas Wiegand, Gary J Sullivan, Gisle Bjontegaard, and Ajay Luthra. Overview of the h. 264/avc video coding stan- dard. IEEE Transactions on Circuits and Systems for Video Technology, 13(7):560–576, 2003. 1, 2 [54] Ian H Witten, Radford M Neal, and John G Cleary. Arith- metic coding for . Communications of the ACM, 30(6):520–540, 1987. 3

6647