Enhanced Invertible Encoding for Learned Yueqi Xie∗ Ka Leong Cheng∗ Qifeng Chen HKUST HKUST HKUST ABSTRACT competitive performance. Generally, the recent VAE-based meth- Although deep learning based image compression methods have ods [6, 7] follow a process: an encoder transforms the original pixels achieved promising progress these days, the performance of these x into lower-dimensional latent features y and then quantizes it to methods still cannot match the latest compression standard Versa- yˆ , which can be losslessly compressed using entropy coding meth- tile Video Coding (VVC). Most of the recent developments focus ods, like [39, 48]. A jointly optimized decoder is on designing a more accurate and flexible entropy model that can utilized to transform yˆ back to the reconstructed original image xˆ. better parameterize the distributions of the latent features. How- Most of the recent developments on this topic focus on the im- ever, few efforts are devoted to structuring a better transformation provement of the entropy model. Ballé et al. [8] propose a varia- between the image space and the latent feature space. In this paper, tional image compression method with a scale hyperprior. After- instead of employing previous autoencoder style networks to build ward, Minnen et al. [34] combine the context model [27] and an this transformation, we propose an enhanced Invertible Encoding autoregressive model over the latent features. Cheng et al. [11] Network with invertible neural networks (INNs) to largely mitigate further improve the method with attention modules and employ the information loss problem for better compression. Experimen- discretized Gaussian mixture likelihoods to better parameterize tal results on the Kodak, CLIC, and Tecnick datasets show that the latent features. These recent improvements on entropy models our method outperforms the existing learned image compression greatly contribute to efficient compression. However, they usually methods and compression standards, including VVC (VTM 12.1), model the transformation between the image space and the latent especially for high-resolution images. Our source code is available feature space using an autoencoder framework. Although autoen- at https://github.com/xyq7/InvCompress. coders have strong capacities to select the important portions of the information for reconstruction, the neglected information during CCS CONCEPTS encoding is usually lost and unrecoverable for decoding. To resolve the information loss problem, a possible idea is to inte- • Computing methodologies → Machine learning; Artificial grate invertible neural networks (INNs) into the encoding-decoding intelligence. process because INNs have the strictly invertible property to help KEYWORDS preserve information. However, it is still nontrivial to leverage INNs for replacing the original encoder-decoder architecture. On Image compression; Invertible neural networks the one hand, to ensure strict invertibility, the total pixel number ACM Reference Format: in the output should be the same as the input pixel number. But Yueqi Xie, Ka Leong Cheng, and Qifeng Chen. 2021. Enhanced Invertible it is intractable to quantize and compress such high-dimensional Encoding for Learned Image Compression. In Proceedings of the 29th ACM features [16] and hard to get desirable low bit rates for compres- Int’l Conference on Multimedia (MM ’21), Oct. 20–24, 2021, Virtual Event, sion. The work [20] explores the usage of INNs to break AE limit China. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3474085. but only gets good performance in high bpp range. For lower-bpp 3475213 image compression with INNs, Wang et al. [46] take a subspace of the output to get lower-dimensional features, model lost infor- 1 INTRODUCTION mation with a given distribution, and do sampling for the inverse Lossy image compression has been a fundamental and important passing. However, this strategy introduces unwanted errors and research topic in media storage and transmission for decades. Clas- leads to unstable training, so they further introduce a knowledge sical image compression standards are usually based on handcrafted distillation module to guide and stabilize the training of INNs, but it schemes, such as JPEG [45], JPEG2000 [37], WebP [17], BPG [10], may lead to sub-optimal solutions. On the other hand, although the and Versatile Video Coding (VVC) [24]. Some of them are widely mathematical network design guarantees the invertibility of INNs, used in practice. Recently, there is an increasing interest in learned it makes INNs have limited nonlinear transformation capacity [13] image compression methods [6, 11, 29, 33, 34] considering their compared to other encoder-decoder architectures. Concerning these aspects, we propose an enhanced Invertible En- ∗Both authors contributed equally. coding Network for image compression, which maintains a highly invertible structure based on INNs. Instead of using the unstable Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed sampling mechanism [46], we propose an attentive channel squeeze for profit or commercial advantage and that copies bear this notice and the full citation layer to stabilize the training and to flexibly adjust the feature di- on the first page. Copyrights for components of this work owned by others than ACM mension for a lower bit rate. We also present a feature enhancement must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a module with same-resolution transforms and residual connections fee. Request permissions from [email protected]. to improve the network nonlinear representation capacity for better MM ’21, October 20–24, 2021, Virtual Event, China. image information preservation. With our design, we can directly © 2021 Association for Computing Machinery. ACM ISBN 978-1-4503-8651-7/21/10...$15.00 https://doi.org/10.1145/3474085.3475213 integrate INNs to capture the lost information without relying on estimated bit rates and the reconstructed image distortion. After- any sub-optimal guidance and boost performance. ward, the improvement is mainly in two directions. One direction Experimental results show that our proposed method outper- is to build a more effective entropy model for rate estimation. The forms the existing learned methods and traditional codecs on three work [8] proposes a hyperprior entropy model to use additional bits widely-used datasets for image compression, including Kodak [12] to capture the distribution of the latent features. Some follow-up and two high-resolution datasets, namely the CLIC Professional methods [27, 33, 34] integrate context factors to further reduce Validation dataset [43] and the Tecnick dataset [5]. Visual results the spatial redundancy within the latent features. Also, 3D context demonstrate that our reconstructed images contain more details entropy model [19], channel-wise models [35] and hierarchical en- under similar BPP rates, beneficial from our highly invertible de- tropy models [21, 34] are used to better extract correlations of latent sign. Further analysis of our proposed attentive channel squeeze features. Cheng et al. [11] propose Gaussian mixture likelihood to layer and feature enhancement module verifies their functionality. replace the original single Gaussian model to improve the accu- The contributions of this paper can be summarized as follows: racy. Another direction is to improve the VAE architecture. CNNs • Unlike widely-used autoencoder style networks for learned with generalized divisive normalization (GDN) layers [6, 7] achieve image compression, we present an enhanced Invertible En- good performance for image compression. Attention mechanism coding Network with an INN architecture for the transfor- and residual blocks [11, 29, 51, 52] are incorporated into the VAE mation between image space and feature space to largely architecture. Some other progress includes generative adversarial mitigate the information loss problem. training [2, 38, 40], spatial RNNs [28] and multi-scale fusion [38]. • The proposed method outperforms the existing learned and Generally, the VAE-based methods account for a more signifi- traditional image compression methods, including the latest cant proportion among all the existing learned methods for image VVC (VTM 12.1), especially on high-resolution datasets. compression, considering the great performance and stability of • We propose an attentive channel squeeze layer to resolve the encoder and decoder architecture. Still, they cannot explicitly the unstable and sub-optimal training of the INN-based net- solve the information loss problem during encoding, making the works for image compression. We further leverage a feature neglected information usually unrecoverable for decoding. enhancement module to improve the nonlinear representa- tion capacity of our network. 2.2 Invertible Neural Networks Invertible neural networks (INNs) [13, 14, 26] are popular with 2 RELATED WORK three great properties of the design: (i) INNs maintain a bijective 2.1 Lossy Image Compression mapping between inputs and outputs; (ii) the bijective mapping is efficiently accessible; (iii) there exists a tractable Jacobian of Traditional methods. Lossy image compression has long been the bijective mapping to compute posterior probabilities explicitly. an important and fundamental topic in image processing. Tradi- Many recent works in different areas with INN architecture achieve tional compression standards includes JPEG [45], JPEG2000 [37], better performance than those with autoencoder style frameworks, WebP [17], (BPG) [10] and Versatile Video especially for the tasks with inherent invertibility. Coding (VVC) [24]. Many of them are widely used in practice. Typ- RealNVP [14] uses a multi-scale architecture with coupling lay- ically, they follow the pipeline of transformation, quantization, and ers and convolution operations to first deal with image processing entropy coding. The transformation is usually based on handcrafted tasks. Ardizzone et al. [4] demonstrate the effectiveness of INNs modules with prior knowledge, such as discrete cosine transfor- on both the synthetic data and two real-world applications in the mation (DCT) [3] and discrete transform (DWT) [32]. The medicine and astrophysics fields. Recently, many works start touse entropy coders include Huffman coder and some arithmetic coding normalizing flow methods with exact likelihoods during training in methods [32, 39, 48]. Some modern standards like BPG and VVC generative tasks, where the key is to parametrize the distribution further introduce intra prediction for better compression perfor- using INNs. SRFlow [31] uses a conditional INN architecture to mance. However, since these traditional methods generally perform better resolve the ill-posed problem of super-resolution compared compression based on image blocks, their reconstructed images to GAN-based methods. Pumarola et al. [36] propose a conditional usually contain some limitations of the blocking effects. generative flow model for Image and 3D Point Cloud generation. For Learned methods. Deep learning based methods have raised the image rescaling task, Xiao et al. [49] use INNs to generate a bijec- great interest recently with impressive performance, by employing tive mapping between high-resolution images and low-resolution neural networks instead of handcrafted rules for the nonlinear images with additional latent variables. Xing et al. [50] propose an transformation learning between the image and the feature space. invertible image signal processing pipeline based on INNs. Several recurrent neural network (RNN) based methods [23, 42, 44] progressively encode the residual information from the previous 3 METHOD step to compress the image. However, these RNN-based models rely on binary representation at each iteration and cannot directly 3.1 Background optimize the rate during training. Figure 1 provides a high-level overview of general learned image Another branch of the methods is based on variational autoen- compression models in the approach [18]. A base- coders (VAEs). Several early works [1, 7, 41] solve the problem of line model (Figure 1a) is formulated as follows: non-differential quantization and estimation of bit rates, which lay the foundation for end-to-end optimization to minimize the y = 푔푎 (x), yˆ = 푄 (y), xˆ = 푔푠 (yˆ). (1) � �! � � ℎ � � ℎ � � ℎ � � � � � � � Q Q Q Q Q Q Q � � � � � � � � � � EC EC EC EC EC EC EC bits bits bits bits bits bits bits

��|� ��|� ��|� �(�, ��) �(�, ��) ∑ � (� , ��) ED ED ED ED ED ED � � � ED � � � � � � � � �" � � ℎ � � ℎ � � ℎ

(a) Baseline [6] (b) Ballé et al. [8] (c) Minnen et al. [34] (d) Cheng et al. [11]

Figure 1: The overview of existing learned compression methods. 퐸퐶 and 퐸퐷 denote the encoder and decoder of the entropy model, respectively, and 퐶푚 denotes the context model.

For encoding, a parametric analysis transform 푔푎 encodes the im- autoregressive context model with a mean and scale hyperprior, age x into latent features y, and y is then quantized to obtain the as shown in Figure 1c. Cheng et al. [11] utilize discretized Gauss- discrete latent features yˆ, which is then losslessly compressed into ian mixture likelihoods with attention enhancement to model the bitstreams using entropy coding like arithmetic coding [39]. Note distributions of yˆ, as shown in Figure 1d. that since integer rounding is a fundamentally non-differentiable function, we use the method in [7] to approximately model the 3.2 Proposed Method (− ) quantized latent features by adding a uniform noise 푈 0.5, 0.5 Instead of optimizing the parameterization ℎ푎, ℎ푠 of the latent to y during training. For simplicity, we use yˆ to denote both the feature distribution, our proposed method focuses on enhancing latent features with uniform noise added during training and the the analysis 푔푎 and synthesis 푔푠 transforms between the image discretely quantized latent features during testing. For decoding, a space X and the latent feature space Y. Considering the natural parametric synthesis transform 푔푠 decodes the quantized yˆ back to invertibility in image compression, we use an invertible network the reconstructed image xˆ. design with a feature enhancement module and an attentive channel The fundamental optimization objective of learned image com- squeeze layer to play the role of the analysis 푔푎 and synthesis 푔푠 pression is to minimize a weighted sum of the rate-distortion trade- transforms. Figure 2 shows an overview of our proposed approach. off during training: INN architecture. We formulate an INN architecture design to 퐿 = 푅(yˆ) + 휆퐷(x, xˆ). (2) serve as the analysis 푔푎 and synthesis 푔푠 transforms. It consists of two essential invertible layers: the downscaling layer and the The rate 푅 is the the entropy of yˆ, which is estimated by a non- coupling layer. Existing methods [8, 11, 34] usually employ 4 down- parametric, fully factorized density model (entropy model) 푝yˆ |휃 scaling and 4 corresponding upscaling modules in the analysis during training. So, we have the rate estimation: and synthesis transforms, respectively. Similarly, we also stack 4 invertible blocks in our INN architecture to downscale the input 푅 = E[− log2 푝yˆ |휃 (yˆ |휃)]. (3) resolution by a factor of 24, where each block sequentially contains The distortion 퐷 is defined as 퐷 = 푀푆퐸(x, xˆ) for MSE optimization 1 downscaling layer and 3 coupling layers. The kernel sizes of the and 퐷 = 1 − 푀푆-푆푆퐼푀(x, xˆ) for MS-SSIM [47] optimization. We coupling layers for the 4 blocks are empirically set as 푘 = 5, 5, 3, 3. use 휆 to control the rate-distortion tradeoff for different bit rates. The downscaling layer is composed of a pixel shuffling layer [14] Later, Ballé et al. [8] propose a scale hyperprior on top of the and an invertible 1×1 convolution [26], and each downscaling layer general learn image compression model, as shown in Figure 1b. reduces the resolution of the input tensor by 2 and quadruples the Specifically, they stack another parametric analysis transform ℎ푎 channel dimension. on top of y to capture the significant spatial dependencies in the We use the affine coupling layer design introduced in Real- quantized latent features yˆ and obtain an additional set of encoded 푡ℎ (푖) NVP [14]. For the 푖 coupling layer that takes an input u1:퐶 with variables z. In this way, they model yˆ as a zero-mean Gaussian dimensional size of 퐶, it splits the input at position 푐 < 퐶 into two distribution with standard deviations 휎 for all the elements. The ( + ) parts and gives a 퐶 dimensional output u 푖 1 : standard deviations are estimated by another parametric synthesis 1:퐶 transform ℎ , which takes the quantized zˆ as input and outputs the (푖+1) (푖) (푖) (푖) 푠 u = u ⊙ exp(휎푐 (푔2 (u ))) + ℎ2 (u ), (5) estimated standard deviations 휎ˆ. So, we have the conditional prob- 1:푐 1:푐 푐+1:퐶 푐+1:퐶 2 (푖+1) (푖) (푖+1) (푖+1) ability distribution 푝 | = N(0, 휎 ). Similarly, an entropy model = ⊙ ( ( ( ))) + ( ) yˆ zˆ u푐+1:퐶 u푐+1:퐶 exp 휎푐 푔1 u1:푐 ℎ1 u1:푐 , (6) 푝zˆ |휃 is applied for entropy estimation of zˆ. In this case, the rate where ⊙ denotes the Hadamard product, exp(·) denotes the ex- estimation contains two terms: ponential function, and 휎푐 (·) denotes the center sigmoid function. 푅 = E[− log 푝 (yˆ |zˆ)] + E[− log 푝 (zˆ|휃)]. (4) 푡ℎ (푖+1) 2 yˆ |zˆ 2 zˆ |휃 Symmetrically, the 푖 coupling layer inversely takes u1:퐶 as input with splitting position 푐. The affine coupling layer gives a perfect Later works further improve the hyperprior to better param- inverse: eterize the distributions of the quantized latent features with a (푖) = ( (푖+1) − ( (푖+1) )) ⊙ (− ( ( (푖+1) ))) more accurate and flexible entropy model. Minnen34 [ ] propose an u푐+1:퐶 u푐+1:퐶 ℎ1 u1:푐 exp 휎푐 푔1 u1:푐 , (7) hc r ipyrdnatpxl ob opesd oresolve To compressed. be to pixels redundant simply are which v transforms to representation some nonlinear obtain adds and extracts first module image height input with an compressed given process, encoding the For Workflow Overall 3.3 ( of size a into it reshapes and module attention the makes layer squeeze channel [ in proposed module to obtain features dimension latent first the the along ( operation of average shape performs a it into tensor the reshape forwardly ratio compression The a INNs. Given of follows: tensor as output is the idea of to dimension layer channel squeeze the channel reduce attentive of an many introduce tensor, we problem, input this the in pixels of number total the change size kernel with convolutions [ cascade Block three Dense popular is the module on this based Specifically, network. nonlin- our the of improve representativeness to ear architecture INN the before manner residual represen- nonlinear of [ capacity de- tation limited network a have invertible usually strictly INNs the sign, of by property guaranteed the is since invertibility However, transformation. invertible eling the last size filter and first as the set is of convolution size kernel the Conv-LeakyReLU-Conv- where func- design: LeakyReLU-Conv, four bottleneck these following all the implementations, use our tions In invertible. be not need arbitrary any for functions holds feedforward the invertibility by the guaranteed Thus, inherently design. is mathematical invertibility such that note Please ihsz ( size with tetv hne squeeze. channel Attentive module. enhancement Feature Reconstruct Reconstruct Input u u 1: 13 ( 푖 ∈ 푐 ) 1 .Hne eadafaueehneetmdl na in module enhancement feature a add we Hence, ]. R = . � u 3 퐶 ( × � u into 퐻 , 1: ( 퐻 푖 × ⊕ 푐 + 푊 , 1 y v iue2 vriwadwrfo forpooe nacdIvril noigNetwork. Encoding Invertible enhanced ourproposed of workflow and Overview 2: Figure ) 푊 hntefradps forINarchitecture INN our of pass forward the Then . 푘 − ihsz ( size with Enhancement ∈ h ideoei ovltoa ae with layer convolutional a is one middle the ; ,teatniecanlsueelyrfirst layer squeeze channel attentive the ), 11 ℎ Dense Block Dense Block 푔 Feature R 2 퐻 1 .Frteivrepoes h attentive the process, inverse the For ]. ( ,푔 푑 u × n width and 푐 2 ( Convs (1,3,1)

ℎ Convs (1,3,1) 푖 + ℎ , ) × 훼 1: 푤 1 퐶 퐶 훼 oiso h quantized the of copies ℎ , where , )⊙ )) Dense Block

l h prtosi Nscannot INNs in operations the All Dense Block , 2 퐻 enn htteefunctions these that meaning , ⊕ 22 , Nsaepwru hnmod- when powerful are INNs 푊 exp 푊 ,ad“ov 131”means (1,3,1)” “Convs and ], ,floe ya attention an by followed ), h etr enhancement feature the , 푑 (− = 휎 1 푐 훼 , 3 ( 푔 n nipttensor input an and 3 Pixel Shuffling Layer 훼 x × 2 , , 1 ( ∈ 4 u . 퐶 훼 4 푐 ( 퐶 R Invertible 1 x 1 Conv , 푖 , + ) Invertible Neural Network (INN) , 퐻 3 1: ℎ × 퐻 퐶 y , ˆ 퐻 = 푊 , ))) is after first ×

푊 Coupling Layers (kernel k) 푊 .Then ). 2 퐻 . ). 4 Architecture and , obe to (8) 푤 ecnuteprmnso he omnyue mg compres- image used commonly three on experiments conduct We eldvlpdCmrsA yoc irr [ library PyTorch CompressAI well-developed [ (ANS) systems numeral asymmetric the use we parameterize to distribution Gaussian scale and mean a uses which riigdetails. Training h output The compresses ratio compression e mgswt ihrhih rwdhsalrthan smaller width or convenience. height either with images few a is network Our training. for on used trained are images remaining the and set, around select randomly We task. sion of sisting Setup Experimental 4.1 layer. squeeze enhancement channel feature attentive ourproposed and of module effectiveness studies the ablation examine and to analysis some conduct further We methods. state-of-the- art existing with comparisons by qualitative followed sec- setup, and experimental this quantitative the detailed of remainder introduce the first In we method. tion, our validate to datasets sion EXPERIMENTS 4 compress of distribution the parametize to used for prior model Gaussian scale and z thesis features latent quantized the image reconstructed the for module enhancement feature decodes architecture for features latent quantized the copies = = o h eoigpoes h tetv hne qez layer squeeze channel attentive the process, decoding the For eepo h aehprro rpsdb inne l [ al. et Minnen by proposed hyperprior same the employ We 훼 ℎ 푊 푎 2 ie n hnrsae tas it reshapes then and times 4 ℎ ( y ˆ 푠 ne u rhtcuedsg.Gvnahyper-parameter a Given design. architecture our under ) z ˆ 20 rnfrs Specifically, transforms. and , y atrzdpiretoymodel entropy factorized-prior a , ˆ 256 ℎ , and v 푠 745 ( noltn features latent into z ˆ × ℎ ) ihqaiygnrliae o h mg compres- image the for images general high-quality z 푠 ˆ 256 ok ihtecua context causal the with works nobitstreams. into ae h unie ieinformation side quantized the takes eueteFikr2 aae sdi [ in used dataset 2W Flicker the use We 훼 Channel Squeeze Channel h tetv hne qez ae further layer squeeze channel attentive the , admycopdpths sn h recently the using patches, cropped randomly v ˆ Attentive Attentive

oobtain to Copy Average 푝 y ˆ y | ˆ z ˆ Attention Attention ihapi fanalysis of pair a with ← y � � ℎ u ˆ ℎ 푎 v hc ial nege the undergoes finally which , ˆ ∈ y ˆ ED EC 푠 Q h nes aso h INN the of pass inverse The . ban h ieinformation side the obtains 200 ( z ˆ bits fe teto enhancement attention after R � z ˆ as ) 훼 푑 ,퐶 × mgsfrorvalidation our for images 푝 ℎ 푚 z ˆ � × | ( 휃 ( 휃 � 푤 9 y , ˆ o nrp coding, entropy For . � .Nt htw drop we that Note ]. 퐶 . nrdcdi [ in introduced ) � ) ic hr sno is there Since . 푚 � � ℎ ℎ 15 Hyperprior ( � | y � ˆ olosslessly to ] ) o h mean the for ℎ 푎 256 z ˆ ED EC Q n syn- and sinput. as 30 bits x ˆ � � � . xfor px ,con- ], 7 34 is ] ], Figure 3: Performance evaluation on the Kodak dataset.

Figure 4: Performance evaluation on the CLIC Professional Figure 5: Performance evaluation on the Tecnick dataset. Validation dataset.

Table 1: Area under curve (AUC) of VVC (VTM 12.1) and our optimizer. Generally, it takes around 10 days to train a model. Our MSE method on different datasets, with the bpp range deter- network is first optimized for 450 epochs with an initial learning −4 −5 mined by our quality 1 and quality 8 models. rate of 10 ; then the learning rate is reduced to 10 at epoch 450 and further down to 10−6 at epoch 550. 퐶 We use the channel number 푁 = 훼 and the weight factor 휆 Dataset VVC (VTM 12.1) Ours [MSE] as our quality parameters. We in-total train 8 models optimized Kodak 33.309 33.373 with MSE (mean squared error) quality metric and 5 models with CLIC 25.153 25.238 MS-SSIM (multiscale structural similarity) quality metric [47] under Tecnick 23.563 23.693 different quality levels. For the MSE models, 휆 is chosen from the set {0.0016, 0.0024, 0.0032, 0.0075, 0.015, 0.03, 0.045, 0.09}, in which the first four 휆 values are paired with channel number 푁 = 128 for All the experiments are conducted on a single RTX 2080 Ti GPU lower-rate models, and 푁 is set as 192 to pair with the remaining and trained for 600 epochs with a batch size of 8 using Adam [25] four 휆 values for higher-rate models. For the MS-SSIM models, 휆 Original Ours [MSE]

Original Image Ours [MSE] Ours [MS-SSIM] VTM (12.1) 0.150bpp, 25.18dB, 0.908 0.140bpp, 23.55dB, 0.930 0.155bpp, 25.74dB, 0.913 Ours [MS-SSIM] VTM (12.1)

BPG WebP

BPG WebP JPEG2000 JPEG 0.157bpp, 25.00dB, 0.897 0.145bpp, 23.07dB, 0.837 0.240bpp, 24.51dB, 0.885 0.191bpp, 19.90dB, 0.696 JPEG2000 JPEG

Original Ours [MSE]

Original Image Ours [MSE] Ours [MS-SSIM] VTM (12.1) 0.130bpp, 31.51dB, 0.972 0.124bpp, 28.01dB, 0.978 0.124bpp, 30.85dB, 0.965 Ours [MS-SSIM] VTM (12.1)

BPG WebP

BPG WebP JPEG2000 JPEG 0.119bpp, 29.41dB, 0.954 0.126bpp, 25.83dB, 0.893 0.240bpp, 29.72dB, 0.957 0.171bpp, 21.79dB, 0.789 JPEG2000 JPEG

Original Ours [MSE]

Original Image Ours [MSE] Ours [MS-SSIM] VTM (12.1) 0.134bpp, 28.90dB, 0.911 0.128bpp, 26.80dB, 0.928 0.124bpp, 28.59dB, 0.899 Ours [MS-SSIM] VTM (12.1)

BPG WebP

BPG WebP JPEG2000 JPEG 0.126bpp, 28.04dB, 0.888 0.159bpp, 26.93dB, 0.863 0.240bpp, 28.17dB, 0.899 0.159bpp, 21.69dB, 0.635 JPEG2000 JPEG

Figure 6: Visualization of sample reconstructed images from the Kodak dataset. belongs to the set of {6, 12, 40, 120, 220}, and we use 푁 = 128 for contains 24 uncompressed images with resolutions of 768 × 512; the the first two 휆 values to train lower-rate models and 푁 = 192 for CLIC dataset comprises 41 high-quality images with much higher the remaining three 휆 values to train higher-rate models. resolutions; the Tecnick dataset contains 100 images with high Evaluation. We evaluate our methods on three commonly used resolutions of 1200 × 1200. datasets for image compression, which are the Kodak PhotoCD We use the peak signal-to-noise ratio (PSNR) and the multiscale image dataset (Kodak) [12], the CLIC Professional Validation dataset structural similarity index (MS-SSIM) [47] to quantify the image (CLIC) [43], and the old Tecnick dataset [5]. The Kodak dataset distortion level; we use the bits per pixel (bpp) to evaluate the rate Table 2: Deviation and scaled deviation of the averaging operation of our attentive channel squeeze layer on the Kodak dataset.

Quality Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Deviation (휖) 0.1344 0.1448 0.1662 0.2344 0.2047 0.2520 0.3541 0.4568 Scaled Deviation (휖˜) 0.6640 0.5884 0.5912 0.5015 0.4421 0.3637 0.4106 0.3611

Quality 1 Quality 2 Quality 3 Quality 4

Original Image

Quality 5 Quality 6 Quality 7 Quality 8

Quality 1 Quality 2 Quality 3 Quality 4

Original Image

Quality 5 Quality 6 Quality 7 Quality 8

Figure 7: Scaled deviation map of image kodim20 and kodim24 from the Kodak dataset. performance. We draw the rate-distortion (RD) curves according Figure 3 shows the RD curve comparison on the Kodak dataset. to their rate-distortion performance to compare the coding effi- Similar to existing work [11], we convert 푀푆−푆푆퐼푀 to −10 log10 (1− ciency of different methods. We also report the area under the 푀푆-푆푆퐼푀) for clearer comparison. It can be observed that our rate-distortion curve (AUC) as an aggregate measurement to better method slightly outperforms VVC (VTM 12.1) and yields much compare methods with similar performance. better performance when comparing with both the existing learned methods and other traditional image compression standards. 4.2 Rate-distortion Performance For the CLIC dataset and the Tecnick dataset, we compare our We compare our model with state-of-the-art learned image com- MSE optimized results with traditional compression standards and pression models, including the methods proposed by Ballé el al. [8], the learned methods with official testing results available in their Minnen et al. [34], Lee et al. [27], Hu et al. [21], and Cheng et al. [11]. paper or their official GitHub pages. We show the RD curveson The corresponding data points on their RD curves are collected from the CLIC dataset in Figure 4 and the Tecnick dataset in Figure 5. their paper or their official GitHub pages. We also compare with We can see that our MSE optimized method outperforms all other some widely-used image compression codecs, including JPEG [45], approaches. Note that most images in the CLIC dataset and the JPEG2000 [37], WebP [17], BPG [10], and VVC [24]. We evaluate Tecknick dataset are of high resolutions, implying that our method their performance using the CompressAI evaluation platform. For is more robust and promising to compress high-resolution images. VVC, we use the most up-to-date VVC Official Test Model VTM 12.1 From the RD curves, we can see that VVC (VTM 12.1) and our (accessed on April 2021) with an intra-profile configuration from approach achieve similar performance in terms of PSNR. Hence, the official GitHub page to test on images. To evaluate theBPG we further report their corresponding AUC values for better com- performance, we use the BPG software with subsampling mode of parison and ranking. Statistics in Table 1 indicates that our method YUV444, HEVC implementation of , and bit depth of 8 to test outperforms the latest VVC (VTM 12.1) codec in terms of the ag- on images. gregated AUC metric. 4.3 Qualitative Results We show some qualitative comparison of some sample reconstructed images on the Kodak dataset in Figure 6. The first sample is image kodim01 with an approximate bpp of 0.145; the second sample is image kodim07 with approximately 0.125 bpp; the last sample is image kodim22 with around 0.130 bpp. For JPEG and JPEG2000, we use the lowest quality since they cannot reach the mentioned bpp levels. We can see that our MSE optimized method achieves a good performance compared with the latest VVC (VTM 12.1) codec, much better than the performance of other codecs. Also, our MS- SSIM optimized method can reconstruct images with much more Figure 8: Ablation study on feature enhancement module. structural details than all the traditional codecs. We present more qualitative results in our supplementary materials.

4.4 Analysis of Attentive Channel Squeeze We further visualize the scaled deviation map of the image kodim20 and image kodim24 between 훾 and 훾ˆ of our models un- In this section, we demonstrate that our proposed attentive channel der different quality levels in Figure 7. Note that the deviation map squeeze layer introduces only minor deviation to enable a stable and is in shape (ℎ,푤), and each pixel is the mean of the absolute devia- tractable dimension adjustment. We also analyze the distribution of tion along the channel dimension after scaling with 휇. We can see the deviation maps on images and the deviation levels of different that the “relative” deviation generally decreases as the quality in- quality models. × × 푑 × × creases, indicating less information loss of the averaging operation, ∈ 푑 ℎ 푤 ∈ 훼 ℎ 푤 Let 훾 R and 훾ˆ R be the input and output which is in line with our intuition: higher-quality models usually latent feature tensors of the averaging operation, respectively. For lead to less information loss in the process of compression. simplicity of notation, we reshape the 훾 as (훼, 푙) and 훾ˆ as (1, 푙), = 푑 × × where 푙 훼 ℎ 푤. The averaging operation in the attentive 4.5 Ablation Study channel squeeze layer compresses 훾 by compression ratio 훼 along the channel dimension. Specifically, each pixel in 훾ˆ is the mean To verify the contribution of the proposed nonlinear feature en- value of the corresponding 훼 pixels in 훾. The corresponding inverse hancement module, we conduct a corresponding ablation study on operation for dimension matching is copying. It is obvious that this module. Specifically, we train two models with and without such operation can lead to some deviation. using the nonlinear feature enhancement module on the Flicker To quantify such deviation, we do some analysis using the Kodak 2W dataset for 600 epochs, using the same weight factor 휆 = 0.01 dataset to calculate the mean absolute pixel deviation as follows: and channel number 푁 = 192. Figure 8 shows the rate-distortion points of two models evaluated 푙 훼 1 Õ Õ on the Kodak dataset. We can observe that the proposed nonlinear 휖 = |훾 − 훾ˆ |, (9) 푙 푖,푗 1,푗 feature enhancement module improves the modeling capacity of 푗=1 푖=1 the model to compress images with lower bit rates and higher PSNR and MS-SSIM values. where 훾푖,푗 and 훾ˆ1,푗 denote the corresponding pixel in 훾 and 훾ˆ, re- spectively. As shown in Table 2, the value of the deviation is minor (less than or comparable with the quantization error in most cases), 5 CONCLUSION which is beneficial for stable training. Unlike existing autoencoder style networks, our proposed enhanced To compare the deviation among models under different quality Invertible Encoding Network based on invertible neural networks levels, it is unfair to directly compare the values because 훾 and 훾ˆ (INNs) can better model image compression as an invertible process. have a larger value range for the model with higher quality. Ac- The critical issues in integrating INNs for image compression in- cordingly, the absolute value of deviation is amplified. As a result, clude unstable training and limited nonlinear transformation capac- calculating the “relative” deviation concerning the value range is a ity. Our proposed attentive channel squeeze layer offers stable and more reasonable practice. A naive way is to divide the mean abso- tractable feature dimension adjustment; the incorporated feature lute pixel deviation 휖 by the value range for fair comparison among enhancement module increases the network nonlinear transforma- different quality models. However, we find that the distribution of tion capacity. Overall, our network maintains a highly invertible the pixel values has long tails, so the value range is very likely to architecture to largely mitigate information loss when compressing be affected by the outlier pixel values. Hence, to better scalethe images. mean absolute pixel deviation, we introduce a scaling factor 휇: Extensive experiments on three widely-used datasets show that our approach outperforms state-of-the-art learned image compres- 푙 훼 1 Õ Õ 휇 = |훾 |. (10) sion methods and existing compression standards, including VVC 푙 푖,푗 푗=1 푖=1 (VTM 12.1), especially on the two high-resolution image datasets. Furthermore, the visual results demonstrate that our compression 휖 The scaled mean absolute pixel deviation 휖˜ = 휇 . We report the 휖 methods preserve more detailed information than current compres- and 휖˜ values of our models under different quality levels in Table 2. sion standards. REFERENCES Vol. 31. [1] Eirikur Agustsson, Fabian Mentzer, Michael Tschannen, Lukas Cavigelli, Radu [27] Jooyoung Lee, Seunghyun Cho, and Seung-Kwon Beack. 2019. Context-adaptive Timofte, Luca Benini, and Luc Van Gool. 2017. Soft-to-Hard Vector Quantization Entropy Model for End-to-end Optimized Image Compression. In Proceedings of for End-to-End Learning Compressible Representations. In Advances in Neural the International Conference on Learning Representations. Information Processing Systems. [28] Chaoyi Lin, Jiabao Yao, Fangdong Chen, and Li Wang. 2020. A Spatial RNN Codec [2] Eirikur Agustsson, Michael Tschannen, Fabian Mentzer, Radu Timofte, and for End-to-End Image Compression. In Proceedings of the IEEE/CVF Conference Luc Van Gool. 2019. Generative Adversarial Networks for Extreme Learned on Computer Vision and Pattern Recognition. 13266–13274. Image Compression. In Proceedings of the IEEE/CVF International Conference on [29] Haojie Liu, Tong Chen, Peiyao Guo, Qiu Shen, Xun Cao, Yao Wang, and Zhan Ma. Computer Vision. 221–231. 2019. Non-local Attention Optimized Deep Image Compression. arXiv preprint [3] Nasir U. Ahmed, T. Natarajan, and Kamisetty R. Rao. 1974. Discrete Cosine arXiv:1904.09757 (2019). Transform. IEEE Trans. Comput. C-23, 1 (1974), 90–93. [30] Jiaheng Liu, Guo Lu, Zhihao Hu, and Dong Xu. 2020. A Unified End-to-End Frame- [4] Lynton Ardizzone, Jakob Kruse, Sebastian Wirkert, Daniel Rahner, Eric W. Pelle- work for Efficient Deep Image Compression. arXiv preprint arXiv:2002.03370 grini, Ralf S. Klessen, Lena Maier-Hein, Carsten Rother, and Ullrich Köthe. 2019. (2020). Analyzing Inverse Problems with Invertible Neural Networks. In Proceedings of [31] Andreas Lugmayr, Martin Danelljan, Luc Van Gool, and Radu Timofte. 2020. SR- the International Conference on Learning Representations. Flow: Learning the Super-Resolution Space with Normalizing Flow. In Proceedings [5] Nicola Asuni and Andrea Giachetti. 2014. TESTIMAGES: a Large-scale Archive of the European Conference on Computer Vision. for Testing Visual Devices and Basic Image Processing Algorithms. In Proceedings [32] Detlev Marpe, Heiko Schwarz, and . 2003. Context-based Adap- of the Conference on Smart Tools and Applications in Computer Graphics. 63–70. tive Binary Arithmetic Coding in the H. 264/AVC Video Compression Standard. [6] Johannes Ballé, Valero Laparra, and Eero P. Simoncelli. 2016. End-to-end Opti- IEEE Transactions on Circuits and Systems for Video Technology 13, 7 (2003), 620– mization of Nonlinear Transform Codes for Perceptual Quality. In Proceedings of 636. Picture Coding Symposium. 1–5. [33] Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, and [7] Johannes Ballé, Valero Laparra, and Eero P. Simoncelli. 2017. End-to-end Op- Luc Van Gool. 2018. Conditional Probability Models for Deep Image Compression. timized Image Compression. In Proceedings of the International Conference on In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Learning Representations. 4394–4402. [8] Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick John- [34] David Minnen, Johannes Ballé, and George Toderici. 2018. Joint Autoregressive ston. 2018. Variational Image Compression with a Scale Hyperprior. In Proceedings and Hierarchical Priors for Learned Image Compression. In Advances in Neural of the International Conference on Learning Representations. Information Processing Systems. 10794–10803. [9] Jean Bégaint, Fabien Racapé, Simon Feltman, and Akshay Pushparaja. 2020. Com- [35] David Minnen and Saurabh Singh. 2020. Channel-Wise Autoregressive Entropy pressAI: a PyTorch Library and Evaluation Platform for End-to-end Compression Models for Learned Image Compression. In Proceedings of the IEEE International Research. arXiv preprint arXiv:2011.03029 (2020). Conference on Image Processing. 3339–3343. [10] Fabrice Bellard. 2015. BPG Image Format. https://bellard.org/bpg/ [36] Albert Pumarola, Stefan Popov, Francesc Moreno-Noguer, and Vittorio Ferrari. [11] Zhengxue Cheng, Heming Sun, Masaru Takeuchi, and Jiro Katto. 2020. Learned 2020. C-Flow: Conditional Generative Flow Models for Images and 3D Point Image Compression with Discretized Gaussian Mixture Likelihoods and Attention Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Modules. In Proceedings of the IEEE/CVF Conference on Computer Vision and Recognition. Pattern Recognition. 7939–7948. [37] Majid Rabbani. 2002. JPEG2000: Image Compression Fundamentals, Standards [12] Eastman Kodak Company. 1999. Kodak Lossless True Color Image Suite. http: and Practice. Journal of Electronic Imaging 11, 2 (2002), 286. //r0k.us/graphics/kodak/ [38] Oren Rippel and Lubomir Bourdev. 2017. Real-time Adaptive Image Compression. [13] Laurent Dinh, David Krueger, and Yoshua Bengio. 2015. NICE: Non-linear Inde- In Proceedings of the International Conference on Machine Learning. 2922–2930. pendent Components Estimation. In Proceedings of the International Conference [39] Jorma Rissanen and Glen G. Langdon. 1981. Universal Modeling and Coding. on Learning Representations Workshops. IEEE Transactions on Information Theory 27, 1 (1981), 12–23. [14] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. 2017. Density Estima- [40] Shibani Santurkar, David M. Budden, and Nir Shavit. 2018. Generative Compres- tion using Real NVP. In Proceedings of the International Conference on Learning sion. In Proceedings of Picture Coding Symposium. 258–262. Representations. [41] Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Huszár. 2017. Lossy [15] Jarek Duda. 2009. Asymmetric numeral systems. arXiv preprint arXiv:0902.0271 Image Compression with Compressive Autoencoders. In Proceedings of the Inter- (2009). national Conference on Learning Representations. [16] Allen Gersho and Robert M. Gray. 2012. Vector Quantization and Signal Compres- [42] George Toderici, Sean M. O’Malley, Sung Jin Hwang, Damien Vincent, David sion. Vol. 159. Minnen, Shumeet Baluja, Michele Covell, and Rahul Sukthankar. 2015. Variable [17] Google. 2010. Web Picture Format. https://chromium.googlesource.com/webm/ Rate Image Compression with Recurrent Neural Networks. In Proceedings of the libwebp International Conference on Learning Representations. [18] Vivek K. Goyal. 2001. Theoretical Foundations of Transform Coding. IEEE Signal [43] George Toderici, Wenzhe Shi, Radu Timofte, Johannes Balle Lucas Theis, Eirikur Processing Magazine 18, 5 (2001), 9–21. Agustsson, Nick Johnston, and Fabian Mentzer. 2021. Workshop and Challenge [19] Zongyu Guo, Yaojun Wu, Runsen Feng, Zhizheng Zhang, and Zhibo Chen. 2020. on Learned Image Compression. http://www.compression.cc 3-D Context Entropy Model for Improved Practical Image Compression. In Pro- [44] George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen, ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Joel Shor, and Michele Covell. 2017. Full Resolution Image Compression with Workshops. 116–117. Recurrent Neural Networks. In Proceedings of the IEEE Conference on Computer [20] Leonhard Helminger, Abdelaziz Djelouah, Markus Gross, and Christopher Vision and Pattern Recognition. 5306–5314. Schroers. 2021. Lossy Image Compression with Normalizing Flows. In Pro- [45] Gervais Knox Wallace. 1992. The JPEG Still Picture Compression Standard. IEEE ceedings of the International Conference on Learning Representations Workshop Transactions on Consumer Electronics 38, 1 (1992), xviii–xxxiv. Neural Compression. [46] Yaolong Wang, Mingqing Xiao, Chang Liu, Shuxin Zheng, and Tie-Yan Liu. [21] Yueyu Hu, Wenhan Yang, and Jiaying Liu. 2020. Coarse-to-Fine Hyper-Prior 2020. Modeling Lost Information in Lossy Image Compression. arXiv preprint Modeling for Learned Image Compression. In Proceedings of the AAAI Conference arXiv:2006.11999 (2020). on Artificial Intelligence, Vol. 34. 11013–11020. [47] Zhou Wang, Eero P. Simoncelli1, and Alan C. Bovik. 2003. Multiscale Structural [22] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Similarity for Image Quality Assessment. In Proceedings of the Asilomar Conference 2017. Densely Connected Convolutional Networks. In Proceedings of the IEEE/CVF on Signals, Systems and Computers, Vol. 2. 1398–1402. Conference on Computer Vision and Pattern Recognition. 2261–2269. [48] Ian H. Witten, Radford M. Neal, and John G. Cleary. 1987. Arithmetic Coding for [23] Nick Johnston, Damien Vincent, David Minnen, Michele Covell, Saurabh Singh, . Commun. ACM 30, 6 (1987), 520–540. Troy Chinen, Sung Jin Hwang, Joel Shor, and George Toderici. 2018. Improved [49] Mingqing Xiao, Shuxin Zheng, Chang Liu, Yaolong Wang, Di He, Guolin Ke, Lossy Image Compression with Priming and Spatially Adaptive Bit Rates for Jiang Bian, Zhouchen Lin, and Tie-Yan Liu. 2020. Invertible Image Rescaling. In Recurrent Networks. In Proceedings of the IEEE Conference on Computer Vision Proceedings of the European Conference on Computer Vision. 126–144. and Pattern Recognition. 4385–4393. [50] Yazhou Xing, Zian Qian, and Qifeng Chen. 2021. Invertible Image Signal Process- [24] Joint Video Experts Team (JVET). 2021. VVC Official Test Model VTM. https: ing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern //vcgit.hhi.fraunhofer.de/jvet/VVCSoftware_VTM/-/tree/VTM-12.1 accessed on Recognition. April 5, 2021. [51] Yulun Zhang, Kunpeng Li, Kai Li, Bineng Zhong, and Yun Fu. 2019. Residual Non- [25] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimiza- local Attention Networks for Image Restoration. In Proceedings of the International tion. In Proceddings of the International Conference on Learning Representations. Conference on Learning Representations. [26] Durk P. Kingma and Prafulla Dhariwal. 2018. Glow: Generative Flow with [52] Lei Zhou, Zhenhong Sun, Xiangji Wu, and Junmin Wu. 2019. End-to-end Op- Invertible 1x1 Convolutions. In Advances in Neural Information Processing Systems, timized Image Compression with Attention Mechanism.. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.