Arxiv:2012.15463V1 [Cs.CV] 31 Dec 2020

LEARNED MULTI-RESOLUTION VARIABLE-RATE IMAGE COMPRESSION WITH OCTAVE-BASED RESIDUAL BLOCKS Mohammad Akbari∗, Jie Liang∗, Jingning Hany, Chengjie Tuz [email protected], [email protected], [email protected], [email protected] Simon Fraser University, Canada∗, Google Inc.y, Tencent Technologiesz ABSTRACT allows non-linear transform coding and more flexible context modellings [11]. Various learning-based image compression Recently deep learning-based image compression has shown frameworks have been proposed in the last few years. the potential to outperform traditional codecs. However, most In [12, 13], a scheme involving a generalized divisive nor- existing methods train multiple networks for multiple bit rates, malization (GDN)-based nonlinear analysis transform, a uni- which increase the implementation complexity. In this paper, form quantizer, and an inverse GDN (IGDN)-based synthesis we propose a new variable-rate image compression framework, transform was proposed. The encoding network consists of which employs generalized octave convolutions (GoConv) and three stages of convolution and GDN layers. The decoding generalized octave transposed-convolutions (GoTConv) with network consists of three stages of transposed-convolution and built-in generalized divisive normalization (GDN) and inverse IGDN layers. Despite its simple architecture, it outperforms GDN (IGDN) layers. Novel GoConv- and GoTConv-based JPEG2000 in both PSNR and SSIM. residual blocks are also developed in the encoder and decoder networks. Our scheme also uses a stochastic rounding-based A compressive auto-encoder framework with residual con- scalar quantization. To further improve the performance, we nection as in ResNet (residual neural network) was proposed encode the residual between the input and the reconstructed in [5], where the quantization was replaced by smooth ap- image from the decoder network as an enhancement layer. To proximation, and a scaling approach was used to get different enable a single model to operate with different bit rates and rates. In [14], a soft-to-hard vector quantization approach was to learn multi-rate image features, a new objective function introduced, and a unified framework was developed for both is introduced. Experimental results show that the proposed image compression and neural network model compression. framework trained with variable-rate objective function outper- In [1], a deep semantic segmentation-based layered im- forms the standard codecs such as H.265/HEVC-based BPG age compression (DSSLIC) was proposed, by taking advan- and state-of-the-art learning-based variable-rate methods. tage of the Generative Adversarial Network (GAN). A low- dimensional representation and segmentation map of the in- Index Terms— learned image compression, variable-rate, put along with the residual between the input and the synthe- deep learning, residual coding, generalized octave convolu- sized image were encoded. It outperforms the BPG codec tions, multi-resolution autoencoder, multi-resolution image (RGB4:4:4) in both PSNR and MS-SSIM [15]. coding Most previous works used fixed entropy models shared between the encoder and decoder. In [16], a conditional entropy 1. INTRODUCTION model based on Gaussian scale mixture (GSM) was proposed arXiv:2012.15463v1 [cs.CV] 31 Dec 2020 where the scale parameters were conditioned on a hyper-prior In the last few years, deep learning has made tremendous learned using a hyper auto-encoder. The compressed hyper- progresses in the well-studied topic of image compression. prior was transmitted and added to the bit stream as side in- Deep learning-based image compression [1, 2, 3, 4, 5, 6, 7] formation. The model in [16] was extended in [4, 17] where a has shown the potential to outperform standard codecs such as Gaussian mixture model (GMM) with both mean and scale pa- JPEG2000, the H.265/HEVC-based BPG image codec [8], and rameters conditioned on the hyper-prior was utilized. In these also the new versatile video coding test model (VTM) [9, 10], methods, the hyper-priors were combined with auto-regressive making it a very promising tool for the next-generation image priors generated using context models, which outperformed compression. BPG in terms of both PSNR and MS-SSIM. These approaches In traditional compression methods, many components are are jointly optimized to effectively capture the spatial depen- fixed and hand-crafted such as linear transform and entropy dencies and probabilistic structures of the latents, which lead coding. Deep learning-based approaches have the potential of to a significant compression performance. However, some automatically exploiting the data features; thereby achieving of these latents are spatially redundant. In [10], a learned better compression performance. In addition, deep learning multi-resolution image compression approach was proposed 1 H L Fig. 1: Overall framework of the proposed codec. x: input image; fE: deep encoder; fy ; y g: factorized code map (including H L high and low resolution maps); Q: uniform scalar quantizer; fy^ ; y^ g: quantized code map; fD: deep decoder; x¯: generated image by the deep decoder; r: residual image; r0: decoded residual image; x~: final reconstructed image. that uses generalized octave convolutions (GoConv) and gener- based residual sub-networks are also developed in the encoder alized octave transposed-convolutions (GoTConv) to factorize and decoder networks, by incorporating separate shortcut con- the latents into high and low resolutions. As a result, the spa- nection for HR and LR feature maps. Our scheme uses the tial redundancy corresponding to the latents is reduced, which stochastic rounding-based scalar quantization as in [18, 22, 23]. improves the compression performance. As in [21], to further improve the performance, we encode the Since most learned image compression methods need to residual between the input and the reconstructed image from train multiple networks for multiple bit rates, variable-rate the decoder network by BPG as an enhancement layer. To image compression approaches have also been proposed in enable a single model to operate with different bit rates and to which a single neural network model is trained to operate at learn multi-rate image features, a new variable-rate objective multiple bit rates. This approach was first introduced in [18], function is introduced [21]. Experimental results show that the which was then generalized for full-resolution images using proposed framework trained with variable-rate objective func- deep learning-based entropy coding in [6]. tion outperforms the standard codecs including H.265/HEVC- In [18], long short-term memory (LSTM)-based recurrent based BPG in the more challenging YUV4:2:0 and YUV4:4:4 neural networks (RNNs) and residual-based layer coding was formats, as well as state-of-the-art learning-based variable-rate used to compress thumbnail images. Better SSIM results than methods in terms of MS-SSIM metric. JPEG and WebP were reported. This approach was gener- This paper is organized as follows. The architecture of the alized in [6], which proposed a variable-rate framework for proposed deep encoder and decoder networks will be described full-resolution images by introducing a gated recurrent unit, in Section 2.1. Following that, the formulation of the deep residual scaling, and deep learning-based entropy coding. This encoder and decoder are presented in Sections 2.2 and 2.4. method can outperform JPEG in terms of PSNR. In Section 2.6, the objective functions are formulated and In [19], a CNN-based multi-scale decomposition transform explained. Finally, we will present the experimental results on was optimized for all scales. Rate allocation algorithms were Kodak image set as well as the ablation studies in Section 3. also applied to determine the optimal scale of each image block. The results in [19] were reported to be better than BPG in MS- SSIM. In [20], a learned progressive image compression model was proposed, in which bit-plane decomposition was adopted. 2. THE PROPOSED METHOD Bidirectional assembling gated units were also introduced to reduce the correlation between different bit-planes [20]. The overall framework of the proposed codec is shown in Fig. In [21], we proposed a variable-rate image compression 1. At the encoder side, two layers of information are encoded: scheme, by applying GDN-based residual blocks as in ResNet the encoder network output (code map) and the residual image. and two novel multi-bit objective functions. The residual be- The code map is composed of HR and LR parts denoted by H L tween the input and the reconstructed image was also encoded fy ; y g, which are obtained by the deep encoder fE. The by BPG to further improve the performance. Experimental maps are quantized by a uniform scalar quantizer Q, and then results show that our variable-rate model can outperform state- separately encoded by the FLIF lossless codec [24]. The of-the-art learning-based variable-rate methods. reconstruction of the input image (denoted by x¯) is obtained H L In this paper, we extend and improve our previous ap- from the quantized maps fy^ ; y^ g by the deep decoder fD. proach in [21] and propose a new deep learning-based multi- To further improve the performance, the residual r between the resolution variable-rate image compression framework, which input and the reconstruction is encoded by the BPG codec as an employs GoConv and GoTConv layers developed in [10] to enhancement layer [1]. At the decoder side, the reconstruction factorize all the feature maps into high resolution (HR) and x¯ from the deep decoder and the decoded residual image r0 low resolution (LR) information. Two novel types of octave- are added to get the final reconstruction x~. 2 Fig. 2: Architecture of the proposed deep encoder and deep decoder networks. GoConv/GoTConv (n, k×k, s): generalized octave convolutions and transposed-convolutions with n filters of size k×k and stride of s. GoRes/GoTRes (n): GoConv- and GoTConv-based residual blocks with n filters.

Load more