Variable Rate Deep Image Compression with Modulated Autoencoder Fei Yang, Luis Herranz, Joost Van De Weijer, Jos A
Total Page:16
File Type:pdf, Size:1020Kb
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2019 1 Variable Rate Deep Image Compression with Modulated Autoencoder Fei Yang, Luis Herranz, Joost van de Weijer, Jos A. Iglesias Guitin, Antonio M. Lpez, Mikhail G. Mozerov Abstract—Variable rate is a requirement for flexible and quantization quantization table adaptable image and video compression. However, deep image tabletable compression methods (DIC) are optimized for a single fixed inverse rate-distortion (R-D) tradeoff. While this can be addressed by DCT scaling Q IDCT ˆ I bit-stream inversescalinginverse Iˆ training multiple models for different tradeoffs, the memory I DCT scaling Q bit-stream scaling IDCT I requirements increase proportionally to the number of models. (a) Scaling the bottleneck representation of a shared autoencoder I encoder Q bit-stream decoder Iˆ can provide variable rate compression with a single shared I encoder Q bit-stream decoder Iˆ autoencoder. However, the R-D performance using this simple mechanism degrades in low bitrates, and also shrinks the effective range of bitrates. To address these limitations, we formulate inverse the problem of variable R-D optimization for DIC, and propose encoder scaling Q bit-stream decoder ˆ I inversescalinginverse Iˆ modulated autoencoders (MAEs), where the representations of a I encoder scaling Q bit-stream scaling decoder I shared autoencoder are adapted to the specific R-D tradeoff via a modulation network. Jointly training this modulated autoencoder scaling factor m k k and the modulation network provides an effective way to navigate kk scalingscaling factorfactor mkk the R-D operational curve. Our experiments show that the proposed method can achieve almost the same R-D performance (b) of independent models with significantly fewer parameters. I encoder Q bit-stream decoder Iˆ I encoder Q bit-stream decoder Iˆ Index Terms—Deep image compression, variable bitrate, au- toencoder, modulated autoencoder modulating demodulating Q : quantization modulatingnetwork demodulatingnetwork Q :: quantizationquantization network network network network : trainable I. INTRODUCTION :: trainabletrainable MAGE compression is a fundamental and well-studied (c) I problem in image processing and computer vision [2], Fig. 1: Image compression and R-D control: (a) JPEG trans- [3], [4]. The goal is to design binary representations (i.e. form, (b) pre-trained autoencoder with bottleneck scaling [1], bitstreams) with minimal entropy [5] that minimize the number and (c) our proposed modulated autoencoder with joint train- of bits required to represent an image (i.e. bitrate) at a given ing. Entropy coding/decoding is omitted for simplicity. level of fidelity (i.e. distortion) [6]. In many applications, communication networks or storage devices may impose a obtain finer or coarser quantizations, and then inverting the constraint on the maximum bitrate, which requires the image scaling at the decoder side (Fig. 1a). encoder to adapt to a given bitrate budget. In some applica- Recent studies show that deep image compression (DIC) tions this constraint may even change dynamically over time achieves comparable or even better results than classical image (e.g. video transmission). In all these cases, a bitrate control compression techniques [7], [8], [9], [10], [1], [11], [12], [13], mechanism is required, and it is available in most traditional [14], [15]. In this paradigm, the parameters of the encoder arXiv:1912.05526v2 [eess.IV] 21 Jul 2020 image and video compression codecs. In general, reducing the and decoder are learned from certain image data by jointly bitrate causes an increase in the distortion, i.e. there is a rate- minimizing rate and distortion at a particular R-D tradeoff. distortion (R-D) tradeoff. This mechanism is typically based However, variable bitrate requires an independent model for on scaling the latent representation prior to quantization to every R-D tradeoff. This is an obvious limitation, since it All authors are with the Computer Vision Center at the Universitat Autnoma requires storing each model separately, resulting in large de Barcelona (UAB). Antonio and Joost are also with the Computer Science memory requirement. Dpt. at UAB. Fei is also with Key Laboratory of Information Fusion Technol- To address this limitation, Theis et al. [1] use a single ogy, Northwestern Polytechnical University, China. E-mail: [email protected] The authors thank Audi Electronics Venture GmbH for supporting this autoencoder whose bottleneck representation is scaled before work, the Generalitat de Catalunya CERCA Program and its ACCIO quantization depending on the target bitrate (Fig. 1b). How- agency. Fei Yang acknowledges the Chinese Scholarship Council grant ever, this approach only considers the importance of differ- No.201706290127. Luis acknowledges the support of the Spanish project RTI2018-102285-A-I00. Joost acknowledges the support of the Spanish ent channels from the bottleneck representation of learned project TIN2016-79717-R. Antonio and Jose acknowledge the support of autoencoders under R-D tradeoff constraint. In addition, the project TIN2017-88709-R (MINECO/AEI/FEDER, UE). Antonio also thanks autoencoder is optimized for a single specific R-D tradeoff the support by ICREA Academia programme. Jose and Luis acknowledge the support of the EUs Horizon 2020 R&I programme under the Marie (typically high bitrate). These aspects lead to a drop in Skodowska-Curie grant No.665919. performance for low bitrates and a narrow effective range of JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2019 2 bitrates. will use the entropy model described in [8] to approximate Pq In order to tackle the limitations of multiple independent by p~z (~z). Finally, we will use mean squared error (MSE) as a models and bottleneck scaling, we formulate the problem of distortion metric. With these particular choices, (1) becomes variable R-D optimization for DIC, and propose the modulated argmin R (~z; θ) + λD (x; ^x; θ; φ) ; (2) autoencoder (MAE) framework, where the representations of θ,φ a shared autoencoder at different layers are adapted to a with specific R-D tradeoff via a modulating network. The mod- ulating network is conditioned on the target R-D tradeoff, and R (~z; θ) = Ex∼px;∆z∼U [− log2 p~z (~z)] ; (3) synchronized with the actual tradeoff optimized to learn the h 2i D (x; ^x; θ; φ) = Ex∼px;∆z∼U kx − ^xk : (4) parameters of the autoencoder and the modulating network. MAEs can achieve almost the same operational R-D points of III. MULTI-RATE DEEP IMAGE COMPRESSION WITH independent models with much fewer overall parameters (i.e. MODULATED AUTOENCODERS just the shared autoencoder plus the small overhead of the modulating network). Multi-layer modulation does not suffer A. Problem definition from the main limitations of bottleneck scaling, namely, drop We are interested in DIC models that can operate satisfac- in performance for low rates, and shrinkage of the effective torily on different R-D tradeoffs, and adapt to a specific one range of bitrates. when required. Note that Eq.(2) optimizes rate and distortion for a fixed tradeoff λ. We extend that formulation to multiple II. BACKGROUND R-D tradeoffs (i.e. λ 2 Λ = fλ1; : : : ; λM g) as the multi-rate- distortion problem Almost all lossy image and video compression approaches X follow the transform coding paradigm [16]. The basic structure argmin [R (~z; θ; λ) + λD (x; ^x; θ; φ, λ)] ; (5) θ,φ is a transform z = f (x) that takes an input image x 2 RN λ2Λ and obtains a transformed representation z, followed by a with quantizer q = Q (z) where q 2 D is a discrete-valued Z R (~z; θ; λ) = [− log p (~z)] ; (6) vector. The decoder reverses the quantization (i.e. dequantizer Ex∼px;∆z∼U 2 ~z h i ^z = Q−1 (q) 2 ) and the transform (i.e. inverse transform) D (x; ^x; θ; φ; λ) = Ex∼px;∆z∼U kx − ^xk ; (7) as ^x = g (^z) reconstructing the output image ^x 2 RN . Before the transmission (or storage), the discrete-valued vector where we are simplifying the notation by omitting features q is binarized and serialized into a bitstream b. Entropy dependency on λ, i.e. ~z = ~z (λ) = f (x; θ; λ) and ^x = coding [17] is used to exploit the statistical redundancy in ^x (λ) = g (~z (λ); φ, λ). This formulation can be easily ex- that bitstream and reduce its length. tended to a continuous range of tradeoffs. Note also that these In DIC, the handcrafted analysis and synthesis transforms optimization problems assume that all R-D operational points are replaced by the encoder z = f (x; θ) and decoder ^x = are equally important. It could be possible to integrate an g (^z; φ) of a convolutional autoencoder, parametrized by θ and importance function I (λ) to further give more importance to φ. The fundamental difference is that the transforms are not certain R-D operational points if required. We assume uniform designed but learned from training data. The model is typically importance (continuous or discrete) for simplicity. trained by minimizing the optimization problem B. Bottleneck scaling arg minθ,φR (b) + λD (x; ^x) ; (1) A possible way to make the encoder and decoder aware of where R (b) measures the rate of the bitstream b and D (^x; x) λ is simply scaling the latent representation in the bottleneck represents a distortion metric between x and x^, and the before quantization (implicitly scaling the quantization bin), Lagrange multiplier λ controls the R-D tradeoff. Note that and then inverting the scaling in the decoder. In this case, λ is fixed in this case. The problem is solved using gradient q = Q (z s (λ)) and ^x (λ) = g (^z (λ) (1=s (λ)); φ), descent and backpropagation [18]. where s (λ) is the specific scaling factor for the tradeoff λ s (λ) To make the model differentiable, which is required to apply .