Nonlinear Transform Coding Johannes Ballé, Philip A
Total Page:16
File Type:pdf, Size:1020Kb
1 Nonlinear Transform Coding Johannes Ballé, Philip A. Chou, David Minnen, Saurabh Singh, Nick Johnston, Eirikur Agustsson, Sung Jin Hwang, George Toderici Google Research Mountain View, CA 94043, USA {jballe, philchou, dminnen, saurabhsingh, nickj, eirikur, sjhwang, gtoderici}@google.com Abstract—We review a class of methods that can be collected it separates the task of decorrelating a source, from coding under the name nonlinear transform coding (NTC), which over it. Any source can be optimally compressed in theory using the past few years have become competitive with the best linear vector quantization (VQ) [2]. However, in general, VQ quickly transform codecs for images, and have superseded them in terms of rate–distortion performance under established perceptual becomes computationally infeasible for sources of more than quality metrics such as MS-SSIM. We assess the empirical a handful dimensions, mainly because the codebook of repro- rate–distortion performance of NTC with the help of simple duction vectors, as well as the computational complexity of example sources, for which the optimal performance of a vector the search for the best reproduction of the source vector grow quantizer is easier to estimate than with natural data sources. exponentially with the number of dimensions. TC simplifies To this end, we introduce a novel variant of entropy-constrained vector quantization. We provide an analysis of various forms quantization and coding by first mapping the source vector into of stochastic optimization techniques for NTC models; review a latent space via a decorrelating invertible transform, such as architectures of transforms based on artificial neural networks, as the Karhunen–Loève Transform (KLT), and then separately well as learned entropy models; and provide a direct comparison quantizing and coding each of the latent dimensions. of a number of methods to parameterize the rate–distortion trade-off of nonlinear transforms, introducing a simplified one. Much of the theory surrounding TC is based on an implicit or explicit assumption that the source is jointly Gaussian, because this assumption allows for closed-form solutions. If I. INTRODUCTION the source is Gaussian, all that is needed to make the latent There is no end in sight for the world’s reliance on multi- dimensions independent is decorrelation. When speaking of media communication. Digital devices have been increasingly TC, it is almost always assumed that the transforms are linear, permeating our daily lives, and with them comes the need even if the source is far from Gaussian. As an example, to store, send, and receive images and audio ever more consider the banana-shaped distribution in fig. 1: While linear efficiently. Almost universally, transform coding (TC) has been transform coding (LTC) is limited to lattice quantization, the method of choice for compressing this type of data source. nonlinear transform coding (NTC) can more closely adapt to In his 2001 article for IEEE Signal Processing Magazine the source, leading to better compression performance. [1], Vivek Goyal attributed the success of TC to a divide- Until a few years ago, one of the fundamental constraints and-conquer paradigm: the practical benefit of TC is that arXiv:2007.03034v2 [cs.IT] 24 Oct 2020 Fig. 1. Linear transform code (left), and nonlinear transform code (right) of a banana-shaped source distribution, both obtained by empirically minimizing the rate–distortion Lagrangian (eq. (13)). Lines represent quantization bin boundaries, while dots indicate code vectors. While LTC is limited to lattice quantization, NTC can more closely adapt to the source, leading to better compression performance (RD results in fig. 3; details in section III). 2 in designing transform codes was that determining nonlinear transforms with desirable properties, such as improved inde- 0.6 source boundaries pendence between latent dimensions, is a difficult problem codebook for high-dimensional sources. As a result, not much prac- 0.4 tical research had been conducted in directly using nonlin- ear transforms for compression. However, this premise has 0.2 changed with the recent resurgence of artificial neural net- works (ANNs). It is well known that, with the right set of 0.0 parameters, ANNs can approximate arbitrary functions [3]. 8.91 7.13 5.34 3.54 1.74 0.00 1.74 3.54 5.35 7.17 8.92 source space It turns out that in combination with stochastic optimization methods, such as stochastic gradient descent (SGD), and mas- sively parallel computational hardware, a nearly universal set of tools for function approximation has emerged. These tools have also been used in the context of data compression [4]–[9]. Even though these methods were developed from scratch, they have rapidly become competitive with modern conventional compression methods such as HEVC [10], which are the culmination of decades of incremental engineering efforts. This demonstrates, as it has in other fields, the flexibility and ease of prototyping that universal function approximation brings over designing methods manually, and the power of developing methods in a data-driven fashion. This paper reviews some of the recent developments in data- driven lossy compression; in particular, we focus on a class of methods that can be collectively called nonlinear trans- form coding (NTC), providing insights into its capabilities and challenges. We assess the empirical rate–distortion (RD) Fig. 2. Top: A near-optimal entropy-constrained scalar quantizer of a standard performance of NTC with the help of simple example sources: Laplacian source, found using the VECVQ algorithm (eq. (5)). Bottom: an the Laplace source and the two-dimensional distribution of entropy-constrained vector quantizer of a banana-shaped source, found using the same algorithm. fig. 1. To this end, we introduce a novel variant of entropy- constrained vector quantization (ECVQ) algorithm [11]. Fur- ther, we provide insights into various forms of optimization the information to reconstruct an approximation to x. Each techniques for NTC models and review ANN-based transform possible vector x is approximated using a codevector ck 2 C, N architectures, as well as entropy modeling for NTC. A further where C = fck 2 R j 0 ≤ k < Kg is called the contribution of this paper is to provide a direct comparison of codebook. Once the codevector index k = e(x) for a given a number of methods to parameterize the RD trade-off, and to x is determined using the encoder e(·), Alice subjects it to introduce a simplified method. lossless entropy coding, such as Huffman coding or arithmetic In the next section, we first review stochastic gradient coding, which yields a bit sequence of nominal length s(k). optimization of the RD Lagrangian, a necessary tool for In what follows, we’ll assume that the performance of this optimizing ANNs for lossy compression. We introduce vari- entropy coding method is optimized to closely approximate ational ECVQ, illustrating this type of optimization. VECVQ the theoretical limit, i.e., that Alice and Bob share an estimate also serves as a baseline to evaluate NTC in the subsequent of the marginal probability distribution of k, also called an section. In that section, we discuss various approaches for entropy model, P (k), and that s(k) ≈ − log P (k). To the approximating the gradient of the RD objective and review extent that P (k) approximates M(k) = Ex∼psource δ(k; e(x)), ANN architectures. Section IV reviews entropy modeling via the true marginal distribution of k (where δ denotes the learned forward and backward adaptation, and illustrates its Kronecker delta function), s(k) is close to optimal, since performance gains on image compression. Section V compares codes of length − log M(k) would achieve the lowest possible several ways of parameterizing the transforms to continuously average rate, the entropy of k. Since Alice and Bob also share traverse the RD curve with a single set of transforms. The last knowledge of the codebook, Bob can decode the index k and two sections discuss connections to related work and conclude finally look up the reconstructed vector ck. the paper, respectively. To optimize the efficiency of this scheme, Alice and Bob seek to simultaneously minimize the cross entropy of the index II. STOCHASTIC RATE–DISTORTION OPTIMIZATION under the entropy model (the rate) as well as the distortion Consider the following lossy compression scenario. Alice between x and the reconstructed vector, quantified by some N is drawing vectors x 2 R from some data source, whose distortion measure d: probability density function we denote psource. Here, Alice L = − log P (k) + λ d(x; c ); (1) is concerned with compressing each vector into a bit se- Ex∼psource k quence, communicating this sequence to Bob, who then uses with k = e(x) as determined by the encoder e(·), choosing 3 a codebook index for each possible source vector. Many which can be approximated by the sample expectation authors formulate this as a minimization problem over one @ 1 X @`Θ(kb; xb) of the terms given a hard constraint on the other [12]. In this L ≈ ; (7) @Θ VQ B @Θ paper, we consider the Lagrangian relaxation of the distortion- fxb∼psourcej0≤b<Bg constrained problem, with the Lagrange multiplier λ on the with kb = eΘ(xb). This is an unbiased estimator of the distortion term determining the trade-off between rate and derivative of LVQ based on averaging the derivatives of ` over distortion. a batch of B source vector samples. The top panel of fig. 2 illustrates a lossy compression Minimization of LVQ will fit the entropy model to the method for a simple, one-dimensional Laplacian source, opti- k M(k) = δ(k; e(x)) 2 marginal distribution of , Ex∼psource , as mized for squared error distortion (i.e., d(x; c) = kx − ck2). well as adjust the codebook vectors to minimize distortion.