Compressing Variational Posteriors

Yibo Yang∗ Robert Bamler∗ Stephan Mandt UC Irvine UC Irvine UC Irvine [email protected] [email protected] [email protected]

Abstract

Deep Bayesian latent variable models have enabled new approaches to both model and . Here, we propose a new algorithm for compressing latent representations in deep probabilistic models, such as VAEs, in post-processing. Our algorithm generalizes to the continuous domain, using adaptive discretization accuracy that exploits posterior uncertainty. Our approach separates model design and training from the compression task, and thus allows for various rate-distortion trade-offs with a single trained model, eliminating the need to train multiple models for different bit rates. We obtain promising experimental results on compressing Bayesian neural word embeddings, and outperform JPEG on over a wide range of bit rates using only a single VAE.

1 Introduction and Related Work

One natural application of deep generative modeling is data compression. Here we focus on of continuous data, which entails quantization, i.e., discretizing arbitrary real numbers to a countable number of representative values, resulting in the classic trade-off between rate (entropy of the discretized values) and distortion (error from discretization). State-of-the-art deep learning based approaches to lossy data compression [1–5] train a machine learning model to perform lossy compression by optimizing a rate-distortion objective, substituting the quantization step with a differentiable approximation during training. Consequently, a new model has to be trained for any desired combination of rate and distortion. To support variable-bitrate compression, one has to train and store several deep learning models for different rates, resulting in possibly very large size. As an additional downside, in real-time applications such as streaming, the codec has to load a new model into memory every time a change in bandwidth requires adjusting the bitrate. By contrast, we consider a different but related problem: given a trained model, what is the best way to encode the information contained in its continuous latent variables? Our proposed solution is a lossy neural compression method that decouples training from compression, and enables variable- bitrate compression with a single model. We generalize the classical entropy-coding algorithm, Arithmetic Coding [6, 7], from discrete to continuous domain. The proposed method, Bayesian Arithmetic Coding, is in essence an adaptive quantization scheme that exploits posterior uncertainty to automatically reduce the accuracy of latent variables for which the model is uncertain anyway.

2 Method and Experimental Results

Problem Setup We consider a wide class of generative models, such as variational autoencoders (VAEs), defined by a joint distribution, p(x, z) = p(z) p(x|z), where z ∈ RK are continuous latent variables. Although we discuss the unsupervised setting, our framework also captures the supervised setup, in terms of a conditional model for target y given auxiliary input x: p(y, z|x) = p(z) p(y|z, x).

∗Corresponding authors; equal contribution. 1 35 1.0 7 8 (0.111)2 = / 30 0.8 (0.11) = 3/4 2 F<(m) 25 5 Bayesian AC (proposed) (0.101)2 = /8 F (m) 0.6 ≤ 20 Uniform Quantization 1 (0.1)2 = /2 p(m) k-means Quantization 15 3 p(zi) 0.4 JPEG (0.011)2 = /8 µi

F (zi) Average PSNR10 (RGB) Ball´eet al. (2017)

(0.01) = 1/4 Average MS-SSIM (RGB) 2 q(zi x) σσi σσiii 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1 g(ξ |) ± 0.0 0.2 0.4 0.6 0.8 1.0 1.2 (0.001)2 = /8 i Average Bits per Pixel (BPP) Average Bits per Pixel (BPP) 0 0 1 2 3 4 5 6 7 8 9 10 2 0 2 − Figure 2: Average rate-distortion performance discrete message m latent variable zi R ∈ on the Kodak dataset, measured in PSNR and Figure 1: Comparison of standard Arithmetic MS-SSIM [8] (higher is better). Bayesian AC Coding (AC, left) and Bayesian AC (right, pro- (blue, proposed) outperforms JPEG for all tested posed). Both methods use the prior CDF (or- bitrates with a single model. By contrast, [1] ange) to map non-uniformly distributed data to (black squares) relies on individually optimized a number ξ ∼ U(0, 1), and truncate it to finite models for each bitrate that all have to be in- binary precision based on a notion of uncertainty. cluded in the codec.

We assume two parties in communication, a sender and a receiver. Both have access to the model p(x, z) = p(z)p(x|z), and given a data sample x, the sender wishes to send it to the receiver as efficiently as possible. The sender first infers a variational distribution q(z|x), then uses our proposed algorithm to select a latent vector zˆ that has high probability under q, and that can be encoded in a compressed bitstring, which is sent to the receiver. The receiver losslessly decodes the bitstring back into ˆz and uses the likelihood p(x|ˆz) to generate a reconstructed data point ˆx, typically setting ˆx = arg maxx p(x|ˆz). Currently our algorithm assumes factorized prior p(z) and posterior q(z|x).

Bayesian Arithmetic Coding Given a discrete source p, standard arithmetic coding (AC) uses its CDF to map a data sample m to an interval Im ∈ [0, 1) of width p(m), and uniquely identifies it ˆ with a real number ξ ∈ Im using only dlog2 p(m)e binary digits. In the continuous case, considering the ith latent variable zi, we make the similar observation that the CDF F (zi) transforms zi ∼ p(zi) into ξi ∼ U(0, 1). In binary, the bits of ξi = (0.b1b2, ...)2 are an infinite sequence of Bernoulli(0.5) random variables and cannot be compressed further. Our goal then, is to truncate ξi into a finite number ˆ −1 ˆ of bits, corresponding to a value ξi ∈ [0, 1), which then identifies a quantized value zˆi = F (ξi) in the latent space; we would like zˆi to represent the original value zi well, while spending as few bits on ˆ ξi as possible. Similar to how standard AC performs this truncation using the width of an “uncertainty interval" Im ∈ [0, 1), in the continuous case we consider the posterior uncertainty in zi-space and map it to ξi-space, as illustrated in Figure 1. Approximating the posterior p(zi|x) by the variational −1 distribution q(zi|x), we consider the (negative) distortion function g(ξi) := q(F (ξi) | x), and search for the optimal ξˆby minimizing the rate-distortion objective:

XK ˆ XK ˆ ˆ L(ξˆ|x) = − log q(ˆz|x) + λ|ξˆ| = − log q(ˆzi|x) + λ|ξi| = − log g(ξi) + λ|ξi| i=1 i=1 ˆ ˆ where λ > 0 controls the rate-distortion trade-off, and |ξi| is the number of bits for expressing ξi in binary. The optimization decouples across the K dimensions and can be efficiently solved in parallel. ˆ −1 ˆ K The resulting optimal ξ is transmitted lossessly to the receiver, who then recovers ˆz = (F (ξi))i=1.

Results on Image Compression We trained a VAE with standard Gaussian prior and diagonal Gaussian posterior, using the same ImageNet dataset and neural architecture as in [1]. See Figure 2 for results, where we compare to Ballé et al. [1], JPEG, and a uniform quantization baseline that ignores posterior variance and quantizes posterior means using a uniformly-spaced grid. Although we fall short of Ballé et al. [1], which represents the end-to-end optimized performance on this task, we outperform JPEG on all bitrates tested, both quantitatively and visually, using a single VAE.

Results on Compressing Bayesian Neural Word Embeddings We trained a Bayesian Skip-gram model [9] and applied our method to compress the learned word embeddings. We compare again to baselines based on uniformly quantizating the posterior means, using more sophisticated entropy- coding algorithms such as gzip and bzip2. On all performance metrics considered (semantic and syntactic reasoning tasks [10]), our algorithm outperforms all baselines by a wide margin (full results omitted due to space constraint), and demonstrates its flexibility and potential for model compression.

2 References [1] Johannes Ballé, Valero Laparra, and Eero P Simoncelli. End-to-end optimized image compres- sion. International Conference on Learning Representations, 2017. [2] Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston. Variational image compression with a scale hyperprior. In ICLR, 2018. [3] Oren Rippel and Lubomir Bourdev. Real-time adaptive image compression, 2017. [4] Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, and Luc Van Gool. Conditional probability models for deep image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4394–4402, 2018. [5] Salvator Lombardo, Jun Han, Christopher Schroers, and Stephan Mandt. Deep generative video compression. In Advances in Neural Information Processing Systems, pages 9283–9294, 2019. [6] Ian H Witten, Radford M Neal, and John G Cleary. Arithmetic coding for data compression. Communications of the ACM, 30(6):520–540, 1987. [7] David JC MacKay. , inference and learning algorithms. Cambridge Univer- sity Press, 2003. [8] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for im- age quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, pages 1398–1402. Ieee, 2003. [9] Robert Bamler and Stephan Mandt. Dynamic word embeddings. In International conference on Machine learning, 2017. [10] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre- sentations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119, 2013.

3