Variational Bayesian Quantization

Yibo Yang * 1 Robert Bamler * 1 Stephan Mandt 1

Abstract for a particular compression objective. Here, we study a We propose a novel algorithm for quantizing con- related but different problem: given a trained model, what tinuous latent representations in trained models. is the best way to encode the information contained in its Our approach applies to deep probabilistic mod- continuous latent variables? els, such as variational autoencoders (VAEs), and As we show, our proposed solution provides a new “plug- enables both data and model compression. Un- and-play” approach to of both data in- like current end-to-end neural compression meth- stances (represented by local latent variables, e.g., in a VAE) ods that cater the model to a fixed quantization as well as model parameters (represented by global latent scheme, our algorithm separates model design variables that serve as parameters of a Bayesian statisti- and training from quantization. Consequently, cal model). Our approach separates the compression task our algorithm enables “plug-and-play” compres- from model design and training, thus implementing variable- sion with variable rate-distortion trade-off, using bitrate compression as an independent post-processing step a single trained model. Our algorithm can be in a wide class of existing latent variable models. seen as a novel extension of to the continuous domain, and uses adaptive quan- At the heart of our proposed method lies a novel quanti- tization accuracy based on estimates of posterior zation scheme that optimizes a rate-distortion trade-off by uncertainty. Our experimental results demonstrate exploiting posterior uncertainty estimates. Quantization is the importance of taking into account posterior central to lossy compression, as continuous-valued data like uncertainties, and show that natural images, , and distributed representations ul- with the proposed algorithm outperforms JPEG timately need to be discretized to a finite number of bits over a wide range of bit rates using only a single for digital storage or transmission. Lossy compression al- VAE. Further experiments on Bayesian gorithms therefore typically find a discrete approximation neural word embeddings demonstrate the versatil- of some semantic representation of the data, which is then ity of the proposed method. encoded with a method. In classical lossy compression methods such as JPEG or MP3, the semantic representation is carefully designed to 1. Introduction support compression at variable bitrates. By contrast, state- Latent-variable models have become a mainstay of modern of-the-art deep learning based approaches to lossy data com- machine learning. Scalable approximate Bayesian infer- pression (Balle´ et al., 2017; 2018; Rippel & Bourdev, 2017; ence methods, in particular Black Box Variational Inference Mentzer et al., 2018; Lombardo et al., 2019) are trained to (Ranganath et al., 2014; Rezende et al., 2014), have spurred minimize a distortion metric at a fixed bitrate. To support variable-bitrate compression in this approach, one has to arXiv:2002.08158v2 [eess.IV] 7 Sep 2020 the development of increasingly large and expressive proba- bilistic models, including deep generative probabilistic mod- train several models for different bitrates. While training els such as variational autoencoders (Kingma & Welling, several models may be viable in many cases, a bigger issue 2014b) and Bayesian neural networks (MacKay, 1992; Blun- is the increase in decoder size as the decoder has to store dell et al., 2015). One natural application of deep latent the parameters of not one but several deep neural networks variable modeling is , and recent work has for each bitrate setting. In applications like streaming focused on end-to-end procedures that optimize a model under fluctuating connectivity, the decoder further has to load a new deep learning model into memory every time a * 1 Equal contribution Department of Computer Science, Uni- change in bandwidth requires adjusting the bitrate. versity of California, Irvine. Correspondence to: Yibo Yang , Robert Bamler . By contrast, we propose a a quantization method for latent

th variable models that decouples training from compression, Proceedings of the 37 International Conference on Machine and that enables variable-bitrate compression with a single Learning, Online, PMLR 119, 2020. Copyright 2020 by the au- thor(s). model. We generalize a classical entropy coding algorithm, Variational Bayesian Quantization

Arithmetic Coding (Witten et al., 1987; MacKay, 2003), Berger, 1972; Chou et al., 1989). Optimal vector quan- from the discrete to continuous domain. Our proposed algo- tization, although theoretically well-motivated (Gallager, rithm, Variational Bayesian Quantization, exploits posterior 1968), is not tractable in high-dimensional spaces (Gersho uncertainty estimates to automatically reduce the quanti- & Gray, 2012) and not scalable in practice. Therefore most zation accuracy of latent variables for which the model is classical lossy compression algorithms map data to a suit- uncertain anyway. This strategy is analogous to the way hu- ably designed semantic representation, in such a way that mans communicate quantitative information. For example, coordinate-wise scalar quantization can be fruitfully applied. Wikipedia lists the population of Rome in 2017 with the Recent machine-learning-based data compression methods specific number 2,879,728. By contrast, its population in learn such hand-designed representation from data, but sim- the year 500 AD is estimated by the round number 100,000 ilar to classical methods, most such ML methods directly because the high uncertainty would make a more precise take quantization into account in the generative model de- number meaningless. Our ablation studies show that this sign or training. Various approaches have been proposed posterior-informed quantization scheme is crucial to obtain- to approximate the non-differentiable quantization opera- ing competitive performance. tion during training, such as stochastic binarization (Toderici In detail, our contributions are as follows: et al., 2016; 2017), additive uniform noise (Balle´ et al., 2017; 2018; Habibian et al., 2019), or other differentiable approxi- A new discretization scheme. We present a novel ap- mation (Agustsson et al., 2017; Theis et al., 2017; Mentzer • proach to discretizing latent variables in a variational et al., 2018; Rippel & Bourdev, 2017); many such schemes inference framework. Our approach generalizes arith- result in quantization with a uniformly-spaced grid, with the metic coding from discrete to continuous distributions exception of (Agustsson et al., 2017), which optimizes for and takes posterior uncertainty into account. quantization grid points. Yang et al.(2020) considers opti- mal quantization at compression time, but assumes a fixed Single-model compression at variable bitrates. The quantization scheme of (Balle´ et al., 2017) during training. • decoupling of modeling and compression allows us We depart from such approaches by treating quantization to adjust the trade-off between bitrate and distortion as a post-processing step decoupled from model design and in post-processing. This is in contrast to existing ap- training. Crucial to our approach is a new quantization proaches to both data and model compression, which scheme that automatically adapts to different length scales often require specially trained models for each bitrate. in the representation space based on posterior uncertainty Automatic self-pruning. Deep latent variable models estimates. To our best knowledge, the only prior work that • often exhibit posterior collapse, i.e., the variational uses posterior uncertainty for compression is in the context posterior collapses to the model prior. In our approach, of bits-back coding (Honkela & Valpola, 2004; Townsend latent dimensions with collapsed posteriors require et al., 2019), but these works focus on lossless compression, close to zero bits, thus don’t require manual pruning. with the recent exception of (Yang et al., 2020). Most existing neural image compression methods require Competitive experimental performance. We show that • training a separate machine learning model for each desired our method outperforms JPEG over a wide range of bitrate setting (Balle´ et al., 2017; 2018; Mentzer et al., 2018; bitrates using only a single model. We also show that Theis et al., 2017; Lombardo et al., 2019). In fact, Alemi we can successfully compress word embeddings with et al.(2018) showed that any particular fitted VAE model minimal loss, as evaluated on semantic reasoning task. only targets one specific point on the rate-distortion curve. Our approach has the same benefit of variable-bitrate single- The paper is structured as follows: Section2 reviews related model compression as methods based on recurrent VAEs work in neural compression; Section3 proposes our Varia- (Gregor et al., 2016; Toderici et al., 2016; 2017; Johnston tional Bayesian Quantization algorithm. We give empirical et al., 2018); but unlike these methods, which use dedicated results in Section4, and conclude in Section5. Section6 model architecture for progressive image reconstruction, we provides additional theoretical insight about our method. instead focus more broadly on quantizing latent representa- tions in a given generative model, designed and trained for 2. Related Work specific application purposes (possibly other than compres- sion, e.g., modeling complex scientific observations). Compressing continuous-valued data is a classical problem in the signal processing community. Typically, a distortion measure (often the squared error) and a source distribution are assumed, and the goal is to design a quantizer that opti- mizes the rate-distortion (R-D) performance (Lloyd, 1982; Variational Bayesian Quantization

3. Posterior-Informed Variable-Bitrate much wider, capturing also Bayesian neural nets (MacKay, Compression 2003), probabilistic word embeddings (Barkan, 2017; Bam- ler & Mandt, 2017), matrix factorization (Mnih & Salakhut- We now propose an algorithm for quantizing latent vari- dinov, 2008), and topic models (Blei et al., 2003). ables in trained models. After describing the problem setup and assumptions (Subsection 3.1), we briefly review Arith- Protocol Overview. We consider two parties in commu- metic Coding (Subection 3.2). Subsection 3.3 describes our nication, a sender and a receiver. In the case of data com- proposed lossy compression algorithm, which generalizes pression, both parties have access to the model, but only Arithmetic Coding to the continuous domain. the sender has access to the data point x, which it uses to fit a variational distribution q (z x). It then uses the φ | 3.1. Problem Setup algorithm proposed below to select a latent variable vec- Generative Model and Variational Inference. We con- tor ˆz that has high probability under qφ, and that can be sider a wide class of generative probabilistic models with encoded into a compressed bitstring, which gets transmitted data x and unknown (or “latent”) variables z RK from to the receiver. The receiver losslessly decodes the com- ∈ pressed bitstring back into ˆz and uses the likelihood p(x ˆz) some continuous latent space with dimension K. The gen- | erative model is defined by a joint probability distribution, to generate a reconstructed data point ˆx, typically setting ˆx = arg max p(x ˆz). In the case of model compression, x | p(x, z) = p(z) p(x z) (1) the sender infers a distribution qφ(z x) over model parame- | | ters z given training data x, and uses our algorithm to select with a prior p(z) and a likelihood p(x z). Although our pre- a suitable vector ˆz of quantized model parameters. The | sentation focuses on unsupervised representation learning, receiver receives ˆz and uses it to reconstruct the model. our framework also captures the supervised setup.1 The rest of this section describes how the proposed algo- Our proposed compression method uses z as a proxy to de- rithm selects ˆz and encodes it into a compressed bitstring. scribe the data x. This requires “solving” Eq.1 for z given x, i.e., inferring the posterior p(z x) = p(x, z)/ p(x, z) dz. 3.2. Background: Arithmetic Coding | Since exact Bayesian inference is often intractable, we resort R to Variational Inference (VI) (Jordan et al., 1999; Blei et al., Our quantization algorithm, introduced in Section 3.3 below, 2017; Zhang et al., 2019), which approximates the posterior is inspired by a lossless compression algorithm, arithmetic coding (AC) (Witten et al., 1987; MacKay, 2003), which by a so-called variational distribution qφ(z x) by minimiz- | we generalize from discrete data to the continuous space of ing the Kullback-Leibler divergence DKL(qφ(z x) p(z x)) | || | latent variables z RK . To get there, we first review the over a set of variational parameters φ. ∈ main idea of AC that our proposed algorithm borrows. Factorization Assumptions. We assume that both the AC is an instance of so-called entropy coding. It uniquely prior p(z) and the variational distribution q (z x) are fully maps messages m from a discrete set to a com- φ | ∈ M M factorized (mean-field assumption). For concreteness, our pressed bitstring of some length (the “bitrate”). Entropy Rm examples use a Gaussian variational distribution. Thus, coding exploits prior knowledge of the distribution p(m) of messages to map probable messages to short bitstrings while K p(z) = i=1p(zi); and (2) spending more bits on improbable messages. This way, en- K 2 tropy coding algorithms aim to minimize the expected rate qφ(z x) = Qi=1 (zi; µi(x), σi (x)), (3) | N Ep(m)[ m]. For lossless compression, the expected rate has R Q th a fundamental lower bound, the entroy H = p(m)[h(m)], where p(zi) is a prior for the i component of z, and the E where h(m) = log p(m) is the Shannon information means µi and standard deviations σi together comprise the − 2 variational parameters φ over which VI optimizes.2 content of m. AC provides near optimal lossless compres- sion as it maps each message m to a bitstring of length ∈ M Prominently, the model class defined by Eqs.1-3 in- m = h(m) , where denotes the ceiling function. cludes variational autoencoders (VAEs) (Kingma & Welling, R d e d·e 2014a) for data compression, but we stress that the class is AC is usually discussed in the context of streaming com- pression where m is a sequence of symbols from a finite 1For supervised learning with labels y, we would consider alphabet, as AC improves on this task over the more widely a conditional generative model p(y, z x) = p(y z, x) p(z) with | | known (Huffman, 1952). In our work, we conditional likelihood p(y z, x), where z are the model parameters, focus on a different aspect of AC: its use of a cumulative treated as a Bayesian latent| variable with associated prior p(z). 2These parameters are often amortized by a neural network (in probability distribution function to map a nonuniformly dis- tributed random variable m p(m) to a number ξ that is which case µi and σi depend on x), but don’t have to (in which ∼ case µi and σi do not depend on x and are directly optimized). nearly uniformly distributed over the interval [0, 1). Variational Bayesian Quantization

1 7 AC, VBQ exploits knowledge of a prior probability distribu- (0.111)2 = /8 3 4 tion p(z) in combination with a (soft) uncertainty region to (0.11)2 = / F<(m) 5 (0.101)2 = /8 F (m) encode probable values of z into short bitstrings. ≤ 1 (0.1)2 = /2 p(m) 3 p(zi) (0.011)2 = /8 µi F (zi) From Intervals to Distributions. The main ideas that 1 (0.01)2 = /4 q(zi x) σσiii 1 | ± VBQ borrows from AC are as follows: (1) the use of a (0.001)2 = /8 g(ξi) ± 0 cumulative distribution function to map a non-uniformly 0 1 2 3 4 5 6 7 8 9 10 2 0 2 distributed random variable to a uniformly distributed ran- − discrete message m latent variable zi R ∈ dom variable ξ over the interval (0, 1), and (2) the use of an “uncertainty region” to select a number from this interval to Figure 1. Comparison of Arithmetic Coding (AC, left) and VBQ encode the message with as few bits as possible. While AC (right, proposed). Both methods use a prior CDF (orange) to map uses an interval with hard boundaries, VBQ softens this Im nonuniformly distributed data to a number ξ (0, 1), and both uncertainty region by considering posterior uncertainty. ∼ U require an uncertainty region for truncation. We consider a single continuous latent variable zi R with ∈ arbitrary prior p(zi). The cumulative (CDF) of the prior,

Figure1 (left) illustrates AC for a binomial-distributed mes- zi sage m 0,..., 10 (the number of ‘heads’ in a sequence F (zi) := p(zi0) dzi0 , (5) ∈ { } of ten coin flips). The solid and dashed orange lines show Z−∞ 3 the left and right sided cumulative distribution function, is shown in orange in Figure1 (right). It maps zi p(zi) ∼ F<(m) := p(m0) and F (m) := p(m0), to ξi (0, 1). In contrast to the discrete case discussed m0

Algorithm 1 Rate-Distortion Optimization for Dimension i

Input: Prior CDF F (zi), rate penalty λ > 0, 2 variational mode µi(x) and variance σi (x). ˆ Output: Optimal code point ξi∗ (0.b1b2 . . . b (ξˆ ))2. ≡ R i∗ Evaluate ξ† F (µ (x)). i ← i Initialize r 0, ξˆ∗ null, `∗ . ← i ← ← ∞ repeat Update r r + 1 ˆr,left ← r r ˆr,right r r Set ξi 2− 2 ξi† , ξi 2− 2 ξi† . ˆr,left ← b ˆr,cleft ← d e if ξi = 0 and `λ(ξi x) < `∗ then Figure 2. Effect of an anisotropic posterior distribution on quanti- 6 r,left | r,left zation. Left: linear regression model with optimal fit (green) and Update ξˆ∗ ξˆ , `∗ ` (ξˆ x). i ← i ← λ i | end if fits of models with quantized parameters (orange, purple). Right: ˆr,right ˆr,right posterior distribution and quantized model parameters following if ξi = 1 and `λ(ξi x) < `∗ then 6 r,right | r,right two different quantization schemes. Although both quantized mod- Update ξˆ∗ ξˆ , `∗ ` (ξˆ x). i ← i ← λ i | els are equally far away from the optimal solution (green dot), end if VBQ (orange) fits the data better because it takes the anisotropy of until log g(ξ†) log g(ξˆ∗) < λ(r + 1 (ξˆ∗)). the posterior into account. i − i − R i

ˆ ˆ Optimizing the Rate-Distortion Trade-Off. Rather than for the code point ξ∗ that minimizes `(ξi x). For each r, i | considering a hard uncertainty region, VBQ simply tries the algorithm only needs to consider the two code points K ˆr,left ˆr,right to find a point ξ (ξi)i=1 that identifies latent variables ξi ξi† and ξi ξi† with rate at most r that enclose K ≡ ≤ ≥ z (zi) with high probability under the variational the optimum ξ† := F (µ (x)) and are closest to it; these two ≡ i=1 i i distribution qφ(z x) while being expressible in few bits. code points can be easily computed in constant time. The | We thus express log qφ(z x) in terms of the coordinates iteration terminates as soon as the maximally possible re- | ξi = F (zi) using Eq.3, maining increase in log q(z x) = log g(ξ ) is smaller than i| i the minimum penalty for an increasing bitlength (in practice, K 1 2 F − (ξi) µi(x) log q (z x) = − + cnst. (7) the iteration rarely exceeds r 8). φ | − 2 σ2(x) ≈ i=1 i  X ˆ K Encoding. After finding the optimal code points (ξi∗)i=1, For each dimension i, we restrict the quantile ξi (0, 1) to they have to be encoded into a single bitstring. Simply ∈ ˆ the set of code points ξˆi that can be represented in binary concatenating the binary representations (Eq.4) of all ξi∗ ˆ via Eq.4 with a finite but arbitrary bitlength (ξˆ ). We would be ambiguous due to their variable lengths (ξ∗) R i R i define the total bitlength (ξˆ) := K (ξˆ ), i.e., the (see detailed discussion in the Supplementary Material). In- R i=1 R i length of the concatenation of all codes ξˆ , i 1,...,K stead, we treat the code points as symbols from a discrete P i ∈ { } neglecting, for now, an overhead for delimiters (see below). vocabulary and encode them via lossless entropy coding, Using a rate penalty parameter λ > 0 that is shared across e.g., Arithmetic Coding. The entropy coder requires a prob- all dimensions i, we minimize the rate-distortion objective abilistic model over all code points; here we simply use their empirical distribution. When using our method for λ(ξˆ x) = log qφ(ˆz x) + λ (ξˆ) (8) model compression, this empirical distribution has to be L | − | R K 1 2 transmitted to the receiver as additional header information F − (ξˆi) µi(x) = − + λ (ξˆ ) + cnst. that counts towards the total bitrate. For data compression, 2 σ2(x) R i i=1 " i  # by contrast, we obtain the empirical distribution of code X points on training data and include it in the decoder. The optimization thus decouples across all latent dimen- sions i, and can be solved efficiently and in parallel by minimizing the K independent objective functions Discussion. The proposed algorithm adjusts the accuracy for each latent variable zi based on two factors: (i) a global 1 2 2 ` (ξˆ x) = F − (ξˆ ) µ (x) + 2λ σ (x) (ξˆ ). (9) rate setting λ that is shared across all dimensions i; and λ i| i − i i R i (ii) a per-dimension posterior uncertainty estimate σ (x).  i Point (i) allows tuning the rate-distortion trade-off whereas Although the bitlength R(ξˆi) is discontinuous (it counts the number of binary digits, see Eq.4), ` (ξˆ x) can be (ii) takes the anisotropy of the latent space into account. λ i| efficiently minimized over ξˆi using Algorithm1. The al- Figure2 illustrates the effect of anisotropy in latent space. gorithm iterates over all rates r 1, 2,... and searches The right panel plots the posterior of a toy Bayesian linear ∈ { } Variational Bayesian Quantization

0.6 uncompressed model prets word and context embedding vectors as latent variables VBQ (proposed) and associates them with Gaussian approximate posterior 0.4 uniform quant. + AC distributions. Point estimating the latent variables would uniform quant. + lzma result in classical word2vec. Even though the model was 0.2 uniform quant. + bzip2 uniform quant. + gzip not specifically designed or trained with model compres- sion taken into consideration, the proposed algorithm can

Hits@10 (higher0 is better) .0 0 2 4 successfully compress it in post-processing. bits per latent dimension

Figure 3. Performance of compressed word embeddings on a stan- Experiment Setup. We implemented the Black Box VI dard semantic and syntactic reasoning task (Mikolov et al., 2013a). version of the Bayesian Skip-gram model proposed in (Bam- VBQ (orange, proposed) leads to much smaller file sizes at equal ler & Mandt, 2017),4 and trained the model on books pub- model performance over a wide range of performances. lished between 1980 and 2008 from the Google Books cor- pus (Michel et al., 2011), following the preprocessing de- scribed in (Bamler & Mandt, 2017) with a vocabulary of regression model y = ax + b (see left panel) with only two V = 100,000 words and embedding dimension d = 100. z (a, b) latent variables . Due to the elongated shape of In the trained model, we observed that the distribution of ≡ a b the posterior, VBQ uses a higher accuracy for than for . posterior modes µ across all words w and all dimen- ˆz w,j As a result, the algorithm finds a quantization (orange dot sions j of the embedding space was quite different from the in right panel) that is closer to the optimal (MAP) solution prior. To improve the bitrate of our method, we used an “em- a b (green dot) along the -axis than along the -axis. pirical prior” for encoding that is shared across all w and j; The purple dot in Figure2 (right) compares to a more com- we chose a Gaussian (0, σ2) where σ2 is the empirical N 0 0 mon quantization method, which simply rounds the MAP variance of all variational means (µw,j)w=1,...,V ; j=1,...,d. solution to the nearest point (which is then entropy coded) We compare our method’s performance to a baseline that δ > 0 δ from a fixed grid with spacing . We tuned so that the quantizes to a uniform grid and then uses the empirical resulting quantized model parameters (purple dot) have the distribution of quantized coordinates for lossless entropy same distance to the optimum as our proposed solution (or- coding. We also compare to uniform quantization baselines ange dot). Despite the equal distance to the optimum, VBQ that replace the entropy coding step with the standard com- finds model parameters with higher posterior probability. pression libraries gzip, bzip2, and lzma. These methods The resulting model fits the data better (left panel). are not restricted by a factorized distribution of code points This concludes the description of the proposed Variational and could therefore detect and exploit correlations between Bayesian Quantization algorithm. In the next section, we an- quantized code points across words or dimensions. alyze the algorithm’s behaviour experimentally and demon- We evaluate performance on the semantic and syntactic rea- strate its performance for variable-bitrate compression on soning task proposed in (Mikolov et al., 2013a), a popular both word embeddings and images. dataset of semantic relations like “Japan : yen = Russia : ru- ble” and syntactic relations like “amazing : amazingly = 4. Experiments lucky : luckily”, where the goal is to predict the last word given the first three words. We report Hits@10, i.e., the frac- We tested our approach on two very different domains: word tion of challenges for which the compressed model ranks embeddings and images. For word embeddings, we mea- the correct prediction among the top ten. sured the performance drop on a semantic reasoning task due to lossy compression. Our proposed VBQ method sig- nificantly improves model performance over uniform dis- Results. Figure3 shows the model performance on the cretization and compression with either Arithmetic Coding semantic and syntactic reasoning tasks as a function of com- (AC), gzip, bzip2, or lzma at equal bitrate. For image com- pression rate. Our proposed VBQ significantly outperforms pression, we show that a single standard VAE, compressed all baselines and reaches the same Hits@10 at less than half with VBQ, outperforms JPEG and other baselines at a wide the bitrate over a wide range.5 range of bitrates, both quantitatively and visually. 4See Supplementary Material for hyperparameters. Our code is available at https://github.com/mandt-lab/vbq. 4.1. Compressing Word Embeddings 5 The uncompressed model performance (dotted gray line in Figure3) is not state of the art. This is not a shortcoming of We consider the Bayesian Skip-gram model for neural word the compression method but merely of the model, and can be embeddings (Barkan, 2017), a probabilistic generative for- attributed to the smaller vocabulary and training set used compared mulation of word2vec (Mikolov et al., 2013b) which inter- to (Mikolov et al., 2013b) due to hardware constraints. Variational Bayesian Quantization

JPEG: we used the libjpeg implementation packaged • with the Python Pillow library, using default config- urations (e.g., 4:2:0 subsampling), and we adjust the quality parameter to vary the rate-distortion trade-off; MNISTFrey Faces Deep learning baseline: we compare to Balle´ et al. • (2017), who directly optimized for the rate and distor- tion, training a separate model for each point on the R- D curve. In our large-scale experiment, we adopte their model architecture, so their performance essentially represents the end-to-end optimized performance up- per bound for our method (which uses a single model).

decreasing bitrate, increasing distortion 4.2.1. QUALITATIVE ANALYSIS ON TOY DATASETS We trained a VAE on the MNIST dataset and the Frey Faces dataset, using 5 and 4-dimensional latent spaces, respec- Figure 4. Qualitative behavior of our proposed VBQ algorithm on tively. See Supplemental Material for experimental details. two data sets of small-scale images (MNIST and Frey Faces). With decreasing bitrate, the method starts to confuse the encoded object Figure4 shows example image reconstructions from our with a generic one (encoded by the median of the prior p(z)). VBQ algorithm with increasing λ, and thus decreasing bi- trate. The right-most column is the extreme case λ , → ∞ resulting in the shortest possible bistring encoding ξˆi = 4.2. Image Compression 1 (0.1)2 = 2 (i.e., zˆi being the median of the prior p(zi)) for While Section 4.1 demonstrated the proposed VBQ method every dimension i. As the bitrate decreases (as (ξˆ) 0), R → for model compression, we now apply the same method to our method gradually “confuses” the original image with data compression using a variational autoencoder (VAE). a generic image (roughly in the center of the embedding We first provide qualitative insight on small-scale images, space), while preserving approximately the same level of and then quantitative results on full resolution color images. sharpness. This is in contrast to JPEG which typically intro- duces blocky and/or pixel-level artifacts at lower bitrates. Model. For simplicity, we consider regular VAEs with a standard normal prior and Gaussian variational posterior. 4.2.2. FULL-RESOLUTION COLOR IMAGES The generative network parameterizes a factorized categor- We apply our VBQ method to a VAE trained on color im- ical or Gaussian likelihood model in experiments in Sec. ages, and obtain practical image compression performance 4.2.1 or 4.2.2, respectively. Network architectures are de- rivaling JPEG, while outperforming baselines that ignore scribed below and in more detail in Supplementary Material. posterior uncertainty and directly quantize latent variables.

Baselines. We consider the following baselines: Model and Dataset. The inference and generative net- works of our VAE are identical to the analysis and syn- Uniform quantization: for a given image x, we quan- thesis networks of Balle´ et al.(2017), using 3 layers of • tize each dimension of the posterior mean vector µ(x) 256 filters each in a convolutional architecture. We used a to a uniform grid. We report the bitrate for encoding diagonal Gaussian likelihood model, whose mean is com- 2 the resulting quantized latent representation via stan- puted by the generative net and the variance σ is fixed dard entropy coding (e.g., arithmetic coding). Entropy as a hyper-parameter, similar to a β-VAE (Higgins et al., 2 coding requires prior knowledge of the probabilities of 2017) approach (σ was tuned to 0.001 to ensure the VAE each grid point. Here, we use the empirical frequencies achieved overall good R-D trade-off; see (Alemi et al., of grid points over a subset of the training set; 2018)). We trained the model on the same subset of the ImageNet dataset as used in (Balle´ et al., 2017). We evalu- k-means quantization: similar to “uniform quantiza- • ated performance on the standard Kodak (Kodak) dataset, tion”, but with the placement of grid points optimized a separate set of 24 uncompressed color images. As in the k via -means clustering on a subset of the training data; word embedding experiment, we also observed that using an Quantization with generalized Lloyd algorithm: similar empirical prior for our method improved the bitrate; for this, • to above, but the grid points are optimized using gener- we used the flexible density model of Balle´ et al.(2018), alized Lloyd algorithm (Chou et al., 1989), a widely- fitting a different distribution per latent channel, on samples used state-of-the-art classical quantization method; of posterior means µ (treating spatial dimensions as i.i.d.). Variational Bayesian Quantization

(a) Original (b) JPEG (c) VBQ (d) Uniform grid (e) Balle´ et al. MS-SSIM=0.813 MS-SSIM=0.933 MS-SSIM=0.723 MS-SSIM=0.958

Figure 5. Image reconstructions at matching bitrate (0.24 bits per pixel). VBQ (c; proposed) outperforms AC with uniform quantization (d) and JPEG (b) and is comparable to the approach by (Balle´ et al., 2017) (e) despite using a model that is not optimized for this specific bitrate. Uniform quantization here used a modified version of the VAE in Figure6, using an additional conv layer with smaller dimensions to reduce the bitrate down to 0.24 (this was not possible in the original model even with the largest possible grid spacing).

1.0 35 Channel 1: Channel 2: Channel 3: x[D (q(zi x) p(zi))]: 28.8 x[D (q(zi x) p(zi))]: 0.09 x[D (q(zi x) p(zi))]: 0.11 E KL | || E KL | || E KL | || 30 0.8 Ex[num. bits] (proposed): 1.07 Ex[num. bits] (proposed): 0.02 Ex[num. bits] (proposed): 0.02 VBQ (proposed) x[num. bits] (baseline): 0.54 x[num. bits] (baseline): 0.18 x[num. bits] (baseline): 0.10 25 100 E E E Uniform Quantization 0.6 20 k-means Quantization 2 Generalized Lloyd Quant. 10− 15 0.4 JPEG 4 Average PSNR10 (RGB) Ball´eet al. (2017) 10− Average MS-SSIM0 (RGB) .2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Channel 4: Channel 5: Channel 6: Average Bits per Pixel (BPP) Average Bits per Pixel (BPP) x[D (q(zi x) p(zi))]: 82.4 x[D (q(zi x) p(zi))]: 0.06 x[D (q(zi x) p(zi))]: 300 E KL | || E KL | || E KL | || Ex[num. bits] (proposed): 1.98 Ex[num. bits] (proposed): 0.02 Ex[num. bits] (proposed): 4.42 x[num. bits] (baseline): 0.97 x[num. bits] (baseline): 0.11 x[num. bits] (baseline): 1.01 100 E E E Figure 6. Aggregate rate-distortion performance on the Kodak 2 dataset (higher is better). VBQ (blue, proposed) outperforms JPEG 10− prior p(zi) aggregate q(zi) for all tested bitrates with a single model. By contrast, (Balle´ et al., 4 10− density of µi

2017) (black squares) relies on individually optimized models for 5 0 5 5 0 5 5 0 5 each bitrate that all have to be included in the decoder. − − −

Figure 7. Variational posteriors and the encoding cost for the first 6 latent channels of an image-compression VAE trained with high β Results. As common in image compression work, we setting. “Baseline” refers to uniform quantization. The proposed measure distortion between original and compressed im- VBQ method wastes fewer bits on channels that exhibit posterior ages under two quality metrics (the higher the better): collapse (channels 2, 3, and 5) than the baseline method. It instead Peak Signal-to-Noise ratio (PSNR) and MS-SSIM (Wang spends more bits on channels without posterior collapse. et al., 2003), over all RGB channels. Figure6 shows rate- distortion performance. We averaged both the bits per pixel (BPP) and quality measure over all images in the Kodak Indifference to Posterior Collapse. A known issue in dataset for each λ (we got similar results by averaging only deep generative models such as VAEs is posterior collapse, over the quality metrics for fixed bitrates via interpolation). where the model ignores some subset of latent variables, We found that our method generally produced images with whose variational posterior distributions collapse to closely higher quality, both in terms of PSNR and perceptual quality, match the prior. Such collapsed dimensions constitute an compared to JPEG and uniform quantization. Similar to overhead in conventional neural compression approaches, (Balle´ et al., 2017), our method avoids unpleasant artifacts, which is often dealt with by a pruning step. One curious and introduces blurriness at low bitrate. See Figure5 for consequence of our approach is that it automatically spends example image reconstructions. For more examples and R- close to zero bits encoding the collapsed latent dimensions. D curves on individual images, see Supplementary Material. As an illustration, we trained a VAE as used in the color Although our results fall short of the end-to-end optimized image compression experiment with a high β setting to rate-distortion performance of Balle´ et al.(2017), it is worth purposefully induce posterior collapse, and examine the emphasizing that our method operates anywhere on the R-D average number of bits spent on various latent channels. curve with a single trained VAE model, unlike Balle´ et al. Figure7 shows the prior p(zi), aggregated (approximate) (2017) , which requires costly optimization and storage of posterior q(zi) := Ex[q(zi x)], and histograms of posterior | individual models for each point on the R-D curve. On means µi(x) for the first six channels of the VAE; all the the other hand, as with any quantization method, the re- quantities were averaged over an image batch and across construction quality of VBQ is upper-bounded by that of latent spatial dimensions. We observe that channels 2, 3, the full-precision latents; e.g., evaluating on uncompressed and 5 appear to exhibit posterior collapse, as the aggregated latents gives a PSNR upper-bound of about 38.9 in Figure6. posteriors closely match the prior while the posterior means Variational Bayesian Quantization tightly cluster at zero; this is also reflected by low average KL-divergence between the variational posterior q(z x) i| and prior p(zi), see text inside each panel. We observe that, for these collapsed channels, our method spends fewer bits on average than uniform quantization (baseline) at the same total bitrate, and more bits instead on channels 1, 4, and 6, which do not exhibit posterior collapse. The explanation is Figure 8. Dense quantization grid of VBQ (left) and a subset of it 2 that a collapsed posterior has unusually high variance σi (x), (center) that resembles the more common uniform grid (right). causing our model to refrain from long code words due to 2 ˆ the high penalty σi (x) per bitrate (ξi) in Eq.9. ∝ R prior probability density (purple curve), while regions of low prior probability contain only grid points with large 5. Conclusions (ˆz ). This observation can be formalized as follows. R i We proposed a novel algorithm for discretizing latent rep- Theorem 1. For a latent dimension zi with prior p(zi) and resentations in trained latent variables, with applications to for any interval R, there exists a grid point zˆi I ⊂ ∈ I both data and model compression. Our proposed “Varia- whose bitlength is bounded by the information content of , I tional Bayesian Quantization” (VBQ) algorithm automati- i.e., (ˆzi) log2 P ( ) with P ( ) = p(zi) dzi. cally adapts encoding accuracy to posterior uncertainty esti- R ≤ − I I I R mates, and is applicable to arbitrary probabilistic generative Proof. (ˆzi) is the number of nontrivial bits in the quan- models with factorized prior and mean-field variational pos- ˆ R 1 tile ξi := F (ˆzi). For example, the quantiles 4 = (0.01)2, terior distributions. As our approach separates quantization 1 3 2 = (0.1)2, and 4 = (0.11)2 each have at most one non- from model design and training, it enables “plug-&-play” trivial bit (since the initial “0.” and terminal “1” are con- compression at variable bitrate with pretrained models. sidered trivial), and they are equally spaced with spacing 1 2 We showed the empirical effectiveness of our proposed VBQ 4 = 2− . Incrementing the number of bits by one divides the spacing in half. Thus, for any r N, the quantiles method for both model and data compression. For model ∈ compression, VBQ retained significantly higher task per- with at most r nontrivial bits are equally spaced with spac- r 1 ing 2− − . The prior CDF F maps any interval R to formance of a trained word embedding model than other I ⊂ methods that compress in post-processing. For data com- an interval F ( ) (0, 1) of size F ( ) = P ( ). Since r 1I ⊂ | I | I P ( ) > 2− − for r = log P ( ) , the interval con- pression, we showed that VBQ can outperform JPEG over a I b− 2 I c I tains at least one grid point zˆ with (ˆz ) r. wide range of bitrates with a single trained standard VAE. i R i ≤ Given its versatility, we believe that VBQ holds promise for compressing Bayesian neural networks, especially in Delimination Overhead. The above theorem provides an applications that demand rate-distortion trade-offs. Lastly, upper bound on the bitrate for a single coordinate zi. En- as VBQ relies on accurate posterior approximation , its rate- coding a high dimensional latent representation z requires distortion performance provides a new metric for quantita- some overhead for delimiting the individual coordinates. tive evaluation of approximate Bayesian inference methods. We provide a theoretical upper bound on this overhead for a severely restricted variant of VBQ that does not have access 6. Theoretical Considerations to posterior uncertainty estimates and that operates only on a subset of the proposed grid. Specifically, we pick only one Here, we provide additional theoretical insights into the 1 1 grid point for each interval n = [n , n + ) with n Z. I − 2 2 ∈ proposed VBQ method based on reviewer feedback. According to Theorem1, there always exists a grid point zˆ with (ˆz ) log P ( ). Thus, the resulting i ∈ In R i ≤ − 2 In Dense Quantization Grid. Section 3.3 describes the pro- subset of the grid (Figure8, center) resembles the more posed VBQ algorithm in terms of quantiles ξi. While quan- commonly used uniform grid (Figure8 right), whose bitrate tiles simplify the discussion, it is also instructive to consider under standard entropy coding is the information content. what effectively happens directly in representation space. If we restrict VBQ to this sparse subset of the dense grid, In the space of representations z, VBQ optimizes over a then the algorithm collapses to a method known as Shannon- dense quantization grid, shown in Figure8 (left) for a single Fano-Elias coding (Cover & Thomas, 2012), whose over- dimension z . Each code point ξˆ (0, 1), i.e., each quantile head over the information theoretically minimal bitrate is i i ∈ with a finite-length binary representation, defines a grid known to be at most one bit per dimension. The full VBQ al- 1 point zˆi = F − (ξˆi). The height of each orange bar in the gorithm has more freedom: it exploits posterior uncertainty figure shows the bitlength (ˆz ). Interestingly, the grid estimates to reduce accuracy where it is not needed, thus R i places many points with small (ˆz ) in regions of high saving bits and improving compression performance. R i Variational Bayesian Quantization

Acknowledgements Chou, P. A., Lookabaugh, T. D., and Gray, R. M. Entropy- constrained vector quantization. IEEE Trans. Acoustics, We thank Johannes Balle´ for helping us reproduce the re- Speech, and Signal Processing, 37:31–42, 1989. sults in prior work (Balle´ et al., 2017). Yibo Yang acknowl- edges funding from the Hasso Plattner Foundation. This Cover, T. and Thomas, J. Elements of Information material is based upon work supported by the Defense Ad- Theory. Wiley, 2012. ISBN 9781118585771. URL vanced Research Projects Agency (DARPA) under Contract https://books.google.com/books?id= No. HR001120C0021. Any opinions, findings and con- VWq5GG6ycxMC. clusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the Gallager, R. G. and reliable communi- views of the Defense Advanced Research Projects Agency cation, volume 2. Springer, 1968. (DARPA). Furthermore, this work was supported by the National Science Foundation under Grants NSF-1928718, Gersho, A. and Gray, R. M. Vector quantization and signal NSF-2003237, and by Qualcomm. compression, volume 159. Springer Science & Business Media, 2012.

References Gregor, K., Besse, F., Rezende, D. J., Danihelka, I., and Agustsson, E., Mentzer, F., Tschannen, M., Cavigelli, L., Wierstra, D. Towards conceptual compression, 2016. Timofte, R., Benini, L., and Gool, L. V. Soft-to-hard Habibian, A., Rozendaal, T. v., Tomczak, J. M., and Co- vector quantization for end-to-end learning compressible hen, T. S. Video compression with rate-distortion au- representations. In Advances in Neural Information Pro- toencoders. In Proceedings of the IEEE International cessing Systems, 2017. Conference on Computer Vision, pp. 7033–7042, 2019. Alemi, A., Poole, B., Fischer, I., Dillon, J., Saurous, R. A., Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., and Murphy, K. Fixing a broken elbo. In International Botvinick, M., Mohamed, S., and Lerchner, A. beta-vae: Conference on Machine Learning, 2018. Learning basic visual concepts with a constrained varia- Balle,´ J., Laparra, V., and Simoncelli, E. P. End-to-end tional framework. International Conference on Learning optimized image compression. International Conference Representations, 2017. on Learning Representations, 2017. Honkela, A. and Valpola, H. Variational learning and bits- Balle,´ J., Minnen, D., Singh, S., Hwang, S. J., and Johnston, back coding: an information-theoretic view to bayesian N. Variational image compression with a scale hyperprior. learning. IEEE transactions on Neural Networks, 15(4): In ICLR, 2018. 800–810, 2004.

Bamler, R. and Mandt, S. Dynamic word embeddings. Huffman, D. A. A method for the construction of minimum- In Proceedings of the 34th International Conference on redundancy codes. Proceedings of the IRE, 40(9):1098– Machine Learning-Volume 70, pp. 380–389, 2017. 1101, 1952.

Barkan, O. Bayesian neural word embedding. In Association Johnston, N., Vincent, D., Minnen, D., Covell, M., Singh, for the Advancement of Artificial Intelligence, pp. 3135– S., Chinen, T., Jin Hwang, S., Shor, J., and Toderici, G. 3143, 2017. Improved lossy image compression with priming and Berger, T. Optimum quantizers and permutation codes. spatially adaptive bit rates for recurrent networks. In IEEE Trans. Information Theory, 18:759–765, 1972. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4385–4393, 2018. Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet allocation. Journal of machine Learning research, 3: Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, 993–1022, 2003. L. K. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999. Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Varia- tional inference: A review for statisticians. Journal of Kingma, D. P. and Welling, M. Auto-encoding variational the American statistical Association, 112(518):859–877, Bayes. In International Conference on Learning Repre- 2017. sentations, 2014a. Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, Kingma, D. P. and Welling, M. Auto-encoding variational D. Weight uncertainty in neural networks. arXiv preprint Bayes. In International Conference on Learning Repre- arXiv:1505.05424, 2015. sentations, pp. 1–9, 2014b. Variational Bayesian Quantization

Kodak, E. The Kodak PhotoCD Dataset. https://www. Toderici, G., O’Malley, S. M., Hwang, S. J., Vincent, D., cns.nyu.edu/˜lcv/iclr2017/. Accessed: 2020- Minnen, D., Baluja, S., Covell, M., and Sukthankar, R. 01-09. Variable rate image compression with recurrent neural networks. International Conference on Learning Repre- Lloyd, S. Least squares quantization in pcm. IEEE transac- sentations, 2016. tions on information theory, 28(2):129–137, 1982. Toderici, G., Vincent, D., Johnston, N., Hwang, S. J., Min- Lombardo, S., Han, J., Schroers, C., and Mandt, S. Deep nen, D., Shor, J., and Covell, M. Full resolution image generative video compression. In Advances in Neural compression with recurrent neural networks. In 2017 Information Processing Systems, pp. 9283–9294, 2019. IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pp. 5435–5443, 2017. MacKay, D. J. A practical bayesian framework for backprop- agation networks. Neural computation, 4(3):448–472, Townsend, J., Bird, T., and Barber, D. Practical lossless 1992. compression with latent variables using bits back coding. arXiv preprint arXiv:1901.04866, 2019. MacKay, D. J. Information theory, inference and learning algorithms. Cambridge University Press, 2003. Wang, Z., Simoncelli, E. P., and Bovik, A. C. Multiscale structural similarity for image quality assessment. In Mentzer, F., Agustsson, E., Tschannen, M., Timofte, R., The Thrity-Seventh Asilomar Conference on Signals, Sys- and Van Gool, L. Conditional probability models for tems & Computers, 2003, volume 2, pp. 1398–1402. Ieee, deep image compression. In Proceedings of the IEEE 2003. Conference on Computer Vision and Pattern Recognition, pp. 4394–4402, 2018. Witten, I. H., Neal, R. M., and Cleary, J. G. Arithmetic coding for data compression. Communications of the Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, ACM, 30(6):520–540, 1987. M. K., Pickett, J. P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., et al. Quantitative analysis of culture using Yang, Y., Bamler, R., and Mandt, S. Improving infer- millions of digitized books. science, 331(6014):176–182, ence for neural image compression. arXiv preprint 2011. arXiv:2006.04240, 2020.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient Zhang, C., Butepage, J., Kjellstrom, H., and Mandt, S. Ad- estimation of word representations in vector space. arXiv vances in variational inference. IEEE transactions on preprint arXiv:1301.3781, 2013a. pattern analysis and machine intelligence, 2019.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in Neural Infor- mation Processing Systems, pp. 3111–3119, 2013b.

Mnih, A. and Salakhutdinov, R. R. Probabilistic matrix fac- torization. In Advances in neural information processing systems, pp. 1257–1264, 2008.

Ranganath, R., Gerrish, S., and Blei, D. M. Black box variational inference. In International Conference on Artificial Intelligence and Statistics, pp. 814–822, 2014.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep gen- erative models. In International Conference on Machine Learning, pp. 1278–1286, 2014.

Rippel, O. and Bourdev, L. Real-time adaptive image com- pression, 2017.

Theis, L., Shi, W., Cunningham, A., and Huszar,´ F. Lossy image compression with compressive autoencoders. Inter- national Conference on Learning Representations, 2017. Supplementary Material to “Variational Bayesian Quantization”

Yibo Yang * 1 Robert Bamler * 1 Stephan Mandt 1

This document provides details of the proposed compression S1.2. Encoding via Standard Entropy Coding method (Section S1), model and experiment details (Sec- The actual encoding scheme we ended up using does not tion S2), and additional examples of compressed images deal with the binary representation of each ξˆ explicitly. (Section S3). i Instead, we treat each ξˆi as a discrete symbol and directly ˆ K encode the sequence (ξi)i=1 of symbols via entropy coding S1. Delimitation Overhead (e.g., standard arithmetic coding). The entropy coder needs a model of the probability p(ξˆ ) of each symbol. For model We elaborate on the “encoding” paragraph of Section 3.3 i compression, we use the empirical frequencies, which we of the main paper. After finding a quantized code point ξˆ i transmit as extra header information that counts towards for each dimension i 1,...,K of the latent space, ∈ { } the total bitrate. For data compression, we estimate the these code points have to be losslessly encoded into a sin- frequencies on training data and include them in the decoder. gle bitstring for transmission or storage. We experimented with two encoding schemes, described in Subsections S1.1 Transmitting the empirical frequencies lead to a negligible and S1.2 below. Subsection S1.3 provides further analysis. overhead in the word embeddings experiment. Only a few hundred code points (depending on λ) had nonzero frequen- S1.1. Encoding via Concatenation cies, so that the compressed file size was dominated by the encoding of K = V d = 107 quantized latent variables. We first describe an encoding scheme that we did not end up using, but that makes it easier to understand the objective S1.3. Justification of the Rate Penalty Term λ (ξˆi) function of VBQ (Eq. 8 of the main text). This encoding R scheme concatenates the binary representations of ξˆi (Eq. 4 All experimental results are reported with the encoding of the main text) for all i 1,...,K in to a single bit- scheme of Section S1.2 as it lead to slightly lower bitrates ∈ { } string. As each dimension i contributes (ξˆ ) bits to the in practice. A peculiarity of this encoding scheme is that R i concatenated bitstring, this encoding scheme justifies the it ignores the length (ξˆi) of the binary representation of R K rate penalty term “λ (ξˆi)” in Eq. 4 of the main text. each ξˆ . For a sequence of symbols ξˆ (ξˆ ) with an R i ≡ i i=1 i.i.d. entropy model p(ξˆ ), an optimal entropy coder (such as One also has to transmit the rates (ξˆ ) (in compressed i R i arithmetic coding) achieves the total bitrate (ξˆ) = h(ξˆ) form using traditional entropy coding) so that the decoder R d e with the information content can split the concatenated bitstring at the correct positions. While this incurs some overhead, the variable-bitlength rep- K K ˆ ˆ ˆ ˆ resentation of ξi also saves one bit per dimension i because h(ξ) = h(ξi) = log2 p(ξi). (S1)

arXiv:2002.08158v2 [eess.IV] 7 Sep 2020 − the last bit in the binary representation of each ξˆ does not i=1 i=1 i X X need to be transmitted as it is always equal to one (other- wise, the optimization algorithm in VBQ would favor an ˆ ˆ In particular, Eq. S1 does not depend on (ξi). This poses equivalent shorter binary representation of ξi). R the question whether the rate penalty term “λ (ξˆ )” in the R i VBQ objective (Eq. 8 of the main text) is justified. Ide- ally, the algorithm would minimize λh(ξˆ ) = λ log p(ξˆ ) i − 2 i instead, but this quantity is unknown until the quantiza- ˆ K ˆ tions (ξi)i=1 and therefore the empirical frequencies p(ξi) ˆ * 1 are obtained. Our experiments suggest that (ξi) is a useful Equal contribution Department of Computer Science, Uni- ˆ R versity of California, Irvine. Correspondence to: Yibo Yang proxy for the eventual value of h(ξi). , Robert Bamler . Figure S1 plots the rate estimate (ξˆ ), i.e., the integer num- R i ber of bits in the binary representation of ξˆi (x-axis) against Supplementary Material to “Variational Bayesian Quantization” ) i ˆ ξ S2. More Experimental Details (

h 20 S2.1. Word Embeddings The word embeddings experiment involved only minimal 10 hyperparameter tuning, and we only optimized for perfor- mance of the uncompressed model since the goal of the experiment was to test the proposed compression method on a model that was not tuned for compression. We trained information content 0 5 4 0 2 4 6 8 10 for 10 iterations with minibatches of 10 randomly drawn words and contexts due to hardware constraints. We tried rate estimate (ξˆ ) R i learning rates 0.1 and 1 and chose 0.1.

Figure S1. Relation between rate estimate (ξˆi) and the actual R S2.2. Experiments on Images contribution h(ξˆi) of code point ξˆi to the total bitrate under en- tropy coding. The approximate affine linear relationship justifies As mentioned in the main text, we used regular VAEs in the minimizing (ξˆi) as a proxy for h(ξˆi) in VBQ. image experiments with standard normal prior p(z) and fac- R torized normal posterior q(z x) with diagonal covariance. | the actual contribution h(ξˆ ) = log p(ξˆ ) to the total S2.3. MNIST i − 2 i bitrate according to Eq. S1( y-axis). The figure shows exper- The VAE’s inference network has two convolutional layers imental data for compressed word embeddings at 1.32 bits followed by a fully connected layer. The two conv layers per latent dimension. We make the following observations: use 32 and 64 filters respectively, with kernel size 3, stride size 2, and ReLU activation. The fully connected layer has For most code points ξˆ , the dependency between h(ξˆ ) output dimension 10 so that µ and σ2 of q(z x) each has • i i | and (ξˆ ) can be approximated by an affine linear dimension 5. R i function, thus justifying the use of (ξˆ ) in the opti- R i The generative network architecture mirrors the inference mization of Bayesian AC. network but in reverse, starting with a dense layer mapping 5 dimensional latent variables to 1568 dimensional, treated The slope of the approximate linear dependency is as 32-channel 7x7 activations, and followed by two decon- • larger than one. This may be understood by the penalty volutional layers of 64 and 32 filters (with identical padding term λ (ξˆ ) in the objective function (Eq. 8 of the and stride as the convolutional layers). The output is de- R i main text), which causes the method to avoid code convolved with a single 3x3 filter with sigmoid activation function. For each pixel, the (scalar) output of the last layer points ξˆi with large rate estimates (ξˆi), thus reduc- R parameterizes the likelihood of the pixel being white. ing their empirical frequencies p(ξˆi) and increasing ˆ ˆ their information content h(ξi) = log2 p(ξi). This We trained the network on binarized MNIST images for 100 − 4 observation does not invalidate the use of (ξˆ ) as epochs, using the Adam optimizer with learning rate 10− . R i an estimate for h(ξˆi) since the different slope can be absorbed in a rescaling of the parameter λ S2.4. Frey Faces On the Frey Faces dataset, we observe poor reconstruction For rates (ξˆ ) 4, there are two code points for quality by training on binarized images with a factorized • R i ≥ each rate with considerably lower information content. Bernoulli likelihood model; instead, we treat each pixel These code points correspond to the two extremes for as an observation from a factorized categorical likelihood each rate, i.e., ξˆi closest to zero or one, respectively. model with 256 possible outcomes. The observation that the two extremes have lower in- The VAE’s inference network has two layers. The first formation content (i.e., higher empirical frequencies) layer flattens the input image, converts each pixel value in can be explained by the fact that the empirical prior dis- 0, 1, ..., 255 into a one-hot vector R256, and uses it to { } ∈ tribution whose CDF we use to map latent variables zi index a 128-dimensional dense vector. The second layer to quantiles ξi does not fully capture the true distribu- flattens the result of the first layer as its input (which has tion of variational means. Indeed, experiments with dimensionality equal to 128 number of pixels), and fully × a more long tailed empirical prior distribution lead to connects its input to 8 hidden units. The final output is split marginally better performance, but the simplicity of a to obtain 4-dimensional µ and log σ2 of q(z x). Gaussian empirical prior seemed more valuable to us. | Supplementary Material to “Variational Bayesian Quantization”

The generative network has two fully connected layers. The first layer uses 4 hidden units and ReLU activation; the sec- ond layer uses 256 number of pixels hidden units, and × takes a 256-way softmax to compute the categorical proba- bility of of each pixel value taking value in 0, 1, ..., 255 . { } We obtained the Frey Faces images from https://cs. nyu.edu/˜roweis/data.html. We trained on a ran- dom subset of 1800 images for 800 epochs, using the Adam 4 optimizer with learning rate 10− . On both MNIST and Frey Faces, we vary the rate-distortion trade-off parameter λ of Variational Bayesian Quantization 5 4 between 10− and 10 .

S2.5. Color Image Compression As mentioned in the main text, the VAE here uses a fully convolutional architecture with 3 layers of 256 filters each, the same as in (Balle´ et al., 2017); see the latter for detailed descriptions. We tuned the variance σ2 of the likelihood 4 model on a logarithmic grid from 10− to 0.1 and set it to 0.001. The VAE was trained on the same dataset as in (Balle´ et al., 2017) for 2 million steps, using Adam with learning 4 rate 10− . 6 In the image compression R-D curves, λ ranges from 2− to 216. In Figure 5 of the main text, λ was set to 17.5 for VBQ to match the bitrate of the other methods. The uniform quantization result was obtained with 4 quantization levels, on a separately tuned model that had an additional convolutional layer of 64 channels. The additional conv layer was to reduce the latent dimensionality, as uniform quantization could not achieve bitrates lower than 0.5 even with only 2 grid points in the original 3-layer model.

S3. Additional Image Compression Examples Starting on the next page, we provide detailed compression results for individual images from the Kodak dataset. For each image, we show the rate-distortion performance by various methods, followed by reconstructions using our proposed method and JPEG at equal bitrate.1

1The present version of this document contains a subset of example images due to a file size limit on arXiv submissions. Supplementary Material to “Variational Bayesian Quantization”

40 1.0

Bayesian AC (proposed) 30 0.8 Uniform Quantization k-means Quantization 20 0.6 JPEG Ball´eet al. (2017) PSNR (RGB) MS-SSIM (RGB) 0.4 10

0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Bits per Pixel (BPP) Bits per Pixel (BPP)

Figure S2. Proposed. bits-per-pixel: 0.27, PSNR: 23.595, MS-SSIM: 0.871

Figure S3. JPEG. bits-per-pixel: 0.27, PSNR: 22.015, MS-SSIM: 0.816 Supplementary Material to “Variational Bayesian Quantization”

40 1.0

0.8 30

0.6 Bayesian AC (proposed) Uniform Quantization 20 0.4 k-means Quantization PSNR (RGB) JPEG MS-SSIM (RGB) Ball´eet al. (2017) 10 0.2

0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Bits per Pixel (BPP) Bits per Pixel (BPP)

Figure S4. Proposed. bits-per-pixel: 0.19, PSNR: 29.226, MS-SSIM: 0.882

Figure S5. JPEG. bits-per-pixel: 0.19, PSNR: 26.658, MS-SSIM: 0.747 Supplementary Material to “Variational Bayesian Quantization”

40 1.0

0.8 30 0.6 Bayesian AC (proposed) Uniform Quantization 20 k-means Quantization 0.4 PSNR (RGB) JPEG MS-SSIM (RGB) Ball´eet al. (2017) 10 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Bits per Pixel (BPP) Bits per Pixel (BPP)

Figure S6. Proposed. bits-per-pixel: 0.19, PSNR: 30.664, MS-SSIM: 0.94

Figure S7. JPEG. bits-per-pixel: 0.19, PSNR: 26.201, MS-SSIM: 0.842 Supplementary Material to “Variational Bayesian Quantization”

1.0

30 0.8 Bayesian AC (proposed) 20 Uniform Quantization 0.6 k-means Quantization PSNR (RGB) JPEG MS-SSIM (RGB) Ball´eet al. (2017) 10 0.4 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Bits per Pixel (BPP) Bits per Pixel (BPP)

Figure S8. Proposed. bits-per-pixel: 0.34, PSNR: 22.177, MS-SSIM: 0.903

Figure S9. JPEG. bits-per-pixel: 0.34, PSNR: 21.331, MS-SSIM: 0.882 Supplementary Material to “Variational Bayesian Quantization”

40 1.0

0.8 30

0.6 Bayesian AC (proposed) Uniform Quantization 20 k-means Quantization

PSNR (RGB) 0.4 JPEG MS-SSIM (RGB) 10 Ball´eet al. (2017) 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Bits per Pixel (BPP) Bits per Pixel (BPP)

(a) Proposed. bits-per-pixel: 0.22, PSNR: 29.584, MS-SSIM: 0.935 (b) JPEG. bits-per-pixel: 0.22, PSNR: 26.543, MS-SSIM: 0.846 Supplementary Material to “Variational Bayesian Quantization”

40 1.0

0.8 30 Bayesian AC (proposed) 0.6 Uniform Quantization 20 k-means Quantization

PSNR (RGB) 0.4 JPEG MS-SSIM (RGB) 10 Ball´eet al. (2017) 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Bits per Pixel (BPP) Bits per Pixel (BPP)

Figure S11. Proposed. bits-per-pixel: 0.2, PSNR: 29.523, MS-SSIM: 0.928

Figure S12. JPEG. bits-per-pixel: 0.2, PSNR: 26.6, MS-SSIM: 0.838 Supplementary Material to “Variational Bayesian Quantization”

40 1.0

0.8 30 0.6 Bayesian AC (proposed) Uniform Quantization 20 0.4 k-means Quantization PSNR (RGB) JPEG MS-SSIM (RGB) Ball´eet al. (2017) 10 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Bits per Pixel (BPP) Bits per Pixel (BPP)

Figure S13. Proposed. bits-per-pixel: 0.21, PSNR: 28.77, MS-SSIM: 0.908

Figure S14. JPEG. bits-per-pixel: 0.21, PSNR: 26.247, MS-SSIM: 0.818 Supplementary Material to “Variational Bayesian Quantization”

40 1.0

30 0.8 Bayesian AC (proposed) 0.6 Uniform Quantization 20 k-means Quantization PSNR (RGB) 0.4 JPEG MS-SSIM (RGB) Ball´eet al. (2017) 10 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Bits per Pixel (BPP) Bits per Pixel (BPP)

(a) Proposed. bits-per-pixel: 0.2, PSNR: 29.006, MS-SSIM: 0.936 (b) JPEG. bits-per-pixel: 0.2, PSNR: 25.389, MS-SSIM: 0.835 Supplementary Material to “Variational Bayesian Quantization”

40 1.0

0.8 30

0.6 Bayesian AC (proposed) Uniform Quantization 20 k-means Quantization

PSNR (RGB) 0.4 JPEG MS-SSIM (RGB) Ball´eet al. (2017) 10 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Bits per Pixel (BPP) Bits per Pixel (BPP)

Figure S16. Proposed. bits-per-pixel: 0.2, PSNR: 31.425, MS-SSIM: 0.945

Figure S17. JPEG. bits-per-pixel: 0.2, PSNR: 26.976, MS-SSIM: 0.833 Supplementary Material to “Variational Bayesian Quantization”

1.0

30 0.8

Bayesian AC (proposed) 20 0.6 Uniform Quantization k-means Quantization PSNR (RGB) JPEG MS-SSIM (RGB) 0.4 10 Ball´eet al. (2017)

0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Bits per Pixel (BPP) Bits per Pixel (BPP)

Figure S18. Proposed. bits-per-pixel: 0.24, PSNR: 24.254, MS-SSIM: 0.882

Figure S19. JPEG. bits-per-pixel: 0.24, PSNR: 22.562, MS-SSIM: 0.818