Sinkhorn Autoencoders
Total Page:16
File Type:pdf, Size:1020Kb
Sinkhorn AutoEncoders Giorgio Patrini⇤• Rianne van den Berg⇤ Patrick Forr´e Marcello Carioni UvA Bosch Delta Lab University of Amsterdam† University of Amsterdam KFU Graz Samarth Bhargav Max Welling Tim Genewein Frank Nielsen ‡ University of Amsterdam University of Amsterdam Bosch Center for Ecole´ Polytechnique CIFAR Artificial Intelligence Sony CSL Abstract 1INTRODUCTION Optimal transport o↵ers an alternative to Unsupervised learning aims at finding the underlying maximum likelihood for learning generative rules that govern a given data distribution PX . It can autoencoding models. We show that mini- be approached by learning to mimic the data genera- mizing the p-Wasserstein distance between tion process, or by finding an adequate representation the generator and the true data distribution of the data. Generative Adversarial Networks (GAN) is equivalent to the unconstrained min-min (Goodfellow et al., 2014) belong to the former class, optimization of the p-Wasserstein distance by learning to transform noise into an implicit dis- between the encoder aggregated posterior tribution that matches the given one. AutoEncoders and the prior in latent space, plus a recon- (AE) (Hinton and Salakhutdinov, 2006) are of the struction error. We also identify the role of latter type, by learning a representation that maxi- its trade-o↵hyperparameter as the capac- mizes the mutual information between the data and ity of the generator: its Lipschitz constant. its reconstruction, subject to an information bottle- Moreover, we prove that optimizing the en- neck. Variational AutoEncoders (VAE) (Kingma and coder over any class of universal approxima- Welling, 2013; Rezende et al., 2014), provide both a tors, such as deterministic neural networks, generative model — i.e. a prior distribution PZ on is enough to come arbitrarily close to the the latent space with a decoder G(X Z) that models optimum. We therefore advertise this frame- the conditional likelihood — and an encoder| Q(Z X) work, which holds for any metric space and — approximating the posterior distribution of| the prior, as a sweet-spot of current generative generative model. Optimizing the exact marginal autoencoding objectives. likelihood is intractable in latent variable models We then introduce the Sinkhorn auto- such as VAE’s. Instead one maximizes the Evidence encoder (SAE), which approximates and Lower BOund (ELBO) as a surrogate. This objec- minimizes the p-Wasserstein distance in la- tive trades o↵a reconstruction error of the input tent space via backprogation through the distribution PX and a regularization term that aims Sinkhorn algorithm. SAE directly works at minimizing the Kullback-Leibler (KL) divergence on samples, i.e. it models the aggregated from the approximate posterior Q(Z X) to the prior posterior as an implicit distribution, with | PZ . no need for a reparameterization trick for gradients estimations. SAE is thus able to An alternative principle for learning generative au- work with di↵erent metric spaces and priors toencoders comes from the theory of Optimal Trans- with minimal adaptations. port (OT) (Villani, 2008), where the usual KL- We demonstrate the flexibility of SAE on divergence KL(PX ,PG) is replaced by OT-cost diver- latent spaces with di↵erent geometries and gences Wc(PX ,PG), among which the p-Wasserstein priors and compare with other methods on distances Wp are proper metrics. In the papers benchmark data sets. Tolstikhin et al. (2018); Bousquet et al. (2017) it was shown that the objective Wc(PX ,PG) can be ⇤Equal contributions. •Now at Deeptrace. Now at re-written as the minimization of the reconstruction Google Brain. Now at DeepMind. † error of the input P over all probabilistic encoders ‡ X Q(Z X) constrained to the condition of matching is a necessary condition for learning the true data | the aggregated posterior QZ — the average (approx- distribution PX in rigorous mathematical terms. Any imate) posterior EPX [Q(Z X)] — to the prior PZ deviation will thus be punished with a poorer perfor- in the latent space. In Wasserstein| AutoEncoders mance. Altogether, we have addressed and answered (WAE) (Tolstikhin et al., 2018), it was suggested, the open questions in (Tolstikhin et al., 2018; Bous- following the standard optimization principles, to quet et al., 2017) in detail and highlighted the sweet- softly enforce that constraint via a penalization term spot framework for generative autoencoder models depending on a choice of a divergence D(QZ ,PZ ) based on Optimal Transport (OT) for any metric in latent space. For any such choice of divergence space and any prior distribution PZ , and with special this leads to the minimization of a lower bound of emphasis on Euclidean spaces, Lp-norms and neural the original objective, leaving the question about the networks. status of the original objective open. Nonetheless, The theory supports practical innovations. We are WAE empirically improves upon VAE for the two now in a position to learn deterministic autoencoders, choices made there, namely either a Maximum Mean Q(Z X), G(X Z), by minimizing a reconstruction Discrepancy (MMD) (Gretton et al., 2007; Sripe- error| for P and| the p-Wasserstein distance on the rumbudur et al., 2010, 2011), or an adversarial loss X latent space between samples of the aggregated poste- (GAN), again both in latent space. rior and the prior Wp(QˆZ , PˆZ ). The computation of We contribute to the formal analysis of autoencoders the latter is known to be difficult and costly (cp. Hun- with OT. First, using the Monge-Kantorovich equiva- garian algorithm (Kuhn, 1955)). A fast approximate lence (Villani, 2008), we show that (in non-degenerate solution is provided by the Sinkhorn algorithm (Cu- cases) the objective Wc(PX ,PG) can be reduced to turi, 2013), which uses an entropic relaxation. We fol- the minimization of the reconstruction error of PX low (Frogner et al., 2015) and (Genevay et al., 2018), over any class containing the class of all deterministic by exploiting the di↵erentiability of the Sinkhorn encoders Q(Z X), again constrained to QZ = PZ . iterations, and unroll it for backpropagation. In ad- Second, when| restricted to the p-Wasserstein dis- dition we correct for the entropic bias of the Sinkhorn tance Wp(PX ,PG), and by using a combination of algorithm (Genevay et al., 2018; Feydy et al., 2018). triangle inequality and a form of data processing Altogether, we call our method the Sinkhorn Au- inequality for the generator G, we show that the toEncoder (SAE). soft and unconstrained minimization of the recon- The Sinkhorn AutoEncoder is agnostic to the ana- struction error of P together with the penalization X lytical form of the prior, as it optimizes a sample- term γ W (Q ,P ) is actually an upper bound p Z Z based cost function which is aware of the geometry to the original· objective W (P ,P ), where the p X G of the latent space. Furthermore, as a byproduct regularization/trade-o↵hyperparameter γ needs to of using deterministic networks, it models the ag- match at least the capacity of the generator G,i.e. gregated posterior as an implicit distribution (Mo- its Lipschitz constant. This suggests that using a p- hamed and Lakshminarayanan, 2016) with no need of Wasserstein metric W (Q ,P ) in latent space in the p Z Z the reparametrization trick for learning the encoder WAE setting (Tolstikhin et al., 2018) is a preferable (Kingma and Welling, 2013). Therefore, with essen- choice. tially no change in the algorithm, we can learn models Third, we show that the minimum of that objective with normally distributed priors and aggregated pos- can be approximated from above by any class of teriors, as well as distributions living on manifolds universal approximators for Q(Z X) to arbitrarily such as hyperspheres (Davidson et al., 2018) and | small error. In case we choose the Lp-norms p and probability simplices. corresponding p-Wasserstein distances W kone·k can p In our experiments we explore how well the Sinkhorn use the results of (Hornik, 1991) to show that any AutoEncoder performs on the benchmark datasets class of probabilistic encoders Q(Z X) that contains MNIST and CelebA with di↵erent prior distributions the class of deterministic neural| networks has all P and geometries in latent space, e.g. the Gaussian those desired properties. This justifies the use of such Z in Euclidean space or the uniform distribution on a classes in practice. Note that analogous results for hypersphere. Furthermore, we compare the SAE to the latter for other divergences and function classes the VAE (Kingma and Welling, 2013), to the WAE- are unknown. MMD (Tolstikhin et al., 2018) and other methods of Fourth, as a corollary we get the folklore claim that approximating the p-Wasserstein distance in latent matching the aggregated posterior QZ and prior PZ space like the Hungarian algorithm (Kuhn, 1955) and the Sliced Wasserstein AutoEncoder (Kolouri 2.2 THE WASSERSTEIN et al., 2018). We also explore the idea of matching AUTOENCODER (WAE) the aggregated posterior QZ to a standard Gaussian prior PZ via the fact that the 2-Wasserstein distance Tolstikhin et al. (2018) show that if the decoder has a closed form for Gaussian distributions: we esti- G(X Z)isdeterministic,i.e. PG = G#PZ , or in | mate the mean and covariance of QZ on minibatches other words, if all stochasticity of the generative and use the loss W2(QˆZ ,PZ ) for backpropagation. model is captured by PZ ,then: Finally, we train SAE on MNIST with a probability simplex as a latent space and visualize the matching Wc(PX ,PG)= inf EX PX EZ Q(Z X)[c(X, G(Z))]. Q(Z X): ⇠ ⇠ | | of the aggregate posterior and the prior. QZ =PZ (4) 2PRINCIPLESOF Learning the generative model G with the WAE amounts to the objective: WASSERSTEIN AUTOENCODING min min EX PX EZ Q(Z X)[c(X, G(Z))] G Q(Z X) ⇠ ⇠ | | 2.1 OPTIMAL TRANSPORT +β D(Q ,P ), (5) · Z Z We follow Tolstikhin et al.