Sinkhorn

Giorgio Patrini⇤• Rianne van den Berg⇤ Patrick Forr´e Marcello Carioni UvA Bosch Delta Lab University of Amsterdam† University of Amsterdam KFU Graz

Samarth Bhargav Max Welling Tim Genewein Frank Nielsen ‡ University of Amsterdam University of Amsterdam Bosch Center for Ecole´ Polytechnique CIFAR Artificial Intelligence Sony CSL

Abstract 1INTRODUCTION

Optimal transport o↵ers an alternative to aims at finding the underlying maximum likelihood for learning generative rules that govern a given data distribution PX . It can autoencoding models. We show that mini- be approached by learning to mimic the data genera- mizing the p-Wasserstein distance between tion process, or by finding an adequate representation the generator and the true data distribution of the data. Generative Adversarial Networks (GAN) is equivalent to the unconstrained min-min (Goodfellow et al., 2014) belong to the former class, optimization of the p-Wasserstein distance by learning to transform noise into an implicit dis- between the encoder aggregated posterior tribution that matches the given one. AutoEncoders and the prior in latent space, plus a recon- (AE) (Hinton and Salakhutdinov, 2006) are of the struction error. We also identify the role of latter type, by learning a representation that maxi- its trade-o↵hyperparameter as the capac- mizes the mutual information between the data and ity of the generator: its Lipschitz constant. its reconstruction, subject to an information bottle- Moreover, we prove that optimizing the en- neck. Variational AutoEncoders (VAE) (Kingma and coder over any class of universal approxima- Welling, 2013; Rezende et al., 2014), provide both a tors, such as deterministic neural networks, generative model — i.e. a prior distribution PZ on is enough to come arbitrarily close to the the latent space with a decoder G(X Z) that models optimum. We therefore advertise this frame- the conditional likelihood — and an encoder| Q(Z X) work, which holds for any metric space and — approximating the posterior distribution of| the prior, as a sweet-spot of current generative generative model. Optimizing the exact marginal autoencoding objectives. likelihood is intractable in latent variable models We then introduce the Sinkhorn auto- such as VAE’s. Instead one maximizes the Evidence encoder (SAE), which approximates and Lower BOund (ELBO) as a surrogate. This objec- minimizes the p-Wasserstein distance in la- tive trades o↵a reconstruction error of the input tent space via backprogation through the distribution PX and a regularization term that aims Sinkhorn algorithm. SAE directly works at minimizing the Kullback-Leibler (KL) divergence on samples, i.e. it models the aggregated from the approximate posterior Q(Z X) to the prior posterior as an implicit distribution, with | PZ . no need for a reparameterization trick for gradients estimations. SAE is thus able to An alternative principle for learning generative au- work with di↵erent metric spaces and priors toencoders comes from the theory of Optimal Trans- with minimal adaptations. port (OT) (Villani, 2008), where the usual KL- We demonstrate the flexibility of SAE on divergence KL(PX ,PG) is replaced by OT-cost diver- latent spaces with di↵erent geometries and gences Wc(PX ,PG), among which the p-Wasserstein priors and compare with other methods on distances Wp are proper metrics. In the papers benchmark data sets. Tolstikhin et al. (2018); Bousquet et al. (2017) it was shown that the objective Wc(PX ,PG) can be

⇤Equal contributions. •Now at Deeptrace. Now at re-written as the minimization of the reconstruction Google Brain. Now at DeepMind. † error of the input P over all probabilistic encoders ‡ X Q(Z X) constrained to the condition of matching is a necessary condition for learning the true data | the aggregated posterior QZ — the average (approx- distribution PX in rigorous mathematical terms. Any

imate) posterior EPX [Q(Z X)] — to the prior PZ deviation will thus be punished with a poorer perfor- in the latent space. In Wasserstein| AutoEncoders mance. Altogether, we have addressed and answered (WAE) (Tolstikhin et al., 2018), it was suggested, the open questions in (Tolstikhin et al., 2018; Bous- following the standard optimization principles, to quet et al., 2017) in detail and highlighted the sweet- softly enforce that constraint via a penalization term spot framework for generative models depending on a choice of a divergence D(QZ ,PZ ) based on Optimal Transport (OT) for any metric in latent space. For any such choice of divergence space and any prior distribution PZ , and with special this leads to the minimization of a lower bound of emphasis on Euclidean spaces, Lp-norms and neural the original objective, leaving the question about the networks. status of the original objective open. Nonetheless, The theory supports practical innovations. We are WAE empirically improves upon VAE for the two now in a position to learn deterministic autoencoders, choices made there, namely either a Maximum Mean Q(Z X), G(X Z), by minimizing a reconstruction Discrepancy (MMD) (Gretton et al., 2007; Sripe- error| for P and| the p-Wasserstein distance on the rumbudur et al., 2010, 2011), or an adversarial loss X latent space between samples of the aggregated poste- (GAN), again both in latent space. rior and the prior Wp(QˆZ , PˆZ ). The computation of We contribute to the formal analysis of autoencoders the latter is known to be dicult and costly (cp. Hun- with OT. First, using the Monge-Kantorovich equiva- garian algorithm (Kuhn, 1955)). A fast approximate lence (Villani, 2008), we show that (in non-degenerate solution is provided by the Sinkhorn algorithm (Cu- cases) the objective Wc(PX ,PG) can be reduced to turi, 2013), which uses an entropic relaxation. We fol- the minimization of the reconstruction error of PX low (Frogner et al., 2015) and (Genevay et al., 2018), over any class containing the class of all deterministic by exploiting the di↵erentiability of the Sinkhorn encoders Q(Z X), again constrained to QZ = PZ . iterations, and unroll it for backpropagation. In ad- Second, when| restricted to the p-Wasserstein dis- dition we correct for the entropic bias of the Sinkhorn tance Wp(PX ,PG), and by using a combination of algorithm (Genevay et al., 2018; Feydy et al., 2018). triangle inequality and a form of data processing Altogether, we call our method the Sinkhorn Au- inequality for the generator G, we show that the toEncoder (SAE). soft and unconstrained minimization of the recon- The Sinkhorn AutoEncoder is agnostic to the ana- struction error of P together with the penalization X lytical form of the prior, as it optimizes a sample- term W (Q ,P ) is actually an upper bound p Z Z based cost function which is aware of the geometry to the original· objective W (P ,P ), where the p X G of the latent space. Furthermore, as a byproduct regularization/trade-o↵hyperparameter needs to of using deterministic networks, it models the ag- match at least the capacity of the generator G,i.e. gregated posterior as an implicit distribution (Mo- its Lipschitz constant. This suggests that using a p- hamed and Lakshminarayanan, 2016) with no need of Wasserstein metric W (Q ,P ) in latent space in the p Z Z the reparametrization trick for learning the encoder WAE setting (Tolstikhin et al., 2018) is a preferable (Kingma and Welling, 2013). Therefore, with essen- choice. tially no change in the algorithm, we can learn models Third, we show that the minimum of that objective with normally distributed priors and aggregated pos- can be approximated from above by any class of teriors, as well as distributions living on manifolds universal approximators for Q(Z X) to arbitrarily such as hyperspheres (Davidson et al., 2018) and | small error. In case we choose the Lp-norms p and probability simplices. corresponding p-Wasserstein distances W kone·k can p In our experiments we explore how well the Sinkhorn use the results of (Hornik, 1991) to show that any AutoEncoder performs on the benchmark datasets class of probabilistic encoders Q(Z X) that contains MNIST and CelebA with di↵erent prior distributions the class of deterministic neural| networks has all P and geometries in latent space, e.g. the Gaussian those desired properties. This justifies the use of such Z in Euclidean space or the uniform distribution on a classes in practice. Note that analogous results for hypersphere. Furthermore, we compare the SAE to the latter for other divergences and function classes the VAE (Kingma and Welling, 2013), to the WAE- are unknown. MMD (Tolstikhin et al., 2018) and other methods of Fourth, as a corollary we get the folklore claim that approximating the p-Wasserstein distance in latent matching the aggregated posterior QZ and prior PZ space like the Hungarian algorithm (Kuhn, 1955) and the Sliced Wasserstein AutoEncoder (Kolouri 2.2 THE WASSERSTEIN et al., 2018). We also explore the idea of matching AUTOENCODER (WAE) the aggregated posterior QZ to a standard Gaussian prior PZ via the fact that the 2-Wasserstein distance Tolstikhin et al. (2018) show that if the decoder has a closed form for Gaussian distributions: we esti- G(X Z)isdeterministic,i.e. PG = G#PZ , or in | mate the mean and covariance of QZ on minibatches other words, if all stochasticity of the generative and use the loss W2(QˆZ ,PZ ) for backpropagation. model is captured by PZ ,then: Finally, we train SAE on MNIST with a probability simplex as a latent space and visualize the matching Wc(PX ,PG)= inf EX PX EZ Q(Z X)[c(X, G(Z))]. Q(Z X): ⇠ ⇠ | | of the aggregate posterior and the prior. QZ =PZ (4) 2PRINCIPLESOF Learning the generative model G with the WAE amounts to the objective: WASSERSTEIN AUTOENCODING min min EX PX EZ Q(Z X)[c(X, G(Z))] G Q(Z X) ⇠ ⇠ | | 2.1 OPTIMAL TRANSPORT + D(Q ,P ), (5) · Z Z We follow Tolstikhin et al. (2018) and denote with where >0 is a Lagrange multiplier and D is any , , the sample spaces and with X, Y, Z and X Y Z divergence measure on probability distributions on PX ,PY ,PZ the corresponding random variables and . The specific choice for D is left open. WAE uses distributions. Given a map F : we denote by Z X!Y either MMD (Gretton et al., 2012) or a discriminator F# the push-forward map acting on a distribution trained adversarially for D. As discussed in Bousquet 1 P as P F .IfF (Y X) is non-deterministic we de- et al. (2017), Eq. 5 is a lower bound of Eq. 4 for any fine the push-forward |F (Y X) P of a distribution | # X choice of D and any value of >0. Minimizing this P as the induced marginal of the joint distribution lower bound does not ensure a minimization of the F (Y X)P . For any measurable non-negative cost | X original objective of Eq. 4. c : R+ , one can define the following OT-costX⇥Ybetween! [ distributions{1} P and P via: X Y 2.3 THEORETICAL CONTRIBUTIONS Wc(PX ,PY )= inf E(X,Y ) [c(X, Y )], (1) ⇧(P ,P ) ⇠ 2 X Y We improve upon the analysis of Tolstikhin et al. where ⇧(PX ,PY ) is the set of all joint distributions (2018) of generative autoencoders in the framework that have PX and PY as the marginals. The elements of Optimal Transport in several ways. Our contribu- from ⇧(PX ,PY ) are called couplings from PX to PY . tions can be summarized by the following theorem, If c(x, y)=d(x, y)p for a metric d and p 1then upon which we will comment directly after. pp Wp := Wc is called the p-Wasserstein distance. Theorem 2.1. Let , be endowed with any met- X Z 1 Let P denote the true data distribution on .We rics and p 1. Let PX be a non-atomic distribution X define a latent variable model as follows: weX fix a and G(X Z) be a deterministic generator/decoder | latent space and a prior distribution PZ on that is -Lipschitz. Then we have the equality: and considerZ the conditional distribution G(X ZZ) | (the decoder) parameterized by a neural network p p Wp(PX ,PG)= inf EX PX EZ Q(Z X) [d(X, G(Z)) ] Q ⇠ ⇠ | G. Together they specify a generative model as 2F q G(X Z)P . The induced marginal will be denoted + Wp(QZ ,PZ ) , (6) | Z · by PG. Learning PG such that it approximates the true PX is then defined as: where is any class of probabilistic encoders that at leastF contains a class of universal approximators. min Wc(PX ,PG). (2) If , are Euclidean spaces endowed with the L - G X Z p norms p then a valid minimal choice for is the Because of the infimum over ⇧(PX ,PG)insideWc, k·k F this is intractable. To rewrite this objective we con- class of all deterministic neural network encoders Q sider the posterior distribution Q(Z X)(theencoder) | 1A probability measure is non-atomic if every point and its aggregated posterior QZ : in its support has zero measure. It is important to distin- guish between the empirical data distribution PˆX ,which QZ = Q(Z X)#PX = EX PX Q(Z X), (3) | ⇠ | is always atomic, and the underlying true distribution the induced marginal of the joint Q(Z X)P . PX ,onlywhichweneedtoassumetobenon-atomic. | X (here written as a function), for which the objective When the encoders are restricted to be neural net- reduces to: works of limited capacity, e.g. if their architecture is fixed, then enforcing QZ PZ might not be feasible p p ⇡ Wp(PX ,PG)= inf EX PX [ X G(Q(X)) p] Q ⇠ k k in the general case of dimensionality mismatch be- 2F q tween and (Rubenstein et al., 2018). In fact, + Wp(QZ ,PZ ). X Z · since the class of deterministic neural networks (of limited capacity) is much smaller than the class of The proof of Theorem 2.1 can be found in Appendix deterministic measurable maps, one might consider A, B and C. It uses the following three arguments: adding noise to the output, i.e. use stochastic net- i.) It is the Monge-Kantorovich equivalence (Villani, works instead. Nonetheless, neural networks can 2008) for non-atomic PX that allows us to restrict to approximate any measurable map up to arbitrarily deterministic encoders Q(Z X). This is a first theo- small error (Hornik, 1991). Furthermore, in practice retical improvement over the| Eq. 4 from Tolstikhin the encoder Q(Z X) maps from the high dimensional | et al. (2018). data space to the much lower dimensional latent space , suggestingX that the task of matching dis- ii.) The upper bound can be achieved by a simple tributionsZ in the lower dimensional latent space triangle inequality: should be feasible. Also, in view of Theorem 2.1Z it

Wp(PX ,PG) Wp(PX , P˜X )+Wp(P˜X ,PG), follows that learning deterministic autoencoders is  sucient to approach the theoretical bound and thus where P˜ := G Q(Z X) P = G Q is the recon- X # | # X # Z will be our empirical choice. struction of PX . Note that the triangle inequality is not available for other general cost functions or diver- Theorem 2.1 certifies that, failing to match aggre- gences. This might be a reason for the diculty of gated posterior and prior makes learning the data getting upper bounds in such settings. On the other distribution impossible. Matching in latent space hand, if a divergence satisfies the triangle inequality should be seen as fundamental as minimizing the then one can use the same argument to arrive at new reconstruction error, a fact known about the perfor- variational optimization objectives and principles. mance of VAE (Ho↵man and Johnson, 2016; Higgins et al., 2017; Alemi et al., 2018; Rosca et al., 2018). iii.) We then prove the data processing inequality for This necessary condition for learning the data distri- the Wp-distance: bution turns out to be also sucient assuming that the set of encoders is expressive enough to nullify the Wp(G#QZ ,G#PZ ) Wp(QZ ,PZ ),  · reconstruction error. with any G Lip, the Lipschitz constant of G. Such an inequalityk k is available and known for several With the help of Theorem 2.1 we arrive at the follow- other divergences usually with = 1. ing unconstrained min-min-optimization objective over deterministic decoder and encoder neural net- Putting all three pieces together we immediately works (Q written as a function here): arrive at the equality (upper and lower bound) of the first part of Theorem 2.1. This insight directly p p min min EX PX [ X G(Q(X)) p] suggests that using the divergence W (Q ,P )in G Q ⇠ k k p Z Z q latent space with a hyperparameter G Lip in + Wp(QZ ,PZ ), the WAE setting is a preferable choice. Thesek arek two · further improvements over Tolstikhin et al. (2018). with G for all occuring G. k kLip Note that if G is a neural network with activation function g with g0 1 (e.g. ReLU, sigmoid, tanh, k k1  3 THE SINKHORN etc.) and weight matrices (B`)`=1,...,L,thenG is - Lipschitz for any B B ,wherethe AUTOENCODER k 1kp ···k Lkp latter is the product of the Lp-matrix norms (cp. Balan et al. (2017)). 3.1 ENTROPY REGULARIZED OPTIMAL TRANSPORT iv.) For the second part of Theorem 2.1 we use the universal approximator property of neural networks Even though the theory supports the use of the p- (Hornik, 1991) and the compatibility of the Lp-norm Wasserstein distance Wp(QZ ,PZ ) in latent space, it is -norm with the p-Wasserstein distance W .Prov- notoriously hard to compute or estimate. In practice, k·kp p ing such statements for other divergences seems to we will need to approximate Wp(QZ ,PZ ) via samples require much more e↵ort (if possible at all). from QZ (and PZ ). The sample version Wp(QˆZ , PˆZ ) ˆ 1 M ˆ 1 M with PZ = M m=1 zm and QZ = M m=1 z˜m divergences with cost c˜ in latent space and c in data has an exact solution, which can be computed using space to arrive at the objective: the Hungarian algorithmP (Kuhn, 1955) in nearP O(M 3) ˆ ˆ time (time complexity). Furthermore, Wp(QZ , PZ ) min min EX PX EZ Q(Z X)[c(X, G(Z))] G Q(Z X) ⇠ ⇠ | 1 | will di↵er from W (Q ,P ) in size of about O(M k ) p Z Z + S (Q ,P ), (9) (sample complexity), where k is the dimension of · c,˜ " Z Z Z (Weed and Bach, 2017). Both complexity measures with hyperparameters 0 and " 0. are unsatisfying in practice, but they can be improved Restricting further to p-Wasserstein distances, corre- via entropy regularization (Cuturi, 2013), which we sponding Sinkhorn divergences and deterministic en- will explain next. /decoder neural networks, we arrive at the Sinkhorn Following Genevay et al. (2018, 2019); Feydy et al. AutoEncoder (SAE) objective: (2018) we define the entropy regularized OT cost with " 0: p p min min EX PX [ X G(Q(X)) p] G Q ⇠ k k ˜ q 1 Sc,"(PX ,PY ):= inf E(X,Y ) [c(X, Y )] + S p (QZ ,PZ ) p , (10) ⇧(P ,P ) ⇠ p," 2 X Y · k·k + " KL(,PX PY ). (7) which is then up to the "-terms close to the original · ⌦ objective. Note that for computational reasons it is This is in general not a divergence due to its entropic sometimes convenient to remove the p-th roots again. bias. When we remove this bias we arrive at the The inequality pp a + pp b 2pp a + b shows that the Sinkhorn divergence: additional loss is small, while still minimizing an upper bound (using := p). Sc,"(PX ,PY ):=S˜c,"(PX ,PY ) 1 S˜ (P ,P )+S˜ (P ,P ) . (8) 3.3 THE SINKHORN ALGORITHM 2 c," X X c," Y Y ⇣ ⌘ Now that we have the general Sinkhorn AutoEncoder The Sinkhorn divergence has the following limiting optimization objective, we need to review how the behaviour: Sinkhorn divergence Sc,˜ "(QZ ,PZ ) can be estimated " 0 in practice by the Sinkhorn algorithm (Cuturi, 2013) Sc,"(PX ,PY ) ! Wc(PX ,PY ), "! using samples. Sc,"(PX ,PY ) !1 MMD c(PX ,PY ). ! If we take M samples each from QZ and PZ ,we

This means that the Sinkhorn divergence Sc," interpo- get the corresponding empirical (discrete) distribu- ˆ 1 M lates between OT-divergences and MMDs (Gretton tions concentrated on M points: PZ = M m=1 zm et al., 2012). On the one hand, for small " it is known and Qˆ = 1 M . Then, the optimal cou- Z M m=1 z˜m P that Sc," deviates from the initial objective Wc by pling of the (empirical) entropy regularized OT-cost about O(" log(1/")) (Genevay et al., 2019). On the S˜ (Qˆ , Pˆ )withP " 0 is given by the matrix: c,˜ " Z Z other hand, if " is big enough then Sc," will have 1 1 ˜ the more favourable sample complexity of O(M 2 ) R⇤ := arg min M R, C F " H(R), (11) R S h i · of MMDs, which is independent of the dimension, 2 M and was proven in Genevay et al. (2019). Further- where C˜ = c˜(z˜ ,z ) is the matrix associated to the more, the Sinkhorn algorithm (Cuturi, 2013), which ij i j cost c˜, R is a doubly stochastic matrix as defined will be explained in the section 3.3, allows for faster M M T in S = R ⇥ R1 = 1,R 1 = 1 , and computation of the Sinkhorn divergence S with M R 0 c," , denotes{ 2 the Frobenius| inner product; 1}is the time complexity close to O(M 2) (Altschuler et al., h· ·iF vector of ones and H(R)= M R log R is 2017). Therefore, if we balance " well, we are close i,j=1 i,j i,j the entropy of R. to our original objective and at the same time have P favourable computational and statistical properties. Cuturi (2013) shows that the Sinkhorn Algorithm 1 (Sinkhorn, 1964) returns its "-regularized optimum 3.2 THE SINKHORN AUTOENCODER R⇤ (see Eq. 11) in the limit L , which is also !1 OBJECTIVE unique due to strong convexity of the entropy. The Sinkhorn algorithm is a fixed point algorithm that is Guided by the theoretical insights, we can restrict the much faster than the Hungarian algorithm: it runs WAE framework (Tolstikhin et al., 2018) to Sinkhorn in nearly O(M 2) time (Altschuler et al., 2017) and Algorithm 1 Sinkhorn Algorithm 2 SAE Training round M M Input: z˜i QZ , zj PZ , ", L Input: encoder weights A,decoderweightsB, ", L, i=1 j=1 M M { } ⇠ { } ⇠ Minibatch: x = xi i=1 PX , z = zj j=1 PZ i, j : C˜ij =˜c(˜zi,zj ) 8 z˜ Q (x),x ˜ G{ (˜}z) ⇠ { } ⇠ K =exp( C/˜ "), u 1 #elem-wiseexp A B D = 1 x x˜ p repeat until convergence, but at most L times: M p S = SinkhornLossk k (˜z,z,",L), # 3 x Alg. 1 + Eq. 12 v 1/(K>u) # elem-wise division Update: A, B with gradient (A,B)(D + S). u 1/(Kv) r ·

R⇤ Diag(u) K Diag(v) # plus rounding step

Output: R⇤, C˜. 3.4 TRAINING THE SINKHORN AUTOENCODER can be eciently implemented with matrix multipli- cations; see Algorithm 1. For better di↵erentiability To train the Sinkhorn AutoEncoder with encoder QA, decoder GB and with weights A, B, resp., we sample properties we deviate from Eq. 8 and use the unbi- M ased sharp Sinkhorn loss (Luise et al., 2018; Genevay minibatches x = xi i=1 from the data distribution P and z = z M{ }from the prior P . After encod- et al., 2018) by dropping the entropy terms (only) in X { i}i=1 Z the evaluations: ing x we then run the Sinkhorn Algorithm 1 three times (for (x, z), (x, x) and (z,z)) to find the optimal couplings and then compute the unbiased Sinkhorn- Loss via Eq. 12. Note that the L Sinkhorn steps 1 in Algorithm 1 are di↵erentiable. The weights can S (Qˆ ,Pˆ ):= R⇤, C˜ c,˜ " Z Z M h iF then be updated via (auto-)di↵erentiation through 1 ˜ ˜ the Sinkhorn steps (together with the gradient of R⇤ˆ , C ˆ F + R⇤ˆ , C ˆ F , 2M h QZ QZ i h PZ PZ i the reconstruction loss). One training round is sum- ⇣ (12)⌘ marized in Algorithm 2. Small " and large L worsen the numerical stability of the Sinkhorn. In most ex- 2 periments, both c and c˜ will be 2. Experimentally we found that the re-calculationk·k of the three opti- where the indices QˆZ , PˆZ refer to Eq. 11 applied to mal couplings at each iteration is not a significant the samples from QZ in both arguments and then overhead. P in both arguments, respectively. Z SAE can in principle work with arbitrary priors. The Since this only deviates from Eq. 8 in "-terms we only requirement coming from the Sinkhorn is the still have all the mentioned properties, e.g. that the ability to generate samples. The choice should be optimum of this Sinkhorn distance approaches the op- motivated by the desired geometric properties of the timum of the OT-cost with the stated rate (Genevay latent space. et al., 2018; Cominetti and San Mart´ın, 1994; Weed, 2018). Furthermore, for numerical stability we use 4 CLOSED FORM OF THE the Sinkhorn algorithm in log-space (Chizat et al., 2016; Schmitzer, 2016). In order to round the R that 2-WASSERSTEIN DISTANCE results from a finite number L of Sinkhorn iterations to a doubly stochastic matrix, we use the procedure The 2-Wasserstein distance W2(QZ ,PZ ) has a closed described Algorithm 2 of (Altschuler et al., 2017). form in Euclidean space if both QZ and PZ are Gaus- sian (Peyr´eand Cuturi (2018) Rem. 2.31): The smaller the ", the smaller the entropy and the 2 2 better the approximation of the OT-cost. At the W2 ( (µ1, ⌃1), (µ2, ⌃2)) = µ1 µ2 2 N N k 1 k same time, a larger number of steps O(L)isneeded 1 1 2 +tr ⌃ +⌃ 2 ⌃ 2 ⌃ ⌃ 2 , (13) to converge, while the rate of convergence remains 1 2 2 1 2 linear in L (Genevay et al., 2018). Note that all ✓ ⇣ ⌘ ◆ Sinkhorn operations are di↵erentiable. Therefore, which will further simplify if PZ is standard Gaussian. when the distance is used as a cost function, we can Even though the aggregated posterior QZ might not unroll O(L) iterations and backpropagate (Genevay be Gaussian we use the above formula for matching et al., 2018). In conclusion, we obtain a di↵eren- and backpropagation, by estimating µ1 and ⌃1 on tiable surrogate for OT-cost between empirical dis- minibatches of QZ via the standard formulas: µˆ1 := 1 M ˆ 1 M T tributions; the approximation arises from sampling, M i=1 z˜i and ⌃1 := M 1 i=1(z˜i µˆ1)(z˜i µˆ1) . entropy regularization and the finite amount of steps We refer to this method as W2GAE (Wasserstein in place of convergence. GaussianP AutoEncoder).P We will compare this (a) (b) (c) (d) (e) Figure 1: a) Swiss Roll and its b) squared and c) spherical embeddings learned by Sinkhorn encoders. MNIST embedded onto a 10D sphere viewed through t-SNE, with classes by colours: d) encoder only or e) encoder + decoder. method against SAE and other baselines as discussed matches data and model samples with adversarial next in the related work section. training, and to Ambrogioni et al. (2018), which matches samples from the model joint distribution 5RELATEDWORK and a variational joint approximation. WAE and WGAN objectives are linked respectively to primal The Gaussian prior is common in VAE’s for the and dual formulations of OT (Tolstikhin et al., 2018). reason of tractability. In fact, changing the prior Our approach for training the encoder alone qualifies and/or the approximate posterior distributions re- as self-supervised representation learning (Donahue quires the use of tractable densities and the appropri- et al., 2017; Noroozi and Favaro, 2016; Noroozi et al., ate reparametrization trick. A hyperspherical prior 2017). As in noise-as-target (NAT) (Bojanowski and is used by Davidson et al. (2018) with improved Joulin, 2017) and in contrast to most other methods, experimental performance; the algorithm models a we can sample pseudo labels (from the prior) inde- Von Mises-Fisher posterior, with a non-trivial pos- pendently from the input. In Appendix D we show terior sampling procedure and a reparametrization a formal connection with NAT. trick based on rejection sampling. Our implicit en- coder distribution sidesteps these diculties. Recent Another way of estimating the 2-Wasserstein distance advances on variable reparametrization can also sim- in Euclidean space is the Sliced Wasserstein AutoEn- plify these requirements (Figurnov et al., 2018). We coder (SWAE) (Kolouri et al., 2018). The main idea are not aware of methods embedding on probability is to sample one-dimensional lines in Euclidean space simplices, except the use of Dirichlet priors by the and exploit the explicit form of the 2-Wasserstein same Figurnov et al. (2018). distance in terms of cumulative distribution functions in the one-dimensional setting. We will compare our Ho↵man and Johnson (2016) showed that the objec- methods to SWAE as well. tive of a VAE does not force the aggregated posterior and prior to match, and that the mutual informa- tion of input and codes may be minimized instead. 6EXPERIMENTS Just like the WAE, SAE avoids this e↵ect by con- struction. Makhzani et al. (2015) and WAE improve 6.1 REPRESENTATION LEARNING latent matching by GAN/MMD. With the same goal, WITH SINKHORN ENCODERS Alemi et al. (2017) and Tomczak and Welling (2017) introduce learnable priors in the form of a mixture We demonstrate qualitatively that the Sinkhorn dis- of posteriors, which can be used in SAE as well. tance is a valid objective for unsupervised by training the encoder in isolation. The The Sinkhorn (1964) algorithm gained interest after task consists of embedding the input distribution Cuturi (2013) showed its application for fast compu- in a lower dimensional space, while preserving the tation of Wasserstein distances. The algorithm has local data geometry and minimizing the loss function been applied to ranking (Adams and Zemel, 2011), L = 1 R , C˜ ,withc˜(z,z )= z z 2. Here M domain adaptation (Courty et al., 2014), multi-label M ⇤ F 0 0 2 is the minibatchh i size. k k classification (Frogner et al., 2015), metric learning (Huang et al., 2016) and ecological inference (Muzel- We display the representation of a 3D Swiss Roll 3 lec et al., 2017). Santa Cruz et al. (2017); Linderman and MNIST. For the Swiss Roll we set " = 10 , et al. (2018) used it for supervised combinatorial while for MNIST it is set to 0.5, and L is picked to losses. Our use of the Sinkhorn for generative mod- ensure convergence. For the Swiss roll (Figure 1a), eling is akin to that of Genevay et al. (2018), which we use a 50-50 fully connected network with ReLUs. MNIST CelebA method prior cost MMD RE FID MMD RE FID VAE KL 10.2812.2211.4 10.2094.1955 -VAE N KL 0.1 2.20 11.76 50.0 0.1 0.21 67.80 65 N WAE MMD 100 0.50 7.07 24.4 2000⇤ 0.21 65.45 58 SWAE N SW 100 0.32 7.46 18.8 100 0.21 65.28 64 N 2 W2GAE (ours) W2 10.677.0430.5 10.2065.5558 HAE (ours) N Hungarian 100 5.79 11.84 16.8 100 32.09 84.51 293 SAE (ours) N Sinkhorn 100 5.34 12.81 17.2 100 4.82 90.54 187 N HVAE† KL 10.2512.7321.5 ---- H WAE MMD 100 0.24 7.88 22.3 2000⇤ 0.25 66.54 59 SWAE H SW 100 0.24 7.80 27.6 100 0.41 63.64 80 HAE (ours) H Hungarian 100 0.23 8.69 12.0 100 0.26 63.49 58 SAE (ours) H Sinkhorn 100 0.25 8.59 12.5 100 0.24 63.97 56 H Table 1: Results of the autoencoding task. Top 3 results for the FID scores are indicated with boldface numbers. We compute MMD in latent space to evaluate the matching between the aggregated posterior and prior. MMD results are reported times 102. Note that MMD scores are not comparable for di↵erent priors. For SAE and the Gaussian prior, we used ✏ = 10 as lower values led to numerical instabilities. For the hypersphere we set ✏ =0.1. *The value of = 2000 is similar to the value = 100 as used in (Tolstikhin et al., 2018), as a prefactor of 0.05 was used there for the reconstruction cost. †Comparing with Davidson et al. (2018) in high-dimensional latent spaces turned out to be unfeasible, due to CPU-based computations. Figures 1b, 1c show that the local geometry of the proposed in (Bi´nkowski et al., 2018). For details on Swiss Roll is conserved in the new representational the experimental setup, see Appendix E. The results spaces — a square and a sphere. Figure 1d shows for MNIST and CelebA are shown in Table 1. Ex- the t-SNE visualization (Maaten and Hinton, 2008) trapolations, interpolations and samples of WAE and of the learned representation of the MNIST test set. SAE for CelebA are shown in Fig. 2. Visualizations With neither labels nor reconstruction error, we learn for MNIST are shown in Appendix D. Interpolations an embedding that is aware of class-wise clusters. on the hypersphere are defined on geodesics connect- Minimization of the Sinkhorn distance achieves this ing points on the hypersphere. FID scores of SAE by encoding onto a d-dimensional hypersphere with with a hyperspherical prior are on par or better than a uniform prior, such that points are encouraged to the competing methods. Note that although the FID map far apart. A contractive force is present due scores for the VAE are slightly better than that of to the inductive prior of neural networks, which are SAE/HAE, the reconstruction error of the VAE is known to be Lipschitz functions. On the one hand, significantly higher. Surprisingly, the simple W2GAE points in the latent space disperse in order to fill up method is on par with WAE on CelebA. the sphere; on the other hand, points close on image For the Gaussian prior on CelebA, both HAE and space cannot be mapped too far from each other. SAE perform very poorly. In appendix F we analyzed As a result, local distances are conserved while the the behaviour of the Hungarian algorithm in isola- overall distribution is spread. When the encoder is tion for two sets of samples from high-dimensional combined with a decoder G the contractive force is Gaussian distributions. The Hungarian algorithm enlarged: they collaborate in learning a latent space finds a better matching between samples from a which makes reconstruction possible despite finite smaller variance Gaussian with samples from the capacity; see Figure 1e. standard normal distribution. This behaviour gets worse for higher dimensions, and also occurs for the 6.2 AUTOENCODING EXPERIMENTS Sinkhorn algorithm. This might be due to the fact that most probability mass of a high-dimensional For the autoencoding task we compare SAE against isotropic Gaussian with standard deviation lies on ()-VAE, HVAE, SWAE and WAE-MMD. We fur- a thin annulus at radius pd from its origin. For a fi- 2 thermore denote the model that matches the samples nite number of samples the L2 cost function can lead in latent space with the Hungarian algorithm with to a lower matching cost for samples between two an- HAE. Where compatible, all methods are evaluated nuli of di↵erent radii. This e↵ect leads to an encoder both on the hypersphere and with a standard normal with a variance lower than one. When sampling from prior. Results from our proposed W2GAE method as the prior after training, this yields saturated sampled discussed in section 4 for Gaussian priors are shown images. See Appendix D for reconstructions and sam- as well. We compute FID scores (Heusel et al., 2017) ples for HAE with a Gaussian prior on CelebA. Note on CelebA and MNIST. For MNIST we use LeNet as that neither SWAE and W2GAE su↵er from this Figure 2: From left to right: CelebA extrapolations, interpolations, and samples. Models from Table 1: WAE with a Gaussian prior (top) and SAE with a uniform prior on the hypersphere (bottom). problem in our experiments, even though these meth- (3b); the e↵ect is evident when showing the prior ods also provide an estimate of the 2-Wasserstein and the aggregated posterior that tries to cover it distance. For W2GAE this problem does start at (3c). Figure 3d (leftmost and rightmost columns) even higher dimensions (Appendix F). shows that every digit 0 9 is indeed represented on one of the 16 vertices, while some digits are present with multiple styles, e.g. the 7. The central sam- 6.3 DIRICHLET PRIORS ples in the Figure are the interpolations obtained by sampling on edges connecting vertices – no real data We further demonstrate the flexibility of SAE by is autoencoded. Samples from the vertices appear using Dirichlet priors on MNIST. The prior draws much crisper than other prior samples (3e), a sign of samples on the probability simplex; hence we con- mismatch between prior and aggregated posterior on strain the encoder by a final softmax layer. We areas with lower probability mass. Finally, we could use priors that concentrate on the vertices with the even learn the Dirichlet hyperparameter(s) with a purpose of clustering the digits. A 10-dimensional reparametrization trick (Figurnov et al., 2018) and Dir(1/2) prior (Figure 3a) results in an embedding let the data inform the model on the best prior. qualitatively similar to the uniform sphere (1e). With a more skewed prior Dir(1/5), the latent space could be organized such that each digit is mapped to a 7 CONCLUSION vertex, with little mass in the center. We found that in dimension 10 this is seldom the case, as multiple We introduced a generative model built on the prin- vertices can be taken by the same digit to model ciples of Optimal Transport. Working with empirical di↵erent styles, while other digits share the same ver- Wasserstein distances and deterministic networks pro- tex. We therefore experiment with a 16-dimensional vides us with a flexible likelihood-free framework for Dir(1/5), which yields more disconnected clusters latent variable modeling.

(a) (b) (c) (d) (e) Figure 3: t-SNEs of SAE latent spaces on MNIST: a) 10-dim Dir(1/2) and b) 16-dim Dir(1/5) priors. For the latter: c) aggregated posterior (red) vs. prior (blue), d) vertices interpolation and e) samples from the prior. References using Sinkhorn Divergences. arXiv preprint arXiv:1810.08278. Adams, R. P. and Zemel, R. S. (2011). Rank- ing via Sinkhorn Propagation. arXiv preprint Figurnov, M., Mohamed, S., and Mnih, A. (2018). arXiv:1106.1925. Implicit Reparameterization Gradients. In NIPS. Alemi, A., Poole, B., Fischer, I., Dillon, J., Saurous, Frogner, C., Zhang, C., Mobahi, H., Araya, M., and R. A., and Murphy, K. (2018). Fixing a Broken Poggio, T. A. (2015). Learning with a Wasserstein ELBO. In ICML. Loss. In NIPS. Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, Genevay, A., Chizat, L., Bach, F., Cuturi, M., Peyr´e, K. (2017). Deep variational information bottleneck. G., et al. (2019). Sample Complexity of Sinkhorn In ICLR. Divergences. In AISTATS. Altschuler, J., Weed, J., and Rigollet, P. (2017). Near- Genevay, A., Peyr´e, G., Cuturi, M., et al. (2018). linear time approximation algorithms for optimal Learning Generative Models with Sinkhorn Diver- transport via Sinkhorn iteration. In NIPS. gences. In AISTATS. Ambrogioni, L., G¨u¸cl¨u, U., G¨u¸cl¨ut¨urk, Y., Hinne, Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, M., van Gerven, M. A., and Maris, E. (2018). B., Warde-Farley, D., Ozair, S., Courville, A., and Wasserstein Variational Inference. In NIPS. Bengio, Y. (2014). Generative Adversarial Nets. Balan, R., Singh, M., and Zou, D. (2017). Lipschitz In NIPS. properties for deep convolutional networks. arXiv preprint arXiv:1701.05217. Gretton, A., Borgwardt, K. M., Rasch, M., Sch¨olkopf, B., and Smola, A. J. (2007). A kernel method for Bi´nkowski, M., Sutherland, D. J., Arbel, M., and the two-sample-problem. In NIPS. Gretton, A. (2018). Demystifying MMD GANs. arXiv preprint arXiv:1801.01401. Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch¨olkopf, B., and Smola, A. (2012). A kernel Bojanowski, P. and Joulin, A. (2017). Unsupervised two-sample test. Journal of Learning by Predicting Noise. In ICML. Research, 13(Mar):723–773. Bousquet, O., Gelly, S., Tolstikhin, I., Simon-Gabriel, Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, C.-J., and Schoelkopf, B. (2017). From optimal B., and Hochreiter, S. (2017). GANs trained by transport to generative modeling: the VEGAN a two time-scale update rule converge to a local cookbook. arXiv preprint arXiv:1705.07642. nash equilibrium. In NIPS. Chizat, L., Peyr´e, G., Schmitzer, B., and Vialard, F.-X. (2016). Scaling Algorithms for Unbal- Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, anced Transport Problems. arXiv preprint X., Botvinick, M., Mohamed, S., and Lerchner, A. arXiv:1607.05816. (2017). -VAE: Learning basic visual concepts with a constrained variational framework. In ICLR. Cominetti, R. and San Mart´ın, J. (1994). Asymptotic analysis of the exponential penalty trajectory in Hinton, G. E. and Salakhutdinov, R. R. (2006). Re- linear programming. Math. Programming, 67(2, ducing the dimensionality of data with neural net- Ser. A):169–187. works. science, 313(5786):504–507. Courty, N., Flamary, R., and Tuia, D. (2014). Do- Ho↵man, M. D. and Johnson, M. J. (2016). ELBO main adaptation with regularized optimal trans- surgery: yet another way to carve up the varia- port. In KDD. tional evidence lower bound. In Workshop in Ad- Cuturi, M. (2013). Sinkhorn Distances: Lightspeed vances in Approximate Bayesian Inference, NIPS. Computation of Optimal Transport. In NIPS. Hornik, K. (1991). Approximation capabilities of Davidson, T. R., Falorsi, L., De Cao, N., Kipf, T., and multilayer feedforward networks. Neural networks, Tomczak, J. M. (2018). Hyperspherical Variational 4(2):251–257. Auto-Encoders. In UAI. Huang, G., Guo, C., Kusner, M. J., Sun, Y., Sha, F., Donahue, J., Kr¨ahenb¨uhl, P., and Darrell, T. (2017). and Weinberger, K. Q. (2016). Supervised word Adversarial feature learning. In ICLR. mover’s distance. In NIPS. Feydy, J., S´ejourn´e, T., Vialard, F.-X., Amari, S.- Kingma, D. P. and Welling, M. (2013). Auto- i., Trouv´e, A., and Peyr´e, G. (2018). Inter- encoding variational Bayes. arXiv preprint polating between Optimal Transport and MMD arXiv:1312.6114. Kolouri, S., Martin, C. E., and Rohde, G. K. (2018). Sinkhorn, R. (1964). A relationship between arbitrary Sliced-Wasserstein Autoencoder: An Embarrass- positive matrices and doubly stochastic matrices. ingly Simple Generative Model. arXiv preprint Ann. Math. Statist., 35. arXiv:1804.01947. Sriperumbudur, B. K., Fukumizu, K., and Lanckriet, Kuhn, H. W. (1955). The Hungarian method for G. R. G. (2011). Universality, characteristic kernels the assignment problem. Naval research logistics and RKHS embedding of measures. Journal of quarterly, 2(1-2):83–97. Machine Learning Research, 12(Jul):2389–2410. Linderman, S. W., Mena, G. E., Cooper, H., Panin- Sriperumbudur, B. K., Gretton, A., Fukumizu, K., ski, L., and Cunningham, J. P. (2018). Repa- Sch¨olkopf, B., and Lanckriet, G. R. G. (2010). rameterizing the Birkho↵Polytope for Variational Hilbert space embeddings and metrics on prob- Permutation Inference. AISTATS. ability measures. Journal of Machine Learning Luise, G., Rudi, A., Pontil, M., and Ciliberto, C. Research, 11(Apr):1517–1561. (2018). Di↵erential Properties of Sinkhorn Approx- Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, imation for Learning with Wasserstein Distance. B. (2018). Wasserstein Auto-Encoders. In ICLR. In NIPS. Tomczak, J. M. and Welling, M. (2017). VAE with Maaten, L. v. d. and Hinton, G. (2008). Visualizing a VampPrior. In AISTATS. data using t-SNE. Journal of machine learning Villani, C. (2008). Optimal Transport: Old and New. research, 9(Nov):2579–2605. Grundlehren der mathematischen Wissenschaften. Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., Springer Berlin Heidelberg. and Frey, B. (2015). Adversarial autoencoders. Weed, J. (2018). An explicit analysis of the entropic ICLR. penalty in linear programming. arXiv preprint Mohamed, S. and Lakshminarayanan, B. (2016). arXiv:1806.01879. Learning in implicit generative models. In ICML. Weed, J. and Bach, F. (2017). Sharp asymptotic Muzellec, B., Nock, R., Patrini, G., and Nielsen, F. and finite-sample rates of convergence of empirical (2017). Tsallis Regularized Optimal Transport and measures in Wasserstein distance. In NIPS. Ecological Inference. In AAAI. Noroozi, M. and Favaro, P. (2016). Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV. Noroozi, M., Pirsiavash, H., and Favaro, P. (2017). Representation learning by learning to count. CVPR. Peyr´e, G. and Cuturi, M. (2018). Computational Op- timal Transport. arXiv preprint arXiv:1803.00567. Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approxi- mate inference in deep generative models. ICML. Rosca, M., Lakshminarayanan, B., and Mohamed, S. (2018). Distribution Matching in Variational Inference. arXiv preprint arXiv:1802.06847. Rubenstein, P. K., Schoelkopf, B., and Tolstikhin, I. (2018). Wasserstein Auto-Encoders: Latent Di- mensionality and Random Encoders. In ICLR workshop. Santa Cruz, R., Fernando, B., Cherian, A., and Gould, S. (2017). Deeppermnet: Visual permuta- tion learning. In CVPR. Schmitzer, B. (2016). Stabilized sparse scaling algo- rithms for entropy regularized transport problems. arXiv preprint arXiv:1610.06519.