<<

Geometrically Enriched Latent Spaces

Georgios Arvanitidis Søren Hauberg Bernhard Sch¨olkopf MPI for Intelligent Systems, T¨ubingen DTU Compute, Lyngby MPI for Intelligent Systems, T¨ubingen

Abstract

A common assumption in generative mod- els is that the generator immerses the latent space into a Euclidean ambient space. In- stead, we consider the ambient space to be a Riemannian , which allows for en- coding domain knowledge through the asso- ciated Riemannian metric. Shortest paths can then be defined accordingly in the la- tent space to both follow the learned manifold and respect the ambient . Through Figure 1: The proposed shortest path ( ) favors the careful design of the ambient metric we can smiling class, while the standard shortest path ( ) ensure that shortest paths are well-behaved merely minimizes the distance on the data manifold. even for deterministic generators that other- wise would exhibit a misleading bias. Ex- perimentally we show that our approach im- proves the interpretability and the function- lines in Z are not shortest paths in any meaningful ality of learned representations both using sense, and therefore do not constitute natural inter- stochastic and deterministic generators. polants. To overcome this issue, it has been proposed to endow the latent space with a Riemannian met- ric such that curve lengths are measured in the ambi- 1 Introduction ent observation space X (Tosi et al., 2014; Arvanitidis et al., 2018). In other words, this ensures that any Unsupervised representation learning has made smooth invertible transformation of Z does not change tremendous progress with generative models such the distance between a pair of points, as long as the as variational autoencoders (VAEs) (Kingma and ambient path in X remains the same. This approach Welling, 2014; Rezende et al., 2014) and generative immediately solves the identifiability problem. adversarial networks (GANs) (Goodfellow et al., While distances in X are well-defined and give rise to 2014). These, and similar, models provide a flexible an identfiable latent representation, they need not be and efficient parametrization of the density of obser- particularly useful. We take inspiration from metric vations in an ambient space X through a typically learning (Weinberger et al., 2006; Arvanitidis et al., lower dimensional latent space Z. 2016) and propose to equip the ambient observation While the latent space Z constitutes a compressed space X with a Riemannian metric and measure curve representation of the data, it is by no means unique. lengths in latent space accordingly. With this ap- Like most other latent variable models, these genera- proach it is straight-forward to steer shortest paths in tive models are subject to identifiability problems, such latent space to avoid low-density regions, but also to that different representations can give rise to identical incorporate higher level semantic information. For in- densities (Bishop, 2006). This implies that straight stance, Fig.1 shows a shortest path under an ambient metric that favors images of smiling people. In such a Proceedings of the 24th International Conference on Artifi- way, we can control, and potentially unbias, distance cial Intelligence and Statistics (AISTATS) 2021, San Diego, based methods by utilizing domain knowledge, for ex- California, USA. PMLR: Volume 130. Copyright 2021 by ample in an individual fairness scenario. Hence, we the author(s). get both identifiable and useful latent representations. Geometrically Enriched Latent Spaces

D x point x ∈ M. Hence, v ∈ TxM is a vector v ∈ R and v actually the Riemannian metric is M : M → D×D. v X R 0 Thus, the simplest approach is to assume that X is equipped with the Euclidean metric MX (x) = ID and γ(t) γ(t) its restriction is utilized as the Riemannian metric on y TxM. Since the choice of MX (·) has a direct impact on M, we can utilize other metrics designed to encode x y high-level semantic information (see Sec.3). Figure 2: Examples of a tangent vector ( ) and a Another view is to consider as smooth manifold the shortest path ( ) on an embedded M ⊂ X (left) and D D whole ambient space X = R . Hence, the TxX = R on an ambient X (right). is centered at x ∈ RD and again the simplest Rieman- nian metric is the Euclidean MX (x) = ID. However, we are able to use other suitable metrics that simply In summary, we consider the ambient space of a gen- change the way we measure distances in X (see Sec.3). erative model as a Riemannian manifold, where the For instance, given a set of points in X we can con- metric can be defined by the user in order to encode struct a metric with small magnitude near the data to high level information about the problem of interest. pull the shortest paths towards them (see Fig.2 right). In such a way, the resulting shortest paths in the latent For a d-dimensional embedded manifold M ⊂ X , a col- space move optimally on the data manifold, while re- lection of chart maps φ : U ⊂ M → d is used to as- specting the geometry of the ambient space. This can i i R sign local intrinsic coordinates to neighborhoods U ⊂ be useful in scenarios where a domain expert wants i M, and for simplicity, we assume that a global chart to control the shortest paths in an interpretable way. map φ(·) exists. By definition, when M is smooth the In addition, we propose a simple method to construct φ(·) and its inverse φ−1 : φ(M) ⊂ d → M ⊂ X exist diagonal metrics in the ambient space, as well as an R and are smooth maps. Thus, a vx ∈ TxM can be ex- architecture for the generator in order to extrapolate d pressed as vx = Jφ−1 (z)vz, where z = φ(x) ∈ R and meaningfully. We show how this enables us to prop- d vz ∈ R are the representations in the intrinsic coor- erly capture the geometry of the data manifold in de- D×d dinates. Also, the Jacobian J −1 (z) ∈ defines a terministic generators, which is otherwise infeasible. φ R basis that spans the TxM, and thus, we represent the ambient metric MX (·) in the intrinsic coordinates as 2 Applied Riemannian geometry intro | −1 hvx, vxix = hvz, Jφ−1 (z) MX (φ (z))Jφ−1 (z)vzi

We are interested in Riemannian (do Carmo, = hvz, M(z)vzi = hvz, vziz, (1) 1992), which constitute well-defined metric spaces, | −1 d×d where the inner product is defined only locally and with M(z) = Jφ−1 (z) MX (φ (z))Jφ−1 (z) ∈ R 0 changes smoothly throughout space. In a nutshell, being smooth. As we discuss below, we should be able these are smooth spaces where we can compute short- to evaluate the intrinsic M(z) in order to find length est paths, which prefer regions where the magnitude of minimizing curves on M. However, when M is em- the inner product is small. In this work, we show how bedded the chart maps are usually unknown, as well to use these geometric structures in machine learning, as a global chart rarely exists. In contrast, for ambient where it is commonly assumed that data lie near a low like manifolds the global chart is φ(x) = x, which is dimensional manifold in an ambient observation space. convenient to use in practice. Definition 1. A Riemannian manifold is a smooth In general, one of the main utilities of a Riemannian manifold M, equipped with a positive definite Rieman- manifold M ⊆ X is to enable us compute short- nian metric M(x) ∀ x ∈ M, which changes smoothly p est paths therein. Intuitively, the norm hdx, dxix and defines a local inner product on the tangent space represents how the infinitesimal displacement vector TxM at each point x ∈ M as hv, uix = hv, M(x)ui dx ≈ x0 − x on M is locally scaled. Thus, for a curve with v, u ∈ TxM. γ : [0, 1] → M that connects two points x = γ(0) and y = γ(1), the length on M or equivalently in φ(M) A smooth manifold is a topological space, which lo- using that γ(t) = φ−1(c(t)) and Eq.1 is measured as cally is homeomorphic to a . An intu- itive way to think of a d-dimensional smooth manifold Z 1 q is as an embedded non-intersecting surface M in an length[γ(t)] = hγ˙ (t), γ˙ (t)iγ(t)dt (2) D 0 ambient space X for example R with D > d (see 1 Z p Fig.2 left). In this case, the tangent space TxM is = hc˙(t), M(c(t))c ˙(t)idt = length[c(t)], a d-dimensional vector space tangential to M at the 0 Georgios Arvanitidis, Søren Hauberg, Bernhard Sch¨olkopf

whereγ ˙ (t) = ∂tγ(t) ∈ Tγ(t)M is the velocity of the M(z) = Jg(z)|MX (g(z))Jg(z) is known as the pull- curve and accordinglyc ˙(t) ∈ Tc(t)φ(M). The minimiz- back metric. Essentially, it captures the intrinsic ge- ers of this functional are the shortest paths, also known ometry of the immersed MZ , while taking into ac- as geodesics. We find them by solving a system of 2nd count the geometry of X . Therefore, the space Z order nonlinear ordinary differential equations (ODEs) together with M(z) constitutes a Riemannian mani- dZ defined in the intrinsic coordinates. Notably, for am- fold, but since Z = R the chart map and the TzZ bient like manifolds the trivial chart map enables us to are trivial. Consequently, we can evaluate the met- compute the shortest paths in practice by solving the ric M(z) in intrinsic coordinates, which enables us to ODEs system. In a certain sense, the behavior of the compute shortest paths on Z by solving the ODEs sys- shortest paths is to avoid high measure p|M(c(t))| re- tem. Intuitively, these paths in Z move optimally on gions. For general manifolds, the analytical solution is MZ , while simultaneously respecting the geometry of intractable, so we rely on approximate solutions (Hen- the ambient space X . Also, note that g(·) is not a nig and Hauberg, 2014; Yang et al., 2018; Arvanitidis chart map, and hence, it is easier to learn. Moreover, et al., 2019). For further information on Riemannian if the data manifold has a special , we have to geometry and the ODE system see Appendix 1. choose the latent space accordingly (Davidson et al., 2018; Mathieu et al., 2019). For further discussion and 2.1 Unifying the two manifold perspectives the theoretical properties of g(·) see Appendix 2.

In all related works, the ambient space X is considered 3 Data learned Riemannian manifolds as a Euclidean space. Instead, we propose to consider X as a Riemannian manifold. This allows us to encode We discuss previous approaches that apply differential high-level information through the associated metric, geometry, which largely inspire our work. Briefly, we which constitutes an interpretable way to control the present two related Riemannian metric learning meth- shortest paths. In order to find such a path on an ods, and then, we propose a simple technique to con- embedded M ⊂ X , we have to solve a system of ODEs struct similar metrics in the ambient space X . This defined in the intrinsic coordinates. However, when M constitutes a principled and interpretable way to en- is embedded the chart maps are commonly unknown. code high-level information as domain knowledge in Hence, the usual trick is to utilize another manifold our models. Also, we present the related work where Z that has a trivial chart map and to represent the the structure of an embedded data manifold is properly geometry of M therein. Thus, we find the curve in Z, captured in the latent space of stochastic generators. which corresponds to the actual shortest path on M. In particular, assume an embedded d-dimensional 3.1 Riemannian metrics in ambient space manifold M ⊂ X within a Riemannian manifold D N D X = R with metric MX (·), a Euclidean space Assume that a set of points {xn}n=1 in X = R is Z = RdZ called as latent space and a smooth function given. The Riemannian metric learning task is to learn D×D g : Z → X called as generator. Since g(·) is smooth, a positive definite metric tensor MX : X → R 0 MZ = g(Z) ⊂ X is an immersed dZ -dimensional that changes smoothly across the space. The actual smooth (sub)manifold1. In general, we assume that behavior of the metric depends on the problem that we dZ = d and also that g(·) approximates closely the want to model. For example, if we want the shortest true embedded MZ ≈ M, while if dZ < d then g(·) paths to stay on the data manifold (see Fig.2 right) the can only approximate a submanifold on M. Conse- meaningful behavior for the metric is that the measure D×dZ p quently, the Jacobian matrix Jg(z) ∈ R is a basis |MX (·)| should be small near the data and large as that spans the Tx=g(z)MZ , and maps a tangent vector we move away. Similarly, in Fig.1 the ambient metric vz ∈ TzZ to a tangent vector vx ∈ Tx=g(z)MZ . Thus, is designed such that its measure is small near the as before the restriction of MX (·) on TxMZ induces a data points with smiling face, and thus, the shortest metric in the latent space Z as paths tend to follow this semantic constraint. In such a way, we can regulate the inductive bias of a model, by hvx, vxix = hvz, Jg(z)|MX (g(z))Jg(z)vzi. (3) including high-level information through the ambient metric and controlling the shortest paths accordingly. The induced Riemannian metric in the latent space One of the first approaches to learn such a Riemannian 1 A function g : Z → M ⊂ X is an immersion if the Jg : metric was presented by Hauberg et al.(2012), where TzZ → Tg(z)M is injective (full rank) ∀z ∈ Z. Intuitively, M (x) is the convex combination of a predefined set M is a dim(Z)-dimensional surface and intersections are X allowed. While g(·) is an embedding if it is an injective of metrics, using a smooth weighting function. In par- K function, which means that interections are not allowed. ticular, at first K metrics are estimated {Mk}k=1 ∈ D×D K D An immersion locally is an embedding. R 0 centered at the locations {ck}k=1 ∈ R . Then, Geometrically Enriched Latent Spaces

2 we can evaluate the metric at a new point as exp(−0.5 · λk · kx − xkk2) with bandwidth λk ∈ R>0. Also, h(x) can be the probability density function of K X w˜k(x) the given data. Usually, the true density function is M (x) = w (x)M , w (x) = , (4) X k k k PK unknown and difficult to learn, however, we can ap- w˜j(x) k=1 j=1 proximate it roughly by utilizing a simple model as  kx−c k2  the Gaussian Mixture Model (GMM) (Bishop, 2006). where the kernelw ˜ (x) = exp − k 2 with band- k 2σ2 Such a Riemannian metric pulls the shortest paths to- width σ ∈ R>0 is a smooth function, and thus, the wards areas of X with high h(x) (see Fig.2 right). metric is smooth as a linear combination of smooth functions. A practical example for the base metrics In a similar context, a supervised version can be de- fined where the function h(x) represents cost, while in Mk is the local Linear Discriminant Analysis (LDA), where local metrics are learned using labeled data such Eq.6 we do not use the inversion. In this way, shortest that to separate well the classes locally (Hastie and paths will tend to avoid regions of the ambient space Tibshirani, 1994). In a similar spirit, a related ap- X where the cost function is high. For instance, in proach is the Large Margin Nearest Neighbor (LMNN) Fig.1 we can think of a cost function that is high over classifier (Weinberger et al., 2006). Note that the do- all the non-smiling data points. Of course, such a cost main of metric learning provides a huge list of options function can be learned as an independent regression that can be considered (Su´arezet al., 2018). or classification problem, while in other cases can even be given by the problem or a domain expert. Further Similarly, in an unsupervised setting Arvanitidis et al. details about all the metrics above in Appendix 3. (2016) proposed to construct the Riemannian metric in a non-parametric fashion as the the inverse of the local diagonal covariance. In particular, for a given 3.2 Riemannian metrics in latent space N point set {xn} at a point x the diagonal elements n=1 As discussed in Sec. 2.1, we can capture the geom- of the metric MX (·) are equal to etry of the given embedded data manifold M ⊂ X N !−1 by learning a smooth generator g : Z → MZ ⊂ X X 2 such that MZ ≈ M. In previous works has been MXdd (x) = wn(x)(xnd − xd) + ε , (5) n=1 shown how to learn in practice such a function g(·), and also, the mild conditions it has to follow so that  kx −xk2  n 2 the induced Riemannian metric to capture properly with wn(x) = exp − 2σ2 , where the parameter the structure of M. In the latent space Z = d σ ∈ R>0 controls the of the Riemannian R manifold i.e., how fast the metric changes, and ε > 0 we call as latent codes or representations the points is a small scalar to upper bound the metric. Although {zn ∈ Z | xn = g(zn), n = 1,...,N}. these are quite flexible and intuitive metrics, select- First Tosi et al.(2014) considered the Gaussian Pro- ing the parameter σ is a challenging task (Arvanitidis cess Latent Variable Model (GP-LVM) (Lawrence, et al., 2017), especially due to the curse of dimension- 2005), where g(·) is a stochastic function defined as a ality (Bishop, 2006) and the sample size N due to the multi-output Gaussian process g ∼ GP(m(z), k(z, z0)). non-parametric regime. This stochastic generator induces a random Rieman- nian metric in Z and the expected metric is used for The proposed Riemannian metrics. Inspired by computing shortest paths. The advantage of such a the approaches described above and Peyr´e et al. stochastic generator is that the metric magnitude in- (2010), we propose a general and simple technique to creases analogous to the uncertainty of g(·), which hap- easily construct metrics in X , which allows to encode pens in regions of Z where there are no latent codes. information depending on the problem of interest. An Apart from this desired behavior, this metric is not unsupervised diagonal metric can be defined as very practical due to the GPs computational cost. −1 Another set of approaches known as deep generative MX (x) = (α · h(x) + ε) · ID, (6) models, parameterizes g(·) as a deep neural network where ID is the identity matrix with dimension D, the (DNN). On the one hand we have the explicit density D function h(x): R → R>0 with behavior h(x) → 1 models, where the marginal likelihood can be com- when x is near the data manifold, otherwise h(x) → 0, puted, with main representatives the VAE (Kingma and α,  ∈ R>0 are scaling factors to lower and up- and Welling, 2014; Rezende et al., 2014) and the nor- per bound the metric, respectively. One simple but malizing flow models (Dinh et al., 2016; Rezende and very effective approach is to use a positive Radial Ba- Mohamed, 2015). On the other hand we have the im- sis Function (RBF) network (Que and Belkin, 2016) plicit density models for which the marginal likelihood | K as h(x) = w φ(x) with weights w ∈ R>0 and φk(x) = is intractable, as is the GAN (Goodfellow et al., 2014). Georgios Arvanitidis, Søren Hauberg, Bernhard Sch¨olkopf

Recently, Arvanitidis et al.(2018) showed that we are Therefore, to properly capture the structure of M in Z able to properly capture the structure of the embed- we mainly rely on stochastic generators with increasing ded data manifold M ⊂ X in the latent space Z of uncertainty as we move further from the latent codes. a VAE under the condition of having meaningful un- Even if the RBF based approach is a meaningful way to certainty quantification for the generative process. In get the desired behavior, in general, uncertainty quan- particular, the standard VAE assumes a Gaussian like- tification with parametric models is still considered as 2 lihood p(x | z) = N (x | µ(z), ID · σ (z)) with a prior an open problem (MacKay, 1992; Gal and Ghahra- p(z) = N (0, Id). Hence, the generator can be written mani, 2015; Lakshminarayanan et al., 2017; Detlefsen as g(z) = µ(z) + diag(ε) · σ(z) where ε ∼ N (0, ID) and et al., 2019; Arvanitidis et al., 2018). Nevertheless, D µ : Z → X , σ : Z → R>0 are usually parametrized Eklund and Hauberg(2019) showed that the expected with DNNs. However, simply parametrizing σ(·) with Riemannian metric in Eq.7 is a reasonable approxi- a DNN does not directly imply meaningful uncertainty mation to use in practice. Obviously, when g(·) is de- quantification, because in general the DNN extrapo- terministic, like the GAN, the second term in Eq.7 lates arbitrarily to regions of Z with no latent codes. disappears, since these models do not quantify the Thus, the proposed solution in Arvanitidis et al.(2018) uncertainty of the generative process. This directly is to model the inverse variance β(z) = (σ2(z))−1 with means that deterministic generators are not able, by a positive Radial Basis Function (RBF) network (Que construction, to properly capture the geometric struc- and Belkin, 2016), which implies that moving further ture of M in the latent space, and hence, exhibit a from the latent codes increases the uncertainty. Un- misleading bias (Hauberg, 2018). der this stochastic generator, the expected Rieman- nian metric in Z can be written as 4 Enriching the latent space with | M(z) = Ep(ε)[Jgε (z) Jgε (z)] geometric information = Jµ(z)|Jµ(z) + Jσ(z)|Jσ(z), (7) Here, we unify the approaches presented in Sec. 3.1 where gε(·) implies that ε is kept fixed ∀ z ∈ Z, which and Sec. 3.2, in order to provide additional geomet- ensures a smooth mapping, and hence, differentiabil- ric structure in the latent space of a generative model. ity (Eklund and Hauberg, 2019). Here, we observe This is the first time that these two fundamentally that the metric increases when the generator becomes different Riemannian views are combined. Their main uncertain due to the second term. This constitutes difference is that the metric induced by g(·) merely a desired behavior, as the metric informs us to avoid tries to capture the intrinsic geometry of the given regions of Z where there are no latent codes, which data manifold M, while MX (·) allows to directly en- directly implies that these regions do not correspond code high-level information in X based for instance to parts of the data manifold in X . In some sense, we on domain knowledge. Moreover, we provide in the model the topology of M as well (Hauberg, 2018). In stochastic case a relaxation for efficient computation of contrast, this is not the case when the uncertainty of the expected metric. While in the deterministic case, g(·) is not taken into account, since then, the behav- we combine a carefully designed ambient metric with a ior of the metric is arbitrary as moving away from the new architecture for g(·) to extrapolate meaningfully, latent codes (Chen et al., 2018; Shao et al., 2017) which is one way to ensure well-behaved shortest paths Consequently, deterministic generators as the Auto- that respect the structure of the data manifold M. Encoder (AE) and the GAN capture poorly the struc- ture of M in Z because the second term in Eq.7 Stochastic generators. Assuming that an ambi- does not exist. The reason is that these models are ent MX (·) is given (see Sec. 3.1), we learn a VAE trained with likelihood p(x | z) = δ(x − g(z)) and with Gaussian likelihood, so the stochastic mapping the uncertainty is not quantified. Of course, for the is g(z) = µ(z) + diag(ε) · σ(z), while using a positive AE one potential heuristic solution is to use the la- RBF for meaningful quantification of the uncertainty tent codes of the training data to fit post-hoc a mean- σ(·). Therefore, we can apply Eq.3 to derive the new ingful variance estimator under the Gaussian likeli- stochastic more informative pull-back metric in the la- hood and the maximum likelihood principle as θ∗ = tent space Z, which is equal to QN 2 argmax N (xn | g(e(xn)), D · σ (e(xn))), with | θ n=1 I θ Mε(z) = Jgε (z) MX (µ(z) + diag(ε) · σ(z))Jgε (z) encoder e : X → Z. In principle, we could follow the with J (z) = J (z) + diag(ε) · J (z). (8) same procedure for the GAN by learning an encoder gε µ σ (Donahue et al., 2016; Dumoulin et al., 2016). How- This is a random quantity and we can compute its ex- ever, it is still unclear if the encoder for a GAN learns pectation directly by sampling ε ∼ N (0, ID), so that meaningful representations or if the powerful generator M(z) = Eε∼p(ε)[Mε(z)]. However, in practice this ex- ignores the inferred latent codes (Arora et al., 2018). pectation will increase dramatically the computational Geometrically Enriched Latent Spaces

cost, especially, since we need to evaluate the expected LANDs GMM metric many times when computing a shortest path. For this reason, we relax the problem and consider only the Eε∼p(ε)[gε(z)] = µ(z) for the evaluation of the ambient metric MX (·) in Eq.8, which simplifies the corresponding expected Riemannian metric to

M(z) , Jµ(z)|MX (µ(z)) Jµ(z) + Jσ(z)|MX (µ(z)) Jσ(z). (9) Figure 3: Using the samples in Z generated from the proposed q(z) we fit a mixture of LANDs and a GMM. Essentially, the realistic underlying assumption is that Thus, we can capture the true underlying structure of near the latent codes σ(z) → 0 so g(z) → µ(z) and the the data manifold in the latent space of a GAN. first term dominates. But, as we move in regions of Z with no codes the σ(z)  0, and for this reason, we need the second term in the equation. In particular, that as we move further from the support of p(z) the Jσ(·) dominates when moving further from the latent linear part of Eq. 10 becomes the dominant one, espe- codes, and hence, the behavior of MX (·) will be less cially, when bounded activation functions are utilized important there. Thus, we are allowed to consider for f(·). Hence, the MZ will extrapolate meaning- this relaxation, for which the meaningful uncertainty fully as we move further from the support of the prior. estimation is still necessary. We further analyze and However, we also need the generated MZ to be a valid check empirically this relaxation in Appendix 4. immersion. For further discussion see Appendix 2.

Deterministic generators. For deterministic g(·), we propose one simple yet efficient way to ensure well- 5 Experiments behaved shortest paths, which respect the structure of the given data manifold M. First, we learn an ambi- 5.1 Deterministic generators ent Riemannian metric MX (·) in X that only roughly represents the structure of M, for instance, by using Synthetic data. Usually, in GANs the g(·) is a con- an RBF or a GMM (Eq.6). Thus, this metric informs tinuous function, so we expect some generated points to fall off the given data support. Mainly, when the us how close the generated MZ is to the data. Hence, we additionally need g(·) to extrapolate meaningfully. data lie near a disconnected M and irrespective of Therefore, g(·) should learn to generate well the given training optimallity. This is known as the distribution data from a prior p(z), but as we move further from mismatch problem. We generate a synthetic dataset (Fig.4) and we train a Wasserstein GAN (WGAN) the support of p(z), the generated MZ should also 2 move further from the given data in X . Consequently, (Arjovsky et al., 2017) with a latent space Z = R and p(z) = N (0, 2). For the ambient metric we used since MX (·) is designed to increase far from the given I data, the induced Riemannian metric in Z captures a positive RBF (Eq.6). Further details in Appendix 5. properly the structure of the data manifold. Then, we define a density function in Z using the induced Riemannian metric as q(z) ∝ p˜r(z) · (1 + One of the simplest deterministic generators with this p −1 desirable behavior is the probabilistic Principal Com- |M(z)|) (Lebanon, 2002; Le and Cuturi, 2015), ponent Analysis (pPCA) (Tipping and Bishop, 1999). wherep ˜r(z) is a uniform density within a ball of radius r. The behavior of q(z) is interpretable, since the den- This basic model has a Gaussian prior p(z) = N (0, Id) and a generator that is simply a linear map. The gen- sity is high wherever M(·) is small and that happens erator is constructed by the top d eigenvectors of the only in the regions of Z that g(·) learns to map near empirical covariance matrix, scaled by their eigenval- the given data manifold in X . Also, thep ˜(z) simply ues. Inspired by this simple model, we propose for the ensures that the support of q(z) is within the region deterministic generator the following architecture where the g(·) is trained. Intuitively, we can think of q(z) as an approximate aggregated posterior, but p p g(z) = f(z) + U · diag([ λ1,..., λd]) · z + b, (10) without training an encoder. Thus, we compare sam- ples generated from q(z) using Markov Chain Monte where f : Z → X is a deep neural network, U ∈ RD×d Carlo (MCMC) and the prior p(z). We clearly see in the top d eigenvectors with their corresponding eigen- Fig.4 that our samples align better with the given data values λd computed from the empirical data covari- manifold. Furthermore, we used the precision and re- ance, and b ∈ RD is the data mean. This interpretable call metrics (Kynk¨a¨anniemiet al., 2019) with K=10, model can be seen as a residual network (ResNet) (He which shows that we generate more accurate samples et al., 2015). In particular, the desired behavior is (high precision), while our coverage is sufficiently good Georgios Arvanitidis, Søren Hauberg, Bernhard Sch¨olkopf

Standard measure Our measure The proposed q(z) Our samples Prior samples

Precision: 0.92, Recall: 0.27 Precision: 0.66, Recall: 0.28

Figure 4: Our measure shows that the ambient metric helps to properly capture the data manifold structure in the latent space. Also, we generate more accurate samples using the proposed q(z) compared to the prior p(z).

Precision: 0.93, Recall: 0.91 Precision: 0.90, Recall: 0.92 Precision: 0.88, Recall: 0.98

Figure 5: Top panels: Samples from q(z) using the MX 0 (·) are more accurate (left), even if the ambient metric (middle) is not used, compared to the samples from p(z)(right). Bottom panels: our shortest path (top) respects the data manifold structure avoiding off-the-manifold “shortcuts” (middle), while the linear (bottom) is arbitrary.

the global structure of the data manifold is approx- imately preserved. The results in Fig.5 show that, indeed, the MX 0 (·) improves the sampling since we achieve higher precision, but lower recall, while our path traverse better the data manifold. We discuss the

Classifier result linear projection step in Appendix 2, and provide fur- ther implementation details and results in Appendix 5. Pre-trained generator. We use for g(·) a pre- Figure 6: Classification of the paths in X over time. trained Progressive GAN on the CelebA dataset (Kar- Clearly, we see that only our shortest paths ( ) re- ras et al., 2018). We also train a classifier that distin- spect the ambient geometry, while both the straight guishes the blond people. As before, we use a linear ( ) and the naive shortest path ( ), traverse projection to d0 = 1000, where we define a cost based regions which are classified as blond. ambient metric MX 0 (·) using a positive RBF (Eq.6). This metric is designed to penalize regions in X that (high recall). Moreover, using the sampled latent rep- correspond to blonds. Our goal is to avoid these re- resentations we fit a mixture of LANDs (Fig.3), which gions when interpolating in Z. As discussed in Sec.3, are adaptive normal distributions on Riemannian man- it is not guaranteed that this deterministic g(·) prop- ifolds (Arvanitidis et al., 2016). Hence, we can sample erly captures the structure of the data manifold in Z, from each component individually, without training a but even so, we test our ability to control the paths. conditional GAN (Mirza and Osindero, 2014). As we observe in Fig.7 only our path that utilizes the informative MX 0 (·) successfully avoids crossing regions MNIST data. Similarly, we perform an experiment with blond hair. In particular, it corresponds to the with the MNIST digits 0,1,2 and Z = R5. This is optimal path on the generated manifold, while taking a high dimensional dataset so the RBF used for the into account the high-level semantic information. In MX (·) fits poorly. Also, the Euclidean distance for contrast, the shortest path without the MX 0 (·) passes images is not meaningful. Thus, after training the through the high cost region in X , as it merely mini- WGAN, we use PCA to project linearly the data in mizes the distance on the manifold and does not utilize a d0-dimensional subspace D > d0 > d with d0 = 10, the additional information. Also, we show in Fig.6 the where we construct the ambient MX 0 (·). This removes classifier prediction along several interpolants. For fur- the non-informative dimensions from the data, while ther details and interpolation results see Appendix 5. Geometrically Enriched Latent Spaces

Figure 7: Our interpolant (top) successfully avoids the high cost regions on the data manifold (blond hair), since it utilizes the high-level semantic information that is encoded into the ambient metric. The standard shortest path (middle) provides a smoother interpolation than the linear case (bottom).

Figure 8: Comparing interpolants in the latent space under different ambient metrics and their generated images. We can control effectively the shortest paths to follow high-level information incorporated in MX 0 (·).

Shortest path × + Linear distance

Figure 9: Compare means using the standard shortest Figure 10: The KDE using the shortest path and the path (+) and our approach (◦) that respects high-level straight line as distance measure. information i.e. it avoids the area around (×) in X 0.

closer to the selected points (×). Finally, we include 5.2 Stochastic Generators to this linear combination a cost related metric (Eq.6) based on a positive RBF that increases near the points Controlling Shortest Paths. In Fig.8 we compare (×) in X 0. Therefore, the resulting path ( ) avoids the effect of several interpretable ambient metrics on these neighborhoods, while respecting the other ambi- the shortest paths in the latent space of a VAE trained ent metrics and the geometry of MZ . with Z = R2 on the MNIST digits 0,1,2,3. As before, 0 we project linearly the data in d = 10 to construct Compare means. In Fig.9 we compute the mean there the MX 0 (·). Under the Euclidean metric in X the values for three sets of points ( ), comparing the stan- path ( ) merely follows the structure of the generated dard shortest path (+) with our approach (◦) using M since it avoids regions with no data. At first, we Z a cost based M 0 (·) relying on a positive RBF with 0 X construct an LDA metric in X (Eq.4) by considering centers (×) in X 0. Our method allows to use domain the digits 0,1,3 to be in the same class. Hence, the re- knowledge to get prototypes that respect some desired sulting path ( ) avoids crossing the regions in Z that properties. In such a way, we can use generative mod- correspond to digit 2, while simultaneously respects els for applications as in natural sciences, where do- the geometry of MZ . Also, we select 3 data points main experts provide information through the metric. (×) and using their 100 nearest neighbors in X 0 we construct the local covariance based metric (Eq.5), so Density Estimation. We show in Fig. 10 the kernel that the path ( ) moves closer to the selected points. density estimation (KDE) in Z comparing the straight Moreover, we linearly combine these two metrics to en- line to the shortest path, under the same bandwidth. force the path ( ) to pass through 0,1,3 while moving For MX 0 (·) we linearly combine an LDA metric, where Georgios Arvanitidis, Søren Hauberg, Bernhard Sch¨olkopf each digit is a separate class, and a cost based metric folds learned from data. In International Conference with centers (×) in X 0. The M(·) distinguishes better on Artificial Intelligence and Statistics (AISTATS). the classes due to the LDA metric, while the density is Bishop, C. M. (2006). Pattern Recognition and Ma- reduced near the high cost regions. Further discussion chine Learning (Information Science and Statistics). for all the experiments can be found in Appendix 5. Chen, N., Klushyn, A., Kurle, R., Jiang, X., Bayer, J., and Smagt, P. (2018). Metrics for Deep Generative 6 Conclusion Models. In International Conference on Artificial Intelligence and Statistics (AISTATS). We considered the ambient space of generative models as a Riemannian manifold. This allows to encode high- Davidson, T. R., Falorsi, L., De Cao, N., Kipf, T., and level information through the metric and we proposed Tomczak, J. M. (2018). Hyperspherical variational an easy way to construct suitable metrics. In order auto-encoders. to capture the geometry into the latent space, proper Detlefsen, N. S., Jørgensen, M., and Hauberg, S. uncertainty estimation is essential in stochastic genera- (2019). Reliable training and estimation of variance tors, while in the deterministic case one way is through networks. In Advances in Neural Information Pro- the proposed meaningful extrapolation. Thus, we get cessing Systems (NeurIPS). interpretable shortest paths that respect the ambient space geometry, while moving optimally on the learned Dinh, L., Sohl-Dickstein, J., and Bengio, S. (2016). manifold. In the future, it will be interesting to investi- Density estimation using real NVP. CoRR, gate if and how we can use the ambient space geometry abs/1605.08803. during the learning phase of a generative model. do Carmo, M. (1992). Riemannian Geometry. Math- ematics (Boston, Mass.). Birkh¨auser. Acknowledgments Donahue, J., Kr¨ahenb¨uhl,P., and Darrell, T. (2016). SH was supported by a research grant (15334) from Adversarial feature learning. VILLUM FONDEN. This project has received fund- Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, ing from the European Research Council (ERC) un- O., Lamb, A., Arjovsky, M., and Courville, A. der the European Union’s Horizon 2020 research and (2016). Adversarially learned inference. innovation programme (grant agreement no 757360). Eklund, D. and Hauberg, S. (2019). Expected path length on random manifolds. In arXiv preprint. References Gal, Y. and Ghahramani, Z. (2015). Dropout as a Arjovsky, M., Chintala, S., and Bottou, L. (2017). Bayesian approximation: Representing model un- Wasserstein generative adversarial networks. In Pro- certainty in deep learning. arXiv:1506.02142. ceedings of the 34th International Conference on Machine Learning (ICML). Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Ben- Arora, S., Risteski, A., and Zhang, Y. (2018). Do gio, Y. (2014). Generative adversarial nets. In Ad- GANs learn the distribution? some theory and em- vances in Neural Information Processing Systems pirics. In International Conference on Learning Rep- (NeurIPS). resentations (ICLR). Hastie, T. and Tibshirani, R. (1994). Discriminant Arvanitidis, G., Hansen, L. K., and Hauberg, S. adaptive nearest neighbor classification. (2016). A locally adaptive normal distribution. In Advances in Neural Information Processing Systems Hauberg, S. (2018). Only bayes should learn a mani- (NeurIPS). fold. Arvanitidis, G., Hansen, L. K., and Hauberg, S. Hauberg, S., Freifeld, O., and Black, M. (2012). A (2017). Maximum likelihood estimation of rieman- Geometric Take on Metric Learning. In Advances in nian metrics from euclidean data. In Geometric Sci- Neural Information Processing Systems (NeurIPS). ence of Information (GSI). He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep Arvanitidis, G., Hansen, L. K., and Hauberg, S. residual learning for image recognition. (2018). Latent space oddity: on the curvature of Hennig, P. and Hauberg, S. (2014). Probabilistic Solu- deep generative models. In International Confer- tions to Differential Equations and their Application ence on Learning Representations (ICLR). to Riemannian Statistics. In International Confer- Arvanitidis, G., Hauberg, S., Hennig, P., and Schober, ence on Artificial Intelligence and Statistics (AIS- M. (2019). Fast and robust shortest paths on mani- TATS). Geometrically Enriched Latent Spaces

Karras, T., Aila, T., Laine, S., and Lehtinen, J. (2018). Shao, H., Kumar, A., and Fletcher, P. T. (2017). The Progressive growing of GANs for improved quality, Riemannian Geometry of Deep Generative Models. stability, and variation. In International Conference In arXiv. on Learning Representations. Su´arez,J. L., Garc´ıa,S., and Herrera, F. (2018). A Kingma, D. P. and Welling, M. (2014). Auto-Encoding tutorial on distance metric learning: Mathematical Variational Bayes. In Proceedings of the 2nd In- foundations, algorithms and experiments. ternational Conference on Learning Representations Tipping, M. E. and Bishop, C. (1999). Probabilistic (ICLR). principal component analysis. Journal of the Royal Kynk¨a¨anniemi,T., Karras, T., Laine, S., Lehtinen, Statistical Society, Series B. J., and Aila, T. (2019). Improved precision and Tosi, A., Hauberg, S., Vellido, A., and Lawrence, N. D. recall metric for assessing generative models. In (2014). Metrics for Probabilistic . In The Advances in Neural Information Processing Systems Conference on Uncertainty in Artificial Intelligence (NeurIPS). (UAI). Lakshminarayanan, B., Pritzel, A., and Blundell, C. Weinberger, K. Q., Blitzer, J., and Saul, L. K. (2006). (2017). Simple and scalable predictive uncertainty Distance metric learning for large margin nearest estimation using deep ensembles. In Advances in neighbor classification. In Advances in Neural In- Neural Information Processing Systems (NeurIPS). formation Processing Systems (NeurIPS). Lawrence, N. (2005). Probabilistic non-linear princi- Yang, T., Arvanitidis, G., Fu, D., Li, X., and Hauberg, pal component analysis with gaussian process latent S. (2018). Geodesic clustering in deep generative variable models. J. Mach. Learn. Res. models. In arXiv preprint. Le, T. and Cuturi, M. (2015). Unsupervised rieman- nian metric learning for histograms using aitchi- son transformations. In International Conference on International Conference on Machine Learning (ICML). Lebanon, G. (2002). Learning riemannian metrics. In Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence (UAI). MacKay, D. J. C. (1992). A practical bayesian frame- work for backpropagation networks. Neural Comput. Mathieu, E., Le Lan, C., Maddison, C. J., Tomioka, R., and Teh, Y. W. (2019). Continuous hierarchi- cal representations with poincar´evariational auto- encoders. In Advances in Neural Information Pro- cessing Systems (NeurIPS). Mirza, M. and Osindero, S. (2014). Conditional gen- erative adversarial nets. Peyr´e,G., P´echaud, M., Keriven, R., and Cohen, L. D. (2010). Geodesic methods in computer vision and graphics. Found. Trends. Comput. Graph. Vis. Que, Q. and Belkin, M. (2016). Back to the future: Radial basis function networks revisited. In Inter- national Conference on Artificial Intelligence and Statistics (AISTATS), Cadiz, Spain. Rezende, D. and Mohamed, S. (2015). Variational inference with normalizing flows. In International Conference on Machine Learning (ICML). Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate infer- ence in deep generative models. In International Conference on Machine Learning (ICML).