A Wrapped Normal Distribution on Hyperbolic Space for Gradient-Based Learning
Total Page:16
File Type:pdf, Size:1020Kb
A Wrapped Normal Distribution on Hyperbolic Space for Gradient-Based Learning Yoshihiro Nagano 1 Shoichiro Yamaguchi 2 Yasuhiro Fujita 2 Masanori Koyama 2 Abstract (a) A tree representation of the train- Hyperbolic space is a geometry that is known to ing dataset be well-suited for representation learning of data !!!!!!" "!!""!! with an underlying hierarchical structure. In this "!!!""! paper, we present a novel hyperbolic distribution !!"!"!" called pseudo-hyperbolic Gaussian, a Gaussian- !!!!!"" !!!!"!" like distribution on hyperbolic space whose den- sity can be evaluated analytically and differen- !!!"!"" !!"!!"" !"!!"!" "!!!"!" tiated with respect to the parameters. Our dis- tribution enables the gradient-based learning of the probabilistic models on hyperbolic space that could never have been considered before. Also, (b) Vanilla VAE (β = 1:0) (c) Hyperbolic VAE we can sample from this hyperbolic probability distribution without resorting to auxiliary means like rejection sampling. As applications of our distribution, we develop a hyperbolic-analog of variational autoencoder and a method of proba- bilistic word embedding on hyperbolic space. We demonstrate the efficacy of our distribution on various datasets including MNIST, Atari 2600 Breakout, and WordNet. Figure 1: The visual results of Hyperbolic VAE applied to 1. Introduction an artificial dataset generated by applying random pertur- bations to a binary tree. The visualization is being done Recently, hyperbolic geometry is drawing attention as a on the Poincare´ ball. The red points are the embeddings powerful geometry to assist deep networks in capturing fun- of the original tree, and the blue points are the embeddings damental structural properties of data such as a hierarchy. of noisy observations generated from the tree. The pink × Hyperbolic attention network (Gul¨ c¸ehre et al., 2019) im- represents the origin of the hyperbolic space. The VAE was proved the generalization performance of neural networks trained without the prior knowledge of the tree structure. on various tasks including machine translation by imposing Please see 6.1 for experimental details arXiv:1902.02992v2 [stat.ML] 10 May 2019 the hyperbolic geometry on several parts of neural networks. Poincare´ embeddings (Nickel & Kiela, 2017) succeeded in learning a parsimonious representation of symbolic data by determines the properties of the dataset that can be learned embedding the dataset into Poincare´ balls. from the embedding. For the dataset with a hierarchical structure, in particular, the number of relevant features can In the task of data embedding, the choice of the target space grow exponentially with the depth of the hierarchy. Eu- 1Department of Complexity Science and Engineering, The clidean space is often inadequate for capturing the structural University of Tokyo, Japan 2Preferred Networks, Inc., Japan. information (Figure1). If the choice of the target space Correspondence to: Yoshihiro Nagano <[email protected] of the embedding is limited to Euclidean space, one might tokyo.ac.jp>. have to prepare extremely high dimensional space as the tar- Proceedings of the 36 th International Conference on Machine get space to guarantee small distortion. However, the same Learning, Long Beach, California, PMLR 97, 2019. Copyright embedding can be done remarkably well if the destination 2019 by the author(s). is the hyperbolic space (Sarkar, 2012; Sala et al., 2018). A Wrapped Normal Distribution on Hyperbolic Space for Gradient-Based Learning Now, the next natural question is; “how can we extend these and Lorentz (hyperboloid/Minkowski) model, and Poincare´ works to probabilistic inference problems on hyperbolic half-plane model. Many applications of hyperbolic space space?” When we know in advance that there is a hierarchi- to machine learning to date have adopted the Poincare´ disk cal structure in the dataset, a prior distribution on hyperbolic model as the subject of study (Nickel & Kiela, 2017; Ganea space might serve as a good informative prior. We might et al., 2018a;b; Sala et al., 2018). In this study, however, also want to make Bayesian inference on a dataset with we will use the Lorentz model that, as claimed in Nickel hierarchical structure by training a variational autoencoder & Kiela(2018), comes with a simpler closed form of the (VAE) (Kingma & Welling, 2014; Rezende et al., 2014) with geodesics and does not suffer from the numerical instabili- latent variables defined on hyperbolic space. We might also ties in approximating the distance. We will also exploit the want to conduct probabilistic word embedding into hyper- fact that both exponential map and parallel transportation bolic space while taking into account the uncertainty that have a clean closed form in the Lorentz model. arises from the underlying hierarchical relationship among Lorentz model n (Figure 2(a)) can be represented as a set words. Finally, it would be best if we can compare different H of points z 2 n+1 with z > 0 such that its Lorentzian probabilistic models on hyperbolic space based on popular R 0 product (negative Minkowski bilinear form) statistical measures like divergence that requires the explicit form of the probability density function. n X hz; z0i = −z z0 + z z0; The endeavors we mentioned in the previous paragraph all L 0 0 i i require probability distributions on hyperbolic space that i=1 admit a parametrization of the density function that can be with itself is −1. That is, computed analytically and differentiated with respect to n n+1 the parameter. Also, we want to be able to sample from the H = z 2 R : hz; ziL = −1; z0 > 0 : (1) distribution efficiently; that is, we do not want to resort to auxiliary methods like rejection sampling. Lorentzian inner product also functions as the metric tensor on hyperbolic space. We will refer to the one-hot vector In this study, we present a novel hyperbolic distribution µ = [1; 0; 0; :::0] 2 n ⊂ n+1 as the origin of the called pseudo-hyperbolic Gaussian, a Gaussian-like distri- 0 H R hyperbolic space. Also, the distance between two points bution on hyperbolic space that resolves all these problems. z; z0 on n is given by d (z; z0) = arccosh (−hz; z0i ), We construct this distribution by defining Gaussian distri- H ` L which is also the length of the geodesic that connects z and bution on the tangent space at the origin of the hyperbolic z0. space and projecting the distribution onto hyperbolic space after transporting the tangent space to a desired location in the space. This operation can be formalized by a combina- 2.2. Parallel Transport and Exponential Map tion of the parallel transport and the exponential map for the The rough explanation of our strategy for the construction Lorentz model of hyperbolic space. of pseudo-hyperbolic Gaussian G(µ; Σ) with µ 2 Hn and We can use our pseudo-hyperbolic Gaussian distribution to a positive positive definite matrix Σ is as follows. We (1) construct a probabilistic model on hyperbolic space that can sample a vector from N (0; Σ), (2) transport the vector from be trained with gradient-based learning. For example, our µ0 to µ along the geodesic, and (3) project the vector onto distribution can be used as a prior of a VAE (Figure1, Figure the surface. To formalize this sequence of operations, we 6). It is also possible to extend the existing probabilistic em- need to define the tangent space on hyperbolic space as well bedding method to hyperbolic space using our distribution, as the way to transport the tangent space and the way to such as probabilistic word embedding. We will demonstrate project a vector in the tangent space to the surface. The the utility of our method through the experiments of proba- transportation of the tangent vector requires parallel trans- bilistic hyperbolic models on benchmark datasets including port, and the projection of the tangent vector to the surface MNIST, Atari 2600 Breakout, and WordNet. requires the definition of exponential map. Tangent space of hyperbolic space 2. Background n n Let us use TµH to denote the tangent space of H at µ n 2.1. Hyperbolic Geometry (Figure 2(a)). Representing TµH as a set of vectors in the same ambient space Rn+1 into which Hn is embedded, n Hyperbolic geometry is a non-Euclidean geometry with a TµH can be characterized as the set of points satisfying constant negative Gaussian curvature, and it can be visual- the orthogonality relation with respect to the Lorentzian ized as the forward sheet of the two-sheeted hyperboloid. product: There are four common equivalent models used for the hy- n perbolic geometry: the Klein model, Poincare´ disk model, TµH := fu: hu; µiL = 0g: (2) A Wrapped Normal Distribution on Hyperbolic Space for Gradient-Based Learning (a) (b) (c) !"#"$$%$&'#"()*+#, !"#$%&%'()*+,)# 1 1 Figure 2: (a) One-dimensional Lorentz model H (red) and its tangent space TµH (blue). (b) Parallel transport carries n v 2 Tµ0 (green) to u 2 Tµ (blue) while preserving k · kL . (c) Exponential map projects the u 2 Tµ (blue) to z 2 H (red). n The distance between µ and expµ(u) which is measured on the surface of H coincides with kukL. n TµH set can be literally thought of as the tangent space of norm of v. For hyperbolic space, this map (Figure 2(c)) is n the forward hyperboloid sheet at µ. Note that Tµ0 H con- given by n+1 p sists of v 2 R with v0 = 0, and kvkL := hv; viL = kvk2. u z = expµ(u) = cosh (kukL)µ+sinh (kukL) : (5) Parallel transport and inverse parallel transport kukL Next, for an arbitrary pair of point µ; ν 2 Hn, the par- allel transport from ν to µ is defined as a map PTν!µ As we can confirm with straightforward computation, this n n n from Tν H to TµH that carries a vector in Tν H exponential map is norm preserving in the sense that along the geodesic from ν to µ in a parallel manner d`(µ; expµ(u)) = arccosh −hµ; expµ(u)iL = kukL.