A Wrapped Normal Distribution on Hyperbolic Spacefor Gradient

A Wrapped Normal Distribution on Hyperbolic Space for Gradient-Based Learning Yoshihiro Nagano 1 Shoichiro Yamaguchi 2 Yasuhiro Fujita 2 Masanori Koyama 2 Abstract (a) A tree representation of the Hyperbolic space is a geometry that is known training dataset to be well-suited for representation learning of !!!!!!" "!!""!! data with an underlying hierarchical structure. In "!!!""! !!"!"!" this paper, we present a novel hyperbolic distri- !!!!!"" !!!!"!" bution called hyperbolic wrapped distribution, a wrapped normal distribution on hyperbolic space whose density can be evaluated analytically and !!!"!"" !!"!!"" !"!!"!" "!!!"!" differentiated with respect to the parameters. Our distribution enables the gradient-based learning of the probabilistic models on hyperbolic space (b) Vanilla VAE (β = 1:0) (c) Hyperbolic VAE that could never have been considered before. Also, we can sample from this hyperbolic probability distribution without resorting to auxiliary means like rejection sampling. As applications of our distribution, we develop a hyperbolic-analog of variational autoencoder and a method of probabilistic word embedding on hyperbolic space. We demonstrate the efficacy of our distribution on various datasets including MNIST, Atari 2600 Breakout, and WordNet. Figure 1: The visual results of Hyperbolic VAE applied to an artificial dataset generated by applying random pertur- 1. Introduction bations to a binary tree. The visualization is being done on the Poincare´ ball. The red points are the embeddings Recently, hyperbolic geometry is drawing attention as a of the original tree, and the blue points are the embeddings powerful geometry to assist deep networks in capturing of noisy observations generated from the tree. The pink fundamental structural properties of data such as a hi- × represents the origin of the hyperbolic space. The VAE erarchy. Hyperbolic attention network (Gulçehre¨ et al., was trained without the prior knowledge of the tree struc- 2019) improved the generalization performance of neural ture. Please see 6.1 for experimental details networks on various tasks including machine translation by imposing the hyperbolic geometry on several parts of In the task of data embedding, the choice of the target space neural networks. Poincare´ embeddings (Nickel & Kiela, determines the properties of the dataset that can be learned 2017) succeeded in learning a parsimonious representation from the embedding. For the dataset with a hierarchical of symbolic data by embedding the dataset into Poincare´ structure, in particular, the number of relevant features can balls. grow exponentially with the depth of the hierarchy. Eu- 1Department of Complexity Science and Engineering, The clidean space is often inadequate for capturing the struc- University of Tokyo, Japan 2Preferred Networks, Inc., Japan. tural information (Figure 1). If the choice of the target Correspondence to: Yoshihiro Nagano <[email protected] space of the embedding is limited to Euclidean space, one tokyo.ac.jp>. might have to prepare extremely high dimensional space as the target space to guarantee small distortion. However, the Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright same embedding can be done remarkably well if the desti- 2019 by the author(s). nation is hyperbolic space (Sarkar, 2012; Sala et al., 2018). A Wrapped Normal Distribution on Hyperbolic Space for Gradient-Based Learning Now, the next natural question is; “how can we extend these boloid. There are four common equivalent models used works to probabilistic inference problems on hyperbolic for the hyperbolic geometry: the Klein model, Poincare´ space?” When we know in advance that there is a hier- disk model, and Lorentz (hyperboloid/Minkowski) model, archical structure in the dataset, a prior distribution on hy- and Poincare´ half-plane model. Many applications of hyperbolic space might serve as a good informative prior. We perbolic space to machine learning to date have adopted might also want to make Bayesian inference on a dataset the Poincare´ disk model as the subject of study (Nickel & with hierarchical structure by training a variational autoen- Kiela, 2017; Ganea et al., 2018a;b; Sala et al., 2018). In coder (VAE) (Kingma & Welling, 2014; Rezende et al., this study, however, we will use the Lorentz model that, as 2014) with latent variables defined on hyperbolic space. claimed in Nickel & Kiela (2018), comes with a simpler We might also want to conduct probabilistic word embed- closed form of the geodesics and does not suffer from the ding into hyperbolic space while taking into account the numerical instabilities in approximating the distance. We uncertainty that arises from the underlying hierarchical re- will also exploit the fact that both exponential map and par- lationship among words. Finally, it would be best if we can allel transportation have a clean closed form in the Lorentz compare different probabilistic models on hyperbolic space model. based on popular statistical measures like divergence that Lorentz model Hn (Figure 2(a)) can be represented as a set requires the explicit form of the probability density func- of points z 2 Rn+1 with z > 0 such that its Lorentzian tion. 0 product (negative Minkowski bilinear form) The endeavors we mentioned in the previous paragraph all Xn require probability distributions on hyperbolic space that 0 0 0 hz; z iL = −z z + z z ; admit a parametrization of the density function that can be 0 0 i i computed analytically and differentiated with respect to i=1 the parameter. Also, we want to be able to sample from the with itself is −1. That is, distribution efficiently; that is, we do not want to resort to { } auxiliary methods like rejection sampling. n n+1 H = z 2 R : hz; ziL = −1; z0 > 0 : (1) In this study, we present a novel hyperbolic distribution called hyperbolic wrapped distribution, a wrapped nor- Lorentzian inner product also functions as the metric tensor mal distribution on hyperbolic space that resolves all these on hyperbolic space. We will refer to the one-hot vector 2 Hn ⊂ Rn+1 problems. We construct this distribution by defining Gaus- µ0 = [1; 0; 0; :::0] as the origin of the sian distribution on the tangent space at the origin of the hy- hyperbolic space. Also, the distance between two points 0 Hn 0 −⟨ 0i perbolic space and projecting the distribution onto hyper- z; z on is given by d`(z; z ) = arccosh ( z; z L), which is also the length of the geodesic that connects z and bolic space after transporting the tangent space to a desired 0 location in the space. This operation can be formalized by z . a combination of the parallel transport and the exponential map for the Lorentz model of hyperbolic space. 2.2. Parallel Transport and Exponential Map We can use our hyperbolic wrapped distribution to con- The rough explanation of our strategy for the construction struct a probabilistic model on hyperbolic space that can of hyperbolic wrapped distribution G(µ; Σ) with µ 2 Hn be trained with gradient-based learning. For example, our and a positive positive definite matrix Σ is as follows. We distribution can be used as a prior of a VAE (Figure 1, (1) sample a vector from N (0; Σ), (2) transport the vector Figure 6). It is also possible to extend the existing prob- from µ0 to µ along the geodesic, and (3) project the vector abilistic embedding method to hyperbolic space using our onto the surface. To formalize this sequence of operations, distribution, such as probabilistic word embedding. We we need to define the tangent space on hyperbolic space will demonstrate the utility of our method through the ex- as well as the way to transport the tangent space and the periments of probabilistic hyperbolic models on bench- way to project a vector in the tangent space to the surface. mark datasets including MNIST, Atari 2600 Breakout, and The transportation of the tangent vector requires parallel WordNet. transport, and the projection of the tangent vector to the surface requires the definition of exponential map. 2. Background Tangent space of hyperbolic space n n 2.1. Hyperbolic Geometry Let us use TµH to denote the tangent space of H at µ n (Figure 2(a)). Representing TµH as a set of vectors in Hyperbolic geometry is a non-Euclidean geometry with the same ambient space Rn+1 into which Hn is embed- a constant negative Gaussian curvature, and it can be vi- n ded, TµH can be characterized as the set of points satisfy- sualized as the forward sheet of the two-sheeted hyper- ing the orthogonality relation with respect to the Lorentzian A Wrapped Normal Distribution on Hyperbolic Space for Gradient-Based Learning (a) (b) (c) !"#"$$%$&'#"()*+#, !"#$%&%'()*+,)# 1 1 Figure 2: (a) One-dimensional Lorentz model H (red) and its tangent space TµH (blue). (b) Parallel transport carries 2 2 k · k 2 2 Hn v Tµ0 (green) to u Tµ (blue) while preserving L . (c) Exponential map projects the u Tµ (blue) to z Hn k k (red). The distance between µ and expµ(u) which is measured on the surface of coincides with u L. n n product: vector v in TµH onto H in a way that the distance from µ to destination of the map coincides with kvkL, the met- Hn f h i g Tµ := u: u; µ L = 0 : (2) ric norm of v. For hyperbolic space, this map (Figure 2(c)) n is given by TµH set can be literally thought of as the tangent space of Hn the forward hyperboloid sheet at µ. Note thatpTµ0 con- 2 Rn+1 k k h i u sists of v with v0 = 0, and v L := v; v L = k k k k z = expµ(u) = cosh ( u L)µ+sinh ( u L)k k : (5) kvk2. u L Parallel transport and inverse parallel transport As we can confirm with straightforward computation, this Next, for an arbitrary pair of point µ; ν 2 Hn, the par- exponential map is norm preserving( in the) sense that ν µ PT ! allel transport from to is defined as a map ν µ d (µ; exp (u)) = arccosh −⟨µ; exp (u)iL = kukL.

Load more