<<

Lorentzian Distance Learning for Hyperbolic Representations

Marc T. Law 1 2 3 Renjie Liao 1 2 Jake Snell 1 2 Richard S. Zemel 1 2

Abstract A common and natural non-Euclidean representation space We introduce an approach to learn representations is the spherical model (e.g.( Wang et al., 2017)) where the d d data lies on a unit hypersphere = x R : x 2 = 1 based on the Lorentzian distance in hyperbolic ge- S { ∈ � � } ometry. Hyperbolic is especially suited and angles are compared with the cosine similarity func- to hierarchically-structured datasets, which are tion. Recently, some machine learning approaches (Nickel prevalent in the real world. Current hyperbolic & Kiela, 2017; 2018; Ganea et al., 2018) have considered representation learning methods compare exam- representing hierarchical datasets with the hyperbolic model. ples with the Poincare´ distance. They try to min- The motivation is that anyfinite tree can be mapped into imize the distance of each node in a hierarchy afinite while approximately preserving with its descendants while maximizing its dis- distances (Gromov, 1987), which is not the case for Eu- tance with other nodes. This formulation pro- clidean space (Linial et al., 1995). Since hierarchies can duces node representations close to the centroid be formulated as trees, hyperbolic spaces can be used to of their descendants. To obtain efficient and inter- represent hierarchically structured data where the high-level pretable algorithms, we exploit the fact that the nodes of the hierarchy are represented close to the origin centroid w.r.t the squared Lorentzian distance can whereas leaves are further away from the origin. be written in closed-form. We show that the Eu- An important question is the form of . clidean norm of such a centroid decreases as the Since theirfirst formulation in the early nineteenth century curvature of the hyperbolic space decreases. This by Lobachevsky and Bolyai, hyperbolic spaces have been property makes it appropriate to represent hierar- used in many domains. In particular, they became popular in chies where parent nodes minimize the distances mathematics (e.g. space theory and differential geometry), to their descendants and have smaller Euclidean and physics when Varicak (1908) discovered that special norm than their children. Our approach obtains relativity theory (Einstein, 1905) had a natural interpretation state-of-the-art results in retrieval and classifica- in hyperbolic geometry. Various hyperbolic and tion tasks on different datasets. related distances have been studied since then. Among them are the Poincare´ metric, the Lorentzian distance (Ratcliffe, 2006), and the gyrodistance (Ungar, 2010; 2014). 1. Introduction In the case of hierarchical datasets, machine learning ap- Generalizations of Euclidean space are important forms of proaches that learn hyperbolic representations designed to data representation in machine learning. For instance, ker- preserve the hierarchical similarity order have typically em- nel methods (Shawe-Taylor et al., 2004) rely on Hilbert ployed the Poincare´ metric. Usually, the optimization prob- spaces that possess the structure of the inner product and lem is formulated so that the representation of a node in a are therefore used to compare examples. The properties hierarchy should be closer to the representation of its chil- of such spaces are well-known and closed-form relations dren and other descendants than to any other node in the are often exploited to obtain efficient, scalable, and inter- hierarchy. Based on (Gromov, 1987), the Poincare´ metric is pretable training algorithms. While representing examples a sensible dissimilarity function as it satisfies all the proper- in a Euclidean space is appropriate to compare lengths and ties of a distance metric and is thus natural to interpret. angles, non-Euclidean representations are useful when the task requires specific structure. In this paper, we explain why the squared Lorentzian dis- tance is a better choice than the Poincare´ metric. One ana- 1University of Toronto, Canada 2Vector Institute, Canada 3 lytic argument relies on Jacobi Field (Lee, 2006) properties NVIDIA, work done while affiliated with the University of of Riemannian centers of mass (also called “Karcher means” Toronto. Correspondence to: Marc Law. although Karcher (2014) strongly discourages the use of Proceedings of the 36 th International Conference on Machine that term). One other interesting property is that its centroid Learning, Long Beach, California, PMLR 97, 2019. Copyright can be written in closed form. 2019 by the author(s). Lorentzian Distance Learning for Hyperbolic Representations

Contributions: The main contributions of this paper are space of our model (e.g. the output representation of some the study of the Lorentzian distance. We show that interpret- neural network). We consider that d =R d. F ing the squared Lorentzian distances with a set of points is equivalent to interpreting the distance with their centroid. 2.2. Optimizing the Poincare´ distance metric We also study the dependence of the centroid with some hyperparameters, particularly the curvature of the manifold Most methods that compare hyperbolic representations that has an impact on its Euclidean norm which is used as a (Nickel & Kiela, 2017; 2018; Ganea et al., 2018; Gulcehre proxy for depth in the hierarchy. This is the key motivation et al., 2019) consider the Poincare´ distance metric defined for allc d,d d as: for our theoretical work characterizing its behavior w.r.t. the ∈P ∈P 2 curvature. We relate the Lorentzian distance to other hyper- 1 c d d (c,d) = cosh − 1 + 2 � − � (3) bolic distances/geometries and explore its performance on P (1 c 2)(1 d 2) retrieval and classification problems. � −� � −� � � which satisfies all the properties of a distance metric and is therefore natural to interpret. Direct optimization in d P 2. Background of problems using the distance formulation in Eq. (3) is In this section, we provide some technical background about numerically unstable for two main reasons (see for instance hyperbolic geometry and introduce relevant notation. The (Nickel & Kiela, 2018) or (Ganea et al., 2018, Section 4)). interested reader may refer to (Ratcliffe, 2006). First, the denominator depends on the norm of examples, so optimizing over c and d when either of their norms is close to 1 leads to numerical instability. Second, elements have 2.1. Notation and definitions to be re-projected onto the Poincare´ ball at each iteration To simplify the notation, we consider that vectors are row with afixed maximum norm. Moreover, Eq. (3) is not vectors and is the � -norm. In the following, we consider differentiable whenc=d (see proof in appendix). �·� 2 three important spaces. For better numerical stability of their solver, Nickel & Kiela Poincare´ ball: The Poincare´ ball d is defined as the set of (2018) propose to use an equivalent formulation of d in P P d-dimensional vectors with Euclidean norm smaller than 1 the unit model. They use the fact that there (i.e. d = x R d : x <1 ). Its associated distance is exists an invertible mapping h: d,β d defined for all P { ∈ � � } H →P the Poincare´ distance metric defined in Eq. (3). a=(a , ,a ) d,β as: 0 ··· d ∈H Hyperboloid model: We consider some specific hyper- 1 d d,β d+1 h(a) := (a1, ,a d) (4) boloid models R defined as follows: d 2 ··· ∈P H ⊆ 1 + 1 + i=1 ai d,β d+1 2 � := a=(a 0, ,a d) R : a = β,a 0 >0 When β=1,a d,�b d, we have the following H { ··· ∈ � � L − } ∈H ∈H (1) equivalence: where β>0 and a 2 = a,a is the squared Lorentzian � �L � � L 1 norm of a. The squared Lorentzian norm is derived from the d (a,b) =d (h(a),h(b)) = cosh− ( a,b ) (5) H P −� � L Lorentzian inner product defined for all a=(a 0, ,a d) d,β d,β ··· ∈ Nickel & Kiela (2018) show that optimizing the formulation ,b=(b 0, ,b d) as: in Eq. (5) in d is more stable numerically. H ··· ∈H H d Duality between spherical and hyperbolic geometries: a,b := a 0b0 + aibi β (2) One can observe from Eq. (5) that preserving the order � � L − ≤ − i=1 � of Poincare´ distances is equivalent to preserving the reverse order of Lorentzian inner products (defined in Eq. (2)) since It is worth noting that a,b = β iff a=b . Otherwise, 1 � � L − the cosh− function is monotonically increasing on its do- a,b < β for all pairs (a,b) ( d,β)2. Vectors in � � L − ∈ H main [1, + ). The relationship between the Poincare´ met- d,β are a subset of positive time-like vectors1. The hyper- ∞ H d,β ric and the Lorentzian inner product is actually similar to the boloid has constant negative curvature 1/β. More- 1 H − relationship between the distance cos− ( p,q ) d,β d 2 � � over, every vector a satisfies a0 = β+ i=1 ai . and the cosine p,q (or the squared Euclidean distance ∈H � � We note d := d,1 the space obtained when� β=1 ; it is p q 2 = 2 2 p,q ) when p and q are on a unit hyper- H H � � − � d − � � called the unit hyperboloid model and is the main hyper- sphere because of the duality between these geometries S boloid model considered in the literature. (Ratcliffe, 2006). The hyperboloid d,β can be seen as a H half hypersphere of imaginary radius i√β. In the same way Model space: Finally, we note d R d the output vector F ⊆ as kernel methods that consider inner products in Hilbert 1A vector a that satisfies a,a <0 is called time-like and it spaces as similarity measures, we consider in this paper the � � L is called positive iffa 0 >0. Lorentzian inner product and its induced distance. Lorentzian Distance Learning for Hyperbolic Representations

3. Lorentzian distance learning gβ(f1) < g β(f2) . The proof of the following theorem � � � � is given in the appendix: We present the (squared) Lorentzian distance function d L which has been studied in differential geometry but not Theorem 3.1 (Order of Euclidean norms). The following order is preserved for all a d,b d with h g where used to learn hyperbolic representations to the best of our ∈F ∈F ◦ β knowledge. Nonetheless, it was used in contexts where rep- h andg β are defined in Eq. (4) and Eq. (7), respectively: resentations are not hyperbolic (Liu et al., 2010; Sun et al., a < b h(g (a)) < h(g (b)) (10) 2015)(i.e. not constrained to belong to some hyperboloid). � � � � ⇐⇒ � β � � β � We give the formulation of the Lorentzian centroid when representations are hyperbolic and show that its Euclidean In conclusion, the Euclidean norms of examples can be norm, which is used as a proxy for depth in the hierarchy, compared equivalently in any space. This is particularly depends on the curvature 1/β. We show that studying useful if we want to study the Euclidean norm of centroids. − (squared) Lorentzian distances with a set of points is equiv- alent to studying the distance with their centroid. Moreover, 3.2. Centroid properties we exploit results in (Karcher, 1987) to explain why the The center of mass (here called centroid) is a concept in Lorentzian distance is a better choice than the Poincare´ statistics motivated in (Frechet´ , 1948) to estimate some sta- distance. Finally, we discuss optimization details. tistical dispersion of a set of points (e.g. the variance). It is the minimizer of an expectation of (squared) distances with 3.1. Lorentzian distance and mappings a set of points and was extended to Riemannian manifolds The squared Lorentzian distance (Ratcliffe, 2006) is defined in (Grove & Karcher, 1973). We now study the centroid for all paira d,β,b d,β as: w.r.t. the squared Lorentzian distance. Ideally, we would ∈H ∈H like the centroid of node representations to be (close to) the d2 (a,b) = a b 2 = 2β 2 a,b (6) L � − � L − − � � L representation of their lowest common ancestor. It satisfies all the axioms of a distance metric except the Lemma 3.2 (Center of mass of the Lorentzian inner prod- triangle inequality. uct). The point µ d,β that maximizes the following prob- ∈Hn d,β lem maxµ d,β i=1 νi xi,µ where i,x i , Mapping: Current hyperbolic machine learning models ex- ∈H � � L ∀ ∈H i,ν i 0, νi >0 is unique and formulated as: ploit d . They re-project at each iteration their learned rep- ∀ ≥ i � H resentations onto the Poincare´ ball (Nickel & Kiela, 2017) � n i=1 νixi or use the exponential map of the unit hyperboloid model µ= β n (11) d νixi (Nickel & Kiela, 2018; Gulcehre et al., 2019) to directly |� �i=1 �L| H � optimize on it. Since our approach does not necessarily � consider that β=1 , we consider the invertible mapping where a = a 2 is the modulus of the imaginary d d,β d,β |� �L| |� �L| gβ : , called a local parametrization of , Lorentzian norm of the positive time-like vector a. The F →H d H � defined for allf=(f 1, ,f d) as: proof is given in the appendix. The centroid formulation ··· ∈F in Eq. (11) generalizes the centroid formulations given in g (f) := ( f 2 +β,f , ,f ) d,β (7) β � � 1 ··· d ∈H (Galperin, 1993; Ratcliffe, 2006) for any number of points d and any value of the constant curvature 1/β. As mentioned in Section� 2.1, is the output space of our − d F d,β 1 model, i.e., R in practice. The pair ( , gβ− ) where Theorem 3.3 (Centroid of the squared Lorentzian dis- 1 d,β d H d,β d,β g− : is called a chart of the manifold . tance). The point µ that minimizes the prob- β H →F H n ∈H 2 d,β lem minµ d,β i=1 νid (xi,µ) where i,x i , d d ∈H L ∀ ∈H We then compare two examples f1 and f2 with ν 0, ν >0 is given in Eq. ( 11) ∈F ∈F i i i d2 by calculating: ≥ � L � 2 The proof exploits the formulation given in Eq. (6). The d (gβ(f1), gβ(f2)) = 2β 2 g β(f1), gβ(f2) (8) L − − � �L formulation of the centroid µ can be used to perform hard = 2 β+ f ,f f 2 +β f 2 +β (9) clustering where a uniform measure over the data in a cluster 1 2 1 2 1 − � � − � � � � is assumed (i.e. i,ν i = or equivalently in this context � � � � ∀ n Preserved order of Euclidean norms: Although examples i,ν i = 1). One can see that the centroid of one example is ∀ are compared with d2 in the hyperbolic space d,β where the example itself. We now study its Euclidean norm. L H all the points have the same Lorentzian norm, it is worth Theorem 3.4. The Euclidean norm of the centroid of differ- noting that the order of the Euclidean norms of examples is ent points in d decreases asβ>0 decreases. preserved along the three spaces d, d,β and d with the P F H P mappings gβ and h. The preservation with gβ is straight- The proof is given in the appendix. Fig. 1 illustrates the forward: f ,f d, f < f 2 f 2 +β= 2-dimensional Poincare´ ball representation of the centroid ∀ 1 2 ∈F � 1� � 2� ⇐⇒ � 1� � Lorentzian Distance Learning for Hyperbolic Representations

(a) Poincare´ distance (b)β=1, Squared Lorentzian dist. (c)β=0.1, Squared Lorentzian dist. Figure 1.n = 10 examples are represented in green in a Poincare´ ball. Their centroid w.r.t. d and d2 for different values of 1 1 H L β 1, 10 − (and i 1, ,n ,ν i = ) is in magenta and the level sets represent the sum of the distances between the current ∈{ } ∀ ∈{ ··· } n point and the 10 examples. Smaller values ofβ induce smaller Euclidean norms of the centroid. of a set of 10 different points w.r.t. the Poincare´ distance Since the vectorfield is more important and to get a “better and the squared Lorentzian distance for different values of curvature adaption” (Karcher, 2014)(i.e. nonzero eigenval- β>0 . One can see that the centroid w.r.t. the Poincare´ ues), Karcher (1987) recommends the “modified distance” 1 2 d metric does not have smaller norm. On the other hand, the ( 1 + cosh(d )) which is equal to 2 d on . − H L H Euclidean norm of the Lorentzian centroid does decrease Hyperbolic centroids in the literature: To the best of our as β decreases, it can then be enforced to be smaller by knowledge, two other works exploit hyperbolic centroids. choosing lower values of β>0 . Centroids w.r.t. other Sala et al. (2018) use a centroid related to the Poincare´ hyperbolic distances are illustrated in the appendix. metric but do not have a closed-form solution for it so they We provide in the following some side remarks that are compute it via gradient descent. Gulcehre et al. (2019) useful to understand the behavior of the Lorentzian distance. optimize a problem based on the Poincare´ metric but exploit Theorem 3.5 (Nearest point). Let d,β be a sub- the gyro-centroid (Ungar, 2014) which belongs to another B⊆H set of d,β, and let µ d,β be the centroid of the set type of hyperbolic geometry. When the set contains only H ∈H x , ,x d,β w.r.t. the squared Lorentzian distance 2 points, it is the minimizer of the Einstein addition of 1 ··· n ∈H (see Theorem 3.3). We have the following relation: the gyrodistances between it and the two points of the set by using the gyrotriangle inequality. Otherwise, it can be n 2 2 seen as a point which preserves left gyrotranslation. One arg min νid (xi,b) = arg min d (µ,b) (12) L L limitation of the gyrocentroid is that it cannot be seen as b i=1 b ∈B � ∈B a minimizer of an expectation of distances (unlike Frechet´ The theorem shows that distances with a set of points can means) since the Einstein addition is not commutative. be compared with only one point which is the centroid. We will show the interest of this theorem in the last experiment 3.3. Optimization and solver in Section 4.3. Hyperbolic approaches that optimize the Poincare´ distance Moreover, from the Cauchy-Schwarz inequality, as β tends on the hyperboloid (Nickel & Kiela, 2018) as defined in to 0, the Lorentzian distance in Eq. (9) to f1 tends to 0 for Eq. (5) exploit Riemannian stochastic gradient descent. This any vector f that can be written f =τf with τ 0 . choice of optimizer is motivated by the fact that the domain 2 2 1 ≥ The distance is greater otherwise. Therefore, the Lorentzian of Eq. (5) cannot be considered to be a vector space (e.g. distance with a set of points tends to be smaller along the ray Rd+1 R d+1). Indeed, d (a,b) is not defined for pairs × H that contains elements that can be written as τµ where τ 0 a R d+1, b R d+1 that satisfy a,b > 1 due to the ≥ ∈ ∈ 1 � � L − andµ is their centroid (see illustrations in the appendix). definition of cosh− . Therefore, the directional derivative of d lacks a suitable vector space structure, and standard From Jacobifield properties, when H Curvature adaption: optimizers cannot be used. the chosen metric is the Poincare´ metric d on the unit H hyperboloid d, the Hessian of d has the eigenvalue 0 On the other hand, the squared Lorentzian distance relies H H along the radial direction. However, the eigenvalues of only on the formulation of the Lorentzian inner product the Hessian are principal curvatures of the level surface. in Eq. (2) which is well-defined and smooth for any pair Lorentzian Distance Learning for Hyperbolic Representations

4. Experiments 4 standard SGD with momentum We evaluate the Lorentzian distance in three different tasks. 3.5 standard SGD without momentum Outer-product NG Thefirst two experiments consider the same evaluation pro- Quasi-diagonal NG 3 tocol as the baselines and therefore do not exploit the formu- lation of the center of mass which does not exist in closed 2.5 form for the Poincare´ metric. Thefirst experiment considers similarity constraints based on subsumption in hierarchies. 2 The second experiment performs binary classification to training loss 1.5 determine whether or not a test node belongs to a specific subtree of the hierarchy. Thefinal experiment extracts a 1 category hierarchy from a dataset where the taxonomy is

0.5 partially provided.

0 0 100 200 300 400 500 4.1. Representation of hierarchies number of epochs The goal of thefirst experiment is to learn the hyperbolic Figure 2. Training loss obtained by different solvers as a function representation of a hierarchy. To this end, we consider the of the number of epochs for the ACM dataset. same evaluation protocol as (Nickel & Kiela, 2018) and the same datasets. Each dataset can be seen as a directed acyclic graph (DAG) where a parent node links to its children and a R d+1, b R d+1. We formulate our training loss ∈ ∈ descendants in the tree to create a set of edges . More ex- functions as taking representations in d =R d that are D F actly, Nickel & Kiela (2017) consider subsumption relations mapped to d,β via g : d d,β and compared with H β F →H in a hierarchy (e.g. WordNet nouns). They consider the set the Lorentzian inner product for which a directional deriva- = (u, v) such that each node u is an ancestor of v in D { } tive in vector space exists. Therefore, methods of real analy- the tree. For each node u, a set of negative nodes (u) is N sis apply (see Chapter 3.1.1 of (Absil et al., 2009)), and the created: each example in (u) is not a descendant of u in N chain rule is used to derive standard optimization techniques the tree. Their problem is then formulated as learning the such as SGD. In the case of the unit hyperboloid model, the n representations f i that minimize: geodesic distance d (i.e. Poincare)´ and the vector space { }i=1 H d(f ,f ) distance d (i.e. Lorentzian) are increasing functions of e− u v L 1 L = log (13) d(fu,f ) each other due to the monotonicity of cosh− . Therefore, D − e− v� (u,v) v� (u) v by trying to decrease the Lorentzian distance in vector space, �∈D ∈N ∪{ } the standard SGD also decreases the geodesic distance. where d is the chosen distance� function. Eq. (13) tries to get similar examples closer to each other than dissimilar It is worth mentioning that due to the formulation of g , β ones. Nickel & Kiela (2017) and Nickel & Kiela (2018) op- the parameter space is not Euclidean and has a Riemannian timize the Poincare´ metric with different solvers. We replace structure. The direction in parameter space which gives 2 d(fu,f v) byd (gβ(fu), gβ(fv)) as defined in Eq. (9). the largest change in the objective per unit of change is pa- L rameterized by the inverse of the Fisher information As in (Nickel & Kiela, 2018), the quality of the learned and called natural gradient (NG) (Amari, 1998). We have embeddings is measured with the following metrics: the tried different optimizers such as the standard SGD with and mean rank (MR), the mean average precision (MAP) and without momentum, and Riemannian metric methods that the Spearman rank-order correlation ρ (SROC) between approximate the NG (Ollivier, 2015). We experimentally the Euclidean norms of the learned embeddings and the observed that the standard SGD with momentum provides normalized ranks of nodes. Following (Nickel & Kiela, a good tradeoff in terms on running time and retrieval eval- 2018), the normalized rank of a nodeu is formulated: uation metrics. Although its convergence is slightly worse sp(u) than NG approximation methods, it is much simpler to use rank(u) = [0, 1] (14) sp(u) + lp(u) ∈ since it does not require storing gradient matrices and re- turns better retrieval performance. Momentum methods are where lp(u) is the longest path between u and one of its known to account for the curvature directions of the parame- descendant leaves; sp(u) is the shortest path from the root ter space. The convergence comparison of the SGD and NG of the hierarchy to u. A leaf in the hierarchy tree then has approximations for the ACM dataset is illustrated in Fig. 2. a normalized rank equal to 1, and nodes closer to the root A detailed comparison of standard SGD and NG optimizers have smaller normalized rank. ρ is measured as the SROC is provided in the appendix. In the following, we only use between the list of Euclidean norms of embeddings and their the standard SGD with momentum. normalized rank. Lorentzian Distance Learning for Hyperbolic Representations

Table 1. Evaluation of Taxonomy embeddings. MR: Mean Rank (lower is better). MAP: Mean Average Precision (higher is better). ρ: Spearman’s rank-order correlation (higher is better).

Method d in d d in d Ours Ours Ours Ours optimizerP proposedP in optimizerP proposedH inβ=0.01β=0.1β=1β=0.01 (Nickel & Kiela, 2017)(Nickel & Kiela, 2018)λ=0λ=0λ=0λ=0.01 MR 4.02 2.95 1.46 1.59 1.72 1.47 WordNet Nouns MAP 86.5 92.8 94.0 93.5 91.5 94.7 ρ 58.5 59.5 40.2 45.2 43.1 71.1 MR 1.35 1.23 1.11 1.14 1.23 1.13 WordNet Verbs MAP 91.2 93.5 94.6 93.7 91.9 94.0 ρ 55.1 56.6 36.8 38.7 37.2 73.0 MR 1.23 1.17 1.06 1.06 1.09 1.06 EuroVoc MAP 94.4 96.5 96.5 96.0 95.0 96.1 ρ 61.4 67.5 41.8 44.2 45.6 61.7 MR 1.71 1.63 1.03 1.06 1.16 1.04 ACM MAP 94.8 97.0 98.8 96.9 94.1 98.1 ρ 62.9 65.9 53.9 55.9 46.7 66.4 MR 12.8 12.4 1.31 1.30 1.40 1.33 MeSH MAP 79.4 79.9 90.1 90.5 85.5 90.3 ρ 74.9 76.3 46.1 47.2 41.5 78.7

Our optimization problems learns representations f stop training at 3000 epochs for the other datasets. The { i ∈ d n that minimize the following problem: mini-batch size is 50, and the number of sampled negatives F }i=1 per example is 50. The weights of the embeddings are 2 2 L +λ max( fu fv , 0) (15) D � � − � � initialized from the continuous uniform distribution in the (u,v):rank(u)0. (3) Medical Subject Headings (MeSH):( Rogers, 1963) is - Case where λ=0 : Our approach obtains better Mean a medical thesaurus provided by the U.S. National Library Rank and Mean Average Precision scores than (Nickel & of Medicine. (4) Wordnet:( Miller, 1998) is a large lexical Kiela, 2018) for small values of β when we use the same database. As in (Nickel & Kiela, 2018), we consider the constraints. The fact that the retrieval performance of our ap- noun and verb hierarchy of WordNet. More details about proach changes with different values of β 0.01,0.1,1 ∈{ } these datasets can be found in (Nickel & Kiela, 2018). shows that the curvature of the space has an impact on Implementation details: Following (Nickel & Kiela, the distances between examples and on the behavior of the 2017)2, we implemented our method in Pytorch 0.3.1. We model. As explained in Section 3.2, the squared Lorentzian use the standard SGD optimizer with a learning rate of 0.1 distance tends to behave more radially (i.e. the distance and momentum of 0.9. For the largest datasets Wordnet tends to decrease along the ray that can be written τµ where τ 0 and µ is the centroid) as β>0 decreases. Children Nouns and MeSH, we stop training after 1500 epochs. We ≥ then tend to have larger Euclidean norm than their parents 2We use the source code available at https://github. while being close w.r.t. the Lorentzian distance. We eval- com/facebookresearch/poincare-embeddings Lorentzian Distance Learning for Hyperbolic Representations

Table 2. Test F1 classification scores for four different subtrees of WordNet noun tree. Dataset animal.n.01 group.n.01 worker.n.01 mammal.n.01 (Ganea et al., 2018) 99.26 0.59% 91.91 3.07% 66.83 11.83% 91.37 6.09% ± ± ± ± Euclidean dist 99.36 0.18% 91.38 1.19% 47.29 3.93% 77.76 5.08% ± ± ± ± log + Eucl 98.27 0.70% 91.41 0.18% 36.66 2.74% 56.11 2.21% 0 ± ± ± ± Ours (β=λ=0.01) 99.57 0.24% 99.75 0.11% 94.50 1.21% 96.65 1.18% ± ± ± ± Ours (β=0.01,λ=0) 99.77 0.17% 99.86 0.03% 96.32 1.05% 97.73 0.86% ± ± ± ± uate in the appendix the percentage of nodes that have a (Ganea et al., 2018) extract some pre-trained hyperbolic Euclidean norm greater than their parent in the tree. More embeddings of the WordNet nouns hierarchy, those repre- than 90% of pairs (parent,child) satisfy the desired order of sentations are learned with the Poincare´ metric. They then Euclidean norms. The percentage increases with smaller consider four subtrees whose roots are the following synsets: values of β, this illustrates our point on the impact on the animal.n.01, group.n.01, worker.n.01 and mammal.n.01. For Euclidean norm of the center of mass. each subtree, they consider that every node that belongs to it is positive and all the other nodes of Wordnet nouns are On the other hand, our approach obtains worse performance negative. They then select 80% of the positive nodes for for the ρ metric which evaluates how the order of the Eu- training, the rest for test. They select the same percentage of clidean norms is correlated with their normalized rank/depth negative nodes for training and test. At test time, they eval- in the hierarchy tree. This result is expected due to the for- uate the F1 score of the binary classification performance. mulation of the constraints of L that considers only local D They do it for 3 different training/test splits. We refer to similarity orders between pairs of examples. The loss L D (Ganea et al., 2018) for details on the baselines. does not take into account the global structure of the tree; it only considers whether pairs of concepts subsume each Our goal is to evaluate the relevance of the Lorentzian dis- other or not. The worse ρ performance of the Lorentzian tance to represent hierarchical datasets. We then use the distance d could be due to the fact that d does not take embeddings trained with our approach (with β=0.01 and L L into account global structure of the hierarchy tree but does different values of λ) and classify a test node by assigning tend to preserve local structure (as shown in Table 3). it to the category of the nearest training example w.r.t. the Lorentzian distance (with β=1 ). The test performance - Case where λ>0 : As a consequence of the worse ρ per- of our approach is reported in Table 2. It outperforms the formance, we evaluate the performance of our model when classification performance of the baselines which are based including normalized rank information during training. As on Poincare´ or Euclidean distance. This shows that our can be seen in Table 1, this improves the ρ performance embeddings can also be used to perform classification. and outperforms the baselines (Nickel & Kiela, 2017; 2018) for all the evaluation metrics on some datasets. The Mean Rank and Mean Average Precision performances remain 4.3. Automatic hierarchy extraction comparable with the case where λ=0 . Global rank infor- The previous experiments compare the performances of the mation can then be exploited during training without having Lorentzian distance and the standard Poincare´ distance but a significant impact on retrieval performance. Increasing λ do not exploit the closed form expression of the center of even more can improveρ. mass since we use the same evaluation protocols as the In conclusion, we have shown that the Lorentzian distance baselines (i.e. we use the same pairwise constraints). We outperforms the standard Poincare´ distance for retrieval (i.e. now exploit the formulation of our centroids to efficiently mean rank and MAP). In particular, retrieval performance extract the category hierarchy of a dataset whose taxonomy can be improved by tuning the hyperparameterβ. is partially provided. In particular, we test our approach on the CIFAR-100 (Krizhevsky & Hinton, 2009) dataset which contains 100 classes/categories containing 600 images each. 4.2. Binary classification The 100 classes are grouped into 20 superclasses3 (e.g. the Another task of interest for hyperbolic representations is classes apples, mushrooms, oranges, pears, sweet peppers to determine whether a given node belongs to a specific belong to the superclass fruits and vegetables). We con- subtree of the hierarchy or not. We follow the same binary sider the hierarchy formed by these superclasses to learn classification protocol as (Ganea et al., 2018) on the same hyberbolic representations with a neural network. datasets. We describe their protocol below. 3The detailed class hierarchy of CIFAR-100 is available at https://www.cs.toronto.edu/˜kriz/cifar.html Lorentzian Distance Learning for Hyperbolic Representations

Figure 3. Taxonomy extracted for CIFAR-100 by optimizing Eq. (18) with the Lorentzian distance (left) and Euclidean distance (right)

Let k be the set of categories (with k = 100) and output dimensionality d = 10 on the whole dataset, and {Cc}c=1 κ be the set of superclasses (with κ = 20) of the extracted the super-class centroids on which we applied {Aa}a=1 dataset. We formulate the posterior probability: hierarchical agglomerative clustering based on complete- linkage clustering (Defays, 1977). Fig. 3 illustrates the exp( d(xi,µ )) p( x ) = − Cc (16) extracted taxonomies when the chosen distance d is the Cc| i k m=1 exp( d(xi,µ m )) squared Lorentzian distance (left) or the squared Euclidean − C distance (right). The hierarchical clusters extracted with � exp( d(xi,µ )) hyperbolic representations are more natural. For instance, p( x ) = − Aa (17) a i κ insects are closer toflowers with the Lorentzian distance. A | e=1 exp( d(xi,µ e )) − A This is explained by the fact that bees (known for their role where d is the chosen distance. Our goal is to learn the � in pollination) are in the insects superclass. Trees are also representations x and µ that minimize the following loss: i closer to outdoor scenes and things with the Lorentzian distance. The reptile superclass contains aquatic animals log(p( x )) α log(p( x )) (18) − Cc| i − Aa| i (e.g. turtles and crocodiles), which is why it is close tofish x x �i∈Cc i�∈Aa and aquatic mammals. where α 0 is a regularization parameter. Eq. (18) tries to ≥ group samples which belongs to the same (super)-class into 5. Conclusion a single cluster. It corresponds to the supervised clustering task (Law et al., 2016; 2019). Following works on clustering, In this paper, we proposed a distance learning approach the optimal values of the different µ are the centers of mass based on the Lorentzian distance to represent hierarchically- of the elements belonging to the respective (super-)classes. structured datasets. Unlike most of the literature that con- If the chosen metric is the squared Lorentzian distance, the siders the unit hyperboloid model, we show that the perfor-

optimal value ofµ c is: mance of the learned model can be improved by decreasing C the curvature of the chosen hyperboloid model. We give n i=1 νixi a formulation of the centroid w.r.t. the squared Lorentzian µ c = β n s.t.ν i =1 c (xi) (19) C νixi C distance as a function of the curvature, and we show that |� �i=1 �L| � the Euclidean norm of its projection in the Poincare´ ball where 1 is the indicator� function and n = 600 is the size of decreases as the curvature decreases. Hierarchy constraints the mini-batch. We trained a convolutional neural network are generally formulated such that high-level nodes are sim- ϕ so that its output of the i-th image is f d and x = i ∈F i ilar to all their descendants and thus their representation gβ(fi) (see details in the appendix). One advantage of our should be close to the centroid of the descendants. Decreas- loss function is that its complexity is linear in the number of ing the curvature implicitly enforces high-level nodes to examples and linear in the number of (super-)classes. The have smaller Euclidean norm than their descendants and is algorithm is then more efficient than pairwise constraint therefore is more appropriate for learning representations of losses such as Eq. (13) where the cardinality of the negative hierarchically-structured datasets. nodes (u) is so large that it has to be subsampled (e.g. N with the sampling strategy in (Jean et al., 2015)). From Theorem 3.5, each category can be represented by a single centroid. We then trained a neural network with Lorentzian Distance Learning for Hyperbolic Representations

Acknowledgements Sebastien´ Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. On using very large target vocabulary for We thank Kuan-Chieh Wang and the anonymous reviewers neural machine translation. In Proceedings of the 53rd for helpful feedback on early versions of this manuscript. Annual Meeting of the Association for Computational This work was supported by Samsung and the Intelligence Linguistics and the 7th International Joint Conference on Advanced Research Projects Activity (IARPA) via Depart- Natural Language Processing (Volume 1: Long Papers), ment of Interior/Interior Business Center (DoI/IBC) contract volume 1, pp. 1–10, 2015. number D16PC00003. The U.S. Government is authorized to reproduce and distribute reprints for Governmental pur- Hermann Karcher. Riemannian comparison constructions. poses notwithstanding any copyright annotation thereon. SFB 256, 1987.

Disclaimer: The views and conclusions contained herein Hermann Karcher. Riemannian center of mass and so called are those of the authors and should not be interpreted as nec- karcher mean. arXiv preprint arXiv:1407.2087, 2014. essarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/IBC, or the U.S. Alex Krizhevsky and Geoffrey Hinton. Learning multiple Government. layers of features from tiny images. Technical report, Citeseer, 2009.

References Marc T Law, Yaoliang Yu, Matthieu Cord, and Eric P Xing. P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Op- Closed-form training of mahalanobis distance for super- timization algorithms on matrix manifolds. Princeton vised clustering. In Proceedings of the IEEE Conference University Press, 2009. on Computer Vision and Pattern Recognition, pp. 3909– 3917, 2016. Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998. Marc T Law, Jake Snell, Amir massoud Farahmand, Raquel Urtasun, and Richard S Zemel. Dimensionality reduction Daniel Defays. An efficient algorithm for a complete link for representing the knowledge of probabilistic models. method. The Computer Journal, 20(4):364–366, 1977. In International Conference on Learning Representations, 2019. Albert Einstein. Zur elektrodynamik bewegter korper.¨ An- nalen der physik, 322(10):891–921, 1905. John M Lee. Riemannian manifolds: an introduction to curvature, volume 176. Springer Science & Business Maurice Frechet.´ Les el´ ements´ aleatoires´ de nature quel- Media, 2006. conque dans un espace distancie.´ Ann. Inst. H. Poincare´, 10(3):215–310, 1948. Nathan Linial, Eran London, and Yuri Rabinovich. The ge- ometry of graphs and some of its algorithmic applications. GA Galperin. A concept of the mass center of a system of Combinatorica, 15(2):215–245, 1995. material points in the constant curvature spaces. Commu- Risheng Liu, Zhouchen Lin, Zhixun Su, and Kewei Tang. nications in Mathematical Physics, 154(1):63–84, 1993. Feature extraction by learning lorentzian metric tensor Octavian-Eugen Ganea, Gary Becigneul,´ and Thomas Hof- and its extensions. Pattern Recognition, 43(10):3298– mann. Hyperbolic neural networks. Advances in neural 3306, 2010. information processing systems, 2018. George Miller. WordNet: An electronic lexical database. MIT press, 1998. Mikhael Gromov. Hyperbolic groups. In Essays in group theory, pp. 75–263. Springer, 1987. Maximilian Nickel and Douwe Kiela. Learning continuous hierarchies in the lorentz model of hyperbolic geometry. Karsten Grove and Hermann Karcher. How to conjugatec International Conference on Machine Learning, 2018. 1-close group actions. Mathematische Zeitschrift, 132(1): 11–20, 1973. Maximillian Nickel and Douwe Kiela. Poincare´ embeddings for learning hierarchical representations. In Advances in Caglar Gulcehre, Misha Denil, Mateusz Malinowski, Ali neural information processing systems, pp. 6338–6347, Razavi, Razvan Pascanu, Karl Moritz Hermann, Peter 2017. Battaglia, Victor Bapst, David Raposo, Adam Santoro, and Nando de Freitas. Hyperbolic attention networks. In Yann Ollivier. Riemannian metrics for neural networks i: International Conference on Learning Representations, feedforward networks. Information and Inference: A 2019. Journal of the IMA, 4(2):108–153, 2015. Lorentzian Distance Learning for Hyperbolic Representations

John Ratcliffe. Foundations of hyperbolic manifolds, vol- ume 149. Springer Science & Business Media, 2006. Frank B Rogers. Medical subject headings. Bulletin of the Medical Library Association, 51:114–116, 1963. Frederic Sala, Chris De Sa, Albert Gu, and Christopher Re.´ Representation tradeoffs for hyperbolic embeddings. In International Conference on Machine Learning, pp. 4457–4466, 2018. John Shawe-Taylor, Nello Cristianini, et al. Kernel methods for pattern analysis. Cambridge university press, 2004. Ke Sun, Jun Wang, Alexandros Kalousis, and Stephane´ Marchand-Maillet. Space-time local embeddings. In Advances in Neural Information Processing Systems, pp. 100–108, 2015. Abraham Albert Ungar. Barycentric calculus in Euclidean and hyperbolic geometry: A comparative introduction. World Scientific, 2010. Abraham Albert Ungar. Analytic hyperbolic geometry in n dimensions: An introduction. CRC Press, 2014. Vladimir Varicak. Beitrage¨ zur nichteuklidischen geometrie. Jahresbericht der Deutschen Mathematiker-Vereinigung, 17:70–83, 1908. Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon Yuille. Normface: l2 hypersphere embedding for face ver- ification. In Proceedings of the 2017 ACM on Multimedia Conference, pp. 1041–1049. ACM, 2017.