Are Hyperbolic Representations in Graphs Created Equal?
Total Page:16
File Type:pdf, Size:1020Kb
Are Hyperbolic Representations in Graphs Created Equal? Max Kochurov 1 Sergey Ivanov 2 Eugeny Burnaev 1 Abstract In this work we highlight the shortcomings of the past re- search about Hyperbolic deep learning for graph data. We Recently there was an increasing interest in appli- review the model from (Bachmann et al., 2020) and fix it to cations of graph neural networks in non-Euclidean be more robust at zero curvature case. Furthermore, we eval- geometry; however, are non-Euclidean represen- uate the κ-Stereographic model for node classification, link tations always useful for graph learning tasks? prediction, graph classification, graph embedding problems. For different problems such as node classification and link prediction we compute hyperbolic em- beddings and conclude that for tasks that require 2. Preliminaries global prediction consistency it might be useful 2.1. Base Model to use non-Euclidean embeddings, while for other n tasks Euclidean models are superior. To do so we κ-Stereographic model Mκ (Bachmann et al., 2020) is a first fix an issue of the existing models associated unification of constant curvature manifolds: hyperboloid with the optimization process at zero curvature. and sphere. The model is a Riemannian manifold that has Current hyperbolic models deal with gradients at constant sectional curvature κ and dimension n. Besides the origin in ad-hoc manner, which is inefficient curvature, the parameter κ defines how parameters are con- and can lead to numerical instabilities. We solve strained what is crucial for optimization algorithms: the instabilities of κ-Stereographic model at zero curvature cases and evaluate the approach of em- bedding graphs into the manifold in several graph representation learning tasks. ( n p κ n x 2 R : kxk2 < 1= −κ ; λx κ < 0 Mκ = n κ R ; λx κ ≥ 0 (1) 1. Introduction κ 2 where λx = 2 is conformal (preserving angles) met- 1+κkxk2 Hierarchies in data are common in real world settings and ric tensor at point x. can be observed in many scenarios. For example, languages For optimization we need exponential map (gradient update) have relations between words and contextual dependencies and parallel transport (momentum update). For positive and within a sentence. Words can be viewed as entities and negative cases they are well defined. Exponential map is one may define natural type dependencies on them. These a function that defines a unit time travel along geodesic dependencies may be entirely arbitrary and different relation κ n (straight) line from a given point, i.e. expx : TxMκ 7! graphs may exist for the same dictionary. n Mκ. For κ-Stereographic model exponential map is defined arXiv:2007.07698v1 [cs.LG] 15 Jul 2020 In general, there are several settings in graph problems: knowledge graph completion (Balaevi et al., 2019), node classification and link prediction (Liu et al., 2019; Chami et al., 2019; Bachmann et al., 2020) or graph embedding (Gu et al., 2019). While Euclidean baselines are strong, there is a general trend of making Hyperbolic embeddings to be more efficient and interpretable. 1Skolkovo Institute of Science and Technology, Russia 2Criteo AI Lab, France. Correspondence to: Max Kochurov <[email protected]>. (a) κ = 1 (b) κ = −1 Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 108, 2020. Copyright 2020 by Figure 1. Stereographic projection geodesics for sphere (κ = 1) the author(s). and hyperboloid (κ = −1) Are Hyperbolic representations in graphs created equal? as follows: Then the complete formula in the equation4 for tanκ(x) κ u that is differentiable at κ = 0 should be the following: expx(u) = x ⊕κ tanκ(kukx=2) ; (2) kuk2 8 κ−1=2 tan(xκ1=2) κ > 0 <> where ⊕κ is defined as 1 3 2 2 5 tanκ(x) = x + 3 κx + 15 κ x κ = 0 ; (14) 2 2 > −1=2 1=2 (1 − 2κhx; yi − κkyk2)x + (1 + κkxk2)y :jκj tanh(xjκj ) κ < 0 x ⊕κ y = 2 2 2 (3) 1 − 2κhx; yi + κ kxk2kyk2 2 2 5 where 15 κ x term is optional and is required for higher and tanκ as −1 order gradients only. Taylor expansions for tanκ and func- 8 −1=2 1=2 tions involved in computing distances to hyperplanes may >κ tan(xκ ) κ > 0 < be found in AppendixB. tanκ(x) = x κ = 0 (4) :>jκj−1=2 tanh(xjκj1=2) κ < 0 2.3. Curvature optimization Parallel transport (e.g. for momentum update in SGD) along Curvature optimization in Hyperbolic models is an impor- geodesic looks like tant challenge. The optimization step changes the geometry κ κ κ Px!y(v) = gyr[y; −x]vλx/λy ; (5) of space and parameter constraints are also dependent on the curvature parameter. Therefore parameters and curvature where gyr[u; v]w is defined as are tied together and curvature update cannot be performed without parameter updates. gyr[u; v]w = (−(u ⊕κ v)) ⊕ (u ⊕κ (v ⊕κ w)) (6) fp g Computing distances and similarity measured on the κ- We are interested in optimizing parameters i of the κ Stereographic manifold is used for final classification layers model that lie in -Stereographic model. Formally, the in neural networks. Importantly, distances on the manifold optimization problem looks as: are defined as follows: min L(κ, fpig) (15) κ2 ; p 2Mn −1 R i κ dκ(x; y) = 2 tanκ (k(−x) ⊕κ yk2); (7) Gromov product, an extension to inner product on manifolds To work with parameters fpig on the manifold, they are represented as tuples of numbers in some chart, i.e. a nu- 2 2 2 merical parametrization of the manifold. One cannot make (x; y)r = (dM(x; r) + dM(y; r) − dM(x; y) )=2; (8) independent parameter updates. Fair to note, that all cur- and distance to the hyperplane vatures represent the same model, and one can be obtained with isomorphism. However, this is not the case for product ~ κ dκ(x; Ha;p) = inf dκ(x; w) (9) manifolds and numerical precision is different for different w2H~ κ a;p curvatures (Gu et al., 2019; Sa et al., 2018). 2jh(−p) ⊕κ x; aij = sin−1 ; Model decomposition changes loss landscape and this was κ (1 + κk(−p) ⊕ kxk2)kak κ 2 2 shown to improve the performance (Salimans & Kingma, (10) 2016). Parameter constraints vary with κ as seen in equa- where tion1. Once negative κ changes, numerical representation 8 of parameters may no longer satisfy boundary conditions, κ−1=2 sin(xκ1=2) κ > 0 <> because the ball’s radius changes (equation1), while param- sinκ(x) = x κ = 0 (11) eters remain fixed. In order to satisfy the constraints, we :>jκj−1=2 sinh(xjκj1=2) κ < 0 should project these parameters back to the domain. There are a few alternatives on how to perform optimization in 2.2. Fixing Missing Gradients such a complex parameter space. Equations4 and 11 are incomplete. At the point of zero Alternating optimization. One way to perform optimiza- κ curvature, gradient for is not well defined. However, we tion is by doing alternation in parameter updates: with one can generalize it by taking left and right Taylor expansions. step we optimize κ and project p , with another step we We can see gradients for κ appear: i optimize pi. This partially solves inconsistency but intro- 1 3 2 2 5 duces another problem. After κ update, all mutual distances tanκ(x) ≈ x + 3 κx + 15 κ x (12) κ=+" are updated and perturbed. Therefore, parameter distances 1 3 2 2 5 to the origin are changed and momentum for these param- tanκ(x) ≈ x + 3 (−κ)x + 15 (−κ) x (13) κ=−" eters correspond to a new tangent space, which does not Are Hyperbolic representations in graphs created equal? correspond to the old one. From one point of view, it is just 3.1. Graph Embedding a vector space around a specific point; from another, it is As shown in (Sa et al., 2018), Euclidean space cannot cap- connected to a point on a concrete manifold with the metric ture all tree-like data structures, while embeddings in Hy- tensor. After curvature updates, the metric tensor changes perbolic space can represent any tree without significant dis- as well. It is not obvious how to take this into account. tortions. Other works (Gu et al., 2019) successfully applied However, just ignoring both problems worked sufficiently mixed curvature spaces for graph embeddings and show that in practice. non-Euclidean spaces are good at capturing relations in the real-world graphs. The analysis of mixed curvature spaces Joint optimization. The purpose of alternating optimiza- is still not explored to a full extent. Curvature signs should tion was to decouple curvature and parameters. However, be specified before training and cannot be changed while it is not practical: forward model pass is done twice, one training. The limitation requires a costly grid search to tune per each parameter group. More efficiently we can compute hyperparameters. The problem is not self-contained to serve and apply gradients once for all parameter groups. There as a real-world example, but it clearly shows the advantages are two options on how to do that: one is to update κ first and disadvantages of Riemannian optimization compared and then p , the second is vice-versa. In the former case, we i to Euclidean optimization in terms of the number of itera- have biased gradients for parameters, in the latter for cur- tions and running time. It allows us to compare purely the vature. Bias appears as we separate updates in two blocks. methods of optimization and the loss function rather than In parameter updates, curvature affects the exponential map the complexity of neural networks.