On the Geometry of Data Representations, © 2020 ABSTRACT

diss. eth no. 26998 ONTHEGEOMETRYOF DATAREPRESENTATIONS A thesis submitted to attain the degree of doctor of sciences of eth zurich¨ (Dr. sc. ETH Zurich)¨ presented by gary becigneul´ Master in Mathematics University of Cambridge born on 11.11.1994 citizen of France Accepted on the recommendation of Prof. Dr. Thomas Hofmann (ETH Zurich)¨ Prof. Dr. Gunnar Ratsch¨ (ETH Zurich)¨ Prof. Dr. Tommi Jaakkola (MIT) Prof. Dr. Regina Barzilay (MIT) 2020 Gary Becigneul:´ On the Geometry of Data Representations, © 2020 ABSTRACT The vast majority of state-of-the-art Machine Learning (ML) methods nowadays internally represent the input data as being embedded in a “continuous” space, i.e. as sequences of floats, where nearness in this space is meant to define semantic or statistical similarity w.r.t to the task at hand. As a consequence, the choice of which metric is used to measure nearness, as well as the way data is embedded in this space, currently constitute some of the cornerstones of building meaningful data representations. Characterizing which points should be close to each other in a given set essentially defines a “geometry”. Interestingly, certain geometric properties may be incompatible in a non-trivial manner − or put an- other way, selecting desired geometric properties may have non-trivial implications. The investigation of which geometric properties are de- sirable in a given context, and how to enforce them, is one of the motivations underlying this thesis. Initially motivated by uncovering how Convolutional Neural Net- work (CNN) disentangle representations to make tangled classes linearly separable, we start our investigation by studying how invariance to nuisance deformations may help untangling classes in a classification problem, by locally contracting and flattening group orbits within data classes. We then take interest into directly representing data in a Riemannian space of choice, with a particular emphasis on hyperbolic spaces, which is known to be better suited to represented tree-like graphs. We develop a word embedding method generalizing GloVe, as well as a new way of measuring entailment between concepts. In the hyperbolic space, we also develop the tools needed to build neural networks, including matrix multiplication, pointwise non-linearity, Multinomial Logistic Regression (MLR) and Gated Recurrent Unit (GRU). Since many more optimization tools are available for Euclidean do- mains than for the hyperbolic space, we needed to adapt some of the most powerful adaptive schemes − Adagrad, Adam, Amsgrad − to such spaces, in order to let our hyperbolic models have a chance to outperform their Euclidean counterparts. We also provide convergence iii guarantees for these new methods, which recover those already known for the particular case of the Euclidean space. Finally, the growing prominence of graph-like data led us to extend some of the most successful Graph Neural Network (GNN) architectures. First, we start by generalizing Graph Convolutional Network (GCN)s to hyperbolic and spherical spaces. We then leveraged Optimal Trans- port (OT) geometry to turn current architectures into a universal ap- proximator, by dispensing with the last node aggregation step yielding the final graph embedding. We hope that this work will help motivate further investigations in these geometrically-flavored directions. iv R ESUM´ E´ La grande majorite´ des methodes´ d’Apprentissage Automatique (AA) atteignant l’etat-de-l’art´ de nos jours represente´ les donnees´ reçues en entree´ dans un espace “continu”, i.e. comme une sequence´ de nombres decimaux,´ dans un espace ou` la proximite´ est vouee´ a` definir´ la similarite´ semantique´ ou statistique par rapport a` la tache.ˆ En consequence,´ le choix de la metrique´ utilisee´ pour mesurer la proximite,´ ainsi que la façon dont les donnees´ sont plongees´ dans cet espace, constituent actuellement l’une des pierres angulaires de l’edification´ de bonnes representations´ des donnees.´ Characteriser´ quels points doivent etreˆ proches les uns des autres dans un ensemble definit´ essentiellement ce qu’on appelle une geometrie´ . Il se trouve que certaines propriet´ es´ geom´ etriques´ peuvent etreˆ incom- patibles d’une façon non-triviale − ou, dit autrement, la selection´ de certaines propriet´ es´ geom´ etriques´ peut avoir des implications non-triviales. La recherche de quelles propriet´ es´ geom´ etriques´ sont souhaitables dans un contexte donne,´ et de comment les obtenir, est l’une des motivations principales sous-jacentes a` cette these.` Initiallement motive´ par la decouverte´ de comment les reseaux´ de neurones a` convolutions parviennent a` desenchev´ etrerˆ les donnees´ afin de rendre des classes enchevetrˆ ees´ linearement separables,´ nous commençons notre recherche en etudiant´ comment l’invariance a` des transformations nuisibles peut aider a` desenchev´ etrerˆ les classes d’un probleme` de classification, en contractant et en applatissant locallement des orbites de groupes au sein des classes de donnees.´ Nous nous interessons´ ensuite a` representer´ directement les donnees´ dans un espace Riemannien de choix, avec une emphase particuliere` sur les espaces hyperboliques, qui sont connus pour etreˆ mieux adaptes´ a` representer´ les graphes ayant une structure arborescente sous-jacente. Nous developpons´ une methode´ de word embedding gen´ eralisant´ GloVe, ainsi qu’une nouvelle façon de mesurer les relations asymetriques´ d’inclusion semantique´ entre les conceptes. Nous developpons´ egalement´ les outils necessaires´ a` la construction de reseaux´ de neurones dans les espaces hyperboliques: multiplication matricielle, application de non-linearit´ e´ ponctuelle, regression´ multi-logistique et GRU. v Comme il existe bien davantage d’outils d’optimisation pour les espaces Euclidiens que pour les espaces hyperboliques, nous avions besoin d’adapter certaines des methodes´ adaptives les plus perfor- mantes − Adagrad, Adam, Amsgrad − a` ces espaces, afin de per- mettre a` nos modeles` hyperboliques d’avoir une chance de performer superieurement´ a` leur analogue Euclidien. Nous prouvons ainsi des garanties de convergence pour ces nouvelles methodes,´ qui recouvrent celles dej´ a` connues pour le cas particulier de l’espace Euclidien. Enfin, la presence´ accrue de donnees´ sous forme de graphe nous a conduit a` etendre´ certaines des architectures de reseaux´ de neurones de graphes les plus puissantes. En premier lieu, nous commençons par gen´ eraliser´ les reseaux´ de neurones de graphes a` convolutions, aux espaces hyperboliques et spheriques.´ Ensuite, nous faisons appel a` la geom´ etrie´ du transport optimal pour transformer les architectures courantes en approximateur universel, en supprimant la derniere` aggregation´ des representations´ internes des noeuds du graphe qui avant resultait´ en la representation´ finale du graphe. Nous esperons´ que ceci contribuera a` motiver davantage d’explorations dans ces directions de recherche a` tendance geom´ etrique.´ vi PUBLICATIONS The material presented in this thesis has in parts been published in the following publications: 1. Octavian-Eugen Ganea 1, Gary Becigneul´ 1 and Thomas Hofmann. “Hyperbolic Neural Networks.” NeurIPS 2018: Advances in Neural Information Processing Systems. [GBH18c]. 2. Gary Becigneul´ and Octavian-Eugen Ganea. “Riemannian Adaptive Optimization Methods.” ICLR 2019: International Conference on Learning Representations. [BG19]. 3. Alexandru Tifrea 1, Gary Becigneul´ 1 and Octavian-Eugen Ganea 1. “Poincare´ GloVe: Hyperbolic Word Embeddings.” ICLR 2019: International Conference on Learning Representations. [TBG19]. 4. Gregor Bachmann 1, Gary Becigneul´ 1 and Octavian-Eugen Ganea. “Constant Curvature Graph Convolutional Networks.” ICML 2020: International Conference on Machine Learning. [BBG20]. In part currently under review: 5. Gary Becigneul´ 1, Octavian-Eugen Ganea 1, Benson Chen 1, Regina Barzilay, Tommi Jaakkola. “Optimal Transport Graph Neural Networks.” Under Review at ICLR 2021. [Bec+´ 20]. And in part unpublished: 6. Gary Becigneul.´ “On the Effect of Pooling on the Geometry of Representations.” arXiv preprint arXiv:1703.06726, 2017. [Bec´ 17]. 1Equal contribution. vii The following publications were part of my PhD research and present results that are supplemental to this work or build upon its results, but are not covered in this dissertation: 7. Octavian-Eugen Ganea, Gary Becigneul´ and Thomas Hofmann. “Hyperbolic Entailment Cones for Learning Hierarchical Em- beddings.” ICML 2018: International Conference on Machine Learning . [GBH18a]. 8. Octavian-Eugen Ganea, Sylvain Gelly, Gary Becigneul´ and Aliak- sei Severyn. “Breaking the Softmax Bottleneck via Learnable Monotonic Point- wise Non-linearities.” ICML 2019: International Conference on Machine Learning. [Gan+19]. 9. Ondrej Skopek, Octavian-Eugen Ganea and Gary Becigneul.´ “Mixed-Curvature Variational Autoencoders.” ICLR 2020: International Conference on Learning Representations. [SGB20]. 10. Foivos Alimisis, Antonio Orvieto, Gary Becigneul´ and Aurelien´ Lucchi. “A Continuous-time Perspective for Modeling Acceleration in Riemannian Optimization AISTATS 2020: International Conference on Artificial Intelligence and Statistics.” [Ali+20b]. 11. Calin Cruceru, Gary Becigneul´ and Octavian-Eugen Ganea. “Computationally Tractable Riemannian Manifolds for Graph Embeddings.” AAAI 2021: Association for the Advancement of Artificial Intelligence. [CBG21]. Currently under review are also other parts of my PhD research and present results that are

Load more