Supervised Data Mining in Networks: Link Prediction and Applications

University of Granada Department of Computer Science and Artificial Intelligence Supervised Data Mining in Networks: Link Prediction and Applications Doctoral Thesis V´ıctorMart´ınezGómez Supervisors Fernando Berzal Galiano Juan Carlos Cubero Talavera Programa de Doctorado en Tecnolog´ıasde la Informacióny la Comunicación PhD Program in Information and Communication Technologies Granada, May 2018 Editor: Universidad de Granada. Tesis Doctorales Autor: Victor Martínez Gómez ISBN: 978-84-9163-982-4 URI: http://hdl.handle.net/10481/53613 2 El doctorando V´ıctorMart´ınezGómezy los directores de la tesis Fernando Berzal Galiano y Juan Carlos Cubero Talavera garantizamos, al firmar esta tesis doctoral, que el trabajo ha sido realizado por el doctorando bajo la direcciónde los directores de la tesis y hasta donde nuestro conocimiento alcanza, en la realizacióndel trabajo, se han respetado los derechos de otros autores a ser citados, cuando se han utilizado sus resultados o publicaciones. Granada, Mayo 2018. The doctoral candidate V´ıctorMart´ınezGómezand the thesis supervisors Fernando Berzal Galiano and Juan Carlos Cubero Talavera guarantee, by signing this doctoral thesis, that the work has been carried out by the doctoral student under the direction of the thesis directors, and as far as we know, in the realization of the work, the rights to be cited of other authors have been respected when their results or publications have been used. Granada, May 2018. V´ıctorMart´ınezGómez Fernando Berzal Galiano Juan Carlos Cubero Talavera 3 4 Esta tesis doctoral ha sido desarrollada con la financiaciónde la ayuda con referencia BES-2013-064699 bajo el plan Ayudas para Contratos Predoctorales para la Formación de Doctores 2013 y con la financiacióndel proyecto con referencia TIN2012-36951 bajo el plan Proyectos Nacionales de Investigación, adscritos al Ministerio de Econom´ıa, Industria y Competitividad. This doctoral thesis is partially supported by the Spanish Ministry of Economy and the European Regional Development Fund (FEDER), under grant Ayudas para Contratos Predoctorales para la Formaciónde Doctores 2013 with reference BES-2013- 064699 and grant Proyectos Nacionales de Investigación with reference TIN2012-36951. 5 6 Table of Contents Resumen 9 Abstract 11 I PhD dissertation 13 1 Introduction . 13 2 Preliminaries . 16 2.1 Basic concepts . 16 2.2 The link prediction problem . 17 3 Objectives . 19 4 Discussion of results . 20 4.1 Link prediction . 20 4.1.1 Link prediction: The state of the art . 20 4.1.2 Adaptive link prediction . 23 4.1.3 Probabilistic local link prediction . 25 4.2 Applications . 26 4.2.1 Prioritization using heterogeneous data . 26 4.2.2 Disambiguation of semantic relations . 29 4.2.3 An automorphic distance metric for node role discovery 31 4.3 A network data mining framework . 32 5 Concluding remarks . 35 6 Future work . 36 Bibliography 37 II Publications: published, accepted, and submitted papers 43 7 TABLE OF CONTENTS 1 Link prediction . 43 1.1 Link prediction: The state of the art . 43 1.2 Adaptive link prediction . 91 1.3 Probabilistic local link prediction . 113 2 Applications . 121 2.1 Prioritization using heterogeneous data . 121 2.2 Disambiguation of semantic relations . 151 2.3 An automorphic distance metric for node role discovery . 179 3 A network data mining framework . 199 8 Resumen La predicciónde enlaces consiste en predecir la existencia de enlaces no observados actualmente o enlaces que apareceránen el futuro entre pares de nodos en redes complejas. Este problema ha atra´ıdo la atención de investigadores en diversas disciplinas debido a su utilidad en una amplia gama de aplicaciones, entre las que se encuentran la identificación de genes asociados a determinadas enfermedades o la mejora de las sugerencias realizadas por los sistemas de recomendación. Esta tesis doctoral comprende diferentes l´ıneasde trabajo, todas ellas estrechamente relacionadas con el problema de la predicciónde enlaces. Por un lado, despuésde un estudio exhaustivo del estado del arte en predicciónde enlaces, se identificaron las principales limitaciones de los enfoques actualmente propuestos. Estas limitaciones se relacionaban con las dificultades asociadas al equilibrio entre la escalabilidad y el rendimiento de las técnicas de predicción de enlaces. Se han propuesto dos técnicasescalables de predicciónde enlaces que siguen diferentes enfoques para explotar caracter´ısticaslocales de la red. Por otro lado, se han abordado diferentes aplicaciones para las técnicas de predicciónde enlaces. Se ha propuesto un nuevo algoritmo para priorizacióngenérica, como la priorizaciónde genes asociados a una determinada enfermedad, que logró mejores resultados que otras técnicas gracias a su capacidad para integrar fuentes de datos heterogéneas. Tambiénse ha desarrollado un algoritmo para la desambiguación de los sentidos de las palabras en relaciones semánticas entre conceptos, basado en la predicciónde enlaces y que no requiere datos anotados. En este trabajo, mostramos cómonuestro algoritmo logróuna mayor precisiónque otras técnicasdel estado del arte en diferentes tareas de evaluacióny cómolas relaciones extra´ıdaspueden usarse para mejorar el rendimiento de las técnicas de última generación para la desambiguacióndel sentido de las palabras. Además,dado que la funciónde los nodos influye en cómose forman los enlaces en redes complejas, hemos desarrollado una nueva métrica de distancia basada en el concepto de equivalencia automórficacon aplicaciónal descubrimiento de los roles de los nodos. Finalmente, hemos desarrollado una herramienta de miner´ıa de datos para redes complejas. Esta herramienta, llamada NOESIS, contiene implementaciones eficientes de una extensa lista de algoritmos relacionados con redes, incluyendo una biblioteca completa de técnicasde predicciónde enlaces. 9 Abstract Link prediction is the problem of predicting the existence of currently-unobserved links or links that will appear in the future between pairs of nodes in complex networks. This problem has attracted a great deal of attention from researchers in diverse disciplines due to its applicability in a wide range of tasks, such as the identification of disease-associated candidate genes or the improvement of recommendations suggested by recommender systems. This PhD dissertation comprises different lines of work, all of them closely related to the link prediction problem. On the one hand, after an exhaustive study of the state of the art in link prediction, the main limitations of currently proposed approaches were identified. These limitations were related to the difficulties associated to the trade-off between scalability and performance in link prediction techniques. Two scalable link prediction techniques were proposed that follow different approaches to exploit local network features. On the other hand, different applications of link prediction techniques were addressed. We proposed a novel algorithm for generic prioritization, such as disease-gene prioritization, which achieved better results than other state-of-the-art techniques due to its capacity for integrating heterogeneous data sources. We also developed a novel algorithm for word sense disambiguation of semantic relations between concepts, based on link prediction and without the requirement of annotated data. We showed how our algorithm achieved better accuracy than other state-of-the-art techniques in different evaluation tasks and how relations extracted using our approach could improve the performance of state-of-the-art general-purpose word sense disambiguation techniques. In addition, since node role influences how links are formed in complex networks, we developed a novel distance metric based on the concept of automorphic equivalence with application to node role discovery. Finally, we developed a software framework for network data mining. This framework, called NOESIS, contains efficient implementations of an extensive list of network-related algorithms, including a complete library of link prediction techniques. 11 Chapter I PhD dissertation The aim of this chapter is to describe the context of this doctoral dissertation. In the first section, we introduce and motivate the addressed problem. The second section is devoted to the introduction of the essential concepts and terminology used throughout this dissertation. In the third section, the objectives set for this dissertation are described. The results achieved in this thesis, and the resulting associated publications, are summarized in the fourth section. In the fifth section, concluding remarks are presented. Finally, the sixth section proposes future lines of work that arise from the research described in this dissertation. 1 Introduction Leonhard Euler laid the foundations of graph theory when working in the problem of the Seven Bridges of Königsberg in 1736. Despite the prominence of this area of Mathematics, most work was done over the centuries from a theoretical perspective on the study of small graphs [1]. However, new graph-related problems and challenges have arisen as the amount of available data has increased dramatically over the last few decades [2]. Nowadays, many systems are composed of thousands or even millions of entities or agents with relationships or interactions between them. These systems exhibit complex dynamics and are studied in many different fields. These complex networks arise in very different domains, such as social networks [3], transport networks [4], or protein-protein

Supervised Data Mining in Networks: Link Prediction and Applications

Arxiv:1705.10225V8 [Stat.ML] 6 Feb 2020

On Community Structure in Complex Networks: Challenges and Opportunities

Community Detection with Dependent Connectivity∗

Developing a Credit Scoring Model Using Social Network Analysis

Exploratory Analysis of Networks

On Spectral Embedding Performance and Elucidating Network Structure in Stochastic Block Model Graphs

Link Prediction in Large-Scale Complex Networks (Application to Bibliographical Networks)

Dynamic Stochastic Block Models, Clustering and Segmentations in Dynamic Graphs Marco Corneli

Efficient Inference in Stochastic Block Models with Vertex Labels

A Review of Stochastic Block Models and Extensions for Graph Clustering

Interpretable Stochastic Block Influence Model: Measuring Social Influence Among Homophilous Communities

Barabasi-Albert Scale-Free Networks • Repeat Measurements Multiple Times and Plot Histograms of Assortativity