
University of Granada Department of Computer Science and Artificial Intelligence Supervised Data Mining in Networks: Link Prediction and Applications Doctoral Thesis V´ıctorMart´ınezG´omez Supervisors Fernando Berzal Galiano Juan Carlos Cubero Talavera Programa de Doctorado en Tecnolog´ıasde la Informaci´ony la Comunicaci´on PhD Program in Information and Communication Technologies Granada, May 2018 Editor: Universidad de Granada. Tesis Doctorales Autor: Victor Martínez Gómez ISBN: 978-84-9163-982-4 URI: http://hdl.handle.net/10481/53613 2 El doctorando V´ıctorMart´ınezG´omezy los directores de la tesis Fernando Berzal Galiano y Juan Carlos Cubero Talavera garantizamos, al firmar esta tesis doctoral, que el trabajo ha sido realizado por el doctorando bajo la direcci´onde los directores de la tesis y hasta donde nuestro conocimiento alcanza, en la realizaci´ondel trabajo, se han respetado los derechos de otros autores a ser citados, cuando se han utilizado sus resultados o publicaciones. Granada, Mayo 2018. The doctoral candidate V´ıctorMart´ınezG´omezand the thesis supervisors Fernando Berzal Galiano and Juan Carlos Cubero Talavera guarantee, by signing this doctoral thesis, that the work has been carried out by the doctoral student under the direction of the thesis directors, and as far as we know, in the realization of the work, the rights to be cited of other authors have been respected when their results or publications have been used. Granada, May 2018. V´ıctorMart´ınezG´omez Fernando Berzal Galiano Juan Carlos Cubero Talavera 3 4 Esta tesis doctoral ha sido desarrollada con la financiaci´onde la ayuda con referencia BES-2013-064699 bajo el plan Ayudas para Contratos Predoctorales para la Formaci´on de Doctores 2013 y con la financiaci´ondel proyecto con referencia TIN2012-36951 bajo el plan Proyectos Nacionales de Investigaci´on, adscritos al Ministerio de Econom´ıa, Industria y Competitividad. This doctoral thesis is partially supported by the Spanish Ministry of Economy and the European Regional Development Fund (FEDER), under grant Ayudas para Contratos Predoctorales para la Formaci´onde Doctores 2013 with reference BES-2013- 064699 and grant Proyectos Nacionales de Investigaci´on with reference TIN2012-36951. 5 6 Table of Contents Resumen 9 Abstract 11 I PhD dissertation 13 1 Introduction . 13 2 Preliminaries . 16 2.1 Basic concepts . 16 2.2 The link prediction problem . 17 3 Objectives . 19 4 Discussion of results . 20 4.1 Link prediction . 20 4.1.1 Link prediction: The state of the art . 20 4.1.2 Adaptive link prediction . 23 4.1.3 Probabilistic local link prediction . 25 4.2 Applications . 26 4.2.1 Prioritization using heterogeneous data . 26 4.2.2 Disambiguation of semantic relations . 29 4.2.3 An automorphic distance metric for node role discovery 31 4.3 A network data mining framework . 32 5 Concluding remarks . 35 6 Future work . 36 Bibliography 37 II Publications: published, accepted, and submitted papers 43 7 TABLE OF CONTENTS 1 Link prediction . 43 1.1 Link prediction: The state of the art . 43 1.2 Adaptive link prediction . 91 1.3 Probabilistic local link prediction . 113 2 Applications . 121 2.1 Prioritization using heterogeneous data . 121 2.2 Disambiguation of semantic relations . 151 2.3 An automorphic distance metric for node role discovery . 179 3 A network data mining framework . 199 8 Resumen La predicci´onde enlaces consiste en predecir la existencia de enlaces no observados actualmente o enlaces que aparecer´anen el futuro entre pares de nodos en redes complejas. Este problema ha atra´ıdo la atenci´on de investigadores en diversas disciplinas debido a su utilidad en una amplia gama de aplicaciones, entre las que se encuentran la identificaci´on de genes asociados a determinadas enfermedades o la mejora de las sugerencias realizadas por los sistemas de recomendaci´on. Esta tesis doctoral comprende diferentes l´ıneasde trabajo, todas ellas estrechamente relacionadas con el problema de la predicci´onde enlaces. Por un lado, despu´esde un estudio exhaustivo del estado del arte en predicci´onde enlaces, se identificaron las principales limitaciones de los enfoques actualmente propuestos. Estas limitaciones se relacionaban con las dificultades asociadas al equilibrio entre la escalabilidad y el rendimiento de las t´ecnicas de predicci´on de enlaces. Se han propuesto dos t´ecnicasescalables de predicci´onde enlaces que siguen diferentes enfoques para explotar caracter´ısticaslocales de la red. Por otro lado, se han abordado diferentes aplicaciones para las t´ecnicas de predicci´onde enlaces. Se ha propuesto un nuevo algoritmo para priorizaci´ongen´erica, como la priorizaci´onde genes asociados a una determinada enfermedad, que logr´o mejores resultados que otras t´ecnicas gracias a su capacidad para integrar fuentes de datos heterog´eneas. Tambi´ense ha desarrollado un algoritmo para la desambiguaci´on de los sentidos de las palabras en relaciones sem´anticas entre conceptos, basado en la predicci´onde enlaces y que no requiere datos anotados. En este trabajo, mostramos c´omonuestro algoritmo logr´ouna mayor precisi´onque otras t´ecnicasdel estado del arte en diferentes tareas de evaluaci´ony c´omolas relaciones extra´ıdaspueden usarse para mejorar el rendimiento de las t´ecnicas de ´ultima generaci´on para la desambiguaci´ondel sentido de las palabras. Adem´as,dado que la funci´onde los nodos influye en c´omose forman los enlaces en redes complejas, hemos desarrollado una nueva m´etrica de distancia basada en el concepto de equivalencia autom´orficacon aplicaci´onal descubrimiento de los roles de los nodos. Finalmente, hemos desarrollado una herramienta de miner´ıa de datos para redes complejas. Esta herramienta, llamada NOESIS, contiene implementaciones eficientes de una extensa lista de algoritmos relacionados con redes, incluyendo una biblioteca completa de t´ecnicasde predicci´onde enlaces. 9 Abstract Link prediction is the problem of predicting the existence of currently-unobserved links or links that will appear in the future between pairs of nodes in complex networks. This problem has attracted a great deal of attention from researchers in diverse disciplines due to its applicability in a wide range of tasks, such as the identification of disease-associated candidate genes or the improvement of recommendations suggested by recommender systems. This PhD dissertation comprises different lines of work, all of them closely related to the link prediction problem. On the one hand, after an exhaustive study of the state of the art in link prediction, the main limitations of currently proposed approaches were identified. These limitations were related to the difficulties associated to the trade-off between scalability and performance in link prediction techniques. Two scalable link prediction techniques were proposed that follow different approaches to exploit local network features. On the other hand, different applications of link prediction techniques were addressed. We proposed a novel algorithm for generic prioritization, such as disease-gene prioritization, which achieved better results than other state-of-the-art techniques due to its capacity for integrating heterogeneous data sources. We also developed a novel algorithm for word sense disambiguation of semantic relations between concepts, based on link prediction and without the requirement of annotated data. We showed how our algorithm achieved better accuracy than other state-of-the-art techniques in different evaluation tasks and how relations extracted using our approach could improve the performance of state-of-the-art general-purpose word sense disambiguation techniques. In addition, since node role influences how links are formed in complex networks, we developed a novel distance metric based on the concept of automorphic equivalence with application to node role discovery. Finally, we developed a software framework for network data mining. This framework, called NOESIS, contains efficient implementations of an extensive list of network-related algorithms, including a complete library of link prediction techniques. 11 Chapter I PhD dissertation The aim of this chapter is to describe the context of this doctoral dissertation. In the first section, we introduce and motivate the addressed problem. The second section is devoted to the introduction of the essential concepts and terminology used throughout this dissertation. In the third section, the objectives set for this dissertation are described. The results achieved in this thesis, and the resulting associated publications, are summarized in the fourth section. In the fifth section, concluding remarks are presented. Finally, the sixth section proposes future lines of work that arise from the research described in this dissertation. 1 Introduction Leonhard Euler laid the foundations of graph theory when working in the problem of the Seven Bridges of K¨onigsberg in 1736. Despite the prominence of this area of Mathematics, most work was done over the centuries from a theoretical perspective on the study of small graphs [1]. However, new graph-related problems and challenges have arisen as the amount of available data has increased dramatically over the last few decades [2]. Nowadays, many systems are composed of thousands or even millions of entities or agents with relationships or interactions between them. These systems exhibit complex dynamics and are studied in many different fields. These complex networks arise in very different domains, such as social networks [3], transport networks [4], or protein-protein
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages223 Page
-
File Size-