Link Prediction in Large-Scale Complex Networks (Application to Bibliographical Networks)
Total Page:16
File Type:pdf, Size:1020Kb
N◦ attribué par la bibliothèque | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | Université Paris Nord Doctoral Thesis Link Prediction in Large-scale Complex Networks (Application to Bibliographical Networks) Author: Manisha Pujari A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the research team Apprentissage Artificiel et Applications LIPN CNRS UMR-7030 Jury: Reviewer: Céline Robardet Professor INSA Lyon Bénédicte le Grand Professor Université Paris 1 Panthéon Sorbonne Examiner: Aldo Gangemi Professor SPC, Université Paris 13 Christophe Prieur Associate Professor, HDR Université Paris Diderot Director: Céline Rouveirol Professor SPC, Université Paris 13 Supervisor: Rushed Kanawati Associate Professor SPC, Université Paris 13 “ The more I learn, the more I realize how much I don’t know. ” Albert Einstein Abstract Link Prediction in Large-scale Complex Networks (Application to Bibliographical Networks) In this work, we are interested to tackle the problem of link prediction in complex net- works. In particular, we explore topological dyadic approaches for link prediction. Dif- ferent topological proximity measures have been studied in the scientific literature for finding the probability of appearance of new links in a complex network. Supervised learning methods have also been used to combine the predictions made or information provided by different topological measures. They create predictive models using various topological measures. The problem of supervised learning for link prediction is a difficult problem especially due to the presence of heavy class imbalance. In this thesis, we search different alternative approaches to improve the performance of different dyadic approaches for link prediction. We propose here, a new approach of link prediction based on supervised rank aggregation that uses concepts from computational social choice theory. Our approach is founded on supervised techniques of aggregating sorted lists (or preference aggregation). We also explore different ways of improving supervised link prediction approaches. One approach is to extend the set of attributes describing an example (pair of nodes) by attributes calculated in a multiplex network that includes the target network. Multiplex networks have a layered structure, each layer having different kinds of links between same sets of nodes. The second way is to use community information for sampling of examples to deal with the problem of class imbalance. Experiments conducted on real networks extracted from well known DBLP bibliographic database. Keywords : Complex networks, Link prediction, Supervised rank aggregation, Multi- plex network analysis. Résumé Prévision de Liens dans les Grands Graphes de Terrain (Application aux réseaux bibliographiques) Nous nous intéressons dans ce travail au problème de prévision de nouveaux liens dans des grands graphes de terrain. Nous explorons en particulier les approches topologiques dyadiques pour la prévision de liens. Différentes mesures de proximité topologique ont été étudiées dans la littérature pour prédire l’apparition de nouveaux liens. Des techniques d’apprentissage supervisé ont été aussi utilisées afin de combiner ces différentes mesures pour construire des modèles prédictifs. Le problème d’apprentissage supervisé est ici un problème difficile à cause notamment du fort déséquilibre de classes. Dans cette thèse, nous explorons différentes approches alternatives pour améliorer les per- formances des approches dyadiques pour la prévision de liens. Nous proposons d’abord, une approche originale de combinaison des prévisions fondée sur des techniques d’agré- gation supervisée de listes triées (ou agrégation de préférences). Nous explorons aussi différentes approches pour améliorer les performances des approches supervisées pour la prévision de liens. Une première approche consiste à étendre l’ensemble des attri- buts décrivant un exemple (paires de noeuds) par des attributs calculés dans un réseau multiplexe qui englobe le réseau cible. Un deuxième axe consiste à évaluer l’apport des techniques de détection de communautés pour l’échantillonnage des exemples. Des ex- périmentations menées sur des réseaux réels extraits de la base bibliographique DBLP montrent l’intérêt des approaches proposées. Mots-clès : Réseaux complexes, Prévision de liens, Agrégation supervisée de préfé- rences, Analyse de réseaux multiplexes. Dedicated to my mother Dr. Kamalini Pujari Acknowledgements Acknowledgement is something very special for me where I can show my gratitude to- wards all who have directly or indirectly affected my life and helped me to sail through the tides. I must start with my supervisor Dr. Rushed Kanawati who has been a perfect guide during this research work. He has been a great mentor for me starting from the days of my internship in 2010. I still remember the early time when I was not very confident about my ability to do a thesis, and it was Rushed who used to tell me stories from his life to build up my confidence. And indeed, I will be going out to the professional world as a much more confident person than before. His liveliness and enthusiasm towards his work is very inspiring. His advice on both research as well as on my career have been invaluable. I thank him for believing in me and giving me an opportunity to work with him. I would like to express my thanks to my thesis director, Prof. Céline Rouveirol for being very kind and supportive throughout these years of research. In spite of her busy schedules, she always took time for guiding and helping me whenever I needed. She was always ready to answer my questions on anything regarding teaching or research. Her encouragement and guidance during the final preparations of this thesis and presentation is overwhelming. She is a highly knowledgeable and experienced person. I feel myself very fortunate to have got a chance know her and to work with her. I also thank all members of the jury for accepting to be a part of final judgement day of my work. In particular, I would like to thank Prof. Céline Robardet and Prof. Bénédicte Le Grand for being the reviewers. Their comments and advices helped me a lot to improve this manuscript. A good support system is important for surviving and staying sane during Ph.D. For making this journey fun and being there with me, many thanks to all my friends: Yue Ma, Zied Yakoubi, Abdoulaye Guissé, Nasserine Benchettara, Pegah Alizadeh, Hanene Ochi, Ines Chebil, Amine Chaibi, Sarra Ben Abbes, Nouha Omrane. They will always remain in my mind and heart. I also thanks all other friends and colleagues at LIPN. My list will remain incomplete without thanking Brigitte Guéveneux, Nathalie Tavares, Marie Fontanillas and Antonia Wilk who were very kind and helpful for all types of administrative works throughout these years. In the end I would like to thank my entire family for being with me through thick and thin. My husband Samarendra Tripathy is my greatest strength. He is my best friend, my guide and my critic too. It was his dream that I do a Ph.D. and it was never possible for me to complete this work without his constant support. I feel so lucky to have him in my life. I think of my parents today. I thank my mother Kamalini Pujari for being the greatest inspiration of my life. May it be work or home, she is just perfect. I thank her for teaching me how to lead a happy life, not just by words but by example. I thank my father Byomokesh Pujari for his unconditional love and trust in me at every step of life. He has been a friend in moments when I badly needed one. I also thank my little brother Manas Ranjan Pujari for being a wonderful sibling in spite of all tantrums I throw. I greatly thank my parent-in-laws, Binod Chandra Tripathy and Snehalata Tripathy for their constant blessings and prayers which helped me to never lose hope and keep going. I thank the entire family of my husband for their fun-filled and sweet encouraging words that helped me discharging all the stress. Thank you. Manisha Pujari March, 2015 Contents Abstract v Résumé vii Acknowledgements xi List of Figures xvii List of Tables xix 1 Introduction 1 1.1 Context ..................................... 1 1.2 Contributions .................................. 2 1.3 Outline ..................................... 4 2 Complex Networks Analysis 5 2.1 Introduction ................................... 5 2.2 Complex networks ............................... 5 2.2.1 Formal definitions ........................... 10 2.3 Characteristics of complex networks ..................... 12 2.4 Network modeling ............................... 15 2.5 Tasks in network analysis ........................... 17 2.6 Bibliographical networks ............................ 22 2.7 Conclusion .................................... 27 3 Link Prediction in Complex Networks: Topological Approaches 29 3.1 Introduction ................................... 29 3.2 Problem description, notations and evaluation ................ 31 3.2.1 Evaluation ................................ 31 3.3 Link prediction approaches .......................... 33 3.3.1 Unsupervised approaches ....................... 35 3.3.1.1 Neighborhood based features ................ 36 3.3.1.2 Path based features ..................... 38 3.3.1.3 Aggregation of node topological features .......... 40 3.3.2 Supervised approaches ......................... 41 3.3.2.1 Supervised machine learning based approaches . 41 xiii Contents xiv 3.3.2.2 Matrix based approaches .................. 44 3.3.2.3 Probabilistic