Multidimensional Network Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Universita` degli Studi di Pisa Dipartimento di Informatica Dottorato di Ricerca in Informatica Ph.D. Thesis Multidimensional Network Analysis Michele Coscia Supervisor Supervisor Fosca Giannotti Dino Pedreschi May 9, 2012 Abstract This thesis is focused on the study of multidimensional networks. A multidimensional network is a network in which among the nodes there may be multiple different qualitative and quantitative relations. Traditionally, complex network analysis has focused on networks with only one kind of relation. Even with this constraint, monodimensional networks posed many analytic challenges, being representations of ubiquitous complex systems in nature. However, it is a matter of common experience that the constraint of considering only one single relation at a time limits the set of real world phenomena that can be represented with complex networks. When multiple different relations act at the same time, traditional complex network analysis cannot provide suitable an- alytic tools. To provide the suitable tools for this scenario is exactly the aim of this thesis: the creation and study of a Multidimensional Network Analysis, to extend the toolbox of complex network analysis and grasp the complexity of real world phenomena. The urgency and need for a multidimensional network analysis is here presented, along with an empirical proof of the ubiquity of this multifaceted reality in different complex networks, and some related works that in the last two years were proposed in this novel setting, yet to be systematically defined. Then, we tackle the foundations of the multidimensional setting at different levels, both by looking at the basic exten- sions of the known model and by developing novel algorithms and frameworks for well-understood and useful problems, such as community discovery (our main case study), temporal analysis, link prediction and more. We conclude this thesis with two real world scenarios: a monodimensional study of international trade, that may be improved with our proposed multidimensional analysis; and the analysis of literature and bibliography in the field of classical archaeology, used to show how natural and useful the choice of a multidimensional network analysis strategy is in a problem traditionally tackled with different techniques. 4 To my family: my parents for their exceptional support of all my needs, and my sister being way smarter and professional than I will ever dream. εαν µη ελπηται ανελπιστoν oυκ εξευρησει, ανεξερευνητoν εoν και απoρoν: Acknowledgments The most important role in this thesis, for which I really need to be grateful beyond what I can express, is the one played by my supervisors. Fosca Giannotti and Dino Pedreschi are really a fundamental part of my professional career and probably all the opportunities I was able to obtain are due to their hard work. I also need to thank the professional guiding figures of Ricardo Hausmann and Cesar A. Hidalgo, two incredibly enthusiastic and professional researchers who, even if they do not have any formal role in this thesis, have accompanied me through more than half of my period as graduate student and they are offering me the possibility of having a real impact on the world, far greater than the one I could dream before. Also Albert-Laszlo Barabasi gave me great opportunities and a great scientific environment to live in. A profound gratitude goes also to the reviewers of this thesis, Hannu Toivonen, Yong-Yeol Ahn, Paolo Ferragina and Maria Simi, whose comments greatly improved the quality of the work I'm presenting. Among all my co-authors, the prominent figure who deserves the biggest acknowledgment is for sure Michele Berlingerio, having taught me everything I know about being a good researcher and an efficient computer scientist, losing in the process probably more time, energy and patience than he expected. A big thank also to Maximilian Schich, because I do believe that 90% of what I know about complex networks is probably due to him. I cannot stress enough also how important was to work together with Anna Monreale and Ruggero Pensa, two of the people who worked harder than one can believe, and also responsible for the preparation that lead me into starting the PhD. Also Salvo Rinzivillo was the most funny and enjoyable co-author ever, able to \make me christian", as he would say. And finally, of course, a big thank to Amedeo Cappelli. There is an incredible amount of other people with whom I do not have a formal collaboration, nevertheless their impact of this thesis is far from negligible. They are too many, and a data mining approach is needed to cluster them into geographically separated groups. This consideration gives also the idea how ridiculous is for me to take the entire credit as sole author of this thesis, that is basically a collaborative melting pot of an incredible hive-mind: I should rather take the blame for all the errors and misunderstanding I put in it. Pisa is the place where my career is born. Therefore I should start thanking Alessio Orlandi (for that coding summer, for Lucca Comics and for Zurich), Diego Pennacchioli (who got plenty of unrequested updates about this thesis), Roberto Trasarti (I hope he remembers the fun we had in our complex network class in Lucca), Filippo Volpini, Giulio Rossetti and Riccardo Guidotti (my beloved trio of undergraduates), Lorenzo Gabrielli (the most precise person I know), Mirco Nanni (a fellow cinephile), Chiara Renso (she always brings loads of stuff to eat and to drink from her missions) and Chiara Falchi (I probably owe her way more than a month of dinners). I will probably be killed by all the people I did not mentioned, they are that many, but it's better to stop here. I spent an important period of my life at the Harvard Kennedy School, and I hope to continue to stay there for a long time. I then thank Catalina Prieto and Jennifer Gala, being the solution to all my problems involving the US. Stephen Kosack keeps demonstrating how a great guy is, and I am looking forward to see what our thousands of projects will became. A big thank also to Juan Jimenez, Muhammed Yildirim, Isabel Meirelles and Alex Simoes (an MIT intruder!) for the work with the Product Space. Finally a thanks also to Juan Pablo Chauvin and Jasmina Beganovic, for their appreciation of my t-shirts and the quotes on my whiteboard. Finally, for the group in Northeastern University, the first thought goes to Sune Lehmann, for the great talks we had. And then I think about Yong-Yeol Ahn, with the hope and promise that we will be able to actually make real the collaboration we were planning. I also need to apologize to Nicholas Blumm, Dashun Wang and Chaoming Song for all the times I stole part of their desk or their chairs. And again, for all the people not mentioned here, the thesis is long enough with scientific blabbering, but if it would contain all the friendship you gifted to me, it would explode exponentially. Damiano Ceccarelli for sure deserves a special thanks, being my only non-scientific co-author. But, more importantly, being the voice of the reason for basically everything else. Silvia Tomanin is probably my main motivator, constantly increasing the threshold of what should I do to beat her, but she will keep anyway to end up as a winner. I won't give up, by the way. I did not forget Gabriele Pastore, Francesca Aversa and all the guys from the forum, even if in these years I probably gave them less attention than I should. A big special thank also to Eleonora Grasso: our relationship was the cornerstone of three years of my life and one of the few things that kept me sane in the darkest hours of a PhD student (and there are many of those). I wish the very best for your future. The last final \Thank you", the special one, to my family: my parents and my sister. You are simply perfect in understanding, supporting and correcting me in every occasion. I wish the world would be an easier place, where work does not bring you far from home, because being that far from where my heart is, I just feel lost. And then there is you, Clara. The biggest, beautiful and exciting bet I have in my future. I have put everything on it. Contents 1 Introduction 15 I Setting the Stage 19 2 Network Analysis 21 2.1 The Graph Representation . 21 2.2 Statistical Properties . 22 2.3 Community Discovery . 25 2.3.1 Problem Features . 26 2.3.2 The Definition-based classification . 28 2.3.3 Feature Distance . 31 2.3.4 Internal Density . 36 2.3.5 Bridge Detection . 42 2.3.6 Diffusion . 45 2.3.7 Closeness . 49 2.3.8 Structure Definition . 50 2.3.9 Link Clustering . 53 2.3.10 No Definition . 54 2.3.11 Empirical Test . 56 2.3.12 Alternative Classifications . 58 2.4 Generators . 59 2.4.1 Descriptive Models . 60 2.4.2 Generative Models . 62 2.5 Link Analysis . 63 2.6 Information Propagation . 64 2.7 Graph Mining . 66 2.8 Privacy . 67 3 Multidimensional Network: Model Definition 69 4 Related Work 71 4.1 Layered Networks . 71 4.2 Hypergraphs . 74 4.3 Multidimensional Networks . 75 4.3.1 Multidimensional Community Discovery . 75 4.3.2 Multidimensional Link Prediction . 76 4.3.3 Signed Networks . 76 4.4 Tensor Decomposition . 77 6 CHAPTER 0. CONTENTS 5 Real World Multidimensional Networks 79 5.1 Facebook . 79 5.2 Supermarket . 80 5.3 Flickr . 82 5.4 DBLP . 82 5.5 Querylog . 84 5.6 GTD .