Algorithms for Analyzing and Mining Real-World Graphs Frank W. Takes

Algorithms for Analyzing and Mining Real-World Graphs Uitnodiging voor de openbare verdediging van het proefschrift Algorithms for Analyzing Algorithms for Analyzing and and Mining Real-World Graphs Mining Real-World Graphs door Frank Takes op woensdag 19 november 2014 om 16.15 uur in Academiegebouw Universiteit Leiden Rapenburg 67-73 Na afloop is er een receptie met gelegenheid tot felicitatie. Graag tot ziens! Paranimfen: Fenne Bodrij (0620962554) ISBN 978-90-5335-957-0 Frank W. Takes Nancy Takes (0654223539) Met tijdrovende parkeerbezig- heden dient rekening te worden Frank W. Takes gehouden. Na aanvang is geen toegang mogelijk. 12366_Takes_OS.indd 1 15-10-14 16:17 Algorithms for Analyzing and Mining Real-World Graphs Frank W. Takes The author of this PhD thesis was employed at Leiden University. The work in this thesis has been carried out under the auspices of the research school IPA (Institute for Programming research and Algorithmics). This research was financed by the Netherlands Organization for Scientific Research (NWO) as part of the Complex Patterns in Streams (COMPASS) project. Copyright 2014 by Frank W. Takes Open-access: https://openaccess.leidenuniv.nl Typeset using LATEX, figures generated using TIKZ and GNUPLOT Printed by Ridderprint B.V. ISBN 978-90-5335-957-0 Algorithms for Analyzing and Mining Real-World Graphs Proefschrift ter verkrijging van de graad van Doctor aan de Universiteit Leiden, op gezag van Rector Magnificus prof.mr. C.J.J.M. Stolker, volgens besluit van het College voor Promoties te verdedigen op woensdag 19 november 2014 klokke 16.15 uur door Frank Willem Takes geboren te Leidschendam in 1986 Promotiecommissie Promotor prof.dr.J.N.Kok Copromotor dr. W.A. Kosters Commissieleden prof. dr. T.H. Bäck prof. dr. H. Blockeel (KU Leuven) prof. dr. T. Calders (Vrije Universiteit Brussel) dr. H.J. Hoogeboom prof. dr. A.P.J.M. Siebes (Universiteit Utrecht) Contents 1 Introduction 1 1.1 Introduction................................. 1 1.2 Graphs.................................... 2 1.3 Graphalgorithms.............................. 6 1.4 Datamining................................. 9 1.5 Thesisoutline................................ 10 I Graph Algorithms 13 2 Determining the Diameter of Small-World Networks 15 2.1 Introduction................................. 16 2.2 Preliminaries ................................ 17 2.2.1 Definitions ............................. 17 2.2.2 Small-worldnetworks . 19 2.3 Relatedwork ................................ 19 2.4 BoundingDiameters.. .. .. .. .. .. .. .. .. .. .. .. 20 2.4.1 Observations ............................ 20 2.4.2 Algorithm.............................. 21 2.4.3 Complexity ............................. 22 2.4.4 Selectionstrategies. 24 2.4.5 Example............................... 26 2.4.6 Pruning ............................... 28 viii 2.5 Experiments................................. 29 2.5.1 Datasets............................... 29 2.5.2 Measurementmethodology . 30 2.5.3 Results................................ 30 2.6 GPUparallelism............................... 33 2.7 Conclusion ................................. 34 3 ComputingtheEccentricityDistributionofLargeGraphs 37 3.1 Introduction................................. 38 3.2 Preliminaries ................................ 39 3.3 Relatedwork ................................ 43 3.4 Exactalgorithm............................... 44 3.4.1 Eccentricitybounds. 44 3.4.2 Examplerun ............................ 45 3.4.3 Pruning ............................... 48 3.5 Approximationalgorithms . 48 3.5.1 Randomnodeselection . 49 3.5.2 Hybridalgorithm. .. .. .. .. .. .. .. .. .. .. 51 3.5.3 Neighborhoodapproximation . 53 3.6 Experiments................................. 55 3.6.1 Datasets............................... 55 3.6.2 Exactalgorithm........................... 56 3.6.3 Hybridalgorithm. .. .. .. .. .. .. .. .. .. .. 58 3.7 Conclusion ................................. 60 4 A Bounding Framework for Computing Extreme Graph Measures 61 4.1 Introduction................................. 62 4.2 Definitions.................................. 63 4.3 Framework ................................. 64 4.3.1 Bounds ............................... 64 4.3.2 Algorithm.............................. 65 4.4 Experimentsonreal-worldgraphs. 67 4.4.1 Datasets............................... 67 4.4.2 Results................................ 67 4.4.3 Correlationwithgraphproperties . 67 4.5 Experimentonasyntheticgraph . 72 4.6 Conclusion ................................. 73 ix 5 Adaptive Landmark Selection for Shortest Path Computation 75 5.1 Introduction................................. 76 5.2 Preliminaries ................................ 77 5.2.1 Notation............................... 77 5.2.2 Problemdefinition . 78 5.2.3 Landmarks ............................. 78 5.3 Relatedwork ................................ 79 5.4 Landmarkframework ........................... 80 5.4.1 Landmarkselection. 80 5.4.2 Landmarkprocessing. 82 5.4.3 Optimizations............................ 82 5.4.4 Example............................... 83 5.5 Balancing centrality and covering . 84 5.5.1 Adaptivelandmarkselection. 84 5.5.2 Greedycentralneighborprocessing . 86 5.6 Experiments................................. 87 5.6.1 Datasets............................... 90 5.6.2 Measurementmethodology . 91 5.6.3 Results and discussion . 91 5.7 Conclusion ................................. 93 6 Identifying Prominent Actors in Online Social Networks 95 6.1 Introduction................................. 96 6.2 Preliminaries ................................ 97 6.2.1 Definitions ............................. 97 6.2.2 Problemstatement . 97 6.2.3 Onlinesocialnetworks. 98 6.3 Relatedwork ................................ 99 6.4 Prominentnodes .............................. 100 6.4.1 Nodeproperties. 100 6.4.2 BiasedRandomWalk . 101 6.5 Dataset ................................... 103 6.6 Experiments................................. 106 6.6.1 Nodeproperties. 106 6.6.2 BiasedRandomWalk . 106 6.6.3 Results................................ 107 6.7 Conclusion ................................. 110 x II Path Traversal Patterns 111 7 The Difficulty of Path Traversal in an Information Network 113 7.1 Introduction................................. 114 7.2 Preliminaries ................................ 116 7.2.1 Concepts&definitions . 116 7.2.2 Wikipedia .............................. 116 7.2.3 TheWikiGame........................... 117 7.2.4 Problemdefinition . 118 7.3 Relatedwork ................................ 120 7.4 Node-baseddifficultymeasures . 121 7.4.1 Degreemeasures . 121 7.4.2 Neighborhoodmeasures . 122 7.5 Path-baseddifficultymeasures. 124 7.5.1 Pathlength ............................. 124 7.5.2 Numberofshortestpaths . 124 7.5.3 Uniquenessofshortestpaths . 126 7.6 Conclusion ................................. 127 8 Mining User-Generated Path Traversal Patterns 129 8.1 Introduction................................. 130 8.2 Preliminaries ................................ 131 8.2.1 Wikipediagraph . .. .. .. .. .. .. .. .. .. .. 131 8.2.2 TheWikiGamedataset . 132 8.3 Relatedwork ................................ 133 8.4 Pathtraversalpatterns . 134 8.4.1 Patterns ............................... 134 8.4.2 Centralitymeasures . 135 8.4.3 User-definednodecentrality. 136 8.4.4 Measureevaluation. 136 8.4.5 Experiments............................. 138 8.5 Globalpatterns ............................... 139 8.5.1 Frequenttraversalgraphs . 139 8.5.2 Subgraphcentrality . 141 8.6 Conclusion ................................. 142 Bibliography 158 Samenvatting 159 xi Curriculum Vitae 163 Dankwoord 165 Publication List 167 Titles in the IPA Dissertation Series since 2008 169 1 Introduction 1.1 Introduction We live in a connected world. In our social and digital lives, we are confronted with networks (or graphs) on a daily basis. When someone tells a story, it is likely that this story passed through various other people that together form a network of social interactions. Online social networks such as Facebook are based on gigantic networks in which people are connected trough so-called friendship links. Browsing the internet means traversing a large network of pages that is connected via clickable (hyper)links. Accessing one webpage on a mobile phone creates a few dozen wired or wireless connections between devices in a matter of microseconds. Networks are everywhere around us, and influence the way in which we communicate, socialize, search, navigate and consume information. When networks are stored in a digital format, they can produce an enormous amount of data. Such a large volume of data is sometimes called big data, not only because of its quantity, but also because the data may arrive at an enormous speed and because the data is usually diverse in terms of what type of information it repres- ents. Data is used in many disciplines of science to verify hypotheses about a certain domain. Popularized under the name data science, large (network) datasets are be- ing generated and investigated by commercial organizations as well as a number of research disciplines. Within the field of computer science, we specifically consider tasks related to stor- ing, retrieving, manipulating and understanding data in an automated and efficient way. The most simple type of data is called unstructured data, which may for example be the textual content of a news article or numeric measurements from a temper- ature sensor. On the other hand there is structured data, which refers to data that 2 1.2. Graphs is organized according to

Load more