Scalable Community Detection

Scalable Community Detection Zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften von der KIT-Fakultät für Informatik des Karlsruher Instituts für Technologie (KIT) genehmigte Dissertation von Michael Alexander Hamann Tag der mündlichen Prüfung: 23. Juli 2020 1. Referentin: Prof. Dr. Dorothea Wagner 2. Referent: Prof. Dr. Ulrik Brandes Acknowledgements First of all, I would like to thank Dorothea Wagner for offering me the opportunity to work as a doctoral researcher in her group. While leaving me a lot of freedom to pursue the research projects and collaborations of my choice she also ensured that I stayed focused on finishing this thesis. Many thanks also to Ulrik Brandes for agreeing to be the second referee of my thesis. Further, I would like to thank all my colleagues, all the students I supervised and all my co-authors for the many fruitful and interesting discussions we had. Many of the results in this thesis would not exist without these discussions. In particular, I would like to thank the colleagues with whom I shared my office, first Ben Strasser and later Lars Gottesbüren, with whom I could always discuss my research ideas and who provided many ideas, feedback and guidance. I would also like to thank my family. My parents for encouraging me to follow my interests and supporting my wish to study informatics. My wife for always supporting me even when I worked long hours or traveled to attend conferences and summer schools, in particular with caring for our daughter such that I could concentrate on my work for this thesis. Last but not least I would also like to thank my daughter – she is such a lovely and happy girl, it is always a joy to be with her, in particular in these difficult times of the coronavirus pandemic. My work on this thesis and all results presented in it has been funded by the German Research Foundation (DFG) under grant WA 654/22-1 and WA 654/22-2 within the Priority Programme 1736 “Algorithms for Big Data”. For the experiments on distributed clustering (Chapter 6) and exact editing (Chapter 10), we further acknowledge support by the state of Baden-Württemberg through bwHPC. 3 Contents 1 Introduction 11 1.1 Our Contribution . 13 1.2 Preliminaries . 14 1.3 Common Measures . 15 1.3.1 Average Local Clustering Coefficient . 16 1.3.2 Conductance . 16 1.3.3 Modularity . 16 1.3.4 Map Equation . 17 1.3.5 Additional Quality Measures . 18 1.4 Community Detection Algorithms . 19 1.4.1 Louvain . 19 1.4.2 Label Propagation . 20 1.4.3 Greedy Clique Expansion (GCE) . 20 1.4.4 Order Statistics Local Optimization Method (OSLOM) . 20 1.4.5 MOSES . 21 I Benchmark Graphs 23 2 Introduction to Benchmarking Community Detection Algorithms 25 2.1 Synthetic Benchmark Graphs . 26 2.1.1 LFR . 27 2.1.2 CKB . 28 2.1.3 A-BTER . 28 2.1.4 nPSO . 29 2.2 Benchmark Instances . 29 2.2.1 Web Graphs . 29 2.2.2 SNAP . 30 2.2.3 Facebook 100 . 30 2.3 Comparing Communities . 31 2.3.1 Normalized Mutual Information . 31 2.3.2 Adjusted Rand Index . 32 2.3.3 F1 Score . 32 3 I/O-Efficient Generation of Massive Graphs Following the LFR Benchmark 35 3.1 Introduction . 35 3.1.1 Random Graphs from a prescribed Degree Sequence . 36 5 Contents 3.1.2 Our Contribution . 37 3.2 Preliminaries and Notation . 37 3.2.1 Notation . 37 3.2.2 External-Memory Model . 38 3.2.3 TFP: Time Forward Processing . 38 3.3 The LFR Benchmark . 39 3.4 EM-CA: Community Assignment . 42 3.4.1 A simple, iterative, but not yet complete algorithm . 42 3.4.2 Enforcing constraint on community size (R2) . 43 3.4.3 Assignment with overlapping communities . 44 3.5 EM-HH: Deterministic Edges from a Degree Sequence . 44 3.5.1 Data structure . 46 3.5.2 Algorithm . 47 3.5.3 Improving the I/O-complexity . 49 3.6 EM-ES: I/O-efficient Edge Switching . 50 3.6.1 EM-ES for Independent Swaps . 52 3.6.2 Inter-Swap Dependencies . 54 3.6.3 Complexity . 55 3.7 EM-CM/ES: Sampling of random graphs from prescribed degree sequence 55 3.7.1 Configuration Model . 55 3.7.2 Edge rewiring for non-simple graphs . 57 3.8 EM-GER/EM-CER: Merging and repairing the intra- and inter-community graphs . 57 3.8.1 EM-GER: Global Edge Rewiring . 58 3.8.2 EM-CER: Community Edge Rewiring . 58 3.9 Implementation . 59 3.10 Experimental Results . 60 3.10.1 Notation and Setup . 60 3.10.2 EM-HH’s state size . 61 3.10.3 Inter-Swap Dependencies . 61 3.10.4 Test systems . 62 3.10.5 Performance of EM-HH ........................ 62 3.10.6 Performance of EM-ES ......................... 62 3.10.7 Performance of EM-CM/ES and qualitative comparison with EM-ES 63 3.10.8 Convergence of EM-ES ......................... 65 3.10.9 Performance of EM-LFR ........................ 65 3.10.10 Qualitative Comparison of EM-LFR ................. 66 3.11 Outlook and Conclusion . 68 3.12 Comparing LFR Implementations . 70 3.13 Comparing EM-ES and EM-CM/ES ..................... 72 6 Contents 4 Benchmark Generator for Dynamic Overlapping Communities in Networks 75 4.1 Introduction . 75 4.1.1 Our Contribution . 76 4.2 Related Work . 77 4.3 Algorithm . 77 4.3.1 Initialization . 81 4.3.2 Community Events . 81 4.3.3 Node Events . 84 4.3.4 Node-Community Assignment . 84 4.3.5 Edges . 86 4.3.6 Event Stream Generation . 87 4.3.7 Time Complexity . 88 4.4 Experiments . 88 4.4.1 Setup . 89 4.4.2 Running Time . 89 4.4.3 Graph Properties . 91 4.4.4 Community Detection Algorithm . 95 4.5 Conclusion . 102 II Disjoint Density-Based Communities 105 5 Introduction to Disjoint Community Detection 107 6 Distributed Graph Clustering using Modularity and Map Equation 109 6.1 Related Work . 109 6.2 Our Contribution . 110 6.3 Preliminaries . 110 6.3.1 Thrill . 111 6.4 Algorithm . 111 6.4.1 Calculation of Scores . 112 6.4.2 Distributed Synchronous Local Moving (DSLM) . 113 6.4.3 Distributed Contraction and Unpacking . 114 6.5 Experiments . 115 6.5.1 Weak Scaling . 117 6.5.2 Quality . 117 6.5.3 Real-World Graphs . 118 6.6 Conclusion . 122 7 Structure-Preserving Sparsification Methods for Social Networks 123 7.1 Introduction . 124 7.1.1 Contribution . 125 7.2 Edge Sparsification . 126 7.2.1 Global and Local Filtering . 126 7 Contents 7.2.2 Limits of Edge Sparsification . 130 7.3 Sparsification Methods . 131 7.3.1 Random Edge (RE) . 131 7.3.2 Triangles . 131 7.3.3 (Local) Jaccard Similarity (JS, LJS) . ..

Load more