Thèse De Doctorat
Total Page:16
File Type:pdf, Size:1020Kb
THÈSE DE DOCTORAT de l’Université de recherche Paris Sciences et Lettres PSL Research University Préparée à l’École Normale Supérieure Nouvelles approches pour le partitionnement de grands graphes Novel Approaches to the Clustering of Large Graphs Ecole doctorale n°386 SCIENCES MATHÉMATIQUES DE PARIS CENTRE Spécialité MATHÉMATIQUES COMPOSITION DU JURY : M. CHANG Cheng-Shang National Tsing Hua University, Rapporteur M. LAMBIOTTE Renaud University of Oxford, Rapporteur M. VAYATIS Nicolas ENS Paris-Saclay, Président du jury Soutenue par ALEXANDRE HOLLOCOU le 19 décembre 2018 M. BONALD Thomas h Télécom Paristech, Membre du jury M. D’ALCHÉ-BUC Florence Dirigée par Marc LELARGE Télécom Paristech, Membre du jury et Thomas BONALD M. LELARGE Marc INRIA Paris, Membre du jury h M. VIENNOT Laurent INRIA Paris, Membre du jury 2 Abstract Graphs are ubiquitous in many fields of research ranging from sociology to biology. A graph is a very simple mathematical structure that consists of a set of elements, called nodes, connected to each other by edges. It is yet able to represent complex systems such as protein-protein interaction or scientific collaborations. Graph clustering is a central problem in the analysis of graphs whose objective is to identify dense groups of nodes that are sparsely connected to the rest of the graph. These groups of nodes, called clusters, are fundamental to an in-depth understanding of graph structures. There is no universal definition of what a good cluster is, and different approaches might be best suited for different applications. Whereas most of classic methods focus on finding node partitions, i.e. on coloring graph nodes so that each node has one and only one color, more elaborate approaches are often necessary to model the complex structure of real-life graphs and to address sophisticated applications. In particular, in many cases, we must consider that a given node can belong to more than one cluster. Besides, many real-world systems exhibit multi-scale structures and one much seek for hierarchies of clusters rather than flat clusterings. Furthermore, graphs often evolve over time and are too massive to be handled in one batch so that one must be able to process stream of edges. Finally, in many applications, processing entire graphs is irrelevant or expensive, and it can be more appropriate to recover local clusters in the neighborhood of nodes of interest rather than color all graph nodes. In this work, we study alternative approaches and design novel algorithms to tackle these different problems. First, we study soft clustering algorithms that aim at finding a degree of membership for each node to each cluster, allowing a given node to be a member of multiple clusters. Then, we explore the subject of edge clustering whose objective is to find partitions of edges instead of node partitions. Third, we study hierarchical clustering techniques to find multi-scale clusters in graphs. We then examine streaming methods for graph clustering, able to handle graphs with billions of edges. Finally, we focus on local graph clustering approaches, whose objective is to find clusters in the neighborhood of so-called seed nodes. The novel methods that we propose to address these different problems are mostly inspired by variants of modularity, a classic measure that accesses the quality of a node partition, and by random walks, stochastic processes whose properties are closely related to the graph structure. We provide analyses that give theoretical guarantees for the different proposed techniques, and endeavour to evaluate these algorithms on real-world datasets and use cases. 3 4 List of publications Multiple local community detection. Hollocou, A., Bonald, T., and Lelarge, M. (2017). ACM SIGMETRICS Performance Evaluation Review, 45(2):76–83. Reference [Hollocou et al., 2017a]. A streaming algorithm for graph clustering. Hollocou, A., Maudet, J., Bonald, T., and Lelarge, M. (2017). NIPS Workshop. Reference [Hollocou et al., 2017b]. Hierarchical graph clustering by node pair sampling. Bonald, T., Charpentier, B., Galland, A., and Hollocou, A. (2018). KDD Workshop. Reference [Bonald et al., 2018a]. Weighted spectral embedding of graphs. Bonald, T., Hollocou, A., and Lelarge, M. (2018). In Communication, Control, and Computing (Allerton), 2018 56th Annual Allerton Conference on. Reference [Bonald et al., 2018c]. Modularity-based sparse soft graph clustering. Hollocou, A., Bonald, T., and Lelarge, M. (2019). AISTATS. Reference [Hollocou et al., 2019]. A fast agglomerative algorithm for edge clustering. Hollocou, A., Lutz, Q., and Bonald, T. (2018). Preprint. Reference [Hollocou et al., 2018]. Forward and backward embedding of directed graphs. Bonald, T., De Lara, N., and Hollocou, A. (2018). Preprint. Reference [Bonald et al., 2018b]. 5 6 Acknowledgements First, I would like to express my sincere gratitude to my advisors Marc Lelarge and Thomas Bonald for their support during the last three years. I would like to thank you for encouraging my research and for your invaluable contribution to this work. It has been a truly great pleasure to work with you. I hope that we will have the opportunity to continue discussions in the future. I am also extremely grateful to my colleagues and friends at the Ministère des Armées. It was an amazing experience working with you, and I will never forget these moments. Special thanks to Fabrice, without whom my Ph.D. would not have been possible, and Pierre, for his extensive knowledge and his always helpful insights. I would also like to thank my committee members, professor Cheng-Shang Chang, professor Renaud Lambiotte, professor Florence D’Alche-Buc, professor Nicolas Vayatis and professor Laurent Viennot for serving as my committee members even at hardship and busy schedules. I am also grateful to all my friends for their encouragements. Special thanks to the Joko team, Xavier, Nicolas and Arthur, for your patience and your support. Lastly, I would like to thank my family and Nathalie for their unwavering love and for supporting me for everything. I cannot thank you enough for encouraging me throughout this experience. 7 8 Contents 1 Introduction 13 1.1 Motivations . 13 1.1.1 Graphs . 13 1.1.2 Graph clustering . 16 1.2 Contributions . 18 1.3 Notations . 19 1.4 Classic graph clustering methods . 20 1.4.1 Modularity . 20 1.4.2 Spectral methods . 27 1.4.3 Some other graph clustering approaches . 31 1.5 Evaluation of graph clustering algorithms . 32 1.5.1 Synthetic graph models . 32 1.5.2 Real-life graphs and ground-truth clusters . 34 2 Soft clustering 37 2.1 Introduction . 37 2.2 Fuzzy c-means: an illustration of soft clustering . 38 2.2.1 The k-means problem . 38 2.2.2 Fuzzy c-means . 39 2.3 Soft modularity . 41 2.4 Related work . 42 2.4.1 The softmax approach . 42 2.4.2 Other approaches . 44 2.5 Alternating projected gradient descent . 45 2.6 Soft clustering algorithm . 48 2.6.1 Local gradient descent step . 48 2.6.2 Cluster membership representation . 48 2.6.3 Projection onto the probability simplex . 49 2.6.4 MODSOFT . 49 2.7 Link with the Louvain algorithm . 49 2.8 Algorithm pseudo-code . 51 2.9 Experimental results . 53 2.9.1 Synthetic data . 53 2.9.2 Real data . 53 3 Edge clustering 57 3.1 Introduction . 57 3.2 Related work . 58 3.3 A modularity function for edge partitions . 58 3.4 The approach of Evans and Lambiotte . 60 3.4.1 Incidence graph and bipartite graph projections . 60 3.4.2 Line graph . 61 3.4.3 Weighted line graph . 61 3.4.4 Random-walk-based projection . 61 9 3.4.5 Limitation . 62 3.5 An agglomerative algorithm for maximizing edge modularity . 62 3.5.1 Agglomerative step . 63 3.5.2 Greedy maximization step . 64 3.5.3 Algorithm . 64 3.6 Experimental results . 64 3.6.1 Synthetic data . 64 3.6.2 Real-life datasets . 65 4 Hierarchical clustering 69 4.1 Introduction . 69 4.2 Related work . 69 4.3 Agglomerative clustering and nearest-neighbor chain . 70 4.3.1 Agglomerative algorithms . 70 4.3.2 Classic distances . 71 4.3.3 Ward’s method . 72 4.3.4 The Lance-Williams family . 72 4.3.5 Nearest-neighbor chain algorithm . 72 4.4 Node pair sampling . 73 4.5 Clustering algorithm . 75 4.6 Link with modularity . 76 4.7 Experiments . 77 5 Streaming approaches 85 5.1 Introduction . 85 5.2 Related work . 86 5.2.1 Edge streaming models . 86 5.2.2 Connected components and pairwise distances . 86 5.2.3 Minimum cut . 87 5.3 A streaming algorithm for graph clustering . 88 5.3.1 Intuition . ..