Multi-Scale Clustering in Graphs Using Modularity
Total Page:16
File Type:pdf, Size:1020Kb
DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2019 Multi-scale clustering in graphs using modularity BERTRAND CHARPENTIER KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Multi-scale clustering in graphs using modularity BERTRAND CHARPENTIER Master in Machine Learning Date: January 15, 2019 Supervisor: Pawel Herman (KTH), Thomas Bonald (Télécom ParisTech) Examiner: Johan Håstad Swedish title: Multiskal-klustring i grafer med moduläritet School of Computer Science and Communication iii Abstract This thesis provides a new hierarchical clustering algorithm for graphs, named Paris, which can be interpreted through the modularity score and its resolution parameter. The algorithm is agglomerative and based on a simple distance between clusters induced by the probability of sampling node pairs. It tries to approximate the optimal partitions with respect to the modularity score at any resolution in one run. In addition to the Paris hierarchical algorithm, this thesis proposes four algorithms that compute rankings of the sharpest clusters, clus- terings and resolutions by processing the hierarchy output by Paris. These algorithms are based on a new measure of stability for cluster- ings, named sharp-score. Key outcomes of these four algorithms are the possibility to rank clusters, detect sharpest clusterings scale, go be- yond the resolution limit and detect relevant resolutions. All these algorithms have been tested on both synthetic and real datasets to illustrate the efficiency of their approaches. Keywords: Hierarchical clustering, Multi-scale clustering, Graph, Mod- ularity, Resolution, Dendrogram iv Sammanfattning Denna avhandling ger en ny hierarkisk klusteralgoritm för grafer, som heter Paris, vilket kan tolkas av modularitetsresultatet och dess upp- lösningsparameter. Algoritmen är agglomerativ och är baserad på ett enda avstånd mellan kluster som induceras av sannolikheten för samp- ling av nodpar. Det försöker att approximera de optimala partitioner- na vid vilken upplösning som helst i en körning. Förutom en hierarkisk algoritm föreslår denna avhandling fyra al- goritmer som beräknar rankningar av de bästa grupperna, kluster och resolutioner genom att bearbeta hierarkiproduktionen i Paris. Dessa algoritmer bygger på ett nytt koncept av klusterstabilitet, kallad sharp- score. Viktiga resultat av dessa fyra algoritmer är förmågan att rang- ordna kluster, upptäcka bästa klusterskala, gå utöver upplösningsgrän- sen och upptäcka de mest relevanta resolutionerna. Alla dessa algoritmer har testats på både syntetiska och verkliga datamängder för att illustrera effektiviteten i deras metoder. v This work was performed at Télécom ParisTech in the LINCS labo- ratory and has been performed in the framework of a double diploma between KTH and Ensimag. I would first like to thank my supervisor Thomas Bonald at Télé- com ParisTech who guided me during my master thesis project and all the researchers from the LINCS who welcomed me. I would also like to thank my thesis advisors Pawel Herman of the CSC department at KTH and Sylvain Bouveret at Ensimag who accompanied during my master thesis, read it carefully and gave me precious comments on my work. I would also like to thank my examiners Johan Håstad at KTH and Stephanie Hahmann at Ensimag who were involved in the validation survey for this research project. I must also express my very profound gratitude to my parents and to my friends for providing me continuous encouragement through- out my years of study and through the process of researching and writing this thesis. Finally, this Master’s thesis has been very interesting for many rea- sons. First of all, I exchanged with different researchers and PhD stu- dents from many different domains. Thus, it allows to learn a lot and imagine new ideas crossing disciplines and the results I obtained gave me the feeling to improve the state of the art in some way. Finally, I am particularly happy to have submitted a paper and started collabo- rations with other researchers. For all these reasons, I plan to continue in that domain by doing a PhD. Contents 1 Introduction 1 1.1 Challenges & objectives . .4 1.2 Thesis outline . .5 2 Background 7 2.1 Relevant Theory . .7 2.1.1 Graph definitions . .7 2.1.2 Modularity . .9 2.1.3 Louvain algorithm . 11 2.1.4 Spectral clustering algorithm . 12 2.2 Related work . 12 3 Theoretical work 16 3.1 Association coefficient and distance . 16 3.2 Resolution as a stability measure . 19 3.3 A multi-scale block model: Hierarchical Stochastic Block Model . 21 3.3.1 Stochastic Block Model . 21 3.3.2 Hierarchical Stochastic Block Model . 22 4 Methods 25 4.1 Hierarchical clustering . 25 4.2 Rankings from the hierarchy . 31 4.2.1 Clusters . 31 4.2.2 Homogeneous and heterogeneous clusterings . 33 4.2.3 Resolutions . 36 5 Results 38 5.1 Hierarchical Stochastic Block Model . 39 5.2 Stochastic Block Model . 41 vi CONTENTS vii 5.3 Real data . 55 5.4 Vector datasets . 58 6 Discussion 62 6.1 Contributions and key findings . 62 6.2 Limitations and future work . 63 6.3 Ethical and sustainable impact. 65 7 Conclusion 67 Bibliography 69 Appendices 75 A HSBM . 76 B OpenFlight . 80 C OpenStreet . 83 D EnglishDico . 87 E HumansWikipedia . 91 Chapter 1 Introduction Data clustering is a well-known field of study in Machine Learning. This task consists in partitioning a dataset in communities of objects which are similar to each other. Clustering algorithms are useful for many types of data analysis (data compression, data classification, im- age analysis, information retrieval...). The results of clustering algo- rithms are highly dependent on the definition of a community. Dif- ferent approaches can be based on concepts of connectivity, centrality, density, distribution fitting or seed expansion. Still, they agree on two features that a good clustering must satisfy: • Objects within a community are similar. Communities must have strong internal connections • Objects in different communities are different. Communities must have weak external connections Because of the concepts of similarity and difference, clustering algo- rithms often use distances. For graphs data, the modularity score pro- posed by Newman et al. [33] tries to capture these concepts in an- other way. The quality of a partition is assessed with the modularity score in such a way that a high score corresponds to a good partition. This score has a parameter, called resolution, which impacts the size of clusters in the optimal partition. A class of clustering algorithms have emerged from the notion of modularity. In particular, the Louvain al- gorithm [2] is a greedy algorithm aiming at maximizing the modular- ity score at a given resolution. In practice, data clustering is important because datasets often ex- hibit community structures. A particular dataset can only have one 1 2 CHAPTER 1. INTRODUCTION single relevant partition where each community is well separated from the others. However, the clusters often have more complex organiza- tions. Each object can belong to more than one community with dif- ferent degrees of membership, which leads to overlapping clusterings or soft clusterings. Datasets can also demonstrate more than one res- olution of clustering, which leads to hierarchical and multi-scale clus- terings. Hence, there are three main approaches for clustering: • Hard clustering: Each object belongs to only one cluster. • Soft clustering: Each object belongs to multiple clusters with dif- ferent degrees. • Multi-scale clustering: Each object belongs to multiple clusters at different scale. In this case, clusters often build a hierarchy. In explicative data analysis, it is important to take into account the natural representation of data. Hence, each object in a dataset can be represented in two different ways: • the features of the objects are represented by a vector of numbers 1.1a. • the similarities between objects are represented by a set of weighted edges 1.1b. Thus, the available knowledge is either placed in the objects them- selves or in the links between them. In the latter case, data can be represented as a graph. Since the definition of a community is directly related to the connections between objects, the graph representation is a natural choice for clustering. The edges allow to compare the sim- ilarity between objects efficiently. Sometimes, graphs emerge natu- rally from datasets (social network, link between web pages, neuronal connections in brains, road map ...etc). If it is not the case, the data are represented by vectors of features and it is still possible to build a graph representation thanks to well chosen similarity measures (simi- lar movies, similar pixels in images, similar authors ...etc). CHAPTER 1. INTRODUCTION 3 x y (a) Vector representation (b) Graph representation Figure 1.1: Data representations As it shown in Figure 1.2, the graphs are ubiquitous in real life and can have different scales. Moreover, graphs contain the similar- ities between objects and the definition of a community is based on the notion of similarity. Therefore, it makes them particularly adapted for community detection. Consequently, many clustering algorithms (including modularity-based algorithms) use graph representation to detect community structures in data. Figure 1.2: Examples of graphs in real life: brain network (top Left), so- cial network (top right), city networks such that IOT or city maps (bot- tom left), world networks such that Internet people and good trans- port (bottom right) 4 CHAPTER 1. INTRODUCTION The previous paragraphs present the broad context of clustering with different approaches (distribution fitting, seed expansion, mod- ularity ...) on different types of data (vector data, graphs or a com- bination of the two) with different problems (hard, soft, multi-scale clustering). This thesis is not aimed at dealing with all the clustering tasks. This thesis is only focusing on multi-scale clustering on graphs using modularity methods. 1.1 Challenges & objectives Most classic clustering algorithms often produce one single partition depending on the setting of a hyper-parameter (number of clusters, neighborhood size, resolution ...).