DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, 2019

Multi-scale clustering in graphs using modularity

BERTRAND CHARPENTIER

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Multi-scale clustering in graphs using modularity

BERTRAND CHARPENTIER

Master in Machine Learning Date: January 15, 2019 Supervisor: Pawel Herman (KTH), Thomas Bonald (Télécom ParisTech) Examiner: Johan Håstad Swedish title: Multiskal-klustring i grafer med moduläritet School of Computer Science and Communication

iii

Abstract

This thesis provides a new hierarchical clustering algorithm for graphs, named Paris, which can be interpreted through the modularity score and its resolution parameter. The algorithm is agglomerative and based on a simple distance between clusters induced by the probability of sampling node pairs. It tries to approximate the optimal partitions with respect to the modularity score at any resolution in one . In addition to the Paris hierarchical algorithm, this thesis proposes four algorithms that compute rankings of the sharpest clusters, clus- terings and resolutions by processing the hierarchy output by Paris. These algorithms are based on a new measure of stability for cluster- ings, named sharp-score. Key outcomes of these four algorithms are the possibility to rank clusters, detect sharpest clusterings scale, go be- yond the resolution limit and detect relevant resolutions. All these algorithms have been tested on both synthetic and real datasets to illustrate the efficiency of their approaches.

Keywords: Hierarchical clustering, Multi-scale clustering, Graph, Mod- ularity, Resolution, Dendrogram iv

Sammanfattning

Denna avhandling ger en ny hierarkisk klusteralgoritm för grafer, som heter Paris, vilket kan tolkas av modularitetsresultatet och dess upp- lösningsparameter. Algoritmen är agglomerativ och är baserad på ett enda avstånd mellan kluster som induceras av sannolikheten för samp- ling av nodpar. Det försöker att approximera de optimala partitioner- na vid vilken upplösning som helst i en körning. Förutom en hierarkisk algoritm föreslår denna avhandling fyra al- goritmer som beräknar rankningar av de bästa grupperna, kluster och resolutioner genom att bearbeta hierarkiproduktionen i Paris. Dessa algoritmer bygger på ett nytt koncept av klusterstabilitet, kallad sharp- score. Viktiga resultat av dessa fyra algoritmer är förmågan att rang- ordna kluster, upptäcka bästa klusterskala, gå utöver upplösningsgrän- sen och upptäcka de mest relevanta resolutionerna. Alla dessa algoritmer har testats på både syntetiska och verkliga datamängder för att illustrera effektiviteten i deras metoder. v

This work was performed at Télécom ParisTech in the LINCS labo- ratory and has been performed in the framework of a double diploma between KTH and Ensimag. I would first like to thank my supervisor Thomas Bonald at Télé- com ParisTech who guided me during my master thesis project and all the researchers from the LINCS who welcomed me. I would also like to thank my thesis advisors Pawel Herman of the CSC department at KTH and Sylvain Bouveret at Ensimag who accompanied during my master thesis, read it carefully and gave me precious comments on my work. I would also like to thank my examiners Johan Håstad at KTH and Stephanie Hahmann at Ensimag who were involved in the validation survey for this research project. I must also express my very profound gratitude to my parents and to my friends for providing me continuous encouragement through- out my years of study and through the process of researching and writing this thesis. Finally, this Master’s thesis has been very interesting for many rea- sons. First of all, I exchanged with different researchers and PhD stu- dents from many different domains. Thus, it allows to learn a lot and imagine new ideas crossing disciplines and the results I obtained gave me the feeling to improve the state of the art in some way. Finally, I am particularly happy to have submitted a paper and started collabo- rations with other researchers. For all these reasons, I plan to continue in that domain by doing a PhD. Contents

1 Introduction 1 1.1 Challenges & objectives ...... 4 1.2 Thesis outline ...... 5

2 Background 7 2.1 Relevant Theory ...... 7 2.1.1 Graph definitions ...... 7 2.1.2 Modularity ...... 9 2.1.3 Louvain algorithm ...... 11 2.1.4 Spectral clustering algorithm ...... 12 2.2 Related work ...... 12

3 Theoretical work 16 3.1 Association coefficient and distance ...... 16 3.2 Resolution as a stability measure ...... 19 3.3 A multi-scale block model: Hierarchical Stochastic Block Model ...... 21 3.3.1 Stochastic Block Model ...... 21 3.3.2 Hierarchical Stochastic Block Model ...... 22

4 Methods 25 4.1 Hierarchical clustering ...... 25 4.2 Rankings from the hierarchy ...... 31 4.2.1 Clusters ...... 31 4.2.2 Homogeneous and heterogeneous clusterings . . 33 4.2.3 Resolutions ...... 36

5 Results 38 5.1 Hierarchical Stochastic Block Model ...... 39 5.2 Stochastic Block Model ...... 41

vi CONTENTS vii

5.3 Real data ...... 55 5.4 Vector datasets ...... 58

6 Discussion 62 6.1 Contributions and key findings ...... 62 6.2 Limitations and future work ...... 63 6.3 Ethical and sustainable impact...... 65

7 Conclusion 67

Bibliography 69

Appendices 75 A HSBM ...... 76 B OpenFlight ...... 80 C OpenStreet ...... 83 D EnglishDico ...... 87 E HumansWikipedia ...... 91

Chapter 1

Introduction

Data clustering is a well-known field of study in Machine Learning. This task consists in partitioning a dataset in communities of objects which are similar to each other. Clustering algorithms are useful for many types of data analysis (data compression, data classification, im- age analysis, information retrieval...). The results of clustering algo- rithms are highly dependent on the definition of a community. Dif- ferent approaches can be based on concepts of connectivity, centrality, density, distribution fitting or seed expansion. Still, they agree on two features that a good clustering must satisfy:

• Objects within a community are similar. Communities must have strong internal connections

• Objects in different communities are different. Communities must have weak external connections

Because of the concepts of similarity and difference, clustering algo- rithms often use distances. For graphs data, the modularity score pro- posed by Newman et al. [33] tries to capture these concepts in an- other way. The quality of a partition is assessed with the modularity score in such a way that a high score corresponds to a good partition. This score has a parameter, called resolution, which impacts the size of clusters in the optimal partition. A class of clustering algorithms have emerged from the notion of modularity. In particular, the Louvain al- gorithm [2] is a greedy algorithm aiming at maximizing the modular- ity score at a given resolution. In practice, data clustering is important because datasets often ex- hibit community structures. A particular dataset can only have one

1 2 CHAPTER 1. INTRODUCTION

single relevant partition where each community is well separated from the others. However, the clusters often have more complex organiza- tions. Each object can belong to more than one community with dif- ferent degrees of membership, which leads to overlapping clusterings or soft clusterings. Datasets can also demonstrate more than one res- olution of clustering, which leads to hierarchical and multi-scale clus- terings. Hence, there are three main approaches for clustering:

• Hard clustering: Each object belongs to only one cluster.

• Soft clustering: Each object belongs to multiple clusters with dif- ferent degrees.

• Multi-scale clustering: Each object belongs to multiple clusters at different scale. In this case, clusters often build a hierarchy.

In explicative data analysis, it is important to take into account the natural representation of data. Hence, each object in a dataset can be represented in two different ways:

• the features of the objects are represented by a vector of numbers 1.1a.

• the similarities between objects are represented by a set of weighted edges 1.1b.

Thus, the available knowledge is either placed in the objects them- selves or in the links between them. In the latter case, data can be represented as a graph. Since the definition of a community is directly related to the connections between objects, the graph representation is a natural choice for clustering. The edges allow to compare the sim- ilarity between objects efficiently. Sometimes, graphs emerge natu- rally from datasets (social network, link between web pages, neuronal connections in brains, road map ...etc). If it is not the case, the data are represented by vectors of features and it is still possible to build a graph representation thanks to well chosen similarity measures (simi- lar movies, similar pixels in images, similar authors ...etc). CHAPTER 1. INTRODUCTION 3

x

y

(a) Vector representation (b) Graph representation

Figure 1.1: Data representations

As it shown in Figure 1.2, the graphs are ubiquitous in real life and can have different scales. Moreover, graphs contain the similar- ities between objects and the definition of a community is based on the notion of similarity. Therefore, it makes them particularly adapted for community detection. Consequently, many clustering algorithms (including modularity-based algorithms) use graph representation to detect community structures in data.

Figure 1.2: Examples of graphs in real life: brain network (top Left), so- cial network (top right), city networks such that IOT or city maps (bot- tom left), world networks such that Internet people and good trans- port (bottom right) 4 CHAPTER 1. INTRODUCTION

The previous paragraphs present the broad context of clustering with different approaches (distribution fitting, seed expansion, mod- ularity ...) on different types of data (vector data, graphs or a com- bination of the two) with different problems (hard, soft, multi-scale clustering). This thesis is not aimed at dealing with all the clustering tasks. This thesis is only focusing on multi-scale clustering on graphs using modularity methods.

1.1 Challenges & objectives

Most classic clustering algorithms often produce one single partition depending on the setting of a hyper-parameter (number of clusters, neighborhood size, resolution ...). As a parameter value has generally a strong impact on the algorithm output, an arbitrary choice for pa- rameters is likely to lead to irrelevant solutions with no meaningful cluster. Thus, parameter tuning is important for data clustering. In the case of modularity-based algorithms, the tuning needs to be done on the resolution parameter which determines the scale of the optimal clustering. Hence, the first objective of this this thesis is to address the follow- ing question: How to estimate suitable values for the resolution parameter to ensure a meaningful clustering? In addition to the tuning of the resolution parameter, the modu- larity approach meets another problem known as the resolution limit ([17], [25]): the optimal partition for a given resolution tends to have clusters with too homogeneous sizes. This "limit", which means a "con- straint" or a "limitation", is difficult to address with straightforward modularity approaches. It is an obstacle to the discovery of communi- ties of fairly different sizes. Hence, communities might be excessively merged or excessively split. This limit is intractable with classic mod- ularity maximization approaches. Then, the second objective tackled in this thesis is to examine the following question: "Is it possible to build a clustering algorithm based on modularity which does not suffer from the resolution limit ?" Another problem is that datasets often have more than one rele- vant clustering. Hierarchical and multi-scale clustering algorithms try to solve this issue by proposing multiple levels of clusterings. An in- teresting question is then how to assess and compare the quality of the CHAPTER 1. INTRODUCTION 5

partitions. Intuitively a good partition will have compact and isolated clusters. This problem is closely related to the definition of the robust- ness of a partition since a good partition should be robust with respect to different parameter and graph inputs. Finally, this thesis tries to examine a last objective: "How to identify the sharpest clustering levels and the most robust partitions in a graph ?" More technical details about these objectives are explained in the chapter 2. In particular, the related works and the state-of-the-art on these topics are presented in the section 2.2 or in reviews on commu- nity detection in networks [16], [20]. It is also important to mention that the meaning of the words "sharp" and "robust" clustering are clar- ified in the chapters 3, 4, 5.

1.2 Thesis outline

The main purpose of this thesis is to propose and test an new algo- rithm for graph clustering using the modularity approach. The consti- tution of this thesis is reported in the following chapters. Chapter 3 proposes a new similarity coefficient between nodes and demonstrates some key results about it. In particular, it shows prop- erties of the resolution parameter as a stability measure for clusters and clusterings. The proposed stability measure evaluates the com- pactness of a cluster and its isolation from the other elements in the graph. This chapter also introduces a hierarchical stochastic block model (HSBM) with a new notion of a priori on the hierarchy. Chapter 4 proposes new algorithms for multi-scale clustering. The first algorithm introduced in section 4.1 is hierarchical algorithm which relies on a simple distance between clusters induced by probability distributions over nodes and edges. This distance is directly related to the new similarity coefficient proposed in this work and satisfies key properties which enables an efficient implementation of the hier- archical algorithm and can be linked to the resolution parameter of the modularity score. Then, two different methods are proposed in section 4.2 to pro- cess a hierarchy and detect most relevant scales. These methods are based on a new score which describes the sharpness of clusters bor- ders. One method keeps clusters at the same scale while the other is much more resistant to cluster size heterogeneity. These procedures 6 CHAPTER 1. INTRODUCTION

make the whole algorithm parameter-free which avoids the risk of an arbitrary parameter tuning. Finally, this section describes a last method which provides a list of relevant values for the resolution pa- rameter for the modularity score of a graph. This is a key outcome of this thesis since adjusting the resolution is commonly considered as an important issue in modularity based algorithms. In the chapter 5, numerical experiments on both synthetic and real datasets illustrate the quality of the hierarchal and flat clusterings re- turned by these methods. Chapter 2

Background

2.1 Relevant Theory

The aim of this chapter is to present the concepts which are useful for the understanding of the thesis. It recalls first basic graph definitions. It presents then the modularity score and its interpretations. Finally, famous models for synthetic graphs with community structures are explained.

2.1.1 Graph definitions Consider a graph G = (V,E). V = {1, ..., n} describes the set of n vertices or nodes, and E describes the set of m edges. If {i, j} ∈ E, nodes i, j are linked by an edge and are said to be neighbors. The degree of the node i is equal to the number of neighbors of i.

j

n Aij 1 i

Figure 2.1: Graph notations

In this thesis, G is considered weighted (numbers are assigned to the edges) and undirected (the edges {i, j} and {j, i} are identical). Let A

7 8 CHAPTER 2. BACKGROUND

be the adjacency matrix of the graph. This is a symmetric, non-negative matrix such that for each i, j ∈ V , Aij > 0 if and only if there is an edge between i and j (Aij = 0 otherwise). In this case, Aij denotes the weight of the edge {i, j} ∈ E. Weight wi on the node i can then be naturally defined in the following way: X wi = Aij (2.1) j∈V

Remark that for unit weights, wi is the degree of i. The total weight of the graph is:

X X w = wi = Aij (2.2) i i,j∈V

The edge weights induce a joint probability distribution on the pairs of nodes:

Aij ∀i, j ∈ V, p(i, j) = (2.3) w p(i, j) is the probability to sample the edge {i, j} with respect to its weight. The node weights induce a probability distribution on the nodes. This probability distribution is the marginal distribution of p(i, j):

wi X ∀i ∈ V, p(i) = = p(i, j) (2.4) w j∈V p(i) is the probability to sample the node i with respect to its weight. It follows the definition of the conditional probability distribution:

Aij p(i, j) ∀i, j ∈ V, p(i|j) = = (2.5) wj p(j)

It is possible to define a random walk from this conditional probability distribution, that is to say a stochastic process which randomly jumps from vertex to vertex. In this case, p(i|j) is the probability to go to the node i from the node j. It turns out that p(i, j) and p(i) are stationary distributions over edges and nodes associated with this random walk. A clustering P is a partition of V . More formally, P = {C1, ..., CK } 0 such that ∪k∈ 1,K Ck = V and ∀k, k ∈ 1,K ,Ck ∩ Ck0 = ∅. Each J K J K element Ck of a clustering is called a cluster. In soft clustering however, CHAPTER 2. BACKGROUND 9

vertices can belong to multiple clusters. A last type of clustering is the hierarchical clustering which try to detect a good hierarchy H = {P 1, ..., P L}. The hierarchical clustering H is thus a list of clusterings verifying:

l l+1 ∀l ∈ 0,L , ∀Ck ∈ P , ∃Ck0 ∈ P ,Ck ⊂ Ck0 (2.6) J K Probability distributions are easily extendable to clusters by defining: X p(Ck) = p(i) (2.7)

i∈Ck X p(Ck,Ck0 ) = p(i, j) (2.8)

i∈Ck,j∈Ck0 p(Ck) is the probability to sample one node from the cluster Ck. p(Ck,Ck0 ) is the probability to sample one node from the cluster Ck and one from the cluster Ck0 or, in other words, to sample an edge linking Ck and Ck0 .

2.1.2 Modularity A natural definition for a good clustering is a clustering with dense connections between vertices within the same cluster and sparse con- nections between vertices in different clusters. A node belongs to a community if it has stronger connections with members of this com- munity than with members of another community. The modularity score introduced by Newman [29] tries to stick to this notion of a good community. The modularity is denoted by Q and applies to a graph G and a clustering P . A high modularity means a good clustering whereas a low modularity means a poor clustering. Then:   X X Aij wi wj Q(P ) = − (2.9) w w w Ck∈P i,j∈Ck

A first remark is that the modularity is modular with respect to the clusters: X Q(P ) = Q(Ck) (2.10)

Ck∈P 10 CHAPTER 2. BACKGROUND

It is possible to interpret the modularity in different ways. A first way to understand the modularity is to re-write it in terms of proba- bility distributions: X X Q(P ) = (p(i, j) − p(i)p(j)) (2.11)

Ck∈P i,j∈Ck

Thus, the modularity is the difference between the probabilities of sampling two nodes of the same cluster under the joint distribution p(i, j) and under the product distribution p(i)p(j). In other words, it compares the actual weight of the edge {i, j} with the expected val- ues of this weight under a null model. In this case, the null model is a graph with nodes having the same degree as in the original graph but with edges distributed at random. The nodes then keeps the same weights but are considered independent. Other choices of null model have been studied for the modularity [35], [6] but (2.11) is the most common choice in community detection. A second manner to grasp the modularity is to decompose the modularity in two terms:

Q(P ) = W (P ) − S(P ) (2.12) X X where W (P ) = p(i, j) (2.13)

Ck∈P i,j∈Ck X X S(P ) = p(i)p(j) (2.14)

Ck∈P i,j∈Ck

The first term W (P ) is the sum of the clusters weights. The denser the clusters are, the larger the term W (P ) is. The second term S(P ), also known as the Simpson index [43], penalizes clusterings which are not diversified enough. It is the probability to sample two vertices belong- ing to the same cluster independently. Hence, the more diversified the clustering is, the lower the term S(P ) is. Yet, the modularity maximization has some limits to detect good clusterings. The first one is that small communities may not be iden- tified in very large networks. This issue is fully explained in [17]. To overcome this limit a solution is to introduce the resolution parameter denoted γ: X X Qγ(P ) = (p(i, j) − γp(i)p(j)) (2.15)

Ck∈P i,j∈Ck CHAPTER 2. BACKGROUND 11

The resolution parameter balances the weight term compare to the va- riety term. Hence, large values of γ enable the detection of smaller clusters and small values of γ enable the detection of larger clusters. For instance, the limit cases where γ = 0 or γ → +∞ lead to clustering with one cluster containing all the nodes (large W (P )) or multiple sole clusters containing one node (small S(P )). However, there is a second limit. For heterogeneous distributions of cluster sizes, modularity is not capable to recover all the communities [25]. Some communities are necessarily over-merged or over-split.

2.1.3 Louvain algorithm The Louvain algorithm proposed by Blondel et al. [2] is a very popular clustering algorithm based on modularity. The Louvain algorithm is a greedy algorithm that maximizes modularity given resolution γ in two phases. At the beginning, each node corresponds to a separate community. During the first phase, nodes are considered in a cyclic way (random or not). The modularity increase caused by moving a node from its current community to one of its neighboring community is computed. The node at issue is then moved to the community bringing the largest modularity increase. When no more nodes are inclined to change of community, the second phase starts. During this second phase, also know as the aggregation phase, all the nodes belonging to the same community are merged in such a way that the total weight of each community is constant. Hence, any links within communities are represented thanks to self loops on the new community nodes whereas edge weights between different commu- nities are summed to create the edges between the new community nodes. After the second phase, each new cluster is considered as a node in the following of the algorithm. These two phases are iterated until the number of nodes remains the same after a merge. The inventors of this algorithm have observed that the first phase may be slow when it tries to change communities of nodes for only small modularity increase. The algorithm can then be sped up by adding a threshold and looking at modularity increase which are above this threshold. 12 CHAPTER 2. BACKGROUND

2.1.4 Spectral clustering algorithm The spectral algorithm described in [26] is also a very popular cluster- ing algorithm based on the spectral decomposition of the adjacency matrix. It takes as an input k, the expected number of clusters, and can be mainly decomposed in two phases. The first phase consists in computing the first k eigenvectors u1, ..., uk of the Laplacian matrix L. For recall, the Laplacian matrix is:

L = D − A with A adjacency matrix, D matrix with the degrees on the diagonal

n×k Let U ∈ R be the matrix containing the vectors u1, ..., uk as columns. During the second phase, each node are represented in the new spectral base by yi, i ∈ 1, n the lines of the matrix U. Hence, a K- J K means clustering is performed on the yi in order to affect a cluster to each node. The spectral clustering algorithm has different variants depending on the Laplacian matrix used (Symmetric, random walk...). A common interpretation of this algorithm is that it tries to compute a clustering that maximize its cut [26].

2.2 Related work

Many algorithms have been designed to partition datasets in commu- nities. The libraries Scikit-Learn [42] and Community [11] propose a large range of algorithms used nowadays for clustering. These al- gorithms are explained on the libraries’ web pages. The figure 2.2 presents the results of ten algorithms on six synthetic datasets. Each line corresponds to the results obtained by the algorithms on the same 2 dimensional graphical dataset taken from Scikit-Learn. CHAPTER 2. BACKGROUND 13

Figure 2.2: Overview of 10 classic clustering algorithms applied on 6 synthetic datasets

In addition to these algorithms, there are other recent algorithms in the literature that deserve to be mentioned even if they are not yet implemented in libraries. Since this thesis is mainly focused on graph clustering methods based on modularity, a second paragraph presents references to this family of algorithms.

Clustering algorithms for vector data. Usual clustering techniques apply to vector data. They do not directly apply to graphs, unless the graph is embedded in some metric space, through spectral techniques for instance (Spectral Clustering [26]). The choice of the distance is important for clustering vector data Moreover, most clustering algorithms only output one single clus- tering and rely on some parameter (number of clusters, neighborhood size, ...) that allows one to adapt the clustering to the dataset and to the intended purpose. Some algorithms try to find good representa- tives for clusters (K-means, Affinity Propagation [18], Mean Shift [5] [10]). These representative points are sometimes called centroids of the clusters and are often updated iteratively through the run of the algorithm. Other algorithms are based on density (DBSCAN [15], Op- tics [41]). The same cluster is assigned to points belonging to the same dense neighborhood. These methods need parameters to define the significant neighborhood of a point. Eventually, there are also algo- 14 CHAPTER 2. BACKGROUND

rithms as Gaussian Mixture which try directly to fit a given type of probability distribution to the data and estimate its parameters. Hierarchical clustering algorithms return a different structure of solution with a sequence of clustering which are nested within each other (Ward [47], Birch [49]). This solution is richer than a single par- tition since it contains multiple relevant levels of clustering. For in- stance, in the case of the agglomerative hierarchical clustering algo- rithms, the idea is to merge iteratively the two nearest points with re- spect to a given distance. Agglomerative methods can be implemented efficiently through the Nearest-Neighbor Chain [28] if the distance is proved to be reducible. However, hierarchical methods lack a way of identifying the best clustering levels in some sense in the clustering hierarchy.

Clustering algorithms for graphs. A number of clustering algorithms have also been specifically developed for graphs. Since it is common to transform vector data in graphs these graph algorithms can be used on vector data as well. In this case, the choice of the transformation [26], [12] is very important and generally involves a distance or a simi- larity measure. For instance, it is possible to limit the number of edges of a node to the k nearest neighbors and assign weights thanks to a Radial Basis Function. A first famous algorithm for clustering in graphs is also the spec- tral clustering applied to the adjacency matrix of the graph. Other approaches include the divisive approach of the Girvan-Newman al- gorithm [34], based on the notion of edge betweenness, the iterative approach of the paper [40] and [23], looking for local maxima of mod- ularity or some fitness function, and other approaches based on statis- tical inference [8], replica correlations [39] and graph wavelets [45]. Within the graph clustering algorithms, there is an important sub- set of algorithms based on the maximization of the modularity score. The Louvain algorithm [2] detailed in section 2.1.3 proposes a greedy approach which is widely used because of its high speed and its effi- ciency on different size of network. It tries to maximize iteratively the modularity score Qγ at a given resolution γ. There are other meth- ods based on the modularity score such that the greedy algorithm of Clauset et al. [7] or the spectral decomposition of the modularity matrix [32]. Some of them propose multi-resolution approaches [1], [22],[38] or try to estimate relevant resolutions [30]. Nevertheless, none CHAPTER 2. BACKGROUND 15

of them propose an effective way of selecting the appropriate value of the resolution γ. Indeed this parameter is hard to adjust in practice. The modularity optimization at a given resolution is known to have some limits which prevent from detecting clusters in too large graphs or clusters with too heterogeneous sizes [17], [25]. These limitations remain even with multi-resolution approaches [48]. Graphs have also a set of hierarchical algorithms. In the case of the agglomerative algorithms this distance used may be based on the modularity increase [31], on a random walk in the graph [37], on some notion of structural similarity, involving the neighborhood of each node [19], or on a correlation measure between clusters [4]. None of these distances has been proved to be reducible. Chapter 3

Theoretical work

This chapter presents the main theoretical contributions of this the- sis. It first introduces a new association coefficient and distance for graphs. Secondly, it analyzes the resolution parameter as a stability measure for clustering and clusters. Finally, it presents a new graph model having hierarchical structures.

3.1 Association coefficient and distance

Association coefficient. This thesis introduces a new association co- efficient between nodes i and j denoted γij:

p(i, j) γij = (3.1) p(i)p(j)

According to γij, nodes i, j are associated or close if these nodes are sampled more frequently from the joint distribution p(i, j) than from the product distribution p(i)p(j). The term p(i)p(j) behaves as a nor- malization factor such that γij is fair with respect to the degree of the nodes. Another writing of this coefficient is :

p(i|j) p(j|i) γij = = (3.2) p(i) p(j)

It makes clear that nodes i, j are associated if i (respectively j) is sam- pled more frequently given j (respectively i). This coefficient is closely linked with the modularity Qγ. Let P0 be the partition where each node has a singleton cluster. The modularity increase due to merging two

16 CHAPTER 3. THEORETICAL WORK 17

nodes i, j in P0 is:

∆Qγ = p(i, j) − γp(i)p(j) (3.3)

= p(i)p(j)(γij − γ) (3.4)

If γ is large enough, it is clear that there is no gain in modularity to merge any pair of nodes in P0. The association coefficient γij is then the limit resolution under which merging the nodes i, j is beneficial. Starting from any partition P = {C0, ..., CK } leads to a generaliza- tion of γij. The modularity increase obtained by merging two clusters Ck,Ck0 is:

∆Qγ = p(Ck,Ck0 ) − γp(Ck)p(Ck0 ) (3.5)

The association coefficient γCkCk0 derives in the same way:

p(Ck,Ck0 ) γCkCk0 = (3.6) p(Ck)p(Ck0 ) This association coefficient verifies two following key properties: Proposition 1. p(A) p(B) γA∪B,C = γA,C + γB,C (3.7) p(A ∪ B) p(A ∪ B) Proof. p(A ∪ B,C) γ = A∪B,C p(A ∪ B)p(C) p(A, C) p(B,C) = + p(A ∪ B)p(C) p(A ∪ B)p(C) p(A)p(C) p(A, C) p(B)p(C) p(B,C) = + p(A ∪ B)p(C) p(A)p(C) p(A ∪ B)p(C) p(B)p(C) p(A) p(B) = γ + γ p(A ∪ B) A,C p(A ∪ B) B,C

Proposition 2. For any disjoint clusters A, B, C in the partition P :

γA∪B,C ∈ [min(γA,C , γB,C ), max(γA,C , γB,C )] (3.8)

Proof. It follows immediately from proposition 1, since γA∪B,C is a weighted mean of γA,C and γB,C . 18 CHAPTER 3. THEORETICAL WORK

Distance. Taking the inverse of this association coefficient defines a distance between two nodes i, j or two clusters Ck,Ck0 : p(i)p(j) d(i, j) = (3.9) p(i, j)

p(Ck)p(Ck0 ) d(Ck,Ck0 ) = (3.10) p(Ck,Ck0 ) If i, j are not connected, the distance d(i, j) is set to ∞. This distance is symmetric and non-negative but does not necessarily verifies the triangle inequality. Hence, it is not a metric in general. It verify the two following properties: Proposition 3. For any disjoint clusters A, B, C in the partition P : p(A) p(B) d(A ∪ B,C)−1 = d(A, C)−1 + d(B,C)−1 (3.11) p(A ∪ B) p(A ∪ B) Proof. It is clear from proposition 1 that d(A ∪ B,C) is the harmonic mean of d(A, C) and d(B,C) Proposition 4. d(A ∪ B,C) ∈ [min(d(A, C), d(B,C)), max(d(A, C), d(B,C))] (3.12) Proof. It follows immediately from proposition 3, since d(A ∪ B,C) is a weighted harmonic mean of d(A, C) and d(B,C). This distance will be used in the agglomerative algorithm to merge the closest clusters. Moreover, as d(A ∪ B,C) ≥ min(d(A, C), d(B,C)), the minimum distance of A and B cannot decrease to any other cluster C. Merging other clusters than A or B will not change the nearest neighbor of A or B. Hence, the fact that minimum distance is a local property will lead to a local clustering algorithm (Nearest-Neighbors Chain).

Radon-Nikodym derivative. It is interesting to remark that the dis- tance d(i, j) can be interpreted as a Radon-Nikodym derivative. Let first recall the definition of a Radon-Nikodym derivative [44]. If a mea- sure λ is absolutely continuous with respect to a positive measure µ, then the Radon-Nikodym derivative of λ with respect to µ is f such that for any measurable set E: Z λ(E) = fdµ (3.13) E CHAPTER 3. THEORETICAL WORK 19

dλ The Radon-Nikodym derivative f is often denoted dµ . f is called derivative since it describes the rate of change of density of the mea- sure λ with respect to the measure µ. As regards p(i)p(j) and p(i, j), it is clear that the joint distribu- tion is absolutely continuous with respect to the product distribution (p(i)p(j) = 0 ⇒ p(i, j) = 0). The Radon-Nikodym derivative of p(i)p(j) p(i)p(j) with respect to p(i, j) is d(i, j) = p(i,j) . Indeed, for all set of nodes E: X p(E) = d(i, j)p(i, j) (3.14) i,j∈E

The distance d(i, j) can then be interpreted as a rate of distortion of the product distribution with respect to the joint distribution.

3.2 Resolution as a stability measure

This section explicits new definitions of stability for clusterings and clusters and relates them to the association coefficient and the distance presented in section 3.1.

Clustering. Given a clustering P , the resolution parameter has a strong influence on the modularity score. Taking the modularity at resolution γ = 0 as a reference gives:

Qγ(P ) = Q0(P ) − γS(P ) (3.15)

Hence, Qγ(P ) is linear with respect to γ. The objective is the maxi- mization of the modularity at any resolution. It leads to consider:

max Qγ(P ) = max (Q0(P ) − γS(P )) (3.16) P P

The function f : γ → maxP Qγ(P ) is then a convex piecewise-defined function since it is a maximum over the finite set of linear functions γ → Q0(P ) − γS(P ). A key property emerges from this remark:

0 Proposition 5. If P a partition of a graph G then, either Qγ(P ) 6= maxP 0 Qγ(P ) 0 for all resolution γ or Qγ(P ) = maxP 0 Qγ(P ) on a range [γmin, γmax]. The range [γmin, γmax] is called the stable zone of the partition P

Proof. Let P be a partition of G, and γmin, γmax be the minimum and maximum resolution where the modularity of P is optimal. Let P 0 be 20 CHAPTER 3. THEORETICAL WORK

0 any other partition of G. The function g : γ → Qγ(P ) − Qγ(P ) is linear and verifies g(γmin) > 0, g(γmax) > 0. Consequently, g(γ) > 0 for any γ ∈ [γmin, γmax] and P is optimal on the range[γmin, γmax] The proposition 5 explains that clusterings are only stable on inter- vals of resolutions. A large stable zone for a clustering means that it is resistant to variation of resolutions.

Cluster. If C is a cluster of nodes in G, its stable range [γmin, γmax] is defined as follows:

γmin = max γCC0 (3.17) C0⊂V \C

γmax = min γC0C00 (3.18) C0,C00⊂C

If C contains all the nodes (C = V ), the resolution γmin is set to 0. If C contains one single node (C = {i}), the resolution γmax is set to +∞. The resolution γmin is the lowest resolution from which it is worth to 0 merge C with an other cluster C . The resolution γmax is the largest resolution from which it is worth to split C in two sub-clusters C0 and C00. It is interesting to rewrite the formula 3.17 and 3.18 in terms of distances:

dout = min dCC0 = 1/γmin (3.19) C0⊂V \C

din = max dC0C00 = 1/γmax (3.20) C0,C00⊂C

Thus, dout characterizes the distance of C to the rest of the graph since it is the distance between C and its nearest neighbor C0. In contrast, 0 din characterizes the maximum distance between one sub-cluster C and its nearest neighbor C00 within C. As a consequence, the length of [γmin, γmax] is a relevant quantity to assess the quality of a cluster. If C has a large stable range, it means that the characteristic length within C is significantly smaller than the characteristic length between C and the rest of the graph. It measures somehow the sharpness of the cluster border. The stable resolution range of a cluster is closely related to the def- inition of stable resolution range of a clustering: P P Proposition 6. If P is a clustering stable on [γmin, γmax] and C ∈ P a cluster C C stable on [γmin, γmax], then: P P C C [γmin, γmax] ⊂ [γmin, γmax] (3.21) CHAPTER 3. THEORETICAL WORK 21

In particular:

P C γmin ≥ max γmin C∈P P C γmax ≤ min γmax C∈P

C C Proof. It is clear from the definitions of γmin and γmax that any partition C C containing C cannot be stable on a range larger than [γmin, γmax]. The resolution stable range of P is then really sensitive to the sta- bility of each cluster C in P . One bad cluster is sufficient to drasti- cally reduce the resolution stable range of a partition although all the others clusters have a large stable range. Therefore, the resolution sta- ble range of a partition evaluates more the minimum stability of the clusters that composed it rather than the average stability of all the clusters. A better quantity to evaluate the quality of a partition is the average stable range of all clusters. All these remarks motivate the creation of the clustering algorithms in chapter 4.

3.3 A multi-scale block model: Hierarchical Stochastic Block Model

This section proposes a new presentation of a Hierarchical Stochas- tic Block Model with an a priori on the hierarchical structure. Before hand, it recalls first the definition of the classic Stochastic Block Model.

3.3.1 Stochastic Block Model The Stochastic Block Model (SBM) is a generative model creating graphs containing communities. The SBM generates graphs with n nodes 1, ..., n organized in k communities {C1, ..., Ck}. These graphs can be generated from the matrix P = {Prs}1≤r,s≤k of edge probabilities which is generally symmetric. In the generated graph, the nodes i ∈ Cr and j ∈ Cs will be connected by an unitary weight with probability Prs. It is also possible to interpret P coefficients as parameters of other dis- tributions than the Bernoulli distribution. For instance, Prs can be the mean of a Poisson law which generate graphs with weighted edges. 22 CHAPTER 3. THEORETICAL WORK

A sub-category of the SBM is the planted partition model (PPM) which corresponds to:

 p if r = s Prs = (3.22) q otherwise

Only the case where p > q fits with the classic definition of a commu- nity. Connections are denser within communities than between com- munities. In this situation, the model is said assortative. Conversely, if p < q the model is said dissortative. The special case where p = q corresponds to the Erd˝os–Rényimodel. These two last models do not generate graphs with clear clusters in general. Consequently, only assortative SBMs are used as benchmark for testing clustering algo- rithms. Remark that graphs could be assortative and nodes have less edges with nodes of its community than with other nodes. This hap- P pens when Prr < s,s6=r Prs. This case does not contradict the classic intuition about communities.

3.3.2 Hierarchical Stochastic Block Model The SBM lacks a clear way to generate graphs with a hierarchical struc- ture although it is efficient to generate communities. Moreover, real graphs often appear to have more than one relevant clustering scale which motivates the definition of a hierarchical version of SBM. This section proposed a Hierarchical Stochastic Block Model (HSBM) com- posed of L levels. The n nodes are first distributed in k1 blocks with same size. Each vertex of a block is then connected to each node of another block with probability µ1 of having weight 1 and 1 − µ1 of having weight 0. Ev- ery block is then splited in k2 sub-blocks where edges between sub- blocks are sampled with probability µ2. This procedure is repeated recursively for l in [1,L] and generates levels from the lowest to the deepest. The sequence µ1, ..., µL is generally taken increasing in order to have a clear hierarchical structure. At the end, blocks at level L − 1 are comprised of kL vertices which are connected with probability µL. Thus, kl is the division factor of a block at level l − 1 in blocks at level l, and µl is a parameter to sample edges at level l. It is natural to set k0 = 1 and µ0 = 0. since the whole graph can be seen as one block with a probability 0. to be connected to other nodes. The HSBM is then composed of L + 1 levels where level 0 consists in one cluster with CHAPTER 3. THEORETICAL WORK 23

all nodes and level L consists in n clusters with one single node. Re- QL−1 mark that this HSBM is equivalent to a SBM with i ki blocks of kL nodes and a hierarchical probability matrix. For instance, the HSBM with k = [1, 2, 3, 4] and µ = [0., .2,.4,.8] is equivalent to the SBM with 6 blocks of 4 nodes and a probability matrix equal to:

.8 .4 .4 .2 .2 .2 .4 .8 .4 .2 .2 .2   .4 .4 .8 .2 .2 .2   .2 .2 .2 .8 .4 .4   .2 .2 .2 .4 .8 .4 .2 .2 .2 .4 .4 .8

Edge distribution. The HSBM described above only generates graphs with unitary edges. The edges are connected from a Bernoulli distri- bution thanks to the parameters µl. However, it is possible to sample the edges from other distributions. The edge weights can be sampled from a Poisson law with mean µl, or simply be equal to µl which would produce a deterministic HSBM.

Unbalanced hierarchy. In the previous definition of HSBM, the hi- erarchy is well balanced. The division factor and the edge sampling parameter are constant at each level and the number of levels is the same for each branch in the hierarchy. This definition can be naturally extended to unbalanced hierarchy. Every block of every level would have in this case its own size and its own probability matrix describing how to connect sub-blocks to each other within this block. For exam- ple, the coefficient Pkk0 of the matrix indicates the probability that the nodes in the kth and k0th have to be connected by an edge. Examples of balanced and unbalanced HSBM are shown on figure 3.1. An alterna- tive formulation of the HSBM is proposed in [27] 24 CHAPTER 3. THEORETICAL WORK

Figure 3.1: Balanced HSBM (left), unbalanced HSBM (right)

A priori on hierarchy. The edge weights can take different forms de- pending on the hierarchical structure. The function g : l → µl is the a priori on the hierarchical structure of the weights. The a priori is usu- ally taken increasing because vertices in communities at deeper level are supposed to be more similar. According to the context, the a priori function may increase in different ways:

• Linear a priori: g(l) = l. This a priori is adapted to describe net- works where edge weights increase with additive factor:  a0 if l = 0 µl = (3.23) µl−1 + al otherwise

where al are taken positives.

• Exponential a priori: g(l) = el. This a priori is adapted to describe networks where edge weights increase with multiplicative fac- tor:  m0 if l = 0 µl = (3.24) µl−1ml otherwise

where ml are taken larger than 1..

These two a priori functions are obviously equivalent and it is possi- ble to pass from a model with the linear a priori to a model with an exponential a priori. However, the parametrization does matter since the additive factors may be more interesting than the multiplicative factors for some purposes or conversely. Chapter 4

Methods

This chapter presents the new clustering algorithms proposed in this thesis. It introduces on the one hand a new hierarchical clustering al- gorithm for graphs based on the modularity score. On the other hand, it proposes multiple processing of the resulting hierarchies in order to rank the best clusters, clusterings and resolutions.

4.1 Hierarchical clustering

The algorithm for hierarchical clustering is agglomerative. It starts from singleton clusters (i.e. each node is in its own cluster) and then recursively merges clusters. At each step of the algorithm, the two closest clusters are merged and the graph is updated accordingly. The full algorithm is explained here below (algorithm 1):

25 26 CHAPTER 4. METHODS

Algorithm 1 Hierarchical clustering

Require: G = {V0,E0} with V0 = {1, .., n} # Initialisation of the dendrogram D = ∅ # Recursive agglomeration for t ∈ 1, n − 1 do # DetectionJ ofK closest clusters 0 0 i, j = 0 0 0 0 d(i , j ) argmini ,j ∈Vt−1,i 6=j D = D ∪ {{i, j, d(i, j), ni + nj}} # Updating of the graph Vt = Vt−1\{i, j} ∪ {n + t} p(n + t) = p(i) + p(j), nn+t = ni + nj for u ∈ neighbors(i) ∪ neighbors(j) do Et = Et−1\{{i, u}, {j, u}} ∪ {{n + t, u}} p(n + t, u) = p(i, u) + p(j, u) end for end for return D

Dendrogram. The dendrogram D contains the pair of nodes merged through the run of the algorithm. All the partitions P0, ..., Pn−1 can be recovered by browsing the final dendrogram bottom up. Remark that the partition Pt is composed of n − t clusters which which proves to be particularly helpful if the number of clusters aimed for is known in ad- vance. In addition to the merged nodes, the dendrogram contains the distance dt = d(i, j) and the number of nodes nn+t = ni + nj within the cluster. In the traditional dendrogram representation, each branch is plotted at height dt which require the sequence of distances associated with each merge to be non-decreasing. The proposition 3 ensures that the sequence of distances d1..., dn−1 is non-decreasing in the present context. Two examples of dendrograms are given on figure 4.1. CHAPTER 4. METHODS 27

Figure 4.1: Two examples of dendrogram

Agglomeration. The graph aggregation step is similar to the aggre- gation step in the Louvain algorithm. The index of the cluster gener- ated from the merge of i and j at step t is classically set to n + t. In this way, the new index will be different from the other cluster indices already in the graph. If u is a neighbor of i or j, then the aggregated P P graph will have the edge {n + t, u} with weight i0∈i Aui0 + j0∈j Auj0 . This operation also creates a self-edge on the new cluster. This is equivalent to update the probabilities as presented in the algorithm 1 or use the update formula stated in the proposition 3. 28 CHAPTER 4. METHODS

l l 1 2 1 1 2 1 j 2 v v 9 u 3 9 n+t 3 u 3 1 i 1 2 3 1 2 3 k k

(a) Before merging of {i, j} (b) After merging of {i, j}

Sliding resolution. The hierarchical algorithm 1 can be seen as a modularity-maximizing algorithm with a sliding resolution by defin- γ = 1 P = ing t dt . The hierarchical algorithm starts from the partition 0 {{1} ..., {n}} since P0 is optimal for a large enough resolution. The hi- erarchical algorithm first merges the nodes i and j with the highest similarity coefficient γij = γ1. By definition, γij is also the largest reso- lution from which merging nodes i, j increases the modularity (equa- tion (3.4)). Consequently, it is worth to merge the nodes i and j from resolution γ1. In the same way, the algorithm iteratively computes a sequence of resolutions γ1, ..., γn−1 which trigger the merge of pair of nodes with respect to the modularity score. The resolution param- eter is then sliding from large values to low values along the non- increasing sequence γ1, ..., γn−1. From the resolution perspective, it is natural to set γ0 = +∞ and γn = 0, since the partition P0 with n clusters is optimal for any positive resolution larger than γ1 and the partition Pn−1 with one singleton cluster is optimal for any positive resolution lower than γn−1.

Clusterings stability. The hierarchical algorithm 1 tries to approxi- mate the optimal clusterings and their stables zones. For recall, the ∗ optimal partition P and its stable zone [γmin, γmax] verifies:

∗ 0 ∀γ ∈ [γmin, γmax],Qγ(P ) = max Qγ(P ) P 0

In a similar way, the partitions P0,P1, ..., Pn−1 and the sequence of res- olutions γ0, γ1...γn verifies the following property:

Proposition 7. if P0, ..., Pn−1 and [γ1, γ0], ..., [γn, γn−1] are the partitions CHAPTER 4. METHODS 29

and resolutions ranges returned by the algorithm 1, then:

0 ∀γ ∈ [γt+1, γt],Qγ(Pt) = max Qγ(P ) (4.1) 0 P ∈{P0,...,Pn−1}

Proof. The functions Qγ(Pt): γ → Qγ(Pt) and Qγ(Pt−1): γ → Qγ(Pt−1) are linear. Thus, by definition of γt, it is clear that Qγ(Pt) > Qγ(Pt−1) if γ < γt and Qγ(Pt) < Qγ(Pt−1) if γ > γt. Since the relation is true for all t and the sequence γ0, ..., γn−1 is non-increasing, it implies that:

∀s < t, ∀γ ∈ [0, γt],Qγ(Pt) ≥ Qγ(Ps)

In a similar way, we have:

∀t < s, ∀γ ∈ [γt+1, ∞],Qγ(Pt) ≥ Qγ(Ps)

This property shows that by restricting the possible partitions to P0, ..., Pn−1, the hierarchical algorithm finds back the optimal parti- tions and compute exactly their stable zones [γ1, γ0], [γ2, γ1], ..., [γn, γn−1]. Note that if P0, ..., Pn−1 contains all the true optimal partitions, then [γ1, γ0], [γ2, γ1], ..., [γn, γn−1] are exactly the true stable zones. The algo- rithm is based on the belief that if the partition Pt is almost optimal on [γt+1, γt] then Pt+1 is almost optimal on [γt+1, γt].

Clusters stability. In the same way, the hierarchical algorithm ap- proximates also the stable resolution range of clusters. For recall, the stable range [γmin, γmax] of a cluster C satisfies:

γmin = max γCC0 C0⊂V \C

γmax = min γC0C00 C0,C00⊂C

In the case of the clusters computed by the hierarchical algorithm, it gives:

Proposition 8. If C a cluster created by the algorithm 1 at step t and merged with another cluster at step s, then:

γs = max γCC0 (4.2) 0 0 C ∈Ps0 ,s ≥s−1

γt = min γC0C00 (4.3) 0 00 C ,C ⊂C and ∈Pt 30 CHAPTER 4. METHODS

Proof. The first result derives from the γs definition and the reducibil- ity property (2):

γC,C00∪C000 ≤ min(γCC00 , γCC000 ) ≤ max γCC0 = γs 0 C ∈Ps−1

0 00 0 Since C ,C ⊂ C and ∈ Pt, it exists t < t such that γC0C00 = γt0 . The second result then simply derives from the fact that γ0 ≥ .. ≥ γn−1.

The stable range computed by the hierarchical algorithm is then an optimistic approximation of the true stability of C.

Connected components. It is worth to analyze the run of the algo- rithm on a disconnected graph. If the graph G consists of k connected components, then the partition Pn−k will be composed of these k con- nected components, whose respective distances are infinite; the k − 1 last merges can then be done in an arbitrary order. Moreover, the hier- archies associated with these connected components are independent of one another (i.e., the algorithm successively applied to the corre- sponding subgraphs would produce exactly the same hierarchy). Sim- ilarly, we expect the clustering of weakly connected subgraphs to be approximately independent of one another. This is not the case of the Louvain algorithm, whose clustering depends on the whole graph through the total weight w, a shortcoming related to the resolution limit of modularity

Efficient Implementation. The implementation presented by the al- gorithm 1 is not optimal. By the reducibility property of the distance, the algorithm can be implemented through the Nearest-Neighbor Chain scheme [28]. Starting from an arbitrary node, a chain a nearest-neighbors is formed. Whenever two nodes of the chain are mutual nearest neigh- bors, these two nodes are merged and the chain is updated recur- sively, until the initial node is eventually merged. This scheme re- duces the search of a global minimum (the pair of nodes i, j that min- imizes d(i, j)) to that of a local minimum (any pair of nodes i, j such 0 0 that d(i, j) = minj0 d(i, j ) = mini0 d(i , j)). This scheme speeds up sig- nificantly the algorithm while returning exactly the same hierarchy. It only requires a consistent tie-breaking rule for equal distances (e.g., any node at equal distance of i and j is considered as closer to i if and only if i < j). CHAPTER 4. METHODS 31

A Python implementation of the algorithm, available in the on-line supplementary material 1, is based on this Nearest-Neighbor Chain scheme. Observe that the space complexity of the algorithm is in O(m). The time complexity is in the worst case (complete graph) O(n2). In practice, the graphs are generally sparse which make the algorithm much faster.

4.2 Rankings from the hierarchy

The hierarchical algorithm returns a dendrogram that has to be pro- cessed in order to identify the relevant clusters and partitions. This section proposes methods aiming at extracting the top ranked clusters from the dendrogram through multiple kinds of rankings. Although these methods have been designed for the algorithm 1, they can be adapted to other hierarchical algorithms.

4.2.1 Clusters The dendrogram is composed of many branches which induce clus- ters. Hence, it contains exactly 2n − 1 clusters that can be labeled chronologically: n clusters composed of one of the original nodes and n − 1 clusters created by a merge at each iteration of the algorithm. Thus, the cluster created at time t is labeled k = n + t. The variable s denotes the time at which it is merged with another cluster. The qual- ity of a cluster C is then assessed with the score:

sharp-score(C) = f (γt) − f (γs) (4.4)

= f(1/dt) − f(1/ds) (4.5)

The function f is an decreasing function which comprehend a priori on the hierarchy. Typical choices for this function are f(γ) = γ or f(γ) = log(γ). One bottom up pass over the dendrogram is then sufficient to rank all the clusters.

Cluster isolation and compactness. The quantity f(1/dt) − f(1/ds) measures somehow the gap between ds and dt. By definition, the dis- tance between a cluster and its nearest neighbor is ds. It indicates how isolated is the cluster from the others. On the other hand, the largest

1https://github.com/Charpenb/Graph_Clustering 32 CHAPTER 4. METHODS

distance between two nodes which have been merged within a cluster is dt. It indicates how compact are the nodes within a cluster. Accord- ing to this score a good cluster is a cluster which is far from the others and compact which sounds reasonable. It is important to compare the two distances ds and dt since it gives an idea of the scaling difference between the inside and the outside the cluster. If ds and dt were both large (resp. low), the cluster would be isolated (resp. compact) but the nodes inside (resp. outside) the clusters would be isolated (resp. compact) as well. The sharp-score evaluates then the sharpness of the border of the cluster, hence its name. The conception of this score is of course motivated by the theory developed in section 3.2.

Choice of f. The choice of the function f highly depends on the a priori g on the hierarchical structure of the weights. In general, a rel- evant choice for detecting levels is to consider the function f = g−1 which fully defines the sharp-score. To illustrate this property, lets consider the the deterministic HSBM with L levels. The division factor and the edge weights at level l are still denoted kl and µl. Under this model, all the nodes have the same connections at each level: kL−1 edges with weight µL at level L, (kL−1− 1)kL edges with weight µL−1 at level L − 1,...etc. Hence, all the nodes have the same degree d and the similarity coefficient of the edge i, j at level l is easy to compute: w γij = µl (4.6) d2 The hierarchical algorithm 1 applied on the HSBM merges then first a pair of nodes i, j in a same block at the deepest level Their similarity w coefficient is γij = d2 µL. This merges creates a new cluster labeled n+1 and the new similarity coefficients with node k can be computed from equation (3): 1 1 γ = γ + γ kn+1 2 ki 2 kj = γki because γki = γkj

Thus, all nodes are first gathered in blocks at level L. The algorithm 0 merges then the pair of blocks k, k with similarity coefficient γkk0 = w d2 µL−1. Finally, it finds back all the levels of the HSBM. If the edge weights fit exactly with the a priori (g(l) = µl), the sharp-score of a CHAPTER 4. METHODS 33

cluster C created at time step t is:

 1 if C is a block of the original HSBM f(γt) − f(γs) = (4.7) 0 otherwise

An interesting remark is that, if µl < g(l) (resp. µl > g(l)), we have then f(γt) − f(γs) < 1 (resp. f(γt) − f(γs) > 1). It is then interesting to rank the clusters according to their sharp-score since a larger score means a cluster sharper than expected. :

• The linear a priori (g(l) = l, f(γ) = γ) is adapted to edge weights with additive factor (3.23). The sharp-score is in this case:  al if C is a block at level l f(γt) − f(γs) = (4.8) 0 otherwise

• The exponential a priori (g(l) = el, f(γ) = log(γ)) is adapted to edge weights with multiplicative factor (3.24). The sharp-score is in this case:  log(ml) if C is a block at level l f(γt) − f(γs) = (4.9) 0 otherwise

4.2.2 Homogeneous and heterogeneous clusterings Apart from the clusters, the dendrogram contains also many cluster- ings. This paragraph mainly describes two ways to extract clusterings from the dendrogram. The first method consists in cutting homoge- neously the dendrogram at one time step t which will return the par- tition Pt. The second method consists in cutting heterogeneously the dendrogram at many time steps t0, ..., tk and take the clusters Ct0 , ..., Ctk created at these time steps to form a partition of the whole graph. In both cases, the quality of a partition is assessed by evaluating the weighted mean of the scores of the partition clusters: 1 X sharp-index(P ) = nC × sharp-score(C) (4.10) n C∈P

This index quantifies the sharpness of a clustering by looking at the sharpness of each cluster. The weighted mean makes it less sensitive to outliers than the Dunn index [14]. Moreover, it proposes an alternative to the Davies-Bouldin index [13] not based on centroids. 34 CHAPTER 4. METHODS

Homogeneous clusterings. Homogeneous slicing ranks all the parti- tions P0, ..., Pn−1. For recall, each partition Pt is approximately optimal on its stable zone [γt+1, γt]. Thus, all the clusters of Pt are observed at the same scale defined by the resolution range [γt+1, γt] and ranking partitions is equivalent to rank scales (Figure 4.3). Nevertheless, this type of partition tends to contain clusters with homogeneous size in the same way as modularity maximization algorithms. Even if homo- geneous cluster sizes is one of the resolution limit, observing clusters at the same scale is also valuable for a partition. This drawback and this advantage are inextricable since the homogeneous size problem cannot be overcome by optimizing partition at only one given resolu- tion [25]. It is important to remark that the top 1 clustering Pt is not necessarily the clustering with the largest stable zone [γt+1, γt]. A large stable zone requires that all the clusters have a high sharp-score. The partition index would be in that case minC∈P (sharp-score(C)) which is much more sensitive to one bad cluster. scale

1 2 3 4 5 6 7 8 9 10

Figure 4.3: Representation of a homogeneous cut on a 10 nodes den- drogram. The partition represented by the dashed line has two clusters observed at the same scale.

The ranking of homogeneous partitions can be done in one bottom up pass (algorithm 2) over the dendrogram plus one sorting of the par- titions with respect to their scores. The dendrogram browsing is linear in O(n) whereas the sorting is in O(n log(n)). The final complexity of the full ranking is then in O(n log(n)). Remark that the time complex- ity to only find the top 1 partition is O(n). CHAPTER 4. METHODS 35

Algorithm 2 Best homogeneous clustering Require: D dendrogram, c_scores cluster scores, c_sizes cluster sizes # Initialization of the partition scores p_scores[0] = 0. for t ∈ 1, n − 1 do # ComputationJ K of partition scores i = D[t, 0] j = D[t, 1] p_scores[t] = p_scores[t − 1] − c_sizes[i] × c_scores[i] − c_sizes[j] × c_scores[j] + c_sizes[n + t] × c_scores[n + t] end for return argmax(p_scores)

Heterogeneous clusterings. Heterogeneous slicing is less constrained than homogeneous slicing. In contrast with the homogeneous method, clusters can be extracted at different level in the hierarchy. As a conse- quence, each cluster is related to its own scale defined by [γs, γt] (Fig- ure 4.4). This method can then be really qualified as a multi-resolution method since the top 1 partition is related to multiple different resolu- tions. In practice, this method overcomes the resolution limit. How- ever, all the clusters in the top ranked partitions are not forced to be at the same scale. scale 1

1 2 3 4 scale 2 5 6 7 scale 3 8 9 10

Figure 4.4: Representation of a heterogeneous cut on a 10 nodes den- drogram. The partition represented by dashed lines has three clusters. Each cluster is observed at its own scale. In this example, scale 1 < scale 2 = scale 3

There are many more possible heterogeneous partitions than ho- mogeneous partitions which make the full ranking task impossible in 36 CHAPTER 4. METHODS

reasonable time. However, it is still possible to find the top 1 hetero- geneous clustering in one pass. During this pass, the score of each cluster is compared to the best score of heterogeneous cuts in the sub- dendrogram induced by this cluster (algorithm 3). In order to have a ranking of partitions with different clusters, the clusters present in the first partition are not evaluated at the second pass. Thus, two parti- tions in the ranking have no clusters in common and finding the k first partitions take k passes.

Algorithm 3 Sharpest heterogeneous clustering Require: D dendrogram, c_scores cluster scores, c_sizes cluster sizes # Initialization of partitions and scores for t ∈ [1, n] do best_p[t] = [[t]] best_p_scores[t] = 0. end for for t ∈ 1, n − 1 do # ComputationJ K of partitions and scores i = D[t, 0] j = D[t, 1] if best_p_scores[i]+best_p_scores[j] < c_sizes[n+t]×c_scores[n+t] then best_p_scores[n + t] = c_sizes[n + t] × c_scores[n + t] best_p[n + t] = [[t]] else best_p_scores[n + t] = best_p_scores[i] + best_p_scores[j] best_p[n + t] = best_p[i] + best_p[j] end if end for return best_p[2n − 1]

4.2.3 Resolutions In the case of the hierarchical algorithm 1, the dendrogram contains also the resolutions γ0, ..., γn. A resolution range [γt+1, γt] is associated to each partition Pt. Instead of returning the partition Pt, an alterna- tive is to run the Louvain algorithm (or any other modularity maxi- mization algorithm) at a resolution in [γt+1, γt]. A good choice is to CHAPTER 4. METHODS 37

∗ ∗ take γt = mean(γt, γt+1). The resolution γt can then be ranked like the partition Pt in the algorithm 2.

Scale ranking. A risk with the homogeneous ranking is that Pt might be a bad approximation of the true optimal partition on this resolu- tion range [γt+1, γt] due to the fact that the partitions have to form a hierarchy. However, the homogeneous ranking orders more than the partitions, it orders the resolution ranges as well. This order can be used to performed a ranking of the resolutions. The resolution rank- ing describes then the top ranked scalings at which the graph can be observed.

Resolution estimation. The resolution ranges [γt+1, γt] are approxi- mations of the optimal stable zones. The choice of γ in [γt, γt+1] is im- portant for finding back the true optimal partition with the Louvain algorithm. The two quantities γt and γt+1 are bad choices since other partitions like Pt−1 and Pt+1 might be also optimal at these resolutions. In order to be as far as possible from these two values a natural choice is to take the mean of γt and γt+1. The choice of the mean is important and depends on the a priori on the hierarchy. If the a priori is linear (3.23), the arithmetic mean is adapted. However, if the a priori is ex- ponential (3.24), the geometric mean seems to be more adapted to the  √ log(a)+log(b)  log scale log( ab) = 2 .

Number of clusters. It is also interesting to remark that the resolu- tions ranges and the number of clusters are closely linked. The par- tition Pt has always n − t clusters and stable on [γt+1, γt]. Each top resolution range is then associated to a number of clusters. This num- ber can be seen as a estimation of the number of true clusters in the graph. Chapter 5

Results

This chapter aims at testing the performances of the hierarchical algo- rithm 1 and the four different rankings (Clusters, Homogeneous, Het- erogeneous, Resolution). The hierarchical algorithm is denoted Paris (Pairwise Agglomeration using Resolution Incremental Sliding) and can be combined with any ranking. In all the chapter, the combination of Paris and the ranking clusters algorithm is denoted Paris+Clusters, the combination of Paris and the ranking homogeneous and hetero- geneous clusterings algorithms are denoted Paris+Homogeneous and Paris+Heterogeneous, and the combination of Paris with the ranking resolution algorithm and Louvain is denoted Paris+Louvain. Finally, the number of clusters obtained by the Paris+Louvain algorithm can also be used as an input for the spectral clustering algorithm. The com- bination of the three algorithms is denoted Paris+Louvain+Spectral. All these algorithms are compared to the Louvain algorithm which is one of the state-of-the art algorithms for modularity maximization, and the spectral clustering algorithm which is a widely used algorithm for clustering in general. No other algorithms are added to avoid the figures to be overcrowded. The tests are first performed on synthetic datasets such as SBM and HSBM. It allows to assess quantitatively the quality of the hierarchy proposed by Paris on graphs built from different settings. Tests on real datasets are presented to assess the scalability and the capacity to generalize to real data. Finally, algorithms are also compared with scikit-learn algorithms on toy graphical datasets.

38 CHAPTER 5. RESULTS 39

5.1 Hierarchical Stochastic Block Model

The objective of this section is to test the behavior of the algorithms on hierarchical graphs. For this purpose, the algorithms are run on one HSBM with Poisson distributed edge weights. The first experiment gives a first insight in the HSBM by taking a toy example. Hence, it describes the behavior of the different process- ing of the Paris hierarchy on a simple HSBM. The second experiment proposes a first evaluation of the quality of the partitions contained in the Paris hierarchy by comparing the modularity score of Louvain results and Paris results. In addition, this experiment evaluates the sensitivity of Paris with respect to the sharp- ness of the levels (e.g. the multiplicative factor of the HSBM) which is a important parameter to test. Creating graphs with larger division factors or number of levels is very computationally demanding. This makes the testing on other HSBM harder although it would obviously be interesting.

Experiment 1. The first experiment, the HSBM is balanced and gen- erated from the division factors k = [1, 2, 2, 2, 100] and the edge weights are µ4 = 4., µ3 = .05 × µ4, µ2 = .1 × µ3, µ1 = .15 × µ2, µ0 = 0. It contains 800 nodes. The a priori function and the mean used are the exponential function and the harmonic mean. This section does not compare the Paris hierarchy with other hierarchical algorithms since the literature does note provide benchmarks for such analysis. The results of Paris+Clusters are plotted in Figure A.1 in the ap- pendix. It detects the clusters of each level in the balanced HSBM. The 2 clusters of the highest level are ranked first, the 4 clusters of the sec- ond level are ranked then before the 8 clusters of the last natural level. This order is consistent with the multiplicative decay of the weights .05,.1 and .15 at the three levels. All the other clusters are ranked after these. The results of Paris+Homogeneous are plotted in Figure A.2 in the appendix. It is interesting to remark that the three natural levels of the balanced HSBM are detected and ranked at positions 1, 3 and 7. This order is still consistent with the multiplicative decay of the weights. In addition to these levels, intermediary partitions which mix clusters of two successive levels appear in the ranking. It is relevant to find these clusterings since they have also sharp clusters. 40 CHAPTER 5. RESULTS

The results of Paris+Heterogeneous are plot in Figure A.3 in the ap- pendix. It detects the three levels of the balanced HSBM in the same co- herent order. There are no intermediary partitions as for Paris+Homogeneous since such partitions are excluded at each pass. The results of Paris+Louvain are plot in Figure A.4 in the appendix. The ranking of Paris+Louvain is the same than the ranking proposed by Paris+Homogeneous. It confirms that the Paris+Homogeneous gives a good approximation of the optimal partitions. The top resolutions in the ranking given by Paris are then good estimations of relevant reso- lutions for this graph.

Experiment 2. In order to assess quantitatively the quality of the par- titions and resolutions proposed by Paris, Paris and Louvain are run on four HSBMs. The HSBMs contain 200 nodes and are balanced. They are generated from the division factors k = [1, 2, 2, 50] and edge weights of the form µl−1 = decay×µl. The variable decay is in [.5,.6,.7,.8]. In Figures 5.1, 5.2, 5.3, 5.4 are presented the number of clusters and the modularity with respect to the resolution. Concerning the modular- ity, the goal is to have modularity curve as high as possible. The red points are obtained by running the Louvain algorithm for each resolu- tion γ. In contrast, the blue line is obtained after only one run of Paris. Paris starts with highest γ value on the right of the plots, and move step by step towards the left, by decreasing γ, increasing the modular- ity Qγ and decreasing the number of clusters. The black vertical lines indicate the resolutions γt computed during the run of the algorithm. These resolutions delimit the stable zones of the Paris partitions. For decay = .5 and decay = .6, Paris and Louvain seem to detect the two levels of the HSBMs. Indeed, figures show that the number of clus- ters have clearly two steps for 2 and 4 clusters. Moreover, the figures on the evolution of modularity is composed of two straight lines. This might suggest that the clusterings computed by the two algorithms are very close to the optimal in view of the property 5. Remark that the x axis is kept linear in order to observe this property. When the decay is increasing (.7,.8), the Louvain’s partitions seem still optimal since the modularity of its partitions looks still piecewise affine. However, the modularity of the partitions proposed by Paris becomes a bit lower than for Louvain even if the Paris partitions are still pretty accurate. Another important point is that Paris still detect relevant resolutions. The largest gap between two successive γt are indeed around the sta- CHAPTER 5. RESULTS 41

ble zones found by Louvain. Thus, even for large decay, it is possible to retrieve the two levels by running Louvain with the resolutions cal- culated by Paris. These resolutions are not necessarily equal to 1.

Conclusions. The two experiments performed in this section demon- strates the good performances of Paris algorithms which provide rele- vant hierarchies for HSBMs with relevant rankings for clusterings and clusters:

• The experiment 1 shows that the 4 Paris algorithms detect rele- vant partitions in HSBMs and give a good ranking of them with respect to the sharpness of their clusters.

• The experiment 2 shows that Paris provides a good approxima- tion of the optimal partitions at any resolution in only one run. It approximates at the same time the most stable resolution ranges.

5.2 Stochastic Block Model

This section aims at assessing the quality of the top 1 partitions in the rankings proposed by the different processing of the Paris hierarchy. These clusterings are compared to the Louvain algorithm and the spec- tral clustering with the number of clusters set to 25, 50 and 75. Six experiments are performed on SBMs with unitary weights to test the robustness of the algorithms. The design of the experiments is inspired by the article [24] and tries to test the sensitivity of the algorithms with respect to the most natural parameters on a subset of models. In the same way as for the HSBM, it is not possible to randomize over all the possible SBM due to computational limits. Instead, the experiments focuses on natural and easy to understand parameters and make them vary while it is computationally tractable. The results of the algorithm is assessed with the Adjusted Mutual Information (AMI) which is a common choice to quantify the similarity between two partitions in clustering [46]. Lets first recall the definition of the Mutual information between two partitions P1 = {C1, ..., Ck} 42 CHAPTER 5. RESULTS

(a)

(b)

Figure 5.1: Number of clusters and modularity Qγ of partitions re- turned by Paris (1 blue line = 1 run) and Louvain (1 red point = 1 run) with respect to the resolution γ on 2 levels HSBMs. The edge weights of the HSBMs have a decay factor equal to .5 from a level to another. Black vertical lines indicate the resolutions γt computed by Paris. Paris starts with large resolutions (right of the plots) and move step by step by decreasing the resolution (left of the plots). CHAPTER 5. RESULTS 43

(a)

(b)

Figure 5.2: Number of clusters and modularity Qγ of partitions re- turned by Paris (1 blue line = 1 run) and Louvain (1 red point = 1 run) with respect to the resolution γ on 2 levels HSBMs. The edge weights of the HSBMs have a decay factor equal to .6 from a level to another. Black vertical lines indicate the resolutions γt computed by Paris. Paris starts with large resolutions (right of the plots) and move step by step by decreasing the resolution (left of the plots). 44 CHAPTER 5. RESULTS

(a)

(b)

Figure 5.3: Number of clusters and modularity Qγ of partitions re- turned by Paris (1 blue line = 1 run) and Louvain (1 red point = 1 run) with respect to the resolution γ on 2 levels HSBMs. The edge weights of the HSBMs have a decay factor equal to .7 from a level to another. Black vertical lines indicate the resolutions γt computed by Paris. Paris starts with large resolutions (right of the plots) and move step by step by decreasing the resolution (left of the plots). CHAPTER 5. RESULTS 45

(a)

(b)

Figure 5.4: Number of clusters and modularity Qγ of partitions re- turned by Paris (1 blue line = 1 run) and Louvain (1 red point = 1 run) with respect to the resolution γ on 2 levels HSBMs. The edge weights of the HSBMs have a decay factor equal to .8 from a level to another. Black vertical lines indicate the resolutions γt computed by Paris. Paris starts with large resolutions (right of the plots) and move step by step by decreasing the resolution (left of the plots). 46 CHAPTER 5. RESULTS

and P2 = {G1, ..., Gl}:

|P1| |P2| X X p(i, j) MI(P ,P ) = p(i, j) log 1 2 p (i)p (j) i=1 i=1 1 2

|Ci ∩ Gj| |Ci| |Gj| with p(i, j) = , p1(i) = , p2(j) = n n n This AMI is s a corrected version of the the MI shared by the true and the predicted partitions. It tries not to overestimate high mutual in- formation obtained by chance. For this purpose it compares the MI of the partitions to the expected MI under an hypergeometric model of randomness:

MI(P1,P2) − E[MI(P1,P2)] AMI(P1,P2) = max(H(P1),H(P2)) − E[MI(P1,P2)]

The AMI is equal to 1 when P1 and P2 are identical since the MI is equal to the entropy of the partitions in that case. Conversely the AMI is equal to 0 when MI(P1,P2) is equal to the expected value due to chance. In addition to the AMI, the number of detected clusters is also plotted. All the results are averaged over 10 samples for each SBM setting and plotted in continuous lines. The variance of each sequence of 10 results is represented by the shaded zones. The a priori function and the mean used are the exponential a priori and the harmonic mean. In this section the Paris algorithms are only compared with the classic version of the Louvain algorithm (γ = 1) for two reasons. The first rea- son is that the Louvain algorithm is the state of the art for community detection in networks. The second reason is that Paris algorithms can be seen as improved version of Louvain.

Experiment 3. On the figure 5.5, the number of blocks ranges from 10 to 100. The block size is fixed to 10 and the average internal and exter- nal degrees are 5 and 1. Paris+Homogeneous, Paris+Heterogeneous and Paris+Louvain resist well to variation in the number of blocks. They detect by the way the right number of blocks. In contrast, the performance of Louvain with fixed resolution equal to 1 worsen when the number of blocks is increasing. In the same manner, the spectral clustering algorithm is only performing well when the expected num- ber of clusters is close to the true number of blocks. CHAPTER 5. RESULTS 47

Figure 5.5: AMI (top) and number of clusters detected (bottom) by the algorithms for SBMs having different number of bocks. The number of blocks is going from 10 to 100 48 CHAPTER 5. RESULTS

Experiment 4. On the figure 5.6, the average external degree ranges from 1 to 20 while the average internal degree is equal to 5. The num- ber of blocks is fixed to 50 and the block size is set to 10. Paris+Homogeneous and Paris+Heterogeneous have good performances while the average external degree is lower than 7.5 and tends to overestimate the num- ber of blocks for larger average external degree. Louvain tends to un- derestimate the number of blocks in general even if it has also correct performances up to an average external degree of 7.5. The best results are obtained by Paris+Louvain and Paris+Louvain+Spectral where the AMI and the estimated number of blocks are very close to the ground- truth. The performance of the spectral clustering is very dependent on the expected number of clusters in input.

Experiment 5. On the figure 5.7, the block size ranges from 10 to 50. The number of blocks is fixed to 50 and the internal and external de- gree are 5. and 1.. The only algorithm which appears to be dependent of the block size is Louvain. The three other algorithms look to be robust to variation of the block size.

Experiment 6. On the figure 5.8, the block size are sampled from the range [10, 100] with a power law distribution of parameter going from 1 to 3. The number of blocks is fixed to 50 and the internal and external edge probabilities are .5 and .01. This experiment is inspired from real networks where community sizes seem to follow a power law distri- bution with parameter in [1, 3] [24]. All the algorithms derived from Paris have good performances although Paris+Louvain have the best results. Moreover, Paris+Heterogeneous overestimates a bit the num- ber of clusters. As regards Louvain, it still underestimates the number of clusters.

Experiment 7. On the figure 5.9, graphs contain two types of blocks: large blocks and small blocks. The size ratio between the large blocks and small blocks ranges from 1 to 10. The total graph size is 600 with half composed of large blocks of size 100. The internal and external edge probabilities are .5 and .01. Paris+Homogeneous, Paris+Louvain and Paris+Louvain+Spectral are more robust than Louvain to differ- ence in the size ratio. As regards Paris+Heterogeneous, it is much more robust to difference in the size ratio than all the other algorithms. CHAPTER 5. RESULTS 49

Figure 5.6: AMI (top) and number of clusters detected (bottom) by the algorithms for SBMs having different ratio between external and internal degree. The average external degree is going from 1. to 20. while the average internal degree is 5. 50 CHAPTER 5. RESULTS

Figure 5.7: AMI (top) and number of clusters detected (bottom) by the algorithms for SBMs having different block sizes. The size of a block is going from 10 to 100 CHAPTER 5. RESULTS 51

Figure 5.8: AMI (top) and number of clusters detected (bottom) by the algorithms for SBMs having block size distribution. The block size distribution is a power law with parameter going from 1. to 3.. Block sizes are between 10 and 100 52 CHAPTER 5. RESULTS

Again classic spectral clustering performances highly depends on the input parameter (number of expected clusters).

Experiment 8. On the figure 5.1, the experiments are performed on Erd˝os–Rényigraphs. These experiments test practically the complex- ity of the the algorithms with respect to the size and the density of the graphs. The impact of the graph size on the running time is evaluated by ranging the number of nodes from 100 to 1000 with density fixed to .1. The impact of the graph density on the running time is evaluated by ranging the probability of an edge to exist from .01 to .99 with number of nodes to 500. As the figures show, Louvain and Paris have similar running times and are faster than spectral clustering algorithms. The running times of the different hierarchy processing is not measured since it is negligible compared to the computing of the Paris hierarchy.

Conclusions. In view of the all these experiments, the different ver- sion of Paris seems to bring significant benefits compare to the original version of Louvain:

• The three version of Paris seem to provide a good estimation of the number of clusters in general. In comparison, Louvain fails at finding the right number of clusters. The estimation of the num- ber of clusters can be used as an input to the spectral clustering algorithm to improve its performance.

• Paris turns out to be a very efficient method to detect relevant resolutions. Its top 1 resolution estimation makes the Louvain much more robust to different factor (degree heterogeneity, size heterogeneity, number of blocks)

• Paris+Heterogeneous achieves also very good results, especially concerning the difference in ratio size. A large difference in the cluster sizes is the main cause of the resolution limit in modular- ity maximization. Paris+Heterogeneous is a new clustering al- gorithm also based on modularity which does not seem to suffer from the resolution limit.

• Paris and Louvain have similar running times and are faster than the spectral clustering. CHAPTER 5. RESULTS 53

Figure 5.9: AMI (top) and number of clusters detected (bottom) by the algorithms for SBMs having two different block sizes. The size ratio between large and small blocks in going from 1 to 10 54 CHAPTER 5. RESULTS

Table 5.1: The running times of the clustering algorithms in seconds with respect to the number of nodes (top) and the probability of an edge to exist (bottom). The graphs are random Erd˝os–Rényigraphs. CHAPTER 5. RESULTS 55

5.3 Real data

The experiments on real networks are performed on five datasets with various sizes and sparsity 5.2.

Dataset #nodes #edges Average degree OpenFlight 3,097 18,193 11.74 OpenStreet 5,993 6,957 2.32 EnglishDico 94,300 817,661 17.34 HumansWikipedia 702,782 3,247,884 9.24

Table 5.2: Summary of the 5 datasets (number of nodes, number of edges and average degree)

Experiment 9. The 3 Paris clustering algorithms are run on each datasets and the top 8 results of each run are commented in this section. All the figures are reported in the appendices since there are many. In general, it is a bad practice to consider meta-data as ground-truth [36]. This re- mark makes it difficult to have a quantitative analysis on the commu- nities found by the algorithms on real datasets. However, even if real datasets have no clear ground-truth, it is still interesting to observe intuitive/meaningful clusters emerging from the algorithms. Hence, in the same way as in the paper of Clauset et al. [7], the results of real datasets are presented in tables and maps in the appendices (5.1a, 5.1b,5.2a, 5.2b, 5.3a, 5.3b,5.4a, 5.4b) without quantitative comparison with other algorithms. In the case of Paris+Homogeneous and Paris+Louvain, the best clusterings with respect to the sharp-index are often small variations of the same clustering level. In other words, it means that clusterings with the highest sharp-index are often composed of the same clusters except a few. It is logical since two similar clusterings have similar scores. In order to avoid this effect, all the partitions corresponding to a distance dt which is within a 10% range around the top partition are filtered out. This is a short presentation of each dataset and their results:

• OpenFlight is a weighted graph where nodes are representing airports around the world. The edge weights are the number of flights between the two airports at issue. The results are plotted 56 CHAPTER 5. RESULTS

on B.1, B.2 and B.3 in the appendix. They exhibit different levels of regions in the world including the natural visual clusters such that continents (America, Africa,...) and countries (USA, Canada, Australia).

• OpenStreet is a unweighted graph where nodes are intersection between streets in Paris. Two intersections are connected if there is a street between them. The results are plotted in Figures C.1, C.2 and C.3 in the appendix. They exhibit different levels of dis- tricts in Paris including the natural visual clusters such that north and south sides of the Seine.

• EnglishDico is a unweighted graph where nodes are represent- ing words in English. Two nodes are connected if one word is using another in its definition. The results are plotted in Figures D.1, D.2 and D.3 in the appendix. They exhibit different levels of concepts such that "sleep", "theft", "school" or "slavery".

• HumansWikipedia is a unweighted graph where nodes are rep- resenting web-pages of persons on Wikipedia. Two people are connected if one person has a link to the web-page of the other on his Wikipedia web-page. The results are plotted in Figures E.1, E.2 and E.3 in the appendix. They exhibit different levels of communities of figures such that Nazi, Russian politicians, Foot- ballers celebrities or Scientists.

The 8 top results with respect to the sharp-index of OpenFlight and OpenStreet are directly plotted on a map of the world and a map of Paris. As regards EnglishDico and HumansWikipedia, the nodes with the highest degree of the 8 and 4 largest clusters in the 4 top clusterings are indicated in a table format. Remark that the ranking proposed by the Paris algorithms are much richer than the only 8 first results since they order the n − 1 partitions extracted from the dendrogram. How- ever, it is not an easy task to make a brief summary of such rankings. The code is then available on-line1 in order to reproduce the results and observe the complete rankings. It is interesting to observe that the clusters returned by the Paris algorithms can be interpreted in a natural way. For instance, it is possible to distinguish continent and countries on OpenFlight, North

1https://github.com/sharpenb/Graph_Clustering CHAPTER 5. RESULTS 57

and South sides of the Seine on OpenStreet, groups of words concern- ing "sleep", "islam" and "cold" on English Dico and groups of people like "Cricketers", "Nazis" and "Scientists" on HumansWikipedia. The Paris provides different interesting scales of clusterings with smaller or larger clusters. Another interesting remark is that Paris+Homogeneous and Paris+Louvain exhibit similar clusterings. It confirms that Paris gives good approximation of Louvain clusterings.

Experiment 10. In addition to the clusterings, the running times of Paris, Louvain (version of Community package [11]) and spectral clus- tering have been measured and are detailed in Table 5.10. These exper- iments have been done on a machine with 2.8GHz Intel Core i7 CPU with 16GB of RAM with the simple time python package.

Dataset Spectral (50) Louvain Paris OpenFlight 9.096 ± 0.206s 0.370 ± .0001s 0.470 ± .0001s OpenStreet 56.95 ± 0.328s 0.388 ± .0001s 0.433 ± .0001s EnglishDico − 47.20 ± 0.511s 41.77 ± 0.125s HumansWikipedia − 196.0 ± 2.795s 323.9 ± 7.488s

Figure 5.10: Mean and variance of the running times of Paris and Lou- vain on the 4 datasets over 10 runs (in seconds). "-" means that the runs did not finish

Hence, Paris is able to process very large networks (more than 4M edges in 5min on a classic computer) and have similar running times than Louvain even if it seems to be slightly slower. Running times of the hierarchy processing are negligible compared to Paris. In view of this experiment and the experiment 2, it is interesting to observe that, Louvain computes an approximated optimal clustering at only one given resolution, whereas Paris computes an approximation of optimal clusterings at any resolution in an equivalent amount of time. Compare to Paris and Louvain, the spectral clustering is very slow and does not finish on large networks because of memory issue.

Conclusions. The Paris algorithms lead to interesting results on real datasets: • Paris algorithms return different scales of meaningful cluster- ings. 58 CHAPTER 5. RESULTS

• Paris gives a good approximation of partitions returned by Lou- vain at given resolution.

• Paris scales up for large graphs. Running times of Paris and Louvain have the same order of magnitude while Paris returns a richer structure (a hierarchy).

5.4 Vector datasets

In this section graphs are generated from vector datasets. It allows to compare Paris with other clustering algorithms which are not de- signed for graphs and underlines the power of graph representation. The experiments performed in this section are obviously not exhaus- tive. They rather give an illustration of the results obtained by Louvain and Paris on graphs generated from vector datasets compared to algo- rithms which are directly designed for vectors.

Experiment 11. The algorithms are tested on the 6 toy datasets pro- posed by scikit-learn [42]. These datasets are sampled from various distributions and have different shapes (circles, moons, blobs, square). Each dataset contains 1500 points. The results of the 9 scikit-learn al- gorithms are presented on the Figure 5.11. Ideally, a perfect clustering algorithm should detect the different shapes of point clouds (2 circles, 2 moons, 3 blobs, 3 lines, 3 blobs, nothing). It is important to mention that the parameters of these algorithms are given in advance which bring them a substantial advantage. For instance, the number of clus- ters to detect is known for the K-means algorithm whereas the Paris algorithms does not need parameter tuning. CHAPTER 5. RESULTS 59

MiniBatchKMeans AffinityProp MeanShift SpectralClustering Ward AggloClustering DBSCAN Birch GaussianMixture

Figure 5.11: Results of the 9 scikit-learn algorithms on 6 datasets. Good parameter settings are given in advance to the algorithms.

The method used to construct the graph is particularly important and the similarity between two points can be measured in many ways. In this case, the edge weights between two points u, v are classically built from the following similarity measure also called Gaussian Ker- nel:

( 2 2 − ||u−v|| − ||u−v|| e σ2 e σ2 > threshold weight(u, v) = if (5.1) 0 otherwise

The results of Louvain and Paris with σ = 1. and threshold = .8 are presented on the Figure 5.12. Given the graphs, The Louvain algo- rithm is run at resolution γ = 1 and the Paris algorithms have no pa- rameter. 60 CHAPTER 5. RESULTS

Louvain Paris+homo Paris+hetero Paris+Louvain

Figure 5.12: Result of Louvain and Paris algorithms on 6 datasets. No parameter tuning is performed in advance.

On the figure 5.12, the Paris algorithms have almost perfect results for the 5 first datasets since they detect exactly the points cloud. No other algorithm have equivalent results on these datasets on figure 5.11. Again, it is interesting to remark that the performance of Lou- vain is improved by the resolution estimation of Paris. For instance, the classic Louvain algorithm does not detect the 2 circles or the two moons whereas Paris+Louvain does. The last dataset has no commu- nities. It is thus hard to expect a particular solution. The clustering composed of one cluster as proposed by MeanShift and DBSCAN on figure 5.11 seems to be a good solution. However, many clustering algorithms as Paris are forced to split the dataset in more than one cluster. Hence, they try to detect communities created by chance in the homogeneous square which sounds also a good solution.

Conclusions. The comparison of Paris algorithms with vector data clustering algorithms is not straight forward and imply a transforma- tion of vector data to a graph. However, the results obtained in this section can be summed up in two remarks.

• Paris has no parameters. It does not need any preliminary pa- rameter tuning to propose a clustering of the data. CHAPTER 5. RESULTS 61

• Paris algorithms seems to provides better results than classical clustering algorithms on simple vector datasets after a graph trans- formation. Chapter 6

Discussion

6.1 Contributions and key findings

Resolution as a stability measure. This thesis first proposed a new point of view on the resolution parameter in the modularity score in chapter 3. In particular, this thesis proves that clustering and clusters have at most one stable range of resolution values where they are op- timal with respect to the modularity score. This is in agreement with implicit assumptions present in many papers [1], [21], [22]. This thesis shows indeed that the resolution stable range of a clustering is a first interesting indicator to evaluate the quality of a partition. However, it turns out that the average stable range of clusters within a clustering is much less sensitive to outliers and more adapted to evaluate the whole partition quality.

A hierarchical algorithm and a new distance. The hierarchical clus- tering algorithm, Paris, proposed in this thesis is based on a new dis- tance which can be interpreted through resolution parameter of the modularity score. The properties of the new distance are presented in chapter 3 and the hierarchical algorithm is described in chapter 4. The Paris algorithm returns a dendrogram which contains approxima- tions of the optimal partitions with respect to the modularity at any resolution in one single run. In comparison, the Louvain algorithm [2] only approximates the optimal partition for one given resolution in one run. The running time of Paris is comparable to the Louvain algorithm, which makes it usable even for large graphs.

62 CHAPTER 6. DISCUSSION 63

Four methods based on new clustering scores to process hierarchies. This thesis introduces four different methods in order to post pro- cess the hierarchy output by the Paris algorithm (chapter 4). The first method evaluates the quality and the robustness of the clusters in the hierarchy thanks to the new sharp-score. The second and the third methods order the partitions of the hierarchy with respect to the new sharp-index. They return partitions with clusters observed at the same resolution (homogeneous sizes of clusters), or partitions with clus- ters observed at different resolutions (heterogeneous sizes of clusters). This latter method actually overcomes the resolution limit [17], [25]. Finally, the last method is able to detect the most relevant resolutions that can be used for any modularity maximization algorithm. Con- trary to the resolution estimation proposed in [30], the last method in this thesis proposed more than one suitable resolution value which have proved to reach good results on both synthetic and real datasets.

Multi-scale structure and robustness to graph parameters. All these algorithms are able to extract the multi-scale structure from a graph where the top ranked clustering levels have shown to be more robust to many parameters (number of clusters, degree heterogeneity, size heterogeneity) than Louvain. These results are widely presented in chapter 5 and use a HSBM and a benchmark of SBM presented in this thesis. The HSBM introduced in this thesis is similar to the one pre- sented in [27], whereas the benchmark if SBM is aimed at testing graph parameters in a similar fashion as Lanchinetti et al. in [24]. Beyond the results of this thesis, the algorithms have been implemented in python package "python_paris"1 in order to be easily accessible to any users.

6.2 Limitations and future work

The methods proposed in this Master’s thesis present points that de- serve to be discussed and open paths for further improvements.

Paris algorithm and theoretical guarantees. Each homogeneous clus- tering within the Paris’ dendrogram is supposed to be an approxima- tion of the optimal partition on a resolution range with respect to the modularity score. This thesis shows that the Paris algorithm reaches

1https://pypi.org/project/python_paris/ 64 CHAPTER 6. DISCUSSION

good results in practice. However, in the same way than the Louvain algorithm [2], the Paris algorithm provides no theoretical guarantee that a partition is close to the optimal for a given resolution. A fu- ture work could investigate the theoretical optimality of the partitions returned by the Paris algorithm.

Nested hierarchies. Paris has proposed so far hard hierarchies where partitions are nested in each other. Then, it can extract approximated optimal clusterings at a each resolution with respect to the modular- ity score from these hierarchies. However, the nested constraint is strong and may lower the quality of the clusterings. One improve- ment would be to adapt the Paris algorithm in order to return soft hi- erarchies where partitions may not be nested, and then evaluate how much it can improve the approximation quality of the partitions at each resolution.

Evaluation of hierarchy quality. The hierarchy returned in the form of a dendrogram by the Paris algorithm is another key outcome of the thesis. The dendrogram turns out to be an efficient structure to rep- resent the multi-scale structure of a graph and relates each level with a modularity resolution. However, whereas many methods exist to evaluate the quality of flat clusterings, the evaluation of the hierarchy quality is much more difficult. The experiment 2 proposes a way to an- alyze the quality of a hierarchy by looking at the modularity score with respect to the resolution parameter. However, it might be interesting to have score to evaluate hierarchy which does not involve modular- ity. This is an ongoing work that has been analyzed in [9] and that we analyze in [3]. The paper [3] proposes by the way to observe how well the information is compressed in the dendrogram structure.

Comparison with other clustering algorithms. Other outcomes of this Master’s thesis are the different processing of Paris’ hierarchies which return ranking of flat clusterings without parameter tuning. The performance of the best flat clustering with respect of the sharp- index is mainly compared with the classical Louvain algorithm for two main reasons: the Louvain algorithm is known as the state-of-the-art for graph clustering and the processings of the Paris hierarchy can be seen as an improvement of the Louvain algorithm. Of course, other CHAPTER 6. DISCUSSION 65

clustering algorithm exist but they need in general the setting of pa- rameters. Even if it sounds not fair for the processings of Paris, the experiment 8 compares them with some of these algorithms with good values for their parameters. Finally even in this case, Paris algorithms reach better results than the competitors. The experiments presented in this work are obviously not exhaustive, and many other experi- ments could be done to enrich the analysis of the Paris algorithm and the processing of its hierarchy (with different quality scores [16] or datasets [24] for instance).

Other applications for sharp-score and hierarchical a priori More- over, the new distance proposed in this thesis and the sharp-score seems to have many possible applications to process any type of hierarchy in the same way as the Dunn and Davies-Bouldin Index [13], [14]. This thesis introduces also the new concept of a priori on the hierar- chy which can take different forms but seems to be exponential in real world networks. Potential future works could in particular investi- gate local clustering algorithms or overlapping clusterings algorithms based on the notion of sharp-score, and look closer at good forms of a priori function for real world networks.

Code optimization. Eventually, another possible improvement of Paris is to optimize its implementation which is available on-line2. The op- timizations could be algorithmic or simply translate the python code on a quicker language like C or C++.

6.3 Ethical and sustainable impact.

Although it is not an easy task to evaluate the impact of a given dis- covery on the real world, researchers must question the possible use of their findings by the industry. The most direct repercussion is obvi- ously to improve the global understanding of the world. Knowledge is an ethical value shared by many people and research contributes pos- itively to it. In the present case of graph clustering applied to social networks, it can give for instance an interesting insight on how hu- man communities are organized and what are their dynamics. From this point of view, this thesis has a beneficial impact since it proposes

2https://github.com/sharpenb/Graph_Clustering 66 CHAPTER 6. DISCUSSION

new solutions and knowledges concerning graph clustering problems. As any piece of knowledge, algorithms must be well mastered by their users so as not to lead to erroneous results or immoral usages. However, research can have more indirect repercussions depend- ing on their applications. Some of the possible applications of graph methods are explained here to give an overview of positive and neg- ative effects. Graph clustering techniques help to answer biological questions. Brain networks and the tree of life are examples of graphs that can be used in this domain. Graph algorithms can also be used for more controversial applications in marketing. On the one hand, it allows to better adapt services to individuals and predict undesir- able events. But on the other hand, it seems to infringe on privacy and freedom by using large amount of data and suggesting actions in our place. For instance, graph clustering can be used on social networks to guess what are the interests or the potential friends of a given user or even detect what are the most influential persons within a community. However, graph techniques can have also a positive ecological impact. For example, the graph representation of a city combined with cluster- ing techniques can help resources pooling. In the case of the master thesis, I had the opportunity to discuss with two companies Deezer and nam.R. The objective of Deezer was to improve their music sug- gestions and create meaningful play-lists, whereas nam.R clearly fo- cuses on project with sustainable development goals approved by the UN. Chapter 7

Conclusion

The purpose of this thesis was to develop and test a new algorithm, called Paris (Chapter 4), based on modularity for hierarchical cluster- ing of graphs. The development of the algorithm was made bearing the following three questions in mind:

1. How to estimate suitable values of the resolution parameter to ensure a meaningful clustering? Clustering algorithms based on modularity require the proper tuning of a parameter called resolution. The combination of the Paris algorithm with the sharp-score proposed in chapter 4 pro- vides a ranking of the most suitable resolutions. The algorithm Paris+Louvain is the first method that has proved its efficiency on synthetic datasets (experiments 3,4,5,6,7,8) and its reliability on real datasets (experiments 9,10) for resolution estimation. Other recent attempts such that [30] are not guaranteed to converge and give subsequently poor and unstable results.

2. Is it possible to build a clustering algorithm based on modularity which does not suffer from the resolution limit ? The classical algorithms based on modularity generally suffer from the resolution limit [17], [25]. In contrast, the algorithm Paris+Heterogeneous produces clusterings which overcome the resolution limit (experiment 7). This algorithm keeps a modu- larity interpretation where each cluster is observed at its own resolution instead of one unique resolution. To the best of our knowledge, this is the first method with a clear modularity inter- pretation which does not suffer from the resolution limit.

67 68 CHAPTER 7. CONCLUSION

3. How to identify the sharpest clustering levels and the most robust par- titions in a graph ? The sharp-score and the sharp-index have proved to be good crite- ria to identify the best clustering levels in a hierarchy. The mean- ing of the "best clustering" is determined by the two new scores which are relevant since they give high scores to clusters which are compact and isolated from the rest of the graph. By construc- tion, these scores are robust to resolution variations and have shown to be robust to other parameters (experiments 3,4,5,6,7).

To conclude, the Paris algorithm can mainly be used for three dif- ferent purposes:

• As an alternative to the Louvain algorithm to approximate the optimal clusterings with respect to the modularity score at any resolution in only one single run (Paris+Homogeneous)

• In combination with the Louvain algorithm to estimate the most suitable resolution (Paris+Louvain)

• As a hierarchical modularity based algorithm capable to over- come the resolution limit (Paris+Heterogeneous) Bibliography

[1] A. Arenas, A. Fernández, and S. Gómez. “Analysis of the struc- ture of complex networks at different resolution levels”. In: New Journal of Physics 10.5 (May 2008), p. 053039. DOI: 10 . 1088 / 1367-2630/10/5/053039. eprint: physics/0703218. [2] Vincent D Blondel et al. “Fast unfolding of communities in large networks”. In: Journal of Statistical Mechanics: Theory and Experi- ment 2008.10 (2008), P10008. URL: http://stacks.iop.org/ 1742-5468/2008/i=10/a=P10008. [3] T. Bonald and B. Charpentier. “Learning Graph Representations by Dendrograms”. In: ArXiv e-prints (July 2018). arXiv: 1807. 05087. [4] Cheng-Shang Chang et al. “A general probabilistic framework for detecting community structure in networks”. In: 2011 Pro- ceedings IEEE INFOCOM (2011), pp. 730–738. [5] Yizong Cheng. “Mean Shift, Mode Seeking, and Clustering.” In: IEEE Trans. Pattern Anal. Mach. Intell. 17.8 (1995), pp. 790–799. URL: http://dblp.uni-trier.de/db/journals/pami/ pami17.html#Cheng95. [6] Fan Chung and Linyuan Lu. “Connected Components in Ran- dom Graphs with Given Expected Degree Sequences”. In: An- nals of Combinatorics 6.2 (Nov. 2002), pp. 125–145. ISSN: 0219- 3094. DOI: 10.1007/PL00012580. URL: https://doi.org/ 10.1007/PL00012580. [7] A. Clauset, M. E. J. Newman, and C. Moore. “Finding commu- nity structure in very large networks”. In: 70.6, 066111 (Dec. 2004), p. 066111. DOI: 10 . 1103 / PhysRevE . 70 . 066111. eprint: cond-mat/0408187.

69 70 BIBLIOGRAPHY

[8] Aaron Clauset, Cristopher Moore, and M. E. J. Newman. “Hi- erarchical structure and the prediction of missing links in net- works”. In: Nature 453 (2008), pp. 98–101. URL: http://dx. doi.org/10.1038/nature06830. [9] V. Cohen-Addad et al. Hierarchical Clustering: Objective Functions and Algorithms. Apr. 2017. arXiv: 1704.02147 [cs.DS]. [10] Dorin Comaniciu and Peter Meer. “Mean shift: A robust approach toward feature space analysis”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002), pp. 603–619. [11] Community. Community Louvain. 2018. URL: https://perso. crans.org/aynaud/communities/api.html (visited on 01/01/2018). [12] Samuel I. Daitch, Jonathan A. Kelner, and Daniel A. Spielman. “Fitting a Graph to Vector Data”. In: ICML ’09 (2009), pp. 201– 208. DOI: 10.1145/1553374.1553400. URL: http://doi. acm.org/10.1145/1553374.1553400. [13] David L. Davies and Donald W. Bouldin. “A Cluster Separa- tion Measure”. In: IEEE Trans. Pattern Anal. Mach. Intell. 1.2 (Feb. 1979), pp. 224–227. ISSN: 0162-8828. DOI: 10 . 1109 / TPAMI . 1979 . 4766909. URL: http : / / dx . doi . org / 10 . 1109 / TPAMI.1979.4766909. [14] J. C. Dunn. “A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters”. In: Journal of Cybernetics 3.3 (1973), pp. 32–57. DOI: 10.1080/01969727308546046. eprint: https://doi.org/10.1080/01969727308546046. URL: https://doi.org/10.1080/01969727308546046. [15] Martin Ester et al. “A Density-based Algorithm for Discover- ing Clusters a Density-based Algorithm for Discovering Clus- ters in Large Spatial Databases with Noise”. In: KDD’96 (1996), pp. 226–231. URL: http://dl.acm.org/citation.cfm? id=3001460.3001507. [16] S. Fortunato and D. Hric. “Community detection in networks: A user guide”. In: physrep 659 (Nov. 2016), pp. 1–44. DOI: 10. 1016/j.physrep.2016.09.002. arXiv: 1608.00163 [physics.soc-ph]. BIBLIOGRAPHY 71

[17] Santo Fortunato and M Barthelemy. “Resolution limit in com- munity detection”. In: Proceedings of the National Academy of Sci- ences (Jan. 2007). URL: http://www.pnas.org/cgi/content/ abstract/104/1/36. [18] Brendan J J. Frey and Delbert Dueck. “Clustering by Passing Messages Between Data Points.” In: Science (Jan. 2007). ISSN: 1095- 9203. DOI: 10.1126/science.1136800. URL: http://dx. doi.org/10.1126/science.1136800. [19] Jianbin Huang et al. “SHRINK: A structural clustering algorithm for detecting hierarchical communities in networks”. In: (Jan. 2010), pp. 219–228. [20] B. S. Khan and M. A. Niazi. “Network Community Detection: A Review and Visual Survey”. In: ArXiv e-prints (Aug. 2017). arXiv: 1708.00977. [21] R. Lambiotte. “Multi-scale Modularity in Complex Networks”. In: ArXiv e-prints (Apr. 2010). arXiv: 1004.4268 [physics.soc-ph]. [22] Renaud Lambiotte, Jean-Charles Delvenne, and Mauricio Bara- hona. “Random Walks, Markov Processes and the Multiscale Mod- ular Organization of Complex Networks”. In: CoRR abs/1502.04381 (2015). [23] A. Lancichinetti, S. Fortunato, and J. Kertesz. “Detecting the over- lapping and hierarchical community structure of complex net- works”. In: ArXiv e-prints (Feb. 2008). arXiv: 0802.1218 [physics.soc-ph]. [24] A. Lancichinetti, S. Fortunato, and F. Radicchi. “Benchmark graphs for testing community detection algorithms”. In: 78.4 (Oct. 2008), p. 046110. [25] Andrea Lancichinetti and Santo Fortunato. “Limits of modular- ity maximization in community detection”. In: Phys. Rev. E 84 (6 Dec. 2011), p. 066122. DOI: 10.1103/PhysRevE.84.066122. URL: https://link.aps.org/doi/10.1103/PhysRevE. 84.066122. [26] Ulrike von Luxburg. “A Tutorial on Spectral Clustering”. In: CoRR abs/0711.0189 (2007). URL: http://dblp.uni-trier.de/ db/journals/corr/corr0711.html#abs-0711-0189. 72 BIBLIOGRAPHY

[27] V. Lyzinski et al. “Community Detection and Classification in Hierarchical Stochastic Blockmodels”. In: ArXiv e-prints (Mar. 2015). arXiv: 1503.02115 [stat.ML]. [28] Fionn Murtagh and Pedro Contreras. “Algorithms for hierar- chical clustering: an overview, II”. In: Wiley Interdisciplinary Re- views: Data Mining and Knowledge Discovery 7.6 (), e1219. DOI: 10 . 1002 / widm . 1219. URL: https : / / onlinelibrary . wiley.com/doi/abs/10.1002/widm.1219. [29] M E Newman. “Modularity and community structure in net- works”. In: Proc Natl Acad Sci U S A 103.23 (June 2006), pp. 8577– 8582. DOI: 10.1073/pnas.0601602103. URL: http://www. ncbi.nlm.nih.gov/sites/entrez?cmd=retrieve&db= pubmed&list_uids=16723398&dopt=AbstractPlus. [30] M. E. J. Newman. “Community detection in networks: Modu- larity optimization and maximum likelihood are equivalent”. In: CoRR abs/1606.02319 (2016). arXiv: 1606.02319. URL: http: //arxiv.org/abs/1606.02319. [31] M. E. J. Newman. “Fast algorithm for detecting community struc- ture in networks”. In: Phys. Rev. E 69 (6 June 2004), p. 066133. DOI: 10 . 1103 / PhysRevE . 69 . 066133. URL: https : / / link.aps.org/doi/10.1103/PhysRevE.69.066133. [32] M. E. J. Newman. “From the Cover: Modularity and community structure in networks”. In: Proceedings of the National Academy of Science 103 (June 2006), pp. 8577–8582. DOI: 10.1073/pnas. 0601602103. eprint: physics/0602124. [33] M. E. J. Newman and M. Girvan. “Finding and evaluating com- munity structure in networks”. In: Phys. Rev. E 69.2 (Feb. 2004), p. 026113. DOI: 10.1103/PhysRevE.69.026113. URL: http: //link.aps.org/doi/10.1103/PhysRevE.69.026113. [34] M. E. J. Newman and M. Girvan. “Finding and evaluating com- munity structure in networks”. In: Physical Review E 69 (2004). [35] M. E. J. Newman, S. H. Strogatz, and D. J. Watts. “Random graphs with arbitrary degree distributions and their applications”. In: 64.2, 026118 (Aug. 2001), p. 026118. DOI: 10.1103/PhysRevE. 64.026118. eprint: cond-mat/0007235. BIBLIOGRAPHY 73

[36] Leto Peel, Daniel B. Larremore, and Aaron Clauset. “The ground truth about metadata and community detection in networks”. In: CoRR abs/1608.05878 (2016). arXiv: 1608.05878. URL: http: //arxiv.org/abs/1608.05878. [37] Pascal Pons and Matthieu Latapy. “Computing communities in large networks using random walks (long version)”. In: Com- puter and Information Sciences-ISCIS 2005 (2005). arXiv:arXiv:physics/0512106v1, pp. 284–293. [38] Joerg Reichardt and Stefan Bornholdt. “Statistical Mechanics of Community Detection”. In: Physical Review E 74 (2006), p. 016110. URL: http://www.citebase.org/abstract?id=oai: arXiv.org:cond-mat/0603718. [39] P. Ronhovde and Z. Nussinov. “Multiresolution community de- tection for megascale networks by information-based replica cor- relations”. In: ArXiv e-prints (Dec. 2008). arXiv: 0812.1072 [physics.soc-ph]. [40] Marta Sales-Pardo et al. “Extracting the hierarchical organiza- tion of complex systems”. In: Proceedings of the National Academy of Sciences 104.39 (2007), pp. 15224–15229. eprint: http://www. pnas . org / content / 104 / 39 / 15224 . full . pdf. URL: http://www.pnas.org/content/104/39/15224. [41] Erich Schubert et al. “DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN”. In: ACM Trans. Database Syst. 42.3 (2017), 19:1–19:21. [42] Scikit-Learn. Scikit-Learn Clustering Algorithms. 2018. URL: http: //scikit- learn.org/stable/modules/clustering. html (visited on 01/01/2018). [43] E.H. Simpson. “Measurement of diversity”. In: Nature 163.4148 (1949), p. 688. [44] Gerald Teschl. Topics in Real and Functional Analysis. 2010. URL: http://www.mat.univie.ac.at/~gerald/ftp/book- fa/ (visited on 01/01/2018). [45] N. Tremblay and P.Borgnat. “Graph Wavelets for Multiscale Com- munity Mining”. In: IEEE Transactions on Signal Processing 62.20 (Oct. 2014), pp. 5227–5239. 74 BIBLIOGRAPHY

[46] Nguyen Xuan Vinh, Julien Epps, and James Bailey. “Information Theoretic Measures for Clusterings Comparison: Variants, Prop- erties, Normalization and Correction for Chance”. In: J. Mach. Learn. Res. 11 (Dec. 2010), pp. 2837–2854. ISSN: 1532-4435. URL: http : / / dl . acm . org / citation . cfm ? id = 1756006 . 1953024. [47] Joe H. Ward. “Hierarchical Grouping to Optimize an Objective Function”. In: Journal of the American Statistical Association 58.301 (1963), pp. 236–244. URL: http://www.jstor.org/stable/ 2282967. [48] Ju Xiang et al. “Multi-resolution modularity methods and their limitations in community detection”. In: 85 (Oct. 2012), pp. 1–10. [49] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. “BIRCH: An Efficient Data Clustering Method for Very Large Databases”. In: SIGMOD Rec. 25.2 (June 1996), pp. 103–114. ISSN: 0163-5808. DOI: 10.1145/235968.233324. URL: http://doi.acm. org/10.1145/235968.233324. Appendices

75 76 BIBLIOGRAPHY

A HSBM

Figure A.1: The 7 top clusters according to Paris+Clusters on a multi- level HSBM (left to right, top to bottom in descending order of quality). BIBLIOGRAPHY 77

Figure A.2: The 7 top clusterings according to Paris+Homogeneous on a multi-level HSBM (left to right, top to bottom in descending order of quality). 78 BIBLIOGRAPHY

Figure A.3: The 3 top clusterings according to Paris+Heterogeneous on a multi-level HSBM (left to right, top to bottom in descending order of quality). BIBLIOGRAPHY 79

Figure A.4: The 7 top clusterings according to Paris+Louvain on a multi-level HSBM (left to right, top to bottom in descending order of quality). 80 BIBLIOGRAPHY

B OpenFlight

Figure B.1: The 8 top clusterings according to Paris+Homogeneous on OpenFlight (left to right, top to bottom in decreasing order of quality). BIBLIOGRAPHY 81

Figure B.2: The 8 top clusterings according to Paris+Heterogeneous on OpenFlight (left to right, top to bottom in decreasing order of quality). 82 BIBLIOGRAPHY

Figure B.3: The 8 top clusterings according to Paris+Louvain on Open- Flight (left to right, top to bottom in decreasing order of quality). BIBLIOGRAPHY 83

C OpenStreet

Figure C.1: The 8 top clusterings according to Paris+Homogeneous on OpenStreet (left to right, top to bottom in decreasing order of quality). 84 BIBLIOGRAPHY

Figure C.2: The 8 top clusterings according to Paris+Heterogeneous on OpenStreet (left to right, top to bottom in decreasing order of quality). BIBLIOGRAPHY 85

Figure C.3: The 8 top clusterings according to Paris+Louvain on Open- Street (left to right, top to bottom in decreasing order of quality). 86 BIBLIOGRAPHY BIBLIOGRAPHY 87

D EnglishDico

Cluster size Main nodes 35 revenge, vengeance, avenge, wreak, wanion,... 34 blockhead, dunce, dolt, mooncalf, numskull,... 34 beforehand, predestinate, predetermine, foreordain, foredoom,... 32 dung, excrement, stallage, ordure, dropping,... 32 miser, sordid, covetous, niggard, penurious,... 32 childbirth, parturition, midwife, obstetrics, midwifery,... 32 comment, scholium, annotation, commentator, skave, ... 32 drunken, bacchanal, orgies, bacchus, orgy,... Cluster size Main nodes 62 manner, yawningly, worryingly, tortiously, telarly, tauntingly,... 60 act, withholdment, tripudiation, surculation, subministration,... 55 weep, lament, lamentation, mourn, wail,... 50 foretell, prophecy, prophesy, foreshow, prediction,... 50 sleep, sleepy, drowzy, slumber, doze,... 49 hydroid, zooid, sessile, gonophore, hydroidea,... 48 praise, commendation, extol, panegyric, eulogy,... 48 mohammedan, islam, imaum, moslem, mohammedanism,... Cluster size Main nodes 31 wrongly, misincline, misworkship, misenter, mistime,... 28 panegyric, eulogy, commendatory, laudatory, eulogistical,... 26 sleepy, drowsy, drowsiness, sleepiness, dozy,... 24 divination, gyromancy, hyrdomancy, hieromancy, dactyliomancy,... 23 musty, moldy, mustiness, fusiness, fusty,... 22 scapula, infraspinous, mesoscpula, metacromion, acromion,... 22 entrails, disembowel, eviscerate, umbles, exenterate,... 22 inexpressible, ineffable, nameless, unspeakable, unutterable,... Cluster size Main nodes 128 omen, augur, presage, foretell, prophecy,... 89 boast, bully, swash, bluster, huff,... 86 see, yellowammer, weddahs, utis, uniclinal,... 82 deny, reject, renounce, abdicate, recant,... 80 grief, sorrow, weep, mourning, lament,... 73 slave, servitude, slavery, villain, bondage,... 72 stamen, pistil, linnaean, cineraria, heterogony,... 69 embryo, cotyledon, albumen, seminal, radicle,...

Figure D.1: The 4 top clusterings according to Paris+Homogeneous on EnglishDico (top to bottom in decreasing order of quality): cluster- ing 1 ( 1,321 clusters), clustering 2 (8,686 clusters), clustering 3 (20,033 clusters), clustering 4 (5,368 clusters) 88 BIBLIOGRAPHY

Cluster size Main nodes 134 dead, grave, bury, tomb, atrium,... 105 see, syphon, calygraphy, capella, cherif,... 103 cold, ice, cool, chill, frost,... 85 act, superinduction, infoldment, transcolation, disfurnishment,... 84 concave, vertebra, socket, crocodile, alligator,... 83 hag, deformity, ugly, mar, deform,... 83 mix, affair, mingle, interfere, busy,... 78 serpent, snake, fang, venomous, rattlesnake,... Cluster size Main nodes 631 piece, separate, break, tear, mental,... 359 school, teach, college, learn, university,... 306 lay, waste, domestic, manage, destruction,... 275 current, pole, electricity, needle, battery,... 242 single, secret, private, conceal, retire,... 237 morbid, sick, disgust, dislike, deserving,... 213 soft, abound, flexible, willow, basket,... 205 protect, enemy, strengthen, ditch, castle,... Cluster size Main nodes 976 light, object, eye, black, dark,... 751 tend, passion, temper, excite, noise,... 405 very, size, type, vice, wicked, monster,... 344 spirit, influence, charm, spell, devil,... 320 day, year, honor, once, hundred, seven,... 307 first, originally, original, origin, begin,... 299 dull, stupid, fool, foolish, silly, blunt,... 279 back, drive, thrust, push, reflect, expel,... Cluster size Main nodes 1,002 matter, mark, spot, soil, pure,... 783 life, pleasure, sport, humor, please, joy, ... 656 process, art, draw, plate, picture,... 433 authority, rule, government, king, control,... 422 estate, father, descend, descent, son,... 419 catch, knot, involve, difficulty, drag,... 418 follow, reason, argument, proof, proposition,... 410 treat, science, study, master, skilled,...

Figure D.2: The 4 top clusterings according to Paris+Heterogeneous on EnglishDico (top to bottom in decreasing order of quality): clustering 1 (27,373 clusters), clustering 2 (14,372 clusters), clustering 3 (11,608 clusters), clustering 4 (10,787 clusters) BIBLIOGRAPHY 89

Cluster size Main nodes 40 servitude, slavery, bondage, enslave, thrall,... 38 beforehand, predestinate, predetermine, foreordain, foredoom,... 34 dung, excrement, alvine, stallage, ordure,... 32 sleepy, drowsy, slumber, doze, drowse,... 31 hydroid, gonophore, hydroidea, athecata, hydractinian,... 31 lament, mourn, deplorable, deplore, bewail,... 31 leaflet, pinnate, jugum, paripinnate, pecopteris,... 29 cleanse, detergent, abstergent, deterge, absterge,... Cluster size Main nodes 62 manner, astonishedly, besottingly, clammily, condescendingly,... 60 act, affeerment, ambulation, attemperation, braggardism,... 49 anatomy, dissection, anatomical, anatomism, osteotomy,... 48 hyrdroid, zooid, gonophore, sessile, hydroidea,... 47 mohammedan, imaum, islam, mameluke, moslem,... 46 medusa, jellyfish, calycozoa, dsicoplora, acalephae,... 43 sponge, spicule, siliceous, keratose, prorifera,... 43 marsupial, marsupialia, opossum, callosum, wombat,... Cluster size Main nodes 28 wrongly, misincline, misworship, miscenter, miscalculate,... 23 entrails, disembowel, eventration, eviscerate, exenterate,... 22 divination, gyromancy, dactyliomancy, hieromancy, hydromancy,... 22 lusty, moldy, mustiness, fusiness, fust,... 21 contradict, impugn, disaffirm, gainsay, disputer,... 21 stupid, doltish, blockish, bufflehead, chuckkehead,... 20 inhabitant, aragonese, belgian, biscayan, californian, gaditanian,... 20 eulogy, panegyric, commendatory, laudatory, eulogistical,... Cluster size Main nodes 113 sorrow, weep, grieve, mournful, sorrowful,... 93 omen, augur, presage, foretell, predict,... 86 see, amebean, antecians, attone, bedphere,... 83 censure, deserving, condemn, blame, doom,... 82 servile, fawn, obsequious, submissive, meanly,... 81 steal, petty, thief, thieft, slink,... 77 pride, proud, disdain, arrogant, haughty,... 75 hydroid, zooid, medusa, gonophore, sessile, polyp,...

Figure D.3: The 4 top clusterings according to Paris+Louvain on En- glishDico (top to bottom in decreasing order of quality): clustering 1 (12,711 clusters), clustering 2 (8,414 clusters), Clustering 3 (19,401 clus- ters), clustering 4 (5,075 clusters) 90 BIBLIOGRAPHY BIBLIOGRAPHY 91

E HumansWikipedia

Cluster size Main nodes 422 Walther-Peer Fellgiebel, Fritz Berger, Hans Bauer (SS officer), Horst Weber, Rudolf Heynsen, Kurt Blasberg,... 172 Arthur Haygarth, Horace Bates, Timothy Duke (cricketer), W. Beeston (Middlesex cricketer), Edward Morant,... 120 , , , , (cricketer), ,... 93 John Beazley, Euphronios, Nikosthenes, Amasis Painter, Oltos, Andokides painter,... Cluster size Main nodes 158 Arthur Haygarth, G. Dupuis (Essex cricketer), James Dale (cricketer), Hyde (Sussex cricketer), Hudson (Sussex cricketer), Thomas Bache,... 83 Attorneys in the United States, Karen Tallian, Don Caruth, Pat Lindsey, Tommy Dickerson, William Joel Blass,... 77 Vladimir Putin, Vladimir Yevseyevich Zuev, Gennady Khazanov, Mikhail Yevdokimov, Valery Shumakov, Sergey Vakhrukov,... 69 All-time Rochester Rhinos roster, Kwame Sarkodie, Lenin Steenkamp, Juan Pablo Reyes, Bill Sedgewick, Kofi Sarkodie,... Cluster size Main nodes 535 Walther-Peer Fellgiebel, Fritz Berger, Hans Bauer (SS officer), Horst Weber, Rudolf Heynsen, Kurt Blasberg,... 183 Arthur Haygarth, James Burt (cricketer), Thomas Nordish, Edward Winter (cricketer), John Evans (Kent cricketer),... 176 Leigh Adams, , Bruce Penhall, Ole Olsen (speedway rider), , ,... 173 , Linet Masai, , , Sharon Cherop,... Cluster size Main nodes 586 Walther-Peer Fellgiebel, Günther Tonne, Hans-Gotthard Pestke, Ernst Mengersen, Günther Sachs, Siegfried Lüdden,... 285 Laisenia Qarase, Mahendra Chaudhry, Frank Bainimarama, George Speight, Kamisese Mara, Sitiveni Rabuka,... 284 Arthur Haygarth, William Clarke (cricketer), Jem Broadbridge, Fuller Pilch, John Woodcock ( writer),... 282 Lee Kuan Yew, Lee Hsien Loong, Goh Chok Tong, S. R. Nathan, Ong Teng Cheong, George Yeo,...

Figure E.1: The 4 top clusterings according to Paris+Homogeneous on HumansWikipedia (top to bottom in decreasing order of quality): clustering 1 (69,913 clsuters), clustering 2 (104,868 clusters), Clustering 3 (44,416 clsuters), Clustering 4 (26,046 clsuters) 92 BIBLIOGRAPHY

Cluster size Main nodes 34,669 , , Pelé, Diego Maradona, José Mourinho, Kevin Keegan,... 18,851 Jawaharlal Nehru, Rabindranath Tagore, Indira Gandhi, Gautama Buddha, Pervez Musharraf, Amitabh Bachchan,... 13,493 Pierre Trudeau, Stephen Harper, Jean CHrétien Brian Mulroney, William Lyon Mackenzie King,... 13,439 Babe RUth, Jackie Robinson, Ty Cobb, Ted Williams, Barry Bonds, Joe DiMaggio... Cluster size Main nodes 17,777 Mao Zedong, Chiang Kai-shek, Henry John Temple, Tokugawa Hideyoshi, Wu Zetian, Zhou Enlai,... 14,601 Henry II of , Charlemagne, John King of England, William the Conqueror, Louis IX of ,... 13,051 Fidel Castro, Francisco Franco, Jorge Luis Borges, Che Guevara, Federico Garcia Lorca, Juan Carlos I of Spain,... 12,071 Muhammad Ali, Hulk Hogan, Ric Flair, Vince McMahon, Mick Foley, Mike Tyson,... Cluster size Main nodes 9,522 Pope John Paul II, Jesus, Pope Benedict XVI, Pope Paul VI, Pope Pius XII, Pope Leo XIII,... 6,261 Michael Jordan, Shaquille O’Neal, Magic Johnson, Kareem Abdul-Jabbar, Lebron James, Wilt Chamberlain,... 4,296 Michael Schumacher, Ayrton Senna, Jeff Gordon, Dale Earnhardt, Mario Andretti, Tony Stewart,... 4,109 Augustine of Hippo, Paul the Apostle, Saint Peter, Jerome, John the Baptist, Eusebius,... Cluster size Main nodes 28,118 Charles Darwin, Albert Einstein, Thomas Edison, Carl Linnaeus, Charles Lindbergh, Alexander von Humboldt... 21,063 Pierre Trudeau, Stephen Harper, Jean CHrétien, Brian Mulroney, William Lyon Mackenzie King, ... 14,054 Jack Kemp, Brett Favre, Peyton Manning, Walter Camp, Tom Brady, Fielding H; Yost,... 11,967 Louis XIV of France, Charles V, Philip II of Spain, Louis XV of France, Henry IV of France, Rembrandt,...

Figure E.2: The 4 top clusterings according to Paris+Heterogeneous on HumansWikipedia (top to bottom in decreasing order of quality): clustering 1 (140,208 clusters), clustering 2 (108,268 clusters), clustering 3 (102,007 clusters), clustering 4 (81,479 clusters) BIBLIOGRAPHY 93

Cluster size Main nodes 422 Walther-Peer Fellgiebel, Fritz Berger, Hans Bauer (SS officer), Horst Weber, Rudolf Heynsen, Kurt Blasberg,... 173 Arthur Haygarth, J. Hampton (Surrey cricketer), Edward Morant, W. Beeston (Middlesex cricketer), William Warsop,... 107 Denis Sassou Nguesso, Pascal Lissouba, Bernard Kolélas, Marien Ngouabi, André Milongo, Claude-Ernest Ndalla,... 104 John Beazley, Euphronios, Amasis Painter, Nikosthenes, Oltos, Andokides painter,... Cluster size Main nodes 158 Arthur Haygarth , G. Dupuis (Essex cricketer), James Dale (cricketer), Hyde (Sussex cricketer), Hudson (Sussex cricketer),... 83 Attorneys in the United States, Douglas S. Jackson , Ronald Saunders, Shannon Robinson , Cisco McSorley,... 77 Vladimir Putin, Andrew Kuchins, Mountaga Diallo, Eugène-Richard Gasana, Bernadette Sebage Rathedi,... 48 All-time Rochester Rhinos roster, Bill Sedgewick, Lenin Steenkamp, Kwame Sarkodie, Juan Pablo Reyes, Joe Mercik,... Cluster size Main nodes 535 Walther-Peer Fellgiebel, Fritz Berger, Hans Bauer (SS officer), Horst Weber, Rudolf Heynsen, Kurt Blasberg,... 196 Arthur Haygarth, Joey Ring, Jacob White, Knowles (Middlesex cricketer), W. White (Middlesex cricketer),... 163 Laisenia Qarase, Frank Bainimarama, Kamisese Mara, Sitiveni Rabuka, Josefa Iloilo, Joni Madraiwiwi,... 152 Gyanendra of , , Sher Bahadur Deuba, Bishweshwar Prasad Koirala, Mahendra of Nepal, ,... Cluster size Main nodes 609 Walther-Peer Fellgiebel, Adalbert von Blanc, Franz Kieslich, Walther Krause, Franz Griesbach, Rolf Johannesson,... 296 Arthur Haygarth, Fuller Pilch, Ned Wenman, Tom Marsden, William Bedle,... 290 , Zlatko Kranjˇcar, Afshin Ghotbi, , , ,... 278 Doyle Brunson, Phil Hellmuth, Daniel Negreanu, Phil Ivey, Annie Duke, Erik Seidel,...

Figure E.3: The 4 top clusterings according to Paris+Louvain on Hu- mansWikipedia (top to bottom in decreasing order of quality): clus- tering 1 (67,474 clusters), clustering 2 (101,368 clusters), clustering 3 (42,218 clusters), clustering 4 (24,190 clusters) TRITA -EECS-EX

www.kth.se