Clustering at Large Scales Veronika Strnadová CSC Lab What is clustering?

Finding groups of highly similar objects in unlabeled datasets or: Finding groups of highly connected vertices in a graph

퐶1

퐶2 cluster

퐶3 Applications

• Image Segmentation • Normalized Cuts and Image Segmentation. Jianbo Shi and Jitendra Malik, 2001

• Load Balancing in Parallel Computing • Multilevel k-way Partitioning Scheme for Irregular Graphs. George Karypis and Vipin Kumar. 1998

• Genetic Mapping 퐿퐺1 퐿퐺2 • Computational approaches and software tools for genetic linkage map estimation in plants. Jitendra Cheema and Jo Dicks. 2009

• Community Detection • Modularity and community structure in networks. Mark E.J. Newman. 2006 Clustering in Genetic Mapping

“The problem of genetic mapping can essentially be divided into three parts: grouping, ordering, and spacing” Computational approaches and software tools for genetic linkage map estimation in plants. Cheema 2009 Challenges in the large scale setting

Big Data: Many popular clustering algorithms are inherently difficult to parallelize, and inefficient at large scales Challenges in the large scale setting

Big Data: Many popular clustering algorithms are inherently difficult to parallelize, and inefficient at large scales

“The computational issue becomes critical when the application involves large-scale data” Jain, Data Clustering: 50 Years Beyond 푘-means Challenges in the large scale setting

Big Data: Many popular clustering algorithms are inherently difficult to parallelize, and inefficient at large scales

“The computational issue becomes critical when the application involves large-scale data” Jain, Data Clustering: 50 Years Beyond 푘-means

“There is no clustering algorithm that can be universally used to solve all problems” Xu & Wunsch, A Comprehensive Overview of Basic Clustering Algorithms Popular Clustering Techniques • 푘-means • K-means/K-medians/K-medoids • Better, bigger k-means • Spectral • Optimizing NCut • PIC • Hierarchical • Single Linkage • BIRCH • CURE • Density-based • DBSCAN • II DBSCAN • Multilevel • Metis • ParMetis • Multithreaded Metis • Modularity • Community Detection 푘-means

• “…simpler methods such as 푘-means remain the preferred choice of in many large-scale applications.” Kulis and Jordan, Revisiting k-means: New Algorithms via Bayesian Nonparametrics. 2012 푘-means

• “…simpler methods such as 푘-means remain the preferred choice of in many large-scale applications.” Kulis and Jordan, Revisiting k-means: New Algorithms via Bayesian Nonparametrics. 2012

• “The 푘-means algorithm is very simple and can be easily implemented in solving many practical problems.” Xu and Wunsch, Survey of Clustering Algorithms. 2005 푘-means

• “…simpler methods such as 푘-means remain the preferred choice of in many large-scale applications.” Kulis and Jordan, Revisiting k-means: New Algorithms via Bayesian Nonparametrics. 2012

• “The 푘-means algorithm is very simple and can be easily implemented in solving many practical problems.” Xu and Wunsch, Survey of Clustering Algorithms. 2005

• “In spite of the fact that 푘-means was proposed over 50 years ago and thousands of clustering algorithms have been published since then, 푘-means is still widely used.” Jain, Data Clustering: 50 Years Beyond 푘-means. 푘-means 푑 Given: 푥1, 푥2, … , 푥푛 ∈ 푅 , 푘 푘 푛 2 Objective: minimize 푗=1 푖=1 푟푖푗||푥푖 − 휇푗||2 where 푟푖푗 is an indicator function Initialize 휇1, 휇2, … , 휇푘 Repeat:

1. Find optimal assignments of 푥푖 to 휇푗 2. Re-evaluate 휇푗 given assignments 푥푖 푘 푛 2 Until 푗=1 푖=1 푟푖푗||푥푖 − 휇푗||2 converges to a local minimum Example: 푘 = 2

Step 2 Step 1 Step 2 Initialization + Step 1

Least Squares Quantization in PCM. Stuart P. Lloyd 1982 Convergence of 푘 means

푘 푛 2 Let J = 푗=1 푖=1 푟푖푗||푥푖 − 휇푗||2

• Minimize 퐽 with respect to 푟: (discrete optimization)

• Simply assigning each 푥푖 to its closest 휇 ensures that ||푥푖 − 휇푗|| is minimized for each 푥푖, so let: 1 , if 푗 = 푎푟푔푚푖푛 ||푥 − 휇 ||2 • 푟 = 1,…푘 푖 푗 2 푖푗 0, otherwise

• Minimize 퐽 with respect to 휇: (set derivative of J wrt 휇푗 to 0) 푛 1 푛 • −2 푖=1 푟푛푗 푥푖 − 휇푗 = 0 → 휇푗 = 푖=1 푟푛푗푥푖 푛푗 Distributed K-Means

푑 Given: 푥1, 푥2, … , 푥푛 ∈ 푅 , 푘 푘 푛 2 Objective: minimize 푗=1 푖=1 푟푖푗||푥푖 − 휇푗||2 where 푟푖푗 is an indicator function Initialize 휇1, 휇2, … , 휇푘 and keep as a global variable Repeat:

1. Map: Find optimal assignments of 풙풊 to 흁풋 (map: outputs key=풋, value = 풙풊) 2. Reduce: Re-evaluate 흁풋 given assignments of 풙풊 (combine: partial sums of 풙풊 with same key) (reduce: takes 풋 (key) and list of values 풙풊, computes a new 흁풋) 푘 푛 2 Until 푗=1 푖=1 푟푖푗||푥푖 − 휇푗||2 converges to a local minimum

Parallel k-Means Using MapReduce. Zhao 2009 Divide-and-conquer k-Medians 푆

Divide 푥푖 into 푙 groups 휒1, … , 휒푙 For each 푖, find 푂 푘 centers 휒1 휒2 휒3 휒4 Assign each 푥푖 to its closest center Weigh each center by the 푐41 number of points assigned to it 푐42 Let 휒′ be the set of weighted centers Cluster 휒′ to find exactly 푘 휒′ centers

Clustering Data Streams: Theory and Practice. Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani. 2001 Disadvantages of k-means • Assumes clusters are spherical in shape

vs: • Assumes clusters are of approximately equal size

• Assumes we know 푘 ahead of time

• Sensitivity to

• Converges to a local optimum of the objective Disadvantages of k-means • Assumes clusters are spherical in shape

vs: • Assumes clusters are of approximately equal size

vs:

• Assumes we know 푘 ahead of time

• Sensitivity to Outliers

• Converges to a local optimum of the objective Disadvantages of k-means • Assumes clusters are spherical in shape

vs: • Assumes clusters are of approximately equal size

vs:

• Assumes we know 푘 ahead of time

vs:

• Sensitivity to Outliers

• Converges to a local optimum of the objective Disadvantages of k-means • Assumes clusters are spherical in shape

vs: • Assumes clusters are of approximately equal size

vs:

• Assumes we know 푘 ahead of time

vs:

• Sensitivity to Outliers

vs:

• Converges to a local optimum of the objective Addressing the 푘 problem

• Assuming we know 푘 ahead of time:

vs: A parametric, probabilistic view of k-means

Assume that the points 푥푖 were generated from a mixture of 푘 (multivariate) Gaussian distributions:

1 1 1 푇 −1 푁 푥 휇, Σ = 푒−2 푥−휇 Σ (푥−휇) 2휋퐷/2 |Σ|1/2

퐾 퐾 푝 푥 = 푘=1 휋푘푁(푥|휇푘, Σ) where 푘=1 휋푘 = 1

흁ퟐ

흁ퟏ

Pattern Recognition and . Christopher M. Bishop. 2006 A parametric, probabilistic view of k-means

• What is the most likely set of parameters θ = {{휋푘}, {휇푘}} that generated the points 푥푖 ?

We want to maximize the likelihood 푝(푥|휃) => Solve by Expectation Maximization

흁ퟐ

흁ퟏ

Xu and Wunsch, Survey of Clustering Algorithms DP-Means • A probabilistic, nonparametric view of 푘-means: • Assumes a “countably infinite” mixture of Gaussians, attempt to estimate the parameters of these Gaussians and the way they are combined • Leads to an intuitive, 푘-means-like algorithm

흁ퟑ

흁ퟒ

흁ퟐ

흁ퟔ 흁ퟏ 흁ퟓ

Revisiting k-means: New Algorithms via Bayesian Nonparametrics. Brian Kulis and Michael I. Jordan. 2012 DP-Means

Given: 푥1, … , 푥푛, 휆 푘 Goal: minimize 푗=1 푖 푟푖푗 푥푖 − 휇푗 + 휆푘 Algorithm: Initialize 푘 = 1, 휇 = global mean

Initialize 푟푖푗 = 1 for all 푖 = 1 … 푛 Repeat: • For each 푥푖: 2 • compute 푑푖푗 = ||푥푖 − 휇푗||2, 푗 = 1 … 푘 • If min 푑푖푗 > 휆, start a new cluster: set 푘 = 푘 + 1; 푗 • Otherwise, set 푟푖푗 = 푎푟푔푚푖푛푗푑푖푗 • Generate clusters 푐푖, … , 푐푘 based on 푟푖푗 • For each cluster compute 휇푗

Revisiting k-means: New Algorithms via Bayesian Nonparametrics. Brian Kulis and Michael I. Jordan. 2012 Optimizing a different Global Objective: Spectral Clustering and Graph Cuts Spectral Clustering

Key idea: use the eigenvectors of a graph Laplacian matrix 퐿(퐺) to cluster a graph

3 −1 −1 −1 0 1 퐺 2 −1 2 0 −1 0 퐿 퐺 = 퐷 − 푊 = −1 0 3 −1 −1 3 4 −1 −1 −1 4 −1 0 0 −1 −1 2 5

Partitioning Sparse Matrices with Eigenvectors of Graphs. Pothen, Alex, Horst D. Simon, and Kang-Pu Liu. 1990 Normalized Cuts and Image Segmentation. Jianbo Shi and Jitendra Malik. 2001 A Tutorial on Spectral Clustering. Ulrike von Luxburg. 2007 Spectral Clustering Algorithm

Given: data points 푥푖 and a similarity function s(푥푖, 푥푖) between them, 푘

1. Construct (normalized) graph Laplacian 퐿 퐺 푉, 퐸 = 퐷 − 푊 2. Find the 푘 eigenvectors corresponding to the 푘 smallest eigenvalues of 퐿 3. Let U be the n × 푘 matrix of eigenvectors ′ 4. Use 푘-means to find 푘 clusters 퐶′ letting 푥푖 be the rows of U ′ 5. Assign data point 푥푖 to the 푗th cluster if 푥푖 was assigned to cluster 푗

A Tutorial on Spectral Clustering. Ulrike von Luxburg. 2007 Spectral Clustering Algorithm

Given: data points 푥푖 and a similarity function s(푥푖, 푥푖) between them, 푘 Objective: minimize the ratio cut (or normalized cut) 1. Construct (normalized) graph Laplacian 퐿 퐺 푉, 퐸 = 퐷 − 푊 2. Find the 푘 eigenvectors corresponding to the 푘 smallest eigenvalues of 퐿 3. Let U be the n × 푘 matrix of eigenvectors ′ 4. Use 푘-means to find 푘 clusters 퐶′ letting 푥푖 be the rows of U ′ 5. Assign data point 푥푖 to the 푗th cluster if 푥푖 was assigned to cluster 푗

A Tutorial on Spectral Clustering. Ulrike von Luxburg. 2007 Graph Cuts

A graph cut of 퐺 푉, 퐸 is the sum of the weights on a set of edges 푆퐸 ⊆ 퐸 which cross partitions 퐴푖 of the graph

푊 퐴 , 퐴 = 푤 푖 푖 푥푖∈퐴푖,푥푗∈퐴푖 푖푗 1 푐푢푡 퐴 , … , 퐴 = 푊(퐴 , 퐴 ) 1 푘 2 푖 푖 푖=1,…,푘

Examples: 푐푢푡 퐴1, 퐴2 = 7 퐴 퐴2 푐푢푡 퐴1, 퐴2 = 2 2 퐴1 퐴1 Normalized Cuts 푐푢푡 퐴1, 퐴2 = 2 퐴2 퐴1

푐푢푡(퐴푖, 퐴푖) 푐푢푡(퐴 , 퐴 ) 푅푎푡푖표푐푢푡 퐴 , 퐴 , … , 퐴 = 푖 푖 1 2 푘 |퐴 | 푁푐푢푡 퐴1, 퐴2, … , 퐴푘 = 푖 푣표푙(퐴푖) 푖=1,…,푘 푖=1,…,푘

푅푎푡푖표푐푢푡 퐴1, 퐴2 = 0.226 푁푐푢푡 퐴1, 퐴2 = 0.063 퐴2 퐴1 퐴2 퐴1 Relationship between Eigenvectors and Graph Cuts For some subset of vertices 퐴, let: 푇 푛 푓 = 푓1, … , 푓푛 ∈ 푅 , such that:

퐴 /|퐴| if 푣푖 ∈ 퐴 푓푖 = − 퐴 /|퐴 | if 푣푖 ∈ 퐴

Then: 푓푇퐿푓 = 푉 ∙ 푅푎푡푖표퐶푢푡(퐴, 퐴 ), 푓 ⊥ ퟏ = 0, and 푓 = 푛

Therefore, minimizing RatioCut is equivalent to minimizing 푓푇퐿푓, subject to:

푓 ⊥ ퟏ = 0, 푓 = 푛, and 푓푖 as defined above A Tutorial on Spectral Clustering. Ulrike von Luxburg. 2007 Discrete Optimization:

푇 min 푓 퐿푓 subject to 푓 ⊥ ퟏ, 푓 = 푛, and 푓푖 as defined previously 푓∈푅푛

Relaxed Optimization:

min 푓푇퐿푓 subject to 푓 ⊥ ퟏ, 푓 = 푛 푓∈푅푛

Since ퟏ is the first eigenvector of 퐿, the solution to this is given by the Rayleigh- Ritz theorem, and we have that the optimal solution to the relaxed optimization problem is the second eigenvector of 푳

But this solution is real-valued, so we need to transform the values of 푓 back into discrete values

A Tutorial on Spectral Clustering. Ulrike von Luxburg. 2007 Spectral Clustering: key ideas

• The 푘 real eigenvectors of the (normalized) graph Laplacian are the optimal solutions to a relaxed version of the ratio (or normalized) cut problem • An approximate solution is found by mapping the values of the indices in the 푘 real eigenvectors back to the discrete set (1, … , 푘)

• Advantage: works well in practice • Disadvantages: • computationally expensive (need to solve for eigenvectors) • approximate solution is not guaranteed to be close to the true solution Power Iteration Clustering (PIC) • Large-scale extension to spectral clustering • Key idea: Use power iteration on I − 퐷−1L = 퐷−1푊 until convergence to a linear combination of the 푘 smallest eigenvectors

• Applicable to large-scale text-document classification, works well for small values of 푘  If 퐹퐹푇 = 푊 then 퐷−1푊 푣 = 퐷−1퐹퐹푇푣

Power Iteration Clustering. Lin and Cohen. 2010, A Very Fast Method for Clustering Big Text Datasets. Lin and Cohen. 2010 Addressing the shape problem

Assuming clusters are spherical in shape

vs:

• Produces a hierarchy of clusterings • Agglomerative or Divisive

Density-based Clustering

• key idea: clusters are sets of data points in regions of high density Hierarchical Clustering

• Single Linkage ℎ9 = 3.6 [Sibson72], [Gower69] ℎ8 = 3.4

ℎ7 = 2.9 2 ℎ6 = 2.6 1.2 4 ℎ5 = 2.1 3.4 5 1 2.9 2.1 ℎ4 = 1.9 1.9 3 6 ℎ3 = 1.5 1.5 ℎ = 1.4 8 3.6 2 7 2.9 ℎ1 = 1.2

10 1.4 1 2 3 4 5 6 7 8 9 10 9 Density-Based Clustering

푨 푩

푫 DBSCAN – Algorithm

Given: 푥푖 in dataset 풳, ℰ, MinPts For each 푥푖 in dataset 풳 do: If 푥푖 is note yet assigned to a cluster then: Build a cluster around 푥푖 using ℰ and MinPts

Building clusters:

All points that are density-reachable from 푥푖 are added to the cluster containing 푥푖

A Density-Based Algorithm for Discovering Clusters in Large Spatial with Noise. Ester et al. KDD96 Example of 휀-neighborhood of풙풋 density reachability: 휀

풙풋 휀 Suppose 풙ퟐ 풙 풙 MinPts=4 ퟏ 풊

휀-neighborhood of풙풊 Parallel DBSCAN • DBSCAN using the disjoint set data structure: • Initially each point is its own “disjoint set” • For each point not yet assigned to a cluster, merge its disjoint set with the disjoint sets of all clusters in its 휀-neighborhood • In Parallel: • Merge all local disjoint sets that satisfy density reachability, keeping a list of nonlocal vertices that should be merged with local vertices • Then, merge in parallel across threads using a union-lock (shared memory)

A New Scalable Parallel DBSCAN Algorithm Using the Disjoint-Set Data Structure. Patwary et al. 2012 BIRCH Algorithm

• Goal: cluster large datasets efficiently and handle outliers • Dynamically build a tree in which leaf nodes are clusters with all points within distance t, and all non-leaf nodes are unions of their child nodes • Then, cluster the leaf nodes using an agglomerative clustering algorithm BIRCH Algorithm

• Dynamically build a tree in which leaf nodes are clusters with all points within distance t, and all non-leaf nodes are unions of their child nodes • Then, cluster the leaf nodes using an agglomerative clustering algorithm BIRCH Algorithm

• Dynamically build a tree in which leaf nodes are clusters with all points within distance t, and all non-leaf nodes are unions of their child nodes • Then, cluster the leaf nodes using an agglomerative clustering algorithm BIRCH Algorithm

• Dynamically build a tree in which leaf nodes are clusters with all points within distance t, and all non-leaf nodes are unions of their child nodes • Then, cluster the leaf nodes using an agglomerative clustering algorithm BIRCH Algorithm

• Dynamically build a tree in which leaf nodes are clusters with all points within distance t, and all non-leaf nodes are unions of their child nodes • Then, cluster the leaf nodes using an agglomerative clustering algorithm BIRCH

• Key idea: store important cluster information in cluster feature vectors 퐶퐹푖 for each cluster 푖

퐶퐹푖 = (푁푖, 퐿푆푖, 푆푆푖)

푁푖 = number of points in cluster 푖, 푁 푁 2 퐿푆푖 = 푖=1 푥푖, 푆푆푖 = 푖=1 푥푖

BIRCH: An Efficient Data Clustering Method for Very Large Databases. Zhang et al. 1996 The CF-tree

푪푭ퟏ = 퐶퐹1 + 퐶퐹2 + 퐶퐹3 + 퐶퐹4 [푪푭ퟏ, 풄풉풊풍풅ퟏ], [푪푭ퟐ, 풄풉풊풍풅ퟐ] 푪푭ퟐ = 퐶퐹1 + 퐶퐹2 + 퐶퐹3 + 퐶퐹4

[퐶퐹1, 푐ℎ푖푙푑1], [퐶퐹2, 푐ℎ푖푙푑2], [퐶퐹3, 푐ℎ푖푙푑3] , [퐶퐹4, 푐ℎ푖푙푑4] [퐶퐹1, 푐ℎ푖푙푑1], [퐶퐹2, 푐ℎ푖푙푑2], [퐶퐹3, 푐ℎ푖푙푑3] , [퐶퐹4, 푐ℎ푖푙푑4]

Leaf Nodes: [푪푭ퟏ], [푪푭ퟐ, ], … [푪푭푳] [푪푭ퟏ], [푪푭ퟐ, ], … [푪푭푳] CURE

• Attempts to address the inability of k-means to find clusters of arbitrary shape and size and the computational complexity of hierarchical algorithms • Key idea: store a “well-scattered” set of points to capture the size and shape of the cluster

Cure: An Efficient Clustering Algorithm for Large Databases. Sudipto Guha, Rajeev Rastogi and Kyusek Shim. 2001 CURE Algorithm

• Input: 푘, the number of clusters • Draw a random sample of the points • Each point is its own cluster initially • Each cluster stores a point of representative points and a mean • Build a kd-tree of all clusters • Use the tree to build a heap that stores 푢.closest for each cluster 푢 • While size(heap) > 푘: • Merge together the two closest clusters in the heap • Update the representative points in each cluster • Update the tree and the heap • Merging step: • New mean is a mean of the means of the clusters being merged • Select c well-scattered points based on their distance from the new mean • Shrink each representative point closer to the mean CURE Algorithm: Initial Data CURE Algorithm: Random Sample CURE Algorithm: Nearest neighbors

3 10 2 9 14 15 1 4 11 13 5 12 6 7

8

16 17 18

19 20 22 21

23 24 26 25 Cure Algorithm: Nearest neighbor heap

3 2 d(10,11) 4 10 1 5 d(11,10) d(12,11) 6 9 10 11 12 7 14 15 11 13 d(14,15) d(15,14) d(2,5) d(5,2) 2 8 12 14 15 5 d(4,5) d(6,5) d(7,6) d(3,4) d(13,14)d(17,19)d(19,17) d(16,20) 4 6 7 3 13 17 19 16

d(20,16) d(18,22) d(22,1 d(23,25)d(25,23) d(21,22) d(1,7) d(9,11) d(8,6) d(24,20) d(26,21) 16 17 18 20 18 8) 22 23 25 21 1 9 8 24 26

20 19 22 21 23 26 24 25 CURE: merge Advantages & Disadvantages of CURE

• Advantages: • Finds arbitrarily shaped clusters • Sample size relative to the dataset size is relatively small if clusters are well- separated • Reduces sensitivity to outliers due to the shrinkage of representative points • Disadvantages: 2 • 푂 푛 log2 푛 • “a middle ground between centroid-based and all-point extremes” Highly Application-Specific Clustering Methods

• Multilevel Schemes for load balancing

• Modularity Clustering for community detection Multilevel 푘-way partitioning for load balancing

• Goal: Partition a graph into 푘 clusters of approximately equal size • Key Idea: First, condense the graph into a smaller graph, partition/cluster the condensed graph, then expand the small graph back out and refine it

A Multilevel Algorithm for Partitioning Graphs. Hendrickson and Leland. 1995 Multilevel partitioning example

1 1 2 3 2 4 2 2 3 4 1 4 10 9 5 3 1 6 1 3 4 4 3 8 1 7 7 7

1 13

7 7

17 Multilevel partitioning example

1 1 2 1 2 3 4 1 3 4 2 2 find a maximal matching 2 2 2 4 4 2 4 3 1 3 4 1 10 9 5 3 10 9 5 3 1 6 1 6 1 3 4 4 3 1 3 4 3 1 4 8 1 7 8 7 7 7

1 13

7 7

17 Multilevel partitioning example

1 1 2 1 2 3 4 1 3 4 2 2 find a maximal matching 2 2 2 4 4 2 4 3 1 3 4 1 10 9 5 3 10 9 5 3 1 6 1 6 1 3 4 4 3 1 3 4 3 1 4 8 1 7 8 7 7 7

1 13 coarsen

11 1 7 3 7 3 4 4 3 17 5 13 Multilevel partitioning example

1 1 2 1 2 3 4 1 3 4 2 2 find a maximal matching 2 2 2 4 4 2 4 3 1 3 4 1 10 9 5 3 10 9 5 3 1 6 1 6 1 3 4 4 3 1 3 4 3 1 4 8 1 7 11 8 7 7 7 4 4 17 5 13 coarsen partition

11 1 7 3 7 3 4 4 3 17 5 13 Multilevel partitioning example

1 1 2 1 2 3 4 1 3 4 2 2 find a maximal matching 2 2 2 4 4 2 4 3 1 3 4 1 10 9 5 3 10 9 5 3 1 6 1 6 1 3 4 4 3 1 3 4 3 1 4 8 1 7 11 8 7 7 7 4 4 17 refine and expand 5 13 coarsen partition 1 3 4 11 2 2 4 4 1 7 3 7 10 5 3 3 9 1 4 4 3 4 4 3 6 3 17 8 1 7 5 13 Distributed Multilevel Partitioning 1 1 2 3 4 2 2 2 4 3 4 1 • Coarsening Stage: 10 9 5 1 6 3 • first, compute a graph coloring using Luby’s algorithm 1 3 4 3 1 4 • next, for all vertices of the same color in parallel: 8 7 • all vertices with the same color select a match using the heavy edge heuristic 11 • after all vertices select a neighbor, synchronize 1 7 7 • Coarsening ends when there are O(p) vertices 3 3 4 4 3 17 • Partitioning Stage: 5 13 • Recursive bisection, based on nested dissection and greedy partitioning 1 3 4 2 • Refinement Stage: 2 4 4 • greedy refinement is used on all vertices of the same color in 10 9 5 parallel, based on the gain of each vertex 3 4 6 8 7 3

Parallel Multilevel k-Way Partitioning Scheme for Irregular Graphs. George Karypis and Vipin Kumar. 1999 Multi-Threaded Multilevel Partititioning

• Each thread owns a subset of vertices • Coarsening Stage • “unprotected matching”: let M be a shared vector of “matchings” • Each thread populates M as it finds a match for its local vertex, writes are done unprotected • After all threads finish, each thread corrects the piece of M corresponding to its slice of vertices • Partitioning Stage • “parallel k-sectioning”: each thread computes a 푘-partitioning and the one with best edge cut is selected • Refinement Stage • “coarse-grained refinement”: each thread keeps a local queue of vertices sorted by gain values • After each thread has moved a fixed number of vertices from the priority queue, all threads communicate and moves are undone until the potential partition weights would result in a balanced partitioning

Multi-Threaded Graph Partitioning. Dominique LaSalle and George Karypis. 2013 Modularity Clustering

• Key idea: find groups of vertices in a graph that have more edges than expected

Modularity and community structure in networks. M.E.J. Newman. 2006 Modularity

• The expected number of edges between vertices 푣푖 and푣푗 with degrees 푑푖 and 푑푗 is: 푑 푑 1 푖 푗 where 푚 = 푑 2푚 2 푖 푖 • Suppose we have divided the network into clusters 퐶1, … , 퐶푙 • The modularity can be expressed as: 푑 푑 • 푖 푗 푄 = 푣푖,푣푗∈퐶푘 퐴푖푗 − 2푚 where 퐴 is the adjacency matrix and the sum is over all 푣푖 and 푣푗 that fall in the cluster 퐶푘 Modularity Clustering Algorithm

• Maximizing modularity for two clusters is equivalent to finding the principal eigenvalue/eigenvector pair of a “modularity matrix” • Proceed by recursive bisection: • First, find the principal eigenvalue and eigenvector by power iteration • Next, in each successive split: • find the principal eigenvector of the ``generalized” modularity matrix • Stop when modularity no is no longer positive

Modularity and community structure in networks. M.E.J. Newman. 2006 Advantages of Modularity Clustering

• A “natural” stopping criterion: we don’t need to know 푘 ahead of time

• Seems to be a suitable objective for community detection

Empirical Comparison of Algorithms for Network Community Detection. Jure Leskovec, Kevin J. Lang, Michael W. Mahoney Parallel Modularity Clustering

• Agglomerative clustering scheme: • Each vertex starts as its own ``community” • Communities are merged based on their potential contribution to the modularity score • Merging is done in parallel using a maximal matching on potential contributions of merging particular edges

Parallel community detection for massive graphs. Riedy, E. Jason, Henning Meyerhenke, David Ediger, and David A. Bader. 2012 Frameworks • CombBLAS • The Combinatorial BLAS: Design, Implementation, and applications. Aydin Buluc and John Gilbert. 2011 • A set of primitives for graph computations based on linear algebra • Applicable to: • K-means clustering • Power iteration clustering • Markov clustering • GraphLab • GraphLab: A New Parallel Framework for Machine Learning. Low, Yucheng, et al. 2010 • A large-scale library for machine learning • Smart scheduling and relaxed data/computational consistency • a graph-based data model which simultaneously represents data and computational dependencies • Good for: • Gibbs => ‘’fuzzy” DP Means • ParMetis • ParMetis: Parallel Graph Partitioning and Sparse Matrix Ordering Library. Karypis, George, Kirk Schloegel and Vipin Kumar. 2003 • an MPI-based parallel library that implements a partitioning algorithms for unstructured graphs and meshes • Multilevel partitioning schemes Summary: Popular Clustering Techniques • 푘-means • K-means/K-medians/K-medoids • Better, bigger k-means • Spectral • Optimizing NCut • PIC • Hierarchical • Single Linkage • BIRCH • CURE • Density-based • DBSCAN • II DBSCAN • Multilevel • Metis • ParMetis • Multithreaded Metis • Modularity • Community Detection Conclusion and Future work

“The tradeoff among different criteria and methods is still dependent on the applications themselves.” A Comprehensive Overview of Basic Clustering Algorithms. Xu & Wunsch. 2005

Future Work: • Large-scale Nonparametric methods – don’t specify 푘 to start with • Decomposing the similarity function in genetic mapping into an inner product in order to use it for PIC • Parallel DP-Means Thank you

John Gilbert, Linda Petzold, Xifeng Yan

Aydin Buluc, Stefanie Jegelka, Joseph Gonzalez

Adam Lugowski, Kevin Deweese References

Lloyd, Stuart P. “Least Squares Quantization in PCM.“IEEE Transactions on Information Theory. Vol. 28(2). 1982

Pothen, Alex, Horst D. Simon, and Kang-Pu Liu. “Partitioning Sparse Matrices with Eigenvectors of Graphs.” SIAM Journal on Matrix Analysis and Applications 11, no. 3: pp.430-452. 1990

Hendrickson, B., & Leland, R. W. (1995). “A Multi-Level Algorithm For Partitioning Graphs.” SC, 95, 28.

Karypis, G., Schloegel, K., & Kumar, V. (2003). “Parmetis.: Parallel graph partitioning and sparse matrix ordering library.” Manual Version 3.1

Kleinberg, J. (2003). “An impossibility theorem for clustering.” Advances in neural information processing systems, 463-470.

Bishop, C. M., & Nasrabadi, N. M. (2006). Pattern recognition and machine learning (Vol. 1, p. 740). Chapters 1, 2, 9. New York: springer.

Newman, M. E. (2006). “Modularity and community structure in networks.” Proceedings of the National Academy of Sciences, 103(23), 8577-8582. Ben-David, S., & Ackerman, M. (2008). “Measures of clustering quality: A working set of axioms for clustering.” In Advances in neural information processing systems (pp. 121-128).

Von Luxburg, U. (2007). “A tutorial on spectral clustering.” Statistics and computing, 17(4), 395-416.

Jain, A. K. (2010). “Data clustering: 50 years beyond K-means.” Pattern Recognition Letters, 31(8), 651-666.

Leskovec, J., Lang, K. J., & Mahoney, M. (2010). Empirical comparison of algorithms for network community detection. In Proceedings of the 19th international conference on World wide web (pp. 631-640). ACM.

Lin, F., & Cohen, W. W. (2010). “Power iteration clustering.” In Proceedings of the 27th International Conference on Machine Learning (ICML-10) (pp. 655-662).

Lin, F., & Cohen, W. W. (2010). “A Very Fast Method for Clustering Big Text Datasets.” In ECAI (pp. 303-308).

Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., & Hellerstein, J. M. (2010). Graphlab: A new framework for parallel machine learning. arXiv preprint arXiv:1006.4990.

Buluç, A., & Gilbert, J. R. (2011). “The Combinatorial BLAS: Design, implementation, and applications.” International Journal of High Performance Computing Applications, 25(4), 496-509.

Riedy, E. Jason, Henning Meyerhenke, David Ediger, and David A. Bader. “Parallel community detection for massive graphs.” In Parallel Processing and Applied Mathematics, pp. 286-296. Springer Berlin Heidelberg, 2012 Backup Slides Power Iteration Clustering (PIC)

• Large-scale extension to spectral clustering • Key ideas: power iteration and the spectral gap • Power iteration: 퐷−1퐿 = I − 퐷−1푆 (퐼 − 퐷−1퐿)푣푡 푣푡+1 = (퐼 − 퐷−1퐿)푣푡 • Let: 훿푡+1 = | 푣푡+1 − 푣푡 | • Stop when: 훿푡+1 − 훿푡 ≈ 0 • Applicable to large-scale text-document classification, works well for small values of 푘  If 퐹퐹푇 = 푆 then 퐼 − 퐷−1푆 푣 = 푣 − 퐷−1퐹퐹푇푣

Power Iteration Clustering. Lin and Cohen. 2010, A Very Fast Method for Clustering Big Text Datasets. Lin and Cohen. 2010 Utility of CF vectors in BIRCH

Because we have the additivity property of CF vectors:

퐶퐹푖 + 퐶퐹푗 = (푁푖 + 푁푗, 퐿푆푖 + 퐿푆푗, 푆푆푖 + 푆푆푗) We can easily compute updates to the mean, radius, or diameter of a CF vector, and we can use CF vectors to compute distances between centroids and incoming points.

Example: distance between centroid 휇푖of cluster 푖 and centroid 휇푗 of cluster 푗 1 푇 2 퐿푆푖 퐿푆푗 퐿푆푖 퐿푆푗 휇푖 − 휇푗 = − − 푁푖 푁푗 푁푖 푁푗 Multilevel 푘-Way: Matching & Coarsening Step

Randomly select a vertex, and match it with the neighbor that has the heaviest edge (“heavy edge matching”)

1 1 2 3 4 2 11 2 2 4 3 4 1 1 7 10 9 5 3 3 7 3 1 6 1 3 4 3 4 4 1 4 3 8 7 17 5 13

Multilevel k-way Partitioning Scheme for Irregular Graphs. George Karypis and Vipin Kumar. 1998 Multilevel 푘-Way: Partition Step

Use spectral clustering to find 푘 clusters

11 1 7 3 7 3 4 4 3 17 5 13 Multilevel 푘-Way: Uncoarsening and Refinement Step • Expand the vertices and edges back out to the finer graph • Use the gain heuristic to move vertices one at a time, and keep the configuration that gives the best balanced partitioning 1 1 3 11 1 2 3 4 2 4 2 2 2 4 1 7 2 4 1 4 4 7 3 10 3 5 3 3 10 9 5 9 4 4 1 3 3 4 6 3 1 3 4 6 3 17 1 4 3 1 5 13 8 7 8 7 The gain of a vertex 푣 is the net reduction in the weight of edges that would result in moving 푣 from partition A to partition B 2 Example: gain of 1 3 1 3 gain 푣3 = 2 − 4 = −2 2 2 4 4 2 2 4 4 moving vertex 3 to 3 4 10 9 5 10 9 3 5 partition B 3 4 3 6 3 4 3 6 8 1 7 8 1 7 Generative Process specification • A convention for specifying a probability distribution in a more compact way, by leaving out certain conditional independence properties in the specification; these properties are implied. • At each step, the distribution depends only on the previously defined random variables • Example of a generative process specification: 1 푋 , 푋 ∼ 퐵푒푟푛표푢푙푙푖 1 2 2 2 푋3 ∼ 푁 푋1 + 푋2, 휎 푋4 ∼ 푁 푎푋2 + 푏, 1 1 if 푋 ≥ 0 푋 ~ 4 5 0 else

This specification means that: If 푋 is a vector such that: 푋 = (푋1, … , 푋5) then it respects this :

푋1 푋2 In other words, the multivariate distribution on X respects the following conditional independence probabilities:

푋3 푋4 푝 푥1, 푥2, 푥3, 푥4, 푥5 = 푝 푥1 푝 푥2 푝 푥3 푥1, 푥2 푝 푥4 푥2 푝 푥5 푥4

푋5 A Gaussian Process We can equivalently look at a mixture of Gaussians as a generative model: • Choose a cluster, then choose a point from that cluster

A real-valued stochastic process {푋푡, 푡 ∈ 푇} where 푇 is an index set, is a Gaussian process if for any choice of distinct values 푡1, … , 푡푘 ∈ 푇, the random vector 푿 = (푋푡 , … , 푋푡 )′ has a multivariate normal distribution with mean 휇 = 퐸 푿 and covariance matrix1 Σ = 푘 푐표푣 푿, 푿 .

Example: 푇 = 1,2,3,4,5,6, , 푋 = (푋1, 푋2, 푋3, 푋4) p X = 풙 = 푝 푥1, 푥2, 푥3, 푥4 = 푝 푥1 푝 푥2 푝 푥3 푥1, 푥2 푝 푥4 푥2 = 푁 퐸 푋 , 푐표푣 푋, 푋 1 푋1 푋2 푋 , 푋 ∼ 푁 , 휎2 1 2 2 2 푋3 ∼ 푁 푋1 + 푋2, 휎 푋3 푋4 푋4 ∼ 푁 푎푋2 + 푏, 1 A Dirichlet Process

• In Bayesian modelling, we assume we have an underlying and unknown distribution that we wish to infer given some observed data. We can do this by assuming a prior, and then computing the posterior given some data using that: posterior ∝ likelihood × prior. But constraining the prior to be parametric limits the types of inferences we can make. The nonparametric approach solves this problem, but we have to choose a prior that makes the posterior calculation tractable. • The Dirichlet Process is a stochastic process used in Bayesian nonparametric models of data. Each draw from a Dirichlet process is itself a distribution. The Dirichlet process has Dirichlet distributed finite dimensional marginal distributions. Distributions drawn from a Dirichlet process are discrete, but cannot be described using a finite number of parameters, “hence the classification as a nonparametric model.” A Dirichlet Mixture

Assume 퐺0 is a prior distribution over the means of the Gaussians: 휇1, 휇2, … , 휇푘~퐺0 The mixing coefficients are distributed according to a Dirichlet distribution: 휋~퐷푖푟(푘, 휋0) We choose one of the clusters from a discrete distribution over 휋, such as a multinomial distribution: 푧1, … , 푧푛~퐷푖푠푐푟푒푡푒(휋) Finally, we choose 푥푖 according to the Gaussian distribution with mean 휇푧 푥1, … , 푥푛~푁(휇푧, 휎퐼) ``A way to view the DP mixture model is to take the limit of the above 훼 model as 푘 → ∞, when choosing 휋0 to be 푘 ퟏ” • The Dirichlet distribution is a commonly used as a prior distribution over 휃 when p(x|휃) is a multinomial distribution over k values (it’s the conjugate prior to the multinomial) • Computing E(X) from p is difficult/intractable/inefficient • Instead, approximate E(f(X)) using an MCMC method such as Gibbs Sampling • Construct a markov chain that has p as its stationary distribution • Ergodic theorem says: if we run chain long enough, it will converge to the st. dist. And so the end state is approximately what it truly is • Ergodic theorem + irreducibility: just use sample mean over all states that markov chain takes to approximate sample mean

Pattern Recognition and Machine Learning. Christopher Bishop and Revisiting k-means: … Kulis and Jordan A clustering function

A clustering function takes a distance function d on a set 푆 of 푛 ≥ 2 points and outputs a clustering Γ

Example: S = {1,2,3,4,5,6,7} 푓 푑 = Γ = { 1,2,3 , 4,5,6,7 } ퟏ 푑(1,2) ퟏ 푑(1,2) ퟐ ퟑ ퟐ ퟑ

ퟓ ퟒ ퟓ ퟒ

ퟔ ퟕ ퟔ ퟕ

An Impossibility Theorem for Clustering. Jon Kleinberg. 2003 Scale Invariance: 푓 푑 = 푓(훼 ∗ 푑)

ퟏ 푑2(1,2) ퟐ ퟑ 푓 푑1 = Γ = { 1,2,3 , 4,5,6,7 }

ퟏ 푑1(1,2) ퟐ ퟓ ퟑ ퟒ

ퟓ ퟒ ퟔ ퟔ ퟕ ퟕ

푓 훼 ∗ 푑1 = Γ = { 1,2,3 , 4,5,6,7 } Consistency

푓 푑 = Γ = { 1,2,3 , 4,5,6,7 } 푓 푑′ = Γ = { 1,2,3 , 4,5,6,7 }

ퟏ ퟏ 푑(1,2) 푑′(1,2) ퟑ ퟐ ퟐ 푑(1,5) ퟑ 푑′(1,5)

ퟓ ퟒ

ퟓ ퟒ ퟔ ퟕ ퟔ ퟕ Richness:

Range(푓) is equal to the set of all partitions of S.

ퟏ 푑(1,2) ퟏ푑(1,2) ퟏ ퟏ 푑(1,2) 푑(1,2) ퟐ ퟑ ퟑ ퟐ ퟑ ퟐ ퟑ ퟐ

ퟓ ퟒ ퟓ ퟒ ퟓ ퟒ ퟓ ퟒ ퟔ ퟕ ퟔ ퟕ ퟔ ퟕ ퟔ ퟕ

ퟏ ퟏ푑(1,2) 푑(1,2) ퟐ ퟑ ퟑ ퟐ etc… ퟓ ퟒ ퟓ ퟒ

ퟔ ퟕ ퟔ ퟕ Impossibility Theorem for Clustering

A clustering function should have the three following properties:

scale invariance, consistency, and richness

Theorem: For each 푛 ≥ 2, there is no clustering function that satisfies scale invariance, consistency, and richness

“…there is no solution that simultaneously satisfies a small collection of simple properties” – Jon Kleinberg, An Impossibility Theorem for Clustering. 2003 Possibility of Clustering

Instead of thinking about clustering functions, let’s think about clustering results – what makes a good clustering?

ퟑ ퟐ There exist Clustering Quality Measures that satisfy consistency, richness, scalability

ퟓ ퟒ 푚 1,2,3 , 4,5,6,7 = 훼 ∈ 푅

ퟔ ퟕ

Measures of Clustering Quality: A Working Set of Axioms for Clustering. Margareta Ackerman and Shai Ben-David. 2009 Modularity

• Suppose we have divided the network into two groups • The expected number of edges between vertices 푣푖 and푣푗 with degrees 푑푖 and 푑푗 is: 푑 푑 1 푖 푗 where 푚 = 푑 2푚 2 푖 푖 • The The modularity can be expressed amodularity is defined as: 푑푖푑푗 • 푄 = 푖,푗 퐴푖푗 − 2푚 where 퐴 is the adjacency matrix and the sum is over all 휃 푣푖 and 푣푗 that fall in the same group • s: 1 푠푇퐵푠 4푚 where 푠푖 = 1 if 푣푖 is in group 1 and 푠푖 = −1 if 푣푖 is in group 2, and 푑 푑 푖 푗 퐵푖푗 = 퐴푖푗 − 2푚 • We can therefore write the modularity as: 1 • 푄 = (푢푇푠)2휆 4푚 푖=1,…,푛 푖 푖 Gibbs Sampling

• An MCMC method: • tool for sampling from and computing expectations with respect to very complicated and high dimensional probability distribution • MCMC constructs a markov chain whose stationary distribution is the complicated distribution • Can be used in DP means for expectation maximization