Large Scale Clustering
Total Page:16
File Type:pdf, Size:1020Kb
Clustering at Large Scales Veronika Strnadová CSC Lab What is clustering? Finding groups of highly similar objects in unlabeled datasets or: Finding groups of highly connected vertices in a graph 퐶1 퐶2 cluster 퐶3 Applications • Image Segmentation • Normalized Cuts and Image Segmentation. Jianbo Shi and Jitendra Malik, 2001 • Load Balancing in Parallel Computing • Multilevel k-way Partitioning Scheme for Irregular Graphs. George Karypis and Vipin Kumar. 1998 • Genetic Mapping 퐿퐺1 퐿퐺2 • Computational approaches and software tools for genetic linkage map estimation in plants. Jitendra Cheema and Jo Dicks. 2009 • Community Detection • Modularity and community structure in networks. Mark E.J. Newman. 2006 Clustering in Genetic Mapping “The problem of genetic mapping can essentially be divided into three parts: grouping, ordering, and spacing” Computational approaches and software tools for genetic linkage map estimation in plants. Cheema 2009 Challenges in the large scale setting Big Data: Many popular clustering algorithms are inherently difficult to parallelize, and inefficient at large scales Challenges in the large scale setting Big Data: Many popular clustering algorithms are inherently difficult to parallelize, and inefficient at large scales “The computational issue becomes critical when the application involves large-scale data” Jain, Data Clustering: 50 Years Beyond 푘-means Challenges in the large scale setting Big Data: Many popular clustering algorithms are inherently difficult to parallelize, and inefficient at large scales “The computational issue becomes critical when the application involves large-scale data” Jain, Data Clustering: 50 Years Beyond 푘-means “There is no clustering algorithm that can be universally used to solve all problems” Xu & Wunsch, A Comprehensive Overview of Basic Clustering Algorithms Popular Clustering Techniques • 푘-means • K-means/K-medians/K-medoids • Better, bigger k-means • Spectral • Optimizing NCut • PIC • Hierarchical • Single Linkage • BIRCH • CURE • Density-based • DBSCAN • II DBSCAN • Multilevel • Metis • ParMetis • Multithreaded Metis • Modularity • Community Detection 푘-means • “…simpler methods such as 푘-means remain the preferred choice of in many large-scale applications.” Kulis and Jordan, Revisiting k-means: New Algorithms via Bayesian Nonparametrics. 2012 푘-means • “…simpler methods such as 푘-means remain the preferred choice of in many large-scale applications.” Kulis and Jordan, Revisiting k-means: New Algorithms via Bayesian Nonparametrics. 2012 • “The 푘-means algorithm is very simple and can be easily implemented in solving many practical problems.” Xu and Wunsch, Survey of Clustering Algorithms. 2005 푘-means • “…simpler methods such as 푘-means remain the preferred choice of in many large-scale applications.” Kulis and Jordan, Revisiting k-means: New Algorithms via Bayesian Nonparametrics. 2012 • “The 푘-means algorithm is very simple and can be easily implemented in solving many practical problems.” Xu and Wunsch, Survey of Clustering Algorithms. 2005 • “In spite of the fact that 푘-means was proposed over 50 years ago and thousands of clustering algorithms have been published since then, 푘-means is still widely used.” Jain, Data Clustering: 50 Years Beyond 푘-means. 푘-means 푑 Given: 푥1, 푥2, … , 푥푛 ∈ 푅 , 푘 푘 푛 2 Objective: minimize 푗=1 푖=1 푟푖푗||푥푖 − 휇푗||2 where 푟푖푗 is an indicator function Initialize 휇1, 휇2, … , 휇푘 Repeat: 1. Find optimal assignments of 푥푖 to 휇푗 2. Re-evaluate 휇푗 given assignments 푥푖 푘 푛 2 Until 푗=1 푖=1 푟푖푗||푥푖 − 휇푗||2 converges to a local minimum Example: 푘 = 2 Step 2 Step 1 Step 2 Initialization + Step 1 Least Squares Quantization in PCM. Stuart P. Lloyd 1982 Convergence of 푘 means 푘 푛 2 Let J = 푗=1 푖=1 푟푖푗||푥푖 − 휇푗||2 • Minimize 퐽 with respect to 푟: (discrete optimization) • Simply assigning each 푥푖 to its closest 휇 ensures that ||푥푖 − 휇푗|| is minimized for each 푥푖, so let: 1 , if 푗 = 푎푟푔푚푖푛 ||푥 − 휇 ||2 • 푟 = 1,…푘 푖 푗 2 푖푗 0, otherwise • Minimize 퐽 with respect to 휇: (set derivative of J wrt 휇푗 to 0) 푛 1 푛 • −2 푖=1 푟푛푗 푥푖 − 휇푗 = 0 → 휇푗 = 푖=1 푟푛푗푥푖 푛푗 Distributed K-Means 푑 Given: 푥1, 푥2, … , 푥푛 ∈ 푅 , 푘 푘 푛 2 Objective: minimize 푗=1 푖=1 푟푖푗||푥푖 − 휇푗||2 where 푟푖푗 is an indicator function Initialize 휇1, 휇2, … , 휇푘 and keep as a global variable Repeat: 1. Map: Find optimal assignments of 풙풊 to 흁풋 (map: outputs key=풋, value = 풙풊) 2. Reduce: Re-evaluate 흁풋 given assignments of 풙풊 (combine: partial sums of 풙풊 with same key) (reduce: takes 풋 (key) and list of values 풙풊, computes a new 흁풋) 푘 푛 2 Until 푗=1 푖=1 푟푖푗||푥푖 − 휇푗||2 converges to a local minimum Parallel k-Means Using MapReduce. Zhao 2009 Divide-and-conquer k-Medians 푆 Divide 푥푖 into 푙 groups 휒1, … , 휒푙 For each 푖, find 푂 푘 centers 휒1 휒2 휒3 휒4 Assign each 푥푖 to its closest center Weigh each center by the 푐41 number of points assigned to it 푐42 Let 휒′ be the set of weighted centers Cluster 휒′ to find exactly 푘 휒′ centers Clustering Data Streams: Theory and Practice. Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani. 2001 Disadvantages of k-means • Assumes clusters are spherical in shape vs: • Assumes clusters are of approximately equal size • Assumes we know 푘 ahead of time • Sensitivity to Outliers • Converges to a local optimum of the objective Disadvantages of k-means • Assumes clusters are spherical in shape vs: • Assumes clusters are of approximately equal size vs: • Assumes we know 푘 ahead of time • Sensitivity to Outliers • Converges to a local optimum of the objective Disadvantages of k-means • Assumes clusters are spherical in shape vs: • Assumes clusters are of approximately equal size vs: • Assumes we know 푘 ahead of time vs: • Sensitivity to Outliers • Converges to a local optimum of the objective Disadvantages of k-means • Assumes clusters are spherical in shape vs: • Assumes clusters are of approximately equal size vs: • Assumes we know 푘 ahead of time vs: • Sensitivity to Outliers vs: • Converges to a local optimum of the objective Addressing the 푘 problem • Assuming we know 푘 ahead of time: vs: A parametric, probabilistic view of k-means Assume that the points 푥푖 were generated from a mixture of 푘 (multivariate) Gaussian distributions: 1 1 1 푇 −1 푁 푥 휇, Σ = 푒−2 푥−휇 Σ (푥−휇) 2휋퐷/2 |Σ|1/2 퐾 퐾 푝 푥 = 푘=1 휋푘푁(푥|휇푘, Σ) where 푘=1 휋푘 = 1 흁ퟐ 흁ퟏ Pattern Recognition and Machine Learning. Christopher M. Bishop. 2006 A parametric, probabilistic view of k-means • What is the most likely set of parameters θ = {{휋푘}, {휇푘}} that generated the points 푥푖 ? We want to maximize the likelihood 푝(푥|휃) => Solve by Expectation Maximization 흁ퟐ 흁ퟏ Xu and Wunsch, Survey of Clustering Algorithms DP-Means • A probabilistic, nonparametric view of 푘-means: • Assumes a “countably infinite” mixture of Gaussians, attempt to estimate the parameters of these Gaussians and the way they are combined • Leads to an intuitive, 푘-means-like algorithm 흁ퟑ 흁ퟒ 흁ퟐ 흁ퟔ 흁ퟏ 흁ퟓ Revisiting k-means: New Algorithms via Bayesian Nonparametrics. Brian Kulis and Michael I. Jordan. 2012 DP-Means Given: 푥1, … , 푥푛, 휆 푘 Goal: minimize 푗=1 푖 푟푖푗 푥푖 − 휇푗 + 휆푘 Algorithm: Initialize 푘 = 1, 휇 = global mean Initialize 푟푖푗 = 1 for all 푖 = 1 … 푛 Repeat: • For each 푥푖: 2 • compute 푑푖푗 = ||푥푖 − 휇푗||2, 푗 = 1 … 푘 • If min 푑푖푗 > 휆, start a new cluster: set 푘 = 푘 + 1; 푗 • Otherwise, set 푟푖푗 = 푎푟푔푚푖푛푗푑푖푗 • Generate clusters 푐푖, … , 푐푘 based on 푟푖푗 • For each cluster compute 휇푗 Revisiting k-means: New Algorithms via Bayesian Nonparametrics. Brian Kulis and Michael I. Jordan. 2012 Optimizing a different Global Objective: Spectral Clustering and Graph Cuts Spectral Clustering Key idea: use the eigenvectors of a graph Laplacian matrix 퐿(퐺) to cluster a graph 3 −1 −1 −1 0 1 퐺 2 −1 2 0 −1 0 퐿 퐺 = 퐷 − 푊 = −1 0 3 −1 −1 3 4 −1 −1 −1 4 −1 0 0 −1 −1 2 5 Partitioning Sparse Matrices with Eigenvectors of Graphs. Pothen, Alex, Horst D. Simon, and Kang-Pu Liu. 1990 Normalized Cuts and Image Segmentation. Jianbo Shi and Jitendra Malik. 2001 A Tutorial on Spectral Clustering. Ulrike von Luxburg. 2007 Spectral Clustering Algorithm Given: data points 푥푖 and a similarity function s(푥푖, 푥푖) between them, 푘 1. Construct (normalized) graph Laplacian 퐿 퐺 푉, 퐸 = 퐷 − 푊 2. Find the 푘 eigenvectors corresponding to the 푘 smallest eigenvalues of 퐿 3. Let U be the n × 푘 matrix of eigenvectors ′ 4. Use 푘-means to find 푘 clusters 퐶′ letting 푥푖 be the rows of U ′ 5. Assign data point 푥푖 to the 푗th cluster if 푥푖 was assigned to cluster 푗 A Tutorial on Spectral Clustering. Ulrike von Luxburg. 2007 Spectral Clustering Algorithm Given: data points 푥푖 and a similarity function s(푥푖, 푥푖) between them, 푘 Objective: minimize the ratio cut (or normalized cut) 1. Construct (normalized) graph Laplacian 퐿 퐺 푉, 퐸 = 퐷 − 푊 2. Find the 푘 eigenvectors corresponding to the 푘 smallest eigenvalues of 퐿 3. Let U be the n × 푘 matrix of eigenvectors ′ 4. Use 푘-means to find 푘 clusters 퐶′ letting 푥푖 be the rows of U ′ 5. Assign data point 푥푖 to the 푗th cluster if 푥푖 was assigned to cluster 푗 A Tutorial on Spectral Clustering. Ulrike von Luxburg. 2007 Graph Cuts A graph cut of 퐺 푉, 퐸 is the sum of the weights on a set of edges 푆퐸 ⊆ 퐸 which cross partitions 퐴푖 of the graph 푊 퐴 , 퐴 = 푤 푖 푖 푥푖∈퐴푖,푥푗∈퐴푖 푖푗 1 푐푢푡 퐴 , … , 퐴 = 푊(퐴 , 퐴 ) 1 푘 2 푖 푖 푖=1,…,푘 Examples: 푐푢푡 퐴1, 퐴2 = 7 퐴 퐴2 푐푢푡 퐴1, 퐴2 = 2 2 퐴1 퐴1 Normalized Cuts 푐푢푡 퐴1, 퐴2 = 2 퐴2 퐴1 푐푢푡(퐴푖, 퐴푖) 푐푢푡(퐴 , 퐴 ) 푅푎푡푖표푐푢푡 퐴 , 퐴 , … , 퐴 = 푖 푖 1 2 푘 |퐴 | 푁푐푢푡 퐴1, 퐴2, … , 퐴푘 = 푖 푣표푙(퐴푖) 푖=1,…,푘 푖=1,…,푘 푅푎푡푖표푐푢푡 퐴1, 퐴2 = 0.226 푁푐푢푡 퐴1, 퐴2 = 0.063 퐴2 퐴1 퐴2 퐴1 Relationship between Eigenvectors and Graph Cuts For some subset of vertices 퐴, let: 푇 푛 푓 = 푓1, … , 푓푛 ∈ 푅 , such that: 퐴 /|퐴| if 푣푖 ∈ 퐴 푓푖 = − 퐴 /|퐴 | if 푣푖 ∈ 퐴 Then: 푓푇퐿푓 = 푉 ∙ 푅푎푡푖표퐶푢푡(퐴, 퐴 ), 푓 ⊥ ퟏ = 0, and 푓 = 푛 Therefore, minimizing RatioCut is equivalent to minimizing 푓푇퐿푓, subject to: 푓 ⊥ ퟏ = 0, 푓 = 푛, and 푓푖 as defined above A Tutorial on Spectral Clustering.