Robust Clustering Algorithms
Total Page:16
File Type:pdf, Size:1020Kb
ROBUST CLUSTERING ALGORITHMS A Thesis Presented to The Academic Faculty by Pramod Gupta In Partial Fulfillment of the Requirements for the Degree Masters in Computer Science in the School of Computer Science Georgia Institute of Technology May 2011 Copyright © 2011 by Pramod Gupta ROBUST CLUSTERING ALGORITHMS Approved by: Prof. Maria-Florina Balcan, Advisor School of Computer Science Georgia Institute of Technology Prof. Santosh Vempala School of Computer Science Georgia Institute of Technology Prof. Alexander Gray School of Computer Science Georgia Institute of Technology Date Approved: March, 29, 2011 ACKNOWLEDGEMENTS This thesis would not have been possible without the guidance and the help of several individuals who in one way or another contributed and extended their valuable assistance in the preparation and completion of this work. First and foremost, my utmost gratitude to Prof. Nina Balcan, whose encouragement, guidance and support from the initial to the final level enabled me to develop an understand- ing of the subject. She has been an immense inspiration and I thank her for her patience, motivation, enthusiasm, immense knowledge and steadfast encouragement as I hurdle all the obstacles in the completion this work. I would like to thank the rest of my thesis committee: Prof. Santosh Vempala and Prof. Alex Gray for their encouragement and insightful comments throughout the course of this work. I am extremely grateful to Prof. Dana Randall, an insightful and amazing teacher, who helped me better understand the theoretical thought process. The Algorithms & Randomness Center, that allowed us to use their machines to run our experiments. James Kriigel at the Technology Services Organization, College of Computing, for his help in setting us up with the necessary technical resources. My friends and colleagues in the Theory Group and the School of Computer Science for some very insightful discussions and suggestions. Lastly, I offer my sincerest gratitude to all of those who supported me in any respect during the completion of the project. iii TABLE OF CONTENTS ACKNOWLEDGEMENTS .............................. iii LIST OF TABLES .................................. viii LIST OF FIGURES .................................. ix SUMMARY ...................................... xiii CHAPTERS I CLUSTERING ANALYSIS: AN INTRODUCTION ............. 1 1.1 Introduction.................................1 1.2 Formal Definition..............................3 1.3 Types of Clusterings.............................4 1.3.1 Hierarchical vs Partitional......................5 1.3.2 Hard (Exclusive or Overlapping) vs Fuzzy.............6 1.3.3 Complete vs Partial.........................7 1.3.4 Agglomerative vs Divisive.....................7 1.3.5 Deterministic vs Stochastic.....................7 1.3.6 Incremental vs Non-Incremental..................8 1.4 Types of Data................................8 1.4.1 Numerical Data...........................8 1.4.2 Binary Data.............................9 1.4.3 Categorical Data...........................9 1.4.4 Transaction Data..........................9 1.4.5 Time Series Data.......................... 10 1.5 Similarity Measure.............................. 10 1.5.1 Covariance Matrix.......................... 11 1.5.2 Euclidean Distance......................... 12 1.5.3 Manhattan Distance......................... 13 1.5.4 Maximum Distance......................... 13 iv 1.5.5 Minkowski Distance........................ 13 1.5.6 Mahalanobis Distance........................ 14 1.5.7 Cosine Similarity.......................... 15 1.5.8 A General Similarity Coefficient.................. 16 1.6 Discussion.................................. 17 II HIERARCHICAL CLUSTERING ....................... 18 2.1 Strengths and Weaknesses.......................... 19 2.2 Representation................................ 21 2.2.1 n-Tree................................ 21 2.2.2 Dendrogram............................. 22 2.3 Standard Agglomerative Linkage Algorithms................ 24 2.3.1 Single Linkage........................... 24 2.3.2 Complete Linkage.......................... 25 2.3.3 Group Average Linkage....................... 25 2.3.4 Centroid Linkage.......................... 27 2.3.5 Median Linkage........................... 28 2.3.6 Ward’s Linkage........................... 28 2.4 (Generalized) Wishart’s Method....................... 29 2.5 BIRCH.................................... 30 2.6 CURE.................................... 32 2.7 Divisive Hierarchical Algorithms...................... 35 2.8 EigenCluster................................. 36 2.9 Discussion.................................. 38 III ROBUSTNESS OF HIERARCHICAL ALGORITHMS ........... 39 3.1 Framework.................................. 39 3.1.1 Formal Setup............................ 40 3.2 Properties of Similarity Function K ..................... 43 3.2.1 Strict Separation........................... 43 v 3.2.2 Max Stability............................ 44 3.3 Average Stability............................... 45 3.3.1 Good Neighborhood......................... 46 3.4 Robustness of Standard Linkage Algorithms................ 47 3.5 Robustness of Algorithms w.r.t. Structure of Data............. 49 3.6 Discussion.................................. 52 IV ROBUST HIERARCHICAL LINKAGE (RHL) ................ 54 4.1 Generating an Interesting Starting Point................... 55 4.2 Ranked Linkage............................... 61 4.3 The Main Result............................... 64 4.4 Run Time Analysis............................. 65 4.5 Discussion.................................. 68 V WEIGHTED NEIGHBORHOOD LINKAGE ................. 69 5.1 Algorithm.................................. 71 5.2 Correctness Analysis............................ 71 5.2.1 The Main Result........................... 83 5.3 Run Time Analysis............................. 83 5.4 Discussion.................................. 86 VI THE INDUCTIVE SETTING .......................... 87 6.1 Formal Definition.............................. 87 6.2 Robust Hierarchical Linkage........................ 88 6.3 Weighted Neighborhood Linkage...................... 91 6.4 Discussion.................................. 94 VII EXPERIMENTS ................................. 95 7.1 Data Sets................................... 96 7.1.1 Real-World Data Sets........................ 96 7.1.2 Synthetic Data Sets......................... 101 7.2 Experimental Setup............................. 103 vi 7.3 Results.................................... 104 7.3.1 Transductive Setting......................... 104 7.3.2 Inductive Setting.......................... 110 7.3.3 Variation in Input Parameters.................... 112 7.4 Discussion.................................. 113 VIIICONCLUSION .................................. 115 8.1 Future Work................................. 116 APPENDICES APPENDIX A — CORRECTNESS RESULTS FOR STANDARD LINKAGE ALGORITHMS ................................. 118 APPENDIX B — CLUSTERING VALIDITY .................. 121 APPENDIX C — CONCENTRATION BOUNDS ................ 147 APPENDIX D — MATLAB IMPLEMENTATION ............... 155 REFERENCES ..................................... 160 vii LIST OF TABLES 1 Data Set: Iris................................. 96 2 Data Set: Wine................................ 97 3 Data Set: Digits................................ 98 4 Data Set: Breast Cancer Wisconsin...................... 98 5 Data Set: Breast Cancer Wisconsin (Diagnostic)............... 98 6 Data Set: Ionosphere............................. 99 7 Data Set: Spambase.............................. 100 8 Data Set: Mushroom............................. 101 9 Data Set: Syn1................................ 101 10 Data Set: Syn2................................ 102 11 Data Set: Syn3................................ 103 viii LIST OF FIGURES 1 Different clusterings of the same data set...................4 2 A clustering..................................4 3 Types of Clusterings.............................5 4 Agglomerative and Divisive hierarchical clustering............. 18 5 A 5-tree (n-tree of 5 points).......................... 22 6 A dendrogram................................. 22 7 Loop Plot................................... 23 8 Single Linkage: The similarity of two clusters is defined as the maximum similarity between any two points in the two different clusters........ 25 9 Complete Linkage: The similarity of two clusters is defined as the minimum similarity between any two points in the two different clusters........ 26 10 Average Linkage: The similarity of two clusters is defined as the average pairwise similarity among all pairs of points in the two different clusters... 26 11 Centroid Linkage: The distance between two clusters is the distance be- tween the centroids of the two clusters. The symbol × marks the centroids of the two clusters............................... 27 12 BIRCH: The idea of CF Tree........................ 31 13 Shrinking representatives towards center of the cluster............ 32 14 CURE Example................................ 34 15 The Divide-and-Merge methodology..................... 36 16 Consider a document clustering problem. Assume that data lies in multiple regions Algorithms, Complexity, Learning, Planning, Squash, Billiards, Football, Baseball. Suppose that K(x; y) = 0:999 if x and y belong to the same inner region; K(x; y) = 3=4 if x 2 Algorithms and y 2 Complexity, or if x 2 Learning and y 2 Planning, or if x 2 Squash and y 2 Billiards, or if x 2 Football and y 2 Baseball; K(x; y) = 1=2 if x is in (Algorithms