KERNEL-BASED CLUSTERING of BIG DATA by Radha Chitta A
Total Page:16
File Type:pdf, Size:1020Kb
KERNEL-BASED CLUSTERING OF BIG DATA By Radha Chitta A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science – Doctor of Philosophy 2015 ABSTRACT KERNEL-BASED CLUSTERING OF BIG DATA By Radha Chitta There has been a rapid increase in the volume of digital data over the recent years. A study by IDC and EMC Corporation predicted the creation of 44 zettabytes (1021 bytes) of digital data by the year 2020. Analysis of this massive amounts of data, popularly known as big data, necessi- tates highly scalable data analysis techniques. Clustering is an exploratory data analysis tool used to discover the underlying groups in the data. The state-of-the-art algorithms for clustering big data sets are linear clustering algorithms, which assume that the data is linearly separable in the input space, and use measures such as the Euclidean distance to define the inter-point similarities. Though efficient, linear clustering algorithms do not achieve high cluster quality on real-world data sets, which are not linearly separable. Kernel-based clustering algorithms employ non-linear simi- larity measures to define the inter-point similarities. As a result, they are able to identify clusters of arbitrary shapes and densities. However, kernel-based clustering techniques suffer from two major limitations: (i) Their running time and memory complexity increase quadratically with the increase in the size of the data set. They cannot scale up to data sets containing billions of data points. (ii) The performance of the kernel-based clustering algorithms is highly sensitive to the choice of the kernel similarity function. Ad hoc approaches, relying on prior domain knowledge, are currently employed to choose the kernel function, and it is difficult to determine the appropriate kernel similarity function for the given data set. In this thesis, we develop scalable approximate kernel-based clustering algorithms using random sampling and matrix approximation techniques. They can cluster big data sets containing billions of high-dimensional points not only as efficiently as linear clustering algorithms but also as accu- rately as classical kernel-based clustering algorithms. Our first contribution is based on the premise that the similarity matrices corresponding to big data sets can usually be well-approximated by low-rank matrices built from a subset of the data. We develop an approximate kernel-based clustering algorithm, which uses a low-rank approximate kernel matrix, constructed from a uniformly sampled small subset of the data, to perform cluster- ing. We show that the proposed algorithm has linear running time complexity and low memory requirements, and also achieves high cluster quality, when provided with sufficient number of data samples. We also demonstrate that the proposed algorithm can be easily parallelized to handle distributed data sets. We then employ non-linear random feature maps to approximate the kernel similarity function, and design clustering algorithms which enhance the efficiency of kernel-based clustering, as well as label assignment for previously unseen data points. Our next contribution is an online kernel-based clustering algorithm that can cluster potentially unbounded stream data in real-time. It intelligently samples the data stream and finds the cluster labels using these sampled points. The proposed scheme is more effective than the current kernel- based and linear stream clustering techniques, both in terms of efficiency and cluster quality. We finally address the issues of high dimensionality and scalability to data sets containing a large number of clusters. Under the assumption that the kernel matrix is sparse when the number of clusters is large, we modify the above online kernel-based clustering scheme to perform clustering in a low-dimensional space spanned by the top eigenvectors of the sparse kernel matrix. The combination of sampling and sparsity further reduces the running time and memory complexity. The proposed clustering algorithms can be applied in a number of real-world applications. We demonstrate the efficacy of our algorithms using several large benchmark text and image data sets. For instance, the proposed batch kernel clustering algorithms were used to cluster large image data sets (e.g. Tiny) containing up to 80 million images. The proposed stream kernel clustering algorithm was used to cluster over a billion tweets from Twitter, for hashtag recommendation. To My Family iv ACKNOWLEDGMENTS “Life is a continuous learning process. Each day presents an opportunity for learning.” - Lailah Gifty Akita, Think Great: Be Great Every day during my PhD studies has been a great opportunity for learning, thanks to my advisors, colleagues, friends, and family. I am very grateful to my thesis advisor Prof. Anil K. Jain, who has been a wonderful mentor. His ability to identify good research problems has always been my inspiration. I am motivated by his energy, discipline, meticulousness and passion for research. He has taught me to plan and prioritize my work, and present it in a convincing manner. I am also very thankful to Prof. Rong Jin, with whom I had the privilege of working closely. Under his guidance, I have learnt how to formalize a problem, and develop coherent solutions to the problem, using different machine learning tools. I am inspired by his extensive knowledge and hard-working nature. I would like to thank my PhD committee members, Prof. Pang-Ning Tan, Prof. Shantanu Chakrabartty, and Prof. Selin Aviyente for their valuable comments and suggestions. Prof. Pang- Ning Tan was always available when I needed help, and provided very useful suggestions. I am grateful to several other researchers who have mentored me at various stages of my re- search. I have had the privilege of working with Dr. Suvrit Sra and Dr. Francesco Dinuzzo, at the Max Planck Institute for Intelligent Systems, Germany. I would like to thank them for giving me an insight into several emerging problems in machine learning. I thank Dr. Ganesh Ramesh from Edmodo for providing me the opportunity to learn more about natural language processing, and building scalable solutions. Dr. Timothy Havens was very helpful when we were working together during the first year of my PhD. I would like to thank my lab mates and friends: Shalini, Soweon, Serhat, Zheyun, Jinfeng, v Mehrdad, Kien, Alessandra, Abhishek, Brendan, Jung-Eun, Sunpreet, Inci, Scott, Lacey, Charles, and Keyur. They made my life at MSU very memorable. I would like to specially thank Serhat for all the helpful discussions, and Soweon for her support and encouragement. I am thankful to Linda Moore, Cathy Davison, Norma Teague, Katie Trinklein, Courtney Kosloski and Debbie Kruch for their administrative support. Many thanks to the CSE and HPCC administrators, specially Kelly Climer, Adam Pitcher, Dr. Dirk Colbry, and Dr. Benjamin Ong. Last but not the least, I would like to thank my family. I am deeply indebted to my husband Praveen, without whose support and motivation, I would not have been able to pursue and complete my PhD. My parents, my sister and my parents-in-law have been very supportive throughout the past five years. I was inspired by my father Ramamurthy to pursue higher studies, and strive to make him proud. I would like to specially mention my mother Sudha Lakshmi, who has been my role model and inspiration. I can always count on her to encourage me and uplift my spirits. vi TABLE OF CONTENTS LIST OF TABLES ....................................... x LIST OF FIGURES ...................................... xiv LIST OF ALGORITHMS ................................... xxi Chapter 1 Introduction .................................. 1 1.1 DataAnalysis .................................... ... 4 1.1.1 DataRepresentation . .. .. .. .. .. .. .. ...... 4 1.1.2 Learning...................................... ... 5 1.1.3 Inference ..................................... ... 6 1.2 Clustering...................................... ... 7 1.2.1 ClusteringAlgorithms . ....... 8 1.2.2 ChallengesinDataClustering . ......... 10 1.3 ClusteringBigData............................... ..... 13 1.3.1 Clustering with k-means ................................ 17 1.4 KernelBasedClustering . ...... 19 1.4.1 Kernel k-means..................................... 25 1.4.2 Challenges .................................... ... 27 1.4.2.1 Scalability ................................. ..... 28 1.4.2.2 Choiceofkernel .............................. ..... 29 1.5 ThesisContributions .. .. .. .. .. .. .. .. ...... 31 1.6 DatasetsandEvaluationMetrics . ......... 35 1.6.1 Datasets ...................................... .. 35 1.6.2 EvaluationMetrics .. .. .. .. .. .. .. .. ...... 39 1.7 ThesisOverview .................................. ... 41 Chapter 2 Approximate Kernel-based Clustering .................... 42 2.1 Introduction.................................... .... 42 2.2 RelatedWork ..................................... .. 43 2.2.1 Low-rankMatrixApproximation. ......... 44 2.2.1.1 CURmatrixapproximation . ....... 45 2.2.1.2 Nystrommatrixapproximation. .......... 46 2.2.2 Kernel-basedClusteringforLargeDatasets . .............. 47 2.3 ApproximateKernelk-means. ....... 49 2.3.1 Parameters .................................... ... 52 2.3.1.1 Samplesize.................................. .... 54 2.3.1.2 Samplingstrategies. ........ 55 2.3.2 Analysis...................................... ... 56 vii 2.3.2.1 Computationalcomplexity . ......... 56 2.3.2.2 Approximationerror