Efficient K-Nearest Neighbor Queries Using Clustering with Caching

EFFICIENT K-NEAREST NEIGHBOR QUERIES USING CLUSTERING WITH CACHING by JAIM AHMED (Under the Direction of Maria Hybinette) ABSTRACT We introduce a new algorithm for K-nearest neighbor queries that uses clustering and caching to improve performance. The main idea is to reduce the distance computation cost between the query point and the data points in the data set. We use a divide-and-conquer approach. First, we divide the training data into clusters based on similarity between the data points in terms of Euclidean distance. Next we use linearization for faster lookup. The data points in a cluster can be sorted based on their similarity (measured by Euclidean distance) to the center of the cluster. Fast search data structures such as the B-tree can be utilized to store data points based on their distance from the cluster center and perform fast data search. The B-tree algorithm is good for range search as well. We achieve a further performance boost by using B- tree based data caching. In this work we provide details of the algorithm, an implementation, and experimental results in a robot navigation task. INDEX WORDS: K-Nearest Neighbors, Execution, Caching. EFFICIENT K-NEAREST NEIGHBOR QUERIES USING CLUSTERING WITH CACHING by JAIM AHMED B.S., Southern Polytechnic State University, 1997 A Thesis Submitted to the Graduate Faculty of The University of Georgia in Partial Fulfillment of the Requirements for the Degree MASTER OF SCIENCE ATHENS, GEORGIA 2009 © 2009 Jaim Ahmed All Rights Reserved EFFICIENT K-NEAREST NEIGHBOR QUERIES USING CLUSTERING WITH CACHING by JAIM AHMED Major Professor: Maria Hybinette Committee: Eileen T. Kraemer Khaled Rasheed Electronic Version Approved: Maureen Grasso Dean of the Graduate School The University of Georgia May, 2009 DEDICATION First of all my dedication goes out to my wife Jennifer for her support and inspiration especially when the going got tough. Also, my dedication goes to my parents for their unconditional love and motivation. My final dedication goes to my sister and brother-in-law for their genuine friendship and kindness. iv ACKNOWLEDGEMENTS First of all, I express my sincere gratitude to my Major Advisor Dr. Maria Hybinette for her constant support and encouragement. Dr. Hybinette has been very kind with her time and wisdom. She has been a shining example of hard work and dedication and will remain a source of inspiration for me forever. I would also like to thank my committee members Dr. Eileen Kraemer and Dr. Khaled Rasheed for their time and consideration. Special thanks Dr. Tucker Balch for his helpful suggestions and consultations. Also, thanks to the Borg lab for access to example data and their helpful suggestions. v TABLE OF CONTENTS Page ACKNOWLEDGEMENTS.........................................................................................................v LIST OF TABLES...................................................................................................................viii LIST OF FIGURES ...................................................................................................................ix CHAPTER 1 Introduction...............................................................................................................1 1.1 Overview.........................................................................................................1 1.2 Problem Domain..............................................................................................3 1.3 What is K-nearest Neighbor Search?................................................................6 1.4 Contributions...................................................................................................8 2 Related Work...........................................................................................................10 3 Background .............................................................................................................15 3.1 Data Clustering..............................................................................................15 3.2 Data Caching .................................................................................................19 3.3 Basic KNN Search.........................................................................................20 3.4 KD-tree Data Structure ..................................................................................22 4 System Architecture.................................................................................................24 4.1 Pre-processing ...............................................................................................25 4.2 ckSearch Runtime Queries.............................................................................31 5 Experiments & Results ............................................................................................45 vi 5.1 Setup Information..........................................................................................45 5.2 The effect of the size of the data set ...............................................................46 5.3 The effect of data dimension on the performance ...........................................51 5.4 The effect of search radius on the performance ..............................................54 5.5 The effect of search radius on accuracy..........................................................57 5.6 The effect of the number of clusters...............................................................58 6 Conclusion...............................................................................................................62 REFERENCES .........................................................................................................................64 APPENDICES ..........................................................................................................................67 A Notation Table.........................................................................................................67 B Implementation Pseudocode ....................................................................................68 vii LIST OF TABLES Page Table 5.1: The effect of data size on performance (k=1)............................................................47 Table 5.2: Effect of data size on performance (k=3) ..................................................................48 Table 5.3: Effect of data size on performance (k=10) ................................................................49 Table 5.4: ckSearch speedup over linear search.........................................................................50 Table 5.5: The effect of data dimension on performance (N=50K) ............................................52 Table 5.6: The effect of data dimension on performance (N=100K) ..........................................52 Table 5.7: ckSearch speedup over linear search for various dimensions.....................................53 Table 5.8: The effect of search radius on performance (k = 3) ...................................................55 Table 5.9: The effect of search radius on performance (k = 10) .................................................56 Table 5.10: The effect of the search radius on query accuracy ...................................................57 Table 5.11: The effect of the number of clusters on performance (k=1) .....................................59 Table 5.12: The effect of the number of clusters on performance (k=5) .....................................60 Table A.1: List of various notations used in this thesis ..............................................................67 viii LIST OF FIGURES Page Figure 1.1: Autonomous robot being trained to navigate through obstacles..................................4 Figure 1.2: Autonomous robot navigation sensors input ..............................................................5 Figure 1.3: Pictorial representations of KNN search ....................................................................6 Figure 3.1: Data clustering in 2-dimensional space....................................................................16 Figure 3.2: Stages in data clustering ..........................................................................................17 Figure 3.3: Typical application cache structure..........................................................................18 Figure 3.4: Basic KNN search process represented in 2-dimensional space ...............................20 Figure 3.5: Basic KNN Search Algorithm .................................................................................21 Figure 3.6: KD-tree data structure .............................................................................................22 Figure 4.1: Cluster data linearization.........................................................................................26 Figure 4.2: B-tree data structure ...............................................................................................29 Figure 4.3: Data cluster to B-tree correlation.............................................................................34 Figure 4.4: ckSearch algorithm data caching scheme.................................................................36 Figure 4.5: Cluster search rule 1 (Cluster exclusion rule)...........................................................39 Figure 4.6: Cluster search rule 2 (Cluster search region rule).....................................................40 Figure 4.7: Cluster search rule 3 (Cluster contains query sphere)...............................................42

Load more