EFFICIENT K-NEAREST NEIGHBOR QUERIES USING CLUSTERING WITH CACHING

by

JAIM AHMED

(Under the Direction of Maria Hybinette)

ABSTRACT

We introduce a new algorithm for K-nearest neighbor queries that uses clustering and caching to improve performance. The main idea is to reduce the distance computation cost between the query point and the data points in the data set. We use a divide-and-conquer approach. First, we divide the training data into clusters based on similarity between the data points in terms of . Next we use linearization for faster lookup. The data points in a cluster can be sorted based on their similarity (measured by Euclidean distance) to the center of the cluster. Fast search data structures such as the B-tree can be utilized to store data points based on their distance from the cluster center and perform fast data search. The B-tree algorithm is good for range search as well. We achieve a further performance boost by using B- tree based data caching. In this work we provide details of the algorithm, an implementation, and experimental results in a robot navigation task.

INDEX WORDS: K-Nearest Neighbors, Execution, Caching.

EFFICIENT K-NEAREST NEIGHBOR QUERIES USING CLUSTERING WITH CACHING

by

JAIM AHMED

B.S., Southern Polytechnic State University, 1997

A Thesis Submitted to the Graduate Faculty of The University of Georgia in Partial Fulfillment

of the Requirements for the Degree

MASTER OF SCIENCE

ATHENS, GEORGIA

2009

© 2009

Jaim Ahmed

All Rights Reserved

EFFICIENT K-NEAREST NEIGHBOR QUERIES USING CLUSTERING WITH CACHING

by

JAIM AHMED

Major Professor: Maria Hybinette

Committee: Eileen T. Kraemer Khaled Rasheed

Electronic Version Approved:

Maureen Grasso Dean of the Graduate School The University of Georgia May, 2009

DEDICATION

First of all my dedication goes out to my wife Jennifer for her support and inspiration especially when the going got tough. Also, my dedication goes to my parents for their unconditional love and motivation. My final dedication goes to my sister and brother-in-law for their genuine friendship and kindness.

iv

ACKNOWLEDGEMENTS

First of all, I express my sincere gratitude to my Major Advisor Dr. Maria Hybinette for her constant support and encouragement. Dr. Hybinette has been very kind with her time and wisdom. She has been a shining example of hard work and dedication and will remain a source of inspiration for me forever.

I would also like to thank my committee members Dr. Eileen Kraemer and Dr. Khaled

Rasheed for their time and consideration. Special thanks Dr. Tucker Balch for his helpful suggestions and consultations. Also, thanks to the Borg lab for access to example data and their helpful suggestions.

v

TABLE OF CONTENTS

Page

ACKNOWLEDGEMENTS...... v

LIST OF TABLES...... viii

LIST OF FIGURES ...... ix

CHAPTER

1 Introduction...... 1

1.1 Overview...... 1

1.2 Problem Domain...... 3

1.3 What is K-nearest Neighbor Search?...... 6

1.4 Contributions...... 8

2 Related Work...... 10

3 Background ...... 15

3.1 Data Clustering...... 15

3.2 Data Caching ...... 19

3.3 Basic KNN Search...... 20

3.4 KD-tree Data Structure ...... 22

4 System Architecture...... 24

4.1 Pre-processing ...... 25

4.2 ckSearch Runtime Queries...... 31

5 Experiments & Results ...... 45

vi

5.1 Setup Information...... 45

5.2 The effect of the size of the data set ...... 46

5.3 The effect of data dimension on the performance ...... 51

5.4 The effect of search radius on the performance ...... 54

5.5 The effect of search radius on accuracy...... 57

5.6 The effect of the number of clusters...... 58

6 Conclusion...... 62

REFERENCES ...... 64

APPENDICES ...... 67

A Notation Table...... 67

B Implementation Pseudocode ...... 68

vii

LIST OF TABLES

Page

Table 5.1: The effect of data size on performance (k=1)...... 47

Table 5.2: Effect of data size on performance (k=3) ...... 48

Table 5.3: Effect of data size on performance (k=10) ...... 49

Table 5.4: ckSearch speedup over linear search...... 50

Table 5.5: The effect of data dimension on performance (N=50K) ...... 52

Table 5.6: The effect of data dimension on performance (N=100K) ...... 52

Table 5.7: ckSearch speedup over linear search for various dimensions...... 53

Table 5.8: The effect of search radius on performance (k = 3) ...... 55

Table 5.9: The effect of search radius on performance (k = 10) ...... 56

Table 5.10: The effect of the search radius on query accuracy ...... 57

Table 5.11: The effect of the number of clusters on performance (k=1) ...... 59

Table 5.12: The effect of the number of clusters on performance (k=5) ...... 60

Table A.1: List of various notations used in this thesis ...... 67

viii

LIST OF FIGURES

Page

Figure 1.1: Autonomous robot being trained to navigate through obstacles...... 4

Figure 1.2: Autonomous robot navigation sensors input ...... 5

Figure 1.3: Pictorial representations of KNN search ...... 6

Figure 3.1: Data clustering in 2-dimensional space...... 16

Figure 3.2: Stages in data clustering ...... 17

Figure 3.3: Typical application cache structure...... 18

Figure 3.4: Basic KNN search process represented in 2-dimensional space ...... 20

Figure 3.5: Basic KNN Search Algorithm ...... 21

Figure 3.6: KD-tree data structure ...... 22

Figure 4.1: Cluster data linearization...... 26

Figure 4.2: B-tree data structure ...... 29

Figure 4.3: Data cluster to B-tree correlation...... 34

Figure 4.4: ckSearch algorithm data caching scheme...... 36

Figure 4.5: Cluster search rule 1 (Cluster exclusion rule)...... 39

Figure 4.6: Cluster search rule 2 (Cluster search region rule)...... 40

Figure 4.7: Cluster search rule 3 (Cluster contains query sphere)...... 42

Figure 4.8: Cluster search rule 4 (Cluster intersects query sphere) ...... 43

Figure 5.1: Performance vs. data set size chart (k = 1)...... 47

Figure 5.2: Performance vs. data set size chart (k = 3)...... 48

ix

Figure 5.3: Performance vs. data set size chart (k = 10)...... 49

Figure 5.4: Chart showing ckSearch speedup over the linear search ...... 50

Figure 5.5: Data dimension vs. performance chart (N = 50K)...... 52

Figure 5.6: Data dimension vs. performance chart (N = 100K) ...... 53

Figure 5.7: Search radius vs. performance for 10000 data records ...... 55

Figure 5.8: Search radius vs. performance chart for 10,000 data records (k = 10) ...... 56

Figure 5.9: Search radius vs. query accuracy chart ...... 58

Figure 5.10: The number of clusters vs. performance chart for 50000 data records (k = 1)...... 59

Figure 5.11: The number of clusters vs. performance chart (k = 5) ...... 60

Figure B.1: ckSearch KNN algorithm...... 68

Figure B.2: SeachClusters(q) pseudocode...... 69

Figure B.3: The SearchCache(q) algorithm pseudocode ...... 70

Figure B.4: The SearchLeftNodes(leafNode i, key left ) pseudocode...... 71

Figure B.5: The SearchRightNodes(leafNode i, key right ) pseudocode ...... 72

x

CHAPTER 1

INTRODUCTION

In this research, we introduce an efficient algorithm for K-nearest neighbor queries that uses clustering, a pruning of the search space, and caching to improve performance. We call our algorithm ckSearch . The main goal of this work is to improve performance of queries in a k- nearest neighbor (KNN) system.

In this chapter we provide an overview of the KNN algorithm, and brief coverage of the performance challenges facing KNN implementations. We describe our application and experimental domain, and then provide details on our approach.

1.1 Overview

The K-nearest neighbor algorithm (KNN) is a well-known statistical search or learning method used in a wide range of problem solving domains: e.g., navigation

[32], data mining [33], and image processing [11]. In robotic navigation KNN is used to select an appropriate action of a robot by evaluating similar (K) instances from the ‘nearest neighbor feature set’ in training data. In forestry KNN is used to map satellite image data to inventory forest resources [34] and in wine evaluation KNN is used to classify wines, here the feature space include alcohol level, hue and wine opacity [35]. More formally, KNN finds the K closest (or most similar) points to a query point among N points in a d-

1

dimensional attribute (or feature) space. K is the number of neighbors that are considered from a training data set and typically ranges from 1 to 20.

Advantages of KNN algorithm include that it is fairly simple to implement and it is well suited for multi-modal classes [36]. However, a major disadvantage of KNN implementations is its high computational costs, especially when coupled with a large amount of data. The high cost is partly due to computing Euclidian distances between the N neighboring data points and the query point. Further many KNN implementations degrade in performance as the data becomes higher dimensional (i.e., they suffer from the “”), typically when the number of features are 20 and above performance starts to degrade [10]. Another drawback of KNN concerns its significant memory requirements, especially for Locality Sensitive Hashing (LSH) based KNN systems [6].

A key idea of our ckSearch algorithm is to improve performance by avoiding costly distance computations for the KNN search. We use a divide-and-conquer approach. First, we divide the training data into clusters based on similarity between the data points in terms of

Euclidean distance. Next we perform a linearization of data points in each cluster for faster lookup. The data points in a cluster can be sorted based on their similarity (measured by

Euclidean distance) to the center of the cluster. Our data linearization process takes advantage of this similarity and produces indexes for each data point in a cluster. Fast search data structures such as the B-tree can be utilized to store data points based on their metric indexes. Next we load the data points into a memory aware B-tree data structure. We achieve a further performance boost using B-tree based data caching.

The ckSearch cache policy pre-fetches closer (or more similar) clusters to the query point into the cache in anticipation of what may be needed next and it avoids checking the

2

cache if the cluster needed has not been put in the cache. This policy avoids some cache misses. At runtime, the ckSearch system first evaluates the cache upon receiving the query point and then searches for the k closest points in the cache. The cache is organized hierarchically in a B-tree structure and thereby reduces distance. In the case of a cache miss, the ckSearch algorithm searches the main B-tree for the k nearest neighbors using our new method.

1.2 Problem Domain

A focus of this research is to improve performance of the KNN approach and to demonstrate its performance in a real-world problem. We assessed our approach using data from an autonomous robot navigation experiment. The existing solution for this system uses the KD-tree algorithm that partitions the training data set recursively (KD-trees are specialized BSP trees). A KD-tree based algorithm provided direction and speed commands for a robot based on learned perception examples. One of our objectives is to improve performance of the existing KD approach. In order to improve data processing speed, we introduce our novel ckSearch algorithm that utilizes data clustering and data caching. In addition, our ckSearch system utilizes several rules to further reduce or avoid costly distance calculations. Even though our system has been assessed for efficient execution of KNN algorithm in a robotics domain, this system is expected to perform well in any domain that utilizes a KNN algorithm. One such domain could be image processing where a KNN algorithm is used to classify comparable image pixels.

3

Figure 1.1: An autonomous robot being trained to navigate through obstacles.

Figure 1.1 shows an autonomous robot being trained to navigate through obstacles.

Green lines showing sensors and yellow arrow showing direction. These sensor readings will be used as training data to classify speed and direction during autonomous run. This image is used by permission from the Borg Lab at Georgia Institute of Technology.

Autonomous robot navigation in unstructured outdoor environments is a challenging area of active research. At the core of this navigation task, identifying obstacles and traversing around these obstacles plays a vital role in reaching the robot’s target destination.

There is a recent trend of using KNN-based approaches in autonomous robotics research

[32]. Autonomous robots have the ability to function and can perform desired tasks in unstructured environments without continuous human guidance. But, it relies on algorithm such as KNN for learned data classification. Typically, sensors collect obstacle data and the decision making system must decide which action to take based on previously learned behavior [1].

4

Figure 1.2: Autonomous robot navigation sensors input.

Figure 1.2 shows a representation of a robot’s sensor input for navigation. Each green line represents an estimate of free space from the robot to an obstacle. At each time step there are 60 such inputs which make up 60 dimensional data point. The yellow arrow shows the direction input by the robot trainer, and the blue arrowhead shows original direction of the target path. Later the robot uses these 60 dimensional sensor data and the direction taken by the trainer as the training data set to decide speed and direction during autonomous run.

In this manner, it has the ability to move through its operating environment without human assistance using KNN algorithm to dictate which direction to move and what the speed should be based previous learned data. Needless to say this decision making process must be efficient, accurate, and swift to enable the robot to cope with its environment and avoid obstacles. Most of the current KNN algorithms (such as KD-tree) are too slow for the task. As defined, this is our problem domain. It was determined that the existing KD-tree based nearest neighbor search algorithm suffered performance devaluation from “curse of

5

dimensionality” and performance needed improvement. In this research we worked to come up with an apt algorithm to speed up classification of such robots’ direction and speed data.

(a) (b)

Figure 1.3: Pictorial representations of KNN search.

1.3 What is K-nearest Neighbor Search?

The k-nearest neighbor (KNN) is a variation of the nearest neighbor algorithm where it is required to find k number closest point to the query point. The nearest neighbor search algorithm along with its variations are frequently used to solve problems in areas such as robotics, data mining, multi-key retrieval, and pattern classification. Discovering a way to reduce the computational complexity of nearest neighbor search is of considerable interest in these areas.

The KNN search, also known as the , can be expressed as an optimization problem for finding closest points in metric spaces [2]. Given a set of N points, in M, and a query point q where q ∈ M. The problem is to find the closest k points to the query point q, in set N set . Usually, M is considered to be a d-dimensional

Euclidean space and distance is measured by Euclidean distance or Manhattan distance.

6

A significant cost of the KNN approach is due to the computation of the O(l) distance function, especially when an application uses vectors with a high dimensionality such as sensor data from an autonomous robot [3]. A full search solution involves calculating the distance between the target vector q, and every vector pi, in order to find the k closest to q.

Although full search ensures the best possible search results, this solution is often unfeasible due to its O(nl) cost. Autonomous robot decision making applications often involve searching a large database for a closest match to a query case [4].

A simple solution to the KNN search problem is to compute the distance from the query point to all the other points in the database, storing the data point smallest calculated distance yet [5]. This sequential full search finds the k nearest neighbors by progressively updating the current nearest neighbor p j when a data point is found closer to the query point than the current nearest neighbor. With each update, the current KNN search radius shrinks to the actual kth nearest neighbor distance. The final nearest neighbor is one of the data points inside the current nearest neighbor search radius. Thus, in the sequential full search, the distances of all N data points to the query point are computed with the search complexity being N distance computations per query point.

The number of data points in a data set increases the number of distance calculations for any KNN algorithm. Further, the “curse of dimensionality” increases the number of calculations tremendously. One approach to reducing the complexity of the nearest neighbor search is to reduce the number of data points to be searched. Our approach to KNN search focuses on an inexpensive way of eliminating data points from consideration using computationally inexpensive rules, thereby avoiding a more expensive distance computation.

7

The rules determine those data points which cannot be nearer to the query point than the current nearest neighbor.

The computational complexity of KNN queries has increased in recent years.

Moreover, the advent of new research areas using learning algorithms such as autonomous robotics and other artificial intelligence domains has drawn interest back to nearest neighbor search. Currently, use of large containing millions of image records for a vision based navigational system is quite common [1]. Naturally, these new challenges have prompted a fresh look at nearest neighbor search and the ways it can help solve new problems.

As mentioned above, we apply our cluster-based KNN search method to the task of steering and speed decision making for an autonomous robot based using training data. In addition, our approach utilizes data caching strategy to improve performance as well. In addition, the ckSearch algorithm is general enough to produce good performance in problem domains such as in image processing, information extraction in data mining, and classification of texts as well.

1.4 Contributions

Results of our research will be of interest to those investigating high performance memory-based learning methods. In particular, we have implemented a system that supports fast and exact KNN queries without scanning the entire data set. Our novel contributions include:

8

• A geometry-based method for pruning the search space at query time. Some

existing approaches (e.g. Approximate Nearest Neighbor) also prune, but are

not able to provide exact responses to queries.

• Further improved performance using caching.

Our solution is based on a framework consisting of three major components: (1) Pre- processing of data points into clusters; (2) Data point mapping to a metric data structure; and

(3) Implementation of smart caching. We have designed our caching strategy based on the assumption that a data cache can boost performance in repeated calculation algorithms such as KNN. The approach takes advantage of an algorithm that balances the cost and performance of each component in order to achieve an overall reduction in cost to improve performance [4]. Using above mentioned techniques along with rules to avoid unnecessary computation, our algorithm achieves performance improvement over linear search and KD- tree based KNN algorithms. The performance evaluation section details these experiments and results.

The rest of the thesis is organized as follows: Chapter 2 discusses related work done by various other researchers in this area. Chapter 3 presents background information and various concepts used in this project. Chapter 4 describes in detail our proposed approach and all the related information. The experiments are discussed and the results are presented in

Chapter 5. Finally, Chapter 6 presents the conclusions of this thesis and describes future work.

9

CHAPTER 2

RELATED WORK

This related work section reports various recent research works on autonomous robot navigation as well as KNN algorithm. Navigation is one of the most challenging skills required from a mobile robot. There has been a recent trend of using KNN algorithm to classify learned data amongst researchers. In this section, we present some of the related work done by researchers in this area.

The 6D SLAM (Simultaneous Localization and Mapping) system is based on scan matching technique where scan matching is based on the well known iterative closest point

(ICP) algorithm [3]. This system employs a cached KD-tree to improve performance of the iterative closest point algorithm. Since the KD-tree itself suffers from performance break down with high-dimensional data points, we believe the 6D SLAM will suffer performance deterioration with high-dimensional navigation data [17].

Another approach by researchers to solve the navigation problem is based on stereo vision of the robot system. Binary classifiers were used to augment stereo vision for enhanced autonomous robot navigation. However this system does not use any one optimized binary classifier. Instead, suggests using several generic classifiers such as SVM, Simple

Fisher Algorithm, and Fisher LDA. This approach also suggests creating and storing learned models of traversable and non-traversable terrain. We believe generic binary classifiers are prone to performance degradation which can affect performance of this system [1].

10

Some researchers applied memory-based robot learning to solve similar problems.

Memory-based neural networks were used to learn task to be performed [22]. This task can be figuring out navigational hot spots or could be decision making. These researchers also augmented nearest neighbor network with a local model network.

Next, we present several related work done in the KNN search area. There has been a long flow of research on solving the nearest neighbor search problem. A large number of solutions have been proposed to improve cost of the nearest neighbor search. The quality and usefulness of these various proposed solutions are determined by the time complexity of the queries as well as the space complexity of any search data structures that must be maintained.

The current KNN techniques can be divided into five major approaches. These approaches are: data partitioning approach, approach, locality sensitive hashing

(LSH), scanning based approach, and linearization approach.

The most prominent is the data partitioning approach. It is also known as the space partitioning, spatial index, or spatial access method. Data partitioning techniques such as

KD-tree [22] or Grid-file [25] iteratively bisects the search space into regions containing fraction of the points of the parent region. Queries are performed via traversal of the tree from the root to a leaf by evaluating the query point at each split. One of the main drawbacks for this concept, is the “curse of dimensonality” . Curse of dimensionality is a problem caused by the exponential increase in volume associated with adding extra dimensions to a mathematical space. Data partitioning techniques perform comparable with low-dimension data points. On the other hand with high dimensional data, partioning technique’s performance quickly degrades. It is because of the exponential increase in volume associated with iterative partitioning of the high-dimension euclidean search space. Multi-dimensional

11

indexes such as R-trees [46] have been shown to be inefficient for supporting range queries in high-dimensional databases [19].

Dimensionality reduction approaches apply “dimensionality reduction” techniques on the data and insert the data into the indexing trees. The “dimension reduction” is the process of reducing the number of random variables or attributes being considered. This process is divided into feature selection and feature extraction. The dimensionality reduction approach first performs dimension reduction and then inserts the data into indexing trees.

There are cost associated with performing dimension reduction and subsequent data indexing. That is why this technique performs well on low-dimensional data set but performance suffers when data dimensions increase.

Locality sensitive hashing (LSH) is comparatively new nearest neighbor search approach. It is a technique for grouping points into buckets based on distance metric operation on the points. Points that are close to each other under the chosen metric are mapped to the same bucket with high probability. Theoretically, for a database of n vectors of d dimensions, the time complexity of finding the nearest neighbor of an object using locality sensitive hashing is sub-linear in n and only polynomial in d. A key requirement of applying LSH to a particular space and distance measure is to identify a family of locality sensitive functions, satisfying the properties [26]. Thus, locality sensitive hashing is only applicable for specific spaces and distance measures where such families of functions have been identified, such as real vector spaces with distance measures, or bit vectors with the

Hamming distance [28]. Also, locality sensitive hashing techniques are based on hashing it has large memory footprint. A large amount of memory must be allocated to apply LSH which certainly is a major drawback [27].

12

Scanning based approaches such as the VA-file [17] divide the data space into 2b rectangular cells, where b denotes a user specified number of bits. Each cell is allocated a bit- string of length b that approximates the data points that fall into a cell. The VA-file is based on the idea of object approximation and approximates object shapes by their minimum bounding box. The VA-file itself is simply an array of these compact, geometric approximations. The nearest neighbor search starts by scanning the entire file of approximations and filtering out the irrelevant points based on their approximations. Instead of hierarchically organizing these cells like in grid-files or R-trees, the VA-file allocates a unique bit-string of length b for each cell, and approximates data points that fall into a cell by that bit-string.

Linearization approaches, such as space-filling curve methods ( Z-order curve ) map d- dimensional points into a one-dimensional space (curve). As a result, one can issue a range query along the curve to find k-nearest neighbors.

As it is evident from discussion so far that most of the conventional approach to the

KNN search suffers from drawbacks related either performance or memory space complexity. Our proposed ckSearch approach, as it will be described in detail later in this paper, is a novel approach to the KNN search problem. It utilizes clustering to achieve data partition. It makes smart but balanced use of data caching technique to boost performance. It avoids curse of dimensionality by mapping d-dimensional points into a one-dimensional space using “linearization approach”. It uses indexing tree as a data structure to evade large memory requirement. Moreover, it introduces metric index caching to the KNN algorithm.

As described in this chapter, there have been metric based KNN systems. But, our proposed data clustering along with smart use of data cache is a unique and a novel approach. Our

13

proposed solution has been carefully designed to overcome the disadvantages many of these conventional approaches suffer. At the same time, our approach culls the benefits the above mention techniques enjoy.

14

CHAPTER 3

BACKGROUND

We have provided comprehensive background information of our project in this chapter. It is important to remind the reader that the main goal of this project is to design a fast cluster- based KNN algorithm. In addition, this KNN algorithm must be able to process autonomous robots navigation (sensor) data quickly so that the robot can decide on direction and speed without stalling or running into obstacles. As mentioned above, the actual algorithm will be described in the next section but all the necessary background information will be explained in this section. For ease of exposition, the background section is further divided into three subsections. These subsections are: data clustering, notations, basic nearest neighbor search, and KD-tree search algorithm.

3.1 Data Clustering

Data clustering is an essential component of our ckSearch algorithm and considered part of the pre-processing step. A large portion of the cost of the KNN search is due to the computation of the O(l) distance function, especially when an application contains points with large number of dimensions such as navigation sensor readings of an autonomous robot.

The central strategy to reduce these repeated, and in some cases unnecessary, distance computation is to partition the data space. As the goal is to split the data space into partitions, data clustering is one of several the ways to achieve this goal.

15

Figure 3.1: Data clustering in 2-dimensional space.

Cluster analysis is the organization of a collection of patterns, usually a vector of measurements or a point in a multidimensional space, into clusters based on similarity [9].

Ideally, patterns within a valid cluster are more similar to each other than they are to a pattern belonging to a different cluster. Since data points in a large database or data set are often clustered or correlated, data clustering as a data partitioning technique seems ideal. The diversity of techniques for data representation, similarity between data elements, and categorizing data elements has generated a range of clustering methods.

Typical pattern clustering activity involves the following steps [9]:

(1) pattern representation

(2) definition of a pattern proximity measure appropriate to the data domain

16

(3) clustering or grouping

(4) data abstraction

(5) assessment of output if needed

Figure 3.2: Stages in data clustering

Pattern representation refers to the number of classes, the number of available patterns, and the features available to a clustering algorithm. It is divided into feature selection and feature extraction process. Feature selection is the process of recognizing the most effective of the original features to use in clustering. Feature extraction is the use of one or more transformations of the input features to produce new prominent features. In order to use in clustering, either or both of these techniques can be used to obtain an appropriate set of features.

Pattern proximity is usually measured by a distance function defined on pairs of patterns. A variety of distance functions are used depending on data domains. Typically,

Euclidean distance function is the most popular of these functions and often used to show similarity between two patterns. On the other hand, other similarity measures can be used to show the conceptual similarity between patterns.

The clustering step can be performed in a variety of ways. There are several major clustering techniques available such as: hierarchical, partitional, fuzzy, probabilistic, and graph theoretic to name a few. The K-means clustering, a partition-based clustering

17

technique was used in this project. K-means clustering is simple and a perfect fit for data partitioning required for nearest neighbor search algorithm. There are several existing clustering schemes in the literature such as BIRCH [30], CLARANS, and DBSCAN [31].

Data abstraction is the next step in the clustering process (except for optional output assessment step). It is the process of extracting simple representation of the data set. A typical data abstraction is a compact description of each cluster, usually in terms of cluster prototype or representative patterns such as the centroid.

In this project, data indexing is not dependent on the underlying clustering method.

But, it is expected that the clustering strategy will have an influence on data retrieval performance.

Figure 3.3: Typical application cache structure

18

3.2 Data Caching

Data caching is a general technique used to enhance performance of data access where the original data is expensive to compute compared to the cost of reading the cache. In KNN search, a large number of high-dimensional dataset is repeatedly accessed with each query. A data cache can prove extremely effective in such KNN search process. When data is cached, the most recently accessed data from the high-dimensional data set is stored in a memory buffer. Thus, this data cache is a temporary storage area where frequently accessed data can be stored for rapid access. When our ckSearch algorithm needs to access data, it first checks the cache to see if the data is there. If it finds what it is looking for in the cache, it will use the data from the cache instead of going to the data source to find it. Thus, using data cache our proposed algorithm can achieve shorter access time and boost performance. Even though a data cache is favorable, there are computational costs associated with data caching. This cost is primarily an accumulation of data retrieval cost, data maintenance cost, and cache miss cost. Thus, our proposed algorithm implements a comprehensive caching strategy to keep cache cost from offsetting performance gains.

19

Figure 3.4: Basic KNN search process represented in 2-dimensional space

3.3 Basic KNN Search

In order to search for the k nearest neighbors of a query point q, the distance of the k th nearest neighbor to q defines the minimum radius r min required for retrieving the complete answer set. It is not possible to calculate such a distance preemptively because of the fact that we are unaware of query point; q’s surrounding points without further scanning. Thus, iteratively increasing the search radius and examining the neighbors within that search sphere is a viable approach.

Describing this algorithm, the query point in question is q. The task is to find k nearest neighbor for this query point, q. This search process starts with a query sphere defined by a relatively small radius r about query point q, SearchSphere(q,r). Naturally, all data spaces the query sphere intersects have to be searched for potential k nearest neighbors.

Iteratively, the search sphere is expanded until all k nearest neighbor points are found. In this

20

process, all the data subspaces intersecting the current query space are checked. If enlargement of the query sphere does not introduce new nearest neighbor points, the current

KNN result set, R is considered the nearest neighbors (assuming the size of the current result set is k). At this point, the search query is started with a small initial radius which in turn keeps the search space small to avoid unwanted calculations. The goal here is to minimize unnecessary search costs. Arguably, a search sphere with larger radius may contain all the k nearest points but cost of going through all the data points out weighs the benefits.

Basic KNN Search(k):

1 R = empty; // The result set

2 Search sphere radius, r = as small as possible;

3 Find all data spaces intersection current query sphere;

4 Check all intersecting data spaces for k-nearest neighbor;

5

6 if R.Size() == k (where k-nearest neighbors are found)

7 exit;

8 else

9 increase search radius;

10 goto line 3;

11 start the search process again;

END ;

Figure 3.5: Basic KNN Search Algorithm

21

We performed several performance comparisons between the ckSearch algorithm and the KD-tree based multi-dimensional indexing structure. It has been detailed in the experiment section of this project . We believe it is important to understand the KD-tree algorithm to understand the performance comparisons we have conducted in the experiment section of this paper. So, a comprehensive account on the KD-tree is included in the following section.

Figure 3.6: KD-tree data structure

3.4 KD-tree data structure

K-Dimensional search trees, i.e. KD-trees, are a generalization of binary search trees designated to handle the case of multidimensional records. In KD-tree, a multidimensional record is identified with its corresponding multidimensional key x = ( x(1) , x (2), . . ., x(K)),

22

where each x(n), 1 ≤ n ≤ K, refers to the value of the nth attribute of the key x. Each x(n) belongs to some totally ordered domain Dn, and x is an element of D = D1 × D2 × . . . × DK.

Therefore, each multidimensional key may be viewed as a point in a K-dimensional space, and its nth attribute can be viewed as the nth coordinate of such a point. Without loss of generality, we assume that Dn = [0,1] for all 1 ≤ n ≤ K, and hence that D is the hypercube

[0,1] K [10]. A KD-tree for a set of K-dimensional records is a binary tree such that,

(1) Each node contains a K-dimensional record and has an associated discriminant

n ∈ {1, 2, . . . , K}

(2) For every node with key x and discriminant n, it is true that any record in the left sub-

tree with key y satisfies y(n) < x(n) and any record in the right sub-tree with key y

satisfies y(n) > x(n).

(3) The root node has depth 0 and discriminant 1. All nodes at depth d have discriminant

(d mod K) + 1 .

There are many implementations of KD-trees. There are homogeneous and non- homogeneous KD-trees. The non-homogeneous KD-trees contain only one value in internal nodes and pointers to its left and right sub-trees. For non-homogeneous KD-trees, all records are stored in external nodes. The expected cost of a single insertion in a random KD-tree is

O(logn) . On the other hand, the expected cost of building the whole tree is O(nlogn) . On average, the deletions in KD-trees have expected cost of O(logn) . The nearest neighbor queries are supported in O(logn) time in KD-trees [10].

23

CHAPTER 4

SYSTEM ARCHITECTURE

In this section, we describe the system architecture of ckSearch, which includes our scalable and efficient KNN search mechanism.

A number of solutions have been introduced to reduce the cost of the KNN search.

The quality and usefulness of these various solutions are limited by the computational time complexity of computing queries as well as the space complexity of the relevant search data structures. As mentioned in the Related Work chapter, solutions face tradeoffs that affect performance and are prone to the curse of dimensionality phenomena as number of attributes increases. When the number of attributes is large KNN implementations either require large memory allocations (due to space complexity) or fall victim to time complexity. A rule of thumb is that KNN algorithms work well for 20 or less attributes [10].

Our ckSearch technique balances both time and space complexities to achieve an overall reduction in both. Our cluster-based approach uses caching to minimize the cost of searching high-dimensional data. Our solution, detailed in this section, includes two phases:

(1) The pre-processing of data points; and (2) Runtime queries. In the pre-processing step the d-dimensional data set is partitioned into data clusters based on similarity between the data points. We discuss both phases in detail in the preprocessing and runtime query sections below. The following observations influenced the design of the ckSearch system:

24

Observation 1: (Data partitioning)

Data space partitioning can reduce redundant distance computations while searching for k nearest neighbors in a high dimensional data domain. Simple clustering algorithms such as

K-means clustering can reduce computational cost by separating high-dimensional data points into clusters based on similarity.

Observation 2: (Data reference)

Reference to a cluster centroid may expose similarity or dissimilarity between data points within a cluster and data points across different clusters. Moreover, data points in a cluster can be sorted based on their distance from a reference point such as the cluster centroid.

Observation 3: (Data Caching)

Data caching can substantially reduce search time by pre-fetching and reduce cost of distance calculation for the KNN search. Cache miss expenditure must be kept in-check by using smart cache strategies and rules to predict cache miss scenario.

4.1 Pre-Processing

Step 1: Data Partitioning – K-means Clustering

Data clustering is an essential component of our algorithm. By clustering as a pre-processing step, we are able to improve the performance of queries at runtime. A direct approach to reducing the complexity of the nearest neighbor search is to reduce the number of data points investigated. The central strategy to reduce these repeated, and in some cases unnecessary, distance computation is to partition the data space. CkSearch splits the data space into

25

partitions and uses data clustering to avoid examining unnecessary data points in multidimensional data by clustering based on data similarities (Observation 1). The first step is to cluster the data set using an existing K-means clustering algorithm.

K-means clustering is a simple, partition-based clustering technique. It is a good fit for the data partitioning required for nearest neighbor search. It is important to mention that even though our approach uses K-means clustering; it does not depend on this particular clustering technique. We could just as easily have selected another clustering algorithm such as: DBSCAN [31], CLARANS, or BIRCH [30]. In our algorithm, the number clusters is selected based on number of records present in the data set. We choose 5 clusters up to 10000 records and go up by 2 clusters for every 5000 records.

Figure 4.1: Cluster data linearization

26

Once we have selected a number of cluster centers, we can use them to index our data. Figure 4.1 shows cluster data linearization based on the distance between the center and individual data point in that cluster. The cluster center is the starting point of the segment and the cluster boundary being the maximum for the segment.

Step 2: Data index construction & Index structure

After the clustering phase, our algorithm constructs the data index. This data index is a single dimensional value based on the distance between the data point and a reference point in a data partition. During this part of the process, each high-dimensional point is transformed into a point in a single dimensional space.

This conversion is commonly known as data linearization or data mapping.

Linearization is achieved by selecting a reference point and then ordering all partitions according to their distances from the selected reference point. This extends well with

Observation 2 which states that reference to a cluster center may expose similarity or dissimilarity of data points in a cluster. This similarity or dissimilarity is exposed by linearization in the form of data mapping. There are several types of reference points that can be used for the linearization process. Typically the center of a cluster is used as a reference point. But, some linearization techniques use either a boundary (edge) point or a random point as reference. Ad hoc linearization approaches, such as space-filling curve methods such as Z-order curve [15] map d-dimensional points into a one-dimensional space (curve). For further cost reduction, we use a three step data linearization algorithm. In the following section, the data index construction (i.e. linearization) using the three step algorithm is described.

27

First, a reference point is identified for each partition or data cluster. The center of each partition or cluster is selected as reference point. In the second step, the Euclidean distance between the data point pi and the reference cluster center Ci is computed. And, in the final step, the following simplistic linear function is used to complete the conversion (i.e., the data mapping). Each high-dimensional data point is transformed into a key, key i, in a single dimensional space.

key i = distance(p i, C i) + m × µ; (4.1)

In the above function (4.1) , the term key i represents the single dimensional index value for a data point after the linearization process [11]. According to the research work on data partitioning by Agbhari & Makinouchi [11], data points in a cluster can be referenced and mapped by a fixed data point such as the cluster center. We utilized this concept to perform data linearization in this project. The distance function, distance(p i, C i) represents distance function between the data point pi and the cluster center reference point Ci. The function distance(p i, C i) is a Euclidean function and returns single dimensional distance value. The next parameter, m is the number of the data cluster being processed. If there are total M clusters, then the m value is in between 0 and M – 1 , such that 0 ≤ m ≤ M – 1 . If there are 10 clusters for example, then m value will be one of the values in the [ 0,1,2, . . . . . , 9 ] range.

The last parameter µ is a constant to stretch the data ranges. The constant µ serves as a multiplier to parameter m so that all points in a partition or cluster can be mapped to a region within m × µ and (( m + 1) × µ) . Because of the µ multiplier, the function (1.1)

28

correctly maps the cluster center as the minimum boundary or starting index of this region and furthest data point (in this cluster) as the maximum boundary or index of this region.

Moreover, all the data points (in this cluster) appropriately map in between the minimum and maximum indices. As a result, one can issue a range query to find the nearest neighbor enabling use of efficient single dimensional index structure such as the B-tree.

Figure 4.2: B-tree data structure

Figure 4.2 above shows a B-tree data structure. The leaf nodes contain data points. B- tree is especially optimized for search operations.

Step 3: Data structure & data loading

The selection of appropriate data structures is an integral part of any efficient search algorithm design. For fast data retrieval algorithms such as our ckSearch, it is vital to use a speedy data structure. In ckSearch system, we used three different data structures. The core structure is the B-tree. A B-tree was used as the main data storage for our system. We have also utilized one-dimensional arrays and two-dimensional arrays. The two-dimensional array was used to store the minimum and maximum data distance for each cluster. Any balanced tree such as a B-tree works well as a fast cache data structure because of its rapid data

29

retrieval time. Accordingly, an instance of the B-tree algorithm is used for the data caching implementation as well.

B-tree is a data structure that keeps data sorted and allows searches, insertions, and deletions in logarithmic time. It is optimized for systems that read and write segments of data such as: data clusters, databases, and file-systems. In B-trees, non-leaf nodes can have variable number of child nodes and are used as guides to leaf-nodes. Search operations with in-memory B-trees are significantly faster than the in-memory red-black trees and AVL trees

[28]. It fits perfectly for our ckSearch algorithm because costly B-tree insertion operations are only performed during the pre-processing index loading time. During actual ckSearch runtime, only inexpensive search operations are performed (on the B-tree) to locate k nearest neighbors. This strategy further aids our algorithm in improving overall processing time.

After the data linearization process, as described in the previous section, the mapped points are loaded in the B-tree. The transformed data point indexes worked as keys for our data structure where only leaf-nodes store the actual data points. The conventional B-tree was modified so that each leaf node was linked to neighboring leaf nodes on both sides. This modification assisted in further speedy retrieval of nearest neighbor points.

In our algorithm, a two-dimensional array is used to store the maximum distance, distMax i between each cluster center Ci and the furthest data point pi in that cluster.

Similarly, the minimum distances distMin i are stored in this two-dimensional array as well.

Our algorithm uses the distMax i and the distMin i distance values to eliminate unnecessary out of boundaries (data space) computations. A separate single dimensional array is used to store cluster centers.

30

4.2 ckSearch Runtime Queries

In this section, we describe the ckSearch query. After loading the indexes into the tree-based data structure, the pre-processing part of the algorithm concludes. At this point, our algorithm performs the fast KNN search.

How ckSearch Works

In this section, we describe the search process of the ckSearch. The overall technique is to iteratively solve the KNN problem. It begins selecting a small radius ri defining a small area around the query point and then iteratively increase the radius up to a radius, rmax . The search space is iteratively increases until all k nearest neighbor values are found or the “STOP” criterion has been met (when r is rmax ).

As explained above, during the pre-processing process the data points are clustered (by using K-means clustering), reference points are selected (cluster centers), data linearization are completed, and data points are loaded in a B-tree data structure. The actual search begins by consulting cache hit-miss strategy and determining the outcome based on the cache rules as described below in the “cache strategy” section. Regardless of the outcome of the cache strategy, the ckSearch algorithm next inspects the following two stopping criteria:

• The search radius ri has reached its maximum rmax threshold value and still have not

found k-nearest neighbor s.

• The distance(p max ,q) value, the distance between query point q and the furthest data

point pmax in the result set R, is less than or equal to the current search radius ri and

the size of the result set is k. In this case, we can be sure that the algorithm has found

31

all the k nearest neighbors to query point q and further increase of the query area (i.e.

search radius ri) will only result in redundant computational cost.

Next, if the outcome of the cache hit-miss strategy is a hit the algorithm enters

SeachCache(q) sub-routine (see Appendix B, figure B.3). The data cache is a B-tree index structure modified to access left and right leaf nodes. At this point, the algorithm iteratively runs the SeachCache(q) sub-routine until stopped by the stopping criteria mention in the

“cache strategy” section below. For each iteration, it increases the search radius ri by increment amount, rincrement to widen the search space. Instead of a cache hit if a cache miss occurs at the beginning of the search, our ckSearch algorithm enters a loop where it first checks the stopping criteria and then enters the SearchClusters(q) routine.

The SearchClusters(q) (see Appendix B, figure B.2) search routine is an important part of our algorithm because it applies the “Cluster search rules” to reduce significant computation cost. It checks every cluster iteratively and takes one of the following three actions:

• Exclude the cluster from the search: If the cluster in question does not contain or

intersect the search sphere of the query point q and falls under the Cluster exclusion

rule ( Rule 1 ), the cluster is then exempted from the KNN search. Thus, a significant

reduction of computation cost occurs.

• Call SearchLeftNode(), Search the cluster inwards and ignore nodes to the right: If

the cluster in question “intersects” the query search sphere, according to the Cluster

intersects query sphere rule (Rule 4) criterion, data space inward toward the cluster

32

center must be search. In this case, only nodes left of the query node in the B-tree

need to be search. Moreover, nodes to the right (in the B-tree) are ignore. Because

these nodes reside outside of this cluster boundary. Thus, our algorithm only calls the

SearchLeftNodes(leafNode i, key left ) (Appendix B, figure B.4) sub-routine in the next

step to search for k nearest neighbors.

• Perform an Exhaustive Search: If the data cluster “contains” the query point q

determined by the Cluster contains query sphere rule (Rule 3) , then an exhaustive

search of the cluster must be completed to find the k nearest neighbors. The data

space is sufficiently traversed to complete the search. This is be done by searching

inward and outward of the cluster center accordingly. Because potential nearest

neighbors can be left or right of the query node in the B-tree. The search routines,

SearchLeftNodes(leafNode i, key left ) and SearchRightNodes(leafNode i, key right ) are

used for searching inward and outward of the cluster center.

Next, our ckSearch algorithm locates the leaf node, leafNode i (from the B-tree) whereby query point q with index key query may be stored. Intuitively we can say that this leafNode i has the high probability of having the nearest neighbors of the query point.

Because the data points stored in leafNode i has similar distance from the cluster center as the query point q. Thus, resides in the same region as the query point in the data space. The sub- routine getQueryLeaf(btree, key query ) returns this leaf node.

Next, based on the Cluster search rules (as mentioned in the “Cluster search rules” section), the ckSearch algorithm either calls SearchLeftNodes(leafNode i, key left ) for Rule 3

33

or calls both SearchLeftNodes(leafNode i, key left ) and SearchRightNodes(leafNode i, key right ) for Rule 4 . Each of these sub-routines has built-in loops to check for k nearest neighbors in the leafNode i. Moreover, these routines check left and right leaf nodes based on inward or outward data search ( Rule 3 or Rule 4 ).

Figure 4.3: Data cluster to B-tree correlation

In the above figure 4.3 Data cluster to B-tree correlation is shown. This figure shows how the data points in a cluster are stored in the B-tree leaf nodes (bottom level). The data points are sorted based on the 1-dimensional linear transformed distance from the cluster center (used as keys).

It is import to mention that the actual discovery of the nearest neighbors happen in the

SearchLeftNodes(leafNode i, key left ) and SearchRightNodes(leafNode i, key right ) sub-routines.

Because each of these two search routines iterative calculates the distance between each data point in leafNode i and the query point q. The k data points with shortest distance to query point are returned as a result set.

34

If the query sphere contains the first element of a node, then it is likely that its predecessor with respect to distance from the cluster center may also be close to q. Thus, the

SearchLeftNodes(leafNode i, key left ) also examines its left sibling leaf node for nearest neighbors. On the other hand if the query sphere contains the last element of a node, for the same reason as stated above, the SearchRightNodes(leafNode i, key right ) routine examines its right sibling leaf node for nearest neighbors.

Now, at the end of these phases the algorithm re-examines the two stopping criteria mentioned above. It checks for KNN result set R and stops if the k nearest neighbors has been identified. Moreover, ensures that the further enlargement of the search sphere does not change the KNN result set. The search process only stops if the distance of the furthest data point in the answer set, R, from query point q is less than or equal to the current search radius ri. Otherwise, it increases the search radius and repeats the entire process. Figures B.1 – B.5 in appendix B illustrate the algorithm pseudocodes of the sub-routines mentioned above.

35

Figure 4.4: ckSearch algorithm data caching scheme

Figure 4.4 shows the ckSearch algorithm data caching scheme. In this example, the query point q and data point A reside in the same leaf node of the ckSearch cache. This is a cache hit scenario.

Cache Strategy

Data caching is an important component of our KNN search algorithm. A data cache can prove extremely effective in a KNN search process where repeated access of a large number of high-dimensional dataset is performed. A fast cache implementation can dramatically reduce the number of distance computations by simply storing frequently accessed data into a

36

data cache. On the other hand, expensive cache misses can degrade performance. Thus, we have developed a cache strategy to reduce redundant computation while avoid expensive cache misses (and therefore reduce costly B-tree insertion operations). This cache strategy is comprised of the following rules:

• Reduce the cost of the insertion operation as much as possible by reducing frequent

cache updates. The underlying data structure of our cache strategy is a B-tree data

structure ideal for fast cache implementations. In B-tree, inserting a record requires

O(log n) operations in worst case.

• Conduct preliminary checks before performing costly cache searches to reduce cache-

miss cost. We have decided to take this conservative approach to make sure that the

cache-hits remain as performance boost for the ckSearch system and do not get

overwhelmed by too many cache misses. For a given query point, we find the closest

cluster to query point by calculating the distance between the query point and the

cluster centers. Then, we check if the closest cluster to the query point is the same as

the cluster stored in data cache (a B-tree structure). Our assumption here is that two

consecutive query points will fall in the same cluster and possibly around the same

region of that cluster. Thus, their k nearest neighbors will also be in the same region

of the cluster.

• Perform an additional check by matching query point leaf node from the data cache

B-tree with leaf node from the actual data storage B-tree. These two leaf nodes

essentially indicate same region of the same data cluster. If these two leaf nodes turn

37

out to be the same, then the current query point falls in the same data region as the

previous query point because our data structure keeps the leaf nodes sorted based on

distance from the cluster center. So, in order for two leaf nodes to be same the stored

in these leaf nodes must be located in the same region in a cluster. Thus, our ckSearch

algorithm proceeds to perform search to retrieve k nearest neighbors from the data

cache.

• If it is a cache-miss scenario based on the above-mentioned strategies, our algorithm

skips the data cache and the search is performed on the main data storage B-tree data

structure. At the end of the query search, the leaf nodes containing the nearest

neighbors are loaded on to the data cache B-tree for next query iteration under the

CacheUpdate process.

Cluster Search Rules

Our online search strategy depends critically on a search radius parameter r i. We initially select r i to be conservatively small. If there are not enough points returned from a query, we can gradually increase the value of r i. In this section we describe several cluster search rules based on query radius, cluster boundary, and location of the query point. Using the above mentioned parameters and simple geometric calculations, it possible to figure out with certainty that some clusters will not contain any of the k nearest neighbors. Thus, these clusters can be completely excluded from computations and reduce significant amount of computational cost. These following rules are applied during the query time (runtime) of the ckSearch.

38

Figure 4.5: Cluster search rule 1 (Cluster exclusion rule)

The above figure (figure 4.5) illustrates the cluster search rule 1 (Cluster exclusion rule). In this example, the query point is outside the cluster M 1. This cluster can be excluded from the KNN search operations. Thus, reducing expensive distance computation cost.

Rule 1: The cluster exclusion rule

A cluster can be excluded from nearest neighbor search if the following condition is true,

distance( Ci, q) - r i > distMax i (4.2)

Employing an exclusion strategy, it is possible to exclude a cluster and its data points from KNN search. Naturally, by excluding a cluster from distance computations, computation cost can be reduced.

39

Let C i be the reference point (cluster center) of the cluster Mi. Now, the query point q has a search radius ri. As described above, r i is the search radius of the search area where the ckSearch system looks for possible nearest points. The distance between the cluster center and the query point q is denoted by distance(Ci, q) . Moreover, the distance between C i and the furthest data points in cluster Mi is denoted by distMax i. Given the condition distance(Ci, q) > distMax i, we can say that the cluster Mi can be excluded from KNN search if the query point q and query sphere rests out side the cluster boundary.

Figure 4.6: Cluster search rule 2 (Cluster search region rule)

The above figure (Figure 4.6) shows Cluster search rule 2 (Cluster search region rule). This rule describes the valid search region for a query point in a cluster. This rule ensures valid search computations in a cluster and avoids unnecessary iterations in invalid region.

40

Rule 2: Cluster search region rule

When a cluster is searched for nearest neighbor point, the effective search range is,

dist max = max{0, distMin i}

dist min = min{distMax i, (distance(C i,q) + r i)}

Then, the effective search region is within, [dist min , dist max ] (4.3)

Carefully selected search region can further reduce cost for nearest neighbor search.

Moreover, range query can be performed using the search range within an affected cluster.

And, most importantly search termination rules can be set up based on this search range while searching the leaf nodes in B-tree index structure for the nearest neighbor. This speeds up the data retrieval from the B-tree.

Let the distance between the cluster center C i and the query point q is denoted by distance(Ci, q) . Now, the query point q has a search radius ri. Moreover, the distances to the furthest and the closest data points from the cluster center C i in cluster Mi is denoted by distMax i and distMin i. From these given conditions, we can deduce effective search region of a cluster because no data point lies beyond this search region.

41

Figure 4.7: Cluster search rule 3 (Cluster contains query sphere)

Figure 4.7 above illustrates Cluster search rule 3 (Cluster contains query sphere). This rule describes that the query point q and its search region based on query radius r 1 is completely inside the cluster M 1. Thus, cluster M 1 contains the q’s query sphere.

Rule 3: Cluster contains query sphere rule

The query sphere with radius ri is completely contained in the affected cluster Mi, if the following conditions are true.

distance( Ci, q) + r i ≤ distMax i (4.4)

It is an important piece of information if the query point q and its query search sphere are completely contained in the partition or cluster. The reason is this information then can be used to formulate smarter nearest neighbor search and in turn reduce search related computation cost.

42

Let, distance(Ci, q) be the distance function between cluster center C i and the query point q. The radius of the cluster Mi is distMax i. Given, distance(Ci, q) ≤ distMax i we can correctly formulate that if, distance( Ci, q) + r i ≤ distMax i condition is true then the cluster Mi will completely contain the query sphere (see figure 4.7).

Figure 4.8: Cluster search rule 4 (Cluster intersects query sphere)

Figure 4.8 above shows cluster search rule 4 (Cluster intersects query sphere). This rule describes that the query point q only intersects the cluster M 1. Thus, it is possible that the k-nearest neighbor may not be in the cluster M 1.

Rule 4: Cluster intersects query sphere rule

The query sphere with radius ri intersects the affected cluster Mi, if the following conditions are true.

distance( Ci, q) - r i ≤ distMax i (4.5)

43

Similar to the above section, it is important to know if a cluster is intersecting with the search sphere of the query point q. In this case, the nearest neighbor point may be in the cluster in question. It is also possible that the nearest neighbor is located in another cluster.

Thus, the iterative process of searching may continue.

Let, distance(Ci, q) be the distance function between cluster center C i and the query point q. The radius of the cluster Mi is distMax i. Assuming, distance(Ci, q) > distMax i where the query point is outside the affected cluster. We can correctly put together the rule that if, distance( Ci, q) - r i ≤ distMax i condition is true then the cluster Mi will partially intersect the query sphere (see figure 4.8).

44

CHAPTER 5

EXPERIMENTS & RESULTS

In this section, we detail experimental setups, describe our experiments, and present the results of those experiments. The main objective of our experiments is to evaluate the performance of our ckSearch system. The indexing strategies of ckSearch are tested on different data sets varying data set size, dimensions, and data distribution.

We use the KD-tree algorithm as a benchmark for comparison. The KD-tree algorithm is an effective and commonly used KNN method based on a multi-dimensional indexing structure. Moreover, the KD-tree algorithm is especially appealing for comparison as it is similar to our ckSearch multi-dimensional indexing tree structure. The focus of our research is to speed up the learned data classification process, and is especially applicable for an existing autonomous robot that currently use a KD-tree based KNN system. According to

Arya and Silverman [2], linear search serves as an effective KNN search technique. So, for completeness we also compare the performance of ckSearch with the linear search KNN technique.

5.1 Setup information

The ckSearch search algorithm and related K-means clustering technique were implemented in Java. A tree-based indexing structure was used as the primary data structure along with a two dimensional array to store cluster information. The linear search and the KD-tree

45

implementations are Java based implementations. The KD-tree code was obtained from our colleagues at Georgia Institute of Technology [37]. Experiments were performed on a 1.5-

GHz PC with 512 megabytes main memory, running Microsoft Windows XP version 2002

SP3.

For our training set we used a training data generated by an autonomous robot guided by a human. We also created synthetic test data sets ranging from 10,000 to 100,000 records with various dimensions (such as: 9, 18, 36, 50, and 60). For each query, a d-dimensional point is used. One hundred query trials were used for each experiment. Then, we averaged the total performance time to even out the I/O cost of the performance.

5.2 The effect of the size of the data set

The size of the data set can play a significant role in the performance of a KNN algorithm where searches are O(n) with respect to the number of items stored for the query. In order to evaluate the performance of our system with this criterion, we conducted a series of experiments: We used a 60-dimensional data set with k set to 1, 3, and 10. During these experiments we gradually increased the number of data points in the data set. We started with

5000 data points and increased it to 10000, 20000, 40000, 50000, 60000, 80000, and 100000.

With each data set we also recorded the performance time of the ckSearch and compared it with the performance time of the linear search implementation. The results are tabulated in table 5.1.

The following table shows the effect of the size of the data set on the ckSearch and the linear search implementations of KNN algorithm. The size of the data set was gradually increased and the performance times were recorded.

46

Data Size Dimension Linear search (ms) ckSearch (ms) k

5000 60 44.010621 3.574 1 10000 60 79.029039 5.663 1 20000 60 159.794052 15.7322 1 40000 60 219.094885 12.4334 1 50000 60 264.959094 16.3343 1 80000 60 264.959094 26.3441 1 100000 60 721.357044 34.59425 1

Table 5.1: The effect of data size on performance (k=1)

Performance vs. Data Set Size k = 1

800 700 600 500 Linear Search 400 ckSearch 300 200 Execution Execution Time (ms) 100 0 5K 10K 20K 40K 50K 80K 100K Data Size

Figure 5.1: Performance vs. data set size chart (k = 1)

47

Data Size Dimension Linear search (ms) ckSearch (ms) k

5000 60 73.031907 13.91388 3 10000 60 98.910615 29.22268 3 20000 60 174.237228 40.72744 3 40000 60 278.986854 87.66623 3 50000 60 325.127635 68.59792 3 80000 60 520.718415 155.5077 3 100000 60 873.28166 122.8424 3

Table 5.2: Effect of data size on performance (k=3)

Performance vs. Data Set Size k = 3

1000 900 800 700 600 Linear Search 500 ckSearch 400 300

Execution Time (ms) Time Execution 200 100 0 5K 10K 20K 40K 50K 80K 100K Data size

Figure 5.2: Performance vs. data set size chart (k = 3)

48

Data Size Dimension Linear search (ms) ckSearch (ms) k

5000 60 73.031907 13.91388 10 10000 60 98.910615 29.22268 10 20000 60 174.237228 40.72744 10 40000 60 278.986854 87.66623 10 50000 60 325.127635 68.59792 10 80000 60 520.718415 155.5077 10 100000 60 873.28166 122.8424 10

Table 5.3: Effect of data size on performance (k=10)

Performance vs. Data Set Size k = 10

1400

1200

1000

800 Linear Search 600 ckSearch

400 Execution Time (ms) Time Execution 200

0 5K 10K 20K 40K 50K 80K 100K Data Size

Figure 5.3: Performance vs. data set size chart (k = 10)

49

Data Size Dimension k = 1 k = 3 k = 10

5000 60 12.32791 5.248852694 3.226432 10000 60 13.96273 3.384720805 2.27282 20000 60 10.17797 4.278128799 2.999871 40000 60 17.66894 3.182375559 3.516738 50000 60 16.55994 4.739613768 2.510961 80000 60 10.19073 3.348506537 4.066482 100000 60 20.85194 7.10895797 3.356904

Table 5.4: ckSearch speedup over linear search

ckSearch Speedup over Linear Search

50 45 40 35

30 Speedup, K=1 25 Speedup, K=3 20 Speedup, K=10 Speedup 15 10 5 0 5K 10K 20K 40K 50K 80K 100K Data Size

Figure 5.4: Chart showing ckSearch speedup over the linear search

The tables 5.1, 5.2, and 5.3 show the results of the “effect of the size of the data set” experiment. In this experiment we evaluated the performance of the ckSearch algorithm against an implementation of the linear search KNN algorithm. The results clearly show that the ckSearch performed far better than the linear search. The speedup chart (figure 5.4) verifies that the ckSearch achieves and maintains a steady speedup over the linear search

50

method on several values of k. The ckSearch process copes much better than the linear search where the increased number of computations takes place due to larger data sets.

5.3 The effect of data dimension on the performance

The number of dimensions of a data set can influence the performance of a KNN algorithm.

This happens due to increased complexity of Euclidean distance computations associated with high-dimensional data. An autonomous robots’ navigational data set can be high- dimensional. An autonomous robot system may use high-dimensional sensor arrays or high- dimensional image processing for navigation. Thus, we focused on evaluating the ckSearch system performance with high-dimensional data criteria.

In this experiment, we used large data sets with 50000 and 100000 records. The k value was set to 1 for the first experiment and the data set size was set to 50000. For the second experiment, the k value was set to 3 and the data set with 100000 records was used. For each of these experiments, we used a 9-dimensional data set to begin with. Then, the number of dimensions was gradually increased to 18, 36, 60, and 75. The performance time of each experiment was recorded. In order to perform comparisons the same experiments were performed with a KD-tree implementation and a linear search technique. We tabulated the experiment results in the following table.

The following table shows the effect of dimension of the data set on ckSearch and KD- tree implementations of KNN algorithm. The dimension of the data set was gradually increased and the performance times were recorded.

51

Dimension Data Set Size KD-tree (ms) Linear Search (ms) ckSearch (ms) 9 50000 125.3342 268.728288 22.26529 18 50000 112.7768 291.717802 19.05041 36 50000 140.987 324.284511 18.37775 60 50000 157.7129 345.627796 30.71787 75 50000 222.6564 561.185849 27.38437

Table 5.5: The effect of data dimension on performance (N=50K)

Data Dimension vs. Performance N = 50000 & k = 1

600

500

400 linear search 300 Kd-tree ckSearch 200

Execution Execution Time (ms) 100

0 9 18 36 60 75 Dimensions

Figure 5.5: Data dimension vs. performance chart (N = 50K)

Dimension Data Set Size k Linear Search (ms) ckSearch (ms) 9 100000 3 419.812183 35.07446 18 100000 3 451.023915 92.92075 36 100000 3 477.775919 158.4776 60 100000 3 571.532918 122.8424 75 100000 3 660.196533 140.9969

Table 5.6: The effect of data dimension on performance (N=100K)

52

Data Dimension vs. Performance N = 100000 & k = 3

700 600 500

400 linear search 300 ckSearch 200

Execution Execution Time (ms) 100 0 9 18 36 60 75 Dimensions

Figure 5.6: Data dimension vs. performance chart (N = 100K)

Dimension Data Set Size Linear Search (ms) ckSearch (ms) Speedup 9 100000 419.812183 35.07446 11.96917 18 100000 451.023915 92.92075 4.853856 36 100000 477.775919 158.4776 3.014785 60 100000 571.532918 122.8424 4.652569 75 100000 660.196533 140.9969 4.682349

Table 5.7: ckSearch speedup over linear search for various dimensions

In this experiment we compared the ckSearch performance with the KD-tree and linear search implementations. The experiment results clearly show that the ckSearch system performed better than both the KD-tree and the linear search method. Moreover, the ckSearch achieved considerable speedup over linear search (see table 5.7). The results also show that as the number of dimensions increase the KD-tree and the linear search performance gradually degrades. Especially, with a larger data set (100000 records) the linear search performance degrades at a higher rate. On the other hand, ckSearch shows robustness

53

to dimension increase. The ckSearch performance time increases at a much slower rate than the KD-tree and the linear search system (see figure 5.6).

5.4 The effect of search radius on the performance

The search radius is an important factor for the ckSearch system. The ckSearch system uses incremental radius based search. Typically, a small search sphere is used and enlarged when the search condition cannot be met. Our proposed ckSearch system relies on the search sphere to minimize repeated costly distance calculations to optimize performance. Thus, it is important to study the effect of the search radius on performance of the ckSearch system.

In this experiment, we used a large data set with 10000 records. We used a k value 1 for the first part of the experiment and a k value of 3 for the second part of the experiment. The radius value was gradually increased from 1.0 meters to 10.0 meters. The performance time of each experiment was recorded.

The following table shows the effect of search radius on ckSearch query performance.

Even though the KD-tree and the linear search methods do not use a search radius, we have listed the KD-tree and the linear search performance results to compare with the ckSearch system.

54

Radius (m) Data Set Size KD-tree (ms) Linear Search (ms) ckSearch (ms) 1.0 10000 293.99938 109.38541 5.122073 2.0 10000 293.99938 109.38541 13.31026 3.0 10000 293.99938 109.38541 19.89829 4.0 10000 293.99938 109.38541 26.42299 5.0 10000 293.99938 109.38541 34.06723 6.0 10000 293.99938 109.38541 39.26128 7.0 10000 293.99938 109.38541 42.53429 8.0 10000 293.99938 109.38541 46.6953 9.0 10000 293.99938 109.38541 49.8422 10.0 10000 293.99938 109.38541 51.81746

Table 5.8: The effect of search radius on performance (k = 3)

Effect of radius increase on performance k = 3

350

300

250

200 linear scan kd-tree 150 ckSearch 100

Execution Execution Time (ms) 50

0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 Radius

Figure 5.7: Search radius vs. performance for 10000 data records

55

Radius (m) Data Set Size Linear Search (ms) ckSearch (ms) 1.0 10000 169.24721 11.324462 2.0 10000 169.24721 27.077553 3.0 10000 169.24721 43.68484 4.0 10000 169.24721 63.019417 5.0 10000 169.24721 76.492185 6.0 10000 169.24721 85.420351 7.0 10000 169.24721 94.545402 8.0 10000 169.24721 102.19342 9.0 10000 169.24721 112.9639 10.0 10000 169.24721 114.11205

Table 5.9: The effect of search radius on performance (k = 10)

Effect of radius increase on performance k = 10

180 160 140 120 100 linear scan 80 ckSearch 60 40 Execution Time (ms) Time Execution 20 0

.0 0 .0 .0 .0 0 .0 .0 .0 1 2. 3 4 5.0 6 7. 8 9 10 Radius

Figure 5.8: Search radius vs. performance chart for 10,000 data records (k = 10)

Considering the experimental results listed in the tables 5.8 and 5.9, the search radius has a significant impact on ckSearch performance. We observe a sharp increase in performance time as the search radius increases. We believe this is due to an increase in the number of redundant distance computations. As will be shown in the accuracy experiments,

56

the ckSearch algorithm finds the results well before reaching maximum radius of 10.0 meters used for the above experiments.

5.5 The effect of search radius on accuracy

This experiment is similar to the above experiment regarding the effect of search radius on performance time. In this experiment, we evaluate the effect of search radius on data accuracy. Typically, a small search sphere is used as the start radius. This radius was enlarged when the search condition could not be met. In this experiment we started with the radius as 1.0 meters and went up to 10.0 meters. Since ckSearch system relies on the search sphere to minimize repeated costly distance calculations to optimize performance, it is important to study the effect of the search radius on accuracy of the ckSearch system.

In this experiment for each of the radius values used during the query, we recorded the number of accurate nearest neighbor found by the ckSearch algorithm. The results from this experiment are shown in the table below.

Radius (m) Data Set Size k = 3 k = 10 1.0 10000 0.0 0.0 2.0 10000 85.423 76.667 3.0 10000 97.355 97.702 4.0 10000 99.856 99.113

5.0 10000 100.0 99.822 6.0 10000 100.0 100.0 7.0 10000 100.0 100.0 8.0 10000 100.0 100.0 9.0 10000 100.0 100.0 10.0 10000 100.0 100.0

Table 5.10: The effect of the search radius on query accuracy

57

Search radius vs. Accuracy N = 10000 & d = 60

120

100

80 k=3 60 k=10 40 KNN found (%) KNN 20

0

.0 .0 .0 .0 .0 0 0 1.0 2 3.0 4 5.0 6 7.0 8 9. 10.0 Radius

Figure 5.9: Search radius vs. query accuracy chart

The experimental results in the table show that the accuracy of the query search gets better as the search radius increases from 1.0 meters to 10.0 meters. The larger search radius allows the ckSearch algorithm to assess more of the nearest neighbors. Thus, the accuracy of the search increases as the search radius increases. It also important to notice that the ckSearch algorithm achieves 100% accuracy well before the maximum radius value 10.0 meters. This indicates that selecting proper radius is important for performance of the ckSearch system.

5.6 The effect of the number of clusters

The number of clusters can affect the performance of a cluster based algorithm. Even though clustering for ckSearch is part of the pre-processing stage and does not directly affect performance time, it can indirectly influence the ckSearch algorithm time. In order to find out the effect of the number clusters, we performed several experiments. In this set of

58

experiments, the effect of the clusters on the ckSearch system was investigated. As the number of clusters increase it is plausible that it can cause increased computational complexity and in turn increase computation time.

The number of cluster was gradually increased and the subsequent performance was recorded in this experiment. We used 5, 10, 20, 30, and 50 clusters and the size of the data set was 50,000 records. We varied the number of nearest neighbor values k (1 and 5) and conducted two separate experiments. The results of the experiments are tabulated below.

Cluster k KD-tree (ms) Linear Search (ms) ckSearch (ms) 5 1 266.2874 246.287447 29.59359 10 1 266.2874 246.287447 27.03791 20 1 266.2874 246.287447 29.7673 30 1 266.2874 246.287447 17.83305 50 1 266.2874 246.287447 21.50379

Table 5.11: The effect of the number of clusters on performance (k=1)

Effect of data clusters on perform ance N = 50000 & k = 1

300 250 200 linear search 150 Kd-tree 100 ckSearch

50 Execution Execution Time (ms) 0 5 10 20 30 50 Clusters

Figure 5.10: The number of clusters vs. performance chart for 50000 data records (k = 1)

59

Cluster k Linear Search (ms) ckSearch (ms) 5 5 363.641468 116.6476 10 5 363.641468 121.0026 20 5 363.641468 104.7671 30 5 363.641468 86.15084 50 5 363.641468 114.5369

Table 5.12: The effect of the number of clusters on performance (k=5)

Effect data clusters on performance N = 50000 & k = 5

400 350 300 250 linear search 200 ckSearch 150 100 Execution Time (ms) Time Execution 50 0 5 10 20 30 50 Clusters

Figure 5.11: The number of clusters vs. performance chart (k = 5)

Tables 5.11 and 5.12 illustrate the results of our experiments with the number of cluster. Our initial hypothesis was that as the number of clusters increases, performance will decrease because more clusters will take longer to search. Interestingly according to our results, ckSearch performance times remain close to the same or very slightly increase. We hypothesize that this is due to spreading out the data records into number of clusters and most

60

of this cluster search is eliminated using the “cluster search rules”. These “cluster search rules” prevented the ckSearch system from unnecessary searching.

61

CHAPTER 6

CONCLUSION

In this thesis, we introduced a new algorithm for K-nearest neighbor queries that uses clustering and caching to improve performance. The main idea is to reduce the distance computation cost between the query point and the data points in the data set. We used a divide-and-conquer approach. First, we divide the training data into clusters based on similarity between the data points in terms of Euclidean distance. Next we use linearization for faster lookup. The data points in a cluster can be sorted based on their similarity

(measured by Euclidean distance) to the center of the cluster. Fast search data structures such as the B-tree can be utilized to store data points based on their distance from the cluster center and perform fast data search. The B-tree algorithm is good for range search as well.

We achieve a further performance boost by using B-tree based data caching. In this work we provided details of the algorithm, an implementation, and experimental results in a robot navigation task.

We conducted extensive experiments on the performance and the accuracy of the ckSearch algorithm. In order to confirm performance improvement of KNN queries, we performed experiments on the ckSearch system with large and small data sets. Several of our experiments focused on performance of the ckSearch algorithm with high dimensional data sets since many of the KNN search algorithms fail on performance when it comes to high dimensional data. The results show that our algorithm is both effective and efficient. In fact,

62

the ckSearch algorithm achieves performance improvement over both the KD-tree and the linear scan KNN algorithms.

In the future we will further improve our system by adding an analysis to select the best possible initial search radius for the ckSearch algorithm. It is conceivable that selecting too small search radius can end up with much unnecessary iteration. We want to remedy this weakness of the system by adding the search radius selection analysis.

63

REFERENCES

[1] M. Procopio, T. Strohmann, A. Bates, G. Grudic, J. Mulligan. Using Binary Classifiers to Augment Stereo Vision for Enhanced Autonomous Robot Navigation. April 2007.

[2] Arya, S., D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An Optimal Algorithm for Approximate Nearest Neighbor Searching in Fixed Dimensions. Journal of the ACM, vol. 45, no. 6, pp. 891-923

[3] V. Ramasubramanian, Kuldip K. Paliwal. Fast nearest-neighbor search algorithms based on approximation-elimination search. January 1999.

[4] J. Chua, P. Tischer. A Framework for the Construction of Fast Nearest Neighbour Search Algorithms. Monash University, Australia.

[5] J. Chua, P. Tischer. Minimal Cost Spanning Trees for Nearest-Neighbour Matching. Monash University, Australia.

[6] V. Athitsos, M. Potamias, P. Papapetrou, G. Kollios. Nearest Neighbor Retrieval Using Distance-Based Hashing. In Proc. IEEE International Conference on Data Engineering (ICDE), April 2008.

[7] Y. Hsueh, R. Zimmermann, M. Yang. Approximate Continuous K Nearest Neighbor Queries for Continuous Moving Objects with Pre-Defined Paths. Department of Computer Science, University of Southern California.

[8] W. Shang, H. Huang, H. Zhu, Y. Lin, Z. Wang, Y. Qu. An Improved kNN – Fuzzy kNN Algorithm. School of Computer and Information Technology, Beijing Jiaotong University, China.

[9] A. Jain, M. Murty, P. Flynn. Data Clustering: A Review. Michigan State University, U.S.A.

[10] A. Duch, V. Castro, C. Martinez. Randomized K-Dimensional Binary Search Trees. September, 1998.

[11] Z. Aghbari, A. Makinouchi. Linearization Approach for Efficient KNN Search of High-Dimensional Data. University of Sharjah, Sharjah, UAE.

64

[12] R. Weber, H. Schek, S. Blott. A Quantative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces. ETH Zentrum, Zurich.

[13] A. Thomasian, L. Zhang. The Stepwise Dimensionality Increasing (SDI) Index for High-Dimensional Data. May, 2006.

[14] B. Zheng, W. Lee, D. Lee. Search K Nearest Neighbors on Air. Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong.

[15] H. Zhang, A. Berg, M. Maire, J. Malik. SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition. University of California, Berkeley, California.

[16] C. Yu, B. Ooi, K. Tan, H. Jagadish. Indexing the Distance: An Efficient Method to KNN Processing. Proc. Of the 27 th VLDB Conference, Roma, Italy, 2001

[17] A. Nuchter, K. Lingemann, J. Hertzberg. 6D SLAM with Cached kd-tree Search. University of Osnabruck, Osnabruck, Germany.

[18] G. Neto, H. Costelha, P. Lima. Topological Navigation in Configuration Space Applied to Soccer Robots. Instituto Superior Tecnico, Portugal.

[19] C. Yu, S. Wang. Efficient Index based KNN join processing for high dimensional data. Information and Software Technology. May 2006.

[20] G. DeSouza, A. Kak. Vision for Mobile Robot Navigation: A Survey. IEEE Transactions on pattern analysis and machine intelligence, vol. 24, no. 2, February, 2002.

[21] E. Plaku, L. Kavraki. Distributed Computation of the knn Graph for Large High-Dimensional Point Sets. Journal of Parallel and Distributed Computing, 2007, vol. 67(3), pp. 346-359.

[22] J.L.Bentley Multidimensional Binary Search Trees in Database Applications. IEEE Trans. on Software Engineering, SE-5(4):333-340, July 1979.

[23] N. Ripperda, C. Brenner. Marker-Free Registration of Terrestrial Laser Scans Using the Normal Distribution Transform. University of Hannover, Germany.

[24] C. Atkeson, S. Schaal. Memory-Based Neural Networks For Robot Learning. GIT, Atlanta, Georgia.

[25] J. Nievergelt, H. Hinterberger, K. Sevcik. The gridfile: An Adaptable Symmetric Multikey File Stucture. ACM Trans. on Database Systems, 9(1):38 - 71, 1984.

65

[26] A. Gionis, P. Indyk, R. Motwani. Similarity search in high dimensions via Hashing. In International Conference on Very Large Databases (VLDB), 1999 pp. 518-529.

[27] V. Athitsos, M. Potamias, P. Papapetrou, G. Killios. Nearest Neighbor Retrieval Using Distance-Based Hashing.

[28] A. Andoni, P. Indyk. Efficient algorithms for substring nearest neighbor Problem. In ACM-SIAM Symposium on Discrete Algorithms (SODA). 2006, pp. 1203 – 1212.

[29] M. Zhang, T. Zhang, R. Ramakrishnan. BIRCH: A new data clustering algorithm and its applications.Data Mining and Knowledge Discovery.

[30] G. Grizaite, R. Oberperfler. DBSCAN Clustering Algorithm. January 31, 2005.

[31] T. Bingmann. “STX B+ Trees Template Classes: Speed Test Results.” 2008. Idle box. 4 th April, 2009< http://idlebox.net/2007/stx-btree/stx-btree-0.8.3/doxygen- html/speedtest.html >.

[32] D. Bentivegna. Learning from Observation Using Primitives. Doctoral Dissertation, Georgia Institute of Technology, 2004.

[33] L. Xiong, S. Chitti. Mining multiple private databases using a kNN classifier. In Proceedings of the 2007 ACM symposium on Applied computing. 2007, pp. 435 - 440.

[34] H. Franco-Lopez, A. Ek, M. Bauer. Estimation and mapping of forest stand density, volume, and cover type using the k-nearest neighbors method. Remote Sensing of Environment, Vol. 77, No. 3, 2001, pp. 251-274.

[35] H. Maarse, P. Slump, A.Tas, J. Schaefer. Classification of wines according to type Journal Zeitschrift für Lebensmitteluntersuchung und -Forschung A. Vol .184, No. 3, March, 1987, pp. 198-203.

[36] A. Sohail, P. Bhattacharya. Schaefer. Classification of Facial Expressions Using K-Nearest Neighbor Classifier. /Computer Graphics Collaboration Techniques. Vol .4418, June, 2007, pp. 555-566.

[37] S. Arya, D. Mount. “ANN: A Library for Approximate Nearest Neighbor Searching.” August 4, 2006, ANN. 14 th April, 2009. < http://www.cs.umd.edu/~mount/ANN/>

66

APPENDIX A

NOTATION TABLE

Notation

Table A.1 lists a variety of symbols, functions, and parameters used in this paper. Following terms and notations are used throughout this paper, especially in the pseudo-code section of the algorithm.

D Number of dimensions N Number of data points D ∈ Ω Data set Ω = [0,1] d Data space R Result set containing k-nearest neighbors Ci Cluster center reference point R Radius of a search sphere rincrement Radius increment value rmax Maximum radius value for STOP criterion pi A data point p in the ith cluster distMax i Maximum radius of a partition, Mi distMin i Distance between Ci and closest point to Ci pmax The furthest data point from q in the KNN result set R FurthestPoint(R,q) Furthest point from query point q in set R SearchRadius(q) Search radius of query point, q SearchSphere(q, r) Sphere space with query point q in center and radius r distNearest q Nearest distance to query point, q distance(p i, C i) Distance between point pi and Cluster center Ci key i B-tree index of nodes and data entries in leaf node data i Data entries in leaf node of a B-tree dist Center Distance from query point q to Cluster center Ci GetNearest(q) nearest neighbor to query point, q

Table A.1: List of various notations used in this thesis

67

APPENDIX B

IMPLEMENTATION PSEUDOCODE

ckSearch _KNN(q) 1 initialize(); 2 loadBTree(); 3 rincrement = increment value; 4 R = empty; 5 6 if (IsCacheHit(q) == true): 7 while( r < r max ) 8 if (distance(p max ,q) < r and R.Size() == k): 9 STOP; 10 return; 11 12 r = r increment + r ; 13 SearchCache(q); 14 15 else if (IsCacheHit(q) == false): 16 while( r < r max ) 17 if (distance(p max ,q) < r and R.Size() == k): 18 STOP; 19 return; 20 21 r = r increment + r ; 22 SearchClusters(q); 23 UpdateCache(); 24 25 End ckSearch_KNN;

Figure B.1: ckSearch KNN algorithm

The figure above shows the ckSearch KNN query algorithm pseudocode. This is one of the several methods utilized to implement ckSearch algorithm.

68

SearchClusters(q): 1 for i = 0 to (M – 1): 2 dist Center = distance(Ci,q); 3 4 if (exclude(i,q) == true) : // Rule 1 5 SKIP CLUSTER i; 6 7 else if (intersects(i,q) == true): // Rule 2 8 key query = i * µ + dist Center ; 9 leafNode i = getQueryLeaf(btree, key query ); 10 key left = i * µ + (dist Center – r); 11 SearchLeftNodes(leafNode i, key left ); 12 13 else if (contains(i,q) == true): // Rule 3 14 key query = i * µ + dist Center ; 15 leafNode i = getQueryLeaf(btree, key query ); 16 key left = i * µ + (dist Center – r); 17 SearchLeftNodes(leafNode i, key left ); 18 key right = i * µ + (dist Center + r); 19 SearchRightNodes(leafNode i, key right ); 20 21 //end of for loop; 22 END; Figure B.2: SeachClusters(q) pseudocode

The figure B.2 above shows pseudocode of the main cluster search algorithm. This

“SearchClusters(q)” algorithm is part of our proposed ckSearch KNN search algorithm.

69

SearchCache(q): 1 index = index of cached cluster; 2 for i = 0 to (M – 1): // searching all cached clusters 3 dist Center = distance(C i,q); 4 5 if (exclude(i,q) == true) : // Cluster Rule #1 6 SKIP CLUSTER i; 7 8 else if (intersects(i,q) == true): // Cluster Rule #2 9 key query = i * µ + dist Center ; 10 leafNode i = getQueryLeaf(btree, key query ); 11 key left = i * µ + (dist Center – r); 12 SearchLeftNodes(leafNode i, key left ); 13 14 else if (contains(i,q) == true): // Cluster Rule #3 15 key query = i * µ + dist Center ; 16 leafNode i = getQueryLeaf(btree, key query ); 17 key left = i * µ + (dist Center – r); 18 SearchLeftNodes(leafNode i, keyleft ); 19 key right = i * µ + (dist Center + r); 20 SearchRightNodes(leafNode i, key right ); 21 22 END;

Figure B.3: The “SearchCache(q)” algorithm pseudocode

The figure above (figure: B.3) shows pseudocode of the cache search algorithm. This

“SearchCache(q)” algorithm is part of our proposed ckSearch KNN search system.

70

SearchLeftNodes(leafNode i, key left ) 1 ++ 2 for( i = 0; i < leafNodeSize(); i ): // searching leafNode i for nearest neighbors 3 if R.Size() == k: { 4 if (distance(p max ,q) > distance(data i,q)): 5 Remove pmax from R; 6 Add data i to R; 7 8 } else if R.Size() ≠ k: { 9 Add data i to R; 10 } 11 // End of for loop 12 13 dist left = dist Center – r; 14 leftLeafNode = GetLeftLeafNode(leafNode i); 15 16 while (true) { 17 leftLeafNode = GetLeftLeafNode(leftLeafNode); 18 SearchLeafNode(leftLeafNode); //searching leftLeafNode for nearest neighbors 19 20 keyOfMinRecord = get key value of left most entry of leftLeafNode; 21 22 if keyOfMinRecord < dist left OR, if Cluster boundary reached 23 break loop; //reached the search sphere limit, no need to search 24 } 25 END;

Figure B.4: The SearchLeftNodes(leafNode i, key left ) pseudocode

Figure B.4 above shows the “SearchLeftNodes(leafNode i, key left )” function pseudocode. This function searches the leaf nodes to the left in the data structure for nearest neighbor points. It is considered one of the most important functions in the ckSearch implementation.

71

SearchRightNodes(leafNode i, key right ) 1 ++ 2 for( i = 0; i < leafNodeSize(); i ): // searching leafNode i for nearest neighbors 3 if R.Size() == k: { 4 if (distance(p max ,q) > distance(data i,q)): 5 Remove pmax from R; 6 Add data i to R; 7 8 } else if R.Size() ≠ k: { 9 Add data i to R; 10 } 11 // End of for loop 12 13 dist right = dist Center + r; 14 rightLeafNode = GetRightLeafNode(leafNode i); 15 16 while (true) { 17 rightLeafNode = GetRightLeafNode(rightLeafNode); 18 SearchLeafNode(rightLeafNode); //searching rightLeafNode for KNN 19 20 keyOfMaxRecord = get key value of left most entry of rightLeafNode; 21 22 if keyOfMaxRecord > dist right OR, if Cluster boundary reached 23 break loop; //reached the search sphere limit, no need to search 24 } 25 END; Figure B.5: The SearchRightNodes(leafNode i, key right ) pseudocode

Figure B.5 above shows the “SearchRightNodes(leafNode i, key right )” function pseudocode. This function searches the leaf nodes to the right in the data structure for nearest neighbor points. It is considered one of the most important functions in the ckSearch implementation.

72