PCAF: Scalable, High Precision K-NN Search Using Principal Component Analysis Based Filtering

PCAF: Scalable, High Precision k-NN Search using Principal Component Analysis based Filtering Huan Feng∗, David Eyersy, Steven Millsy, Yongwei Wu∗, Zhiyi Huangy ∗ Tsinghua University, China y University of Otago, New Zealand Abstract—Approximate k Nearest Neighbours (AkNN) search algorithms have poor scalability on multicore systems due to is widely used in domains such as computer vision and machine them performing a large number of random memory accesses learning. However, AkNN search in high dimensional datasets and thus causing cache misses [8], [34], [35]. does not work well on multicore platforms. It scales poorly due to its large memory footprint. Current parallel AkNN search using The data filtering strategy [35] instead excludes unlikely space subdivision for filtering helps reduce the memory footprint, features based on distance estimation between the query but leads to loss of precision. We propose a new data filtering feature and the reference feature. If they have a high filtering method—PCAF—for parallel AkNN search based on principal rate, much computation and many memory accesses can be components analysis. PCAF improves on previous methods by avoided. Typically, the scalability of AkNN can be greatly demonstrating sustained, high scalability for a wide range of high dimensional datasets on both Intel and AMD multicore platforms. improved on multicore systems by using a filtering strategy. Moreover, PCAF maintains high precision in terms of the AkNN Subspace Clustering for filtering (SCF) [34] was the state- search results. of-art approach using the data filtering strategy in AkNN. It greatly improves the scalability of AkNN algorithms. How- I. INTRODUCTION ever, its search precision is unstable and depends on the nature Wide use of k Nearest Neighbours (k-NN) search is made in of the reference features. We will discuss this challenge in domains such as bioinformatics [6], data analysis [9], machine detail in the next section. learning [37], computer vision [41] and handwriting recogni- In this paper, we propose a parallel AkNN algorithm called tion [42]. Given query data points, k-NN finds k data items PCAF which uses principal Components Analysis (PCA) [18] within a database (i.e., a set of features) that are most similar to estimate the rank of distance between the query feature to the query data, where the similarity is often measured by and the reference feature. PCAF uses data filtering to exclude Euclidean Distance. In general, a feature f can be defined as a those reference features that are not likely to be k-NN features D dimensional vector: f = [e1; e2; ::; eD]. The database DB is according to the PCA estimation. It has high scalability on defined as a set of N features: DB = ff1; f2; ::; fN g. We call multicore systems with stable, high search precision on high- the feature that is used to query the database DB the query dimensional datasets (e.g., 561 dimensions). feature and the features in DB the reference features. Based on The remainder of this paper is organised as follows. Sec- these definitions, the k-NN problem can be formally described tion II describes the motivation of the idea. Section III presents as: given a query feature q, find the k reference features in DB our PCAF algorithm. Section IV demonstrates the detailed that have the shortest (Euclidean) distances to q. technical implementation of PCAF. Section V provides experi- To address the challenge of rapidly increasing amounts mental results and evaluation compared with four widely-used of data being included for processing, many Approximate k k-NN algorithms. Section VI discusses the most related work. Nearest Neighbours (AkNN) algorithms [5], [7], [21], [32], Finally, Section VII summarises the contributions of this paper. [34] have been proposed. Instead of returning the actual k- NN, they return k results that are highly likely to be the II. MOTIVATION k-NN. Although AkNN algorithms have better performance, As discussed previously, data filtering in AkNN greatly their searching precision is of great concern [17], [23], [36]. improves its parallel performance on multicore systems. We There are two main strategies in AkNN for finding approx- have previously proposed a parallel AkNN algorithm called imate nearest neighbours: data selection and data filtering. SCF using data filtering to exclude unlikely k-NN features. The data selection strategy tries to find candidate features that Before searching, SCF needs to build an index for the refer- are most likely to be the precise k nearest neighbours. Most ence features, as needed in all AkNN algorithms. SCF divides AkNN algorithms adopt this strategy [1], [7], [29]. However, the reference features into a number of subspaces with low this strategy incurs a large memory footprint and the AkNN dimensionality in order to alleviate the problem of curse of dimensionality [4], [9]. Then in each subspace, SCF uses k- ? Department of Computer Science and Technology; Tsinghua National Laboratory for Information Science and Technology (TNLIST) Tsinghua Uni- means [26], [29] clustering method to divide the reference versity, Beijing 100084; Technology Innovation Center at Yinzhou, Yangtze features into clusters. The centre of each cluster is used Delta Region Institute of Tsinghua University, Ningbo 315000, Zhejiang; to estimate the distance between the query feature and the Research Institute of Tsinghua University in Shenzhen, Shenzhen 518057, China reference features of the cluster in the subspace. Finally the y Department of Computer Science, University of Otago, New Zealand distance between the query feature and any reference feature in TABLE I: The rank estimation using SCF with different subspaces and PCAF in an example e e e e RED Rank [e e ][e e ] [e e ][e e ] PCAF p EED Rank 1 2 3 4 SCF 1 2 3 4 1 3 2 4 1 q 1 1 1 1 - - EED Rank EED Rank q -10.77 - - A 1 1 14 15 19.10 4 A 16.66 3 17.12 4 A 7.29 18.06 4 B 2 3 7 11 11.87 2 B 8.67 1 13.19 2 B -0.73 10.04 2 C 4 5 5 5 7.55 1 C 10.32 2 9.57 1 C -7.05 3.72 1 D 5 6 12 10 14.90 3 D 17.57 4 14.51 3 D 0.48 11.25 3 (a) A 4-dimensional case with one query fea- (b) EED and rank using SCF with a (c) EED and rank using PCAF with ture, q, and four reference features, A, B, C, different formation of subspaces, se- only 1 principal component for projec- and D. RED and the exact rank of the reference quential manner ([e1 e2][e3 e4]) and tion. p1 lists the projections of each features are listed. interleaved manner ([e1 e3][e2 e4]) feature the RED in Table Ia, the rank is C; B; D; A. Based on the D (5, 6) (1, 14) A g12 (3, 13) (4, 5) C g12 (4.5, 5.5) D (5, 12) estimation of SCF, if only one nearest neighbour is requested, B (2, 3) in this case, the precision of the 1-NN result using SCF is (2, 7) B g11 (3, 6) C (4, 5) g11 (1.5, 2) 0—the matched feature is not actually the closest. (1, 1) A Dimension [e1, e2] Dimension [e1, e3] Dimension [e3, e4] A (14, 15) (1, 15) A Dimension [e2, e4] However, if we change the partitioning of dimensions for the g21 (2, 13) C (7, 11) g21 (13, 12.5) subspaces as in Figure 1b, SCF can produce the correct 1-NN B (3, 11) D(12, 10) D (6, 10) g22 (6, 8) result. In the figure, we instead choose [e1; e3] to form the first (5, 5) B g22 (5.5, 7.5) (5, 5) C subspace and [e2; e4] to form the second subspace. Likewise (a) Forming subspaces in a se- (b) Forming subspaces in an in- we create two clusters in each subspace. According to the quential manner, so that features terleaved manner, so that features EED calculated by SCF, the rank of the reference features is in subspace [e e ] are divided in subspace [e e ] are divided 1 2 1 3 C; B; D; A which is exactly the same as their rank in RED. into 2 groups with centres g11 into 2 groups with centres g11 and g12, while features in sub- and g12, while features in sub- Therefore, the precision of each of k-NN for k 2 1::4 using space [e3 e4] are divided into 2 space [e2 e4] are divided into 2 SCF would be 100%. groups with centres g21 and g22 groups with centres g21 and g22 From the above example, we can see that the precision of Fig. 1: Using SCF to divide space into two subspaces with SCF is seriously affected by how the subspaces are formed. two groups, each formed in a different way Forming the best subspaces to achieve the highest precision is data dependent, and difficult to determine in SCF [10], [34] —it depends on the nature of the reference features, which are the original high-dimensional space is estimated by summing different in different applications. up the distance between the query feature and the reference In the following section, we propose a PCA based filtering feature in each subspace. method, PCAF, for AkNN search. It has the same advantages Table Ia gives an example of k-NN search with one query of using the data filtering strategy: high scalability, small feature and four reference features in a 4-dimensional space. memory footprint, and reduced computational overhead. More The Real Euclidean Distance (RED) and the rank based on importantly, compared with SCF, PCAF has stable, higher it are also listed in the table.

Load more