136 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO. 1, JANUARY 2015 Rank-Based Similarity Search: Reducing the Dimensional Dependence
Michael E. Houle and Michael Nett
Abstract—This paper introduces a data structure for k-NN search, the Rank Cover Tree (RCT), whose pruning tests rely solely on the comparison of similarity values; other properties of the underlying space, such as the triangle inequality, are not employed. Objects are selected according to their ranks with respect to the query object, allowing much tighter control on the overall execution costs. A formal theoretical analysis shows that with very high probability, the RCT returns a correct query result in time that depends very competitively on a measure of the intrinsic dimensionality of the data set. The experimental results for the RCT show that non-metric pruning strategies for similarity search can be practical even when the representational dimension of the data is extremely high. They also show that the RCT is capable of meeting or exceeding the level of performance of state-of-the-art methods that make use of metric pruning or other selection tests involving numerical constraints on distance values.
Index Terms—Nearest neighbor search, intrinsic dimensionality, rank-based search Ç
1INTRODUCTION F the fundamental operations employed in data mining distance measure d : U U7! Rþ satisfying the metric pos- Otasks such as classification, cluster analysis, and anom- tulates. Given a query point q 2U, similarity queries over S aly detection, perhaps the most widely-encountered is that are of two general types: of similarity search. Similarity search is the foundation of k U S k k-nearest-neighbor (k-NN) classification, which often pro- -nearest neighbor queries report a set of size d q; u d q; v u U duces competitively-low error rates in practice, particularly elements satisfying ð Þ ð Þ for all 2 and v S U when the number of classes is large [26]. The error rate of 2 n . r nearest-neighbor classification has been shown to be Given a real value 0, range queries report the set v S d q; v r ‘asymptotically optimal’ as the training set size increases fg2 j ð Þ . k [14], [47]. For clustering, many of the most effective and While a -NN query result is not necessarily unique, the popular strategies require the determination of neighbor range query result clearly is. sets based at a substantial proportion of the data set objects Motivated at least in part by the impact of similarity [26]: examples include hierarchical (agglomerative) meth- search on problems in data mining, machine learning, pat- ods such as ROCK [22] and CURE [23]; density-based meth- tern recognition, and statistics, the design and analysis of ods such as DBSCAN [17], OPTICS [3], and SNN [16]; and scalable and effective similarity search structures has been non-agglomerative shared-neighbor clustering [27]. Con- the subject of intensive research for many decades. Until rel- tent-based filtering methods for recommender systems [45] atively recently, most data structures for similarity search and anomaly detection methods [11] commonly make use targeted low-dimensional real vector space representations L of k-NN techniques, either through the direct use of k-NN and the euclidean or other p distance metrics [44]. How- search, or by means of k-NN cluster analysis. A very popu- ever, many public and commercial data sets available today lar density-based measure, the Local Outlier Factor (LOF), are more naturally represented as vectors spanning many relies heavily on k-NN set computation to determine the rel- hundreds or thousands of feature attributes, that can be real ative density of the data in the vicinity of the test point [8]. or integer-valued, ordinal or categorical, or even a mixture For data mining applications based on similarity search, of these types. This has spurred the development of search data objects are typically modeled as feature vectors of structures for more general metric spaces, such as the Multi- attributes for which some measure of similarity is defined. Vantage-Point Tree [7], the Geometric Near-neighbor Access Often, the data can be modeled as a subset S Ubelonging Tree (GNAT) [9], Spatial Approximation Tree (SAT) [40], the to a metric space M¼ðU;dÞ over some domain U, with M-tree [13], and (more recently) the Cover Tree (CT) [6]. Despite their various advantages, spatial and metric search structures are both limited by an effect often referred M. E. Houle is with the National Institute of Informatics, 2-1-2 Hitotsuba- to as the curse of dimensionality. One way in which the curse shi, Chiyoda-ku, Tokyo 101-8430, Japan. E-mail: [email protected]. M. Nett is with Google Japan, 6-10-1 Roppongi, Minato-ku, Tokyo 106- may manifest itself is in a tendency of distances to concen- 6126, Japan. E-mail: [email protected]. trate strongly around their mean values as the dimension Manuscript received 13 Sept. 2012; revised 14 Apr. 2014; accepted 27 May increases. Consequently, most pairwise distances become 2014. Date of publication 25 July 2014; date of current version 5 Dec. 2014. difficult to distinguish, and the triangle inequality can no Recommended for acceptance by G. Lanckriet. longer be effectively used to eliminate candidates from con- For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. sideration along search paths. Evidence suggests that when Digital Object Identifier no. 10.1109/TPAMI.2014.2343223 the representational dimension of feature vectors is high
0162-8828 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. HOULE AND NETT: RANK-BASED SIMILARITY SEARCH: REDUCING THE DIMENSIONAL DEPENDENCE 137
(roughly 20 or more [5], [49]), traditional similarity search The main contributions of this paper are as follows: accesses an unacceptably-high proportion of the data ele- ments, unless the underlying data distribution has special A similarity search index in which only ordinal properties [6], [12], [41]. Even though the local neighbor- pruning is used for node selection—no use is made hood information employed by data mining applications is of metric pruning or of other constraints involving both useful and meaningful, high data dimensionality tends distance values. to make this local information very expensive to obtain. Experimental evidence indicating that for practical k The performance of similarity search indices depends -NN search applications, our rank-based method is crucially on the way in which they use similarity informa- very competitive with approaches that make explicit tion for the identification and selection of objects relevant to use of similarity constraints. In particular, it compre- the query. Virtually all existing indices make use of numeri- hensively outperforms state-of-the-art implementa- cal constraints for pruning and selection. Such constraints tions of both LSH and the BD-Tree, and is capable of include the triangle inequality (a linear constraint on three achieving practical speedups even for data sets of distance values), other bounding surfaces defined in terms extremely high representational dimensionality. of distance (such as hypercubes or hyperspheres) [25], [32], A formal theoretical analysis of performance show- k range queries involving approximation factors as in Local- ing that RCT -NN queries efficiently produce cor- ity-Sensitive Hashing (LSH) [19], [30], or absolute quantities rect results with very high probability. The as additive distance terms [6]. One serious drawback of performance bounds are expressed in terms of a such operations based on numerical constraints such as the measure of intrinsic dimensionality (the expansion triangle inequality or distance ranges is that the number of rate [31]), independently of the full representational objects actually examined can be highly variable, so much dimension of the data set. The analysis shows that so that the overall execution time cannot be easily predicted. by accepting polynomial sublinear dependence of n In an attempt to improve the scalability of applications query cost in terms of the number of data objects , that depend upon similarity search, researchers and practi- the dependence on the intrinsic dimensionality is tioners have investigated practical methods for speeding up lower than any other known search index achieving n the computation of neighborhood information at the sublinear performance in , while still achieving expense of accuracy. For data mining applications, the very high accuracy. approaches considered have included feature sampling for To the best of our knowledge, the RCT is the first practical local outlier detection [15], data sampling for clustering similarity search index that both depends solely on ordinal [50], and approximate similarity search for k-NN classifica- pruning, and admits a formal theoretical analysis of correct- tion (as well as in its own right). Examples of fast approxi- ness and performance. A preliminary version of this work mate similarity search indices include the BD-Tree,a has appeared in [28]. widely-recognized benchmark for approximate k-NN The remainder of this paper is organized as follows. search; it makes use of splitting rules and early termination Section 2 briefly introduces two search structures whose to improve upon the performance of the basic KD-Tree. One designs are most-closely related to that of the RCT: the of the most popular methods for indexing, Locality-Sensi- rank-based SASH approximate similarity search structure tive Hashing [19], [30], can also achieve good practical [29], and the distance-based Cover Tree for exact similarity search performance for range queries by managing parame- search [6]. Section 3 introduces basic concepts and tools ters that influence a tradeoff between accuracy and time. needed to describe the RCT. Section 4 presents the algorith- The spatial approximation sample hierarchy (SASH) simi- mic details of the RCT index. In Section 5 we provide a for- larity search index [29] has had practical success in acceler- mal theoretical analysis of the theoretical performance ating the performance of a shared-neighbor clustering guarantees of the RCT. Section 6 compares the practical algorithm [27], for a variety of data types. performance of the RCT index with that of other popular In this paper, we propose a new similarity search struc- methods for similarity search. Concluding remarks are ture, the Rank Cover Tree (RCT), whose internal operations made in Section 7. completely avoid the use of numerical constraints involving similarity values, such as distance bounds and the triangle 2RELATED WORK inequality. Instead, all internal selection operations of the This paper will be concerned with two recently-proposed RCT can be regarded as ordinal or rank-based, in that objects approaches that on the surface seem quite dissimilar: the are selected or pruned solely according to their rank with SASH heuristic for approximate similarity search [29], and respect to the sorted order of distance to the query object. the Cover Tree for exact similarity search [6]. Of the two, Rank thresholds precisely determine the number of objects the SASH can be regarded as combinatorial, whereas the to be selected, thereby avoiding a major source of variation Cover Tree makes use of numerical constraints. Before for- in the overall query execution time. This precision makes mally stating the new results, we first provide an overview ordinal pruning particularly well-suited to those data min- of both the SASH and the Cover Tree. ing applications, such as k-NN classification and LOF out- lier detection, in which the desired size of the neighborhood sets is limited. As ordinal pruning involves only direct pair- 2.1 SASH wise comparisons between similarity values, the RCT is A (SASH) is a multi-level structure recursively constructed also an example of a combinatorial similarity search algo- by building a SASH on a half-sized random sample S0 S of rithm [21]. the object set S, and then connecting each object remaining 138 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO. 1, JANUARY 2015
SASH construction can be performed in OðÞpcn log n time and requires OðÞcn space. The SASH was proposed as a heuristic structure with no formal analysis of query accu- racy; however, its lack of dependence on the representa- tional dimension of the data, together with tight control on the execution time, allow it to achieve very substantial speed-ups (typically 1-3 orders of magnitude) for real data sets of sizes and representational dimensions extending into the millions, while still consistently achieving high accuracy rates [27], [29]. Very few other efficient similarity search methods are known to use rank information as the sole basis for access or pruning. The recent algorithm of [48] proposes a rank- based hashing scheme, in which similarity computation relies on rank averaging and other arithmetic operations. No experimental results for this method have appeared in the research literature as yet. Another algorithm we consid- ered, the combinatorial random connection graph search method RanWalk [21], is mainly of theoretical interest, since Fig. 1. SASH routine for finding approximate k-nearest neighbors. the preprocessing time and space required is quadratic in the data set size. Due to these impracticalities, both meth- outside S0 to several of its approximate nearest neighbors ods are excluded from the experimentation presented in from within S0. Queries are processed by first locating Section 6. approximate neighbors within sample S0, and then using the pre-established connections to discover neighbors within the 2.2 Cover Trees and the Expansion Rate remainder of the data set. The SASH index relies on a pair- In [31], Karger and Ruhl introduced a measure of intrinsic wise distance measure, but otherwise makes no assumptions dimensionality as a means of analyzing the performance of regarding the representation of the data, and does not use a local search strategy for handling nearest neighbor the triangle inequality for pruning of search paths. queries. In their method, a randomized structure resem- SASH construction is in batch fashion, with points bling a skip list [42] is used to retrieve pre-computed sam- inserted in level order. Each node v appears at only one ples of elements in the vicinity of points of interest. level of the structure: if the leaf level is level 1, the probabil- 1 Eventual navigation to the query is then possible by repeat- ity of v being assigned to level j is j. Each node v is attached 2 edly shifting the focus to those sample elements closest to p p to at most parent nodes, for some constant 1, chosen the query, and retrieving new samples in the vicinity of the as approximate nearest neighbors from among the nodes at new points of interest. The complexity of their method v one level higher than . The SASH guarantees a constant depends heavily on the rate at which the number of visited degree for each node by ensuring that each can serve as the elements grows as the search expands. Accordingly, they c p parent of at most ¼ 4 children; any attempt to attach limited their attention to sets which satisfied the following c w more than children to an individual parent is resolved smooth-growth property. Let by accepting only the c closest children to w, and reassign- ing rejected children to nearby surrogate parents whenever BSðv; rÞ¼fw 2 S j dðv; wÞ rg necessary. S Similarity queries are performed by establishing an be the set of elements of contained in the closed ball of r v S S upper limit kj on the number of neighbor candidates to radius centered at 2 . Given a query set U, is said to b; d q r> be retained from level j of the SASH, dependent on both have ð Þ-expansion if for all 2Uand 0, j and the number of desired neighbors k (see Fig. 1). The jBSðq; rÞj b ¼) j BSðq; 2rÞj d jBSðq; rÞj: search starts from the root and progresses by successively visiting all children of the retained set for the current The expansion rate of S is the minimum value of d such that level, and then reducing the set of children to meet the the above condition holds, subject to the choice of minimum k quota for the new level, by selecting the j elements clos- ball set size b (in their analysis, Karger and Ruhl chose k est to the query point. In the case of -NN search, when b ¼ OðÞlogjSj ). Karger and Ruhl’s expansion rates have the quota values are chosen as since been applied to the design and analysis of routing j algorithms and distance labeling schemes [1], [10], [46]. 1 n 1 d kj ¼ max k log2 ; pc ; One can consider the value log2 to be a measure of the 2 intrinsic dimension, by observing that for the euclidean dis- m tance metric in R , doubling the radius of a sphere the total number of nodes visited is bounded by m m increases its volume by a factor of 2 . When sampling R
1þ 1 by a uniformly distributed point set, the expanded sphere k log2n pc2 n O~ k n : would contain proportionally as many points. 1 þ log2 ¼ ð þ log Þ n 2 klog2 1 However, as pointed out by the authors themselves, low- dimensional subsets in very high-dimensional spaces can HOULE AND NETT: RANK-BASED SIMILARITY SEARCH: REDUCING THE DIMENSIONAL DEPENDENCE 139
proposed an improvement upon the Navigating Net with execution costs depending only on a constant number of factors of d. They showed that the densely-interconnected graph structure of a Navigating Net could be thinned into a tree satisfying the following invariant covering tree condi- tion: for every node u at level l 1 of the structure, its par- ent v at level l is such that dðu; vÞ < 2l. Nearest-neighbor search in the resulting structure, called a Cover Tree, pro- gresses by identifying a cover set of nodes Cl at every tree level l whose descendants are guaranteed to include all nearest neighbors. Fig. 2 shows the method by which the cover sets are generated, adapted from the original descrip- Fig. 2. Cover tree routine for finding the nearest neighbor of q tion in [6]. It assumes that the closest pair of points of S have unit distance. have very low expansion rates, whereas even for one- The asymptotic complexities of Cover Trees and Navi- dimensional data the expansion rate can be linear in the size gating Nets are summarized in Fig. 3. Both methods have S d of . The expansion dimension log2 is also not a robust mea- complexities that are optimal in terms of n ¼jSj; however, sure of intrinsic dimensionality, in that the insertion or dele- in the case of Navigating Nets, the large (polynomial) tion of even a single point can cause an arbitrarily-large dependence on d is impractically high. Experimental results increase or decrease. for the Cover Tree show good practical performance for Subsequently, Krauthgamer and Lee [34] proposed a many real data sets [6], but the formal analysis still depends structure called a Navigating Net, consisting of a hierarchy very heavily on d. " S of progressively finer -nets of , with pointers that allow The Cover Tree can also be used to answer approximate navigation from one level of the hierarchy to the next. Their similarity queries using an early termination strategy. How- analysis of the structure involved a closely-related alterna- ever, the approximation guarantees only that the resulting tive to the expansion dimension that does not depend on neighbors are within a factor of 1 þ " of the optimal distan- S the distribution of . The doubling constant is defined as the ces, for some error value ">0. The dependence on distance d B q; r minimum value such that every ball Sð 2 Þ can be values does not scale well to higher dimensional settings, as d r entirely covered by balls of radius . Analogously, the small variations in the value of " may lead to great varia- d doubling dimension is then defined as log2 . Although tions in the number of neighbors having distances within Gupta et al. [24] have shown that the doubling dimension is the range. more general than the expansion dimension, the latter is more sensitive to the actual distribution of the point set S. 3PRELIMINARIES In [33], Krauthgamer and Lee furthermore showed that without additional assumptions, nearest neighbor queries In this section, we introduce some of the basic concepts 7 cannot be approximated within a factor of less than 5, unless needed for the description of the RCT structure, and prove log d 2 OðÞlog log n . a technical lemma needed for the analysis. For real-world data sets, the values of the expansion rate n can be very large, typically greatly in excess of log2 [6]. In 3.1 Level Sets terms of d, the execution costs associated with Navigating The organization of nodes in a Rank Cover Tree for S is sim- Nets are prohibitively high. Beygelzimer et al. [6] have ilar to that of the skip list data structure [37], [42]. Each node
Fig. 3. Asymptotic complexities of Rank Cover Tree, Cover Tree, Navigating Nets (NavNet), RanWalk, and LSH, stated in terms of n ¼jSj, neighbor k d d k set size , and 2-expansion rate . The complexity of NavNet is reported in terms of the doubling constant . For thep RCT,ffiffiffi we show the -NN com- plexity bounds derived in Section 5 for several sample rates, both constant and sublinear. f is the golden ratio ð1 þ 5Þ=2. For the stated bounds, 1 the RCT results are correct with probability at least 1 nc. The query complexities stated for the CT and NavNet algorithms are for single nearest- neighbor (1-NN) search, in accordance with the published descriptions and analyses for these methods [6], [34]. Although a scheme exists for apply- ing LSH to handle k-NN queries (see Section 6), LSH in fact performs ð1 þ "Þ-approximate range search. Accordingly, the complexities stated for LSH are for range search; only the approximate dependence on n is shown, in terms of r, a positive-valued parameter that depends on the sensitivity of the family of hash functions employed. The dimensional dependence is hidden in the cost of evaluating distances and hash functions (not shown), as well as the value of r. The query bound for RanWalk is the expected time required to return the exact 1-NN. 140 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO. 1, JANUARY 2015 of the bottom level of the RCT (L0) is associated with a To facilitate the analysis of the RCT, we assume that the unique element of S. oracle rankings are induced by a distance metric on the items of S relative to any given query point q 2U.Inthispaper,for Definition 1. A random leveling of a set S is a sequence the sake of simplicity, we will assume that no pair of items of L¼ðÞL ;L ; ... of subsets of S, where for any integer j 0 0 1 S have identical distances to any point in U. When desired, the membership of Lj is determined by independently select- þ1 the uniqueness of distance values can be achieved by means v L 1 ing each node 2 j for inclusion with probability D, for some of a tie-breaking perturbation scheme [20]. Furthermore, the D > real-valued constant 1. total ranking of S can also serve to rank any subset of S.For L r The level ðvÞ of an element v 2 S is defined as the maxi- each level set j 2L, we define the rank function Lj : U j v L v L N r q; v u L r q; u r q; v mum index such that 2 j; a copy of appears at every j ! as Lj ð Þ¼jf 2 j j ð Þ ð Þgj, and hence- L ; ...;Lj h r r r r level in the range 0 . The smallest index such that forth we take j and to refer to Lj and S, respectively. Lh ¼;is the height of the random leveling. Other properties that will prove useful in analyzing the performance of Rank 3.3 Expansion Rate Cover Trees include: As with the Cover Tree, the RCT is analyzed in terms of ðvÞ is a geometrically-distributed random variable Karger and Ruhl’s expansion rate. With respect to a query v D set U, a random leveling L that, for all q 2Uand Lj 2L, sat- with expected value E½ ¼ð Þ D 1. jBL ðq; irÞj di jBL ðq; rÞj di > 0 The probability that v 2 S belongs to the non-empty isfies j j for some real value 1 and some integer i>1, is said to have an i-expansion of di. level set Lj is Pr½ ðvÞ j ¼ j. D We consider random levelings with 2- and 3-expansion The size of a random leveling, that is the sum of the rates d and d , respectively. For simplicity, in this version of cardinalities of its level sets, has expected value 2 3 the paper we will place no constraints on the minimum ball "# X D size, although such constraints can easily be accommodated v S : in the analysis if desired. Henceforth, we shall take Bjðq; rÞ E ð Þ ¼j j D v S 1 2 to refer to BLj ðq; rÞ. Note also that the expansion rates of random levelings may be higher than those based only on The expected height of a random leveling is logarith- the data set itself (level 0). mic in jSj; moreover, the probability of h greatly One of the difficulties in attempting to analyze similarity exceeding its expected value is vanishingly small search performance in terms of ranks is the fact that rank (see [37]): information, unlike distance, does not satisfy the triangle c inequality in general. To this end, Goyal et al. [21] intro- h> c 1 h 1= S : Pr½ ð þ ÞE½ j j duced the disorder inequality, which can be seen as a relaxed, combinatorial generalization of the triangle inequality. A S D 3.2 Rank Function point set has real-valued disorder constant if all triples of points x; y; z 2 S satisfy Tree-based strategies for proximity search typically use a distance metric in two different ways: as a numerical (lin- rðx; yÞ D ðÞrðz; xÞþrðz; yÞ ; ear) constraint on the distances among three data objects (or the query object and two data objects), as exemplified by the and D is minimal with this property. We can derive a simi- triangle inequality, or as an numerical (absolute) constraint lar relationship in terms of expansion rates of a set S. on the distance of candidates from a reference point. The d d 2 proposed Rank Cover Tree differs from most other search Lemma 1 (Rank Triangle Inequality). Let 2 and 3 be the - 3 U L S structures in that it makes use of the distance metric solely and -expansion rates for and the random leveling of . Ll 2L q 2U for ordinal pruning, thereby avoiding many of the difficul- Then for any level set , and any query object and u; v 2 S ties associated with traditional approaches in high-dimen- elements , È È É É sional settings, such as the loss of effectiveness of the 2 rlðq; vÞ max d rlðq; uÞ; min d ; d rlðu; vÞ : triangle inequality for pruning search paths [12]. 2 2 3 Let U be some domain containing the point set S and the set of all possible queries. We assume the existence of an Proof. From the triangle inequality, we know that oracle which, given a query point and two objects in S, determines the object most similar to the query. Note that dðq; vÞ dðq; uÞþdðu; vÞ: this ordering is assumed to be consistent with some under- lying total order of the data objects with respect to the There are two cases to consider, depending on the relationship between dðq; uÞ and dðq; vÞ.First,letus query. Based on such an oracle we provide an abstract for- d q; u d u; v d q; v mulation of ranks. suppose that ð Þ ð Þ.Inthiscase, ð Þ 2dðq; uÞ.SinceBlðq; dðq; vÞÞ Blðq; 2dðq; uÞÞ,andsince q Definition 2. Let 2U be a query. The rank function rlðq; vÞ¼jBlðq; dðq; vÞÞj; we have rS : U S ! N yields rSðq; vÞ¼i if and only if ðÞv1; ...;vn is the ordering provided by the oracle for q and v ¼ vi. The rlðq; vÞ jBlðq; 2dðq; uÞÞj k k NN q NNðq; kÞ¼ -nearest neighbor ( - ) set of is defined as d2jBlðq; dðq; uÞÞj d2 rlðq; uÞ: fv 2 S j rSðq; vÞ kg. HOULE AND NETT: RANK-BASED SIMILARITY SEARCH: REDUCING THE DIMENSIONAL DEPENDENCE 141
Fig. 5. k-nearest neighbor search for the Rank Cover Tree. Fig. 4. Offline construction routine for the Rank Cover Tree.
Now assume that dðq; uÞ dðu; vÞ. Let w be any point T for level l. Now, for any u 2 Ll and any choice of l