Rank Cover Trees for Nearest Neighbor Search

Michael E. HOULE 1 Michael NETT 1,2

1 National Institute of Informatics, Japan 2 The University of Tokyo, Japan

Summary Idea of Sampling K-Nearest Neighbor Search

Virtually all known distance-based similarity We try to find items similar to a query object q Maintain level-wise sets C covering the query • • i search indexes make use of some form of nu- with respect to some data set X ⊆ Ω. results. Start with Ch containing the artificial root. merical constraints (triangle inequality, additive Suppose we found a similar point x with respect Ci is constructed from the set Ci+1 by keeping the distance bounds, . . . ) on similarity values for • 0 • to a (small) subset X ⊆ X, for example, by ki children of all elements in Ci+1, which are most pruning and selection. The use of such numer- means of a sequential scan. similar to the query q. ical constraints, however, often leads to large We are likely to observe transitivity: an item The set C0 contains the query result. variations in the numbers of objects examined • 0 • ph y ∈ X \ X which is similar to the item x is also We choose k = ω · max{k/ |X|, 1}, where ω is a • i in the execution of a query, making it difficult similar to q. parameter allowing to trade-off between to control the execution costs. We introduce a The probability of observing this kind of accuracy and query time. probabilistic data structure for similarity search, • transitivity can be bounded! Our analysis√ shows: if ω is chosen greater than the Rank Cover Tree (Rct), that entirely avoids • ( ) ph δlogφ 5h + { | |}, then the the use of numerical constraints. All internal se- h max h, e X approximation is free of error with very high lections are made according to the ranks of the probability. objects with respect to the query, allowing much The expansion rate δ measures intrinsic tighter control on the overall execution costs. • A rank-based probabilistic analysis shows that dimensionality. with very high probability, the Rct returns a cor- rect query result in time that depends competi- q Recall Rates tively on a measure of the intrinsic dimensional- x y

ity of the data set. Amsterdam Library of Object Images 100

Motivation 80

Text, images, market data, biological data, sci- 70

entiﬁc data, and many other forms of informa- 60 tion are currently being accumulated in large Construction 50 data repositories at a rate that greatly outstrips For each item ∈ , introduce into levels 40 x X x Average Recall [%] our ability to analyze and to interpret. Together • 0, . . . , λ . For a tree of height h, λ follows a 30 with this explosion of information, the demand x x Brute Force − / geometric distribution with p = |X| 1 h. SASH for effective methods for searching, clustering, 20 ANN KD-Tree Build a partial Rct on the highest level by EELSH categorizing, summarizing and matching within 10 RCT (h=3) • RCT (h=4) data sets continues to grow. For such applica- connecting items in that level to an artiﬁcial root. Cover Tree 0 tions, solutions based on similarity search are Connect the next level by using approximate 0.001 0.01 0.1 1 10 100 1000 • Average Query Time [ms] among the earliest (and most effective) pro- nearest neighbors found in the partial Rct. posed in statistics, pattern recognition, and ma- Well-formed with high probability. • MNIST Database of Hand-Written Digits chine learning. The design and analysis of ef- 100

fective similarity search structures has conse- 90 quently been the subject of intensive research for many decades. 80 70

60 Curse of Dimensionality 50

One can make the following observations with 40 respect to very high-dimensional data: Average Recall [%] 30 Brute Force The performance of classical data structures for SASH • 20 ANN KD-Tree similarity search converges towards that of a EELSH 10 RCT (h=3) sequential scan. RCT (h=4) Cover Tree 0 Point-to-point distances become 0.001 0.01 0.1 1 10 100 1000 • Average Query Time [ms] indistinguishable as they concentrate heavily around their mean value. Individual search paths within a similarity search Reuters Corpus • 100 data structure can no longer be effectively 90 excluded from consideration. 80

Additional Material Average Recall [%]

30 Brute Force Technical Report 20 SASH • RCT (h=3) Poster 10 RCT (h=4) • Cover Tree Implementation 0 • 0.001 0.01 0.1 1 10 100 1000 10000 Documentation Average Query Time [ms] •

連連連絡絡絡先先先: Michael E. HOULE (フフフーーールルルマママイイイケケケルルル) / 国国国立立立情情情報報報学学学研研研究究究所所所客客客員員員教教教授授授 : 03 - 4212 - 2538 t : 03 - 3556 - 1916 B : meh @ nii.ac.jp