Rank Cover Trees for

Michael E. HOULE 1 Michael NETT 1,2

1 National Institute of Informatics, Japan 2 The University of Tokyo, Japan

Summary Idea of Sampling K-Nearest Neighbor Search

Virtually all known distance-based similarity We try to find items similar to a query object q Maintain level-wise sets C covering the query • • i search indexes make use of some form of nu- with respect to some data X ⊆ Ω. results. Start with Ch containing the artificial root. merical constraints (triangle inequality, additive Suppose we found a similar point x with respect Ci is constructed from the set Ci+1 by keeping the distance bounds, . . . ) on similarity values for • 0 • to a (small) subset X ⊆ X, for example, by ki children of all elements in Ci+1, which are most pruning and selection. The use of such numer- means of a sequential scan. similar to the query q. ical constraints, however, often leads to large We are likely to observe transitivity: an item The set C0 contains the query result. variations in the numbers of objects examined • 0 • ph y ∈ X \ X which is similar to the item x is also We choose k = ω · max{k/ |X|, 1}, where ω is a • i in the execution of a query, making it difficult similar to q. parameter allowing to trade-off between to control the execution costs. We introduce a The probability of observing this kind of accuracy and query time. probabilistic for similarity search, • transitivity can be bounded! Our analysis√ shows: if ω is chosen greater than the Rank Cover (Rct), that entirely avoids • ( ) ph δlogφ 5h + { | |}, then the the use of numerical constraints. All internal se- h max h, e X approximation is free of error with very high lections are made according to the ranks of the probability. objects with respect to the query, allowing much The expansion rate δ measures intrinsic tighter control on the overall execution costs. • A rank-based probabilistic analysis shows that dimensionality. with very high probability, the Rct returns a cor- rect query result in time that depends competi- q Recall Rates tively on a measure of the intrinsic dimensional- x y

ity of the data set. Amsterdam Library of Object Images 100

90

Motivation 80

Text, images, market data, biological data, sci- 70

entific data, and many other forms of informa- 60 tion are currently being accumulated in large Construction 50 data repositories at a rate that greatly outstrips For each item ∈ , introduce into levels 40 x X x Average Recall [%] our ability to analyze and to interpret. Together • 0, . . . , λ . For a tree of height h, λ follows a 30 with this explosion of information, the demand x x Brute Force − / geometric distribution with p = |X| 1 h. SASH for effective methods for searching, clustering, 20 ANN KD-Tree Build a partial Rct on the highest level by EELSH categorizing, summarizing and matching within 10 RCT (h=3) • RCT (h=4) data sets continues to grow. For such applica- connecting items in that level to an artificial root. Cover Tree 0 tions, solutions based on similarity search are Connect the next level by using approximate 0.001 0.01 0.1 1 10 100 1000 • Average Query Time [ms] among the earliest (and most effective) pro- nearest neighbors found in the partial Rct. posed in statistics, pattern recognition, and ma- Well-formed with high probability. • MNIST Database of Hand-Written Digits chine learning. The design and analysis of ef- 100

fective similarity search structures has conse- 90 quently been the subject of intensive research for many decades. 80 70

60 Curse of Dimensionality 50

One can make the following observations with 40 respect to very high-dimensional data: Average Recall [%] 30 Brute Force The performance of classical data structures for SASH • 20 ANN KD-Tree similarity search converges towards that of a EELSH 10 RCT (h=3) sequential scan. RCT (h=4) Cover Tree 0 Point-to-point distances become 0.001 0.01 0.1 1 10 100 1000 • Average Query Time [ms] indistinguishable as they concentrate heavily around their mean value. Individual search paths within a similarity search Reuters Corpus • 100 data structure can no longer be effectively 90 excluded from consideration. 80

70

60

50

40

Additional Material Average Recall [%]

30 Brute Force Technical Report 20 SASH • RCT (h=3) Poster 10 RCT (h=4) • Cover Tree Implementation 0 • 0.001 0.01 0.1 1 10 100 1000 10000 Documentation Average Query Time [ms] •

連連連絡絡絡先先先: Michael E. HOULE (フフフーーールルルマママイイイケケケルルル) / 国国国立立立情情情報報報学学学研研研究究究所所所 客客客員員員教教教授授授 : 03 - 4212 - 2538 t : 03 - 3556 - 1916 B : meh @ nii.ac.jp