Cover Trees for Nearest Neighbor
Total Page:16
File Type:pdf, Size:1020Kb
Cover Trees for Nearest Neighbor Alina Beygelzimer [email protected] IBM Thomas J. Watson Research Center, Hawthorne, NY 10532 Sham Kakade [email protected] TTI-Chicago, 1427 E 60th Street, Chicago, IL 60637 John Langford [email protected] TTI-Chicago, 1427 E 60th Street, Chicago, IL 60637 Abstract The basic nearest neighbor problem is as follows: We present a tree data structure for fast Given a set S of n points in some metric space (X, d), nearest neighbor operations in general n- the problem is to preprocess S so that given a query point metric spaces (where the data set con- point p ∈ X, one can efficiently find a point q ∈ S sists of n points). The data structure re- which minimizes d(p, q). quires O(n) space regardless of the met- ric’s structure yet maintains all performance Context. For general metrics, finding (or even ap- properties of a navigating net [KL04a]. If proximating) the nearest neighbor of a point requires the point set has a bounded expansion con- Ω(n) time. The classical example is a uniform met- stant c, which is a measure of the intrinsic ric where every pair of points is near the same dis- dimensionality (as defined in [KR02]), the tance, so there is no structure to take advantage of. cover tree data structure can be constructed However, the metrics of practical interest typically do in O c6n log n time. Furthermore, nearest have some structure which can be exploited to yield neighbor queries require time only logarith- significant computational speedups. Motivated by mic in n, in particular O c12 log n time. this observation, several notions of metric structure Our experimental results show speedups and algorithms exploiting this structure have been over the brute force search varying between proposed [Cla99, KR02, KL04a]. one and several orders of magnitude on nat- Denote the closed ball of radius r around p in S ⊂ X ural machine learning datasets. by BS(p, r) = {q ∈ S : d(p, q) ≤ r}. When clear from the context, we drop the subscript S. Karger and Ruhl [KR02] considered the following notion of di- 1. Introduction mension based on point expansion, and described a randomized algorithm for metrics in which this Problem. Nearest neighbor search is a basic com- dimension is small. The expansion constant of S putational tool that is particularly relevant to ma- is defined as the smallest value c ≥ 2 such that chine learning, where it is often believed that high- |B (p, 2r)| ≤ c|B (p, r)| for every p ∈ X and r > 0. dimensional datasets have low-dimensional intrinsic S S If S is arranged uniformly on some surface of di- structure. Here we study how one can exploit po- mension d, then c ∼ 2d, which suggests defining the tential structure in the dataset to speed up nearest expansion dimension of S (also referred to as KR- neighbor computations. Such speedups could ben- dimension) as dim (S) = log c. However, as previ- efit a number of machine learning algorithms, in- KR ously observed in [KR02, KL04a], some metrics that cluding dimensionality reduction algorithms (which should intuitively be considered low-dimensional turn are inherently based on this belief of low-dimensional out to have large growth constants. For example, structure) and classification algorithms that rely on adding a single point in a Euclidean space may make nearest neighbor operations (for example, [LMS05]). the KR-dimension grow arbitrarily (though such ex- Appearing in Proceedings of the 23 rd International Con- amples may be pathological in practice). ference on Machine Learning, Pittsburgh, PA, 2006. Copy- right 2006 by the author(s)/owner(s). A more robust alternative is given by the doubling Cover Trees for Nearest Neighbor constant [Cla99, KL04a], which is the minimum value In our analysis, we focus primarily on the expansion c such that every ball in X can be covered by c balls constant, because this permits results on exact near- in X of half the radius. The doubling dimension of est neighbor queries. If c is the expansion constant S is then defined as dimKL(S) = log c. This notion of S, we can state the dependence on c explicitly: is strictly more general than the KR-dimension, as Cover Tree Nav. Net [KR02] shown in [GKL03]. A drawback (so far) of working Constr. Space O(n) cO(1)n cO(1)n ln n with the doubling dimension is that only weaker re- Constr. Time O(c6n ln n) cO(1)n ln n cO(1)n ln n sults have been provable, and even those apply only Insert/Remove O(c6 ln n) cO(1) ln n cO(1) ln n to approximate nearest neighbors. Query O(c12 ln n) cO(1) ln n cO(1) ln n The aforementioned algorithms have query time guarantees which are only logarithmic in n (while be- It is important to note that the algorithms here (as in ing exponential in their respective notion of intrinsic [KL04a] but not in [KR02]) work without knowledge dimensionality). Unfortunately, in machine learning of the structure; only the analysis is done with respect applications, most of these theoretically appealing al- to the assumptions. Comparison of time complexity gorithms are still not used in practice. When the in terms of c can be subtle (see the discussion in Sec- Euclidean dimension is small, one typical approach tion 4). Also, such a comparison is somewhat unfair is to use KD-trees (see [FBL77]). If the metric is since past work did not explicitly try to optimize the non-Euclidean, or the Euclidean dimension is large, c dependence. ball trees [Uhl91, Omo87] provide compelling perfor- The algorithms easily extend to approximate nearest mance in many practical applications [GM00]. These neighbor queries for sets with a bounded doubling methods currently have only trivial query time guar- dimension, as in [KL04a]. The algorithm of [KL04a] antees of O(n), although improved performance may depends on the aspect ratio ∆ defined as the ratio of be provable given some form of structure. the largest to the smallest interpoint distance.1 The The focus of this paper is to make these theoreti- query times of our algorithm are the same as those O(1) cally appealing algorithms more practically applica- in [KL04a], namely O(log ∆) + (1/) , where is ble. One significant drawback of these algorithms the approximation parameter. (based on intrinsic dimensionality notions) is that In an extended version [BKL06], we provide several their space requirements are exponential in the di- algorithms of practical interest. These include a lazy mension. As we observe experimentally (see Sec- construction (which amortizes the construction cost tion 5), it is common for the dimension to grow with over queries), a batch construction (which is empiri- the dataset size, so space consumption is a reasonable cally superior to a sequence of single point insertions), concern. This drawback is precisely what the cover and a batch query (which amortizes the query time tree addresses. over multiple queries). New Results. We propose a simple data struc- Organization. The rest of the paper is organized as ture, a cover tree, for exact and approximate nearest follows. Sections 2 and 3 specify the algorithms and neighbor operations. The data structure improves prove their correctness, with no assumptions about over other results [KR02, KL04a, Cla99, HM04] by any structure present in the data set. Section 4 pro- making the space requirement linear in the dataset vides the runtime analysis in terms of dimensionality. size, independent of any dimensionality assumptions. Section 5 presents experimental results. The cover tree is simple since the data structure be- ing manipulated is a tree; in fact, a cover tree (as a graph) can be viewed as a subgraph of a navigat- 2. The Cover Tree Datastructure ing net [KL04a]. The cover tree throws away most A cover tree T on a data set S is a leveled tree where of the edges of the navigating net while maintaining each level is a “cover” for the level beneath it. Each all dimension-dependent guarantees. The algorithms level is indexed by an integer scale i which decreases and proofs needed for this structure are inherently different because (for example) a greedy traversal of 1The results in [Cla99] also depend on this ratio and the tree is not guaranteed to answer a query correctly. rely on some additional stronger assumptions about the distribution of queries. The algorithms in [KL04b] and We also provide experiments (see Section 5) and pub- [HM04] eliminate the dependence on the aspect ratio but lic code, suggesting this approach is competitive with do not achieve linear space. current practical approaches. Cover Trees for Nearest Neighbor as the tree is descended. Every node in the tree is Algorithm 1 Find-Nearest (cover tree T , query associated with a point in S. Each point in S may be point p) associated with multiple nodes in the tree; however, 1. Set Q∞ = C∞, where C∞ is the root level of T . we require that any point appears at most once in 2. for i from ∞ down to −∞ every level. Let C denote the set of points in S i (a) Set Q = { Children(q): q ∈ Q }. associated with the nodes at level i. The cover tree i (b) Form cover set Qi−1 = {q ∈ Q : d(p, q) ≤ obeys the following invariants for all i: d(p, Q) + 2i}. 1. (Nesting) Ci ⊂ Ci−1. This implies that once a 3. return arg minq∈Q−∞ d(p, q). point p ∈ S appears in Ci then every lower level in the tree has a node associated with p.