Fast Nearest Neighbor Retrieval for Bregman Divergences

Lawrence Cayton [email protected] Department of Computer Science and Engineering, University of California, San Diego, CA 92093

Abstract et al., 1999; Rasiwasia et al., 2007). Because the KL- does not satisfy the triangle inequality, very We present a data structure enabling efficient little of the research on NN retrieval structures applies. nearest neighbor (NN) retrieval for bregman divergences. The family of bregman diver- The KL-divergence belongs to a broad family of dis- gences includes many popular dissimilarity similarities called bregman divergences. Other exam- measures including KL-divergence (relative ples include Mahalanobis , used e.g. in classi- entropy), Mahalanobis distance, and Itakura- fication (Weinberger et al., 2006); the Itakura-Saito di- Saito divergence. These divergences present vergence, used in sound processing (Gray et al., 1980); 2 a challenge for efficient NN retrieval because and `2 distance. Bregman divergences present a chal- they are not, in general, metrics, for which lenge for fast NN retrieval since they need not be sym- most NN data structures are designed. The or satisfy the triangle inequality. data structure introduced in this work shares This paper introduces bregman ball trees (bbtrees), the the same basic structure as the popular met- first NN retrieval data structure for general bregman ric ball tree, but employs convexity proper- divergences. The data structure is a relative of the ties of bregman divergences in place of the tri- popular metric ball tree (Omohundro, 1989; Uhlmann, angle inequality. Experiments demonstrate 1991; Moore, 2000). Since this data structure is built speedups over brute-force search of up to sev- on the triangle inequality, the extension to bregman eral orders of magnitude. divergences is non-trivial. A bbtree defines a hierarchical space decomposition 1. Introduction based on bregman balls; retrieving a NN with the tree requires computing bounds on the bregman divergence Nearest neighbor (NN) search is a core primitive in from a query to these balls. We show that this diver- machine learning, vision, signal processing, and else- gence can be computed exactly with a simple bisection where. Given a database X, a dissimilarity measure search that is very efficient. Since only bounds on the d, and a query q, the goal is to find the x X mini- divergence are needed, we can often stop the search ∈ mizing d(x, q). Brute-force search is often impractical early using primal and dual function evaluations. given the size and dimensionality of modern data sets, so many data structures have been developed to accel- In the experiments, we show that the bbtree provides a erate NN retrieval. substantial speedup—often orders of magnitude—over brute-force search. Most retrieval data structures are for the `2 norm and, more generally, metrics. Though many dissimilarity 2. Background measures are metrics, many are not. For example, the natural notion of dissimilarity between probability This section provides background on bregman diver- distributions is the KL-divergence (relative entropy), gences and . which is not a metric. It has been used to compare histograms in a wide variety of applications, includ- 2.1. Bregman Divergences ing text analysis, image classification, and content- based image retrieval (Pereira et al., 1993; Puzicha First we briefly overview bregman divergences.

th Definition 1 (Bregman, 1967). Let f be a strictly Appearing in Proceedings of the 25 International Confer- 1 ence on Machine Learning, Helsinki, Finland, 2008. Copy- convex differentiable function. The bregman diver- right 2008 by the author(s)/owner(s). 1Additional technical restrictions are typically put on Fast Nearest Neighbor Retrieval for Bregman Divergences

f Table 1. Some standard bregman divergences.

f(x) df (x, y) df (x, y) } 2 1 2 1 2 `2 2 kxk2 2 kx − yk2

y x P P xi KL xi log xi xi log yi

1 > 1 > Figure 1. The bregman divergence between x and y. Mahalanobis 2 x Qx 2 (x − y) Q(x − y) “ ” P P xi xi Itakura-Saito − log xi − log − 1 yi yi gence based on f is

df (x, y) f(x) f(y) f(y), x y . ≡ − − h∇ − i 2.2. NN Search One can interpret the bregman divergence as the dis- Because of the tremendous practical and theoreti- tance between a function and its first-order taylor ex- cal importance of nearest neighbor search in machine pansion. In particular, d (x, y) is the difference be- f learning, computational geometry, databases, and else- tween f(x) and the linear approximation of f(x) cen- where, many retrieval schemes have been developed to tered at y; see figure 1. Since f is convex, d (x, y) is f reduce the computational cost of finding NNs. always nonnegative. KD-trees (Friedman et al., 1977) are one of the earli- Some standard bregman divergences and their base est and most popular data structures for NN retrieval. functions are listed in table 1. The data structure and accompanying search algo- A bregman divergence is typically used to assess sim- rithm provide a blueprint for a huge body of future ilarity between two objects, much like a metric. But work (including the present one). The tree defines a though metrics and bregman divergences are both used hierarchical space partition where each node defines an for similarity assessment, they do not share the same axis-aligned rectangle. The search algorithm is a sim- fundamental properties. Metrics satisfy three basic ple branch and bound exploration of the tree. Though properties: non-negativity: d(x, y) 0; symmetry: KD-trees are useful in many applications, their per- d(x, y) = d(y, x); and, perhaps most≥ importantly, the formance has been widely observed to degrade badly triangle inequality: d(x, z) d(x, y) + d(y, z). Breg- with the dimensionality of the database. man divergences are nonnegative,≤ however they do not Metric ball trees (Omohundro, 1989; Uhlmann, 1991; satisfy the triangle inequality (in general) and can be Yianilos, 1993; Moore, 2000) extend the basic method- asymmetric. ology behind KD-trees to metric spaces by using met- Bregman divergences do satisfy a variety of geometric ric balls in place of rectangles. The search algorithm properties, a couple of which we will need later. The uses the triangle inequality to prune out nodes. They bregman divergence df (x, y) is convex in x, but not seem to scale with dimensionality better than KD-trees necessarily in y. Define the bregman ball of radius R (Moore, 2000), though high-dimensional data remains around µ as very challenging. Some high-dimensional datasets are intrinsically low-dimensional; various retrieval schemes B(µ, R) x : d (x, µ) R . ≡ { f ≤ } have been developed that scale with a notion of intrin- Since df (x, µ) is convex in x, B(µ, R) is a convex set. sic dimensionality (Beygelzimer et al., 2006). Another interesting property concerns means. For a In many applications, an exact NN is not required; set of points, the mean under a bregman divergence is something nearby is good enough. This is especially well defined and, interestingly, is independent of the true in machine learning applications, where there is choice of divergence: typically a lot of noise and uncertainty. Thus many researchers have switched to the problem of approxi- X 1 X µ argmin d (x, µ) = x. mate NN search. This relaxation led to some signifi- X ≡ µ f X x X | | x X cant breakthroughs, perhaps the most important be- ∈ ∈ ing locality sensitive hashing (Datar et al., 2004). Spill This fact can be used to extend k-means to the family trees (Liu et al., 2004) are another data structure for of bregman divergences (Banerjee et al., 2005). approximate NN search and have exhibited very strong f. In particular, f is assumed to be Legendre. performance empirically. Fast Nearest Neighbor Retrieval for Bregman Divergences

The present paper appears to be the first to describe searching for is the left NN a general method for efficiently finding bregman NNs; xq argminx X df (x, q). however, some related problems have been examined. ≡ ∈ (Nielsen et al., 2007) explores the geometric properties Finding the right NN (argminx X df (q, x)) is consid- ∈ of bregman voronoi diagrams. Voronoi diagrams are ered in section 5. of course closely related to NN search, but do not lead Branch and bound search locates xq in the bbtree. to an efficient NN data structure beyond dimension 2. First, the tree is descended; at each node, the search al- (Guha et al., 2007) contains results on sketching breg- gorithm chooses the child for which df (µ, q) is smallest man (and other) divergences. Sketching is related to and ignores the sibling node (temporarily). Upon ar- dimensionality reduction, which is the basis for many riving at a leaf node i, the algorithm calculates df (x, q) NN schemes. for all x X . The closest point is the candidate NN; ∈ i We are aware of only one NN speedup scheme for KL- call it xc. Now the algorithm must traverse back up divergences (Spellman & Vemuri, 2005). The results the tree and consider the previously ignored siblings. in this paper are quite limited: experiments were con- An ignored sibling j must be explored if ducted on only one dataset and the speedup is less df (xc, q) > min d(x, q). (1) x B(µj ,Rj ) than 3x. Moreover, there appears to be a significant ∈ technical flaw in the derivation of their data structure. The algorithm computes the right side of (1); we come In particular, they cite the as an back that in a moment. If (1) holds, then node j and equality for projection onto an arbitrary convex set, all of its children can be ignored since the NN can- whereas it is actually an inequality. not be found in that subtree. Otherwise, the subtree rooted at j must be explored. This algorithm is easily 3. Bregman Ball Trees adjusted to return the k-nearest neighbors. This section describes the bregman ball tree data The algorithm hinges on the computation of (1)—the 2 structure. The data structure and search algorithms bregman projection onto a bregman ball. In the `2 (or follow the same basic program used in KD-trees and arbitrary metric) case, the projection can be computed metric trees; in place of rectangular cells or metric analytically with the triangle inequality. Since general balls, the fundamental geometric object is a bregman bregman divergences do not satisfy this inequality, we ball. need a different way to compute—or at least bound— the right side of (1). Computing this projection is the A bbtree defines a hierarchical space partition based main technical contribution of this paper, so we discuss on bregman balls. The data structure is a binary tree it separately in section 4. where each node i is associated with a subset of the database X X. Node i additionally defines a breg- i ⊂ 3.2. Approximate Search man ball B(µi,Ri) with center µi and radius Ri such that Xi B(µi,Ri). Interior (non-leaf) nodes of tree As we mentioned in section 2.2, many practical appli- have two⊂ child nodes l and r. The database points cations do not require an exact NN. This is especially belonging to node i are split between child l and r; true in machine learning applications, where there is 2 each point in Xi appears in exactly one of Xl or Xr. typically a lot of noise and even the representation of Though Xl and Xr are disjoint, the balls B(µl,Rl) points used is heuristic (e.g. selecting an appropriate and B(µr,Rr) may overlap. The root node of the tree kernel for an SVM often involves guesswork). This encapsulates the entire database. Each leaf covers a flexibility is fortunate, since exact NN retrieval meth- small fraction of the database; the set of all leaves ods rarely work well on high-dimensional data. cover the entirety. Following (Liu et al., 2004), a simple way to speed up the retrieval time of the bbtree is to simply stop af- 3.1. Searching ter only a few leaves have been examined. This idea This subsection describes how to retrieve a query’s originates from the empirical observation that metric nearest neighbor with a bbtree. Throughout, X = and KD-trees often locate a point very close to the NN quickly, then spend most of the execution time back- x1, . . . , xn is the database, q is a query, and df ( , ) {is a (fixed)} bregman divergence. The point we· are· tracking. We show empirically that the quality of the NN degrades gracefully as the number of leaves ex- 2The disjointedness of the two point sets is not essential. amined decreases. Even when the search procedure is stopped very early, it returns a solution that is among the nearest neighbors. Fast Nearest Neighbor Retrieval for Bregman Divergences

3.3. Building q The performance of the search algorithm depends on µ xp how many nodes can be pruned; the more, the better. Intuitively, the balls of two siblings should be well- separated and compact. If the balls are well-separated, a query is likely to be much closer to one than the What properties of this projection might extend to all other. If the balls are compact, then the distance from of bregman divergences? a query to a ball will be a good approximation to the distance from a query to the nearest point within the ball. Thus at each level, we’d like to divide the points 1. First, xp lies on the line between q and µ; this into two well-separated sets, each of which is compact. drastically reduces the search space from a D- A natural way to do this is to use k-means, which has dimensional convex set to a one-dimensional line. already been extended to bregman divergences (Baner- 2. Second, x lies on the boundary of B(µ, R)—i.e jee et al., 2005). p df (xp, µ) = R. Combined with property 1, this The build algorithm proceeds from top down. Start- fact completely determines xp: it is the point ing at the top, the algorithm runs k-means to partition where the line between µ and q intersects the shell the points into two clusters. This process is repeated of B(µ, R). recursively. The total build time is O(n log n). Clus- 2 tering from the bottom-up might yield better results, 3. Finally, since the `2 ball is spherically symmetric, but the O(n2 log n) build time is impractical for large we can compute this intersection analytically. datasets. We prove that the first property is a special case of 4. Computing the Bound a fact that holds for all bregman divergences. Addi- tionally, the second property generalizes to bregman Recall that the search procedure needs to determine if divergences without change. The final property does the bound not go through, so we will not be able to find a solution to (P) analytically. df (xc, q) > min df (x, q) (2) x B(µ,R) Throughout, we use q0 f(q), µ0 f(µ), etc. to ∈ ≡ ∇ ≡ ∇ simplify notation. xp denotes the optimal solution to holds, where xc is the current candidate NN. We first (P). show that the right side can be computed to accu- 1 Claim 2. xp0 lies on the line between q0 and µ0. racy  in only O(log  ) steps with a simple bisection search. Since we only actually need upper and lower bounds on the quantity, we then present a procedure Proof. The lagrange dual function of (P) is that augments the bisection search with primal and inf df (x, q) + λ(df (x, µ) R), (3) dual bounds so that it can stop early. x − The right of (2) is a convex program: where λ 0. Differentiating (3) with respect to x and setting it≥ equal to 0, we get min df (x, q) x subject to: d (x, µ) R. (P) f(xp) f(q) + λ f(xp) λ f(µ) = 0. f ≤ ∇ − ∇ ∇ − ∇ We use the change of variable θ λ and rearrange The search algorithm will need to solve (P) many times ≡ 1+λ in the course of locating q’s NN, so we need to be able to arrive at to compute a solution very quickly. f(xp) = θµ0 + (1 θ)q0, Before considering the general case, let us pause to ∇ − examine the `2 case. In this case, we can compute the where θ [0, 1). 2 ∈ projection xp analytically: 2 Thus we see that property 1 of the `2 projection is a xp = θµ + (1 θ)q, special case of a relationship between the gradients; − 2 it follows from claim 2 because f(x) = x for the `2 √2R ∇ where θ = q µ . divergence. k − k Fast Nearest Neighbor Retrieval for Bregman Divergences

Since f is strictly convex, the gradient mapping is one- where xc is the current candidate NN. If (7) holds, the to-one. Moreover, the inverse mapping is given by the node in question must be searched; otherwise it can gradient of the convex conjugate, defined as be pruned. We can evaluate the right side of (7) ex- actly using the bisection method described previously, f ∗(y) sup x, y f(x) . (4) ≡ x {h i − } but an exact solution is not needed. Suppose we have Symbolically: bounds a and A satisfying

f A min df (x, q) a. ∗ ≥ x B(µ,R) ≥ ∇ ∈ x x! If df (xc, q) < a, the node can be pruned; if df (xc, q) > f ∇ A, the node must be explored. We now describe upper and lower bounds that are computed at each step of Thus to solve (P), we can look for the optimal x along 0 the bisection search; the search proceeds until one of θµ + (1 θ)q , and then apply f to recover x .3 To 0 0 ∗ p the two stopping conditions is met. keep notation− simple, we define∇ A lower bound is given by weak duality. The lagrange x0 θµ0 + (1 θ)q0 and (5) θ ≡ − dual function is xθ f ∗(xθ0 ). (6) ≡ ∇ θ   (θ) df (xθ, q) + df (xθ, µ) R . (8) Now onto the second property. L ≡ 1 θ − − Claim 3. df (xp, µ) = R—i.e. the projection lies on By weak duality, for any θ [0, 1), the boundary of B(µ, R). ∈

(θ) min df (x, q). (9) The claim follows from complementary slackness ap- L ≤ x B(µ,R) ∈ plied to (3). Claims 2 and 3 imply that finding the projection of q onto B(µ, R) is equivalent to For the upper bound, we use the primal. At any θ satisfying d (x , µ) R, we have find θ f θ ≤ subject to: df (xθ, µ) = R df (xθ, q) min df (x, q). (10) θ (0, 1] ≥ x B(µ,R) ∈ ∈ xθ = f ∗(θµ0 + (1 θ)q0). ∇ − Let us now put all of the pieces together. We wish to Fortunately, solving this program is simple. evaluate whether (7) holds. The algorithm performs Claim 4. df (xθ, µ) is monotonic in θ. bisection search on θ, attempting to locate the θ satis- fying df (xθ, µ) = R. At step i the algorithm evaluates This claim follows from the convexity of f ∗. Since θi on two functions. First, it checks the lower bound df (xθ, µ) is monotonic, we can efficiently search for θp bound given by the dual function (θi) defined in (8). satisfying df (xθ , µ) = R using bisection search on θ. L p If (θi) > df (xc, q), then the node can be pruned. We summarize the result in the following theorem. L Otherwise, if xθi B(µ, R), we can update the upper 2 ∈ Theorem 5. Suppose f ∗ 2 is bounded around x0 . bound. If d (x , q) < d (x , q), then the node must k∇ k p f θi f c Then a point x satisfying be searched. Otherwise, neither bound holds, so the 2 bisection search continues. See Algorithm 1 for pseu- df (x, q) df (xp, q)  + O( ) | − | ≤ docode. can be found in O(log 1/) iterations. Each iteration requires one divergence evaluation and one gradient 5. Left and Right NN evaluation. Since a bregman divergence can be asymmetric, it de- 4.1. Stopping Early fines two NN problems: Recall that the point of all this analysis is to evaluate (lNN) return argminx X df (x, q) and whether • ∈

df (xc, q) > min df (x, q), (7) (rNN) return argminx X df (q, x). x B(µ,R) • ∈ ∈ 3All of the base functions in table 1 have closed form The bbtree data structure finds the left NN . We show conjugates. that it can also be used to find the right NN. Fast Nearest Neighbor Retrieval for Bregman Divergences

Algorithm 1 CanPrune rcv-D. We used latent dirichlet allocation (LDA) • D (Blei et al., 2003) to generate topic histograms for Input: θl, θr (0, 1], q, xc, µ R , R R. θl+θr ∈ ∈ ∈ 500k documents in the rcv1 corpus (Lewis et al., Set θ = 2 . Set x = f ∗(θµ0 + (1 θ)q0) 2004). These histograms were generated by build- θ ∇ − if (θ) > df (xc, q) then ing a LDA model on a training set and then per- returnL Yes forming inference on 500k documents to gener- else if xθ B(µ, R) and df (xθ, q) < df (xc, q) then ate their posterior dirichlet parameters. Suitably return ∈No scaled, these parameters give a representation of else if df (xθ, µ) > R then the documents in the topic simplex (Blei et al., return CanPrune(θl, θ, q, xc, µ) 2003). We generated data using this process for else if df (xθ, µ) < R then D = 8, 16,..., 256 topics. return CanPrune(θ, θ , q, x , µ) r c Corel histograms. This dataset contains 60k end if • color histograms generated from the Corel image dataset. Each histogram is 64-dimensional. Recall that the convex conjugate of f is defined as Semantic space. This dataset is a 371- • f ∗(y) supx x, y f(x) . The supremum is realized dimensional representation of 5000 images from at a point≡ x satisfying{h i− f}(x) = y; thus ∇ the Corel Stock photo collection. Each image is represented as a distribution over 371 description f ∗(y0) = y, y0 f(y). h i − keywords (Rasiwasia et al., 2007). We use this identity to rewrite d ( , ): f · · SIFT signatures. This dataset contains 1111- • dimensional representations of 10k images from df (x, y) = f(x) f(y) y0, x y − − h − i the PASCAL 2007 dataset (Everingham et al., = f(x) + f ∗(y0) y0, x − h i 2007). Each point is a histogram of quantized = df ∗ (y0, x0). SIFT features as suggested in (Nowak et al., 2006). This relationship provides a simple prescription for adapting the bbtree to the rNN problem: build a bb- Notice that most of these datasets are fairly high- tree for the divergence df ∗ and the database X0 ≡ dimensional. f(x1),..., f(x ) . On query q, q0 f(q) is {∇ ∇ n } ≡ ∇ computed and the bbtree finds x0 X0 minimizing We are mostly interested in approximate NN retrieval, ∈ df ∗ (x0, q0). The point x whose gradient is x0 is then since that is likely sufficient for machine learning appli- the rNN to q. cations. If the bbtree is stopped early, it is not guar- anteed to return an exact NN, so we need a way to 6. Experiments evaluate the quality of the point it returns. One nat- ural evaluation metric is this: How many points from We examine the performance benefit of using bbtrees the database are closer to the query than the returned for approximate and exact NN search. All experiments point? Call this value NC for “number closer”. If NC were conducted with a simple C implementation that is small compared to the size of the database, say 10 is available from the author’s website. versus 100k, then it will likely share many properties e.g. 4 The results are for the KL-divergence. We chose to with the true NN ( class label). evaluate the bbtree for the KL-divergence because it The results are shown in figure 2. These are strong is used widely in machine learning, text mining, and results; it is shown that the bbtree is often orders computer vision; moreover, very little is known about of magnitude faster than brute-force search without efficient NN retrieval for it. In contrast, there has a substantial degradation of quality. More analysis been a tremendous amount of work for speeding up appears in the caption. 2 the `2 and Mahalanobis divergences—they both may 4 be handled by standard metric trees and many other A different evaluation criteria is the approximation ra- tio  satisfying df (x, q) ≤ (1 + )df (xq, q), where xq is q’s methods. Other bregman divergences appear much true NN. We did not use this measure because it is dif- less often in applications. Still, examining the prac- ficult to interpret. For example, suppose we find  = .3 tical performance of bbtrees for these other bregman approximate NNs from two different databases A and B. divergences is an interesting direction for future work. It could easily be the case that all points in A are 1.3- approximate NNs, whereas only the exact NN in database We ran experiments on several challenging datasets. B is 1.3-approximate. Fast Nearest Neighbor Retrieval for Bregman Divergences

rcv−8 rcv−16 rcv−32 5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2

rcv−64 rcv−128 rcv−256 5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2

SIFT signatures Semantic space Corel histograms 5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2

Figure 2. Log-log plots (base 10): y-axis is the exponent of the speedup over brute force search, x-axis is the exponent of the number of database points closer to the query than the reported NN. The y-axis ranges from 100 5 2 2 (no speedup) to 10 . The x-axis ranges from 10− to 10 . All results are averages over queries not in the database.

Consider the plot for rcv-128 (center). At x = 100, the bbtree is returning one of the two nearest neighbors (on average) out of 500k points at a 100x speedup over brute force search. At x = 101, the bbtree is returning one of the eleven nearest neighbors (again, out of 500k points) and yields three orders of magnitude speedup over brute force search.

The best results are achieved on the rcv-D datasets and the Corel histogram dataset. The improve- ments are less pronounced for the SIFT signature and Semantic space data, which may be a result of both the high dimensionality and small size of these two datasets. Even so, we are getting useful speedups on the semantic space dataset (10-100x speedup with small error). For the SIFT signatures, we are getting a 10x speedup while receiving NNs in the top 1%. Fast Nearest Neighbor Retrieval for Bregman Divergences

the solution of problems in convex programming. USSR Table 2. Exact search Computational and Mathematical Physics, 7, 200–217. dataset dimensionality speedup rcv-8 8 64.5 Datar, M., Immorlica, N., Indyk, P., & Mirrokni, V. S. rcv-16 16 36.7 (2004). Locality-sensitive hashing scheme based on p- rcv-32 32 21.9 stable distributions. SCG 2004. rcv-64 64 12.0 Everingham, M., Gool, L. V., Williams, C. K., Winn, J., corel histograms 64 2.4 & Zisserman, A. (2007). The PASCAL Visual Object rcv-128 128 5.3 Classes Challenge 2007 Results. rcv-256 256 3.3 semantic space 371 1.0 Friedman, J. H., Bentley, J. L., & Finkel, R. A. (1977). An SIFT signatures 1111 0.9 algorithm for finding best matches in logarithmic ex- pected time. ACM Transactions on Mathematical Soft- Finally, we consider exact NN retrieval. It is well ware, 3(3), 209–226. known that finding a (guaranteed) exact NN in mod- Gray, R. M., Buzo, A., Gray, A. H., & Matsuyama, Y. erate to high-dimensional databases is very challeng- (1980). Distortion measures for speech processing. IEEE ing. In particular, metric trees, KD-trees, and relatives Transactions on Acoustics, Speech, and Signal Process- typically afford a reasonable speedup in moderate di- ing. mensions, but the speedup diminishes with increasing Guha, S., Indyk, P., & McGregor, A. (2007). Sketching dimensionality (Moore, 2000; Liu et al., 2004). When information divergences. COLT. used for exact search, the bbtree reflects this basic pat- Lewis, D. D., Yang, Y., Rose, T., & Li, F. (2004). Rcv1: tern. Table 2 shows the results. The bbtree provides A new benchmark collection for text categorization re- a substantial speedup on the moderate-dimensional search. JMLR. databases (up through D = 256), but no speedup on Liu, T., Moore, A. W., Gray, A., & Yang, K. (2004). the two databases of highest dimensionality. An investigation of practical approximate neighbor al- gorithms. NIPS. 7. Conclusion Moore, A. W. (2000). Using the triangle inequality to sur- vive high-dimensional data. UAI. In this paper, we introduced bregman ball trees and demonstrated their efficacy in NN search. The exper- Nielsen, F., Boissonnat, J.-D., & Nock, R. (2007). On bregman voronoi diagrams. SODA (pp. 746–755). iments demonstrated that bbtrees can speed up ap- proximate NN retrieval for the KL-divergence by or- Nowak, E., Jurie, F., & Triggs, B. (2006). Sampling strate- ders of magnitude over brute force search. There are gies for bag-of-features image classification. ECCV. many possible directions for future research. On the Omohundro, S. (1989). Five balltree construction algo- practical side, which ideas behind the many variants of rithms (Technical Report). ICSI. metric trees might be useful for bbtrees? On the the- Pereira, F., Tishby, N., & Lee, L. (1993). Distributional oretical side, what is a good notion of intrinsic dimen- clustering of English words. 31st Annual Meeting of the sionality for bregman divergences and can a practical ACL (pp. 183–190). data structure be designed around it? Puzicha, J., Buhmann, J., Rubner, Y., & Tomasi, C. (1999). Empirical evaluation of dissimilarity measures Acknowledgements for color and texture. ICCV. Thanks to Serge Belongie, Sanjoy Dasgupta, Charles Rasiwasia, N., Moreno, P., & Vasconcelos, N. (2007). Elkan, Carolina Galleguillos, Daniel Hsu, Nikhil Rasi- Bridging the gap: query by semantic example. IEEE wasia, and Lawrence Saul. Support was provided by Transactions on Multimedia. the NSF under grants IIS-0347646 and IIS-0713540. Spellman, E., & Vemuri, B. (2005). Efficient shape index- ing using an information theoretic representation. Inter- References national Conference on Image and Video Retrieval. Banerjee, A., Merugu, S., Dhillon, I. S., & Ghosh, J. Uhlmann, J. K. (1991). Satisfying general proxim- (2005). Clustering with bregman divergences. JMLR. ity/similarity queries with metric trees. Information Processing Letters, 40, 175–179. Beygelzimer, A., Kakade, S., & Langford, J. (2006). Cover trees for nearest neighbor. ICML. Weinberger, K., Blitzer, J., & Saul, L. (2006). Distance metric learning for large margin nearest neighbor classi- Blei, D., Ng, A., & Jordan, M. (2003). Latent dirichlet fication. NIPS. allocation. JMLR. Yianilos, P. N. (1993). Data structures and algorithms for Bregman, L. (1967). The relaxation method of finding nearest neighbor search in general metric spaces. SODA. the common point of convex sets and its application to