Fast Nearest Neighbor Retrieval for Bregman Divergences

Fast Nearest Neighbor Retrieval for Bregman Divergences Lawrence Cayton [email protected] Department of Computer Science and Engineering, University of California, San Diego, CA 92093 Abstract et al., 1999; Rasiwasia et al., 2007). Because the KL- divergence does not satisfy the triangle inequality, very We present a data structure enabling efficient little of the research on NN retrieval structures applies. nearest neighbor (NN) retrieval for bregman divergences. The family of bregman diver- The KL-divergence belongs to a broad family of dis- gences includes many popular dissimilarity similarities called bregman divergences. Other exam- measures including KL-divergence (relative ples include Mahalanobis distance, used e.g. in classi- entropy), Mahalanobis distance, and Itakura- fication (Weinberger et al., 2006); the Itakura-Saito di- Saito divergence. These divergences present vergence, used in sound processing (Gray et al., 1980); 2 a challenge for efficient NN retrieval because and `2 distance. Bregman divergences present a chal- they are not, in general, metrics, for which lenge for fast NN retrieval since they need not be sym- most NN data structures are designed. The metric or satisfy the triangle inequality. data structure introduced in this work shares This paper introduces bregman ball trees (bbtrees), the the same basic structure as the popular met- first NN retrieval data structure for general bregman ric ball tree, but employs convexity proper- divergences. The data structure is a relative of the ties of bregman divergences in place of the tri- popular metric ball tree (Omohundro, 1989; Uhlmann, angle inequality. Experiments demonstrate 1991; Moore, 2000). Since this data structure is built speedups over brute-force search of up to sev- on the triangle inequality, the extension to bregman eral orders of magnitude. divergences is non-trivial. A bbtree defines a hierarchical space decomposition 1. Introduction based on bregman balls; retrieving a NN with the tree requires computing bounds on the bregman divergence Nearest neighbor (NN) search is a core primitive in from a query to these balls. We show that this diver- machine learning, vision, signal processing, and else- gence can be computed exactly with a simple bisection where. Given a database X, a dissimilarity measure search that is very efficient. Since only bounds on the d, and a query q, the goal is to find the x X mini- divergence are needed, we can often stop the search 2 mizing d(x; q). Brute-force search is often impractical early using primal and dual function evaluations. given the size and dimensionality of modern data sets, so many data structures have been developed to accel- In the experiments, we show that the bbtree provides a erate NN retrieval. substantial speedup|often orders of magnitude|over brute-force search. Most retrieval data structures are for the `2 norm and, more generally, metrics. Though many dissimilarity 2. Background measures are metrics, many are not. For example, the natural notion of dissimilarity between probability This section provides background on bregman diver- distributions is the KL-divergence (relative entropy), gences and nearest neighbor search. which is not a metric. It has been used to compare histograms in a wide variety of applications, includ- 2.1. Bregman Divergences ing text analysis, image classification, and content- based image retrieval (Pereira et al., 1993; Puzicha First we briefly overview bregman divergences. th Definition 1 (Bregman, 1967). Let f be a strictly Appearing in Proceedings of the 25 International Confer- 1 ence on Machine Learning, Helsinki, Finland, 2008. Copy- convex differentiable function. The bregman diver- right 2008 by the author(s)/owner(s). 1Additional technical restrictions are typically put on Fast Nearest Neighbor Retrieval for Bregman Divergences f Table 1. Some standard bregman divergences. f(x) df (x; y) df (x, y) } 2 1 2 1 2 `2 2 kxk2 2 kx − yk2 y x P P xi KL xi log xi xi log yi 1 > 1 > Figure 1. The bregman divergence between x and y. Mahalanobis 2 x Qx 2 (x − y) Q(x − y) “ ” P P xi xi Itakura-Saito − log xi − log − 1 yi yi gence based on f is df (x; y) f(x) f(y) f(y); x y : ≡ − − hr − i 2.2. NN Search One can interpret the bregman divergence as the dis- Because of the tremendous practical and theoreti- tance between a function and its first-order taylor ex- cal importance of nearest neighbor search in machine pansion. In particular, d (x; y) is the difference be- f learning, computational geometry, databases, and else- tween f(x) and the linear approximation of f(x) cen- where, many retrieval schemes have been developed to tered at y; see figure 1. Since f is convex, d (x; y) is f reduce the computational cost of finding NNs. always nonnegative. KD-trees (Friedman et al., 1977) are one of the earli- Some standard bregman divergences and their base est and most popular data structures for NN retrieval. functions are listed in table 1. The data structure and accompanying search algo- A bregman divergence is typically used to assess sim- rithm provide a blueprint for a huge body of future ilarity between two objects, much like a metric. But work (including the present one). The tree defines a though metrics and bregman divergences are both used hierarchical space partition where each node defines an for similarity assessment, they do not share the same axis-aligned rectangle. The search algorithm is a sim- fundamental properties. Metrics satisfy three basic ple branch and bound exploration of the tree. Though properties: non-negativity: d(x; y) 0; symmetry: KD-trees are useful in many applications, their per- d(x; y) = d(y; x); and, perhaps most≥ importantly, the formance has been widely observed to degrade badly triangle inequality: d(x; z) d(x; y) + d(y; z). Breg- with the dimensionality of the database. man divergences are nonnegative,≤ however they do not Metric ball trees (Omohundro, 1989; Uhlmann, 1991; satisfy the triangle inequality (in general) and can be Yianilos, 1993; Moore, 2000) extend the basic method- asymmetric. ology behind KD-trees to metric spaces by using met- Bregman divergences do satisfy a variety of geometric ric balls in place of rectangles. The search algorithm properties, a couple of which we will need later. The uses the triangle inequality to prune out nodes. They bregman divergence df (x; y) is convex in x, but not seem to scale with dimensionality better than KD-trees necessarily in y. Define the bregman ball of radius R (Moore, 2000), though high-dimensional data remains around µ as very challenging. Some high-dimensional datasets are intrinsically low-dimensional; various retrieval schemes B(µ, R) x : d (x; µ) R : ≡ f f ≤ g have been developed that scale with a notion of intrin- Since df (x; µ) is convex in x, B(µ, R) is a convex set. sic dimensionality (Beygelzimer et al., 2006). Another interesting property concerns means. For a In many applications, an exact NN is not required; set of points, the mean under a bregman divergence is something nearby is good enough. This is especially well defined and, interestingly, is independent of the true in machine learning applications, where there is choice of divergence: typically a lot of noise and uncertainty. Thus many researchers have switched to the problem of approxi- X 1 X µ argmin d (x; µ) = x: mate NN search. This relaxation led to some signifi- X ≡ µ f X x X j j x X cant breakthroughs, perhaps the most important be- 2 2 ing locality sensitive hashing (Datar et al., 2004). Spill This fact can be used to extend k-means to the family trees (Liu et al., 2004) are another data structure for of bregman divergences (Banerjee et al., 2005). approximate NN search and have exhibited very strong f. In particular, f is assumed to be Legendre. performance empirically. Fast Nearest Neighbor Retrieval for Bregman Divergences The present paper appears to be the first to describe searching for is the left NN a general method for efficiently finding bregman NNs; xq argminx X df (x; q): however, some related problems have been examined. ≡ 2 (Nielsen et al., 2007) explores the geometric properties Finding the right NN (argminx X df (q; x)) is consid- 2 of bregman voronoi diagrams. Voronoi diagrams are ered in section 5. of course closely related to NN search, but do not lead Branch and bound search locates xq in the bbtree. to an efficient NN data structure beyond dimension 2. First, the tree is descended; at each node, the search al- (Guha et al., 2007) contains results on sketching breg- gorithm chooses the child for which df (µ, q) is smallest man (and other) divergences. Sketching is related to and ignores the sibling node (temporarily). Upon ar- dimensionality reduction, which is the basis for many riving at a leaf node i, the algorithm calculates df (x; q) NN schemes. for all x X . The closest point is the candidate NN; 2 i We are aware of only one NN speedup scheme for KL- call it xc. Now the algorithm must traverse back up divergences (Spellman & Vemuri, 2005). The results the tree and consider the previously ignored siblings. in this paper are quite limited: experiments were con- An ignored sibling j must be explored if ducted on only one dataset and the speedup is less df (xc; q) > min d(x; q): (1) x B(µj ;Rj ) than 3x. Moreover, there appears to be a significant 2 technical flaw in the derivation of their data structure. The algorithm computes the right side of (1); we come In particular, they cite the pythagorean theorem as an back that in a moment.

Load more