-Independent Dual-Tree Algorithms

Ryan R. Curtin [email protected] William B. March [email protected] Parikshit Ram [email protected] David V. Anderson [email protected] Alexander G. Gray [email protected] Charles L. Isbell, Jr. [email protected] Georgia Institute of Technology, 266 Ferst Drive NW, Atlanta, GA 30332 USA

Abstract important in astrophysics, is an n-body problem and Dual-tree algorithms are a widely used class can be solved quickly with trees (March et al., 2012). of branch-and-bound algorithms. Unfortu- In addition, Euclidean minimum spanning trees can nately, developing dual-tree algorithms for be found quickly using tree-based algorithms (March use with different trees and problems is often et al., 2010). Other dual-tree algorithms include ker- complex and burdensome. We introduce a nel density estimation (Gray & Moore, 2003a), mean four-part logical split: the tree, the traversal, shift (Wang et al., 2007), Gaussian summation (Lee the point-to-point base case, and the pruning & Gray, 2006), kernel density estimation, fast singu- rule. We provide a meta-algorithm which al- lar value decomposition (Holmes et al., 2008), range lows development of dual-tree algorithms in search, furthest-neighbor search, and many others. a tree-independent manner and easy exten- The dual-tree algorithms referenced above are each sion to entirely new types of trees. Repre- quite similar, but no formal connections between the sentations are provided for five common al- algorithms have been established. The types of trees gorithms; for k-, this used to solve each problem may differ, and in addition, leads to a novel, tighter pruning bound. The the manner in which the trees are traversed can differ meta-algorithm also allows straightforward (depending on the problem or the tree). In practice, extensions to massively parallel settings. a researcher may have to implement entirely separate algorithms to solve the same problems with different 1. Introduction trees; this is time-consuming and makes it difficult to In large-scale machine learning applications, algorith- explore the properties of tree types that make them mic scalability is paramount. Hence, much study has more suited for particular problems. Worse yet, par- been put into fast algorithms for machine learning allel dual-tree algorithms are difficult to develop and tasks. One commonly-used approach is to build trees appear to be far more complex than serial implemen- on data and then use branch-and-bound algorithms to tations; yet, both solve the same problem. We make minimize runtime. A popular example of a branch- these contributions to address these shortcomings: and-bound algorithm is the use of the trees for near- • A representation of dual-tree algorithms as est neighbor search, pioneered by Bentley (1975), and four separate components: a space tree, a subsequently modified to use two trees (“dual-tree”) traversal, a base case, and a pruning rule. (Gray & Moore, 2001). Later, an optimized tree struc- ture, the cover tree, was designed (Beygelzimer et al., • A meta-algorithm that produces dual-tree al- 2006), giving provably linear scaling in the number of gorithms, given those four separate components. queries (Ram et al., 2009)—a significant improvement • Base cases and pruning rules for a variety of over the quadratically-scaling brute-force algorithm. dual-tree algorithms, which can be used with Asymptotic speed gains as dramatic as described any space tree and any traversal. above are common for dual-tree branch-and-bound al- • A theoretical framework, used to prove the gorithms. These types of algorithms can be applied correctness of these meta-algorithms and develop to a class of problems referred to as ‘n-body prob- a new, tighter bound for k-nearest neighbor lems’ (Gray & Moore, 2001). The n-point correlation, search. Proceedings of the 30 th International Conference on Ma- • Implications of our representation, including easy chine Learning, Atlanta, Georgia, USA, 2013. JMLR: creation of large-scale distributed dual-tree W&CP volume 28. Copyright 2013 by the author(s). algorithms via our meta-algorithm. Tree-Independent Dual-Tree Algorithms 2. Overview of Meta-Algorithm : {x , x } In other works, dual-tree algorithms are described as Nr 1 3 standalone algorithms that operate on a query dataset x4 Sp Sq and a reference dataset Sr. By observing common- x2 x5 alities in these algorithms, we propose the following Np : {x1, x5} x1 logical split of any dual-tree algorithm into four parts: • A space tree (a type of ).

• A pruning dual-tree traversal, which visits nodes Nc : {x2} Nd : {x4} x3 Sr in two space trees, and is parameterized by a (a) Abstract representation. (b) <2 representation. BaseCase() and a Score() function. Figure 1. An example space tree. • A BaseCase() function that defines the action to take on a combination of points. • The of points held in a node Ni is denoted P(Ni) or Pi. • A Score() function that determines if a subtree should be visited during the traversal. • The convex subset represented by node Ni is de- noted S (Ni) or Si. We can use this to define a meta-algorithm: • The set of descendant nodes of a node Ni, de- Given a type of space tree, a pruning dual- n n noted D (Ni) or Di , is the set of nodes C (Ni) ∪ tree traversal, a BaseCase() function, and 1 C (C (Ni)) ∪ ... . a Score() function, use the pruning dual- tree traversal with the given BaseCase() and • The set of descendant points of a node Ni, de- p p Score() functions on two space trees Tq noted D (Ni) or Di , is the set of points { p : p ∈ n 2 (built on Sq) and Tr (built on Sr). P(D (Ni)) ∪ P(Ni) } .

In Sections3 and4, space trees, traversals, and related • The parent of a node Ni is denoted Par(Ni). quantities are rigorously defined. Then, Sections5–8 An abstract representation of an example space tree on define BaseCase() and Score() for various dual-tree 2 algorithms. Section9 discusses implications and future a five-point dataset in < is shown in Figure 1(a). In possibilities, including large-scale parallelism. this illustration, Nr is the root node of the tree; it has no parent and it contains the points x3 and x1 (that is, Pr = {x1, x5}. The node Np contains points x1 and 3. Space Trees x5 and has children Nc and Nd (which each have no To develop a framework for understanding dual-tree children and contain points x2 and x4, respectively). algorithms, we must introduce some terminology. In Figure 1(b), the points in the tree and the subsets Sr (darker rectangle) and Sp (lighter rectangle) are Definition 1. A space tree on a dataset S ∈

Definition 5. The maximum descendant dis- Algorithm 1 DepthFirstTraversal(Nq, Nr). tance of a node Ni is defined as the maximum dis- if Score( q, r) = ∞ then return p N N tance between the centroid Ci of Si and points in Di : for each pq ∈ Pq, pr ∈ Pr do λ( i) = max kCi − pk. N p BaseCase(pq, pr) p∈Di It is straightforward to show that kd-trees, , for each Nqc ∈ Cq, Nrc ∈ Cr do metric trees, ball trees, cover trees (Beygelzimer DepthFirstTraversal(Nqc, Nrc) et al., 2006), R-trees, and vantage-point trees all sat- isfy the conditions of a space tree. The quantities dmin(Nq, Nr), dmax(Nq, Nr), λ(Ni), and ρ(Ni) are contained within that node. If no nodes are pruned, easily derived (or bounded, which in many cases is then the traversal will visit each node in the tree once. sufficient) for each of these types of trees. Clearly, a pruning single-tree traversal that does not For a kd-tree, dmin(Ni, Nj) is bounded below by the prune any nodes is a single-tree traversal. A prun- minimum distance between Si and Sj; dmax(Ni, Nj) ing single-tree traversal can be implemented with two is bounded above similarly by the maximum distance callbacks: BaseCase(point) and Score(node). This between Si and Sj. Both ρ(Ni) and λ(Ni) are allows both the point-to-point computation and the bounded above by dmax(Ci, Ni). scoring to be entirely independent of the traversal. For the cover tree (Beygelzimer et al., 2006), each Thus, single-tree branch-and-bound algorithms can be node Ni contains only one point pi and has ‘scale’ expressed in a tree-independent manner. Extensions si. dmin(Ni, Nj) is bounded below by d(pi, pj) − to the dual-tree case are given below. 2si+1 − 2sj +1 and d ( , ) is bounded above by max Ni Nj Definition 8. A dual-tree traversal is a process si+1 sj +1 d(pi, pj) + 2 + 2 . Because pi is the centroid of that, given two space trees Tq (query tree) and Tr , ρ( ) = 0. λ( ) is simply 2si+1. Si Ni Ni (reference tree), will visit every combination of nodes 4. Tree Traversals (Nq, Nr) once, where Nq ∈ Tq and Nr ∈ Tr. At each visit (Nq, Nr), a computation is performed be- In general, the nodes of each space tree can be tra- tween each point in Nq and each point in Nr. versed in a number of ways. However, there have been Definition 9. A pruning dual-tree traversal is a no attempts to formalize tree traversal. Therefore, we process which, given two space trees (query tree) introduce several definitions which will be useful later. Tq and Tr (reference tree), will visit combinations of Definition 6. A single-tree traversal is a process nodes (Nq, Nr) such that Nq ∈ Tq and Nr ∈ Tr no that, given a space tree, will visit each node in that more than once, and perform a computation to assign tree once and perform a computation on any points a score to that combination. If the score is above some contained within the node that is being visited. bound, the combination is pruned and no combinations n n (Nqc, Nrc) such that Nqc ∈ Dq and Nrc ∈ Dr will be As an example, the standard depth-first traversal or visited; otherwise, a computation is performed between breadth-first traversal are single-tree traversals. From each point in Nq and each point in Nr. a programming perspective, the computation in the single-tree traversal can be implemented with a sim- Similar to the pruning single-tree traversal, a prun- ple callback BaseCase(point) function. This allows ing dual-tree algorithm can use two callback functions the computation to be entirely independent of the BaseCase(pq, pr) and Score(Nq, Nr). An exam- single-tree traversal itself. As an example, a simple ple implementation of a depth-first pruning dual-tree single-tree algorithm to count the number of points traversal is given in Algorithm1. The traversal is in a given tree would increment a counter variable started on the root of the Tq and the root of Tr. each time BaseCase(point) was called. However, this Algorithm1 provides only one example of a commonly- concept by itself is not very useful; without pruning used pruning dual-tree traversal. Other possibilities branches, no computations can be avoided. not explicitly documented here include breadth-first traversals and the unique cover tree dual-tree traversal Definition 7. A pruning single-tree traversal is a described by Beygelzimer et al. (2006), which can be process that, given a space tree, will visit nodes in the adapted to the generic space tree case. tree and perform a computation to assign a score to that node. If the score is above some bound, the node The rest of this work is devoted to using these concepts is “pruned” and none of its descendants will be visited; to represent existing dual-tree algorithms in the four otherwise, a computation is performed on any points parts described in Section2. Tree-Independent Dual-Tree Algorithms

5. k-Nearest Neighbor Search Algorithm 2 k-nearest-neighbors BaseCase() Input: query point p , reference point p , list of k k-nearest neighbor search is a well-studied problem q r nearest candidate points N and k candidate dis- with a plethora of algorithms and results. The prob- pq tances D (both ordered by ascending distance) lem can be stated as follows: pq Output: distance d between pq and pr n×d Given a query dataset Sq ∈ < , a reference dataset m×d d ← kpq − prk Sr ∈ < , and an integer k : 0 < k < m, for each point pq ∈ Sq, find the k nearest neighbors in Sr and if d < Dpq [k] and BaseCase(pq, pr) not yet called their distances from pq. The list of nearest neighbors then for a point pq can be referred to as Npq and the dis- insert d into ordered list Dpq and truncate list to tances to nearest neighbors for pq can be referred to length k as Dpq . Thus, the k-th nearest neighbor to point pq is insert pr into Npq such that Npq is ordered by

Npq [k] and Dpq [k] = kpq − Npq [k]k. distance and truncate list to length k return d This can be solved using a brute-force approach: com- pare every possible point combination and store the k smallest distance results for each pq. However, this can cache previous calculations for large speedups. scales poorly – O(nm); hence the importance of fast algorithms to solve the problem. Many existing algo-  B1(Nq) = max max Dp[k], max Dp[k] rithms employ tree-based branch-and-bound strategies p p∈Pq p∈Dq ,p6∈Pq (Beygelzimer et al., 2006; Cover & Hart, 1967; Fried- = max  max D [k], max {max D [k]} p p p man et al., 1977; Fukunaga & Narendra, 1975; Gray p∈Pq Nc∈Cq p∈Dc & Moore, 2001; Ram et al., 2009).  = max max Dp[k], max B1(Nc) p∈Pq Nc∈Cq We unify all of these branch-and-bound strategies by defining methods BaseCase(pq, pr) and Score(Nq, ) for use with a pruning dual-tree traversal. Suppose we have, at some point in the traversal, two Nr p points p0, p1 ∈ Dq for some node Nq, with Dp0 [k] = ∞ At the initialization of the tree traversal, the lists Np q and Dp1 [k] < ∞. This means there exist k points and Dp are empty lists for each query point pq. After 1 k i q {pr, . . . , pr } in Sr such that d(p1, pr) ≤ Dp1 [k] for i = the traversal is complete, for a query point p , the p q {1, . . . , k}. Because p0, p1 ∈ Dq , we can apply the tri- set {Npq [1], ..., Npq [k]} is the ordered set of k nearest angle inequality to see that d(p0, p1) ≤ 2λ(Nq). There- neighbors of the point pq, and each Dp [i] = kpq − i q fore, d(p0, pr) ≤ Dp1 [k] + 2λ(Nq) for i = {1, . . . , k}. Npq [i]k. If we assume that Dpq [i] = ∞ if i is greater Using this observation we can construct an alternate than the length of Dpq , we can formulate BaseCase() bound function B2(Nq): as given in Algorithm2 3. B ( ) = min D [k] + 2λ( ) With the base case established, only the pruning rule 2 Nq p p Nq p∈Dq remains. A valid pruning rule will, for a given query node Nq and reference node Nr, prune the reference which can, like B1(Nq), be rearranged to provide a subtree rooted at Nr if and only if it is known that recursive definition. In addition, if p0 ∈ Pq and p1 ∈ p p there are no points in Dr that are in the set of k nearest Dq , we can bound d(p0, p1) more tightly with ρ(Nq)+ p neighbors of any points in Dq . Thus, at any point in λ(Nq) instead of 2λ(Nq). These observations yield the traversal, we can prune the combination (Nq, Nr) if and only if dmin(Nq, Nr) ≥ B1(Nq), where  B2(Nq) = min min (Dp[k] + ρ(Nq) + λ(Nq)), p∈Pq B ( ) = max D [k].  1 Nq p p p∈Dq min (B2(Nc) + 2(λ(Nq) − λ(Nc)) . Nc∈Cq Now, we can describe this bound recursively. This is important for implementation; a recursive function Both B1(Nq) and B2(Nr) provide valid pruning rules. We can combine both to get a tighter pruning rule 3 In practice, k-nearest-neighbors is often run with iden- by taking the tighter of the two bounds. In addi- tical reference and query sets. In that situation it may be tion, B1(Nq) ≥ B1(Nc) and B2(Nq) ≥ B2(Nc) for useful to modify this implementation of BaseCase() so that all ∈ . Therefore, we can prune ( , ) if a point does not return itself as the nearest neighbor (with Nc Cq Nq Nr distance 0). dmin(Nq, Nr) ≥ min{B1(Par(Nq)),B2(Par(Nq))}. Tree-Independent Dual-Tree Algorithms

Algorithm 3 k-nearest-neighbors Score() Algorithm 4 AllNN(Nq, Nr) (Gray, 2003) Input: query node , reference node nn Nq Nr if dmin(Nq, Nr) ≥ δq , then return Output: a score for the node combination if Nq is leaf and Nr is leaf then (Nq, Nr), or ∞ if the combination should be pruned for each pq ∈ Pq, pr ∈ Pr do dqr ← kpq − prk. if dmin(Nq, Nr) < B(Nq) then if d < D then D = d ; N = p return d ( , ) qr pq pq qr pq r min Nq Nr nn nn if dqr < δq then δq ← dqr return ∞ AllNN(Nq.left, closer-of(Nr.left, Nr.right)) AllNN(Nq.left, farther-of(Nr.left, Nr.right)) AllNN(Nq.right, closer-of(Nr.left, Nr.right)) These observations are combined for a better bound: AllNN(Nq.right, farther-of(Nr.left, Nr.right))  nn nn nn nn  δq = min(δq , max(δq.left, δq.right)) B(Nq) = min max max Dp[k], max B(Nc) , p∈Pq Nc∈Cq  min Dp[k] + ρ(Nq) + λ(Nq) , p∈Pq nn p δq is the maximum of Dpq for all pq ∈ Dq . That   nn  is, δ = B1(Nq). Thus, the comparison in the first min B(Nc) + 2 λ(Nq) − λ(Nc) q Nc∈Cq line of Algorithm4 is equivalent to Algorithm3 with  B1(Nq) instead of B(Nq). B(Par(Nq)) . The first two lines of the inside of the for each loop As a result of this bound function being expressed re- (the base case) are equivalent to Algorithm2 with k = cursively, previous bounds can be cached and used to 1. kd-trees only hold points in leaves; therefore, the calculate the bound B(Nq) quickly. We can use this base case is called for all combinations of points in each to structure Score() as given in Algorithm3. node combination, identical to the depth-first traverser (Algorithm1). Applying the meta-algorithm in Section2 with any tree type and any pruning dual-tree traversal A kd-tree is a space tree and the dual depth-first re- gives a correct implementation of k-nearest neighbor cursion is a pruning dual-tree traversal. Also, we search. Proving the correctness is straightforward; showed the equivalency of the pruning rule (that is, first, a (non-pruning) dual-tree traversal which uses the Score() function) and the equivalency of the base BaseCase() as given in Algorithm2 will give correct case. So, it is clear that Algorithm4 is produced using results for any space tree. Then, we already know that our meta-algorithm with these parameters. In addi- B(Nq) is a bounding function that, at any point in tion, because B(Nq) is always less than B1(Nq), Al- the traversal, will not prune any subtrees which could gorithm3 provides a tighter bound than the pruning contain better nearest neighbor candidates than the rule in Algorithm4. current candidates. Thus, the true nearest neighbors This algorithm is also a generalization of the stan- for each query point will always be visited, and the dard cover tree k-NN search (Beygelzimer et al., 2006). results will be correct. The cover tree search is a pruning dual-tree traver- We now show that this algorithm is a generalization sal where the query tree is traversed depth-first while of the standard kd-tree k-NN search, which uses a the reference tree is simultaneously traversed breadth- pruning dual-tree depth-first traversal. The archety- first. The pruning rule (after simple adaptation to pal algorithm for all-nearest neighbor search (k-nearest the k-nearest-neighbor search problem instead of the neighbor search with k = 1) given for kd-trees in Alex nearest-neighbor search problem) is equivalent to Gray’s Ph.D. thesis (2003) is shown here in Algorithm nn dmin(Nq, Nr) ≥ Dp [k] + λ(Nq) 4 with converted notation. δq is the bound for a node q Nq and is initialized to ∞; Dp represents the nearest q where p is the point contained in (remember, each distance for a query point p , and N represents the q Nq q pq node of a cover tree contains one point). This is equiva- nearest neighbor for a query point p . .left repre- q Nq lent to B ( ) because ρ( ) = 0 for cover trees. The sents the left child of and is defined to be if 2 Nq Nq Nq Nq Nq transformation from the algorithm given by Beygelz- has no children; .right is similarly defined. Nq imer et al. (2006) to our representation is made clear The structure of the algorithm matches Algorithm1; in Appendix A (supplementary material) and in the it is a dual-tree depth-first recursion. Because this is k-nearest neighbor search implementation of the C++ nn a depth-first recursion, δq = ∞ for a node Nq if no library MLPACK (Curtin et al., 2011); this is imple- descendants of Nq have been recursed into. Otherwise, mented in terms of our meta-algorithm. Tree-Independent Dual-Tree Algorithms

Specific algorithms for ball trees, metric trees, VP Algorithm 5 Range search BaseCase(). trees, octrees, and other space trees are trivial to cre- Input: query point pq, reference point pr, neighbor ate using the and implementa- BaseCase() Score() list Npq , distance list Dpq tion given here (and in MLPACK). Note also that this Output: distance d between pq and pr implementation will work in any metric space. d ← kpq − prk An extension to k-furthest neighbor search is straight- forward. The bound function must be ‘inverted’ by if δ1 ≤ d ≤ δ2 and BaseCase(pq, pr) not yet called changing ‘max’ to ‘min’ (and vice versa); in addition, then Np ← Np ∪ {pr} the distances Dpq [i] must be initialized to 0 instead of q q ∞, and the lists D and N must be sorted by descend- Dpq ← Dpq ∪ {d} ing distance instead of ascending distance. Lastly, the return d comparison d < Dpq [k] must be changed to d > Dpq [k]. With these simple changes, we have easily solved an Algorithm 6 Range search Score(). entirely different problem using our meta-algorithm. An implementation using our meta-algorithm for both Input: query node Nq, reference node Nr kd-trees and cover trees is also available in MLPACK. Output: a score for (Nq, Nr), or ∞ if the combi- nation should be pruned 6. Range Search if δ1 ≤ dmin(Nq, Nr) ≤ δ2 then Range search is another popular neighbor searching return dmin(Nq, Nr) problem related to k-nearest neighbor search. In addi- return ∞ tion to being a fairly standard machine learning task, it has numerous uses in applications such as databases and geographic information systems (GIS). A treatise proposed in 1926. Recently, a dual-tree version of on the history of the problem and solutions is given by Bor˚uvka’s algorithm was developed (March et al., Agarwal & Erickson (1999). The problem is: 2010) for kd-trees and cover trees. We unify these two

Given query and reference datasets Sq,Sr and a range algorithms and generalize to other types of space tree [δ1, δ2], for each point pq ∈ Sq, find all points in Sr by formulating BaseCase() and Score() functions. N×D such that δ1 ≤ kpq − prk ≤ δ2. As with k-nearest For a dataset Sr ∈ < , Bor˚uvka’s algorithm con- neighbor search, refer to the list of neighbors for each nects each point to its nearest neighbor, giving many query point pq as Npq and the corresponding distances ‘components’. For each component c, the nearest as Dpq . These lists are not sorted in any particular point in Sr to any point of c that is not part of order, and at initialization time, they are empty. c is found. The points are connected, combining In different settings, the problem of range search may those components. This process repeats until only one not be stated identically; however, our results are eas- component—the minimum spanning tree—remains. ily adaptable. A BaseCase() implementation is given During the algorithm, we maintain a list F made up in Algorithm5, and a Score() implementation is given of i components Fi : {Ei,Vi} where Ei is the list of in Algorithm6. The only bounds to consider are edges and Vi is the list of vertices in the component [δ1, δ2], so no complex bound handling is necessary. Fi (these are points in Sr). Each point in Sr belongs While range search is sometimes mentioned in the con- to only one Fi. At initialization, |F | = |Sr| and Fi = text of dual-tree algorithms (Gray & Moore, 2001), the {∅, {pi}} for i = {1,..., |Sr|}, where pi is the i’th point focus is usually on k-nearest neighbor search. As a re- in Sr. For p ∈ Sr we define F (p) = Fi if Fi is the sult, we cannot find any explicitly published dual-tree component containing p. During the algorithm, we algorithms to generalize; however, a single-tree algo- maintain N(Fi) as the candidate nearest neighbor of rithm was proposed by Bentley and Friedman (1979). component Fi and pc(Fi) as the point in component Thus, the BaseCase() and Score() proposed here can Fi nearest to N(Fi). Then, D(Fi) = kpc(Fi)−N(Fi)k. be used with our meta-algorithm to produce entirely Remember that F (N(Fi)) 6= Fi. novel range search implementations; MLPACK has kd- To run Bor˚uvka’s algorithm with a space tree Tr built tree and cover tree implementations. on the set of points Sr, a pruning dual-tree traversal is run with BaseCase() as Algorithm7, Score() as Al- 7. Bor˚uvka’s Algorithm gorithm8, Tr as both of the trees, and F as initialized Finding a Euclidean minimum spanning tree has been before. Note that Score() uses B(Nq) from Section5 a relevant problem since Bor˚uvka’s algorithm was with k = 1. Upon traversal completion, we have a list N(Fi) of nearest neighbors of each component Fi. The Tree-Independent Dual-Tree Algorithms

Algorithm 7 Bor˚uvka’s algorithm BaseCase(). bution of K to fq is very small. Therefore, we can Input: query point pq, reference point pr, nearest approximate small values using a dual-tree algorithm candidate point N(F (pq)) and distance D(F (pq)) to avoid unnecessary computation; this is the idea set Output: distance d between pq and pr forth by Gray & Moore (2001). Because K is a func- tion which is decreasing with distance, the maximum if pq = pr then difference between K values for a given combination return 0 (Nq, Nr) can be bounded above with if F (pq) 6= F (pr) and kpq − prk < D(F (pq)) then B ( , ) = K(d ( , )) − K(d ( , )). D(F (pq)) ← kpq − prk K Nq Nr min Nq Nr max Nq Nr N(F (pq)) ← pr; pc(F (pq)) ← pq The algorithm takes a parameter ; when BK (Nq, Nr) return kpq − prk is less than /|Sr|, the kernel values are approximated using the kernel value of the centroid Cr of the refer- Algorithm 8 Bor˚uvka’s algorithm Score(). ence node. The division by |Sr| ensures that the total approximation error is bounded above by . The base Input: query node Nq, reference node Nr case on p and p merely needs to add K(kp −p k) to Output: a score for the node combination q r q r the existing density estimate fq. When the algorithm (Nq, Nr), or ∞ if the combination should be pruned is initialized, fq = 0 for all query points. BaseCase() if dmin(Nq, Nr) < B(Nq) then is Algorithm 10 and Score() is Algorithm9. p p if F (pq) = F (pr) ∀pq ∈ Dq , pr ∈ Dr then Again we emphasize the flexibility of our meta- return ∞ algorithm. To our knowledge cover trees, octrees, and return dmin(Nq, Nr) ball trees have never been used to perform KDE in return ∞ this manner. Our meta-algorithm can produce these implementations with ease. edge (N(Fi), pc(Fi)) is added to Fi for each Fi. Then, 9. Discussions any components in F with shared edges are merged We have now shown five separate algorithms for which 0 0 into a new list F where |F | < |F |. The pruning dual- we have taken existing dual-tree algorithms and con- 0 tree traversal is then run again with F = F and the structed a BaseCase() and Score() function that can traversal-merge process repeats until |F | = 1. When be used with any space tree and any dual-tree traver- |F | = 1, then F1 is the minimum spanning tree of Sr. sal. Single-tree extensions of these four methods are To prove the correctness of the meta-algorithm, see straightforward simplifications. Theorem 4.1 in March et al. (2010). That proof can be adapted from kd-trees to general space trees. Our This modular way of viewing tree-based algorithms representation is a generalization of their algorithms; has several useful immediate applications. The first our meta-algorithm to produces their kd-tree and cover tree implementations with a tighter distance bound Algorithm 9 KDE Score(Nq, Nr). B(Nq). Our meta-algorithm produces a provably cor- Input: query node Nq, reference node Nr rect dual-tree algorithm with any type of space tree. Output: a score for (Nq, Nr) or ∞ if the combina- 8. Kernel Density Estimation tion should be pruned  if BK ( q, r) ≥ then Much work has been produced regarding the use N N |Sr | p of dual-tree algorithms for kernel density estimation for each pq ∈ Dq do p (KDE), including by Gray & Moore (2001; 2003b) and fq ← fq + |Dr |K(kpq − Crk) later by Lee et al. (2006; 2008). KDE is an important return ∞ machine learning problem with a vast range of applica- return dmin(Nq,Nr) tions, such as signal processing to econometrics. The problem is, given query and reference sets Sq,Sr, to es- Algorithm 10 KDE BaseCase(pq, pr). timate a probability density fq at each point pq ∈ Sq Input: query point pq, reference point pr, density using each point pr ∈ Sr and a kernel function Kh. estimate fq The exact probability density at a point pq is the sum Output: distance between pq and pr of K(kpq − prk) for all pr ∈ Sr. if BaseCase(pq, pr) already called then return In general, the kernel function is some zero-centered f ← f + K (kp − p k) probability density function, such as a Gaussian. This q q h q r return kpq − prk means that when kpq − prk is very large, the contri- Tree-Independent Dual-Tree Algorithms is implementation. Given a tree implementation and 9.1. Parallelism a dual-tree traversal implementation, all that is re- Nowhere in this paper has parallelism been discussed quired is BaseCase() and Score() functions. Thus, in any detail. In fact, all of the given algorithms seem code reuse can be maximized, and new algorithms can to be suited to serial implementation. However, the be implemented simply by writing two new functions. pruning dual-tree traversal is entirely separate from More importantly, the code is now modular. ML- the rest of the dual-tree algorithm; therefore, a par- PACK (Curtin et al., 2011), written in C++, uses tem- allel pruning dual-tree traversal can be used without plates for this. One example is the DualTreeBoruvka modifying the rest of the algorithm. class, which implements the meta-algorithm discussed in Section7, and has the following arguments: For instance, consider k-nearest neighbor search. Most large-scale parallel implementations of k-NN do not template available software exists that implements distributed class DualTreeBoruvka; dual-tree k-nearest neighbor search. As a simple (and not necessarily efficient) proof-of- This means that any class satisfying the constraints concept idea for a distributed traversal, suppose we of the TreeType template parameter can be de- have t2 machines and a “master” machine for some signed without any consideration or knowledge of t > 0. Then, for a query tree Tq and a reference tree the DualTreeBoruvka class or of the TraversalType Tr, we can split Tq into t subtrees and one “master class; it is entirely independent. Then, assuming a tree” Tqm. The reference tree Tr is split the same TreeType and TraversalType without bugs, the dual- way. Each possible combination of query and reference tree Bor˚uvka’s algorithm is guaranteed to work. An subtrees is stored on one of the t2 machines, and the immediate example of the advantage of this is that master trees are stored on the master machine. The cover trees were implemented for MLPACK for use lists D and N can be stored on the master machine with k-nearest neighbor search. This cover tree imple- and can be updated or queried by other machines. mentation could, without any additional work, be used The traversal starts at the roots of the query tree with DualTreeBoruvka—which was never an intended and reference tree and proceeds serially on the mas- goal during the cover tree implementation but still a ter. When a combination in two subtrees is reached, particularly valuable result! Score() and BaseCase() are performed on the ma- Of course, the utility of these abstractions are not lim- chine containing those two subtrees and that subtree ited to implementation details. Each of the papers traversal continues in parallel. This idea satisfies the cited in the previous sections describe algorithms in conditions of a pruning dual-tree traversal; thus, we terms of one specific tree type. March et al. (2010) can use it to make any dual-tree algorithm parallel. discuss implementations of Bor˚uvka’s Algorithm on Recently, a distributed dual-tree algorithm was devel- both kd-trees and cover trees and give algorithms for oped for kernel summation (Lee et al., 2012); this work both. Each algorithm given is quite different and it could be adapted to a generalized distributed pruning is not easy to see their similarities. Using our meta- dual-tree traversal for use with our meta-algorithm. algorithm, any of these tree-based algorithms can be expressed with less effort—especially for more complex 10. Conclusion trees like the cover tree—and in a more general sense. We have proposed a tree-independent representation In addition, correctness proofs for our algorithms tend of dual-tree algorithms and a meta-algorithm which to be quite simple. These proofs for each algorithm can be used to create these algorithms. A dual-tree here can be given in two simple sub-proofs: (1) prove algorithm is represented as four parts: a space tree, the correctness of BaseCase() when no prunes are a pruning dual-tree traversal, a BaseCase() function, made, and (2) prove that Score() does not prune any and a Score() function. We applied this representa- subtrees on which the algorithm’s correctness depends. tion to generalize and extend five example dual-tree The logical split of base case, pruning rule, tree type, algorithms to different types of trees and traversals. and traversal can also be advantageous. A strong ex- During this process, we also devised a novel bound for ample of this is the function B(Nq) devised in Section k-nearest neighbor search that is tighter than existing 5, which is a novel, tighter bound. When not consider- bounds. Importantly, this abstraction can be applied ing a particular tree, the path to a superior algorithm to help approach the problem of parallel dual-tree al- can often be simpler (as in that case). gorithms, which currently is not well researched. Tree-Independent Dual-Tree Algorithms

References Holmes, M.P., Gray, A.G., and Isbell Jr., C.L. QUIC- SVD: Fast SVD using cosine trees. Advances in Agarwal, P.K. and Erickson, J. Geometric range Neural Information Processing Systems (NIPS), 21, searching and its relatives. Contemporary Mathe- 2008. matics, 223:1–56, 1999. Lee, D. and Gray, A.G. Faster Gaussian summation: Bentley, J.L. Multidimensional binary search trees Theory and experiment. In In Proceedings of the used for associative searching. Communications of Twenty-second Conference on Uncertainty in Arti- the ACM, 18(9):509–517, 1975. ficial Intelligence (UAI), 2006. Bentley, J.L. and Friedman, J.H. Data structures for Lee, D. and Gray, A.G. Fast high-dimensional ker- range searching. ACM Computing Surveys (CSUR), nel summations using the monte carlo multipole 11(4):397–409, 1979. method. Advances in Neural Information Processing Systems (NIPS), 21, 2008. Beygelzimer, A., Kakade, S.M., and Langford, J. Cover trees for nearest neighbor. In Proceedings Lee, D., Gray, A.G., and Moore, A.W. Dual-tree fast of the 23rd International Conference on Machine Gauss transforms. In Advances in Neural Informa- Learning (ICML ’06), volume 23, pp. 97, 2006. tion Processing Systems 18, pp. 747–754. MIT Press, Cambridge, MA, 2006. Cover, T.M. and Hart, P.E. Nearest neighbor pattern classification. Information Theory, IEEE Transac- Lee, D., Vuduc, R., and Gray, A.G. A distributed tions on, 13(1):21–27, 1967. kernel summation framework for general-dimension machine learning. In SIAM International Confer- Curtin, R.R., Cline, J.R., Slagle, N.P., Amidon, M.L., ence on Data Mining, volume 2012, pp. 5, 2012. and Gray, A.G. MLPACK: a scalable C++ machine learning library. In BigLearning: Algorithms, Sys- March, W.B., Ram, P., and Gray, A.G. Fast Eu- tems, and Tools for Learning at Scale, 2011. clidean minimum spanning tree: algorithm, anal- ysis, and applications. In Proceedings of the 16th Friedman, J.H., Bentley, J.L., and Finkel, R.A. An ACM SIGKDD International Conference on Knowl- algorithm for finding best matches in logarithmic edge Discovery and Data Mining (KDD ’10), pp. expected time. ACM Transactions on Mathematical 603–612, 2010. Software (TOMS), 3(3):209–226, 1977. March, W.B., Connolly, A.J., and Gray, A.G. Fast al- Fukunaga, K. and Narendra, P.M. A branch and gorithms for comprehensive n-point correlation esti- bound algorithm for computing k-nearest neighbors. mates. In Proceedings of the 18th ACM SIGKDD in- Computers, IEEE Transactions on, 100(7):750–753, ternational conference on Knowledge discovery and 1975. data mining, pp. 1478–1486. ACM, 2012.

Gray, A.G. Bringing Tractability to Generalized N- Ram, P., Lee, D., March, W.B., and Gray, A.G. Body Problems in Statistical and Scientific Com- Linear-time algorithms for pairwise statistical prob- putation. PhD thesis, Carnegie Mellon University, lems. Advances in Neural Information Processing 2003. Systems 22 (NIPS 2009), 23, 2009. Wang, P., Lee, D., Gray, A.G., and Rehg, J.M. Fast Gray, A.G. and Moore, A.W. ‘N-Body’ problems in mean shift with accurate and stable convergence. statistical learning. In Advances in Neural Informa- In Workshop on Artificial Intelligence and Statistics tion Processing Systems 14 (NIPS 2001), volume 4, (AISTATS), 2007. pp. 521–527, 2001.

Gray, A.G. and Moore, A.W. Rapid evaluation of mul- tiple density models. In Advances in Neural Infor- mation Processing Systems 16 (NIPS 2003), volume 2003 of NIPS ’03, 2003a.

Gray, A.G. and Moore, A.W. Nonparametric den- sity estimation: Toward computational tractability. In SIAM International Conference on Data Mining (SDM). Citeseer, 2003b.