Introduction to Cover

Javen Qinfeng Shi SML of NICTA RSISE of ANU 2006 Oct. 15 Outline

™ Goal ™ Related works ™ definition ™ Optimized Implement in C++ ™ Complexity ™ Further work Outline

™ Goal ™ Related works ™ Cover tree definition ™ Efficiently Implement in C++ ™ Advantages and disadvantages ™ Further work Goal

™ Speed up Nearest-neighbour search[John Langford etc. 2006] ZPreprocess a dataset S of n points in some metric space X so that given a query point p ∈ X , a point q ∈ S which minimises d(p,q) can be efficiently found ™ Also Kmeans Outline

™ Goal ™ Related works ™ Cover tree definition ™ Example ™ Optimized Implement in C++ ™ Advantages and disadvantages ™ Further work Relevant works

™ Kd tree [Friedman, Bentley & Finkel 1977] ™ family: ZBall tree [Omohundro,1991] ZMetric tree [Uhlmann, 1991] ZAnchors [Moore 2000] Z(Improved) Anchors metric tree [Charles2003] ™ Navigating net [R.Krauthgamer, J.Lee 2004] ? Lots of discussions and proofs on its computational bound Kd tree

™ Univariate axis-aligned splits ™ Split on widest dimension ™ O(N log N) to build, O(N) space Kd tree Kd tree y

x (axis) Ball tree/metric tree

A of points in R2 [Uhlmann 1991], [Omohundro 1991] A ball-tree: level 1

[Uhlmann 1991], [Omohundro 1991] A ball-tree: level 2 A ball-tree: level 3 A ball-tree: level 4 A ball-tree: level 5 Outline

™ Goal ™ Related works ™ Cover tree definition ZBasic definition ZHow to build a cover tree ZQuery a point/points ™ Optimized Implement in C ™ Advantages and disadvantages ™ Further work Basic definition

™ A cover tree T on a dataset S is a leveled tree where each level is indexed by an integer scale i which decreases as the tree is descended

™ Ci denotes the set of nodes at level i ™ d(p,q) denotes the distance between poitns p and q ™ A valid tree satisfies the following properties

Z Nesting: Ci ⊂Ci−1

Z Covering tree: For every node p ∈ C i − 1 , there exists a node q∈Ci satisfying d ( p , q ) ≤ 2 i and exactly one such q is a parent of p i Z Separation: For all nodes p , q ∈ C i , d(p,q) > 2 Nesting

™Ci ⊂ Ci−1 ZEach node in set

Ci has a self- child ZAll nodes in set

Ci are also nodes in sets Cj where j

ZSet C-∞ contains all the nodes in dataset S Cover tree

™ For every

node p ∈ C i− 1 , there exists a

node q∈Ci satisfying d( p,q) ≤ 2i and exactly one such q is a parent of p Separation

i ™ For all nodes p , q ∈ C i d, (p,q) > 2 Cover tree struture

C∞ / Ca Root

C∞−1

C∞−2

> 2i Ci < 2i Ci−1

C−∞ Cover tree struture

9

8 11 3 5 7 10

6

5 1 y 4

3 7 2 4 9

2 6 8 1

0 02 46 810 x How to build a Cover tree

A set of points in R2 How to build a Cover tree

A set of points in R2 How to build a Cover tree

Ca Root d_max a=?

A set of points in R2 How to build a Cover tree

Ca Root d_max a= ceil(log 2 (d _ max))

A set of points in R2 How to build a Cover tree

Ca Root

2^a Ca−1 2^(a-1)

A set of points in R2 How to build a Cover tree

Ca Root

2^a Ca−1 2^(a-1)

A set of points in R2 How to build a Cover tree

Ca Root

2^a Ca−1 2^(a-1)

A set of points in R2 How to build a Cover tree

Ca Root

2^a Ca−1 2^(a-1)

A set of points in R2 How to build a Cover tree

Ca Root

2^a Ca−1 2^(a-1)

A set of points in R2 How to build a Cover tree

Ca Root

2^a Ca−1 2^(a-1)

A set of points in R2 How to build a Cover tree

Ca Root

2^a Ca−1 2^(a-1)

Continue till every circle contain only one point Query for a single point

Ci+1 set Q∞ = C∞ for i from ∞ down to - ∞ Ci

consider the set of children of Qi : Ci−1 set = {Children(q) : q ∈Qi} form next cover set :

Qi−1 = {q ∈ set : d( p,q) ≤ d( p, set) + b} return arg min d( p,q) q∈Q−∞ b=? Query for a single point

i−1 b = ∑ 2k = 2i −∞ Query Examples

9

8 11 3 5 7 10

6

5 1 y 4

3 7 2 4 9

2 6 8 1

0 02 46 810 x Query for Node 1

9

8 11 3 5 7 10

6

5 1 y 4

3 7 2 4 9

2 6 8 1

0 02 46 810 x Query for Node 4

9

8 11 3 5 7 10

6

5 1 y 4

3 7 2 4 9

2 6 8 1

0 02 46 810 x Query for Node 11

9

8 11 3 5 7 10

6

5 1 y 4

3 7 2 4 9

2 6 8 1

0 02 46 810 x Query for many points

2^(j+1) Optimized implement in C++

™ Alignment memory (faster) ™ exploit the four-stage fully pipelined floating- point adder (faster) ™ Explicit cover tree (less space) ™ Avoid declaring new variable (less space) ™ Single-precision float constants (less space) ™ Function pointers (instead of switch) ™ Some differences from the paper Alignment memory

™ pad to make their sizes a multiple of a word, doubleword or quadword. ™ Cpu can fetch the data (with multiple size of a word) from memory only one time, otherwise have to fetch twice. Alignment memory Alignment memory

™ Cover tree code:

™ Result: exploit the four-stage fully pipelined floating-point adder exploit the four-stage fully pipelined floating-point adder ™ Cover tree code: Explicit cover tree

™ Theory is based on an implicit implementation, but tree is built with a condensed explicit implementation to preserve O(n) space bound Avoid declaring new variable

™ Use the same variable Zpoint_set here is the same one Some differences from the paper

™ Paper: ™ Code: Z2^i Z1.3^i ZScale i reduces (from ZScale i top to bottom) increases(0~100) n.scale = top_scale - cur_scale; Complexity

™ Expasion constant C Advantages and disadvantages

™ Advantage: ZQuery many points at one time are faster (instead of query each by each) ZLess required space ™ Disadvantage: ZBuilding cover tree is still time consuming. ZExpansion constant C(which is related to bound) is hard to be accurately determined ZRecursion might be slower than iteration, and may overflow the recursion stack in both C++ or python. Further work

™ Provide python interface (swig, pyrex) ™ Apply it into other classification methods (k-means, crf,…)to speed them up. ™Thanks for coming!