An into ductory tutorial on kd-trees

Andrew W. Mo ore

Carnegie Mellon University

[email protected]

Extract from Andrew Mo ore's PhD Thesis: Ecient Memory-basedLearning for Robot Control

PhD. Thesis; Technical Rep ort No. 209, Computer Lab oratory, University of Cambridge. 1991.

Chapter 6

Kd-trees for Cheap Learning

This chapter gives a speci cation of the nearest neighbour algorithm. It also gives

both an informal and formal introduction to the kd-tree data structure. Then there

is an explicit, detailedaccount of how the nearest neighbour search algorithm is

implemented eciently, which is fol lowed by an empirical investigation into the al-

gorithm's performance. Final ly, there is discussion of some other algorithms related

to the nearest neighbour search.

6.1 Nearest Neighb our Sp eci cation

k k

r

d

Given twomulti-dimensional spaces Domain = < and Range = < , let an exemplar be a

member of Domain  Range and let an exemplar-set b e a nite set of exemplars. Given an

exemplar-set , E, and a target domain vector, d, then a nearest neighbour of d is any. exemplar

0 0

0

d ; r  2 E such that None-nearerE; d; d . Notice that there might b e more than one suit-

able exemplar . This ambiguity captures the requirement that any nearest neighb our is adequate.

None-nearer is de ned thus:

0 00 00 0 00

None-nearerE; d; d  , 8d ; r  2 E j d d jj d d j 6.1

In Equation 6.1 the distance metric is Euclidean, though any other L -norm could have b een used.

p

v

u

i=k

d

u

X

0

t

0

2

j d d j= d d  6.2

i i

i=1

where d is the ith comp onentofvector d.

i

In the following sections I describ e some algorithms to realize this abstract sp eci cation with

the additional informal requirement that the computation time should b e relatively short. 6-1

Algorithm: Nearest Neighb our by Scanning.

Data Structures:

domain-vector Avector of k oating p ointnumb ers.

d

range-vector Avector of k oating p ointnumb ers.

r

exemplar A pair: domain-vector ; range-vector

Input: exlist ,oftyp e list of exemplar

dom,oftyp e domain-vector

Output: nearest,oftyp e exemplar

Preconditions: exlist is not empty

0

0

Postconditions: if nearest represents the exemplar d ; r ,

and exlist represents the exemplar set E ,

and dom represents the vector d,

0 0

0

then d ; r  2 E and None-nearerE; d; d .

Co de:

1. nearest-dist := in nity

2. nearest := unde ned

3. for ex := each exemplar in exlist

3.1 dist := distance b etween dom and the domain of ex

3.2 if dist < nearest-dist then

3.2.1 nearest-dist := dist

3.2.2 nearest := ex

Table 6.1: Finding Nearest Neighb our by scanning a list.

6.2 Naive Nearest Neighb our

This op eration could b e achieved by representing the exemplar-set as a list of exemplars. In Ta-

ble 6.1, I give the trivial nearest neighb our algorithm which scans the entire list. This algorithm has

time complexity O N  where N is the size of E. By structuring the exemplar-set more intelligently,

it is p ossible to avoid making a distance computation for every memb er.

6.3 Intro duction to kd-trees

A kd-tree is a data structure for storing a nite set of p oints from a k -dimensional space. It was

[ ]

examined in detail by J. Bentley Bentley, 1980; Friedman et al., 1977 . Recently, S. Omohundro

has recommended it in a survey of p ossible techniques to increase the sp eed of neural network

[ ]

learning Omohundro, 1987 .

A kd-tree is a binary tree. The contents of each no de are depicted in Table 6.2. Here I provide

an informal description of the structure and meaning of the tree, and in the following subsection I 6-2

Field Name: Field Typ e Description

dom-elt domain-vector A p oint from k -d space

d

range-elt range-vector A p oint from k -d space

r

split integer The splitting

left kd-tree A kd-tree representing those p oints

to the left of the splitting plane

right kd-tree A kd-tree representing those p oints

to the right of the splitting plane

Table 6.2: The elds of a kd-tree no de

give a formal de nition of the invariants and semantics.

The exemplar-set E is represented by the set of no des in the kd-tree, each no de representing

one exemplar. The dom-elt eld represents the domain-vector of the exemplar and the range-elt

eld represents the range-vector. The dom-elt comp onent is the index for the no de. It splits the

space into two subspaces according to the splitting hyp erplane of the no de. All the p oints in the

\left" subspace are represented by the left subtree, and the p oints in the \right" subspace by the

right subtree. The splitting hyp erplane is a plane which passes through dom-elt and whichis

p erp endicular to the direction sp eci ed by the split eld. Let i b e the value of the split eld. Then

a p oint is to the left of dom-elt if and only if its ith comp onent is less than the ith comp onentof

dom-elt . The complimentary de nition holds for the right eld. If a no de has no children, then

the splitting hyp erplane is not required.

Figure 6.1 demonstrates a kd-tree representation of the four dom-elt p oints 2; 5, 3; 8, 6; 3

and 8; 9. The ro ot no de, with dom-elt 2; 5 splits the plane in the y -axis into two subspaces.

The p oint3; 8 lies in the lower subspace, that is fx; y  j y<5g, and so is in the left subtree.

Figure 6.2 shows how the no des partition the plane.

6.3.1 Formal Sp eci cation of a kd-tree

The reader who is content with the informal description ab ove can omit this section. I de ne a

mapping

exset-rep : kd-tree ! exemplar-set 6.3

which maps the tree to the exemplar-set it represents:

exset-repempty = 

exset-rep< d; r; ; empty; empty > = fd; rg

6.4

exset-rep< d; r; split; tree ; tree > =

left right

exset-reptree  [fd; rg[ exset-reptree 

left right 6-3

[2,5] Figure 6.1

A 2d-tree of four elements.

[6,3] [3,8] The splitting planes are not

indicated. The [2,5] no de

splits along the y = 5 plane and the [3,8] no de splits

[8,9] along the x = 3 plane.

[8,9]

[3,8]

Figure 6.2 w the tree of Figure 6.1

[2,5] Ho splits up the x,y plane.

[6,3] 6-4

The invariant is that subtrees only ever contain dom-elt s which are on the correct side of all their

ancestors' splitting planes.

Is-legal-kdtreeempty:

Is-legal-kdtree< d; r; ; empty; empty >:

Is-legal-kdtree< d; r; split ; tree ; tree > ,

left right

0

0 0

 d ^ 8d ; r  2 exset-reptree  d

6.5

left

split split

0

0 0

>d ^ 8d ; r  2 exset-reptree  d

right

split split

Is-legal-kdtreetree ^

left

Is-legal-kdtreetree 

right

6.3.2 Constructing a kd-tree

Given an exemplar-set E,akd-tree can b e constructed by the algorithm in Table 6.3. The pivot-

cho osing pro cedure of Step 2 insp ects the set and cho oses a \go o d" domain vector from this set

to use as the tree's ro ot. The discussion of how sucharootischosen is deferred to Section 6.7.

Whichever exemplar is chosen as ro ot will not a ect the correctness of the kd-tree, though the

tree's maximum depth and the shap e of the hyp erregions wil l b e a ected.

6.4 Nearest Neighb our Search

In this section, I describ e the nearest neighb our algorithm which op erates on kd-trees. I b egin with

an informal description and worked example, and then give the precise algorithm.

A rst approximation is initially found at the leaf no de which contains the target p oint. In

Figure 6.3 the target p oint is marked X and the leaf no de of the region containing the target is

coloured black. As is exempli ed in this case, this rst approximation is not necessarily the nearest

neighb our, but at least we knowany p otential nearer neighb our must lie closer, and so must lie

within the circle centred on X and passing through the leaf no de. Wenow back up to the parent

of the current no de. In Figure 6.4 this parent is the black no de. We compute whether it is possible

for a closer solution to that so far found to exist in this parent's other child. Here it is not p ossible,

b ecause the circle do es not intersect with the shaded space o ccupied by the parent's other child.

If no closer neighb our can exist in the other child, the algorithm can immediately move up a further

level, else it must recursively explore the other child. In this example, the next parent whichis

checked will need to b e explored, b ecause the area it covers i.e. everywhere north of the central

horizontal line does intersect with the b est circle so far.

Table 6.4 describ es my actual implementation of the nearest neighb our algorithm. It is called

with four parameters: the kd-tree, the target domain vector, a representation of a hyp errectangle in

Domain ,andavalue indicating the maximum distance from the target whichisworth searching.

The search will only take place within those p ortions of the kd-tree which lie b oth in the hyp er- 6-5

Algorithm: Constructing a kd-tree

Input: exset,oftyp e exemplar-set

Output: kd,oftyp e kdtree

Pre: None

Post: exset = exset-repkd ^ Is-legal-kdtreekd

Co de:

1. If exset is empty then return the empty kdtree

2. Call pivot-cho osing pro cedure, which returns twovalues:

ex := a member of exset

split := the splitting dimension

3. d := domain vector of ex

4. exset' := exset with ex removed

5. r := range vector of ex

0

0 0

6. exsetleft := fd ; r  2 exset' j d  d g

split split

0

0 0

7. exsetright := fd ; r  2 exset' j d >d g

split split

8. kdleft := recursively construct kd-tree from exsetleft

9. kdright := recursively construct kd-tree from exsetright

10. kd := < d; r; split; kdleft; kdright >

Pro of: By induction on the length of exset and the de nitions

of exset-rep and Is-legal-kdtree.

Table 6.3: Constructing a kd-tree from a set of exemplars.

Figure 6.3

The black dot is the dot

whichowns the leaf no de

containing the target the

cross. Any nearer neigh-

b our must lie inside this cir-

cle. 6-6

Figure 6.4

The black dot is the parent

of the closest found so far.

In this case the black dot's

other child shaded grey

need not b e searched.

, and within the maximum distance to the target. The caller of the routine will generally

sp ecify the in nite hyp errectangle whichcovers the whole of Domain , and the in nite maximum

distance.

Before discussing its execution, I will explain how the op erations on the hyp errectangles can

b e implemented. A hyp errectangle is represented bytwo arrays: one of its minimum co ordinates,

the other of its maximum co ordinates. To cut the hyp errectangle, so that one of its edges is moved

closer to its centre, the appropriate array comp onent is altered. Tocheck to see if a hyp errectangle

hr intersects with a hyp ersphere radius r centered at p oint t ,we nd the p oint p in hr whichis

min max

closest to t.Write hr as the minimum extreme of hr in the ith dimension and hr as the

i i

maximum extreme. p , the ith comp onent of this closest p oint is computed thus:

i

8

min min

>

hr if t  hr

>

i

i i

<

min max

6.6 p =

t if hr

i

i i

i i

>

>

:

min max

hr if t  hr

i

i i

The ob jects intersect only if the distance b etween p and t is less than or equal to r .

The search is depth rst, and uses the heuristic of searching rst the child no de which contains

the target. Step 1 deals with the trivial empty tree case, and Steps 2 and 3 assign two imp ortant

lo cal variables. Step 4 cuts the currenthyp errectangle into the twohyp errectangles covering the

space o ccupied by the child no des. Steps 5{7 determine whichchild contains the target. After

Step 8, when this initial child is searched, it may b e p ossible to prove that there cannot b e any

closer p oint in the hyp errectangle of the further child. In particular, the p oint at the currentnode

must b e out of range. The test is made in Steps 9 and 10. Step 9 restricts the maximum radius in

whichany p ossible closer p oint could lie, and then the test in Step 10 checks whether there is any 6-7

Algorithm: Nearest Neighb our in a kd-tree

Input: kd,oftyp e kdtree

target ,oftyp e domain vector

hr,oftyp e hyp errectangle

max-dist-sqd,oftyp e oat

Output: nearest,oftyp e exemplar

dist-sqd,oftyp e oat

Pre: Is-legal-kdtreekd

Post: Informally, the p ostcondition is that nearest is a nearest exemplar

to target which also lies b oth within the hyp errectangle hr

p p

and within distance max-dist-sqd of target . dist-sqd is

the distance of this nearest p oint.

If there is no such p oint then dist-sqd contains in nity.

Co de:

1. if kd is empty then set dist-sqd to in nity and exit.

2. s := split eld of kd

3. pivot := dom-elt eld of kd

4. Cut hr into two sub-hyp errectangles left-hr and right-hr.

The cut plane is through pivot and p erp endicul ar to the s dimension.

5. target-in-left := target  pivot

s s

6. if target-in-left then

6.1 nearer-kd := left eld of kd and nearer-hr := left-hr

6.2 further-kd := right eld of kd and further-hr := right-hr

7. if not target-in-left then

7.1 nearer-kd := right eld of kd and nearer-hr := right-hr

7.2 further-kd := left eld of kd and further-hr := left-hr

8. Recursively call Nearest Neighb our with parameters

nearer-kd,target , nearer-hr,max-dist-sqd, storing the results

in nearest and dist-sqd

9. max-dist-sqd := minimum of max-dist-sqd and dist-sqd

10. A nearer p oint could only lie in further-kd if there were some

p

part of further-hr within distance max-dist-sqd of target .

if this is the case then

2

10.1 if pivot target < dist-sqd then

10.1.1 nearest := pivot; range-elt eld of kd

2

10.1.2 dist-sqd := pivot target 

10.1.3 max-dist-sqd := dist-sqd

10.2 Recursively call Nearest Neighb our with parameters

further-kd,target, further-hr,max-dist-sqd,

storing the results in temp-nearest and temp-dist-sqd

10.3 If temp-dist-sqd < dist-sqd then

10.3.1 nearest := temp-nearest and dist-sqd := temp-dist-sqd

Pro of: Outlined in text

Table 6.4: The Nearest Neighb our Algorithm 6-8

space in the hyp errectangle of the further child which lies within this radius. If it is not p ossible

then no further search is necessary.Ifitis p ossible, then Step 10:1checks if the p oint asso ciated

with the current no de of the tree is closer than the closest yet. Then, in Step 10.2, the further

child is recursively searched. The maximum distance worth examining in this further searchisthe

distance to the closest p ointyet discovered.

The pro of that this will nd the nearest neighb our within the constraints is by induction on the

size of the kd-tree. If the cuto were not made in Step 10, then the pro of would b e straightforward:

the p oint returned is the closest out of i the closest p oint in the nearer child, ii the p ointat

the current no de and iii the closest p oint in the further child. If the cuto were made in Step 10,

then the p oint returned is the closest p oint in the nearest child, and we can show that neither the

current p oint, nor any p oint in the further child can p ossibly b e closer.

Many lo cal optimizations are p ossible which while not altering the asymptotic p erformance of

the algorithm will multiply the sp eed by a constant factor. In particular, it is in practice p ossible

to hold almost all of the search state globally, instead of passing it as recursive parameters.

6.5 Theoretical Behaviour

Given a kd-tree with N no des, how many no des need to b e insp ected in order to nd the proven

nearest neighb our using the algorithm in Section 6.4?. It is clear at once that on average, at least

O log N  insp ections are necessary, b ecause any nearest neighb our search requires traversal to at

least one leaf of the tree. It is also clear that no more than N no des are searched: the algorithm

visits each no de at most once.

Figure 6.5 graphically shows whywe might exp ect considerably fewer than N no des to b e

visited: the shaded areas corresp ond to areas of the kd-tree whichwere cut o .

The imp ortantvalues are i the worst case numb er of insp ections and ii the exp ected number

of insp ections. It is actually easy to construct worst case distributions of the p oints which will force

nearly all the no des to b e insp ected. In Figure 6.6, the tree is two-dimensional, and the p oints are

scattered along the circumference of a circle. If we request the nearest neighb our with the target

close to the centre of the circle, it will therefore b e necessary for each rectangle, and hence each

leaf, to b e insp ected this is in order to ensure that there is no p oint lying inside the circle in any

rectangle.

Calculation of the exp ected numb er of insp ections is dicult, b ecause the analysis dep ends

critically on the exp ected distribution of the p oints in the kd-tree, and the exp ected distribution

of the target p oints presented to the nearest neighb our algorithm.

[ ]

The analysis is p erformed in Friedman et al., 1977 . This pap er considers the exp ected number

of hyp errectangles corresp onding to leaf no des which will provably need to b e searched. Such

hyp errectangles intersect the volume enclosed byahyp ersphere centered on the query p oint whose

surface passes through the nearest neighb our. For example, in Figure 6.5 the hyp ersphere in this 6-9

Figure 6.5

Generally during a nearest

neighb our search only a few

leaf no des need to b e in-

sp ected.

Figure 6.6

A bad distribution which

forces almost all no des to

b e insp ected. 6-10

case a circle is shown, and the number of intersecting hyp errectangles is two.

The pap er shows that the exp ected number of intersecting hyp errectangles is indep endentof

N , the numb er of exemplars. The asymptotic search time is thus logarithmic b ecause the time to

descend from the ro ot of the tree to the leaves is logarithmic in a balanced tree, and then an

exp ected constant amount of backtracking is required.

However, this reasoning was based on the assumption that the hyp errectangles in the tree

tend to b e hyp ercubic in shap e. Empirical evidence in myinvestigations has shown that this

is not generally the case for their tree building strategy. This is discussed and demonstrated in

Section 6.7.

A second danger is that the cost, while indep endentofN , is exp onentially dep endentonk , the

1

dimensionality of the domain vectors .

Thus theoretical analysis provides some insightinto the cost, but here, empirical investigation

will b e used to examine the exp ense of nearest neighb our in practice.

6.6 Empirical Behaviour

In this section I investigate the empirical b ehaviour of the nearest neighb our searching algorithm.

We exp ect that the numb er of no des insp ected in the tree varies according to the following prop erties

of the tree:

 N , the size of the tree.

 k , the dimensionality of the domain vectors in the tree. This value is the k in kd-tree.

dom

 d , the distribution of the domain vectors. This can b e quanti ed as the \true" dimen-

distrib

sionality of the vectors. For example, if the vectors had three comp onents, but all lay on the

surface of a sphere, then the underlying dimensionalitywould b e 2. In general, discovery of

the underlying dimensionality of a given sample of p oints is extremely dicult, but for these

tests it is a straightforward matter to generate such p oints. To makea kd-tree with underly-

ing dimensionality d , I use randomly generated k -dimensional domain vectors which

distrib dom

lie on a d -dimensional hyp erelliptical surface. The random vector generation algorithm

distrib

is as follows: Generate d random angles 2 [0; 2  where 0  i

distrib i distrib

Q

i=d1

sin  +  . The phase angles  are de ned as the j th comp onent of the vector b e

i ij ij

i=0

1

 =  if the j th bit of the binary representation of i is 1 and is zero otherwise.

ij

2

 d , the probability distribution from which the search target vector will b e selected. I

target

shall assume that this distribution is the same as that which determines the domain vectors.

This is indeed what will happ en when the kd-tree is used for learning control.

1

This was p ointed out to the author by N. Maclaren. 6-11 Figure 6.7

12

Numb er of insp ections re-

10

quired during a nearest

neighb our search against

8

the size of the kd-tree. In

exp eriment the tree

6 this

was four-dimensional and

4

the underlying distribution

of the p oints was three-

2 dimensional.

0

0 2000 4000 6000 8000 1000

In the following sections I investigate how p erformance dep ends on each of these prop erties.

6.6.1 Performance against Tree Size

Figures 6.7 and 6.8 graph the numb er of no des insp ected against the numb er of no des in the entire

kd-tree. Eachvalue was derived by generating a random kd-tree, and then requesting 500 random

nearest neighb our searches on the kd-tree. The average numb er of insp ected no des was recorded.

Anodewas counted as b eing insp ected if the distance b etween it and the target was computed.

Figure 6.7 was obtained from a 4d-tree with an distribution distribution d = 3. Figure 6.8

distrib

used an 8d-tree with an underlying distribution d =8.

distrib

It is immediately clear that after a certain p oint, the exp ense of a nearest neighb our search has

no detectable increase with the size of the tree. This agrees with the prop osed mo del of search

cost|logarithmic with a large additive constant term.

6.6.2 Performance against the \k" in kd-tree

Figure 6.9 graphs the numb er of no des insp ected against k , the numb er of comp onents in the

dom

kd-tree's domain vectors for a 10,000 no de tree. The underlying dimensionalitywas also k .

dom

The numb er of insp ections p er search rises very quickly, p ossibly exp onentially, with k . This

dom

b ehaviour, the massive increase in cost with dimension, is familiar in computational . 6-12 80

70 Figure 6.8

60

ber of insp ections

50 Num against kd-tree size for an

40

eight-dimensional tree with

30

an eight-dimensional un- derlying distribution. 20

10

0 0 2000 4000 6000 8000 1000

600

500 Figure 6.9

Number of insp ec-

400 tions graphed against tree

300 dimension. In these exp eri-

ments the p oints had an un-

200 derlying distribution with

the same dimensionalityas

100 the tree.

0

1 3 5 7 9 11 13 15 6-13 500

400 Figure 6.10

300

Number of insp ections

graphed against underly-

dimensionality for a

200 ing fourteen-dimensional tree.

100

0

1 3 5 7 9 11 13

6.6.3 Performance against the Distribution Dimensionality

This exp eriment con rms that it is d , the distribution dimension from which the p oints were

distrib

selected, rather than k which critically a ects the search p erformance. The trials for Figure 6.10

dom

used a 10,000 no de kd-tree with domain dimension of 14, for various values of d . The im-

distrib

p ortant observation is that for 14d-trees, the p erformance do es improve greatly if the underlying

distribution-dimensi on is relatively low. Conversely, Figure 6.11 shows that for a xed 4-d un-

derlying dimensionality, the search exp ense do es not seem to increase anyworse than linearly with

k .

dom

6.6.4 When the Target is not Chosen from the kd-tree's Distribution

In this exp eriment the p oints were distributed on a three dimensional elliptical surface in ten-

dimensional space. The target vector was, however, chosen at random from a ten-dimensional

distribution. The kd-tree contained 10,000 p oints. The average numb er of insp ections over 50

searches was found to b e 8,396. This compares with another exp eriment in which b oth p oints and

target were distributed in ten and the average numb er of insp ections was only 248. The

reason for the appalling p erformance was exempli ed in Figure 6.6: if the target is very far from

its nearest neighb our then very many leaf no des must b e checked.

6.6.5 Conclusion

The sp eed of the search measured as the numb er of distance computations required seems to vary

::: 6-14

100 Figure 6.11

80

Number of in-

60

sp ections graphed against

tree dimension, given a con-

40

stant four dimensional un- derlying distribution.

20

0

4 6 8 10 12 14

 :::only marginally with tree size. If the tree is suciently large with resp ect to the number

of dimensions, it is essentially constant.

 :::very quickly with the dimensionality of the distribution of the datap oints, d .

distrib

 :::linearly with the numb er of comp onents in the kd-tree's domain k , given a xed

dom

distribution dimension d .

distrib

There is also evidence to suggest that unless the target vector is drawn from the same distri-

bution as the kd-tree p oints, p erformance can b e greatly worsened.

These results supp ort the b elief that real time searching for nearest neighb ours is practical in

a rob otic system where we can exp ect the underlying dimensionality of the data p oints to b e low,

roughly less than 10. This need not mean that the vectors in the input space should have less

than ten comp onents. For data p oints obtained from rob otic systems it will not b e easy to decide

what the underlying dimensionality is. However Chapter 10 will show that the data do es tend to

lie within a numberoflow dimensional subspaces.

6.7 Further kd-tree Op erations

In this section I discuss some other op erations on kd-trees which are required for use in the SAB

learning system. These include incrementally adding a p ointtoakd-tree, range searching, and

selecting a pivot p oint. 6-15

6.7.1 Range Searching a kd-tree

range-search : exemplar-set  Domain < ! exemplar-set

The abstract range search op eration on an exemplar-set nds all exemplars whose domain

vectors are within a given distance of a target p oint:

0 0 0 2 2

range-searchE; d;r = fd ; r  2 E j d d 

This is implemented by a mo di ed nearest neighb our search. The mo di cations are that i the

initial distance is not reduced as closer p oints are discovered and ii all discovered p oints within the

[

distance are returned, not just the nearest. The complexity of this op eration is shown, in Preparata

]

and Shamos, 1985 , to still b e logarithmic in N the size of E for a xed range size.

6.7.2 Cho osing a Pivot from an Exemplar Set

The tree building algorithm of Section 6.3 requires that a pivot and a splitting plane b e selected

from which to build the ro ot of a kd-tree. It is desirable for the tree to b e reasonably balanced,

and also for the shap es of the hyp erregions corresp onding to leaf no des to b e fairly equally pro-

p ortioned. The rst criterion is imp ortant b ecause a badly unbalanced tree would p erhaps have

O N  accessing b ehaviour instead of O log N . The second criterion is in order to maximize cuto

opp ortunities for the nearest neighb our search. This is dicult to formalize, but can b e motivated

by an illustration. In Figure 6.12 is a p erfectly balanced kd-tree in which the leaf regions are

very non-square. Figure 6.13 illustrates a kd-tree representing the same set of p oints, but which

promotes squareness at the exp ense of some balance.

One pivoting strategy whichwould lead to a p erfectly balanced tree, and which is suggested

[ ]

in Omohundro, 1987 ,istopick the splitting dimension as that with maximum variance, and let

the pivot b e the p oint with the median split comp onent. This will, it is hop ed, tend to promote

square regions b ecause having split in one dimension, the next level in the tree is unlikely to nd

that the same dimension has maximum spread, and so will cho ose a di erent dimension. For

uniform distributions this tends to p erform reasonably well, but for badly skewed distributions the

hyp erregions tend to take long thin shap es. This is exempli ed in Figure 6.12 which has b een

balanced using this standard median pivot choice.

Toavoid this bad case, I cho ose a pivot which splits the exemplar set in the middle of the range

of the most spread dimension. As can b e seen in Figure 6.13, this tends to favour squarer regions

at the exp ense of a slightimbalance in the kd-tree. This means that large empty areas of space are

lled with only a few hyp errectangles which are themselves large. Thus, the numb er of leaf no des

which need to b e insp ected in case they contain a nearer neighb our is smaller than for the original

case, which had many small thin hyp errectangles. 6-16

Figure 6.12

A 2d tree balanced using

the `median of the most

spread dimension' pivoting

strategy.

Figure 6.13

A 2d tree balanced using

the `closest to the centre of

the widest dimension' piv-

oting strategy. 6-17

My pivot choice algorithm is to rstly cho ose the splitting dimension as the longest dimension

of the currenthyp errectangle, and then cho ose the pivot as the p oint closest to the middle of the

hyp errectangle along this dimension. Occasionally, this pivot mayeven b e an extreme p oint along

its dimension, leading to an entirely unbalanced no de. This is worth it, b ecause it creates a large

empty leaf no de. It is p ossible but extremely unlikely that the p oints could b e distributed in such

away as to cause the tree to have one emptychild at every level. This would b e unacceptable, and

so ab ove a certain depth threshold, the pivots are chosen using the standard median technique.

Selecting the median as the split and selecting the closest to the centre of the range are b oth

O N  op erations, and so either way a tree rebalance is O N log N .

6.7.3 Incrementally Adding a to a kd-tree

Firstly, the leaf no de which contains the new p oint is computed. The hyp errectangle corresp onding

to this leaf is also obtained. See Section 6.4 for hyp errectangle implementation. When the leaf

no de is found it may either b e i empty, in which case it is simply replaced by a new singleton

no de, or ii it contains a singleton no de. In case ii the singleton no de must b e given a child, and

so its previously irrelevant split eld must b e de ned. The split eld should b e chosen to preserve

the squareness of the new subhyp errectangles. A simple heuristic is used. The split dimension is

chosen as the dimension in which the hyp errectangle is longest. This heuristic is motivated by the

same requirement as for tree balancing|that the regions should b e as square as p ossible, even if

this means some loss of balance.

This splitting choice is just a heuristic, and there is no guarantee that a series of p oints added

in this way will preserve the balance of the kd-tree, nor that the hyp errectangles will b e well shap ed

for nearest neighb our search. Thus, on o ccasion such as when the depth exceeds a small multiple

of the b est p ossible depth the tree is rebuilt. Incremental addition costs O log N .

6.7.4 Q Nearest Neighb ours

This uses a mo di ed version of the nearest neighb our search. Instead of only searching within a

sphere whose radius is the closest distance yet found, the search is within a sphere whose radius is

the Qth closest yet found. Until Q p oints have b een found, this distance is in nity.

6.7.5 Deleting a Point from a kd-tree

If the p oint is at a leaf, this is straightforward. Otherwise, it is dicult, b ecause the structure of

b oth trees b elow this no de are pivoted around the p ointwe wish to remove. One solution would

b e to rebuild the tree b elow the deleted p oint, but on o ccasion this would b e very exp ensive. My

solution is to mark the p oint as deleted with an extra eld in the kd-tree no de, and to ignore

deletion no des in nearest neighb our and similar searches. When the tree is next rebuilt, all deletion

no des are removed. 6-18

Biblio graphy

[ ]

Bentley, 1980 J. L. Bentley. Multidimensional Divide and Conquer. Communications of the ACM,

234:214|229, 1980.

[ ]

Friedman et al., 1977 J. H. Friedman, J. L. Bentley, and R. A. Finkel. An Algorithm for Finding

Best Matches in Logarithmic Exp ected Time. ACM Trans. on Mathematical Software, 33:209{

226, Septemb er 1977.

[ ]

Omohundro, 1987 S. M. Omohundro. Ecient Algorithms with Neural Network Behaviour. Jour-

nal of Complex Systems, 12:273{347, 1987.

[ ]

Preparata and Shamos, 1985 F. P. Preparata and M. Shamos. Computational Geometry.

Springer-Verlag, 1985. Bib-1