An into ductory tutorial on kd-trees
Andrew W. Mo ore
Carnegie Mellon University
Extract from Andrew Mo ore's PhD Thesis: Ecient Memory-basedLearning for Robot Control
PhD. Thesis; Technical Rep ort No. 209, Computer Lab oratory, University of Cambridge. 1991.
Chapter 6
Kd-trees for Cheap Learning
This chapter gives a speci cation of the nearest neighbour algorithm. It also gives
both an informal and formal introduction to the kd-tree data structure. Then there
is an explicit, detailedaccount of how the nearest neighbour search algorithm is
implemented eciently, which is fol lowed by an empirical investigation into the al-
gorithm's performance. Final ly, there is discussion of some other algorithms related
to the nearest neighbour search.
6.1 Nearest Neighb our Sp eci cation
k k
r
d
Given twomulti-dimensional spaces Domain = < and Range = < , let an exemplar be a
member of Domain Range and let an exemplar-set b e a nite set of exemplars. Given an
exemplar-set , E, and a target domain vector, d, then a nearest neighbour of d is any. exemplar
0 0
0
d ; r 2 E such that None-nearerE; d; d . Notice that there might b e more than one suit-
able exemplar . This ambiguity captures the requirement that any nearest neighb our is adequate.
None-nearer is de ned thus:
0 00 00 0 00
None-nearerE; d; d , 8d ; r 2 E j d d jj d d j 6.1
In Equation 6.1 the distance metric is Euclidean, though any other L -norm could have b een used.
p
v
u
i=k
d
u
X
0
t
0
2
j d d j= d d 6.2
i i
i=1
where d is the ith comp onentofvector d.
i
In the following sections I describ e some algorithms to realize this abstract sp eci cation with
the additional informal requirement that the computation time should b e relatively short. 6-1
Algorithm: Nearest Neighb our by Scanning.
Data Structures:
domain-vector Avector of k oating p ointnumb ers.
d
range-vector Avector of k oating p ointnumb ers.
r
exemplar A pair: domain-vector ; range-vector
Input: exlist ,oftyp e list of exemplar
dom,oftyp e domain-vector
Output: nearest,oftyp e exemplar
Preconditions: exlist is not empty
0
0
Postconditions: if nearest represents the exemplar d ; r ,
and exlist represents the exemplar set E ,
and dom represents the vector d,
0 0
0
then d ; r 2 E and None-nearerE; d; d .
Co de:
1. nearest-dist := in nity
2. nearest := unde ned
3. for ex := each exemplar in exlist
3.1 dist := distance b etween dom and the domain of ex
3.2 if dist < nearest-dist then
3.2.1 nearest-dist := dist
3.2.2 nearest := ex
Table 6.1: Finding Nearest Neighb our by scanning a list.
6.2 Naive Nearest Neighb our
This op eration could b e achieved by representing the exemplar-set as a list of exemplars. In Ta-
ble 6.1, I give the trivial nearest neighb our algorithm which scans the entire list. This algorithm has
time complexity O N where N is the size of E. By structuring the exemplar-set more intelligently,
it is p ossible to avoid making a distance computation for every memb er.
6.3 Intro duction to kd-trees
A kd-tree is a data structure for storing a nite set of p oints from a k -dimensional space. It was
[ ]
examined in detail by J. Bentley Bentley, 1980; Friedman et al., 1977 . Recently, S. Omohundro
has recommended it in a survey of p ossible techniques to increase the sp eed of neural network
[ ]
learning Omohundro, 1987 .
A kd-tree is a binary tree. The contents of each no de are depicted in Table 6.2. Here I provide
an informal description of the structure and meaning of the tree, and in the following subsection I 6-2
Field Name: Field Typ e Description
dom-elt domain-vector A p oint from k -d space
d
range-elt range-vector A p oint from k -d space
r
split integer The splitting dimension
left kd-tree A kd-tree representing those p oints
to the left of the splitting plane
right kd-tree A kd-tree representing those p oints
to the right of the splitting plane
Table 6.2: The elds of a kd-tree no de
give a formal de nition of the invariants and semantics.
The exemplar-set E is represented by the set of no des in the kd-tree, each no de representing
one exemplar. The dom-elt eld represents the domain-vector of the exemplar and the range-elt
eld represents the range-vector. The dom-elt comp onent is the index for the no de. It splits the
space into two subspaces according to the splitting hyp erplane of the no de. All the p oints in the
\left" subspace are represented by the left subtree, and the p oints in the \right" subspace by the
right subtree. The splitting hyp erplane is a plane which passes through dom-elt and whichis
p erp endicular to the direction sp eci ed by the split eld. Let i b e the value of the split eld. Then
a p oint is to the left of dom-elt if and only if its ith comp onent is less than the ith comp onentof
dom-elt . The complimentary de nition holds for the right eld. If a no de has no children, then
the splitting hyp erplane is not required.
Figure 6.1 demonstrates a kd-tree representation of the four dom-elt p oints 2; 5, 3; 8, 6; 3
and 8; 9. The ro ot no de, with dom-elt 2; 5 splits the plane in the y -axis into two subspaces.
The p oint3; 8 lies in the lower subspace, that is fx; y j y<5g, and so is in the left subtree.
Figure 6.2 shows how the no des partition the plane.
6.3.1 Formal Sp eci cation of a kd-tree
The reader who is content with the informal description ab ove can omit this section. I de ne a
mapping
exset-rep : kd-tree ! exemplar-set 6.3
which maps the tree to the exemplar-set it represents:
exset-repempty =
exset-rep< d; r; ; empty; empty > = fd; rg
6.4
exset-rep< d; r; split; tree ; tree > =
left right
exset-reptree [fd; rg[ exset-reptree
left right 6-3
[2,5] Figure 6.1
A 2d-tree of four elements.
[6,3] [3,8] The splitting planes are not
indicated. The [2,5] no de
splits along the y = 5 plane and the [3,8] no de splits
[8,9] along the x = 3 plane.
[8,9]
[3,8]
Figure 6.2 w the tree of Figure 6.1
[2,5] Ho splits up the x,y plane.
[6,3] 6-4
The invariant is that subtrees only ever contain dom-elt s which are on the correct side of all their
ancestors' splitting planes.
Is-legal-kdtreeempty:
Is-legal-kdtree< d; r; ; empty; empty >:
Is-legal-kdtree< d; r; split ; tree ; tree > ,
left right
0
0 0
d ^ 8d ; r 2 exset-reptree d
6.5
left
split split
0
0 0
>d ^ 8d ; r 2 exset-reptree d
right
split split
Is-legal-kdtreetree ^
left
Is-legal-kdtreetree
right
6.3.2 Constructing a kd-tree
Given an exemplar-set E,akd-tree can b e constructed by the algorithm in Table 6.3. The pivot-
cho osing pro cedure of Step 2 insp ects the set and cho oses a \go o d" domain vector from this set
to use as the tree's ro ot. The discussion of how sucharootischosen is deferred to Section 6.7.
Whichever exemplar is chosen as ro ot will not a ect the correctness of the kd-tree, though the
tree's maximum depth and the shap e of the hyp erregions wil l b e a ected.
6.4 Nearest Neighb our Search
In this section, I describ e the nearest neighb our algorithm which op erates on kd-trees. I b egin with
an informal description and worked example, and then give the precise algorithm.
A rst approximation is initially found at the leaf no de which contains the target p oint. In
Figure 6.3 the target p oint is marked X and the leaf no de of the region containing the target is
coloured black. As is exempli ed in this case, this rst approximation is not necessarily the nearest
neighb our, but at least we knowany p otential nearer neighb our must lie closer, and so must lie
within the circle centred on X and passing through the leaf no de. Wenow back up to the parent
of the current no de. In Figure 6.4 this parent is the black no de. We compute whether it is possible
for a closer solution to that so far found to exist in this parent's other child. Here it is not p ossible,
b ecause the circle do es not intersect with the shaded space o ccupied by the parent's other child.
If no closer neighb our can exist in the other child, the algorithm can immediately move up a further
level, else it must recursively explore the other child. In this example, the next parent whichis
checked will need to b e explored, b ecause the area it covers i.e. everywhere north of the central
horizontal line does intersect with the b est circle so far.
Table 6.4 describ es my actual implementation of the nearest neighb our algorithm. It is called
with four parameters: the kd-tree, the target domain vector, a representation of a hyp errectangle in
Domain ,andavalue indicating the maximum distance from the target whichisworth searching.
The search will only take place within those p ortions of the kd-tree which lie b oth in the hyp er- 6-5
Algorithm: Constructing a kd-tree
Input: exset,oftyp e exemplar-set
Output: kd,oftyp e kdtree
Pre: None
Post: exset = exset-repkd ^ Is-legal-kdtreekd
Co de:
1. If exset is empty then return the empty kdtree
2. Call pivot-cho osing pro cedure, which returns twovalues:
ex := a member of exset
split := the splitting dimension
3. d := domain vector of ex
4. exset' := exset with ex removed
5. r := range vector of ex
0
0 0
6. exsetleft := fd ; r 2 exset' j d d g
split split
0
0 0
7. exsetright := fd ; r 2 exset' j d >d g
split split
8. kdleft := recursively construct kd-tree from exsetleft
9. kdright := recursively construct kd-tree from exsetright
10. kd := < d; r; split; kdleft; kdright >
Pro of: By induction on the length of exset and the de nitions
of exset-rep and Is-legal-kdtree.
Table 6.3: Constructing a kd-tree from a set of exemplars.
Figure 6.3
The black dot is the dot
whichowns the leaf no de
containing the target the
cross. Any nearer neigh-
b our must lie inside this cir-
cle. 6-6
Figure 6.4
The black dot is the parent
of the closest found so far.
In this case the black dot's
other child shaded grey
need not b e searched.
rectangle, and within the maximum distance to the target. The caller of the routine will generally
sp ecify the in nite hyp errectangle whichcovers the whole of Domain , and the in nite maximum
distance.
Before discussing its execution, I will explain how the op erations on the hyp errectangles can
b e implemented. A hyp errectangle is represented bytwo arrays: one of its minimum co ordinates,
the other of its maximum co ordinates. To cut the hyp errectangle, so that one of its edges is moved
closer to its centre, the appropriate array comp onent is altered. Tocheck to see if a hyp errectangle
hr intersects with a hyp ersphere radius r centered at p oint t ,we nd the p oint p in hr whichis
min max
closest to t.Write hr as the minimum extreme of hr in the ith dimension and hr as the
i i
maximum extreme. p , the ith comp onent of this closest p oint is computed thus:
i
8
min min
>
hr if t hr
>
i
i i
<
min max
6.6 p =
t if hr i i i i i > > : min max hr if t hr i i i The ob jects intersect only if the distance b etween p and t is less than or equal to r . The search is depth rst, and uses the heuristic of searching rst the child no de which contains the target. Step 1 deals with the trivial empty tree case, and Steps 2 and 3 assign two imp ortant lo cal variables. Step 4 cuts the currenthyp errectangle into the twohyp errectangles covering the space o ccupied by the child no des. Steps 5{7 determine whichchild contains the target. After Step 8, when this initial child is searched, it may b e p ossible to prove that there cannot b e any closer p oint in the hyp errectangle of the further child. In particular, the p oint at the currentnode must b e out of range. The test is made in Steps 9 and 10. Step 9 restricts the maximum radius in whichany p ossible closer p oint could lie, and then the test in Step 10 checks whether there is any 6-7 Algorithm: Nearest Neighb our in a kd-tree Input: kd,oftyp e kdtree target ,oftyp e domain vector hr,oftyp e hyp errectangle max-dist-sqd,oftyp e oat Output: nearest,oftyp e exemplar dist-sqd,oftyp e oat Pre: Is-legal-kdtreekd Post: Informally, the p ostcondition is that nearest is a nearest exemplar to target which also lies b oth within the hyp errectangle hr p p and within distance max-dist-sqd of target . dist-sqd is the distance of this nearest p oint. If there is no such p oint then dist-sqd contains in nity. Co de: 1. if kd is empty then set dist-sqd to in nity and exit. 2. s := split eld of kd 3. pivot := dom-elt eld of kd 4. Cut hr into two sub-hyp errectangles left-hr and right-hr. The cut plane is through pivot and p erp endicul ar to the s dimension. 5. target-in-left := target pivot s s 6. if target-in-left then 6.1 nearer-kd := left eld of kd and nearer-hr := left-hr 6.2 further-kd := right eld of kd and further-hr := right-hr 7. if not target-in-left then 7.1 nearer-kd := right eld of kd and nearer-hr := right-hr 7.2 further-kd := left eld of kd and further-hr := left-hr 8. Recursively call Nearest Neighb our with parameters nearer-kd,target , nearer-hr,max-dist-sqd, storing the results in nearest and dist-sqd 9. max-dist-sqd := minimum of max-dist-sqd and dist-sqd 10. A nearer p oint could only lie in further-kd if there were some p part of further-hr within distance max-dist-sqd of target . if this is the case then 2 10.1 if pivot target < dist-sqd then 10.1.1 nearest := pivot; range-elt eld of kd 2 10.1.2 dist-sqd := pivot target 10.1.3 max-dist-sqd := dist-sqd 10.2 Recursively call Nearest Neighb our with parameters further-kd,target, further-hr,max-dist-sqd, storing the results in temp-nearest and temp-dist-sqd 10.3 If temp-dist-sqd < dist-sqd then 10.3.1 nearest := temp-nearest and dist-sqd := temp-dist-sqd Pro of: Outlined in text Table 6.4: The Nearest Neighb our Algorithm 6-8 space in the hyp errectangle of the further child which lies within this radius. If it is not p ossible then no further search is necessary.Ifitis p ossible, then Step 10:1checks if the p oint asso ciated with the current no de of the tree is closer than the closest yet. Then, in Step 10.2, the further child is recursively searched. The maximum distance worth examining in this further searchisthe distance to the closest p ointyet discovered. The pro of that this will nd the nearest neighb our within the constraints is by induction on the size of the kd-tree. If the cuto were not made in Step 10, then the pro of would b e straightforward: the p oint returned is the closest out of i the closest p oint in the nearer child, ii the p ointat the current no de and iii the closest p oint in the further child. If the cuto were made in Step 10, then the p oint returned is the closest p oint in the nearest child, and we can show that neither the current p oint, nor any p oint in the further child can p ossibly b e closer. Many lo cal optimizations are p ossible which while not altering the asymptotic p erformance of the algorithm will multiply the sp eed by a constant factor. In particular, it is in practice p ossible to hold almost all of the search state globally, instead of passing it as recursive parameters. 6.5 Theoretical Behaviour Given a kd-tree with N no des, how many no des need to b e insp ected in order to nd the proven nearest neighb our using the algorithm in Section 6.4?. It is clear at once that on average, at least O log N insp ections are necessary, b ecause any nearest neighb our search requires traversal to at least one leaf of the tree. It is also clear that no more than N no des are searched: the algorithm visits each no de at most once. Figure 6.5 graphically shows whywe might exp ect considerably fewer than N no des to b e visited: the shaded areas corresp ond to areas of the kd-tree whichwere cut o . The imp ortantvalues are i the worst case numb er of insp ections and ii the exp ected number of insp ections. It is actually easy to construct worst case distributions of the p oints which will force nearly all the no des to b e insp ected. In Figure 6.6, the tree is two-dimensional, and the p oints are scattered along the circumference of a circle. If we request the nearest neighb our with the target close to the centre of the circle, it will therefore b e necessary for each rectangle, and hence each leaf, to b e insp ected this is in order to ensure that there is no p oint lying inside the circle in any rectangle. Calculation of the exp ected numb er of insp ections is dicult, b ecause the analysis dep ends critically on the exp ected distribution of the p oints in the kd-tree, and the exp ected distribution of the target p oints presented to the nearest neighb our algorithm. [ ] The analysis is p erformed in Friedman et al., 1977 . This pap er considers the exp ected number of hyp errectangles corresp onding to leaf no des which will provably need to b e searched. Such hyp errectangles intersect the volume enclosed byahyp ersphere centered on the query p oint whose surface passes through the nearest neighb our. For example, in Figure 6.5 the hyp ersphere in this 6-9 Figure 6.5 Generally during a nearest neighb our search only a few leaf no des need to b e in- sp ected. Figure 6.6 A bad distribution which forces almost all no des to b e insp ected. 6-10 case a circle is shown, and the number of intersecting hyp errectangles is two. The pap er shows that the exp ected number of intersecting hyp errectangles is indep endentof N , the numb er of exemplars. The asymptotic search time is thus logarithmic b ecause the time to descend from the ro ot of the tree to the leaves is logarithmic in a balanced tree, and then an exp ected constant amount of backtracking is required. However, this reasoning was based on the assumption that the hyp errectangles in the tree tend to b e hyp ercubic in shap e. Empirical evidence in myinvestigations has shown that this is not generally the case for their tree building strategy. This is discussed and demonstrated in Section 6.7. A second danger is that the cost, while indep endentofN , is exp onentially dep endentonk , the 1 dimensionality of the domain vectors . Thus theoretical analysis provides some insightinto the cost, but here, empirical investigation will b e used to examine the exp ense of nearest neighb our in practice. 6.6 Empirical Behaviour In this section I investigate the empirical b ehaviour of the nearest neighb our searching algorithm. We exp ect that the numb er of no des insp ected in the tree varies according to the following prop erties of the tree: N , the size of the tree. k , the dimensionality of the domain vectors in the tree. This value is the k in kd-tree. dom d , the distribution of the domain vectors. This can b e quanti ed as the \true" dimen- distrib sionality of the vectors. For example, if the vectors had three comp onents, but all lay on the surface of a sphere, then the underlying dimensionalitywould b e 2. In general, discovery of the underlying dimensionality of a given sample of p oints is extremely dicult, but for these tests it is a straightforward matter to generate such p oints. To makea kd-tree with underly- ing dimensionality d , I use randomly generated k -dimensional domain vectors which distrib dom lie on a d -dimensional hyp erelliptical surface. The random vector generation algorithm distrib is as follows: Generate d random angles 2 [0; 2 where 0 i distrib i distrib Q i=d 1 sin + . The phase angles are de ned as the j th comp onent of the vector b e i ij ij i=0 1 = if the j th bit of the binary representation of i is 1 and is zero otherwise. ij 2 d , the probability distribution from which the search target vector will b e selected. I target shall assume that this distribution is the same as that which determines the domain vectors. This is indeed what will happ en when the kd-tree is used for learning control. 1 This was p ointed out to the author by N. Maclaren. 6-11 Figure 6.7 12 Numb er of insp ections re- 10 quired during a nearest neighb our search against 8 the size of the kd-tree. In exp eriment the tree 6 this was four-dimensional and 4 the underlying distribution of the p oints was three- 2 dimensional. 0 0 2000 4000 6000 8000 1000 In the following sections I investigate how p erformance dep ends on each of these prop erties. 6.6.1 Performance against Tree Size Figures 6.7 and 6.8 graph the numb er of no des insp ected against the numb er of no des in the entire kd-tree. Eachvalue was derived by generating a random kd-tree, and then requesting 500 random nearest neighb our searches on the kd-tree. The average numb er of insp ected no des was recorded. Anodewas counted as b eing insp ected if the distance b etween it and the target was computed. Figure 6.7 was obtained from a 4d-tree with an distribution distribution d = 3. Figure 6.8 distrib used an 8d-tree with an underlying distribution d =8. distrib It is immediately clear that after a certain p oint, the exp ense of a nearest neighb our search has no detectable increase with the size of the tree. This agrees with the prop osed mo del of search cost|logarithmic with a large additive constant term. 6.6.2 Performance against the \k" in kd-tree Figure 6.9 graphs the numb er of no des insp ected against k , the numb er of comp onents in the dom kd-tree's domain vectors for a 10,000 no de tree. The underlying dimensionalitywas also k . dom The numb er of insp ections p er search rises very quickly, p ossibly exp onentially, with k . This dom b ehaviour, the massive increase in cost with dimension, is familiar in computational geometry. 6-12 80 70 Figure 6.8 60 ber of insp ections 50 Num against kd-tree size for an 40 eight-dimensional tree with 30 an eight-dimensional un- derlying distribution. 20 10 0 0 2000 4000 6000 8000 1000 600 500 Figure 6.9 Number of insp ec- 400 tions graphed against tree 300 dimension. In these exp eri- ments the p oints had an un- 200 derlying distribution with the same dimensionalityas 100 the tree. 0 1 3 5 7 9 11 13 15 6-13 500 400 Figure 6.10 300 Number of insp ections graphed against underly- dimensionality for a 200 ing fourteen-dimensional tree. 100 0 1 3 5 7 9 11 13 6.6.3 Performance against the Distribution Dimensionality This exp eriment con rms that it is d , the distribution dimension from which the p oints were distrib selected, rather than k which critically a ects the search p erformance. The trials for Figure 6.10 dom used a 10,000 no de kd-tree with domain dimension of 14, for various values of d . The im- distrib p ortant observation is that for 14d-trees, the p erformance do es improve greatly if the underlying distribution-dimensi on is relatively low. Conversely, Figure 6.11 shows that for a xed 4-d un- derlying dimensionality, the search exp ense do es not seem to increase anyworse than linearly with k . dom 6.6.4 When the Target is not Chosen from the kd-tree's Distribution In this exp eriment the p oints were distributed on a three dimensional elliptical surface in ten- dimensional space. The target vector was, however, chosen at random from a ten-dimensional distribution. The kd-tree contained 10,000 p oints. The average numb er of insp ections over 50 searches was found to b e 8,396. This compares with another exp eriment in which b oth p oints and target were distributed in ten dimensions and the average numb er of insp ections was only 248. The reason for the appalling p erformance was exempli ed in Figure 6.6: if the target is very far from its nearest neighb our then very many leaf no des must b e checked. 6.6.5 Conclusion The sp eed of the search measured as the numb er of distance computations required seems to vary ::: 6-14 100 Figure 6.11 80 Number of in- 60 sp ections graphed against tree dimension, given a con- 40 stant four dimensional un- derlying distribution. 20 0 4 6 8 10 12 14 :::only marginally with tree size. If the tree is suciently large with resp ect to the number of dimensions, it is essentially constant. :::very quickly with the dimensionality of the distribution of the datap oints, d . distrib :::linearly with the numb er of comp onents in the kd-tree's domain k , given a xed dom distribution dimension d . distrib There is also evidence to suggest that unless the target vector is drawn from the same distri- bution as the kd-tree p oints, p erformance can b e greatly worsened. These results supp ort the b elief that real time searching for nearest neighb ours is practical in a rob otic system where we can exp ect the underlying dimensionality of the data p oints to b e low, roughly less than 10. This need not mean that the vectors in the input space should have less than ten comp onents. For data p oints obtained from rob otic systems it will not b e easy to decide what the underlying dimensionality is. However Chapter 10 will show that the data do es tend to lie within a numberoflow dimensional subspaces. 6.7 Further kd-tree Op erations In this section I discuss some other op erations on kd-trees which are required for use in the SAB learning system. These include incrementally adding a p ointtoakd-tree, range searching, and selecting a pivot p oint. 6-15 6.7.1 Range Searching a kd-tree range-search : exemplar-set Domain < ! exemplar-set The abstract range search op eration on an exemplar-set nds all exemplars whose domain vectors are within a given distance of a target p oint: 0 0 0 2 2 range-searchE; d;r = fd ; r 2 E j d d This is implemented by a mo di ed nearest neighb our search. The mo di cations are that i the initial distance is not reduced as closer p oints are discovered and ii all discovered p oints within the [ distance are returned, not just the nearest. The complexity of this op eration is shown, in Preparata ] and Shamos, 1985 , to still b e logarithmic in N the size of E for a xed range size. 6.7.2 Cho osing a Pivot from an Exemplar Set The tree building algorithm of Section 6.3 requires that a pivot and a splitting plane b e selected from which to build the ro ot of a kd-tree. It is desirable for the tree to b e reasonably balanced, and also for the shap es of the hyp erregions corresp onding to leaf no des to b e fairly equally pro- p ortioned. The rst criterion is imp ortant b ecause a badly unbalanced tree would p erhaps have O N accessing b ehaviour instead of O log N . The second criterion is in order to maximize cuto opp ortunities for the nearest neighb our search. This is dicult to formalize, but can b e motivated by an illustration. In Figure 6.12 is a p erfectly balanced kd-tree in which the leaf regions are very non-square. Figure 6.13 illustrates a kd-tree representing the same set of p oints, but which promotes squareness at the exp ense of some balance. One pivoting strategy whichwould lead to a p erfectly balanced tree, and which is suggested [ ] in Omohundro, 1987 ,istopick the splitting dimension as that with maximum variance, and let the pivot b e the p oint with the median split comp onent. This will, it is hop ed, tend to promote square regions b ecause having split in one dimension, the next level in the tree is unlikely to nd that the same dimension has maximum spread, and so will cho ose a di erent dimension. For uniform distributions this tends to p erform reasonably well, but for badly skewed distributions the hyp erregions tend to take long thin shap es. This is exempli ed in Figure 6.12 which has b een balanced using this standard median pivot choice. Toavoid this bad case, I cho ose a pivot which splits the exemplar set in the middle of the range of the most spread dimension. As can b e seen in Figure 6.13, this tends to favour squarer regions at the exp ense of a slightimbalance in the kd-tree. This means that large empty areas of space are lled with only a few hyp errectangles which are themselves large. Thus, the numb er of leaf no des which need to b e insp ected in case they contain a nearer neighb our is smaller than for the original case, which had many small thin hyp errectangles. 6-16 Figure 6.12 A 2d tree balanced using the `median of the most spread dimension' pivoting strategy. Figure 6.13 A 2d tree balanced using the `closest to the centre of the widest dimension' piv- oting strategy. 6-17 My pivot choice algorithm is to rstly cho ose the splitting dimension as the longest dimension of the currenthyp errectangle, and then cho ose the pivot as the p oint closest to the middle of the hyp errectangle along this dimension. Occasionally, this pivot mayeven b e an extreme p oint along its dimension, leading to an entirely unbalanced no de. This is worth it, b ecause it creates a large empty leaf no de. It is p ossible but extremely unlikely that the p oints could b e distributed in such away as to cause the tree to have one emptychild at every level. This would b e unacceptable, and so ab ove a certain depth threshold, the pivots are chosen using the standard median technique. Selecting the median as the split and selecting the closest to the centre of the range are b oth O N op erations, and so either way a tree rebalance is O N log N . 6.7.3 Incrementally Adding a Point to a kd-tree Firstly, the leaf no de which contains the new p oint is computed. The hyp errectangle corresp onding to this leaf is also obtained. See Section 6.4 for hyp errectangle implementation. When the leaf no de is found it may either b e i empty, in which case it is simply replaced by a new singleton no de, or ii it contains a singleton no de. In case ii the singleton no de must b e given a child, and so its previously irrelevant split eld must b e de ned. The split eld should b e chosen to preserve the squareness of the new subhyp errectangles. A simple heuristic is used. The split dimension is chosen as the dimension in which the hyp errectangle is longest. This heuristic is motivated by the same requirement as for tree balancing|that the regions should b e as square as p ossible, even if this means some loss of balance. This splitting choice is just a heuristic, and there is no guarantee that a series of p oints added in this way will preserve the balance of the kd-tree, nor that the hyp errectangles will b e well shap ed for nearest neighb our search. Thus, on o ccasion such as when the depth exceeds a small multiple of the b est p ossible depth the tree is rebuilt. Incremental addition costs O log N . 6.7.4 Q Nearest Neighb ours This uses a mo di ed version of the nearest neighb our search. Instead of only searching within a sphere whose radius is the closest distance yet found, the search is within a sphere whose radius is the Qth closest yet found. Until Q p oints have b een found, this distance is in nity. 6.7.5 Deleting a Point from a kd-tree If the p oint is at a leaf, this is straightforward. Otherwise, it is dicult, b ecause the structure of b oth trees b elow this no de are pivoted around the p ointwe wish to remove. One solution would b e to rebuild the tree b elow the deleted p oint, but on o ccasion this would b e very exp ensive. My solution is to mark the p oint as deleted with an extra eld in the kd-tree no de, and to ignore deletion no des in nearest neighb our and similar searches. When the tree is next rebuilt, all deletion no des are removed. 6-18 Biblio graphy [ ] Bentley, 1980 J. L. Bentley. Multidimensional Divide and Conquer. Communications of the ACM, 234:214|229, 1980. [ ] Friedman et al., 1977 J. H. Friedman, J. L. Bentley, and R. A. Finkel. An Algorithm for Finding Best Matches in Logarithmic Exp ected Time. ACM Trans. on Mathematical Software, 33:209{ 226, Septemb er 1977. [ ] Omohundro, 1987 S. M. Omohundro. Ecient Algorithms with Neural Network Behaviour. Jour- nal of Complex Systems, 12:273{347, 1987. [ ] Preparata and Shamos, 1985 F. P. Preparata and M. Shamos. Computational Geometry. Springer-Verlag, 1985. Bib-1