Today’s Outline CSE 326: Binary Search Trees • Dictionary ADT / Search ADT • Quick Review • Binary Search Trees

5/30/2008 1 5/30/2008 2

The Dictionary ADT ADTs Seen So Far •jfogarty • • Data: James •Stack Fogarty –Push –Insert – a of insert(jfogarty, ….) CSE 666 –Pop – DeleteMin (key, value) • phenry pairs Peter • Queue Henry CSE 002 – Enqueue find(boqin) – Dequeue • Operations: • boqin Then there is decreaseKey… • boqin – Insert (key, Bo, Qin, … Bo Qin value) CSE 002 – Find (key) Need pointer! Why? The Dictionary ADT is also Because find not efficient. – Remove (key)called the “Map ADT” 5/30/2008 3 5/30/2008 4

A Modest Few Uses Implementations insert find delete

Θ(1) Θ(n) Θ(n) •Sets • Unsorted Linked-list • Dictionaries • Networks : Router tables Θ(1) Θ(n) Θ(n) • Unsorted array • Operating systems : Page tables

• Compilers : Symbol tables log n + n Θ(log n) log n + n

Probably the most widely used ADT! • Sorted array SO CLOSE! What limits the performance?

5/30/2008 5 5/30/2008 Time to move elements, can we mimic BinSearch with BST?6 Tree Calculations Tree Calculations Example

Recall: height is max How high is this tree? A number of edges from root to a leaf B

D E F G Find the height of the tree... H I height(t) = 1 + max {height(t.left), height(t.right)} height(B) = 1 runtime: height(C) = 4 J K L L Θ(N) (constant time for each ; so height(A) = 5 each node visited twice) M N 5/30/2008 7 5/30/2008 8

More Recursive Tree Calculations: Inorder Traversal Tree Traversals void traverse(BNode t){ A traversal is an order for visiting all the nodes of a tree + if (t != NULL) traverse (t.left); * 5 Three types: process t.element; • Pre-order: Root, left subtree, right subtree 2 4 traverse (t.right);

• In-order: Left subtree, root, right subtree } (an expression tree) } • Post-order: Left subtree, right subtree, root

5/30/2008 9 5/30/2008 10

Binary Trees : Representation

• Binary tree is A – a root left right A pointerpointer A – left subtree (maybe empty) B C B C B C – right subtree (maybe left right left right empty) pointerpointer pointerpointer D E F D E F Data

• Representation:left right G H D E F pointer pointer left right left right left right pointerpointer pointerpointer pointerpointer I J

5/30/2008 11 5/30/2008 12 Binary Tree: Special Cases Binary Tree: Some Numbers! For binary tree of height h: – max # of leaves: 2h, for perfect tree A A A

– max # of nodes: 2h+1 – 1, for perfect tree B C B C B C – min # of leaves: 1, for “list” tree D E F D E F G D E F G – min # of nodes: h+1, for “list” tree Complete Tree Perfect Tree H I

Full Tree Average Depth for N nodes?

5/30/2008 13 5/30/2008 14

Binary Data Example and Counter-Example

• Structural property – each node has ≤ 2children – result: • storage is small 8 5 8 • operations are simple • average depth is small 5 11 4 8 5 11 • Order property – all keys in left subtree smaller than root’s key 2 6 10 12 1 7 11 2 7 6 10 18 – all keys in right subtree larger than root’s key – result: easy to find any given key 4 7 9 14 3 4 All children must 15 20 obey order • What must I know about what I store? 13 BINARY SEARCH TREES? 21 Comparison, equality testing 5/30/2008 15 5/30/2008 16

Find in BST, Recursive Find in BST, Iterative

Node Find(Object key, Node Find(Object key, 10 Node root) { Node root) { if (root == NULL) 10 return NULL; 5 15 while (root != NULL && if (key < root.key) root.key != key) { if (key < root.key) 5 15 2 9 20 return Find(key, root.left); root = root.left; else if (key > root.key) else 2 9 20 7 17 30 return Find(key, root = root.right; root.right); } else 7 17 30 return root; return root; Runtime: } } Runtime:

5/30/2008 17 5/30/2008 18 Insert in BST BuildTree for BST 10 • Suppose keys 1, 2, 3, 4, 5, 6, 7, 8, 9 are Insert(13) inserted into an initially empty BST. 5 15 Insert(8) Insert(31) Runtime depends on the order! – in given order 2 9 20 Θ(n2)

7 17 30 – in reverse order Insertions happen only Θ(n2) at the leaves – easy! Runtime: – first, then left median, right median, etc. 5, 3, 7, 2, 1, 6, 8, 9 better: n log n

5/30/2008 19 5/30/2008 20

Bonus: FindMin/FindMax Deletion in BST

10 • Find minimum 10 5 15 5 15 2 9 20 • Find maximum 2 9 20 7 17 30 7 17 30 Why might deletion be harder than insertion?

5/30/2008 21 5/30/2008 22

Non-lazy Deletion Lazy Deletion • Removing an item disrupts the tree Instead of physically deleting nodes, just mark them as structure. deleted • Basic idea: find the node that is to be + simpler 10 removed. Then “fix” the tree so that it is + physical deletions done in batches still a . + some adds just flip deleted 5 15 flag • Three cases: 2 9 20 – node has no children (leaf node) – extra memory for “deleted” flag – many lazy deletions = slow – node has one child finds 7 17 30 – some operations may have to – node has two children be modified (e.g., min and max) 5/30/2008 23 5/30/2008 24 Non-lazy Deletion – The Leaf Case Deletion – The One Child Case

Delete(17) 10 Delete(15) 10

5 15 5 15

2 9 20 2 9 20

7 17 30 7 30

5/30/2008 25 5/30/2008 26

Deletion – The Two Child Case Deletion – The Two Child Case

10 Idea: Replace the deleted node with a value Delete(5) guaranteed to be between the two child subtrees 5 20 Options: 2 9 30 • succ from right subtree: findMin(t.right) • pred from left subtree : findMax(t.left) 7 A value guaranteed to be between the two subtrees! - succ from right subtree Now delete the original node containing succ or pred What can we replace 5 with? -predfrom left subtree • Leaf or one child case – easy!

5/30/2008 27 5/30/2008 28

Finally… Balanced BST

Observation 10 • BST: the shallower the better! • For a BST with n nodes 7 replaces 5 7 20 – Average height is O(log n) – Worst case height is O(n) • Simple cases such as insert(1, 2, 3, ..., n) 2 9 30 lead to the worst case scenario

Original node containing Solution: Require a Balance Condition that 7 gets deleted 1. ensures depth is O(log n) – strong enough! 2. is easy to maintain – not too strong!

5/30/2008 29 5/30/2008 30 Potential Balance Conditions Potential Balance Conditions 1. Left and right subtrees of the root 3. Left and right subtrees of every node have equal number of nodes have equal number of nodes

Too weak! Too strong! Do height mismatch example Only perfect trees

2. Left and right subtrees of the root 4. Left and right subtrees of every node have equal height have equal height Too strong! Too weak! Only perfect trees Do example where there’s a left chain and a right chain, no other nodes 5/30/2008 31 5/30/2008 32

Balanced BST

Observation CSE 326: Data Structures • BST: the shallower the better! • For a BST with n nodes AVL Trees – Average height is O(log n) – Worst case height is O(n) • Simple cases such as insert(1, 2, 3, ..., n) lead to the worst case scenario

Solution: Require a Balance Condition that 1. ensures depth is O(log n) – strong enough! 2. is easy to maintain – not too strong!

33 34

Potential Balance Conditions The AVL BalanceAdelson-Velskii Condition and Landis 1. Left and right subtrees of the root AVL balance property: have equal number of nodes 2. Left and right subtrees of the root Left and right subtrees of every node have equal height have heights differing by at most 1 3. Left and right subtrees of every node have equal number of nodes • Ensures small depth 4. Left and right subtrees of every node – Will prove this by showing that an AVL tree of height h must have a lot of (i.e. O(2h)) nodes have equal height • Easy to maintain – Using single and double rotations 35 36 AVL trees or not? The AVL Tree 6 4 8 Structural properties This is an 1. Binary tree property AVL tree 1 7 11 (0,1, or 2 children) 8 10 12 2. Heights of left and right subtrees of every node 5 11 6 differ by at most 1 4 8 Result: 2 6 10 12 Worst case depth of any 1 5 7 11 node is: O(log n) 4 7 9 13 14 3

Ordering property 15 2 – Same as for BST 37 38

Proving Shallowness Bound Testing the Balance Property

Let S(h) be the min # of nodes in an AVL tree of height h=4 AVL tree of height h with the min # of nodes (12) 10 We need to be able to:

Trees of height h = 1, 2, 3 …. 8 1. Track Balance 5 15 Claim: S(h) = S(h-1) + S(h-2) + 1 2. Detect Imbalance 5 11 2 9 20 Solution of recurrence: S(h) = O(2h) 3. Restore Balance (like Fibonacci numbers) 2 6 10 12 7 17 30

7 9 13 14 NULLs have height -1 15 39 40

AVL trees: find, insert An AVL Tree • AVL find: data 3 10 – same as BST find. 10 3 height • AVL insert: children 22 – same as BST insert, except may need to 5 20 “fix” the AVL tree after inserting new value. 0 1 1 0 2 9 15 30

0 0 7 17

Track height at all times. Why?

41 42 AVL tree insert Bad Case #1 Let x be the node where an imbalance occurs. Insert(6) Four cases to consider. The insertion is in the Insert(3) 1. left subtree of the left child of x. 2. right subtree of the left child of x. Insert(1) 3. left subtree of the right child of x. 4. right subtree of the right child of x.

Idea: Cases 1 & 4 are solved by a single rotation. Cases 2 & 3 are solved by a double rotation. Where is AVL property violated?

43 44

Single rotation in general Fix: Apply Single Rotation a AVL Property violated at this node (x) b 2 1 Z h 6 3 h X h Y h ≥ -1 1 3 0 0 X < b < Y < a < Z 1 6 0 b 1 a

h+1 X hh Single Rotation: Y Z 1. Rotate between x and child 45 Height of tree before? Height of tree after? Effect on Ancestors? 46

Bad Case #2 Fix: Apply Double Rotation AVL Property violated at this node (x) Insert(1) 2 2 1 1 1 Insert(6) 3 1 1 Insert(3) 6 3 0 0 0 0 1 6 3 6

Intuition: 3 must become root Double Rotation 1. Rotate between x’s child and grandchild 2. Rotate between x and x’s new child 47 48 Double rotation in general Double rotation, step 1 h ≥ 0 15 a b 8 17 c Z h 4 10 16 h W h -1 X h-1 Y 3 6 5 W < b

b a 6 10 16 4 hhh-1 Y h W X Z 3 5

49 50 Height of tree before? Height of tree after? Effect on Ancestors?

Double rotation, step 2 15 Imbalance at node X 8 17

6 10 16 4 Single Rotation 3 5 1. Rotate between x and child

15 6 17 Double Rotation 4 8 16 1. Rotate between x’s child and grandchild 3 5 10 2. Rotate between x and x’s new child

51 52

Single and Double Rotations: Insertion into AVL tree Inserting what integer values would cause the tree to need a: 1. Find spot for new key 1. single rotation? 9 5 11 2. Hang new node there with this key 2 7 13 3. Search back up the path for imbalance 0 3 4. If there is an imbalance: 2. double rotation? Zig-zig case #1: Perform single rotation and exit

Zig-zag case #2: Perform double rotation and exit 3. no rotation? Both rotations keep the subtree height unchanged. Hence only one rotation is sufficient! 53 54 Easy Insert Hard Insert (Bad Case #1)

3 3 Insert(33) 10 Insert(3) 10 22 Imbalance 12 5 15 5 15 1 0 0 1 0 0 0 1 2 9 12 20 2 9 12 20 0 0 0 0 0 3 17 30 17 30 Unbalanced? Yes, at 15

Unbalanced? No How to fix? Single rotate

Zig-zig 55 56

Single Rotation Hard Insert (Bad Case #2) 3 3 10 10 3 Insert(18) 10 2322 5 15 5 20 22 5 15 Imbalance 1 0 0 2 1 0 1 1 2 9 12 20 2 9 15 30 1 0 0 1 2 9 12 20 0 0 1 0 0 0 0 3 17 30 3 12 17 33 0 0 0 Unbalanced? 3 17 30 Yes, at 15 0 33 How to fix? Double rotate

Zig-zag

57 58

Single Rotation (oops!) Double Rotation (Step #1) 3 3 3 3 10 10 10 10 2323 2323 5 15 5 15 5 15 5 20 1 0 0 2 1 0 0 2 1 0 0 2 1 0 2 0 2 9 12 20 2 9 12 17 2 9 12 20 2 9 15 30 0 1 0 0 1 0 1 0 0 0 1 3 17 30 3 20 3 17 30 3 12 17 0 0 0 0 0 18 18 30 18 18 Still unbalanced. But like zig-zig tree! 59 60 Double Rotation (Step #2) Insert into an AVL tree: 5, 8, 9, 4, 2, 7, 3, 1

3 3 10 10

2322 5 15 5 17

1 0 0 2 1 0 1 1 2 9 12 17 2 9 15 20

0 1 0 0 0 0 3 20 3 12 18 30

0 0 18 30

61 62

AVL Trees Revisited

• Balance condition: CSE 326: Data Structures Left and right subtrees of every node have heights differing by at most 1 Splay Trees – Strong enough : Worst case depth is O(log n) – Easy to maintain : one single or double rotation

• Guaranteed O(log n) running time for – Find ? – Insert ? – Delete ? – buildTree ? Θ(n log n)

64

Single and Double Rotations AVL Trees Revisited a b • What extra info did we maintain in each h Z node? h X h Y

a • Where were rotations performed? b c Z h h W h -1 X h-1 Y • How did we locate this node?

65 66 Other Possibilities? Splay Trees • Could use different balance conditions, different ways to maintain balance, different guarantees on running • Blind adjusting version of AVL trees time, … – Why worry about balances? Just rotate Extra info, complex logic to anyway! • Why aren’t AVL trees perfect? detect imbalance, recursive • Amortized time per operations is O(log n) bottom-up implementation • Worst case time per operation is O(n) • Many other balanced BST data structures – But guaranteed to happen rarely – Red-Black trees – AA trees – Splay Trees SAT/GREInsert/Find Analogy always question: rotate node to the –2-3 Trees – B-Trees AVLroot is! to Splay trees as ______is to ______–… 67 Leftish : 68

Recall: Amortized Complexity Recall: Amortized Complexity • Is amortized guarantee any weaker than If a sequence of M operations takes O(M f(n)) time, worstcase? we say the amortized runtime is O(f(n)). Yes, it is only for sequences • Is amortized guarantee any stronger than • Worst case time per operation can still be large, say O(n) averagecase? Yes, guarantees no bad sequences • Worst case time for any sequence of M operations is O(M f(n)) • Is average case guarantee goodNo, adversarial enough input, in Average time per operation for any sequence is O(f(n)) practice? bad day, …

Yes, again, no bad sequences • Is amortized guarantee good enough in Amortized complexity is worst-case guarantee over practice? sequences of operations.

69 70

The Idea Find/Insert in Splay Trees

10 1. Find or insert a node k 2. Splay k to the root using: If you’re forced to make 17 zig-zag, zig-zig, or plain old zig rotation a really deep access:

Since you’re down there anyway, fix up a lot of deep nodes! Why could this be good??

5 1. Helps the new root, k

2 9 o Great if k is accessed again

3 2. And helps many others! o Great if many others on the path are accessed 71 72 Splaying node k to the root: Splaying node k to the root: Need to be careful! Need to be careful!

One option (that we won’t use) is to repeatedly use AVL What’s bad about this process?r is pushed almost as low as k was single rotation until k becomes the root: (see Section Bad seq: find(k), find(r), find(…), … 4.5.1 for details) p k q p k F q r s p F E s p r s q E D A B F s q D A B F k A r r E A k E B C C D B C C D 73 74

Splay: Zig-Zag* Splay: Zig-Zig* g k

g k p p W Z p g p X k g X Y k W X Y Z W Y Z W X Helps those in blue Y Z Hurts those in red *Is this just two AVL single rotations in a row?

*Just like an… Which nodes improve depth? Not quite – we rotate g and p, then p and k Why does this help? AVL double rotation k and its original children75 Same number of nodes helped as hurt. But later rotations help the whole subtree. 76

Special Case for Root: Zig Splaying Example: Find(6) root p k root 1 1

k p 2 2 Z X ? 3 3 X Y Y Z Find(6) Zig-zig 4 6 Think of as if Relative depth of p, Y, Z? Relative depth of everyone else? created by inserting 5 5 Down 1 level Much better Why not drop zig-zig and just zig all the way? 6,5,4,3,2,1 – each took Zig only helps one child! 6 4 77 constant time – 78 a LOT of Still Splaying 6 Finally…

1 1 1 6

2 6 6 1 ? ? 3 3 3 3 Zig-zig Zig 6 2 5 2 5 2 5

5 4 4 4

4 79 80

Another Splay: Find(4) Example Splayed Out

6 6 6 4

1 1 1 1 6 ? ? 3 4 4 3 5 Find(4) Zig-zag Zig-zag 2 5 3 5 3 5 2

4 2 2

81 82

But Wait… Why Splaying Helps

What happened here? • If a node n on the access path is at depth d before the splay, it’s at about depth d/2 after the splay Didn’t two find operations take linear time instead of logarithmic? • Overall, nodes which are low on the access path tend to move closer to the root What about the amortized O(log n) guarantee? • Splaying gets amortized O(log n) That still holds, though we must take performance. (Maybe not now, but soon, and for the rest of the into account the previous steps used to create operations.) this tree. In fact, a splay tree, by construction, will never look like the example we started83 with 84 Practical Benefit of Splaying Splay Operations: Find • No heights to maintain, no imbalance to check for • Find the node in normal BST manner – Less storage per node, easier to code • Splay the node to the root – if node not found, splay what would have been its parent • Data accessed once, is often soon accessed again – Splaying does implicit caching by bringing it to the root What if we didn’t splay? Amortized guarantee fails! Bad sequence: find(leaf k), find(k), find(k), … 85 86

Splay Operations: Insert Splay Operations: Remove Everything else splayed, so we’d better do that for remove • Insert the node in normal BST manner

• Splay the node to the root k find(k) delete k L R L R What if we didn’t splay? < k > k

Amortized guarantee fails! Now what? Bad sequence: insert(k), find(k), find(k), … 87 88

Delete Example Join Delete(4) Join(L, R): 6 4 6 given two trees such that (stuff in L) < (stuff in R), merge them: 1 9 find(4) 1 6 1 9 splay L max 4 7 2 9 2 7 L R R Find max 2 7 2 2

1 6 1 6 SimilarSplay to BST on delete the – findmaximum max = find element element with no in right L, childthen attach RDoes this work to join any two trees? 9 9 No, need L < R

89 7 7 90 E Splay Tree Summary Splay E B H • All operations are in amortized O(log n) time A D F I

• Splaying can be done top-down; this may be better C G because: I Like what? Skew heaps! (don’t need to wait) – only one pass H – no recursion or parent pointers necessary A – we didn’t cover top-down in class B • Splay trees are very effective search trees G – Relatively simple What happens to node that never get accessed F – No extra fields required (tend to drop to the bottom) C – Excellent locality properties: frequently accessed keys are cheap to find D E 91 92

E Splay E A I

B H CSE 326: Data Structures A C G B-Trees I D F B H C G D F

E 93

TIme to access CPU (conservative) (has registers) 1 ns per instruction Disk (mechanical B-Trees device) access time SRAM Cache = 8KB - 4MB 2-10 ns Seek time + Main Memory Transfer DRAM Main Memory time up to 10GB 40-100 ns Weiss Sec. 4.7 Seek time is expensive, Diskonce seek, a few then many GB Disk millisecondstransfer alot (5-10 Million ns)

Summary: Accesses to Memory are Painful, especially slower levels Example from book: N = 10 Million Dictionary ADT: Find, Insert, Remove drivers in WA state Want to do even better, make Treesthat number smaller so (3 far or 4) AVL trees M = 5, 10 accesses; M=10, 7 accesses; M=100, 3-4 accesses Up to 2 children, height can be N-1 Suppose we have 100 million items Operations WC: O(N) (100,000,000): •BST 10 Mil disk accesses

Up to 2 children, height can be log N Operations WC: O(log N) log_2 100,000,000 = 26.6 •AVL 23 disk accesses • Depth of AVL Tree

Log_128 100,000,000 = 3.8 • Number of Disk Accesses Up to 2 children, height can be N-1 •Splay Simpler implementation than AVL Wow! Operations WC: ON) Draw picture of tree and memory and show Amortized: O(log N) that each access can be an access to disk

Like we did with d heaps Keep things more balanced M-ary Search Tree Solution: B-Trees M = 7 • specialized M-ary search trees

• Each node has (up to) M-1 keys: • Maximum branching factor – subtree between two keys x and y contains of leaves with values v such that 3 7 12 21 M x ≤ v < y • Complete tree has height = O(logMn) • Pick branching factor M # disk accesses for find: Best: O(logMn) such that each node Might be unbalanced Worst: O(n) takes one full {page, block} x<3 3≤x<7 7≤x<12 12≤x<21 21≤x RuntimeMight just be of binary; find: (DO LATER)- binary search at each level Best: O(logMn . log2M = log2 n) of memory We said max b.f. = M Worst: O(n)

N = 24, depth = 2 B-Trees B-Tree: Example What depth with an AVL tree? 2h to 2h+1-1 B-Tree with M = 4 (# pointers in internal node)depth = 4 to 5 What makes them disk-friendly? and L = 4 (# data items in leaf)

1. Many keys stored in a node 10 40 • All brought to memory/cache in one access!

3 15 20 30 50 2. Internal nodes contain only keys; Only leaf nodes contain keys and actual data 10 11 12 20 25 26 40 42 • The can be loaded into memory 1 2 irrespective of data object size AB xG 3 5 6 9 15 17 30 32 33 36 50 60 70 • Data actually resides in disk Data objects, that I’ll Note: All leaves at the same depth! ignore in slides B-Tree Properties ‡ Example,24 Againentries, but only 2 deep Leaves are at least ½ full, B-Tree with M = 4 – BST would be 4. – Data is stored at the leaves kind of like complete tree and L = 4 Mostly full. – All leaves are at the same depth and contains between Some inactive entries. ⎡L/2⎤ and L data items Find(33, 2, 16) 10 40 – Internal nodes store up to M-1 keys Insert(18, 16,71) Insert(80) (split into 2 – Internal nodes have between ⎡M/2⎤ and M children leaves each w. min # 3 15 20 30 50 values) [put 71 at top] – Root (special case) has between 2 and M children (or root could be a leaf) 1 2 10 11 12 20 25 26 40 42 Tree about log n deep. 3 5 6 9 15 17 30 32 33 36 50 60 70

Actually, logM/2n (Only showing keys, but leaves also have data!) ‡These are technically B+-Trees

M = 3 L = 2 Building a B-Tree Splitting the Root

Too many keys in a leaf!

1 3 14 14 3 3 14 Insert(1) And create Insert(3) Insert(14) 3 14 a new root 1 3 14 The empty 1 3 14 B-Tree M = 3 L = 2 So, split the leaf.

Now, Insert(1)? How to solve? Definitely need to split.Then what? Create new root This is how B-Trees grow deeper.

M = 3 L = 2 M = 3 L = 2 Propagating Splits Overflowing leaves Too many keys in a leaf! 14 59 14 14 59 14 14 Insert(5) 14 26 59 Add new Insert(59) Insert(26) 1 3 1 3 14 26 59 child 1 3 14 1 3 14 59 1 3 5 14 26 59 Split the leaf, but no space in parent!

So, split the leaf. 14 5 14 59 Same problem as before 14 59 Create a 5 59 5 59 -- must split And add new root But parent has room. a new child What if there wasn’t? 1 3 14 26 59 1 3 5 14 26 59 1 3 5 14 26 59

So, split the node. M = 3 L = 2 Insertion Algorithm After More Routine Inserts

1. Insert the key in its leaf 3. If an internal node ends up 2. If the leaf ends up with L+1 with M+1 items, overflow! 14 items, overflow! – Split the node into two nodes: •original with ⎡ ⎤ items – Split the leaf into two nodes: (M+1)/2 5 59 Insert(89) •original with ⎡(L+1)/2⎤ items • new one with ⎣(M+1)/2⎦ items Insert(79) • new one with ⎣(L+1)/2⎦ items – Add the new child to the parent – Add the new child to the parent – If the parent ends up with M+1 1 3 5 14 26 59 – If the parent ends up with M+1 items, overflow! items, overflow! 14 4. Split an overflowed root in two and hang the new nodes 5 59 89 Note that new leaves/ internal nodes guaranteed to under a new root How about have enough nodes after split. Why? deletion? 1 3 5 14 26 59 79 89 This makes the tree deeper!

M = 3 L = 2 M = 3 L = 2 Deletion and Adoption Deletion A leaf has too few keys! 1. Delete item from leaf 2. Update keys of ancestors if necessary 14 14 Delete(5) 5 79 89 ? 79 89 14 14

Delete(59) 1 3 5 14 26 79 89 1 3 14 26 79 89 5 59 89 5 79 89

So, borrow from a sibling 1 3 5 14 26 59 79 89 1 3 5 14 26 79 89 Sibling pointer (and parent pointers) 14 What could go wrong? What if not 3 79 89 enough to Might cause leaf to drop below L/2 children borrow from? 1 3 3 14 26 79 89

M = 3 L = 2 Deletion and Merging Does Adoption Always Work? A leaf has too few keys! 14 14 • What if the sibling doesn’t have enough for Delete(3) you to borrow from? 3 79 89 ? 79 89

e.g. you have ⎡L/2⎤-1 and sibling has ⎡L/2⎤ 1 3 14 26 79 89 1 14 26 79 89 ? And no sibling with surplus! What if L were 100? Would have keys leftover (49) in node that must be deleted! 14 Soln: give to neighbors! (no surplus Æ must be room) So, delete 79 89 Merge them into one node! the leaf But now an internal node has too few subtrees! 1 14 26 79 89 M = 3 L = 2 M = 3 L = 2 Deletion with Propagation A Bit More Adoption (More Adoption) This is just to setup for next slide.

14 79 79 79

Adopt a Delete(1) 79 89 14 89 14 89 26 89 neighbor (adopt a sibling) 1 14 26 79 89 1 14 26 79 89 1 14 26 79 89 14 26 79 89

Same as before: borrow from rich neighbor!

M = 3 L = 2 M = 3 L = 2 Pulling out the Root A leaf has too few keys! And no sibling with surplus! Pulling out the Root (continued) The root 79 79 has just one subtree! Delete(26) So, delete Simply make 26 89 89 the leaf; the one child the new root! merge 79 89 14 26 79 89 14 79 89

14 79 89 But now the root A node has too few subtrees has just one subtree! and no neighbor with surplus! 79 89 79

Delete 79 89 89 the node 14 79 89 and merge 14 79 89 14 79 89

Deletion Algorithm Deletion Slide Two

3. If an internal node ends up with 1. Remove the key from its leaf fewer than ⎡M/2⎤ items, underflow! – Adopt from a neighbor; 2. If the leaf ends up with fewer update the parent – If adoption won’t work, than ⎡L/2⎤ items, underflow! merge with neighbor – Adopt data from a sibling; – If the parent ends up with fewer than update the parent ⎡M/2⎤ items, underflow! – If adopting won’t work, delete node and merge with neighbor 4. If the root ends up with only one – If the parent ends up with child, make the child the new root This reduces the fewer than ⎡M/2⎤ items, of the tree height of the tree! underflow! Thinking about B-Trees Tree Names You Might Encounter • B-Tree insertion can cause (expensive) FYI: splitting and propagation – B-Trees with , are called 2-3 • B-Tree deletion can cause (cheap) M = 3 L = x adoption or (expensive) deletion, merging trees and propagation • Nodes can have 2 or 3 keys – B-Trees with M = 4, L = x are called 2-3-4 • PropagationOnly 1/L isinserts rare cause if split,M and only 1/ML are of these large go up at all! (Why?) trees • If M = L = 128, then a B-Tree of height • Nodes can have 2, 3, or 4 keys 4 will store at least 30,000,000 items

Height 5: 2 billion

Range Queries

• Think of a range query. – “Give me all customers aged 45-55.” K-D Trees and Quad Trees – “Give me all accounts worth $5m to $15m”

• Can be done in time ______.

• What if we want both: – “Give me all customers aged 45-55 with accounts worth between $5m and $15m.”

Geometric Data Structures k-d Trees

• Organization of points, lines, planes, etc in • Jon Bentley, 1975, while an undergraduate support of faster processing • Tree used to store spatial data. • Applications – Nearest neighbor search. – Range queries. – Map information – Fast look-up – Graphics - computing object intersections • k-d tree are guaranteed log2 n depth where n – Data compression - nearest neighbor search is the number of points in the set. – Decision Trees - machine learning – Traditionally, k-d trees store points in d-dimensional space which are equivalent to vectors in d-dimensional space. Range Queries Nearest Neighbor Search

i i i g g g h h h

e e e y d f y d f y d f query

b b b a c a c a c

x x x

Rectangular query Circular query Nearest neighbor is e.

k-d Tree Construction k-d Tree Construction

• If there is just one point, form a leaf with that point. • Otherwise, divide the points in half by a line i perpendicular to one of the axes. g h • Recursively construct k-d trees for the two sets of e points. y d f • Division strategies – divide points perpendicular to the axis with widest spread. b – divide in a round-robin fashion (book does it this way) a c

x divide perpendicular to the widest spread.

k-d Tree Construction (18) 2-d Tree Decomposition

k-d tree cell 2 x i s1 y y g s8 h s2 s6 s4 e s6 y d f x y y y s5 s3 s4 s7 s8 s2 1 b s7 a c a b x g c f h i s5 s3 s1

x d e

3 k-d Tree Splitting k-d Tree Splitting

sorted points in each dimension sorted points in each dimension 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 i x a d g b e i c h f i x a d g b e i c h f y a c b d f e h g i y a c b d f e h g i g g h h • max spread is the max of f -a and i -a. indicator for each set y d e x x y y y d e f f abcdef gh i • In the selected dimension the middle point in the list splits the 0 0 1 0 0 1 0 1 1 b b a data. a c c scan sorted points in y dimension • To build the sorted lists for the and add to correct set x other dimensions scan the sorted x list adding each point to one of two y a b d e g c f h i sorted lists.

k-d Tree Construction Node Structure for k-d Trees Complexity • First sort the points in each dimension. • A node has 5 fields – O(dn log n) time and dn storage. – axis (splitting axis) – These are stored in A[1..d,1..n] – value (splitting value) • Finding the widest spread and equally – left (left subtree) divide into two subsets can be done in – right (right subtree) O(dn) time. – point (holds a point if left and right children • We have the recurrence are null) – T(n,d) < 2T(n/2,d) + O(dn) • Constructing the k-d tree can be done in O(dn log n) and dn storage

Rectangular Range Query Rectangular Range Query (1) • Recursively search every cell that intersects the rectangle. x i s1 y y g s8 h s2 s6 s4 e s6 y d f x y y y s5 s3 s4 s7 s8 s2 b s7 a c a b x g c f h i s5 s3 s1

x d e Rectangular Range Query (8) Rectangular Range Query

x print_range(xlow, xhigh, ylow, yhigh :integer, root: node pointer) { i s1 Case { root = null: return; g y y root.left = null: s8 h s2 s6 if xlow < root.point.x and root.point.x < xhigh s4 s6 and ylow < root.point.y and root.point.y < yhigh y d e f x y y y then print(root); s5 s3 s4 s7 s8 else s2 if(root.axis = “x” and xlow < root.value ) or b s7 (root.axis = “y” and ylow < root.value ) then a c a b x g c f h i s5 print_range(xlow, xhigh, ylow, yhigh, root.left); s3 s1 if (root.axis = “x” and xlow > root.value ) or x d e (root.axis = “y” and ylow > root.value ) then print_range(xlow, xhigh, ylow, yhigh, root.right); }}

k-d Tree Nearest Neighbor k-d Tree NNS (1)

Search query point • Search recursively to find the point in the x s1 same cell as the query. i y y g s8 • On the return search each subtree where h s2 s6 s4 e s6 a closer point than the one you already y d f x y y y s5 s3 s4 s7 s8 know about might be found. s2 b s7 a c a b x g c f h i s5 s3 s1

x d e

k-d Tree NNS Notes on k-d NNS query point x • Has been shown to run in O(log n) s1 i average time per search in a reasonable y y g s8 h s2 s6 model. s4 e w s6 y d f x y y y • Storage for the k-d tree is O(n). s5 s3 s4 s7 s8 s2 • Preprocessing time is O(n log n) assuming b s7 a c a b x g c f h i d is a constant. s5 s3 s1

x d e Worst-Case for Nearest Neighbor Search Quad Trees • Space Partitioning query point •Half of the points visited for a query •Worst case O(n) •But: on average (and in practice) g nearest neighbor d a b f c y queries are O(log N) y d e f g e

b a c x x

A Bad Case Notes on Quad Trees

• Number of nodes is O(n(1+ log(Δ/n))) where n is the number of points and Δ is the ratio of the width (or height) of the key y space and the smallest distance between two points

ab • Height of the tree is O(log n + log Δ)

x

K-D vs Quad Geometric Data Structures

• k-D Trees – Density balanced trees • Geometric data structures are common. – Height of the tree is O(log n) with batch insertion – Good choice for high dimension • The k-d tree is one of the simplest. – Supports insert, find, nearest neighbor, range queries – Nearest neighbor search • Quad Trees – Space partitioning tree – Range queries – May not be balanced – Not a good choice for high dimension • Other data structures used for – Supports insert, delete, find, nearest neighbor, range queries – 3-d graphics models – Physical simulations