Binary Search Binary • BST converts a static binary search into a dynamic binary search allowing to efficiently Compare the left−right ordering in a to the insert and delete data items bottom−up ordering in a where the key of each parent node is greater than or equal to the key of any child node • Left-to-right ordering in a tree: for every node

x, the values of all the keys kleft in the left

subtree are smaller than the key kparent in x and

the values of all the keys kright in the right subtree are larger than the key in x: k < k < k x kparent left parent right

kleft kright 1 2

Binary Search Tree BST: find / insert operations

• No duplicates! (attach them all to a single item) find is the successful • Basic operations: binary search – find: find a given search key or detect that it is not present in the tree insert creates a new node – insert: insert a node with a given key to the at the point at which tree if it is not found the unsuccessful – findMin: find the minimum key – findMax: find the maximum key search stops – remove: remove a node with a given key and restore the tree if necessary

3 4

Binary Search Trees: Binary Search Tree: running time findMin / findMax / sort • Running time for find, insert, findMin, • Extremely simple: starting at the root, branch findMax, sort: O(log n) average-case and O(n) repeatedly left (findMin) or right (findMax) as worst-case complexity (just as in QuickSort) long as a corresponding child exists • The root of the tree plays a role of the pivot in quickSort • As in QuickSort, the recursive traversal of the tree can sort the items: – First visit the left subtree BST of the depth – Then visit the root – Then visit the right subtree

5 about log n 6

1 BST of the depth about n Binary Search Tree: node removal

15 1 1 • remove is the most complex operation:

10 3 15 – The removal may disconnect parts of the tree – The reattachment of the tree must maintain 8 4 3 the binary search tree property

5 5 10 – The reattachment should not make the tree unnecessarily deeper as the depth specifies 4 8 4 the running time of the tree operations

8 3 10

1 15 5

7 8

BST: how to remove a node BST: how to remove a node

• If the node k to be removed is a leaf, delete it • If the node k to be removed has two children, • If the node k has only one child, remove it after then replace the item in this node with the item linking its child to its parent node with the smallest key in the right subtree and remove this latter node from the right subtree •Thus, and are not removeMin removeMax (Exercise: if possible, how can the nodes in the complex because the affected nodes are either left subtree be used instead? ) leaves or have only one child • The second removal is very simple as the node with the smallest key does not have a left child • The smallest node is easily found as in findMin

9 10

Average-Case Performance of BST: an Example of Node Removal Binary Search Tree Operations

• Internal path length of a is the sum of the depths of its nodes: depth 0 IPL = 0 + 1 + 1 + 2 + 2 + 3 + 3 + 3 1 = 15 2 3

• Average internal path length T(n) of the binary search trees with n nodes is O(n log n)

11 12

2 Average-Case Performance of Average-Case Performance of Binary Search Tree Operations Binary Search Tree Operations • If the n-node tree contains the root, the i-node • Therefore, the average complexity of find or left subtree, and the (n−i−1)-node right subtree, insert operations is T(n) ⁄ n =O(logn) then: 2 T(n) = n − 1 + T(i) + T(n−i−1) •For n pairs of random insert / remove 0.5 because the root contributes 1 to the path operations, an expected depth is O(n ) length of each of the other n − 1 nodes • In practice, for random input, all operations are • Averaging over all i;0 ≤ i < n: the same about O(log n) but the worst-case performance recurrence as for QuickSort: can be O(n)! 2 T(n) = (n −1) + n ()T(0) + T(1) +...+ T(n −1) so that T(n) is O(n log n)

13 14

Balanced Trees AVL Tree

• Balancing ensures that the internal path lengths • An AVL tree is a binary search tree with the are close to the optimal n log n following additional balance property: • The average-case and the worst-case – for any node in the tree, the height of the left complexity is about O(log n) due to their and right subtrees can differ by at most 1 balanced structure • But, insert and remove operations take more – the height of an empty subtree is −1 time on average than for the standard binary • The AVL-balance guarantees that the AVL tree search trees of height h has at least ch nodes, c > 1, and the

– AVL tree (1962: Adelson-Velskii, Landis) maximum depth of an n-item tree is about logcn – Red-black and AA-tree – B-tree (1972: Bayer, McCreight) 15 16

Minimal AVL-trees of heights 0,1,2,3 !

Lemma 3.19: The height of an AVL-tree is O(log n). S =1 0 Proof: h+1 S1=2 1) AVL tree of height h has less than 2 – 1 nodes. S2=4 2) We calculate the maximum height of an AVL tree with n nodes. 1 S3=7 Let Sh be the size of the smallest AVL tree of the height h 5 (it is obvious that S0 = 1, S1 = 2)

10 We can up and solve the following recurrence relation:

Binary search tree of 15 Sh= Sh−1+ Sh−2+1, height 6, 7 nodes! 20 Sh = Fh+3 − 1

where Fi is the i-th Fibonacci number 25 Minimal complete binary search tree 30 of height 3 ! 17 18

3 Definition-Fibonacci number: F(n)=F(n-1)+F(n-2), Fi is ith Fibonacci number, AVL Tree F1=1, F2=1, F3=2 F4=F2+F3 • Therefore, for each n-node AVL tree: h+3 n ≥ Sh ≈ (ϕ 5)−1 Sh= Sh−1+ Sh−2+1, Sh = Fh+3 − 1 where ϕ = ()1+ 5 2 ≅ 1.618, or

Proof by mathematical induction: h ≤1.44log2 (n +1) −1.328

1) Base case: S0=F3-1=1, S1=F4-1=2 • Thus, the worst-case height is at most 44%

2) Assumption: S =F -1 more than the minimum height of a complete i i+3 binary tree 3) Proof: Si+1=(Fi+3-1)+(Fi+2-1)+1=Fi+4-1

19 20

Balancing an AVLTree Single Rotation

• Two mirror-symmetric pairs of cases to rebalance • The inserted new key invalidates the AVL the tree if after the insertion of a new key to the property bottom the AVL property is invalidated • To restore the AVL tree, the root is moved to • Only one single or double rotation is sufficient the node B and the rest of the tree is • Deletions are more complicated: O(log n) rotations can be required to restore the balance reorganised as the BST • AVL balancing is not computationally efficient . Better balanced search trees: red-black, AA-trees, B-trees

21 22

Example for inserting a new key and rebalancing the AVL-tree: Red-Black Tree

40 •A red-black tree is a binary search tree with the 15 60 15 following ordering properties: Every node is coloured either red or black, the root is 10 20 10 40 black – Red Rule: If a node is red, its children must 12 20 60 12 be black. – Path Rule: Every path from the root to a leaf or to a node with one child must contain the Insertion algorithm (informal): 1) Ordinary binary search tree insertion same number of black nodes. 2) rebalance the tree if necessary

23 24

4 100 30 50

90 50 150 30 45 5

45 60 144 180 20 40 2 9 40 50 40 46 10 41

Red-Black Tree A red-black tree with 8 nodes A red-black tree not balanced! 1) RedRule is satisfied. A new node below 10 not possible! 2) 2 black nodes in each of the 5 paths This shows that red black trees from the root to a leaf or a node with 1 child, are somehow balanced! 25 So the PathRule is satisfied. 26

Red-Black Tree Red-Black Tree Statement: Because every path from the root to a leaf contains b black nodes, If a red-black tree is complete, with all black nodes except for red there are at least 2b -1 nodes leaves at the lowest level, the height of that tree will be minimal, in the tree. approximately log2 n. • Proof (math induction): To get the maximum height of for a given n, we would have as Base case: it is valid for b=1 (only the root or also1-2 its many red nodes as possible on one path, and all other nodes are red children) black. The path with all the red nodes would be about twice as long as the paths with no red elements.. • Let it be valid for all red-black trees with b black nodes per path • If a tree contains b+1 black nodes per path and the root This lead us to the hypothesis that the maximum has 2 black children, then it contains at least height of a red-black tree is less than 2 log2n. 2⋅(2b−1)+1 = 2b+1−1 black nodes

27 28

Red-Black Tree Summary AVL, Red-Black trees

1) Red-black trees never get far out of balance. Proof of hypothesis that any red-black tree with height t is O(log n): 2) Maximum height of an AVL-tree

By the red rule, at most half of the nodes in the path can be red, so at least half of the nodes must be black. h≤1.44log2(n+1)−1.328

That means: b ≥ h / 2 Maximum height of an Red-Black tree:

Previous proof: b n ≥ 2 − 1 h ≤ 2 ⋅log 2 (n +1)

We replace b: n ≥ 2 h / 2 − 1

We get: h ≤ 2⋅log2 (n +1)

29 30

5 AA-Trees B-Trees: Efficient External Search

• Software implementation of the operations insert • For very big databases, even log2n search steps and remove for red-black trees is a rather tricky may be unacceptable process • To reduce the number of disk accesses: an

• A balanced search AA-tree is a method of optimal m-ary search tree of height about logmn choice if deletions are needed • The AA-tree adds one extra condition to the red- n 105 106 107 108 109 m-way branching ⎡log n⎤ 17 20 24 27 30 black tree: left children may not be red lowers the optimal 2 • This condition greatly simplifies the red-black tree height by factor ⎡log10n⎤ 5 6 7 8 9 tree remove operation log2 m (i.e., by 3.3 if ⎡log100n⎤ 3 3 4 4 5 m=10) ⎡log1000n⎤ 2 2 3 3 3 31 32

Multiway Search Tree of Order m = 4 B-Tree Definition

• A B-tree of order m is defined as an m-ary balanced tree with the following properties: – The root is either a leaf or it has between 2 and m children inclusive • In an m-ary search tree, at most m−1 keys are – Every nonleaf node (except possibly the root) used to decide which branch to take has between ⎡m/2⎤ and m children inclusive • The data records are associated only with leaves, so the worst-case and average-case searches involve the tree height and the average leaf depth, respectively

33 34

B-Tree Definition Naming the B-Trees

• B-trees are named after their branching factors, • A nonleaf node with µ children has µ−1 keys (keyi : that is, ⎡m/2⎤ - m-tree i=1, …, µ−1) to guide the search •A 4−7-tree is the B-tree of order m = 7 • keyi represents the smallest key in the subtree i+1 • All leaves are at the same depth – 2..7 children per root • The data items are stored at leaves, and every leaf – 4..7 children per each non-root node contains between ⎡l/2⎤ and l data items, for some l that may be chosen independently of the tree order m •A 2−4-tree is the B-tree of order m = 4 • Assuming that each node represents a disk block, the – 2..4 children per root choice is based on the size of items that are being stored – 2..4 children per each non-root node

35 36

6 2−4-Tree Example of choosing m, l

• Let one disk block holds 8192 bytes. • Each key uses 32 bytes. • The branch is a number of another disk block, so let a branch be 4 bytes. • B-tree of order m: m−1 keys per node plus m branches • So the largest order m that 1 node fits in 1 disk block: 32(m−1)+4m = 36m−32 < 8192 Æ m=228 • Because the block size is 8192 bytes, l = 32 records of size 256 bytes fit in a single block • Let there be 107 data records, 256 bytes per record

The leaf has between 2 and m children, here 3, the nonleaf nodes have between 2 and 4 children and 1 to 3 keys, the leafs have a maximum of 7 records 37 38

Example of choosing m, l Analysis of B-Trees

• Each leaf has between 16..32 data records • A search or an insertion in a B-tree of order m inclusive, and each internal node, except from with n data items requires fewer than ⎡logmn⎤ the root, branches in 114 – 228 ways disk accesses •107 records can be stored in 312500 − 625000 leaves (= 107 / (16 .. 32)) • In practice, ⎡logmn⎤ is almost constant as long • In the worst case the leaves would be on the as m is not small level 4 (1142 = 12996 < 625,000 < 1143 = 1481544) • Data insertion is simple until the corresponding leaf is not already full; then it must be split into 2 leaves, and the parent(s) should be updated

39 40

Analysis of B-Trees Symbol Table and Hashing

• Additional disk writes for data insertion and • A ( symbol) table is a set of table entries, deletion are extremely rare (K,V) • An algorithm analysis beyond the scope of this • Each entry contains: course shows that both insertions, deletions, and – a unique key, K, and – a value (information), V retrievals of data have only logm/2n disk accesses • Each key uniquely identifies its entry in the worst case (e.g., ⎡log114625000⎤=3 in the above example)

41 42

7 Key K Associated value V Example: Code k City Country State

AKL 271 Auckland NZ We have a table with 1000 elements. Each element has a security number as a key. We need to transform the key into an index in our array. DCA 2080 Washington USA DC To allow fast access, we can take the right most 3 digits of the number.

FRA 3822 Frankfurt Germany For example: 214303261 would be stored at index 261 or 033518000 would be stored at 0. GLA 4342 Glasgow UK Scotland You can see two distinct keys can have the same 3 digits at the end. HKG 4998 Hong Kong China -collision! LAX 7459 Los Angeles USA California Hashing is the process of transforming a key into a table index. Hash-function: performs an easily computable operation on the key

2 and returns the hash value. k=26 co+26c1+c2, c0, c1, c2 are the integer codes for the English alphabet, 0-A, 1-B, 2-C, 3……. A table is a mapping of keys to values.

43 44

Symbol Table and Hashing Basic Features of Hashing

• Once the entry (K,V) is found, its value V, may • Hashing computes an integer, called the hash be updated, it may be retrieved, or the entire code, for each object (key) entry, (K,V) , may be removed from the table • The computation, called the hash function, h(K), • If no entry with key K exists in the table, a new maps objects (e.g., keys K) to the array indices table entry having K as its key may be inserted (e.g., 0, 1, …, imax) in the table • An object having a key value K should be stored • Hashing is a technique of storing values in the at location h(K), and the hash function must tables and searching for them in linear, O(n), always return a valid index for the array worst-case and extremely fast, O(1), average- case time

45 46

Basic Features of Hashing How Common Are Collisions?

•A perfect hash function produces a different • Von Mises Birthday Paradox: index value for every key. But such a function if there are more than 23 people in a room, cannot be always found. the chance is greater than 50% that two or more • Collision: if two distinct keys, K1 ≠ K2, map to of them will have the same birthday • Thus, in the table that is only 6.3% full (since the same table address, h(K1) = h(K2) 23/365 = 0.063) there is a “good” chance of a • Collision resolution policy: how to find collision! additional storage in which to store one of the collided table entries

47 48

8 How Common Are Collisions? Probability PN(n) of One or More Collisions • Probability of no collision (that is, that none of the n items collides, being randomly tossed into N! a table of size N): P (n) =1− Q (n) =1− N N N n (N − n)! N N −1 N(N −1) n % P (n) Q (1) =1 ≡ ; Q (2) = Q (1) ≡ ; 365 N N N N N N 2 10 2.7 0.1169 N − 2 N(N −1)(N − 2) 20 5.5 0.4114 Q (3) = Q (2) ≡ ; ... N N N N 3 30 8.2 0.7063 N − n +1 N(N −1)...(N − n +1) 40 11.0 0.8912 Q (n) = Q (n −1) ≡ N N N N n 50 13.7 0.9704 60 16.4 0.9941

49 50

Open Addressing with Linear Probing (OALP) • The simplest collision resolution policy:

OALP – to successively search for the first empty entry example: N = 10 at a lower location h(k)=int(k/10) – if no such entry, then ``wrap around'' the table • Drawbacks: clustering of keys in the table

51 52

Open Addressing with Double Hashing (OADH) • Better collision resolution policy reducing the likelihood of clustering: – to hash the collided key again using a OADH different hash function and N = 10, h(k)=int(h/10) – to use the result of the second hashing as an d(k)=h(k)+k mod 10 increment for probing table locations (including wraparound)

53 54

9 Two More Collision Resolution Chaining and Hash Bucket Techniques • Open addressing has a problem when • Chaining: all keys collided at a single hash significant number of items need to be deleted address are placed on a , or chain, as logically deleted items must remain in the started at that address table until the table can be reorganised • Hash bucket: a big is divided into a • Two techniques to attenuate this drawback: number of small sub-tables, or buckets – Chaining – the hush function maps a key into one of the – Hash bucket buckets – the keys are stored in each bucket sequentially in increasing order

55 56

Choosing a hash function Choosing a hash function • Four basic methods: division, folding, middle- squaring, and truncation • Folding: • Division: – divide the integer key, K, into sections – choose a prime number as the table size N – add, subtract, and/or multiply them together – convert keys, K, into integers for combining into the final value, h(K) – use the remainder h(K) = K mod N as a hash • Example: value of the key K K=013402122 Æ sections 013, 402, 122 Æ – find a double hashing decrement using the h(K) = 013 + 402 + 122 = 537 quotient, ∆K = max{1, (K/N)mod N}

57 58

Choosing a hash function Choosing a hash function

• Middle-squaring: • Truncation: – choose a middle section of the integer key, K – delete part of the key, K – square the chosen section – use the remaining digits (bits, characters) as h(K) – use a middle section of the result as h(K) •Example: •Example: K=013402122 Æ last 3 digits: h(K) = 122 K = 013402122 Æ middle: 402 Æ • Notice that truncation does not spread keys 4022=161404 Æ middle: h(K) = 6140 uniformly into the table; thus it is often used in conjunction with other methods

59 60

10 Table ADT Representations: Universal Class by Division Comparative Performance • Theorem (universal class of hash functions by Operation Representation division): Sorted AVL tree Hash table – Let the size of a key set, K, be a prime array number: Initialize O(N) O(1) O(N) Is full? O(1) O(1) O(1) |K| = M Search*) O(log N) O(log N) O(1) – Let the members of K be regarded as the Insert O(N) O(log N) O(1) integers {0,…,M−1} Delete O(N) O(log N) O(1) – For any numbers a∈{1,…,M-1}; b∈ {0,…,M− Enumerate O(N) O(N) O(N log N)**) 1} let h (k) = (a ⋅k + b) mod M mod N a,b () *) also: Retrieve, Update **)To enumerate a hash table, entries must 61 first be sorted in ascending order of keys that takes O(N log N) time 62

11