CS240E: Data Structures and Data Management

Helena S. Ven

08 Jan. 2019 Class: T,Th at 0830 – 0950 Instructors: Therese Biedl

Office: DC 2341, (1000 - 1100) Topics: Tutorial:

Midterm: The class on the date of the midterm (28 Feb.): , Hash for strings, Compressed tries. Class start at 0910 instead of 0830. Final: Arithmetic compression, Cache oblivious trees, and External sorting are not on the final. No deletion in dictionaries (lazy deletion is fine)

By convention, log is base 2 unless stated otherwise. The distribution of this document is prohibited unless given permission from the author and Prof. Biedl Index

1 Runtime and Asymptotic bounds 4 1.1 Objective of this course ...... 4 1.1.1 Computational Problems ...... 4 1.2 Asymptotic Analysis ...... 5 1.3 Analysis of Algorithms ...... 7 1.4 Runtime of Randomised Algorithms ...... 8 1.5 Potential Method of Amortised Analysis ...... 10

2 Comparison-Based Data Structures 11 2.1 Array-based Data types ...... 11 2.2 ADT: Priority Queues ...... 11 2.3 ...... 13 2.3.1 Operations in heaps ...... 14 2.3.2 Improvements of the Heap ...... 17 2.4 Heap Merging ...... 19 2.4.1 Method 1, Determinstic ...... 19 2.4.2 Method 2, Randomised ...... 20 2.4.3 Method 3, Modified Heap ...... 21 2.4.4 Almost-heaps ...... 23 2.5 ADT: Dictionaries ...... 23 2.6 Binary Search Trees ...... 24 2.7 -based Implementations of the Dictionary ...... 25 2.7.1 ...... 25 2.7.2 AVL Trees ...... 26 2.7.3 ...... 30 2.8 Skip Lists ...... 33 2.9 Dictionary with Biased Search Requests ...... 35 2.9.1 MTF Array and Transpose Array ...... 36 2.10 Splay Trees ...... 36

3 Hashing and Spatial Data 39 3.1 Hash Tables ...... 39 3.1.1 Probe Hashing ...... 39 3.1.2 Double Hashing ...... 41 3.1.3 Cuckoo Hashing ...... 41 3.1.4 Complexity of Probing Methods ...... 42 3.1.5 Universal Hashing ...... 42 3.2 Hash of Multi-dimensional data ...... 43 3.3 Tries ...... 44 3.3.1 Variation of : No leaves ...... 45 3.3.2 Variation of Trie: Compressed labels ...... 45 3.3.3 Variation of Trie: Allow Prefixes ...... 45 3.3.4 Compressed Tries ...... 46 3.4 ADT: Dictionary with Range Search ...... 47

1 3.5 Spatial Data Structures ...... 48 3.6 Quad-Trees ...... 48 3.7 KD-Tree ...... 50 3.8 ...... 52 3.8.1 Problem of Duplicates and Generalisations ...... 54 3.8.2 3-Sided Ranged Queries ...... 54

4 Sorting and Searching Algorithms 55 4.1 Problem: Selection and Sorting ...... 55 4.1.1 The Lower Bound of Comparison Sorting ...... 55 4.2 Quick-Select ...... 56 4.2.1 Randomised Pivoting ...... 57 4.3 Partitioning ...... 58 4.4 ...... 58 4.4.1 Choice of the Pivot ...... 61 4.5 Sorting Integers ...... 61 4.5.1 Bucket Sort ...... 62 4.5.2 ...... 63 4.6 Problem: Search ...... 64 4.7 Interpolation Search ...... 64

5 String Algorithms 66 5.1 Problem: ...... 66 5.2 Pattern Pre-processing ...... 67 5.2.1 Karp-Rabin Fingerprint Algorithm ...... 67 5.2.2 Boyer-Moore Algorithm ...... 68 5.2.3 Finite Automaton and Knuth-Morris-Pratt Method ...... 71 5.3 Text Pre-processing ...... 75 5.3.1 Trie of Suffixes, Suffix Trees ...... 75 5.3.2 Suffix Array ...... 75 5.4 Comparison of Pattern Matching Algorithms ...... 77 5.5 Problem: Compression ...... 79 5.5.1 Prefix-Free Encoding ...... 79 5.6 Huffman Tree ...... 80 5.6.1 Huffman Tree with Different Base ...... 82 5.7 Run-Length Encoding ...... 83 5.8 Lempel-Ziv-Welch ...... 84 5.8.1 Decoding Lempel-Ziv-Welch ...... 85 5.9 BZip2 ...... 86 5.9.1 Burrows-Wheeler Transform ...... 87 5.9.2 Move-to-front Transform ...... 89 5.10 Arithmetic Compression ...... 90 5.11 Comparison of Compression Algorithms ...... 91

6 External Memory Model 92 6.1 Sorting in External Memory ...... 92 6.2 Dictionary and B-Trees ...... 93 6.2.1 Insertion and Deletion in a-b-Trees ...... 94 6.2.2 Height of a a-b-Tree ...... 98 6.2.3 B+-tree ...... 99 6.3 Red-Black Tree ...... 100 6.4 Extendible Hashing ...... 101

2 A Tables 103 A.1 List of common recurrences ...... 103 A.1.1 Comparison of Hashings ...... 103 A.1.2 Comparison of Sorting Algorithms ...... 103 A.1.3 Comparison of Pattern Matching Algorithms ...... 104 A.1.4 Comparison of Compression Algorithms ...... 104 A.2 Information Theory ...... 104 A.3 Course sumary ...... 105

3 Caput 1

Runtime and Asymptotic bounds

1.1 Objective of this course

This course considers designing algorithms for the handling huge data, on the order of millions or billions of bytes. Important concepts: • Correctness: Does program output expected results? • Efficiency: Is the program using the computer’s resources efficiently?

• Storing, accessing, and operations on large collections of data.

1.1.1 Computational Problems “Mergesort is a recursive algorithm that solves the sorting problem in O(n log n) worst time”. Explanation of terminologies: 1.A algorithm: Turing machine. Informally a finite list of instructions that modifies input, a step-by-step process for carrying out a series of computations, given an arbitrary problem instance I. A program is a specific implementation of a algorithm.

2.A recursive algorithm uses itself on a smaller instance of the problem. 3.A problem is: A language specification of given input (instance) and a target output. The sorting problem: Given numbers, put them in sorted order. 4. Solving: The algorithm produces correct output on all instances.

5. Run-Time: Number of computing steps. This leads to the problem of what a computer can do in one step. A rigorous treatment requires a Turing machine but a Turing machine leads to long runtimes. We will use a idealised RAM (see below) model. 6. Worst case: Worst over all instances.

7. n: Size of the input. Usually the number of memory cells.

Definition

A problem is a computational task. A problem instance is an input for a problem. A problem solution is the correct answer (output) to a problem. The size of an instance is a positive integer.

4 Below are some common problems: • Sorting problem: Given A[0 . . . n − 1] of elements in a ordered , find permutation π of {0 . . . n − 1} such that A ◦ π is ascending. The size of an instance is n. • The structured search problem: Given items with keys, search by the key. • The unstructured search problem: Search for a string in a text. To communicate an algorithm, we use pseudocode, which is an abstraction of programming that omits obvious details (variable declarations), has limited error detection, and sometimes use natural language or mathematical descriptions. The pseudo-code runs on an idealised Random Access Machine (RAM) model. We will use the RAM model: A computer uses memory cells (words). With the assumption that a word can hold a big number (usually 232 or 264). We do not place a limit on this and we assume it is big enough to hold all numbers. Primitive operations, such as add or multiply, require one unit of time. We are primarily concerned with the amount of time (running time) and amount of memory (space) a program requires to run. These will depend on the size of a given problem instance.

1.2 Asymptotic Analysis

Definition

Let f, g be functions. • (f ∈ O(g)) g is a asymptotic upper bound of f if there exists > 0 such that |f| ≤ c |g| eventually.

• (f ∈ Ω(g)) g is a asymptotic lower bound of f if there exists c > 0 such that |f| ≥ c |g| eventually. • (f ∈ Θ(g)) g is a asymptotic tight bound of f if f ∈ O(g) ∩ Ω(g). • (f ∈ o(g)) g is asymptotic dominant of f if for all constants c, |f| < c |g| eventually. • (f ∈ ω(g)) g is asymptotic insignificant of f if for all constants c, |f| > c |g| eventually.

The bound is usually easy to prove if g is known. Note. Notice the strict less than symbol. 1.2.1 Limit Rule. Let f, g be non-zero functions, a a point. If limx→a f(x)/g(x) = L exists, then

o(g) if L = 0  O(g) if L < ∞ f ∈ Ω(g) if L > 0  ω(g) if L = ∞

Hence if L ∈ R+, then f ∈ Θ(g). Proof.

• Suppose L < ∞. Then there exists a neighbourhood U of a such that |f/g − L| < 1, so

f(x) − L ≤ 1, (x ∈ U) g(x)

this implies f(x) ≤ (L + 1)g(x) for such x. This gives f ∈ O(g).

5 1.2.2 Corollary. 1. A polynomial of degree is in Θ(nd) (n → ∞)

2. For any bases b1, b2 > 1, log (n) ∈ Θ(log (n)) b1 b2 3. log n ∈ o(np) for any p > 0.

The conditions for o(g), ω(g) are equivalent to the limit rule, but this is not the case for O(g) and Ω(g). 1.2.3 Proposition. If f ∈ o(g), then f(x) lim = 0 x→a g(x) Proof. Let  > 0. Then there exists a neighbourhood U 3 a such that f(x) < g(x) on U, so

f(x) <  g(x)

Since  is arbitrary, f(x)/g(x) → 0. 1.2.4 Corollary. 1. o(g) ⊆ O(g) 2. ω(g) ⊆ Ω(g)

3. o(g) ∩ Ω(g) = ∅ From this we can obtain (log n)r ∈ o(n) for all r > 0.

Example: Existence of Limits

Let f(n) := (2 + sin n) n Then f(n) ∈ Θn since 2n ≤ f(n) ≤ 3n, but the limit

f(n) lim = lim (2 + sin n) n→∞ n n→∞ does not exist. Let f(n) := (1 + sin n) n + 0.01 now f(n) ∈ O(n).

Exam question: Given f, g, determine which asymptotic category does f fall into.

1.2.5 Proposition.

log(n!) ∈ Θ(n log n) Proof.

• n n X X log(n!) = log(i) ≤ log(n) = n log n i=1 i=1 so log(n!) ∈ O(n log n)

6 • When n ≥ 4, 2 log 2 ≤ log n, so

n n X X log(n!) ≥ log(i) = log(n/2) i=n/2 i=n/2 n n = log 2 2 1 1 = n log n − n log 2 2 2 1 1 ≥ n log n − n log n 2 4 1 = n log n 4 Hence log(n!) ∈ Ω(n log n). Combining the above, log(n!) ∈ Θ(n log n).

1.3 Analysis of Algorithms

Designing an algorithm consists of the following stages: 1. Start with the idea of the algorithm

2. Pseudocode 3. Prove correctness 4. Analyse runtime

• Worst case analysis: The worst case runtime of algorithm A is a function f : Z+ → R mapping n (size of input) to the longest running time for any input instance of size n:

TA(n) = max{TA(I): |I| = n}

Note: It is important not to try comparing algorithms using O(·) notation. Since O(·) can always be made worst! • Average case analysis: avg 1 X T (n) = T (I) A |{I : |I| = n}| A I:|I|=n • Best case analysis (tie-breaker) • Expected runtime • Amortised runtime 5. Analyse space requirements The auxiliary space is the additional space taken by an algorithm, not considering the space taken by the instance, so if we can sort an array in-place, this would be O(1) space. There are two ways to analyse the run-time in Θ(g). One is to establish a Θ(·) bound at every step. Another way is to prove O(g(n)), which allows for upper bounds, and Ω(g(n)), which allows for lower bounds. Recursive algorithms runtime may be expressed as a recursive function e.g.

T (n) = 2T (bn/2c) + n + c

We shall see that in most cases, floors and ceilings can be ignored.

7 Example: Mergesort

Mergesort is an algorithm to solve the sorting problem, its runtime T (n) is given by the recurrence ( T (d n e) + T (b n c) + Θ(n) if n > 1 T (n) = 2 2 Θ(1) if n = 1

It suffices to consider the following exact recurrence, which can bound the above function from below or above, depending on c. ( T (d n e) + T (b n c) + cn if n > 1 T (n) = 2 2 c if n = 1 The following is the corresponding sloppy recurrence, removing floors and ceilings. ( 2T ( n ) + cn if n > 1 T (n) = 2 c if n = 1

Convention: A tree with 1 node’s height is 0, an empty has height −1.

1.4 Runtime of Randomised Algorithms

In a randomised algorithm, the algorithm relies on a feed of random numbers in addition to the input. The cost will depend on the random numbers used and the input. The goal is to convert uncontrollable instance variations to controllable random number generation.

Definition

Let T (I, σ) be the runtime of a random algorithm on instance I with random sequence σ. Then the expected runtime on I is exp T (I) := E[T (I, σ)|I] = T (I, σ) dσ ˆΣ where Σ is the set of all random sequences. The (worst case) expected runtime corresponding to input size n is

exp exp Tworst(n) := max T (I) |I|=n

The (average case) expected runtime corresponding to input size n is

exp 1 X exp T (n) := T (I) avg |I : |I| = n| |I|=n

The goal for a randomised algorithm is to design the randomisation such that all instances of size n have equal runtime. Utility: random(n) randomly draws an integer from {0, . . . , n − 1}.

8 Example: Random Walk

Given a binary tree, consider the algorithm: Randomly walk downwards to a child until reaching a leaf. What is the runtime of this algorithm? The worst case is O(h) ≤ O(n), where h is the height of the tree. The expected runtime T (n) is O(log n). Proof. Suppose the time to move from a node to a randomly chosen child is ≤ c. We shall prove T (n) ≤ c log(n + 1). This can be done by induction. • When n = 1, the time is ≤ c = c log(n + 1)

• Let nL, nR be sizes of the left/right subtrees. Observe,

(nL + 1) + (nR + 1) = n + 1

1 The probability of going to either child is 2 . 1 1 T (n) ≤ c + T (n ) + T (n ) 2 L 2 R 1 1 ≤ c + c log(n + 1) + c log(n + 1) 2 L 2 R Since log is a concave function,

1 n + 1 + n + 1 n + 1 (log(n + 1) + log(n + 1)) ≤ log L R = log 2 L R 2 2 so n + 1 ≤ c + c log 2 ≤ c log(n + 1)

as required.

Example: Bogosort

What are the (expected/absolute) best case, (expected/absolute) worst case runtime for bogosort? The absolute best case is when the shuffle is correct everytime or the input is constant, so O(n). The absolute worse case is when the algorithm never teminates. The expected best case is when the input is constant, so

T exp(n) = cn ∈ O(n)]

The expected worse case is when every input is different, and cn n! − 1 T exp(n) = + (T exp(n) + cn) n! n! Solving for T exp gives T exp(n) = cn · n! ∈ O(n!)

9 1.5 Potential Method of Amortised Analysis

Some data structures require amortised analysis. For example, inserting in a costs Θ(1) most of the time, but when the array needs to allocate more memory, the becomes Θ(n). Amortised analysis is a method to obtain the average time.

Definition

A potential function is a φ(t) that depends on the state of a at time t, such that

• φ ≥ 0 • φ(0) = 0 The amortised runtime of step t → t + 1 is

T amo := T + φ(t + 1) − φ(t)

A good choice of φ leads to a amortised runtime that does not depend on t. For dynamic arrays, we have φ(t) := c(2n − M + 1) where • n: Is the number of items in the array

• M: Capacity of the array (M/2 ≤ n ≤ M) • c: Constant, sometimes assumed to be 1. Copying n items will take time at most cn. In a dynamic array has two types of steps t → t + 1. • Insert when the array had space: M 0 := M and n0 = n + 1, so

φ(t + 1) − φ(t) = 2c

Amortised runtime is O(1) + 2c ∈ O(1). • Insert where array was full: n = M,M 0 = 2M, n0 = n + 1 φ(t + 1) − φ(t) = 2c − cM = 2c − cn Amortised runtime is O(1) + cn + 2c − cn ∈ O(1).

10 Caput 2

Comparison-Based Data Structures

Definition

A (ADT) is a description of information and collection of operations on that information. The information is accessed only through the operations. A realisation of ADT specifies:

1. How information is stored (data structure) 2. How the operations are performed.

2.1 Array-based Data types

Definition

A stack is an ADT consisting of a collection of items with operations

• Push: Insert and item • Pop: Remove the most recently inserted item. • Size: # Items in the stack • isEmpty: If the stack is empty (i.e. Size = 0)

• Top: Most recently inserted item Items are removed in LIFO (last-in first-out) order. A queue is a FIFO (first-in first-out) data structure.

• Enqueue: Insert an item

• Dequeue: Remove the oldest item

Stacks and queues are realised with arrays/linked lists.

2.2 ADT: Priority Queues

11 Definition

A (max-oriented) stores items that has a priority, with operations:

1. PQ-insert: Give item and priority, attach to the data structure.

2. PQ-deleteMax/PQ-pop: Remove and return the highest priority item. 3. PQ-getMax: Like PQ-deleteMax but doesn’t remove 4. PQ-merge: Merge two queues.

Note. There is also a min-oriented priority queue for which the minimal element is extracted. Usually the priority is a key by which the items are sorted. In the pseudocode, we do not explicitly show the items associated with each key. Applications of Priority Queues:

1. ToDo list 2. Discrete event simulation 3. Sorting

A priority queue can be used to solve the sorting problem. Below is the Priority queue sort.

Algorithm 1 Priority Queue Sort 1: procedure PQ-Sort(A[n]) 2: p ← PriorityQueue . Create New queue 3: for i ← (0, . . . , n − 1) 4: PQ-insert(p, A[i]) 5: end for 6: for i ← (n − 1,..., 0) 7: A[i] ← PQ-deleteMax(p) 8: end for 9: end procedure

The runtime of this depends on the implementation of the priority queue, which is

Θ(n + n · TPQ-insert + n · TPQ-deleteMax)

We can implement the priority queue using simple data structures.

1. Unsorted array (amortised dynamic array) or linked lists:

• PQ-insert: Append the element to the end of the list. O(1) • PQ-deleteMax: Search the maximum element. O(n) This realisation, when used for sorting, yields the .

2. Sorted list

• PQ-insert: O(n) • PQ-deleteMax: Remove the maximum item, which is stored either at beginning or end of the array, O(1)

This yields . 3. Heap: Binary tree that stores items with priorities

12 Descriptio 2.1: A Heap of Integers. Notice that the nodes marked red are sorted 50

29 34

27 15 8 10

23 26

2.3 Heap

Definition

A binary tree is 1. Empty 2. A node with two binary tree children.

Terminology: • Root: The base of the tree • Leaf: The nodes without any children

• Parent: The node directly above a given node • Child: The nodes directly below a given node • Level: A set of nodes whose distance to the root is constant. • Sibling: Another child of the parent of a given node.

• Ancestor: The set of parents, and parents of parents, etc. of a given node. • descendant: Like ancestor but with children. The height of a binary tree is the maximal distance from a leaf node to a root node. The height of an singleton (i.e. one node) tree is 0. The height of a empty tree is −1.

Definition

A (max) heap is a binary tree that stores ordered items, such that

1. Structural Property: All levels of heap are completely filed, except possibly for the last level. The filled items in the last level are left-justified. 2. Heap-order Property: For any node i, the key of parent of I is larger than or equal to key of i.

2.3.1 Lemma. Any binary tree with n nodes has height at least log(n + 1) − 1 ∈ Ω(log n). The height of a heap with n nodes is Θ(log n). A heap is usually stored as array, level-by-level:

13 A[0]

A[1] A[2]

A[3] A[4] A[5] A[6]

Navigating this array: • The root is A[0];

• The left child of A[i] is A[2i + 1]; • The right child of A[i] is A[2i + 2];

i−1 • The parent of A[i], i 6= 0 is A[b 2 c] • The last node is A[n − 1].

We can hide these details using helper functions, root, parent, last, etc. Ph−1 i h A single node tree has height 0. If a heap has hight h, it has n = r+ i=0 2 elements, where r ∈ {1,..., 2 }, so the weight of a heap is at least n ≥ 2h. Reversing the inequality gives h ≥ log n ∈ O(log n).

2.3.1 Operations in heaps To add an element to a heap, we can extend the array to have n + 1 elements, and then insert an element at A[n]. This breaks the heap ordering property, so we need a function Fix-up to fix it. The idea is to successively exchange a node with its parent until the heap property is restored.

Algorithm 2 Fix up 1: procedure Fix-up(A[n], k) 2: while k > 0 ∧ A[parent(k)] < A[k] . Exchange k with its parents to fix heap 3: A[k] ↔ A[parent(k)] 4: k ← parent(k) 5: end while 6: end procedure

14 Descriptio 2.2: A example of insertion followed by a fix up. The red node is inserted. A fix-up moves the yellow nodes down. The green node is an ancestor of the red node but is not moved 50

29 34 2. 27 15 8 10 1. 23 26 48

50

48 34

27 29 8 10

23 26 15

deleteMax is the opposite operation of insert. It removes the maximum element (i.e. root) of the heap. This breaks the structural property, so we can move the last element to the root, but this breaks the heap-ordering properto, so we execute Fix-down to move the root to the right place.

Algorithm 3 Remove Top of Heap 1: procedure Fix-down(A[n], k) 2: while k < n . When k is not a leaf 3: j ← left-child(k) . Find larger child of k 4: if j 6= last(n) ∧ A[j + 1] > A[j] 5: j ← j + 1 . j ← right-child(k) 6: end if 7: if A[k] ≥ A[j] 8: break 9: end if 10: A[j] ↔ A[k] 11: k ← j 12: end while 13: end procedure

15 Descriptio 2.3: Removing the maximum element of the heap. The last node is moved to the place of the root and then Fix-down is executed 50

29 34

27 15 8 10

23 26

26

29 34

27 15 8 10

23

34

29 26

27 15 8 10

23

16 These two operations suffice to realise the priority queue.

Algorithm 4 Heap-based Priority Queue 1: procedure PQ-insert(A[n], x) 2: increase-size(A[n], 1) 3: l ← last(n) 4: A[l] ← x 5: Fix-up(A[n], l) 6: end procedure 7: procedure PQ-deleteMax(A[n]) 8: l ← last(n) 9: A[root] ↔ A[l] 10: decrease-size(A[n], 1) 11: Fix-down(A[n], root) 12: return A[l] 13: end procedure

Runtime: • PQ-insert: O(log n) (Note: The array resizing associated with an insertion can be done in amortised O(1)) • PQ-deleteMax: O(log n) • PQ-max: Θ(1) A PQ-sort realised using this heap will take O(n log n) time. A heap priority queue does not have enough structure to find an element in o(n), but to delete a particular item in the heap, we can change its priority to ∞ (or a number greater than all other priorities). Then we can remove the maximum of the heap. The Change-Priority function modifies the value of an item in the heap and fixes the heap. Notice that depending on the target priority, we may need to Fix-up or Fix-down 1: procedure Change-Priority(A, n, i, p) 2: if A[i] < p . Priority increased 3: A[i] ← p 4: Fix-up(A, n, i) 5: else if A[i] > p . Priority decreased 6: A[i] ← p 7: Fix-down(A, n, i) 8: end if 9: end procedure The comparisons are redundant since Fix-Up/Fix-Down will stop immediately if there is nothing to be done, so we can have the simplification:

1: procedure Change-Priority(A, n, i, p) 2: A[i] ← p 3: Fix-up(A, n, i) 4: Fix-down(A, n, i) 5: end procedure

2.3.2 Improvements of the Heap We can build the heap faster if we know all inputs in advance. Although this will not be asymptotically faster. Heapify Problem. Given n items A[n], build a heap containing all of them.

17 Descriptio 2.4: A binary with already-built subheaps. This binary tree can be converted to a heap by 1 fix-down operation on its root 12

29 34

27 15 11 10

23 26

A naive approach is to insert them all 1: procedure Heapify(A[n]) 2: H ← heap() 3: for i ← (0, . . . , n − 1) 4: insert(H,A[i]) 5: end for 6: return H 7: end procedure Another way is to use Fix-down. This approach actually uses Θ(n) time. Below is a formulation that is easier to analyse. 1: procedure Heapify(A[n]) 2: for i ← (parent(last(n)),..., 0) 3: fix-down(A[n], i) 4: end for 5: end procedure The worst case running time is Θ(n log n), but we can do better. By directly treating the array as a heap, we can fix its subtrees so they become subheaps. ( 1 if n = 1 T (n) := T (n/2) + T (n/2) + Θ(log n) if n > 1

Algorithm 5 Heapify 1: procedure Heapify(A, n, i = 0) 2: if Has-left(i) 3: Heapify(A, n, left-child(i)) 4: end if 5: if Has-right(i) 6: Heapify(A, n, right-child(i)) 7: end if 8: Fix-down(A, n, i) 9: end procedure

With this established we can create a sorting algorithm, using the PQ-sort algorithm. Heapsort sorts an array using Θ(n log n) time and Θ(1) space.

18 Algorithm 6 Heapsort 1: procedure Heapsort(A[n]) 2: Heapify(A[n]) 3: while n > 1 . Extract maximum repetitively 4: A[root] ↔ A[last(n)] 5: fix-down(A, n, root) 6: n ← n − 1 7: end while 8: end procedure

2.4 Heap Merging

Definition

If H1,H2 are heaps, the join of H1,H2 is a heap that contains all items of H1,H2.

Note. If the heap contains more than integers, it is allowed for two items to have the same key.

This is easy to do for binary heaps in O(|H2| log |H1|), assuming |H2| ≤ |H1|. The join function can be na¨ıvely done by inserting every element in H2 into H1. This implementation may work well for small H2. There are 3 more sophisticated methods.

3 1. In O((log |H1|) ), which can be optimised to O(log |H2| · log |H1|)

2. Randomised, destroys structural property, but the expected runtime is O(log |H1|).

3. Use a different properties, but runs in amortised O(log |H1|).

2.4.1 Method 1, Determinstic There are several cases for heap merging.

1. H1,H2 are both full and have the same height h: h H1,H2 will each have 2 − 1 elements. In this case we can build a new heap H, whose parent node is ∞ and has H1,H2 as children.

H1 H2

Now we can execute Heap-Delete on the root node. This generates a join in O(min{log |H1| , log |H2|}) time.

2. H1,H2 are full, but the height of H2 is lesser than H1:

In which case we can merge H2 into the leftmost subtree of H1 that has the same height as H2 using the case (1) algorithm.

19 H1[0]

∞ ...

H3 H2

2 In order to delete the ∞, we need to fix-down on all ancestors of the ∞ node. This takes O((log |H1|) ) time.

2 3. H2 is full but H1 is not. Omitted. O((log |H1|) )

4. H2 is not full. We can split H2 into O(log |H2|) heaps that are full, and insert using (3). This will generate 2 a lot of single node heaps. The runtime is O(log |H2| · (log |H1|) ).

2.4.2 Method 2, Randomised In this case we present a randomised method based on exchanging heap elements. We assume that the heaps are stored as binary trees. Unfortunately, we break the structural property. The heap with smaller root will be merged into the heap with the bigger root. Then the algorithm randomly decides the child at which to merge.

50 30

29 34 15 14

3 4 2 1

1

The merged heap usually does not have the heap structure except for that the keys of the children and lesser than that of the parent. The Heap-Merge walks downwards from r1, r2 until one of them hits a leaf, so the time is at most twice of the random walk in one tree, which we have shown to be O(log |H1|).

20 1: procedure Heap-Merge(r1, r2) . Roots of the heaps 2: if r1 = ∅ 3: return r2 4: else if r2 = ∅ 5: return r1 6: end if 7: if key(r1) > key(r2) 8: c ← Random-Child(r1) 9: c ← Heap-Merge(c, r2) 10: return r1 11: else 12: c ← Random-Child(r2) 13: c ← Heap-Merge(c, r1) 14: return r2 15: end if 16: end procedure

2.4.3 Method 3, Modified Heap If we modify the heap property, we can obtain amortised O(log n) time for merging. A modification to this algorithm can get to worst case O(log n). The idea is merging is easy for lists and can be done in O(1).

Definition

A is a list of binary trees such that

1. The roots of the binary trees are allowed to have arbitrary finite degrees. They are the orders of the binary trees. 2. The orders are all distinct. 3. Each binary tree has the heap-order property.

Since binomial heaps are lists, merging them is simply trivially merging the lists (O(1)). Trees with same root degree are merged to maintain the binomial heap property. i.e. If T1,T2 have the same root degree, the one with smaller root is attached to the other as a child. To do this we keep an array C of possible root degrees d, pointing to the first instance of trees with degree d. When the clean-up function iterates through R, collisions with the array are merged. Runtime of BH-Cleanup: O(|R| + maxT dT ). 1: procedure BH-Cleanup(R, n) .R is a stack of trees of a binomial heap and has size n 2: C ← array(blog nc + 1) . Array of nil 3: while R 6= nil 4: T ← pop(R) 5: d ← root-degree(T ) 6: if C[d] = nil 7: C[d] ← T 8: else . Merge trees of same degree and push them back onto R 0 9: T ← C[d] 10: C[d] ← nil 0 11: if root(T ) > root(T ) 0 12: add-child(root(T ), root(T )) 13: push(R,T ) 14: else 0 15: add-child(root(T ), root(T )) 0 16: push(R,T )

21 17: end if 18: end if 19: end while 20: for i ← (0,..., blog nc) 21: if C[i] 6= nil 22: push(R,C[i]) 23: end if 24: end for 25: end procedure To remove the maximum from a binomial heap (BH-Delete-Max): 1. If R is a binomial heap, Find-Max on R will take O(|R|) time (searching the roots). 2. BH-Remove-Max: Once the tree containing the maximal element is known, we can join its children (which is just a list) to and remove the root element.

O(dT ) is the degree of root of the tree whose root is the max element. 3. BH-Clean-up.

Total runtime: O(|R| + maxT dT ). 2.4.1 Theorem. If a binomial heap is created with only operations BH-Delete, BH-Merge,BH-Singleton, the binomial heap’s maximal root degree is dT ≤ log n + 1. Proof. Suppose the root degrees are 0, . . . , d, the minimal binomial heap with this structure will have 1 node at the 0-tree, 2 node at 1-tree, and 4 (not 3 since this tree can only be formed by joining 2 1-tree’s) nodes at the 2-tree. In general the i-tree will have 2i − 2 nodes, so n = 2i. Rule of thumb: When we have a costly operation, clean up the data struture to reduce its entropy. The potential function φ of a binomial heap is φ(t) := (number of trees) · c c is a constant that is bigger than all constants in O-terms. The merge of R1,R2 will run is amortised time T = actual − ∆potential ≤ c + φ(t + 1) − φ(t)

= c + (|R1| + |R2|) − (|R1| + |R2|) ∈ c ∈ O(1)

Delete-Max has amortised run-time (maxT dT is taken over all tree that could be in a binomial tree with size n) 0 ≤ c(|R| + max dT ) +c R − c|R| T | {z } actual

= c · max dT + c · max dT T T ≤ 2c log n ∈ O(log n)

If we call Clean-up after every operation, it takes time O(|R| + maxT dT ) (this becomes |R| ∈ O(log n) if we do clean up every time). With this we can insert, merge, and delete max in O(log n) worst case. The trade-off is we cannot insert in constant time anymore. Note. How to store trees with high degrees? We can store a high-degree tree T as a binary tree T 0. At node v: • Left child in T 0 is left child in T • Right child in T 0 is right sibling in T . This creates a leftist heap.

22 2.4.4 Almost-heaps (See tut01) To fix the almost-heap, we swap the violating node (45) with its largest ancestor that is smaller than 45. This places the 45 into the correct position, but now 40 is below 30. Then we can fix down the 40. In this case, preprocessing the data structure and building a stack reduces the runtime from logarithmic to linear.

2.5 ADT: Dictionaries

Definition

A Dictionary (also symbol-table, relation, map) is an ADT, stores key-valueu pairs (KVPs) Every item has a key k and a value v. The keys can be compared and are unique. Keys can be stored in O(1) space and compared in O(1) time. Operations:

1. Dict-Insert(k, v) (pre: k is not in the dictionary) 2. Dict-Search(k) (find an item. If key does not exist, Search must handle)

3. Dict-Delete(k) 4. Optional: Dict-ClosestKeyBefore 5. Optional: Dict-Join 6. Optional: Dict-isEmpty

7. Optional: Dict-Size

Easy implementations: 1. Unsorted list/array • Insert: O(1) (Amortised for arrays) • Search/Delete: O(n). 2. • Search: O(log n) via binary search • Insert/Delete: O(n) 3. Binary : • Insert, search, delete: O(height) • Height: O(log n) best case, O(n) worst case. There are several more sophisticated implementations, which achieve higher efficiency. We want a dictionary to achieve: O(log n) for search, insert, delete. 1. O(log n) expected, independent of input odrder (Treaps) 2. O(log n) worst-case (AVL trees) 3. O(log n) amortised, no rotation (Scapegoat trees) 4. O(log n) expected, little space () 5. O(log n) amortised, little space () Note. We don’t study deletion. We could use lazy-deletion instead: Search the key to be deleted and mark its place as “deleted”, and we keep track of d (# dummy items). if d > n, we then rebuild the tree with n valid items. This uses O(n + d) = O(n) time. This gives amortised O(1). (i.e. potential φ(t) = d)

23 2.6 Binary Search Trees

Definition

A is a tree where for every node, the left subtree is less than the node and right subtree is larger than the node. This property holds for all subtrees.

x

< x > x

We can perform search on the binary search tree

1: procedure BST-Search(T, k) 2: if T = ∅ 3: return notfound 4: else if root(T ) = k 5: return root(T ) 6: else if root(T ) > k 7: return BST-Search(left(T ), k) 8: else if root(T ) < k 9: return BST-Search(right(T ), k) 10: end if 11: end procedure

Definition

In a binary search tree, the predecessor/successor of a node is the node is the maximum of the left/minimum of the right subtree.

BST-Delete can be implemented by 1. Search for node x that contains the key. 2. If x is leaf, delete 3. If x has one non-empty subtree, move child up 4. Otherwise, swap x with its predecessor/successor and remove the swapped node. BST-Search and BST-Delete run in O(h), where h is the height of the tree. In the worst case h = n − 1, where n is the number of nodes. In the best case, h ≥ log(n + 1) − 1. 2.6.1 Theorem. The average height of a BST with n nodes is Θ(log n). Proof. The average height of a BST with n nodes is the expected height of BST built by inserting {0, . . . , n − 1} in a random number. Let X(n) be the height of a random BST with n nodes. Observe that the first inserted item is uniformly randomly chosen from {0, . . . , n − 1} and is placed in the root, so X(n) = 1 + max{X(i),X(n − i − 1)} i has equal probability for being {0, . . . , n − 1}, so

n−1 1 X X(n) = 1 + max{X(i),X(n − i − 1)} n i=0

24 1 3 3 1 • If i lies within 4 n to 4 n, the maximum is at most 1 + X( 4 n). This constitutes 2 of all cases. • Otherwise, the maximum is bounded by X(n).

Hence 1 3 1 X(n) ≤ 1 + (1 + X( n)) + X(n) 2 4 2 Rearranging, 3 X(n) ≤ 3 + X( n) 4 so 3 [X(n)] ≤ 3 + [X( n)] E E 4 Using the table of common recurrences, we have that the height is Θ(log n).

Definition

The balance of a binary search tree T with subtrees L, R is

balance(T ) := height(R) − height(L)

BST-Construct If we have a sorted array A[n], we can build a BST containing elements of A such that |balance| ≤ 1 for all subtrees. This is usually done by extracting the of A[n] (i.e. A[bn/2c]) repetitively and takes Θ(n) time.

2.7 Tree-based Implementations of the Dictionary

In this section we examine Treaps, AVL Trees, and Scapegoat Trees for implementing dictionaries.

2.7.1 Treaps

Definition

A is a binary search tree whose nodes also store a priority, such that • The tree is a BST w.r.t. keys. • The tree is a max-heap (but without structural property) w.r.t. priorities.

4 key 0 Each node is displayed as priority :

Searching in a treap is the same as a BST, which is O(height). Insertion is more compilicated. We pick a priority randomly in {0, . . . , n − 1}, insert the key k, and fix the priority using tree rotations. 1: procedure Treap-Insert(T, k) 2: p ← random(size(T )) 3: x ← BST-Insert(T, k) . Max-heap property broken 4: Correct heap ordering using tree rotations 5: end procedure Below is an example with 8 being inserted with a priority of 2. Nodes that breaks the heap ordering property are coloured red.

25 4 4 8 1 1 2

6 8 4 0 2 1

8 6 6 2 0 0

Definition

A is

x y Right rotation on x y C A x Left rotation on y

A B B C

In a left (resp. right) rotation, we bring the right (resp. left) child up and arrange the remaining subtrees using the only possible arrangement. Since the height of the treap is log n we can conclude that: 2.7.1 Theorem. Treap insertion is expected O(log n).

2.7.2 AVL Trees

Definition

An AVL (Adelson-Velsky, Landis) tree is a BST such that the balance is always in {−1, 0, +1}. If • balance = −1: Left-heavy • balance = 0: Balanced

• balance = +1: Right-heavy We can either store height of subtree, or balance factors in the tree and make adjustments when necessary.

2.7.2 Proposition. If N(h) is the minimal number of nodes a AVL tree of height h can have, then

2h/2 ≤ N(h) ≤ 2h Proof. Observe that N(0) = 1,N(1) = 2. Moreover, the minimal AVL tree with height h will have a h − 1 height and a h − 2 height AVL subtree, so N(h) = 1 + N(h − 1) + N(h − 2), (h ≥ 2) | {z } | {z } Left Right

• 2h/2 ≤ N(h): The bases case is √ N(0) = 1 ≥ 20/2,N(1) = 2 ≥ 2

26 Step:

N(h) = 1 + N(h − 1) + N(h − 2) ≥ 1 + 2(h−1)/2 + 2(h−2)/2 2h/2 2h/2 ≥ 1 + + 2 2 ≥ 2h/2

so the proof is complete by induction • N(h) ≤ 2h: N(0) ≤ 20,N(1) ≤ 21, and

N(h) = 1 + N(h − 1) + N(h − 2) ≤ 1 + 2h−1 + 2h−2 ≤ 2h−1 + 2h−1 = 2h

and the proof is complete by induction.

2.7.3 Theorem. The height of a AVL tree is height ∈ Θ(log n) where n is the number of nodes. Proof. Observe: n ≥ N(h) ≥ 2h/2 Solving for h: h ≤ 2 log n so h ∈ O(log n). h ∈ Θ(log n) for all binary search trees.

27 Definition

A left double rotation at x is 1. A right rotation at y 2. A left rotation at z Mnemonic: The middle node is raised to the root. x z

A y Left2 rotation x y

z D A B C D

B C

x z

2 y A Right rotation y x

D z D C B A

C B

AVL-Insert is similar to BST-Insert but requires the tree’s AVL property to be maintained after the insertion. Note. AVL Tree rotations always bring up the middle element.

28 1: procedure AVL-Insert(T, (k, v)) 2: z ← BST-Insert(T, (k, v)) . Insertion returns leaf node 3: z.height ← 0 4: while z 6= root(T ) . Move up the tree to maintain balance 5: z ← parent(z) 6: if |balance(z)| > 1 7: y ← taller-child(z) 8: x ← taller-child(y) 9: z ← AVL-Restructure(x, y, z) 10: break . Exit 11: end if 12: z.height ← 1 + max{z.left.height, z.right.height} . Recalculate height 13: end while 14: end procedure 15: procedure AVL-Restructure(x, y, z) . z is parent of y is parent of x 16: switch (x, y, z) z y x 17: case . Right rotate 18: return rotate-right(z) z y x 2 19: case . Right rotate 20: z.left ← rotate-left(y) 21: return rotate-right(z) z y x 2 22: case . Left rotate 23: z.right ← rotate-right(y) 24: return rotate-left(z) z y x 25: case . Left rotate 26: return rotate-left(z) 27: end procedure

AVL-Insert runs in Θ(log n). 1. Add a new node according to rules of BST insertion 2. Update height information of each node.

3. Fix the tree using rotations. If subtree z is unbalanced, let y, x be its descendent on the inserted path. Then we apply rotation to z, y, x. (Only 1 rotation!) The rotation will make z balanced and reduce its height. The height reduction ensures that the tree overall is still correct.

AVL-Delete is analogous. Notice that there is no break after the restructure anymore, since tree rotations cannot increase the height of a subtree without breaking the AVL property. The AVL-Delete function runs in Θ(height) and may call AVL-Restructure Θ(height) times. Therefore, the worst case runtimes for AVL-Insert and AVL-Delete are both Θ(height) = Θ(log n).

29 1: procedure AVL-Delete(T, k) 2: z ← BST-Delete(T, (k, v)) . Deletion returns the child of BST node that was removed 3: z.height ← 1 + max{z.left.height, z.right.height} . Recalculate height 4: while z 6= root(T ) . Move up the tree to maintain balance 5: z ← parent(z) 6: if |balance(z)| > 1 7: y ← taller-child(z) 8: x ← taller-child(y) 9: z ← AVL-Restructure(x, y, z) 10: end if 11: z.height ← 1 + max{z.left.height, z.right.height} . Recalculate height 12: end while 13: end procedure

Descriptio 2.5: A scapegoat tree

30 0/5

20 60 0/2 0/2

10 40 0/1 0/1

2.7.3 Scapegoat Tree Scapegoat tree is a realisation of the dictionary using no tree rotations. Instead, we insert like a regular binary search tree and reconstruct the tree when it becomes too unbalanced. We shall see that the insertion time is amortised O(log n).

Definition

 1  Let α ∈ 2 , 1 .A scapegoat α-tree is a binary search tree such that

height ≤ log1/α n + 1

Every node v in a scapegoat α-tree stores nv which is the weight of the subtree at v and (optional) a integer known as “tokens” and is initialised at 0.

1 The closer α is to 2 , the more regular the tree. Tokens are only a method to analyse the data structure and are not stored in practice.

• SGT-Rebuild at p: (Special operation) 1. Find highest ancestor v (known as the scapegoat) of p such that

nv > α · nparent(v)

2. Extract the descendants of p in nodes, sorted, to be A[n]. 3. Rebuild the subtree with BST-Construct(A[n]) 4. After complete rebuild, remove all tokens at p and its descendants.

We can show that this releases ≥ (2α + 1)nv tokens.

30 This takes Θ(nv) time. • SGT-Insert: 1. Insert as for a BST tree

2. On insert path, update weights nv. 3. On each node of insert-path, deposite a token.

4. If now the height (determined from the length of insertion path) is > log1/α n, we do a SGT-Rebuild at p.

Below is a lemma ensuring the existence of v.

2.7.4 Lemma. If insertion path has length > log1/α n, then some node v on that path satisfies nv > αnparent(v)

Proof. Let the path be (v0, . . . , vk)(v0 is the root node). Assume nv ≤ αnparent(v) for all vi. Observe

n ≤ αn vi+1 vi so n ≤ αkn = αkn vk v0 where n is the size of the tree. Since the insertion path is the only place where the Scapegoat structural property can be violated,

h 1 ≤ nleaf ≤ α · n

From which we have  1 h ≤ n α so h ≤ log1/α n but this contradicts the fact that the path has length greater than h. 2.7.5 Proposition. Amortised time for SGT-Insert is O(log n). Proof. Suppose c > 0 is large enough such that

• Build takes time ≤ c · np • The height is ≤ c · log n • Insert without rebuild takes time ≤ c log n. Define (choice of this K is because it allows for amortised analysis) c K := 2α − 1 Define the potential function φ(t) := K · (#tokens) Amortised time of insert without rebuilding is

≤ c log n +K (#new tokens) ∈ O(log n) | {z } | {z } TBST-Insert ∆φ≤h

31 Amortised time of insert with rebuild:

= TBST-Insert + TBST-Construct + ∆φ

≤ c log n + c · np + Kc log n − K(#old tokens) | {z } ∆φ 1 = O(log n) + c · n − c (#tokens below p) p 2α − 1

The number of tokens at p is at least (2α − 1) · np, because if v is inserted to the left,

nL > α · np

Observe that thet right subtree is smaller: 1 n = n − n − 1 ≤ ( − 1)n − 1 ≤ n R p L α L L

nL − nR = 2nL + 1 − (nL + nR + 1) | {z } np

> 2αnp + 1 − np

= (2α − 1)np + 1

When the tree was balanced last time, |nL − nR| ≤ 1, so we must have had at least (2α − 1)np insertions to get here. This generated (2α − 1)np tokens in the tree for which upon the rebuild are all free’d. Therefore, the amortised time of insert with rebuild is bounded is 1 ≤ O(log n) + c · n − c (#tokens below p) = O(log n) p 2α − 1 | {z } ≤0

32 Descriptio 2.6: A skip list. A tower of height 1 is highlighted in blue

S3 −∞ +∞

S2 −∞ 25 +∞

S1 −∞ 12 25 31 +∞

S0 −∞ 12 20 25 30 31 +∞

2.8 Skip Lists

Binary search repetitively obtain the median of a list’s sublists. We can extend this idea to produce a structure that is easy to search, providing another implementation of dictionary.

Definition

A skip list is a hierarchy S of ordered linked lists (levels) S0,...,Sh of key-value pairs.

• Each list Si contains spsecial keys ±∞ (sentinels)

• List S0 contains the KVPs of S in non-decreasign order.

• Each list is a subsequence of the previous one: S0 ⊇ S1 ⊇ · · · ⊇ Sh.

• Sh only contains sentinels.

The skip list contains a reference to the topmost (i.e. in Sh) left node. Each node p has references to after(p) and below(p). A tower of nodes is the set of all nodes above a node in S0, including itself.

The advantage of a skip list over a treap is that the skip list allows for range search and behaves like a list instead of a tree. In some sense a skip list is a tree (“below” is left child and “after” is right child) In practice the top most layer carries no information and is thus not stored.

Algorithm 7 Searching in Skip lists 1: procedure Skip-Search(L, k) 2: p ← topmost left node of L 3: P ← stack(p) . Creates a stack containing p 4: while below(p) 6= ∅ 5: p ← below(p) 6: while key(after(p)) < k 7: p ← after(p) 8: end while 9: push(P, p) 10: end while 11: return P 12: end procedure

Skip-Insert does:

1 1. Randomly toss a coin until obtaining a tail (i.e. geometric distribution with ratio 2 )

33 Descriptio 2.7: Skip-Insert, where the height of this tower is randomised to be 1

S3 −∞ +∞

S2 −∞ 25 +∞

S1 −∞ 12 22 25 31 +∞

S0 −∞ 12 20 22 25 30 31 +∞

2. Let i be the number of heads. This will be the height of the tower, so

1l P (h ≥ l) = k 2

Notice that a single node has height 0. 3. Add more sentinel layers if needed (i > h)

4. Search for k with Skip-Search to get stack P . The top i items of P are the predecessors p0, . . . , pi.

5. Insert (k, v) after p0 in S0 and k after pj in Sj for 1 ≤ j ≤ i. Skip-Delete: 1. Search for key k with Skip-Search. 2. Remove the tower at k. 3. Remove extra sentinel layers (layers that only contain ±∞)

Definition

A event En has high probability if P (En) → 1 when n → ∞.

2.8.1 Proposition. The height of a skip list is below 3 log n (where n is number of nodes in S0) with high probability. Proof. O(h) = max{height of tower at key k} = Xk k 1 where Xk is a geometric with ratio 2 , so E[Xk] = 1. The height of the skiplist is the max of its towers

P (max{Xk} ≥ i) = 1 − P (max{Xk} < i) k k Y = 1 − P (Xk < i) k 1 = 1 − (1 − )n 2i if there are n keys. Unfortunately, this cannot be evaluated analytically. Instead we can have a bound: X n P (max{Xk} ≥ i) ≤ P (Xk ≥ i) = k i k 2

34 with this we have n n 1 P (height > 3 log n) ≤ = = 23 log n n3 n2 Rearranging, 1 P (height ≤ 3 log n) = 1 − n2 Since log n  n, this is practically 1. Hence the height of a skip list is O(log n). We can interpret the above theorem as expected height ≤ 3 log n. 2.8.2 Proposition. The expected space of a skip list is linear P Proof. Space is proportional to n + k Xk, so X X E[n + Xk] = n + E[Xk] = 2n k k | {z } 1

What is the efficiency of a Skip List? • Downward steps is less than the height (O(log n))

• What is the number of forward steps? We can take a different aproach. What is the length of all search paths reaching to a node. Defin e C(j) to be the expected # of steps to get to some node on π on layer h − j (h is height of top layer).

• C(0) = 0, we begin at the top left. • To obtain C(j): If the last move is “after”, it means the node above does not exist, so

C(j) = P (htower = h − j) · C(from left) + P (htower > h − j) · (C(from above) + 1) 1 1 = C(j) + C(j − 1) + 1 2 2

Therefore we have ( 0 if j = 0 C(j) = 2 + C(j − 1) if j > 0 and so C(j) = 2j. Since the height is O(log n), the time for search is

O(C(h)) = O(log n)

This also implies O(log n) time for Skip-Insert and Skip-Delete. Skip lists are fast and simple to implement in practice.

2.9 Dictionary with Biased Search Requests

In practice 80% of the search request come from 20% of the keys. Some requests are much more frequent and we want to tailor the dictionary such that these requests are fast. Two scenario: We know access-probabilities of all items, or we do not. For example, if we have keys A, B, C, D, E and they have access probability 5 8 1 10 12 P := ,P := ,P := ,P := ,P := A 26 B 26 C 26 D 26 E 26

35 Then the total access cost (in the number of comparisons) is the first moment: X 1 + Pxix x∈{A,B,C,D,E} where ix is the index of x. This is optimised when items of higher probability are placed before the lower probability ones. If we do not know the access probabilities, we need a adaptive data structure, which changes upon being used.

2.9.1 MTF Array and Transpose Array A rule of thumb is the assumption of temporal locality: A recently accessed item is likely to be used soon again, so this gives the idea of a MTF array:

Definition

A Move-To-Front Array is an array such that

• New insertions are always in the front (i.e. [0]) • (Move-to-Front) If an item is accessed, the item is moved to [0].

This adds some overhead to searching but the overhead is constant. (Time is 1 + idx k for node k). 2.9.1 Theorem. MTF Let S be a set of keys and a = (a1, . . . , am) be a sequence of key accesses. Let T be the number of comparisons for a MTF array storing S when queried using. T opt be the optimal static order time. Then T MTF ≤ 2T opt opt Proof. Let x, y ∈ S be distinct. Let tx,y be the number of comparisons of x with y. Let Cx be the number of occurrences of x in a. In the optimal search sequence, x is placed before y or y is placed before x, so opt tx,y = min{Cx,Cy} Now consider the MTF array case. We can split the search sequence a into maximal blocks a 1,..., a k, where j j each block starts with x or y and only contains x or y. Within block a , a1, which is either x or y, is compared with y or x only once, so MTF opt tx,y = k ≤ 2tx,y

This heuristic will echo in the section for compression algorithms. Another heuristic is the Transpose array: Upon successful search, swap the accessed item with the item immediately preceding it. The worst case of the Transpose array happens when two items are accessed alternatively. If x, y are at the end of the array and are directly adjacent, the access sequence xyxyxyxy . . . will require Θ(n) comparisons per access.

2.10 Splay Trees

The Move-to-Front method does not work well with binary search trees, since rotating leaf elements causes the binary search tree to degrade into linked lists. We can solve this problem using more compilicated tree rotations that do not unbalance the tree.

Definition

Splay Trees are binary search trees with a splay operation that allows recently accessed elements to be accessed again.

36 Definition

A Zig-zig rotation is

z x

y D Right zig-zig rotation A y Left zig-zig rotation x C B z

A B C D

Double left/right rotations are also called zig-zag rotations. The insertion and search function in Splay trees are identical to that of binary search trees, but they require a Splay step to move the elements to the top.

Algorithm 8 Insertion in Splay Trees 1: procedure Splay-Insert(T, (k, v)) 2: x ← BST-Insert(T, (k, v)) 3: Splay(T, x) 4: end procedure 5: procedure Splay(T, x) . x is a node in T 6: while x 6= root(T ) 7: p ← parent(x) 8: if x = left(p) 9: if p = root(T ) 10: rotate-right(p) 11: else 12: g ← parent(p) 13: switch (x, p, g) g p x 14: case . Right zig-zig rotation 15: rotate-right(g) 16: rotate-right(p) g p x 17: case . Right zig-zag rotation 18: rotate-right(p) 19: rotate-left(g) 20: end if 21: else . Symmetric case with x = right(p) 22: ... 23: end if 24: end while 25: end procedure

Splay trees need no extra space compared to a BST but is much more efficient.

37 2.10.1 Theorem. A Splay step in a splay tree costs amortised O(log n) time. Proof. Define r(v) := log(nv)

(where nv is the size of subtree at node v) Potential function: X φ(t) := log(nv) v∈Tree The time is the number of rotations. Let x be the node where we apply the Splay step. We wish to prove the following amortised time for the number of rotations. ≤ 3r0(x) − 3r(x) where r0 is r after rotation. To prove this, we are interested in φ0 − φ. g p x x g p Suppose the subtree is changed by a zig zag rotation to . Then r0(x) = r(g), so

0 X 0 X φ − φ = r (v) − r(v) v v = r0(g) + r0(p) + r0(x) − r(g) −r(p) − r(x) | {z } =0 = r0(g) + r0(p) − r(x) − r(p) Observe that r(p) ≥ r(x) since parent(x) = p is a subtree. Moreover, 0 0 0 ng + np ≤ nx Using the concavity of log, 1 1 1 n0 + n0 n0 (r0 + r0 ) = log(n0 ) + log(n0 ) ≤ 2 log( g p ) ≤ 2 log( x ) = 2(r0 − 1) 2 g p 2 g 2 p 2 2 x so φ0 − φ ≤ r0(g) + r0(p) − 2r(x) ≤ 2(r0(x) − 1) − 2r(x) ≤ 2r0(x) − 2r(x) − 2 + r0(x) − r(x) | {z } ≥0 = 3r0(x) − 3r(x) − 2 Hence amortised time for zig-zag is at most 2 + 3r0(x) − 3r(x) − 2 = 3(r0(x) − r(x)) where 2 came from the time of 2 rotations. For single rotations, φ0 − φ ≤ 3(r0(x) − r(x)). Hence amortised time for Splay is: X = T (q) rotation q X 0 ≤ (3r (x) − 3r(x)) + 1 |{z} x Last rotation, Single rotation ≤ 3r0(root) − 3r(init-pos(x)) + 1 ∈ O(log n)

38 Caput 3

Hashing and Spatial Data

3.1 Hash Tables

Consider a special situation: In a dictionary, we know that all keys are from the integer range {0,...,M − 1}. A dictionary of such keys can be easily implemented as an array:

1. Dict-Search: Check if A[k] is empty. Θ(1) 2. Dict-Insert: A[k] ← v. Θ(1) 3. Dict-Delete: A[k] ← nil Unfortunately this is not always the case, but we can define a special function to map keys to {0,...,M −1}. This is known as hashing and the function is hash function. Note. Hash functions are usually assumed to be evaluable in O(1). We have a hash-table T [0,...,M − 1] that is chosen by the implementation. e.g. If U = N,M = 11, we can define the modular hash function h(k) := k mod 11

The difficulty is to choose M. If M is a power of 10 or 2, it is very easy to get a collision. Primes are favoured. Ideal hash functions do not throw away information about the key. Multiple distinct keys may occupy the same slot. If we have keys {45, 13, 7} and we are inserting 18, 18 collides with 7 and needs to occupy the same space as 18. A simple solution is chaining: Build a linked-list and linear search on the linked-list. If we have a collision when inserting, the new keys are moved to the front (MTF heuristic) The runtime of search is O(length of T [i]). If we use a uniform hash function that takes each of {0,...,M− n 1} equally likely, the expect length of list is O( M ). M 0 A heuristic we can use is rehashing: Whenever n > 2 , we create a new hashtable with M := 2M and totally rebuild the hash structure. n n 1 The amortised insertion time is still O( M ). The number α := M is the load factor and is always ≤ 2 .

3.1.1 Probe Hashing Suppose we have keys {7, 41} and we are trying to insert 84 (with M = 11 and modular hash) The slot with residue 7 ≡ 84 is occupied, so we try the next slot with residue 8, which is also occupied by 41, so 84 is inserted at residue 9:

39 Descriptio 3.1: Inserting a new element into a chain-based with M = 7 0 nil

1

2

3

4

5 nil

6

k ≡ 7 7

8

9 nil

This method is known as linear probing. The problem is that sequences tend to get long and Dict-Insert, Dict-Search are slow in practice. We have even more problem when searches are unsuccessful: The sequence must be entirely scanned to find out that a key does not exist. The probe sequence is

h(k), h(k) + 1, h(k) + 2,...

3.1.1 Triangular Number Probing. If we use the probe sequence  i(i + 1) h(k, i) := h(k) + mod M 2 and M = 2p for integer p, then the probe sequence

h(k, 0), h(k, 1), . . . , h(k, M − 1) are all different, so it must hit an empty slot. Proof. Suppose two entries in the probe sequence are equal: (0 ≤ i < j < M)

i(i + 1) j(j + 1) ≡ mod M 2 2 so that i2 + i − j2 − j ≡ 0 mod 2M = 2p+1 (i − j)(i + j − 1) ≡ 0 mod 2p+1 Only one of (i − j) and (i + j − 1) can be even, so one of then is divisible by 2p+1. This violates the condition i, j ∈ {0,...,M − 1}, so all entries in the probe sequence are different.

40 3.1.2 Double Hashing Suppose there are 2 uncorrelated hash functions that maps keys into {0,...,M − 1}. The probe sequence is (modulo M) h1(k), h1(k) + h2(k), h1(k) + 2h2(k),...

Warning: h2 should never be 0, otherwise there is infinite loop. Also the output of h2 should not divide M or have a common divisor.

3.1.3 Cuckoo Hashing

Main idea is to have two hash functions h0, h1 and two hash tables T0,T1. We maintain the invariant:

Key k is always at T0[h0(k)] or T1[h1(k)]. This obviously fails if collision happens on both tables, but assuming that the Dict-Search and Dict-Delete takes O(1). Dict-Insert has to pay for the price of this efficiency.

Detour

Below is a hash function. h(k) := bM(ϕk − bϕkc)c This is an example of multiplication method to get hash functions. Notice that ϕk − bϕkc extracts the fractional component of ϕk. If ϕ is irrational, there is no period of h. ϕ is the golden ratio and works very well. (We will not see hash functions this complicated on the midterm)

To insert with Cuckoo hashing,

1. If T0[h0(k)] is empty, put k there and exit. 0 0 2. If T1[h1(k)] is empty, put the original k := T0[h0(k)] into T1[h1(k)]. (Kick k out of its nest like a cuckoo) Exit.

0 00 0 3. If T0[h0(k )] is empty, move k := T1[h1(k)] to T0[h0(k )]. 4. ...

T0 T1

nil

41 1: procedure Cuckoo-Insert(k, v) 2: i ← 0 . Nest id 3: for j ← (0,..., 2M − 1) 4: if Ti[hi(k)] = nil 5: Ti[hi(k)] ← (k, v) 6: return . Success 7: end if 8: (k, v) ↔ Ti[hi(k)] 9: i ← 1 − i 10: end for 11: return “failed, requires rehash” 12: end procedure

3.1.4 Complexity of Probing Methods α is the load factor. Below is a table of average costs, assuming uniform hashing. T avg Search (Fail) Insert Search (Success) Linear 1 1 1 (1−α)2 (1−α)2 1−α

1 1 1  1  Double 1−α 1−α α log 1−α Cuckoo 1 (worst) α 1 (worst) (1−2α)2 If we can keep load factor low, Cuckoo hashing is amortised O(1) insert.

3.1.5 Universal Hashing Every hash function must fail for some sequences of inputs. If everything hashes to same value, this is a terrible worst case. The solution is to use randomisation. When initialising or re-hashing, choose a prime number p > M and random numbers a, b ∈ Zp, a 6= 0. Use the hash function h(k) := ((ak + b) mod p) mod M

3.1.2 Proposition. 1 For any fixed numbers x 6= y, the probability of a collision using this random function h is at most M . 3.1.3 Corollary. The expected run-time for insert is O(1) if α is sufficiently small. Hash vs BST: Binary search tree has: • O(log n) worst case operation cost • No need for assumptions, special functions, or known properties of input distribution • Predictable space usage • Never need to rebuild entire structure • Supports ordered dictionary operations (rank, select) Hash has: • Expected/Amortised O(1) operations • Choose space-time tradeoff using load factor • Cuckoo achieves O(1) worst-case for search and delete.

42 3.2 Hash of Multi-dimensional data

A word is a string. |w| is length of w, and the natural ordering on words is the lexicographic order, so

b

The lexicographic order sometimes is different from numerical ordering. For example

999 < 1000, 999 >lex 1000

If we build a dictionary whose keys are words, a word uses ω(1) time. Comparison also takes ω(1) time. A simple approach is to have multi-dimensional data. The standard approach is to use a base R:

A · P · P · L · E → (60, 80, 80, 76, 69) → 65R4 + 80R3 + 80R2 + 76R + 69

For ASCII string, R = 128. To prevent treating gigantic numbers, if we use a modular hashing the modulo should be applied in every step. 1: procedure Hash-string(s) 2: h ← 0 3: for c ← s 4: h ← (hR + c) mod M 5: end for 6: end procedure 7: return h This hashing function takes time Θ(|w|).

43 Descriptio 3.2: A binary trie storing {00, 0001, 010, 011, 110, 111, 11} 0 1

0 1 1

$ 0 0 1 0 $ 1 00$ 11$ 1 $ $ $ $ 010$ 011$ 110$ 111$ $ 0001$

3.3 Tries

Definition

A trie is a tree such that: • Items are only stored in the leaf nodes. • Edge to child is labeled with corresponding character or a sentinel $. If the alphabet is {0, 1, $}, the structure is a binary trie and is used for storing bit strings. $ is the end of word symbol.

Note. Keys in a trie can have different number of bits.

Definition

A prefix of a string S 0 . . n − 1 is a S 0 . . i for some 0 ≤ i ≤ n − 1. A dictionary is prefix-freeJ ifK there is no pair ofJ stringsK in the dictionary such that one is the prefix of the other.

A bit string is of the form 0100$ (the end-of-word symbol is optional) We store a set of words that is prefix free: No word is a prefix of another word. (001 is a prefix of 00101) This can be ensured by adding the special character $ to the “end” of a word. To search in a Trie, 1. Start from the root and most significant bit of x 2. Follow link that corresponds to current bit in x.

3. Return success if we reach a leaf. The leaf must store x. 4. Otherwise, recur on the new node.

1: procedure Trie-Search(v ← root, d ← 0, x) 2: if is-leaf(v) 3: return v 4: else 5: c ← child-of(v, x[d]) 6: if c = nil 7: return not found 8: else 9: return Trie-Search(c, d + 1, x)

44 10: end if 11: end if 12: end procedure Trie-Search forms the backbone for two other operations • Trie-Inesrt:

1. Trie-Search (should be unsuccessful) 2. Add new nodes to the “missing child” place, adding necessary extra bits of x as intermediate nodes in the process.

• Trie-Delete: 1. Search for x 2. Let v be the leaf where x is found. 3. Delete v and all ancestors of v until we reach an ancestor that has two children. The operations take time Θ(|x|), In case for which the alphabet is large, hashing can be used. If the alphabet Σ is not {0, 1} or {0, 1, $}, we could store arrays of children at each node corresponding to each letter in the alphabet.

3.3.1 Variation of Trie: No leaves Since the keys are stored implicitly through each path, we do not need to store the keys. This halves the amount of space needed.

0 1 0 1

$ $ $ $

0$ 1$

3.3.2 Variation of Trie: Compressed labels In this variant, a node has a child only if it has at least two descendants. This saves space if there are long bitstrings.

0 1

0 1

0 $ $ 0 0 1 1 00$ 0001$ 010$ 011$ 110$ 111$ 11$

3.3.3 Variation of Trie: Allow Prefixes If we allow prefixes, e.g. storing 011 and 01, we can use a special flag to indicate a node stores a value. This is more space efficient:

45 1

$ 0

1$ $ 10$

3.3.4 Compressed Tries Morrison (1968) invented this structure and named it Patricia (Practical Algorithm To Retrieve Information Coded In Alphanumeric) tries. Nodes with only one path are compressed. Each node stores an index: The next bit to be tested during a search (0 is first bit, 1 second bit, etc.) A compressed trie storing n keys always has at most n − 1 internal nodes.

0 0 1

1 2 0 1 0 1

2 2 3 111$ $ 1 0 1 $ 1 00$ 0001$ 01001$ 3 110$ 1101$ $ 0 011$ 01101$

CTrie-Search: Start from the root and bit indicated at node. Follow the link to the current bit until we reach a leaf. We need to explicitly check whether word stored at leaf is x. 1: procedure -Search(v ← root, x) 2: if is-leaf(v) 3: return x = key(v) 4: else 5: d ← index(v) 6: c ← child-of(v, x[d]) 7: if c = nil 8: return not found 9: else 10: return CTrie-Search(c, x) 11: end if 12: end if 13: end procedure Insert and Delete:

• CTrie-Insert: 1. CTrie-Search 2. Let v be the node where search ended. 3. Add new branch node at v to for new key x.

• CTrie-Delete: 1. CTrie-Search

46 Descriptio 3.3: Example of BST-RangeSearch(T, 28, 47) on a binary tree. 52

35 74

15 42 65 94

9 27 39 46 60 69 86 99

22 29 37 41 49

2. Remove node v stored key x 3. Compress along path to v when possible. These operations are O(|x|).

3.4 ADT: Dictionary with Range Search

We add a new operation to the dictionary,

Dict-Range-Search: Given keys x1, x2, return all keys between x1 and x2. ([x1, x2])

In a sorted array, we can search for x1, x2 and return every item in between, so the runtime is O(log n + s), where s is the size of the output. Notice that this cannot be done with hashing unless we use special hash functions. This can already be implemented in a binary search tree.

1: procedure BST-RangeSearch(T, x1, x2) 2: if T = nil 3: return 4: else if x1 ≤ key(T ) ≤ x2 5: L ← BST-RangeSearch(T.left, x1, x2) 6: R ← BST-RangeSearch(T.right, x1, x2) 7: return L ∪ {key(T )} ∪ R 8: else if key(T ) < x1 9: return BST-RangeSearch(T.left, x1, x2) 10: else if key(T ) > x2 11: return BST-RangeSearch(T.right, x1, x2) 12: end if 13: end procedure Idea

1. Search for left, right boundaries x1, x2, giving search paths P1,P2. 2. Partition nodes of T into different trees:

• Boundary: Nodes on P1 or P2.

• Inside: Nodes on right of P1, left of P2.

• Outside: Nodes on left of P1, or right of P2. 3. All inside nodes are returned.

47 4. Test if boundary nodes satisfy condition and return. The runtime is the time of searching two paths and the number of inside nodes:

T = 2O(log n) + O(s) = O(log n + s) where s is the number of items found.

3.5 Spatial Data Structures

2 Can we design a dictionary for points in R ? • Each item has d aspects (dimensions) • Aspect values are numbers

• Each item corresponds to a point in d-dimensional space. • Range-search query: Specify a range for dimensions, and find all the items whose dimensions fall within given range. e.g. Search for points on a plane within the rectangle [1, 3] × [5, 12].

We concentrate on d = 2, the plane, and create a function that allows for searching all points in a rectangle.

3.6 Quad-Trees

For this data structure, we can assume all points are within a square R. This square can be found by computing minimum and maximum x, y values. Ideally the width/height of R is a power of 2. To build the , 1. Root of quadtree is R

2. If R contains 0 or 1 points, the root r is a leaf that stores point. 3. Otherwise split, partition R into four equal subsquares (quadrants) 4. Root has four subtrees associated with the four quadrants.

5. Repeat process at subtree. Convention: (0, 0) (or minimal co¨ordinate)starts at the bottom left of page. The subtrees are in the order NE, NW, SW, SE, which are ++, −+, −−, +−.

NE NW SW SE

48 Descriptio 3.4: Left: A Quadtree. Right: A particularly bad configuration that leads to a tall tree with only 3 points

Convention: Points on the quadrant split line goes to the right/top. Operations in quadtree: • QTree-Search: Analogous to binary search and tries • QTree-Insert: Search for the point, split the leaf while there are two points in one region. • QTree-Delete: Search for point, remove. Remove ancestors until an ancestor has at least two points. The runtime is O(height). Sometimes this may be better. Unfortunately, this is no bound of the height better than n relative to n. For a range search: 1: procedure QTree-RangeSearch(T,A) 2: R ← region-of(T ) 3: if R ⊆ A 4: return all-points(T ) 5: else if R ∩ A = ∅ 6: return () 7: else if is-single-point(T ) 8: p ← point(T ) 9: if p ∈ A 10: return p 11: else 12: return () 13: end if 14: else 15: r ← () 16: for v ∈ T 17: append(r, QTree-RangeSearch(v, A)) 18: end for 19: end if 20: end procedure What is the height of a quadtree?

Definition

The spread factor of a set of points S with bounding box radius r is r β(S) := minx,y∈S kx − yk

49 Descriptio 3.5: A KD-Tree of 10 points. Level is shown with thickness

p0

p7 p8 p3 p6

p5

p1 p4

p9

p2

• The height of the quadtree is Θ(log β(S)). • Complexity to build initial tree is Θ(nh) worst case.

• Complexity of QTree-RangeSearch is Θ(nh) worst case, even when the answer is ∅. In practice (graphics scenes, etc.), are much faster. Note. The quadtree is a tree for 2D data. The case for d = 3 is an . The case for d = 1 is a trie. Variation: We could stop splitting once the number of points in each quadrant falls below a threshold.

3.7 KD-Tree

n−1 Suppose we have n points h(xi, yi)ii=0 . The KD-Tree splits the plane such that troughly half the point are in each subtree. Each node of the KD-tree keeps track of a splitting hyperplane (a line in 2D). The D in KD stands for dimension, which can be any Z+. If a point is right on the splitting hyperplane, the point belongs to the half-space with higher co¨ordinate. e.g. A line at x = 2 cuts the space into half spaces x < 2 and x ≥ 2. To halve the points, we find the median of coordinate axis along one axis. To construct a KD-tree: n−1 1: procedure KDTree-Build(h(xi, yi)ii=0 , d ← x) 2: if n ≤ 1 3: return leaf(h(xi, yi)i) . Leaf node 4: end if 5: if d = x n 6: p ← Select-nth(hxii, b 2 c + 1) 7: (L, R) ← partition(hxi, yii, p) . Partition in x direction into x < p and x ≥ p 8: TL ← KDTree-Build(L, y) 9: TR ← KDTree-Build(R, y) 10: return branch(TL,TR, x) 11: else . d = y, splitting on y 12: ... 13: end if

50 14: end procedure Runtime:

• Finding the partition point and partitioning can be done with QuickSelect (see Algorithms chapter). • Θ(n) expected runtime on each level in the tree. • Total runtime is Θ(hn). By pre-sorting, this can be reduced to Θ(n log n + hn). (no details) If we assume that the points are in general position (i.e. no points share the same x or y co¨ordinate),each n n split puts b 2 c points on one side and d 2 e points on the other, so the recursion height h(n) satisfies lnm h(n) ≤ h + 1 2 This resolves to h(n) ≤ dlog ne ∈ Θ(log n). In this case, building the tree takes Θ(n log n) time and O(n) space. Below is an example for which the KD-tree has infinite height, and points share co¨ordinates:

We could do: Tiebreak by x coordinate when x coordinate are the same. Input regularity assumption: All coordinates are integers No two x or y coordinates are equal. This ensure height to be O(log n) since the number of points is halved at each level. Building a KD-tree (KD-Build): • Find the median (or bn/2c + 1). O(n). • Split the points (O(n)) • Recur Runtime is n T (n) = O(n) + 2T ∈ O(n log n) 2 The KD-Tree has dictionary operations:

• KDTree-Search: Binary tree search. • KDTree-Insert: Insert as new leaf. • KDTree-Delete: Delete leaf and any ancestor which has only one point until there is only one such ancestor left. Problem: After insert/delete, the split might no longer be at exact median and the height cannot be guaranteed to be O(log n). This can be done by allowing certain imbalance and re-building the tree when necessary. KDTree-RangeSearch: We can query with a rectangle A: 1: procedure KDTree-RangeSearch(T,A) 2: R ← region-of(T ) 3: if R ⊆ A . Case Red

51 4: return points(T ) 5: else if R ∩ A = ∅ . Case Blue 6: return () 7: else if is-leaf(T ) 8: p ← point(T ) 9: if p ∈ A . Case Red 10: return (p) 11: else . Case Blue 12: return () 13: end if 14: else if is-split-x(T ) . Case Green 15: L ← KDTree-RangeSearch(T.left,A) 16: R ← KDTree-RangeSearch(T.right,A) 17: return L ∪ R 18: else if is-split-y(T ) . Case Green 19: ... . Analogous 20: end if 21: end procedure The complexity of this ranged query is O(s + Q(n)), where • s is the number of points in A • Q(n) is the number of green nodes (Case Green). We might na¨ıvely think Q(n) = 1 + Q(n/2) because we are splitting half way, but this is false since we have both horizontal and vertical splits. • We can show that for general positions (no details),

Q(n) ≤ 2Q(n/4) + O(1) √ which resolves to Q(n) ∈ O( n) √ Hence the complexity is O(s + n). B. In some problems, the orientation (horizontal or vertical) of the initial split can matter. In some problems they don’t. For example, finding the point of minimal x. Higher (d) dimensional analog for KD-trees exist, in which if we assume each point is stored in Θ(1) space,

• Storage: O(n) • Construction: O(n log n) We split by axis 0 in the first layer, axis 1 in the second, etc. and when we reach axis d − 1 the axis is wrapped around to be 0 again.

• Range query: O(s + n1−1/d)

3.8 Range Tree

Both Quadtrees and KD-trees are intuitive, but may be slow for large range searches. Quadtree is potentially wasteful in space. There is a much faster data structure that has much faster ranged query, but needs ω(n) linear space. In graphics scenes with millions of points, we sometimes cannot afford this. The Range Tree is a associated data structure and the underlying structure is a BST/Scapegoat Tree: • There is a binary tree T , sorted by x-co¨ordinate,at each point.

52 • Each node v has an auxiliary structure Tv, which is a binary tree sorted by y-co¨ordinate.This tree uses O(|Tv|) space. Important thing: Each point is in the associated tree of all of its ancestors. Using the invariant of scapegoat tree, each point is in O(log n) associated trees. Hence the space is at most O(n log n) if the heights of the binary trees are kept below O(log n).

• RTree-Search: Search for x in primary tree, y in associated tree. • RTree-Insert: Insert x into primary tree T . For each of the O(log n) nodes, insert into the subtree. This gives O((log n)2). Since we do not rotate, there is no additional overhead introduced.

• RTree-Delete is analogous to RTree-Insert.

To perform ranged search of rectangle [x1, x2] × [y1, y2] on a Range Tree:

1. Binary tree range search on [x1, x2] in the primary tree. This generates three sets of nodes, B,O,A (Boundary, Outside, Allocation). 2. • For each p ∈ A, perform range search on associated y-tree of p.

• For each p ∈ B, test if p.y ∈ [y1, y2].

10

4 14

Associated Tree 6 12 16

2 5 8 11 13 15

1 3 7 9

(15, 16)

(6, 15) 15 (12, 14)

(5, 13) 13 (10, 12)

(7, 11) 11 (8, 10) 10 A (14, 9)

(11, 8)

(2, 7)

(9, 6) 6 (1, 5)

(4, 4)

(16, 3)

(13, 2)

(3, 1)

• The time to find boundary and allocation nodes in the primary tree is O(log n).

53 • There are O(log n) allocation nodes.

• O(log n + sv) for each allocatio node v, where sv is the number of points in Tv that are in A. • If two allocation nodes have no common point in their trees, every node is returned in at most one auxiliary P structure, so v sv ≤ s.

Hence the time for RTree-RangeSearch is O(s+(log n)2). This can be reduced to O(s+log n) (no further details)

3.8.1 Problem of Duplicates and Generalisations The Range Tree’s primary and secondary trees may contain duplicates. This requires a tie-breaker (perhaps using the other co¨ordinates). To generalise into d dimensions, • Space: O(n(log n)d−1)

• Construction time: O(n(log n)d−1)

• Range search time: O(s + (log n)d). Comparing to KD-trees for general co¨ordinates:

• Space: O(n) • Construction time: O(n log n)

• Range search time: O(s + n1−1/d)

This is a trade-off between space and time.

3.8.2 3-Sided Ranged Queries Usually we are interested in searching a rectangle (4-sided query), which has four edges. A 3-sided query is a product of an open ray and a interval, and so only has three edges, allowing one axis to go to ±∞:

[x0, x1] × [y0, →]

1. In a max-heap, we can find all elements with y ≥ y0 in time O(1 + s), so we can use a max-heap in y at each node. Note. This data structure is tailored towards three-sided ranged queries and will not work well for rectangular queries. In this case the run-time is O(log n + s). 2. Treaps: Unique if there are no duplicates, O(height + s)

3. Priority Search Trees: Separate where the point is stored from the x co-¨ordinateused for split.

54 Caput 4

Sorting and Searching Algorithms

4.1 Problem: Selection and Sorting

Selection Problem. Given array of ordered item A[n], find the A˜[k] where A˜ is the sorted A. The median finding problem is the selection problem with k = bn/2c. Some simple solutions: 1. Make k passes through the array, deleting the maximum number each time. Θ(kn)

2. Sort the array, then return the kth largest. Θ(n log n) 3. Scan the array and maintain the k largest number so far in a min-heap. Θ(n log k)

4. Make a max-heap by calling Heapify. Call deleteMax k times: Θ(n + k log n). Sorting Problem. Given A[0 . . . n − 1] of elements in a ordered set, find permutation π of {0 . . . n − 1} such that A ◦ π is ascending. The size of an instance is n. A solution to the sorting problem is stable if equal items stay in their unsorted order. To simplify the problem and allow for average-case analysis, we make the assumption:

• All input elements are distinct • Inputs whose relative orders are equal are identified as equal. • All input permutations are equally likely. • The comparisons dominate the time of any other constant time operation. Thus, runtime is Θ(#comparisons).

Another formulation: We want to extract all informations about a permutation of {0, . . . , n − 1} using only sorting.

4.1.1 The Lower Bound of Comparison Sorting There are many (entirely comparison based) sorting algorithms with worst case runtime O(n log n). Is it possible to do better? 4.1.1 Theorem. Any key comparison based sorting algorithm uses Ω(n log n) comparisons in the worst case.

Proof. When only comparison is available, a unsorted array with distinct has entropy log n! = Θ(n log n). A sorted array has 0 entropy, so a sorting algorithms must extract all Θ(n log n) bits of entropy. Since each comparison yields at most 1 bit of entropy, the sorting algorithm must take at least n log n steps.

55 (Slow proof): Let T be the decision tree of a compairson-based sorting algorithm on some input. Each leaf of the tree correspond to a possible output, for which there are at most n!. Since each branch generates a new leaf, the tree’s height is at least log n! ∈ Θ(n log n)

4.1.2 Proposition. n Let A[n] be an array of n unsorted items. To find a ascending subarray A[i, . . . , i + 4] of 5 items, 8 comparisons are needed in the worst case. Proof. Consider A without 5-element ascending arrays. Any algorithm probing A must not leave four consecutive elements un-probed. i.e. There cannot exist i such that A[i, . . . , i + 3] are not visited by the algorithm. n n Hence the algorithm must probe at least 4 items. Since each comparison involves 2 items, at least 8 is necessary.

4.2 Quick-Select

The QuickSelect and related algorithm QuickSort rely on two subroutines: • Choose-Pivot: Choose an index p. We will use the pivot value v ← A[p] to rearrange the array. Simple way of choosing the pivot is to use the last element: 1: procedure Choose-Pivot1(A[n]) 2: return n − 1 3: end procedure

• Partition: Rearrange A and return a pivot index i so that 1. A[i] = v (pivot value) 2. All items in A[0, . . . , i − 1] are ≤ v 3. All items in A[i + 1, . . . , n − 1] are ≥ v

It turns out this can be done in Θ(n).

With these two algorithms established, the QuickSelect can be implemented with

1: procedure QuickSelect(A[n], k) 2: p ← Choose-Pivot(A[n]) 3: i ← Partition(A[n], p) 4: if i = k 5: return A[i] 6: else if i > k 7: return QuickSelect(A[0, . . . , i − 1], k) 8: else if i < k 9: return QuickSelect(A[i + 1, . . . , n − 1], k − i − 1) 10: end if 11: end procedure

Note. A tail recursive algorithm like QuickSelect can be modified to use O(1) memory by converting the recursion to a loop. The Quickselect algorithm also can be modified (in the most obvious way) to produce a list of all elements before k and after k.

56 The key to QuickSelect is the two subalgorithms that it critically relies on. In the worst case, the size of recursion instance is reduced by 1 at each level, producing T (n) ∈ Θ(n2) because ( T (n − 1) + cn if n ≥ 2 T (n) = c if n = 1 4.2.1 Theorem. Let T (n) to be the average time of QuickSelect from a array A[n], using the Choose-Pivot1 algorithm. Then T (n) ∈ Θ(n) Proof. Fix i ∈ {0, . . . , n−1}. There are (n−1)! permutations for which the pivot value v is the ith smallest item. Let T (I) be runtime for instance I. Then 1 X T (n) = T (I) n! I:|I|=n n−1 1 X X = T (I) n! i=0 I:|I|=n pivot-index(I)=i n−1 1 X ≤ (n − 1)!(cn + max{T (i),T (n − i − 1)}) n! i=0 n−1 1 X = cn + max{T (i),T (n − i − 1)} n i=0 Hence n−1 1 X T (n) ≤ cn + max{T (i),T (n − i − 1)} n i=0 Now observe: 1 3 • If the pivot index i is between 4 n and 4 n, the maximum is at most n + T (3/4n). Half of the cases fall in this category.

1 • The probability of pivot not in this range is also 2 , in which case the runtime is bounded by T (n). Half of the cases fall in this category. Hence 1 3 1 T (n) ≤ n + T ( n) + T (n) 2 4 2 Rearranging, 3 T (n) ≤ 2n + T ( n) 4 so T (n) ∈ O(n) from the table of common recurrences. The lower bound can be found by noticing that QuickSelect needs to scan through the array via Partition at least once to find the kth element.

4.2.1 Randomised Pivoting We can leverage the power of randomness to choose the pivot. A simple idea for optimising QuickSelect is to shuffle the input before partitioning. Then T exp becomes T avg, which is Θ(n). We can also use 1: procedure Choose-Pivot2(A[n]) 2: return random(n) 3: end procedure

57 4.3 Partitioning

A conceptually easy linear time partition is 1: procedure Partition(A[n], p) 2: S, L ← ∅ 3: v ← A[p] 4: for x ← A[0, . . . , p − 1] ∪ A[p + 1, . . . , n − 1] 5: if x < v 6: append(S, x) 7: else 8: append(L, x) 9: end if 10: end for 11: i ← |S| 12: A[0, . . . , i − 1] ← S 13: A[i] ← v 14: A[i + 1, . . . , n − 1] ← L 15: end procedure Unfortunately, this costs O(n) space. A much more efficient solution is with Hoare’s Partition. Two indices, i, j, move from the front (resp. back) of the array forward (resp. backward) until they collide. When they collide, the original element v is inserted. 1: procedure Partition(A[n], p) 2: A[n − 1] ↔ A[p] 3: i ← −1, j ← n − 1, v ← A[n − 1] 4: loop 5: do 6: i ← i + 1 7: while i < n, A[i] < v 8: do 9: j ← j − 1 10: while j > 0,A[j] > v 11: if i ≥ j 12: break 13: else 14: A[i] ↔ A[j] 15: end if 16: end loop 17: A[n − 1] ↔ A[p] 18: return i. 19: end procedure Note. Hoare’s Partitioning does not put elements equal to v into one block.

4.4 Quicksort

The Choose-Pivot and Partition functions together can be used to implement a sorting algorithm, known as Quicksort: Note. Do not implement this algorithm in a production setting. There are much better versions of the Quicksort compared to the one above. We could implement Quicksort using a stack, without recursions. Avoid recursion if possible! In the case of Quicksort, two routes of recursions are needed. We implement one using a stack and the other using a while 1 loop. The “bigger” recursion instance is pushed onto the stack, so each item on the stack is 2 of the item below it, ensuring O(log n) space. (Think about what happens in the worst case) When the instance is small enough (e.g. 20), the constant in the Insertion sort takes over and Insertion sort becomes faster than quick sort.

58 1: procedure Quicksort(A, n) 2: if n ≤ 1 3: return 4: end if 5: p ← pivot(A) 6: i ← Partition(A, p) 7: Quicksort(A[0, . . . , i − 1], i) 8: Quicksort(A[i + 1, . . . , n − 1], n − i − 1) 9: end procedure

1: procedure QuickSort(A[n]) 2: S ← stack((0, n − 1)) 3: while ¬isEmpty(S) 4: (l, r) ← pop(S) 5: while r − l + 1 > θ 6: p ← Choose-Pivot(A, l, r) 7: i ← Partition(A, l, r, p) 8: if i − l > r − i 9: push(S, (l, i − 1)) 10: l ← i + 1 11: else 12: push(S, (i + 1, r)) 13: r ← i − 1 14: end if 15: end while 16: end while 17: InsertionSort(A[n]) 18: end procedure In a realistic implementation, the A[0, . . . , i − 1] should consist of A and the boundary indices 0 and i. The choice p is critical. A bad pivot choice leads to a worst case runtime of O(n2). i (the pivot index) is the index for which the pivot would be if A is sorted. 4.4.1 Proposition. The worst case runtime of Quicksort is T worst(n) ∈ Θ(n2) The best case runtime of Quicksort is T best(n) ∈ Θ(n log n) Proof. In the worst case, at each level the size of recursion instance shrinks by 1, so

T worst(n) = T worst(n − 1) + Θ(n) so T worst(n) ∈ Θ(n2) 1 In the best case the size of recursion instance shrinks by 2 : n − 1 n − 1 T best(n) = T best(b c) + T best(d e) + Θ(n) 2 2 so T best(n) = Θ(n log n)

The auxiliary space is bounded below by the recursion depth. This is Θ(n) in the worst case, and can be reduced to Θ(log n) by recursing the larger instance using a while loop.

59 4.4.2 Theorem. The average case runtime of Quicksort is T avg(n) ∈ Θ(n log n)

1 Proof. Similar to Theorem 4.2.1, n of permutations have pivot index i, so ( c if n = 1 T avg(n) = 1 Pn−1 avg avg cn + n i=0 (T (i) + T (n − i − 1)) if n ≥ 2 We can analyse the recursion tree T (n)

T (i) T (n − i − 1)

......

The number of comparisons in each level is at most n. Define Cn to the expected number of comparisons of instance of size n. Then n−1 X 1 C(n) = 1 + max{C(i),C(n − i − 1)} n i=0 n 2 X = 1 + C(i) n i=n/2

We can show that C(n) ≤ logC n for some constant c. If n = 1, then C(1) = 0 = logC 1. Otherwise,

n−1 2 X C(n) ≤ 1 + C(i) n i=n/2 2 n 34  ≤ 1 + log (n) + log ( ) n 4 C C n 1 1 1 3 ≤ 1 + log n + log n + log 2 C 2 C 2 C 4 1 3 = log n + 1 + log C 2 C 4 when C ≥ p4/3, ≤ logC

Alternatively, we guess a bound and prove it explicitly. Proof. We want to show T (n) ≤ 2n · loge(n). (Notice the natural ) The base case n = 1 evidently holds. Then n−1 1 X T (n)n + (T (i) + T (n − i − 1)) n i=0 n−1 2 X ≤ n + 2i log (i) n e i=1 n−1 4 X ≤ n + i log (i) n e i=1

60 Observe: n−1 n X 1 2 1 2 i log i ≤ x log x dx = n log n − n − 2 log 2 + 1 ˆ e 2 e 4 e i=1 2 so 4 1 1  T (n) ≤ n + n2 log n − n2 ' 2n log n n 2 e 4 e

4.4.1 Choice of the Pivot Quicksort critically relies on the choice of pivot.. We are given a array A[n] and k ∈ {0, . . . , n − 1}. We want the item that would be A[k] if A were sorted. Below are some options: 1. Random choice works well, see below. 2. Simple: p = n − 1 (worst case for already sorted arrays)

3. Middle: p = bn/2c 4. Rule of three: Median of A[0],A[bn/2c],A[n − 1]. 5. Unusual: Take the median of bn/5c elements. This will give O(n log n) worst case runtime but is very slow in practice.

6. Recent improvement (2009): Change partition to use multiple pivots. e.g. p0 = A[0], p1 = A[n/2], p2 = 24 A[n − 1]. This supresses quicksort since it takes at most 15 n loge n comparisons. In 1960’s quicksort, the constant is 2.

With this we can implement QuickSelect, which uses Partition.

• In the best case, i = k and there are n comparisons. • In the worst case, Θ(n2) due to the poor choice of pivots. This can happen if the input is already sorted and the pivot is chosen at the first element.

If

1. All instances are equally likely to be the input. 2. We choose pivot uniformly

Then the randomised Quick-Select has the same expected and average runtime.

4.5 Sorting Integers

In a permutation-based sorting algorithm, the runtime is Ω(n log n) due to Information theoretic constraits. We can circumvent this bound is we know more information about what we are sorting. If we are sorting integers, the task becomes a lot easier.

Integer Sorting Problem. Given A[0 . . . n − 1] of integers with m digits that are in base r, find permutation π of {0 . . . n − 1} such that A ◦ π is ascending. The size of an instance is n.

61 4.5.1 Bucket Sort In a radix sort, we express the numbers in a base R and then sort the digits. The BucketSort algorithm sorts a single digit. Example: (base R = 4)

123 230 021 320 210 232 101

Then we have B[0] 230 320 210 B[1] 021 101 B[2] 232 B[3] 123 Now we put the buckets back into the array

A := [B[0],B[1],B[2],B[3]] = [230, 320, 210, 021, 101, 232, 123]

The pseudocode for BucketSort is ([Rd]x is dth digit of x) 1: procedure BucketSort(A[n], d, R) 2: B[0,...,R − 1] ← empty-list 3: for i ← (0, . . . , n − 1) d 4: append(B[[R ]A[i]],A[i]) 5: end for 6: i ← 0 7: for j ← (0,...,R − 1) 8: while B[j] 6= ∅ 9: A[i] ← pop-front(B[j]) 10: i ← i + 1 11: end while 12: end for 13: end procedure The BucketSort is a stable algorithm, requiring Θ(n + R) time and Θ(n) space. This is a tremendous waste of space since it uses linked lists, and can be improved by sorting the array in-place, producing CountSort. CountSort encodes the size of each in a array C and use C to decide the boundary between each key.

1: procedure CountSort(A[n], d, R) 2: C ← array(R) . Array of size R filled with 0 3: for i ← (0, . . . , n − 1) . Pass 1: Find number of each type d 4: C[[R ]A[i]] += 1 5: end for 6: I ← array(R) . Array of size R filled with 0 7: for i ← (0,...,R − 2) . Pass 2: Find boundary of each type 8: I[i + 1] ← I[i] + C[i] 9: end for 10: B ← array(n) . Auxiliary array 11: for i ← (0, . . . , n − 1) d 12: k ← [R ]A[i] 13: B[I[k]] ← A[i] 14: I[k] += 1 15: end for 16: A ← B 17: end procedure

62 4.5.2 Radix Sort The MSD (Most significant digit) Radix sort sorts the array from the highest digit to the lowest digit. It uses a recursion on the {l, . . . , r}th digits. The input array is assumed to have a maximum of m digits:

1: procedure RadixSort-MSD(A[n], l, r, d) 2: if l < r 3: CountSort(A[l, . . . , r], d, R) 4: if d > 0 5: for i ← (0,...,R − 1) 6: (li, ri) ← boundaries-of-bin(i) 7: RadixSort-MSD(A, li, ri, d − 1) 8: end for 9: end if 10: end if 11: end procedure

RadixSort-MSD has many recursions. We can improve the RadixSort-MSD by sorting from the lowest digit, forming the LSD (Least Siginificant Digit) Radix sort. Same as above, we assume A’s element have at most m digits.

1: procedure RadixSort-LSD(A[n], d, R) 2: for d ← (1, . . . , m) 3: CountSort(A[n], d, R) 4: end for 5: end procedure

The time cost is Θ(m(n + r)), and the auxiliary space is Θ(n + r).

63 4.6 Problem: Search

Search Problem. Find the key-value pair of a corresponding key in a (comparison-based) dictionary.

4.6.1 Theorem. Ω(log n) comparisons are required to search a size-n dictionary. Proof. Each comparison can yield at most 1 bit of information, so producing n requires dlog ne bits of information. If the dictionary is implemented as a sorted array, we can use binary search: 1: procedure Binary-Search(A[n], k) .A sorted 2: l ← 0, r ← n − 1 3: while l < r 4: m ← b(l + r)/2c 5: if A[m] < k 6: l ← m + 1 7: else if A[m] > k 8: r ← m − 1 9: else 10: return m 11: end if 12: end while 13: if k = A[l] 14: return l 15: else 16: return notfound . Between A[l − 1] and A[l] 17: end if 18: end procedure

4.7 Interpolation Search

Often the key provides more information than order. If the keys are numerical in nature and we can assume them to be uniformly distributed between A[0] and A[n − 1], we can produce a optimised version of Binary-Search which has faster average case. Suppose we know that two keys 40 and 120 exists. Then we expect the key right in the middle to be 80, so 3 it is more appropriate to search at 4 if we are looking for 100. 1: procedure Interpolation-Search(A[n], k) .A sorted 2: l ← 0 3: r ← n − 1 4: while l < r ∧ A[r] 6= A[l] ∧ A[l] ≤ k ≤ A[r] j k−A[l] k 5: m ← l + A[r]−A[l] (r − l) 6: if A[m] < k 7: l ← m + 1 8: else if A[m] > k 9: r ← m − 1 10: else 11: return m 12: end if 13: end while 14: if k = A[l] 15: return l 16: else 17: return notfound . Between A[l − 1] and A[l] 18: end if

64 19: end procedure The key is  k − A[l]  m ← l (r − l) A[r] − A[l] √ Runtime is T (n) = 1 + T ( n). This resolves to expected time of O(log log n), and the worst case is Θ(n). A worst case can be produced by considering 1, 2, 4, 8,... , which is highly atypical for a uniformly random array. Binary-Search and Interpolation-Search in parallel limit the worst case time to Θ(log n).

65 Caput 5

String Algorithms

5.1 Problem: Pattern Matching

Problem. Given a text T [n] (usually as array of characters) on a alphabet Σ and a pattern P [m]. Determine if P occur as a substring of T , and if true return the first i such that

P [j] = T [i + j], (0 ≤ j < m)

(This is the first occurrence of P ) T is usually huge and P much smaller. Applications: • Information Retrieval • Data Mining • (Find DNA sequences)

Definition

• A substring of T [n] is T i . . j such that 0 ≤ i ≤ j < n. J K • A prefix of T [n] is T 0 . . i . J K • A suffix of T [n] is T i . . n − 1 . J K In general a matching algorithm consists of guesses and checks: • A guess/shift is a position i such that P might start at T [i] • A check is a single position j with 0 ≤ j < m such that the condition T [i + j] = P [j] is tested. Na¨ıve brute force algorithm that checks every character at every shift can run in O(nm) in worst case: 1: procedure PM-BruteForce(T [n],P [m]) 2: for i ← (0, . . . , n − m) 3: if Strcmp(T [i, . . . , i + m − 1],P ) = 0 4: return i 5: end if 6: end for 7: return nil 8: end procedure 9: procedure Strcmp(A[m],B[m])

66 10: for j ← (0, . . . , m − 1) 11: if A[j] < P [j] 12: return −1 13: else if A[j] > P [j] 14: return +1 15: end if 16: end for 17: return 0 18: end procedure The worst possible input is P = am−1b and T = an.

a b b b a b a b b a b

a b c a

a

a

a

a b b

a

a b b a

To improve this we use pre-processing (a common theme in algorithmic design). We break the problem into two parts: 1. Pre-processing: Build Data structures/Extract information that makes query easier. 2. Query: Solve the problem on the improved data structure In the case of pattern matching, we can either preprocess P or preprocess T . • Pattern preprocessing: Karp-Rabin, Boyer-Moore, DFA (Deterministic Finite Automaton), KMP (Knuth- Morris-Pratt) • Text preprocessing: Suffix trees 5.1.1 Information Theoretic Lower bound for Pattern Matching. n Any pattern matching algorith must use at least b m c character comparisons in the worst case, where n = |T | , m = |P |. Proof. Consider the worst case when P does not occur in T . Suppose a algorithm does not probe m consecutive characters. Then changing this part of T with P generates another pattern, T 0, for which the algorithm fails to return the occurrence of P . This violates correctness. Hence the algorithm must visit one out n of m adjacent elements, giving b m c character comparisons.

5.2 Pattern Pre-processing 5.2.1 Karp-Rabin Fingerprint Algorithm Idea: Eliminate guesses that do not fit a hash function. For example, if we want to search P = 59265 in T = 31415926535, we can use a hash function

h(x0, . . . , x4) = (x0x1x2x3x4)10 mod 97 The hash of the pattern is 95, so we can reject with hash different than 95:

67 3 1 4 1 5 9 2 6 5 3 5 h = 84

h = 94

h = 76

h = 18

h = 95

1: procedure PM-Karp-Rabin1(T [n],P [m]) 2: hP ← h(P ) 3: for i ← (0, . . . , n − m) 4: hT ← h(T [i, . . . , i + m − 1]) 5: if hT = hP 6: if Strcmp(T [i, . . . , i + m − 1],P ) = 0 7: return i 8: end if 9: end if 10: end for 11: return nil 12: end procedure Na¨ıve computation takes Θ(m) time per guess, but we can improve! Adjacent window hashes (fingerprints) are not independent and can be updated in constant time via a corkscrew mechanism. If we have a prime p and radix r, m (x1x2 . . . xm)r mod p = (((x0x1 . . . xm−1)r mod p − x0 · r mod p) · r + xm) mod p Hence the improved algorithm is 1: procedure PM-Karp-Rabin-Rolling(T [n],P [m]) 2: hP ← h(P ) 3: p ← random-prime m−1 4: s ← r mod p . Corkscrew constant 5: hT ← h(T [0, . . . , m − 1]) 6: for i ← (0, . . . , n − m) 7: if i > 0 hT ← ((hT − T [i] · s) · r + T [i + m]) mod p 8: end if 9: if hT = hP 10: if Strcmp(T [i, . . . , i + m − 1],P ) = 0 11: return i 12: end if 13: end if 14: end for 15: return nil 16: end procedure The expected running time is O(m + n) and the worst case is Θ(nm), but this is very unlikely.

5.2.2 Boyer-Moore Algorithm This algorithm is brute force with two changes:

• Reverse-order searching: Compare P with a guess moving backwards. • Bad character heuristic: When a mismatch occurs, eliminate guesses where P does not agree with this character.

68 The bad character heuristic requires a last-occurrence function L mapping Σ (the alphabet) to integers, which is defined as:

• L(c) = i for the largest i such that P [i] = c • L(c) = −1 if c does not exist in P .

1: procedure PM-BoyerMoore(T [n],P [m]) 2: L ← Last-Occur-Func(P ) 3: i ← 0 . Current shfit 4: while i ≤ n − m 5: for j ← (m − 1,..., 0) 6: if T [i + j] 6= P [j] 7: break 8: end if 9: end for 10: if j = −1 11: return i . Found substring 12: else 13: i ← i + max{1, j − L[T [i + j]]} 14: end if 15: end while 16: return nil 17: end procedure Worst case is Θ(nm), but this is unlikely. This algorithm is very fast in practice. Below is searching pattern cbba in a text. Notice cbba’s last occurrence function is

L = (a 7→ 3, b 7→ 2, c 7→ 0)

A shift of −1 corresponds to completely non-overlapping patterns.

a b a d b a a b b a b c b b a

i = 0 a −1 i = 4 a 2

i = 5 a 2

i = 6 c b b a

i = 7 3 a 2

i = 8 a 0 i = 11 c b b a

• In a average English text T the algorithm probes approximately 25% of charcaters in T . This is the fastest in practice. • Worst case run-time with only bad character heuristic is Θ(mn + |Σ|). For example, when searching baaaaa in aaaaaaaaaaa. • In practice bad-character heuristic is good enough.

Worst case run-time can be reduced to Θ(n + m + |Σ|) with good-suffix heuristic. That is, if the tail P [k, . . . , m − 1] fits the text, we shift the text so the tail is matched again: (searching P = lyoyo)

69 p u l y o y o

o y o

(y) (o)

Idea: Ensure match characters in the suffix still match, and non-match characters do not match.

1. Try fit the pattern:

i T P

2. Mismatch found at a position:

i Diff Match

The green part is the good suffix. 3. If the good suffix exists in the pattern such that the character preceding the suffix is not the same as the current character, shift: (Cross represents anything different from red)

i

In the case of multiple occurrences, the last occurrence is used. 4. If the good suffix has a suffix that is a prefix of the pattern:

i

To do this, we can pre-compute a suffix skip array of amounts of shifting such that S[j] is the maximum l such that (one of the following) • P [j + 1 . . . m − 1] is a suffix of P [0 . . . l] and P [j] 6= P [l − m + j + 1]. • P [0 . . . l] is a suffix of P [j + 1 . . . m − 1]

• l = −1 if the above cannot be satisfied. The guess is updated by i ← i+(m−1−S[j]). Good-suffix and bad-character should be combined in practice and the heuristic giving the larger offset should be chosen. This is similar to Knuth-Morris-Pratt method’s failure function and can be computed in Θ(m) time.

70 5.2.3 Finite Automaton and Knuth-Morris-Pratt Method Consider the automaton for the pattern P = ababaca:

Σ Σ

a b a b a c a start 0 1 2 3 4 5 6 7

The above is an NFA. However, evaluating an NFA is very slow, since a NFA of m states is equivalent to a DFA of 2m states. We can show that for every NFA, there exists an equivalent DFA. For example, the above NFA is equivalent to

a

b,c a Σ a b a b a b a c a start 0 1 2 3 4 5 6 7

c b,c c b,c

b,c

We will not cover the conversion method since there is a even better method, the Knuth-Morris-Pratt Algorithm. The core idea is that previous matching failures provide information about the position of the next feasible matching.

Σ \{a} Σ × ×

a b a b a c a start 0 1 2 3 4 5 6 7 × × ×

×

× is a failure transition, which is used only if no other transition path fits. The failure transition does not consume a character (similar to the  transition in NFA). Under this rule, the automaton is deterministic, but not a DFA. Since the failure arcs only lead backwards, we can store a failure function F [0, . . . , m − 1] so the failure arc from state j leads to F [j − 1].

F [j] is the length of the longest prefix of P that is a suffix of P [1, . . . , j]. Since F [j] is defined via pattern matching in P [1, . . . , j], we can use already built parts of F to build the rest of F . Below is the algorithmic implementation: 1: procedure KMP-Match(T [n],P [m])

71 Descriptio 5.1: KMP-Match on ababaca a b a b a b a b a c a

a b a b a ×

(a) (b) (a) b a ×

(a) (b) (a) b a c a State 1 2 3 4 5 3, 4 5 3, 4 5 6 7

2: F ← KMT-Failure-Array(P ) 3: i, j ← 0 . Cursor in T,P 4: while i < n 5: if P [j] = T [i] 6: if j = m − 1 7: return match at i − m − 1 8: else 9: i ← i + 1 10: j ← j + 1 11: end if 12: else .P [j] = T [i] is matching failure. Trigger failure arc F [j − 1] 13: if j > 0 14: j ← F [j − 1] 15: else . Failure Arc is a loop 16: i ← i + 1 17: end if 18: end if 19: end while 20: return nil 21: end procedure The failure array represents the part of the pattern for which the prefix can be matched. i.e. Previous matching failures provide information about where the next match should begin.

F [j] is length of longest prefix of P that is a suffix of P [1, . . . , j]. This ensures every possible match is tried. In DFA terms, F [j] is the failure arc from state j + 1. This array can be efficiently built in O(m). 1: procedure KMP-Failure-Array(P [m]) 2: F [0] ← 0 3: i ← 1, j ← 0 4: while i < m 5: if P [i] = P [j] 6: j ← j + 1 7: F [i] ← j 8: i ← i + 1 9: else if j > 0 10: j ← F [j − 1] 11: else 12: F [i] ← 0 13: i ← i + 1 14: end if 15: end while 16: return P 17: end procedure

72 1. Initially there is a a: Since all proper prefixes of a are empty, the failure arc on state 1 must lead to 0.

Σ \{a}

a start 0 1 ×

2. No prefix of P is a suffix of b, so failure on 2 leds to 0.

Σ \{a}

a b start 0 1 2 × ×

3. To find longest prefix of P which is a suffix of ba, we feed a into the already built DFA and arrives in state 1. This suggest the failure arc on 3 leads to 1.

Σ \{a} ×

a b a start 0 1 2 3 × ×

4. abab’s suffix ab can be salvaged, generating another arc.

Σ \{a} ×

a b a b start 0 1 2 3 4 × × ×

5. Final state

Σ \{a} Σ × ×

a b a b a c a start 0 1 2 3 4 5 6 7 × × ×

×

It is not clear that this function will terminate: Define the “potential function” (this is a variant of the main loop)

ψ(t) := 2i − j

• Initially ψ(t) = 0

73 • In one while loop execution: – Found item → Finish, happen only once. – P [j] = T [i] and j < m − 1: ψ increases by 1 – Follow failure arc F [j]: j decreases, so ψ increases by at least 1 – At leftmost state: ψ increases by 2. • Therefore ψ always increases. The number of iterations in the while loop is thus at most 2i ∈ O(n). so the runtime of KMP-Failure-Array is Θ(m). The main KMP-Match function is Θ(n + m).

74 Descriptio 5.2: Trie of suffixes of bananaban $

a n $ $ b aban$ n $ an$ a n $ a a b anaban$ n a b a n $ ananaban$ b $ ban$ a n a n a b a n $ n bananaban$

$ n$ a n $ a b naban$ n a b a n $ nanaban$

5.3 Text Pre-processing

Instead of pre-processing the pattern, we could also pre-process the text.

5.3.1 Trie of Suffixes, Suffix Trees Idea: If P is in T , then P is a prefix of a suffix of T . The set of suffixes of T can be built in a Trie. This allows for O(m) time searching of any individual pattern, but the pre-processing stage takes. This is a Trie of Suffixes. The suffixes (leaf nodes) are stored via indices, so T [5,..., 9] is stored as “5”. A Suffix Tree is a compressed trie (see Subsection 3.3.4) of suffixes. Additionally, each internal node stores a reference to a leaf. Building this trie can be done in:

• Insert n items into compressed trie: O(n2) • Build in O(n) directly. This is Ukkonen’s Algorithm (no details, but covered in CS482) Searchi in a suffix tree:

1. Follow the edges until run out of characters. 2. • If land on internal node, follow the internal node reference to a leaf node. Compare the pattern string to decide if the pattern is in the original string. • If land on leaf node, compare pattern string with key.

5.3.2 Suffix Array Goal: Easy pattern matching as fast as suffix trees, but with less space.

75 Descriptio 5.3: Suffix Tree of T [10] = bananaban 9 .. 9 J K

$ b 5 .. 9 J K 1 n $ 7 .. 9 J K a 2 a b 3 .. 9 J K 0 3 n b a 0 .. 9 1 .. 9 J K J K 3 $ n 6 .. 9 J K

$ 8 .. 9 J K 1 a b 4 .. 9 J K 2 n 2 .. 9 J K

For this we can create a suffix array: A array of indices sorted by lexicographical order on T i . . n − 1 . For example, on tarantula. J K

i Suffix i Suffix 0 tarantula 8 a 1 arantula 3 antula 2 rantula 1 arantula 3 antula → 7 la 4 ntula 4 ntula 5 tula 2 rantula 6 ula 0 tarantula 7 la 5 tula 8 a 6 ula

The suffix array As is the sorting permutation of T i . . n − 1 . To search this we can use binary search. J K s 1: procedure SuffixArray-Search(A [n],P [m]) 2: l ← 0 3: r ← n − 1 4: while l ≤ r  r−l  5: m ← l + 2 s 6: j ← A [m] . Suffix is T j . . n − 1 7: c ← Strcmp(T [j, . . . , j + m − 1],P ) J K 8: if c = −1 9: l ← m + 1 10: else if c = +1 11: r ← m − 1 12: else 13: return j . Match at T j . . j + m − 1 14: end if J K 15: end while 16: return nil 17: end procedure

76 To build the suffix array, we can use MSD/LSD radix sort on the characters (with finite alphabet). This is

O(d(n + R)) = O(n2) where d is the number of “digits” and R the radix ≤ |Σ|. (This can be done in O(n) using a suffix tree but we are trying to avoid using a tree) If we know the pattern length beforehand, we can assume d ≤ m. It is not clear how we can create the suffix array efficiently. Due to the highly structured nature of the suffixes, we can using the following procedure: 1. Begin with array and sort by first character:

i Suffix i Suffix 0 bananaban 1 ananaban 1 ananaban 3 anaban 2 nanaban 5 aban 3 anaban → 7 an 4 naban 0 bananaban 5 aban 6 ban 6 ban 2 nanaban 7 an 4 naban 8 n 8 n

2. For each item in each group (i.e. share the same prefix), find the complementary index that correspond to the substring with this prefix removed. Sort by this index.

i Suffix Index i Suffix 1 ananaban 6 5 aban 3 anaban 7 1 ananaban 5 aban 5 3 anaban 7 an 8 → 7 an 0 bananaban 0 bananaban 6 ban 6 ban 2 nanaban 2 nanaban 4 naban 4 naban 8 n 8 n

Runtime: • Search: O(m log n) because there are m characters and so m times of binary search (each takes O(log n))

• Construct of Suffix Array: O(n log n), since the number of groups of a common prefix is doubled in each iteration.

5.4 Comparison of Pattern Matching Algorithms

Brute-Force KR BM DFA KMP Suffix Tree Preproc. – O(m) O(m + |Σ|) O(m |Σ|) O(m) O(n2)/O(n) Search O(nm) O(n + m) (exp.) O(n) O(n) O(n) O(m) Space – O(1) O(m + |Σ|) O(m |Σ|) O(m) O(n)

Most of the searching algorithms can be adapted to find all occurrences with the same worst-case run-time. For example (requires moving the pre-processing stage outside), 1: procedure Match-All(T [n],P [m]) 2: Preprocess

77 3: i ← 0 4: M ← () 5: while i ≤ n − m 6: i ← Match(T [i, . . . , n − 1],P ) 7: if i = nil 8: return M 9: else 10: append(M, i) 11: i ← i + 1 12: end if 13: end while 14: return M 15: end procedure

78 5.5 Problem: Compression

This section is dedicated to the mechanisms of gzip and bzip. Compression Problem. Given a source text S[n], produce a compressed text C[n0] such that n0 ≤ n. The compression ratio is the ratio of entropy in the two strings. H(C) |C| log |Σ | ρ := = C H(S) |S| log |ΣS| and there is a deterministic algorithm, given C, produces S.

• Logical Compression uses the meaning of the data and only applies to a certain domain (e.g. sound) • Physical Compression only knows the physical bits in the data and not their meanings.

• Lossy Compression achieves better compression ratio at the expense of data, so the source text can only be recovered approximately. • Lossless Compression always decode the source text exactly. Due to information theoretic constraints, lossless compression can not achieve a worst-case ratio of < 1.

Note. We are interested only in lossless compression. The source text can be any data, not always “text”. The encoded alphabet ΣC is usually binary. The goal of an encoding or decoding algorithm could be processing speed, reliability (error-correction), security (encryption), or size (compression). Examples:

• Simplest method: Find a bijection ΣS → ΣC and map S on C. This includes, for example, the Caesar cipher.

• ASCII. A 7→ 65, B 7→ 66, etc. ASCII unfortunately does not work so well for languages with ´l¨ot´sˆof d˜ıˆa¸cr¨ıt`ıcs,or those with large alphabets. This is a Fixed-length encoding: All the codewords have the same length. In the case of ASCII, each character is mapped to 7 bits.

• Variable-Length encoding: More frequent characters get shorter code words. Example: Morse code, UTF-8 Encoding of Unicode (each Unicode character use 1 – 4 bytes). All the examples above are char-by-char encodings. The image of each character is the codeword of the char. In general such an algorithm can be described as: (E is encoding dictionary) 1: procedure Text-Encode(E,S[n]) 2: C ← () 3: for i ∈ (0, . . . , n − 1) 4: x ← search(S[i]) 5: append(C, x) 6: end for 7: end procedure

In a average sample, the letters in the alphabet ΣS have different occurrence rates. ∗ ∗ The decoding algorithm maps the encoded text ΣC to plain text ΣS. A code must be uniquely decodable.

5.5.1 Prefix-Free Encoding 0 0 The coding dictionary E is prefix-free when E(c) is not a prefix of E(c ) for any distinct c, c ∈ ΣS. Such a dictionary can be represented by a trie.

79 0 1

0 1 0 1

A T

0 1 0 1

N D E

Any prefix-free code is uniquely decodable. • Encode: (T is trie) 1. Find leaf nodes corresponding to each letter in alphabet. Let this array be L. 2. Initialise empty string C 3. For each character x in the source S, trace the parent path L[x] and append this string to C

Runtime: O(|T | + |C|) = O(|ΣS| + |C|) • Decode: (T is trie) 1: procedure PrefixFreeDecode(T,C[n]) 2: S ← () 3: i ← 0 4: while i < n 5: r ← root(T ) 6: while ¬isLeaf(r) 7: if i = n 8: return “invalid encoding” 9: end if 10: c ← child(r, C[i]) 11: i ← i + 1 12: r ← c 13: end while 14: append(S, r) 15: end while 16: end procedure Runtime: O(|C|).

5.6 Huffman Tree

Which prefix-code is the best? Example: If LOSSLESS is the input string, the encoding length is X (frequency of c) · |E(c)| c∈{L,O,E,S} where E(c) is encoding of c.

0 1

0 1 0 1 L O E S

80 The tree above produces a string 0001111100101111, with 16 characters. We could do better with Huffman encoding. Heuristic: Frequent characters should be closer to the root. This produces a . 1. Take 2 least frequent characters a, a0 (Not unique). 2. Make the nodes a, a0 low in the tree (they will be siblings).

0 1

a a0

3. Pretend that the there exists a char c that is “either a or a0”, with frequency

f(c) = f(a) + f(a0)

0 4. Recur on ΣS \{a, a } ∪ {c}. i.e. Insert c as a phantom with frequency f(c). This reduces the size of ΣS by 1. This method is Huffman encoding. The same string above produces the Huffman tree

0 1 S 0 1 L 0 1 E O and gives an encoding of LOSSLESS 7→ 01001110100011, which has length 14 and compression ratio ρ = 88%. 1: procedure Huffman-Encoding(S[n]) 2: f ← map(ΣS, 0) . Generate Frequency Table 3: for i ∈ (0, . . . , n − 1) 4: f[S[i]] += 1 5: end for 6: Q ← priority-queue-min . Min-oriented priority queue that store tries 7: for c ∈ ΣS : f[c] > 0 8: Q.insert(c, f[c]) . Insert single-node trie 9: end for 10: while size(Q) > 1 11: T1 ← Q.deleteMin 12: T2 ← Q.deleteMin 13: Q.insert((T1,T2), freq(T1) + freq(T2)) 14: end while 15: T ← min(Q) . Top element of Q is now the encoding trie 16: C ← PrefixFreeEncode(T,S) 17: return C 18: end procedure Huffmann encoding is optimal encoding, assuming that: • The encoding trie needs to be send in plain text. • This assumes only character-by-character encoding is available.

81 • This assumes we can only use a binary trie. Sometimes encoding in another base is better (requires lower entropy).

5.6.1 Proposition. Huffman encoding is a binary trie prefix-free encoding that minimises the moment (cost) X µ(T ) = (frequency of c) · depthT (c)

c∈ΣS

0 0 Proof. Let T be the Huffman Trie. We want to show that µ(T ) ≤ µ(T ) for any trie T that encodes ΣS. We use induction on |ΣS|.

• Base case: |ΣS| = 2. Huffman encodes one bit per character. This is optimal. • Inductive Step: Let a, a0 be the chars with lowest frequency. Huffman-Encoding places a, a0 as siblings. Consider b, b0 are characters that are siblings in T 0, in the lowest level. Cases:

0 0 0 0 0 – If {a, a } = {b, b }, T,T induces tries T0,T0 (resp.) for alphabet Σ\{a, a }∪{c}, where c is a phantom alphabet with f(c) = f(a) + f(a0). 0 Since T0 is a Huffman tree, µ(T0) ≤ µ(T0). Hence

0 0 0 0 µ(T ) = µ(T0) + f(a) + f(a ) ≤ µ(T0) + f(a) + f(a ) = µ(T )

– Switch b, b0 with a, a0 in T 0, produces a tree with lower moment.

Hence to prove that a tree is not a Huffman tree, we only need to produce another tree with lower moment µ(T ).

5.6.1 Huffman Tree with Different Base Sometimes we achieve higher compression ratio by using a different base. Consider nobanana$. The frequencies are {$ 7→ 1, b 7→ 1, o 7→ 1, a 7→ 3, n 7→ 3} Compare the Huffman trees in base 2 and 3:

0 1 n 0 1 a 0 1 0 1 2 o n a 0 1 0 1 2 $ b $ b o

The two encodings generate the strings

H(1, 001, 0001, 01, 1, 01, 1, 01, 0000) = 20 H(1, 02, 01, 2, 1, 2, 1, 2, 00) ' 19.01 which saves less than 1 bit of entropy.

82 5.7 Run-Length Encoding

Idea: A maximal sequence of identical characters is compressed by converting it to a number. This simple algorithm produces optimal compression rates for long repeating strings of 0’s and 1’s. Assume: S is a bitstring. For example:

S = 000001110000 7→ 0, 5, 3, 4

The 0 in the front is the parity and signals that S starts with a 0 rather than a 1. The string 0, 5, 3, 4 is then converted using Elias γ code. An integer k is converted to •b log kc 0’s.

• Binary representation of k (which always start with 1)

k blog kc Binary Encoding 1 0 1 1 2 1 10010 3 1 11011 4 2 100 00100 5 2 101 00101 6 2 110 00110

The numbers excluding the parity bit are transferred with this code.

• A sequence of n identical characters is compressed down to 2blog nc + 2 ∈ Θ(log n) bits. • Works badly if there are runs of length 2 or 4. This effectively converts each 00 or 0000 to 010 or 00100, which makes the string even longer.

• RLE is used as a step in some other algorithms.

1: procedure RLE-Encode(S[n]) 2: C ← S[0] . Output is parity bit 3: i ← 0 4: while i < n 5: k ← 1 6: while i + k < n ∧ S[i + k] = S[i] 7: k += 1 8: end while 9: i ← i + k 10: K ← () . Compute Elias γ code 11: while k > 1 12: C.append(0) 13: K.prepend(k mod 2) 14: k ← bk/2c 15: end while 16: K.prepend(1) 17: C.append(K) . Append generated length 18: end while 19: return C 20: end procedure 1: procedure RLE-Decode(C) .C is a steram of bits 2: S ← () 3: b ← C.pop 4: while ¬empty(C)

83 5: l ← 0 6: while C.pop = 0 7: l += 1 8: end while 9: k ← 1 10: for j ∈ (1, . . . , l) 11: k ← 2k + C.pop . If C is empty at this stage encoding is invalid 12: end for 13: S.append(b, k) . Append b repeat k times 14: b ← 1 − b 15: end while 16: return S 17: end procedure

5.8 Lempel-Ziv-Welch

The Lempel-Ziv-Welch algorithm is used for GIF. Its patent has expired. Idea: Certain substrings are much more frequent than the others. The Lempel-Ziv-Welch is a adaptive encoding algorithm:

• There is a fixed initial dictionary D0 (usually ASCII)

• For i ≥ 0, Di is used to determine the ith output character

• After writing the ith character to the output, both encoder and decoder update Di to Di+1. • Each item in the dictionary is one of

– A single character in ΣS – A substring of S that the encoder and decoder has already seen. • The output is usually converted to a bit-string with fixed-width encoding using 12 bits, which limits the number of codes to 4096.

1: procedure LZW-Encode(S) 2: D ← trie-ascii() . Build a Trie with all ASCII elements as leafs 3: C ← () . Result 4: i ← |D| . Next id is 128 5: while ¬isEmpty(S) 6: v ← root(D) 7: x ← peek(S) 8: while v.has-child(x) 9: c ← v.child(x) 10: v ← c 11: pop(S) 12: if isEmpty(S) 13: break 14: end if 15: x ← peek(S) 16: end while 17: C.append(v.code()) 18: if ¬isEmpty(S) 19: v.add-child(x, i) 20: i ← i + 1 21: end if 22: end while 23: return C 24: end procedure

84 Descriptio 5.4: Lempel-Ziv-Welch on ANANASANNA produces 65, 78, 128, 65, 83, 128, 129. The grey part of each pattern represents a previously seen substring in the trie. Encoding Substring 65 78 128 65 83 128 129

A

128 N

129 A N 130 A

131 130 S A 128

132 N N A 65 133 S 131 N

133 A N N 78 A 129 A S Time 83 A 132

5.8.1 Decoding Lempel-Ziv-Welch The decoding algorithm uses the same idea. Notice that

• Each string is stored as code of prefix plus one more character at the end. In the example for ANANASANNA, 130 is mapped to (128, A) • When we decode to a nonexistent code 133 in the dictionary, we have to lookback Below is an example on 67, 65, 78, 95, 66, 129, 133, and the encoding operation which produced this string. Notice that the blue A is simultaneously being added as a new leaf into the encoding trie, and being encoded in the leaf. Input Decode Code String String (readable) 67 C 65 A 128 (67, A) CA 78 N 129 (65, N) AN 95 130 (78, ) N 66 B 131 (95, B) B 129 AN 132 (66, A) BA 133 ANA 133 (129, A) ABA

85 Encoding Substring 67 65 78 95 66 129 133

C

128 A

129 N 130

131 B

132 A N 133 A N A Time

Notice that the last character of 133 is determined from its first character. This is a special case of LZW- Decode. 1: procedure LZW-Decode(C) 2: D ← trie-ascii() . Build a Trie with all ASCII elements as leafs 3: i ← |D| . Next id is 128 4: c ← C[0] . Current code 5: s ← D[c] 6: S ← (s) . Result 7: while ¬isEmpty(C) 8: sprev ← s 9: c ← next(C) 10: if c 6= i 11: s ← D[c] . Decode using regular method 12: else . c = i is a special case 13: s ← sprev + sprev[0] 14: end if 15: S.append(s) 16: D.insert(i, sprev + s[0]) 17: i ← i + 1 18: end while 19: return T 20: end procedure Runtime: • Encode: O(|S|). • Decode: O(1) per line in table? O(length of output) per line in table.

5.9 BZip2

This algorithm is better (higher compression ratio) than the previous ones. It has 4 compression passes and is even better.

86 Stages: 1. Burrows-Wheeler Transform: Converts repeated long substrings to long runs of characters. 2. Move-to-front Transform Converts long runs of characters to long runs of 0’s. 3. Modified RLE Compresses long runs of 0’s. See Wikipedia for detail. 4. Huffman encoding BZip2 tend to be slower than other methods, but gives better compression.

5.9.1 Burrows-Wheeler Transform • If T has repeated substrings, this generates long sequences of characters. • Permute characters such that there are a lot of repeating characters.

• The coded text must have with a end-of-word character $ which occurs nowhere else. $, when using comparisons, is less than all other characters.

Definition

A cyclic shift of X[0 . . . n − 1] is the concatenation of X[i + 1 . . . n − 1],X[0 . . . i].

BWT-Encode: • List all cyclic shifts In real implementations, the kth character of the ith cyclic shift can be obtained from S, by directly querying S[(i + k) mod n], so storing i suffices.

• Sort cyclic shifts using MSD Radix sort from first character • Extract last column.

This pass generates consecutive characters. This is because if ab is a substring that occurs many times, there will be cyclic shifts of the form b ... a, which generates consecutive a’s. The runtime of BWT-Encode is O(n2). If we notice that sorting cyclic shifts of S is equivalent to sorting suffixes of S + S with length > n, it could be done in O(n) using a suffix tree of S + S. The encoding uses O(n) space. 1: procedure BWT-Encode(S) 2: A ← suffix-array(S) 3: for i ∈ (0,..., |S| − 1) 4: if A[i] = 0 5: C.append($) 6: else 7: C.append(T [A[i] − 1]) 8: end if 9: end for 10: return C 11: end procedure

87 Descriptio 5.5: Burrows-Wheeler Transform on alf eats alfalfa$ $ a l f e a t s a l f a l f a a l f a l f a $ a l f e a t s e a t s a l f a l f a $ a l f a $ a l f e a t s a l f a l f a l f e a t s a l f a l f a $ a l f a $ a l f e a t s a l f

a l f a l f a $ a l f e a t s Output a t s a l f a l f a $ a l f e e a t s a l f a l f a $ a l f f e a t s a l f a l f a $ a l Sorted f a $ a l f e a t s a l f a l f a l f a $ a l f e a t s a l l f e a t s a l f a l f a $ a l f a $ a l f e a t s a l f a l f a l f a $ a l f e a t s a s a l f a l f a $ a l f e a t t s a l f a l f a $ a l f e a

Descriptio 5.6: Inverse Burrows-Wheeler Transform from ard$rcaaaaabb. The input text is recovered by tracing from the red node. $, 3 a, 0 a, 0 r, 1 a, 6 d, 2

a, 7 $, 3 a, 8 r, 4 a, 9 c, 5 b, 10 a, 6 b, 11 a, 7 c, 5 a, 8 d, 2 a, 9 r, 1 b, 10 r, 4 b, 11

It is not apparent how such a construction allows the original string to be recovered. Indeed, first column does not contain enough information. Given the last column, we can sort the last column to obtain the first column. Meanwhile, a row index is attached to disambiguate the last column. This allows complete disambiguation and tracing through the row indices recovers the input. We can try reconstructing T by creating a matrix of cyclic shifts. From the encoding stage, C is the last column of the matrix. Since we know that C is a permutation of T , we can recover the first column by sorting C. The first character can be located by finding the first character of the row which ends with $. 1: procedure BWT-Decode(C[n]) 2: A ← array(n) 3: for i ∈ {0, . . . , n − 1} 4: A[i] ← (C[i], i) 5: end for 6: stable-sort(A) . Stable sort A by first entry 7: for j ∈ {0, . . . , n} . Find $ 8: if C[j] = $

88 9: break 10: end if 11: end for 12: S ← () 13: do 14: j ← A[j] 15: S.append(C[j]) 16: while C[j] 6= $ 17: return C 18: end procedure which can be done in O(n) time. The Burrows-Wheeler Transform is equivalent to computing the suffix array, and one can be changed into the other using a O(n) time algorithm. 1: procedure SuffixArray-from-BWT(C[n]) .C[n] is BWT of some text 0 2: C ← array(n) 3: for i ∈ (0, . . . , n − 1) 0 4: C [i] ← (C[i], i) 5: end for 0 0 6: A ← stable-sort(C ) . Sort C by first character 7: c ← 0 8: i ← 0 9: while c < n 10: k ← A[i]2 . index associated with A[i] 11: C[k] ← (C[k], c) 12: i ← k 13: c ← c + 1 14: end while 15: S ← array(n) . Compute suffix array 16: for i ∈ (0, . . . , n − 1) 17: S[i] ← C[i]2 18: end for 19: return S 20: end procedure

5.9.2 Move-to-front Transform The encoder and decoder agree on a initial alphabet Σ. The characters of Σ are listed in an array L. Notice that the encoding and decoding are completely symmetric

1: procedure MTF-Encode(S, ΣS) 2: L ← array(ΣS) 3: C ← () 4: for x ∈ S -1 5: C.append(L [x]) 6: for j ∈ {n − 1,..., 0} 7: L[j] ↔ L[j + 1] 8: end for 9: end for 10: return C 11: end procedure 12: procedure MTF-Decode(C, ΣS) 13: L ← array(ΣS) 14: S ← () 15: for i ∈ C 16: S.append(L[i]) 17: for j ∈ {n − 1,..., 0}

89 18: L[j] ↔ L[j + 1] 19: end for 20: end for 21: return S 22: end procedure

5.10 Arithmetic Compression

This unorthodox algorithm encodes a string in a real interval. Using the frequency table {$ 7→ 1, b 7→ 1, o 7→ 1, a 7→ 3, n 7→ 3} We can subdivide the unit interval into sub-intervals that are proportional to the frequencies:

a (0.33) n (0.33) b (0.11) o (0.11) $ (0.11)

Then the interval corresponding to each character is chosen and the process is repeated:

a (0.333) n (0.334) b (0.111) o (0.111) $ (0.111)

a (0.333) n (0.334) b (0.111) o (0.111) $ (0.111)

a (0.333) n (0.334) b (0.111) o (0.111) $ (0.111)

no$

The algorithm is 1: procedure Arithmetic-Encode(S) 2: F ← frequencies(S) . Generate map from alphabet to frequencies 3: I ← [0, 1] 4: for x ∈ S 5: I ← subdivide(I,F ) . Compute subintervals of I using frequencies 6: I ← Ix . Subinterval that corresponds to current character 7: end for 8: C ∈ I. Choose number in I that needs fewest bits 9: return C 10: end procedure 11: procedure Arithmetic-Decode(F,C) .F is frequency array 12: S ← () 13: I ← [0, 1] 14: do 15: I ← subdivide(I,F ) 16: S.append(x : C ∈ Ix) 17: while c 6= $

90 18: return S 19: end procedure

5.11 Comparison of Compression Algorithms

In the worst case, any compression algorithm has ratio at least 1 due to information theoretic constraints.

Huffman Run-Length Encoding Lempel-Ziv-Welch Length of code Variable Variable Fixed Type Single-character Multi-character Multi-character Number of passes 2 1 1 Ratio in practice 60% Bad on real text 45% Application Optimal 01-encoding Good on long runs Good on text Auxiliary Must send dictionary None None Drawbacks Can be worse than ASCII Can be worse than ASCII Application Part of pkzip, JPEG, mp3 Fax machines GIF, Unix compress

91 Caput 6

External Memory Model

There are three memory sections in a computer: • External Memory (Internet, Disk): Size is unbounded.

• Internal Memory: Size M is several gb • CPU’s memory.

External Memory

Slow, in blocks

Internal Memory

Fast

CPU

The access of external memory from internal memory is very slow. Accessing to the external memory consists of block transfers: block read/write of blocks that are of size B. (Usually megabytes) Compared to block transfers, the time taken in internal memory and for computations is generally negligible. Goal: Design algorithm and data structure that use very few block transfers betweeen internal and external memory. Note. Contents from this section are relevant in CS 348.

6.1 Sorting in External Memory

Given an array of n numbers in external memory. We may want to put them in sorted order. The most memory efficient algorithm is Heapsort, but Heapsort accesses array elements that are far apart and likely requires 1 block transfer per array access. This is prohibitively expensive. A improvement is to use d-way merge sort.

1: procedure d-Way-MergeSort(S1,...,Sd) 2: P ← min-priority-queue 3: S ← set 4: for i ∈ {1, . . . , d} 5: P.insert((Si[0], i)) . Insert tuple of first element in Si and i 6: end for 7: while ¬isEmpty(P ) 8: (x, i) ← deleteMin(P )

92 9: Si.remove([0]) 10: S.append(x) 11: if ¬isEmpty(Si) 12: P.insert((Si[0], i)) 13: end if 14: end while 15: return S 16: end procedure n The total number of block transfers, if each Si has size of one block, is O(logd(n) · B ). n Since Ω( B ) block transfers are required to scan n elements, we can prove: 6.1.1 Theorem. A comparison-based sorting algorithm on external memory will take n n Ω( log ) B M/B B block transfers.

M Hence the d-Way-MergeSort with d ' B is optimal upto constant factors.

6.2 Dictionary and B-Trees

Note. The insights in this section are related to database implementation. When searching a BST implemented in the “standard” way, potentially log n block transfers are required to reach a leaf node. This is very inefficient. A heuristic to improve this is to store groups of nodes in one block.

If we put b − 1 KVP’s into one block, the number of block transfers become

Θ(log n) = Θ(log n) log b b This heuristic converts the binary tree into a tree of blocks.

r

T0 T1 T2 T3

Notice the leftest block is smaller than the leftmode node of the root block. This suggests the structure: Select a branching number d. If a node r has keys k1, . . . , kd and subtrees T0,...,Td, then T0 < k1 < T2 < k2 ... . d can be variable for each node.

93 Definition

A a-b-tree is a tree such that each node contains ∈ {a, . . . , b} keys, with the exception: • The number of children of the root can be ∈ {2, . . . , b}.

All empty subtrees must be on the same level. A tree has order b if every node has at most b children.

Each node in a 2-4-tree contains 2, 3, or 4 nodes. Search in a a-b-tree:

• At each node, find position k in k1, . . . , kd, which is O(log b) • If k does not exist, traverse to the children corresponding to its position.

1: procedure ABTree-Search(k, v ← root(, )p ← empty-subtree()) 2: if is-empty(v) 3: return not found . Would be in p 4: end if d d 5: (Ti)i=0 , (ki)i=1 ← keys-subtrees(v) . Obtain keys and subtrees at v 6: if k < k1 7: return ABTree-Search(k, T0, v) 8: end if 9: i ← max{i : ki ≤ k} 10: if ki = k 11: return (v, i) . Found 12: else 13: return ABTree-Search(k, Ti, v) 14: end if 15: end procedure Total runtime of ABTree-Search is O(log b · height). In external memory model, assuming any node fits into one block, the number of block transfers is O(height).

6.2.1 Insertion and Deletion in a-b-Trees These two operations require a ≤ db/2e. Insertion is similar to BST, but instead of moving downwards, an insertion operation moves the tree up. 1: procedure ABTree-Insert(T, k) 2: v ← ABTree-Search(k) . Find (should be) position of k as a leaf 3: add(v, k) 4: while nKeys(v) ≥ b . Keys overflow d d 5: (Ti)i=0 , (ki)i=1 ← keys-subtrees(v) . Obtain keys and subtrees at v 6: if ¬hasParent(v) 7: create-parent(v) 8: end if 9: p ← parent(v)  d  10: j ← 2 + 1 . Compute cut point 0 j−1 j−1 11: v ← node((ki)i=1 , (Ti)i=0 ) 00 d d 12: v ← node((ki)i=j+1 , (Ti)j ) 0 00 13: p.replace((v) ← (v , kj, v )) . Replace v in the key-subtree list of p by a composition 14: v ← p 15: end while 16: end procedure

94 0 00 0 00 k k k k3 k

k1 k2 k3 k4 k1 k2 k4

T0 T1 T2 T3 T4 T0 T1 T2 T3 T4

Example of inserting key 17 into a 2-4-tree: 1. Original tree, block that 17 should arrive in is highlighted.

5|10|12

3|4 6|8 11 13|14|15

2. After insertion, the node is overfull, since this node will have 5 empty subtree children.

5|10|12

3|4 6|8 11 13|14|15|17

3. Cleave the node locally, splitting half of the keys on the left and half on the right.

15

13|14 17

The subnodes have ≥ a − 1 KVP’s as long as a ≥ b/2. 4. The middle key (15) is transferred upwards

5|10|12|15

3|4 6|8 11 13|14 17

5. Now the root node has too many children, so it is likewise cleaved.

12

5|10 15

3|4 6|8 11 13|14 17

Notice that all the empty nodes are still on the same level. Time: • Search: O(log b · (height)) • Splitting: O(height)

95 Overall, time is O(log b · (height)). The number of block transfers is O(height). Finally, we have the deletion operator, which handles underflow situations instead of overflow. 1: procedure ABTree-Delete(T, k) 2: w ← ABTree-Search(k, T ) 0 3: v ← order-neighbour(w, k) . Leaf node that contains a immediate predecessor/successor k of k 0 4: w.replace(k ← k ) 0 5: v.delete(k ) . Also delete empty subtrees 6: while nKeys(v) = 0 . Keys underflow 7: if v = root(T ) . Root remove 8: T.delete(v) . v only has one children. Replace v by this children and remove 9: break 10: end if 11: p ← parent(v) 12: u ∈ sibling-of-max-keys(v) . Find sibling of v with maximum number of keys 13: if nKeys(u) < a . Merge 14: if u = sibling-right(v) 0 15: v ← node(v.T0, k, u.T0, u.k1, u.T1) 16: p.replace((v, k, u) ← (v)) 17: v ← p . Repeat this process 18: else . Symmetric for left sibling 19: ... 20: end if 21: else . Transfer/Rotate 22: if u = sibling-right(v) 23: p.replace(k ← u.k1) . Replace key in p 24: v.append(k, u.T0) . Transfer left-most subtree and key to v 25: u.remove(u.T0, u.k1) 26: else . Symmetric for left sibling 27: ... 28: end if 29: end if 30: end while 31: end procedure Below is an example in a 2-4-tree.

• (Transfer/Rotate): ABTree-Delete(T, 43)

36

25 43

18|21 31 41 51

12 19 24 28 33 39 42 48 56|62 1. 36

25 48

18|21 31 41 51

12 19 24 28 33 39 42 56|62 2.

96 36

25 48

18|21 31 41 56

3. 12 19 24 28 33 39 42 51 62

• (Merge) ABTree-Delete(T, 19):

36

25 48

18|21 31 41 56

1. 12 19 24 28 33 39 42 51 62 36

25 48

18| 31 41 56

12 19|24 28 33 39 42 51 62 2. 36

25 48

18 31 41 56

12 19|24 28 33 39 42 51 62 3.

• (Root remove) ABTree-Delete(T, 42):

36

25 48

18 31 41 56

12 19|24 28 33 39 42 51 62 1.

97 36

25 48

18 31 56

12 19|24 28 33 39|42 51 62 2. 36

25

18 31 48|56

12 19|24 28 33 39|42 51 62 3.

25|36

18 31 48|56

12 19|24 28 33 39|42 51 62 4. 25|36

18 31 48|56

12 19|24 28 33 39|42 51 62 5.

6.2.2 Height of a a-b-Tree 6.2.1 Proposition. Height of a-b-tree of n KVP’s is n + 1 h ≤ log ∈ O(log n) a 2 a Proof. Let N(h) be the minimum number of KVP’s in a a-b-tree of height h. Since the root node has at least 2 children and all other internal nodes has at least a children, and that each node has at least a − 1 keys, h−1 X i−1 N(h) ≥ 1 + (2a · a − 1) i=0 Therefore, ah − 1 N(h) ≥ 1 + 2(a − 1) = 2ah − 1 a − 1 Solving for h generates the required expression.

Definition

A B-tree of order m is a dm/2e-m-tree.

98 The blocksize B and m should be that a block of size B contains m − 1 KVP’s. • Search/Insert/Delete requires Θ(height) time. • Height is log n log n O( m ) = O( ) log 2 log m • Work at a node can be done in O(log m) time.

• Total cost is O(logm n) = O(logB n). This is a huge saving of block transfers. Some optimisations include pre-emptive splitting/merging, i.e. split any node close to overflow and delete any node close to underflow. Thre are two issues of a B tree: • What if we do not know B? A tree which does not know the size of one block is cache-oblivious, such a tree can be obtained√ by building a hierarchy of binary trees. Each node v in binary tree T stores a binary tree T 0 of size Θ( n) that is recursively cache-oblivious. This achieves O(logB n) block transfers without knowing B. • What if the values are much bigger than the keys? This can be solved with B+-Trees.

6.2.3 B+-tree B+-tree is a method to optimise the B-tree of large KVP size. There are two types of trees: • Storage-Variant Trees: Every node stores a KVP e.g. Heap, Binary Search Tree • Decision-Variant Trees: All KVP’s are leaves. e.g. Trie, KD-tree We can always convert a storage-variant to a decision variant tree.

Definition

A B+-tree is the dicision-variant of a B-tree. That is, each left node stores the minimum key of the right subtree.

Hence each internal node is a comparison of the form < x?

6.2.2 Lemma. AB+-tree stores < 2n keys. Proof. The bottom layer stores n keys. The layer above stores ≤ n keys, and the layer above stores ≤ n a a2 keys. This pattern continues, so the total number of keys is bounded by ∞ X n a = n ≤ 2n i a − 1 i=0 a

The advantage of using a B+-tree is that the internal nodes do not have to store the values along the keys. Since the size of an internal node is bounded by the block size, a internal node can store much more keys than a leaf node. This reduces the tree’s height.

99 Descriptio 6.1: A B-tree and equivalent B+-tree of order 5. KVP nodes are in green, Key-only nodes are in blue. 25

14|20|26 38|44|50|56

10|12 16|18 22|24 28|30|32 34|36 40|42 46|48 52|54 58|60 34

16|22|28 40|46|52|58

10|12|14 16|18|20 22|24|25|26 28|30|32 34|36|38 40|42|44 46|48|50 52|54|56 58|60

6.3 Red-Black Tree

Properties: • A d-node becomes a black node with d − 1 red children. • Subtrees of red nodes of empty or are black.

Definition

A red-black tree is a tree such that: • Every node is red or black • Subtrees of red nodes are empty or have black root. • Any empty subtree T has the same black-depth.

The red-black outperforms AVL tree in practical applications. The black depth is equal to the depth in a 2-4-tree.

100 Descriptio 6.2: A 2-4-tree and a equivalent Red-Black Tree 5|10|12

3|4 6|8 11 13|14|15

10

5 12

4 8 11 14

3 6 13 15

6.4 Extendible Hashing

Under sufficiently low load factors, operations on a hash table can be done in amortised O(1) time. However, re-hashing is the bottleneck if we decide to create a hash table in external memory. This will take between Θ(n/B) or Θ(n) block-transfers, which is prohibitively expensive for real applications. To circumvent this difficulty we can use a Trie of blocks: • Keys are integers {0,..., 2L − 1}. • Interpret all integers as bitstrings • Build a trie D (directory) of integers in inernal memory. • Stop splitting in the trie when the remaining items fit in one block. • Each leaf of D refers to a block in external memory. • The non-leaf nodes are stored in inernal memory, and the leaf are in external memory. Operations:

• ToB-Search: Search for a key x in binary until we reach a leaf. Load a block at the leaf and search in it.

• ToB-Insert: Search for x and load the block, and then insert x. If this exceeds block capacity, split at trie node, and split blocks (possibly repeatedly) Usually will finish within 2 to 4 block transfers. Worst case has Θ(log n) = Θ(log l) trie-splits, and Θ(log n) block splits.

• ToB-Delete: Search for x and load block. Then mark as deleted (lazy deletion). Alternatively, we can combine underfull blocks.

1: procedure ExtHash-Insert(D, k) 2: s ← binary-string(k) 3: l ← Trie-Search(D, w) 4: d ← depth(D, l) 5: P ← transfer-block(l) 6: while isFull(P ) d+1 7: (P0,P1) ← split(P, [2 ]s) 8: (l0, l1) ← create-children(l, (P0,P1)) . Create children that links to P0,P1

101 Descriptio 6.3: A Trie of blocks

00∗ 0

0 010∗ 1 0

1 011∗

1

1∗

Internal Memory External Memory

9: d ← d + 1 10: l ← ls[d] 11: P ← Pw[d] 12: end while 13: insert(P, w) 14: end procedure

• Usually O(1) block transfers • Never do a full re-hash • Assume: Trie fits into internal memory.

• There may be empty or underfilled blocks. • Directory is much smaller than total number of stored keys and usually fits into internal memory. If it does not, a B-tree could be used. • Space usage is not too inefficient. Under uniform distribution assumption each block is expected to be 69% (1σ) full.

A slight optimisation to save memory is to expand the trie until all blocks have the same depth, and then store the leaf nodes: Key 000 001 010 011 100 101 110 111 Block 00∗ 010∗ 011∗ 1∗

102 Additamentum A

Tables

A.1 List of common recurrences T (n) Θ Example T (n) = T (n/2) + Θ(1) log n Binary search T (n) = 2T (n/2) + Θ(n) n log n Mergesort T (n) = 2T (n/2) + Θ(log n) n Heapify T (n) = T (cn) + Θ(n), (0 < c < 1) √n Selection T (n) = 2T√(n/4) + Θ(1) n Range Search T (n) = T ( n) + Θ(1) log log n Interpolation Search

Notice that O(n + k log n) ⊆ O(n + k log k) where k ∈ {0, . . . , n − 1}, since √ 1. If k ≥ n, then n + k log n ≤ n + k log k2 ' n + k log k ∈ O(n + k log k). √ √ 2. If k ≤ n, then n + k log n ≤ n + n log n ∈ O(n).

A.1.1 Comparison of Hashings α is the load factor. Below is a table of average costs, assuming uniform hashing. T avg Search (Fail) Insert Search (Success) Linear 1 1 1 (1−α)2 (1−α)2 1−α

1 1 1  1  Double 1−α 1−α α log 1−α Cuckoo 1 (worst) α 1 (worst) (1−2α)2 If we can keep load factor low, Cuckoo hashing is amortised O(1) insert.

A.1.2 Comparison of Sorting Algorithms Algorithm T worst, (T avg) Space HeapSort n log n 1 MergeSort n log n n SelectionSort n2 1 InsertionSort n2 1 QuickSort n2, (n log n) log n

103 A.1.3 Comparison of Pattern Matching Algorithms Brute-Force KR BM DFA KMP Suffix Tree Preproc. – O(m) O(m + |Σ|) O(m |Σ|) O(m) O(n2)/O(n) Search O(nm) O(n + m) (exp.) O(n) O(n) O(n) O(m) Space – O(1) O(m + |Σ|) O(m |Σ|) O(m) O(n)

A.1.4 Comparison of Compression Algorithms In the worst case, any compression algorithm has ratio at least 1 due to information theoretic constraints.

Huffman Run-Length Encoding Lempel-Ziv-Welch Length of code Variable Variable Fixed Type Single-character Multi-character Multi-character Number of passes 2 1 1 Ratio in practice 60% Bad on real text 45% Application Optimal 01-encoding Good on long runs Good on text Auxiliary Must send dictionary None None Drawbacks Can be worse than ASCII Can be worse than ASCII Application Part of pkzip, JPEG, mp3 Fax machines GIF, Unix compresss

A.2 Information Theory

Although information theory knowledge is not required in this course, some proofs are made succint using information theoretic terms.

Definition

Let X be a discrete random variable with probability densities {p1, . . . , pn}. The entropy of X is

n X H(X) := E[− log P (X)] = − pi log pi i=1

B. This definition cannot be extended analogously to continuous random variables (differential entropy). H(X) measures the amount of information known by a realisation of X. For example, if X is the coin flip, X contains 1 bit of entropy. 1 If X produces n outputs, each with uniform (i.e. n ) probability, the entropy of X is

n X 1  1  H(X) = − log = log n n n i=1 For example, a random permutation of n distinct elements contains

H(X) = log(n!) ∈ Θ(n log n) bits of entropy. A.2.1 Fundamental Properties of Information. H has the properties:

1. H is decreasing w.r.t. individual probabilities pi. 2. H ≥ 0 3. H(1) = 0 (Events that only has one outcome provides no information) 4. If X,Y are independent, H(XY ) = H(X) + H(Y ).

104 A.3 Course sumary

1. How to re-organise data • Sorting algorithms • ADT Priority Queue, Finding maximum, Selection • Lower bounds for a problem, decision trees 2. How to manipulate structured data (KVP) • Balanced trees, hashing, tries • Special keys: Words, integers, points • Special situations, Biased search requests, range queries 3. How to manipulate unstructured data • Searching • Compression 4. General-purpose algorithmic design techniques: • Randmisation: Shift average-requirements from instance to oracle • Pre-processing: Initial work pays off in faster queries • External-memory: Huge data

105