Fundamental Data Structures Contents

1 Introduction 1 1.1 Abstract ...... 1 1.1.1 Examples ...... 1 1.1.2 Introduction ...... 2 1.1.3 Defining an ...... 2 1.1.4 Advantages of abstract data typing ...... 4 1.1.5 Typical operations ...... 4 1.1.6 Examples ...... 5 1.1.7 Implementation ...... 5 1.1.8 See also ...... 6 1.1.9 Notes ...... 6 1.1.10 References ...... 6 1.1.11 Further ...... 7 1.1.12 External links ...... 7 1.2 ...... 7 1.2.1 Overview ...... 7 1.2.2 Examples ...... 7 1.2.3 Language support ...... 8 1.2.4 See also ...... 8 1.2.5 References ...... 8 1.2.6 Further reading ...... 8 1.2.7 External links ...... 9 1.3 Analysis of ...... 9 1.3.1 Cost models ...... 9 1.3.2 Run-time analysis ...... 10 1.3.3 Relevance ...... 12 1.3.4 Constant factors ...... 12 1.3.5 See also ...... 12 1.3.6 Notes ...... 12 1.3.7 References ...... 13 1.4 Amortized analysis ...... 13 1.4.1 History ...... 13

i ii CONTENTS

1.4.2 Method ...... 13 1.4.3 Examples ...... 13 1.4.4 Common use ...... 14 1.4.5 References ...... 14 1.5 Accounting method ...... 14 1.5.1 The method ...... 14 1.5.2 Examples ...... 15 1.5.3 References ...... 15 1.6 Potential method ...... 15 1.6.1 Definition of amortized time ...... 15 1.6.2 Relation between amortized and actual time ...... 16 1.6.3 Amortized analysis of worst-case inputs ...... 16 1.6.4 Examples ...... 16 1.6.5 Applications ...... 17 1.6.6 References ...... 17

2 Sequences 18 2.1 ...... 18 2.1.1 History ...... 18 2.1.2 Abstract arrays ...... 18 2.1.3 Implementations ...... 19 2.1.4 Language support ...... 19 2.1.5 See also ...... 21 2.1.6 References ...... 21 2.1.7 External links ...... 21 2.2 ...... 21 2.2.1 History ...... 22 2.2.2 Applications ...... 22 2.2.3 Element identifier and addressing formulas ...... 22 2.2.4 Efficiency ...... 24 2.2.5 Dimension ...... 25 2.2.6 See also ...... 25 2.2.7 References ...... 25 2.3 ...... 26 2.3.1 Bounded-size dynamic arrays and capacity ...... 26 2.3.2 Geometric expansion and amortized cost ...... 26 2.3.3 Growth factor ...... 27 2.3.4 Performance ...... 27 2.3.5 Variants ...... 27 2.3.6 Language support ...... 28 2.3.7 References ...... 28 2.3.8 External links ...... 28 CONTENTS iii

2.4 ...... 28 2.4.1 Advantages ...... 29 2.4.2 Disadvantages ...... 29 2.4.3 History ...... 29 2.4.4 Basic concepts and nomenclature ...... 30 2.4.5 Tradeoffs ...... 31 2.4.6 Linked list operations ...... 33 2.4.7 Linked lists using arrays of nodes ...... 34 2.4.8 Language support ...... 35 2.4.9 Internal and external storage ...... 35 2.4.10 Related data structures ...... 36 2.4.11 Notes ...... 37 2.4.12 Footnotes ...... 37 2.4.13 References ...... 37 2.4.14 External links ...... 38 2.5 ...... 38 2.5.1 Nomenclature and implementation ...... 38 2.5.2 Basic algorithms ...... 38 2.5.3 Advanced concepts ...... 41 2.5.4 See also ...... 41 2.5.5 References ...... 41 2.6 Stack (abstract data type) ...... 41 2.6.1 History ...... 42 2.6.2 Non-essential operations ...... 42 2.6.3 Software stacks ...... 42 2.6.4 Hardware stacks ...... 43 2.6.5 Applications ...... 45 2.6.6 Security ...... 45 2.6.7 See also ...... 46 2.6.8 References ...... 46 2.6.9 Further reading ...... 46 2.6.10 External links ...... 46 2.7 Queue (abstract data type) ...... 46 2.7.1 Queue implementation ...... 47 2.7.2 Purely functional implementation ...... 47 2.7.3 See also ...... 48 2.7.4 References ...... 48 2.7.5 External links ...... 48 2.8 Double-ended queue ...... 48 2.8.1 Naming conventions ...... 49 2.8.2 Distinctions and sub-types ...... 49 iv CONTENTS

2.8.3 Operations ...... 49 2.8.4 Implementations ...... 49 2.8.5 Language support ...... 50 2.8.6 Complexity ...... 50 2.8.7 Applications ...... 51 2.8.8 See also ...... 51 2.8.9 References ...... 51 2.8.10 External links ...... 51 2.9 Circular buffer ...... 51 2.9.1 Uses ...... 51 2.9.2 How it works ...... 52 2.9.3 Circular buffer mechanics ...... 52 2.9.4 Optimization ...... 53 2.9.5 Fixed-length-element and contiguous-block circular buffer ...... 53 2.9.6 External links ...... 53

3 Dictionaries 54 3.1 ...... 54 3.1.1 Operations ...... 54 3.1.2 Example ...... 55 3.1.3 Implementation ...... 55 3.1.4 Language support ...... 55 3.1.5 Permanent storage ...... 56 3.1.6 See also ...... 56 3.1.7 References ...... 56 3.1.8 External links ...... 56 3.2 ...... 57 3.2.1 Operation ...... 57 3.2.2 Performance ...... 57 3.2.3 Applications and software libraries ...... 57 3.2.4 See also ...... 57 3.2.5 References ...... 57 3.3 ...... 58 3.3.1 Hashing ...... 58 3.3.2 Key statistics ...... 59 3.3.3 Collision resolution ...... 59 3.3.4 Dynamic resizing ...... 62 3.3.5 Performance analysis ...... 63 3.3.6 Features ...... 64 3.3.7 Uses ...... 64 3.3.8 Implementations ...... 65 3.3.9 History ...... 66 CONTENTS v

3.3.10 See also ...... 66 3.3.11 References ...... 66 3.3.12 Further reading ...... 67 3.3.13 External links ...... 67 3.4 ...... 68 3.4.1 Operations ...... 68 3.4.2 Properties ...... 69 3.4.3 Analysis ...... 69 3.4.4 Choice of ...... 70 3.4.5 History ...... 70 3.4.6 References ...... 71 3.5 ...... 72 3.5.1 Quadratic function ...... 72 3.5.2 Quadratic probing insertion ...... 72 3.5.3 Quadratic probing search ...... 72 3.5.4 Limitations ...... 73 3.5.5 See also ...... 73 3.5.6 References ...... 73 3.5.7 External links ...... 73 3.6 ...... 73 3.6.1 Classical applied data structure ...... 74 3.6.2 Implementation details for caching ...... 74 3.6.3 See also ...... 74 3.6.4 Notes ...... 74 3.6.5 External links ...... 74 3.7 ...... 74 3.7.1 History ...... 74 3.7.2 Operation ...... 75 3.7.3 Theory ...... 75 3.7.4 Example ...... 76 3.7.5 Variations ...... 76 3.7.6 Comparison with related structures ...... 76 3.7.7 See also ...... 76 3.7.8 References ...... 76 3.7.9 External links ...... 77 3.8 ...... 77 3.8.1 See also ...... 78 3.8.2 References ...... 78 3.8.3 External links ...... 78 3.9 Hash function ...... 78 3.9.1 Uses ...... 79 vi CONTENTS

3.9.2 Properties ...... 80 3.9.3 Hash function algorithms ...... 82 3.9.4 Locality-sensitive hashing ...... 84 3.9.5 Origins of the term ...... 85 3.9.6 ...... 85 3.9.7 See also ...... 85 3.9.8 References ...... 85 3.9.9 External links ...... 86 3.10 Perfect hash function ...... 86 3.10.1 Application ...... 86 3.10.2 Construction ...... 86 3.10.3 Space lower bounds ...... 87 3.10.4 Extensions ...... 87 3.10.5 Related constructions ...... 87 3.10.6 References ...... 87 3.10.7 Further reading ...... 88 3.10.8 External links ...... 88 3.11 ...... 88 3.11.1 Introduction ...... 88 3.11.2 Mathematical guarantees ...... 89 3.11.3 Constructions ...... 89 3.11.4 See also ...... 92 3.11.5 References ...... 92 3.11.6 Further reading ...... 92 3.11.7 External links ...... 93 3.12 K-independent hashing ...... 93 3.12.1 Background ...... 93 3.12.2 Definitions ...... 93 3.12.3 Techniques ...... 93 3.12.4 Independence needed by different hashing methods ...... 94 3.12.5 References ...... 94 3.12.6 Further reading ...... 94 3.13 ...... 94 3.13.1 Method ...... 95 3.13.2 History ...... 95 3.13.3 Universality ...... 95 3.13.4 Application ...... 96 3.13.5 Extensions ...... 96 3.13.6 Notes ...... 96 3.13.7 References ...... 96 3.14 Cryptographic hash function ...... 97 CONTENTS vii

3.14.1 Properties ...... 98 3.14.2 Illustration ...... 99 3.14.3 Applications ...... 99 3.14.4 Hash functions based on block ciphers ...... 100 3.14.5 Merkle–Damgård construction ...... 100 3.14.6 Use in building other cryptographic primitives ...... 100 3.14.7 Concatenation ...... 101 3.14.8 Cryptographic hash algorithms ...... 101 3.14.9 See also ...... 101 3.14.10 References ...... 101 3.14.11 External links ...... 102

4 Sets 103 4.1 (abstract data type) ...... 103 4.1.1 ...... 103 4.1.2 Operations ...... 103 4.1.3 Implementations ...... 104 4.1.4 Language support ...... 105 4.1.5 Multiset ...... 105 4.1.6 See also ...... 106 4.1.7 Notes ...... 107 4.1.8 References ...... 107 4.2 array ...... 107 4.2.1 Definition ...... 107 4.2.2 Basic operations ...... 107 4.2.3 More complex operations ...... 108 4.2.4 Compression ...... 108 4.2.5 Advantages and disadvantages ...... 108 4.2.6 Applications ...... 109 4.2.7 Language support ...... 109 4.2.8 See also ...... 110 4.2.9 References ...... 110 4.2.10 External links ...... 110 4.3 Bloom filter ...... 110 4.3.1 description ...... 111 4.3.2 Space and time advantages ...... 111 4.3.3 Probability of false positives ...... 112 4.3.4 Approximating the number of items in a Bloom filter ...... 113 4.3.5 The union and intersection of sets ...... 114 4.3.6 Interesting properties ...... 114 4.3.7 Examples ...... 114 4.3.8 Alternatives ...... 115 viii CONTENTS

4.3.9 Extensions and applications ...... 115 4.3.10 See also ...... 117 4.3.11 Notes ...... 117 4.3.12 References ...... 118 4.3.13 External links ...... 120 4.4 MinHash ...... 121 4.4.1 Jaccard similarity and minimum hash values ...... 121 4.4.2 Algorithm ...... 121 4.4.3 Min-wise independent permutations ...... 122 4.4.4 Applications ...... 122 4.4.5 Other uses ...... 122 4.4.6 Evaluation and benchmarks ...... 122 4.4.7 See also ...... 122 4.4.8 References ...... 123 4.4.9 External links ...... 123 4.5 Disjoint-set data structure ...... 123 4.5.1 Disjoint-set linked lists ...... 124 4.5.2 Disjoint-set forests ...... 124 4.5.3 Applications ...... 125 4.5.4 History ...... 125 4.5.5 See also ...... 125 4.5.6 References ...... 125 4.5.7 External links ...... 126 4.6 Partition refinement ...... 126 4.6.1 Data structure ...... 126 4.6.2 Applications ...... 126 4.6.3 See also ...... 127 4.6.4 References ...... 127

5 Priority queues 128 5.1 ...... 128 5.1.1 Operations ...... 128 5.1.2 Similarity to queues ...... 128 5.1.3 Implementation ...... 128 5.1.4 Equivalence of priority queues and sorting algorithms ...... 129 5.1.5 Libraries ...... 130 5.1.6 Applications ...... 130 5.1.7 See also ...... 131 5.1.8 References ...... 131 5.1.9 Further reading ...... 132 5.1.10 External links ...... 132 5.2 ...... 132 CONTENTS ix

5.2.1 Basic data structure ...... 132 5.2.2 Optimizations ...... 132 5.2.3 Applications ...... 132 5.2.4 References ...... 133 5.3 (data structure) ...... 133 5.3.1 Operations ...... 133 5.3.2 Implementation ...... 134 5.3.3 Variants ...... 134 5.3.4 Comparison of theoretic bounds for variants ...... 135 5.3.5 Applications ...... 135 5.3.6 Implementations ...... 135 5.3.7 See also ...... 136 5.3.8 References ...... 136 5.3.9 External links ...... 136 5.4 ...... 136 5.4.1 Heap operations ...... 137 5.4.2 Building a heap ...... 138 5.4.3 Heap implementation ...... 139 5.4.4 Derivation of index equations ...... 140 5.4.5 Related structures ...... 141 5.4.6 Summary of running times ...... 141 5.4.7 See also ...... 141 5.4.8 References ...... 141 5.4.9 External links ...... 141 5.5 ''''-ary heap ...... 142 5.5.1 Data structure ...... 142 5.5.2 Analysis ...... 142 5.5.3 Applications ...... 143 5.5.4 References ...... 143 5.5.5 External links ...... 143 5.6 ...... 143 5.6.1 Binomial heap ...... 143 5.6.2 Structure of a binomial heap ...... 144 5.6.3 Implementation ...... 144 5.6.4 Summary of running times ...... 146 5.6.5 Applications ...... 146 5.6.6 See also ...... 146 5.6.7 References ...... 146 5.6.8 External links ...... 146 5.7 ...... 146 5.7.1 Structure ...... 147 x CONTENTS

5.7.2 Implementation of operations ...... 147 5.7.3 Proof of bounds ...... 148 5.7.4 Worst case ...... 149 5.7.5 Summary of running times ...... 149 5.7.6 Practical considerations ...... 149 5.7.7 References ...... 149 5.7.8 External links ...... 150 5.8 ...... 150 5.8.1 Structure ...... 151 5.8.2 Operations ...... 151 5.8.3 Summary of running times ...... 151 5.8.4 References ...... 152 5.8.5 External links ...... 152 5.9 Double-ended priority queue ...... 152 5.9.1 Operations ...... 152 5.9.2 Implementation ...... 153 5.9.3 ...... 154 5.9.4 Applications ...... 155 5.9.5 See also ...... 155 5.9.6 References ...... 155 5.10 Soft heap ...... 155 5.10.1 Applications ...... 156 5.10.2 References ...... 156

6 Successors and neighbors 157 6.1 ...... 157 6.1.1 Algorithm ...... 157 6.1.2 Performance ...... 158 6.1.3 Binary search versus other schemes ...... 158 6.1.4 Variations ...... 159 6.1.5 History ...... 160 6.1.6 Implementation issues ...... 160 6.1.7 support ...... 161 6.1.8 See also ...... 161 6.1.9 Notes and references ...... 161 6.1.10 External links ...... 163 6.2 Binary search ...... 164 6.2.1 Definition ...... 164 6.2.2 Operations ...... 164 6.2.3 Examples of applications ...... 167 6.2.4 Types ...... 167 6.2.5 See also ...... 168 CONTENTS xi

6.2.6 Notes ...... 168 6.2.7 References ...... 168 6.2.8 Further reading ...... 169 6.2.9 External links ...... 169 6.3 Random ...... 169 6.3.1 Binary trees from random permutations ...... 169 6.3.2 Uniformly random binary trees ...... 170 6.3.3 Random split trees ...... 170 6.3.4 Notes ...... 170 6.3.5 References ...... 171 6.3.6 External links ...... 171 6.4 Tree rotation ...... 171 6.4.1 Illustration ...... 172 6.4.2 Detailed illustration ...... 172 6.4.3 Inorder invariance ...... 173 6.4.4 Rotations for rebalancing ...... 173 6.4.5 Rotation distance ...... 173 6.4.6 See also ...... 173 6.4.7 References ...... 173 6.4.8 External links ...... 174 6.5 Self-balancing binary ...... 174 6.5.1 Overview ...... 174 6.5.2 Implementations ...... 175 6.5.3 Applications ...... 175 6.5.4 See also ...... 175 6.5.5 References ...... 175 6.5.6 External links ...... 175 6.6 ...... 175 6.6.1 Description ...... 176 6.6.2 Operations ...... 176 6.6.3 Randomized ...... 177 6.6.4 Comparison ...... 177 6.6.5 See also ...... 178 6.6.6 References ...... 178 6.6.7 External links ...... 178 6.7 AVL tree ...... 178 6.7.1 Definition ...... 178 6.7.2 Operations ...... 179 6.7.3 Comparison to other structures ...... 179 6.7.4 See also ...... 180 6.7.5 References ...... 180 xii CONTENTS

6.7.6 Further reading ...... 180 6.7.7 External links ...... 180 6.8 Red–black tree ...... 180 6.8.1 History ...... 180 6.8.2 Terminology ...... 181 6.8.3 Properties ...... 181 6.8.4 Analogy to B-trees of order 4 ...... 182 6.8.5 Applications and related data structures ...... 182 6.8.6 Operations ...... 183 6.8.7 Proof of asymptotic bounds ...... 186 6.8.8 Parallel algorithms ...... 187 6.8.9 Popular Culture ...... 187 6.8.10 See also ...... 187 6.8.11 References ...... 187 6.8.12 Further reading ...... 188 6.8.13 External links ...... 188 6.9 WAVL tree ...... 188 6.9.1 Definition ...... 188 6.9.2 Operations ...... 189 6.9.3 Computational complexity ...... 190 6.9.4 Related structures ...... 190 6.9.5 References ...... 191 6.10 ...... 191 6.10.1 Theory ...... 191 6.10.2 Operations ...... 191 6.10.3 See also ...... 193 6.10.4 References ...... 193 6.10.5 External links ...... 193 6.11 ...... 193 6.11.1 Advantages ...... 193 6.11.2 Disadvantages ...... 193 6.11.3 Operations ...... 194 6.11.4 Implementation and variants ...... 195 6.11.5 Analysis ...... 196 6.11.6 Performance theorems ...... 197 6.11.7 Dynamic optimality conjecture ...... 197 6.11.8 Variants ...... 197 6.11.9 See also ...... 198 6.11.10 Notes ...... 198 6.11.11 References ...... 198 6.11.12 External links ...... 199 CONTENTS xiii

6.12 Tango tree ...... 199 6.12.1 Structure ...... 199 6.12.2 Algorithm ...... 199 6.12.3 Analysis ...... 200 6.12.4 See also ...... 200 6.12.5 References ...... 200 6.13 ...... 200 6.13.1 Description ...... 201 6.13.2 History ...... 202 6.13.3 Usages ...... 202 6.13.4 See also ...... 203 6.13.5 References ...... 203 6.13.6 External links ...... 203 6.14 B-tree ...... 203 6.14.1 Overview ...... 204 6.14.2 B-tree usage in ...... 205 6.14.3 Technical description ...... 206 6.14.4 Best case and worst case heights ...... 207 6.14.5 Algorithms ...... 207 6.14.6 In filesystems ...... 210 6.14.7 Variations ...... 210 6.14.8 See also ...... 210 6.14.9 Notes ...... 211 6.14.10 References ...... 211 6.14.11 External links ...... 211 6.15 B+ tree ...... 212 6.15.1 Overview ...... 212 6.15.2 Algorithms ...... 212 6.15.3 Characteristics ...... 213 6.15.4 Implementation ...... 213 6.15.5 History ...... 214 6.15.6 See also ...... 214 6.15.7 References ...... 214 6.15.8 External links ...... 214

7 Integer and string searching 215 7.1 ...... 215 7.1.1 History and etymology ...... 215 7.1.2 Applications ...... 215 7.1.3 Algorithms ...... 216 7.1.4 Implementation strategies ...... 216 7.1.5 See also ...... 218 xiv CONTENTS

7.1.6 References ...... 218 7.1.7 External links ...... 219 7.2 ...... 219 7.2.1 Applications ...... 220 7.2.2 Operations ...... 220 7.2.3 History ...... 221 7.2.4 Comparison to other data structures ...... 221 7.2.5 Variants ...... 221 7.2.6 See also ...... 221 7.2.7 References ...... 222 7.2.8 External links ...... 222 7.3 Suffix tree ...... 222 7.3.1 History ...... 223 7.3.2 Definition ...... 223 7.3.3 Generalized suffix tree ...... 223 7.3.4 Functionality ...... 223 7.3.5 Applications ...... 224 7.3.6 Implementation ...... 224 7.3.7 Parallel construction ...... 225 7.3.8 External construction ...... 225 7.3.9 See also ...... 225 7.3.10 Notes ...... 225 7.3.11 References ...... 226 7.3.12 External links ...... 227 7.4 Suffix array ...... 227 7.4.1 Definition ...... 227 7.4.2 Example ...... 227 7.4.3 Correspondence to suffix trees ...... 227 7.4.4 Space Efficiency ...... 227 7.4.5 Construction Algorithms ...... 228 7.4.6 Applications ...... 228 7.4.7 Notes ...... 229 7.4.8 References ...... 229 7.4.9 External links ...... 230 7.5 Suffix automaton ...... 230 7.5.1 See also ...... 230 7.5.2 References ...... 230 7.5.3 Additional reading ...... 230 7.6 ...... 230 7.6.1 Supported operations ...... 231 7.6.2 How it works ...... 231 CONTENTS xv

7.6.3 References ...... 233 7.7 ...... 233 7.7.1 How it works ...... 233 7.7.2 Fusion hashing ...... 235 7.7.3 References ...... 235 7.7.4 External links ...... 235

8 Text and image sources, contributors, and licenses 236 8.1 Text ...... 236 8.2 Images ...... 247 8.3 Content license ...... 252 Chapter 1

Introduction

1.1 Abstract data type 1.1.1 Examples

For example, integers are an ADT, defined as the val- Not to be confused with . ues …, −2, −1, 0, 1, 2, …, and by the operations of ad- dition, subtraction, multiplication, and division, together In , an abstract data type (ADT) is a with greater than, less than, etc., which behave according mathematical model for data types where a data type is to familiar mathematics (with care for integer division), defined by its behavior (semantics) from the point of view independently of how the integers are represented by of a user of the data, specifically in terms of possible val- the computer.[lower-alpha 1] Explicitly, “behavior” includes ues, possible operations on data of this type, and the be- obeying various axioms (associativity and commutativity havior of these operations. This contrasts with data struc- of addition etc.), and preconditions on operations (can- tures, which are concrete representations of data, and are not divide by zero). Typically integers are represented in the point of view of an implementer, not a user. a data structure as binary numbers, most often as two’s Formally, an ADT may be defined as a “class of ob- complement, but might be binary-coded decimal or in jects whose logical behavior is defined by a set of val- ones’ complement, but the user is abstracted from the ues and a set of operations";[1] this is analogous to an concrete choice of representation, and can simply use the algebraic structure in mathematics. What is meant by data as integers. “behavior” varies by author, with the two main types of An ADT consists not only of operations, but also of val- formal specifications for behavior being axiomatic (alge- ues of the underlying data and of constraints on the op- braic) specification and an abstract model;[2] these cor- erations. An “” typically refers only to the op- respond to axiomatic semantics and operational seman- erations, and perhaps some of the constraints on the op- tics of an abstract machine, respectively. Some authors erations, notably pre-conditions and post-conditions, but also include the computational complexity (“cost”), both not other constraints, such as relations between the oper- in terms of time (for computing operations) and space ations. (for representing values). In practice many common data For example, an abstract stack, which is a last-in-first- types are not ADTs, as the abstraction is not perfect, and out structure, could be defined by three operations: push, users must be aware of issues like arithmetic overflow that that inserts a data item onto the stack; pop, that removes are due to the representation. For example, integers are a data item from it; and peek or top, that accesses a data often stored as fixed width values (32-bit or 64-bit bi- item on top of the stack without removal. An abstract nary numbers), and thus experience integer overflow if queue, which is a first-in-first-out structure, would also the maximum is exceeded. have three operations: enqueue, that inserts a data item ADTs are a theoretical concept in computer science, used into the queue; dequeue, that removes the first data item in the design and analysis of algorithms, data structures, from it; and front, that accesses and serves the first data and software systems, and do not correspond to spe- item in the queue. There would be no way of differentiat- cific features of computer languages—mainstream com- ing these two data types, unless a mathematical constraint puter languages do not directly support formally speci- is introduced that for a stack specifies that each pop al- fied ADTs. However, various language features corre- ways returns the most recently pushed item that has not spond to certain aspects of ADTs, and are easily confused been popped yet. When analyzing the efficiency of algo- with ADTs proper; these include abstract types, opaque rithms that use stacks, one may also specify that all oper- data types, protocols, and design by contract. ADTs were ations take the same time no matter how many data items first proposed by Barbara Liskov and Stephen N. Zilles in have been pushed into the stack, and that the stack uses a 1974, as part of the development of the CLU language.[3] constant amount of storage for each element.

1 2 CHAPTER 1. INTRODUCTION

1.1.2 Introduction • store(V, x) where x is a value of unspecified nature; • Abstract data types are purely theoretical entities, used fetch(V), that yields a value, (among other things) to simplify the description of ab- stract algorithms, to classify and evaluate data structures, with the constraint that and to formally describe the type systems of program- ming languages. However, an ADT may be implemented • fetch(V) always returns the value x used in the most by specific data types or data structures, in many ways recent store(V, x) operation on the same variable V. and in many programming languages; or described in a formal specification language. ADTs are often imple- mented as modules: the module’s interface declares pro- As in so many programming languages, the operation cedures that correspond to the ADT operations, some- store(V, x) is often written V ← x (or some similar no- times with comments that describe the constraints. This tation), and fetch(V) is implied whenever a variable V is strategy allows the implementation of used in a context where a value is required. Thus, for the module to be changed without disturbing the client example, V ← V + 1 is commonly understood to be a programs. shorthand for store(V,fetch(V) + 1). The term abstract data type can also be regarded as a gen- In this definition, it is implicitly assumed that storing a eralised approach of a number of algebraic structures, value into a variable U has no effect on the state of a dis- such as lattices, groups, and rings.[4] The notion of ab- tinct variable V. To this assumption explicit, one stract data types is related to the concept of data ab- could add the constraint that straction, important in object-oriented programming and design by contract methodologies for software develop- • if U and V are distinct variables, the sequence { ment. store(U, x); store(V, y) } is equivalent to { store(V, y); store(U, x) }.

1.1.3 Defining an abstract data type More generally, ADT definitions often assume that any operation that changes the state of one ADT instance has An abstract data type is defined as a mathematical model no effect on the state of any other instance (including of the data objects that make up a data type as well as the other instances of the same ADT) — unless the ADT ax- functions that operate on these objects. There are no stan- ioms imply that the two instances are connected (aliased) dard conventions for defining them. A broad division may in that sense. For example, when extending the definition be drawn between “imperative” and “functional” defini- of abstract variable to include abstract records, the opera- tion styles. tion that selects a field from a record variable R must yield a variable V that is aliased to that part of R. The definition of an abstract variable V may also restrict Imperative-style definition the stored values x to members of a specific set X, called the range or type of V. As in programming languages, In the philosophy of languages, such restrictions may simplify the description and analysis an abstract data structure is conceived as an entity that is of algorithms, and improve their readability. mutable—meaning that it may be in different states at dif- ferent times. Some operations may change the state of the Note that this definition does not imply anything about ADT; therefore, the order in which operations are eval- the result of evaluating fetch(V) when V is un-initialized, uated is important, and the same operation on the same that is, before performing any store operation on V. An entities may have different effects if executed at differ- algorithm that does so is usually considered invalid, be- ent times—just like the instructions of a computer, or the cause its effect is not defined. (However, there are some commands and procedures of an imperative language. To important algorithms whose efficiency strongly depends underscore this view, it is customary to say that the oper- on the assumption that such a fetch is legal, and returns ations are executed or applied, rather than evaluated. The some arbitrary value in the variable’s range.) imperative style is often used when describing abstract algorithms. (See The Art of Computer Programming by Donald Knuth for more details) Instance creation Some algorithms need to create new instances of some ADT (such as new variables, or new stacks). To describe such algorithms, one usually includes Abstract variable Imperative-style definitions of in the ADT definition a create() operation that yields an ADT often depend on the concept of an abstract vari- instance of the ADT, usually with axioms equivalent to able, which may be regarded as the simplest non-trivial ADT. An abstract variable V is a mutable entity that • the result of create() is distinct from any instance in admits two operations: use by the algorithm. 1.1. ABSTRACT DATA TYPE 3

This axiom may be strengthened to exclude also partial Single-instance style Sometimes an ADT is defined as aliasing with other instances. On the other hand, this ax- if only one instance of it existed during the execution of iom still allows implementations of create() to yield a pre- the algorithm, and all operations were applied to that in- viously created instance that has become inaccessible to stance, which is not explicitly notated. For example, the the program. abstract stack above could have been defined with opera- tions push(x) and pop(), that operate on the only existing stack. ADT definitions in this style can be easily rewrit- Example: abstract stack (imperative) As another ten to admit multiple coexisting instances of the ADT, by example, an imperative-style definition of an abstract adding an explicit instance parameter (like S in the previ- stack could specify that the state of a stack S can be mod- ous example) to every operation that uses or modifies the ified only by the operations implicit instance. On the other hand, some ADTs cannot be meaningfully • push(S, x), where x is some value of unspecified na- defined without assuming multiple instances. This is the ture; case when a single operation takes two distinct instances • pop(S), that yields a value as a result, of the ADT as parameters. For an example, consider aug- menting the definition of the abstract stack with an oper- ation compare(S, T) that checks whether the stacks S and with the constraint that T contain the same items in the same order.

• For any value x and any abstract variable V, the se- quence of operations { push(S, x); V ← pop(S) } is Functional-style definition equivalent to V ← x. Another way to define an ADT, closer to the spirit of Since the assignment V ← x, by definition, cannot change functional programming, is to consider each state of the the state of S, this condition implies that V ← pop(S) re- structure as a separate entity. In this view, any opera- stores S to the state it had before the push(S, x). From this tion that modifies the ADT is modeled as a mathematical condition and from the properties of abstract variables, it function that takes the old state as an argument, and re- follows, for example, that the sequence turns the new state as part of the result. Unlike the im- perative operations, these functions have no side effects. Therefore, the order in which they are evaluated is imma- { push(S, x); push(S, y); U ← pop(S); push(S, terial, and the same operation applied to the same argu- z); V ← pop(S); W ← pop(S)} ments (including the same input states) will always return the same results (and output states). where x, y, and z are any values, and U, V, W are pairwise In the functional view, in particular, there is no way (or distinct variables, is equivalent to need) to define an “abstract variable” with the semantics of imperative variables (namely, with fetch and store op- { U ← y; V ← z; W ← x } erations). Instead of storing values into variables, one passes them as arguments to functions. Here it is implicitly assumed that operations on a stack in- stance do not modify the state of any other ADT instance, including other stacks; that is, Example: abstract stack (functional) For example, a complete functional-style definition of an abstract stack could use the three operations: • For any values x, y, and any distinct stacks S and T, the sequence { push(S, x); push(T, y) } is equivalent • push: takes a stack state and an arbitrary value, re- to { push(T, y); push(S, x) }. turns a stack state;

An abstract stack definition usually includes also a • top: takes a stack state, returns a value; Boolean-valued function empty(S) and a create() opera- • pop: takes a stack state, returns a stack state. tion that returns a stack instance, with axioms equivalent to In a functional-style definition there is no need for a cre- ate operation. Indeed, there is no notion of “stack in- • create() ≠ S for any stack S (a newly created stack is stance”. The stack states can be thought of as being po- distinct from all previous stacks); tential states of a single stack structure, and two stack • empty(create()) (a newly created stack is empty); states that contain the same values in the same order are considered to be identical states. This view actually mir- • not empty(push(S, x)) (pushing something into a rors the behavior of some concrete implementations, such stack makes it non-empty). as linked lists with hash . 4 CHAPTER 1. INTRODUCTION

Instead of create(), a functional-style definition of an ab- 1.1.4 Advantages of abstract data typing stract stack may assume the existence of a special stack state, the empty stack, designated by a special like Encapsulation Λ or "()"; or define a bottom() operation that takes no - guments and returns this special stack state. Note that the Abstraction provides a promise that any implementation axioms imply that of the ADT has certain properties and abilities; knowing these is all that is required to make use of an ADT object. • push(Λ, x) ≠ Λ. The user does not need any technical knowledge of how the implementation works to use the ADT. In this way, the implementation may be complex but will be encapsu- In a functional-style definition of a stack one does not lated in a simple interface when it is actually used. need an empty predicate: instead, one can test whether a stack is empty by testing whether it is equal to Λ. Localization of change Note that these axioms do not define the effect of top(s) or pop(s), unless s is a stack state returned by a push. Since Code that uses an ADT object will not need to be edited push leaves the stack non-empty, those two operations are if the implementation of the ADT is changed. Since any undefined (hence invalid) when s = Λ. On the other hand, changes to the implementation must still comply with the the axioms (and the lack of side effects) imply that push(s, interface, and since code using an ADT object may only x) = push(t, y) if and only if x = y and s = t. refer to properties and abilities specified in the interface, As in some other branches of mathematics, it is custom- changes may be made to the implementation without re- ary to assume also that the stack states are only those quiring any changes in code where the ADT is used. whose existence can be proved from the axioms in a finite number of steps. In the abstract stack example above, this rule means that every stack is a finite sequence of values, Flexibility that becomes the empty stack (Λ) after a finite number of pops. By themselves, the axioms above do not ex- Different implementations of the ADT, having all the clude the existence of infinite stacks (that can be poped same properties and abilities, are equivalent and may forever, each time yielding a different state) or circular be used somewhat interchangeably in code that uses the stacks (that return to the same state after a finite number ADT. This gives a great deal of flexibility when using of pops). In particular, they do not exclude states s such ADT objects in different situations. For example, differ- that pop(s) = s or push(s, x) = s for some x. However, ent implementations of the ADT may be more efficient since one cannot obtain such stack states with the given in different situations; it is possible to use each in the sit- operations, they are assumed “not to exist”. uation where they are preferable, thus increasing overall efficiency.

Whether to include complexity 1.1.5 Typical operations Aside from the behavior in terms of axioms, it is also pos- sible to include, in the definition of an ADT operation, Some operations that are often specified for ADTs (pos- their algorithmic complexity. Alexander Stepanov, de- sibly under other names) are signer of the ++ Standard Library, included complexity guarantees in the STL specification, arguing: • compare(s, t), that tests whether two instances’ states are equivalent in some sense; The reason for introducing the notion of • hash(s), that computes some standard hash function abstract data types was to allow interchange- from the instance’s state; able software modules. You cannot have • interchangeable modules unless these modules print(s) or show(s), that produces a human-readable share similar complexity behavior. If I replace representation of the instance’s state. one module with another module with the same functional behavior but with different In imperative-style ADT definitions, one often finds also complexity tradeoffs, the user of this code will be unpleasantly surprised. I could tell • create(), that yields a new instance of the ADT; him anything I like about data abstraction, • initialize(s), that prepares a newly created instance and he still would not want to use the code. s for further operations, or resets it to some “initial Complexity assertions have to be part of the state"; interface. — Alexander Stepanov[5] • copy(s, t), that puts instance s in a state equivalent to that of t; 1.1. ABSTRACT DATA TYPE 5

• clone(t), that performs s ← create(), copy(s, t), and 1.1.7 Implementation returns s; Further information: Opaque data type • free(s) or destroy(s), that reclaims the memory and other resources used by s. Implementing an ADT means providing one procedure or function for each abstract operation. The ADT instances The free operation is not normally relevant or meaning- are represented by some concrete data structure that is ful, since ADTs are theoretical entities that do not “use manipulated by those procedures, according to the ADT’s memory”. However, it may be necessary when one needs specifications. to analyze the storage used by an algorithm that uses the ADT. In that case one needs additional axioms that spec- Usually there are many ways to implement the same ify how much memory each ADT instance uses, as a func- ADT, using several different concrete data structures. tion of its state, and how much of it is returned to the pool Thus, for example, an abstract stack can be implemented by free. by a linked list or by an array. In order to prevent clients from depending on the imple- mentation, an ADT is often packaged as an opaque data 1.1.6 Examples type in one or more modules, whose interface contains only the signature (number and types of the parameters Some common ADTs, which have proved useful in a great and results) of the operations. The implementation of the variety of applications, are module—namely, the bodies of the procedures and the concrete data structure used—can then be hidden from • Container most clients of the module. This makes it possible to change the implementation without affecting the clients. • List If the implementation is exposed, it is known instead as a transparent data type. • Set When implementing an ADT, each instance (in • Multiset imperative-style definitions) or each state (in functional- style definitions) is usually represented by a of • Map some sort.[7] • Modern object-oriented languages, such as C++ and Java, support a form of abstract data types. When a class is • Graph used as a type, it is an abstract type that refers to a hidden representation. In this model an ADT is typically imple- • Stack mented as a class, and each instance of the ADT is usu- • Queue ally an object of that class. The module’s interface typ- ically declares the constructors as ordinary procedures, • Priority queue and most of the other ADT operations as methods of that class. However, such an approach does not easily en- • Double-ended queue capsulate multiple representational variants found in an ADT. It also can undermine the extensibility of object- • Double-ended priority queue oriented programs. In a pure object-oriented program that uses interfaces as types, types refer to behaviors not Each of these ADTs may be defined in many ways and representations. variants, not necessarily equivalent. For example, an abstract stack may or may not have a count operation that tells how many items have been pushed and not yet Example: implementation of the abstract stack popped. This choice makes a difference not only for its clients but also for the implementation. As an example, here is an implementation of the abstract stack above in the C . Abstract graphical data type

An extension of ADT for computer graphics was pro- Imperative-style interface An imperative-style inter- posed in 1979:[6] an abstract graphical data type (AGDT). face might be: It was introduced by Nadia Magnenat Thalmann, and struct stack_Rep stack_Rep; // type: stack Daniel Thalmann. AGDTs provide the advantages of instance representation (opaque record) typedef ADTs with facilities to build graphical objects in a struc- stack_Rep* stack_T; // type: handle to a stack instance tured way. () typedef void* stack_Item; // type: 6 CHAPTER 1. INTRODUCTION

value stored in stack instance (arbitrary address) stack_T ADT libraries stack_create(void); // creates a new empty stack instance void stack_push(stack_T s, stack_Item x); // adds an item Many modern programming languages, such as C++ and at the top of the stack stack_Item stack_pop(stack_T s); Java, come with standard libraries that implement several // removes the top item from the stack and returns it bool common ADTs, such as those listed above. stack_empty(stack_T s); // checks whether stack is empty Built-in abstract data types This interface could be used in the following manner: #include // includes the stack interface stack_T The specification of some programming languages is s = stack_create(); // creates a new empty stack instance intentionally vague about the representation of certain int x = 17; stack_push(s, &x); // adds the address of built-in data types, defining only the operations that can x at the top of the stack void* y = stack_pop(s); // be done on them. Therefore, those types can be viewed as removes the address of x from the stack and returns it “built-in ADTs”. Examples are the arrays in many script- if(stack_empty(s)) { } // does something if stack is empty ing languages, such as Awk, Lua, and , which can be regarded as an implementation of the abstract list.

This interface can be implemented in many ways. The implementation may be arbitrarily inefficient, since the 1.1.8 See also formal definition of the ADT, above, does not specify how much space the stack may use, nor how long each • Concept () operation should take. It also does not specify whether • the stack state s continues to exist after a call x ← pop(s). Formal methods In practice the formal definition should specify that the • Functional specification space is proportional to the number of items pushed and • Generalized algebraic data type not yet popped; and that every one of the operations above must finish in a constant amount of time, independently • Initial algebra of that number. To comply with these additional specifi- cations, the implementation could use a linked list, or an • Liskov substitution principle array (with dynamic resizing) together with two integers • Type theory (an item count and the array size). • Walls and Mirrors

1.1.9 Notes Functional-style interface Functional-style ADT def- initions are more appropriate for functional programming [1] Compare to the characterization of integers in abstract al- gebra. languages, and vice versa. However, one can provide a functional-style interface even in an imperative language like C. For example: 1.1.10 References typedef struct stack_Rep stack_Rep; // type: stack state representation (opaque record) typedef stack_Rep* [1] Dale & Walker 1996, p. 3. stack_T; // type: handle to a stack state (opaque pointer) [2] Dale & Walker 1996, p. 4. typedef void* stack_Item; // type: value of a stack state (arbitrary address) stack_T stack_empty(void); // returns [3] Liskov & Zilles 1974. the empty stack state stack_T stack_push(stack_T s, [4] Rudolf Lidl (2004). Abstract Algebra. Springer. ISBN stack_Item x); // adds an item at the top of the stack 81-8128-149-7., Chapter 7,section 40. state and returns the resulting stack state stack_T stack_pop(stack_T s); // removes the top item from [5] Stevens, Al (March 1995). “Al Stevens Interviews Alex the stack state and returns the resulting stack state Stepanov”. Dr. Dobb’s Journal. Retrieved 31 January stack_Item stack_top(stack_T s); // returns the top item 2015. of the stack state [6] D. Thalmann, N. Magnenat Thalmann (1979). Design and Implementation of Abstract Graphical Data Types The main problem is that C lacks garbage collection, and (PDF). IEEE., Proc. 3rd International Computer Soft- this makes this style of programming impractical; more- ware and Applications Conference (COMPSAC'79), IEEE, Chicago, USA, pp.519-524 over, memory allocation routines in C are slower than al- location in a typical garbage collector, thus the perfor- [7] Robert Sedgewick (1998). Algorithms in C. Addi- mance impact of so many allocations is even greater. son/Wesley. ISBN 0-201-31452-5., definition 4.4. 1.2. DATA STRUCTURE 7

• Liskov, Barbara; Zilles, Stephen (1974). “Program- implementations usually use hash tables to look ming with abstract data types”. Proceedings of the up identifiers. ACM SIGPLAN symposium on Very high level lan- Data structures provide a means to manage large amounts guages. pp. 50–59. doi:10.1145/800233.807045. of data efficiently for uses such as large databases and • Dale, Nell; Walker, Henry M. (1996). Abstract Data internet indexing services. Usually, efficient data struc- Types: Specifications, Implementations, and Appli- tures are key to designing efficient algorithms. Some for- cations. Jones & Bartlett Learning. ISBN 978-0- mal design methods and programming languages empha- 66940000-7. size data structures, rather than algorithms, as the key or- ganizing factor in . Data structures can be used to organize the storage and retrieval of information 1.1.11 Further stored in both main memory and secondary memory.

• Mitchell, John C.; Plotkin, Gordon (July 1988). 1.2.1 Overview “Abstract Types Have Existential Type” (PDF). ACM Transactions on Programming Languages and Data structures are generally based on the ability of a Systems. 10 (3). doi:10.1145/44501.45065. computer to fetch and store data at any place in its mem- ory, specified by a pointer—a bit string, representing a 1.1.12 External links , that can be itself stored in memory and manipulated by the program. Thus, the array and record data structures are based on computing the addresses of • Abstract data type in NIST Dictionary of Algo- data items with arithmetic operations; while the linked rithms and Data Structures data structures are based on storing addresses of data items within the structure itself. Many data structures use both principles, sometimes combined in non-trivial ways 1.2 Data structure (as in XOR linking). The implementation of a data structure usually requires writing a set of procedures that create and manipulate in- hash stances of that structure. The efficiency of a data struc- keys function buckets ture cannot be analyzed separately from those operations. This observation motivates the theoretical concept of an 00 abstract data type, a data structure that is defined indi- 01 521-8976 John Smith rectly by the operations that may be performed on it, and 02 521-1234 the mathematical properties of those operations (includ- 03 ing their space and time cost). Lisa Smith :: 13 Sandra Dee 1.2.2 Examples 14 521-9655 15 Main article:

A hash table. There are numerous types of data structures, generally built upon simpler primitive data types: Not to be confused with data type. • An array is a number of elements in a specific order, In computer science, a data structure is a particular typically all of the same type. Elements are accessed way of organizing data in a computer so that it can be using an integer index to specify which element is used efficiently.[1][2] Data structures can implement one required (Depending on the language, individual el- or more particular abstract data types (ADT), which spec- ements may either all be forced to be the same type, ify the operations that can be performed on a data struc- or may be of almost any type). Typical implementa- ture and the computional complexity of those operations. tions allocate contiguous memory words for the ele- In comparison, a data structure is a concrete implemen- ments of arrays (but this is not always a necessity). tation of the specification provided by an ADT. Arrays may be fixed-length or resizable. Different kinds of data structures are suited to differ- • A linked list (also just called list) is a linear collection ent kinds of applications, and some are highly special- of data elements of any type, called nodes, where ized to specific tasks. For example, relational databases each node has itself a value, and points to the next commonly use B-tree indexes for data retrieval,[3] while node in the linked list. The principal advantage of a 8 CHAPTER 1. INTRODUCTION

linked list over an array, is that values can always be Many known data structures have concurrent versions that efficiently inserted and removed without relocating allow multiple computing threads to access the data struc- the rest of the list. Certain other operations, such ture simultaneously. as to a certain element, are however slower on lists than on arrays. 1.2.4 See also • A record (also called tuple or struct) is an aggregate data structure. A record is a value that contains other • Abstract data type values, typically in fixed number and sequence and typically indexed by names. The elements of records • Concurrent data structure are usually called fields or members. • Data model • A union is a data structure that specifies which of a number of permitted primitive types may be stored • Dynamization in its instances, e.g. float or long integer. Contrast with a record, which could be defined to contain a • float and an integer; whereas in a union, there is only one value at a time. Enough space is allocated to • List of data structures contain the widest member datatype. • Persistent data structure • A (also called variant, variant record, discriminated union, or disjoint union) contains an • Plain old data structure additional field indicating its current type, for en- hanced type safety. 1.2.5 References • A class is a data structure that contains data fields, like a record, as well as various methods which op- [1] Paul E. Black (ed.), entry for data structure in Dictionary erate on the contents of the record. In the context of Algorithms and Data Structures. U.S. National Institute of object-oriented programming, records are known of Standards and Technology. 15 December 2004. Online as plain old data structures to distinguish them from version Accessed May 21, 2009. classes. [2] Entry data structure in the Encyclopædia Britannica (2009) Online entry accessed on May 21, 2009. 1.2.3 Language support [3] Gavin Powell (2006). “Chapter 8: Building Fast- Performing Models”. Beginning Database De- Most assembly languages and some low-level languages, sign ISBN 978-0-7645-7490-0. Wrox Publishing. such as BCPL (Basic Combined Programming Lan- guage), lack built-in support for data structures. On the [4] “The GNU C Manual”. Free Software Foundation. Re- other hand, many high-level programming languages and trieved 15 October 2014. some higher-level assembly languages, such as MASM, have special syntax or other built-in support for cer- [5] “Free Pascal: Reference Guide”. Free Pascal. Retrieved tain data structures, such as records and arrays. For 15 October 2014. example, the C and Pascal languages support structs and records, respectively, in addition to vectors (one- dimensional arrays) and multi-dimensional arrays.[4][5] 1.2.6 Further reading Most programming languages feature some sort of library . mechanism that allows data structure implementations to be reused by different programs. Modern languages usu- ally come with standard libraries that implement the most • Peter Brass, Advanced Data Structures, Cambridge common data structures. Examples are the C++ Standard University Press, 2008. Template Library, the Java Collections Framework, and • Microsoft's .NET Framework. Donald Knuth, The Art of Computer Programming, vol. 1. Addison-Wesley, 3rd edition, 1997. Modern languages also generally support modular pro- gramming, the separation between the interface of a li- • Dinesh Mehta and Sartaj Sahni Handbook of brary module and its implementation. Some provide Data Structures and Applications, Chapman and opaque data types that allow clients to hide implemen- Hall/CRC Press, 2007. tation details. Object-oriented programming languages, such as C++, Java and may use classes for this • Niklaus Wirth, Algorithms and Data Structures, purpose. Prentice Hall, 1985. 1.3. ANALYSIS OF ALGORITHMS 9

1.2.7 External links to estimate the complexity function for arbitrarily large input. , Big-omega notation and Big- • course on data structures theta notation are used to this end. For instance, binary search is said to run in a number of steps proportional • Data structures Programs Examples in c,java to the logarithm of the length of the sorted list being • UC Berkeley video course on data structures searched, or in O(log(n)), colloquially “in logarithmic time". Usually asymptotic estimates are used because • Descriptions from the Dictionary of Algorithms and different implementations of the same algorithm may dif- Data Structures fer in efficiency. However the efficiencies of any two • Data structures course “reasonable” implementations of a given algorithm are related by a constant multiplicative factor called a hidden • An Examination of Data Structures from .NET per- constant. spective Exact (not asymptotic) measures of efficiency can some- • Schaffer, C. Data Structures and Algorithm Analysis times be computed but they usually require certain as- sumptions concerning the particular implementation of the algorithm, called model of computation. A model of 1.3 Analysis of algorithms computation may be defined in terms of an abstract com- puter, e.g., Turing machine, and/or by postulating that certain operations are executed in unit time. For exam- n!2ⁿn² n log₂n n ple, if the sorted list to which we apply binary search has 100 n elements, and we can guarantee that each lookup of an 90 element in the list can be done in unit time, then at most log n + 1 time units are needed to return an answer. 80 2 70 1.3.1 Cost models 60 N 50 Time efficiency estimates depend on what we define to be a step. For the analysis to correspond usefully to the actual 40 execution time, the time required to perform a step must 30 be guaranteed to be bounded above by a constant. One must be careful here; for instance, some analyses count 20 an addition of two numbers as one step. This assumption √n 10 may not be warranted in certain contexts. For example, if 1 log₂n the numbers involved in a computation may be arbitrarily 0 0 10 20 30 40 50 60 70 80 90 100 large, the time required by a single addition can no longer n be assumed to be constant. [2][3][4][5][6] Graphs of number of operations, N vs input size, n for common Two cost models are generally used: complexities, assuming a coefficient of 1 • the uniform cost model, also called uniform-cost In computer science, the analysis of algorithms is the measurement (and similar variations), assigns a determination of the amount of resources (such as time constant cost to every machine operation, regardless and storage) necessary to execute them. Most algorithms of the size of the numbers involved are designed to work with inputs of arbitrary length. Usu- ally, the efficiency or running time of an algorithm is • the logarithmic cost model, also called stated as a function relating the input length to the num- logarithmic-cost measurement (and variations ber of steps (time complexity) or storage locations (space thereof), assigns a cost to every machine operation complexity). proportional to the number of involved The term “analysis of algorithms” was coined by Donald Knuth.[1] Algorithm analysis is an important part of a The latter is more cumbersome to use, so it’s only em- broader computational complexity theory, which pro- ployed when necessary, for example in the analysis of vides theoretical estimates for the resources needed by arbitrary-precision arithmetic algorithms, like those used any algorithm which solves a given computational prob- in . lem. These estimates provide an insight into reasonable A key point which is often overlooked is that published directions of search for efficient algorithms. lower bounds for problems are often given for a model of In theoretical analysis of algorithms it is common to computation that is more restricted than the set of oper- estimate their complexity in the asymptotic sense, i.e., ations that you could use in practice and therefore there 10 CHAPTER 1. INTRODUCTION

are algorithms that are faster than what would naively be Informally, an algorithm can be said to exhibit a growth thought possible.[7] rate on the order of a mathematical function if beyond a certain input size n, the function times a positive con- stant provides an upper bound or limit for the run-time 1.3.2 Run-time analysis of that algorithm. In other words, for a given input size n greater than some n0 and a constant c, the running time of Run-time analysis is a theoretical classification that es- that algorithm will never be larger than . This concept is timates and anticipates the increase in running time (or frequently expressed using Big O notation. For example, run-time) of an algorithm as its input size (usually denoted since the run-time of grows quadratically as as n) increases. Run-time efficiency is a topic of great its input size increases, insertion sort can be said to be of 2 interest in computer science:A program can take sec- order O(n ). onds, hours or even years to finish executing, depending Big O notation is a convenient way to express the worst- on which algorithm it implements (see also performance case scenario for a given algorithm, although it can also analysis, which is the analysis of an algorithm’s run-time be used to express the average-case — for example, in practice). the worst-case scenario for is O(n2), but the average-case run-time is O(n log n).

Shortcomings of empirical metrics Empirical orders of growth Since algorithms are platform-independent (i.e. a given algorithm can be implemented in an arbitrary Assuming the execution time follows power rule, t ≈ k programming language on an arbitrary computer running na, the coefficient a can be found [8] by taking empiri- an arbitrary ), there are significant draw- cal measurements of run time {t1, t2} at some problem- a backs to using an empirical approach to gauge the com- size points {n1, n2} , and calculating t2/t1 = (n2/n1) parative performance of a given set of algorithms. so that a = log(t2/t1)/ log(n2/n1) . In other words, Take as an example a program that looks up a specific en- this measures the slope of the empirical line on the log– try in a sorted list of size n. Suppose this program were log plot of execution time vs. problem size, at some size implemented on Computer A, a state-of-the-art machine, point. If the order of growth indeed follows the power using a linear search algorithm, and on Computer B, a rule (and so the line on log–log plot is indeed a straight much slower machine, using a binary search algorithm. line), the empirical value of a will stay constant at dif- Benchmark testing on the two computers running their ferent ranges, and if not, it will change (and the line is a respective programs might look something like the fol- curved line) - but still could serve for comparison of any lowing: two given algorithms as to their empirical local orders of growth behaviour. Applied to the above table: Based on these metrics, it would be easy to jump to the conclusion that Computer A is running an algorithm that It is clearly seen that the first algorithm exhibits a linear is far superior in efficiency to that of Computer B. How- order of growth indeed following the power rule. The em- ever, if the size of the input-list is increased to a sufficient pirical values for the second one are diminishing rapidly, number, that conclusion is dramatically demonstrated to suggesting it follows another rule of growth and in any be in error: case has much lower local orders of growth (and improv- ing further still), empirically, than the first one. Computer A, running the linear search program, exhibits a linear growth rate. The program’s run-time is directly proportional to its input size. Doubling the input size dou- Evaluating run-time complexity bles the run time, quadrupling the input size quadruples the run-time, and so forth. On the other hand, Com- The run-time complexity for the worst-case scenario puter B, running the binary search program, exhibits a of a given algorithm can sometimes be evaluated by logarithmic growth rate. Quadrupling the input size only examining the structure of the algorithm and making increases the run time by a constant amount (in this exam- some simplifying assumptions. Consider the following ple, 50,000 ns). Even though Computer A is ostensibly a pseudocode: faster machine, Computer B will inevitably surpass Com- puter A in run-time because it’s running an algorithm with 1 get a positive integer from input 2 if n > 10 3 print “This a much slower growth rate. might take a while...” 4 for i = 1 to n 5 for j = 1 to i 6 print i * j 7 print “Done!" A given computer will take a discrete amount of time to Orders of growth execute each of the instructions involved with carrying out this algorithm. The specific amount of time to carry Main article: Big O notation out a given instruction will vary depending on which in- struction is being executed and which computer is exe- 1.3. ANALYSIS OF ALGORITHMS 11

cuting it, but on a conventional computer, this amount will be deterministic.[9] Say that the actions carried out [ ] [ ] 1 1 in step 1 are considered to consume time T , step 2 uses f(n) = T +T +T +T +(n+1)T + (n2 + n) T + (n2 + 3n) T 1 1 2 3 7 4 2 6 2 5 time T2, and so forth. In the algorithm above, steps 1, 2 and 7 will only be run which reduces to once. For a worst-case evaluation, it should be assumed that step 3 will be run as well. Thus the total amount of [ ] [ ] 1 1 time to run steps 1-3 and step 7 is: f(n) = (n2 + n) T + (n2 + 3n) T +(n+1)T +T +T +T +T 2 6 2 5 4 1 2 3 7 As a rule-of-thumb, one can assume that the highest- T1 + T2 + T3 + T7. order term in any given function dominates its rate of growth and thus defines its run-time order. In this ex- The loops in steps 4, 5 and 6 are trickier to evaluate. The ample, n² is the highest-order term, so one can conclude outer loop test in step 4 will execute ( n + 1 ) times (note that f(n) = O(n²). Formally this can be proven as follows: that an extra step is required to terminate the for loop, [ ] hence n + 1 and not n executions), which will consume 1 2 [ Prove] that 2 (n + n) T6 + T4( n + 1 ) time. The inner loop, on the other hand, is 1 (n2 + 3n) T + (n + 1)T + T + governed by the value of i, which iterates from 1 to i. On 2 5 4 1 ≤ 2 ≥ the first pass through the outer loop, j iterates from 1 to T2 + T[3 + T7 cn] , n [n0 ] 1 1 1: The inner loop makes one pass, so running the inner (n2 + n) T + (n2 + 3n) T + (n + 1)T + T + T + T + T 2 6 2 5 4 1 2 3 7 loop body (step 6) consumes T6 time, and the inner loop ≤ 2 2 ≥ test (step 5) consumes 2T5 time. During the next pass (n + n)T6 + (n + 3n)T5 + (n + 1)T4 + T1 + T2 + T3 + T7 ( forn 0) through the outer loop, j iterates from 1 to 2: the inner Let k be a constant greater than or equal to loop makes two passes, so running the inner loop body [T1..T7] 2 2 ≤ 2 2 (step 6) consumes 2T6 time, and the inner loop test (step T6(n + n) + T5(n + 3n) + (n + 1)T4 + T1 + T2 + T3 + T7 k(n + n) + k(n + 3n) + kn + 5k 2 2 2 2 2 5) consumes 3T5 time. =2kn + 5kn + 5k[ ≤ 2kn +] 5kn + 5kn ( forn ≥ 1) = 12kn 1 2 Altogether, the total time required to run the inner loop [Therefore ] 2 (n + n) T6 + 1 2 body can be expressed as an arithmetic progression: 2 (n + 3n) T5 + (n + 1)T4 + T1 + T2 + 2 T3 + T7 ≤ cn , n ≥ n0 for c = 12k, n0 = 1

T6 + 2T6 + 3T6 + ··· + (n − 1)T6 + nT6 A more elegant approach to analyzing this algorithm would be to declare that [T1..T7] are all equal to one unit which can be factored[10] as of time, in a system of units chosen so that one unit is greater than or equal to the actual times for these steps. [ ] This would mean that the algorithm’s running time breaks [11] 1 2 down as follows: T6 [1 + 2 + 3 + ··· + (n − 1) + n] = T6 (n + n) 2 ∑ ∑ n ≤ n 2 ≤ 4 + i=1 i 4 + i=1 n = 4 + n The total time required to run the outer loop test can be 5n2 ( forn ≥ 1) = O(n2). evaluated similarly:

Growth rate analysis of other resources 2T5 + 3T5 + 4T5 + ··· + (n − 1)T5 + nT5 + (n + 1)T5 The methodology of run-time analysis can also be utilized = T5 + 2T5 + 3T5 + 4T5 + ··· + (n − 1)T5 + nT5 + (n + 1)T5 − T5 for predicting other growth rates, such as consumption of which can be factored as memory space. As an example, consider the following pseudocode which manages and reallocates memory us- age by a program based on the size of a file which that T [1 + 2 + 3 + ··· + (n − 1) + n + (n + 1)] − T program manages: [5 ] 5 1 while (file still open) let n = size of file for every 100,000 = (n2 + n) T + (n + 1)T − T 2 5 5 5 kilobytes of increase in file size double the amount of mem- [ ] 1 ory reserved =T (n2 + n) + nT 5 2 5 [ ] In this instance, as the file size n increases, memory will be consumed at an exponential growth rate, which is or- 1 2 = (n + 3n) T n 2 5 der O(2 ). This is an extremely rapid and most likely unmanageable growth rate for consumption of memory Therefore, the total running time for this algorithm is: resources. 12 CHAPTER 1. INTRODUCTION

1.3.3 Relevance • Computational complexity theory • Algorithm analysis is important in practice because the Master theorem accidental or unintentional use of an inefficient algorithm • NP-Complete can significantly impact system performance. In time- sensitive applications, an algorithm taking too long to • Numerical analysis run can render its results outdated or useless. An inef- ficient algorithm can also end up requiring an uneconom- • time ical amount of computing power or storage in order to • Program optimization run, again rendering it practically useless. • Profiling (computer programming)

1.3.4 Constant factors • Scalability

Analysis of algorithms typically focuses on the asymp- • Smoothed analysis totic performance, particularly at the elementary level, • Termination analysis — the subproblem of checking but in practical applications constant factors are impor- whether a program will terminate at all tant, and real-world data is in practice always limited in size. The limit is typically the size of addressable • Time complexity — includes table of orders of 32 memory, so on 32-bit machines 2 = 4 GiB (greater if growth for common algorithms segmented memory is used) and on 64-bit machines 264 = 16 EiB. Thus given a limited size, an order of growth (time or space) can be replaced by a constant factor, and 1.3.6 Notes in this sense all practical algorithms are O(1) for a large enough constant, or for small enough data. [1] Donald Knuth, Recent News

This interpretation is primarily useful for functions that [2] Alfred V. Aho; John E. Hopcroft; Jeffrey D. Ullman grow extremely slowly: (binary) iterated logarithm (log*) (1974). The design and analysis of computer algorithms. is less than 5 for all practical data (265536 bits); (binary) Addison-Wesley Pub. Co., section 1.3 log-log (log log n) is less than 6 for virtually all practi- cal data (264 bits); and binary log (log n) is less than 64 [3] Juraj Hromkovič (2004). Theoretical computer science: introduction to Automata, computability, complexity, algo- for virtually all practical data (264 bits). An algorithm rithmics, randomization, communication, and cryptogra- with non-constant complexity may nonetheless be more phy. Springer. pp. 177–178. ISBN 978-3-540-14015-3. efficient than an algorithm with constant complexity on practical data if the overhead of the constant time algo- [4] Giorgio Ausiello (1999). Complexity and approximation: rithm results in a larger constant factor, e.g., one may have combinatorial optimization problems and their approxima- 6 K > k log log n so long as K/k > 6 and n < 22 = 264 bility properties. Springer. pp. 3–8. ISBN 978-3-540- . 65431-5. For large data linear or quadratic factors cannot be ig- [5] Wegener, Ingo (2005), Complexity theory: exploring the nored, but for small data an asymptotically inefficient al- limits of efficient algorithms, Berlin, New York: Springer- gorithm may be more efficient. This is particularly used Verlag, p. 20, ISBN 978-3-540-21045-0 in hybrid algorithms, like , which use an asymp- [6] Robert Endre Tarjan (1983). Data structures and network totically efficient algorithm (here , with time algorithms. SIAM. pp. 3–7. ISBN 978-0-89871-187-5. complexity n log n ), but switch to an asymptotically in- efficient algorithm (here insertion sort, with time com- [7] Examples of the price of abstraction?, csthe- plexity n2 ) for small data, as the simpler algorithm is ory.stackexchange.com faster on small data. [8] How To Avoid O-Abuse and Bribes, at the blog “Gödel’s Lost Letter and P=NP” by R. J. Lipton, professor of Com- puter Science at Georgia Tech, recounting idea by Robert 1.3.5 See also Sedgewick

• Amortized analysis [9] However, this is not the case with a quantum computer

• Analysis of parallel algorithms [10] It can be proven by induction that 1 + 2 + 3 + ··· + (n − n(n+1) 1) + n = 2 • Asymptotic computational complexity [11] This approach, unlike the above approach, neglects the • Best, worst and average case constant time consumed by the loop tests which terminate their respective loops, but it is trivial to prove that such • Big O notation omission does not affect the final result 1.4. AMORTIZED ANALYSIS 13

1.3.7 References 1.4.2 Method

• Cormen, Thomas H.; Leiserson, Charles E.; Rivest, The method requires knowledge of which series of oper- Ronald L. & Stein, Clifford (2001). Introduction to ations are possible. This is most commonly the case with Algorithms. Chapter 1: Foundations (Second ed.). data structures, which have state that persists between op- Cambridge, MA: MIT Press and McGraw-Hill. pp. erations. The basic idea is that a worst case operation can 3–122. ISBN 0-262-03293-7. alter the state in such a way that the worst case cannot occur again for a long time, thus “amortizing” its cost. • Sedgewick, Robert (1998). Algorithms in C, Parts 1- 4: Fundamentals, Data Structures, Sorting, Search- There are generally three methods for performing amor- ing (3rd ed.). Reading, MA: Addison-Wesley Pro- tized analysis: the aggregate method, the accounting fessional. ISBN 978-0-201-31452-6. method, and the potential method. All of these give the same answers, and their usage difference is primarily cir- • Knuth, Donald. The Art of Computer Programming. cumstantial and due to individual preference.[3] Addison-Wesley. • Aggregate analysis determines the upper bound T(n) • Greene, Daniel A.; Knuth, Donald E. (1982). Math- on the total cost of a sequence of n operations, then ematics for the Analysis of Algorithms (Second ed.). calculates the amortized cost to be T(n)/ n.[3] Birkhäuser. ISBN 3-7643-3102-X. • The accounting method determines the individual • Goldreich, Oded (2010). Computational Complex- cost of each operation, combining its immediate ex- ity: A Conceptual Perspective. Cambridge Univer- ecution time and its influence on the running time sity Press. ISBN 978-0-521-88473-0. of future operations. Usually, many short-running operations accumulate a “debt” of unfavorable state in small increments, while rare long-running opera- 1.4 Amortized analysis tions decrease it drastically.[3] • The potential method is like the accounting method, “Amortized” redirects here. For other uses, see but overcharges operations early to compensate for Amortization. undercharges later.[3]

In computer science, amortized analysis is a method 1.4.3 Examples for analyzing a given algorithm’s time complexity, or how much of a resource, especially time or memory in the con- Dynamic Array text of computer programs, it takes to execute. The moti- vation for amortized analysis is that looking at the worst- case run time per operation can be too pessimistic.[1] While certain operations for a given algorithm may have a significant cost in resources, other operations may not be as costly. Amortized analysis considers both the costly and less costly operations together over the whole series of operations of the algorithm. This may include account- ing for different types of input, length of the input, and other factors that affect its performance.[2]

1.4.1 History

Amortized analysis initially emerged from a method called aggregate analysis, which is now subsumed by amortized analysis. However, the technique was first for- mally introduced by Robert Tarjan in his 1985 paper Amortized Computational Complexity, which addressed Amortized Analysis of the Push operation for a Dynamic Array the need for more useful form of analysis than the com- mon probabilistic methods used. Amortization was ini- Consider a dynamic array that grows in size as more ele- tially used for very specific types of algorithms, partic- ments are added to it such as an ArrayList in Java. If we ularly those involving binary trees and union operations. started out with a dynamic array of size 4, it would take However, it is now ubiquitous and comes into play when constant time to push four elements onto it. Yet pushing analyzing many other algorithms as well.[2] a fifth element onto that array would take longer as the 14 CHAPTER 1. INTRODUCTION array would have to create a new array of double the cur- 1.4.5 References rent size (8), copy the old elements onto the new array, and then add the new element. The next four push op- • Allan Borodin and Ran El-Yaniv (1998). Online erations would similarly take constant time, and then the Computation and Competitive Analysis. Cambridge subsequent addition would require another slow doubling University Press. pp. 20,141. of the array size. In general if we consider an arbitrary number of pushes n [1] “Lecture 7: Amortized Analysis” (PDF). https://www.cs. to an array of size n, we notice that push operations take cmu.edu/. Retrieved 14 March 2015. External link in |website= (help) constant time except for the last one which takes O(n) time to perform the size doubling operation. Since there [2] Rebecca Fiebrink (2007), Amortized Analysis Explained were n operations total we can take the average of this (PDF), retrieved 2011-05-03 and find that for pushing elements onto the dynamic array takes: O( n ) = O(1) , constant time.[3] [3] “Lecture 20: Amortized Analysis”. http://www.cs.cornell. n edu/. Cornell University. Retrieved 14 March 2015. Ex- ternal link in |website= (help)

Queue [4] Grossman, Dan. “CSE332: Data Abstractions” (PDF). cs.washington.edu. Retrieved 14 March 2015.

Let’s look at a Ruby implementation of a Queue, a FIFO data structure: 1.5 Accounting method class Queue def initialize @input = [] @output = [] end def enqueue(element) @input << element end def For accounting methods in business and financial report- dequeue if @output.empty? while @input.any? @output ing, see accounting methods. << @input.pop end end @output.pop end end

In the field of analysis of algorithms in computer science, The enqueue operation just pushes an element onto the the accounting method is a method of amortized anal- input array; this operation does not depend on the lengths ysis based on accounting. The accounting method often of either input or output and therefore runs in constant gives a more intuitive account of the amortized cost of an time. operation than either aggregate analysis or the potential However the dequeue operation is more complicated. If method. Note, however, that this does not guarantee such the output array already has some elements in it, then de- analysis will be immediately obvious; often, choosing the queue runs in constant time; otherwise, dequeue takes correct parameters for the accounting method requires O(n) time to add all the elements onto the output array as much knowledge of the problem and the complexity from the input array, where n is the current length of the bounds one is attempting to prove as the other two meth- input array. After copying n elements from input, we can ods. perform n dequeue operations, each taking constant time, The accounting method is most naturally suited for prov- before the output array is empty again. Thus, we can per- ing an O(1) bound on time. The method as explained here form a sequence of n dequeue operations in only O(n) is for proving such a bound. time, which implies that the amortized time of each de- queue operation is O(1).[4] Alternatively, we can charge the cost of copying any item 1.5.1 The method from the input array to the output array to the earlier enqueue operation for that item. This charging scheme A set of elementary operations which will be used in the doubles the amortized time for enqueue, but reduces the algorithm is chosen and their costs are arbitrarily set to amortized time for dequeue to O(1). 1. The fact that the costs of these operations may differ in reality presents no difficulty in principle. What is im- portant is that each elementary operation has a constant cost. 1.4.4 Common use Each aggregate operation is assigned a “payment”. The payment is intended to cover the cost of elementary oper- • In common usage, an “amortized algorithm” is one ations needed to complete this particular operation, with that an amortized analysis has shown to perform some of the payment left over, placed in a pool to be used well. later. The difficulty with problems that require amortized anal- • Online algorithms commonly use amortized analy- ysis is that, in general, some of the operations will require sis. greater than constant cost. This means that no constant 1.6. POTENTIAL METHOD 15 payment will be enough to cover the worst case cost of operation, the pool has 3m + 3 - (2m + 1) = m + 2. Note an operation, in and of itself. With proper selection of that this is the same amount as after inserting element m payment, however, this is no longer a difficulty; the ex- + 1. In fact, we can show that this will be the case for any pensive operations will only occur when there is sufficient number of reallocations. payment in the pool to cover their costs. It can now be made clear why the payment for an inser- tion is 3. 1 pays for the first insertion of the element, 1 pays for moving the element the next time the table is ex- 1.5.2 Examples panded, and 1 pays for moving an older element the next time the table is expanded. Intuitively, this explains why A few examples will help to illustrate the use of the ac- an element’s contribution never “runs out” regardless of counting method. how many times the table is expanded: since the table is always doubled, the newest half always covers the cost of moving the oldest half. Table expansion We initially assumed that creating a table was free. In It is often necessary to create a table before it is known reality, creating a table of size n may be as expensive as how much space is needed. One possible strategy is to O(n). Let us say that the cost of creating a table of size n double the size of the table when it is full. Here we will is n. Does this new cost present a difficulty? Not really; it use the accounting method to show that the amortized turns out we use the same method to show the amortized cost of an insertion operation in such a table is O(1). O(1) bounds. All we have to do is change the payment. Before looking at the procedure in detail, we need some When a new table is created, there is an old table with definitions. Let T be a table, E an element to insert, m entries. The new table will be of size 2m. As long as num(T) the number of elements in T, and size(T) the al- the entries currently in the table have added enough to the located size of T. We assume the existence of operations pool to pay for creating the new table, we will be all right. m create_table(n), which creates an empty table of size n, We cannot expect the first 2 entries to help pay for the for now assumed to be free, and elementary_insert(T,E), new table. Those entries already paid for the current ta- m which inserts element E into a table T that already has ble. We must then rely on the last 2 entries to pay the 2m space allocated, with a cost of 1. cost 2m . This means we must add m/2 = 4 to the pay- The following pseudocode illustrates the table insertion ment for each entry, for a total payment of 3 + 4 = 7. procedure: function table_insert(T,E) if num(T) = size(T) U := 1.5.3 References create_table(2 × size(T)) for each F in T elemen- tary_insert(U,F) T := U elementary_insert(T,E) • Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algo- Without amortized analysis, the best bound we can show rithms, Second Edition. MIT Press and McGraw- for n insert operations is O(n2) — this is due to the loop Hill, 2001. ISBN 0-262-03293-7. Section 17.2: at line 4 that performs num(T) elementary insertions. The accounting method, pp. 410–412. For analysis using the accounting method, we assign a payment of 3 to each table insertion. Although the reason for this is not clear now, it will become clear during the 1.6 Potential method course of the analysis. Assume that initially the table is empty with size(T) = In computational complexity theory, the potential m. The first m insertions therefore do not require real- method is a method used to analyze the amortized time location and only have cost 1 (for the elementary insert). and space complexity of a data structure, a measure of its Therefore, when num(T) = m, the pool has (3 - 1)×m = performance over sequences of operations that smooths 2m. out the cost of infrequent but expensive operations.[1][2] Inserting element m + 1 requires reallocation of the table. Creating the new table on line 3 is free (for now). The 1.6.1 Definition of amortized time loop on line 4 requires m elementary insertions, for a cost of m. Including the insertion on the last line, the total cost In the potential method, a function Φ is chosen that maps for this operation is m + 1. After this operation, the pool states of the data structure to non-negative numbers. If therefore has 2m + 3 - (m + 1) = m + 2. S is a state of the data structure, Φ(S) may be thought Next, we add another m - 1 elements to the table. At this of intuitively as an amount of potential energy stored in point the pool has m + 2 + 2×(m - 1) = 3m. Inserting that state;[1][2] alternatively, Φ(S) may be thought of as an additional element (that is, element 2m + 1) can be representing the amount of disorder in state S or its dis- seen to have cost 2m + 1 and a payment of 3. After this tance from an ideal state. The potential value prior to the 16 CHAPTER 1. INTRODUCTION

operation of initializing a data structure is defined to be this assumption, if X is a type of operation that may be zero. performed by the data structure, and n is an integer defin- Let o be any individual operation within a sequence of ing the size of the given data structure (for instance, the operations on some data structure, with Sₑₒᵣₑ denoting number of items that it contains), then the amortized time the state of the data structure prior to operation o and for operations of type X is defined to be the maximum, Sₐₑᵣ denoting its state after operation o has completed. among all possible sequences of operations on data struc- Then, once Φ has been chosen, the amortized time for tures of size n and all operations oi of type X within the operation o is defined to be sequence, of the amortized time for operation oi. With this definition, the time to perform a sequence of operations may be estimated by multiplying the amor- Tamortized(o) = Tactual(o) + C · (Φ(Safter) − Φ(Sbefore)), tized time for each type of operation in the sequence by the number of operations of that type. where C is a non-negative constant of proportionality (in units of time) that must remain fixed throughout the anal- ysis. That is, the amortized time is defined to be the actual 1.6.4 Examples time taken by the operation plus C times the difference in potential caused by the operation.[1][2] Dynamic array

A dynamic array is a data structure for maintaining an 1.6.2 Relation between amortized and ac- array of items, allowing both random access to positions tual time within the array and the ability to increase the array size by one. It is available in Java as the “ArrayList” type and Despite its artificial appearance, the total amortized time in Python as the “list” type. of a sequence of operations provides a valid upper bound A dynamic array may be implemented by a data struc- on the actual time for the same sequence of operations. ture consisting of an array A of items, of some length For any sequence of operations O = o1, o2,... , define: N, together with a number n ≤ N representing the posi- tions within the array that have been used so far. With • The∑ total amortized time: Tamortized(O) = this structure, random accesses to the dynamic array may i Tamortized(oi) be implemented by accessing the same cell of the inter- ∑ nal array A, and when n < N an operation that increases • The total actual time: T (O) = T (o ) actual i actual i the dynamic array size may be implemented simply by incrementing n. However, when n = N, it is necessary to Then: resize A, and a common strategy for doing so is to double its size, replacing A by a new array of length 2n.[3] ∑ Tamortized(O) = (Tactual(oi) + C · (Φ(Si+1) − Φ(Si)))This = Tactual structure(O)+ mayC·(Φ( beS analyzedfinal)−Φ( usingSinitial the)) potential func- i tion: where the sequence of potential function values forms a telescoping series in which all terms other than the initial Φ = 2n − N and final potential function values cancel in pairs. Since the resizing strategy always causes A to be at least Hence: half-full, this potential function is always non-negative, as desired. When an increase-size operation does not lead to a resize Tactual(O) = Tamortized(O) + C · (Φ(Sinitial) − Φ(Sfinal)) operation, Φ increases by 2, a constant. Therefore, the In case Φ(Sfinal) ≥ 0 and Φ(Sinitial) = 0 , Tactual(O) ≤ constant actual time of the operation and the constant in- Tamortized(O) , so the amortized time can be used to pro- crease in potential combine to give a constant amortized vide accurate predictions about the actual time of se- time for an operation of this type. quences of operations, even though the amortized time However, when an increase-size operation causes a resize, for an individual operation may vary widely from its ac- the potential value of n decreases to zero after the resize. tual time. Allocating a new internal array A and copying all of the values from the old internal array to the new one takes 1.6.3 Amortized analysis of worst-case in- O(n) actual time, but (with an appropriate choice of the puts constant of proportionality C) this is entirely cancelled by the decrease in the potential function, leaving again a Typically, amortized analysis is used in combination with constant total amortized time for the operation. a worst case assumption about the input sequence. With The other operations of the data structure (reading and 1.6. POTENTIAL METHOD 17

writing array cells without changing the array size) do not This number is always non-negative and starts with 0, as cause the potential function to change and have the same required. [2] constant amortized time as their actual time. An Inc operation flips the least significant bit. Then, if the Therefore, with this choice of resizing strategy and poten- LSB were flipped from 1 to 0, then the next bit should be tial function, the potential method shows that all dynamic flipped. This goes on until finally a bit is flipped from 0 to array operations take constant amortized time. Combin- 1, in which case the flipping stops. If the number of bits ing this with the inequality relating amortized time and flipped from 1 to 0 is k, then the actual time is k+1 and actual time over sequences of operations, this shows that the potential is reduced by k−1, so the amortized time is any sequence of n dynamic array operations takes O(n) 2. Hence, the actual time for running m Inc operations is actual time in the worst case, despite the fact that some O(m). of the individual operations may themselves take a linear amount of time.[2] 1.6.5 Applications

Multi-Pop Stack The potential function method is commonly used to ana- lyze Fibonacci heaps, a form of priority queue in which Consider a stack which supports the following operations: removing an item takes logarithmic amortized time, and all other operations take constant amortized time.[4] It • Initialize - create an empty stack. may also be used to analyze splay trees, a self-adjusting form of binary search tree with logarithmic amortized • Push - add a single element on top of the stack. time per operation.[5]

• Pop(k) - remove k elements from the top of the stack. 1.6.6 References

[1] Goodrich, Michael T.; Tamassia, Roberto (2002), “1.5.1 This structure may be analyzed using the potential func- Amortization Techniques”, Algorithm Design: Founda- tion: tions, Analysis and Internet Examples, Wiley, pp. 36–38.

[2] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Φ = number-of-elements- Ronald L.; Stein, Clifford (2001) [1990]. “17.3 The in-stack potential method”. Introduction to Algorithms (2nd ed.). MIT Press and McGraw-Hill. pp. 412–416. ISBN 0- This number is always non-negative, as required. 262-03293-7. A Push operation takes constant time and increases Φ by [3] Goodrich and Tamassia, 1.5.2 Analyzing an Extendable 1, so its amortized time is constant. Array Implementation, pp. 139–141; Cormen et al., 17.4 Dynamic tables, pp. 416–424. A Pop operation takes time O(k) but also reduces Φ by k, so its amortized time is also constant. [4] Cormen et al., Chapter 20, “Fibonacci Heaps”, pp. 476– 497. This proves that any sequence of m operations takes O(m) actual time in the worst case. [5] Goodrich and Tamassia, Section 3.4, “Splay Trees”, pp. 185–194.

Binary counter

Consider a counter represented as a binary number and supporting the following operations:

• Initialize - create a counter with value 0.

• Inc - add 1 to the counter.

• Read - return the current counter value.

This structure may be analyzed using the potential func- tion:

Φ = number-of-bits- equal-to-1 Chapter 2

Sequences

2.1 Array data type selected by indices computed at run-time. Depending on the language, array types may overlap (or This article is about the abstract data type. For the be identified with) other data types that describe aggre- -level structure, see Array data structure. For other gates of values, such as lists and strings. Array types are uses, see Array. often implemented by array data structures, but some- times by other means, such as hash tables, linked lists, or search trees. In computer science, an array type is a data type that is meant to describe a collection of elements (values or variables), each selected by one or more indices (identi- 2.1.1 History fying keys) that can be computed at run time by the pro- gram. Such a collection is usually called an array vari- Heinz Rutishauser's programming language Super- [1] able, array value, or simply array. By analogy with plan (1949–1951) included multi-dimensional arrays. the mathematical concepts of vector and matrix, array Rutishauser however although describing how a compiler types with one and two indices are often called vector for his language should be built, did not implement one. type and matrix type, respectively. Assembly languages and low-level languages like Language support for array types may include certain BCPL[3] generally have no syntactic support for arrays. built-in array data types, some syntactic constructions (array type constructors) that the programmer may use to Because of the importance of array structures for ef- define such types and declare array variables, and spe- ficient computation, the earliest high-level program- cial notation for indexing array elements.[1] For exam- ming languages, including FORTRAN (1957), COBOL ple, in the Pascal programming language, the declara- (1960), and Algol 60 (1960), provided support for multi- tion type MyTable = array [1..4,1..2] of integer, defines dimensional arrays. a new array data type called MyTable. The declaration var A: MyTable then defines a variable A of that type, which is an aggregate of eight elements, each being an 2.1.2 Abstract arrays integer variable identified by two indices. In the Pas- cal program, those elements are denoted A[1,1], A[1,2], An array data structure can be mathematically modeled A[2,1],… A[4,2].[2] Special array types are often defined as an abstract data structure (an abstract array) with two by the language’s standard libraries. operations Dynamic lists are also more common and easier to im- plement than dynamic arrays. Array types are distin- get(A, I): the data stored in the element of the guished from record types mainly because they allow the array A whose indices are the integer tuple I. element indices to be computed at run time, as in the Pas- set(A,I,V): the array that results by setting the cal assignment A[I,J] := A[N-I,2*J]. Among other things, value of that element to V. this feature allows a single iterative statement to process arbitrarily many elements of an array variable. These operations are required to satisfy the axioms[4] In more theoretical contexts, especially in type theory and in the description of abstract algorithms, the terms “ar- get(set(A,I, V), I) = V ray” and “array type” sometimes refer to an abstract data type (ADT) also called abstract array or may refer to an get(set(A,I, V), J) = get(A, J) if I ≠ J associative array, a mathematical model with the basic operations and behavior of a typical array type in most for any array state A, any value V, and any tuples I, J for languages — basically, a collection of elements that are which the operations are defined.

18 2.1. ARRAY DATA TYPE 19

The first axiom means that each element behaves like a variable. The second axiom means that elements with distinct indices behave as disjoint variables, so that stor- 1 2 3 ing a value in one element does not affect the value of any other element. These axioms do not place any constraints on the set of valid index tuples I, therefore this abstract model can be 4 5 6 used for triangular matrices and other oddly-shaped ar- rays. 7 8 9 2.1.3 Implementations

In order to effectively implement variables of such types A two-dimensional array stored as a one-dimensional array of as array structures (with indexing done by pointer arith- one-dimensional arrays (rows). metic), many languages restrict the indices to integer data types (or other types that can be interpreted as integers, such as and enumerated types), and require that as a vector of pointers to its rows. Thus an element in all elements have the same data type and storage size. row i and column j of an array A would be accessed by Most of those languages also restrict each index to a finite double indexing (A[i][j] in typical notation). This way of interval of integers, that remains fixed throughout the life- emulating multi-dimensional arrays allows the creation of time of the array variable. In some compiled languages, jagged arrays, where each row may have a different size in fact, the index ranges may have to be known at compile — or, in general, where the valid range of each index de- time. pends on the values of all preceding indices. On the other hand, some programming languages provide This representation for multi-dimensional arrays is quite more liberal array types, that allow indexing by arbitrary prevalent in C and C++ software. However, C and C++ values, such as floating-point numbers, strings, objects, will use a linear indexing formula for multi-dimensional references, etc.. Such index values cannot be restricted to arrays that are declared with constant size, an interval, much less a fixed interval. So, these languages e.g. by int A[10][20] or int A[m][n], instead of the tra- usually allow arbitrary new elements to be created at any ditional int **A.[6]:p.81 time. This choice precludes the implementation of array types as array data structures. That is, those languages use array-like syntax to implement a more general associative array semantics, and must therefore be implemented by a Indexing notation hash table or some other . Most programming languages that support arrays support the store and select operations, and have special syntax for 2.1.4 Language support indexing. Early languages used parentheses, e.g. A(i,j), as in FORTRAN; others choose square brackets, e.g. Multi-dimensional arrays A[i,j] or A[i][j], as in Algol 60 and Pascal (to distinguish from the use of parentheses for function calls). The number of indices needed to specify an element is called the dimension, dimensionality, or rank of the array type. (This nomenclature conflicts with the concept of Index types dimension in linear algebra,[5] where it is the number of elements. Thus, an array of numbers with 5 rows and 4 Array data types are most often implemented as array columns, hence 20 elements, is said to have dimension 2 structures: with the indices restricted to integer (or totally in computing contexts, but represents a matrix with di- ordered) values, index ranges fixed at array creation time, mension 4-by-5 or 20 in mathematics. Also, the com- and multilinear element addressing. This was the case in puter science meaning of “rank” is similar to its meaning most “third generation” languages, and is still the case in tensor algebra but not to the linear algebra concept of of most systems programming languages such as Ada, C, rank of a matrix.) and C++. In some languages, however, array data types Many languages support only one-dimensional arrays. In have the semantics of associative arrays, with indices of those languages, a multi-dimensional array is typically arbitrary type and dynamic element creation. This is the represented by an Iliffe vector, a one-dimensional array case in some scripting languages such as Awk and Lua, of references to arrays of one dimension less. A two- and of some array types provided by standard C++ li- dimensional array, in particular, would be implemented braries. 20 CHAPTER 2. SEQUENCES

Bounds checking and which of these is represented by the * operator varies by language. Some languages (like Pascal and Modula) perform Languages providing array programming capabilities bounds checking on every access, raising an exception have proliferated since the innovations in this area of or aborting the program when any index is out of its APL. These are core capabilities of domain-specific lan- valid range. may allow these checks to be guages such as GAUSS, IDL, Matlab, and Mathematica. turned off to trade safety for speed. Other languages (like They are a core facility in newer languages, such as Julia FORTRAN and C) trust the programmer and perform no and recent versions of Fortran. These capabilities are also checks. Good compilers may also analyze the program provided via standard extension libraries for other general to determine the range of possible values that the index purpose programming languages (such as the widely used may have, and this analysis may lead to bounds-checking NumPy library for Python). elimination.

String types and arrays Index origin Many languages provide a built-in string data type, with Some languages, such as C, provide only zero-based array specialized notation ("string literals") to build values of types, for which the minimum valid value for any index that type. In some languages (such as C), a string is just is 0. This choice is convenient for array implementation an array of characters, or is handled in much the same and address computations. With a language such as C, way. Other languages, like Pascal, may provide vastly a pointer to the interior of any array can be defined that different operations for strings and arrays. will symbolically act as a pseudo-array that accommo- dates negative indices. This works only because C does not check an index against bounds when used. Array index range queries

Other languages provide only one-based array types, Some programming languages provide operations that re- where each index starts at 1; this is the traditional con- turn the size (number of elements) of a vector, or, more vention in mathematics for matrices and mathematical generally, range of each index of an array. In C and C++ sequences. A few languages, such as Pascal, support arrays do not support the size function, so programmers n-based array types, whose minimum legal indices are often have to declare separate variable to hold the size, chosen by the programmer. The relative merits of each and pass it to procedures as a separate parameter. choice have been the subject of heated debate. Zero- based indexing has a natural advantage to one-based in- Elements of a newly created array may have undefined dexing in avoiding off-by-one or fencepost errors.[7] values (as in C), or may be defined to have a specific “de- fault” value such as 0 or a null pointer (as in Java). See comparison of programming languages (array) for the base indices used by various languages. In C++ a std::vector object supports the store, select, and append operations with the performance characteristics discussed above. Vectors can be queried for their size Highest index and can be resized. Slower operations like inserting an element in the middle are also supported. The relation between numbers appearing in an array dec- laration and the index of that array’s last element also Slicing varies by language. In many languages (such as C), one should specify the number of elements contained in the An array slicing operation takes a subset of the elements array; whereas in others (such as Pascal and Visual Basic of an array-typed entity (value or variable) and then as- .NET) one should specify the numeric value of the index sembles them as another array-typed entity, possibly with of the last element. Needless to say, this distinction is other indices. If array types are implemented as array immaterial in languages where the indices start at 1. structures, many useful slicing operations (such as select- ing a sub-array, swapping indices, or reversing the direc- Array algebra tion of the indices) can be performed very efficiently by manipulating the dope vector of the structure. The pos- Some programming languages support array program- sible slicings depend on the implementation details: for ming, where operations and functions defined for certain example, FORTRAN allows slicing off one column of a data types are implicitly extended to arrays of elements matrix variable, but not a row, and treat it as a vector; of those types. Thus one can write A+B to add corre- whereas C allow slicing off a row from a matrix, but not sponding elements of two arrays A and B. Usually these a column. languages provide both the element-by-element multipli- On the other hand, other slicing operations are possible cation and the standard matrix product of linear algebra, when array types are implemented in other ways. 2.2. ARRAY DATA STRUCTURE 21

Resizing 2.1.6 References

Some languages allow dynamic arrays (also called resiz- [1] Robert W. Sebesta (2001) Concepts of Programming able, growable, or extensible): array variables whose in- Languages. Addison-Wesley. 4th edition (1998), 5th edi- dex ranges may be expanded at any time after creation, tion (2001), ISBN 9780201385960 without changing the values of its current elements. [2] K. Jensen and Niklaus Wirth, PASCAL User Manual and For one-dimensional arrays, this facility may be provided Report. Springer. Paperback edition (2007) 184 pages, as an operation “append(A,x)" that increases the size of ISBN 978-3540069508 the array A by one and then sets the value of the last el- [3] John Mitchell, Concepts of Programming Languages. ement to x. Other array types (such as Pascal strings) Cambridge University Press. provide a concatenation operator, which can be used to- gether with slicing to achieve that effect and more. In [4] Lukham, Suzuki (1979), “Verification of array, record, some languages, assigning a value to an element of an and pointer operations in Pascal”. ACM Transactions on array automatically extends the array, if necessary, to in- Programming Languages and Systems 1(2), 226–244. clude that element. In other array types, a slice can be replaced by an array of different size” with subsequent el- [5] see the definition of a matrix ements being renumbered accordingly — as in Python’s [6] Brian W. Kernighan and Dennis M. Ritchie (1988), The list assignment "A[5:5] = [10,20,30]", that inserts three C programming Language. Prentice-Hall, 205 pages. new elements (10,20, and 30) before element "A[5]". Re- sizable arrays are conceptually similar to lists, and the two [7] Edsger W. Dijkstra, Why numbering should start at zero concepts are synonymous in some languages. An extensible array can be implemented as a fixed-size 2.1.7 External links array, with a counter that records how many elements are actually in use. The append operation merely increments • NIST’s Dictionary of Algorithms and Data Struc- the counter; until the whole array is used, when the ap- tures: Array pend operation may be defined to fail. This is an imple- mentation of a dynamic array with a fixed capacity, as in the string type of Pascal. Alternatively, the append op- eration may re-allocate the underlying array with a larger 2.2 Array data structure size, and copy the old elements to the new area. This article is about the byte-layout-level structure. For the abstract data type, see Array data type. For other 2.1.5 See also uses, see Array.

• Array access analysis In computer science, an array data structure, or sim- ply an array, is a data structure consisting of a collec- • Array programming tion of elements (values or variables), each identified by at least one array index or key. An array is stored so that • Array slicing the position of each element can be computed from its index tuple by a mathematical formula.[1][2][3] The sim- • Bounds checking and index checking plest type of data structure is a linear array, also called one-dimensional array. • Bounds checking elimination For example, an array of 10 32-bit integer variables, with • Delimiter-separated values indices 0 through 9, may be stored as 10 words at memory addresses 2000, 2004, 2008, ... 2036, so that the element • Comparison of programming languages (array) with index i has the address 2000 + 4 × i.[4] The memory address of the first element of an array is • Parallel array called first address or foundation address. Because the mathematical concept of a matrix can be rep- Related types resented as a two-dimensional grid, two-dimensional ar- rays are also sometimes called matrices. In some cases • Variable-length array the term “vector” is used in computing to refer to an ar- ray, although tuples rather than vectors are more correctly • Dynamic array the mathematical equivalent. Arrays are often used to im- plement tables, especially lookup tables; the word table is • Sparse array sometimes used as a synonym of array. 22 CHAPTER 2. SEQUENCES

Arrays are among the oldest and most important data 2.2.2 Applications structures, and are used by almost every program. They are also used to implement many other data structures, Arrays are used to implement mathematical vectors and such as lists and strings. They effectively exploit the ad- matrices, as well as other kinds of rectangular tables. dressing logic of computers. In most modern comput- Many databases, small and large, consist of (or include) ers and many external storage devices, the memory is a one-dimensional arrays whose elements are records. one-dimensional array of words, whose indices are their Arrays are used to implement other data structures, such addresses. Processors, especially vector processors, are as heaps, hash tables, deques, queues, stacks, strings, and often optimized for array operations. VLists. Arrays are useful mostly because the element indices can One or more large arrays are sometimes used to emu- be computed at run time. Among other things, this fea- late in-program dynamic memory allocation, particularly ture allows a single iterative statement to process arbi- memory pool allocation. Historically, this has some- trarily many elements of an array. For that reason, the times been the only way to allocate “dynamic memory” elements of an array data structure are required to have portably. the same size and should use the same data representa- tion. The set of valid index tuples and the addresses of Arrays can be used to determine partial or complete the elements (and hence the element addressing formula) control flow in programs, as a compact alternative to are usually,[3][5] but not always,[2] fixed while the array is (otherwise repetitive) multiple IF statements. They are in use. known in this context as control tables and are used in conjunction with a purpose built whose The term array is often used to mean array data type, a control flow is altered according to values contained in of data type provided by most high-level program- the array. The array may contain pointers (or ming languages that consists of a collection of values or relative subroutine numbers that can be acted upon by variables that can be selected by one or more indices com- SWITCH statements) that direct the of the execu- puted at run-time. Array types are often implemented by tion. array structures; however, in some languages they may be implemented by hash tables, linked lists, search trees, or other data structures. 2.2.3 Element identifier and addressing The term is also used, especially in the description of formulas algorithms, to mean associative array or “abstract array”, a theoretical computer science model (an abstract data When data objects are stored in an array, individual type or ADT) intended to capture the essential properties objects are selected by an index that is usually a non- of arrays. negative scalar integer. Indexes are also called subscripts. An index maps the array value to a stored object. There are three ways in which the elements of an array can be indexed: 2.2.1 History • 0 (zero-based indexing): The first element of the ar- The first digital computers used machine-language pro- ray is indexed by subscript of 0.[8] gramming to set up and access array structures for data tables, vector and matrix computations, and for many • 1 (one-based indexing): The first element of the ar- other purposes. John von Neumann wrote the first array- ray is indexed by subscript of 1.[9] sorting program (merge sort) in 1945, during the build- ing of the first stored-program computer.[6]p. 159 Array in- • n (n-based indexing): The base index of an array can dexing was originally done by self-modifying code, and be freely chosen. Usually programming languages later using index registers and indirect addressing. Some allowing n-based indexing also allow negative index mainframes designed in the 1960s, such as the Burroughs values and other scalar data types like enumerations, B5000 and its successors, used memory segmentation to or characters may be used as an array index. perform index-bounds checking in hardware.[7] Assembly languages generally have no special support Arrays can have multiple dimensions, thus it is not un- for arrays, other than what the machine itself provides. common to access an array using multiple indices. For The earliest high-level programming languages, includ- example, a two-dimensional array A with three rows and ing FORTRAN (1957), COBOL (1960), and ALGOL four columns might provide access to the element at the 60 (1960), had support for multi-dimensional arrays, and 2nd row and 4th column by the expression A[1, 3] (in a so has C (1972). In C++ (1983), class templates exist row major language) or A[3, 1] (in a column major lan- for multi-dimensional arrays whose dimension is fixed at guage) in the case of a zero-based indexing system. Thus runtime[3][5] as well as for runtime-flexible arrays.[2] two indices are used for a two-dimensional array, three 2.2. ARRAY DATA STRUCTURE 23

for a three-dimensional array, and n for an n-dimensional This means that array a has 2 rows and 3 columns, and array. the array is of integer type. Here we can store 6 elements The number of indices needed to specify an element is they are stored linearly but starting from first row linear called the dimension, dimensionality, or rank of the array. then continuing with second row. The above array will be stored as a11, a12, a13, a21, a22, a23. In standard arrays, each index is restricted to a certain range of consecutive integers (or consecutive values of This formula requires only k multiplications and k addi- tions, for any array that can fit in memory. Moreover, if some ), and the address of an element is computed by a “linear” formula on the indices. any coefficient is a fixed power of 2, the multiplication can be replaced by bit shifting. The coefficients ck must be chosen so that every valid in- One-dimensional arrays dex tuple maps to the address of a distinct element. If the minimum legal value for every index is 0, then B is A one-dimensional array (or single dimension array) is a the address of the element whose indices are all zero. As type of linear array. Accessing its elements involves a sin- in the one-dimensional case, the element indices may be gle subscript which can either represent a row or column changed by changing the base address B. Thus, if a two- index. dimensional array has rows and columns indexed from 1 As an example consider the C declaration int anArray- to 10 and 1 to 20, respectively, then replacing B by B + c1 - Name[10]; − 3 c1 will cause them to be renumbered from 0 through Syntax : datatype anArrayname[sizeofArray]; 9 and 4 through 23, respectively. Taking advantage of this feature, some languages (like FORTRAN 77) specify In the given example the array can contain 10 elements of that array indices begin at 1, as in mathematical tradition any value available to the int type. In C, the array element while other languages (like Fortran 90, Pascal and Algol) indices are 0-9 inclusive in this case. For example, the ex- let the user choose the minimum value for each index. pressions anArrayName[0] and anArrayName[9] are the first and last elements respectively. Dope vectors For a vector with linear addressing, the element with in- dex i is located at the address B + c × i, where B is a fixed The addressing formula is completely defined by the di- base address and c a fixed constant, sometimes called the mension d, the base address B, and the increments c1, c2, address increment or stride. ..., ck. It is often useful to pack these parameters into a If the valid element indices begin at 0, the constant B is record called the array’s descriptor or stride vector or dope simply the address of the first element of the array. For vector.[2][3] The size of each element, and the minimum this reason, the C programming language specifies that and maximum values allowed for each index may also be array indices always begin at 0; and many programmers included in the dope vector. The dope vector is a com- will call that element "zeroth" rather than “first”. plete handle for the array, and is a convenient way to pass arrays as arguments to procedures. Many useful array However, one can choose the index of the first element by slicing operations (such as selecting a sub-array, swap- an appropriate choice of the base address B. For example, ping indices, or reversing the direction of the indices) can if the array has five elements, indexed 1 through 5, and the be performed very efficiently by manipulating the dope base address B is replaced by B + 30c, then the indices of vector.[2] those same elements will be 31 to 35. If the numbering does not start at 0, the constant B may not be the address of any element. Compact layouts

Often the coefficients are chosen so that the elements oc- Multidimensional arrays cupy a contiguous area of memory. However, that is not necessary. Even if arrays are always created with con- For multi dimensional array, the element with indices i,j tiguous elements, some array slicing operations may cre- would have address B + c · i + d · j, where the coeffi- ate non-contiguous sub-arrays from them. cients c and d are the row and column address increments, respectively. There are two systematic compact layouts for a two- dimensional array. For example, consider the matrix More generally, in a k-dimensional array, the address of an element with indices i1, i2, ..., ik is   1 2 3 A = 4 5 6. B + c1 · i1 + c2 · i2 + ... + ck · ik. 7 8 9

For example: int a[2][3]; In the row-major order layout (adopted by C for statically 24 CHAPTER 2. SEQUENCES

declared arrays), the elements in each row are stored in In an array with element size k and on a machine with a consecutive positions and all of the elements of a row have cache line size of B bytes, iterating through an array of a lower address than any of the elements of a consecutive n elements requires the minimum of ceiling(nk/B) cache row: misses, because its elements occupy contiguous memory In column-major order (traditionally used by Fortran), the locations. This is roughly a factor of B/k better than the elements in each column are consecutive in memory and number of cache misses needed to access n elements at all of the elements of a column have a lower address than random memory locations. As a consequence, sequen- any of the elements of a consecutive column: tial iteration over an array is noticeably faster in practice than iteration over many other data structures, a prop- For arrays with three or more indices, “row major order” erty called locality of reference (this does not mean how- puts in consecutive positions any two elements whose in- ever, that using a perfect hash or trivial hash within the dex tuples differ only by one in the last index. “Column same (local) array, will not be even faster - and achiev- major order” is analogous with respect to the first index. able in constant time). Libraries provide low-level opti- In systems which use processor cache or virtual memory, mized facilities for copying ranges of memory (such as scanning an array is much faster if successive elements memcpy) which can be used to move contiguous blocks are stored in consecutive positions in memory, rather than of array elements significantly faster than can be achieved sparsely scattered. Many algorithms that use multidimen- through individual element access. The speedup of such sional arrays will scan them in a predictable order. A pro- optimized routines varies by array element size, architec- grammer (or a sophisticated compiler) may use this infor- ture, and implementation. mation to choose between row- or column-major layout Memory-wise, arrays are compact data structures with no for each array. For example, when computing the prod- per-element overhead. There may be a per-array over- uct A·B of two matrices, it would be best to have A stored head, e.g. to store index bounds, but this is language- in row-major order, and B in column-major order. dependent. It can also happen that elements stored in an array require less memory than the same elements stored in individual variables, because several array elements Resizing can be stored in a single word; such arrays are often called packed arrays. An extreme (but commonly used) case is Main article: Dynamic array the , where every bit represents a single element. A single octet can thus hold up to 256 different combina- Static arrays have a size that is fixed when they are created tions of up to 8 different conditions, in the most compact and consequently do not allow elements to be inserted or form. removed. However, by allocating a new array and copy- Array accesses with statically predictable access patterns ing the contents of the old array to it, it is possible to are a major source of data parallelism. effectively implement a dynamic version of an array; see dynamic array. If this operation is done infrequently, in- sertions at the end of the array require only amortized Comparison with other data structures constant time. Some array data structures do not reallocate storage, but Growable arrays are similar to arrays but add the ability do store a count of the number of elements of the array to insert and delete elements; adding and deleting at the in use, called the count or size. This effectively makes end is particularly efficient. However, they reserve linear the array a dynamic array with a fixed maximum size or (Θ(n)) additional storage, whereas arrays do not reserve capacity; Pascal strings are examples of this. additional storage. Associative arrays provide a mechanism for array-like functionality without huge storage overheads when the in- Non-linear formulas dex values are sparse. For example, an array that contains values only at indexes 1 and 2 billion may benefit from us- More complicated (non-linear) formulas are occasionally ing such a structure. Specialized associative arrays with used. For a compact two-dimensional triangular array, integer keys include Patricia , Judy arrays, and van for instance, the addressing formula is a polynomial of Emde Boas trees. degree 2. Balanced trees require O(log n) time for indexed access, but also permit inserting or deleting elements in O(log [15] 2.2.4 Efficiency n) time, whereas growable arrays require linear (Θ(n)) time to insert or delete elements at an arbitrary position. Both store and select take (deterministic worst case) Linked lists allow constant time removal and insertion in constant time. Arrays take linear (O(n)) space in the the middle but take linear time for indexed access. Their number of elements n that they hold. memory use is typically worse than arrays, but is still lin- 2.2. ARRAY DATA STRUCTURE 25 ear. • Variable-length array • Bit array 1 2 3 • Array slicing • Offset (computer science) • Row-major order 4 5 6 • Stride of an array

7 8 9 2.2.7 References [1] Black, Paul E. (13 November 2008). “array”. Dictionary of Algorithms and Data Structures. National Institute of Standards and Technology. Retrieved 22 August 2010. A two-dimensional array stored as a one-dimensional array of one-dimensional arrays (rows). [2] Bjoern Andres; Ullrich Koethe; Thorben Kroeger; Hamprecht (2010). “Runtime-Flexible Multi- An Iliffe vector is an alternative to a multidimensional dimensional Arrays and Views for C++98 and C++0x”. array structure. It uses a one-dimensional array of arXiv:1008.2909 [cs.DS]. references to arrays of one dimension less. For two di- [3] Garcia, Ronald; Lumsdaine, Andrew (2005). “MultiAr- mensions, in particular, this alternative structure would ray: a C++ library for generic programming with arrays”. be a vector of pointers to vectors, one for each row. Thus Software: Practice and Experience. 35 (2): 159–188. an element in row i and column j of an array A would doi:10.1002/spe.630. ISSN 0038-0644. be accessed by double indexing (A[i][j] in typical no- tation). This alternative structure allows jagged arrays, [4] David R. Richardson (2002), The Book on Data Struc- where each row may have a different size — or, in gen- tures. iUniverse, 112 pages. ISBN 0-595-24039-9, ISBN 978-0-595-24039-5. eral, where the valid range of each index depends on the values of all preceding indices. It also saves one multipli- [5] Veldhuizen, Todd L. (December 1998). Arrays in Blitz++ cation (by the column address increment) replacing it by (PDF). Computing in Object-Oriented Parallel Environ- a bit shift (to index the vector of row pointers) and one ments. Lecture Notes in Computer Science. Springer extra memory access (fetching the row address), which Berlin Heidelberg. pp. 223–230. doi:10.1007/3-540- may be worthwhile in some architectures. 49372-7_24. ISBN 978-3-540-65387-5. [6] Donald Knuth, The Art of Computer Programming, vol. 3. Addison-Wesley 2.2.5 Dimension [7] Levy, Henry M. (1984), Capability-based Computer Sys- The dimension of an array is the number of indices tems, Digital Press, p. 22, ISBN 9780932376220. needed to select an element. Thus, if the array is seen [8] “Array Code Examples - PHP Array Functions - PHP as a function on a set of possible index combinations, it code”. http://www.configure-all.com/: Computer Pro- is the dimension of the space of which its domain is a gramming Web programming Tips. Retrieved 8 April discrete subset. Thus a one-dimensional array is a list of 2011. In most computer languages array index (counting) data, a two-dimensional array a rectangle of data, a three- starts from 0, not from 1. Index of the first element of the dimensional array a block of data, etc. array is 0, index of the second element of the array is 1, and so on. In array of names below you can see indexes This should not be confused with the dimension of the and values. set of all matrices with a given domain, that is, the number of elements in the array. For example, an ar- [9] “Chapter 6 - Arrays, Types, and Constants”. Modula-2 ray with 5 rows and 4 columns is two-dimensional, but Tutorial. http://www.modula2.org/tutor/index.php. Re- such matrices form a 20-dimensional space. Similarly, trieved 8 April 2011. The names of the twelve variables a three-dimensional vector can be represented by a one- are given by Automobiles[1], Automobiles[2], ... Auto- dimensional array of size three. mobiles[12]. The variable name is “Automobiles” and the array subscripts are the numbers 1 through 12. [i.e. in Modula-2, the index starts by one!]

2.2.6 See also [10] Chris Okasaki (1995). “Purely Functional Random- Access Lists”. Proceedings of the Seventh Inter- • Dynamic array national Conference on Functional Programming Languages and Computer Architecture: 86–95. • Parallel array doi:10.1145/224164.224187. 26 CHAPTER 2. SEQUENCES

[11] Gerald Kruse. CS 240 Lecture Notes: Linked Lists Plus: 2.3.1 Bounded-size dynamic arrays and Complexity Trade-offs. Juniata College. Spring 2008. capacity

[12] Day 1 Keynote - Bjarne Stroustrup: C++11 Style at Go- A simple dynamic array can be constructed by allocating ingNative 2012 on channel9.msdn.com from minute 45 or an array of fixed-size, typically larger than the number foil 44 of elements immediately required. The elements of the [13] Number crunching: Why you should never, ever, EVER use dynamic array are stored contiguously at the start of the linked-list in your code again at kjellkod.wordpress.com underlying array, and the remaining positions towards the end of the underlying array are reserved, or unused. Ele- [14] Brodnik, Andrej; Carlsson, Svante; Sedgewick, Robert; ments can be added at the end of a dynamic array in con- Munro, JI; Demaine, ED (1999), Resizable Arrays in Op- stant time by using the reserved space, until this space is timal Time and Space (Technical Report CS-99-09) (PDF), completely consumed. When all space is consumed, and Department of Computer Science, University of Waterloo an additional element is to be added, then the underly- ing fixed-sized array needs to be increased in size. Typ- [15] Counted B-Tree ically resizing is expensive because it involves allocating a new underlying array and copying each element from the original array. Elements can be removed from the 2.3 Dynamic array end of a dynamic array in constant time, as no resizing is required. The number of elements used by the dynamic array contents is its logical size or size, while the size of the underlying array is called the dynamic array’s capac- ity or physical size, which is the maximum possible size 2 without relocating data.[2] A fixed-size array will suffice in applications where the 2 7 maximum logical size is fixed (e.g. by specification), or can be calculated before the array is allocated. A dynamic 2 7 1 array might be preferred if • the maximum logical size is unknown, or difficult to 2 7 1 3 calculate, before the array is allocated • it is considered that a maximum logical size given by 2 7 1 3 8 a specification is likely to change • the amortized cost of resizing a dynamic array does not significantly affect performance or responsive- 2 7 1 3 8 4 ness Logical size 2.3.2 Geometric expansion and amortized Capacity cost

Several values are inserted at the end of a dynamic array using To avoid incurring the cost of resizing many times, dy- geometric expansion. Grey cells indicate space reserved for ex- namic arrays resize by a large amount, such as doubling pansion. Most insertions are fast (constant time), while some are in size, and use the reserved space for future expansion. slow due to the need for reallocation (Θ(n) time, labelled with tur- tles). The logical size and capacity of the final array are shown. The operation of adding an element to the end might work as follows: In computer science, a dynamic array, growable array, function insertEnd(dynarray a, element e) if (a.size resizable array, dynamic table, mutable array, or ar- = a.capacity) // resize a to twice its current capacity: ray list is a random access, variable-size list data struc- a.capacity ← a.capacity * 2 // (copy the contents to the ture that allows elements to be added or removed. It is new memory location here) a[a.size] ← e a.size ← a.size supplied with standard libraries in many modern main- + 1 programming languages. A dynamic array is not the same thing as a dynamically As n elements are inserted, the capacities form a allocated array, which is an array whose size is fixed when geometric progression. Expanding the array by any con- the array is allocated, although a dynamic array may use stant proportion a ensures that inserting n elements takes such a fixed-size array as a back end.[1] O(n) time overall, meaning that each insertion takes 2.3. DYNAMIC ARRAY 27 amortized constant time. Many dynamic arrays also deal- that resides in other areas of memory. In this case, ac- locate some of the underlying storage if its size drops be- cessing items in the array sequentially will actually in- low a certain threshold, such as 30% of the capacity. This volve accessing multiple non-contiguous areas of mem- threshold must be strictly smaller than 1/a in order to pro- ory, so the many advantages of the cache-friendliness of vide hysteresis (provide a stable band to avoiding repeat- this data structure are lost. edly growing and shrinking) and support mixed sequences Compared to linked lists, dynamic arrays have faster in- of insertions and removals with amortized constant cost. dexing (constant time versus linear time) and typically Dynamic arrays are a common example when teaching faster iteration due to improved locality of reference; amortized analysis.[3][4] however, dynamic arrays require linear time to insert or delete at an arbitrary location, since all following ele- ments must be moved, while linked lists can do this in 2.3.3 Growth factor constant time. This disadvantage is mitigated by the gap buffer and tiered vector variants discussed under Variants The growth factor for the dynamic array depends on sev- below. Also, in a highly fragmented memory region, it eral factors including a space-time trade-off and algo- may be expensive or impossible to find contiguous space rithms used in the memory allocator itself. For growth for a large dynamic array, whereas linked lists do not re- factor a, the average time per insertion operation is about quire the whole data structure to be stored contiguously. a/(a−1), while the number of wasted cells is bounded A balanced tree can store a list while providing all oper- above by (a−1)n. If memory allocator uses a first-fit allo- ations of both dynamic arrays and linked lists reasonably cation algorithm, then growth factor values such as a=2 efficiently, but both insertion at the end and iteration over can cause dynamic array expansion to run out of memory the list are slower than for a dynamic array, in theory and even though a significant amount of memory may still be in practice, due to non-contiguous storage and tree traver- available.[5] There have been various discussions on ideal sal/manipulation overhead. growth factor values, including proposals for the Golden Ratio as well as the value 1.5.[6] Many textbooks, how- ever, use a = 2 for simplicity and analysis purposes.[3][4] Below are growth factors used by several popular imple- mentations: 2.3.5 Variants 2.3.4 Performance Gap buffers are similar to dynamic arrays but allow effi- The dynamic array has performance similar to an array, cient insertion and deletion operations clustered near the with the addition of new operations to add and remove same arbitrary location. Some deque implementations elements: use array deques, which allow amortized constant time insertion/removal at both ends, instead of just one end. • Getting or setting the value at a particular index Goodrich[15] presented a dynamic array algorithm called (constant time) Tiered Vectors that provided O(n1/2) performance for or- der preserving insertions or deletions from the middle of • Iterating over the elements in order (linear time, the array. good cache performance) Hashed Array Tree (HAT) is a dynamic array algorithm • Inserting or deleting an element in the middle of the published by Sitarski in 1996.[16] Hashed Array Tree array (linear time) wastes order n1/2 amount of storage space, where n is the number of elements in the array. The algorithm has O(1) • Inserting or deleting an element at the end of the amortized performance when appending a series of ob- array (constant amortized time) jects to the end of a Hashed Array Tree. In a 1999 paper,[14] Brodnik et al. describe a tiered dy- Dynamic arrays benefit from many of the advantages 1/2 of arrays, including good locality of reference and data namic array data structure, which wastes only n space cache utilization, compactness (low memory use), and for n elements at any point in time, and they prove a lower random access. They usually have only a small fixed addi- bound showing that any dynamic array must waste this tional overhead for storing information about the size and much space if the operations are to remain amortized capacity. This makes dynamic arrays an attractive tool constant time. Additionally, they present a variant where for building cache-friendly data structures. However, in growing and shrinking the buffer has not only amortized languages like Python or Java that enforce reference se- but worst-case constant time. mantics, the dynamic array generally will not store the Bagwell (2002)[17] presented the VList algorithm, which actual data, but rather it will store references to the data can be adapted to implement a dynamic array. 28 CHAPTER 2. SEQUENCES

2.3.6 Language support [12] Day 1 Keynote - Bjarne Stroustrup: C++11 Style at Go- ingNative 2012 on channel9.msdn.com from minute 45 or C++'s std::vector is an implementation of dynamic ar- foil 44 rays, as are the ArrayList[18] class supplied with the Java [19] [13] Number crunching: Why you should never, ever, EVER use API and the .NET Framework. The generic List<> linked-list in your code again at kjellkod.wordpress.com class supplied with version 2.0 of the .NET Framework is also implemented with dynamic arrays. Smalltalk's Or- [14] Brodnik, Andrej; Carlsson, Svante; Sedgewick, Robert; deredCollection is a dynamic array with dynamic start Munro, JI; Demaine, ED (1999), Resizable Arrays in Op- and end-index, making the removal of the first element timal Time and Space (Technical Report CS-99-09) (PDF), also O(1). Python's list datatype implementation is a dy- Department of Computer Science, University of Waterloo namic array. and D implement dynamic arrays [15] Goodrich, Michael T.; Kloss II, John G. (1999), “Tiered at the language’s core. Ada's Ada.Containers.Vectors Vectors: Efficient Dynamic Arrays for Rank-Based generic package provides dynamic array implementation Sequences”, Workshop on Algorithms and Data Struc- for a given subtype. Many scripting languages such as tures, Lecture Notes in Computer Science, 1663: 205– Perl and Ruby offer dynamic arrays as a built-in primitive 216, doi:10.1007/3-540-48447-7_21, ISBN 978-3-540- data type. Several cross-platform frameworks provide 66279-2 dynamic array implementations for C, including CFAr- [16] Sitarski, Edward (September 1996), “HATs: Hashed ar- ray and CFMutableArray in Core Foundation, and GAr- ray trees”, Algorithm Alley, Dr. Dobb’s Journal, 21 (11) ray and GPtrArray in GLib. [17] Bagwell, Phil (2002), Fast Functional Lists, Hash-Lists, Deques and Variable Length Arrays, EPFL 2.3.7 References [18] Javadoc on ArrayList

[1] See, for example, the source code of java.util.ArrayList [19] ArrayList Class class from OpenJDK 6.

[2] Lambert, Kenneth Alfred (2009), “Physical size and log- ical size”, Fundamentals of Python: From First Programs 2.3.8 External links Through Data Structures, Cengage Learning, p. 510, • ISBN 1423902181 NIST Dictionary of Algorithms and Data Structures: Dynamic array [3] Goodrich, Michael T.; Tamassia, Roberto (2002), “1.5.2 Analyzing an Extendable Array Implementation”, Algo- • VPOOL - C language implementation of dynamic rithm Design: Foundations, Analysis and Internet Exam- array. ples, Wiley, pp. 39–41. • CollectionSpy — A Java profiler with explicit sup- [4] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, port for debugging ArrayList- and Vector-related is- Ronald L.; Stein, Clifford (2001) [1990]. “17.4 Dynamic sues. tables”. Introduction to Algorithms (2nd ed.). MIT Press and McGraw-Hill. pp. 416–424. ISBN 0-262-03293-7. • Open Data Structures - Chapter 2 - Array-Based Lists [5] “C++ STL vector: definition, growth factor, member functions”. Retrieved 2015-08-05.

[6] “vector growth factor of 1.5”. comp.lang.c++.moderated. 2.4 Linked list Google Groups.

[7] List object implementation from python.org, retrieved In computer science, a linked list is a linear collection 2011-09-27. of data elements, called nodes, each pointing to the next node by means of a pointer. It is a data structure con- [8] Brais, Hadi. “Dissecting the C++ STL Vector: Part 3 - sisting of a group of nodes which together represent a Capacity & Size”. Micromysteries. Retrieved 2015-08- sequence. Under the simplest form, each node is com- 05. posed of data and a reference (in other words, a link) to [9] “facebook/folly”. GitHub. Retrieved 2015-08-05. the next node in the sequence. This structure allows for efficient insertion or removal of elements from any po- [10] Chris Okasaki (1995). “Purely Functional Random- sition in the sequence during iteration. More complex Access Lists”. Proceedings of the Seventh Inter- variants add additional links, allowing efficient insertion national Conference on Functional Programming or removal from arbitrary element references. Languages and Computer Architecture: 86–95. doi:10.1145/224164.224187. 12 99 37 [11] Gerald Kruse. CS 240 Lecture Notes: Linked Lists Plus: A Complexity Trade-offs. Juniata College. Spring 2008. linked list whose nodes contain two fields: an integer value and 2.4. LINKED LIST 29 a link to the next node. The last node is linked to a terminator • Nodes are stored incontiguously, greatly increas- used to signify the end of the list. ing the time required to access individual elements within the list, especially with a CPU cache.

Linked lists are among the simplest and most common • Difficulties arise in linked lists when it comes to re- data structures. They can be used to implement several verse traversing. For instance, singly linked lists other common abstract data types, including lists (the ab- are cumbersome to navigate backwards[1] and while stract data type), stacks, queues, associative arrays, and doubly linked lists are somewhat easier to read, S-expressions, though it is not uncommon to implement memory is wasted in allocating space for a back- the other data structures directly without using a list as pointer. the basis of implementation. The principal benefit of a linked list over a conventional array is that the list elements can easily be inserted or re- 2.4.3 History moved without reallocation or reorganization of the en- tire structure because the data items need not be stored Linked lists were developed in 1955–1956 by Allen contiguously in memory or on disk, while an array has Newell, Cliff Shaw and Herbert A. Simon at RAND to be declared in the source code, before compiling and Corporation as the primary data structure for their running the program. Linked lists allow insertion and re- Information Processing Language. IPL was used by the moval of nodes at any point in the list, and can do so with authors to develop several early artificial intelligence pro- a constant number of operations if the link previous to grams, including the Logic Theory Machine, the General the link being added or removed is maintained during list Problem Solver, and a program. Reports traversal. on their work appeared in IRE Transactions on Informa- tion Theory in 1956, and several conference proceedings On the other hand, simple linked lists by themselves do from 1957 to 1959, including Proceedings of the Western not allow random access to the data, or any form of ef- Joint Computer Conference in 1957 and 1958, and In- ficient indexing. Thus, many basic operations — such formation Processing (Proceedings of the first UNESCO as obtaining the last node of the list (assuming that the International Conference on Information Processing) in last node is not maintained as separate node reference in 1959. The now-classic diagram consisting of blocks rep- the list structure), or finding a node that contains a given resenting list nodes with arrows pointing to successive list datum, or locating the place where a new node should be nodes appears in “Programming the Logic Theory Ma- inserted — may require sequential scanning of most or all chine” by Newell and Shaw in Proc. WJCC, February of the list elements. The advantages and disadvantages of 1957. Newell and Simon were recognized with the ACM using linked lists are given below. Turing Award in 1975 for having “made basic contribu- tions to artificial intelligence, the psychology of human 2.4.1 Advantages cognition, and list processing”. The problem of machine translation for natural language processing led Victor Yn- • Linked lists are a dynamic data structure, which gve at Massachusetts Institute of Technology (MIT) to can grow and be pruned, allocating and deallocating use linked lists as data structures in his COMIT pro- memory while the program is running. gramming language for computer research in the field of linguistics. A report on this language entitled “A pro- • Insertion and deletion node operations are easily im- gramming language for mechanical translation” appeared plemented in a linked list. in Mechanical Translation in 1958. • Dynamic data structures such as stacks and queues LISP, standing for list processor, was created by John can be implemented using a linked list. McCarthy in 1958 while he was at MIT and in 1960 he published its design in a paper in the Communications • There is no need to define an initial size for a Linked of the ACM, entitled “Recursive Functions of Symbolic list. Expressions and Their Computation by Machine, Part I”. • Items can be added or removed from the middle of One of LISP’s major data structures is the linked list. list. By the early 1960s, the utility of both linked lists and lan- guages which use these structures as their primary data 2.4.2 Disadvantages representation was well established. Bert Green of the MIT Lincoln Laboratory published a review article enti- • They use more memory than arrays because of the tled “Computer languages for symbol manipulation” in storage used by their pointers. IRE Transactions on Human Factors in Electronics in March 1961 which summarized the advantages of the • Nodes in a linked list must be read in order from linked list approach. A later review article, “A Compar- the beginning as linked lists are inherently sequential ison of list-processing computer languages” by Bobrow access. and Raphael, appeared in Communications of the ACM 30 CHAPTER 2. SEQUENCES in April 1964. be called 'forward('s’) and 'backwards’, or 'next' and Several operating systems developed by Technical Sys- 'prev'('previous’). tems Consultants (originally of West Lafayette Indiana, and later of Chapel Hill, North Carolina) used singly 12 99 37 linked lists as file structures. A directory entry pointed A to the first sector of a file, and succeeding portions of doubly linked list whose nodes contain three fields: an integer the file were located by traversing pointers. Systems us- value, the link forward to the next node, and the link backward ing this technique included Flex (for the Motorola 6800 to the previous node CPU), mini-Flex (same CPU), and Flex9 (for the Mo- torola 6809 CPU). A variant developed by TSC for and A technique known as XOR-linking allows a doubly marketed by Smoke Signal Broadcasting in California, linked list to be implemented using a single link field in used doubly linked lists in the same manner. each node. However, this technique requires the ability The TSS/360 operating system, developed by IBM for to do bit operations on addresses, and therefore may not the System 360/370 machines, used a double linked list be available in some high-level languages. for their file system catalog. The directory structure was Many modern operating systems use doubly linked lists similar to , where a directory could contain files and to maintain references to active processes, threads, and other directories and extend to any depth. other dynamic objects.[2] A common strategy for rootkits to evade detection is to unlink themselves from these lists.[3] 2.4.4 Basic concepts and nomenclature

Each record of a linked list is often called an 'element' or Multiply linked list 'node'. In a 'multiply linked list', each node contains two or more The field of each node that contains the address of the link fields, each field being used to connect the same set next node is usually called the 'next link' or 'next pointer'. of data records in a different order (e.g., by name, by The remaining fields are known as the 'data', 'informa- department, by date of birth, etc.). While doubly linked tion', 'value', 'cargo', or 'payload' fields. lists can be seen as special cases of multiply linked list, the The 'head' of a list is its first node. The 'tail' of a list may fact that the two orders are opposite to each other leads to refer either to the rest of the list after the head, or to the simpler and more efficient algorithms, so they are usually last node in the list. In Lisp and some derived languages, treated as a separate case. the next node may be called the 'cdr' (pronounced could- er) of the list, while the payload of the head node may be called the 'car'. Circular Linked list

In the last node of a list, the link field often contains a Singly linked list null reference, a special value used to indicate the lack of further nodes. A less common convention is to make it Singly linked lists contain nodes which have a data field point to the first node of the list; in that case the list is as well as a 'next' field, which points to the next node in said to be 'circular' or 'circularly linked'; otherwise it is line of nodes. Operations that can be performed on singly said to be 'open' or 'linear'. linked lists include insertion, deletion and traversal. 12 99 37 12 99 37 A A singly linked list whose nodes contain two fields: an integer circular linked list value and a link to the next node In the case of a circular doubly linked list, the only change that occurs is that the end, or “tail”, of the said list is Doubly linked list linked back to the front, or “head”, of the list and vice versa. Main article: Doubly linked list Sentinel nodes In a 'doubly linked list', each node contains, besides the next-node link, a second link field pointing to the Main article: Sentinel node 'previous’ node in the sequence. The two links may 2.4. LINKED LIST 31

In some implementations an extra 'sentinel' or 'dummy' dynamic array is exceeded, it is reallocated and (possibly) node may be added before the first data record or after copied, which is an expensive operation. the last one. This convention simplifies and accelerates Linked lists have several advantages over dynamic arrays. some list-handling algorithms, by ensuring that all links Insertion or deletion of an element at a specific point of a can be safely dereferenced and that every list (even one list, assuming that we have indexed a pointer to the node that contains no data elements) always has a “first” and (before the one to be removed, or before the insertion “last” node. point) already, is a constant-time operation (otherwise without this reference it is O(n)), whereas insertion in a Empty lists dynamic array at random locations will require moving half of the elements on average, and all the elements in An empty list is a list that contains no data records. This the worst case. While one can “delete” an element from is usually the same as saying that it has zero nodes. If an array in constant time by somehow marking its slot as sentinel nodes are being used, the list is usually said to be “vacant”, this causes fragmentation that impedes the per- empty when it has only sentinel nodes. formance of iteration. Moreover, arbitrarily many elements may be inserted into a linked list, limited only by the total memory available; Hash linking while a dynamic array will eventually fill up its under- lying array data structure and will have to reallocate — The link fields need not be physically part of the nodes. an expensive operation, one that may not even be possi- If the data records are stored in an array and referenced ble if memory is fragmented, although the cost of real- by their indices, the link field may be stored in a separate location can be averaged over insertions, and the cost of array with the same indices as the data records. an insertion due to reallocation would still be amortized O(1). This helps with appending elements at the array’s List handles end, but inserting into (or removing from) middle posi- tions still carries prohibitive costs due to data moving to Since a reference to the first node gives access to the maintain contiguity. An array from which many elements whole list, that reference is often called the 'address’, are removed may also have to be resized in order to avoid 'pointer', or 'handle' of the list. Algorithms that manipu- wasting too much space. late linked lists usually get such handles to the input lists On the other hand, dynamic arrays (as well as fixed-size and return the handles to the resulting lists. In fact, in the array data structures) allow constant-time random access, context of such algorithms, the word “list” often means while linked lists allow only sequential access to elements. “list handle”. In some situations, however, it may be con- Singly linked lists, in fact, can be easily traversed in only venient to refer to a list by a handle that consists of two one direction. This makes linked lists unsuitable for ap- links, pointing to its first and last nodes. plications where it’s useful to look up an element by its index quickly, such as . Sequential access on ar- rays and dynamic arrays is also faster than on linked lists Combining alternatives on many machines, because they have optimal locality of reference and thus make good use of data caching. The alternatives listed above may be arbitrarily combined in almost every way, so one may have circular doubly Another disadvantage of linked lists is the extra storage linked lists without sentinels, circular singly linked lists needed for references, which often makes them imprac- with sentinels, etc. tical for lists of small data items such as characters or boolean values, because the storage overhead for the links may exceed by a factor of two or more the size of the 2.4.5 Tradeoffs data. In contrast, a dynamic array requires only the space for the data itself (and a very small amount of control As with most choices in computer programming and de- data).[note 1] It can also be slow, and with a naïve alloca- sign, no method is well suited to all circumstances. A tor, wasteful, to allocate memory separately for each new linked list data structure might work well in one case, but element, a problem generally solved using memory pools. cause problems in another. This is a list of some of the Some hybrid solutions try to combine the advantages of common tradeoffs involving linked list structures. the two representations. Unrolled linked lists store several elements in each list node, increasing cache performance while decreasing memory overhead for references. CDR Linked lists vs. dynamic arrays coding does both these as well, by replacing references with the actual data referenced, which extends off the end A dynamic array is a data structure that allocates all ele- of the referencing record. ments contiguously in memory, and keeps a count of the current number of elements. If the space reserved for the A good example that highlights the pros and cons of us- 32 CHAPTER 2. SEQUENCES

ing dynamic arrays vs. linked lists is by implementing a a persistent data structure. Again, this is not true with the program that resolves the Josephus problem. The Jose- other variants: a node may never belong to two different phus problem is an election method that works by having circular or doubly linked lists. a group of people stand in a circle. Starting at a predeter- In particular, end-sentinel nodes can be shared among mined person, you count around the circle n times. Once singly linked non-circular lists. The same end-sentinel you reach the nth person, take them out of the circle and node may be used for every such list. In Lisp, for exam- have the members close the circle. Then count around the ple, every proper list ends with a link to a special node, circle the same n times and repeat the process, until only denoted by nil or (), whose CAR and CDR links point to one person is left. That person wins the election. This itself. Thus a Lisp procedure can safely take the CAR or shows the strengths and weaknesses of a linked list vs. a CDR of any list. dynamic array, because if you view the people as con- nected nodes in a circular linked list then it shows how The advantages of the fancy variants are often limited to easily the linked list is able to delete nodes (as it only has the complexity of the algorithms, not in their efficiency. to rearrange the links to the different nodes). However, A circular list, in particular, can usually be emulated by the linked list will be poor at finding the next person to re- a linear list together with two variables that point to the move and will need to search through the list until it finds first and last nodes, at no extra cost. that person. A dynamic array, on the other hand, will be poor at deleting nodes (or elements) as it cannot remove one node without individually shifting all the elements up Doubly linked vs. singly linked the list by one. However, it is exceptionally easy to find the nth person in the circle by directly referencing them Double-linked lists require more space per node (unless by their position in the array. one uses XOR-linking), and their elementary operations are more expensive; but they are often easier to manip- The list ranking problem concerns the efficient conver- ulate because they allow fast and easy sequential access sion of a linked list representation into an array. Although to the list in both directions. In a doubly linked list, one trivial for a conventional computer, solving this problem can insert or delete a node in a constant number of oper- by a parallel algorithm is complicated and has been the ations given only that node’s address. To do the same in a subject of much research. singly linked list, one must have the address of the pointer A balanced tree has similar memory access patterns and to that node, which is either the handle for the whole list space overhead to a linked list while permitting much (in case of the first node) or the link field in the previous more efficient indexing, taking O(log n) time instead of node. Some algorithms require access in both directions. O(n) for a random access. However, insertion and dele- On the other hand, doubly linked lists do not allow tail- tion operations are more expensive due to the overhead sharing and cannot be used as persistent data structures. of tree manipulations to maintain balance. Schemes exist for trees to automatically maintain themselves in a bal- anced state: AVL trees or red-black trees. Circularly linked vs. linearly linked

A circularly linked list may be a natural option to repre- Singly linked linear lists vs. other lists sent arrays that are naturally circular, e.g. the corners of a polygon, a pool of buffers that are used and released in While doubly linked and circular lists have advantages FIFO (“first in, first out”) order, or a set of processes that over singly linked linear lists, linear lists offer some ad- should be time-shared in round-robin order. In these ap- vantages that make them preferable in some situations. plications, a pointer to any node serves as a handle to the whole list. A singly linked linear list is a recursive data structure, because it contains a pointer to a smaller object of the With a circular list, a pointer to the last node gives easy same type. For that reason, many operations on singly access also to the first node, by following one link. Thus, linked linear lists (such as merging two lists, or enumerat- in applications that require access to both ends of the list ing the elements in reverse order) often have very simple (e.g., in the implementation of a queue), a circular struc- recursive algorithms, much simpler than any solution us- ture allows one to handle the structure by a single pointer, ing iterative commands. While those recursive solutions instead of two. can be adapted for doubly linked and circularly linked A circular list can be split into two circular lists, in con- lists, the procedures generally need extra arguments and stant time, by giving the addresses of the last node of each more complicated base cases. piece. The operation consists in swapping the contents of Linear singly linked lists also allow tail-sharing, the use the link fields of those two nodes. Applying the same of a common final portion of sub-list as the terminal por- operation to any two nodes in two distinct lists joins the tion of two different lists. In particular, if a new node is two list into one. This property greatly simplifies some al- added at the beginning of a list, the former list remains gorithms and data structures, such as the quad-edge and available as the tail of the new one — a simple example of face-edge. 2.4. LINKED LIST 33

The simplest representation for an empty circular list Singly linked lists Our node data structure will have (when such a thing makes sense) is a null pointer, indicat- two fields. We also keep a variable firstNode which always ing that the list has no nodes. Without this choice, many points to the first node in the list, or is null for an empty algorithms have to test for this special case, and handle list. it separately. By contrast, the use of null to denote an record Node { data; // The data being stored in the node empty linear list is more natural and often creates fewer Node next // A reference to the next node, null for last node special cases. } record List { Node firstNode // points to first node of list; null for empty list } Using sentinel nodes Traversal of a singly linked list is simple, beginning at the first node and following each next link until we come to Sentinel node may simplify certain list operations, by en- the end: suring that the next or previous nodes exist for every el- node := list.firstNode while node not null (do something ement, and that even empty lists have at least one node. with node.data) node := node.next One may also use a sentinel node at the end of the list, with an appropriate data field, to eliminate some end-of- The following code inserts a node after an existing node list tests. For example, when scanning the list looking for in a singly linked list. The diagram shows how it works. a node with a given value x, setting the sentinel’s data field Inserting a node before an existing one cannot be done di- to x makes it unnecessary to test for end-of-list inside the rectly; instead, one must keep track of the previous node loop. Another example is the merging two sorted lists: and insert a node after it. if their sentinels have data fields set to +∞, the choice of the next output node does not need special handling for newNode newNode empty lists. 37 37 However, sentinel nodes use up extra space (especially 12 99 12 99 in applications that use many short lists), and they may node node.next node node.next complicate other operations (such as the creation of a new empty list). function insertAfter(Node node, Node newNode) // in- However, if the circular list is used merely to simulate sert newNode after node newNode.next := node.next a linear list, one may avoid some of this complexity by node.next := newNode adding a single sentinel node to every list, between the last and the first data nodes. With this convention, an Inserting at the beginning of the list requires a separate empty list consists of the sentinel node alone, pointing to function. This requires updating firstNode. itself via the next-node link. The list handle should then function insertBeginning(List list, Node newNode) // be a pointer to the last data node, before the sentinel, if insert node before current first node newNode.next := the list is not empty; or to the sentinel itself, if the list is list.firstNode list.firstNode := newNode empty. Similarly, we have functions for removing the node after The same trick can be used to simplify the handling of a given node, and for removing a node from the beginning a doubly linked linear list, by turning it into a circular of the list. The diagram demonstrates the former. To find doubly linked list with a single sentinel node. However, and remove a particular node, one must again keep track in this case, the handle should be a single pointer to the of the previous element. dummy node itself.[9]

2.4.6 Linked list operations 12 99 37 node node.next node.next.next When manipulating linked lists in-place, care must be taken to not use values that you have invalidated in pre- vious assignments. This makes algorithms for insert- 12 99 37 ing or deleting linked list nodes somewhat subtle. This node node.next node.next.next section gives pseudocode for adding or removing nodes from singly, doubly, and circularly linked lists in-place. Throughout we will use null to refer to an end-of-list function removeAfter(Node node) // remove node marker or sentinel, which may be implemented in a num- past this one obsoleteNode := node.next node.next := ber of ways. node.next.next destroy obsoleteNode function remove- Beginning(List list) // remove first node obsoleteNode := Linearly linked lists list.firstNode list.firstNode := list.firstNode.next // point past deleted node destroy obsoleteNode 34 CHAPTER 2. SEQUENCES

Notice that removeBeginning() sets list.firstNode to null assumes that the list is empty. when removing the last node in the list. function insertAfter(Node node, Node newNode) if node Since we can't iterate backwards, efficient insertBefore = null newNode.next := newNode else newNode.next := or removeBefore operations are not possible. Inserting to node.next node.next := newNode a list before a specific node requires traversing the list, Suppose that “L” is a variable pointing to the last node which would have a worst case running time of O(n). of a circular linked list (or null if the list is empty). To Appending one linked list to another can be inefficient append “newNode” to the end of the list, one may do unless a reference to the tail is kept as part of the List insertAfter(L, newNode) L := newNode structure, because we must traverse the entire first list in order to find the tail, and then append the second list to To insert “newNode” at the beginning of the list, one may this. Thus, if two linearly linked lists are each of length n do , list appending has asymptotic time complexity of O(n) . insertAfter(L, newNode) if L = null L := newNode In the Lisp family of languages, list appending is provided by the append procedure. Many of the special cases of linked list operations can 2.4.7 Linked lists using arrays of nodes be eliminated by including a dummy element at the front Languages that do not support any type of reference can of the list. This ensures that there are no special cases still create links by replacing pointers with array indices. for the beginning of the list and renders both insertBe- The approach is to keep an array of records, where each ginning() and removeBeginning() unnecessary. In this record has integer fields indicating the index of the next case, the first useful data in the list will be found at (and possibly previous) node in the array. Not all nodes in list.firstNode.next. the array need be used. If records are also not supported, parallel arrays can often be used instead. Circularly linked list As an example, consider the following linked list record that uses arrays instead of pointers: In a circularly linked list, all nodes are linked in a contin- uous circle, without using null. For lists with a front and record Entry { integer next; // index of next entry in ar- a back (such as a queue), one stores a reference to the last ray integer prev; // previous entry (if double-linked) string node in the list. The next node after the last node is the name; real balance; } first node. Elements can be added to the back of the list A linked list can be built by creating an array of these and removed from the front in constant time. structures, and an integer variable to store the index of Circularly linked lists can be either singly or doubly the first element. linked. integer listHead Entry Records[1000] Both types of circularly linked lists benefit from the abil- Links between elements are formed by placing the array ity to traverse the full list beginning at any given node. index of the next (or previous) cell into the Next or Prev This often allows us to avoid storing firstNode and lastN- field within a given element. For example: ode, although if the list may be empty we need a special representation for the empty list, such as a lastNode vari- In the above example, ListHead would be set to 2, the able which points to some node in the list or is null if it’s location of the first entry in the list. Notice that entry empty; we use such a lastNode here. This representation 3 and 5 through 7 are not part of the list. These cells significantly simplifies adding and removing nodes with are available for any additions to the list. By creating a a non-empty list, but empty lists are then a special case. ListFree integer variable, a free list could be created to keep track of what cells are available. If all entries are in use, the size of the array would have to be increased Algorithms Assuming that someNode is some node in or some elements would have to be deleted before new a non-empty circular singly linked list, this code iterates entries could be stored in the list. through that list starting with someNode: The following code would traverse the list and display function iterate(someNode) if someNode ≠ null node names and account balance: := someNode do do something with node.value node := i := listHead while i ≥ 0 // loop through the list print node.next while node ≠ someNode i, Records[i].name, Records[i].balance // print entry i := Notice that the test "while node ≠ someNode” must be at Records[i].next the end of the loop. If the test was moved to the beginning When faced with a choice, the advantages of this ap- of the loop, the procedure would fail whenever the list had proach include: only one node. This function inserts a node “newNode” into a circular • The linked list is relocatable, meaning it can be linked list after a given node “node”. If “node” is null, it moved about in memory at will, and it can also be 2.4. LINKED LIST 35

quickly and directly serialized for storage on disk or 2.4.9 Internal and external storage transfer over a network. When constructing a linked list, one is faced with the • Especially for a small list, array indexes can occupy choice of whether to store the data of the list directly in significantly less space than a full pointer on many the linked list nodes, called internal storage, or merely to architectures. store a reference to the data, called external storage. In- • Locality of reference can be improved by keeping ternal storage has the advantage of making access to the the nodes together in memory and by periodically data more efficient, requiring less storage overall, hav- rearranging them, although this can also be done in ing better locality of reference, and simplifying memory a general store. management for the list (its data is allocated and deallo- cated at the same time as the list nodes). • Naïve dynamic memory allocators can produce an excessive amount of overhead storage for each node External storage, on the other hand, has the advantage of allocated; almost no allocation overhead is incurred being more generic, in that the same data structure and per node in this approach. can be used for a linked list no matter what the size of the data is. It also makes it easy to place the • Seizing an entry from a pre-allocated array is faster same data in multiple linked lists. Although with internal than using dynamic memory allocation for each storage the same data can be placed in multiple lists by node, since dynamic memory allocation typically re- including multiple next references in the node data struc- quires a search for a free memory block of the de- ture, it would then be necessary to create separate rou- sired size. tines to add or delete cells based on each field. It is pos- sible to create additional linked lists of elements that use This approach has one main disadvantage, however: it internal storage by using external storage, and having the creates and manages a private memory space for its nodes. cells of the additional linked lists store references to the This leads to the following issues: nodes of the linked list containing the data. • It increases complexity of the implementation. In general, if a set of data structures needs to be included in linked lists, external storage is the best approach. If • Growing a large array when it is full may be diffi- a set of data structures need to be included in only one cult or impossible, whereas finding space for a new linked list, then internal storage is slightly better, unless a linked list node in a large, general memory pool may generic linked list package using external storage is avail- be easier. able. Likewise, if different sets of data that can be stored • Adding elements to a dynamic array will occa- in the same data structure are to be included in a single sionally (when it is full) unexpectedly take linear linked list, then internal storage would be fine. (O(n)) instead of constant time (although it’s still an Another approach that can be used with some languages amortized constant). involves having different data structures, but all have the • Using a general memory pool leaves more memory initial fields, including the next (and prev if double linked for other data if the list is smaller than expected or list) references in the same location. After defining sep- if many nodes are freed. arate structures for each type of data, a generic struc- ture can be defined that contains the minimum amount For these reasons, this approach is mainly used for lan- of data shared by all the other structures and contained guages that do not support dynamic memory allocation. at the top (beginning) of the structures. Then generic These disadvantages are also mitigated if the maximum routines can be created that use the minimal structure to size of the list is known at the time the array is created. perform linked list type operations, but separate routines can then handle the specific data. This approach is of- ten used in message parsing routines, where several types 2.4.8 Language support of messages are received, but all start with the same set of fields, usually including a field for message type. The Many programming languages such as Lisp and Scheme generic routines are used to add new messages to a queue have singly linked lists built in. In many functional lan- when they are received, and remove them from the queue guages, these lists are constructed from nodes, each called in order to process the message. The message type field is a cons or cons cell. The cons has two fields: the car, a ref- then used to call the correct routine to process the specific erence to the data for that node, and the cdr, a reference type of message. to the next node. Although cons cells can be used to build other data structures, this is their primary purpose. In languages that support abstract data types or templates, Example of internal and external storage linked list ADTs or templates are available for building linked lists. In other languages, linked lists are typically Suppose you wanted to create a linked list of families built using references together with records. and their members. Using internal storage, the structure 36 CHAPTER 2. SEQUENCES

might look like the following: In an unordered list, one simple heuristic for decreasing record member { // member of a family member next; average search time is the move-to-front heuristic, which string firstName; integer age; } record family { // the simply moves an element to the beginning of the list family itself family next; string lastName; string address; once it is found. This scheme, handy for creating sim- member members // head of list of members of this family ple caches, ensures that the most recently used items are } also the quickest to find again. Another common approach is to "index" a linked list us- To print a complete list of families and their members using internal storage, we could write: ing a more efficient external data structure. For example, one can build a red-black tree or hash table whose ele- aFamily := Families // start at head of families list while ments are references to the linked list nodes. Multiple aFamily ≠ null // loop through list of families print in- such indexes can be built on a single list. The disadvan- formation about family aMember := aFamily.members // tage is that these indexes may need to be updated each get head of list of this family’s members while aMember time a node is added or removed (or at least, before that ≠ null // loop through list of members print information index is used again). about member aMember := aMember.next aFamily := aFamily.next Using external storage, we would create the following Random access lists structures: A random access list is a list with support for fast ran- record node { // generic link structure node next; pointer dom access to read or modify any element in the list.[10] data // generic pointer for data at node } record member One possible implementation is a skew binary random { // structure for family member string firstName; integer access list using the skew binary number system, which age } record family { // structure for family string last- involves a list of trees with special properties; this al- Name; string address; node members // head of list of lows worst-case constant time head/cons operations, and members of this family } worst-case logarithmic time random access to an element To print a complete list of families and their members by index.[10] Random access lists can be implemented as using external storage, we could write: persistent data structures.[10] famNode := Families // start at head of families list Random access lists can be viewed as immutable linked while famNode ≠ null // loop through list of families lists in that they likewise support the same O(1) head and aFamily := (family) famNode.data // extract family from tail operations.[10] node print information about family memNode := aFam- A simple extension to random access lists is the min- ily.members // get list of family members while memNode list, which provides an additional operation that yields the ≠ null // loop through list of members aMember := (mem- minimum element in the entire list in constant time (with- ber)memNode.data // extract member from node print in- out mutation complexities).[10] formation about member memNode := memNode.next famNode := famNode.next Notice that when using external storage, an extra step is 2.4.10 Related data structures needed to extract the record from the node and cast it into the proper data type. This is because both the list Both stacks and queues are often implemented using of families and the list of members within the family are linked lists, and simply restrict the type of operations stored in two linked lists using the same data structure which are supported. (node), and this language does not have parametric types. The skip list is a linked list augmented with layers of As long as the number of families that a member can be- pointers for quickly jumping over large numbers of el- long to is known at compile time, internal storage works ements, and then descending to the next layer. This pro- fine. If, however, a member needed to be included in an cess continues down to the bottom layer, which is the ac- arbitrary number of families, with the specific number tual list. known only at run time, external storage would be neces- A binary tree can be seen as a type of linked list where the sary. elements are themselves linked lists of the same nature. The result is that each node may include a reference to the Speeding up search first node of one or two other linked lists, which, together with their contents, form the subtrees below that node. Finding a specific element in a linked list, even if it is An is a linked list in which each node sorted, normally requires O(n) time (linear search). This contains an array of data values. This leads to improved is one of the primary disadvantages of linked lists over cache performance, since more list elements are contigu- other data structures. In addition to the variants discussed ous in memory, and reduced memory overhead, because above, below are two simple ways to improve search time. less metadata needs to be stored for each element of the 2.4. LINKED LIST 37 list. 2.4.13 References A hash table may use linked lists to store the chains of • items that hash to the same position in the hash table. Juan, Angel (2006). “Ch20 –Data Structures; ID06 - PROGRAMMING with JAVA (slide part of the A heap shares some of the ordering properties of a linked book 'Big Java', by CayS. Horstmann)" (PDF). p. 3. list, but is almost always implemented using an array. In- stead of references from node to node, the next and pre- • Black, Paul E. (2004-08-16). Pieterse, Vreda; vious data indexes are calculated using the current data’s Black, Paul E., eds. “linked list”. Dictionary of Al- index. gorithms and Data Structures. National Institute of A self-organizing list rearranges its nodes based on some Standards and Technology. Retrieved 2004-12-14. heuristic which reduces search times for data retrieval by keeping commonly accessed nodes at the head of the list. • Antonakos, James L.; Mansfield, Kenneth C., Jr. (1999). Practical Data Structures Using C/C++. Prentice-Hall. pp. 165–190. ISBN 0-13-280843- 2.4.11 Notes 9.

[1] The amount of control data required for a dynamic array • Collins, William J. (2005) [2002]. Data Structures is usually of the form K + B ∗ n , where K is a per- and the Java Collections Framework. New York: array constant, B is a per-dimension constant, and n is McGraw Hill. pp. 239–303. ISBN 0-07-282379- the number of dimensions. K and B are typically on the 8. order of 10 bytes. • Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2003). Introduction to 2.4.12 Footnotes Algorithms. MIT Press. pp. 205–213, 501–505. ISBN 0-262-03293-7. [1] Skiena, Steven S. (2009). The Algorithm Design Manual (2nd ed.). Springer. p. 76. ISBN 9781848000704. We • can do nothing without this list predecessor, and so must Cormen, Thomas H.; Leiserson, Charles E.; Rivest, spend linear time searching for it on a singly-linked list. Ronald L.; Stein, Clifford (2001). “10.2: Linked lists”. Introduction to Algorithms (2nd ed.). MIT [2] http://www.osronline.com/article.cfm?article=499 Press. pp. 204–209. ISBN 0-262-03293-7.

[3] http://www.cs.dartmouth.edu/~{}sergey/me/cs/cs108/ • Green, Bert F., Jr. (1961). “Computer Lan- rootkits/bh-us-04-butler.pdf guages for Symbol Manipulation”. IRE Transac- [4] Chris Okasaki (1995). “Purely Functional Random- tions on Human Factors in Electronics (2): 3–8. Access Lists”. Proceedings of the Seventh Inter- doi:10.1109/THFE2.1961.4503292. national Conference on Functional Programming Languages and Computer Architecture: 86–95. • McCarthy, John (1960). “Recursive Functions of doi:10.1145/224164.224187. Symbolic Expressions and Their Computation by Machine, Part I”. Communications of the ACM. 3 [5] Gerald Kruse. CS 240 Lecture Notes: Linked Lists Plus: (4): 184. doi:10.1145/367177.367199. Complexity Trade-offs. Juniata College. Spring 2008.

[6] Day 1 Keynote - Bjarne Stroustrup: C++11 Style at Go- • Knuth, Donald (1997). “2.2.3-2.2.5”. Fundamental ingNative 2012 on channel9.msdn.com from minute 45 or Algorithms (3rd ed.). Addison-Wesley. pp. 254– foil 44 298. ISBN 0-201-89683-4.

[7] Number crunching: Why you should never, ever, EVER use • Newell, Allen; Shaw, F. C. (1957). “Programming linked-list in your code again at kjellkod.wordpress.com the Logic Theory Machine”. Proceedings of the [8] Brodnik, Andrej; Carlsson, Svante; Sedgewick, Robert; Western Joint Computer Conference: 230–240. Munro, JI; Demaine, ED (1999), Resizable Arrays in Op- timal Time and Space (Technical Report CS-99-09) (PDF), • Parlante, Nick (2001). “Linked list basics” (PDF). Department of Computer Science, University of Waterloo Stanford University. Retrieved 2009-09-21.

[9] Ford, William; Topp, William (2002). Data Structures • Sedgewick, Robert (1998). Algorithms in C. Addi- with C++ using STL (Second ed.). Prentice-Hall. pp. son Wesley. pp. 90–109. ISBN 0-201-31452-5. 466–467. ISBN 0-13-085850-1. • [10] Okasaki, Chris (1995). Purely Functional Random-Access Shaffer, Clifford A. (1998). A Practical Introduc- Lists (PS). In Functional Programming Languages and tion to Data Structures and Algorithm Analysis. New Computer Architecture. ACM Press. pp. 86–95. Re- Jersey: Prentice Hall. pp. 77–102. ISBN 0-13- trieved May 7, 2015. 660911-2. 38 CHAPTER 2. SEQUENCES

• Wilkes, Maurice Vincent (1964). “An Experiment nodes) because there is no need to keep track of the pre- with a Self-compiling Compiler for a Simple List- vious node during traversal or no need to traverse the list Processing Language”. Annual Review in Auto- to find the previous node, so that its link can be modified. matic Programming. Pergamon Press. 4 (1): 1. The concept is also the basis for the mnemonic link sys- doi:10.1016/0066-4138(64)90013-8. tem memorization technique. • Wilkes, Maurice Vincent (1964). “Lists and Why They are Useful”. Proceeds of the ACM National Conference, Philadelphia 1964. ACM (P–64): F1– 2.5.1 Nomenclature and implementation 1.

• Shanmugasundaram, Kulesh (2005-04-04). “ The first and last nodes of a doubly linked list are imme- Kernel Linked List Explained”. Retrieved 2009-09- diately accessible (i.e., accessible without traversal, and 21. usually called head and tail) and therefore allow traversal of the list from the beginning or end of the list, respec- tively: e.g., traversing the list from beginning to end, or 2.4.14 External links from end to beginning, in a search of the list for a node with specific data value. Any node of a doubly linked list, • Description from the Dictionary of Algorithms and once obtained, can be used to begin a new traversal of the Data Structures list, in either direction (towards beginning or end), from the given node. • Introduction to Linked Lists, Stanford University Computer Science Library The link fields of a doubly linked list node are often called next and previous or forward and backward. The ref- • Linked List Problems, Stanford University Com- erences stored in the link fields are usually implemented puter Science Library as pointers, but (as in any linked data structure) they may also be address offsets or indices into an array where the • Open Data Structures - Chapter 3 - Linked Lists nodes live. • Patent for the idea of having nodes which are in sev- eral linked lists simultaneously (note that this tech- nique was widely used for many decades before the 2.5.2 Basic algorithms patent was granted) Consider the following basic algorithms written in Ada: 2.5 Doubly linked list Open doubly linked lists In computer science, a doubly linked list is a linked data structure that consists of a set of sequentially linked record DoublyLinkedNode { prev // A reference to the records called nodes. Each node contains two fields, previous node next // A reference to the next node data called links, that are references to the previous and to // Data or a reference to data } record DoublyLinkedList the next node in the sequence of nodes. The beginning { DoublyLinkedNode firstNode // points to first node of and ending nodes’ previous and next links, respectively, list DoublyLinkedNode lastNode // points to last node of point to some kind of terminator, typically a sentinel node list } or null, to facilitate traversal of the list. If there is only one sentinel node, then the list is circularly linked via the sen- tinel node. It can be conceptualized as two singly linked Traversing the list Traversal of a doubly linked list can lists formed from the same data items, but in opposite be in either direction. In fact, the direction of traversal sequential orders. can change many times, if desired. Traversal is often called iteration, but that choice of terminology is unfor- 12 99 37 tunate, for iteration has well-defined semantics (e.g., in mathematics) which are not analogous to traversal. A doubly linked list whose nodes contain three fields: an integer value, the link to the next node, and the link to the previous node. Forwards node := list.firstNode while node ≠ null node := node.next direction. While adding or removing a node in a doubly linked list requires changing more links than the same op- Backwards erations on a singly linked list, the operations are simpler node := list.lastNode while node ≠ null node := node.prev 2.5. DOUBLY LINKED LIST 39

Inserting a node These symmetric functions insert a of the loop. This is important for the case where the list node either after or before a given node: contains only the single node someNode. function insertAfter(List list, Node node, Node newN- ode) newNode.prev := node newNode.next := node.next Inserting a node This simple function inserts a node if node.next == null list.lastNode := newNode else into a doubly linked circularly linked list after a given el- node.next.prev := newNode node.next := newNode func- ement: tion insertBefore(List list, Node node, Node newN- ode) newNode.prev := node.prev newNode.next := node function insertAfter(Node node, Node newNode) if node.prev == null list.firstNode := newNode else newNode.next := node.next newNode.prev := node node.prev.next := newNode node.prev := newNode node.next.prev := newNode node.next := newNode We also need a function to insert a node at the beginning To do an “insertBefore”, we can simply “in- of a possibly empty list: sertAfter(node.prev, newNode)". function insertBeginning(List list, Node newNode) Inserting an element in a possibly empty list requires a if list.firstNode == null list.firstNode := newNode special function: list.lastNode := newNode newNode.prev := null newN- function insertEnd(List list, Node node) if list.lastNode ode.next := null else insertBefore(list, list.firstNode, == null node.prev := node node.next := node else in- newNode) sertAfter(list.lastNode, node) list.lastNode := node A symmetric function inserts at the end: To insert at the beginning we simply “in- function insertEnd(List list, Node newNode) if sertAfter(list.lastNode, node)". list.lastNode == null insertBeginning(list, newNode) Finally, removing a node must deal with the case where else insertAfter(list, list.lastNode, newNode) the list empties: function remove(List list, Node node); if node.next Removing a node Removal of a node is easier than in- == node list.lastNode := null else node.next.prev := sertion, but requires special handling if the node to be node.prev node.prev.next := node.next if node == removed is the firstNode or lastNode: list.lastNode list.lastNode := node.prev; destroy node function remove(List list, Node node) if node.prev == null list.firstNode := node.next else node.prev.next := Deleting a node As in doubly linked lists, “re- node.next if node.next == null list.lastNode := node.prev moveAfter” and “removeBefore” can be implemented else node.next.prev := node.prev with “remove(list, node.prev)" and “remove(list, One subtle consequence of the above procedure is that node.next)". deleting the last node of a list sets both firstNode and lastNode to null, and so it handles removing the last node Double linked list implementation The following from a one-element list correctly. Notice that we also program illustrates implementation of double linked list don't need separate “removeBefore” or “removeAfter” functionality in C programming language. methods, because in a doubly linked list we can just use “remove(node.prev)" or “remove(node.next)" where /* Description: Double linked list header file License: these are valid. This also assumes that the node being re- GNU GPL v3 */ #ifndef DOUBLELINKEDLIST_H moved is guaranteed to exist. If the node does not exist #define DOUBLELINKEDLIST_H /* Codes for various in this list, then some error handling would be required. errors */ #define NOERROR 0x0 #define MEMAL- LOCERROR 0x01 #define LISTEMPTY 0x03 #define NODENOTFOUND 0x4 /* True or false */ #define Circular doubly linked lists TRUE 0x1 #define FALSE 0x0 /* Double linked DoubleLinkedList definition */ typedef struct Dou- Traversing the list Assuming that someNode is some bleLinkedList { int number; struct DoubleLinkedList* node in a non-empty list, this code traverses through that pPrevious; struct DoubleLinkedList* pNext; }Dou- list starting with someNode (any node will do): bleLinkedList; /* Get data for each node */ extern Forwards DoubleLinkedList* GetNodeData(DoubleLinkedList* pNode); /* Add a new node forward */ extern void node := someNode do do something with node.value AddNodeForward(void); /* Add a new node in the node := node.next while node ≠ someNode reverse direction */ extern void AddNodeReverse(void); Backwards /* Display nodes in forward direction */ extern void DisplayNodeForward(void); /*Display nodes in reverse node := someNode do do something with node.value direction */ extern void DisplayNodeReverse(void); /* node := node.prev while node ≠ someNode Delete nodes in the DoubleLinkedList by searching for //NODEPA Notice the postponing of the test to the end a node */ extern void DeleteNode(const int number); 40 CHAPTER 2. SEQUENCES

/* Function to detect cycle in a DoubleLinkedList */ */ void DeleteNode(const int SearchNumber) { unsigned extern unsigned int DetectCycleinList(void); /*Function int Nodefound = FALSE; DoubleLinkedList* pCurrent to reverse nodes */ extern void ReverseNodes(void); /* = pHead; if (pCurrent != NULL) { DoubleLinkedList* function to display error message that DoubleLinkedList pNextNode = pCurrent->pNext; DoubleLinkedList* is empty */ void ErrorMessage(int Error); /* Sort nodes pTemp = (DoubleLinkedList* ) NULL; if (pNextN- */ extern void SortNodes(void); #endif ode != NULL) { while((pNextNode != NULL) && (Nodefound==FALSE)) { // If search entry is at the /* Double linked List functions */ beginning if(pHead->number== SearchNumber) { pCurrent=pHead->pNext; pHead= pCurrent; pHead- /***************************************************** Name: DoubledLinked.c version: 0.1 Description: Im- >pPrevious= NULL; Nodefound =TRUE; } /* if the search entry is somewhere in the DoubleLinkedList or plementation of a DoubleLinkedList. These functions provide functionality of a double linked List. Change at the end */ else if(pNextNode->number == Search- Number) { Nodefound = TRUE; pTemp = pNextNode- history: 0.1 Initial version License: GNU GPL v3 ******************************************************/>pNext; pCurrent->pNext = pTemp; /* if the node to be deleted is not NULL,,, then point pNextnode->pNext #include “DoubleLinkedList.h” #include “stdlib.h” #in- clude “stdio.h” /* Declare pHead */ DoubleLinkedList* to the previous node which is pCurrent */ if(pTemp) pHead = NULL; /* Variable for storing error sta- { pTemp->pPrevious= pCurrent; } free(pNextNode); tus */ unsigned int Error = NOERROR; Dou- } /* iterate through the Double Linked List until next bleLinkedList* GetNodeData(DoubleLinkedList* node is NULL */ pNextNode=pNextNode->pNext; pNode) { if(!(pNode)) { Error = MEMALLO- pCurrent=pCurrent->pNext; } } else if (pCurrent- CERROR; return NULL; } else { printf("\nEnter >number == SearchNumber) { /* add code to delete a number: "); scanf("%d”,&pNode->number); re- nodes allocated with other functions if the search entry turn pNode; } } /* Add a node forward */ void is found. */ Nodefound = TRUE; free(pCurrent); AddNodeForward(void) { DoubleLinkedList* pNode = pCurrent= NULL; pHead = pCurrent; } } else if malloc(sizeof(DoubleLinkedList)); pNode = GetNode- (pCurrent == NULL) { Error= LISTEMPTY; Er- Data(pNode); if(pNode) { DoubleLinkedList* pCurrent rorMessage(Error); } if (Nodefound == FALSE && = pHead; if (pHead== NULL) { pNode->pNext= pCurrent!= NULL) { Error = NODENOTFOUND; NULL; pNode->pPrevious= NULL; pHead=pNode; ErrorMessage(Error); } } /* Function to detect cycle } else { while(pCurrent->pNext!=NULL) { in double linked List */ unsigned int DetectCyclein- pCurrent=pCurrent->pNext; } pCurrent->pNext= List(void) { DoubleLinkedList* pCurrent = pHead; pNode; pNode->pNext= NULL; pNode->pPrevious= DoubleLinkedList* pFast = pCurrent; unsigned int pCurrent; } } else { Error = MEMALLOCERROR; cycle = FALSE; while( (cycle==FALSE) && pCurrent- } } /* Function to add nodes in reverse direction, >pNext != NULL) { if(!(pFast = pFast->pNext)) { Arguments; Node to be added. Returns : Nothing cycle= FALSE; break; } else if (pFast == pCurrent) { */ void AddNodeReverse(void) { DoubleLinkedList* cycle = TRUE; break; } else if (!(pFast = pFast->pNext)) pNode = malloc(sizeof(DoubleLinkedList)); pN- { cycle = FALSE; break; } else if(pFast == pCurrent) { ode = GetNodeData(pNode); if(pNode) { Dou- cycle = TRUE; break; } pCurrent=pCurrent->pNext; } if(cycle) { printf("\nDouble Linked list is cyclic”); } else bleLinkedList* pCurrent = pHead; if (pHead==NULL) { pNode->pPrevious= NULL; pNode->pNext= NULL; { Error=LISTEMPTY; ErrorMessage(Error); } return cycle; } /*Function to reverse nodes in a double linked pHead=pNode; } else { while(pCurrent->pPrevious != NULL ) { pCurrent=pCurrent->pPrevious; } list */ void ReverseNodes(void) { DoubleLinkedList *pCurrent= NULL, *pNextNode= NULL; pCurrent = pNode->pPrevious= NULL; pNode->pNext= pCurrent; pCurrent->pPrevious= pNode; pHead=pNode; } } pHead; if (pCurrent) { pHead =NULL; while (pCurrent != NULL) { pNextNode = pCurrent->pNext; pCurrent- else { Error = MEMALLOCERROR; } } /* Dis- play Double linked list data in forward direction */ >pNext = pHead; pCurrent->pPrevious=pNextNode; void DisplayNodeForward(void) { DoubleLinkedList* pHead = pCurrent; pCurrent = pNextNode; } } else pCurrent = pHead; if (pCurrent) { while(pCurrent != { Error= LISTEMPTY; ErrorMessage(Error); } } /* NULL ) { printf("\nNumber in forward direction is %d Function to display diagnostic errors */ void ErrorMes- ",pCurrent->number); pCurrent=pCurrent->pNext; } } sage(int Error) { switch(Error) { case LISTEMPTY: else { Error = LISTEMPTY; ErrorMessage(Error); } } printf("\nError: Double linked list is empty!"); break; /* Display Double linked list data in Reverse direction */ case MEMALLOCERROR: printf("\nMemory al- void DisplayNodeReverse(void) { DoubleLinkedList* location error "); break; case NODENOTFOUND: pCurrent = pHead; if (pCurrent) { while(pCurrent- printf("\nThe searched node is not found "); break; >pNext != NULL) { pCurrent=pCurrent->pNext; } default: printf("\nError code missing\n”); break; } } while(pCurrent) { printf("\nNumber in Reverse direc- tion is %d ",pCurrent->number); pCurrent=pCurrent- /* main.h header file */ #ifndef MAIN_H #define >pPrevious; } } else { Error = LISTEMPTY; ErrorMes- MAIN_H #include “DoubleLinkedList.h” /* Error code sage(Error); } } /* Delete nodes in a double linked List */ extern unsigned int Error; #endif 2.6. STACK (ABSTRACT DATA TYPE) 41

:= newNode newNode.next := node node.prev = ad- /***************************************************/dressOf(newNode.next) function insertAfter(Node /***************************************************node, Node newNode) newNode.next := node.next if Name: main.c version: 0.1 Description: Imple- newNode.next != null newNode.next.prev = addres- mentation of a double linked list Change his- sOf(newNode.next) node.next := newNode newN- tory: 0.1 Initial version License: GNU GPL v3 ode.prev := addressOf(node.next) ****************************************************/ #include #include #include Deleting a node To remove a node, we simply modify “main.h” int main(void) { int choice =0; int Input- the link pointed by prev, regardless of whether the node Number=0; printf("\nThis program creates a double was the first one of the list. linked list”); printf("\nYou can add nodes in for- ward and reverse directions”); do { printf("\n1.Create function remove(Node node) atAddress(node.prev) Node Forward”); printf("\n2.Create Node Reverse”); := node.next if node.next != null node.next.prev = printf("\n3.Delete Node”); printf("\n4.Display Nodes node.prev destroy node in forward direction”); printf("\n5.Display Nodes in reverse direction”); printf("\n6.Reverse nodes”); printf("\n7.Exit\n”); printf("\nEnter your choice: 2.5.4 See also "); scanf("%d”,&choice); switch(choice) { case 1: • XOR linked list AddNodeForward(); break; case 2: AddNodeReverse(); break; case 3: printf("\nEnter the node you want to • SLIP (programming language) delete: "); scanf("%d”,&InputNumber); DeleteN- ode(InputNumber); break; case 4: printf("\nDisplaying node data in forward direction \n”); DisplayNode- 2.5.5 References Forward(); break; case 5: printf("\nDisplaying node data in reverse direction\n”); DisplayNodeRe- [1] http://www.codeofhonor.com/blog/ verse(); break; case 6: ReverseNodes(); break; avoiding-game-crashes-related-to-linked-lists case 7: printf(“Exiting program”); break; default: printf("\nIncorrect choice\n”); } } while (choice !=7); [2] https://github.com/webcoyote/coho/blob/master/Base/ List.h return 0; }

2.6 Stack (abstract data type) 2.5.3 Advanced concepts For the use of the term LIFO in accounting, see LIFO Asymmetric doubly linked list (accounting). In computer science, a stack is an abstract data type An asymmetric doubly linked list is somewhere between the singly linked list and the regular doubly linked list. It shares some features with the singly linked list (single- direction traversal) and others from the doubly linked list (ease of modification) It is a list where each node’s previous link points not to the previous node, but to the link to itself. While this makes little difference between nodes (it just points to an offset within the previous node), it changes the head of the list: It allows the first node to modify the firstNode link easily.[1][2] As long as a node is in a list, its previous link is never null.

Inserting a node To insert a node before another, we Simple representation of a stack runtime with push and pop op- erations. change the link that pointed to the old node, using the prev link; then set the new node’s next link to point to the that serves as a collection of elements, with two principal old node, and change that node’s prev link accordingly. operations: push, which adds an element to the collection, function insertBefore(Node node, Node newNode) if and pop, which removes the most recently added element node.prev == null error “The node is not in a list” that was not yet removed. The order in which elements newNode.prev := node.prev atAddress(newNode.prev) come off a stack gives rise to its alternative name, LIFO 42 CHAPTER 2. SEQUENCES

(for last in, first out). Additionally, a peek operation may stack in either case is not the implementation but the in- give access to the top without modifying the stack. terface: the user is only allowed to pop or push items The name “stack” for this type of structure comes from onto the array or linked list, with few other helper opera- the analogy to a set of physical items stacked on top of tions. The following will demonstrate both implementa- each other, which makes it easy to take an item off the tions, using pseudocode. top of the stack, while getting to an item deeper in the stack may require taking off multiple other items first.[1] Array An array can be used to implement a (bounded) Considered as a linear data structure, or more abstractly a stack, as follows. The first element (usually at the zero sequential collection, the push and pop operations occur offset) is the bottom, resulting in array[0] being the only at one end of the structure, referred to as the top of first element pushed onto the stack and the last element the stack. This makes it possible to implement a stack as popped off. The program must keep track of the size a singly linked list and a pointer to the top element. (length) of the stack, using a variable top that records A stack may be implemented to have a bounded capacity. the number of items pushed so far, therefore pointing to If the stack is full and does not contain enough space to the place in the array where the next element is to be in- accept an entity to be pushed, the stack is then considered serted (assuming a zero-based index convention). Thus, to be in an overflow state. The pop operation removes an the stack itself can be effectively implemented as a three- item from the top of the stack. element structure: structure stack: maxsize : integer top : integer items : array of item procedure initialize(stk : stack, size : inte- 2.6.1 History ger): stk.items ← new array of size items, initially empty stk.maxsize ← size stk.top ← 0 Stacks entered the computer science literature in 1946, The push operation adds an element and increments the in the computer design of Alan M. Turing (who used the top index, after checking for overflow: terms “bury” and “unbury”) as a means of calling and returning from .[2] Subroutines had already procedure push(stk : stack, x : item): if been implemented in Konrad Zuse's Z4 in 1945. Klaus stk.top = stk.maxsize: report overflow error else: Samelson and Friedrich L. Bauer of Technical Univer- stk.items[stk.top] ← x stk.top ← stk.top + 1 sity Munich proposed the idea in 1955 and filed a patent [3] Similarly, pop decrements the top index after checking in 1957. The same concept was developed, indepen- for underflow, and returns the item that was previously dently, by the Australian Charles Leonard Hamblin in the the top one: first half of 1957.[4] procedure pop(stk : stack): if stk.top = 0: re- Stacks are often described by analogy to a spring-loaded [5][1][6] port underflow error else: stk.top ← stk.top − 1 r ← stack of plates in a cafeteria. Clean plates are stk.items[stk.top] placed on top of the stack, pushing down any already there. When a plate is removed from the stack, the one Using a dynamic array, it is possible to implement a stack below it pops up to become the new top. that can grow or shrink as much as needed. The size of the stack is simply the size of the dynamic array, which is a very efficient implementation of a stack since adding 2.6.2 Non-essential operations items to or removing items from the end of a dynamic array requires amortized O(1) time. In many implementations, a stack has more operations than “push” and “pop”. An example is “top of stack”, or "peek", which observes the top-most element without re- Linked list Another option for implementing stacks is moving it from the stack.[7] Since this can be done with to use a singly linked list. A stack is then a pointer to the a “pop” and a “push” with the same data, it is not essen- “head” of the list, with perhaps a counter to keep track of tial. An underflow condition can occur in the “stack top” the size of the list: operation if the stack is empty, the same as “pop”. Also, structure frame: data : item next : frame or nil struc- implementations often have a function which just returns ture stack: head : frame or nil size : integer procedure whether the stack is empty. initialize(stk : stack): stk.head ← nil stk.size ← 0 Pushing and popping items happens at the head of the 2.6.3 Software stacks list; overflow is not possible in this implementation (un- less memory is exhausted): Implementation procedure push(stk : stack, x : item): newhead ← new frame newhead.data ← x newhead.next ← stk.head A stack can be easily implemented either through an array stk.head ← newhead stk.size ← stk.size + 1 procedure or a linked list. What identifies the data structure as a pop(stk : stack): if stk.head = nil: report underflow er- 2.6. STACK (ABSTRACT DATA TYPE) 43

ror r ← stk.head.data stk.head ← stk.head.next stk.size ← stk.size - 1 return r

Stacks and programming languages

Some languages, such as Perl, LISP and Python, make the stack operations push and pop available on their standard list/array types. Some languages, notably those in the Forth family (including PostScript), are designed around language-defined stacks that are directly visible to and manipulated by the programmer. The following is an example of manipulating a stack in (">" is the Lisp interpreter’s prompt; lines not starting with ">" are the interpreter’s responses to ex- pressions): > (setf stack (list 'a 'b 'c)) ;; set the variable “stack” (A B C) > (pop stack) ;; get top (leftmost) element, should modify the stack A > stack ;; check the value of stack (B C) > (push 'new stack) ;; push a new top onto the stack (NEW B C) A typical stack, storing local data and call information for nested Several of the C++ container types have procedure calls (not necessarily nested procedures!). This stack push_back and pop_back operations with LIFO seman- grows downward from its origin. The stack pointer points to the current topmost datum on the stack. A push operation decrements tics; additionally, the stack template class adapts existing the pointer and copies the data to the stack; a pop operation copies containers to provide a restricted API with only push/pop data from the stack and then increments the pointer. Each pro- operations. PHP has an SplStack class. Java’s library cedure called in the program stores procedure return information contains a Stack class that is a specialization of Vector. (in yellow) and local data (in other colors) by pushing them onto Following is an example program in Java language, using the stack. This type of stack implementation is extremely com- that class. mon, but it is vulnerable to buffer overflow attacks (see the text). import java.util.*; class StackDemo { public static void main(String[]args) { Stack stack = • a push operation, in which a data item is placed at new Stack(); stack.push(“A”); // Insert “A” the location pointed to by the stack pointer, and the in the stack stack.push(“B”); // Insert “B” in the address in the stack pointer is adjusted by the size of stack stack.push(“C”); // Insert “C” in the stack the data item; stack.push(“D”); // Insert “D” in the stack Sys- • tem.out.println(stack.peek()); // Prints the top of the a pop or pull operation: a data item at the current stack (“D”) stack.pop(); // removing the top (“D”) location pointed to by the stack pointer is removed, stack.pop(); // removing the next top (“C”) } } and the stack pointer is adjusted by the size of the data item.

There are many variations on the basic principle of stack 2.6.4 Hardware stacks operations. Every stack has a fixed location in memory at which it begins. As data items are added to the stack, the A common use of stacks at the architecture level is as a stack pointer is displaced to indicate the current extent of means of allocating and accessing memory. the stack, which expands away from the origin. Stack pointers may point to the origin of a stack or to a limited range of addresses either above or below the ori- Basic architecture of a stack gin (depending on the direction in which the stack grows); however, the stack pointer cannot cross the origin of the A typical stack is an area of computer memory with a stack. In other words, if the origin of the stack is at ad- fixed origin and a variable size. Initially the size of the dress 1000 and the stack grows downwards (towards ad- stack is zero. A stack pointer, usually in the form of a dresses 999, 998, and so on), the stack pointer must never hardware register, points to the most recently referenced be incremented beyond 1000 (to 1001, 1002, etc.). If location on the stack; when the stack has a size of zero, a pop operation on the stack causes the stack pointer to the stack pointer points to the origin of the stack. move past the origin of the stack, a stack underflow oc- The two operations applicable to all stacks are: curs. If a push operation causes the stack pointer to in- 44 CHAPTER 2. SEQUENCES crement or decrement beyond the maximum extent of the pointer will be updated before a new item is pushed onto stack, a stack overflow occurs. the stack; if it points to the next available location in the Some environments that rely heavily on stacks may pro- stack, it will be updated after the new item is pushed onto vide additional operations, for example: the stack. Popping the stack is simply the inverse of pushing. The • Duplicate: the top item is popped, and then pushed topmost item in the stack is removed and the stack pointer again (twice), so that an additional copy of the for- is updated, in the opposite order of that used in the push mer top item is now on top, with the original below operation. it.

• Peek: the topmost item is inspected (or returned), Hardware support but the stack pointer is not changed, and the stack size does not change (meaning that the item remains Stack in main memory Many CPU families, includ- on the stack). This is also called top operation in ing the x86, Z80 and 6502, have a dedicated register re- many articles. served for use as (call) stack pointers and special push and pop instructions that manipulate this specific reg- • Swap or exchange: the two topmost items on the ister, conserving opcode space. Some processors, like stack exchange places. the PDP-11 and the 68000, also have special address- • Rotate (or Roll): the n topmost items are moved on ing modes for implementation of stacks, typically with the stack in a rotating fashion. For example, if n=3, a semi-dedicated stack pointer as well (such as A7 in items 1, 2, and 3 on the stack are moved to positions the 68000). However, in most processors, several dif- 2, 3, and 1 on the stack, respectively. Many variants ferent registers may be used as additional stack pointers of this operation are possible, with the most com- as needed (whether updated via addressing modes or via mon being called left rotate and right rotate. add/sub instructions).

Stacks are often visualized growing from the bottom up Stack in registers or dedicated memory Main (like real-world stacks). They may also be visualized article: Stack machine growing from left to right, so that “topmost” becomes “rightmost”, or even growing from top to bottom. The important feature is that the bottom of the stack is in a The x87 floating point architecture is an example of a fixed position. The illustration in this section is an exam- set of registers organised as a stack where direct access ple of a top-to-bottom growth visualization: the top (28) to individual registers (relative the current top) is also is the stack “bottom”, since the stack “top” is where items possible. As with stack-based machines in general, hav- are pushed or popped from. ing the top-of-stack as an implicit argument allows for a small machine code footprint with a good usage of bus A right rotate will move the first element to the third po- bandwidth and code caches, but it also prevents some sition, the second to the first and the third to the second. types of optimizations possible on processors permitting Here are two equivalent visualizations of this process: random access to the register file for all (two or three) apple banana banana ===right rotate==> cucumber cu- operands. A stack structure also makes superscalar im- cumber apple cucumber apple banana ===left rotate==> plementations with register renaming (for speculative ex- cucumber apple banana ecution) somewhat more complex to implement, although A stack is usually represented in computers by a block it is still feasible, as exemplified by modern x87 imple- mentations. of memory cells, with the “bottom” at a fixed location, and the stack pointer holding the address of the current Sun SPARC, AMD Am29000, and Intel i960 are all ex- “top” cell in the stack. The top and bottom terminology amples of architectures using register windows within a are used irrespective of whether the stack actually grows register-stack as another strategy to avoid the use of slow towards lower memory addresses or towards higher mem- main memory for function arguments and return values. ory addresses. There are also a number of small microprocessors Pushing an item on to the stack adjusts the stack pointer that implements a stack directly in hardware and some by the size of the item (either decrementing or increment- microcontrollers have a fixed-depth stack that is not di- ing, depending on the direction in which the stack grows rectly accessible. Examples are the PIC microcontrollers, in memory), pointing it to the next cell, and copies the the Computer Cowboys MuP21, the Harris RTX line, and new top item to the stack area. Depending again on the the Novix NC4016. Many stack-based microprocessors exact implementation, at the end of a push operation, the were used to implement the programming language Forth stack pointer may point to the next unused location in the at the microcode level. Stacks were also used as a basis stack, or it may point to the topmost item in the stack. of a number of mainframes and mini computers. Such If the stack points to the current topmost item, the stack machines were called stack machines, the most famous 2.6. STACK (ABSTRACT DATA TYPE) 45

being the Burroughs B5000. the caller function when the calling finishes. The func- tions follow a runtime protocol between caller and callee to save arguments and return value on the stack. Stacks 2.6.5 Applications are an important way of supporting nested or recursive function calls. This type of stack is used implicitly by the Expression evaluation and syntax parsing compiler to support CALL and RETURN statements (or their equivalents) and is not manipulated directly by the Calculators employing reverse Polish notation use a stack programmer. structure to hold values. Expressions can be represented Some programming languages use the stack to store data in prefix, postfix or infix notations and conversion from that is local to a procedure. Space for local data items is one form to another may be accomplished using a stack. allocated from the stack when the procedure is entered, Many compilers use a stack for parsing the syntax of ex- and is deallocated when the procedure exits. The C pro- pressions, program blocks etc. before translating into low gramming language is typically implemented in this way. level code. Most programming languages are context- Using the same stack for both data and procedure calls free languages, allowing them to be parsed with stack has important security implications (see below) of which based machines. a programmer must be aware in order to avoid introduc- ing serious security bugs into a program. Backtracking

Main article: Backtracking 2.6.6 Security

Some computing environments use stacks in ways that Another important application of stacks is backtracking. may make them vulnerable to security breaches and at- Consider a simple example of finding the correct path in a tacks. Programmers working in such environments must maze. There are a series of points, from the starting point take special care to avoid the pitfalls of these implemen- to the destination. We start from one point. To reach tations. the final destination, there are several paths. Suppose we choose a random path. After following a certain path, we For example, some programming languages use a com- realise that the path we have chosen is wrong. So we need mon stack to store both data local to a called procedure to find a way by which we can return to the beginning of and the linking information that allows the procedure to that path. This can be done with the use of stacks. With return to its caller. This means that the program moves the help of stacks, we remember the point where we have data into and out of the same stack that contains critical reached. This is done by pushing that point into the stack. return addresses for the procedure calls. If data is moved In case we end up on the wrong path, we can pop the last to the wrong location on the stack, or an oversized data point from the stack and thus return to the last point and item is moved to a stack location that is not large enough continue our quest to find the right path. This is called to contain it, return information for procedure calls may backtracking. be corrupted, causing the program to fail. Malicious parties may attempt a stack smashing attack that takes advantage of this type of implementation by Runtime memory management providing oversized data input to a program that does not check the length of input. Such a program may copy the Main articles: Stack-based memory allocation and Stack data in its entirety to a location on the stack, and in so do- machine ing it may change the return addresses for procedures that have called it. An attacker can experiment to find a spe- A number of programming languages are stack-oriented, cific type of data that can be provided to such a program meaning they define most basic operations (adding two such that the return address of the current procedure is re- numbers, printing a ) as taking their arguments set to point to an area within the stack itself (and within from the stack, and placing any return values back on the the data provided by the attacker), which in turn contains stack. For example, PostScript has a return stack and an instructions that carry out unauthorized operations. operand stack, and also has a graphics state stack and a This type of attack is a variation on the buffer overflow dictionary stack. Many virtual machines are also stack- attack and is an extremely frequent source of security oriented, including the p-code machine and the Java Vir- breaches in software, mainly because some of the most tual Machine. popular compilers use a shared stack for both data and Almost all calling conventions— the ways in which procedure calls, and do not verify the length of data items. subroutines receive their parameters and return results— Frequently programmers do not write code to verify the usea special stack (the "call stack") to hold information size of data items, either, and when an oversized or un- about procedure/function calling and nesting in order to dersized data item is copied to the stack, a security breach switch to the context of the called function and restore to may occur. 46 CHAPTER 2. SEQUENCES

2.6.7 See also 2.6.10 External links

• List of data structures • Stacks and its Applications • • Queue Stack Machines - the new wave • Bounding stack depth • Double-ended queue • Stack Size Analysis for Interrupt-driven Programs • Call stack (322 KB)

• FIFO (computing and electronics) • This article incorporates public domain material from the NIST document: Black, Paul E. “Bounded • Stack-based memory allocation stack”. Dictionary of Algorithms and Data Struc- tures. • Stack overflow • Stack-oriented programming language 2.7 Queue (abstract data type)

2.6.8 References

[1] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009) [1990]. Introduction to Back Front Algorithms (3rd ed.). MIT Press and McGraw-Hill. ISBN 0-262-03384-4. Dequeue Enqueue [2] Newton, David E. (2003). Alan Turing: a study in light and shadow. Philadelphia: Xlibris. p. 82. ISBN 9781401090791. Retrieved 28 January 2015.

[3] Dr. Friedrich Ludwig Bauer and Dr. Klaus Samelson (30 March 1957). “Verfahren zur automatischen Ver- arbeitung von kodierten Daten und Rechenmaschine zur Representation of a FIFO (first in, first out) queue Ausübung des Verfahrens” (in German). Germany, Mu- nich: Deutsches Patentamt. Retrieved 2010-10-01. In computer science, a queue (/ˈkjuː/ KYEW) is a particu- lar kind of abstract data type or collection in which the en- [4] C. L. Hamblin, “An Addressless Coding Scheme based tities in the collection are kept in order and the principal on Mathematical Notation”, N.S.W University of Tech- nology, May 1957 (typescript) (or only) operations on the collection are the addition of entities to the rear terminal position, known as enqueue, [5] Ball, John A. (1978). Algorithms for RPN calcula- and removal of entities from the front terminal position, tors (1 ed.). Cambridge, Massachusetts, USA: Wiley- known as dequeue. This makes the queue a First-In-First- Interscience, John Wiley & Sons, Inc. ISBN 0-471- Out (FIFO) data structure. In a FIFO data structure, the 03070-8. first element added to the queue will be the first one to be removed. This is equivalent to the requirement that once [6] Godse, A. P.; Godse, D. A. (2010-01-01). Computer a new element is added, all elements that were added be- Architecture. Technical Publications. pp. 1–56. ISBN fore have to be removed before the new element can be 9788184315349. Retrieved 2015-01-30. removed. Often a peek or front operation is also entered, returning the value of the front element without dequeu- [7] Horowitz, Ellis: “Fundamentals of Data Structures in Pas- ing it. A queue is an example of a linear data structure, cal”, page 67. Computer Science Press, 1984 or more abstractly a sequential collection. Queues provide services in computer science, transport, 2.6.9 Further reading and operations research where various entities such as data, objects, persons, or events are stored and held to • Donald Knuth. The Art of Computer Program- be processed later. In these contexts, the queue performs ming, Volume 1: Fundamental Algorithms, Third the function of a buffer. Edition.Addison-Wesley, 1997. ISBN 0-201- Queues are common in computer programs, where they 89683-4. Section 2.2.1: Stacks, Queues, and De- are implemented as data structures coupled with ac- ques, pp. 238–243. cess routines, as an abstract data structure or in object- 2.7. QUEUE (ABSTRACT DATA TYPE) 47 oriented languages as classes. Common implementations Queues and programming languages are circular buffers and linked lists. Queues may be implemented as a separate data type, or may be considered a special case of a double-ended queue 2.7.1 Queue implementation (deque) and not implemented separately. For example, Perl and Ruby allow pushing and popping an array from Theoretically, one characteristic of a queue is that it does both ends, so one can use push and shift functions to en- not have a specific capacity. Regardless of how many ele- queue and dequeue a list (or, in reverse, one can use un- ments are already contained, a new element can always be shift and pop), although in some cases these operations added. It can also be empty, at which point removing an are not efficient. element will be impossible until a new element has been C++'s Standard Template Library provides a “queue” added again. templated class which is restricted to only push/pop Fixed length arrays are limited in capacity, but it is not operations. Since J2SE5.0, Java’s library contains a true that items need to be copied towards the head of the Queue interface that specifies queue operations; imple- queue. The simple trick of turning the array into a closed menting classes include LinkedList and (since J2SE 1.6) circle and letting the head and tail drift around endlessly ArrayDeque. PHP has an SplQueue class and third party in that circle makes it unnecessary to ever move items libraries like beanstalk'd and Gearman. stored in the array. If n is the size of the array, then com- puting indices modulo n will turn the array into a circle. This is still the conceptually simplest way to construct a Examples queue in a high level language, but it does admittedly slow things down a little, because the array indices must be A simple queue implemented in Ruby: compared to zero and the array size, which is compara- class Queue def initialize @list = Array.new end def ble to the time taken to check whether an array index is enqueue(element) @list << element end def dequeue out of bounds, which some languages do, but this will @list.shift end end certainly be the method of choice for a quick and dirty implementation, or for any high level language that does not have pointer syntax. The array size must be declared ahead of time, but some implementations simply double 2.7.2 Purely functional implementation the declared array size when overflow occurs. Most mod- ern languages with objects or pointers can implement or Queues can also be implemented as a purely functional come with libraries for dynamic lists. Such data struc- data structure.[2] Two versions of the implementation tures may have not specified fixed capacity limit besides exists. The first one, called real-time queue,[3] pre- memory constraints. Queue overflow results from trying sented below, allows the queue to be persistent with op- to add an element onto a full queue and queue underflow erations in O(1) worst-case time, but requires lazy lists happens when trying to remove an element from an empty with memoization. The second one, with no lazy lists nor queue. memoization is presented at the end of the sections. Its A bounded queue is a queue limited to a fixed number of amortized time is O(1) if the persistency is not used; but items.[1] its worst-time complexity is O(n) where n is the number of elements in the queue. There are several efficient implementations of FIFO | | queues. An efficient implementation is one that can per- Let us recall that, for l a list, l denotes its length, that NIL form the operations—enqueuing and dequeuing—in O(1) represents an empty list and CONS(h, t) represents the time. list whose head is h and whose tail is t.

• Linked list Real-time queue

• A doubly linked list has O(1) insertion and The data structure used to implements our queues con- deletion at both ends, so is a natural choice for sists of three linked lists (f, r, s) where f is the front queues. of the queue, r is the rear of the queue in reverse or- • A regular singly linked list only has efficient der. The invariant of the structure is that s is the rear insertion and deletion at one end. However, a of f without its |r| first elements, that is |s| = |f| − |r| . small modification—keeping a pointer to the The tail of the queue (CONS(x, f), r, s) is then almost last node in addition to the first one—will en- (f, r, s) and inserting an element x to (f, r, s) is almost able it to implement an efficient queue. (f, CONS(x, r), s) . It is said almost, because in both of those results, |s| = |f| − |r| + 1 . An auxiliary function • A deque implemented using a modified dynamic ar- aux must the be called for the invariant to be satisfied. ray Two cases must be considered, depending on whether s 48 CHAPTER 2. SEQUENCES

is the empty list, in which case |r| = |f|+1 , or not. The [3] Hood, Robert; Melville, Robert (November 1981.). formal definition is aux(f, r, Cons(_, s)) = (f, r, s) “Real-time queue operations in pure Lisp”. Information and aux(f, r, NIL) = (f ′, NIL, f ′) where f ′ is f fol- Processing Letters,. 13 (2). Check date values in: |date= lowed by r reversed. (help) Let us call reverse(f, r) the function which returns f followed by r reversed. Let us furthermore assume • Donald Knuth. The Art of Computer Programming, that |r| = |f| + 1 , since it is the case when this Volume 1: Fundamental Algorithms, Third Edition. function is called. More precisely, we define a lazy Addison-Wesley, 1997. ISBN 0-201-89683-4. Sec- function rotate(f, r, a) which takes as input three list tion 2.2.1: Stacks, Queues, and Deques, pp. 238– such that |r| = |f| + 1 , and return the concatenation 243. of f, of r reversed and of a. Then reverse(f, r) = • rotate(f, r, NIL) . The inductive definition of rotate Thomas H. Cormen, Charles E. Leiserson, Ronald is rotate(NIL, Cons(y, NIL), a) = Cons(y, a) L. Rivest, and Clifford Stein. Introduction to Algo- and rotate(CONS(x, f),CONS(y, r), a) = rithms, Second Edition. MIT Press and McGraw- Cons(x, rotate(f, r, CONS(y, a))) . Its running Hill, 2001. ISBN 0-262-03293-7. Section 10.1: time is O(r) , but, since lazy evaluation is used, the Stacks and queues, pp. 200–204. computation is delayed until the results is forced by the • computation. William Ford, William Topp. Data Structures with C++ and STL, Second Edition. Prentice Hall, 2002. The list s in the data structure has two purposes. This list ISBN 0-13-085850-1. Chapter 8: Queues and Pri- serves as a counter for |f| − |r| , indeed, |f| = |r| if and ority Queues, pp. 386–390. only if s is the empty list. This counter allows us to ensure that the rear is never longer than the front list. Further- • Adam Drozdek. Data Structures and Algorithms in more, using s, which is a tail of f, forces the computation C++, Third Edition. Thomson Course Technology, of a part of the (lazy) list f during each tail and insert op- 2005. ISBN 0-534-49182-0. Chapter 4: Stacks and eration. Therefore, when |f| = |r| , the list f is totally Queues, pp. 137–169. forced. If it wast not the case, the intern representation of f could be some append of append of... of append, and forcing would not be a constant time operation anymore. 2.7.5 External links

• Queue Data Structure and Algorithm Amortized queue • Queues with algo and 'c' programme Note that, without the lazy part of the implementation, the real-time queue would be a non-persistent implemen- • STL Quick Reference tation of queue in O(1) amortized time. In this case, the | | − | | list s can be replaced by the integer f r , and the • VBScript implementation of stack, queue, deque, reverse function would be called when s is 0. and Red-Black Tree

2.7.3 See also This article incorporates public domain material from the NIST document: Black, Paul E. “Bounded queue”. • Circular buffer Dictionary of Algorithms and Data Structures.

• Deque

• Priority queue 2.8 Double-ended queue

• Queueing theory “Deque” redirects here. It is not to be confused with • Stack – the “opposite” of a queue: LIFO (Last In dequeueing, a queue operation. First Out) Not to be confused with Double-ended priority queue.

In computer science, a double-ended queue (dequeue, 2.7.4 References often abbreviated to deque, pronounced deck) is an abstract data type that generalizes a queue, for which el- [1] “Queue (Java Platform SE 7)". Docs.oracle.com. 2014- ements can be added to or removed from either the front 03-26. Retrieved 2014-05-22. (head) or back (tail).[1] It is also often called a head-tail [2] Okasaki, Chris. “Purely Functional Data Structures” linked list, though properly this refers to a specific data (PDF). structure implementation (see below). 2.8. DOUBLE-ENDED QUEUE 49

2.8.1 Naming conventions • Storing deque contents in a circular buffer, and only resizing when the buffer becomes full. This de- Deque is sometimes written dequeue, but this use is gener- creases the frequency of resizings. ally deprecated in technical literature or technical writing • because dequeue is also a verb meaning “to remove from Allocating deque contents from the center of the a queue”. Nevertheless, several libraries and some writ- underlying array, and resizing the underlying array ers, such as Aho, Hopcroft, and Ullman in their textbook when either end is reached. This approach may re- Data Structures and Algorithms, spell it dequeue. John quire more frequent resizings and waste more space, Mitchell, author of Concepts in Programming Languages, particularly when elements are only inserted at one also uses this terminology. end. • Storing contents in multiple smaller arrays, allo- cating additional arrays at the beginning or end as 2.8.2 Distinctions and sub-types needed. Indexing is implemented by keeping a dynamic array containing pointers to each of the This differs from the queue abstract data type or First-In- smaller arrays. First-Out List (FIFO), where elements can only be added to one end and removed from the other. This general data class has some possible sub-types: Purely functional implementation

Double-ended queues can also be implemented as a • An input-restricted deque is one where deletion can purely functional data structure.[2] Two versions of the be made from both ends, but insertion can be made implementation exists. The first one, called 'real-time at one end only. deque, is presented below. It allows the queue to be persistent with operations in O(1) worst-case time, but • An output-restricted deque is one where insertion requires lazy lists with memoization. The second one, can be made at both ends, but deletion can be made with no lazy lists nor memoization is presented at the end from one end only. of the sections. Its amortized time is O(1) if the per- sistency is not used; but the worst-time complexity of an Both the basic and most common list types in comput- operation is O(n) where n is the number of elements in ing, queues and stacks can be considered specializations the double-ended queue. of deques, and can be implemented using deques. Let us recall that, for l a list, |l| denotes its length, that NIL represents an empty list and CONS(h, t) represents the list whose head is h and whose tail is t. The functions 2.8.3 Operations drop(i,l) and take(i,l) return the list l without its first i ele- ments, and the i first elements respectively. Or, if |l| < i The basic operations on a deque are enqueue and dequeue , they return the empty list and l respectively. on either end. Also generally implemented are peek op- erations, which return the value at that end without de- A double-ended queue is represented as a sixtuple queuing it. (lenf, f, sf, lenr, r, sr) where f is a linked list which contains the front of the queue of length lenf . Simi- Names vary between languages; major implementations larly, r is a linked list which represents the reverse of the include: rear of the queue, of length lenr . Furthermore, it is as- sured that |f| ≤ 2|r| + 1 and |r| ≤ 2|f| + 1 - intuitively, it means that neither the front nor the rear contains more 2.8.4 Implementations than a third of the list plus one element. Finally, sf and sr are tails of f and of r, they allow to schedule the mo- There are at least two common ways to efficiently imple- ment where some lazy operations are forced. Note that, ment a deque: with a modified dynamic array or with a when a double-ended queue contains n elements in the doubly linked list. front list and n elements in the rear list, then the inequal- The dynamic array approach uses a variant of a dynamic ity invariant remains satisfied after i insertions and d dele- i + d/2 ≤ n n/2 array that can grow from both ends, sometimes called tions when . That is, at most operation array deques. These array deques have all the proper- can happen between each rebalancing. ties of a dynamic array, such as constant-time random Intuitively, inserting an element x in front of access, good locality of reference, and inefficient inser- the double-ended queue (lenf, f, sf, lenr, sr) tion/removal in the middle, with the addition of amor- leads almost to the double-ended queue (lenf + tized constant-time insertion/removal at both ends, in- 1,CONS(x, f), drop(2, sf), lenr, r, drop(2, sr)) stead of just one end. Three common implementations , the head and the tail of the double-ended queue include: (lenf, CONS(x, f), sf, lenr, r, sr) are x and al- 50 CHAPTER 2. SEQUENCES

most (lenf − 1, f, drop(2, sf), lenr, r, drop(2, sr)) linked list implementations, respectively. respectively, and the head and the tail of As of Java 6, Java’s Collections Framework provides a (lenf, NIL, NIL, lenr, CONS(x, NIL), drop(2, sr)) new Deque interface that provides the functionality of (0, NIL, NIL, 0, NIL, NIL) are x and respectively. insertion and removal at both ends. It is implemented The function to insert an element in the rear, or to drop by classes such as ArrayDeque (also new in Java 6) and the last element of the double-ended queue, are similar LinkedList, providing the dynamic array and linked list to the above function which deal with the front of the implementations, respectively. However, the ArrayD- double-ended queue. It is said “almost” because, after eque, contrary to its name, does not support random ac- insertion and after an application of tail, the invariant cess. |r| ≤ 2|f| + 1 may not be satisfied anymore. In this case it is required to rebalance the double-ended queue. Perl's arrays have native support for both removing (shift and pop) and adding (unshift and push) elements on both O(n) In order to avoid an operation with an costs, the ends. algorithm uses laziness with memoization, and force the rebalancing to be partly done during the following Python 2.4 introduced the collections module with sup- (|l| + |r|)/2 operations, that is, before the following port for deque objects. It is implemented using a doubly rebalancing. In order to create the scheduling, some linked list of fixed-length subarrays. auxiliary lazy functions are required. The function As of PHP 5.3, PHP’s SPL extension contains the rotateRev(f,r,a) returns the list f, followed by the list 'SplDoublyLinkedList' class that can be used to imple- r reversed, followed by the list a. It is required in ment Deque datastructures. Previously to make a Deque | | − | | this function that r 2 f is 2 or 3. This function structure the array functions array_shift/unshift/pop/push is defined by induction as rotateRev(NIL, r, a) = had to be used instead. reverser + +a where ++ is the concatenation operation, and by rotateRev(CONS(x, f), r, a) = GHC's Data.Sequence module implements an efficient, CONS(x, rotateRev(f, drop(2, r), reverse(take(2, r))+functional deque structure in Haskell. The implemen- +a)) . It should be noted that, rotateRev(f, r, NIL) tation uses 2–3 finger trees annotated with sizes. There returns the list f followed by the list r reversed. The are other (fast) possibilities to implement purely func- function rotateDrop(f, j, r) which returns f, followed tional (thus also persistent) double queues (most using [3][4] by ((r without its j first element) reversed) is also re- heavily lazy evaluation). Kaplan and Tarjan were the quired, for j<|f|. It is defined by rotateDrop(f, 0, r) = first to implement optimal confluently persistent caten- [5] rotateRev(f, r, NIL) , rotateDrop(f, 1, r) = able deques. Their implementation was strictly purely rotateRev(f, drop(1, r),NIL) and functional in the sense that it did not use lazy evaluation. rotateDrop(CONS(x, f), j, r) = Okasaki simplified the data structure by using lazy eval- CONS(x, rotateDrop(f, j − 2), drop(2, r)) . uation with a bootstrapped data structure and degrading the performance bounds from worst-case to amortized. The balancing function can now be defined with Kaplan, Okasaki, and Tarjan produced a simpler, non- fun balance(q as (lenf, f,sf, lenr,r,sr))= if lenf > 2*lenr+1 bootstrapped, amortized version that can be implemented then let val i= (left+lenr)div 2 val j=lenf + lenr -i val either using lazy evaluation or more efficiently using mu- f'=take(i,f) val r'=rotateDrop(r,i,f) in (i,f',f',j,r',r') else tation in a broader but still restricted fashion. Mihaesau if lenf > 2*lenr+1 then let val j= (left+lenr)div 2 val and Tarjan created a simpler (but still highly complex) i=lenf + lenr -j val r'=take(i,r) val f'=rotateDrop(f,i,r) in strictly purely functional implementation of catenable de- (i,f',f',j,r',r') else q ques, and also a much simpler implementation of strictly purely functional non-catenable deques, both of which have optimal worst-case bounds. Note that, without the lazy part of the implementation, this would be a non-persistent implementation of queue in O(1) amortized time. In this case, the lists sf and sr 2.8.6 Complexity can be removed from the representation of the double- ended queue. • In a doubly-linked list implementation and assuming no allocation/deallocation overhead, the time com- plexity of all deque operations is O(1). Addition- 2.8.5 Language support ally, the time complexity of insertion or deletion in the middle, given an iterator, is O(1); however, the Ada's containers provides the generic time complexity of random access by index is O(n). packages Ada.Containers.Vectors and • In a growing array, the amortized time complexity Ada.Containers.Doubly_Linked_Lists, for the dynamic of all deque operations is O(1). Additionally, the array and linked list implementations, respectively. time complexity of random access by index is O(1); C++'s Standard Template Library provides the class tem- but the time complexity of insertion or deletion in plates std::deque and std::list, for the multiple array and the middle is O(n). 2.9. 51

2.8.7 Applications • Code Project: An In-Depth Study of the STL Deque Container One example where a deque can be used is the A-Steal job • scheduling algorithm.[6] This algorithm implements task Deque implementation in C scheduling for several processors. A separate deque with • VBScript implementation of stack, queue, deque, threads to be executed is maintained for each processor. and Red-Black Tree To execute the next , the processor gets the first el- ement from the deque (using the “remove first element” • Multiple implementations of non-catenable deques deque operation). If the current thread forks, it is put in Haskell back to the front of the deque (“insert element at front”) and a new thread is executed. When one of the proces- sors finishes execution of its own threads (i.e. its deque 2.9 Circular buffer is empty), it can “steal” a thread from another processor: it gets the last element from the deque of another proces- sor (“remove last element”) and executes it. The steal-job scheduling algorithm is used by Intel’s Threading Build- ing Blocks (TBB) library for parallel programming.

2.8.8 See also

• Pipe • Queue • Priority queue

2.8.9 References

[1] Donald Knuth. The Art of Computer Programming, Vol- ume 1: Fundamental Algorithms, Third Edition. Addison- Wesley, 1997. ISBN 0-201-89683-4. Section 2.2.1: Stacks, Queues, and Deques, pp. 238–243. A ring showing, conceptually, a circular buffer. This visually [2] Okasaki, Chris. “Purely Functional Data Structures” shows that the buffer has no real end and it can loop around the (PDF). buffer. However, since memory is never physically created as a [3] http://www.cs.cmu.edu/~{}rwh/theses/okasaki.pdf C. ring, a linear representation is generally used as is done below. Okasaki, “Purely Functional Data Structures”, September 1996 A circular buffer, circular queue, cyclic buffer or ring buffer is a data structure that uses a single, fixed-size [4] Adam L. Buchsbaum and Robert E. Tarjan. Confluently persistent deques via data structural bootstrapping. Jour- buffer as if it were connected end-to-end. This structure nal of Algorithms, 18(3):513–547, May 1995. (pp. 58, lends itself easily to buffering data streams. 101, 125)

[5] Haim Kaplan and Robert E. Tarjan. Purely functional 2.9.1 Uses representations of catenable sorted lists. In ACM Sym- posium on Theory of Computing, pages 202–211, May The useful property of a circular buffer is that it does 1996. (pp. 4, 82, 84, 124) not need to have its elements shuffled around when one [6] Eitan Frachtenberg, Uwe Schwiegelshohn (2007). Job is consumed. (If a non-circular buffer were used then it Scheduling Strategies for Parallel Processing: 12th Inter- would be necessary to shift all elements when one is con- national Workshop, JSSPP 2006. Springer. ISBN 3-540- sumed.) In other words, the circular buffer is well-suited 71034-5. See p.22. as a FIFO buffer while a standard, non-circular buffer is well suited as a LIFO buffer. 2.8.10 External links Circular buffering makes a good implementation strategy for a queue that has fixed maximum size. Should a maxi- • Type-safe open source deque implementation at mum size be adopted for a queue, then a circular buffer is Comprehensive C Archive Network a completely ideal implementation; all queue operations are constant time. However, expanding a circular buffer • SGI STL Documentation: deque requires shifting memory, which is comparatively costly. 52 CHAPTER 2. SEQUENCES

For arbitrarily expanding queues, a linked list approach If two elements are then removed from the buffer, the may be preferred instead. oldest values inside the buffer are removed. The two ele- In some situations, overwriting circular buffer can be ments removed, in this case, are 1 & 2, leaving the buffer used, e.g. in multimedia. If the buffer is used as the with just a 3: bounded buffer in the producer-consumer problem then it is probably desired for the producer (e.g., an audio gen- erator) to overwrite old data if the consumer (e.g., the 3 sound card) is unable to momentarily keep up. Also, the LZ77 family of lossless data compression algorithms op- erates on the assumption that strings seen more recently in a data stream are more likely to occur soon in the stream. If the buffer has 7 elements then it is completely full: Implementations store the most recent data in a circular buffer.

6 7 8 9 3 4 5 2.9.2 How it works

A consequence of the circular buffer is that when it is full and a subsequent write is performed, then it starts over- writing the oldest data. In this case, two more elements — A & B — are added and they overwrite the 3 & 4:

6 7 8 9 AB 5

Alternatively, the routines that manage the buffer could prevent overwriting the data and return an error or raise an exception. Whether or not data is overwritten is up A 24-byte keyboard circular buffer. When the write pointer is to the semantics of the buffer routines or the application about to reach the read pointer - because the microprocessor is using the circular buffer. not responding, the buffer will stop recording keystrokes and - in Finally, if two elements are now removed then what some computers - a beep will be played. would be returned is not 3 & 4 but 5 & 6 because A & B overwrote the 3 & the 4 yielding the buffer with: A circular buffer first starts empty and of some predefined length. For example, this is a 7-element buffer:

7 8 9 AB

Assume that a 1 is written into the middle of the buffer 2.9.3 Circular buffer mechanics (exact starting location does not matter in a circular buffer): A circular buffer can be implemented using four pointers, or two pointers and two integers:

1 • buffer start in memory

• buffer end in memory, or buffer capacity

Then assume that two more elements are added — 2 & 3 • start of valid data (index or pointer) — which get appended after the 1: • end of valid data (index or pointer), or amount of data currently in the buffer (integer) 1 2 3

This image shows a partially full buffer: 2.9. CIRCULAR BUFFER 53

which can be variable length. This offers nearly all the 1 2 3 efficiency advantages of a circular buffer while maintain- ing the ability for the buffer to be used in that only START END accept contiguous blocks.[1] This image shows a full buffer with four elements (num- Fixed-sized compressed circular buffers use an alternative bers 1 through 4) having been overwritten: indexing strategy based on elementary number theory to maintain a fixed-sized compressed representation of the entire data sequence.[3] 6 7 8 9 AB 5

END START 2.9.6 External links

When an element is overwritten, the start pointer is in- [1] Simon Cooke (2003), “The Bip Buffer - The Circular cremented to the next element. Buffer with a Twist” In the pointer-based implementation strategy, the buffer’s [2] Morin, Pat. “ArrayQueue: An Array-Based Queue”. full or empty state can be resolved from the start and end Open Data Structures (in pseudocode). Retrieved 7 indexes. When they are equal, the buffer is empty, and November 2015. when the start is one greater than the end, the buffer is [3] John C. Gunther. 2014. Algorithm 938: Compressing [1] full. When the buffer is instead designed to track the circular buffers. ACM Trans. Math. Softw. 40, 2, Article number of inserted elements n, checking for emptiness 17 (March 2014) means checking n = 0 and checking for fullness means [2] checking whether n equals the capacity. • CircularBuffer at the Portland Pattern Repository

• Boost: Templated Circular Buffer Container 2.9.4 Optimization • http://www.dspguide.com/ch28/2.htm A circular-buffer implementation may be optimized by mapping the underlying buffer to two contiguous regions of virtual memory. (Naturally, the underlying buffer‘s length must then equal some multiple of the system’s page size.) Reading from and writing to the circular buffer may then be carried out with greater efficiency by means of di- rect memory access; those accesses which fall beyond the end of the first virtual-memory region will automatically wrap around to the beginning of the underlying buffer. When the read offset is advanced into the second virtual- memory region, both offsets—read and write—are decre- mented by the length of the underlying buffer.[1]

2.9.5 Fixed-length-element and contiguous-block circular buffer

Perhaps the most common version of the circular buffer uses 8-bit bytes as elements. Some implementations of the circular buffer use fixed- length elements that are bigger than 8-bit bytes—16-bit integers for audio buffers, 53-byte ATM cells for tele- com buffers, etc. Each item is contiguous and has the correct data alignment, so software reading and writing these values can be faster than software that handles non- contiguous and non-aligned values. Ping-pong buffering can be considered a very specialized circular buffer with exactly two large fixed-length ele- ments. The Bip Buffer (bipartite buffer) is very similar to a cir- cular buffer, except it always returns contiguous blocks Chapter 3

Dictionaries

3.1 Associative array “binding” may also be used to refer to the process of cre- ating a new association. “Dictionary (data structure)" redirects here. It is not to The operations that are usually defined for an associative be confused with data dictionary. array are:[1][2] “Associative container” redirects here. For the im- plementation of ordered associative arrays in the standard library of the C++ programming language, see • Add or insert: add a new (key, value) pair to the associative containers. collection, binding the new key to its new value. The arguments to this operation are the key and the In computer science, an associative array, map, symbol value. table, or dictionary is an abstract data type composed of a collection of (key, value) pairs, such that each possible • Reassign: replace the value in one of the key appears at most once in the collection. (key, value) pairs that are already in the collection, Operations associated with this data type allow:[1][2] binding an old key to a new value. As with an inser- tion, the arguments to this operation are the key and • the addition of a pair to the collection the value. • the removal of a pair from the collection • Remove or delete: remove a (key, value) pair from • the modification of an existing pair the collection, unbinding a given key from its value. The argument to this operation is the key. • the lookup of a value associated with a particular key

The dictionary problem is a classic computer science • Lookup: find the value (if any) that is bound to a problem: the task of designing a data structure that main- given key. The argument to this operation is the key, tains a set of data during 'search', 'delete', and 'insert' and the value is returned from the operation. If no operations.[3] The two major solutions to the dictionary value is found, some associative array implementa- problem are a hash table or a search tree.[1][2][4][5] In some tions raise an exception. cases it is also possible to solve the problem using directly addressed arrays, binary search trees, or other more spe- cialized structures. Often then instead of add or reassign there is a single set operation that adds a new (key, value) pair if one does Many programming languages include associative arrays not already exist, and otherwise reassigns it. as primitive data types, and they are available in software libraries for many others. Content-addressable memory In addition, associative arrays may also include other op- is a form of direct hardware-level support for associative erations such as determining the number of bindings or arrays. constructing an iterator to loop over all the bindings. Usu- ally, for such an operation, the order in which the bindings Associative arrays have many applications including such are returned may be arbitrary. fundamental programming patterns as memoization and the decorator pattern.[6] A multimap generalizes an associative array by allowing multiple values to be associated with a single key.[7] A bidirectional map is a related abstract data type in which 3.1.1 Operations the bindings operate in both directions: each value must be associated with a unique key, and a second lookup op- In an associative array, the association between a key and eration takes a value as argument and looks up the key a value is often known as a “binding”, and the same word associated with that value.

54 3.1. ASSOCIATIVE ARRAY 55

3.1.2 Example The great advantage of a Hash over a straight address is that there doesn't have to be a search for the key to find Suppose that the set of loans made by a library is repre- the address, the hash IS the address of the correct key and sented in a data structure. Each book in a library may the value is immediately available. However, hash table be checked out only by a single library patron at a time. based dictionaries must be prepared to handle collisions However, a single patron may be able to check out multi- that occur when two keys are mapped by the hash function ple books. Therefore, the information about which books to the same index, and many different collision resolution are checked out to which patrons may be represented by strategies have been developed for dealing with this situ- an associative array, in which the books are the keys and ation, often based either on open addressing (looking at a the patrons are the values. Using notation from Python sequence of hash table indices instead of a single index, or JSON, the data structure would be: until finding either the given key or an empty cell) or on hash chaining (storing a small association list instead of a { “Pride and Prejudice": “Alice”, “Wuthering Heights": [1][2][4][9] “Alice”, “Great Expectations": “John” } single binding in each hash table cell).

A lookup operation on the key “Great Expectations” Search tree implementations would return “John”. If John returns his book, that would cause a deletion operation, and if Pat checks out a book, Main article: search tree that would cause an insertion operation, leading to a dif- ferent state: Another common approach is to implement an associative { “Pride and Prejudice": “Alice”, “The Brothers Kara- array with a (self-balancing) red-black tree.[10] mazov": “Pat”, “Wuthering Heights": “Alice” } Dictionaries may also be stored in binary search trees or in data structures specialized to a particular type of keys such as radix trees, tries, Judy arrays, or van Emde Boas 3.1.3 Implementation trees, but these implementation methods are less efficient than hash tables as well as placing greater restrictions on For dictionaries with very small numbers of bindings, it the types of data that they can handle. The advantages may make sense to implement the dictionary using an of these alternative structures come from their ability to association list, a linked list of bindings. With this im- handle operations beyond the basic ones of an associative plementation, the time to perform the basic dictionary array, such as finding the binding whose key is the closest operations is linear in the total number of bindings; how- to a queried key, when the query is not itself present in ever, it is easy to implement and the constant factors in the set of bindings. its running time are small.[1][8] Another very simple implementation technique, usable Other implementations when the keys are restricted to a narrow range of inte- gers, is direct addressing into an array: the value for a 3.1.4 Language support given key k is stored at the array cell A[k], or if there is no binding for k then the cell stores a special sentinel value Main article: Comparison of programming languages that indicates the absence of a binding. As well as being (mapping) simple, this technique is fast: each dictionary operation takes constant time. However, the space requirement for this structure is the size of the entire keyspace, making it Associative arrays can be implemented in any program- impractical unless the keyspace is small.[4] ming language as a package and many language systems provide them as part of their standard library. In some The two major approaches to implementing dictionaries languages, they are not only built into the standard sys- [1][2][4][5] are a hash table or a search tree. tem, but have special syntax, often using array-like sub- scripting. Hash table implementations Built-in syntactic support for associative arrays was intro- duced by SNOBOL4, under the name “table”. MUMPS The most frequently used general purpose implementa- made multi-dimensional associative arrays, optionally tion of an associative array is with a hash table: an array persistent, its key data structure. SETL supported them of bindings, together with a hash function that maps each as one possible implementation of sets and maps. Most possible key into an array index. The basic idea of a hash modern scripting languages, starting with AWK and in- table is that the binding for a given key is stored at the cluding Rexx, Perl, Tcl, JavaScript, Wolfram Language, position given by applying the hash function to that key, Python, Ruby, and Lua, support associative arrays as a and that lookup operations are performed by looking at primary container type. In many more languages, they that cell of the array and using the binding found there. are available as library functions without special syntax. 56 CHAPTER 3. DICTIONARIES

In Smalltalk, Objective-C, .NET,[11] Python, 3.1.6 See also REALbasic, Swift, and VBA they are called dictio- naries; in Perl, Ruby and Seed7 they are called hashes; • Key-value database in C++, Java, Go, , Scala, OCaml, Haskell they • Tuple are called maps (see map (C++), unordered_map (C++), and Map); in Common Lisp and Windows PowerShell, • Function (mathematics) they are called hash tables (since both typically use this implementation). In PHP, all arrays can be associative, • JSON except that the keys are limited to integers and strings. In JavaScript (see also JSON), all objects behave as 3.1.7 References associative arrays with string-valued keys, while the Map and WeakMap types take arbitrary objects as keys. In [1] Goodrich, Michael T.; Tamassia, Roberto (2006), “9.1 Lua, they are called tables, and are used as the primitive The Map Abstract Data Type”, Data Structures & Algo- building block for all data structures. In Visual FoxPro, rithms in Java (4th ed.), Wiley, pp. 368–371 they are called Collections. The D language also has support for associative arrays.[12] [2] Mehlhorn, Kurt; Sanders, Peter (2008), “4 Hash Tables and Associative Arrays”, Algorithms and Data Structures: The Basic Toolbox (PDF), Springer, pp. 81–98

[3] Anderson, Arne (1989). “Optimal Bounds on the Dictio- 3.1.5 Permanent storage nary Problem”. Proc. Symposium on Optimal Algorithms. Springer Verlag: 106–114.

Main article: Key-value store [4] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001), “11 Hash Tables”, Most programs using associative arrays will at some point Introduction to Algorithms (2nd ed.), MIT Press and McGraw-Hill, pp. 221–252, ISBN 0-262-03293-7. need to store that data in a more permanent form, like in a computer file. A common solution to this problem is a [5] Dietzfelbinger, M., Karlin, A., Mehlhorn, K., Meyer generalized concept known as archiving or serialization, auf der Heide, F., Rohnert, H., and Tarjan, R. E. which produces a text or binary representation of the orig- 1994. “Dynamic Perfect Hashing: Upper and Lower inal objects that can be written directly to a file. This Bounds”. SIAM J. Comput. 23, 4 (Aug. 1994), is most commonly implemented in the underlying object 738-761. http://portal.acm.org/citation.cfm?id=182370 model, like .Net or Cocoa, which include standard func- doi:10.1137/S0097539791194094 tions that convert the internal data into text form. The [6] Goodrich & Tamassia (2006), pp. 597–599. program can create a complete text representation of any group of objects by calling these methods, which are al- [7] Goodrich & Tamassia (2006), pp. 389–397. most always already implemented in the base associative [8] “When should I use a hash table instead of an association [13] array class. list?". lisp-faq/part2. 1996-02-20.

For programs that use very large data sets, this sort of [9] Klammer, F.; Mazzolini, L. (2006), “Pathfinders for asso- individual file storage is not appropriate, and a database ciative maps”, Ext. Abstracts GIS-l 2006, GIS-I, pp. 71– management system (DB) is required. Some DB systems 74. natively store associative arrays by serializing the data and then storing that serialized data and the key. Individual [10] Joel Adams and Larry Nyhoff. “Trees in STL”. Quote: arrays can then be loaded or saved from the database us- “The Standard Template library ... some of its contain- ers -- the set, map, multiset, and mul- ing the key to refer to them. These key-value stores have timap templates -- are generally built using a been used for many years and have a history as long as special kind of self-balancing binary search tree called a that as the more common relational database (RDBs), but red-black tree.” a lack of standardization, among other reasons, limited their use to certain niche roles. RDBs were used for these [11] “Dictionary Class”. MSDN. roles in most cases, although saving objects to a RDB [12] “Associative Arrays, the D programming language”. Dig- can be complicated, a problem known as object-relational ital Mars. impedance mismatch. [13] “Archives and Serializations Programming Guide”, Apple After c. 2010, the need for high performance databases Inc., 2012 suitable for cloud computing and more closely matching the internal structure of the programs using them led to a renaissance in the key-value store market. These systems 3.1.8 External links can store and retrieve associative arrays in a native fash- ion, which can greatly improve performance in common • NIST’s Dictionary of Algorithms and Data Struc- web-related workflows. tures: Associative Array 3.2. ASSOCIATION LIST 57

3.2 Association list or hash table, because of the greater simplicity of their implementation.[4] In computer programming and particularly in Lisp, an as- sociation list, often referred to as an alist, is a linked list 3.2.3 Applications and software libraries in which each list element (or node) comprises a key and a value. The association list is said to associate the value In the early development of Lisp, association lists with the key. In order to find the value associated with were used to resolve references to free variables in a given key, a sequential search is used: each element of procedures.[5][6] In this application, it is convenient to the list is searched in turn, starting at the head, until the augment association lists with an additional operation, key is found. Associative lists provide a simple way of that reverses the addition of a key–value pair without implementing an associative array, but are efficient only scanning the list for other copies of the same key. In this when the number of keys is very small. way, the association list can function as a stack, allow- ing local variables to temporarily shadow other variables 3.2.1 Operation with the same names, without destroying the values of those other variables.[7] An associative array is an abstract data type that can be Many programming languages, including Lisp,[5] used to maintain a collection of key–value pairs and look Scheme,[8] OCaml,[9] and Haskell[10] have functions for up the value associated with a given key. The association handling association lists in their standard libraries. list provides a simple way of implementing this data type. To test whether a key is associated with a value in a 3.2.4 See also given association list, search the list starting at its first node and continuing either until a node containing the • Self-organizing list, a strategy for re-ordering the key has been found or until the search reaches the end of keys in an association list to speed up searches for the list (in which case the key is not present). To add a frequently-accessed keys new key–value pair to an association list, create a new node for that key-value pair, set the node’s link to be the previous first element of the association list, and re- 3.2.5 References place the first element of the association list with the new node.[1] Although some implementations of association [1] Marriott, Kim; Stuckey, Peter J. (1998). Programming with Constraints: An Introduction. MIT Press. pp. 193– lists disallow having multiple nodes with the same keys 195. ISBN 9780262133418. as each other, such duplications are not problematic for this search algorithm: duplicate keys that appear later in [2] Frické, Martin (2012). “2.8.3 Association Lists”. Logic the list are ignored.[2] and the Organization of Information. Springer. pp. 44– 45. ISBN 9781461430872. It is also possible to delete a key from an association list, by scanning the list to find each occurrence of the key and [3] Knuth, Donald. “6.1 Sequential Searching”. The Art of splicing the nodes containing the key out of the list.[1] The Computer Programming, Vol. 3: Sorting and Searching scan should continue to the end of the list, even when the (2nd ed.). Addison Wesley. pp. 396–405. ISBN 0-201- key is found, in case the same key may have been inserted 89685-0. multiple times. [4] Janes, Calvin (2011). “Using Association Lists for As- sociative Arrays”. Developer’s Guide to Collections in Microsoft .NET. Pearson Education. p. 191. ISBN 3.2.2 Performance 9780735665279. [5] McCarthy, John; Abrahams, Paul W.; Edwards, Daniel The disadvantage of association lists is that the time to J.; Hart, Timothy P.; Levin, Michael I. (1985). LISP 1.5 [3] search is O(n), where n is the length of the list. For large Programmer’s Manual (PDF). MIT Press. ISBN 0-262- lists, this may be much slower than the times that can be 13011-4. See in particular p. 12 for functions that search obtained by representing an associative array as a binary an association list and use it to substitute symbols in an- search tree or as a hash table. Additionally, unless the other expression, and p. 103 for the application of asso- list is regularly pruned to remove elements with duplicate ciation lists in maintaining variable bindings. keys, multiple values associated with the same key will [6] van de Snepscheut, Jan L. A. (1993). What Computing Is increase the size of the list, and thus the time to search, All About. Monographs in Computer Science. Springer. without providing any compensatory advantage. p. 201. ISBN 9781461227106. One advantage of association lists is that a new element [7] Scott, Michael Lee (2000). “3.3.4 Association Lists can be added in constant time. Additionally, when the and Central Reference Tables”. Programming Language number of keys is very small, searching an association list Pragmatics. Morgan Kaufmann. p. 137. ISBN may be more efficient than searching a binary search tree 9781558604421. 58 CHAPTER 3. DICTIONARIES

[8] Pearce, Jon (2012). Programming and Meta- of computer software, particularly for associative arrays, Programming in Scheme. Undergraduate Texts database indexing, caches, and sets. in Computer Science. Springer. p. 214. ISBN 9781461216827. 3.3.1 Hashing [9] Minsky, Yaron; Madhavapeddy, Anil; Hickey, Jason (2013). Real World OCaml: Functional Programming for the Masses. O'Reilly Media. p. 253. ISBN Main article: Hash function 9781449324766. [10] O'Sullivan, Bryan; Goerzen, John; Stewart, Donald Bruce The idea of hashing is to distribute the entries (key/value (2008). Real World Haskell: Code You Can Believe In. pairs) across an array of buckets. Given a key, the algo- O'Reilly Media. p. 299. ISBN 9780596554309. rithm computes an index that suggests where the entry can be found: index = f(key, array_size) 3.3 Hash table Often this is done in two steps:

Not to be confused with Hash list or Hash tree. hash = hashfunc(key) index = hash % array_size “Rehash” redirects here. For the South Park episode, see In this method, the hash is independent of the array size, Rehash (South Park). For the IRC command, see List of and it is then reduced to an index (a number between 0 Internet Relay Chat commands § REHASH. and array_size − 1) using the modulo operator (%). In computing, a hash table (hash map) is a data struc- In the case that the array size is a power of two, the re- mainder operation is reduced to masking, which improves hash speed, but can increase problems with a poor hash func- tion. keys function buckets 00 01 521-8976 Choosing a good hash function John Smith 02 521-1234 A good hash function and implementation algorithm are 03 Lisa Smith essential for good hash table performance, but may be :: difficult to achieve. 13 Sandra Dee A basic requirement is that the function should provide a 14 521-9655 uniform distribution of hash values. A non-uniform dis- 15 tribution increases the number of collisions and the cost of resolving them. Uniformity is sometimes difficult to A small phone book as a hash table ensure by design, but may be evaluated empirically us- ing statistical tests, e.g., a Pearson’s chi-squared test for ture used to implement an associative array, a structure discrete uniform distributions.[5][6] that can map keys to values. A hash table uses a hash The distribution needs to be uniform only for table sizes function to compute an index into an array of buckets or that occur in the application. In particular, if one uses slots, from which the desired value can be found. dynamic resizing with exact doubling and halving of the Ideally, the hash function will assign each key to a unique table size s, then the hash function needs to be uniform bucket, but it is possible that two keys will generate an only when s is a power of two. Here the index can be identical hash causing both keys to point to the same computed as some range of bits of the hash function. On bucket. Instead, most hash table designs assume that hash the other hand, some hashing algorithms prefer to have s collisions—different keys that are assigned by the hash be a prime number.[7] The modulus operation may pro- function to the same bucket—will occur and must be ac- vide some additional mixing; this is especially useful with commodated in some way. a poor hash function. In a well-dimensioned hash table, the average cost (num- For open addressing schemes, the hash function should ber of instructions) for each lookup is independent of the also avoid clustering, the mapping of two or more keys to number of elements stored in the table. Many hash ta- consecutive slots. Such clustering may cause the lookup ble designs also allow arbitrary insertions and deletions cost to skyrocket, even if the load factor is low and colli- of key-value pairs, at (amortized[2]) constant average cost sions are infrequent. The popular multiplicative hash[3] is per operation.[3][4] claimed to have particularly poor clustering behavior.[7] In many situations, hash tables turn out to be more effi- Cryptographic hash functions are believed to provide cient than search trees or any other table lookup struc- good hash functions for any table size s, either by modulo ture. For this reason, they are widely used in many kinds reduction or by bit masking. They may also be ap- 3.3. HASH TABLE 59

propriate if there is a risk of malicious users trying to 3.3.3 Collision resolution sabotage a network service by submitting requests de- signed to generate a large number of collisions in the Hash collisions are practically unavoidable when hashing server’s hash tables. However, the risk of sabotage can a random subset of a large set of possible keys. For ex- also be avoided by cheaper methods (such as applying a ample, if 2,450 keys are hashed into a million buckets, secret to the data, or using a universal hash function). even with a perfectly uniform random distribution, ac- A drawback of cryptographic hashing functions is that cording to the birthday problem there is approximately a they are often slower to compute, which means that in 95% chance of at least two of the keys being hashed to cases where the uniformity for any s is not necessary, a the same slot. non-cryptographic hashing function might be preferable. Therefore, almost all hash table implementations have some collision resolution strategy to handle such events. Some common strategies are described below. All these Perfect hash function methods require that the keys (or pointers to them) be stored in the table, together with the associated values. If all keys are known ahead of time, a perfect hash func- tion can be used to create a perfect hash table that has no collisions. If minimal perfect hashing is used, every Separate chaining location in the hash table can be used as well. Perfect hashing allows for constant time lookups in all cases. This is in contrast to most chaining and open ad- dressing methods, where the time for lookup is low on keys buckets entries 000 average, but may be very large, O(n), for some sets of Lisa Smith 521-8976 001 John Smith keys. 002 :: John Smith 521-1234 Lisa Smith 151 152 Sam Doe Sandra Dee 521-9655 3.3.2 Key statistics 153 154 Sandra Dee A critical statistic for a hash table is the load factor, de- :: Ted Baker 418-4165 253 fined as Ted Baker 254 Sam Doe 521-5030 255

n factor load = , Hash collision resolved by separate chaining. k where In the method known as separate chaining, each bucket is independent, and has some sort of list of entries with the same index. The time for hash table operations is the • n is the number of entries; time to find the bucket (which is constant) plus the time • k is the number of buckets. for the list operation. In a good hash table, each bucket has zero or one en- As the load factor grows larger, the hash table becomes tries, and sometimes two or three, but rarely more than slower, and it may even fail to work (depending on the that. Therefore, structures that are efficient in time and method used). The expected constant time property of space for these cases are preferred. Structures that are ef- a hash table assumes that the load factor is kept below ficient for a fairly large number of entries per bucket are some bound. For a fixed number of buckets, the time for not needed or desirable. If these cases happen often, the a lookup grows with the number of entries and therefore hashing function needs to be fixed. the desired constant time is not achieved. Second to that, one can examine the variance of number Separate chaining with linked lists Chained hash ta- of entries per bucket. For example, two tables both have bles with linked lists are popular because they require 1,000 entries and 1,000 buckets; one has exactly one en- only basic data structures with simple algorithms, and can try in each bucket, the other has all entries in the same use simple hash functions that are unsuitable for other bucket. Clearly the hashing is not working in the second methods. one. The cost of a table operation is that of scanning the en- A low load factor is not especially beneficial. As the load tries of the selected bucket for the desired key. If the factor approaches 0, the proportion of unused areas in distribution of keys is sufficiently uniform, the average the hash table increases, but there is not necessarily any cost of a lookup depends only on the average number of reduction in search cost. This results in wasted memory. keys per bucket—that is, it is roughly proportional to the 60 CHAPTER 3. DICTIONARIES

load factor. Separate chaining with other structures Instead of For this reason, chained hash tables remain effective even a list, one can use any other data structure that supports when the number of table entries n is much higher than the required operations. For example, by using a self- the number of slots. For example, a chained hash table balancing tree, the theoretical worst-case time of com- with 1000 slots and 10,000 stored keys (load factor 10) mon hash table operations (insertion, deletion, lookup) is five to ten times slower than a 10,000-slot table (load can be brought down to O(log n) rather than O(n). How- factor 1); but still 1000 times faster than a plain sequential ever, this approach is only worth the trouble and extra list. memory cost if long delays must be avoided at all costs (e.g., in a real-time application), or if one must guard For separate-chaining, the worst-case scenario is when all against many entries hashed to the same slot (e.g., if one entries are inserted into the same bucket, in which case expects extremely non-uniform distributions, or in the the hash table is ineffective and the cost is that of search- case of web sites or other publicly accessible services, ing the bucket data structure. If the latter is a linear list, which are vulnerable to malicious key distributions in re- the lookup procedure may have to scan all its entries, so quests). the worst-case cost is proportional to the number n of en- tries in the table. The variant called array hash table uses a dynamic array to store all the entries that hash to the same slot.[8][9][10] The bucket chains are often searched sequentially using Each newly inserted entry gets appended to the end of the order the entries were added to the bucket. If the the dynamic array that is assigned to the slot. The dy- load factor is large and some keys are more likely to come namic array is resized in an exact-fit manner, meaning up than others, then rearranging the chain with a move- it is grown only by as many bytes as needed. Alterna- to-front heuristic may be effective. More sophisticated tive techniques such as growing the array by block sizes data structures, such as balanced search trees, are worth or pages were found to improve insertion performance, considering only if the load factor is large (about 10 or but at a cost in space. This variation makes more effi- more), or if the hash distribution is likely to be very non- cient use of CPU caching and the translation lookaside uniform, or if one must guarantee good performance even buffer (TLB), because slot entries are stored in sequential in a worst-case scenario. However, using a larger table memory positions. It also dispenses with the next point- and/or a better hash function may be even more effective ers that are required by linked lists, which saves space. in those cases. Despite frequent array resizing, space overheads incurred Chained hash tables also inherit the disadvantages of by the operating system such as memory fragmentation linked lists. When storing small keys and values, the were found to be small. space overhead of the next pointer in each entry record An elaboration on this approach is the so-called dynamic can be significant. An additional disadvantage is that perfect hashing,[11] where a bucket that contains k entries traversing a linked list has poor cache performance, mak- is organized as a perfect hash table with k2 slots. While it ing the processor cache ineffective. uses more memory (n2 slots for n entries, in the worst case and n × k slots in the average case), this variant has guar- anteed constant worst-case lookup time, and low amor- overflow tized time for insertion. It is also possible to use a fusion keys buckets entries 000 tree for each bucket, achieving constant time for all oper- 001 Lisa Smith 521-8976 [12] John Smith 002 ations with high probability. :: :: Lisa Smith 151 152 John Smith 521-1234 Sam Doe Sandra Dee 521-9655 153 Ted Baker 418-4165 Open addressing 154 Sandra Dee :: :: 253 Ted Baker 254 Sam Doe 521-5030 Main article: Open addressing 255 In another strategy, called open addressing, all entry records are stored in the bucket array itself. When a Hash collision by separate chaining with head records in the new entry has to be inserted, the buckets are examined, bucket array. starting with the hashed-to slot and proceeding in some probe sequence, until an unoccupied slot is found. When Separate chaining with list head cells Some chaining searching for an entry, the buckets are scanned in the implementations store the first record of each chain in the [4] same sequence, until either the target record is found, or slot array itself. The number of pointer traversals is de- an unused array slot is found, which indicates that there creased by one for most cases. The purpose is to increase is no such key in the table.[13] The name “open address- cache efficiency of hash table access. ing” refers to the fact that the location (“address”) of the The disadvantage is that an empty bucket takes the same item is not determined by its hash value. (This method space as a bucket with one entry. To save space, such hash is also called closed hashing; it should not be confused tables often have about as many slots as stored entries, with “open hashing” or “closed addressing” that usually meaning that many slots have two or more entries. mean separate chaining.) 3.3. HASH TABLE 61

keys buckets 000 001 Lisa Smith 521-8976 John Smith 002 :: : Lisa Smith 151 152 John Smith 521-1234 Sam Doe 153 Sandra Dee 521-9655 154 Ted Baker 418-4165 Sandra Dee 155 :: : Ted Baker 253 This graph compares the average number of cache misses re- 254 Sam Doe 521-5030 quired to look up elements in tables with chaining and linear 255 probing. As the table passes the 80%-full mark, linear probing’s performance drastically degrades. Hash collision resolved by open addressing with linear probing (interval=1). Note that “Ted Baker” has a unique hash, but nev- ertheless collided with “Sandra Dee”, that had previously collided the absence of a memory allocator. It also avoids the ex- with “John Smith”. tra indirection required to access the first entry of each bucket (that is, usually the only one). It also has bet- Well-known probe sequences include: ter locality of reference, particularly with linear probing. With small record sizes, these factors can yield better per- formance than chaining, particularly for lookups. Hash • Linear probing, in which the interval between tables with open addressing are also easier to serialize, probes is fixed (usually 1) because they do not use pointers. • Quadratic probing, in which the interval between On the other hand, normal open addressing is a poor probes is increased by adding the successive outputs choice for large elements, because these elements fill en- of a quadratic polynomial to the starting value given tire CPU cache lines (negating the cache advantage), and by the original hash computation a large amount of space is wasted on large empty table slots. If the open addressing table only stores references • Double hashing, in which the interval between to elements (external storage), it uses space comparable probes is computed by a second hash function to chaining even for large records but loses its speed ad- vantage. A drawback of all these open addressing schemes is that Generally speaking, open addressing is better used for the number of stored entries cannot exceed the number hash tables with small records that can be stored within of slots in the bucket array. In fact, even with good hash the table (internal storage) and fit in a cache line. They functions, their performance dramatically degrades when are particularly suitable for elements of one word or less. the load factor grows beyond 0.7 or so. For many ap- If the table is expected to have a high load factor, the plications, these restrictions mandate the use of dynamic records are large, or the data is variable-sized, chained resizing, with its attendant costs. hash tables often perform as well or better. Open addressing schemes also put more stringent require- Ultimately, used sensibly, any kind of hash table algo- ments on the hash function: besides distributing the keys rithm is usually fast enough; and the percentage of a cal- more uniformly over the buckets, the function must also culation spent in hash table code is low. Memory usage is minimize the clustering of hash values that are consecu- rarely considered excessive. Therefore, in most cases the tive in the probe order. Using separate chaining, the only differences between these algorithms are marginal, and concern is that too many objects map to the same hash other considerations typically come into play. value; whether they are adjacent or nearby is completely irrelevant.

Open addressing only saves memory if the entries are Coalesced hashing A hybrid of chaining and open small (less than four times the size of a pointer) and the addressing, coalesced hashing links together chains of load factor is not too small. If the load factor is close to nodes within the table itself.[13] Like open addressing, it zero (that is, there are far more buckets than stored en- achieves space usage and (somewhat diminished) cache tries), open addressing is wasteful even if each entry is advantages over chaining. Like chaining, it does not ex- just two words. hibit clustering effects; in fact, the table can be efficiently Open addressing avoids the time overhead of allocating filled to a high density. Unlike chaining, it cannot have each new entry record, and can be implemented even in more elements than table slots. 62 CHAPTER 3. DICTIONARIES

Cuckoo hashing Another alternative open-addressing search times in the table. This is similar to ordered hash solution is cuckoo hashing, which ensures constant tables[17] except that the criterion for bumping a key does lookup time in the worst case, and constant amortized not depend on a direct relationship between the keys. time for insertions and deletions. It uses two or more hash Since both the worst case and the variation in the num- functions, which means any key/value pair could be in two ber of probes is reduced dramatically, an interesting vari- or more locations. For lookup, the first hash function is ation is to probe the table starting at the expected suc- used; if the key/value is not found, then the second hash cessful probe value and then expand from that position function is used, and so on. If a collision happens during in both directions.[18] External Robin Hood hashing is an insertion, then the key is re-hashed with the second hash extension of this algorithm where the table is stored in function to map it to another bucket. If all hash func- an external file and each table position corresponds to a tions are used and there is still a collision, then the key it fixed-sized page or bucket with B records.[19] collided with is removed to make space for the new key, and the old key is re-hashed with one of the other hash functions, which maps it to another bucket. If that lo- 2-choice hashing cation also results in a collision, then the process repeats until there is no collision or the process traverses all the 2-choice hashing employs two different hash functions, buckets, at which point the table is resized. By combin- h1(x) and h2(x), for the hash table. Both hash functions ing multiple hash functions with multiple cells per bucket, are used to compute two table locations. When an object very high space utilization can be achieved. is inserted in the table, then it is placed in the table loca- tion that contains fewer objects (with the default being the h1(x) table location if there is equality in bucket size). 2- Hopscotch hashing Another alternative open- choice hashing employs the principle of the power of two [14] addressing solution is hopscotch hashing, which choices.[20] combines the approaches of cuckoo hashing and linear probing, yet seems in general to avoid their limitations. In particular it works well even when the load factor 3.3.4 Dynamic resizing grows beyond 0.9. The algorithm is well suited for implementing a resizable concurrent hash table. The good functioning of a hash table depends on the fact The hopscotch hashing algorithm works by defining a that the table size is proportional to the number of entries. neighborhood of buckets near the original hashed bucket, With a fixed size, and the common structures, it is simi- where a given entry is always found. Thus, search is lim- lar to linear search, except with a better constant factor. ited to the number of entries in this neighborhood, which In some cases, the number of entries may be definitely is logarithmic in the worst case, constant on average, and known in advance, for example keywords in a language. with proper alignment of the neighborhood typically re- More commonly, this is not known for sure, if only due quires one cache miss. When inserting an entry, one first to later changes in code and data. It is one serious, al- attempts to add it to a bucket in the neighborhood. How- though common, mistake to not provide any way for the ever, if all buckets in this neighborhood are occupied, the table to resize. A general-purpose hash table “class” will algorithm traverses buckets in sequence until an open slot almost always have some way to resize, and it is good (an unoccupied bucket) is found (as in linear probing). At practice even for simple “custom” tables. An implemen- that point, since the empty bucket is outside the neigh- tation should check the load factor, and do something if it borhood, items are repeatedly displaced in a sequence of becomes too large (this needs to be done only on inserts, hops. (This is similar to cuckoo hashing, but with the dif- since that is the only thing that would increase it). ference that in this case the empty slot is being moved into To keep the load factor under a certain limit, e.g., under the neighborhood, instead of items being moved out with 3/4, many table implementations expand the table when the hope of eventually finding an empty slot.) Each hop items are inserted. For example, in Java’s HashMap class brings the open slot closer to the original neighborhood, the default load factor threshold for table expansion is 3/4 without invalidating the neighborhood property of any of and in Python’s dict, table size is resized when load factor the buckets along the way. In the end, the open slot has is greater than 2/3. been moved into the neighborhood, and the entry being inserted can be added to it. Since buckets are usually implemented on top of a dynamic array and any constant proportion for resizing greater than 1 will keep the load factor under the desired Robin Hood hashing limit, the exact choice of the constant is determined by the same space-time tradeoff as for dynamic arrays. One interesting variation on double-hashing collision res- olution is Robin Hood hashing.[15][16] The idea is that a Resizing is accompanied by a full or incremental table new key may displace a key already inserted, if its probe rehash whereby existing items are mapped to new bucket count is larger than that of the key at the current posi- locations. tion. The net effect of this is that it reduces worst case To limit the proportion of memory wasted due to empty 3.3. HASH TABLE 63

buckets, some implementations also shrink the size of To ensure that the old table is completely copied over be- the table—followed by a rehash—when items are deleted. fore the new table itself needs to be enlarged, it is neces- From the point of space-time tradeoffs, this operation is sary to increase the size of the table by a factor of at least similar to the deallocation in dynamic arrays. (r + 1)/r during resizing. Disk-based hash tables almost always use some scheme Resizing by copying all entries of incremental resizing, since the cost of rebuilding the entire table on disk would be too high. A common approach is to automatically trigger a com- plete resizing when the load factor exceeds some thresh- old rₐₓ. Then a new larger table is allocated, all the en- Monotonic keys tries of the old table are removed and inserted into this new table, and the old table is returned to the free stor- If it is known that key values will always increase (or age pool. Symmetrically, when the load factor falls below decrease) monotonically, then a variation of consistent a second threshold rᵢ, all entries are moved to a new hashing can be achieved by keeping a list of the single smaller table. most recent key value at each hash table resize operation. Upon lookup, keys that fall in the ranges defined by these For hash tables that shrink and grow frequently, the resiz- list entries are directed to the appropriate hash function— ing downward can be skipped entirely. In this case, the and indeed hash table—both of which can be different for table size is proportional to the maximum number of en- each range. Since it is common to grow the overall num- tries that ever were in the hash table at one time, rather ber of entries by doubling, there will only be O(log(N)) than the current number. The disadvantage is that mem- ranges to check, and binary search time for the redirec- ory usage will be higher, and thus cache behavior may be tion would be O(log(log(N))). As with consistent hashing, worse. For best control, a “shrink-to-fit” operation can be this approach guarantees that any key’s hash, once issued, provided that does this only on request. will never change, even when the hash table is later grown. If the table size increases or decreases by a fixed percent- age at each expansion, the total cost of these resizings, amortized over all insert and delete operations, is still a Other solutions

constant, independent of the number of entries n and of [21] the number m of operations performed. Linear hashing is a hash table algorithm that permits incremental hash table expansion. It is implemented us- For example, consider a table that was created with the ing a single hash table, but with two possible lookup func- minimum possible size and is doubled each time the load tions. ratio exceeds some threshold. If m elements are inserted into that table, the total number of extra re-insertions that Another way to decrease the cost of table resizing is to occur in all dynamic resizings of the table is at most m − choose a hash function in such a way that the hashes of 1. In other words, dynamic resizing roughly doubles the most values do not change when the table is resized. Such cost of each insert or delete operation. hash functions are prevalent in disk-based and distributed hash tables, where rehashing is prohibitively costly. The problem of designing a hash such that most values do Incremental resizing not change when the table is resized is known as the problem. The four most popular Some hash table implementations, notably in real-time approaches are rendezvous hashing, consistent hashing, systems, cannot pay the price of enlarging the hash table the content addressable network algorithm, and Kademlia all at once, because it may interrupt time-critical opera- distance. tions. If one cannot avoid dynamic resizing, a solution is to perform the resizing gradually: 3.3.5 Performance analysis • During the resize, allocate the new hash table, but keep the old table unchanged. In the simplest model, the hash function is completely un- specified and the table does not resize. For the best pos- • In each lookup or delete operation, check both ta- sible choice of hash function, a table of size k with open bles. addressing has no collisions and holds up to k elements, with a single comparison for successful lookup, and a ta- • Perform insertion operations only in the new table. ble of size k with chaining and n keys has the minimum • At each insertion also move r elements from the old max(0, n − k) collisions and O(1 + n/k) comparisons for table to the new table. lookup. For the worst choice of hash function, every in- sertion causes a collision, and hash tables degenerate to • When all elements are removed from the old table, linear search, with Ω(n) amortized comparisons per in- deallocate it. sertion and up to n comparisons for a successful lookup. 64 CHAPTER 3. DICTIONARIES

Adding rehashing to this model is straightforward. As in a index into an array of values. Note that there are no col- dynamic array, geometric resizing by a factor of b implies lisions in this case. i that only n/b keys are inserted i or more times, so that the The entries stored in a hash table can be enumerated ef- total number of insertions is bounded above by bn/(b − ficiently (at constant cost per entry), but only in some 1), which is O(n). By using rehashing to maintain n < k, pseudo-random order. Therefore, there is no efficient tables using both chaining and open addressing can have way to locate an entry whose key is nearest to a given key. unlimited elements and perform successful lookup in a Listing all n entries in some specific order generally re- single comparison for the best choice of hash function. quires a separate sorting step, whose cost is proportional In more realistic models, the hash function is a random to log(n) per entry. In comparison, ordered search trees variable over a probability distribution of hash functions, have lookup and insertion cost proportional to log(n), but and performance is computed on average over the choice allow finding the nearest key at about the same cost, and of hash function. When this distribution is uniform, the ordered enumeration of all entries at constant cost per en- assumption is called “simple uniform hashing” and it can try. be shown that hashing with chaining requires Θ(1 + n/k) If the keys are not stored (because the hash function is comparisons on average for an unsuccessful lookup, and [22] collision-free), there may be no easy way to enumerate hashing with open addressing requires Θ(1/(1 − n/k)). the keys that are present in the table at any given moment. Both these bounds are constant, if we maintain n/k < c using table resizing, where c is a fixed constant less than Although the average cost per operation is constant and 1. fairly small, the cost of a single operation may be quite high. In particular, if the hash table uses dynamic resiz- ing, an insertion or deletion operation may occasionally 3.3.6 Features take time proportional to the number of entries. This may be a serious drawback in real-time or interactive applica- Advantages tions. Hash tables in general exhibit poor locality of refer- The main advantage of hash tables over other table data ence—that is, the data to be accessed is distributed structures is speed. This advantage is more apparent seemingly at random in memory. Because hash tables when the number of entries is large. Hash tables are par- cause access patterns that jump around, this can trig- ticularly efficient when the maximum number of entries ger microprocessor cache misses that cause long delays. can be predicted in advance, so that the bucket array can Compact data structures such as arrays searched with be allocated once with the optimum size and never re- linear search may be faster, if the table is relatively small sized. and keys are compact. The optimal performance point varies from system to system. If the set of key-value pairs is fixed and known ahead of time (so insertions and deletions are not allowed), one Hash tables become quite inefficient when there are many may reduce the average lookup cost by a careful choice collisions. While extremely uneven hash distributions of the hash function, bucket table size, and internal data are extremely unlikely to arise by chance, a malicious structures. In particular, one may be able to devise a hash adversary with knowledge of the hash function may be function that is collision-free, or even perfect (see below). able to supply information to a hash that creates worst- In this case the keys need not be stored in the table. case behavior by causing excessive collisions, resulting in very poor performance, e.g., a denial of service at- tack.[23][24][25] In critical applications, a data structure Drawbacks with better worst-case guarantees can be used; however, universal hashing—a that prevents Although operations on a hash table take constant time on the attacker from predicting which inputs cause worst- average, the cost of a good hash function can be signifi- case behavior—may be preferable.[26] The hash function cantly higher than the inner loop of the lookup algorithm used by the hash table in the Linux routing table cache for a sequential list or search tree. Thus hash tables are was changed with Linux version 2.4.2 as a countermea- not effective when the number of entries is very small. sure against such attacks.[27] (However, in some cases the high cost of computing the hash function can be mitigated by saving the hash value together with the key.) 3.3.7 Uses For certain string processing applications, such as spell- Associative arrays checking, hash tables may be less efficient than tries, finite automata, or Judy arrays. Also, if there are not too many possible keys to store -- that is, if each key can be Main article: associative array represented by a small enough number of bits -- then, in- stead of a hash table, one may use the key directly as the Hash tables are commonly used to implement many 3.3. HASH TABLE 65 types of in-memory tables. They are used to imple- objects. In this representation, the keys are the names of ment associative arrays (arrays whose indices are arbi- the members and methods of the object, and the values trary strings or other complicated objects), especially are pointers to the corresponding member or method. in interpreted programming languages like Perl, Ruby, Python, and PHP. Unique data representation When storing a new item into a multimap and a hash col- lision occurs, the multimap unconditionally stores both Main article: String interning items. When storing a new item into a typical associative array Hash tables can be used by some programs to avoid cre- and a hash collision occurs, but the actual keys them- ating multiple character strings with the same contents. selves are different, the associative array likewise stores For that purpose, all strings in use by the program are both items. However, if the key of the new item exactly stored in a single string pool implemented as a hash table, matches the key of an old item, the associative array typ- which is checked whenever a new string has to be created. ically erases the old item and overwrites it with the new This technique was introduced in Lisp interpreters under item, so every item in the table has a unique key. the name hash consing, and can be used with many other kinds of data (expression trees in a symbolic algebra sys- tem, records in a database, files in a file system, binary Database indexing decision diagrams, etc.). Hash tables may also be used as disk-based data struc- tures and database indices (such as in dbm) although B- Transposition table trees are more popular in these applications. In multi- node database systems, hash tables are commonly used Main article: Transposition table to distribute rows amongst nodes, reducing network traf- fic for hash joins. 3.3.8 Implementations Caches In programming languages Main article: cache (computing) Many programming languages provide hash table func- Hash tables can be used to implement caches, auxiliary tionality, either as built-in associative arrays or as stan- data tables that are used to speed up the access to data that dard library modules. In C++11, for example, the is primarily stored in slower media. In this application, unordered_map class provides hash tables for keys and hash collisions can be handled by discarding one of the values of arbitrary type. two colliding entries—usually erasing the old item that is The Java programming language (including the vari- currently stored in the table and overwriting it with the ant which is used on Android) includes the HashSet, new item, so every item in the table has a unique hash HashMap, LinkedHashSet, and LinkedHashMap generic value. collections.[28] In PHP 5, the Zend 2 engine uses one of the hash func- Sets tions from Daniel J. Bernstein to generate the hash values used in managing the mappings of data pointers stored Besides recovering the entry that has a given key, many in a hash table. In the PHP source code, it is labelled as hash table implementations can also tell whether such an DJBX33A (Daniel J. Bernstein, Times 33 with Addition). entry exists or not. Python's built-in hash table implementation, in the form Those structures can therefore be used to implement a of the dict type, as well as Perl's hash type (%) are used set data structure, which merely records whether a given internally to implement namespaces and therefore need key belongs to a specified set of keys. In this case, the to pay more attention to security, i.e., collision attacks. structure can be simplified by eliminating all parts that Python sets also use hashes internally, for fast lookup have to do with the entry values. Hashing can be used to (though they store only keys, not values).[29] implement both static and dynamic sets. In the .NET Framework, support for hash tables is pro- vided via the non-generic Hashtable and generic Dictio- nary classes, which store key-value pairs, and the generic Object representation HashSet class, which stores only values. Several dynamic languages, such as Perl, Python, In Rust's standard library, the generic HashMap and JavaScript, Lua, and Ruby, use hash tables to implement HashSet structs use linear probing with Robin Hood 66 CHAPTER 3. DICTIONARIES bucket stealing. • Bloom filter, memory efficient data-structure de- signed for constant-time approximate lookups; uses hash function(s) and can be seen as an approximate Independent packages hash table.

• SparseHash (formerly Google SparseHash) An ex- • Distributed hash table (DHT), a resilient dynamic tremely memory-efficient hash_map implementa- table spread over several nodes of a network. tion, with only 2 bits/entry of overhead. The Sparse- Hash library has several C++ hash map implementa- • , a trie structure, similar to tions with different performance characteristics, in- the array mapped trie, but where each key is hashed cluding one that optimizes for memory use and an- first. other that optimizes for speed.

• SunriseDD An open source C library for hash ta- 3.3.11 References ble storage of arbitrary data objects with -free lookups, built-in reference counting and guaranteed [1] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, order iteration. The library can participate in ex- Ronald L.; Stein, Clifford (2009). Introduction to Algo- ternal reference counting systems or use its own rithms (3rd ed.). Massachusetts Institute of Technology. built-in reference counting. It comes with a vari- pp. 253–280. ISBN 978-0-262-03384-8. ety of hash functions and allows the use of run- [2] Charles E. Leiserson, Amortized Algorithms, Table time supplied hash functions via callback mecha- Doubling, Potential Method Lecture 13, course MIT nism. Source code is well documented. 6.046J/18.410J Introduction to Algorithms—Fall 2005

• uthash This is an easy-to-use hash table for C struc- [3] Knuth, Donald (1998). 'The Art of Computer Program- tures. ming'. 3: Sorting and Searching (2nd ed.). Addison- Wesley. pp. 513–558. ISBN 0-201-89685-0.

3.3.9 History [4] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001). “Chapter 11: Hash Ta- bles”. Introduction to Algorithms (2nd ed.). MIT Press and The idea of hashing arose independently in different McGraw-Hill. pp. 221–252. ISBN 978-0-262-53196-2. places. In January 1953, H. P. Luhn wrote an internal IBM memorandum that used hashing with chaining.[30] [5] Pearson, Karl (1900). “On the criterion that a given sys- Gene Amdahl, Elaine M. McGraw, Nathaniel Rochester, tem of deviations from the probable in the case of a corre- and Arthur Samuel implemented a program using hash- lated system of variables is such that it can be reasonably ing at about the same time. Open addressing with linear supposed to have arisen from random sampling”. Philo- probing (relatively prime stepping) is credited to Amdahl, sophical Magazine, Series 5. 50 (302). pp. 157–175. but Ershov (in Russia) had the same idea.[30] doi:10.1080/14786440009463897. [6] Plackett, Robin (1983). “Karl Pearson and the Chi- Squared Test”. International Statistical Review (Inter- 3.3.10 See also national Statistical Institute (ISI)). 51 (1). pp. 59–72. doi:10.2307/1402731. • Rabin–Karp string search algorithm [7] Wang, Thomas (March 1997). “Prime Double Hash Ta- • Stable hashing ble”. Archived from the original on 1999-09-03. Re- trieved 2015-05-10. • Consistent hashing [8] Askitis, Nikolas; Zobel, Justin (October 2005). Cache- • Extendible hashing conscious Collision Resolution in String Hash Tables. Pro- ceedings of the 12th International Conference, String • Lazy deletion Processing and Information Retrieval (SPIRE 2005). • 3772/2005. pp. 91–102. doi:10.1007/11575832_11. Pearson hashing ISBN 978-3-540-29740-6. • PhotoDNA [9] Askitis, Nikolas; Sinha, Ranjan (2010). “Engineering • scalable, cache and space efficient tries for strings”. The Search data structure VLDB Journal. 17 (5): 633–660. doi:10.1007/s00778- 010-0183-9. ISSN 1066-8888.

Related data structures [10] Askitis, Nikolas (2009). Fast and Compact Hash Ta- bles for Integer Keys (PDF). Proceedings of the 32nd Aus- There are several data structures that use hash functions tralasian Computer Science Conference (ACSC 2009). 91. but cannot be considered special cases of hash tables: pp. 113–122. ISBN 978-1-920682-72-9. 3.3. HASH TABLE 67

[11] Erik Demaine, Jeff Lind. 6.897: Advanced Data Struc- [27] Bar-Yosef, Noa; Wool, Avishai (2007). Remote algo- tures. MIT Computer Science and Artificial Intelligence rithmic complexity attacks against randomized hash tables Laboratory. Spring 2003. http://courses.csail.mit.edu/6. Proc. International Conference on Security and Cryptog- 897/spring03/scribe_notes/L2/lecture2.pdf raphy (SECRYPT) (PDF). p. 124.

[12] Willard, Dan E. (2000). “Examining computational ge- [28] https://docs.oracle.com/javase/tutorial/collections/ ometry, van Emde Boas trees, and hashing from the per- implementations/index.html spective of the fusion tree”. SIAM Journal on Computing. 29 (3): 1030–1049. doi:10.1137/S0097539797322425. [29] https://stackoverflow.com/questions/513882/ MR 1740562.. python-list-vs-dict-for-look-up-table

[13] Tenenbaum, Aaron M.; Langsam, Yedidyah; Augenstein, [30] Mehta, Dinesh P.; Sahni, Sartaj. Handbook of Datastruc- Moshe J. (1990). Data Structures Using C. Prentice Hall. tures and Applications. p. 9-15. ISBN 1-58488-435-5. pp. 456–461, p. 472. ISBN 0-13-199746-7.

[14] Herlihy, Maurice; Shavit, Nir; Tzafrir, Moran (2008). 3.3.12 Further reading “Hopscotch Hashing”. DISC '08: Proceedings of the 22nd international symposium on Distributed Computing. • Tamassia, Roberto; Goodrich, Michael T. (2006). Berlin, Heidelberg: Springer-Verlag. pp. 350–364. “Chapter Nine: Maps and Dictionaries”. Data struc- tures and algorithms in Java : [updated for Java 5.0] [15] Celis, Pedro (1986). Robin Hood hashing (PDF) (Tech- (4th ed.). Hoboken, NJ: Wiley. pp. 369–418. ISBN nical report). Computer Science Department, University of Waterloo. CS-86-14. 0-471-73884-0. • [16] Goossaert, Emmanuel (2013). “Robin Hood hashing”. McKenzie, B. J.; Harries, R.; Bell, T. (Feb 1990). “Selecting a hashing algorithm”. Soft- [17] Amble, Ole; Knuth, Don (1974). “Ordered ware Practice & Experience. 20 (2): 209–224. hash tables”. Computer Journal. 17 (2): 135. doi:10.1002/spe.4380200207. doi:10.1093/comjnl/17.2.135.

[18] Viola, Alfredo (October 2005). “Exact distribution of in- 3.3.13 External links dividual displacements in linear probing hashing”. Trans- actions on Algorithms (TALG). ACM. 1 (2,): 214–242. • A Hash Function for Hash Table Lookup by Bob doi:10.1145/1103963.1103965. Jenkins. [19] Celis, Pedro (March 1988). External Robin Hood Hashing • (Technical report). Computer Science Department, Indi- Hash Tables by SparkNotes—explanation using C ana University. TR246. • Hash functions by Paul Hsieh [20] http://www.eecs.harvard.edu/~{}michaelm/postscripts/ • Design of Compact and Efficient Hash Tables for handbook2001.pdf Java [21] Litwin, Witold (1980). “Linear hashing: A new tool for • file and table addressing”. Proc. 6th Conference on Very Libhashish hash library Large Databases. pp. 212–223. • NIST entry on hash tables [22] Doug Dunham. CS 4521 Lecture Notes. University of • Open addressing hash table removal algorithm from Minnesota Duluth. Theorems 11.2, 11.6. Last modified April 21, 2009. ICI programming language, ici_set_unassign in set.c (and other occurrences, with permission). [23] Alexander Klink and Julian Wälde’s Efficient Denial of • Service Attacks on Web Application Platforms, December A basic explanation of how the hash table works by 28, 2011, 28th Chaos Communication Congress. Berlin, Reliable Software Germany. • Lecture on Hash Tables [24] Mike Lennon “Hash Table Vulnerability Enables Wide- • Scale DDoS Attacks”. 2011. Hash-tables in C—two simple and clear examples of hash tables implementation in C with linear probing [25] “Hardening Perl’s Hash Function”. November 6, 2013. and chaining, by Daniel Graziotin

[26] Crosby and Wallach. Denial of Service via Algorithmic • Open Data Structures – Chapter 5 – Hash Tables Complexity Attacks. quote: “modern universal hashing techniques can yield performance comparable to com- • MIT’s Introduction to Algorithms: Hashing 1 MIT monplace hash functions while being provably secure OCW lecture Video against these attacks.” “Universal hash functions ... are ... a solution suitable for adversarial environments. ... in • MIT’s Introduction to Algorithms: Hashing 2 MIT production systems.” OCW lecture Video 68 CHAPTER 3. DICTIONARIES

• How to sort a HashMap (Java) and keep the dupli- 3.4.1 Operations cate entries Linear probing is a component of open addressing • How python dictionary works schemes for using a hash table to solve the dictionary problem. In the dictionary problem, a data structure should maintain a collection of key–value pairs subject to operations that insert or delete pairs from the collec- 3.4 Linear probing tion or that search for the value associated with a given key. In open addressing solutions to this problem, the data structure is an array T (the hash table) whose cells T[i] (when nonempty) each store a single key–value pair. Keys Indices Key-value pairs A hash function is used to map each key into the cell of T (records) where that key should be stored, typically scrambling the 0 keys so that keys with similar values are not placed near John Smith 1 Lisa Smith +1-555-8976 each other in the table. A hash collision occurs when the hash function maps a key into a cell that is already oc- Lisa Smith 872 cupied by a different key. Linear probing is a strategy 873 John Smith +1-555-1234 874 Sandra Dee +1-555-9655 for resolving collisions, by placing the new key into the Sam Doe closest following empty cell.[3][4] 998 Sam Doe +1-555-5030 Sandra Dee 999

The collision between John Smith and Sandra Dee (both hashing Search to cell 873) is resolved by placing Sandra Dee at the next free location, cell 874. To search for a given key x, the cells of T are examined, beginning with the cell at index h(x) (where h is the hash Linear probing is a scheme in computer programming function) and continuing to the adjacent cells h(x) + 1, for resolving collisions in hash tables, data structures for h(x) + 2, ..., until finding either an empty cell or a cell maintaining a collection of key–value pairs and looking whose stored key is x. If a cell containing the key is found, up the value associated with a given key. It was in- the search returns the value from that cell. Otherwise, if vented in 1954 by Gene Amdahl, Elaine M. McGraw, an empty cell is found, the key cannot be in the table, and Arthur Samuel and first analyzed in 1963 by Donald because it would have been placed in that cell in prefer- Knuth. ence to any later cell that has not yet been searched. In this case, the search returns as its result that the key is not Along with quadratic probing and double hashing, linear present in the dictionary.[3][4] probing is a form of open addressing. In these schemes, each cell of a hash table stores a single key–value pair. When the hash function causes a collision by mapping a new key to a cell of the hash table that is already oc- cupied by another key, linear probing searches the ta- Insertion ble for the closest following free location and inserts the new key there. Lookups are performed in the same way, To insert a key–value pair (x,v) into the table (possibly re- by searching the table sequentially starting at the posi- placing any existing pair with the same key), the insertion tion given by the hash function, until finding a cell with a algorithm follows the same sequence of cells that would matching key or an empty cell. be followed for a search, until finding either an empty cell or a cell whose stored key is x. The new key–value pair As Thorup & Zhang (2012) write, “Hash tables are the [3][4] most commonly used nontrivial data structures, and the is then placed into that cell. most popular implementation on standard hardware uses If the insertion would cause the load factor of the table linear probing, which is both fast and simple.”[1] Lin- (its fraction of occupied cells) to grow above some pre- ear probing can provide high performance because of set threshold, the whole table may be replaced by a new its good locality of reference, but is more sensitive to table, larger by a constant factor, with a new hash func- the quality of its hash function than some other colli- tion, as in a dynamic array. Setting this threshold close to sion resolution schemes. It takes constant expected time zero and using a high growth rate for the table size leads per search, insertion, or deletion when implemented us- to faster hash table operations but greater memory usage ing a random hash function, a 5-independent hash func- than threshold values close to one and low growth rates. tion, or tabulation hashing. However, good results can A common choice would be to double the table size when be achieved in practice with other hash functions such as the load factor would exceed 1/2, causing the load factor MurmurHash.[2] to stay between 1/4 and 1/2.[5] 3.4. LINEAR PROBING 69

lision to cause more nearby collisions.[3] Additionally, achieving good performance with this method requires a higher-quality hash function than for some other col- lision resolution schemes.[6] When used with low-quality hash functions that fail to eliminate nonuniformities in the input distribution, linear probing can be slower than When a key–value pair is deleted, it may be necessary to move other open-addressing strategies such as double hashing, another pair backwards into its cell, to prevent searches for the which probes a sequence of cells whose separation is de- moved key from finding an empty cell. termined by a second hash function, or quadratic probing, where the size of each step varies depending on its posi- tion within the probe sequence.[7] Deletion

It is also possible to remove a key–value pair from the 3.4.3 Analysis dictionary. However, it is not sufficient to do so by sim- ply emptying its cell. This would affect searches for other Using linear probing, dictionary operations can be im- keys that have a hash value earlier than the emptied cell, plemented in constant expected time. In other words, in- but that are stored in a position later than the emptied sert, remove and search operations can be implemented cell. The emptied cell would cause those searches to in- in O(1), as long as the load factor of the hash table is a correctly report that the key is not present. constant strictly less than one.[8] Instead, when a cell i is emptied, it is necessary to search In more detail, the time for any particular operation (a forward through the following cells of the table until find- search, insertion, or deletion) is proportional to the length ing either another empty cell or a key that can be moved to of the contiguous block of occupied cells at which the cell i (that is, a key whose hash value is equal to or earlier operation starts. If all starting cells are equally likely, than i). When an empty cell is found, then emptying cell in a hash table with N cells, then a maximal block of i is safe and the deletion process terminates. But, when k occupied cells will have probability k/N of contain- the search finds a key that can be moved to cell i, it per- ing the starting location of a search, and will take time forms this move. This has the effect of speeding up later O(k) whenever it is the starting location. Therefore, the searches for the moved key, but it also empties out an- expected time for an operation can be calculated as the other cell, later in the same block of occupied cells. The product of these two terms, O(k2/N), summed over all of search for a movable key continues for the new emptied the maximal blocks of contiguous cells in the table. A cell, in the same way, until it terminates by reaching a cell similar sum of squared block lengths gives the expected that was already empty. In this process of moving keys to time bound for a random hash function (rather than for a earlier cells, each key is examined only once. Therefore, random starting location into a specific state of the hash the time to complete the whole process is proportional to table), by summing over all the blocks that could exist the length of the block of occupied cells containing the (rather than the ones that actually exist in a given state deleted key, matching the running time of the other hash [3] of the table), and multiplying the term for each potential table operations. block by the probability that the block is actually occu- Alternatively, it is possible to use a lazy deletion strat- pied. That is, defining Block(i,k) to be the event that there egy in which a key–value pair is removed by replacing is a maximal contiguous block of occupied cells of length the value by a special flag value indicating a deleted key. k beginning at index i, the expected time per operation is However, these flag values will contribute to the load factor of the hash table. With this strategy, it may be- come necessary to clean the flag values out of the array ∑N ∑n and rehash all the remaining key–value pairs once too E[T ] = O(1) + O(k2/N) Pr[Block(i, k)]. large a fraction of the array becomes occupied by deleted i=1 k=1 keys.[3][4] This formula can be simiplified by replacing Block(i,k) by a simpler necessary condition Full(k), the event that at 3.4.2 Properties least k elements have hash values that lie within a block of cells of length k. After this replacement, the value within the sum no longer depends on i, and the 1/N fac- Linear probing provides good locality of reference, which tor cancels the N terms of the outer summation. These causes it to require few uncached memory accesses per simplifications lead to the bound operation. Because of this, for low to moderate load fac- tors, it can provide very high performance. However, compared to some other open addressing strategies, its ∑n performance degrades more quickly at high load factors E[T ] ≤ O(1) + O(k2) Pr[Full(k)]. because of , a tendency for one col- k=1 70 CHAPTER 3. DICTIONARIES

But by the multiplicative form of the Chernoff bound, of distinct keys to any k-tuple of indexes. The parame- when the load factor is bounded away from one, the prob- ter k can be thought of as a measure of hash function ability that a block of length k contains at least k hashed quality: the larger k is, the more time it will take to com- values is exponentially small as a function of k, caus- pute the hash function but it will behave more similarly ing this sum to be bounded by a constant independent of to completely random functions. For linear probing, 5- n.[3] It is also possible to perform the same analysis using independence is enough to guarantee constant expected Stirling’s approximation instead of the Chernoff bound time per operation,[16] while some 4-independent hash to estimate the probability that a block contains exactly k functions perform badly, taking up to logarithmic time hashed values.[4][9] per operation.[6] In terms of the load factor α, the expected time for a suc- Another method of constructing hash functions with both cessful search is O(1 + 1/(1 − α)), and the expected time high quality and practical speed is tabulation hashing. In for an unsuccessful search (or the insertion of a new key) this method, the hash value for a key is computed by using is O(1 + 1/(1 − α)2).[10] For constant load factors, with each byte of the key as an index into a table of random high probability, the longest probe sequence (among the numbers (with a different table for each byte position). probe sequences for all keys stored in the table) has loga- The numbers from those table cells are then combined rithmic length.[11] by a bitwise exclusive or operation. Hash functions con- structed this way are only 3-independent. Nevertheless, linear probing using these hash functions takes constant 3.4.4 Choice of hash function expected time per operation.[4][17] Both tabulation hash- ing and standard methods for generating 5-independent Because linear probing is especially sensitive to unevenly hash functions are limited to keys that have a fixed num- distributed hash values,[7] it is important to combine it ber of bits. To handle strings or other types of variable- with a high-quality hash function that does not produce length keys, it is possible to compose a simpler universal such irregularities. hashing technique that maps the keys to intermediate val- ues and a higher quality (5-independent or tabulation) The analysis above assumes that each key’s hash is a ran- hash function that maps the intermediate values to hash dom number independent of the hashes of all the other table indices.[1][18] keys. This assumption is unrealistic for most applications of hashing. However, random or pseudorandom hash val- In an experimental comparison, Richter et al. found ues may be used when hashing objects by their identity that the Multiply-Shift family of hash functions (defined · w ÷ w−d rather than by their value. For instance, this is done us- as hz(x) = (x zmod2 ) 2 ) was “the fastest ing linear probing by the IdentityHashMap class of the hash function when integrated with all hashing schemes, Java collections framework.[12] The hash value that this i.e., producing the highest throughputs and also of good class associates with each object, its identityHashCode, quality” whereas tabulation hashing produced “the low- [2] is guaranteed to remain fixed for the lifetime of an ob- est throughput”. They point out that each table look-up ject but is otherwise arbitrary.[13] Because the identity- require several cycles, being more expensive than simple HashCode is constructed only once per object, and is not arithmetic operatons. They also found MurmurHash to required to be related to the object’s address or value, its be superior than tabulation hashing: “By studying the re- construction may involve slower computations such as the sults provided by Mult and Murmur, we think that the call to a random or pseudorandom number generator. For trade-off for by tabulation (...) is less attractive in prac- instance, Java 8 uses an Xorshift pseudorandom number tice”. generator to construct these values.[14] For most applications of hashing, it is necessary to com- 3.4.5 History pute the hash function for each value every time that it is hashed, rather than once when its object is created. In such applications, random or pseudorandom numbers The idea of an associative array that allows data to be ac- cannot be used as hash values, because then different ob- cessed by its value rather than by its address dates back to the mid-1940s in the work of Konrad Zuse and Vannevar jects with the same value would have different hashes. [19] And cryptographic hash functions (which are designed Bush, but hash tables were not described until 1953, to be computationally indistinguishable from truly ran- in an IBM memorandum by Hans Peter Luhn. Luhn used a different collision resolution method, chaining, rather dom functions) are usually too slow to be used in hash [20] tables.[15] Instead, other methods for constructing hash than linear probing. functions have been devised. These methods compute Knuth (1963) summarizes the early history of linear the hash function quickly, and can be proven to work well probing. It was the first open addressing method, and was with linear probing. In particular, linear probing has been originally synonymous with open addressing. According analyzed from the framework of k-independent hashing, to Knuth, it was first used by Gene Amdahl, Elaine M. a class of hash functions that are initialized from a small McGraw (née Boehme), and Arthur Samuel in 1954, in random seed and that are equally likely to map any k-tuple an assembler program for the IBM 701 computer.[8] The 3.4. LINEAR PROBING 71

first published description of linear probing is by Peterson [9] Eppstein, David (October 13, 2011), “Linear probing (1957),[8] who also credits Samuel, Amdahl, and Boehme made easy”, 0xDE. but adds that “the system is so natural, that it very likely [10] Sedgewick, Robert (2003), “Section 14.3: Linear Prob- may have been conceived independently by others either [21] ing”, Algorithms in Java, Parts 1–4: Fundamentals, Data before or since that time”. Another early publication Structures, Sorting, Searching (3rd ed.), Addison Wesley, of this method was by Soviet researcher Andrey Ershov, pp. 615–620, ISBN 9780321623973. in 1958.[22] [11] Pittel, B. (1987), “Linear probing: the probable largest The first theoretical analysis of linear probing, show- search time grows logarithmically with the number ing that it takes constant expected time per operation of records”, Journal of Algorithms, 8 (2): 236–249, with random hash functions, was given by Knuth.[8] doi:10.1016/0196-6774(87)90040-X, MR 890874. Sedgewick calls Knuth’s work “a landmark in the analysis of algorithms”.[10] Significant later developments include [12] “IdentityHashMap”, Java SE 7 Documentation, Oracle, re- trieved 2016-01-15. a more detailed analysis of the probability distribution of the running time,[23][24] and the proof that linear probing [13] Friesen, Jeff (2012), Beginning Java 7, Expert’s voice in runs in constant time per operation with practically us- Java, Apress, p. 376, ISBN 9781430239109. able hash functions rather than with the idealized random [14] Kabutz, Heinz M. (September 9, 2014), “Identity Crisis”, functions assumed by earlier analysis.[16][17] The Java Specialists’ Newsletter, 222.

[15] Weiss, Mark Allen (2014), “Chapter 3: Data Structures”, 3.4.6 References in Gonzalez, Teofilo; Diaz-Herrera, Jorge; Tucker, Allen, Computing Handbook, 1 (3rd ed.), CRC Press, p. 3-11, [1] Thorup, Mikkel; Zhang, Yin (2012), “Tabulation-based ISBN 9781439898536. 5-independent hashing with applications to linear probing [16] Pagh, Anna; Pagh, Rasmus; Ružić, Milan (2009), “Linear and second moment estimation”, SIAM Journal on Com- probing with constant independence”, SIAM Journal on puting, 41 (2): 293–331, doi:10.1137/100800774, MR Computing, 39 (3): 1107–1120, doi:10.1137/070702278, 2914329. MR 2538852

[2] Richter, Stefan; Alvarez, Victor; Dittrich, Jens (2015), “A [17] Pătraşcu, Mihai; Thorup, Mikkel (2011), “The power seven-dimensional analysis of hashing methods and its im- of simple tabulation hashing”, Proceedings of the plications on query processing”, Proceedings of the VLDB 43rd annual ACM Symposium on Theory of Com- Endowment, 9 (3): 293–331. puting (STOC '11), pp. 1–10, arXiv:1011.5200 , doi:10.1145/1993636.1993638 [3] Goodrich, Michael T.; Tamassia, Roberto (2015), “Sec- tion 6.3.3: Linear Probing”, Algorithm Design and Appli- [18] Thorup, Mikkel (2009), “String hashing for linear prob- cations, Wiley, pp. 200–203. ing”, Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, Philadelphia, PA: [4] Morin, Pat (February 22, 2014), “Section 5.2: Lin- SIAM, pp. 655–664, doi:10.1137/1.9781611973068.72, earHashTable: Linear Probing”, Open Data Structures (in MR 2809270. pseudocode) (0.1Gβ ed.), pp. 108–116, retrieved 2016- 01-15. [19] Parhami, Behrooz (2006), Introduction to Parallel Pro- cessing: Algorithms and Architectures, Series in Computer [5] Sedgewick, Robert; Wayne, Kevin (2011), Algorithms Science, Springer, 4.1 Development of early models, p. (4th ed.), Addison-Wesley Professional, p. 471, ISBN 67, ISBN 9780306469640. 9780321573513. Sedgewick and Wayne also halve the ta- ble size when a deletion would cause the load factor to be- [20] Morin, Pat (2004), “Hash tables”, in Mehta, Dinesh come too low, causing them to use a wider range [1/8,1/2] P.; Sahni, Sartaj, Handbook of Data Structures and Ap- in the possible values of the load factor. plications, Chapman & Hall / CRC, p. 9-15, ISBN 9781420035179. [6] Pătraşcu, Mihai; Thorup, Mikkel (2010), “On the k- [21] Peterson, W. W. (April 1957), “Addressing for random- independence required by linear probing and minwise in- access storage”, IBM Journal of Research and Develop- dependence” (PDF), Automata, Languages and Program- ment, Riverton, NJ, USA: IBM Corp., 1 (2): 130–146, ming, 37th International Colloquium, ICALP 2010, Bor- doi:10.1147/rd.12.0130. deaux, France, July 6–10, 2010, Proceedings, Part I, Lecture Notes in Computer Science, 6198, Springer, pp. [22] Ershov, A. P. (1958), “On Programming of Arithmetic 715–726, doi:10.1007/978-3-642-14165-2_60 Operations”, Communications of the ACM, 1 (8): 3–6, doi:10.1145/368892.368907. Translated from Doklady [7] Heileman, Gregory L.; Luo, Wenbin (2005), “How AN USSR 118 (3): 427–430, 1958, by Morris D. Fried- caching affects hashing” (PDF), Seventh Workshop on Al- man. Linear probing is described as algorithm A2. gorithm Engineering and Experiments (ALENEX 2005), pp. 141–154. [23] Flajolet, P.; Poblete, P.; Viola, A. (1998), “On the analy- sis of linear probing hashing”, Algorithmica, 22 (4): 490– [8] Knuth, Donald (1963), Notes on “Open” Addressing 515, doi:10.1007/PL00009236, MR 1701625. 72 CHAPTER 3. DICTIONARIES

[24] Knuth, D. E. (1998), “Linear probing and are all distinct. This leads to a probe sequence of graphs”, Algorithmica, 22 (4): 561–568, h(k), h(k) + 1, h(k) + 3, h(k) + 6, ... where the doi:10.1007/PL00009240, MR 1701629. values increase by 1, 2, 3, ...

• For prime m > 2, most choices of c1 and c2 will 3.5 Quadratic probing make h(k,i) distinct for i in [0, (m-1)/2]. Such choices include c1 = c2 = 1/2, c1 = c2 = 1, and c1 = 0, c2 = 1. Because there are only about m/2 distinct Quadratic probing is an open addressing scheme in probes for a given element, it is difficult to guaran- computer programming for resolving collisions in hash tee that insertions will succeed when the load factor tables—when an incoming data’s hash value indicates it is > 1/2. should be stored in an already-occupied slot or bucket. Quadratic probing operates by taking the original hash in- dex and adding successive values of an arbitrary quadratic 3.5.2 Quadratic probing insertion polynomial until an open slot is found. For a given hash value, the indices generated by linear The problem, here, is to insert a key at an available key [1] probing are as follows: space in a given Hash Table using quadratic probing. H + 1,H + 2,H + 3,H + 4, ..., H + k Algorithm to insert key in hash table This method results in primary clustering, and as the clus- ter grows larger, the search for those items hashing within 1. Get the key k 2. Set counter j = 0 3. Compute hash the cluster becomes less efficient. function h[k] = k % SIZE 4. If hashtable[h[k]] is empty An example sequence using quadratic probing is: (4.1) Insert key k at hashtable[h[k]] (4.2) Stop Else (4.3) H + 12,H + 22,H + 32,H + 42, ..., H + k2 The key space at hashtable[h[k]] is occupied, so we need to find the next available key space (4.4) Increment j (4.5) Quadratic probing can be a more efficient algorithm in Compute new hash function h[k] = ( k + j * j ) % SIZE a closed hash table, since it better avoids the clustering (4.6) Repeat Step 4 till j is equal to the SIZE of hash table problem that can occur with linear probing, although it 5. The hash table is full 6. Stop is not immune. It also provides good memory caching because it preserves some locality of reference; however, linear probing has greater locality and, thus, better cache C function for key insertion performance. int quadratic_probing_insert(int *hashtable, int key, int Quadratic probing is used in the Berkeley Fast File *empty) { /* hashtable[] is an integer hash table; empty[] System to allocate free blocks. The allocation routine is another array which indicates whether the key space is chooses a new cylinder-group when the current is nearly occupied; If an empty key space is found, the function full using quadratic probing, because of the speed it shows returns the index of the bucket where the key is inserted, in finding unused cylinder-groups. otherwise it returns (−1) if no empty key space is found */ int i, index; for (i = 0; i < SIZE; i++) { index = (key + 3.5.1 Quadratic function i*i) % SIZE; if (empty[index]) { hashtable[index] = key; empty[index] = 0; return index; } } return −1; } Let h(k) be a hash function that maps an element k to an integer in [0,m-1], where m is the size of the table. Let th the i probe position for a value k be given by the function 3.5.3 Quadratic probing search

Algorithm to search element in hash table 2 h(k, i) = (h(k) + c1i + c2i )(mod m) 1. Get the key k to be searched 2. Set counter j = 0 3. where c2 ≠ 0. If c2 = 0, then h(k,i) degrades to a linear Compute hash function h[k] = k % SIZE 4. If the key probe. For a given hash table, the values of c1 and c2 space at hashtable[h[k]] is occupied (4.1) Compare the remain constant. element at hashtable[h[k]] with the key k. (4.2) If they are Examples: equal (4.2.1) The key is found at the bucket h[k] (4.2.2) Stop Else (4.3) The element might be placed at the next • If h(k, i) = (h(k) + i + i2)(mod m) , then the location given by the quadratic function (4.4) Increment j probe sequence will be h(k), h(k) + 2, h(k) + 6, ... (4.5) Set k = ( k + (j * j) ) % SIZE, so that we can probe the bucket at a new slot, h[k]. (4.6) Repeat Step 4 till j is n • For m = 2 , a good choice for the constants are c1 greater than SIZE of hash table 5. The key was not found = c2 = 1/2, as the values of h(k,i) for i in [0,m-1] in the hash table 6. Stop 3.6. DOUBLE HASHING 73

C function for key searching 1. Get the key k 2. Set counter j = 0 3. Compute hash function h[k] = k % SIZE 4. If hashtable[h[k]] is empty int quadratic_probing_search(int *hashtable, int key, (4.1) Insert key k at hashtable[h[k]] (4.2) Stop Else (4.3) int *empty) { /* If the key is found in the hash table, The key space at hashtable[h[k]] is occupied, so we need the function returns the index of the hashtable where to find the next available key space (4.4) Increment j (4.5) the key is inserted, otherwise it returns (−1) if the Compute new hash function h[k]. If j is odd, then h[k] = key is not found */ int i, index; for (i = 0; i < SIZE; ( k + j * j ) % SIZE, else h[k] = ( k - j * j ) % SIZE (4.6) i++) { index = (key + i*i) % SIZE; if (!empty[index] Repeat Step 4 till j is equal to the SIZE of hash table 5. && hashtable[index] == key) return index; } return −1; } The hash table is full 6. Stop The search algorithm is modified likewise.

3.5.4 Limitations 3.5.5 See also [2] For linear probing it is a bad idea to let the hash table • Hash tables get nearly full, because performance is degraded as the hash table gets filled. In the case of quadratic probing, • Hash collision the situation is even more drastic. With the exception of the triangular number case for a power-of-two-sized hash • Double hashing table, there is no guarantee of finding an empty cell once the table gets more than half full, or even before the table • Linear probing gets half full if the table size is not prime. This is because • at most half of the table can be used as alternative loca- Hash function tions to resolve collisions. If the hash table size is b (a prime greater than 3), it can be proven that the first b/2 3.5.6 References alternative locations including the initial location h(k) are all distinct and unique. Suppose, we assume two of the 2 [1] Horowitz, Sahni, Anderson-Freed (2011). Fundamentals alternative locations to be given by h(k) + x (mod b) of Data Structures in C. University Press. ISBN 978-81- 2 and h(k) + y (mod b) , where 0 ≤ x, y ≤ (b / 2). If 7371-605-8. these two locations point to the same key space, but x ≠ y. Then the following would have to be true, [2] Weiss, Mark Allen (2009). Data Structures and Algorithm 2 2 2 2 Analysis in C++. Pearson Education. ISBN 978-81-317- h(k) + x = h(k) + y (mod b) x = y (mod b) 1474-4. x2 − y2 = 0 (mod b)(x − y)(x + y) = 0 (mod b) As b (table size) is a prime greater than 3, either (x - y) or (x + y) has to be equal to zero. Since x and y are unique, 3.5.7 External links (x - y) cannot be zero. Also, since 0 ≤ x, y ≤ (b / 2), (x + y) cannot be zero. • Tutorial/quadratic probing Thus, by contradiction, it can be said that the first (b / 2) alternative locations after h(k) are unique. So an empty key space can always be found as long as at most (b / 2) 3.6 Double hashing locations are filled, i.e., the hash table is not more than half full. Double hashing is a computer programming technique used in hash tables to resolve hash collisions, in cases when two different values to be searched for produce the Alternating sign same hash key. It is a popular collision-resolution tech- nique in open-addressed hash tables. Double hashing is − − If the sign of the offset is alternated (e.g. +1, 4, +9, 16 implemented in many popular libraries. etc.), and if the number of buckets is a prime number p congruent to 3 modulo 4 (i.e. one of 3, 7, 11, 19, 23, 31 Like linear probing, it uses one hash value as a starting and so on), then the first p offsets will be unique modulo point and then repeatedly steps forward an interval until p. the desired value is located, an empty location is reached, or the entire table has been searched; but this interval In other words, a permutation of 0 through p-1 is ob- is decided using a second, independent hash function tained, and, consequently, a free bucket will always be (hence the name double hashing). Unlike linear probing found as long as there exists at least one. and quadratic probing, the interval depends on the data, The insertion algorithm only receives a minor modifica- so that even values mapping to the same location have tion (but do note that SIZE has to be a suitable prime different bucket sequences; this minimizes repeated col- number as explained above): lisions and the effects of clustering. 74 CHAPTER 3. DICTIONARIES

Given two randomly, uniformly, and independently se- h2(k) = (k mod 7) + 1 h h lected hash functions 1 and 2 , the ith location in the This ensures that the secondary hash function will always T h(i, k) = bucket sequence for value k in a hash table is: be non zero. (h1(k) + i · h2(k)) mod |T |. Generally, h1 and h2 are selected from a set of universal hash functions. 3.6.3 See also

3.6.1 Classical applied data structure • Collision resolution in hash tables

Double hashing with open addressing is a classical data • Hash function structure on a table T . Let n be the number of elements n • Linear probing stored in T , then T 's load factor is α = |T | . Double hashing approximates uniform open address • Cuckoo hashing hashing. That is, start by randomly, uniformly and inde- pendently selecting two universal hash functions h1 and h2 to build a double hashing table T . 3.6.4 Notes

All elements are put in T by double hashing using h1 and [1] Bradford, Phillip G.; Katehakis, Michael N. (2007), “A h2 . Given a key k , determining the (i + 1) -st hash probabilistic study on combinatorial expanders and hash- location is computed by: ing” (PDF), SIAM Journal on Computing, 37 (1): 83–111, doi:10.1137/S009753970444630X, MR 2306284. h(i, k) = (h1(k) + i · h2(k)) mod |T |. Let T have fixed load factor α : 1 > α > 0 . Bradford [2] L. Guibas and E. Szemerédi: The Analysis of Dou- and Katehakis[1] showed the expected number of probes ble Hashing, Journal of Computer and System Sciences, 1978, 16, 226-274. for an unsuccessful search in T , still using these initially 1 chosen hash functions, is 1−α regardless of the distribu- [3] G. S. Lueker and M. Molodowitch: More Analysis of Dou- tion of the inputs. More precisely, these two uniformly, ble Hashing, Combinatorica, 1993, 13(1), 83-96. randomly and independently chosen hash functions are chosen from a set of universal hash functions where pair- [4] J. P. Schmidt and A. Siegel: Double Hashing is Com- wise independence suffices. putable and Randomizable with Universal Hash Functions, manuscript. Previous results include: Guibas and Szemerédi[2] 1 showed 1−α holds for unsuccessful search for load factors α < 0.319 . Also, Lueker and Molodowitch[3] showed 3.6.5 External links this held assuming ideal randomized functions. Schmidt • and Siegel[4] showed this with k -wise independent and How Caching Affects Hashing by Gregory L. Heile- uniform functions (for k = c log n , and suitable constant man and Wenbin Luo 2005. c ). • Hash Table Animation

• klib a C library that includes double hashing func- 3.6.2 Implementation details for caching tionality.

Linear probing and, to a lesser extent, quadratic probing are able to take advantage of the data cache by accessing locations that are close together. Double hashing has, on 3.7 Cuckoo hashing average, larger intervals and is not able to achieve this advantage. Cuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in Like all other forms of open addressing, double hashing a table, with worst-case constant lookup time. The name becomes linear as the hash table approaches maximum derives from the behavior of some species of cuckoo, capacity. The only solution to this is to rehash to a larger where the cuckoo chick pushes the other eggs or young size, as with all other open addressing schemes. out of the nest when it hatches; analogously, inserting a On top of that, it is possible for the secondary hash func- new key into a cuckoo hashing table may push an older tion to evaluate to zero. For example, if we choose k=5 key to a different location in the table. with the following function: − h2(k) = 5 (k mod 7) 3.7.1 History The resulting sequence will always remain at the initial hash value. One possible solution is to change the sec- Cuckoo hashing was first described by Rasmus Pagh and ondary hash function to: Flemming Friche Rodler in 2001.[1] 3.7. CUCKOO HASHING 75

from collisions, which happen when more than one key is mapped to the same cell. The basic idea of cuckoo hash- ing is to resolve collisions by using two hash functions instead of only one. This provides two possible locations in the hash table for each key. In one of the commonly used variants of the algorithm, the hash table is split into two smaller tables of equal size, and each hash function provides an index into one of these two tables. It is also possible for both hash functions to provide indexes into a single table. Lookup requires inspection of just two locations in the hash table, which takes constant time in the worst case (see Big O notation). This is in contrast to many other hash table algorithms, which may not have a constant worst-case bound on the time to do a lookup. Deletions, also, may be performed by blanking the cell containing a key, in constant worst case time, more simply than some other schemes such as linear probing. When a new key is inserted, and one of its two cells is empty, it may be placed in that cell. However, when both cells are already full, it will be necessary to move other keys to their second locations (or back to their first loca- tions) to make room for the new key. A is used: The new key is inserted in one of its two possible locations, “kicking out”, that is, displacing, any key that might already reside in this location. This displaced key is then inserted in its alternative location, again kicking out any key that might reside there. The process con- tinues in the same way until an empty position is found, completing the algorithm. However, it is possible for this insertion process to fail, by entering an infinite loop or by finding a very long chain (longer than a preset threshold that is logarithmic in the table size). In this case, the hash table is rebuilt in-place using new hash functions:

There is no need to allocate new tables for the rehashing: We may simply run through the tables to delete and perform the usual insertion procedure on all keys found not to be at their intended position in the table. — Pagh & Rodler, “Cuckoo Hashing”[1] Cuckoo hashing example. The arrows show the alternative loca- tion of each key. A new item would be inserted in the location of A by moving A to its alternative location, currently occupied by B, and moving B to its alternative location which is currently 3.7.3 Theory vacant. Insertion of a new item in the location of H would not succeed: Since H is part of a cycle (together with W), the new Insertions succeed in expected constant time,[1] even con- item would get kicked out again. sidering the possibility of having to rebuild the table, as long as the number of keys is kept below half of the ca- 3.7.2 Operation pacity of the hash table, i.e., the load factor is below 50%. One method of proving this uses the theory of random Cuckoo hashing is a form of open addressing in which graphs: one may form an undirected graph called the each non-empty cell of a hash table contains a key or “cuckoo graph” that has a vertex for each hash table lo- key–value pair.A hash function is used to determine the cation, and an edge for each hashed value, with the end- location for each key, and its presence in the table (or points of the edge being the two possible locations of the the value associated with it) can be found by examining value. Then, the greedy insertion algorithm for adding a that cell of the table. However, open addressing suffers set of values to a cuckoo hash table succeeds if and only if 76 CHAPTER 3. DICTIONARIES

the cuckoo graph for this set of values is a pseudoforest, arbitrarily large by increasing the stash size. However, a graph with at most one cycle in each of its connected larger stashes also mean slower searches for keys that are components. Any vertex-induced subgraph with more not present or are in the stash. A stash can be used in edges than vertices corresponds to a set of keys for which combination with more than two hash functions or with there are an insufficient number of slots in the hash table. blocked cuckoo hashing to achieve both high load factors When the hash function is chosen randomly, the cuckoo and small failure rates.[5] The analysis of cuckoo hash- graph is a random graph in the Erdős–Rényi model. With ing with a stash extends to practical hash functions, not high probability, for a random graph in which the ratio of just to the random hash function model commonly used the number of edges to the number of vertices is bounded in theoretical analysis of hashing.[6] below 1/2, the graph is a pseudoforest and the cuckoo Some people recommend a simplified generalization of hashing algorithm succeeds in placing all keys. More- cuckoo hashing called skewed-associative cache in some over, the same theory also proves that the expected size of CPU caches.[7] a connected component of the cuckoo graph is small, en- suring that each insertion takes constant expected time.[2] 3.7.6 Comparison with related structures

3.7.4 Example Other algorithms that use multiple hash functions include the Bloom filter, a memory-efficient data structure for in- The following hash functions are given: ⌊ ⌋ exact sets. An alternative data structure for the same in- ′ k exact set problem, based on cuckoo hashing and called the h (k) = k mod 11 h (k) = 11 mod 11 cuckoo filter, uses even less memory and (unlike classical Columns in the following two tables show the state of the Bloom flters) allows element deletions as well as inser- hash tables over time as the elements are inserted. tions and membership tests; however, its theoretical anal- ysis is much less developed than the analysis of Bloom [8] Cycle filters. A study by Zukowski et al.[9] has shown that cuckoo If you now wish to insert the element 6, then you get into hashing is much faster than chained hashing for small, a cycle. In the last row of the table we find the same initial cache-resident hash tables on modern processors. Ken- situation as at the beginning again. [10] ⌊ ⌋ neth Ross has shown bucketized versions of cuckoo ′ 6 hashing (variants that use buckets that contain more than h (6) = 6 mod 11 = 6 h (6) = 11 mod 11 = 0 one key) to be faster than conventional methods also for large hash tables, when space utilization is high. The per- 3.7.5 Variations formance of the bucketized cuckoo hash table was inves- tigated further by Askitis,[11] with its performance com- Several variations of cuckoo hashing have been studied, pared against alternative hashing schemes. primarily with the aim of improving its space usage by A survey by Mitzenmacher[3] presents open problems re- increasing the load factor that it can tolerate to a num- lated to cuckoo hashing as of 2009. ber greater than the 50% threshold of the basic algorithm. Some of these methods can also be used to reduce the fail- ure rate of cuckoo hashing, causing rebuilds of the data 3.7.7 See also structure to be much less frequent. • Generalizations of cuckoo hashing that use more than two Perfect hashing alternative hash functions can be expected to utilize a • Linear probing larger part of the capacity of the hash table efficiently while sacrificing some lookup and insertion speed. Us- • Double hashing ing just three hash functions increases the load to 91%.[3] Another generalization of cuckoo hashing, called blocked • Hash collision cuckoo hashing consists in using more than one key per • Hash function bucket. Using just 2 keys per bucket permits a load factor above 80%.[4] • Quadratic probing Another variation of cuckoo hashing that has been stud- • Hopscotch hashing ied is cuckoo hashing with a stash. The stash, in this data structure, is an array of a constant number of keys, used to store keys that cannot successfully be inserted into the 3.7.8 References main hash table of the structure. This modification re- duces the failure rate of cuckoo hashing to an inverse- [1] Pagh, Rasmus; Rodler, Flemming Friche (2001). polynomial function with an exponent that can be made “Cuckoo Hashing”. Algorithms — ESA 2001. Lec- 3.8. HOPSCOTCH HASHING 77

ture Notes in Computer Science. 2161. pp. 121– • Algorithmic Improvements for Fast Concurrent 133. doi:10.1007/3-540-44676-1_10. ISBN 978-3-540- Cuckoo Hashing, X. Li, D. Andersen, M. Kamin- 42493-2. sky, M. Freedman. EuroSys 2014. [2] Kutzelnigg, Reinhard (2006). Bipartite random graphs and cuckoo hashing. Fourth Colloquium on Mathematics Examples and Computer Science. Discrete Mathematics and Theo- retical Computer Science. pp. 403–406 • Concurrent high-performance Cuckoo hashtable [3] Mitzenmacher, Michael (2009-09-09). “Some Open written in C++ Questions Related to Cuckoo Hashing | Proceedings of • ESA 2009” (PDF). Retrieved 2010-11-10. Cuckoo hash map written in C++ [4] Dietzfelbinger, Martin; Weidling, Christoph (2007), • Static cuckoo hashtable generator for C/C++ “Balanced allocation and dictionaries with tightly packed • constant size bins”, Theoret. Comput. Sci., 380 (1-2): 47– Cuckoo hashtable written in Java 68, doi:10.1016/j.tcs.2007.02.054, MR 2330641. • Generic Cuckoo hashmap in Java [5] Kirsch, Adam; Mitzenmacher, Michael D.; Wieder, • Udi (2010), “More robust hashing: cuckoo hashing Cuckoo hash table written in Haskell with a stash”, SIAM J. Comput., 39 (4): 1543–1561, • Cuckoo hashing for Go doi:10.1137/080728743, MR 2580539. [6] Aumüller, Martin; Dietzfelbinger, Martin; Woelfel, Philipp (2014), “Explicit and efficient hash families suf- 3.8 Hopscotch hashing fice for cuckoo hashing with a stash”, Algorithmica, 70 (3): 428–456, doi:10.1007/s00453-013-9840-x, MR 3247374. [7] “Micro-Architecture”. [8] Fan, Bin; Kaminsky, Michael; Andersen, David (August 2013). “Cuckoo Filter: Better Than Bloom” (PDF). ;lo- gin:. USENIX. 38 (4): 36–40. Retrieved 12 June 2014. [9] Zukowski, Marcin; Heman, Sandor; Boncz, Peter (June 2006). “Architecture-Conscious Hashing” (PDF). Pro- ceedings of the International Workshop on Data Manage- ment on New Hardware (DaMoN). Retrieved 2008-10- 16. [10] Ross, Kenneth (2006-11-08). “Efficient Hash Probes on Modern Processors” (PDF). IBM Research Report RC24100. RC24100. Retrieved 2008-10-16. [11] Askitis, Nikolas (2009). Fast and Compact Hash Ta- bles for Integer Keys (PDF). Proceedings of the 32nd Aus- tralasian Computer Science Conference (ACSC 2009). 91. Hopscotch hashing. Here, H is 4. Gray entries are occupied. In pp. 113–122. ISBN 978-1-920682-72-9. part (a), the item x is added with a hash value of 6. A linear probe finds that entry 13 is empty. Because 13 is more than 4 entries away from 6, the algorithm looks for an earlier entry to 3.7.9 External links swap with 13. The first place to look in is H-1 = 3 entries before, at entry 10. That entry’s hop information bit-map indicates that • A cool and practical alternative to traditional hash d, the item at entry 11, can be displaced to 13. After displacing d, tables, U. Erlingsson, M. Manasse, F. Mcsherry, Entry 11 is still too far from entry 6, so the algorithm examines 2006. entry 8. The hop information bit-map indicates that item c at entry 9 can be moved to entry 11. Finally, a is moved to entry 9. • Cuckoo Hashing for Undergraduates, 2006, R. Part (b) shows the table state just before adding x. Pagh, 2006. Hopscotch hashing is a scheme in computer program- • Cuckoo Hashing, Theory and Practice (Part 1, Part ming for resolving hash collisions of values of hash func- 2 and Part 3), Michael Mitzenmacher, 2007. tions in a table using open addressing. It is also well suited • Naor, Moni; Segev, Gil; Wieder, Udi (2008). for implementing a concurrent hash table. Hopscotch “History-Independent Cuckoo Hashing”. Interna- hashing was introduced by Maurice Herlihy, Nir Shavit tional Colloquium on Automata, Languages and Pro- and Moran Tzafrir in 2008.[1] The name is derived from gramming (ICALP). Reykjavik, Iceland. Retrieved the sequence of hops that characterize the table’s inser- 2008-07-21. tion algorithm. 78 CHAPTER 3. DICTIONARIES

The algorithm uses a single array of n buckets. For each cache aligned, then one could apply a reorganization op- bucket, its neighborhood is a small collection of nearby eration in which items are moved into the now vacant lo- consecutive buckets (i.e. ones with close indices to the cation in order to improve alignment. original hashed bucket). The desired property of the One advantage of hopscotch hashing is that it provides neighborhood is that the cost of finding an item in the good performance at very high table load factors, even buckets of the neighborhood is close to the cost of find- ones exceeding 0.9. Part of this efficiency is due to using ing it in the bucket itself (for example, by having buckets a linear probe only to find an empty slot during insertion, in the neighborhood fall within the same cache line). The not for every lookup as in the original linear probing hash size of the neighborhood must be sufficient to accommo- table algorithm. Another advantage is that one can use date a logarithmic number of items in the worst case (i.e. any hash function, in particular simple ones that are close- it must accommodate log(n) items), but only a constant to-universal. number on average. If some bucket’s neighborhood is filled, the table is resized. In hopscotch hashing, as in cuckoo hashing, and unlike in 3.8.1 See also linear probing, a given item will always be inserted-into • and found-in the neighborhood of its hashed bucket. In Cuckoo hashing other words, it will always be found either in its original • Hash collision hashed array entry, or in one of the next H-1 neighboring entries. H could, for example, be 32, a common machine • Hash function word size. The neighborhood is thus a “virtual” bucket that has fixed size and overlaps with the next H-1 buckets. • Linear probing To speed the search, each bucket (array entry) includes a • Open addressing “hop-information” word, an H-bit bitmap that indicates which of the next H-1 entries contain items that hashed • Perfect hashing to the current entry’s virtual bucket. In this way, an item can be found quickly by looking at the word to see which • Quadratic probing entries belong to the bucket, and then scanning through the constant number of entries (most modern processors support special bit manipulation operations that make the 3.8.2 References lookup in the “hop-information” bitmap very fast). [1] Herlihy, Maurice and Shavit, Nir and Tzafrir, Moran Here is how to add item x which was hashed to bucket i: (2008). “Hopscotch Hashing” (PDF). DISC '08: Proceed- ings of the 22nd international symposium on Distributed Computing. Arcachon, France: Springer-Verlag. pp. 1. If the entry i is empty, add x to i and return. 350–364. 2. Starting at entry i, use a linear probe to find an empty entry at index j. 3.8.3 External links 3. If the empty entry’s index j is within H-1 of entry • libhhash - a C hopscotch hashing implementation i, place x there and return. Otherwise, entry j is too far from i. To create an empty entry closer to • hopscotch-map - a C++ implementation of a hash i, find an item y whose hash value lies between i and map using hopscotch hashing j, but within H-1 of j. Displacing y to j creates a new empty slot closer to i. Repeat until the empty entry is within H-1 of entry i, place x there and return. 3.9 Hash function If no such item y exists, or if the bucket i already contains H items, resize and rehash the table. This article is about a programming concept. For other meanings of “hash” and “hashing”, see Hash (disam- The idea is that hopscotch hashing “moves the empty slot biguation). towards the desired bucket”. This distinguishes it from A hash function is any function that can be used to map linear probing which leaves the empty slot where it was data of arbitrary size to data of fixed size. The values re- found, possibly far away from the original bucket, or from turned by a hash function are called hash values, hash cuckoo hashing that, in order to create a free bucket, codes, hash sums, or simply hashes. One use is a data moves an item out of one of the desired buckets in the structure called a hash table, widely used in computer target arrays, and only then tries to find the displaced item software for rapid data lookup. Hash functions accelerate a new place. table or database lookup by detecting duplicated records To remove an item from the table, one simply removes in a large file. An example is finding similar stretches in it from the table entry. If the neighborhood buckets are DNA sequences. They are also useful in cryptography. 3.9. HASH FUNCTION 79

— it tells where one should start looking for it. Still, in a hash half-full table, a good hash function will typically narrow keys function hashes the search down to only one or two entries. 00 John Smith 01 Caches 02 Lisa Smith 03 Hash functions are also used to build caches for large data 04 Sam Doe sets stored in slow media. A cache is generally simpler 05 than a hashed search table, since any collision can be re- : solved by discarding or writing back the older of the two Sandra Dee 15 colliding items. This is also used in file comparison.

A hash function that maps names to integers from 0 to 15. There is a collision between keys “John Smith” and “Sandra Dee”. Bloom filters

Main article: Bloom filter A cryptographic hash function allows one to easily ver- ify that some input data maps to a given hash value, but Hash functions are an essential ingredient of the Bloom if the input data is unknown, it is deliberately difficult to filter, a space-efficient probabilistic data structure that is reconstruct it (or equivalent alternatives) by knowing the used to test whether an element is a member of a set. stored hash value. This is used for assuring integrity of transmitted data, and is the building block for , which provide . Finding duplicate records Hash functions are related to (and often confused with) , check digits, fingerprints, randomization functions, error-correcting codes, and ciphers. Although Main article: Hash table these concepts overlap to some extent, each has its own uses and requirements and is designed and optimized When storing records in a large unsorted file, one may differently. The Hash Keeper database maintained by use a hash function to map each record to an index into the American National Drug Intelligence Center, for in- a table T, and to collect in each bucket T[i] a list of the stance, is more aptly described as a catalogue of file fin- numbers of all records with the same hash value i. Once gerprints than of hash values. the table is complete, any two duplicate records will end up in the same bucket. The duplicates can then be found by scanning every bucket T[i] which contains two or more 3.9.1 Uses members, fetching those records, and comparing them. With a table of appropriate size, this method is likely to be Hash tables much faster than any alternative approach (such as sorting the file and comparing all consecutive pairs). Hash functions are primarily used in hash tables,[1] to quickly locate a data record (e.g., a dictionary definition) given its search key (the headword). Specifically, the Protecting data hash function is used to map the search key to an index; the index gives the place in the hash table where the cor- Main article: Security of cryptographic hash functions responding record should be stored. Hash tables, in turn, are used to implement associative arrays and dynamic sets. A hash value can be used to uniquely identify secret infor- mation. This requires that the hash function is collision- Typically, the domain of a hash function (the set of possi- resistant, which means that it is very hard to find data that ble keys) is larger than its range (the number of different will generate the same hash value. These functions are table indices), and so it will map several different keys to categorized into cryptographic hash functions and prov- the same index. Therefore, each slot of a hash table is ably secure hash functions. Functions in the second cate- associated with (implicitly or explicitly) a set of records, gory are the most secure but also too slow for most practi- rather than a single record. For this reason, each slot of cal purposes. Collision resistance is accomplished in part a hash table is often called a bucket, and hash values are by generating very large hash values. For example, SHA- also called bucket indices. 1, one of the most widely used cryptographic hash func- Thus, the hash function only hints at the record’s location tions, generates 160 bit values. 80 CHAPTER 3. DICTIONARIES

Finding similar records Standard uses of hashing in cryptography

Main article: Locality sensitive hashing Main article: Cryptographic hash function

Hash functions can also be used to locate table records Some standard applications that employ hash func- whose key is similar, but not identical, to a given key; or tions include authentication, message integrity (using an pairs of records in a large file which have similar keys. For HMAC (Hashed MAC)), message fingerprinting, data that purpose, one needs a hash function that maps similar corruption detection, and digital signature efficiency. keys to hash values that differ by at most m, where m is a small integer (say, 1 or 2). If one builds a table T of all record numbers, using such a hash function, then similar 3.9.2 Properties records will end up in the same bucket, or in nearby buck- ets. Then one need only check the records in each bucket Good hash functions, in the original sense of the term, are T[i] against those in buckets T[i+k] where k ranges be- usually required to satisfy certain properties listed below. tween −m and m. The exact requirements are dependent on the application, This class includes the so-called acoustic fingerprint al- for example a hash function well suited to indexing data gorithms, that are used to locate similar-sounding entries will probably be a poor choice for a cryptographic hash in large collection of audio files. For this application, the function. hash function must be as insensitive as possible to data capture or transmission errors, and to trivial changes such as timing and volume changes, compression, etc.[2] Determinism

A hash procedure must be deterministic—meaning that for a given input value it must always generate the same Finding similar substrings hash value. In other words, it must be a function of the data to be hashed, in the mathematical sense of the term. This requirement excludes hash functions that depend The same techniques can be used to find equal or similar on external variable parameters, such as pseudo-random stretches in a large collection of strings, such as a docu- number generators or the time of day. It also excludes ment repository or a genomic database. In this case, the functions that depend on the memory address of the ob- input strings are broken into many small pieces, and a ject being hashed in cases that the address may change hash function is used to detect potentially equal pieces, during execution (as may happen on systems that use cer- as above. tain methods of garbage collection), although sometimes The Rabin–Karp algorithm is a relatively fast string rehashing of the item is possible. searching algorithm that works in O(n) time on average. The determinism is in the context of the reuse of the It is based on the use of hashing to compare strings. function. For example, Python adds the feature that hash functions make use of a randomized seed that is gener- ated once when the Python process starts in addition to the input to be hashed. The Python hash is still a valid Geometric hashing hash function when used in within a single run. But if the values are persisted (for example, written to disk) they This principle is widely used in computer graphics, can no longer be treated as valid hash values, since in the computational geometry and many other disciplines, to next run the random value might differ. solve many proximity problems in the plane or in three- dimensional space, such as finding closest pairs in a set of points, similar shapes in a list of shapes, similar images Uniformity in an image database, and so on. In these applications, the set of all inputs is some sort of metric space, and the A good hash function should map the expected inputs as hashing function can be interpreted as a partition of that evenly as possible over its output range. That is, every space into a grid of cells. The table is often an array with hash value in the output range should be generated with two or more indices (called a grid file, grid index, bucket roughly the same probability. The reason for this last re- grid, and similar names), and the hash function returns quirement is that the cost of hashing-based methods goes an index tuple. This special case of hashing is known as up sharply as the number of collisions—pairs of inputs geometric hashing or the grid method. Geometric hash- that are mapped to the same hash value—increases. If ing is also used in telecommunications (usually under the some hash values are more likely to occur than others, name vector quantization) to encode and compress multi- a larger fraction of the lookup operations will have to dimensional signals. search through a larger set of colliding table entries. 3.9. HASH FUNCTION 81

Note that this criterion only requires the value to be uni- input data z, and the number n of allowed hash values. formly distributed, not random in any sense. A good ran- A common solution is to compute a fixed hash function domizing function is (barring computational efficiency with a very large range (say, 0 to 232 − 1), divide the re- concerns) generally a good choice as a hash function, but sult by n, and use the division’s remainder. If n is itself a the converse need not be true. power of 2, this can be done by bit masking and bit shift- Hash tables often contain only a small subset of the valid ing. When this approach is used, the hash function must inputs. For instance, a club membership list may contain be chosen so that the result has fairly uniform distribution only a hundred or so member names, out of the very large between 0 and n − 1, for any value of n that may occur in set of all possible names. In these cases, the uniformity the application. Depending on the function, the remain- criterion should hold for almost all typical subsets of en- der may be uniform only for certain values of n, e.g. odd tries that may be found in the table, not just for the global or prime numbers. set of all possible entries. We can allow the table size n to not be a power of 2 In other words, if a typical set of m records is hashed to and still not have to perform any remainder or division n table slots, the probability of a bucket receiving many operation, as these computations are sometimes costly. more than m/n records should be vanishingly small. In For example, let n be significantly less than 2b. Con- particular, if m is less than n, very few buckets should sider a pseudorandom number generator (PRNG) func- have more than one or two records. (In an ideal "perfect tion P(key) that is uniform on the interval [0, 2b − 1]. hash function", no bucket should have more than one A hash function uniform on the interval [0, n-1] is n record; but a small number of collisions is virtually in- P(key)/2b. We can replace the division by a (possibly evitable, even if n is much larger than m – see the birthday faster) right bit shift: nP(key) >> b. paradox). When testing a hash function, the uniformity of the distri- Variable range with minimal movement (dynamic bution of hash values can be evaluated by the chi-squared hash function) When the hash function is used to store test. values in a hash table that outlives the run of the program, and the hash table needs to be expanded or shrunk, the Defined range hash table is referred to as a dynamic hash table. A hash function that will relocate the minimum number It is often desirable that the output of a hash function have of records when the table is – where z is the key being fixed size (but see below). If, for example, the output is hashed and n is the number of allowed hash values – such constrained to 32-bit integer values, the hash values can that H(z,n + 1) = H(z,n) with probability close to n/(n + be used to index into an array. Such hashing is commonly 1). used to accelerate data searches.[3] On the other hand, Linear hashing and spiral storage are examples of dy- cryptographic hash functions produce much larger hash namic hash functions that execute in constant time but values, in order to ensure the computational complexity relax the property of uniformity to achieve the minimal of brute-force inversion.[4] For example, SHA-1, one of movement property. the most widely used cryptographic hash functions, pro- duces a 160-bit value. Extendible hashing uses a dynamic hash function that re- quires space proportional to n to compute the hash func- Producing fixed-length output from variable length in- tion, and it becomes a function of the previous keys that put can be accomplished by breaking the input data into have been inserted. chunks of specific size. Hash functions used for data searches use some arithmetic expression which iteratively Several algorithms that preserve the uniformity property processes chunks of the input (such as the characters in but require time proportional to n to compute the value a string) to produce the hash value.[3] In cryptographic of H(z,n) have been invented. hash functions, these chunks are processed by a one-way compression function, with the last chunk being padded if necessary. In this case, their size, which is called block Data normalization size, is much bigger than the size of the hash value.[4] For example, in SHA-1, the hash value is 160 bits and the In some applications, the input data may contain features block size 512 bits. that are irrelevant for comparison purposes. For example, when looking up a personal name, it may be desirable to ignore the distinction between upper and lower case Variable range In many applications, the range of letters. For such data, one must use a hash function that hash values may be different for each run of the program, is compatible with the data equivalence criterion being or may change along the same run (for instance, when a used: that is, any two inputs that are considered equivalent hash table needs to be expanded). In those situations, one must yield the same hash value. This can be accomplished needs a hash function which takes two parameters—the by normalizing the input before hashing it, as by upper- 82 CHAPTER 3. DICTIONARIES

casing all letters. The same technique can be used to map two-letter coun- try codes like “us” or “za” to country names (262 = 676 table entries), 5-digit codes like 13083 to city names Continuity (100000 entries), etc. Invalid data values (such as the country code “xx” or the zip code 00000) may be left un- “A hash function that is used to search for similar (as op- defined in the table or mapped to some appropriate “null” posed to equivalent) data must be as continuous as possi- value. ble; two inputs that differ by a little should be mapped to equal or nearly equal hash values.”[5] Perfect hashing Note that continuity is usually considered a fatal flaw for checksums, cryptographic hash functions, and other re- Main article: Perfect hash function lated concepts. Continuity is desirable for hash functions A hash function that is injective—that is, maps each valid only in some applications, such as hash tables used in Nearest neighbor search. hash keys function hashes Non-invertible 00 In cryptographic applications, hash functions are typically 01 John Smith expected to be practically non-invertible, meaning that it 02 is not realistic to reconstruct the input datum x from its 03 Lisa Smith hash value h(x) alone without spending great amounts of 04 computing time (see also One-way function). 05 Sam Doe : 13 3.9.3 Hash function algorithms Sandra Dee 14 For most types of hashing functions, the choice of the 15 function depends strongly on the nature of the input data, and their probability distribution in the intended applica- tion. A perfect hash function for the four names shown input to a different hash value—is said to be perfect. With such a function one can directly locate the desired Trivial hash function entry in a hash table, without any additional searching. If the data to be hashed is small enough, one can use the data itself (reinterpreted as an integer) as the hashed Minimal perfect hashing value. The cost of computing this “trivial” (identity) hash function is effectively zero. This hash function is perfect, as it maps each input to a distinct hash value. hash The meaning of “small enough” depends on the size of keys function hashes the type that is used as the hashed value. For example, in Java, the hash code is a 32-bit integer. Thus the 32- John Smith bit integer Integer and 32-bit floating-point Float objects 0 can simply use the value directly; whereas the 64-bit in- Lisa Smith teger Long and 64-bit floating-point Double cannot use 1 this method. 2 Sam Doe Other types of data can also use this perfect hashing 3 scheme. For example, when mapping character strings Sandra Dee between upper and lower case, one can use the binary encoding of each character, interpreted as an integer, to index a table that gives the alternative form of that char- A minimal perfect hash function for the four names shown acter (“A” for “a”, “8” for “8”, etc.). If each character is stored in 8 bits (as in extended ASCII[6] or ISO Latin 1), A perfect hash function for n keys is said to be minimal if the table has only 28 = 256 entries; in the case of its range consists of n consecutive integers, usually from 0 characters, the table would have 17×216 = 1114112 en- to n−1. Besides providing single-step lookup, a minimal tries. perfect hash function also yields a compact hash table, 3.9. HASH FUNCTION 83

without any vacant slots. Minimal perfect hash functions complex issue and depends on the nature of the data. If are much harder to find than perfect ones with a wider the units b[k] are single bits, then F(S,b) could be, for range. instance if highbit(S) = 0 then return 2 * S + b else return (2 * Hashing uniformly distributed data S + b) ^ P Here highbit(S) denotes the most significant bit of S; the If the inputs are bounded-length strings and each input '*' operator denotes unsigned integer multiplication with may independently occur with uniform probability (such lost overflow; '^' is the bitwise exclusive or operation ap- as telephone numbers, car license plates, invoice num- plied to words; and P is a suitable fixed word.[7] bers, etc.), then a hash function needs to map roughly the same number of inputs to each hash value. For instance, suppose that each input is an integer z in the range 0 to N−1, and the output must be an integer h in the range 0 to Special-purpose hash functions n−1, where N is much larger than n. Then the hash func- tion could be h = z mod n (the remainder of z divided by In many cases, one can design a special-purpose n), or h = (z × n) ÷ N (the value z scaled down by n/N and (heuristic) hash function that yields many fewer colli- truncated to an integer), or many other formulas. sions than a good general-purpose hash function. For ex- ample, suppose that the input data are file names such as FILE0000.CHK, FILE0001.CHK, FILE0002.CHK, Hashing data with other distributions etc., with mostly sequential numbers. For such data, a function that extracts the numeric part k of the file name These simple formulas will not do if the input values are and returns k mod n would be nearly optimal. Needless not equally likely, or are not independent. For instance, to say, a function that is exceptionally good for a specific most patrons of a supermarket will live in the same geo- kind of data may have dismal performance on data with graphic area, so their telephone numbers are likely to be- different distribution. gin with the same 3 to 4 digits. In that case, if m is 10000 or so, the division formula (z × m) ÷ M, which depends mainly on the leading digits, will generate a lot of colli- sions; whereas the remainder formula z mod m, which is quite sensitive to the trailing digits, may still yield a fairly even distribution. Main article: Rolling hash

In some applications, such as substring search, one must Hashing variable-length data compute a hash function h for every k-character substring of a given n-character string t; where k is a fixed integer, When the data values are long (or variable-length) and n is k. The straightforward solution, which is to ex- character strings—such as personal names, web page ad- tract every such substring s of t and compute h(s) sep- dresses, or mail messages—their distribution is usually arately, requires a number of operations proportional to very uneven, with complicated dependencies. For exam- k·n. However, with the proper choice of h, one can use ple, text in any natural language has highly non-uniform the technique of rolling hash to compute all those hashes distributions of characters, and character pairs, very char- with an effort proportional to k + n. acteristic of the language. For such data, it is prudent to use a hash function that depends on all characters of the string—and depends on each character in a different way. In cryptographic hash functions, a Merkle–Damgård con- Universal hashing struction is usually used. In general, the scheme for hash- ing such data is to break the input into a sequence of small A universal hashing scheme is a randomized algorithm units (bits, bytes, words, etc.) and combine all the units that selects a hashing function h among a family of such b[1], b[2], …, b[m] sequentially, as follows functions, in such a way that the probability of a collision of any two distinct keys is 1/n, where n is the number S ← S0; // Initialize the state. for k in 1, 2, ..., m do // Scan of distinct hash values desired—independently of the two the input data units: S ← F(S, b[k]); // Combine data unit keys. Universal hashing ensures (in a probabilistic sense) k into the state. return G(S, n) // Extract the hash value that the hash function application will behave as well as from the state. if it were using a random function, for any distribution of This schema is also used in many text and fin- the input data. It will, however, have more collisions than gerprint algorithms. The state variable S may be a 32- perfect hashing and may require more operations than a or 64-bit unsigned integer; in that case, S0 can be 0, and special-purpose hash function. See also unique permuta- G(S,n) can be just S mod n. The best choice of F is a tion hashing.[8] 84 CHAPTER 3. DICTIONARIES

Hashing with checksum functions Hashing by nonlinear table lookup

One can adapt certain checksum or fingerprinting algo- Main article: Tabulation hashing rithms for use as hash functions. Some of those algo- rithms will map arbitrary long string data z, with any typ- Tables of random numbers (such as 256 random 32-bit ical real-world distribution—no matter how non-uniform integers) can provide high-quality nonlinear functions to and dependent—to a 32-bit or 64-bit string, from which be used as hash functions or for other purposes such as one can extract a hash value in 0 through n − 1. cryptography. The key to be hashed is split into 8-bit This method may produce a sufficiently uniform distri- (one-byte) parts, and each part is used as an index for the bution of hash values, as long as the hash range size n is nonlinear table. The table values are then added by arith- small compared to the range of the checksum or finger- metic or XOR addition to the hash output value. Because print function. However, some checksums fare poorly in the table is just 1024 bytes in size, it fits into the cache of the avalanche test, which may be a concern in some ap- modern microprocessors and allow very fast execution of plications. In particular, the popular CRC32 checksum the hashing algorithm. As the table value is on average provides only 16 bits (the higher half of the result) that much longer than 8 bits, one bit of input affects nearly all are usable for hashing. Moreover, each bit of the input output bits. has a deterministic effect on each bit of the CRC32, that This algorithm has proven to be very fast and of high qual- is one can tell without looking at the rest of the input, ity for hashing purposes (especially hashing of integer- which bits of the output will flip if the input bit is flipped; number keys). so care must be taken to use all 32 bits when computing the hash from the checksum.[9] Efficient hashing of strings

See also: Universal hashing § Hashing strings Multiplicative hashing Modern microprocessors will allow for much faster pro- Multiplicative hashing is a simple type of hash func- cessing, if 8-bit character strings are not hashed by pro- tion often used by teachers introducing students to hash cessing one character at a time, but by interpreting the tables.[10] Multiplicative hash functions are simple and string as an array of 32 bit or 64 bit integers and hash- fast, but have higher collision rates in hash tables than ing/accumulating these “wide word” integer values by more sophisticated hash functions.[11] means of arithmetic operations (e.g. multiplication by constant and bit-shifting). The remaining characters of In many applications, such as hash tables, collisions make the string which are smaller than the word length of the the system a little slower but are otherwise harmless. In CPU must be handled differently (e.g. being processed such systems, it is often better to use hash functions based one character at a time). on multiplication -- such as MurmurHash and the SBox- Hash -- or even simpler hash functions such as CRC32 -- This approach has proven to speed up hash code genera- and tolerate more collisions; rather than use a more com- tion by a factor of five or more on modern microproces- plex hash function that avoids many of those collisions sors of a word size of 64 bit. but takes longer to compute.[11] Multiplicative hashing Another approach[14] is to convert strings to a 32 or 64 is susceptible to a “common mistake” that leads to poor bit numeric value and then apply a hash function. One diffusion—higher-value input bits do not affect lower- method that avoids the problem of strings having great value output bits.[12] similarity (“Aaaaaaaaaa” and “Aaaaaaaaab”) is to use a (CRC) of the string to com- pute a 32- or 64-bit value. While it is possible that two different strings will have the same CRC, the likelihood Hashing with cryptographic hash functions is very small and only requires that one check the ac- tual string found to determine whether one has an ex- Some cryptographic hash functions, such as SHA-1, have act match. CRCs will be different for strings such as even stronger uniformity guarantees than checksums or “Aaaaaaaaaa” and “Aaaaaaaaab”. Although, CRC codes can be used as hash values[15] they are not cryptographi- fingerprints, and thus can provide very good general- [16] purpose hashing functions. cally secure since they are not collision-resistant. In ordinary applications, this advantage may be too small to offset their much higher cost.[13] However, this method 3.9.4 Locality-sensitive hashing can provide uniformly distributed hashes even when the keys are chosen by a malicious agent. This feature may Locality-sensitive hashing (LSH) is a method of per- help to protect services against denial of service attacks. forming probabilistic dimension reduction of high- 3.9. HASH FUNCTION 85

dimensional data. The basic idea is to hash the input 3.9.7 See also items so that similar items are mapped to the same buck- ets with high probability (the number of buckets being • Bloom filter much smaller than the universe of possible input items). • This is different from the conventional hash functions, Coalesced hashing such as those used in cryptography, as in this case the • Cuckoo hashing goal is to maximize the probability of “collision” of sim- ilar items rather than to avoid collisions.[17] • Hopscotch hashing

One example of LSH is MinHash algorithm used for find- • Cryptographic hash function ing similar documents (such as web-pages): • Distributed hash table Let h be a hash function that maps the members of A and B to distinct integers, and for any set S define hᵢ(S) to • Geometric hashing be the member x of S with the minimum value of h(x). Then hᵢ(A) = hᵢ(B) exactly when the minimum hash • Hash Code cracker value of the union A ∪ B lies in the intersection A ∩ B. • Therefore, Hash table • HMAC Pr[hᵢ(A) = hᵢ(B)] = J(A,B). where J is Jaccard index. • Identicon • In other words, if r is a random variable that is one when Linear hash hᵢ(A) = hᵢ(B) and zero otherwise, then r is an unbiased • List of hash functions estimator of J(A,B), although it has too high a variance to be useful on its own. The idea of the MinHash scheme is • Locality sensitive hashing to reduce the variance by averaging together several vari- • ables constructed in the same way. MD5 • Perfect hash function

3.9.5 Origins of the term • PhotoDNA

The term “hash” comes by way of analogy with its non- • Rabin–Karp string search algorithm technical meaning, to “chop and mix”, see hash (food). • Indeed, typical hash functions, like the mod operation, Rolling hash “chop” the input domain into many sub-domains that get • Transposition table “mixed” into the output range to improve the uniformity of the key distribution. • Universal hashing

Donald Knuth notes that Hans Peter Luhn of IBM appears • MinHash to have been the first to use the concept, in a memo dated January 1953, and that Robert Morris used the term in • Low-discrepancy sequence a survey paper in CACM which elevated the term from technical jargon to formal terminology.[18] 3.9.8 References

3.9.6 List of hash functions [1] Konheim, Alan (2010). “7. HASHING FOR STORAGE: DATA MANAGEMENT”. Hashing in Computer Science: Main article: List of hash functions Fifty Years of Slicing and Dicing. Wiley-Interscience. ISBN 9780470344736.

• [2] “Robust Audio Hashing for Content Identification by Jaap NIST hash function competition Haitsma, Ton Kalker and Job Oostveen” • Bernstein hash[19] [3] Sedgewick, Robert (2002). “14. Hashing”. Algorithms in • Fowler-Noll-Vo hash function (32, 64, 128, 256, Java (3 ed.). Addison Wesley. ISBN 978-0201361209. 512, or 1024 bits) [4] Menezes, Alfred J.; van Oorschot, Paul C.; Vanstone, • Jenkins hash function (32 bits) Scott A (1996). Handbook of Applied Cryptography. CRC Press. ISBN 0849385237. • Pearson hashing (64 bits) [5] “Fundamental Data Structures – Josiang p.132”. Re- • trieved May 19, 2014. 86 CHAPTER 3. DICTIONARIES

[6] Plain ASCII is a 7-bit , although it is 3.10 Perfect hash function often stored in 8-bit bytes with the highest-order bit always clear (zero). Therefore, for plain ASCII, the bytes have In computer science, a perfect hash function for a set S only 27 = 128 valid values, and the character translation table has only this many entries. is a hash function that maps distinct elements in S to a set of integers, with no collisions. In mathematical terms, it [7] Broder, A. Z. (1993). “Some applications of Rabin’s fin- is a total injective function. gerprinting method”. Sequences II: Methods in Communi- cations, Security, and Computer Science. Springer-Verlag. Perfect hash functions may be used to implement a pp. 143–152. lookup table with constant worst-case access time. A perfect hash function has many of the same applications [8] Shlomi Dolev, Limor Lahiani, Yinnon Haviv, “Unique as other hash functions, but with the advantage that no permutation hashing”, Theoretical Computer Science collision resolution has to be implemented. Volume 475, 4 March 2013, Pages 59–65.

[9] Bret Mulvey, Evaluation of CRC32 for Hash Tables, in Hash Functions. Accessed April 10, 2009. 3.10.1 Application

[10] Knuth. “The Art of Computer Programming”. Volume 3: A perfect hash function with values in a limited range can “Sorting and Searching”. Section “6.4. Hashing”. be used for efficient lookup operations, by placing keys [11] Peter Kankowski. “Hash functions: An empirical com- from S (or other associated values) in a lookup table in- parison”. dexed by the output of the function. One can then test whether a key is present in S, or look up a value associ- [12] “CS 3110 Lecture 21: Hash functions”. Section “Multi- ated with that key, by looking for it at its cell of the table. plicative hashing”. Each such lookup takes constant time in the worst case.[1] [13] Bret Mulvey, Evaluation of SHA-1 for Hash Tables, in Hash Functions. Accessed April 10, 2009. 3.10.2 Construction [14] http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1. 1.18.7520 Performance in Practice of String Hashing A perfect hash function for a specific set S that can be Functions evaluated in constant time, and with values in a small [15] Peter Kankowski. “Hash functions: An empirical com- range, can be found by a randomized algorithm in a num- parison”. ber of operations that is proportional to the size of S. The original construction of Fredman, Komlós & Szemerédi [16] Cam-Winget, Nancy; Housley, Russ; Wagner, David; (1984) uses a two-level scheme to map a set S of n ele- Walker, Jesse (May 2003). “Security Flaws in 802.11 ments to a range of O(n) indices, and then map each index Data Link Protocols”. Communications of the ACM. 46 to a range of hash values. The first level of their construc- (5): 35–39. doi:10.1145/769800.769823. tion chooses a large prime p (larger than the size of the [17] A. Rajaraman and J. Ullman (2010). “Mining of Massive universe from which S is drawn), and a parameter k, and Datasets, Ch. 3.”. maps each element x of S to the index

[18] Knuth, Donald (1973). The Art of Computer Program- ming, volume 3, Sorting and Searching. pp. 506–542. g(x) = (kxmodp)modn. [19] “Hash Functions”. cse.yorku.ca. September 22, 2003. Retrieved November 1, 2012. the djb2 algorithm (k=33) If k is chosen randomly, this step is likely to have colli- was first reported by dan bernstein many years ago in sions, but the number of elements nᵢ that are simultane- comp.lang.c. ously mapped to the same index i is likely to be small. The second level of their construction assigns disjoint ranges 2 3.9.9 External links of O(ni ) integers to each index i. It uses a second set of linear modular functions, one for each index i, to map [1] • Calculate hash of a given value by Timo Denk each member x of S into the range associated with g(x). As Fredman, Komlós & Szemerédi (1984) show, there • Hash Functions and Block Ciphers by Bob Jenkins exists a choice of the parameter k such that the sum of • The Goulburn Hashing Function (PDF) by Mayur the lengths of the ranges for the n different values of g(x) Patel is O(n). Additionally, for each value of g(x), there exists a linear modular function that maps the corresponding sub- • Hash Function Construction for Textual and Geo- set of S into the range associated with that value. Both metrical Data Retrieval Latest Trends on Comput- k, and the second-level functions for each value of g(x), ers, Vol.2, pp. 483–489, CSCC conference, Corfu, can be found in polynomial time` by choosing values ran- 2010 domly until finding one that works.[1] 3.10. PERFECT HASH FUNCTION 87

The hash function itself requires storage space O(n) to with constant access time is to use an (ordinary) perfect store k, p, and all of the second-level linear modular func- hash function or cuckoo hashing to store a lookup table tions. Computing the hash value of a given key x may be of the positions of each key. If the keys to be hashed are performed in constant time by computing g(x), looking up themselves stored in a , it is possible to store the second-level function associated with g(x), and apply- a small number of additional bits per key in a data struc- ing this function to x. A modified version of this two-level ture that can be used to compute hash values quickly.[7] scheme with a larger number of values at the top level can Order-preserving minimal perfect hash functions require be used to construct a perfect hash function that maps S necessarily Ω(n log n) bits to be represented.[8] into a smaller range of length n + o(n).[1] 3.10.5 Related constructions 3.10.3 Space lower bounds A simple alternative to perfect hashing, which also allows The use of O(n) words of information to store the func- dynamic updates, is cuckoo hashing. This scheme maps tion of Fredman, Komlós & Szemerédi (1984) is near- keys to two or more locations within a range (unlike per- optimal: any perfect hash function that can be calculated fect hashing which maps each key to a single location) in constant time requires at least a number of bits that is but does so in such a way that the keys can be assigned proportional to the size of S.[2] one-to-one to locations to which they have been mapped. Lookups with this scheme are slower, because multiple locations must be checked, but nevertheless take constant 3.10.4 Extensions worst-case time.[9]

Dynamic perfect hashing 3.10.6 References Main article: Dynamic perfect hashing [1] Fredman, Michael L.; Komlós, János; Szemerédi, En- dre (1984), “Storing a Sparse Table with O(1) Worst Using a perfect hash function is best in situations where Case Access Time”, Journal of the ACM, 31 (3): 538, there is a frequently queried large set, S, which is seldom doi:10.1145/828.1884, MR 0819156 updated. This is because any modification of the set S may cause the hash function to no longer be perfect for the [2] Fredman, Michael L.; Komlós, János (1984), “On the size modified set. Solutions which update the hash function of separating systems and families of perfect hash func- any time the set is modified are known as dynamic perfect tions”, SIAM Journal on Algebraic and Discrete Methods, hashing,[3] but these methods are relatively complicated 5 (1): 61–68, doi:10.1137/0605009, MR 731857. to implement. [3] Dietzfelbinger, Martin; Karlin, Anna; Mehlhorn, Kurt; Meyer auf der Heide, Friedhelm; Rohnert, Hans; Tarjan, Robert E. (1994), “Dynamic perfect hashing: upper and Minimal perfect hash function lower bounds”, SIAM Journal on Computing, 23 (4): 738– 761, doi:10.1137/S0097539791194094, MR 1283572. A minimal perfect hash function is a perfect hash func- tion that maps n keys to n consecutive integers – usually [4] Belazzougui, Djamal; Botelho, Fabiano C.; Dietzfel- the numbers from 0 to n − 1 or from 1 to n. A more for- binger, Martin (2009), “Hash, displace, and compress” mal way of expressing this is: Let j and k be elements of (PDF), Algorithms—ESA 2009: 17th Annual European some finite set S. F is a minimal perfect hash function if Symposium, Copenhagen, Denmark, September 7-9, 2009, and only if F(j) = F(k) implies j = k (injectivity) and there Proceedings, Lecture Notes in Computer Science, 5757, Berlin: Springer, pp. 682–693, doi:10.1007/978-3-642- exists an integer a such that the range of F is a..a + |S| − 1. 04128-0_61, MR 2557794. It has been proven that a general purpose minimal perfect [4] hash scheme requires at least 1.44 bits/key. The best [5] Baeza-Yates, Ricardo; Poblete, Patricio V. (2010), currently known minimal perfect hashing schemes can be “Searching”, in Atallah, Mikhail J.; Blanton, Marina, Al- represented using approximately 2.6 bits per key.[5] gorithms and Theory of Computation Handbook: General Concepts and Techniques (2nd ed.), CRC Press, ISBN 9781584888239. See in particular p. 2-10. Order preservation [6] Jenkins, Bob (14 April 2009), “order-preserving mini- A minimal perfect hash function F is order preserving if mal perfect hashing”, in Black, Paul E., Dictionary of Al- keys are given in some order a , a , ..., an and for any gorithms and Data Structures, U.S. National Institute of 1 2 Standards and Technology, retrieved 2013-03-05 keys aj and ak, j < k implies F(aj) < F(ak).[6] In this case, the function value is just the position of each key in the [7] Belazzougui, Djamal; Boldi, Paolo; Pagh, Rasmus; Vi- sorted ordering of all of the keys. A simple implementa- gna, Sebastiano (November 2008), “Theory and prac- tion of order-preserving minimal perfect hash functions tice of monotone minimal perfect hashing”, Journal of 88 CHAPTER 3. DICTIONARIES

Experimental Algorithmics, 16, Art. no. 3.2, 26pp, 3.11 Universal hashing doi:10.1145/1963190.2025378. [8] Fox, Edward A.; Chen, Qi Fan; Daoud, Amjad M.; In mathematics and computing universal hashing (in a Heath, Lenwood S. (July 1991), “Order-preserving randomized algorithm or data structure) refers to select- minimal perfect hash functions and information re- ing a hash function at random from a family of hash func- trieval”, ACM Transactions on Information Systems, tions with a certain mathematical property (see definition New York, NY, USA: ACM, 9 (3): 281–308, below). This guarantees a low number of collisions in doi:10.1145/125187.125200. expectation, even if the data is chosen by an adversary. [9] Pagh, Rasmus; Rodler, Flemming Friche (2004), Many universal families are known (for hashing integers, “Cuckoo hashing”, Journal of Algorithms, 51 (2): 122– vectors, strings), and their evaluation is often very effi- 144, doi:10.1016/j.jalgor.2003.12.002, MR 2050140. cient. Universal hashing has numerous uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography. 3.10.7 Further reading • Richard J. Cichelli. Minimal Perfect Hash Functions 3.11.1 Introduction Made Simple, Communications of the ACM, Vol. 23, Number 1, January 1980. See also: Hash function • Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algo- Assume we want to map keys from some universe U into rithms, Second Edition. MIT Press and McGraw- m bins (labelled [m] = {0, . . . , m − 1} ). The algorithm Hill, 2001. ISBN 0-262-03293-7. Section 11.5: will have to handle some data set S ⊆ U of |S| = n Perfect hashing, pp. 245–249. keys, which is not known in advance. Usually, the goal of hashing is to obtain a low number of collisions (keys from • Fabiano C. Botelho, Rasmus Pagh and Nivio Zi- S that land in the same bin). A deterministic hash func- viani. “Perfect Hashing for Data Management Ap- tion cannot offer any guarantee in an adversarial setting plications”. if the size of U is greater than m · n , since the adver- • Fabiano C. Botelho and Nivio Ziviani. “External sary may choose S to be precisely the preimage of a bin. perfect hashing for very large key sets”. 16th ACM This means that all data keys land in the same bin, making Conference on Information and Knowledge Man- hashing useless. Furthermore, a deterministic hash func- agement (CIKM07), Lisbon, Portugal, November tion does not allow for rehashing: sometimes the input 2007. data turns out to be bad for the hash function (e.g. there are too many collisions), so one would like to change the • Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, hash function. and Sebastiano Vigna. “Monotone minimal per- fect hashing: Searching a sorted table with O(1) The solution to these problems is to pick a function ran- accesses”. In Proceedings of the 20th Annual domly from a family of hash functions. A family of func- { → } ACM-SIAM Symposium On Discrete Mathematics tions H = h : U [m] is called a universal family ∀ ∈ ̸ ≤ 1 (SODA), New York, 2009. ACM Press. if, x, y U, x = y : Prh∈H [h(x) = h(y)] m . In other words, any two keys of the universe collide with • Douglas C. Schmidt, GPERF: A Perfect Hash Func- probability at most 1/m when the hash function h is tion Generator, C++ Report, SIGS, Vol. 10, No. 10, drawn randomly from H . This is exactly the proba- November/December, 1998. bility of collision we would expect if the hash function assigned truly random hash codes to every key. Some- 3.10.8 External links times, the definition is relaxed to allow collision proba- bility O(1/m) . This concept was introduced by Carter • Minimal Perfect Hashing by Bob Jenkins and Wegman[1] in 1977, and has found numerous appli- cations in computer science (see, for example [2]). If we • gperf is an Open Source C and C++ perfect hash have an upper bound of ϵ < 1 on the collision probability, generator we say that we have ϵ -almost universality. • cmph is Open Source implementing many perfect Many, but not all, universal families have the following hashing methods stronger uniform difference property: • Sux4J is Open Source implementing perfect hash- ∀ ∈ ̸ ing, including monotone minimal perfect hashing in x, y U, x = y , when h is drawn randomly − Java from the family H , the difference h(x) h(y) mod m is uniformly distributed in [m] . • MPHSharp is Open Source implementing many per- fect hashing methods in C# Note that the definition of universality is only concerned 3.11. UNIVERSAL HASHING 89

with whether h(x) − h(y) = 0 , which counts collisions. 3.11.2 Mathematical guarantees The uniform difference property is stronger. (Similarly, a universal family can be XOR universal if For any fixed set S of n keys, using a universal family ∀x, y ∈ U, x ≠ y , the value h(x) ⊕ h(y) mod m is guarantees the following properties. uniformly distributed in [m] where ⊕ is the bitwise ex- clusive or operation. This is only possible if m is a power 1. For any fixed x in S , the expected number of keys of two.) in the bin h(x) is n/m . When implementing hash tables by chaining, this number is proportional to the An even stronger condition is pairwise independence: we expected running time of an operation involving the have this property when ∀x, y ∈ U, x ≠ y we have the key x (for example a query, insertion or deletion). probability that x, y will hash to any pair of hash values z1, z2 is as if they were perfectly random: P (h(x) = 2. The expected number of pairs of keys x, y in S with ∧ 2 z1 h(y) = z2) = 1/m . Pairwise independence is x ≠ y that collide ( h(x) = h(y) ) is bounded above sometimes called strong universality. by n(n − 1)/2m , which is of order O(n2/m) . Another property is uniformity. We say that a family is When the number of bins, m , is O(n) , the expected uniform if all hash values are equally likely: P (h(x) = number of collisions is O(n) . When hashing into n2 z) = 1/m for any hash value z . Universality does not bins, there are no collisions at all with probability at imply uniformity. However, strong universality does im- least a half. ply uniformity. 3. The expected number of keys in bins with at least t Given a family with the uniform distance property, one keys in them is bounded above by 2n/(t−2(n/m)+ can produce a pairwise independent or strongly univer- 1) .[6] Thus, if the capacity of each bin is capped sal hash family by adding a uniformly distributed random to three times the average size ( t = 3n/m ), the constant with values in [m] to the hash functions. (Sim- total number of keys in overflowing bins is at most ilarly, if m is a power of two, we can achieve pairwise O(m) . This only holds with a hash family whose independence from an XOR universal hash family by do- collision probability is bounded above by 1/m . If a ing an exclusive or with a uniformly distributed random weaker definition is used, bounding it by O(1/m) , constant.) Since a shift by a constant is sometimes irrel- this result is no longer true.[6] evant in applications (e.g. hash tables), a careful distinc- tion between the uniform distance property and pairwise independent is sometimes not made.[3] As the above guarantees hold for any fixed set S , they hold if the data set is chosen by an adversary. However, For some applications (such as hash tables), it is impor- the adversary has to make this choice before (or indepen- tant for the least significant bits of the hash values to be dent of) the algorithm’s random choice of a hash function. also universal. When a family is strongly universal, this If the adversary can observe the random choice of the al- is guaranteed: if H is a strongly universal family with gorithm, randomness serves no purpose, and the situation L L′ m = 2 , then the family made of the functions hmod2 is the same as deterministic hashing. for all h ∈ H is also strongly universal for L′ ≤ L . Unfortunately, the same is not true of (merely) universal The second and third guarantee are typically used in con- families. For example, the family made of the identity junction with rehashing. For instance, a randomized al- function h(x) = x is clearly universal, but the family gorithm may be prepared to handle some O(n) num- ′ made of the function h(x) = xmod2L fails to be uni- ber of collisions. If it observes too many collisions, it versal. chooses another random h from the family and repeats. Universality guarantees that the number of repetitions is UMAC and -AES and several other message a geometric random variable. authentication code algorithms are based on universal hashing.[4][5] In such applications, the software chooses a new hash function for every message, based on a unique 3.11.3 Constructions nonce for that message. Several hash table implementations are based on univer- Since any computer data can be represented as one or sal hashing. In such applications, typically the software more machine words, one generally needs hash functions chooses a new hash function only after it notices that “too for three types of domains: machine words (“integers”); many” keys have collided; until then, the same hash func- fixed-length vectors of machine words; and variable- tion continues to be used over and over. (Some colli- length vectors (“strings”). sion resolution schemes, such as dynamic perfect hashing, pick a new hash function every time there is a collision. Other collision resolution schemes, such as cuckoo hash- Hashing integers ing and 2-choice hashing, allow a number of collisions before picking a new hash function). This section refers to the case of hashing integers that fit in machines words; thus, operations like multiplication, 90 CHAPTER 3. DICTIONARIES

addition, division, etc. are cheap machine-level instruc- Avoiding modular arithmetic The state of the art for tions. Let the universe to be hashed be U = {0, . . . , m− hashing integers is the multiply-shift scheme described 1} . by Dietzfelbinger et al. in 1997.[7] By avoiding modular The original proposal of Carter and Wegman[1] was to arithmetic, this method is much easier to implement and pick a prime p ≥ m and define also runs significantly faster in practice (usually by at least a factor of four[8]). The scheme assumes the number of bins is a power of two, m = 2M . Let w be the number of bits in a machine word. Then the hash functions are ha,b(x) = ((ax + b) mod p) mod m parametrised over odd positive integers a < 2w (that fit in a word of w bits). To evaluate h (x) , multiply x by where a, b are randomly chosen integers modulo p with a a modulo 2w and then keep the high order M bits as the a ≠ 0 . (This is a single iteration of a linear congruential hash code. In mathematical notation, this is generator.)

To see that H = {ha,b} is a universal family, note that w w−M h(x) = h(y) only holds when ha(x) = (a · x mod 2 ) div 2 and it can be implemented in C-like programming lan- guages by ax + b ≡ ay + b + i · m (mod p) h (x) = (unsigned) (a*x) >> (w-M) for some integer i between 0 and (p − 1)/m . If x ≠ y a , their difference, x − y is nonzero and has an inverse This scheme does not satisfy the uniform difference prop- modulo p . Solving for a yields erty and is only 2/m -almost-universal; for any x ≠ y , Pr{ha(x) = ha(y)} ≤ 2/m .

a ≡ i · m · (x − y)−1 (mod p) To understand the behavior of the hash function, notice that, if axmod2w and aymod2w have the same highest- w There are p − 1 possible choices for a (since a = 0 is ex- order 'M' bits, then a(x − y)mod2 has either all 1’s or cluded) and, varying i in the allowed range, ⌊(p − 1)/m⌋ all 0’s as its highest order M bits (depending on whether w w possible non-zero values for the right hand side. Thus the axmod2 or aymod2 is larger. Assume that the least collision probability is significant set bit of x−y appears on position w−c . Since a is a random odd integer and odd integers have inverses w in the ring Z2w , it follows that a(x − y)mod2 will be uniformly distributed among w -bit integers with the least ⌊(p − 1)/m⌋/(p − 1) ≤ ((p − 1)/m)/(p − 1) = 1/m significant set bit on position w − c . The probability that M Another way to see H is a universal family is via the no- these bits are all 0’s or all 1’s is therefore at most 2/2 = tion of statistical distance. Write the difference h(x) − 2/m . On the other hand, if c < M , then higher-order − w h(y) as M bits of a(x y)mod2 contain both 0’s and 1’s, so it is certain that h(x) ≠ h(y) . Finally, if c = M then bit w w − M of a(x − y)mod2 is 1 and ha(x) = ha(y) if and only if bits w − 1, . . . , w − M + 1 are also 1, which h(x) − h(y) ≡ (a(x − y) mod p)(mod m) happens with probability 1/2M−1 = 2/m . Since x − y is nonzero and a is uniformly distributed in This analysis is tight, as can be shown with the example {1, . . . , p} , it follows that a(x − y) modulo p is also x = 2w−M−2 and y = 3x . To obtain a truly 'universal' uniformly distributed in {1, . . . , p} . The distribution of hash function, one can use the multiply-add-shift scheme (h(x) − h(y)) mod m is thus almost uniform, up to a difference in probability of 1/p between the samples. w w−M As a result, the statistical distance to a uniform family is ha,b(x) = ((ax + b)mod2 ) div 2 O(m/p) , which becomes negligible when p ≫ m . which can be implemented in C-like programming lan- The family of simpler hash functions guages by

ha,b(x) = (unsigned) (a*x+b) >> (w-M) ha(x) = (ax mod p) mod m where a is a random odd positive integer with a < 2w w−M is only approximately universal: Pr{ha(x) = ha(y)} ≤ and b is a random non-negative integer with b < 2 . [1] 2/m for all x ≠ y . Moreover, this analysis is nearly With these choices of a and b , Pr{ha,b(x) = ha,b(y)} ≤ [1] w [9] tight; Carter and Wegman show that Pr{ha(1) = 1/m for all x ̸≡ y (mod 2 ) . This differs slightly ha(m+1)} ≥ 2/(m−1) whenever (p−1) mod m = 1 but importantly from the mistranslation in the English . paper.[10] 3.11. UNIVERSAL HASHING 91

Hashing vectors Hashing strings

This section is concerned with hashing a fixed-length vec- This refers to hashing a variable-sized vector of machine tor of machine words. Interpret the input as a vector words. If the length of the string can be bounded by a x¯ = (x0, . . . , xk−1) of k machine words (integers of w small number, it is best to use the vector solution from bits each). If H is a universal family with the uniform above (conceptually padding the vector with zeros up to difference property, the following family (dating back to the upper bound). The space required is the maximal Carter and Wegman[1]) also has the uniform difference length of the string, but the time to evaluate h(s) is just property (and hence is universal): the length of s . As long as zeroes are forbidden in the (∑ ) string, the zero-padding can be ignored when evaluating k−1 the hash function without affecting universality[11]). Note h(¯x) = i=0 hi(xi) mod m , where each that if zeroes are allowed in the string, then it might be hi ∈ H is chosen independently at random. best to append a fictitious non-zero (e.g., 1) character to If m is a power of two, one may replace summation by all strings prior to padding: this will ensure that univer- [15] exclusive or.[11] sality is not affected. In practice, if double-precision arithmetic is available, Now assume we want to hash x¯ = (x0, . . . , xℓ) , where a this is instantiated with the multiply-shift hash family good bound on ℓ is not known a priori. A universal family [12] of.[12] Initialize the hash function with a vector a¯ = proposed by treats the string x as the coefficients of ∈ (a , . . . , a − ) of random odd integers on 2w bits each. a polynomial modulo a large prime. If xi [u] , let 0 k 1 ≥ { } Then if the number of bins is m = 2M for M ≤ w : p max u, m be a prime and define: (( ∑ ) ) ( ) ℓ · i ( k∑−1 ) ha(¯x) = hint i=0 xi a mod p , · 2w 2w−M ha¯(¯x) = xi ai mod 2 div 2 where a ∈ [p] is uniformly random and hint is i=0 chosen randomly from a universal family map- It is possible to halve the number of multiplications, ping integer domain [p] 7→ [m] . which roughly translates to a two-fold speed-up in [11] practice. Initialize the hash function with a vector a¯ = Using properties of modular arithmetic, above can be (a0, . . . , ak−1) of random odd integers on 2w bits each. computed without producing large numbers for large [13] The following hash family is universal: strings as follows:[16]  uint hash(String x, int a, int p) uint h = INITIAL_VALUE ( ⌈∑k/2⌉ ) for (uint i=0 ; i < x.length ; ++i) h = ((h*a) + x[i]) mod  2w 2w−M ha¯(¯x) = (x2i + a2i) · (x2i+1 + a2i+1) mod 2 p returndiv 2 h i=0 If double-precision operations are not available, one can This Rabin-Karp rolling hash is based on a linear con- interpret the input as a vector of half-words ( w/2 -bit gruential generator.[17] Above algorithm is also known as integers). The algorithm will then use ⌈k/2⌉ multiplica- Multiplicative hash function.[18] In practice, the mod op- tions, where k was the number of half-words in the vec- erator and the parameter p can be avoided altogether by tor. Thus, the algorithm runs at a “rate” of one multipli- simply allowing integer to overflow because it is equiva- cation per word of input. lent to mod (Max-Int-Value + 1) in many programming languages. Below table shows values chosen to initialize The same scheme can also be used for hashing integers, h and a for some of the popular implementations. by interpreting their bits as vectors of bytes. In this vari- ant, the vector technique is known as tabulation hashing Consider two strings x,¯ y¯ and let ℓ be length of the longer and it provides a practical alternative to multiplication- one; for the analysis, the shorter string is conceptually based universal hashing schemes.[14] padded with zeros up to length ℓ . A collision before ap- plying h implies that a is a root of the polynomial with Strong universality at high speed is also possible.[15] Ini- int coefficients x¯ − y¯ . This polynomial has at most ℓ roots tialize the hash function with a vector a¯ = (a , . . . , a ) 0 k modulo p , so the collision probability is at most ℓ/p . The of random integers on 2w bits. Compute probability of collision through the random hint brings the 1 ℓ total collision probability to m + p . Thus, if the prime k∑−1 p is sufficiently large compared to the length of strings strong 2w w ha¯(¯x) = (a0 + ai+1xi mod 2 ) div 2 hashed, the family is very close to universal (in statistical i=0 distance). The result is strongly universal on w bits. Experimentally, Other universal families of hash functions used to hash it was found to run at 0.2 CPU cycle per byte on recent unknown-length strings to fixed-length hash values in- Intel processors for w = 32 . clude the Rabin fingerprint and the Buzhash. 92 CHAPTER 3. DICTIONARIES

Avoiding modular arithmetic To mitigate the com- [7] Dietzfelbinger, Martin; Hagerup, Torben; Katajainen, putational penalty of modular arithmetic, two tricks are Jyrki; Penttonen, Martti (1997). “A Reliable Ran- used in practice:[11] domized Algorithm for the Closest-Pair Problem” (Postscript). Journal of Algorithms. 25 (1): 19–51. 1. One chooses the prime p to be close to a power of doi:10.1006/jagm.1997.0873. Retrieved 10 February 2011. two, such as a Mersenne prime. This allows arith- metic modulo p to be implemented without divi- [8] Thorup, Mikkel. “Text-book algorithms at SODA”. sion (using faster operations like addition and shifts). [9] Woelfel, Philipp (2003). Über die Komplexität der Multi- For instance, on modern architectures one can work plikation in eingeschränkten Branchingprogrammmodellen 61 − with p = 2 1 , while xi 's are 32-bit values. (PDF) (Ph.D.). Universität Dortmund. Retrieved 18 2. One can apply vector hashing to blocks. For in- September 2012. stance, one applies vector hashing to each 16-word [10] Woelfel, Philipp (1999). Efficient Strongly Universal and block of the string, and applies string hashing to the Optimally Universal Hashing (PDF). Mathematical Foun- ⌈k/16⌉ results. Since the slower string hashing is dations of Computer Science 1999. LNCS. pp. 262– applied on a substantially smaller vector, this will 272. doi:10.1007/3-540-48340-3_24. Retrieved 17 May essentially be as fast as vector hashing. 2011. [11] Thorup, Mikkel (2009). String hashing for lin- 3. One chooses a power-of-two as the divisor, allow- w ear probing. Proc. 20th ACM-SIAM Symposium ing arithmetic modulo 2 to be implemented with- on Discrete Algorithms (SODA). pp. 655–664. out division (using faster operations of bit masking). doi:10.1137/1.9781611973068.72. Archived (PDF) The NH hash-function family takes this approach. from the original on 2013-10-12., section 5.3 [12] Dietzfelbinger, Martin; Gil, Joseph; Matias, Yossi; Pip- 3.11.4 See also penger, Nicholas (1992). Polynomial Hash Functions Are Reliable (Extended Abstract). Proc. 19th International • K-independent hashing Colloquium on Automata, Languages and Programming (ICALP). pp. 235–246. • Rolling hashing [13] Black, J.; Halevi, S.; Krawczyk, H.; Krovetz, T. (1999). • Tabulation hashing UMAC: Fast and Secure Message Authentication (PDF). Advances in Cryptology (CRYPTO '99)., Equation 1 • Min-wise independence [14] Pătraşcu, Mihai; Thorup, Mikkel (2011). The power • Universal one-way hash function of simple tabulation hashing. Proceedings of the 43rd annual ACM Symposium on Theory of Comput- • Low-discrepancy sequence ing (STOC '11). pp. 1–10. arXiv:1011.5200 . • Perfect hashing doi:10.1145/1993636.1993638. [15] Kaser, Owen; Lemire, Daniel (2013). “Strongly universal string hashing is fast”. Computer Jour- 3.11.5 References nal. Oxford University Press. arXiv:1202.4961 . doi:10.1093/comjnl/bxt070. [1] Carter, Larry; Wegman, Mark N. (1979). “Universal Classes of Hash Functions”. Journal of Computer and [16] “Hebrew University Course Slides” (PDF). System Sciences. 18 (2): 143–154. doi:10.1016/0022- 0000(79)90044-8. Conference version in STOC'77. [17] Robert Uzgalis. “Library Hash Functions”. 1996.

[2] Miltersen, Peter Bro. “Universal Hashing”. Archived [18] Kankowsk, Peter. “Hash functions: An empirical com- from the original (PDF) on 24 June 2009. parison”. [19] Yigit, Ozan. “String hash functions”. [3] Motwani, Rajeev; Raghavan, Prabhakar (1995). Ran- domized Algorithms. Cambridge University Press. p. 221. [20] Kernighan; Ritchie (1988). “6”. The C Programming Lan- ISBN 0-521-47465-5. guage (2nd ed.). p. 118. ISBN 0-13-110362-8. [4] David Wagner, ed. “Advances in Cryptology - CRYPTO [21] “String (Java Platform SE 6)". docs.oracle.com. Retrieved 2008”. p. 145. 2015-06-10. [5] Jean-Philippe Aumasson, Willi Meier, Raphael Phan, Luca Henzen. “The Hash Function BLAKE”. 2014. p. 3.11.6 Further reading 10. • [6] Baran, Ilya; Demaine, Erik D.; Pătraşcu, Mihai (2008). Knuth, Donald Ervin (1998). The Art of Computer “Subquadratic Algorithms for 3SUM” (PDF). Algorith- Programming, Vol. III: Sorting and Searching (3rd mica. 50 (4): 584–596. doi:10.1007/s00453-007-9036- ed.). Reading, Mass; London: Addison-Wesley. 3. ISBN 0-201-89685-0. 3.12. K-INDEPENDENT HASHING 93

3.11.7 External links {h : U → [m]} is k -independent if for any k distinct k keys (x1, . . . , xk) ∈ U and any k hash codes (not nec- • k Open Data Structures - Section 5.1.1 - Multiplica- essarily distinct) (y1, . . . , yk) ∈ [m] , we have: tive Hashing

−k Pr [h(x1) = y1 ∧ · · · ∧ h(xk) = yk] = m 3.12 K-independent hashing h∈H This definition is equivalent to the following two condi- A family of hash functions is said to be k -independent tions: or k -universal[1] if selecting a hash function at ran- dom from the family guarantees that the hash codes of 1. for any fixed x ∈ U , as h is drawn randomly from any designated k keys are independent random variables H , h(x) is uniformly distributed in [m] . (see precise mathematical definitions below). Such fam- ilies allow good average case performance in randomized 2. for any fixed, distinct keys x1, . . . , xk ∈ U , as h algorithms or data structures, even if the input data is cho- is drawn randomly from H , h(x1), . . . , h(xk) are sen by an adversary. The trade-offs between the degree independent random variables. of independence and the efficiency of evaluating the hash function are well studied, and many k -independent fam- Often it is inconvenient to achieve the perfect joint prob- ilies have been proposed. ability of m−k due to rounding issues. Following,[3] one may define a (µ, k) -independent family to satisfy:

3.12.1 Background k ∀ distinct (x1, . . . , xk) ∈ U and ∀(y , . . . , y ) ∈ [m]k , See also: Hash function 1 k Prh∈H [h(x1) = y1 ∧ · · · ∧ h(xk) = yk] ≤ µ/mk The goal of hashing is usually to map keys from some large domain (universe) U into a smaller range, such as Observe that, even if µ is close to 1, h(x ) are no longer { − } i m bins (labelled [m] = 0, . . . , m 1 ). In the analysis independent random variables, which is often a problem of randomized algorithms and data structures, it is often in the analysis of randomized algorithms. Therefore, a desirable for the hash codes of various keys to “behave more common alternative to dealing with rounding issues randomly”. For instance, if the hash code of each key is to prove that the hash family is close in statistical dis- were an independent random choice in [m] , the num- tance to a k -independent family, which allows black-box ber of keys per bin could be analyzed using the Chernoff use of the independence properties. bound. A deterministic hash function cannot offer any such guarantee in an adversarial setting, as the adversary may choose the keys to be the precisely the preimage of a 3.12.3 Techniques bin. Furthermore, a deterministic hash function does not allow for rehashing: sometimes the input data turns out with random coefficients to be bad for the hash function (e.g. there are too many collisions), so one would like to change the hash function. The original technique for constructing k-independent The solution to these problems is to pick a function ran- hash functions, given by Carter and Wegman, was to se- domly from a large family of hash functions. The random- lect a large prime number p, choose k random numbers ness in choosing the hash function can be used to guaran- modulo p, and use these numbers as the coefficients of a tee some desired random behavior of the hash codes of polynomial of degree k whose values modulo p are used any keys of interest. The first definition along these lines as the value of the hash function. All polynomials of the was universal hashing, which guarantees a low collision given degree modulo p are equally likely, and any polyno- probability for any two designated keys. The concept of k mial is uniquely determined by any k-tuple of argument- -independent hashing, introduced by Wegman and Carter value pairs with distinct arguments, from which it follows in 1981,[2] strengthens the guarantees of random behav- that any k-tuple of distinct arguments is equally likely to [2] ior to families of k designated keys, and adds a guarantee be mapped to any k-tuple of hash values. on the uniform distribution of hash codes. Tabulation hashing 3.12.2 Definitions Main article: Tabulation hashing The strictest definition, introduced by Wegman and [2] Carter under the name “strongly universal k hash fam- Tabulation hashing is a technique for mapping keys to ily”, is the following. A family of hash functions H = hash values by partitioning each key into bytes, using each 94 CHAPTER 3. DICTIONARIES byte as the index into a table of random numbers (with a [3] Siegel, Alan (2004). “On universal classes of extremely different table for each byte position), and combining the random constant-time hash functions and their time-space results of these table lookups by a bitwise exclusive or op- tradeoff” (PDF). SIAM Journal on Computing. 33 (3): eration. Thus, it requires more randomness in its initial- 505–543. doi:10.1137/S0097539701386216. Confer- ization than the polynomial method, but avoids possibly- ence version in FOCS'89. slow multiplication operations. It is 3-independent but [4] Pătraşcu, Mihai; Thorup, Mikkel (2012), “The [4] not 4-independent. Variations of tabulation hashing can power of simple tabulation hashing”, Journal of achieve higher degrees of independence by performing the ACM, 59 (3): Art. 14, arXiv:1011.5200 , table lookups based on overlapping combinations of bits doi:10.1145/2220357.2220361, MR 2946218. from the input key, or by applying simple tabulation hash- ing iteratively.[5][6] [5] Siegel, Alan (2004), “On universal classes of ex- tremely random constant-time hash functions”, SIAM Journal on Computing, 33 (3): 505–543, 3.12.4 Independence needed by different doi:10.1137/S0097539701386216, MR 2066640. hashing methods [6] Thorup, M. (2013), “Simple tabulation, fast expanders, double tabulation, and high independence”, Proceed- The notion of k-independence can be used to differentiate ings of the 54th Annual IEEE Symposium on Founda- between different hashing methods, according to the level tions of Computer Science (FOCS 2013), pp. 90–99, of independence required to guarantee constant expected doi:10.1109/FOCS.2013.18, MR 3246210. time per operation. [7] Bradford, Phillip G.; Katehakis, Michael N. (2007), “A For instance, hash chaining takes constant expected time probabilistic study on combinatorial expanders and hash- even with a 2-independent hash function, because the ex- ing” (PDF), SIAM Journal on Computing, 37 (1): 83–111, pected time to perform a search for a given key is bounded doi:10.1137/S009753970444630X, MR 2306284. by the expected number of collisions that key is involved in. By linearity of expectation, this expected number [8] Pagh, Anna; Pagh, Rasmus; Ružić, Milan (2009), “Linear probing with constant independence”, SIAM Journal on equals the sum, over all other keys in the hash table, of the Computing, 39 (3): 1107–1120, doi:10.1137/070702278, probability that the given key and the other key collide. MR 2538852 Because the terms of this sum only involve probabilistic events involving two keys, 2-independence is sufficient to [9] Pătraşcu, Mihai; Thorup, Mikkel (2010), “On the k- ensure that this sum has the same value that it would for independence required by linear probing and minwise a truly random hash function.[2] independence” (PDF), Automata, Languages and Pro- gramming, 37th International Colloquium, ICALP 2010, Double hashing is another method of hashing that re- Bordeaux, France, July 6-10, 2010, Proceedings, Part I, quires a low degree of independence. It is a form of open Lecture Notes in Computer Science, 6198, Springer, pp. addressing that uses two hash functions: one to determine 715–726, doi:10.1007/978-3-642-14165-2_60 the start of a probe sequence, and the other to determine the step size between positions in the probe sequence. As long as both of these are 2-independent, this method gives 3.12.6 Further reading constant expected time per operation.[7] • Motwani, Rajeev; Raghavan, Prabhakar (1995). On the other hand, linear probing, a simpler form of open Randomized Algorithms. Cambridge University addressing where the step size is always one, requires 5- Press. p. 221. ISBN 0-521-47465-5. independence. It can be guaranteed to work in constant expected time per operation with a 5-independent hash function,[8] and there exist 4-independent hash functions for which it takes logarithmic time per operation.[9] 3.13 Tabulation hashing

In computer science, tabulation hashing is a method for 3.12.5 References constructing universal families of hash functions by com- bining table lookup with exclusive or operations. It was [1] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, first studied in the form of Zobrist hashing for computer Ronald L.; Stein, Clifford (2009) [1990]. Introduction to games; later work by Carter and Wegman extended this Algorithms (3rd ed.). MIT Press and McGraw-Hill. ISBN 0-262-03384-4. method to arbitrary fixed-length keys. Generalizations of tabulation hashing have also been developed that can han- [2] Wegman, Mark N.; Carter, J. Lawrence (1981). “New dle variable-length keys such as text strings. hash functions and their use in authentication and set equality” (PDF). Journal of Computer and Sys- Despite its simplicity, tabulation hashing has strong theo- tem Sciences. 22 (3): 265–279. doi:10.1016/0022- retical properties that distinguish it from some other hash 0000(81)90033-7. Conference version in FOCS'79. Re- functions. In particular, it is 3-independent: every 3-tuple trieved 9 February 2011. of keys is equally likely to be mapped to any 3-tuple of 3.13. TABULATION HASHING 95

hash values. However, it is not 4-independent. More so- the value of the position before the move, without needing phisticated but slower variants of tabulation hashing ex- to loop over all of the features of the position.[4] tend the method to higher degrees of independence. Tabulation hashing in greater generality, for arbitrary bi- Because of its high degree of independence, tabula- nary values, was later rediscovered by Carter & Wegman tion hashing is usable with hashing methods that require (1979) and studied in more detail by Pătraşcu & Thorup a high-quality hash function, including linear probing, (2012). cuckoo hashing, and the MinHash technique for estimat- ing the size of set intersections.

3.13.3 Universality 3.13.1 Method Carter & Wegman (1979) define a randomized scheme Let p denote the number of bits in a key to be hashed, for generating hash functions to be universal if, for any and q denote the number of bits desired in an output hash two keys, the probability that they collide (that is, they are function. Choose another number r, less than or equal to mapped to the same value as each other) is 1/m, where p; this choice is arbitrary, and controls the tradeoff be- m is the number of values that the keys can take on. tween time and memory usage of the hashing method: They defined a stronger property in the subsequent pa- smaller values of r use less memory but cause the hash per Wegman & Carter (1981): a randomized scheme for function to be slower. Compute t by rounding p/r up generating hash functions is k-independent if, for every to the next larger integer; this gives the number of r-bit k-tuple of keys, and each possible k-tuple of values, the blocks needed to represent a key. For instance, if r = 8, probability that those keys are mapped to those values is then an r-bit number is a byte, and t is the number of 1/mk. 2-independent hashing schemes are automatically bytes per key. The key idea of tabulation hashing is to universal, and any universal hashing scheme can be con- view a key as a vector of t r-bit numbers, use a lookup ta- verted into a 2-independent scheme by storing a random ble filled with random values to compute a hash value for number x as part of the initialization phase of the algo- each of the r-bit numbers representing a given key, and rithm and adding x to each hash value. Thus, universality combine these values with the bitwise binary exclusive or is essentially the same as 2-independence. However, k- operation.[1] The choice of r should be made in such a independence for larger values of k is a stronger property, way that this table is not too large; e.g., so that it fits into held by fewer hashing algorithms. the computer’s cache memory.[2] As Pătraşcu & Thorup (2012) observe, tabulation hashing The initialization phase of the algorithm creates a two- is 3-independent but not 4-independent. For any single dimensional array T of dimensions 2r by t, and fills the key x, T[x ,0] is equally likely to take on any hash value, array with random q-bit numbers. Once the array T is 0 and the exclusive or of T[x ,0] with the remaining table initialized, it can be used to compute the hash value h(x) 0 values does not change this property. For any two keys x of any given key x. To do so, partition x into r-bit values, and y, x is equally likely to be mapped to any hash value where x consists of the low order r bits of x, x consists 0 1 as before, and there is at least one position i where xi ≠ of the next r bits, etc. For example, with the choice r = yi; the table value T[yi,i] is used in the calculation of h(y) 8, xi is just the ith byte of x. Then, use these values as but not in the calculation of h(x), so even after the value indices into T and combine them with the exclusive or of h(x) has been determined, h(y) is equally likely to be operation:[1] any valid hash value. Similarly, for any three keys x, y, and z, at least one of the three keys has a position i where h(x) = T[0][x0] ⊕ T[1][x1] ⊕ T[2][x2] ⊕ ... its value zi differs from the other two, so that even after the values of h(x) and h(y) are determined, h(z) is equally [5] 3.13.2 History likely to be any valid hash value. However, this reasoning breaks down for four keys be- The first instance of tabulation hashing is Zobrist hashing, cause there are sets of keys w, x, y, and z where none a method for hashing positions in abstract board games of the four has a byte value that it does not share with at such as chess named after Albert Lindsey Zobrist, who least one of the other keys. For instance, if the keys have published it in 1970.[3] In this method, a random bitstring two bytes each, and w, x, y, and z are the four keys that is generated for each game feature such as a combination have either zero or one as their byte values, then each byte of a chess piece and a square of the chessboard. Then, to value in each position is shared by exactly two of the four hash any game position, the bitstrings for the features of keys. For these four keys, the hash values computed by that position are combined by a bitwise exclusive or. The tabulation hashing will always satisfy the equation h(w) resulting hash value can then be used as an index into a ⊕ h(x) ⊕ h(y) ⊕ h(z) = 0, whereas for a 4-independent transposition table. Because each move typically changes hashing scheme the same equation would only be satis- only a small number of game features, the Zobrist value fied with probability 1/m. Therefore, tabulation hashing of the position after a move can be updated quickly from is not 4-independent.[5] 96 CHAPTER 3. DICTIONARIES

3.13.4 Application simple tabulation hashing on the expanded keys, results in a hashing scheme whose independence number is ex- Because tabulation hashing is a universal hashing scheme, ponential in the parameter r, the number of bits per block it can be used in any hashing-based algorithm in which in the partition of the keys into blocks. universality is sufficient. For instance, in hash chaining, Simple tabulation is limited to keys of a fixed length, be- the expected time per operation is proportional to the cause a different table of random values needs to be ini- sum of collision probabilities, which is the same for any tialized for each position of a block in the keys. Lemire universal scheme as it would be for truly random hash (2012) studies variations of tabulation hashing suitable functions, and is constant whenever the load factor of for variable-length keys such as character strings. The the hash table is constant. Therefore, tabulation hashing general type of hashing scheme studied by Lemire uses a can be used to compute hash functions for hash chaining single table T indexed by the value of a block, regard- with a theoretical guarantee of constant expected time per less of its position within the key. However, the val- operation.[6] ues from this table may be combined by a more compli- However, universal hashing is not strong enough to guar- cated function than bitwise exclusive or. Lemire shows antee the performance of some other hashing algorithms. that no scheme of this type can be 3-independent. Nev- For instance, for linear probing, 5-independent hash func- ertheless, he shows that it is still possible to achieve 2- tions are strong enough to guarantee constant time op- independence. In particular, a tabulation scheme that in- eration, but there are 4-independent hash functions that terprets the values T[xi] (where xi is, as before, the ith fail.[7] Nevertheless, despite only being 3-independent, block of the input) as the coefficients of a polynomial tabulation hashing provides the same constant-time guar- over a finite field and then takes the remainder of the re- antee for linear probing.[8] sulting polynomial modulo another polynomial, gives a Cuckoo hashing, another technique for implementing 2-independent hash function. hash tables, guarantees constant time per lookup (regard- less of the hash function). Insertions into a cuckoo hash 3.13.6 Notes table may fail, causing the entire table to be rebuilt, but such failures are sufficiently unlikely that the expected [1] Morin (2014); Mitzenmacher & Upfal (2014). time per insertion (using either a truly random hash func- tion or a hash function with logarithmic independence) [2] Mitzenmacher & Upfal (2014). is constant. With tabulation hashing, on the other hand, [3] Thorup (2013). the best bound known on the failure probability is higher, high enough that insertions cannot be guaranteed to take [4] Zobrist (1970). constant expected time. Nevertheless, tabulation hashing is adequate to ensure the linear-expected-time construc- [5] Pătraşcu & Thorup (2012); Mitzenmacher & Upfal (2014). tion of a cuckoo hash table for a static set of keys that [8] does not change as the table is used. [6] Carter & Wegman (1979).

[7] For the sufficiency of 5-independent hashing for linear 3.13.5 Extensions probing, see Pagh, Pagh & Ružić (2009). For examples of weaker hashing schemes that fail, see Pătraşcu & Thorup Although tabulation hashing as described above (“simple (2010). tabulation hashing”) is only 3-independent, variations of [8] Pătraşcu & Thorup (2012). this method can be used to obtain hash functions with much higher degrees of independence. Siegel (2004) uses the same idea of using exclusive or operations to combine 3.13.7 References random values from a table, with a more complicated al- gorithm based on expander graphs for transforming the Secondary sources key bits into table indices, to define hashing schemes that are k-independent for any constant or even logarithmic • Morin, Pat (February 22, 2014), “Section 5.2.3: value of k. However, the number of table lookups needed Tabulation hashing”, Open Data Structures (in pseu- to compute each hash value using Siegel’s variation of docode) (0.1Gβ ed.), pp. 115–116, retrieved 2016- tabulation hashing, while constant, is still too large to be 01-08. practical, and the use of expanders in Siegel’s technique also makes it not fully constructive. Thorup (2013) pro- • Mitzenmacher, Michael; Upfal, Eli (2014), “Some vides a scheme based on tabulation hashing that reaches practical randomized algorithms and data struc- high degrees of independence more quickly, in a more tures”, in Tucker, Allen; Gonzalez, Teofilo; Diaz- constructive way. He observes that using one round of Herrera, Jorge, Computing Handbook: Computer simple tabulation hashing to expand the input keys to six Science and Software Engineering (3rd ed.), CRC times their original length, and then a second round of Press, pp. 11-1 – 11-23, ISBN 9781439898529. 3.14. CRYPTOGRAPHIC HASH FUNCTION 97

See in particular Section 11.1.1: Tabulation hash- Input Digest cryptographic ing, pp. 11-3 – 11-4. DFCD 3454 BBEA 788A 751A Fox hash 696C 24D9 7009 CA99 2D17 function

The red fox cryptographic 0086 46BB FB7D CBE2 823C Primary sources jumps over hash ACC7 6CD1 90B1 EE6E 3ABC the blue dog function

The red fox cryptographic 8FD8 7558 7851 4F32 D1C6 • Carter, J. Lawrence; Wegman, Mark N. (1979), jumps ouer hash 76B1 79A9 0DA4 AEFE 4819 “Universal classes of hash functions”, Journal of the blue dog function The red fox cryptographic Computer and System Sciences, 18 (2): 143–154, FCD3 7FDB 5AF2 C6FF 915F jumps oevr hash D401 C0A9 7D9A 46AF FB45 doi:10.1016/0022-0000(79)90044-8, MR 532173. the blue dog function

The red fox cryptographic 8ACA D682 D588 4C75 4BF4 jumps oer hash • Lemire, Daniel (2012), “The univer- 1799 7D88 BCF8 92B9 6A6C the blue dog function sality of iterated hashing over variable- length strings”, Discrete Applied Mathe- A cryptographic hash function (specifically SHA-1) at work. A matics, 160: 604–617, arXiv:1008.1715 , small change in the input (in the word “over”) drastically changes doi:10.1016/j.dam.2011.11.009, MR 2876344. the output (digest). This is the so-called avalanche effect.

• Pagh, Anna; Pagh, Rasmus; Ružić, Milan (2009), “Linear probing with constant independence”, 3.14 Cryptographic hash function SIAM Journal on Computing, 39 (3): 1107–1120, doi:10.1137/070702278, MR 2538852. A cryptographic hash function is a special class of hash function that has certain properties which make it suitable • Pătraşcu, Mihai; Thorup, Mikkel (2010), “On for use in cryptography. It is a mathematical algorithm the k-independence required by linear probing that maps data of arbitrary size to a bit string of a fixed and minwise independence” (PDF), Proceedings of size (a hash function) which is designed to also be a one- the 37th International Colloquium on Automata, way function, that is, a function which is infeasible to in- Languages and Programming (ICALP 2010), Bor- vert. The only way to recreate the input data from an deaux, France, July 6-10, 2010, Part I, Lecture ideal cryptographic hash function’s output is to attempt a Notes in Computer Science, 6198, Springer, brute-force search of possible inputs to see if they pro- pp. 715–726, doi:10.1007/978-3-642-14165-2_60, duce a match. Bruce Schneier has called one-way hash MR 2734626. functions “the workhorses of modern cryptography”.[1] The input data is often called the message, and the output • Pătraşcu, Mihai; Thorup, Mikkel (2012), “The (the hash value or hash) is often called the message digest power of simple tabulation hashing”, Journal of or simply the digest. the ACM, 59 (3): Art. 14, arXiv:1011.5200 , The ideal cryptographic hash function has four main doi:10.1145/2220357.2220361, MR 2946218. properties: • Siegel, Alan (2004), “On universal classes of • it is quick to compute the hash value for any given extremely random constant-time hash functions”, message SIAM Journal on Computing, 33 (3): 505–543, doi:10.1137/S0097539701386216, MR 2066640. • it is infeasible to generate a message from its hash value except by trying all possible messages • Thorup, M. (2013), “Simple tabulation, fast ex- • a small change to a message should change the hash panders, double tabulation, and high independence”, value so extensively that the new hash value appears Proceedings of the 54th Annual IEEE Symposium on uncorrelated with the old hash value Foundations of Computer Science (FOCS 2013), pp. 90–99, doi:10.1109/FOCS.2013.18, MR 3246210. • it is infeasible to find two different messages with the same hash value • Wegman, Mark N.; Carter, J. Lawrence (1981), “New hash functions and their use in authentica- Cryptographic hash functions have many information- tion and set equality”, Journal of Computer and Sys- security applications, notably in digital signatures, tem Sciences, 22 (3): 265–279, doi:10.1016/0022- message authentication codes (MACs), and other forms 0000(81)90033-7, MR 633535. of authentication. They can also be used as ordinary hash functions, to index data in hash tables, for fingerprinting, • Zobrist, Albert L. (April 1970), A New Hashing to detect duplicate data or uniquely identify files, and as Method with Application for Game Playing (PDF), checksums to detect accidental data corruption. Indeed, Tech. Rep. 88, Madison, Wisconsin: Computer in information-security contexts, cryptographic hash val- Sciences Department, University of Wisconsin. ues are sometimes called (digital) fingerprints, checksums, 98 CHAPTER 3. DICTIONARIES or just hash values, even though all these terms stand for functions are vulnerable to length-extension attacks: given more general functions with rather different properties hash(m) and len(m) but not m, by choosing a suitable m' and purposes. an attacker can calculate hash(m || m') where || denotes concatenation.[4] This property can be used to break naive authentication schemes based on hash functions. The 3.14.1 Properties HMAC construction works around these problems.

Most cryptographic hash functions are designed to take a In practice, collision resistance is insufficient for many string of any length as input and produce a fixed-length practical uses. In additional to collision resistance, it hash value. should be impossible for an adversary to find two mes- sages with substantially similar digests; or to infer any A cryptographic hash function must be able to withstand useful information about the data, given only its digest. all known types of cryptanalytic attack. In theoretical In particular, should behave as much as possible like a cryptography, the security level of a cryptographic hash random function (often called a random oracle in proofs function has been defined using the following properties: of security) while still being deterministic and efficiently computable. This rules out functions like the SWIFFT • Pre-image resistance function, which can be rigorously proven to be collision resistant assuming that certain problems on ideal lattices Given a hash value h it should be diffi- are computationally difficult, but as a linear function, cult to find any message m such that h = does not satisfy these additional properties.[5] hash(m). This concept is related to that of one-way function. Functions that lack Checksum algorithms, such as CRC32 and other cyclic this property are vulnerable to preimage redundancy checks, are designed to meet much weaker attacks. requirements, and are generally unsuitable as crypto- graphic hash functions. For example, a CRC was used • Second pre-image resistance for message integrity in the WEP encryption standard, but an attack was readily discovered which exploited the Given an input m it should be dif- 1 linearity of the checksum. ficult to find different input m2 such that hash(m1) = hash(m2). Functions that lack this property are vulnerable to Degree of difficulty second-preimage attacks. • Collision resistance In cryptographic practice, “difficult” generally means “al- most certainly beyond the reach of any adversary who It should be difficult to find two different must be prevented from breaking the system for as long messages m1 and m2 such that hash(m1) as the security of the system is deemed important”. The = hash(m2). Such a pair is called a cryp- meaning of the term is therefore somewhat dependent on tographic hash collision. This property the application, since the effort that a malicious agent is sometimes referred to as strong colli- may put into the task is usually proportional to his ex- sion resistance. It requires a hash value pected gain. However, since the needed effort usu- at least twice as long as that required ally grows very quickly with the digest length, even a for preimage-resistance; otherwise col- thousand-fold advantage in processing power can be neu- lisions may be found by a birthday at- tralized by adding a few dozen bits to the latter. tack.[2] For messages selected from a limited set of messages, for example passwords or other short messages, it can be fea- These properties form a hierarchy, in that collision re- sible to invert a hash by trying all possible messages in the sistance implies second pre-image resistance, which in set. Because cryptographic hash functions are typically turns implies pre-image resistance, while the converse is [3] designed to be computed quickly, special key derivation not true in general. The weaker assumption is always functions that require greater computing resources have preferred in theoretical cryptography, but in practice, a been developed that make such brute force attacks more hash-functions which is only second pre-image resistant difficult. is considered insecure and is therefore not recommended for real applications. In some theoretical analyses “difficult” has a spe- cific mathematical meaning, such as “not solvable in Informally, these properties mean that a malicious ad- asymptotic polynomial time". Such interpretations of versary cannot replace or modify the input data without difficulty are important in the study of provably secure changing its digest. Thus, if two strings have the same cryptographic hash functions but do not usually have a digest, one can be very confident that they are identical. strong connection to practical security. For example, an A function meeting these criteria may still have unde- exponential time algorithm can sometimes still be fast sirable properties. Currently popular cryptographic hash enough to make a feasible attack. Conversely, a polyno- 3.14. CRYPTOGRAPHIC HASH FUNCTION 99 mial time algorithm (e.g., one that requires n20 steps for ing retrieved if forgotten or lost, and they have to be re- n-digit keys) may be too slow for any practical use. placed with new ones.) The password is often concate- nated with a random, non-secret salt value before the hash function is applied. The salt is stored with the password 3.14.2 Illustration hash. Because users have different salts, it is not feasible to store tables of precomputed hash values for common An illustration of the potential use of a cryptographic passwords. functions, such as PBKDF2, hash is as follows: Alice poses a tough math problem to or , typically use repeated invocations of a Bob and claims she has solved it. Bob would like to try cryptographic hash to increase the time required to per- it himself, but would yet like to be sure that Alice is not form brute force attacks on stored password digests. bluffing. Therefore, Alice writes down her solution, com- In 2013 a long-term Password Hashing Competition was putes its hash and tells Bob the hash value (whilst keep- announced to choose a new, standard algorithm for pass- ing the solution secret). Then, when Bob comes up with word hashing.[7] the solution himself a few days later, Alice can prove that she had the solution earlier by revealing it and having Bob hash it and check that it matches the hash value given to Proof-of-work him before. (This is an example of a simple commitment scheme; in actual practice, Alice and Bob will often be Main article: Proof-of-work system computer programs, and the secret would be something less easily spoofed than a claimed puzzle solution). A proof-of-work system (or protocol, or function) is an economic measure to deter denial of service attacks and 3.14.3 Applications other service abuses such as spam on a network by requir- ing some work from the service requester, usually mean- ing processing time by a computer. A key feature of these Verifying the integrity of files or messages schemes is their asymmetry: the work must be moder- ately hard (but feasible) on the requester side but easy to Main article: File verification check for the service provider. One popular system — used in Bitcoin mining and Hashcash — uses partial hash An important application of secure hashes is verification inversions to prove that work was done, as a good-will to- of message integrity. Determining whether any changes ken to send an e-mail. The sender is required to find a have been made to a message (or a file), for example, can message whose hash value begins with a number of zero be accomplished by comparing message digests calcu- bits. The average work that sender needs to perform in lated before, and after, transmission (or any other event). order to find a valid message is exponential in the number For this reason, most digital signature algorithms only of zero bits required in the hash value, while the recipi- confirm the authenticity of a hashed digest of the message ent can verify the validity of the message by executing a to be “signed”. Verifying the authenticity of a hashed di- single hash function. For instance, in Hashcash, a sender gest of the message is considered proof that the message is asked to generate a header whose 160 bit SHA-1 hash itself is authentic. value has the first 20 bits as zeros. The sender will on average have to try 219 times to find a valid header. MD5, SHA1, or SHA2 hashes are sometimes posted along with files on websites or forums to allow verification of integrity.[6] This practice establishes a chain of trust so File or data identifier long as the hashes are posted on a site authenticated by HTTPS. A message digest can also serve as a means of reliably identifying a file; several source code management sys- tems, including Git, Mercurial and Monotone, use the Password verification sha1sum of various types of content (file content, direc- tory trees, ancestry information, etc.) to uniquely identify Main article: password hashing them. Hashes are used to identify files on peer-to-peer filesharing networks. For example, in an ed2k link, an A related application is password verification (first in- MD4-variant hash is combined with the file size, provid- vented by Roger Needham). Storing all user passwords ing sufficient information for locating file sources, down- as cleartext can result in a massive security breach if the loading the file and verifying its contents. Magnet links password file is compromised. One way to reduce this are another example. Such file hashes are often the top danger is to only store the hash digest of each password. hash of a hash list or a hash tree which allows for addi- To authenticate a user, the password presented by the user tional benefits. is hashed and compared with the stored hash. (Note that One of the main applications of a hash function is to al- this approach prevents the original passwords from be- low the fast look-up of a data in a hash table. Being hash 100 CHAPTER 3. DICTIONARIES functions of a particular kind, cryptographic hash func- Message Message Message tions lend themselves well to this application too. block 1 block 2 block n However, compared with standard hash functions, cryp- tographic hash functions tend to be much more expensive Message Message Message Length block 1 block 2 block n padding computationally. For this reason, they tend to be used in contexts where it is necessary for users to protect them- Finali- IV Hash selves against the possibility of forgery (the creation of f f f f sation data with the same digest as the expected data) by poten- tially malicious participants. The Merkle–Damgård hash construction.

Pseudorandom generation and key derivation one-way compression function. The compression func- tion can either be specially designed for hashing or be Hash functions can also be used in the generation of built from a block cipher. A hash function built with pseudorandom bits, or to derive new keys or passwords the Merkle–Damgård construction is as resistant to colli- from a single secure key or password. sions as is its compression function; any collision for the full hash function can be traced back to a collision in the compression function. 3.14.4 Hash functions based on block ci- phers The last block processed should also be unambiguously length padded; this is crucial to the security of this There are several methods to use a block cipher to build a construction. This construction is called the Merkle– cryptographic hash function, specifically a one-way com- Damgård construction. Most widely used hash functions, pression function. including SHA-1 and MD5, take this form. The methods resemble the block cipher modes of opera- The construction has certain inherent flaws, includ- tion usually used for encryption. Many well-known hash ing length-extension and generate-and-paste attacks, and functions, including MD4, MD5, SHA-1 and SHA-2 are cannot be parallelized. As a result, many entrants in the built from block-cipher-like components designed for the recent NIST hash function competition were built on dif- purpose, with feedback to ensure that the resulting func- ferent, sometimes novel, constructions. tion is not invertible. SHA-3 finalists included functions with block-cipher-like components (e.g., , BLAKE) though the function finally selected, Keccak, was built on 3.14.6 Use in building other cryptographic a cryptographic sponge instead. primitives A standard block cipher such as AES can be used in Hash functions can be used to build other cryptographic place of these custom block ciphers; that might be useful primitives. For these other primitives to be cryptograph- when an embedded system needs to implement both en- ically secure, care must be taken to build them correctly. cryption and hashing with minimal code size or hardware area. However, that approach can have costs in efficiency Message authentication codes (MACs) (also called keyed and security. The ciphers in hash functions are built for hash functions) are often built from hash functions. hashing: they use large keys and blocks, can efficiently HMAC is such a MAC. change keys every block, and have been designed and vet- Just as block ciphers can be used to build hash func- ted for resistance to related-key attacks. General-purpose tions, hash functions can be used to build block ciphers. ciphers tend to have different design goals. In particu- Luby-Rackoff constructions using hash functions can be lar, AES has key and block sizes that make it nontrivial provably secure if the underlying hash function is secure. to use to generate long hash values; AES encryption be- Also, many hash functions (including SHA-1 and SHA- comes less efficient when the key changes each block; and 2) are built by using a special-purpose block cipher in related-key attacks make it potentially less secure for use a Davies-Meyer or other construction. That cipher can in a hash function than for encryption. also be used in a conventional mode of operation, with- out the same security guarantees. See SHACAL, BEAR and LION. 3.14.5 Merkle–Damgård construction Pseudorandom number generators (PRNGs) can be built Main article: Merkle–Damgård construction using hash functions. This is done by combining a (secret) A hash function must be able to process an arbitrary- random seed with a counter and hashing it. length message into a fixed-length output. This can be Some hash functions, such as Skein, Keccak, and achieved by breaking the input up into a series of equal- RadioGatún output an arbitrarily long stream and can be sized blocks, and operating on them in sequence using a used as a stream cipher, and stream ciphers can also be 3.14. CRYPTOGRAPHIC HASH FUNCTION 101 built from fixed-length digest hash functions. Often this veloped SHA-0 and SHA-1. is done by first building a cryptographically secure pseu- On 12 August 2004, Joux, Carribault, Lemuet, and Jalby dorandom number generator and then using its stream of announced a collision for the full SHA-0 algorithm. Joux random bytes as keystream. SEAL is a stream cipher et al. accomplished this using a generalization of the that uses SHA-1 to generate internal tables, which are Chabaud and Joux attack. They found that the collision then used in a keystream generator more or less unrelated had complexity 251 and took about 80,000 CPU hours to the hash algorithm. SEAL is not guaranteed to be as on a supercomputer with 256 Itanium 2 processors— strong (or weak) as SHA-1. Similarly, the key expansion equivalent to 13 days of full-time use of the supercom- of the HC-128 and HC-256 stream ciphers makes heavy puter. use of the SHA256 hash function. In February 2005, an attack on SHA-1 was reported that would find collision in about 269 hashing operations, 3.14.7 Concatenation rather than the 280 expected for a 160-bit hash function. In August 2005, another attack on SHA-1 was reported 63 Concatenating outputs from multiple hash functions pro- that would find collisions in 2 operations. Though the- [12][13] vides collision resistance as good as the strongest of the oretical weaknesses of SHA-1 exist, no collision algorithms included in the concatenated result. For ex- (or near-collision) has yet been found. Nonetheless, it ample, older versions of (TLS) is often suggested that it may be practical to break within and Secure Sockets Layer (SSL) use concatenated MD5 years, and that new applications can avoid these problems and SHA-1 sums. This ensures that a method to find col- by using later members of the SHA family, such as SHA- [14][15] lisions in one of the hash functions does not defeat data 2, or using techniques such as randomized hashing protected by both hash functions. that do not require collision resistance. For Merkle–Damgård construction hash functions, the However, to ensure the long-term robustness of applica- concatenated function is as collision-resistant as its tions that use hash functions, there was a competition to strongest component, but not more collision-resistant. design a replacement for SHA-2. On October 2, 2012, Antoine Joux observed that 2-collisions lead to n- Keccak was selected as the winner of the NIST hash func- collisions: If it is feasible for an attacker to find two mes- tion competition. A version of this algorithm became a sages with the same MD5 hash, the attacker can find as FIPS standard on August 5, 2015 under the name SHA- [16] many messages as the attacker desires with identical MD5 3. hashes with no greater difficulty.[8] Among the n messages Another finalist from the NIST hash function competi- with the same MD5 hash, there is likely to be a collision tion, BLAKE, was optimized to produce BLAKE2 which in SHA-1. The additional work needed to find the SHA- is notable for being faster than SHA-3, SHA-2, SHA-1, 1 collision (beyond the exponential birthday search) re- or MD5, and is used in numerous applications and li- quires only polynomial time.[9][10] braries.

3.14.8 Cryptographic hash algorithms 3.14.9 See also

There is a long list of cryptographic hash functions, al- 3.14.10 References though many have been found to be vulnerable and should not be used. Even if a hash function has never been bro- [1] Schneier, Bruce. “Cryptanalysis of MD5 and SHA: Time for a New Standard”. Computerworld. Retrieved 2016- ken, a successful attack against a weakened variant may 04-20. Much more than encryption algorithms, one-way undermine the experts’ confidence and lead to its aban- hash functions are the workhorses of modern cryptogra- donment. For instance, in August 2004 weaknesses were phy. found in several then-popular hash functions, including SHA-0, RIPEMD, and MD5. These weaknesses called [2] Katz, Jonathan and Lindell, Yehuda (2008). Introduction into question the security of stronger algorithms derived to Modern Cryptography. Chapman & Hall/CRC. from the weak hash functions—in particular, SHA-1 [3] Rogaway & Shrimpton 2004, in Sec. 5. Implications. (a strengthened version of SHA-0), RIPEMD-128, and [4] “Flickr’s API Signature Forgery Vulnerability”. Thai RIPEMD-160 (both strengthened versions of RIPEMD). Duong and Juliano Rizzo. Neither SHA-0 nor RIPEMD are widely used since they were replaced by their strengthened versions. [5] Lyubashevsky, Vadim and Micciancio, Daniele and Peik- ert, Chris and Rosen, Alon. “SWIFFT: A Modest Pro- As of 2009, the two most commonly used cryptographic posal for FFT Hashing”. Springer. Retrieved 29 August hash functions were MD5 and SHA-1. However, a suc- 2016. cessful attack on MD5 broke Transport Layer Security in 2008.[11] [6] Perrin, Chad (December 5, 2007). “Use MD5 hashes to verify software downloads”. TechRepublic. Retrieved The United States National Security Agency (NSA) de- March 2, 2013. 102 CHAPTER 3. DICTIONARIES

[7] “Password Hashing Competition”. Retrieved March 3, 2013.

[8] Antoine Joux. Multicollisions in Iterated Hash Functions. Application to Cascaded Constructions. LNCS 3152/2004, pages 306–316 Full text.

[9] Finney, Hal (August 20, 2004). “More Problems with Hash Functions”. The Cryptography Mailing List. Re- trieved May 25, 2016.

[10] Hoch, Jonathan J.; Shamir, Adi (2008). “On the Strength of the Concatenated Hash Combiner when All the Hash Functions Are Weak” (PDF). Retrieved May 25, 2016.

[11] Alexander Sotirov, Marc Stevens, Jacob Appelbaum, Ar- jen Lenstra, David Molnar, Dag Arne Osvik, Benne de Weger, MD5 considered harmful today: Creating a rogue CA certificate, accessed March 29, 2009.

[12] Xiaoyun Wang, Yiqun Lisa Yin, and Hongbo Yu, Finding Collisions in the Full SHA-1

[13] Bruce Schneier, Cryptanalysis of SHA-1 (summarizes Wang et al. results and their implications)

[14] Shai Halevi, Hugo Krawczyk, Update on Randomized Hashing

[15] Shai Halevi and Hugo Krawczyk, Randomized Hashing and Digital Signatures

[16] NIST.gov – Computer Security Division – Computer Se- curity Resource Center

3.14.11 External links

• Paar, Christof; Pelzl, Jan (2009). “11: Hash Func- tions”. Understanding Cryptography, A Textbook for Students and Practitioners. Springer. (companion web site contains online cryptography course that covers hash functions)

• “The ECRYPT Hash Function Website”. • Buldas, A. (2011). “Series of mini-lectures about cryptographic hash functions”. • Rogaway, P.; Shrimpton, T. (2004). “Crypto- graphic Hash-Function Basics: Definitions, Impli- cations, and Separations for Preimage Resistance, Second-Preimage Resistance, and Collision Resis- tance”. CiteSeerX: 10 .1 .1 .3 .6200. Chapter 4

Sets

4.1 Set (abstract data type) types, and quotient sets may be replaced by setoids.) The characteristic function F of a set S is defined as: In computer science, a set is an abstract data type that can store certain values, without any particular order, and no { repeated values. It is a computer implementation of the 1, if x ∈ S F (x) = mathematical concept of a finite set. Unlike most other 0, if x ̸∈ S collection types, rather than retrieving a specific element from a set, one typically tests a value for membership in In theory, many other abstract data structures can be a set. viewed as set structures with additional operations and/or additional axioms imposed on the standard operations. Some set data structures are designed for static or frozen For example, an abstract heap can be viewed as a set sets that do not change after they are constructed. Static structure with a min(S) operation that returns the element sets allow only query operations on their elements — such of smallest value. as checking whether a given value is in the set, or enumer- ating the values in some arbitrary order. Other variants, called dynamic or mutable sets, allow also the insertion 4.1.2 Operations and deletion of elements from the set. An abstract data structure is a collection, or aggregate, Core set-theoretical operations of data. The data may be booleans, numbers, characters, or other data structures. If one considers the structure One may define the operations of the algebra of sets: yielded by packaging [lower-alpha 1] or indexing,[lower-alpha 2] there are four basic data structures:[1][2] • union(S,T): returns the union of sets S and T. • intersection(S,T): returns the intersection of sets S 1. unpackaged, unindexed: bunch and T. 2. packaged, unindexed: set • difference(S,T): returns the difference of sets S and 3. unpackaged, indexed: string (sequence) T. • 4. packaged, indexed: list (array) subset(S,T): a predicate that tests whether the set S is a subset of set T. In this view, the contents of a set are a bunch, and isolated data items are elementary bunches (elements). Whereas Static sets sets contain elements, bunches consist of elements. Further structuring may be achieved by considering the Typical operations that may be provided by a static set multiplicity of elements (sets become multisets, bunches structure S are: become hyperbunches)[3] or their homogeneity (a record is a set of fields, not necessarily all of the same type). • is_element_of(x,S): checks whether the value x is in the set S. 4.1.1 Type theory • is_empty(S): checks whether the set S is empty. • size(S) or cardinality(S): returns the number of ele- In type theory, sets are generally identified with their ments in S. indicator function (characteristic function): accordingly, a set of values of type A may be denoted by 2A or P(A) • iterate(S): returns a function that returns one more . (Subtypes and subsets may be modeled by refinement value of S at each call, in some arbitrary order.

103 104 CHAPTER 4. SETS

• enumerate(S): returns a list containing the elements • equal(S1, S2): checks whether the two given sets are of S in some arbitrary order. equal (i.e. contain all and only the same elements).

• build(x1,x2,…,xn,): creates a set structure with val- • hash(S): returns a hash value for the static set S such ues x1,x2,…,xn. that if equal(S1, S2) then hash(S1) = hash(S2) • create_from(collection): creates a new set structure Other operations can be defined for sets with elements of containing all the elements of the given collection or a special type: all the elements returned by the given iterator. • sum(S): returns the sum of all elements of S for some Dynamic sets definition of “sum”. For example, over integers or reals, it may be defined as fold(0, add, S). Dynamic set structures typically add: • collapse(S): given a set of sets, return the union.[9] For example, collapse({{1}, {2, 3}}) == {1, 2, 3}. • create(): creates a new, initially empty set structure. May be considered a kind of sum. • create_with_capacity(n): creates a new set • flatten(S): given a set consisting of sets and atomic structure, initially empty but capable of hold- elements (elements that are not sets), returns a set ing up to n elements. whose elements are the atomic elements of the orig- • add(S,x): adds the element x to S, if it is not present inal top-level set or elements of the sets it contains. already. In other words, remove a level of nesting – like col- lapse, but allow atoms. This can be done a sin- • remove(S, x): removes the element x from S, if it is gle time, or recursively flattening to obtain a set of present. only atomic elements.[10] For example, flatten({1, • capacity(S): returns the maximum number of values {2, 3}}) == {1, 2, 3}. that S can hold. • nearest(S,x): returns the element of S that is closest in value to x (by some metric). Some set structures may allow only some of these opera- • tions. The cost of each operation will depend on the im- min(S), max(S): returns the minimum/maximum el- plementation, and possibly also on the particular values ement of S. stored in the set, and the order in which they are inserted. 4.1.3 Implementations Additional operations Sets can be implemented using various data structures, There are many other operations that can (in principle) which provide different time and space trade-offs for be defined in terms of the above, such as: various operations. Some implementations are designed to improve the efficiency of very specialized operations, • pop(S): returns an arbitrary element of S, deleting it such as nearest or union. Implementations described as from S.[4] “general use” typically strive to optimize the element_of, add, and delete operations. A simple implementation is • pick(S): returns an arbitrary element of S.[5][6][7] to use a list, ignoring the order of the elements and tak- Functionally, the mutator pop can be interpreted as ing care to avoid repeated values. This is simple but in- the pair of selectors (pick, rest), where rest returns efficient, as operations like set membership or element the set consisting of all elements except for the ar- deletion are O(n), as they require scanning the entire bitrary element.[8] Can be interpreted in terms of list.[lower-alpha 4] Sets are often instead implemented using iterate.[lower-alpha 3] more efficient data structures, particularly various flavors of trees, tries, or hash tables. • map(F,S): returns the set of distinct values resulting from applying function F to each element of S. As sets can be interpreted as a kind of map (by the in- dicator function), sets are commonly implemented in the • filter(P,S): returns the subset containing all elements same way as (partial) maps (associative arrays) – in this of S that satisfy a given predicate P. case in which the value of each key-value pair has the or a sentinel value (like 1) – namely, a self-balancing • fold(A0,F,S): returns the value A|S| after applying Ai+1 := F(Ai, e) for each element e of S, for some binary search tree for sorted sets (which has O(log n) for binary operation F. F must be associative and com- most operations), or a hash table for unsorted sets (which mutative for this to be well-defined. has O(1) average-case, but O(n) worst-case, for most op- erations). A sorted linear hash table[11] may be used to • clear(S): delete all elements of S. provide deterministically ordered sets. 4.1. SET (ABSTRACT DATA TYPE) 105

Further, in languages that support maps but not sets, sets • Python has built-in set and frozenset types since 2.4, can be implemented in terms of maps. For example, a and since Python 3.0 and 2.7, supports non-empty common programming idiom in Perl that converts an ar- set literals using a curly-bracket syntax, e.g.: {x, y, ray to a hash whose values are the sentinel value 1, for use z}. as a set, is: • The .NET Framework provides the generic HashSet my %elements = map { $_ => 1 } @elements; and SortedSet classes that implement the generic ISet interface. Other popular methods include arrays. In particular a • Smalltalk's class library includes Set and Identity- subset of the integers 1..n can be implemented efficiently Set, using equality and identity for inclusion test as an n-bit bit array, which also support very efficient respectively. Many dialects provide variations for union and intersection operations. A Bloom map imple- compressed storage (NumberSet, CharacterSet), for ments a set probabilistically, using a very compact repre- ordering (OrderedSet, SortedSet, etc.) or for weak sentation but risking a small chance of false positives on references (WeakIdentitySet). queries. • Ruby's standard library includes a set module which The Boolean set operations can be implemented in terms contains Set and SortedSet classes that implement of more elementary operations (pop, clear, and add), but sets using hash tables, the latter allowing iteration in specialized algorithms may yield lower asymptotic time sorted order. bounds. If sets are implemented as sorted lists, for ex- ample, the naive algorithm for union(S,T) will take time • OCaml's standard library contains a Set module, proportional to the length m of S times the length n of which implements a functional set data structure us- T; whereas a variant of the list merging algorithm will ing binary search trees. do the job in time proportional to m+n. Moreover, there are specialized set data structures (such as the union-find • The GHC implementation of Haskell provides a data structure) that are optimized for one or more of these Data.Set module, which implements immutable sets [12] operations, at the expense of others. using binary search trees. • The Tcl Tcllib package provides a set module which implements a set data structure based upon TCL 4.1.4 Language support lists.

One of the earliest languages to support sets was Pascal; • The Swift standard library contains a Set type, since many languages now include it, whether in the core lan- Swift 1.2. guage or in a standard library. As noted in the previous section, in languages which do • In C++, the Standard Template Library (STL) pro- not directly support sets but do support associative arrays, vides the set template class, which is typically im- sets can be emulated using associative arrays, by using the plemented using a binary search tree (e.g. red-black elements as keys, and using a dummy value as the values, tree); SGI's STL also provides the hash_set template which are ignored. class, which implements a set using a hash table. C++11 has support for the unordered_set template 4.1.5 Multiset class, which is implemented using a hash table. In sets, the elements themselves are the keys, in con- A generalization of the notion of a set is that of a multiset trast to sequenced containers, where elements are or bag, which is similar to a set but allows repeated accessed using their (relative or absolute) position. (“equal”) values (duplicates). This is used in two dis- Set elements must have a strict weak ordering. tinct senses: either equal values are considered identical, and are simply counted, or equal values are considered • Java offers the Set interface to support sets (with the equivalent, and are stored as distinct items. For example, HashSet class implementing it using a hash table), given a list of people (by name) and ages (in years), one and the SortedSet sub-interface to support sorted could construct a multiset of ages, which simply counts sets (with the TreeSet class implementing it using the number of people of a given age. Alternatively, one a binary search tree). can construct a multiset of people, where two people are • Apple's Foundation framework (part of Cocoa) considered equivalent if their ages are the same (but may provides the Objective-C classes NSSet, be different people and have different names), in which NSMutableSet, NSCountedSet, NSOrderedSet, case each pair (name, age) must be stored, and selecting and NSMutableOrderedSet. The CoreFoundation on a given age gives all the people of a given age. APIs provide the CFSet and CFMutableSet types Formally, it is possible for objects in computer science for use in C. to be considered “equal” under some equivalence relation 106 CHAPTER 4. SETS

but still distinct under another relation. Some types of multiplicities (this will not be able to distinguish between multiset implementations will store distinct equal objects equal elements at all). as separate items in the data structure; while others will Typical operations on bags: collapse it down to one version (the first one encountered) and keep a positive integer count of the multiplicity of the • contains(B, x): checks whether the element x is element. present (at least once) in the bag B As with sets, multisets can naturally be implemented us- • is_sub_bag(B , B ): checks whether each element ing hash table or trees, which yield different performance 1 2 in the bag B occurs in B no more often than it oc- characteristics. 1 1 curs in the bag B2; sometimes denoted as B1 ⊑ B2. The set of all bags over type T is given by the expression • count(B, x): returns the number of times that the bag T. If by multiset one considers equal items identi- element x occurs in the bag B; sometimes denoted cal and simply counts them, then a multiset can be in- as B # x. terpreted as a function from the input domain to the non-negative integers (natural numbers), generalizing the • scaled_by(B, n): given a natural number n, returns a identification of a set with its indicator function. In some bag which contains the same elements as the bag B, cases a multiset in this counting sense may be generalized except that every element that occurs m times in B to allow negative values, as in Python. occurs n * m times in the resulting bag; sometimes denoted as n ⊗ B. • C++'s Standard Template Library implements both • union(B1, B2): returns a bag that containing just sorted and unsorted multisets. It provides the those values that occur in either the bag B1 or the multiset class for the sorted multiset, as a kind of bag B2, except that the number of times a value x associative container, which implements this multi- occurs in the resulting bag is equal to (B1 # x) + (B2 set using a self-balancing binary search tree. It pro- # x); sometimes denoted as B1 ⊎ B2. vides the unordered_multiset class for the unsorted multiset, as a kind of unordered associative contain- ers, which implements this multiset using a hash ta- Multisets in SQL ble. The unsorted multiset is standard as of C++11; previously SGI’s STL provides the hash_multiset In relational databases, a table can be a (mathematical) set class, which was copied and eventually standardized. or a multiset, depending on the presence on unicity con- straints on some columns (which turns it into a candidate • For Java, third-party libraries provide multiset func- key). tionality: SQL allows the selection of rows from a relational table: • Apache Commons Collections provides the this operation will in general yield a multiset, unless the Bag and SortedBag interfaces, with imple- keyword DISTINCT is used to force the rows to be all menting classes like HashBag and TreeBag. different, or the selection includes the primary (or a can- didate) key. • Google Guava provides the Multiset interface, with implementing classes like HashMultiset In ANSI SQL the MULTISET keyword can be used to and TreeMultiset. transform a subquery into a collection expression: SELECT expression1, expression2... FROM ta- • Apple provides the NSCountedSet class as part of ble_name... Cocoa, and the CFBag and CFMutableBag types as part of CoreFoundation. is a general select that can be used as subquery expression • Python’s standard library includes of another more general query, while collections.Counter, which is similar to a mul- MULTISET(SELECT expression1, expression2... tiset. FROM table_name...) • Smalltalk includes the Bag class, which can be in- stantiated to use either identity or equality as predi- transforms the subquery into a collection expression that cate for inclusion test. can be used in another query, or in assignment to a col- umn of appropriate collection type. Where a multiset data structure is not available, a workaround is to use a regular set, but override the equal- ity predicate of its items to always return “not equal” on 4.1.6 See also distinct objects (however, such will still not be able to • Bloom filter store multiple occurrences of the same object) or use an associative array mapping the values to their integer • Disjoint set 4.2. BIT ARRAY 107

4.1.7 Notes [11] Wang, Thomas (1997), Sorted Linear Hash Table

[1] “Packaging” consists in supplying a container for an ag- [12] Stephen Adams, "Efficient sets: a balancing act", Journal gregation of objects in order to turn them into a single of Functional Programming 3(4):553-562, October 1993. object. Consider a function call: without packaging, a Retrieved on 2015-03-11. function can be called to act upon a bunch only by passing each bunch element as a separate argument, which com- plicates the function’s signature considerably (and is just 4.2 Bit array not possible in some programming languages). By pack- aging the bunch’s elements into a set, the function may now be called upon a single, elementary argument: the set A bit array (also known as bitmap, bitset, bit string, object (the bunch’s package). or bit vector) is an array data structure that compactly stores bits. It can be used to implement a simple set data [2] Indexing is possible when the elements being considered structure. A bit array is effective at exploiting bit-level are totally ordered. Being without order, the elements of parallelism in hardware to perform operations quickly. A a multiset (for example) do not have lesser/greater or pre- typical bit array stores kw bits, where w is the number of ceding/succeeding relationships: they can only be com- pared in absolute terms (same/different). bits in the unit of storage, such as a byte or word, and k is some nonnegative integer. If w does not divide the [3] For example, in Python pick can be implemented on a number of bits to be stored, some space is wasted due to derived class of the built-in set as follows: internal fragmentation. class Set(set): def pick(self): return next(iter(self))

[4] Element insertion can be done in O(1) time by simply in- 4.2.1 Definition serting at an end, but if one avoids duplicates this takes O(n) time. A bit array is a mapping from some domain (almost al- ways a range of integers) to values in the set {0, 1}. The values can be interpreted as dark/light, absent/present, 4.1.8 References locked/unlocked, valid/invalid, et cetera. The point is that there are only two possible values, so they can be stored [1] Hehner, Eric C. R. (1981), “Bunch Theory: A Simple Set in one bit. As with other arrays, the access to a single Theory for Computer Science”, Information Processing bit can be managed by applying an index to the array. Letters, 12 (1): 26, doi:10.1016/0020-0190(81)90071-5 Assuming its size (or length) to be n bits, the array can [2] Hehner, Eric C. R. (2004), A Practical Theory of Pro- be used to specify a subset of the domain (e.g. {0, 1, 2, gramming, second edition ..., n−1}), where a 1-bit indicates the presence and a 0-bit the absence of a number in the set. This set data structure [3] Hehner, Eric C. R. (2012), A Practical Theory of Pro- uses about n/w words of space, where w is the number of gramming, 2012-3-30 edition bits in each machine word. Whether the least significant [4] Python: pop() bit (of the word) or the most significant bit indicates the smallest-index number is largely irrelevant, but the for- [5] Management and Processing of Complex Data Structures: mer tends to be preferred (on little-endian machines). Third Workshop on Information Systems and Artificial In- telligence, Hamburg, Germany, February 28 - March 2, 1994. Proceedings, ed. Kai v. Luck, Heinz Marburger, p. 4.2.2 Basic operations 76

[6] Python Issue7212: Retrieve an arbitrary element from a Although most machines are not able to address individ- set without removing it; see msg106593 regarding stan- ual bits in memory, nor have instructions to manipulate dard name single bits, each bit in a word can be singled out and ma- nipulated using bitwise operations. In particular: [7] Ruby Feature #4553: Add Set#pick and Set#pop

[8] Inductive Synthesis of Functional Programs: Universal • OR can be used to set a bit to one: 11101010 OR Planning, Folding of Finite Programs, and Schema Ab- 00000100 = 11101110 straction by Analogical Reasoning, Ute Schmid, Springer, Aug 21, 2003, p. 240 • AND can be used to set a bit to zero: 11101010 AND 11111101 = 11101000 [9] Recent Trends in Data Type Specification: 10th Workshop on Specification of Abstract Data Types Joint with the 5th • AND together with zero-testing can be used to de- COMPASS Workshop, S. Margherita, Italy, May 30 - June termine if a bit is set: 3, 1994. Selected Papers, Volume 10, ed. Egidio Aste- siano, Gianna Reggio, Andrzej Tarlecki, p. 38 11101010 AND 00000001 = [10] Ruby: flatten() 00000000 = 0 108 CHAPTER 4. SETS

11101010 AND 00000010 = word and keep a running total. Counting zeros is simi- 00000010 ≠ 0 lar. See the Hamming weight article for examples of an efficient implementation. • XOR can be used to invert or toggle a bit: Inversion 11101010 XOR 00000100 = 11101110 Vertical flipping of a one-bit-per-pixel image, or some 11101110 XOR 00000100 = FFT algorithms, requires flipping the bits of individual 11101010 words (so b31 b30 ... b0 becomes b0 ... b30 b31). When this operation is not available on the processor, it’s still • NOT can be used to invert all bits. possible to proceed by successive passes, in this example on 32 bits: NOT 10110010 = 01001101 exchange two 16bit halfwords exchange bytes by pairs (0xddccbbaa -> 0xccddaabb) ... swap bits by pairs To obtain the bit mask needed for these operations, we swap bits (b31 b30 ... b1 b0 -> b30 b31 ... b0 b1) The can use a bit shift operator to shift the number 1 to the last operation can be written ((x&0x55555555)<<1) | left by the appropriate number of places, as well as bitwise (x&0xaaaaaaaa)>>1)). if necessary. Given two bit arrays of the same size representing sets, we can compute their union, intersection, and set-theoretic Find first one difference using n/w simple bit operations each (2n/w for difference), as well as the complement of either: The find first set or find first one operation identifies the for i from 0 to n/w-1 complement_a[i] := not a[i] union[i] index or position of the 1-bit with the smallest index in := a[i] or b[i] intersection[i] := a[i] and b[i] difference[i] an array, and has widespread hardware support (for ar- := a[i] and (not b[i]) rays not larger than a word) and efficient algorithms for its computation. When a priority queue is stored in a bit array, find first one can be used to identify the highest pri- If we wish to iterate through the bits of a bit array, we can ority element in the queue. To expand a word-size find do this efficiently using a doubly nested loop that loops first one to longer arrays, one can find the first nonzero through each word, one at a time. Only n/w memory ac- word and then run find first one on that word. The re- cesses are required: lated operations find first zero, count leading zeros, count for i from 0 to n/w-1 index := 0 // if needed word := a[i] leading ones, count trailing zeros, count trailing ones, and for b from 0 to w-1 value := word and 1 ≠ 0 word := log base 2 (see find first set) can also be extended to a bit word shift right 1 // do something with value index := array in a straightforward manner. index + 1 // if needed 4.2.4 Compression Both of these code samples exhibit ideal locality of refer- ence, which will subsequently receive large performance A bit array is the densest storage for “random” bits, that boost from a data cache. If a cache line is k words, only is, where each bit is equally likely to be 0 or 1, and each about n/wk cache misses will occur. one is independent. But most data is not random, so it may be possible to store it more compactly. For exam- 4.2.3 More complex operations ple, the data of a typical fax image is not random and can be compressed. Run-length encoding is commonly used As with character strings it is straightforward to define to compress these long streams. However, most com- length, substring, lexicographical compare, concatenation, pressed data formats are not so easy to access randomly; reverse operations. The implementation of some of these also by compressing bit arrays too aggressively we run operations is sensitive to endianness. the risk of losing the benefits due to bit-level parallelism (vectorization). Thus, instead of compressing bit arrays as streams of bits, we might compress them as streams of Population / Hamming weight bytes or words (see Bitmap index (compression)).

If we wish to find the number of 1 bits in a bit array, some- times called the population count or Hamming weight, 4.2.5 Advantages and disadvantages there are efficient branch-free algorithms that can com- pute the number of bits in a word using a series of simple Bit arrays, despite their simplicity, have a number of bit operations. We simply run such an algorithm on each marked advantages over other data structures for the same 4.2. BIT ARRAY 109 problems: based on bit arrays that accept either false positives or false negatives. • They are extremely compact; few other data struc- Bit arrays and the operations on them are also important tures can store n independent pieces of data in n/w for constructing succinct data structures, which use close words. to the minimum possible space. In this context, opera- tions like finding the nth 1 bit or counting the number of • They allow small arrays of bits to be stored and ma- 1 bits up to a certain position become important. nipulated in the register set for long periods of time with no memory accesses. Bit arrays are also a useful abstraction for examining streams of compressed data, which often contain ele- • Because of their ability to exploit bit-level paral- ments that occupy portions of bytes or are not byte- lelism, limit memory access, and maximally use the aligned. For example, the compressed Huffman coding data cache, they often outperform many other data representation of a single 8-bit character can be anywhere structures on practical data sets, even those that are from 1 to 255 bits long. more asymptotically efficient. In information retrieval, bit arrays are a good representa- tion for the posting lists of very frequent terms. If we However, bit arrays aren't the solution to everything. In compute the gaps between adjacent values in a list of particular: strictly increasing integers and encode them using unary coding, the result is a bit array with a 1 bit in the nth • Without compression, they are wasteful set data position if and only if n is in the list. The implied proba- n structures for sparse sets (those with few elements bility of a gap of n is 1/2 . This is also the special case of compared to their range) in both time and space. Golomb coding where the parameter M is 1; this parame- For such applications, compressed bit arrays, Judy ter is only normally selected when -log(2-p)/log(1-p) ≤ 1, arrays, tries, or even Bloom filters should be consid- or roughly the term occurs in at least 38% of documents. ered instead.

• Accessing individual elements can be expensive and 4.2.7 Language support difficult to express in some languages. If random ac- cess is more common than sequential and the array The APL programming language fully supports bit arrays is relatively small, a byte array may be preferable on of arbitrary shape and size as a Boolean datatype distinct a machine with byte addressing. A word array, how- from integers. All major implementations (Dyalog APL, ever, is probably not justified due to the huge space APL2, APL Next, NARS2000, Gnu APL, etc.) pack the overhead and additional cache misses it causes, un- bits densely into whatever size the machine word is. Bits less the machine only has word addressing. may be accessed individually via the usual indexing no- tation (A[3]) as well as through all of the usual primitive functions and operators where they are often operated on 4.2.6 Applications using a special case algorithm such as summing the bits via a table lookup of bytes. Because of their compactness, bit arrays have a number of applications in areas where space or efficiency is at a The C programming language's bitfields, pseudo-objects premium. Most commonly, they are used to represent a found in structs with size equal to some number of bits, simple group of boolean flags or an ordered sequence of are in fact small bit arrays; they are limited in that they boolean values. cannot span words. Although they give a convenient syn- tax, the bits are still accessed using bitwise operators on Bit arrays are used for priority queues, where the bit at most machines, and they can only be defined statically index k is set if and only if k is in the queue; this data (like C’s static arrays, their sizes are fixed at compile- structure is used, for example, by the Linux kernel, and time). It is also a common idiom for C programmers to benefits strongly from a find-first-zero operation in hard- use words as small bit arrays and access bits of them us- ware. ing bit operators. A widely available header file included Bit arrays can be used for the allocation of memory pages, in the X11 system, xtrapbits.h, is “a portable way for sys- inodes, disk sectors, etc. In such cases, the term bitmap tems to define bit field manipulation of arrays of bits.” may be used. However, this term is frequently used to A more explanatory description of aforementioned ap- refer to raster images, which may use multiple bits per proach can be found in the comp.lang.c faq. pixel. In C++, although individual bools typically occupy the Another application of bit arrays is the Bloom filter, a same space as a byte or an integer, the STL type vec- probabilistic set data structure that can store large sets tor is a partial template specialization in which in a small space in exchange for a small probability of bits are packed as a space efficiency optimization. Since error. It is also possible to build probabilistic hash tables bytes (and not bits) are the smallest addressable unit in 110 CHAPTER 4. SETS

C++, the [] operator does not return a reference to an el- or word boundary— or unaligned— elements immedi- ement, but instead returns a proxy reference. This might ately follow each other with no padding. seem a minor point, but it means that vector is not Hardware description languages such as VHDL, Verilog, a standard STL container, which is why the use of vec- and SystemVerilog natively support bit vectors as these tor is generally discouraged. Another unique STL [1] are used to model storage elements like flip-flops, hard- class, bitset, creates a vector of bits fixed at a partic- ware busses and hardware signals in general. In hard- ular size at compile-time, and in its interface and syntax ware verification languages such as OpenVera, e and more resembles the idiomatic use of words as bit sets by C SystemVerilog, bit vectors are used to sample values from programmers. It also has some additional power, such as the hardware models, and to represent data that is trans- the ability to efficiently count the number of bits that are ferred to hardware during simulations. set. The Boost C++ Libraries provide a dynamic_bitset class[2] whose size is specified at run-time. The D programming language provides bit arrays in its 4.2.8 See also standard library, Phobos, in std.bitmanip. As in C++, the • [] operator does not return a reference, since individual Bit field bits are not directly addressable on most hardware, but • Arithmetic logic unit instead returns a bool. • In Java, the class BitSet creates a bit array that is then Bitboard Chess and similar games. manipulated with functions named after bitwise opera- • Bitmap index tors familiar to C programmers. Unlike the bitset in C++, the Java BitSet does not have a “size” state (it has an ef- • Binary numeral system fectively infinite size, initialized with 0 bits); a bit can • be set or tested at any index. In addition, there is a Bitstream class EnumSet, which represents a Set of values of an • Judy array enumerated type internally as a bit vector, as a safer al- ternative to bitfields. The .NET Framework supplies a BitArray collection 4.2.9 References class. It stores boolean values, supports random access and bitwise operators, can be iterated over, and its Length [1] std::bitset property can be changed to grow or truncate it. [2] boost::dynamic_bitset

Although Standard ML has no support for bit arrays, [3] http://perldoc.perl.org/perlop.html# Standard ML of New Jersey has an extension, the BitAr- Bitwise-String-Operators ray structure, in its SML/NJ Library. It is not fixed in size and supports set operations and bit operations, including, [4] http://perldoc.perl.org/functions/vec.html unusually, shift operations. Haskell likewise currently lacks standard support for bit- 4.2.10 External links wise operations, but both GHC and Hugs provide a Data.Bits module with assorted bitwise functions and op- • mathematical bases by Pr. D.E.Knuth erators, including shift and rotate operations and an “un- boxed” array over boolean values may be used to model • vector Is Nonconforming, and Forces Opti- a Bit array, although this lacks support from the former mization Choice module. • vector: More Problems, Better Solutions In Perl, strings can be used as expandable bit arrays. They can be manipulated using the usual bitwise operators (~ | & ^),[3] and individual bits can be tested and set using the 4.3 Bloom filter vec function.[4] In Ruby, you can access (but not set) a bit of an integer Not to be confused with Bloom shader effect. (Fixnum or Bignum) using the bracket operator ([]), as if it were an array of bits. A Bloom filter is a space-efficient probabilistic data Apple’s Core Foundation library contains CFBitVector structure, conceived by Burton Howard Bloom in 1970, and CFMutableBitVector structures. that is used to test whether an element is a member of a PL/I supports arrays of bit strings of arbitrary length, set. False positive matches are possible, but false nega- which may be either fixed-length or varying. The array el- tives are not, thus a Bloom filter has a 100% recall rate. ements may be aligned— each element begins on a byte In other words, a query returns either “possibly in set” or “definitely not in set”. Elements can be added to the 4.3. 111

set, but not removed (though this can be addressed with either the element is in the set, or the bits have by chance a “counting” filter). The more elements that are added to been set to 1 during the insertion of other elements, re- the set, the larger the probability of false positives. sulting in a false positive. In a simple Bloom filter, there Bloom proposed the technique for applications where is no way to distinguish between the two cases, but more the amount of source data would require an impracti- advanced techniques can address this problem. cally large amount of memory if “conventional” error- The requirement of designing k different independent free hashing techniques were applied. He gave the ex- hash functions can be prohibitive for large k. For a good ample of a hyphenation algorithm for a dictionary of hash function with a wide output, there should be little 500,000 words, out of which 90% follow simple hyphen- if any correlation between different bit-fields of such a ation rules, but the remaining 10% require expensive disk hash, so this type of hash can be used to generate mul- accesses to retrieve specific hyphenation patterns. With tiple “different” hash functions by slicing its output into sufficient core memory, an error-free hash could be used multiple bit fields. Alternatively, one can pass k differ- to eliminate all unnecessary disk accesses; on the other ent initial values (such as 0, 1, ..., k − 1) to a hash func- hand, with limited core memory, Bloom’s technique uses tion that takes an initial value; or add (or append) these a smaller hash area but still eliminates most unnecessary values to the key. For larger m and/or k, independence accesses. For example, a hash area only 15% of the size among the hash functions can be relaxed with negligible needed by an ideal error-free hash still eliminates 85% increase in false positive rate.[3] Specifically, Dillinger & of the disk accesses, an 85–15 form of the Pareto princi- Manolios (2004b) show the effectiveness of deriving the ple.[1] k indices using enhanced double hashing or triple hash- More generally, fewer than 10 bits per element are re- ing, variants of double hashing that are effectively simple quired for a 1% false positive probability, independent of random number generators seeded with the two or three the size or number of elements in the set.[2] hash values. Removing an element from this simple Bloom filter is im- possible because false negatives are not permitted. An 4.3.1 Algorithm description element maps to k bits, and although setting any one of those k bits to zero suffices to remove the element, it also results in removing any other elements that happen {x, y, z} to map onto that bit. Since there is no way of determining whether any other elements have been added that affect the bits for an element to be removed, clearing any of the 0 1 0 1 1 1 0 0 0 0 0 1 0 1 0 0 1 0 bits would introduce the possibility for false negatives. One-time removal of an element from a Bloom filter can w be simulated by having a second Bloom filter that contains items that have been removed. However, false positives An example of a Bloom filter, representing the set {x, y, z}. The in the second filter become false negatives in the com- colored arrows show the positions in the bit array that each set posite filter, which may be undesirable. In this approach element is mapped to. The element w is not in the set {x, y, z}, re-adding a previously removed item is not possible, as because it hashes to one bit-array position containing 0. For this figure, m = 18 and k = 3. one would have to remove it from the “removed” filter. It is often the case that all the keys are available but are ex- An empty Bloom filter is a bit array of m bits, all set to pensive to enumerate (for example, requiring many disk 0. There must also be k different hash functions defined, reads). When the false positive rate gets too high, the each of which maps or hashes some set element to one of filter can be regenerated; this should be a relatively rare the m array positions with a uniform random distribution. event. Typically, k is a constant, much smaller than m, which is proportional to the number of elements to be added; the precise choice of k and the constant of proportionality of 4.3.2 Space and time advantages m are determined by the intended false positive rate of the filter. While risking false positives, Bloom filters have a strong space advantage over other data structures for represent- To add an element, feed it to each of the k hash functions ing sets, such as self-balancing binary search trees, tries, to get k array positions. Set the bits at all these positions hash tables, or simple arrays or linked lists of the entries. to 1. Most of these require storing at least the data items them- To query for an element (test whether it is in the set), feed selves, which can require anywhere from a small num- it to each of the k hash functions to get k array positions. ber of bits, for small integers, to an arbitrary number If any of the bits at these positions is 0, the element is def- of bits, such as for strings (tries are an exception, since initely not in the set – if it were, then all the bits would they can share storage between elements with equal pre- have been set to 1 when it was inserted. If all are 1, then fixes). However, Bloom filters do not store the data items 112 CHAPTER 4. SETS

FILTER STORAGE long runs of zeros. The information content of the array

Do you have 'key1'? relative to its size is low. The generalized Bloom filter (k Filter: Storage: greater than 1) allows many more bits to be set while still No No maintaining a low false positive rate; if the parameters No (k and m) are chosen well, about half of the bits will be set,[5] and these will be apparently random, minimizing redundancy and maximizing information content. necessary Do you have 'key2'? disk access Storage: Filter: Yes Yes 4.3.3 Probability of false positives Yes: here is key2 Yes: here is key2

False Positive unnecessary Do you have 'key3'? Storage: 1 Filter: disk access log_2(m)=8 No log_2(m)=12 Yes log_2(m)=16 No No log_2(m)=20 0.01 log_2(m)=24 log_2(m)=28 log_2(m)=32 log_2(m)=36 0.0001 Bloom filter used to speed up answers in a key-value storage sys- tem. Values are stored on a disk which has slow access times. p

Bloom filter decisions are much faster. However some unneces- 1e-06 sary disk accesses are made when the filter reports a positive (in order to weed out the false positives). Overall answer speed is better with the Bloom filter than without the Bloom filter. Use of 1e-08 a Bloom filter for this purpose, however, does increase memory usage. 1e-10 1 10 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09

n at all, and a separate solution must be provided for the The false positive probability p as a function of number of ele- actual storage. Linked structures incur an additional lin- ments n in the filter and the filter size m . An optimal number of ear space overhead for pointers. A Bloom filter with 1% hash functions k = (m/n) ln 2 has been assumed. error and an optimal value of k, in contrast, requires only about 9.6 bits per element, regardless of the size of the Assume that a hash function selects each array position elements. This advantage comes partly from its compact- with equal probability. If m is the number of bits in the ness, inherited from arrays, and partly from its probabilis- array, the probability that a certain bit is not set to 1 by tic nature. The 1% false-positive rate can be reduced by a certain hash function during the insertion of an element a factor of ten by adding only about 4.8 bits per element. is However, if the number of potential values is small and many of them can be in the set, the Bloom filter is easily 1 1 − . surpassed by the deterministic bit array, which requires m only one bit for each potential element. Note also that hash tables gain a space and time advantage if they be- If k is the number of hash functions, the probability that gin ignoring collisions and store only whether each bucket the bit is not set to 1 by any of the hash functions is contains an entry; in this case, they have effectively be- [4] come Bloom filters with k = 1. ( ) 1 k Bloom filters also have the unusual property that the time 1 − . needed either to add items or to check whether an item is m in the set is a fixed constant, O(k), completely indepen- If we have inserted n elements, the probability that a cer- dent of the number of items already in the set. No other tain bit is still 0 is constant-space set data structure has this property, but the average access time of sparse hash tables can make them faster in practice than some Bloom filters. In a hardware ( ) 1 kn implementation, however, the Bloom filter shines because 1 − ; its k lookups are independent and can be parallelized. m To understand its space efficiency, it is instructive to com- the probability that it is 1 is therefore pare the general Bloom filter with its special case when k = 1. If k = 1, then in order to keep the false positive ( ) rate sufficiently low, a small fraction of bits should be 1 kn 1 − 1 − . set, which means the array must be very large and contain m 4.3. BLOOM FILTER 113

Now test membership of an element that is not in the set. Each of the k array positions computed by the hash func- m k = ln 2, tions is 1 with a probability as above. The probability of n all of them being 1, which would cause the algorithm to erroneously claim that the element is in the set, is often which gives given as

2−k ≈ 0.6185m/n. ( ) [ ] k ( ) 1 kn k 1 − 1 − ≈ 1 − e−kn/m . The required number of bits m, given n (the number of m inserted elements) and a desired false positive probability p (and assuming the optimal value of k is used) can be This is not strictly correct as it assumes independence for computed by substituting the optimal value of k in the the probabilities of each bit being set. However, assuming probability expression above: it is a close approximation we have that the probability of false positives decreases as m (the number of bits in ( ) m ln 2 the array) increases, and increases as n (the number of − m n n − ( n ln 2) m inserted elements) increases. p = 1 e An alternative analysis arriving at the same approxima- which can be simplified to: tion without the assumption of independence is given by Mitzenmacher and Upfal.[6] After all n items have been m added to the Bloom filter, let q be the fraction of the m ln p = − (ln 2)2 . bits that are set to 0. (That is, the number of bits still set to n 0 is qm.) Then, when testing membership of an element This results in: not in the set, for the array position given by any of the k hash functions, the probability that the bit is found set − to 1 is 1 q . So the probability that all k hash functions n ln p − k m = − . find their bit set to 1 is (1 q) . Further, the expected (ln 2)2 value of q is the probability that a given array position is left untouched by each of the k hash functions for each of This means that for a given false positive probability p, the the n items, which is (as above) length of a Bloom filter m is proportionate to the number of elements being filtered n.[8] While the above formula ( ) is asymptotic (i.e. applicable as m,n → ∞), the agree- 1 kn ment with finite values of m,n is also quite good; the false E[q] = 1 − m positive probability for a finite Bloom filter with m bits, n elements, and k hash functions is at most It is possible to prove, without the independence assump- tion, that q is very strongly concentrated around its ex- pected value. In particular, from the Azuma–Hoeffding (1 − e−k(n+0.5)/(m−1))k. inequality, they prove that[7] So we can use the asymptotic formula if we pay a penalty for at most half an extra element and at most one fewer λ bit.[9] Pr(|q − E[q]| ≥ ) ≤ 2 exp(−2λ2/m) m Because of this, we can say that the exact probability of 4.3.4 Approximating the number of items false positives is in a Bloom filter

( )Swamidass & Baldi (2007) showed that the number of [ ] k ( ) ∑ 1 kn items in a Bloom filterk can be approximated with the fol- Pr(q = t)(1−t)k ≈ (1−E[q])k = 1 − 1 − ≈ 1 − e−kn/m m lowing formula, t as before. [ ] m X n∗ = − ln 1 − , k m Optimal number of hash functions where n∗ is an estimate of the number of items in the For a given m and n, the value of k (the number of hash filter, m is the length (size) of the filter, k is the number functions) that minimizes the false positive probability is of hash functions, and X is the number of bits set to one. 114 CHAPTER 4. SETS

4.3.5 The union and intersection of sets • Some kinds of superimposed code can be seen as a Bloom filter implemented with physical edge- Bloom filters are a way of compactly representing a set of notched cards. An example is Zatocoding, invented items. It is common to try to compute the size of the in- by Calvin Mooers in 1947, in which the set of cate- tersection or union between two sets. Bloom filters can be gories associated with a piece of information is rep- used to approximate the size of the intersection and union resented by notches on a card, with a random pattern of two sets. Swamidass & Baldi (2007) showed that for of four notches for each category. two Bloom filters of length m, their counts, respectively can be estimated as 4.3.7 Examples [ ] m n(A) • Akamai's web servers use Bloom filters to pre- n(A∗) = − ln 1 − k m vent “one-hit-wonders” from being stored in its disk caches. One-hit-wonders are web objects requested and by users just once, something that Akamai found ap- plied to nearly three-quarters of their caching infras- [ ] tructure. Using a Bloom filter to detect the second ∗ m n(B) request for a web object and caching that object only n(B ) = − ln 1 − . k m on its second request prevents one-hit wonders from entering the disk cache, significantly reducing disk The size of their union can be estimated as workload and increasing disk cache hit rates.[10]

[ ] • Google BigTable, Apache HBase and Apache Cas- m n(A ∪ B) sandra, and Postgresql[11] use Bloom filters to re- n(A∗ ∪ B∗) = − ln 1 − , k m duce the disk lookups for non-existent rows or columns. Avoiding costly disk lookups consider- where n(A ∪ B) is the number of bits set to one in either ably increases the performance of a database query of the two Bloom filters. Finally, the intersection can be operation.[12] estimated as • The Google Chrome web browser used to use a Bloom filter to identify malicious URLs. Any URL n(A∗ ∩ B∗) = n(A∗) + n(B∗) − n(A∗ ∪ B∗), was first checked against a local Bloom filter, and only if the Bloom filter returned a positive result was using the three formulas together. a full check of the URL performed (and the user warned, if that too returned a positive result).[13][14]

4.3.6 Interesting properties • The Squid Web Proxy Cache uses Bloom filters for cache digests.[15] • Unlike a standard hash table, a Bloom filter of a fixed • Bitcoin uses Bloom filters to speed up wallet size can represent a set with an arbitrarily large num- synchronization.[16][17] ber of elements; adding an element never fails due to the data structure “filling up”. However, the false • The Venti archival storage system uses Bloom filters positive rate increases steadily as elements are added to detect previously stored data.[18] until all bits in the filter are set to 1, at which point all queries yield a positive result. • The SPIN model checker uses Bloom filters to track the reachable state space for large verification • Union and intersection of Bloom filters with the problems.[19] same size and set of hash functions can be imple- mented with bitwise OR and AND operations re- • The Cascading analytics framework uses Bloom spectively. The union operation on Bloom filters is filters to speed up asymmetric joins, where one lossless in the sense that the resulting Bloom filter is of the joined data sets is significantly larger than the same as the Bloom filter created from scratch us- the other (often called Bloom join in the database ing the union of the two sets. The intersect operation literature).[20] satisfies a weaker property: the false positive proba- bility in the resulting Bloom filter is at most the false- • The Exim mail transfer agent (MTA) uses Bloom positive probability in one of the constituent Bloom filters in its rate-limit feature.[21] filters, but may be larger than the false positive prob- ability in the Bloom filter created from scratch using • Medium uses Bloom filters to avoid recommending the intersection of the two sets. articles a user has previously read.[22] 4.3. BLOOM FILTER 115

4.3.8 Alternatives variant may be slower than classic Bloom filters but this may be compensated by the fact that a single hash func- tion need to be computed. Classic Bloom filters use 1.44 log2(1/ϵ) bits of space per inserted key, where ϵ is the false positive rate of the Another alternative to classic Bloom filter is the one based Bloom filter. However, the space that is strictly neces- on space efficient variants of cuckoo hashing. In this case sary for any data structure playing the same role as a once the hash table is constructed, the keys stored in the [23] Bloom filter is only log2(1/ϵ) per key. Hence Bloom hash table are replaced with short signatures of the keys. filters use 44% more space than an equivalent optimal Those signatures are strings of bits computed using a hash data structure. Instead, Pagh et al. provide an optimal- function applied on the keys. space data structure. Moreover, their data structure has constant locality of reference independent of the false positive rate, unlike Bloom filters, where a smaller false 4.3.9 Extensions and applications positive rate ϵ leads to a greater number of memory ac- cesses per query, log(1/ϵ) . Also, it allows elements to Cache filtering be deleted without a space penalty, unlike Bloom filters. The same improved properties of optimal space usage, constant locality of reference, and the ability to delete el- ements are also provided by the cuckoo filter of Fan et al. (2014), an open source implementation of which is available. Stern & Dill (1996) describe a probabilistic structure based on hash tables, hash compaction, which Dillinger & Manolios (2004b) identify as significantly more ac- curate than a Bloom filter when each is configured op- Using a Bloom filter to prevent one-hit-wonders from being timally. Dillinger and Manolios, however, point out that stored in a web cache decreased the rate of disk writes by nearly the reasonable accuracy of any given Bloom filter over a one half, reducing the load on the disks and potentially increasing [10] wide range of numbers of additions makes it attractive disk performance. for probabilistic enumeration of state spaces of unknown size. Hash compaction is, therefore, attractive when the Content delivery networks deploy web caches around the number of additions can be predicted accurately; how- world to cache and serve web content to users with greater ever, despite being very fast in software, hash compaction performance and reliability. A key application of Bloom is poorly suited for hardware because of worst-case linear filters is their use in efficiently determining which web ob- access time. jects to store in these web caches. Nearly three-quarters of the URLs accessed from a typical web cache are “one- Putze, Sanders & Singler (2007) have studied some vari- hit-wonders” that are accessed by users only once and ants of Bloom filters that are either faster or use less space never again. It is clearly wasteful of disk resources to than classic Bloom filters. The basic idea of the fast vari- store one-hit-wonders in a web cache, since they will ant is to locate the k hash values associated with each key never be accessed again. To prevent caching one-hit- into one or two blocks having the same size as processor’s wonders, a Bloom filter is used to keep track of all URLs memory cache blocks (usually 64 bytes). This will pre- that are accessed by users. A web object is cached only sumably improve performance by reducing the number when it has been accessed at least once before, i.e., the ob- of potential memory cache misses. The proposed vari- ject is cached on its second request. The use of a Bloom ants have however the drawback of using about 32% more filter in this fashion significantly reduces the disk write space than classic Bloom filters. workload, since one-hit-wonders are never written to the The space efficient variant relies on using a single hash disk cache. Further, filtering out the one-hit-wonders function that generates for each key a value in the range also saves cache space on disk, increasing the cache hit [0, n/ε] where ϵ is the requested false positive rate. The rates.[10] sequence of values is then sorted and compressed using Golomb coding (or some other compression technique)

to occupy a space close to n log2(1/ϵ) bits. To query Counting filters the Bloom filter for a given key, it will suffice to check if its corresponding value is stored in the Bloom filter. Counting filters provide a way to implement a delete oper- Decompressing the whole Bloom filter for each query ation on a Bloom filter without recreating the filter afresh. would make this variant totally unusable. To overcome In a counting filter the array positions (buckets) are ex- this problem the sequence of values is divided into small tended from being a single bit to being an n-bit counter. blocks of equal size that are compressed separately. At In fact, regular Bloom filters can be considered as count- query time only half a block will need to be decompressed ing filters with a bucket size of one bit. Counting filters on average. Because of decompression overhead, this were introduced by Fan et al. (2000). 116 CHAPTER 4. SETS

The insert operation is extended to increment the value Data synchronization of the buckets, and the lookup operation checks that each of the required buckets is non-zero. The delete operation Bloom filters can be used for approximate data synchro- then consists of decrementing the value of each of the nization as in Byers et al. (2004). Counting Bloom filters respective buckets. can be used to approximate the number of differences be- Arithmetic overflow of the buckets is a problem and the tween two sets and this approach is described in Agarwal buckets should be sufficiently large to make this case rare. & Trachtenberg (2006). If it does occur then the increment and decrement opera- tions must leave the bucket set to the maximum possible Bloomier filters value in order to retain the properties of a Bloom filter. The size of counters is usually 3 or 4 bits. Hence count- Chazelle et al. (2004) designed a generalization of Bloom ing Bloom filters use 3 to 4 times more space than static filters that could associate a value with each element that Bloom filters. In contrast, the data structures of Pagh, had been inserted, implementing an associative array. Pagh & Rao (2005) and Fan et al. (2014) also allow dele- Like Bloom filters, these structures achieve a small space tions but use less space than a static Bloom filter. overhead by accepting a small probability of false posi- Another issue with counting filters is limited scalability. tives. In the case of “Bloomier filters”, a false positive is Because the counting Bloom filter table cannot be ex- defined as returning a result when the key is not in the panded, the maximal number of keys to be stored simul- map. The map will never return the wrong value for a taneously in the filter must be known in advance. Once key that is in the map. the designed capacity of the table is exceeded, the false positive rate will grow rapidly as more keys are inserted. Compact approximators Bonomi et al. (2006) introduced a data structure based on d-left hashing that is functionally equivalent but uses ap- Boldi & Vigna (2005) proposed a lattice-based general- proximately half as much space as counting Bloom filters. ization of Bloom filters. A compact approximator as- The scalability issue does not occur in this data structure. sociates to each key an element of a lattice (the standard Once the designed capacity is exceeded, the keys could Bloom filters being the case of the Boolean two-element be reinserted in a new hash table of double size. lattice). Instead of a bit array, they have an array of lattice The space efficient variant by Putze, Sanders & Singler elements. When adding a new association between a key (2007) could also be used to implement counting filters and an element of the lattice, they compute the maximum by supporting insertions and deletions. of the current contents of the k array locations associated to the key with the lattice element. When reading the Rottenstreich, Kanizo & Keslassy (2012) introduced a value associated to a key, they compute the minimum of new general method based on variable increments that the values found in the k locations associated to the key. significantly improves the false positive probability of The resulting value approximates from above the original counting Bloom filters and their variants, while still sup- value. porting deletions. Unlike counting Bloom filters, at each element insertion, the hashed counters are incremented by a hashed variable increment instead of a unit incre- Stable Bloom filters ment. To query an element, the exact values of the coun- ters are considered and not just their positiveness. If a Deng & Rafiei (2006) proposed Stable Bloom filters as a sum represented by a counter value cannot be composed variant of Bloom filters for streaming data. The idea is of the corresponding variable increment for the queried that since there is no way to store the entire history of a element, a negative answer can be returned to the query. stream (which can be infinite), Stable Bloom filters con- tinuously evict stale information to make room for more recent elements. Since stale information is evicted, the Stable Bloom filter introduces false negatives, which do not appear in traditional Bloom filters. The authors show that a tight upper bound of false positive rates is guaran- Decentralized aggregation teed, and the method is superior to standard Bloom filters in terms of false positive rates and time efficiency when a small space and an acceptable false positive rate are given. Bloom filters can be organized in distributed data struc- tures to perform fully decentralized computations of aggregate functions. Decentralized aggregation makes Scalable Bloom filters collective measurements locally available in every node of a distributed network without involving a centralized Almeida et al. (2007) proposed a variant of Bloom fil- computational entity for this purpose.[24] ters that can adapt dynamically to the number of elements 4.3. BLOOM FILTER 117 stored, while assuring a minimum false positive proba- by attenuating (shifting out) bits set by sources further bility. The technique is based on sequences of standard away.[26] Bloom filters with increasing capacity and tighter false positive probabilities, so as to ensure that a maximum false positive probability can be set beforehand, regard- Chemical structure searching less of the number of elements to be inserted. Bloom filters are often used to search large chemical structure databases (see chemical similarity). In the sim- Layered Bloom filters plest case, the elements added to the filter (called a fin- gerprint in this field) are just the atomic numbers present A layered Bloom filter consists of multiple Bloom filter in the molecule, or a hash based on the atomic number layers. Layered Bloom filters allow keeping track of how of each atom and the number and type of its bonds. This many times an item was added to the Bloom filter by case is too simple to be useful. More advanced filters checking how many layers contain the item. With a lay- also encode atom counts, larger substructure features like ered Bloom filter a check operation will normally return carboxyl groups, and graph properties like the number of the deepest layer number the item was found in.[25] rings. In hash-based fingerprints, a hash function based on atom and bond properties is used to turn a subgraph into a PRNG seed, and the first output values used to set Attenuated Bloom filters bits in the Bloom filter. Molecular fingerprints started in the late 1940s as way to search for chemical structures searched on punched cards. However, it wasn't until around 1990 that Day- light introduced a hash-based method to generate the bits, rather than use a precomputed table. Unlike the dictio- nary approach, the hash method can assign bits for sub- structures which hadn't previously been seen. In the early 1990s, the term “fingerprint” was considered different from “structural keys”, but the term has since grown to en- compass most molecular characteristics which can used for a similarity comparison, including structural keys, sparse count fingerprints, and 3D fingerprints. Unlike Attenuated Bloom Filter Example: Search for pattern 11010, Bloom filters, the Daylight hash method allows the num- starting from node n1. ber of bits assigned per feature to be a function of the feature size, but most implementations of Daylight-like An attenuated Bloom filter of depth D can be viewed as fingerprints use a fixed number of bits per feature, which an array of D normal Bloom filters. In the context of makes them a Bloom filter. The original Daylight fin- service discovery in a network, each node stores regular gerprints could be used for both similarity and screening and attenuated Bloom filters locally. The regular or local purposes. Many other fingerprint types, like the popular Bloom filter indicates which services are offered by the ECFP2, can be used for similarity but not for screening node itself. The attenuated filter of level i indicates which because they include local environmental characteristics services can be found on nodes that are i-hops away from that introduce false negatives when used as a screen. Even the current node. The i-th value is constructed by taking if these are constructed with the same mechanism, these a union of local Bloom filters for nodes i-hops away from are not Bloom filters because they cannot be used to filter. the node.[26] Let’s take a small network shown on the graph below as an example. Say we are searching for a service A whose 4.3.10 See also id hashes to bits 0,1, and 3 (pattern 11010). Let n1 node • Count–min sketch to be the starting point. First, we check whether service A is offered by n1 by checking its local filter. Since the • Feature hashing patterns don't match, we check the attenuated Bloom fil- • ter in order to determine which node should be the next MinHash hop. We see that n2 doesn't offer service A but lies on the • Quotient filter path to nodes that do. Hence, we move to n2 and repeat the same procedure. We quickly find that n3 offers the • Skip list service, and hence the destination is located.[27] By using attenuated Bloom filters consisting of multiple 4.3.11 Notes layers, services at more than one hop distance can be discovered while avoiding saturation of the Bloom filter [1] Bloom (1970). 118 CHAPTER 4. SETS

[2] Bonomi et al. (2006). 4.3.12 References

[3] Dillinger & Manolios (2004a); Kirsch & Mitzenmacher • Agarwal, Sachin; Trachtenberg, Ari (2006), (2006). “Approximating the number of differences be- tween remote sets” (PDF), IEEE Information [4] Mitzenmacher & Upfal (2005). Theory Workshop, Punta del Este, Uruguay: 217, doi:10.1109/ITW.2006.1633815, ISBN [5] Blustein & El-Maazawi (2002), pp. 21–22 1-4244-0035-X • [6] Mitzenmacher & Upfal (2005), pp. 109–111, 308. Ahmadi, Mahmood; Wong, Stephan (2007), “A Cache Architecture for Counting Bloom Filters”, [7] Mitzenmacher & Upfal (2005), p. 308. 15th international Conference on Networks (ICON- 2007), p. 218, doi:10.1109/ICON.2007.4444089, [8] Starobinski, Trachtenberg & Agarwal (2003) ISBN 978-1-4244-1229-7 • Almeida, Paulo; Baquero, Carlos; Preguica, Nuno; [9] Goel & Gupta (2010). Hutchison, David (2007), “Scalable Bloom Fil- ters” (PDF), Information Processing Letters, 101 (6): [10] Maggs & Sitaraman (2015). 255–261, doi:10.1016/j.ipl.2006.10.007 [11] ""Bloom index contrib module"". Postgresql.org. 2016- • Apache Software Foundation (2012), “11.6. 04-01. Retrieved 2016-06-18. Schema Design”, The Apache HBase Reference Guide, Revision 0.94.27 [12] Chang et al. (2006); Apache Software Foundation (2012). • Bloom, Burton H. (1970), “Space/Time Trade- [13] Yakunin, Alex (2010-03-25). “Alex Yakunin’s blog: Nice offs in Hash Coding with Allowable Errors”, Bloom filter application”. Blog.alexyakunin.com. Re- Communications of the ACM, 13 (7): 422–426, trieved 2014-05-31. doi:10.1145/362686.362692

[14] “Issue 10896048: Transition safe browsing from bloom • Blustein, James; El-Maazawi, Amal (2002), “opti- filter to prefix set. - Code Review”. Chromiumcodere- mal case for general Bloom filters”, Bloom Filters view.appspot.com. Retrieved 2014-07-03. — A Tutorial, Analysis, and Survey, Dalhousie Uni- versity Faculty of Computer Science, pp. 1–31 [15] Wessels (2004). • Boldi, Paolo; Vigna, Sebastiano (2005), “Mu- [16] Bitcoin 0.8.0 table strings in Java: design, implementation and lightweight text-search algorithms”, Sci- [17] “The Bitcoin Foundation - Supporting the development of ence of Computer Programming, 54 (1): 3–23, Bitcoin”. bitcoinfoundation.org. doi:10.1016/j.scico.2004.05.003 • Bonomi, Flavio; Mitzenmacher, Michael; Pan- [18] “Plan 9 /sys/man/8/venti”. Plan9.bell-labs.com. Re- trieved 2014-05-31. igrahy, Rina; Singh, Sushil; Varghese, George (2006), “An Improved Construction for Count- [19] http://spinroot.com/ ing Bloom Filters”, Algorithms – ESA 2006, 14th Annual European Symposium (PDF), Lecture [20] Mullin (1990). Notes in Computer Science, 4168, pp. 684– 695, doi:10.1007/11841036_61, ISBN 978-3-540- [21] “Exim source code”. github. Retrieved 2014-03-03. 38875-3 • Broder, Andrei; Mitzenmacher, Michael (2005), [22] “What are Bloom filters?". Medium. Retrieved 2015-11- “Network Applications of Bloom Filters: A Sur- 01. vey” (PDF), Internet Mathematics, 1 (4): 485–509, doi:10.1080/15427951.2004.10129096 [23] Pagh, Pagh & Rao (2005). • Byers, John W.; Considine, Jeffrey; Mitzenmacher, [24] Pournaras, Warnier & Brazier (2013). Michael; Rost, Stanislav (2004), “Informed con- tent delivery across adaptive overlay networks”, [25] Zhiwang, Jungang & Jian (2010). IEEE/ACM Transactions on Networking, 12 (5): 767, doi:10.1109/TNET.2004.836103 [26] Koucheryavy et al. (2009). • Chang, Fay; Dean, Jeffrey; Ghemawat, Sanjay; [27] Kubiatowicz et al. (2000). Hsieh, Wilson; Wallach, Deborah; Burrows, Mike; 4.3. BLOOM FILTER 119

Chandra, Tushar; Fikes, Andrew; Gruber, Robert • Eppstein, David; Goodrich, Michael T. (2007), (2006), “Bigtable: A Distributed Storage System for “Space-efficient straggler identification in round- Structured Data”, Seventh Symposium on Operating trip data streams via Newton’s identities and in- System Design and Implementation vertible Bloom filters”, Algorithms and Data Struc- tures, 10th International Workshop, WADS 2007, • Charles, Denis; Chellapilla, Kumar (2008), Springer-Verlag, Lecture Notes in Computer Sci- “Bloomier Filters: A second look”, The Computing ence 4619, pp. 637–648, arXiv:0704.3313 Research Repository (CoRR), arXiv:0807.0928 • Fan, Bin; Andersen, Dave G.; Kaminsky, Michael; • Chazelle, Bernard; Kilian, Joe; Rubinfeld, Ronitt; Mitzenmacher, Michael D. (2014), “Cuckoo fil- Tal, Ayellet (2004), “The Bloomier filter: an ef- ter: Practically better than Bloom”, Proc. 10th ficient data structure for static support lookup ta- ACM Int. Conf. Emerging Networking Experi- bles”, Proceedings of the Fifteenth Annual ACM- ments and Technologies (CoNEXT '14), pp. 75–88, SIAM Symposium on Discrete Algorithms (PDF), pp. doi:10.1145/2674005.2674994. Open source im- 30–39 plementation available on github. • Fan, Li; Cao, Pei; Almeida, Jussara; Broder, An- • Cohen, Saar; Matias, Yossi (2003), “Spectral Bloom drei (2000), “Summary Cache: A Scalable Wide- Filters”, Proceedings of the 2003 ACM SIGMOD Area Web Cache Sharing Protocol”, IEEE/ACM International Conference on Management of Data Transactions on Networking, 8 (3): 281–293, (PDF), pp. 241–252, doi:10.1145/872757.872787, doi:10.1109/90.851975. A preliminary version ap- ISBN 158113634X peared at SIGCOMM '98. • Deng, Fan; Rafiei, Davood (2006), “Approximately • Goel, Ashish; Gupta, Pankaj (2010), “Small subset Detecting Duplicates for Streaming Data using Sta- queries and bloom filters using ternary associative ble Bloom Filters”, Proceedings of the ACM SIG- memories, with applications”, ACM Sigmetrics 2010, MOD Conference (PDF), pp. 25–36 38: 143, doi:10.1145/1811099.1811056

• Dharmapurikar, Sarang; Song, Haoyu; Turner, • Haghighat, Mohammad Hashem; Tavakoli, Mehdi; Jonathan; Lockwood, John (2006), “Fast packet Kharrazi, Mehdi (2013), “Payload Attribution via classification using Bloom filters”, Proceedings of Character Dependent Multi-Bloom Filters”, Trans- the 2006 ACM/IEEE Symposium on Architecture for action on Information Forensics and Security, IEEE, Networking and Communications Systems (PDF), 99 (5): 705, doi:10.1109/TIFS.2013.2252341 pp. 61–70, doi:10.1145/1185347.1185356, ISBN 1595935800 • Kirsch, Adam; Mitzenmacher, Michael (2006), “Less Hashing, Same Performance: Building a • Dietzfelbinger, Martin; Pagh, Rasmus (2008), “Suc- Better Bloom Filter”, in Azar, Yossi; Erlebach, cinct Data Structures for Retrieval and Approximate Thomas, Algorithms – ESA 2006, 14th Annual Membership”, The Computing Research Repository European Symposium (PDF), Lecture Notes in (CoRR), arXiv:0803.3693 Computer Science, 4168, Springer-Verlag, Lecture Notes in Computer Science 4168, pp. 456–467, • Dillinger, Peter C.; Manolios, Panagiotis (2004a), doi:10.1007/11841036, ISBN 978-3-540-38875-3 “Fast and Accurate Bitstate Verification for SPIN”, • Koucheryavy, Y.; Giambene, G.; Staehle, D.; Proceedings of the 11th International Spin Workshop Barcelo-Arroyo, F.; Braun, T.; Siris, V. (2009), on Model Checking Software, Springer-Verlag, Lec- “Traffic and QoS Management in Wireless Multi- ture Notes in Computer Science 2989 media Networks”, COST 290 Final Report, USA: 111 • Dillinger, Peter C.; Manolios, Panagiotis (2004b), “Bloom Filters in Probabilistic Verification”, • Kubiatowicz, J.; Bindel, D.; Czerwinski, Y.; Geels, Proceedings of the 5th International Conference S.; Eaton, D.; Gummadi, R.; Rhea, S.; Weather- on Formal Methods in Computer-Aided Design, spoon, H.; et al. (2000), “Oceanstore: An architec- Springer-Verlag, Lecture Notes in Computer ture for global-scale persistent storage” (PDF), ACM Science 3312 SIGPLAN Notices, USA: 190–201

• Donnet, Benoit; Baynat, Bruno; Friedman, Timur • Maggs, Bruce M.; Sitaraman, Ramesh K. (July (2006), “Retouched Bloom Filters: Allowing Net- 2015), “Algorithmic nuggets in content deliv- worked Applications to Flexibly Trade Off False ery”, SIGCOMM Computer Communication Review, Positives Against False Negatives”, CoNEXT 06 – New York, NY, USA: ACM, 45 (3): 52–66, 2nd Conference on Future Networking Technologies doi:10.1145/2805789.2805800 120 CHAPTER 4. SETS

• Mitzenmacher, Michael; Upfal, Eli (2005), • Shanmugasundaram, Kulesh; Brönnimann, Hervé; Probability and computing: Randomized algorithms Memon, Nasir (2004), “Payload attribution via hier- and probabilistic analysis, Cambridge University archical Bloom filters”, Proceedings of the 11th ACM Press, pp. 107–112, ISBN 9780521835404 Conference on Computer and Communications Se- curity, pp. 31–41, doi:10.1145/1030083.1030089, • Mortensen, Christian Worm; Pagh, Rasmus; ISBN 1581139616 Pătraşcu, Mihai (2005), “On dynamic range reporting in one dimension”, Proceedings • Starobinski, David; Trachtenberg, Ari; Agarwal, of the Thirty-seventh Annual ACM Sympo- Sachin (2003), “Efficient PDA Synchronization”, sium on Theory of Computing, pp. 104–111, IEEE Transactions on Mobile Computing, 2 (1): 40, doi:10.1145/1060590.1060606, ISBN 1581139608 doi:10.1109/TMC.2003.1195150

• Mullin, James K. (1990), “Optimal semijoins for • Stern, Ulrich; Dill, David L. (1996), “A New distributed database systems”, Software Engineer- Scheme for Memory-Efficient Probabilistic Verifi- ing, IEEE Transactions on, 16 (5): 558–560, cation”, Proceedings of Formal Description Tech- doi:10.1109/32.52778 niques for Distributed Systems and Communica- tion Protocols, and Protocol Specification, Testing, • Pagh, Anna; Pagh, Rasmus; Rao, S. Srinivasa and Verification: IFIP TC6/WG6.1 Joint Interna- (2005), “An optimal Bloom filter replacement”, tional Conference, Chapman & Hall, IFIP Con- Proceedings of the Sixteenth Annual ACM-SIAM ference Proceedings, pp. 333–348, CiteSeerX: Symposium on Discrete Algorithms (PDF), pp. 823– 10 .1 .1 .47 .4101 829 • Swamidass, S. Joshua; Baldi, Pierre (2007), “Math- • Porat, Ely (2008), “An Optimal Bloom Filter Re- ematical correction for fingerprint similarity mea- placement Based on Matrix Solving”, The Comput- sures to improve chemical retrieval”, Journal of ing Research Repository (CoRR), arXiv:0804.1845 chemical information and modeling, ACS Publi- cations, 47 (3): 952–964, doi:10.1021/ci600526a, • Pournaras, E.; Warnier, M.; Brazier, F.M.T.. PMID 17444629 (2013), “A generic and adaptive aggregation • Wessels, Duane (January 2004), “10.7 Cache Di- service for large-scale decentralized networks”, gests”, Squid: The Definitive Guide (1st ed.), Complex Adaptive Systems Modeling, 1:19, O'Reilly Media, p. 172, ISBN 0-596-00162-2, doi:10.1186/2194-3206-1-19. Prototype imple- Cache Digests are based on a technique first pub- mentation available on github. lished by Pei Cao, called Summary Cache. The fun- • Putze, F.; Sanders, P.; Singler, J. (2007), “Cache-, damental idea is to use a Bloom filter to represent Hash- and Space-Efficient Bloom Filters”, in Deme- the cache contents. trescu, Camil, Experimental Algorithms, 6th Interna- • Zhiwang, Cen; Jungang, Xu; Jian, Sun (2010), tional Workshop, WEA 2007 (PDF), Lecture Notes “A multi-layer Bloom filter for duplicated URL in Computer Science, 4525, Springer-Verlag, Lec- detection”, Proc. 3rd International Conference ture Notes in Computer Science 4525, pp. 108– on Advanced Computer Theory and Engineer- 121, doi:10.1007/978-3-540-72845-0, ISBN 978- ing (ICACTE 2010), 1, pp. V1–586–V1–591, 3-540-72844-3 doi:10.1109/ICACTE.2010.5578947 • Rottenstreich, Ori; Kanizo, Yossi; Keslassy, Isaac (2012), “The Variable-Increment Count- 4.3.13 External links ing Bloom Filter”, 31st Annual IEEE Interna- tional Conference on Computer Communications, • Why Bloom filters work the way they do (Michael 2012, Infocom 2012 (PDF), pp. 1880–1888, Nielsen, 2012) doi:10.1109/INFCOM.2012.6195563, ISBN 978-1-4673-0773-4 • Bloom Filters — A Tutorial, Analysis, and Survey (Blustein & El-Maazawi, 2002) at Dalhousie Uni- • Sethumadhavan, Simha; Desikan, Rajagopalan; versity Burger, Doug; Moore, Charles R.; Keckler, Stephen W. (2003), “Scalable hardware memory disam- • Table of false-positive rates for different configura- biguation for high ILP processors”, 36th Annual tions from a University of Wisconsin–Madison web- IEEE/ACM International Symposium on Microar- site chitecture, 2003, MICRO-36 (PDF), pp. 399– 410, doi:10.1109/MICRO.2003.1253244, ISBN 0- • Interactive Processing demonstration from ash- 7695-2043-X can.org 4.4. MINHASH 121

• “More Optimal Bloom Filters,” Ely Porat sets A and B. In other words, if r is the random variable (Nov/2007) Google TechTalk video on YouTube that is one when hᵢ(A) = hᵢ(B) and zero otherwise, then r is an unbiased estimator of J(A,B). r has too high a • “Using Bloom Filters” Detailed Bloom Filter expla- variance to be a useful estimator for the Jaccard similar- nation using Perl ity on its own—it is always zero or one. The idea of the • “A Garden Variety of Bloom Filters - Explanation MinHash scheme is to reduce this variance by averaging and Analysis of Bloom filter variants together several variables constructed in the same way. • “Bloom filters, fast and simple” - Explanation and example implementation in Python 4.4.2 Algorithm

Variant with many hash functions 4.4 MinHash The simplest version of the scheme uses k differ- In computer science, MinHash (or the min-wise ent hash functions, where k is a fixed integer parameter, independent permutations locality sensitive hashing and represents each set S by the k values of hᵢ(S) for scheme) is a technique for quickly estimating how similar these k functions. two sets are. The scheme was invented by Andrei Broder To estimate J(A,B) using this version of the scheme, let [1] (1997), and initially used in the AltaVista search en- y be the number of hash functions for which hᵢ(A) = gine to detect duplicate web pages and eliminate them hᵢ(B), and use y/k as the estimate. This estimate is [2] from search results. It has also been applied in large- the average of k different 0-1 random variables, each of scale clustering problems, such as clustering documents which is one when hᵢ(A) = hᵢ(B) and zero otherwise, [1] by the similarity of their sets of words. and each of which is an unbiased estimator of J(A,B). Therefore, their average is also an unbiased estimator, 4.4.1 Jaccard similarity and minimum and by standard Chernoff bounds for sums of 0-1 random variables, its expected error is O(1/√k).[3] hash values Therefore, for any constant ε > 0 there is a constant k = 2 The Jaccard similarity coefficient is a commonly used in- O(1/ε ) such that the expected error of the estimate is at dicator of the similarity between two sets. For sets A and most ε. For example, 400 hashes would be required to B it is defined to be the ratio of the number of elements estimate J(A,B) with an expected error less than or equal of their intersection and the number of elements of their to .05. union:

Variant with a single hash function |A ∩ B| J(A, B) = . |A ∪ B| It may be computationally expensive to compute multiple hash functions, but a related version of MinHash scheme This value is 0 when the two sets are disjoint, 1 when they avoids this penalty by using only a single hash function are equal, and strictly between 0 and 1 otherwise. Two and uses it to select multiple values from each set rather sets are more similar (i.e. have relatively more members than selecting only a single minimum value per hash func- in common) when their Jaccard index is closer to 1. The tion. Let h be a hash function, and let k be a fixed integer. goal of MinHash is to estimate J(A,B) quickly, without If S is any set of k or more values in the domain of h, de- explicitly computing the intersection and union. fine h₍k₎(S) to be the subset of the k members of S that Let h be a hash function that maps the members of A have the smallest values of h. This subset h₍k₎(S) is used and B to distinct integers, and for any set S define hᵢ(S) as a signature for the set S, and the similarity of any two to be the minimal member of S with respect to h—that sets is estimated by comparing their signatures. is, the member x of S with the minimum value of h(x). Specifically, let A and B be any two sets. Then X = Now, if we apply hᵢ to both A and B, we will get the h₍k₎(h₍k₎(A) ∪ h₍k₎(B)) = h₍k₎(A ∪ B) is a set of k elements same value exactly when the element of the union A ∪ B of A ∪ B, and if h is a random function then any sub- with minimum hash value lies in the intersection A ∩ B. set of k elements is equally likely to be chosen; that is, X The probability of this being true is the ratio above, and is a simple random sample of A ∪ B. The subset Y = X therefore: ∩ h₍k₎(A) ∩ h₍k₎(B) is the set of members of X that be- long to the intersection A ∩ B. Therefore, |Y|/k is an un- Pr[ hᵢ(A) = hᵢ(B) ] = J(A,B), biased estimator of J(A,B). The difference between this estimator and the estimator produced by multiple hash That is, the probability that hᵢ(A) = hᵢ(B) is true is functions is that X always has exactly k members, whereas equal to the similarity J(A,B), assuming randomly chosen the multiple hash functions may lead to a smaller number 122 CHAPTER 4. SETS

of sampled elements due to the possibility that two differ- 4.4.4 Applications ent hash functions may have the same minima. However, when k is small relative to the sizes of the sets, this dif- The original applications for MinHash involved cluster- ference is negligible. ing and eliminating near-duplicates among web docu- By standard Chernoff bounds for sampling without re- ments, represented as sets of the words occurring in those [1][2] placement, this estimator has expected error O(1/√k), documents. Similar techniques have also been used matching the performance of the multiple-hash-function for clustering and near-duplicate elimination for other scheme. types of data, such as images: in the case of image data, an image can be represented as a set of smaller subim- ages cropped from it, or as sets of more complex image Time analysis feature descriptions.[6] In data mining, Cohen et al. (2001) use MinHash as a The estimator |Y|/k can be computed in time O(k) from tool for association rule learning. Given a database in the two signatures of the given sets, in either variant of the which each entry has multiple attributes (viewed as a 0–1 scheme. Therefore, when ε and k are constants, the time matrix with a row per database entry and a column per to compute the estimated similarity from the signatures is attribute) they use MinHash-based approximations to the also constant. The signature of each set can be computed Jaccard index to identify candidate pairs of attributes that in linear time on the size of the set, so when many pair- frequently co-occur, and then compute the exact value wise similarities need to be estimated this method can of the index for only those pairs to determine the ones lead to a substantial savings in running time compared whose frequencies of co-occurrence are below a given to doing a full comparison of the members of each set. strict threshold.[7] Specifically, for set size n the many hash variant takes O(n k) time. The single hash variant is generally faster, requiring O(n) time to maintain the queue of minimum hash values assuming n >> k.[1] 4.4.5 Other uses

The MinHash scheme may be seen as an instance of 4.4.3 Min-wise independent permutations locality sensitive hashing, a collection of techniques for using hash functions to map large sets of objects down to In order to implement the MinHash scheme as described smaller hash values in such a way that, when two objects above, one needs the hash function h to define a random have a small distance from each other, their hash values permutation on n elements, where n is the total number are likely to be the same. In this instance, the signature of distinct elements in the union of all of the sets to be of a set may be seen as its hash value. Other locality compared. But because there are n! different permuta- sensitive hashing techniques exist for Hamming distance tions, it would require Ω(n log n) bits just to specify a between sets and cosine distance between vectors; local- truly random permutation, an infeasibly large number for ity sensitive hashing has important applications in nearest even moderate values of n. Because of this fact, by anal- neighbor search algorithms.[8] For large distributed sys- ogy to the theory of universal hashing, there has been sig- tems, and in particular MapReduce, there exist modified nificant work on finding a family of permutations that is versions of MinHash to help compute similarities with no “min-wise independent”, meaning that for any subset of dependence on the point dimension.[9] the domain, any element is equally likely to be the min- imum. It has been established that a min-wise indepen- dent family of permutations must include at least 4.4.6 Evaluation and benchmarks

A large scale evaluation has been conducted by Google ··· ≥ n−o(n) lcm(1, 2, , n) e in 2006 [10] to compare the performance of Min- hash and Simhash[11] algorithms. In 2007 Google re- different permutations, and therefore that it needs Ω(n) ported using Simhash for duplicate detection for web bits to specify a single permutation, still infeasibly crawling[12] and using Minhash and LSH for Google large.[2] News personalization.[13] Because of this impracticality, two variant notions of min-wise independence have been introduced: restricted min-wise independent permutations families, and ap- 4.4.7 See also proximate min-wise independent families. Restricted min-wise independence is the min-wise independence property restricted to certain sets of cardinality at most • w-shingling k.[4] Approximate min-wise independence has at most a fixed probability ε of varying from full independence.[5] • Count-min sketch 4.5. DISJOINT-SET DATA STRUCTURE 123

4.4.8 References [12] Gurmeet Singh, Manku; Jain, Arvind; Das Sarma, An- ish (2007), “Detecting near-duplicates for web crawling”, [1] Broder, Andrei Z. (1997), “On the resemblance and con- Proceedings of the 16th International Conference on World tainment of documents”, Compression and Complexity Wide Web (PDF), doi:10.1145/1242572.1242592. of Sequences: Proceedings, Positano, Amalfitan Coast, [13] Das, Abhinandan S.; Datar, Mayur; Garg, Ahutosh; Ra- Salerno, Italy, June 11-13, 1997 (PDF), IEEE, pp. 21– jaram, Shyam; et al. (2007), “Google news personaliza- 29, doi:10.1109/SEQUEN.1997.666900. tion: scalable online collaborative filtering”, Proceedings [2] Broder, Andrei Z.; Charikar, Moses; Frieze, Alan M.; of the 16th International Conference on World Wide Web, Mitzenmacher, Michael (1998), “Min-wise independent doi:10.1145/1242572.1242610. permutations”, Proc. 30th ACM Symposium on The- ory of Computing (STOC '98), New York, NY, USA: Association for Computing Machinery, pp. 327–336, 4.4.9 External links doi:10.1145/276698.276781. • Mining of Massive Datasets, Ch. 3. Finding similar [3] Vassilvitskii, Sergey (2011), COMS 6998-12: Dealing Items with Massive Data (lecture notes, Columbia university) • (PDF). Simple Simhashing • Set Similarity & MinHash - C# implementation [4] Matoušek, Jiří; Stojaković, Miloš (2003), “On re- stricted min-wise independence of permutations”, Ran- • Minhash with LSH for all-pair search (C# imple- dom Structures and Algorithms, 23 (4): 397–408, mentation) doi:10.1002/rsa.10101. • MinHash – Java implementation [5] Saks, M.; Srinivasan, A.; Zhou, S.; Zuckerman, D. (2000), “Low discrepancy sets yield approximate min- • MinHash – Scala implementation and a duplicate wise independent permutation families”, Information Pro- detection tool cessing Letters, 73 (1–2): 29–32, doi:10.1016/S0020- 0190(99)00163-5. • All pairs similarity search (Google Research) • [6] Chum, Ondřej; Philbin, James; Isard, Michael; Zisser- Distance and Similarity Measures(Wolfram Alpha) man, Andrew (2007), “Scalable near identical image and • Nilsimsa hash (Python implementation) shot detection”, Proceedings of the 6th ACM International Conference on Image and Cideo Retrieval (CIVR'07), • Simhash doi:10.1145/1282280.1282359; Chum, Ondřej; Philbin, James; Zisserman, Andrew (2008), “Near duplicate image detection: min-hash and tf-idf weighting”, Proceedings of the British Machine Vision Conference 4.5 Disjoint-set data structure (PDF), 3, p. 4.

[7] Cohen, E.; Datar, M.; Fujiwara, S.; Gionis, A.; Indyk, P.; 1 2 3 4 5 6 7 8 Motwani, R.; Ullman, J. D.; Yang, C. (2001), “Finding interesting associations without support pruning”, IEEE Transactions on Knowledge and Data Engineering, 13 (1): MakeSet creates 8 singletons. 64–78, doi:10.1109/69.908981.

[8] Andoni, Alexandr; Indyk, Piotr (2008), “Near-optimal 1 2 5 6 8 3 4 7 hashing algorithms for approximate nearest neighbor in high dimensions”, Communications of the ACM, 51 (1): After some operations of Union, some sets are grouped together. 117–122, doi:10.1145/1327452.1327494. In computer science, a disjoint-set data structure, also [9] Zadeh, Reza; Goel, Ashish (2012), Dimension Indepen- called a union–find data structure or merge–find set, dent Similarity Computation, arXiv:1206.2082 . is a data structure that keeps track of a set of elements partitioned into a number of disjoint (nonoverlapping) [10] Henzinger, Monika (2006), “Finding near-duplicate subsets. It supports two useful operations: web pages: a large-scale evaluation of algorithms”, Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information • Find: Determine which subset a particular element Retrieval (PDF), doi:10.1145/1148170.1148222. is in. Find typically returns an item from this set that serves as its “representative"; by comparing the [11] Charikar, Moses S. (2002), “Similarity estimation tech- result of two Find operations, one can determine niques from rounding algorithms”, Proceedings of the whether two elements are in the same subset. 34th Annual ACM Symposium on Theory of Computing, doi:10.1145/509907.509965. • Union: Join two subsets into a single subset. 124 CHAPTER 4. SETS

The other important operation, MakeSet, which makes a belongs to is merged with another list of the same size or set containing only a given element (a singleton), is gen- of greater size. Each time that happens, the size of the list erally trivial. With these three operations, many practical to which x belongs at least doubles. So finally, the ques- partitioning problems can be solved (see the Applications tion is “how many times can a number double before it is section). the size of n ?" (then the list containing x will contain all n elements). The answer is exactly log (n) . So for any In order to define these operations more precisely, some 2 way of representing the sets is needed. One common given element of any given list in the structure described, it will need to be updated log (n) times in the worst case. approach is to select a fixed element of each set, called 2 Therefore, updating a list of n elements stored in this way its representative, to represent the set as a whole. Then, O(n (n)) Find(x) returns the representative of the set that x be- takes log time in the worst case. A find opera- tion can be done in O(1) for this structure because each longs to, and Union takes two set representatives as its arguments. node contains the name of the list to which it belongs. A similar argument holds for merging the trees in the data structures discussed below. Additionally, it helps explain 4.5.1 Disjoint-set linked lists the time analysis of some operations in the binomial heap and Fibonacci heap data structures. A simple disjoint-set data structure uses a linked list for each set. The element at the head of each list is chosen as its representative. 4.5.2 Disjoint-set forests MakeSet creates a list of one element. Union appends the Disjoint-set forests are data structures where each set is two lists, a constant-time operation if the list carries a represented by a tree data structure, in which each node pointer to its tail. The drawback of this implementation holds a reference to its parent node (see Parent pointer is that Find requires O(n) or linear time to traverse the list tree). They were first described by Bernard A. Galler and backwards from a given element to the head of the list. Michael J. Fischer in 1964,[2] although their precise anal- This can be avoided by including in each linked list node ysis took years. a pointer to the head of the list; then Find takes constant In a disjoint-set forest, the representative of each set is the time, since this pointer refers directly to the set represen- root of that set’s tree. Find follows parent nodes until it tative. However, Union now has to update each element reaches the root. Union combines two trees into one by of the list being appended to make it point to the head of attaching the root of one to the root of the other. One the new combined list, requiring O(n) time. way of implementing these might be: When the length of each list is tracked, the required time function MakeSet(x) x.parent := x function Find(x) if can be improved by always appending the smaller list x.parent == x return x else return Find(x.parent) func- to the longer. Using this weighted-union heuristic, a se- tion Union(x, y) xRoot := Find(x) yRoot := Find(y) quence of m MakeSet, Union, and Find operations on n el- xRoot.parent := yRoot ements requires O(m + nlog n) time.[1] For asymptotically faster operations, a different data structure is needed. In this naive form, this approach is no better than the linked-list approach, because the tree it creates can be highly unbalanced; however, it can be enhanced in two Analysis of the naive approach ways. The first way, called union by rank, is to always attach O(n (n)) We now explain the bound log above. the smaller tree to the root of the larger tree. Since it is Suppose you have a collection of lists and each node of the depth of the tree that affects the running time, the each list contains an object, the name of the list to which tree with smaller depth gets added under the root of the it belongs, and the number of elements in that list. Also deeper tree, which only increases the depth if the depths assume that the total number of elements in all lists is n were equal. In the context of this algorithm, the term rank (i.e. there are n elements overall). We wish to be able to is used instead of depth since it stops being equal to the merge any two of these lists, and update all of their nodes depth if path compression (described below) is also used. so that they still contain the name of the list to which they One-element trees are defined to have a rank of zero, and belong. The rule for merging the lists A and B is that if whenever two trees of the same rank r are united, the A is larger than B then merge the elements of B into A rank of the result is r+1. Just applying this technique and update the elements that used to belong to B , and alone yields a worst-case running-time of O(log n) for the vice versa. Union or Find operation. Pseudocode for the improved Choose an arbitrary element of list L , say x . We wish MakeSet and Union: to count how many times in the worst case will x need to function MakeSet(x) x.parent := x x.rank := 0 function have the name of the list to which it belongs updated. The Union(x, y) xRoot := Find(x) yRoot := Find(y) if xRoot element x will only have its name updated when the list it == yRoot return // x and y are not already in same set. 4.5. DISJOINT-SET DATA STRUCTURE 125

Merge them. if xRoot.rank < yRoot.rank xRoot.parent proven by Hopcroft and Ullman,[6] was O(log* n), the := yRoot else if xRoot.rank > yRoot.rank yRoot.parent iterated logarithm of n, another slowly growing function := xRoot else yRoot.parent := xRoot xRoot.rank := (but not quite as slow as the inverse Ackermann function). xRoot.rank + 1 Tarjan and Van Leeuwen also developed one-pass Find The second improvement, called path compression, is a algorithms that are more efficient in practice while re- way of flattening the structure of the tree whenever Find taining the same worst-case complexity.[7] is used on it. The idea is that each node visited on the way In 2007, Sylvain Conchon and Jean-Christophe Filliâtre to a root node may as well be attached directly to the root developed a persistent version of the disjoint-set forest node; they all share the same representative. To effect data structure, allowing previous versions of the struc- this, as Find recursively traverses up the tree, it changes ture to be efficiently retained, and formalized its correct- each node’s parent reference to point to the root that it ness using the proof assistant Coq.[8] However, the im- found. The resulting tree is much flatter, speeding up fu- plementation is only asymptotic if used ephemerally or if ture operations not only on these elements but on those the same version of the structure is repeatedly used with referencing them, directly or indirectly. Here is the im- limited backtracking. proved Find: function Find(x) if x.parent != x x.parent := Find(x.parent) return x.parent 4.5.5 See also These two techniques complement each other; applied to- • Partition refinement, a different data structure for gether, the amortized time per operation is only O(α(n)) maintaining disjoint sets, with updates that split sets , where α(n) is the inverse of the function n = f(x) = apart rather than merging them together A(x, x) , and A is the extremely fast-growing Ackermann • function. Since α(n) is the inverse of this function, α(n) Dynamic connectivity is less than 5 for all remotely practical values of n . Thus, the amortized running time per operation is effectively a 4.5.6 References small constant. In fact, this is asymptotically optimal: Fredman and Saks [1] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, showed in 1989 that Ω(α(n)) words must be accessed by Ronald L.; Stein, Clifford (2001), “Chapter 21: Data any disjoint-set data structure per operation on average.[3] structures for Disjoint Sets”, Introduction to Algorithms (Second ed.), MIT Press, pp. 498–524, ISBN 0-262- 03293-7 4.5.3 Applications [2] Galler, Bernard A.; Fischer, Michael J. (May 1964), “An improved equivalence algorithm”, Communications of the Disjoint-set data structures model the partitioning of a ACM, 7 (5): 301–303, doi:10.1145/364099.364331. The set, for example to keep track of the connected com- paper originating disjoint-set forests. ponents of an undirected graph. This model can then [3] Fredman, M.; Saks, M. (May 1989), “The cell probe com- be used to determine whether two vertices belong to the plexity of dynamic data structures”, Proceedings of the same component, or whether adding an edge between Twenty-First Annual ACM Symposium on Theory of Com- them would result in a cycle. The Union–Find algo- puting: 345–354, Theorem 5: Any CPROBE(log n) im- rithm is used in high-performance implementations of plementation of the set union problem requires Ω(m α(m, unification.[4] n)) time to execute m Find’s and n−1 Union’s, beginning with n singleton sets. This data structure is used by the Boost Graph Library to [4] Knight, Kevin (1989). “Unification: A multidisci- implement its Incremental Connected Components func- plinary survey”. ACM Computing Surveys. 21: 93–124. tionality. It is also used for implementing Kruskal’s algo- doi:10.1145/62029.62030. rithm to find the minimum spanning tree of a graph. [5] Tarjan, Robert Endre (1975). “Efficiency of a Good But Note that the implementation as disjoint-set forests Not Linear Set Union Algorithm”. Journal of the ACM. doesn't allow deletion of edges—even without path com- 22 (2): 215–225. doi:10.1145/321879.321884. pression or the rank heuristic. [6] Hopcroft, J. E.; Ullman, J. D. (1973). “Set Merging Al- gorithms”. SIAM Journal on Computing. 2 (4): 294–303. 4.5.4 History doi:10.1137/0202024. [7] Tarjan, Robert E.; van Leeuwen, Jan (1984), “Worst-case While the ideas used in disjoint-set forests have long been analysis of set union algorithms”, Journal of the ACM, 31 familiar, Robert Tarjan was the first to prove the upper (2): 245–281, doi:10.1145/62.2160 bound (and a restricted version of the lower bound) in [8] Conchon, Sylvain; Filliâtre, Jean-Christophe (October terms of the inverse Ackermann function, in 1975.[5] Un- 2007), “A Persistent Union-Find Data Structure”, ACM til this time the best bound on the time per operation, SIGPLAN Workshop on ML, Freiburg, Germany 126 CHAPTER 4. SETS

4.5.7 External links • Associated with each set Si, a collection of its ele- ments of Si, in a form such as a doubly linked list • C++ implementation, part of the Boost C++ li- or array data structure that allows for rapid deletion braries of individual elements from the collection. Alterna- tively, this component of the data structure may be • A Java implementation with an application to color represented by storing all of the elements of all of image segmentation, Statistical Region Merging the sets in a single array, sorted by the identity of (SRM), IEEE Trans. Pattern Anal. Mach. Intell. the set they belong to, and by representing the col- 26(11): 1452–1458 (2004) lection of elements in any set Si by its starting and ending positions in this array. • Java applet: A Graphical Union–Find Implementa- • tion, by Rory L. P. McGuire Associated with each element, the set it belongs to.

• Wait-free Parallel Algorithms for the Union–Find To perform a refinement operation, the algorithm loops Problem, a 1994 paper by Richard J. Anderson and through the elements of the given set X. For each such Heather Woll describing a parallelized version of element x, it finds the set Si that contains x, and checks Union–Find that never needs to block whether a second set for Si ∩ X has already been started. If not, it creates the second set and add Si to a list L of • Python implementation the sets that are split by the operation. Then, regardless of whether a new set was formed, the algorithm removes • Visual explanation and C# code x from Si and adds it to Si ∩ X. In the representation in which all elements are stored in a single array, moving x from one set to another may be performed by swapping 4.6 Partition refinement x with the final element of Si and then decrementing the end index of Si and the start index of the new set. Finally, after all elements of X have been processed in this way, In the design of algorithms, partition refinement is a the algorithm loops through L, separating each current technique for representing a partition of a set as a data set Si from the second set that has been split from it, and structure that allows the partition to be refined by split- reports both of these sets as being newly formed by the ting its sets into a larger number of smaller sets. In that refinement operation. sense it is dual to the union-find data structure, which also maintains a partition into disjoint sets but in which the op- The time to perform a single refinement operations in this erations merge pairs of sets together. way is O(|X|), independent of the number of elements in the family of sets and also independent of the total num- Partition refinement forms a key component of several ber of sets in the data structure. Thus, the time for a efficient algorithms on graphs and finite automata, in- sequence of refinements is proportional to the total size cluding DFA minimization, the Coffman–Graham algo- of the sets given to the algorithm in each refinement step. rithm for parallel scheduling, and lexicographic breadth- first search of graphs.[1][2][3] 4.6.2 Applications

4.6.1 Data structure An early application of partition refinement was in an al- gorithm by Hopcroft (1971) for DFA minimization. In A partition refinement algorithm maintains a family of this problem, one is given as input a deterministic finite disjoint sets Si. At the start of the algorithm, this family automaton, and must find an equivalent automaton with as contains a single set of all the elements in the data struc- few states as possible. Hopcroft’s algorithm maintains a ture. At each step of the algorithm, a set X is presented to partition of the states of the input automaton into subsets, the algorithm, and each set Si in the family that contains with the property that any two states in different subsets members of X is split into two sets, the intersection Si ∩ must be mapped to different states of the output automa- X and the difference Si \ X. ton. Initially, there are two subsets, one containing all the accepting states of the automaton and one containing the Such an algorithm may be implemented efficiently by remaining states. At each step one of the subsets Si and maintaining data structures representing the following one of the input symbols x of the automaton are chosen, information:array, ordered by the sets they belong to, and and the subsets of states are refined into states for which a sets may be represented by start and end indices into this [4][5] transition labeled x would lead to Si, and states for which array. an x-transition would lead somewhere else. When a set Si that has already been chosen is split by a refinement, • The ordered sequence of the sets Si in the family, in only one of the two resulting sets (the smaller of the two) a form such as a doubly linked list that allows new needs to be chosen again; in this way, each state partici- sets to be inserted into the middle of the sequence pates in the sets X for O(s log n) refinement steps and the 4.6. PARTITION REFINEMENT 127 overall algorithm takes time O(ns log n), where n is the [6] Hopcroft, John (1971), “An n log n algorithm for mini- number of initial states and s is the size of the alphabet.[6] mizing states in a finite automaton”, Theory of machines and computations (Proc. Internat. Sympos., Technion, Partition refinement was applied by Sethi (1976) in an Haifa, 1971), New York: Academic Press, pp. 189–196, efficient implementation of the Coffman–Graham al- MR 0403320. gorithm for parallel scheduling. Sethi showed that it could be used to construct a lexicographically ordered [7] Sethi, Ravi (1976), “Scheduling graphs on two pro- topological sort of a given in linear cessors”, SIAM Journal on Computing, 5 (1): 73–82, time; this lexicographic topological ordering is one of the doi:10.1137/0205005, MR 0398156. key steps of the Coffman–Graham algorithm. In this ap- [8] Rose, D. J.; Tarjan, R. E.; Lueker, G. S. (1976), “Al- plication, the elements of the disjoint sets are vertices of gorithmic aspects of vertex elimination on graphs”, the input graph and the sets X used to refine the partition SIAM Journal on Computing, 5 (2): 266–283, are sets of neighbors of vertices. Since the total number doi:10.1137/0205021. of neighbors of all vertices is just the number of edges in the graph, the algorithm takes time linear in the number [9] Corneil, Derek G. (2004), “Lexicographic breadth first search – a survey”, Graph-Theoretic Methods in Com- of edges, its input size.[7] puter Science, Lecture Notes in Computer Science, 3353, Partition refinement also forms a key step in lexicographic Springer-Verlag, pp. 1–19. breadth-first search, a graph search algorithm with appli- cations in the recognition of chordal graphs and several other important classes of graphs. Again, the disjoint set elements are vertices and the set X represent sets of neigh- bors, so the algorithm takes linear time.[8][9]

4.6.3 See also

• Refinement (sigma algebra)

4.6.4 References

[1] Paige, Robert; Tarjan, Robert E. (1987), “Three partition refinement algorithms”, SIAM Journal on Computing, 16 (6): 973–989, doi:10.1137/0216062, MR 917035.

[2] Habib, Michel; Paul, Christophe; Viennot, Laurent (1999), “Partition refinement techniques: an inter- esting algorithmic tool kit”, International Journal of Foundations of Computer Science, 10 (2): 147–170, doi:10.1142/S0129054199000125, MR 1759929.

[3] Habib, Michel; Paul, Christophe; Viennot, Laurent (1998), “A synthesis on partition refinement: a use- ful routine for strings, graphs, Boolean matrices and automata”, STACS 98 (Paris, 1998), Lecture Notes in Computer Science, 1373, Springer-Verlag, pp. 25–38, doi:10.1007/BFb0028546, MR 1650757.

[4] Valmari, Antti; Lehtinen, Petri (2008). “Efficient min- imization of DFAs with partial transition functions”. In Albers, Susanne; Weil, Pascal. 25th International Symposium on Theoretical Aspects of Computer Science (STACS 2008). Leibniz International Proceedings in Informatics (LIPIcs). Dagstuhl, Germany: Schloss Dagstuhl: Leibniz-Zentrum fuer Informatik. pp. 645– 656. doi:10.4230/LIPIcs.STACS.2008.1328. ISBN 978- 3-939897-06-4. ISSN 1868-8969..

[5] Knuutila, Timo (2001). “Re-describing an algorithm by Hopcroft”. Theoretical Computer Science. 250 (1-2): 333–363. doi:10.1016/S0304-3975(99)00150-4. ISSN 0304-3975. Chapter 5

Priority queues

5.1 Priority queue In addition, peek (in this context often called find-max or find-min), which returns the highest-priority element In computer science, a priority queue is an abstract data but does not modify the queue, is very frequently imple- type which is like a regular queue or stack data structure, mented, and nearly always executes in O(1) time. This but where additionally each element has a “priority” as- operation and its O(1) performance is crucial to many ap- sociated with it. In a priority queue, an element with high plications of priority queues. priority is served before an element with low priority. If More advanced implementations may support more com- two elements have the same priority, they are served ac- plicated operations, such as pull_lowest_priority_element, cording to their order in the queue. inspecting the first few highest- or lowest-priority ele- While priority queues are often implemented with heaps, ments, clearing the queue, clearing subsets of the queue, they are conceptually distinct from heaps. A priority performing a batch insert, merging two or more queues queue is an abstract concept like “a list" or “a map"; just into one, incrementing priority of any element, etc. as a list can be implemented with a linked list or an array, a priority queue can be implemented with a heap or a va- 5.1.2 Similarity to queues riety of other methods such as an unordered array. One can imagine a priority queue as a modified queue, but when one would get the next element off the queue, 5.1.1 Operations the highest-priority element is retrieved first. A priority queue must at least support the following op- Stacks and queues may be modeled as particular kinds of erations: priority queues. As a reminder, here is how stacks and queues behave: • insert_with_priority: add an element to the queue • with an associated priority. stack – elements are pulled in last-in first-out-order (e.g., a stack of papers) • pull_highest_priority_element: remove the element • queue – elements are pulled in first-in first-out-order from the queue that has the highest priority, and re- (e.g., a line in a cafeteria) turn it.

This is also known In a stack, the priority of each inserted element is mono- as "pop_element(Off)", tonically increasing; thus, the last element inserted is al- "get_maximum_element" or ways the first retrieved. In a queue, the priority of each "get_front(most)_element". inserted element is monotonically decreasing; thus, the first element inserted is always the first retrieved. Some conventions reverse the order of priorities, considering lower values to be higher priority, so this may also be 5.1.3 Implementation known as "get_minimum_element", and is often referred to as "get-min" in the lit- Naive implementations erature. This may instead be specified as sepa- There is a variety of simple, usually inefficient, ways to rate "peek_at_highest_priority_element" implement a priority queue. They provide an analogy and "delete_element" functions, to help one understand what a priority queue is. For which can be combined to produce instance, one can keep all the elements in an unsorted "pull_highest_priority_element". list. Whenever the highest-priority element is requested,

128 5.1. PRIORITY QUEUE 129 search through all elements for the one with the highest For applications that do many "peek" operations for ev- priority. (In big O notation: O(1) insertion time, O(n) pull ery “extract-min” operation, the time complexity for peek time due to search.) actions can be reduced to O(1) in all tree and heap imple- mentations by caching the highest priority element after every insertion and removal. For insertion, this adds at Usual implementation most a constant cost, since the newly inserted element is compared only to the previously cached minimum ele- To improve performance, priority queues typically use a ment. For deletion, this at most adds an additional “peek” heap as their backbone, giving O(log n) performance for cost, which is typically cheaper than the deletion cost, so inserts and removals, and O(n) to build initially. Vari- overall time complexity is not significantly impacted. ants of the basic heap data structure such as pairing heaps Monotone priority queues are specialized queues that are or Fibonacci heaps can provide better bounds for some optimized for the case where no item is ever inserted that operations.[1] has a lower priority (in the case of min-heap) than any Alternatively, when a self-balancing binary search tree is item previously extracted. This restriction is met by sev- used, insertion and removal also take O(log n) time, al- eral practical applications of priority queues. though building trees from existing sequences of elements takes O(n log n) time; this is typical where one might al- ready have access to these data structures, such as with Summary of running times third-party or standard libraries. In the following time complexities[5] O(f) is an asymp- From a computational-complexity standpoint, priority totic upper bound and Θ(f) is an asymptotically tight queues are congruent to sorting algorithms. See the next bound (see Big O notation). Function names assume a section for how efficient sorting algorithms can create ef- min-heap. ficient priority queues. [1] Brodal and Okasaki later describe a persistent variant with Specialized heaps the same bounds except for decrease-key, which is not supported. Heaps with n elements can be constructed bottom-up in O(n).[8] There are several specialized heap data structures that either supply additional operations or outperform heap- [2] Amortized time. based implementations for specific types of keys, specif- √ 2 log log n [11][12] ically integer keys. [3] Bounded by Ω(log log n),O(2 ) [4] n is the size of the larger heap. • When the set of keys is {1, 2, ..., C}, and only in- sert, find-min and extract-min are needed, a bucket queue can be constructed as an array of C linked lists 5.1.4 Equivalence of priority queues and plus a pointer top, initially C. Inserting an item with sorting algorithms key k appends the item to the k'th, and updates top ← min(top, k), both in constant time. Extract-min Using a priority queue to sort deletes and returns one item from the list with in- dex top, then increments top if needed until it again The semantics of priority queues naturally suggest a sort- points to a non-empty list; this takes O(C) time in ing method: insert all the elements to be sorted into a the worst case. These queues are useful for sorting priority queue, and sequentially remove them; they will the vertices of a graph by their degree.[2]:374 come out in sorted order. This is actually the proce- dure used by several sorting algorithms, once the layer • For the set of keys {1, 2, ..., C}, a van Emde Boas of abstraction provided by the priority queue is removed. tree would support the minimum, maximum, insert, This sorting method is equivalent to the following sorting delete, search, extract-min, extract-max, predecessor algorithms: and successor operations in O(log log C) time, but has a space cost for small queues of about O(2m/2), where m is the number of bits in the priority value.[3] Using a to make a priority queue

• The Fusion tree algorithm by Fredman and Willard A sorting algorithm can also be used to implement a pri- ority queue. Specifically, Thorup says:[13] implements the minimum operation in O(1)√ time and insert and extract-min operations in O( log n) time however it is stated by the author that, “Our We present a general deterministic linear algorithms have theoretical interest only; The con- space reduction from priority queues to sort- stant factors involved in the execution times pre- ing implying that if we can sort up to n keys in clude practicality.”.[4] S(n) time per key, then there is a priority queue 130 CHAPTER 5. PRIORITY QUEUES

supporting delete and insert in O(S(n)) time and rival. This ensures that the prioritized traffic (such as real- find-min in constant time. time traffic, e.g. an RTP stream of a VoIP connection) is forwarded with the least delay and the least likelihood That is, if there is a sorting algorithm which can sort in of being rejected due to a queue reaching its maximum O(S) time per key, where S is some function of n and word capacity. All other traffic can be handled when the high- size,[14] then one can use the given procedure to create a est priority queue is empty. Another approach used is to priority queue where pulling the highest-priority element send disproportionately more traffic from higher priority is O(1) time, and inserting new elements (and deleting queues. elements) is O(S) time. For example, if one has an O(n Many modern protocols for local area networks also in- log log n) sort algorithm, one can create a priority queue clude the concept of priority queues at the media access with O(1) pulling and O(log log n) insertion. control (MAC) sub-layer to ensure that high-priority ap- plications (such as VoIP or IPTV) experience lower la- tency than other applications which can be served with 5.1.5 Libraries best effort service. Examples include IEEE 802.11e (an amendment to IEEE 802.11 which provides quality of A priority queue is often considered to be a "container service) and ITU-T G.hn (a standard for high-speed local data structure". area network using existing home wiring (power lines, The Standard Template Library (STL), and the C++ phone lines and coaxial cables). 1998 standard, specifies priority_queue as one of the STL Usually a limitation (policer) is set to limit the band- container adaptor class templates. However, it does not width that traffic from the highest priority queue can take, specify how two elements with same priority should be in order to prevent high priority packets from choking served, and indeed, common implementations will not re- off all other traffic. This limit is usually never reached turn them according to their order in the queue. It im- due to high level control instances such as the Cisco plements a max-priority-queue, and has three parame- Callmanager, which can be programmed to inhibit calls ters: a comparison object for sorting such as a function which would exceed the programmed bandwidth limit. object (defaults to less if unspecified), the underly- ing container for storing the data structures (defaults to std::vector), and two iterators to the beginning and Discrete event simulation end of a sequence. Unlike actual STL containers, it does not allow iteration of its elements (it strictly adheres to its Another use of a priority queue is to manage the events abstract data type definition). STL also has utility func- in a discrete event simulation. The events are added to tions for manipulating another random-access container the queue with their simulation time used as the prior- as a binary max-heap. The Boost (C++ libraries) also ity. The execution of the simulation proceeds by repeat- have an implementation in the library heap. edly pulling the top of the queue and executing the event Python’s heapq module implements a binary min-heap on thereon. top of a list. See also: Scheduling (computing), queueing theory Java's library contains a PriorityQueue class, which im- plements a min-priority-queue. Dijkstra’s algorithm Go's library contains a container/heap module, which im- plements a min-heap on top of any compatible data struc- When the graph is stored in the form of or ture. matrix, priority queue can be used to extract minimum The Standard PHP Library extension contains the class efficiently when implementing Dijkstra’s algorithm, al- SplPriorityQueue. though one also needs the ability to alter the priority of a particular vertex in the priority queue efficiently. Apple’s Core Foundation framework contains a CFBinaryHeap structure, which implements a min-heap. Huffman coding

5.1.6 Applications Huffman coding requires one to repeatedly obtain the two lowest-frequency trees. A priority queue is one method of Bandwidth management doing this.

Priority queuing can be used to manage limited resources such as bandwidth on a transmission line from a network Best-first search algorithms router. In the event of outgoing traffic queuing due to insufficient bandwidth, all other queues can be halted to Best-first search algorithms, like the A* search algorithm, send the traffic from the highest priority queue upon ar- find the shortest path between two vertices or nodes of 5.1. PRIORITY QUEUE 131 a weighted graph, trying out the most promising routes [3] P. van Emde Boas. Preserving order in a forest in less than first. A priority queue (also known as the fringe) is used logarithmic time. In Proceedings of the 16th Annual Sym- to keep track of unexplored routes; the one for which the posium on Foundations of Computer Science, pages 75-84. estimate (a lower bound in the case of A*) of the total IEEE Computer Society, 1975. path length is smallest is given highest priority. If mem- ory limitations make best-first search impractical, vari- [4] Michael L. Fredman and Dan E. Willard. Surpassing the ants like the SMA* algorithm can be used instead, with information theoretic bound with fusion trees. Journal of Computer and System Sciences, 48(3):533-551, 1994 a double-ended priority queue to allow removal of low- priority items. [5] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L. (1990). Introduction to Algorithms (1st ed.). MIT Press and McGraw-Hill. ISBN 0-262-03141-8. ROAM triangulation algorithm [6] Iacono, John (2000), “Improved upper bounds for pair- The Real-time Optimally Adapting Meshes (ROAM) al- ing heaps”, Proc. 7th Scandinavian Workshop on Algo- gorithm computes a dynamically changing triangulation rithm Theory, Lecture Notes in Computer Science, 1851, of a terrain. It works by splitting triangles where more Springer-Verlag, pp. 63–77, doi:10.1007/3-540-44985- detail is needed and merging them where less detail is X_5 needed. The algorithm assigns each triangle in the ter- rain a priority, usually related to the error decrease if that [7] Brodal, Gerth S. (1996), “Worst-Case Efficient Priority triangle would be split. The algorithm uses two priority Queues”, Proc. 7th Annual ACM-SIAM Symposium on queues, one for triangles that can be split and another for Discrete Algorithms (PDF), pp. 52–58 triangles that can be merged. In each step the triangle from the split queue with the highest priority is split, or [8] Goodrich, Michael T.; Tamassia, Roberto (2004). “7.3.6. the triangle from the merge queue with the lowest priority Bottom-Up Heap Construction”. Data Structures and Al- is merged with its neighbours. gorithms in Java (3rd ed.). pp. 338–341.

[9] Haeupler, Bernhard; Sen, Siddhartha; Tarjan, Robert E. Prim’s algorithm for minimum spanning tree (2009). “Rank-pairing heaps” (PDF). SIAM J. Computing: 1463–1485. Using min heap priority queue in Prim’s algorithm to find the minimum spanning tree of a connected and [10] Brodal, G. S. L.; Lagogiannis, G.; Tarjan, R. E. (2012). Strict Fibonacci heaps (PDF). Proceedings of undirected graph, one can achieve a good running time. the 44th symposium on Theory of Computing - STOC This min heap priority queue uses the min heap data '12. p. 1177. doi:10.1145/2213977.2214082. ISBN structure which supports operations such as insert, min- 9781450312455. imum, extract-min, decrease-key.[15] In this implementa- tion, the weight of the edges is used to decide the priority [11] Fredman, Michael Lawrence; Tarjan, Robert E. (1987). of the vertices. Lower the weight, higher the priority and “Fibonacci heaps and their uses in improved network higher the weight, lower the priority.[16] optimization algorithms” (PDF). Journal of the Asso- ciation for Computing Machinery. 34 (3): 596–615. doi:10.1145/28869.28874. 5.1.7 See also [12] Pettie, Seth (2005). “Towards a Final Analysis of Pairing • Batch queue Heaps” (PDF). Max Planck Institut für Informatik.

• Command queue [13] Thorup, Mikkel (2007). “Equivalence between prior- ity queues and sorting”. Journal of the ACM. 54 (6). • Job scheduler doi:10.1145/1314690.1314692.

[14] http://courses.csail.mit.edu/6.851/spring07/scribe/lec17. 5.1.8 References pdf

[1] Thomas H. Cormen, Charles E. Leiserson, Ronald L. [15] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Sec- Rivest, Clifford Stein (2009). INTRODUCTION TO AL- ond Edition. MIT Press and McGraw-Hill, 2001. ISBN GORITHMS. 3. MIT Press. p. 634. ISBN 978-81- 0-262-03293-7. Chapter 20: Fibonacci Heaps, pp.476– 203-4007-7. In order to implement Prim’s algorithm ef- 497. Third edition p518. ficiently, we need a fast way to select a new edge to add to the tree formed by the edges in A. In the pseudo-code [2] Skiena, Steven (2010). The Algorithm Design Manual (2nd ed.). Springer Science+Business Media. ISBN 1- [16] “Prim’s Algorithm”. Geek for Geeks. Retrieved 12 849-96720-2. September 2014. 132 CHAPTER 5. PRIORITY QUEUES

5.1.9 Further reading • To insert an element x with priority p, add x to the container at A[p]. • Thomas H. Cormen, Charles E. Leiserson, Ronald • L. Rivest, and Clifford Stein. Introduction to Algo- To remove an element x with priority p, remove x rithms, Second Edition. MIT Press and McGraw- from the container at A[p] Hill, 2001. ISBN 0-262-03293-7. Section 6.5: Pri- • To find an element with the minimum priority, per- ority queues, pp. 138–142. form a sequential search to find the first non-empty container, and then choose an arbitrary element 5.1.10 External links from this container.

• C++ reference for std::priority_queue In this way, insertions and deletions take constant time, while finding the minimum priority element takes time • Descriptions by Lee Killough O(C).[1][3] • PQlib - Open source Priority Queue library for C • libpqueue is a generic priority queue (heap) imple- 5.2.2 Optimizations mentation (in C) used by the Apache HTTP Server project. As an optimization, the data structure can also maintain an index L that lower-bounds the minimum priority of • Survey of known priority queue structures by Stefan an element. When inserting a new element, L should be Xenos updated to the minimum of its old value and the new ele- ment’s priority. When searching for the minimum prior- • UC Berkeley - Computer Science 61B - Lecture 24: ity element, the search can start at L instead of at zero, and Priority Queues (video) - introduction to priority after the search L should be left equal to the priority that queues using binary heap was found in the search.[3] In this way the time for a search is reduced to the difference between the previous lower bound and its next value; this difference could be sig- 5.2 Bucket queue nificantly smaller than C. For applications of monotone priority queues such as Dijkstra’s algorithm in which the In the design and analysis of data structures, a bucket minimum priorities form a monotonic sequence, the sum queue[1] (also called a bucket priority queue[2] or of these differences is at most C, so the total time for bounded-height priority queue[3]) is a priority queue a sequence of n operations is O(n + C), rather than the for prioritizing elements whose priorities are small slower O(nC) time bound that would result without this integers. It has the form of an array of buckets: an array optimization. data structure, indexed by the priorities, whose cells con- Another optimization (already given by Dial 1969) can tain buckets of items with the same priority as each other. be used to save space when the priorities are monotonic The bucket queue is the priority-queue analogue of and, at any point in time, fall within a range of r values pigeonhole sort (also called ), a sorting algo- rather than extending over the whole range from 0 to C. In rithm that places elements into buckets indexed by their this case, one can index the array by the priorities mod- priorities and then concatenates the buckets. Using a ulo r rather than by their actual values. The search for bucket queue as the priority queue in a gives the minimum priority element should always begin at the a form of the pigeonhole sort algorithm. previous minimum, to avoid priorities that are higher than [1] Applications of the bucket queue include computation of the minimum but have lower moduli. the degeneracy of a graph as well as fast algorithms for shortest paths and widest paths for graphs with weights 5.2.3 Applications that are small integers or are already sorted. Its first use[2] [4] was in a shortest path algorithm by Dial (1969). A bucket queue can be used to maintain the vertices of an undirected graph, prioritized by their degrees, 5.2.1 Basic data structure and repeatedly find and remove the vertex of minimum degree.[3] This greedy algorithm can be used to calculate This structure can handle the insertions and deletions of the degeneracy of a given graph. It takes linear time, with elements with integer priorities in the range from 0 to or without the optimization that maintains a lower bound some known bound C, as well as operations that find the on the minimum priority, because each vertex is found in time proportional to its degree and the sum of all vertex element with minimum (or maximum) priority. It con- [5] sists of an array A of container data structures, where ar- degrees is linear in the number of edges of the graph. ray cell A[p] stores the collection of elements with prior- In Dijkstra’s algorithm for shortest paths in positively- ity p. It can handle the following operations: weighted directed graphs, a bucket queue can be used to 5.3. HEAP (DATA STRUCTURE) 133

obtain a time bound of O(n + m + dc), where n is the number of vertices, m is the number of edges, d is the diameter of the network, and c is the maximum (integer) 100 link cost.[6] In this algorithm, the priorities will only span a range of width c + 1, so the modular optimization can 19 36 be used to reduce the space to O(n + c).[1] A variant of the same algorithm can be used for the , and (in combination with methods for quickly partition- 17 3 25 1 ing non-integer edge weights) leads to near-linear-time solutions to the single-source single-destination version of this problem.[7] 2 7

5.2.4 References Example of a complete binary max-heap with node keys being [1] Mehlhorn, Kurt; Sanders, Peter (2008), “10.5.1 Bucket integers from 1 to 100 Queues”, Algorithms and Data Structures: The Basic Tool- box, Springer, p. 201, ISBN 9783540779773. the lowest key is in the root node. Heaps are crucial in [2] Edelkamp, Stefan; Schroedl, Stefan (2011), “3.1.1 Bucket several efficient graph algorithms such as Dijkstra’s algo- Data Structures”, Heuristic Search: Theory and Applica- rithm, and in the sorting algorithm heapsort. A common tions, Elsevier, pp. 90–92, ISBN 9780080919737. See implementation of a heap is the binary heap, in which the also p. 157 for the history and naming of this structure. tree is a complete binary tree (see figure). [3] Skiena, Steven S. (1998), The Algorithm Design Manual, In a heap, the highest (or lowest) priority element is al- Springer, p. 181, ISBN 9780387948607. ways stored at the root. A heap is not a sorted struc- [4] Dial, Robert B. (1969), “Algorithm 360: Shortest- ture and can be regarded as partially ordered. As visible path forest with topological ordering [H]", from the heap-diagram, there is no particular relationship Communications of the ACM, 12 (11): 632–633, among nodes on any given level, even among the siblings. doi:10.1145/363269.363610. When a heap is a complete binary tree, it has a smallest possible height—a heap with N nodes always has log N [5] Matula, D. W.; Beck, L. L. (1983), “Smallest-last order- height. A heap is a useful data structure when you need ing and clustering and graph coloring algorithms”, Journal of the ACM, 30 (3): 417–427, doi:10.1145/2402.322385, to remove the object with the highest (or lowest) priority. MR 0709826. Note that, as shown in the graphic, there is no implied ordering between siblings or cousins and no implied se- [6] Varghese, George (2005), Network Algorithmics: An In- terdisciplinary Approach to Designing Fast Networked De- quence for an in-order traversal (as there would be in, e.g., vices, Morgan Kaufmann, ISBN 9780120884773. a binary search tree). The heap relation mentioned above applies only between nodes and their parents, grandpar- [7] Gabow, Harold N.; Tarjan, Robert E. (1988), “Algo- ents, etc. The maximum number of children each node rithms for two bottleneck optimization problems”, Jour- can have depends on the type of heap, but in many types nal of Algorithms, 9 (3): 411–417, doi:10.1016/0196- it is at most two, which is known as a binary heap. 6774(88)90031-4, MR 955149 The heap is one maximally efficient implementation of an abstract data type called a priority queue, and in fact priority queues are often referred to as “heaps”, regardless 5.3 Heap (data structure) of how they may be implemented.

This article is about the programming data structure. For A heap data structure should not be confused with the the dynamic memory area, see Dynamic memory alloca- heap which is a common name for the pool of memory tion. from which dynamically allocated memory is allocated. In computer science, a heap is a specialized tree-based The term was originally used only for the data structure. data structure that satisfies the heap property: If A is a parent node of B then the key (the value) of node A is 5.3.1 Operations ordered with respect to the key of node B with the same ordering applying across the heap. A heap can be classi- The common operations involving heaps are: fied further as either a "max heap" or a "min heap". In a max heap, the keys of parent nodes are always greater than or equal to those of the children and the highest key Basic is in the root node. In a min heap, the keys of parent nodes are less than or equal to those of the children and • find-max or find-min: find the maximum item of a 134 CHAPTER 5. PRIORITY QUEUES

max-heap or a minimum item of a min-heap (a.k.a. 5.3.2 Implementation peek) Heaps are usually implemented in an array (fixed size or • insert: adding a new key to the heap (a.k.a., push[1]) dynamic array), and do not require pointers between ele- ments. After an element is inserted into or deleted from • extract-min [or extract-max]: returns the node of a heap, the heap property may be violated and the heap minimum value from a min heap [or maximum must be balanced by internal operations. value from a max heap] after removing it from the Full and almost full binary heaps may be represented in [2] heap (a.k.a., pop ) a very space-efficient way (as an ) using an array alone. The first (or last) element will con- • delete-max or delete-min: removing the root node of tain the root. The next two elements of the array contain a max- or min-heap, respectively its children. The next four contain the four children of the two child nodes, etc. Thus the children of the node at • replace: pop root and push a new key. More efficient position n would be at positions 2n and 2n + 1 in a one- than pop followed by push, since only need to bal- based array, or 2n + 1 and 2n + 2 in a zero-based array. ance once, not twice, and appropriate for fixed-size This allows moving up or down the tree by doing simple [3] heaps. index computations. Balancing a heap is done by shift-up or shift-down operations (swapping elements which are Creation out of order). As we can build a heap from an array with- out requiring extra memory (for the nodes, for example), heapsort can be used to sort an array in-place. • create-heap: create an empty heap Different types of heaps implement the operations in dif- • heapify: create a heap out of given array of elements ferent ways, but notably, insertion is often done by adding the new element at the end of the heap in the first avail- able free space. This will generally violate the heap prop- • merge (union): joining two heaps to form a valid new erty, and so the elements are then shifted up until the heap heap containing all the elements of both, preserving property has been reestablished. Similarly, deleting the the original heaps. root is done by removing the root and then putting the last element in the root and shifting down to rebalance. • meld: joining two heaps to form a valid new heap Thus replacing is done by deleting the root and putting containing all the elements of both, destroying the the new element in the root and shifting down, avoiding original heaps. a shifting up step compared to pop (shift down of last el- ement) followed by push (shift up of new element). Inspection Construction of a binary (or d-ary) heap out of a given array of elements may be performed in linear time using • size: return the number of items in the heap. the classic Floyd algorithm, with the worst-case number of comparisons equal to 2N − 2s2(N) − e2(N) (for a bi- • is-empty: return true if the heap is empty, false oth- nary heap), where s2(N) is the sum of all digits of the bi- erwise. nary representation of N and e2(N) is the exponent of 2 in the prime factorization of N.[4] This is faster than a se- quence of consecutive insertions into an originally empty Internal heap, which is log-linear (or linearithmic).[lower-alpha 1]

• increase-key or decrease-key: updating a key within a max- or min-heap, respectively 5.3.3 Variants • • delete: delete an arbitrary node (followed by moving 2–3 heap last node and sifting to maintain heap) • B-heap

• shift-up: move a node up in the tree, as long as • Beap needed; used to restore heap condition after inser- tion. Called “sift” because node moves up the tree • Binary heap until it reaches the correct level, as in a sieve. • Binomial heap • shift-down: move a node down in the tree, similar to • Brodal queue shift-up; used to restore heap condition after dele- tion or replacement. • d-ary heap 5.3. HEAP (DATA STRUCTURE) 135

• Fibonacci heap • Graph algorithms: By using heaps as internal traver- sal data structures, run time will be reduced by • Leftist heap polynomial order. Examples of such problems are Prim’s minimal-spanning-tree algorithm and • Pairing heap Dijkstra’s shortest-path algorithm. • • Priority Queue: A priority queue is an abstract con- • Soft heap cept like “a list” or “a map"; just as a list can be im- plemented with a linked list or an array, a priority • queue can be implemented with a heap or a variety of other methods. • Leaf heap • Order statistics: The Heap data structure can be used • Radix heap to efficiently find the kth smallest (or largest) element in an array. • Randomized meldable heap

• Ternary heap 5.3.6 Implementations • Treap • The C++ Standard Library provides the make_heap, push_heap and pop_heap algorithms for heaps (usu- 5.3.4 Comparison of theoretic bounds for ally implemented as binary heaps), which operate on arbitrary random access iterators. It treats the itera- variants tors as a reference to an array, and uses the array-to- heap conversion. It also provides the container adap- [5] In the following time complexities O(f) is an asymp- tor priority_queue, which wraps these facilities in a totic upper bound and Θ(f) is an asymptotically tight container-like class. However, there is no standard bound (see Big O notation). Function names assume a support for the decrease/increase-key operation. min-heap. • The Boost C++ libraries include a heaps library. [1] Each insertion∑ takes O(log(k)) in the existing size of the Unlike the STL it supports decrease and increase n − heap, thus k=1 O(log k) . Since log n/2 = (log n) 1 operations, and supports additional types of heap: , a constant factor (half) of these insertions are within a specifically, it supports d-ary, binomial, Fibonacci, constant factor of the maximum, so asymptotically we can pairing and skew heaps. assume k = n ; formally the time is nO(log n)−O(n) = O(n log n) . This can also be readily seen from Stirling’s • There is generic heap implementation for C and C++ approximation. with D-ary heap and B-heap support. It provides STL-like API. [2] Brodal and Okasaki later describe a persistent variant with the same bounds except for decrease-key, which is not • The Java platform (since version 1.5) provides supported. Heaps with n elements can be constructed the binary heap implementation with class [8] bottom-up in O(n). java.util.PriorityQueue in Java Collections [3] Amortized time. Framework. This class implements by default a √ min-heap; to implement a max-heap, programmer [4] Bounded by Ω(log log n),O(22 log log n) [11][12] should write a custom comparator. There is no support for the decrease/increase-key operation. [5] n is the size of the larger heap. • Python has a heapq module that implements a pri- ority queue using a binary heap. 5.3.5 Applications • PHP has both max-heap (SplMaxHeap) and min- The heap data structure has many applications. heap (SplMinHeap) as of version 5.3 in the Standard PHP Library.

• Heapsort: One of the best sorting methods being in- • Perl has implementations of binary, binomial, and place and with no quadratic worst-case scenarios. Fibonacci heaps in the Heap distribution available on CPAN. • Selection algorithms: A heap allows access to the min or max element in constant time, and other • The Go language contains a heap package with heap selections (such as median or kth-element) can be algorithms that operate on an arbitrary type that sat- done in sub-linear time on data that is in a heap.[13] isfies a given interface. 136 CHAPTER 5. PRIORITY QUEUES

• Apple’s Core Foundation library contains a [10] Brodal, G. S. L.; Lagogiannis, G.; Tarjan, R. E. CFBinaryHeap structure. (2012). Strict Fibonacci heaps (PDF). Proceedings of the 44th symposium on Theory of Computing - STOC • Pharo has an implementation in the Collections- '12. p. 1177. doi:10.1145/2213977.2214082. ISBN Sequenceable package along with a set of test cases. 9781450312455. A heap is used in the implementation of the timer event loop. [11] Fredman, Michael Lawrence; Tarjan, Robert E. (1987). “Fibonacci heaps and their uses in improved network • The Rust programming language has a binary max- optimization algorithms” (PDF). Journal of the Asso- heap implementation, BinaryHeap, in the collec- ciation for Computing Machinery. 34 (3): 596–615. tions module of its standard library. doi:10.1145/28869.28874. [12] Pettie, Seth (2005). “Towards a Final Analysis of Pairing Heaps” (PDF). Max Planck Institut für Informatik. 5.3.7 See also [13] Frederickson, Greg N. (1993), “An Optimal Algorithm • Sorting algorithm for Selection in a Min-Heap”, Information and Compu- tation (PDF), 104 (2), Academic Press, pp. 197–214, • Search data structure doi:10.1006/inco.1993.1030 • Stack (abstract data type) 5.3.9 External links • Queue (abstract data type)

• Tree (data structure) • Heap at Wolfram MathWorld

• Treap, a form of binary search tree based on heap- • Explanation of how the basic heap algorithms work ordered trees 5.4 Binary heap 5.3.8 References

[1] The Python Standard Library, 8.4. heapq — Heap queue algorithm, heapq.heappush

[2] The Python Standard Library, 8.4. heapq — Heap queue 100 algorithm, heapq.heappop 19 36 [3] The Python Standard Library, 8.4. heapq — Heap queue algorithm, heapq.heapreplace

[4] Suchenek, Marek A. (2012), “Elementary Yet Precise 17 3 25 1 Worst-Case Analysis of Floyd’s Heap-Construction Pro- gram”, Fundamenta Informaticae, IOS Press, 120 (1): 75–92, doi:10.3233/FI-2012-751. 2 7 [5] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L. (1990). Introduction to Algorithms (1st ed.). MIT Press and McGraw-Hill. ISBN 0-262-03141-8. Example of a complete binary max heap [6] Iacono, John (2000), “Improved upper bounds for pair- ing heaps”, Proc. 7th Scandinavian Workshop on Algo- rithm Theory, Lecture Notes in Computer Science, 1851, Springer-Verlag, pp. 63–77, doi:10.1007/3-540-44985- X_5

[7] Brodal, Gerth S. (1996), “Worst-Case Efficient Priority Queues”, Proc. 7th Annual ACM-SIAM Symposium on Discrete Algorithms (PDF), pp. 52–58

[8] Goodrich, Michael T.; Tamassia, Roberto (2004). “7.3.6. Bottom-Up Heap Construction”. Data Structures and Al- gorithms in Java (3rd ed.). pp. 338–341.

[9] Haeupler, Bernhard; Sen, Siddhartha; Tarjan, Robert E. (2009). “Rank-pairing heaps” (PDF). SIAM J. Computing: 1463–1485. Example of a complete binary min heap 5.4. BINARY HEAP 137

A binary heap is a heap data structure that takes the form heap property, thus the insertion operation has a worst- of a binary tree. Binary heaps are a common way of im- case time complexity of O(log n) but an average-case plementing priority queues.[1]:162–163 complexity of O(1).[3] A binary heap is defined as a binary tree with two addi- As an example of binary heap insertion, say we have a tional constraints:[2] max-heap

• Shape property: a binary heap is a complete binary 11 tree; that is, all levels of the tree, except possibly the last one (deepest) are fully filled, and, if the last level 5 8 of the tree is not complete, the nodes of that level are filled from left to right. 3 4 X • Heap property: the key stored in each node is either greater than or equal to or less than or equal to the keys in the node’s children, according to some total and we want to add the number 15 to the heap. We first order. place the 15 in the position marked by the X. However, the heap property is violated since 15 > 8, so we need to swap the 15 and the 8. So, we have the heap looking as Heaps where the parent key is greater than or equal to follows after the first swap: (≥) the child keys are called max-heaps; those where it is less than or equal to (≤) are called min-heaps. Efficient (logarithmic time) algorithms are known for the two op- 11 erations needed to implement a priority queue on a bi- nary heap: inserting an element, and removing the small- 5 15 est (largest) element from a min-heap (max-heap). Bi- nary heaps are also commonly employed in the heapsort 3 4 8 sorting algorithm, which is an in-place algorithm owing to the fact that binary heaps can be implemented as an implicit data structure, storing keys in an array and us- However the heap property is still violated since 15 > 11, ing their relative positions within that array to represent so we need to swap again: child-parent relationships.

15 5.4.1 Heap operations 5 11 Both the insert and remove operations modify the heap to conform to the shape property first, by adding or remov- 3 4 8 ing from the end of the heap. Then the heap property is restored by traversing up or down the heap. Both opera- tions take O(log n) time. which is a valid max-heap. There is no need to check the left child after this final step: at the start, the max-heap was valid, meaning 11 > 5; if 15 > 11, and 11 > 5, then Insert 15 > 5, because of the transitive relation.

To add an element to a heap we must perform an up-heap operation (also known as bubble-up, percolate-up, sift-up, Extract trickle-up, heapify-up, or cascade-up), by following this algorithm: The procedure for deleting the root from the heap (effec- tively extracting the maximum element in a max-heap or the minimum element in a min-heap) and restoring the 1. Add the element to the bottom level of the heap. properties is called down-heap (also known as bubble- 2. Compare the added element with its parent; if they down, percolate-down, sift-down, trickle down, heapify- are in the correct order, stop. down, cascade-down, and extract-min/max).

3. If not, swap the element with its parent and return 1. Replace the root of the heap with the last element to the previous step. on the last level.

The number of operations required depends only on the 2. Compare the new root with its children; if they are number of levels the new element must rise to satisfy the in the correct order, stop. 138 CHAPTER 5. PRIORITY QUEUES

3. If not, swap the element with one of its children and to modify the value of the root, even when an element is return to the previous step. (Swap with its smaller not being deleted. In the pseudocode above, what starts child in a min-heap and its larger child in a max- with // is a comment. Note that A is an array (or list) that heap.) starts being indexed from 1 up to length(A), according to the pseudocode. So, if we have the same max-heap as before In the worst case, the new root has to be swapped with its child on each level until it reaches the bottom level of 11 the heap, meaning that the delete operation has a time complexity relative to the height of the tree, or O(log n).

5 8 5.4.2 Building a heap 3 4 Building a heap from an array of n input elements can be We remove the 11 and replace it with the 4. done by starting with an empty heap, then successively inserting each element. This approach, called Williams’ method after the inventor of binary heaps, is easily seen 4 to run in O(n log n) time: it performs n insertions at O(log n) cost each.[lower-alpha 1] 5 8 However, Williams’ method is suboptimal. A faster [4] 3 method (due to Floyd ) starts by arbitrarily putting the elements on a binary tree, respecting the shape property (the tree could be represented by an array, see below). Now the heap property is violated since 8 is greater than Then starting from the lowest level and moving upwards, 4. In this case, swapping the two elements, 4 and 8, is shift the root of each subtree downward as in the dele- enough to restore the heap property and we need not swap tion algorithm until the heap property is restored. More elements further: specifically if all the subtrees starting at some height h (measured from the bottom) have already been “heapi- 8 fied”, the trees at height h + 1 can be heapified by send- ing their root down along the path of maximum valued 5 4 children when building a max-heap, or minimum valued children when building a min-heap. This process takes O(h) operations (swaps) per node. In this method most 3 of the heapification takes place in the lower levels. Since the height of the heap is ⌊log(n)⌋ , the⌈ number⌉ of nodes ⌈ ⌉ log n ⌈ ⌉ The downward-moving node is swapped with the larger at height h is ≤ 2(log n)−h−1 = 2 = n . of its children in a max-heap (in a min-heap it would 2h+1 2h+1 Therefore, the cost of heapifying all subtrees is: be swapped with its smaller child), until it satisfies the heap property in its new position. This functionality is achieved by the Max-Heapify function as defined be-   ⌈log n⌉ ⌈log n⌉ low in pseudocode for an array-backed heap A of length ∑ n ∑ h O(h) = O n  heap_length[A]. Note that “A” is indexed starting at 1, not 2h+1 2h+1 h=0 h=0 0 as is common in many real programming languages.   ⌈log n⌉ Max-Heapify (A, i): n ∑ h = O   left ← 2*i // ← means “assignment” right ← 2*i + 1 2 2h h=0 largest ← i ( ) ∞ if left ≤ heap_length[A] and A[left] > A[largest] then: ∑ h = O n largest ← left 2h if right ≤ heap_length[A] and A[right] > A[largest] then: h=0 largest ← right = O(2n) if largest ≠ i then: = O(n) swap A[i] and A[largest] Max-Heapify(A, largest) h For the above algorithm to correctly re-heapify the array, This uses the fact that the given infinite series h / 2 the node at index i and its two direct children must vio- converges to 2. late the heap property. If they do not, the algorithm will The exact value of the above (the worst-case number of fall through with no change to the array. The down-heap comparisons during the heap construction) is known to be operation (without the preceding swap) can also be used equal to: 5.4. BINARY HEAP 139

[5] 2n − 2s2(n) − e2(n) , • its parent at index floor((i − 1) ∕ 2). where s2(n) is the sum of all digits of the binary repre- Alternatively, if the tree root is at index 1, with valid in- sentation of n and e2(n) is the exponent of 2 in the prime dices 1 through n, then each element a at index i has factorization of n. • children at indices 2i and 2i +1 The Build-Max-Heap function that follows, converts an array A which stores a complete binary tree with n nodes • its parent at index floor(i ∕ 2). to a max-heap by repeatedly using Max-Heapify in a bot- tom up manner. It is based on the observation that the This implementation is used in the heapsort algorithm, array elements indexed by floor(n/2) + 1, floor(n/2) + 2, where it allows the space in the input array to be reused ..., n are all leaves for the tree (assuming, as before, that to store the heap (i.e. the algorithm is done in-place). indices start at 1), thus each is a one-element heap. Build- The implementation is also useful for use as a Priority Max-Heap runs Max-Heapify on each of the remaining queue where use of a dynamic array allows insertion of tree nodes. an unbounded number of items. Build-Max-Heap (A): The upheap/downheap operations can then be stated in heap_length[A] ← length[A] terms of an array as follows: suppose that the heap prop- for each index i from floor(length[A]/2) downto 1 do: erty holds for the indices b, b+1, ..., e. The sift-down Max-Heapify(A, i) function extends the heap property to b−1, b, b+1, ..., e. Only index i = b−1 can violate the heap property. Let j be the index of the largest child of a[i] (for a max-heap, 5.4.3 Heap implementation or the smallest child for a min-heap) within the range b, ..., e. (If no such index exists because 2i > e then the heap property holds for the newly extended range and nothing needs to be done.) By swapping the values a[i] and a[j] the heap property for position i is established. At this point, the only problem is that the heap property might 0 1 2 3 4 5 6 not hold for index j. The sift-down function is applied tail-recursively to index j until the heap property is estab- A small complete binary tree stored in an array lished for all elements. The sift-down function is fast. In each step it only needs two comparisons and one swap. The index value where it is working doubles in each iteration, so that at most log2 e steps are required. For big heaps and using virtual memory, storing elements in an array according to the above scheme is inefficient: Comparison between a binary heap and an array implementa- (almost) every level is in a different page. B-heaps are tion. binary heaps that keep subtrees in a single page, reducing [6] Heaps are commonly implemented with an array. Any bi- the number of pages accessed by up to a factor of ten. nary tree can be stored in an array, but because a binary The operation of merging two binary heaps takes Θ(n) heap is always a complete binary tree, it can be stored for equal-sized heaps. The best you can do is (in case of compactly. No space is required for pointers; instead, array implementation) simply concatenating the two heap the parent and children of each node can be found by arrays and build a heap of the result.[7] A heap on n el- arithmetic on array indices. These properties make this ements can be merged with a heap on k elements using heap implementation a simple example of an implicit data O(log n log k) key comparisons, or, in case of a pointer- structure or Ahnentafel list. Details depend on the root based implementation, in O(log n log k) time.[8] An al- position, which in turn may depend on constraints of a gorithm for splitting a heap on n elements into two heaps programming language used for implementation, or pro- on k and n-k elements, respectively, based on a new view grammer preference. Specifically, sometimes the root is of heaps as an ordered collections of subheaps was pre- placed at index 1, sacrificing space in order to simplify sented in.[9] The algorithm requires O(log n * log n) com- arithmetic. parisons. The view also presents a new and conceptually Let n be the number of elements in the heap and i be an simple algorithm for merging heaps. When merging is a arbitrary valid index of the array storing the heap. If the common task, a different heap implementation is recom- tree root is at index 0, with valid indices 0 through n − 1, mended, such as binomial heaps, which can be merged in then each element a at index i has O(log n). Additionally, a binary heap can be implemented with a • children at indices 2i + 1 and 2i + 2 traditional binary tree data structure, but there is an issue 140 CHAPTER 5. PRIORITY QUEUES

with finding the adjacent element on the last level on the binary heap when adding an element. This element can − be determined algorithmically or by adding extra data to right = 1) + last(L 2j the nodes, called “threading” the tree—instead of merely = (2L+2 − 2) − 2j storing references to the children, we store the inorder = 2(2L+1 − 2 − j) + 2 successor of the node as well. = 2i + 2 It is possible to modify the heap structure to allow extrac- tion of both the smallest and largest element in O (log n) time.[10] To do this, the rows alternate between min heap As required. and max heap. The algorithms are roughly the same, but, Noting that the left child of any node is always 1 place in each step, one must consider the alternating rows with before its right child, we get left = 2i + 1 . alternating comparisons. The performance is roughly the If the root is located at index 1 instead of 0, the last node same as a normal single direction heap. This idea can be in each level is instead at index 2l+1 − 1 . Using this generalised to a min-max-median heap. throughout yields left = 2i and right = 2i + 1 for heaps with their root at 1. 5.4.4 Derivation of index equations Parent node In an array-based heap, the children and parent of a node can be located via simple arithmetic on the node’s index. Every node is either the left or right child of its parent, so This section derives the relevant equations for heaps with we know that either of the following is true. their root at index 0, with additional notes on heaps with their root at index 1. 1. i = 2 × (parent) + 1 To avoid confusion, we'll define the level of a node as its distance from the root, such that the root itself occupies 2. i = 2 × (parent) + 2 level 0.

Hence, Child nodes

For a general node located at index i (beginning from 0), we will first derive the index of its right child, right = i − 1 i − 2 parent = or 2i + 2 . 2 2 Let node i be located in level L , and note that any level ⌊ ⌋ l contains exactly 2l nodes. Furthermore, there are ex- i − 1 Now consider the expression . actly 2l+1 − 1 nodes contained in the layers up to and 2 including layer l (think of binary arithmetic; 0111...111 If node i is a left child, this gives the result immediately, = 1000...000 - 1). Because the root is stored at 0, the k however, it also gives the correct result if node i is a right th node will be stored at index (k − 1) . Putting these child. In this case, (i−2) must be even, and hence (i−1) observations together yields the following expression for must be odd. the index of the last node in layer l.

⌊ ⌋ ⌊ ⌋ i − 1 i − 2 1 l+1 − − l+1 − = + last(l) = (2 1) 1 = 2 2 2 2 2 i − 2 = Let there be j nodes after node i in layer L, such that 2 = parent

i = last(L) − j Therefore, irrespective of whether a node is a left or right child, its parent can be found by the expression: = (2L+1 − 2) − j

Each of these j nodes must have exactly 2 children, so ⌊ ⌋ there must be 2j nodes separating i 's right child from i − 1 parent = the end of its layer ( L + 1 ). 2 5.4. BINARY HEAP 141

5.4.5 Related structures [6] Poul-Henning Kamp. “You're Doing It Wrong”. ACM Queue. June 11, 2010. Since the ordering of siblings in a heap is not specified [7] Chris L. Kuszmaul. “binary heap”. Dictionary of Algo- by the heap property, a single node’s two children can rithms and Data Structures, Paul E. Black, ed., U.S. Na- be freely interchanged unless doing so violates the shape tional Institute of Standards and Technology. 16 Novem- property (compare with treap). Note, however, that in ber 2009. the common array-based heap, simply swapping the chil- [8] J.-R. Sack and T. Strothotte “An Algorithm for Merging dren might also necessitate moving the children’s sub-tree Heaps”, Acta Informatica 22, 171-186 (1985). nodes to retain the heap property. [9] . J.-R. Sack and T. Strothotte “A characterization of heaps The binary heap is a special case of the d-ary heap in and its applications” Information and Computation Vol- which d = 2. ume 86, Issue 1, May 1990, Pages 69–86. [10] Atkinson, M.D., J.-R. Sack, N. Santoro, and T. Strothotte 5.4.6 Summary of running times (1 October 1986). “Min-max heaps and generalized pri- ority queues.” (PDF). Programming techniques and Data In the following time complexities[11] O(f) is an asymp- structures. Comm. ACM, 29(10): 996–1000. totic upper bound and Θ(f) is an asymptotically tight [11] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, bound (see Big O notation). Function names assume a Ronald L. (1990). Introduction to Algorithms (1st ed.). min-heap. MIT Press and McGraw-Hill. ISBN 0-262-03141-8. [12] Iacono, John (2000), “Improved upper bounds for pair- [1] In fact, this procedure can be shown to take Θ(n log n) ing heaps”, Proc. 7th Scandinavian Workshop on Algo- time in the worst case, meaning that n log n is also an rithm Theory, Lecture Notes in Computer Science, 1851, asymptotic lower bound on the complexity.[1]:167 In the Springer-Verlag, pp. 63–77, doi:10.1007/3-540-44985- average case (averaging over all permutations of n inputs), X_5 though, the method takes linear time.[4] [13] Brodal, Gerth S. (1996), “Worst-Case Efficient Priority [2] Brodal and Okasaki later describe a persistent variant with Queues”, Proc. 7th Annual ACM-SIAM Symposium on the same bounds except for decrease-key, which is not Discrete Algorithms (PDF), pp. 52–58 supported. Heaps with n elements can be constructed bottom-up in O(n).[14] [14] Goodrich, Michael T.; Tamassia, Roberto (2004). “7.3.6. Bottom-Up Heap Construction”. Data Structures and Al- [3] Amortized time. gorithms in Java (3rd ed.). pp. 338–341. √ [4] Bounded by Ω(log log n),O(22 log log n) [17][18] [15] Haeupler, Bernhard; Sen, Siddhartha; Tarjan, Robert E. (2009). “Rank-pairing heaps” (PDF). SIAM J. Computing: [5] n is the size of the larger heap. 1463–1485. [16] Brodal, G. S. L.; Lagogiannis, G.; Tarjan, R. E. 5.4.7 See also (2012). Strict Fibonacci heaps (PDF). Proceedings of the 44th symposium on Theory of Computing - STOC • Heap '12. p. 1177. doi:10.1145/2213977.2214082. ISBN 9781450312455. • Heapsort [17] Fredman, Michael Lawrence; Tarjan, Robert E. (1987). “Fibonacci heaps and their uses in improved network 5.4.8 References optimization algorithms” (PDF). Journal of the Asso- ciation for Computing Machinery. 34 (3): 596–615. [1] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, doi:10.1145/28869.28874. Ronald L.; Stein, Clifford (2009) [1990]. Introduction to [18] Pettie, Seth (2005). “Towards a Final Analysis of Pairing Algorithms (3rd ed.). MIT Press and McGraw-Hill. ISBN Heaps” (PDF). Max Planck Institut für Informatik. 0-262-03384-4. [2] eEL,CSA_Dept,IISc,Bangalore, “Binary Heaps”, Data Structures and Algorithms 5.4.9 External links [3] http://wcipeg.com/wiki/Binary_heap • Binary Heap Applet by Kubo Kovac [4] Hayward, Ryan; McDiarmid, Colin (1991). “Average • Open Data Structures - Section 10.1 - BinaryHeap: Case Analysis of Heap Building by Repeated Insertion” An Implicit Binary Tree (PDF). J. Algorithms. 12: 126–153. • Implementation of binary max heap in C by Robin [5] Suchenek, Marek A. (2012), “Elementary Yet Precise Thomas Worst-Case Analysis of Floyd’s Heap-Construction Pro- gram”, Fundamenta Informaticae, IOS Press, 120 (1): • Implementation of binary min heap in C by Robin 75–92, doi:10.3233/FI-2012-751. Thomas 142 CHAPTER 5. PRIORITY QUEUES

5.5 ''d''-ary heap procedure may be used to decrease the priority of an item in a min-heap, or to increase the priority of an item in a [2][3] The d-ary heap or d-heap is a priority queue data struc- max-heap. ture, a generalization of the binary heap in which the To create a new heap from an array of n items, one may nodes have d children instead of 2.[1][2][3] Thus, a binary loop over the items in reverse order, starting from the heap is a 2-heap, and a ternary heap is a 3-heap. Ac- item at position n − 1 and ending at the item at position cording to Tarjan[2] and Jensen et al.,[4] d-ary heaps were 0, applying the downward-swapping procedure for each invented by Donald B. Johnson in 1975.[1] item.[2][3] This data structure allows decrease priority operations to be performed more quickly than binary heaps, at the ex- 5.5.2 Analysis pense of slower delete minimum operations. This trade- off leads to better running times for algorithms such as In a d-ary heap with n items in it, both the upward- Dijkstra’s algorithm in which decrease priority operations swapping procedure and the downward-swapping proce- are more common than delete min operations.[1][5] Addi- dure may perform as many as logd n = log n / log d swaps. tionally, d-ary heaps have better memory cache behavior In the upward-swapping procedure, each swap involves a than a binary heap, allowing them to run more quickly single comparison of an item with its parent, and takes in practice despite having a theoretically larger worst- constant time. Therefore, the time to insert a new item case running time.[6][7] Like binary heaps, d-ary heaps into the heap, to decrease the priority of an item in a min- are an in-place data structure that uses no additional stor- heap, or to increase the priority of an item in a max-heap, age beyond that needed to store the array of items in the is O(log n / log d). In the downward-swapping procedure, heap.[2][8] each swap involves d comparisons and takes O(d) time: it takes d − 1 comparisons to determine the minimum or 5.5.1 Data structure maximum of the children and then one more comparison against the parent to determine whether a swap is needed. Therefore, the time to delete the root item, to increase the The d-ary heap consists of an array of n items, each of priority of an item in a min-heap, or to decrease the pri- which has a priority associated with it. These items may ority of an item in a max-heap, is O(d log n / log d).[2][3] be viewed as the nodes in a complete d-ary tree, listed in breadth first traversal order: the item at position 0 of the When creating a d-ary heap from a set of n items, most of array forms the root of the tree, the items at positions 1 the items are in positions that will eventually hold leaves through d are its children, the next d2 items are its grand- of the d-ary tree, and no downward swapping is per- children, etc. Thus, the parent of the item at position i formed for those items. At most n/d + 1 items are non- (for any i > 0) is the item at position floor((i − 1)/d) and leaves, and may be swapped downwards at least once, at its children are the items at positions di + 1 through di + a cost of O(d) time to find the child to swap them with. d. According to the heap property, in a min-heap, each At most n/d2 + 1 nodes may be swapped downward two item has a priority that is at least as large as its parent; in times, incurring an additional O(d) cost for the second a max-heap, each item has a priority that is no larger than swap beyond the cost already counted in the first term, its parent.[2][3] etc. Therefore, the total amount of time to create a heap in this way is The minimum priority item in a min-heap (or the maxi- mum priority item in a max-heap) may always be found ∑ ( ) logd n n [2][3] at position 0 of the array. To remove this item from the i=1 di + 1 O(d) = O(n). priority queue, the last item x in the array is moved into its place, and the length of the array is decreased by one. The exact value of the above (the worst-case number of Then, while item x and its children do not satisfy the heap comparisons during the construction of d-ary heap) is property, item x is swapped with one of its children (the known to be equal to: one with the smallest priority in a min-heap, or the one with the largest priority in a max-heap), moving it down- d (n − s (n)) − (d − 1 − ward in the tree and later in the array, until eventually the d−1 d (nmodd))(e (⌊ n ⌋) + 1) ,[9] heap property is satisfied. The same downward swapping d d procedure may be used to increase the priority of an item in a min-heap, or to decrease the priority of an item in a where s(n) is the sum of all digits of the standard base-d max-heap.[2][3] representation of n and e(n) is the exponent of d in the factorization of n. This reduces to To insert a new item into the heap, the item is appended to the end of the array, and then while the heap property − − [9] is violated it is swapped with its parent, moving it upward 2n 2s2(n) e2(n) , in the tree and earlier in the array, until eventually the heap property is satisfied. The same upward-swapping for d = 2, and to 5.6. BINOMIAL HEAP 143

3 − − − − [9] 2 (n s3(n)) 2e3(n) e3(n 1) , [6] Naor, D.; Martel, C. U.; Matloff, N. S. (1991), “Per- formance of priority queue structures in a virtual mem- for d = 3. ory environment”, Computer Journal, 34 (5): 428–437, doi:10.1093/comjnl/34.5.428. The space usage of the d-ary heap, with insert and delete- min operations, is linear, as it uses no extra storage other [7] Kamp, Poul-Henning (2010), “You're doing it wrong”, than an array containing a list of the items in the heap.[2][8] ACM Queue, 8 (6). If changes to the priorities of existing items need to be [8] Mortensen, C. W.; Pettie, S. (2005), “The complexity of supported, then one must also maintain pointers from the implicit and space efficient priority queues”, Algorithms items to their positions in the heap, which again uses only and Data Structures: 9th International Workshop, WADS [2] linear storage. 2005, Waterloo, Canada, August 15–17, 2005, Proceed- ings, Lecture Notes in Computer Science, 3608, Springer- Verlag, pp. 49–60, doi:10.1007/11534273_6, ISBN 978- 5.5.3 Applications 3-540-28101-6.

Dijkstra’s algorithm for shortest paths in graphs and [9] Suchenek, Marek A. (2012), “Elementary Yet Precise Prim’s algorithm for minimum spanning trees both use Worst-Case Analysis of Floyd’s Heap-Construction Pro- a min-heap in which there are n delete-min operations gram”, Fundamenta Informaticae, IOS Press, 120 (1): and as many as m decrease-priority operations, where n 75–92, doi:10.3233/FI-2012-751. is the number of vertices in the graph and m is the number [10] Cherkassky, B. V.; Goldberg, A. V.; Radzik, T. (1996), of edges. By using a d-ary heap with d = m/n, the total “Shortest paths algorithms: Theory and experimental times for these two types of operations may be balanced evaluation”, Mathematical Programming, 73 (2): 129– against each other, leading to a total time of O(m logm/n 174, doi:10.1007/BF02592101. n) for the algorithm, an improvement over the O(m log n) running time of binary heap versions of these algorithms whenever the number of edges is significantly larger than 5.5.5 External links the number of vertices.[1][5] An alternative priority queue data structure, the Fibonacci heap, gives an even better • C++ implementation of generalized heap with D- theoretical running time of O(m + n log n), but in prac- Heap support tice d-ary heaps are generally at least as fast, and often faster, than Fibonacci heaps for this application.[10] 4-heaps may perform better than binary heaps in practice, 5.6 Binomial heap even for delete-min operations.[2][3] Additionally, a d-ary heap typically runs much faster than a binary heap for For binomial price trees, see Binomial options heap sizes that exceed the size of the computer’s cache pricing model. memory: A binary heap typically requires more cache misses and virtual memory page faults than a d-ary heap, In computer science, a binomial heap is a heap similar each one taking far more time than the extra work in- to a binary heap but also supports quick merging of two curred by the additional comparisons a d-ary heap makes heaps. This is achieved by using a special tree structure. It compared to a binary heap.[6][7] is important as an implementation of the mergeable heap abstract data type (also called meldable heap), which is a 5.5.4 References priority queue supporting merge operation.

[1] Johnson, D. B. (1975), “Priority queues with update and finding minimum spanning trees”, Information Processing 5.6.1 Binomial heap Letters, 4 (3): 53–57, doi:10.1016/0020-0190(75)90001- 0. A binomial heap is implemented as a collection of bi- [2] Tarjan, R. E. (1983), “3.2. d-heaps”, Data Structures nomial trees (compare with a binary heap, which has a and Network Algorithms, CBMS-NSF Regional Confer- shape of a single binary tree), which are defined recur- ence Series in Applied Mathematics, 44, Society for In- sively as follows: dustrial and Applied Mathematics, pp. 34–38. • [3] Weiss, M. A. (2007), "d-heaps”, Data Structures and Al- A binomial tree of order 0 is a single node gorithm Analysis (2nd ed.), Addison-Wesley, p. 216, • ISBN 0-321-37013-9. A binomial tree of order k has a root node whose children are roots of binomial trees of orders k−1, [4] Jensen, C.; Katajainen, J.; Vitale, F. (2004), An extended k−2, ..., 2, 1, 0 (in this order). truth about heaps (PDF).

[5] Tarjan (1983), pp. 77 and 91. A binomial tree of order k has 2k nodes, height k. 144 CHAPTER 5. PRIORITY QUEUES

Order 1 20 3 9 5 12

17 21 23 12 77

99 33 24 23

Binomial trees of order 0 to 3: Each tree has a root node with subtrees of all lower ordered binomial trees, which have been 53 highlighted. For example, the order 3 binomial tree is connected to an order 2, 1, and 0 (highlighted as blue, green and red re- 1 Example spectively) binomial tree. of a binomial heap containing 13 nodes with distinct keys. The heap consists of three binomial trees with orders 0, 2, and 3. Because of its unique structure, a binomial tree of order k can be constructed from two trees of order k−1 trivially by attaching one of them as the leftmost child of the root 5.6.3 Implementation of the other tree. This feature is central to the merge op- Because no operation requires random access to the root eration of a binomial heap, which is its major advantage nodes of the binomial trees, the roots of the binomial trees over other conventional heaps. can be stored in a linked list, ordered by increasing order The name( ) comes from the shape: a binomial tree of order of the tree. n n has d nodes at depth d . (See Binomial coefficient.) Merge

As mentioned above, the simplest and most important op- 5.6.2 Structure of a binomial heap eration is the merging of two binomial trees of the same order within a binomial heap. Due to the structure of bi- A binomial heap is implemented as a set of binomial trees nomial trees, they can be merged trivially. As their root that satisfy the binomial heap properties: node is the smallest element within the tree, by compar- ing the two keys, the smaller of them is the minimum key, and becomes the new root node. Then the other tree be- • Each binomial tree in a heap obeys the minimum- comes a subtree of the combined tree. This operation is heap property: the key of a node is greater than or basic to the complete merging of two binomial heaps. equal to the key of its parent. function mergeTree(p, q) if p.root.key <= q.root.key re- turn p.addSubTree(q) else return q.addSubTree(p) The operation of merging two heaps is perhaps the most in- • There can only be either one or zero binomial trees teresting and can be used as a subroutine in most other for each order, including zero order. operations. The lists of roots of both heaps are traversed simultaneously in a manner similar to that of the . The first property ensures that the root of each binomial If only one of the heaps contains a tree of order j, this tree tree contains the smallest key in the tree, which applies to is moved to the merged heap. If both heaps contain a tree the entire heap. of order j, the two trees are merged to one tree of order The second property implies that a binomial heap with j+1 so that the minimum-heap property is satisfied. Note n nodes consists of at most log n + 1 binomial trees. In that it may later be necessary to merge this tree with some fact, the number and orders of these trees are uniquely other tree of order j+1 present in one of the heaps. In the determined by the number of nodes n: each binomial tree course of the algorithm, we need to examine at most three corresponds to one digit in the binary representation of trees of any order (two from the two heaps we merge and number n. For example number 13 is 1101 in binary, one composed of two smaller trees). 23 + 22 + 20 , and thus a binomial heap with 13 nodes Because each binomial tree in a binomial heap corre- will consist of three binomial trees of orders 3, 2, and 0 sponds to a bit in the binary representation of its size, (see figure below). there is an analogy between the merging of two heaps and 5.6. BINOMIAL HEAP 145

ning time is O(log n). 7 > 3 function merge(p, q) while not (p.end() and q.end()) tree = mergeTree(p.currentTree(), q.currentTree()) if not heap.currentTree().empty() tree = mergeTree(tree, 12 8 5 4 heap.currentTree()) heap.addTree(tree) heap.next(); p.next(); q.next()

13 9 Insert

Inserting a new element to a heap can be done by sim- ply creating a new heap containing only this element and then merging it with the original heap. Due to the merge, 3 insert takes O(log n) time. However, across a series of n consecutive insertions, insert has an amortized time of O(1) (i.e. constant). 7 5 4 Find minimum

12 8 9 To find the minimum element of the heap, find the min- imum among the roots of the binomial trees. This can again be done easily in O(log n) time, as there are just 13 O(log n) trees and hence roots to examine. By using a pointer to the binomial tree that contains the minimum element, the time for this operation can be re- To merge two binomial trees of the same order, first compare the duced to O(1). The pointer must be updated when per- root key. Since 7>3, the black tree on the left(with root node 7) forming any operation other than Find minimum. This is attached to the grey tree on the right(with root node 3) as a can be done in O(log n) without raising the running time subtree. The result is a tree of order 3. of any operation.

1 9 2 Delete minimum 11 7 84 To delete the minimum element from the heap, first find 12 510 this element, remove it from its binomial tree, and obtain 3 6 21 a list of its subtrees. Then transform this list of subtrees 14 into a separate binomial heap by reordering them from + smallest to largest order. Then merge this heap with the 1 6 2 original heap. Since each root has at most log n children, creating this new heap is O(log n). Merging heaps is O(log 3 149 7 84 n), so the entire delete minimum operation is O(log n).

11 12 510 function deleteMin(heap) min = heap.trees().first() for each current in heap.trees() if current.root < min.root 21 then min = current for each tree in min.subTrees() tmp.addTree(tree) heap.removeTree(min) merge(heap, This shows the merger of two binomial heaps. This is accom- tmp) plished by merging two binomial trees of the same order one by one. If the resulting merged tree has the same order as one bi- nomial tree in one of the two heaps, then those two are merged Decrease key again. After decreasing the key of an element, it may become smaller than the key of its parent, violating the minimum- the binary addition of the sizes of the two heaps, from heap property. If this is the case, exchange the element right-to-left. Whenever a carry occurs during addition, with its parent, and possibly also with its grandparent, and this corresponds to a merging of two binomial trees dur- so on, until the minimum-heap property is no longer vio- ing the merge. lated. Each binomial tree has height at most log n, so this Each tree has order at most log n and therefore the run- takes O(log n) time. 146 CHAPTER 5. PRIORITY QUEUES

Delete • Haskell implementation of binomial heap • To delete an element from the heap, decrease its key to Common Lisp implementation of binomial heap negative infinity (that is, some value lower than any el- ement in the heap) and then delete the minimum in the [1] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, heap. Ronald L. (1990). Introduction to Algorithms (1st ed.). MIT Press and McGraw-Hill. ISBN 0-262-03141-8.

5.6.4 Summary of running times [2] Iacono, John (2000), “Improved upper bounds for pair- ing heaps”, Proc. 7th Scandinavian Workshop on Algo- [1] rithm Theory, Lecture Notes in Computer Science, 1851, In the following time complexities O(f) is an asymp- Springer-Verlag, pp. 63–77, doi:10.1007/3-540-44985- totic upper bound and Θ(f) is an asymptotically tight X_5 bound (see Big O notation). Function names assume a min-heap. [3] Brodal, Gerth S. (1996), “Worst-Case Efficient Priority Queues”, Proc. 7th Annual ACM-SIAM Symposium on [1] Brodal and Okasaki later describe a persistent variant with Discrete Algorithms (PDF), pp. 52–58 the same bounds except for decrease-key, which is not [4] Goodrich, Michael T.; Tamassia, Roberto (2004). “7.3.6. supported. Heaps with n elements can be constructed Bottom-Up Heap Construction”. Data Structures and Al- bottom-up in O(n).[4] gorithms in Java (3rd ed.). pp. 338–341. [2] Amortized time. √ [5] Haeupler, Bernhard; Sen, Siddhartha; Tarjan, Robert E. [3] Bounded by Ω(log log n),O(22 log log n) [7][8] (2009). “Rank-pairing heaps” (PDF). SIAM J. Computing: 1463–1485. [4] n is the size of the larger heap. [6] Brodal, G. S. L.; Lagogiannis, G.; Tarjan, R. E. (2012). Strict Fibonacci heaps (PDF). Proceedings of 5.6.5 Applications the 44th symposium on Theory of Computing - STOC '12. p. 1177. doi:10.1145/2213977.2214082. ISBN • Discrete event simulation 9781450312455. • Priority queues [7] Fredman, Michael Lawrence; Tarjan, Robert E. (1987). “Fibonacci heaps and their uses in improved network optimization algorithms” (PDF). Journal of the Asso- 5.6.6 See also ciation for Computing Machinery. 34 (3): 596–615. doi:10.1145/28869.28874. • Fibonacci heap [8] Pettie, Seth (2005). “Towards a Final Analysis of Pairing • Soft heap Heaps” (PDF). Max Planck Institut für Informatik. • Skew binomial heap 5.7 Fibonacci heap 5.6.7 References In computer science, a Fibonacci heap is a data struc- • Thomas H. Cormen, Charles E. Leiserson, Ronald ture for priority queue operations, consisting of a collec- L. Rivest, and Clifford Stein. Introduction to Algo- tion of heap-ordered trees. It has a better amortized run- rithms, Second Edition. MIT Press and McGraw- ning time than many other priority queue data structures Hill, 2001. ISBN 0-262-03293-7. Chapter 19: Bi- including the binary heap and binomial heap. Michael nomial Heaps, pp. 455–475. L. Fredman and Robert E. Tarjan developed Fibonacci heaps in 1984 and published them in a scientific journal • Vuillemin, J. (1978). A data structure for manipu- in 1987. They named Fibonacci heaps after the Fibonacci lating priority queues. Communications of the ACM numbers, which are used in their running time analysis. 21, 309–314. For the Fibonacci heap, the find-minimum operation takes constant (O(1)) amortized time.[1] The insert and 5.6.8 External links decrease key operations also work in constant amortized time.[2] Deleting an element (most often used in the spe- • Java applet simulation of binomial heap cial case of deleting the minimum element) works in [2] • Python implementation of binomial heap O(log n) amortized time, where n is the size of the heap. This means that starting from an empty data structure, • Two C implementations of binomial heap (a generic any sequence of a insert and decrease key operations and one and one optimized for integer keys) b delete operations would take O(a + b log n) worst case 5.7. FIBONACCI HEAP 147 time, where n is the maximum heap size. In a binary or child is cut, the node itself needs to be cut from its par- binomial heap such a sequence of operations would take ent and becomes the root of a new tree (see Proof of de- O((a + b) log n) time. A Fibonacci heap is thus better gree bounds, below). The number of trees is decreased than a binary or binomial heap when b is smaller than a in the operation delete minimum, where trees are linked by a non-constant factor. It is also possible to merge two together. Fibonacci heaps in constant amortized time, improving As a result of a relaxed structure, some operations can on the logarithmic merge time of a binomial heap, and take a long time while others are done very quickly. For improving on binary heaps which cannot handle merges the amortized running time analysis we use the potential efficiently. method, in that we pretend that very fast operations take Using Fibonacci heaps for priority queues improves the a little bit longer than they actually do. This additional asymptotic running time of important algorithms, such time is then later combined and subtracted from the ac- as Dijkstra’s algorithm for computing the shortest path tual running time of slow operations. The amount of time between two nodes in a graph, compared to the same al- saved for later use is measured at any given moment by a gorithm using other slower priority queue data structures. potential function. The potential of a Fibonacci heap is given by

5.7.1 Structure Potential = t + 2m

where t is the number of trees in the Fibonacci heap, and m is the number of marked nodes. A node is marked if at least one of its children was cut since this node was made a child of another node (all roots are unmarked). The amortized time for an operation is given by the sum of the actual time and c times the difference in potential, where c is a constant (chosen to match the constant factors in the O notation for the actual time). Thus, the root of each tree in a heap has one unit of time stored. This unit of time can be used later to link this tree with another tree at amortized time 0. Also, each marked node has two units of time stored. One can be used to cut the node from its parent. If this happens, the node Figure 1. Example of a Fibonacci heap. It has three trees of becomes a root and the second unit of time will remain degrees 0, 1 and 3. Three vertices are marked (shown in blue). stored in it as in any other root. Therefore, the potential of the heap is 9 (3 trees + 2 × (3 marked- vertices)). 5.7.2 Implementation of operations A Fibonacci heap is a collection of trees satisfying the minimum-heap property, that is, the key of a child is al- To allow fast deletion and concatenation, the roots of all ways greater than or equal to the key of the parent. This trees are linked using a circular, doubly linked list. The implies that the minimum key is always at the root of one children of each node are also linked using such a list. of the trees. Compared with binomial heaps, the struc- For each node, we maintain its number of children and ture of a Fibonacci heap is more flexible. The trees do not whether the node is marked. Moreover, we maintain a have a prescribed shape and in the extreme case the heap pointer to the root containing the minimum key. can have every element in a separate tree. This flexibility Operation find minimum is now trivial because we keep allows some operations to be executed in a lazy manner, the pointer to the node containing it. It does not change postponing the work for later operations. For example, the potential of the heap, therefore both actual and amor- merging heaps is done simply by concatenating the two tized cost are constant. lists of trees, and operation decrease key sometimes cuts a node from its parent and forms a new tree. As mentioned above, merge is implemented simply by concatenating the lists of tree roots of the two heaps. This However at some point some order needs to be introduced can be done in constant time and the potential does not to the heap to achieve the desired running time. In par- change, leading again to constant amortized time. ticular, degrees of nodes (here degree means the number of children) are kept quite low: every node has degree at Operation insert works by creating a new heap with one most O(log n) and the size of a subtree rooted in a node element and doing merge. This takes constant time, and of degree k is at least Fk₊₂, where Fk is the kth Fibonacci the potential increases by one, because the number of number. This is achieved by the rule that we can cut at trees increases. The amortized cost is thus still constant. most one child of each non-root node. When a second Operation extract minimum (same as delete minimum) 148 CHAPTER 5. PRIORITY QUEUES

ficiently we use an array of length O(log n) in which we keep a pointer to one root of each degree. When a second root is found of the same degree, the two are linked and the array is updated. The actual running time is O(log n + m) where m is the number of roots at the beginning of the second phase. At the end we will have at most O(log n) roots (because each has a different degree). There- fore, the difference in the potential function from before this phase to after it is: O(log n) − m, and the amortized running time is then at most O(log n + m) + c(O(log n) − m). With a sufficiently large choice of c, this simplifies to O(log n).

Fibonacci heap from Figure 1 after first phase of extract mini- In the third phase we check each of the remaining roots mum. Node with key 1 (the minimum) was deleted and its chil- and find the minimum. This takes O(log n) time and the dren were added as separate trees. potential does not change. The overall amortized running time of extract minimum is therefore O(log n). operates in three phases. First we take the root containing the minimum element and remove it. Its children will be- come roots of new trees. If the number of children was d, it takes time O(d) to process all new roots and the poten- tial increases by d−1. Therefore, the amortized running time of this phase is O(d) = O(log n).

Fibonacci heap from Figure 1 after decreasing key of node 9 to 0. This node as well as its two marked ancestors are cut from the tree rooted at 1 and placed as new roots.

Operation decrease key will take the node, decrease the key and if the heap property becomes violated (the new key is smaller than the key of the parent), the node is cut from its parent. If the parent is not a root, it is marked. If it has been marked already, it is cut as well and its parent is marked. We continue upwards until we reach either the root or an unmarked node. Now we set the minimum pointer to the decreased value if it is the new minimum. In the process we create some number, say k, of new trees. Each of these new trees except possibly the first one was marked originally but as a root it will become unmarked. One node can become marked. Therefore, the number of marked nodes changes by −(k − 1) + 1 = − k + 2. Com- bining these 2 changes, the potential changes by 2(−k + 2) + k = −k + 4. The actual time to perform the cutting was O(k), therefore (again with a sufficiently large choice Fibonacci heap from Figure 1 after extract minimum is com- pleted. First, nodes 3 and 6 are linked together. Then the result of c) the amortized running time is constant. is linked with tree rooted at node 2. Finally, the new minimum is Finally, operation delete can be implemented simply by found. decreasing the key of the element to be deleted to minus infinity, thus turning it into the minimum of the whole However to complete the extract minimum operation, we heap. Then we call extract minimum to remove it. The need to update the pointer to the root with minimum amortized running time of this operation is O(log n). key. Unfortunately there may be up to n roots we need to check. In the second phase we therefore decrease the number of roots by successively linking together roots of 5.7.3 Proof of degree bounds the same degree. When two roots u and v have the same degree, we make one of them a child of the other so that The amortized performance of a Fibonacci heap depends the one with the smaller key remains the root. Its degree on the degree (number of children) of any tree root be- will increase by one. This is repeated until every root has ing O(log n), where n is the size of the heap. Here we a different degree. To find trees of the same degree ef- show that the size of the (sub)tree rooted at any node 5.7. FIBONACCI HEAP 149

x of degree d in the heap must have size at least Fd₊₂, Although the total running time of a sequence of opera- where Fk is the kth Fibonacci number. The degree bound tions starting with an empty structure is bounded by the follows from this and the fact (easily proved by induc- bounds given above, some (very few) operations in the tion) that F ≥ φd for all integers d ≥ 0 , where sequence can take very long to complete (in particular √d+2 . φ = (1 + 5)/2 = 1.618 . (We then have n ≥ Fd+2 ≥ delete and delete minimum have linear running time in φd , and taking the log to base φ of both sides gives the worst case). For this reason Fibonacci heaps and other ≤ d logφ n as required.) amortized data structures may not be appropriate for real- Consider any node x somewhere in the heap (x need not time systems. It is possible to create a data structure which has the same worst-case performance as the Fi- be the root of one of the main trees). Define size(x) to be [4][5] the size of the tree rooted at x (the number of descendants bonacci heap has amortized performance. One such structure, the Brodal queue, is, in the words of the creator, of x, including x itself). We prove by induction on the height of x (the length of a longest simple path from x “quite complicated” and "[not] applicable in practice.” Created in 2012, the strict Fibonacci heap is a simpler to a descendant leaf), that size(x) ≥ Fd₊₂, where d is the degree of x. (compared to Brodal’s) structure with the same worst- case bounds. It is unknown whether the strict Fibonacci Base case: If x has height 0, then d = 0, and size(x) = 1 heap is efficient in practice. The run-relaxed heaps of = F2. Driscoll et al. give good worst-case performance for all Inductive case: Suppose x has positive height and degree Fibonacci heap operations except merge. d>0. Let y1, y2, ..., yd be the children of x, indexed in order of the times they were most recently made children 5.7.5 Summary of running times of x (y1 being the earliest and yd the latest), and let c1, c2, − ..., cd be their respective degrees. We claim that ci ≥ i 2 [6] for each i with 2≤i≤d: Just before yi was made a child In the following time complexities O(f) is an asymp- totic upper bound and Θ(f) is an asymptotically tight of x, y1,...,yi₋₁ were already children of x, and so x had degree at least i−1 at that time. Since trees are combined bound (see Big O notation). Function names assume a only when the degrees of their roots are equal, it must min-heap. have been that yi also had degree at least i−1 at the time it became a child of x. From that time to the present, yi [1] Brodal and Okasaki later describe a persistent variant with can only have lost at most one child (as guaranteed by the the same bounds except for decrease-key, which is not marking process), and so its current degree ci is at least supported. Heaps with n elements can be constructed bottom-up in O(n).[9] i−2. This proves the claim. Since the heights of all the yi are strictly less than that of [2] Amortized time. x, we can apply the inductive hypothesis to them to get √ [3] Bounded by Ω(log log n),O(22 log log n) [2][12] size(yi) ≥ Fci₊₂ ≥ F₍i₋₂₎₊₂ = Fi. The nodes x and y1 each contribute at least 1 to size(x), and so we have [4] n is the size of the larger heap. ∑ ∑ ≥ d ≥ d ∑size(x) 2 + i=2 size(yi) 2 + i=2 Fi = 1 + d i=0 Fi. ∑ 5.7.6 Practical considerations d A routine induction proves that 1 + i=0 Fi = Fd+2 for any d ≥ 0 , which gives the desired lower bound on Fibonacci heaps have a reputation for being slow in size(x). practice[13] due to large memory consumption per node and high constant factors on all operations.[14] Recent ex- perimental results suggest that Fibonacci heaps are more efficient in practice than most of its later derivatives, 5.7.4 Worst case including quake heaps, violation heaps, strict Fibonacci heaps, rank pairing heaps, but less efficient than either Although Fibonacci heaps look very efficient, they have pairing heaps or array-based heaps.[15] the following two drawbacks (as mentioned in the pa- per “The Pairing Heap: A new form of Self Adjusting Heap”): “They are complicated when it comes to cod- 5.7.7 References ing them. Also they are not as efficient in practice when compared with the theoretically less efficient forms of [1] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, heaps, since in their simplest version they require stor- Ronald L.; Stein, Clifford (2001) [1990]. “Chapter 20: age and manipulation of four pointers per node, com- Fibonacci Heaps”. Introduction to Algorithms (2nd ed.). MIT Press and McGraw-Hill. pp. 476–497. ISBN 0- pared to the two or three pointers per node needed for 262-03293-7. Third edition p. 518. other structures ".[3] These other structures are referred to Binary heap, Binomial heap, Pairing Heap, Brodal Heap [2] Fredman, Michael Lawrence; Tarjan, Robert E. (1987). and Rank Pairing Heap. “Fibonacci heaps and their uses in improved network 150 CHAPTER 5. PRIORITY QUEUES

optimization algorithms” (PDF). Journal of the Asso- 5.7.8 External links ciation for Computing Machinery. 34 (3): 596–615. doi:10.1145/28869.28874. • Java applet simulation of a Fibonacci heap

[3] Fredman, Michael L.; Sedgewick, Robert; Sleator, Daniel • MATLAB implementation of Fibonacci heap D.; Tarjan, Robert E. (1986). “The pairing heap: a new • form of self-adjusting heap” (PDF). Algorithmica. 1 (1): De-recursived and memory efficient C imple- 111–129. doi:10.1007/BF01840439. mentation of Fibonacci heap (free/libre software, CeCILL-B license) [4] Gerth Stølting Brodal (1996), “Worst-Case Efficient Pri- • ority Queues”, Proc. 7th ACM-SIAM Symposium on Dis- Ruby implementation of the Fibonacci heap (with crete Algorithms, Society for Industrial and Applied Math- tests) ematics: 52–58, doi:10.1145/313852.313883, ISBN 0- • Pseudocode of the Fibonacci heap algorithm 89871-366-8, CiteSeerX: 10 .1 .1 .43 .8133 • Various Java Implementations for Fibonacci heap [5] Brodal, G. S. L.; Lagogiannis, G.; Tarjan, R. E. (2012). Strict Fibonacci heaps (PDF). Proceedings of the 44th symposium on Theory of Computing - STOC '12. p. 1177. doi:10.1145/2213977.2214082. ISBN 5.8 Pairing heap 9781450312455. A pairing heap is a type of heap data structure with [6] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, relatively simple implementation and excellent practical Ronald L. (1990). Introduction to Algorithms (1st ed.). amortized performance, introduced by Michael Fredman, MIT Press and McGraw-Hill. ISBN 0-262-03141-8. Robert Sedgewick, Daniel Sleator, and Robert Tarjan in [1] [7] Iacono, John (2000), “Improved upper bounds for pair- 1986. Pairing heaps are heap-ordered multiway tree ing heaps”, Proc. 7th Scandinavian Workshop on Algo- structures, and can be considered simplified Fibonacci rithm Theory, Lecture Notes in Computer Science, 1851, heaps. They are considered a “robust choice” for imple- [2] Springer-Verlag, pp. 63–77, doi:10.1007/3-540-44985- menting such algorithms as Prim’s MST algorithm, and X_5 support the following operations (assuming a min-heap):

[8] Brodal, Gerth S. (1996), “Worst-Case Efficient Priority • find-min: simply return the top element of the heap. Queues”, Proc. 7th Annual ACM-SIAM Symposium on Discrete Algorithms (PDF), pp. 52–58 • merge: compare the two root elements, the smaller remains the root of the result, the larger element and [9] Goodrich, Michael T.; Tamassia, Roberto (2004). “7.3.6. its subtree is appended as a child of this root. Bottom-Up Heap Construction”. Data Structures and Al- gorithms in Java (3rd ed.). pp. 338–341. • insert: create a new heap for the inserted element and merge into the original heap. [10] Haeupler, Bernhard; Sen, Siddhartha; Tarjan, Robert E. (2009). “Rank-pairing heaps” (PDF). SIAM J. Computing: • decrease-key (optional): remove the subtree rooted 1463–1485. at the key to be decreased, replace the key with a smaller key, then merge the result back into the heap. [11] Brodal, G. S. L.; Lagogiannis, G.; Tarjan, R. E. (2012). Strict Fibonacci heaps (PDF). Proceedings of • delete-min: remove the root and merge its subtrees. the 44th symposium on Theory of Computing - STOC Various strategies are employed. '12. p. 1177. doi:10.1145/2213977.2214082. ISBN 9781450312455. The analysis of pairing heaps’ time complexity was ini- tially inspired by that of splay trees.[1] The amortized time [12] Pettie, Seth (2005). “Towards a Final Analysis of Pairing Heaps” (PDF). Max Planck Institut für Informatik. per delete-min is O(log n), and the operations find-min, merge, and insert run in O(1) amortized time.[3] [13] http://www.cs.princeton.edu/~{}wayne/ Determining the precise asymptotic running time of pair- kleinberg-tardos/pdf/FibonacciHeaps.pdf, p. 79 ing heaps when a decrease-key operation is needed has turned out to be difficult. Initially, the time complexity of [14] http://web.stanford.edu/class/cs166/lectures/07/ this operation was conjectured on empirical grounds to be Small07.pdf, p. 72 O(1),[4] but Fredman proved that the amortized time per [15] Larkin, Daniel; Sen, Siddhartha; Tarjan, Robert (2014). decrease-key is at least Ω(log log n) for some sequences [5] “A Back-to-Basics Empirical Study of Priority Queues”. of operations. Using a different amortization argument, Proceedings of the Sixteenth Workshop on Algorithm En- Pettie then proved√ that insert, meld, and decrease-key all gineering and Experiments: 61–72. arXiv:1403.0252 . run in O(22 log log n) amortized time, which is o(log n) doi:10.1137/1.9781611973198.7. .[6] Elmasry later introduced a variant of pairing heaps 5.8. PAIRING HEAP 151

for which decrease-key runs in O(log log n) amortized the two root elements as its root element and just adds the time and with all other operations matching Fibonacci heap with the larger root to the list of subheaps: heaps,[7] but no tight Θ(log log n) bound is known for [6][3] function merge(heap1, heap2) if heap1 == Empty re- the original data structure. Moreover, it is an open turn heap2 elsif heap2 == Empty return heap1 el- o( n) question whether a log amortized time bound for sif heap1.elem < heap2.elem return Heap(heap1.elem, decrease-key and a O(1) amortized time bound for insert [8] heap2 :: heap1.subheaps) else return Heap(heap2.elem, can be achieved simultaneously. heap1 :: heap2.subheaps) Although this is worse than other priority queue algo- rithms such as Fibonacci heaps, which perform decrease- key in O(1) amortized time, the performance in practice insert is excellent. Stasko and Vitter,[4] Moret and Shapiro,[9] and Larkin, Sen, and Tarjan[8] conducted experiments on The easiest way to insert an element into a heap is to pairing heaps and other heap data structures. They con- merge the heap with a new heap containing just this ele- cluded that pairing heaps are often faster in practice than ment and an empty list of subheaps: array-based binary heaps and d-ary heaps, and almost al- function insert(elem, heap) return merge(Heap(elem, ways faster in practice than other pointer-based heaps, in- []), heap) cluding data structures like Fibonacci heaps that are the- oretically more efficient. delete-min

5.8.1 Structure The only non-trivial fundamental operation is the deletion of the minimum element from the heap. The standard A pairing heap is either an empty heap, or a pair consist- strategy first merges the subheaps in pairs (this is the step ing of a root element and a possibly empty list of pairing that gave this datastructure its name) from left to right and heaps. The heap ordering property requires that all the then merges the resulting list of heaps from right to left: root elements of the subheaps in the list are not smaller function delete-min(heap) if heap == Empty error else than the root element of the heap. The following descrip- return merge-pairs(heap.subheaps) tion assumes a purely functional heap that does not sup- port the decrease-key operation. This uses the auxiliary function merge-pairs: type PairingHeap[Elem] = Empty | Heap(elem: Elem, function merge-pairs(l) if length(l) == 0 return subheaps: List[PairingHeap[Elem]]) Empty elsif length(l) == 1 return l[0] else return merge(merge(l[0], l[1]), merge-pairs(l[2.. ])) A pointer-based implementation for RAM machines, supporting decrease-key, can be achieved using three That this does indeed implement the described two-pass pointers per node, by representing the children of a node left-to-right then right-to-left merging strategy can be by a singly-linked list: a pointer to the node’s first child, seen from this reduction: one to its next sibling, and one to the parent. Alterna- merge-pairs([H1, H2, H3, H4, H5, H6, H7]) => tively, the parent pointer can be omitted by letting the merge(merge(H1, H2), merge-pairs([H3, H4, H5, H6, last child point back to the parent, if a single boolean flag H7])) # merge H1 and H2 to H12, then the rest of is added to indicate “end of list”. This achieves a more the list => merge(H12, merge(merge(H3, H4), merge- compact structure at the expense of a constant overhead pairs([H5, H6, H7]))) # merge H3 and H4 to H34, [1] factor per operation. then the rest of the list => merge(H12, merge(H34, merge(merge(H5, H6), merge-pairs([H7])))) # merge H5 and H6 to H56, then the rest of the list => merge(H12, 5.8.2 Operations merge(H34, merge(H56, H7))) # switch direction, merge the last two resulting heaps, giving H567 => merge(H12, find-min merge(H34, H567)) # merge the last two resulting heaps, giving H34567 => merge(H12, H34567) # finally, merge The function find-min simply returns the root element of the first merged pair with the result of merging the rest the heap: => H1234567 function find-min(heap) if heap == Empty error else re- turn heap.elem 5.8.3 Summary of running times

merge In the following time complexities[10] O(f) is an asymp- totic upper bound and Θ(f) is an asymptotically tight Merging with an empty heap returns the other heap, oth- bound (see Big O notation). Function names assume a erwise a new heap is returned that has the minimum of min-heap. 152 CHAPTER 5. PRIORITY QUEUES

[1] Brodal and Okasaki later describe a persistent variant with [11] Iacono, John (2000), “Improved upper bounds for pair- the same bounds except for decrease-key, which is not ing heaps”, Proc. 7th Scandinavian Workshop on Algo- supported. Heaps with n elements can be constructed rithm Theory, Lecture Notes in Computer Science, 1851, bottom-up in O(n).[13] Springer-Verlag, pp. 63–77, doi:10.1007/3-540-44985- X_5 [2] Amortized time. √ [12] Brodal, Gerth S. (1996), “Worst-Case Efficient Priority 2 log log n [16][17] [3] Bounded by Ω(log log n),O(2 ) Queues”, Proc. 7th Annual ACM-SIAM Symposium on Discrete Algorithms (PDF), pp. 52–58 [4] n is the size of the larger heap. [13] Goodrich, Michael T.; Tamassia, Roberto (2004). “7.3.6. Bottom-Up Heap Construction”. Data Structures and Al- 5.8.4 References gorithms in Java (3rd ed.). pp. 338–341.

[1] Fredman, Michael L.; Sedgewick, Robert; Sleator, Daniel [14] Haeupler, Bernhard; Sen, Siddhartha; Tarjan, Robert E. D.; Tarjan, Robert E. (1986). “The pairing heap: a new (2009). “Rank-pairing heaps” (PDF). SIAM J. Computing: form of self-adjusting heap” (PDF). Algorithmica. 1 (1): 1463–1485. 111–129. doi:10.1007/BF01840439. [15] Brodal, G. S. L.; Lagogiannis, G.; Tarjan, R. E. [2] Mehlhorn, Kurt; Sanders, Peter (2008). Algorithms and (2012). Strict Fibonacci heaps (PDF). Proceedings of Data Structures: The Basic Toolbox (PDF). Springer. p. the 44th symposium on Theory of Computing - STOC 231. '12. p. 1177. doi:10.1145/2213977.2214082. ISBN 9781450312455. [3] Iacono, John (2000). Improved upper bounds for pairing heaps (PDF). Proc. 7th Scandinavian Workshop on [16] Fredman, Michael Lawrence; Tarjan, Robert E. (1987). Algorithm Theory. Lecture Notes in Computer Science. “Fibonacci heaps and their uses in improved network Springer-Verlag. pp. 63–77. arXiv:1110.4428 . optimization algorithms” (PDF). Journal of the Asso- doi:10.1007/3-540-44985-X_5. ISBN 978-3-540- ciation for Computing Machinery. 34 (3): 596–615. 67690-4. doi:10.1145/28869.28874.

[4] Stasko, John T.; Vitter, Jeffrey S. (1987), “Pairing [17] Pettie, Seth (2005). “Towards a Final Analysis of Pairing heaps: experiments and analysis”, Communications of Heaps” (PDF). Max Planck Institut für Informatik. the ACM, 30 (3): 234–249, doi:10.1145/214748.214759, CiteSeerX: 10 .1 .1 .106 .2988

[5] Fredman, Michael L. (1999). “On the efficiency of pairing 5.8.5 External links heaps and related data structures” (PDF). Journal of the ACM. 46 (4): 473–501. doi:10.1145/320211.320214. • Louis Wasserman discusses pairing heaps and their implementation in Haskell in The Monad Reader, [6] Pettie, Seth (2005), “Towards a final analysis of pair- Issue 16 (pp. 37–52). ing heaps” (PDF), Proc. 46th Annual IEEE Sympo- sium on Foundations of Computer Science, pp. 174–183, • pairing heaps, Sartaj Sahni doi:10.1109/SFCS.2005.75, ISBN 0-7695-2468-0

[7] Elmasry, Amr (2009), “Pairing heaps with O(log log n) decrease cost” (PDF), Proc. 20th Annual ACM-SIAM 5.9 Double-ended priority queue Symposium on Discrete Algorithms, pp. 471–476, doi:10.1137/1.9781611973068.52 Not to be confused with Double-ended queue. [8] Larkin, Daniel H.; Sen, Siddhartha; Tarjan, Robert E. (2014), “A back-to-basics empirical study of pri- In computer science, a double-ended priority queue ority queues”, Proceedings of the 16th Workshop on [1] [2] Algorithm Engineering and Experiments, pp. 61–72, (DEPQ) or double-ended heap is a data structure similar to a priority queue or heap, but allows for efficient arXiv:1403.0252 , doi:10.1137/1.9781611973198.7 removal of both the maximum and minimum, according [9] Moret, Bernard M. E.; Shapiro, Henry D. (1991), to some ordering on the keys (items) stored in the struc- “An empirical analysis of algorithms for construct- ture. Every element in a DEPQ has a priority or value. ing a minimum spanning tree”, Proc. 2nd Workshop In a DEPQ, it is possible to remove the elements in both on Algorithms and Data Structures, Lecture Notes in ascending as well as descending order.[3] Computer Science, 519, Springer-Verlag, pp. 400– 411, doi:10.1007/BFb0028279, ISBN 3-540-54343-0, CiteSeerX: 10 .1 .1 .53 .5960 5.9.1 Operations [10] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L. (1990). Introduction to Algorithms (1st ed.). A double-ended priority queue features the follow oper- MIT Press and McGraw-Hill. ISBN 0-262-03141-8. ations: 5.9. DOUBLE-ENDED PRIORITY QUEUE 153 isEmpty() Checks if DEPQ is empty and returns true if • Removing the max element: Perform remove- empty. max() on the max heap and remove(node value) on the min heap, where node value is the value in the size() Returns the total number of elements present in corresponding node in the min heap. the DEPQ. getMin() Returns the element having least priority. Total correspondence getMax() Returns the element having highest priority. put(x) Inserts the element x in the DEPQ. removeMin() Removes an element with minimum pri- ority and returns this element. removeMax() Removes an element with maximum pri- ority and returns this element.

If an operation is to be performed on two elements having the same priority, then the element inserted first is chosen. Also, the priority of any element can be changed once it has been inserted in the DEPQ.[4]

5.9.2 Implementation A total correspondence heap for the elements 3, 4, 5, 5, 6, 6, 7, 8, 9, 10, 11 with element 11 as buffer.[1] Double-ended priority queues can be built from balanced binary search trees (where the minimum and maximum Half the elements are in the min PQ and the other half elements are the leftmost and rightmost leaves, respec- in the max PQ. Each element in the min PQ has a one tively), or using specialized data structures like min-max to one correspondence with an element in max PQ. If heap and pairing heap. the number of elements in the DEPQ is odd, one of Generic methods of arriving at double-ended priority the elements is retained in a buffer.[1] Priority of every queues from normal priority queues are:[5] element in the min PQ will be less than or equal to the corresponding element in the max PQ. Dual structure method

Leaf correspondence

A dual structure with 14,12,4,10,8 as the members of DEPQ.[1]

In this method two different priority queues for min and max are maintained. The same elements in both the PQs are shown with the help of correspondence pointers. Here, the minimum and maximum elements are values contained in the root nodes of min heap and max heap A leaf correspondence heap for the same elements as above.[1] respectively. In this method only the leaf elements of the min and • Removing the min element: Perform removemin() max PQ form corresponding one to one pairs. It is not on the min heap and remove(node value) on the max necessary for non-leaf elements to be in a one to one heap, where node value is the value in the corre- correspondence pair.[1] sponding node in the max heap. 154 CHAPTER 5. PRIORITY QUEUES

Interval heaps • Even number of elements: If the number of ele- ments is even, then for the insertion of a new ele- ment an additional node is created. If the element falls to the left of the parent interval, it is consid- ered to be in the min heap and if the element falls to the right of the parent interval, it is considered in the max heap. Further, it is compared successively and moved from the last node to the root until all the conditions for interval heap are satisfied. If the element lies within the interval of the parent node it- self, the process is stopped then and there itself and moving of elements does not take place.[6]

Implementing a DEPQ using interval heap. The time required for inserting an element depends on the number of movements required to meet all the conditions Apart from the above mentioned correspondence meth- and is O(log n). ods, DEPQ’s can be obtained efficiently using interval heaps.[6] An interval heap is like an embedded min-max Deleting an element heap in which each node contains two elements. It is a [6] complete binary tree in which: • Min element: In an interval heap, the minimum el- ement is the element on the left hand side of the root • The left element is less than or equal to the right node. This element is removed and returned. To fill element. in the vacancy created on the left hand side of the root node, an element from the last node is removed • Both the elements define a closed interval. and reinserted into the root node. This element is • Interval represented by any node except the root is a then compared successively with all the left hand sub-interval of the parent node. elements of the descending nodes and the process stops when all the conditions for an interval heap are • Elements on the left hand side define a min heap. satisfied.In case if the left hand side element in the • Elements on the right hand side define a max heap. node becomes greater than the right side element at any stage, the two elements are swapped[6] and then further comparisons are done. Finally, the root node Depending on the number of elements, two cases are will again contain the minimum element on the left possible[6] - hand side.

1. Even number of elements: In this case, each node • Max element: In an interval heap, the maximum contains two elements say p and q, with p ≤ q. Every element is the element on the right hand side of the node is then represented by the interval [p, q]. root node. This element is removed and returned. To fill in the vacancy created on the right hand side 2. Odd number of elements: In this case, each node of the root node, an element from the last node is except the last contains two elements represented by removed and reinserted into the root node. Further the interval [p, q] whereas the last node will contain comparisons are carried out on a similar basis as dis- a single element and is represented by the interval cussed above. Finally, the root node will again con- [p, p]. tain the max element on the right hand side.

Thus, with interval heaps, both the minimum and maxi- Inserting an element Depending on the number of mum elements can be removed efficiently traversing from elements already present in the interval heap, following root to leaf. Thus, a DEPQ can be obtained[6] from an in- cases are possible: terval heap where the elements of the interval heap are the priorities of elements in the DEPQ. • Odd number of elements: If the number of ele- ments in the interval heap is odd, the new element is firstly inserted in the last node. Then, it is suc- 5.9.3 Time Complexity cessively compared with the previous node elements and tested to satisfy the criteria essential for an in- Interval Heaps terval heap as stated above. In case if the element does not satisfy any of the criteria, it is moved from When DEPQ’s are implemented using Interval heaps con- the last node to the root until all the conditions are sisting of n elements, the time complexities for the vari- satisfied.[6] ous functions are formulated in the table below[1] 5.10. SOFT HEAP 155

Pairing heaps [5] Fundamentals of Data Structures in C++ - Ellis Horowitz, Sartaj Sahni and Dinesh Mehta When DEPQ’s are implemented using heaps or pairing [6] http://www.mhhe.com/engcs/compsci/sahni/enrich/c9/ heaps consisting of n elements, the time complexities for interval.pdf the various functions are formulated in the table below.[1] For pairing heaps, it is an amortized complexity. 5.10 Soft heap 5.9.4 Applications For the scene band, see Soft Heap. External sorting In computer science, a soft heap is a variant on the simple One example application of the double-ended priority heap data structure that has constant amortized time for queue is external sorting. In an external sort, there are 5 types of operations. This is achieved by carefully “cor- more elements than can be held in the computer’s mem- rupting” (increasing) the keys of at most a certain number ory. The elements to be sorted are initially on a disk and of values in the heap. The constant time operations are: the sorted sequence is to be left on the disk. The external quick sort is implemented using the DEPQ as follows: • create(S): Create a new soft heap

1. Read in as many elements as will fit into an internal • insert(S, x): Insert an element into a soft heap DEPQ. The elements in the DEPQ will eventually be the middle group (pivot) of elements. • meld(S, S' ): Combine the contents of two soft heaps into one, destroying both 2. Read in the remaining elements. If the next element is ≤ the smallest element in the DEPQ, output this • delete(S, x): Delete an element from a soft heap next element as part of the left group. If the next el- • findmin(S): Get the element with minimum key in ement is ≥ the largest element in the DEPQ, output the soft heap this next element as part of the right group. Other- wise, remove either the max or min element from the DEPQ (the choice may be made randomly or al- Other heaps such as Fibonacci heaps achieve most of ternately); if the max element is removed, output it these bounds without any corruption, but cannot provide a as part of the right group; otherwise, output the re- constant-time bound on the critical delete operation. The moved element as part of the left group; insert the amount of corruption can be controlled by the choice of newly input element into the DEPQ. a parameter ε, but the lower this is set, the more time insertions require (O(log 1/ε) for an error rate of ε). 3. Output the elements in the DEPQ, in sorted order, More precisely, the guarantee offered by the soft heap as the middle group. is the following: for a fixed value ε between 0 and 1/2, at 4. Sort the left and right groups recursively. any point in time there will be at most ε*n corrupted keys in the heap, where n is the number of elements inserted so far. Note that this does not guarantee that only a fixed 5.9.5 See also percentage of the keys currently in the heap are corrupted: in an unlucky sequence of insertions and deletions, it can • Queue (abstract data type) happen that all elements in the heap will have corrupted keys. Similarly, we have no guarantee that in a sequence • Priority queue of elements extracted from the heap with findmin and delete, only a fixed percentage will have corrupted keys: • Double-ended queue in an unlucky scenario only corrupted elements are ex- tracted from the heap. 5.9.6 References The soft heap was designed by Bernard Chazelle in 2000. The term “corruption” in the structure is the result of what [1] Data Structures, Algorithms, & Applications in Java: Chazelle called “carpooling” in a soft heap. Each node in Double-Ended Priority Queues, Sartaj Sahni, 1999. the soft heap contains a linked-list of keys and one com- mon key. The common key is an upper bound on the [2] Brass, Peter (2008). Advanced Data Structures. Cam- values of the keys in the linked-list. Once a key is added bridge University Press. p. 211. ISBN 9780521880374. to the linked-list, it is considered corrupted because its [3] “Depq - Double-Ended Priority Queue”. value is never again relevant in any of the soft heap op- erations: only the common keys are compared. This is [4] “depq”. what makes soft heaps “soft"; you can't be sure whether 156 CHAPTER 5. PRIORITY QUEUES

or not any particular value you put into it will be cor- HeapSelect(a[xIndex..n], k-xIndex+1) rupted. The purpose of these corruptions is effectively to lower the information entropy of the data, enabling the data structure to break through information-theoretic bar- 5.10.2 References riers regarding heaps. • Chazelle, B. 2000. The soft heap: an approximate priority queue with optimal error rate. J. ACM 47, 6 (Nov. 2000), 1012-1027. 5.10.1 Applications • Kaplan, H. and Zwick, U. 2009. A simpler imple- Despite their limitations and unpredictable nature, soft mentation and analysis of Chazelle’s soft heaps. In heaps are useful in the design of deterministic algorithms. Proceedings of the Nineteenth Annual ACM -SIAM They were used to achieve the best complexity to date for Symposium on Discrete Algorithms (New York, New finding a minimum spanning tree. They can also be used York, January 4––6, 2009). Symposium on Dis- to easily build an optimal , as well as crete Algorithms. Society for Industrial and Applied near-sorting algorithms, which are algorithms that place Mathematics, Philadelphia, PA, 477-485. every element near its final position, a situation in which insertion sort is fast. One of the simplest examples is the selection algorithm. Say we want to find the kth largest of a group of n num- bers. First, we choose an error rate of 1/3; that is, at most about 33% of the keys we insert will be corrupted. Now, we insert all n elements into the heap — we call the orig- inal values the “correct” keys, and the values stored in the heap the “stored” keys. At this point, at most n/3 keys are corrupted, that is, for at most n/3 keys is the “stored” key larger than the “correct” key, for all the others the stored key equals the correct key. Next, we delete the minimum element from the heap n/3 times (this is done according to the “stored” key). As the total number of insertions we have made so far is still n, there are still at most n/3 corrupted keys in the heap. Ac- cordingly, at least 2n/3 − n/3 = n/3 of the keys remaining in the heap are not corrupted. Let L be the element with the largest correct key among the elements we removed. The stored key of L is possi- bly larger than its correct key (if L was corrupted), and even this larger value is smaller than all the stored keys of the remaining elements in the heap (as we were re- moving minimums). Therefore, the correct key of L is smaller than the remaining n/3 uncorrupted elements in the soft heap. Thus, L divides the elements somewhere between 33%/66% and 66%/33%. We then partition the set about L using the partition algorithm from quicksort and apply the same algorithm again to either the set of numbers less than L or the set of numbers greater than L, neither of which can exceed 2n/3 elements. Since each insertion and deletion requires O(1) amortized time, the total deterministic time is T(n) = T(2n/3) + O(n). Using case 3 of the master theorem (with ε=1 and c=2/3), we know that T(n) = Θ(n). The final algorithm looks like this: function softHeapSelect(a[1..n], k) if k = 1 then return minimum(a[1..n]) create(S) for i from 1 to n insert(S, a[i]) for i from 1 to n/3 x := findmin(S) delete(S, x) xIn- dex := partition(a, x) // Returns new index of pivot x if k < xIndex softHeapSelect(a[1..xIndex-1], k) else soft- Chapter 6

Successors and neighbors

6.1 Binary search algorithm T, the following subroutine uses binary search to find the index of T in A.[6] This article is about searching a finite sorted array. For searching continuous function values, see bisection 1. Set L to 0 and R to n − 1. method. 2. If L > R, the search terminates as unsuccessful.

In computer science, binary search, also known as half- 3. Set m (the position of the middle element) to the interval search[1] or logarithmic search,[2] is a search floor of (L + R) / 2. algorithm that finds the position of a target value within a 4. If Am < T, set L to m + 1 and go to step 2. sorted array.[3][4] It compares the target value to the mid- dle element of the array; if they are unequal, the half in 5. If Am > T, set R to m – 1 and go to step 2. which the target cannot lie is eliminated and the search continues on the remaining half until it is successful. 6. Now Am = T, the search is done; return m. Binary search runs in at worst logarithmic time, making O(log n) comparisons, where n is the number of ele- This iterative procedure keeps track of the search bound- ments in the array and log is the binary logarithm; and aries via two variables. Some implementations may place using only constant (O(1)) space.[5] Although specialized the comparison for equality at the end of the algorithm, resulting in a faster comparison loop but costing one more data structures designed for fast searching—such as hash [7] tables—can be searched more efficiently, binary search iteration on average. applies to a wider range of search problems. Although the idea is simple, implementing binary search Approximate matches correctly requires attention to some subtleties about its exit conditions and midpoint calculation. The above procedure only performs exact matches, find- ing the position of a target value. However, due to the or- There exist numerous variations of binary search. One dered nature of sorted arrays, it is trivial to extend binary variation in particular (fractional cascading) speeds up bi- search to perform approximate matches. Particularly, bi- nary searches for the same value in multiple arrays. nary search can be used to compute, relative to a value, its rank (the number of smaller elements), predecessor 6.1.1 Algorithm (next-smallest element), successor (next-largest element), and nearest neighbor. Range queries seeking the number Binary search works on sorted arrays. A binary search be- of elements between two values can be performed with [8] gins by comparing the middle element of the array with two rank queries. the target value. If the target value matches the mid- dle element, its position in the array is returned. If the • Rank queries can be performed using a modified target value is less than or greater than the middle ele- version of binary search. By returning m on a suc- ment, the search continues in the lower or upper half of cessful search, and L on an unsuccessful search, the the array, respectively, eliminating the other half from number of elements less than the target value is re- consideration.[6] turned instead.[8] • Predecessor and successor queries can be performed Procedure with rank queries. Once the rank of the target value is known, its predecessor is the element at the posi- Given an array A of n elements with values or records A0 tion given by its rank (as it is the largest element that ... An₋₁, sorted such that A0 ≤ ... ≤ An₋₁, and target value is smaller than the target value). Its successor is the

157 158 CHAPTER 6. SUCCESSORS AND NEIGHBORS

element after it (if it is present in the array) or at the smaller subarray being eliminated. The actual number of [9] − n−log n−1 next position after the predecessor (otherwise). average iterations is slightly higher, at log n n The nearest neighbor of the target value is either its iterations.[5] In the best case, where the first middle ele- predecessor or successor, whichever is closer. ment selected is equal to the target value, its position is returned after one iteration.[12] In terms of iterations, no • Range queries are also straightforward. Once the search algorithm that is based solely on comparisons can ranks of the two values are known, the number exhibit better average and worst-case performance than of elements greater than or equal to the first value binary search.[11] and less than the second is the difference of the two ranks. This count can be adjusted up or down Each iteration of the binary search algorithm defined by one according to whether the endpoints of the above makes one or two comparisons, checking if the range should be considered to be part of the range middle element is equal to the target value in each itera- and whether the array contains keys matching those tion. Again assuming that each element is equally likely endpoints.[10] to be searched, each iteration makes 1.5 comparisons on average. A variation of the algorithm instead checks for equality at the very end of the search, eliminating on av- 6.1.2 Performance erage half a comparison from each iteration. This de- creases the time taken per iteration very slightly on most computers, while guaranteeing that the search takes the maximum number of iterations, on average adding one it- eration to the search. Because the comparison loop is per- formed only ⌊log n+1⌋ times in the worst case, for all but enormous n , the slight increase in comparison loop effi- ciency does not compensate for the extra iteration. Knuth 1998 gives a value of 266 (more than 73 quintillion)[13] el- ements for this variation to be faster.[lower-alpha 2][14][15] Fractional cascading can be used to speed up searches of A tree representing binary search. The array being searched here the same value in multiple arrays. Where k is the num- is [20, 30, 40, 50, 90, 100], and the target value is 40. ber of arrays, searching each array for the target value takes O(k log n) time; fractional cascading reduces this The performance of binary search can be analyzed by re- to O(k + log n) .[16] ducing the procedure to a binary comparison tree, where the root node is the middle element of the array; the mid- dle element of the lower half is left of the root and the 6.1.3 Binary search versus other schemes middle element of the upper half is right of the root. The rest of the tree is built in a similar fashion. This Sorted arrays with binary search are a very inefficient so- model represents binary search; starting from the root lution when insertion and deletion operations are inter- node, the left or right subtrees are traversed depending leaved with retrieval, taking O(n) time for each such op- on whether the target value is less or more than the node eration, and complicating memory use. under consideration, representing the successive elimina- tion of elements.[5][11] Other algorithms support much more efficient insertion and deletion, and also fast exact matching. The worst case is ⌊log n + 1⌋ iterations (of the compar- ison loop), where the ⌊⌋ notation denotes the floor func- tion that rounds its argument down to an integer. This Hashing is reached when the search reaches the deepest level of the tree, equivalent to a binary search that has reduced to For implementing associative arrays, hash tables, a data one element and, in each iteration, always eliminates the structure that maps keys to records using a hash func- smaller subarray out of the two if they are not of equal [lower-alpha 1][11] tion, are generally faster than binary search on a sorted size. array of records;[17] most implementations require only On average, assuming that each element is equally likely amortized constant time on average.[lower-alpha 3][19] How- to be searched, by the time the search completes, the - ever, hashing is not useful for approximate matches, such get value will most likely be found at the second-deepest as computing the next-smallest, next-largest, and nearest level of the tree. This is equivalent to a binary search that key, as the only information given on a failed search is that completes one iteration before the worst case, reached af- the target is not present in any record.[20] Binary search ter log n − 1 iterations. However, the tree may be unbal- is ideal for such matches, performing them in logarithmic anced, with the deepest level partially filled, and equiv- time. In addition, all operations possible on a sorted ar- alently, the array may not be divided perfectly by the ray can be performed—such as finding the smallest and search in some iterations, half of the time resulting in the largest key and performing range searches.[21] 6.1. BINARY SEARCH ALGORITHM 159

Trees used for set membership. But there are other algorithms that are more specifically suited for set membership. A A binary search tree is a binary tree data structure that bit array is the simplest, useful when the range of keys is works based on the principle of binary search: the records limited, is very fast: O(1) . The Judy1 type of Judy array of the tree are arranged in sorted order, and traversal handles 64-bit keys efficiently. of the tree is performed using a logarithmic time binary For approximate results, Bloom filters, another proba- search-like algorithm. Insertion and deletion also require bilistic data structure based on hashing, store a set of logarithmic time in binary search trees. This is faster than keys by encoding the keys using a bit array and multi- the linear time insertion and deletion of sorted arrays, and ple hash functions. Bloom filters are much more space- binary trees retain the ability to perform all the operations efficient than bitarrays in most cases and not much slower: possible on a sorted array, including range and approxi- with k hash functions, membership queries require only mate queries.[22] O(k) time. However, Bloom filters suffer from false pos- However, binary search is usually more efficient for itives.[lower-alpha 6][lower-alpha 7][31] searching as binary search trees will most likely be imper- fectly balanced, resulting in slightly worse performance than binary search. This applies even to balanced bi- Other data structures nary search trees, binary search trees that balance their own nodes—as they rarely produce optimally-balanced There exist data structures that may improve on binary trees—but to a lesser extent. Although unlikely, the search in some cases for both searching and other opera- tree may be severely imbalanced with few internal nodes tions available for sorted arrays. For example, searches, with two children, resulting in the average and worst-case approximate matches, and the operations available to search time approaching n comparisons.[lower-alpha 4] Bi- sorted arrays can be performed more efficiently than bi- nary search trees take more space than sorted arrays.[24] nary search on specialized data structures such as van Emde Boas trees, fusion trees, tries, and bit arrays. How- Binary search trees lend themselves to fast searching in ever, while these operations can always be done at least external memory stored in hard disks, where data needs to efficiently on a sorted array regardless of the keys, such be sought and paged into main memory. By splitting the data structures are usually only faster because they exploit tree into pages of some number of elements, each storing the properties of keys with a certain attribute (usually keys in turn a section of the tree, searching in a binary search that are small integers), and thus will be time or space tree costs fewer disk seeks, improving its overall perfor- consuming for keys that do not have that attribute.[21] mance. Notice that this effectively creates a multiway tree, as each page is connected to each other. The B- tree generalizes this method of tree organization; B-trees 6.1.4 Variations are frequently used to organize long-term storage such as databases and filesystems.[25][26] Uniform binary search

Linear search Uniform binary search stores, instead of the lower and up- per bounds, the index of the middle element and the num- Linear search is a simple search algorithm that checks ev- ber of elements around the middle element that were not ery record until it finds the target value. Linear search eliminated yet. Each step reduces the width by about half. can be done on a linked list, which allows for faster in- This variation is uniform because the difference between sertion and deletion than an array. Binary search is faster the indices of middle elements and the preceding mid- than linear search for sorted arrays except if the array is dle elements chosen remains constant between searches short.[lower-alpha 5][28] If the array must first be sorted, that of arrays of the same length.[32] cost must be amortized (spread) over any searches. Sort- ing the array also enables efficient approximate matches and other operations.[29] Fibonacci search

Main article: Fibonacci search technique Mixed approaches Fibonacci search is a method similar to binary search that The Judy array uses a combination of approaches to pro- successively shortens the interval in which the maximum vide a highly efficient solution. of a unimodal function lies. Given a finite interval, a uni- modal function, and the maximum length of the resulting Set membership algorithms interval, Fibonacci search finds a Fibonacci number such that if the interval is divided equally into that many subin- A related problem to search is set membership. Any al- tervals, the subintervals would be shorter than the maxi- gorithm that does lookup, like binary search, can also be mum length. After dividing the interval, it eliminates the 160 CHAPTER 6. SUCCESSORS AND NEIGHBORS

complexity compensates for this only for large arrays.[36]

Fractional cascading

Main article: Fractional cascading

Fractional cascading is a technique that speeds up binary searches for the same element for both exact and approx- imate matching in “catalogs” (arrays of sorted elements) associated with vertices in graphs. Searching each cat- Fibonacci search on the function f(x) = sin ((x+ 1 )π) on the 10 alog separately requires O(k log n) time, where k is the unit interval [0, 1] . The algorithm finds an interval containing 1 number of catalogs. Fractional cascading reduces this to the maximum of f with a length less than or equal to 10 in the 5 6 O(k + log n) by storing specific information in each cat- above example. In three iterations, it returns the interval [ 13 , 13 ] [16] 1 alog about other catalogs. , which is of length 13 . Fractional cascading was originally developed to effi- ciently solve various computational geometry problems, subintervals in which the maximum cannot lie until one but it also has been applied elsewhere, in domains such [33][34] or more contiguous subintervals remain. as data mining and Internet Protocol routing.[16]

Exponential search 6.1.5 History

Main article: Exponential search In 1946, John Mauchly made the first mention of binary search as part of the Moore School Lectures, the first ever Exponential search is an algorithm for searching primar- set of lectures regarding any computer-related topic.[39] ily in infinite lists, but it can be applied to select the upper Every published binary search algorithm worked only bound for binary search. It starts by finding the first ele- for arrays whose length is one less than a power of ment with an index that is both a power of two and greater two[lower-alpha 8] until 1960, when Derrick Henry Lehmer than the target value. Afterwards, it sets that index as the published a binary search algorithm that worked on all upper bound, and switches to binary search. A search arrays.[41] In 1962, Hermann Bottenbruch presented an takes ⌊log x + 1⌋ iterations of the exponential search and ALGOL 60 implementation of binary search that placed at most ⌊log n⌋ iterations of the binary search, where x is the comparison for equality at the end, increasing the av- the position of the target value. Only if the target value erage number of iterations by one, but reducing to one is near the beginning of the array is this variation more the number of comparisons per iteration.[7] The uniform efficient than selecting the highest element as the upper binary search was presented to Donald Knuth in 1971 bound.[35] by A. K. Chandra of Stanford University and published in Knuth’s The Art of Computer Programming.[39] In 1986, Bernard Chazelle and Leonidas J. Guibas intro- Interpolation search duced fractional cascading, a technique used to speed up binary searches in multiple arrays.[16][42][43] Main article: Interpolation search 6.1.6 Implementation issues Instead of merely calculating the midpoint, interpolation search attempts to calculate the position of the target Although the basic idea of binary search is value, taking into account the lowest and highest elements comparatively straightforward, the details can in the array and the length of the array. This is only pos- be surprisingly tricky ... — Donald Knuth[2] sible if the array elements are numbers. It works on the basis that the midpoint is not the best guess in many cases; When Jon Bentley assigned binary search as a problem for example, if the target value is close to the highest el- in a course for professional programmers, he found that ement in the array, it is likely to be located near the end [36] ninety percent failed to provide a correct solution after of the array. When the distribution of the array ele- several hours of working on it,[44] and another study pub- ments is uniform or near uniform, it makes O(log log n) [36][37][38] lished in 1988 shows that accurate code for it is only found comparisons. in five out of twenty textbooks.[45] Furthermore, Bent- In practice, interpolation search is slower than binary ley’s own implementation of binary search, published in search for small arrays, as interpolation search requires his 1986 book Programming Pearls, contained an over- extra computation, and the slower growth rate of its time flow error that remained undetected for over twenty years. 6.1. BINARY SEARCH ALGORITHM 161

The Java programming language library implementation • Python provides the bisect module.[54] of binary search had the same overflow bug for more than nine years.[46] • Ruby's Array class includes a bsearch method with built-in approximate matching.[55] In a practical implementation, the variables used to rep- resent the indices will often be of fixed size, and this can • Go's sort standard library package contains the result in an arithmetic overflow for very large arrays. If functions Search, SearchInts, SearchFloat64s, and the midpoint of the span is calculated as (L + R) / 2, then SearchStrings, which implement general binary the value of L + R may exceed the range of integers of search, as well as specific implementations for the data type used to store the midpoint, even if L and R searching slices of integers, floating-point numbers, are within the range. If L and R are nonnegative, this can and strings, respectively.[56] be avoided by calculating the midpoint as L + (R − L)/ 2.[47] • For Objective-C, the Cocoa framework provides the If the target value is greater than the greatest value in the NSArray -indexOfObject:inSortedRange:options: [57] array, and the last index of the array is the maximum usingComparator: method in Mac OS X 10.6+. representable value of L, the value of L will eventually Apple’s Core Foundation C framework also contains [58] become too large and overflow. A similar problem will a CFArrayBSearchValues() function. occur if the target value is smaller than the least value in the array and the first index of the array is the smallest representable value of R. In particular, this means that R 6.1.8 See also must not be an unsigned type if the array starts with index • 0. Bisection method – the same idea used to solve equations in the real numbers An infinite loop may occur if the exit conditions for the loop are not defined correctly. Once L exceeds R, the search has failed and must convey the failure of the 6.1.9 Notes and references search. In addition, the loop must be exited when the tar- get element is found, or in the case of an implementation Notes where this check is moved to the end, checks for whether the search was successful or failed at the end must be in [1] This happens as binary search will not always divide the place. Bentley found that, in his assignment of binary array perfectly. Take for example the array [1, 2 ... search, this error was made by most of the programmers 16]. The first iteration will select the midpoint of 8. On who failed to implement a binary search correctly.[7][48] the left subarray are eight elements, but on the right are nine. If the search takes the right path, there is a higher chance that the search will make the maximum number of 6.1.7 Library support comparisons.[11]

Many languages’ standard libraries include binary search [2] A formal time performance analysis by Knuth showed that routines: the average running time of this variation for a success- ful search is 17.5 log n + 17 units of time compared to 18 log n − 16 units for regular binary search. The time • C provides the function bsearch() in its standard li- complexity for this variation grows slightly more slowly, brary, which is typically implemented via binary but at the cost of higher initial complexity.[14] search (although the official standard does not re- quire it so).[49] [3] It is possible to perform hashing in guaranteed constant time.[18] • C++'s STL provides the functions bi- nary_search(), lower_bound(), upper_bound() [4] The worst binary search tree for searching can be pro- and equal_range().[50] duced by inserting the values in sorted or near-sorted or- der or in an alternating lowest-highest record pattern.[23] • Java offers a set of overloaded binarySearch() [5] Knuth 1998 performed a formal time performance analy- static methods in the classes Arrays and Collections sis of both of these search algorithms. On Knuth’s hypo- in the standard java.util package for perform- thetical MIX computer, intended to represent an ordinary ing binary searches on Java arrays and on Lists, computer, binary search takes on average 18 log n − 16 [51][52] respectively. units of time for a successful search, while linear search with a sentinel node at the end of the list takes 1.75n + • Microsoft's .NET Framework 2.0 offers static − n mod 2 8.5 4n units. Linear search has lower initial com- generic versions of the binary search algorithm in its plexity because it requires minimal computation, how- collection base classes. An example would be Sys- ever quickly outgrows binary search in complexity. On tem.Array’s method BinarySearch(T[] array, T the MIX computer, binary search only outperforms linear value).[53] search with a sentinel if n > 44 .[11][27] 162 CHAPTER 6. SUCCESSORS AND NEIGHBORS

[6] As simply setting all of the bits which the hash functions [18] Knuth 1998, §6.4 (“Hashing”), subsection “History”. point to for a specific key can affect queries for other keys which have a common hash location for one or more of [19] Dietzfelbinger, Martin; Karlin, Anna; Mehlhorn, Kurt; the functions.[30] Meyer auf der Heide, Friedhelm; Rohnert, Hans; Tarjan, Robert E. (August 1994). “Dynamic Perfect Hashing: [7] There exist improvements of the Bloom filter which im- Upper and Lower Bounds”. SIAM Journal on Computing. prove on its complexity or support deletion; for exam- 23 (4): 738–761. doi:10.1137/S0097539791194094. ple, the cuckoo filter exploits cuckoo hashing to gain these [20] Morin, Pat. “Hash Tables” (PDF). p. 1. Retrieved 28 advantages.[30] March 2016. [8] That is, arrays of length 1, 3, 7, 15, 31 ...[40] [21] Beame, Paul; Fich, Faith E. (2001). “Optimal Bounds for the Predecessor Problem and Related Problems”. Citations Journal of Computer and System Sciences. 65 (1): 38–72. doi:10.1006/jcss.2002.1822. [1] Willams, Jr., Louis F. (1975). A modification to the [22] Sedgewick & Wayne 2011, §3.2 (“Binary Search Trees”), half-interval search (binary search) method. Proceedings subsection “Order-based methods and deletion”. of the 14th ACM Southeast Conference. pp. 95–101. doi:10.1145/503561.503582. [23] Knuth 1998, §6.2.2 (“Binary tree searching”), subsection “But what about the worst case?". [2] Knuth 1998, §6.2.1 (“Searching an ordered table”), sub- section “Binary search”. [24] Sedgewick & Wayne 2011, §3.5 (“Applications”), “Which symbol-table implementation should I use?". [3] Cormen et al. 2009, p. 39. [25] Knuth 1998, §5.4.9 (“Disks and Drums”). [4] Weisstein, Eric W. “Binary Search”. MathWorld. [26] Knuth 1998, §6.2.4 (“Multiway trees”). [5] Flores, Ivan; Madpis, George (1971). “Average binary search length for dense ordered lists”. CACM. 14 (9): [27] Knuth 1998, Answers to Exercises (§6.2.1) for “Exercise 602–603. doi:10.1145/362663.362752. 5”. [28] Knuth 1998, §6.2.1 (“Searching an ordered table”). [6] Knuth 1998, §6.2.1 (“Searching an ordered table”), sub- section “Algorithm B”. [29] Sedgewick & Wayne 2011, §3.2 (“Ordered symbol ta- bles”). [7] Bottenbruch, Hermann (1962). “Structure and Use of ALGOL 60”. Journal of the ACM. 9 (2): 161–211. Pro- [30] Fan, Bin; Andersen, Dave G.; Kaminsky, Michael; cedure is described at p. 214 (§43), titled “Program for Mitzenmacher, Michael D. (2014). Cuckoo Filter: Prac- Binary Search”. tically Better Than Bloom. Proceedings of the 10th ACM International on Conference on emerging Net- [8] Sedgewick & Wayne 2011, §3.1, subsection “Rank and working Experiments and Technologies. pp. 75–88. selection”. doi:10.1145/2674005.2674994.

[9] Goldman & Goldman 2008, pp. 461–463. [31] Bloom, Burton H. (1970). “Space/time Trade-offs in Hash Coding with Allowable Errors”. CACM. 13 (7): [10] Sedgewick & Wayne 2011, §3.1, subsection “Range 422–426. doi:10.1145/362686.362692. queries”. [32] Knuth 1998, §6.2.1 (“Searching an ordered table”), sub- [11] Knuth 1998, §6.2.1 (“Searching an ordered table”), sub- section “An important variation”. section “Further analysis of binary search”. [33] Kiefer, J. (1953). “Sequential Search for a Max- [12] Chang 2003, p. 169. imum”. Proceedings of the American Mathematical So- ciety. 4 (3): 502–506. doi:10.2307/2032161. JSTOR [13] Sloane, Neil. Table of n, 2n for n = 0..1000. Part of OEIS A000079. Retrieved 30 April 2016. 2032161.

[14] Knuth 1998, §6.2.1 (“Searching an ordered table”), sub- [34] Hassin, Refael (1981). “On Maximizing Functions by Fi- section “Exercise 23”. bonacci Search”. Fibonacci Quarterly. 19: 347–351. [35] Moffat & Turpin 2002, p. 33. [15] Rolfe, Timothy J. (1997). “Analytic derivation of com- parisons in binary search”. ACM SIGNUM Newsletter. 32 [36] Knuth 1998, §6.2.1 (“Searching an ordered table”), sub- (4): 15–19. doi:10.1145/289251.289255. section “Interpolation search”.

[16] Chazelle, Bernard; Liu, Ding (2001). Lower bounds for [37] Knuth 1998, §6.2.1 (“Searching an ordered table”), sub- intersection searching and fractional cascading in higher section “Exercise 22”. dimension. 33rd ACM Symposium on Theory of Com- puting. pp. 322–329. doi:10.1145/380752.380818. [38] Perl, Yehoshua; Itai, Alon; Avni, Haim (1978). “Interpo- lation search—a log log n search”. CACM. 21 (7): 550– [17] Knuth 1998, §6.4 (“Hashing”). 553. doi:10.1145/359545.359557. 6.2. BINARY SEARCH TREE 163

[39] Knuth 1998, §6.2.1 (“Searching an ordered table”), sub- Works section “History and bibliography”. • Alexandrescu, Andrei (2010). The D Program- [40] “2n−1”. OEIS A000225. Retrieved 7 May 2016. ming Language. Upper Saddle River, NJ: Addison- [41] Lehmer, Derrick (1960). Teaching combinatorial tricks to Wesley Professional. ISBN 0-321-63536-1. a computer. Proceedings of Symposia in Applied Mathe- matics. 10. pp. 180–181. doi:10.1090/psapm/010. • Bentley, Jon (2000) [1986]. Programming Pearls (2nd ed.). Addison-Wesley. ISBN 0-201-65788-0. [42] Chazelle, Bernard; Guibas, Leonidas J. (1986). “Fractional cascading: I. A data structuring tech- • Chang, Shi-Kuo (2003). Data Structures and Algo- nique” (PDF). Algorithmica. 1 (1): 133–162. rithms. Software Engineering and Knowledge Engi- doi:10.1007/BF01840440. neering. 13. Singapore: World Scientific. ISBN 978-981-238-348-8. [43] Chazelle, Bernard; Guibas, Leonidas J. (1986), “Fractional cascading: II. Applications” (PDF), Algorithmica, 1 (1): 163–191, doi:10.1007/BF01840441 • Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009) [1990]. [44] Bentley 2000, §4.1 (“The Challenge of Binary Search”). Introduction to Algorithms (3rd ed.). MIT Press and McGraw-Hill. ISBN 0-262-03384-4. [45] Pattis, Richard E. (1988). “Textbook errors in bi- nary searching”. SIGCSE Bulletin. 20: 190–194. • Fitzgerald, Michael (2007). Ruby Pocket Refer- doi:10.1145/52965.53012. ence. Sebastopol, CA: O'Reilly Media. ISBN 978- 1-4919-2601-7. [46] Bloch, Joshua (2 June 2006). “Extra, Extra – Read All About It: Nearly All Binary Searches and Mergesorts are Broken”. Google Research Blog. Retrieved 21 April 2016. • Goldman, Sally A.; Goldman, Kenneth J. (2008). A Practical Guide to Data Structures and Algorithms [47] Ruggieri, Salvatore (2003). “On computing the semi-sum using Java. Boca Raton: CRC Press. ISBN 978-1- of two integers” (PDF). Information Processing Letters. 87 58488-455-2. (2): 67–71. doi:10.1016/S0020-0190(03)00263-1. • Knuth, Donald (1998). Sorting and Searching. The [48] Bentley 2000, §4.4 (“Principles”). Art of Computer Programming. 3 (2nd ed.). Read- ing, MA: Addison-Wesley Professional. [49] “bsearch – binary search a sorted table”. The Open Group Base Specifications (7th ed.). The Open Group. 2013. Retrieved 28 March 2016. • Leiss, Ernst (2007). A Programmer’s Companion to Algorithm Analysis. Boca Raton, FL: CRC Press. [50] Stroustrup 2013, §32.6.1 (“Binary Search”). ISBN 1-58488-673-0.

[51] “java.util.Arrays”. Java Platform Standard Edition 8 Doc- • Moffat, Alistair; Turpin, Andrew (2002). Compres- umentation. Oracle Corporation. Retrieved 1 May 2016. sion and Coding Algorithms. Hamburg, Germany: Kluwer Academic Publishers. doi:10.1007/978-1- [52] “java.util.Collections”. Java Platform Standard Edition 8 4615-0935-6. ISBN 978-0-7923-7668-2. Documentation. Oracle Corporation. Retrieved 1 May 2016. • Sedgewick, Robert; Wayne, Kevin (2011). [53] “List.BinarySearch Method (T)". Microsoft Devel- Algorithms (4th ed.). Upper Saddle River, NJ: oper Network. Retrieved 10 April 2016. Addison-Wesley Professional. ISBN 978-0-321- 57351-3. Condensed web version: ; book version [54] “8.5. bisect — Array bisection algorithm”. The Python . Standard Library. Python Software Foundation. Re- trieved 10 April 2016. • Stroustrup, Bjarne (2013). The C++ Program- [55] Fitzgerald 2007, p. 152. ming Language (4th ed.). Upper Saddle River, NJ: Addison-Wesley Professional. ISBN 978-0-321- [56] “Package sort”. The Go Programming Language. Re- 56384-2. trieved 28 April 2016.

[57] “NSArray”. Mac Developer Library. Apple Inc. Re- 6.1.10 External links trieved 1 May 2016.

[58] “CFArray”. Mac Developer Library. Apple Inc. Re- • NIST Dictionary of Algorithms and Data Structures: trieved 1 May 2016. binary search 164 CHAPTER 6. SUCCESSORS AND NEIGHBORS

the left sub-tree, and smaller than all keys in the right sub- 8 tree.[1]:287 (The leaves (final nodes) of the tree contain no key and have no structure to distinguish them from one another. Leaves are commonly represented by a special 3 10 leaf or nil symbol, a NULL pointer, etc.) Generally, the information represented by each node is a record rather than a single data element. However, for sequencing purposes, nodes are compared according to 1 6 14 their keys rather than any part of their associated records. The major advantage of binary search trees over other data structures is that the related sorting algorithms and search algorithms such as in-order traversal can be very 4 7 13 efficient; they are also easy to code. Binary search trees are a fundamental data structure used A binary search tree of size 9 and depth 3, with 8 at the root. The to construct more abstract data structures such as sets, leaves are not drawn. multisets, and associative arrays. Some of their disad- vantages are as follows: 6.2 Binary search tree • The shape of the binary search tree depends entirely on the order of insertions and deletions, and can be- In computer science, binary search trees (BST), some- come degenerate. times called ordered or sorted binary trees, are a partic- ular type of containers: data structures that store “items” • When inserting or searching for an element in a bi- (such as numbers, names etc.) in memory. They allow nary search tree, the key of each visited node has fast lookup, addition and removal of items, and can be to be compared with the key of the element to be used to implement either dynamic sets of items, or lookup inserted or found. tables that allow finding an item by its key (e.g., finding the phone number of a person by name). • The keys in the binary search tree may be long and Binary search trees keep their keys in sorted order, so the run time may increase. that lookup and other operations can use the principle of binary search: when looking for a key in a tree (or a place • After a long intermixed sequence of random inser- to insert a new key), they traverse the tree from root to tion and deletion, the expected height of the tree leaf, making comparisons to keys stored in the nodes of approaches square root of the number of keys, √n, the tree and deciding, based on the comparison, to con- which grows much faster than log n. tinue searching in the left or right subtrees. On average, this means that each comparison allows the operations to skip about half of the tree, so that each lookup, inser- tion or deletion takes time proportional to the logarithm Order relation of the number of items stored in the tree. This is much better than the linear time required to find items by key Binary search requires an order relation by which every in an (unsorted) array, but slower than the corresponding element (item) can be compared with every other element operations on hash tables. in the sense of a total preorder. The part of the element which effectively takes place in the comparison is called Several variants of the binary search tree have been stud- its key. Whether duplicates, i.e. different elements with ied in computer science; this article deals primarily with same key, shall be allowed in the tree or not, does not the basic type, making references to more advanced types depend on the order relation, but on the application only. when appropriate. In the context of binary search trees a total preorder is re- alized most flexibly by means of a three-way comparison 6.2.1 Definition subroutine.

A binary search tree is a rooted binary tree, whose inter- nal nodes each store a key (and optionally, an associated 6.2.2 Operations value) and each have two distinguished sub-trees, com- monly denoted left and right. The tree additionally satis- Binary search trees support three main operations: in- fies the binary search tree property, which states that the sertion of elements, deletion of elements, and lookup key in each node must be greater than all keys stored in (checking whether a key is present). 6.2. BINARY SEARCH TREE 165

Searching Here’s how a typical binary search tree insertion might be performed in a binary tree in C++: Searching a binary search tree for a specific key can be a void insert(Node*& root, int key, int value) { if (!root) programmed recursively or iteratively. root = new Node(key, value); else if (key < root->key) We begin by examining the root node. If the tree is null, insert(root->left, key, value); else // key >= root->key the key we are searching for does not exist in the tree. insert(root->right, key, value); } Otherwise, if the key equals that of the root, the search is successful and we return the node. If the key is less than The above destructive procedural variant modifies the tree that of the root, we search the left subtree. Similarly, if in place. It uses only constant heap space (and the iter- the key is greater than that of the root, we search the right ative version uses constant stack space as well), but the subtree. This process is repeated until the key is found or prior version of the tree is lost. Alternatively, as in the the remaining subtree is null. If the searched key is not following Python example, we can reconstruct all ances- found after a null subtree is reached, then the key is not tors of the inserted node; any reference to the original present in the tree. This is easily expressed as a recursive tree root remains valid, making the tree a persistent data algorithm (implemented in Python): structure: 1 def search_recursively(key, node): 2 if node is None or def binary_tree_insert(node, key, value): if node is node.key == key: 3 return node 4 else if key < node.key: None: return NodeTree(None, key, value, None) if 5 return search_recursively(key, node.left) 6 else: # key key == node.key: return NodeTree(node.left, key, > node.key 7 return search_recursively(key, node.right) value, node.right) if key < node.key: return Node- Tree(binary_tree_insert(node.left, key, value), node.key, The same algorithm can be implemented iteratively: node.value, node.right) else: return NodeTree(node.left, 1 def search_iteratively(key, node): 2 current_node = node.key, node.value, binary_tree_insert(node.right, node 3 while current_node is not None: 4 if key == key, value)) current_node.key: 5 return current_node 6 else if key < current_node.key: 7 current_node = current_node.left The part that is rebuilt uses O(log n) space in the average 8 else: # key > current_node.key: 9 current_node = case and O(n) in the worst case. current_node.right 10 return None In either version, this operation requires time proportional to the height of the tree in the worst case, which is O(log These two examples rely on the order relation being a total n) time in the average case over all trees, but O(n) time order. in the worst case. If the order relation is only a total preorder a reasonable Another way to explain insertion is that in order to insert extension of the functionality is the following: also in case a new node in the tree, its key is first compared with that of equality search down to the leaves in a direction speci- of the root. If its key is less than the root’s, it is then fiable by the user. A binary equipped with such compared with the key of the root’s left child. If its key a comparison function becomes stable. is greater, it is compared with the root’s right child. This Because in the worst case this algorithm must search from process continues, until the new node is compared with the root of the tree to the leaf farthest from the root, a leaf node, and then it is added as this node’s right or the search operation takes time proportional to the tree’s left child, depending on its key: if the key is less than height (see tree terminology). On average, binary search the leaf’s key, then it is inserted as the leaf’s left child, trees with n nodes have O(log n) height.[lower-alpha 1] How- otherwise as the leaf’s right child. ever, in the worst case, binary search trees can have O(n) There are other ways of inserting nodes into a binary tree, height, when the unbalanced tree resembles a linked list but this is the only way of inserting nodes at the leaves and (degenerate tree). at the same time preserving the BST structure.

Insertion Deletion

Insertion begins as a search would begin; if the key is There are three possible cases to consider: not equal to that of the root, we search the left or right subtrees as before. Eventually, we will reach an external • Deleting a node with no children: simply remove the node and add the new key-value pair (here encoded as a node from the tree. record 'newNode') as its right or left child, depending on • Deleting a node with one child: remove the node and the node’s key. In other words, we examine the root and replace it with its child. recursively insert the new node to the left subtree if its key is less than that of the root, or the right subtree if its • Deleting a node with two children: call the node to key is greater than or equal to the root. be deleted N. Do not delete N. Instead, choose ei- 166 CHAPTER 6. SUCCESSORS AND NEIGHBORS

ther its in-order successor node or its in-order pre- self.replace_node_in_parent(self.right_child) else: # this decessor node, R. Copy the value of R to N, then node has no children self.replace_node_in_parent(None) recursively call delete on R until reaching one of the first two cases. If you choose in-order successor of a node, as right sub tree is not NIL (Our present case is node has 2 children), then its in-order successor Traversal is node with least value in its right sub tree, which will have at a maximum of 1 sub tree, so deleting it Main article: would fall in one of the first 2 cases. Once the binary search tree has been created, its elements Broadly speaking, nodes with children are harder to can be retrieved in-order by recursively traversing the left delete. As with all binary trees, a node’s in-order suc- subtree of the root node, accessing the node itself, then cessor is its right subtree’s left-most child, and a node’s recursively traversing the right subtree of the node, con- in-order predecessor is the left subtree’s right-most child. tinuing this pattern with each node in the tree as it’s re- In either case, this node will have zero or one children. cursively accessed. As with all binary trees, one may con- Delete it according to one of the two simpler cases above. duct a pre-order traversal or a post-order traversal, but neither are likely to be useful for binary search trees. An 7 6 6 in-order traversal of a binary search tree will always result in a sorted list of node items (numbers, strings or other comparable items). 6 9 6 9 9 The code for in-order traversal in Python is given be- low. It will call callback (some function the programmer Deleting a node with two children from a binary search tree. First wishes to call on the node’s value, such as printing to the the rightmost node in the left subtree, the in-order predecessor (6), screen) for every node in the tree. is identified. Its value is copied into the node being deleted. The def traverse_binary_tree(node, call- in-order predecessor can then be easily deleted because it has at back): if node is None: return tra- most one child. The same method works symmetrically using the verse_binary_tree(node.leftChild, callback) call- inorder successor (9). back(node.value) traverse_binary_tree(node.rightChild, callback) Consistently using the in-order successor or the in-order predecessor for every instance of the two-child case can lead to an unbalanced tree, so some implementations se- Traversal requires O(n) time, since it must visit every lect one or the other at different times. node. This algorithm is also O(n), so it is asymptotically optimal. Runtime analysis: Although this operation does not al- ways traverse the tree down to a leaf, this is always a Traversal can also be implemented iteratively. For cer- possibility; thus in the worst case it requires time propor- tain applications, e.g. greater equal search, approxima- tional to the height of the tree. It does not require more tive search, an operation for single step (iterative) traver- even when the node has two children, since it still follows sal can be very useful. This is, of course, implemented a single path and does not visit any node twice. without the callback construct and takes O(1) on average and O(log n) in the worst case. def find_min(self): # Gets minimum node in a subtree current_node = self while current_node.left_child: current_node = current_node.left_child return Verification current_node def replace_node_in_parent(self, new_value=None): if self.parent: if self == self.parent.left_child: self.parent.left_child = Sometimes we already have a binary tree, and we need to new_value else: self.parent.right_child = new_value determine whether it is a BST. This problem has a simple if new_value: new_value.parent = self.parent def recursive solution. binary_tree_delete(self, key): if key < self.key: The BST property—every node on the right subtree has self.left_child.binary_tree_delete(key) elif key > to be larger than the current node and every node on the self.key: self.right_child.binary_tree_delete(key) left subtree has to be smaller than (or equal to - should else: # delete the key here if self.left_child and not be the case as only unique values should be in the tree self.right_child: # if both children are present suc- - this also poses the question as to if such nodes should be cessor = self.right_child.find_min() self.key = succes- left or right of this parent) the current node—is the key sor.key successor.binary_tree_delete(successor.key) to figuring out whether a tree is a BST or not. The greedy elif self.left_child: # if the node has only a *left* algorithm – simply traverse the tree, at every node check child self.replace_node_in_parent(self.left_child) elif whether the node contains a value larger than the value at self.right_child: # if the node has only a *right* child the left child and smaller than the value on the right child 6.2. BINARY SEARCH TREE 167

– does not work for all cases. Consider the following tree: Sort 20 / \ 10 30 / \ 5 40 Main article: Tree sort In the tree above, each node meets the condition that the node contains a value larger than its left child and smaller A binary search tree can be used to implement a simple than its right child hold, and yet it is not a BST: the value sorting algorithm. Similar to heapsort, we insert all the 5 is on the right subtree of the node containing 20, a vio- values we wish to sort into a new ordered data structure— lation of the BST property. in this case a binary search tree—and then traverse it in Instead of making a decision based solely on the values order. of a node and its children, we also need information flow- The worst-case time of build_binary_tree is O(n2)— ing down from the parent as well. In the case of the tree if you feed it a sorted list of values, it chains them above, if we could remember about the node containing into a linked list with no left subtrees. For example, the value 20, we would see that the node with value 5 is build_binary_tree([1, 2, 3, 4, 5]) yields the tree (1 (2 (3 violating the BST property contract. (4 (5))))). So the condition we need to check at each node is: There are several schemes for overcoming this flaw with simple binary trees; the most common is the self- • if the node is the left child of its parent, then it must balancing binary search tree. If this same procedure is be smaller than (or equal to) the parent and it must done using such a tree, the overall worst-case time is O(n pass down the value from its parent to its right sub- log n), which is asymptotically optimal for a comparison tree to make sure none of the nodes in that subtree sort. In practice, the added overhead in time and space for is greater than the parent a tree-based sort (particularly for node allocation) make • if the node is the right child of its parent, then it must it inferior to other asymptotically optimal sorts such as be larger than the parent and it must pass down the heapsort for static list sorting. On the other hand, it is value from its parent to its left subtree to make sure one of the most efficient methods of incremental sort- none of the nodes in that subtree is lesser than the ing, adding items to a list over time while keeping the parent. list sorted at all times.

A recursive solution in C++ can explain this further: Priority queue operations struct TreeNode { int key; int value; struct TreeNode *left; struct TreeNode *right; }; bool isBST(struct Binary search trees can serve as priority queues: struc- TreeNode *node, int minKey, int maxKey) { if(node tures that allow insertion of arbitrary key as well as lookup == NULL) return true; if(node->key < minKey || and deletion of the minimum (or maximum) key. Inser- node->key > maxKey) return false; return isBST(node- tion works as previously explained. Find-min walks the >left, minKey, node->key) && isBST(node->right, tree, following left pointers as far as it can without hitting node->key, maxKey); } a leaf: // Precondition: T is not a leaf function find-min(T): The initial call to this function can be something like this: while hasLeft(T): T ? left(T) return key(T) if(isBST(root, INT_MIN, INT_MAX)) { puts(“This is a Find-max is analogous: follow right pointers as far as BST.”); } else { puts(“This is NOT a BST!"); } possible. Delete-min (max) can simply look up the mini- mum (maximum), then delete it. This way, insertion and deletion both take logarithmic time, just as they do in a Essentially we keep creating a valid range (starting from binary heap, but unlike a binary heap and most other pri- [MIN_VALUE, MAX_VALUE]) and keep shrinking it ority queue implementations, a single tree can support all down for each node as we go down recursively. of find-min, find-max, delete-min and delete-max at the As pointed out in section #Traversal, an in-order traver- same time, making binary search trees suitable as double- sal of a binary search tree returns the nodes sorted. Thus ended priority queues.[2]:156 we only need to keep the last visited node while travers- ing the tree and check whether its key is smaller (or smaller/equal, if duplicates are to be allowed in the tree) 6.2.4 Types compared to the current key. There are many types of binary search trees. AVL trees and red-black trees are both forms of self-balancing bi- 6.2.3 Examples of applications nary search trees.A splay tree is a binary search tree that automatically moves frequently accessed elements nearer Some examples shall illustrate the use of above basic to the root. In a treap (tree heap), each node also holds a building blocks. (randomly chosen) priority and the parent node has higher 168 CHAPTER 6. SUCCESSORS AND NEIGHBORS

priority than its children. Tango trees are trees optimized which are asymptotically as good as any static search tree for fast searches. T-trees are binary search trees opti- we can construct for any particular sequence of lookup mized to reduce storage space overhead, widely used for operations. in-memory databases Alphabetic trees are Huffman trees with the additional A degenerate tree is a tree where for each parent node, constraint on order, or, equivalently, search trees with there is only one associated child node. It is unbalanced the modification that all elements are stored in the leaves. and, in the worst case, performance degrades to that of a Faster algorithms exist for optimal alphabetic binary trees linked list. If your added node function does not handle (OABTs). re-balancing, then you can easily construct a degenerate tree by feeding it with data that is already sorted. What this means is that in a performance measurement, the tree 6.2.5 See also will essentially behave like a linked list data structure. • Search tree Performance comparisons • Binary search algorithm D. A. Heger (2004)[3] presented a performance compar- • Randomized binary search tree ison of binary search trees. Treap was found to have the best average performance, while red-black tree was found • Tango trees to have the smallest amount of performance variations. • Self-balancing binary search tree Optimal binary search trees • Geometry of binary search trees Main article: Optimal binary search tree • If we do not plan on modifying a search tree, and we Red-black tree • AVL trees

• Day–Stout–Warren algorithm

α γ 6.2.6 Notes

[1] The notion of an average BST is made precise as follows. Let a random BST be one built using only insertions out of β γ α β a sequence of unique elements in random order (all per- mutations equally likely); then the expected height of the Tree rotations are very common internal operations in binary tree is O(log n). If deletions are allowed as well as inser- tions, “little is known about the average height of a binary trees to keep perfect, or near-to-perfect, internal balance in the [1]:300 tree. search tree”.

know exactly how often each item will be accessed, we can construct[4] an optimal binary search tree, which is a 6.2.7 References search tree where the average cost of looking up an item (the expected search cost) is minimized. [1] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009) [1990]. Introduction to Even if we only have estimates of the search costs, such Algorithms (3rd ed.). MIT Press and McGraw-Hill. ISBN a system can considerably speed up lookups on average. 0-262-03384-4. For example, if you have a BST of English words used in a spell checker, you might balance the tree based on word [2] Mehlhorn, Kurt; Sanders, Peter (2008). Algorithms and frequency in text corpora, placing words like the near the Data Structures: The Basic Toolbox (PDF). Springer. root and words like agerasia near the leaves. Such a tree might be compared with Huffman trees, which similarly [3] Heger, Dominique A. (2004), “A Disquisition on The Per- seek to place frequently used items near the root in order formance Behavior of Binary Search Tree Data Struc- to produce a dense information encoding; however, Huff- tures” (PDF), European Journal for the Informatics Pro- man trees store data elements only in leaves, and these fessional, 5 (5): 67–75 elements need not be ordered. [4] Gonnet, Gaston. “Optimal Binary Search Trees”. Scien- If we do not know the sequence in which the elements in tific Computation. ETH Zürich. Retrieved 1 December the tree will be accessed in advance, we can use splay trees 2013. 6.3. RANDOM BINARY TREE 169

6.2.8 Further reading are equally likely. It is also possible to form other dis- tributions, for instance by repeated splitting. Adding and • This article incorporates public domain material removing nodes directly in a random binary tree will in from the NIST document: Black, Paul E. “Binary general disrupt its random structure, but the treap and re- Search Tree”. Dictionary of Algorithms and Data lated randomized binary search tree data structures use Structures. the principle of binary trees formed from a random per- mutation in order to maintain a balanced binary search • Cormen, Thomas H.; Leiserson, Charles E.; Rivest, tree dynamically as nodes are inserted and deleted. Ronald L.; Stein, Clifford (2001). “12: Binary search trees, 15.5: Optimal binary search trees”. For random trees that are not necessarily binary, see Introduction to Algorithms (2nd ed.). MIT Press & random tree. McGraw-Hill. pp. 253–272, 356–363. ISBN 0- 262-03293-7. 6.3.1 Binary trees from random permuta- • Jarc, Duane J. (3 December 2005). “Binary Tree tions Traversals”. Interactive Data Structure Visualiza- tions. University of Maryland. For any set of numbers (or, more generally, values from some total order), one may form a binary search tree in • Knuth, Donald (1997). “6.2.2: Binary Tree Search- which each number is inserted in sequence as a leaf of ing”. The Art of Computer Programming. 3: “Sort- the tree, without changing the structure of the previously ing and Searching” (3rd ed.). Addison-Wesley. pp. inserted numbers. The position into which each num- 426–458. ISBN 0-201-89685-0. ber should be inserted is uniquely determined by a binary search in the tree formed by the previous numbers. For • Long, Sean. “Binary Search Tree” (PPT). Data instance, if the three numbers (1,3,2) are inserted into a Structures and Algorithms Visualization-A Power- tree in that sequence, the number 1 will sit at the root Point Slides Based Approach. SUNY Oneonta. of the tree, the number 3 will be placed as its right child, • Parlante, Nick (2001). “Binary Trees”. CS Educa- and the number 2 as the left child of the number 3. There tion Library. Stanford University. are six different permutations of the numbers (1,2,3), but only five trees may be constructed from them. That is be- cause the permutations (2,1,3) and (2,3,1) form the same 6.2.9 External links tree.

• Literate implementations of binary search trees in various languages on LiteratePrograms Expected depth of a node

• Binary Tree Visualizer (JavaScript animation of var- For any fixed choice of a value x in a given set of n num- ious BT-based data structures) bers, if one randomly permutes the numbers and forms a binary tree from them as described above, the expected • Kovac, Kubo. “Binary Search Trees” (Java applet). value of the length of the path from the root of the tree to Korešponden?ný seminár z programovania. x is at most 2 log n + O(1), where “log” denotes the natural logarithm function and the O introduces big O notation. • Madru, Justin (18 August 2009). “Binary Search For, the expected number of ancestors of x is by linear- Tree”. JDServer. C++ implementation. ity of expectation equal to the sum, over all other values • Binary Search Tree Example in Python y in the set, of the probability that y is an ancestor of x. And a value y is an ancestor of x exactly when y is the • “References to Pointers (C++)". MSDN. Microsoft. first element to be inserted from the elements in the in- 2005. Gives an example binary tree implementa- terval [x,y]. Thus, the values that are adjacent to x in the tion. sorted sequence of values have probability 1/2 of being an ancestor of x, the values one step away have probabil- ity 1/3, etc. Adding these probabilities for all positions 6.3 Random binary tree in the sorted sequence gives twice a Harmonic number, leading to the bound above. A bound of this form holds also for the expected search length of a path to a fixed In computer science and probability theory, a random value x that is not part of the given set.[1] binary tree refers to a binary tree selected at random from some probability distribution on binary trees. Two different distributions are commonly used: binary trees The longest path formed by inserting nodes one at a time according to a random permutation, and binary trees chosen from a Although not as easy to analyze as the average path length, uniform discrete distribution in which all distinct trees there has also been much research on determining the ex- 170 CHAPTER 6. SUCCESSORS AND NEIGHBORS pectation (or high probability bounds) of the length of the 6.3.2 Uniformly random binary trees longest path in a binary search tree generated from a ran- dom insertion order. It is now known that this length, for The number of binary trees with n nodes is a Catalan a tree with n nodes, is almost surely number: for n = 1, 2, 3, ... these numbers of trees are

1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 1 log n ≈ 4.311 log n, … (sequence A000108 in the OEIS). β Thus, if one of these trees is selected uniformly at ran- where β is the unique number in the range 0 < β < 1 sat- dom, its probability is the reciprocal of a Catalan number. isfying the equation Trees in this model have expected depth proportional to the square root of n, rather than to the logarithm;[4] how- 2βe1−β = 1. [2] ever, the Strahler number of a uniformly random binary tree, a more sensitive measure of the distance from a leaf in which a node has Strahler number i whenever it has ei- Expected number of leaves ther a child with that number or two children with number i − 1, is with high probability logarithmic.[5] In the random permutation model, each of the numbers Due to their large heights, this model of equiprobable ran- from the set of numbers used to form the tree, except dom trees is not generally used for binary search trees, for the smallest and largest of the numbers, has probabil- but it has been applied to problems of modeling the ity 1/3 of being a leaf in the tree, for it is a leaf when it parse trees of algebraic expressions in compiler design[6] inserted after its two neighbors, and any of the six permu- (where the above-mentioned bound on Strahler number tations of these two neighbors and it are equally likely. By translates into the number of registers needed to evaluate similar reasoning, the smallest and largest of the numbers an expression[7]) and for modeling evolutionary trees.[8] have probability 1/2 of being a leaf. Therefore, the ex- In some cases the analysis of random binary trees un- pected number of leaves is the sum of these probabilities, der the random permutation model can be automatically which for n ≥ 2 is exactly (n + 1)/3. transferred to the uniform model.[9]

Treaps and randomized binary search trees 6.3.3 Random split trees

In applications of binary search tree data structures, it is Devroye & Kruszewski (1996) generate random binary rare for the values in the tree to be inserted without dele- trees with n nodes by generating a real-valued random tion in a random order, limiting the direct applications of variable x in the unit interval (0,1), assigning the first xn random binary trees. However, algorithm designers have nodes (rounded down to an integer number of nodes) to devised data structures that allow insertions and deletions the left subtree, the next node to the root, and the remain- to be performed in a binary search tree, at each step main- ing nodes to the right subtree, and continuing recursively taining as an invariant the property that the shape of the in each subtree. If x is chosen uniformly at random in the tree is a random variable with the same distribution as a interval, the result is the same as the random binary search random binary search tree. tree generated by a random permutation of the nodes, If a given set of ordered numbers is assigned numeric pri- as any node is equally likely to be chosen as root; how- orities (distinct numbers unrelated to their values), these ever, this formulation allows other distributions to be used priorities may be used to construct a for instead. For instance, in the uniformly random binary the numbers, a binary tree that has as its inorder traversal tree model, once a root is fixed each of its two subtrees sequence the sorted sequence of the numbers and that is must also be uniformly random, so the uniformly random heap-ordered by priorities. Although more efficient con- model may also be generated by a different choice of dis- struction algorithms are known, it is helpful to think of a tribution for x. As Devroye and Kruszewski show, by Cartesian tree as being constructed by inserting the given choosing a beta distribution on x and by using an appro- numbers into a binary search tree in priority order. Thus, priate choice of shape to draw each of the branches, the by choosing the priorities either to be a set of independent mathematical trees generated by this process can be used random real numbers in the unit interval, or by choosing to create realistic-looking botanical trees. them to be a random permutation of the numbers from 1 to n (where n is the number of nodes in the tree), and by maintaining the heap ordering property using tree ro- 6.3.4 Notes tations after any insertion or deletion of a node, it is pos- [1] Hibbard (1962); Knuth (1973); Mahmoud (1992), p. 75. sible to maintain a data structure that behaves like a ran- dom binary search tree. Such a data structure is known [2] Robson (1979); Pittel (1985); Devroye (1986); Mahmoud as a treap or a randomized binary search tree.[3] (1992), pp. 91–99; Reed (2003). 6.4. TREE ROTATION 171

[3] Martinez & Roura (1998); Seidel & Aragon (1996). • Mahmoud, Hosam M. (1992), Evolution of Random Search Trees, John Wiley & Sons. [4] Knuth (2005), p. 15. • Martinez, Conrado; Roura, Salvador (1998), [5] Devroye & Kruszewski (1995). That it is at most logarith- “Randomized binary search trees”, Journal mic is trivial, because the Strahler number of every tree is of the ACM, ACM Press, 45 (2): 288–323, bounded by the logarithm of the number of its nodes. doi:10.1145/274787.274812. [6] Mahmoud (1992), p. 63. • Pittel, B. (1985), “Asymptotical growth of a class of [7] Flajolet, Raoult & Vuillemin (1979). random trees”, Annals of Probability, 13 (2): 414– 427, doi:10.1214/aop/1176993000. [8] Aldous (1996). • Reed, Bruce (2003), “The height of a random binary [9] Mahmoud (1992), p. 70. search tree”, Journal of the ACM, 50 (3): 306–332, doi:10.1145/765568.765571.

6.3.5 References • Robson, J. M. (1979), “The height of binary search trees”, Australian Computer Journal, 11: 151–153. • Aldous, David (1996), “Probability distributions on cladograms”, in Aldous, David; Pemantle, Robin, • Seidel, Raimund; Aragon, Cecilia R. (1996), Random Discrete Structures, The IMA Volumes in “Randomized Search Trees”, Algorithmica, 16 (4/5): Mathematics and its Applications, 76, Springer- 464–497, doi:10.1007/s004539900061. Verlag, pp. 1–18. • Devroye, Luc (1986), “A note on the height of bi- 6.3.6 External links nary search trees”, Journal of the ACM, 33 (3): 489– 498, doi:10.1145/5925.5930. • Open Data Structures - Chapter 7 - Random Binary Search Trees • Devroye, Luc; Kruszewski, Paul (1995), “A note on the Horton-Strahler number for random trees”, Information Processing Letters, 56 (2): 95–99, doi:10.1016/0020-0190(95)00114-R. 6.4 Tree rotation

• Devroye, Luc; Kruszewski, Paul (1996), “The botanical beauty of random binary trees”, in Brandenburg, Franz J., Graph Drawing: 3rd Int. Symp., GD'95, Passau, Germany, Septem- ber 20-22, 1995, Lecture Notes in Computer Science, 1027, Springer-Verlag, pp. 166–177, α γ doi:10.1007/BFb0021801, ISBN 3-540-60723-4.

• Drmota, Michael (2009), Random Trees : An Interplay between Combinatorics and Probability, α Springer-Verlag, ISBN 978-3-211-75355-2. β γ β

• Flajolet, P.; Raoult, J. C.; Vuillemin, J. (1979), “The Generic tree rotations. number of registers required for evaluating arith- metic expressions”, Theoretical Computer Science, 9 In discrete mathematics, tree rotation is an operation on (1): 99–125, doi:10.1016/0304-3975(79)90009-4. a binary tree that changes the structure without interfer- ing with the order of the elements. A tree rotation moves • Hibbard, Thomas N. (1962), “Some combinato- one node up in the tree and one node down. It is used to rial properties of certain trees with applications to change the shape of the tree, and in particular to decrease searching and sorting”, Journal of the ACM, 9 (1): its height by moving smaller subtrees down and larger sub- 13–28, doi:10.1145/321105.321108. trees up, resulting in improved performance of many tree operations. • Knuth, Donald M. (1973), “6.2.2 Binary Tree Searching”, The Art of Computer Programming, III, There exists an inconsistency in different descriptions as Addison-Wesley, pp. 422–451. to the definition of the direction of rotations. Some say that the direction of rotation reflects the direction that a • Knuth, Donald M. (2005), “Draft of Section 7.2.1.6: node is moving upon rotation (a left child rotating into its Generating All Trees”, The Art of Computer Pro- parent’s location is a right rotation) while others say that gramming, IV. the direction of rotation reflects which subtree is rotating 172 CHAPTER 6. SUCCESSORS AND NEIGHBORS

(a left subtree rotating into its parent’s location is a left you can see in the diagram, the order of the leaves doesn't rotation, the opposite of the former). This article takes change. The opposite operation also preserves the order the approach of the directional movement of the rotating and is the second kind of rotation. node. Assuming this is a binary search tree, as stated above, the elements must be interpreted as variables that can be 6.4.1 Illustration compared to each other. The alphabetic characters to the left are used as placeholders for these variables. In the animation to the right, capital alphabetic characters are used as variable placeholders while lowercase Greek let- ters are placeholders for an entire set of variables. The circles represent individual nodes and the triangles repre- sent subtrees. Each subtree could be empty, consist of a single node, or consist of any number of nodes.

6.4.2 Detailed illustration

Animation of tree rotations taking place.

The right rotation operation as shown in the image to the left is performed with Q as the root and hence is a right rotation on, or rooted at, Q. This operation results in a ro- Pictorial description of how rotations are made. tation of the tree in the clockwise direction. The inverse operation is the left rotation, which results in a movement When a subtree is rotated, the subtree side upon which it in a counter-clockwise direction (the left rotation shown is rotated increases its height by one node while the other above is rooted at P). The key to understanding how a subtree decreases its height. This makes tree rotations rotation functions is to understand its constraints. In par- useful for rebalancing a tree. ticular the order of the leaves of the tree (when read left to right for example) cannot change (another way to think Using the terminology of Root for the parent node of the of it is that the order that the leaves would be visited in an subtrees to rotate, Pivot for the node which will become in-order traversal must be the same after the operation as the new parent node, RS for rotation side upon to rotate before). Another constraint is the main property of a bi- and OS for opposite side of rotation. In the above dia- nary search tree, namely that the right child is greater than gram for the root Q, the RS is C and the OS is P. The the parent and the left child is less than the parent. Notice pseudo code for the rotation is: that the right child of a left child of the root of a sub-tree Pivot = Root.OS Root.OS = Pivot.RS Pivot.RS = Root (for example node B in the diagram for the tree rooted Root = Pivot at Q) can become the left child of the root, that itself becomes the right child of the “new” root in the rotated This is a constant time operation. sub-tree, without violating either of those constraints. As The programmer must also make sure that the root’s par- 6.4. TREE ROTATION 173

ent points to the pivot after the rotation. Also, the pro- grammer should note that this operation may result in a new root for the entire tree and take care to update point- ers accordingly.

6.4.3 Inorder invariance

The tree rotation renders the inorder traversal of the bi- nary tree invariant. This implies the order of the elements are not affected when a rotation is performed in any part of the tree. Here are the inorder traversals of the trees shown above: Pictorial description of how rotations cause rebalancing in an Left tree: ((A, P, B), Q, C) Right tree: (A, P, (B, Q, C)) AVL tree. Computing one from the other is very simple. The fol- lowing is example Python code that performs that com- 6.4.5 Rotation distance putation: def right_rotation(treenode): left, Q, C = treenode A, P, The rotation distance between any two binary trees with B = left return (A, P, (B, Q, C)) the same number of nodes is the minimum number of ro- tations needed to transform one into the other. With this distance, the set of n-node binary trees becomes a metric Another way of looking at it is: space: the distance is symmetric, positive when given two Right rotation of node Q: different trees, and satisfies the triangle inequality. Let P be Q’s left child. Set Q’s left child to be P’s right It is an open problem whether there exists a polynomial child. [Set P’s right-child’s parent to Q] Set P’s right child time algorithm for calculating rotation distance. to be Q. [Set Q’s parent to P] Daniel Sleator, Robert Tarjan and William Thurston Left rotation of node P: showed that the rotation distance between any two n- Let Q be P’s right child. Set P’s right child to be Q’s left node trees (for n ≥ 11) is at most 2n − 6, and that some child. [Set Q’s left-child’s parent to P] Set Q’s left child pairs of trees are this far apart as soon as n is sufficiently [1] to be P. [Set P’s parent to Q] large. Lionel Pournin showed that, in fact, such pairs exist whenever n ≥ 11. [2] All other connections are left as-is. There are also double rotations, which are combinations of left and right rotations. A double left rotation at X can 6.4.6 See also be defined to be a right rotation at the right child of X • followed by a left rotation at X; similarly, a double right AVL tree, red-black tree, and splay tree, kinds of rotation at X can be defined to be a left rotation at the left binary search tree data structures that use rotations child of X followed by a right rotation at X. to maintain balance. Tree rotations are used in a number of tree data structures • Associativity of a binary operation means that per- such as AVL trees, red-black trees, splay trees, and . forming a tree rotation on it does not change the final They require only constant time because they are local result. transformations: they only operate on 5 nodes, and need • The Day–Stout–Warren algorithm balances an un- not examine the rest of the tree. balanced BST.

• Tamari lattice, a partially ordered set in which the 6.4.4 Rotations for rebalancing elements can be defined as binary trees and the or- dering between elements is defined by tree rotation. A tree can be rebalanced using rotations. After a rotation, the side of the rotation increases its height by 1 whilst the 6.4.7 References side opposite the rotation decreases its height similarly. Therefore, one can strategically apply rotations to nodes [1] Sleator, Daniel D.; Tarjan, Robert E.; Thurston, William whose left child and right child differ in height by more P. (1988), “Rotation distance, triangulations, and hyper- than 1. Self-balancing binary search trees apply this op- bolic geometry”, Journal of the American Mathematical eration automatically. A type of tree which uses this re- Society, 1 (3): 647–681, doi:10.2307/1990951, JSTOR balancing technique is the AVL tree. 1990951, MR 928904. 174 CHAPTER 6. SUCCESSORS AND NEIGHBORS

[2] Pournin, Lionel (2014), “The diameter of associ- data structures such as associative arrays, priority queues ahedra”, Advances in Mathematics, 259: 13–42, and sets. arXiv:1207.6296 , doi:10.1016/j.aim.2014.02.035, MR The red–black tree, which is a type of self-balancing bi- 3197650. nary search tree, was called symmetric binary B-tree[2] and was renamed but can still be confused with the 6.4.8 External links generic concept of self-balancing binary search tree be- cause of the initials. • Java applets demonstrating tree rotations • The AVL Tree Rotations Tutorial (RTF) by John 6.5.1 Overview Hargrove

6.5 Self-balancing binary search tree α γ 50 β γ α β

Tree rotations are very common internal operations on self- 17 76 balancing binary trees to keep perfect or near-to-perfect balance.

Most operations on a binary search tree (BST) take time 9 23 54 directly proportional to the height of the tree, so it is desir- able to keep the height small. A binary tree with height h can contain at most 20+21+···+2h = 2h+1−1 nodes. It 14 19 72 follows that for a tree with n nodes and height h: n ≤ 2h+1 − 1 And that implies: ≥ ⌈ − ⌉ ≥ ⌊ ⌋ 12 67 h log2(n + 1) 1 log2 n . In other words, the minimum height of a tree with n nodes ⌊ ⌋ [1] An example of an unbalanced tree; following the path from the is log2(n), rounded down; that is, log2 n . root to a node takes an average of 3.27 node accesses However, the simplest algorithms for BST item insertion may yield a tree with height n in rather common situa- tions. For example, when the items are inserted in sorted 50 key order, the tree degenerates into a linked list with n nodes. The difference in performance between the two 17 72 situations may be enormous: for n = 1,000,000, for ex- ⌊ ⌋ ample, the minimum height is log2(1, 000, 000) = 19 12 23 54 76 . 9 14 19 67 If the data items are known ahead of time, the height can be kept small, in the average sense, by adding values in a random order, resulting in a random binary search tree. The same tree after being height-balanced; the average path effort However, there are many situations (such as online algo- decreased to 3.00 node accesses rithms) where this randomization is not viable. In computer science, a self-balancing (or height- Self-balancing binary trees solve this problem by per- balanced) binary search tree is any node-based binary forming transformations on the tree (such as tree rota- search tree that automatically keeps its height (maximal tions) at key insertion times, in order to keep the height number of levels below the root) small in the face of ar- proportional to log2(n). Although a certain overhead is bitrary item insertions and deletions.[1] involved, it may be justified in the long run by ensuring These structures provide efficient implementations for fast execution of later operations. mutable ordered lists, and can be used for other abstract Maintaining the height always at its minimum value 6.6. TREAP 175

⌊ ⌋ log2(n) is not always viable; it can be proven that any we have a very simple-to-describe yet asymptotically op- insertion algorithm which did so would have an exces- timal O(n log n) sorting algorithm. Similarly, many al- sive overhead. Therefore, most self-balanced BST algo- gorithms in computational geometry exploit variations rithms keep the height within a constant factor of this on self-balancing BSTs to solve problems such as the lower bound. line segment intersection problem and the point loca- In the asymptotic ("Big-O") sense, a self-balancing BST tion problem efficiently. (For average-case performance, structure containing n items allows the lookup, insertion, however, self-balanced BSTs may be less efficient than and removal of an item in O(log n) worst-case time, and other solutions. Binary tree sort, in particular, is likely to be slower than merge sort, quicksort, or heapsort, be- ordered enumeration of all items in O(n) time. For some implementations these are per-operation time bounds, cause of the tree-balancing overhead as well as cache ac- cess patterns.) while for others they are amortized bounds over a se- quence of operations. These times are asymptotically op- Self-balancing BSTs are flexible data structures, in that timal among all data structures that manipulate the key it’s easy to extend them to efficiently record additional in- only through comparisons. formation or perform new operations. For example, one can record the number of nodes in each subtree having a certain property, allowing one to count the number of 6.5.2 Implementations nodes in a certain key range with that property in O(log n) time. These extensions can be used, for example, to Popular data structures implementing this type of tree in- optimize database queries or other list-processing algo- clude: rithms.

• 2-3 tree 6.5.4 See also • AA tree • Search data structure • AVL tree • Day–Stout–Warren algorithm • • Red-black tree Fusion tree • Skip list • Scapegoat tree • Sorting • Splay tree

• Treap 6.5.5 References

[1] Donald Knuth. The Art of Computer Programming, Vol- 6.5.3 Applications ume 3: Sorting and Searching, Second Edition. Addison- Wesley, 1998. ISBN 0-201-89685-0. Section 6.2.3: Bal- Self-balancing binary search trees can be used in a nat- anced Trees, pp.458–481. ural way to construct and maintain ordered lists, such as [2] Paul E. Black, “red-black tree”, in Dictionary of Algo- priority queues. They can also be used for associative ar- rithms and Data Structures [online], Vreda Pieterse and rays; key-value pairs are simply inserted with an ordering Paul E. Black, eds. 13 April 2015. (accessed 03 October based on the key alone. In this capacity, self-balancing 2016) Available from: http://www.nist.gov/dads/HTML/ BSTs have a number of advantages and disadvantages redblack.html over their main competitor, hash tables. One advantage of self-balancing BSTs is that they allow fast (indeed, 6.5.6 External links asymptotically optimal) enumeration of the items in key order, which hash tables do not provide. One disadvan- • Dictionary of Algorithms and Data Structures: tage is that their lookup algorithms get more complicated Height-balanced binary search tree when there may be multiple items with the same key. Self-balancing BSTs have better worst-case lookup per- • GNU libavl, a LGPL-licensed library of binary tree formance than hash tables (O(log n) compared to O(n)), implementations in C, with documentation but have worse average-case performance (O(log n) com- pared to O(1)). Self-balancing BSTs can be used to implement any algo- 6.6 Treap rithm that requires mutable ordered lists, to achieve opti- mal worst-case asymptotic performance. For example, if In computer science, the treap and the randomized bi- binary tree sort is implemented with a self-balanced BST, nary search tree are two closely related forms of binary 176 CHAPTER 6. SUCCESSORS AND NEIGHBORS search tree data structures that maintain a dynamic set of to have the same priority) then the shape of a treap has ordered keys and allow binary searches among the keys. the same probability distribution as the shape of a random After any sequence of insertions and deletions of keys, binary search tree, a search tree formed by inserting the the shape of the tree is a random variable with the same nodes without rebalancing in a randomly chosen insertion probability distribution as a random binary tree; in par- order. Because random binary search trees are known to ticular, with high probability its height is proportional to have logarithmic height with high probability, the same is the logarithm of the number of keys, so that each search, true for treaps. insertion, or deletion operation takes logarithmic time to Aragon and Seidel also suggest assigning higher priori- perform. ties to frequently accessed nodes, for instance by a pro- cess that, on each access, chooses a random number and replaces the priority of the node with that number if it 6.6.1 Description is higher than the previous priority. This modification would cause the tree to lose its random shape; instead, frequently accessed nodes would be more likely to be near 9 the root of the tree, causing searches for them to be faster. Naor and Nissim[3] describe an application in maintaining h authorization certificates in public-key cryptosystems.

6.6.2 Operations

Treaps support the following basic operations: 4 7 • To search for a given key value, apply a standard c j binary search algorithm in a binary search tree, ig- noring the priorities.

• To insert a new key x into the treap, generate a ran- dom priority y for x. Binary search for x in the tree, and create a new node at the leaf position where the 2 0 binary search determines a node for x should exist. Then, as long as x is not the root of the tree and has a e a larger priority number than its parent z, perform a tree rotation that reverses the parent-child relation between x and z.

A treap with alphabetic key and numeric max heap order • To delete a node x from the treap, if x is a leaf of the tree, simply remove it. If x has a single child The treap was first described by Cecilia R. Aragon and [1][2] z, remove x from the tree and make z be the child Raimund Seidel in 1989; its name is a portmanteau of the parent of x (or make z the root of the tree of tree and heap. It is a Cartesian tree in which each key is if x had no parent). Finally, if x has two children, given a (randomly chosen) numeric priority. As with any swap its position in the tree with the position of its binary search tree, the inorder traversal order of the nodes immediate successor z in the sorted order, resulting is the same as the sorted order of the keys. The structure in one of the previous cases. In this final case, the of the tree is determined by the requirement that it be swap may violate the heap-ordering property for z, heap-ordered: that is, the priority number for any non- so additional rotations may need to be performed to leaf node must be greater than or equal to the priority of restore this property. its children. Thus, as with Cartesian trees more generally, the root node is the maximum-priority node, and its left and right subtrees are formed in the same manner from Bulk operations the subsequences of the sorted order to the left and right of that node. In addition to the single-element insert, delete and lookup An equivalent way of describing the treap is that it could operations, several fast “bulk” operations have been de- be formed by inserting the nodes highest-priority-first fined on treaps: union, intersection and set difference. into a binary search tree without doing any rebalancing. These rely on two helper operations, split and merge. Therefore, if the priorities are independent random num- bers (from a distribution over a large enough space of pos- • To split a treap into two smaller treaps, those smaller sible priorities to ensure that two nodes are very unlikely than key x, and those larger than key x, insert x into 6.6. TREAP 177

the treap with maximum priority—larger than the root of the tree, and otherwise it calls the insertion proce- priority of any node in the treap. After this inser- dure recursively to insert x within the left or right subtree tion, x will be the root node of the treap, all values (depending on whether its key is less than or greater than less than x will be found in the left subtreap, and all the root). The numbers of descendants are used by the values greater than x will be found in the right sub- algorithm to calculate the necessary probabilities for the treap. This costs as much as a single insertion into random choices at each step. Placing x at the root of a the treap. subtree may be performed either as in the treap by in- serting it at a leaf and then rotating it upwards, or by an • Merging two treaps that are the product of a former alternative algorithm described by Martínez and Roura split, one can safely assume that the greatest value that splits the subtree into two pieces to be used as the in the first treap is less than the smallest value in left and right children of the new node. the second treap. Create a new node with value x, such that x is larger than this max-value in the first The deletion procedure for a randomized binary search treap, and smaller than the min-value in the second tree uses the same information per node as the insertion treap, assign it the minimum priority, then set its left procedure, and like the insertion procedure it makes a se- child to the first heap and its right child to the sec- quence of O(log n) random decisions in order to join the ond heap. Rotate as necessary to fix the heap order. two subtrees descending from the left and right children After that it will be a leaf node, and can easily be of the deleted node into a single tree. If the left or right deleted. The result is one treap merged from the two subtree of the node to be deleted is empty, the join op- original treaps. This is effectively “undoing” a split, eration is trivial; otherwise, the left or right child of the and costs the same. deleted node is selected as the new subtree root with prob- ability proportional to its number of descendants, and the join proceeds recursively. The union of two treaps t1 and t2, representing sets A and B is a treap t that represents A ∪ B. The following recursive algorithm computes the union: 6.6.4 Comparison function union(t1, t2): if t1 = nil: return t2 if t2 = nil: return t1 if priority(t1) < priority(t2): swap t1 and t2 The information stored per node in the randomized bi- t<, t> ← split t2 on key(t1) return new node(key(t1), nary tree is simpler than in a treap (a small integer rather union(left(t1), t<), union(right(t1), t>)) than a high-precision random number), but it makes a Here, split is presumed to return two trees: one hold- greater number of calls to the random number generator ing the keys less its input key, one holding the greater (O(log n) calls per insertion or deletion rather than one keys. (The algorithm is non-destructive, but an in-place call per insertion) and the insertion procedure is slightly destructive version exists as well.) more complicated due to the need to update the numbers The algorithm for intersection is similar, but requires the of descendants per node. A minor technical difference is join helper routine. The complexity of each of union, in- that, in a treap, there is a small probability of a collision tersection and difference is O(m log n/m) for treaps of (two keys getting the same priority), and in both cases sizes m and n, with m ≤ n. Moreover, since the recursive there will be statistical differences between a true ran- calls to union are independent of each other, they can be dom number generator and the pseudo-random number executed in parallel.[4] generator typically used on digital computers. However, in any case the differences between the theoretical model of perfect random choices used to design the algorithm 6.6.3 Randomized binary search tree and the capabilities of actual random number generators are vanishingly small. The randomized binary search tree, introduced by Although the treap and the randomized binary search tree Martínez and Roura subsequently to the work of Aragon [5] both have the same random distribution of tree shapes af- and Seidel on treaps, stores the same nodes with the ter each update, the history of modifications to the trees same random distribution of tree shape, but maintains performed by these two data structures over a sequence different information within the nodes of the tree in order of insertion and deletion operations may be different. For to maintain its randomized structure. instance, in a treap, if the three numbers 1, 2, and 3 Rather than storing random priorities on each node, the are inserted in the order 1, 3, 2, and then the number randomized binary search tree stores a small integer at 2 is deleted, the remaining two nodes will have the same each node, the number of its descendants (counting itself parent-child relationship that they did prior to the inser- as one); these numbers may be maintained during tree tion of the middle number. In a randomized binary search rotation operations at only a constant additional amount tree, the tree after the deletion is equally likely to be either of time per rotation. When a key x is to be inserted into of the two possible trees on its two nodes, independently a tree that already has n nodes, the insertion algorithm of what the tree looked like prior to the insertion of the chooses with probability 1/(n + 1) to place x as the new middle number. 178 CHAPTER 6. SUCCESSORS AND NEIGHBORS

6.6.5 See also 6.7 AVL tree

• Finger search J 6.6.6 References +1 F P ‒1 [1] Aragon, Cecilia R.; Seidel, Raimund (1989), +1 “Randomized Search Trees” (PDF), Proc. 30th Symp. Foundations of Computer Science (FOCS 1989), DG L V ‒1 +1 ‒1 Washington, D.C.: IEEE Computer Society Press, 0 pp. 540–545, doi:10.1109/SFCS.1989.63531, ISBN C N S X 0-8186-1982-1 0 0 0 0

[2] Seidel, Raimund; Aragon, Cecilia R. (1996), Q U “Randomized Search Trees”, Algorithmica, 16 (4/5): 0 0 464–497, doi:10.1007/s004539900061 Fig. 1: AVL tree with balance factors (green) [3] Naor, M.; Nissim, K. (April 2000), “Certificate revo- cation and certificate update” (PDF), IEEE Journal on Selected Areas in Communications, 18 (4): 561–570, In computer science, an AVL tree is a self-balancing bi- doi:10.1109/49.839932. nary search tree. It was the first such data structure to be invented.[2] In an AVL tree, the heights of the two child [4] Blelloch, Guy E.,; Reid-Miller, Margaret, (1998), subtrees of any node differ by at most one; if at any time “Fast set operations using treaps”, Proc. 10th ACM they differ by more than one, rebalancing is done to re- Symp. Parallel Algorithms and Architectures (SPAA 1998), New York, NY, USA: ACM, pp. 16–26, store this property. Lookup, insertion, and deletion all doi:10.1145/277651.277660, ISBN 0-89791-989-0. take O(log n) time in both the average and worst cases, where n is the number of nodes in the tree prior to the [5] Martínez, Conrado; Roura, Salvador (1997), operation. Insertions and deletions may require the tree “Randomized binary search trees”, Journal of the to be rebalanced by one or more tree rotations. ACM, 45 (2): 288–323, doi:10.1145/274787.274812 The AVL tree is named after its two Soviet inventors, Georgy Adelson-Velsky and Evgenii Landis, who pub- 6.6.7 External links lished it in their 1962 paper “An algorithm for the or- ganization of information”.[3] • Collection of treap references and info by Cecilia Aragon AVL trees are often compared with red–black trees be- cause both support the same set of operations and take • Open Data Structures - Section 7.2 - Treap: A Ran- O(log n) time for the basic operations. For lookup- domized Binary Search Tree intensive applications, AVL trees are faster than red– [4] • Treap Applet by Kubo Kovac black trees because they are more strictly balanced. Similar to red–black trees, AVL trees are height- • Animated treap balanced. Both are in general not weight-balanced nor μ- balanced for any μ≤1⁄ ;[5] that is, sibling nodes can have • Randomized binary search trees. Lecture notes 2 hugely differing numbers of descendants. from a course by Jeff Erickson at UIUC. Despite the title, this is primarily about treaps and skip lists; randomized binary search trees are mentioned only 6.7.1 Definition briefly. • A high performance key-value store based on treap Balance factor by Junyi Sun In a binary tree the balance factor of a node N is defined • VB6 implementation of treaps. Visual basic 6 im- to be the height difference plementation of treaps as a COM object. • ActionScript3 implementation of a treap BalanceFactor(N) := –Height(LeftSubtree(N)) • Pure Python and Cython in-memory treap and dup- + Height(RightSubtree(N)) [6] treap • Treaps in C#. By Roy Clemmons of its two child subtrees. A binary tree is called AVL tree if the invariant • Pure Go in-memory, immutable treaps • Pure Go persistent treap key-value storage library BalanceFactor(N) ∈ {–1,0,+1} 6.7. AVL TREE 179

holds for every node N in the tree. order (or at least a total preorder) on the set of keys. The A node N with BalanceFactor(N) < 0 is called “left- number of comparisons required for successful search is heavy”, one with BalanceFactor(N) > 0 is called “right- limited by the height h and for unsuccessful search is very heavy”, and one with BalanceFactor(N) = 0 is sometimes close to h, so both are in O(log n). simply called “balanced”. Traversal Remark Once a node has been found in an AVL tree, the next or In the sequel, because there is a one-to-one correspon- previous node can be accessed in amortized constant time. dence between nodes and the subtrees rooted by them, Some instances of exploring these “nearby” nodes require we sometimes leave it to the context whether the name of traversing up to h ∝ log(n) links (particularly when nav- an object stands for the node or the subtree. igating from the rightmost leaf of the root’s left subtree to the root or from the root to the leftmost leaf of the root’s right subtree; in the AVL tree of figure 1, moving Properties from node P to the next but one node Q takes 3 steps). However, exploring all n nodes of the tree in this manner Balance factors can be kept up-to-date by knowing the would visit each link exactly twice: one downward visit previous balance factors and the change in height – it to enter the subtree rooted by that node, another visit up- is not necessary to know the absolute height. For hold- ward to leave that node’s subtree after having explored it. ing the AVL balance information, two bits per node are And since there are n−1 links in any tree, the amortized sufficient.[7] cost is found to be 2×(n−1)/n, or approximately 2. The height h of an AVL tree with n nodes lies in the interval:[8] 6.7.3 Comparison to other structures log2(n+1) ≤ h < c log2(n+2)+b Both AVL trees and red–black trees are self-balancing 1 with the golden ratio φ := (1+√5) ⁄2 ≈ 1.618, c := ⁄ log2 binary search trees and they are related mathematically. c φ ≈ 1.44, and b := ⁄2 log2 5 – 2 ≈ –0.328. This is be- Indeed, every AVL tree can be colored red–black. The cause an AVL tree of height h contains at least Fh₊₂ – operations to balance the trees are different; both AVL 1 nodes where {Fh} is the Fibonacci sequence with the trees and red-black require O(1) rotations in the worst seed values F1 = 1, F2 = 1. case, while both also require O(log n) other updates (to colors or heights) in the worst case (though only O(1) amortized). AVL trees require storing 2 bits (or one trit) Data structure of information in each node, while red-black trees require just one bit per node. The bigger difference between the According to the original paper “An algorithm for the or- two data structures is their height limit. ganization of information” AVL trees have been invented as binary search trees. In that sense they are a data struc- For a tree of size n ≥ 1 ture together with its major associated operations, namely search, insert, delete, which rely on and maintain the AVL • an AVL tree’s height is at most property. In that sense, the AVL tree is a “self-balancing binary search tree”. ≦ h c log2(n + d) + b < c log2(n + 2) + b

6.7.2 Operations √ 1+ 5 ≈ where φ := 2 1.618 the golden ratio, Read-only operations of an AVL tree involve carrying out c := 1 ≈ 1.44, b := c log 5 − 2 ≈ log2 φ 2 2 the same actions as would be carried out on an unbalanced −0.328, and d := 1 + 1√ ≈ 1.07 . binary search tree, but modifications have to observe and φ4 5 restore the height balance of the subtrees. • a red–black tree’s height is at most

Searching ≦ [9] h 2 log2(n + 1) Searching for a specific key in an AVL tree can be done the same way as that of a normal unbalanced binary AVL trees are more rigidly balanced than red–black trees, search tree. In order for search to work effectively it has leading to faster retrieval but slower insertion and dele- to employ a comparison function which establishes a total tion. 180 CHAPTER 6. SUCCESSORS AND NEIGHBORS

6.7.4 See also 6.7.6 Further reading

• Trees • Donald Knuth. The Art of Computer Program- ming, Volume 3: Sorting and Searching, Third Edi- • Tree rotation tion. Addison-Wesley, 1997. ISBN 0-201-89685-0. Pages 458–475 of section 6.2.3: Balanced Trees. • Red–black tree

• Splay tree 6.7.7 External links

• Scapegoat tree • This article incorporates public domain material from the NIST document: Black, Paul E. “AVL • B-tree Tree”. Dictionary of Algorithms and Data Struc- tures. • T-tree • AVL tree demonstration (HTML5/Canvas) • List of data structures • AVL tree demonstration (requires Flash)

6.7.5 References • AVL tree demonstration (requires Java)

[1] Eric Alexander. “AVL Trees”. 6.8 Red–black tree [2] Robert Sedgewick, Algorithms, Addison-Wesley, 1983, ISBN 0-201-06672-6, page 199, chapter 15: Balanced Trees. A red–black tree is a kind of self-balancing binary search tree. Each node of the binary tree has an extra [3] Georgy Adelson-Velsky, G.; Evgenii Landis (1962). bit, and that bit is often interpreted as the color (red or “An algorithm for the organization of information”. black) of the node. These color bits are used to ensure Proceedings of the USSR Academy of Sciences (in Rus- the tree remains approximately balanced during inser- sian). 146: 263–266. English translation by Myron J. tions and deletions.[2] Ricci in Soviet Math. Doklady, 3:1259–1263, 1962. Balance is preserved by painting each node of the tree [4] Pfaff, Ben (June 2004). “Performance Analysis of BSTs with one of two colors (typically called 'red' and 'black') in System Software” (PDF). Stanford University. in a way that satisfies certain properties, which collec- tively constrain how unbalanced the tree can become in [5] AVL trees are not weight-balanced? (meaning: AVL trees the worst case. When the tree is modified, the new tree is are not μ-balanced?) subsequently rearranged and repainted to restore the col- ≤ Thereby: A Binary Tree is called µ -balanced, with 0 oring properties. The properties are designed in such a µ ≤ 1 , if for every node N , the inequality 2 way that this rearranging and recoloring can be performed

| | efficiently. 1 − ≤ Nl ≤ 1 2 µ |N|+1 2 + µ The balancing of the tree is not perfect, but it is good holds and µ is minimal with this property. |N| is the num- enough to allow it to guarantee searching in O(log n) time, ber of nodes below the tree with N as root (including the where n is the total number of elements in the tree. The root) and Nl is the left child node of N . insertion and deletion operations, along with the tree re- arrangement and recoloring, are also performed in O(log [6] Knuth, Donald E. (2000). Sorting and searching (2. ed., n) time.[3] 6. printing, newly updated and rev. ed.). Boston [u.a.]: Addison-Wesley. p. 459. ISBN 0-201-89685-0. Tracking the color of each node requires only 1 bit of information per node because there are only two colors. [7] More precisely: if the AVL balance information is kept in The tree does not contain any other data specific to its the child nodes – with meaning “when going upward there being a red–black tree so its memory footprint is almost is an additional increment in height”, this can be done with identical to a classic (uncolored) binary search tree. In one bit. Nevertheless, the modifying operations can be many cases the additional bit of information can be stored programmed more efficiently if the balance information at no additional memory cost. can be checked with one test.

[8] Knuth, Donald E. (2000). Sorting and searching (2. ed., 6. printing, newly updated and rev. ed.). Boston [u.a.]: 6.8.1 History Addison-Wesley. p. 460. ISBN 0-201-89685-0. In 1972 Rudolf Bayer[4] invented a data structure that was [9] Red–black tree#Proof of asymptotic bounds a special order-4 case of a B-tree. These trees maintained 6.8. RED–BLACK TREE 181

all paths from root to leaf with the same number of nodes, 6.8.3 Properties creating perfectly balanced trees. However, they were not binary search trees. Bayer called them a “symmetric bi- nary B-tree” in his paper and later they became popular 13 as 2-3-4 trees or just 2-4 trees.[5] 8 17 In a 1978 paper, “A Dichromatic Framework for [6] Balanced Trees”, Leonidas J. Guibas and Robert 1 11 15 25 Sedgewick derived the red-black tree from the symmet- [7] ric binary B-tree. The color “red” was chosen because NIL 6 NIL NIL NIL NIL 22 27 it was the best-looking color produced by the color laser printer available to the authors while working at Xerox NIL NIL NIL NIL NIL NIL PARC.[8] Another response from professor Guibas states that it was because of the red and black pens available to An example of a red–black tree them to draw the trees.[9] In addition to the requirements imposed on a binary In 1993, Arne Andersson introduced the idea of right search tree the following must be satisfied by a red–black [10] leaning tree to simplify insert and delete operations. tree:[16] In 1999, Chris Okasaki showed how to make insert op- eration purely functional. Its balance function needed to 1. A node is either red or black. take care of only 4 unbalanced cases and one default bal- anced case.[11] 2. The root is black. This rule is sometimes omitted. The original algorithm used 8 unbalanced cases, but Since the root can always be changed from red to Cormen et al. (2001) reduced that to 6 unbalanced black, but not necessarily vice versa, this rule has cases.[2] Sedgewick showed that the insert operation can little effect on analysis. be implemented in just 46 lines of Java code.[12][13] In 3. All leaves (NIL) are black. 2008, Sedgewick proposed the left-leaning red–black tree, leveraging Andersson’s idea that simplified algo- 4. If a node is red, then both its children are black. rithms. Sedgewick originally allowed nodes whose two children are red making his trees more like 2-3-4 trees but 5. Every path from a given node to any of its descen- later this restriction was added making new trees more dant NIL nodes contains the same number of black like 2-3 trees. Sedgewick implemented the insert algo- nodes. Some definitions: the number of black nodes rithm in just 33 lines, significantly shortening his original from the root to a node is the node’s black depth; 46 lines of code.[14][15] the uniform number of black nodes in all paths from root to the leaves is called the black-height of the red–black tree.[17]

These constraints enforce a critical property of red–black 6.8.2 Terminology trees: the path from the root to the farthest leaf is no more than twice as long as the path from the root to the nearest leaf. The result is that the tree is roughly height-balanced. A red–black tree is a special type of binary tree, used in Since operations such as inserting, deleting, and finding computer science to organize pieces of comparable data, values require worst-case time proportional to the height such as text fragments or numbers. of the tree, this theoretical upper bound on the height al- The leaf nodes of red–black trees do not contain data. lows red–black trees to be efficient in the worst case, un- These leaves need not be explicit in computer memory— like ordinary binary search trees. a null child pointer can encode the fact that this child is a To see why this is guaranteed, it suffices to consider the leaf—but it simplifies some algorithms for operating on effect of properties 4 and 5 together. For a red–black tree red–black trees if the leaves really are explicit nodes. To T, let B be the number of black nodes in property 5. Let save memory, sometimes a single sentinel node performs the shortest possible path from the root of T to any leaf the role of all leaf nodes; all references from internal consist of B black nodes. Longer possible paths may be nodes to leaf nodes then point to the sentinel node. constructed by inserting red nodes. However, property 4 Red–black trees, like all binary search trees, allow effi- makes it impossible to insert more than one consecutive cient in-order traversal (that is: in the order Left–Root– red node. Therefore, ignoring any black NIL leaves, the Right) of their elements. The search-time results from the longest possible path consists of 2*B nodes, alternating traversal from root to leaf, and therefore a balanced tree black and red (this is the worst case). Counting the black of n nodes, having the least possible tree height, results in NIL leaves, the longest possible path consists of 2*B-1 O(log n) search time. nodes. 182 CHAPTER 6. SUCCESSORS AND NEIGHBORS

The shortest possible path has all black nodes, and the slot in the cluster vector is used. If values are stored by longest possible path alternates between red and black reference, e.g. objects, null references can be used and nodes. Since all maximal paths have the same number so the cluster can be represented by a vector containing 3 of black nodes, by property 5, this shows that no path is slots for value pointers plus 4 slots for child references in more than twice as long as any other path. the tree. In that case, the B-tree can be more compact in memory, improving data locality. The same analogy can be made with B-trees with larger 6.8.4 Analogy to B-trees of order 4 orders that can be structurally equivalent to a colored bi- nary tree: you just need more colors. Suppose that you add blue, then the blue–red–black tree defined like red– 8 13 17 black trees but with the additional constraint that no two successive nodes in the hierarchy will be blue and all blue nodes will be children of a red node, then it becomes equivalent to a B-tree whose clusters will have at most NIL 1 6 NIL 11 NIL NIL 15 NIL 22 25 27 7 values in the following colors: blue, red, blue, black, NIL NIL NIL NIL NIL NIL blue, red, blue (For each cluster, there will be at most 1 black node, 2 red nodes, and 4 blue nodes). The same red–black tree as in the example above, seen as a B- tree. For moderate volumes of values, insertions and deletions in a colored binary tree are faster compared to B-trees be- A red–black tree is similar in structure to a B-tree of cause colored trees don't attempt to maximize the fill fac- order[note 1] 4, where each node can contain between 1 and tor of each horizontal cluster of nodes (only the minimum 3 values and (accordingly) between 2 and 4 child point- fill factor is guaranteed in colored binary trees, limiting ers. In such a B-tree, each node will contain only one the number of splits or junctions of clusters). B-trees value matching the value in a black node of the red–black will be faster for performing rotations (because rotations tree, with an optional value before and/or after it in the will frequently occur within the same cluster rather than same node, both matching an equivalent red node of the with multiple separate nodes in a colored binary tree). red–black tree. For storing large volumes, however, B-trees will be much faster as they will be more compact by grouping several One way to see this equivalence is to “move up” the red children in the same cluster where they can be accessed nodes in a graphical representation of the red–black tree, locally. so that they align horizontally with their parent black node, by creating together a horizontal cluster. In the B- All optimizations possible in B-trees to increase the av- tree, or in the modified graphical representation of the erage fill factors of clusters are possible in the equivalent red–black tree, all leaf nodes are at the same depth. multicolored binary tree. Notably, maximizing the av- erage fill factor in a structurally equivalent B-tree is the The red–black tree is then structurally equivalent to a B- same as reducing the total height of the multicolored tree, tree of order 4, with a minimum fill factor of 33% of by increasing the number of non-black nodes. The worst values per cluster with a maximum capacity of 3 values. case occurs when all nodes in a colored binary tree are This B-tree type is still more general than a red–black black, the best case occurs when only a third of them are tree though, as it allows ambiguity in a red–black tree black (and the other two thirds are red nodes). conversion—multiple red–black trees can be produced Notes from an equivalent B-tree of order 4. If a B-tree clus- ter contains only 1 value, it is the minimum, black, and has two child pointers. If a cluster contains 3 values, then [1] Using Knuth’s definition of order: the maximum number the central value will be black and each value stored on of children its sides will be red. If the cluster contains two values, however, either one can become the black node in the red–black tree (and the other one will be red). 6.8.5 Applications and related data struc- So the order-4 B-tree does not maintain which of the val- tures ues contained in each cluster is the root black tree for the whole cluster and the parent of the other values in the Red–black trees offer worst-case guarantees for insertion same cluster. Despite this, the operations on red–black time, deletion time, and search time. Not only does this trees are more economical in time because you don't have make them valuable in time-sensitive applications such as to maintain the vector of values. It may be costly if values real-time applications, but it makes them valuable build- are stored directly in each node rather than being stored ing blocks in other data structures which provide worst- by reference. B-tree nodes, however, are more econom- case guarantees; for example, many data structures used ical in space because you don't need to store the color in computational geometry can be based on red–black attribute for each node. Instead, you have to know which trees, and the Completely Fair Scheduler used in current 6.8. RED–BLACK TREE 183

Linux kernels uses red–black trees. changes (which are very quick in practice) and no more The AVL tree is another structure supporting O(log n) than three tree rotations (two for insertion). Although in- search, insertion, and removal. It is more rigidly balanced sert and delete operations are complicated, their times re- than red–black trees, leading to slower insertion and re- main O(log n). moval but faster retrieval. This makes it attractive for data structures that may be built once and loaded without re- Insertion construction, such as language dictionaries (or program dictionaries, such as the opcodes of an assembler or in- Insertion begins by adding the node as any binary search terpreter). tree insertion does and by coloring it red. Whereas in the Red–black trees are also particularly valuable in binary search tree, we always add a leaf, in the red–black functional programming, where they are one of the most tree, leaves contain no information, so instead we add a common persistent data structures, used to construct red interior node, with two black leaves, in place of an associative arrays and sets which can retain previous existing black leaf. versions after mutations. The persistent version of What happens next depends on the color of other nearby red–black trees requires O(log n) space for each insertion nodes. The term uncle node will be used to refer to the or deletion, in addition to time. sibling of a node’s parent, as in human family trees. Note For every 2-4 tree, there are corresponding red–black that: trees with data elements in the same order. The insertion and deletion operations on 2-4 trees are also equivalent • property 3 (all leaves are black) always holds. to color-flipping and rotations in red–black trees. This makes 2-4 trees an important tool for understanding the • property 4 (both children of every red node are logic behind red–black trees, and this is why many in- black) is threatened only by adding a red node, re- troductory algorithm texts introduce 2-4 trees just before painting a black node red, or a rotation. red–black trees, even though 2-4 trees are not often used in practice. • property 5 (all paths from any given node to its leaf nodes contain the same number of black nodes) is In 2008, Sedgewick introduced a simpler version of the [18] threatened only by adding a black node, repainting red–black tree called the left-leaning red–black tree by a red node black (or vice versa), or a rotation. eliminating a previously unspecified degree of freedom in the implementation. The LLRB maintains an additional invariant that all red links must lean left except during in- Notes serts and deletes. Red–black trees can be made isometric to either 2-3 trees,[19] or 2-4 trees,[18] for any sequence of 1. The label N will be used to denote the current node operations. The 2-4 tree isometry was described in 1978 (colored red). In the diagrams N carries a blue con- by Sedgewick. With 2-4 trees, the isometry is resolved tour. At the beginning, this is the new node being by a “color flip,” corresponding to a split, in which the inserted, but the entire procedure may also be ap- red color of two children nodes leaves the children and plied recursively to other nodes (see case 3). P will moves to the parent node. The tango tree, a type of tree denote N's parent node, G will denote N's grand- optimized for fast searches, usually uses red–black trees parent, and U will denote N's uncle. In between as part of its data structure. some cases, the roles and labels of the nodes are ex- In the version 8 of Java, the Collection HashMap has been changed, but in each case, every label continues to modified such that instead of using a LinkedList to store represent the same node it represented at the begin- different elements with identical hashcodes, a Red-Black ning of the case. tree is used. This results in the improvement of time com- 2. If a node in the right (target) half of a diagram car- plexity of searching such an element from O(n) to O(log [20] ries a blue contour it will become the current node n). in the next iteration and there the other nodes will be newly assigned relative to it. Any color shown in the diagram is either assumed in its case or implied 6.8.6 Operations by those assumptions.

Read-only operations on a red–black tree require no mod- 3. A numbered triangle represents a subtree of unspec- ification from those used for binary search trees, because ified depth. A black circle atop a triangle means that every red–black tree is a special case of a simple binary black-height of subtree is greater by one compared search tree. However, the immediate result of an in- to subtree without this circle. sertion or removal may violate the properties of a red– black tree. Restoring the red–black properties requires There are several cases of red–black tree insertion to han- a small number (O(log n) or amortized O(1)) of color dle: 184 CHAPTER 6. SUCCESSORS AND NEIGHBORS

• N is the root node, i.e., first node of red–black tree { n->parent->color = BLACK; u->color = BLACK; g = grandparent(n); g->color = RED; insert_case1(g); } else • N's parent (P) is black { insert_case4(n); } } • N's parent (P) and uncle (U) are red

• N is added to right of left child of grandparent, or Note: In the remaining cases, it is assumed that N is added to left of right child of grandparent (P is the parent node P is the left child of its parent. red and U is black) If it is the right child, left and right should be reversed throughout cases 4 and 5. The code • N is added to left of left child of grandparent, or N samples take care of this. is added to right of right child of grandparent (P is red and U is black) void insert_case4(struct node *n) { struct node *g = grandparent(n); if ((n == n->parent->right) && Each case will be demonstrated with example C code. (n->parent == g->left)) { rotate_left(n->parent); The uncle and grandparent nodes can be found by these /* * rotate_left can be the below because of al- functions: ready having *g = grandparent(n) * * struct node struct node *grandparent(struct node *n) { if ((n != *saved_p=g->left, *saved_left_n=n->left; * g->left=n; NULL) && (n->parent != NULL)) return n->parent- * n->left=saved_p; * saved_p->right=saved_left_n; * * >parent; else return NULL; } struct node *uncle(struct and modify the parent’s nodes properly */ n = n->left; node *n) { struct node *g = grandparent(n); if (g == } else if ((n == n->parent->left) && (n->parent == NULL) return NULL; // No grandparent means no uncle g->right)) { rotate_right(n->parent); /* * rotate_right if (n->parent == g->left) return g->right; else return can be the below to take advantage of already having g->left; } *g = grandparent(n) * * struct node *saved_p=g- >right, *saved_right_n=n->right; * g->right=n; * n->right=saved_p; * saved_p->left=saved_right_n; * */ Case 1: The current node N is at the root of the tree. In n = n->right; } insert_case5(n); } this case, it is repainted black to satisfy property 2 (the void insert_case5(struct node *n) { struct node *g = root is black). Since this adds one black node to every grandparent(n); n->parent->color = BLACK; g->color path at once, property 5 (all paths from any given node to = RED; if (n == n->parent->left) rotate_right(g); else its leaf nodes contain the same number of black nodes) is rotate_left(g); } not violated. void insert_case1(struct node *n) { if (n->parent == Note that inserting is actually in-place, since all the calls NULL) n->color = BLACK; else insert_case2(n); } above use tail recursion. In the algorithm above, all cases are chained in order, ex- Case 2: The current node’s parent P is black, so prop- cept in insert case 3 where it can recurse to case 1 back erty 4 (both children of every red node are black) is not to the grandparent node: this is the only case where an invalidated. In this case, the tree is still valid. Property iterative implementation will effectively loop. Because 5 (all paths from any given node to its leaf nodes contain the problem of repair is escalated to the next higher level the same number of black nodes) is not threatened, be- h but one, it takes maximally ⁄2 iterations to repair the tree cause the current node N has two black leaf children, but (where h is the height of the tree). Because the probability because N is red, the paths through each of its children for escalation decreases exponentially with each iteration have the same number of black nodes as the path through the average insertion cost is constant. the leaf it replaced, which was black, and so this property remains satisfied. Mehlhorn & Sanders (2008) point out: “AVL trees do not support constant amortized update costs”, but red-black void insert_case2(struct node *n) { if (n->parent->color trees do.[21] == BLACK) return; /* Tree is still valid */ else in- sert_case3(n); } Removal

Note: In the following cases it can be assumed In a regular binary search tree when deleting a node with that N has a grandparent node G, because its two non-leaf children, we find either the maximum ele- parent P is red, and if it were the root, it would ment in its left subtree (which is the in-order predeces- be black. Thus, N also has an uncle node U, sor) or the minimum element in its right subtree (which although it may be a leaf in cases 4 and 5. is the in-order successor) and move its value into the node being deleted (as shown here). We then delete the node void insert_case3(struct node *n) { struct node *u = we copied the value from, which must have fewer than uncle(n), *g; if ((u != NULL) && (u->color == RED)) two non-leaf children. (Non-leaf children, rather than all 6.8. RED–BLACK TREE 185 children, are specified here because unlike normal binary (colored black). In the diagrams N carries a blue search trees, red–black trees can have leaf nodes any- contour. At the beginning, this is the replacement where, so that all nodes are either internal nodes with two node and a leaf, but the entire procedure may also children or leaf nodes with, by definition, zero children. be applied recursively to other nodes (see case 3). In effect, internal nodes having two leaf children in a red– In between some cases, the roles and labels of the black tree are like the leaf nodes in a regular binary search nodes are exchanged, but in each case, every label tree.) Because merely copying a value does not violate continues to represent the same node it represented any red–black properties, this reduces to the problem of at the beginning of the case. deleting a node with at most one non-leaf child. Once we have solved that problem, the solution applies equally to 2. If a node in the right (target) half of a diagram car- the case where the node we originally want to delete has ries a blue contour it will become the current node at most one non-leaf child as to the case just considered in the next iteration and there the other nodes will where it has two non-leaf children. be newly assigned relative to it. Any color shown in the diagram is either assumed in its case or im- Therefore, for the remainder of this discussion we address plied by those assumptions. White represents an ar- the deletion of a node with at most one non-leaf child. We bitrary color (either red or black), but the same in use the label M to denote the node to be deleted; C will both halves of the diagram. denote a selected child of M, which we will also call “its child”. If M does have a non-leaf child, call that its child, 3. A numbered triangle represents a subtree of unspec- C; otherwise, choose either leaf as its child, C. ified depth. A black circle atop a triangle means that black-height of subtree is greater by one compared If M is a red node, we simply replace it with its child C, to subtree without this circle. which must be black by property 4. (This can only occur when M has two leaf children, because if the red node M had a black non-leaf child on one side but just a leaf We will find the sibling using this function: child on the other side, then the count of black nodes on struct node *sibling(struct node *n) { if ((n == NULL) || both sides would be different, thus the tree would violate (n->parent == NULL)) return NULL; // no parent means property 5.) All paths through the deleted node will sim- no sibling if (n == n->parent->left) return n->parent- ply pass through one fewer red node, and both the deleted >right; else return n->parent->left; } node’s parent and child must be black, so property 3 (all leaves are black) and property 4 (both children of every red node are black) still hold. Note: In order that the tree remains well- Another simple case is when M is black and C is red. defined, we need that every null leaf remains Simply removing a black node could break Properties 4 a leaf after all transformations (that it will not (“Both children of every red node are black”) and 5 (“All have any children). If the node we are delet- paths from any given node to its leaf nodes contain the ing has a non-leaf (non-null) child N, it is easy same number of black nodes”), but if we repaint C black, to see that the property is satisfied. If, on the both of these properties are preserved. other hand, N would be a null leaf, it can be verified from the diagrams (or code) for all the The complex case is when both M and C are black. (This cases that the property is satisfied as well. can only occur when deleting a black node which has two leaf children, because if the black node M had a black We can perform the steps outlined above with the fol- non-leaf child on one side but just a leaf child on the lowing code, where the function replace_node substitutes other side, then the count of black nodes on both sides child into n’s place in the tree. For convenience, code in would be different, thus the tree would have been an in- this section will assume that null leaves are represented valid red–black tree by violation of property 5.) We be- by actual node objects rather than NULL (the code in gin by replacing M with its child C. We will relabel this the Insertion section works with either representation). child C (in its new position) N, and its sibling (its new parent’s other child) S.(S was previously the sibling of void delete_one_child(struct node *n) { /* * Precon- M.) In the diagrams below, we will also use P for N's dition: n has at most one non-leaf child. */ struct new parent (M's old parent), SL for S's left child, and node *child = is_leaf(n->right) ? n->left : n->right; SR for S's right child (S cannot be a leaf because if M replace_node(n, child); if (n->color == BLACK) { if and C were black, then P's one subtree which included (child->color == RED) child->color = BLACK; else M counted two black-height and thus P's other subtree delete_case1(child); } free(n); } which includes S must also count two black-height, which cannot be the case if S is a leaf node). Note: If N is a null leaf and we do not want Notes to represent null leaves as actual node objects, we can modify the algorithm by first calling 1. The label N will be used to denote the current node delete_case1() on its parent (the node that we 186 CHAPTER 6. SUCCESSORS AND NEIGHBORS

delete, n in the code above) and deleting it af- (s->left->color == BLACK) && (s->right->color == terwards. We do this if the parent is black (red RED)) {/* this last test is trivial too due to cases is trivial), so it behaves in the same way as a 2-4. */ s->color = RED; s->right->color = BLACK; null leaf (and is sometimes called a 'phantom' rotate_left(s); } } delete_case6(n); } leaf). And we can safely delete it at the end as n void delete_case6(struct node *n) { struct node *s = sib- will remain a leaf after all operations, as shown ling(n); s->color = n->parent->color; n->parent->color above. In addition, the sibling tests in cases 2 = BLACK; if (n == n->parent->left) { s->right->color = and 3 require updating as it is no longer true BLACK; rotate_left(n->parent); } else { s->left->color that the sibling will have children represented = BLACK; rotate_right(n->parent); } } as objects. Again, the function calls all use tail recursion, so the al- If both N and its original parent are black, then deleting gorithm is in-place. this original parent causes paths which proceed through N to have one fewer black node than paths that do not. As In the algorithm above, all cases are chained in order, ex- this violates property 5 (all paths from any given node to cept in delete case 3 where it can recurse to case 1 back its leaf nodes contain the same number of black nodes), to the parent node: this is the only case where an itera- the tree must be rebalanced. There are several cases to tive implementation will effectively loop. No more than consider: h loops back to case 1 will occur (where h is the height of the tree). And because the probability for escalation Case 1: N is the new root. In this case, we are done. We decreases exponentially with each iteration the average removed one black node from every path, and the new removal cost is constant. root is black, so the properties are preserved. Additionally, no tail recursion ever occurs on a child node, void delete_case1(struct node *n) { if (n->parent != so the tail recursion loop can only move from a child back NULL) delete_case2(n); } to its successive ancestors. If a rotation occurs in case 2 (which is the only possibility of rotation within the loop Note: In cases 2, 5, and 6, we assume N is of cases 1–3), then the parent of the node N becomes red the left child of its parent P. If it is the right after the rotation and we will exit the loop. Therefore, child, left and right should be reversed through- at most one rotation will occur within this loop. Since no out these three cases. Again, the code exam- more than two additional rotations will occur after exiting ples take both cases into account. the loop, at most three rotations occur in total. void delete_case2(struct node *n) { struct node *s = 6.8.7 Proof of asymptotic bounds sibling(n); if (s->color == RED) { n->parent->color = RED; s->color = BLACK; if (n == n->parent->left) A red black tree which contains n internal nodes has a rotate_left(n->parent); else rotate_right(n->parent); } height of O(log n). delete_case3(n); } void delete_case3(struct node *n) { struct node *s Definitions: = sibling(n); if ((n->parent->color == BLACK) && (s->color == BLACK) && (s->left->color == BLACK) • h(v) = height of subtree rooted at node v && (s->right->color == BLACK)) { s->color = RED; • delete_case1(n->parent); } else delete_case4(n); } bh(v) = the number of black nodes from v to any leaf void delete_case4(struct node *n) { struct node *s = in the subtree, not counting v if it is black - called sibling(n); if ((n->parent->color == RED) && (s->color the black-height == BLACK) && (s->left->color == BLACK) && (s->right->color == BLACK)) { s->color = RED; Lemma: A subtree rooted at node v has at least 2bh(v)−1 n->parent->color = BLACK; } else delete_case5(n); } internal nodes. void delete_case5(struct node *n) { struct node *s = Proof of Lemma (by induction height): sibling(n); if (s->color == BLACK) { /* this if statement is trivial, due to case 2 (even though case 2 changed the Basis: h(v) = 0 sibling to a sibling’s child, the sibling’s child can't be If v has a height of zero then it must be null, therefore red, since no red parent can have a red child). */ /* the bh(v) = 0. So: following statements just force the red to be on the left of the left of the parent, or right of the right, so case six will rotate correctly. */ if ((n == n->parent->left) 2bh(v) − 1 = 20 − 1 = 1 − 1 = 0 && (s->right->color == BLACK) && (s->left->color == RED)) { /* this last test is trivial too due to cases Inductive Step: v such that h(v) = k, has at least 2bh(v) −1 2-4. */ s->color = RED; s->left->color = BLACK; internal nodes implies that v′ such that h( v′ ) = k+1 has ′ rotate_right(s); } else if ((n == n->parent->right) && at least 2bh(v ) − 1 internal nodes. 6.8. RED–BLACK TREE 187

Since v′ has h( v′ ) > 0 it is an internal node. As such it has • AA tree, a variation of the red-black tree two children each of which have a black-height of either • bh( v′ ) or bh( v′ )−1 (depending on whether the child AVL tree is red or black, respectively). By the inductive hypothesis • B-tree (2-3 tree, 2-3-4 tree, B+ tree, B*-tree, UB- bh(v′)−1 ′ each child has at least 2 − 1 internal nodes, so v tree) has at least: • Scapegoat tree

′ ′ ′ • Splay tree 2bh(v )−1 − 1 + 2bh(v )−1 − 1 + 1 = 2bh(v ) − 1 • T-tree internal nodes. • WAVL tree Using this lemma we can now show that the height of the tree is logarithmic. Since at least half of the nodes on any path from the root to a leaf are black (property 4 of 6.8.11 References a red–black tree), the black-height of the root is at least h(root)/2. By the lemma we get: [1] James Paton. “Red-Black Trees”. [2] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001). “Red–Black Trees”. h(root) h(root) Introduction to Algorithms (second ed.). MIT Press. pp. n ≥ 2 2 −1 ↔ log (n + 1) ≥ ↔ h(root) ≤ 2 log (n + 1). 2 2 2273–301. ISBN 0-262-03293-7. Therefore, the height of the root is O(log n). [3] John Morris. “Red–Black Trees”. [4] Rudolf Bayer (1972). “Symmetric binary B-Trees: Data 6.8.8 Parallel algorithms structure and maintenance algorithms”. Acta Informatica. 1 (4): 290–306. doi:10.1007/BF00289509.

Parallel algorithms for constructing red–black trees from [5] Drozdek, Adam. Data Structures and Algorithms in Java sorted lists of items can run in constant time or O(log log (2 ed.). Sams Publishing. p. 323. ISBN 0534376681. n) time, depending on the computer model, if the number [6] Leonidas J. Guibas and Robert Sedgewick (1978). “A of processors available is asymptotically proportional to Dichromatic Framework for Balanced Trees”. Proceed- the number n of items where n→∞. Fast search, inser- ings of the 19th Annual Symposium on Foundations of [22] tion, and deletion parallel algorithms are also known. Computer Science. pp. 8–21. doi:10.1109/SFCS.1978.3.

[7] “Red Black Trees”. eternallyconfuzzled.com. Retrieved 6.8.9 Popular Culture 2015-09-02. [8] Robert Sedgewick (2012). Red-Black BSTs. Coursera. A red-black-tree was referenced correctly in an episode A lot of people ask why did we use the name red–black. [23] of Missing (Canadian TV series) as noted by Robert Well, we invented this data structure, this way of looking Sedgewick in one of his lectures:[24] at balanced trees, at Xerox PARC which was the home Jess: " It was the red door again. " of the personal computer and many other innovations that we live with today entering[sic] graphic user interfaces, Pollock: " I thought the red door was the storage con- ethernet and object-oriented programmings[sic] and many tainer. " other things. But one of the things that was invented there Jess: " But it wasn't red anymore, it was black. " was laser printing and we were very excited to have nearby Antonio: " So red turning to black means what? " color laser printer that could print things out in color and Pollock: " Budget deficits, red ink, black ink. " out of the colors the red looked the best. So, that’s why we Antonio: " It could be from a binary search tree. The picked the color red to distinguish red links, the types of red-black tree tracks every simple path from a node to a links, in three nodes. So, that’s an answer to the question descendant leaf that has the same number of black nodes. for people that have been asking. " [9] “Where does the term “Red/Black Tree” come from?". Jess: " Does that help you with the ladies? " programmers.stackexchange.com. Retrieved 2015-09-02.

[10] Andersson, Arne (1993-08-11). Dehne, Frank; Sack, 6.8.10 See also Jörg-Rüdiger; Santoro, Nicola; Whitesides, Sue, eds. “Balanced search trees made simple” (PDF). Algorithms • List of data structures and Data Structures (Proceedings). Lecture Notes in Computer Science. Springer-Verlag Berlin Heidelberg. • Tree data structure 709: 60–71. doi:10.1007/3-540-57155-8_236. ISBN 978-3-540-57155-1. Archived from the original on 2000- • Tree rotation 03-17. 188 CHAPTER 6. SUCCESSORS AND NEIGHBORS

[11] Okasaki, Chris (1999-01-01). “Red-black trees in a func- 6.8.13 External links tional setting” (PS). Journal of Functional Programming. 9 (4): 471–477. doi:10.1017/S0956796899003494. • A complete and working implementation in C ISSN 1469-7653. • Red–Black Tree Demonstration [12] Sedgewick, Robert (1983). Algorithms (1st ed.). Addison-Wesley. ISBN 0-201-06672-6. • OCW MIT Lecture by Prof. Erik Demaine on Red [13] RedBlackBST code in Java Black Trees -

[14] Sedgewick, Robert (2008). “Left-leaning Red-Black • Binary Search Tree Insertion Visualization on Trees” (PDF). YouTube – Visualization of random and pre-sorted data insertions, in elementary binary search trees, • [15] Sedgewick, Robert; Wayne, Kevin (2011). and left-leaning red–black trees Algorithms (4th ed.). Addison-Wesley Profes- sional. ISBN 978-0-321-57351-3. • An intrusive red-black tree written in C++

[16] Cormen, Thomas; Leiserson, Charles; Rivest, Ronald; • Red-black BSTs in 3.3 Balanced Search Trees Stein, Clifford (2009). “13”. Introduction to Algorithms (3rd ed.). MIT Press. pp. 308–309. ISBN 978-0-262- • Red–black BST Demo 03384-8.

[17] Mehlhorn, Kurt; Sanders, Peter (2008). Algorithms and Data Structures: The Basic Toolbox (PDF). Springer, 6.9 WAVL tree Berlin/Heidelberg. pp. 154–165. doi:10.1007/978-3- 540-77978-0. ISBN 978-3-540-77977-3. p. 155. In computer science, a WAVL tree or weak AVL tree [18] http://www.cs.princeton.edu/~{}rs/talks/LLRB/ is a self-balancing binary search tree. WAVL trees are RedBlack.pdf named after AVL trees, another type of balanced search [19] http://www.cs.princeton.edu/courses/archive/fall08/ tree, and are closely related both to AVL trees and red– cos226/lectures/10BalancedTrees-2x2.pdf black trees, which all fall into a common framework of rank balanced trees. Like other balanced binary search [20] “How does a HashMap work in JAVA”. coding-geek.com. trees, WAVL trees can handle insertion, deletion, and [1][2] [21] Mehlhorn & Sanders 2008, pp. 165, 158 search operations in time O(log n) per operation. WAVL trees are designed to combine some of the best [22] Park, Heejin; Park, Kunsoo (2001). “Parallel algo- rithms for red–black trees”. Theoretical computer sci- properties of both AVL trees and red–black trees. One ence. Elsevier. 262 (1–2): 415–435. doi:10.1016/S0304- advantage of AVL trees over red–black trees is that they ≈ 3975(00)00287-5. Our parallel algorithm for construct- are more balanced: they have height at most logφ n ing a red–black tree from a sorted list of n items runs in 1.44 log2 n (for a tree with n data items, where φ is the O(1) time with n processors on the CRCW PRAM and golden ratio), while red–black trees have larger maximum runs in O(log log n) time with n / log log n processors on height, 2 log2 n . If a WAVL tree is created using only the EREW PRAM. insertions, without deletions, then it has the same small height bound that an AVL tree has. On the other hand, [23] Missing (Canadian TV series). A, W Network (Canada); Lifetime (United States). red–black trees have the advantage over AVL trees that they perform less restructuring of their trees. In AVL [24] Robert Sedgewick (2012). B-Trees. Coursera. 10:07 trees, each deletion may require a logarithmic number of minutes in. So not only is there some excitement in that tree rotation operations, while red–black trees have sim- dialogue but it’s also technically correct which you don't pler deletion operations that use only a constant number often find with math in popular culture of computer sci- of tree rotations. WAVL trees, like red–black trees, use ence. A red black tree tracks every simple path from a only a constant number of tree rotations, and the constant node to a descendant leaf with the same number of black [1][2] nodes they got that right. is even better than for red–black trees. WAVL trees were introduced by Haeupler, Sen & Tarjan (2015). The same authors also provided a common view 6.8.12 Further reading of AVL trees, WAVL trees, and red–black trees as all being a type of rank-balanced tree.[2] • Mathworld: Red–Black Tree • San Diego State University: CS 660: Red–Black tree notes, by Roger Whitney 6.9.1 Definition • Pfaff, Ben (June 2004). “Performance Analysis of As with binary search trees more generally, a WAVL tree BSTs in System Software” (PDF). Stanford Univer- consists of a collection of nodes, of two types: internal sity. nodes and external nodes. An internal node stores a data 6.9. WAVL TREE 189 item, and is linked to its parent (except for a designated from each node to its parent, incrementing the rank of root node that has no parent) and to exactly two children each parent node if necessary to make it greater than the in the tree, the left child and the right child. An external new rank of its child, until one of three stopping condi- node carries no data, and has a link only to its parent in tions is reached. the tree. These nodes are arranged to form a binary tree, so that for any internal node x the parents of the left and • If the path of incremented ranks reaches the root of right children of x are x itself. The external nodes form the tree, then the rebalancing procedure stops, with- [3] the leaves of the tree. The data items are arranged in out changing the structure of the tree. the tree in such a way that an inorder traversal of the tree lists the data items in sorted order.[4] • If the path of incremented ranks reaches a node What distinguishes WAVL trees from other types of bi- whose parent’s rank previously differed by two, and nary search tree is its use of ranks. These are num- (after incrementing the rank of the node) still differs bers, stored with each node, that provide an approxima- by one, then again the rebalancing procedure stops tion to the distance from the node to its farthest leaf de- without changing the structure of the tree. scendant. The ranks are required to obey the following • If the procedure increases the rank of a node x, so properties:[1][2] that it becomes equal to the rank of the parent y of x, but the other child of y has a rank that is smaller by • [5] Every external node has rank 0 two (so that the rank of y cannot be increased) then again the rebalancing procedure stops. In this case, • If a non-root node has rank r, then the rank of its by performing at most two tree rotations, it is always parent must be either r + 1 or r + 2. possible to rearrange the tree nodes near x and y in • An internal node with two external children must such a way that the ranks obey the constraints of a have rank exactly 1. WAVL tree, leaving the rank of the root of the ro- tated subtree unchanged.

6.9.2 Operations Thus, overall, the insertion procedure consists of a search, the creation of a constant number of new nodes, a loga- Searching rithmic number of rank changes, and a constant number of tree rotations.[1][2] Searching for a key k in a WAVL tree is much the same as in any balanced binary search tree data structure. One begins at the root of the tree, and then repeatedly com- Deletion pares k with the data item stored at each node on a path from the root, following the path to the left child of a As with binary search trees more broadly, deletion oper- node when k is smaller than the value at the node or in- ations on an internal node x that has at least one external- stead following the path to the right child when k is larger node child may be performed directly, by removing x than the value at the node. When a node with value equal from the tree and reconnecting the other child of x to the to k is reached, or an external node is reached, the search parent of x. If, however, both children of a node x are in- stops.[6] ternal nodes, then we may follow a path downward in the If the search stops at an internal node, the key k has been tree from x to the leftmost descendant of its right child, found. If instead, the search stops at an external node, a node y that immediately follows x in the sorted order- then the position where k would be inserted (if it were ing of the tree nodes. Then y has an external-node child inserted) has been found.[6] (its left child). We may delete x by performing the same reconnection procedure at node y (effectively, deleting y instead of x) and then replacing the data item stored at x Insertion with the one that had been stored at y.[7] In either case, after making this change to the tree struc- Insertion of a key k into a WAVL tree is performed by ture, it is necessary to rebalance the tree and update its performing a search for the external node where the key ranks. As in the case of an insertion, this may be done should be added, replacing that node by an internal node by following a path upwards in the tree and changing the with data item k and two external-node children, and then ranks of the nodes along this path until one of three things rebalancing the tree. The rebalancing step can be per- [2] happens: the root is reached and the tree is balanced, a formed either top-down or bottom-up, but the bottom- node is reached whose rank does not need to be changed, up version of rebalancing is the one that most closely [1][2] and again the tree is balanced, or a node is reached whose matches AVL trees. rank cannot be changed. In this last case a constant num- In this rebalancing step, one assigns rank 1 to the newly ber of tree rotations completes the rebalancing stage of created internal node, and then follows a path upward the deletion process.[1][2] 190 CHAPTER 6. SUCCESSORS AND NEIGHBORS

Overall, as with the insertion procedure, a deletion con- tions, then its structure will be the same as the struc- sists of a search downward through the tree (to find the ture of an AVL tree created by the same insertion se- node to be deleted), a continuation of the search farther quence, and its ranks will be the same as the ranks of downward (to find a node with an external child), the re- the corresponding AVL tree. It is only through deletion moval of a constant number of new nodes, a logarithmic operations that a WAVL tree can become different from number of rank changes, and a constant number of tree an AVL tree. In particular this implies that a WAVL rotations.[1][2] tree created only through insertions has height at most ≈ [2] logφ n 1.44 log2 n . 6.9.3 Computational complexity Red–black trees Each search, insertion, or deletion in a WAVL tree in- volves following a single path in the tree and performing A red–black tree is a balanced binary search tree in which a constant number of steps for each node in the path. In each node has a color (red or black), satisfying the follow- a WAVL tree with n items that has only undergone inser- ing properties: ≈ tions, the maximum path length is logφ n 1.44 log2 n . If both insertions and deletions may have happened, the • External nodes are black. maximum path length is 2 log n . Therefore, in either 2 • case, the worst-case time for each search, insertion, or If an internal node is red, its two children are both deletion in a WAVL tree with n data items is O(log n). black. • All paths from the root to an external node have 6.9.4 Related structures equal numbers of black nodes.

WAVL trees are closely related to both AVL trees and red–black trees can equivalently be defined in terms of a red–black trees. Every AVL tree can have ranks assigned system of ranks, stored at the nodes, satisfying the fol- to its nodes in a way that makes it into a WAVL tree. lowing requirements (different than the requirements for And every WAVL tree can have its nodes colored red and ranks in WAVL trees): black (and its ranks reassigned) in a way that makes it into a red–black tree. However, some WAVL trees do not • The rank of an external node is always 0 and its par- come from AVL trees in this way and some red–black ent’s rank is always 1. trees do not come from WAVL trees in this way. • The rank of any non-root node equals either its par- ent’s rank or its parent’s rank minus 1. AVL trees • No two consecutive edges on any root-leaf path have An AVL tree is a kind of balanced binary search tree in rank difference 0. which the two children of each internal node must have heights that differ by at most one.[8] The height of an ex- The equivalence between the color-based and rank-based ternal node is zero, and the height of any internal node definitions can be seen, in one direction, by coloring a is always one plus the maximum of the heights of its two node black if its parent has greater rank and red if its children. Thus, the height function of an AVL tree obeys parent has equal rank. In the other direction, colors can the constraints of a WAVL tree, and we may convert any be converted to ranks by making the rank of a black node AVL tree into a WAVL tree by using the height of each equal to the number of black nodes on any path to an node as its rank.[1][2] external node, and by making the rank of a red node equal [9] The key difference between an AVL tree and a WAVL to its parent. tree arises when a node has two children with the same The ranks of the nodes in a WAVL tree can be converted rank or height. In an AVL tree, if a node x has two chil- to a system of ranks of nodes, obeying the requirements dren of the same height h as each other, then the height for red–black trees, by dividing each rank by two and of x must be exactly h + 1. In contrast, in a WAVL tree, rounding up to the nearest integer.[10] Because of this con- if a node x has two children of the same rank r as each version, for every WAVL tree there exists a valid red– other, then the rank of x can be either r + 1 or r + 2. This black tree with the same structure. Because red–black greater flexibility in ranks also leads to a greater flexibil- trees have maximum height 2 log2 n , the same is true for ity in structures: some WAVL trees cannot be made into WAVL trees.[1][2] However, there exist red–black trees AVL trees even by modifying their ranks, because they that cannot be given a valid WAVL tree rank function.[2] include nodes whose children’s heights differ by more [2] Despite the fact that, in terms of their tree structures, than one. WAVL trees are special cases of red–black trees, their If a WAVL tree is created only using insertion opera- update operations are different. The tree rotations used 6.10. SCAPEGOAT TREE 191 in WAVL tree update operations may make changes that to a regular binary search tree: a node stores only a key would not be permitted in a red–black tree, because they and two pointers to the child nodes. This makes scape- would in effect cause the recoloring of large subtrees of goat trees easier to implement and, due to data structure the red–black tree rather than making color changes only alignment, can reduce node overhead by up to one-third. on a single path in the tree.[2] This allows WAVL trees to perform fewer tree rotations per deletion, in the worst case, than red-black trees.[1][2] 6.10.1 Theory

6.9.5 References A binary search tree is said to be weight-balanced if half the nodes are on the left of the root, and half on the right. [1] Goodrich, Michael T.; Tamassia, Roberto (2015), “4.4 Weak AVL Trees”, Algorithm Design and Applications, An α-weight-balanced node is defined as meeting a re- Wiley, pp. 130–138. laxed weight balance criterion: size(left) <= α*size(node) size(right) <= α*size(node) [2] Haeupler, Bernhard; Sen, Siddhartha; Tarjan, Robert E. (2015), “Rank-balanced trees” (PDF), ACM Transactions Where size can be defined recursively as: on Algorithms, 11 (4): Art. 30, 26, doi:10.1145/2689412, MR 3361215. function size(node) if node = nil return 0 else return size(node->left) + size(node->right) + 1 end [3] Goodrich & Tamassia (2015), Section 2.3 Trees, pp. 68– An α of 1 therefore would describe a linked list as bal- 83. anced, whereas an α of 0.5 would only match almost com- [4] Goodrich & Tamassia (2015), Chapter 3 Binary Search plete binary trees. Trees, pp. 89–114. A binary search tree that is α-weight-balanced must also [5] In this we follow Goodrich & Tamassia (2015). In the be α-height-balanced, that is version described by Haeupler, Sen & Tarjan (2015), the height(tree) <= log₁/α(NodeCount) + 1 external nodes have rank −1. This variation makes very little difference in the operations of WAVL trees, but it Scapegoat trees are not guaranteed to keep α-weight- causes some minor changes to the formula for converting balance at all times, but are always loosely α-height- WAVL trees to red–black trees. balanced in that

[6] Goodrich & Tamassia (2015), Section 3.1.2 Searching in height(scapegoat tree) <= log₁/α(NodeCount) + 1 a Binary Search Tree, pp. 95–96. This makes scapegoat trees similar to red-black trees in [7] Goodrich & Tamassia (2015), Section 3.1.4 Deletion in a that they both have restrictions on their height. They dif- Binary Search Tree, pp. 98–99. fer greatly though in their implementations of determin- ing where the rotations (or in the case of scapegoat trees, [8] Goodrich & Tamassia (2015), Section 4.2 AVL Trees, pp. rebalances) take place. Whereas red-black trees store ad- 120–125. ditional 'color' information in each node to determine the location, scapegoat trees find a scapegoat which isn't α- [9] Goodrich & Tamassia (2015), Section 4.3 Red–black Trees, pp. 126–129. weight-balanced to perform the rebalance operation on. This is loosely similar to AVL trees, in that the actual ro- [10] In Haeupler, Sen & Tarjan (2015) the conversion is done tations depend on 'balances’ of nodes, but the means of by rounding down, because the ranks of external nodes determining the balance differs greatly. Since AVL trees are −1 rather than 0. Goodrich & Tamassia (2015) give a check the balance value on every insertion/deletion, it is formula that also rounds down, but because they use rank typically stored in each node; scapegoat trees are able to 0 for external nodes their formula incorrectly assigns red– calculate it only as needed, which is only when a scape- black rank 0 to internal nodes with WAVL rank 1. goat needs to be found. Unlike most other self-balancing search trees, scapegoat 6.10 Scapegoat tree trees are entirely flexible as to their balancing. They sup- port any α such that 0.5 < α < 1. A high α value results in fewer balances, making insertion quicker but lookups and In computer science, a scapegoat tree is a self-balancing deletions slower, and vice versa for a low α. Therefore in [1] binary search tree, invented by Arne Andersson and practical applications, an α can be chosen depending on [2] again by Igal Galperin and Ronald L. Rivest. It provides how frequently these actions should be performed. worst-case O(log n) lookup time, and O(log n) amortized insertion and deletion time. Unlike most other self-balancing binary search trees that provide worst case O(log n) lookup time, scapegoat trees 6.10.2 Operations have no additional per-node memory overhead compared 192 CHAPTER 6. SUCCESSORS AND NEIGHBORS

Insertion worst-case scenarios are spread out, insertion takes O(log n) amortized time.

Insertion is implemented with the same basic ideas as an unbalanced binary search tree, however with a few signif- Sketch of proof for cost of insertion Define the Im- icant changes. balance of a node v to be the absolute value of the differ- ence in size between its left node and right node minus 1, When finding the insertion point, the depth of the new or 0, whichever is greater. In other words: node must also be recorded. This is implemented via a | − | − simple counter that gets incremented during each itera- I(v) = max( left(v) right(v) 1, 0) tion of the lookup, effectively counting the number of Immediately after rebuilding a subtree rooted at v, I(v) = edges between the root and the inserted node. If this node 0. violates the α-height-balance property (defined above), a Lemma: Immediately before rebuilding the subtree rebalance is required. rooted at v, To rebalance, an entire subtree rooted at a scapegoat un- I(v) = Ω(|v|) ( Ω is Big O Notation.) dergoes a balancing operation. The scapegoat is defined Proof of lemma: as being an ancestor of the inserted node which isn't α- weight-balanced. There will always be at least one such Let v0 be the root of a subtree immediately after rebuild- ancestor. Rebalancing any of them will restore the α- ing. h(v0) = log(|v0|+1) . If there are Ω(|v0|) degener- height-balanced property. ate insertions (that is, where each inserted node increases the height by 1), then One way of finding a scapegoat, is to climb from the new I(v) = Ω(|v |) , node back up to the root and select the first node that isn't 0 h(v) = h(v ) + Ω(|v |) and α-weight-balanced. 0 0 log(|v|) ≤ log(|v0| + 1) + 1 . Climbing back up to the root requires O(log n) storage Since I(v) = Ω(|v|) before rebuilding, there were Ω(|v|) space, usually allocated on the stack, or parent pointers. insertions into the subtree rooted at v that did not result in This can actually be avoided by pointing each child at its rebuilding. Each of these insertions can be performed in parent as you go down, and repairing on the walk back O(log n) time. The final insertion that causes rebuilding up. costs O(|v|) . Using aggregate analysis it becomes clear To determine whether a potential node is a viable scape- that the amortized cost of an insertion is O(log n) : goat, we need to check its α-weight-balanced property. Ω(|v|)O(log n)+O(|v|) To do this we can go back to the definition: Ω(|v|) = O(log n) size(left) <= α*size(node) size(right) <= α*size(node) Deletion However a large optimisation can be made by realising that we already know two of the three sizes, leaving only Scapegoat trees are unusual in that deletion is easier than the third having to be calculated. insertion. To enable deletion, scapegoat trees need to Consider the following example to demonstrate this. As- store an additional value with the tree data structure. suming that we're climbing back up to the root: This property, which we will call MaxNodeCount sim- size(parent) = size(node) + size(sibling) + 1 ply represents the highest achieved NodeCount. It is set to NodeCount whenever the entire tree is rebalanced, But as: and after insertion is set to max(MaxNodeCount, Node- size(inserted node) = 1. Count). The case is trivialized down to: To perform a deletion, we simply remove the node as you would in a simple binary search tree, but if size[x+1] = size[x] + size(sibling) + 1 NodeCount <= α*MaxNodeCount Where x = this node, x + 1 = parent and size(sibling) is the only function call actually required. then we rebalance the entire tree about the root, remem- bering to set MaxNodeCount to NodeCount. Once the scapegoat is found, the subtree rooted at the scapegoat is completely rebuilt to be perfectly This gives deletion its worst-case performance of O(n) balanced.[2] This can be done in O(n) time by travers- time; however, it is amortized to O(log n) average time. ing the nodes of the subtree to find their values in sorted order and recursively choosing the median as the root of Sketch of proof for cost of deletion Suppose the the subtree. scapegoat tree has n elements and has just been rebuilt As rebalance operations take O(n) time (dependent on the (in other words, it is a complete binary tree). At most number of nodes of the subtree), insertion has a worst- n/2 − 1 deletions can be performed before the tree must case performance of O(n) time. However, because these be rebuilt. Each of these deletions take O(log n) time 6.11. SPLAY TREE 193

(the amount of time to search for the element and flag it 6.11 Splay tree as deleted). The n/2 deletion causes the tree to be re- built and takes O(log n) + O(n) (or just O(n) ) time. A splay tree is a self-adjusting binary search tree with the Using aggregate analysis it becomes clear that the amor- additional property that recently accessed elements are tized cost of a deletion is O(log n) : quick to access again. It performs basic operations such ∑ n 2 n as insertion, look-up and removal in O(log n) amortized 1 O(log n)+O(n) 2 O(log n)+O(n) n = n = O(log n) 2 2 time. For many sequences of non-random operations, splay trees perform better than other search trees, even when the specific pattern of the sequence is unknown. Lookup The splay tree was invented by Daniel Sleator and Robert Tarjan in 1985.[1] Lookup is not modified from a standard binary search tree, and has a worst-case time of O(log n). This is in All normal operations on a binary search tree are com- contrast to splay trees which have a worst-case time of bined with one basic operation, called splaying. Splaying O(n). The reduced node memory overhead compared to the tree for a certain element rearranges the tree so that other self-balancing binary search trees can further im- the element is placed at the root of the tree. One way to prove locality of reference and caching. do this is to first perform a standard binary tree search for the element in question, and then use tree rotations in a specific fashion to bring the element to the top. Alterna- 6.10.3 See also tively, a top-down algorithm can combine the search and the tree reorganization into a single phase. • Splay tree

• Trees 6.11.1 Advantages

• Tree rotation Good performance for a splay tree depends on the fact that it is self-optimizing, in that frequently accessed nodes • AVL tree will move nearer to the root where they can be accessed more quickly. The worst-case height—though unlikely— • B-tree is O(n), with the average being O(log n). Having fre- quently used nodes near the root is an advantage for many • T-tree practical applications (also see Locality of reference), and is particularly useful for implementing caches and • List of data structures garbage collection algorithms. Advantages include: 6.10.4 References • Comparable performance: Average-case perfor- [1] Andersson, Arne (1989). Improving partial rebuilding mance is as efficient as other trees.[2] by using simple balance criteria. Proc. Workshop on Algorithms and Data Structures. Journal of Algorithms. • Springer-Verlag. pp. 393–402. doi:10.1007/3-540- Small memory footprint: Splay trees do not need to 51542-9_33. store any bookkeeping data.

[2] Galperin, Igal; Rivest, Ronald L. (1993). “Scapegoat trees”. Proceedings of the fourth annual ACM-SIAM Sym- 6.11.2 Disadvantages posium on Discrete algorithms: 165–174. The most significant disadvantage of splay trees is that the height of a splay tree can be linear. For example, 6.10.5 External links this will be the case after accessing all n elements in non- decreasing order. Since the height of a tree corresponds • Scapegoat Tree Applet by Kubo Kovac to the worst-case access time, this means that the actual cost of an operation can be high. However the amortized • Scapegoat Trees: Galperin and Rivest’s paper de- access cost of this worst case is logarithmic, O(log n). scribing scapegoat trees Also, the expected access cost can be reduced to O(log n) by using a randomized variant.[3] • On Consulting a Set of Experts and Searching (full version paper) The representation of splay trees can change even when they are accessed in a 'read-only' manner (i.e. by find op- • Open Data Structures - Chapter 8 - Scapegoat Trees erations). This complicates the use of such splay trees in 194 CHAPTER 6. SUCCESSORS AND NEIGHBORS

a multi-threaded environment. Specifically, extra man- p. Note that zig-zig steps are the only thing that differen- agement is needed if multiple threads are allowed to per- tiate splay trees from the rotate to root method introduced form find operations concurrently. This also makes them by Allen and Munro[4] prior to the introduction of splay unsuitable for general use in purely functional program- trees. ming, although they can be used in limited ways to im- plement priority queues even there.

6.11.3 Operations

Splaying

When a node x is accessed, a splay operation is performed on x to move it to the root. To perform a splay opera- tion we carry out a sequence of splay steps, each of which Zig-zag step: this step is done when p is not the root and moves x closer to the root. By performing a splay oper- x is a right child and p is a left child or vice versa. The tree ation on the node of interest after every access, the re- is rotated on the edge between p and x, and then rotated cently accessed nodes are kept near the root and the tree on the resulting edge between x and g. remains roughly balanced, so that we achieve the desired amortized time bounds. Each particular step depends on three factors:

• Whether x is the left or right child of its parent node, p,

• whether p is the root or not, and if not

• whether p is the left or right child of its parent, g (the grandparent of x). Join It is important to remember to set gg (the great- grandparent of x) to now point to x after any splay op- Given two trees S and T such that all elements of S are eration. If gg is null, then x obviously is now the root and smaller than the elements of T, the following steps can be must be updated as such. used to join them to a single tree: There are three types of splay steps, each of which has a left- and right-handed case. For the sake of brevity, only • Splay the largest item in S. Now this item is in the one of these two is shown for each type. These three types root of S and has a null right child. are: • Set the right child of the new root to T. Zig step: this step is done when p is the root. The tree is rotated on the edge between x and p. Zig steps exist to deal with the parity issue and will be done only as the last Split step in a splay operation and only when x has odd depth at the beginning of the operation. Given a tree and an element x, return two new trees: one containing all elements less than or equal to x and the p x other containing all elements greater than x. This can be done in the following way: x p

C A • Splay x. Now it is in the root so the tree to its left contains all elements smaller than x and the tree to A B B C its right contains all element larger than x.

• Split the right subtree from the rest of the tree. Zig-zig step: this step is done when p is not the root and x and p are either both right children or are both left chil- dren. The picture below shows the case where x and p are Insertion both left children. The tree is rotated on the edge joining p with its parent g, then rotated on the edge joining x with To insert a value x into a splay tree: 6.11. SPLAY TREE 195

• Insert x as with a normal binary search tree. Below there is an implementation of splay trees in C++, which uses pointers to represent each node on the tree. • when an item is inserted, a splay is performed. This implementation is based on bottom-up splaying ver- • As a result, the newly inserted node x becomes the sion and uses the second method of deletion on a splay root of the tree. tree. Also, unlike the above definition, this C++ version does not splay the tree on finds - it only splays on inser- ALTERNATIVE: tions and deletions. #include #ifndef SPLAY_TREE #define • Use the split operation to split the tree at the value SPLAY_TREE template< typename T, typename Comp of x to two sub-trees: S and T. = std::less< T > > class splay_tree { private: Comp • Create a new tree in which x is the root, S is its left comp; unsigned long p_size; struct node { node *left, sub-tree and T its right sub-tree. *right; node *parent; T key; node( const T& init = T( ) ) : left( 0 ), right( 0 ), parent( 0 ), key( init ) { } ~node( ) { if( left ) delete left; if( right ) delete right; if( parent Deletion ) delete parent; } } *root; void left_rotate( node *x ) { node *y = x->right; if(y) { x->right = y->left; if( To delete a node x, use the same method as with a binary y->left ) y->left->parent = x; y->parent = x->parent; } search tree: if x has two children, swap its value with that if( !x->parent ) root = y; else if( x == x->parent->left of either the rightmost node of its left sub tree (its in-order ) x->parent->left = y; else x->parent->right = y; if(y) predecessor) or the leftmost node of its right subtree (its y->left = x; x->parent = y; } void right_rotate( node in-order successor). Then remove that node instead. In *x ) { node *y = x->left; if(y) { x->left = y->right; if( this way, deletion is reduced to the problem of removing y->right ) y->right->parent = x; y->parent = x->parent; a node with 0 or 1 children. Unlike a binary search tree, } if( !x->parent ) root = y; else if( x == x->parent->left in a splay tree after deletion, we splay the parent of the ) x->parent->left = y; else x->parent->right = y; if(y) y- removed node to the top of the tree. >right = x; x->parent = y; } void splay( node *x ) { while( ALTERNATIVE: x->parent ) { if( !x->parent->parent ) { if( x->parent- >left == x ) right_rotate( x->parent ); else left_rotate( • The node to be deleted is first splayed, i.e. brought x->parent ); } else if( x->parent->left == x && x- to the root of the tree and then deleted. leaves the >parent->parent->left == x->parent ) { right_rotate( tree with two sub trees. x->parent->parent ); right_rotate( x->parent ); } else if( x->parent->right == x && x->parent->parent->right • The two sub-trees are then joined using a “join” op- == x->parent ) { left_rotate( x->parent->parent ); eration. left_rotate( x->parent ); } else if( x->parent->left == x && x->parent->parent->right == x->parent ) { right_rotate( x->parent ); left_rotate( x->parent ); } else 6.11.4 Implementation and variants { left_rotate( x->parent ); right_rotate( x->parent ); } } } void replace( node *u, node *v ) { if( !u->parent ) root Splaying, as mentioned above, is performed during a sec- = v; else if( u == u->parent->left ) u->parent->left = v; ond, bottom-up pass over the access path of a node. It is else u->parent->right = v; if( v ) v->parent = u->parent; possible to record the access path during the first pass for } node* subtree_minimum( node *u ) { while( u->left ) use during the second, but that requires extra space dur- u = u->left; return u; } node* subtree_maximum( node ing the access operation. Another alternative is to keep *u ) { while( u->right ) u = u->right; return u; } public: a parent pointer in every node, which avoids the need for splay_tree( ) : root( 0 ), p_size( 0 ) { } void insert( const extra space during access operations but may reduce over- T &key ) { node *z = root; node *p = 0; while( z ) { all time efficiency because of the need to update those p = z; if( comp( z->key, key ) ) z = z->right; else z = [1] pointers. z->left; } z = new node( key ); z->parent = p; if( !p ) Another method which can be used is based on the ar- root = z; else if( comp( p->key, z->key ) ) p->right = gument that we can restructure the tree on our way down z; else p->left = z; splay( z ); p_size++; } node* find( the access path instead of making a second pass. This const T &key ) { node *z = root; while( z ) { if( comp( top-down splaying routine uses three sets of nodes - left z->key, key ) ) z = z->right; else if( comp( key, z->key tree, right tree and middle tree. The first two contain all ) ) z = z->left; else return z; } return 0; } void erase( items of original tree known to be less than or greater than const T &key ) { node *z = find( key ); if( !z ) return; current item respectively. The middle tree consists of the splay( z ); if( !z->left ) replace( z, z->right ); else if( sub-tree rooted at the current node. These three sets are !z->right ) replace( z, z->left ); else { node *y = sub- updated down the access path while keeping the splay op- tree_minimum( z->right ); if( y->parent != z ) { replace( erations in check. Another method, semisplaying, modi- y, y->right ); y->right = z->right; y->right->parent = fies the zig-zig case to reduce the amount of restructuring y; } replace( z, y ); y->left = z->left; y->left->parent done in all operations.[1][5] = y; } delete z; p_size--; } const T& minimum( ) { 196 CHAPTER 6. SUCCESSORS AND NEIGHBORS return subtree_minimum( root )->key; } const T& amortized-cost = cost + ΔΦ maximum( ) { return subtree_maximum( root )->key; } ≤ 3(rank'(x)-rank(x)) bool empty( ) const { return root == 0; } unsigned long size( ) const { return p_size; } }; #endif // SPLAY_TREE When summed over the entire splay operation, this telescopes to 3(rank(root)-rank(x)) which is O(log n). The Zig operation adds an amortized cost of 1, but there’s 6.11.5 Analysis at most one such operation. So now we know that the total amortized time for a se- A simple amortized analysis of static splay trees can be quence of m operations is: carried out using the potential method. Define:

• size(r) - the number of nodes in the sub-tree rooted T (m) = O(m log n) at node r (including r). amortized To go from the amortized time to the actual time, we must • rank(r) = log2(size(r)). add the decrease in potential from the initial state before • Φ = the sum of the ranks of all the nodes in the tree. any operation is done (Φi) to the final state after all oper- ations are completed (Φf). Φ will tend to be high for poorly balanced trees and low for well-balanced trees. ∑ To apply the potential method, we first calculate ΔΦ - Φi − Φf = ranki(x) − rankf (x) = O(n log n) the change in the potential caused by a splay operation. x We check each case separately. Denote by rank' the rank where the last inequality comes from the fact that for ev- function after the operation. x, p and g are the nodes af- ery node x, the minimum rank is 0 and the maximum fected by the rotation operation (see figures above). rank is log(n). Zig step: Now we can finally bound the actual time: ΔΦ = rank'(p) - rank(p) + rank'(x) - rank(x) [since only p and x change ranks] Tactual(m) = O(m log n + n log n) = rank'(p) - rank(x) [since rank'(x)=rank(p)] ≤ rank'(x) - rank(x) [since rank'(p)rank'(p)] • Define rank(r) and Φ exactly as above. ≤ 3(rank'(x)-rank(x)) - 2 [due to the concavity of the log function] The same analysis applies and the amortized cost of a splaying operation is again: Zig-Zag step:

ΔΦ = rank'(g) - rank(g) + rank'(p) - rank(p) + W rank(root)−rank(x) = O(log W −log w(x)) = O(log ) rank'(x) - rank(x) w(x) ≤ rank'(g) + rank'(p) - 2 rank(x) [since rank'(x)=rank(g) and rank(x)

The amortized cost of any operation is ΔΦ plus the actual ∑ W cost. The actual cost of any zig-zig or zig-zag operation Φ − Φ ≤ log i f w(x) is 2 since there are two rotations to make. Hence: x∈tree 6.11. SPLAY TREE 197 since the maximum size of any single node is W and the net potential drop is O (n log n).). This theorem is minimum is w(x). equivalent to splay trees having key-independent op- [1] Hence the actual time is bounded by: timality. Scanning Theorem Also known as the Sequential Ac- cess Theorem or the Queue theorem. Accessing ∑ W ∑ W O( (log ) + (log )) the n elements of a splay tree in symmetric order w(x) w(x) x∈sequence x∈tree takes O(n) time, regardless of the initial structure of the splay tree.[8] The tightest upper bound proven so far is 4.5n .[9] 6.11.6 Performance theorems

There are several theorems and conjectures regarding the 6.11.7 Dynamic optimality conjecture worst-case runtime for performing a sequence S of m ac- cesses in a splay tree containing n elements. Main article: Optimal binary search tree

Balance Theorem The cost of performing the sequence In addition to the proven performance guarantees for S is O [m log n + n log n] (Proof: take a constant splay trees there is an unproven conjecture of great in- weight, e.g. w(x)=1 for every node x. Then W=n). terest from the original Sleator and Tarjan paper. This This theorem implies that splay trees perform as well conjecture is known as the dynamic optimality conjecture as static balanced binary search trees on sequences and it basically claims that splay trees perform as well as of at least n accesses.[1] any other binary search tree algorithm up to a constant factor. Static Optimality Theorem Let qx be the number of times element x is accessed in S. If every element Dynamic Optimality Conjecture:[1] Let A is accessed at least once, then the cost of perform- [ ∑ ] be any binary search tree algorithm that ac- m ing S is O m + qx log (Proof: let x∈tree qx cesses an element x by traversing the path from w(x) = qx . Then W = m ). This theorem im- the root to x at a cost of d(x) + 1 , and that be- plies that splay trees perform as well as an optimum tween accesses can make any rotations in the static binary search tree on sequences of at least n tree at a cost of 1 per rotation. Let A(S) be accesses. They spend less time on the more frequent the cost for A to perform the sequence S of items.[1] accesses. Then the cost for a splay tree to per- form the same accesses is O[n + A(S)] . Static Finger Theorem Assume that the items are numbered from 1 through n in ascend- There are several corollaries of the dynamic optimality ing order. Let f be any fixed element (the conjecture that remain unproven: 'finger'). Then the cost of performing S is [ ∑ ] | − | [1] O m + n log n + x∈sequence log( x f + 1) Traversal Conjecture: Let T1 and T2 be (Proof: let w(x) = 1/(|x − f| + 1)2 . Then two splay trees containing the same elements. W=O(1). The net potential drop is O (n log n) since Let S be the sequence obtained by visiting the weight of any item is at least 1/n^2).[1] the elements in T2 in preorder (i.e., depth first search order). The total cost of performing the Dynamic Finger Theorem Assume that the 'fin- sequence S of accesses on T1 is O(n) . ger' for each step accessing an element y is the element accessed in the previ- Deque Conjecture:[8][10][11] Let S be a se- ous step, x. The cost of performing S is quence of m double-ended queue operations [ ∑ ] m | − | (push, pop, inject, eject). Then the cost of per- O m + n + x,y∈sequence log( y x + 1) forming S on a splay tree is O(m + n) . .[6][7] [5] Working Set Theorem At any time during the se- Split Conjecture: Let S be any permutation quence, let t(x) be the number of distinct el- of the elements of the splay tree. Then the cost ements accessed before the previous time ele- of deleting the elements in the order S is O(n) ment x was accessed. The cost of performing S . [ ∑ ] is O m + n log n + x∈sequence log(t(x) + 1) (Proof: let w(x) = 1/(t(x) + 1)2 . Note that 6.11.8 Variants here the weights change during the sequence. How- ever, the sequence of weights is still a permutation In order to reduce the number of restructuring operations, of 1, 1/4, 1/9, ..., 1/n^2. So as before W=O(1). The it is possible to replace the splaying with semi-splaying, 198 CHAPTER 6. SUCCESSORS AND NEIGHBORS in which an element is splayed only halfway towards the 6.11.11 References root.[1] • Albers, Susanne; Karpinski, Marek (2002). Another way to reduce restructuring is to do full splaying, “Randomized Splay Trees: Theoretical and Ex- but only in some of the access operations - only when the perimental Results” (PDF). Information Processing access path is longer than a threshold, or only in the first Letters. 81: 213–221. doi:10.1016/s0020- m access operations.[1] 0190(01)00230-7.

• Allen, Brian; Munro, Ian (1978). “Self-organizing 6.11.9 See also search trees”. Journal of the ACM. 25 (4): 526–535. doi:10.1145/322092.322094. • • Cole, Richard; Mishra, Bud; Schmidt, Jeanette; • Link/cut tree Siegel, Alan (2000). “On the Dynamic Finger Con- jecture for Splay Trees. Part I: Splay Sorting log • Scapegoat tree n-Block Sequences”. SIAM Journal on Computing. 30: 1–43. doi:10.1137/s0097539797326988. • (data structure) • • Trees Cole, Richard (2000). “On the Dynamic Fin- ger Conjecture for Splay Trees. Part II: The • Tree rotation Proof”. SIAM Journal on Computing. 30: 44–85. doi:10.1137/S009753979732699X. • AVL tree • Elmasry, Amr (2004), “On the sequential access • B-tree theorem and Deque conjecture for splay trees”, Theoretical Computer Science, 314 (3): 459–466, • T-tree doi:10.1016/j.tcs.2004.01.019

• List of data structures • Goodrich, Michael; Tamassia, Roberto; Gold- wasser, Michael (2014). Data Structures and Algo- • Iacono’s working set structure rithms in Java (6 ed.). Wiley. p. 506. ISBN 978-1- 118-77133-4. • Geometry of binary search trees • Knuth, Donald (1997). The Art of Computer Pro- • Splaysort, a sorting algorithm using splay trees gramming. 3: Sorting and Searching (3rd ed.). Addison-Wesley. p. 478. ISBN 0-201-89685-0.

6.11.10 Notes • Lucas, Joan M. (1991). “On the Competitiveness of Splay Trees: Relations to the Union-Find Problem”. [1] Sleator & Tarjan 1985. Online Algorithms. Series in Discrete Mathematics and Theoretical Computer Science. Center for Dis- [2] Goodrich, Tamassia & Goldwasser 2014. crete Mathematics and Theoretical Computer Sci- ence. 7: 95–124. [3] Albers & Karpinski 2002. • [4] Allen & Munro 1978. Pettie, Seth (2008), “Splay Trees, Davenport- Schinzel Sequences, and the Deque Conjecture”, [5] Lucas 1991. Proc. 19th ACM-SIAM Symposium on Discrete Al- gorithms, 0707: 1115–1124, arXiv:0707.2160 , [6] Cole et al. 2000. Bibcode:2007arXiv0707.2160P

[7] Cole 2000. • Sleator, Daniel D.; Tarjan, Robert E. (1985). “Self-Adjusting Binary Search Trees” (PDF). [8] Tarjan 1985. Journal of the ACM. 32 (3): 652–686. [9] Elmasry 2004. doi:10.1145/3828.3835.

[10] Pettie 2008. • Sundar, Rajamani (1992). “On the Deque conjec- ture for the splay algorithm”. Combinatorica. 12 [11] Sundar 1992. (1): 95–124. doi:10.1007/BF01191208. 6.12. TANGO TREE 199

• Tarjan, Robert E. (1985). “Sequential access in accessed node in T is in the subtree rooted at r, and l as splay trees takes linear time”. Combinatorica. 5 (4): the preferred child otherwise. Note that if the most re- 367–378. doi:10.1007/BF02579253. cently accessed node of T is p itself, then l is the preferred child by definition. 6.11.12 External links A preferred path is defined by starting at the root and fol- lowing the preferred children until reaching a leaf node. • NIST’s Dictionary of Algorithms and Data Struc- Removing the nodes on this path partitions the remain- tures: Splay Tree der of the tree into a number of subtrees, and we recurse on each subtree (forming a preferred path from its root, • Implementations in C and Java (by Daniel Sleator) which partitions the subtree into more subtrees). • Pointers to splay tree visualizations Auxiliary Trees • Fast and efficient implementation of Splay trees • Top-Down Splay Tree Java implementation To represent a preferred path, we store its nodes in a balanced binary search tree, specifically a red-black tree. • Zipper Trees For each non-leaf node n in a preferred path P, it has a non-preferred child c, which is the root of a new auxiliary • splay tree video tree. We attach this other auxiliary tree’s root (c) to n in P, thus linking the auxiliary trees together. We also aug- ment the auxiliary tree by storing at each node the min- 6.12 Tango tree imum and maximum depth (depth in the reference tree, that is) of nodes in the subtree under that node. A tango tree is a type of binary search tree proposed by Erik D. Demaine, Dion Harmon, John Iacono, and Mihai Patrascu in 2004.[1] 6.12.2 Algorithm It is an online binary search tree that achieves an Searching O(log log n) competitive ratio relative to the optimal offline binary search tree, while only using O(log log n) To search for an element in the tango tree, we simply sim- additional bits of memory per node. This improved upon ulate searching the reference tree. We start by searching the previous best known competitive ratio, which was the preferred path connected to the root, which is sim- O(log n) . ulated by searching the auxiliary tree corresponding to that preferred path. If the auxiliary tree doesn't contain 6.12.1 Structure the desired element, the search terminates on the parent of the root of the subtree containing the desired element Tango trees work by partitioning a binary search tree into (the beginning of another preferred path), so we simply a set of preferred paths, which are themselves stored in proceed by searching the auxiliary tree for that preferred auxiliary trees (so the tango tree is represented as a tree path, and so forth. of trees). Updating Reference Tree In order to maintain the structure of the tango tree (aux- To construct a tango tree, we simulate a complete binary iliary trees correspond to preferred paths), we must do search tree called the reference tree, which is simply a some updating work whenever preferred children change traditional binary search tree containing all the elements. as a result of searches. When a preferred child changes, This tree never shows up in the actual implementation, the top part of a preferred path becomes detached from but is the conceptual basis behind the following pieces of the bottom part (which becomes its own preferred path) a tango tree. and reattached to another preferred path (which becomes the new bottom part). In order to do this efficiently, we'll define cut and join operations on our auxiliary trees. Preferred Paths

First, we define for each node its preferred child, which Join Our join operation will combine two auxiliary informally is the most-recently touched child by a tradi- trees as long as they have the property that the top node of tional binary search tree lookup. More formally, consider one (in the reference tree) is a child of the bottom node a subtree T, rooted at p, with children l (left) and r (right). of the other (essentially, that the corresponding preferred We set r as the preferred child of p if the most recently paths can be concatenated). This will work based on the 200 CHAPTER 6. SUCCESSORS AND NEIGHBORS

concatenate operation of red-black trees, which combines maintain the proper invariants (switching preferred chil- two trees as long as they have the property that all ele- dren and re-arranging preferred paths). ments of one are less than all elements of the other, and split, which does the reverse. In the reference tree, note that there exist two nodes in the top path such that a node Searching To see that the searching (not updating) fits is in the bottom path if and only if its key-value is between in this bound, simply note that every time an auxiliary them. Now, to join the bottom path to the top path, we tree search is unsuccessful and we have to move to the simply split the top path between those two nodes, then next auxiliary tree, that results in a preferred child switch concatenate the two resulting auxiliary trees on either side (since the parent preferred path now switches directions of the bottom path’s auxiliary tree, and we have our final, to join the child preferred path). Since all auxiliary tree joined auxiliary tree. searches are unsuccessful except the last one (we stop once a search is successful, naturally), we search k + 1 auxiliary trees. Each search takes O(log log n) , because Cut Our cut operation will break a preferred path into an auxiliary tree’s size is bounded by log n , the height of two parts at a given node, a top part and a bottom part. the reference tree. More formally, it'll partition an auxiliary tree into two auxiliary trees, such that one contains all nodes at or above a certain depth in the reference tree, and the other con- Updating The update cost fits within this bound as tains all nodes below that depth. As in join, note that the well, because we only have to perform one cut and one top part has two nodes that bracket the bottom part. Thus, join for every visited auxiliary tree. A single cut or join we can simply split on each of these two nodes to divide operation takes only a constant number of searches, splits, the path into three parts, then concatenate the two outer and concatenates, each of which takes logarithmic time ones so we end up with two parts, the top and bottom, as in the size of the auxiliary tree, so our update cost is desired. (k + 1)O(log log n) .

Competitive Ratio 6.12.3 Analysis Tango trees are O(log log n) -competitive, because the In order to bound the competitive ratio for tango trees, work done by the optimal offline binary search tree is we must find a lower bound on the performance of the at least linear in k (the total number of preferred child optimal offline tree that we use as a benchmark. Once switches), and the work done by the tango tree is at most we find an upper bound on the performance of the tango (k + 1)O(log log n) . tree, we can divide them to bound the competitive ratio.

6.12.4 See also Interleave Bound • Splay tree Main article: Interleave lower bound • Optimal binary search tree

To find a lower bound on the work done by the optimal • Red-black tree offline binary search tree, we again use the notion of pre- ferred children. When considering an access sequence (a • Tree (data structure) sequence of searches), we keep track of how many times a reference tree node’s preferred child switches. The to- tal number of switches (summed over all nodes) gives an 6.12.5 References asymptotic lower bound on the work done by any binary search tree algorithm on the given access sequence. This [1] Demaine, E. D.; Harmon, D.; Iacono, J.; Pă- is called the interleave lower bound.[1] traşcu, M. (2007). “Dynamic Optimality— Almost”. SIAM Journal on Computing. 37 (1): 240. doi:10.1137/S0097539705447347. Tango Tree

In order to connect this to tango trees, we will find an 6.13 Skip list upper bound on the work done by the tango tree for a (k + given access sequence. Our upper bound will be In computer science, a skip list is a data structure that 1)O( n) log log , where k is the number of interleaves. allows fast search within an ordered sequence of ele- The total cost is divided into two parts, searching for the ments. Fast search is made possible by maintaining a element, and updating the structure of the tango tree to linked hierarchy of subsequences, with each successive 6.13. SKIP LIST 201

subsequence skipping over fewer elements than the previ- ous one. Searching starts in the sparsest subsequence un- til two consecutive elements have been found, one smaller and one larger than or equal to the element searched for. Via the linked hierarchy, these two elements link to ele- ments of the next sparsest subsequence, where searching is continued until finally we are searching in the full se- quence. The elements that are skipped over may be cho- Inserting elements to skip list sen probabilistically [2] or deterministically,[3] with the former being more common. corresponding linked-list operations, except that “tall” el- NIL ements must be inserted into or deleted from more than NIL one linked list. NIL NIL O(n) operations, which force us to visit every node in as- head 1 2 3 4 5 6 7 8 9 10 cending order (such as printing the entire list), provide the opportunity to perform a behind-the-scenes derandom- A schematic picture of the skip list data structure. Each box with ization of the level structure of the skip-list in an opti- an arrow represents a pointer and a row is a linked list giving mal way, bringing the skip list to O(log n) search time. a sparse subsequence; the numbered boxes at the bottom repre- sent the ordered data sequence. Searching proceeds downwards (Choose the level of the i'th finite node to be 1 plus the from the sparsest subsequence at the top until consecutive ele- number of times we can repeatedly divide i by 2 before ments bracketing the search element are found. it becomes odd. Also, i=0 for the negative infinity header as we have the usual special case of choosing the highest possible level for negative and/or positive infinite nodes.) However this also allows someone to know where all of 6.13.1 Description the higher-than-level 1 nodes are and delete them. A skip list is built in layers. The bottom layer is an or- Alternatively, we could make the level structure quasi- dinary ordered linked list. Each higher layer acts as an random in the following way: “express ” for the lists below, where an element in make all nodes level 1 j ← 1 while the number of nodes layer i appears in layer i+1 with some fixed probability p at level j > 1 do for each i'th node at level j do if i is odd (two commonly used values for p are 1/2 or 1/4). On av- if i is not the last node at level j randomly choose whether erage, each element appears in 1/(1-p) lists, and the tallest to promote it to level j+1 else do not promote end if else element (usually a special head element at the front of the if i is even and node i-1 was not promoted promote it to skip list) in all the lists, log1/p n of them. level j+1 end if repeat j ← j + 1 repeat A search for a target element begins at the head element Like the derandomized version, quasi-randomization is in the top list, and proceeds horizontally until the cur- only done when there is some other reason to be running rent element is greater than or equal to the target. If the a O(n) operation (which visits every node). current element is equal to the target, it has been found. If the current element is greater than the target, or the The advantage of this quasi-randomness is that it doesn't search reaches the end of the linked list, the procedure give away nearly as much level-structure related informa- is repeated after returning to the previous element and tion to an adversarial user as the de-randomized one. This dropping down vertically to the next lower list. The ex- is desirable because an adversarial user who is able to tell pected number of steps in each linked list is at most 1/p, which nodes are not at the lowest level can pessimize per- which can be seen by tracing the search path backwards formance by simply deleting higher-level nodes. (Bethea from the target until reaching an element that appears in and Reiter however argue that nonetheless an adversary the next higher list or reaching the beginning of the cur- can use probabilistic and timing methods to force per- [4] rent list. Therefore, the total expected cost of a search formance degradation. ) The search performance is still O guaranteed to be logarithmic. is (log1/p n)/p, which is (log n) when p is a constant. By choosing different values of p, it is possible to trade It would be tempting to make the following “optimiza- search costs against storage costs. tion": In the part which says “Next, for each i'th...”, for- get about doing a coin-flip for each even-odd pair. Just flip a coin once to decide whether to promote only the Implementation details even ones or only the odd ones. Instead of O(n log n) coin flips, there would only be O(log n) of them. Unfor- The elements used for a skip list can contain more than tunately, this gives the adversarial user a 50/50 chance of one pointer since they can participate in more than one being correct upon guessing that all of the even numbered list. nodes (among the ones at level 1 or higher) are higher than Insertions and deletions are implemented much like the level one. This is despite the property that he has a very 202 CHAPTER 6. SUCCESSORS AND NEIGHBORS

low probability of guessing that a particular node is at (1+3+1). level N for some integer N. function lookupByPositionIndex(i) node ← head i ← i A skip list does not provide the same absolute worst-case + 1 # don't count the head as a step for level from top performance guarantees as more traditional balanced tree to bottom do while i ≥ node.width[level] do # if next data structures, because it is always possible (though with step is not too far i ← i - node.width[level] # subtract the very low probability) that the coin-flips used to build the current width node ← node.next[level] # traverse forward skip list will produce a badly balanced structure. How- at the current level repeat repeat return node.value end ever, they work well in practice, and the randomized bal- function ancing scheme has been argued to be easier to imple- This method of implementing indexing is detailed in ment than the deterministic balancing schemes used in Section 3.4 Linear List Operations in “A skip list cook- balanced binary search trees. Skip lists are also useful in book” by William Pugh. parallel computing, where insertions can be done in dif- ferent parts of the skip list in parallel without any global rebalancing of the data structure. Such parallelism can 6.13.2 History be especially advantageous for resource discovery in an ad-hoc wireless network because a randomized skip list Skip lists were first described in 1989 by William Pugh.[6] can be made robust to the loss of any single node.[5] To quote the author:

Indexable skiplist Skip lists are a probabilistic data structure that seem likely to supplant balanced trees as the im- O As described above, a skiplist is capable of fast (log n) plementation method of choice for many ap- insertion and removal of values from a sorted sequence, plications. Skip list algorithms have the same O but it has only slow (n) lookups of values at a given asymptotic expected time bounds as balanced position in the sequence (i.e. return the 500th value); trees and are simpler, faster and use less space. however, with a minor modification the speed of random access indexed lookups can be improved to O(log n) . For every link, also store the width of the link. The width 6.13.3 Usages is defined as the number of bottom layer links being tra- versed by each of the higher layer “express lane” links. List of applications and frameworks that use skip lists: For example, here are the widths of the links in the ex- • ample at the top of the page: MemSQL uses skiplists as its prime indexing struc- ture for its database technology. 1 10 o---> o------> o Top level 1 3 2 5 o---> o------> o------• Cyrus IMAP server offers a “skiplist” backend DB -> o------> o Level 3 1 2 1 2 3 2 o---> implementation (source file) o------> o---> o------> o------> o------> o Level 2 1 1 1 1 1 1 1 1 1 1 1 o---> o---> o---> o---> o---> • Lucene uses skip lists to search delta-encoded post- o---> o---> o---> o---> o---> o---> o Bottom level Head ing lists in logarithmic time. 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th NIL Node Node • QMap (up to Qt 4) template class of Qt that provides Node Node Node Node Node Node Node Node a dictionary. Notice that the width of a higher level link is the sum of the component links below it (i.e. the width 10 link • Redis, an ANSI-C open-source persistent key/value spans the links of widths 3, 2 and 5 immediately below store for Posix systems, uses skip lists in its imple- it). Consequently, the sum of all widths is the same on mentation of ordered sets.[7] every level (10 + 1 = 1 + 3 + 2 + 5 = 1 + 2 + 1 + 2 + 5). • nessDB, a very fast key-value embedded Database To index the skiplist and find the i'th value, traverse the Storage Engine (Using log-structured-merge (LSM) skiplist while counting down the widths of each traversed trees), uses skip lists for its memtable. link. Descend a level whenever the upcoming width would be too large. • skipdb is an open-source database format using or- dered key/value pairs. For example, to find the node in the fifth position (Node 5), traverse a link of width 1 at the top level. Now four • ConcurrentSkipListSet and more steps are needed but the next width on this level is ConcurrentSkipListMap in the Java 1.6 API. ten which is too large, so drop one level. Traverse one link of width 3. Since another step of width 2 would be • leveldb, a fast key-value storage library written at too far, drop down to the bottom level. Now traverse the Google that provides an ordered mapping from final link of width 1 to reach the target running total of 5 string keys to string values 6.14. B-TREE 203

Skip lists are used for efficient statistical computations [11] Bajpai, R.; Dhara, K. K.; Krishnaswamy, V. (2008). of running medians (also known as moving medians). “QPID: A Distributed Priority Queue with Item Local- Skip lists are also used in distributed applications (where ity”. 2008 IEEE International Symposium on Parallel the nodes represent physical computers, and pointers and Distributed Processing with Applications. p. 215. represent network connections) and for implementing doi:10.1109/ISPA.2008.90. ISBN 978-0-7695-3471-8. highly scalable concurrent priority queues with less lock [12] Sundell, H. K.; Tsigas, P. (2004). “Scalable and lock- [8] [9][10][11] contention, or even without locking, as well as free concurrent dictionaries”. Proceedings of the 2004 lockless concurrent dictionaries.[12] There are also several ACM symposium on Applied computing - SAC '04 (PDF). p. US patents for using skip lists to implement (lockless) pri- 1438. doi:10.1145/967900.968188. ISBN 1581138121. ority queues and concurrent dictionaries.[13] [13] US patent 7937378

6.13.4 See also 6.13.6 External links • Bloom filter • “Skip list” entry in the Dictionary of Algorithms and • Skip graph Data Structures • Skip Lists: A Linked List with Self-Balancing BST- 6.13.5 References Like Properties on MSDN in C# 2.0 • Skip Lists lecture (MIT OpenCourseWare: Intro- [1] http://www.cs.uwaterloo.ca/research/tr/1993/28/ root2side.pdf duction to Algorithms) • [2] Pugh, W. (1990). “Skip lists: A probabilistic alternative Open Data Structures - Chapter 4 - Skiplists to balanced trees” (PDF). Communications of the ACM. • Skip trees, an alternative data structure to skip lists 33 (6): 668. doi:10.1145/78973.78977. in a concurrent approach [3] Munro, J. Ian; Papadakis, Thomas; Sedgewick, Robert • (1992). “Deterministic skip lists”. Proceedings of the Skip tree graphs, a distributed version of skip trees third annual ACM-SIAM symposium on Discrete algo- • More on skip tree graphs, a distributed version of rithms (SODA '92). Orlando, Florida, USA: Society for Industrial and Applied Mathematics, Philadelphia, PA, skip trees USA. pp. 367–375. alternative link

[4] Darrell Bethea and Michael K. Reiter, Data Struc- Demo applets tures with Unpredictable Timing https://www.cs.unc. edu/~{}djb/papers/2009-ESORICS.pdf, section 4 “Skip • Skip List Applet by Kubo Kovac Lists” • Thomas Wenger’s demo applet on skiplists [5] Shah, Gauri (2003). Distributed Data Structures for Peer- to-Peer Systems (PDF) (Ph.D. thesis). Yale University. Implementations [6] William Pugh (April 1989). “Concurrent Maintenance of Skip Lists”, Tech. Report CS-TR-2222, Dept. of Com- • Algorithm::SkipList, implementation in Perl on puter Science, U. Maryland. CPAN

[7] “Redis ordered set implementation”. • Raymond Hettinger’s implementation in Python

[8] Shavit, N.; Lotan, I. (2000). “Skiplist-based concurrent • ConcurrentSkipListSet documentation for Java 6 priority queues”. Proceedings 14th International Paral- (and sourcecode) lel and Distributed Processing Symposium. IPDPS 2000 (PDF). p. 263. doi:10.1109/IPDPS.2000.845994. ISBN 0-7695-0574-0. 6.14 B-tree [9] Sundell, H.; Tsigas, P. (2003). “Fast and lock-free con- current priority queues for multi-thread systems”. Pro- ceedings International Parallel and Distributed Processing Not to be confused with Binary tree. Symposium. p. 11. doi:10.1109/IPDPS.2003.1213189. ISBN 0-7695-1926-1. In computer science, a B-tree is a self-balancing tree data [10] Fomitchev, Mikhail; Ruppert, Eric (2004). Lock-free structure that keeps data sorted and allows searches, se- linked lists and skip lists (PDF). Proc. Annual ACM Symp. quential access, insertions, and deletions in logarithmic on Principles of Distributed Computing (PODC). pp. 50– time. The B-tree is a generalization of a binary search 59. doi:10.1145/1011767.1011776. ISBN 1581138024. tree in that a node can have more than two children 204 CHAPTER 6. SUCCESSORS AND NEIGHBORS

(Comer 1979, p. 123). Unlike self-balancing binary node. In a 2-3 B-tree, the internal nodes will store either search trees, the B-tree is optimized for systems that read one key (with two child nodes) or two keys (with three and write large blocks of data. B-trees are a good example child nodes). A B-tree is sometimes described with the of a data structure for external memory. It is commonly parameters (d+1) — (2d+1) or simply with the highest used in databases and filesystems. branching order, (2d + 1) . A B-tree is kept balanced by requiring that all leaf nodes 6.14.1 Overview be at the same depth. This depth will increase slowly as elements are added to the tree, but an increase in the over- all depth is infrequent, and results in all leaf nodes being 7 16 one more node farther away from the root. B-trees have substantial advantages over alternative im- plementations when the time to access the data of a node 1 2 5 6 9 12 18 21 greatly exceeds the time spent processing that data, be- cause then the cost of accessing the node may be amor- A B-tree (Bayer & McCreight 1972) of order 5 (Knuth 1998). tized over multiple operations within the node. This usu- ally occurs when the node data are in secondary storage such as disk drives. By maximizing the number of keys In B-trees, internal (non-leaf) nodes can have a variable within each internal node, the height of the tree decreases number of child nodes within some pre-defined range. and the number of expensive node accesses is reduced. In When data is inserted or removed from a node, its num- addition, rebalancing of the tree occurs less often. The ber of child nodes changes. In order to maintain the pre- maximum number of child nodes depends on the infor- defined range, internal nodes may be joined or split. Be- mation that must be stored for each child node and the cause a range of child nodes is permitted, B-trees do not size of a full disk block or an analogous size in secondary need re-balancing as frequently as other self-balancing storage. While 2-3 B-trees are easier to explain, practical search trees, but may waste some space, since nodes are B-trees using secondary storage need a large number of not entirely full. The lower and upper bounds on the num- child nodes to improve performance. ber of child nodes are typically fixed for a particular im- plementation. For example, in a 2-3 B-tree (often simply referred to as a 2-3 tree), each internal node may have only 2 or 3 child nodes. Variants Each internal node of a B-tree will contain a number of The term B-tree may refer to a specific design or it may keys. The keys act as separation values which divide its refer to a general class of designs. In the narrow sense, a subtrees. For example, if an internal node has 3 child B-tree stores keys in its internal nodes but need not store nodes (or subtrees) then it must have 2 keys: a1 and a2. those keys in the records at the leaves. The general class All values in the leftmost subtree will be less than a1, all includes variations such as the B+ tree and the B* tree. values in the middle subtree will be between a1 and a2, and all values in the rightmost subtree will be greater than • In the B+ tree, copies of the keys are stored in a2. the internal nodes; the keys and records are stored Usually, the number of keys is chosen to vary between d in leaves; in addition, a leaf node may include a and 2d , where d is the minimum number of keys, and pointer to the next leaf node to speed sequential ac- d + 1 is the minimum degree or branching factor of the cess (Comer 1979, p. 129). tree. In practice, the keys take up the most space in a node. The factor of 2 will guarantee that nodes can be • The B* tree balances more neighboring internal split or combined. If an internal node has 2d keys, then nodes to keep the internal nodes more densely adding a key to that node can be accomplished by splitting packed (Comer 1979, p. 129). This variant requires the hypothetical 2d+1 key node into two d key nodes and non-root nodes to be at least 2/3 full instead of 1/2 moving the key that would have been in the middle to the (Knuth 1998, p. 488). To maintain this, instead of parent node. Each split node has the required minimum immediately splitting up a node when it gets full, its number of keys. Similarly, if an internal node and its keys are shared with a node next to it. When both neighbor each have d keys, then a key may be deleted nodes are full, then the two nodes are split into three. from the internal node by combining it with its neighbor. Deleting nodes is somewhat more complex than in- − Deleting the key would make the internal node have d serting however. 1 keys; joining the neighbor would add d keys plus one more key brought down from the neighbor’s parent. The • B-trees can be turned into order statistic trees to al- 2d result is an entirely full node of keys. low rapid searches for the Nth record in key order, The number of branches (or child nodes) from a node or counting the number of records between any two will be one more than the number of keys stored in the records, and various other related operations.[1] 6.14. B-TREE 205

Etymology RPM drive, the rotation period is 8.33 milliseconds. For a drive such as the Seagate ST3500320NS, the track-to- Rudolf Bayer and Ed McCreight invented the B-tree track seek time is 0.8 milliseconds and the average read- while working at Boeing Research Labs in 1971 (Bayer & ing seek time is 8.5 milliseconds.[4] For simplicity, as- McCreight 1972), but they did not explain what, if any- sume reading from disk takes about 10 milliseconds. thing, the B stands for. Douglas Comer explains: Naively, then, the time to locate one record out of a mil- lion would take 20 disk reads times 10 milliseconds per The origin of “B-tree” has never been ex- disk read, which is 0.2 seconds. plained by the authors. As we shall see, “bal- The time won't be that bad because individual records are anced,” “broad,” or “bushy” might apply. Oth- grouped together in a disk block. A disk block might be ers suggest that the “B” stands for Boeing. Be- 16 kilobytes. If each record is 160 bytes, then 100 records cause of his contributions, however, it seems could be stored in each block. The disk read time above appropriate to think of B-trees as “Bayer"- was actually for an entire block. Once the disk head is trees. (Comer 1979, p. 123 footnote 1) in position, one or more disk blocks can be read with lit- tle delay. With 100 records per block, the last 6 or so Donald Knuth speculates on the etymology of B-trees in comparisons don't need to do any disk reads—the com- his May, 1980 lecture on the topic “CS144C classroom parisons are all within the last disk block read. lecture about disk storage and B-trees”, suggesting the To speed the search further, the first 13 to 14 comparisons “B” may have originated from Boeing or from Bayer’s (which each required a disk access) must be sped up. name.[2] After a talk at CPM 2013 (24th Annual Symposium on Combinatorial Pattern Matching, Bad Herrenalb, Ger- An index speeds the search many, June 17–19, 2013), Ed McCreight answered a question on B-tree’s name by Martin Farach-Colton say- A significant improvement can be made with an index. In ing: “Bayer and I were in a lunch time where we get to the example above, initial disk reads narrowed the search think a name. And we were, so, B, we were thinking… B range by a factor of two. That can be improved substan- is, you know… We were working for Boeing at the time, tially by creating an auxiliary index that contains the first we couldn't use the name without talking to lawyers. So, record in each disk block (sometimes called a sparse in- there is a B. It has to do with balance, another B. Bayer dex). This auxiliary index would be 1% of the size of the was the senior author, who did have several years older original database, but it can be searched more quickly. than I am and had many more publications than I did. So Finding an entry in the auxiliary index would tell us which there is another B. And so, at the lunch table we never did block to search in the main database; after searching the resolve whether there was one of those that made more auxiliary index, we would have to search only that one sense than the rest. What really lives to say is: the more block of the main database—at a cost of one more disk you think about what the B in B-trees means, the better read. The index would hold 10,000 entries, so it would you understand B-trees.”[3] take at most 14 comparisons. Like the main database, the last 6 or so comparisons in the aux index would be on the same disk block. The index could be searched in about 8 6.14.2 B-tree usage in databases disk reads, and the desired record could be accessed in 9 disk reads. Time to search a sorted file The trick of creating an auxiliary index can be repeated to make an auxiliary index to the auxiliary index. That Usually, sorting and searching algorithms have been char- acterized by the number of comparison operations that would make an aux-aux index that would need only 100 must be performed using order notation.A binary search entries and would fit in one disk block. of a sorted table with N records, for example, can be Instead of reading 14 disk blocks to find the desired ⌈ ⌉ done in roughly log2 N comparisons. If the table had record, we only need to read 3 blocks. Reading and 1,000,000 records, then a specific record could be located searching the first (and only) block of the aux-aux in- ⌈ ⌉ with at most 20 comparisons: log2(1, 000, 000) = 20 dex identifies the relevant block in aux-index. Reading . and searching that aux-index block identifies the relevant Large databases have historically been kept on disk block in the main database. Instead of 150 milliseconds, drives. The time to read a record on a disk drive far ex- we need only 30 milliseconds to get the record. ceeds the time needed to compare keys once the record The auxiliary indices have turned the search problem

is available. The time to read a record from a disk drive from a binary search requiring roughly log2 N disk reads involves a seek time and a rotational delay. The seek time to one requiring only logb N disk reads where b is the may be 0 to 20 or more milliseconds, and the rotational blocking factor (the number of entries per block: b = 100

delay averages about half the rotation period. For a 7200 entries per block; logb 1, 000, 000 = 3 reads). 206 CHAPTER 6. SUCCESSORS AND NEIGHBORS

In practice, if the main database is being frequently Disadvantages of B-trees searched, the aux-aux index and much of the aux index may reside in a disk cache, so they would not incur a disk • maximum key length cannot be changed without read. completely rebuilding the database. This led to many database systems truncating full human names to 70 characters. Insertions and deletions (Other implementations of associative array, such as a If the database does not change, then compiling the index or a separate-chaining hash table, dy- is simple to do, and the index need never be changed. namically adapt to arbitrarily long key lengths). If there are changes, then managing the database and its index becomes more complicated. 6.14.3 Technical description Deleting records from a database is relatively easy. The index can stay the same, and the record can just be Terminology marked as deleted. The database stays in sorted order. If there is a large number of deletions, then the searching The literature on B-trees is not uniform in its terminology and storage become less efficient. (Folk & Zoellick 1992, p. 362). Insertions can be very slow in a sorted sequential file be- Bayer & McCreight (1972), Comer (1979), and others cause room for the inserted record must be made. In- define the order of B-tree as the minimum number of serting a record before the first record in the file requires keys in a non-root node. Folk & Zoellick (1992) points shifting all of the records down one. Such an operation is out that terminology is ambiguous because the maximum just too expensive to be practical. One solution is to leave number of keys is not clear. An order 3 B-tree might hold some space available to be used for insertions. Instead of a maximum of 6 keys or a maximum of 7 keys. Knuth densely storing all the records in a block, the block can (1998, p. 483) avoids the problem by defining the order have some free space to allow for subsequent insertions. to be maximum number of children (which is one more Those records would be marked as if they were “deleted” than the maximum number of keys). records. The term leaf is also inconsistent. Bayer & McCreight Both insertions and deletions are fast as long as space (1972) considered the leaf level to be the lowest level of is available on a block. If an insertion won't fit on the keys, but Knuth considered the leaf level to be one level block, then some free space on some nearby block must below the lowest keys (Folk & Zoellick 1992, p. 363). be found and the auxiliary indices adjusted. The hope is There are many possible implementation choices. In that enough space is nearby such that a lot of blocks do some designs, the leaves may hold the entire data record; not need to be reorganized. Alternatively, some out-of- in other designs, the leaves may only hold pointers to the sequence disk blocks may be used. data record. Those choices are not fundamental to the idea of a B-tree.[5] There are also unfortunate choices like using the variable Advantages of B-tree usage for databases k to represent the number of children when k could be confused with the number of keys. The B-tree uses all of the ideas described above. In par- ticular, a B-tree: For simplicity, most authors assume there are a fixed number of keys that fit in a node. The basic assumption is the key size is fixed and the node size is fixed. In practice, • keeps keys in sorted order for sequential traversing variable length keys may be employed (Folk & Zoellick 1992, p. 379). • uses a hierarchical index to minimize the number of disk reads Definition • uses partially full blocks to speed insertions and According to Knuth’s definition, a B-tree of order m is a deletions tree which satisfies the following properties:

• keeps the index balanced with an elegant recursive 1. Every node has at most m children. algorithm 2. Every non-leaf node (except root) has at least ⌈m/2⌉ children. In addition, a B-tree minimizes waste by making sure the interior nodes are at least half full. A B-tree can handle 3. The root has at least two children if it is not a leaf an arbitrary number of insertions and deletions. node. 6.14. B-TREE 207

4. A non-leaf node with k children contains k−1 keys. It can be shown (by induction for example) that a B-tree of height h with all its nodes completely filled has n= 5. All leaves appear in the same level mh+1−1 entries. Hence, the best case height of a B-tree is: Each internal node’s keys act as separation values which divide its subtrees. For example, if an internal node has 3 child nodes (or subtrees) then it must have 2 keys: a1 and ⌈ ⌉ − logm(n + 1) 1 a2. All values in the leftmost subtree will be less than a1, all values in the middle subtree will be between a and 1 Let d be the minimum number of children an inter- a , and all values in the rightmost subtree will be greater 2 nal (non-root) node can have. For an ordinary B-tree, than a . 2 d=⌈m/2⌉. Internal nodes Internal nodes are all nodes except for Comer (1979, p. 127) and Cormen et al. (2001, pp. 383– leaf nodes and the root node. They are usually repre- 384) give the worst case height of a B-tree (where the root sented as an ordered set of elements and child point- node is considered to have height 0) as ers. Every internal node contains a maximum of U children and a minimum of L children. Thus, the ⌊ ( )⌋ n + 1 number of elements is always 1 less than the number h ≤ log . of child pointers (the number of elements is between d 2 L−1 and U−1). U must be either 2L or 2L−1; there- fore each internal node is at least half full. The rela- tionship between U and L implies that two half-full 6.14.5 Algorithms nodes can be joined to make a legal node, and one full node can be split into two legal nodes (if there’s Search room to push one element up into the parent). These properties make it possible to delete and insert new Searching is similar to searching a binary search tree. values into a B-tree and adjust the tree to preserve Starting at the root, the tree is recursively traversed from the B-tree properties. top to bottom. At each level, the search reduces its field of view to the child pointer (subtree) whose range includes The root node The root node’s number of children has the search value. A subtree’s range is defined by the val- the same upper limit as internal nodes, but has no ues, or keys, contained in its parent node. These limiting lower limit. For example, when there are fewer than values are also known as separation values. L−1 elements in the entire tree, the root will be the Binary search is typically (but not necessarily) used only node in the tree with no children at all. within nodes to find the separation values and child tree of interest. Leaf nodes Leaf nodes have the same restriction on the number of elements, but have no children, and no child pointers. Insertion

A B-tree of depth n+1 can hold about U times as many All insertions start at a leaf node. To insert a new ele- items as a B-tree of depth n, but the cost of search, insert, ment, search the tree to find the leaf node where the new and delete operations grows with the depth of the tree. As element should be added. Insert the new element into that with any balanced tree, the cost grows much more slowly node with the following steps: than the number of elements. Some balanced trees store values only at leaf nodes, and 1. If the node contains fewer than the maximum legal use different kinds of nodes for leaf nodes and internal number of elements, then there is room for the new nodes. B-trees keep values in every node in the tree, and element. Insert the new element in the node, keeping may use the same structure for all nodes. However, since the node’s elements ordered. leaf nodes never have children, the B-trees benefit from 2. Otherwise the node is full, evenly split it into two improved performance if they use a specialized structure. nodes so:

6.14.4 Best case and worst case heights (a) A single median is chosen from among the leaf’s elements and the new element. Let h be the height of the classic B-tree. Let n > 0 be the (b) Values less than the median are put in the new number of entries in the tree.[6] Let m be the maximum left node and values greater than the median number of children a node can have. Each node can have are put in the new right node, with the median at most m−1 keys. acting as a separation value. 208 CHAPTER 6. SUCCESSORS AND NEIGHBORS

tains one more element, and hence it is legal too. If U−1 is even, then U=2L−1, so there are 2L−2 elements in the node. Half of this number is L−1, which is the minimum number of elements allowed per node. An improved algorithm supports a single pass down the tree from the root to the node where the insertion will take place, splitting any full nodes encountered on the way. This prevents the need to recall the parent nodes into memory, which may be expensive if the nodes are on secondary storage. However, to use this improved al- gorithm, we must be able to send one element to the par- ent and split the remaining U−2 elements into two legal nodes, without adding a new element. This requires U = 2L rather than U = 2L−1, which accounts for why some textbooks impose this requirement in defining B-trees.

Deletion

There are two popular strategies for deletion from a B- tree.

1. Locate and delete the item, then restructure the tree to retain its invariants, OR

2. Do a single pass down the tree, but before entering (visiting) a node, restructure the tree so that once the key to be deleted is encountered, it can be deleted without triggering the need for any further restruc- turing

The algorithm below uses the former strategy. There are two special cases to consider when deleting an element:

1. The element in an internal node is a separator for its A B Tree insertion example with each iteration. The nodes of this child nodes B tree have at most 3 children (Knuth order 3). 2. Deleting an element may put its node under the min- (c) The separation value is inserted in the node’s imum number of elements and children parent, which may cause it to be split, and so on. If the node has no parent (i.e., the node The procedures for these cases are in order below. was the root), create a new root above this node (increasing the height of the tree). Deletion from a leaf node If the splitting goes all the way up to the root, it creates a new root with a single separator value and two children, which is why the lower bound on the size of internal nodes 1. Search for the value to delete. does not apply to the root. The maximum number of ele- ments per node is U−1. When a node is split, one element 2. If the value is in a leaf node, simply delete it from moves to the parent, but one element is added. So, it must the node. be possible to divide the maximum number U−1 of ele- ments into two legal nodes. If this number is odd, then 3. If underflow happens, rebalance the tree as de- U=2L and one of the new nodes contains (U−2)/2 = L−1 scribed in section “Rebalancing after deletion” be- elements, and hence is a legal node, and the other con- low. 6.14. B-TREE 209

Deletion from an internal node Each element in an down; deficient node now has the minimum internal node acts as a separation value for two subtrees, number of elements) therefore we need to find a replacement for separation. 2. Replace the separator in the parent with the Note that the largest element in the left subtree is still less last element of the left sibling (left sibling loses than the separator. Likewise, the smallest element in the one node but still has at least the minimum right subtree is still greater than the separator. Both of number of elements) those elements are in leaf nodes, and either one can be the new separator for the two subtrees. Algorithmically 3. The tree is now balanced described below: • Otherwise, if both immediate siblings have only the minimum number of elements, then merge with a 1. Choose a new separator (either the largest element in sibling sandwiching their separator taken off from the left subtree or the smallest element in the right their parent subtree), remove it from the leaf node it is in, and replace the element to be deleted with the new sep- 1. Copy the separator to the end of the left node arator. (the left node may be the deficient node or it may be the sibling with the minimum number 2. The previous step deleted an element (the new sepa- of elements) rator) from a leaf node. If that leaf node is now defi- cient (has fewer than the required number of nodes), 2. Move all elements from the right node to the then rebalance the tree starting from the leaf node. left node (the left node now has the maxi- mum number of elements, and the right node – empty) Rebalancing after deletion Rebalancing starts from a 3. Remove the separator from the parent along leaf and proceeds toward the root until the tree is bal- with its empty right child (the parent loses an anced. If deleting an element from a node has brought element) it under the minimum size, then some elements must be redistributed to bring all nodes up to the minimum. Usu- • If the parent is the root and now has no el- ally, the redistribution involves moving an element from a ements, then free it and make the merged sibling node that has more than the minimum number of node the new root (tree becomes shal- nodes. That redistribution operation is called a rotation. lower) If no sibling can spare an element, then the deficient node • Otherwise, if the parent has fewer than must be merged with a sibling. The merge causes the par- the required number of elements, then re- ent to lose a separator element, so the parent may become balance the parent deficient and need rebalancing. The merging and rebal- ancing may continue all the way to the root. Since the Note: The rebalancing operations are different for minimum element count doesn't apply to the root, mak- B+ trees (e.g., rotation is different because parent ing the root be the only deficient node is not a problem. has copy of the key) and B*-tree (e.g., three siblings The algorithm to rebalance the tree is as follows: are merged into two siblings).

• If the deficient node’s right sibling exists and has more than the minimum number of elements, then Sequential access rotate left While freshly loaded databases tend to have good sequen- 1. Copy the separator from the parent to the tial behavior, this behavior becomes increasingly difficult end of the deficient node (the separator moves to maintain as a database grows, resulting in more random down; the deficient node now has the mini- I/O and performance challenges.[7] mum number of elements) 2. Replace the separator in the parent with the first element of the right sibling (right sibling Initial construction loses one node but still has at least the mini- mum number of elements) In applications, it is frequently useful to build a B-tree to represent a large existing collection of data and then up- 3. The tree is now balanced date it incrementally using standard B-tree operations. In • Otherwise, if the deficient node’s left sibling exists this case, the most efficient way to construct the initial and has more than the minimum number of ele- B-tree is not to insert every element in the initial collec- ments, then rotate right tion successively, but instead to construct the initial set of leaf nodes directly from the input, then build the internal 1. Copy the separator from the parent to the nodes from these. This approach to B-tree construction start of the deficient node (the separator moves is called bulkloading. Initially, every leaf but the last one 210 CHAPTER 6. SUCCESSORS AND NEIGHBORS

has one extra element, which will be used to build the TOPS-20 (and possibly TENEX) used a 0 to 2 level tree internal nodes. that has similarities to a B-tree. A disk block was 512 36- 9 For example, if the leaf nodes have maximum size 4 and bit words. If the file fit in a 512 (2 ) word block, then the file directory would point to that physical disk block. If the initial collection is the integers 1 through 24, we would 18 initially construct 4 leaf nodes containing 5 values each the file fit in 2 words, then the directory would point to and 1 which contains 4 values: an aux index; the 512 words of that index would either be NULL (the block isn't allocated) or point to the physical We build the next level up from the leaves by taking the address of the block. If the file fit in 227 words, then last element from each leaf node except the last one. the directory would point to a block holding an aux-aux Again, each node except the last will contain one extra index; each entry would either be NULL or point to an value. In the example, suppose the internal nodes contain aux index. Consequently, the physical disk block for a at most 2 values (3 child pointers). Then the next level up 227 word file could be located in two disk reads and read of internal nodes would be: on the third. This process is continued until we reach a level with only Apple’s filesystem HFS+, Microsoft’s NTFS,[9] AIX one node and it is not overfilled. In the example only the (jfs2) and some Linux filesystems, such as btrfs and Ext4, root level remains: use B-trees. B*-trees are used in the HFS and Reiser4 file systems. 6.14.6 In filesystems 6.14.7 Variations Most modern filesystems use B-trees (or § Variants); al- ternatives such as extendible hashing are less common.[8] Access concurrency In addition to its use in databases, the B-tree is also used in filesystems to allow quick random access to an arbitrary Lehman and Yao[10] showed that all the read locks could block in a particular file. The basic problem is turning be avoided (and thus concurrent access greatly improved) the file block i address into a disk block (or perhaps to a by linking the tree blocks at each level together with a cylinder-head-sector) address. “next” pointer. This results in a tree structure where both Some operating systems require the user to allocate the insertion and search operations descend from the root to maximum size of the file when the file is created. The file the leaf. Write locks are only required as a tree block is can then be allocated as contiguous disk blocks. When modified. This maximizes access concurrency by multi- converting to a disk block the operating system just adds ple users, an important consideration for databases and/or the file block address to the starting disk block of the file. other B-tree based ISAM storage methods. The cost as- The scheme is simple, but the file cannot exceed its cre- sociated with this improvement is that empty pages can- ated size. not be removed from the btree during normal operations. (However, see [11] for various strategies to implement Other operating systems allow a file to grow. The result- node merging, and source code at.[12]) ing disk blocks may not be contiguous, so mapping logical blocks to physical blocks is more involved. United States Patent 5283894, granted in 1994, appears to show a way to use a 'Meta Access Method' [13] to al- MS-DOS, for example, used a simple File Allocation low concurrent B+ tree access and modification without Table (FAT). The FAT has an entry for each disk [note 1] locks. The technique accesses the tree 'upwards’ for both block, and that entry identifies whether its block is searches and updates by means of additional in-memory used by a file and if so, which block (if any) is the next indexes that point at the blocks in each level in the block disk block of the same file. So, the allocation of each cache. No reorganization for deletes is needed and there file is represented as a linked list in the table. In order to are no 'next' pointers in each block as in Lehman and Yao. find the disk address of file block i , the operating system (or disk utility) must sequentially follow the file’s linked list in the FAT. Worse, to find a free disk block, it must sequentially scan the FAT. For MS-DOS, that was not a 6.14.8 See also huge penalty because the disks and files were small and the FAT had few entries and relatively short file chains. • B+tree In the FAT12 filesystem (used on floppy disks and early hard disks), there were no more than 4,080 [note 2] entries, • R-tree and the FAT would usually be resident in memory. As disks got bigger, the FAT architecture began to confront • penalties. On a large disk using FAT, it may be necessary 2–3 tree to perform disk reads to learn the disk location of a file block to be read or written. • 2–3–4 tree 6.14. B-TREE 211

6.14.9 Notes • Comer, Douglas (June 1979), “The Ubiquitous B-Tree”, Computing Surveys, 11 (2): 123–137, [1] For FAT, what is called a “disk block” here is what the doi:10.1145/356770.356776, ISSN 0360-0300. FAT documentation calls a “cluster”, which is fixed-size group of one or more contiguous whole physical disk • Cormen, Thomas; Leiserson, Charles; Rivest, sectors. For the purposes of this discussion, a cluster has Ronald; Stein, Clifford (2001), Introduction to Algo- no significant difference from a physical sector. rithms (Second ed.), MIT Press and McGraw-Hill, pp. 434–454, ISBN 0-262-03293-7. Chapter 18: [2] Two of these were reserved for special purposes, so only B-Trees. 4078 could actually represent disk blocks (clusters). • Folk, Michael J.; Zoellick, Bill (1992), File Struc- tures (2nd ed.), Addison-Wesley, ISBN 0-201- 6.14.10 References 55713-4

[1] Counted B-Trees, retrieved 2010-01-25 • Knuth, Donald (1998), Sorting and Searching, The Art of Computer Programming, Volume 3 (Second [2] Knuth’s video lectures from Stanford ed.), Addison-Wesley, ISBN 0-201-89685-0. Sec- [3] Talk’s video, retrieved 2014-01-17 tion 6.2.4: Multiway Trees, pp. 481–491. Also, pp. 476–477 of section 6.2.3 (Balanced Trees) dis- [4] Seagate Technology LLC, Product Manual: Barracuda cusses 2-3 trees. ES.2 Serial ATA, Rev. F., publication 100468393, 2008 , page 6 Original papers [5] Bayer & McCreight (1972) avoided the issue by saying an index element is a (physically adjacent) pair of (x, a) • Bayer, Rudolf; McCreight, E. (July 1970), Organi- where x is the key, and a is some associated information. The associated information might be a pointer to a record zation and Maintenance of Large Ordered Indices, or records in a random access, but what it was didn't really Mathematical and Information Sciences Report No. matter. Bayer & McCreight (1972) states, “For this paper 20, Boeing Scientific Research Laboratories. the associated information is of no further interest.” • Bayer, Rudolf (1971), Binary B-Trees for Virtual [6] If n is zero, then no root node is needed, so the height of Memory, Proceedings of 1971 ACM-SIGFIDET an empty tree is not well defined. Workshop on Data Description, Access and Control, San Diego, California. [7] “Cache Oblivious B-trees”. State University of New York (SUNY) at Stony Brook. Retrieved 2011-01-17.

[8] Mikuláš Patocka. “Design and Implementation of the 6.14.11 External links Spad Filesystem”. “Table 4.1: Directory organization in • filesystems”. 2006. B-tree lecture by David Scot Taylor, SJSU • [9] Mark Russinovich. “Inside Win2K NTFS, Part 1”. B-Tree animation applet by slady Microsoft Developer Network. Archived from the orig- • B-tree and UB-tree on Scholarpedia Curator: Dr inal on 13 April 2008. Retrieved 2008-04-18. Rudolf Bayer [10] “Efficient locking for concurrent operations on B-trees”. • Portal.acm.org. doi:10.1145/319628.319663. Retrieved B-Trees: Balanced Tree Data Structures 2012-06-28. • NIST’s Dictionary of Algorithms and Data Struc- [11] http://www.dtic.mil/cgi-bin/GetTRDoc?AD= tures: B-tree ADA232287&Location=U2&doc=GetTRDoc.pdf • B-Tree Tutorial [12] “Downloads - high-concurrency-btree - High Concurrency • B-Tree code in C - GitHub Project Hosting”. Retrieved The InfinityDB BTree implementation 2014-01-27. • Cache Oblivious B(+)-trees [13] Lockless Concurrent B+Tree • Dictionary of Algorithms and Data Structures entry for B*-tree General • Open Data Structures - Section 14.2 - B-Trees • Bayer, R.; McCreight, E. (1972), “Organization • Counted B-Trees and Maintenance of Large Ordered Indexes” (PDF), Acta Informatica, 1 (3): 173–189, • B-Tree .Net, a modern, virtualized RAM & Disk doi:10.1007/bf00288683 implementation 212 CHAPTER 6. SUCCESSORS AND NEIGHBORS

6.15 B+ tree leaf, in this case.) This node is permitted to have as little as one key if necessary, and at most b.

6.15.2 Algorithms

Search

The root of a B+ Tree represents the whole range of val- ues in the tree, where every internal node is a subinterval. We are looking for a value k in the B+ Tree. Starting from the root, we are looking for the leaf which may contain A simple B+ tree example linking the keys 1–7 to data values the value k. At each node, we figure out which internal d1-d7. The linked list (red) allows rapid in-order traversal. This particular tree’s branching factor is b=4. pointer we should follow. An internal B+ Tree node has at most d ≤ b children, where every one of them represents a A B+ tree is an n-ary tree with a variable but often large different sub-interval. We select the corresponding node number of children per node. A B+ tree consists of a root, by searching on the key values of the node. internal nodes and leaves.[1] The root may be either a leaf Function: search (k) return tree_search (k, root); Func- or a node with two or more children.[2] tion: tree_search (k, node) if node is a leaf then return A B+ tree can be viewed as a B-tree in which each node node; switch k do case k < k_0 return tree_search(k, contains only keys (not key-value pairs), and to which an p_0); case k_i ≤ k < k_{i+1} return tree_search(k, additional level is added at the bottom with linked leaves. p_{i+1}); case k_d ≤ k return tree_search(k, p_{d+1}); The primary value of a B+ tree is in storing data for ef- This pseudocode assumes that no duplicates are allowed. ficient retrieval in a block-oriented storage context — in particular, filesystems. This is primarily because unlike Prefix key compression binary search trees, B+ trees have very high fanout (num- ber of pointers to child nodes in a node,[1] typically on the • It is important to increase fan-out, as this allows to order of 100 or more), which reduces the number of I/O direct searches to the leaf level more efficiently. operations required to find an element in the tree. • Index Entries are only to `direct traffic’, thus we can The ReiserFS, NSS, XFS, JFS, ReFS, and BFS filesys- compress them. tems all use this type of tree for metadata indexing; BFS also uses B+ trees for storing directories. NTFS uses B+ trees for directory indexing. EXT4 uses ex- Insertion tent trees (a modified B+ tree data structure) for file extent indexing.[3] Relational database management sys- Perform a search to determine what bucket the new tems such as IBM DB2,[4] Informix,[4] Microsoft SQL record should go into. Server,[4] Oracle 8,[4] Sybase ASE,[4] and SQLite[5] sup- port this type of tree for table indices. Key-value database • If the bucket is not full (at most b - 1 entries after [6] management systems such as CouchDB and Tokyo the insertion), add the record. Cabinet[7] support this type of tree for data access. • Otherwise, split the bucket. 6.15.1 Overview • Allocate new leaf and move half the bucket’s elements to the new bucket. The order, or branching factor, b of a B+ tree measures • Insert the new leaf’s smallest key and address the capacity of nodes (i.e., the number of children nodes) into the parent. for internal nodes in the tree. The actual number of chil- • If the parent is full, split it too. dren for a node, referred to here as m, is constrained for internal nodes so that ⌈b/2⌉ ≤ m ≤ b . The root is an ex- • Add the middle key to the parent node. ception: it is allowed to have as few as two children.[1] For • Repeat until a parent is found that need not example, if the order of a B+ tree is 7, each internal node split. (except for the root) may have between 4 and 7 children; the root may have between 2 and 7. Leaf nodes have no • If the root splits, create a new root which has one key children, but are constrained so that the number of keys and two pointers. (That is, the value that gets pushed must be at least ⌈b/2⌉−1 and at most b−1 . In the situa- to the new root gets removed from the original node) tion where a B+ tree is nearly empty, it only contains one node, which is a leaf node. (The root is also the single B-trees grow at the root and not at the leaves.[1] 6.15. B+ TREE 213

h Deletion • The maximum number of keys is nkmax = b • • Start at root, find leaf L where entry belongs. The space required to store the tree is O(n) • • Remove the entry. Inserting a record requires O(logb n) operations • Finding a record requires O(log n) operations • If L is at least half-full, done! b • If L has fewer entries than it should, • Removing a (previously located) record requires O(log n) operations • If sibling (adjacent node with same parent b as L) is more than half-full, re-distribute, • Performing a range query with k elements occurring borrowing an entry from it. within the range requires O(logb n + k) operations • Otherwise, sibling is exactly half-full, so we can merge L and sibling. 6.15.4 Implementation • If merge occurred, must delete entry (pointing to L or sibling) from parent of L. The leaves (the bottom-most index blocks) of the B+ tree are often linked to one another in a linked list; this makes • Merge could propagate to root, decreasing height. range queries or an (ordered) iteration through the blocks simpler and more efficient (though the aforementioned Bulk-loading upper bound can be achieved even without this addition). This does not substantially increase space consumption Given a collection of data records, we want to create a or maintenance on the tree. This illustrates one of the B+ tree index on some key field. One approach is to in- significant advantages of a B+tree over a B-tree; in a B- sert each record into an empty tree. However, it is quite tree, since not all keys are present in the leaves, such an expensive, because each entry requires us to start from ordered linked list cannot be constructed. A B+tree is the root and go down to the appropriate leaf page. An thus particularly useful as a database system index, where efficient alternative is to use bulk-loading. the data typically resides on disk, as it allows the B+tree to actually provide an efficient structure for housing the data itself (this is described in [4]:238 as index structure • The first step is to sort the data entries according to “Alternative 1”). a search key in ascending order. If a storage system has a block size of B bytes, and the • We allocate an empty page to serve as the root, and keys to be stored have a size of k, arguably the most ef- insert a pointer to the first page of entries into it. ficient B+ tree is one where b = (B/k) − 1 . Although theoretically the one-off is unnecessary, in practice there • When the root is full, we split the root, and create a is often a little extra space taken up by the index blocks new root page. (for example, the linked list references in the leaf blocks). • Keep inserting entries to the right most index page Having an index block which is slightly larger than the just above the leaf level, until all entries are indexed. storage system’s actual block represents a significant per- formance decrease; therefore erring on the side of caution is preferable. Note (1) when the right-most index page above the leaf level fills up, it is split; (2) this action may, in turn, cause If nodes of the B+ tree are organized as arrays of ele- a split of the right-most index page on step closer to the ments, then it may take a considerable time to insert or root; and (3) splits only occur on the right-most path from delete an element as half of the array will need to be the root to the leaf level. shifted on average. To overcome this problem, elements inside a node can be organized in a binary tree or a B+ tree instead of an array. 6.15.3 Characteristics B+ trees can also be used for data stored in RAM. In this case a reasonable choice for block size would be the size For a b-order B+ tree with h levels of index: of processor’s cache line. Space efficiency of B+ trees can be improved by using • The maximum number of records stored is nmax = bh − bh−1 some compression techniques. One possibility is to use delta encoding to compress keys stored into each block. • The minimum number of records stored is nmin = For internal blocks, space saving can be achieved by ei- ⌈ ⌉ − b h 1 ther compressing keys or pointers. For string keys, space 2 2 can be saved by using the following technique: Normally • The minimum number of keys is nkmin = the i -th entry of an internal block contains the first key ⌈ ⌉ − b h 1 − 2 2 1 of block i+1. Instead of storing the full key, we could 214 CHAPTER 6. SUCCESSORS AND NEIGHBORS

store the shortest prefix of the first key of block i+1 that [5] SQLite Version 3 Overview is strictly greater (in lexicographic order) than last key of block i. There is also a simple way to compress pointers: [6] CouchDB Guide (see note after 3rd paragraph) if we suppose that some consecutive blocks i, i+1...i+k [7] Tokyo Cabinet reference Archived September 12, 2009, are stored contiguously, then it will suffice to store only at the Wayback Machine. a pointer to the first block and the count of consecutive blocks. 6.15.8 External links All the above compression techniques have some draw- backs. First, a full block must be decompressed to ex- • B+ tree in Python, used to implement a list tract a single element. One technique to overcome this problem is to divide each block into sub-blocks and com- • Dr. Monge’s B+ Tree index notes press them separately. In this case searching or inserting • an element will only need to decompress or compress a Evaluating the performance of CSB+-trees on Mu- sub-block instead of a full block. Another drawback of tithreaded Architectures compression techniques is that the number of stored ele- • Effect of node size on the performance of cache con- ments may vary considerably from a block to another de- scious B+-trees pending on how well the elements are compressed inside each block. • Fractal Prefetching B+-trees • Towards pB+-trees in the field: implementations 6.15.5 History Choices and performance • Cache-Conscious Index Structures for Main- The B tree was first described in the paper Organization Memory Databases and Maintenance of Large Ordered Indices. Acta Infor- matica 1: 173–189 (1972) by Rudolf Bayer and Edward • Cache Oblivious B(+)-trees M. McCreight. There is no single paper introducing the B+ tree concept. Instead, the notion of maintaining all • The Power of B-Trees: CouchDB B+ Tree Imple- data in leaf nodes is repeatedly brought up as an interest- mentation ing variant. An early survey of B trees also covering B+ • B+ Tree Visualization trees is Douglas Comer:"The Ubiquitous B-Tree", ACM Computing Surveys 11(2): 121–137 (1979). Comer notes that the B+ tree was used in IBM’s VSAM data ac- Implementations cess software and he refers to an IBM published article from 1973. • Interactive B+ Tree Implementation in C • Interactive B+ Tree Implementation in C++ 6.15.6 See also • Memory based B+ tree implementation as C++ tem- plate library • Binary search tree • Stream based B+ tree implementation as C++ tem- • B-tree plate library • Divide and conquer algorithm • Open Source JavaScript B+ Tree Implementation

• Perl implementation of B+ trees 6.15.7 References • Java/C#/Python implementations of B+ trees [1] Navathe, Ramez Elmasri, Shamkant B. (2010). Fun- damentals of database systems (6th ed.). Upper Saddle • C# B+Tree implementation, MIT License River, N.J.: Pearson Education. pp. 652–660. ISBN • 9780136086208. File based B+Tree in C# with threading and MVCC support [2] http://www.seanster.com/BplusTree/BplusTree.html • JavaScript B+ Tree, MIT License [3] Giampaolo, Dominic (1999). Practical File System Design with the Be File System (PDF). Morgan Kaufmann. ISBN • JavaScript B+ Tree, Interactive and Open Source 1-55860-497-9.

[4] Ramakrishnan Raghu, Gehrke Johannes - Database Management Systems, McGraw-Hill Higher Education (2000), 2nd edition (en) page 267 Chapter 7

Integer and string searching

7.1 Trie can be compressed into a deterministic acyclic finite state automaton. This article is about a tree data structure. For the French Though tries are usually keyed by character strings, they commune, see Trie-sur-Baïse. need not be. The same algorithms can be adapted to serve In computer science, a trie, also called digital tree similar functions of ordered lists of any construct, e.g. permutations on a list of digits or shapes. In particular, a bitwise trie is keyed on the individual bits making up any fixed-length binary datum, such as an integer or memory address. A it t i 7.1.1 History and etymology A eo11 n Trie were first described by R. de la Briandais in 15 1959.[1][2]:336 The term trie was coined two years later to te in by Edward Fredkin, who pronounces it /ˈtriː/ (as “tree”), after the middle syllable of retrieval.[3][4] However, other 7 d na n 5 authors pronounce it /ˈtraɪ/ (as “try”), in an attempt to dis- tinguish it verbally from “tree”.[3][4][5] tea ten ted inn 7.1.2 Applications 34 12 9 As a replacement for other data structures A trie for keys “A”,"to”, “tea”, “ted”, “ten”, “i”, “in”, and “inn”. As discussed below, a trie has a number of advantages and sometimes radix tree or prefix tree (as they can be over binary search trees.[6] A trie can also be used to re- searched by prefixes), is a kind of search tree—an or- place a hash table, over which it has the following advan- dered tree data structure that is used to store a dynamic tages: set or associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores • Looking up data in a trie is faster in the worst case, the key associated with that node; instead, its position in O(m) time (where m is the length of a search string), the tree defines the key with which it is associated. All compared to an imperfect hash table. An imperfect the descendants of a node have a common prefix of the hash table can have key collisions. A key collision string associated with that node, and the root is associ- is the hash function mapping of different keys to the ated with the empty string. Values are not necessarily same position in a hash table. The worst-case lookup associated with every node. Rather, values tend only to speed in an imperfect hash table is O(N) time, but be associated with leaves, and with some inner nodes that far more typically is O(1), with O(m) time spent correspond to keys of interest. For the space-optimized evaluating the hash. presentation of prefix tree, see compact prefix tree. • There are no collisions of different keys in a trie. In the example shown, keys are listed in the nodes and val- ues below them. Each complete English word has an arbi- • Buckets in a trie, which are analogous to hash ta- trary integer value associated with it. A trie can be seen as ble buckets that store key collisions, are necessary a tree-shaped deterministic finite automaton. Each finite only if a single key is associated with more than one language is generated by a trie automaton, and each trie value.

215 216 CHAPTER 7. INTEGER AND STRING SEARCHING

• There is no need to provide a hash function or to We can look up a value in the trie as follows: change hash functions as more keys are added to a find :: String -> Trie a -> Maybe a find [] t = value t find trie. (k:ks) t = do ct <- Data.Map.lookup k (children t) find • A trie can provide an alphabetical ordering of the ks ct entries by key. In an imperative style, and assuming an appropriate data Tries do have some drawbacks as well: type in place, we can describe the same algorithm in Python (here, specifically for testing membership). Note • Tries can be slower in some cases than hash ta- that children is a list of a node’s children; and we say that bles for looking up data, especially if the data is di- a “terminal” node is one which contains a valid word. rectly accessed on a hard disk drive or some other secondary storage device where the random-access def find(node, key): for char in key: if char in time is high compared to main memory.[7] node.children: node = node.children[char] else: return None return node • Some keys, such as floating point numbers, can lead to long chains and prefixes that are not particularly Insertion proceeds by walking the trie according to the meaningful. Nevertheless, a bitwise trie can han- string to be inserted, then appending new nodes for the dle standard IEEE single and double format floating suffix of the string that is not contained in the trie. In point numbers. imperative pseudocode, • Some tries can require more space than a hash ta- algorithm insert(root : node, s : string, value : any): node ble, as memory may be allocated for each character = root i = 0 n = length(s) while i < n: if node.child(s[i]) in the search string, rather than a single chunk of != nil: node = node.child(s[i]) i = i + 1 else: break memory for the whole entry, as in most hash tables. (* append new nodes, if necessary *) while i < n: node.child(s[i]) = new node node = node.child(s[i]) i = i Dictionary representation + 1 node.value = value

A common application of a trie is storing a predictive text or autocomplete dictionary, such as found on a mobile Sorting telephone. Such applications take advantage of a trie’s ability to quickly search for, insert, and delete entries; Lexicographic sorting of a set of keys can be accom- however, if storing dictionary words is all that is required plished with a simple trie-based algorithm as follows: (i.e., storage of information auxiliary to each word is not required), a minimal deterministic acyclic finite state au- • Insert all keys in a trie. tomaton (DAFSA) would use less space than a trie. This is because a DAFSA can compress identical branches • Output all keys in the trie by means of pre- from the trie which correspond to the same suffixes (or order traversal, which results in output that is in parts) of different words being stored. lexicographically increasing order. Pre-order traver- Tries are also well suited for implementing approxi- sal is a kind of depth-first traversal. mate matching algorithms,[8] including those used in spell checking and hyphenation[4] software. This algorithm is a form of . A trie forms the fundamental data structure of , Term indexing which (in 2007) was the fastest known string sorting algorithm.[10] However, now there are faster string sorting [11] A discrimination tree term index stores its information in algorithms. a trie data structure.[9] Full text search 7.1.3 Algorithms A special kind of trie, called a suffix tree, can be used to Lookup and membership are easily described. The listing index all suffixes in a text in order to carry out fast full below implements a recursive trie node as a Haskell data text searches. type. It stores an optional value and a list of children tries, indexed by the next character: 7.1.4 Implementation strategies import Data.Map data Trie a = Trie { value :: Maybe a, children :: Map Char (Trie a) } There are several ways to represent tries, corresponding to different trade-offs between memory use and speed 7.1. TRIE 217

the alphabet array as a bitmap of 256 bits representing b d the ASCII alphabet, reducing dramatically the size of the nodes.[14]

a o a Bitwise tries

Bitwise tries are much the same as a normal character- based trie except that individual bits are used to traverse b d n x d n what effectively becomes a form of binary tree. Gen- erally, implementations use a special CPU instruction to very quickly find the first set bit in a fixed length key (e.g., GCC’s __builtin_clz() intrinsic). This value is then used y k c to index a 32- or 64-entry table which points to the first item in the bitwise trie with that number of leading zero bits. The search then proceeds by testing each subsequent bit in the key and choosing child[0] or child[1] appropri- e ately until the item is found. Although this process might sound slow, it is very cache- local and highly parallelizable due to the lack of regis- A trie implemented as a doubly chained tree: vertical arrows are ter dependencies and therefore in fact has excellent per- child pointers, dashed horizontal arrows are next pointers. The set of strings stored in this trie is {baby, bad, bank, box, dad, formance on modern out-of-order execution CPUs. A dance}. The lists are sorted to allow traversal in lexicographic red-black tree for example performs much better on pa- order. per, but is highly cache-unfriendly and causes multiple pipeline and TLB stalls on modern CPUs which makes that algorithm bound by memory latency rather than CPU of the operations. The basic form is that of a linked speed. In comparison, a bitwise trie rarely accesses mem- set of nodes, where each node contains an array of child ory, and when it does, it does so only to read, thus avoid- pointers, one for each symbol in the alphabet (so for the ing SMP cache coherency overhead. Hence, it is increas- English alphabet, one would store 26 child pointers and ingly becoming the algorithm of choice for code that per- for the alphabet of bytes, 256 pointers). This is simple but forms many rapid insertions and deletions, such as mem- wasteful in terms of memory: using the alphabet of bytes ory allocators (e.g., recent versions of the famous Doug (size 256) and four-byte pointers, each node requires a Lea’s allocator (dlmalloc) and its descendents). kilobyte of storage, and when there is little overlap in the strings’ prefixes, the number of required nodes is roughly Compressing tries the combined length of the stored strings.[2]:341 Put an- other way, the nodes near the bottom of the tree tend Compressing the trie and merging the common branches to have few children and there are many of them, so the can sometimes yield large performance gains. This works [12] structure wastes space storing null pointers. best under the following conditions: The storage problem can be alleviated by an implemen- tation technique called alphabet reduction, whereby the • The trie is mostly static (key insertions to or dele- original strings are reinterpreted as longer strings over a tions from a pre-filled trie are disabled). smaller alphabet. E.g., a string of n bytes can alternatively • Only lookups are needed. be regarded as a string of 2n four-bit units and stored in a trie with sixteen pointers per node. Lookups need to visit • The trie nodes are not keyed by node-specific data, twice as many nodes in the worst case, but the storage or the nodes’ data are common.[15] requirements go down by a factor of eight.[2]:347–352 • The total set of stored keys is very sparse within their An alternative implementation represents a node as a representation space. triple (symbol, child, next) and links the children of a node together as a singly linked list: child points to For example, it may be used to represent sparse bitsets, the node’s first child, next to the parent node’s next [12][13] i.e., subsets of a much larger, fixed enumerable set. In child. The set of children can also be repre- such a case, the trie is keyed by the bit element position sented as a binary search tree; one instance of this within the full set. The key is created from the string of idea is the ternary search tree developed by Bentley and [2]:353 bits needed to encode the integral position of each ele- Sedgewick. ment. Such tries have a very degenerate form with many Another alternative in order to avoid the use of an array missing branches. After detecting the repetition of com- of 256 pointers (ASCII), as suggested before, is to store mon patterns or filling the unused gaps, the unique leaf 218 CHAPTER 7. INTEGER AND STRING SEARCHING nodes (bit strings) can be stored and compressed easily, External memory tries reducing the overall size of the trie. Several trie variants are suitable for maintaining sets of Such compression is also used in the implementation of strings in external memory, including suffix trees. A the various fast lookup tables for retrieving Unicode char- trie/B-tree combination called the B-trie has also been acter properties. These could include case-mapping ta- suggested for this task; compared to suffix trees, they are bles (e.g. for the Greek letter pi, from ∏ to π), or lookup limited in the supported operations but also more com- tables normalizing the combination of base and combin- pact, while performing update operations faster.[17] ing characters (like the a-umlaut in German, ä, or the For such .(ַּ֫ד ,dalet-patah-dagesh-ole in Biblical Hebrew applications, the representation is similar to transforming 7.1.5 See also a very large, unidimensional, sparse table (e.g. Unicode code points) into a multidimensional matrix of their com- • Suffix tree binations, and then using the coordinates in the hyper- matrix as the string key of an uncompressed trie to rep- • Radix tree resent the resulting character. The compression will then • Directed acyclic word graph (aka DAWG) consist of detecting and merging the common columns within the hyper-matrix to compress the last dimension • Acyclic deterministic finite automata in the key. For example, to avoid storing the full, multi- • byte Unicode code point of each element forming a ma- Hash trie trix column, the groupings of similar code points can be • Deterministic finite automata exploited. Each dimension of the hyper-matrix stores the start position of the next dimension, so that only the off- • Judy array set (typically a single byte) need be stored. The resulting • Search algorithm vector is itself compressible when it is also sparse, so each dimension (associated to a layer level in the trie) can be • Extendible hashing compressed separately. • Hash array mapped trie Some implementations do support such data compression within dynamic sparse tries and allow insertions and dele- • Prefix Hash Tree tions in compressed tries. However, this usually has a sig- • Burstsort nificant cost when compressed segments need to be split or merged. Some tradeoff has to be made between data • Luleå algorithm compression and update speed. A typical strategy is to limit the range of global lookups for comparing the com- • Huffman coding mon branches in the sparse trie. • The result of such compression may look similar to trying • to transform the trie into a directed acyclic graph (DAG), HAT-trie because the reverse transform from a DAG to a trie is obvious and always possible. However, the shape of the 7.1.6 References DAG is determined by the form of the key chosen to in- dex the nodes, in turn constraining the compression pos- [1] de la Briandais, René (1959). File searching using variable sible. length keys. Proc. Western J. Computer Conf. pp. 295– Another compression strategy is to “unravel” the data 298. Cited by Brass. structure into a single byte array.[16] This approach elim- [2] Brass, Peter (2008). Advanced Data Structures. Cam- inates the need for node pointers, substantially reducing bridge University Press. the memory requirements. This in turn permits memory mapping and the use of virtual memory to efficiently load [3] Black, Paul E. (2009-11-16). “trie”. Dictionary of Al- gorithms and Data Structures. National Institute of Stan- the data from disk. dards and Technology. Archived from the original on One more approach is to “pack” the trie.[4] Liang de- 2010-05-19. scribes a space-efficient implementation of a sparse [4] Franklin Mark Liang (1983). Word Hy-phen-a-tion By packed trie applied to automatic hyphenation, in which Com-put-er (Doctor of Philosophy thesis). Stanford Uni- the descendants of each node may be interleaved in mem- versity. Archived from the original (PDF) on 2010-05-19. ory. Retrieved 2010-03-28.

[5] Knuth, Donald (1997). “6.3: Digital Searching”. The Art of Computer Programming Volume 3: Sorting and Search- ing (2nd ed.). Addison-Wesley. p. 492. ISBN 0-201- 89685-0. 7.2. RADIX TREE 219

[6] Bentley, Jon; Sedgewick, Robert (1998-04-01). “Ternary [17] Askitis, Nikolas; Zobel, Justin (2008). “B-tries for Disk- Search Trees”. Dr. Dobb’s Journal. Dr Dobb’s. Archived based String Management” (PDF). VLDB Journal: 1–26. from the original on 2008-06-23. ISSN 1066-8888.

[7] Edward Fredkin (1960). “Trie Memory”. Com- munications of the ACM. 3 (9): 490–499. doi:10.1145/367390.367400. 7.1.7 External links

[8] Aho, Alfred V.; Corasick, Margaret J. (Jun 1975). • NIST’s Dictionary of Algorithms and Data Struc- “Efficient String Matching: An Aid to Bibliographic tures: Trie Search” (PDF). Communications of the ACM. 18 (6): 333– 340. doi:10.1145/360825.360855. • Tries by Lloyd Allison [9] John W. Wheeler; Guarionex Jordan. “An Empirical Study of Term Indexing in the Darwin Implementation of • Comparison and Analysis the Model Evolution Calculus”. 2004. p. 5. • Java reference implementation Simple with prefix [10] “Cache-Efficient String Sorting Using Copying” (PDF). compression and deletions. Retrieved 2008-11-15.

[11] “Engineering Radix Sort for Strings.” (PDF). Retrieved 2013-03-11. 7.2 Radix tree [12] Allison, Lloyd. “Tries”. Retrieved 18 February 2014.

[13] Sahni, Sartaj. “Tries”. Data Structures, Algorithms, & Ap- plications in Java. University of Florida. Retrieved 18 February 2014.

[14] Bellekens, Xavier (2014). A Highly-Efficient Memory- Compression Scheme for GPU-Accelerated Intrusion De- tection Systems. Glasgow, Scotland, UK: ACM. pp. 302:302––302:309. ISBN 978-1-4503-3033-6. Re- trieved 21 October 2015.

[15] Jan Daciuk; Stoyan Mihov; Bruce W. Watson; Richard E. Watson (2000). “Incremental Construction of Minimal Acyclic Finite-State Automata”. Computational Linguis- tics. Association for Computational Linguistics. 26: 3. doi:10.1162/089120100561601. Archived from the orig- An example of a radix tree inal on 2006-03-13. Retrieved 2009-05-28. This paper presents a method for direct building of minimal acyclic In computer science, a radix tree (also radix trie or finite states automaton which recognizes a given finite list compact prefix tree) is a data structure that represents of words in lexicographical order. Our approach is to con- a space-optimized trie in which each node that is the only struct a minimal automaton in a single phase by adding child is merged with its parent. The result is that the num- new strings one by one and minimizing the resulting au- tomaton on-the-fly ber of children of every internal node is at least the radix r of the radix tree, where r is a positive integer and a power [16] Ulrich Germann; Eric Joanis; Samuel Larkin (2009). x of 2, having x ≥ 1. Unlike in regular tries, edges can be “Tightly packed tries: how to fit large models into mem- labeled with sequences of elements as well as single el- ory, and make them load fast, too” (PDF). ACL Work- ements. This makes radix trees much more efficient for shops: Proceedings of the Workshop on Software Engi- small sets (especially if the strings are long) and for sets neering, Testing, and Quality Assurance for Natural Lan- of strings that share long prefixes. guage Processing. Association for Computational Lin- guistics. pp. 31–39. We present Tightly Packed Tries Unlike regular trees (where whole keys are compared en (TPTs), a compact implementation of read-only, com- masse from their beginning up to the point of inequal- pressed trie structures with fast on-demand paging and ity), the key at each node is compared chunk-of-bits by short load times. We demonstrate the benefits of TPTs for chunk-of-bits, where the quantity of bits in that chunk at storing n-gram back-off language models and phrase ta- that node is the radix r of the radix trie. When the r is bles for statistical machine translation. Encoded as TPTs, 2, the radix trie is binary (i.e., compare that node’s 1-bit these databases require less space than flat text file rep- portion of the key), which minimizes sparseness at the resentations of the same data compressed with the gzip utility. At the same time, they can be mapped into mem- expense of maximizing trie depth—i.e., maximizing up ory quickly and be searched directly in time linear in the to conflation of nondiverging bit-strings in the key. When length of the key, without the need to decompress the en- r is an integer power of 2 greater or equal to 4, then the tire file. The overhead for local decompression during radix trie is an r-ary trie, which lessens the depth of the search is marginal. radix trie at the expense of potential sparseness. 220 CHAPTER 7. INTEGER AND STRING SEARCHING

As an optimization, edge labels can be stored in constant The following pseudo code assumes that these classes ex- size by using two pointers to a string (for the first and last ist. [1] elements). Edge Note that although the examples in this article show strings as sequences of characters, the type of the string • Node targetNode elements can be chosen arbitrarily; for example, as a bit or byte of the string representation when using multibyte • string label character encodings or Unicode. Node 7.2.1 Applications • Array of Edges edges Radix trees are useful for constructing associative arrays • function isLeaf() with keys that can be expressed as strings. They find par- ticular application in the area of IP routing, where the ability to contain large ranges of values with a few excep- function lookup(string x) { // Begin at the root with tions is particularly suited to the hierarchical organization no elements found Node traverseNode := root; int el- of IP addresses.[2] They are also used for inverted indexes ementsFound := 0; // Traverse until a leaf is found of text documents in information retrieval. or it is not possible to continue while (traverseNode != null && !traverseNode.isLeaf() && elementsFound < x.length) { // Get the next edge to explore based on the ele- 7.2.2 Operations ments not yet found in x Edge nextEdge := select edge from traverseNode.edges where edge.label is a prefix Radix trees support insertion, deletion, and searching op- of x.suffix(elementsFound) // x.suffix(elementsFound) re- erations. Insertion adds a new string to the trie while try- turns the last (x.length - elementsFound) elements of x // ing to minimize the amount of data stored. Deletion re- Was an edge found? if (nextEdge != null){ // Set the next moves a string from the trie. Searching operations include node to explore traverseNode := nextEdge.targetNode; // (but are not necessarily limited to) exact lookup, find pre- Increment elements found based on the label stored at the decessor, find successor, and find all strings with a prefix. edge elementsFound += nextEdge.label.length; } else { All of these operations are O(k) where k is the maximum // Terminate loop traverseNode := null;}} // A match is length of all strings in the set, where length is measured found if we arrive at a leaf node and have used up ex- in the quantity of bits equal to the radix of the radix trie. actly x.length elements return (traverseNode != null && traverseNode.isLeaf() && elementsFound == x.length); } Lookup Insertion

To insert a string, we search the tree until we can make no further progress. At this point we either add a new outgoing edge labeled with all remaining elements in the input string, or if there is already an outgoing edge sharing a prefix with the remaining input string, we split it into two edges (the first labeled with the common prefix) and proceed. This splitting step ensures that no node has more children than there are possible string elements. Several cases of insertion are shown below, though more may exist. Note that r simply represents the root. It is assumed that edges can be labelled with empty strings to terminate strings where necessary and that the root has no Finding a string in a Patricia trie incoming edge. (The lookup algorithm described above will not work when using empty-string edges.) The lookup operation determines if a string exists in a trie. Most operations modify this approach in some way • Insert 'water' at the root to handle their specific tasks. For instance, the node where a string terminates may be of importance. This • Insert 'slower' while keeping 'slow' operation is similar to tries except that some edges con- sume multiple elements. • Insert 'test' which is a prefix of 'tester' 7.2. RADIX TREE 221

• Insert 'team' while splitting 'test' and creating a new Radix trees also share the disadvantages of tries, however: edge label 'st' as they can only be applied to strings of elements or ele- ments with an efficiently reversible mapping to strings, • Insert 'toast' while splitting 'te' and moving previous they lack the full generality of balanced search trees, strings a level lower which apply to any data type with a total ordering.A reversible mapping to strings can be used to produce the required total ordering for balanced search trees, but not Deletion the other way around. This can also be problematic if a data type only provides a comparison operation, but not To delete a string x from a tree, we first locate the leaf a (de)serialization operation. representing x. Then, assuming x exists, we remove the Hash tables are commonly said to have expected O(1) corresponding leaf node. If the parent of our leaf node insertion and deletion times, but this is only true when has only one other child, then that child’s incoming label considering computation of the hash of the key to be a is appended to the parent’s incoming label and the child constant-time operation. When hashing the key is taken is removed. into account, hash tables have expected O(k) insertion and deletion times, but may take longer in the worst case depending on how collisions are handled. Radix trees Additional operations have worst-case O(k) insertion and deletion. The suc- cessor/predecessor operations of radix trees are also not • Find all strings with common prefix: Returns an ar- implemented by hash tables. ray of strings which begin with the same prefix.

• Find predecessor: Locates the largest string less than 7.2.5 Variants a given string, by lexicographic order. A common extension of radix trees uses two colors of • Find successor: Locates the smallest string greater nodes, 'black' and 'white'. To check if a given string is than a given string, by lexicographic order. stored in the tree, the search starts from the top and fol- lows the edges of the input string until no further progress can be made. If the search string is consumed and the fi- 7.2.3 History nal node is a black node, the search has failed; if it is white, the search has succeeded. This enables us to add Donald R. Morrison first described what he called “Patri- a large range of strings with a common prefix to the tree, [3] cia trees” in 1968; the name comes from the acronym using white nodes, then remove a small set of “excep- PATRICIA, which stands for "Practical Algorithm To tions” in a space-efficient manner by inserting them using Retrieve Information Coded In Alphanumeric". Gernot black nodes. Gwehenberger independently invented and described the data structure at about the same time.[4] PATRICIA tries The HAT-trie is a cache-conscious data structure based are radix tries with radix equals 2, which means that each on radix trees that offers efficient string storage and re- bit of the key is compared individually and each node is trieval, and ordered iterations. Performance, with re- a two-way (i.e., left versus right) branch. spect to both time and space, is comparable to the cache- conscious hashtable.[5][6] See HAT trie implementation notes at 7.2.4 Comparison to other data structures The adaptive radix tree is a radix tree variant that inte- grates adaptive node sizes to the radix tree. One major (In the following comparisons, it is assumed that the keys drawbacks of the usual radix trees is the use of space, are of length k and the data structure contains n mem- because it uses a constant node size in every level. The bers.) major difference between the radix tree and the adaptive radix tree is its variable size for each node based on the Unlike balanced trees, radix trees permit lookup, inser- number of child elements, which grows while adding new tion, and deletion in O(k) time rather than O(log n). This entries. Hence, the adaptive radix tree leads to a better does not seem like an advantage, since normally k ≥ log n, use of space without reducing its speed.[7][8][9] but in a balanced tree every comparison is a string com- parison requiring O(k) worst-case time, many of which are slow in practice due to long common prefixes (in the 7.2.6 See also case where comparisons begin at the start of the string). In a trie, all comparisons require constant time, but it • Prefix tree (also known as a Trie) takes m comparisons to look up a string of length m. Radix trees can perform these operations with fewer com- • Deterministic acyclic finite state automaton parisons, and require many fewer nodes. (DAFSA) 222 CHAPTER 7. INTEGER AND STRING SEARCHING

• Ternary search tries 7.2.8 External links

• Acyclic deterministic finite automata • Algorithms and Data Structures Research & Ref- erence Material: PATRICIA, by Lloyd Allison, • Hash trie Monash University • Patricia Tree, NIST Dictionary of Algorithms and • Deterministic finite automata Data Structures

• Judy array • Crit-bit trees, by Daniel J. Bernstein • Radix Tree API in the Linux Kernel, by Jonathan • Search algorithm Corbet • Extendible hashing • Kart (key alteration radix tree), by Paul Jarc

• Hash array mapped trie Implementations • Prefix hash tree • Linux Kernel Implementation, used for the page cache, among other things. • Burstsort • GNU C++ Standard library has a trie implementa- • Luleå algorithm tion • • Huffman coding Java implementation of Concurrent Radix Tree, by Niall Gallagher • C# implementation of a Radix Tree 7.2.7 References • Practical Algorithm Template Library, a C++ li- brary on PATRICIA tries (VC++ >=2003, GCC [1] Morin, Patrick. “Data Structures for Strings” (PDF). Re- trieved 15 April 2012. G++ 3.x), by Roman S. Klyujkov • Patricia Trie C++ template class implementation, by [2] Knizhnik, Konstantin. “Patricia Tries: A Better Index For Radu Gruian Prefix Searches”, Dr. Dobb’s Journal, June, 2008. • Haskell standard library implementation “based on [3] Morrison, Donald R. Practical Algorithm to Retrieve In- big-endian patricia trees”. Web-browsable source formation Coded in Alphanumeric code.

[4] G. Gwehenberger, Anwendung einer binären Verweisket- • Patricia Trie implementation in Java, by Roger tenmethode beim Aufbau von Listen. Elektronische Kapsi and Sam Berlin Rechenanlagen 10 (1968), pp. 223–226 • Crit-bit trees forked from C code by Daniel J. Bern- [5] Askitis, Nikolas; Sinha, Ranjan (2007). HAT-trie: A stein Cache-conscious Trie-based Data Structure for Strings. • Patricia Trie implementation in C, in libcprops Proceedings of the 30th Australasian Conference on Com- puter science. 62. pp. 97–105. ISBN 1-920682-43-0. • Patricia Trees : efficient sets and maps over integers in OCaml, by Jean-Christophe Filliâtre [6] Askitis, Nikolas; Sinha, Ranjan (October 2010). “Engineering scalable, cache and space efficient tries • Radix DB (Patricia trie) implementation in C, by G. for strings”. The VLDB Journal. 19 (5): 633–660. B. Versiani doi:10.1007/s00778-010-0183-9. ISSN 1066-8888. ISSN 0949-877X (0nline). 7.3 Suffix tree [7] Kemper, Alfons; Eickler, André (2013). Datenbanksys- teme, Eine Einführung. 9. pp. 604–605. ISBN 978-3- 486-72139-3. In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie [8] “armon/libart · GitHub”. GitHub. Retrieved 17 Septem- containing all the suffixes of the given text as their keys ber 2014. and positions in the text as their values. Suffix trees al- low particularly fast implementations of many important [9] http://www-db.in.tum.de/~{}leis/papers/ART.pdf string operations. 7.3. 223

7.3.2 Definition

The suffix tree for the string S of length n is defined as a A NA tree such that:[2] BANANA$ • The tree has exactly n leaves numbered from 1 to n. 0 • Except for the root, every internal node has at least two children. $ NA $ NA$ • Each edge is labeled with a non-empty substring of S. 5 4 2 • No two edges starting out of a node can have string- NA$$ labels beginning with the same character. • The string obtained by concatenating all the string- labels found on the path from the root to leaf i spells 3 1 out suffix S[i..n], for i from 1 to n.

Suffix tree for the text BANANA. Each substring is terminated Since such a tree does not exist for all strings, S is padded with special character $. The six paths from the root to the leaves with a terminal symbol not seen in the string (usually de- (shown as boxes) correspond to the six suffixes A$, NA$, ANA$, noted $). This ensures that no suffix is a prefix of an- NANA$, ANANA$ and BANANA$. The numbers in the leaves other, and that there will be n leaf nodes, one for each of give the start position of the corresponding suffix. Suffix links, the n suffixes of S . Since all internal non-root nodes are drawn dashed, are used during construction. branching, there can be at most n − 1 such nodes, and n + (n − 1) + 1 = 2n nodes in total (n leaves, n − 1 internal non-root nodes, 1 root). The construction of such a tree for the string S takes time and space linear in the length of S . Once constructed, Suffix links are a key feature for older linear-time con- several operations can be performed quickly, for instance struction algorithms, although most newer algorithms, locating a substring in S , locating a substring if a certain which are based on Farach’s algorithm, dispense with suf- number of mistakes are allowed, locating matches for a fix links. In a complete suffix tree, all internal non-root regular expression pattern etc. Suffix trees also provide nodes have a suffix link to another internal node. If the one of the first linear-time solutions for the longest com- path from the root to a node spells the string χα , where mon substring problem. These speedups come at a cost: χ is a single character and α is a string (possibly empty), storing a string’s suffix tree typically requires significantly it has a suffix link to the internal node representing α . more space than storing the string itself. See for example the suffix link from the node for ANA to the node for NA in the figure above. Suffix links are also used in some algorithms running on the tree. 7.3.1 History

The concept was first introduced by Weiner (1973), 7.3.3 Generalized suffix tree which Donald Knuth subsequently characterized as “Al- A generalized suffix tree is a suffix tree made for a set of gorithm of the Year 1973”. The construction was greatly words instead of a single word. It represents all suffixes simplified by McCreight (1976) , and also by Ukkonen from this set of words. Each word must be terminated by (1995).[1] Ukkonen provided the first online-construction a different termination symbol or word. of suffix trees, now known as Ukkonen’s algorithm, with running time that matched the then fastest algorithms. These algorithms are all linear-time for a constant-size al- 7.3.4 Functionality phabet, and have worst-case running time of O(n log n) in general. A suffix tree for a string S of length n can be built in Θ(n) Farach (1997) gave the first suffix tree construction algo- time, if the letters come from an alphabet of integers in a rithm that is optimal for all alphabets. In particular, this polynomial range (in particular, this is true for constant- is the first linear-time algorithm for strings drawn from sized alphabets).[3] For larger alphabets, the running time an alphabet of integers in a polynomial range. Farach’s is dominated by first sorting the letters to bring them into algorithm has become the basis for new algorithms for a range of size O(n) ; in general, this takes O(n log n) constructing both suffix trees and suffix arrays, for exam- time. The costs below are given under the assumption ple, in external memory, compressed, succinct, etc. that the alphabet is constant. 224 CHAPTER 7. INTEGER AND STRING SEARCHING

Assume that a suffix tree has been built for the string S of • Find all z tandem repeats in O(n log n + z) , and length n , or that a generalised suffix tree has been built for k-mismatch tandem repeats in O(kn log(n/k) + z) [16] the set of strings D = {S1,S2,...,SK } of total length . n = |n1| + |n2| + ··· + |nK | . You can: • Find the longest common substrings to at least k strings in D for k = 2,...,K in Θ(n) time.[17] • Search for strings: • Find the longest palindromic substring of a given • Check if a string P of length m is a substring string (using the generalized suffix tree of the string in O(m) time.[4] and its reverse) in linear time.[18] • Find the first occurrence of the patterns P1,...,Pq of total length m as substrings in O(m) time. 7.3.5 Applications • Find all z occurrences of the patterns Suffix trees can be used to solve a large number of string P1,...,Pq of total length m as substrings in O(m + z) time.[5] problems that occur in text-editing, free-text search, com- putational biology and other application areas.[19] Pri- • Search for a regular expression P in time ex- mary applications include:[19] pected sublinear in n .[6] • Find for each suffix of a pattern P , the • String search, in O(m) complexity, where m is the length of the longest match between a prefix length of the sub-string (but with initial O(n) time of P [i . . . m] and a substring in D in Θ(m) required to build the suffix tree for the string) time.[7] This is termed the matching statis- tics for P . • Finding the longest repeated substring • Find properties of the strings: • Finding the longest common substring • Find the longest common substrings of the • Finding the longest palindrome in a string [8] string Si and Sj in Θ(ni + nj) time. • Find all maximal pairs, maximal repeats or su- Suffix trees are often used in bioinformatics applica- permaximal repeats in Θ(n + z) time.[9] tions, searching for patterns in DNA or protein sequences • Find the Lempel–Ziv decomposition in Θ(n) (which can be viewed as long strings of characters). The time.[10] ability to search efficiently with mismatches might be considered their greatest strength. Suffix trees are also • Θ(n) Find the longest repeated substrings in used in data compression; they can be used to find re- time. peated data, and can be used for the sorting stage of the • Find the most frequently occurring substrings Burrows–Wheeler transform. Variants of the LZW com- of a minimum length in Θ(n) time. pression schemes use suffix trees (LZSS). A suffix tree is • Find the shortest strings from Σ that do not also used in suffix tree clustering, a data clustering algo- occur in D , in O(n + z) time, if there are z rithm used in some search engines.[20] such strings. • Find the shortest substrings occurring only 7.3.6 Implementation once in Θ(n) time.

• Find, for each i , the shortest substrings of Si If each node and edge can be represented in Θ(1) space, not occurring elsewhere in D in Θ(n) time. the entire tree can be represented in Θ(n) space. The total length of all the strings on all of the edges in the tree The suffix tree can be prepared for constant time is O(n2) , but each edge can be stored as the position and lowest common ancestor retrieval between nodes in Θ(n) length of a substring of S, giving a total space usage of time.[11] One can then also: Θ(n) computer words. The worst-case space usage of a suffix tree is seen with a fibonacci word, giving the full 2n • Find the longest common prefix between the suffixes nodes. [12] Si[p..ni] and Sj[q..nj] in Θ(1) . An important choice when making a suffix tree im- • Search for a pattern P of length m with at most k plementation is the parent-child relationships between mismatches in O(kn + z) time, where z is the num- nodes. The most common is using linked lists called sib- ber of hits.[13] ling lists. Each node has a pointer to its first child, and to the next node in the child list it is a part of. Other • Find all z maximal palindromes in Θ(n) ,[14] or implementations with efficient running time properties Θ(gn) time if gaps of length g are allowed, or use hash maps, sorted or unsorted arrays (with array dou- Θ(kn) if k mismatches are allowed.[15] bling), or balanced search trees. We are interested in: 7.3. SUFFIX TREE 225

• The cost of finding the child on a given character. On the other hand, there have been practical works for constructing disk-based suffix trees which scale to (few) • The cost of inserting a child. GB/hours. The state of the art methods are TDD,[28] TRELLIS,[29] DiGeST,[30] and B2ST.[31] • The cost of enlisting all children of a node (divided by the number of children in the table below). TDD and TRELLIS scale up to the entire human genome – approximately 3GB – resulting in a disk-based suffix tree of a size in the tens of gigabytes.[28][29] However, Let σ be the size of the alphabet. Then you have the fol- these methods cannot handle efficiently collections of se- lowing costs: quences exceeding 3GB.[30] DiGeST performs signifi- cantly better and is able to handle collections of sequences in the order of 6GB in about 6 hours.[30] . All these meth- Lookup Insertion Traversal ods can efficiently build suffix trees for the case when the arrays unsorted / lists Sibling O(σ) Θ(1) Θ(1)tree does not fit in main memory, but the input does. The trees sibling Bitwise O(log σ) Θ(1) Θ(1)most recent method, B2ST,[31] scales to handle inputs that maps Hash Θ(1) Θ(1) O(σdo) not fit in main memory. ERA is a recent parallel suffix tree search Balanced O(log σ) O(log σ) O(1)tree construction method that is significantly faster. ERA arrays Sorted O(log σ) O(σ) O(1)can index the entire human genome in 19 minutes on an lists sibling + maps Hash O(1) O(1) O(1)8-core desktop computer with 16GB RAM. On a sim- ple Linux cluster with 16 nodes (4GB RAM per node), The insertion cost is amortised, and that the costs for ERA can index the entire human genome in less than 9 hashing are given for perfect hashing. minutes.[32] The large amount of information in each edge and node makes the suffix tree very expensive, consuming about 10 to 20 times the memory size of the source text in good 7.3.9 See also implementations. The suffix array reduces this require- ment to a factor of 8 (for array including LCP values built • Suffix array within 32-bit address space and 8-bit characters.) This • Generalised suffix tree factor depends on the properties and may reach 2 with usage of 4-byte wide characters (needed to contain any • Trie symbol in some UNIX-like systems, see wchar_t) on 32- bit systems. Researchers have continued to find smaller indexing structures. 7.3.10 Notes

[1] Giegerich & Kurtz (1997).

7.3.7 Parallel construction [2] http://www.cs.uoi.gr/~{}kblekas/courses/ bioinformatics/Suffix_Trees1.pdf Various parallel algorithms to speed up suffix tree con- struction have been proposed.[21][22][23][24][25] Recently, [3] Farach (1997). a practical parallel algorithm for suffix tree construction [4] Gusfield (1999), p.92. with O(n) work (sequential time) and O(log2 n) span has been developed. The algorithm achieves good paral- [5] Gusfield (1999), p.123. lel scalability on shared-memory multicore machines and can index the 3GB human genome in under 3 minutes [6] Baeza-Yates & Gonnet (1996). [26] using a 40-core machine. [7] Gusfield (1999), p.132.

[8] Gusfield (1999), p.125.

7.3.8 External construction [9] Gusfield (1999), p.144.

Though linear, the memory usage of a suffix tree is signif- [10] Gusfield (1999), p.166. icantly higher than the actual size of the sequence collec- [11] Gusfield (1999), Chapter 8. tion. For a large text, construction may require external memory approaches. [12] Gusfield (1999), p.196.

There are theoretical results for constructing suffix trees [13] Gusfield (1999), p.200. in external memory. The algorithm by Farach-Colton, Ferragina & Muthukrishnan (2000) is theoretically op- [14] Gusfield (1999), p.198. timal, with an I/O complexity equal to that of sorting. [15] Gusfield (1999), p.201. However the overall intricacy of this algorithm has pre- vented, so far, its practical implementation.[27] [16] Gusfield (1999), p.204. 226 CHAPTER 7. INTEGER AND STRING SEARCHING

[17] Gusfield (1999), p.205. • Farach-Colton, Martin; Ferragina, Paolo; Muthukr- ishnan, S. (2000), “On the sorting-complexity of [18] Gusfield (1999), pp.197–199. suffix tree construction.”, Journal of the ACM, 47 [19] Allison, L. “Suffix Trees”. Retrieved 2008-10-14. (6): 987–1011, doi:10.1145/355541.355547.

[20] First introduced by Zamir & Etzioni (1998). • Giegerich, R.; Kurtz, S. (1997), “From Ukko- [21] Apostolico et al. (Vishkin). nen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construc- [22] Hariharan (1994). tion” (PDF), Algorithmica, 19 (3): 331–353, [23] Sahinalp & Vishkin (1994). doi:10.1007/PL00009177. [24] Farach & Muthukrishnan (1996). • Gusfield, Dan (1999), Algorithms on Strings, Trees and Sequences: Computer Science and Computa- [25] Iliopoulos & Rytter (2004). tional Biology, Cambridge University Press, ISBN [26] Shun & Blelloch (2014). 0-521-58519-8.

[27] Smyth (2003). • Hariharan, Ramesh (1994), “Optimal Parallel Suffix [28] Tata, Hankins & Patel (2003). Tree Construction”, ACM Symposium on Theory of Computing. [29] Phoophakdee & Zaki (2007). • [30] Barsky et al. (2008). Iliopoulos, Costas; Rytter, Wojciech (2004), “On Parallel Transformations of Suffix Arrays into Suf- [31] Barsky et al. (2009). fix Trees”, 15th Australasian Workshop on Combi- natorial Algorithms. [32] Mansour et al. (2011). • Mansour, Essam; Allam, Amin; Skiadopoulos, 7.3.11 References Spiros; Kalnis, Panos (2011), “ERA: Efficient Serial and Parallel Suffix Tree Construction for • Apostolico, A.; Iliopoulos, C.; Landau, G. M.; Very Long Strings” (PDF), PVLDB, 5 (1): 49–60, Schieber, B.; Vishkin, U. (1988), “Parallel construc- doi:10.14778/2047485.2047490. tion of a suffix tree with applications”, Algorithmica, • 3. McCreight, Edward M. (1976), “A Space- Economical Suffix Tree Construction Algo- • Baeza-Yates, Ricardo A.; Gonnet, Gaston rithm”, Journal of the ACM, 23 (2): 262– H. (1996), “Fast text searching for regu- 272, doi:10.1145/321941.321946, CiteSeerX: lar expressions or automaton searching on 10 .1 .1 .130 .8022. tries”, Journal of the ACM, 43 (6): 915–936, doi:10.1145/235809.235810. • Phoophakdee, Benjarath; Zaki, Mohammed J. (2007), “Genome-scale disk-based suffix tree in- • Barsky, Marina; Stege, Ulrike; Thomo, Alex; Up- dexing”, SIGMOD '07: Proceedings of the ACM SIG- ton, Chris (2008), “A new method for indexing MOD International Conference on Management of genomes using on-disk suffix trees”, CIKM '08: Pro- Data, New York, NY, USA: ACM, pp. 833–844. ceedings of the 17th ACM Conference on Informa- tion and Knowledge Management, New York, NY, • Sahinalp, Cenk; Vishkin, Uzi (1994), “Symmetry USA: ACM, pp. 649–658. breaking for suffix tree construction”, ACM Sympo- • Barsky, Marina; Stege, Ulrike; Thomo, Alex; Up- sium on Theory of Computing ton, Chris (2009), “Suffix trees for very large ge- • Smyth, William (2003), Computing Patterns in nomic sequences”, CIKM '09: Proceedings of the Strings, Addison-Wesley. 18th ACM Conference on Information and Knowl- edge Management, New York, NY, USA: ACM. • Shun, Julian; Blelloch, Guy E. (2014), “A Simple • Farach, Martin (1997), “Optimal Suffix Tree Con- Parallel Cartesian Tree Algorithm and its Appli- struction with Large Alphabets” (PDF), 38th IEEE cation to Parallel Suffix Tree Construction”, ACM Symposium on Foundations of Computer Science Transactions on Parallel Computing. (FOCS '97), pp. 137–143. • Tata, Sandeep; Hankins, Richard A.; Patel, Jig- • Farach, Martin; Muthukrishnan, S. (1996), “Op- nesh M. (2003), “Practical Suffix Tree Construc- timal Logarithmic Time Randomized Suffix Tree tion”, VLDB '03: Proceedings of the 30th Interna- Construction”, International Colloquium on Au- tional Conference on Very Large Data Bases, Mor- tomata Languages and Programming. gan Kaufmann, pp. 36–47. 7.4. 227

• Ukkonen, E. (1995), “On-line construction of suf- 7.4.2 Example fix trees” (PDF), Algorithmica, 14 (3): 249–260, doi:10.1007/BF01206331. Consider the text S =banana$ to be indexed: The text ends with the special sentinel letter $ that is • Weiner, P. (1973), “Linear pattern matching al- unique and lexicographically smaller than any other char- gorithms” (PDF), 14th Annual IEEE Symposium acter. The text has the following suffixes: on Switching and Automata Theory, pp. 1–11, doi:10.1109/SWAT.1973.13. These suffixes can be sorted in ascending order: The suffix array A contains the starting positions of these • Zamir, Oren; Etzioni, Oren (1998), “Web document sorted suffixes: clustering: a feasibility demonstration”, SIGIR '98: Proceedings of the 21st annual international ACM The suffix array with the suffixes written out vertically SIGIR conference on Research and development in underneath for clarity: information retrieval, New York, NY, USA: ACM, So for example, A[3] contains the value 4, and therefore pp. 46–54. refers to the suffix starting at position 4 within S , which is the suffix ana$. 7.3.12 External links 7.4.3 Correspondence to suffix trees • Suffix Trees by Sartaj Sahni Suffix arrays are closely related to suffix trees: • NIST’s Dictionary of Algorithms and Data Struc- tures: Suffix Tree • Suffix arrays can be constructed by performing a depth-first traversal of a suffix tree. The suffix array • Universal Data Compression Based on the Burrows- corresponds to the leaf-labels given in the order in Wheeler Transformation: Theory and Practice, ap- which these are visited during the traversal, if edges plication of suffix trees in the BWT are visited in the lexicographical order of their first character. • Theory and Practice of Succinct Data Structures, • A suffix tree can be constructed in linear time by C++ implementation of a compressed suffix tree using a combination of suffix array and LCP ar- • Ukkonen’s Suffix Tree Implementation in C Part 1 ray. For a description of the algorithm, see the Part 2 Part 3 Part 4 Part 5 Part 6 corresponding section in the LCP array article. It has been shown that every suffix tree algorithm can be systematically replaced with an algorithm that uses a suf- 7.4 Suffix array fix array enhanced with additional information (such as the LCP array) and solves the same problem in the same [2] In computer science, a suffix array is a sorted array of time complexity. Advantages of suffix arrays over suf- all suffixes of a string. It is a data structure used, among fix trees include improved space requirements, simpler others, in full text indices, data compression algorithms linear time construction algorithms (e.g., compared to [1] and within the field of bioinformatics.[1] Ukkonen’s algorithm) and improved cache locality. Suffix arrays were introduced by Manber & Myers (1990) as a simple, space efficient alternative to suffix trees. They 7.4.4 Space Efficiency have independently been discovered by Gaston Gonnet in 1987 under the name PAT array (Gonnet, Baeza-Yates & Suffix arrays were introduced by Manber & Myers (1990) Snider 1992). in order to improve over the space requirements of suffix trees: Suffix arrays store n integers. Assuming an inte- ger requires 4 bytes, a suffix array requires 4n bytes in 7.4.1 Definition total. This is significantly less than the 20n bytes which are required by a careful suffix tree implementation.[3] S = S[1]S[2]...S[n] S[i, j] Let be a string and let denote However, in certain applications, the space requirements S i j the substring of ranging from to . of suffix arrays may still be prohibitive. Analyzed in The suffix array A of S is now defined to be an array of bits, a suffix array requires O(n log n) space, whereas integers providing the starting positions of suffixes of S in the original text over an alphabet of size σ only requires lexicographical order. This means, an entry A[i] contains O(n log σ) bits. For a human genome with σ = 4 and the starting position of the i -th smallest suffix in S and n = 3.4 × 109 the suffix array would therefore occupy thus for all 1 < i ≤ n : S[A[i − 1], n] < S[A[i], n] . about 16 times more memory than the genome itself. 228 CHAPTER 7. INTEGER AND STRING SEARCHING

Such discrepancies motivated a trend towards recursively sort a subset of suffixes. This subset is compressed suffix arrays and BWT-based compressed then used to infer a suffix array of the remaining suf- full-text indices such as the FM-index. These data fixes. Both of these suffix arrays are then merged to structures require only space within the size of the text compute the final suffix array. or even less. • Induced copying algorithms are similar to recursive algorithms in the sense that they use an already 7.4.5 Construction Algorithms sorted subset to induce a fast sort of the remaining suffixes. The difference is that these algorithms fa- A suffix tree can be built in O(n) and can be converted vor iteration over recursion to sort the selected suffix into a suffix array by traversing the tree depth-first also subset. A survey of this diverse group of algorithms in O(n) , so there exist algorithms that can build a suffix has been put together by Puglisi, Smyth & Turpin array in O(n) . (2007). A naive approach to construct a suffix array is to use a comparison-based sorting algorithm. These algorithms A well-known recursive algorithm for integer alphabets require O(n log n) suffix comparisons, but a suffix com- is the DC3 / skew algorithm of Kärkkäinen & Sanders parison runs in O(n) time, so the overall runtime of this (2003). It runs in linear time and has successfully been approach is O(n2 log n) . used as the basis for parallel[7] and external memory[8] More advanced algorithms take advantage of the fact that suffix array construction algorithms. the suffixes to be sorted are not arbitrary strings but re- Recent work by Salson et al. (2009) proposes an al- lated to each other. These algorithms strive to achieve the gorithm for updating the suffix array of a text that has following goals:[4] been edited instead of rebuilding a new suffix array from scratch. Even if the theoretical worst-case time complex- • minimal asymptotic complexity Θ(n) ity is O(n log n) , it appears to perform well in prac- tice: experimental results from the authors showed that • lightweight in space, meaning little or no working their implementation of dynamic suffix arrays is gener- memory beside the text and the suffix array itself is ally more efficient than rebuilding when considering the needed insertion of a reasonable number of letters in the original • fast in practice text.

One of the first algorithms to achieve all goals is the SA- IS algorithm of Nong, Zhang & Chan (2009). The al- 7.4.6 Applications gorithm is also rather simple (< 100 LOC) and can be enhanced to simultaneously construct the LCP array.[5] The suffix array of a string can be used as an index to The SA-IS algorithm is one of the fastest known suffix ar- quickly locate every occurrence of a substring pattern P ray construction algorithms. A careful implementation by within the string S . Finding every occurrence of the pat- Yuta Mori outperforms most other linear or super-linear tern is equivalent to finding every suffix that begins with construction approaches. the substring. Thanks to the lexicographical ordering, these suffixes will be grouped together in the suffix ar- Beside time and space requirements, suffix array con- ray and can be found efficiently with two binary searches. struction algorithms are also differentiated by their sup- The first search locates the starting position of the inter- ported alphabet: constant alphabets where the alpha- val, and the second one determines the end position: bet size is bound by a constant, integer alphabets where characters are integers in a range depending on n and def search(P): l = 0; r = n while l < r: mid = (l+r) / 2 if P general alphabets where only character comparisons are > suffixAt(A[mid]): l = mid + 1 else: r = mid s = l; r = n allowed.[6] while l < r: mid = (l+r) / 2 if P < suffixAt(A[mid]): r = mid else: l = mid + 1 return (s, r) Most suffix array construction algorithms are based on one of the following approaches:[4] Finding the substring pattern P of length m in the string O • Prefix doubling algorithms are based on a strategy S of length n takes (m log n) time, given that a sin- of Karp, Miller & Rosenberg (1972). The idea is to gle suffix comparison needs to compare m characters. find prefixes that honor the lexicographic ordering of Manber & Myers (1990) describe how this bound can O suffixes. The assessed prefix length doubles in each be improved to (m + log n) time using LCP infor- iteration of the algorithm until a prefix is unique and mation. The idea is that a pattern comparison does not provides the rank of the associated suffix. need to re-compare certain characters, when it is already known that these are part of the longest common prefix of • Recursive algorithms follow the approach of the suf- the pattern and the current search interval. Abouelhoda, fix tree construction algorithm by Farach (1997) to Kurtz & Ohlebusch (2004) improve the bound even fur- 7.4. SUFFIX ARRAY 229 ther and achieve a search time of O(m) as known from • Kurtz, S (1999). “Reducing the space requirement suffix trees. of suffix trees”. Software-Practice and Experi- Suffix sorting algorithms can be used to compute the ence. 29 (13): 1149. doi:10.1002/(SICI)1097- Burrows–Wheeler transform (BWT). The BWT requires 024X(199911)29:13<1149::AID- sorting of all cyclic permutations of a string. If this string SPE274>3.0.CO;2-O. ends in a special end-of-string character that is lexico- • Abouelhoda, Mohamed Ibrahim; Kurtz, Stefan; graphically smaller than all other character (i.e., $), then Ohlebusch, Enno (2002). The Enhanced Suffix Ar- the order of the sorted rotated BWT matrix corresponds ray and Its Applications to Genome Analysis. Algo- to the order of suffixes in a suffix array. The BWT can rithms in Bioinformatics. Lecture Notes in Com- therefore be computed in linear time by first construct- puter Science. p. 449. doi:10.1007/3-540-45784- ing a suffix array of the text and then deducing the BWT 4_35. ISBN 978-3-540-44211-0. string: BWT [i] = S[A[i] − 1] . • Puglisi, Simon J.; Smyth, W. F.; Turpin, Andrew H. Suffix arrays can also be used to look up substrings in (2007). “A taxonomy of suffix array construction Example-Based Machine Translation, demanding much algorithms”. ACM Computing Surveys. 39 (2): 4. less storage than a full phrase table as used in Statistical doi:10.1145/1242471.1242472. machine translation. • Many additional applications of the suffix array require Nong, Ge; Zhang, Sen; Chan, Wai Hong (2009). the LCP array. Some of these are detailed in the Linear Suffix Array Construction by Almost Pure application section of the latter. Induced-Sorting. 2009 Data Compression Confer- ence. p. 193. doi:10.1109/DCC.2009.42. ISBN 978-0-7695-3592-0. 7.4.7 Notes • Fischer, Johannes (2011). Inducing the LCP-Array. [1] Abouelhoda, Kurtz & Ohlebusch 2002. Algorithms and Data Structures. Lecture Notes in Computer Science. p. 374. doi:10.1007/978-3- [2] Abouelhoda, Kurtz & Ohlebusch 2004. 642-22300-6_32. ISBN 978-3-642-22299-3. [3] Kurtz 1999. • Salson, M.; Lecroq, T.; Léonard, M.; Mouchard, [4] Puglisi, Smyth & Turpin 2007. L. (2010). “Dynamic extended suffix arrays”. Journal of Discrete Algorithms. 8 (2): 241. [5] Fischer 2011. doi:10.1016/j.jda.2009.02.007. [6] Burkhardt & Kärkkäinen 2003. • Burkhardt, Stefan; Kärkkäinen, Juha (2003). Fast [7] Kulla & Sanders 2007. Lightweight Suffix Array Construction and Checking. Combinatorial Pattern Matching. Lecture Notes in [8] Dementiev et al. 2008. Computer Science. p. 55. doi:10.1007/3-540- 44888-8_5. ISBN 978-3-540-40311-1. 7.4.8 References • Karp, Richard M.; Miller, Raymond E.; Rosen- berg, Arnold L. (1972). Rapid identification of re- • Abouelhoda, Mohamed Ibrahim; Kurtz, Stefan; peated patterns in strings, trees and arrays. Pro- Ohlebusch, Enno (2004). “Replacing suffix trees ceedings of the fourth annual ACM symposium with enhanced suffix arrays”. Journal of Discrete on Theory of computing - STOC '72. p. 125. Algorithms. 2 (1): 53–86. doi:10.1016/S1570- doi:10.1145/800152.804905. 8667(03)00065-0. • Farach, M. (1997). Optimal suffix tree construc- • Manber, Udi; Myers, Gene (1990). Suffix arrays: a tion with large alphabets. Proceedings 38th Annual new method for on-line string searches. First Annual Symposium on Foundations of Computer Science. ACM-SIAM Symposium on Discrete Algorithms. p. 137. doi:10.1109/SFCS.1997.646102. ISBN 0- pp. 319–327. 8186-8197-7. • Manber, Udi; Myers, Gene (1993). “Suffix ar- • Kärkkäinen, Juha; Sanders, Peter (2003). Simple rays: a new method for on-line string searches”. Linear Work Suffix Array Construction. Automata, SIAM Journal on Computing. 22: 935–948. Languages and Programming. Lecture Notes in doi:10.1137/0222058. Computer Science. p. 943. doi:10.1007/3-540- 45061-0_73. ISBN 978-3-540-40493-4. • Gonnet, G.H; Baeza-Yates, R.A; Snider, T (1992). “New indices for text: PAT trees and PAT ar- • Dementiev, Roman; Kärkkäinen, Juha; Mehn- rays”. Information retrieval: data structures and al- ert, Jens; Sanders, Peter (2008). “Better gorithms. external memory suffix array construction”. 230 CHAPTER 7. INTEGER AND STRING SEARCHING

Journal of Experimental Algorithmics. 12: 1. 7.5.2 References doi:10.1145/1227161.1402296. • [1] Navarro, Gonzalo (2001), “A guided tour to approximate Kulla, Fabian; Sanders, Peter (2007). “Scalable par- string matching” (PDF), ACM Computing Surveys, 33 (1): allel suffix array construction”. Parallel Computing. 31–88, doi:10.1145/375360.375365 33 (9): 605. doi:10.1016/j.parco.2007.06.004. [2] Mohri, Mehryar; Moreno, Pedro; Weinstein, Eu- gene (September 2009), “General suffix automa- 7.4.9 External links ton construction algorithm and space bounds”, Theoretical Computer Science, 410 (37): 3553–3562, • Suffix Array in Java doi:10.1016/j.tcs.2009.03.034 • Suffix sorting module for BWT in C code • Suffix Array Implementation in Ruby 7.5.3 Additional reading • Suffix array library and tools • Inenaga, S.; Hoshino, H.; Shinohara, A.; Takeda, • Project containing various Suffix Array c/c++ Im- M.; Arikawa, S. (2001), “On-line construction of plementations with a unified interface symmetric compact directed acyclic word graphs”, Proc. 8th Int. Symp. String Processing and In- • A fast, lightweight, and robust C API library to con- formation Retrieval, 2001. SPIRE 2001, pp. 96– struct the suffix array 110, doi:10.1109/SPIRE.2001.989743, ISBN 0- • Suffix Array implementation in Python 7695-1192-9. • Linear Time Suffix Array implementation in C using • Crochemore, Maxime; Vérin, Renaud (1997), “Di- suffix tree rect construction of compact directed acyclic word graphs”, Combinatorial Pattern Matching, Lecture Notes in Computer Science, Springer-Verlag, pp. 7.5 Suffix automaton 116–129, doi:10.1007/3-540-63220-4_55.

• Epifanio, Chiara; Mignosi, Filippo; Shallit, Jef-

q0 frey; Venturini, Ilaria (2004), “Sturmian graphs and a conjecture of Moser”, in Calude, Cristian S.; Calude, Elena; Dineen, Michael J., Developments in s u f f i x q1 q2 q3 q4 q5 q6 q7 language theory. Proceedings, 8th international con- ference (DLT 2004), Auckland, New Zealand, De- cember 2004, Lecture Notes in Computer Science, Non-deterministic suffix automaton for the word “suffix”. Epsilon 3340, Springer-Verlag, pp. 175–187, ISBN 3-540- transitions are shown grey. 24014-4, Zbl 1117.68454

In computer science, a suffix automaton or directed • Do, H.H.; Sung, W.K. (2011), “Compressed Di- acyclic word graph is a finite automaton that recognizes rected Acyclic Word Graph with Application in Lo- the set of suffixes of a given string. It can be thought of cal Alignment”, Computing and Combinatorics, Lec- as a compressed form of the suffix tree, a data structure ture Notes in Computer Science, 6842, Springer- that efficiently represents the suffixes of the string. For Verlag, pp. 503–518, doi:10.1007/978-3-642- example, a suffix automaton for the string “suffix” can be 22685-4_44, ISBN 978-3-642-22684-7 queried for other strings; it will report “true” for any of the strings “suffix”, “uffix”, “ffix”, “fix”, “ix” and “x”, and [1] “false” for any other string. 7.6 Van Emde Boas tree The suffix automaton of a set of strings U has at most 2Q − 2 states, where Q is the number of nodes of a prefix-tree A Van Emde Boas tree (or Van Emde Boas prior- representing the strings in U.[2] ity queue; Dutch pronunciation: [vɑn 'ɛmdə 'boːɑs]), also Suffix automata have applications in approximate string known as a vEB tree, is a tree data structure which im- matching.[1] plements an associative array with m-bit integer keys. It performs all operations in O(log m) time, or equivalently in O(log log M) time, where M = 2m is the maximum 7.5.1 See also number of elements that can be stored in the tree. The M is not to be confused with the actual number of elements • GADDAG stored in the tree, by which the performance of other tree • Suffix array data-structures is often measured. The vEB tree has good 7.6. VAN EMDE BOAS TREE 231

space efficiency when it contains a large number of ele- stored in T.max. Note that T.min is not stored anywhere ments, as discussed below. It was invented by a team else in the vEB tree, while T.max is. If T is empty then led by Dutch computer scientist Peter van Emde Boas in we use the convention that T.max=−1 and T.min=M. Any 1975.[1] other value x is stored in the subtree T.children[i] where i = ⌊x/√M⌋. The auxiliary tree T.aux keeps track of which children are non-empty, so T.aux contains the value j if 7.6.1 Supported operations and only if T.children[j] is non-empty.

A vEB supports the operations of an ordered associative array, which includes the usual associative array opera- FindNext tions along with two more order operations, FindNext and FindPrevious:[2] The operation FindNext(T, x) that searches for the suc- cessor of an element x in a vEB tree proceeds as fol- • Insert: insert a key/value pair with an m-bit key lows: If x≤T.min then the search is complete, and the answer is T.min. If x>T.max then the next element • Delete: remove the key/value pair with a given key does not exist, return M. Otherwise, let i = x/√M. If x≤T.children[i].max then the value being searched for • Lookup: find the value associated with a given key is contained in T.children[i] so the search proceeds re- • FindNext: find the key/value pair with the smallest cursively in T.children[i]. Otherwise, We search for the key at least a given k value i in T.aux. This gives us the index j of the first sub- tree that contains an element larger than x. The algorithm • FindPrevious: find the key/value pair with the largest then returns T.children[j].min. The element found on the key at most a given k children level needs to be composed with the high bits to form a complete next element. A vEB tree also supports the operations Minimum and function FindNext(T, x). if x ≤ T.min then re- Maximum, which return the minimum and maximum el- turn T.min if x > T.max then // no next ele- ement stored in the tree respectively.[3] These both run ment return M i = floor(x/√M) lo = x mod √M in O(1) time, since the minimum and maximum element hi = x − lo if lo ≤ T.children[i].max then re- are stored as attributes in each tree. turn hi + FindNext(T.children[i], lo) return hi + T.children[FindNext(T.aux, i)].min end 7.6.2 How it works Note that, in any case, the algorithm performs O(1) work and then possibly recurses on a subtree over a universe of size M1/2 (an m/2 bit universe). This gives a recurrence for the running time of , which resolves to O(log m) = O(log log M).

Insert

The call insert(T, x) that inserts a value x into a vEB tree T operates as follows:

1. If T is empty then we set T.min = T.max = x and we are done.

An example Van Emde Boas tree with dimension 5 and the root’s 2. Otherwise, if xT.max then we insert x into the sub- M−1} has a root node that stores an array T.children of tree i responsible for x and then set T.max = x. If length √M. T.children[i] is a pointer to a vEB tree that is T.children[i] was previously empty, then we also in- responsible for the values {i√M, ..., (i+1)√M−1}. Addi- sert i into T.aux tionally, T stores two values T.min and T.max as well as an auxiliary vEB tree T.aux. 4. Otherwise, T.min< x < T.max so we insert x into Data is stored in a vEB tree as follows: The smallest value the subtree i responsible for x. If T.children[i] was currently in the tree is stored in T.min and largest value is previously empty, then we also insert i into T.aux. 232 CHAPTER 7. INTEGER AND STRING SEARCHING

In code: = T.children[T.aux.max].max if T.aux is empty then re- function Insert(T, x) if T.min > T.max then // T is empty turn i = floor(x / √M) Delete(T.children[i], x mod √M) T.min = T.max = x; return if T.min == T.max then if if T.children[i] is empty then Delete(T.aux, i) end x < T.min then T.min = x if x > T.max then T.max = Again, the efficiency of this procedure hinges on the fact x if x < T.min then swap(x, T.min) if x > T.max then that deleting from a vEB tree that contains only one el- T.max = x i = floor(x / √M) Insert(T.children[i], x mod ement takes only constant time. In particular, the last √M) if T.children[i].min == T.children[i].max then In- line of code only executes if x was the only element in sert(T.aux, i) end T.children[i] prior to the deletion. The key to the efficiency of this procedure is that insert- ing an element into an empty vEB tree takes O(1) time. So, even though the algorithm sometimes makes two re- Discussion cursive calls, this only occurs when the first recursive call was into an empty subtree. This gives the same running The assumption that log m is an integer is unnecessary. time recurrence of as before. The operations x/√M and x mod √M can be replaced by taking only higher-order ⌈m/2⌉ and the lower-order ⌊m/2⌋ bits of x, respectively. On any existing machine, Delete this is more efficient than division or remainder compu- tations. Deletion from vEB trees is the trickiest of the operations. The call Delete(T, x) that deletes a value x from a vEB The implementation described above uses pointers and tree T operates as follows: occupies a total space of O(M) = O(2m). This√ can be seen√ as follows. The√ recurrence is S(M) = O( M) + · 1. If T.min = T.max = x then x is the only element ( M + 1) S(√M) . Resolving that would√ lead to ∈ log log M · stored in the tree and we set T.min = M and T.max S(M) (1 + M) + log log M O( M) . = −1 to indicate that the tree is empty. One can, fortunately, also show that S(M) = M−2 by induction.[4] 2. Otherwise, if x == T.min then we need to find In practical implementations, especially on machines the second-smallest value y in the vEB tree, delete with shift-by-k and find first zero instructions, perfor- it from its current location, and set T.min=y. mance can further be improved by switching to a bit ar- The second-smallest value y is either T.max or ray once m equal to the word size (or a small multiple T.children[T.aux.min].min, so it can be found in thereof) is reached. Since all operations on a single word O(1) time. In the latter case we delete y from the are constant time, this does not affect the asymptotic per- subtree that contains it. formance, but it does avoid the majority of the pointer 3. Similarly, if x == T.max then we need to find storage and several pointer dereferences, achieving a sig- the second-largest value y in the vEB tree and set nificant practical savings in time and space with this trick. T.max=y. The second-largest value y is either T.min An obvious optimization of vEB trees is to discard empty or T.children[T.aux.max].max, so it can be found in subtrees. This makes vEB trees quite compact when they O(1) time. We also delete x from the subtree that contain many elements, because no subtrees are created contains it. until something needs to be added to them. Initially, each element added creates about log(m) new trees containing 4. In case where x is not T.min or T.max, and T has about m/2 pointers all together. As the tree grows, more no other elements, we know x is not in T and return and more subtrees are reused, especially the larger ones. without further operations. In a full tree of 2m elements, only O(2m) space is used. 5. Otherwise, we have the typical case where x≠T.min Moreover, unlike a binary search tree, most of this space and x≠T.max. In this case we delete x from the sub- is being used to store data: even for billions of elements, tree T.children[i] that contains x. the pointers in a full vEB tree number in the thousands. 6. In any of the above cases, if we delete the last ele- However, for small trees the overhead associated with ment x or y from any subtree T.children[i] then we vEB trees is enormous: on the order of √M. This is one also delete i from T.aux reason why they are not popular in practice. One way of addressing this limitation is to use only a fixed number of bits per level, which results in a trie. Alternatively, each In code: table may be replaced by a hash table, reducing the space function Delete(T, x) if T.min == T.max == x then to O(n) (where n is the number of elements stored in the T.min = M T.max = −1 return if x == T.min then if data structure) at the expense of making the data struc- T.aux is empty then T.min = T.max return else x = ture randomized. Other structures, including y-fast tries T.children[T.aux.min].min T.min = x if x == T.max then and x-fast tries have been proposed that have compara- if T.aux is empty then T.max = T.min return else T.max ble update and query times and also use randomized hash 7.7. FUSION TREE 233

tables to reduce the space to O(n) or O(n log M). tree was proposed in 2007[4] which yields worst-case run- times of O(logw n + log log u) per operation, where u is the size of the largest key. It remains open whether dy- 7.6.3 References namic fusion trees can achieve O(logw n) per operation with high probability. [1] Peter van Emde Boas: Preserving order in a forest in less than logarithmic time (Proceedings of the 16th Annual Symposium on Foundations of Computer Science 10: 75- 7.7.1 How it works 84, 1975)

[2] Gudmund Skovbjerg Frandsen: Dynamic algorithms: A fusion tree is essentially a B-tree with branching factor Course notes on van Emde Boas trees (PDF) (University of w1/5 (any small exponent is also possible), which gives of Aarhus, Department of Computer Science) it a height of O(logw n). To achieve the desired runtimes for updates and queries, the fusion tree must be able to [3] • Thomas H. Cormen, Charles E. Leiserson, Ronald search a node containing up to w1/5 keys in constant time. L. Rivest, and Clifford Stein. Introduction to Algo- rithms, Third Edition. MIT Press, 2009. ISBN 978- This is done by compressing (“sketching”) the keys so that 0-262-53305-8. Chapter 20: The van Emde Boas all can fit into one machine word, which in turn allows tree, pp. 531–560. comparisons to be done in parallel.

[4] Rex, A. “Determining the space complexity of van Emde Boas trees”. Retrieved 2011-05-27. Sketching

Sketching is the method by which each w-bit key at a node Further reading containing k keys is compressed into only k − 1 bits. Each key x may be thought of as a path in the full binary tree • Erik Demaine, Shantonu Sen, and Jeff Lindy. Mas- of height w starting at the root and ending at the leaf cor- sachusetts Institute of Technology. 6.897: Ad- responding to x. To distinguish two paths, it suffices to vanced Data Structures (Spring 2003). Lecture 1 look at their branching point (the first bit where the two notes: Fixed-universe successor problem, van Emde keys differ). All k paths together have k − 1 branching Boas. Lecture 2 notes: More van Emde Boas, .... points, so at most k − 1 bits are needed to distinguish any • Van Emde Boas, P.; Kaas, R.; Zijlstra, E. (1976). two of the k keys. “Design and implementation of an efficient priority queue”. Mathematical Systems Theory. 10: 99–127. doi:10.1007/BF01683268.

7.7 Fusion tree

In computer science, a fusion tree is a type of tree data structure that implements an associative array on w-bit integers. When operating on a collection of n key–value pairs, it uses O(n) space and performs searches in O(logw n) time, which is asymptotically faster than a traditional self-balancing binary search tree, and also better than the van Emde Boas tree for large values of w. It achieves this Visualization of the sketch function. speed by exploiting certain constant-time operations that can be done on a machine word. Fusion trees were in- [1] An important property of the sketch function is that it pre- vented in 1990 by Michael Fredman and Dan Willard. serves the order of the keys. That is, sketch(x) < sketch(y) Several advances have been made since Fredman and for any two keys x < y. Willard’s original 1990 paper. In 1999[2] it was shown how to implement fusion trees under a model of com- putation in which all of the underlying operations of the Approximating the sketch algorithm belong to AC0, a model of circuit complexity that allows addition and bitwise Boolean operations but If the locations of the sketch bits are b1 < b2 < ··· < br, − disallows the multiplication operations used in the origi- then the sketch of the key xw ₁···x1x0 is the r-bit integer x x ··· x . nal fusion tree algorithm. A dynamic version of fusion br br−1 b1 trees using hash tables was proposed in 1996[3] which With only standard word operations, such as those of the matched the original structure’s O(logw n) runtime in ex- C programming language, it is difficult to directly com- pectation. Another dynamic version using exponential pute the sketch of a key in constant time. Instead, the 234 CHAPTER 7. INTEGER AND STRING SEARCHING

sketch bits can be packed into a range of size at most r4, times. If t is also a bit string st denotes the concatenation using bitwise AND and multiplication. The bitwise AND of t to s. operation serves to clear all non-sketch bits from the key, The node sketch makes it possible to search the keys for while the multiplication shifts the sketch bits into a small any b-bit integer y. Let z = (0y)k, which can be computed range. Like the “perfect” sketch, the approximate sketch in constant time (multiply y by the constant (0b1)k). Note preserves the order of the keys. that 1sketch(xi) - 0y is always positive, but preserves its Some preprocessing is needed to determine the correct leading 1 iff sketch(xi) ≥ y. We can thus compute the multiplication constant. Each sketch bit in location ∑bi will smallest index i such that sketch(xi) ≥ y as follows: r get shifted to bi + mi via a multiplication by m = i=1 2mi. For the approximate sketch to work, the following 1. Subtract z from the node sketch. three properties must hold: 2. Take the bitwise AND of the difference and the con- stant (10b)k. This clears all but the leading bit of 1. bi + mj are distinct for all pairs (i, j). This will ensure each block. that the sketch bits are uncorrupted by the multipli- cation. 3. Find the most significant bit of the result. 2. bi + mi is a strictly increasing function of i. That is, 4. Compute i, using the fact that the leading bit of the the order of the sketch bits is preserved. i-th block has index i(b+1).

4 3. (br + mr)-(b1 + m1) ≤ r . That is, the sketch bits are packed into a range of size at most r4. Desketching

An inductive argument shows how the mi can be con- For an arbitrary query q, parallel comparison computes the index i such that structed. Let m1 = w − b1. Suppose that 1 < t ≤ r and that m , m ... mt-1 have already been chosen. Then pick 1 2 − the smallest integer mt such that both properties (1) and sketch(xi ₁) ≤ sketch(q) ≤ sketch(xi) (2) are satisfied. Property (1) requires that mt ≠ bi − bj + ml for all 1 ≤ i, j ≤ r and 1 ≤ l ≤ t−1. Thus, there are Unfortunately, the sketch function is not in general order- less than tr2 ≤ r3 values that mt must avoid. Since mt is preserving outside the set of keys, so it is not necessarily − chosen to be minimal, (bt + mt) ≤ (bt−₁ + mt−₁) + r3. the case that xi ₁ ≤ q ≤ xi. What is true is that, among − This implies Property (3). all of the keys, either xi ₁ or xi has the longest common prefix with q. This is because any key y with a longer The approximate sketch is thus computed as follows: common prefix with q would also have more sketch bits in common with q, and thus sketch(y) would be closer to 1. Mask out all but the sketch bits with a bitwise AND. sketch(q) than any sketch(xj). 2. Multiply the key by the predetermined constant m. The length longest common prefix between two w-bit in- This operation actually requires two machine words, tegers a and b can be computed in constant time by find- but this can still by done in constant time. ing the most significant bit of the bitwise XOR between a and b. This can then be used to mask out all but the 3. Mask out all but the shifted sketch bits. These are longest common prefix. now contained in a contiguous block of at most r4 < w4/5 bits. Note that p identifies exactly where q branches off from the set of keys. If the next bit of q is 0, then the successor of q is contained in the p1 subtree, and if the next bit of Parallel comparison q is 1, then the predecessor of q is contained in the p0 subtree. This suggests the following algorithm: The purpose of the compression achieved by sketching is to allow all of the keys to be stored in one w-bit word. Let 1. Use parallel comparison to find the index i such that the node sketch of a node be the bit string sketch(xi−₁) ≤ sketch(q) ≤ sketch(xi). 2. Compute the longest common prefix p of q and ei- 1sketch(x )1sketch(x )...1sketch(xk) 1 2 ther xi−₁ or xi (taking the longer of the two).

We can assume that the sketch function uses exactly b ≤ 3. Let l−1 be the length of the longest common prefix r4 bits. Then each block uses 1 + b ≤ w4/5 bits, and since p. 1/5 k ≤ w , the total number of bits in the node sketch is at (a) If the l-th bit of q is 0, let e = p10w-l. Use most w. parallel comparison to search for the successor A brief notational aside: for a bit string s and nonnegative of sketch(e). This is the actual predecessor of integer m, let sm denote the concatenation of s to itself m q. 7.7. FUSION TREE 235

(b) If the l-th bit of q is 1, let e = p01w-l. Use par- 7.7.4 External links allel comparison to search for the predecessor of sketch(e). This is the actual successor of q. • MIT CS 6.897: Advanced Data Structures: Lecture 4, Fusion Trees, Prof. Erik Demaine (Spring 2003) 4. Once either the predecessor or successor of q is • found, the exact position of q among the set of keys MIT CS 6.897: Advanced Data Structures: Lec- is determined. ture 5, More fusion trees; self-organizing data struc- tures, move-to-front, static optimality, Prof. Erik Demaine (Spring 2003)

7.7.2 Fusion hashing • MIT CS 6.851: Advanced Data Structures: Lecture 13, Fusion Tree notes, Prof. Erik Demaine (Spring An application of fusion trees to hash tables was given 2007) by Willard, who describes a data structure for hashing in which an outer-level hash table with hash chaining is com- • MIT CS 6.851: Advanced Data Structures: Lecture bined with a fusion tree representing each hash chain. In 12, Fusion Tree notes, Prof. Erik Demaine (Spring hash chaining, in a hash table with a constant load factor, 2012) the average size of a chain is constant, but additionally with high probability all chains have size O(log n / log log n), where n is the number of hashed items. This chain size is small enough that a fusion tree can handle searches and updates within it in constant time per operation. There- fore, the time for all operations in the data structure is constant with high probability. More precisely, with this data structure, for every inverse-quasipolynomial proba- bility p(n) = exp((log n)O(1)), there is a constant C such that the probability that there exists an operation that ex- ceeds time C is at most p(n).[5]

7.7.3 References

[1] Fredman, M. L.; Willard, D. E. (1990), “BLAST- ING Through the Information Theoretic Barrier with FUSION TREES”, Proceedings of the Twenty-second Annual ACM Symposium on Theory of Computing (STOC '90), New York, NY, USA: ACM, pp. 1–7, doi:10.1145/100216.100217, ISBN 0-89791-361-2.

[2] Andersson, Arne; Miltersen, Peter Bro; Thorup, Mikkel (1999), “Fusion trees can be implemented with AC0 in- structions only”, Theoretical Computer Science, 215 (1- 2): 337–344, doi:10.1016/S0304-3975(98)00172-8, MR 1678804.

[3] Raman, Rajeev (1996), “Priority queues: small, mono- tone and trans-dichotomous”, Fourth Annual European Symposium on Algorithms (ESA '96), Barcelona, Spain, September 25–27, 1996, Lecture Notes in Computer Science, 1136, Berlin: Springer-Verlag, pp. 121–137, doi:10.1007/3-540-61680-2_51, MR 1469229.

[4] Andersson, Arne; Thorup, Mikkel (2007), “Dynamic or- dered sets with exponential search trees”, Journal of the ACM, 54 (3): A13, doi:10.1145/1236457.1236460, MR 2314255.

[5] Willard, Dan E. (2000), “Examining computational ge- ometry, van Emde Boas trees, and hashing from the per- spective of the fusion tree”, SIAM Journal on Computing, 29 (3): 1030–1049, doi:10.1137/S0097539797322425, MR 1740562. Chapter 8

Text and image sources, contributors, and licenses

8.1 Text

• Abstract data type Source: https://en.wikipedia.org/wiki/Abstract_data_type?oldid=742761243 Contributors: SolKarma, Merphant, Ark~enwiki, B4hand, Michael Hardy, Wapcaplet, Skysmith, Haakon, Silvonen, Populus, Wernher, W7cook, Aqualung, BenRG, Noldoaran, Tea2min, Giftlite, WiseWoman, Jonathan.mark.lingard, Jorge Stolfi, Daniel Brockman, Knutux, Dunks58, Andreas Kaufmann, Corti, Mike Rosoft, Rich Farmbrough, Wrp103, Pink18, RJHall, Leif, Spoon!, R. S. Shaw, Alansohn, Diego Moya, Mr Adequate, Kendrick Hang, Japanese Searobin, Miaow Miaow, Ruud Koot, Marudubshinki, Graham87, BD2412, Qwertyus, Kbdank71, Rjwilmsi, MZMcBride, Everton137, Chobot, YurikBot, Wavelength, SAE1962, Cedar101, Fang Aili, Sean Whitton, Petri Krohn, DGaw, TuukkaH, Smack- Bot, Brick Thrower, Jpvinall, Chris the speller, SchfiftyThree, Nbarth, Cybercobra, Dreadstar, A5b, Breno, Antonielly, MTSbot~enwiki, Phuzion, Only2sea, Blaisorblade, Gnfnrf, Thijs!bot, Sagaciousuk, Ideogram, Widefox, JAnDbot, Magioladitis, David Eppstein, Zacchiro, Felipe1982, Javawizard, SpallettaAC1041, AntiSpamBot, Khinyaminn, Cobi, Funandtrvl, Lights, Sector7agent, Anna Lincoln, Don4of4, Kbrose, Arjun024, Flyer22 Reborn, Yerpo, Svick, Fishnet37222, Denisarona, ClueBot, The Thing That Should Not Be, Unbuttered Parsnip, Garyzx, Adrianwn, Mild Bill Hiccup, PeterV1510, Boing! said Zebedee, M4gnum0n, Armin Rigo, Cacadril, BOTarate, Thehelpfulone, Aitias, Appicharlask, Baudway, Addbot, Ghettoblaster, Capouch, Daniel.Burckhardt, Chamal N, Debresser, Bluebusy, Jarble, Yobot, Legobot II, Pcap, Vanished user rt41as76lk, Materialscientist, ArthurBot, Nhey24, RibotBOT, FrescoBot, Mark Renier, Chevymonte- carlo, Maggyero, The Arbiter, RedBot, Tanayseven, Reconsider the static, Babayagagypsies, Dismantle101, Liztanp, Efphf, Dinamik-bot, Ljr1981, John of Reading, Thecheesykid, Ebrambot, Demonkoryu, ChuispastonBot, Snehalshekatkar, Double Dribble, Rocketrod1960, ClueBot NG, Hoorayforturtles, Frietjes, Widr, BG19bot, Ameyenn, ChrisGualtieri, GoShow, Hower64, JYBot, Dexbot, Epicgenius, Car- wile2, Cpt Wise, KasparBot, Tropicalkitty, Cakedy and Anonymous: 186 • Data structure Source: https://en.wikipedia.org/wiki/Data_structure?oldid=740892231 Contributors: LC~enwiki, Ap, -- April, Andre En- gels, Karl E. V. Palmen, XJaM, Arvindn, Ghyll~enwiki, Michael Hardy, TakuyaMurata, Minesweeper, Ahoerstemeier, Nanshu, Kingturtle, Glenn, UserGoogol, Jiang, Edaelon, Nikola Smolenski, Dcoetzee, Chris Lundberg, Populus, Traroth, Mrjeff, Bearcat, Robbot, Noldoaran, Craig Stuntz, Altenmann, Babbage, Mushroom, Seth Ilys, GreatWhiteNortherner, Tea2min, Giftlite, DavidCary, Esap, Jorge Stolfi, Siroxo, Pgan002, Kjetil r, Lancekt, Jacob grace, Pale blue dot, Andreas Kaufmann, Corti, Wrp103, Bender235, MisterSheik, Lycurgus, Shanes, Viriditas, Vortexrealm, Obradovic Goran, Helix84, Mdd, Jumbuck, Alansohn, Liao, Tablizer, Yamla, PaePae, ReyBrujo, Derbeth, Forderud, Mahanga, Bushytails, Mindmatrix, Carlette, Ruud Koot, Easyas12c, TreveX, Bluemoose, Abd, Palica, Mandarax, Yoric~enwiki, Qwertyus, Koavf, Ligulem, GeorgeBills, Husky, Margosbot~enwiki, Fragglet, RexNL, Fresheneesz, Butros, Chobot, Tas50, Banaticus, YurikBot, RobotE, Hairy Dude, Pi Delport, Mipadi, Grafen, Dmoss, Tony1, Googl, Ripper234, Closedmouth, Vicarious, JLaTondre, GrinBot~enwiki, TuukkaH, SmackBot, Reedy, DCDuring, Thunderboltz, BurntSky, Gilliam, Ohnoitsjamie, EncMstr, MalafayaBot, Nick Levine, Frap, Allan McInnes, Khukri, Ryan Roos, Sethwoodworth, Er Komandante, SashatoBot, Demicx, Soumyasch, Antonielly, SpyMa- gician, Loadmaster, Noah Salzman, Mr Stephen, Alhoori, Sharcho, Caiaffa, Iridescent, CRGreathouse, Ahy1, FinalMinuet, Requestion, Nnp, Peterdjones, GPhilip, Pascal.Tesson, Qwyrxian, MTA~enwiki, Thadius856, AntiVandalBot, Widefox, Seaphoto, Jirka6, Dougher, Tom 99, Lanov, MER-C, Wikilolo, Wmbolle, Magioladitis, Rhwawn, Nyq, Wwmbes, David Eppstein, User A1, Cpl Syx, Oicumayberight, Gwern, MasterRadius, Rettetast, Lithui, Sanjay742, Rrwright, Marcin Suwalczan, Jimmytharpe, Santhy, TXiKiBoT, Oshwah, Eve Hall, Vipinhari, Coldfire82, BwDraco, Rozmichelle, Billinghurst, Falcon8765, Spinningspark, Spitfire8520, Haiviet~enwiki, SieBot, Caltas, Eu- rooppa~enwiki, Ham Pastrami, Jerryobject, Flyer22 Reborn, Jvs, Strife911, Ramkumaran7, Nskillen, DancingPhilosopher, Digisus, Tanvir Ahmmed, ClueBot, Spellsinger180, Justin W Smith, The Thing That Should Not Be, Rodhullandemu, Sundar sando, Garyzx, Adrianwn, Abhishek.kumar.ak, Excirial, Alexbot, Erebus Morgaine, Arjayay, Morel, DumZiBoT, XLinkBot, Paushali, Pgallert, Galzigler, Alexius08, ,Legobot, Luckas-bot ,سعی ,Teles ,ماني ,MystBot, Dsimic, Jncraton, MrOllie, EconoPhysicist, Publichealthguru, Mdnahas, Tide rolls Yobot, Fraggle81, AnomieBOT, DemocraticLuntz, SteelPangolin, Jim1138, Kingpin13, Materialscientist, ArthurBot, Xqbot, Pur3r4ngelw, Miym, DAndC, RibotBOT, Shadowjams, Methcub, Prari, FrescoBot, Liridon, Mark Renier, Hypersonic12, Maggyero, Rameshngbot, Mer- tyWiki, Thompsonb24, Profvalente, FoxBot, Laurențiu Dascălu, Lotje, Bharatshettybarkur, Tbhotch, Thinktdub, Kh31311, Vineetzone, Uriah123, DRAGON BOOSTER, EmausBot, Apoctyliptic, Dcirovic, Thecheesykid, ZéroBot, MithrandirAgain, EdEColbert, IGeMiNix, Mentibot, BioPupil, MainFrame, Senator2029, Chandraguptamaurya, Rocketrod1960, Raveendra Lakpriya, Petrb, ClueBot NG, Aks1521, Widr, Danim, Jorgenev, Orzechowskid, Gmharhar, HMSSolent, Wbm1058, Walk&check, Panchobook, Richfaber, SoniyaR, Yashykt, Cncmaster, Sallupandit, Pragmocialist, Sgord512, Anderson, Vishnu0919, Varma rockzz, Frosty, Hernan mvs, Forgot to put name, I am One of Many, Bereajan, Gauravxpress, Haeynzen, JaconaFrere, Gambhir.jagmeet, Richard Yin, Jrachiele, Guturu Bhuvanamitra, Tran- quilHope, Iliazm, KasparBot, \wowzeryest\, ProprioMe OW, Ampsthatgoto11, Koerkra, SandeepGfG, Harvi004 and Anonymous: 388 • Analysis of algorithms Source: https://en.wikipedia.org/wiki/Analysis_of_algorithms?oldid=741567256 Contributors: Bryan Derksen,

236 8.1. TEXT 237

Seb, Arvindn, Hfastedge, Edward, Nixdorf, Kku, Dfeuer, McKay, Pakaran, Murray Langton, Altenmann, MathMartin, Bkell, Tea2min, David Gerard, Giftlite, Jao, Brona, Manuel Anastácio, Beland, Andreas Kaufmann, Liberlogos, Mani1, Bender235, Ashewmaker, MCiura, Gary, Terrycojones, Pinar, Ruud Koot, Ilya, Qwertyus, Miserlou, DVdm, YurikBot, Louigi, PrologFan, Cedar101, Cmglee, Smack- Bot, Vald, Nbarth, Mhym, GRuban, Radagast83, Cybercobra, Kendrick7, Spiritia, Lee Carre, Amakuru, CRGreathouse, ShelfSkewed, Marek69, Hermel, Magioladitis, VoABot II, David Eppstein, User A1, Maju wiki, 2help, Cometstyles, The Wilschon, BotKung, Group- think, Keilana, Xe7al, Ykhwong, Alastair Carnegie, Ivan Akira, Roux, Jarble, Legobot, Yobot, Fraggle81, Pcap, GateKeeper, AnomieBOT, Materialscientist, Miym, Charvest, Fortdj33, 124Nick, Serols, RobinK, WillNess, RjwilmsiBot, Uploader4u, Primefac, Jmencisom, Wikipelli, The Nut, Tirab, Tijfo098, ClueBot NG, Ifarzana, Satellizer, Tvguide1234, Helpful Pixie Bot, Intr199, Manuel.mas12, Liam987, AlexanderZoul, Jochen Burghardt, Phamnhatkhanh, Cubism44, Vieque, PNattrass and Anonymous: 81 • Amortized analysis Source: https://en.wikipedia.org/wiki/Amortized_analysis?oldid=741330272 Contributors: Michael Hardy, Takuya- Murata, Poor Yorick, Dcoetzee, Altenmann, Giftlite, Brona, Andreas Kaufmann, Qutezuce, Talldean, Caesura, Joriki, Brazzy, Nneonneo, Eubot, Mathbot, Rbonvall, Laubrau~enwiki, Jimp, RussBot, PrologFan, CWenger, Allens, SmackBot, InverseHypercube, Torzsmokus, Pierre de Lyon, The Transhumanist, Magioladitis, User A1, Oravec, R'n'B, Bse3, Mantipula, BotKung, Svick, Safek, BarretB, Addbot, EconoPhysicist, Worch, Jarble, AnomieBOT, Stevo2001, Jangirke, FrescoBot, Mike22120, Vramasub, ClueBot NG, Josephshanak, Mr- Blok, Widr, Cerabot~enwiki, Jeff Erickson, Mayfanning7, Pelzflorian, SurendraMatavalam, ScottDNelson, Murraycu and Anonymous: 34 • Accounting method Source: https://en.wikipedia.org/wiki/Accounting_method?oldid=581881141 Contributors: Damian Yerrick, Timo Honkasalo, Dcoetzee, CyborgTosser, Hoxu, Mechonbarsa, Ruud Koot, SmackBot, AnthonyUK, Addbot, Yobot, RobinK, MladenWiki, Miracle Pen, John of Reading, Rezabot, Proxyma and Anonymous: 6 • Potential method Source: https://en.wikipedia.org/wiki/Potential_method?oldid=713008203 Contributors: Andreas Kaufmann, Oliphaunt, BD2412, Regnaron~enwiki, Ripper234, Bluebot, Rmturner, Deflective, CobaltBlue, Tokenzero, David Eppstein, Russl5445, SchreiberBike, Addbot, Yobot, Erel Segal, DrilBot, EmausBot, Arrandale, Proxyma, Esmaeil1372, Mahmoud182003 and Anonymous: 8 • Array data type Source: https://en.wikipedia.org/wiki/Array_data_type?oldid=734451219 Contributors: Edward, Michael Hardy, SamB, Jorge Stolfi, Beland, D6, Spoon!, Bgwhite, RussBot, KGasso, SmackBot, Canthusus, Gilliam, Nbarth, Cybercobra, Lambiam, Korval, Capmo, Hvn0413, Mike Fikes, JohnCaron, Hroðulf, Vigyani, Cerberus0, Kbrose, ClueBot, Mxaza, Garyzx, Staticshakedown, Addbot, Yobot, Denispir, Pcap, AnomieBOT, Jim1138, Praba230890, Nhantdn, Termininja, Akhilan, Thecheesykid, Cgt, ClueBot NG, Mar- iuskempe, Helpful Pixie Bot, Airatique, Pratyya Ghosh, Soni, Yamaha5, Fmadd and Anonymous: 44 • Array data structure Source: https://en.wikipedia.org/wiki/Array_data_structure?oldid=740941150 Contributors: The Anome, Ed Poor, Andre Engels, Tsja, B4hand, Patrick, RTC, Michael Hardy, Norm, Nixdorf, Graue, TakuyaMurata, Alfio, Ellywa, Julesd, Cgs, Poor Yorick, Rossami, Dcoetzee, Dysprosia, Jogloran, Wernher, Fvw, Sewing, Robbot, Josh Cherry, Fredrik, Lowellian, Wikibot, Jleedev, Giftlite, SamB, DavidCary, Massysett, BenFrantzDale, Lardarse, Ssd, Jorge Stolfi, Macrakis, Jonathan Grynspan, Lockeownzj00, Beland, Vanished user 1234567890, Karol Langner, Icairns, Simoneau, Jh51681, Andreas Kaufmann, Mattb90, Corti, Jkl, Rich Farmbrough, Guanabot, Qutezuce, ESkog, ZeroOne, Danakil, MisterSheik, G worroll, Spoon!, Army1987, Func, Rbj, Mdd, Jumbuck, Mr Adequate, Atanamir, Krischik, Tauwasser, ReyBrujo, Suruena, Rgrig, Forderud, Beliavsky, Beej71, Mindmatrix, Jimbryho, Ruud Koot, Jeff3000, Grika, Palica, Gerbrant, Graham87, Kbdank71, Zzedar, Ketiltrout, Bill37212, Ligulem, Yamamoto Ichiro, Mike Van Emmerik, Quuxplusone, Intgr, Vi- sor, Sharkface217, Bgwhite, YurikBot, Wavelength, Borgx, RobotE, RussBot, Fabartus, Splash, Pi Delport, Stephenb, Pseudomonas, Kim- chi.sg, Dmason, JulesH, Mikeblas, Bota47, JLaTondre, Hide&Reason, Heavyrain2408, SmackBot, Princeatapi, Blue520, Trojo~enwiki, Brick Thrower, Alksub, Apers0n, Betacommand, GwydionM, Anwar saadat, Keegan, Timneu22, Mcaruso, Tsca.bot, Tamfang, Berland, Cybercobra, Mwtoews, Masterdriverz, Kukini, Smremde, SashatoBot, Derek farn, John, 16@r, Slakr, Beetstra, Dreftymac, Courcelles, George100, Ahy1, Engelec, Wws, Neelix, Simeon, Kaldosh, Travelbird, Mrstonky, Skittleys, Strangelv, Christian75, Narayanese, Epbr123, Sagaciousuk, Trevyn, Escarbot, Thadius856, AntiVandalBot, AbstractClass, JAnDbot, JaK81600~enwiki, MER-C, Cameltrader, PhiLho, SiobhanHansa, Magioladitis, VoABot II, Ling.Nut, DAGwyn, David Eppstein, User A1, Squidonius, Gwern, Highegg, Themania, Patar knight, J.delanoy, Slogsweep, Darkspots, Jayden54, Mfb52, Funandtrvl, VolkovBot, TXiKiBoT, Anonymous Dissident, Don4of4, Amog, Redacteur, Nicvaroce, Kbrose, SieBot, Caltas, Garde, Tiptoety, Paolo.dL, Oxymoron83, Svick, Anchor Link Bot, Jlmerrill, ClueBot, LAX, Jackollie, The Thing That Should Not Be, Alksentrs, Rilak, Supertouch, R000t, Liempt, Excirial, Immortal Wowbagger, Bargomm, Thingg, Footballfan190, Johnuniq, SoxBot III, Awbell, Chris glenne, Staticshakedown, SilvonenBot, Henrry513414, Dsimic, Gaydudes, Btx40, EconoPhysicist, SamatBot, Zorrobot, Legobot, Luckas-bot, Yobot, Ptbotgourou, Fraggle81, Peter Flass, Tempodivalse, Ober- sachsebot, Xqbot, SPTWriter, FrescoBot, Citation bot 2, I dream of horses, HRoestBot, MastiBot, Jandalhandler, Laurențiu Dascălu, Dinamik-bot, TylerWilliamRoss, Merlinsorca, Jfmantis, The Utahraptor, EmausBot, Mfaheem007, Donner60, Ipsign, Ieee andy, EdoBot, Muzadded, Mikhail Ryazanov, ClueBot NG, Widr, Helpful Pixie Bot, Roger Wellington-Oguri, Wbm1058, 111008066it, Solomon7968, Mark Arsten, Crh23, Simba2331, Insidiae, OlyDLG, ChrisGualtieri, A'bad group, Makecat-bot, Chetan chopade, Ginsuloft, KasparBot, Shahparth199730, Dmitrizaitsev, Ushkin N, Evilgogeta4 and Anonymous: 248 • Dynamic array Source: https://en.wikipedia.org/wiki/Dynamic_array?oldid=738007236 Contributors: Damian Yerrick, Edward, Ixfd64, Phoe6, Dcoetzee, Furrykef, Wdscxsj, Jorge Stolfi, Karol Langner, Andreas Kaufmann, Moxfyre, Dpm64, Wrp103, Forbsey, ZeroOne, MisterSheik, Spoon!, Runner1928, Ryk, Fresheneesz, Wavelength, SmackBot, Rōnin, Bluebot, Octahedron80, Nbarth, Cybercobra, De- cltype, MegaHasher, Beetstra, Green caterpillar, Sytelus, Tobias382, Icep, Wikilolo, David Eppstein, Gwern, Cobi, Spinningspark, Arbor ,Aekton, Didz93, Tartarus ,ماني ,to SJ, Ctxppc, ClueBot, Simonykill, Garyzx, Mild Bill Hiccup, Alex.vatchenko, Addbot, AndersBot Luckas-bot, Yobot, , Arjun G. Menon, Rubinbot, Materialscientist, Citation bot, SPTWriter, FrescoBot, Mutinus, Patmorin, WillNess, EmausBot, Card Zero, François Robere, RippleSax and Anonymous: 44 • Linked list Source: https://en.wikipedia.org/wiki/Linked_list?oldid=742768060 Contributors: Uriyan, BlckKnght, Fubar Obfusco, Perique des Palottes, BL~enwiki, Paul Ebermann, Stevertigo, Nixdorf, Kku, TakuyaMurata, Karingo, Minesweeper, Stw, Angela, Smack, Dwo, MatrixFrog, Dcoetzee, RickK, Ww, Andrewman327, IceKarma, Silvonen, Thue, Samber~enwiki, Traroth, Kenny Moens, Robbot, As- tronautics~enwiki, Fredrik, Yacht, 75th Trombone, Wereon, Pengo, Tea2min, Enochlau, Giftlite, Achurch, Elf, Haeleth, BenFrantzDale, Kenny sh, Levin, Jason Quinn, Jorge Stolfi, Mboverload, Ferdinand Pienaar, Neilc, CryptoDerk, Supadawg, Karl-Henner, Sam Hoce- var, Creidieki, Sonett72, Andreas Kaufmann, Jin~enwiki, Corti, Ta bu shi da yu, Poccil, Wrp103, Pluke, Antaeus Feldspar, Dcoetzee- Bot~enwiki, Jarsyl, MisterSheik, Shanes, Nickj, Spoon!, Bobo192, R. S. Shaw, Adrian~enwiki, Giraffedata, Mdd, JohnyDog, Arthena, Upnishad, Mc6809e, Lord Pistachio, Fawcett5, Theodore Kloba, Wtmitchell, Docboat, RJFJR, Versageek, TheCoffee, Kenyon, Christian *Yendi* Severin, Unixxx, Kelmar~enwiki, Mindmatrix, Arneth, Ruud Koot, Tabletop, I64s, Smiler jerg, Rnt20, Graham87, Magister Mathematicae, BD2412, TedPostol, StuartBrady, Arivne, Intgr, Adamking, King of Hearts, DVdm, Bgwhite, YurikBot, Borgx, Deep- trivia, RussBot, Jengelh, Grafen, Welsh, Daniel Mietchen, Raven4x4x, Quentin mcalmott, ColdFusion650, Cesarsorm~enwiki, Tetracube, 238 CHAPTER 8. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES

Clayhalliwell, LeonardoRob0t, Bluezy, Katieh5584, Tyomitch, Willemo, RichF, robot, SmackBot, Waltercruz~enwiki, FlashSheri- dan, Rōnin, Sam Pointon, Gilliam, Leafboat, Rmosler2100, NewName, Chris the speller, TimBentley, Stevage, Nbarth, Colonies Chris, Deshraj, JonHarder, Cybercobra, IE, MegaHasher, Lasindi, Atkinson 291, Dreslough, Jan.Smolik, NJZombie, Minna Sora no Shita, 16@r, Hvn0413, Beetstra, ATren, Noah Salzman, Koweja, Hu12, Iridescent, PavelY, Aeons, Tawkerbot2, Ahy1, Penbat, VTBassMatt, Ntsimp, Mblumber, JFreeman, Xenochria, HappyInGeneral, Headbomb, Marek69, Neil916, Dark knight, Nick Number, Danarmstrong, Tha- dius856, AntiVandalBot, Ste4k, Darklilac, Wizmo, JAnDbot, XyBot, MER-C, PhilKnight, SiobhanHansa, Wikilolo, VoABot II, Twsx, Japo, David Eppstein, Philg88, Gwern, Moggie2002, Tgeairn, Trusilver, Javawizard, Dillesca, Daniel5Ko, Nwbeeson, Cobi, KylieTas- tic, Ja 62, Brvman, Meiskam, Larryisgood, Oshwah, Vipinhari, Mantipula, Don4of4, Amog, BotKung, BigDunc, Wolfrock, Celticeric, B4upradeep, Tomaxer, Albertus Aditya, Clowd81, Sprocter, Kbrose, Arjun024, J0hn7r0n, Wjl2, SieBot, Tiddly Tom, Yintan, Ham Pas- trami, Pi is 3.14159, Keilana, Flyer22 Reborn, TechTony, Redmarkviolinist, Beejaye, Bughunter2, Mygerardromance, NHSKR, Hariva, Denisarona, Thorncrag, VanishedUser sdu9aya9fs787sads, Scarlettwharton, ClueBot, Ravek, Justin W Smith, The Thing That Should Not Be, Raghaven, ImperfectlyInformed, Garyzx, Arakunem, Mild Bill Hiccup, Rob Bednark, Lindsayfgilmour, TobiasPersson, SchreiberBike, Dixie91, Nasty psycho, XLinkBot, Marc van Leeuwen, Avoided, G7mcluvn, Hook43113, Kurniasan, Wolkykim, Addbot, Anandvachhani, MrOllie, Freqsh0, Zorrobot, Jarble, Quantumobserver, Yobot, Fraggle81, KamikazeBot, Shadoninja, AnomieBOT, Jim1138, Materialsci- entist, Mwe 001, Citation bot, Quantran202, SPTWriter, Mtasic, Binaryedit, Miym, Etienne Lehnart, Sophus Bie, Apollo2991, Construc- tive editor, Afromayun, Prari, FrescoBot, Meshari alnaim, Ijsf, Mark Renier, Citation bot 1, I dream of horses, Apeculiaz, Patmorin, Carloseow, Vrenator, Zvn, BZRatfink, Arjitmalviya, Vhcomptech, WillNess, Minimac, Jfmantis, RjwilmsiBot, Agrammenos, EmausBot, KralSS, Super48paul, Simply.ari1201, Eniagrom, MaGa, Donner60, Carmichael, Peter Karlsen, 28bot, Sjoerddebruin, ClueBot NG, Jack Greenmaven, Millermk, Rezabot, Widr, MerlIwBot, Helpful Pixie Bot, HMSSolent, BG19bot, WikiPhoenix, Tango4567, Dekai Wu, Computersagar, SaurabhKB, Klilidiplomus, Singlaive, IgushevEdward, Electricmuffin11, TalhaIrfanKhan, Jmendeth, Frosty, Smortypi, RossMMooney, Gauravxpress, Noyster, Suzrocksyu, Bryce archer, Melcous, Monkbot, Azx0987, Mahesh Dheravath, Vikas bhatnager, Aswincweety, RationalBlasphemist, TaqPol, Ishanalgorithm, KasparBot, Pythagorean Aditya Guha Roy, Fmadd, Ushkin N, SimoneBrig- ante and Anonymous: 667 • Doubly linked list Source: https://en.wikipedia.org/wiki/Doubly_linked_list?oldid=739258133 Contributors: McKay, Tea2min, Jorge Stolfi, Andreas Kaufmann, CanisRufus, Velella, Mindmatrix, Ruud Koot, Ewlyahoocom, Daverocks, Jeremy Visser, Chris the speller, Nick Levine, Cybercobra, Myasuda, Medinoc, Fetchcomms, Crazytonyi, Kehrbykid, TechTony, Fishnet37222, The Thing That Should Not Be, Addbot, Happyrabbit, Vandtekor, Amaury, Sae1962, Tyriar, Ryanz1123, Pinethicket, Jeffrd10, Suffusion of Yellow, Jfmantis, John of Reading, Wikipelli, Usb10, ManU0710, ClueBot NG, Widr, Tlefebvre, Prashantgonarkar, Closeyes2, Asaifm, Comp.arch and Anonymous: 66 • Stack (abstract data type) Source: https://en.wikipedia.org/wiki/Stack_(abstract_data_type)?oldid=741559563 Contributors: The Anome, Andre Engels, Arvindn, Christian List, Edward, Patrick, RTC, Michael Hardy, Modster, MartinHarper, Ixfd64, TakuyaMurata, Mbessey, Stw, Stan Shebs, Notheruser, Dcoetzee, Jake Nelson, Traroth, JensMueller, Finlay McWalter, Robbot, Noldoaran, Murray Lang- ton, Fredrik, Wlievens, Guy Peters, Tea2min, Adam78, Giftlite, BenFrantzDale, WiseWoman, Gonzen, Macrakis, VampWillow, Hgfernan, Maximaximax, Marc Mongenet, Karl-Henner, Andreas Kaufmann, RevRagnarok, Corti, Poccil, Andrejj, CanisRufus, Spoon!, Bobo192, Grue, Shenme, R. S. Shaw, Vystrix Nexoth, Physicistjedi, James Foster, Obradovic Goran, Mdd, Musiphil, Liao, Hackwrench, Pion, ReyBrujo, 2mcm, Netkinetic, Postrach, Mindmatrix, MattGiuca, Ruud Koot, Mandarax, Slgrandson, Graham87, Qwertyus, Kbdank71, Angusmclellan, Maxim Razin, Vlad Patryshev, FlaBot, Dinoen, Mahlon, Chobot, Bgwhite, Gwernol, Whosasking, NoirNoir, Roboto de Ajvol, YurikBot, Borgx, Michael Slone, Ahluka, Stephenb, ENeville, Mipadi, Reikon, Vanished user 1029384756, Xdenizen, Scs, Epipelagic, Caerwine, Boivie, Rwxrwxrwx, Fragment~enwiki, Cedar101, TuukkaH, KnightRider~enwiki, SmackBot, Adam majewski, Hftf, Incnis Mrsi, BiT, Edgar181, Fernandopabon, Gilliam, Chris the speller, Agateller, RDBrown, Jprg1966, Thumperward, Oli Filth, Nbarth, DHN-bot~enwiki, Cybercobra, Funky Monkey, PeterJeremy, MarkPritchard, Mlpkr, Vasiliy Faronov, MegaHasher, SashatoBot, Zchenyu, Vanished user 9i39j3, F15x28, Ultranaut, SpyMagician, CoolKoon, Loadmaster, Tasc, Mr Stephen, Iridescent, Nutster, Tsf, Jesse Viviano, IntrigueBlue, Penbat, VTBassMatt, Myasuda, FrontLine~enwiki, Simenheg, Jzalae, Pheasantplucker, Bsmntbombdood, Seth Man- apio, Thijs!bot, Al Lemos, Headbomb, Davidhorman, Mentifisto, AntiVandalBot, Seaphoto, Stevenbird, CosineKitty, Arch dude, IanOs- good, Jheiv, SiobhanHansa, Wikilolo, Magioladitis, VoABot II, Gamkiller, Individual X, David Eppstein, Gwern, R'n'B, Pomte, Adavidb, Ianboggs, Dillesca, Sanjay742, Bookmaker~enwiki, Cjhoyle, Manassehkatz, David.Federman, Funandtrvl, Jeff G., Cheusov, Maxtremus, TXiKiBoT, Hqb, Klower, JhsBot, Aaron Rotenberg, BotKung, Wikidan829, !dea4u, SieBot, Calliopejen1, BotMultichill, Raviemani, Ham Pastrami, Keilana, Aillema, Ctxppc, OKBot, Hariva, Mr. Stradivarius, Fsmoura, ClueBot, Clx321, Melizg, Robert impey, Mahue, Rustamabd, LobStoR, Aitias, Johnuniq, XLinkBot, Jyotiswaroopr123321, Ceriak, Hook43113, MystBot, Dsimic, Gggh, Addbot, Ghet- toblaster, CanadianLinuxUser, Numbo3-bot, OlEnglish, Jarble, Aavviof, Luckas-bot, KamikazeBot, Peter Flass, AnomieBOT, 1exec1, Unara, Materialscientist, Citation bot, Xqbot, Quantran202, TechBot, GrouchoBot, RibotBOT, Cmccormick8, In fact, Rpv.imcc, Mark Renier, D'ohBot, Ionutzmovie, Alxeedo, Colin meier, Salocin-yel, I dream of horses, Tom.Reding, Xcvista, ElNuevoEinstein, Tapkeer- rambo007, Trappist the monk, Tbhotch, IITManojit, Yammesicka, Jfmantis, Faysol037, RjwilmsiBot, Ripchip Bot, Mohinib27, Emaus- Bot, WikitanvirBot, Dreamkxd, Luciform, Gecg, Maashatra11, RA0808, ZéroBot, Shuipzv3, Arkahot, Paul Kube, Thine Antique Pen, BlizzmasterPilch, L Kensington, ChuispastonBot, ClueBot NG, Matthiaspaul, StanBally, Dhardik007, Strcat, Ztothefifth, Zakblade2000, Robin400, Widr, Nakarumaka, KLBot2, Spieren, Vishal G.Dhavale., Nipunbayas, PranavAmbhore, Solomon7968, Proxyma, BattyBot, David.moreno72, Abhidesai, Nova2358, ChrisGualtieri, Flaqueleto, Chengshuotian, Kushalbiswas777, Mogism, Makecat-bot, Benton- jimmy, Icarot, Tiberius6996, Maeganm, Gauravxpress, Yasir72.multan, Tranzenic, Rajavenu.iitm, Jacektomas, Monkbot, Opencooper, Pre8y, Flayneorange, KasparBot, Benaboy01, Fmadd, Azurnwiki and Anonymous: 336 • Queue (abstract data type) Source: https://en.wikipedia.org/wiki/Queue_(abstract_data_type)?oldid=736676847 Contributors: Blck- Knght, Andre Engels, DavidLevinson, LapoLuchini, Edward, Patrick, Michael Hardy, Ixfd64, Ahoerstemeier, Nanshu, Glenn, Emper- orbma, Dcoetzee, Furrykef, Traroth, Metasquares, Jusjih, PuzzletChung, Robbot, Noldoaran, Fredrik, JosephBarillari, Rasmus Faber, Tea2min, Giftlite, Massysett, BenFrantzDale, MingMecca, Zoney, Rdsmith4, Tsemii, Andreas Kaufmann, Corti, Discospinster, Wrp103, Mecanismo, Mehrenberg, Indil, Kwamikagami, Chairboy, Spoon!, Robotje, Helix84, Zachlipton, Alansohn, Liao, Conan, Gunslinger47, Mc6809e, Caesura, Jguk, Kenyon, Woohookitty, Mindmatrix, Peng~enwiki, MattGiuca, Ruud Koot, Graham87, Rachel1, Qwertyus, De- Piep, Olivier Teuliere, Bruce1ee, W3bbo, Margosbot~enwiki, Wouter.oet, Ewlyahoocom, Jrtayloriv, Zotel, Roboto de Ajvol, PhilipR, RussBot, J. M., SpuriousQ, Stephenb, Stassats, Howcheng, JohJak2, Caerwine, Mike1024, Carlosguitar, SmackBot, Honza Záruba, M2MM4M, Dabear~enwiki, Skizzik, Chris the speller, Oli Filth, Wikibarista, Nbarth, DHN-bot~enwiki, OrphanBot, Zvar, Cybercobra, Pissant, Mlpkr, Cdills, Kflorence, Almkglor, PseudoSudo, Ckatz, 16@r, Sharcho, Nutster, Penbat, VTBassMatt, Banditlord, Simenheg, Tawkerbot4, Christian75, X96lee15, Uruiamme, Thadius856, Hires an editor, Lperez2029, Egerhart, Deflective, SiobhanHansa, Wikilolo, MikeDunlavey, David Eppstein, Gwern, GrahamDavies, Sanjay742, Contactbanish, NewEnglandYankee, Nwbeeson, Bobo2000, Alnok- taBOT, JhsBot, Broadbot, Atiflz, BotKung, Jesin, Calliopejen1, BotMultichill, Ham Pastrami, Keilana, Thesuperslacker, Hariva, Arsenic99, 8.1. TEXT 239

Chelseafan528, WikiBotas, ClueBot, Ggia, Vanmaple, Alexbot, Ksulli10, Jotterbot, TobiasPersson, SensuiShinobu1234, DumZiBoT, Kle- ,ماني ,tos, XLinkBot, SilvonenBot, Marry314113, Dsimic, Addbot, Some jerk on the Internet, OliverTwisted, MrOllie, SoSaysChappy Loupeter, Yobot, Vanished user rt41as76lk, KamikazeBot, Materialscientist, LilHelpa, Xqbot, Vegpuff, Joseph.w.s~enwiki, DSisyphBot, Ruby.red.roses, FrescoBot, Mark Renier, Miklcct, Arthur MILCHIOR, Gbduende, PrometheeFeu~enwiki, Maxwellterry, John lindgren, Garfieldnate, EmausBot, Jasonanaggie, Akerans, Redhanker, Sorancio, Donner60, Clehner~enwiki, Gralfca, ClueBot NG, Detonadorado, MahlerFive, Ztothefifth, Rahulghose, Iprathik, Zanaferx, Tlefebvre, Vasuakeel, PhuksyWiki, Solomon7968, Fswangke, Dmitrysobolev, David.moreno72, Nemo Kartikeyan, Kushalbiswas777, DavidLeighEllis, Sam Sailor, Tranzenic, ScottDNelson, Ishanalgorithm, Inter- netArchiveBot, Cakedy, NgYShung, GreenC bot and Anonymous: 204 • Double-ended queue Source: https://en.wikipedia.org/wiki/Double-ended_queue?oldid=734510471 Contributors: The Anome, Freckle- foot, Edward, Axlrosen, CesarB, Dcoetzee, Dfeuer, Zoicon5, Furrykef, Fredrik, Merovingian, Rasmus Faber, Tea2min, Smjg, Sj, Ben- FrantzDale, Esrogs, Chowbok, Rosen, Andreas Kaufmann, Pt, Spoon!, Mindmatrix, Ruud Koot, Mandarax, Wikibofh, Drrngrvy, Naraht, Ffaarr, YurikBot, Fabartus, Jengelh, SpuriousQ, Fbergo, Schellhammer, Ripper234, Sneftel, Bcbell, SmackBot, Cparker, Psiphiorg, Chris the speller, Kurykh, TimBentley, Oli Filth, Silly rabbit, Nbarth, Luder, Puetzk, Cybercobra, Offby1, Dicklyon, CmdrObot, Penbat, Funny- farmofdoom, Mwhitlock, Omicronpersei8, Headbomb, VictorAnyakin, Felix C. Stegerman, David Eppstein, MartinBot, Huzzlet the bot, KILNA, VolkovBot, Anonymous Dissident, BotKung, Ramiromagalhaes, Kbrose, Hawk777, Ham Pastrami, Krishna.91, Rdhettinger, Foxj, Alexbot, Rhododendrites, XLinkBot, Dekart, Wolkykim, Matěj Grabovský, Rrmsjp, Legobot, Yobot, Sae1962, Arthur MILCHIOR, Lit- tleWink, Woodlot, EmausBot, WikitanvirBot, Aamirlang, E Nocress, ClueBot NG, Ztothefifth, Shire Reeve, Helpful Pixie Bot, BG19bot, Gauravi123, Mtnorthpoplar, RippleSax, Vorhalas, Zdim wiki and Anonymous: 91 • Circular buffer Source: https://en.wikipedia.org/wiki/Circular_buffer?oldid=738847724 Contributors: Damian Yerrick, Julesd, Malco- hol, Chocolateboy, Tea2min, DavidCary, Andreas Kaufmann, Astronouth7303, Foobaz, Shabble, Cburnett, Qwertyus, Bgwhite, Pok148, Cedar101, Mhi, WolfWings, SmackBot, Ohnoitsjamie, Chris the speller, KiloByte, Silly rabbit, Antonrojo, Rrelf, Frap, Cybercobra, Zoxc, Mike65535, Anonymi, Joeyadams, Mark Giannullo, Headbomb, Llloic, ForrestVoight, Marokwitz, Hosamaly, Parthashome, Magi- oladitis, Indubitably, Amikake3, Strategist333, Billinghurst, Rhanekom, Calliopejen1, SiegeLord, BrightRoundCircle, , OlivierEM, DrZoomEN, Para15000, Niceguyedc, Lucius Annaeus Seneca, Apparition11, Dekart, Dsimic, Addbot, Shervinemami, MrOllie, Or- linKolev, Matěj Grabovský, Yobot, Ptbotgourou, Tennenrishin, AnomieBOT, BastianVenthur, Materialscientist, ChrisCPearson, Serkan Kenar, Shirik, 78.26, Mayukh iitbombay 2008, Hoo man, Sysabod, Ybungalobill, Paulitex, Lipsio, Eight40, ZéroBot, Bloodust, Pokbot, ClueBot NG, Asimsalam, Shengliangsong, Lemtronix, Exfuent, Tectu, Msoltyspl, MuhannadAjjan, Cerabot~enwiki, ScotXW, Jijubin, Hailu143, EUROCALYPTUSTREE, Agustinothadeus and Anonymous: 102 • Associative array Source: https://en.wikipedia.org/wiki/Associative_array?oldid=741225367 Contributors: Damian Yerrick, Robert Merkel, Fubar Obfusco, Maury Markowitz, Hirzel, B4hand, Paul Ebermann, Edward, Patrick, Michael Hardy, Shellreef, Graue, Minesweeper, Brianiac, Samuelsen, Bart Massey, Hashar, Dcoetzee, Dysprosia, Silvonen, Bevo, Robbot, Noldoaran, Fredrik, Alten- mann, Wlievens, Catbar, Wikibot, Ruakh, EvanED, Jleedev, Tea2min, Ancheta Wis, Jpo, DavidCary, Mintleaf~enwiki, Inter, Wolfkeeper, Jorge Stolfi, Macrakis, Pne, Neilc, Kusunose, Karol Langner, Bosmon, Int19h, Andreas Kaufmann, RevRagnarok, Ericamick, LeeHunter, PP Jewel, Kwamikagami, James b crocker, Spoon!, Bobo192, TommyG, Minghong, Alansohn, Mt~enwiki, Krischik, Sligocki, Rtmy- ers, Kdau, Tony Sidaway, RainbowOfLight, Forderud, TShilo12, Boothy443, Mindmatrix, RzR~enwiki, Apokrif, Kglavin, Bluemoose, ObsidianOrder, Pfunk42, Qwertyus, Yurik, Swmcd, Scandum, Koavf, Agorf, Jeff02, RexNL, Alvin-cs, Wavelength, Fdb, Maerk, Dg- goldst, Cedar101, JLaTondre, Ffangs, TuukkaH, SmackBot, KnowledgeOfSelf, MeiStone, Mirzabah, TheDoctor10, Sam Pointon, Brianski, Hugo-cs, Jdh30, Zven, Cfallin, CheesyPuffs144, Malbrain, Nick Levine, Vegard, Radagast83, Cybercobra, Decltype, Paddy3118, YeMer- ryPeasant, AvramYU, Doug Bell, AmiDaniel, Antonielly, EdC~enwiki, Tobe2199, Hans Bauer, Dreftymac, Pimlottc, George100, JForget, Jokes Free4Me, Pgr94, MrSteve, Countchoc, Ajo Mama, WinBot, Oddity-, Alphachimpbot, Maslin, JonathanCross, Pfast, PhiLho, Wm- bolle, Magioladitis, David Eppstein, Gwern, Doc aberdeen, Signalhead, VolkovBot, Chaos5023, Kyle the bot, TXiKiBoT, Anna Lincoln, BotKung, Comet--berkeley, Jesdisciple, PanagosTheOther, Nemo20000, Jerryobject, CultureDrone, Anchor Link Bot, ClueBot, Copyed- itor42, Irishjugg~enwiki, XLinkBot, Orbnauticus, Frostus, Dsimic, Deineka, Addbot, Debresser, Jarble, Bartledan, Davidwhite544, Mar- gin1522, Legobot, Luckas-bot, Yobot, TaBOT-zerem, Pcap, Peter Flass, AnomieBOT, RibotBOT, January2009, Sae1962, Efadae, Neil Schipper, Floatingdecimal, Tushar858, EmausBot, WikitanvirBot, Marcos canbeiro, AvicBot, ClueBot NG, JannuBl22t, Helpful Pixie Bot, Shuisman, DoctorRad, Crh23, Mithrasgregoriae, JYBot, Dcsaba70, LTWoods, Comp.arch, Suelru, Bad Dryer, Alonsoguillenv, EDicken- son and Anonymous: 194 • Association list Source: https://en.wikipedia.org/wiki/Association_list?oldid=728838162 Contributors: SJK, Dcoetzee, Dremora, Tony Sidaway, Pmcjones, SMcCandlish, David Eppstein, Yobot, Helpful Pixie Bot and Anonymous: 2 • Hash table Source: https://en.wikipedia.org/wiki/Hash_table?oldid=742446127 Contributors: Damian Yerrick, AxelBoldt, Zundark, The Anome, BlckKnght, Sandos, Rgamble, LapoLuchini, AdamRetchless, Imran, Mrwojo, Frecklefoot, Michael Hardy, Nixdorf, Pnm, Axl- rosen, TakuyaMurata, Ahoerstemeier, Nanshu, Dcoetzee, Dysprosia, Furrykef, Omegatron, Wernher, Bevo, Tjdw, Pakaran, Secretlondon, Robbot, Fredrik, Tomchiukc, R3m0t, Altenmann, Ashwin, UtherSRG, Miles, Giftlite, DavidCary, Wolfkeeper, BenFrantzDale, Everyk- ing, Waltpohl, Jorge Stolfi, Wmahan, Neilc, Pgan002, CryptoDerk, Knutux, Bug~enwiki, Sonjaaa, Teacup, Beland, Watcher, DNewhall, ReiniUrban, Sam Hocevar, Derek Parnell, Askewchan, Kogorman, Andreas Kaufmann, Kaustuv, Shuchung~enwiki, T Long, Hydrox, Cfailde, Luqui, Wrp103, Antaeus Feldspar, Bender235, Khalid, Raph Levien, JustinWick, CanisRufus, Shanes, Iron Wallaby, Krakhan, Bobo192, Davidgothberg, Larryv, Sleske, Helix84, Mdd, Varuna, Baka toroi, Anthony Appleyard, Sligocki, Drbreznjev, DSatz, Akuch- ling, TShilo12, Nuno Tavares, Woohookitty, LOL, Linguica, Paul Mackay~enwiki, Davidfstr, GregorB, Meneth, Graham87, Kbdank71, Tostie14, Rjwilmsi, Scandum, Koavf, Kinu, Pleiotrop3, Filu~enwiki, Nneonneo, FlaBot, Ecb29, Fragglet, Intgr, Fresheneesz, Antaeus FeIdspar, YurikBot, Wavelength, RobotE, Mongol, RussBot, Me and, CesarB’s unpriviledged account, Lavenderbunny, Gustavb, Mi- padi, Cryptoid, Mike.aizatsky, Gareth Jones, Piccolomomo~enwiki, CecilWard, Nethgirb, Gadget850, Bota47, Sebleblanc, Deeday-UK, Sycomonkey, Ninly, Gulliveig, Th1rt3en, CWenger, JLaTondre, ASchmoo, Kungfuadam, Daivox, SmackBot, Apanag, Obakeneko, Pizza- Margherita, Alksub, Eskimbot, RobotJcb, C4chandu, Gilliam, Arpitm, Neurodivergent, EncMstr, Cribe, Deshraj, Tackline, Frap, Mayrel, Radagast83, Cybercobra, Decltype, HFuruseth, Rich.lewis, Esb, Acdx, MegaHasher, Doug Bell, Derek farn, IronGargoyle, Josephsieh, Peter Horn, Pagh, Saxton, Tawkerbot2, Ouishoebean, CRGreathouse, Ahy1, MaxEnt, Seizethedave, Cgma, Not-just-yeti, Thermon, Ot- terSmith, Ajo Mama, Stannered, AntiVandalBot, Hosamaly, Thailyn, Pixor, JAnDbot, MER-C, Epeefleche, Dmbstudio, SiobhanHansa, Wikilolo, Bongwarrior, QrczakMK, Josephskeller, Tedickey, Schwarzbichler, Cic, Allstarecho, David Eppstein, Oravec, Gwern, Mag- nus Bakken, Glrx, Narendrak, Tikiwont, Mike.lifeguard, Luxem, NewEnglandYankee, Cobi, Cometstyles, Winecellar, VolkovBot, Sim- ulationelson, Floodyberry, Anurmi~enwiki, BotKung, Collin Stocks, JimJJewett, Nightkhaos, Spinningspark, Abatishchev, Helios2k6, Kehrbykid, Kbrose, PeterCanthropus, Gerakibot, Triwbe, Digwuren, Svick, JL-Bot, ObfuscatePenguin, ClueBot, Justin W Smith, Imper- fectlyInformed, Adrianwn, Mild Bill Hiccup, Niceguyedc, JJuran, Groxx, Berean Hunter, Eddof13, Johnuniq, Arlolra, XLinkBot, Het- ori, Pichpich, Paulsheer, TheTraveler3, MystBot, Karuthedam, Wolkykim, Addbot, Gremel123, Scientus, CanadianLinuxUser, MrOl- 240 CHAPTER 8. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES

lie, Numbo3-bot, Om Sao, Zorrobot, Jarble, Frehley, Legobot, Luckas-bot, Yobot, Denispir, KamikazeBot, Peter Flass, Dmcomer, AnomieBOT, Erel Segal, Jim1138, Sz-iwbot, Materialscientist, Citation bot, ArthurBot, Baliame, Drilnoth, Arbalest Mike, Ched, Shad- owjams, Kracekumar, FrescoBot, Gbutler69, W Nowicki, X7q, Sae1962, Citation bot 1, Velociostrich, Simonsarris, Maggyero, Iekpo, Trappist the monk, SchreyP, Grapesoda22, Patmorin, Cutelyaware, JeepdaySock, Shafigoldwasser, Kastchei, DuineSidhe, EmausBot, Su- per48paul, Ibbn, DanielWaterworth, GoingBatty, Mousehousemd, ZéroBot, Purplie, Ticklemepink42, Paul Kube, Demonkoryu, Donner60, Carmichael, Pheartheceal, Aberdeen01, Neil P. Quinn, Teapeat, Rememberway, ClueBot NG, Iiii I I I, Incompetence, Rawafmail, Friet- jes, Cntras, Rezabot, Jk2q3jrklse, Helpful Pixie Bot, BG19bot, Jan Spousta, MusikAnimal, SanAnMan, Pbruneau, AdventurousSquir- rel, Triston J. Taylor, CitationCleanerBot, Happyuk, FeralOink, Spacemanaki, Aloksukhwani, Emimull, Deveedutta, Shmageggy, Igu- shevEdward, AlecTaylor, Mcom320, Thomas J. S. Greenfield, Razibot, Djszapi, QuantifiedElf, Myconix, Chip Wildon Forster, Tmfer- rara, Tuketu7, Whacks, Monkbot, Iokevins, Kjerish, Oleaster, MediKate, Micahsaint, Tourorist, Mtnorthpoplar, Dazappa, Gou7214309, Dwemthy, Ushkin N and Anonymous: 457 • Linear probing Source: https://en.wikipedia.org/wiki/Linear_probing?oldid=737979319 Contributors: Ubiquity, Bearcat, Enochlau, An- dreas Kaufmann, Gazpacho, Discospinster, RJFJR, Linas, Tas50, The Rambling Man, CesarB’s unpriviledged account, SpuriousQ, Chris the speller, JonHarder, MichaelBillington, Sbluen, Jeberle, Negrulio, Cryptic C62, Jngnyc, Alaibot, Thijs!bot, A3nm, David Eppstein, STBot, Themania, OliviaGuest, Arjunaraoc, C. A. Russell, Addbot, Legobot, Yobot, Tedzdog, Patmorin, Infinity ive, Dixtosa, Danmoberly, Dzf1992 and Anonymous: 16 • Quadratic probing Source: https://en.wikipedia.org/wiki/Quadratic_probing?oldid=735366636 Contributors: Aragorn2, Dcoetzee, Enochlau, Andreas Kaufmann, Rich Farmbrough, ZeroOne, Oleg Alexandrov, Ryk, Eubot, CesarB’s unpriviledged account, Robertvan1, Mikeblas, SmackBot, InverseHypercube, Ian1000, Cybercobra, Wizardman, Jdanz, Simeon, Magioladitis, David Eppstein, R'n'B, Philip Trueman, Hatmatbbat10, C. A. Russell, Addbot, Yobot, Bavla, Kmgpratyush, Donner60, ClueBot NG, Helpful Pixie Bot, Yashykt, Vaib- hav1992, AndiPersti, Danielcamiel, EapenZhan and Anonymous: 38 • Double hashing Source: https://en.wikipedia.org/wiki/Double_hashing?oldid=722897281 Contributors: AxelBoldt, CesarB, Angela, Dcoetzee, Usrnme h8er, Stesmo, RJFJR, Zawersh, Pfunk42, Gurch, CesarB’s unpriviledged account, Momeara, DasBrose~enwiki, Cob- blet, SmackBot, Bluebot, Hashbrowncipher, JForget, Only2sea, Alaibot, Thijs!bot, David Eppstein, WonderPhil, Philip Trueman, Ox- fordwang, Extensive~enwiki, Mild Bill Hiccup, Addbot, Tcl16, Smallman12q, Amiceli, Imposing, Jesse V., ClueBot NG, Exercisephys, Bdawson1982, Kevin12xd and Anonymous: 36 • Cuckoo hashing Source: https://en.wikipedia.org/wiki/Cuckoo_hashing?oldid=738697882 Contributors: Arvindn, Dcoetzee, McKay, Phil Boswell, Nyh, Pps, DavidCary, Neilc, Bender235, Unquietwiki, Zawersh, Ej, Nihiltres, Bgwhite, CesarB’s unpriviledged account, Zr2d2, Zerodamage, Aaron Will, SmackBot, Mandyhan, Thumperward, Cybercobra, Pagh, Jafet, CRGreathouse, Alaibot, Headbomb, Hermel, David Eppstein, S3000, Themania, Wjaguar, Mark cummins, LiranKatzir, Svick, Justin W Smith, Hetori, Addbot, Alquantor, Lmonson26, Luckas-bot, Yobot, Valentas.Kurauskas, Thore Husfeldt, W Nowicki, Citation bot 1, Userask, Trappist the monk, EmausBot, BuZZdEE.BuzZ, Rcsprinter123, Bomazi, Yoavt, BattyBot, Andrew Helwer, Dexbot, Usernameasdf, Monkbot, Cyberboys91, Harvi004 and Anonymous: 45 • Hopscotch hashing Source: https://en.wikipedia.org/wiki/Hopscotch_hashing?oldid=742861865 Contributors: Cybercobra, Svick, Im- ageRemovalBot, Shafigoldwasser, QinglaiXiao, BG19bot, Alxradz and Anonymous: 9 • Hash function Source: https://en.wikipedia.org/wiki/Hash_function?oldid=742381939 Contributors: Damian Yerrick, Derek Ross, Taw, BlckKnght, PierreAbbat, Miguel~enwiki, Imran, David spector, Dwheeler, Hfastedge, Michael Hardy, EddEdmondson, Ixfd64, Mde- bets, Nanshu, J-Wiki, Jc~enwiki, Vanis~enwiki, Dcoetzee, Ww, The Anomebot, Doradus, Robbot, Noldoaran, Altenmann, Mikepel- ley, Tea2min, Connelly, Giftlite, Paul Richter, DavidCary, KelvSYC, Wolfkeeper, Obli, Everyking, TomViza, Brona, Malyctenar, Jorge Stolfi, Matt Crypto, Utcursch, Knutux, OverlordQ, Kusunose, Watcher, Karl-Henner, Talrias, Peter bertok, Quota, Eisnel, Mormegil, Jonmcauliffe, Rich Farmbrough, Antaeus Feldspar, Bender235, Chalst, Evand, PhilHibbs, Haxwell, Bobo192, Sklender, Davidgothberg, Boredzo, Helix84, CyberSkull, Atlant, Jeltz, Mmmready, Apoc2400, InShaneee, Velella, Jopxton, ShawnVW, Kurivaim, MIT Trekkie, Redvers, Blaxthos, Kazvorpal, Brookie, Linas, Mindmatrix, GVOLTT, LOL, TheNightFly, Drostie, Pfunk42, Graham87, Qwertyus, Toolan, Rjwilmsi, Seraphimblade, Pabix, LjL, Ttwaring, Utuado, Nguyen Thanh Quang, FlaBot, Harmil, Gurch, Thenowhereman, Math- rick, Intgr, M7bot, Chobot, Roboto de Ajvol, YurikBot, Wavelength, RattusMaximus, RobotE, CesarB’s unpriviledged account, Stephenb, Pseudomonas, Andipi, Zeno of Elea, EngineerScotty, Mikeblas, Fender123, Bota47, Tachyon01, Ms2ger, Eurosong, Dfinkel, Lt-wiki- bot, Ninly, Gulliveig, StealthFox, Claygate, Snoops~enwiki, QmunkE, Emc2, Appleseed, Tobi Kellner, That Guy, From That Show!, Jbalint, SmackBot, InverseHypercube, Bomac, KocjoBot~enwiki, BiT, Yamaguchi, Gilliam, Raghaw, Schmiteye, Mnbf9rca, JesseS- tone, Oli Filth, EncMstr, Octahedron80, Nbarth, Kmag~enwiki, Malbrain, Chlewbot, Shingra, Midnightcomm, Lansey, Andrei Stroe, MegaHasher, Lambiam, Kuru, Alexcollins, Paulschou, RomanSpa, Chuck Simmons, KHAAAAAAAAAAN, Erwin, Peyre, Vstarre, Pagh, MathStuf, ShakingSpirit, Iridescent, Agent X2, BrianRice, Courcelles, Owen214, Juhachi, Neelix, Mblumber, SavantEdge, Adolphus79, Sytelus, Epbr123, Ultimus, Leedeth, Stualden, Folic Acid, AntiVandalBot, Xenophon (bot), JakeD409, Davorian, Powerdesi, JAnDbot, Epeefleche, Hamsterlopithecus, Kirrages, Stangaa, Steveprutz, Wikilolo, Coffee2theorems, Magioladitis, Pndfam05, Patelm, Nyttend, Kgfleischmann, Dappawit, Applrpn, STBot, GimliDotNet, R'n'B, Jfroelich, Francis Tyers, Demosta, Tgeairn, J.delanoy, Maurice Car- bonaro, Svnsvn, Wjaguar, L337 kybldmstr, Globbet, Ontarioboy, Doug4, Meiskam, Jrmcdaniel, VolkovBot, Sjones23, Boute, TXiKi- BoT, Christofpaar, GroveGuy, A4bot, Nxavar, Noformation, Cuddlyable3, Crashthatch, Wiae, Jediknil, Tastyllama, Skarz, LittleBenW, SieBot, WereSpielChequers, Laoris, KrizzyB, Xelgen, Flyer22 Reborn, Iamhigh, Dhb101, BrightRoundCircle, OKBot, Svick, Fusion- Now, BitCrazed, ClueBot, Cab.jones, Ggia, Unbuttered Parsnip, Garyzx, Mild Bill Hiccup, शिव, Dkf11, SamHartman, Alexbot, Erebus Morgaine, Diaa abdelmoneim, Wordsputtogether, Tonysan, Rishi.bedi, XLinkBot, Kotha arun2005, Dthomsen8, MystBot, Karuthedam, SteveJothen, Addbot, Butterwell, TutterMouse, Dranorter, MrOllie, CarsracBot, AndersBot, Jeaise, Lightbot, Luckas-bot, Fraggle81, AnomieBOT, Erel Segal, Materialscientist, Citation bot, Twri, ArthurBot, Xqbot, Capricorn42, Matttoothman, M2millenium, Theclapp, RibotBOT, Alvin Seville, MerlLinkBot, FrescoBot, Nageh, MichealH, TruthIIPower, Haeinous, Geoffreybernardo, Pinethicket, 10metreh, Cnwilliams, Mghgtg, Dinamik-bot, Vrenator, Keith Cascio, Phil Spectre, Jeffrd10, Updatehelper, Kastchei, EmausBot, Timtempleton, Gfoley4, Mayazcherquoi, Timde, MarkWegman, Dewritech, Jachto, John Cline, White Trillium, Fæ, Akerans, Paul Kube, Music Sorter, Donner60, Senator2029, Teapeat, Sven Manguard, Shi Hou, Mikhail Ryazanov, Rememberway, ClueBot NG, Incompetence, Neuroneu- tron, Monchoman45, Cntras, Widr, Mtking, Bluechimera0, HMSSolent, Wikisian, BG19bot, JamesNZ, GarbledLecture933, Harpreet Osahan, Glacialfox, Winston Chuen-Shih Yang, ChrisGualtieri, Tech77, Jeff Erickson, Jonahugh, Lindsaywinkler, Tmferrara, Catty- cat95, Tolcso, Frogger48, Eddiearin123, Philnap, Kanterme, Laberinto15, MatthewBuchwalder, Wkudrle, Computilizer, Mark22207, GlennLawyer, Gcarvelli, BlueFenixReborn, Some Gadget Geek, Siddharthgondhi, GSS-1987, Entranced98 and Anonymous: 490 • Perfect hash function Source: https://en.wikipedia.org/wiki/Perfect_hash_function?oldid=737620332 Contributors: Edward, Cimon Avaro, Dcoetzee, Fredrik, Giftlite, Neilc, E David Moyer, Burschik, Bender235, LOL, Ruud Koot, JMCorey, ScottJ, Mathbot, Spl, Ce- 8.1. TEXT 241

sarB’s unpriviledged account, Dtrebbien, Długosz, Gareth Jones, Salrizvy, Johndburger, SmackBot, Nbarth, Srchvrs, Otus, 4hodmt, Mega- Hasher, Pagh, Mudd1, Krauss, Headbomb, Wikilolo, David Eppstein, Glrx, Cobi, Drkarger, Gajeam, PixelBot, Addbot, G121, Bbb23, AnomieBOT, FrescoBot, Daoudamjad, John of Reading, Prvák, Maysak, Voomoo, Arka sett, BG19bot, SteveT84, Mcichelli, Dexbot, Latin.ufmg and Anonymous: 34 • Universal hashing Source: https://en.wikipedia.org/wiki/Universal_hashing?oldid=734555002 Contributors: Mattflaschen, DavidCary, Neilc, ArnoldReinhold, EmilJ, Pol098, Rjwilmsi, Sdornan, SeanMack, Chobot, Dmharvey, Gareth Jones, Guruparan18, Johndburger, Twintop, CharlesHBennett, SmackBot, Cybercobra, Copysan, DanielLemire, Pagh, Dwmalone, Winxa, Jafet, Arnstein87, Marc W. Abel, Sytelus, Francois.boutines, Headbomb, Golgofrinchian, David Eppstein, Copland Stalker, Danadocus, Cyberjoac, Ulamgamer, Ben- der2k14, Rswarbrick, Addbot, RPHv, Yobot, Mpatrascu, Citation bot, LilHelpa, Citation bot 1, TPReal, Patmorin, RjwilmsiBot, Emaus- Bot, Dewritech, ClueBot NG, Helpful Pixie Bot, Cleo, BG19bot, Walrus068, BattyBot, ChrisGualtieri, Zolgharnein, Jeff Erickson, Zen- guine and Anonymous: 42 • K-independent hashing Source: https://en.wikipedia.org/wiki/K-independent_hashing?oldid=700552394 Contributors: Nandhp, Rjwilmsi, CBM, David Eppstein, Iohannes Animosus, Mpatrascu, Mr Sheep Measham, BattyBot and Anonymous: 3 • Tabulation hashing Source: https://en.wikipedia.org/wiki/Tabulation_hashing?oldid=719749726 Contributors: RJFJR, DanielLemire, David Eppstein, Thomasda, Oranav, Thore Husfeldt, Tom.Reding, BG19bot, Cleanelephant, Eehcyl, Kbulgakov and Anonymous: 3 • Cryptographic hash function Source: https://en.wikipedia.org/wiki/Cryptographic_hash_function?oldid=743091536 Contributors: Damian Yerrick, Bryan Derksen, Zundark, Arvindn, Imran, Paul Ebermann, Michael Hardy, Dan Koehl, Vacilandois, Dcljr, CesarB, Ciphergoth, Feedmecereal, Charles Matthews, Ww, Amol kulkarni, Mrand, Taxman, Phil Boswell, Chuunen Baka, Robbot, Paranoid, As- tronautics~enwiki, Fredrik, Lowellian, Pingveno, Aetheling, Mattflaschen, Javidjamae, Giftlite, Lunkwill, DavidCary, ShaunMacPherson, Inkling, BenFrantzDale, Ianhowlett, Leonard G., Jorge Stolfi, Cloud200, Matt Crypto, Utcursch, CryptoDerk, Lightst, Antandrus, Tjwood, Anirvan, Imjustmatthew, Rich Farmbrough, FT2, ArnoldReinhold, YUL89YYZ, Samboy, Mykhal, Chalst, Kyz, Sietse Snel, Schneier, Bobo192, Myria, VBGFscJUn3, Davidgothberg, Boredzo, Quintus~enwiki, Sligocki, Ciphergoth2, Danhash, Pgimeno~enwiki, H2g2bob, BDD, MIT Trekkie, PseudonympH, Simetrical, CygnusPius, Mindmatrix, Apokrif, Jok2000, Mandarax, Alienus, Ej, SMC, AndyKali, Ruptor, Mathbot, Harmil, Maxal, Intgr, Fresheneesz, Wolfmankurd, Wigie, FrenchIsAwesome, CesarB’s unpriviledged account, Teddyb, Gaius Cornelius, Rsrikanth05, Bachrach44, Froth, Guruparan18, Dbfirs, Ott2, Analoguedragon, Appleseed, Finell, DaishiHarada, Smack- Bot, Mmernex, Tom Lougheed, Michaelfavor, Mdd4696, C4chandu, BiT, Ohnoitsjamie, Oli Filth, Nbarth, DHN-bot~enwiki, Colonies Chris, Zsinj, Kotra, Deeb, Fuzzypeg, Lambiam, Twotwotwo, Twredfish, Brian Gunderson, Oswald Glinkmeyer, Lee Carre, OnBeyondZe- brax, Paul Foxworthy, Fils du Soleil, MoleculeUpload, Jafet, Chris55, Mellery, CmdrObot, Jesse Viviano, Penbat, NormHardy, Cydebot, ST47, Optimist on the run, Bsmntbombdood, Bdragon, Jm3, N5iln, Strongriley, Dawnseeker2000, AntiVandalBot, Nipisiquit, JAnDbot, BenjaminGittins, Instinct, Jimbobl, Coolhandscot, Gavia immer, Extropian314, VoABot II, NoDepositNoReturn, Twsx, Firealwaysworks, David Eppstein, Vssun, WLU, Ratsbane, Gwern, JensAlfke, Maurice Carbonaro, Eliz81, Cpiral, Osndok, 83d40m, Robertgreer, SmallPota- toes, TreasuryTag, Sroc, TooTallSid, Oconnor663, Nxavar, Wordsmith, Jamelan, Enviroboy, Fltnsplr, AP61, Arjun024, SieBot, Tehsha, Caltas, JuanPechiar, ArchiSchmedes, Jasonsewall, Wahrmund, Bpeps, ClueBot, JWilk, Ggia, Arakunem, Avinava, CounterVandalism- Bot, Niceguyedc, DragonBot, Infomade, Cenarium, Leobold1, Erodium, Thinking Stone, DumZiBoT, Cmcqueen1975, Pierzz, Mitch Ames, SteveJothen, Addbot, Non-dropframe, Laurinavicius, Cube444, Leszek Jańczuk, Wikipedian314, Download, Maslen, Yobot, Mar- ioS, Wurfmaul, Doctorhook, SwisterTwister, AnomieBOT, DemocraticLuntz, Materialscientist, Are you ready for IPv6?, Xvsn, Clark89, Rabbler, Capricorn42, Oxwil, Marios.agathangelou, Sylletka, BrianWren, Daemorris, Amit 9b, Tsihonglau, Hymek, MerlLinkBot, Maxi- ,!Haeinous, Doremo, Blotowij, Jandalhandler, RobinK, Salvidrim ,.תומר א ,wheat, Bonev, FreeKnowledgeCreator, FrescoBot, Jsaenznoval -Lotje, ATBS, Wedgefish, January, Eatnumber1, Plfernandez, Whisky drinker, Patriot8790, Tracerneo, RistoLaanoja, Emaus ,קול ציון Bot, WikitanvirBot, AvicBot, ZéroBot, Quelrod, A930913, Erianna, Bomazi, ClueBot NG, MelbourneStar, Champloo11, Rezabot, Widr, Danwix, BG19bot, Lichtspiel, Garsd, ZipoBibrok5x10^8, Manoguru, RavelTwig, Luzmcosta, Darts123, Basisplan0815, JYBot, Curi- Connorr89, Epicgenius, Tentinator, Jianhui67, Musko47, TAKUMI YAMAWAKI, Claw of Slime, Maciej ,مونا بشيري ,ousMind01 Czyżewski, Maths314, Chouhartem, Tourorist, Onlinetvnet, TheExceptionCloaker, Axlesoft and Anonymous: 296 • Set (abstract data type) Source: https://en.wikipedia.org/wiki/Set_(abstract_data_type)?oldid=741086073 Contributors: Damian Yer- rick, William Avery, Mintguy, Patrick, Modster, TakuyaMurata, EdH, Mxn, Dcoetzee, Fredrik, Jorge Stolfi, Lvr, Urhixidur, Andreas Kaufmann, CanisRufus, Spoon!, RJFJR, Ruud Koot, Pfunk42, Bgwhite, Roboto de Ajvol, Mlc, Cedar101, QmunkE, Incnis Mrsi, Blue- bot, MartinPoulter, Nbarth, Gracenotes, Otus, Cybercobra, Dreadstar, Wizardman, MegaHasher, Hetar, Amniarix, CBM, Polaris408, Peterdjones, Hosamaly, Hut 8.5, Wikilolo, Lt basketball, Gwern, Raise exception, Fylwind, Davecrosby uk, BotKung, Rhanekom, SieBot, Oxymoron83, Casablanca2000in, Classicalecon, Linforest, Niceguyedc, UKoch, Quinntaylor, Addbot, SoSaysChappy, Loupeter, Legobot, Luckas-bot, Denispir, Pcap, AnomieBOT, Citation bot, Twri, DSisyphBot, GrouchoBot, FrescoBot, Spindocter123, Tyamath, EmausBot, Wikipelli, Elaz85, Mentibot, Nullzero, Helpful Pixie Bot, Poonam7393, Umasoni30, Vimalwatwani, Chmarkine, Irene31, Mark viking, FriendlyCaribou, Brandon.heck, Aristiden7o, Bender the Bot and Anonymous: 45 • Bit array Source: https://en.wikipedia.org/wiki/Bit_array?oldid=738371255 Contributors: Awaterl, Boud, Pnm, Dcoetzee, Furrykef, JesseW, AJim, Bovlb, Vadmium, Karol Langner, Sam Hocevar, Andreas Kaufmann, Notinasnaid, Paul August, CanisRufus, Spoon!, R. S. Shaw, Rgrig, Forderud, Jacobolus, Bluemoose, Qwertyus, Hack-Man, StuartBrady, Intgr, RussBot, Cedar101, TomJF, JLaTondre, Chris the speller, Bluebot, Doug Bell, Archimerged, DanielLemire, Glen Pepicelli, CRGreathouse, Gyopi, Neelix, Davnor, Kubanczyk, Izyt, Gwern, Themania, R'n'B, Sudleyplace, TheChrisD, Cobi, Pcordes, Bvds, RomainThibaux, Psychless, Skwa, Onomou, MystBot, Ad- dbot, IOLJeff, Tide rolls, Bluebusy, Peter Flass, AnomieBOT, Rubinbot, JnRouvignac, ZéroBot, Nomen4Omen, Cocciasik, ClueBot NG, Snotbot, Minakshinajardhane, Chmarkine, Chip123456, BattyBot, Mogism, Thajdog10, User85734, François Robere, Carlos R Castro G, Chadha.varun, Francisco Bajumuzi, Ushkin N and Anonymous: 53 • Bloom filter Source: https://en.wikipedia.org/wiki/Bloom_filter?oldid=741394302 Contributors: Damian Yerrick, The Anome, Edward, Michael Hardy, Pnm, Wwwwolf, Thebramp, Charles Matthews, Dcoetzee, Furrykef, Phil Boswell, Fredrik, Chocolateboy, Babbage, Alan Liefting, Giftlite, DavidCary, ShaunMacPherson, Rchandra, Macrakis, Neilc, EvilGrin, James A. Donald, Mahemoff, Two Bananas, Urhixidur, Andreas Kaufmann, Anders94, Subrabbit, Smyth, Agl~enwiki, CanisRufus, Susvolans, Giraffedata, Drangon, Terrycojones, Mbloore, Yinotaurus, Dzhim, GiovanniS, Galaxiaad, Mindmatrix, Shreevatsa, RzR~enwiki, Tabletop, Payrard, Ryan Reich, Pfunk42, Qw- ertyus, Ses4j, Rjwilmsi, Sdornan, Brighterorange, Vsriram, Quuxplusone, Chobot, Wavelength, Argav, Taejo, CesarB’s unpriviledged ac- count, Msikma, E123, Dtrebbien, Wirthi, Cconnett, Cedar101, HereToHelp, Rubicantoto, Sbassi, Daivox, SmackBot, Stev0, MalafayaBot, Nbarth, Cybercobra, Xiphoris, Wikidrone, Drae, Galaad2, Jeremy Banks, Shakeelmahate, Requestion, Krauss, Farzaneh, Hilgerdenaar, Lindsay658, Hanche, Headbomb, NavenduJain, QuiteUnusual, Marokwitz, Labongo, Bblfish, Igodard, ARSHA, Alexmadon, David Epp- stein, STBot, Flexdream, Willpwillp, Osndok, Coolg49964, Jjldj, VolkovBot, Ferzkopp, LokiClock, Trachten, Rlaufer, SieBot, Emorrissey, 242 CHAPTER 8. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES

Sswamida, Nhahmada, Abbasgadhia, Svick, Justin W Smith, Gtoal, HowardBGolden, Rhubbarb, Quanstro, Pointillist, Shabbychef, Ben- der2k14, Sun Creator, AndreasBWagner, Sharma337, Dsimic, SteveJothen, Addbot, Mortense, Jerz4835, FrankAndProust, MrOllie, Light- bot, Legobot, Russianspy3, Luckas-bot, Yobot, Ptbotgourou, Amirobot, Gharb, AnomieBOT, Materialscientist, Citation bot, Naufraghi, Krj373, Osloom, X7q, Citation bot 1, Chenopodiaceous, HRoestBot, Jonesey95, Kronos04, Trappist the monk, Chronulator, Mavam, Bud- deyp, RjwilmsiBot, Liorithiel, Lesshaste, John of Reading, Drafiei, GoingBatty, HiW-Bot, ZéroBot, Meng6, AManWithNoPlan, Ashish goel public, Jar354, Mikhail Ryazanov, ClueBot NG, Gareth Griffith-Jones, Bpodgursky, Rezabot, Helpful Pixie Bot, BG19bot, Divine- Traube, ErikDubbelboer, Solomon7968, Exercisephys, Chmarkine, Williamdemeo, Akryzhn, Faizan, Lsmll, Everymorning, BloomFil- terEditor, OriRottenstreich, Monkbot, Reddishmariposa, Queelius, Epournaras, Ushkin N, Satokoala, Bender the Bot and Anonymous: 194 • MinHash Source: https://en.wikipedia.org/wiki/MinHash?oldid=723852847 Contributors: AxelBoldt, Kku, Qwertyus, Rjwilmsi, Gareth Jones, Johndburger, Cedar101, Ma8thew, Ebrahim, David Eppstein, SchreiberBike, XLinkBot, Yobot, Citation bot, JonDePlume, Foo- barnix, Trappist the monk, EmausBot, Chire, Chirag101192, Frietjes, Leopd, RWMajeed, Xmutangzk, Linuxjava, NickGrattan, Srednuas Lenoroc, ElizaLepine and Anonymous: 25 • Disjoint-set data structure Source: https://en.wikipedia.org/wiki/Disjoint-set_data_structure?oldid=738969800 Contributors: The Anome, Michael Hardy, Dominus, LittleDan, Charles Matthews, Dcoetzee, Grendelkhan, Pakaran, Giftlite, Pgan002, Jonel, Deewiant, Andreas Kaufmann, Qutezuce, SamRushing, Nyenyec, Beige Tangerine, Msh210, Bigaln2, ReyBrujo, LOL, Bkkbrad, Ruud Koot, Qw- ertyus, Kasei-jin~enwiki, Rjwilmsi, Salix alba, Intgr, Fresheneesz, Wavelength, Sceptre, NawlinWiki, Spike Wilbury, Kevtrice, Spirko, Ripper234, Cedar101, Tevildo, SmackBot, Isaac Dupree, Oli Filth, Nikaustr, Lambiam, Archimerged, SpyMagician, IanLiu, Dr Greg, Superjoe30, Edward Vielmetti, Gfonsecabr, Headbomb, Kenahoo, Stellmach, David Eppstein, Chkno, Glrx, Rbrewer42, Kyle the bot, Oshwah, Jamelan, Oaf2, MasterAchilles, Boydski, Alksentrs, Adrianwn, Vanisheduser12a67, DumZiBoT, XLinkBot, Cldoyle, Dekart, Addbot, Shmilymann, Lightbot, Chipchap, Tonycao, Yobot, Erel Segal, Rubinbot, Sz-iwbot, Citation bot, Fantasticfears, Backpackadam, Williacb, HRoestBot, MathijsM, Akim Demaille, Rednas1234, EmausBot, Zhouji2010, ZéroBot, Wmayner, ChuispastonBot, Mankarse, Nullzero, Aleskotnik, Andreschulz, FutureTrillionaire, Qunwangcs157, Andyhowlett, Faizan, Simonemainardi, William Di Luigi, Kimi91, Sharma.illusion, Kbhat95, Kennysong, R.J.C.vanHaaften, Ahg simon, Michaelovertolli and Anonymous: 80 • Partition refinement Source: https://en.wikipedia.org/wiki/Partition_refinement?oldid=740678932 Contributors: Tea2min, Linas, Qw- ertyus, Matt Cook, Chris the speller, Headbomb, David Eppstein, Watchduck, Noamz, RjwilmsiBot, Xsoameix, David N. Jansen and Anonymous: 1 • Priority queue Source: https://en.wikipedia.org/wiki/Priority_queue?oldid=743559387 Contributors: Frecklefoot, Michael Hardy, Nix- dorf, Bdonlan, Strebe, Dcoetzee, Sanxiyn, Robbot, Fredrik, Kowey, Bkell, Tea2min, Decrypt3, Giftlite, Zigger, Vadmium, Andreas Kauf- mann, Byrial, BACbKA, El C, Spoon!, Bobo192, Nyenyec, Dbeardsl, Jeltz, Mbloore, Forderud, RyanGerbil10, Kenyon, Woohookitty, Oliphaunt, Ruud Koot, Hdante, Pete142, Graham87, Qwertyus, Pdelong, Ckelloug, Vegaswikian, StuartBrady, Jeff02, Spl, Anders.Warga, Stephenb, Gareth Jones, Lt-wiki-bot, PaulWright, SmackBot, Emeraldemon, Stux, Gilliam, Riedl, Oli Filth, Silly rabbit, Nbarth, Kostmo, Zvar, Calbaer, Cybercobra, BlackFingolfin, A5b, Clicketyclack, Ninjagecko, Robbins, Rory O'Kane, Sabik, John Reed Riley, ShelfSkewed, Chrisahn, Corpx, Omicronpersei8, Thijs!bot, LeeG, Mentifisto, AntiVandalBot, Wayiran, CosineKitty, Ilingod, VoABot II, David Eppstein, Jutiphan, Umpteee, Squids and Chips, TXiKiBoT, Coder Dan, Red Act, RHaden, Rhanekom, SieBot, ThomasTenCate, EnOreg, Volkan YAZICI, ClueBot, Niceguyedc, Thejoshwolfe, SchreiberBike, BOTarate, Krungie factor, DumZiBoT, XLinkBot, Ghettoblaster, Vield, Jncraton, Lightbot, Legobot, Yobot, FUZxxl, Bestiasonica, AnomieBOT, 1exec1, Kimsey0, Xqbot, Redroof, Thore Husfeldt, FrescoBot, Hobsonlane, Itusg15q4user, Arthur MILCHIOR, Orangeroof, ElNuevoEinstein, HenryAyoola, EmausBot, LastKingpin, Moswento, Arken- flame, Meng6, GabKBel, ChuispastonBot, Highway Hitchhiker, ClueBot NG, Carandraug, Ztothefifth, Widr, FutureTrillionaire, Happyuk, Chmarkine, J.C. Labbrev, Dexbot, Kushalbiswas777, MeekMelange, Lone boatman, Sriharsh1234, Theemathas, Dough34, Mydog333, Luckysud4, Sammydre, Bladeshade2, Mtnorthpoplar, Kdhanas, GreenC bot, Bender the Bot and Anonymous: 149 • Bucket queue Source: https://en.wikipedia.org/wiki/Bucket_queue?oldid=728330531 Contributors: David Eppstein • Heap (data structure) Source: https://en.wikipedia.org/wiki/Heap_(data_structure)?oldid=741560251 Contributors: Derek Ross, LC~enwiki, Christian List, Boleslav Bobcik, DrBob, B4hand, Frecklefoot, Paddu, Jimfbleak, Notheruser, Kragen, Jll, Aragorn2, Charles Matthews, Timwi, Dcoetzee, Dfeuer, Dysprosia, Doradus, Jogloran, Shizhao, Cannona, Robbot, Noldoaran, Fredrik, Sbisolo, Vikingstad, Giftlite, DavidCary, Wolfkeeper, Mellum, Tristanreid, Pgan002, Beland, Two Bananas, Pinguin.tk~enwiki, Andreas Kaufmann, Abdull, Oskar Sigvardsson, Wiesmann, Yuval madar, Qutezuce, Tristan Schmelcher, Ascánder, Mwm126, Iron Wallaby, Spoon!, Mdd, Musiphil, Guy Harris, Sligocki, Suruena, Derbeth, Wsloand, Oleg Alexandrov, Mahanga, Mindmatrix, LOL, Prophile, Daira Hopwood, Ruud Koot, Apokrif, Tom W.M., Graham87, Qwertyus, Drpaule, Psyphen, Mathbot, Quuxplusone, Krun, Fresheneesz, Chobot, YurikBot, Wave- length, RobotE, Vecter, NawlinWiki, DarkPhoenix, B7j0c, Moe Epsilon, Mlouns, LeoNerd, Bota47, Schellhammer, Lt-wiki-bot, Abu adam~enwiki, Ketil3, HereToHelp, Daivox, SmackBot, Reedy, Tgdwyer, Eskimbot, Took, Thumperward, Oli Filth, Silly rabbit, Nbarth, Ilyathemuromets, Jmnbatista, Cybercobra, Mlpkr, Prasi90, Itmozart, Atkinson 291, Ninjagecko, SabbeRubbish, Loadmaster, Hiiiiiiiiiiiii- iiiiiiii, Jurohi, Jafet, Ahy1, Eric Le Bigot, Flamholz, Cydebot, Max sang, Christian75, Grubbiv, Thijs!bot, OverLeg, Ablonus, Anka.213, BMB, Plaga701, Jirka6, Magioladitis, 28421u2232nfenfcenc, David Eppstein, Inhumandecency, Kibiru, Bradgib, Andre.holzner, Jfroelich, Public Menace, Cobi, STBotD, Cool 1 love, VolkovBot, JhsBot, Wingedsubmariner, Billinghurst, Rhanekom, Quietbritishjim, SieBot, Ham Pastrami, Flyer22 Reborn, Svick, Jonlandrum, Ken123BOT, AncientPC, VanishedUser sdu9aya9fs787sads, ClueBot, Garyzx, Uncle Milty, Bender2k14, Kukolar, Xcez-be, Addbot, Psyced, Nate Wessel, Chzz, Jasper Deng, Numbo3-bot, Konryd, Chipchap, Bluebusy, Luckas-bot, Timeroot, KamikazeBot, DavidHarkness, AnomieBOT, Alwar.sumit, Jim1138, Burakov, ArthurBot, DannyAsher, Xqbot, Control.valve, GrouchoBot, Лев Дубовой, Mcmlxxxi, Kxx, C7protal, Mark Renier, Wikitamino, Sae1962, Gruntler, AaronEmi, ImPerfection, Patmorin, CobraBot, Akim Demaille, Stryder29, RjwilmsiBot, EmausBot, John of Reading, Tuankiet65, WikitanvirBot, Sergio91pt, Hari6389, Er- mishin, Jaseemabid, Chris857, ClueBot NG, Manizzle23, Incompetence, Softsundude, Joel B. Lewis, Samuel Marks, Mediator Scien- tiae, BG19bot, Racerdogjack, Chmarkine, Hadi Payami, PatheticCopyEditor, Hupili, ChrisGualtieri, Rarkenin, Frosty, DJB3.14, Clevera, FenixFeather, P.t-the.g, Theemathas, Sunny1304, Tim.sebring, Ginsuloft, Azx0987, Chaticramos, Evohunz, Nbro, Sequoia 42, Ougarcia, Danmejia1, CLCStudent and Anonymous: 200 • Binary heap Source: https://en.wikipedia.org/wiki/Binary_heap?oldid=743537755 Contributors: Derek Ross, Taw, Shd~enwiki, B4hand, Pit~enwiki, Nixdorf, Snoyes, Notheruser, Kragen, Kyokpae~enwiki, Dcoetzee, Dfeuer, Dysprosia, Kbk, Espertus, Fredrik, Altenmann, DHN, Vikingstad, Tea2min, DavidCary, Laurens~enwiki, Levin, Alexf, Bryanlharris, Sam Hocevar, Andreas Kaufmann, Rich Farm- brough, Sladen, Hydrox, Antaeus Feldspar, CanisRufus, Iron Wallaby, Liao, Wsloand, Bsdlogical, Kenyon, Oleg Alexandrov, Mahanga, LOL, Ruud Koot, Qwertyus, Pdelong, Brighterorange, Drpaule, Platyk, VKokielov, Fresheneesz, Mdouze, Tofergregg, CiaPan, Daev, MonoNexo, Htonl, Schellhammer, HereToHelp, Ilmari Karonen, DomQ, Theone256, Oli Filth, Nbarth, Matt77, Cybercobra, Djcmackay, Danielcer, Ohconfucius, Doug Bell, J Crow, Catphive, Dicklyon, Inquisitus, Hu12, Velle~enwiki, Cydebot, Codetiger, Headbomb, WinBot, 8.1. TEXT 243

Kba, Alfchung~enwiki, JAnDbot, MSBOT, R27182818, Magioladitis, Seshu pv, Jessicapierce, Japo, David Eppstein, Scott tucker, Pgn674, Applegrew, Foober, Phishman3579, Funandtrvl, Rozmichelle, Vektor330, Tdeoras, Nuttycoconut, Lourakis, Ctxppc, Cpflames, Anchor Link Bot, ClueBot, Miquelmartin, Jaded-view, Kukolar, Amossin, XLinkBot, Addbot, Bluebusy, Luckas-bot, Yobot, Amirobot, David- shen84, AnomieBOT, DemocraticLuntz, Jim1138, Baliame, Xqbot, Surturpain, Smk65536, GrouchoBot, Speakus, Okras, FrescoBot, Trappist the monk, Indy256, Patmorin, Duoduoduo, Loftpo, Tim-J.Swan, Superlaza, Racerx11, Dcirovic, Chris857, EdoBot, Dakaminski, Rezabot, Ciro.santilli, O12, Helpful Pixie Bot, BG19bot, Crocodilesareforwimps, Chmarkine, MiquelMartin, IgushevEdward, Harsh 2580, Drjackstraw, 22990atinesh, Msproul, Billyisyoung, Cbcomp, Aswincweety, Errohitagg, Nbro, Missingdays, Stevenxiaoxiong, Dilettantest, Wattitude, PhilipWelch and Anonymous: 173 • D-ary heap Source: https://en.wikipedia.org/wiki/D-ary_heap?oldid=728415272 Contributors: Derek Ross, Greenrd, Phil Boswell, Rich Farmbrough, Qwertyus, Fresheneesz, SmackBot, Shalom Yechiel, Cydebot, Alaibot, David Eppstein, Skier Dude, Slemm, M2Ys4U, LeaW, Addbot, DOI bot, Yobot, Miyagawa, Citation bot 1, JanniePieters, DrilBot, Dude1818, RjwilmsiBot, ChuispastonBot, Helpful Pixie Bot, Fragapanagos, Angelababy00 and Anonymous: 18 • Binomial heap Source: https://en.wikipedia.org/wiki/Binomial_heap?oldid=740862033 Contributors: Michael Hardy, Poor Yorick, Dcoetzee, Dysprosia, Doradus, Maximus Rex, Cdang, Fredrik, Brona, MarkSweep, TonyW, Creidieki, Klemen Kocjancic, Martin TB, Lemontea, Bo Lindbergh, Karlheg, Arthena, Wsloand, Oleg Alexandrov, LOL, Qwertyus, NeonMerlin, Fragglet, Fresheneesz, CiaPan, YurikBot, Hairy Dude, Vecter, Googl, SmackBot, Theone256, Peterwhy, Yuide, Nviladkar, Stebulus, Cydebot, Marqueed, Thijs!bot, Magioladitis, Matt.smart, Gwern, Funandtrvl, VolkovBot, Wingedsubmariner, Biscuittin, YonaBot, Volkan YAZICI, OOo.Rax, Alexbot, Aham1234, Materialscientist, Vmanor, DARTH ,ماني ,Npansare, Addbot, Alquantor, Alex.mccarthy, Download, Sapeur, LinkFA-Bot SIDIOUS 2, Josve05a, Templatetypedef, ClueBot NG, BG19bot, Dexbot, Mark L MacDonald, Boza s6, Oleksandr Shturmov and Anony- mous: 63 • Fibonacci heap Source: https://en.wikipedia.org/wiki/Fibonacci_heap?oldid=731467315 Contributors: Michael Hardy, Zeno Gantner, Poor Yorick, Charles Matthews, Dcoetzee, Dysprosia, Wik, Hao2lian, Phil Boswell, Fredrik, Eliashedberg, P0nc, Brona, Creidieki, Qutezuce, Bender235, Aquiel~enwiki, Mkorpela, Wsloand, Oleg Alexandrov, Japanese Searobin, LOL, Ruud Koot, Rjwilmsi, Ravik, Fresh- eneesz, Antiuser, YurikBot, SmackBot, Arkitus, Droll, MrBananaGrabber, Ninjagecko, Jrouquie, Hiiiiiiiiiiiiiiiiiiiii, Vanisaac, Myasuda, AnnedeKoning, Cydebot, Gimmetrow, Headbomb, DekuDekuplex, Jirka6, JAnDbot, David Eppstein, The Real Marauder, DerHexer, Adam Zivner, Yecril, Funandtrvl, Aaron Rotenberg, Wingedsubmariner, Wbrenna36, Crashie, Bporopat, Arjun024, Thw1309, Clue- Bot, Gene91, Mild Bill Hiccup, Nanobear~enwiki, RobinMessage, Peatar, Kaba3, Safenner1, Addbot, LatitudeBot, Mdk wiki~enwiki, Luckas-bot, Yobot, Vonehrenheim, AnomieBOT, Erel Segal, Citation bot, Miym, Kxx, Novamo, Arthur MILCHIOR, MorganGreen, Pinethicket, Lars Washington, Ereiniona, EmausBot, Coliso, Wikipelli, Trimutius, Lexusuns, Templatetypedef, ClueBot NG, Softsundude, O.Koslowski, BG19bot, PatheticCopyEditor, ChrisGualtieri, Martin.carames, Dexbot, Jochen Burghardt, Faizan, Theemathas, Nvmbs, Oleksandr Shturmov, Mtnorthpoplar, Aayushdhir and Anonymous: 107 • Pairing heap Source: https://en.wikipedia.org/wiki/Pairing_heap?oldid=693767507 Contributors: Phil Boswell, Pgan002, Wsloand, Ruud Koot, Qwertyus, Quale, Drdisque, Cedar101, Sneftel, Tgdwyer, Bluebot, SAMJAM, Jrouquie, Cydebot, Alaibot, Magioladitis, David Eppstein, Wingedsubmariner, Celique, Geoffrey.foster, Yobot, Gilo1969, Kxx, Citation bot 1, Breaddawson, Hoofinasia, Dexbot, Jeff Erickson, CV9933 and Anonymous: 13 • Double-ended priority queue Source: https://en.wikipedia.org/wiki/Double-ended_priority_queue?oldid=720258122 Contributors: Dremora, Ruud Koot, Qwertyus, Quuxplusone, Sneftel, Racklever, Henning Makholm, PamD, David Eppstein, Julianhyde, AvicAWB, Templatetypedef, Shire Reeve, 0milch0, BG19bot, Ramesh Ramaiah, Vibhave, BPositive, Mark Arsten, Conceptualizing and Anonymous: 6 • Soft heap Source: https://en.wikipedia.org/wiki/Soft_heap?oldid=666581987 Contributors: Denny, Dcoetzee, Doradus, Fredrik, Just An- other Dan, Pgan002, Wsloand, Ruud Koot, Agthorr, Eubot, Boticario, Bondegezou, SmackBot, Bluebot, Cydebot, Alaibot, Headbomb, Cobi, Bender2k14, Addbot, LilHelpa, Ita140188, Agentex, FrescoBot, Lunae and Anonymous: 12 • Binary search algorithm Source: https://en.wikipedia.org/wiki/Binary_search_algorithm?oldid=739807615 Contributors: Peter Winnberg, Taw, Dze27, Ed Poor, LA2, M~enwiki, Hannes Hirzel, Edward, Patrick, Robert Dober, Nixdorf, Pnm, Zeno Gantner, Takuya- Murata, Loisel, Stan Shebs, EdH, Mxn, Hashar, Charles Matthews, Dcoetzee, Fuzheado, SirJective, McKay, Pakaran, Phil Boswell, Fredrik, Altenmann, Tea2min, Giftlite, The Cave Troll, BenFrantzDale, Mboverload, Macrakis, Pne, DevilsAdvocate, Beland, Over- lordQ, Maximaximax, Two Bananas, Pm215, Ukexpat, Sleepyrobot, Ericamick, Bfjf, Harriv, Shlomif, ESkog, Plugwash, El C, Diomidis Spinellis, EmilJ, Baruneju, Spoon!, BrokenSegue, Photonique, Musiphil, Alansohn, Liao, Caesura, Andrewmu, Mr flea, Gpvos, HenryLi, Forderud, Ericl234, Nuno Tavares, Pol098, Tabletop, Palica, Gerbrant, Ryajinor, Arjarj, Zzedar, GrundyCamellia, Coemgenus, Scandum, Quale, XP1, Ligulem, R.e.b., FlaBot, Quuxplusone, Sioux.cz, CiaPan, Chobot, DVdm, Drtom, The Rambling Man, YurikBot, Wave- length, Stephenb, Ewx, Hv, ColdFusion650, Kcrca, Black Falcon, Googl, Nikkimaria, Zachwlewis, Cedar101, Htmnssn, Messy Thinking, SigmaEpsilon, JLaTondre, Fsiler, SmackBot, NickyMcLean, WilliamThweatt, TestPilot, KocjoBot~enwiki, Ieopo, BiT, Gene Thomas, Amux, J4 james, Iain.dalton, Oli Filth, Jonny Diamond, TripleF, Oylenshpeegul, Sephiroth BCR, Mlpkr, Agcala~enwiki, Doug Bell, Breno, Beetstra, Wstomv, Mr Stephen, TwistOfCain, Lavaka, Devourer09, Fabian Steeg~enwiki, David Cooke, Svivian, Ironmagma, Solidpoint, Verdy p, Boemanneke, Ardnew, FrancoGG, Tmdean, Heineman, AntiVandalBot, Kylemcinnes, Seaphoto, Kdakin, JAnDbot, FactoidCow, SiobhanHansa, Soulbot, Chutzpan, Allstarecho, David Eppstein, Toddcs, Gwern, MartinBot, Glrx, Userabc, Trusilver, Fylwind, Dodno, WhiteOak2006, SoCalSuperEagle, Mariolj, Oshwah, Vipinhari, Kinkydarkbird, Merritt.alex, Swanyboy2, Don4of4, CanOfWorms, Dirkbb, Meters, Df747jet, Brianga, ICrann15, Scarian, Comp123, Jan Winnicki, Psherm85, Jerryobject, Flyer22 Reborn, Joshgilker- son, Lourakis, Macy, Dillard421, Svick, Hariva, Rdhettinger, VanishedUser sdu9aya9fs787sads, ClueBot, Justin W Smith, Syhon, Garyzx, Arunsingh16, Tim32, JeffDonner, Dasboe, Predator106, Hasanadnantaha, Hkleinnl, Neuralwarp, XLinkBot, Muffincorp, Mitch Ames, Bob1312, Briandamgaard, NjardarBot, Balabiot, Legobot, Luckas-bot, Yobot, MarioS, AnomieBOT, Andrewrp, 1exec1, Jim1138, Man- garah, Gankro, Materialscientist, Citation bot, Taeshadow, Lacis alfredo, Melmann, SPTWriter, Jeffrey Mall, Mononomic, Pmlineditor, Shirik, Harry0xBd, WithWhich, FrescoBot, CarminPolitano, Ninaddb, Atlantia, Biker Biker, BigDwiki, AANaimi, Nnarasimhakaushik, Aperisic, MoreNet, Jfmantis, RjwilmsiBot, JustAHappyCamper, EmausBot, Msswp, Robrohan, Wikipelli, Dcirovic, John Cline, Check- ingfax, ChaosCon, Midas02, Staszek Lem, DOwenWilliams, L Kensington, Bill william compton, Ranching, Peter Karlsen, Mark Mar- tinec, TYelliot, 28bot, Rocketrod1960, Haigee2007, ClueBot NG, MelbourneStar, Gilderien, Imjooseo, Widr, Nullzero, Sangameshh, Jk2q3jrklse, Helpful Pixie Bot, Curb Chain, Wbm1058, BG19bot, Superamin, Rodion Gork, Lambin~enwiki, Rynishere, Chmarkine, Njanani, BattyBot, Nithin.A.P, Timothy Gu, ChrisGualtieri, Daiyuda, Wullschj, Aj8uppal, Codethinkers, Lugia2453, Jamesx12345, Nero hu, NC4PK, Mark viking, I am One of Many, Tentinator, IRockStone, DavidLeighEllis, Pappu0007, Bloghog23, Alex.koturanov, Rul- nick, Benjohnbarnes, Peturb, Dalton Quinn, Kjerish, KH-1, Esquivalience, CruiserAbhi, PJ Cabral, JJMC89, JindalApoorv, Sunflower42, Rohit0303, Divyanshj.16, Zaffy806, Fresal and Anonymous: 426 244 CHAPTER 8. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES

• Binary search tree Source: https://en.wikipedia.org/wiki/Binary_search_tree?oldid=741153803 Contributors: Damian Yerrick, Bryan Derksen, Taw, Mrwojo, Spiff~enwiki, PhilipMW, Michael Hardy, Chris-martin, Nixdorf, Ixfd64, Minesweeper, Darkwind, LittleDan, Glenn, BAxelrod, Timwi, MatrixFrog, Dcoetzee, Havardk, Dysprosia, Doradus, Maximus Rex, Phil Boswell, Fredrik, Postdlf, Bkell, Hadal, Tea2min, Enochlau, Awu, Giftlite, DavidCary, P0nc, Ezhiki, Maximaximax, Qleem, Karl-Henner, Qiq~enwiki, Shen, An- dreas Kaufmann, Jin~enwiki, Grunt, Kate, Oskar Sigvardsson, D6, Ilana, Kulp, ZeroOne, Damotclese, Vdm, Func, LeonardoGre- gianin, Runner1928, Nicolasbock, HasharBot~enwiki, Alansohn, Liao, RoySmith, Rudo.Thomas, Pion, Wtmitchell, Evil Monkey, 4c27f8e656bb34703d936fc59ede9a, Oleg Alexandrov, Mindmatrix, LOL, Oliphaunt, Ruud Koot, Trevor Andersen, GregorB, Mb1000, MrSomeone, Qwertyus, Nneonneo, Hathawayc, VKokielov, Ecb29, Mathbot, BananaLanguage, DevastatorIIC, Quuxplusone, Sketch-The- Fox, Butros, Banaticus, Roboto de Ajvol, YurikBot, Wavelength, Personman, Michael Slone, Hyad, Taejo, Gaius Cornelius, Oni Lukos, TheMandarin, Salrizvy, Moe Epsilon, BOT-Superzerocool, Googl, Regnaron~enwiki, Abu adam~enwiki, Chery, Cedar101, Jogers, Leonar- doRob0t, Richardj311, WikiWizard, SmackBot, Bernard François, Gilliam, Ohnoitsjamie, Theone256, Oli Filth, Neurodivergent, DHN- bot~enwiki, Alexsh, Garoth, Mweber~enwiki, Allan McInnes, Calbaer, NitishP, Cybercobra, Underbar dk, Hcethatsme, MegaHasher, Breno, Nux, Beetstra, Dicklyon, Hu12, Vocaro, Konnetikut, JForget, James pic, CRGreathouse, Ahy1, WeggeBot, Mikeputnam, Train- Underwater, Jdm64, AntiVandalBot, Jirka6, Lanov, Huttarl, Eapache, JAnDbot, Anoopjohnson, Magioladitis, Abednigo, Allstarecho, Tomt22, Gwern, S3000, MartinBot, Anaxial, Leyo, Mike.lifeguard, Phishman3579, Skier Dude, Joshua Issac, Mgius, Kewlito, Danado- cus, Vectorpaladin13, Labalius, BotKung, One half 3544, Spadgos, MclareN212, Nerdgerl, Rdemar, Davekaminski, Rhanekom, SieBot, YonaBot, Xpavlic4, Casted, VVVBot, Ham Pastrami, Jerryobject, Swapsy, Djcollom, Svick, Anchor Link Bot, GRHooked, Loren.wilton, Xevior, ClueBot, ChandlerMapBot, Madhan virgo, Theta4, Splttingatms, Shailen.sobhee, AgentSnoop, Onomou, XLinkBot, WikHead, Matekm, Legobot, Luckas-bot, Yobot, Dimchord, AnomieBOT, The ,ماني ,Metalmax, MrOllie, Jdurham6, Nate Wessel, LinkFA-Bot Parting Glass, Burakov, Ivan Kuckir, Tbvdm, LilHelpa, Capricorn42, SPTWriter, Doctordiehard, Wtarreau, Shmomuffin, Dzikasosna, Smallman12q, Kurapix, Adamuu, FrescoBot, 4get, Citation bot 1, Golle95, Aniskhan001, Frankrod44, Cochito~enwiki, MastiBot, The- sevenseas, Sss41, Vromascanu, Shuri org, Rolpa, Jayaneethatj, Avermapub, MladenWiki, Konstantin Pest, Akim Demaille, Cyc115, WillNess, Nils schmidt hamburg, RjwilmsiBot, Ripchip Bot, X1024, Chibby0ne, Albmedina, Your Lord and Master, Nomen4Omen, Meng6, Wmayner, Tolly4bolly, Snehalshekatkar, Dan Wang, ClueBot NG, SteveAyre, Jms49, Frietjes, Ontariolot, Solsan88, Nakaru- maka, BG19bot, AlanSherlock, Rafikamal, BPositive, RJK1984, Phc1, WhiteNebula, IgushevEdward, Hdanak, JingguoYao, Yaderbh, RachulAdmas, TwoTwoHello, Frosty, Josell2, SchumacherTechnologies, Farazbhinder, Wulfskin, Embanner, Mtahmed, Jihlim, Kaidul, Cybdestroyer, Jabanabba, Gokayhuz, Mathgorges, Jianhui67, Paul2520, Super fish2, Ryuunoshounen, Dk1027, Azx0987, KH-1, Tshub- ham, HarshalVTripathi, ChaseKR, Nbro, Filip Euler, Koolnik90, K-evariste, Enzoferber, Selecsosi, Jonnypurgatory, Peterc26, Mezhaka, SimoneBrigante and Anonymous: 367 • Random binary tree Source: https://en.wikipedia.org/wiki/Random_binary_tree?oldid=662022722 Contributors: Cybercobra, David Eppstein, Cobi, Addbot, Cardel, Gilo1969, Citation bot 1, Patmorin, RjwilmsiBot, Helpful Pixie Bot, Dsp de and Anonymous: 7 • Tree rotation Source: https://en.wikipedia.org/wiki/Tree_rotation?oldid=733753791 Contributors: Mav, BlckKnght, B4hand, Michael Hardy, Kragen, Dcoetzee, Dysprosia, Altenmann, Michael Devore, Leonard G., Neilc, Andreas Kaufmann, Mr Bound, Chub~enwiki, BRW, Oleg Alexandrov, Joriki, Graham87, Qwertyus, Wizzar, Pako, Mathbot, Peterl, Abarry, Trainra, Cedar101, SmackBot, DHN-bot~enwiki, Ramasamy, Kjkjava, Hyperionred, Thijs!bot, Headbomb, Waylonflinn, David Eppstein, Gwern, Vegasprof, STBotD, Skaraoke, Mtanti, ,Legobot, Mangarah, LilHelpa, GrouchoBot ,ماني ,SCriBu, Castorvx, Salvar, SieBot, Woblosch, Svick, Xevior, Boykobb, LaaknorBot Adamuu, Citation bot 1, Britannic124, Alexey.kudinkin, ClueBot NG, Knowledgeofthekrell, Josell2, Explorer512, Tar-Elessar, Javier Borrego Fernandez C-512, Fmadd, AdamBignell and Anonymous: 40 • Self-balancing binary search tree Source: https://en.wikipedia.org/wiki/Self-balancing_binary_search_tree?oldid=742424795 Contrib- utors: Michael Hardy, Angela, Dcoetzee, Dysprosia, DJ Clayworth, Noldoaran, Fredrik, Diberri, Enochlau, Wolfkeeper, Jorge Stolfi, Neilc, Pgan002, Jacob grace, Andreas Kaufmann, Shlomif, Baluba, Mdd, Alansohn, Jeltz, ABCD, Kdau, RJFJR, Japanese Searobin, Jacobolus, Chochopk, Qwertyus, Moskvax, Intgr, YurikBot, Light current, Plyd, Daivox, MrDrBob, Cybercobra, Jon Awbrey, Ripe, Momet, Jafet, CRGreathouse, Cydebot, Widefox, David Eppstein, Funandtrvl, VolkovBot, Sriganeshs, Lamro, Jruderman, Plastikspork, SteveJothen, Addbot, Bluebusy, Yobot, Larrycz, Xqbot, Drilnoth, Steaphan Greene, FrescoBot, DrilBot, ActuallyRationalThinker, EmausBot, RA0808, Larkinzhang1993, Azuris, ClueBot NG, Andreas4965, Solomon7968, Wolfgang42, Josell2, Jochen Burghardt, G PViB, Ollie314 and Anonymous: 51 • Treap Source: https://en.wikipedia.org/wiki/Treap?oldid=742919546 Contributors: Edward, Poor Yorick, Jogloran, Itai, Jleedev, Eequor, Andreas Kaufmann, Qef, Milkmandan, Saccade, Wsloand, Oleg Alexandrov, Jörg Knappen~enwiki, Ruud Koot, Hdante, Behdad, Qw- ertyus, Arbor, Gustavb, Regnaron~enwiki, James.nvc, SmackBot, KnowledgeOfSelf, Chris the speller, Cybercobra, MegaHasher, Pfh, J. Finkelstein, Yzt, Jsaxton86, Cydebot, Blaisorblade, Escarbot, RainbowCrane, David Eppstein, AHMartin, Bajsejohannes, Justin W Smith, Kukolar, Hans Adler, Addbot, Luckas-bot, Yobot, Erel Segal, Rubinbot, Citation bot, Bencmq, Gilo1969, Miym, Brutaldeluxe, Cshinyee, C02134, ICEAGE, MaxDel, Patmorin, Cdb273, MoreNet, ChuispastonBot, BG19bot, Chmarkine, Naxik, Lsmll and Anonymous: 30 • AVL tree Source: https://en.wikipedia.org/wiki/AVL_tree?oldid=741904731 Contributors: Damian Yerrick, BlckKnght, M~enwiki, Ede- maine, FvdP, Infrogmation, Michael Hardy, Nixdorf, Minesweeper, Jll, Poor Yorick, Dcoetzee, Dysprosia, Doradus, Greenrd, Topbanana, Noldoaran, Fredrik, Altenmann, Merovingian, Tea2min, Andrew Weintraub, Mckaysalisbury, Neilc, Pgan002, Tsemii, Andreas Kauf- mann, Safety Cap, Mike Rosoft, Guanabot, Byrial, Pavel Vozenilek, Shlomif, Lankiveil, Rockslave, Smalljim, Geek84, Axe-Lander, Darangho, Kjkolb, Larryv, Obradovic Goran, HasharBot~enwiki, Orimosenzon, Kdau, Docboat, Evil Monkey, Tphyahoo, RJFJR, Kenyon, Oleg Alexandrov, LOL, Ruud Koot, Gruu, Seyen, Graham87, Qwertyus, ErikHaugen, Toby Douglass, Mikm, Alex Kapranoff, Jeff02, Gurch, Intgr, Chobot, YurikBot, Gaius Cornelius, NawlinWiki, Astral, Dtrebbien, Kain2396, Bkil, Pnorcks, Blackllotus, Bota47, Lt-wiki- bot, Arthur Rubin, Cedar101, KGasso, Gulliveig, Danielx, LeonardoRob0t, Paul D. Anderson, SmackBot, Apanag, InverseHypercube, David.Mestel, KocjoBot~enwiki, Gilliam, Tsoft, DHN-bot~enwiki, ChrisMP1, Tamfang, Cybercobra, Flyingspuds, Epachamo, Philvarner, Dcamp314, Kuru, Euchiasmus, Michael miceli, Caviare, Babbling.Brook, Dicklyon, Yksyksyks, Momet, Nysin, Jac16888, Daewoollama, Cyhawk, ST47, Zian, Joeyadams, Msanchez1978, Eleuther, AntiVandalBot, Ste4k, Jirka6, Gökhan, JAnDbot, Leuko, Magioladitis, Anant sogani, Avicennasis, David Eppstein, Nguyễn Hữu Dung, MartinBot, J.delanoy, Pedrito, Phishman3579, Jeepday, Michael M Clarke, UnwashedMeme, Binnacle, Adamd1008, DorganBot, Hwbehrens, Funandtrvl, BenBac, VolkovBot, Indubitably, Mtanti, Castorvx, Alex- Great, Uw.Antony, Enviroboy, Srivesh, SieBot, Aent, Vektor330, Flyer22 Reborn, Svick, Mauritsmaartendejong, Denisarona, Xevior, ClueBot, Nnemo, CounterVandalismBot, Auntof6, Kukolar, Ksulli10, Moberg, Mellerho, XLinkBot, Gnowor, Njvinod, Resper~enwiki, DOI bot, Dawynn, Ommiy-Pangaeus~enwiki, Leszek Jańczuk, Mr.Berna, West.andrew.g, Tide rolls, Matěj Grabovský, Bluebusy, MattyIX, Legobot, Luckas-bot, Yobot, II MusLiM HyBRiD II, Agrawalyogesh, AnomieBOT, Jim1138, Royote, Kingpin13, Materialscientist, Xqbot, Drilnoth, Oliversisson, VladimirReshetnikov, Greg Tyler, Shmomuffin, Adamuu, Mjkoo, FrescoBot, MarkHeily, Mohammad ahad, Ichi- monji10, Maggyero, DrilBot, Sebculture, RedBot, Trappist the monk, MladenWiki, EmausBot, Benoit fraikin, Mzruya, Iamnitin, AvicBot, Vlad.c.manea, Nomen4Omen, Geoff55, Mnogo, Chire, Compusense, ClueBot NG, Bulldog73, Macdonjo, G0gogcsc300, Codingrecipes, 8.1. TEXT 245

Helpful Pixie Bot, Titodutta, BG19bot, Northamerica1000, Solomon7968, Ravitkhurana, Crh23, Proxyma, DmitriyVilkov, Zhaofeng Li, ChrisGualtieri, Eta Aquariids, Dexbot, Kushalbiswas777, CostinulAT, Akerbos, Josell2, Jochen Burghardt, G PViB, Elfbw, Ppkhoa, Yel- natz, Dough34, Hibbarnt, Jasonchan1994, Jpopesculian, Skr15081997, Devsathish, Aviggiano, Eeb379, Monkbot, Teetooan, HexTree, Badidipedia, Dankocevski, Esquivalience, StudentOfStones, Jmonty42, NNcNannara, Ankitagrawalvit, Mhush12 and Anonymous: 349 • Red–black tree Source: https://en.wikipedia.org/wiki/Red%E2%80%93black_tree?oldid=743251895 Contributors: Dreamyshade, Jz- cool, Ghakko, FvdP, Michael Hardy, Blow~enwiki, Minesweeper, Ahoerstemeier, Cyp, Strebe, Jerome.Abela, Notheruser, Kragen, Ghewgill, Timwi, MatrixFrog, Dcoetzee, Dfeuer, Dysprosia, Hao2lian, Shizhao, Phil Boswell, Robbot, Fredrik, Altenmann, Hump- back~enwiki, Jleedev, Tea2min, Enochlau, Connelly, Giftlite, Sepreece, BenFrantzDale, Brona, Dratman, Leonard G., Pgan002, Li- Daobing, Sebbe, Karl-Henner, Andreas Kaufmann, Tristero~enwiki, Perey, Spundun, Will2k, Haxwell, Aplusbi, SickTwist, Giraffedata, Ryan Stone, Zetawoof, Iav, Hawke666, Cjcollier, Fawcett5, Denniss, Cburnett, RJFJR, H2g2bob, Kenyon, Silverdirk, Joriki, Mindma- trix, Merlinme, Ruud Koot, Urod, Gimboid13, Jtsiomb, Marudubshinki, Graham87, Qwertyus, OMouse, Drebs~enwiki, Rjwilmsi, Hgka- math, ErikHaugen, Toby Douglass, SLi, FlaBot, Margosbot~enwiki, Fragglet, Jameshfisher, Kri, Loading, SGreen~enwiki, YurikBot, Wavelength, Jengelh, Rsrikanth05, Bovineone, Sesquiannual, Jaxl, Długosz, Coderzombie, Mikeblas, Blackllotus, Schellhammer, Reg- naron~enwiki, Ripper234, JMBucknall, Lt-wiki-bot, Abu adam~enwiki, Smilindog2000, SmackBot, Pgk, Gilliam, Thumperward, Silly rab- bit, DHN-bot~enwiki, Sct72, Khalil Sawant, Xiteer, Cybercobra, Philvarner, TheWarlock, Alexandr.Kara, SashatoBot, Mgrand, N3bulous, Bezenek, Caviare, Dicklyon, Otac0n, Belfry, Pqrstuv, Pranith, Supertigerman, Ahy1, Jodawi, Pmussler, Linuxrocks123, Sytelus, Epbr123, Ultimus, Abloomfi, Headbomb, AntiVandalBot, Widefox, Hermel, Roleplayer, .anacondabot, Stdazi, David Eppstein, Lunakeet, Gwern, MartinBot, Glrx, Themania, IDogbert, Madhurtanwani, Phishman3579, Warut, Smangano, Binnacle, Lukax, Potatoswatter, KylieTastic, Bonadea, Funandtrvl, DoorsAjar, Jozue, Simoncropp, Laurier12, Bioskope, Yakov1122~enwiki, YonaBot, Sdenn, Stone628, Stanislav Nowak~enwiki, AlanUS, Hariva, Shyammurarka, Xevior, Uncle Milty, Nanobear~enwiki, Xmarios, Karlhendrikse, Kukolar, MiniStephan, Uniwalk, Versus22, Johnuniq, XLinkBot, Consed, C. A. Russell, Addbot, Joshhibschman, Fcp2007, AgadaUrbanit, Tide rolls, Light- bot, Luckas-bot, Yobot, Fraggle81, AnomieBOT, Narlami, Cababunga, Maxis ftw, ChrisCPearson, Storabled, Zehntor, Tbvdm, Xqbot, Nishantjr, RibotBOT, Kyle Hardgrave, Adamuu, FrescoBot, AstaBOTh15, Karakak, Kmels, Banej, Userask, Hnn79, Trappist the monk, Gnathan87, MladenWiki, Pellucide, Belovedeagle, Patmorin, Sreeakshay, EmausBot, John of Reading, Dem1995, Hugh Aguilar, K6ka, Nomen4Omen, Mnogo, Awakenrz, Card Zero, Grandphuba, KYLEMONGER, Kapil.xerox, Donner60, Wikipedian to the max, 28bot, ClueBot NG, Xjianz, Spencer greg, Wittjeff, Ontariolot, Widr, Hagoth, Pratyya Ghosh, Deepakabhyankar, Naxik, Dexbot, JingguoYao, Akerbos, Epicgenius, Mimibar, Kahtar, Weishi Zeng, Suelru, Monkbot, Spasticcodemonkey, Aureooms, HMSLavender, Freitafr, Nbro, Demagur, Rubydragons, Jmonty42, Codedgeass, JamesBWatson3, Frankbryce, Mar10dejong, Asgowrisankar, Jhnam88, Cristophercalo, Linkadvitch and Anonymous: 322 • WAVL tree Source: https://en.wikipedia.org/wiki/WAVL_tree?oldid=685567411 Contributors: David Eppstein and I dream of horses • Scapegoat tree Source: https://en.wikipedia.org/wiki/Scapegoat_tree?oldid=726214277 Contributors: FvdP, Edward, Dcoetzee, Ruakh, Dbenbenn, Tweenk, Sam Hocevar, Andreas Kaufmann, Rich Farmbrough, Jarsyl, Aplusbi, Oleg Alexandrov, Firsfron, Slike2, Qwertyus, Mathbot, Wknight94, SmackBot, Chris the speller, Cybercobra, MegaHasher, Vanisaac, AbsolutBildung, Thijs!bot, Robert Ullmann, The- mania, Danadocus, Joey Parrish, WillUther, Kukolar, SteveJothen, Addbot, Yobot, Citation bot, C.hahn, Patmorin, WikitanvirBot, Hankjo, Mnogo, ClueBot NG, AlecTaylor, Tomer adar, Theemathas, Hqztrue and Anonymous: 36 • Splay tree Source: https://en.wikipedia.org/wiki/Splay_tree?oldid=741769900 Contributors: Mav, BlckKnght, Xaonon, Christopher Ma- han, FvdP, Edward, Michael Hardy, Nixdorf, Pnm, Drz~enwiki, Dcoetzee, Dfeuer, Dysprosia, Silvonen, Tjdw, Phil Boswell, Fredrik, Stephan Schulz, Giftlite, Wolfkeeper, CyborgTosser, Lqs, Wiml, Gscshoyru, Urhixidur, Karl Dickman, Andreas Kaufmann, Yonkel- tron, Rich Farmbrough, Qutezuce, Bender235, Sietse Snel, Aplusbi, Chbarts, Phdye, Tabletop, VsevolodSipakov, Graham87, Qwertyus, Rjwilmsi, Pako, Ligulem, Jameshfisher, Fresheneesz, Wavelength, Vecter, Romanc19s, Długosz, Abu adam~enwiki, Terber, HereTo- Help, That Guy, From That Show!, SmackBot, Honza Záruba, Unyoyega, Apankrat, Silly rabbit, Octahedron80, Axlape, OrphanBot, Cybercobra, Philvarner, Just plain Bill, Ohconfucius, MegaHasher, Vanished user 9i39j3, Lim Wei Quan, Jamie King, Dicklyon, Free- side3, Martlau, Momet, Ahy1, VTBassMatt, Escarbot, Atavi, Coldzero1120, Eapache, KConWiki, David Eppstein, Ahmad87, Gwern, HPRappaport, Foober, Phishman3579, Dodno, Funandtrvl, Anna Lincoln, Rhanekom, Zuphilip, Russelj9, Svick, AlanUS, JP.Martin- ,Legobot, Yobot, Roman Munich ,דוד שי ,Flatin, Nanobear~enwiki, Pointillist, Safek, Kukolar, XLinkBot, Dekart, Maverickwoo, Addbot AnomieBOT, Erel Segal, 1exec1, Josh Guffin, Citation bot, Winniehell, Shmomuffin, Dzikasosna, FrescoBot, Snietfeld, Citation bot 1, Jwillia3, Zetifree, Sss41, MladenWiki, Sihag.deepak, Ybungalobill, Crimer, Wyverald, Const86, EmausBot, Hannan1212, Dcirovic, Slow- Byte, Mnogo, P2004a, Petrb, ClueBot NG, Wiki.ajaygautam, SteveAyre, Ontariolot, Antiqueight, Vagobot, Arunshankarbk, Harijec, Hue- SatLum, FokkoDriesprong, Makecat-bot, Arunkumar nonascii, B.pradeep143, MazinIssa, Abc00786, Lfbarba, Craftbondpro, Mdburns, Fabio.pakk, BethNaught, Efortanely, BenedictEggers, Admodi, Havewish and Anonymous: 135 • Tango tree Source: https://en.wikipedia.org/wiki/Tango_tree?oldid=686480013 Contributors: AnonMoos, Giraffedata, RHaworth, Qw- ertyus, Rjwilmsi, Vecter, Jengelh, Grafen, Malcolma, Rayhe, SmackBot, C.Fred, Chris the speller, Iridescent, Alaibot, Headbomb, Nick Number, Acroterion, Nyttend, Philg88, Inomyabcs, ImageRemovalBot, Sfan00 IMG, Nathan Johnson, Jasper Deng, Yobot, AnomieBOT, Erel Segal, Anand Oza, FrescoBot, Σ, RenamedUser01302013, Card Zero, Ontariolot, Do not want, Tango tree, DoctorKubla, Dexbot, Faizan and Anonymous: 17 • Skip list Source: https://en.wikipedia.org/wiki/Skip_list?oldid=741895317 Contributors: Mrwojo, Stevenj, Charles Matthews, Dcoet- zee, Dysprosia, Doradus, Populus, Noldoaran, Fredrik, Jrockway, Altenmann, Jorge Stolfi, Two Bananas, Andreas Kaufmann, Antaeus Feldspar, R. S. Shaw, Davetcoleman, Nkour, Ruud Koot, Qwertyus, MarSch, Drpaule, Intgr, YurikBot, Wavelength, Pi Delport, Bovi- neone, Gareth Jones, Zr2d2, Cedar101, AchimP, SmackBot, Chadmcdaniel, Silly rabbit, Cybercobra, Viebel, Almkglor, Laurienne Bell, Nsfmc, CRGreathouse, Nczempin, Thijs!bot, Dougher, Bondolo, Sanchom, Magioladitis, A3nm, JaGa, STBotD, Musically ut, Funandtrvl, VolkovBot, Rhanekom, SieBot, Ivan Štambuk, MinorContributor, Menahem.fuchs, Cereblio, OKBot, Svick, Rdhettinger, Denisarona, PuercoPop, Gene91, Kukolar, Xcez-be, Braddunbar, Addbot, DOI bot, Jim10701, Luckas-bot, Yobot, Wojciech mula, AnomieBOT, SvartMan, Citation bot, Carlsotr, Alan Dawrst, RibotBOT, FrescoBot, Jamesooders, MastiBot, Devynci, Patmorin, EmausBot, Pet3ris, Jaspervdg, Overred~enwiki, ClueBot NG, Vishalvishnoi, Rpk512, BG19bot, ChrisGualtieri, Dexbot, Mark viking, Purealtruism, Dmx2010, Mtnorthpoplar, Corka94 and Anonymous: 113 ,1דובק ,Monkbot • B-tree Source: https://en.wikipedia.org/wiki/B-tree?oldid=738369059 Contributors: Kpjas, Bryan Derksen, FvdP, Mrwojo, Spiff~enwiki, Edward, Michael Hardy, Rp, Chadloder, Minesweeper, JWSchmidt, Ciphergoth, BAxelrod, Alaric, Charles Matthews, Dcoetzee, Dys- prosia, Greenrd, Hao2lian, Ed g2s, Tjdw, AaronSw, Carbuncle, Wtanaka, Fredrik, Altenmann, Liotier, Bkell, Dmn, Tea2min, Giftlite, DavidCary, Uday, Wolfkeeper, Lee J Haywood, Levin, Curps, Joconnor, Ketil, Jorge Stolfi, AlistairMcMillan, Nayuki, Neilc, Pgan002, Gdr, Cbraga, Knutux, Stephan Leclercq, Peter bertok, Andreas Kaufmann, Chmod007, Kate, Ta bu shi da yu, Slady, Rich Farmbrough, 246 CHAPTER 8. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES

Guanabot, Leibniz, Qutezuce, Talldean, Slike, Dpotter, Mrnaz, SickTwist, Wipe, R. S. Shaw, HasharBot~enwiki, Alansohn, Anders Kase- org, ABCD, Wtmitchell, Wsloand, MIT Trekkie, Voxadam, Postrach, Mindmatrix, Decrease789, Ruud Koot, Qwertyus, FreplySpang, Rjwilmsi, Kinu, Strake, [email protected], FlaBot, Psyphen, Ysangkok, Fragglet, Joe07734, Makkuro, Fresheneesz, Kri, Antimatter15, CiaPan, Daev, Chobot, Vyroglyph, YurikBot, Bovineone, Ethan, PrologFan, Mikeblas, EEMIV, Cedar101, LeonardoRob0t, SmackBot, Cutter, Ssbohio, Btwied, Danyluis, Mhss, Chris the speller, Bluebot, Oli Filth, Malbrain, Stevemidgley, Cybercobra, AlyM, Jeff Wheeler, Battamer, Ck lostsword, Zearin, Bezenek, Flying Bishop, Loadmaster, Dicklyon, P199, Inquisitus, Norm mit, Noodlez84, Lamdk, FatalEr- ror, Ahy1, Aubrey Jaffer, Beeson, Cydebot, PKT, ContivityGoddess, Headbomb, I do not exist, Alfalfahotshots, AntiVandalBot, Luna Santin, Widefox, Jirka6, Lfstevens, Lklundin, The Fifth Horseman, MER-C, .anacondabot, Nyq, Yakushima, David Eppstein, Hbent, MoA)gnome, Ptheoch, CarlFeynman, Glrx, Trusilver, Altes, Phishman3579, Jy00912345, Priyank bolia, GoodPeriodGal, DorganBot, MartinRinehart, Michael Angelkovich, VolkovBot, Oshwah, Appoose, Kovianyo, Don4of4, Dlae, Jesin, Billinghurst, Uw.Antony, Aed- nichols, Joahnnes, Ham Pastrami, JCLately, Jojalozzo, Ctxppc, Dravecky, Anakin101, Hariva, Wantnot, ClueBot, Rpajares, Simon04, Junk98df, Abrech, Kukolar, Iohannes Animosus, XLinkBot, Paushali, Addbot, CanadianLinuxUser, AnnaFrance, LinkFA-Bot, Jjdaw- son7, Verbal, Lightbot, Krano, Teles, Twimoki, Luckas-bot, Quadrescence, Yobot, AnomieBOT, Gptelles, Materialscientist, MorgothX, Xtremejames183, Xqbot, Nishantjr, Matttoothman, Sandeep.a.v, Merit 07, Almabot, GrouchoBot, Eddvella, January2009, Jacosi, SirSeal, Hobsonlane, Bladefistx2, Mfwitten, Redrose64, Fgdafsdgfdsagfd, Trappist the monk, Patmorin, RjwilmsiBot, MachineRebel, John lind- gren, DASHBot, Wkailey, John of Reading, Wout.mertens, John ch fr, Pyschobbens, Ctail, Fabriciodosanjossilva, TomYHChan, Mnogo, NGPriest, Tuolumne0, ClueBot NG, Betzaar, Oldsharp, Widr, DanielKlein24, Bor4kip, RMcPhillip, Meurondb, BG19bot, WinampLlama, Cp3149, Andytwigg, David.moreno72, JoshuSasori, Jimw338, YFdyh-bot, Dexbot, Seanhalle, Lsmll, Enock4seth, Tentinator, TheWises- tOfFools, DavidLeighEllis, M Murphy1993, JaconaFrere, Skr15081997, Audreymeows, Utsavullas33, Nbro, IvayloS, CAPTAIN RAJU, Grecinto, SundeepBhuvan, GreenC bot and Anonymous: 383 • B+ tree Source: https://en.wikipedia.org/wiki/B%2B_tree?oldid=730535667 Contributors: Bryan Derksen, Cherezov, Tim Starling, Pnm, Eurleif, CesarB, Marc omorain, Josh Cherry, Vikreykja, Lupo, Dmn, Giftlite, Inkling, WorldsApart, Neilc, Lightst, Arrow~enwiki, White- Dragon, Two Bananas, Scrool, Leibniz, Zenohockey, Nyenyec, Cmdrjameson, TheProject, Obradovic Goran, Happyvalley, Mdd, Arthena, Yamla, TZOTZIOY, Stevestrange, Knutties, Oleg Alexandrov, RHaworth, LrdChaos, LOL, Decrease789, GregorB, PhilippWeissenbacher, Ash211, Penumbra2000, Gurch, Degeberg, Intgr, Fresheneesz, Chobot, Bornhj, Encyclops, Bovineone, Capi, Luc4~enwiki, Mikeblas, Foeckler, Snarius, Cedar101, LeonardoRob0t, Jbalint, Jsnx, Arny, DomQ, Mhss, Hongooi, Rrburke, Cybercobra, Itmozart, Nat2, Leksey, Tlesher, Julthep, Cychoi, UncleDouggie, Yellowstone6, Ahy1, Unixguy, CmdrObot, Leujohn, Jwang01, Ubuntu2, I do not exist, Nu- world, Ste4k, JAnDbot, Txomin, CommonsDelinker, Garciada5658, Afaviram, Mfedyk, Priyank bolia, Mqchen, Mrcowden, VolkovBot, OliviaGuest, Mdmkolbe, Muro de Aguas, Singaldhruv, Highlandsun, SheffieldSteel, MRLacey, S.Örvarr.S, SieBot, Tresiden, YonaBot, Yungoe, Amarvashishth, Mogentianae, Imachuchu, ClueBot, Kl4m, Boing! said Zebedee, Tuxthepenguin933, SchreiberBike, Max613, Raatikka, Addbot, TutterMouse, Thunderpenguin, Favonian, AgadaUrbanit, Matěj Grabovský, Bluebusy, Twimoki, Luckas-bot, Matthew D Dillon, Yobot, ColinTempler, Vevek, AnomieBOT, Materialscientist, LilHelpa, Nishantjr, Nqzero, Ajarov, Mydimle, Pinethicket, Ed- die595, Reaper Eternal, MikeDierken, Holy-foek, Kastauyra, Igor Yalovecky, Gf uip, EmausBot, Immunize, Wout.mertens, Tommy2010, K6ka, James.walmsley, Bad Romance, Fabrictramp(public), QEDK, Ysoroka, Grundprinzip, ClueBot NG, Vedantkumar, MaximalIdeal, Anchor89, Giovanni Kock Bonetti, BG19bot, Lowercase Sigma, Chmarkine, BattyBot, NorthernSilencer, Cyberbot II, Michaelcomella, AshishMbm2012, Perkinsb1024, EvergreenFir, Alexjlockwood, Graham477, Andylamp, Cowprophet, Ngkaho1234, Shubh-i sparkx, Vic- tor.scherbakov and Anonymous: 248 • Trie Source: https://en.wikipedia.org/wiki/Trie?oldid=742852749 Contributors: Bryan Derksen, Taral, Bignose, Edward, Chris-martin, Rl, Dcoetzee, Dysprosia, Doradus, Fredrik, Altenmann, Mattflaschen, Tea2min, Matt Gies, Giftlite, Dbenbenn, DavidCary, Sepreece, Wolf- keeper, Pgan002, Gdr, LiDaobing, Danny Rathjens, Teacup, Watcher, Andreas Kaufmann, Kate, Antaeus Feldspar, BACbKA, JustinWick, Kwamikagami, Diomidis Spinellis, EmilJ, Shoujun, Giraffedata, BlueNovember, Hugowolf, CyberSkull, Diego Moya, Loreto~enwiki, Still- notelf, Velella, Blahedo, Runtime, Tr00st, Gmaxwell, Simetrical, MattGiuca, Gerbrant, Graham87, BD2412, Qwertyus, Rjwilmsi, Drpaule, Sperxios, Hairy Dude, Me and, Pi Delport, Dantheox, Gaius Cornelius, Nad, Mikeblas, Danielx, TMott, SmackBot, Slamb, Honza Záruba, InverseHypercube, Karl Stroetmann, Jim baker, BiT, Ennorehling, Eug, Chris the speller, Neurodivergent, MalafayaBot, Drewnoakes, Otus, Malbrain, Kaimiddleton, Cybercobra, Leaflord, ThePianoGuy, Musashiaharon, Denshade, Edlee, Johnny Zoo, MichaelPloujnikov, Cyde- bot, Electrum, Farzaneh, Bsdaemon, Deborahjay, Headbomb, Widefox, Maged918, KMeyer, Nosbig, Deflective, Raanoo, Ned14, David Eppstein, FuzziusMaximus, Micahcowan, Francis Tyers, Pavel Fusu, 97198, Dankogai, Funandtrvl, Bse3, Kyle the bot, Nissenbenyitskhak, Jmacglashan, C0dergirl, Sergio01, Ham Pastrami, Enrique.benimeli, Svick, AlanUS, Jludwig, VanishedUser sdu9aya9fs787sads, Anup- chowdary, Para15000, Niceguyedc, Pombredanne, JeffDonner, Estirabot, Mindstalk, Stephengmatthews, Johnuniq, Dscholte, XLinkBot, -Legobot, Luckas-bot, Yobot, Nashn, AnomieBOT, Amrita ,ماني ,Dsimic, Deineka, Addbot, Cowgod14, MrOllie, Yaframa, OlEnglish syaPutra, Royote, Citation bot, Ivan Kuckir, Coding.mike, GrouchoBot, Modiashutosh, RibotBOT, Shadowjams, Pauldinhqd, FrescoBot, Mostafa.vafi, X7q, Jonasbn, Citation bot 1, Chenopodiaceous, Base698, GeypycGn, Miracle Pen, Pmdusso, Diannaa, Cutelyaware, Will- Ness, RjwilmsiBot, EmausBot, DanielWaterworth, Dcirovic, Bleakgadfly, Midas02, HolyCookie, Let4time, ClueBot NG, Jbragadeesh, ,Helpful Pixie Bot, Sangdol, Sboosali, Dvanatta, Dexbot, Junkyardsparkle, Jochen Burghardt ,17אנונימי ,Adityasinghhhhhh, Atthaphong Kirpo, Vsethuooo, RealFoxX, Averruncus, AntonDevil, Painted Fox, Ramiyam, *thing goes, Bwegs14, Iokevins, Angelababy00, GreenC bot and Anonymous: 176 • Radix tree Source: https://en.wikipedia.org/wiki/Radix_tree?oldid=731465974 Contributors: Cwitty, Edward, CesarB, Dcoetzee, AaronSw, Javidjamae, Gwalla, Bhyde, Andreas Kaufmann, Qutezuce, Bender235, Brim, Guy Harris, Noosphere, Daira Hopwood, Gre- gorB, Qwertyus, Yurik, Adoniscik, Hairy Dude, Me and, Pi Delport, Dogcow, Gulliveig, Modify, C.Fred, DBeyer, Optikos, Srchvrs, Malbrain, Frap, Cybercobra, MegaHasher, Khazar, Nausher, Makyen, Babbling.Brook, Dicklyon, Pjrm, DavidDecotigny, Ahy1, Cyde- bot, Mortehu, Coffee2theorems, Tedickey, Rocchini, Phishman3579, Jy00912345, SparsityProblem, Burkeaj, Cobi, VolkovBot, Jamelan, Abatishchev, Para15000, Sameemir, Arkanosis, Safek, XLinkBot, Hetori, Rgruian, Dsimic, Addbot, Ollydbg, Lightbot, Npgall, Drachmae, Citation bot, Shy Cohen, Pauldinhqd, FrescoBot, Citation bot 1, SpmRmvBot, ICEAGE, MastiBot, Hesamwls, Puffin, TYelliot, Helpful Pixie Bot, ChrisGualtieri, Saffles, Awnedion, Lugia2453, Crow, Simonfakir, David.sippet, Sebivor, Warrenjharper and Anonymous: 78 • Suffix tree Source: https://en.wikipedia.org/wiki/Suffix_tree?oldid=732954566 Contributors: AxelBoldt, Michael Hardy, Delirium, Alfio, Charles Matthews, Dcoetzee, Jogloran, Phil Boswell, Sho Uemura, Giftlite, P0nc, Sundar, Two Bananas, Andreas Kaufmann, Squash, Kbh3rd, Bcat, Shoujun, Christian Kreibich, R. S. Shaw, Jemfinch, Blahma, Mechonbarsa, RJFJR, Wsloand, Oleg Alexandrov, Ruud Koot, Dionyziz, Rjwilmsi, JMCorey, Ffaarr, Bgwhite, Vecter, Ru.spider, TheMandarin, Nils Grimsmo, Lt-wiki-bot, TheTaxman, Cedar101, DmitriyV, Heavyrain2408, SmackBot, C.Fred, TripleF, Cybercobra, Ninjagecko, ThePianoGuy, Nux, Beetstra, MTSbot~enwiki, Reques- tion, MaxEnt, Cydebot, Jleunissen, Thijs!bot, Jhclark, Headbomb, Leafman, MER-C, Sarahj2107, CobaltBlue, Johnbibby, A3nm, David Eppstein, Bbi5291, Andre.holzner, Glrx, Doranchak, Dhruvbird, Jamelan, NVar, Xevior, ClueBot, Garyzx, Para15000, Safek, Xodarap00, Stephengmatthews, XLinkBot, Addbot, Deselaers, DOI bot, RomanPszonka, Chamal N, Nealjc, Yobot, Npgall, Kilom691, Senvey, Cita- 8.2. IMAGES 247

tion bot, Eumolpo, Xqbot, Gilo1969, Sky Attacker, X7q, Citation bot 1, SpmRmvBot, Skyerise, Illya.havsiyevych, RedBot, Xutaodeng, Luismsgomes, Mavam, RjwilmsiBot, 12hugo34, Grondilu, Ronni1987, EdoBot, ClueBot NG, Kasirbot, T.seppelt, Andrew Helwer, Jochen Burghardt, Cos2, Farach, Anurag.x.singh and Anonymous: 81 • Suffix array Source: https://en.wikipedia.org/wiki/Suffix_array?oldid=720145448 Contributors: Edward, Mjordan, BenRG, Kiwibird, Tea2min, Giftlite, Mboverload, Viksit, Beland, Karol Langner, Andreas Kaufmann, MeltBanana, Malcolm rowe, Arnabdotorg, Nroets, RJFJR, Ruud Koot, Qwertyus, Gaius Cornelius, Nils Grimsmo, Bkil, SmackBot, TripleF, Malbrain, Chris83, Thijs!bot, Headbomb, JoaquinFerrero, Wolfgang-gerlach~enwiki, Joe Wiki, David Eppstein, Glrx, Cobi, Singleheart, Jwarhol, Garyzx, Alexbot, XLinkBot, Addbot, EchoBlaze94, Matěj Grabovský, Yobot, AnomieBOT, Olivier Lartillot, FrescoBot, Libor Vilímek, Chuancong, Gailcarmichael, Saketkc, ZéroBot, Dennis714, Solomon7968, T.seppelt, Andrew Helwer, ChrisGualtieri, JingguoYao, SteenthIWbot, StephanErb, Gauvain, Cos2, Hvaara, Anurag.x.singh, Denidi and Anonymous: 51 • Suffix automaton Source: https://en.wikipedia.org/wiki/Suffix_automaton?oldid=706186025 Contributors: Qwertyus, David Eppstein, CorenSearchBot, Tr00rle and Dexbot • Van Emde Boas tree Source: https://en.wikipedia.org/wiki/Van_Emde_Boas_tree?oldid=731551093 Contributors: B4hand, Michael Hardy, Kragen, Charles Matthews, Dcoetzee, Doradus, Phil Boswell, Dbenbenn, Bender235, BACbKA, Nickj, Qwertyus, Rjwilmsi, Jeff02, Quuxplusone, Fresheneesz, Argav, Pi Delport, Cedar101, Gulliveig, SmackBot, Cybercobra, A5b, David Cooke, Neelix, Cydebot, Cyhawk, Snoopy67, David Eppstein, Panarchy, Brvman, Dangercrow, Svick, Adrianwn, Kaba3, Addbot, Lightbot, Luckas-bot, Yobot, Fx4m, Man- garah, Brutaldeluxe, Patmorin, Gailcarmichael, EmausBot, ClueBot NG, Jackrae, ElhamKhodaee, MatthewIreland, Dexbot, Theemathas, RandomSort, Peter238, Cewbot, Knife-in-the-drawer, Puma314 and Anonymous: 24 • Fusion tree Source: https://en.wikipedia.org/wiki/Fusion_tree?oldid=736555935 Contributors: Edemaine, CesarB, Charles Matthews, Dcoetzee, ZeroOne, Oleg Alexandrov, SmackBot, Cybercobra, Cydebot, Alaibot, Nick Number, David Eppstein, Lamro, Czcollier, Dec- oratrix, Gmharhar, Vladfi, Comp.arch and Anonymous: 7

8.2 Images

• File:8bit-dynamiclist_(reversed).gif Source: https://upload.wikimedia.org/wikipedia/commons/c/cc/8bit-dynamiclist_%28reversed% 29.gif License: CC-BY-SA-3.0 Contributors: This file was derived from: 8bit-dynamiclist.gif Original artist: Seahen, User:Rezonansowy • File:AVL-tree-wBalance_K.svg Source: https://upload.wikimedia.org/wikipedia/commons/a/ad/AVL-tree-wBalance_K.svg License: CC BY-SA 4.0 Contributors: This vector image was created with Inkscape. Original artist: Nomen4Omen • File:AVLtreef.svg Source: https://upload.wikimedia.org/wikipedia/commons/0/06/AVLtreef.svg License: Public domain Contributors: Own work Original artist: User:Mikm • File:Ambox_important.svg Source: https://upload.wikimedia.org/wikipedia/commons/b/b4/Ambox_important.svg License: Public do- main Contributors: Own work, based off of Image:Ambox scales.svg Original artist: Dsmurat (talk · contribs) • File:AmortizedPush.png Source: https://upload.wikimedia.org/wikipedia/commons/e/e5/AmortizedPush.png License: CC BY-SA 4.0 Contributors: Own work Original artist: ScottDNelson • File:An_example_of_how_to_find_a_string_in_a_Patricia_trie.png Source: https://upload.wikimedia.org/wikipedia/commons/6/ 63/An_example_of_how_to_find_a_string_in_a_Patricia_trie.png License: CC BY-SA 3.0 Contributors: Microsoft Visio Original artist: Saffles • File:Array_of_array_storage.svg Source: https://upload.wikimedia.org/wikipedia/commons/0/01/Array_of_array_storage.svg License: Public domain Contributors: No machine-readable source provided. Own work assumed (based on copyright claims). Original artist: No machine-readable author provided. Dcoetzee assumed (based on copyright claims). • File:AttenuatedBloomFilter2.png Source: https://upload.wikimedia.org/wikipedia/commons/d/d8/AttenuatedBloomFilter2.png Li- cense: CC BY-SA 4.0 Contributors: Own work Original artist: Satokoala • File:B-tree.svg Source: https://upload.wikimedia.org/wikipedia/commons/6/65/B-tree.svg License: CC BY-SA 3.0 Contributors: Own work based on [1]. Original artist: CyHawk • File:B_tree_insertion_example.png Source: https://upload.wikimedia.org/wikipedia/commons/3/33/B_tree_insertion_example.png Li- cense: Public domain Contributors: I drew it :) Original artist: User:Maxtremus • File:BinaryTreeRotations.svg Source: https://upload.wikimedia.org/wikipedia/commons/4/43/BinaryTreeRotations.svg License: CC BY-SA 3.0 Contributors: Own work Original artist: Josell7 • File:Binary_Heap_with_Array_Implementation.JPG Source: https://upload.wikimedia.org/wikipedia/commons/c/c4/Binary_Heap_ with_Array_Implementation.JPG License: CC0 Contributors: I(Chris857 (talk)) created this work entirely by myself. Original artist: Chris857 (talk) • File:Binary_search_into_array.png Source: https://upload.wikimedia.org/wikipedia/commons/f/f7/Binary_search_into_array.png Li- cense: Public domain Contributors: Template:LoStrangolatore Original artist: Tushe2000 • File:Binary_search_tree.svg Source: https://upload.wikimedia.org/wikipedia/commons/d/da/Binary_search_tree.svg License: Public domain Contributors: No machine-readable source provided. Own work assumed (based on copyright claims). Original artist: No machine- readable author provided. Dcoetzee assumed (based on copyright claims). • File:Binary_search_tree_delete.svg Source: https://upload.wikimedia.org/wikipedia/commons/4/46/Binary_search_tree_delete.svg Li- cense: Public domain Contributors: No machine-readable source provided. Own work assumed (based on copyright claims). Original artist: No machine-readable author provided. Dcoetzee assumed (based on copyright claims). • File:Binary_tree_in_array.svg Source: https://upload.wikimedia.org/wikipedia/commons/8/86/Binary_tree_in_array.svg License: Pub- lic domain Contributors: No machine-readable source provided. Own work assumed (based on copyright claims). Original artist: No machine-readable author provided. Dcoetzee assumed (based on copyright claims). 248 CHAPTER 8. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES

• File:Binomial-heap-13.svg Source: https://upload.wikimedia.org/wikipedia/commons/6/61/Binomial-heap-13.svg License: CC-BY- SA-3.0 Contributors: de:Bild:Binomial-heap-13.png by de:Benutzer:Koethnig Original artist: User:D0ktorz • File:Binomial_Trees.svg Source: https://upload.wikimedia.org/wikipedia/commons/c/cf/Binomial_Trees.svg License: CC-BY-SA-3.0 Contributors: No machine-readable source provided. Own work assumed (based on copyright claims). Original artist: No machine-readable author provided. Lemontea~commonswiki assumed (based on copyright claims). • File:Binomial_heap_merge1.svg Source: https://upload.wikimedia.org/wikipedia/commons/9/9f/Binomial_heap_merge1.svg License: CC-BY-SA-3.0 Contributors: Own work Original artist: Lemontea • File:Binomial_heap_merge2.svg Source: https://upload.wikimedia.org/wikipedia/commons/e/e8/Binomial_heap_merge2.svg License: CC-BY-SA-3.0 Contributors: Own work Original artist: Lemontea • File:BloomFilterDisk.png Source: https://upload.wikimedia.org/wikipedia/commons/6/61/BloomFilterDisk.png License: CC BY-SA 4.0 Contributors: https://people.cs.umass.edu/~{}ramesh/Site/PUBLICATIONS.html Original artist: Ramesh K. Sitaraman • File:Bloom_filter.svg Source: https://upload.wikimedia.org/wikipedia/commons/a/ac/Bloom_filter.svg License: Public domain Contrib- utors: self-made, originally for a talk at WADS 2007 Original artist: David Eppstein • File:Bloom_filter_fp_probability.svg Source: https://upload.wikimedia.org/wikipedia/commons/e/ef/Bloom_filter_fp_probability.svg License: CC BY 3.0 Contributors: Own work Original artist: Jerz4835 • File:Bloom_filter_speed.svg Source: https://upload.wikimedia.org/wikipedia/commons/c/c4/Bloom_filter_speed.svg License: Public domain Contributors: Transferred from en.wikipedia to Commons by RMcPhillip using CommonsHelper. Original artist: Alexmadon at English Wikipedia • File:Bplustree.png Source: https://upload.wikimedia.org/wikipedia/commons/3/37/Bplustree.png License: CC BY 3.0 Contributors: Own work Original artist: Grundprinzip • File:Bstreesearchexample.jpg Source: https://upload.wikimedia.org/wikipedia/commons/f/fa/Bstreesearchexample.jpg License: Public domain Contributors: ? Original artist: ? • File:CPT-LinkedLists-addingnode.svg Source: https://upload.wikimedia.org/wikipedia/commons/4/4b/ CPT-LinkedLists-addingnode.svg License: Public domain Contributors: • Singly_linked_list_insert_after.png Original artist: Singly_linked_list_insert_after.png: Derrick Coetzee • File:CPT-LinkedLists-deletingnode.svg Source: https://upload.wikimedia.org/wikipedia/commons/d/d4/ CPT-LinkedLists-deletingnode.svg License: Public domain Contributors: • Singly_linked_list_delete_after.png Original artist: Singly_linked_list_delete_after.png: Derrick Coetzee • File:Circular_Buffer_Animation.gif Source: https://upload.wikimedia.org/wikipedia/commons/f/fd/Circular_Buffer_Animation.gif License: CC BY-SA 4.0 Contributors: Own work Original artist: MuhannadAjjan • File:Circular_buffer.svg Source: https://upload.wikimedia.org/wikipedia/commons/b/b7/Circular_buffer.svg License: CC-BY-SA-3.0 Contributors: This vector image was created with Inkscape. Original artist: en:User:Cburnett • File:Circular_buffer_-_6789345.svg Source: https://upload.wikimedia.org/wikipedia/commons/6/67/Circular_buffer_-_6789345.svg License: CC-BY-SA-3.0 Contributors: This vector image was created with Inkscape. Original artist: en:User:Cburnett • File:Circular_buffer_-_6789AB5.svg Source: https://upload.wikimedia.org/wikipedia/commons/b/ba/Circular_buffer_-_6789AB5. svg License: CC-BY-SA-3.0 Contributors: This vector image was created with Inkscape. Original artist: en:User:Cburnett • File:Circular_buffer_-_6789AB5_with_pointers.svg Source: https://upload.wikimedia.org/wikipedia/commons/0/05/Circular_ buffer_-_6789AB5_with_pointers.svg License: CC-BY-SA-3.0 Contributors: This vector image was created with Inkscape. Original artist: en:User:Cburnett • File:Circular_buffer_-_X789ABX.svg Source: https://upload.wikimedia.org/wikipedia/commons/4/43/Circular_buffer_-_X789ABX. svg License: CC-BY-SA-3.0 Contributors: This vector image was created with Inkscape. Original artist: en:User:Cburnett • File:Circular_buffer_-_XX123XX.svg Source: https://upload.wikimedia.org/wikipedia/commons/d/d7/Circular_buffer_-_XX123XX. svg License: CC-BY-SA-3.0 Contributors: This vector image was created with Inkscape. Original artist: en:User:Cburnett • File:Circular_buffer_-_XX123XX_with_pointers.svg Source: https://upload.wikimedia.org/wikipedia/commons/0/02/Circular_ buffer_-_XX123XX_with_pointers.svg License: CC-BY-SA-3.0 Contributors: This vector image was created with Inkscape. Original artist: en:User:Cburnett • File:Circular_buffer_-_XX1XXXX.svg Source: https://upload.wikimedia.org/wikipedia/commons/8/89/Circular_buffer_-_ XX1XXXX.svg License: CC-BY-SA-3.0 Contributors: This vector image was created with Inkscape. Original artist: en:User:Cburnett • File:Circular_buffer_-_XXXX3XX.svg Source: https://upload.wikimedia.org/wikipedia/commons/1/11/Circular_buffer_-_ XXXX3XX.svg License: CC-BY-SA-3.0 Contributors: This vector image was created with Inkscape. Original artist: en:User:Cburnett • File:Circular_buffer_-_empty.svg Source: https://upload.wikimedia.org/wikipedia/commons/f/f7/Circular_buffer_-_empty.svg Li- cense: CC-BY-SA-3.0 Contributors: This vector image was created with Inkscape. Original artist: en:User:Cburnett • File:Circularly-linked-list.svg Source: https://upload.wikimedia.org/wikipedia/commons/d/df/Circularly-linked-list.svg License: Pub- lic domain Contributors: Own work Original artist: Lasindi • File:Closed_Access_logo_alternative.svg Source: https://upload.wikimedia.org/wikipedia/commons/c/c1/Closed_Access_logo_ alternative.svg License: CC0 Contributors: File:Open_Access_logo_PLoS_white.svg and own modification Original artist: Jakob Voß, influenced by original art designed at PLoS, modified by Wikipedia users Nina and Beao • File:Commons-logo.svg Source: https://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg License: CC-BY-SA-3.0 Contribu- tors: ? Original artist: ? • File:Comparison_computational_complexity.svg Source: https://upload.wikimedia.org/wikipedia/commons/7/7e/Comparison_ computational_complexity.svg License: CC BY-SA 4.0 Contributors: Own work Original artist: Cmglee 8.2. IMAGES 249

• File:Crypto_key.svg Source: https://upload.wikimedia.org/wikipedia/commons/6/65/Crypto_key.svg License: CC-BY-SA-3.0 Contribu- tors: Own work based on image:Key-crypto-sideways.png by MisterMatt originally from English Wikipedia Original artist: MesserWoland • File:Cryptographic_Hash_Function.svg Source: https://upload.wikimedia.org/wikipedia/commons/2/2b/Cryptographic_Hash_ Function.svg License: Public domain Contributors: Original work for Wikipedia Original artist: User:Jorge Stolfi based on Image:Hash_function.svg by Helix84 • File:Cuckoo.svg Source: https://upload.wikimedia.org/wikipedia/commons/d/de/Cuckoo.svg License: CC BY-SA 3.0 Contributors: File: Cuckoo.png Original artist: Rasmus Pagh • File:Data_Queue.svg Source: https://upload.wikimedia.org/wikipedia/commons/5/52/Data_Queue.svg License: CC BY-SA 3.0 Contributors: Own work Original artist: This Image was created by User:Vegpuff.

• File:Doubly-linked-list.svg Source: https://upload.wikimedia.org/wikipedia/commons/5/5e/Doubly-linked-list.svg License: Public do- main Contributors: Own work Original artist: Lasindi • File:Dsu_disjoint_sets_final.svg Source: https://upload.wikimedia.org/wikipedia/commons/a/ac/Dsu_disjoint_sets_final.svg License: CC BY-SA 3.0 Contributors: Own work Original artist: 93willy • File:Dsu_disjoint_sets_init.svg Source: https://upload.wikimedia.org/wikipedia/commons/6/67/Dsu_disjoint_sets_init.svg License: CC BY-SA 3.0 Contributors: Own work Original artist: 93willy • File:Dual_heap.jpg Source: https://upload.wikimedia.org/wikipedia/commons/b/b7/Dual_heap.jpg License: CC BY-SA 3.0 Contribu- tors: Own work Original artist: Pratiklahoti8004 • File:Dynamic_array.svg Source: https://upload.wikimedia.org/wikipedia/commons/3/31/Dynamic_array.svg License: CC0 Contributors: Own work Original artist: Dcoetzee • File:Edit-clear.svg Source: https://upload.wikimedia.org/wikipedia/en/f/f2/Edit-clear.svg License: Public domain Contributors: The Tango! Desktop Project. Original artist: The people from the Tango! project. And according to the meta-data in the file, specifically: “Andreas Nilsson, and Jakub Steiner (although minimally).” • File:Fibonacci_heap-decreasekey.png Source: https://upload.wikimedia.org/wikipedia/commons/0/09/Fibonacci_heap-decreasekey. png License: CC-BY-SA-3.0 Contributors: ? Original artist: ? • File:Fibonacci_heap.png Source: https://upload.wikimedia.org/wikipedia/commons/4/45/Fibonacci_heap.png License: CC-BY-SA-3.0 Contributors: ? Original artist: ? • File:Fibonacci_heap_extractmin1.png Source: https://upload.wikimedia.org/wikipedia/commons/5/56/Fibonacci_heap_extractmin1. png License: CC-BY-SA-3.0 Contributors: ? Original artist: ? • File:Fibonacci_heap_extractmin2.png Source: https://upload.wikimedia.org/wikipedia/commons/9/95/Fibonacci_heap_extractmin2. png License: CC-BY-SA-3.0 Contributors: ? Original artist: ? • File:Fibonacci_search.png Source: https://upload.wikimedia.org/wikipedia/commons/e/e5/Fibonacci_search.png License: CC BY-SA 4.0 Contributors: Own work Original artist: Esquivalience • File:Folder_Hexagonal_Icon.svg Source: https://upload.wikimedia.org/wikipedia/en/4/48/Folder_Hexagonal_Icon.svg License: Cc-by- sa-3.0 Contributors: ? Original artist: ? • File:Free-to-read_lock_75.svg Source: https://upload.wikimedia.org/wikipedia/commons/8/80/Free-to-read_lock_75.svg License: CC0 Contributors: Adapted from 9px|Open_Access_logo_PLoS_white_green.svg Original artist: This version:Trappist_the_monk (talk) (Uploads) • File:FusionTreeSketch.gif Source: https://upload.wikimedia.org/wikipedia/commons/8/8a/FusionTreeSketch.gif License: CC BY-SA 3.0 Contributors: Own work Original artist: Vladfi • File:HASHTB12.svg Source: https://upload.wikimedia.org/wikipedia/commons/9/90/HASHTB12.svg License: Public domain Contribu- tors: ? Original artist: ? • File:Hash_table_3_1_1_0_1_0_0_SP.svg Source: https://upload.wikimedia.org/wikipedia/commons/7/7d/Hash_table_3_1_1_0_1_0_ 0_SP.svg License: CC BY-SA 3.0 Contributors: Own work Original artist: Jorge Stolfi • File:Hash_table_4_1_0_0_0_0_0_LL.svg Source: https://upload.wikimedia.org/wikipedia/commons/2/2e/Hash_table_4_1_0_0_0_0_ 0_LL.svg License: Public domain Contributors: Own work Original artist: Jorge Stolfi • File:Hash_table_4_1_1_0_0_0_0_LL.svg Source: https://upload.wikimedia.org/wikipedia/commons/7/71/Hash_table_4_1_1_0_0_0_ 0_LL.svg License: Public domain Contributors: Own work Original artist: Jorge Stolfi • File:Hash_table_4_1_1_0_0_1_0_LL.svg Source: https://upload.wikimedia.org/wikipedia/commons/5/58/Hash_table_4_1_1_0_0_1_ 0_LL.svg License: Public domain Contributors: Own work Original artist: Jorge Stolfi • File:Hash_table_5_0_1_1_1_1_0_LL.svg Source: https://upload.wikimedia.org/wikipedia/commons/5/5a/Hash_table_5_0_1_1_1_1_ 0_LL.svg License: CC BY-SA 3.0 Contributors: Own work Original artist: Jorge Stolfi • File:Hash_table_5_0_1_1_1_1_0_SP.svg Source: https://upload.wikimedia.org/wikipedia/commons/b/bf/Hash_table_5_0_1_1_1_1_ 0_SP.svg License: CC BY-SA 3.0 Contributors: Own work Original artist: Jorge Stolfi • File:Hash_table_5_0_1_1_1_1_1_LL.svg Source: https://upload.wikimedia.org/wikipedia/commons/d/d0/Hash_table_5_0_1_1_1_1_ 1_LL.svg License: CC BY-SA 3.0 Contributors: Own work Original artist: Jorge Stolfi • File:Hash_table_average_insertion_time.png Source: https://upload.wikimedia.org/wikipedia/commons/1/1c/Hash_table_average_ insertion_time.png License: Public domain Contributors: Author’s Own Work. Original artist: Derrick Coetzee (User:Dcoetzee) • File:Heap_add_step1.svg Source: https://upload.wikimedia.org/wikipedia/commons/a/ac/Heap_add_step1.svg License: Public domain Contributors: Drawn in Inkscape by Ilmari Karonen. Original artist: Ilmari Karonen 250 CHAPTER 8. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES

• File:Heap_add_step2.svg Source: https://upload.wikimedia.org/wikipedia/commons/1/16/Heap_add_step2.svg License: Public domain Contributors: Drawn in Inkscape by Ilmari Karonen. Original artist: Ilmari Karonen • File:Heap_add_step3.svg Source: https://upload.wikimedia.org/wikipedia/commons/5/51/Heap_add_step3.svg License: Public domain Contributors: Drawn in Inkscape by Ilmari Karonen. Original artist: Ilmari Karonen • File:Heap_delete_step0.svg Source: https://upload.wikimedia.org/wikipedia/commons/1/1c/Heap_delete_step0.svg License: Public do- main Contributors: http://en.wikipedia.org/wiki/File:Heap_add_step1.svg Original artist: Ilmari Karonen • File:Heap_remove_step1.svg Source: https://upload.wikimedia.org/wikipedia/commons/e/ee/Heap_remove_step1.svg License: Public domain Contributors: Drawn in Inkscape by Ilmari Karonen. Original artist: Ilmari Karonen • File:Heap_remove_step2.svg Source: https://upload.wikimedia.org/wikipedia/commons/2/22/Heap_remove_step2.svg License: Public domain Contributors: Drawn in Inkscape by Ilmari Karonen. Original artist: Ilmari Karonen • File:Hopscotch-wiki-example.gif Source: https://upload.wikimedia.org/wikipedia/en/f/fa/Hopscotch-wiki-example.gif License: CC- BY-3.0 Contributors: ? Original artist: ? • File:Internet_map_1024.jpg Source: https://upload.wikimedia.org/wikipedia/commons/d/d2/Internet_map_1024.jpg License: CC BY 2.5 Contributors: Originally from the English Wikipedia; description page is/was here. Original artist: The Opte Project • File:Interval_heap_depq.jpg Source: https://upload.wikimedia.org/wikipedia/commons/e/ec/Interval_heap_depq.jpg License: CC BY- SA 3.0 Contributors: Own work Original artist: Pratiklahoti8004 • File:LampFlowchart.svg Source: https://upload.wikimedia.org/wikipedia/commons/9/91/LampFlowchart.svg License: CC-BY-SA-3.0 Contributors: vector version of Image:LampFlowchart.png Original artist: svg by Booyabazooka

• File:Leaf_correspondence.jpg Source: https://upload.wikimedia.org/wikipedia/commons/a/a7/Leaf_correspondence.jpg License: CC BY-SA 3.0 Contributors: Own work Original artist: Pratiklahoti8004 • File:Lifo_stack.png Source: https://upload.wikimedia.org/wikipedia/commons/b/b4/Lifo_stack.png License: CC0 Contributors: Own work Original artist: Maxtremus • File:Linear_Probing_Deletion.png Source: https://upload.wikimedia.org/wikipedia/commons/3/38/Linear_Probing_Deletion.png Li- cense: CC BY-SA 4.0 Contributors: Own work Original artist: Cryptic C62 • File:Max-Heap.svg Source: https://upload.wikimedia.org/wikipedia/commons/3/38/Max-Heap.svg License: CC BY-SA 3.0 Contributors: Own work Original artist: Ermishin • File:Merge-arrows.svg Source: https://upload.wikimedia.org/wikipedia/commons/5/52/Merge-arrows.svg License: Public domain Con- tributors: ? Original artist: ? • File:Merkle-Damgard_hash_big.svg Source: https://upload.wikimedia.org/wikipedia/commons/e/ed/Merkle-Damgard_hash_big.svg License: Public domain Contributors: No machine-readable source provided. Own work assumed (based on copyright claims). Original artist: No machine-readable author provided. Davidgothberg assumed (based on copyright claims). • File:Min-heap.png Source: https://upload.wikimedia.org/wikipedia/commons/6/69/Min-heap.png License: Public domain Contributors: Transferred from en.wikipedia to Commons by LeaW. Original artist: Vikingstad at English Wikipedia • File:Nuvola_kdict_glass.svg Source: https://upload.wikimedia.org/wikipedia/commons/1/18/Nuvola_kdict_glass.svg License: LGPL Contributors: • Nuvola_apps_kdict.svg Original artist: Nuvola_apps_kdict.svg:*Nuvola_apps_kdict.png: user:David_Vignoni • File:Office-book.svg Source: https://upload.wikimedia.org/wikipedia/commons/a/a8/Office-book.svg License: Public domain Contribu- tors: This and myself. Original artist: Chris Down/Tango project • File:Open_Access_logo_PLoS_transparent.svg Source: https://upload.wikimedia.org/wikipedia/commons/7/77/Open_Access_logo_ PLoS_transparent.svg License: CC0 Contributors: http://www.plos.org/ Original artist: art designer at PLoS, modified by Wikipedia users Nina, Beao, and JakobVoss • File:Patricia_trie.svg Source: https://upload.wikimedia.org/wikipedia/commons/a/ae/Patricia_trie.svg License: CC BY 2.5 Contributors: Own work Original artist: Claudio Rocchini • File:People_icon.svg Source: https://upload.wikimedia.org/wikipedia/commons/3/37/People_icon.svg License: CC0 Contributors: Open- Clipart Original artist: OpenClipart • File:Pointer_implementation_of_a_trie.svg Source: https://upload.wikimedia.org/wikipedia/commons/5/5d/Pointer_implementation_ of_a_trie.svg License: CC BY-SA 4.0 Contributors: Own work Original artist: Qwertyus • File:Portal-puzzle.svg Source: https://upload.wikimedia.org/wikipedia/en/f/fd/Portal-puzzle.svg License: Public domain Contributors: ? Original artist: ? • File:ProgramCallStack2_en.png Source: https://upload.wikimedia.org/wikipedia/commons/8/8a/ProgramCallStack2_en.png License: Public domain Contributors: Transferred from en.wikipedia to Commons. Original artist: Agateller at English Wikipedia • File:Question_book-new.svg Source: https://upload.wikimedia.org/wikipedia/en/9/99/Question_book-new.svg License: Cc-by-sa-3.0 Contributors: Created from scratch in Adobe Illustrator. Based on Image:Question book.png created by User:Equazcion Original artist: Tkgd2007 • File:Question_dropshade.png Source: https://upload.wikimedia.org/wikipedia/commons/d/dd/Question_dropshade.png License: Public domain Contributors: Image created by JRM Original artist: JRM • File:Red-black_tree_delete_case_2_as_svg.svg Source: https://upload.wikimedia.org/wikipedia/commons/5/5c/Red-black_tree_ delete_case_2_as_svg.svg License: CC BY-SA 3.0 Contributors: Own work Original artist: Abloomfi • File:Red-black_tree_delete_case_3_as_svg.svg Source: https://upload.wikimedia.org/wikipedia/commons/a/a0/Red-black_tree_ delete_case_3_as_svg.svg License: CC BY-SA 3.0 Contributors: Own work Original artist: Abloomfi 8.2. IMAGES 251

• File:Red-black_tree_delete_case_4_as_svg.svg Source: https://upload.wikimedia.org/wikipedia/commons/3/3d/Red-black_tree_ delete_case_4_as_svg.svg License: CC BY-SA 3.0 Contributors: Own work Original artist: Abloomfi • File:Red-black_tree_delete_case_5_as_svg.svg Source: https://upload.wikimedia.org/wikipedia/commons/3/36/Red-black_tree_ delete_case_5_as_svg.svg License: CC BY-SA 3.0 Contributors: Own work Original artist: Abloomfi • File:Red-black_tree_delete_case_6_as_svg.svg Source: https://upload.wikimedia.org/wikipedia/commons/9/99/Red-black_tree_ delete_case_6_as_svg.svg License: CC BY-SA 3.0 Contributors: Own work Original artist: Abloomfi • File:Red-black_tree_example.svg Source: https://upload.wikimedia.org/wikipedia/commons/6/66/Red-black_tree_example.svg Li- cense: CC-BY-SA-3.0 Contributors: Own work Original artist: Cburnett • File:Red-black_tree_example_(B-tree_analogy).svg Source: https://upload.wikimedia.org/wikipedia/commons/7/72/Red-black_tree_ example_%28B-tree_analogy%29.svg License: CC-BY-SA-3.0 Contributors: This vector image was created with Inkscape. Original artist: fr:Utilisateur:Verdy_p • File:Red-black_tree_insert_case_3.svg Source: https://upload.wikimedia.org/wikipedia/commons/d/d6/Red-black_tree_insert_case_ 3.svg License: CC BY-SA 3.0 Contributors: Own work Original artist: Abloomfi • File:Red-black_tree_insert_case_4.svg Source: https://upload.wikimedia.org/wikipedia/commons/8/89/Red-black_tree_insert_case_ 4.svg License: CC BY-SA 3.0 Contributors: Own work Original artist: Abloomfi • File:Red-black_tree_insert_case_5.svg Source: https://upload.wikimedia.org/wikipedia/commons/d/dc/Red-black_tree_insert_case_ 5.svg License: CC BY-SA 3.0 Contributors: Own work Original artist: Abloomfi • File:Singly-linked-list.svg Source: https://upload.wikimedia.org/wikipedia/commons/6/6d/Singly-linked-list.svg License: Public do- main Contributors: Own work Original artist: Lasindi • File:Skip_list.svg Source: https://upload.wikimedia.org/wikipedia/commons/8/86/Skip_list.svg License: Public domain Contributors: Own work Original artist: Wojciech Muła • File:Skip_list_add_element-en.gif Source: https://upload.wikimedia.org/wikipedia/commons/2/2c/Skip_list_add_element-en.gif Li- cense: CC BY-SA 3.0 Contributors: Own work Original artist: Artyom Kalinin • File:Splay_tree_zig.svg Source: https://upload.wikimedia.org/wikipedia/commons/2/2c/Splay_tree_zig.svg License: CC-BY-SA-3.0 Contributors: • Zig.gif Original artist: Zig.gif: User:Regnaron • File:Suffix_automaton.svg Source: https://upload.wikimedia.org/wikipedia/commons/f/f4/Suffix_automaton.svg License: CC BY-SA 4.0 Contributors: Own work Original artist: Qwertyus • File:Suffix_tree_BANANA.svg Source: https://upload.wikimedia.org/wikipedia/commons/d/d2/Suffix_tree_BANANA.svg License: Public domain Contributors: own work (largely based on PNG version by Nils Grimsmo) Original artist: Maciej Jaros (commons: Nux, wiki-pl: Nux) (PNG version by Nils Grimsmo) • File:Text_document_with_red_question_mark.svg Source: https://upload.wikimedia.org/wikipedia/commons/a/a4/Text_document_ with_red_question_mark.svg License: Public domain Contributors: Created by bdesham with Inkscape; based upon Text-x-generic.svg from the Tango project. Original artist: Benjamin D. Esham (bdesham) • File:Total_correspondence_heap.jpg Source: https://upload.wikimedia.org/wikipedia/commons/2/2e/Total_correspondence_heap.jpg License: CC BY-SA 3.0 Contributors: Own work Original artist: Pratiklahoti8004 • File:TreapAlphaKey.svg Source: https://upload.wikimedia.org/wikipedia/commons/4/4b/TreapAlphaKey.svg License: CC0 Contribu- tors: Own work, with labels to match bitmap version Original artist: Qef • File:Tree_Rebalancing.gif Source: https://upload.wikimedia.org/wikipedia/commons/c/c4/Tree_Rebalancing.gif License: CC-BY-SA- 3.0 Contributors: Transferred from en.wikipedia to Commons by Common Good using CommonsHelper. Original artist: Mtanti at English Wikipedia • File:Tree_Rotations.gif Source: https://upload.wikimedia.org/wikipedia/commons/1/15/Tree_Rotations.gif License: CC-BY-SA-3.0 Contributors: Transferred from en.wikipedia to Commons. Original artist: Mtanti at English Wikipedia • File:Tree_rotation.png Source: https://upload.wikimedia.org/wikipedia/commons/2/23/Tree_rotation.png License: CC-BY-SA-3.0 Contributors: EN-Wikipedia Original artist: User:Ramasamy • File:Tree_rotation_animation_250x250.gif Source: https://upload.wikimedia.org/wikipedia/commons/3/31/Tree_rotation_animation_ 250x250.gif License: CC BY-SA 4.0 Contributors: Own work Original artist: Tar-Elessar • File:Trie_example.svg Source: https://upload.wikimedia.org/wikipedia/commons/b/be/Trie_example.svg License: Public domain Con- tributors: own work (based on PNG image by Deco) Original artist: Booyabazooka (based on PNG image by Deco). Modifications by Superm401. • File:Unbalanced_binary_tree.svg Source: https://upload.wikimedia.org/wikipedia/commons/a/a9/Unbalanced_binary_tree.svg License: Public domain Contributors: Own work Original artist: Me (Intgr) • File:VebDiagram.svg Source: https://upload.wikimedia.org/wikipedia/commons/6/6b/VebDiagram.svg License: Public domain Contrib- utors: Own work Original artist: Gailcarmichael • File:Wiki_letter_w_cropped.svg Source: https://upload.wikimedia.org/wikipedia/commons/1/1c/Wiki_letter_w_cropped.svg License: CC-BY-SA-3.0 Contributors: This file was derived from Wiki letter w.svg: Original artist: Derivative work by Thumperward • File:Wikibooks-logo-en-noslogan.svg Source: https://upload.wikimedia.org/wikipedia/commons/d/df/Wikibooks-logo-en-noslogan. svg License: CC BY-SA 3.0 Contributors: Own work Original artist: User:Bastique, User:Ramac et al. 252 CHAPTER 8. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES

• File:Wikibooks-logo.svg Source: https://upload.wikimedia.org/wikipedia/commons/f/fa/Wikibooks-logo.svg License: CC BY-SA 3.0 Contributors: Own work Original artist: User:Bastique, User:Ramac et al. • File:Wikiquote-logo.svg Source: https://upload.wikimedia.org/wikipedia/commons/f/fa/Wikiquote-logo.svg License: Public domain Contributors: Own work Original artist: Rei-artur • File:Wikisource-logo.svg Source: https://upload.wikimedia.org/wikipedia/commons/4/4c/Wikisource-logo.svg License: CC BY-SA 3.0 Contributors: Rei-artur Original artist: Nicholas Moreau • File:Wikiversity-logo-Snorky.svg Source: https://upload.wikimedia.org/wikipedia/commons/1/1b/Wikiversity-logo-en.svg License: CC BY-SA 3.0 Contributors: Own work Original artist: Snorky • File:Wiktionary-logo-v2.svg Source: https://upload.wikimedia.org/wikipedia/commons/0/06/Wiktionary-logo-v2.svg License: CC BY- SA 4.0 Contributors: Own work Original artist: Dan Polansky based on work currently attributed to Wikimedia Foundation but originally created by Smurrayinchester • File:Zigzag.gif Source: https://upload.wikimedia.org/wikipedia/commons/6/6f/Zigzag.gif License: CC-BY-SA-3.0 Contributors: ? Orig- inal artist: ? • File:Zigzig.gif Source: https://upload.wikimedia.org/wikipedia/commons/f/fd/Zigzig.gif License: CC-BY-SA-3.0 Contributors: ? Origi- nal artist: ?

8.3 Content license

• Creative Commons Attribution-Share Alike 3.0