Design and Analysis of Data Structures
Total Page:16
File Type:pdf, Size:1020Kb
Contents 1 Introduction and Review 1 1.1 Welcome! . .1 1.2 Tick Tock, Tick Tock . .2 1.3 Classes of Computational Complexity . .6 1.4 The Fuss of C++ . 12 1.5 Random Numbers . 30 1.6 Bit-by-Bit . 33 1.7 The Terminal-ator . 38 1.8 Git, the \Undo" Button of Software Development . 43 2 Introductory Data Structures 47 2.1 Array Lists . 47 2.2 Linked Lists . 53 2.3 Skip Lists . 59 2.4 Circular Arrays . 67 2.5 Abstract Data Types . 75 2.6 Deques . 76 2.7 Queues . 79 2.8 Stacks . 81 2.9 And the Iterators Gonna Iterate-ate-ate . 84 3 Tree Structures 91 3.1 Lost in a Forest of Trees . 91 3.2 Heaps . 99 3.3 Binary Search Trees . 105 3.4 BST Average-Case Time Complexity . 112 3.5 Randomized Search Trees . 117 3.6 AVL Trees . 125 3.7 Red-Black Trees . 136 3.8 B-Trees . 145 3.9 B+ Trees . 156 i ii CONTENTS 4 Introduction to Graphs 167 4.1 Introduction to Graphs . 167 4.2 Graph Representations . 173 4.3 Algorithms on Graphs: Breadth First Search . 177 4.4 Algorithms on Graphs: Depth First Search . 181 4.5 Dijkstra's Algorithm . 185 4.6 Minimum Spanning Trees . 190 4.7 Disjoint Sets . 197 5 Hashing 209 5.1 The Unquenched Need for Speed . 209 5.2 Hash Functions . 210 5.3 Introduction to Hash Tables . 215 5.4 Probability of Collisions . 221 5.5 Collision Resolution: Open Addressing . 227 5.6 Collision Resolution: Closed Addressing . 236 5.7 Collision Resolution: Cuckoo Hashing . 241 5.8 Hash Maps . 245 6 Implementing a Lexicon 253 6.1 Creating a Lexicon . 253 6.2 Using Linked Lists . 254 6.3 Using Arrays . 257 6.4 Using Binary Search Trees . 260 6.5 Using Hash Tables and Hash Maps . 262 6.6 Using Multiway Tries . 266 6.7 Using Ternary Search Trees . 274 7 Coding and Compression 285 7.1 Return of the (Coding) Trees . 285 7.2 Entropy and Information Theory . 286 7.3 Honey, I Shrunk the File . 290 7.4 Bitwise I/O . 293 8 Summary 301 8.1 Array List (Dynamic Array) . 301 8.1.1 Unsorted Array List . 301 8.1.2 Sorted Array List . 302 8.2 Linked List . 304 8.3 Skip List . 306 8.4 Heap . 308 8.5 Binary Search Tree . 309 8.5.1 Regular Binary Search Tree . 309 8.5.2 Randomized Search Tree . 310 8.5.3 AVL Tree . 311 8.5.4 Red-Black Tree . 311 CONTENTS iii 8.6 B-Tree . 313 8.7 B+ Tree . 315 8.8 Hashing . 317 8.8.1 Linear Probing . 318 8.8.2 Double Hashing . 318 8.8.3 Random Hashing . 319 8.8.4 Separate Chaining . 320 8.8.5 Cuckoo Hashing . 320 8.9 Multiway Trie . 321 8.10 Ternary Search Tree . 323 8.11 Disjoint Set (Up-Tree) . 324 8.12 Graph Representation . 326 8.12.1 Adjacency Matrix . 326 8.12.2 Adjacency List . 326 8.13 Graph Traversal . 328 8.13.1 Breadth First Search (BFS) . 328 8.13.2 Depth First Search (DFS) . 328 8.13.3 Dijkstra's Algorithm . 329 8.14 Minimum Spanning Tree . 330 8.14.1 Prim's Algorithm . 330 8.14.2 Prim's Algorithm . 330 iv CONTENTS Chapter 1 Introduction and Review 1.1 Welcome! This is a textbook companion to the Massive Open Online Course (MOOC), Data Structures: An Active Learning Approach, which has various activities embedded throughout to help stimulate your learning and improve your under- standing of the materials we will cover. While this textbook contains all STOP and Think questions, which will help you reflect on the material, and all Exer- cise Breaks, which will test your knowledge and understanding of the concepts discussed, we recommend utilizing the MOOC for all Code Challenges, which will allow you to actually implement some of the algorithms we will cover. About the Authors Niema Moshiri (B.S.'15 UC San Diego) is a Ph.D. candidate in the Bioinfor- matics and Systems Biology (BISB) program at the University of California, San Diego. His educational background is in Bioinformatics, which spans Com- puter Science (with a focus on Algorithm Design) and Biology (with a focus on Molecular Biology). He is co-advised by Siavash Mirarab and Pavel Pevzner, with whom he performs research in Computational Biology with a focus on HIV phylogenetics and epidemiology. Liz Izhikevich (B.S.'17 UC San Diego, M.S.'18 UC San Diego) is an NSF and Stanford Graduate Fellow in the Computer Science Department at Stanford University. She has spent years working with and developing the content of multiple advanced data structures courses. Her research interests lie at the intersection of serverless distributed systems and security. Acknowledgements We would like to thank Christine Alvarado for providing the motivation for the creation of this textbook by incorporating it into her Advanced Data Structures course at UC San Diego, as well as for reviewing the content and providing 1 2 CHAPTER 1. INTRODUCTION AND REVIEW useful feedback for revisions. We would also like to thank Paul Kube, whose detailed lecture slides served as a great reference for many of the topics dis- cussed (especially the lesson walking through the proof for the Binary Search Tree average-case time complexity). Lastly, we would like to thank Phillip Com- peau and Pavel Pevzner, whose textbook (Bioinformatics Algorithms) served as inspiration for writing style and instructional approach. 1.2 Tick Tock, Tick Tock During World War II, Cambridge mathematics alumnus Alan Turing, the \Fa- ther of Computer Science," was recruited by British intelligence to crack the German naval Enigma, a rotor cipher machine that encrypted messages in a way that was previously impossible to decrypt by the Allied forces. In 2014, a film titled The Imitation Game was released to pay homage to Turing and his critical work in the field of cryptography during World War II. In the film, Tur- ing's leader, Commander Denniston, angrily critiques the Enigma-decrypting machine for not working, to which Turing replies: \It works... It was just... Still working." In the world of mathematics, a solution is typically either correct or incor- rect, and all correct solutions are typically equally \good." In the world of algorithms, however, this is certainly not the case: even if two algorithms pro- duce identical solutions for a given problem, there is the possibility that one algorithm is \better" than the other. In the aforementioned film, the initial version of Turing's Enigma-cracking machine would indeed produce the correct solution if given enough time to run, but because the Nazis would change the encryption cipher every 24 hours, the machine would only be useful if it were able to complete the decryption process far quicker than this. However, at the end of the movie, Turing updates the machine to omit possible solutions based on knowledge of the unchanging greeting appearing at the end of each message. After this update, the machine is able to successfully find a solution fast enough to be useful to the British intelligence agency. What Turing experienced can be described by time complexity, a way of quantifying the execution time of an algorithm as a function of its input size by effectively counting the number of elementary operations performed by the algorithm. Essentially, the time complexity of a given algorithm tells us roughly how fast the algorithm performs as the input size grows. There are three main notations to describe time complexity: Big-O (\Big- Oh"), Big-Ω (\Big-Omega"), and Big-Θ (\Big-Theta"). Big-O notation pro- vides an upper-bound on the number of operations performed, Big-Ω provides a lower-bound, and Big-Θ provides both an upper-bound and a lower-bound. In other words, Big-Θ implies both Big-O and Big-Ω. For our purposes, we will only consider Big-O notation because, in general, we only care about upper-bounds on our algorithms. 1.2. TICK TOCK, TICK TOCK 3 Let f(n) and g(n) be two functions defined for the real numbers. One would write f(n) = O(g(n)) if and only if there exists some constant c and some real n0 such that the absolute value of f(n) is at most c multiplied by the absolute value of g(n) for all values of n greater than or equal to n0. In other words, such that jf(n)j ≤ c jg(n)j for all n ≥ n0. Also, note that, for all three notations, we drop all lower functions when we write out the complexity. Mathematically, say we have two functions g(n) and h(n), where jg(n)j ≥ jh(n)j for all n ≥ n0. If we have a function f(n) that is O(g(n) + h(n)), we would just write that f(n) is O(g(n)) to simplify. Note that we only care about time complexities with regard to large values of n, so when you think of algorithm performance in terms of any of these three notations, you should really only think about how the algorithm scales as n approaches infinity. For example, say we have a list of n numbers and, for each number, we want to print out the number and its opposite. Assuming the \print" operation is a single operation and the \opposite" operation is also a single operation, we will be performing 3 operations for each number (print itself, return the opposite of it, and print the opposite).