CS240E: Data Structures and Data Management

Total Page:16

File Type:pdf, Size:1020Kb

CS240E: Data Structures and Data Management CS240E: Data Structures and Data Management Helena S. Ven 08 Jan. 2019 Class: T,Th at 0830 { 0950 Instructors: Therese Biedl Office: DC 2341, (1000 - 1100) Topics: Tutorial: Midterm: The class on the date of the midterm (28 Feb.): Tries, Hash for strings, Compressed tries. Class start at 0910 instead of 0830. Final: Arithmetic compression, Cache oblivious trees, and External sorting are not on the final. No deletion in dictionaries (lazy deletion is fine) By convention, log is base 2 unless stated otherwise. The distribution of this document is prohibited unless given permission from the author and Prof. Biedl Index 1 Runtime and Asymptotic bounds 4 1.1 Objective of this course . .4 1.1.1 Computational Problems . .4 1.2 Asymptotic Analysis . .5 1.3 Analysis of Algorithms . .7 1.4 Runtime of Randomised Algorithms . .8 1.5 Potential Method of Amortised Analysis . 10 2 Comparison-Based Data Structures 11 2.1 Array-based Data types . 11 2.2 ADT: Priority Queues . 11 2.3 Heap . 13 2.3.1 Operations in heaps . 14 2.3.2 Improvements of the Heap . 17 2.4 Heap Merging . 19 2.4.1 Method 1, Determinstic . 19 2.4.2 Method 2, Randomised . 20 2.4.3 Method 3, Modified Heap . 21 2.4.4 Almost-heaps . 23 2.5 ADT: Dictionaries . 23 2.6 Binary Search Trees . 24 2.7 Tree-based Implementations of the Dictionary . 25 2.7.1 Treaps . 25 2.7.2 AVL Trees . 26 2.7.3 Scapegoat Tree . 30 2.8 Skip Lists . 33 2.9 Dictionary with Biased Search Requests . 35 2.9.1 MTF Array and Transpose Array . 36 2.10 Splay Trees . 36 3 Hashing and Spatial Data 39 3.1 Hash Tables . 39 3.1.1 Probe Hashing . 39 3.1.2 Double Hashing . 41 3.1.3 Cuckoo Hashing . 41 3.1.4 Complexity of Probing Methods . 42 3.1.5 Universal Hashing . 42 3.2 Hash of Multi-dimensional data . 43 3.3 Tries . 44 3.3.1 Variation of Trie: No leaves . 45 3.3.2 Variation of Trie: Compressed labels . 45 3.3.3 Variation of Trie: Allow Prefixes . 45 3.3.4 Compressed Tries . 46 3.4 ADT: Dictionary with Range Search . 47 1 3.5 Spatial Data Structures . 48 3.6 Quad-Trees . 48 3.7 KD-Tree . 50 3.8 Range Tree . 52 3.8.1 Problem of Duplicates and Generalisations . 54 3.8.2 3-Sided Ranged Queries . 54 4 Sorting and Searching Algorithms 55 4.1 Problem: Selection and Sorting . 55 4.1.1 The Lower Bound of Comparison Sorting . 55 4.2 Quick-Select . 56 4.2.1 Randomised Pivoting . 57 4.3 Partitioning . 58 4.4 Quicksort . 58 4.4.1 Choice of the Pivot . 61 4.5 Sorting Integers . 61 4.5.1 Bucket Sort . 62 4.5.2 Radix Sort . 63 4.6 Problem: Search . 64 4.7 Interpolation Search . 64 5 String Algorithms 66 5.1 Problem: Pattern Matching . 66 5.2 Pattern Pre-processing . 67 5.2.1 Karp-Rabin Fingerprint Algorithm . 67 5.2.2 Boyer-Moore Algorithm . 68 5.2.3 Finite Automaton and Knuth-Morris-Pratt Method . 71 5.3 Text Pre-processing . 75 5.3.1 Trie of Suffixes, Suffix Trees . 75 5.3.2 Suffix Array . 75 5.4 Comparison of Pattern Matching Algorithms . 77 5.5 Problem: Compression . 79 5.5.1 Prefix-Free Encoding . 79 5.6 Huffman Tree . 80 5.6.1 Huffman Tree with Different Base . 82 5.7 Run-Length Encoding . 83 5.8 Lempel-Ziv-Welch . 84 5.8.1 Decoding Lempel-Ziv-Welch . 85 5.9 BZip2 . 86 5.9.1 Burrows-Wheeler Transform . 87 5.9.2 Move-to-front Transform . 89 5.10 Arithmetic Compression . 90 5.11 Comparison of Compression Algorithms . ..
Recommended publications
  • Compressed Suffix Trees with Full Functionality
    Compressed Suffix Trees with Full Functionality Kunihiko Sadakane Department of Computer Science and Communication Engineering, Kyushu University Hakozaki 6-10-1, Higashi-ku, Fukuoka 812-8581, Japan [email protected] Abstract We introduce new data structures for compressed suffix trees whose size are linear in the text size. The size is measured in bits; thus they occupy only O(n log |A|) bits for a text of length n on an alphabet A. This is a remarkable improvement on current suffix trees which require O(n log n) bits. Though some components of suffix trees have been compressed, there is no linear-size data structure for suffix trees with full functionality such as computing suffix links, string-depths and lowest common ancestors. The data structure proposed in this paper is the first one that has linear size and supports all operations efficiently. Any algorithm running on a suffix tree can also be executed on our compressed suffix trees with a slight slowdown of a factor of polylog(n). 1 Introduction Suffix trees are basic data structures for string algorithms [13]. A pattern can be found in time proportional to the pattern length from a text by constructing the suffix tree of the text in advance. The suffix tree can also be used for more complicated problems, for example finding the longest repeated substring in linear time. Many efficient string algorithms are based on the use of suffix trees because this does not increase the asymptotic time complexity. A suffix tree of a string can be constructed in linear time in the string length [28, 21, 27, 5].
    [Show full text]
  • Interval Trees Storing and Searching Intervals
    Interval Trees Storing and Searching Intervals • Instead of points, suppose you want to keep track of axis-aligned segments: • Range queries: return all segments that have any part of them inside the rectangle. • Motivation: wiring diagrams, genes on genomes Simpler Problem: 1-d intervals • Segments with at least one endpoint in the rectangle can be found by building a 2d range tree on the 2n endpoints. - Keep pointer from each endpoint stored in tree to the segments - Mark segments as you output them, so that you don’t output contained segments twice. • Segments with no endpoints in range are the harder part. - Consider just horizontal segments - They must cross a vertical side of the region - Leads to subproblem: Given a vertical line, find segments that it crosses. - (y-coords become irrelevant for this subproblem) Interval Trees query line interval Recursively build tree on interval set S as follows: Sort the 2n endpoints Let xmid be the median point Store intervals that cross xmid in node N intervals that are intervals that are completely to the completely to the left of xmid in Nleft right of xmid in Nright Another view of interval trees x Interval Trees, continued • Will be approximately balanced because by choosing the median, we split the set of end points up in half each time - Depth is O(log n) • Have to store xmid with each node • Uses O(n) storage - each interval stored once, plus - fewer than n nodes (each node contains at least one interval) • Can be built in O(n log n) time. • Can be searched in O(log n + k) time [k = #
    [Show full text]
  • KP-Trie Algorithm for Update and Search Operations
    The International Arab Journal of Information Technology, Vol. 13, No. 6, November 2016 722 KP-Trie Algorithm for Update and Search Operations Feras Hanandeh1, Izzat Alsmadi2, Mohammed Akour3, and Essam Al Daoud4 1Department of Computer Information Systems, Hashemite University, Jordan 2, 3Department of Computer Information Systems, Yarmouk University, Jordan 4Computer Science Department, Zarqa University, Jordan Abstract: Radix-Tree is a space optimized data structure that performs data compression by means of cluster nodes that share the same branch. Each node with only one child is merged with its child and is considered as space optimized. Nevertheless, it can’t be considered as speed optimized because the root is associated with the empty string. Moreover, values are not normally associated with every node; they are associated only with leaves and some inner nodes that correspond to keys of interest. Therefore, it takes time in moving bit by bit to reach the desired word. In this paper we propose the KP-Trie which is consider as speed and space optimized data structure that is resulted from both horizontal and vertical compression. Keywords: Trie, radix tree, data structure, branch factor, indexing, tree structure, information retrieval. Received January 14, 2015; accepted March 23, 2015; Published online December 23, 2015 1. Introduction the exception of leaf nodes, nodes in the trie work merely as pointers to words. Data structures are a specialized format for efficient A trie, also called digital tree, is an ordered multi- organizing, retrieving, saving and storing data. It’s way tree data structure that is useful to store an efficient with large amount of data such as: Large data associative array where the keys are usually strings, bases.
    [Show full text]
  • Lecture 26 Fall 2019 Instructors: B&S Administrative Details
    CSCI 136 Data Structures & Advanced Programming Lecture 26 Fall 2019 Instructors: B&S Administrative Details • Lab 9: Super Lexicon is online • Partners are permitted this week! • Please fill out the form by tonight at midnight • Lab 6 back 2 2 Today • Lab 9 • Efficient Binary search trees (Ch 14) • AVL Trees • Height is O(log n), so all operations are O(log n) • Red-Black Trees • Different height-balancing idea: height is O(log n) • All operations are O(log n) 3 2 Implementing the Lexicon as a trie There are several different data structures you could use to implement a lexicon— a sorted array, a linked list, a binary search tree, a hashtable, and many others. Each of these offers tradeoffs between the speed of word and prefix lookup, amount of memory required to store the data structure, the ease of writing and debugging the code, performance of add/remove, and so on. The implementation we will use is a special kind of tree called a trie (pronounced "try"), designed for just this purpose. A trie is a letter-tree that efficiently stores strings. A node in a trie represents a letter. A path through the trie traces out a sequence ofLab letters that9 represent : Lexicon a prefix or word in the lexicon. Instead of just two children as in a binary tree, each trie node has potentially 26 child pointers (one for each letter of the alphabet). Whereas searching a binary search tree eliminates half the words with a left or right turn, a search in a trie follows the child pointer for the next letter, which narrows the search• Goal: to just words Build starting a datawith that structure letter.
    [Show full text]
  • Lecture 04 Linear Structures Sort
    Algorithmics (6EAP) MTAT.03.238 Linear structures, sorting, searching, etc Jaak Vilo 2018 Fall Jaak Vilo 1 Big-Oh notation classes Class Informal Intuition Analogy f(n) ∈ ο ( g(n) ) f is dominated by g Strictly below < f(n) ∈ O( g(n) ) Bounded from above Upper bound ≤ f(n) ∈ Θ( g(n) ) Bounded from “equal to” = above and below f(n) ∈ Ω( g(n) ) Bounded from below Lower bound ≥ f(n) ∈ ω( g(n) ) f dominates g Strictly above > Conclusions • Algorithm complexity deals with the behavior in the long-term – worst case -- typical – average case -- quite hard – best case -- bogus, cheating • In practice, long-term sometimes not necessary – E.g. for sorting 20 elements, you dont need fancy algorithms… Linear, sequential, ordered, list … Memory, disk, tape etc – is an ordered sequentially addressed media. Physical ordered list ~ array • Memory /address/ – Garbage collection • Files (character/byte list/lines in text file,…) • Disk – Disk fragmentation Linear data structures: Arrays • Array • Hashed array tree • Bidirectional map • Heightmap • Bit array • Lookup table • Bit field • Matrix • Bitboard • Parallel array • Bitmap • Sorted array • Circular buffer • Sparse array • Control table • Sparse matrix • Image • Iliffe vector • Dynamic array • Variable-length array • Gap buffer Linear data structures: Lists • Doubly linked list • Array list • Xor linked list • Linked list • Zipper • Self-organizing list • Doubly connected edge • Skip list list • Unrolled linked list • Difference list • VList Lists: Array 0 1 size MAX_SIZE-1 3 6 7 5 2 L = int[MAX_SIZE]
    [Show full text]
  • Finding Neighbors in a Forest: a B-Tree for Smoothed Particle Hydrodynamics Simulations
    Finding Neighbors in a Forest: A b-tree for Smoothed Particle Hydrodynamics Simulations Aurélien Cavelan University of Basel, Switzerland [email protected] Rubén M. Cabezón University of Basel, Switzerland [email protected] Jonas H. M. Korndorfer University of Basel, Switzerland [email protected] Florina M. Ciorba University of Basel, Switzerland fl[email protected] May 19, 2020 Abstract Finding the exact close neighbors of each fluid element in mesh-free computational hydrodynamical methods, such as the Smoothed Particle Hydrodynamics (SPH), often becomes a main bottleneck for scaling their performance beyond a few million fluid elements per computing node. Tree structures are particularly suitable for SPH simulation codes, which rely on finding the exact close neighbors of each fluid element (or SPH particle). In this work we present a novel tree structure, named b-tree, which features an adaptive branching factor to reduce the depth of the neighbor search. Depending on the particle spatial distribution, finding neighbors using b-tree has an asymptotic best case complexity of O(n), as opposed to O(n log n) for other classical tree structures such as octrees and quadtrees. We also present the proposed tree structure as well as the algorithms to build it and to find the exact close neighbors of all particles. arXiv:1910.02639v2 [cs.DC] 18 May 2020 We assess the scalability of the proposed tree-based algorithms through an extensive set of performance experiments in a shared-memory system. Results show that b-tree is up to 12× faster for building the tree and up to 1:6× faster for finding the exact neighbors of all particles when compared to its octree form.
    [Show full text]
  • Game Trees, Quad Trees and Heaps
    CS 61B Game Trees, Quad Trees and Heaps Fall 2014 1 Heaps of fun R (a) Assume that we have a binary min-heap (smallest value on top) data structue called Heap that stores integers and has properly implemented insert and removeMin methods. Draw the heap and its corresponding array representation after each of the operations below: Heap h = new Heap(5); //Creates a min-heap with 5 as the root 5 5 h.insert(7); 5,7 5 / 7 h.insert(3); 3,7,5 3 /\ 7 5 h.insert(1); 1,3,5,7 1 /\ 3 5 / 7 h.insert(2); 1,2,5,7,3 1 /\ 2 5 /\ 7 3 h.removeMin(); 2,3,5,7 2 /\ 3 5 / 7 CS 61B, Fall 2014, Game Trees, Quad Trees and Heaps 1 h.removeMin(); 3,7,5 3 /\ 7 5 (b) Consider an array based min-heap with N elements. What is the worst case running time of each of the following operations if we ignore resizing? What is the worst case running time if we take into account resizing? What are the advantages of using an array based heap vs. using a BST-based heap? Insert O(log N) Find Min O(1) Remove Min O(log N) Accounting for resizing: Insert O(N) Find Min O(1) Remove Min O(N) Using a BST is not space-efficient. (c) Your friend Alyssa P. Hacker challenges you to quickly implement a max-heap data structure - "Hah! I’ll just use my min-heap implementation as a template", you think to yourself.
    [Show full text]
  • Suffix Trees
    JASS 2008 Trees - the ubiquitous structure in computer science and mathematics Suffix Trees Caroline L¨obhard St. Petersburg, 9.3. - 19.3. 2008 1 Contents 1 Introduction to Suffix Trees 3 1.1 Basics . 3 1.2 Getting a first feeling for the nice structure of suffix trees . 4 1.3 A historical overview of algorithms . 5 2 Ukkonen’s on-line space-economic linear-time algorithm 6 2.1 High-level description . 6 2.2 Using suffix links . 7 2.3 Edge-label compression and the skip/count trick . 8 2.4 Two more observations . 9 3 Generalised Suffix Trees 9 4 Applications of Suffix Trees 10 References 12 2 1 Introduction to Suffix Trees A suffix tree is a tree-like data-structure for strings, which affords fast algorithms to find all occurrences of substrings. A given String S is preprocessed in O(|S|) time. Afterwards, for any other string P , one can decide in O(|P |) time, whether P can be found in S and denounce all its exact positions in S. This linear worst case time bound depending only on the length of the (shorter) string |P | is special and important for suffix trees since an amount of applications of string processing has to deal with large strings S. 1.1 Basics In this paper, we will denote the fixed alphabet with Σ, single characters with lower-case letters x, y, ..., strings over Σ with upper-case or Greek letters P, S, ..., α, σ, τ, ..., Trees with script letters T , ... and inner nodes of trees (that is, all nodes despite of root and leaves) with lower-case letters u, v, ...
    [Show full text]
  • Augmentation: Range Trees (PDF)
    Lecture 9 Augmentation 6.046J Spring 2015 Lecture 9: Augmentation This lecture covers augmentation of data structures, including • easy tree augmentation • order-statistics trees • finger search trees, and • range trees The main idea is to modify “off-the-shelf” common data structures to store (and update) additional information. Easy Tree Augmentation The goal here is to store x.f at each node x, which is a function of the node, namely f(subtree rooted at x). Suppose x.f can be computed (updated) in O(1) time from x, children and children.f. Then, modification a set S of nodes costs O(# of ancestors of S)toupdate x.f, because we need to walk up the tree to the root. Two examples of O(lg n) updates are • AVL trees: after rotating two nodes, first update the new bottom node and then update the new top node • 2-3 trees: after splitting a node, update the two new nodes. • In both cases, then update up the tree. Order-Statistics Trees (from 6.006) The goal of order-statistics trees is to design an Abstract Data Type (ADT) interface that supports the following operations • insert(x), delete(x), successor(x), • rank(x): find x’s index in the sorted order, i.e., # of elements <x, • select(i): find the element with rank i. 1 Lecture 9 Augmentation 6.046J Spring 2015 We can implement the above ADT using easy tree augmentation on AVL trees (or 2-3 trees) to store subtree size: f(subtree) = # of nodes in it. Then we also have x.size =1+ c.size for c in x.children.
    [Show full text]
  • Heaps a Heap Is a Complete Binary Tree. a Max-Heap Is A
    Heaps Heaps 1 A heap is a complete binary tree. A max-heap is a complete binary tree in which the value in each internal node is greater than or equal to the values in the children of that node. A min-heap is defined similarly. 97 Mapping the elements of 93 84 a heap into an array is trivial: if a node is stored at 90 79 83 81 index k, then its left child is stored at index 42 55 73 21 83 2k+1 and its right child at index 2k+2 01234567891011 97 93 84 90 79 83 81 42 55 73 21 83 CS@VT Data Structures & Algorithms ©2000-2009 McQuain Building a Heap Heaps 2 The fact that a heap is a complete binary tree allows it to be efficiently represented using a simple array. Given an array of N values, a heap containing those values can be built, in situ, by simply “sifting” each internal node down to its proper location: - start with the last 73 73 internal node * - swap the current 74 81 74 * 93 internal node with its larger child, if 79 90 93 79 90 81 necessary - then follow the swapped node down 73 * 93 - continue until all * internal nodes are 90 93 90 73 done 79 74 81 79 74 81 CS@VT Data Structures & Algorithms ©2000-2009 McQuain Heap Class Interface Heaps 3 We will consider a somewhat minimal maxheap class: public class BinaryHeap<T extends Comparable<? super T>> { private static final int DEFCAP = 10; // default array size private int size; // # elems in array private T [] elems; // array of elems public BinaryHeap() { .
    [Show full text]
  • Advanced Data Structures
    Advanced Data Structures PETER BRASS City College of New York CAMBRIDGE UNIVERSITY PRESS Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo Cambridge University Press The Edinburgh Building, Cambridge CB2 8RU, UK Published in the United States of America by Cambridge University Press, New York www.cambridge.org Information on this title: www.cambridge.org/9780521880374 © Peter Brass 2008 This publication is in copyright. Subject to statutory exception and to the provision of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published in print format 2008 ISBN-13 978-0-511-43685-7 eBook (EBL) ISBN-13 978-0-521-88037-4 hardback Cambridge University Press has no responsibility for the persistence or accuracy of urls for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate. Contents Preface page xi 1 Elementary Structures 1 1.1 Stack 1 1.2 Queue 8 1.3 Double-Ended Queue 16 1.4 Dynamical Allocation of Nodes 16 1.5 Shadow Copies of Array-Based Structures 18 2 Search Trees 23 2.1 Two Models of Search Trees 23 2.2 General Properties and Transformations 26 2.3 Height of a Search Tree 29 2.4 Basic Find, Insert, and Delete 31 2.5ReturningfromLeaftoRoot35 2.6 Dealing with Nonunique Keys 37 2.7 Queries for the Keys in an Interval 38 2.8 Building Optimal Search Trees 40 2.9 Converting Trees into Lists 47 2.10
    [Show full text]
  • L11: Quadtrees CSE373, Winter 2020
    L11: Quadtrees CSE373, Winter 2020 Quadtrees CSE 373 Winter 2020 Instructor: Hannah C. Tang Teaching Assistants: Aaron Johnston Ethan Knutson Nathan Lipiarski Amanda Park Farrell Fileas Sam Long Anish Velagapudi Howard Xiao Yifan Bai Brian Chan Jade Watkins Yuma Tou Elena Spasova Lea Quan L11: Quadtrees CSE373, Winter 2020 Announcements ❖ Homework 4: Heap is released and due Wednesday ▪ Hint: you will need an additional data structure to improve the runtime for changePriority(). It does not affect the correctness of your PQ at all. Please use a built-in Java collection instead of implementing your own. ▪ Hint: If you implemented a unittest that tested the exact thing the autograder described, you could run the autograder’s test in the debugger (and also not have to use your tokens). ❖ Please look at posted QuickCheck; we had a few corrections! 2 L11: Quadtrees CSE373, Winter 2020 Lecture Outline ❖ Heaps, cont.: Floyd’s buildHeap ❖ Review: Set/Map data structures and logarithmic runtimes ❖ Multi-dimensional Data ❖ Uniform and Recursive Partitioning ❖ Quadtrees 3 L11: Quadtrees CSE373, Winter 2020 Other Priority Queue Operations ❖ The two “primary” PQ operations are: ▪ removeMax() ▪ add() ❖ However, because PQs are used in so many algorithms there are three common-but-nonstandard operations: ▪ merge(): merge two PQs into a single PQ ▪ buildHeap(): reorder the elements of an array so that its contents can be interpreted as a valid binary heap ▪ changePriority(): change the priority of an item already in the heap 4 L11: Quadtrees CSE373,
    [Show full text]