To Trie Or Not to Trie? Realizing Space-Partitioning Trees Inside Postgresql: Challenges, Experiences and Performance

Total Page:16

File Type:pdf, Size:1020Kb

To Trie Or Not to Trie? Realizing Space-Partitioning Trees Inside Postgresql: Challenges, Experiences and Performance Purdue University Purdue e-Pubs Department of Computer Science Technical Reports Department of Computer Science 2005 To Trie or Not to Trie? Realizing Space-partitioning Trees inside PostgreSQL: Challenges, Experiences and Performance Mohamed Y. Eltabakh Ramy Eltarras Walid G. Aref Purdue University, [email protected] Report Number: 05-008 Eltabakh, Mohamed Y.; Eltarras, Ramy; and Aref, Walid G., "To Trie or Not to Trie? Realizing Space- partitioning Trees inside PostgreSQL: Challenges, Experiences and Performance" (2005). Department of Computer Science Technical Reports. Paper 1623. https://docs.lib.purdue.edu/cstech/1623 This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information. TO TRIE OR NOT TO TRIE? REALIZING SPACE­ PARTITIONING TREES INSIDE POSTGRESQL: CHALLENGES, EXPERIENCES AND PERFORMANCE Mohamed Y. Eltaabakh Ramy Eltarras Walid G. Aref Department of Computer Sciences Purdue University West Lafayette, IN 47907 CSD TR #05-008 April 2005 To Trie or Not to Trie? Realizing Space-partitioning Trees inside PostgreSQL: Challenges, Experiences, and Performance l\Iohamed Y. Eltabakh Ramy Eltarras Walid G. Aref Purdue Universit~· West Lafayette, IN. 47907-1398. USA {meltabak, rhassan, aref}@cs.purdue.edu Abstract this need and have initiated efforts to support several non-traditional indexes, e.g., (Oracle [35]. and IBl\l l'l1an~' evolving datilbilse applications warrant DB2 [ID. the use of non-twditional indexing mecha­ One of the major hurdles in implementing non­ nisms beyond B+-trees and hash tables. SP­ traditionill indexes inside a database engine is the GiST is an extensible indexing framework that very wide variety of such indexes. Moreover. there broadens the class of supported indexes to is tremendous overhead that is associated with real­ include disk-based wrsions of a \vide vari­ izing and integrating any of these indexes inside the ety of space-partitioning trees, e.g., disk-based engine. Generalized search trees (e.g., GiST [21] and trie variants, quadtree vilriants, and kd-trees. SP-GiST [4. are designed to address this problem. This paper presents il serious attempt at im­ 3D plementing and realizing SP-GiST-based in­ Generalized search trees (GiST [21 D and Space­ dexes inside PostgreSQL. Several index types partitioning Generalized search trees (SP-GiST [4. 3D are realized inside PostgreSQL facilitated by are software engineering frameworks for rapid proto­ rapid SP-GiST instantiations. Challenge~, typing of indexes inside a database engine. GiST experiences. and performance issues are ad­ supports the class of balanced trees (B+-tree-like dressed in the paper. Performance compar­ trees). e.g., R-trees [7, 20, 32], SR-trees [24], and isons are conducted from within PostgreSQL RD-trees [22], while SP-GiST supports the class of to compare updilte ilnd search performances space-partitioning trees. e.g., tries, quadtrees. and k-d of SP-GiST-based indexes against the B+-tree trees. Both frameworks have internal methods that and the R-tree for text string and point data furnish general database functionalities. e.g.. gener­ sets. Interesting performilnce results are pre­ alized search and insert algorithms, as well as user­ sented in the paper. Results highlight the po­ defined external methods and parameters that tailor tentifd performance gains of SP-GiST-based the generalized index into one instance index from the indexes as well ilS several strengths and weak­ corresponding index class. GiST has been tested in nesses of SP-GiST-bilsed indexes that need to prototype systems, e.g.. in Predator [34] and in Post­ be addressed in future research greSQL [37J. and hence is not the focus of this study. The purpose of this study is to demonstrate feasibil­ 1 Introduction ity and performance issues of SP-GiST-based indexes. This work presents a serious attempt at implement­ l\Iany emerging database applicat.ions warrant the use ing and realizing SP-GiST-based indexes inside Post­ of non-traditional indexing mechanisms beyond B+­ greSQL. Using rapid SP-GiST instantiations. several trees and hash tables. Diltil base vendors hilve realized index types are realized inside PostgreSQL that in­ Perrnission to copy wi.thout fee all OJ' part of this material is dex string and point data types. Performance compar­ grunted pT'Ovided t./wt the copies are nol made or distributed for isons are conducted from within PostgreSQL to com­ direct com mercial a.dvantage. I he VLDB copyrighl not.ice and pare update and search performances of one disk-based the tit.le of the pllblicaJion and its dale appear. and not.ice is trie variant against the B+-tree for a variety of string gillen Owl copying is by pennission of the Very Large Data. Base Endowment. To copy olherll'ise. or 10 republish. requires a fee dataset collections, and one disk-based kd-tree variant and/or s}i<"ciol permission fmm the Endowment. against the R-tree for two-dimensional point dataset Proceedings of the 31st VLDB Conference, collections. Other more sophisticated index types can Trondheim, Norway, 2005 be instantiated using the same SP-GiST platform. We plan to release the PostgreSQL version of SP-GiST 2 Related Work around the same time we publish this paper. The contributions of this research can be summa­ I\l ultidimensional searching is a fundamental oper­ rized as follows: ation for many database applications. Several in­ dex structures beyond B-trees [6, 11], and hash ta­ 1. We realized SP-GiST inside PostgreSQL to ex­ bles [14, 29], have been proposed to manage the update tend the available access methods to include the and search operations over spatial and multidimen­ class of space partitioning trees. Our implementa­ sional data, e.g., [17, 27. 31, 33]. These index struc­ tion methodology makes SP-GiST portable. That tures include the R-tree, and its variants [7. 20, 32]. is. SP-GiST can be realized inside PostgreSQL the quadtree, and its variants [15, 18, 25, 39], the kd­ without recompiling PostgreSQL code. tree [8], the disk based kd-tree variants, e.g.. [9, 30], 2. \Ve extended the indexing operations of the space and the trie structure, and its variants [2, 10. 16]. Ex­ partitioning trees, e.g., the trie, to include more tensions to the B-tree have been also proposed to ad­ challenging search operations, e.g., the prefix dress indexing multidimensional attributes. such as the match search, and the regular expression match universal B-tree [5], and the hB-tree [13]. Extensible search. indexing frameworks have been proposed to instanti­ ate a variety of index structures in an efficient way and 3. \:Ve conducted extensive experiments from within without modifying the database engine. Extensible in­ PostgreSQL to compare the performance of the dexing frameworks are first proposed in [36]. GiST trie and the kd-tree against the B+-tree and the (Generalized Search Trees) is an extensible framework R-tree, respectively. Our results illustrate that for B-tree-Iike indexes [21]. SP-GiST (Space Partition­ the lrie has more than 150% search performance ing Generalized Search Trees) is an extensible frame­ improvement over the B+-tree performance in the work for the family of space partitioning trees [4, 3]. case of the exact match search. Also, the trie An extensible bulk loading algorithm for SP-GiST is has more than 2 orders of magnitude search per­ proposed in [19]. Extensible indexing structures gain formance improvement in the case of the regular more importance in the context of the object rela­ expression match search. Our experiments also tional database management systems (ORDBMS) that demonstrate that the kd-tree has more than 300% are designed primarily to support new types of data search performance improvement over the R-tree and applications. The implementation of GiST in 1n­ in the case of the point match search. formix Dynamic Server with Universal Data Option 4. We realized the suffix tree index structure using (IDSjUDO) is presented in [26]. Current commercial the SP-GiST framevvork to support the substring databases have also supported the extensible indexing match searching. Our experiments demonstrate frameworks, e.g., IBM DB2 [1], and Oracle [35]. The that the suffix tree improves the search perfor­ performance of the various index strllctures have been mance by more than 3 orders of magnitude over studied extensively. For example. a model for the R­ the sequential scan. The other index structures, tree performance is proposed in [38]. A comparison e.g.. B+-tree, and R-tree, do not support the sub­ between quadtree and R-tree indexes in Oracle spa­ string match search. tial is presented in [28]. Another comparative study among the R-tree variants and the Pl\lR quadtree is 5. We identified weaknesses of SP-GiST-based in­ presented in [23]. dexes compared to the B+-tree and the R-tree indexes that need to be addressed in future re­ 3 Space-partitioning 'Trees: Overview, search. Basically, the insertion time and the index size of the SP-GiST-based indexes involve higher Challenges, and SP-GiST overhead than those of the B+-tree and the R-tree The main characteristic of space-partitioning trees indexes. is that they partition the multi-dimensional space The rest of this paper proceeds as follows. In Sec­ into disjoint (non-overlapping) regions. Refer to tion 2. we highlight related work in the area of ex­ Figures I, 2, and 3, for a few examples of space­ tensible indexes and generalized indexing frameworks. partitioning trees. Partitioning can be either In Section 3. we overview space-partitioning trees, the (1) space-driven (e.g., Figure 2), where we decompose challenges they have from database indexing point of the space into equal-sized partitions regardless of the view, and how these challenges are addressed in SP­ data distribution, or (2) data-driven (e.g., Figure 3), GiST.
Recommended publications
  • Application of TRIE Data Structure and Corresponding Associative Algorithms for Process Optimization in GRID Environment
    Application of TRIE data structure and corresponding associative algorithms for process optimization in GRID environment V. V. Kashanskya, I. L. Kaftannikovb South Ural State University (National Research University), 76, Lenin prospekt, Chelyabinsk, 454080, Russia E-mail: a [email protected], b [email protected] Growing interest around different BOINC powered projects made volunteer GRID model widely used last years, arranging lots of computational resources in different environments. There are many revealed problems of Big Data and horizontally scalable multiuser systems. This paper provides an analysis of TRIE data structure and possibilities of its application in contemporary volunteer GRID networks, including routing (L3 OSI) and spe- cialized key-value storage engine (L7 OSI). The main goal is to show how TRIE mechanisms can influence de- livery process of the corresponding GRID environment resources and services at different layers of networking abstraction. The relevance of the optimization topic is ensured by the fact that with increasing data flow intensi- ty, the latency of the various linear algorithm based subsystems as well increases. This leads to the general ef- fects, such as unacceptably high transmission time and processing instability. Logically paper can be divided into three parts with summary. The first part provides definition of TRIE and asymptotic estimates of corresponding algorithms (searching, deletion, insertion). The second part is devoted to the problem of routing time reduction by applying TRIE data structure. In particular, we analyze Cisco IOS switching services based on Bitwise TRIE and 256 way TRIE data structures. The third part contains general BOINC architecture review and recommenda- tions for highly-loaded projects.
    [Show full text]
  • Compressed Suffix Trees with Full Functionality
    Compressed Suffix Trees with Full Functionality Kunihiko Sadakane Department of Computer Science and Communication Engineering, Kyushu University Hakozaki 6-10-1, Higashi-ku, Fukuoka 812-8581, Japan [email protected] Abstract We introduce new data structures for compressed suffix trees whose size are linear in the text size. The size is measured in bits; thus they occupy only O(n log |A|) bits for a text of length n on an alphabet A. This is a remarkable improvement on current suffix trees which require O(n log n) bits. Though some components of suffix trees have been compressed, there is no linear-size data structure for suffix trees with full functionality such as computing suffix links, string-depths and lowest common ancestors. The data structure proposed in this paper is the first one that has linear size and supports all operations efficiently. Any algorithm running on a suffix tree can also be executed on our compressed suffix trees with a slight slowdown of a factor of polylog(n). 1 Introduction Suffix trees are basic data structures for string algorithms [13]. A pattern can be found in time proportional to the pattern length from a text by constructing the suffix tree of the text in advance. The suffix tree can also be used for more complicated problems, for example finding the longest repeated substring in linear time. Many efficient string algorithms are based on the use of suffix trees because this does not increase the asymptotic time complexity. A suffix tree of a string can be constructed in linear time in the string length [28, 21, 27, 5].
    [Show full text]
  • KP-Trie Algorithm for Update and Search Operations
    The International Arab Journal of Information Technology, Vol. 13, No. 6, November 2016 722 KP-Trie Algorithm for Update and Search Operations Feras Hanandeh1, Izzat Alsmadi2, Mohammed Akour3, and Essam Al Daoud4 1Department of Computer Information Systems, Hashemite University, Jordan 2, 3Department of Computer Information Systems, Yarmouk University, Jordan 4Computer Science Department, Zarqa University, Jordan Abstract: Radix-Tree is a space optimized data structure that performs data compression by means of cluster nodes that share the same branch. Each node with only one child is merged with its child and is considered as space optimized. Nevertheless, it can’t be considered as speed optimized because the root is associated with the empty string. Moreover, values are not normally associated with every node; they are associated only with leaves and some inner nodes that correspond to keys of interest. Therefore, it takes time in moving bit by bit to reach the desired word. In this paper we propose the KP-Trie which is consider as speed and space optimized data structure that is resulted from both horizontal and vertical compression. Keywords: Trie, radix tree, data structure, branch factor, indexing, tree structure, information retrieval. Received January 14, 2015; accepted March 23, 2015; Published online December 23, 2015 1. Introduction the exception of leaf nodes, nodes in the trie work merely as pointers to words. Data structures are a specialized format for efficient A trie, also called digital tree, is an ordered multi- organizing, retrieving, saving and storing data. It’s way tree data structure that is useful to store an efficient with large amount of data such as: Large data associative array where the keys are usually strings, bases.
    [Show full text]
  • Lecture 26 Fall 2019 Instructors: B&S Administrative Details
    CSCI 136 Data Structures & Advanced Programming Lecture 26 Fall 2019 Instructors: B&S Administrative Details • Lab 9: Super Lexicon is online • Partners are permitted this week! • Please fill out the form by tonight at midnight • Lab 6 back 2 2 Today • Lab 9 • Efficient Binary search trees (Ch 14) • AVL Trees • Height is O(log n), so all operations are O(log n) • Red-Black Trees • Different height-balancing idea: height is O(log n) • All operations are O(log n) 3 2 Implementing the Lexicon as a trie There are several different data structures you could use to implement a lexicon— a sorted array, a linked list, a binary search tree, a hashtable, and many others. Each of these offers tradeoffs between the speed of word and prefix lookup, amount of memory required to store the data structure, the ease of writing and debugging the code, performance of add/remove, and so on. The implementation we will use is a special kind of tree called a trie (pronounced "try"), designed for just this purpose. A trie is a letter-tree that efficiently stores strings. A node in a trie represents a letter. A path through the trie traces out a sequence ofLab letters that9 represent : Lexicon a prefix or word in the lexicon. Instead of just two children as in a binary tree, each trie node has potentially 26 child pointers (one for each letter of the alphabet). Whereas searching a binary search tree eliminates half the words with a left or right turn, a search in a trie follows the child pointer for the next letter, which narrows the search• Goal: to just words Build starting a datawith that structure letter.
    [Show full text]
  • Suffix Trees
    JASS 2008 Trees - the ubiquitous structure in computer science and mathematics Suffix Trees Caroline L¨obhard St. Petersburg, 9.3. - 19.3. 2008 1 Contents 1 Introduction to Suffix Trees 3 1.1 Basics . 3 1.2 Getting a first feeling for the nice structure of suffix trees . 4 1.3 A historical overview of algorithms . 5 2 Ukkonen’s on-line space-economic linear-time algorithm 6 2.1 High-level description . 6 2.2 Using suffix links . 7 2.3 Edge-label compression and the skip/count trick . 8 2.4 Two more observations . 9 3 Generalised Suffix Trees 9 4 Applications of Suffix Trees 10 References 12 2 1 Introduction to Suffix Trees A suffix tree is a tree-like data-structure for strings, which affords fast algorithms to find all occurrences of substrings. A given String S is preprocessed in O(|S|) time. Afterwards, for any other string P , one can decide in O(|P |) time, whether P can be found in S and denounce all its exact positions in S. This linear worst case time bound depending only on the length of the (shorter) string |P | is special and important for suffix trees since an amount of applications of string processing has to deal with large strings S. 1.1 Basics In this paper, we will denote the fixed alphabet with Σ, single characters with lower-case letters x, y, ..., strings over Σ with upper-case or Greek letters P, S, ..., α, σ, τ, ..., Trees with script letters T , ... and inner nodes of trees (that is, all nodes despite of root and leaves) with lower-case letters u, v, ...
    [Show full text]
  • Efficient In-Memory Indexing with Generalized Prefix Trees
    Efficient In-Memory Indexing with Generalized Prefix Trees Matthias Boehm, Benjamin Schlegel, Peter Benjamin Volk, Ulrike Fischer, Dirk Habich, Wolfgang Lehner TU Dresden; Database Technology Group; Dresden, Germany Abstract: Efficient data structures for in-memory indexing gain in importance due to (1) the exponentially increasing amount of data, (2) the growing main-memory capac- ity, and (3) the gap between main-memory and CPU speed. In consequence, there are high performance demands for in-memory data structures. Such index structures are used—with minor changes—as primary or secondary indices in almost every DBMS. Typically, tree-based or hash-based structures are used, while structures based on prefix-trees (tries) are neglected in this context. For tree-based and hash-based struc- tures, the major disadvantages are inherently caused by the need for reorganization and key comparisons. In contrast, the major disadvantage of trie-based structures in terms of high memory consumption (created and accessed nodes) could be improved. In this paper, we argue for reconsidering prefix trees as in-memory index structures and we present the generalized trie, which is a prefix tree with variable prefix length for indexing arbitrary data types of fixed or variable length. The variable prefix length enables the adjustment of the trie height and its memory consumption. Further, we introduce concepts for reducing the number of created and accessed trie levels. This trie is order-preserving and has deterministic trie paths for keys, and hence, it does not require any dynamic reorganization or key comparisons. Finally, the generalized trie yields improvements compared to existing in-memory index structures, especially for skewed data.
    [Show full text]
  • Search Trees
    Lecture III Page 1 “Trees are the earth’s endless effort to speak to the listening heaven.” – Rabindranath Tagore, Fireflies, 1928 Alice was walking beside the White Knight in Looking Glass Land. ”You are sad.” the Knight said in an anxious tone: ”let me sing you a song to comfort you.” ”Is it very long?” Alice asked, for she had heard a good deal of poetry that day. ”It’s long.” said the Knight, ”but it’s very, very beautiful. Everybody that hears me sing it - either it brings tears to their eyes, or else -” ”Or else what?” said Alice, for the Knight had made a sudden pause. ”Or else it doesn’t, you know. The name of the song is called ’Haddocks’ Eyes.’” ”Oh, that’s the name of the song, is it?” Alice said, trying to feel interested. ”No, you don’t understand,” the Knight said, looking a little vexed. ”That’s what the name is called. The name really is ’The Aged, Aged Man.’” ”Then I ought to have said ’That’s what the song is called’?” Alice corrected herself. ”No you oughtn’t: that’s another thing. The song is called ’Ways and Means’ but that’s only what it’s called, you know!” ”Well, what is the song then?” said Alice, who was by this time completely bewildered. ”I was coming to that,” the Knight said. ”The song really is ’A-sitting On a Gate’: and the tune’s my own invention.” So saying, he stopped his horse and let the reins fall on its neck: then slowly beating time with one hand, and with a faint smile lighting up his gentle, foolish face, he began..
    [Show full text]
  • Lecture 1: Introduction
    Lecture 1: Introduction Agenda: • Welcome to CS 238 — Algorithmic Techniques in Computational Biology • Official course information • Course description • Announcements • Basic concepts in molecular biology 1 Lecture 1: Introduction Official course information • Grading weights: – 50% assignments (3-4) – 50% participation, presentation, and course report. • Text and reference books: – T. Jiang, Y. Xu and M. Zhang (co-eds), Current Topics in Computational Biology, MIT Press, 2002. (co-published by Tsinghua University Press in China) – D. Gusfield, Algorithms for Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge Press, 1997. – D. Krane and M. Raymer, Fundamental Concepts of Bioin- formatics, Benjamin Cummings, 2003. – P. Pevzner, Computational Molecular Biology: An Algo- rithmic Approach, 2000, the MIT Press. – M. Waterman, Introduction to Computational Biology: Maps, Sequences and Genomes, Chapman and Hall, 1995. – These notes (typeset by Guohui Lin and Tao Jiang). • Instructor: Tao Jiang – Surge Building 330, x82991, [email protected] – Office hours: Tuesday & Thursday 3-4pm 2 Lecture 1: Introduction Course overview • Topics covered: – Biological background introduction – My research topics – Sequence homology search and comparison – Sequence structural comparison – string matching algorithms and suffix tree – Genome rearrangement – Protein structure and function prediction – Phylogenetic reconstruction * DNA sequencing, sequence assembly * Physical/restriction mapping * Prediction of regulatiory elements • Announcements:
    [Show full text]
  • Search Trees for Strings a Balanced Binary Search Tree Is a Powerful Data Structure That Stores a Set of Objects and Supports Many Operations Including
    Search Trees for Strings A balanced binary search tree is a powerful data structure that stores a set of objects and supports many operations including: Insert and Delete. Lookup: Find if a given object is in the set, and if it is, possibly return some data associated with the object. Range query: Find all objects in a given range. The time complexity of the operations for a set of size n is O(log n) (plus the size of the result) assuming constant time comparisons. There are also alternative data structures, particularly if we do not need to support all operations: • A hash table supports operations in constant time but does not support range queries. • An ordered array is simpler, faster and more space efficient in practice, but does not support insertions and deletions. A data structure is called dynamic if it supports insertions and deletions and static if not. 107 When the objects are strings, operations slow down: • Comparison are slower. For example, the average case time complexity is O(log n logσ n) for operations in a binary search tree storing a random set of strings. • Computing a hash function is slower too. For a string set R, there are also new types of queries: Lcp query: What is the length of the longest prefix of the query string S that is also a prefix of some string in R. Prefix query: Find all strings in R that have S as a prefix. The prefix query is a special type of range query. 108 Trie A trie is a rooted tree with the following properties: • Edges are labelled with symbols from an alphabet Σ.
    [Show full text]
  • Suffix Trees for Fast Sensor Data Forwarding
    Suffix Trees for Fast Sensor Data Forwarding Jui-Chieh Wub Hsueh-I Lubc Polly Huangac Department of Electrical Engineeringa Department of Computer Science and Information Engineeringb Graduate Institute of Networking and Multimediac National Taiwan University [email protected], [email protected], [email protected] Abstract— In data-centric wireless sensor networks, data are alleviates the effort of node addressing and address reconfig- no longer sent by the sink node’s address. Instead, the sink uration in large-scale mobile sensor networks. node sends an explicit interest for a particular type of data. The source and the intermediate nodes then forward the data Forwarding in data-centric sensor networks is particularly according to the routing states set by the corresponding interest. challenging. It involves matching of the data content, i.e., This data-centric style of communication is promising in that it string-based attributes and values, instead of numeric ad- alleviates the effort of node addressing and address reconfigura- dresses. This content-based forwarding problem is well studied tion. However, when the number of interests from different sinks in the domain of publish-subscribe systems. It is estimated increases, the size of the interest table grows. It could take 10s to 100s milliseconds to match an incoming data to a particular in [3] that the time it takes to match an incoming data to a interest, which is orders of mangnitude higher than the typical particular interest ranges from 10s to 100s milliseconds. This transmission and propagation delay in a wireless sensor network. processing delay is several orders of magnitudes higher than The interest table lookup process is the bottleneck of packet the propagation and transmission delay.
    [Show full text]
  • X-Fast and Y-Fast Tries
    x-Fast and y-Fast Tries Outline for Today ● Bitwise Tries ● A simple ordered dictionary for integers. ● x-Fast Tries ● Tries + Hashing ● y-Fast Tries ● Tries + Hashing + Subdivision + Balanced Trees + Amortization Recap from Last Time Ordered Dictionaries ● An ordered dictionary is a data structure that maintains a set S of elements drawn from an ordered universe � and supports these operations: ● insert(x), which adds x to S. ● is-empty(), which returns whether S = Ø. ● lookup(x), which returns whether x ∈ S. ● delete(x), which removes x from S. ● max() / min(), which returns the maximum or minimum element of S. ● successor(x), which returns the smallest element of S greater than x, and ● predecessor(x), which returns the largest element of S smaller than x. Integer Ordered Dictionaries ● Suppose that � = [U] = {0, 1, …, U – 1}. ● A van Emde Boas tree is an ordered dictionary for [U] where ● min, max, and is-empty run in time O(1). ● All other operations run in time O(log log U). ● Space usage is Θ(U) if implemented deterministically, and O(n) if implemented using hash tables. ● Question: Is there a simpler data structure meeting these bounds? The Machine Model ● We assume a transdichotomous machine model: ● Memory is composed of words of w bits each. ● Basic arithmetic and bitwise operations on words take time O(1) each. ● w = Ω(log n). A Start: Bitwise Tries Tries Revisited ● Recall: A trie is a simple data 0 1 structure for storing strings. 0 1 0 1 ● Integers can be thought of as strings of bits. 1 0 1 1 0 ● Idea: Store integers in a bitwise trie.
    [Show full text]
  • Tries and String Matching
    Tries and String Matching Where We've Been ● Fundamental Data Structures ● Red/black trees, B-trees, RMQ, etc. ● Isometries ● Red/black trees ≡ 2-3-4 trees, binomial heaps ≡ binary numbers, etc. ● Amortized Analysis ● Aggregate, banker's, and potential methods. Where We're Going ● String Data Structures ● Data structures for storing and manipulating text. ● Randomized Data Structures ● Using randomness as a building block. ● Integer Data Structures ● Breaking the Ω(n log n) sorting barrier. ● Dynamic Connectivity ● Maintaining connectivity in an changing world. String Data Structures Text Processing ● String processing shows up everywhere: ● Computational biology: Manipulating DNA sequences. ● NLP: Storing and organizing huge text databases. ● Computer security: Building antivirus databases. ● Many problems have polynomial-time solutions. ● Goal: Design theoretically and practically efficient algorithms that outperform brute-force approaches. Outline for Today ● Tries ● A fundamental building block in string processing algorithms. ● Aho-Corasick String Matching ● A fast and elegant algorithm for searching large texts for known substrings. Tries Ordered Dictionaries ● Suppose we want to store a set of elements supporting the following operations: ● Insertion of new elements. ● Deletion of old elements. ● Membership queries. ● Successor queries. ● Predecessor queries. ● Min/max queries. ● Can use a standard red/black tree or splay tree to get (worst-case or expected) O(log n) implementations of each. A Catch ● Suppose we want to store a set of strings. ● Comparing two strings of lengths r and s takes time O(min{r, s}). ● Operations on a balanced BST or splay tree now take time O(M log n), where M is the length of the longest string in the tree.
    [Show full text]