Implicit Data Structures, Sorting, and Text Indexing Jesper Sindahl Nielsen

Total Page:16

File Type:pdf, Size:1020Kb

Implicit Data Structures, Sorting, and Text Indexing Jesper Sindahl Nielsen Implicit Data Structures, Sorting, and Text Indexing Jesper Sindahl Nielsen PhD Dissertation Department of Computer Science Aarhus University Denmark Implicit Data Structures, Sorting, and Text Indexing A Dissertation Presented to the Faculty of Science and Technology of Aarhus University in Partial Fulfillment of the Requirements for the PhD Degree by Jesper Sindahl Nielsen July 31, 2015 Abstract This thesis on data structures is in three parts. The first part deals with two fundamental space efficient data structures: finger search trees and priority queues. The data structures are implicit, i.e. they only consist of n input elements stored in an array of length n. We consider the problem in the strict implicit model which allows no information to be stored except the array and its length. Furthermore the n elements are comparable and indivisible, i.e. we cannot inspect their bits, but we can compare any pair of elements. A finger search tree is a data structure that allows for efficient lookups of the elements stored. We present a strict implicit dynamic finger search strucutre with operations Search, Change-Finger, Insert, and Delete, with times O(log t), O(nε), O(log n), O(log n), respectively, where t is the rank distance between the current finger and the query element. We also prove this structure is optimal in the strict implicit model. Next we present two strictly implicit priority queues supporting Insert and ExtractMin in times O(1) and O(log n). The first priority queue has amortized bounds, and the second structure’s bounds are worst case, however the first structure also has O(1) moves amortized for the operations. The second part of the thesis deals with another fundamental problem: sorting integers. In this problem the model is a word-RAM with word size w = Ω(log2 n log log n) bits, and our input is n integers of w bits each. We give a randomized algorithm that sorts such integers in expected O(n) time. In arriving at our result we also present a randomized algorithm for sorting smaller integers that are packed in words. Letting b be the number of n 2 integers per word, we give a packed sorting algorithm running in time O( b (log n + log b)). The topic of the third part is text indexing. The problems considered are term proximity in succinct space, two pattern document retrieval problems and wild card indexing. In all of the problems we are given a collection of documents with total length n. For term proximity we must store the documents using the information theoretic lower bound space (succinct). The query is then a pattern and a value k, and the answer is the top-k documents matching the pattern. The top-k is determined by the Term Proximity scoring function, where the score of a document is the distance between the closest pair of occurrences of the query pattern in the document (lower is better). For this problem we show it is possible to answer queries in O(|P |+k polylog(n)) time, where |P | is the pattern length, and n the total length of the documents. In the two pattern problem queries are two patterns, and we must return all documents matching both patterns (Two-Pattern – 2P), or matching one pattern but not the other (Forbidden Pattern√ – FP). For these problems we give a solution with space O(n) words and query time O( nk log1/2+ε n). We also reduce boolean matrix multiplication to both 2P and FP, giving evidence that high query times are likely necessary. Furthermore we give concrete lower bounds for 2P and FP in the pointer machine model that prove near optimality of all known data structures. In the Wild Card Indexing (WCI) problem queries are patterns with up to κ wild cards, where a wild card matches any character. We give pointer machine lower bounds for WCI, proving near optimality of known solutions. i Resumé Denne afhandling omhandler datastrukturer og består af tre dele. Den første del er om to grundlæggende pladseffektive datastrukturer: fingersøgningstræer og prioritetskøer. Datas- trukturerne er implicit, dvs. består af de n input elementer gemt i en tabel af længde n. Vi studerer problemerne i den stærke implicitte model, hvor det kun er tilladt at gemme tabellen og tallet n. Ydermere er det kun muligt at sammenligne alle par af de n elementer, og ellers er elementerne udelelige, dvs. vi kan ikke tilgå deres bits. Et fingersøgningstræ er en datastruktur, hvor man effektivt kan søge efter de opbevarede elementer. Vi beskriver en stærkt implicit dynamisk fingersøgningsstruktur med operationerne Search, Change- Finger, Insert og Delete, som tager henholdsvis O(log t), O(nε), O(log n) og O(log n) tid. Her er t rangafstanden mellem et specielt element, fingeren, og det efterspurgte element blandt de n elementer. Vi beviser også, at disse tider er optimale for strengt implicitte fingersøgningsstrukturer. Bagefter præsenterer vi to stærkt implicitte prioritetskøer, der understøtter Insert og ExtractMin i O(1) og O(log n) tid. Den første prioritetskø har amortiserede grænser, og den anden har værste-falds-grænser (worst case), tilgengæld har den første kun O(1) flytninger amortiseret per operation. Den anden del fokuserer på en anden grundlæggende problemstilling: sortering af heltal. Vi studerer problemet i Random Access Machine modellen hvor antallet af bits per ord (ordstørelsen) er w = Ω(log2 n log log n), og inputtet er n heltal hver med w bits. Vi giver en randomiseret algoritme, der sorterer heltallene i forventet linær tid. Undervejs udvikler vi en randomiseret algoritme til at sortere mindre heltal, der er pakket ind i ord. Lad b være antallet af heltal pakket i hvert ord, så tager algoritmen for at sortere pakkede heltal n 2 O( b (log n + log b)) tid. Emnet i tredje del er tekstindeksering. Problemstillingerne der betragtes er Udtryks- tæthed med koncis plads, To-Mønstret dokumenthentning og dokumenthentning med jokere i forespørgslen. I alle problemstillingerne har vi en samling af tekstdokumenter med to- tallængde n. For udtrykstæthed må datastrukturen kun bruge den informationsteoretiske nedre grænse i plads (koncis/succinct). Forespørgslen er en tekststreng P og en værdi k, og svaret er de top-k dokumenter, der har P som delstreng, med de bedste vurderingstal. De top-k dokumenter afgøres udfra tæthedskriteriet: et dokuments vurderingstal er afstanden mellem de to tætteste forekomster af P (kortere afstand er bedre). Vi viser, at det er muligt at besvare den slags forespørgsler i O(|P | + k polylog n) tid, hvor |P | er længden P . I To Mønstre problemerne er forespørgsler to tekststrenge P1 og P2. I den ene variant skal vi returnere alle dokumenter, der indeholder både P1 og P2, i den anden variant skal vi re- turnere alle dokumenter, der indeholder√ P1 men ikke P2. For disse problemer giver vi en datastruktur med O(n) plads og O( nk log1/2+ε n) forespørgselstid. Vi reducerer desuden boolsk matrix multiplikation til begge problemer, hvilket er belæg for at forespørgslerne må tage lang tid. Ydermere giver vi pointermaskine nedre grænser for To Mønstre problemerne, der viser at alle kendte datastrukturer er næsten optimale. I joker problemet er forespørgsler tekststrenge med jokere, og resultatet er alle dokumenter, hvor forespørgslen forekommer, når man lader jokere være et hvilket som helst bogstav. For dette problem viser vi også ne- dre grænser i pointermaskine modellen, der påviser, at de kendte datastrukturer er næsten optimale. iii Preface I have always been fond of programming, even from an early age where I played around with web pages, which later turned into server side scripting, and landed me a job at a local web development company. In the first year of my undergraduate I was introduced to algorithms and data structures, which soon became my primary interest. I met Mark Greve who arranged programming competitions and practice sessions, which were mostly about solving problems reminiscent of the ones we encountered at the algorithms and data structure exams. I enjoyed these programming contests, and as my abilities grew I became more and more interested in the theory of algorithms and data structures. When I started my graduate studies, one of the first courses I took was Computational Geometry with Gerth Brodal teaching it. He came to learn of my interest in algorithmics and soon suggested I pursue a PhD degree with him as my advisor, and this thesis is the end result. My PhD position has been partially at Aarhus University and partially at the State and University Library. The time at university has been mostly spent focusing on the theory, where as at the library we have focused on the practical side. At the library we have developed several tools for quality assurance in their digital audio archives, which are frequently used by their digital archiving teams. During my PhD studies I have been a co-author on 6 published papers, and 2 papers under submission. In this thesis I have only included 6 of these papers (one unpublished), but the remaining two still deserve to be mentioned. The first I will mention is part of the work I did at the State and University Library with quality assurance of their digital sound archives. The paper is about a tool used for checking if a migration from one audio file format to another succeeded. Here succeeded means that the content of the migrated file sounds the same as the original. This is particularly useful for libraries since there are standards on how to store digital sound archives and typically they receive audio formats different from their standard. 1 Bolette Ammitzbøll Jurik and Jesper Sindahl Nielsen. Audio quality assurance: An application of cross correlation.
Recommended publications
  • Advanced Data Structures
    Advanced Data Structures PETER BRASS City College of New York CAMBRIDGE UNIVERSITY PRESS Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo Cambridge University Press The Edinburgh Building, Cambridge CB2 8RU, UK Published in the United States of America by Cambridge University Press, New York www.cambridge.org Information on this title: www.cambridge.org/9780521880374 © Peter Brass 2008 This publication is in copyright. Subject to statutory exception and to the provision of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published in print format 2008 ISBN-13 978-0-511-43685-7 eBook (EBL) ISBN-13 978-0-521-88037-4 hardback Cambridge University Press has no responsibility for the persistence or accuracy of urls for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate. Contents Preface page xi 1 Elementary Structures 1 1.1 Stack 1 1.2 Queue 8 1.3 Double-Ended Queue 16 1.4 Dynamical Allocation of Nodes 16 1.5 Shadow Copies of Array-Based Structures 18 2 Search Trees 23 2.1 Two Models of Search Trees 23 2.2 General Properties and Transformations 26 2.3 Height of a Search Tree 29 2.4 Basic Find, Insert, and Delete 31 2.5ReturningfromLeaftoRoot35 2.6 Dealing with Nonunique Keys 37 2.7 Queries for the Keys in an Interval 38 2.8 Building Optimal Search Trees 40 2.9 Converting Trees into Lists 47 2.10
    [Show full text]
  • An Efficient Framework for Implementing Persistent Data Structures on Asymmetric NVM Architecture
    AsymNVM: An Efficient Framework for Implementing Persistent Data Structures on Asymmetric NVM Architecture Teng Ma Mingxing Zhang Kang Chen∗ [email protected] [email protected] [email protected] Tsinghua University Tsinghua University & Sangfor Tsinghua University Beijing, China Shenzhen, China Beijing, China Zhuo Song Yongwei Wu Xuehai Qian [email protected] [email protected] [email protected] Alibaba Tsinghua University University of Southern California Beijing, China Beijing, China Los Angles, CA Abstract We build AsymNVM framework based on AsymNVM ar- The byte-addressable non-volatile memory (NVM) is a promis- chitecture that implements: 1) high performance persistent ing technology since it simultaneously provides DRAM-like data structure update; 2) NVM data management; 3) con- performance, disk-like capacity, and persistency. The cur- currency control; and 4) crash-consistency and replication. rent NVM deployment with byte-addressability is symmetric, The key idea to remove persistency bottleneck is the use of where NVM devices are directly attached to servers. Due to operation log that reduces stall time due to RDMA writes and the higher density, NVM provides much larger capacity and enables efficient batching and caching in front-end nodes. should be shared among servers. Unfortunately, in the sym- To evaluate performance, we construct eight widely used metric setting, the availability of NVM devices is affected by data structures and two transaction applications based on the specific machine it is attached to. High availability canbe AsymNVM framework. In a 10-node cluster equipped with achieved by replicating data to NVM on a remote machine. real NVM devices, results show that AsymNVM achieves However, it requires full replication of data structure in local similar or better performance compared to the best possible memory — limiting the size of the working set.
    [Show full text]
  • A Pointer-Free Data Structure for Merging Heaps and Min-Max Heaps
    Theoretical Computer Science 84 (1991) 107-126 107 Elsevier A pointer-free data structure for merging heaps and min-max heaps Giorgio Gambosi and Enrico Nardelli Istituto di Analisi dei Sistemi ed Informutica, C.N.R., Roma, Italy Maurizio Talamo Dipartimento di Matematica Pura ed Applicata, University of L’Aquila, L’Aquila, Italy, and Istituto di Analisi dei Sistemi ed Informatica, C.N.R., Roma, Italy Abstract Gambosi, G., E. Nardelli and M. Talamo, A pointer-free data structure for merging heaps and min-max heaps, Theoretical Computer Science 84 (1991) 107-126. In this paper a data structure for the representation of mergeable heaps and min-max heaps without using pointers is introduced. The supported operations are: Insert, DeleteMax, DeleteMin, FindMax, FindMin, Merge, NewHeap, DeleteHeap. The structure is analyzed in terms of amortized time complexity, resulting in a O(1) amortized time for each operation except for Insert, for which a O(lg n) bound holds. 1. Introduction The use of pointers in data structures seems to contribute quite significantly to the design of efficient algorithms for data access and management. Implicit data structures [ 131 have been introduced in order to evaluate the impact of the absence of pointers on time efficiency. Traditionally, implicit data structures have been mostly studied for what concerns the dictionary problem, both in l-dimensional [4,5,9, 10, 141, and in multi- dimensional space [l]. In such papers, the maintenance of a single dictionary has been analyzed, not considering the case in which several instances of the same structure (i.e. several dictionaries) have to be represented and maintained at the same time and within the same array-structured memory.
    [Show full text]
  • Efficient Support of Position Independence on Non-Volatile Memory
    Efficient Support of Position Independence on Non-Volatile Memory Guoyang Chen Lei Zhang Richa Budhiraja Qualcomm Inc. North Carolina State University Qualcomm Inc. [email protected] Raleigh, NC [email protected] [email protected] Xipeng Shen Youfeng Wu North Carolina State University Intel Corp. Raleigh, NC [email protected] [email protected] ABSTRACT In Proceedings of MICRO-50, Cambridge, MA, USA, October 14–18, 2017, This paper explores solutions for enabling efficient supports of 13 pages. https://doi.org/10.1145/3123939.3124543 position independence of pointer-based data structures on byte- addressable None-Volatile Memory (NVM). When a dynamic data structure (e.g., a linked list) gets loaded from persistent storage into 1 INTRODUCTION main memory in different executions, the locations of the elements Byte-Addressable Non-Volatile Memory (NVM) is a new class contained in the data structure could differ in the address spaces of memory technologies, including phase-change memory (PCM), from one run to another. As a result, some special support must memristors, STT-MRAM, resistive RAM (ReRAM), and even flash- be provided to ensure that the pointers contained in the data struc- backed DRAM (NV-DIMMs). Unlike on DRAM, data on NVM tures always point to the correct locations, which is called position are durable, despite software crashes or power loss. Compared to independence. existing flash storage, some of these NVM technologies promise This paper shows the insufficiency of traditional methods in sup- 10-100x better performance, and can be accessed via memory in- porting position independence on NVM. It proposes a concept called structions rather than I/O operations.
    [Show full text]
  • Data Structures Are Ways to Organize Data (Informa- Tion). Examples
    CPSC 211 Data Structures & Implementations (c) Texas A&M University [ 1 ] What are Data Structures? Data structures are ways to organize data (informa- tion). Examples: simple variables — primitive types objects — collection of data items of various types arrays — collection of data items of the same type, stored contiguously linked lists — sequence of data items, each one points to the next one Typically, algorithms go with the data structures to manipulate the data (e.g., the methods of a class). This course will cover some more complicated data structures: how to implement them efficiently what they are good for CPSC 211 Data Structures & Implementations (c) Texas A&M University [ 2 ] Abstract Data Types An abstract data type (ADT) defines a state of an object and operations that act on the object, possibly changing the state. Similar to a Java class. This course will cover specifications of several common ADTs pros and cons of different implementations of the ADTs (e.g., array or linked list? sorted or unsorted?) how the ADT can be used to solve other problems CPSC 211 Data Structures & Implementations (c) Texas A&M University [ 3 ] Specific ADTs The ADTs to be studied (and some sample applica- tions) are: stack evaluate arithmetic expressions queue simulate complex systems, such as traffic general list AI systems, including the LISP language tree simple and fast sorting table database applications, with quick look-up CPSC 211 Data Structures & Implementations (c) Texas A&M University [ 4 ] How Does C Fit In? Although data structures are universal (can be imple- mented in any programming language), this course will use Java and C: non-object-oriented parts of Java are based on C C is not object-oriented We will learn how to gain the advantages of data ab- straction and modularity in C, by using self-discipline to achieve what Java forces you to do.
    [Show full text]
  • Software II: Principles of Programming Languages
    Software II: Principles of Programming Languages Lecture 6 – Data Types Some Basic Definitions • A data type defines a collection of data objects and a set of predefined operations on those objects • A descriptor is the collection of the attributes of a variable • An object represents an instance of a user- defined (abstract data) type • One design issue for all data types: What operations are defined and how are they specified? Primitive Data Types • Almost all programming languages provide a set of primitive data types • Primitive data types: Those not defined in terms of other data types • Some primitive data types are merely reflections of the hardware • Others require only a little non-hardware support for their implementation The Integer Data Type • Almost always an exact reflection of the hardware so the mapping is trivial • There may be as many as eight different integer types in a language • Java’s signed integer sizes: byte , short , int , long The Floating Point Data Type • Model real numbers, but only as approximations • Languages for scientific use support at least two floating-point types (e.g., float and double ; sometimes more • Usually exactly like the hardware, but not always • IEEE Floating-Point Standard 754 Complex Data Type • Some languages support a complex type, e.g., C99, Fortran, and Python • Each value consists of two floats, the real part and the imaginary part • Literal form real component – (in Fortran: (7, 3) imaginary – (in Python): (7 + 3j) component The Decimal Data Type • For business applications (money)
    [Show full text]
  • Lecture III BALANCED SEARCH TREES
    §1. Keyed Search Structures Lecture III Page 1 Lecture III BALANCED SEARCH TREES Anthropologists inform that there is an unusually large number of Eskimo words for snow. The Computer Science equivalent of snow is the tree word: we have (a,b)-tree, AVL tree, B-tree, binary search tree, BSP tree, conjugation tree, dynamic weighted tree, finger tree, half-balanced tree, heaps, interval tree, kd-tree, quadtree, octtree, optimal binary search tree, priority search tree, R-trees, randomized search tree, range tree, red-black tree, segment tree, splay tree, suffix tree, treaps, tries, weight-balanced tree, etc. The above list is restricted to trees used as search data structures. If we include trees arising in specific applications (e.g., Huffman tree, DFS/BFS tree, Eskimo:snow::CS:tree alpha-beta tree), we obtain an even more diverse list. The list can be enlarged to include variants + of these trees: thus there are subspecies of B-trees called B - and B∗-trees, etc. The simplest search tree is the binary search tree. It is usually the first non-trivial data structure that students encounter, after linear structures such as arrays, lists, stacks and queues. Trees are useful for implementing a variety of abstract data types. We shall see that all the common operations for search structures are easily implemented using binary search trees. Algorithms on binary search trees have a worst-case behaviour that is proportional to the height of the tree. The height of a binary tree on n nodes is at least lg n . We say that a family of binary trees is balanced if every tree in the family on n nodes has⌊ height⌋ O(log n).
    [Show full text]
  • Purely Functional Data Structures
    Purely Functional Data Structures Chris Okasaki September 1996 CMU-CS-96-177 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Thesis Committee: Peter Lee, Chair Robert Harper Daniel Sleator Robert Tarjan, Princeton University Copyright c 1996 Chris Okasaki This research was sponsored by the Advanced Research Projects Agency (ARPA) under Contract No. F19628- 95-C-0050. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of ARPA or the U.S. Government. Keywords: functional programming, data structures, lazy evaluation, amortization For Maria Abstract When a C programmer needs an efficient data structure for a particular prob- lem, he or she can often simply look one up in any of a number of good text- books or handbooks. Unfortunately, programmers in functional languages such as Standard ML or Haskell do not have this luxury. Although some data struc- tures designed for imperative languages such as C can be quite easily adapted to a functional setting, most cannot, usually because they depend in crucial ways on as- signments, which are disallowed, or at least discouraged, in functional languages. To address this imbalance, we describe several techniques for designing functional data structures, and numerous original data structures based on these techniques, including multiple variations of lists, queues, double-ended queues, and heaps, many supporting more exotic features such as random access or efficient catena- tion. In addition, we expose the fundamental role of lazy evaluation in amortized functional data structures.
    [Show full text]
  • Space-Efficient Data Structures for String Searching and Retrieval
    Louisiana State University LSU Digital Commons LSU Doctoral Dissertations Graduate School 2014 Space-efficient data structures for string searching and retrieval Sharma Valliyil Thankachan Louisiana State University and Agricultural and Mechanical College, [email protected] Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_dissertations Part of the Computer Sciences Commons Recommended Citation Valliyil Thankachan, Sharma, "Space-efficient data structures for string searching and retrieval" (2014). LSU Doctoral Dissertations. 2848. https://digitalcommons.lsu.edu/gradschool_dissertations/2848 This Dissertation is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in LSU Doctoral Dissertations by an authorized graduate school editor of LSU Digital Commons. For more information, please [email protected]. SPACE-EFFICIENT DATA STRUCTURES FOR STRING SEARCHING AND RETRIEVAL A Dissertation Submitted to the Graduate Faculty of the Louisiana State University and Agricultural and Mechanical College in partial fulfillment of the requirements for the degree of Doctor of Philosophy in The Department of Computer Science by Sharma Valliyil Thankachan B.Tech., National Institute of Technology Calicut, 2006 May 2014 Dedicated to the memory of my Grandfather. ii Acknowledgments This thesis would not have been possible without the time and effort of few people, who believed in me, instilled in me the courage to move forward and lent a supportive shoulder in times of uncertainty. I would like to express here how much it all meant to me. First and foremost, I would like to extend my sincerest gratitude to my advisor Dr. Rahul Shah. I would have hardly imagined that a fortuitous encounter in an online puzzle forum with Rahul, would set the stage for a career in theoretical computer science.
    [Show full text]
  • 23. Metric Data Structures
    This book is licensed under a Creative Commons Attribution 3.0 License 23. Metric data structures Learning objectives: • organizing the embedding space versus organizing its contents • quadtrees and octtrees. grid file. two-disk-access principle • simple geometric objects and their parameter spaces • region queries of arbitrary shape • approximation of complex objects by enclosing them in simple containers Organizing the embedding space versus organizing its contents Most of the data structures discussed so far organize the set of elements to be stored depending primarily, or even exclusively, on the relative values of these elements to each other and perhaps on their order of insertion into the data structure. Often, the only assumption made about these elements is that they are drawn from an ordered domain, and thus these structures support only comparative search techniques: the search argument is compared against stored elements. The shape of data structures based on comparative search varies dynamically with the set of elements currently stored; it does not depend on the static domain from which these elements are samples. These techniques organize the particular contents to be stored rather than the embedding space. The data structures discussed in this chapter mirror and organize the domain from which the elements are drawn—much of their structure is determined before the first element is ever inserted. This is typically done on the basis of fixed points of reference which are independent of the current contents, as inch marks on a measuring scale are independent of what is being measured. For this reason we call data structures that organize the embedding space metric data structures.
    [Show full text]
  • Heapmon: a Low Overhead, Automatic, and Programmable Memory Bug Detector ∗
    Appears in the Proceedings of the First IBM PAC2 Conference HeapMon: a Low Overhead, Automatic, and Programmable Memory Bug Detector ∗ Rithin Shetty, Mazen Kharbutli, Yan Solihin Milos Prvulovic Dept. of Electrical and Computer Engineering College of Computing North Carolina State University Georgia Institute of Technology frkshetty,mmkharbu,[email protected] [email protected] Abstract memory bug detection tool, improves debugging productiv- ity by a factor of ten, and saves $7,000 in development costs Detection of memory-related bugs is a very important aspect of the per programmer per year [10]. Memory bugs are not easy software development cycle, yet there are not many reliable and ef- to find via code inspection because a memory bug may in- ficient tools available for this purpose. Most of the tools and tech- volve several different code fragments which can even be in niques available have either a high performance overhead or require different files or modules. The compiler is also of little help a high degree of human intervention. This paper presents HeapMon, in finding heap-related memory bugs because it often fails to a novel hardware/software approach to detecting memory bugs, such fully disambiguate pointers [18]. As a result, detection and as reads from uninitialized or unallocated memory locations. This new identification of memory bugs must typically be done at run- approach does not require human intervention and has only minor stor- time [1, 2, 3, 4, 6, 7, 8, 9, 11, 13, 14, 18]. Unfortunately, the age and execution time overheads. effects of a memory bug may become apparent long after the HeapMon relies on a helper thread that runs on a separate processor bug has been triggered.
    [Show full text]
  • Reducing the Storage Overhead of Main-Memory OLTP Databases with Hybrid Indexes
    Reducing the Storage Overhead of Main-Memory OLTP Databases with Hybrid Indexes Huanchen Zhang David G. Andersen Andrew Pavlo Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University [email protected] [email protected] [email protected] Michael Kaminsky Lin Ma Rui Shen Intel Labs Carnegie Mellon University VoltDB [email protected] [email protected] [email protected] ABSTRACT keep more data resident in memory. Higher memory efficiency al- Using indexes for query execution is crucial for achieving high per- lows for a larger working set, which enables the system to achieve formance in modern on-line transaction processing databases. For higher performance with the same hardware. a main-memory database, however, these indexes consume a large Index data structures consume a large portion of the database’s fraction of the total memory available and are thus a major source total memory—particularly in OLTP databases, where tuples are of storage overhead of in-memory databases. To reduce this over- relatively small and tables can have many indexes. For example, head, we propose using a two-stage index: The first stage ingests when running TPC-C on H-Store [33], a state-of-the-art in-memory all incoming entries and is kept small for fast read and write opera- DBMS, indexes consume around 55% of the total memory. Freeing tions. The index periodically migrates entries from the first stage to up that memory can lead to the above-mentioned benefits: lower the second, which uses a more compact, read-optimized data struc- costs and/or the ability to store additional data.
    [Show full text]