Implicit Data Structures, Sorting, and Text Indexing Jesper Sindahl Nielsen
Total Page:16
File Type:pdf, Size:1020Kb
Implicit Data Structures, Sorting, and Text Indexing Jesper Sindahl Nielsen PhD Dissertation Department of Computer Science Aarhus University Denmark Implicit Data Structures, Sorting, and Text Indexing A Dissertation Presented to the Faculty of Science and Technology of Aarhus University in Partial Fulfillment of the Requirements for the PhD Degree by Jesper Sindahl Nielsen July 31, 2015 Abstract This thesis on data structures is in three parts. The first part deals with two fundamental space efficient data structures: finger search trees and priority queues. The data structures are implicit, i.e. they only consist of n input elements stored in an array of length n. We consider the problem in the strict implicit model which allows no information to be stored except the array and its length. Furthermore the n elements are comparable and indivisible, i.e. we cannot inspect their bits, but we can compare any pair of elements. A finger search tree is a data structure that allows for efficient lookups of the elements stored. We present a strict implicit dynamic finger search strucutre with operations Search, Change-Finger, Insert, and Delete, with times O(log t), O(nε), O(log n), O(log n), respectively, where t is the rank distance between the current finger and the query element. We also prove this structure is optimal in the strict implicit model. Next we present two strictly implicit priority queues supporting Insert and ExtractMin in times O(1) and O(log n). The first priority queue has amortized bounds, and the second structure’s bounds are worst case, however the first structure also has O(1) moves amortized for the operations. The second part of the thesis deals with another fundamental problem: sorting integers. In this problem the model is a word-RAM with word size w = Ω(log2 n log log n) bits, and our input is n integers of w bits each. We give a randomized algorithm that sorts such integers in expected O(n) time. In arriving at our result we also present a randomized algorithm for sorting smaller integers that are packed in words. Letting b be the number of n 2 integers per word, we give a packed sorting algorithm running in time O( b (log n + log b)). The topic of the third part is text indexing. The problems considered are term proximity in succinct space, two pattern document retrieval problems and wild card indexing. In all of the problems we are given a collection of documents with total length n. For term proximity we must store the documents using the information theoretic lower bound space (succinct). The query is then a pattern and a value k, and the answer is the top-k documents matching the pattern. The top-k is determined by the Term Proximity scoring function, where the score of a document is the distance between the closest pair of occurrences of the query pattern in the document (lower is better). For this problem we show it is possible to answer queries in O(|P |+k polylog(n)) time, where |P | is the pattern length, and n the total length of the documents. In the two pattern problem queries are two patterns, and we must return all documents matching both patterns (Two-Pattern – 2P), or matching one pattern but not the other (Forbidden Pattern√ – FP). For these problems we give a solution with space O(n) words and query time O( nk log1/2+ε n). We also reduce boolean matrix multiplication to both 2P and FP, giving evidence that high query times are likely necessary. Furthermore we give concrete lower bounds for 2P and FP in the pointer machine model that prove near optimality of all known data structures. In the Wild Card Indexing (WCI) problem queries are patterns with up to κ wild cards, where a wild card matches any character. We give pointer machine lower bounds for WCI, proving near optimality of known solutions. i Resumé Denne afhandling omhandler datastrukturer og består af tre dele. Den første del er om to grundlæggende pladseffektive datastrukturer: fingersøgningstræer og prioritetskøer. Datas- trukturerne er implicit, dvs. består af de n input elementer gemt i en tabel af længde n. Vi studerer problemerne i den stærke implicitte model, hvor det kun er tilladt at gemme tabellen og tallet n. Ydermere er det kun muligt at sammenligne alle par af de n elementer, og ellers er elementerne udelelige, dvs. vi kan ikke tilgå deres bits. Et fingersøgningstræ er en datastruktur, hvor man effektivt kan søge efter de opbevarede elementer. Vi beskriver en stærkt implicit dynamisk fingersøgningsstruktur med operationerne Search, Change- Finger, Insert og Delete, som tager henholdsvis O(log t), O(nε), O(log n) og O(log n) tid. Her er t rangafstanden mellem et specielt element, fingeren, og det efterspurgte element blandt de n elementer. Vi beviser også, at disse tider er optimale for strengt implicitte fingersøgningsstrukturer. Bagefter præsenterer vi to stærkt implicitte prioritetskøer, der understøtter Insert og ExtractMin i O(1) og O(log n) tid. Den første prioritetskø har amortiserede grænser, og den anden har værste-falds-grænser (worst case), tilgengæld har den første kun O(1) flytninger amortiseret per operation. Den anden del fokuserer på en anden grundlæggende problemstilling: sortering af heltal. Vi studerer problemet i Random Access Machine modellen hvor antallet af bits per ord (ordstørelsen) er w = Ω(log2 n log log n), og inputtet er n heltal hver med w bits. Vi giver en randomiseret algoritme, der sorterer heltallene i forventet linær tid. Undervejs udvikler vi en randomiseret algoritme til at sortere mindre heltal, der er pakket ind i ord. Lad b være antallet af heltal pakket i hvert ord, så tager algoritmen for at sortere pakkede heltal n 2 O( b (log n + log b)) tid. Emnet i tredje del er tekstindeksering. Problemstillingerne der betragtes er Udtryks- tæthed med koncis plads, To-Mønstret dokumenthentning og dokumenthentning med jokere i forespørgslen. I alle problemstillingerne har vi en samling af tekstdokumenter med to- tallængde n. For udtrykstæthed må datastrukturen kun bruge den informationsteoretiske nedre grænse i plads (koncis/succinct). Forespørgslen er en tekststreng P og en værdi k, og svaret er de top-k dokumenter, der har P som delstreng, med de bedste vurderingstal. De top-k dokumenter afgøres udfra tæthedskriteriet: et dokuments vurderingstal er afstanden mellem de to tætteste forekomster af P (kortere afstand er bedre). Vi viser, at det er muligt at besvare den slags forespørgsler i O(|P | + k polylog n) tid, hvor |P | er længden P . I To Mønstre problemerne er forespørgsler to tekststrenge P1 og P2. I den ene variant skal vi returnere alle dokumenter, der indeholder både P1 og P2, i den anden variant skal vi re- turnere alle dokumenter, der indeholder√ P1 men ikke P2. For disse problemer giver vi en datastruktur med O(n) plads og O( nk log1/2+ε n) forespørgselstid. Vi reducerer desuden boolsk matrix multiplikation til begge problemer, hvilket er belæg for at forespørgslerne må tage lang tid. Ydermere giver vi pointermaskine nedre grænser for To Mønstre problemerne, der viser at alle kendte datastrukturer er næsten optimale. I joker problemet er forespørgsler tekststrenge med jokere, og resultatet er alle dokumenter, hvor forespørgslen forekommer, når man lader jokere være et hvilket som helst bogstav. For dette problem viser vi også ne- dre grænser i pointermaskine modellen, der påviser, at de kendte datastrukturer er næsten optimale. iii Preface I have always been fond of programming, even from an early age where I played around with web pages, which later turned into server side scripting, and landed me a job at a local web development company. In the first year of my undergraduate I was introduced to algorithms and data structures, which soon became my primary interest. I met Mark Greve who arranged programming competitions and practice sessions, which were mostly about solving problems reminiscent of the ones we encountered at the algorithms and data structure exams. I enjoyed these programming contests, and as my abilities grew I became more and more interested in the theory of algorithms and data structures. When I started my graduate studies, one of the first courses I took was Computational Geometry with Gerth Brodal teaching it. He came to learn of my interest in algorithmics and soon suggested I pursue a PhD degree with him as my advisor, and this thesis is the end result. My PhD position has been partially at Aarhus University and partially at the State and University Library. The time at university has been mostly spent focusing on the theory, where as at the library we have focused on the practical side. At the library we have developed several tools for quality assurance in their digital audio archives, which are frequently used by their digital archiving teams. During my PhD studies I have been a co-author on 6 published papers, and 2 papers under submission. In this thesis I have only included 6 of these papers (one unpublished), but the remaining two still deserve to be mentioned. The first I will mention is part of the work I did at the State and University Library with quality assurance of their digital sound archives. The paper is about a tool used for checking if a migration from one audio file format to another succeeded. Here succeeded means that the content of the migrated file sounds the same as the original. This is particularly useful for libraries since there are standards on how to store digital sound archives and typically they receive audio formats different from their standard. 1 Bolette Ammitzbøll Jurik and Jesper Sindahl Nielsen. Audio quality assurance: An application of cross correlation.