Succinct Data Structures
Total Page:16
File Type:pdf, Size:1020Kb
SUCCINCT DATA STRUCTURES by Ankur Gupta Department of Computer Science Duke University Date: Approved: Jeffrey Scott Vitter, Supervisor Pankaj Agarwal Roberto Grossi Xiaobai Sun Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science in the Graduate School of Duke University 2010 ABSTRACT SUCCINCT DATA STRUCTURES by Ankur Gupta Department of Computer Science Duke University Date: Approved: Jeffrey Scott Vitter, Supervisor Pankaj Agarwal Roberto Grossi Xiaobai Sun An abstract of a dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science in the Graduate School of Duke University 2010 Copyright c 2010 by Ankur Gupta All rights reserved Abstract The world is drowning in data. The recent explosion of web publishing, XML data, bioinformation, scientific data, image data, geographical map data, and even email communications has put a strain on our ability to manage the information contained there. The influx of massive data sets with all kinds of features presents a number of difficulties with efficient management of storage space, organization of informa- tion, and data accessibility. A primary computing challenge in these cases is how to compress the data but still allow them to be queried quickly. This thesis addresses theoretical and algorithmic issues arising from these practical concerns for the prob- lem of compressed text indexing, where we want to maintain efficient data storage and rapid response to queries on data. The premise of data compression comes from many real-life situations, where data are often highly compressible. This compressibility constitutes a major opportunity for saving space and data query latency, and is a critical bottleneck for many applica- tions. In mobile applications, for instance, space and the power to access information are at a premium. In a streaming environment, where new data are being generated constantly, compression can also aid in prediction of upcoming trends. In the case of bioinformatics, analyzing succinct representations of DNA sequences could lead to a deeper understanding of nature, perhaps even giving hints on secondary and tertiary structure, gene evolution, and other important topics. We use text data as the subject of this particular study. We introduce a num- ber of compressed data structures for compressed text indexing that enable arbitrary searching for patterns in the provably best possible time. The methodology is distinct in that the process of searching also encompasses decoding; therefore, the original document is no longer needed. Together, these data structures can be used at mul- iv tiple levels of a compression-retrieval hierarchy to arrive at an overall text indexing solution. Some structures can be used individually as well, within or beyond the scope of text indexing. For each data structure, we provide a theoretical estimate of its space usage and query performance on a suite of operations crucial to access the stored data. In each case, we relate its space usage to the compressed size of the original data and show that the supported operations function in near-optimal or optimal time. We also present a number of experimental results using our methodology. These experiments validate our theoretical findings, and we establish that our methodology is competitive with the state-of-the-art. v Acknowledgements First and foremost, I would like to thank my advisor Jeffrey Scott Vitter. I'm not sure where I would be without his continued support and guidance. Jeff's insistence on clarity and precision is a necessary foundation for any serious graduate student, and I am grateful to have benefited from such a firm vision. I would also like to thank my committee members Roberto Grossi, Xiaobai Sun, and Pankaj Agarwal for providing careful comments on my doctoral work. Special thanks go to Roberto Grossi, who served as a collaborator and co-advisor throughout my graduate career and helped shape who I have become. I would also like to thank Rahul Shah and Wing-Kai Hon, both of with whom I enjoyed working and socializing. Finally, I would like to thank Jon Sorenson, who offered a sounding board in the final stages of writing. I would like to thank my family for providing love and encouragement. My par- ents, Umesh and Manju Gupta, and my brother Parag Gupta, were always there when I most needed someone. I could not have completed this work without them. I cannot begin to express in words the impact my wife Diksha had on me during the final stages of my studies; her concern for and patience with long hours and demanding schedules is truly amazing. Finally, my grandfather Ramswaroop Gupta has always been a quiet strength in my life, with a deep calm and a focus on the simple things. I hope one day to reach that pedestal. I have a long list of friends whose companionship has broadened my life: Matt Taylor, Rex Robinson, Sharlotte Greer, Tylan Watts, Andrew Strack, Priya Mahade- van, Justin Moore, Kristina Killgrove, Patrick Reynolds, David Cherryholmes, and Aaron Miller to name just a few. I am glad to have met them. I would like to offer thanks to Michael E. Durbin, who advised me while I was vi at the University of Texas at Dallas. His mentorship played a big part in fueling my enthusiasm towards Computer Science. I would also like to thank Diane Riggs, in the Department of Computer Science at Duke University. She was always there to offer help to students, whether it be paperwork, scheduling, or just a sympathetic ear. vii Notice of Revision Owing to a mistake on the part of the author, the thesis document submitted for graduation on December 14, 2007 contained errors in the results and analysis of Section 2.4.4 regarding t-subset encoding, as pointed out by the thesis committee. As of the date of this revision, the author had not fixed these errors. This revised document removed the section in question, as well as all references to this material. No results in the revised document depend on the removed section. Some fixes were made in other parts of Section 2.4 as well. Discussion of related work was not updated and hence may not reflect new advances after December 2007. This version of the thesis supersedes all previous versions of the thesis. The author regrets the mistake and apologizes for the introduction of errors in technical results. The author also apologizes for any inconvenience to interested readers, who may have been led astray by the earlier incorrect results. The revised version of this thesis and this notice were approved on 2 July 2010. viii Contents Abstract iv Acknowledgements vi Notice of Revision viii List of Figures xiv List of Tables xv 1 Introduction 1 1.1 Text Compression and Text Indexing . 3 1.2 Dictionaries and Data-Aware Measures For Set Data . 5 1.3 Dynamizing Succinct Data Structures . 6 2 An Algorithmic Framework for Compression and Text Indexing 8 2.1 Introduction . 8 2.1.1 Text Compression . 8 2.1.2 Compressed Text Indexing . 11 2.1.3 Outline of Chapter . 16 2.2 High-Order Empirical Entropy . 16 2.2.1 Empirical Probabilistic High-Order Entropy . 16 2.2.2 Finite Set High-Order Entropy . 19 2.3 The Unified Algorithmic Framework: Tighter Analysis for the BWT . 22 2.3.1 The BWT and (Compressed) Suffix Arrays . 23 2.3.2 Context-Based Partitioning of the BWT . 25 2.4 Encoding Sublists in High-Order Entropy . 29 ix 2.4.1 Individually Encoded Sublists . 30 2.4.2 The Space Redundancy of Encoding Multiple Sublists . 33 2.4.3 The Wavelet Tree . 38 2.5 Encoding the Empirical Statistical Model . 42 2.5.1 Definitions and a Simple Bound . 43 2.5.2 Nearly Tight Upper Bound on M(T; Σ; h) . 44 2.6 Nearly Tight Lower Bounds for the BWT . 49 2.6.1 Constructing δ-resilient Texts . 51 2.6.2 Encoding a δ-resilient Text . 55 2.7 Random Access to the Compressed Representation of LF and Φ . 57 2.7.1 Wavelet Trees as Succinct Dictionaries . 58 2.7.2 Random Access to the Compressed Representation of Φ . 62 2.7.3 Random Access to the Compressed Representation of LF . 66 2.8 Using the Framework for Compressed Suffix Arrays . 67 2.8.1 Compressed Suffix Arrays (CSAs) . 67 2.8.2 High-Order Entropy-Compressed Suffix Arrays . 71 2.9 Applications to Text Indexing . 78 2.9.1 High-Order Entropy-Compressed Text Indexing . 79 2.9.2 A Pattern Matching Tool . 82 2.10 Conclusions . 85 3 When Indexing Equals Compression: Experiments with Compressing Suffix Arrays and Applications 88 3.1 Introduction . 88 3.1.1 Our Results . 89 x 3.1.2 Outline of Chapter . 91 3.2 A Simple Yet Powerful Dictionary . 91 3.2.1 Practical Dictionaries . 92 3.2.2 Empirical Distribution of RLE Values and γ Codes . 96 3.2.3 Statistical Evidence Justifying γ Codes . 99 3.2.4 Fast Access of Experimental-Analysis-Driven Dictionaries . 102 3.3 Review of Wavelet Trees . 104 3.3.1 Efficient Construction of the Wavelet Tree . 108 3.3.2 Compression with bwt2wzip . 109 3.3.3 Decompression with wzip2bwt . 112 3.3.4 Performance and Experiments for wzip . 113 3.4 Practical Suffix Arrays: Indexing Equals Compression . 115 3.4.1 Compressed Suffix Arrays (CSA) . 115 3.4.2 Practical Considerations for Compressed Suffix Arrays . 117 3.4.3 Suffix Array Compression . 120 3.4.4 Suffix Array Functionalities . 122 3.5 Space-Efficient Suffix Trees . 123 3.6 Conclusions . 126 4 Compressed Dictionaries and Data-Aware Measures 128 4.1 Introduction .