Uwaterloo Latex Thesis Template
Total Page:16
File Type:pdf, Size:1020Kb
In-Memory Storage for Labeled Tree-Structured Data by Gelin Zhou A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Computer Science Waterloo, Ontario, Canada, 2017 c Gelin Zhou 2017 Examining Committee Membership The following served on the Examining Committee for this thesis. The decision of the Examining Committee is by majority vote. Johannes Fischer External Examiner Professor J. Ian Munro Supervisor Professor Meng He Co-Supervisor Assistant Professor Gordon V. Cormack Internal Examiner Professor Therese Biedl Internal Examiner Professor Gordon B. Agnew Internal-external Examiner Associate Professor ii I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. iii Abstract In this thesis, we design in-memory data structures for labeled and weights trees, so that various types of path queries or operations can be supported with efficient query time. We assume the word RAM model with word size w, which permits random accesses to w-bit memory cells. Our data structures are space-efficient and many of them are even succinct. These succinct data structures occupy space close to the information theoretic lower bounds of the input trees within lower order terms. First, we study the problems of supporting various path queries over weighted trees. A path counting query asks for the number of nodes on a query path whose weights lie within a query range, while a path reporting query requires to report these nodes. A path median query asks for the median weight on a path between two given nodes, and a path selection query returns the k-th smallest weight. We design succinct data struc- tures to support path counting queries in O(lg σ= lg lg n + 1) time, path reporting queries in O((occ + 1)(lg σ= lg lg n + 1)) time, and path median and path selection queries in O(lg σ= lg lg σ) time, where n is the size of the input tree, the weights of nodes are drawn from [1..σ] and occ is the size of the output. Our results not only greatly improve the best known data structures [31, 75, 65], but also match the lower bounds for path counting, median and selection queries [86, 87, 71] when σ = Ω(n=polylog(n)). Second, we study the problem of representing labeled ordinal trees succinctly. Our new representations support a much broader collection of operations than previous work. In our approach, labels of nodes are stored in a preorder label sequence, which can be compressed using any succinct representation of strings that supports access, rank and select operations. Thus, we present a framework for succinct representations of labeled ordinal trees that is able to handle large alphabets. This answers an open problem pre- sented by Geary et al. [54], which asks for representations of labeled ordinal trees that remain space-efficient for large alphabets. We further extend our work and present the first succinct representations for dynamic labeled ordinal trees that support several label- based operations including finding the level ancestor with a given label. Third, we study the problems of supporting path minimum and semigroup path sum queries. In the path minimum problem, we preprocess a tree on n weighted nodes, such that given an arbitrary path, the node with the smallest weight along this path can be located. We design novel succinct indices for this problem under the indexing model, for which weights of nodes are read-only and can be accessed with ranks of nodes in the pre- order traversal sequence of the input tree. One of our index structures supports queries in O(α(m; n)) time, and occupies O(m) bits of space in addition to the space required for the input tree, where m is an integer greater than or equal to n and α(m; n) is the inverse-Ackermann function. Following the same approach, we also develop succinct data iv structures for semigroup path sum queries, for which a query asks for the sum of weights along a given query path. Then, using the succinct indices for path minimum queries, we achieve three different time-space tradeoffs for path reporting queries. Finally, we study the problems of supporting various path queries in dynamic set- tings. We propose the first non-trivial linear-space solution that supports path reporting in O((lg n= lg lg n)2 + occ lg n= lg lg n)) query time, where n is the size of the input tree and occ is the output size, and the insertion and deletion of a node of an arbitrary degree in O(lg2+ n) amortized time, for any constant (0; 1). Obvious solutions based on directly dynamizing solutions to the static version of2 this problem all require Ω((lg n= lg lg n)2) time for each node reported. We also design data structures that support path count- ing and path reporting queries in O((lg n= lg lg n)2) time, and insertions and deletions in O((lg n= lg lg n)2) amortized time. This matches the best known results for dynamic two- dimensional range counting [62] and range selection [63], which can be viewed as special cases of path counting and path selection. v Acknowledgements First of all, I would like to express my sincere gratitude to my supervisor, Professor J. Ian Munro, for his continuous guidance and support throughout my graduate programs. It has been an honor and great experience to work with and learn from Professor Munro during my masters and doctoral studies. I decided to take an industry job and move to San Francisco 3.5 years ago. Professor Munro fully supported my choice and suggested me to continue with my PhD program as a part-time student. This thesis would not be completed without his support and encouragement. My sincere thanks also go to my co-supervisor, Professor Meng He, for the much effort he has spent on proofreading my thesis and the papers we collaborated. He has provided a lot of helpful and insightful feedback to greatly improve the quality of my thesis. I can hardly thank him enough for doing that. Besides my supervisor and co-supervisor, I would like to thank the rest of my thesis committee, Professor Gordon V. Cormack, Professor Therese Biedl, Professor Gordon B. Agnew, and Professor Johannes Fischer for spending time to review my thesis and attend my defense. The comments they provided have further improved the presentation of my thesis. I also want to thank Professor Joseph Cheriyan for serving as the defense chair. Big thanks to our department's administrative coordinators, Wendy Rush and Margaret Towell, for the much help they have provided me. During my graduate studies, Professor Timothy M. Chan taught me a lot about geo- metric data structures, and the discussions and the collaborations with him were always interesting and informative. Before I moving to San Francisco, I was fortunate to work with Professor Venkatesh Raman and Professor Moshe Lewenstein. I learned a lot from the discussions with them. When I was preparing for the thesis defense, I was told that Professor Alejandro L´opez- Ortiz is fighting with a fatal disease. He is a reader of my Master's thesis and a committee member of my PhD Comprehensive-II exam. He is such a nice person and probably a friend of every graduate student of the Algorithms and Complexity Group. I would like to thank him for all his help and I wish him well. I thank my office mates for the fruitful discussions and for all the fun we have had dur- ing the years I have spent at Waterloo: Francisco Claude, Reza Dorrigiv, Robert Fraser, Shahin Kamali, Daniela Maftulea, Patrick K. Nicholson, Alejandro Salinger, and Diego Seco. I also would like to thank Eric Y. Chen, Arash Farzan, Alexander Golynski, and Derek Phillips for sharing their industrial experience with me. Special thanks to Marilyn Miller, Professor Munro's wife, for her kindness, her help, and the delicious desserts she made. I still remember the advice for buying suits she gave to me before my Master's convocation. vi Last but not least, I would like to thank my wife, Corona Luo, for her love and her support. vii Dedication This is dedicated to all the bit manipulations that make this world a better place. viii Table of Contents List of Tables xiii List of Figures xiv 1 Introduction1 1.1 Organization of the Thesis...........................2 2 Preliminaries5 2.1 Models of Computation.............................5 2.2 Notation.....................................6 2.3 Bit Vectors...................................7 2.4 Sequences....................................8 2.4.1 Wavelet Trees and the Ball-Inheritance Problem...........9 2.5 Ordinal Trees.................................. 11 2.5.1 Balanced Parentheses.......................... 13 2.5.2 Tree Covering.............................. 13 2.5.3 Tree Extraction............................. 16 2.5.4 Restricted Topological Partitions................... 18 3 Static Succinct Data Structures for Path Queries 21 3.1 Introduction................................... 21 ix 3.1.1 Previous Work............................. 22 3.1.2 Our Contributions........................... 25 3.1.3 The Organization of This Chapter................... 26 3.2 Applying Tree Extraction to Path Queries.................. 26 3.3 Data Structures under the Pointer Machine Model.............. 28 3.4 Word RAM Data Structures with Reduced Space Cost........... 33 3.5 Succinct Data Structures with Improved Query Time............ 35 3.5.1 Succinct Ordinal Trees over an Alphabet of Size O(lg n)...... 35 3.5.2 Path Counting and Reporting....................