Exploiting Non-Uniform Query Distributions in Data Structuring Problems

Exploiting Non-Uniform Query Distributions in Data Structuring Problems by John Howat A thesis submitted to the Faculty of Graduate and Postdoctoral Affairs in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science Carleton University Ottawa, Ontario © 2012 John Howat Library and Archives Bibliotheque et Canada Archives Canada Published Heritage Direction du 1+1 Branch Patrimoine de I'edition 395 Wellington Street 395, rue Wellington Ottawa ON K1A0N4 Ottawa ON K1A 0N4 Canada Canada Your file Votre reference ISBN: 978-0-494-93655-9 Our file Notre reference ISBN: 978-0-494-93655-9 NOTICE: AVIS: The author has granted a non L'auteur a accorde une licence non exclusive exclusive license allowing Library and permettant a la Bibliotheque et Archives Archives Canada to reproduce, Canada de reproduire, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par telecommunication ou par I'lnternet, preter, telecommunication or on the Internet, distribuer et vendre des theses partout dans le loan, distrbute and sell theses monde, a des fins commerciales ou autres, sur worldwide, for commercial or non support microforme, papier, electronique et/ou commercial purposes, in microform, autres formats. paper, electronic and/or any other formats. The author retains copyright L'auteur conserve la propriete du droit d'auteur ownership and moral rights in this et des droits moraux qui protege cette these. Ni thesis. Neither the thesis nor la these ni des extraits substantiels de celle-ci substantial extracts from it may be ne doivent etre imprimes ou autrement printed or otherwise reproduced reproduits sans son autorisation. without the author's permission. In compliance with the Canadian Conformement a la loi canadienne sur la Privacy Act some supporting forms protection de la vie privee, quelques may have been removed from this formulaires secondaires ont ete enleves de thesis. cette these. While these forms may be included Bien que ces formulaires aient inclus dans in the document page count, their la pagination, il n'y aura aucun contenu removal does not represent any loss manquant. of content from the thesis. Canada Abstract This thesis examines data structures with query times that are a function of the distribution of queries made to them. When a query distribution exhibits non uniformity—as is often the case in many applications—the sequence of queries can often be executed faster. Several such problems are considered. For the dictionary problem, this thesis presents the first binary search tree to achieve the working-set property for individual queries in the worst case, a data structure that supports searching from arbitrary temporal fingers, and a data struc ture that supports a stronger version of the unified property. For the predecessor search problem in bounded universes, this thesis presents a data structure that answers queries in time that is a function of the distance between the query and its answers (which leads to several applications in the areas of nearest neighbour search and range searching), as well as several data structures that answer queries in time that is a function of the entropy of the query distribution and various space requirements. Acknowledgements A PhD thesis does not happen in isolation. I owe my supervisors, Prosenjit Bose and Pat Morin, a large debt of gratitude for their assistance during my studies. Their willingness to work with me and read countless drafts of papers has been incredible. I cannot imagine how many times I’ve knocked on their doors over the years and asked, “Got a second?” My thanks are also due to the other members of my thesis committee: Paola Flochinni, Ian Munro, Michiel Smid, and Brett Stevens. Their feedback was invalu able. I also gratefully acknowledge the generous funding I have received from Car- leton University, the School of Computer Science, the Computational Geometry Laboratory, and the Natural Sciences and Engineering Research Council of Canada. It has been a pleasure to be a part of the Computational Geometry Laboratory at Carleton University for the past five years. This lab would not be possible without Prosenjit Bose, Anil Maheshwari, Pat Morin, and Michiel Smid, as well as our many students and postdoctoral fellows (past and present). Our frequent seminars, open problem sessions, and social activities have all resulted in a happy and productive environment. The laboratory was lucky enough to host a number of visitors dur ing my time here. In particular, discussions with Rolf Fagerberg and John Iacono contributed to some parts of this thesis. Like any good graduate student at Carleton, I spent a completely reasonable amount of time at Mike’s Place. My sincere thanks are owed to my fellow graduate students in other departments for joining me there. In particular, I thank Matthew Meier, Tara Ogaick, and Elise Vist for helping me unwind after long days of research. I am fortunate to have an incredibly supportive family. I can only begin to offer my thanks: my father, Robert Howat, for starting my journey with an old 386; my late mother, Judy Howat, for being my biggest advocate; my sister, Laurie Howat, for being my number one fan; my stepmother, Debbie Howat, for letting me talk through everything; and more extended family than I can enumerate. I could not have asked for more support. Finally, I thank my wonderful wife, Bianca Howat, for helping me through grad uate school in more ways than I suspect either of us realize. I would not have been able to complete this thesis without her love and support (and occasional interest in the Fibonacci numbers). Contents Abstract ii Acknowledgements iii 1 Introduction 1 1.1 Motivation ................................................................................................ 2 1.2 Problem Statements .............................................................................. 3 1.3 Summary of Contributions ..................................................................... 5 1.4 Bibliographic Notes................................................................................. 7 1.5 Organization of the T hesis ..................................................... 8 2 Background 10 2.1 Optimum Binary Search T rees ............................................................... 10 2.2 Defining Non-Uniformity ........................................................................ 11 2.2.1 Static and Dynamic O ptim ality ................................................ 12 2.2.2 Key-Independent Optimality ...................................................... 13 2.2.3 The Working-Set Property ......................................................... 14 2.2.4 The Queueish Property ............................................................... 15 v 2.2.5 The Static and Dynamic Finger P roperties .............................. 16 2.2.6 The Unified Property ................................................................... 17 2.3 Splay Trees ................................... 18 2.4 Some Distribution-Sensitive Data Structures ............................ 19 3 The Working-Set Property in Binary Search TYees 26 3.1 Problem Definition ................................................................................. 26 3.1.1 Background ................................................................................. 27 3.1.2 The Working-Set Structure . ............................................. 27 3.1.3 M odel .......................................................................... 30 3.2 Definition of Layered Working-Set T re e s '. 31 3.2.1 Tree Decom position .................................................................. 31 3.2.2 Encoding the Linked L is ts ......................................................... 33 3.3 Operations on Layered Working-Set T r e e s ........................................... 34 3.3.1 Intra-Layer Operations ......................................... 35 3.3.2 Inter-Layer Operations ............................................................... 38 3.3.3 Tree Operations ........................................................................... 44 3.4 Conclusion and Open Problem s ............................................... 47 4 Searching with Temporal Fingers 48 4.1 Problem Definition ................................................................................. 48 4.1.1 Defining Temporal D istance ...................................................... 49 4.1.2 Background ................................................................................. 50 4.2 The Data Structure .............................................................................. 50 4.2.1 The O ld Data S tructu re................................ 51 vi 4.2.2 The Y oung Data Structure ..................................................... 52 4.2.3 Performing a Query ................................................................. 52 4.2.4 Access C o s t................................................................................ 53 4.3 Conclusion and Open Problem s ........................................................... 56 5 The Strong Unified Property 57 5.1 Problem Definition.............................................................................. 57 5.1.1 Defining the Strong Unified Property ......................... 58 5.1.2 Background ................... 60 5.2 Towards the Strong Unified P roperty .................................................. 61 5.2.1

Exploiting Non-Uniform Query Distributions in Data Structuring Problems

1 Suffix Trees

Managing Unbounded-Length Keys in Comparison-Driven Data

Augmentation: Range Trees (PDF)

Optimal Finger Search Trees in the Pointer Machine

Bulk Updates and Cache Sensitivity in Search Trees

Space-Efficient Data Structures, Streams, and Algorithms

Finger Search Trees

An..Time Algorithm for Triangulating

Australian Journal

Algorithms and Data Structures for Strings, Points and Integers Or, Points About Strings and Strings About Points

Finger Search in Grammar-Compressed Strings

Worst Case Efficient Data Structures