Ranked Queries in Index Data Structures

Universita` degli Studi di Pisa Dipartimento di Informatica Dottorato di Ricerca in Informatica Ph.D. Thesis Ranked Queries in Index Data Structures Iwona Bialynicka-Birula Supervisor Roberto Grossi October 15, 2008 Abstract A ranked query is a query which returns the top-ranking elements of a set, sorted by rank, where the rank corresponds to some sort of preference function defined on the items of the set. This thesis investigates the problem of adding rank query capabil- ities to several index data structures on top of their existing functionality. Among the data structures investigated are suffix trees, range trees, and hierarchical data structures. We explore the problem of additionally specifying rank when querying these data structures. So, for example in the case of suffix trees, we would like to obtain not all of the occurrences of a substring in a text or in a set of documents, but to obtain only the most preferable results. What is most important, the efficiency of such a query must be proportional to the number of preferable results and not all of the occurrences, which can be too many to process efficiently. First, we introduce the concept of rank-sensitive data structures. Rank-sensitive data structures are defined through an analogy to output-sensitive data structures. Output-sensitive data structures are capable of reporting the items satisfying an on- line query in time linear to the number of items returned plus a sublinear function of the number of items stored. Rank-sensitive data structures are additionally given a ranking of the items and just the top k best-ranking items are reported at query time, sorted in rank order. The query must remain linear only with respect to the number of items returned, which this time is not a function of the elements satisfying the query, but the parameter k given at query time. We explore several ways of adding rank-sensitivity to different data structures and the different trade-offs which this incurs. Adding rank to an index query can be viewed as adding an additional dimen- sion to the indexed data set. Therefore, ranked queries can be viewed as multi- dimensional range queries, with the notable difference that we additionally want to maintain an ordering along one of the dimensions. Most range data structures do not maintain such an order, with the exception of the Cartesian tree. The Cartesian tree has multiple applications in range searching an other fields, but is rarely used for indexing due to its rigid structure which makes it difficult to use with dynamic content. The second part of this work deals with overcoming this rigidness and describes the first efficient dynamic version of the Cartesian tree. Acknowledgments I would like to thank everybody at the University of Pisa, especially my advisor Pro- fessor Roberto Grossi, Professor Andrea Maggiolo Schettini, and Professor Pierpaolo Degano, who made my studies at the University of Pisa and this work possible. I would also like to thank Professor Anna Karlin for granting me guest access to the courses at the University of Washington. Last but not least, I would like to thank my family (Iwo Bia lynicki-Birula, Zofia Bia lynicka-Birula, Piotr Bia lynicki-Birula, Stefania Kube) and my husband (Vittorio Bertocci) without whose continuing faith and support this work would not have been possible. iv Contents Introduction xv I Background 1 1 State of the Art 3 1.1 Prioritysearchtrees ........................... 6 1.1.1 Complexity ............................ 7 1.1.2 Construction ........................... 7 1.1.3 Three-sidedquery.. .. .. 8 1.1.4 Dynamicprioritysearchtrees . 9 1.1.5 Priority search trees and ranked queries . .. 11 1.2 Cartesiantrees .............................. 12 1.2.1 Definition ............................. 12 1.2.2 Construction ........................... 12 1.3 Weight-balancedB-trees . 14 1.3.1 B-trees............................... 14 1.3.2 Weight-balancedB-trees . 15 2 Base data structures 19 2.1 Suffixtrees................................. 19 2.1.1 Complexity ............................ 20 2.1.2 Construction ........................... 21 2.1.3 Applications............................ 22 2.2 Rangetrees ................................ 23 2.2.1 Definition ............................. 24 2.2.2 Spacecomplexity .. .. .. 24 2.2.3 Construction ........................... 26 vi Contents 2.2.4 Orthogonalrangesearchquery. 26 2.2.5 Fractionalcascading . 28 2.2.6 Dynamicoperationsonrangetrees . 30 2.2.7 Applications............................ 30 II Making structures rank-sensitive 33 3 Introduction and definitions 35 4 Adding rank-sensitivity to suffix trees 37 4.1 Introduction................................ 37 4.2 Preliminaries ............................... 38 4.2.1 Lightandheavyedges . 38 4.2.2 Rank-predecessor and rank-successor . 40 4.3 Approach ................................. 40 4.3.1 Naive solution with quadratic space . 40 4.3.2 Naive solution with non-optimal time . 42 4.3.3 Rank-treesolution . 42 4.4 Rankedtreestructure. .. .. 44 4.4.1 Leaflists.............................. 45 4.4.2 Leafarrays ............................ 46 4.4.3 Nodearrays ............................ 46 4.5 Rankedtreealgorithms. 46 4.5.1 Top-k query............................ 46 4.6 Complexity ................................ 48 4.6.1 Spacecomplexity . .. .. 48 4.6.2 Timecomplexity ......................... 50 4.7 Experimentalresults . .. .. 52 4.7.1 Structureconstructionalgorithm . 54 4.7.2 Querytimeresults . .. .. 55 4.7.3 Structuresizeresults . 63 4.8 An alternative approach using Cartesian trees . ..... 63 4.8.1 Datastructure .......................... 63 4.8.2 Queryalgorithm ......................... 64 4.8.3 Discussion............................. 66 Contents vii 5 Rank-sensitivity — a general framework 69 5.1 Introductionanddefinitions . 69 5.2 Thestaticcaseanditsdynamization . 70 5.2.1 Staticcaseonasingleinterval . 71 5.2.2 Polylog intervals in the dynamic case . 72 5.3 Multi-Q-heaps............................... 76 5.3.1 High-level implementation . 77 5.3.2 Multi-Q-heap: Representation . 78 5.3.3 Multi-Q-heap: Supportedoperations . 78 5.3.4 Multi-Q-heap: Lookuptables . 80 5.3.5 Generalcase............................ 82 5.4 Conclusions ................................ 82 III Dynamic Cartesian trees 83 5.5 Introduction................................ 85 5.5.1 Background ............................ 85 5.5.2 Motivation............................. 87 5.5.3 Ourresults ............................ 88 5.6 Preliminaries ............................... 89 5.6.1 PropertiesofCartesiantrees . 90 5.6.2 DynamicoperationsonCartesiantrees . 91 5.7 Bounding the number of edge modifications . 95 5.8 Implementing the insertions . 99 5.8.1 TheCompanionIntervalTree . .102 5.8.2 SearchingCost ..........................104 5.8.3 RestructuringCost . .110 5.9 Implementing deletions . 112 5.10Conclusions ................................113 Conclusions 115 Index 117 Bibliography 119 viii Contents List of Tables 1.1 Operations supported with the use of a priority search tree and their corresponding complexities. 7 4.1 A table for the quick answering of rank-successor queries computed for the tree in Figure 4.1. A cell in column i and row j identifies the rank-successor of leaf i at the ancestor of i whose depth is j. For clarity, leaves are labeled with their rank. ... 41 4.2 Table 4.1 with only the distinct rank-successors of each leaf distin- guished. .................................. 41 4.3 The average ratio of the number of executions of the inner loop to k depending on the suffix tree size and shape. 57 4.4 The query time in milliseconds depending on k. ............ 58 4.5 Structure size and construction time (in milliseconds) depending on thelengthandtypeoftheindexedtext. 62 x List of Tables List of Figures 1.1 Asampleprioritysearchtree. 6 1.2 A sample dynamic priority search tree. .. 9 1.3 An example of a Cartesian tree. The nodes storing points x, y h i are located at coordinates (x, y) in the Cartesian plane. The edges e =( x ,y , x ,y ) are depicted as lines connecting the two coor- h L Li h R Ri dinates (xL,yL) and (xR,yR). The tree induces a subdivision of the Cartesian plane, which is illustrated by dotted lines. ...... 13 2.1 Suffix tree for the string senselessness$. Longer substrings are rep- resented using start and end positions of the substring in the text. Leaves contain starting positions of their corresponding suffixes. 20 2.2 Range tree for the set of points (1, 15), (2, 5), (3, 8), (4, 2), (5, 1), { (6, 6), (7, 14), (8, 11), (9, 16), (10, 13), (11, 4), (12, 3), (13, 12), (14, 10), (15, 7), (16, 9) . For clarity, the first coordinate has been omitted in } the second-level tree nodes. The query 1 x 13, 2 y 11 is ≤ ≤ ≤ ≤ considered. The path of the query is marked using bold edges. The leaves and the roots of the second-level nodes considered for the query are marked with black dots. Finally, the ranges of points reported are indicated using gray rectangles. 25 2.3 The second-level structures of the range tree in Figure 2.2 linked using fractional cascading. For clarity, the first-level nodes are not depicted — they are the same as the ones in Figure 2.2. The bold lines indicate the fractional cascading pointers following the paths of the query in the primary tree. The bold dashed lines indicate the fractional cascading pointers followed from a node on the main query path to its child whose secondary structure needs to be queried. ... 29 xii List of Figures 4.1 A sample tree with the leaf rank indicated inside the leaf symbols. Heavy edges

Ranked Queries in Index Data Structures

Interval Trees Storing and Searching Intervals

Balanced Trees Part One

Augmentation: Range Trees (PDF)

Advanced Data Structures

Search Trees

Computational Geometry: 1D Range Tree, 2D Range Tree, Line

Position Heaps for Cartesian-Tree Matching on Strings and Tries

Range Searching

I/O-Efficient Spatial Data Structures for Range Queries

Parallel Range, Segment and Rectangle Queries with Augmented Maps

Lecture Notes of CSCI5610 Advanced Data Structures

CMSC 420: Lecture 7 Randomized Search Structures: Treaps and Skip Lists