Techniques for Improving Efficiency and Scalability for the Integration Of

1 Techniques for Improving Efficiency and Scalability for the Integration of Information Retrieval and Databases Hengzhi Wu Thesis submitted for the degree of Doctor of Philosophy at Queen Mary, University of London May 2010 2 Declaration of originality I hereby declare that this thesis, and the research to which it refers, are the product of my own work, and that any ideas or quotations from the work of other people, published or otherwise, are fully acknowledged in accordance with the standard referencing practices of the discipline. The material contained in this thesis has not been submitted, either in whole or in part, for a degree or diploma or other qualification at the University of London or any other University. Some parts of this work have been previously published as: [Wu and Roelleke, 2009] Wu, H. and Roelleke, T. (2009). Semi-subsumed events: A probabilistic semantics of the BM25 term frequency quantification. In ICTIR, pages 375–379. [Roelleke et al., 2008] Roelleke, T., Wu, H., Wang, J., and Azzam, H. (2008). Modelling retrieval models in a probabilistic relational algebra with a new operator: the relational Bayes. VLDB J., 17(1):5–37. [Wu et al., 2008a] Wu, H., Kazai, G., and Roelleke, T. (2008a). Modelling anchor text retrieval in book search based on back-of-book index. In SIRIG Workshop on Focused Retrieval, pages 51–58. [Wu et al., 2008b] Wu, H., Kazai, G., and Taylor, M. (2008b). Book search experiments: Investigating IR methods for the indexing and retrieval of books. In ECIR, pages 234–245. [Roelleke et al., 2005] Roelleke, T., Ashoori, E., Wu, H., and Cai, Z. (2005). The QMUL team with probabilistic SQL at enterprise track. In TREC. 3 Abstract This thesis is on the topic of integration of Information Retrieval (IR) and Databases (DB), with particular focuses on improving efficiency and scalability of integrated IR and DB technology (IR+DB). The main purpose of this study is to develop efficient and scalable techniques for supporting integrated IR and DB technology, which is a popular approach today for handling complex queries over text and structured data. Our specific interest in this thesis is how to efficiently handle queries over large-scale text and structured data. The work is based on a technology that integrates probability theory and relational algebra, where retrievals for text and data are to be expressed in probabilistic logical programs such as probabilistic relational algebra or probabilistic Datalog. To support efficient processing of probabilistic logical programs, we proposed three optimization techniques that focus on aspects covered logical and physical layers, which include: scoring-driven query optimization using scoring expression, query processing with top-k incorporated pipeline, and indexing with relational inverted index. Specifically, scoring expressions are proposed for expressing the scoring or probabilistic semantics of implied scoring functions of PRA expressions, so that efficient query execution plan can be generated by rule-based scoring-driven optimizer. Secondly, to balance efficiency and effectiveness so that to improve query response time, we studied methods for incorporating top- k algorithms into pipelined query execution engine for IR+DB systems. Thirdly, the proposed relational inverted index integrates IR-style inverted index and DB-style tuple-based index, which can be used to support efficient probability estimation and aggregation as well as conventional relational operations. Experiments were carried out to investigate the performances of proposed techniques. Ex- perimental results showed that the efficiency and scalability of an IR+DB prototype have been improved, while the system can handle queries efficiently on considerable large data sets for a number of IR tasks. 4 Contents 1 Introduction 16 1.1 Research Background of the Thesis . 16 1.1.1 Integration of Information Retrieval and Databases at a Glance . 17 1.1.2 Motivation of Research . 19 1.2 Research Problems . 19 1.3 Outline of the Proposed Techniques in the Thesis . 20 1.3.1 Scoring-Driven Optimization . 21 1.3.2 Top-k Incorporated Pipeline . 21 1.3.3 Relational Inverted Index . 22 1.4 Overview of the Thesis . 23 2 Integration of Information Retrieval and Databases 25 2.1 Introduction . 25 2.2 A Brief Review of Information Retrieval . 27 2.2.1 Basic Procedures of Information Retrieval . 27 2.2.2 Retrieval Models . 30 2.2.2.1 Dominant Non-probabilistic Models . 32 2.2.2.2 Dominant Probabilistic Models . 33 2.3 Integrating Ranking into Relational Databases . 34 2.4 Probabilistic Databases . 36 2.4.1 Possible Worlds Model . 37 2.4.2 Probabilistic Relational Algebra . 41 2.4.3 Query Evaluation Techniques for Conjunctive Queries . 45 2.5 Integrated IR and DB Technologies . 48 2.5.1 State-of-the-Art . 48 2.5.2 Modelling IR Strategies in Declarative Languages . 53 5 2.5.2.1 The Basics . 53 2.5.2.2 An Extended PRA for Modelling IR . 56 2.5.2.3 Examples of Modelling Probability Estimations . 58 2.6 Summary . 60 3 SCX: Scoring-Driven Query Optimization with Scoring Expression 61 3.1 Introduction . 61 3.2 Query Optimizations for Databases . 64 3.2.1 Algebraic Manipulation . 66 3.3 Scoring Expression . 68 3.3.1 Discovering the Scoring Semantics of PRA Expressions . 68 3.3.2 Equivalence of PRA Expressions . 69 3.3.3 Ideas and Principles of Design . 73 3.3.4 Syntax and Semantics . 75 3.3.4.1 Instant Constant and Parameter . 75 3.3.4.2 Variable . 75 3.3.4.3 Operators . 79 3.3.4.4 Functions . 79 3.3.4.5 Expressions . 82 3.3.5 Generated SCX and Interpreted SCX . 86 3.4 Scoring-Driven Optimization . 90 3.4.1 Generating SCX for PRA . 90 3.4.2 Principles of SCX Manipulation . 96 3.4.2.1 Rotation-Based Manipulations . 97 3.4.2.2 Transformations of SCX . 98 3.4.3 Automatic Analysis for SCX . 101 3.4.4 Commencing Scoring-Driven Optimization . 104 3.4.4.1 Algorithm and Rules . 105 3.4.4.2 Assisting Index Selection . 108 3.4.4.3 Aligning Scoring Function under Extensional Semantics . 118 3.4.4.4 Verifying Scoring Equivalence . 121 3.5 Experiments and Results . 123 6 3.5.1 Specifications and Setup . 124 3.5.2 Methodology . 125 3.5.3 Results . 126 3.6 Summary . 127 4 TIP: Query Processing with Top-k Incorporated Pipeline 129 4.1 Introduction . 129 4.2 Background . 130 4.2.1 Computational Model . 130 4.2.2 Typical Scenario and Example . 132 4.2.3 Family of Threshold Algorithms . 133 4.2.4 Pipelined Top-k Operators in Relational Databases . 140 4.2.5 Other Related Work . 141 4.3 Top-k Incorporated Pipeline . 143 4.3.1 Preliminary of Execution Plan in Databases and IR+DB Systems . 143 4.3.1.1 Common Query Block . 143 4.3.1.2 Physical Operators and Pipelined Execution Plan . 144 4.3.2 Conceptual Design of TIP . 145 4.3.2.1 Physical Operators . 145 4.3.2.2 Incorporating Top-k Algorithms . 147 4.3.3 An Investigation of Performances Tradeoff of NRA-Style Top-k Strategies 149 4.3.3.1 Ideal Performances Tradeoff Measurement . 150 4.3.3.2 Modelling NRA-Style Top-k in Declarative Languages . 152 4.3.3.3 Allotting Strategies for Budgets . 153 4.4 Experiments and Results . 155 4.4.1 Specifications and Setup . 155 4.4.2 Methodology . 156 4.4.3 Results . 158 4.5 Summary . 162 5 RIX: Indexing with Relational Inverted Index 163 5.1 Introduction . 163 7 5.1.1 Motivation . 163 5.1.2 Inverted Indexes . 165 5.1.3 Outline . 166 5.2 Relational Inverted Index . 167 5.2.1 Logical Designs of Indexing Structures . 167 5.2.1.1 Inverted Index versus TID Index . 167 5.2.1.2 Structures of RIX . 169 5.2.2 Architecture of RIX Indexer . 172 5.2.3 Abstract Data Types and Data Structures . 174 5.2.3.1 Basic ADTs . 174 5.2.3.2 Operational ADTs . 175 5.2.4 Theoretical Analysis . 177 5.2.4.1 Overall Analysis . 177 5.2.4.2 Disk I/O Characteristics . 178 5.2.4.3 I/O Cost Models for Building RIX . 180 5.2.5 Construction Procedures . 185 5.2.5.1 Outline and Data Flow . 185 5.2.5.2 Building Algorithms . 187 5.2.5.3 Scheduling Algorithms . 194 5.2.6 Retrieval Procedures . 201 5.2.6.1 Accessing Methods for Search and Fetch . 201 5.2.6.2 Supporting Physical Operators and Operations . 202 5.2.7 Update Procedure . 204 5.3 Experiments.

Load more