Managing Large-Scale Probabilistic Databases

Managing Large-scale Probabilistic Databases Christopher Re´ A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2009 Program Authorized to Offer Degree: Computer Science & Engineering University of Washington Graduate School This is to certify that I have examined this copy of a doctoral dissertation by Christopher Re´ and have found that it is complete and satisfactory in all respects, and that any and all revisions required by the final examining committee have been made. Chair of the Supervisory Committee: Dan Suciu Reading Committee: Magdalena Balazinska Anna Karlin Dan Suciu Date: In presenting this dissertation in partial fulfillment of the requirements for the doctoral degree at the University of Washington, I agree that the Library shall make its copies freely available for inspection. I further agree that extensive copying of this dissertation is allowable only for scholarly purposes, consistent with “fair use” as prescribed in the U.S. Copyright Law. Requests for copying or reproduction of this dissertation may be referred to Proquest Information and Learning, 300 North Zeeb Road, Ann Arbor, MI 48106-1346, 1-800-521-0600, to whom the author has granted “the right to reproduce and sell (a) copies of the manuscript in microform and/or (b) printed copies of the manuscript made from microform.” Signature Date University of Washington Abstract Managing Large-scale Probabilistic Databases Christopher Re´ Chair of the Supervisory Committee: Professor Dan Suciu Computer Science & Engineering Modern applications are driven by data, and increasingly the data driving these applications are imprecise. The set of applications that generate imprecise data is diverse: In sensor database applications, the goal is to measure some aspect of the physical world (such as temperature in a region or a person’s location). Such an application has no choice but to deal with imprecision, as measuring the physical world is inherently imprecise. In data integration, consider two databases that refer to the same set of real-world entities, but the way in which they refer to those entities is slightly different. For example, one database may contain an entity ‘J. Smith’ while the second database refers to ‘John Smith’ . In such a scenario, the large size of the data makes it too costly to manu- ally reconcile all references in the two databases. To lower the cost of integration, state-of-the-art approaches allow the data to be imprecise. In addition to applications which are forced to cope with imprecision, emerging data-driven applications, such as large-scale information extraction, natively produce and manipulate similarity scores. In all these domains, the current state-of-the-art approach is to allow the data to be imprecise and to shift the burden of coping with imprecision to applications. The thesis of this work is that it is possible to effectively manage large, imprecise databases using a generic approach based on probability theory. The key technical challenge in building such a general-purpose approach is performance, and the technical contributions of this dissertation are techniques for efficient evaluation over probabilistic databases. In particular, we demonstrate that it is possible to run complex SQL queries on tens of gigabytes of probabilistic data with performance that is comparable to a standard relational database engine. TABLE OF CONTENTS Page List of Figures . iii Chapter 1: Introduction . 1 1.1 Problem Space, Key Challenges, and Goals . 2 1.2 Technical Contributions . 4 Chapter 2: Preliminaries . 8 2.1 General Database Preliminaries . 8 2.2 BID databases . 11 2.3 Lineage, c-tables, or Intensional evaluation . 15 2.4 Expressiveness of the Model . 20 Chapter 3: Query-time Technique I: Top-K Query Evaluation . 22 3.1 Motivating Scenario and Problem Definition . 22 3.2 Top-k Query Evaluation using Multisimulation . 27 3.3 Optimizations . 35 3.4 Experiments . 40 Chapter 4: Query-time Technique II: Extensional Evaluation for Aggregates . 46 4.1 Motivating Scenario . 46 4.2 Formal Problem Description . 52 4.3 Preliminaries . 56 4.4 Approaches for HAVING ................................ 62 4.5 Generating a Random World . 74 4.6 Approximating HAVING queries with MIN, MAX and SUM . 81 4.7 Summary of Results . 90 Chapter 5: View-based Technique I: Materialized Views . 92 5.1 Motivating Scenario and Problem Definition . 93 5.2 Query Answering using Probabilistic Views . 98 i 5.3 Practical Query Answering using Probabilistic Views . 116 5.4 Experiments . 119 Chapter 6: View-based Technique II: Approximate Lineage . 124 6.1 Motivating Scenarios . 124 6.2 Statement of Results . 130 6.3 Sufficient Lineage . 141 6.4 Polynomial Lineage . 148 6.5 Experiments . 153 Chapter 7: Related Work . 159 7.1 Broadly Related Work . 159 7.2 Specific Related Work . 164 Chapter 8: Conclusion and Future Work . 169 Bibliography . 172 Appendix A: Proof of Relaxed Progress . 185 A.1 Bounding the Violations of Progress . 185 Appendix B: Proofs for Extensional Evaluation . 191 B.1 Properties of Safe Plans . 191 B.2 Appendix: Full Proof for COUNT(DISTINCT) . 192 B.3 Appendix: Full Proofs for SUM and AVG . 195 B.4 Convergence Proof of Lemma 4.6.8 . 197 ii LIST OF FIGURES Figure Number Page 1.1 A screen shot of MQ............................... 2 2.1 An SQL Query and its equivalent relational plan . 10 2.2 Sample data from a movie database . 12 2.3 The Reviews relation encoded in the syntax of Lineage. 16 3.1 Schema fragment of IMDB and Amazon database, and a fuzzy match table . 23 3.2 Sample fuzzy match data using IMDB . 23 3.3 Sample top-k query . 24 3.4 Top-5 answers for a query on IMDB . 26 3.5 Illustration of Multisimulation . 29 3.6 Three imprecise datasets . 39 3.7 Query Statistics with and without Safe Plans . 39 3.8 Experimental Evaluation . 45 4.1 Sample data for HAVING Queries . 47 4.2 Query Syntax for HAVING queries . 52 4.3 Sample imprecise data from IMDB . 54 4.4 Query and its extensional query plan . 57 4.5 A list of semirings used to evaluate aggregates . 63 4.6 A graphical representation of extensional evaluation . 77 4.7 Summary of results for MIN, MAX and COUNT ..................... 90 5.1 Sample restaurant data for materialized views . 94 5.2 Representable views and their effect on query processing . 120 5.3 Dataset summary for materialized views . 121 6.1 Modeling the GO database using Lineage . 125 6.2 Query statistics for the GO DB [37]. 153 6.3 Compression ratio for approximate lineage . 154 6.4 Comparison of sufficient and polynomial lineage . 155 6.5 Compression experiment for IMDB . 156 iii 6.6 Explanation recovery experiment . 157 6.7 Impact of approximate lineage on query processing performance . 158 iv ACKNOWLEDGMENTS I am forever indebted to Dan Suciu. Dan is blessed with an astonishing combination of brilliance and patience, and I will be forever grateful for the skills he has patiently taught me. I owe him my research career, which is one of the most rewarding pursuits in my life. Magdalena Balazinska offered me patient support and advice throughout my time at the Uni- versity of Washington. I am also profoundly grateful for her trust that my only vaguely defined research directions would lead to something interesting. I am also grateful to Anna Karlin for her participation on my committee. The students in the database group and the Computer Science & Engineering department made my graduate studies both intellectually stimulating and incredibly enjoyable. An incomplete list of those who helped me on this path include Eytan Adar, Michael Cafarella, Nilesh Dalvi, YongChul Kwon, Julie Letchner, Jayant Madhavan, Gerome Miklau, Kate Moore, Vibhor Rastogi, and Atri Rudra. The conversations that I had with each one of them enriched my life and made this work a joy to pursue. The faculty and staff at the University of Washington have created a wonderful place, and I am very thankful to be a part of it. Finally, I would like to thank my family. My father, Donald Re,´ not only provided love and support thoughout my life, but also listened to hours of excited chatter about my research ideas that were – at best – only halfway clear in my own mind. John and Peggy Emery provided incredible support including countless meals, rides, and hours of great conversation. I am, however, most grateful to them for two things: their love and the greatest gift of all, their daughter and my wife, Mikel. Mikel’s patience, love, and support are boundless. For any current or future success that I might have, Mikel deserves more credit than I can possibly express in a few lines of text. v 1 Chapter 1 INTRODUCTION A probabilistic database is a general framework for managing imprecision and uncertainty in data. Building a general framework to manage uncertainty in data is a challenging goal because critical data-management tasks, such as query processing, are theoretically and practically more ex- pensive than in traditional database systems. As a result, a major challenge in probabilistic databases is to efficiently query and understand large collections of uncertain data. This thesis demonstrates that it is possible to effectively manage large, imprecise databases using a generic approach based on probability theory. One example application domain for a probabilistic database, such as our system MQ, is integrating data from autonomous sources [21]. In this application, a key problem is that the same entity may be represented differently in different sources. In one source, the Very Large Database Conference may be represented by its full name; in another source, it is referenced as simply, “VLDB”. As a result, if we ask a query such as, “Which papers appeared in the last VLDB?”, a standard database will likely omit some or all of the relevant answers.

Load more