Probabilistic Databases and Their Applications

University of Kentucky UKnowledge University of Kentucky Doctoral Dissertations Graduate School 2004 Probabilistic Databases and Their Applications Wenzhong Zhao University of Kentucky, [email protected] Right click to open a feedback form in a new tab to let us know how this document benefits ou.y Recommended Citation Zhao, Wenzhong, "Probabilistic Databases and Their Applications" (2004). University of Kentucky Doctoral Dissertations. 326. https://uknowledge.uky.edu/gradschool_diss/326 This Dissertation is brought to you for free and open access by the Graduate School at UKnowledge. It has been accepted for inclusion in University of Kentucky Doctoral Dissertations by an authorized administrator of UKnowledge. For more information, please contact [email protected]. ABSTRACT OF DISSERTATION Wenzhong Zhao The Graduate School University of Kentucky 2004 Probabilistic Databases and Their Applications ABSTRACT OF DISSERTATION A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the College of Engineering at the University of Kentucky By Wenzhong Zhao Lexington, Kentucky Co-Directors: Dr. Alexander Dekhtyar, Assistant Professor of Computer Science and: Dr. Judy Goldsmith, Associate Professor of Computer Science Lexington, Kentucky 2004 ABSTRACT OF DISSERTATION Probabilistic Databases and Their Applications Probabilistic reasoning in databases has been an active area of research during the last two decades. However, the previously proposed database approaches, including the probabilistic relational approach and the probabilistic object approach, are not good fits for storing and managing diverse probability distributions along with their auxiliary information. The work in this dissertation extends significantly the initial semistructured probabilistic database framework proposed by Dekhtyar, Goldsmith and Hawkes in [20]. We extend the formal Semistruc- tured Probabilistic Object (SPO) data model of [20]. Accordingly, we also extend the Semistruc- tured Probabilistic Algebra (SP-algebra), the query algebra proposed for the SPO model. Based on the extended framework, we have designed and implemented a Semistructured Prob- abilistic Database Management System (SPDBMS) on top of a relational DBMS. The SPDBMS is flexible enough to meet the need of storing and manipulating diverse probability distributions along with their associated information. Its query language supports standard database queries as well as queries specific to probabilities, such as conditionalization and marginalization. Currently the SPDBMS serves as a storage backbone for the project Decision Making and Planning under Uncer- tainty with Constraints , that involves managing large quantities of probabilistic information. We also report our experimental results evaluating the performance of the SPDBMS. We describe an extension of the SPO model for handling interval probability distributions. The Extended Semistructured Probabilistic Object (ESPO) framework improves the flexibility of the original semistructured data model in two important features: (i) support for interval probabilities and (ii) association of context and conditionals with individual random variables. An extended SPO ¡ This project is partially supported by the National Science Foundation under Grant No. ITR-0325063. (ESPO) data model has been developed, and an extended query algebra for ESPO has also been introduced to manipulate probability distributions for probability intervals. The Bayesian Network Development Suite (BaNDeS), a system which builds Bayesian networks with full data management support of the SPDBMS, has been described. It allows experts with particular expertise to work only on specific subsystems during the Bayesian network construction process independently and asynchronously while updating the model in real-time. There are three major foci of our ongoing and future work: (1) implementation of a query optimizer and performance evaluation of query optimization, (2) extension of the SPDBMS to han- dle interval probability distributions, and (3) incorporation of machine learning techniques into the BaNDeS. KEYWORDS: Probabilistic Databases, Uncertainty Modeling, Probability Intervals, Semistructured Data, Query Algebra Wenzhong Zhao June 22, 2004 Probabilistic Databases and Their Applications By Wenzhong Zhao Dr. Alexander Dekhtyar (Co-Director of Dissertation) Dr. Judy Goldsmith (Co-Director of Dissertation) Dr. Grzegorz W. Wasilkowski (Director of Graduate Studies) June 22, 2004 RULES FOR THE USE OF DISSERTATIONS Unpublished dissertations submitted for the Doctor’s degree and deposited in the University of Kentucky Library are as a rule open for inspection, but are to be used only with due regard to the rights of the authors. Bibliographical references may be noted, but quotations or summaries of parts may be published only with the permission of the author, and with the usual scholarly acknowledgments. Extensive copying or publication of the dissertation in whole or in part requires also the consent of the Dean of The Graduate School of the University of Kentucky. A library which borrows this dissertation for use by its patrons is expected to secure the signature of each user. Name Date DISSERTATION Wenzhong Zhao The Graduate School University of Kentucky 2004 Probabilistic Databases and Their Applications DISSERTATION A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the College of Engineering at the University of Kentucky By Wenzhong Zhao Lexington, Kentucky Co-Directors: Dr. Alexander Dekhtyar, Assistant Professor of Computer Science and: Dr. Judy Goldsmith, Associate Professor of Computer Science Lexington, Kentucky 2004 Acknowledgments There are so many friends and colleagues, deserved to be acknowledged, that I could not include them all in this short acknowledgement. I would like to thank everyone who has given support and advice during the writing of this dissertation. Special thanks are due to my advisors, Dr. Alexander Dekhtyar and Dr. Judy Goldsmith, for their guidance and insight throughout the research. Their encouragement, support and friendship have been the most critical factor in my enjoyable stay at the University of Kentucky. I cannot express the positive influence they have had on my education and career. Thanks go to the other committee members, Dr. Jane Hayes, Dr. Arnold Stromberg, and Dr. Anita Lee-Post, for their suggestions and comments. I wish to acknowledge my gratitude to the fellow students in our research group, Erik Jessup, Jiangyu Li, Derek Williams, Krol Kevin Mathias, and Ionut Emil Iacob, for their time, effort, coop- eration, and friendship. In addition, I wish to offer special thanks to a good friend, Lengning Liu, for his technical assistance. Lastly, I would like to express my sincere appreciation to my family. My wife, Huifang Qin, has given me selfless support during the completion of this dissertation. Without her, I would never have had the strength or desire to finish this dissertation. Together we have accomplished the impossible. Thank you for your love and friendship. To my lovely son, Mengzhe, I thank him for his understanding my frequent absences. I would also like to thank my parents for their unconditional love and understanding when we moved so far from home. Their support has been invaluable to my success. iii Table of Contents Acknowledgments . iii List of Tables . vi List of Figures . vii List of Files . x Chapter 1 Introduction: Probabilistic Reasoning in Databases 1 1.1 Motivating Examples . 3 1.1.1 Auto Insurance Risk Analysis . 3 1.1.2 Academic Advising . 5 1.1.3 Welfare to Work . 6 1.1.4 Political Elections . 7 1.2 General Framework . 9 1.3 Towards Semistructured Probabilistic Databases . 11 1.4 Overview of Dissertation . 13 Chapter 2 Background 15 2.1 Probability Distributions . 15 2.2 Interval Probabilities . 17 2.3 Bayesian Networks . 24 2.4 Query Processing in RDBMS and in Mediators . 26 Chapter 3 Semistructured Probabilistic Database Framework 30 3.1 Overview of the SPO framework . 30 3.2 Semistructured Probabilistic Object (SPO) Data Model . 31 3.3 Semistructured Probabilistic Algebra . 35 3.3.1 Set Operations . 36 3.3.2 Selection . 37 3.3.3 Projection . 43 3.3.4 Conditionalization . 45 3.3.5 Cartesian Product . 47 3.3.6 Join . 48 3.4 Semantics of SP-Algebra Operations . 51 3.5 Query Optimization . 56 3.5.1 Rewrite rules . 56 3.5.2 Cost Model . 71 Chapter 4 Semistructured Probabilistic DBMS 75 4.1 Representation of SPOs . 76 4.2 Architecture of SPDBMS . 78 4.3 Data Storage for SPOs . 78 4.3.1 Mapping SPOs to relational tables . 79 4.3.2 Algorithms for Populating SP-Relations . 80 4.4 Querying the SPDBMS . 81 4.4.1 Query Language (SPO-QL) . 81 iv 4.4.2 Query Parsing . 82 4.4.3 SP-Algebra Query Translation . 83 4.4.4 Internal Object Model (IOM) . 85 4.5 Experiments . 85 4.5.1 System Environment . 86 4.5.2 Experimental Results . 89 4.6 Summary of the SPDBMS . 93 Chapter 5 Extended Semistructured Probabilistic Database Framework 95 5.1 Overview of the ESPO framework . 95 5.2 Extended Semistructured Probabilistic Object (ESPO) Data model . 96 5.3 Extended Probabilistic Semistructured Algebra . 99 5.3.1 Selection . 100 5.3.2 Projection . 111 5.3.3 Conditionalization . 117 5.3.4 Cartesian Product and Join . 121 Chapter 6 Building Bayesian Networks with SPDBMS 127 6.1 Introduction . 127 6.2 Bayesian Network Construction . 129 6.2.1 The Role of SPDBMS . 130 6.2.2 Network Topology Construction . 133 6.2.3

Probabilistic Databases and Their Applications

Probabilistic Databases

Adaptive Schema Databases ∗

Finding Interesting Itemsets Using a Probabilistic Model for Binary Databases

Open-World Probabilistic Databases

A Compositional Query Algebra for Second-Order Logic and Uncertain Databases

Answering Queries from Statistics and Probabilistic Views∗

Uva-DARE (Digital Academic Repository)

BAYESSTORE: Managing Large, Uncertain Data Repositories with Probabilistic Graphical Models

Brief Tutorial on Probabilistic Databases

Efficient In-Database Analytics with Graphical Models

Indexing Correlated Probabilistic Databases

Analyzing Uncertain Tabular Data