Databases with Uncertainty and Lineage
Total Page:16
File Type:pdf, Size:1020Kb
The VLDB Journal manuscript No. (will be inserted by the editor) Omar Benjelloun · Anish Das Sarma · Alon Halevy · Martin Theobald · Jennifer Widom Databases with Uncertainty and Lineage Abstract This paper introduces ULDBs, an extension of re- Keywords Uncertainty in Databases · Lineage · Prove- lational databases with simple yet expressive constructs for nance · Probabilistic data management representing and manipulating both lineage and uncertainty. Uncertain data and data lineage are two important areas of data management that have been considered extensively in isolation, however many applications require the features in 1 Introduction tandem. Fundamentally, lineage enables simple and consis- tent representation of uncertain data, it correlates uncertainty in query results with uncertainty in the input data, and query The problems faced when managing uncertain data, and those processing with lineage and uncertainty together presents associated with tracking data lineage, have been addressed computational benefits over treating them separately. in isolation in the past (e.g., [2,4,21,27,30,31,34,41,45] for uncertain data and [11,17–19,39,40] for data lineage). Mo- We show that the ULDB representation is complete, and tivated by a diverse set of applications including data inte- that it permits straightforward implementation of many re- gration, deduplication, scientific data management, informa- lational operations. We define two notions of ULDB mini- tion extraction, and others, we became interested in the com- mality — data-minimal and lineage-minimal — and study bination of uncertainty and lineage as the basis for a new minimization of ULDB representations under both notions. type of data management system [46]. With lineage, derived relations are no longer self-contained: Intuitively, an uncertain database is one that represents their uncertainty depends on uncertainty in the base data. multiple possible instances, each corresponding to a single We provide an algorithm for the new operation of extracting possible state of the database. Lineage identifies a data item’s a database subset in the presence of interconnected uncer- derivation, in terms of other data in the database, or out- tainty. We also show how ULDBs enable a new approach to side data sources. One relationship between uncertainty and query processing in probabilistic databases. Finally, we de- lineage is that lineage can be used for understanding and scribe the current state of the Trio system, our implementa- resolving uncertainty. To draw a loose analogy with web tion of ULDBs under development at Stanford. search, answers returned by a search engine are uncertain, reflected by their ranking. Search engines typically provide lineage information including at least a URL and text snip- This work was supported by the National Science Foundation under pet, and users tend to consider both ranking and lineage to grants IIS-0324431, IIS-1098447, and IIS-9985114, by DARPA Con- determine which links to follow. More generally, any appli- tract #03-000225, and by a grant from the Boeing Corporation. cation that integrates information from multiple sources may be uncertain about which data is correct, and the original O. Benjelloun Google Inc. source and derivation of data may offer helpful additional [email protected] information. A. Das Sarma Lineage is also important for uncertainty within a single Stanford University database. When users pose queries against uncertain data, [email protected] the results are uncertain too. Lineage facilitates the corre- A. Halevy lation and coordination of uncertainty in query results with Google Inc. uncertainty in the input data. For example, suppose we know [email protected] that either one set of base data is correct or another one is, M. Theobald but not both. Then we don’t want to produce any query re- Stanford University sults that are derived by mixing data from the two sets, di- [email protected] rectly or indirectly, now or later. Lineage is a particularly J. Widom convenient and intuitive mechanism for encoding the com- Stanford University plex uncertainty relationships that can arise among base and [email protected] derived data. 2 Omar Benjelloun et al. Beyond the conceptual relationships between uncertainty be uncertainty about the data itself, since the extraction tech- and lineage, this paper presents several tangible representa- niques are approximate at best. As a result, we have at least tional and computational benefits derived from their com- three kinds of uncertainty in data integration: the data, the bination. We begin by describing a representation for un- mappings between the schemas and the mappings between certainty and lineage that extends the relational model with data objects in different sources. We emphasize that these tuple alternatives (a set of possible values for each tuple), types of uncertainty are pervasive especially in data inte- maybe tuples (tuples that may be present or absent), and a gration applications whose goal is to offer access to a large lineage function mapping tuple alternatives to the data from number of sources on the WWW (e.g., MetaQuerier [13], which they were derived. We call databases in this scheme and PayGo [37]). ULDBs, for Uncertainty-Lineage Databases. We show that The explicit modeling of data lineage essentially acknowl- because we represent lineage along with uncertainty, ULDBs edges that data comes from somewhere, and that we are are complete, i.e., they can represent all finite sets of possi- not sure about the transformations (such as those outlined ble instances. In contrast, complete models for uncertainty above) that brought data into the database nor about its in- without lineage are more complex, e.g., [23,31]. trinsic meaning. As uncertainty-generating data integration Next, we study problems related to querying ULDBs. First, operations are performed, lineage keeps track of the origins we show that ULDBs permit straightforward and efficient im- of data, thereby giving a powerful tool to manage, explain plementation of many relational operations. We then con- and potentially correct the uncertain truth resulting from the sider the problem of extracting one or more relations from integration process. a ULDB: creating a “projection” of a ULDB onto a subset of Throughout the paper, we will illustrate how the key as- its relations, without changing the possible instances of the pects of ULDBs are useful for data integration applications, relations. Extracting relations is tricky because when data in and also point out a few directions for future research in the a relation R is derived from data in R0, then the possible in- application of uncertainty and lineage to data integration. stances of R may correlate with the possible instances of R0 (even when R0 is not included in the projection), which may In summary, this paper makes the following contribu- in turn correlate with possible instances of other relations. tions: Finally, for both querying and extraction we are interested • We define ULDBs—uncertain databases with lineage— in operating on ULDBs that satisfy some notions of mini- and show that they are complete (Section 3). mality. We define data minimality and lineage minimality of • We give algorithms for relational operations in ULDBs ULDBs, and we present results on minimizing ULDBs. (Section 4.2). ULDBs also open up an interesting alternative approach • We define data-minimality and lineage-minimality for to query processing in probabilistic databases, which are ULDBs, and discuss both types of minimization (Sec- captured by a simple extension of basic ULDBs to include tion 4.3). confidence values. Previous work [21] suggests special tech- niques for constructing query plans that ensure correctness • We define the new problem of extracting data from a for probabilistic data. It turns out that when lineage is tracked, ULDB, and we present an algorithm for it (Section 4.4). special considerations are no longer needed: Query execu- • We describe how ULDBs can be extended with confi- tion initially proceeds without computing probabilities, so dence values, and we show how they offer an alternative any query plan may be used. Probabilities are then computed solution to query processing in probabilistic databases from lineage as needed in a separate step. (Section 5). • We describe our Trio system for ULDBs, and give a de- tailed account of how the above features are implemented Uncertainty, Lineage and Data Integration on top of a conventional relational DBMS (Section 6). We discuss related work in Section 7 and conclude with fu- As discussed in [46], uncertainty and lineage are important ture directions in Section 8. for many classes of applications. In this paper we highlight This paper is an extension of an earlier conference pa- the need for modeling uncertainty and lineage in the context per [6]. This paper provides proofs for numerous theorems of data integration systems, which are systems that offer a that did not appear in [6]. It also describes the implemen- uniform interface to a multitude of data sources. tation of ULDBs in a system development effort underway, In data integration applications, uncertainty arises in dif- called Trio. Specifically, we describe how Trio encodes ferent ways, but they can all be traced to the “subjectivity” ULDBs in a relational database, how queries over ULDBs are effect. Data sources designed by different people or organi- translated into sets of SQL queries, and how computation zations will, by nature, describe the same domain in differ- of confidences is optimized in our system. Some of these ent, sometimes inconsistent ways. Differences may arise in aspects of Trio were also a topic of a system demonstra- the aspects of the domain they choose to model, the schemas tion [38]. they design for the domain (table structure and attribute names), and in the naming conventions they use for data ob- jects. As data integration applications strive to offer a single 2 Preliminaries “objective” and coherent integrated view of data sources, un- certainty is bound to appear.