Probabilistic Databases: Diamonds in the Dirt*
Total Page:16
File Type:pdf, Size:1020Kb
Probabilistic Databases: Diamonds in the Dirt∗ Nilesh Dalvi Christopher Ré Dan Suciu Yahoo!Research University of Washington University of Washington USA USA USA [email protected] [email protected]@cs.washington.edu 1. INTRODUCTION nature of relational probabilistic data, and exploited the spe- A wide range of applications have recently emerged that cial form of probabilistic inference that occurs during query need to manage large, imprecise data sets. The reasons for evaluation. A number of such results have emerged recently: imprecision in data are as diverse as the applications them- lineage-based representations [4], safe plans [11], algorithms selves: in sensor and RFID data, imprecision is due to mea- for top-k queries [31, 37], and representations of views over surement errors [15, 34]; in information extraction, impreci- probabilistic data [33]. What is common to all these results sion comes from the inherent ambiguity in natural-language is that they apply and extend well known concepts that are text [20, 26]; and in business intelligence, imprecision is tol- fundamental to data management, such as the separation of erated because of the high cost of data cleaning [5]. In some query and data when analyzing complexity [38], incomplete applications, such as privacy, it is a requirement that the databases [22], the threshold algorithm [16], and the use of data be less precise. For example, imprecision is purposely materialized views to answer queries [21]. In this paper, we inserted to hide sensitive attributes of individuals so that briefly survey the key concepts in probabilistic database sys- the data may be published [30]. Imprecise data has no place tems, and explain the intellectual roots of these concepts in in traditional, precise database applications like payroll and data management. inventory, and so, current database management systems are not prepared to deal with it. In contrast, the newly emerg- 1.1 An Example: The Purple Sox System ing applications offer value precisely because they query, We illustrate using an example from an information ex- search, and aggregate large volumes of imprecise data to traction system. The Purple Sox1 system at Yahoo! Re- find the “diamonds in the dirt”. This wide-variety of new search focuses on technologies to extract and manage struc- applications points to the need for generic tools to manage tured information from the Web related to a specific com- imprecise data. In this paper, we survey the state of the art munity. An example is the DbLife system [14] that aggre- of techniques that handle imprecise data, by modeling it as gates structured information about the database commu- probabilistic data [2–4, 7, 12, 15, 23, 27, 36]. nity from data on the Web. The system extracts lists of A probabilistic database management system, or Prob- database researchers together with structured, related infor- DMS, is a system that stores large volumes of probabilistic mation such as publications that they have authored, their data and supports complex queries. A ProbDMS may also co-author relationships, talks that they have given, their cur- need to perform some additional tasks, such as updates or rent affiliations, and their professional services. Figure 1(a) recovery, but these do not differ from those in conventional illustrates the researchers’ affiliations, and Figure 1(b) il- database management systems and will not be discussed lustrates their professional activities. Although most re- here. The major challenge in a ProbDMS is that it needs searchers have a single affiliation, in the data in Figure 1(a), both to scale to large data volumes, a core competence of the extracted affiliations are not unique. This occurs be- database management systems, and to do probabilistic infer- cause outdated/erroneous information is often present on ence, which is a problem studied in AI. While many scalable the Web, and even if the extractor is operating on an up-to- data management systems exists, probabilistic inference is date Webpage, the difficulty of the extraction problem forces a hard problem [35], and current systems do not scale to the extractors to produce many alternative extractions or the same extent as data management systems do. To ad- risk missing valuable data. Thus, each Name contains sev- dress this challenge, researchers have focused on the specific eral possible affiliations. One can think of Affiliation as being an attribute with uncertain values. Equivalently, one ∗This work was partially supported by NSF Grants can think of each row as being a separate uncertain tuple. IIS-0454425, IIS-0513877, IIS-0713576, and a Gift There are two constraints on this data: tuples with the same from Microsoft. An extended version of this Name but different Affiliation are mutually exclusive; and paper with additional references is available at http://www.cs.washington.edu/homes/suciu/. tuples with different values of Name are independent. The professional services shown in Figure 1 (b) are extracted Permission to make digital or hard copies of all or part of this work for from conference Webpages, and are also imprecise: in our personal or classroom use is granted without fee provided that copies are example, each record in this table is an independent extrac- not made or distributed for profit or commercial advantage and that copies tion and assumed to be independent. bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific In both examples, the uncertainty in the data is repre- permission and/or a fee. 1 Copyright 2008 ACM 0001-0782/08/0X00 ...$5.00. http://research.yahoo.com/node/498 Researchers: Name Affiliation P Services: 1 1 t1 Fred U. Washington p1 = 0.3 X1 = 1 2 2 t1 U. Wisconsin p1 = 0.2 X1 = 2 Name Conference Role P 3 3 t1 Y! Research p1 = 0.5 X1 = 3 s1 Fred VLDB Session Chair q1 = 0.2 Y1 = 1 1 1 t2 Sue U. Washington p2 = 1.0 X2 = 1 s2 Fred VLDB PC Member q2 = 0.8 Y2 = 1 1 1 t3 John U. Wisconsin p3 = 0.7 X3 = 1 s3 John SIGMOD PC Member q3 = 0.7 Y3 = 1 2 2 t3 U. Washington p3 = 0.3 X3 = 2 s4 John VLDB PC Member q4 = 0.7 Y4 = 1 1 1 t4 Frank Y! Research p4 = 0.9 X4 = 1 s5 Sue SIGMOD Chair q5 = 0.5 Y5 = 1 2 2 t4 M. Research p4 = 0.1 X4 = 2 (b) (a) Figure 1: Example of a probabilistic database. This is a block-independent-disjoint database: the 8 tuples 1 2 3 1 2 in Researchers are grouped in four groups of disjoint events, e.g., t1, t1, t1 are disjoint, and so are t4, t4, while 2 2 1 tuples from different blocks are independent, e.g., t1, t2, t4 are independent; the five tuples in Services are independent probabilistic events. This database can be represented as a c-table using the hidden variables X1,X2,X3,X4 for Researchers and Y1,Y2,Y3,Y4,Y5 for Services. sented as a probabilistic confidence score, which is com- steps in computing a SQL query on probabilistic data: first, puted by the extractor. For example, Conditional Ran- fetch and transform the data, and second, perform prob- dom Fields produce extractions with semantically meaning- abilistic inference. A straightforward but na¨ıve approach ful confidence scores [20]. Other sources of uncertainty can is to separate the two steps: use a database engine for the also be converted to confidence scores, for example prob- first step, and a general-purpose probabilistic inference tech- abilities produced by entity matching algorithms (does the nique [9, 13] for the second. But on large data the proba- mention Fred in one Webpage refer to the same entity as Fred bilistic inference quickly dominates the total running time. in another Webpage?). The example in Figure 1 presents a A better approach is to integrate the two steps, which al- very simplified view of a general principle: uncertain data lows us to leverage some database specific techniques, such is annotated with a confidence score, which is interpreted as as query optimization, using materialized views, and schema a probability. In this paper we use “probabilistic data” and information, to speedup the probabilistic inference. “uncertain data” as synonyms. Designing a good user interface raises new challenges. The answer to a SQL query is a set of tuples, and it is critical to find some way to rank these tuples, because there are 1.2 Facets of a ProbDMS typically lots of false positives when the input data is im- There are three important, related facets of any Prob- precise. Alternatively, aggregation queries can extract value DMS: (1) How do we store (or represent) a probabilistic from imprecise data, because errors tend to cancel each other database? (2) How do we answer queries using our chosen out (the Law of the Large Numbers). A separate and dif- representation? (3) How do we present the result of queries ficult task is how to indicate to the user the correlations to the user? between the output tuples. For example, the two highest There is a tension between the power of a representation ranked tuples may be mutually exclusive, but they could system, i.e., as the system more faithfully models correla- also be positively correlated. As a result, their ranks alone tions, it becomes increasingly difficult to scale the system. convey insufficient information to the user. Finally, a major A simple representation where each tuple is an indepen- challenge of this facet is how to obtain feedback from the dent probabilistic event is easier to process, but it cannot users and how to employ this feedback to ”clean” the under- faithfully model the correlations important to all applica- lying database.