Probabilistic Databases: Diamonds in the Dirt∗ (Extended Version)

Probabilistic Databases: Diamonds in the Dirt∗ (Extended Version) Nilesh Dalvi Christopher Ré Dan Suciu Yahoo!Research University of Washington University of Washington USA USA USA [email protected] [email protected]@cs.washington.edu 1. INTRODUCTION occurs during query evaluation. A number of such results A wide range of applications have recently emerged that have emerged recently: lineage-based representations [11], need to manage large, imprecise data sets. The reasons for safe plans [18], algorithms for top-k queries [63,82], and rep- imprecision in data are as diverse as the applications them- resentations of views over probabilistic data [65, 67]. What selves: in sensor and RFID data, imprecision is due to mea- is common to all these results is that they apply and extend surement errors [28, 66]; in information extraction, impreci- well known concepts that are fundamental to data manage- sion comes from the inherent ambiguity in natural-language ment, such as the separation of query and data when analyz- text [32,40]; and in business intelligence, imprecision is used ing complexity [75], incomplete databases [44], the threshold to reduce the cost of data cleaning [12]. In some applications, algorithm [31], and the use of materialized views to answer such as privacy, it is a requirement that the data be less pre- queries [42, 74]. In this paper, we briefly survey the key cise. For example, imprecision is purposely inserted to hide concepts in probabilistic database systems, and explain the sensitive attributes of individuals so that the data may be intellectual roots of these concepts in data management. published [29, 55, 62]. Imprecise data has no place in tradi- 1.1 An Example tional, precise database applications like payroll and inven- tory, and so, current database management systems are not We illustrate using an example from an information ex- prepared to deal with it. In contrast, these newly emerging traction system. The Purple Sox [61] system at Yahoo! Re- applications offer value precisely because they query, search, search focuses on technologies to extract and manage struc- and aggregate large volumes of imprecise data to find the“di- tured information from the Web related to a specific com- amonds in the dirt”. This wide-variety of applications points munity. An example is the DbLife system [27] that aggre- to the need for generic tools to manage imprecise data. In gates structured information about the database commu- this paper, we survey the state of the art techniques to han- nity from data on the Web. The system extracts lists of dle imprecise data which models imprecision as probabilistic database researchers together with structured, related in- data [4, 8, 11, 14, 21, 28, 45, 51, 71]. formation such as publications they have authored, their A probabilistic database management system, or Prob- co-author relationships, talks they have given, their cur- DMS, is a system that stores large volumes of probabilis- rent affiliations, and their professional services. Figure 1(a) tic data and supports complex queries. A ProbDMS may illustrates the researchers’ affiliations, and Figure 1(b) il- also need to perform some additional tasks, such as updates lustrates their professional activities. Although most re- or recovery, but these do not differ from those in conven- searchers have a single affiliation, in the data in Figure 1(a), tional database management systems and will not be dis- the extracted affiliations are not unique. This occurs be- cussed here. The major challenge in a ProbDMS is that cause outdated/erroneous information is often present on it needs both to scale to large data volumes, a core com- the Web, and even if the extractor is operating on an up-to- petence of database management systems, and to do prob- date Webpage, the difficulty of the extraction problem forces abilistic inference, which is a problem studied in AI. While the extractors to produce many alternative extractions or many scalable data management systems exists, probabilis- risk missing valuable data. Thus, each Name contains several tic inference is in general a hard problem [68], and current possible affiliations. One can think of Affiliation as being systems do not scale to the same extent as data management an attribute with uncertain values; or equivalently, one can systems do. To address this challenge, researchers have fo- think of each row as being a separate uncertain tuple. There cused on the specific nature of relational probabilistic data, are two constraints on this data: tuples with the same Name and exploited the special form of probabilistic inference that but different Affiliation are mutually exclusive; and tuples with different values of Name are independent. The pro- ∗This work was partially supported by NSF Grants IIS- fessional services shown in Figure 1 (b) are extracted from 0454425, IIS-0513877, IIS-0713576, and a Gift from Mi- conference Webpages, and are also imprecise: in our exam- crosoft. ple, each record in this table is an independent extraction Permission to make digital or hard copies of all or part of this work for and assumed to be independent. personal or classroom use is granted without fee provided that copies are In both examples, the uncertainty in the data is repre- not made or distributed for profit or commercial advantage and that copies sented as a probabilistic confidence score, which is com- bear this notice and the full citation on the first page. To copy otherwise, to puted by the extractor. For example, Conditional Ran- republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. dom Fields produce extractions with semantically meaning- Copyright 2008 ACM 0001-0782/08/0X00 ...$5.00. ful confidence scores [40]. Other sources of uncertainty can Researchers: Name Affiliation P Services: 1 1 t1 Fred U. Washington p1 = 0.3 X1 = 1 2 2 t1 U. Wisconsin p1 = 0.2 X1 = 2 Name Conference Role P 3 3 t1 Y! Research p1 = 0.5 X1 = 3 s1 Fred VLDB Session Chair q1 = 0.2 Y1 = 1 1 1 t2 Sue U. Washington p2 = 1.0 X2 = 1 s2 Fred VLDB PC Member q2 = 0.8 Y2 = 1 1 1 t3 John U. Wisconsin p3 = 0.7 X3 = 1 s3 John SIGMOD PC Member q3 = 0.7 Y3 = 1 2 2 t3 U. Washington p3 = 0.3 X3 = 2 s4 John VLDB PC Member q4 = 0.7 Y4 = 1 1 1 t4 Frank Y! Research p4 = 0.9 X4 = 1 s5 Sue SIGMOD Chair q5 = 0.5 Y5 = 1 2 2 t4 M. Research p4 = 0.1 X4 = 2 (b) (a) Figure 1: Example of a probabilistic database. This is a block-independent-disjoint database: the 8 tuples 1 2 3 1 2 in Researchers are grouped in four groups of disjoint events, e.g., t1, t1, t1 are disjoint, and so are t4, t4, while 2 2 1 tuples from different blocks are independent, e.g., t1, t2, t4 are independent; the five tuples in Services are independent probabilistic events. This database can be represented as a c-table using the hidden variables X1,X2,X3,X4 for Researchers and Y1,Y2,Y3,Y4,Y5 for Services. also be converted to confidence scores, for example prob- and a general-purpose probabilistic inference technique [16, abilities produced by entity matching algorithms (does the 24, 80, 81] for the second. But on large data the probabilis- mention Fred in one Webpage refer to the same entity as Fred tic inference quickly dominates the total running time. A in another Webpage?). The example in Figure 1 presents a better approach is to integrate the two steps, which allows very simplified view of a general principle: uncertain data us to leverage some database specific techniques, such as is annotated with a confidence score, which is interpreted as query optimization, using materialized views, and schema a probability. In this paper we use “probabilistic data” and information, to speedup the probabilistic inference. “uncertain data” as synonyms. Designing a good user interface raises new challenges. The answer to a SQL query is a set of tuples, and it is critical to find some way to rank these tuples, because there are 1.2 Facets of a ProbDMS typically lots of false positives when the input data is im- There are three important and related facets of any Prob- precise. Alternatively, aggregation queries can extract value DMS: (1) How do we store (or represent) a probabilistic from imprecise data because the Law of the Large Num- database? (2) How do we answer queries using our chosen bers. A separate, and difficult task, is how to indicate to representation? (3) How do we present the result of queries the user the correlations between the output tuples. For to the user? example, the two highest ranked tuples may be mutually There is a tension between the power of a representation exclusive, but they could also be positively correlated. As system, i.e., as the system more faithfully models correla- a result, their ranks alone convey insufficient information to tions, it becomes increasingly difficult to scale the system. the user. Finally, a major challenge of this facet is how to A simple representation where each tuple is an indepen- obtain feedback from the users and how to employ this feed- dent probabilistic event is easier to process, but it cannot back to ”clean” the underlying database. This is a difficult faithfully model the correlations important to all applica- problem, which to date has not yet been solved. tions. In contrast, a more complicated representation, e.g., a large Markov Network [60], can capture the semantics of the data very faithfully, but it may be impossible to compute 1.3 Key Applications even simple SQL queries using this representation.

Load more