IFSA-EUSFLAT-2015 B.Wanders
Total Page:16
File Type:pdf, Size:1020Kb
16th World Congress of the International Fuzzy Systems Association (IFSA) 9th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT) Revisiting the formal foundation of Probabilistic Databases Brend Wanders1 Maurice van Keulen1 1 Faculty of EEMCS, University of Twente, Enschede, The Netherlands. {b.wanders,m.vankeulen}@utwente.nl Abstract Insufficient understanding of core concepts Un- certainty in data has been the subject of research One of the core problems in soft computing is deal- in several research communities for decades. Nev- ing with uncertainty in data. In this paper, we ertheless, we believe our understanding of certain revisit the formal foundation of a class of proba- concepts is not deep enough. For example, truth bilistic databases with the purpose to (1) obtain of facts that are uncertain. Or, what are possible data model independence, (2) separate metadata worlds really? Also, many models support possible on uncertainty and probabilities from the raw data, alternatives in some way often associated with a (3) better understand aggregation, and (4) create probability. Are these probabilities truly add-ons or more opportunities for optimization. The paper are they tightly connected to the alternatives? presents the formal framework and validates data model independence by showing how to a obtain Aggregates In many data processing tasks, being probabilistic Datalog as well as a probabilistic rela- able to aggregate data in multiple ways is essential. tional algebra by applying the framework to their Computing aggregates over uncertain data is, how- non-probabilistic counterparts. We conclude with a ever, inherently exponential. There is much work on discussion on the latter three goals. approximating aggregates, often with error bounds, but this does not seem to suffice in all cases. Fur- thermore, systems offer operations on uncertain data Keywords: probabilistic databases, probabilistic as aggregates, such as EXP (expected value) in Trio Datalog, probabilistic relational algebra, formal foun- [7]; they seem different from traditional aggregates dation such as SUM, or is there a more generic concept of aggregation that encompasses all? 1. Introduction Optimization opportunities There has been some One of the core problems in soft computing is dealing work on optimization for probabilistic databases, with uncertainty in data. For example, many data for example, in the context of MayBMS/SPROUT activities such as data cleaning, coupling, fusion, [8, 9], but as we experienced in [6], where we apply mapping, transformation, information extraction, MayBMS to a bio-informatics homology use case, etc. are about dealing with the problem of semantic the research prototypes do not scale well enough uncertainty [1, 2]. In the last decade, there has been to thousands of random variables. By generaliz- much attention in the database community to scal- ing certain concepts in our formal foundation, we able manipulation of uncertain data. Probabilistic hope to create better understanding of optimization database research produced numerous uncertainty opportunities. models and research prototypes, mostly relational; see [3, Chp.3] for an extensive survey. Contributions We address the above with a new In our research we actively apply this technology formalisation of a probabilistic database and associ- for soft computing data processing tasks such as ated notions as a result of revisiting its fundaments. indeterministic deduplication [4], probabilistic XML The formalization has the following properties: data integration [5], and probabilistic integration of data about groupings [6]. Based on these experi- • Data model independent ences, we find that there are still important open • Meta-data about uncertain data loosely coupled problems in dealing with uncertain data and that the to raw data available systems are inadequate on certain aspects. • Loosely coupled probabilities We address the following four aspects. • Unified view on aggregates and probabilistic database-specific functions Data model dependence Depending on the require- We demonstrate the usefulness of the formalization ments and domain, we use different data models for creating more insight by discussing questions like such as relational, XML, and RDF. The available “What are possible worlds?”, “What is truth in an models for uncertain data are tightly connected to uncertain context?”, “What are aggregates?”, and a particular data model resulting in a non-uniform “What optimization opportunities come to light?”. dealing with uncertain data as well as replication of Furthermore, we validate the data model indepen- functionality in the various prototype systems. dence by illustrating how a probabilistic Datalog © 2015. The authors - Published by Atlantis Press 289 “Paris Hilton stayed in the Paris Hilton” 2. Formal framework phrase pos refers to 1 Paris Hilton 1,2 the person 2 Paris Hilton 1,2 the hotel The basis of this formalism is the possible world. We 3 Paris 1 the capital of France use the term possible world in the following sense: as 4 Paris 1 Paris, Ontario, Canada long as the winning number has not been drawn yet 5 Hilton 2 the hotel chain in a lottery, you do not known the winner, but you 6 Paris Hilton 6,7 the person 7 Paris Hilton 6,7 the hotel can envision a possible world for each outcome. Anal- 8 Paris 6 the capital of France ogously, one can envision multiple possible database 9 Paris 6 Paris, Ontario, Canada states depending on whether certain facts are true 10 Hilton 7 the hotel chain or not. For example in Figure 1, a possible world . (the true one) could contain annotations 1, 7, 8, and 10, but to a computer a world with annotations 2, 4, Figure 1: Example natural language sentence with 5, and 6 could very well be possible too. Note that a few candidate annotations [10]. this differs from the use of the term ‘possible world’ in logics where it means possible interpretations [12, Chp.6] or as in modal logics [13]. as well as a probabilistic relational database can be The core of this formalization is the idea that we defined using our framework. need to be able to identify the different possible worlds so we can reason about them. We do this by crafting a way to incrementally and constructively Running example We use natural language process- describe the name of a possible world. ing as a running example, the sub-task of Named Entity Extraction and Disambiguation (NEED) in 2.1. Representation particular. NEED attempts to detect named en- tities, i.e., phrases that refer to real-world ob- Our formalization begins with the notion of a jects. Natural language is ambiguous, hence the database as a possible world. A database DB ∈ P A NEED process is inherently uncertain. The ex- consists of assertions {a0, a1,..., an} with ai taken ample sentence of Figure 1 illustrates this: “Paris from A, the universe of assertions. For the pur- Hilton” may refer to a person (the American so- pose of data model independence, we abstract from cialite, television personality, model, actress, and what an assertion is: it may be a tuple in a rela- singer) or to the hotel in France. In the latter tional database, a node in an XML database, and case, the sub-phrase “Paris” refers to the capital so on. Since databases represent possible worlds, of France although there are many more places we use the symbols DB and w interchangeably. A and other entities with the name “Paris” (e.g., see probabilistic database PDB is a set of databases http://en.wikipedia.org/wiki/Paris_(disambiguation) or {DB0, DB1,..., DBn}, i.e., PDB ∈ PP A. Each dif- a gazetteer like GeoNames1). ferent database represents a possible world in the A human immediately understands all this, but probabilistic database. In other words, if an uncer- to a computer this is quite elusive. One typically tainty is not distinguishable in the database state, distinguishes different kinds of ambiguity such as i.e., if two databases are the same, then we regard [11]: (a) semantic ambiguity (to what class does this as one possible world. When we talk about an entity phrase belong, e.g., does “Paris” refer to possible worlds, we intend this to mean ‘all possi- a name or a location?), (b) structural ambiguity ble worlds contained in the probabilistic database’ (does a word belong to the entity or not, e.g., “Lake denoted with WPDB. Garda” vs. “Garda”?), and (c) reference ambiguity (to which real world entity does a phrase refer, e.g., Implicit possible worlds Viewing it the other way does “Paris” refer to the capital of France or one of around, an assertion holds in a subset of all possible the other 158 Paris instances found in GeoNames?). worlds. To describe this relationship, we need an We represent detected entities and the uncertainty identification mechanism to refer to a subset of the surrounding them as annotation candidates. Figure 1 possible worlds. For this purpose, we introduce the contains a table with a few for the example sentence. method of partitioning. A partitioning ωn splits a NEED typically is a multi-stage process where vo- database into n disjunctive parts each denoted with luminous intermediary results need to be stored and a label l of the form ω=v with v ∈ 1..n. If a world manipulated. Furthermore, the dependencies be- w is labelled with label l, we say that ‘l holds for tween the candidates should be carefully maintained. w.’ Every introduced partitioning ωn is a member of For example, “Paris Hilton” can be a person or hotel, Ω, the set of introduced partitionings. Wl denotes but not both (mutual exclusion), and “Paris” can the set of possible worlds in PDB labelled with l. only refer to a place if “Paris Hilton” is interpreted L(ωn) = {ω=v | v ∈ 1..n} is the set of labels for as hotel. We believe that a probabilistic database is partitioning ωn.