Probabilistic with Correlated Tuples

Prithviraj Sen Amol Deshpande Lise Getoor

Department of Computer Science University of Maryland, College Park.

Uncertain/Probabilistic Databases, 2006

Probabilistic Databases with Correlated Tuples –prithvi, amol, lise 1/18 Introduction

Abundance of uncertain data. Numerous approaches proposed to handle uncertainty [BP82, IWL84, FR97, LLRS97, CKP03, DS04, Wid05, CBL06]. However, most models make assumptions about data uncertainty that restricts applicability.

Need for a model with a model of uncertainty that can capture correlations simple and intuitive semantics that is readily understood and defines precise answers to every query

Probabilistic Databases with Correlated Tuples –prithvi, amol, lise 2/18 Motivations for Correlated Data

Applications: “Dirty” databases [CP87, DS96, AFM06]: Arises while trying to integrate data from various sources. Sensor Networks [DGM+04]: Often shows strong spatial correlations, e.g., nearby sensors report similiar values. More applications: Pervasive computing, approximate string matching in DB systems [DS04] etc.

Additional motivation: Correlated tuples arise while evaluating queries [DS04].

Probabilistic Databases with Correlated Tuples –prithvi, amol, lise 3/18 Representing Correlated Tuples: Ingredients

Which theory to use? Probability theory , Dempster-Schafer theory, Fuzzy Logic, Logic ....

At what level do we represent uncertainty? Tuple level , Attribute level.

Approach to representing correlations? Probabilistic Graphical Models (include as special cases: Bayesian networks and Markov networks)

Probabilistic Databases with Correlated Tuples –prithvi, amol, lise 4/18 Review: Independent Tuple-based Probabilistic Databases Possible World Semantics [DS04]

Example borrowed from [DS04]:

possible worlds S instance probability A B prob {s1, s2, t1} 0.12 s1 m 1 0.6 {s1, s2} 0.18 s2 n 1 0.5 {s1, t1} 0.12 {s1} 0.18 T {s2, t1} 0.08 C D prob {s2} 0.12 t1 1 p 0.4 {t1} 0.08 ∅ 0.12

Probabilistic Databases with Correlated Tuples –prithvi, amol, lise 5/18 Representing Correlations: Basic Ideas

Two simple concepts borrowed from probabilistic graphical models literature [Pea88]: Tuple-based Random Variables

Associate every tuple t with a boolean valued random variable Xt .

Factors f (X) is a function of a (small) set of random variables X. 0 ≤ f (X) ≤ 1

Probabilistic Databases with Correlated Tuples –prithvi, amol, lise 6/18 Representing Correlations: Basic Ideas - Contd.

Associate with each tuple in the probabilistic database a random variable.

Define factors on (sub)sets of tuple-based random variables to encode correlations.

The probability of an instantiation of the database is given by the product of all the factors.

Probabilistic Databases with Correlated Tuples –prithvi, amol, lise 7/18 Representing Correlations: Mutual Exclusivity example

Suppose we want to represent mutual exclusivity between tuples s1 and t1. In particular, let us try to represent the possible worlds:

possible worlds X X f S instance probability t1 s1 1 0 0 0 A B prob {s , s , t } 0 1 2 1 0 1 0.6 s m 1 0.6 {s , s } 0.3 1 1 2 1 0 0.4 s n 1 0.5 {s , t } 0 2 1 1 1 1 0 {s1} 0.3 T {s , t } 0.2 2 1 X f C D prob {s } 0 s2 2 2 0 0.5 t 1 p 0.4 {t } 0.2 1 1 1 0.5 ∅ 0

Probabilistic Databases with Correlated Tuples –prithvi, amol, lise 8/18 Representing Correlations: Positive Correlation example

Suppose we want to represent positive correlation between t1 and s1. In particular, let us try to represent the possible worlds:

possible worlds X X f S instance probability t1 s1 1 0 0 0.4 A B {s , s , t } 0.2 1 2 1 0 1 0.2 s m 1 {s , s } 0.1 1 1 2 1 0 0 s n 1 {s , t } 0.2 2 1 1 1 1 0.4 {s1} 0.1 T {s , t } 0 2 1 X f C D {s } 0.2 s2 2 2 0 0.5 t 1 p {t } 0 1 1 1 0.5 ∅ 0.2

Probabilistic Databases with Correlated Tuples –prithvi, amol, lise 9/18 Probabilistic Graphical Model representation

Definition A probabilistic graphical model is graph whose nodes represent random variables and edges represent correlations [Pea88].

Xt1 Xs1 Xt1 Xs1 Xt1 Xs1

Xs2 Xs2 Xs2 Complete Ind. Mutual Exclusivity Positive Correlation Example Example Example

Probabilistic Databases with Correlated Tuples –prithvi, amol, lise 10/18 Query Evaluation: Basic Ideas

Treat intermediate tuples as regular tuples.

Carefully represent correlations between intermediate tuples, base tuples and result tuples to construct a probabilistic graphical model.

Cast the probability computations resulting from query evaluation to inference in probabilistic graphical models.

Probabilistic Databases with Correlated Tuples –prithvi, amol, lise 11/18 Query Evaluation Example

Q Compute D(S ./B=C T )

AND AND fi ,s ,t , fi ,s ,t S: 1 1 1 2 2 1 A B A B C D s m 1 1 i1 m 1 1 p s n 1 2 i2 n 1 1 p

S./ T fs1 , fs2 B=C Q (S./ T ) −→ D B=C T : ↓ C D D t1 1 p r1 p

ft 1 f OR r1,i1,i2

Probabilistic Databases with Correlated Tuples –prithvi, amol, lise 12/18 Query Evaluation Example: Prob. Graphical Model

Query evaluation problem in Prob. Databases: Compute the probability of the result tuple summed over all possible worlds of the database [DS04].

Xs1 Xt1 Xs2

./ ./

Xi1 Xi2

Q

Xr1

Equivalent problem in prob. graph. models: marginal probability computation. Thus we can use inference algorithms (e.g., VE [ZP94]).

Probabilistic Databases with Correlated Tuples –prithvi, amol, lise 13/18 Comparison with other probability computation approaches

Q Compute {}(L ./ J ./ R):

L A B Xj1 Xl1 Xj2 l1 m 1

./ ./ J X Xi X Xr B C r1 1 i2 2 j1 1 p ./ ./ j2 1 q Xi4 Xi3 R Q C D Xres r1 p 1 r2 q 1

Probabilistic Databases with Correlated Tuples –prithvi, amol, lise 14/18 Comparison with Extensional Semantics [FR97, DS04]

Xj1 Xl1 Xj2 Is not guaranteed to ./ ./ match possible world

Xr1 Xi1 Xi2 Xr2 semantics.

./ ./ Safe plans [DS04] X Xi4 i3 return tree-structured Q graphical models.

Xres

Probabilistic Databases with Correlated Tuples –prithvi, amol, lise 15/18 Comparison with Intensional Semantics [FR97, DS04]

Xj Xl1 Xj2 1 Work with a subgraph of ./ ./ the graph. X Xi X Xr r1 1 i2 2 model.

./ ./ Inference X Xi4 i3 algorithms can Q exploit graph structure Xres better.

(Xr1 ∧ Xj1 ∧ Xl1 ) ∨ (Xl1 ∧ Xj2 ∧ Xr2 )

Probabilistic Databases with Correlated Tuples –prithvi, amol, lise 16/18 Conclusion & Future Work

Introduced Probabilistic Databases with correlated tuples.

Borrowed ideas from Probabilistic Graphical Models to represent such correlations.

Cast the query evaluation problem as an inference problem. Future Work: Ways to restructure graphical model to speed up inference. Share/reuse computation to speed up inference. Explore the use of approximate inference methods [GRS96, JGJS99].

Probabilistic Databases with Correlated Tuples –prithvi, amol, lise 17/18 Thank You.

Probabilistic Databases with Correlated Tuples –prithvi, amol, lise 18/18 References

Periklis Andritsos, Ariel Fuxman, and Renee J. Miller. Clean answers over dirty databases. In International Conference on Data Engineering, 2006. Bill P. Buckles and Frederick E. Petry. A fuzzy model for relational databases. International Journal of Fuzzy Sets and Systems, 1982. Sunil Choenni, Henk Ernst Blok, and Erik Leertouwer. Handling uncertainty and ignorance in databases: A rule to combine dependent data. In Database Systems for Advanced Applications, 2006. Reynold Cheng, Dmitri Kalashnikov, and Sunil Prabhakar. Evaluating probabilistic queries over imprecise data. In International Conference on Management of Data., 2003. Roger Cavallo and Michael Pittarelli. The theory of probabilistic databases.

Probabilistic Databases with Correlated Tuples –prithvi, amol, lise 18/18 In International Conference on Very Large Data Bases, 1987. Amol Deshpande, Carlos Guestrin, Sam Madden, Joseph M. Hellerstein, and Wei Hong. Model-driven data acquisition in sensor networks. In International Conference on Very Large Data Bases, 2004. Debabrata Dey and Sumit Sarkar. A probabilistic and algebra. ACM Transactions on Database Systems., 1996. Nilesh Dalvi and Dan Suciu. Efficient query evaluation on probabilistic databases. In International Conference on Very Large Data Bases, 2004. Norbert Fuhr and Thomas Rolleke. A probabilistic for the integration of information retrieval and database systems. ACM Transactions on Information Systems, 1997. Walter R. Gilks, Sylvia Richardson, and David J. Spiegelhalter.

Probabilistic Databases with Correlated Tuples –prithvi, amol, lise 18/18 Markov Chain Monte Carlo in Practice. Chapman & Hall, 1996. Tomasz Imielinski and Jr. Witold Lipski. Incomplete information in relational databases. Journal of the ACM., 1984. Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Machine Learning, 1999. Laks V. S. Lakshmanan, Nicola Leone, Robert Ross, and V. S. Subrahmanian. Probview: a flexible probabilistic database system. ACM Transactions on Database Systems., 1997. Judaea Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988. Jennifer Widom. Probabilistic Databases with Correlated Tuples –prithvi, amol, lise 18/18 Trio: A system for integrated management of data, accuracy, and lineage. In Proceedings of the Biennial Conference on Innovative Data Systems Research, 2005. Nevin Lianwen Zhang and David Poole. A simple approach to bayesian network computations. In Canadian Conference on Artificial Intelligence, 1994.

Probabilistic Databases with Correlated Tuples –prithvi, amol, lise 18/18