Brief Tutorial on Probabilistic Databases

Brief Tutorial on Probabilistic Databases Dan Suciu University of Washington Simons 2016 1 About This Talk • Probabilistic databases – Tuple-independent – Query evaluation • Statistical relational models – Representation, learning, inference in FO – Reasoning/learning = lifted inference • Sources: – Book 2011 [S.,Olteanu,Re,Koch] – Upcoming F&T survey [van Den Broek,S] Simons 2016 2 Background: Relational databases Smoker Friend X Y X Z Database D = Alice 2009 Alice Bob Alice 2010 Alice Carol Bob 2009 Bob Carol Carol 2010 Carol Bob Query: Q(z) = ∃x (Smoker(x,’2009’)∧Friend(x,z)) Z Constraint: Q(D) = Bob Q = ∀x (Smoker(x,’2010’)⇒Friend(x,’Bob’)) Carol Q(D) = true Probabilistic Database Probabilistic database D: X y P a1 b1 p1 a1 b2 p2 a2 b2 p3 Possible worlds semantics: Σworld PD(world) = 1 X y P (Q) = Σ P (world) X Xy y D world |= Q D a1 b1 X y a1 a1b2 b1 X y a1 b2 a1 b1 X y p1p2p3 a2 a2b2 b2 a2 b2 X y a2 b2 a1 b2 a1 b1 X y (1-p1)p2p3 a1 b2 (1-p1)(1-p2)(1-p3) Outline • Model Counting • Small Dichotomy Theorem • Dichotomy Theorem • Query Compilation • Conclusions, Open Problems Simons 2016 5 Model Counting • Given propositional Boolean formula F, compute the number of models #F Example: X1 X2 X3 F F = (X1∨X2) ∧ (X2∨X3) ∧ (X3∨X1) 0 0 0 0 0 0 1 0 #F = 4 0 1 0 0 0 1 1 1 1 0 0 0 1 0 1 1 1 1 0 1 [Valiant’79] #P-hard, even for 2CNF 1 1 1 1 Probability of a Formula • Each variable X has a probability p(X); • P(F) = probability that F=true, when each X is set to true independently Example: X1 X2 X3 F F = (X1∨X2) ∧ (X2∨X3) ∧ (X3∨X1) 0 0 0 0 0 0 1 0 P(F) = (1-p1)*p2*p3 + 0 1 0 0 p1*(1-p2)*p3 + p1*p2*(1-p3) + 0 1 1 1 p1*p2*p3 1 0 0 0 1 0 1 1 n If p(X) = ½ for all X, then P(F) = #F / 2 1 1 0 1 1 1 1 1 Algorithms for Model Counting [Gomes, Sabharwal, Selman’2009] Based on full search DPLL: • Shannon expansion. #F = #F[X=0] + #F[X=1] • Caching. Store #F, look it up later • Components. If Vars(F1) ∩ Vars(F2) = ∅: #(F1 ∧ F2) = #F1 * #F2 Simons 2016 8 Relational Representation (1/2) • Fix an FO sentence Q and a domain Δ • Ground atom à Boolean variable Definition The lineage FQ,Δ is: FQ,Δ = Q if Q = ground atom FQ1 ∧ Q2,Δ = FQ1,Δ ∧ FQ2,Δ same for ∨, à, ¬ F∀x.Q,Δ = ∧a∈Δ FQ[a/x],Δ F∃x.Q,Δ = ∨a∈Δ FQ[a/x],Δ Q = ∀x (Student(x) ⇒ Person(x)) FQ,[n] = (Student(1)⇒Person(1))∧…∧(Student(n)⇒Person(n)) Relational Representation (2/2) • For a database D, denote FQ,D = FQ,domain(D) where all tuples not in D are set to false • FQ,Δ or FQ,D is called the lineage or the provenance or the grounding of Q Weighted FO Model Counting • Probabilities of ground atoms in D = probabilities of Boolean variables p(X) X y P • Fix Q. Given D, compute P(FQ,D) a1 b1 p1 a1 b2 p2 a2 b2 p3 • Simple fact: PD(Q) = P(FQ,D) Simons 2016 11 This Talk Fix a query Q: • What is the complexity of PD(Q) in the size of D? • What is the best runtime of a DPLL-based algorithm on FQ,D in the size of D? Simons 2016 12 Discussion: Correlations [Domingos&Richardson’06] MLN = popular FO framework for Machine Learning tasks Lise Getoor’s talk today Smoker(x)∧Friends(x,y) àSmoker(y), weight = 2.3 Theorem [Jha,S’11] One can construct effectively D s.t. PMLN, Δ(Q) = PD(Q | Γ) = PD(Q∧Γ) / PD(Γ) Outline • Model Counting • Small Dichotomy Theorem • Dichotomy Theorem • Query Compilation • Conclusions, Open Problems Simons 2016 14 Background: Query Plans Q(z) = R(z,X), S(x,y),T(y,u), u=123 Query plan = expressions over the input relation Πz Operators = selection, projection, join, union, difference ⋈X σu=123 ⋈y R(z,X) S(x,y) T(y,u) Boolean query An Example Q() = R(x), S(x,y) = ∃X∃y(R(x) ∧S(x,y)) PD (Q) = 1- {1- p1*[ 1 - (1 - q1 )*(1 - q2 ) ]} * {1- p2*[ 1 - (1 - q3 )*(1 - q4 )*(1 - q5 ) ] } One can compute PD(Q) in PTIME S X y P in the size of the database D a1 b1 q1 R X P a1 b2 q2 a1 p1 a2 b3 q3 a2 p2 a2 b4 q4 a3 p3 a2 b5 q5 Extensional Plans • Modify each operator to compute output probabilities, assuming independent events Simons 2016 17 P(Q) = 1 – [1-p1*(1-(1-q1)*(1-q2))] Q() = R(x), S(x,y) *[1- p2*(1-(1-q3)*(1-q4)*(1-q5))] 1-{1-p1[1-(1-q1)(1-q2)]}* 1-(1-p1q1)(1-p1q2)(1-p2q3)(1-p2q4)(1-p2q5) {1-p2[1-(1-q4)(1-q5) (1-q6)]} ΠΦ ΠΦ p1 q1 Wrong Right p1 q2 p2 q3 p2 q4 1-(1-q1)(1-q2) p2 q5 ⋈ 1-(1-q4)(1-q5) (1-q6) ⋈ X y P a1 b1 q1 X P a1 b2 q2 ΠX a p 1 1 R(x) S(x,y) a2 b3 q3 R(x) a p 2 2 a2 b4 q4 S(x,y) 18 a3 p3 a2 b5 q5 Safe Queries Definition A plan for Q is safe if it computes the probabilities correctly. Q is safe if it has a safe plan. • In AI, computing Q using a safe plan is called lifted inference • Safe query = Liftable query • If Q is safe then PD (Q) is in PTIME Unsafe Queries R S T X P X Y Y P x1 y1 x1 p1 y1 q1 P x1 y2 Wrong x2 p2 y2 q2 x2 y2 ⋈ P T H0() = R(x), S(x,y), T(y) ⋈ R S Theorem. [Dalvi&S.2004] PD(H0) is #P-hard However: 1. This plan computes an upper bound [VLDB’15] 2. Use samples on T [VLDB’16] Wolfgang Gatterbauer’s talk today Hierarchical Queries Fix Q; at(X) = set of atoms (=literals) containing the variable X Definition Q is hierarchical if forall variables x, y: at(x) ⊆at(y) or at(x) ⊇ at(y) or at(x) ∩ at(y) = ∅ Hierarchical Non-hierarchical Q() = R(x,y), S(x,z) H0() = R(x), S(x,y), T(y) X y X S T y R S z R 21 The Small Dichotomy Theorem Non-repeating Conjunctive Query = = Conjunctive Query “without self-joins” = “Simple” conjunctive query [Dalvi&S.04] Theorem Let Q be a non-repeating CQ • If Q is hierarchical, then PD(Q) is in PTIME. • If Q is not hierarchical then PD(Q) is #P-hard. By duality, the same holds for a non-repeating clause Summary so Far Non-repeating CQ Complexity of P (Q) D Non-repeating clause PTIME Hierarchical #P - hard Non-hierarchical Simons 2016 23 Outline • Model Counting • Small Dichotomy Theorem • Dichotomy Theorem • Query Compilation • Conclusions, Open Problems Simons 2016 24 The Rules for Lifted Inference Preprocess Q (omitted from this talk; see book), then apply these rules (some have preconditions) P(¬Q) = 1 – P(Q) negation P(Q1 ∧ Q2) = P(Q1)P(Q2) Independent P(Q1 ∨ Q2) =1 – (1 – P(Q1))(1 – P(Q2)) join / union P(∃z Q) = 1 – Π (1– P(Q[a/z]) a ∈Domain Independent project P(∀z Q) = Πa ∈Domain P(Q[a/z] P(Q1 ∧ Q2) = P(Q1) + P(Q2) - P(Q1 ∨ Q2) Inclusion/ P(Q1 ∨ Q2) = P(Q1) + P(Q2) - P(Q1 ∧ Q2) exclusion FOun = Unate FO An FO sentence is unate if: • Negations occur only on atoms • Every relational symbol R either occurs only positively, or only negatively FOun = FO restricted to unate sentences Simons 2016 26 Dichotomy Theorem [Dalvi&S’12] Theorem For any Q in ∀*FOun (or ∃*FOun) • If rules succeed, then PD(Q) in PTIME in |D| • If rules fail, then PD(Q) is #P hard in |D| Note: Unions of Conjunctive queries (UCQ) is essentially∃*FOun Example: Liftable Query QJ() = S(x1,y1), R(y1), S(X2,y2), T(y2) = [S(X1,y1),R(y1)] ∧ [S(X2,y2),T(y2)] Q1 Q2 P(QJ) = P(Q1) + P(Q2) - P(Q1 ∨ Q2) PTIMEPTIME (have(have seenseen before)already) y = y1 = y2 Q1 ∨ Q2 = ∃y [S(X1,y),R(y) ∨ S(X2,y)),T(y)] P(Q1 ∨ Q2) = = 1 – Πb ∈Domain (1 – P[S(X1,b),R(b) ∨ S(X2,b)),T(b)]) = 1 – Πb ∈Domain (1 – P[S(X1,b)] * P[R(b)∨T(b)]) = … etc 2 Runtime = O(n ). Simons 2016 28 Example: Liftable Query QJ = ∀X1∀y1∀X2∀y2 (S(X1,y1) ∨ R(y1) ∨ S(x2,y2) ∨T(y2)) = [∀X1∀y1S(X1,y1)∨R(y1)] ∨ [∀X2∀y2S(X2,y2)∨T(y2)] Q 1 Q2 P(QJ) = P(Q1) + P(Q2) - P(Q1 ∧ Q2) PTIMEPTIME (have(have seenseen before)already) y = y1 = y2 Q1 ∧ Q2 = ∀y [(∀X1S(X1,y)∨R(y)) ∧ (∀X2S(X2,y))∨T(y)] = ∀y [∀X S(x,y)∨(R(y)∧T(y))] P(Q1 ∧ Q2) = Πb ∈Domain P[∀x.S(x,b) ∨(R(b) ∧ T(b))] = …etc 2 Runtime = O(n ). Simons 2016 29 Unliftable Queries Hk Will drop ∀ to reduce clutter H0= R(x)∨S(x,y)∨T(y) H1= [R(x0)∨S(X0,y0)] ∧ [S(x1,y1)∨T(y1)] H2= [R(x0)∨S1(X0,y0)] ∧ [S1(x1,y1)∨S2(x1,y1)] ∨ [S2(x2,y2)∨T(y2)] H3= [R(x0)∨S1(X0,y0)]∧[S1(x1,y1)∨S2(x1,y1)]∧[S2(X2,y2)∨S3(x2,y2)]∧[S3(x3,y3)∨T(y3)] Every Hk, k≥1 .

Brief Tutorial on Probabilistic Databases

Probabilistic Databases

Adaptive Schema Databases ∗

Finding Interesting Itemsets Using a Probabilistic Model for Binary Databases

Open-World Probabilistic Databases

A Compositional Query Algebra for Second-Order Logic and Uncertain Databases

Answering Queries from Statistics and Probabilistic Views∗

Uva-DARE (Digital Academic Repository)

BAYESSTORE: Managing Large, Uncertain Data Repositories with Probabilistic Graphical Models

Efficient In-Database Analytics with Graphical Models

Indexing Correlated Probabilistic Databases

Analyzing Uncertain Tabular Data

Entity Linkage for Heterogeneous, Uncertain, and Volatile Data Dr