Brief Tutorial on Probabilistic Databases

Brief Tutorial on Probabilistic Databases Dan Suciu University of Washington Simons 2016 1 About This Talk • Probabilistic databases – Tuple-independent – Query evaluation • Statistical relational models – Representation, learning, inference in FO – Reasoning/learning = lifted inference • Sources: – Book 2011 [S.,Olteanu,Re,Koch] – Upcoming F&T survey [van Den Broek,S] Simons 2016 2 Background: Relational databases Smoker Friend X Y X Z Database D = Alice 2009 Alice Bob Alice 2010 Alice Carol Bob 2009 Bob Carol Carol 2010 Carol Bob Query: Q(z) = ∃x (Smoker(x,’2009’)∧Friend(x,z)) Z Constraint: Q(D) = Bob Q = ∀x (Smoker(x,’2010’)⇒Friend(x,’Bob’)) Carol Q(D) = true Probabilistic Database Probabilistic database D: X y P a1 b1 p1 a1 b2 p2 a2 b2 p3 Possible worlds semantics: Σworld PD(world) = 1 X y P (Q) = Σ P (world) X Xy y D world |= Q D a1 b1 X y a1 a1b2 b1 X y a1 b2 a1 b1 X y p1p2p3 a2 a2b2 b2 a2 b2 X y a2 b2 a1 b2 a1 b1 X y (1-p1)p2p3 a1 b2 (1-p1)(1-p2)(1-p3) Outline • Model Counting • Small Dichotomy Theorem • Dichotomy Theorem • Query Compilation • Conclusions, Open Problems Simons 2016 5 Model Counting • Given propositional Boolean formula F, compute the number of models #F Example: X1 X2 X3 F F = (X1∨X2) ∧ (X2∨X3) ∧ (X3∨X1) 0 0 0 0 0 0 1 0 #F = 4 0 1 0 0 0 1 1 1 1 0 0 0 1 0 1 1 1 1 0 1 [Valiant’79] #P-hard, even for 2CNF 1 1 1 1 Probability of a Formula • Each variable X has a probability p(X); • P(F) = probability that F=true, when each X is set to true independently Example: X1 X2 X3 F F = (X1∨X2) ∧ (X2∨X3) ∧ (X3∨X1) 0 0 0 0 0 0 1 0 P(F) = (1-p1)*p2*p3 + 0 1 0 0 p1*(1-p2)*p3 + p1*p2*(1-p3) + 0 1 1 1 p1*p2*p3 1 0 0 0 1 0 1 1 n If p(X) = ½ for all X, then P(F) = #F / 2 1 1 0 1 1 1 1 1 Algorithms for Model Counting [Gomes, Sabharwal, Selman’2009] Based on full search DPLL: • Shannon expansion. #F = #F[X=0] + #F[X=1] • Caching. Store #F, look it up later • Components. If Vars(F1) ∩ Vars(F2) = ∅: #(F1 ∧ F2) = #F1 * #F2 Simons 2016 8 Relational Representation (1/2) • Fix an FO sentence Q and a domain Δ • Ground atom à Boolean variable Definition The lineage FQ,Δ is: FQ,Δ = Q if Q = ground atom FQ1 ∧ Q2,Δ = FQ1,Δ ∧ FQ2,Δ same for ∨, à, ¬ F∀x.Q,Δ = ∧a∈Δ FQ[a/x],Δ F∃x.Q,Δ = ∨a∈Δ FQ[a/x],Δ Q = ∀x (Student(x) ⇒ Person(x)) FQ,[n] = (Student(1)⇒Person(1))∧…∧(Student(n)⇒Person(n)) Relational Representation (2/2) • For a database D, denote FQ,D = FQ,domain(D) where all tuples not in D are set to false • FQ,Δ or FQ,D is called the lineage or the provenance or the grounding of Q Weighted FO Model Counting • Probabilities of ground atoms in D = probabilities of Boolean variables p(X) X y P • Fix Q. Given D, compute P(FQ,D) a1 b1 p1 a1 b2 p2 a2 b2 p3 • Simple fact: PD(Q) = P(FQ,D) Simons 2016 11 This Talk Fix a query Q: • What is the complexity of PD(Q) in the size of D? • What is the best runtime of a DPLL-based algorithm on FQ,D in the size of D? Simons 2016 12 Discussion: Correlations [Domingos&Richardson’06] MLN = popular FO framework for Machine Learning tasks Lise Getoor’s talk today Smoker(x)∧Friends(x,y) àSmoker(y), weight = 2.3 Theorem [Jha,S’11] One can construct effectively D s.t. PMLN, Δ(Q) = PD(Q | Γ) = PD(Q∧Γ) / PD(Γ) Outline • Model Counting • Small Dichotomy Theorem • Dichotomy Theorem • Query Compilation • Conclusions, Open Problems Simons 2016 14 Background: Query Plans Q(z) = R(z,X), S(x,y),T(y,u), u=123 Query plan = expressions over the input relation Πz Operators = selection, projection, join, union, difference ⋈X σu=123 ⋈y R(z,X) S(x,y) T(y,u) Boolean query An Example Q() = R(x), S(x,y) = ∃X∃y(R(x) ∧S(x,y)) PD (Q) = 1- {1- p1*[ 1 - (1 - q1 )*(1 - q2 ) ]} * {1- p2*[ 1 - (1 - q3 )*(1 - q4 )*(1 - q5 ) ] } One can compute PD(Q) in PTIME S X y P in the size of the database D a1 b1 q1 R X P a1 b2 q2 a1 p1 a2 b3 q3 a2 p2 a2 b4 q4 a3 p3 a2 b5 q5 Extensional Plans • Modify each operator to compute output probabilities, assuming independent events Simons 2016 17 P(Q) = 1 – [1-p1*(1-(1-q1)*(1-q2))] Q() = R(x), S(x,y) *[1- p2*(1-(1-q3)*(1-q4)*(1-q5))] 1-{1-p1[1-(1-q1)(1-q2)]}* 1-(1-p1q1)(1-p1q2)(1-p2q3)(1-p2q4)(1-p2q5) {1-p2[1-(1-q4)(1-q5) (1-q6)]} ΠΦ ΠΦ p1 q1 Wrong Right p1 q2 p2 q3 p2 q4 1-(1-q1)(1-q2) p2 q5 ⋈ 1-(1-q4)(1-q5) (1-q6) ⋈ X y P a1 b1 q1 X P a1 b2 q2 ΠX a p 1 1 R(x) S(x,y) a2 b3 q3 R(x) a p 2 2 a2 b4 q4 S(x,y) 18 a3 p3 a2 b5 q5 Safe Queries Definition A plan for Q is safe if it computes the probabilities correctly. Q is safe if it has a safe plan. • In AI, computing Q using a safe plan is called lifted inference • Safe query = Liftable query • If Q is safe then PD (Q) is in PTIME Unsafe Queries R S T X P X Y Y P x1 y1 x1 p1 y1 q1 P x1 y2 Wrong x2 p2 y2 q2 x2 y2 ⋈ P T H0() = R(x), S(x,y), T(y) ⋈ R S Theorem. [Dalvi&S.2004] PD(H0) is #P-hard However: 1. This plan computes an upper bound [VLDB’15] 2. Use samples on T [VLDB’16] Wolfgang Gatterbauer’s talk today Hierarchical Queries Fix Q; at(X) = set of atoms (=literals) containing the variable X Definition Q is hierarchical if forall variables x, y: at(x) ⊆at(y) or at(x) ⊇ at(y) or at(x) ∩ at(y) = ∅ Hierarchical Non-hierarchical Q() = R(x,y), S(x,z) H0() = R(x), S(x,y), T(y) X y X S T y R S z R 21 The Small Dichotomy Theorem Non-repeating Conjunctive Query = = Conjunctive Query “without self-joins” = “Simple” conjunctive query [Dalvi&S.04] Theorem Let Q be a non-repeating CQ • If Q is hierarchical, then PD(Q) is in PTIME. • If Q is not hierarchical then PD(Q) is #P-hard. By duality, the same holds for a non-repeating clause Summary so Far Non-repeating CQ Complexity of P (Q) D Non-repeating clause PTIME Hierarchical #P - hard Non-hierarchical Simons 2016 23 Outline • Model Counting • Small Dichotomy Theorem • Dichotomy Theorem • Query Compilation • Conclusions, Open Problems Simons 2016 24 The Rules for Lifted Inference Preprocess Q (omitted from this talk; see book), then apply these rules (some have preconditions) P(¬Q) = 1 – P(Q) negation P(Q1 ∧ Q2) = P(Q1)P(Q2) Independent P(Q1 ∨ Q2) =1 – (1 – P(Q1))(1 – P(Q2)) join / union P(∃z Q) = 1 – Π (1– P(Q[a/z]) a ∈Domain Independent project P(∀z Q) = Πa ∈Domain P(Q[a/z] P(Q1 ∧ Q2) = P(Q1) + P(Q2) - P(Q1 ∨ Q2) Inclusion/ P(Q1 ∨ Q2) = P(Q1) + P(Q2) - P(Q1 ∧ Q2) exclusion FOun = Unate FO An FO sentence is unate if: • Negations occur only on atoms • Every relational symbol R either occurs only positively, or only negatively FOun = FO restricted to unate sentences Simons 2016 26 Dichotomy Theorem [Dalvi&S’12] Theorem For any Q in ∀*FOun (or ∃*FOun) • If rules succeed, then PD(Q) in PTIME in |D| • If rules fail, then PD(Q) is #P hard in |D| Note: Unions of Conjunctive queries (UCQ) is essentially∃*FOun Example: Liftable Query QJ() = S(x1,y1), R(y1), S(X2,y2), T(y2) = [S(X1,y1),R(y1)] ∧ [S(X2,y2),T(y2)] Q1 Q2 P(QJ) = P(Q1) + P(Q2) - P(Q1 ∨ Q2) PTIMEPTIME (have(have seenseen before)already) y = y1 = y2 Q1 ∨ Q2 = ∃y [S(X1,y),R(y) ∨ S(X2,y)),T(y)] P(Q1 ∨ Q2) = = 1 – Πb ∈Domain (1 – P[S(X1,b),R(b) ∨ S(X2,b)),T(b)]) = 1 – Πb ∈Domain (1 – P[S(X1,b)] * P[R(b)∨T(b)]) = … etc 2 Runtime = O(n ). Simons 2016 28 Example: Liftable Query QJ = ∀X1∀y1∀X2∀y2 (S(X1,y1) ∨ R(y1) ∨ S(x2,y2) ∨T(y2)) = [∀X1∀y1S(X1,y1)∨R(y1)] ∨ [∀X2∀y2S(X2,y2)∨T(y2)] Q 1 Q2 P(QJ) = P(Q1) + P(Q2) - P(Q1 ∧ Q2) PTIMEPTIME (have(have seenseen before)already) y = y1 = y2 Q1 ∧ Q2 = ∀y [(∀X1S(X1,y)∨R(y)) ∧ (∀X2S(X2,y))∨T(y)] = ∀y [∀X S(x,y)∨(R(y)∧T(y))] P(Q1 ∧ Q2) = Πb ∈Domain P[∀x.S(x,b) ∨(R(b) ∧ T(b))] = …etc 2 Runtime = O(n ). Simons 2016 29 Unliftable Queries Hk Will drop ∀ to reduce clutter H0= R(x)∨S(x,y)∨T(y) H1= [R(x0)∨S(X0,y0)] ∧ [S(x1,y1)∨T(y1)] H2= [R(x0)∨S1(X0,y0)] ∧ [S1(x1,y1)∨S2(x1,y1)] ∨ [S2(x2,y2)∨T(y2)] H3= [R(x0)∨S1(X0,y0)]∧[S1(x1,y1)∨S2(x1,y1)]∧[S2(X2,y2)∨S3(x2,y2)]∧[S3(x3,y3)∨T(y3)] Every Hk, k≥1 .

Brief Tutorial on Probabilistic Databases

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support