Data Management Meets Information Theory

Data Management Meets Information Theory

Data Management Meets Information Theory Dan Suciu U. of Washington and RelationalAI Joint work with M. Abo Khamis, H. Ngo, B. Kenig Background Information theory: • Routinely used in ML (e.g. decision trees) • But not in data management • Recent advances in IT shed deep insight This talk • IT in (1) query processing (2) constraints !2 Part 1: From Proof to Algorithms Query Plans !3 Query Processing 101 • SQL ! Query Plan • Query Plan ! Optimized Query Plan • Two major problems: – Cardinality estimation sucks – Optimal plans don’t exists !4 Example Cardinality estimation sucks Optimal plans don’t exists Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) select * from R, S, T where R.Y = S.Y and S.Z = T.Z and T.X = R.X ⨝ ⨝ T(Z,X) R(X,Y) S(Y,Z) Every query plan O(N2) Largest output O(N1.5) New Paradigm • Find information-theoretic proof of the upper bound, or the submodular width • Convert proof to algorithm !6 Two Running Examples R Full Conjunctive Query X Y R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) T S Z Boolean Query R X Y ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) K S U Z T !7 Statistics maxD satisfies stats (|Q(D)|) E.g. R(X,Y) ∧ S(Y,Z), |R|, |S| " N • No other info: |Q(D)| " N2 • S.Y is a key: |Q(D)| " N • S.Y has degree " d: |Q(D)| " d#N E.g. R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) 3/2 No other info: |Q(D)| " N !8 Statistics maxD satisfies stats (|Q(D)|) E.g. R(X,Y) ∧ S(Y,Z), |R|, |S| " N • No other info: |Q(D)| " N2 • S.Y is a key: |Q(D)| " N • S.Y has degree " d: |Q(D)| " d#N E.g. R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) 3/2 No other info: |Q(D)| " N !9 Statistics maxD satisfies stats (|Q(D)|) E.g. R(X,Y) ∧ S(Y,Z), |R|, |S| " N • No other info: |Q(D)| " N2 • S.Y is a key: |Q(D)| " N • S.Y has degree " d: |Q(D)| " d#N E.g. R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) 3/2 No other info: |Q(D)| " N !10 Statistics maxD satisfies stats (|Q(D)|) E.g. R(X,Y) ∧ S(Y,Z), |R|, |S| " N • No other info: |Q(D)| " N2 • S.Y is a key: |Q(D)| " N • S.Y has degree " d: |Q(D)| " d#N E.g. R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) 3/2 No other info: |Q(D)| " N !11 Statistics maxD satisfies stats (|Q(D)|) E.g. R(X,Y) ∧ S(Y,Z), |R|, |S| " N • No other info: |Q(D)| " N2 • S.Y is a key: |Q(D)| " N • S.Y has degree " d: |Q(D)| " d#N E.g. R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) 3/2 No other info: |Q(D)| " N !12 Statistics maxD satisfies stats (|Q(D)|) E.g. R(X,Y) ∧ S(Y,Z), |R|, |S| " N • No other info: |Q(D)| " N2 • S.Y is a key: |Q(D)| " N • S.Y has degree " d: |Q(D)| " d#N E.g. R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) 3/2 No other info: |Q(D)| " N !13 Statistics maxD satisfies stats (|Q(D)|) E.g. R(X,Y) ∧ S(Y,Z), |R|, |S| " N • No other info: |Q(D)| " N2 • S.Y is a key: |Q(D)| " N • S.Y has degree " d: |Q(D)| " d#N E.g. R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) 3/2 No other info: |Q(D)| " N !14 Statistics maxD satisfies stats (|Q(D)|) E.g. R(X,Y) ∧ S(Y,Z), |R|, |S| " N • No other info: |Q(D)| " N2 • S.Y is a key: |Q(D)| " N • S.Y has degree " d: |Q(D)| " d#N E.g. R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) No other info: |Q(D)| " N3/2 !15 Entropy Let V = {X1, X2, …} be a set of random variables. The entropy of X⊆V is H(X) = - $i=1,N pi log pi V H: 2 ! R+ is called entropic The conditional entropy H(Y|X) = H(X∪Y) – H(X) Waterloo 2019 !16 Shannon Inequalities Monotonicity H(U ∪ V) % H(U) H(U) + H(V) % H(U & V) + H(U ∪ V) Submodularity V H: 2 ! R+ is called polymatroid X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Database D ! entropic function H !18 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Database D ! entropic function H Database D R(X,Y) S(Y,Z) T(Z,X) X Y Y Z Z X a 3 3 m m a a 2 2 q q a b 2 3 q q b d 3 2 m m d !19 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Database D ! entropic function H Output Q(D) Database D X Y Z R(X,Y) S(Y,Z) T(Z,X) a 3 m X Y Y Z Z X a 2 q a 3 3 m m a b 2 q a 2 2 q q a d 3 m b 2 3 q q b a 3 q d 3 2 m m d !20 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Database D ! entropic function H Output Q(D) Database D X Y Z R(X,Y) S(Y,Z) T(Z,X) a 3 m 1/5 X Y Y Z Z X a 2 q 1/5 a 3 3 m m a b 2 q 1/5 a 2 2 q q a d 3 m 1/5 b 2 3 q q b a 3 q 1/5 d 3 2 m m d H(XYZ) = log |Q(D)| !21 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Database D ! entropic function H Output Q(D) Database D X Y Z R(X,Y) S(Y,Z) T(Z,X) a 3 m 1/5 X Y Y Z Z X a 2 q 1/5 a 3 2/5 3 m 2/5 m a 1/5 b 2 q 1/5 a 2 1/5 2 q 2/5 q a 2/5 d 3 m 1/5 b 2 1/5 3 q 1/5 q b 1/5 a 3 q 1/5 d 3 1/5 2 m 0 m d 1/5 H(XYZ) = log |Q(D)| !22 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Database D ! entropic function H Output Q(D) Database D X Y Z R(X,Y) S(Y,Z) T(Z,X) a 3 m 1/5 X Y Y Z Z X a 2 q 1/5 a 3 2/5 3 m 2/5 m a 1/5 b 2 q 1/5 a 2 1/5 2 q 2/5 q a 2/5 d 3 m 1/5 b 2 1/5 3 q 1/5 q b 1/5 a 3 q 1/5 d 3 1/5 2 m 0 m d 1/5 H(XYZ) = log |Q(D)| H(XY) " log NR H(YZ) " log NS H(XZ) " log NT H(Z|Y) " log degS(z|y) Cardinalitites, functional dependences, max degrees X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| " N ! |Q(D)|" N3/2 !24 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| " N ! |Q(D)|" N3/2 3 log N % h(XY) + h(YZ) + h(XZ) % h(XYZ) + h(Y) + h(XZ) % h(XYZ) + h(XYZ) + h(∅) = 2 h(XYZ) = 2 log |Output| !25 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| " N ! |Q(D)|" N3/2 submodularity 3 log N % h(XY) + h(YZ) + h(XZ) % h(XYZ) + h(Y) + h(XZ) % h(XYZ) + h(XYZ) + h(∅) = 2 h(XYZ) = 2 log |Output| !26 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| " N ! |Q(D)|" N3/2 submodularity 3 log N % h(XY) + h(YZ) + h(XZ) % h(XYZ) + h(Y) + h(XZ) % h(XYZ) + h(XYZ) + h(∅) = 2 h(XYZ) = 2 log |Output| !27 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| " N ! |Q(D)|" N3/2 submodularity 3 log N % h(XY) + h(YZ) + h(XZ) submodularity % h(XYZ) + h(Y) + h(XZ) % h(XYZ) + h(XYZ) + h(∅) = 2 h(XYZ) = 2 log |Output| !28 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| " N ! |Q(D)|" N3/2 submodularity 3 log N % h(XY) + h(YZ) + h(XZ) submodularity % h(XYZ) + h(Y) + h(XZ) % h(XYZ) + h(XYZ) + h(∅) = 2 h(XYZ) = 2 log |Output| !29 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| " N ! |Q(D)|" N3/2 submodularity 3 log N % h(XY) + h(YZ) + h(XZ) submodularity % h(XYZ) + h(Y) + h(XZ) % h(XYZ) + h(XYZ) + h(∅) = 2 h(XYZ) = 2 log |Q(D)| !30 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| " N ! |Q(D)|" N3/2 submodularity 3 log N % h(XY) + h(YZ) + h(XZ) submodularity % h(XYZ) + h(Y) + h(XZ) % h(XYZ) + h(XYZ) + h(∅) = 2 h(XYZ) Shearer’s inequality = 2 log |Q(D)| !31 X Y Proof to Algorithm Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) h(XY) + h(YZ) + h(XZ) % 2 h(XYZ) h(XY)+h(YZ)+h(XZ) h(XYZ) + h(Y) + h(XZ) Proof h(XYZ) !32 X Y Proof to Algorithm Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) h(XY) + h(YZ) + h(XZ) % 2 h(XYZ) Algorithm h(XY)+h(YZ)+h(XZ) R(X,Y)∧S(Y,Z)∧T(Z,X) h(XYZ) + h(Y) + h(XZ) Proof h(XYZ) !33 X Y Proof to Algorithm Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) h(XY) + h(YZ) + h(XZ) % 2 h(XYZ) Algorithm h(XY)+h(YZ)+h(XZ) R(X,Y)∧S(Y,Z)∧T(Z,X) N3/2 h(XYZ) + h(Y) + h(XZ) Rlight(X,Y)∧S(Y,Z) Proof h(XYZ) 1/2 Rlight or Rheavy: degree(Y) " or > N !34 X Y Proof to Algorithm Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) h(XY) + h(YZ) + h(XZ) % 2 h(XYZ) Algorithm h(XY)+h(YZ)+h(XZ) R(X,Y)∧S(Y,Z)∧T(Z,X) N3/2 h(XYZ) + h(Y) + h(XZ) N3/2 R (X,Y) S(Y,Z) light ∧ ∪ Proof h(XYZ) Rheavy(Y)∧T(X,Z) 1/2 Rlight or Rheavy: degree(Y) " or > N !35 X Y Proof to Algorithm Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Runtime Õ(N3/2) h(XY) + h(YZ) + h(XZ) % 2 h(XYZ) Algorithm h(XY)+h(YZ)+h(XZ) R(X,Y)∧S(Y,Z)∧T(Z,X) N3/2 h(XYZ) + h(Y) + h(XZ) N3/2 R (X,Y) S(Y,Z) light ∧ ∪ Proof h(XYZ) Rheavy(Y)∧T(X,Z) 1/2 Rlight or Rheavy: degree(Y) " or > N !36 Full Conjunctive Query Asymptotically tight, but open if computable Theorem ∀D that satisfies the statistics log |Q(D)| " max H entropic satisfying stats H(X) " max H polymatroid satisfying stats H(X) Computable in EXPTIME, but not tight Thm Q(D) computable in time Õ(Polymatroid-bound) !37 Discussion AGM Bound [Atserias,Grohe,Marx’08, Ngo,Re,Rudra’13] • Entropic bound = polymatroid bound • Algorithm (NPRR) for Q(D) has single log factor Cardinalities + FDs + max degrees [AboKhamis,Ngo,S’17] • Entropic bound ≨ polymatroid bound • Algorithm (PANDA) for Q(D) has polylog factor !38 Boolean Query Tree decomposition (TD) = a tree where each node t is a full conjunctive query • Fractional hypetree width [Grohe,Marx’14] mintree maxnode t maxD • Submodular width [Marx’13,ANS’17] maxD mintree maxnode t !39 X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] mintree maxnode t maxD !40 X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] mintree maxnode t maxD Tree decompositions R(x,y),S(y,z) S(y,z),T(z,u) T(z,u),K(u,x) K(u,x),R(x,y) !41 X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] mintree maxnode t maxD Tree decompositions R(x,y),S(y,z) S(y,z),T(z,u) T(z,u),K(u,x) K(u,x),R(x,y) Runtime Õ(N2) (suboptimal) !42 X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] mintree maxnode t maxD Tree decompositions R(x,y),S(y,z) S(y,z),T(z,u) T(z,u),K(u,x) K(u,x),R(x,y) Runtime Õ(N2) (suboptimal) !43 X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t Tree decompositions R(x,y),S(y,z) S(y,z),T(z,u) T(z,u),K(u,x) K(u,x),R(x,y) !44 X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t Tree decompositions R(x,y),S(y,z) S(y,z),T(z,u) T1 min(

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    73 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us