Data Management Meets Information Theory

Data Management Meets Information Theory Dan Suciu U. of Washington and RelationalAI Joint work with M. Abo Khamis, H. Ngo, B. Kenig Background Information theory: • Routinely used in ML (e.g. decision trees) • But not in data management • Recent advances in IT shed deep insight This talk • IT in (1) query processing (2) constraints !2 Part 1: From Proof to Algorithms Query Plans !3 Query Processing 101 • SQL ! Query Plan • Query Plan ! Optimized Query Plan • Two major problems: – Cardinality estimation sucks – Optimal plans don’t exists !4 Example Cardinality estimation sucks Optimal plans don’t exists Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) select * from R, S, T where R.Y = S.Y and S.Z = T.Z and T.X = R.X ⨝ ⨝ T(Z,X) R(X,Y) S(Y,Z) Every query plan O(N2) Largest output O(N1.5) New Paradigm • Find information-theoretic proof of the upper bound, or the submodular width • Convert proof to algorithm !6 Two Running Examples R Full Conjunctive Query X Y R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) T S Z Boolean Query R X Y ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) K S U Z T !7 Statistics maxD satisfies stats (|Q(D)|) E.g. R(X,Y) ∧ S(Y,Z), |R|, |S| " N • No other info: |Q(D)| " N2 • S.Y is a key: |Q(D)| " N • S.Y has degree " d: |Q(D)| " d#N E.g. R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) 3/2 No other info: |Q(D)| " N !8 Statistics maxD satisfies stats (|Q(D)|) E.g. R(X,Y) ∧ S(Y,Z), |R|, |S| " N • No other info: |Q(D)| " N2 • S.Y is a key: |Q(D)| " N • S.Y has degree " d: |Q(D)| " d#N E.g. R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) 3/2 No other info: |Q(D)| " N !9 Statistics maxD satisfies stats (|Q(D)|) E.g. R(X,Y) ∧ S(Y,Z), |R|, |S| " N • No other info: |Q(D)| " N2 • S.Y is a key: |Q(D)| " N • S.Y has degree " d: |Q(D)| " d#N E.g. R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) 3/2 No other info: |Q(D)| " N !10 Statistics maxD satisfies stats (|Q(D)|) E.g. R(X,Y) ∧ S(Y,Z), |R|, |S| " N • No other info: |Q(D)| " N2 • S.Y is a key: |Q(D)| " N • S.Y has degree " d: |Q(D)| " d#N E.g. R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) 3/2 No other info: |Q(D)| " N !11 Statistics maxD satisfies stats (|Q(D)|) E.g. R(X,Y) ∧ S(Y,Z), |R|, |S| " N • No other info: |Q(D)| " N2 • S.Y is a key: |Q(D)| " N • S.Y has degree " d: |Q(D)| " d#N E.g. R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) 3/2 No other info: |Q(D)| " N !12 Statistics maxD satisfies stats (|Q(D)|) E.g. R(X,Y) ∧ S(Y,Z), |R|, |S| " N • No other info: |Q(D)| " N2 • S.Y is a key: |Q(D)| " N • S.Y has degree " d: |Q(D)| " d#N E.g. R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) 3/2 No other info: |Q(D)| " N !13 Statistics maxD satisfies stats (|Q(D)|) E.g. R(X,Y) ∧ S(Y,Z), |R|, |S| " N • No other info: |Q(D)| " N2 • S.Y is a key: |Q(D)| " N • S.Y has degree " d: |Q(D)| " d#N E.g. R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) 3/2 No other info: |Q(D)| " N !14 Statistics maxD satisfies stats (|Q(D)|) E.g. R(X,Y) ∧ S(Y,Z), |R|, |S| " N • No other info: |Q(D)| " N2 • S.Y is a key: |Q(D)| " N • S.Y has degree " d: |Q(D)| " d#N E.g. R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) No other info: |Q(D)| " N3/2 !15 Entropy Let V = {X1, X2, …} be a set of random variables. The entropy of X⊆V is H(X) = - $i=1,N pi log pi V H: 2 ! R+ is called entropic The conditional entropy H(Y|X) = H(X∪Y) – H(X) Waterloo 2019 !16 Shannon Inequalities Monotonicity H(U ∪ V) % H(U) H(U) + H(V) % H(U & V) + H(U ∪ V) Submodularity V H: 2 ! R+ is called polymatroid X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Database D ! entropic function H !18 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Database D ! entropic function H Database D R(X,Y) S(Y,Z) T(Z,X) X Y Y Z Z X a 3 3 m m a a 2 2 q q a b 2 3 q q b d 3 2 m m d !19 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Database D ! entropic function H Output Q(D) Database D X Y Z R(X,Y) S(Y,Z) T(Z,X) a 3 m X Y Y Z Z X a 2 q a 3 3 m m a b 2 q a 2 2 q q a d 3 m b 2 3 q q b a 3 q d 3 2 m m d !20 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Database D ! entropic function H Output Q(D) Database D X Y Z R(X,Y) S(Y,Z) T(Z,X) a 3 m 1/5 X Y Y Z Z X a 2 q 1/5 a 3 3 m m a b 2 q 1/5 a 2 2 q q a d 3 m 1/5 b 2 3 q q b a 3 q 1/5 d 3 2 m m d H(XYZ) = log |Q(D)| !21 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Database D ! entropic function H Output Q(D) Database D X Y Z R(X,Y) S(Y,Z) T(Z,X) a 3 m 1/5 X Y Y Z Z X a 2 q 1/5 a 3 2/5 3 m 2/5 m a 1/5 b 2 q 1/5 a 2 1/5 2 q 2/5 q a 2/5 d 3 m 1/5 b 2 1/5 3 q 1/5 q b 1/5 a 3 q 1/5 d 3 1/5 2 m 0 m d 1/5 H(XYZ) = log |Q(D)| !22 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Database D ! entropic function H Output Q(D) Database D X Y Z R(X,Y) S(Y,Z) T(Z,X) a 3 m 1/5 X Y Y Z Z X a 2 q 1/5 a 3 2/5 3 m 2/5 m a 1/5 b 2 q 1/5 a 2 1/5 2 q 2/5 q a 2/5 d 3 m 1/5 b 2 1/5 3 q 1/5 q b 1/5 a 3 q 1/5 d 3 1/5 2 m 0 m d 1/5 H(XYZ) = log |Q(D)| H(XY) " log NR H(YZ) " log NS H(XZ) " log NT H(Z|Y) " log degS(z|y) Cardinalitites, functional dependences, max degrees X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| " N ! |Q(D)|" N3/2 !24 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| " N ! |Q(D)|" N3/2 3 log N % h(XY) + h(YZ) + h(XZ) % h(XYZ) + h(Y) + h(XZ) % h(XYZ) + h(XYZ) + h(∅) = 2 h(XYZ) = 2 log |Output| !25 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| " N ! |Q(D)|" N3/2 submodularity 3 log N % h(XY) + h(YZ) + h(XZ) % h(XYZ) + h(Y) + h(XZ) % h(XYZ) + h(XYZ) + h(∅) = 2 h(XYZ) = 2 log |Output| !26 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| " N ! |Q(D)|" N3/2 submodularity 3 log N % h(XY) + h(YZ) + h(XZ) % h(XYZ) + h(Y) + h(XZ) % h(XYZ) + h(XYZ) + h(∅) = 2 h(XYZ) = 2 log |Output| !27 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| " N ! |Q(D)|" N3/2 submodularity 3 log N % h(XY) + h(YZ) + h(XZ) submodularity % h(XYZ) + h(Y) + h(XZ) % h(XYZ) + h(XYZ) + h(∅) = 2 h(XYZ) = 2 log |Output| !28 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| " N ! |Q(D)|" N3/2 submodularity 3 log N % h(XY) + h(YZ) + h(XZ) submodularity % h(XYZ) + h(Y) + h(XZ) % h(XYZ) + h(XYZ) + h(∅) = 2 h(XYZ) = 2 log |Output| !29 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| " N ! |Q(D)|" N3/2 submodularity 3 log N % h(XY) + h(YZ) + h(XZ) submodularity % h(XYZ) + h(Y) + h(XZ) % h(XYZ) + h(XYZ) + h(∅) = 2 h(XYZ) = 2 log |Q(D)| !30 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| " N ! |Q(D)|" N3/2 submodularity 3 log N % h(XY) + h(YZ) + h(XZ) submodularity % h(XYZ) + h(Y) + h(XZ) % h(XYZ) + h(XYZ) + h(∅) = 2 h(XYZ) Shearer’s inequality = 2 log |Q(D)| !31 X Y Proof to Algorithm Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) h(XY) + h(YZ) + h(XZ) % 2 h(XYZ) h(XY)+h(YZ)+h(XZ) h(XYZ) + h(Y) + h(XZ) Proof h(XYZ) !32 X Y Proof to Algorithm Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) h(XY) + h(YZ) + h(XZ) % 2 h(XYZ) Algorithm h(XY)+h(YZ)+h(XZ) R(X,Y)∧S(Y,Z)∧T(Z,X) h(XYZ) + h(Y) + h(XZ) Proof h(XYZ) !33 X Y Proof to Algorithm Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) h(XY) + h(YZ) + h(XZ) % 2 h(XYZ) Algorithm h(XY)+h(YZ)+h(XZ) R(X,Y)∧S(Y,Z)∧T(Z,X) N3/2 h(XYZ) + h(Y) + h(XZ) Rlight(X,Y)∧S(Y,Z) Proof h(XYZ) 1/2 Rlight or Rheavy: degree(Y) " or > N !34 X Y Proof to Algorithm Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) h(XY) + h(YZ) + h(XZ) % 2 h(XYZ) Algorithm h(XY)+h(YZ)+h(XZ) R(X,Y)∧S(Y,Z)∧T(Z,X) N3/2 h(XYZ) + h(Y) + h(XZ) N3/2 R (X,Y) S(Y,Z) light ∧ ∪ Proof h(XYZ) Rheavy(Y)∧T(X,Z) 1/2 Rlight or Rheavy: degree(Y) " or > N !35 X Y Proof to Algorithm Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Runtime Õ(N3/2) h(XY) + h(YZ) + h(XZ) % 2 h(XYZ) Algorithm h(XY)+h(YZ)+h(XZ) R(X,Y)∧S(Y,Z)∧T(Z,X) N3/2 h(XYZ) + h(Y) + h(XZ) N3/2 R (X,Y) S(Y,Z) light ∧ ∪ Proof h(XYZ) Rheavy(Y)∧T(X,Z) 1/2 Rlight or Rheavy: degree(Y) " or > N !36 Full Conjunctive Query Asymptotically tight, but open if computable Theorem ∀D that satisfies the statistics log |Q(D)| " max H entropic satisfying stats H(X) " max H polymatroid satisfying stats H(X) Computable in EXPTIME, but not tight Thm Q(D) computable in time Õ(Polymatroid-bound) !37 Discussion AGM Bound [Atserias,Grohe,Marx’08, Ngo,Re,Rudra’13] • Entropic bound = polymatroid bound • Algorithm (NPRR) for Q(D) has single log factor Cardinalities + FDs + max degrees [AboKhamis,Ngo,S’17] • Entropic bound ≨ polymatroid bound • Algorithm (PANDA) for Q(D) has polylog factor !38 Boolean Query Tree decomposition (TD) = a tree where each node t is a full conjunctive query • Fractional hypetree width [Grohe,Marx’14] mintree maxnode t maxD • Submodular width [Marx’13,ANS’17] maxD mintree maxnode t !39 X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] mintree maxnode t maxD !40 X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] mintree maxnode t maxD Tree decompositions R(x,y),S(y,z) S(y,z),T(z,u) T(z,u),K(u,x) K(u,x),R(x,y) !41 X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] mintree maxnode t maxD Tree decompositions R(x,y),S(y,z) S(y,z),T(z,u) T(z,u),K(u,x) K(u,x),R(x,y) Runtime Õ(N2) (suboptimal) !42 X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] mintree maxnode t maxD Tree decompositions R(x,y),S(y,z) S(y,z),T(z,u) T(z,u),K(u,x) K(u,x),R(x,y) Runtime Õ(N2) (suboptimal) !43 X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t Tree decompositions R(x,y),S(y,z) S(y,z),T(z,u) T(z,u),K(u,x) K(u,x),R(x,y) !44 X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t Tree decompositions R(x,y),S(y,z) S(y,z),T(z,u) T1 min(

Data Management Meets Information Theory

Information Theory Techniques for Multimedia Data Classification and Retrieval

Combinatorial Reasoning in Information Theory

2020 SIGACT REPORT SIGACT EC – Eric Allender, Shuchi Chawla, Nicole Immorlica, Samir Khuller (Chair), Bobby Kleinberg September 14Th, 2020

Information Theory for Intelligent People

An Information-Theoretic Approach to Normal Forms for Relational and XML Data

1 Applications of Information Theory to Biology Information Theory Offers A

Efficient Space-Time Sampling with Pixel-Wise Coded Exposure For

Combinatorial Foundations of Information Theory and the Calculus of Probabilities

Information Theory and Communication - Tibor Nemetz

Wisdom - the Blurry Top of Human Cognition in the DIKW-Model? Anett Hoppe1 Rudolf Seising2 Andreas Nürnberger2 Constanze Wenzel3

Kolmogorov Complexity in Perspective. Part II: Classification, Information Processing and Duality Marie Ferbus-Zanda

An Affective Computational Model for Machine Consciousness