<<

Data Management Meets

Dan Suciu U. of Washington and RelationalAI

Joint work with M. Abo Khamis, H. Ngo, B. Kenig Background

Information theory: • Routinely used in ML (e.g. decision trees) • But not in data management • Recent advances in IT shed deep insight

This talk • IT in (1) query processing (2) constraints

2 Part 1: From Proof to

Query

3 Query Processing 101

• SQL ! Query • Query Plan ! Optimized Query Plan

• Two major problems: – Cardinality estimation sucks – Optimal plans don’t exists

4 Example

Cardinality estimation sucks Optimal plans don’t exists

Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X)

select * from R, S, T where R.Y = S.Y and S.Z = T.Z and T.X = R.X ⨝

⨝ T(Z,X)

R(X,Y) S(Y,Z)

Every query plan O(N2) Largest output O(N1.5) New Paradigm

• Find information-theoretic proof of the upper bound, or the submodular width

• Convert proof to

6 Two Running Examples

R Full Conjunctive Query X Y

R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) T S Z

Boolean Query R X Y

∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) K S U Z T

7 maxD satisfies stats (|Q(D)|)

E.g. R(X,Y) ∧ S(Y,Z), |R|, |S| ≤ N • No other info: |Q(D)| ≤ N2 • S.Y is a : |Q(D)| ≤ N • S.Y has degree ≤ d: |Q(D)| ≤ d×N

E.g. R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) No other info: |Q(D)| ≤ N3/2

8 Statistics maxD satisfies stats (|Q(D)|)

E.g. R(X,Y) ∧ S(Y,Z), |R|, |S| ≤ N • No other info: |Q(D)| ≤ N2 • S.Y is a key: |Q(D)| ≤ N • S.Y has degree ≤ d: |Q(D)| ≤ d×N

E.g. R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) No other info: |Q(D)| ≤ N3/2

9 Statistics maxD satisfies stats (|Q(D)|)

E.g. R(X,Y) ∧ S(Y,Z), |R|, |S| ≤ N • No other info: |Q(D)| ≤ N2 • S.Y is a key: |Q(D)| ≤ N • S.Y has degree ≤ d: |Q(D)| ≤ d×N

E.g. R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) No other info: |Q(D)| ≤ N3/2

10 Statistics maxD satisfies stats (|Q(D)|)

E.g. R(X,Y) ∧ S(Y,Z), |R|, |S| ≤ N • No other info: |Q(D)| ≤ N2 • S.Y is a key: |Q(D)| ≤ N • S.Y has degree ≤ d: |Q(D)| ≤ d×N

E.g. R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) No other info: |Q(D)| ≤ N3/2

11 Statistics maxD satisfies stats (|Q(D)|)

E.g. R(X,Y) ∧ S(Y,Z), |R|, |S| ≤ N • No other info: |Q(D)| ≤ N2 • S.Y is a key: |Q(D)| ≤ N • S.Y has degree ≤ d: |Q(D)| ≤ d×N

E.g. R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) No other info: |Q(D)| ≤ N3/2

12 Statistics maxD satisfies stats (|Q(D)|)

E.g. R(X,Y) ∧ S(Y,Z), |R|, |S| ≤ N • No other info: |Q(D)| ≤ N2 • S.Y is a key: |Q(D)| ≤ N • S.Y has degree ≤ d: |Q(D)| ≤ d×N

E.g. R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) No other info: |Q(D)| ≤ N3/2

13 Statistics maxD satisfies stats (|Q(D)|)

E.g. R(X,Y) ∧ S(Y,Z), |R|, |S| ≤ N • No other info: |Q(D)| ≤ N2 • S.Y is a key: |Q(D)| ≤ N • S.Y has degree ≤ d: |Q(D)| ≤ d×N

E.g. R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) No other info: |Q(D)| ≤ N3/2

14 Statistics maxD satisfies stats (|Q(D)|)

E.g. R(X,Y) ∧ S(Y,Z), |R|, |S| ≤ N • No other info: |Q(D)| ≤ N2 • S.Y is a key: |Q(D)| ≤ N • S.Y has degree ≤ d: |Q(D)| ≤ d×N

E.g. R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) No other info: |Q(D)| ≤ N3/2

15

Let V = {X1, X2, …} be a set of random variables.

The entropy of X⊆V is H(X) = - Σi=1,N pi log pi

V H: 2 ! R+ is called entropic

The H(Y|X) = H(X∪Y) – H(X)

Waterloo 2019 16 Inequalities

Monotonicity

H(U ∪ V) ≥ H(U)

H(U) + H(V) ≥ H(U ∩ V) + H(U ∪ V)

Submodularity

V H: 2 ! R+ is called polymatroid X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) D ! entropic function H

18 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Database D ! entropic function H

Database D R(X,Y) S(Y,Z) T(Z,X) X Y Y Z Z X a 3 3 m m a a 2 2 q q a b 2 3 q q b d 3 2 m m d

19 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Database D ! entropic function H

Output Q(D) Database D X Y Z R(X,Y) S(Y,Z) T(Z,X) a 3 m X Y Y Z Z X a 2 q a 3 3 m m a b 2 q a 2 2 q q a d 3 m b 2 3 q q b a 3 q d 3 2 m m d

20 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Database D ! entropic function H

Output Q(D) Database D X Y Z R(X,Y) S(Y,Z) T(Z,X) a 3 m 1/5 X Y Y Z Z X a 2 q 1/5 a 3 3 m m a b 2 q 1/5 a 2 2 q q a d 3 m 1/5 b 2 3 q q b a 3 q 1/5 d 3 2 m m d

H(XYZ) = log |Q(D)|

21 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Database D ! entropic function H

Output Q(D) Database D X Y Z R(X,Y) S(Y,Z) T(Z,X) a 3 m 1/5 X Y Y Z Z X a 2 q 1/5 a 3 2/5 3 m 2/5 m a 1/5 b 2 q 1/5 a 2 1/5 2 q 2/5 q a 2/5 d 3 m 1/5 b 2 1/5 3 q 1/5 q b 1/5 a 3 q 1/5 d 3 1/5 2 m 0 m d 1/5

H(XYZ) = log |Q(D)|

22 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Database D ! entropic function H

Output Q(D) Database D X Y Z R(X,Y) S(Y,Z) T(Z,X) a 3 m 1/5 X Y Y Z Z X a 2 q 1/5 a 3 2/5 3 m 2/5 m a 1/5 b 2 q 1/5 a 2 1/5 2 q 2/5 q a 2/5 d 3 m 1/5 b 2 1/5 3 q 1/5 q b 1/5 a 3 q 1/5 d 3 1/5 2 m 0 m d 1/5

H(XYZ) = log |Q(D)| H(XY) ≤ log NR H(YZ) ≤ log NS H(XZ) ≤ log NT

H(Z|Y) ≤ log degS(z|y)

Cardinalitites, functional dependences, max degrees X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| ≤ N ! |Q(D)|≤ N3/2

24 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| ≤ N ! |Q(D)|≤ N3/2

3 log N ≥ h(XY) + h(YZ) + h(XZ)

≥ h(XYZ) + h(Y) + h(XZ)

≥ h(XYZ) + h(XYZ) + h(∅)

= 2 h(XYZ)

= 2 log |Output| 25 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| ≤ N ! |Q(D)|≤ N3/2

submodularity

3 log N ≥ h(XY) + h(YZ) + h(XZ)

≥ h(XYZ) + h(Y) + h(XZ)

≥ h(XYZ) + h(XYZ) + h(∅)

= 2 h(XYZ)

= 2 log |Output| 26 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| ≤ N ! |Q(D)|≤ N3/2

submodularity

3 log N ≥ h(XY) + h(YZ) + h(XZ)

≥ h(XYZ) + h(Y) + h(XZ)

≥ h(XYZ) + h(XYZ) + h(∅)

= 2 h(XYZ)

= 2 log |Output| 27 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| ≤ N ! |Q(D)|≤ N3/2

submodularity

3 log N ≥ h(XY) + h(YZ) + h(XZ) submodularity ≥ h(XYZ) + h(Y) + h(XZ)

≥ h(XYZ) + h(XYZ) + h(∅)

= 2 h(XYZ)

= 2 log |Output| 28 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| ≤ N ! |Q(D)|≤ N3/2

submodularity

3 log N ≥ h(XY) + h(YZ) + h(XZ) submodularity ≥ h(XYZ) + h(Y) + h(XZ)

≥ h(XYZ) + h(XYZ) + h(∅)

= 2 h(XYZ)

= 2 log |Output| 29 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| ≤ N ! |Q(D)|≤ N3/2

submodularity

3 log N ≥ h(XY) + h(YZ) + h(XZ) submodularity ≥ h(XYZ) + h(Y) + h(XZ)

≥ h(XYZ) + h(XYZ) + h(∅)

= 2 h(XYZ)

= 2 log |Q(D)| 30 X Y Proof of Upper Bound Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| ≤ N ! |Q(D)|≤ N3/2

submodularity

3 log N ≥ h(XY) + h(YZ) + h(XZ) submodularity ≥ h(XYZ) + h(Y) + h(XZ)

≥ h(XYZ) + h(XYZ) + h(∅)

= 2 h(XYZ) Shearer’s inequality

= 2 log |Q(D)| 31 X Y Proof to Algorithm Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) h(XY) + h(YZ) + h(XZ) ≥ 2 h(XYZ)

h(XY)+h(YZ)+h(XZ) h(XYZ) + h(Y) + h(XZ)

Proof h(XYZ)

32 X Y Proof to Algorithm Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) h(XY) + h(YZ) + h(XZ) ≥ 2 h(XYZ) Algorithm

h(XY)+h(YZ)+h(XZ) R(X,Y)∧S(Y,Z)∧T(Z,X) h(XYZ) + h(Y) + h(XZ)

Proof h(XYZ)

33 X Y Proof to Algorithm Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) h(XY) + h(YZ) + h(XZ) ≥ 2 h(XYZ) Algorithm

h(XY)+h(YZ)+h(XZ) R(X,Y)∧S(Y,Z)∧T(Z,X)

N3/2 h(XYZ) + h(Y) + h(XZ)

Rlight(X,Y)∧S(Y,Z) Proof h(XYZ)

1/2 Rlight or Rheavy: degree(Y) ≤ or > N 34 X Y Proof to Algorithm Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) h(XY) + h(YZ) + h(XZ) ≥ 2 h(XYZ) Algorithm

h(XY)+h(YZ)+h(XZ) R(X,Y)∧S(Y,Z)∧T(Z,X)

N3/2 h(XYZ) + h(Y) + h(XZ) N3/2 R (X,Y) S(Y,Z) light ∧ ∪ Proof h(XYZ) Rheavy(Y)∧T(X,Z)

1/2 Rlight or Rheavy: degree(Y) ≤ or > N 35 X Y Proof to Algorithm Z Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Runtime Õ(N3/2) h(XY) + h(YZ) + h(XZ) ≥ 2 h(XYZ) Algorithm

h(XY)+h(YZ)+h(XZ) R(X,Y)∧S(Y,Z)∧T(Z,X)

N3/2 h(XYZ) + h(Y) + h(XZ) N3/2 R (X,Y) S(Y,Z) light ∧ ∪ Proof h(XYZ) Rheavy(Y)∧T(X,Z)

1/2 Rlight or Rheavy: degree(Y) ≤ or > N 36 Full Conjunctive Query

Asymptotically tight, but open if computable

Theorem ∀D that satisfies the statistics

log |Q(D)| ≤ max H entropic satisfying stats H(X) ≤ max H polymatroid satisfying stats H(X)

Computable in EXPTIME, but not tight

Thm Q(D) computable in time Õ(Polymatroid-bound)

37 Discussion

AGM Bound [Atserias,Grohe,Marx’08, Ngo,Re,Rudra’13]

• Entropic bound = polymatroid bound • Algorithm (NPRR) for Q(D) has single log factor

Cardinalities + FDs + max degrees [AboKhamis,Ngo,S’17]

• Entropic bound ≨ polymatroid bound • Algorithm (PANDA) for Q(D) has polylog factor

38 Boolean Query

Tree decomposition (TD) = a tree where each node t is a full conjunctive query

• Fractional hypetree width [Grohe,Marx’14]

mintree maxnode t maxD

• Submodular width [Marx’13,ANS’17]

maxD mintree maxnode t

39 X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] mintree maxnode t maxD

40 X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] mintree maxnode t maxD Tree decompositions

R(x,y),S(y,z) S(y,z),T(z,u)

T(z,u),K(u,x) K(u,x),R(x,y)

41 X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] mintree maxnode t maxD Tree decompositions

R(x,y),S(y,z) S(y,z),T(z,u)

T(z,u),K(u,x) K(u,x),R(x,y)

Runtime Õ(N2)

(suboptimal) 42 X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] mintree maxnode t maxD Tree decompositions

R(x,y),S(y,z) S(y,z),T(z,u)

T(z,u),K(u,x) K(u,x),R(x,y)

Runtime Õ(N2)

(suboptimal) 43 X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t Tree decompositions

R(x,y),S(y,z) S(y,z),T(z,u)

T(z,u),K(u,x) K(u,x),R(x,y)

44 X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t Tree decompositions

R(x,y),S(y,z) S(y,z),T(z,u) T1

min( max(h(xyz),h(zux)), max(h(yzu),h(uxy))) = T(z,u),K(u,x) K(u,x),R(x,y)

45 X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t Tree decompositions

R(x,y),S(y,z) S(y,z),T(z,u) T1

min( max(h(xyz),h(zux)), max(h(yzu),h(uxy))) = T(z,u),K(u,x) K(u,x),R(x,y)

T2

46 X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t Tree decompositions

R(x,y),S(y,z) S(y,z),T(z,u) T1

min( max(h(xyz),h(zux)), max(h(yzu),h(uxy))) = T(z,u),K(u,x) K(u,x),R(x,y)

T2

47 X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t Tree decompositions

R(x,y),S(y,z) S(y,z),T(z,u) T1

min( max(h(xyz),h(zux)), max(h(yzu),h(uxy))) = T(z,u),K(u,x) K(u,x),R(x,y)

T2 = max(min(h(xyz),h(yzu)), min(h(xyz),h(uxy)), min(h(zux),h(yzu)), min(h(zux),h(uxy))) 48 X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t Tree decompositions

R(x,y),S(y,z) S(y,z),T(z,u) T1

min( max(h(xyz),h(zux)), max(h(yzu),h(uxy))) = T(z,u),K(u,x) K(u,x),R(x,y)

T2 3 log N ≥ h(xy) + h(yz) + h(zu) = max(min(h(xyz),h(yzu)), min(h(xyz),h(uxy)), min(h(zux),h(yzu)), min(h(zux),h(uxy))) X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t Tree decompositions

R(x,y),S(y,z) S(y,z),T(z,u) T1

min( max(h(xyz),h(zux)), max(h(yzu),h(uxy))) = T(z,u),K(u,x) K(u,x),R(x,y)

T2 3 log N ≥ h(xy) + h(yz) + h(zu) ≥ h(xyz) + h(y) + h(zu) = max(min(h(xyz),h(yzu)), min(h(xyz),h(uxy)), min(h(zux),h(yzu)), min(h(zux),h(uxy))) X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t Tree decompositions

R(x,y),S(y,z) S(y,z),T(z,u) T1

min( max(h(xyz),h(zux)), max(h(yzu),h(uxy))) = T(z,u),K(u,x) K(u,x),R(x,y)

T2 3 log N ≥ h(xy) + h(yz) + h(zu) ≥ h(xyz) + h(y) + h(zu) = max(min(h(xyz),h(yzu)), min(h(xyz),h(uxy)), ≥ h(xyz) + h(yzu) min(h(zux),h(yzu)), min(h(zux),h(uxy))) X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t Tree decompositions

R(x,y),S(y,z) S(y,z),T(z,u) T1

min( max(h(xyz),h(zux)), max(h(yzu),h(uxy))) = T(z,u),K(u,x) K(u,x),R(x,y)

T2 3 log N ≥ h(xy) + h(yz) + h(zu) ≥ h(xyz) + h(y) + h(zu) = max(min(h(xyz),h(yzu)), min(h(xyz),h(uxy)), ≥ h(xyz) + h(yzu) min(h(zux),h(yzu)), ≥ 2 min(h(xyz),h(yzu)) min(h(zux),h(uxy))) X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t Tree decompositions

subw(Q) = 3/2 log N R(x,y),S(y,z) S(y,z),T(z,u) T1

min( max(h(xyz),h(zux)), max(h(yzu),h(uxy))) = T(z,u),K(u,x) K(u,x),R(x,y)

T2 3 log N ≥ h(xy) + h(yz) + h(zu) ≥ h(xyz) + h(y) + h(zu) = max(min(h(xyz),h(yzu)), min(h(xyz),h(uxy)), ≥ h(xyz) + h(yzu) min(h(zux),h(yzu)), ≥ 2 min(h(xyz),h(yzu)) min(h(zux),h(uxy))) X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t Tree decompositions

subw(Q) = 3/2 log N R(x,y),S(y,z) S(y,z),T(z,u)

T(z,u),K(u,x) K(u,x),R(x,y) 3 log N ≥ h(xy) + h(yz) + h(zu) Next: proof to algorithm ≥ h(xyz) + h(y) + h(zu) ≥ h(xyz) + h(yzu) ≥ 2 min(h(xyz),h(yzu)) X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t Tree decompositions

subw(Q) = 3/2 log N R(x,y),S(y,z) S(y,z),T(z,u)

T(z,u),K(u,x) K(u,x),R(x,y) 3 log N ≥ h(xy) + h(yz) + h(zu) ≥ h(xyz) + h(y) + h(zu) ≥ h(xyz) + h(yzu) ≥ 2 min(h(xyz),h(yzu)) X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t Tree decompositions

subw(Q) = 3/2 log N A= R(x,y),S(y,z) B= S(y,z),T(z,u)

A(x,y,z)∨B(y,z,u) " R(x,y)∧S(y,z)∧T(z,u)

C= T(z,u),K(u,x) D= K(u,x),R(x,y) 3 log N ≥ h(xy) + h(yz) + h(zu) ≥ h(xyz) + h(y) + h(zu) ≥ h(xyz) + h(yzu) ≥ 2 min(h(xyz),h(yzu)) X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t Tree decompositions

subw(Q) = 3/2 log N A= R(x,y),S(y,z) B= S(y,z),T(z,u)

A(x,y,z)∨B(y,z,u) " R(x,y)∧S(y,z)∧T(z,u)

C= T(z,u),K(u,x) D= K(u,x),R(x,y)

Partition S into Slight∪Sheavy 3 log N ≥ h(xy) + h(yz) + h(zu)

A(x,y,z) " R(x,y) ⋈ Slight(y,z) ≥ h(xyz) + h(y) + h(zu) ≥ h(xyz) + h(yzu) ≥ 2 min(h(xyz),h(yzu)) X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t Tree decompositions

subw(Q) = 3/2 log N A= R(x,y),S(y,z) B= S(y,z),T(z,u)

A(x,y,z)∨B(y,z,u) " R(x,y)∧S(y,z)∧T(z,u)

C= T(z,u),K(u,x) D= K(u,x),R(x,y)

Partition S into Slight∪Sheavy 3 log N ≥ h(xy) + h(yz) + h(zu)

A (x,y,z)" R(x,y) ⋈ Slight(y,z) ≥ h(xyz) + h(y) + h(zu) ≥ h(xyz) + h(yzu) B(y,z,u) " Sheavy(y) ⋈ T(z,u) ≥ 2 min(h(xyz),h(yzu)) X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t Tree decompositions

subw(Q) = 3/2 log N A= R(x,y),S(y,z) B= S(y,z),T(z,u)

A(x,y,z)∨B(y,z,u) " R(x,y)∧S(y,z)∧T(z,u)

C= T(z,u),K(u,x) D= K(u,x),R(x,y) A(x,y,z)∨D(x,y,u) " R(x,y)∧S(y,z)∧K(u,x)

59 X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t Tree decompositions

subw(Q) = 3/2 log N A= R(x,y),S(y,z) B= S(y,z),T(z,u)

A(x,y,z)∨B(y,z,u) " R(x,y)∧S(y,z)∧T(z,u)

C= T(z,u),K(u,x) D= K(u,x),R(x,y) A(x,y,z)∨D(x,y,u) " R(x,y)∧S(y,z)∧K(u,x)

C(x,z,u)∨B(y,z,u) " R(x,y)∧S(y,z)∧T(z,u)

60 X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t Tree decompositions

subw(Q) = 3/2 log N A= R(x,y),S(y,z) B= S(y,z),T(z,u)

A(x,y,z)∨B(y,z,u) " R(x,y)∧S(y,z)∧T(z,u)

C= T(z,u),K(u,x) D= K(u,x),R(x,y) A(x,y,z)∨D(x,y,u) " R(x,y)∧S(y,z)∧K(u,x)

C(x,z,u)∨B(y,z,u) " R(x,y)∧S(y,z)∧T(z,u)

C(x,z,u)∨D(x,y,u) " R(x,y)∧S(y,z)∧K(u,x) 61 X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) U Z O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t Tree decompositions

subw(Q) = 3/2 log N A= R(x,y),S(y,z) B= S(y,z),T(z,u)

A(x,y,z)∨B(y,z,u) " R(x,y)∧S(y,z)∧T(z,u)

C= T(z,u),K(u,x) D= K(u,x),R(x,y) A(x,y,z)∨D(x,y,u) " R(x,y)∧S(y,z)∧K(u,x)

C(x,z,u)∨B(y,z,u) " R(x,y)∧S(y,z)∧T(z,u) Runtime: O(N3/2) C(x,z,u)∨D(x,y,u) " R(x,y)∧S(y,z)∧K(u,x) 62 Discussion

Query evaluation summary: • Proof ! Algorithm • Cardinality estimation

Open problem: • Better proofs ! better algorithms

63 Part 2: Relaxing Constraints

Waterloo 2019 64 Relaxation Problem

FDs and MVDs as hard constraints • Exact Implication � ⊨ �.

FDs and MVDs as soft constraints Approximate implication ∑ ≥ • �∈� � �

Relaxation problem • When can we convert EI to AI?

Waterloo 2019 65 FD and MVD

Functional Dependency (FD) • A!B

(Embedded) Multivalued Dependency:

• EMVD: A!(B|C) if �ABC(R)=�AB(R)⋈�AC(R)

A B C 1 1 1 !(B|C) A!(B|C) 1 1 2 1 2 1 1 2 2 2 2 2 Conditional Independence

• X⊥Y | Z if P(X,Y | Z) = P(X | Z) P(Y | Z)

• Graphoid axioms [Pearl&Paz]

B⊥C|A A B C ¬(B⊥C) 1 1 1 1/5 MVD: A ↠ B iff B⊥C|A Fails for EMVD 1 1 2 1/5 1 2 1 1/5 1 2 2 1/5 2 2 2 1/5 H(Y|X) = H(XY) – H(X) I(X;Y|Z) = H(XZ) + H(YZ) – H(XYZ) – H(Z) Soft Constraints

X!Y iff H(Y|X) = 0

X⊥Y | Z iff I(X;Y|Z) = 0

Waterloo 2019 68 Relaxation Holds for FDs+MVDs

� ⊨ �

Theorem [Kenig&S] If � consists of FDs+MVDs Then: n2/4 ∑ ≥ • �∈� � � If is an FD then: ∑ ≥ • � �∈� � � Example: AB ! C, AD ! E, CE ! F ⊨ ABD ! F

H(C|AB) + H(E|AD) + H(F|CE) ≥ H(F|ABD)

Waterloo 2019 69 Relaxation Fails in General

[Kaced&Romashchenko], [Kenig&S]

Theorem (C⊥D|A), (C⊥D|B), (A⊥B), (B⊥C|D) ⊨ C ⊥ D

∀ � ≥ 0: � (I(C;D|A) + I(C;D|B) + I(A;B) + I(B;C|D)) < I(C;D)

However, we can relax “in the limit”

Theorem [Kenig&S] If � ⊨ �, then forall � >0 exists � ≥ 0:

∑ + H(V) ≥ � �∈� � � �

V = all variables Waterloo 2019 70 CI’s Restricted to Shannon

Theorem (folklore) A Shannon implication � ⊨ � relaxes to � ∑ � ≥ � for some � > 0

Theorem [Kenig&S] (1) � ≤ (2n)! (2) there exists implications where � ≥ 3

Example: Contraction Axiom in semi-graphoids: X ⊥ Y|Z & X ⊥ W | YZ ⊨ X ⊥ YW | Z Relaxes to: I(X;Y|Z) + I(X;W|YZ) ≥ I(X;YW | Z) 71 Summary of Part 2

• The relaxation problem: when can we convert exact implications to approximate implications

• Great practical importance: real data satisfies constraints only approximatively, need to relax

• Open problems: bounds on �

Waterloo 2019 72 Conclusions

• Information theory is routinely used in ML

• Applications to data management: query processing and constraints

• Connections to difficult results in IT

73