Application of Theory, Lecture 8 Kolmogorov and Other Entropy Measures Handout Mode

Iftach Haitner

Tel Aviv University.

December 16, 2014

Iftach Haitner (TAU) Application of , Lecture 8 December 16, 2014 1 / 24 Part I

Kolmogorov Complexity

Iftach Haitner (TAU) Application of Information Theory, Lecture 8 December 16, 2014 2 / 24 Description length

I What is the description length of the following strings? 1. 010101010101010101010101010101010101 2. 011010100000100111100110011001111110 3. 111010100110001100111100010101011111

I 1. Eighteen 01 √ 2. First 36 of the binary expansion of 2 − 1 3. Looks random, but 22 ones out of 36

I Bergg’s paradox: Let s be “the smallest positive integer that cannot be described in twelve English words”

I The above is a definition of s, of less than twelve English words...

I Solution: the word “described" above in the definition of s is not well defined

Iftach Haitner (TAU) Application of Information Theory, Lecture 8 December 16, 2014 3 / 24 Kolmogorov complexity ∗ ++ I For s string x ∈ {0, 1} , let K (x) be the length of the shortest C program (written in binary) that outputs x (on empty input)

I Now the term “described” is well defined. ++ I Why C ?

I All (complete) /computational model are essentially equivalent. 0 I Let K (x) be the description length of x in another complete language, then |K (x) − k 0(x)| ≤ const.

I What is K (x) for x = 0101010101 ... 01 | {z } n pairs ++ I “For i = 1 : i : n; print 01” I K (x) ≤ log n + const

I This is considered to be small complexity. We typically ignore log n factors.

I What is K (x) for x being the first n digits of π? I K (x) = log n + const

Iftach Haitner (TAU) Application of Information Theory, Lecture 8 December 16, 2014 4 / 24 More examples

n I What is K (x) for x ∈ {0, 1} with k ones? n nh(k/n) I Recall that k ≤ 2 I Hence K (x) ≤ log n + nh(k/n)

Iftach Haitner (TAU) Application of Information Theory, Lecture 8 December 16, 2014 5 / 24 Bounds

I K (x) ≤ |x| + const

I Proof: “output x”

I Most sequences have high Kolmogorov complexity: n−1 ++ I At most2 (C ) programs of length ≤ n − 2 n I 2 strings of length n 1 I Hence, at least 2 of n-bit strings have Kolmogorov complexity at least n − 1

I In particular, a has Kolmogorov complexity ≈ n

Iftach Haitner (TAU) Application of Information Theory, Lecture 8 December 16, 2014 6 / 24 Conditional Kolmogorov complexity

I K (x|y) — Kolmogorov complexity of x given y. The length of the shortest partogram that outputd x on input y

I Chain rule K (x, y) ≈ k(y) + k(x|y)

Iftach Haitner (TAU) Application of Information Theory, Lecture 8 December 16, 2014 7 / 24 H vs. K H(X) speaks about a random variable X and K (x) of a string x, but

I Both quantities the amount of uncertainty or in an object

I Both measure the number of it takes to describe an object

I Another property: Let X1,..., Xn be iid, then whp

K (X1,..., Xn) ≈ H(X1,..., Xn) = nH(X1)

I Proof:? AEP

I Example: coin flip (0.7, 0.3) then whp we get a string with K (x) ≈ n · h(0.3)

Iftach Haitner (TAU) Application of Information Theory, Lecture 8 December 16, 2014 8 / 24 Universal compression

I A program of length K (x) that outputs x, compresses x into k(x) bit of information.

9 I Example: length of the human genome:6 · 10 bits

I But the code is redundant

I The relevant number to measure the number of possible values is the Kolmogorov complexity of the code.

I No-one knows its value...

Iftach Haitner (TAU) Application of Information Theory, Lecture 8 December 16, 2014 9 / 24 Universal ++ K (x) = minp : p()=x |p|, where p() is the output of C program defined by p.

Definition 1 The universal probability of a string x is P −|p| PU (x) = p : p()=x 2 = Prp←{0,1}∞ [p() = x]

I Namely, the probability that if one picks a program at random, it prints x.

I Insensitive (up o constant factor) to the computation model.

I Interpretation: PU (x) is the the probability that you observe x in nature.

I Computer as an intelligent amplifier

Theorem 2 −K (x) −K (x) ∗ ∃c > 0 such that 2 ≤ PU (x) ≤ c · 2 for everyx ∈ {0, 1} .

−K (x) I The interesting part is PU (x) ≤ c · 2

I Hence, for X ∼ PU , it holds that |E [K (X)] − H(X)| ≤ c

Iftach Haitner (TAU) Application of Information Theory, Lecture 8 December 16, 2014 10 / 24 Proving Theorem 2

1 ∗ I We need to find c > 0 such that k(x) ≤ log + c for every x ∈ {0, 1} PU (x) 1 I In other words, find a program to output x whose length is log + c PU (x)

I Idea, program chooses a leaf on the Shannon code for PU (in which x is l m of depth log 1 ) PU (x)

I Problem: PU is not computable

I Solution: compute a better and better estimate for the tree of PU along with the “mapping" from the tree nodes back to codewords.

Iftach Haitner (TAU) Application of Information Theory, Lecture 8 December 16, 2014 11 / 24 Proving Theorem 2

I Initial T to be the infinite Binary tree. Program 3 (M) Enumerate over all programs in {0, 1}∗: at round i emulate the first i programs (one after the other), for i steps, and do: If program p outputs a string x and (∗, x, n(x)) ∈/ T , place (p, x, n(x)) at unused n(x)-depth node of l m 0 1 ˆ P −|p | T , for n(x) = log + 1 and PU (x) = 0 emulated p0 has output 2 PˆU (x) p : x

I The program never gets stuck (can always add the node). Proof: Let x ∈ {0, 1}∗. At each point through the execution ofM, P −|p| −K (x) (p,x,·)∈T 2 ≤ 2 P −K (x) Since x 2 ≤ 1, the proof follows by Kraft inequality. ∗ l 1 m I ∀x ∈ {0, 1} : M adds a node (·, x, ·) to T at depth2 + log PU (x) ˆ Proof: PU (x) converges to PU (x) ∗ l 1 m I For x ∈ {0, 1} , let `(x) be the location its (2 + log )-depth node PU (x)

I Program for printing x. RunM till it assigns the node at the location of `(x) Iftach Haitner (TAU) Application of Information Theory, Lecture 8 December 16, 2014 12 / 24 Applications

I (another) Proof that there are infinity many primes.

I Assume there are finitely many primes p1,..., pm

Qm di I Any length n integer x can be written as x = i=1 pi

I di ≤ n, hence length di ≤ log n

I Hence, K (x) ≤ m · log n + const

I But for most numbers k(x) ≥ n − 1

Iftach Haitner (TAU) Application of Information Theory, Lecture 8 December 16, 2014 13 / 24 Computability of K

I Can we compute K (x)?

I Answer, No.

I Proof: Assume K is computable by a program of length C

I Let s be the smallest positive integer s.t. K (s) > 2C + 10, 000

I s can be computed by the following program: 1. x = 0 2. While (K (x) < 2C + 10, 000): x ++ 3. Output x

I Thus K (s) < C + log C + log 10, 000 + const < 2C + 10, 000

I Bergg’s Paradox, revisited:

I s — the smallest positive number with K (s) > 10000

I This is not a paradox, since the description of s is not short.

Iftach Haitner (TAU) Application of Information Theory, Lecture 8 December 16, 2014 14 / 24 Explicit large complexity strings

I Can we give an explicit example of string x with large k(x)? Theorem 4 ∃ constantC s.t. the theoremK (x) ≥ C cannot be proven (under any reasonable axiom system).

I For most strings K (x) > C + 1, but it cannot be proven even for a single string

I K (x) ≥ C is an example for a theorem that cannot be proven, and for most x’s cannot be disproved.

I Proof: for integer C define the program TC : 1. y = 0 2. If y is a proof for the statement k(x) > C, output x 3. y ++

I |TC | = log C + D, where D is a const I Take C such that C > log C + D

I If TC stops and outputs x, then k(x) < log C + D < C, a contradiction to the fact that ∃ proof that k(x) > C. Iftach Haitner (TAU) Application of Information Theory, Lecture 8 December 16, 2014 15 / 24 Part II

Other Entropy Measures

Iftach Haitner (TAU) Application of Information Theory, Lecture 8 December 16, 2014 16 / 24 Other entropy measures Let X ∼ p be a random variable over X .

I Recall that Shannon entropy of X is P H(X) = x∈X −p(x) · log p(x) = EX [− log p(X)]

I Max entropy of X isH 0(X) = log |Supp(X)|

I Min entropy of X isH ∞(X) = minx∈X {− log p(x)} = − log maxx∈X {p(x)} P 2 I Collision probability of X isCP (X) = x∈X p(x) Probability of collision when drawing two independent samples from X

I Collision entropy/Renyi entropy of X isH 2(X) = − log CP(X) 1 Pn α α I For α 6= 1 ∈ N — Hα = 1−α log i=1 pi = 1−α log(kpkα)

I H∞(X) ≤ H2(X) ≤ H(X) ≤ H0(X) (Jensen) Equality iff X is uniform over X P 0 0 I For instance,CP (X) ≤ x p(x) maxx 0 p(x ) = maxx 0 p(x ). Hence, 0 H2(X) ≥ − log maxx 0 p(x ) = H∞(X).

I H2(X) ≤ 2 H∞(X) 0 2 I Proof:CP (X) ≥ (maxx 0 p(x )) . Hence, − log CP(X) ≤ −2 H∞(X)

Iftach Haitner (TAU) Application of Information Theory, Lecture 8 December 16, 2014 17 / 24 Other entropy measures, cont

I No simple chain rule. 1 n I Let X =⊥ wp 2 and uniform over {0, 1} otherwise, and let Y be indicator for X =⊥.

I H∞(X|Y = 1) = 0 andH ∞(X|Y = 0) = n. ButH ∞(X) = 1.

Iftach Haitner (TAU) Application of Information Theory, Lecture 8 December 16, 2014 18 / 24 Section 1

Shannon to Min entropy

Iftach Haitner (TAU) Application of Information Theory, Lecture 8 December 16, 2014 19 / 24 Shannon to Min entropy Given rv X ∼ p, let X n denote n independent copies of X, and let n Qn p (x1 ..., xn) = i=1 p(xi ). Lemma 5 2 LetX ∼ p and let ε > 0. Then Pr [− log pn(X n) ≤ n · (H(X) − ε)] < 2 · e−2ε n.

Proof: (quantitative) AEP. n −n(H(X)+ε) n −n(H(X)−ε) I An,ε := {x ∈ Supp(X ): 2 ≤ p (x) ≤ 2 } n I − log p (x) ≥ n · (H(X) − ε) for any x ∈ An,ε Proposition 6 (Hoeffding’s inequality) Let Z 1,..., Z n be iids over [0, 1] with expectation µ. Then, Pn j  j=i Z  −2ε2n Pr | n − µ| ≥ ε ≤ 2 · e for every ε > 0.

n −2ε2n I Taking Zi = log p(Xi ), it follows that Pr [X ∈/ An,ε] ≤ 2 · e Corollary 7 −2ε2n n ∃ rvW that is (2 · e )-close toX , and H∞(W ) ≥ n(H(X) − ε).

n n n Proof: W = X if X ∈ An,ε, and “well spread” outside Supp(X ) otherwise. Iftach Haitner (TAU) Application of Information Theory, Lecture 8 December 16, 2014 20 / 24 Shannon to Min entropy, conditional version

Lemma 8 Let (X, Y ) ∼ p let ε > 0. Then h n n n i −2ε2n Pr(x n,y n)←(X,Y )n − log pX n|Y n (x |y ) ≤ n · (H(X|Y ) − ε) < 2 · e .

Proof: same proof, letting Zi = log pX|Y (Xi , Yi )

Corollary 9

2 ∃ rvW over X n × Yn that is (2 · e−2ε n)-far from (X, Y )n, n I SD(WYn , Y ) = 0, and n I H(W | WYn = y) ≥ n · (H(X|Y ) − ε), for any y ∈ Supp(Y )

Proof:?

Iftach Haitner (TAU) Application of Information Theory, Lecture 8 December 16, 2014 21 / 24 Section 2

Renyi-entropy to Uniform Distribution

Iftach Haitner (TAU) Application of Information Theory, Lecture 8 December 16, 2014 22 / 24 Pairwise independent hashing

Definition 10 (pairwise independent function family)

A function family G = {g : D 7→ R} is pairwise independent, if ∀ x 6= x 0 ∈ D 0 0 0 1 2 and y, y ∈ R, it holds that Prg←G [g(x) = y ∧ g(x ) = y )] = ( |R| ) .

n m I Example: for D = {0, 1} and R = {0, 1} let G = {(A, b) ∈ {0, 1}m×n × {0, 1}m} with (A, b)(x) = A × x + b.

0 1 I 2-universal families: Prg←G [g(x) = g(x ))] = |R| .

I Example for universal family that is not pairwise independent?

I Many-wise independent

I We identify functions with their description.

I Amazingly useful tool

Iftach Haitner (TAU) Application of Information Theory, Lecture 8 December 16, 2014 23 / 24 Leftover hash lemma Lemma 11 (leftover hash lemma) n n m LetX be a rv over {0, 1} with H2(X) ≥ k let G = {g : {0, 1} 7→ {0, 1} } be 2-universal and letG ← G. Then m 1 (m−k))/2 SD((G, G(X)), (G, ∼ {0, 1} )) ≤ 2 · 2 . Extraction.

Lemma 12 √ 1+δ δ Letp be a distribution over U withCP (p) ≤ |U| , then SD(p, ∼ U) ≤ 2 .

Proof: Let q be the uniform distribution over U.

2 P 2 2 2 1 δ I kp − qk2 = u∈U (p(u)−q(u)) = kpk2 +kqk2 −2hp, qi = CP(p)− |U| ≤ |U| Pn 2 Pn 2 I Chebyshev Sum Inequality: ( i=1 ai ) ≤ n i=1 ai 2 2 I Hence, kp − qk1 ≤ |U| · kp − qk2 √ 1 δ I Thus,SD (p, q) = 2 kp − qk1 ≤ 2 . To deuce the proof of Lemma 11, we notice that 1 −k −m 1+2m−k CP(G, G(X)) ≤ |G| · (2 + 2 ) = |G×{0,1}m|

Iftach Haitner (TAU) Application of Information Theory, Lecture 8 December 16, 2014 24 / 24