Lecture 5: Asymptotic Equipartition Property

• Law of large number for product of random variables • AEP and consequences

Dr. Yao Xie, ECE587, , Duke University Stock market

• Initial investment Y0, daily return ratio ri, in t-th day, your money is

Yt = Y0r1 · ... · rt.

• Now if returns ratio ri are i.i.d., with { 4, w.p. 1/2 r = i 0, w.p. 1/2.

• So you think the expected return ratio is Eri = 2,

• and then t t EYt = E(Y0r1 · ... · rt) = Y0(Eri) = Y02 ?

Dr. Yao Xie, ECE587, Information Theory, Duke University 1 Is “optimized” really optimal?

• With Y0 = 1, actual return Yt goes like

1 4 16 0 0 0 ...

• Optimize expected return is not optimal?

• Fundamental reason: products does not behave the same as addition

Dr. Yao Xie, ECE587, Information Theory, Duke University 2 (Weak) Law of large number Theorem. For independent, identically distributed (i.i.d.) random variables Xi,

1 ∑n X¯ = X → EX, in probability. n n i i=1 • Convergence in probability if for every ϵ > 0,

P {|Xn − X| > ϵ} → 0.

• Proof by Markov inequality.

• So this means

P {|X¯n − EX| ≤ ϵ} → 1, n → ∞.

Dr. Yao Xie, ECE587, Information Theory, Duke University 3 Other types of convergence

• In mean square if as n → ∞

2 E(Xn − X) → 0

• With probability 1 (almost surely) if as n → ∞ { } P lim Xn = X = 1 n→∞

• In distribution if as n → ∞

lim Fn → F, n

where Fn and F are the cumulative distribution function of Xn and X.

Dr. Yao Xie, ECE587, Information Theory, Duke University 4 Product of random variables

• v How does this behave? u u∏n tn Xi i=1 √∏ ∑ • n n ≤ 1 n Geometric mean i=1 Xi arithmetic mean n i=1 Xi

• Examples:

– Volume V of a random box, each dimension Xi, V = X1 · ... · Xn · · – Stock return Yt = Y0r1 ... rt ∏ n – Joint distribution of i.i.d. RVs: p(x1, . . . , xn) = i=1 p(xi)

Dr. Yao Xie, ECE587, Information Theory, Duke University 5 Law of large number for product of random variables

• We can write log Xi Xi = e

• Hence v u u∏n ∑ 1 n tn log Xi Xi = en i=1 i=1

• So from LLN v u u∏n tn E(log X) log EX Xi → e ≤ e = EX. i=1

Dr. Yao Xie, ECE587, Information Theory, Duke University 6 • Stock example:

1 1 E log r = log 4 + log 0 = −∞ i 2 2

E log ri E(Yt) → Y0e = 0, t → ∞.

• Example { a, w.p. 1/2 X = b, w.p. 1/2.  v   u  u∏n √ a + b E tn X → ab ≤  i 2 i=1

Dr. Yao Xie, ECE587, Information Theory, Duke University 7 Asymptotic equipartition property (AEP)

• LLN states that 1 ∑n X → EX n i i=1

• AEP states that most sequences

1 1 log → H(X) n p(X1,X2,...,Xn)

−nH(X) p(X1,X2,...,Xn) ≈ 2

• Analyze using LLN for product of random variables

Dr. Yao Xie, ECE587, Information Theory, Duke University 8 AEP lies in the heart of information theory.

• Proof for lossless source coding

• Proof for

• and more...

Dr. Yao Xie, ECE587, Information Theory, Duke University 9 AEP

Theorem. If X1,X2,... are i.i.d. ∼ p(x), then

1 − log p(X ,X ,...,X ) → H(X), in probability. n 1 2 n

Proof:

1 1 ∑n − log p(X ,X , ··· ,X ) = − log p(X ) n 1 2 n n i i=1 → −E log p(X) = H(X).

There are several consequences.

Dr. Yao Xie, ECE587, Information Theory, Duke University 10

A typical set

(n) Aϵ

n contains all sequences (x1, x2, . . . , xn) ∈ X with the property

−n(H(X)+ϵ) −n(H(X)−ϵ) 2 ≤ p(x1, x2, . . . , xn) ≤ 2 .

Dr. Yao Xie, ECE587, Information Theory, Duke University 11 Not all sequences are created equal

• Coin tossing example: X ∈ {0, 1}, p(1) = 0.8

∑ ∑ − p(1, 0, 1, 1, 0, 1) = p Xi(1 − p)5 Xi = p4(1 − p)2 = 0.0164

∑ ∑ − p(0, 0, 0, 0, 0, 0) = p Xi(1 − p)5 Xi = p4(1 − p)2 = 0.000064

• In this example, if ∈ (n) (x1, . . . , xn) Aϵ , 1 H(X) − ϵ ≤ − log p(X ,...,X ) ≤ H(X) + ϵ. n 1 n

• This means a binary sequence is in typical set is the frequency of heads is approximately k/n

Dr. Yao Xie, ECE587, Information Theory, Duke University 12 •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ◭ •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ◭ •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ◭

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

Dr. Yao Xie, ECE587, Information Theory, Duke University 13 p = 0.6, n = 25, k = number of “1”s

n n 1 k pk(1 − p) n−k − log p(x n) !k" ! k" n 0 1 0 .000000 1 .321928 1 25 0 .000000 1 .298530 2 300 0 .000000 1 .275131 3 2300 0 .000001 1 .251733 4 12650 0 .000007 1 .228334 5 53130 0 .000054 1 .204936 6 177100 0 .000227 1 .181537 7 480700 0 .001205 1 .158139 8 1081575 0 .003121 1 .134740 9 2042975 0 .013169 1 .111342 10 3268760 0 .021222 1 .087943 11 4457400 0 .077801 1 .064545 12 5200300 0 .075967 1 .041146 13 5200300 0 .267718 1 .017748 14 4457400 0 .146507 0 .994349 15 3268760 0 .575383 0 .970951 16 2042975 0 .151086 0 .947552 17 1081575 0 .846448 0 .924154 18 480700 0 .079986 0 .900755 19 177100 0 .970638 0 .877357 20 53130 0 .019891 0 .853958 21 12650 0 .997633 0 .830560 22 2300 0 .001937 0 .807161 23 300 0 .999950 0 .783763 24 25 0 .000047 0 .760364 25 1 0 .000003 0 .736966

Dr. Yao Xie, ECE587, Information Theory, Duke University 14 Consequences of AEP

(n) Theorem. 1. If (x1, x2, . . . , xn) ∈ Aϵ , then for n sufficiently large:

1 H(X) − ϵ ≤ − log p(x , x , . . . , x ) ≤ H(X) + ϵ n 1 2 n

(n) 2. P {Aϵ } ≥ 1 − ϵ.

(n) n(H(X)+ϵ) 3. |Aϵ | ≤ 2 .

(n) n(H(X)−ϵ) 4. |Aϵ | ≥ (1 − ϵ)2 .

Dr. Yao Xie, ECE587, Information Theory, Duke University 15 Property 1 (n) If (x1, x2, . . . , xn) ∈ Aϵ , then

1 H(X) − ϵ ≤ − log p(x , x , . . . , x ) ≤ H(X) + ϵ. n 1 2 n

• Proof from definition:

∈ (n) (x1, x2, . . . , xn) Aϵ ,

if −n(H(X)+ϵ) −n(H(X)−ϵ) 2 ≤ p(x1, x2, . . . , xn) ≤ 2 .

• The number of bits used to describe sequences in typical set is approximately nH(X).

Dr. Yao Xie, ECE587, Information Theory, Duke University 16 Property 2 (n) P {Aϵ } ≥ 1 − ϵ for n sufficiently large. • Proof: From AEP: because 1 − log p(X ,...,X ) → H(X) n 1 n in probability, this means for a given ϵ > 0, when n is sufficiently large

1 p{ − log p(X1,...,Xn) − H(X) ≤ ϵ} ≥ 1 − ϵ. | n {z } (n) ∈Aϵ

– High probability: sequences in typical set are “most typical”. – These sequences almost all have same probability - “equipartition”.

Dr. Yao Xie, ECE587, Information Theory, Duke University 17 Property 3 and 4: size of typical set

− n(H(X)−ϵ) ≤ | (n)| ≤ n(H(X)+ϵ) (1 ϵ)2 Aϵ 2 • Proof: ∑ 1 = p(x1, . . . , xn) (x ,...,x ) 1 ∑n ≥ p(x1, . . . , xn) (n) (x ,...,x )∈A 1 ∑n ϵ −n(H(X)+ϵ) ≥ p(x1, . . . , xn)2 (n) (x1,...,xn)∈Aϵ | (n)| −n(H(X)+ϵ) = Aϵ 2 .

Dr. Yao Xie, ECE587, Information Theory, Duke University 18 (n) On the other hand, P {Aϵ } ≥ 1 − ϵ for n, so ∑ 1 − ϵ < p(x1, . . . , xn) (n) (x1,...,xn)∈Aϵ ≤ | (n)| −n(H(X)−ϵ) Aϵ 2 .

• Size of typical set depends on H(X).

• When p = 1/2 in coin tossing example, H(X) = 1, 2nH(X) = 2n: all sequences are typical sequences.

Dr. Yao Xie, ECE587, Information Theory, Duke University 19 Typical set diagram

• This enables us to divide all sequences into two sets – Typical set: high probability to occur, sample is close to true entropy so we will focus on analyzing sequences in typical set – Non-typical set: small probability, can ignore in general

n:| | n elements

Non-typical set

Typical set

(n) n(H + ∋ ) A ∋ : 2 elements

Dr. Yao Xie, ECE587, Information Theory, Duke University 20 Data compression scheme from AEP

• Let X1,X2,...,Xn be i.i.d. RV drawn from p(x)

• We wish to find short descriptions for such sequences of RVs

Dr. Yao Xie, ECE587, Information Theory, Duke University 21 • Divide all sequences in X n into two sets

Non-typical set Description: n log | | + 2 bits

Typical set Description: n(H + ∋ ) + 2 bits

Dr. Yao Xie, ECE587, Information Theory, Duke University 22 • Use one bit to indicate which set (n) – Typical set Aϵ use prefix “1” Since there are no more than 2n(H(X)+ϵ) sequences, indexing requires no more than ⌈(H(X) + ϵ)⌉ + 1 (plus one extra bit) – Non-typical set use prefix “0” Since there are at most |X |n sequences, indexing requires no more than ⌈n log |X |⌉ + 1

n n n • Notation: x = (x1, . . . , xn), l(x ) = length of codeword for x

• We can prove [ ] 1 E l(Xn) ≤ H(X) + ϵ n

Dr. Yao Xie, ECE587, Information Theory, Duke University 23 Summary of AEP

Almost everything is almost equally probable.

• Reasons that AEP has H(X) − 1 n → – n log p(x ) H(X), in probability – n(H(X)  ϵ) suffices to describe that random sequence on average – 2H(X) is the effective alphabet size – Typical set is the smallest set with probability near 1 – Size of typical set 2nH(X) – The distance of elements in the set nearly uniform

Dr. Yao Xie, ECE587, Information Theory, Duke University 24 Next Time • AEP is about the property of independent sequences

• What about the dependence processes?

• Answer is - next time.

Dr. Yao Xie, ECE587, Information Theory, Duke University 25