Lecture 5: Asymptotic Equipartition Property
• Law of large number for product of random variables • AEP and consequences
Dr. Yao Xie, ECE587, Information Theory, Duke University Stock market
• Initial investment Y0, daily return ratio ri, in t-th day, your money is
Yt = Y0r1 · ... · rt.
• Now if returns ratio ri are i.i.d., with { 4, w.p. 1/2 r = i 0, w.p. 1/2.
• So you think the expected return ratio is Eri = 2,
• and then t t EYt = E(Y0r1 · ... · rt) = Y0(Eri) = Y02 ?
Dr. Yao Xie, ECE587, Information Theory, Duke University 1 Is “optimized” really optimal?
• With Y0 = 1, actual return Yt goes like
1 4 16 0 0 0 ...
• Optimize expected return is not optimal?
• Fundamental reason: products does not behave the same as addition
Dr. Yao Xie, ECE587, Information Theory, Duke University 2 (Weak) Law of large number Theorem. For independent, identically distributed (i.i.d.) random variables Xi,
1 ∑n X¯ = X → EX, in probability. n n i i=1 • Convergence in probability if for every ϵ > 0,
P {|Xn − X| > ϵ} → 0.
• Proof by Markov inequality.
• So this means
P {|X¯n − EX| ≤ ϵ} → 1, n → ∞.
Dr. Yao Xie, ECE587, Information Theory, Duke University 3 Other types of convergence
• In mean square if as n → ∞
2 E(Xn − X) → 0
• With probability 1 (almost surely) if as n → ∞ { } P lim Xn = X = 1 n→∞
• In distribution if as n → ∞
lim Fn → F, n
where Fn and F are the cumulative distribution function of Xn and X.
Dr. Yao Xie, ECE587, Information Theory, Duke University 4 Product of random variables
• v How does this behave? u u∏n tn Xi i=1 √∏ ∑ • n n ≤ 1 n Geometric mean i=1 Xi arithmetic mean n i=1 Xi
• Examples:
– Volume V of a random box, each dimension Xi, V = X1 · ... · Xn · · – Stock return Yt = Y0r1 ... rt ∏ n – Joint distribution of i.i.d. RVs: p(x1, . . . , xn) = i=1 p(xi)
Dr. Yao Xie, ECE587, Information Theory, Duke University 5 Law of large number for product of random variables
• We can write log Xi Xi = e
• Hence v u u∏n ∑ 1 n tn log Xi Xi = en i=1 i=1
• So from LLN v u u∏n tn E(log X) log EX Xi → e ≤ e = EX. i=1
Dr. Yao Xie, ECE587, Information Theory, Duke University 6 • Stock example:
1 1 E log r = log 4 + log 0 = −∞ i 2 2
E log ri E(Yt) → Y0e = 0, t → ∞.
• Example { a, w.p. 1/2 X = b, w.p. 1/2. v u u∏n √ a + b E tn X → ab ≤ i 2 i=1
Dr. Yao Xie, ECE587, Information Theory, Duke University 7 Asymptotic equipartition property (AEP)
• LLN states that 1 ∑n X → EX n i i=1
• AEP states that most sequences
1 1 log → H(X) n p(X1,X2,...,Xn)
−nH(X) p(X1,X2,...,Xn) ≈ 2
• Analyze using LLN for product of random variables
Dr. Yao Xie, ECE587, Information Theory, Duke University 8 AEP lies in the heart of information theory.
• Proof for lossless source coding
• Proof for channel capacity
• and more...
Dr. Yao Xie, ECE587, Information Theory, Duke University 9 AEP
Theorem. If X1,X2,... are i.i.d. ∼ p(x), then
1 − log p(X ,X ,...,X ) → H(X), in probability. n 1 2 n
Proof:
1 1 ∑n − log p(X ,X , ··· ,X ) = − log p(X ) n 1 2 n n i i=1 → −E log p(X) = H(X).
There are several consequences.
Dr. Yao Xie, ECE587, Information Theory, Duke University 10 Typical set
A typical set
(n) Aϵ
n contains all sequences (x1, x2, . . . , xn) ∈ X with the property
−n(H(X)+ϵ) −n(H(X)−ϵ) 2 ≤ p(x1, x2, . . . , xn) ≤ 2 .
Dr. Yao Xie, ECE587, Information Theory, Duke University 11 Not all sequences are created equal
• Coin tossing example: X ∈ {0, 1}, p(1) = 0.8
∑ ∑ − p(1, 0, 1, 1, 0, 1) = p Xi(1 − p)5 Xi = p4(1 − p)2 = 0.0164
∑ ∑ − p(0, 0, 0, 0, 0, 0) = p Xi(1 − p)5 Xi = p4(1 − p)2 = 0.000064
• In this example, if ∈ (n) (x1, . . . , xn) Aϵ , 1 H(X) − ϵ ≤ − log p(X ,...,X ) ≤ H(X) + ϵ. n 1 n
• This means a binary sequence is in typical set is the frequency of heads is approximately k/n
Dr. Yao Xie, ECE587, Information Theory, Duke University 12 •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ◭ •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ◭ •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ◭
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
Dr. Yao Xie, ECE587, Information Theory, Duke University 13 p = 0.6, n = 25, k = number of “1”s
n n 1 k pk(1 − p) n−k − log p(x n) !k" ! k" n 0 1 0 .000000 1 .321928 1 25 0 .000000 1 .298530 2 300 0 .000000 1 .275131 3 2300 0 .000001 1 .251733 4 12650 0 .000007 1 .228334 5 53130 0 .000054 1 .204936 6 177100 0 .000227 1 .181537 7 480700 0 .001205 1 .158139 8 1081575 0 .003121 1 .134740 9 2042975 0 .013169 1 .111342 10 3268760 0 .021222 1 .087943 11 4457400 0 .077801 1 .064545 12 5200300 0 .075967 1 .041146 13 5200300 0 .267718 1 .017748 14 4457400 0 .146507 0 .994349 15 3268760 0 .575383 0 .970951 16 2042975 0 .151086 0 .947552 17 1081575 0 .846448 0 .924154 18 480700 0 .079986 0 .900755 19 177100 0 .970638 0 .877357 20 53130 0 .019891 0 .853958 21 12650 0 .997633 0 .830560 22 2300 0 .001937 0 .807161 23 300 0 .999950 0 .783763 24 25 0 .000047 0 .760364 25 1 0 .000003 0 .736966
Dr. Yao Xie, ECE587, Information Theory, Duke University 14 Consequences of AEP
(n) Theorem. 1. If (x1, x2, . . . , xn) ∈ Aϵ , then for n sufficiently large:
1 H(X) − ϵ ≤ − log p(x , x , . . . , x ) ≤ H(X) + ϵ n 1 2 n
(n) 2. P {Aϵ } ≥ 1 − ϵ.
(n) n(H(X)+ϵ) 3. |Aϵ | ≤ 2 .
(n) n(H(X)−ϵ) 4. |Aϵ | ≥ (1 − ϵ)2 .
Dr. Yao Xie, ECE587, Information Theory, Duke University 15 Property 1 (n) If (x1, x2, . . . , xn) ∈ Aϵ , then
1 H(X) − ϵ ≤ − log p(x , x , . . . , x ) ≤ H(X) + ϵ. n 1 2 n
• Proof from definition:
∈ (n) (x1, x2, . . . , xn) Aϵ ,
if −n(H(X)+ϵ) −n(H(X)−ϵ) 2 ≤ p(x1, x2, . . . , xn) ≤ 2 .
• The number of bits used to describe sequences in typical set is approximately nH(X).
Dr. Yao Xie, ECE587, Information Theory, Duke University 16 Property 2 (n) P {Aϵ } ≥ 1 − ϵ for n sufficiently large. • Proof: From AEP: because 1 − log p(X ,...,X ) → H(X) n 1 n in probability, this means for a given ϵ > 0, when n is sufficiently large
1 p{ − log p(X1,...,Xn) − H(X) ≤ ϵ} ≥ 1 − ϵ. | n {z } (n) ∈Aϵ
– High probability: sequences in typical set are “most typical”. – These sequences almost all have same probability - “equipartition”.
Dr. Yao Xie, ECE587, Information Theory, Duke University 17 Property 3 and 4: size of typical set
− n(H(X)−ϵ) ≤ | (n)| ≤ n(H(X)+ϵ) (1 ϵ)2 Aϵ 2 • Proof: ∑ 1 = p(x1, . . . , xn) (x ,...,x ) 1 ∑n ≥ p(x1, . . . , xn) (n) (x ,...,x )∈A 1 ∑n ϵ −n(H(X)+ϵ) ≥ p(x1, . . . , xn)2 (n) (x1,...,xn)∈Aϵ | (n)| −n(H(X)+ϵ) = Aϵ 2 .
Dr. Yao Xie, ECE587, Information Theory, Duke University 18 (n) On the other hand, P {Aϵ } ≥ 1 − ϵ for n, so ∑ 1 − ϵ < p(x1, . . . , xn) (n) (x1,...,xn)∈Aϵ ≤ | (n)| −n(H(X)−ϵ) Aϵ 2 .
• Size of typical set depends on H(X).
• When p = 1/2 in coin tossing example, H(X) = 1, 2nH(X) = 2n: all sequences are typical sequences.
Dr. Yao Xie, ECE587, Information Theory, Duke University 19 Typical set diagram
• This enables us to divide all sequences into two sets – Typical set: high probability to occur, sample entropy is close to true entropy so we will focus on analyzing sequences in typical set – Non-typical set: small probability, can ignore in general
n:| | n elements
Non-typical set
Typical set
(n) n(H + ∋ ) A ∋ : 2 elements
Dr. Yao Xie, ECE587, Information Theory, Duke University 20 Data compression scheme from AEP
• Let X1,X2,...,Xn be i.i.d. RV drawn from p(x)
• We wish to find short descriptions for such sequences of RVs
Dr. Yao Xie, ECE587, Information Theory, Duke University 21 • Divide all sequences in X n into two sets
Non-typical set Description: n log | | + 2 bits
Typical set Description: n(H + ∋ ) + 2 bits
Dr. Yao Xie, ECE587, Information Theory, Duke University 22 • Use one bit to indicate which set (n) – Typical set Aϵ use prefix “1” Since there are no more than 2n(H(X)+ϵ) sequences, indexing requires no more than ⌈(H(X) + ϵ)⌉ + 1 (plus one extra bit) – Non-typical set use prefix “0” Since there are at most |X |n sequences, indexing requires no more than ⌈n log |X |⌉ + 1
n n n • Notation: x = (x1, . . . , xn), l(x ) = length of codeword for x
• We can prove [ ] 1 E l(Xn) ≤ H(X) + ϵ n
Dr. Yao Xie, ECE587, Information Theory, Duke University 23 Summary of AEP
Almost everything is almost equally probable.
• Reasons that AEP has H(X) − 1 n → – n log p(x ) H(X), in probability – n(H(X) ϵ) suffices to describe that random sequence on average – 2H(X) is the effective alphabet size – Typical set is the smallest set with probability near 1 – Size of typical set 2nH(X) – The distance of elements in the set nearly uniform
Dr. Yao Xie, ECE587, Information Theory, Duke University 24 Next Time • AEP is about the property of independent sequences
• What about the dependence processes?
• Answer is entropy rate - next time.
Dr. Yao Xie, ECE587, Information Theory, Duke University 25