Asymptotic Equipartition Property
Total Page:16
File Type:pdf, Size:1020Kb
Lecture 5: Asymptotic Equipartition Property • Law of large number for product of random variables • AEP and consequences Dr. Yao Xie, ECE587, Information Theory, Duke University Stock market • Initial investment Y0, daily return ratio ri, in t-th day, your money is Yt = Y0r1 · ::: · rt: • Now if returns ratio ri are i.i.d., with { 4; w.p. 1=2 r = i 0; w.p. 1=2: • So you think the expected return ratio is Eri = 2, • and then t t EYt = E(Y0r1 · ::: · rt) = Y0(Eri) = Y02 ? Dr. Yao Xie, ECE587, Information Theory, Duke University 1 Is \optimized" really optimal? • With Y0 = 1, actual return Yt goes like 1 4 16 0 0 0 ::: • Optimize expected return is not optimal? • Fundamental reason: products does not behave the same as addition Dr. Yao Xie, ECE587, Information Theory, Duke University 2 (Weak) Law of large number Theorem. For independent, identically distributed (i.i.d.) random variables Xi, 1 Xn X¯ = X ! EX; in probability. n n i i=1 • Convergence in probability if for every ϵ > 0, P fjXn − Xj > ϵg ! 0: • Proof by Markov inequality. • So this means P fjX¯n − EXj ≤ ϵg ! 1; n ! 1: Dr. Yao Xie, ECE587, Information Theory, Duke University 3 Other types of convergence • In mean square if as n ! 1 2 E(Xn − X) ! 0 • With probability 1 (almost surely) if as n ! 1 n o P lim Xn = X = 1 n!1 • In distribution if as n ! 1 lim Fn ! F; n where Fn and F are the cumulative distribution function of Xn and X. Dr. Yao Xie, ECE587, Information Theory, Duke University 4 Product of random variables • v How does this behave? u uYn tn Xi i=1 pQ P • n n ≤ 1 n Geometric mean i=1 Xi arithmetic mean n i=1 Xi • Examples: { Volume V of a random box, each dimension Xi, V = X1 · ::: · Xn · · { Stock return Yt = Y0r1 ::: rt Q n { Joint distribution of i.i.d. RVs: p(x1; : : : ; xn) = i=1 p(xi) Dr. Yao Xie, ECE587, Information Theory, Duke University 5 Law of large number for product of random variables • We can write log Xi Xi = e • Hence v u uYn P 1 n tn log Xi Xi = en i=1 i=1 • So from LLN v u uYn tn E(log X) log EX Xi ! e ≤ e = EX: i=1 Dr. Yao Xie, ECE587, Information Theory, Duke University 6 • Stock example: 1 1 E log r = log 4 + log 0 = −∞ i 2 2 E log ri E(Yt) ! Y0e = 0; t ! 1: • Example { a; w.p. 1=2 X = b; w.p. 1=2: 8 v 9 < u = uYn p a + b E tn X ! ab ≤ : i; 2 i=1 Dr. Yao Xie, ECE587, Information Theory, Duke University 7 Asymptotic equipartition property (AEP) • LLN states that 1 Xn X ! EX n i i=1 • AEP states that most sequences 1 1 log ! H(X) n p(X1;X2;:::;Xn) −nH(X) p(X1;X2;:::;Xn) ≈ 2 • Analyze using LLN for product of random variables Dr. Yao Xie, ECE587, Information Theory, Duke University 8 AEP lies in the heart of information theory. • Proof for lossless source coding • Proof for channel capacity • and more... Dr. Yao Xie, ECE587, Information Theory, Duke University 9 AEP Theorem. If X1;X2;::: are i.i.d. ∼ p(x), then 1 − log p(X ;X ;:::;X ) ! H(X); in probability. n 1 2 n Proof: 1 1 Xn − log p(X ;X ; ··· ;X ) = − log p(X ) n 1 2 n n i i=1 ! −E log p(X) = H(X): There are several consequences. Dr. Yao Xie, ECE587, Information Theory, Duke University 10 Typical set A typical set (n) Aϵ n contains all sequences (x1; x2; : : : ; xn) 2 X with the property −n(H(X)+ϵ) −n(H(X)−ϵ) 2 ≤ p(x1; x2; : : : ; xn) ≤ 2 : Dr. Yao Xie, ECE587, Information Theory, Duke University 11 Not all sequences are created equal • Coin tossing example: X 2 f0; 1g, p(1) = 0:8 P P − p(1; 0; 1; 1; 0; 1) = p Xi(1 − p)5 Xi = p4(1 − p)2 = 0:0164 P P − p(0; 0; 0; 0; 0; 0) = p Xi(1 − p)5 Xi = p4(1 − p)2 = 0:000064 • In this example, if 2 (n) (x1; : : : ; xn) Aϵ ; 1 H(X) − ϵ ≤ − log p(X ;:::;X ) ≤ H(X) + ϵ. n 1 n • This means a binary sequence is in typical set is the frequency of heads is approximately k=n Dr. Yao Xie, ECE587, Information Theory, Duke University 12 •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ◭ •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ◭ •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ◭ •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• Dr. Yao Xie, ECE587, Information Theory, Duke University 13 p = 0:6, n = 25, k = number of \1"s n n 1 k pk(1 − p) n−k − log p(x n) !k" ! k" n 0 1 0 .000000 1 .321928 1 25 0 .000000 1 .298530 2 300 0 .000000 1 .275131 3 2300 0 .000001 1 .251733 4 12650 0 .000007 1 .228334 5 53130 0 .000054 1 .204936 6 177100 0 .000227 1 .181537 7 480700 0 .001205 1 .158139 8 1081575 0 .003121 1 .134740 9 2042975 0 .013169 1 .111342 10 3268760 0 .021222 1 .087943 11 4457400 0 .077801 1 .064545 12 5200300 0 .075967 1 .041146 13 5200300 0 .267718 1 .017748 14 4457400 0 .146507 0 .994349 15 3268760 0 .575383 0 .970951 16 2042975 0 .151086 0 .947552 17 1081575 0 .846448 0 .924154 18 480700 0 .079986 0 .900755 19 177100 0 .970638 0 .877357 20 53130 0 .019891 0 .853958 21 12650 0 .997633 0 .830560 22 2300 0 .001937 0 .807161 23 300 0 .999950 0 .783763 24 25 0 .000047 0 .760364 25 1 0 .000003 0 .736966 Dr. Yao Xie, ECE587, Information Theory, Duke University 14 Consequences of AEP (n) Theorem. 1. If (x1; x2; : : : ; xn) 2 Aϵ , then for n sufficiently large: 1 H(X) − ϵ ≤ − log p(x ; x ; : : : ; x ) ≤ H(X) + ϵ n 1 2 n (n) 2. P fAϵ g ≥ 1 − ϵ. (n) n(H(X)+ϵ) 3. jAϵ j ≤ 2 . (n) n(H(X)−ϵ) 4. jAϵ j ≥ (1 − ϵ)2 . Dr. Yao Xie, ECE587, Information Theory, Duke University 15 Property 1 (n) If (x1; x2; : : : ; xn) 2 Aϵ , then 1 H(X) − ϵ ≤ − log p(x ; x ; : : : ; x ) ≤ H(X) + ϵ. n 1 2 n • Proof from definition: 2 (n) (x1; x2; : : : ; xn) Aϵ ; if −n(H(X)+ϵ) −n(H(X)−ϵ) 2 ≤ p(x1; x2; : : : ; xn) ≤ 2 : • The number of bits used to describe sequences in typical set is approximately nH(X). Dr. Yao Xie, ECE587, Information Theory, Duke University 16 Property 2 (n) P fAϵ g ≥ 1 − ϵ for n sufficiently large. • Proof: From AEP: because 1 − log p(X ;:::;X ) ! H(X) n 1 n in probability, this means for a given ϵ > 0, when n is sufficiently large 1 pf − log p(X1;:::;Xn) − H(X) ≤ ϵg ≥ 1 − ϵ. | n {z } (n) 2Aϵ { High probability: sequences in typical set are \most typical". { These sequences almost all have same probability - \equipartition". Dr. Yao Xie, ECE587, Information Theory, Duke University 17 Property 3 and 4: size of typical set − n(H(X)−ϵ) ≤ j (n)j ≤ n(H(X)+ϵ) (1 ϵ)2 Aϵ 2 • Proof: X 1 = p(x1; : : : ; xn) (x ;:::;x ) 1 Xn ≥ p(x1; : : : ; xn) (n) (x ;:::;x )2A 1 Xn ϵ −n(H(X)+ϵ) ≥ p(x1; : : : ; xn)2 (n) (x1;:::;xn)2Aϵ j (n)j −n(H(X)+ϵ) = Aϵ 2 : Dr. Yao Xie, ECE587, Information Theory, Duke University 18 (n) On the other hand, P fAϵ g ≥ 1 − ϵ for n, so X 1 − ϵ < p(x1; : : : ; xn) (n) (x1;:::;xn)2Aϵ ≤ j (n)j −n(H(X)−ϵ) Aϵ 2 : • Size of typical set depends on H(X). • When p = 1=2 in coin tossing example, H(X) = 1, 2nH(X) = 2n: all sequences are typical sequences. Dr. Yao Xie, ECE587, Information Theory, Duke University 19 Typical set diagram • This enables us to divide all sequences into two sets { Typical set: high probability to occur, sample entropy is close to true entropy so we will focus on analyzing sequences in typical set { Non-typical set: small probability, can ignore in general n:| | n elements Non-typical set Typical set (n) n(H + ∋ ) A ∋ : 2 elements Dr. Yao Xie, ECE587, Information Theory, Duke University 20 Data compression scheme from AEP • Let X1;X2;:::;Xn be i.i.d. RV drawn from p(x) • We wish to find short descriptions for such sequences of RVs Dr. Yao Xie, ECE587, Information Theory, Duke University 21 • Divide all sequences in X n into two sets Non-typical set Description: n log | | + 2 bits Typical set Description: n(H + ∋ ) + 2 bits Dr. Yao Xie, ECE587, Information Theory, Duke University 22 • Use one bit to indicate which set (n) { Typical set Aϵ use prefix \1" Since there are no more than 2n(H(X)+ϵ) sequences, indexing requires no more than d(H(X) + ϵ)e + 1 (plus one extra bit) { Non-typical set use prefix \0" Since there are at most jX jn sequences, indexing requires no more than dn log jX je + 1 n n n • Notation: x = (x1; : : : ; xn), l(x ) = length of codeword for x • We can prove [ ] 1 E l(Xn) ≤ H(X) + ϵ n Dr. Yao Xie, ECE587, Information Theory, Duke University 23 Summary of AEP Almost everything is almost equally probable. • Reasons that AEP has H(X) − 1 n ! { n log p(x ) H(X), in probability { n(H(X) ϵ) suffices to describe that random sequence on average { 2H(X) is the effective alphabet size { Typical set is the smallest set with probability near 1 { Size of typical set 2nH(X) { The distance of elements in the set nearly uniform Dr.