6.441 Information Theory, Lecture 3

LECTURE 3 Convergence and Asymptotic Equipartition Property Last time: ² Convexity and concavity ² Jensen’s inequality ² Positivity of mutual information ² Data processing theorem ² Fano’s inequality Lecture outline ² Types of convergence ² Weak Law of Large Numbers ² Strong Law of Large Numbers ² Asymptotic Equipartition Property Reading: Scts. 3.1-3.2. Types of convergence Recall what a random variable is: a map ping from its set of sample values Ω onto R X : Ω 7! R » ! X(») In the cases we have been discussing, Ω = X and we map onto [0; 1] Types of convergence ² Sure convergence: a random sequence X1; : : : converges surely to r.v. X if 8» 2 Ω the sequence Xn(») converges to X(») as n ! 1 ² Almost sure convergence (also called con vergence with probability 1) the random sequence converges a.s. (w.p. 1) to X if the sequence X1(»); : : : converges to X(») for all » except possibly on a set of Ω of probability 0 ² Mean-square convergence: X1; : : : con verges in m.s. sense to r.v. X if 2 limn!1 EXn[jXn ¡ Xj ] ! 0 ² Convergence in probability: the sequence converges in probability to X if 8² > 0 limn!1 P r[jXn ¡ Xj > ²] ! 0 ² Convergence in distribution: the sequence converges in distribution if the cumula tive distribution function Fn(x) = P r(Xn · x) satisfies limn!1 Fn(x) ! FX(x) at all x for which F is continuous. Relations among types of convergence Venn diagram of relation: Weak Law of Large Numbers X1; X2; : : : i.i.d. finite mean ¹ and variance σ2 X1 + ¢ ¢ ¢ + Xn Mn = n ² E[Mn] = ² Var(Mn) = 2 σX Pr(jMn ¡ ¹j ¸ ²) · n²2 Weak Law of Large Numbers Consequence of Chebyshev’s inequality: Ran dom variable X X 2 2 σX = (x ¡ E[X]) PX(x) x2X 2 2 σX ¸ c Pr(jX ¡ E[X]j ¸ c) σ2 Pr(jX ¡ E[X]j ¸ c) · X c2 1 Pr(jX ¡ E[X]j ¸ kσ ) · X k2 Strong Law of Large Numbers Theorem: (SLLN) If Xi are IID, and EX[jXj] < 1, then X1 + ¢ ¢ ¢ + Xn Mn = ! E [X]; w:p:1: n X AEP If X1; : : : ; Xn are IID with distribution PX, then ¡1 log(P (x ; : : : ; x )) ! H(X) in prob n X1;:::;Xn 1 n ability j Notation: Xi = (Xi; : : : ; Xj) (if i = 1, gen erally omitted) Proof: create r.v. Y that takes the value yi = ¡ log(PX(xi)) with probability PX(xi) (note that the value of Y is related to its probability distribution) we now apply the WLLN to Y AEP 1 n ¡ log(PXn(x )) n n 1 X = ¡ log(PX(xi)) n i=1 n 1 X = yi n i=1 using the WLLN on Y 1 Pn n i=1 yi ! EY [Y ] in probability EY [Y ] = ¡EZ[log(PX(Z))] = H(X) for some r.v. Z identically distributed with X Consequences of the AEP: the typical set (n) Definition: A² is a typical set with respect to PX(x) if it is the set of sequences in the set of all possible sequences xn 2 X n with probability: ¡n(H(X)+²) n ¡n(H(X)¡²) 2 · PXn (x ) · 2 equivalently 1 n H(X) ¡ ² · ¡ log(P n (x )) · H(X) + ² n X As n increases, the bounds get closer to gether, so we are considering a smaller range of probabilities We shall use the typical set to describe a set with characteristics that belong to the majority of elements in that set. Note: the variance of the entropy is finite Consequences of the AEP: the typical set Why is it typical? AEP says 8² > 0, 8± > 0, 9n0 such that 8n > n0 (n) P r(A² ) ¸ 1 ¡ ± (note: ± can be ²) How big is the typical set? X n 1 = PXn (x ) xn2X n X n ¸ PXn (x ) n (n) x 2A² X ¡n(H(X)+²) ¸ 2 n (n) x 2A² (n) ¡n(H(X)+²) = jA² j2 (n) n(H(X)+²) ) jA² j · 2 (n) P r(A² ) ¸ (1 ¡ ²) X n ) 1 ¡ ² · PXn (x ) n (n) x 2A² (n) ¡n(H(X)¡²) · jA² j2 (n) n(H(X)¡²) ) jA² j ¸ 2 (1 ¡ ²) Visualize: Consequences of the AEP: using the typical set for compression Description in typical set requires no more than n(H(X) + ²) + 1 bits (correction of 1 bit because of integrality) (n)C Description in atypical set A² requires no more than n log(jX j) + 1 bits (n) Add another bit to indicate whether in A² or not to get whole description Consequences of the AEP: using the typical set for compression Let l(xn) be the length of the binary de scription of xn 8² > 0, 9n0 s.t. 8n > n0, n EXn[l(X )] X X n n n n = PXn (x ) l(x ) + PXn (x ) l(x ) (n) C n n (n) x 2A± x 2A X ± n · PXn (x ) (n(H(X) + ±) + 2) xn2A(n) X± n + PXn (x ) (n log(jX j) + 2) C n (n) x 2A± = nH(X) + n² + 2 for ± small enough with respect to ² 1 n so EXn[n l(X )] · H(X)+² for n sufficiently large. Jointly typical sequences (n) A² is a typical set with respect to PX;Y (x; y) if it is the set of sequences in the set of all possible sequences (xn; yn) 2 X n £ Yn with probability: ¡n(H(X)+²) n ¡n(H(X)¡²) 2 · PXn (x ) · 2 ³ ´ ¡n(H(Y )+²) n ¡n(H(Y )¡²) 2 · PY n y · 2 ³ ´ ¡n(H(X;Y )+²) n n ¡n(H(X;Y )¡²) 2 · PXn;Y n x ; y · 2 for (Xn; Y n) sequences of length n IID ac n n Qn cording PXn;Y n(x ; y ) = i=1 PX;Y (xi; yi) n n (n) P r((X ; Y ) 2 A² ) ! 1 as n ! 1 Jointly typical sequences Use the union bound n n (n) P r((X ; Y ) 62 A² ) n n 000(n) · P r((X ; Y ) 62 A² ) n 00(n) + P r((X ) 62 A² ) n 0 (n) + P r((Y ) 62 A² ) For A000 single typical sequence for pair, A00 for X and A0 for Y each element in the RHS goes to 0 MIT OpenCourseWare http://ocw.mit.edu 6.441 Information Theory Spring 2010 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. .

Load more