Lecture 3 -- Convergence and Typical Sets

ECE 587 / STA 563: Lecture 3 { Convergence and Typical Sets Information Theory Duke University, Fall 2020 Author: Galen Reeves Last Modified: September 19, 2020 Outline of lecture: 3.1 Probability Review.....................................1 3.1.1 Basic Inequalities..................................1 3.1.2 Convergence of Random Variables........................2 3.2 The Typical Set and AEP.................................3 3.2.1 High Probability Sets...............................3 3.2.2 The Typical Set..................................4 3.2.3 Examples......................................5 3.1 Probability Review 3.1.1 Basic Inequalities • Markov's Inequality: For any nonnegative random variable X and t > 0, [X] [X ≥ t] ≤ E P t • Proof: We have x 1(x ≥ t) ≤ ; for all x ≥ 0 t Evaluating this inequality with X and the taking the expectation gives the stated result. • Chebyshev's Inequality: For any random variable X with finite second moment and t > 0, Var(X) X − E[X] > t ≤ P t2 2 • Proof: Apply Markov's inequality too Y = (X − E[X]) : 2 E[Y ] Var(X) X − E[X] > t = Y > t ≤ = P P t2 t2 • Chernoff Bound: For any random variable X, t 2 R, and λ > 0, P[X ≥ t] ≤ exp(−λt) E[exp(λX)] Here E[exp(λX)] is the moment generating function • Proof: P[X ≥ t] = P[λX ≥ λt] Since λ > 0 h λX λti = P e ≥ e Since exp(·) is nondecreasing −λt h λX i ≤ e E e Markov's Inq. 2 ECE 587 / STA 563: Lecture 3 ¯ 1 Pn • Chernoff Bound for Sums: Let X1;X2; ··· be iid and Xn = n i=1 Xi. For any λ > 0, ¯ ¯ P Xn ≥ t ≤ exp(−nλt) E exp(nλXn) Chernoff Inq. " n !# X ¯ = exp(−nλt) E exp λXi Definition of Xn i=1 n Y = exp(−nλt) E[exp(λXi)] Independence of Xi i=1 h in = exp(−λt)E[exp(λX1)] Identically distributed 3.1.2 Convergence of Random Variables • A sequence of numbers x1; x2; ··· converges to a limit x if, for all > 0, there exists N such that for all n ≥ N, jxn − xj ≤ This is written as xn ! x as n ! 1, or lim xn = x n!1 • For a continuous function g(·), xn ! x as n ! 1 =) g(xn) ! g(x) as n ! 1 • There are several ways to characterize convergence for random sequences. In the following, we consider the case where the sequence of random variables X1;X2; ··· converges to a (non- random) limit x: ◦ convergence with probability one: (also called almost sure convergence) h i lim Xn = x = 1 P n!1 ◦ convergence in probability: For every > 0, P[jXn − xj ≥ ] ! 0 as n ! 1, i.e., lim [jXn − xj ≤ ] = 1; for all > 0 n!1 P ◦ convergence in in p-th mean: p lim [jXn − xj ] = 0 n!1 E • Note: convergence with probability one =) convergence in probability convergence in p-th mean (p ≥ 1) =) convergence in probability ¯ • Example: (Law of Large Numbers) Let X1;X2; ··· be i.i.d. with E[jXj] < 1 and let Xn = 1 Pn ¯ ¯ n i=1 Xi. The sequence X1; X2; ··· converges to E[X] almost surely that is ¯ Xn ! E[X] almost surely as n ! 1 Thus, the sequence X¯1; X¯2; ··· also converges in probability, i.e. all > 0, ¯ lim Xn − [X] > = 0: n!1 P E ECE 587 / STA 563: Lecture 3 3 • Convergence of random sequences can be extended to functions of random sequences. If g(·) is a continuous function, then Xn ! x in probability as n ! 1 =) g(Xn) ! g(x) in probability as n ! 1 • Example: Let X1;X2; ··· be i.i.d. and let g(·) be continuous with E[jg(X1)j] < 1. Then n 1 X g(X ) ! [g(X)] almost surely as n ! 1 n i E i=1 3.2 The Typical Set and AEP Throughout this section, let X1;X2; ··· be iid copies of a random variable X ∼ p(x) with finite support X 3.2.1 High Probability Sets • A length-n random sequence is denote by n X = (X1;X2; ··· ;Xn) and a realization is denote by n x = (x1; x2; ··· ; xn) The joint pmf is the product measure: n n n n Y n pXn (x ) = P[X = x ] = p(xi) = p(x ) | {z } i=1 | {z } joint pmf | {z } simplified notation since iid • The total number of possible sequences if length n is given by jX jn. This is huge! • Our goal is to identify a subset of A ⊂ X n which contains most of the probability, i.e. for 2 (0; 1), we want a set such that n X n P[X 2 A] = p(x ) ≥ 1 − xn2A The identification of a such a set with nice properties is a key step for many of the proofs in information theory. • It is useful to define such a set in terms of a function g : R ! R. By the law of large numbers, for any > 0 there exists N such that " n # 1 X g(X ) − [g(X)] ≤ ≥ 1 − , for all n ≥ N P n i E i=1 ◦ This means that almost all of the probability is concentrated on the set of sequences A ⊂ X n given by ( n ) 1 X A = xn 2 X n : g(x ) − [g(X)] ≤ n i E i=1 4 ECE 587 / STA 563: Lecture 3 ◦ This set can be expressed equivalently as all xn 2 X n such that n 1 X [g(X)] − ≤ g(x ) ≤ [g(X)] + E n i E i=1 or equivalently −n( [g(X)]+) − Pn g(x ) −n( [g(X)]−) 2 E ≤ 2 i=1 i ≤ 2 E • For which choice of g(·) will this set have nice properties? Can we characterize how large the set is? 3.2.2 The Typical Set • For the special choice of g(x) = − log p(x), it follow that: n Pn Pn − g(xi) log p(xi) Y n 2 i=1 = 2 i=1 = p(xi) = pXn (x ) | {z } i=1 joint pmf of Xn and E[g(X)] = E[− log p(X)] = H(X) | {z } entropy of p(x) • Definition: The -typical set is defined by (n) n n n −n(H(X)+) n −n(H(X)−)o A = x 2 X : 2 ≤ pXn (x ) ≤ 2 or equivalently, the set of all sequences xn 2 X n obeying 1 n H(X) − ≤ − log pXn (x ) ≤ H(X) + | {z } n | {z } entropy | {z } entropy empirical entropy This is the set of all sequences whose probability is approximately equal to the expected probability. Unusually likely and unusually unlikely sequences are excluded. • The typical set contains almost all of the probability. Furthermore, all of the sequences in the typical have roughly the same probability. This is know as the Asymptotic Equipartition property (AEP) • The advantage of the this definition is that we can characterize the size of the typical set in terms of the entropy of H(X): ◦ By law of large numbers: h n (n)i P X 2 A ≥ 1 − , for all n ≥ N ◦ Upper Bound: (n) n H(X)+ A ≤ 2 ◦ Lower Bound: (n) n H(X)− A ≥ (1 − )2 ; for all n ≥ N ECE 587 / STA 563: Lecture 3 5 • Proof of upper bound: X n X n X −n H(X)+ (n) −n H(X)+ 1 = p(x ) ≥ p(x ) ≥ 2 = A 2 n x 2X n (n) n (n) x 2A x 2A and so n H(X)+ (n) 2 ≥ A h (n)i • Proof of lower bound: For n large enough, P A > 1 − , h i (n) X n X −n H(X)− (n) −n H(X)− 1 − < P A = p(x ) ≤ 2 = A 2 n (n) n (n) x 2A x 2A and so n H(X)− (n) (1 − )2 ≤ A • Illustration of typical set with respect to X n. Note jX nj = 2n log(jX j) ◦ High probability sequences (p(xn) > 2−n(H(X)−)) are excluded (too few to matter) ◦ Low probability sequences (p(xn) < 2−n(H(X)+)) are excluded (too unlikely to matter) ◦ Average probability sequences (p(xn) ≈ 2−n(H(X))) are retained non-typical set all possible sequences: X n jX jn elements (n) typical set A 2n(H+) elements 3.2.3 Examples • Example 1: Xi are iid Bernoulli(3/4). is very small. xn p(xn) typical? 3 8 11111111 4 no 3 6 1 2 11111100 4 × 4 yes 3 6 1 2 11101110 4 × 4 yes 1 8 00000000 4 no • Example 1: Xi are iid according to 8 1 > 2 ; x = 0 < 1 p(x) = 4 ; x = 1 :> 1 4 ; x = 2 where is very small. 6 ECE 587 / STA 563: Lecture 3 xn p(xn) typical? 1 4 1 4 00001122 2 × 4 yes 1 4 1 4 00001111 2 × 4 yes 1 2 1 6 11002222 2 × 4 no.

Lecture 3 -- Convergence and Typical Sets

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support