CENTRAL LIMIT THEOREM Contents 1. Introduction 1 2. Convergence 3 3. Variance Matrices 7 4. Multivariate Normal Distribution
Total Page:16
File Type:pdf, Size:1020Kb
CENTRAL LIMIT THEOREM FREDERICK VU Abstract. This expository paper provides a short introduction to probabil- ity theory before proving a central theorem in probability theory, the central limit theorem. The theorem concerns the eventual convergence to a normal distribution of an average of a sampling of independently distributed random variables with identical variance and mean. The paper shall use L´evy'sconti- nuity theorem to go about proving the central limit theorem. Contents 1. Introduction 1 2. Convergence 3 3. Variance Matrices 7 4. Multivariate Normal Distribution 8 5. Characteristic Functions and the L´evyContinuity Theorem 8 Acknowledgments 13 References 13 1. Introduction Before we state the central limit theorem, we must first define several terms. An understanding of the terms relies on basic functional analysis fitted with new probability terminology. Definition 1.1. A probability space is a triple (Ω; F;P ) where Ω is a non-empy set, F is a σ-algebra (collection of subsets closed under countable unions/intersections of countable many subsets) of measurable subsets of Ω, and P is a finite measure on the measurable space (Ω; F) with P (Ω) = 1. P is referred to as a probability. Definition 1.2. A random variable X is a measurable function from a prob- ability space (Ω; F;P ) to a measurable space (S, S) where S is a σ-algebra of measurable subsets of S. Normally (S, S) is the real numbers with the Borel σ- algebra. We will maintain this notation, but conform to the norm throughout the paper. A random vector is a column vector whose components are real-valued random variables defined on the same probability space. In many places in this paper, a statement concerning random variables will presume the existence of some general probability space. Definition 1.3. The expected value of a real-valued random variable X is defined as the Lebesgue integral of X with respect to the measure P Z E(X) ≡ X dP: Ω 1 2 FREDERICK VU For a random vector X, the expected value E(X) is the vector whose components are E(Xi) Definition 1.4. Because independence is such a central notion in probability it is best to define it early. First, define the distribution of a random variable as Q ≡ P ◦ X−1 defined on (S; S) by Q(B) := P (X−1(B)) ≡ P (X 2 B) ≡ P (! 2 Ω: X(!) 2 B);B 2 S: This possibly confusing notation can be understood as the pushforward measure of P to (S; S). Definition 1.5. A set of random variables X1; :::; Xn with Xi a map from (Ω; F;P ) to (Si; Si) is called independent if the distribution Q of X := (X1; :::; Xn) on the product space (S = S1 × · · · × Sn; S = S1 × · · · × Sn) is the product measure Q = Q1 × · · · × Qn where Qi is the distribution of Xi, or more compactly: n Y Q(B1 × · · · × Bn) = Qi(Bi) i=1 Two random vectors are said to be independent if their components are pairwise independent as above. Since the (multivariate) central limit theorem won't be stated until much further along due to the required definitions of normal distributions and many lemmas along the way, we pause here to give an informal statement of the central theorem before continuing on with a few basic lemmas from probability theory. The central limit theorem basically says that given a fixed distribution, if one were to repeatedly, but independently, sample from such distribution, the average value will roughly approach the expected value of the corresponding random variable, even giving a bell-shaped curve if one were to plot a histogram. The following are simple inequalities used often in the paper. Lemma 1.6. (Markov's Inequality) If X is a nonnegative randomvariable and a > 0, then (X) P (X ≥ a) ≤ E : a Proof. Denote for U ⊆ Ω, the indicator function of U, IU . Then by linearity of the integral and the definition of the probability distribution E(X) ≥ E(aIX≥a) = aE(IX≥1) = aP (X ≥ a) Corollary 1.7. (Chebyshev's Inequality) For any random variable X and a > 0 ((X − (X))2) P (jX − (X)j) ≥ a) ≤ E E : E a2 Proof. Consider the random variable (X − E(X))2 and apply Markov's inequality. There are many ways to understand probability measures, and it is from these different points of view and their interrelations that one can derive the multitude of theorems following. CENTRAL LIMIT THEOREM 3 Definitions 1.8. The cumulative distribution function (cdf) of a random n vector X = (Xi; :::; Xn) is the function FX : R ! R FX(x) = P (X1 ≤ x1; :::; Xn ≤ xn) For a continuous random vector X, define the probability density function as @n fX(x) = FX(x1; :::; xn) @x1 ··· @xn This provides us with another way to write the distribution of a random vector X. For A ⊆ Rn, Z P (X 2 A) = fX(x) dx: A Remark 1.9. For a continuous random variable X, there is also another way to express the expected value of powers of X. Z n n (1.10) E(X ) = x fX (x) dx R This is just a specific case of Z (1.11) E(g(X)) = g(x)fX (x) dx R where g is a measurable function. 2. Convergence Definition 2.1. A sequence of cumulative distribution functions fFng is said to converge in distribution, or converge weakly, to the cumulative distribution function F , denoted Fn ) F , if (2.2) lim Fn(x) = F (x) n for every continuity point x of F . If Qn and Q are their respective distribution functions, then we may equivalently define Qn ) Q if for every A = (−∞; x) for which Q(x) = 0, lim Qn(A) = Q(A): n Similarly, if Xn and X are the respective random variables corresponding to Fn and F , we write Xn ) X defined equivalently. Since distributions are just measures on some measurable (S; S), which again is generally the reals, we have a similar understanding of convergence of measures rather than just distributions. The following theorem allows the representation of weakly convergent measures as the distribution of random variables defined on a common probability space. Theorem 2.3. Suppose that µn and µ are probability measures on (R; R) and µn ) µ. Then there exist random variables Xn and X on some (Ω; F;P ) such that Xn;X have respective distributions µn; µ, and Xn(!) ! X(!) for each ! 2 Ω. Proof. Take (Ω; F;P ) to be the set (0; 1) with Borel subsets of (0; 1) and the Lebesgue measure. Denote the cumulative distributions associated with µn; µ by Fn;F , and put Xn(!) = inf[x : ! ≤ Fn(x)] and X(!) = inf[x : ! ≤ F (x)]: 4 FREDERICK VU The set [x : ! ≤ F (x)] is closed on the left since F (x) is right-continuous as are all cumulative distributions, and therefore it is the set [X(!); 1). Hence, ! ≤ F (x) if and only if X(!) ≤ x, and P [! : X(!) ≤ x] = P [! : ! ≤ F (x)] = F (x). Thus, X has cumulative distribution F ; similarly, Xn has distribution F . To prove pointwise convergence, suppose for a given > 0, we choose x so that X(!) − < x < X(!) and µ(x) = 0. Then F (x) < !, and Fn(x) ! F (x) implies that for large enough n, Fn(x) < !, and therefore X(!) − < x < Xn(!). Thus lim inf Xn(!) ≤ X(!): n 0 Now for ! > !, we may similarly choose a y such that ! ≤ Fn(y) and hence 0 Xn(!) ≤ y < X(! ) + . Thus 0 lim sup Xn(!) ≤ X(! ): n Therefore, if X is continuous at !, Xn(!) ! X(!). Since X is increasing on (0; 1), it has at most countably many discontinuities. For any point of discontinuity !, define Xn(!) = X(!) = 0. Since the set of discontinuities has Lebesgue measure 0, the distributions remain unchanged. At the heart of many theorems in probability is the properties of convergence of distribution functions. We now come to many fundamental convergence theo- rems in probability, though in essence they are rehashings of conventional proofs from functional analysis. The first theorem essentially says that measurable maps preserve limits. Theorem 2.4. Let h : R ! R be measurable and let the set Dh of its continuities −1 −1 be measurable. If µn ) µ as before and µ(Dh) = 0, then µn ◦ h ) µ ◦ h . Proof. Using the random variables Xn;X defined in the previous proof, we see h(Xn(!)) ! h(X(!)) almost everywhere. Therefore h(Xn) ) h(X), where such notation means composition h ◦ X. For A ⊆ R, since P [h(X) 2 A] = P [X 2 h−1(A)] = µ(h−1(A)); −1 −1 h◦X has distribution µh ; similarly, h◦Xn has distribution µnh , again abusing −1 −1 notation of composition. Thus h(Xn) ) h(X) is equivalent to µnh ) µh . Corollary 2.5. If Xn ) X and P [X 2 Dh] = 0, then h(Xn) ) h(X). Lemma 2.6. µn ) µ if and only if for every bounded, continuous function f, R R f dµn ! f dµ. Proof. For the forward proof, in the same process as seen in the proof of theorem 2.3, we have f(Xn) ! f(X) almost everywhere. By change of variables and the dominated convergence theorem, Z Z f dµn = E(f(Xn)) ! E(f(X)) = f dµ. Conversely, consider the cumulative distribution functions Fn;F associated with µn; µ and suppose x < y. Define the function f by f(t) = 1 for t ≤ x; f(t) = 0 CENTRAL LIMIT THEOREM 5 R for t ≥ y; and f(t) = (y − t)=(y − x) for x ≤ t ≤ y.