Chapter 4 Efficient Likelihood Estimation and Related Tests

Total Page:16

File Type:pdf, Size:1020Kb

Chapter 4 Efficient Likelihood Estimation and Related Tests 1 Chapter 4 Efficient Likelihood Estimation and Related Tests 1. Maximum likelihood and efficient likelihood estimation 2. Likelihood ratio, Wald, and Rao (or score) tests 3. Examples 4. Consistency of Maximum Likelihood Estimates 5. The EM algorithm and related methods 6. Nonparametric MLE 7. Limit theory for the statistical agnostic: P/ ∈P 2 Chapter 4 Efficient Likelihood Estimation and Related Tests 1Maximumlikelihoodandefficientlikelihoodestimation We begin with a brief discussion of Kullback - Leibler information. Definition 1.1 Let P be a probability measure, and let Q be a sub-probability measure on (X, ) A with densities p and q with respect to a sigma-finite measure µ (µ = P + Q always works). Thus P (X)=1andQ(X) 1. Then the Kullback - Leibler information K(P, Q)is ≤ p(X) (1) K(P, Q) E log . ≡ P q(X) ! " Lemma 1.1 For a probability measure Q and a (sub-)probability measure Q,theKullback-Leibler information K(P, Q)isalwayswell-defined,and [0, ]always K(P, Q) ∈ ∞ =0 ifandonlyifQ = P. ! Proof. Now log 1 = 0 if P = Q, K(P, Q)= log M>0ifP = MQ, M > 1 . ! If P = MQ,thenJensen’sinequalityisstrictandyields % q(X) K(P, Q)=E log P − p(X) # $ q(X) > log E = log E 1 − P p(X) − Q [p(X)>0] # $ log 1 = 0 . ≥− 2 Now we need some assumptions and notation. Suppose that the model is given by P = P : θ Θ . P { θ ∈ } 3 4 CHAPTER 4. EFFICIENT LIKELIHOOD ESTIMATION AND RELATED TESTS We will impose the following hypotheses about : P Assumptions: A0. θ = θ implies P = P . % ∗ θ % θ∗ A1. A x : p (x) > 0 does not depend on θ. ≡{ θ } A2. P has density p with respect to the σ finite measure µ and X ,...,X are i.i.d. P P . θ θ − 1 n θ0 ≡ 0 Notation: n L(θ) L (θ) L(θ X) p (X ) , ≡ n ≡ | ≡ θ i i=1 % n l(θ)=l(θ X) l (θ) log L (θ)= log p (X ) , | ≡ n ≡ n θ i &i=1 l(B) l(B X) ln(B)=supl(θ X) . ≡ | ≡ θ B | ∈ Here is a preliminary result which motivates our definition of the maximum likelihood estimator. Theorem 1.1 If A0 - A2 hold, then for θ = θ % 0 n 1 Ln(θ0) 1 pθ0 (Xi) log = log a.s. K(Pθ0 ,Pθ) > 0 , n Ln(θ) n pθ(Xi) → # $ &i=1 and hence P (L (θ X) >L (θ X)) 1asn . θ0 n 0| n | → →∞ Proof. The first assertion is just the strong law of large numbers; note that pθ0 (X) Eθ0 log = K(Pθ0 ,Pθ) > 0 pθ(X) by lemma 1.1 and A0. The second assertion is an immediate consequence of the first. 2 Theorem 1.1 motivates the following definition. Definition 1.2 The value θ = θ of θ which maximizes the likelihood L(θ X), if it exists and is n | unique, is the maximum likelihood estimator (MLE) of θ.ThusL(θ)=L(Θ) or l(θn)=l(Θ). ' ' Cautions: ' ' θ may not exist. • n θ may exist, but may not be unique. • 'n Note that the definition depends on the version of the density p which is selected; since this • ' θ is not unique, different versions of pθ lead to different MLE’s 1. MAXIMUM LIKELIHOOD AND EFFICIENT LIKELIHOOD ESTIMATION 5 When Θ Rd,theusualapproachtofindingθ is to solve the likelihood (or score)equations ⊂ n (2) l˙(θ X) l˙ (θ)=0; | ≡ n ' i.e. l˙ (θ X)=0,i =1,...,d.Thesolutionθ say, may not be the MLE, but may yield simply a θi | n local maximum of l(θ). The likelihood ratio statistic for testing H( : θ = θ versus K : θ = θ is 0 % 0 L(Θ) supθ Θ L(θ X) L(θn) λn = = ∈ | = , L(θ0) L(θ0 X) L(θ0) | ' L(θn) λn = . L(θ0) ( Write P(, E for P , E .Herearesomemoreassumptionsaboutthemodel which we will use 0 0 θ0 θ0 P to treat these estimators and test statistics. Assumptions, continued: A3. ΘcontainsanopenneighborhoodΘ Rd of θ for which: 0 ⊂ 0 (i) For µ a.e. x, l(θ x) log p (x)istwicecontinuouslydifferentiableinθ. | ≡ θ ··· ··· (ii) For a.e. x,thethirdorderderivativesexistand l jkl (θ x)satisfy l jkl (θ x) Mjkl(x) | | | |≤ for θ Θ for all 1 j, k, l d with E M (X) < . ∈ 0 ≤ ≤ 0 jkl ∞ A4. (i) E l˙ (θ X) =0forj =1,...,d. 0{ j 0| } (ii) E l˙2(θ X) < for j =1,...,d. 0{ j 0| } ∞ (iii) I(θ )=( E ¨l (θ X) )ispositivedefinite. 0 − 0{ jk 0| } Let n 1 1 Z l˙(θ X )andl(θ X)=I− (θ )l˙(θ X) , n ≡ √n 0| i 0| 0 0| &i=1 so that ( n 1 1 I− (θ )Z = l(θ X ) . 0 n √n 0| i &i=1 ( Theorem 1.2 Suppose that X ,...,X are i.i.d. P with density p where satisfies A0 - 1 n θ0 ∈P θ0 P A4. Then: (i) With probability converging to 1 there exist solutions θn of the likelihood equations such that θn p θ0 when P0 = Pθ0 is true. → ( (ii) θ is asymptotically linear with influence function l(θ x). That is, n( 0| n ( 1 1( √n(θ θ )=I− (θ )Z + o (1) = l(θ X )+o (1) n − 0 0 n p √n 0| i p i=1 1 &1 ( I− (θ )Z D N (0,I− (θ()) . →d 0 ≡ ∼ d 0 6 CHAPTER 4. EFFICIENT LIKELIHOOD ESTIMATION AND RELATED TESTS (iii) T 1 T 2 2logλ Z I− (θ )Z = D I(θ )D χ . n →d 0 0 ∼ d (iv) ( T T T 1 2 W √n(θ θ ) I (θ )√n(θ θ ) D I(θ )D = Z I− (θ )Z χ , n ≡ n − 0 n n n − 0 →d 0 0 ∼ d where ( ' ( ( I(θn) , or 1 n ˙ ˙ T In(θn)= n− i=1 l(θn Xi)l(θn Xi) , or ( 1 n ¨ | | n− i=1 l(θn Xi) . ' ( − - ( | ( - (v) ( T 1 T 1 2 R Z I− (θ )Z Z I− (θ )Z χ . n ≡ n 0 n → 0 ∼ d Here we could replace I(θ0)byanyofthepossibilitiesforIn(θn)givenin(iv)andthecon- clusion continues to hold. ' ( (vi) The model satisfies the LAN condition at θ : P 0 1/2 T 1 T l(θ + n− t) l(θ )=t Z t I(θ )t + o (1) 0 − 0 n − 2 0 P0 1 tT Z tT I(θ )t N( (1/2)σ2,σ2) →d − 2 0 ∼ − 0 0 where σ2 tT I(θ )t.Notethat 0 ≡ 0 1/2 √n(θˆn θ0)=tˆn =argmaxln(θ0 + n− t) ln(θ0) − { T T − } 1 d argmax t Z (1/2)t I(θ0)t = I− (θ0)Z → {1 − } N (0,I− (θ )). ∼ d 0 Remark 1.1 Note that the asymptotic form of the log-likelihood given in part (vi) of theorem 1.2 is exactly the log-likelihood ratio for a normal mean model Nd(I(θ0)t, I(θ0)). Also note that T 1 T 1 T 1 1 1 T 1 t Z t I(θ )t = Z I− (θ )Z (t I− (θ )Z) I(θ )(t I− (θ )Z) , − 2 0 2 0 − 2 − 0 0 − 0 1 T 1 which is maximized as a function of t by t = I− (θ0)Z with maximum value Z I− (θ0)Z/2. Corollary 1 Suppose that A0-A4 hold and that ν ν(P )=q(θ)isdifferentiableatθ Θ. ' ≡ θ 0 ∈ Then ν q(θ )satisfies n ≡ n n 1 T 1 (√n(˜ν ( ν )= ˜l (θ X )+o (1) N(0, q˙ (θ )I− (θ )˙q(θ )) . n − 0 √n ν 0| i p →d 0 0 0 &i=1 where l (θ X )=q ˙T (θ )I 1(θ )l˙(θ X )andν q(θ ). ν 0| i 0 − 0 0| i 0 ≡ 0 ( 1. MAXIMUM LIKELIHOOD AND EFFICIENT LIKELIHOOD ESTIMATION 7 If the likelihood equations (2) are difficult to solve or have multiple roots, then it is possible to use a one-step approximation. Suppose that θn is a preliminary estimator of θ and set 1 1 (3) θˇ θ + I− (θ )(n− l˙(θ X)) . n ≡ n n n n| ˇ The estimator θn is' sometimes called a one-step estimator. Theorem 1.3 Suppose that A0-A4 hold, and that θ satisfies n1/4(θ θ )=o (1); note that the n n − 0 p latter holds if √n(θ θ )=O (1). Then n − 0 p 1 1 √n(θˇ θ )=I− (θ )Z + o (1) N (0,I− (θ )) n − 0 0 n p →d d 0 where Z n 1/2 n l˙(θ X ). n ≡ − i=1 0| i - Proof. Theorem 1.2.(i)Existenceandconsistency.Fora>0, let Q θ Θ: θ θ = a . a ≡{ ∈ | − 0| } We will show that (a) P l(θ) <l(θ )forallθ Q 1asn . 0{ 0 ∈ a}→ →∞ This implies that L has a local maximum inside Qa.Sincethelikelihoodequationsmustbesatisfied at a local maximum, it will follow that for any a>0withprobabilityconvergingto1thatthe likelihood equations have a solution θn(a)withinQa;takingtherootclosesttoθ0 completes the proof. To prove (a), write ( 1 1 1 1 (l(θ) l(θ )) = (θ θ )T l˙(θ ) (θ θ )T ¨l(θ ) (θ θ ) n − 0 n − 0 0 − 2 − 0 −n 0 − 0 # $ d d d n 1 + (θ θ )(θ θ )(θ θ ) γ (X )M (X ) 6n j − j0 k − k0 l − l0 jkl i jkl i &j=1 &k=1 &l=1 &i=1 (b) = S1 + S2 + S3 where, by A3(ii), 0 γ (x) 1.
Recommended publications
  • Wooldridge, Introductory Econometrics, 4Th Ed. Appendix C
    Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathemati- cal statistics A short review of the principles of mathemati- cal statistics. Econometrics is concerned with statistical inference: learning about the char- acteristics of a population from a sample of the population. The population is a well-defined group of subjects{and it is important to de- fine the population of interest. Are we trying to study the unemployment rate of all labor force participants, or only teenaged workers, or only AHANA workers? Given a population, we may define an economic model that contains parameters of interest{coefficients, or elastic- ities, which express the effects of changes in one variable upon another. Let Y be a random variable (r.v.) representing a population with probability density function (pdf) f(y; θ); with θ a scalar parameter. We assume that we know f;but do not know the value of θ: Let a random sample from the pop- ulation be (Y1; :::; YN ) ; with Yi being an inde- pendent random variable drawn from f(y; θ): We speak of Yi being iid { independently and identically distributed. We often assume that random samples are drawn from the Bernoulli distribution (for instance, that if I pick a stu- dent randomly from my class list, what is the probability that she is female? That probabil- ity is γ; where γ% of the students are female, so P (Yi = 1) = γ and P (Yi = 0) = (1 − γ): For many other applications, we will assume that samples are drawn from the Normal distribu- tion. In that case, the pdf is characterized by two parameters, µ and σ2; expressing the mean and spread of the distribution, respectively.
    [Show full text]
  • Chapter 9. Properties of Point Estimators and Methods of Estimation
    Chapter 9. Properties of Point Estimators and Methods of Estimation 9.1 Introduction 9.2 Relative Efficiency 9.3 Consistency 9.4 Sufficiency 9.5 The Rao-Blackwell Theorem and Minimum- Variance Unbiased Estimation 9.6 The Method of Moments 9.7 The Method of Maximum Likelihood 1 9.1 Introduction • Estimator θ^ = θ^n = θ^(Y1;:::;Yn) for θ : a function of n random samples, Y1;:::;Yn. • Sampling distribution of θ^ : a probability distribution of θ^ • Unbiased estimator, θ^ : E(θ^) − θ = 0. • Properties of θ^ : efficiency, consistency, sufficiency • Rao-Blackwell theorem : an unbiased esti- mator with small variance is a function of a sufficient statistic • Estimation method - Minimum-Variance Unbiased Estimation - Method of Moments - Method of Maximum Likelihood 2 9.2 Relative Efficiency • We would like to have an estimator with smaller bias and smaller variance : if one can find several unbiased estimators, we want to use an estimator with smaller vari- ance. • Relative efficiency (Def 9.1) Suppose θ^1 and θ^2 are two unbi- ased estimators for θ, with variances, V (θ^1) and V (θ^2), respectively. Then relative efficiency of θ^1 relative to θ^2, denoted eff(θ^1; θ^2), is defined to be the ra- V (θ^2) tio eff(θ^1; θ^2) = V (θ^1) - eff(θ^1; θ^2) > 1 : V (θ^2) > V (θ^1), and θ^1 is relatively more efficient than θ^2. - eff(θ^1; θ^2) < 1 : V (θ^2) < V (θ^1), and θ^2 is relatively more efficient than θ^1.
    [Show full text]
  • Review of Mathematical Statistics Chapter 10
    Review of Mathematical Statistics Chapter 10 Definition. If fX1 ; ::: ; Xng is a set of random variables on a sample space, then any function p f(X1 ; ::: ; Xn) is called a statistic. For example, sin(X1 + X2) is a statistic. If a statistics is used to approximate an unknown quantity, then it is called an estimator. For example, we X1+X2+X3 X1+X2+X3 may use 3 to estimate the mean of a population, so 3 is an estimator. Definition. By a random sample of size n we mean a collection fX1 ;X2 ; ::: ; Xng of random variables that are independent and identically distributed. To refer to a random sample we use the abbreviation i.i.d. (referring to: independent and identically distributed). Example (exercise 10.6 of the textbook) ∗. You are given two independent estimators of ^ ^ an unknown quantity θ. For estimator A, we have E(θA) = 1000 and Var(θA) = 160; 000, ^ ^ while for estimator B, we have E(θB) = 1; 200 and Var(θB) = 40; 000. Estimator C is a ^ ^ ^ ^ weighted average θC = w θA + (1 − w)θB. Determine the value of w that minimizes Var(θC ). Solution. Independence implies that: ^ 2 ^ 2 ^ 2 2 Var(θC ) = w Var(θA) + (1 − w) Var(θB) = 160; 000 w + 40; 000(1 − w) Differentiate this with respect to w and set the derivative equal to zero: 2(160; 000)w − 2(40; 000)(1 − w) = 0 ) 400; 000 w − 80; 000 = 0 ) w = 0:2 Notation: Consider four values x1 = 12 ; x2 = 5 ; x3 = 7 ; x4 = 3 Let's arrange them in increasing order: 3 ; 5 ; 7 ; 12 1 We denote these \ordered" values by these notations: x(1) = 3 ; x(2) = 5 ; x(3) = 7 ; x(4) = 12 5+7 x(2)+x(3) The median of these four values is 2 , so it is 2 .
    [Show full text]
  • ACM/ESE 118 Mean and Variance Estimation Consider a Sample X1
    ACM/ESE 118 Mean and variance estimation Consider a sample x1; : : : ; xN from a random variable X. The sample may have been obtained through N independent but statistically identical experiments. From this sample, we want to estimate the mean µ and variance σ2 of the random variable X (i.e., we want to estimate “population” quantities). The sample mean N 1 x¯ = xi N X i=1 is an estimate of the mean µ. The expectation value of the sample mean is the population mean, E(x¯) = µ, and the variance of the sample mean is var(x¯) = σ2=N. Since the expectation value of the sample mean is the population mean, the sample mean is said to be an unbiased estimator of the population mean. And since the variance of the sample mean approaches zero as the sample size increases (i.e., fluctuations of the sample mean about the population mean decay to zero with increasing sample size), the sample mean is said to be a consistent estimator of the population mean. These properties of the sample mean are a consequence of the fact that if 2 2 x1; : : : ; xN are mutually uncorrelated random variables with variances σ1; : : : ; σN , the variance of their sum z = x1 + · · · + xN is 2 2 2 σz = σ1 + · · · + σN : (1) If we view the members of the sample x1; : : : ; xN as realizations of identically 2 distributed random variables with mean E(xi) = µ and variance var(xi) = σ , it follows by the linearity of the expectation value operation that the expectation −1 value of the sample mean is the population mean: E(x¯) = N i E(xi) = µ.
    [Show full text]
  • MAXIMUM LIKELIHOOD ESTIMATION for the Efficiency of Mles, Namely That the Theorem 1
    MAXIMUM LIKELIHOOD It should be pointed out that MLEs do ESTIMATION not always exist, as illustrated in the follow- ing natural mixture example; see Kiefer and Maximum likelihood is by far the most pop- Wolfowitz [32]. ular general method of estimation. Its wide- spread acceptance is seen on the one hand in Example 2. Let X1, ..., Xn be independent the very large body of research dealing with and identically distributed (i.i.d.) with den- its theoretical properties, and on the other in sity the almost unlimited list of applications. 2 To give a reasonably general definition p 1 x − µ f (x|µ, ν, σ, τ, p) = √ exp − of maximum likelihood estimates, let X = 2πσ 2 σ (X1, ..., Xn) be a random vector of observa- − − 2 tions whose joint distribution is described + √1 p − 1 x ν | exp , by a density fn(x )overthen-dimensional 2πτ 2 τ Euclidean space Rn. The unknown parameter vector is contained in the parameter space where 0 p 1, µ, ν ∈ R,andσ, τ>0. ⊂ s ∗ R . For fixed x define the likelihood The likelihood function of the observed = = | function of x as L() Lx() fn(x )con- sample x , ...x , although finite for any per- ∈ 1 n sidered as a function of . missible choice of the five parameters, ap- proaches infinity as, for example, µ = x , p > ˆ = ˆ ∈ 1 Definition 1. Any (x) which 0andσ → 0. Thus the MLEs of the five maximizes L()over is called a maximum unknown parameters do not exist. likelihood estimate (MLE) of the unknown Further, if an MLE exists, it is not neces- true parameter .
    [Show full text]
  • Asymptotic Concepts L
    Asymptotic Concepts L. Magee January, 2010 ||||||||||{ 1 Definitions of Terms Used in Asymptotic Theory Let an to refer to a random variable that is a function of n random variables. An example is a sample mean n −1 X an =x ¯ = n xi i=1 Convergence in Probability The scalar an converges in probability to a constant α if, for any positive values of and δ, there is a sufficiently large n∗ such that ∗ Prob(jan − αj > ) < δ for all n > n α is called the probability limit, or plim, of an. Consistency If an is an estimator of α, and plim an = α, then an is a (weakly) consistent estimator of α. Convergence in Distribution an converges in distribution to a random variable y (an ! y) if, as n ! 1, Prob(an ≤ b) = Prob(y ≤ b) for all b. In other words, the distribution of an becomes the same as the distribution of y. 2 Examples Let xi ∼ N[µ, σ ]; i = 1; : : : ; n, where the xi's are mutually independently distributed. Define the three statistics −1 Pn 1.x ¯ = n i=1 xi 2 −1 Pn 2 2. s = (n − 1) i=1(xi − x¯) 3. t = x¯−µ (s2=n)1=2 Considering each statistic one-by-one, 1 1. As n ! 1, Var(¯x) ! 0. This implies that plim(¯x) = µ, andx ¯ is a consistent estimator of µ. (Uses fact that Var(¯x) = σ2=n and E(¯x) = µ.) 2. As n ! 1, Var(s2) ! 0, so plim(s2) = σ2 and s2 is a consistent estimator of σ2.
    [Show full text]
  • Asymptotic Theory Greene Ch
    1 Asymptotic Theory Greene Ch. 4, App. D, Kennedy App. C, R scripts: module3s1a, module3s1b, module3s1c, module3s1d Overview of key concepts Consider a random variable xn that itself is a function of other random variables and sample size “ n” (e.g the sample mean, or the sample variance). Such a random variable is called a “ sequence random variable ”, since its outcome depends also on “ n”. We’re interested in the properties of xn as n → ∞ . There are two possible scenarios: (i) xn can “converge in probability” to some constant c, or (ii) “converge in distribution” to another random variable “ x”. Most estimators used in practice converge in probability. If xn converges to x, we call the distribution of x the “limiting distribution” of xn. However, for most estimators of interest we don't know or cannot derive this limiting distribution. This also includes estimators that converge in probability – a single spike over a constant c is not a very exciting or useful distribution! In that case, we derive instead the limiting distribution of a function involving xn (i.e. a ( − ) = ( ) popular one is n xn c , where xn is a sample statistic and cplim x n ) . This function does not collapse to a spike. This limiting distribution is exactly correct as n → ∞ . We use it to derive the “approximate” distribution (or “asymptotic” distribution) of xn itself that holds approximately for finite “ n”. Then from that, we can get the “asymptotic expectation” (for example to assess “asymptotic unbiasedness” or "consistency"), and the “asymptotic variance” (for example to assess asymptotic “efficiency”). Finally, we can then use these properties to compare different estimators, and to perform hypothesis tests.
    [Show full text]
  • Chapter 7: Estimation
    Chapter 7 Estimation 1 Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that an estimator is a statistic. That is, any function of the observations {xi} with values in the parameter space. • There are many estimators of θ. Question: Which is better? • Criteria for Estimators (1) Unbiasedness (2) Efficiency (3) Sufficiency (4) Consistency 2 Unbiasedness Definition: Unbiasedness ^ An unbiased estimator, say θ , has an expected value that is equal to the value of the population parameter being estimated, say θ. That is, ^ E[ θ ] = θ Example: E[ x ] = µ E[s2] = σ2 3 Efficiency Definition: Efficiency / Mean squared error An estimator is efficient if it estimates the parameter of interest in some best way. The notion of “best way” relies upon the choice of a loss function. Usual choice of a loss function is the quadratic: ℓ(e) = e2, resulting in the mean squared error criterion (MSE) of optimality: 2 2 = θˆ −θ = θˆ − θˆ + θˆ −θ = θˆ + θ 2 MSE E( ) E( E( ) E( ) ) Var( ) [b( )] ^ ^ b(θ): E[( θ - θ) bias in θ . The MSE is the sum of the variance and the square of the bias. => trade-off: a biased estimator can have a lower MSE than an unbiased estimator. Note: The most efficient estimator among a group of unbiased 4 estimators is the one with the smallest variance => BUE. Efficiency Now we can compare estimators and select the “best” one. Example: Three different estimators’ distributions 3 1, 2, 3 based on samples 2 of the same size 1 θ Value of Estimator – 1 and 2: expected value = population parameter (unbiased) – 3: positive biased – Variance decreases from 1, to 2, to 3 (3 is the smallest) – 3 can have the smallest MST.
    [Show full text]
  • 14.30 Introduction to Statistical Methods in Economics Spring 2009
    MIT OpenCourseWare http://ocw.mit.edu 14.30 Introduction to Statistical Methods in Economics Spring 2009 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. Problem Set #8 - Solutions 14.30 - Intro. to Statistical Methods in Economics Tnst,ri~ctor:Konrwd Menzel Due: Tuesday, April 28, 2009 Question One: Law of Large Numbers and Central Limit Theorem Probably the two most important concepts that you will take away from this course are the Law of Large Numbers and the Central Limit Theorem and how they allow us to use averages to learn about the world around us. I. State the Law of Large Numbers (please, just copy it down from the lecture notes). Solution to 1: Suppose XI,. , Xn is a sequence of i.i.d. draws with IEIXi] = p and Var(Xi) = a2 < oo for all i. Then for any E > 0 (typically a small value), the sample mean satisfies lim P(IXn- pI > E) = 0 n We say that X, converges in probability to p. 2. Explain what the Law of Large Numbers tells us about the average of an i.i.d. (inde- pendent, identically distributed) sample of the random variable X with mean p and variance a2. Solution to 2: The law of large numbers tells us that the density of the average of an i.i.d. sample of a random variable X will be concentrated in an "epsilon ball" of radius E. Or, more rigorously, for any E > 0, if we take an infinite sample of a random variable, the density of its mean will be concentrated at p.
    [Show full text]
  • 9 Asymptotic Approximations and Practical Asymptotic Tools
    9 Asymptotic Approximations and Practical Asymptotic Tools A point estimator is merely an educated guess about the true value of an unknown param- eter. The utility of a point estimate without some idea of its accuracy is rather limited. That is why, we considered MSE and risk functions of point estimators in the previous chapters. MSE is itself a summary of the distribution of the point estimator. We would really like to know the distribution of our estimator, for the specific n that we have. Un- fortunately, even for the simplest type of estimators, distributions are very hard to find. For example, we cannot find the distribution of the sample mean for a given n except for some special distributions. Distributions are quite simply difficult objects to calculate. As a compromise, we try to find approximations to the exact distribution of an estima- tor. These approximations are derived by assuming that n is large; more precisely, we use the concept of limiting distributions (see Chapter 1 again, if needed). Remarkably, even though calculation of distributions of estimators for fixed n is usually so difficult, in the limiting (or asymptotic) regime, the story is incredibly structured, and things fall into place like the pieces of a difficult puzzle. Large sample theory of estimators, also often called asymptotic theory, or simply asymptotics, is a central and unifying theme in essentially all of statistical inference. Asymptotic theory is an amazingly powerful and pretty tool for the statistician; theory of inference often cannot go far at all without us- ing asymptotic tools. An introduction to some basic tools and concepts of asymptotic inference is given in this chapter.
    [Show full text]
  • Notes on Median and Quantile Regression
    Notes On Median and Quantile Regression James L. Powell Department of Economics University of California, Berkeley Conditional Median Restrictions and Least Absolute Deviations It is well-known that the expected value of a random variable Y minimizes the expected squared deviation between Y and a constant; that is, µ E[Y ] Y ≡ =argminE(Y c)2, c − assuming E Y 2 is finite. (In fact, it is only necessary to assume E Y is finite, if the minimand is normalized by|| subtracting|| Y 2, i.e., || || µ E[Y ] Y ≡ =argminE[(Y c)2 Y 2] c − − =argmin[c2 2cE[Y ]], c − and this normalization has no effect on the solution if the stronger condition holds.) It is less well-known that a median of Y, defined as any number ηY for which 1 Pr Y η and { ≤ Y } ≥ 2 1 Pr Y η , { ≥ Y } ≥ 2 minimizes an expected absolute deviation criterion, η =argminE[ Y c Y ], Y c | − | − | | though the solution to this minimization problem need not be unique. When the c.d.f. FY is strictly increasing everywhere (i.e., Y is continuously distributed with positive density), then uniqueness is not an issue, and 1 ηY = FY− (1/2). In this case, the first-order condition for minimization of E[ Y c Y ] is | − | − | | 0= E[sgn(Y c)], − − for sgn(u) the “sign” (or “signum”) function sgn(u) 1 2 1 u<0 , ≡ − · { } here defined to be right-continuous. 1 Thus, just as least squares (LS) estimation is the natural generalization of the sample mean to estimation of regression coefficients, least absolute deviations (LAD) estimation is the generalization of the sample median to the linear regression context.
    [Show full text]
  • Large-Numbers Behavior of the Sample Mean And
    REVISTA DE MATEMÁTICA:TEORÍA Y APLICACIONES 2018 25(2) : 169–184 CIMPA – UCR ISSN: 1409-2433 (PRINT), 2215-3373 (ONLINE) DOI: https://doi.org/10.15517/rmta.v25i2.33653 LARGE-NUMBERS BEHAVIOR OF THE SAMPLE MEAN AND THE SAMPLE MEDIAN: A COMPARATIVE EMPIRICAL STUDY EL COMPORTAMIENTO DE LA MEDIA Y LA MEDIANA DE MUESTRAS, SEGÚN CRECE EL TAMAÑO DE LA MUESTRA: UN ESTUDIO EMPÍRICO COMPARATIVO ∗ OSVALDO MARRERO Received: 28/Apr/2017; Revised: 13/May/2018; Accepted: 16/May/2018 ∗Department of Mathematics and Statistics, Villanova University, Villanova, Pennsylvania, U.S.A. E-Mail: [email protected] 169 170 O. MARRERO Abstract Through simulation, we study and compare the large-numbers behav- ior of the sample mean and the sample median as estimators of their re- spective corresponding population parameters. We only consider skewed, continuous distributions that do have a mean and a unique median. With respect to the pertinent probabilities, our results throw some light on the speed of convergence, the effect of the skewness, and, intriguinly, also suggest the existence of a couple of changepoints or change-intervals. We propose some questions for further study. Keywords: change-interval; comparing estimators; convergence rate; simula- tion; skewness effect. Resumen Se presentan los resultados de un estudio de simulación donde hemos comparado el comportamiento de la media y la mediana de muestras como estimadores de sus respectivos parámetros, según crece el tamaño de la muestra. Sólo consideramos distribuciones continuas y sesgadas cuyas medias existen y cuyas medianas son únicas. Con respecto a las proba- bilidades en cuestión, nuestros resultados esclarecen ambos la rapidez de convergencia y el efecto del sesgo.
    [Show full text]