Asymptotic Theory of Statistical Estimation 1

Total Page:16

File Type:pdf, Size:1020Kb

Asymptotic Theory of Statistical Estimation 1 Asymptotic Theory of Statistical Estimation 1 Jiantao Jiao Department of Electrical Engineering and Computer Sciences University of California, Berkeley Email: [email protected] September 11, 2019 1Summary of Chapters in [1] Contents 1 The Problem of Statistical Estimation 3 1.1 Formulation of the Problem of Statistical Estimation . .3 1.2 Some Examples . .4 1.2.1 Hodges' and Lehmann's Result . .4 1.2.2 Estimation of the Mean of a Normal Distribution . .5 1.3 Consistency. Methods for Constructing Consistent Estimators . .6 1.3.1 An Existence Theorem . .6 1.3.2 Method of moments . .7 1.3.3 Method of Maximum Likelihood . .8 1.3.4 Bayesian Estimates . .9 1.4 Inequalities for Probabilities of Large Deviations . .9 ^ 1.4.1 Convergence of θ to θ .............................................9 1.4.2 Some Basic Theorems and Lemmas . 11 1.4.3 Examples . 13 1.5 Lower Bounds on the Risk Function . 14 1.6 Regular Statistical Experiments. The Cramer-Rao Inequality . 15 1.6.1 Regular Statistical Experiments . 15 1.6.2 The Cramer-Rao Inequality . 18 2 0 1.6.3 Bounds on the Hellinger distance r2(θ; θ ) in Regular Experiments . 19 1.7 Approximating Estimators by Means of Sums of Independent Random Variables . 20 1.8 Asymptotic Efficiency . 21 1.8.1 Basic Definition . 21 1.8.2 Examples . 23 1.8.3 Bahadur's Asymptotic Efficiency . 23 1.8.4 Efficiency in C. R. Rao's Sense . 25 1.9 Two Theorems on the Asymptotic Behavior of Estimators . 26 1.9.1 Examples . 27 2 Local Asymptotic Normality of Families of Distributions 29 2.1 Independent Identically Distributed Observations . 29 2.2 Local Asymptotic Normality (LAN) . 30 2.3 Independent Nonhomogeneous Observations . 31 2.4 Multidimensional Parameter Set . 33 2.5 Characterizations of Limiting Distributions of Estimators . 34 2.5.1 Estimators of an Unknown Parameter when the LAN Condition is Fulfilled . 34 2.5.2 Regular Parameter Estimators . 34 2.6 Asymptotic Efficiency under LAN Conditions . 35 2.7 Asymptotically Minimax Risk Bound . 37 2.8 Some Corollaries. Superefficient Estimators . 38 3 Some Applications to Nonparametric Estimation 39 3.1 A Minimax Bound on Risks . 39 3.2 Bounds on Risks for Some Smooth Functionals . 42 3.3 Examples of Asymptotically Efficient Estimators . 48 3.4 Estimation of Unknown Density . 49 3.5 Minimax Bounds on Estimators for Density . 52 1 Acknowledgment Thank Russian mathematicians for providing excellent research monographs. 2 Chapter 1 The Problem of Statistical Estimation 1.1 Formulation of the Problem of Statistical Estimation Let (X ; X ;Pθ; θ 2 Θ) be a statistical experiment generated by the observation X. Let ' be a measurable function from (Θ; B) into (Y ; Y). Consider the problem of estimating the value of '(θ) at point θ based on observation X, whose distribution is Pθ. Our only information about θ is that θ 2 Θ. As an estimator for '(θ) one can choose any function of observations T (X) with values in (Y ; Y). Therefore the following problem arises naturally: how to choose statistic T which would estimate '(θ) in the best possible manner. However, what is the meaning of the expression "in the best (possible) manner"? Assume there is on the set Y × Y a real-valued nonnegative function W (y1; y2) which we call loss function and which has the following meaning: if observation X is distributed according to the distribution Pθ, then utilizing statistic T (X) to estimate '(θ) yields a loss which is equal to W (T (X);'(θ)). Averaging over all possible values of X, we arrive at the risk function: RW (T ; θ) = EθW (T (X);'(θ)); (1.1) which will be chosen as the measure of the quality of statistic T as an estimator of '(θ) for a given loss function W . Thus a partial ordering is introduced on the space of all estimators of '(θ): the estimator T1 is preferable to T2 if for all θ 2 Θ, RW (T1; θ) ≤ RW (T2; θ). In view of the last definition, estimator T of the function '(θ) is called inadmissible with respect to loss function W if there exists an estimator T ∗ such that ∗ ∗ RW (T ; θ) ≤ RW (T ; θ); 8θ 2 Θ;RW (T ; θ) < RW (T ; θ) for some θ 2 Θ: (1.2) An estimator which is not inadmissible is called admissible. Although the approach described above is commonly used, it is not free of certain shortcomings. First, many estimators turn out to be uncomparable, and second the choice of loss functions is arbitrary to a substantial degree. Sometimes it is possible to find estimators which are optimal within a certain class which is smaller than the class of all estimators. One such class is the class of unbiased estimators: an estimator T is called an unbiased estimator of function '(θ) if EθT = '(θ) for all θ 2 Θ. Furthermore, if the initial experiment is invariant with respect to a group of transformations it is natural to confine ourselves to a class of estimators which do not violate the symmetry of the problem. This is called invariance principle. Comparing estimators T1;T2 according to their behavior at the "least favorable" points, we arrive at the notion of a minimax estimator. An estimator T0 is called the minimax estimator of '(θ) in Θ1 ⊂ Θ relative to loss function W if sup RW (T0; θ) = inf sup RW (T ; θ); (1.3) T θ2Θ1 θ2Θ1 where the inf is taken over all estimators T of '(θ). If Θ is a subset of a finite-dimensional Euclidean space, then statistical estimation problems based this experiment is called parametric estimation problems. Below we shall mainly deal with parametric problems. Moreover, we shall always assume that Θ is an open subset of a k dPθ finite-dimensional Euclidean space R , and that the family of distributions Pθ and the densities p(x; θ) = dµ are defined on the closure Θ¯ of the set Θ. By B we shall denote the σ-algebra of Borel subsets of Θ. In the parametric case it is usually the parameter itself that is estimated (i.e. '(θ) = θ). In this case the loss function W is defined on the set Θ × Θ and as a rule we shall consider loss functions which possess the following properties: 1. W (u; v) = w(u − v). 3 2. The function w(u) is defined and is nonnegative on Rk; moreover, w(0) = 0 and w(u) is continuous at u = 0 but is not identically zero. 3. Function w is symmetric, i.e. w(u) = w(−u). 4. The sets fu : w(u) < cg are convex sets for all c > 0. 5. The sets fu : w(u) < cg are convex sets for all c > 0 and are bounded for all c > 0 sufficiently small. The function w will also be called loss functions. The first three properties are quite natural and do not require additional comments. Property 4 in the case of a one- dimensional parameter set means that function w(u) is non-decreasing on [0; 1). Denote by W the class of loss functions satisfying 1-4; the same notation will also be used for the corresponding set of functions w. Denote by W0 the class of functions satisfying 1-5. The notation W(W0) will be used for the set of functions w which posses a polynomial majorant. 0 0 Denote by We,α(We,α) the set of functions w belonging to W(W ) whose growth as juj ! 1 is slower than any one of the α functions ejuj ; > 0. r Clearly, all loss functions of the form ju − vj ; r > 0, and the indicator loss function WA(u; v) = I(u − v2 = A) belong to 0 the class Wp. Theorem 1.1.1 (Blackwell's theorem). Let the family fPθ; θ 2 Θg possess a sufficient statistic T . Let the loss function be of the form w(u − v), where w(u) is a convex function in Rk. Let θ∗ be an arbitrary estimator for θ. Then there exists an estimator T ∗ representable in the form g(T ) and such that for all θ 2 Θ, ∗ ∗ Eθw(T − θ) ≤ Eθw(θ − θ): (1.4) If θ∗ is an unbiased estimator for θ, T ∗ can also be chosen to be unbiased. Consider again a parametric statistical experiment. Now we assume that θ is a random variable with a known distribution Q on Θ. In such a situation the estimation problem is called the estimation problem in the Bayesian formulation. Assume, for simplicity, that Q possesses density q with respect to the Lebesgue measure. If, as before, the losses are measured by means of function w, then the mean loss obtained using estimator T (the so-called Bayesian risk of estimator T ) is equal to Z Z Z rw(T ) = EQw(T − θ) = q(θ)dθ w(T (x) − θ)Pθ(dx) = Rw(T ; θ)q(θ)dθ: (1.5) Θ X Θ In the Bayesian setup the risk of estimator R is a positive number and one can talk about the best estimator T~ minimizing risk rw: rw(T~) = min rw(T ): (1.6) T Here we assume the minimum is achievable. The estimator T~ is called the Bayesian estimator with respect to loss function w and prior distribution Q. Evidently the form of the optimal estimator T~ depends on the prior density q. On the other hand, one may assume that as the sample size increases to infinite the Bayesian estimator T~ ceases to depend on the initial distribution Q within a wide class of these distributions (e.g.
Recommended publications
  • Notes for a Graduate-Level Course in Asymptotics for Statisticians
    Notes for a graduate-level course in asymptotics for statisticians David R. Hunter Penn State University June 2014 Contents Preface 1 1 Mathematical and Statistical Preliminaries 3 1.1 Limits and Continuity . 4 1.1.1 Limit Superior and Limit Inferior . 6 1.1.2 Continuity . 8 1.2 Differentiability and Taylor's Theorem . 13 1.3 Order Notation . 18 1.4 Multivariate Extensions . 26 1.5 Expectation and Inequalities . 33 2 Weak Convergence 41 2.1 Modes of Convergence . 41 2.1.1 Convergence in Probability . 41 2.1.2 Probabilistic Order Notation . 43 2.1.3 Convergence in Distribution . 45 2.1.4 Convergence in Mean . 48 2.2 Consistent Estimates of the Mean . 51 2.2.1 The Weak Law of Large Numbers . 52 i 2.2.2 Independent but not Identically Distributed Variables . 52 2.2.3 Identically Distributed but not Independent Variables . 54 2.3 Convergence of Transformed Sequences . 58 2.3.1 Continuous Transformations: The Univariate Case . 58 2.3.2 Multivariate Extensions . 59 2.3.3 Slutsky's Theorem . 62 3 Strong convergence 70 3.1 Strong Consistency Defined . 70 3.1.1 Strong Consistency versus Consistency . 71 3.1.2 Multivariate Extensions . 73 3.2 The Strong Law of Large Numbers . 74 3.3 The Dominated Convergence Theorem . 79 3.3.1 Moments Do Not Always Converge . 79 3.3.2 Quantile Functions and the Skorohod Representation Theorem . 81 4 Central Limit Theorems 88 4.1 Characteristic Functions and Normal Distributions . 88 4.1.1 The Continuity Theorem . 89 4.1.2 Moments .
    [Show full text]
  • Chapter 4. an Introduction to Asymptotic Theory"
    Chapter 4. An Introduction to Asymptotic Theory We introduce some basic asymptotic theory in this chapter, which is necessary to understand the asymptotic properties of the LSE. For more advanced materials on the asymptotic theory, see Dudley (1984), Shorack and Wellner (1986), Pollard (1984, 1990), Van der Vaart and Wellner (1996), Van der Vaart (1998), Van de Geer (2000) and Kosorok (2008). For reader-friendly versions of these materials, see Gallant (1987), Gallant and White (1988), Newey and McFadden (1994), Andrews (1994), Davidson (1994) and White (2001). Our discussion is related to Section 2.1 and Chapter 7 of Hayashi (2000), Appendix C and D of Hansen (2007) and Chapter 3 of Wooldridge (2010). In this chapter and the next chapter, always means the Euclidean norm. kk 1 Five Weapons in Asymptotic Theory There are …ve tools (and their extensions) that are most useful in asymptotic theory of statistics and econometrics. They are the weak law of large numbers (WLLN, or LLN), the central limit theorem (CLT), the continuous mapping theorem (CMT), Slutsky’s theorem,1 and the Delta method. We only state these …ve tools here; detailed proofs can be found in the techinical appendix. To state the WLLN, we …rst de…ne the convergence in probability. p De…nition 1 A random vector Zn converges in probability to Z as n , denoted as Zn Z, ! 1 ! if for any > 0, lim P ( Zn Z > ) = 0: n !1 k k Although the limit Z can be random, it is usually constant. The probability limit of Zn is often p denoted as plim(Zn).
    [Show full text]
  • Chapter 6 Asymptotic Distribution Theory
    RS – Chapter 6 Chapter 6 Asymptotic Distribution Theory Asymptotic Distribution Theory • Asymptotic distribution theory studies the hypothetical distribution -the limiting distribution- of a sequence of distributions. • Do not confuse with asymptotic theory (or large sample theory), which studies the properties of asymptotic expansions. • Definition Asymptotic expansion An asymptotic expansion (asymptotic series or Poincaré expansion) is a formal series of functions, which has the property that truncating the series after a finite number of terms provides an approximation to a given function as the argument of the function tends towards a particular, often infinite, point. (In asymptotic distribution theory, we do use asymptotic expansions.) 1 RS – Chapter 6 Asymptotic Distribution Theory • In Chapter 5, we derive exact distributions of several sample statistics based on a random sample of observations. • In many situations an exact statistical result is difficult to get. In these situations, we rely on approximate results that are based on what we know about the behavior of certain statistics in large samples. • Example from basic statistics: What can we say about 1/ x ? We know a lot about x . What do we know about its reciprocal? Maybe we can get an approximate distribution of 1/ x when n is large. Convergence • Convergence of a non-random sequence. Suppose we have a sequence of constants, indexed by n f(n) = ((n(n+1)+3)/(2n + 3n2 + 5) n=1, 2, 3, ..... 2 Ordinary limit: limn→∞ ((n(n+1)+3)/(2n + 3n + 5) = 1/3 There is nothing stochastic about the limit above. The limit will always be 1/3. • In econometrics, we are interested in the behavior of sequences of real-valued random scalars or vectors.
    [Show full text]
  • Wooldridge, Introductory Econometrics, 4Th Ed. Appendix C
    Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathemati- cal statistics A short review of the principles of mathemati- cal statistics. Econometrics is concerned with statistical inference: learning about the char- acteristics of a population from a sample of the population. The population is a well-defined group of subjects{and it is important to de- fine the population of interest. Are we trying to study the unemployment rate of all labor force participants, or only teenaged workers, or only AHANA workers? Given a population, we may define an economic model that contains parameters of interest{coefficients, or elastic- ities, which express the effects of changes in one variable upon another. Let Y be a random variable (r.v.) representing a population with probability density function (pdf) f(y; θ); with θ a scalar parameter. We assume that we know f;but do not know the value of θ: Let a random sample from the pop- ulation be (Y1; :::; YN ) ; with Yi being an inde- pendent random variable drawn from f(y; θ): We speak of Yi being iid { independently and identically distributed. We often assume that random samples are drawn from the Bernoulli distribution (for instance, that if I pick a stu- dent randomly from my class list, what is the probability that she is female? That probabil- ity is γ; where γ% of the students are female, so P (Yi = 1) = γ and P (Yi = 0) = (1 − γ): For many other applications, we will assume that samples are drawn from the Normal distribu- tion. In that case, the pdf is characterized by two parameters, µ and σ2; expressing the mean and spread of the distribution, respectively.
    [Show full text]
  • Chapter 5 the Delta Method and Applications
    Chapter 5 The Delta Method and Applications 5.1 Linear approximations of functions In the simplest form of the central limit theorem, Theorem 4.18, we consider a sequence X1,X2,... of independent and identically distributed (univariate) random variables with finite variance σ2. In this case, the central limit theorem states that √ d n(Xn − µ) → σZ, (5.1) where µ = E X1 and Z is a standard normal random variable. In this chapter, we wish to consider the asymptotic distribution of, say, some function of Xn. In the simplest case, the answer depends on results already known: Consider a linear function g(t) = at+b for some known constants a and b. Since E Xn = µ, clearly E g(Xn) = aµ + b =√g(µ) by the linearity of the expectation operator. Therefore, it is reasonable to ask whether n[g(Xn) − g(µ)] tends to some distribution as n → ∞. But the linearity of g(t) allows one to write √ √ n g(Xn) − g(µ) = a n Xn − µ . We conclude by Theorem 2.24 that √ d n g(Xn) − g(µ) → aσZ. Of course, the distribution on the right hand side above is N(0, a2σ2). None of the preceding development is especially deep; one might even say that it is obvious that a linear transformation of the random variable Xn alters its asymptotic distribution 85 by a constant multiple. Yet what if the function g(t) is nonlinear? It is in this nonlinear case that a strong understanding of the argument above, as simple as it may be, pays real dividends.
    [Show full text]
  • Asymptotic Results for the Linear Regression Model
    Asymptotic Results for the Linear Regression Model C. Flinn November 29, 2000 1. Asymptotic Results under Classical Assumptions The following results apply to the linear regression model y = Xβ + ε, where X is of dimension (n k), ε is a (unknown) (n 1) vector of disturbances, and β is a (unknown) (k 1)×parametervector.Weassumethat× n k, and that × À ρ(X)=k. This implies that ρ(X0X)=k as well. Throughout we assume that the “classical” conditional moment assumptions apply, namely E(εi X)=0 i. • | ∀ 2 V (εi X)=σ i. • | ∀ We Þrst show that the probability limit of the OLS estimator is β, i.e., that it is consistent. In particular, we know that 1 βˆ = β +(X0X)− X0ε 1 E(βˆ X)=β +(X0X)− X0E(ε X) ⇒ | | = β In terms of the (conditional) variance of the estimator βˆ, 2 1 V (βˆ X)=σ (X0X)− . | Now we will rely heavily on the following assumption X X lim n0 n = Q, n n →∞ where Q is a Þnite, nonsingular k k matrix. Then we can write the covariance ˆ × of βn in a sample of size n explicitly as 2 µ ¶ 1 ˆ σ Xn0 Xn − V (β Xn)= , n| n n so that 2 µ ¶ 1 ˆ σ Xn0 Xn − lim V (βn Xn) = lim lim n n n →∞ | 1 =0 Q− =0 × Since the asymptotic variance of the estimator is 0 and the distribution is centered on β for all n, we have shown that βˆ is consistent. Alternatively, we can prove consistency as follows.
    [Show full text]
  • Asymptotic Independence of Correlation Coefficients With
    Electronic Journal of Statistics Vol. 5 (2011) 342–372 ISSN: 1935-7524 DOI: 10.1214/11-EJS610 Asymptotic independence of correlation coefficients with application to testing hypothesis of independence Zhengjun Zhang Department of Statistics, University of Wisconsin, Madison, WI 53706, USA e-mail: [email protected] Yongcheng Qi Department of Mathematics and Statistics, University of Minnesota Duluth, Duluth, MN 55812, USA e-mail: [email protected] and Xiwen Ma Department of Statistics, University of Wisconsin, Madison, WI 53706, USA e-mail: [email protected] Abstract: This paper first proves that the sample based Pearson’s product- moment correlation coefficient and the quotient correlation coefficient are asymptotically independent, which is a very important property as it shows that these two correlation coefficients measure completely different depen- dencies between two random variables, and they can be very useful if they are simultaneously applied to data analysis. Motivated from this fact, the paper introduces a new way of combining these two sample based corre- lation coefficients into maximal strength measures of variable association. Second, the paper introduces a new marginal distribution transformation method which is based on a rank-preserving scale regeneration procedure, and is distribution free. In testing hypothesis of independence between two continuous random variables, the limiting distributions of the combined measures are shown to follow a max-linear of two independent χ2 random variables. The new measures as test statistics are compared with several existing tests. Theoretical results and simulation examples show that the new tests are clearly superior. In real data analysis, the paper proposes to incorporate nonlinear data transformation into the rank-preserving scale re- generation procedure, and a conditional expectation test procedure whose test statistic is shown to have a non-standard limit distribution.
    [Show full text]
  • Deriving the Asymptotic Distribution of U- and V-Statistics of Dependent
    Bernoulli 18(3), 2012, 803–822 DOI: 10.3150/11-BEJ358 Deriving the asymptotic distribution of U- and V-statistics of dependent data using weighted empirical processes ERIC BEUTNER1 and HENRYK ZAHLE¨ 2 1Department of Quantitative Economics, Maastricht University, P.O. Box 616, NL–6200 MD Maastricht, The Netherlands. E-mail: [email protected] 2 Department of Mathematics, Saarland University, Postfach 151150, D–66041 Saarbr¨ucken, Germany. E-mail: [email protected] It is commonly acknowledged that V-functionals with an unbounded kernel are not Hadamard differentiable and that therefore the asymptotic distribution of U- and V-statistics with an un- bounded kernel cannot be derived by the Functional Delta Method (FDM). However, in this article we show that V-functionals are quasi-Hadamard differentiable and that therefore a mod- ified version of the FDM (introduced recently in (J. Multivariate Anal. 101 (2010) 2452–2463)) can be applied to this problem. The modified FDM requires weak convergence of a weighted version of the underlying empirical process. The latter is not problematic since there exist sev- eral results on weighted empirical processes in the literature; see, for example, (J. Econometrics 130 (2006) 307–335, Ann. Probab. 24 (1996) 2098–2127, Empirical Processes with Applications to Statistics (1986) Wiley, Statist. Sinica 18 (2008) 313–333). The modified FDM approach has the advantage that it is very flexible w.r.t. both the underlying data and the estimator of the unknown distribution function. Both will be demonstrated by various examples. In particular, we will show that our FDM approach covers mainly all the results known in literature for the asymptotic distribution of U- and V-statistics based on dependent data – and our assumptions are by tendency even weaker.
    [Show full text]
  • Chapter 9. Properties of Point Estimators and Methods of Estimation
    Chapter 9. Properties of Point Estimators and Methods of Estimation 9.1 Introduction 9.2 Relative Efficiency 9.3 Consistency 9.4 Sufficiency 9.5 The Rao-Blackwell Theorem and Minimum- Variance Unbiased Estimation 9.6 The Method of Moments 9.7 The Method of Maximum Likelihood 1 9.1 Introduction • Estimator θ^ = θ^n = θ^(Y1;:::;Yn) for θ : a function of n random samples, Y1;:::;Yn. • Sampling distribution of θ^ : a probability distribution of θ^ • Unbiased estimator, θ^ : E(θ^) − θ = 0. • Properties of θ^ : efficiency, consistency, sufficiency • Rao-Blackwell theorem : an unbiased esti- mator with small variance is a function of a sufficient statistic • Estimation method - Minimum-Variance Unbiased Estimation - Method of Moments - Method of Maximum Likelihood 2 9.2 Relative Efficiency • We would like to have an estimator with smaller bias and smaller variance : if one can find several unbiased estimators, we want to use an estimator with smaller vari- ance. • Relative efficiency (Def 9.1) Suppose θ^1 and θ^2 are two unbi- ased estimators for θ, with variances, V (θ^1) and V (θ^2), respectively. Then relative efficiency of θ^1 relative to θ^2, denoted eff(θ^1; θ^2), is defined to be the ra- V (θ^2) tio eff(θ^1; θ^2) = V (θ^1) - eff(θ^1; θ^2) > 1 : V (θ^2) > V (θ^1), and θ^1 is relatively more efficient than θ^2. - eff(θ^1; θ^2) < 1 : V (θ^2) < V (θ^1), and θ^2 is relatively more efficient than θ^1.
    [Show full text]
  • Review of Mathematical Statistics Chapter 10
    Review of Mathematical Statistics Chapter 10 Definition. If fX1 ; ::: ; Xng is a set of random variables on a sample space, then any function p f(X1 ; ::: ; Xn) is called a statistic. For example, sin(X1 + X2) is a statistic. If a statistics is used to approximate an unknown quantity, then it is called an estimator. For example, we X1+X2+X3 X1+X2+X3 may use 3 to estimate the mean of a population, so 3 is an estimator. Definition. By a random sample of size n we mean a collection fX1 ;X2 ; ::: ; Xng of random variables that are independent and identically distributed. To refer to a random sample we use the abbreviation i.i.d. (referring to: independent and identically distributed). Example (exercise 10.6 of the textbook) ∗. You are given two independent estimators of ^ ^ an unknown quantity θ. For estimator A, we have E(θA) = 1000 and Var(θA) = 160; 000, ^ ^ while for estimator B, we have E(θB) = 1; 200 and Var(θB) = 40; 000. Estimator C is a ^ ^ ^ ^ weighted average θC = w θA + (1 − w)θB. Determine the value of w that minimizes Var(θC ). Solution. Independence implies that: ^ 2 ^ 2 ^ 2 2 Var(θC ) = w Var(θA) + (1 − w) Var(θB) = 160; 000 w + 40; 000(1 − w) Differentiate this with respect to w and set the derivative equal to zero: 2(160; 000)w − 2(40; 000)(1 − w) = 0 ) 400; 000 w − 80; 000 = 0 ) w = 0:2 Notation: Consider four values x1 = 12 ; x2 = 5 ; x3 = 7 ; x4 = 3 Let's arrange them in increasing order: 3 ; 5 ; 7 ; 12 1 We denote these \ordered" values by these notations: x(1) = 3 ; x(2) = 5 ; x(3) = 7 ; x(4) = 12 5+7 x(2)+x(3) The median of these four values is 2 , so it is 2 .
    [Show full text]
  • Chapter 4 Efficient Likelihood Estimation and Related Tests
    1 Chapter 4 Efficient Likelihood Estimation and Related Tests 1. Maximum likelihood and efficient likelihood estimation 2. Likelihood ratio, Wald, and Rao (or score) tests 3. Examples 4. Consistency of Maximum Likelihood Estimates 5. The EM algorithm and related methods 6. Nonparametric MLE 7. Limit theory for the statistical agnostic: P/ ∈P 2 Chapter 4 Efficient Likelihood Estimation and Related Tests 1Maximumlikelihoodandefficientlikelihoodestimation We begin with a brief discussion of Kullback - Leibler information. Definition 1.1 Let P be a probability measure, and let Q be a sub-probability measure on (X, ) A with densities p and q with respect to a sigma-finite measure µ (µ = P + Q always works). Thus P (X)=1andQ(X) 1. Then the Kullback - Leibler information K(P, Q)is ≤ p(X) (1) K(P, Q) E log . ≡ P q(X) ! " Lemma 1.1 For a probability measure Q and a (sub-)probability measure Q,theKullback-Leibler information K(P, Q)isalwayswell-defined,and [0, ]always K(P, Q) ∈ ∞ =0 ifandonlyifQ = P. ! Proof. Now log 1 = 0 if P = Q, K(P, Q)= log M>0ifP = MQ, M > 1 . ! If P = MQ,thenJensen’sinequalityisstrictandyields % q(X) K(P, Q)=E log P − p(X) # $ q(X) > log E = log E 1 − P p(X) − Q [p(X)>0] # $ log 1 = 0 . ≥− 2 Now we need some assumptions and notation. Suppose that the model is given by P = P : θ Θ . P { θ ∈ } 3 4 CHAPTER 4. EFFICIENT LIKELIHOOD ESTIMATION AND RELATED TESTS We will impose the following hypotheses about : P Assumptions: A0. θ = θ implies P = P .
    [Show full text]
  • ACM/ESE 118 Mean and Variance Estimation Consider a Sample X1
    ACM/ESE 118 Mean and variance estimation Consider a sample x1; : : : ; xN from a random variable X. The sample may have been obtained through N independent but statistically identical experiments. From this sample, we want to estimate the mean µ and variance σ2 of the random variable X (i.e., we want to estimate “population” quantities). The sample mean N 1 x¯ = xi N X i=1 is an estimate of the mean µ. The expectation value of the sample mean is the population mean, E(x¯) = µ, and the variance of the sample mean is var(x¯) = σ2=N. Since the expectation value of the sample mean is the population mean, the sample mean is said to be an unbiased estimator of the population mean. And since the variance of the sample mean approaches zero as the sample size increases (i.e., fluctuations of the sample mean about the population mean decay to zero with increasing sample size), the sample mean is said to be a consistent estimator of the population mean. These properties of the sample mean are a consequence of the fact that if 2 2 x1; : : : ; xN are mutually uncorrelated random variables with variances σ1; : : : ; σN , the variance of their sum z = x1 + · · · + xN is 2 2 2 σz = σ1 + · · · + σN : (1) If we view the members of the sample x1; : : : ; xN as realizations of identically 2 distributed random variables with mean E(xi) = µ and variance var(xi) = σ , it follows by the linearity of the expectation value operation that the expectation −1 value of the sample mean is the population mean: E(x¯) = N i E(xi) = µ.
    [Show full text]