Asymptotic Theory of Statistical Estimation 1

Asymptotic Theory of Statistical Estimation 1 Jiantao Jiao Department of Electrical Engineering and Computer Sciences University of California, Berkeley Email: [email protected] September 11, 2019 1Summary of Chapters in [1] Contents 1 The Problem of Statistical Estimation 3 1.1 Formulation of the Problem of Statistical Estimation . .3 1.2 Some Examples . .4 1.2.1 Hodges' and Lehmann's Result . .4 1.2.2 Estimation of the Mean of a Normal Distribution . .5 1.3 Consistency. Methods for Constructing Consistent Estimators . .6 1.3.1 An Existence Theorem . .6 1.3.2 Method of moments . .7 1.3.3 Method of Maximum Likelihood . .8 1.3.4 Bayesian Estimates . .9 1.4 Inequalities for Probabilities of Large Deviations . .9 ^ 1.4.1 Convergence of θ to θ .............................................9 1.4.2 Some Basic Theorems and Lemmas . 11 1.4.3 Examples . 13 1.5 Lower Bounds on the Risk Function . 14 1.6 Regular Statistical Experiments. The Cramer-Rao Inequality . 15 1.6.1 Regular Statistical Experiments . 15 1.6.2 The Cramer-Rao Inequality . 18 2 0 1.6.3 Bounds on the Hellinger distance r2(θ; θ ) in Regular Experiments . 19 1.7 Approximating Estimators by Means of Sums of Independent Random Variables . 20 1.8 Asymptotic Efficiency . 21 1.8.1 Basic Definition . 21 1.8.2 Examples . 23 1.8.3 Bahadur's Asymptotic Efficiency . 23 1.8.4 Efficiency in C. R. Rao's Sense . 25 1.9 Two Theorems on the Asymptotic Behavior of Estimators . 26 1.9.1 Examples . 27 2 Local Asymptotic Normality of Families of Distributions 29 2.1 Independent Identically Distributed Observations . 29 2.2 Local Asymptotic Normality (LAN) . 30 2.3 Independent Nonhomogeneous Observations . 31 2.4 Multidimensional Parameter Set . 33 2.5 Characterizations of Limiting Distributions of Estimators . 34 2.5.1 Estimators of an Unknown Parameter when the LAN Condition is Fulfilled . 34 2.5.2 Regular Parameter Estimators . 34 2.6 Asymptotic Efficiency under LAN Conditions . 35 2.7 Asymptotically Minimax Risk Bound . 37 2.8 Some Corollaries. Superefficient Estimators . 38 3 Some Applications to Nonparametric Estimation 39 3.1 A Minimax Bound on Risks . 39 3.2 Bounds on Risks for Some Smooth Functionals . 42 3.3 Examples of Asymptotically Efficient Estimators . 48 3.4 Estimation of Unknown Density . 49 3.5 Minimax Bounds on Estimators for Density . 52 1 Acknowledgment Thank Russian mathematicians for providing excellent research monographs. 2 Chapter 1 The Problem of Statistical Estimation 1.1 Formulation of the Problem of Statistical Estimation Let (X ; X ;Pθ; θ 2 Θ) be a statistical experiment generated by the observation X. Let ' be a measurable function from (Θ; B) into (Y ; Y). Consider the problem of estimating the value of '(θ) at point θ based on observation X, whose distribution is Pθ. Our only information about θ is that θ 2 Θ. As an estimator for '(θ) one can choose any function of observations T (X) with values in (Y ; Y). Therefore the following problem arises naturally: how to choose statistic T which would estimate '(θ) in the best possible manner. However, what is the meaning of the expression "in the best (possible) manner"? Assume there is on the set Y × Y a real-valued nonnegative function W (y1; y2) which we call loss function and which has the following meaning: if observation X is distributed according to the distribution Pθ, then utilizing statistic T (X) to estimate '(θ) yields a loss which is equal to W (T (X);'(θ)). Averaging over all possible values of X, we arrive at the risk function: RW (T ; θ) = EθW (T (X);'(θ)); (1.1) which will be chosen as the measure of the quality of statistic T as an estimator of '(θ) for a given loss function W . Thus a partial ordering is introduced on the space of all estimators of '(θ): the estimator T1 is preferable to T2 if for all θ 2 Θ, RW (T1; θ) ≤ RW (T2; θ). In view of the last definition, estimator T of the function '(θ) is called inadmissible with respect to loss function W if there exists an estimator T ∗ such that ∗ ∗ RW (T ; θ) ≤ RW (T ; θ); 8θ 2 Θ;RW (T ; θ) < RW (T ; θ) for some θ 2 Θ: (1.2) An estimator which is not inadmissible is called admissible. Although the approach described above is commonly used, it is not free of certain shortcomings. First, many estimators turn out to be uncomparable, and second the choice of loss functions is arbitrary to a substantial degree. Sometimes it is possible to find estimators which are optimal within a certain class which is smaller than the class of all estimators. One such class is the class of unbiased estimators: an estimator T is called an unbiased estimator of function '(θ) if EθT = '(θ) for all θ 2 Θ. Furthermore, if the initial experiment is invariant with respect to a group of transformations it is natural to confine ourselves to a class of estimators which do not violate the symmetry of the problem. This is called invariance principle. Comparing estimators T1;T2 according to their behavior at the "least favorable" points, we arrive at the notion of a minimax estimator. An estimator T0 is called the minimax estimator of '(θ) in Θ1 ⊂ Θ relative to loss function W if sup RW (T0; θ) = inf sup RW (T ; θ); (1.3) T θ2Θ1 θ2Θ1 where the inf is taken over all estimators T of '(θ). If Θ is a subset of a finite-dimensional Euclidean space, then statistical estimation problems based this experiment is called parametric estimation problems. Below we shall mainly deal with parametric problems. Moreover, we shall always assume that Θ is an open subset of a k dPθ finite-dimensional Euclidean space R , and that the family of distributions Pθ and the densities p(x; θ) = dµ are defined on the closure Θ¯ of the set Θ. By B we shall denote the σ-algebra of Borel subsets of Θ. In the parametric case it is usually the parameter itself that is estimated (i.e. '(θ) = θ). In this case the loss function W is defined on the set Θ × Θ and as a rule we shall consider loss functions which possess the following properties: 1. W (u; v) = w(u − v). 3 2. The function w(u) is defined and is nonnegative on Rk; moreover, w(0) = 0 and w(u) is continuous at u = 0 but is not identically zero. 3. Function w is symmetric, i.e. w(u) = w(−u). 4. The sets fu : w(u) < cg are convex sets for all c > 0. 5. The sets fu : w(u) < cg are convex sets for all c > 0 and are bounded for all c > 0 sufficiently small. The function w will also be called loss functions. The first three properties are quite natural and do not require additional comments. Property 4 in the case of a one- dimensional parameter set means that function w(u) is non-decreasing on [0; 1). Denote by W the class of loss functions satisfying 1-4; the same notation will also be used for the corresponding set of functions w. Denote by W0 the class of functions satisfying 1-5. The notation W(W0) will be used for the set of functions w which posses a polynomial majorant. 0 0 Denote by We,α(We,α) the set of functions w belonging to W(W ) whose growth as juj ! 1 is slower than any one of the α functions ejuj ; > 0. r Clearly, all loss functions of the form ju − vj ; r > 0, and the indicator loss function WA(u; v) = I(u − v2 = A) belong to 0 the class Wp. Theorem 1.1.1 (Blackwell's theorem). Let the family fPθ; θ 2 Θg possess a sufficient statistic T . Let the loss function be of the form w(u − v), where w(u) is a convex function in Rk. Let θ∗ be an arbitrary estimator for θ. Then there exists an estimator T ∗ representable in the form g(T ) and such that for all θ 2 Θ, ∗ ∗ Eθw(T − θ) ≤ Eθw(θ − θ): (1.4) If θ∗ is an unbiased estimator for θ, T ∗ can also be chosen to be unbiased. Consider again a parametric statistical experiment. Now we assume that θ is a random variable with a known distribution Q on Θ. In such a situation the estimation problem is called the estimation problem in the Bayesian formulation. Assume, for simplicity, that Q possesses density q with respect to the Lebesgue measure. If, as before, the losses are measured by means of function w, then the mean loss obtained using estimator T (the so-called Bayesian risk of estimator T ) is equal to Z Z Z rw(T ) = EQw(T − θ) = q(θ)dθ w(T (x) − θ)Pθ(dx) = Rw(T ; θ)q(θ)dθ: (1.5) Θ X Θ In the Bayesian setup the risk of estimator R is a positive number and one can talk about the best estimator T~ minimizing risk rw: rw(T~) = min rw(T ): (1.6) T Here we assume the minimum is achievable. The estimator T~ is called the Bayesian estimator with respect to loss function w and prior distribution Q. Evidently the form of the optimal estimator T~ depends on the prior density q. On the other hand, one may assume that as the sample size increases to infinite the Bayesian estimator T~ ceases to depend on the initial distribution Q within a wide class of these distributions (e.g.

Asymptotic Theory of Statistical Estimation 1

Notes for a Graduate-Level Course in Asymptotics for Statisticians

Chapter 4. an Introduction to Asymptotic Theory"

Chapter 6 Asymptotic Distribution Theory

Wooldridge, Introductory Econometrics, 4Th Ed. Appendix C

Chapter 5 the Delta Method and Applications

Asymptotic Results for the Linear Regression Model

Asymptotic Independence of Correlation Coefficients With

Deriving the Asymptotic Distribution of U- and V-Statistics of Dependent

Chapter 9. Properties of Point Estimators and Methods of Estimation

Review of Mathematical Statistics Chapter 10

Chapter 4 Efficient Likelihood Estimation and Related Tests

ACM/ESE 118 Mean and Variance Estimation Consider a Sample X1