Chapter 7: Estimation
Total Page:16
File Type:pdf, Size:1020Kb
Chapter 7 Estimation 1 Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that an estimator is a statistic. That is, any function of the observations {xi} with values in the parameter space. • There are many estimators of θ. Question: Which is better? • Criteria for Estimators (1) Unbiasedness (2) Efficiency (3) Sufficiency (4) Consistency 2 Unbiasedness Definition: Unbiasedness ^ An unbiased estimator, say θ , has an expected value that is equal to the value of the population parameter being estimated, say θ. That is, ^ E[ θ ] = θ Example: E[ x ] = µ E[s2] = σ2 3 Efficiency Definition: Efficiency / Mean squared error An estimator is efficient if it estimates the parameter of interest in some best way. The notion of “best way” relies upon the choice of a loss function. Usual choice of a loss function is the quadratic: ℓ(e) = e2, resulting in the mean squared error criterion (MSE) of optimality: 2 2 = θˆ −θ = θˆ − θˆ + θˆ −θ = θˆ + θ 2 MSE E( ) E( E( ) E( ) ) Var( ) [b( )] ^ ^ b(θ): E[( θ - θ) bias in θ . The MSE is the sum of the variance and the square of the bias. => trade-off: a biased estimator can have a lower MSE than an unbiased estimator. Note: The most efficient estimator among a group of unbiased 4 estimators is the one with the smallest variance => BUE. Efficiency Now we can compare estimators and select the “best” one. Example: Three different estimators’ distributions 3 1, 2, 3 based on samples 2 of the same size 1 θ Value of Estimator – 1 and 2: expected value = population parameter (unbiased) – 3: positive biased – Variance decreases from 1, to 2, to 3 (3 is the smallest) – 3 can have the smallest MST. 2 is more efficient than 1. 5 Relative Efficiency It is difficult to prove that an estimator is the best among all estimators, a relative concept is usually used. Definition: Relative efficiency Variance of first estimator Relative Efficiency = Variance of second estimator Example: Sample mean vs. sample median Variance of sample mean = σ2/n Variance of sample median = πσ2/2n Var[median]/Var[mean] = (πσ2/2n) / (σ2/n) = π/2 = 1.57 The sample median is 1.57 times less efficient than the sample mean.6 Asymptotic Efficiency • We compare two sample statistics in terms of their variances. The statistic with the smallest variance is called efficient. • When we look at asymptotic efficiency, we look at the asymptotic variance of two statistics as n grows. Note that if we compare two consistent estimators, both variances eventually go to zero. Example: Random sampling from the normal distribution • Sample mean is asymptotically normal[μ,σ2/n] • Median is asymptotically normal [μ,(π/2)σ2/n] • Mean is asymptotically more efficient Sufficiency • Definition: Sufficiency A statistic is sufficient when no other statistic, which can be calculated from the same sample, provides any additional information as to the value of the parameter of interest. Equivalently, we say that conditional on the value of a sufficient statistic for a parameter, the joint probability distribution of the data does not depend on that parameter. That is, if P(X=x|T(X)=t, θ) = P(X=x|T(X)=t) we say that T is a sufficient statistic. • The sufficient statistic contains all the information needed to estimate the population parameter. It is OK to ‘get rid’ of the 8 original data, while keeping only the value of the sufficient statistic. Sufficiency • Visualize sufficiency: Consider a Markov chain θ → T(X1, . ,Xn) → {X1, . ,Xn} (although in classical statistics θ is not a RV). Conditioned on the middle part of the chain, the front and back are independent. Theorem Let p(x,θ) be the pdf of X and q(t,θ) be the pdf of T(X). Then, T(X) is a sufficient statistic for θ if, for every x in the sample space, the ratio of px( θ ) qt( θ ) is a constant as a function of θ. Example: Normal sufficient statistic: 2 Let X1, X2, … Xn be iid N(μ,σ ) where the variance is known. The sample mean, x , is the sufficient statistic for μ. Proof: Let’s starting with the joint distribution function 2 n 1 ( x − µ ) fx( µ ) = exp − i ∏ 2 σ 2 i=1 2πσ 2 2 1 n ( x − µ ) = − i n exp ∑ 2 2 2 σ (2πσ ) i=1 2 • Next, add and subtract the sample mean: 2 1 n ( x−+− xxµ ) µ = − i fx( ) n exp ∑ 2 2 2 σ (2πσ ) i=1 2 n 22 ( x−+ x) nx( −µ ) 1 ∑ i = − i=1 n exp 2 (2πσ 2 ) 2 2σ • Recall that the distribution of the sample mean is 2 1 nx( − µ ) θ = − qT( ( X) ) 1 exp 2 2 2 2σ 2π σ ( n ) • The ratio of the information in the sample to the information in the statistic becomes independent of μ n 22 ∑( xi −+ x) nx( −µ ) 1 i=1 n exp − 2 (2πσ 2 ) 2 2σ fx( θ ) = 2 qT( ( x) θ ) 1 nx( − µ ) 1 exp − 2 2 2 2σ 2π σ ( n ) n 2 ( xx− ) fx( θ ) 1 ∑ i = exp − i=1 1 n−1 2 qT( ( x) θ ) n 2 (2πσ 2 ) 2 2σ Sufficiency Theorem: Factorization Theorem Let f(x|θ) denote the joint pdf or pmf of a sample X. A statistic T(X) is a sufficient statistic for θ if and only if there exists functions g(t|θ) and h(x) such that, for all sample points x and all parameter points θ f( xθθ) = gT( ( x) ) hx( ) • Sufficient statistics are not unique. From the factorization theorem it is easy to see that (i) the identity function T(X) = X is a sufficient statistic vector and (ii) if T is a sufficient statistic for θ then so is any 1-1 function of T. Then, we have minimal sufficient statistics. Definition: Minimal sufficiency A sufficient statistic T(X) is called a minimal sufficient statistic if, for any other sufficient statistic T ’(X), T'(X) is a function of T (X). Consistency Definition: Consistency The estimator converges in probability to the population parameter being estimated when n (sample size) becomes larger ^ p That is, θ n → θ. ^ We say that θ n is a consistent estimator of θ. Example: x is a consistent estimator of μ (the population mean). • Q: Does unbiasedness imply consistency? No. The first observation of {xn}, x1, is an unbiased estimator of μ. That is, E[x1] = μ. But letting n grow is not going to cause x1 to converge in probability to μ. 13 Squared-Error Consistency Definition: Squared Error Consistency ^ The sequence {θ n} is a squared-error consistent estimator of θ, if ^ θ 2 limn→∞ E[( n - θ) ] = 0 ^ m.s.→ That is, θ n θ. • Squared-error consistency implies that both the bias and the variance of an estimator approach zero. Thus, squared-error consistency implies consistency. 14 Order of a Sequence: Big O and Little o • “Little o” o(.). δ δ -δ A sequence {xn}is o(n ) (order less than n ) if |n xn|→ 0, as n → ∞. 3 4 -4 Example: xn = n is o(n ) since |n xn|= 1 /n → 0, as n → ∞. • “Big O” O(.). δ δ -δ A sequence {xn} is O(n ) (at most of order n ) if n xn → ψ, as n → ∞ (ψ≠0, constant). Example: f(z) = (6z4 – 2z3 + 5) is O(z4) and o(n4+δ) for every δ>0. Special case: O(1): constant • Order of a sequence of RV The order of the variance gives the order of the sequence. Example: What is the order of the sequence { x }? Var[ x ] = σ2/n, which is O(1/n) -or O(n-1). Root n-Consistency • Q: Let xn be a consistent estimator of θ. But how fast does xn converges to θ ? The sample mean, x , has as its variance σ2/n, which is O(1/n). That is, the convergence is at the rate of n-½. This is called “root n-consistency.” Note: n½ x has variance of O(1). • Definition: nδ convergence? If an estimator has a O(1/n2δ) variance, then we say the estimator is nδ –convergent. 2 Example: Suppose var(xn) is O(1/n ). Then, xn is n–convergent. The usual convergence is root n. If an estimator has a faster (higher degree of) convergence, it’s called super-consistent. Estimation • Two philosophies regarding models (assumptions) in statistics: (1) Parametric statistics. It assumes data come from a type of probability distribution and makes inferences about the parameters of the distribution. Models are parameterized before collecting the data. Example: Maximum likelihood estimation. (2) Non-parametric statistics. It assumes no probability distribution –i.e., they are “distribution free.” Models are not imposed a priori, but determined by the data. Examples: histograms, kernel density estimation. • In general, parametric statistics makes more assumptions. Least Squares Estimation • Long history: Gauss (1795, 1801) used it in astronomy. • Idea: There is a functional form relating Y and k variables X. This function depends on unknown parameters, θ. The relation between Y and X is not exact. There is an error, ε. We will estimate the parameters θ by minimizing the sum of squared errors. (1) Functional form known yi = f(xi, θ) + εi (2) Typical Assumptions - f(x, θ) is correctly specified. For example, f(x, θ) = X β - X are numbers with full rank --or E(ε|X) = 0. That is, (ε ⊥ x) - ε ~ iid D(0, σ2 I) Least Squares Estimation 2 • Objective function: S(xi, θ) =Σi εi • We want to minimize w.r.t to θ.