5 Decision Theory: Basic Concepts
Total Page:16
File Type:pdf, Size:1020Kb
5 Decision Theory: Basic Concepts Point estimation of an unknown parameter is generally considered the most basic inference problem. Speaking generically, if θ is some unknown parameter taking values in a suitable parameter space Θ, then a point estimate is an educated guess at the true value of θ. Now, of course, we do not just guess the true value of θ without some information. Information comes from data; some information may also come from expert opinion separate from data. In any case, a point estimate is a function of the available sample data. To start with, we allow any function of the data as a possible point estimate. Theory of inference is used to separate the good estimates from the not so good or bad estimates. As usual, we will start with an example to help us understand the general definitions. iid Example 5.1. Suppose X1, ··· , Xn ∼ Poi(λ),λ > 0. Suppose we want to estimate the parameter λ. Now, of course, λ = Eλ(X1), i.e., λ is the population mean. So, just ¯ X1+···+Xn instinctively, we may want to estimate λ by the sample mean X = n ; and, indeed, X¯ is a possible point estimator of λ. While λ takes values in Θ = (0, ∞), the estimator X¯ takes values in A = [0, ∞); X¯ can be equal to zero! So, the set of possible values of the parameter and the set of possible values of an estimator need not be identical. We must allow them to be different sets, in general. Now, X¯ is certainly not the only possible estimator of λ. We can use essentially any function of the sample observations X1, ··· , Xn to estimate the parameter λ. For example, 4 just X1, or X1 + X2 − X3, or even seemingly poor estimators like X1 . Any estimator is allowed to begin with; theory will separate the good ones from the bad ones. iid 2 Next, suppose X1, ··· , Xn ∼ N(µ, σ ), where µ, σ are unknown parameters. So, now, we have got a two dimensional parameter vector, θ = (µ, σ). Suppose we want to estimate µ. Once again, a possible estimator is X¯; a few other possible estimators are the sample X−2+X4 median, Mn = median{X1, ··· , Xn}, or 2 , or even seemingly poor estimators, like 100X¯. Suppose, it was known to us that µ must be nonnegative. Then, the set of possible values of µ is Θ=[0, ∞). However, the instinctive point estimator X¯ can take any real value. It takes values in the set A = (−∞, ∞). You would notice again that A is not the same as Θ in this case. In general, A and Θ can be different sets. If we want instead to estimate σ2, which is the population variance, a first thought is to 2 1 n ¯ 2 use the sample variance s = n−1 i=1(Xi − X) (dividing by n − 1 rather than n seems a little odd at first glance, but hasP a mathematical reason, which will be clear soon). In this example, the parameter σ2 and the estimator s2 both take values in (0, ∞). But, if we knew that σ2 ≤ 100, say, then once again, Θ and A would be different. And, as always, 2 1 n 2 there are many other possible estimators of σ , for example, n−1 i=1(Xi − Mn) , where P Mn is the sample median. Here is a formal definition of a point estimator. 267 (n) Definition 5.1. Let the vector of sample observations X = (X1, ··· , Xn) have a joint p distribution P = Pn and let θ = h(P ), taking values in the parameter space Θ ⊆ R , be a parameter of the distribution P . Let T (X1, ··· , Xn) taking values in a specified set p A ⊆ R be a general statistic, Then, any such T (X1, ··· , Xn) is called a point estimator of θ. The set Θ is called the parameter space, and the set A is called the statistician’s action space. If specific observed sample data X1 = x1, ··· , Xn = xn are available, then the particular value T (x1, ··· ,xn) is called an estimate of θ. Thus, the word estimator applies to the general function T (X1, ··· , Xn), and the word estimate applies to the value T (x1, ··· ,xn) for specific data. In this text, we use estimator and estimate synonymously. A standard general notation for a generic estimate of a parameter θ is θˆ = θˆ(X1, ··· , Xn). 5.0.1 Evaluating an Estimator and MSE Except in rare cases, the estimate θˆ would not be exactly equal to the true value of the unknown parameter θ. It seems reasonable that we like an estimate θˆ which generally comes quite close to the true value of θ, and dislike an estimate θˆ which generally misses the true value of θ by a large amount. How, are we going to make this precise and quantifiable? The general approach to this question involves the specification of a loss and a risk function, which we will introduce in a later section. For now, we describe a very common and even hugely popular criterion for evaluating a point estimator, the mean squared error. Definition 5.2. Let θ be a real valued parameter, and θˆ an estimate of θ. The mean squared error (MSE) of θˆ is defined as 2 MSE = MSE(θ, θˆ)= Eθ[(θˆ − θ) ], θ ∈ Θ. If the parameter θ is p-dimensional, θ = (θ1, ··· , θp), and θˆ = θˆ1, ··· , θˆp), then the mean squared error f θˆ is defined as p 2 2 MSE = MSE(θ, θˆ)= Eθ[||θˆ − θ|| ]= Eθ[(θˆi − θi) ], θ ∈ Θ. Xi=1 5.0.2 Bias and Variance It turns out that the mean squared error of an estimator has one component to do with systematic error of the estimator and a second component to do with random error of the estimator. If an estimator θˆ routinely overestimated the parameter θ, then usually we will have θˆ − θ > 0; we think of this as the estimator making a systematic error. Systematic 268 errors can also be made by routinely underestimating θ, We quantify the systematic error of an estimator by looking at Eθ(θˆ−θ)= Eθ(θˆ)−θ; this is called the bias of the estimator; we denote it as b(θ). If the bias of an estimator θˆ is always zero, b(θ) = 0 for all θ, then the estimator θˆ is called unbiased. On the other hand, the estimator θˆ may not make much systematic error, but still can just be unreliable because from one dataset to another, its accuracy may differ wildly. This is called random or fluctuation error, and we often quantify the random error by looking at the variance of the estimator, Varθ(θˆ). A pleasant property of the MSE of an estimator is that always the MSE neatly breaks into two components, one involving the bias, and the other involving the variance. You have to try to keep both of them small; large biases and large variances are both red signals. Here is a bias-variance decomposition result. Theorem 5.1. Let θ be a real valued parameter and θˆ an estimator with a finite variance under all θ. Then, 2 MSE(θ, θˆ) = Varθ(θˆ)+ b (θ), θ ∈ Θ. Proof; To prove this simple theorem, we recall the elementary probability fact that for any random variable U with a finitte variance, E(U 2) = Var(U)+[E(U)]2. Identifying U with θˆ − θ, 2 MSE(θ, θˆ) = Varθ(θˆ − θ)+[E(θˆ − θ)] 2 = Varθ(θˆ)+ b (θ). 5.0.3 Computing and Graphing MSE We will now see one introductory example. Example 5.2. (Estimating a Normal Mean and Variance) Suppose we have sample iid 2 observations X1, ··· , Xn ∼ N(µ, σ ), where −∞ <µ< ∞,σ > 0 are unknoown parame- ters. The parameter is two dimensional, θ = (µ, σ). First consider estimation of µ, and as an example, consider these two estimates: X¯ and nX¯ n+1 . We will calculate the MSE of each estimate, and make some comments. ¯ 1 n 1 ¯ Since E(X)= n sumi=1E(Xi)= n (nµ)= µ for any µ and σ, the bias of X for estimating µ is zero; Eθ(X¯) − µ = µ − µ = 0. In other words, X¯ is an unbiased estimate of µ. Therefore, the MSE of X¯ is just its variance, Var(X ) σ2 E[(X¯ − µ)2] = Var(X¯)= 1 = . n n Notice that the MSE of X¯ does not depend on µ; it only depends on σ2. nX¯ Now, we will find the MSE of the other estimate n+1 . For this, by our Theorem 3.1, the nX¯ MSE of n+1 is nX¯ nX¯ nX¯ E[( − µ)2] = Var( )+[E( − µ)]2 n + 1 n + 1 n + 1 269 Comparison of MSE of Estimates in Example 3.2 0.20 0.15 0.10 0.05 mu -4 -2 2 4 n n n σ2 1 = ( )2Var(X¯)+[ µ − µ]2 = ( )2 + ( )2µ2 n + 1 n + 1 n + 1 n n + 1 µ2 nσ2 = + . (n + 1)2 (n + 1)2 2 2 µ Notice that the MSE of this estimate does depend on both µ and σ ; (n+1)2 is the contri- nσ2 bution of the bias component in the MSE, and (n+1)2 is the contribution of the variance component. For purposes of comparison, we plot the MSE of both estimates, taking n = 10 and σ = 1; ¯ nX¯ the MSE of X is constant in µ, and the MSE of n+1 is a quadratic in µ. For µ near zero, nX¯ ¯ n+1 has a smaller MSE, but otherwise, X has a smaller MSE.