Lecture Notes 8 1 Minimax Theory
Total Page:16
File Type:pdf, Size:1020Kb
Lecture Notes 8 1 Minimax Theory n Suppose we want to estimate a parameter θ using data X = (X1;:::;Xn). What is the best possible estimator θ = θ(X1;:::;Xn) of θ? Minimax theory provides a framework for answering this question. b b 1.1 Introduction Let θ = θ(Xn) be an estimator for the parameter θ Θ. We start with a loss function 2 L(θ; θ) that measures how good the estimator is. For example: b b b L(θ; θ) = (θ θ)2 squared error loss, − L(θ; θ) = θ θ absolute error loss, j − j L(θ; θb) = θ θbp L loss, j − j p L(θ; θb) = 0 if θb= θ or 1 if θ = θ zero{one loss, 6 L(θ; θb) = I( θ b θ > c) large deviation loss, j − j L(θ; θb) = log p(xb; θ) p(x; θ)dxb Kullback{Leibler loss: b b p(x; θb) R If θ = (θ1; : : : ; θk) is ab vector then some common loss functions are k L(θ; θ) = θ θ 2 = (θ θ )2; jj − jj j − j j=1 X b b b k 1=p L(θ; θ) = θ θ = θ θ p : jj − jjp j j − jj j=1 ! X When the problem is to predictb a Y 0b; 1 based onb some classifier h(x) a commonly used loss is 2 f g L(Y; h(X)) = I(Y = h(X)): 6 For real valued prediction a common loss function is L(Y; Y ) = (Y Y )2: − b b The risk of an estimator θ is R(θ; θ) = Eθ Lb(θ; θ) = L(θ; θ(x1; : : : ; xn))p(x1; : : : ; xn; θ)dx: (1) Z b b b 1 When the loss function is squared error, the risk is just the MSE (mean squared error): 2 2 R(θ; θ) = Eθ(θ θ) = Varθ(θ) + bias : (2) − If we do not state what loss functionb web are using, assumeb the loss function is squared error. The minimax risk is Rn = inf sup R(θ; θ) θb θ b where the infimum is over all estimators. An estimator θ is a minimax estimator if sup R(θ; θ) = inf sup R(θ; θ): b θ θb θ b b Example 1 Let X1;:::;Xn N(θ; 1). We will see that Xn is minimax with respect to many different loss functions.∼ The risk is 1=n. Example 2 Let X1;:::;Xn be a sample from a density p. Let be the class of smooth densities (defined more precisely later). We will see (later in theP course) that the minimax 4=5 risk for estimating f is Cn− for some constant C > 0. 1.2 Comparing Risk Functions To compare two estimators, we compare their risk functions. However, this does not provide a clear answer as to which estimator is better. Consider the following examples. Example 3 Let X N(θ; 1) and assume we are using squared error loss. Consider two ∼ 2 estimators: θ1 = X and θ2 = 3. The risk functions are R(θ; θ1) = Eθ(X θ) = 1 and 2 2 − R(θ; θ2) = Eθ(3 θ) = (3 θ) : If 2 < θ < 4 then R(θ; θ2) < R(θ; θ1), otherwise, R(θ; θ1) < − − R(θ; θ2). Neitherb estimatorb uniformly dominates the other; see Figureb 1. b b b b b Example 4 Let X1;:::;Xn Bernoulli(p). Consider squared error loss and let p1 = X. Since this has zero bias, we have∼ that p(1 p) b R(p; p ) = Var(X) = − : 1 n Another estimator is b Y + α p = 2 α + β + n b 2 3 2 R(θ, θ2) 1 bR(θ, θ1) 0 θ 012345 b Figure 1: Comparing two risk functions. Neither risk function dominates the other at all values of θ. n 1 where Y = i=1 Xi and α and β are positive constants. Now, 2 P R(p; p2) = Varp(p2) + (biasp(p2)) Y + α Y + α 2 = Varp + Ep p b bα + β + n b α + β + n − np(1 p) np + α 2 = − + p : (α + β + n)2 α + β + n − Let α = β = n=4. The resulting estimator is p Y + n=4 p2 = n +ppn and the risk function is b n R(p; p ) = : 2 4(n + pn)2 The risk functions are plotted in Figure 2. As we can see, neither estimator uniformly b dominates the other. These examples highlight the need to be able to compare risk functions. To do so, we need a one-number summary of the risk function. Two such summaries are the maximum risk and the Bayes risk. The maximum risk is R(θ) = sup R(θ; θ) (3) θ Θ 2 1This is the posterior mean using a Beta (α;b β) prior. b 3 1 Risk p Figure 2: Risk functions for p1 and p2 in Example 4. The solid curve is R(p1). The dotted line is R(p2). b b b and the Bayesb risk under prior π is Bπ(θ) = R(θ; θ)π(θ)dθ: (4) Z Example 5 Consider again the twob estimators inb Example 4. We have p(1 p) 1 R(p1) = max − = 0 p 1 ≤ ≤ n 4n and b n n R(p2) = max = : p 4(n + pn)2 4(n + pn)2 Based on maximum risk, p2 bis a better estimator since R(p2) < R(p1). However, when n is large, R(p1) has smaller risk except for a small region in the parameter space near p = 1=2. Thus, many people prefer pb1 to p2. This illustrates that one-numberb summariesb like maximum risk are imperfect.b b b These two summaries of the risk function suggest two different methods for devising estimators: choosing θ to minimize the maximum risk leads to minimax estimators; choosing θ to minimize the Bayes risk leads to Bayes estimators. An estimator θ thatb minimizes the Bayes risk is called a Bayes estimator. That is, b b Bπ(θ) = inf Bπ(θ) (5) θe b 4 e where the infimum is over all estimators θ. An estimator that minimizes the maximum risk is called a minimax estimator. That is, e sup R(θ; θ) = inf sup R(θ; θ) (6) θ θe θ where the infimum is over all estimators bθ. We call the righte hand side of (6), namely, Rn Rn(Θ)e = inf sup R(θ; θ); (7) ≡ θb θ Θ 2 the minimax risk. Statistical decision theory has two goals:b determine the minimax risk Rn and find an estimator that achieves this risk. Once we have found the minimax risk Rn we want to find the minimax estimator that achieves this risk: sup R(θ; θ) = inf sup R(θ; θ): (8) θ Θ θb θ Θ 2 2 Sometimes we settle for an asymptoticallyb minimax estimatorb sup R(θ; θ) inf sup R(θ; θ) n (9) θ Θ ∼ θb θ Θ ! 1 2 2 b b where an bn means that an=bn 1. Even that can prove too difficult and we might settle for an estimator∼ that achieves the! minimax rate, sup R(θ; θ) inf sup R(θ; θ) n (10) θ Θ θb θ Θ ! 1 2 2 where a b means that both a =bb and b =a areb both bounded as n . n n n n n n ! 1 1.3 Bayes Estimators n Let π be a prior distribution. After observing X = (X1;:::;Xn), the posterior distribution is, according to Bayes' theorem, p(X1;:::;Xn θ)π(θ)dθ (θ)π(θ)dθ P(θ A Xn) = A j = A L (11) 2 j p(X ;:::;X θ)π(θ)dθ (θ)π(θ)dθ RΘ 1 nj RΘ L where (θ) = p(xn; θ) is the likelihoodR function. The posteriorR has density L p(xn θ)π(θ) π(θ xn) = j (12) j m(xn) where m(xn) = p(xn θ)π(θ)dθ is the marginal distribution of Xn. Define the posterior j risk of an estimator θ(xn) by R b r(θ xn) = L(θ; θ(xn))π(θ xn)dθ: (13) j j Z b 5b Theorem 6 The Bayes risk Bπ(θ) satisfies B b(θ) = r(θ xn)m(xn) dxn: (14) π j Z Let θ(xn) be the value of θ that minimizesb r(bθ xn). Then θ is the Bayes estimator. j Proof.b Let p(x; θ) = p(x θ)π(θ) denote theb joint densityb of X and θ. We can rewrite the Bayes risk as follows: j B (θ) = R(θ; θ)π(θ)dθ = L(θ; θ(xn))p(x θ)dxn π(θ)dθ π j Z Z Z ! b = L(bθ; θ(xn))p(x; θ)dxndθ =b L(θ; θ(xn))π(θ xn)m(xn)dxndθ j ZZ ZZ = L(θ;bθ(xn))π(θ xn)dθ m(xn) dxn =b r(θ xn)m(xn) dxn: j j Z Z ! Z b b If we choose θ(xn) to be the value of θ that minimizes r(θ xn) then we will minimize the j integrand at every x and thus minimize the integral r(θ xn)m(xn)dxn. Now we canb find an explicit formula for the Bayes estimatorj b for some specific loss func- R tions. b Theorem 7 If L(θ; θ) = (θ θ)2 then the Bayes estimator is − b θ(xnb) = θπ(θ xn)dθ = E(θ X = xn): (15) j j Z If L(θ; θ) = θ θ then theb Bayes estimator is the median of the posterior π(θ xn).