Shrinkage

Econ 2148, fall 2017 Shrinkage in the Normal means model

Maximilian Kasy

Department of Economics, Harvard University

1 / 47 Shrinkage

Agenda

I Setup: the Normal means model

X ∼ N(θ,Ik )

and the canonical estimation problem with loss kθb − θk2. I The James-Stein (JS) shrinkage estimator. I Three ways to arrive at the JS estimator (almost):

1. Reverse regression of θi on Xi . 2. Empirical Bayes: random effects model for θi . 3. Shrinkage factor minimizing Stein’s Unbiased Risk Estimate.

I Proof that JS uniformly dominates X as estimator of θ.

I The Normal means model as asymptotic approximation.

2 / 47 Shrinkage

Takeaways for this part of class

I Shrinkage estimators trade off and bias.

I In multi-dimensional problems, we can estimate the optimal degree of shrinkage. I Three intuitions that lead to the JS-estimator:

1. Predict θi given Xi ⇒ reverse regression. 2. Estimate distribution of the θi ⇒ empirical Bayes. 3. Find shrinkage factor that minimizes estimated risk.

I Some calculus allows us to derive the risk of JS-shrinkage ⇒ better than MLE, no matter what the true θ is.

I The Normal means model is more general than it seems: large sample approximation to any parametric estimation problem.

3 / 47 Shrinkage The Normal means model

The Normal means model Setup

k I θ ∈ R

I ε ∼ N(0,Ik )

I X = θ + ε ∼ N(θ,Ik ) I Estimator: θb = θb(X) I Loss: squared error

2 L(θb,θ) = ∑(θbi − θi ) i

I Risk:

h i h 2i R(θb,θ) = Eθ L(θb,θ) = ∑Eθ (θbi − θi ) . i

4 / 47 Shrinkage The Normal means model

Two estimators

I Canonical estimator: maximum likelihood,

ML θb = X

I Risk function ML  2 R(θb ,θ) = ∑Eθ εi = k. i

I James-Stein shrinkage estimator

JS  (k − 2)/k  θb = 1 − · X. X 2

I Celebrated result: uniform risk dominance; for all θ

JS ML R(θb ,θ) < R(θb ,θ) = k.

5 / 47 Shrinkage Regression perspective

First motivation of JS: Regression perspective

I We will discuss three ways to motivate the JS-estimator (up to degrees of freedom correction).

I Consider estimators of the form

θbi = c · Xi

or θbi = a + b · Xi .

I How to choose c or (a,b)? I Two particular possibilities: 1. Maximum likelihood: c = 1  (k−2)/k  2. James-Stein: c = 1 − X 2

6 / 47 Shrinkage Regression perspective

Practice problem (Infeasible estimator)

I Suppose you knew X1,...,Xk as well as θ1,...,θk ,

I but are constrained to use an estimator of the form θbi = c · Xi .

1. Find the value of c that minimizes loss.

2. For estimators of the form θbi = a + b · Xi , find the values of a and b that minimize loss.

7 / 47 Shrinkage Regression perspective

Solution

I First problem: ∗ 2 c = argmin ∑(c · Xi − θi ) c i

I problem!

I First order condition:

∗ 0 = ∑(c · Xi − θi ) · Xi . i

I Solution ∗ ∑Xi θi c = 2 . ∑i Xi

8 / 47 Shrinkage Regression perspective

Solution continued

I Second problem:

∗ ∗ 2 (a ,b ) = argmin ∑(a + b · Xi − θi ) a,b i

I Least squares problem again! I First order conditions:

∗ ∗ 0 = ∑(a + b · Xi − θi ) i ∗ ∗ 0 = ∑(a + b · Xi − θi ) · Xi . i

I Solution

∑(Xi − X) · (θi − θ) sXθ b∗ = = , a∗ + b∗ · X = θ 2 2 ∑i (Xi − X) sX

9 / 47 Shrinkage Regression perspective

Regression and reverse regression

I Recall Xi = θi + εi , E[εi |θi ] = 0, Var(εi ) = 1. I Regression of X on θ: Slope

sXθ sεθ 2 = 1 + 2 ≈ 1. sθ sθ

I For optimal shrinkage, we want to predict θ given X, not the other way around!

I Reverse regression of θ on X: Slope

s s2 + s s2 Xθ = θ εθ ≈ θ . 2 2 2 2 sX sθ + 2sεθ + sε sθ + 1

I Interpretation: “signal to (signal plus noise) ratio” < 1.

10 / 47 Shrinkage Regression perspective

Illustration 148 S. M. STIGLER Floridabe usedto improvean estimateof the price of Frenchwine, when it is assumedthat they are unre- lated? The best heuristicexplanation that has been offeredis a Bayesianargument: If the 0i are a priori independentN(0, r2), thenthe posterior mean of Oi is ofthe sameform as OJ, and henceO'J can be viewed as an empiricalBayes estimator(Efron and Morris, (9i,X,) I@ * -' 1973;Lehmann, 1983, page 299). Anotherexplanation 8~~~~~~(iX,)- thathas been offeredis that0JS can be viewedas a relativeof a "pre-test"estimator; if one performsa preliminarytest of the null hypothesis that 0 = 0, and one then uses 0 = 0 or 0i = Xi dependingon the outcomeof the test, the resultingestimator is a o~~~ x weightedaverage of 0 and00 ofwhich 0Js is a smoothed version(Lehmann, 1983, pages 295-296). But neither of these explanationsis fullysatisfactory (although bothhelp render the resultmore plausible); the first becauseit requiresspecial a prioriassumptions where Stein did not,the secondbecause it correspondsto the resultonly in the loosestqualitative way. The FIG. 1. Hypotheticalbivariate plot of Oi vs. Xi, fori =1, * ,k. difficultyof understandingthe Steinparadox is com- 11 / 47 poundedby the fact that its proof usually depends on the of means (, )to lie nearthe 45? explicitcomputation of the risk function or the theory expect point line. of completesufficient , by a processthat all convincesus of its truthwithout really illuminating Now our goal is to estimate of the Oi's given all the reasonsthat it works.(The best presentationI of the Xi's, with no assumptions about a possible knowis thatin Lehmann(1983, pages 300-302)of a dfistributionalstructure for the Oi's-they are simply proofdue to Efronand Morris(1973); Berger(1980, to be viewed as unknown constants. Nonetheless,to page 165,example 54) outlinesa shortbut unintuitive 3eewhy we should expect that the ordinaryestimator O' proof;the one shorterproof I have encounteredin a can be improvedupon, it helps to thinkabout what if textbookis vitiatedby a majornoncorrectable error.) we would do this were not the case. If the Oi's,and The purposeof thispaper is to showhow a different hencethe pairs (Xi, Oi), had a knownjoint distribution, a natural (and in some settingseven optimal) method perspective,one developedby FrancisGalton over a setig hreth -fdono ---0aditibt -n of With no di~stiuinlasmtosaothe',wouldbe to calculate6 = E centuryago (Stigler,1986, chapter 8), can renderthe proceeding (X) (O I X) resulttransparent, as' well as lead to a simple,full and use this, the theoreticalregression function of 0 on to estimatesof the it proof.This perspectiveis perhapscloser to thatof the X, generate Oi'sby evaluating foreach We thinkof this as an unattainable periodbefore 1950 than to subsequentapproaches, but Xi. may it has points in commonwith more recentworks, ideal, unattainable because we do not know the con- ditional distributionof 0 X. we will not particularlythose of Efronand Morris(1973), Rubin given Indeed, assume that our about the unknowncon- (1980),Dempster (1980) and Robbins(1983). uncertainty stants Oican be describedby a probabilitydistribution 2. STEIN ESTIMATIONAS A REGRESSION at all; our view is not that of eitherthe Bayesian or We do know the condi- PROBLEM empiricalBayesian approach. tional distributionof X given 0, namely N(O, 1), and The estimationproblem involves pairs of values we can calculate E (X I 0) = 0. Indeed this, the theo- (Xi, Os), i = 1, ** , k,where one element of each pair retical regressionline of X on 0, correspondsto the (Xi) is knownand one (Oi) is unknown.Since the Oi's line 0 = X in Figure 1, and it is this line which gives are unknown,the pairs cannot in factbe plotted,but the ordinaryestimators 09? = Xi. Thus the ordinary it will help our understandingof the problemand estimator may be viewed as being based on the suggesta meansof approaching it ifwe imaginewhat "4wrong"regression line, on E (XI l ) rather than sucha plot wouldlook like.Figure 1 is hypothetical, E(O I X). Since, as Francis Galton alreadyknew in the butsome aspects of it accuratelyreflect the situation. 1880's, the regressionsof X on 0 and of 0 on X can be Since X is N(O, 1), we can thinkof the X's as being markedlydifferent, this suggests that the ordinary generatedby addingN(O, 1) "errors"to the givenO's. estimatorcan be improvedupon and even suggests Thus thehorizontal deviations of the points from the how this mightbe done-by attemptingto approxi- 450 line 0 = X are independentN(O, 1), and in that mate "E(O I X) "-or whateverthat mightmean in a respectthey should cluster around the line as indi- cated.Also, E(X) = Wand Var(X) = 1/k, so we should

This content downloaded from 128.103.149.52 on Thu, 2 Oct 2014 09:21:26 AM All use subject to JSTOR Terms and Conditions Shrinkage Regression perspective

Expectations

Practice problem

1. Calculate the expectations of

1 2 1 2 X = k ∑Xi , X = k ∑Xi , i i

and 2 1 2 2 2 sX = k ∑(Xi − X) = X − X i 2. Calculate the expected numerator and denominator of c∗ and b∗.

12 / 47 Shrinkage Regression perspective

Solution

I E[X] = θ

I E[X 2] = θ 2 + 1 2 2 2 2 I E[sX ] = θ − θ + 1 = sθ + 1 ∗ I c = (Xθ)/(X 2), and E[Xθ] = θ 2. Thus

θ 2 c∗ ≈ . θ 2 + 1

∗ 2 2 I b = sXθ /sX , and E[sXθ ] = sθ . Thus

2 ∗ sθ b ≈ 2 . sθ + 1

13 / 47 Shrinkage Regression perspective

Feasible analog estimators

Practice problem Propose feasible estimators of c∗ and b∗.

14 / 47 Shrinkage Regression perspective

A solution

I Recall: ∗ Xθ I c = X 2 2 I θε ≈ 0, ε ≈ 1. I Since Xi = θi + εi ,

Xθ = X 2 − Xε = X 2 − θε − ε2 ≈ X 2 − 1

I Thus:

2 2 2 ∗ X − θε − ε X − 1 1 c = ≈ = 1 − =: bc. X 2 X 2 X 2

15 / 47 Shrinkage Regression perspective

Solution continued

I Similarly: ∗ sXθ I b = 2 sX 2 I sθε ≈ 0, sε ≈ 1. I Since Xi = θi + εi ,

2 2 2 2 sXθ = sX − sXε = sX − sθε − sε ≈ sX − 1

I Thus: 2 2 2 ∗ sX − sθε − sε sX − 1 1 b = 2 ≈ 2 = 1 − 2 =: b sX sX sX

16 / 47 Shrinkage Regression perspective

James-Stein shrinkage

I We have almost derived the James-Stein shrinkage estimator. I Only difference: degree of freedom correction I Optimal corrections: (k − 2)/k cJS = 1 − , X 2 and JS (k − 3)/k b = 1 − 2 . sX 2 2 I Note: if θ = 0, then ∑i Xi ∼ χk . 2 I Then, by properties of inverse χ distributions  1  1 E 2 = , ∑i Xi k − 2 so that E cJS = 1.

17 / 47 Shrinkage Regression perspective

Positive part JS-shrinkage

I The estimated shrinkage factors can be negative. JS I c < 0 iff 2 ∑Xi < k − 2. i

I Better estimator: restrict to c ≥ 0.

I “Positive part James-Stein estimator:”

JS+  (k − 2)/k  θb = max 0,1 − · X. X 2

I Dominates James-Stein.

I We will focus on the JS-estimator for analytical tractability.

18 / 47 Shrinkage Parametric empirical Bayes

Second motivation of JS: Parametric empirical Bayes Setup

k I As before: θ ∈ R

I X|θ ∼ N(θ,Ik ) 2 I Loss L(θb,θ) = ∑i (θbi − θi ) I Now add an additional conceptual layer: Think of θi as i.i.d. draws from some distribution. I “Random effects vs. fixed effects” iid 2 I Let’s consider θi ∼ N(0,τ ), where τ2 is unknown.

19 / 47 Shrinkage Parametric empirical Bayes

Practice problem

2 I Derive the marginal distribution of X given τ . 2 I Find the maximum likelihood estimator of τ . 2 I Find the conditional expectation of θ given X and τ . 2 I Plug in the maximum likelihod estimator of τ to get the empirical Bayes estimator of θ.

20 / 47 Shrinkage Parametric empirical Bayes

Solution

I Marginal distribution: 2  X ∼ N 0,(τ + 1) · Ik 2 I Maximum likelihood estimator of τ : 1  X 2  2 =argmax − log( 2 + 1) + i τb ∑ τ 2 t2 2 i (τ + 1) =X 2 − 1 2 I Conditional expectation of θi given Xi , τ : 2 Cov(θi ,Xi ) τ = · X = · X . θbi i 2 i Var(Xi ) τ + 1

I Plugging in τb2:  1  θbi = 1 − · Xi . X 2 21 / 47 Shrinkage Parametric empirical Bayes

General parametric empirical Bayes Setup

I Data X, parameters θ, hyper-parameters η

I Likelihood

X|θ,η ∼ fX|θ

I Family of priors θ|η ∼ fθ|η

I Limiting cases: I θ = η: Frequentist setup. I η has only one possible value: Bayesian setup.

22 / 47 Shrinkage Parametric empirical Bayes

Empirical Bayes estimation

I Marginal likelihood Z fX|η (x|η) = fX|θ (x|θ)fθ|η (θ|η)dθ.

Has simple form when family of priors is conjugate.

I Estimator for hyper-parameter η: marginal MLE

ηb = argmax fX|η (x|η). η

I Estimator for parameter θ: pseudo-posterior expectation

θb = E[θ|X = x,η = ηb].

23 / 47 Shrinkage Stein’s Unbiased Risk Estimate

Third motivation of JS: Stein’s Unbiased Risk Estimate

I Stein’s lemma (simplified version):

I Suppose X ∼ N(θ,Ik ). k 0 I Suppose g(·): R → R is differentiable and E[|g (X)|] < ∞. I Then E[(X − θ) · g(X)] = E[∇g(X)].

I Note: I θ shows up in the expression on the LHS, but not on the RHS I Unbiased estimator of the RHS: ∇g(X)

24 / 47 Shrinkage Stein’s Unbiased Risk Estimate

Practice problem Prove this. Hints: 1. Show that the standard Normal density ϕ(·) satisfies

ϕ0(x) = −x · ϕ(x).

2. Consider each component i separately and use integration by parts.

25 / 47 Shrinkage Stein’s Unbiased Risk Estimate

Solution −0.5 2 I Recall that ϕ(x) = (2π) · exp(−x /2). Differentiation immediately yields the first claim. I Consider the component i = 1; the others follow similarly. Then

E[∂x1 g(X)] = Z Z k = g(x ,...,x ) · (x − )· (x − )dx ...dx ∂x1 1 k ϕ 1 θ1 ∏ ϕ i θi 1 k x2,...xk x1 i=2 Z Z k = g(x ,...,x ) ·(− (x − ))· (x − )dx ...dx 1 k ∂x1 ϕ 1 θ1 ∏ ϕ i θi 1 k x2,...xk x1 i=2 Z Z k = g(x1,...,xk ) ·(x1 − θ1)ϕ(x1 − θ1)· ∏ ϕ(xi − θi )dx1 ...dxk x2,...xk x1 i=2

=E[(X1 − θ1) · g(X)].

I Collecting the components i = 1,...,k yields E[(X − θ) · g(X)] = E[∇g(X)].

26 / 47 Shrinkage Stein’s Unbiased Risk Estimate

Stein’s representation of risk

I Consider a general estimator for θ of the form θb = θb(X) = X + g(X), for differentiable g. I Recall that the risk function is defined as

2 R(θb,θ) = ∑E[(θbi − θi ) ]. i

I We will show that this risk function can be rewritten as

R( , ) = k + E[g (X)2] + 2E[ g (X)]. θb θ ∑ i ∂xi i i

Practice problem

I Interpret this expression.

I Propose an unbiased estimator of risk, based on this expression.

27 / 47 Shrinkage Stein’s Unbiased Risk Estimate

Answer

I The expression of risk has 3 components: 1. k is the risk of the canonical estimator θb = X, corresponding to g ≡ 0. 2 2 2. ∑i E[gi (X) ] = ∑i E[(θbi − Xi ) ] is the sample sum of squared errors.

3. ∑i E[∂xi gi (X)] can be thought of as a penalty for overfitting. I We thus can think of this expression as giving a “penalized least squares” objective.

I The sample analog expression gives “Stein’s Unbiased Risk Estimate” (SURE)

 2 R = k + − X + 2 · g (X). b ∑ θbi i ∑∂xi i i i

28 / 47 Shrinkage Stein’s Unbiased Risk Estimate

I We will use Stein’s representation of risk in 2 ways: 1. To derive feasible optimal shrinkage parameter using its sample analog (SURE). 2. To prove uniform dominance of JS using population version.

Practice problem Prove Stein’s representation of risk. Hints:

I Add and subtract Xi in the expression defining R(θb,θ). I Use Stein’s lemma.

29 / 47 Shrinkage Stein’s Unbiased Risk Estimate

Solution

 2 R(θ) =∑E (θbi − Xi + Xi − θi ) i  2 2  =∑E (Xi − θi ) +(θbi − Xi ) +2(θbi − Xi ) · (Xi − θi ) i  2   =∑1 +E gi (X) +2E gi (X) · (Xi − θi ) i = 1 +Eg (X)2 +2E g (X), ∑ i ∂xi i i

where Stein’s lemma was used in the last step.

30 / 47 Shrinkage Stein’s Unbiased Risk Estimate

Using SURE to pick the tuning parameter

I First use of SURE: To pick tuning parameters, as an alternative to cross-validation or marginal likelihood maximization.

I Simple example: Linear shrinkage estimation

θb = c · X.

Practice problem

I Calculate Stein’s unbiased risk estimate for θb. I Find the coefficient c minimizing estimated risk.

31 / 47 Shrinkage Stein’s Unbiased Risk Estimate

Solution I When θb = c · X, then g(X) = θb − X = (c − 1) · X,

and ∂xi gi (X) = c − 1. I Estimated risk: 2 2 Rb = k + (1 − c) · ∑Xi + k · (c − 1). i

I First order condition for minimizing Rb: ∗ 2 k = (1 − c ) · ∑Xi . i I Thus 1 c∗ = 1 − . X 2 I Once again: Almost the JS estimator, up to degrees of freedom correction! 32 / 47 Shrinkage Stein’s Unbiased Risk Estimate

Celebrated result: Dominance of the JS-estimator

I We next use the population version of SURE to prove uniform dominance of the JS-estimator relative to maximum likelihood.

I Recall that the James-Stein estimator was defined as

JS  (k − 2)/k  θb = 1 − · X. X 2

ML I Claim: The JS-estimator has uniformly lower risk than θb = X.

Practice problem Prove this, using Stein’s representation of risk.

33 / 47 Shrinkage Stein’s Unbiased Risk Estimate

Solution ML I The risk of θb is equal to k. I For JS, we have

JS k − 2 gi (X) = θbi − Xi = − 2 · Xi , and ∑j Xj 2 ! k − 2 2Xi ∂xi gi (X) = 2 · −1 + 2 . ∑j Xj ∑j Xj

I Summing over components gives

2 2 (k − 2) ∑gi (X) = 2 , and i ∑j Xj (k − 2)2 g (X) = − . ∑∂xi i 2 i ∑j Xj

34 / 47 Shrinkage Stein’s Unbiased Risk Estimate

Solution continued

I Plugging into Stein’s expression for risk then gives " # JS R( , ) =k + E g (X)2 + 2 g (X) θb θ ∑ i ∑∂xi i i i " # (k − 2)2 (k − 2)2 =k + E 2 − 2 2 ∑i Xi ∑j Xj (k − 2)2  = k − E 2 . ∑i Xi

(k−2)2 I The term 2 is always positive (for k ≥ 3), and thus so is its ∑i Xi expectation. Uniform dominance immediately follows.

I Pretty cool, no?

35 / 47 Shrinkage Local asymptotic Normality

The Normal means model as asymptotic approximation

I The Normal means model might seem quite special.

I But asymptotically, any sufficiently smooth parametric model is equivalent.

I Formally: The likelihood ratio process of n i.i.d. draws Yi from the distribution Pn √ , θ0+h/ n converges to the likelihood ratio process of one draw X from   N h,I−1 θ0

I Here h is a local parameter for the model around θ0, and Iθ0 is the Fisher information matrix.

36 / 47 Shrinkage Local asymptotic Normality

I Suppose that Pθ has a density fθ relative to some measure. I Recall the following definitions:

I Log-likelihood: `θ (Y ) = logfθ (Y ) ˙ I Score: `θ (Y ) = ∂θ logfθ (Y ) ¨ 2 I Hessian `θ (Y ) = ∂θ logfθ (Y ) ˙ ¨ I Information matrix: Iθ = Varθ (`θ (Y )) = −Eθ [`θ (Y )] I Likelihood ratio process:

√ f (Yi ) θ0+h/ n , ∏ fθ (Yi ) i 0

where Y ,...,Y are i.i.d. P √ distributed. 1 n θ0+h/ n

37 / 47 Shrinkage Local asymptotic Normality

Practice problem (Taylor expansion)

I Using this notation, provide a second order Taylor expansion for

the log-likelihood `θ0+h(Y ) with respect to h. I Provide a corresponding Taylor expansion for the log-likelihood of n i.i.d. draws Y from the distribution P √ . i θ0+h/ n I Assuming that the remainder is negligible, describe the limiting behavior (as n → ∞) of the log-likelihood ratio process

√ f (Yi ) log θ0+h/ n . ∏ fθ (Yi ) i 0

38 / 47 Shrinkage Local asymptotic Normality

Solution

I Expansion for `θ0+h(Y ):

0 ˙ 1 ¨ `θ0+h(Y ) = `θ0 (Y ) + h · `θ0 (Y ) + 2 · h · `θ0 (Y ) · h + remainder.

I Expansion for the log-likelihood ratio of n i.i.d. draws:

√ f 0 (Yi ) θ0+h / n 1 0 ˙ 1 0 ¨ log = √ h · `θ (Yi )+ h · `θ (Yi )·h+remainder. ∏ fθ (Yi ) n ∑ 0 2n ∑ 0 i 0 i i

I Asymptotic behavior (by CLT, LLN):

∆ := √1 `˙ (Y ) →d N(0,I ), n n ∑ θ0 i θ0 i 1 · `¨ (Y ) →p − 1 I . 2n ∑ θ0 i 2 θ0 i

39 / 47 Shrinkage Local asymptotic Normality

I Suppose the remainder is negligible.

I Then the previous slide suggests

√ f (Yi ) θ0+h/ n A 0 1 0 log = h · ∆ − h Iθ h, ∏ fθ (Yi ) 2 0 i 0

where

∆ ∼ N (0,Iθ0 ).

I Theorem 7.2 in van der Vaart (2000), chapter 7 states sufficient conditions for this to hold.

I We show next that this is the same likelihood ratio process as for the model   N h,I−1 . θ0

40 / 47 Shrinkage Local asymptotic Normality

Practice problem

 −1 I Suppose X ∼ N h,I θ0 I Write out the log likelihood ratio

ϕI−1 (X − h) log θ0 . ϕI−1 (X) θ0

41 / 47 Shrinkage Local asymptotic Normality

Solution

I The Normal density is given by

1 1 0  −1 (x) = · exp − x · I · x ϕI q 2 θ0 θ0 (2π)k |det(I−1)| θ0

I Taking ratios and logs yields

ϕI−1 (X − h) θ0 0 1 0 log = h · Iθ0 · x − 2 h · Iθ0 · h. ϕI−1 (X) θ0

I This is exactly the same process we obtained before, with Iθ0 · X taking the role of ∆.

42 / 47 Shrinkage Local asymptotic Normality

Why care

iid √ I Suppose that Yi ∼ Pθ+h/ n, and Tn(Y1,...,Yn) is an arbitrary statistic that satisfies d Tn → Lθ,h

for some limiting distribution Lθ,h and all h.

I Then Lθ,h is the distribution of some (possibly randomized) statistic T (X)!

I This is a (non-obvious) consequence of the convergence of the likelihood ratio process.

I cf. Theorem 7.10 in van der Vaart (2000).

43 / 47 Shrinkage Local asymptotic Normality

Maximum likelihood and shrinkage

I This result applies in particular to T = estimators of θ. ML I Suppose that θb is the maximum likelihood estimator. ML d ML I Then θb → X, and any shrinkage estimator based on θb converges in distribution to a corresponding shrinkage estimator in the limit experiment.

44 / 47 Shrinkage References

References

I Textbook introduction: Wasserman, L. (2006). All of nonparametric statistics. Springer Science & Business Media, chapter 7.

I Reverse regression perspective: Stigler, S. M. (1990). The 1988 Neyman memorial lecture: a Galtonian perspective on shrinkage estimators. Statistical Science, pages 147–155.

45 / 47 Shrinkage References

I Parametric empirical Bayes: Morris, C. N. (1983). Parametric empirical Bayes inference: Theory and applications. Journal of the American Statistical Association, 78(381):pp. 47–55. Lehmann, E. L. and Casella, G. (1998). Theory of point estimation, volume 31. Springer, section 4.6.

I Stein’s Unbiased Risk Estimate: Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribution. The Annals of Statistics, 9(6):1135–1151. Lehmann, E. L. and Casella, G. (1998). Theory of point estimation, volume 31. Springer, sections 5.2, 5.4, 5.5.

46 / 47 Shrinkage References

I The Normal means model as asymptotic approximation: van der Vaart, A. (2000). Asymptotic statistics. Cambridge University Press, chapter 7. Hansen, B. E. (2016). Efficient shrinkage in parametric models. Journal of Econometrics, 190(1):115–132.

47 / 47