Econ 722 – Advanced Econometrics IV
Francis J. DiTraglia
University of Pennsylvania Lecture #1 – Decision Theory
Statistical Decision Theory
The James-Stein Estimator
Econ 722, Spring ’19 Lecture 1 – Slide 1 Decision Theoretic Preliminaries
Parameter θ ∈ Θ Unknown state of nature, from parameter space Θ
Observed Data
Observe X with distribution Fθ from a sample space X
Estimator θb An estimator (aka a decision rule) is a function from X to Θ
Loss Function L(θ, θb) A function from Θ × Θ to R that gives the cost we incur if we report θb when the true state of nature is θ.
Econ 722, Spring ’19 Lecture 1 – Slide 2 Examples of Loss Functions
L(θ, θb) = (θ − θb)2 squared error loss L(θ, θb) = |θ − θb| absolute error loss L(θ, θb) = 0 if θ = θb, 1 otherwise zero-one loss R h f (x|θ) i L(θ, θb) = log f (x|θ) dx Kullback–Leibler loss f (x|θb)
Econ 722, Spring ’19 Lecture 1 – Slide 3 (Frequentist) Risk of an Estimator θb
Z h i R(θ, θb) = Eθ L(θ, θb) = L θ, θb(x) dFθ(x)
The frequentist decision theorist seeks to evaulate, for each θ, how much he would “expect” to lose if he used θb(X ) repeatedly with varying X in the problem. (Berger, 1985)
Example: Squared Error Loss h 2i 2 R(θ, θb) = Eθ (θ − θb) = MSE = Var(θb) + Biasθ(θb)
Econ 722, Spring ’19 Lecture 1 – Slide 4 Bayes Risk and Maximum Risk
Comparing Risk R(θ, θb) is a function of θ rather than a single number. We want an estimator with low risk, but how can we compare?
Maximum Risk
R¯(θb) = sup R(θ, θb) θ∈Θ Bayes Risk h i r(π, θb) = Eπ R(θ, θb) , where π is a prior for θ
Econ 722, Spring ’19 Lecture 1 – Slide 5 Bayes and Minimax Rules
Minimize the Maximum or Bayes risk over all estimators θe
Minimax Rule/Estimator θb is minimax if sup R(θ, θb) = inf sup R(θ, θe) θ∈Θ θe θ∈Θ
Bayes Rule/Estimator θb is a Bayes rule with respect to prior π if r(π, θb) = inf r(π, θe) θe
Econ 722, Spring ’19 Lecture 1 – Slide 6 Recall: Bayes’ Theorem and Marginal Likelihood
Let π be a prior for θ. By Bayes’ theorem, the posterior π(θ|x) is
f (x|θ)π(θ) π(θ|x) = m(x)
where the marginal likelihood m(x) is given by Z m(x) = f (x|θ)π(θ) dθ
Econ 722, Spring ’19 Lecture 1 – Slide 7 Posterior Expected Loss
Posterior Expected Loss Z ρ π(θ|x), θb = L θ, θb π(θ|x) dθ
Bayesian Decision Theory Choose an estimator that minimizes posterior expected loss.
Easier Calculation Since m(x) does not depend on θ, to minimize ρ π(θ|x), θb it R suffices to minimize L(θ, θb)f (x|θ)π(θ) dθ.
Question
Is there a relationship between Bayes risk, r(π, θb) ≡ Eπ[R(θ, θb)], and posterior expected loss?
Econ 722, Spring ’19 Lecture 1 – Slide 8 Bayes Risk vs. Posterior Expected Loss
Theorem Z r(π, θb) = ρ π(θ|x), θb(x) m(x) dx
Proof Z Z Z r(π, θb) = R(θ, θb)π(θ) dθ = L θ, θb(x) f (x|θ) dx π(θ) dθ ZZ = L θ, θb(x) [f (x|θ)π(θ)] dxdθ ZZ = L θ, θb(x) [π(θ|x)m(x)] dxdθ Z Z = L θ, θb(x) π(θ|x) dθ m(x) dx Z = ρ π(θ|x), θb(x) m(x) dx
Econ 722, Spring ’19 Lecture 1 – Slide 9 Finding a Bayes Estimator
Hard Problem Find the function θb(x) that minimizes r(π, θb).
Easy Problem Find the number θb that minimizes ρ π(θ|x), θb)
Punchline Z Since r(π, θb) = ρ π(θ|x), θb(x) m(x) dx, to minimize r(π, θb) we can set θb(x) to be the value θb that minimizes ρ π(θ|x), θb).
Econ 722, Spring ’19 Lecture 1 – Slide 10 Bayes Estimators for Common Loss Functions
Zero-one Loss For zero-one loss, the Bayes estimator is the posterior mode.
Absolute Error Loss: L(θ, θb) = |θ − θb| For absolute error loss, the Bayes estimator is the posterior median.
Squared Error Loss: L(θ, θb) = (θ − θb)2 For squared error loss, the Bayes estimator is the posterior mean.
Econ 722, Spring ’19 Lecture 1 – Slide 11 Derivation of Bayes Estimator for Squared Error Loss
By definition, Z θb ≡ arg min (θ − a)2π(θ|x) dθ a∈Θ Differentiating with respect to a, we have Z 2 (θ − a)π(θ|x) dθ = 0 Z θπ(θ|x) dθ = a
Econ 722, Spring ’19 Lecture 1 – Slide 12 Example: Bayes Estimator for a Normal Mean Suppose X ∼ N(µ, 1) and π is a N(a, b2) prior. Then,
π(µ|x) ∝ f (x|µ) × π(µ) 1 1 ∝ exp − (x − µ)2 + (µ − a)2 2 b2 1 1 a ∝ exp − 1 + µ2 − 2 x + µ 2 b2 b2 ( ) 1 b2 + 1 b2x + a 2 ∝ exp − µ − 2 b2 b2 + 1
2 2 b2 2 2 So π(µ|x) is N(m, ω ) with ω = 1+b2 and m = ω x + (1 − ω )a. Hence the Bayes estimator for µ under squared error loss is
b2X + a θb(X ) = 1 + b2
Econ 722, Spring ’19 Lecture 1 – Slide 13 Minimax Analysis
Wasserman (2004) The advantage of using maximum risk, despite its problems, is that it does not require one to choose a prior.
Berger (1986) Perhaps the greatest use of the minimax principle is in situations for which no prior information is available . . . but two notes of caution should be sounded. First, the minimax principle can lead to bad decision rules. . . Second, the minimax approach can be devilishly hard to implement.
Econ 722, Spring ’19 Lecture 1 – Slide 14 Methods for Finding a Minimax Estimator
1. Direct Calculation 2. Guess a “Least Favorable” Prior 3. Search for an “Equalizer Rule”
Method 1 rarely applicable so focus on 2 and 3. . .
Econ 722, Spring ’19 Lecture 1 – Slide 15 The Bayes Rule for a Least Favorable Prior is Minimax
Theorem Let θb be a Bayes rule with respect to π and suppose that for all θ ∈ Θ we have R(θ, θb) ≤ r(π, θb). Then θb is a minimax estimator, and π is called a least favorable prior.
Proof Suppose that θb is not minimax. Then there exists another estimator θe
with supθ∈Θ R(θ, θe) < supθ∈Θ R(θ, θb). But since
h i r(π, θe) ≡ Eπ R(θ, θe) ≤ Eπ sup R(θ, θe) = sup R(θ, θe) θ∈Θ θ∈Θ
but this implies that θb is not Bayes with respect to π since
r(π, θe) ≤ sup R(θ, θe) < sup R(θ, θb) ≤ r(π, θb) θ∈Θ θ∈Θ Econ 722, Spring ’19 Lecture 1 – Slide 16 Example of Least Favorable Prior
Bounded Normal Mean
I X ∼ N(θ, 1)
I Squared error loss
I Θ = [−m, m] for 0 < m < 1
Least Favorable Prior π(θ) = 1/2 for θ ∈ {−m, m}, zero otherwise.
Resulting Bayes Rule is Minimax exp {mX } − exp {−mX } θb(X ) = m tanh(mX ) = m exp {mX } + exp {−mX }
Econ 722, Spring ’19 Lecture 1 – Slide 17 Equalizer Rules
Definition An estimator θb is called an equalizer rule if its risk function is constant: R(θ, θb) = C for some C.
Theorem If θb is an equalizer rule and is Bayes with respect to π, then θb is minimax and π is least favorable.
Proof Z Z r(π, θb) = R(θ, θb)π(θ) dθ = Cπ(θ) dθ = C
Hence, R(θ, θb) ≤ r(π, θb) for all θ so we can apply the preceding theorem.
Econ 722, Spring ’19 Lecture 1 – Slide 18 Example: X1,..., Xn ∼ iid Bernoulli(p)
√ Under a Beta(α, β) prior with α = β = n/2, √ nX¯ + n/2 p = √ b n + n
is the Bayesian posterior mean, hence the Bayes rule under squared
error loss. The risk function of pb is, n R(p, p) = √ b 4(n + n)2
which is constant in p. Hence, pb is an equalizer rule, and by the preceding theorem is minimax.
Econ 722, Spring ’19 Lecture 1 – Slide 19 Problems with the Minimax Principle
Risk Risk 2 R(θ, θb) 2 R(θ, θb)
1 1
R(θ, θe) R(θ, θe) 0 1 1.5 2 0 1 1.5 2 θ θ
In the left panel, θe is preferred by the minimax principle; in the right panel θb is preferred. But the only difference between them is that the right panel adds an additional fixed loss of 1 for 1 ≤ θ ≤ 2.
Econ 722, Spring ’19 Lecture 1 – Slide 20 Problems with the Minimax Principle
Suppose that Θ = {θ1, θ2}, A = {a1, a2} and the loss function is:
a1 a2
θ1 10 10.01
θ2 8 -8
I Minimax principle: choose a1
I Bayes: Choose a2 unless π(θ1) > 0.9994
Minimax ignores the fact that under θ1 we can never do better than a loss of 10, and tries to prevent us from incurring a tiny additional loss of 0.01
Econ 722, Spring ’19 Lecture 1 – Slide 21 Dominance and Admissibility
Dominance θb dominates θe with respect to R if R(θ, θb) ≤ R(θ, θe) for all θ ∈ Θ and the inequality is strict for at least one value of θ.
Admissibility θb is admissible if no other estimator dominates it.
Inadmissiblility θb is inadmissible if there is an estimator that dominates it.
Econ 722, Spring ’19 Lecture 1 – Slide 22 Example of an Admissible Estimator
Say we want to estimate θ from X ∼ N(θ, 1) under squared error loss. Is the estimator θb(X ) = 3 admissible?
If not, then there is a θe with R(θ, θe) ≤ R(θ, θb) for all θ. Hence:
n h io2 R(3, θe) ≤ R(3, θb) = E θb− 3 + Var(θb) = 0
Since R cannot be negative for squared error loss,
n h io2 0 = R(3, θe) = E θe− 3 + Var(θe)
Therefore θb = θe, so θb is admissible, although very silly!
Econ 722, Spring ’19 Lecture 1 – Slide 23 Bayes Rules are Admissible
Theorem A-1 Suppose that Θ is a discrete set and π gives strictly positive probability to each element of Θ. Then, if θb is a Bayes rule with respect to π, it is admissible.
Theorem A-2 If a Bayes rule is unique, it is admissible.
Theorem A-3 Suppose that R(θ, θb) is continuous in θ for all θb and that π gives strictly positive probability to any open subset of Θ. Then if θb is a Bayes rule with respect to π, it is admissible.
Econ 722, Spring ’19 Lecture 1 – Slide 24 Admissible Equalizer Rules are Minimax
Theorem Let θb be an equalizer rule. Then if θb is admissible, it is minimax.
Proof Since θb is an equalizer rule, R(θ, θb) = C. Suppose that θb is not minimax. Then there is a θe such that
sup R(θ, θe) < sup R(θ, θb) = C θ∈Θ θ∈Θ
But for any θ, R(θ, θe) ≤ supθ∈Θ R(θ, θe). Thus we have shown that θe dominates θb, so that θb cannot be admissible.
Econ 722, Spring ’19 Lecture 1 – Slide 25 Minimax Implies “Nearly” Admissible
Strong Inadmissibility We say that θb is strongly inadmissible if there exists an estimator θe and an ε > 0 such that R(θ, θe) < R(θ, θb) − ε for all θ.
Theorem If θb is minimax, then it is not strongly inadmissible.
Econ 722, Spring ’19 Lecture 1 – Slide 26 Example: Sample Mean, Unbounded Parameter Space
Theorem
Suppose that X1,..., Xn ∼ N(θ, 1) with Θ = R. Under squared error loss, one can show that θb = X¯ is admissible.
Intuition The proof is complicated, but effectively we view this estimator as a limit of a of Bayes estimator with prior N(a, b2), as b2 → ∞.
Minimaxity Since R(θ, X¯) = Var(X¯) = 1/n, we see that X¯ is an equalizer rule. Since it is admissible, it is therefore minimax.
Econ 722, Spring ’19 Lecture 1 – Slide 27 Recall: Gauss-Markov Theorem
Linear Regression Model
y = X β + , E[|X ] = 0
Best Linear Unbiased Estimator
2 I Var(|X ) = σ I ⇒ then OLS has lowest variance among linear, unbiased estimators of β.
2 I Var(ε|X ) 6= σ I ⇒ then GLS gives a lower variance estimator.
What if we consider biased estimators and squared error loss?
Econ 722, Spring ’19 Lecture 1 – Slide 28 Multiple Normal Means: X ∼ N(θ, I )
Goal Estimate the p-vector θ using X with L(θ, θb) = ||θb− θ||2.
Maximum Likelihood Estimator θb MLE = sample mean, but only one observation: θˆ = X .
Risk of θb p 0 ˆ ˆ 0 X 2 2 θ − θ θ − θ = (X − θ) (X − θ) = (Xi − θi ) ∼ χp i=1 2 ˆ Since E[χp] = p, we have R(θ, θ) = p.
Econ 722, Spring ’19 Lecture 1 – Slide 29 Multiple Normal Means: X ∼ N(θ, I )
James-Stein Estimator
p − 2 (p − 2) X θˆJS = θˆ 1 − = X − θˆ0θˆ X 0X
I Shrinks components of sample mean vector towards zero
I More elements in θ ⇒ more shrinkage 0 I MLE close to zero (θb θb small) gives more shrinkage
Econ 722, Spring ’19 Lecture 1 – Slide 30 MSE of James-Stein Estimator
0 JS JS JS R θ, θˆ = E θˆ − θ θˆ − θ " # (p − 2)X 0 (p − 2)X = (X − θ) − (X − θ) − E X 0X X 0X X 0(X − θ) = (X − θ)0 (X − θ) − 2(p − 2) E E X 0X 1 + (p − 2)2 E X 0X X 0(X − θ) 1 = p − 2(p − 2) + (p − 2)2 E X 0X E X 0X
Using fact that R(θ, θb) = p
Econ 722, Spring ’19 Lecture 1 – Slide 31 Simplifying the Second Term Writing Numerator as a Sum
0 Pp p X (X − θ) Xi (Xi − θi ) X Xi (Xi − θi ) = i=1 = E X 0X E X 0X E X 0X i=1
For i = 1,..., p
0 2 Xi (Xi − θi ) X X − 2Xi E = E X 0X (X 0X )2 Not obvious: integration by parts, expectation as a p-fold integral, X ∼ N(θ, I ) Combining
0 p 0 2 Pp 2 X (X − θ) X X X − 2Xi 1 i=1 Xi E = E = pE − 2E X 0X 0 2 X 0X (X 0X )2 i=1 (X X ) 1 X 0X 1 = p − 2 = (p − 2) E X 0X E (X 0X )2 E X 0X
Econ 722, Spring ’19 Lecture 1 – Slide 32 The MLE is Inadmissible when p ≥ 3
1 1 R θ, θˆJS = p − 2(p − 2) (p − 2) + (p − 2)2 E X 0X E X 0X 1 = p − (p − 2)2 E X 0X
0 I E[1/(X X )] exists and is positive whenever p ≥ 3
2 I (p − 2) is always positive
I Hence, second term in the MSE expression is negative
I First term is MSE of the MLE
Therefore James-Stein strictly dominates MLE whenever p ≥ 3!
Econ 722, Spring ’19 Lecture 1 – Slide 33 James-Stein More Generally
I Our example was specific, but the result is general:
I MLE is inadmissible under quadratic loss in regression model with at least three regressors.
I Note, however, that this is MSE for the full parameter vector
I James-Stein estimator is also inadmissible!
I Dominated by “positive-part” James-Stein estimator:
(p − 2)σ2 βbJS = βb 1 − b 0 0 βb X X βb +
2 I βb = OLS, (x)+ = max(x, 0), σb = usual OLS-based estimator I Stops us us from shrinking past zero to get a negative estimate for an element of β with a small OLS estimate.
I Positive-part James-Stein isn’t admissible either!
Econ 722, Spring ’19 Lecture 1 – Slide 34 Lecture #2 – Model Selection I
Kullback-Leibler Divergence
Bias of Maximized Sample Log-Likelihood
Review of Asymptotics for Mis-specified MLE
Deriving AIC and TIC
Corrected AIC (AICc )
Mallow’s Cp
Econ 722, Spring ’19 Lecture 2 – Slide 1 Kullback-Leibler (KL) Divergence
Motivation How well does a given density f (y) approximate an unknown true density g(y)? Use this to select between parametric models.
Definition g(Y ) KL(g; f ) = EG log = EG [log g(Y )] − EG [log f (Y )] f (Y ) | {z } | {z } | {z } Depends only on truth Expected True density on top Fixed across models log-likelihood Properties
I Not symmetric: KL(g; f ) 6= KL(f ; g)
I By Jensen’s Inequality: KL(g; f ) ≥ 0 (strict iff g = f a.e.)
I Minimize KL ⇐⇒ Maximize Expected log-likelihood
Econ 722, Spring ’19 Lecture 2 – Slide 2 KL(g; f ) ≥ 0 with equality iff g = f almost surely
Jensen’s Inequality
If ϕ is convex, then ϕ(E[X ]) ≤ E[ϕ(X )], with strict equality when ϕ is affine or X is constant.
log is concave so (− log) is convex
g(Y ) f (Y ) f (Y ) log = − log ≥ − log EG f (Y ) EG g(Y ) EG g(Y ) Z ∞ f (y) = − log · g(y) dy −∞ g(y) Z ∞ = − log f (y) dy −∞ = − log(1) = 0
Econ 722, Spring ’19 Lecture 2 – Slide 3 KL Divergence and Mis-specified MLE
Pseudo-true Parameter Value θ0
p θbMLE → θ0 ≡ arg min KL(g; fθ) = arg max EG [log f (Y |θ)] θ∈Θ θ∈Θ
What if fθ is correctly specified?
If g = fθ for some θ then KL(g; fθ) is minimized at zero.
Goal: Compare Mis-specified Models
EG [log f (Y |θ0)] versus EG [log h(Y |γ0)]
where θ0 is the pseudo-true parameter value for fθ and γ0 is the
pseudo-true parameter value for hγ.
Econ 722, Spring ’19 Lecture 2 – Slide 4 How to Estimate Expected Log Likelihood?
For simplicity: Y1,..., Yn ∼ iid g(y)
Unbiased but Infeasible
" T # 1 1 X `(θ ) = log f (Y |θ ) = [log f (Y |θ )] EG T 0 EG T t 0 EG 0 t=1 Biased but Feasible −1 T `(θbMLE ) is a biased estimator of EG [log f (Y |θ0)].
Intuition for the Bias −1 −1 T `(θbMLE ) > T `(θ0) unless θbMLE = θ0. Maximized sample log-like. is an overly optimistic estimator of expected log-like.
Econ 722, Spring ’19 Lecture 2 – Slide 5 What to do about this bias?
1. General-purpose asymptotic approximation of “degree of over-optimism” of maximized sample log-likelihood.
I Takeuchi’s Information Criterion (TIC)
I Akaike’s Information Criterion (AIC)
2. Problem-specific finite sample approach, assuming g ∈ fθ.
I Corrected AIC (AICc ) of Hurvich and Tsai (1989)
Tradeoffs TIC is most general and makes weakest assumptions, but requires very large T to work well. AIC is a good approximation to TIC that requires less data. Both AIC and TIC perform poorly when T
is small relative to the number of parameters, hence AICc .
Econ 722, Spring ’19 Lecture 2 – Slide 6 Recall: Asymptotics for Mis-specified ML Estimation
Model f (y|θ), pseudo-true parameter θ0. For simplicity Y1,..., YT ∼ iid g(y).
Fundamental Expansion √ √ −1 ¯ T (θb− θ0) = J T UT + op(1)
T ∂ log f (Y |θ0) 1 X ∂ log f (Yt |θ0) J = − , U¯ = EG ∂θ∂θ0 T T ∂θ t=1 Central Limit Theorem √ ∂ log f (Y |θ ) T U¯ → U ∼ N (0, K), K = Var 0 T d p G ∂θ √ −1 −1 −1 T (θb− θ0) →d J U ∼ Np(0, J KJ )
Information Matrix Equality −1 If g = fθ for some θ ∈ Θ then K = J =⇒ AVAR(θb) = J
Econ 722, Spring ’19 Lecture 2 – Slide 7 Bias Relative to Infeasible Plug-in Estimator
Definition of Bias Term B
1 Z B = `(θb) − g(y) log f (y|θb) dy T | {z } | {z } feasible uses data only once overly-optimistic infeas. not overly-optimistic
Question to Answer On average, over the sampling distribution of θb, how large is B? AIC and TIC construct an asymptotic approximation of E[B].
Econ 722, Spring ’19 Lecture 2 – Slide 8 Derivation of AIC/TIC
Step 1: Taylor Expansion
¯ 0 −1 B = ZT + (θb− θ0) J(θb− θ0) + op(T )
T 1 X Z¯ = {log f (Y |θ ) − [log f (Y |θ )} T T t 0 EG 0 t=1 ¯ Step 2: E[ZT ] = 0 h 0 i E[B] ≈ E (θb− θ0) J(θb− θ0) √ −1 Step 3: T (θb− θ0) →d J U 0 0 −1 T (θb− θ0) J(θb− θ0) →d U J U
Econ 722, Spring ’19 Lecture 2 – Slide 9 Derivation of AIC/TIC Continued. . .
√ −1 Step 3: T (θb− θ0) →d J U 0 0 −1 T (θb− θ0) J(θb− θ0) →d U J U
Step 4: U ∼ Np(0, K) 1 1 [B] ≈ [U0J−1U] = tr J−1K E T E T
Final Result: T −1tr J−1K is an asymp. unbiased estimator of the −1 R over-optimism of T `(θb) relative to g(y) log f (y|θb) dy.
Econ 722, Spring ’19 Lecture 2 – Slide 10 TIC and AIC
Takeuchi’s Information Criterion h n oi Multiply by 2T , estimate J, K ⇒ TIC = 2 `(θb) − tr Jb−1Kb
Akaike’s Information Criterion −1 h i If g = fθ then J = K ⇒ tr J K = p ⇒ AIC = 2 `(θb) − p
Contrasting AIC and TIC Technically, AIC requires that all models under consideration are at least correctly specified while TIC doesn’t. But J−1K is hard to estimate, and if a model is badly mis-specified, `(θb) dominates.
Econ 722, Spring ’19 Lecture 2 – Slide 11 Corrected AIC (AICc ) – Hurvich & Tsai (1989)
Idea Behind AICc Asymptotic approximation used for AIC/TIC works poorly if p is too large relative to T . Try exact, finite-sample approach instead. Assumption: True DGP 2 y = Xβ0 + ε, ε ∼ N(0, σ0IT ), k Regressors
Can Show That
2 2 T σ0 σ0 1 0 0 KL(g, f ) = 2 − log 2 − 1 + 2 (β0 − β1) X X(β0 − β1) 2 σ1 σ1 2σ1
2 Where f is a normal regression model with parameters (β1, σ1) that might not be the true parameters.
Econ 722, Spring ’19 Lecture 2 – Slide 12 But how can we use this?
2 2 T σ0 σ0 1 0 0 KL(g, f ) = 2 − log 2 − 1 + 2 (β0 − β1) X X(β0 − β1) 2 σ1 σ1 2σ1 2 1. Would need to know (β1, σ1) for candidate model. 2 I Easy: just use MLE (βb1, σb1) 2 2. Would need to know (β0, σ0) for true model.
I Very hard! The whole problem is that we don’t know these!
Hurvich & Tsai (1989) Assume:
I Every candidate model is at least correctly specified 2 I Implies any candidate estimator (βb, σb ) is consistent for truth.
Econ 722, Spring ’19 Lecture 2 – Slide 13 Deriving the Corrected AIC
2 Since (βb, σb ) are random, look at E[KLc], where
2 2 T σ0 σ0 1 0 0 KLc = 2 − log 2 − 1 + 2 (βb − β0) X X(βb − β0) 2 σb σb 2σb Finite-sample theory for correctly spec. normal regression model:
h i T T + k KLc = − log(σ2) + [log σ2] − 1 E 2 T − k − 2 0 E b
2 Eliminate constants and scaling, unbiased estimator of E[log σb ]: T + k AIC = log σ2 + c b T − k − 2 a finite-sample unbiased estimator of KL for model comparison
Econ 722, Spring ’19 Lecture 2 – Slide 14 Motivation: Predict y from x via Linear Regression
y = X β + (T ×1) (T ×K)(K×1)
2 E[|X] = 0, Var(|X) = σ I
I If β were known, could never achieve lower MSE than by using all regressors to predict.
I But β is unknown so we have to estimate it from data ⇒ bias-variance tradeoff.
I Could make sense to exclude regressors with small coefficients: add small bias but reduce variance.
Econ 722, Spring ’19 Lecture 2 – Slide 15 Operationalizing the Bias-Variance Tradeoff Idea
Mallow’s Cp Approximate the predictive MSE of each model relative to the infeasible optimum in which β is known.
Notation
I Model index m and regressor matrix Xm
I Corresponding OLS estimator βbm padded out with zeros 0 −1 0 I Xβbm = X(−m)0 + Xm (XmXm) Xm y = Pmy
Econ 722, Spring ’19 Lecture 2 – Slide 16 In-sample versus Out-of-sample Prediction Error
Why not compare RSS(m)? 0 In-sample prediction error: RSS(m) = (y − Xβbm) (y − Xβbm)
From your Problem Set RSS cannot decrease even if we add irrelevant regressors. Thus in-sample prediction error is an overly optimistic estimate of out-of-sample prediction error.
Bias-Variance Tradeoff Out-of-sample performance of full model (using all regressors) could be very poor if there is a lot of estimation uncertainty associated with regressors that aren’t very predictive.
Econ 722, Spring ’19 Lecture 2 – Slide 17 Predictive MSE of Xβbm relative to infeasible optimum X β
Step 1: Algebra
Xβbm − Xβ = Pmy − Xβ = Pm(y − Xβ) − (I − Pm)Xβ
= Pm − (I − Pm)Xβ
Step 2: Pm and (I − Pm) are both symmetric and idempotent, and orthogonal to each other
2 0 Xβbm − Xβ = {Pm − (I − Pm)Xβ} {Pm + (I − Pm)Xβ} 0 0 0 0 0 0 0 = PmPm − β X (I − Pm) Pm − Pm(I − Pm)Xβ 0 0 + β X (I − Pm)(I − Pm)Xβ 0 0 0 = Pm + β X (I − Pm)Xβ
Econ 722, Spring ’19 Lecture 2 – Slide 18 Predictive MSE of Xβbm relative to infeasible optimum X β
Step 3: Expectation of Step 2 conditional on X
h 0 i MSE(m|X) = E (Xβbm − Xβ) (Xβbm − Xβ)|X 0 0 0 = E Pm|X + E β X (I − Pm)Xβ|X 0 0 0 = E tr Pm |X + β X (I − Pm)Xβ 0 0 0 = tr E[ |X]Pm + β X (I − Pm)Xβ 2 0 0 = tr σ Pm + β X (I − Pm)Xβ 2 0 0 = σ km + β X (I − Pm)Xβ
where km denotes the number of regressors in Xm and tr(Pm) = n 0 −1 0 o n 0 0 −1o tr Xm (XmXm) Xm = tr XmXm (XmXm) = tr(Im) = km
Econ 722, Spring ’19 Lecture 2 – Slide 19 Now we know the MSE of a given model. . .
2 0 0 MSE(m|X) = σ km + β X (I − Pm)Xβ
Bias-Variance Tradeoff
2 I Smaller Model ⇒ σ km smaller: less estimation uncertainty. 0 2 I Bigger Model ⇒ X (I − Pm)X = ||(I − Pm)X|| is in general smaller: less (squared) bias.
Mallow’s Cp
2 I Problem: MSE formula is infeasible since it involves β and σ .
I Solution: Mallow’s Cp constructs an unbiased estimator.
I Idea: what about plugging in βb to estimate second term?
Econ 722, Spring ’19 Lecture 2 – Slide 20 What if we plug in βb to estimate the second term? For the missing algebra in Step 4, see the lecture notes.
Notation Let βb denote the full model estimator and P be the corresponding projection matrix: Xβb = Py.
Crucial Fact
span(Xm) is a subspace of span(X), so PmP = PPm = Pm.
Step 4: Algebra using the preceding fact
h 0 0 i 0 0 0 E βb X (I − Pm)Xβb|X = ··· = β X (I−Pm)Xβ+E (P − Pm)|X
Econ 722, Spring ’19 Lecture 2 – Slide 21 Substituting βb doesn’t work. . .
Step 5: Use “Trace Trick” on second term from Step 4 0 0 E[ (P − Pm)|X] = E[tr (P − Pm) |X] 0 = tr E[ |X](P − Pm) 2 = tr σ (P − Pm) 2 = σ (trace {P} − trace {Pm}) 2 = σ (K − km) where K is the total number of regressors in X
Bias of Plug-in Estimator h 0 0 i 0 0 2 E βb X (I − Pm)Xβb|X = β X (I − Pm)Xβ + σ (K − km) | {z } | {z } Truth Bias
Econ 722, Spring ’19 Lecture 2 – Slide 22 Putting Everything Together: Mallow’s Cp
Want An Unbiased Estimator of This:
2 0 0 MSE(m|X) = σ km + β X (I − Pm)Xβ
Previous Slide:
h 0 0 i 0 0 2 E βb X (I − Pm)Xβb|X = β X (I − Pm)Xβ + σ (K − km)
End Result:
2 h 0 0 2 i MC(m) = σb km + βb X (I − Pm)Xβb − σb (K − km) 0 0 2 = βb X (I − Pm)Xβb + σb (2km − K)
2 0 is an unbiased estimator of MSE, with σb = y (I − P)y/(T − K)
Econ 722, Spring ’19 Lecture 2 – Slide 23 Why is this different from the textbook formula? Just algebra, but tedious. . .
2 0 0 2 MC(m) − 2σb km = βb X (I − PM )X βb − Kσb . . 0 2 = y (I − PM )y − T σb 2 = RSS(m) − T σb Therefore: 2 MC(m) = RSS(m) + σb (2km − T )
2 Divide Through by σb : RSS(m) Cp(m) = 2 + 2km − T σb Tells us how to adjust RSS for number of regressors. . .
Econ 722, Spring ’19 Lecture 2 – Slide 24 Lecture #3 – Model Selection II
Bayesian Model Comparison
Bayesian Information Criterion (BIC)
K-fold Cross-validation
Asymptotic Equivalence Between LOO-CV and TIC
Econ 722, Spring ’19 Lecture 3 – Slide 1 Bayesian Model Comparison: Marginal Likelihoods
Bayes’ Theorem for Model m ∈ M
π(θ|y, m) ∝ π(θ|m) f (y|θ, m) | {z } | {z } | {z } Posterior Prior Likelihood Z f (y|m) = π(θ|m)f (y|θ, m) dθ | {z } Θ Marginal Likelihood
Posterior Model Probability for m ∈ M
R P(m)f (y|m) P(m)f (y, θ|m) dθ P(m) Z P(m|y) = = Θ = π(θ|m)f (y|θ, m) dθ f (y) f (y) f (y) Θ
where P(m) is the prior model probability and f (y) is constant across models.
Econ 722, Spring ’19 Lecture 3 – Slide 2 Laplace (aka Saddlepoint) Approximation Suppress model index m for simplicity.
General Case: for T large. . .
Z p/2 2π −1/2 g(θ) exp{T · h(θ)} dθ ≈ exp{T · h(θ0)}g(θ0) |H(θ0)| Θ T
2 ∂ h(θ) p = dim(θ), θ0 = arg max h(θ), H(θ0) = − θ∈Θ ∂θ∂θ0 θ=θ0
Use to Approximate Marginal Likelihood
T T 2 `(θ) 1 X 1 X ∂ log f (Yi |θ) h(θ) = = log f (Yi |θ), H(θ) = J (θ) = − , g(θ) = π(θ) T T T T ∂θ∂θ0 t=1 t=1
and substitute θbMLE for θ0
Econ 722, Spring ’19 Lecture 3 – Slide 3 Laplace Approximation to Marginal Likelihood Suppress model index m for simplicity.
Z 2π p/2 n o −1/2
π(θ)f (y|θ) dθ ≈ exp `(θbMLE ) π(θbMLE ) JT (θbMLE ) Θ T
T T 2 X 1 X ∂ log f (Yi |θ) `(θ) = log f (Y |θ), H(θ) = J (θ) = − i T T ∂θ∂θ0 t=1 t=1
Econ 722, Spring ’19 Lecture 3 – Slide 4 Bayesian Information Criterion
Z 2π p/2 n o −1/2
f (y|m) = π(θ)f (y|θ) dθ ≈ exp `(θbMLE ) π(θbMLE ) JT (θbMLE ) Θ T
Take Logs and Multiply by 2
2 log f (y|m) ≈ 2`(θbMLE ) − p log(T ) + p log(2π) + 2 log π(θb) − log |JT (θb)| | {z } | {z } | {z } Op (T ) O(log T ) Op (1) The BIC Assume uniform prior over models and ignore lower order terms:
BIC(m) = 2 log f (y|θb, m) − pm log(T )
large-sample Frequentist approx. to Bayesian marginal likelihood
Econ 722, Spring ’19 Lecture 3 – Slide 5 Model Selection using a Hold-out Sample
I The real problem is double use of the data: first for estimation, then for model comparison.
I Maximized sample log-likelihood is an overly optimistic estimate of expected log-likelihood and hence KL-divergence
I In-sample squared prediction error is an overly optimistic estimator of out-of-sample squared prediction error
I AIC/TIC, AICc , BIC, Cp penalize sample log-likelihood or RSS to compensate.
I Another idea: don’t re-use the same data!
Econ 722, Spring ’19 Lecture 3 – Slide 6 Hold-out Sample: Partition the Full Dataset
Full Dataset
Training Set Test Set
Estimate Evaluate fit
models: of θb1,..., θbM
θb1,..., θbM against Test Set
Unfortunately this is extremely wasteful of data. . .
Econ 722, Spring ’19 Lecture 3 – Slide 7 K-fold Cross-Validation: “Pseudo-out-of-sample”
Full Dataset
Fold 1 Fold 2 ... Fold K
Step 1 Randomly partition full dataset into K folds of approx. equal size.
Step 2 Treat kth fold as a hold-out sample and estimate model using all observations except those in fold k: yielding estimator θb(−k).
Econ 722, Spring ’19 Lecture 3 – Slide 8 K-fold Cross-Validation: “Pseudo-out-of-sample”
Step 2 Treat kth fold as a hold-out sample and estimate model using all observations except those in fold k: yielding estimator θb(−k).
Step 3 Repeat Step 2 for each k = 1,..., K.
Step 4 −k(t) For each t calculate the prediction ybt of yt based on θb(−k(t)), the estimator that excluded observation t.
Econ 722, Spring ’19 Lecture 3 – Slide 9 K-fold Cross-Validation: “Pseudo-out-of-sample”
Step 4 −k(t) For each t calculate the prediction ybt of yt based on θb(−k(t)), the estimator that excluded observation t.
Step 5 1 PT −k(t) Define CVK = T t=1 L yt , ybt where L is a loss function. Step 5
Repeat for each model & choose m to minimize CVK (m). CV uses each observation for parameter estimation and model evaluation but never at the same time!
Econ 722, Spring ’19 Lecture 3 – Slide 10 Cross-Validation (CV): Some Details
Which Loss Function? I For regression squared error loss makes sense
I For classification (discrete prediction) could use zero-one loss.
I Can also use log-likelihood/KL-divergence as a loss function. . .
How Many Folds?
I One extreme: K = 2. Closest to Training/Test idea.
I Other extreme: K = T Leave-one-out CV (LOO-CV).
I Computationally expensive model ⇒ may prefer fewer folds.
I If your model is a linear smoother there’s a computational trick that makes LOO-CV extremely fast. (Problem Set)
I Asymptotic properties are related to K...
Econ 722, Spring ’19 Lecture 3 – Slide 11 Relationship between LOO-CV and TIC
Theorem LOO-CV using KL-divergence as the loss function is asymptotically equivalent to TIC but doesn’t require us to estimate the Hessian and variance of the score.
Econ 722, Spring ’19 Lecture 3 – Slide 12 Large-sample Equivalence of LOO-CV and TIC
Notation and Assumptions
For simplicity let Y1,..., YT ∼ iid. Let θb(t) be the maximum likelihood estimator based on all observations except t and θb be the full-sample estimator.
Log-likelihood as “Loss” 1 PT CV1 = T t=1 log f (yt |θb(t)) but since min. KL = max. log-like. we choose the model with highest CV1(m).
Econ 722, Spring ’19 Lecture 3 – Slide 13 Overview of the Proof