<<

Statistics 203: Introduction to Regression and Analysis of Robust methods

Jonathan Taylor

- p. 1/18 Today’s class

● Today’s class ■ Weighted regression. ● Heteroskedasticity ● MLE for one sample problem ■ ● Weighted Robust methods. 2 ● Estimating σ ■ ● Weighted regression example . ● Robust methods ● Example ● M - ● Huber’s ψ ● Hampel’s ψ ● Tukey’s ψ ● Solving for βb ● Iteratively reweighted least squares (IRLS) ● Robust estimate of scale ● Other resistant fitting methods ● Why not always use robust regression?

- p. 2/18 Heteroskedasticity

● Today’s class ■ In our standard model, we have assumed that ● Heteroskedasticity ● MLE for one sample problem ● 2 2 ε ∼ N(0, σ I). ● Estimating σ ● Weighted regression example ● Robust methods That is, that the errors are independent and have the same ● Example ● M -estimators variance (homoskedastic). ● Huber’s ψ ● Hampel’s ψ ■ We have discussed graphical checks for non-constant ● Tukey’s ψ ● Solving for βb variance (heteroskedasticity) but not “remedies” for ● Iteratively reweighted least squares (IRLS) heteroskedasticity. ● Robust estimate of scale ■ ● Other resistant fitting methods Suppose that ● Why not always use robust 2 regression? ε ∼ N(0, σ D) for some known D. ■ Where does D come from? Suppose that we see that variance increases like f(Xj), then we might choose Di = f(Xij). ■ What is the “maximum likelihood” thing to do?

- p. 3/18 MLE for one sample problem

● Today’s class ■ Consider the simpler problem ● Heteroskedasticity ● MLE for one sample problem ● Weighted least squares 2 2 Yi ∼ N(µ, σ Di) ● Estimating σ ● Weighted regression example ● 2 Robust methods with σ and D’s known. ● Example ● M -estimators ■ ● Huber’s ψ ● Hampel’s ψ ● Tukey’s ψ n 2 n (Yi − µ) ● Solving for β 2 b −2 log L(µ|Y, σ) = + n log(2πσ ) + log(Di) ● Iteratively reweighted least 2 σ Di squares (IRLS) i=1 i=1 ● Robust estimate of scale X X ● Other resistant fitting methods ■ ● Why not always use robust Differentiating regression? n (Y − µ) −2 i = 0 σ2D i=1 i X b implying n Y n 1 µ = i / . D D i=1 i i=1 i X X ■ Observations are weightedb inversely proportional to their variance. - p. 4/18 Weighted least squares

● Today’s class ■ ● Heteroskedasticity n p−1 ● 2 MLE for one sample problem (Yi − β0 − j=1 βj Xij) ● Weighted least squares 2 −2 log L(β, σ|Y, D) = 2 ● Estimating σ σ Di ● Weighted regression example i=1 P ● Robust methods X ● Example 1 t −1 ● M -estimators = 2 (Y − Xβ) D (Y − Xβ) ● Huber’s ψ σ ● Hampel’s ψ ● Tukey’s ψ 1 t ● = (Y − Xβ) W (Y − Xβ). Solving for βb 2 ● Iteratively reweighted least σ squares (IRLS) −1 ● Robust estimate of scale with W = D . ● Other resistant fitting methods ● Why not always use robust ■ Normal equations: regression? t −2X W (Y − XβW ) = 0 or, b t −1 t βW = (X W X) X W Y. ■ Distribution of β W b 2 t −1 βW ∼ N(β, σ (X W X) ). b - p. 5/18 b Estimating σ2

● Today’s class ■ What are the right residuals? ● Heteroskedasticity ● MLE for one sample problem ■ ● Weighted least squares If we knew β exactly 2 ● Estimating σ ● Weighted regression example p−1 ● Robust methods 2 ● Example Yi − β0 − Xijβj ∼ N(0, σ /Wi). ● M -estimators ● Huber’s ψ i=1 ● Hampel’s ψ X ● Tukey’s ψ ■ ● Suggests that the natural residual is Solving for βb ● Iteratively reweighted least squares (IRLS) ● Robust estimate of scale Yi − YW,i ● Other resistant fitting methods eW,i = Wi Wiei ● Why not always use robust = regression? where p b p t −1 t YW = (X W X) X W Y. ■ Estimate of σ2 b 1 1 χ2 σ2 = e2 = w e2 ∼ σ2 n−p W n − p W,i n − p i i n − p i=1 i=1 X X

b - p. 6/18 Weighted regression example

● Today’s class ● Heteroskedasticity ● MLE for one sample problem ● Weighted least squares 2 ● Estimating σ ● Weighted regression example ● Robust methods ● Example ● M -estimators ● Huber’s ψ ● Hampel’s ψ ● Tukey’s ψ ● Solving for βb ● Iteratively reweighted least squares (IRLS) ● Robust estimate of scale ● Other resistant fitting methods ● Why not always use robust regression?

- p. 7/18 Robust methods

● Today’s class ■ We also discussed detection but no specific remedies. ● Heteroskedasticity ● MLE for one sample problem ■ ● Weighted least squares One alternative is to discard potential – not always a 2 ● Estimating σ good idea. ● Weighted regression example ● Robust methods ■ Outliers can really mess up the sample , but have ● Example ● M -estimators relatively effect on the sample . ● Huber’s ψ ● Hampel’s ψ ■ ● Tukey’s ψ Could also “downweight” outliers: basis of robust techniques. ● Solving for βb ■ ● Iteratively reweighted least Another “cause” of outliers may be that the is not really squares (IRLS) ● Robust estimate of scale normally distributed. ● Other resistant fitting methods ● Why not always use robust regression?

- p. 8/18 Example

● Today’s class ■ Suppose that we have a sample from ● Heteroskedasticity (Yi)1≤i≤n ● MLE for one sample problem ● Weighted least squares 2 1 −|y−µ|/σ ● Estimating σ f(y|µ, σ) = e . ● Weighted regression example 2σ ● Robust methods ● Example ● M -estimators This has heavier tails than the . ● Huber’s ψ ■ ● Hampel’s ψ MLE for µ ● Tukey’s ψ n ● Solving for βb ● Iteratively reweighted least µ = argmin |Yi − µ|. squares (IRLS) µ ● Robust estimate of scale i=1 ● Other resistant fitting methods X ● Why not always use robust ■ It can be shown that µ is the sample median (exercise in regression? b STATS116). ■ Take home message:bif errors are not really normally distributed then least squares is not MLE and the MLE downweights large residuals relative to least squares.

- p. 9/18 M-estimators

● Today’s class ■ Depending on the error distribution of ε (assuming i.i.d.) we ● Heteroskedasticity ● MLE for one sample problem get different optimization problems. ● Weighted least squares 2 ● Estimating σ ■ Suppose that ● Weighted regression example ● −ρ((y−µ)/s) Robust methods f(y|µ, s) ∝ e . ● Example ● M -estimators ■ ● Huber’s ψ MLE: ● Hampel’s ψ n ● Tukey’s ψ Yi − µ ● argmin Solving for βb µ = ρ . ● Iteratively reweighted least µ s i=1 squares (IRLS) X   ● Robust estimate of scale ■ ● Other resistant fitting methods For every ρ, we getb a different : an M-estimator. ● Why not always use robust 0 regression? Let ψ = ρ . ■ Generalizes easily to regression problem

n p−1 Yi − β0 − Xi,jβj β = argmin ρ j=1 . s β i=1 P ! X b

- p. 10/18 Huber’s ψ 1.0 0.5 0.0 psi.huber(x) * x −0.5 −1.0

−4 −2 0 2 4

x - p. 11/18 Hampel’s ψ 2 1 0 psi.hampel(x) * x −1 −2

−10 −5 0 5 10

x - p. 12/18 Tukey’s ψ 1.0 0.5 0.0 psi.bisquare(x) * x −0.5 −1.0

−4 −2 0 2 4

x - p. 13/18 Solving for β

b ● Today’s class ■ Assume for now that s is known in the M-estimator. ● Heteroskedasticity ● MLE for one sample problem ■ ● Weighted least squares Normal equations: 2 ● Estimating σ ● Weighted regression example n p−1 ● Robust methods Yi − j=0 Xijβj ● Example Xijψ = 0, 0 ≤ j ≤ p − 1 ● M -estimators s ● Huber’s ψ i=1 P ! ● Hampel’s ψ X b ● Tukey’s ψ where 0. ● ψ = ρ Solving for βb ● Iteratively reweighted least ■ squares (IRLS) Set ● Robust estimate of scale p−1 Yi−P =0 Xij βj ● Other resistant fitting methods ψ j b ● Why not always use robust s regression? Wi =  p−1 . Yi − j=0 Xijβj ■ Then P b n p−1 XijWi(Yi − Xijβj ) = 0, 0 ≤ j ≤ p − 1. i=1 j=0 X X b Or βXtW X − XtW Y = 0. - p. 14/18

b Iteratively reweighted least squares (IRLS)

● Today’s class ■ Although we are trying to estimate β above, given an initial ● Heteroskedasticity ● MLE for one sample problem 0 ● Weighted least squares estimate β we can compute initial weights 2 ● Estimating σ ● Weighted regression example p−1 0 p−1 ● Robust methods b Yi − j=0 Xijβj ● Example 0 0 Wi = ψ / Yi − Xijβj ● M -estimators s   ● Huber’s ψ P ! j=0 ● Hampel’s ψ b X ● Tukey’s ψ  b  ● Solving for 1 βb ans solve for ● Iteratively reweighted least β squares (IRLS) ● Robust estimate of scale 1 t 0 −1 t 0 ● Other resistant fitting methods b β = (X W X) X W Y. ● Why not always use robust regression? ■ Now we can recomputeb weights, and reestimate β. ■ In general, given weights W j we can solve b βj+1 = (XtW jX)−1XtW jY. ■ This is very similar to a Newton-Raphson technique. b ■ Used over and over again in .

- p. 15/18 Robust estimate of scale

● Today’s class ■ In general s needs to be estimated. A popular estimate ● Heteroskedasticity ● MLE for one sample problem ● Weighted least squares 2 s = MAD(e1, . . . , en)/0.6745. ● Estimating σ ● Weighted regression example ● Robust methods where ● Example ● M -estimators ● Huber’s ψ MAD(e1, . . . , en) = Median|ei − Median(e1, . . . , en)|. ● Hampel’s ψ ● Tukey’s ψ ● ■ Solving for βb Scale can be estimated at each stage of IRLS procedure ● Iteratively reweighted least squares (IRLS) based on current residuals, or based on some “very ● Robust estimate of scale ● Other resistant fitting methods resistant” fit. (LMS or LTS below) ● Why not always use robust ■ regression? The constant 0.6745 is chosen so that s is asymptotically 2 unbiased for σ if the ei’s are N(0, σ ).

- p. 16/18 Other resistant fitting methods

● Today’s class ■ If ψ is not redescending (i.e. does not return to 0) then ● Heteroskedasticity ● MLE for one sample problem robust regression is still susceptible to outliers. ● Weighted least squares 2 ● Estimating σ ■ Least median of squares (LMS) ● Weighted regression example ● Robust methods ● Example p−1 ● M -estimators 2 ● Huber’s ψ β = argmin Median(Yi − Xijβj ) ● Hampel’s ψ β ● Tukey’s ψ j=0 ● X Solving for βb b ● Iteratively reweighted least ■ squares (IRLS) Least trimmed squares (LTS): fix some q < n ● Robust estimate of scale ● Other resistant fitting methods q p−1 ● Why not always use robust regression? 2 β = argmin (Yi − Xijβj )(i). β i=1 j=0 X X b Here the ·(i) represents “order ”.

- p. 17/18 Why not always use robust regression?

● Today’s class ■ Inference: seems reasonable to treat estimate of ● Heteroskedasticity ● MLE for one sample problem β as ● Weighted least squares 2 n ● Estimating σ 2 2 t −1 2 i=1 Wiei ● Weighted regression example σ (X W X) , σ = . ● Robust methods b n − p ● Example P ● M -estimators c ● Huber’s ψ What about degreesc of freedom? ● b b Hampel’s ψ ■ ● Tukey’s ψ uses a different estimate. ● Solving for β b ■ ● Iteratively reweighted least Efficiency: this robustness comes at a cost. Even squares (IRLS) ● Robust estimate of scale asymptotically, confidence intervals for robust estimates are ● Other resistant fitting methods ● Why not always use robust wider than least squares (if least squares model is regression? applicable). ■ Asymptotic Relative Efficiency:

Var(βLS ) ARE(βLS , βrobust) = lim < 1. n→∞ Var(βrobust) b ■ Biggest advantage:b bcan have higher breakdown. Median has b 50% breakdown, sample mean has 0%. - p. 18/18