Robust estimation, efficiency, and Lasso debiasing

Po-Ling Loh

University of Wisconsin - Madison Departments of ECE &

WHOA-PSI workshop Washington University in St. Louis

Aug 12, 2017

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 1 / 26 When p n, assume sparsity: β∗ k  k k0 ≤

High-dimensional

n 1 n p n 1 ⇥ ⇥ ⇥

p 1 ⇥ Linear model: T ∗ yi = xi β + i , i = 1,..., n

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 2 / 26 High-dimensional linear regression

n 1 n p n 1 ⇥ ⇥ ⇥

p 1 ⇥ Linear model: T ∗ yi = xi β + i , i = 1,..., n

When p n, assume sparsity: β∗ k  k k0 ≤

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 2 / 26 Extensive theory (consistency, asymptotic normality) for p fixed, n → ∞

“Robust” M-estimators

Generalization of OLS suitable for heavy-tailed/contaminated errors: n 1 T β arg min `(xi β yi ) β n Introduction ∈ ( − ) Robust regression i=1 Introduction Implementation X Robust regression Scottish hill races Implementation b Scottish hill races Loss functions Belgian phone calls: Linear vs. robust regression 6 Least squares ● Huber

200 Tukey

● 5

● 4 150 ●

● ● 3 Loss 100 Millions of calls 2 50 ● 1 Least squares ● ● ● Absolute value ● ● Huber ● ● ● ● ● ● ● 0 Tukey ● ● ● ● ● 0

−6 −4 −2 0 2 4 6 1950 1955 1960 1965 1970

Residual Year Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 3 / 26

Patrick Breheny BST 764: Applied Statistical Modeling 6/17 Patrick Breheny BST 764: Applied Statistical Modeling 3/17 “Robust” M-estimators

Generalization of OLS suitable for heavy-tailed/contaminated errors: n 1 T β arg min `(xi β yi ) β n Introduction ∈ ( − ) Robust regression i=1 Introduction Implementation X Robust regression Scottish hill races Implementation b Scottish hill races Extensive theory (consistency, asymptotic normality) for p fixed, Loss functions n Belgian phone calls: Linear vs. robust regression → ∞ 6 Least squares ● Huber

200 Tukey

● 5

● 4 150 ●

● ● 3 Loss 100 Millions of calls 2 50 ● 1 Least squares ● ● ● Absolute value ● ● Huber ● ● ● ● ● ● ● 0 Tukey ● ● ● ● ● 0

−6 −4 −2 0 2 4 6 1950 1955 1960 1965 1970

Residual Year Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 3 / 26

Patrick Breheny BST 764: Applied Statistical Modeling 6/17 Patrick Breheny BST 764: Applied Statistical Modeling 3/17 Complications: Optimization for nonconvex `? Statistical theory? Are certain losses provably better than others?

High-dimensional M-estimators

Natural idea: For p > n, use regularized version:

n 1 T β arg min `(xi β yi ) + λ β 1 ∈ β n − k k ( i ) X=1 b

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 4 / 26 High-dimensional M-estimators

Natural idea: For p > n, use regularized version:

n 1 T β arg min `(xi β yi ) + λ β 1 ∈ β n − k k ( i ) X=1 b Complications: Optimization for nonconvex `? Statistical theory? Are certain losses provably better than others?

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 4 / 26 Compare to Lasso theory: Requires sub-Gaussian i ’s If `(u) is locally convex/smooth for u r, any local optima within | | ≤ radius cr of β∗ satisfy

∗ 0 k log p β β 2 C k − k ≤ r n e

Some statistical theory

0 When ` ∞ < C, global optima of high-dimensional M-estimator satisfy k k

∗ k log p β β 2 C , k − k ≤ r n ∗ regardless of distributionb of i

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 5 / 26 If `(u) is locally convex/smooth for u r, any local optima within | | ≤ radius cr of β∗ satisfy

∗ 0 k log p β β 2 C k − k ≤ r n e

Some statistical theory

0 When ` ∞ < C, global optima of high-dimensional M-estimator satisfy k k

∗ k log p β β 2 C , k − k ≤ r n ∗ regardless of distributionb of i Compare to Lasso theory: Requires sub-Gaussian i ’s

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 5 / 26 Some statistical theory

0 When ` ∞ < C, global optima of high-dimensional M-estimator satisfy k k

∗ k log p β β 2 C , k − k ≤ r n ∗ regardless of distributionb of i Compare to Lasso theory: Requires sub-Gaussian i ’s If `(u) is locally convex/smooth for u r, any local optima within | | ≤ radius cr of β∗ satisfy

∗ 0 k log p β β 2 C k − k ≤ r n e

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 5 / 26 Algorithm

1 Run composite on convex, robust loss + `1-penalty until convergence, output βH 2 Run composite gradient descent on nonconvex, robust loss + b 0 µ-amenable penalty, input β = βH

b

Some optimization theory

k log p O r n !

r

b e Local optima may be obtained via two-step algorithm

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 6 / 26 Some optimization theory

k log p O r n !

r

b e Local optima may be obtained via two-step algorithm Algorithm

1 Run composite gradient descent on convex, robust loss + `1-penalty until convergence, output βH 2 Run composite gradient descent on nonconvex, robust loss + b 0 µ-amenable penalty, input β = βH

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 6 / 26 b ∗ Rearranging basic inequality n(β) n(β ) and assuming X T  L ≤ L λ 2 n , obtain ≥ ∞ b

β β∗ cλ√k k − k2 ≤ b k log p Sub-Gaussian assumptions on xi ’s and i ’s provide O n bounds, minimax optimal q 

Motivating calculation

Lasso analysis (e.g., van de Geer ’07, Bickel et al. ’08):

1 2 β arg min y X β 2 + λ β 1 ∈ β ( nk − k k k ) b Ln(β) | {z }

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 7 / 26 k log p Sub-Gaussian assumptions on xi ’s and i ’s provide O n bounds, minimax optimal q 

Motivating calculation

Lasso analysis (e.g., van de Geer ’07, Bickel et al. ’08):

1 2 β arg min y X β 2 + λ β 1 ∈ β ( nk − k k k ) b Ln(β) | {z } ∗ Rearranging basic inequality n(β) n(β ) and assuming X T  L ≤ L λ 2 n , obtain ≥ ∞ b

β β∗ cλ√k k − k2 ≤ b

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 7 / 26 Motivating calculation

Lasso analysis (e.g., van de Geer ’07, Bickel et al. ’08):

1 2 β arg min y X β 2 + λ β 1 ∈ β ( nk − k k k ) b Ln(β) | {z } ∗ Rearranging basic inequality n(β) n(β ) and assuming X T  L ≤ L λ 2 n , obtain ≥ ∞ b

β β∗ cλ√k k − k2 ≤ b k log p Sub-Gaussian assumptions on xi ’s and i ’s provide O n bounds, minimax optimal q 

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 7 / 26 `0() sub-Gaussian whenever `0 bounded = can achieve estimation error ⇒

∗ k log p β β 2 c , k − k ≤ r n

without assuming i is sub-Gaussianb

Motivating calculation

T 0 Key observation: For general loss function, if λ 2 X ` () , ≥ n ∞ obtain

β β∗ cλ√k k − k2 ≤ b

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 8 / 26 = can achieve estimation error ⇒

∗ k log p β β 2 c , k − k ≤ r n

without assuming i is sub-Gaussianb

Motivating calculation

T 0 Key observation: For general loss function, if λ 2 X ` () , ≥ n ∞ obtain

β β∗ cλ√k k − k2 ≤ b `0() sub-Gaussian whenever `0 bounded

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 8 / 26 Motivating calculation

T 0 Key observation: For general loss function, if λ 2 X ` () , ≥ n ∞ obtain

β β∗ cλ√k k − k2 ≤ b `0() sub-Gaussian whenever `0 bounded = can achieve estimation error ⇒

∗ k log p β β 2 c , k − k ≤ r n

without assuming i is sub-Gaussianb

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 8 / 26 Introduction Robust regression Implementation Scottish hill races Loss functions Addressed by local curvature of robust losses around origin 6 5 4 3 Loss 2 1 Least squares Absolute value Huber

0 Tukey

−6 −4 −2 0 2 4 6

Residual When is nonconvex, local optima may exist that are not global ` Patrick Breheny BST 764:β Applied Statistical Modeling 6/17 optima ∗ Addressed by theoretical analysis of eβ β 2 and derivation of suitable optimization algorithms k − k e

Technical challenges

Lasso analysis also requires verifying restricted eigenvalue (RE) condition on design matrix, more complicated for general `

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 9 / 26 When ` is nonconvex, local optima β may exist that are not global optima ∗ Addressed by theoretical analysis of eβ β 2 and derivation of suitable optimization algorithms k − k e

Technical challenges

Introduction Lasso analysis also requiresRobust regressionverifying restricted eigenvalue (RE) Implementation condition on design matrix,Scottish hillmore races complicated for general ` Loss functions Addressed by local curvature of robust losses around origin 6 5 4 3 Loss 2 1 Least squares Absolute value Huber

0 Tukey

−6 −4 −2 0 2 4 6

Residual

Patrick Breheny BST 764: Applied Statistical Modeling 6/17

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 9 / 26 ∗ Addressed by theoretical analysis of β β 2 and derivation of suitable optimization algorithms k − k e

Technical challenges

Introduction Lasso analysis also requiresRobust regressionverifying restricted eigenvalue (RE) Implementation condition on design matrix,Scottish hillmore races complicated for general ` Loss functions Addressed by local curvature of robust losses around origin 6 5 4 3 Loss 2 1 Least squares Absolute value Huber

0 Tukey

−6 −4 −2 0 2 4 6

Residual When is nonconvex, local optima may exist that are not global ` Patrick Breheny BST 764:β Applied Statistical Modeling 6/17 optima e

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 9 / 26 Technical challenges

Introduction Lasso analysis also requiresRobust regressionverifying restricted eigenvalue (RE) Implementation condition on design matrix,Scottish hillmore races complicated for general ` Loss functions Addressed by local curvature of robust losses around origin 6 5 4 3 Loss 2 1 Least squares Absolute value Huber

0 Tukey

−6 −4 −2 0 2 4 6

Residual When is nonconvex, local optima may exist that are not global ` Patrick Breheny BST 764:β Applied Statistical Modeling 6/17 optima ∗ Addressed by theoretical analysis of eβ β 2 and derivation of suitable optimization algorithms k − k e Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 9 / 26 Assumptions:

n satisfies restricted strong convexity with curvature α (NegahbanL et al. ’12) 2 ρλ has bounded subgradient at 0, and ρλ(t) + µt convex α>µ

Related work: Nonconvex regularized M-estimators

Composite objective function

p

β arg min n(β)+ ρλ(βj ) ∈ kβk1≤R L ( j ) X=1 b

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 10 / 26 Related work: Nonconvex regularized M-estimators

Composite objective function

p

β arg min n(β)+ ρλ(βj ) ∈ kβk1≤R L ( j ) X=1 b Assumptions:

n satisfies restricted strong convexity with curvature α L(Negahban et al. ’12) 2 ρλ has bounded subgradient at 0, and ρλ(t) + µt convex α>µ

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 10 / 26 Under suitable distributional assumptions, for λ log p and R 1 ,  n  λ q ∗ k log p β β 2 c statistical error k − k ≤ r n ≈ e

Stationary points (L. & Wainwright ’15)

k log p O r n !

Stationary points statistically indistinguishableb e from global optima

n(β) + ρλ(β), β β 0, β feasible h∇L ∇ − i ≥ ∀ e e e

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 11 / 26 Stationary points (L. & Wainwright ’15)

k log p O r n !

Stationary points statistically indistinguishableb e from global optima

n(β) + ρλ(β), β β 0, β feasible h∇L ∇ − i ≥ ∀ e e e Under suitable distributional assumptions, for λ log p and R 1 ,  n  λ q ∗ k log p β β 2 c statistical error k − k ≤ r n ≈ Po-Ling Loh (UW-Madison) e Robust estimation, efficiency, and debiasing Aug 12, 2017 11 / 26 New ingredient for robust setting: ` convex only in local region = need for local consistency results ⇒

Mathematical statement

Theorem (L. & Wainwright ’15) Suppose R is chosen s.t. β∗ is feasible, and λ satisfies

∗ log p α max n(β ) ∞, α - λ - . (k∇L k r n ) R

2 For n Cτ R2 log p, any stationary point β satisfies ≥ α2 λ√k e β β∗ , where k = β∗ . k − k2 - α µ k k0 − e

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 12 / 26 Mathematical statement

Theorem (L. & Wainwright ’15) Suppose R is chosen s.t. β∗ is feasible, and λ satisfies

∗ log p α max n(β ) ∞, α - λ - . (k∇L k r n ) R

2 For n Cτ R2 log p, any stationary point β satisfies ≥ α2 λ√k e β β∗ , where k = β∗ . k − k2 - α µ k k0 − e New ingredient for robust setting: ` convex only in local region = need for local consistency results ⇒

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 12 / 26 Only requires restricted curvature within constant-radius region around β∗

Local RSC condition

Local RSC condition: For ∆ := β β , 1 − 2

2 log p 2 ∗ n(β ) n(β ), ∆ α ∆ τ ∆ , βj β r h∇L 1 −∇L 2 i ≥ k k2 − n k k1 ∀k − k2 ≤ How is such a result possible?

1

0.8

0.6

0.4

0.2

0

−0.2

−0.4

−0.6

−0.8 1 −1 0.5 −1 −0.5 0 0 −0.5 0.5 1 −1

Loss function has directions of both positive and negative curvature. Negative directions are forbidden by regularizer.

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 13 / 26 Local RSC condition

Local RSC condition: For ∆ := β β , 1 − 2

2 log p 2 ∗ n(β ) n(β ), ∆ α ∆ τ ∆ , βj β r h∇L 1 −∇L 2 i ≥ k k2 − n k k1 ∀k − k2 ≤ How is such a result possible?

1

0.8

0.6

0.4

0.2

0

−0.2

−0.4

−0.6

−0.8 1 −1 0.5 −1 −0.5 0 0 −0.5 0.5 1 −1

Loss function has directions of both positive and negative curvature. Only requires restrictedNegative directions curvature are forbidden withinby regularizer. constant-radius region around β∗

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 13 / 26 Consistency of local stationary points

k log p O r n !

r

Theorem (L. ’17) b e Suppose n satisfies α-local RSC and ρλ is µ-amenable, with α>µ . L 0 log p τ Suppose ` ∞ C and λ . For n k log p, any stationary k k ≤  n % α−µ point β s.t. β β∗ r satisfiesq k − k2 ≤

e e ∗ λ√k β β . k − k2 - α µ − Po-Ling Loh (UW-Madison) Robuste estimation, efficiency, and debiasing Aug 12, 2017 14 / 26 Goal: For regularized M-estimator

n 1 T β arg min `(xi β yi )+ ρλ(β) , ∈ kβk1≤R n − ( i ) X=1 b where ` satisfies α-local RSC, find stationary point such that β β∗ r k − k2 ≤ e

Optimization theory

Question: How to obtain sufficiently close local solutions?

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 15 / 26 Optimization theory

Question: How to obtain sufficiently close local solutions?

Goal: For regularized M-estimator

n 1 T β arg min `(xi β yi )+ ρλ(β) , ∈ kβk1≤R n − ( i ) X=1 b where ` satisfies α-local RSC, find stationary point such that β β∗ r k − k2 ≤ e

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 15 / 26 Wisdom from Huber

Descending ψ-functions are tricky, especially when the starting values for the iterations are non-robust. . . . It is therefore preferable to start with a monotone ψ, iterate to death, and then append a few (1 or 2) iterations with the nonmonotone ψ. — Huber 1981, pp. 191–192

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 16 / 26 Note: Want to optimize original nonconvex objective, since it leads to more efficient (lower-variance) estimators

Two-step algorithm (L. ’17)

Two-step M-estimator: Finds local stationary points of nonconvex, robust loss + µ-amenable penalty

n 1 T β arg min `(xi β yi )+ ρλ(β) ∈ kβk1≤R n − ( i ) X=1 b Algorithm

1 Run composite gradient descent on convex, robust loss + `1-penalty until convergence, output βH 2 Run composite gradient descent on nonconvex, robust loss + b 0 µ-amenable penalty, input β = βH

b

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 17 / 26 Two-step algorithm (L. ’17)

Two-step M-estimator: Finds local stationary points of nonconvex, robust loss + µ-amenable penalty

n 1 T β arg min `(xi β yi )+ ρλ(β) ∈ kβk1≤R n − ( i ) X=1 b Algorithm

1 Run composite gradient descent on convex, robust loss + `1-penalty until convergence, output βH 2 Run composite gradient descent on nonconvex, robust loss + b 0 µ-amenable penalty, input β = βH

Note: Want to optimize originalb nonconvex objective, since it leads to more efficient (lower-variance) estimators

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 17 / 26 Closer look: Loss function ` in some sense calibrated to scale of i If Huber parameter too large, estimation error bound based on `0 becomes suboptimal k k∞ If Huber parameter too small, RSC no longer satisfied w.h.p.

For Lasso, optimal λ known to depend on σ, but loss function does not require calibration Better objective (low-dimensional version proposed by Huber):

n T 1 yi xi β (β, σ) arg min ` − σ + aσ +λ β 1 ∈ β,σ n σ k k ( i ) X=1   b b Ln(β,σ) | {z } However, joint location/scale estimation notoriously difficult even in low dimensions

Scale calibration

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 18 / 26 If Huber parameter too large, estimation error bound based on `0 becomes suboptimal k k∞ If Huber parameter too small, RSC no longer satisfied w.h.p.

For Lasso, optimal λ known to depend on σ, but loss function does not require calibration Better objective (low-dimensional version proposed by Huber):

n T 1 yi xi β (β, σ) arg min ` − σ + aσ +λ β 1 ∈ β,σ n σ k k ( i ) X=1   b b Ln(β,σ) | {z } However, joint location/scale estimation notoriously difficult even in low dimensions

Scale calibration

Closer look: Loss function ` in some sense calibrated to scale of i

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 18 / 26 For Lasso, optimal λ known to depend on σ, but loss function does not require calibration Better objective (low-dimensional version proposed by Huber):

n T 1 yi xi β (β, σ) arg min ` − σ + aσ +λ β 1 ∈ β,σ n σ k k ( i ) X=1   b b Ln(β,σ) | {z } However, joint location/scale estimation notoriously difficult even in low dimensions

Scale calibration

Closer look: Loss function ` in some sense calibrated to scale of i If Huber parameter too large, estimation error bound based on `0 becomes suboptimal k k∞ If Huber parameter too small, RSC no longer satisfied w.h.p.

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 18 / 26 Better objective (low-dimensional version proposed by Huber):

n T 1 yi xi β (β, σ) arg min ` − σ + aσ +λ β 1 ∈ β,σ n σ k k ( i ) X=1   b b Ln(β,σ) | {z } However, joint location/scale estimation notoriously difficult even in low dimensions

Scale calibration

Closer look: Loss function ` in some sense calibrated to scale of i If Huber parameter too large, estimation error bound based on `0 becomes suboptimal k k∞ If Huber parameter too small, RSC no longer satisfied w.h.p.

For Lasso, optimal λ known to depend on σ, but loss function does not require calibration

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 18 / 26 However, joint location/scale estimation notoriously difficult even in low dimensions

Scale calibration

Closer look: Loss function ` in some sense calibrated to scale of i If Huber parameter too large, estimation error bound based on `0 becomes suboptimal k k∞ If Huber parameter too small, RSC no longer satisfied w.h.p.

For Lasso, optimal λ known to depend on σ, but loss function does not require calibration Better objective (low-dimensional version proposed by Huber):

n T 1 yi xi β (β, σ) arg min ` − σ + aσ +λ β 1 ∈ β,σ n σ k k ( i ) X=1   b b Ln(β,σ) | {z }

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 18 / 26 Scale calibration

Closer look: Loss function ` in some sense calibrated to scale of i If Huber parameter too large, estimation error bound based on `0 becomes suboptimal k k∞ If Huber parameter too small, RSC no longer satisfied w.h.p.

For Lasso, optimal λ known to depend on σ, but loss function does not require calibration Better objective (low-dimensional version proposed by Huber):

n T 1 yi xi β (β, σ) arg min ` − σ + aσ +λ β 1 ∈ β,σ n σ k k ( i ) X=1   b b Ln(β,σ) | {z } However, joint location/scale estimation notoriously difficult even in low dimensions

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 18 / 26 S-estimators/LMS:

β0 arg min σ(r(β)) , ∈ β { } b where σ(r) = r(n nδ ) b LTS: −b c n nα −b c b 1 T 2 β0 arg min (yi xi β)(i) + λ β 1 ∈ β n − k k  i  X=1  b  

Scale calibration

Another idea: MM-estimator

n T 1 yi xi β β arg min ` − + λ β 1 , ∈ β n σ k k ( i 0 ) X=1   b using robust estimate of scale σ0 basedb on preliminary estimate β0

How to obtain (β0, σ0)? b b b b

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 19 / 26 Scale calibration

Another idea: MM-estimator

n T 1 yi xi β β arg min ` − + λ β 1 , ∈ β n σ k k ( i 0 ) X=1   b using robust estimate of scale σ0 basedb on preliminary estimate β0

How to obtain (β0, σ0)? S-estimators/LMS: b b b b β0 arg min σ(r(β)) , ∈ β { } b where σ(r) = r(n nδ ) b LTS: −b c n nα −b c b 1 T 2 β0 arg min (yi xi β)(i) + λ β 1 ∈ β n − k k  i  X=1  b Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 19 / 26 Can be used to select σ in location/scale problem:

n 1 T βσ arg min `σ(yi xi β) + λσ β 1 , ∈ β n − k k ( i ) X=1 b where `σ is Huber loss parametrized by σ

Our approach

Lepski’s method originally proposed for adaptive bandwidth selection in

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 20 / 26 Our approach

Lepski’s method originally proposed for adaptive bandwidth selection in nonparametric regression Can be used to select σ in location/scale problem:

n 1 T βσ arg min `σ(yi xi β) + λσ β 1 , ∈ β n − k k ( i ) X=1 b where `σ is Huber loss parametrized by σ

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 20 / 26 k log p For each σj , check if βσ βσ 2Cσ` for all ` > j, and k j − ` k2 ≤ n let σ be argmin in this set q b b b

Basic idea of Lepski’s method: Compute βσ on gridding σ1, . . . , σM ∗ { } of interval [σmin, σmax] σ 3 b

Lepski’s method

Preceding theory implies

∗ k log p βσ β 2 Cσ , k − k ≤ r n b ∗ w.h.p., assuming σ Var(i ) := σ ≥ p

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 21 / 26 k log p For each σj , check if βσ βσ 2Cσ` for all ` > j, and k j − ` k2 ≤ n let σ be argmin in this set q b b b

Lepski’s method

Preceding theory implies

∗ k log p βσ β 2 Cσ , k − k ≤ r n b ∗ w.h.p., assuming σ Var(i ) := σ ≥ Basic idea of Lepski’sp method: Compute βσ on gridding σ1, . . . , σM ∗ { } of interval [σmin, σmax] σ 3 b

min ⇤ max

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 21 / 26 Lepski’s method

Preceding theory implies

∗ k log p βσ β 2 Cσ , k − k ≤ r n b ∗ w.h.p., assuming σ Var(i ) := σ ≥ Basic idea of Lepski’sp method: Compute βσ on gridding σ1, . . . , σM { } of interval [σ , σ ] σ∗ min max 3 b k log p For each σj , check if βσ βσ 2Cσ` for all ` > j, and k j − ` k2 ≤ n let σ be argmin in this set q b b b

min j ⇤ ` max

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 21 / 26 Lepski’s method

Preceding theory implies

∗ k log p βσ β 2 Cσ , k − k ≤ r n b ∗ w.h.p., assuming σ Var(i ) := σ ≥ Basic idea of Lepski’sp method: Compute βσ on gridding σ1, . . . , σM { } of interval [σ , σ ] σ∗ min max 3 b k log p For each σj , check if βσ βσ 2Cσ` for all ` > j, and k j − ` k2 ≤ n let σ be argmin in this set q b b b

min j ⇤ ` max

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 21 / 26 Lepski’s method

Preceding theory implies

∗ k log p βσ β 2 Cσ , k − k ≤ r n b ∗ w.h.p., assuming σ Var(i ) := σ ≥ Basic idea of Lepski’sp method: Compute βσ on gridding σ1, . . . , σM { } of interval [σ , σ ] σ∗ min max 3 b k log p For each σj , check if βσ βσ 2Cσ` for all ` > j, and k j − ` k2 ≤ n let σ be argmin in this set q b b b

min ⇤ j ` max

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 21 / 26 Lepski’s method

Preceding theory implies

∗ k log p βσ β 2 Cσ , k − k ≤ r n b ∗ w.h.p., assuming σ Var(i ) := σ ≥ Basic idea of Lepski’sp method: Compute βσ on gridding σ1, . . . , σM { } of interval [σ , σ ] σ∗ min max 3 b k log p For each σj , check if βσ βσ 2Cσ` for all ` > j, and k j − ` k2 ≤ n let σ be argmin in this set q b b b

min ⇤ max

Po-Ling Loh (UW-Madison) Robust estimation,b efficiency, and debiasing Aug 12, 2017 21 / 26 Statistical guarantee

Theorem (L. ’17) With high probability, output of Lepski’s method satisfies

k log p β β∗ C 0σ∗ , σb 2 k − k ≤ r n b Method does not require prior knowledge of scale σ∗

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 22 / 26 One-step estimation proposed for obtaining better efficiency (Bickel ’75): T 1 n (X X )− 1 T b = β + ψ(yi x β)xi , ψ · n − i A(ψ) i X=1 e b b where A(ψ) is estimate of E[ψb0(i )] Low-dimensional result: b 2 d E[ψ (i )] √n(b β) N 0, Θ∗ , ψ − −→ [ψ ( )]2 ·  E 0 i  e b f 0 so asymptotic variance for ψ = f matches variance of MLE

Efficiency

Although β guaranteed to be ` -consistent, estimator may have σb 2 relatively high variance b

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 23 / 26 Low-dimensional result:

2 d E[ψ (i )] √n(b β) N 0, Θ∗ , ψ − −→ [ψ ( )]2 ·  E 0 i  e b f 0 so asymptotic variance for ψ = f matches variance of MLE

Efficiency

Although β guaranteed to be ` -consistent, estimator may have σb 2 relatively high variance One-step estimationb proposed for obtaining better efficiency (Bickel ’75): T 1 n (X X )− 1 T b = β + ψ(yi x β)xi , ψ · n − i A(ψ) i X=1 e b b where A(ψ) is estimate of E[ψb0(i )]

b

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 23 / 26 Efficiency

Although β guaranteed to be ` -consistent, estimator may have σb 2 relatively high variance One-step estimationb proposed for obtaining better efficiency (Bickel ’75): T 1 n (X X )− 1 T b = β + ψ(yi x β)xi , ψ · n − i A(ψ) i X=1 e b b where A(ψ) is estimate of E[ψb0(i )] Low-dimensional result: b 2 d E[ψ (i )] √n(b β) N 0, Θ∗ , ψ − −→ [ψ ( )]2 ·  E 0 i  e b f 0 so asymptotic variance for ψ = f matches variance of MLE

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 23 / 26 Resembles Lasso debiasing procedure( Zhang & Zhang ’14, van de Geer et al. ’14, Javanmard & Montanari ’14)

Efficiency

In high dimensions, define

n Θ 1 T bψ = βσ + ψ(yi x βσ)xi , b · n − i b A(ψ) i b X=1 b b b where A(ψ) = 1 n ψ0(yb xT β ) and Θ is high-dimensional n i=1 i i σb estimate of Θ∗ (e.g., graphical− Lasso estimator) b P b b

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 24 / 26 Efficiency

In high dimensions, define

n Θ 1 T bψ = βσ + ψ(yi x βσ)xi , b · n − i b A(ψ) i b X=1 b b b where A(ψ) = 1 n ψ0(yb xT β ) and Θ is high-dimensional n i=1 i i σb estimate of Θ∗ (e.g., graphical− Lasso estimator) P Resemblesb Lasso debiasing procedureb ( Zhangb & Zhang ’14, van de Geer et al. ’14, Javanmard & Montanari ’14)

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 24 / 26 f 0 Implies semiparametric efficiency of one-step estimator when ψ = f Can derive asymptotic confidence intervals/regions for subsets of coefficients Important: Allows statistical inference for high-dimensional regression in cases when xi ’s, i ’s are heavy-tailed

Efficiency

Theorem (L. ’17) Let J 1,..., p denote a subset of coordinates of constant dimension. Then ⊂ { } 2 d E[ψ (i )] ∗ √n(bψ βσ)J N 0, Θ − b −→ [ψ0( )]2 · JJ  E i  e b

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 25 / 26 Can derive asymptotic confidence intervals/regions for subsets of coefficients Important: Allows statistical inference for high-dimensional regression in cases when xi ’s, i ’s are heavy-tailed

Efficiency

Theorem (L. ’17) Let J 1,..., p denote a subset of coordinates of constant dimension. Then ⊂ { } 2 d E[ψ (i )] ∗ √n(bψ βσ)J N 0, Θ − b −→ [ψ0( )]2 · JJ  E i  e b f 0 Implies semiparametric efficiency of one-step estimator when ψ = f

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 25 / 26 Important: Allows statistical inference for high-dimensional regression in cases when xi ’s, i ’s are heavy-tailed

Efficiency

Theorem (L. ’17) Let J 1,..., p denote a subset of coordinates of constant dimension. Then ⊂ { } 2 d E[ψ (i )] ∗ √n(bψ βσ)J N 0, Θ − b −→ [ψ0( )]2 · JJ  E i  e b f 0 Implies semiparametric efficiency of one-step estimator when ψ = f Can derive asymptotic confidence intervals/regions for subsets of coefficients

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 25 / 26 Efficiency

Theorem (L. ’17) Let J 1,..., p denote a subset of coordinates of constant dimension. Then ⊂ { } 2 d E[ψ (i )] ∗ √n(bψ βσ)J N 0, Θ − b −→ [ψ0( )]2 · JJ  E i  e b f 0 Implies semiparametric efficiency of one-step estimator when ψ = f Can derive asymptotic confidence intervals/regions for subsets of coefficients Important: Allows statistical inference for high-dimensional regression in cases when xi ’s, i ’s are heavy-tailed

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 25 / 26 Loh (2017). Statistical consistency and asymptotic normality for high-dimensional robust M-estimators. Annals of Statistics.

Thank you!

Summary

New theory for robust high-dimensional M-estimators implies k log p 0 error rates when ` ∞ C based on local RSC O n k k ≤ q  Lepski’s method proposed to avoid joint scale parameter estimation Derived properties of one-step estimator for semiparametric efficiency and high-dimensional inference

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 26 / 26 Summary

New theory for robust high-dimensional M-estimators implies k log p 0 error rates when ` ∞ C based on local RSC O n k k ≤ q  Lepski’s method proposed to avoid joint scale parameter estimation Derived properties of one-step estimator for semiparametric efficiency and high-dimensional inference

Loh (2017). Statistical consistency and asymptotic normality for high-dimensional robust M-estimators. Annals of Statistics.

Thank you!

Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 26 / 26