Robust estimation, efficiency, and Lasso debiasing
Po-Ling Loh
University of Wisconsin - Madison Departments of ECE & Statistics
WHOA-PSI workshop Washington University in St. Louis
Aug 12, 2017
Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 1 / 26 When p n, assume sparsity: β∗ k k k0 ≤
High-dimensional linear regression
n 1 n p n 1 ⇥ ⇥ ⇥
p 1 ⇥ Linear model: T ∗ yi = xi β + i , i = 1,..., n
Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 2 / 26 High-dimensional linear regression
n 1 n p n 1 ⇥ ⇥ ⇥
p 1 ⇥ Linear model: T ∗ yi = xi β + i , i = 1,..., n
When p n, assume sparsity: β∗ k k k0 ≤
Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 2 / 26 Extensive theory (consistency, asymptotic normality) for p fixed, n → ∞
“Robust” M-estimators
Generalization of OLS suitable for heavy-tailed/contaminated errors: n 1 T β arg min `(xi β yi ) β n Introduction ∈ ( − ) Robust regression i=1 Introduction Implementation X Robust regression Scottish hill races Implementation b Scottish hill races Loss functions Belgian phone calls: Linear vs. robust regression 6 Least squares ● Huber
200 Tukey
● 5
● 4 150 ●
● ● 3 Loss 100 Millions of calls 2 50 ● 1 Least squares ● ● ● Absolute value ● ● Huber ● ● ● ● ● ● ● 0 Tukey ● ● ● ● ● 0
−6 −4 −2 0 2 4 6 1950 1955 1960 1965 1970
Residual Year Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 3 / 26
Patrick Breheny BST 764: Applied Statistical Modeling 6/17 Patrick Breheny BST 764: Applied Statistical Modeling 3/17 “Robust” M-estimators
Generalization of OLS suitable for heavy-tailed/contaminated errors: n 1 T β arg min `(xi β yi ) β n Introduction ∈ ( − ) Robust regression i=1 Introduction Implementation X Robust regression Scottish hill races Implementation b Scottish hill races Extensive theory (consistency, asymptotic normality) for p fixed, Loss functions n Belgian phone calls: Linear vs. robust regression → ∞ 6 Least squares ● Huber
200 Tukey
● 5
● 4 150 ●
● ● 3 Loss 100 Millions of calls 2 50 ● 1 Least squares ● ● ● Absolute value ● ● Huber ● ● ● ● ● ● ● 0 Tukey ● ● ● ● ● 0
−6 −4 −2 0 2 4 6 1950 1955 1960 1965 1970
Residual Year Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 3 / 26
Patrick Breheny BST 764: Applied Statistical Modeling 6/17 Patrick Breheny BST 764: Applied Statistical Modeling 3/17 Complications: Optimization for nonconvex `? Statistical theory? Are certain losses provably better than others?
High-dimensional M-estimators
Natural idea: For p > n, use regularized version:
n 1 T β arg min `(xi β yi ) + λ β 1 ∈ β n − k k ( i ) X=1 b
Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 4 / 26 High-dimensional M-estimators
Natural idea: For p > n, use regularized version:
n 1 T β arg min `(xi β yi ) + λ β 1 ∈ β n − k k ( i ) X=1 b Complications: Optimization for nonconvex `? Statistical theory? Are certain losses provably better than others?
Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 4 / 26 Compare to Lasso theory: Requires sub-Gaussian i ’s If `(u) is locally convex/smooth for u r, any local optima within | | ≤ radius cr of β∗ satisfy
∗ 0 k log p β β 2 C k − k ≤ r n e
Some statistical theory
0 When ` ∞ < C, global optima of high-dimensional M-estimator satisfy k k
∗ k log p β β 2 C , k − k ≤ r n ∗ regardless of distributionb of i
Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 5 / 26 If `(u) is locally convex/smooth for u r, any local optima within | | ≤ radius cr of β∗ satisfy
∗ 0 k log p β β 2 C k − k ≤ r n e
Some statistical theory
0 When ` ∞ < C, global optima of high-dimensional M-estimator satisfy k k
∗ k log p β β 2 C , k − k ≤ r n ∗ regardless of distributionb of i Compare to Lasso theory: Requires sub-Gaussian i ’s
Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 5 / 26 Some statistical theory
0 When ` ∞ < C, global optima of high-dimensional M-estimator satisfy k k
∗ k log p β β 2 C , k − k ≤ r n ∗ regardless of distributionb of i Compare to Lasso theory: Requires sub-Gaussian i ’s If `(u) is locally convex/smooth for u r, any local optima within | | ≤ radius cr of β∗ satisfy
∗ 0 k log p β β 2 C k − k ≤ r n e
Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, 2017 5 / 26 Algorithm
1 Run composite gradient descent on convex, robust loss + `1-penalty until convergence, output βH 2 Run composite gradient descent on nonconvex, robust loss + b 0 µ-amenable penalty, input β = βH
b
Some optimization theory
k log p O r n !
r