Ensemble Methods - Boosting

Total Page:16

File Type:pdf, Size:1020Kb

Ensemble Methods - Boosting Ensemble Methods - Boosting Bin Li IIT Lecture Series 1 / 48 Boosting a weak learner 1 1 I Weak learner L produces an h with error rate β = 2 < 2 , with Pr 1 δ for any . L has access to continuous− stream of training≥ data− and a classD oracle. 1. L learns h1 on first N training points. 2. L randomly filters the next batch of training points, extracting N=2 points correctly classified by h1, N=2 incorrectly classified, and produces h2. 3. L builds a third training set of N points for which h1 and h2 disagree, and produces h3. 4. L ouptuts h = Majority Vote(h1; h2; h3) I Theorem (Schapire, 1990): \The Strength of Weak Learnability" 2 3 errorD(h) 3β 2β < β ≤ − 2 / 48 Plot of error rate 3β2 2β3 − 0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 3 / 48 What is AdaBoost? Figure from HTF 2009. ) x ( m G m α =1 M m ) ) ) ) x ( x x x ( ( ( 3 2 M 1 G G G G )=sign x Final Classifier ( Boosting G Training SampleTraining Sample Training Sample Weighted SampleWeighted Sample Weighted Sample Weighted SampleWeighted Sample Weighted SampleWeighted Sample Weighted Sample Weighted Sample January 2003 Trevor Hastie, Stanford University 11 4 / 48 AdaBoost (Freund & Schapire, 1996) 1. Initialize the observation weigths: wi = 1=N, i = 1; 2;:::; N. 2. For m = 1 to M repeat steps (a)-(d): (a) Fit a classifier Gm(x) to the training data using weigths wi (b) Compute N i=1 wi I (yi = Gm(xi )) errm = N 6 P i=1 wi (c) Compute αm = log((1 errm)=errPm). (d) Update weights for i =− 1;:::; N: wi wi exp [αm I (yi = Gm(xi ))] − · · 6 and renormalize wi to sum to 1. M 3. Output G(x) = sign m=1 αmGm(x) hP i 5 / 48 Example (2) ExampleAn example taken from of(R. AdaBoost Schapire) 2 2 1 =0.err3,α1 1==0i.42wi I,( wrongG1(xi ) pattern:= yi ) = 0d:3i =0.166, correct: di =0.07 1−err1 6 computationα1 = log see next slide= 0:847 P err1 wi for misclassified points: 0:1 exp(0:847) = 0:233. × 0:233 After normalizing, wi for wroing points is 0:233×3+0:1×7 = 0:1667. 0:1 For correct points, wi is 0:233×3+0:1×7 = 0:0714 6 / 48 Martin Riedmiller, Albert-Ludwigs-Universit¨at Freiburg, [email protected] Machine Learning 10 An example of AdaBoostExample (cont.) (3) Example taken from (R. Schapire) 2 =0.21,α2 =0.65 errcomputation2 = i w seei I (G next2(x slidei ) = yi ) = 3 0:07 = 0:21 1−err2 6 × α2 = log 1:3 P err2 ≈ Martin Riedmiller, Albert-Ludwigs-Universit¨at Freiburg, [email protected] Machine Learning 12 7 / 48 An example of AdaBoostExample (cont.) (4) Example taken from (R. Schapire) 2 =0.14,α2 =0.92 err3 = 0:14 1−err3 α3 = log = 1:84 err3 Martin Riedmiller, Albert-Ludwigs-Universit¨at Freiburg, [email protected] Machine Learning 14 8 / 48 Example (5) AnExample example taken -from final(R. classifierSchapire) Final classifier: 9 / 48 Martin Riedmiller, Albert-Ludwigs-Universit¨at Freiburg, [email protected] Machine Learning 15 A simulation - \nested spheres" example Ten features X1;:::; X10 are independent N(0; 1) r.v. The deterministic target Y is defined by 1 if 10 X 2 > χ2 (0:5) = 9:34 Y = j=1 j 10 1 otherwise − P I 2000 training observations (about 1000 cases in each class), and 10,000 test observations. I Bayes error is 0% (noise-free). I A stump is a two-node tree, after a single split. I We use stump as the \weak learner" in the AdaBoost algorithm. 10 / 48 January 2003 Trevor Hastie, Stanford University 17 Boosting Stumps NS example (figure from HTF 2009) Single Stump 400 Node Tree Test Error 0.0 0.1 0.2 0.3 0.4 0.5 0 100 200 300 400 Boosting Iterations A stump is a two-node tree, after a single split. 11 / 48 Boosting stumps works remarkably well on the nested-spheres problem. January 2003 Trevor Hastie, Stanford University 18 Boosting & Training Error Nested spheres in R10 — Bayes error is 0%. Training and testing errorsStumps(figure from HTF 2009) Train and Test Error 0.0 0.1 0.2 0.3 0.4 0.5 0 100 200 300 400 500 600 Number of Terms Boosting drives the training error to zero. Further Boosting drives the training error to zero. Further iterations iterations continue to improve test error in many continue to improve test error. examples. 12 / 48 January 2003 Trevor Hastie, Stanford University 19 Boosting Noisy Problems Boosting forNested noisy Gaussians problems in R10(figure— Bayes from HTF) error is 25%. Nested spheres example withStumps added noise - Bayes error is 25% Bayes Error Train and Test Error 0.0 0.1 0.2 0.3 0.4 0.5 0 100 200 300 400 500 600 Number of Terms 13 / 48 Here the test error does increase, but quite slowly. Weak classifiers in AdaBoost In the illustrative example, after fitting the first classifier the normalized weights wi for wrong and correct points are 0:1667 and 0:0714. Then what is the sum of weights for wrong and correct points separately? 14 / 48 Weak classifiers in AdaBoost (cont.) Sum of weights for wrongly classified points: wi exp(αm) i:yiX6=G(xi ) 1 errm = wi − errm i:yiX6=G(xi ) i wi I (yi = Gm(xi )) = wi i wi I (yi = Gm(xi )) i:yiX6=G(xi ) P 6 = wi I (yi = GPm(xi )) = wi i X i:yiX=G(xi ) Error rate is1 =2 on the new weighting scheme. The next weak classifier is \independent" to the previous one. 15 / 48 Boosting as an additive model M I Boosting builds an additive model: f (x) = m=1 βmb(x; γm). For example, in boosting trees b(x; γm) is a tree, and γm P parametrizes the splits. I We use additive models in statistics all the time! I Multiple linear regression: f (x) = j βj xj I Generalized additive model: f (x) = j fj (xj ) P M I Basis expansions in spline model: f (Px) = m=1 θmhm(x) I Traditionally, the parameters fm, θm are fitPjointly (i.e. least squares, maximum likelihood). I Friedman et al. (2000) have shown that AdaBoost is equivalent to stagewise additive logistic regression using the exponential loss criterion L(y; f (x)) = exp( yf (x)). − I yf (x) is called \margin"(> 0 classified correctly, < 0 misclassified). 16 / 48 Stagewise additive modeling Based on Friedman, Hastie and Tibshirani (2000)'s result, the AdaBoost is equivalent to the following modeling strategy. Forward Stagewise Additive Modeling. 1. Initialize f0(x) = 0. 2. For m = 1 to M: (a) Compute N (βm; γm) = arg min L (yi ; fm 1(xi ) + βb(xi ; γ)) : β,γ − i=1 X (b) Set fm(x) = fm 1(x) + βmb(x; γm) − Compare with stepwise approach, stagewise additive modeling slows the process down, and tends to overfit less quickly. 17 / 48 10.5 Why Exponential Loss? 345 Exponential loss vs. misclassification loss (figure from HTF, 2009) Training Error Exponential Loss Misclassification Rate 0.0 0.2 0.4 0.6 0.8 1.0 0 100 200 300 400 Boosting Iterations FIGURE 10.3. Simulated data, boosting with stumps: misclassificationP error 18 / 48 N rate on the training set, and average exponential loss: (1/N ) exp( yif(xi)). i=1 − After about 250 iterations, the misclassification error is zero, while the exponential loss continues to decrease. 10.5 Why Exponential Loss? The AdaBoost.M1 algorithm was originally motivated from a very differ- ent perspective than presented in the previous section. Its equivalence to forward stagewise additive modeling based on exponential loss was only discovered five years after its inception. By studying the properties of the exponential loss criterion, one can gain insight into the procedure and dis- cover ways it might be improved. The principal attraction of exponential loss in the context of additive modeling is computational; it leads to the simple modular reweighting Ad- aBoost algorithm. However, it is of interest to inquire about its statistical properties. What does it estimate and how well is it being estimated? The first question is answered by seeking its population minimizer. It is easy to show (Friedman et al., 2000) that Yf(x) 1 Pr(Y =1x) f ∗(x)=argminEY x(e− )= log | , (10.16) f(x) | 2 Pr(Y = 1 x) − | Why exponential loss? I The AdaBoost algorithm was originally motivated from a very different perspective. I What is the statistical property for exponential loss? yf I What does it estimate? The population minimizer of e− is: 1 Pr(Y = 1 x) Yf (x) log j = arg min EY x e− 2 Pr(Y = 1 x) f (x) j − j I AdaBoost estimates one-half the log-odds of P(Y = 1 x) and uses its sign as the the classification rule. j I Binomial negative log-likelihood or deviance has the same population minimizer. l(Y ; f (x)) = log 1 + e−2Yf (x) − −Yf I Note e is not a proper log-likelihood. 19 / 48 10.6 Loss Functions and Robustness 347 Loss functions for classification (figure from HTF, 2009) Misclassification Exponential Binomial Deviance Squared Error Support Vector Loss 0.0 0.5 1.0 1.5 2.0 2.5 3.0 −2 −1 0 1 2 y f · FIGURE 10.4.
Recommended publications
  • A Robust Hybrid of Lasso and Ridge Regression
    A robust hybrid of lasso and ridge regression Art B. Owen Stanford University October 2006 Abstract Ridge regression and the lasso are regularized versions of least squares regression using L2 and L1 penalties respectively, on the coefficient vector. To make these regressions more robust we may replace least squares with Huber’s criterion which is a hybrid of squared error (for relatively small errors) and absolute error (for relatively large ones). A reversed version of Huber’s criterion can be used as a hybrid penalty function. Relatively small coefficients contribute their L1 norm to this penalty while larger ones cause it to grow quadratically. This hybrid sets some coefficients to 0 (as lasso does) while shrinking the larger coefficients the way ridge regression does. Both the Huber and reversed Huber penalty functions employ a scale parameter. We provide an objective function that is jointly convex in the regression coefficient vector and these two scale parameters. 1 Introduction We consider here the regression problem of predicting y ∈ R based on z ∈ Rd. The training data are pairs (zi, yi) for i = 1, . , n. We suppose that each vector p of predictor vectors zi gets turned into a feature vector xi ∈ R via zi = φ(xi) for some fixed function φ. The predictor for y is linear in the features, taking the form µ + x0β where β ∈ Rp. In ridge regression (Hoerl and Kennard, 1970) we minimize over β, a criterion of the form n p X 0 2 X 2 (yi − µ − xiβ) + λ βj , (1) i=1 j=1 for a ridge parameter λ ∈ [0, ∞].
    [Show full text]
  • Lecture 22 - 11/19/2019 Lecture 22: Robust Location Estimation Lecturer: Jiantao Jiao Scribe: Vignesh Subramanian
    EE290 Mathematics of Data Science Lecture 22 - 11/19/2019 Lecture 22: Robust Location Estimation Lecturer: Jiantao Jiao Scribe: Vignesh Subramanian In this lecture, we get a historical perspective into the robust estimation problem and discuss Huber's work [1] for robust estimation of a location parameter. The Huber loss function is given by, ( 1 2 2 t ; jtj ≤ k ρHuber(t) = 1 2 : (1) k jtj − 2 k ; jtj > k Here k is a parameter and the idea behind the loss function is to penalize outliers (beyond k) linearly instead of quadratically. Figure 1 shows the Huber loss function for k = 1. In this lecture we will get an intuitive 1 2 Figure 1: The green line plots the Huber-loss function for k = 1, and the blue line plots the quadratic function 2 t . understanding for the reasons behind the particular form of this function, quadratic in interior, linear in exterior and convex and will see that this loss function is optimal for one dimensional robust mean estimation for Gaussian location model. First we describe the problem setting. 1 Problem Setting Suppose we observe X1;X2;:::;Xn where Xi − µ ∼ F 2 F are i.i.d. Here, F = fF j F = (1 − )G + H; H 2 Mg; (2) where G 2 M is some fixed distribution function which is usually assumed to have zero mean, and M denotes the space of all probability measures. This describes the corruption model where the observed distribution is a convex combination of the true distribution G(x) and an arbitrary corruption distribution H.
    [Show full text]
  • Robust Regression Through the Huber's Criterion and Adaptive Lasso
    Robust Regression through the Huber’s criterion and adaptive lasso penalty Sophie Lambert-Lacroix, Laurent Zwald To cite this version: Sophie Lambert-Lacroix, Laurent Zwald. Robust Regression through the Huber’s criterion and adap- tive lasso penalty. Electronic Journal of Statistics , Shaker Heights, OH : Institute of Mathematical Statistics, 2011, 5, pp.1015-1053. 10.1214/11-EJS635. hal-00661864 HAL Id: hal-00661864 https://hal.archives-ouvertes.fr/hal-00661864 Submitted on 20 Jan 2012 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Electronic Journal of Statistics Vol. 5 (2011) 1015–1053 ISSN: 1935-7524 DOI: 10.1214/11-EJS635 Robust regression through the Huber’s criterion and adaptive lasso penalty Sophie Lambert-Lacroix UJF-Grenoble 1 / CNRS / UPMF / TIMC-IMAG UMR 5525, Grenoble, F-38041, France e-mail: [email protected] and Laurent Zwald LJK - Universit´eJoseph Fourier BP 53, Universit´eJoseph Fourier 38041 Grenoble cedex 9, France e-mail: [email protected] Abstract: The Huber’s Criterion is a useful method for robust regression. The adaptive least absolute shrinkage and selection operator (lasso) is a popular technique for simultaneous estimation and variable selection.
    [Show full text]
  • Sketching for M-Estimators and Robust Numerical Linear Algebra
    Sketching for M-Estimators and Robust Numerical Linear Algebra David Woodruff CMU Talk Outline • Regression – Sketching for least squares regression – Sketching for fast robust regression • Low Rank Approximation – Sketching for fast SVD – Sketching for fast robust low rank approximation • Recent sketching/sampling work for robust problems Linear Regression Matrix form Input: nd-matrix A and a vector b=(b1,…, bn) n is the number of examples; d is the number of unknowns Output: x* so that Ax* and b are close • Consider the over-constrained case, when n d Least Squares Regression 2 • Find x* that minimizes |Ax-b|2 •Ax* is the projection of b onto the column span of A • Desirable statistical properties • Closed form solution: x* = (ATA)-1 AT b Sketching to Solve Least Squares Regression . How to find an approximate solution x to minx |Ax-b|2 ? . Goal: output x‘ for which |Ax‘-b|2 ≤ (1+ε) minx |Ax-b|2 with high probability . Draw S from a k x n random family of matrices, for a value k << n . Compute S*A and S*b . Output the solution x‘ to minx‘ |(SA)x-(Sb)|2 How to Choose the Right Sketching Matrix ? . Recall: output the solution x‘ to minx‘ |(SA)x-(Sb)|2 . Lots of matrices work . S is d/ε2 x n matrix of i.i.d. Normal random variables . Computing S*A may be slow… . Can speed up to O(nd log n) time using Fast Johnson Lindenstrauss transforms [Sarlos] . Not sensitive to input sparsity Faster Sketching Matrices [CW] Think of rows . CountSketch matrix as hash buckets .
    [Show full text]
  • A General and Adaptive Robust Loss Function
    A General and Adaptive Robust Loss Function Jonathan T. Barron Google Research Abstract and can be adjusted to model a wider family of functions. This allows us to generalize algorithms built around a fixed We present a generalization of the Cauchy/Lorentzian, robust loss with a new “robustness” hyperparameter that can Geman-McClure, Welsch/Leclerc, generalized Charbon- be tuned or annealed to improve performance. nier, Charbonnier/pseudo-Huber/L1-L2, and L2 loss func- Though new hyperparameters may be valuable to a prac- tions. By introducing robustness as a continuous param- titioner, they complicate experimentation by requiring man- eter, our loss function allows algorithms built around ro- ual tuning or time-consuming cross-validation. However, bust loss minimization to be generalized, which improves by viewing our general loss function as the negative log- performance on basic vision tasks such as registration and likelihood of a probability distribution, and by treating the clustering. Interpreting our loss as the negative log of a robustness of that distribution as a latent variable, we show univariate density yields a general probability distribution that maximizing the likelihood of that distribution allows that includes normal and Cauchy distributions as special gradient-based optimization frameworks to automatically cases. This probabilistic interpretation enables the training determine how robust the loss should be without any manual of neural networks in which the robustness of the loss auto- parameter tuning. This “adaptive” form of our loss is par- matically adapts itself during training, which improves per- ticularly effective in models with multivariate output spaces formance on learning-based tasks such as generative im- (say, image generation or depth estimation) as we can intro- age synthesis and unsupervised monocular depth estima- duce independent robustness variables for each dimension tion, without requiring any manual parameter tuning.
    [Show full text]
  • ERM and RERM Are Optimal Estimators for Regression Problems When Malicious Outliers Corrupt the Labels
    ERM and RERM are optimal estimators for regression problems when malicious outliers corrupt the labels CHINOT Geoffrey ENSAE, 5 avenue Henri Chatelier, 91120, Palaiseau, France e-mail: [email protected] Abstract: We study Empirical Risk Minimizers (ERM) and Regularized Empirical Risk Minimizers (RERM) for regression problems with convex and L-Lipschitz loss functions. We consider a setting where jOj malicious outliers contaminate the labels. In that case, under a local Bernstein condi- tion, we show that the L2-error rate is bounded by rN + ALjOj=N, where N is the total number of observations, rN is the L2-error rate in the non- contaminated setting and A is a parameter coming from the local Bernstein condition. When rN is minimax-rate-optimal in a non-contaminated set- ting, the rate rN +ALjOj=N is also minimax-rate-optimal when jOj outliers contaminate the label. The main results of the paper can be used for many non-regularized and regularized procedures under weak assumptions on the noise. We present results for Huber's M-estimators (without penalization or regularized by the `1-norm) and for general regularized learning problems in reproducible kernel Hilbert spaces when the noise can be heavy-tailed. MSC 2010 subject classifications: Primary 62G35 ; secondary 62G08 . Keywords and phrases: Regularized empirical risk minimizers, outliers, robustness, minimax-rate-optimality. 1. Introduction Let (Ω; A;P ) be a probability space where Ω = X ×Y. X denotes the measurable space of the inputs and Y ⊂ R the measurable space of the outputs. Let (X; Y ) be a random variable taking values in Ω with joint distribution P and let µ be the marginal distribution of X.
    [Show full text]
  • Benchmarking Daily Line Loss Rates of Low Voltage Transformer Regions in Power Grid Based on Robust Neural Network
    applied sciences Article Benchmarking Daily Line Loss Rates of Low Voltage Transformer Regions in Power Grid Based on Robust Neural Network Weijiang Wu 1,2 , Lilin Cheng 3,*, Yu Zhou 1,2, Bo Xu 4, Haixiang Zang 3 , Gaojun Xu 1,2 and Xiaoquan Lu 1,2 1 State Grid Jiangsu Electric Power Company Limited Marketing Service Center, Nanjing 210019, China; [email protected] (W.W.); [email protected] (Y.Z.); [email protected] (G.X.); [email protected] (X.L.) 2 State Grid Key Laboratory of Electric Power Metering, Nanjing 210019, China 3 College of Energy and Electrical Engineering, Hohai University, Nanjing 211100, China; [email protected] 4 Jiangsu Frontier Electric Technology Co., Ltd., Nanjing 211100, China; [email protected] * Correspondence: [email protected]; Tel.: +86-173-7211-8166 Received: 28 September 2019; Accepted: 3 December 2019; Published: 17 December 2019 Abstract: Line loss is inherent in transmission and distribution stages, which can cause certain impacts on the profits of power-supply corporations. Thus, it is an important indicator and a benchmark value of which is needed to evaluate daily line loss rates in low voltage transformer regions. However, the number of regions is usually very large, and the dataset of line loss rates contains massive outliers. It is critical to develop a regression model with both great robustness and efficiency when trained on big data samples. In this case, a novel method based on robust neural network (RNN) is proposed. It is a multi-path network model with denoising auto-encoder (DAE), which takes the advantages of dropout, L2 regularization and Huber loss function.
    [Show full text]
  • Consistent Regression When Oblivious Outliers Overwhelm
    Consistent regression when oblivious outliers overwhelm∗ Tommaso d’Orsi† Gleb Novikov‡ David Steurer§ Abstract We consider a robust linear regression model H = - , where an adversary ∗ + oblivious to the design - ℝ= 3 may choose to corrupt all but an fraction of ∈ × the observations H in an arbitrary way. Prior to our work, even for Gaussian -, no estimator for ∗ was known to be consistent in this model except for quadratic sample size = & 3 2 or for logarithmic inlier fraction > 1 log =. We show that consistent ( / ) / estimation is possible with nearly linear sample size and inverse-polynomial inlier frac- tion. Concretely, we show that the Huber loss estimator is consistent for every sample size = = $ 3 2 and achieves an error rate of $ 3 2= 1 2. Both bounds are optimal ( / ) ( / ) / (up to constantfactors). Our results extendto designsfar beyond the Gaussian case and only require the column span of - to not contain approximately sparse vectors (simi- lar to the kind of assumption commonly made about the kernel space for compressed sensing). We provide two technically similar proofs. One proof is phrased in terms of strong convexity, extending work of [TJSO14], and particularly short. The other proof highlights a connection between the Huber loss estimator and high-dimensional me- dian computations. In the special case of Gaussian designs, this connection leads us to a strikingly simple algorithm based on computing coordinate-wise medians that achieves optimal guarantees in nearly-linear time, and that can exploit sparsity of ∗. The model studied here also captures heavy-tailed noise distributions that may not arXiv:2009.14774v2 [cs.LG] 25 May 2021 even have a first moment.
    [Show full text]
  • Statistica Sinica Preprint No: SS-2019-0324
    Statistica Sinica Preprint No: SS-2019-0324 Title Robustness and Tractability for Non-convex M-estimators Manuscript ID SS-2019-0324 URL http://www.stat.sinica.edu.tw/statistica/ DOI 10.5705/ss.202019.0324 Complete List of Authors Ruizhi Zhang, Yajun Mei, Jianjun Shi and Huan Xu Corresponding Author Ruizhi Zhang E-mail [email protected] Notice: Accepted version subject to English editing. Statistica Sinica: Newly accepted Paper (accepted author-version subject to English editing) Statistica Sinica Robustness and Tractability for Non-convex M-estimators Ruizhi Zhang*, Yajun Mei**, Jianjun Shi**, Huan Xu*** *University of Nebraska-Lincoln, **Georgia Institute of Technology, ***Alibaba Inc. Abstract: We investigate two important properties of M-estimators, namely, ro- bustness and tractability, in the linear regression setting, when the observations are contaminated by some arbitrary outliers. Specifically, robustness means the statistical property that the estimator should always be close to the true under- lying parameters regardless of the distribution of the outliers, and tractability indicates the computational property that the estimator can be computed effi- ciently, even if the objective function of the M-estimator is non-convex. In this article, by learning the landscape of the empirical risk, we show that under some sufficient conditions, many M-estimators enjoy nice robustness and tractability properties simultaneously when the percentage of outliers is small. We further extend our analysis to the high-dimensional setting, where the number of pa- rameters is greater than the number of samples, p n, and prove that when the proportion of outliers is small, the penalized M-estimators with L1 penalty will enjoy robustness and tractability simultaneously.
    [Show full text]
  • Robust Regression Implementation Scottish Hill Races
    Introduction Robust regression Implementation Scottish hill races Robust regression Patrick Breheny December 1 Patrick Breheny BST 764: Applied Statistical Modeling 1/17 Introduction Robust regression Implementation Scottish hill races Belgian phone calls We begin our discussion of robust regression with a simple motivating example dealing with the number of phone calls made per year in Belgium The data set phones.txt contains two columns: year calls: Number of calls made (in millions) As it turns out, there is a flaw in the data { for a period of time from 1964-1969, the total length of calls was recorded instead of the number Patrick Breheny BST 764: Applied Statistical Modeling 2/17 Introduction Robust regression Implementation Scottish hill races Belgian phone calls: Linear vs. robust regression Least squares ● Huber 200 Tukey ● ● 150 ● ● ● 100 Millions of calls 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1950 1955 1960 1965 1970 Year Patrick Breheny BST 764: Applied Statistical Modeling 3/17 Introduction Robust regression Implementation Scottish hill races Robust loss Robust regression methods achieve their robustness by modifying the loss function P 2 The linear regression loss function, l(r) = i ri , increases sharply with the size of the residual One alternative is to use the absolute value as a loss function P instead of squaring the residual: l(r) = i jrij This achieves robustness, but is hard to work with in practice because the absolute value function is not differentiable Patrick Breheny BST 764: Applied Statistical Modeling 4/17
    [Show full text]
  • Generalized Huber Loss for Robust Learning and Its Efficient
    1 Generalized Huber Loss for Robust Learning and its Efficient Minimization for a Robust Statistics Kaan Gokcesu, Hakan Gokcesu Abstract—We propose a generalized formulation of the Huber To create a robust loss with fast convergence, we need loss. We show that with a suitable function of choice, specifically to combine the properties of the absolute and the quadratic the log-exp transform; we can achieve a loss function which loss. The most straightforward approach is to use a piecewise combines the desirable properties of both the absolute and the quadratic loss. We provide an algorithm to find the minimizer of function to combine the quadratic and absolute losses where such loss functions and show that finding a centralizing metric they work the best. As an example, we can straightforwardly is not that much harder than the traditional mean and median. use the following function x2 , x 1 I. INTRODUCTION LP (x)= | |≤ . (1) ( x , x > 1 Many problems in learning, optimization and statistics liter- | | | | ature [1]–[4] require robustness, i.e., that a model trained (or While this function is continuous, the strict cutoff at 1 may optimized) be less influenced by some outliers than by inliers prove to be arbitrary or even useless for certain datasets. We (i.e., the nominal data) [5], [6]. This approach is extremely can bypass this problem by using a free variable as common in parameter estimation and learning tasks, where 1 2 a robust loss (e.g., the absolute error) may be more desirable δ x , x δ LP,δ(x)= | |≤ . (2) over a nonrobust loss (e.g., the quadratic error) due to its insen- ( x , x >δ sitivity to the large errors.
    [Show full text]
  • Robust Regression Through the Huber's Criterion and Adaptive Lasso
    Electronic Journal of Statistics Vol. 5 (2011) 1015–1053 ISSN: 1935-7524 DOI: 10.1214/11-EJS635 Robust regression through the Huber’s criterion and adaptive lasso penalty Sophie Lambert-Lacroix UJF-Grenoble 1 / CNRS / UPMF / TIMC-IMAG UMR 5525, Grenoble, F-38041, France e-mail: [email protected] and Laurent Zwald LJK - Universit´eJoseph Fourier BP 53, Universit´eJoseph Fourier 38041 Grenoble cedex 9, France e-mail: [email protected] Abstract: The Huber’s Criterion is a useful method for robust regression. The adaptive least absolute shrinkage and selection operator (lasso) is a popular technique for simultaneous estimation and variable selection. The adaptive weights in the adaptive lasso allow to have the oracle properties. In this paper we propose to combine the Huber’s criterion and adaptive penalty as lasso. This regression technique is resistant to heavy-tailed er- rors or outliers in the response. Furthermore, we show that the estimator associated with this procedure enjoys the oracle properties. This approach is compared with LAD-lasso based on least absolute deviation with adaptive lasso. Extensive simulation studies demonstrate satisfactory finite-sample performance of such procedure. A real example is analyzed for illustration purposes. Keywords and phrases: Adaptive lasso, concomitant scale, Huber’s cri- terion, oracle property, robust estimation. Received June 2010. Contents 1 Introduction................................ 1016 2 Thelasso-typemethod.......................... 1017 2.1 Lasso-typeestimator . 1017 2.2 Robust lasso-typeestimator:The LAD-lasso . 1018 2.3 The Huber’s Criterion with adaptive lasso . 1019 2.4 Tuning parameter estimation . 1021 2.5 Some remarks on scale invariance . 1022 3 Theoreticalproperties .
    [Show full text]