Robust Regression
Robust Data Mining Techniques By Boonyakorn Jantaranuson Outline
● Introduction ○ OLS and important terminology ● Least Median of Squares (LMedS) ● M-estimator ● Penalized least squares What is Regression?
● Fit a model to observed data ● Get minimum error between real data and predicted data
https://en.wikipedia.org/wiki/Regression_analysis Outliers
● Noise: transmission error, measurement error ● Cause problem to resulting regression model Robust Regression
● More robust to outliers than normal regression ● Outliers are not removed but not strongly affect the model Problem formulation
●
● yi is called response variable
● xi is called explanatory variable with p dimensions
● ei is error term
● Goal: want to find the estimate of each parameter with minimum error Problem formulation (contd.)
● Estimates of parameter are called regression coefficients
● Residual ri is the difference between real and predicted value
● Formally, our goal is to find a model which can fit the data with smallest residuals Ordinary Least Squares (OLS)
● Most common regression model ● Also called sum of least squares or least squares (LS) ● Goal: find regression coefficients that minimize the sum of squared residuals Problem with OLS
● Regression model is sensitive to outlier Breakdown point
● Measure of robustness of regression method ● Ratio of the smallest number of outliers that causes the regression model to break down and total number of data points ● E.g. 1 outlier already corrupt OLS result ○ Its breakdown point is 1/n or 0% ● Highest possible breakdown point is 50% Leverage points
● Outliers can occur in both x- and y-directions ● Outliers in x-direction called leverage point ● Normally yields larger residual than outlier in y-axis Least Median of Squares (LMedS)
● Introduced by Hampel in 1975 ● Replace sum in OLS with median ● More robust because of median LMedS (contd.)
● Can achieve 50% breakdown point ● Computationally expensive for exact solution ○ O(np+1 log n) in p-dimension ● Need some approximation algorithm LMedS with randomization
● Calculate the approximation of LMedS ● Get a good running time of O(n log2 n) is 2-D with high probability and O(np-1 log n) in p-dimension in worst case LMedS with randomization (contd.)
● Goal: maintain the interval of slopes of lines to get minimum residual ● Set of line is defined by
● The interval of slopes (w.r.t. 2 points) is LMedS with randomization (contd.)
● In each iteration, n cones will be random from all possible (n-1)(n-2)/2 cones ● The median of residual will be tested and interval is shrinked ● Repeat until residual is small enough and find the optimal solution from the intersections in the remaining interval Reweighted Least Squares (RLS)
● One variant of LMedS ● Combines OLS with estimates from LMedS ○ S is scale estimate corresponding to LMedS RLS (contd.)
From Robust Regression and Outlier Detection by Rousseuww M-estimator
● The name M is from Maximum Likelihood ● Replace squared residual in OLS with a symmetric, positive semi-definite function ρ M-estimator (contd.)
● To find regression coefficients that minimize the objective function, we need to find derivative of that function M-estimator (contd.)
● We can also reduce M-estimator to other types of regression ○ OLS: 2 ρ(ri) = r ○ Least absolute deviations (LAD): ρ(ri) = |r| ● LAD yields less residuals than OLS but in high-dimensional data OLS can perform slightly better ○ But still 0% breakdown points! ● Challenge: need to choose right ρ function to get a good result Penalized Least Squares
● OLS is equivalent to find maximum likelihood estimate (MLE) of data ● MLE only interested in training data, not in prior knowledge => Overfitting ● Solution: use maximum a posteriori (MAP) Penalized Least Squares
● With prior that the data is normally distributed (Gaussian), calculating MAP is equivalent to
● Intuitively, it is OLS with penalty term
● The above is called ridge regression or l2 regularization Penalized Least Squares (contd.)
● Different assumption on data and prior give different type of regularization
From Machine Learning: A Probabilistic Perspective by Murphy Hard Thresholding (TORRENT)
● TORRENT = Thresholding Operator-based Robust RegrEssioN meThod
● Based on l1 regularized regression
● Iteratively maintain the active set St using hard thresholding operator ○ Active set is a set of clean points (not outliers) ● Keep updating weights (regression coeff.) until the residual less than some pre-specified error tolerance TORRENT (contd.)
From paper Robust Regression via Hard Thresholding by Bhatia, Jain and Kar TORRENT (contd.)
● Offer several variants which are suitable in different situations ● Variants ○ TORRENT-FC: fully corrective LS, converge faster but expensive at each step ○ TORRENT-GD: using gradient descent, suitable for high dimensional data ○ TORRENT-HYB: hybrid version of above variants Self-Scaled Regularized Robust Regression
● Also based on l1 regularized regression ● Incorperate prior knowledge to make the penalty term able to scaled automatically ○ Prior e.g. Data occurrence Conclusion
● OLS is sensitive to outliers ● LMedS have high breakdown point but slow ● M-Estimate is flexible but hard to find the right function to make it robust ● Penalized Least Squares is also robust but require prior knowledge on data ○ Sometime need strong assumption and not always correct Remarks
● Old papers tend to talk more about high breakdown point i.e. try to reach 50% breakdown point ● More recent papers interested in computational speed instead ○ Effect of high dimensional data