<<

Robust Regression

Robust Mining Techniques By Boonyakorn Jantaranuson Outline

● Introduction ○ OLS and important terminology ● Least of Squares (LMedS) ● M- ● Penalized What is Regression?

● Fit a model to observed data ● Get minimum error between real data and predicted data

https://en.wikipedia.org/wiki/Regression_analysis

● Noise: transmission error, measurement error ● Cause problem to resulting regression model Robust Regression

● More robust to outliers than normal regression ● Outliers are not removed but not strongly affect the model Problem formulation

● yi is called response variable

● xi is called explanatory variable with p dimensions

● ei is error term

● Goal: want to find the estimate of each parameter with minimum error Problem formulation (contd.)

● Estimates of parameter are called regression coefficients

● Residual ri is the difference between real and predicted value

● Formally, our goal is to find a model which can fit the data with smallest residuals (OLS)

● Most common regression model ● Also called sum of least squares or least squares (LS) ● Goal: find regression coefficients that minimize the sum of squared residuals Problem with OLS

● Regression model is sensitive to Breakdown point

● Measure of robustness of regression method ● Ratio of the smallest number of outliers that causes the regression model to break down and total number of data points ● E.g. 1 outlier already corrupt OLS result ○ Its breakdown point is 1/n or 0% ● Highest possible breakdown point is 50% points

● Outliers can occur in both x- and y-directions ● Outliers in x-direction called leverage point ● Normally yields larger residual than outlier in y-axis Least Median of Squares (LMedS)

● Introduced by Hampel in 1975 ● Replace sum in OLS with median ● More robust because of median LMedS (contd.)

● Can achieve 50% breakdown point ● Computationally expensive for exact solution ○ O(np+1 log n) in p-dimension ● Need some approximation algorithm LMedS with

● Calculate the approximation of LMedS ● Get a good running time of O(n log2 n) is 2-D with high probability and O(np-1 log n) in p-dimension in worst case LMedS with randomization (contd.)

● Goal: maintain the interval of slopes of lines to get minimum residual ● Set of line is defined by

● The interval of slopes (w..t. 2 points) is LMedS with randomization (contd.)

● In each iteration, n cones will be random from all possible (n-1)(n-2)/2 cones ● The median of residual will be tested and interval is shrinked ● Repeat until residual is small enough and find the optimal solution from the intersections in the remaining interval Reweighted Least Squares (RLS)

● One variant of LMedS ● Combines OLS with estimates from LMedS ○ S is scale estimate corresponding to LMedS RLS (contd.)

From Robust Regression and Outlier Detection by Rousseuww M-estimator

● The name M is from Maximum Likelihood ● Replace squared residual in OLS with a symmetric, positive semi-definite function ρ M-estimator (contd.)

● To find regression coefficients that minimize the objective function, we need to find derivative of that function M-estimator (contd.)

● We can also reduce M-estimator to other types of regression ○ OLS: 2 ρ(ri) = r ○ Least absolute deviations (LAD): ρ(ri) = |r| ● LAD yields less residuals than OLS but in high-dimensional data OLS can perform slightly better ○ But still 0% breakdown points! ● Challenge: need to choose right ρ function to get a good result Penalized Least Squares

● OLS is equivalent to find maximum likelihood estimate (MLE) of data ● MLE only interested in training data, not in prior knowledge => Overfitting ● Solution: use maximum a posteriori (MAP) Penalized Least Squares

● With prior that the data is normally distributed (Gaussian), calculating MAP is equivalent to

● Intuitively, it is OLS with penalty term

● The above is called ridge regression or l2 regularization Penalized Least Squares (contd.)

● Different assumption on data and prior give different type of regularization

From Machine Learning: A Probabilistic Perspective by Murphy Hard Thresholding (TORRENT)

● TORRENT = Thresholding Operator-based Robust RegrEssioN meThod

● Based on l1 regularized regression

● Iteratively maintain the active set St using hard thresholding operator ○ Active set is a set of clean points (not outliers) ● Keep updating weights (regression coeff.) until the residual less than some pre-specified error tolerance TORRENT (contd.)

From paper Robust Regression via Hard Thresholding by Bhatia, Jain and Kar TORRENT (contd.)

● Offer several variants which are suitable in different situations ● Variants ○ TORRENT-FC: fully corrective LS, converge faster but expensive at each step ○ TORRENT-GD: using gradient descent, suitable for high dimensional data ○ TORRENT-HYB: hybrid version of above variants Self-Scaled Regularized Robust Regression

● Also based on l1 regularized regression ● Incorperate prior knowledge to make the penalty term able to scaled automatically ○ Prior e.g. Data occurrence Conclusion

● OLS is sensitive to outliers ● LMedS have high breakdown point but slow ● M-Estimate is flexible but hard to find the right function to make it robust ● Penalized Least Squares is also robust but require prior knowledge on data ○ Sometime need strong assumption and not always correct Remarks

● Old papers tend to talk more about high breakdown point i.e. try to reach 50% breakdown point ● More recent papers interested in computational speed instead ○ Effect of high dimensional data