<<

ALGORITHMS FOR ROBUST BY EXPLOITING THE CONNECTION TO SPARSE SIGNAL RECOVERY

Yuzhe Jin and Bhaskar D. Rao

Department of Electrical and Computer Engineering, University of California at San Diego, La Jolla, CA 92093-0407, USA {yujin, brao}@ucsd.edu

ABSTRACT is the fitting error. This criterion is sensitive to and hence In this paper, we develop algorithms for robust linear regression by not robust. Many existing methods for robust regression follow the leveraging the connection between the problems of robust regres- idea that one should de-emphasize the impact of with large sion and sparse signal recovery. We explicitly model the measure- deviation in order to obtain robustness. For example, the method ment noise as a combination of two terms; the first term accounts for of Least Absolute Value (LAV) [4] is probably a∑ well-known rep- M | | regular measurement noise modeled as zero Gaussian noise, resentative of this kind. This method minimizes i=1 e˜i , which and the second term captures the impact of outliers. The fact that can be equivalently viewed as imposing a Laplacian distribution on the latter component could indeed be a sparse vector pro- the measurement noise ei. Alternatively, the family of M-estimates vides the opportunity to sparse signal reconstruction meth- [1] consider flexible weighting schemes on the fitting error e˜i. ods to solve the problem of robust regression. Maximum a posteriori The weighting functions used in M-estimates aim to de-emphasize (MAP) based and empirical based algorithms are samples with large deviation, and they can also be related to cer- developed for this purpose. Experimental studies on simulated and tain probability densities imposed on the measurement noise. By real data sets are presented to demonstrate the effectiveness of the assuming the measurement noise ei is drawn from the Student’s proposed algorithms. t-distribution [5], the impact of extreme errors is also effectively downscaled. Further, the model of Gaussian mixtures also has been Index Terms— robust linear regression, sparse signal recovery, employed in robust regression wherein samples of the measurement outlier detection, MAP, sparse Bayesian learning noise ei are assumed to be i.i.d. and drawn from a mixture of two Gaussians with one accounting for regular noise and the other for 1. INTRODUCTION outliers [6]. In addition, robust procedures that aim to explicitly remove the Consider the linear regression problem with the model impact of extreme errors have also been developed. For instance, | the method of Least Trimmed Squares (LTS) [7] employs the opti- yi = a xi + ei, for i = 1, 2, ..., M, (1) mization criterion that minimizes only a portion of the squared fitting L errors with smallest magnitudes. The essence of this method can be where xi ∈ is usually termed as the explanatory variable, yi is the response variable, a ∈ RL is the regression coefficients, L is the viewed as roughly detecting outliers and removing their impact at the data fitting stage. This idea can be generalized to various out- model order, and ei is the measurement noise in the ith response. Note that model (1) can be compactly represented by lier diagnosis techniques [7]. For an extensive survey of previous work on robust regression and outlier detection, interested readers y = Xa + e, (2) are referred to [1][3][7] and the references therein. It is interesting to note that the underly- | | ∈ RM where X = [x1, x2, ..., xM ] , e = [e1, e2, ..., eM ] and ing various robust regression methods indeed have counterparts in | ∈ RM y = [y1, y2, ..., yM ] . The goal is to determine the regres- the context of the sparse signal recovery problem which has recently sion coefficients a and this is often achieved by using a suitable opti- received much attention in many application domains [8][9]. The mization criterion. This problem has many important applications in heavy-tailed outlier-tolerating priors imposed on the measurement science and engineering. An important factor that makes this prob- noise correspond to the sparsity-inducing distributions in sparse sig- lem interesting and challenging is that the response variable y may nal recovery. The Laplacian distribution in the LAV method and its usually contain outliers. The popular ordinary (LS) use in the corresponding ℓ1-norm minimization based sparse signal is sensitive to outliers and hence robust regression methods are of recovery algorithms serves as an excellent example of this kind. As interest. Numerous approaches for robust regression have been de- another example, the LTS method exhibits very similar ingredient veloped [1][2][3] with the goal of extracting the model parameters to the thresholding method that is used for finding sparse solutions. reliably in the presence of outliers. Our work examines this connection more deeply. Intuitively, this connection is made possible by the fact that outliers are events that 1.1. Background occur infrequently, and thus sparse. To make the connection more explicit, in Section 2.1 we develop a two component model for the As a popular technique, ordinary∑ LS estimation determines the M | additive noise and reformulate the regression problem such that the model parameters a by minimizing e˜2, where e˜ = y − a x i=1 i i i i usefulness of sparse recovery methods is evident. Using this new This research was supported by NSF Grants IIS-0613595 and CCF- formulation and connection, in Section 2.2 we develop algorithms 0830612. for the robust regression which are based on sparse signal recovery methods. They are then evaluated in Section 3 and demonstrated to denoising [8]) and ℓ1-regularization problem [11]. By estimating w, be effective in combating the negative impact of outliers. these algorithms determine how each observation is contaminated. In contrast, the LAVmethod, which can be obtained by letting ζ → 0 2. TWO COMPONENT MODEL AND SPARSE SIGNAL in (6), assumes a Laplacian prior on the total noise and minimizes RECOVERY ALGORITHMS FOR ROBUST REGRESSION the sum of ℓ1-norm of the fitting errors. As a result, it is not able to clarify the underlying mechanism of noise contamination. 2.1. The two component model To solve the above optimization problems, (5) and (6) will be- come convex optimization problems when p = 1 and (5) will be We leverage the fact that outliers occur infrequently and hence are considered in the simulation study. Motivated by the analysis in sparse. Unfortunately, the (1) leaves us little opportu- [8, 5.2]√ and our experience, we choose the regularization parameter nity to take advantage of this observation, since a single measure- σ˜ 2 log M 1 λ = 3 , where σ˜ is a proper estimation of scale. Procedures ment noise term ei deals with both the impact of outlier and regular can be developed for other choices of p [12]. noise. To explicitly make use of the sparsity of outliers, we suggest an alternative model by splitting e into two independent additive i 2.2.2. Empirical Bayesian inference based robust regression components, namely wi and ϵi, as follows, | This method adopts the empirical Bayesian approach for robust yi = a xi + wi + ϵi, for i = 1, 2, ..., M. (3) regression. In particular, we utilize the Sparse Bayesian learning methodology developed in [13][14]. To this end, it is assumed that The interpretations of xi, yi and a are carried over from (1). If a wi is a with prior distribution wi ∼ N (0, γi), response yi is not an outlier, then the corresponding wi is assumed where γi is the hyperparameter that controls the of each wi to be zero. If yi is an outlier, then wi can be viewed as the anomalous and has to be learnt. If γi = 0, it the corresponding wi will be error in yi such that (yi −wi) appears to be a response contaminated zero, resulting in no anomalous error being added into observation only by regular noise. The term ϵi, on the other hand, contains the yi. If γi > 0, an anomalous noise whose magnitude depends on γi regular measurement noise in response yi, and it is modeled as i.i.d. 2 will contaminate yi, and it results in an outlier in the measurement. zero mean Gaussian noise, i.e. ϵi ∼ N (0, σ ). Compactly, model (3) can be represented by To estimate the regression coefficients, we jointly find [ ] aˆ, γˆ, σˆ2 = arg max P (y|X, a, γ, σ2), (7) a 2 y = Xa + w + ϵ = [X,I] + ϵ, (4) a,γ,σ w where γ , {γ1, γ2, ..., γM }. Then w can be estimated by the pos- | where, in addition to (2), w = [w1, w2, ..., wM ] , and ϵ = terior mean, i.e. [ϵ , ϵ , ..., ϵ ]|. By definition, w is a sparse vector, which means 1 2 M 2 the number of nonzero entries of w is (much) smaller than the wˆ = E[w|X, y, aˆ, γˆ, σˆ ]. (8) length of w. As we shall see, this model (4) enables the opportunity Note that the essence of this method is that the robust regres- to adapt sparse signal recovery methods to robust regression. sion problem is cast into the framework of Sparse Bayesian learning (SBL) with appropriate modifications, and this is made possible by 2.2. Sparse signal recovery algorithms for robust linear regres- our proposed two component noise modeling technique. The algo- sion rithm development, analysis and experimental study of the original Since w is sparse, one can utilize ideas from sparse signal recovery SBL for sparse signal recovery have been extensively discussed in to develop robust linear regression methods. In this work, we con- [13][14][15]. Interested readers are referred to these references for sider Bayesian methods: MAP techniques and empirical Bayesian more detail. Due to space limitation, we summarize the extended SBL based robust regression algorithm, which can be derived using methods. For MAP methods, we assume a super-Gaussian prior for 2 w to encourage sparsity and are discussed next. the Expectation-Maximization (EM) approach, as follows. 2 1. Initialize a(0), σ(0) and γi(0) for i = 1, 2, ..., M. Denote , 2.2.1. Maximum a posteriori (MAP) based robust regression Γ(k) diag(γ1(k) , ..., γM(k) ). To simultaneously estimate the regression coefficients a and the out- 2. At iteration k, compute liers w, we propose to solve the following optimization problem, 2 −1 −1 − wˆ (k) = (I + σ(k−1)Γ(k−1)) (y Xa(k−1)) (9) 2 p | − − − aˆ, wˆ = arg min ∥y − Xa − w∥2 + λ∥w∥p, (5) c 2 1 1 a,w W(k) = wˆ (k)wˆ (k) + (σ(k−1)I + Γ(k−1)) (10) ( ) c ∑ 1 γi = [W(k)]i,i (11) M p p (k) where ∥w∥p = |wi| , 0 < p ≤ 1, and λ is a regulariza- i=1 2 1 2 1 c σ = ∥y − Xa − ∥ + tr(W ) tion parameter. This approach can be viewed as an MAP estimation (k) M (k 1) 2 M (k) ∝ {− | |p} with a super-Gaussian prior distribution P (wi) exp λ wi . 2 | − (y − Xa − ) wˆ (12) This encourages sparse w to be recovered. Another closely related M (k 1) (k) algorithm could be immediately obtained as follows, | −1 | a(k) = (X X) X (y − wˆ (k)). (13)

aˆ, wˆ = arg min ∥w∥p, s.t. ∥y − Xa − w∥2 ≤ ζ, (6) a,w 1In our , the Least Absolute Value method is employed to obtain σ˜. Other robust techniques for scale estimation could also be used. where ζ is a regularization parameter. 2Inspired by the SBL implementations by authors of [13][14], we suggest

Note that for p = 1 these two algorithms are variants of the that the hyperparameters γi(k) that are smaller than a predefined threshold sparse signal recovery methods, namely Lasso [10] (or Basis pursuit be pruned from future iterations. -5 -3 x 10 (a) (b) x 10 (c) (d) 3. Check for convergence. If convergence criterion is not satis- 3 5 Bias (5%) Var (5%) Bias (30%) Var (30%) -5 2.5 10 -2 fied, go to 2. If it has converged, output a(k) as the regression 4 LAV 10 LTS coefficients. 2 3 M 1.5 GM 2 To provide an interpretation of this algorithm, at each iteration -6 Stu-t -4 1 10 10 Alg1 it first estimates the posterior mean wˆ (k) obtaining the current esti- 1 0.5 Alg2 mate of the outlier components. Then, it performs an ordinary LS 0 0 -6 estimation on the corrected data, i.e. (y − wˆ (k)). It is also worth- 10 -0.5 -7 -1 while to note that this algorithm can be generalized to general robust 10 -1 -2 regression problems, where the model will be y = f(X, a) + w + ϵ aˆ1 aˆ2 aˆ3 aˆ4 aˆ5 aˆ1 aˆ2 aˆ3 aˆ4 aˆ5 aˆ1 aˆ2 aˆ3 aˆ4 aˆ5 aˆ1 aˆ2 aˆ3 aˆ4 aˆ5 and f is a general functional relationship assumed on the data. One can similarly derive the updating rule for a(k) as Fig. 1. Empirical bias and variance (Symmetric outlier case).

( )| -5 (a) (b) (c) (d) ∂f(X, a) x 10 -1 Find a s.t. y − f(X, a) − wˆ = 0, (14) 6 0.03 10 (k) (k) Bias (5%) Var (5%) Bias (30%) Var (30%) ∂a 10-5 4 0.02 10-2 and use the general function f(X, a) in the algorithm as needed. 2 0.01 10-3 0 10-6 3. EXPERIMENTAL STUDY -2 0 10-4

-4 3.1. Robust regression with simulated data sets -0.01 10-5 -6 10-7 To study the statistical behavior of the proposed algorithms, we con- -8 -0.02 10-6 aˆ1 aˆ2 aˆ3 aˆ4 aˆ5 aˆ1 aˆ2 aˆ3 aˆ4 aˆ5 aˆ1 aˆ2 aˆ3 aˆ4 aˆ5 aˆ1 aˆ2 aˆ3 aˆ4 aˆ5 sider the multiple linear regression problem as follows,

∑5 Fig. 2. Empirical bias and variance (Asymmetric outlier case. Leg- ends are the same with Fig.1.) yi = akxk,i + ei, i = 1, 2, ..., M, (15) k=1

| where a = [1, 2, −1.5, −3, 2.5] and M = 100. The explana- empirical Bayesian algorithm (Alg2) actually outperforms the MAP tory variables are independently generated according to x1,i ∼ type algorithm (5) (Alg1) in most cases. This observation echoes the U(1, 31), x2,i ∼ U(−200, −150), x3,i ∼ Laplacian(1, 10), x4,i ∼ fact that empirical Bayesian inference could perform better since the 2 N (10, 5 ), x5,i ∼ Poisson(10). posterior mean of the hyperparameter is more representative for the Consider the following two cases. (i) Symmetric outlier distri- mass [16]. 2 bution. Let us assume ei ∼ (1 − δ)N (0, 0.1 ) + δξ, where ξ ∼ N (b, η2), b takes value equally likely on {−20, 20}, η ∼ U(0, 10), 3.2. Robust regression with real data sets and δ controls the percentage of outlier contamination. (ii) Asym- ∼ − N 2 metric outlier distribution. We assume ei (1 δ) (0, 0.1 )+δξ, 3.2.1. Brownlee’s Stackloss where ξ ∼ N (−20, η2), and η ∼ U(0, 10). For each case, two different levels of outlier contamination will be considered, namely This data set, which has been extensively studied in statistical lit- δ = 5% and δ = 30%. erature (cf. [7, pp.76]), contains 21 four-dimensional observations The MAP based robust regression algorithm (5) in Section 2.2.1 regarding the operation of a plant for the oxidation of ammonia to (denoted by Alg1) and the empirical Bayesian inference based al- nitric acid. Due to space limitation, we select several algorithms for gorithm in Section 2.2.2 (denoted by Alg2) are used to compute linear regression. Following the methodology in [7, Chap 3], the the regression coefficients. Additionally, the following algorithms index plots associated with different algorithms are shown in Fig.3. will also be employed to compute the regression coefficients: (i) Least Absolute Value (LAV); (ii) Least Trimmed Squares (LTS), LS LAV Alg1 Alg2 where 75% of the squared errors are kept; (iii) M-estimate (M), where Huber’s function is employed; (iv) Gaussian Mixture model 5 5 5 5 for noise (GM), (v) Student’s t-distribution for noise (Stu-t), i.e. [ ]− 2 (ν+1)/2 0 0 0 0 P (ei|ν, θ) ∝ 1 + θei /ν with ν is fixed to be 3. For each algorithm 5000 random data sets are processed. The performances are compared in terms of the empirical bias and the -5 -5 -5 -5 empirical variance of the estimate of each regression coefficient. The Standardized Residual results are shown in Fig.1 and 2. The subplot (a) and (b) in each 1 5 10 15 20 1 5 10 15 20 1 5 10 15 20 1 5 10 15 20 figure correspond to the empirical bias and the empirical variance, Index respectively, for 5% outlier contamination. Subplot (c) and (d) are for 30% outlier contamination. Fig. 3. Index plots for different regression algorithms. The interval − To analyze the results, first, our proposed algorithms, especially [ 2.5, 2.5] is marked by red lines for inspecting outliers. Alg2, show consistent performance with lower bias and lower vari- ance in most cases. Second, our methods tend to serve as feasible al- As in Fig.3, our algorithms exhibit very similar results to the gorithmic choices for a large of percentage of outlier contami- LAV method for this data set. Observations 1, 3, 4, and 21 can be nation, as it works well with small (5%) and large (30%) portions of identified as outliers, which is consistent with existing analyses on outliers. They exhibit a higher-breakdown-point method. Third, the this data set [5][7]. Specifically, for each explanatory variable (e.v.), the estimated coefficient (aˆi), its (std) and t-value (t- Based on experiments on these real data sets, we could conclude v) are tabulated for LS and Alg1 as follows. that the proposed algorithms exhibit consistent results compared to existing robust regression algorithms and demonstrate their useful- LS Alg1 ness for robust regression. e.v. aˆi std t-v aˆi std t-v rate 0.716 0.135 5.307 0.834 0.073 11.46 4. SUMMARY temp. 1.295 0.368 3.520 0.596 0.180 3.330 acid. -0.152 0.156 -0.973 -0.071 0.067 -1.062 We studied the problem of robust linear regression based on the con- const. -39.92 11.90 -3.356 -39.47 5.105 -7.732 nection to sparse signal recovery. A two component noise model is proposed and robust regression algorithms are developed. Empirical The LS estimation identifies that regression coefficient for acid studies on simulated and real data sets demonstrate the effectiveness concentration (acid.) is not significantly different from zero at 5% of the proposed algorithms. level. This is confirmed by Alg1 (and also by LAV and Alg2).3 By proper treatment of outliers, the significance of the rate and the con- stant term (const.) are enhanced by Alg1 (also LAV and Alg2), and 5. REFERENCES narrower confidence intervals can be constructed. Combining the [1] P. J. Huber, , Wiley-Interscience, 2004. analysis based on the index plots and the test statistics, we can con- clude that the robust regression methods could be more trustworthy [2] F. R. Hampel, “Robust estimation: A condensed partial sur- in revealing the underlying pattern in the data set. vey,” J. Prob. Theo. and Related Fields, 1973. [3] D. M. Hawkins, Identification of Outliers, Chapman and Hall Ltd, 1980. 3.2.2. The Bupa liver data set [4] F. Y. Edgeworth, “On observations relating to several quanti- This data set [17] mainly contains blood test results regarding liver ties,” Philosophical Magazine, 1887. function for 345 patients. We focus on the task of using the AST and [5] K. L. Lange, R. J. A. Little, and J. M. G. Taylor, “Robust sta- γGT levels to linearly predict the ALT level. The data is processed tistical modeling using the t distribution,” Journal of American by log-transformation [18]. Results are shown in the following table. Statistical Assoc., vol. 84, pp. 881–896, 1989. LS LTS [6] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin., Bayesian Data Analysis, Chapman & Hall, 2003. e.v. aˆi std t-v aˆi std t-v AST 0.693 0.066 10.57 0.788 0.061 12.87 [7] P. J. Rousseeuw and A. M. Leroy, Robust Regression and Out- γGT 0.204 0.030 6.871 0.207 0.026 7.794 lier Detection, Wiley, 2003. const. 0.425 0.176 2.408 0.145 0.166 0.870 [8] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic de- Alg1 Alg2 composition by basis pursuit,” SIAM Review, vol. 43, no. 1, pp. e.v. aˆi std t-v aˆi std t-v 129–159, 2001. AST 0.736 0.062 11.92 0.756 0.062 12.24 [9] E. J. Candes, “Compressive ,” Proceedings of the Int. γGT 0.202 0.027 7.557 0.208 0.027 7.796 Congress of Mathematicians, pp. 1433–1452, 2006. const. 0.306 0.168 1.821 0.220 0.168 1.308 [10] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Royal Stat. Society, vol. 58, no. 1, pp. 267–288, 1996. The LS estimation indicates that all variables and the constant [11] E. J. Candes, J. K. Romberg, and T. Tao, “Stable signal recov- term (intercept) are significant at 5% level. However, the robust ery from incomplete and inaccurate measurements,” Comm. methods we tested here inform us that the constant term is actually Pure Appl. Math, vol. LIX, pp. 1027–1233, 2006. not significant.4 Having this observation, we re-perform the linear regression without constant term. To examine the difference, we [12] I. F. Gorodnitsky and B. D. Rao, “Sparse signal reconstruc- tion from limited data using focuss: A re-weighted norm mini- perform the F -test with null hypothesis H0 : the constant term is equal to zero. The results are summarized in the following table. mization algorithm,” IEEE Trans. Sig. Proc., vol. 45, no. 3, pp. 600–616, 1997. LS LTS Alg1 Alg2 [13] M. E. Tipping, “Sparse bayesian learning and the relevance F -value 5.801 1.995 2.708 1.959 vector machine,” Journal of Machine Learning Research, vol. 1, pp. 211–244, 2001. We can see that except LS, the F -values for all the robust re- [14] D. P. Wipf and B. D. Rao, “Sparse bayesian learning for basis gression methods indicate that the constant term is not significant at selection,” IEEE Transaction Signal Processing, vol. 52, no. 8, 5% level, which confirms our earlier observation. To be complete, pp. 2153–2164, 2004. the simpler linear models learned by different algorithms are given [15] D. P. Wipf, J. A. Palmer, and B. D. Rao, “Perspectives on as follows, sparse bayesian learning,” NIPS, 2004. LTS Alg1 Alg2 [16] D. J. C. MacKay, “Comparison of approximate methods for e.v. aˆi t-v aˆi t-v aˆi t-v handling hyperparameters,” Neural Computation, vol. 11, pp. AST 0.835 29.28 0.830 28.85 0.834 28.98 1035–1068, 1999. γGT 0.205 7.752 0.204 7.630 0.202 7.555 [17] L. Breiman, “Statistical modeling: the two cultures,” Statisti- cal Sci., vol. 16, no. 3, pp. 199–231, 2001. 3 For LS, use t17,0.05 = 2.11. For LAV, Alg1/2, use t13,0.05 = 2.16. [18] G. E. P. Box and D. R. Cox, “An analysis of transformations,” 4 Specifically, dfLTS = 331, dfAlg1 = dfAlg2 = 332. Hence, t∞,0.05 = J. Royal Stat. Society, vol. 26, no. 2, pp. 211–252, 1964. 1.96, F1,∞,0.05 = 3.84 can be used.