Adaptive LASSO Based on Joint M-Estimation of Regression and Scale

2016 24th European Signal Processing Conference (EUSIPCO) Adaptive LASSO based on joint M-estimation of regression and scale Esa Ollila Aalto University, Dept. of Signal Processing and Acoustics, P.O.Box 13000, FI-00076 Aalto, Finland Abstract—The adaptive Lasso (Least Absolute Shrinkage and where λ > 0 is the shrinkage (penalty) parameter, chosen Selection Operator) obtains oracle variable selection property by by the user, w =(w1,...,wp)> is a vector of non-negative using cleverly chosen adaptive weights for regression coefficients weights, and is the Hadamard product, i.e., the component- in the `1-penalty. In this paper, in the spirit of M-estimation of ◦ p wise product of two vectors. Thus w β = w β . regression, we propose a class of adaptive M-Lasso estimates of k ◦ k1 j=1 j| j| Standard Lasso is obtained when wj 1. Adaptive Lasso regression and scale as solutions to generalized zero subgradient ⌘ P equations. The defining estimating equations depend on a differ- was proposed in the real-valued case, but it can be extended entiable convex loss function and choosing the LS-loss function to complex-valued case in straightforward manner. Adaptive yields the standard adaptive Lasso estimate and the associated Lasso is obtained when λ = λn depends on the sample scale statistic. An efficient algorithm, a generalization of the n wˆ = cyclic coordinate descent algorithm, is developed for computing size and the weights are data dependent, defined as j ˆ γ ˆ p M M 1/ βinit,j for γ > 0, where βinit C is a root-n-consistent the proposed -Lasso estimates. We also propose adaptive - | | 2 Lasso estimate of regression with preliminary scale estimate that initial estimator to β. It was shown in [2] that if λn/pn 0 (γ 1)/2 ! uses a highly-robust bounded loss function. A unique feature of and λ n − , then the adaptive Lasso estimate enjoys n !1 the paper is that we consider complex-valued measurements and oracle properties (consistent variable selection and the same regression parameter. Consistent variable selection property of M asymptotic normal distribution as the LSE that knows the true the adaptive -Lasso estimates are illustrated with a simulation ˆ study. model). It should be noted that the root-n consistency of βinit can be relaxed, see [2] for discussion. In this paper, we use Index Terms—Adaptive Lasso, M-estimation, penalized regres- γ =1and the standard (w 1) Lasso estimate βˆ as βˆ sion, sparsity, variable selection j ⌘ λ init as in [3]. The M-estimates of regression [4] are defined as solu- I. INTRODUCTION tions to generalized normal equations that depend on a score function which is the first derivative of the loss function We consider the complex-valued linear model y = Φβ + ⇢(x), (x)=⇢ (x). Commonly used loss functions are ", where Φ is a known n p complex-valued measurement 0 ⇥ the standard LS loss ⇢(x)= x 2 or the robust Huber’s matrix (or matrix of predictors), β =(β1,...,βp)> is the | | loss function. Most robust loss and score functions require unknown vector of complex-valued regression coefficients (or a preliminary estimate of the scale of the error distribution. system parameters) and " Cn denotes the additive noise. For 2 In this paper, we propose a class of weighted/adaptive Lasso ease of exposition, we consider the centered linear model (i.e., estimates following the spirit of M-estimation; namely, we we assume that the intercept is equal to zero). The primary define the weighted M-Lasso estimates of regression and scale interest is to estimate the unknown parameter β given y Cn n p 2 as solutions to generalized zero subgradient equations that also and Φ C ⇥ . When the linear system is underdetermined 2 ˆ depend on a score function. When the associated loss function (p>n) or p n, the least squares estimate (LSE) βLS = ⇡ 2 is the LS-loss, these equations are a sufficient and necessary arg min 1 y Φβ does not have a unique solution or is β 2 k − 2 condition of a solution to the weighted Lasso problem (1). subject to a very high variance. Furthermore, for large number Furthermore, we develop a simple and efficient algorithm to of predictors, one wish to find a sparse solution, meaning that compute the weighted M-Lasso estimate. This algorithm is βˆ =0for most j 1,...,p , so that only the predictors that j 2 { } a natural generalization of cyclic coordinate descent (CCD) exhibit the strongest effects are selected. A common approach algorithm [5] which is the current state-of-the-art method for in the above cases it to use penalized/regularized regression computing the Lasso solution (1). with sparsity enforcing ` -penalty as in Lasso [1]. The Lasso, 1 The paper is organized as follows. Robust loss functions however, inherits the non-robustness (sensitivity to outliers) and their properties are outlined in Section II. As examples of LSE as well as its inefficiency when the noise follows a we consider the Huber’s loss and highly-robust (non-convex) heavy-tailed non-Gaussian distribution. Tukey’s loss and introduce the notion of pseudo-residual The adaptive Lasso [2] uses adaptive weights for penalizing vector. In Section III, we define the M-Lasso estimates of re- different coefficients in the ` -penalty. The weighted Lasso 1 gression and scale and develop the generalized CCD algorithm solves a weighted ` -penalized LS regression problem, 1 for computing the solution. Section IV provides simulation 1 2 studies to illustrate the model selection abilities and prediction minimize y Φβ + λ w β 1 (1) β p 2 − 2 k ◦ k accuracy of the proposed method in various noise conditions. 2C n o 978-0-9928-6265-7/16/$31.00 ©2016 IEEE 2191 2016 24th European Signal Processing Conference (EUSIPCO) n Huber’s loss ⇢ (x) Tukey’s loss ⇢ (x) Notations. The vector space C is equipped with the usual H,c T,c H H Hermitian inner product, a, b = a b, where ( ) =[()⇤]> h i · · 14 denotes the Hermitian (complex conjugate) transpose. This 3 12 p H 2.5 induces the conventional (Hermitian) `2-norm a 2 = a a. 10 nk k 2 ) 8 The `1-norm is the defined as a 1 = ai , where ) i=1 x x 1.5 ( 6 2 2 k k | | ( p ; a = a a = a + a denotes the modulus of a complex ; ⇤ R I 1 | | Pn p 4 number a = aR + |aI . For a matrix A C ⇥ , we denote 2 0.5 p 2 n th p 0 0 by ai C its i column vector and ai C denotes -4 4 -4 -2 4 2 th · 2 2 -2 0 2 the Hermitian transpose of its i row vector. Hence, we 0 0 0 2 -2 2 -2 n p 4 may write the measurement matrix Φ C ⇥ as Φ = Re(x) -4 Im(x) 4 -4 H 2 Re(x) Im(x) φ1 φp = φ1 φn . ··· · ··· · Fig. 1. Surface plots of the robust loss functions II. ROBUST LOSS FUNCTIONS AND PSEUDO-RESIDUALS 2 Suppose that the noise terms "i are i.i.d. continuous random is a hybrid of ` and ` loss functions ⇢(x)= x and 2 1 | | variables from a circular distribution [6] with p.d.f. f(e)= ⇢(x)= x , respectively, using ` -loss for relatively small | | 2 (1/σ)f0(e/σ), where f0(e) denotes the standard form of the errors and `1-loss for relatively large errors. Moreover, it is density and σ > 0 is the scale parameter. If σ is known, then convex. Huber’s score function becomes an M-estimator βˆ solves x, for x c n y φHβˆ H,c(x)= | | . φ i − i = 0 (2) (c sign(x), for x >c − i σ · | | i=1 · X ✓ ◆ We use c =1.215 as our default choice which gives approxi- where : C C, called the score function, is a complex mate 95% efficiency at the complex Gaussian noise. ! + conjugate derivative [7] of the loss function ⇢ : C R0 . Tukey biweight function is another commonly used loss + ! As in [8], a function ⇢ : C R is called a loss function ! 0 function [4]. We define it for complex-values measurements if it is circularly symmetric (i.e., ⇢(e|✓x)=⇢(x) ✓ R), 8 2 as R-differentiable, increasing in x > 0 and satisfies ⇢(0) = 0. | | 2 2 3 Due to circularity assumption, ⇢(x)=⇢0( x ) for some ⇢0 : ⇢T,c(x)=(c /3) min 1, 1 1 ( x /c) . + + | | − − | | R R and hence the score function becomes n o 0 ! 0 Tukey’s loss function is bounded, which makes it very robust @ 1 @⇢ @⇢ 1 to large outliers. As a consequence, it is also non-convex. The (x)= ⇢(x)= + | = ⇢0 ( x )sign(x), @x 2 @x @x 2 0 | | respective score function is ⇤ ✓ R I ◆ 2 where 2 x/ x , x =0 x 1 ( x /c) for x c for T,c(x)= − | | | | . sign(x)= | | 6 8 0, for x =0 0,⇣ ⌘ for x >c ( < | | H is the complex signum function and ⇢0 denotes the real Thus large residuals:ri = yi φ β are completely rejected, − i derivative of the real-valued function ⇢0. i.e., they have zero weight in (2).· For Tukey’s loss function, An objective function approach for M-estimation, on the we use c =3.0 as our default choice which gives approximate other hand, defines an M-estimate of regression (again as- 85% efficiency at the complex Gaussian noise. Huber’s and suming σ is known) as a solution to an optimization program Tukey’s loss functions are depicted in Figure 1.

Adaptive LASSO Based on Joint M-Estimation of Regression and Scale

A Robust Hybrid of Lasso and Ridge Regression

Lasso Reference Manual Release 17

Overfitting Can Be Harmless for Basis Pursuit, but Only to a Degree

Least Squares After Model Selection in High-Dimensional Sparse Models.” DOI:10.3150/11-BEJ410SUPP

Modern Regression 2: the Lasso

On Lasso Refitting Strategies

Lecture 2: Overfitting. Regularization

A Revisit to De-Biased Lasso for Generalized Linear Models

Chapter 5: Lasso and Sparsity in Statistics

High-Dimensional Generalized Linear Models and the Lasso

Data Mining Model Selection

Performance Analysis of Regularized Linear Regression Models for Oxazolines and Oxazoles Derivatives Descriptor Dataset