Adaptive LASSO Based on Joint M-Estimation of Regression and Scale

2016 24th European Signal Processing Conference (EUSIPCO) Adaptive LASSO based on joint M-estimation of regression and scale Esa Ollila Aalto University, Dept. of Signal Processing and Acoustics, P.O.Box 13000, FI-00076 Aalto, Finland Abstract—The adaptive Lasso (Least Absolute Shrinkage and where λ > 0 is the shrinkage (penalty) parameter, chosen Selection Operator) obtains oracle variable selection property by by the user, w =(w1,...,wp)> is a vector of non-negative using cleverly chosen adaptive weights for regression coefficients weights, and is the Hadamard product, i.e., the component- in the `1-penalty. In this paper, in the spirit of M-estimation of ◦ p wise product of two vectors. Thus w β = w β . regression, we propose a class of adaptive M-Lasso estimates of k ◦ k1 j=1 j| j| Standard Lasso is obtained when wj 1. Adaptive Lasso regression and scale as solutions to generalized zero subgradient ⌘ P equations. The defining estimating equations depend on a differ- was proposed in the real-valued case, but it can be extended entiable convex loss function and choosing the LS-loss function to complex-valued case in straightforward manner. Adaptive yields the standard adaptive Lasso estimate and the associated Lasso is obtained when λ = λn depends on the sample scale statistic. An efficient algorithm, a generalization of the n wˆ = cyclic coordinate descent algorithm, is developed for computing size and the weights are data dependent, defined as j ˆ γ ˆ p M M 1/ βinit,j for γ > 0, where βinit C is a root-n-consistent the proposed -Lasso estimates. We also propose adaptive - | | 2 Lasso estimate of regression with preliminary scale estimate that initial estimator to β. It was shown in [2] that if λn/pn 0 (γ 1)/2 ! uses a highly-robust bounded loss function. A unique feature of and λ n − , then the adaptive Lasso estimate enjoys n !1 the paper is that we consider complex-valued measurements and oracle properties (consistent variable selection and the same regression parameter. Consistent variable selection property of M asymptotic normal distribution as the LSE that knows the true the adaptive -Lasso estimates are illustrated with a simulation ˆ study. model). It should be noted that the root-n consistency of βinit can be relaxed, see [2] for discussion. In this paper, we use Index Terms—Adaptive Lasso, M-estimation, penalized regres- γ =1and the standard (w 1) Lasso estimate βˆ as βˆ sion, sparsity, variable selection j ⌘ λ init as in [3]. The M-estimates of regression [4] are defined as solu- I. INTRODUCTION tions to generalized normal equations that depend on a score function which is the first derivative of the loss function We consider the complex-valued linear model y = Φβ + ⇢(x), (x)=⇢ (x). Commonly used loss functions are ", where Φ is a known n p complex-valued measurement 0 ⇥ the standard LS loss ⇢(x)= x 2 or the robust Huber’s matrix (or matrix of predictors), β =(β1,...,βp)> is the | | loss function. Most robust loss and score functions require unknown vector of complex-valued regression coefficients (or a preliminary estimate of the scale of the error distribution. system parameters) and " Cn denotes the additive noise. For 2 In this paper, we propose a class of weighted/adaptive Lasso ease of exposition, we consider the centered linear model (i.e., estimates following the spirit of M-estimation; namely, we we assume that the intercept is equal to zero). The primary define the weighted M-Lasso estimates of regression and scale interest is to estimate the unknown parameter β given y Cn n p 2 as solutions to generalized zero subgradient equations that also and Φ C ⇥ . When the linear system is underdetermined 2 ˆ depend on a score function. When the associated loss function (p>n) or p n, the least squares estimate (LSE) βLS = ⇡ 2 is the LS-loss, these equations are a sufficient and necessary arg min 1 y Φβ does not have a unique solution or is β 2 k − 2 condition of a solution to the weighted Lasso problem (1). subject to a very high variance. Furthermore, for large number Furthermore, we develop a simple and efficient algorithm to of predictors, one wish to find a sparse solution, meaning that compute the weighted M-Lasso estimate. This algorithm is βˆ =0for most j 1,...,p , so that only the predictors that j 2 { } a natural generalization of cyclic coordinate descent (CCD) exhibit the strongest effects are selected. A common approach algorithm [5] which is the current state-of-the-art method for in the above cases it to use penalized/regularized regression computing the Lasso solution (1). with sparsity enforcing ` -penalty as in Lasso [1]. The Lasso, 1 The paper is organized as follows. Robust loss functions however, inherits the non-robustness (sensitivity to outliers) and their properties are outlined in Section II. As examples of LSE as well as its inefficiency when the noise follows a we consider the Huber’s loss and highly-robust (non-convex) heavy-tailed non-Gaussian distribution. Tukey’s loss and introduce the notion of pseudo-residual The adaptive Lasso [2] uses adaptive weights for penalizing vector. In Section III, we define the M-Lasso estimates of re- different coefficients in the ` -penalty. The weighted Lasso 1 gression and scale and develop the generalized CCD algorithm solves a weighted ` -penalized LS regression problem, 1 for computing the solution. Section IV provides simulation 1 2 studies to illustrate the model selection abilities and prediction minimize y Φβ + λ w β 1 (1) β p 2 − 2 k ◦ k accuracy of the proposed method in various noise conditions. 2C n o 978-0-9928-6265-7/16/$31.00 ©2016 IEEE 2191 2016 24th European Signal Processing Conference (EUSIPCO) n Huber’s loss ⇢ (x) Tukey’s loss ⇢ (x) Notations. The vector space C is equipped with the usual H,c T,c H H Hermitian inner product, a, b = a b, where ( ) =[()⇤]> h i · · 14 denotes the Hermitian (complex conjugate) transpose. This 3 12 p H 2.5 induces the conventional (Hermitian) `2-norm a 2 = a a. 10 nk k 2 ) 8 The `1-norm is the defined as a 1 = ai , where ) i=1 x x 1.5 ( 6 2 2 k k | | ( p ; a = a a = a + a denotes the modulus of a complex ; ⇤ R I 1 | | Pn p 4 number a = aR + |aI . For a matrix A C ⇥ , we denote 2 0.5 p 2 n th p 0 0 by ai C its i column vector and ai C denotes -4 4 -4 -2 4 2 th · 2 2 -2 0 2 the Hermitian transpose of its i row vector. Hence, we 0 0 0 2 -2 2 -2 n p 4 may write the measurement matrix Φ C ⇥ as Φ = Re(x) -4 Im(x) 4 -4 H 2 Re(x) Im(x) φ1 φp = φ1 φn . ··· · ··· · Fig. 1. Surface plots of the robust loss functions II. ROBUST LOSS FUNCTIONS AND PSEUDO-RESIDUALS 2 Suppose that the noise terms "i are i.i.d. continuous random is a hybrid of ` and ` loss functions ⇢(x)= x and 2 1 | | variables from a circular distribution [6] with p.d.f. f(e)= ⇢(x)= x , respectively, using ` -loss for relatively small | | 2 (1/σ)f0(e/σ), where f0(e) denotes the standard form of the errors and `1-loss for relatively large errors. Moreover, it is density and σ > 0 is the scale parameter. If σ is known, then convex. Huber’s score function becomes an M-estimator βˆ solves x, for x c n y φHβˆ H,c(x)= | | . φ i − i = 0 (2) (c sign(x), for x >c − i σ · | | i=1 · X ✓ ◆ We use c =1.215 as our default choice which gives approxi- where : C C, called the score function, is a complex mate 95% efficiency at the complex Gaussian noise. ! + conjugate derivative [7] of the loss function ⇢ : C R0 . Tukey biweight function is another commonly used loss + ! As in [8], a function ⇢ : C R is called a loss function ! 0 function [4]. We define it for complex-values measurements if it is circularly symmetric (i.e., ⇢(e|✓x)=⇢(x) ✓ R), 8 2 as R-differentiable, increasing in x > 0 and satisfies ⇢(0) = 0. | | 2 2 3 Due to circularity assumption, ⇢(x)=⇢0( x ) for some ⇢0 : ⇢T,c(x)=(c /3) min 1, 1 1 ( x /c) . + + | | − − | | R R and hence the score function becomes n o 0 ! 0 Tukey’s loss function is bounded, which makes it very robust @ 1 @⇢ @⇢ 1 to large outliers. As a consequence, it is also non-convex. The (x)= ⇢(x)= + | = ⇢0 ( x )sign(x), @x 2 @x @x 2 0 | | respective score function is ⇤ ✓ R I ◆ 2 where 2 x/ x , x =0 x 1 ( x /c) for x c for T,c(x)= − | | | | . sign(x)= | | 6 8 0, for x =0 0,⇣ ⌘ for x >c ( < | | H is the complex signum function and ⇢0 denotes the real Thus large residuals:ri = yi φ β are completely rejected, − i derivative of the real-valued function ⇢0. i.e., they have zero weight in (2).· For Tukey’s loss function, An objective function approach for M-estimation, on the we use c =3.0 as our default choice which gives approximate other hand, defines an M-estimate of regression (again as- 85% efficiency at the complex Gaussian noise. Huber’s and suming σ is known) as a solution to an optimization program Tukey’s loss functions are depicted in Figure 1.

Adaptive LASSO Based on Joint M-Estimation of Regression and Scale

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support