2016 24th European Signal Processing Conference (EUSIPCO)

Adaptive LASSO based on joint M-estimation of regression and scale

Esa Ollila Aalto University, Dept. of Signal Processing and Acoustics, P.O.Box 13000, FI-00076 Aalto, Finland

Abstract—The adaptive Lasso (Least Absolute and where > 0 is the shrinkage (penalty) parameter, chosen Selection Operator) obtains oracle variable selection property by by the user, w =(w1,...,wp)> is a vector of non-negative using cleverly chosen adaptive weights for regression coefficients weights, and is the Hadamard product, i.e., the component- in the `1-penalty. In this paper, in the spirit of M-estimation of p wise product of two vectors. Thus w = w . regression, we propose a class of adaptive M-Lasso estimates of k k1 j=1 j| j| Standard Lasso is obtained when wj 1. Adaptive Lasso regression and scale as solutions to generalized zero subgradient ⌘ P equations. The defining estimating equations depend on a differ- was proposed in the real-valued case, but it can be extended entiable convex loss function and choosing the LS-loss function to complex-valued case in straightforward manner. Adaptive yields the standard adaptive Lasso estimate and the associated Lasso is obtained when = n depends on the sample scale statistic. An efficient algorithm, a generalization of the n wˆ = cyclic coordinate descent algorithm, is developed for computing size and the weights are data dependent, defined as j ˆ ˆ p M M 1/ init,j for > 0, where init C is a root-n-consistent the proposed -Lasso estimates. We also propose adaptive - | | 2 Lasso estimate of regression with preliminary scale estimate that initial estimator to . It was shown in [2] that if n/pn 0 ( 1)/2 ! uses a highly-robust bounded loss function. A unique feature of and n , then the adaptive Lasso estimate enjoys n !1 the paper is that we consider complex-valued measurements and oracle properties (consistent variable selection and the same regression parameter. Consistent variable selection property of M asymptotic normal distribution as the LSE that knows the true the adaptive -Lasso estimates are illustrated with a simulation ˆ study. model). It should be noted that the root-n consistency of init can be relaxed, see [2] for discussion. In this paper, we use Index Terms—Adaptive Lasso, M-estimation, penalized regres- =1and the standard (w 1) Lasso estimate ˆ as ˆ sion, sparsity, variable selection j ⌘ init as in [3]. The M-estimates of regression [4] are defined as solu- I. INTRODUCTION tions to generalized normal equations that depend on a score function which is the first derivative of the loss function We consider the complex-valued linear model y = + ⇢(x), (x)=⇢ (x). Commonly used loss functions are ", where is a known n p complex-valued measurement 0 ⇥ the standard LS loss ⇢(x)= x 2 or the robust Huber’s matrix (or matrix of predictors), =(1,...,p)> is the | | loss function. Most robust loss and score functions require unknown vector of complex-valued regression coefficients (or a preliminary estimate of the scale of the error distribution. system parameters) and " Cn denotes the additive noise. For 2 In this paper, we propose a class of weighted/adaptive Lasso ease of exposition, we consider the centered linear model (i.e., estimates following the spirit of M-estimation; namely, we we assume that the intercept is equal to zero). The primary define the weighted M-Lasso estimates of regression and scale interest is to estimate the unknown parameter given y Cn n p 2 as solutions to generalized zero subgradient equations that also and C ⇥ . When the linear system is underdetermined 2 ˆ depend on a score function. When the associated loss function (p>n) or p n, the least squares estimate (LSE) LS = ⇡ 2 is the LS-loss, these equations are a sufficient and necessary arg min 1 y does not have a unique solution or is 2 k 2 condition of a solution to the weighted Lasso problem (1). subject to a very high variance. Furthermore, for large number Furthermore, we develop a simple and efficient algorithm to of predictors, one wish to find a sparse solution, meaning that compute the weighted M-Lasso estimate. This algorithm is ˆ =0for most j 1,...,p , so that only the predictors that j 2 { } a natural generalization of cyclic coordinate descent (CCD) exhibit the strongest effects are selected. A common approach algorithm [5] which is the current state-of-the-art method for in the above cases it to use penalized/regularized regression computing the Lasso solution (1). with sparsity enforcing ` -penalty as in Lasso [1]. The Lasso, 1 The paper is organized as follows. Robust loss functions however, inherits the non-robustness (sensitivity to outliers) and their properties are outlined in Section II. As examples of LSE as well as its inefficiency when the noise follows a we consider the Huber’s loss and highly-robust (non-convex) heavy-tailed non-Gaussian distribution. Tukey’s loss and introduce the notion of pseudo-residual The adaptive Lasso [2] uses adaptive weights for penalizing vector. In Section III, we define the M-Lasso estimates of re- different coefficients in the ` -penalty. The weighted Lasso 1 gression and scale and develop the generalized CCD algorithm solves a weighted ` -penalized LS regression problem, 1 for computing the solution. Section IV provides simulation 1 2 studies to illustrate the abilities and prediction minimize y + w 1 (1) p 2 2 k k accuracy of the proposed method in various noise conditions. 2C n o

978-0-9928-6265-7/16/$31.00 ©2016 IEEE 2191 2016 24th European Signal Processing Conference (EUSIPCO)

n Huber’s loss ⇢ (x) Tukey’s loss ⇢ (x) Notations. The vector space C is equipped with the usual H,c T,c H H Hermitian inner product, a, b = a b, where ( ) =[()⇤]> h i · · 14 denotes the Hermitian (complex conjugate) transpose. This 3 12 p H 2.5 induces the conventional (Hermitian) `2-norm a 2 = a a. 10 nk k 2 The `1-norm is the defined as a 1 = ai , where 8 i=1 x )

x ) 1.5 2 2 k k | | 6 p ; ( a = a a = a + a denotes the modulus of a complex ; ( ⇤ R I 1 | | Pn p 4 number a = aR + |aI . For a matrix A C ⇥ , we denote 2 0.5 p 2 n th p 0 0 by ai C its i column vector and ai C denotes -4 4 -4 -2 4 2 th · 2 2 -2 0 2 the Hermitian transpose of its i row vector. Hence, we 0 0 0 2 -2 2 -2 n p 4 may write the measurement matrix C ⇥ as = Re(x) -4 Im(x) 4 -4 H 2 Re(x) Im(x) 1 p = 1 n . ··· · ··· · Fig. 1. Surface plots of the robust loss functions II. ROBUST LOSS FUNCTIONS AND PSEUDO-RESIDUALS 2 Suppose that the noise terms "i are i.i.d. continuous random is a hybrid of ` and ` loss functions ⇢(x)= x and 2 1 | | variables from a circular distribution [6] with p.d.f. f(e)= ⇢(x)= x , respectively, using ` -loss for relatively small | | 2 (1/)f0(e/), where f0(e) denotes the standard form of the errors and `1-loss for relatively large errors. Moreover, it is density and > 0 is the scale parameter. If is known, then convex. Huber’s score function becomes an M-estimator ˆ solves x, for x c n y Hˆ H,c(x)= | |  . i i = 0 (2) (c sign(x), for x >c i · | | i=1 · X ✓ ◆ We use c =1.215 as our default choice which gives approxi- where : C C, called the score function, is a complex mate 95% efficiency at the complex Gaussian noise. ! + conjugate derivative [7] of the loss function ⇢ : C R0 . Tukey biweight function is another commonly used loss + ! As in [8], a function ⇢ : C R is called a loss function ! 0 function [4]. We define it for complex-values measurements if it is circularly symmetric (i.e., ⇢(e|✓x)=⇢(x) ✓ R), 8 2 as R-differentiable, increasing in x > 0 and satisfies ⇢(0) = 0. | | 2 2 3 Due to circularity assumption, ⇢(x)=⇢0( x ) for some ⇢0 : ⇢T,c(x)=(c /3) min 1, 1 1 ( x /c) . + + | | | | R R and hence the score function becomes n o 0 ! 0 Tukey’s loss function is bounded, which makes it very robust @ 1 @⇢ @⇢ 1 to large outliers. As a consequence, it is also non-convex. The (x)= ⇢(x)= + | = ⇢0 ( x )sign(x), @x 2 @x @x 2 0 | | respective score function is ⇤ ✓ R I ◆ 2 where 2 x/ x , x =0 x 1 ( x /c) for x c for T,c(x)= | | | |  . sign(x)= | | 6 8 0, for x =0 0,⇣ ⌘ for x >c ( < | | H is the complex signum function and ⇢0 denotes the real Thus large residuals:ri = yi are completely rejected, i derivative of the real-valued function ⇢0. i.e., they have zero weight in (2).· For Tukey’s loss function, An objective function approach for M-estimation, on the we use c =3.0 as our default choice which gives approximate other hand, defines an M-estimate of regression (again as- 85% efficiency at the complex Gaussian noise. Huber’s and suming is known) as a solution to an optimization program Tukey’s loss functions are depicted in Figure 1. n H Let r r()=y denote a residual vector for some yi i ⌘ p minimize ⇢ · . (3) candidate C . The loss function then defines a pseudo- Cp 2 2 i=1 ✓ ◆ residual, X y Naturally, if ⇢ is a convex loss function, then all solu- r r (, )= , (5) ⌘ tions of (2) are solutions of (3). The maximum likelihood ✓ ◆ (ML-)estimate of regression is found by solving (3) with where -function in (5) acts coordinate-wise to vector rr/, ⇢(x)= ln f0(x) or equivalently, solving (2) with (x)= so [ (rr/)] = (r /). Note that if ⇢( ) is the conventional @ i i ln f0(x). 2 · @x⇤ LS-loss, ⇢(x)= x , then (x)=x, and r is equal to the | | In the complex-valued case, Huber’s [4] loss function is residual vector, r = y = r. The multiplier in (5) defined as [8]: is essential in bringing the residuals back to the original scale 2 of the data. Using the above notation, we may write (2) more x , for x c H ⇢H,c(x)= | | | |  (4) compactly as r (ˆ, )=0for j =1,...,p, which will 2c x c2, for x >c, j ( | | | | be elemental in our developments later on. where c is a user-defined threshold that influences the degree of The discussion above assumes is known. In practise, robustness and efficiency of the method. Huber’s loss function is unknown and utilizing robust loss functions above requires

2192 2016 24th European Signal Processing Conference (EUSIPCO)

a preliminary robust scale estimate ˆ. Obtaining an accurate Equations (8) and (9) are referred to as weighted Lasso M- estimate of scale is a challenging problem in sparse regression estimating equations. scenario and therefore we develop a weighted/adaptive M- Lasso method in which regression and scale parameters are Our approach to robust Lasso is different compared to estimated jointly. earlier approaches in the literature which have followed the objective function approach to M-estimation by adding an `1- III. WEIGHTED M -LASSO ESTIMATION penalty to (3). We follow the more general estimating equation A. Definition and properties approach of M-estimation stated in (2). 1 2 Some remarks of this definition are in order. First, if one Note that the LS criterion function J`2 ()= y 2 2 k k uses the LS-loss ⇢(x)= x 2, then rˆ = rˆ and (8) reduces in (1) is convex (in fact strictly convex if n>p) and R- | | to (6). Furthermore, since ⇢ (t)=t2 and ⇢ (t)=2t, the - differentiable but the `1-penalty function w 1 is not R- 0 0 k k function in (10) is (t)=t2, and (9) reduces to (7) for ↵ =1. differentiable at a point where at least one coordinate j is zero (and w is nonzero). However, we can resort to generalization In other words, for LS-loss, the weighted M-Lasso solution j ˆ ˆ of notion of gradient applicable for convex functions, called (, ˆ) is the standard Lasso estimate solving (1) and the subdifferential [9]. For a complex function f : Cp R the standard scale statistic given by (7). Second, if =0(so ! we can define subdifferential at a point as no penalization) and n>p, then the solution to weighted M- Lasso estimating equations (8) and (9) is the unique solution p @f()= z C : f(0) f() + 2Re( z, 0 ) to the convex optimization problem { 2 h i p for all 0 C . 2 } n H yi i Any element z @f() is then called a subgradient of f at arg min Q(, )=↵n + ⇢ · . (11) 2 , . The subdifferential of the modulus j is ⇢ i=1 ✓ ◆ | | X 1 2 sign(j), for j =0 The objective function Q(, ) was proposed by Huber [4] @ j = 1 6 | | ( 2 s for j =0 in the real-valued case for joint M-estimation of regression and scale. Lasso penalized Huber’s criterion was considered where s is some complex number verifying s 1. Thus | |  in [10] and `0-penalization in the complex-valued case in [8]. subdifferential of j is the usual complex conjugate deriva- | | @ To simplify notation we write the pseudo-residual vector tive when j =0, i.e., @ j = @ j for j =0. Then a 6 | | j⇤ | | 6 r (ˆ, ˆ) in (8) as rˆ . Then note that (8) can be written necessary and sufficient condition for a solution to the Lasso compactly as ,rˆ = w sˆ for j =1,...,p. Thus, after problem (1) is that @(J ()+ w ) 0 which gives h j i j `2 k k1 2 taking modulus of both sides, we obtain zero subgradient weighted Lasso equations H y ˆ + wjsˆj =0 for j =1,...,p (6) ,rˆ = w , if ˆ =0 (12) j |h j i| j j 6 ˆ where sˆ is 2 times an element of the subdifferential of j,rˆ wj, if j =0. (13) j |h i|  evaluated at ˆ , i.e., equal to sign(ˆ ) if ˆ =0and | j| j j j 6 some complex number lying inside the unit complex circle In other words, whenever a component, say ˆj, of ˆ becomes ˆ otherwise. If denotes a solution to weighted Lasso solution non-zero, the corresponding absolute correlation between the problem (1), then a natural estimate of the scale is pseudo-residual rˆ and the column of , ,rˆ , j |h j i| n meets the boundary wj in magnitude. This is a well-known 1 2 1 ˆ2 = y Hˆ = rˆ 2, (7) property of Lasso; see e.g., [11] or [12] in the complex- n i i nk k2 i=1 · valued case. This property is then fulfilled by weighted M- X where rˆ = y ˆ denote the residual vector at the solution. Lasso estimates by definition. In the real-valued case, [10] considered minimization of (non-weighted) penalized Huber’s Definition 1. Let w and denote a vector of non-negative criterion Q (, )=Q(, )+ . The solution of k k1 weights and penalty parameter, respectively. Let ⇢(x)= min, Q(, ), however, is different from solutions to (8)- ⇢0( x ) denote a convex loss function. The weighted M-Lasso (9). This can be verified by noting that the zero subgradient | | ˆ p + estimates (, ˆ) C R are defined as solutions to equation @ Q (, )=0 is different from (8) for w 1. 2 ⇥ j ⌘ generalized (zero subgradient) Lasso estimating equations, This also means that solution for weighted penalized Huber’s H criterion based on LS-loss function is not the weighted Lasso r (ˆ, ˆ)+w sˆ =0 for j =1,...,p (8) j j j solution (1). This is somewhat counterintuitive. The equiva- n H ˆ yi lence with weighted Lasso solution to (1) and weighted M- ↵n | i | =0 (9) ˆ · Lasso for LS-loss, however, holds. i=1 ! X The scaling factor ↵ in (9) is chosen so that the obtained where ↵ > 0 is a fixed scaling factor and the function : + + scale estimate ˆ is Fisher-consistent for the unknown scale R0 R0 is defined as n iid 2 ! when "i C (0, ). Due to (9), we choose it as { }i=1 ⇠ N (t)=⇢0 (t)t ⇢0(t). (10) ↵ = E[(")], when " C (0, 1). For Huber’s function (4) 0 ⇠ N

2193 2016 24th European Signal Processing Conference (EUSIPCO)

the -function in (10) becomes In the simulation studies, we use n = log(log(n)) which

2 verifies condition on n in [2, Th. 2]. Note that we can omit 2 x , for x c ˆ ˆad H,c( x )= H,c(x) = | | | |  . (14) the variables for which init,j =0and simply set ,j =0. | | | | c2, for x >c ? ( | | BIC value is determined as = arg min [] 2n ln ˆ + df() ln n where df() is the number of nonzero2 elements In this case the estimating equation (9) can be written as · in ˆ and [] denotes a grid of values. 2 n H y ˆ 1 2 i i 2 2 rˆ M H,c · ˆ =ˆ n↵ ˆ = r 2, C. Adaptive -Lasso with preliminary scale ˆ ! , n↵ i=1 If an accurate robust preliminary scale estimate ˆ is X 0 ˆ available, one may drop the assumption of convexity of the where rˆ = r (, ˆ). The consistency factor ↵ = ↵(c) can be easily solved in closed-form by elementary calculus as ↵ = loss function in Definition 1 to allow for bounded (highly- 2 2 2 c (1 F 2 (2c )) + F 2 (2c ), where F 2 denotes the c.d.f. robust) loss functions. Thus we define a weighted M-Lasso 2 4 k 2 estimate ˆ = ˆ with preliminary scale ˆ0 as a solution to of k-distribution. Note that threshold parameter c determines the value of ↵. Hr (ˆ, ˆ )+w sˆ =0 for j =1,...,p. (15) j 0 j j B. Algorithm We then use Tukey’s loss function ⇢T,c(x) and compute the Next we develop a simple and efficient algorithm, gen- solution using the following three-stage procedure. First stage eralized cyclic coordinate descent (CCD) [5] algorithm, for is as Step A1 for adaptive M-Lasso, where we utilise Huber’s computing the weighted M-Lasso estimates. Recall that CCD (ˆ , ˆ ? ) loss function in finding ⇤ . At second stage, one algorithm repeatedly cycles through the predictors updating computes preliminary scale estimate as the median absolute one coordinate j at a time (j =1,...,p) while keeping deviation (MAD) of the residuals based on the Huber M-Lasso ˆ others fixed at their current iterate values. At jth step, the fit ? computed earlier, so ˆ update for j is obtained by soft-thresholding a conventional H n ˆ =1.20112 median y ˆ ? . coordinate descent update ˆ + ,rˆ , where rˆ denotes the 0 i i i=1 j h j i · · residual vector rˆ = r(ˆ) at current estimate ˆ. In the complex- 1.20112 The scaling constant is used to obtain consistent scale valued case, it is easy to verify (proof omitted) that estimate in complex Gaussian noise. At the last stage, we compute adaptive weights wˆ =1/ ˆ , where ˆ = ˆ ? 1 2 j init,j init soft(y) = arg min y + = sign(y)( y )+ | | 2| | | | | | and find the weighted M-Lasso solution with preliminary 2C n o scale ˆ0 using Tukey’s loss function, wˆj-s as weights and is the complex soft-thresholding operator, where (t) = + penalty n. When computing the solution using the GCCD max(t, 0). For weighted M-Lasso, similar updates are per- ˆ algorithm it is important to give ? as an initial warm start formed but rˆ is replaced by pseudo-residual vector rˆ and for the algorithm. Due to the good warm start, the algorithm the update for scale is calculated prior to cycling through the appears to converge in practise despite of non-convexity of coefficients. Tukey’s loss function. Note that Step 1 (scale update) of The for computing Generalized CCD (GCCD) algorithm GCCD algorithm is now omitted since the scale ˆ0 is not the weighted M-Lasso solutions proceeds as follows: estimated. ˆ2 n y Hˆ 1) Update the scale ˆ2 | i i | ↵n ˆ · IV. SIMULATIONS i=1 ✓ ◆ 2) For j =1,...,p do X We set p =23 =8and n =27 = 128. The coefficient y ˆ vector has 1 =1.0, 2 =1.5, 3 =2.0 and j =0 a) Update the pseudoresidual: rˆ ˆ | | | | iid | | | | ˆ for 4 j p, and Arg(j) Unif(0, 2⇡), j =1,...,p ✓ ◆   ⇠ b) Update the coefficient: ˆj softw ˆj + ,rˆ for each MC trial. The measurement vector y is generated j h j i iid according to the linear model where xij C (0, 1) and the 3) Repeat Steps 1 and 2 until convergence ⇠ N ad error terms " are i.i.d. from either the complex circular Gaus- We define adaptive M-Lasso estimates (ˆ , ˆad) simply as i sian distribution C (0, 2) or the circular Cauchy distribution a weighted M-Lasso solution using data dependent weights N Ct1(0, ), i.e., circular complex t-distribution [6] with ⌫ =1 wˆ -s and penalty as in [2]. They are computed using a j n degrees of freedom. In the former case, the scale parameter two-stage procedure: is the variance and in the latter case (as the variance w 1 M A1 Compute (non-weighted, so j ) -Lasso estimate does not exist) the population MAD, =MedF ( "i ). The ? ⌘ (ˆ , ˆ ? ) | | ⇤ where denotes the optimal penalty pa- support of is the index set of its non-zero elements, i.e., rameter chosen using the Bayesian information criterion = supp()= j 1,...,p : =0 and { 2 { } j 6 } (BIC) over a grid of values. ˆ = supp(ˆ) denotes the support of ˆ. We consider the ˆ ˆ ˆ A2 Compute weights wˆj =1/ init,j , where init = ? cases =0.5 and =2which yield signal-to-noise ratio, ad | | ˆ ad and solve ( , ˆ ) as weighted M-Lasso solutions SNR = 20 log10(avej j /) = 12 dB and SNR =0dB, 2 | | using wˆ =(ˆw1,...,wˆn)> and a penalty parameter n. respectively.

2194 2016 24th European Signal Processing Conference (EUSIPCO)

To assess the model selection abilities of the estimators, we =0.5 =2.0 ˆ Method CM OF FP FN CM OF FP FN calculate the correct model selection rate, CMS()=I = Oracle 100 0 0 0 100 0 0 ˆ) and the overfitting rate, OF(ˆ)=I( ˆ), where I( ) Las 70 30 1.31 0 71 29 1.31 0 ⇢ · denotes the indicator function. In the latter case, all the non- Hub 68 32 1.46 0 67 33 1.47 0 zero coefficients and at least one zero coefficient is selected. adLas 100 0 0 0 98 2 1.06 0 The underfitting rate, UF(ˆ)= (CMS(ˆ) OF(ˆ)), means adHub 100 0 0 0 99 1 1.08 0 ¬ _ adTuk 100 0 0 0 100 0 0 0 that at least one significant predictor is excluded from the Oracle 100 0 0 0 100 0 0 0 model. This is the least wanted scenario since adaptive Lasso Las 33 12 1.34 2.61 1 0 1.67 2.94 can not improve upon the CMS rate of an underfitting initial Hub 29 71 1.92 0 31 68 1.81 1.00 ˆ estimate init. To measure the degree of overfit/underfit, we adLas 37 8 1.17 2.61 1 0 1.67 2.94 compute the (conditional) number of false positives/negatives, adHub 100 0 0 0 84 15 1.15 1.00 FP(ˆ)=#(ˆ =0 = 0) conditioned that OF(ˆ)=1 adTuk 100 0 0 0 97 2 1.06 1.00 j 6 ^ j and FN(ˆ)=#(ˆ =0 = 0) conditioned that UF(ˆ)= TABLE I j ^ j 6 1. The prediction accuracy is measured via prediction error, MODEL SELECTION PERFORMANCE OF DIFFERENT METHODS IN GAUSSIAN (UPPER TABLE) AND CAUCHY (LOWER TABLE) NOISE. 1/2 1 n ˜ H ˆ 2 2 y˜i "i C (0, ) PE(ˆ)= n i=1 i ⇠ N . H· n Oracle Las Hub adLas adHub adTuk 8median⇣ y˜ ˜ ˆ ⌘ " t (0, ) < P i i i=1 i C 1 0.5 0.500 0.517 0.517 0.529 0.537 0.587 · ⇠ 2.0 2.001 2.064 2.065 2.034 2.038 2.051 where an additional test data set (y ˜, ˜ ) of same sample size : 0.5 0.505 1.708 0.541 1.625 0.571 0.633 n is generated from the respective sampling schemes for each 2.0 2.021 3.424 2.156 3.456 2.109 2.109 MC trial. Note that median absolute prediction error (MeAPE) TABLE II is used for Cauchy noise since Cauchy distribution does not PREDICTION ERROR (PE) OF DIFFERENT METHODS IN GAUSSIAN (UPPER have finite variance. The above performance measures for the TABLE) AND CAUCHY (LOWER TABLE) NOISE. oracle estimator, which uses the true coefficient vector , is also given as a point of reference for the evaluated methods. All measures are computed as averages over 1000 MC trials. Methods included in our study are, Las: standard (wj 1) performance (full CMS rate) and in low SNR Cauchy noise, ⌘ Lasso using BIC for penalty parameter selection, Hub: stan- CMS rates decrease to 84% and 97%, respectively. In terms of dard (w 1) M-Lasso estimate of regression and scale PE-s of Table 2, Hub is seen to obtain the best performance j ⌘ using Huber’s loss and BIC, adLas: adaptive Lasso using Las both in Gaussian and Cauchy noise. Adaptive M-Lasso (using ˆ n = log(log(n))) does not improve on PE of the initial as initial estimate init, Hub: adaptive M-Lasso estimate of ˆ estimator, although it provides significant improvements in of regression and scale using Hub as initial estimate init, model selection performance. adTuk: adaptive M-Lasso estimate with preliminary scale ˆ0 ˆ (that uses Tukey’s loss function and Hub as init as explained REFERENCES in Section III-C). [1] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Table 1 reports the model selection performance measures Royal Stat. Soc., Ser. B, vol. 58, pp. 267–288, 1996. for Gaussian (upper table) and Cauchy (lower table) noise. In [2] H. Zou, “The adaptive lasso and its oracle properties,” J. Amer. Stat. both medium SNR ( =0.5) and low SNR ( =2) Gaussian Assoc., vol. 101, no. 476, pp. 1418–1429, 2006. [3] P. Buhlmann¨ and S. Van De Geer, for high-dimensional data: cases, Las and Hub exhibit very similar performance. This methods, theory and applications. Springer, 2011. is even more apparent when inspecting the prediction errors [4] P. J. Huber, Robust Statistics. New York: Wiley, 1981. tabulated in Table II. In other words, using robust M-Lasso [5] J. Friedman, T. Hastie, H. Hofling,¨ and R. Tibshirani, “Pathwise coor- dinate optimization,” Ann. Appl. Stat., vol. 1, no. 2, pp. 302–332, 2007. with Huber’s loss in Gaussian noise leads to marginal loss [6] E. Ollila, J. Eriksson, and V. Koivunen, “Complex elliptically symmetric in performance. In medium SNR Gaussian noise, all adaptive random variables – generation, characterization, and circularity tests,” methods have oracle performance (full 100% CMS rates). In IEEE Trans. Signal Process., vol. 59, no. 1, pp. 58–69, 2011. [7] J. Eriksson, E. Ollila, and V. Koivunen, “Essential statistics and tools low SNR Gaussian noise, however, only Tuk maintains full for complex random variables,” IEEE Trans. Signal Process., vol. 58, 100% CMS rate, whereas CMS rates of Las and Hub drop no. 10, pp. 5400–5408, 2010. down slightly to 98% and 99%, respectively. This illustrates [8] E. Ollila, “Multichannel sparse recovery of complex-valued signals using Huber’s criterion,” in Proc. Theory and its that Tuk estimator based on bounded loss function can be Applications to Radar, Sonar and Remote Sensing (CoSeRa’15), Pisa, useful even in Gaussian noise with low SNR. In heavy-tailed Italy, Jun. 16 – 19, 2015, pp. 32–36. Cauchy noise, the performance of the non-robust Las and [9] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004. adLas completely collapse as expected. For example, in low [10] A. B. Owen, “A robust hybrid of lasso and ridge regression,” Contem- SNR Cauchy noise, Las does underfitting in 99% of MC porary Mathematics, vol. 443, pp. 59–72, 2007. trials. Furthermore, FN number reveals that all zeros (ˆ = 0) [11] T. Hastie, R. Tibshirani, and M. Wainwright, Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, 2015. is the most commonly selected model. The robust M-Lasso [12] P. Gerstoft, A. Xenaki, and C. Mecklenbrauker, “Multiple and single methods, however, maintain excellent performance. In medium snapshot compressive beamforming,” J. Acoust. Soc. Am., vol. 138, no. 4, SNR Cauchy errors, both adHub and adTuk achieve oracle pp. 2003–2014, 2015.

2195