91 (2011) 1505–1526

Contents lists available at ScienceDirect

Signal Processing

journal homepage: www.elsevier.com/locate/sigpro

Review Surveying and comparing simultaneous sparse approximation (or group-) algorithms

A. Rakotomamonjy

LITIS EA4108, University of Rouen, Avenue de l’universite´, 76800 Saint Etienne du Rouvray, France article info abstract

Article history: In this paper, we survey and compare different algorithms that, given an overcomplete Received 20 April 2010 dictionary of elementary functions, solve the problem of simultaneous sparse signal Received in revised form approximation, with common sparsity profile induced by a ‘p‘q mixed-norm. Such a 4 November 2010 problem is also known in the statistical learning community as the group lasso Accepted 16 January 2011 problem. We have gathered and detailed different algorithmic results concerning these Available online 25 January 2011 two equivalent approximation problems. We have also enriched the discussion by Keywords: providing relations between several algorithms. Experimental comparisons of the Simultaneous sparse approximation detailed algorithms have also been carried out. The main lesson learned from these Block-sparse regression experiments is that depending on the performance measure, greedy approaches and Group lasso iterative reweighted algorithms are the most efficient algorithms either in term of Iterative reweighted algorithms computational complexities, sparsity recovery or mean-square error. & 2011 Elsevier B.V. All rights reserved.

Contents

1. Introduction ...... 1506 1.1. Problem formalization ...... 1506

2. Solving the ‘1–‘2 optimization problem ...... 1507 2.1. Block ...... 1507 2.1.1. Deriving optimality conditions ...... 1508 2.1.2. The algorithm and its convergence ...... 1508 2.2. Landweber iterations ...... 1509 2.3. Other works ...... 1510 3. Generic algorithms for large classes of p and q ...... 1510 3.1. M-FOCUSS algorithm ...... 1510 3.1.1. Deriving the algorithm...... 1510 3.1.2. Discussing M-FOCUSS ...... 1511 3.2. Automatic relevance determination approach ...... 1511 3.2.1. Exhibiting the relation with ARD...... 1512 3.3. Solving the ARD formulation for p ¼ 1 and 1rqr2...... 1512 3.4. Iterative reweighted ‘1‘q algorithms for po1 ...... 1513 3.4.1. Iterative reweighted algorithm ...... 1513 3.4.2. Connections with Majorization-Minimization algorithm ...... 1514 3.4.3. Connection with group bridge Lasso ...... 1514

E-mail address: [email protected]

0165-1684/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.sigpro.2011.01.012 1506 A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526

4. Specific case algorithms...... 1515 4.1. S-OMP...... 1515 4.2. M-CosAmp ...... 1515 4.3. Sparse Bayesian learning and reweighted algorithm ...... 1516 5. Numerical experiments ...... 1517 5.1. Experimental set-up...... 1517 5.2. Comparing ‘1‘2 M-BP problem solvers ...... 1518 5.3. Computational performances...... 1519 5.4. Comparing performances...... 1520 6. Conclusions ...... 1523 Acknowledgments ...... 1523 Appendix A...... 1523 A.1. J1,2ðCÞ subdifferential ...... 1523 A.2. Proof of Lemma 1...... 1524 A.3. Proof of Eq. (37) ...... 1524 References ...... 1524

1. Introduction the signal S and the dictionary U under the

hypothesis that all signals si share the same sparsity Since several years now, there has been a lot of interest profile. This latter hypothesis can also be translated into about sparse signal approximation. This large interest the coefficient matrix C having a minimal number of non- comes from frequent wishes of practitioners to represent zero rows. In order to measure the number of non-zero data in the most parsimonious way. rows of C, a possible criterion is the so-called row-support Recently, researchers have focused their efforts on a or row-diversity measure of a coefficient matrix defined as natural extension of sparse approximation problem which is the problem of finding jointly sparse representations of rowsuppðCÞ¼fi 2½1 M : c a0 for some kg multiple signal vectors. This problem is also known as i,k simultaneous sparse approximation and it can be stated The row-support of C tells us which atoms of the dic- as follows. Suppose we have several signals describing the tionary have been used for building the signal matrix. same phenomenon, and each signal is contaminated by Hence, if the cardinality of the row-support is lower than noise. We want to find the sparsest approximation of each the dictionary cardinality, it means that at least one atom signal by using the same set of elementary functions. of the dictionary has not been used for synthesizing the Hence, the problem consists in finding the best approx- signal matrix. Then, the row-‘ pseudo-norm of a coeffi- imation of each signal while controlling the number of 0 cient matrix can be defined as: JCJ ¼jrowsuppðCÞj. functions involved in all the approximations. row-0 According to this definition, the simultaneous sparse Such a situation arises in many different application approximation problem can be stated as domains such as sensor networks signal proces- sing [30,11], neuroelectromagnetic imaging [22,37,59], 1J J2 min SUC F source localization [31], image restoration [17], distribu- C 2 ted [27] and signal and image s:t: JCJrow-0 rT ð2Þ processing [24,49]. where J JF is the Frobenius norm and T a user-defined parameter that controls the sparsity of the solution. Note 1.1. Problem formalization that the problem can also take the different form:

Formally, the problem of simultaneous sparse approx- min JCJrow-0 C imation is the following. Suppose that we have measured 1J J L s:t: 2 SUC F re ð3Þ L signals fsigi ¼ 1 where each signal is of the form si ¼ N NM Uci þe where si 2 R , U 2 R is a matrix of unit-norm For this latter formulation, the problem translates in M elementary functions, ci 2 R is a weighting vector and e minimizing the number of non-zero rows in the coeffi- is a noise vector. U will be denoted in the sequel as the cient matrix C while keeping control on the approxima- dictionary matrix. Since we have several signals, the tion error. Both problems (2) and (3) are appealing for overall measurements can be written as their formulation clarity. However, similarly to the single signal approximation case, solving these optimization S ¼ UCþE ð1Þ problems are notably intractable because J Jrow-0 is a with S ¼½s1 s2 sL is a signal matrix, C ¼½c1 c2 cL discrete-valued function. Two ways of addressing these a coefficient matrix and E a noise matrix. Note that in the intractable problems (2) and (3) are possible: relaxing the sequel, we have adopted the following notations: ci, and problem by replacing the J Jrow-0 function with a more c,j respectively denote the i-th row and j-th column of tractable row-diversity measure or by using some sub- matrix C and ci,j is the i-th element in the j-th column of C. optimal algorithms. For the simultaneous sparse approximation (SSA) A large class of relaxed versions of J Jrow-0 proposed problem, the goal is then to recover the matrix C given in the literature are encompassed into the following A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526 1507 form [52]: suggest interested readers to follow for instance these 0 1 1=q pointers and references therein. X X The most frequent case of problem (4) encountered in J ðCÞ¼ Jc Jp with Jc J ¼ @ jc jqA p,q i, q i, q i,j theliteratureistheonewherep ¼ 1andq ¼ 2. This case is i j the simpler one since it leads to a convex problem. We where typically pr1 and qZ1. This penalty term can be discussdifferentalgorithmsforsolvingthisproblem J J interpreted as the ‘p quasi-norm of the sequence f ci, qgi in Section 2. Then, we survey in Section 3 generic algorithms to the power p. According to this relaxed version of the that are able to solve our approximation problems with row-diversity measure, most of the algorithms proposed different values of p and q. These algorithms are essentially in the literature try to solve the relaxed problem: iterative reweighted ‘1 or ‘2 algorithms. Many works in the 1J J2 literature have focused on one algorithm for solving a min SUC F þlJp,qðCÞð4Þ C 2 particular case of problem (4). These specific algorithms where l is another user-defined parameter that balances are discussed in Section 4. Notably, we present some greedy the approximation error and the sparsity-inducing pen- algorithms and discuss a sparse Bayesian learning algo- rithm. The experimental comparisons we carried out aim at alty Jp,qðCÞ. The choice of p and q results in a compromise between the row-support sparsity and the convexity of clarifying how each algorithm behaves in terms of compu- the optimization problem. Indeed, problem (4) is known tational complexity, in sparsity recovery and in approxima- to be convex when p,qZ1 while it is known to produce a tion mean-square error. These results are described row- C if p 1 (due to the penalty function in Section 5. For a sake of reproducibility, the code used in r 1 singularity at C ¼ 0 [15]). this paper has been made freely available. Final discussions The simultaneous sparse approximation problem as and conclusions are given in Section 6. described here is equivalent to several other problems studied in other research communities. Problem (4) can be 2. Solving the ‘12‘2 optimization problem reformulated so as to make clear its relation with some other problems denoted as the ‘p‘q group lasso [61] or When p=1, q=2, optimization problem (4) becomes a block-sparse regression [28] in statistics or block-sparse signal particular problem named as M-BP for Multiple Basis Pur- recovery [13] in signal processing. Indeed, let us define suit in the sequel. It is a special case that deserves attention. 0 1 0 1 0 1 Indeed, it seems to be the most natural extension of the so- s1 U ð0Þ c1 B C B C B C called Lasso problem [50] or Basis Pursuit Denoising [7], s~ ¼ @ ^ A U~ ¼ @ & A c~ ¼ @ ^ A since for L=1, it can be easily shown that problem (4) ð0Þ sL U cL reduced to these two problems. The key point of this case is then, problem (4) can be equivalently rewritten as that it yields to a convex optimization problem and thus it can benefit from all properties resulting from convexity e.g 1J~ ~ ~J2 u ~ min2 sUc 2 þlJ p,qðcÞð5Þ global minimum. One of the first algorithm addressing the c~ P M-BP problem is the one proposed by Cotter et al. [8] where Ju ðc~Þ¼ M Jc~ Jp with g being all the indices in p,q i ¼ 1 gi q i known as M-FOCUSS. Such an algorithm based on factored c~ related to the i-th element of the dictionary matrix U. gradient descent works for any pr1 and has been proved As we have already stated, simultaneous sparse approx- to converge towards a local or global (when p=1) minimum imation and equivalent problems have been investigated by of problem (4) if it does not get stuck into a fixed point. diverse communities and have also been applied to various Because, M-FOCUSS is better tailored for situations where application problems. Since the literature on these topics pr1, we postpone its description to next section. The two has been overwhelming since the last few years. The aim of algorithms on which we focus on herein are based on a this work is to gather results from different communities block-coordinate descent as proposed by Yuan et al. [61] for (statistics, signal processing and ) so as to the group lasso and the one based on Landweber iterations survey, analyze and compare different proposed algorithms of Fornasier et al. [17].Wehavebiasedoursurveytowards for solving optimization problem (4) for different values of p these two approaches because of their efficiencies and their and q. In particular, instead of merely summarizing existing convergence properties. results, we enrich our survey by providing results like proof of convergence, formal relations between different algo- rithms and experimental comparisons that were not avail- 2.1. Block coordinate descent able in the literature. These experimental comparisons essentially focus on the computational complexity of the The specific structure of the optimization problem (4) for different methods, on their ability of recovering the signal p=1 and q=2 leads to a very simple block coordinate sparsity pattern and on the quality of approximation they descent algorithm that we name as M-BCD. Here, we provided evaluated through mean-square error. provide a more detailed derivation of the algorithm and Note that recently, there has been various works which we also give results that were not available in the Yuan et al. addressed the simultaneous sparse approximation in the work [61].WeprovideaproofofconvergenceoftheM-BCD noiseless case [47,45] and many works which aim at algorithm and we show that if the dictionary is orthogonal providing theoretical approximation guarantees in both noiseless and noisy cases [6,13,33,14,25,29]. Surveying 1 http://asi.insa-rouen.fr/enseignants/arakotom/code/SSAindex. these works is beyond the scope of this paper and we html 1508 A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526 then it can be proved that the solution of the problem is follows the same lines as the one proposed by Sardy equivalent to a simple shrinkage of the coefficient matrix. et al. [42] for a single signal approximation.

Theorem 1. The M-BCD algorithm converges to a solution of 2.1.1. Deriving optimality conditions the M-Basis Pursuit problem given in Eq. (6), where con- The M-BP optimization problem is the following: vergence is understood as any accumulation point of the M- X BCD algorithm is a minimum of problem (6) and the 1 J J2 J J minWðCÞ¼ SUC F þl ci, 2 ð6Þ C 2 sequence of fCkg generated by the algorithm is bounded. i wheretheobjectivefunctionWðCÞ is a non-smooth but Proof. Note that M-BP problem presents a particular convex function. Since the problem is unconstrained a structure with a smooth and differentiable convex func- 2 % J J necessary and sufficient condition for a matrix C to be a tionP SUC F and a row-separable penalty function minimizer of (6) is that 0 2 @WðC% Þ where @WðCÞ denotes the ihiðci,Þ where hðÞ is a continuous and convex function subdifferential of our objective value WðCÞ [2].Bycomputing with respect to ci,. Also note that our algorithm considers a cyclic rule the subdifferential of WðCÞ with respect to each row ci, of C, the KKT optimality condition of problem (6) is then where within each loop, for any i 2½1, ,M, each ci, is considered for optimization. The main particularity is that ri þlgi, ¼ 0 8i for some i, the ci, may be left unchanged by the block- t where ri ¼ fi ðSUCÞ and gPi, is the i-th row of a subdiffer- coordinate descent if already optimal. This occurs espe- ential matrix G of J1,2ðCÞ¼ Jc J2.AccordingtotheJ1,2’s i i, cially for row ci, which are equal to 0. subdifferential definition givenintheAppendix,theKKT Then according to the special structure of the problem optimality conditions can be rewritten as and the use of a cyclic rule, the results of Tseng [55] prove ci, that our M-BCD algorithm converges. & ri þl ¼ 0 8i, ci,a0 Jci,J2 Algorithm 1. Solving M-BP through block-coordinate JriJ2 rl 8i, ci, ¼ 0 ð7Þ descent. AmatrixC satisfying these equations can be obtained after 1: t=1, Cð0Þ ¼ 0, Loop = 1, the following algebra. Let us expand each ri so that 2: while Loop do t t ri ¼ fi ðSUCiÞfi fici, ¼ Tici, ð8Þ 3: for i ¼ 1,2, ...,M do 4: Compute JriJ where C is the matrix C with the i-th row being set to 0 i 5: if optimality condition of ci, according to Eqs. (7) is not t and Ti ¼ fi ðSUCiÞ. The second equality is obtained by satisfied t then remembering that fi fi = 1. Then, Eq. (7) tells us that if ci, is 6: T ’ft ðSUCðt1ÞÞ non-zero, Ti and ci, have to be collinear. Plugging all these i i i 7: ðtÞ l c ’ð1 Þ Ti points into Eq. (7) yields to an optimal solution that can be i, JTi J þ obtained as 8: end if 9: t’t þ1 l c ¼ 1 T 8i ð9Þ 10: end for i, J J i 11: if all optimality conditions are satisfied then Ti þ 12: Loop = 0 where ðxÞ þ ¼ x if x40 and 0 otherwise. From this update 13: end if equation,wecanderiveasimplealgorithmwhichconsists 14: end while in iteratively applying the update (9) to each row i of C Regarding computational complexity, if we suppose while keeping other rows fixed. that the computations of UtS and UtU have been per- formed beforehand, the main computational burden

2.1.2. The algorithm and its convergence of Algorithm 1 for each loop is the computation of JriJ The block-coordinate descent algorithm is detailed in and Ti. Complexity of computing these two terms for all i 2 Algorithm 1. It is a simple and efficient algorithm for is about OðM LÞ. Note however that since Ti is computed solving M-BP. only for non-optimal ci;:, in practice, and as shown in our Basically, the idea consists in solving each row ci: at a experiments, computation complexity is empirically time while keeping other rows fixed. By starting from a lower than quadratic with respect to M. sparse solution like, C ¼ 0, at each iteration, one checks for a Another point we want to emphasize and that has not given i whether row ci, is optimal or not based on condi- been shed to light yet is that this algorithm should be tions (7). If not, ci, is then updated according to Eq. (9). initialized at C ¼ 0 so as to take advantage of the sparsity Although, such a block-coordinate algorithm does not pattern of the solution. Indeed, by doing so, only few converge in general for non-smooth optimization pro- updates are needed before convergence. blem, Tseng [55] has shown that for an optimization Intuitively, we can understand this algorithm as an problem which objective value is the sum of a smooth algorithm which tends to shrink to zero rows of the and convex function and a non-smooth but block-separ- coefficient matrix that contribute poorly to the approx- able convex function, block-coordinate optimization con- imation. Indeed, Ti can be interpreted as the correlation verges towards the global minimum of the problem. Our between the residual when row i has been removed and proof of convergence is based on such properties and fi. Hence the smaller the norm of Ti is, the less fi is A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526 1509 relevant in the approximation. And according to Eq. (9), to be ‘‘closed’’ respectively to CðtÞ and the previous signal ðtÞ the smaller the resulting ci, is. Insight into this block- approximation UC . coordinate descent algorithm can be further obtained by Now, the minimizer of Jsur for fixed A can be obtained supposing that MrN and that U is composed of ortho- by expanding Jsur as normal elements of RN, hence UtU ¼ I. In such a situa- 1 X tion, we have J ðC,AÞ¼ JðAþUtðSUAÞÞCJ2 þl Jc J sur 2 F i;: q i XL t J J2 t 2 1 t 1 1 1 Ti ¼ f S and Ti ¼ ðf skÞ J J2 J J2 J J2 J J2 i 2 i AþU ðSUAÞ þ S F UA F þ A F k ¼ 1 2 2 2 2 and thus and noticing that the second line of this equation does not 0 1 depend on C. Thus we have B l C t @ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiA t argminJsurðC,AÞ¼UqðAþU ðSUAÞÞ ci, ¼ 1 P fi S C L t 2 kðfi skÞ þ with UqðXÞ being defined as This last equation highlights the relation between the X 1 J J2 J J UqðXÞ¼argmin XC F þl ci;: q ð11Þ single Basis Pursuit (when L=1) and the Multiple-Basis C 2 i Pursuit algorithm presented here. Both algorithms lead to a shrinkage of the coefficient projection when considering Thus the iterative algorithm we get simply reads, at orthonormal dictionary elements. With the inclusion of iteration t as multiple signals, the shrinking factor becomes more ðt þ 1Þ ðtÞ t ðtÞ C ¼ UqðC þU ðSUÞC Þð12Þ robust to noise since it depends on the correlation of the atom f to all signals. i Algorithm 2. Solving M-BP through Landweber iterations. This M-BCD algorithm can be considered as an exten- sion to simultaneous signal approximations of the works 1: Cð0Þ ¼ 0, Loop = 1, t=0 of Sardy et al. [42] and Elad [12] which also considered 2: while Loop do 3: ðt þ 1Þ ðtÞ t ðtÞ block coordinate descent for single signal approximation. C 2 ’C þU ðSUC Þ In addition to the works of Sardy et al. and Elad, many 4: for i ¼ 1,2, ...,M do 5: Compute Jcðt þ 1=2ÞJ other authors have considered block-coordinate descent i;: 6: algorithm for related sparse approximation problems. For cðt þ 1Þ ¼ 1 l cðt þ 1=2Þ i, Jcðt þ 1=2Þ J i;: instance, block coordinate descent has also been used for i, þ 7: end for solving the Lasso [19], and the elastic net [62]. 8: t’t þ1 9: if stopping criterions are satisfied then 2.2. Landweber iterations 10: Loop = 0 11: end if 12: end while For recovering vector valued data with joint sparsity constraints, Fornasier et al. [17] have proposed an exten- The Landweber algorithm proposed by Fornasier et al. sion of the Landweber iterative approach introduced by involves two steps: a first step which is similar to a Daubechies et al. [9]. In their work, Fornasier et al. used gradient descent with fixed step and which does not take an iterative shrinking algorithm (that has the flavor of a into account the sparsity constraint and a thresholding gradient projection approach) which is able to solve the step through the operator UqPðÞ, which updates the solu- J J general problem (4) with p=1 and q ¼f1,2,1g. We sum- tion according to the penalty i ci;: q. Fornasier et al. give marize here the details of their derivations for general q. detailed information on the closed-form equation of UqðÞ TheproblemtosolveistheonegiveninEq.(4)withp=1: for q ¼f1,2,1g. We can note that the problem (11) X decouples with respect to the row of C and the thresh- 1J J2 J J min SUC F þl ci;: q ð10Þ C 2 olding operator UqðÞ acts independently on each row. i According to Fornasier et al. [17], for q=2, U2ðCÞ writes The iterative algorithm can be derived first by noting that the for each row ci, as objective function given above is strictly convex then by l defining the following surrogate objective function: ½ ð Þ ¼ U2 C i, 1 J J ci;: X ci þ 1 J J2 J J 1 J J2 1J J2 Jsur ðC,AÞ¼2 SUC F þl ci;: q þ 2 UCUA F 2 CA F i and the full algorithm is summarized in Algorithm 2. For this algorithm, the computation complexity at each itera- From this surrogate function, Fornasier et al. consider the tion is similar to the block-coordinate descent algorithm solution of problem (4) as the limit of the sequence CðtÞ: and it is dominated by the computation of UtUC which is ðt þ 1Þ ðtÞ 2 t C ¼ argminJsur ðC,C Þ of the order OðM LÞ if U U is precomputed. We also note C that block-coordinate descent and Landweber iterations Note that according to the additional terms (the last two yield to algorithms that are very similar. Two differences ones) in the surrogate function, Cðt þ 1Þ is a coefficient matrix can be noted. The first one is that the shrinking operation that minimizes the desired objective function and terms that for one case is based on Ti while in the other case it ðt þ 1Þ ðt þ 1Þ constraint C and the novel signal approximation UC directly considers ci,. At the moment, it is not clear what 1510 A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526 are the implications of these two different schemes in van den Berg et al. then proposed an algorithm that com- term of convergence rate and more investigations are putes this projection in linear time. Note how similar the needed to clarify this point. The other main difference iterates given in Eqs. (12) and (13) are. Actually, approaches between the two approaches is that, by optimizing at each of Fornasier et al. and van den Berg et al. are equivalent and loop, only the ci,’s that are not optimal yet, the BCD the only point in which they differ is the way the projection algorithm is more efficient than the one of Fornasier et al. are computed. This difference essentially comes from the Our experimental results corroborates this findings. problem formulation: Fornasier et al. use a penalized pro- Note that Fornasier et al. have also shown that this blem with known l while van den Berg considers the iterative scheme converges towards the minimizer of constrained optimization problem. Hence, the latter cannot problem (4) for 1ZqZ1. directly use the analytic solution of the projection PðÞ but have to compute the appropriate value of l given their value 2.3. Other works of t. However, the algorithm they derive for this projection is efficient. Note that if one has some prior knowledge on the Besides the M-FOCUSS algorithm of Cotter et al., value of t, then this algorithm would be preferable than the another seminal work which considered the SSA problem one of Fornasier et al. is the one of Malioutov et al. [31]. While dealing with a The other recent work that is noteworthy is the one of source localization problem, they had to consider a sparse Obozinski et al. [35]. They address the problem of SSA approximation problem with joint sparsity constraints. within the context of multi-task classification problem. From the block-sparse signal approximation problem Indeed, they want to recover a common set of covariates that are simultaneously relevant for all classification tasks. XM 1J~ ~ ~J2 þ J~ J For achieving this aim, they propose to solve problem (4) min2 s Uc 2 l cgi 2 c~ i with p=1, q=2 and a differentiable loss function such as the logistic loss or the square loss function. One of their they derived an equivalent problem which writes as most prominent contribution is to have proposed a path- minpþlq following algorithm that is able to compute the regular- p,q,r,c~ ization path of the solution with respect to l. Their 1 s:t: Js~U~ c~J2 rp algorithm uses a continuation method based on predic- 2 2 XM tion–correction steps. The prediction step is built upon the ri rq optimality conditions given in Eq. (7) while the correction i ¼ 1 step needs an algorithm which solves problem (4) but only J J cgi 2 rri 8i ¼ 1, ...,M for a small subset of covariates. The main advantage of such an algorithm is its efficiency for selecting an appro- Note that this above problem has a linear objective priate value of l in a model selection context. function and linear and second-order cone constraints. Such a problem is known as a second-order cone pro- 3. Generic algorithms for large classes of p and q gramming problem and there exists interior point meth- ods for solving it [46]. While very attractive because of its In this section, we present several algorithms that are simple formulation and the global convergence of the able to solve the simultaneous sparse approximation pro- algorithm, this approach suffers from a high complexity blem for a large classes of p and q.Inthefirstpartofthe which is of the order OðM3L3Þ. section, we review the M-FOCUSS algorithm of Cotter Very recently, two other algorithms have been pro- et al. [8] that addresses the case where 0 p 1andq=2. posed for addressing our jointly sparse approximation o r In a second part, we make clear how the penalization term problem. The first one, derived by van den Berg et al. [56] J ðCÞ with p=1 and 1 q 2, is related to automatic is very similar to the one of Fornasier et al. although from p,q r r relevance determination. From this novel insight, we then a different perspective. Indeed, they develop a method propose an iterative reweighted least-square algorithm that based on spectral gradient projection for solving the group solves this general case. The last part of the section is lasso problem which is the one given in Eq. (5) with p=1 devoted to the extension of reweighted ‘ algorithms [63,4] and q=2. However, instead of the regularized version, 1 to the SSA problem. We describe how algorithms tailored they considered the following constrained version: for solving the SSA problem with p=1 and any q can be re- 1 XM used for solving a problem with the same q but p 1. J~ ~ ~J2 J~ J o min sUc 2s:t: cgi 2 rt c~ 2 i 3.1. M-FOCUSS algorithm that they iteratively solve owing to spectral gradient projection. Basically, their solution iterates write as We first detail how the M-FOCUSS algorithm proposed ð þ Þ ð Þ t ð Þ by Cotter et al. [8] can be derived, then we discuss some of c~ t 1 ¼ Pðc~ t þaU~ ðs~U~ c~ t ÞÞ ð13Þ its properties and relation with other works. where a is some step size to be optimized for instance by backtracking and PðÞ is the projection operator defined as 3.1.1. Deriving the algorithm () X M-FOCUSS is, as far as we know, the first algorithm that 2 PðzÞ¼ argminJzxJ subject to Jxg J2 rt has been introduced for solving simultaneous sparse approx- x i i imation problem. M-FOCUSS addresses the general case A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526 1511

ð Þ ð Þ ð Þ where q=2 and pr1. This algorithm can be understood as a C t ¼ W t Z t where Z is the minimizer of fixed-point algorithm which can be derived through an 1 l appropriate factorization of problem (4) objective function JSUWðtÞZJ2 þ JZJ2 2 F 2 F gradient. ðtÞ Indeed, since the partial derivative of Jp,2ðCÞ with and thus C can be understood as the minimizer of respects to an entry c is m,n 1 l 0 1 JSUCJ2 þ JðWðtÞÞ1CJ2 p=2 2 F 2 F @J ðCÞ @ X X p,2 ¼ @ c2 A ¼ pJc Jp2c ð14Þ @c @c i,j m, 2 m,n This above equation makes clear the relation between m,n m,n i j sparse approximation problem and iterative reweighted the gradient of the objective function writes least-square. Such an connection has already been high- lighted in other context. Indeed, while sparsity is usually UtðSUCÞþlPC induced by using ‘1 norm penalty, it has been proved that J Jp2 where P ¼ diagðp ci, 2 Þ. Then, we define the weight- solving the problem in which the ‘1 norm has been 1=2J J1p=2 ing matrix W as W ¼ diagðp ci, 2 Þ. After having replaced by an adaptive ‘2 norm leads to equivalent replaced W2 ¼ P in the above gradient, and after simple solutions [10,41,23]. algebras, we have the following necessary optimality Despite this nice insight, M-FOCUSS presents an condition: important issue. Indeed, the updates proposed by Cotter et al. are not guaranteed to converge to a local minimum ððUWÞtðUWÞþlIÞW1C ¼ðUWÞtS of the problem (if the problem is not convex po1) or to Hence, the optimum solution writes as the global minimum of the convex problem (p=1). Their

% algorithm presents several fixed-points since when a row C ¼ WððUWÞtðUWÞþlIÞ1ðUWÞtS ð15Þ of C is equal to 0, it stays at 0 at the next iteration. Note that since W also depends on C, the above closed- Although such a point may be harmless if the algorithm is form expression of the optimal matrix C can be inter- initialized with a ‘‘good’’ starting point, it is nonetheless preted as a fixed-point algorithm. From this insight on C, an undesirable point when solving a convex problem. At Cotter et al. suggest the following iterates for solving the the contrary, the M-BCD and the Landweber iteration simultaneous sparse approximation problem: algorithms do not suffer from the presence of such fixed points. However, in the M-FOCUSS algorithm, this patho- Cðt þ 1Þ ¼ WðtÞððUWðtÞÞtðUWðtÞÞþlIÞ1ðUWðtÞÞtS ð16Þ logical behavior can be handled by introducing a smooth- ðtÞ 1=2J ðtÞJ1p=2 where W ¼ diagðp ci, 2 Þ. Now, since we have ing term e in the weight so that the updated weight ððUWðtÞÞtðUWðtÞÞþlIÞ1ðUWðtÞÞt ¼ðUWðtÞÞtððUWðtÞÞðUWðtÞÞt þlIÞ1 becomes ðtÞ 1=2 J ðtÞJ 1p=2 Cotter et al. actually consider the following iterative W ¼ diagðp ð ci, þeÞÞ Þ scheme: where e40. The use of e avoids a given weight to be at ðt þ 1Þ ðtÞ ðtÞ t ðtÞ ðtÞ t 1 C ¼ W ðUW Þ ððUW ÞðUW Þ þlIÞ S ð17Þ zero and consequently it avoids the related ci, to stay Resulting algorithm is detailed in Algorithm 3. The com- permanently at zero. If we furthermore note that putational complexity of this algorithm is dominated by M-FOCUSS is not more than an iterative reweighted the linear system to be solved at each iteration. Forming least-square, then according to the very recent works of the linear system and its resolutions costs respectively Daubechies et al. [10] and Chartrand et al. [5], it seems about OðN2MÞ and OðN3Þ. justified to iterate the M-FOCUSS algorithm using decreasing value of e. In our numerical experimental, we Algorithm 3. M-FOCUSS with Jp r 1,q ¼ 2 penalty. will consider the M-FOCUSS algorithm with fixed and decreasing value of e. 1: Initialize Cð0Þ to a matrix of 1, t ¼ 1 2: while Loop do 3: WðtÞ’diagðp1=2Jcðt1ÞJ1p=2Þ i, 2 3.2. Automatic relevance determination approach 4: AðtÞ’UWðtÞ 5: ðtÞ’ ðtÞ ðtÞ t ðtÞ ðtÞ t 1 C W ðA Þ ðA ðA Þ þlIÞÞ S In this section, we focus on the relaxed optimization 6: t’t þ1 7: if stopping condition is satisfied then problem given in (4) with the general penalization Jp,qðCÞ. 8: Loop = 0 Our objective here is to clarify the connection between 9: end if such a form of penalization and the automatic relevance 10: end while determination of C’s rows, which has been the keystone of the Bayesian approach of Wipf et al. [60]. We will show

3.1.2. Discussing M-FOCUSS that for some values of p and q, the mixed-norm Jp,qðCÞ has Rao et al. [40] has provided an interesting interpreta- an equivalent variational formulation. Then, by using this tion on the FOCUSS algorithm for a single signal approx- novel formulation in problem (4), instead of Jp,qðCÞ, imation which can be readily extend to M-FOCUSS. we exhibit the relation between our sparse approxima- Indeed, M-FOCUSS can be viewed as an iterative re- tion problem and ARD. We then propose an iterative weighted least-square algorithm. Indeed, From the updat- reweighted least-square approach for solving this result- ing equation of CðtÞ (see Eq. (16)) , we can note that ing ARD problem. 1512 A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526

3.2.1. Exhibiting the relation with ARD smaller the ct,k norm should be. Furthermore, optimization For this purpose, we first consider the following formu- problem (21) also involves some constraints on fdt,kg.These lation of the simultaneous sparse approximation problem: constraints impose the matrix d to have positive elements

andtobesothatits‘1=ðr þ sÞ,1=s mixed-norm is smaller than 1J J2 2=p min SUC F þluðJp,qðCÞÞ ð18Þ 1. Note that this mixed-norm on d plays an important role C 2 since it induces the row-norm sparsity on C. Indeed, Intheconvexcase(forpZ1andqZ1), since the power according to the relation between p, r and s,forpo1, we function is strictly monotonically increasing, problems (4) also have r þs41, making the ‘1=ðr þ sÞ,1=s non-differentiable and (18) are equivalent, in the sense that for a given value with respect to the first norm. Such singularities favor lu,thereexistsal so that solutions of the two problems are row-norm sparsity of the matrix d at optimality, inducing equivalent. When Jp,q is not convex, this equivalence does row-norm sparsity of C. As we have noted above, when a not strictly apply. However, due to the nature of the row-norm of d is equal to 0, the corresponding row-norm of problem, the problem formulation (18) is more convenient C should also be equal to 0 which means that the corre- for exhibiting the relation with ARD. sponding element of the dictionary is ‘‘irrelevant’’ for the Let us introduce the key lemma that allows us to derive approximation of all signals. Problem (21) proposes an the ARD-based formulation of the problem. This lemma equivalent formulation of problem (4) for which the row- gives a variational form of the ‘p,q norm of a sequence diversity measure has been transformed in another penalty fat,kg. Note that similar proofs have been derived [48,39]. function owing to an ARD formulation. The trade-off between convexity of the problem and the sparsity of the Lemma 1. If s40 and fa : k 2 N ,t 2 N g2R such that t,k n T solution has been transferred from p and q to r and s. at least one a 40, then t,k From a Bayesian perspective, we can interpret the 8 ! 9 < s=ðr þ sÞ = mixed-norm on d as the diagonal term of the covariance X ja j2 X X t,k Z 1=s min : dt,k 0, dt,k r1 matrix of a Gaussian prior over the row-norm of C d : d ; t,k t,k k t distribution. This is typically the classical Bayesian auto- 0 1 != X X p=q 2 p matic relevance determination approach as proposed for @ q A instance by Tipping [51]. This novel insight on the ARD ¼ jat,kj ð19Þ k t interpretation of Jp,qðCÞ clarifies the connection between the M-FOCUSS algorithm of Cotter et al. [8] and the where q ¼ 2=ðsþ1Þ and p ¼ 2=ðsþr þ1Þ. Furthermore, at Multiple Sparse Bayesian Learning (M-SBL) algorithm of optimality, we have P Wipf et al. [60] for any value of po1. In their previous ja j2s=ðs þ 1Þð ja j2=ðs þ 1ÞÞr=ðs þ r þ 1Þ works, Wipf et al. have proved that these two algorithms d% ¼ Pt,k P u u,k ð20Þ t,k 2=ðs þ 1Þ ðs þ 1Þ=ðs þ r þ 1Þ r þ s were related when p 0. Here, we refine their result by ð vð ujau,vj Þ Þ enlarging the connection to other values of p by showing Proof. The proof has been postponed to the Appendix. that both algorithms actually solve a problem with auto-

According to this lemma, the ‘p,q norm of a sequence can matic relevance determination on the row-norm of C. be computed through a minimization problem. Hence, applying this lemma to ðJ ðCÞÞ2=p by defining a ¼ c , p,q t,k t,k 3.3. Solving the ARD formulation for p ¼ 1 and 1rqr2 we get a variational form of the penalization term. We can also note that the mixed-norm on the matrix coefficients Herein, we propose a simple iterative algorithm for has been transformed to a mixed-norm on weight matrix d. solving problem (21) for p=1 and 1rqr2. This algo- Plugging the above variational formulation of the penaliza- rithm, named as M-EMq, is based on an iterative tion term in problem (18) yields to the following equivalent reweighted where the weights are updated problem: according to Eq. (20). Thus, it can be seen as an extension 1 X c2 of the M-FOCUSS algorithm of Cotter et al. for qr2. Note min JSUCJ2 þl t,k F t,k that we have restricted ourselves to p=1 since we will C,d 2 dt,k ! show in the next section that the case po1 can be X X s=ðr þ sÞ 1=s handled using another reweighted scheme. s:t: dt,k r1 k t Since p=1, thus sþr ¼ 1, the problem we are consider- d Z0 8t,k ð21Þ ing is t,k ! X X c2 This problem is the one which makes clear the automatic 1 J J2 t,k min skUc,k 2 þl ¼ JobjðC,dÞ relevance determination interpretation of the original for- C,d 2 dt,k k ! t mulation (4). Indeed, we have transformed problem (4) into X X s aproblemwithasmoothobjectivefunctionattheexpense 1=s s:t: dt,k r1 of some additional variables dt,k. These parameters dt,k k t actually aim at determining the relevance of each element dt,k Z0 ð22Þ of C. Indeed, in the objective function, each squared-value Since, we consider that 1rqr2 hence 0rsr1,2 this c is now inversely weighted by a coefficient d .Bytaking t,k t,k optimization problem is convex with a smooth objective the convention that x=0 ¼1 if xa0 and 0 otherwise, the objective value of the optimization problem becomes finite 2 2 only if dt,k ¼ 0forct,k ¼ 0. Then the smaller the dt,k is, the For s=0, we use the sup norm of vector d,k in the constraints. A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526 1513 function. We propose to address this problem through a in some fixed points (e.g. all the elements of d being zero block-coordinate descent algorithm which alternatively expect for one entry ft1,k1g). In practice, we have experi- solves the problem with respect to C with the weight d enced, by comparing for q=2 with the M-BCD algorithm, being fixed, and then keeping C fixed and computing the that if d is initialized to non-zero entries, Algorithm 4 optimal weight d. The resulting algorithm is detailed converges to the global minimum of problem (22). Numer- in Algorithm 4. ical experiments will illustrate this point. Regarding computational complexity, it is easy to see Algorithm 4. Iterative reweighted least-square for that at each iteration, the main burden of the algorithm is addressing J penalty. 1,1 r q r 2 the linear system to be solved. Supposing that UtU has 1: Initialize dð0Þ to a strictly positive matrix, t=1 been precomputed, solving the systems for all signals 2: while Loop costs about OðM3LÞ. Hence, the flexibility on the choice of 3: CðtÞ’ solution of problem (22) with fixed d ¼ dðt1Þ as given q that this algorithm allows is heavily paid by an by Eq. (23) increased computational complexity. 4: dðtÞ’ solution of problem (22) with fixed C ¼ CðtÞ as given by Eq. (24) 5: t’tþ1 3.4. Iterative reweighted ‘1‘q algorithms for po1 6: if stopping condition is satisfied then 7: Loop = 0 This section introduces an iterative reweighted M- 8: end if Basis Pursuit (IrM-BP) algorithm and proposes a way for 9: end while setting the weights. Through this approach, we are able to provide an iterative algorithm which solves problem (4) Owing to the problem structure, steps 4 and 5 of this with po1 and 1rqr2. algorithm has a simple closed form. Indeed, for fixed d, each vector cðtÞ at iteration t is given by ,k 3.4.1. Iterative reweighted algorithm ðtÞ t ðt1Þ 1 t c,k ¼ðU Uþ2lDk Þ U sk ð23Þ Recently, several works have advocated that sparse approximations can be recovered through iterative algo- where Dðt1Þ is a diagonal matrix of entries 1=dðt1Þ.Ina k ,k rithms based on a reweighted ‘ minimization [63,4,5,58,18]. similar way, for fixed C, step 5 boils down in solving 1 Typically,forasinglesignalcase,theideaconsistsin problem (19). Hence, by defining a ¼ cðtÞ , we also have a t,k t,k iteratively solving the following problem: closed form for dðtÞ as P 1 X 2s=ðs þ 1Þ 2=ðs þ 1Þ ð1sÞ=2 2 ja j ð ja j Þ min JsUcJ þl zijcij dðtÞ ¼ t,kP P u u,k ð24Þ c 2 2 t,k 2=ðs þ 1Þ ðs þ 1Þ=2 i vð ujau,vj Þ where z are some positive weights, and then to update the Note that similar to the M-FOCUSS algorithm, this i positive weights z according to the solution c% of the above algorithm can also be seen as an iterative reweighted i problem. Besides providing empirical evidences that rew- least-square approach or as an Expectation-Minimization eighted ‘ minimization yields to sparser solutions than a algorithm, where the weights are defined in Eq. (24). 1 simple ‘ minimization, the above cited works theoretically Furthermore, it can be shown that if the weights d are 1 support such a claim. These results for the single signal initialized to non-zero values then at each loop involving approximation case suggest that in the simultaneous sparse steps 4 and 5, the objective value of problem (22) approximation problem, reweighted M-Basis Pursuit would decreases. Hence, since the problem is convex, our algo- also lead to sparser solutions than the classical M-Basis rithm should converge towards the global minimum of Pursuit. the problem. The iterative reweighted M-Basis Pursuit algorithm is Theorem 2. If the objective value of problem (22) is strictly defined as follows. We iteratively construct a sequence ðtÞ convex (for instance when U is full-rank), and if for the tth C defined as loop, after the step5, we have dðtÞadðt1Þ, then the objective X ðtÞ 1 J J2 ðtÞJ J value has decreased, i.e.: C ¼ argmin SUC F þl z ci, q ð25Þ C 2 i i ðt þ 1Þ ðtÞ ðtÞ ðtÞ ðtÞ ðt1Þ JobjðC ,d ÞoJobjðC ,d ÞoJobjðC ,d Þ: ðtÞ ðtÞ ðtÞ ðtÞ ðt1Þ where the positive weight vector z depends on the Proof. The right inequality JobjðC ,d ÞoJobjðC ,d Þ ðt1Þ ðtÞ previous iterate C . For t=1, we typically define comes from d being the optimal value of the optimiza- zð1Þ ¼ 1 and for t 41, we will consider the following tion problem resulting from step 5 of Algorithm 4. The weighting scheme: strict inequality yields from the hypothesis that ð Þ ð Þ t a t 1 ðtÞ 1 d d and from the strict convexity of the objective z ¼ 8i ð26Þ i J ðt1ÞJ r function. A similar reasoning allows us to derive the left ð ci, q þeÞ ðtÞ inequality. Indeed, since C is not optimal with respect to ðt1Þ ðt1Þ ðtÞ where fc g is the i-th row of C , r auser-defined d for the problem given by step ð4Þ, invoking the strict i, positive constant and e a small regularization term that convexity of the associated objective function and optim- ðt þ 1Þ prevents from having an infinite regularization term for ci, ality of C concludes the proof. & ðt1Þ as soon as ci, vanishes. This is a classical trick that has As stated by the above theorem, the decrease in objective been used for instance by Candes et al. [4] or Chartrand value is actually guaranteed unless, the algorithm get stuck et al. [41]. Note that for any positive weight vector z, 1514 A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526 problem (25) is a convex problem that does not present the non-convex penalty with a quadratic term. In the light local minima. Furthermore, for 1rqr2, it can be solved by of all these previous works, what we have exposed here is the block-coordinate descent algorithm, the Landweber merely an extension of iterative reweighted ‘1 algorithm iteration approach or by the M-EMq given in Algorithm 4, to simultaneous sparse approximation problem. Through by simply replacing l with li ¼ l zi. This reweighting this iterative scheme, we can solve problem (4) with po1 scheme is similar to the adaptive lasso algorithm of Zou and any q provided that we have an algorithm that is able et al. [63] but uses a larger number of iterations and to solve problem (4) with p ¼ 1 and the desired q. addresses the simultaneous approximation problem. Computational complexity of these iterative reweigh- ted algorithms are difficult to evaluate since it naturally 3.4.2. Connections with Majorization-Minimization depends on the complexity of each inner algorithm and algorithm the number of iterations needed before convergence. This IrM-BP algorithm can be interpreted as an algo- However, we have empirically noted that only few itera- rithm for solving an approximation of problem (4) when tions of the inner algorithm are needed (typically about 0opo1 and 1rqr2. Indeed, similar to the reweighted 10) and that the iterative approach greatly benefits from ‘1 scheme of Candes et al. [4] or the one-step reweighted warm-start initializations. Hence, only a slight increase of lasso of Zou et al. [64], this algorithm falls into the class of complexity can be noted between algorithms that use an majorize-minimize (MM) algorithms [26]. MM algorithms ‘1‘q penalty and their reweighted ‘p o 1‘q counterparts. ðtÞ consists in replacing a difficult optimization problem with Analyzing the convergence of the sequence C towards a easier one, for instance by linearizing the objective the global minimizer of problem (4) is a challenging issue. function, by solving the resulting optimization problem Indeed, several points make a formal proof of convergence ðtÞ and by iterating such a procedure. difficult. At first, in order to avoid a row ci, to be perma- The connection between MM algorithms and our nently at zero, we have introduced a smoothing term e, thus reweighted scheme can be made through linearization. we are only solving an eapproximation of problem (4). Let us first define Jp,q,eðCÞ as an approximation of the Furthermore, the penalty we use is non-convex, thus using a penalty term Jp,qðCÞ: monotonic algorithm like an MM approach which decreases X the objective value at each iteration, cannot guarantee con- J J Jp,q,eðCÞ¼ gð ci, q þeÞ vergence to the global minimum of our eapproximate i problem. Hence, due to these two major obstacles, we have where gðÞ ¼ j jp. Since gðÞ is concave for 0opo1, a left this convergence proof for future works. Note however ðt1Þ linear approximation of Jp,q,eðCÞ around C yields to that few works have addressed the convergence issue of the following majorizing inequality: reweighted ‘1 or ‘2 algorithms for single sparse signal X ðt1Þ p ðt1Þ recovery. Notably, we can mention the recent work of Jp,q, ðCÞrJp,q, ðC Þþ ðJc JqJc JqÞ e e J ðt1ÞJ 1p i, i, Daubechies et al. [10] which provides a convergence proof i ð c q þeÞ i, of iterative reweighted least square for exact sparse recovery. then for the minimization step, replacing in problem (4) In the same flavor, Foucart et al. [18] have proposed a ten-

Jp,q with the right part of the inequality and dropping tative of rigorous convergence proof for reweighted ‘1 sparse constant terms lead to our optimization problem (25) signal recovery. Although, we do not have any rigorous proof with appropriately chosen zi and r. Note that for the of convergence, in practice, we will show that the reweighted weights given in Eq. (26), r ¼P1 corresponds to the algorithm provides good sparse approximations. J J linearization of a log penalty ilogð ci, þeÞ whereas As already noted by several authors [4,41,10], e plays a setting r ¼ 1p corresponds to a ‘p penalty (0opo1). major role in the quality of the solution. In the experi- mental results presented below, we have investigated two Algorithm 5. Majorization-Minimization algorithm leading methods for setting e: the first one is to set e to a fixed to iterative reweighted ‘ for addressing J penalty. 1 p r 1,q value (e.g.0.001) and the other one, denoted as an anneal- 1: ð1Þ ing approach, consists in gradually decreasing e after Initialize zi ¼ 1, r ¼ 1p, t=1 2: while Loop do having solved problem (25). 3: CðtÞ’ solution of problem (25) 4: t’t þ1 3.4.3. Connection with group bridge Lasso 5: zðtÞ’ 1 8i Recently, in a context of variable selection through i ðJcðt1Þ J þ Þr i, q e 6: if stopping condition is satisfied then group lasso, Huang et al. [25] have proposed an algorithm 7: Loop = 0 for solving problem (5) with po1 and q=1: 8: end if 1 X 9: end while J~ ~ ~J2 þ J~ Jp ð Þ min s Uc l cgi 1: 27 c~ 2 i MM algorithms have already been considered in opti- mization problems with sparsity-inducing penalties. For Instead of directly addressing this optimization problem, instance, an MM approach has been used by Figueiredo they have shown that the above problem is equivalent to et al. [16] for solving a least-square problem with a ‘p the minimization problem: sparsity-inducing penalty, whereas Candes et al. [4] have 1 X Jc~ J X ~ 2 gi 1 addressed the problem for exact sparse signal recovery. In min Js~Uc~J þ þt zi c~,z 2 z1=p1 a context of simultaneous approximation, Simila [43,44] i i i has also considered MM algorithms while approximating s:t: zi Z0 8i ð28Þ A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526 1515 where t is a regularization parameter. Here equivalence is 4.1. S-OMP understood as follows: c~ % minimizes Eq. (27) if and only if the pair ðc~ % ,z% Þ minimizes Eq. (27) with an appropriate Simultaneous Orthogonal Matching Pursuit (S-OMP) [54] value of t. This value of t will be made clear in the sequel. is an extension of the Orthogonal Matching Pursuit [36,53] algorithm to multiple approximation problem. Algorithm 6. Group bridge Lasso based on iterative S-OMP is a greedy algorithm which is based on the reweighted ‘1 for addressing Jp r 1,q penalty. idea of selecting, at each iteration, an element of the 1: ~ ð0Þ 1p 1=ðp1Þ dictionary and on building all signal approximations as Initialize c , t ¼ p l , t=1 2: while Loop do the projection of the signal matrix S on the span of these  p 3: ðtÞ’ pJ~ ðt1ÞJp 1p selected dictionary elements. Since, according to pro- zi t cg 1 p i P blem (2), one looks for dictionary element that can 4: c~ ðtÞ’argmin 1 Js~U~ c~J2 þ ðzðtÞÞ11=pJc~ J c~ 2 i i gi 1 provide simultaneously good approximation of all signals, 5: t’tþ1 6: if stopping condition is satisfied then Tropp et al. have suggested to select the dictionary 7: Loop = 0 element that maximizes the sum of absolute correlation 8: end if between the basis element and the signal residuals. This 9: end while greedy selection criterion is formalized as follows:

The connection between iterative reweighted ‘1 appro- XL ðtÞ max j/skP ðskÞ,UiSj ach and the algorithm proposed by Huang et al. comes from i the way problem (28) has been solved. Indeed, Huang et al. k ¼ 1 have considered a block-coordinate approach which con- where PðtÞðÞ is the projection operator on the span of the t sists in minimizing the problem with respect to z and then, selected dictionary elements. While conceptually very sim- after having plugged the optimal z in Eq. (28) in minimizing ple, this algorithm whose details are given in Algorithm 7, the problem with respect to c~. For the first step, the optimal provide theoretical guarantees about the quality of its z is derived by minimizing the convex problem approximation. For instance, Tropp et al. [54] have shown X J~ J X that if the algorithm is stopped when T dictionary elements cgi 1 min þt zi have been selected then, under some mild conditions on the z 1=p1 i z i ðTÞ i matrix U, there exists a upper bound on JSUC JF .This s:t: z Z0 8i ð29Þ % % i upper bound depends on JSUC JF with C being the Using Lagrangian theory, simple algebras yield to a closed- solution of problem (2) and multiplicative constant which form solution of the optimal z: is a function of T, L and the coherence of U [54].Inthis sense, S-OMP can be understood as an algorithm that 1p p % ¼ pJ~ Jp approximately solves problem (4) with p=0. zi t cgi 1 p Regarding complexity, we note that at each iteration,

Plugging these zi’s in Eq. (28) and using t such that the most demanding part of the S-OMP is the correlation l ¼ðð1pÞ=ptÞp1 proves the equivalence of the pro- E computation and the linear system to be solved for ðtÞ blems (27) and (28). The relation with iterative reweighted obtaining C . These steps respectively cost OðMNLÞ and

‘1 is made clear in the second step of the block-coordinate OðjOjÞ. Since jOj is typically smaller than M, this algorithm descent. Indeed, in problem (28), since we only optimize is a very efficient algorithm. with respect to c~ with fixed zi,thisstepisequivalentto Algorithm 7. S-OMP algorithm. T the number of diction- 1 X Jc~ J min Js~U~ c~J2 þ gi 1 ð30Þ ary elements to select. c~ 2 1=p1 i z i 1: Initialize : t’0,RðtÞ’S,O’|,Loop’1 which is clearly a weighted ‘1 problem. The full iterative 2: while Loop do 3: 9 t ðtÞ reweighted algorithm is detailed in Algorithm 6. Note that E U R P 4: i’argmax je j although the genuine work on group bridge Lasso only k j j,k 5: O’O [ i addresses the case q=1 (because of theoretical develop- 6: t’t þ1 ments considered in the paper), the algorithm proposed by 7: ðtÞ’ t 1 t C ðUOUOÞ UOS Huang et al. can be applied to any choice of q provided that 8: RðtÞ ¼ SUCðtÞ oneisabletosolvetherelated‘1‘q problem (for instance 9: if t=T then using the algorithm described in Section 3). 10: Loop = 0 11: end if 12: end while 4. Specific case algorithms

In this section, we survey some algorithms that provide 4.2. M-CosAmp solutions of the SSA problem (4) when p=0. These algorithms are different in nature and provide different qualities of Very recently, Needell et al. have developed a novel approximation. The first two algorithms we detail are greedy signal reconstruction algorithm known as CosAmp [34]. algorithms which have the flavor of Matching Pursuit [32]. This algorithm uses ideas from Orthogonal Matching The third one is based on a sparse Bayesian learning and is Pursuit but also takes advantage of ideas from the related to iterative reweighted ‘1 algorithm [60]. compressive sensing literature. This novel approach 1516 A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526 provides stronger theoretical guarantees on the quality of empirical Bayesian learning approach based on automatic signal approximation. The main particularities of CosAmp relevance determination (ARD). The ARD prior over each are the following. Firstly, unlike Orthogonal Matching row they have introduced is Pursuit, CosAmp selects many elements of the dictionary pðci,; diÞ¼N ð0,diIÞ8i at each iteration, the ones that have the largest correla- tion with respect to the signal residual. These novel where d is a vector of non-negative hyperparameters that dictionary elements are thus incorporated into the set of govern the prior variance of each coefficient matrix row. current dictionary elements. The other specificity of Hence, these hyperparameters aim at catching the spar- CosAmp is a pruning dictionary element step after signal sity profile of the approximation. Mathematically, the estimation. This step is justified by the fact that the signal resulting optimization problem is to minimize according to approximate is supposed to be sparse. to d the following cost function: This CosAmp algorithm have been extended so as to XL t 1 handle signals with specific structures such as block- LlogjStjþ sj St sj ð31Þ sparse signal [1]. Extension to jointly sparse approxima- j ¼ 1 tion has also been provided by Duarte et al. [11]. Such an 2 t 2 where St ¼ s IþUDU , D ¼ diagðdÞ and s a parameter of extension of the CosAmp algorithm to simultaneous the algorithm related to the noise level presented in the sparse approximation is detailed in Algorithm 8. Note signals to be approximated. The algorithm is then based that lines 4–5 of the algorithm estimate the dictionary on a likelihood maximization which is performed through elements which have the largest correlations over all the an Expectation-Minimization approach. Very recently, a signals while lines 8–9 prune the set of current dictionary very efficient algorithm for solving this problem has been elements so as to keep the matrix C row-sparse. This proposed [27]. However, the main drawback of this latter algorithm is denoted as M-CosAmp in the experimental approach is that due to its greedy nature, and like any EM section. Complexity of this algorithm is similar to those of algorithm where the objective function is not convex, the S-OMP. However we can note that computing the estima- algorithm can be easily stuck in local minima. ^ tion C involves in the worst case 3T atoms hence its Recently, Wipf et al. [57] have proposed some new 3 complexity is about Oð9T Þ. Besides, computing the cor- insights on automatic relevance determination and sparse relation E is slightly cheaper than for S-OMP since the ðtÞ Bayesian learning. They have shown that, for the vector matrix C has been pruned. regression case, ARD can be achieved by means of iterative Algorithm 8. M-CosAmp algorithm. T the number of reweighted ‘1 minimization. Furthermore, in that paper, dictionary elements to select. they have sketched an extension of such results for matrix regression in which ARD is used for automatically selecting 1: ð0Þ ’ ðtÞ ’| Initialize C ¼ 0,Loop ,R ¼ S,Ores the most relevant covariance components in a dictionary of 2: while Loop do covariance matrices. Such an extension is more related to 3: E9Ut RðtÞ learning with multiple kernels in regression as introduced 4: e ¼ JE J2 i ¼ 1, ...,M i i;: 2 by Girolami et al. [21] or Rakotomamonjy et al. [38] 5: Ores’suppðe,2TÞ 6: O’O [ Ores although some connections with simultaneous sparse 7: ^ ’ t 1 t approximation can be made. Here, we build on the works C ðUOUOÞ UOS 8: J J2 on Wipf et al. [57] and give all the details about how M-SBL vi ¼ ci;: 2 i ¼ 1, ...,M 9: O’suppðv,TÞ and iterative reweighted M-BP are related. 10: t’tþ1 Recall that the cost function minimized by the M-SBL ðtÞ 11: C ¼ 0 of Wipf et al. [60] is 12: ðtÞ ^ CO ¼ CO ðtÞ ðtÞ XL 13: R ¼ SUC LðdÞ¼LlogjS jþ stS 1s ð32Þ 14: if stopping condition is satisfied then t j t j j ¼ 1 15: Loop = 0 16: end if 2 t where St ¼ s IþUDU and D ¼ diagðdÞ, with d being a 17: end while vector of hyperparameters that govern the prior variance of each coefficient matrix row. Now, let us define g% ðzÞ as 4.3. Sparse Bayesian learning and reweighted algorithm the conjugate function of the concave logjStj. Since, that M log function is concave and continuous on R þ , according The approach proposed by Wipf et al. [60], denoted in to the scaling property of conjugate functions we have [3]  the sequel as Multiple-Sparse Bayesian Learning (M-SBL), t % z L logjStj¼minz dLg for solving the sparse simultaneous approximation is z2RM L somewhat related to the optimization problem in Eq. (4) Thus, the cost function LðdÞ in Eq. (32) can then be upper- but from a very different perspective. Indeed, if we bounded by consider that the approaches described in Section 3 are  equivalent to a Maximum a Posteriori estimation proce- z XL Lðd,zÞ9ztdLg% þ stS1s ð33Þ dures, then Wipf et al. have explored a Bayesian model L j t j j ¼ 1 whose prior encourages sparsity. In this sense, their approach is related to the relevance vector machine of Hence when optimized over all its parameters, Lðd,zÞ Tipping et al. [51]. Algorithmically, they proposed an converges to a local minima or a saddle point of (32). A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526 1517

However, for any fixed d, one can optimize over z and get Minimizing LzðCÞ boils down to minimize the M-BP % 2 1=2 the tight optimal upper bound. If we denote as z such an problem with an adaptive penalty li ¼ s zi on each y optimal z for any fixed d , since L logjStj is differentiable, row-norm. This latter point makes the alternate optimi- we have, according to conjugate function properties, the zation scheme based on Eqs. (34) and (35) equivalent to

% following closed form of z our iterative reweighted M-BP for which weights zi would y be given by Eq. (34). According to Wipf et al. [60], M-SBL z% ¼ L rlogjS jðd Þ¼diagðUtS1UÞð34Þ t t can be implemented to that the cost per iteration is about Similar to what proposed by Wipf et al., Eqs. (33) and (34) OðN2MÞ.3 suggest an alternate optimization scheme for minimizing The impact of this relation between M-SBL and iterative Lðd,zÞ. Such a scheme would consist, after initialization reweighted M-BP is essentially methodological. Indeed, its of z to some arbitrary vector, in keeping z fixed and in main advantage is that it turns the original M-SBL optimiza- computing tion problem into a series of convex optimization problems.

XL In this sense, the iterative reweighted algorithm described y 9 t t 1 here, can again be viewed as an application of MM approach d ¼ argminLzðdÞ z dþ sj St sj ð35Þ d j ¼ 1 for solving problem (32). Indeed, we are actually iteratively minimizing a proxy function which has been obtained by then to minimize Lðdy,zÞ for fixed dy, which can be majorizing each term of Eq. (32). Owing to this MM point of analytically done according to Eq. (34). This alternate % view, convergence of our iterative algorithm towards a local scheme is then performed until convergence to some d . minimum of Eq. (32) is guaranteed [26].Convergencefor Owing to this iterative scheme proposed for solving the single signal case using other arguments has also been M-SBL, we can now make clear the connection between shown by Wipf et al. [57]. Note that similar to M-FOCUSS, M-SBL and an iterative reweighted M-BP according to the the original M-SBL algorithm based on EM approach suffers following lemma. Again this is an extension to the multi- from the presence of fixed points (when d =0). Hence, ple signals case of Wipf’s lemma. i such an algorithm is not guaranteed to converge towards Lemma 2. The objective function in Eq. (35) is convex and a local minimum of (32). This is then another argument for can be equivalently solved by computing preferring IrM-BP. 1 X % J J2 2 1=2J J C ¼ argminLzðCÞ¼ SUC F þs z ci, ð36Þ C 2 i 5. Numerical experiments i and then by setting Some computer simulations have been carried out in

1=2 % order to evaluate the algorithms described in the above di ¼ z Jc J 8i i i, sections. Results that have been obtained from these Proof. Convexity of the objective function in Eq. (35) is numerical studies are detailed in this section. straightforward since it is just a sum of convex func- tions [2]. The key point of the proof is based on the equality 5.1. Experimental set-up X 2 1 ci,j In order to quantify the performance of the different t 1 ¼ J J2 þ ð Þ sj St sj 2 min sj Uc,j 2 37 s c,j d i i algorithms and compare them, we have used simulated datasets with different redundancy factors M=N, the whose proof is given in the Appendix. According to this number k of active elements and the number L of signals ð Þ equality, we can upper-bound Lz d with to approximate. The dictionary U is based on M vectors X X c2 sampled from the unit hypersphere of RN. The true t 1 2 i,j ð Þ¼ þ J J þ ð Þ % Lz d,C z d 2 sj Uc,j 2 38 coefficient matrix C has been obtained as follows. The s di j i,j positions of the k non-zero rows in the matrix are % The problem of minimizing Lzðd,CÞ is smooth and jointly randomly drawn. The non-zero coefficients of C are then convex in its parameters C and d and thus an iterative drawn from a zero-mean unit variance Gaussian distribu- coordinate-wise optimization scheme (iteratively opti- tion. The signal matrix S is obtained as in Eq. (1) with the mizing over d with fixed C and then optimizing over C noise matrix being drawn i.i.d. from a zero-mean Gaus- with fixed d) yields to the global minimum. It is easy to sian distribution and variance so that the signal-to-noise show that for any fixed C, the minimal value of Lzðd,CÞ ratio of each single signal is 10 dB. For a given experiment, with respect to d is achieved when when several trials are needed, we only resample the dictionary U and the additive noise E. d ¼ z1=2Jc J 8i i i i, Each algorithm is provided with the signal matrix S Plugging these solutions back into (38) and multiplying and the dictionary U and will output an estimate of C. The the resulting objective function with s2=2 yields to performance criterion we have considered are the mean- square error between the true and the approximate 1 X X L ðCÞ¼ Js Uc J2 þs2 z1=2Jc J ð39Þ signals and the sparsity profile of the coefficient matrix z 2 j ,j 2 i i, j i

Making the relation between ‘2 and Frobenius norms 3 Our experimental analysis with M-SBL has been performed using concludes the proof. & the code provided by Wipf et al. and available. 1518 A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526 that has been recovered. For the latter, we use as a helps in avoiding a row-norm of C to be permanently at 0. performance criterion the F-measure between the row- For all compared problems, regularization parameter l support of the true matrix C% and the estimate one C^ .In has been so as to make the problem equivalent. In this 1 J t J order to take into account numerical precisions, we have experiment, we have set l ¼ 5 maxi fi S 2 for problem (6). overloaded the row-support definition as Fig. 1 shows two examples of how the objective value of the different algorithms evolves with respect to computa- rowsuppðCÞ¼fi 2½1 M : Jc Jomg, i, tional time. We can note that the two iterative reweighted where m is a threshold coefficient that has been set by least-square algorithms (M-EM and M-FOCUSS) are the default to 0.01 in our experiments. From rowsuppðC^ Þ and most computationally demanding. Furthermore, we also rowsuppðC% Þ respectively the estimated and true sparsity see that the Landweber iteration approach of Fornasier profile, we define et al. quickly reduces its objective value but compared to ourM-BCDmethod,itneedsmoretimebeforeproperly jrowsuppðC^ Þ\rowsuppðC% Þj F measure ¼ 2 : converging. Table 1 summarizes more accurately the differ- ^ % jrowsuppðCÞjþjrowsuppðC Þj ence between the four algorithms. These results have been Note that the F-measure is equal to 1 when the estimated averaged over 100 trials and consider two values of k.As sparsity profile coincides exactly with the true one. comparison criteria, we have considered the computational Regarding the stopping criterion, in the experiments time before convergence, the difference (compared to the presented below, we have considered convergence of our M-BCD algorithm) in objective values and the maximal M-BCD algorithm when the optimality conditions given in absolute difference in the coefficient matrix ci,j.Thetable Eq. (7) are satisfied up to a tolerance of 0.001 and when all clearly shows that the M-BCD algorithm is clearly faster than M-BPLand (especially when signals to approximate are matrix coefficient ci,j variations are smaller than 0.001. This latter condition has also been used as a stopping criterion for the Landweber iteration algorithm, M-EM, Table 1 IrM-BP, M-FOCUSS and M-SBL algorithms. When the Summary of M-BP solvers comparison. Comparisons have been carried annealing approach is in play for IrM-BP and M-FOCUSS, out for two values of k, the number of active elements in the dictionary the annealing loop is also stopped under the same condi- and have been averaged over 100 trials. Comparison measures are the ðt þ 1Þ ðtÞ time needed before convergence, the difference in objective value and tion e.g. maxi,jjCi,j Ci,j jo0:001 where (t) and (t+1) the largest coefficient matrix difference. For the two latter measure, the denotes two consecutive annealing solution. baseline algorithm is considered to be the M-BCD one.

3 3 Time (ms) D ObjVal (10 ) JDCJ1 (10 ) 5.2. Comparing ‘1‘2 M-BP problem solvers k=5 In this first experiment, we have compared different M-BCD 4.272.7 – – M-EM 57.271.7 2.672.2 0.871.0 algorithms which solves the M-BP problem with p=1 and M-Focuss 24.174.2 3.671.8 16.671.0 q=2. Besides our M-BCD and M-EM algorithms, we have M-BPLand 7.571.3 5.371.1 0.0471.1 also used the M-FOCUSS of Cotter et al. [8] and the k=32 approach of Fornasier et al. [17] based on Landweber M-BCD 11.572.87 – – iterations and denoted in the sequel as M-BPLand. Note M-EM 130.9717.0 23.175.9 15.478.0 that for M-FOCUSS, we have modified the genuine algo- M-Focuss 60.076.2 28.875.8 38.875.4 rithm by introducing an e parameter, set to 0.001, which M-BPLand 14.172.7 16.973.8 0.4173.8

3 3 10 10 M−BCD M−BCD M−BPem M−BPem M−Focuss M−Focuss M−BPland M−BPland

2 10

2 10 Objective value Objective value 1 10

0 1 10 10 −4 −3 −2 −1 −4 −3 −2 −1 0 10 10 10 10 10 10 10 10 10 Time (s) Time (s)

Fig. 1. Examples of objective value evolution with respect to computational time. Here we have, M=128, N=64, L=3. The number of active elements is: (left) k=5. (right) k=32. For each curve, the large point corresponds to the objective value at convergence. A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526 1519

N=128 L=5 k=M/16 N=128 L=5 k=M/16 2 2 10 10 SBL (1.87) IrM−BCD (1.85) 1 BCD (1.75) 1 10 M−FOC (1.32) 10 IrM−BCD Ann(1.91) EM (2.08) BCD (1.75) 0 0 M−FOC Ann (1.29) 10 CosAmp (2.70) 10 Landweber (1.98)

−1 −1 10 10

−2 −2 10 10 Computation Time (s) Computation Time (s)

−3 −3 10 10

−4 −4 10 10 1 2 3 1 2 3 10 10 10 10 10 10 Dictionary Size M Dictionary Size M

M=128 L=5 k=8 M=128 L=5 k=8 1 1 10 10 BCD (0.10) SBL (1.21) M−FOC (1.29) IrM−BCD (−0.11) EM (0.05) IrM−BCD Ann(−0.08) 0 0 10 CosAmp (0.02) 10 BCD (0.10) Landweber (−0.01) M−FOC Ann (1.29)

−1 −1 10 10

Computation Time (s) −2 Computation Time (s) −2 10 10

−3 −3 10 10 1 2 3 1 2 3 10 10 10 10 10 10 Signal Dimension N Signal Dimension N

Fig. 2. Estimating the empirical exponent, given in parenthesis, of the computational complexity of different algorithms (M-BCD, IrM-BP, M-SBL, M-FOCUSS, CosAmp, Landweber iterations). The top plots give the computation time of the algorithms with respect to the dictionary size. The bottom plots respectively depict the computational complexity with respect to the signal dimensionality. For a sake of readability, we have separated the algorithms into two groups: (left) the ones that solve ‘1‘q problem. (right) The ones that solve ‘p‘2 problem (M-BCD result provided for baseline comparison). The ‘‘IrM-BP Ann’’ and ‘‘M-FOC Ann’’ refers to the Ir-MBP and M-FOCUSS algorithm using an annealing approach for iteratively decreasing e. (Figure best viewed on a computer screen).

highly sparse) and the two iterative reweighted least-square show the computational complexity of the different algo- approaches. We can also note from the table that, although rithms for different experimental settings. Note that we M-FOCUSS and our M-EM are not provided with a formal have also experimented on the M-SBL computational per- convergence proof, these two algorithms seem to empiri- formances owing to the code of Wipf et al. [60] and have cally converge to the problem of global minimum. implemented the M-FOCUSS of Cotter et al. [8], the CosAmp block-sparse approach of Baraniuk et al. [1] and the Land- 5.3. Computational performances weber iteration method of Fornasier et al. [17].4 All algo- rithms need one hyperparameter to be set, for M-SBL and We have also empirically assessed the computational CosAmp,wewereabletochoosetheoptimalonesincethe 5 complexity of our algorithms (we used s=0.2, thus q ¼ 3 hyperparameter respectively depends on a known noise for M-EM and r=1 for IrM-BP). We varied one of the level and a known number of active elements in the different parameters (dictionary size M, signal dimension- dictionary. For other algorithms, we have reported the ality N) while keeping the others fixed. All matrices U, C computational complexity for the l that yields to the best and S are created as described above. Experiments have been run on a Pentium D-3 GHz with 4 GB of RAM using Matlab code. The results in Fig. 2, averaged over 20 trials, 4 All the implementations are included in the toolbox. 1520 A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526 sparsity recovery. Note that our aim here is not to give an Performances averaged over 10 trials of all methods have exact comparison of computational complexity of the algo- been evaluated according to the F-measure and a mean- rithms but just to give an order of magnitude of these square error computed on 10 000 samples. complexities. Indeed, accurate comparisons are difficult Fig. 3 shows, from top to bottom, these performances since the different algorithms do not solve the same when k increases from 2 to 75, when M goes from 64 to problem and do not use the same stopping criterion. 1024 and when L=2,y,75. This framework allows us to We can remark in Fig. 2 that with respect to the capture different experimental settings such as high/low dictionary size, all algorithms present an empirical expo- sparsity patterns, large/small number of signals to be nent between 1.3 and 2.7. Interestingly, we have theore- jointly approximated. tically evaluated the complexity of the M-BCD algorithm When varying k, we can note that across the range of as quadratic whereas we measure a sub-quadratic com- variation, the IrM-BP method with p=0 is competitive plexity. We suppose that this happens because at each compared to all other approaches both with respect to the iteration, only the non-optimal ci,’s are updated and thus F-measure and the mean-square error criterion. When k the number of updates drastically reduces along itera- increases, IrM-BP, M-FOCUSS both with p=0.5 and M-SBL tions. We can note that among all approaches that solve perform also very good. This may be explained by the fact the ‘1‘q problem (left plots), M-BCD, Landweber itera- that as k increases, the optimal solution becomes less and tion approach and M-CosAmp have similar complexity less sparse thus the need for a less aggressive penalty. with a slight advantage to M-BCD for large dictionary size. CosAmp and S-OMP are very competitive for small k but However, we have to note that the M-CosAmp algorithm as soon as the latter increases these two methods are not sometimes suffers from lack of convergence and thus stop able anymore to recover a ‘‘reasonable’’ sparsity pattern. only when the maximal number of allowed iterations is Interestingly, we remark that M-SBL yields to a poor reached. This is the reason why for large dictionary size sparsity recovery measure while the resulting model M-CosAmp is computationally expensive. However, for achieves good mean-square error. A reason for this is that small and medium dictionary size, M-CosAmp is slightly the model selection procedure tends to under-estimate more efficient than M-BCD. When considering the algo- the noise level and thus it leads to a model which keeps rithms that solve the ‘p‘2 problem (right plots), they all many spurious dictionary elements as illustrated in Fig. 5 have similar complexity, with a slightly better constant and detailed in the sequel. From Fig. 3, we can also notice for IrM-BP while M-SBL seems to be the most demanding that the two M-BP solvers, our M-BCD and the Landweber algorithm. iteration approach, perform poorly compared to other Bottom plots of Fig. 2 depict the complexity depen- methods. However, Fornasier’s method seems to be less dency of all algorithms with respect to signal dimension sensitive to noise and model selection since it provides a N. Interestingly, the results show that except for M-SBL better sparsity pattern recovery. It is worth noting that and M-FOCUSS algorithms, all algorithms do not suffer M-SBL and these two latter methods always correctly from the signal dimension increase. We assume that this select all the true dictionary elements but they also have is due to the fact that as dimension increases, the the tendency to include other spurious ones. approximation problem becomes easier and thus faster In the middle plots, as the dictionary size increases, we convergence of those algorithms occurs. remark that M-CosAmp, S-OMP and the Ir-MBP with p=0 algorithms yield to very good (and similar performance) 5.4. Comparing performances sparsity recovery. Regarding mean-square error, again, S-OMP and Ir-MBP seem to be the most efficient algorithms. The objective of the next empirical study is to compare M-SBL and M-BCD keep too many spurious dictionary the performances of the algorithms we surveyed: M-SBL, elements. All other methods provide in-between perfor- M-CosAmp, Landweber iterations, S-OMP, M-FOCUSS mances both in terms of F-measure and mean-square error. with an annealing decreasing of e, the M-BCD algorithm Finally, the bottom plots show that when the number and the IrM-BP approach with two values of p and an of signals to be approximated is rather low, Ir-MBP is the annealing decrease of e. most interesting approach with respect to both criteria The baseline experimental context is M=512, N=128, we consider. However, as soon as L becomes large enough, k=10 and L=5. For this experiment, we have considered S-OMP is able to perform better in sparsity recovery as an agnostic context with no prior knowledge about the well as in mean-square error. noise level being available. Hence, for all models, we have In order to investigate how the different algorithms performed model selection (either for selecting l, the behave when the elements of U are correlated, we have noise level s for M-SBL or the number of elements for run the same experiments with the dictionary U being M-CosAmp and S-OMP). Model selection procedure is the replaced by a matrix UW1=2 where W is a Toeplitz matrix k1 following. Training signals S are randomly splitted into built from a vector v of components vk ¼ r þak with the two parts of N/2 samples. Each algorithm is then trained vector a being obtained from a vector of uniform random 2 on one part of the signal and the mean-square error of the noise of variance sv sorted in decreasing order [20]. Here, 2 resulting model is evaluated on the second part. This we have set r ¼ 0:9andsv ¼ 0:1. Fig. 4 depicts the different splitting and training is run 5 times and the hyperpara- algorithm performances. The main difference that we can meter yielding to the minimal averaged mean-square note is that M-CosAmp and S-OMP seem to suffer more this error is considered as optimal. Each method is then run existing correlation in dictionary elements as the number of on the full signals with that hyperparameter. active elements become large (top panels). A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526 1521

M=512 N=128 L=5 M=512 N=128 L=5 1 107

0.9

6 0.8 10

0.7 105 0.6

0.5 MSE 4 F−measure 10 0.4

0.3 103 0.2

0.1 102 0 1020304050607080 0 1020304050607080 Nb of active elements k Nb of active elements k

N=128 L=5 k=10 N=128 L=5 k=10 1 104

0.9

0.8

0.7

0.6 103 0.5 MSE F−measure 0.4

0.3

0.2

0.1 102 0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200 Size of Dictionary M Size of Dictionary M

M=512 N=128 k=10 M=512 N=128 k=10 1 105 0.9

0.8

0.7 104 0.6

0.5 MSE

F−measure 0.4 103 0.3

0.2

0.1

0 102 100 101 102 100 101 102 Nb of Signals L Nb of Signals L

Fig. 3. Results comparing performances of different simultaneous sparse algorithms. We have varied (top) the number k of active elements in the dictionary. (middle) The dictionary size M and (bottom) the number of signal to approximate L. On the left columns are given the F-measure of all methods while the average mean-square errors are on the right column. (Figure best viewed in color on a computer screen). 1522 A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526

M=512 N=128 L=5 M=512 N=128 L=5 1 108

0.9 107 0.8

0.7 106

0.6 105

0.5 MSE F−measure 0.4 104

0.3 103 0.2

0.1 102 0 1020304050607080 0 1020304050607080 Nb of active elements k Nb of active elements k

N=128 L=5 k=10 N=128 L=5 k=10 1 105

0.9

0.8

0.7 104

0.6

0.5 MSE F−measure 0.4 103

0.3

0.2

0.1 102 0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200 Size of Dictionary M Size of Dictionary M

M=512 N=128 k=10 M=512 N=128 k=10 1 105 0.9

0.8

0.7 104 0.6

0.5 MSE

F−measure 0.4 103 0.3

0.2

0.1

0 102 100 101 102 100 101 102 Nb of Signals L Nb of Signals L

Fig. 4. Results comparing performances of different simultaneous sparse algorithms for a dictionary with increased correlation between elements. We have varied (top) the number k of active elements in the dictionary. (middle) The dictionary size M and (bottom) the number of signal to approximate L. On the left columns are given the F-measure of all methods while the average mean-square errors are on the right column. (Figure best viewed in color on a computer screen). A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526 1523

3 3 True row−norm True row−norm IrM−BCD ann (p=0) Fmeas : 0.95 MSE : 543.93 IrM−BCD ann (p=0) Fmeas : 1.00 MSE : 613.46 M−SBL Fmeas : 0.38 MSE : 812.21 2.5 M−SBL Fmeas : 0.59 MSE : 738.79 2.5 M−BPcosamp Fmeas : 1.00 MSE : 670.23 M−BPcosamp Fmeas : 0.95 MSE : 2881.35

2 2

1.5 1.5 Amplitude Amplitude

1 1

0.5 0.5

0 0 20 40 60 80 100 120 10 20 30 40 50 60 Dictionary element Dictionary element

Fig. 5. Examples of estimated row-norm using three different algorithms. (left) M=128, N=64, k=10 and L=3. (right) M=64, N=64, k=10 and L=3. Here, we want to illustrate cases where a ‘‘good’’ sparsity recovery does not necessary lead to low mean-square error.

Fig. 5 illustrates the behavior of M-CosAmp, M-SBL and sparsity-inducing mixed-norm on the coefficient approx- the IrM-BP with p=0 for N=64, M=128, k=10 and L=3 for imation. In the first part of the paper, we have described two different random samplings of U and locations and several algorithms which addresses the most common amplitudes of non-zero components while W ¼ I. On the SSA problem which is a convex problem with an ‘1‘2 left plot, we have a case where on one hand, M-CosAmp penalty function. For other choices of penalty, one usually misses to recover the first active dictionary element considers iterative reweighted algorithms. These algo- yielding thus to high mean-square error. On the other rithms have been detailed in a second part of the paper. hand, M-SBL achieves lower mean-square error while We have also brought our attention on greedy algorithms keeping few spurious dictionary elements in the model. such as S-OMP or M-CosAmp since they both have the In the meantime, IrM-BP recovers perfectly the sparsity abilities to be very efficient and to provide theoretical pattern and yields to low mean-square error. In the right guarantees on the quality of their solutions. plot, we have another case where M-CosAmp achieves The last part of the paper is devoted to experimental perfect sparsity recovery but provides a model with comparisons of the different algorithms we surveyed. The higher mean-square error than IrM-BP. lesson learned from these comparisons is that the algo- In most of the experimental situations presented here, rithms that seem to be the best performing either in M-CosAmp, or S-OMP, and the IrM-BP seem to be the terms of efficiency, sparsity recovery or low mean-square algorithms that perform the best. These methods are error are the greedy methods such as M-CosAmp and actually related since both approaches solve a simulta- S-OMP or an iterative reweighted ‘1 approach. neous sparse approximation with a J0,2ðCÞ penalty. The main difference lies in the algorithms since our IrM-BP owing to the e term provides a smooth approximation of the ‘0 quasi-norm whereas M-CosAmp or S-OMP directly Acknowledgments solves the approximation problem with the J0,2ðCÞ penalty. To summarize, we suggest the following rule: in We would like to thank the anonymous reviewers for applications where computational complexities are criti- their comments. This work was supported in part by the cal, where signals to be approximated are highly sparse, IST Program of the European Community, under the and where the number of signals to be approximated is PASCAL2 Network of Excellence, IST-216886. This pub- large, greedy approaches like M-CosAmp or S-OMP is to lication only reflects the authors views. This work was be preferred. In other situations like low sparsity patterns partly supported by grants from the ANR ClasSel 08-EMER or small number of signals, iterative algorithms like IrM- and ASAP ANR-09-EMER-001. BP should be considered.

6. Conclusions Appendix A

This paper aimed at surveying and comparing different A.1. J1,2ðCÞ subdifferential simultaneous sparse signal approximation algorithms available in the statistical learning and signal processing By definition, a matrix G lies in @J1,2ðBÞ if and only if for communities. every matrix Z, we have These algorithms usually solve a relaxed version of an / S intractable problem where the relaxation is based on J1,2ðZÞZJ1,2ðBÞþ ZB,G F ð40Þ 1524 A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526

If we expand this equation we have the following equiva- According to these optimality conditions, at a stationary lent expression: point, we have either dm,n ¼ 0or X X X ! s=ðs þ 1Þ X rs=½ðr þ sÞðs þ 1Þ Jzi,J2 Z Jbi,J2 þ /zi,bi,,gi,S ð41Þ l d ¼ ja j2s=ðs þ 1Þ d1=s i i i m,n þ m,n t,n r s t From this latter equation, we understand that, since both ð44Þ J and the Frobenius inner product are row-separable, a 1,2 Then, we can derive matrix G 2 @J1,2ðBÞ if and only if each row of G belongs to ! ! ! ðs þ 1Þ  s þ 1 r=ðr þ sÞ the subdifferential of the ‘2 norm of the corresponding X X X 1=s r þs 2=ðs þ 1Þ 1=s row of B. dm,n ¼ jam,nj dm,n m l m m Indeed, suppose that G is so that any row of G belongs ð45Þ to the subdifferential of the ‘2 norm of the corresponding row of B. We thus have for any row i and thus !0 !1 s s þ 1 ðr þ sÞ=ðr þ s þ 1Þ 8z, JzJ2 ZJbi,J2 þ/zbi,,gi,S ð42Þ X X 1=s @r þs 2=ðs þ 1Þ A dm,n ¼ jam,nj A summation over all the rows then proves that G satisfies m l m Eq. (41) and thus belongs to the subdifferential of J1,2ðBÞ. ð46Þ Now, let us show that a matrix G for which there exists As la0, the inequality on the mixed-norm on dt,k a row that does not belong to the subdifferential of the ‘2 norm of the corresponding row of B cannot belong to the becomes an equality. Hence, after powering each side of Eq. (46) to 1=ðr þsÞ and summing each side over n,we subdifferential of J1,2ðBÞ. Let us consider gi, the i-th row of G, since we have supposed that g 2=@Jb J , the following have i, i, 2 ! equation holds: r þ s þ 1 l X ¼ gðs þ 1Þ=ðr þ s þ 1Þ ð47Þ J J J J / S n (z0 st: z0 2 o bi, 2 þ zbi,,gi r þs n P 2=ðs þ 1Þ Now let us construct Z so that Z ¼ B except for the i-th where gn ¼ mjam,nj . Then, plugging Eqs. (47) row where zi, ¼ z0. Then it is easy to show that this matrix and (46) into (44) gives the desired result: Z does not satisfy Eq. (41), which means that G does not 2s=ðs þ 1Þ r=ðs þ r þ 1 jam,nj gn Þ belong to @J1,2ðBÞ. In conclusion, we get @J1,2ðBÞ by apply- dm,n ¼ P ð48Þ ð gðs þ 1Þ=ðs þ r þ 1ÞÞr þs ing the ‘2 norm subdifferential to each row of B. And it is n n well known [2] that 8 > L < fg 2 R : JgJ2 r1g if b ¼ 0 A.3. Proof of Eq. (37) @JbJ ¼ b ð43Þ 2 :> otherwise JbJ2 We want to show that at optimality which occurs at C% , we have 1 A.2. Proof of Lemma 1 stS1s ¼ stðs UC% Þ j t j s2 j j We aim at proving that which is equivalent, after factorizing with st, to show that 8 ! 9 s=ðr þ sÞ 2 %

[6] J. Chen, X. Huo, Sparse representations for multiple measurements [33] M. Mishali, Y. Eldar, Reduce and boost: recovering arbitrary sets of vectors (MMV) in an overcomplete dictionary, Proceedings of the jointly sparse vectors, IEEE Transactions on Signal Processing 56 IEEE International Conference on Acoustics, Speech and Signal (10) (2008) 4692–4702. Processing, vol. 4, 2005, pp. 257–260. [34] D. Needell, J. Tropp, CosAmp: iterative signal recovery from [7] S. Chen, D. Donoho, M. Saunders, Atomic decomposition by basis incomplete and inaccurate samples, Applied and Computational pursuit, SIAM Journal of Scientific Computing 20 (1) (1999) 33–61. Harmonic Analysis 26 (3) (2009) 301–321. [8] S. Cotter, B. Rao, K. Engan, K. Kreutz-Delgado, Sparse solutions to [35] G. Obozinski, B. Taskar, M. Jordan, Joint covariate selection and linear inverse problems with multiple measurement vectors, IEEE joint subspace selection for multiple classification problems, Sta- Transactions on Signal Processing 53 (7) (2005) 2477–2488. tistics and Computing 20 (2) (2010) 231–252. [9] I. Daubechies, M. Defrise, C.D. Mol, An iterative thresholding [36] Y. Pati, R. Rezaiifar, P. Krishnaprasad, Orthogonal matching pursuit: algorithm for linear inverse problems with a sparsity constraint, recursive function approximation with applications to wavelet Communication on Pure and Applied Mathematics 57 (2004) decomposition, in: Proceedings of the 27th Annual Asilomar Con- 1413–1541. ference on Signals, Systems and Computers, 1993. [10] I. Daubechies, R. DeVore, M. Fornasier, S. Gunturk, Iteratively [37] C. Phillips, J. Mattout, M. Rugg, P. Maquet, K. Friston, An empirical reweighted least squares minimization for sparse recovery, Com- Bayesian solution to the source reconstruction problem in EEG, munication on Pure and Applied Mathematics 63 (1) (2010) 1–38. NeuroImage 24 (2005) 997–1011. [11] M. Duarte, V. Cevhler, R. Baraniuk, Model-based compressive [38] A. Rakotomamonjy, F. Bach, Y. Grandvalet, S. Canu, SimpleMKL, sensing for signal ensembles, in: Proceeding of the 47th Allerton Journal of Machine Learning Research 9 (2008) 2491–2521. Conference on Communication, Control and Computing, 2009. [39] A. Rakotomamonjy, R. Flamary, G. Gasso, S. Canu, ‘p‘q penalty for [12] M. Elad, Why simple shrinkage is still relevant for redundant sparse linear and sparse multiple kernel multi-task learning, IEEE representations?, IEEE Transactions on Information Theory 52 Transactions on Neural Networks, submitted for publication, hal- / S (12) (2006) 5559–5569 00509608 http://hal.archives-ouvertes.fr/hal-00509608/fr/ . [13] Y. Eldar, P. Kuppinger, H. Blocskei, Compressed sensing of block- [40] B. Rao, K. Engan, S. Cotter, J. Palmer, K. Kreutz-Delgado, Subset sparse signals: uncertainty relations and efficient recovery, IEEE selection in noise based on diversity measure minimization, IEEE Transactions on Signal Processing 58 (6) (2010) 3042–3054. Transactions on Signal Processing 51 (3) (2003) 760–770. ¨ [14] Y. Eldar, M. Mishali, Robust recovery of signals from a structured [41] R. Saab, R. Chartrand, Ozgur¨ Yilmaz, Stable sparse approximations union of subspaces, IEEE Transactions on Information Theory 55 via nonconvex optimization, in: 33rd International Conference on (11) (2009) 5302–5316. Acoustics, Speech, and Signal Processing (ICASSP), 2008. [15] J. Fan, R. Li, Variable selection via nonconcave penalized likelihood [42] S. Sardy, A. Bruce, P. Tseng, Block coordinate relaxation methods for non-parametric wavelet denoising, Journal of Computational and and its oracle properties, Journal of the American Statistical Graphical Statistics 9 (2) (2000) 361–379. Association 96 (456) (2001) 1348–1360. [43] T. Simila, Majorize-minimize algorithm for multiresponse sparse [16] M. Figueiredo, J. Bioucas-Dias, R., N., Majorization-minimization regression, in: IEEE International Conference on Acoustics, Speech, algorithms for wavelet-based image for restoration. IEEE Transac- and Signal Processing (ICASSP), 2007, pp. 553–556. tions on Image Processing 16(12) (2007) 2980–2991. [44] T. Simila, J. Tikka, Input selection and shrinkage in multiresponse [17] M. Fornasier, H. Rauhut, Recovery algorithms for vector valued data linear regression, Computational Statistics and Data analysis 52 (1) with joint sparsity constraints, SIAM Journal of Numerical Analysis (2007) 406–422. 46 (2) (2008) 577–613. [45] M. Stojnic, F. Parvaresh, B. Hassibi, On the reconstruction of block- [18] S. Foucart, M.-J. Lai, Sparsest solutions of underdetermined linear sparse signals with an optimal number of measurements, Technical systems via ‘qminimization for 0oqr1, Applied and Computa- Report, ArXiv 0804.0041v1, California Institute of Technology, 2008. tional Harmonic Analysis 26 (3) (2009) 395–407. [46] J. Sturm, Using SeDuMi 1.02 a Matlab toolbox for optimization over [19] J. Friedman, T. Hastie, H. Hofling,¨ R. Tibshirani, Pathwise coordinate symmetric codes, Optimization Methods and Software 11 (12) optimization, The Annals of Applied Statistics 1 (2) (2007) 302–332. (1999) 625–653. [20] G. Gasso, A. Rakotomamonjy, S. Canu, Recovering sparse signals [47] L. Sun, J. Liu, J. Chen, J. Ye, Efficient recovery of jointly sparse with a certain family of non-convex penalties and dc programming, vectors, in: Advances in Neural Information Processing Systems, IEEE Transactions on Signal Processing 57 (12) (2009) 4686–4698. vol. 23, 2009. [21] M. Girolami, S. Rogers, Hierarchic Bayesian models for kernel [48] M. Szafranski, Y. Grandvalet, A. Rakotomamonjy, Composite kernel learning, in: Proceedings of 22nd International Conference on learning, Machine Learning Journal 79 (1-2) (2010) 73–103. Machine Learning, 2005, pp. 241–248. [49] F. Theis, G. Garcia, On the use of sparse signal decomposition in the [22] I. Gorodnitsky, J. George, B. Rao, Neuromagnetic source imaging analysis of multi-channel surface electromyograms, Signal Proces- with FOCUSS: a recursive weighted minimum norm algorithm, sing 86 (3) (2006) 603–623. Journal of Electroencephalography and Clinical Neurophysiology 95 [50] R. Tibshirani, Regression shrinkage and selection via the lasso, (4) (1995) 231–251. Journal of the Royal Statistical Society 46 (1996) 267–288. [23] Y. Grandvalet, Least absolute shrinkage is equivalent to quadratic [51] M. Tipping, Sparse Bayesian learning and the relevance vector penalization, in: L. Niklasson, M. Bode´n, T. Ziemske (Eds.), machine, Journal of Machine Learning Research 1 (2001) ICANN’98. Perspectives in Neural Computing, vol. 1, Springer1998, 211–244. pp. 201–206. [52] J. Tropp, Algorithms for simultaneous sparse approximation. Part II: [24] R. Gribonval, M. Nielsen, Sparse approximations in signal and convex relaxation, Signal Processing 86 (2006) 589–602. image processing, Signal Processing 86 (3) (2006) 415–416. [53] J. Tropp, A. Gilbert, Signal recovery from random measurements via [25] J. Huang, S. Ma, H. Xie, C. Zhang, A group bridge approach for orthogonal matching pursuit, IEEE Transactions on Information variable selection, Biometrika 96 (2) (2009) 339–355. Theory 53 (12) (2007) 4655–4666. [26] D. Hunter, K. Lange, A tutorial on MM algorithms, The American [54] J. Tropp, A. Gilbert, M. Strauss, Algorithms for simultaneous sparse Statistician 58 (2004) 30–37. approximation. Part I: greedy pursuit, Signal Processing 86 (2006) [27] S. Ji, D. Dunson, L. Carin, Multi-task compressive sensing, IEEE 572–588. Transactions on Signal Processing 57 (1) (2008) 92–106. [55] P. Tseng, Convergence of block coordinate descent method for [28] Y. Kim, J. Kim, Y. Kim, Blockwise sparse regression, Statistica Sinica nondifferentiable minimization, Journal of Optimization Theory 16 (2006) 375–390 /http://www3.stat.sinica.edu.tw/statistica/ and Application 109 (2001) 475–494. j16n2/16-2.htmlS. [56] E. van den Berg, M. Schmidt, M. Friedlander, K. Murphy, Group sparsity [29] H. Liu, J. Zhang, On the estimation and variable selection consis- via linear-time projection, Technical Report TR-2008-09, University of tency of the sum of q-norm regularized regression, Technical British Columbia, Department of Computer Science, 2008. Report, Department of Statistics, Carnegie Mellon University, 2009. [57] D. Wipf, S. Nagarajan, A new view of automatic relevance determi- [30] Z. Luo, M. Gaspar, J. Liu, A. Swami, Distributed signal processing in nation, Advances in Neural Information Processing Systems, vol. 20, sensor networks, IEEE Signal Processing Magazine 23 (4) (2006) 14–15. MIT Press, Cambridge, MA, 2008. [31] D. Malioutov, M. Cetin, A. Willsky, Sparse signal reconstruction [58] D. Wipf, S. Nagarajan, Iterative reweighted ‘1 and ‘2 methods for perspective for source localization with sensor arrays, IEEE Trans- finding sparse solutions, Journal of Selected Topics in Signal actions on Signal Processing 53 (8) (2005) 3010–3022. Processing 4 (2) (2010) 317–320. [32] S. Mallat, Z. Zhang, Matching pursuit with time–frequency dictionaries, [59] D. Wipf, J. Owen, H. Attias, K. Sekihara, S. Nagarajan, Robust IEEE Transactions on Signal Processing 41 (12) (1993) 3397–3415. Bayesian estimation of the location orientation and time course 1526 A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526

of multiple correlated neural sources using meg, NeuroImage 49 (1) [62] H. Zhou, T. Hastie, Regularization and variable selection via the (2010) 641–655. elastic net, Journal of the Royal Statistics Society Series B 67 (2005) [60] D. Wipf, B. Rao, An empirical Bayesian strategy for solving the 301–320. simultaneous sparse approximation problem, IEEE Transactions on [63] H. Zou, The adaptive lasso and its oracle properties, Journal of the Signal Processing 55 (7) (2007) 3704–3716. American Statistical Association 101 (476) (2006) 1418–1429. [61] M. Yuan, Y. Lin, Model selection and estimation in regression with [64] H. Zou, R. Li, One-step sparse estimates in nonconcave penalized grouped variables, Journal of Royal Statistics Society B 68 (2006) likelihood models, The Annals of Statistics 36 (4) (2008) 49–67. 1509–1533.