Signal Processing 91 (2011) 1505–1526
Contents lists available at ScienceDirect
Signal Processing
journal homepage: www.elsevier.com/locate/sigpro
Review Surveying and comparing simultaneous sparse approximation (or group-lasso) algorithms
A. Rakotomamonjy
LITIS EA4108, University of Rouen, Avenue de l’universite´, 76800 Saint Etienne du Rouvray, France article info abstract
Article history: In this paper, we survey and compare different algorithms that, given an overcomplete Received 20 April 2010 dictionary of elementary functions, solve the problem of simultaneous sparse signal Received in revised form approximation, with common sparsity profile induced by a ‘p ‘q mixed-norm. Such a 4 November 2010 problem is also known in the statistical learning community as the group lasso Accepted 16 January 2011 problem. We have gathered and detailed different algorithmic results concerning these Available online 25 January 2011 two equivalent approximation problems. We have also enriched the discussion by Keywords: providing relations between several algorithms. Experimental comparisons of the Simultaneous sparse approximation detailed algorithms have also been carried out. The main lesson learned from these Block-sparse regression experiments is that depending on the performance measure, greedy approaches and Group lasso iterative reweighted algorithms are the most efficient algorithms either in term of Iterative reweighted algorithms computational complexities, sparsity recovery or mean-square error. & 2011 Elsevier B.V. All rights reserved.
Contents
1. Introduction ...... 1506 1.1. Problem formalization ...... 1506
2. Solving the ‘1–‘2 optimization problem ...... 1507 2.1. Block coordinate descent ...... 1507 2.1.1. Deriving optimality conditions ...... 1508 2.1.2. The algorithm and its convergence ...... 1508 2.2. Landweber iterations ...... 1509 2.3. Other works ...... 1510 3. Generic algorithms for large classes of p and q ...... 1510 3.1. M-FOCUSS algorithm ...... 1510 3.1.1. Deriving the algorithm...... 1510 3.1.2. Discussing M-FOCUSS ...... 1511 3.2. Automatic relevance determination approach ...... 1511 3.2.1. Exhibiting the relation with ARD...... 1512 3.3. Solving the ARD formulation for p ¼ 1 and 1rqr2...... 1512 3.4. Iterative reweighted ‘1 ‘q algorithms for po1 ...... 1513 3.4.1. Iterative reweighted algorithm ...... 1513 3.4.2. Connections with Majorization-Minimization algorithm ...... 1514 3.4.3. Connection with group bridge Lasso ...... 1514
E-mail address: [email protected]
0165-1684/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.sigpro.2011.01.012 1506 A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526
4. Specific case algorithms...... 1515 4.1. S-OMP...... 1515 4.2. M-CosAmp ...... 1515 4.3. Sparse Bayesian learning and reweighted algorithm ...... 1516 5. Numerical experiments ...... 1517 5.1. Experimental set-up...... 1517 5.2. Comparing ‘1 ‘2 M-BP problem solvers ...... 1518 5.3. Computational performances...... 1519 5.4. Comparing performances...... 1520 6. Conclusions ...... 1523 Acknowledgments ...... 1523 Appendix A...... 1523 A.1. J1,2ðCÞ subdifferential ...... 1523 A.2. Proof of Lemma 1...... 1524 A.3. Proof of Eq. (37) ...... 1524 References ...... 1524
1. Introduction the signal matrix S and the dictionary U under the
hypothesis that all signals si share the same sparsity Since several years now, there has been a lot of interest profile. This latter hypothesis can also be translated into about sparse signal approximation. This large interest the coefficient matrix C having a minimal number of non- comes from frequent wishes of practitioners to represent zero rows. In order to measure the number of non-zero data in the most parsimonious way. rows of C, a possible criterion is the so-called row-support Recently, researchers have focused their efforts on a or row-diversity measure of a coefficient matrix defined as natural extension of sparse approximation problem which is the problem of finding jointly sparse representations of rowsuppðCÞ¼fi 2½1 M : c a0 for some kg multiple signal vectors. This problem is also known as i,k simultaneous sparse approximation and it can be stated The row-support of C tells us which atoms of the dic- as follows. Suppose we have several signals describing the tionary have been used for building the signal matrix. same phenomenon, and each signal is contaminated by Hence, if the cardinality of the row-support is lower than noise. We want to find the sparsest approximation of each the dictionary cardinality, it means that at least one atom signal by using the same set of elementary functions. of the dictionary has not been used for synthesizing the Hence, the problem consists in finding the best approx- signal matrix. Then, the row-‘ pseudo-norm of a coeffi- imation of each signal while controlling the number of 0 cient matrix can be defined as: JCJ ¼jrowsuppðCÞj. functions involved in all the approximations. row-0 According to this definition, the simultaneous sparse Such a situation arises in many different application approximation problem can be stated as domains such as sensor networks signal proces- sing [30,11], neuroelectromagnetic imaging [22,37,59], 1J J2 min S UC F source localization [31], image restoration [17], distribu- C 2 ted compressed sensing [27] and signal and image s:t: JCJrow-0 rT ð2Þ processing [24,49]. where J JF is the Frobenius norm and T a user-defined parameter that controls the sparsity of the solution. Note 1.1. Problem formalization that the problem can also take the different form:
Formally, the problem of simultaneous sparse approx- min JCJrow-0 C imation is the following. Suppose that we have measured 1J J L s:t: 2 S UC F re ð3Þ L signals fsigi ¼ 1 where each signal is of the form si ¼ N N M Uci þe where si 2 R , U 2 R is a matrix of unit-norm For this latter formulation, the problem translates in M elementary functions, ci 2 R is a weighting vector and e minimizing the number of non-zero rows in the coeffi- is a noise vector. U will be denoted in the sequel as the cient matrix C while keeping control on the approxima- dictionary matrix. Since we have several signals, the tion error. Both problems (2) and (3) are appealing for overall measurements can be written as their formulation clarity. However, similarly to the single signal approximation case, solving these optimization S ¼ UCþE ð1Þ problems are notably intractable because J Jrow-0 is a with S ¼½s1 s2 sL is a signal matrix, C ¼½c1 c2 cL discrete-valued function. Two ways of addressing these a coefficient matrix and E a noise matrix. Note that in the intractable problems (2) and (3) are possible: relaxing the sequel, we have adopted the following notations: ci, and problem by replacing the J Jrow-0 function with a more c ,j respectively denote the i-th row and j-th column of tractable row-diversity measure or by using some sub- matrix C and ci,j is the i-th element in the j-th column of C. optimal algorithms. For the simultaneous sparse approximation (SSA) A large class of relaxed versions of J Jrow-0 proposed problem, the goal is then to recover the matrix C given in the literature are encompassed into the following A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526 1507 form [52]: suggest interested readers to follow for instance these 0 1 1=q pointers and references therein. X X The most frequent case of problem (4) encountered in J ðCÞ¼ Jc Jp with Jc J ¼ @ jc jqA p,q i, q i, q i,j theliteratureistheonewherep ¼ 1andq ¼ 2. This case is i j the simpler one since it leads to a convex problem. We where typically pr1 and qZ1. This penalty term can be discussdifferentalgorithmsforsolvingthisproblem J J interpreted as the ‘p quasi-norm of the sequence f ci, qgi in Section 2. Then, we survey in Section 3 generic algorithms to the power p. According to this relaxed version of the that are able to solve our approximation problems with row-diversity measure, most of the algorithms proposed different values of p and q. These algorithms are essentially in the literature try to solve the relaxed problem: iterative reweighted ‘1 or ‘2 algorithms. Many works in the 1J J2 literature have focused on one algorithm for solving a min S UC F þlJp,qðCÞð4Þ C 2 particular case of problem (4). These specific algorithms where l is another user-defined parameter that balances are discussed in Section 4. Notably, we present some greedy the approximation error and the sparsity-inducing pen- algorithms and discuss a sparse Bayesian learning algo- rithm. The experimental comparisons we carried out aim at alty Jp,qðCÞ. The choice of p and q results in a compromise between the row-support sparsity and the convexity of clarifying how each algorithm behaves in terms of compu- the optimization problem. Indeed, problem (4) is known tational complexity, in sparsity recovery and in approxima- to be convex when p,qZ1 while it is known to produce a tion mean-square error. These results are described row-sparse matrix C if p 1 (due to the penalty function in Section 5. For a sake of reproducibility, the code used in r 1 singularity at C ¼ 0 [15]). this paper has been made freely available. Final discussions The simultaneous sparse approximation problem as and conclusions are given in Section 6. described here is equivalent to several other problems studied in other research communities. Problem (4) can be 2. Solving the ‘12‘2 optimization problem reformulated so as to make clear its relation with some other problems denoted as the ‘p ‘q group lasso [61] or When p=1, q=2, optimization problem (4) becomes a block-sparse regression [28] in statistics or block-sparse signal particular problem named as M-BP for Multiple Basis Pur- recovery [13] in signal processing. Indeed, let us define suit in the sequel. It is a special case that deserves attention. 0 1 0 1 0 1 Indeed, it seems to be the most natural extension of the so- s1 U ð0Þ c1 B C B C B C called Lasso problem [50] or Basis Pursuit Denoising [7], s~ ¼ @ ^ A U~ ¼ @ & A c~ ¼ @ ^ A since for L=1, it can be easily shown that problem (4) ð0Þ sL U cL reduced to these two problems. The key point of this case is then, problem (4) can be equivalently rewritten as that it yields to a convex optimization problem and thus it can benefit from all properties resulting from convexity e.g 1J~ ~ ~J2 u ~ min2 s Uc 2 þlJ p,qðcÞð5Þ global minimum. One of the first algorithm addressing the c~ P M-BP problem is the one proposed by Cotter et al. [8] where Ju ðc~Þ¼ M Jc~ Jp with g being all the indices in p,q i ¼ 1 gi q i known as M-FOCUSS. Such an algorithm based on factored c~ related to the i-th element of the dictionary matrix U. gradient descent works for any pr1 and has been proved As we have already stated, simultaneous sparse approx- to converge towards a local or global (when p=1) minimum imation and equivalent problems have been investigated by of problem (4) if it does not get stuck into a fixed point. diverse communities and have also been applied to various Because, M-FOCUSS is better tailored for situations where application problems. Since the literature on these topics pr1, we postpone its description to next section. The two has been overwhelming since the last few years. The aim of algorithms on which we focus on herein are based on a this work is to gather results from different communities block-coordinate descent as proposed by Yuan et al. [61] for (statistics, signal processing and machine learning) so as to the group lasso and the one based on Landweber iterations survey, analyze and compare different proposed algorithms of Fornasier et al. [17].Wehavebiasedoursurveytowards for solving optimization problem (4) for different values of p these two approaches because of their efficiencies and their and q. In particular, instead of merely summarizing existing convergence properties. results, we enrich our survey by providing results like proof of convergence, formal relations between different algo- rithms and experimental comparisons that were not avail- 2.1. Block coordinate descent able in the literature. These experimental comparisons essentially focus on the computational complexity of the The specific structure of the optimization problem (4) for different methods, on their ability of recovering the signal p=1 and q=2 leads to a very simple block coordinate sparsity pattern and on the quality of approximation they descent algorithm that we name as M-BCD. Here, we provided evaluated through mean-square error. provide a more detailed derivation of the algorithm and Note that recently, there has been various works which we also give results that were not available in the Yuan et al. addressed the simultaneous sparse approximation in the work [61].WeprovideaproofofconvergenceoftheM-BCD noiseless case [47,45] and many works which aim at algorithm and we show that if the dictionary is orthogonal providing theoretical approximation guarantees in both noiseless and noisy cases [6,13,33,14,25,29]. Surveying 1 http://asi.insa-rouen.fr/enseignants/ arakotom/code/SSAindex. these works is beyond the scope of this paper and we html 1508 A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526 then it can be proved that the solution of the problem is follows the same lines as the one proposed by Sardy equivalent to a simple shrinkage of the coefficient matrix. et al. [42] for a single signal approximation.
Theorem 1. The M-BCD algorithm converges to a solution of 2.1.1. Deriving optimality conditions the M-Basis Pursuit problem given in Eq. (6), where con- The M-BP optimization problem is the following: vergence is understood as any accumulation point of the M- X BCD algorithm is a minimum of problem (6) and the 1 J J2 J J minWðCÞ¼ S UC F þl ci, 2 ð6Þ C 2 sequence of fCkg generated by the algorithm is bounded. i wheretheobjectivefunctionWðCÞ is a non-smooth but Proof. Note that M-BP problem presents a particular convex function. Since the problem is unconstrained a structure with a smooth and differentiable convex func- 2 % J J necessary and sufficient condition for a matrix C to be a tionP S UC F and a row-separable penalty function minimizer of (6) is that 0 2 @WðC% Þ where @WðCÞ denotes the ihiðci, Þ where hð Þ is a continuous and convex function subdifferential of our objective value WðCÞ [2].Bycomputing with respect to ci, . Also note that our algorithm considers a cyclic rule the subdifferential of WðCÞ with respect to each row ci, of C, the KKT optimality condition of problem (6) is then where within each loop, for any i 2½1, ,M , each ci, is considered for optimization. The main particularity is that ri þlgi, ¼ 0 8i for some i, the ci, may be left unchanged by the block- t where ri ¼ fi ðS UCÞ and gPi, is the i-th row of a subdiffer- coordinate descent if already optimal. This occurs espe- ential matrix G of J1,2ðCÞ¼ Jc J2.AccordingtotheJ1,2’s i i, cially for row ci, which are equal to 0. subdifferential definition givenintheAppendix,theKKT Then according to the special structure of the problem optimality conditions can be rewritten as and the use of a cyclic rule, the results of Tseng [55] prove ci, that our M-BCD algorithm converges. & ri þl ¼ 0 8i, ci, a0 Jci, J2 Algorithm 1. Solving M-BP through block-coordinate JriJ2 rl 8i, ci, ¼ 0 ð7Þ descent. AmatrixC satisfying these equations can be obtained after 1: t=1, Cð0Þ ¼ 0, Loop = 1, the following algebra. Let us expand each ri so that 2: while Loop do t t ri ¼ fi ðS UC iÞ fi fici, ¼ Ti ci, ð8Þ 3: for i ¼ 1,2, ...,M do 4: Compute JriJ where C is the matrix C with the i-th row being set to 0 i 5: if optimality condition of ci, according to Eqs. (7) is not t and Ti ¼ fi ðS UC iÞ. The second equality is obtained by satisfied t then remembering that fi fi = 1. Then, Eq. (7) tells us that if ci, is 6: T ’ft ðS UCðt 1ÞÞ non-zero, Ti and ci, have to be collinear. Plugging all these i i i 7: ðtÞ l c ’ð1 Þ Ti points into Eq. (7) yields to an optimal solution that can be i, JTi J þ obtained as 8: end if 9: t’t þ1 l c ¼ 1 T 8i ð9Þ 10: end for i, J J i 11: if all optimality conditions are satisfied then Ti þ 12: Loop = 0 where ðxÞ þ ¼ x if x40 and 0 otherwise. From this update 13: end if equation,wecanderiveasimplealgorithmwhichconsists 14: end while in iteratively applying the update (9) to each row i of C Regarding computational complexity, if we suppose while keeping other rows fixed. that the computations of UtS and UtU have been per- formed beforehand, the main computational burden
2.1.2. The algorithm and its convergence of Algorithm 1 for each loop is the computation of JriJ The block-coordinate descent algorithm is detailed in and Ti. Complexity of computing these two terms for all i 2 Algorithm 1. It is a simple and efficient algorithm for is about OðM LÞ. Note however that since Ti is computed solving M-BP. only for non-optimal ci;:, in practice, and as shown in our Basically, the idea consists in solving each row ci: at a experiments, computation complexity is empirically time while keeping other rows fixed. By starting from a lower than quadratic with respect to M. sparse solution like, C ¼ 0, at each iteration, one checks for a Another point we want to emphasize and that has not given i whether row ci, is optimal or not based on condi- been shed to light yet is that this algorithm should be tions (7). If not, ci, is then updated according to Eq. (9). initialized at C ¼ 0 so as to take advantage of the sparsity Although, such a block-coordinate algorithm does not pattern of the solution. Indeed, by doing so, only few converge in general for non-smooth optimization pro- updates are needed before convergence. blem, Tseng [55] has shown that for an optimization Intuitively, we can understand this algorithm as an problem which objective value is the sum of a smooth algorithm which tends to shrink to zero rows of the and convex function and a non-smooth but block-separ- coefficient matrix that contribute poorly to the approx- able convex function, block-coordinate optimization con- imation. Indeed, Ti can be interpreted as the correlation verges towards the global minimum of the problem. Our between the residual when row i has been removed and proof of convergence is based on such properties and fi. Hence the smaller the norm of Ti is, the less fi is A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526 1509 relevant in the approximation. And according to Eq. (9), to be ‘‘closed’’ respectively to CðtÞ and the previous signal ðtÞ the smaller the resulting ci, is. Insight into this block- approximation UC . coordinate descent algorithm can be further obtained by Now, the minimizer of Jsur for fixed A can be obtained supposing that MrN and that U is composed of ortho- by expanding Jsur as normal elements of RN, hence UtU ¼ I. In such a situa- 1 X tion, we have J ðC,AÞ¼ JðAþUtðS UAÞÞ CJ2 þl Jc J sur 2 F i;: q i XL t J J2 t 2 1 t 1 1 1 Ti ¼ f S and Ti ¼ ðf skÞ J J2 J J2 J J2 J J2 i 2 i AþU ðS UAÞ þ S F UA F þ A F k ¼ 1 2 2 2 2 and thus and noticing that the second line of this equation does not 0 1 depend on C. Thus we have B l C t @ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiA t argminJsurðC,AÞ¼UqðAþU ðS UAÞÞ ci, ¼ 1 P fi S C L t 2 kðfi skÞ þ with UqðXÞ being defined as This last equation highlights the relation between the X 1 J J2 J J UqðXÞ¼argmin X C F þl ci;: q ð11Þ single Basis Pursuit (when L=1) and the Multiple-Basis C 2 i Pursuit algorithm presented here. Both algorithms lead to a shrinkage of the coefficient projection when considering Thus the iterative algorithm we get simply reads, at orthonormal dictionary elements. With the inclusion of iteration t as multiple signals, the shrinking factor becomes more ðt þ 1Þ ðtÞ t ðtÞ C ¼ UqðC þU ðS UÞC Þð12Þ robust to noise since it depends on the correlation of the atom f to all signals. i Algorithm 2. Solving M-BP through Landweber iterations. This M-BCD algorithm can be considered as an exten- sion to simultaneous signal approximations of the works 1: Cð0Þ ¼ 0, Loop = 1, t=0 of Sardy et al. [42] and Elad [12] which also considered 2: while Loop do 3: ðt þ 1Þ ðtÞ t ðtÞ block coordinate descent for single signal approximation. C 2 ’C þU ðS UC Þ In addition to the works of Sardy et al. and Elad, many 4: for i ¼ 1,2, ...,M do 5: Compute Jcðt þ 1=2ÞJ other authors have considered block-coordinate descent i;: 6: algorithm for related sparse approximation problems. For cðt þ 1Þ ¼ 1 l cðt þ 1=2Þ i, Jcðt þ 1=2Þ J i;: instance, block coordinate descent has also been used for i, þ 7: end for solving the Lasso [19], and the elastic net [62]. 8: t’t þ1 9: if stopping criterions are satisfied then 2.2. Landweber iterations 10: Loop = 0 11: end if 12: end while For recovering vector valued data with joint sparsity constraints, Fornasier et al. [17] have proposed an exten- The Landweber algorithm proposed by Fornasier et al. sion of the Landweber iterative approach introduced by involves two steps: a first step which is similar to a Daubechies et al. [9]. In their work, Fornasier et al. used gradient descent with fixed step and which does not take an iterative shrinking algorithm (that has the flavor of a into account the sparsity constraint and a thresholding gradient projection approach) which is able to solve the step through the operator UqPð Þ, which updates the solu- J J general problem (4) with p=1 and q ¼f1,2,1g. We sum- tion according to the penalty i ci;: q. Fornasier et al. give marize here the details of their derivations for general q. detailed information on the closed-form equation of Uqð Þ TheproblemtosolveistheonegiveninEq.(4)withp=1: for q ¼f1,2,1g. We can note that the problem (11) X decouples with respect to the row of C and the thresh- 1J J2 J J min S UC F þl ci;: q ð10Þ C 2 olding operator Uqð Þ acts independently on each row. i According to Fornasier et al. [17], for q=2, U2ðCÞ writes The iterative algorithm can be derived first by noting that the for each row ci, as objective function given above is strictly convex then by l defining the following surrogate objective function: ½ ð Þ ¼ U2 C i, 1 J J ci;: X ci þ 1 J J2 J J 1 J J2 1J J2 Jsur ðC,AÞ¼2 S UC F þl ci;: q þ 2 UC UA F 2 C A F i and the full algorithm is summarized in Algorithm 2. For this algorithm, the computation complexity at each itera- From this surrogate function, Fornasier et al. consider the tion is similar to the block-coordinate descent algorithm solution of problem (4) as the limit of the sequence CðtÞ: and it is dominated by the computation of UtUC which is ðt þ 1Þ ðtÞ 2 t C ¼ argminJsur ðC,C Þ of the order OðM LÞ if U U is precomputed. We also note C that block-coordinate descent and Landweber iterations Note that according to the additional terms (the last two yield to algorithms that are very similar. Two differences ones) in the surrogate function, Cðt þ 1Þ is a coefficient matrix can be noted. The first one is that the shrinking operation that minimizes the desired objective function and terms that for one case is based on Ti while in the other case it ðt þ 1Þ ðt þ 1Þ constraint C and the novel signal approximation UC directly considers ci, . At the moment, it is not clear what 1510 A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526 are the implications of these two different schemes in van den Berg et al. then proposed an algorithm that com- term of convergence rate and more investigations are putes this projection in linear time. Note how similar the needed to clarify this point. The other main difference iterates given in Eqs. (12) and (13) are. Actually, approaches between the two approaches is that, by optimizing at each of Fornasier et al. and van den Berg et al. are equivalent and loop, only the ci, ’s that are not optimal yet, the BCD the only point in which they differ is the way the projection algorithm is more efficient than the one of Fornasier et al. are computed. This difference essentially comes from the Our experimental results corroborates this findings. problem formulation: Fornasier et al. use a penalized pro- Note that Fornasier et al. have also shown that this blem with known l while van den Berg considers the iterative scheme converges towards the minimizer of constrained optimization problem. Hence, the latter cannot problem (4) for 1ZqZ1. directly use the analytic solution of the projection Pð Þ but have to compute the appropriate value of l given their value 2.3. Other works of t. However, the algorithm they derive for this projection is efficient. Note that if one has some prior knowledge on the Besides the M-FOCUSS algorithm of Cotter et al., value of t, then this algorithm would be preferable than the another seminal work which considered the SSA problem one of Fornasier et al. is the one of Malioutov et al. [31]. While dealing with a The other recent work that is noteworthy is the one of source localization problem, they had to consider a sparse Obozinski et al. [35]. They address the problem of SSA approximation problem with joint sparsity constraints. within the context of multi-task classification problem. From the block-sparse signal approximation problem Indeed, they want to recover a common set of covariates that are simultaneously relevant for all classification tasks. XM 1J~ ~ ~J2 þ J~ J For achieving this aim, they propose to solve problem (4) min2 s Uc 2 l cgi 2 c~ i with p=1, q=2 and a differentiable loss function such as the logistic loss or the square loss function. One of their they derived an equivalent problem which writes as most prominent contribution is to have proposed a path- minpþlq following algorithm that is able to compute the regular- p,q,r,c~ ization path of the solution with respect to l. Their 1 s:t: Js~ U~ c~J2 rp algorithm uses a continuation method based on predic- 2 2 XM tion–correction steps. The prediction step is built upon the ri rq optimality conditions given in Eq. (7) while the correction i ¼ 1 step needs an algorithm which solves problem (4) but only J J cgi 2 rri 8i ¼ 1, ...,M for a small subset of covariates. The main advantage of such an algorithm is its efficiency for selecting an appro- Note that this above problem has a linear objective priate value of l in a model selection context. function and linear and second-order cone constraints. Such a problem is known as a second-order cone pro- 3. Generic algorithms for large classes of p and q gramming problem and there exists interior point meth- ods for solving it [46]. While very attractive because of its In this section, we present several algorithms that are simple formulation and the global convergence of the able to solve the simultaneous sparse approximation pro- algorithm, this approach suffers from a high complexity blem for a large classes of p and q.Inthefirstpartofthe which is of the order OðM3L3Þ. section, we review the M-FOCUSS algorithm of Cotter Very recently, two other algorithms have been pro- et al. [8] that addresses the case where 0 p 1andq=2. posed for addressing our jointly sparse approximation o r In a second part, we make clear how the penalization term problem. The first one, derived by van den Berg et al. [56] J ðCÞ with p=1 and 1 q 2, is related to automatic is very similar to the one of Fornasier et al. although from p,q r r relevance determination. From this novel insight, we then a different perspective. Indeed, they develop a method propose an iterative reweighted least-square algorithm that based on spectral gradient projection for solving the group solves this general case. The last part of the section is lasso problem which is the one given in Eq. (5) with p=1 devoted to the extension of reweighted ‘ algorithms [63,4] and q=2. However, instead of the regularized version, 1 to the SSA problem. We describe how algorithms tailored they considered the following constrained version: for solving the SSA problem with p=1 and any q can be re- 1 XM used for solving a problem with the same q but p 1. J~ ~ ~J2 J~ J o min s Uc 2s:t: cgi 2 rt c~ 2 i 3.1. M-FOCUSS algorithm that they iteratively solve owing to spectral gradient projection. Basically, their solution iterates write as We first detail how the M-FOCUSS algorithm proposed ð þ Þ ð Þ t ð Þ by Cotter et al. [8] can be derived, then we discuss some of c~ t 1 ¼ Pðc~ t þaU~ ðs~ U~ c~ t ÞÞ ð13Þ its properties and relation with other works. where a is some step size to be optimized for instance by backtracking and Pð Þ is the projection operator defined as 3.1.1. Deriving the algorithm () X M-FOCUSS is, as far as we know, the first algorithm that 2 PðzÞ¼ argminJz xJ subject to Jxg J2 rt has been introduced for solving simultaneous sparse approx- x i i imation problem. M-FOCUSS addresses the general case A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526 1511
ð Þ ð Þ ð Þ where q=2 and pr1. This algorithm can be understood as a C t ¼ W t Z t where Z is the minimizer of fixed-point algorithm which can be derived through an 1 l appropriate factorization of problem (4) objective function JS UWðtÞZJ2 þ JZJ2 2 F 2 F gradient. ðtÞ Indeed, since the partial derivative of Jp,2ðCÞ with and thus C can be understood as the minimizer of respects to an entry c is m,n 1 l 0 1 JS UCJ2 þ JðWðtÞÞ 1CJ2 p=2 2 F 2 F @J ðCÞ @ X X p,2 ¼ @ c2 A ¼ pJc Jp 2c ð14Þ @c @c i,j m, 2 m,n This above equation makes clear the relation between m,n m,n i j sparse approximation problem and iterative reweighted the gradient of the objective function writes least-square. Such an connection has already been high- lighted in other context. Indeed, while sparsity is usually UtðS UCÞþlPC induced by using ‘1 norm penalty, it has been proved that J Jp 2 where P ¼ diagðp ci, 2 Þ. Then, we define the weight- solving the problem in which the ‘1 norm has been 1=2J J1 p=2 ing matrix W as W ¼ diagðp ci, 2 Þ. After having replaced by an adaptive ‘2 norm leads to equivalent replaced W 2 ¼ P in the above gradient, and after simple solutions [10,41,23]. algebras, we have the following necessary optimality Despite this nice insight, M-FOCUSS presents an condition: important issue. Indeed, the updates proposed by Cotter et al. are not guaranteed to converge to a local minimum ððUWÞtðUWÞþlIÞW 1C ¼ðUWÞtS of the problem (if the problem is not convex po1) or to Hence, the optimum solution writes as the global minimum of the convex problem (p=1). Their
% algorithm presents several fixed-points since when a row C ¼ WððUWÞtðUWÞþlIÞ 1ðUWÞtS ð15Þ of C is equal to 0, it stays at 0 at the next iteration. Note that since W also depends on C, the above closed- Although such a point may be harmless if the algorithm is form expression of the optimal matrix C can be inter- initialized with a ‘‘good’’ starting point, it is nonetheless preted as a fixed-point algorithm. From this insight on C, an undesirable point when solving a convex problem. At Cotter et al. suggest the following iterates for solving the the contrary, the M-BCD and the Landweber iteration simultaneous sparse approximation problem: algorithms do not suffer from the presence of such fixed points. However, in the M-FOCUSS algorithm, this patho- Cðt þ 1Þ ¼ WðtÞððUWðtÞÞtðUWðtÞÞþlIÞ 1ðUWðtÞÞtS ð16Þ logical behavior can be handled by introducing a smooth- ðtÞ 1=2J ðtÞJ1 p=2 where W ¼ diagðp ci, 2 Þ. Now, since we have ing term e in the weight so that the updated weight ððUWðtÞÞtðUWðtÞÞþlIÞ 1ðUWðtÞÞt ¼ðUWðtÞÞtððUWðtÞÞðUWðtÞÞt þlIÞ 1 becomes ðtÞ 1=2 J ðtÞJ 1 p=2 Cotter et al. actually consider the following iterative W ¼ diagðp ð ci, þeÞÞ Þ scheme: where e40. The use of e avoids a given weight to be at ðt þ 1Þ ðtÞ ðtÞ t ðtÞ ðtÞ t 1 C ¼ W ðUW Þ ððUW ÞðUW Þ þlIÞ S ð17Þ zero and consequently it avoids the related ci, to stay Resulting algorithm is detailed in Algorithm 3. The com- permanently at zero. If we furthermore note that putational complexity of this algorithm is dominated by M-FOCUSS is not more than an iterative reweighted the linear system to be solved at each iteration. Forming least-square, then according to the very recent works of the linear system and its resolutions costs respectively Daubechies et al. [10] and Chartrand et al. [5], it seems about OðN2MÞ and OðN3Þ. justified to iterate the M-FOCUSS algorithm using decreasing value of e. In our numerical experimental, we Algorithm 3. M-FOCUSS with Jp r 1,q ¼ 2 penalty. will consider the M-FOCUSS algorithm with fixed and decreasing value of e. 1: Initialize Cð0Þ to a matrix of 1, t ¼ 1 2: while Loop do 3: WðtÞ’diagðp 1=2Jcðt 1ÞJ1 p=2Þ i, 2 3.2. Automatic relevance determination approach 4: AðtÞ’UWðtÞ 5: ðtÞ’ ðtÞ ðtÞ t ðtÞ ðtÞ t 1 C W ðA Þ ðA ðA Þ þlIÞÞ S In this section, we focus on the relaxed optimization 6: t’t þ1 7: if stopping condition is satisfied then problem given in (4) with the general penalization Jp,qðCÞ. 8: Loop = 0 Our objective here is to clarify the connection between 9: end if such a form of penalization and the automatic relevance 10: end while determination of C’s rows, which has been the keystone of the Bayesian approach of Wipf et al. [60]. We will show
3.1.2. Discussing M-FOCUSS that for some values of p and q, the mixed-norm Jp,qðCÞ has Rao et al. [40] has provided an interesting interpreta- an equivalent variational formulation. Then, by using this tion on the FOCUSS algorithm for a single signal approx- novel formulation in problem (4), instead of Jp,qðCÞ, imation which can be readily extend to M-FOCUSS. we exhibit the relation between our sparse approxima- Indeed, M-FOCUSS can be viewed as an iterative re- tion problem and ARD. We then propose an iterative weighted least-square algorithm. Indeed, From the updat- reweighted least-square approach for solving this result- ing equation of CðtÞ (see Eq. (16)) , we can note that ing ARD problem. 1512 A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526
3.2.1. Exhibiting the relation with ARD smaller the ct,k norm should be. Furthermore, optimization For this purpose, we first consider the following formu- problem (21) also involves some constraints on fdt,kg.These lation of the simultaneous sparse approximation problem: constraints impose the matrix d to have positive elements
andtobesothatits‘1=ðr þ sÞ,1=s mixed-norm is smaller than 1J J2 2=p min S UC F þluðJp,qðCÞÞ ð18Þ 1. Note that this mixed-norm on d plays an important role C 2 since it induces the row-norm sparsity on C. Indeed, Intheconvexcase(forpZ1andqZ1), since the power according to the relation between p, r and s,forpo1, we function is strictly monotonically increasing, problems (4) also have r þs41, making the ‘1=ðr þ sÞ,1=s non-differentiable and (18) are equivalent, in the sense that for a given value with respect to the first norm. Such singularities favor lu,thereexistsal so that solutions of the two problems are row-norm sparsity of the matrix d at optimality, inducing equivalent. When Jp,q is not convex, this equivalence does row-norm sparsity of C. As we have noted above, when a not strictly apply. However, due to the nature of the row-norm of d is equal to 0, the corresponding row-norm of problem, the problem formulation (18) is more convenient C should also be equal to 0 which means that the corre- for exhibiting the relation with ARD. sponding element of the dictionary is ‘‘irrelevant’’ for the Let us introduce the key lemma that allows us to derive approximation of all signals. Problem (21) proposes an the ARD-based formulation of the problem. This lemma equivalent formulation of problem (4) for which the row- gives a variational form of the ‘p,q norm of a sequence diversity measure has been transformed in another penalty fat,kg. Note that similar proofs have been derived [48,39]. function owing to an ARD formulation. The trade-off between convexity of the problem and the sparsity of the Lemma 1. If s40 and fa : k 2 N ,t 2 N g2R such that t,k n T solution has been transferred from p and q to r and s. at least one a 40, then t,k From a Bayesian perspective, we can interpret the 8 ! 9 < s=ðr þ sÞ = mixed-norm on d as the diagonal term of the covariance X ja j2 X X t,k Z 1=s min : dt,k 0, dt,k r1 matrix of a Gaussian prior over the row-norm of C d : d ; t,k t,k k t distribution. This is typically the classical Bayesian auto- 0 1 != X X p=q 2 p matic relevance determination approach as proposed for @ q A instance by Tipping [51]. This novel insight on the ARD ¼ jat,kj ð19Þ k t interpretation of Jp,qðCÞ clarifies the connection between the M-FOCUSS algorithm of Cotter et al. [8] and the where q ¼ 2=ðsþ1Þ and p ¼ 2=ðsþr þ1Þ. Furthermore, at Multiple Sparse Bayesian Learning (M-SBL) algorithm of optimality, we have P Wipf et al. [60] for any value of po1. In their previous ja j2s=ðs þ 1Þð ja j2=ðs þ 1ÞÞr=ðs þ r þ 1Þ works, Wipf et al. have proved that these two algorithms d% ¼ Pt,k P u u,k ð20Þ t,k 2=ðs þ 1Þ ðs þ 1Þ=ðs þ r þ 1Þ r þ s were related when p 0. Here, we refine their result by ð vð ujau,vj Þ Þ enlarging the connection to other values of p by showing Proof. The proof has been postponed to the Appendix. that both algorithms actually solve a problem with auto-
According to this lemma, the ‘p,q norm of a sequence can matic relevance determination on the row-norm of C. be computed through a minimization problem. Hence, applying this lemma to ðJ ðCÞÞ2=p by defining a ¼ c , p,q t,k t,k 3.3. Solving the ARD formulation for p ¼ 1 and 1rqr2 we get a variational form of the penalization term. We can also note that the mixed-norm on the matrix coefficients Herein, we propose a simple iterative algorithm for has been transformed to a mixed-norm on weight matrix d. solving problem (21) for p=1 and 1rqr2. This algo- Plugging the above variational formulation of the penaliza- rithm, named as M-EMq, is based on an iterative tion term in problem (18) yields to the following equivalent reweighted least squares where the weights are updated problem: according to Eq. (20). Thus, it can be seen as an extension 1 X c2 of the M-FOCUSS algorithm of Cotter et al. for qr2. Note min JS UCJ2 þl t,k F t,k that we have restricted ourselves to p=1 since we will C,d 2 dt,k ! show in the next section that the case po1 can be X X s=ðr þ sÞ 1=s handled using another reweighted scheme. s:t: dt,k r1 k t Since p=1, thus sþr ¼ 1, the problem we are consider- d Z0 8t,k ð21Þ ing is t,k ! X X c2 This problem is the one which makes clear the automatic 1 J J2 t,k min sk Uc ,k 2 þl ¼ JobjðC,dÞ relevance determination interpretation of the original for- C,d 2 dt,k k ! t mulation (4). Indeed, we have transformed problem (4) into X X s aproblemwithasmoothobjectivefunctionattheexpense 1=s s:t: dt,k r1 of some additional variables dt,k. These parameters dt,k k t actually aim at determining the relevance of each element dt,k Z0 ð22Þ of C. Indeed, in the objective function, each squared-value Since, we consider that 1rqr2 hence 0rsr1,2 this c is now inversely weighted by a coefficient d .Bytaking t,k t,k optimization problem is convex with a smooth objective the convention that x=0 ¼1 if xa0 and 0 otherwise, the objective value of the optimization problem becomes finite 2 2 only if dt,k ¼ 0forct,k ¼ 0. Then the smaller the dt,k is, the For s=0, we use the sup norm of vector d ,k in the constraints. A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526 1513 function. We propose to address this problem through a in some fixed points (e.g. all the elements of d being zero block-coordinate descent algorithm which alternatively expect for one entry ft1,k1g). In practice, we have experi- solves the problem with respect to C with the weight d enced, by comparing for q=2 with the M-BCD algorithm, being fixed, and then keeping C fixed and computing the that if d is initialized to non-zero entries, Algorithm 4 optimal weight d. The resulting algorithm is detailed converges to the global minimum of problem (22). Numer- in Algorithm 4. ical experiments will illustrate this point. Regarding computational complexity, it is easy to see Algorithm 4. Iterative reweighted least-square for that at each iteration, the main burden of the algorithm is addressing J penalty. 1,1 r q r 2 the linear system to be solved. Supposing that UtU has 1: Initialize dð0Þ to a strictly positive matrix, t=1 been precomputed, solving the systems for all signals 2: while Loop costs about OðM3LÞ. Hence, the flexibility on the choice of 3: CðtÞ’ solution of problem (22) with fixed d ¼ dðt 1Þ as given q that this algorithm allows is heavily paid by an by Eq. (23) increased computational complexity. 4: dðtÞ’ solution of problem (22) with fixed C ¼ CðtÞ as given by Eq. (24) 5: t’tþ1 3.4. Iterative reweighted ‘1 ‘q algorithms for po1 6: if stopping condition is satisfied then 7: Loop = 0 This section introduces an iterative reweighted M- 8: end if Basis Pursuit (IrM-BP) algorithm and proposes a way for 9: end while setting the weights. Through this approach, we are able to provide an iterative algorithm which solves problem (4) Owing to the problem structure, steps 4 and 5 of this with po1 and 1rqr2. algorithm has a simple closed form. Indeed, for fixed d, each vector cðtÞ at iteration t is given by ,k 3.4.1. Iterative reweighted algorithm ðtÞ t ðt 1Þ 1 t c ,k ¼ðU Uþ2lDk Þ U sk ð23Þ Recently, several works have advocated that sparse approximations can be recovered through iterative algo- where Dðt 1Þ is a diagonal matrix of entries 1=dðt 1Þ.Ina k ,k rithms based on a reweighted ‘ minimization [63,4,5,58,18]. similar way, for fixed C, step 5 boils down in solving 1 Typically,forasinglesignalcase,theideaconsistsin problem (19). Hence, by defining a ¼ cðtÞ , we also have a t,k t,k iteratively solving the following problem: closed form for dðtÞ as P 1 X 2s=ðs þ 1Þ 2=ðs þ 1Þ ð1 sÞ=2 2 ja j ð ja j Þ min Js UcJ þl zijcij dðtÞ ¼ t,kP P u u,k ð24Þ c 2 2 t,k 2=ðs þ 1Þ ðs þ 1Þ=2 i vð ujau,vj Þ where z are some positive weights, and then to update the Note that similar to the M-FOCUSS algorithm, this i positive weights z according to the solution c% of the above algorithm can also be seen as an iterative reweighted i problem. Besides providing empirical evidences that rew- least-square approach or as an Expectation-Minimization eighted ‘ minimization yields to sparser solutions than a algorithm, where the weights are defined in Eq. (24). 1 simple ‘ minimization, the above cited works theoretically Furthermore, it can be shown that if the weights d are 1 support such a claim. These results for the single signal initialized to non-zero values then at each loop involving approximation case suggest that in the simultaneous sparse steps 4 and 5, the objective value of problem (22) approximation problem, reweighted M-Basis Pursuit would decreases. Hence, since the problem is convex, our algo- also lead to sparser solutions than the classical M-Basis rithm should converge towards the global minimum of Pursuit. the problem. The iterative reweighted M-Basis Pursuit algorithm is Theorem 2. If the objective value of problem (22) is strictly defined as follows. We iteratively construct a sequence ðtÞ convex (for instance when U is full-rank), and if for the tth C defined as loop, after the step5, we have dðtÞadðt 1Þ, then the objective X ðtÞ 1 J J2 ðtÞJ J value has decreased, i.e.: C ¼ argmin S UC F þl z ci, q ð25Þ C 2 i i ðt þ 1Þ ðtÞ ðtÞ ðtÞ ðtÞ ðt 1Þ JobjðC ,d ÞoJobjðC ,d ÞoJobjðC ,d Þ: ðtÞ ðtÞ ðtÞ ðtÞ ðt 1Þ where the positive weight vector z depends on the Proof. The right inequality JobjðC ,d ÞoJobjðC ,d Þ ðt 1Þ ðtÞ previous iterate C . For t=1, we typically define comes from d being the optimal value of the optimiza- zð1Þ ¼ 1 and for t 41, we will consider the following tion problem resulting from step 5 of Algorithm 4. The weighting scheme: strict inequality yields from the hypothesis that ð Þ ð Þ t a t 1 ðtÞ 1 d d and from the strict convexity of the objective z ¼ 8i ð26Þ i J ðt 1ÞJ r function. A similar reasoning allows us to derive the left ð ci, q þeÞ ðtÞ inequality. Indeed, since C is not optimal with respect to ðt 1Þ ðt 1Þ ðtÞ where fc g is the i-th row of C , r auser-defined d for the problem given by step ð4Þ, invoking the strict i, positive constant and e a small regularization term that convexity of the associated objective function and optim- ðt þ 1Þ prevents from having an infinite regularization term for ci, ality of C concludes the proof. & ðt 1Þ as soon as ci, vanishes. This is a classical trick that has As stated by the above theorem, the decrease in objective been used for instance by Candes et al. [4] or Chartrand value is actually guaranteed unless, the algorithm get stuck et al. [41]. Note that for any positive weight vector z, 1514 A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526 problem (25) is a convex problem that does not present the non-convex penalty with a quadratic term. In the light local minima. Furthermore, for 1rqr2, it can be solved by of all these previous works, what we have exposed here is the block-coordinate descent algorithm, the Landweber merely an extension of iterative reweighted ‘1 algorithm iteration approach or by the M-EMq given in Algorithm 4, to simultaneous sparse approximation problem. Through by simply replacing l with li ¼ l zi. This reweighting this iterative scheme, we can solve problem (4) with po1 scheme is similar to the adaptive lasso algorithm of Zou and any q provided that we have an algorithm that is able et al. [63] but uses a larger number of iterations and to solve problem (4) with p ¼ 1 and the desired q. addresses the simultaneous approximation problem. Computational complexity of these iterative reweigh- ted algorithms are difficult to evaluate since it naturally 3.4.2. Connections with Majorization-Minimization depends on the complexity of each inner algorithm and algorithm the number of iterations needed before convergence. This IrM-BP algorithm can be interpreted as an algo- However, we have empirically noted that only few itera- rithm for solving an approximation of problem (4) when tions of the inner algorithm are needed (typically about 0opo1 and 1rqr2. Indeed, similar to the reweighted 10) and that the iterative approach greatly benefits from ‘1 scheme of Candes et al. [4] or the one-step reweighted warm-start initializations. Hence, only a slight increase of lasso of Zou et al. [64], this algorithm falls into the class of complexity can be noted between algorithms that use an majorize-minimize (MM) algorithms [26]. MM algorithms ‘1 ‘q penalty and their reweighted ‘p o 1 ‘q counterparts. ðtÞ consists in replacing a difficult optimization problem with Analyzing the convergence of the sequence C towards a easier one, for instance by linearizing the objective the global minimizer of problem (4) is a challenging issue. function, by solving the resulting optimization problem Indeed, several points make a formal proof of convergence ðtÞ and by iterating such a procedure. difficult. At first, in order to avoid a row ci, to be perma- The connection between MM algorithms and our nently at zero, we have introduced a smoothing term e, thus reweighted scheme can be made through linearization. we are only solving an e approximation of problem (4). Let us first define Jp,q,eðCÞ as an approximation of the Furthermore, the penalty we use is non-convex, thus using a penalty term Jp,qðCÞ: monotonic algorithm like an MM approach which decreases X the objective value at each iteration, cannot guarantee con- J J Jp,q,eðCÞ¼ gð ci, q þeÞ vergence to the global minimum of our e approximate i problem. Hence, due to these two major obstacles, we have where gð Þ ¼ j jp. Since gð Þ is concave for 0opo1, a left this convergence proof for future works. Note however ðt 1Þ linear approximation of Jp,q,eðCÞ around C yields to that few works have addressed the convergence issue of the following majorizing inequality: reweighted ‘1 or ‘2 algorithms for single sparse signal X ðt 1Þ p ðt 1Þ recovery. Notably, we can mention the recent work of Jp,q, ðCÞrJp,q, ðC Þþ ðJc Jq Jc JqÞ e e J ðt 1ÞJ 1 p i, i, Daubechies et al. [10] which provides a convergence proof i ð c q þeÞ i, of iterative reweighted least square for exact sparse recovery. then for the minimization step, replacing in problem (4) In the same flavor, Foucart et al. [18] have proposed a ten-
Jp,q with the right part of the inequality and dropping tative of rigorous convergence proof for reweighted ‘1 sparse constant terms lead to our optimization problem (25) signal recovery. Although, we do not have any rigorous proof with appropriately chosen zi and r. Note that for the of convergence, in practice, we will show that the reweighted weights given in Eq. (26), r ¼P1 corresponds to the algorithm provides good sparse approximations. J J linearization of a log penalty ilogð ci, þeÞ whereas As already noted by several authors [4,41,10], e plays a setting r ¼ 1 p corresponds to a ‘p penalty (0opo1). major role in the quality of the solution. In the experi- mental results presented below, we have investigated two Algorithm 5. Majorization-Minimization algorithm leading methods for setting e: the first one is to set e to a fixed to iterative reweighted ‘ for addressing J penalty. 1 p r 1,q value (e.g.0.001) and the other one, denoted as an anneal- 1: ð1Þ ing approach, consists in gradually decreasing e after Initialize zi ¼ 1, r ¼ 1 p, t=1 2: while Loop do having solved problem (25). 3: CðtÞ’ solution of problem (25) 4: t’t þ1 3.4.3. Connection with group bridge Lasso 5: zðtÞ’ 1 8i Recently, in a context of variable selection through i ðJcðt 1Þ J þ Þr i, q e 6: if stopping condition is satisfied then group lasso, Huang et al. [25] have proposed an algorithm 7: Loop = 0 for solving problem (5) with po1 and q=1: 8: end if 1 X 9: end while J~ ~ ~J2 þ J~ Jp ð Þ min s Uc l cgi 1: 27 c~ 2 i MM algorithms have already been considered in opti- mization problems with sparsity-inducing penalties. For Instead of directly addressing this optimization problem, instance, an MM approach has been used by Figueiredo they have shown that the above problem is equivalent to et al. [16] for solving a least-square problem with a ‘p the minimization problem: sparsity-inducing penalty, whereas Candes et al. [4] have 1 X Jc~ J X ~ 2 gi 1 addressed the problem for exact sparse signal recovery. In min Js~ Uc~J þ þt zi c~,z 2 z1=p 1 a context of simultaneous approximation, Simila [43,44] i i i has also considered MM algorithms while approximating s:t: zi Z0 8i ð28Þ A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526 1515 where t is a regularization parameter. Here equivalence is 4.1. S-OMP understood as follows: c~ % minimizes Eq. (27) if and only if the pair ðc~ % ,z% Þ minimizes Eq. (27) with an appropriate Simultaneous Orthogonal Matching Pursuit (S-OMP) [54] value of t. This value of t will be made clear in the sequel. is an extension of the Orthogonal Matching Pursuit [36,53] algorithm to multiple approximation problem. Algorithm 6. Group bridge Lasso based on iterative S-OMP is a greedy algorithm which is based on the reweighted ‘1 for addressing Jp r 1,q penalty. idea of selecting, at each iteration, an element of the 1: ~ ð0Þ 1 p 1=ðp 1Þ dictionary and on building all signal approximations as Initialize c , t ¼ p l , t=1 2: while Loop do the projection of the signal matrix S on the span of these p 3: ðtÞ’ pJ~ ðt 1ÞJp 1 p selected dictionary elements. Since, according to pro- zi t cg 1 p i P blem (2), one looks for dictionary element that can 4: c~ ðtÞ’argmin 1 Js~ U~ c~J2 þ ðzðtÞÞ1 1=pJc~ J c~ 2 i i gi 1 provide simultaneously good approximation of all signals, 5: t’tþ1 6: if stopping condition is satisfied then Tropp et al. have suggested to select the dictionary 7: Loop = 0 element that maximizes the sum of absolute correlation 8: end if between the basis element and the signal residuals. This 9: end while greedy selection criterion is formalized as follows:
The connection between iterative reweighted ‘1 appro- XL ðtÞ max j/sk P ðskÞ,UiSj ach and the algorithm proposed by Huang et al. comes from i the way problem (28) has been solved. Indeed, Huang et al. k ¼ 1 have considered a block-coordinate approach which con- where PðtÞð Þ is the projection operator on the span of the t sists in minimizing the problem with respect to z and then, selected dictionary elements. While conceptually very sim- after having plugged the optimal z in Eq. (28) in minimizing ple, this algorithm whose details are given in Algorithm 7, the problem with respect to c~. For the first step, the optimal provide theoretical guarantees about the quality of its z is derived by minimizing the convex problem approximation. For instance, Tropp et al. [54] have shown X J~ J X that if the algorithm is stopped when T dictionary elements cgi 1 min þt zi have been selected then, under some mild conditions on the z 1=p 1 i z i ðTÞ i matrix U, there exists a upper bound on JS UC JF .This s:t: z Z0 8i ð29Þ % % i upper bound depends on JS UC JF with C being the Using Lagrangian theory, simple algebras yield to a closed- solution of problem (2) and multiplicative constant which form solution of the optimal z: is a function of T, L and the coherence of U [54].Inthis sense, S-OMP can be understood as an algorithm that 1 p p % ¼ pJ~ Jp approximately solves problem (4) with p=0. zi t cgi 1 p Regarding complexity, we note that at each iteration,
Plugging these zi’s in Eq. (28) and using t such that the most demanding part of the S-OMP is the correlation l ¼ðð1 pÞ=ptÞp 1 proves the equivalence of the pro- E computation and the linear system to be solved for ðtÞ blems (27) and (28). The relation with iterative reweighted obtaining C . These steps respectively cost OðMNLÞ and
‘1 is made clear in the second step of the block-coordinate OðjOjÞ. Since jOj is typically smaller than M, this algorithm descent. Indeed, in problem (28), since we only optimize is a very efficient algorithm. with respect to c~ with fixed zi,thisstepisequivalentto Algorithm 7. S-OMP algorithm. T the number of diction- 1 X Jc~ J min Js~ U~ c~J2 þ gi 1 ð30Þ ary elements to select. c~ 2 1=p 1 i z i 1: Initialize : t’0,RðtÞ’S,O’|,Loop’1 which is clearly a weighted ‘1 problem. The full iterative 2: while Loop do 3: 9 t ðtÞ reweighted algorithm is detailed in Algorithm 6. Note that E U R P 4: i’argmax je j although the genuine work on group bridge Lasso only k j j,k 5: O’O [ i addresses the case q=1 (because of theoretical develop- 6: t’t þ1 ments considered in the paper), the algorithm proposed by 7: ðtÞ’ t 1 t C ðUOUOÞ UOS Huang et al. can be applied to any choice of q provided that 8: RðtÞ ¼ S UCðtÞ oneisabletosolvetherelated‘1 ‘q problem (for instance 9: if t=T then using the algorithm described in Section 3). 10: Loop = 0 11: end if 12: end while 4. Specific case algorithms
In this section, we survey some algorithms that provide 4.2. M-CosAmp solutions of the SSA problem (4) when p=0. These algorithms are different in nature and provide different qualities of Very recently, Needell et al. have developed a novel approximation. The first two algorithms we detail are greedy signal reconstruction algorithm known as CosAmp [34]. algorithms which have the flavor of Matching Pursuit [32]. This algorithm uses ideas from Orthogonal Matching The third one is based on a sparse Bayesian learning and is Pursuit but also takes advantage of ideas from the related to iterative reweighted ‘1 algorithm [60]. compressive sensing literature. This novel approach 1516 A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526 provides stronger theoretical guarantees on the quality of empirical Bayesian learning approach based on automatic signal approximation. The main particularities of CosAmp relevance determination (ARD). The ARD prior over each are the following. Firstly, unlike Orthogonal Matching row they have introduced is Pursuit, CosAmp selects many elements of the dictionary pðci, ; diÞ¼N ð0,diIÞ8i at each iteration, the ones that have the largest correla- tion with respect to the signal residual. These novel where d is a vector of non-negative hyperparameters that dictionary elements are thus incorporated into the set of govern the prior variance of each coefficient matrix row. current dictionary elements. The other specificity of Hence, these hyperparameters aim at catching the spar- CosAmp is a pruning dictionary element step after signal sity profile of the approximation. Mathematically, the estimation. This step is justified by the fact that the signal resulting optimization problem is to minimize according to approximate is supposed to be sparse. to d the following cost function: This CosAmp algorithm have been extended so as to XL t 1 handle signals with specific structures such as block- LlogjStjþ sj St sj ð31Þ sparse signal [1]. Extension to jointly sparse approxima- j ¼ 1 tion has also been provided by Duarte et al. [11]. Such an 2 t 2 where St ¼ s IþUDU , D ¼ diagðdÞ and s a parameter of extension of the CosAmp algorithm to simultaneous the algorithm related to the noise level presented in the sparse approximation is detailed in Algorithm 8. Note signals to be approximated. The algorithm is then based that lines 4–5 of the algorithm estimate the dictionary on a likelihood maximization which is performed through elements which have the largest correlations over all the an Expectation-Minimization approach. Very recently, a signals while lines 8–9 prune the set of current dictionary very efficient algorithm for solving this problem has been elements so as to keep the matrix C row-sparse. This proposed [27]. However, the main drawback of this latter algorithm is denoted as M-CosAmp in the experimental approach is that due to its greedy nature, and like any EM section. Complexity of this algorithm is similar to those of algorithm where the objective function is not convex, the S-OMP. However we can note that computing the estima- algorithm can be easily stuck in local minima. ^ tion C involves in the worst case 3T atoms hence its Recently, Wipf et al. [57] have proposed some new 3 complexity is about Oð9T Þ. Besides, computing the cor- insights on automatic relevance determination and sparse relation E is slightly cheaper than for S-OMP since the ðtÞ Bayesian learning. They have shown that, for the vector matrix C has been pruned. regression case, ARD can be achieved by means of iterative Algorithm 8. M-CosAmp algorithm. T the number of reweighted ‘1 minimization. Furthermore, in that paper, dictionary elements to select. they have sketched an extension of such results for matrix regression in which ARD is used for automatically selecting 1: ð0Þ ’ ðtÞ ’| Initialize C ¼ 0,Loop ,R ¼ S,Ores the most relevant covariance components in a dictionary of 2: while Loop do covariance matrices. Such an extension is more related to 3: E9Ut RðtÞ learning with multiple kernels in regression as introduced 4: e ¼ JE J2 i ¼ 1, ...,M i i;: 2 by Girolami et al. [21] or Rakotomamonjy et al. [38] 5: Ores’suppðe,2TÞ 6: O’O [ Ores although some connections with simultaneous sparse 7: ^ ’ t 1 t approximation can be made. Here, we build on the works C ðUOUOÞ UOS 8: J J2 on Wipf et al. [57] and give all the details about how M-SBL vi ¼ ci;: 2 i ¼ 1, ...,M 9: O’suppðv,TÞ and iterative reweighted M-BP are related. 10: t’tþ1 Recall that the cost function minimized by the M-SBL ðtÞ 11: C ¼ 0 of Wipf et al. [60] is 12: ðtÞ ^ CO ¼ CO ðtÞ ðtÞ XL 13: R ¼ S UC LðdÞ¼LlogjS jþ stS 1s ð32Þ 14: if stopping condition is satisfied then t j t j j ¼ 1 15: Loop = 0 16: end if 2 t where St ¼ s IþUDU and D ¼ diagðdÞ, with d being a 17: end while vector of hyperparameters that govern the prior variance of each coefficient matrix row. Now, let us define g% ðzÞ as 4.3. Sparse Bayesian learning and reweighted algorithm the conjugate function of the concave logjStj. Since, that M log function is concave and continuous on R þ , according The approach proposed by Wipf et al. [60], denoted in to the scaling property of conjugate functions we have [3] the sequel as Multiple-Sparse Bayesian Learning (M-SBL), t % z L logjStj¼minz d Lg for solving the sparse simultaneous approximation is z2RM L somewhat related to the optimization problem in Eq. (4) Thus, the cost function LðdÞ in Eq. (32) can then be upper- but from a very different perspective. Indeed, if we bounded by consider that the approaches described in Section 3 are equivalent to a Maximum a Posteriori estimation proce- z XL Lðd,zÞ9ztd Lg% þ stS 1s ð33Þ dures, then Wipf et al. have explored a Bayesian model L j t j j ¼ 1 whose prior encourages sparsity. In this sense, their approach is related to the relevance vector machine of Hence when optimized over all its parameters, Lðd,zÞ Tipping et al. [51]. Algorithmically, they proposed an converges to a local minima or a saddle point of (32). A. Rakotomamonjy / Signal Processing 91 (2011) 1505–1526 1517
However, for any fixed d, one can optimize over z and get Minimizing LzðCÞ boils down to minimize the M-BP % 2 1=2 the tight optimal upper bound. If we denote as z such an problem with an adaptive penalty li ¼ s zi on each y optimal z for any fixed d , since L logjStj is differentiable, row-norm. This latter point makes the alternate optimi- we have, according to conjugate function properties, the zation scheme based on Eqs. (34) and (35) equivalent to
% following closed form of z our iterative reweighted M-BP for which weights zi would y be given by Eq. (34). According to Wipf et al. [60], M-SBL z% ¼ L rlogjS jðd Þ¼diagðUtS 1UÞð34Þ t t can be implemented to that the cost per iteration is about Similar to what proposed by Wipf et al., Eqs. (33) and (34) OðN2MÞ.3 suggest an alternate optimization scheme for minimizing The impact of this relation between M-SBL and iterative Lðd,zÞ. Such a scheme would consist, after initialization reweighted M-BP is essentially methodological. Indeed, its of z to some arbitrary vector, in keeping z fixed and in main advantage is that it turns the original M-SBL optimiza- computing tion problem into a series of convex optimization problems.
XL In this sense, the iterative reweighted algorithm described y 9 t t 1 here, can again be viewed as an application of MM approach d ¼ argminLzðdÞ z dþ sj St sj ð35Þ d j ¼ 1 for solving problem (32). Indeed, we are actually iteratively minimizing a proxy function which has been obtained by then to minimize Lðdy,zÞ for fixed dy, which can be majorizing each term of Eq. (32). Owing to this MM point of analytically done according to Eq. (34). This alternate % view, convergence of our iterative algorithm towards a local scheme is then performed until convergence to some d . minimum of Eq. (32) is guaranteed [26].Convergencefor Owing to this iterative scheme proposed for solving the single signal case using other arguments has also been M-SBL, we can now make clear the connection between shown by Wipf et al. [57]. Note that similar to M-FOCUSS, M-SBL and an iterative reweighted M-BP according to the the original M-SBL algorithm based on EM approach suffers following lemma. Again this is an extension to the multi- from the presence of fixed points (when d =0). Hence, ple signals case of Wipf’s lemma. i such an algorithm is not guaranteed to converge towards Lemma 2. The objective function in Eq. (35) is convex and a local minimum of (32). This is then another argument for can be equivalently solved by computing preferring IrM-BP. 1 X % J J2 2 1=2J J C ¼ argminLzðCÞ¼ S UC F þs z ci, ð36Þ C 2 i 5. Numerical experiments i and then by setting Some computer simulations have been carried out in