Multi-dimensional sparse structured signal approximation using split bregman iterations Yoann Isaac, Quentin Barthélemy, Cedric Gouy-Pailler, Jamal Atif, Michèle Sebag

To cite this version:

Yoann Isaac, Quentin Barthélemy, Cedric Gouy-Pailler, Jamal Atif, Michèle Sebag. Multi-dimensional sparse structured signal approximation using split bregman iterations. ICASSP 2013 - 38th IEEE International Conference on Acoustics, Speech and , May 2013, Vancouver, Canada. pp.3826-3830. ￿hal-00862645￿

HAL Id: hal-00862645 https://hal.archives-ouvertes.fr/hal-00862645 Submitted on 30 Apr 2019

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. MULTI-DIMENSIONAL SPARSE STRUCTURED SIGNAL APPROXIMATION USING SPLIT BREGMAN ITERATIONS

Yoann Isaac 1,2, Quentin Barth elemy´ 1, Jamal Atif 2, C edric´ Gouy-Pailler 1, Mich ele` Sebag2

1 CEA, LIST 2 TAO, CNRS − INRIA − LRI Data Analysis Tools Laboratory Universit e´ Paris-Sud 91191 Gif-sur-Yvette CEDEX, FRANCE 91405 Orsay, FRANCE

ABSTRACT structured decomposition property is enforced through adding a total variation (TV) penalty to the minimization objective. The paper focuses on the sparse approximation of signals In the 1D case, the minimization of the above overall ob- using overcomplete representations, such that it preserves the jective can be tackled using the fused- approach first (prior) structure of multi-dimensional signals. The underlying introduced in [5]. In the case of multi-dimensional signals 1 optimization problem is tackled using a multi-dimensional however, the minimization problem presents additional dif- extension of the split Bregman optimization approach. An ficulties. The first contribution of the paper is to show how extensive empirical evaluation shows how the proposed ap- this problem can be handled efficiently, by extending the (1D) proach compares to the state of the art depending on the split Bregman fused-LASSO solver presented in [6], to the signal features. multi-dimensional case. The second contribution is a com- prehensive experimental study, comparing state-of-the-art al- Index Terms — Sparse approximation, Regularization, gorithms to the presented approach referred to as Multi-SSSA Fused-LASSO, Split Bregman, Multidimensional signals and establishing their relative performance depending on di- verse features of the structured signals. The section 2 introduces the formal background. The pro- 1. INTRODUCTION posed optimization approach is described in section 3.1. Sec- tion 4 presents our experimental setting and reports on the re- Dictionary-based representations proceed by approximating sults. The presented approach is discussed w.r.t. related work a signal via a linear combination of dictionary elements, re- in section 5 and the paper concludes with some perspectives ferred to as atoms. Sparse dictionary-based representations, for further research. where each signal involves but a few atoms, have been thor- oughly investigated for 1D and 2D signals for their good prop- erties, as they enable robust transmission (compressed sens- ing [1]) or image in-painting [2]. The dictionary is either 2. PROBLEM STATEMENT given, based on the domain knowledge, or learned from the RC×T signals [3]. Let Y = [ y1,..., yT ] ∈ be a made of T C- RC×N The so-called sparse approximation algorithm aims at dimensional signals, Φ ∈ an overcomplete dictionary finding a sparse approximate representation of the considered of N atoms (N > C ). We consider the linear model signals using this dictionary, by minimizing a weighted sum of the approximation loss and the representation sparsity (see yt = Φ xt + et, t ∈ 1,...,T , (1) [4] for a survey). When available, prior knowledge about N×T the application domain can also be used to guide the search in which X = [ x1,..., xT ] ∈ R stands for the decompo- C×T toward “plausible” decompositions. sition matrix and E = [ e1,..., eT ] ∈ R is a zero-mean Gaussian noise matrix. This paper focuses on sparse approximation enforcing a The sparse structured decomposition problem consists of ap- structured decomposition property, defined as follows. Let proximating the yi, i ∈ { 1,...,T } by decomposing them on the signals be structured (e.g. being recorded in consecutive the dictionary Φ, such that the structure of the decompositions time steps); the structured decomposition property then re- xi reflects that of the signals yi. This goal is formalized as quires that the signal structure is preserved in the dictionary- based representation (e.g. the atoms involved in the approx- 1Our motivating application considers electro-encephalogram (EEG) sig- imation of consecutive signals have “close” weights). The nals, where the number of sensors ranges up to a few hundreds. the minimization2 of the objective function Only few iterations of this system are necessary for conver- 2 gence. In our implementation, this update is only performed min kY − ΦXk2 + λ1kXk1 + λ2kXP k1 , (2) X once at each iteration of the global optimization algorithm. where λ1 and λ2 are regularization coefficients and P encodes the signal structure (provided by the prior knowledge) as in Eq.(9), Eq.(10) could be resolved with the soft-thresholding [7]. In the remainder of the paper, the considered structure is operator that of the temporal ordering of the signals, i.e. kXP k1 = i+1 i+1 i A = SoftThreshold λ1 (X + D ) (11) T k.k1 A t=2 kXt − Xt−1k1. µ1 P i+1 i+1 i B λ2 X P D . (12) = SoftThreshold 1 ( + B) µ1 k.k 3. OPTIMIZATION STRATEGY Solving Eq.(8) requires the minimization of a convex differ- 3.1. Algorithm description entiable function which can be performed via classical opti- mization methods. We proposed here to solve it determin- Bregman iterations have shown to be very efficient for ℓ1 reg- ularized problem [8]. For convex problems with linear con- istically. The main difficulty in extending [6] to the multi- straints, the split Bregman iteration technique is equivalent dimensional signals case rely on this step. Let us define H to the method of multipliers and the augmented Lagrangian from Eq.(8) such as one [9]. The iteration scheme presented in [6] considers an +1 Xi = argmin H(X) . (13) augmented Lagrangian formalism. We have chosen here to X present ours with the initial split Bregman formulation. Differentiating this expression with respect to X yields

First, let us restate the sparse approximation problem d T T H = (2Φ Φ + µ1I)X + X(µ2P P ) − 2Φ Y (14) 2 dX min kY − ΦXk + λ1kAk1 + λ2kBk1 X 2 i i i i T s.t. A = X . (3) +µ1(DA − A ) + µ2(DB − B )P , (15) B = XP where I is the identity matrix. The minimum Xˆ = Xi+1 This reformulation is a key step of the split Bregman method. d ˆ of Eq.(8) is obtained by solving dX H(X) = 0 which is a It decouples the three terms and allows to optimize them sep- Sylvester equation arately within the Bregman iterations. To set-up this iteration scheme, Eq.(3) must be transform to an unconstrained prob- W Xˆ + XZˆ = Ci , (16) lem T T i 2 with W = 2Φ Φ + µ1I, Z = µ2P P and C = −U + min X,A,B kY − ΦXk2 + λ1kAk1 + λ2kBk1 i i i T µ1 µ2 . (4) 2Φ Y + µ1A + ( µ2B − V )P . Fortunately, in our case, W + kX − Ak2 + kXP − Bk2 2 2 2 2 and Z are real symmetric matrix. Thus, they can be diagonal- The split Bregman iterations could then be expressed as [8] ized as follow: i+1 i+1 i+1 2 (X , A , B ) = argmin X,A,B kY − ΦXk2 T W = F D wF (17) +λ1kAk1 + λ2kBk1 (5) T Z = GD zG (18) µ1 i 2 + 2 kX − A + DAk2 µ2 i 2 and Eq.(16) can then be rewritten + 2 kXP − B + DBk2 i+1 +1 +1 D Di Xi Ai (6) ′ ′ i′ A = A + ( − ) DwXˆ + Xˆ Dz = C , (19) i+1 i i+1 i+1 DB = DB + ( X P − B ) (7) with Xˆ ′ = F T XGˆ and Ci′ = F T CiG. Xˆ ′ is then obtained Thanks to the split of the three terms realized above, the min- by imization of Eq.(5) could be realized iteratively by alterna- ′ −1 i′ tively updating variables in the system ∀s ∈ { 1,...,S } Xˆ (: , s ) = ( Dw + Dz(s, s )I) C (: , s ) Xi+1 =argmin kY − ΦXk2 + µ1 kX − Ai + Di k2 X 2 2 A 2 where the notation (: , s ) indices the column s of matrices. µ2 i i 2 + 2 kXP − B + DBk2 (8) Going back to Xˆ could be performed with: Xˆ = F Xˆ ′GT . i+1 µ1 i+1 i 2 A =argmin A λ1kAk1 + 2 kX − A + DAk2 (9) W and Z being independent of the iteration ( i) considered, i+1 µ2 i+1 i 2 theirs diagonalization is done only once and for all as well B =argmin B λ2kBk1 + 2 kX P − B + DBk2(10) −1 as the computation of the terms (Dw + Dz(s, s )I) ∀s ∈ 1 2 p p {1,...,S }. Thus, this update does not require heavy compu- kAkp = ( Pi Pj |Ai,j | ) . The case p = 2 corresponds to the clas- sical Frobenius norm tation. The full algorithm is summarized below. 3.2. Algorithm sum up activity and d it’s duration. A decomposition matrix X could then be written: 1: Input: Y , Φ, P 2: Parameters: λ1, λ2, µ1, µ2, ǫ, iterMax , kMax M 0 0 0 3: Init DA, DB and X 0 0 0 0 T T X = aiPind i,m i,d i (21) 4: A = X P , B = X , Y = 2Φ Φ + µ1I, Z = µ2P P X i=1 5: Compute Dw, Dz, F and G. 6: i = 0 i i−1 where M is the number of activities appearing in one signal 7: while i ≤ iterMax and kX −X k2 ≥ ǫ do kXik2 and ai stands for an activation weight. An example of such 8: k = 0 signal is given in the figure (4.1) below. 9: Xtemp = Xi; Atemp = Ai; Btemp = Bi 10: while k ≤ kMax do T T i temp i 11: C = F (2Φ Y − µ1(DA − A ) − µ2(DB − Btemp )P T )G 12: for s → S do temp −1 13: X (: , s ) = ( Dy + Dz(s, s )I) C(: , s ) 14: end for 15: Xtemp = F X temp GT temp temp i 16: A = SoftThreshold λ1 (X + D ) k.k1 B µ1 temp temp i 17: B = SoftThreshold λ2 (X P + D ) k.k1 B µ2 18: k = k + 1 Fig. 1 . Built signal 19: end while 20: Xi+1 = Xtemp ; Ai+1 = Atemp ; Bi+1 = Btemp i+1 i i+1 i+1 21: DA = DA + ( X − B ) i+1 i i+1 i+1 22: DB = DB + X P − A ) 4.2. Experimental setting 23: i = i + 1 ; 24: end while Each method has been applied to the previously created sig- nals. Then the distance between the decomposition matrices obtained and the real ones have been computed as follow: 4. EXPERIMENTAL EVALUATION

kX − XˆkF The following experiment aims at assessing the efficiency of dist (X, Xˆ) = (22) kXk our approach in decomposing signals built with particular reg- F ularities. We compare it both to algorithms coding each sig- The goal was to understand the influence of the number nal separately, the Orthogonal Matching Pursuit [10] and the of activities ( M) and the range of durations ( d) on the effi- LARS [11] (a LASSO solver) and to methods performing the ciency of the fused-LASSO regularization compared to oth- decomposition simultaneously, the simultaneous OMP and ers sparse coding algorithms. The scheme of experiment de- FISTA [12] a proximal method solving a group-LASSO prob- scribed above has been carried out with the following grid of lem only composed of a l1,2 penalty. parameters:

M , ,..., , 4.1. Data generation • ∈ { 20 30 110} From a fixed random overcomplete dictionary Φ, a set of K • d ∼ U (dmin , d max ) signals having piecewise constant structures have been cre- (dmin dmax ) ∈ { (0 .1, 0.15), (0 .2, 0.25),..., (1 , 1) } ated. Each signal is synthesized from the dictionary and a For each point in the above parameters grid, two set of sig- pre-determined decomposition matrix. nals has been created: a train set allowing to determine for The TV penalization of the fused-LASSO regularization each method the best regularization coefficients and a test set makes him more suitable to deal with data having abrupt designed for evaluate them with these coefficients. changes. Thus, the decomposition matrices of signals have Other parameters have been chosen as follows: been built as linear combinations of activities. This writes as follows: Model Activities C = 20 m ∼ U (0 , T ) 0 if i 6= ind T a ,  d×T = 300 ∼ N (0 2) Pind,m,d (i, j ) =  H(j − (m − 2 )) (20) N = 40 ind ∼ U (1 , N ) d×T −H (j − (m + 2 )) if i = k. K = 100  where P ∈ RN×T , H is the Heaviside function, ind ∈ Dictionaries have been randomly generated using Gaussian {1, . . . , N } is the index of an atoms, m is the center of the independent distributions on individual elements. Fig. 2 . Mean distances on the grid of parameters. On the left: Fused Lasso, in the middle: Fused Lasso vs LARS, on the right: Fused Lasso vs Group Lasso-Solver . The white mask corresponds to non-significant values.

4.3. Results and discussion 5. RELATION TO PRIOR WORKS

The simultaneous sparse approximation of multi-dimensional signals has been widely studied during these last years [13] and numerous methods developed [14, 15, 16, 17, 4]. More In order to evaluate the proposed algorithm, for each point recently, the concept of structured sparsity has considered the (i, j ) in the above grid of parameters, the mean of the previ- encoding of priors in complex regularization [18, 19]. Our ously defined distance has been computed for each method problem belongs to this last category with a regularisation and compared to the mean obtained by our algorithm. A combining a classical sparsity term and a Total Variation one. paired t-test (p < 0.05 ) has then been performed to check This second term has been studied intensively for image de- the significance of these results. noising as the in the ROF model [20, 21]. Results are displayed in Figure 4.3. In the ordinate axis, the The combination of these terms has been introduced as the number of patterns increases from the top to the bottom and fused-LASSO [5]. Despite its convexity, the two ℓ1 non- in the abscissa axis, the duration grows from left to right. The differentiable terms make it difficult to solve. The initial pa- left image displays the mean distances obtained with our algo- per [5] transforms it to a quadratic problem and uses standard rithm. The middle and right one present its performance com- optimization tools (SQOPT). Increasing the number of vari- pared to other methods by displaying the difference (point to ables, this approach can not deal with large-scale problem. A point) of mean distances in grayscale. This difference is per- path algorithm has been developed but is limited to the par- formed such that, negative values (darker blocks) means that ticular case of the fused-LASSO signal approximator [22]. our method outperform the other one. The white mask corre- More recently, scalable approaches based on proximal sub- sponds to zone where the difference of mean distances is not gradient methods [23], ADMM [24] and split Bregman itera- significant and methods have similar performances. Results tions [6] have been proposed for the general fused-LASSO. of the OMP and the LARS are very similar as well as those of To the best or our knowledge, the multi-dimensional fused- the SOMP and the group-Lasso solver. Thus, we only display LASSO in the context of overcomplete representations has here the matrix comparing the our method to the LARS and never been studied. One attempt of multi-dimensional fused- the group-LASSO solver. LASSO has been found in an arxiv version [7] for regression task, but the journal published version does not contain any Compared to the OMP and the LARS, our method obtains mention of the multi-dimensional fused-LASSO anymore. same results as them when only few atoms are active at the same time. It happens in our artificial signals when only few 6. CONCLUSION AND PERSPECTIVES patterns have been added to create decomposition matrices and/or when the pattern durations are small. On the contrary, This paper has shown the efficiency of the proposed Multi- when many atoms are active simultaneously, the OMP and SSSA based on a split Bregman approach, in order to achieve LARS are outperformed by the above algorithm which use the sparse structured approximation of multi-dimensional sig- inter-signal prior informations to find better decompositions. nals, under general conditions. Specifically, the extensive val- Compared to the SOMP and the group-LASSO solver, results idation has considered different regimes in terms of the sig- depends more on the duration of patterns. When patterns are nal complexity and dynamicity (number of patterns simulta- long and not too numerous, theirs performances is similar neously involved and average duration thereof), and it has es- to the fused-LASSO one. The SOMP is outperformed in all tablished a relative competence map of the proposed Multi- other cases. On the contrary, the group-LASSO solver is SSSA approach comparatively to the state of the art. Further outperformed only when patterns are short. work will apply the approach to the motivating application domain, namely the representation of EEG signals. 7. REFERENCES [13] J. Chen and X. Huo, “Theoretical results on sparse rep- resentations of multiple-measurement vectors,” Signal [1] D.L. Donoho, “,” Information The- Processing, IEEE Transactions on , vol. 54, no. 12, pp. ory, IEEE Transactions on , vol. 52, no. 4, pp. 1289– 4634–4643, 2006. 1306, 2006. [14] J.A. Tropp, A.C. Gilbert, and M.J. Strauss, “Algorithms [2] J. Mairal, M. Elad, and G. Sapiro, “Sparse represen- for simultaneous sparse approximation. part i: Greedy tation for color image restoration,” Image Processing, pursuit,” Signal Processing , vol. 86, no. 3, pp. 572–588, IEEE Transactions on , vol. 17, no. 1, pp. 53–69, 2008. 2006. [3] I. To siˇ c´ and P. Frossard, “Dictionary learning: What is [15] J.A. Tropp, “Algorithms for simultaneous sparse ap- the right representation for my signal?,” Signal Process- proximation. part ii: Convex relaxation,” Signal Pro- ing Magazine, IEEE , vol. 28, no. 2, pp. 27–38, 2011. cessing, vol. 86, no. 3, pp. 589–602, 2006.

[4] A. Rakotomamonjy, “Surveying and comparing si- [16] R. Gribonval, H. Rauhut, K. Schnass, and P. Van- multaneous sparse approximation (or group-lasso) algo- dergheynst, “Atoms of all channels, unite! average case rithms,” Signal processing , vol. 91, no. 7, pp. 1505– analysis of multi-channel sparse recovery using greedy 1526, 2011. algorithms,” Journal of Fourier analysis and Applica- tions, vol. 14, no. 5, pp. 655–687, 2008. [5] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight, “Sparsity and smoothness via the fused [17] S.F. Cotter, B.D. Rao, K. Engan, and K. Kreutz- lasso,” Journal of the Royal Statistical Society: Series Delgado, “Sparse solutions to linear inverse problems B (Statistical Methodology) , vol. 67, no. 1, pp. 91–108, with multiple measurement vectors,” Signal Processing, 2005. IEEE Transactions on , vol. 53, no. 7, pp. 2477–2488, 2005. [6] G.B. Ye and X. Xie, “Split bregman method for large scale fused lasso,” Computational Statistics & Data [18] J. Huang, T. Zhang, and D. Metaxas, “Learning with Analysis, vol. 55, no. 4, pp. 1552–1569, 2011. structured sparsity,” Journal of Re- search , vol. 12, pp. 3371–3412, 2011. [7] X. Chen, S. Kim, Q. Lin, J.G. Carbonell, and E.P. Xing, “Graph-structured multi-task regression and an efficient [19] R. Jenatton, J.Y Audibert, and F. Bach, “Structured vari- optimization method for general fused lasso,” arXiv able selection with sparsity-inducing norms,” Journal preprint arXiv:1005.3579 , 2010. of Machine Learning Research , vol. 12, pp. 2777–2824, 2011. [8] T. Goldstein and S. Osher, “The split bregman method [20] L.I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total for ℓ1 regularized problems,” SIAM Journal on Imaging Physica D: Sciences, vol. 2, no. 2, pp. 323–343, 2009. variation based noise removal algorithms,” Nonlinear Phenomena, vol. 60, no. 1-4, pp. 259–268, [9] C. Wu and X.C. Tai, “Augmented lagrangian method, 1992. dual methods, and split bregman iteration for rof, vecto- [21] J. Darbon and M. Sigelle, “A fast and exact algorithm rial tv, and high order models,” SIAM Journal on Imag- for total variation minimization,” Pattern recognition ing Sciences, vol. 3, no. 3, pp. 300–339, 2010. and image analysis , pp. 717–765, 2005. [10] Y.C. Pati, R. Rezaiifar, and PS Krishnaprasad, “Orthog- [22] H. Hoefling, “A path algorithm for the fused lasso signal onal matching pursuit: Recursive function approxima- approximator,” Journal of Computational and Graphi- tion with applications to wavelet decomposition,” in Sig- cal Statistics, vol. 19, no. 4, pp. 984–1006, 2010. nals, Systems and Computers, 1993. 1993 Conference Record of The Twenty-Seventh Asilomar Conference on . [23] J. Liu, L. Yuan, and J. Ye, “An efficient algorithm for IEEE, 1993, pp. 40–44. a class of fused lasso problems,” in Proceedings of the 16th ACM SIGKDD international conference on Knowl- [11] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, edge discovery and data mining . ACM, 2010, pp. 323– “Least angle regression,” The Annals of statistics , vol. 332. 32, no. 2, pp. 407–499, 2004. [24] B. Wahlberg, S. Boyd, M. Annergren, and Y. Wang, “An [12] A. Beck and M. Teboulle, “A fast iterative shrinkage- admm algorithm for a class of total variation regularized thresholding algorithm for linear inverse problems,” estimation problems,” arXiv preprint arXiv:1203.1828 , SIAM Journal on Imaging Sciences , vol. 2, no. 1, pp. 2012. 183–202, 2009.