On Learning with Shift-Invariant Structures

Home , Circulant matrix

arXiv:1812.01115v2 [cs.LG] 28 Jun 2019 inr sacnaeaino icln arcshsbeen has [ matrices work recent circulant Furthermore, dic- past. of the the concatenation in where studied convolutiona extensively [29] a The [28], is [26]. [27], tionary [15], model representation [14], funct sparse matrices objective learning circulant dictionary with th the algorithms directly new proposing optimize and [25] invariance shif fa rotation combines [23], that 2D research extendi problem [24], eigenvalue approach [22], learning an online solving algorithm while learning dictionary the dictionary i shift-invariant squares a proposing shift-invarian least [21], with [20], deal [17], to structures algorithm K-SVD extendi well-known proposed: been the have invariance shift for commodate v,Iay otc -alades [email protected] address: e-mail https://github.com/cristian-rusu-research/shift-inv Contact Italy. ova, rmdt sn eea tutrddcinre eae to related dictionaries structured several matrices. using circulant data align or from components shift-invariant extract to algorithms a the [19]. of circulant in review to given recent related is representations A applications convolutional analysis. and invariant signal solutions shift [18] EE methods, [16], learning audio imaging and [13], medical for [17] coding [14],[15] data sparse [1 from structured and learning [12] dictionary [11], convolutional ma [8], [9], line dictionaries decompositions efficient overcomplete and numerically [7] designing for transformations problems [6], alignment processing or image matching retrieval [5], shift compressed components proble [4], Fourier estimation locking delay GPS time the [3], for convolu circular theorems) (t the cross-correlation signals in and two exemplified between problem shift retrieval the applic shift computing many Fourie past: seen the the have in matrices by tions these tra diagonalized 3]), Fourier Section are fast [1, matrices matrix the (circulant to o [2] Because connection one. form their previous and the of structure shift their circular a is column each rpsdagrtm sn ytei swl sra C sign ECG real of as effectiveness well the images. as dictionary show and synthetic previou we wavelet-like using various and algorithms with a literature proposed work these learn the our on from to connect Based results We how data. them. show training solve from also to ways we describe efficient findings, that propose problems and optimization uni goals define (including matrices learnin matrices), form convolutional such of we and task which circulant problem wit using the deal learning We for dictionary data. shift-invariant matrices, training from circulant components shift-invariant on based rithms, h uhri ihteIttt tlaod enlga(IIT) Tecnologia di Italiano Istituto the with is author The rvosy eea itoaylann ehiusta a that techniques learning dictionary several Previously, efficient numerically several propose we paper, this In icln arcs[]aehgl tutrdmtie wher matrices structured highly are [1] matrices Circulant Abstract I hsppr edsrb e eut n algo- and results new describe we paper, this —In nlann ihsitivratstructures shift-invariant with learning On .I I. NTRODUCTION ariance eosuc code: source Demo . terative rsinRusu Cristian n of ons Gen- , ments the h and t from ulate ions tion trix our our 13] ns- 0], als nd ng ng he c- ar m a- at G st e g s f r t l .Cruatdictionaries Circulant A. h ounieFuirtasomof transform Fourier columnwise the arcsaecmltl endb hi rtcolumn first their by defined completely are matrices rnce rdc and product Kronecker sthe is vr ouni iclrsito h rtoe ihadown a With one. first the of shift circular a is column every conjugate, ihtevector the with antd fasaa) Then scalar). a of magnitude transpose, X tr( |K| ℓ etr oduppercase bold vector, clrvle,clirpi letters calligraphic values, scalar like letters Greek lowercase non-bold ore snhtc C,iae)ta ihih h learni methods. the Notation proposed data highlight the various that of with images) capabilities results ECG, Finally, experimental (synthetic, III). show sources wavelet (Section we and data IV, this training these) Section in shift-invarian from of learn use components unions to like we and algorithms that convolutional propose (circulant, matrices we structured Then, paper. other and t matrix, with factorization their particularly properties, their time patterns. and such data exhibit image to (especially repetit data series) are that real-world data expect the we from as patterns quadratic) model of to (instead able wh space are transform they linear Fourier only fast takes the them via storing done is shift-invaria matrices manipulat of circulant transparent: case are the advantages two In these interesting. propert tionaries is some line that modeling data regularizers solving the as of perform, or act memory they to multiplication and (lower systems) complexity matrix-vector use computational example, and the for store lower that to and means easier structure footprint be the will reasons: two dictionaries of because mainly II]. Section [19, in described solut recently proposed been and have problems con representations and learning sparse networ these tional circulant neural of reviews banded convolutional literature to of Detailed [30]. connection concatenation its a and is co matrices dictionary sparse convolutional the into where literat insights representations theoretical sparse the provide in developed tools uses 0 ecnie nti ae icln matrices circulant paper this in consider We esatb ulnn nScinI icln arcsand matrices circulant II Section in outlining by start We attention of lot a received have dictionaries Structured ∈ X pseudo-norm, stecriaiyof cardinality the is ) R I T II. eoe h rc,vec trace, the denotes ( n ,j k, × odlowercase bold : m X X EPOOE TUTRDDICTIONARIES STRUCTURED PROPOSED HE ) th ounie diag columnwise, T − nr of entry 1 x stemti transpose, matrix the is eoe h nes fasur matrix, square a of inverse the denotes nisdiagonal, its on k X k F 2 X X X = K ∈ ⊙ id aibe like variables Tilde . x tr Y R ( auigti notation, this (abusing ∈ X k ( ( n X x x R steKar-a rdc [31]. product Khatri-Rao the is × ) k ) H n m K X ∈ 2 eoe h ignlmatrix diagonal the denotes X sue odnt column a denote to used is H sthe is sue odnt matrix, a denote to used is R r sdt eoest and sets denote to used are ) nm α stecmlxconjugate complex the is X steFoeisnorm, Frobenius the is ∈ , etrzstematrix the vectorizes ℓ R X X 2 ∗ norm, r sdt denote to used are ⊗ C stecomplex the is Y hs square These . X ˜ eoe the denotes k eFourier he represents x | α k c | 0 tdic- nt ∈ r to ure o of ion sthe is sthe is volu- ding ions R and x ive, ile ng ks kj n in ar y 1 - - : t 2 shift direction the right circulant matrices are: C. Wavelet-like dictionaries

c1 cn ... c3 c2 Multiples, powers, products, and sums of circulant matrices c2 c1 ... c4 c3 are themselves circulant. Therefore, extending this class of def  . . . . .  structures to include other dictionaries is not straightforward. C = circ(c) = ......   In order to represent a richer class of dictionaries, still based cn 1 cn 2 ... c1 cn  − −  on circulants, consider now the following structured matrix  cn cn 1 ... c2 c1   −  2 n 1 n n (p) Rp p (4) = c Pc P c ... P − c R × . Ck = GkS HkS × , ∈ (1) ∈ n n where Gk = circ(gk) and Hk = circ(hk) are both The matrix P R denotes the orthonormal circulant p × p ∈ p p circulant matrices and S R × 2 is a selection matrix that circularly shifts a target vector c by one position, × ∈ matrix that keeps only every other column, i.e., GkS = i.e., P circ e where e is the second vector of the standard p = ( 2) 2 2 4 n 2 p n q 1 gk P gk P gk ... P − gk R × 2 (downsam- basis of R . Notice that P − = circ(eq) is also orthonormal pling the columns by a factor of 2). In∈ general we assume circulant and denotes a cyclic shift by q 1. The main property of circulant matrices (1) is their eigenvalue− factorization which that the filters gk and hk have compact support with length denoted n p. As such, these transformations are related to reads: ≤ H n n the convolutional dictionaries previously described. We call g C = F ΣF, Σ = diag(σ) C × , (2) k ∈ and hk filters because (3) is equivalent to a filtering operation n n H where F C × is the unitary Fourier matrix (F F = of a signal x where the filter coefficients are stored in the ∈ FFH = I) and the diagonal σ = √nFc, σ Cn. circulant matrix. ∈ Note that the multiplication with F on the right is equiv- Now we define a new transformation that operates only on alent to the application of the Fast Fourier Transform, i.e., p the first 2k−1 coordinates and keeps the other unchanged: Fc = FFT(c), while the multiplication with FH is equivalent p H ( − ) to the inverse Fourier transform, i.e., F c = IFFT(c). Both 2k 1 (p) C 0 p p k k−1 (p k−1 ) Rp p transforms are applied in O n log n time and memory. Wk = 2 × − 2 × . (5) 0 p p I p ∈ (p − ) − p k−1 − 2k 1 × 2k 1 − 2   B. Convolutional dictionaries Finally, we define a wavelet-like synthesis transformation Convolutional dictionaries can be reduced to circulant dic- that is a cascade of the fundamental stages (5) as Rn Rm tionaries by observing that given c and x with (p) (p) (p) ∈ ∈ W = W1 Wm 1Wm . (6) m n the result of their convolution is a vector y of size ··· − p =≥n + m 1 as − We call this transformation wavelet-like because Wx applies c1 0 ... 0 0 convolutions to parts of the signal x at different scales (for c2 c1 ... 0 0 example, see [32] for a description of discrete wavelet trans-  . . . . .  formations from the perspective of matrix linear algebra)......   cn cn 1 ... c1 0  y c x toep c x  −  x III. THE PROPOSED DICTIONARY LEARNING ALGORITHMS = = ( ) =  0 cn ... c2 c1  ∗    ..  (3) Dictionary learning [33] provides heuristics that approx-  0 0 . c3 c2   . . .  imate solutions to the following problem: given a dataset  ......  n N  . . . . .  Y R × , a sparsity level s and the size of the dictionary   ∈ n S  0 0 ... 0 cn S we want to create a dictionary D R × and the sparse   RS N ∈ c  x  representations X × then solve =circ = Cconvxconv, ∈ 0(m 1) 1 0(n 1) 1 2 − × − × minimize Y DX F subject to X is s–sparse. (7) D, X k − k where toep(c) is a Toeplitz matrix of size p m. Padding with zeros such that all variables are of size p leads× again to a A classic approach to this problem, which we also use, matrix-vector multiplication by a circulant matrix. An alterna- is the iterative alternating optimization algorithm: keep the tive, but ultimately equivalent, way to write the convolution in dictionary fixed and update the representations and vice-versa terms of a circulant matrix is to notice that a Toeplitz matrix in a loop until convergence. can be embedded into an extended circulant matrix of twice In this paper, we constrain the dictionary D to the structures the size (see [15, Section 4]). previously discussed: circulant and convolutional (including For our purposes, there is a fundamental difference between unions in both cases) and wavelet-like. Our goal is to propose C and Cconv: in the case of C it is exactly equivalent if we numerically efficient dictionary update rules for these struc- choose to operate with c or σ while for Cconv we necessarily tures. While for the general dictionary learning problem there have to work with cconv in order to impose its sparse structure. are several online algorithms that have been proposed [34], As we will see, this means that in the convolutional case we [35], [36] which are computationally efficient, in this paper, cannot exploit some simplifications that occur in the Fourier we consider only batch dictionary learning and we focus on domain (when working directly with σ). the computational complexity of the dictionary update step. 3

A. Circulant dictionary learning Algorithm 1 – UCirc–DLA–SU. Rn N n N Input: The dataset Y × , the number of circulant atoms Given a dataset Y R × and a sparsity level s 1 for ∈ ∈ Rn N ≥ L and the sparsity s n. the representations X × , the work in [15] introduces ≤ Rn nL ∈ Rn n Output: The union of L circulant dictionaries D × an efficient way of learning a circulant dictionary C × nL∈ N ∈ as in (13) and the sparse representations X R × such by approximately solving the optimization problem: 2 ∈ that Y DX F is reduced. 2 k − k minimize Y CX F 1. Initialization: compute the singular value decomposition c, X k − k (8) T (ℓ) (ℓ) of the dataset Y = UΣV , set c = uℓ for ℓ n, set c subject to vec(X) 0 sN, C = circ(c), c 2 =1. ≤ k k ≤ k k to random ℓ2 normalized vectors of size n for L ℓ>n For fixed X, to update c we develop the objective function to and compute the representations X = OMP(D, Y,s≥). 2. Compute Fourier transform Y˜ , set its first row to zero. Y CX 2 = FY ΣFX 2 = Y˜ ΣX˜ 2 , (9) k − kF k − kF k − kF 3. For 1,...,K : and in order to minimize it we set Update dictionary: • ˜xH ˜y ˜xH ˜y – Compute all the L Fourier transforms X˜ (ℓ). σ 1 1 , σ k k , σ σ , k , . . . , n, 1 = 2 k = 2 n k+2 = k∗ =2 – Construct each optimal σ(ℓ) L from (15) ˜x1 2 ˜xk 2 − ℓ=1 k k k k (10) (ℓ) L { } separately: σ1 ℓ=1 = 0 and compute where ˜yT and ˜xT are the rows of Y˜ = FY and X˜ = FX. (ℓ) L { (ℓ) } L (ℓ) L k k σk ℓ=1, σn k+2 ℓ=1 = (σk )∗ ℓ=1, k = Rn N Rn N { } { − } { } Remark 1. Given Y × and X × the best 2,..., n +1, by (10) and then normalize the L ∈ ∈ 2 (ℓ) (ℓ) 1 (ℓ) circulant dictionary C in terms of the Frobenius norm achieves circulants σ σ − σ . ←k k2 n H 2 Update sparse representations X = OMP(D, Y,s). 2 2 ˜xk ˜yk • minimum Y CX = yk | | . (11) c F 2 2 k − k k k − ˜xk 2 Xk=1 k k Proof. Expand the objective of (8) using the optimal (10) as iteration). The calculations in (10) take approximately 2nN 2 2 2 H operations: there are n components in σ to be computed while Y CX F = Y F + CX F 2tr(CXY ) 2 k − k k k k k − ˜x 2 and ˜xH ˜y take 2N operations each. = Y 2 + FH ΣFX 2 2tr(FH ΣFXYH ) k kk2 k k kF k kF − We can limit which and how many of all the possible n 2 2 H = Y + ΣX˜ 2tr(ΣX˜ Y˜ ) shifts of c are allowed. We achieve this by ensuring that rows k kF k kF − n H 2 n H 2 of X corresponding to unwanted shifts are zero. 2 ˜xk ˜yk ˜xk ˜yk (12) = Y F + | 2| 2 | 2| Remark 2 (Approximating linear operators by circulant k k ˜xk − ˜xk k=1 2 k=1 2 Rn n X k k X k k matrices). Given Y × , let us consider the special n H 2 n H 2 ∈ 2 ˜xk ˜yk 2 ˜xk ˜yk case of (8) when N = n and we fix X = I. Now we = Y F | 2| = yk 2 | 2| . k k − ˜xk k k − ˜xk calculate the closest, in Frobenius norm, circulant matrix k=1 2 k=1 2 X k k X k k to a given linear transformation Y. Because the Frobenius In the development of (12) we use the definition of the Frobe- norm is elementwise the optimal solution is directly ck = nius norm, the invariance of the Frobenius norm under unitary 1 y , k ,...,n. Unfortunately, in H n (i j) mod n=(k 1) ij = 1 transformations, i.e., FX = F X = X , and the − − k kF k kF k kF general, circulant matrices do not approximate all linear trans- cyclic permutation of the trace, i.e., tr(ABC)= tr(BCA). formationsP with high accuracy. The result is intuitive since ˜xH ˜y 2 By the Cauchy-Schwarz inequality we have that k k matrices have n2 degrees of freedom while circulants have ˜x 2 ˜y 2 which holds with equality if and only| if| the≤ k kk2k kk2 only n. Furthermore, if we add other constraints, such as rows ˜xk and ˜yk are multiples of each other. In this case, orthogonality for example, the situation is even worse: [37] 2 the objective function (12) develops to Y CX F = shows that the set of orthonormal circulants is finite and n y 2 n ˜y 2 k − k k=1 k 2 k=1 k 2. This is the only case where the constructs it. Therefore, researchers proposed approximating circulantk dictionaryk − cank reachk zero representation error. P P a linear transformation by a product of O(n) circulant, or A necessary condition that the optimal circulant dictionary H 2 Toeplitz, and diagonal matrices [9], [38]. n ˜x ˜yk 2 | k | C obeys is σ 2 = k=1 ˜x 4 = 1, i.e., the ℓ2 norm of k k k 2 the optimal solution of (8) isk one.k P In the context of dictionary learning, to obey the unit ℓ2 B. Union of circulant dictionary learning norm constraint on c we should normalize the optimal solution 1 Let us now consider overcomplete dictionaries (matrices σ σ 2− σ. This is avoided because we can always group that have, significantly, more columns than rows) that are this← normalizing k k factor with the representations X instead of 1 1 unions of circulant matrices. In particular, take a dictionary the circulant dictionary, i.e., σ − CX = C( σ − X). This k k2 k |k2 which is the union of L circulants: grouping is correct because the Fourier transform preserves ℓ2 (1) (2) (L) n nL norms and we have that σ 2 = c 2, i.e., all the columns of D = C C ... C R × , (13) k k k k ∈ C are ℓ2 normalized after normalizing σ. The algorithm called C–DLA, first introduced in [15], where each C(ℓ) = circ(c(ℓ)) = F H Σ(ℓ)F, Σ(ℓ) = has low computational complexity that is dominated by the diag(σ(ℓ)), σ(ℓ) = Fc(ℓ), is a circulant matrix. Given training n N O(nN log n) computation of Y˜ (once) and that of X˜ (at each data Y R × , with this structure, the dictionary learning ∈ 4

(ℓ) problem has the objective: (ℓ) X p N denoted Xconv = R × ) is developed as 0(n 1) N ∈ Y DX 2 = Y C(1) C(2) ... C(L) X 2 − × k − kF k − kF 2 H (1) H (L) 2 L = Y F Σ F ... F Σ F X F 2 ˜ (ℓ) ˜ (ℓ) k − k Y DconvXconv F = vec(Y) vec(ΣconvXconv) = Y FH Σ(1)F ... Σ(L)F X 2 k − k − k − kF (14) Xℓ=1 F (17) 2 2 2 L L L (ℓ) (ℓ) ˜ (ℓ) ˜ (ℓ) (ℓ) (ℓ) 2 = FY Σ FX = Y Σ X , = ˜y A F:,1:nc = ˜y Bc F , − − − k − k ℓ=1 F ℓ=1 F ℓ=1 F X X X (1) (L) RpN nL where the tilde matrices indicate the Fourier transforms (taken where B = A F:,1:n ... A F:,1:n × , nL N ∈ columnwise) and the representations X R × are sepa- (ℓ) (ℓ) (ℓ) CpN p A = ˜x e ... ˜xp e × , with the rated row-wise into L continuous non-overlapping∈ blocks of 1 1 p ∈ (ℓ) ⊗ ⊗ T (ℓ) n N rows of X˜h , c c(1) c(2) ...i c(L) RnL and size n denoted X R × . A way to update all circulant conv = p n ∈ ∈ ˜ ˜ F C × is the p p Fourier matrix restricted to its first components using the Fourier transforms Y and X presents :,1:n ∈ × T th ˜ (ℓ) T th n columns. The solution here is given by the least squares itself. Denote by ˜yk the k row of Y and by (˜xk ) the k (ℓ) row of X˜ . The objective function of the dictionary learning H H n c = (B B) B ˜y, (18) problem separates into k =1,..., 2 +1 (given real valued \ H nL nL training data Y) distinct least squares problems like: where B B R × is a positive definite block symmetric ∈ 2 matrix, where each n n block is a Toeplitz matrix like L H (ℓ1) H× (ℓ2) th T (ℓ) (ℓ) T Tℓ1ℓ2 = F:,1:n(A ) A F:,1:n for the (ℓ1,ℓ2) block – minimize ˜yk σ (˜x ) . (15) (1) (L) k k the diagonal blocks are symmetric positive definite Toeplitz. σ ,...,σ − k k ℓ=1 F H L(L 1) Therefore, B B is determined by nL + (2n 1) − X 2 − Therefore, for a fixed k, the diagonal entries (k, k) of all parameters – the first term covers the parameters of the L Σ(ℓ) (which are denoted σ(ℓ)) are updated simultaneously by symmetric Toeplitz diagonal blocks and the second terms k L(L 1) solving the least squares problems (15). Given real-valued covers the parameters of all 2− non-diagonal Toeplitz (ℓ) (ℓ) data, mirror relations σ = (σ )∗, k = 2, . . . , n, hold blocks. The computational burden is highest in order to n k+2 k (ℓ ,ℓ ) (ℓ ) H (ℓ ) analogously to (10) for− all ℓ = 1,...,L. Notice that this calculate the diagonals W 1 2 = (A 1 ) A 2 with en- (ℓ1,ℓ2) (ℓ1) H (ℓ2) formulation is just a natural extension of the one dimensional tries wk = (˜xk ) ˜xk , i.e., the inner product of the (ℓ1) (ℓ2) least squares problems in (10). To compute all the components corresponding rows from X˜ conv and X˜ conv respectively. These (L) n 2 of all σ we solve this least squares problem 2 times – the calculations take O(pL N) operations. The inverse Fourier 2 (ℓ1,ℓ2) computational complexity is O(nL N). transforms of W to recover the entries of Tℓ1ℓ2 take In comparison, the union of circulants dictionary learning only O(p log2 p) operations. method presented in [15], UC–DLA, updates each circulant The inverse problem in (18) can be solved exactly in block C(ℓ) sequentially and separately (this can be seen as a O(n3L3) via block Cholesky factorization [40, Chapter 4.2]. block coordinate descent approach). The new proposed learn- When nL is large, an alternative approach is to use some iter- ing method, called Union of Circulant Dictionary Learning ative procedure like the Conjugate Gradient approach (already Algorithm with Simultaneous Updates (UCirc–DLA–SU), is used in convolutional problems [41]) where the computational described in Algorithm 1. burden falls on computing matrix-vector products with the Remark 3 (Updating an unused circulant component). matrix BH B which take only O(pL2) operations. The vector (ℓ) th H (1) (L) T Assuming that X = 0n N , i.e., the ℓ circulant matrix v = B ˜y has the structure v = v ... v × is never used in the representations, then we use the update where v(ℓ) = FH z(ℓ) Rn and z(ℓ) Cp with entries 2 :,1:n (ℓ) L (i) (i) (ℓ) (ℓ) H ∈ th ∈ th c = argmax Y i=1,i=ℓ C X z .This is zj = (˜xj ) ˜yj , i.e., the j entry of the ℓ component is z; z 2=1 − 6 2 2 ˜ (ℓ) ˜ k k the inner product of the corresponding rows from Xconv and Y the block update method useP in UC–DLA [15]. Furthermore, respectively. The cost of computing v RnL is O(pLN) since similarly to [39], this update could be used also when atoms ∈ of block ℓ have a lower contribution to the reconstruction than the inverse Fourier transforms are only O(p log2 p). Also, we atoms from other blocks, i.e., X(ℓ) 2 X(i) 2 , i = ℓ. need to compute once the Fourier transform of the dataset k kF ≪k kF ∀ 6 (Y˜ ) and at each iteration all the L Fourier transforms of ˜ (ℓ) the sparse representations Xconv which take O(Np log2 p) and C. Union of convolutional dictionary learning O(LNp log2 p) overall operations respectively. Analogously to (13), we define the dictionary which is a The new proposed learning method, called Union of Con- union of L convolutional matrices as volutional Dictionary Learning Algorithm with Simultaneous Updates (UConv–DLA–SU), is described in Algorithm 2. (1) (2) (L) Rp pL Dconv = Cconv Cconv ... Cconv × , (16) The theoretical and algorithmic importance of dictionaries ∈ that are unions of convolutional matrices where the first h i and the objective function to minimize with respect to this columns have different (and potentially overlapping) supports union dictionary given the fixed representations Xconv has been highlighted in [13] (see in particular the clear pre- pL N ∈ R × (separated into L continuous non-overlapping blocks sentation of the convolutional sparse model in [13, Figure 1]): 5

Algorithm 2 – UConv–DLA–SU. Remark 4 (The special case of L =1 – the computational Rp N Input: The dataset Y × , the number of convolutional simplifications of constructing a single convolutional dic- ∈ atoms L, the length of c denoted n, the length of the tionary). Following a similar development as [15, Remark 2], input signals m n (both n and m are chosen such that consider again the objective function of (8) and develop p n m ) and≥ the sparsity s m. = + 1 2 2 − ≤ Y CconvXconv = ˜y AF:,1:nc , (19) Output: The union of L convolutional dictionaries Dconv k − kF k − kF Rp pL ∈ × as in (16) and the sparse representations Xconv where we have defined A = ˜x1 e1 ... ˜xp ep pL N 2 ∈ p ⊗ ⊗ ∈ R × such that Y DconvXconv is reduced. CpN p, with e the standard basis for Rp, i.e., A is k − kF × k k=1 (ℓ) { } ˜ T 1. Initialization: set c for ℓ = 1,...,L, to random ℓ composed of columns from (Xconv I) corresponding to the 2 ⊗ normalized vectors of size p with non-zeros only in the first non-zero diagonal entries of Σconv (see also the Khatri-Rao n entries and compute X = OMP(Dconv, Y,s). product [31]). 2. Compute Fourier transform Y˜ , set its first row to zero. By the construction in (3) only the first n entries of cconv 3. For 1,...,K : are non-zero and therefore the optimal minimizer of (19) is Update dictionary: • c = (AF:,1:n) ˜y. (20) (ℓ) \ – Compute all the L Fourier transforms X˜ conv. The large matrix AF:,1:n is never explicitly constructed in – Efficiently compute (18): H H 1 H H (1) (L) the computation of c = (F:,1:nA AF:,1:n)− F:,1:nA ˜y = Construct v = v ... v : for ℓ = H 1 → (ℓ) (ℓ) (F:,1:nWF:,1:n)− v, where we have defined 1,...,L set z1 = 0 and compute zk = (ℓ) (ℓ) (ℓ) 2 2 H H ˜x H ˜y , z z , k ,..., p W = diag( ˜x1 2 ... ˜xp 2 ), v = F:,1:nA ˜y. (21) ( k ) k p k+2 = ( k )∗ = 2 2 +1 k k k k and compute the− inverse Fourier transform of z(ℓ) H Observe that T = F:,1:nWF:,1:n is a real-valued symmetric (ℓ) H (ℓ) and keep only its first n entries: v = F:,1:nz . positive definite Toeplitz matrix – it is the upper left (the H Explicitly construct B B: for ℓ1 = 1,...,L and leading principal) submatrix of the p p circulant matrix → H × ℓ2 = ℓ1,...,L compute first column and row of the F WF. The matrix T is never explicitly computed, but block T (T = TT ) by the inverse Fourier ℓ1ℓ2 ℓ2ℓ1 ℓ1ℓ2 its first column is contained in the n entries of the vector (ℓ1,ℓ2) (ℓ1) H (ℓ2) (ℓ1,ℓ2) FH diag W . Also, notice that AH ˜y is a vector whose jth transform of wk = (˜xk ) ˜xk , wp k+2 = ( ) − ˜ (w(ℓ1 ,ℓ2)) , k =1,..., p +1. entry is the inner product of the corresponding rows from X k ∗ 2 ˜ H Get c, solve (18) by the Cholesky decomposition and and Y respectively, i.e., ˜xj ˜yj . → (ℓ) (ℓ) 1 (ℓ) The computation of W and v take approximately O(pN) normalize the L convolutions c c − c . ←k k2 operations each – because X is real-valued, the Fourier trans- Update sparse representations X=OMP(Dconv,Y,s). • forms exhibit symmetries and only half the ℓ2 norms in W and of the entries in AH ˜y need to be computed. These calculations dominate the computational complexity since they depend on (ℓ) group together the first columns of each Cconv into the local the size of the dataset N p – together with the computations (1) ≫ dictionary D of size p L, do the same with the second of the Fourier transforms X˜ conv (at each iteration) and Y˜ (only columns into the local dictionary× D(2) and so on until D(p). once) which take O(LNp log2 p) and O(Np log2 p) operations These dictionaries are called local because they are localized respectively. The least squares problem (20) is solved via to the reconstruction of only a few (depending on the size of the Levinson-Durbin algorithm [40, Section 4.7.3], whose the support n) grouped entries from a signal, as compared to complexity is 4n2 instead of the regular O(n3) computational a global signal model. complexity for unstructured inverse problems. Notice that (17) makes use of the matrix F:,1:n due to the Also, observe that this least squares solution when sparsity pattern in cconv, i.e., only the first n entries are non- applied to minimizing (9) leads to the same optimal H H 1 H H zero. This can be generalized so that if the support of cconv is solution from (10): c = (F A AF)− F A ˜y = H 1 H H H 1 H H denoted by (cconv) then the least squares problem (17) can F W− FF A ˜y = F W− A ˜y = F σ, with W = be solved onlyS on this support my making use of F 2 2 :, (cconv) diag( ˜x1 2 ... ˜xn 2 ). The approach in (10) is pre- p (cconv) S ∈ k k k k C ×|S |, i.e., the Fourier matrix of size p p restricted to ferred to (20) since in the Fourier domain the overall problem × the columns indexed in (cconv). Alternatively, if we do not is decoupled into a series of smaller size independent subprob- S want to (or cannot) decide the support a priori, we can add an lems that can be efficiently solved in parallel. ℓ1 penalty to (17) to promote sparse solutions. Therefore, each single convolutional dictionary can be up- Without the sparsity constraints in cconv, UConv–DLA– dated efficiently and an algorithm in the style of UC–DLA SU essentially reduces to Ucirc–DLA–SU but with a major [15] can be proposed whenever running time is of concern numerical disadvantage: the decoupling that allows for (15) or the dataset is large. In this case, because the convolutional is no longer valid and therefore the least squares problem components would not be updated simultaneously, we would (18) is significantly larger and harder to solve. This extra expect worse results on average. computational effort seems unavoidable if we want to impose There are several places where structures related to unions the sparsity in cconv. This motivates us to discuss the possibility of circulant and convolutional dictionaries appear. We briefly of updating each convolutional dictionary sequentially, just discuss next Gabor frames and then, in the following section, like [15] does for circulant dictionaries. wavelet-like dictionaries. 6

¯ p ¯ Remark 5 (A special case of time-frequency synthesis columns of WA, X1 are the first 2k−1 rows of X. We have dictionary). Union of circulant matrices also show up when also denote Σgk = diag(Fgk) and similarly for Σgk . The 2p pN − studying discrete time-frequency analysis/synthesis matrices matrix A C × 2k 1 is made up of a subset of the columns ∈ T H [42]. Consider the Gabor synthesis matrix given by from ((I FS)X¯ ) WA F corresponding to the non- 2 ⊗ 1 ⊗ ,1 (1) (2) (m) m m2 zero entries from vec( Σg Σh ). Minimizing the quantity G = D C D C ... D C C × , (22) k k ∈ in (24) leads to a least squares problem where both gk and Cm where C= circ(g) for g which is called the Gabor hk have a fixed non-zero support of known size n. It obeys window function, i.e., the∈ matrix G contains circular shifts p (p) n 2m−1 such that the circulants for Cm can be constructed. and modulations of g. The matrices D(ℓ),ℓ = 1, . . . , m, are ≤Similarly to the union of circulants cases described before, (ℓ) (ℓ 1)(k 1) 2πi/m diagonal with entries dkk = ω − − and ω = e . some computational benefits arise when minimizing (24), H H H H In the context of compressed sensing with structured ran- i.e., computing (I2 F )A A(I2 F) (I2 F )A ¯y. ⊗ ⊗ \ ⊗ ¯ T dom matrices, Gabor measurement matrices have been used For convenience we will denote Q = (((I2 FS)X1) H ⊗ for sparse signal recovery [43]: the Gabor function of length and R = WA,1F , such that A = Q R. Notice that (1) (2) ⊙ m is chosen with random entries independently and uniformly H p D D ∗ A A = − , where the blocks are diagonal distributed on the torus z C z =1 [43, Theorem 2.3]. 2k 1 D(2) D(3) { ∈ | | | } (1) (3) Here, our goal is to learn the Gabor function g from a given 2 2 p 2 2 with entries dii = qi 2 ri 2, dii = q − +i 2 ri 2 dataset Y such that the dictionary G allows good sparse k k k k k 2k 1 k k k (2) H 2 and d = q p qi ri 2 where qi and ri are columns representations. Now the objective function develops to: ii k−1 +i 2 k k p 2 2 of Q and R, respectively, i = 1,..., 2k−1 . Therefore, Y GX F = Y D(I C)X F H H k − k k − ⊗ k (I2 F )A A(I2 F) is a 2 2 block matrix whose blocks = y (((I F)X)T D(I FH ))vec(I Σ) 2 are⊗ real-valued circulant⊗ matrices× (and the diagonal blocks are k − ⊗ ⊗ ⊗ ⊗ kF m m 2 (23) also symmetric). Also, because AH A is symmetric positive (ℓ) ˜(ℓ) (ℓ) ˜(ℓ) = y ˜x1 d1 ... ˜xm dm σ definite it allows for a Cholesky factorization LLT where − ℓ=1 ⊗ ℓ=1 ⊗ F L P2 2 P the matrix has only the main diagonal and the secondary = y Aσ F = y AFg F , p lower diagonal of size − of non-zero values. A further k − k k − k 2k 1 where we have A CmN m and we have denoted y = computational benefit comes from when n p and we solve a × ≪ (1) ∈ (m) (ℓ) T th least squared problem in 2n variables, as compared to 2p, i.e., vec(Y), D = D ... D , (˜xk ) is the k row of (ℓ) g ¯g X˜ (ℓ) and d˜ is the kth column of D(ℓ)FH . Similarly to the A(I F) k = A(I F ) k , with both ¯g , h¯ k 2 h 2 :,1:n h¯ k k convolutional dictionary learning case, Gabor atoms typically ⊗ k ⊗ k ∈ Rn. Finally, notice that I FH AH A I F has have compact support and we can add the sparse structure to g ( 2 :,1:n) ( 2 :,1:n) Toeplitz blocks, like in the case⊗ of UConv–DLA–SU.⊗ (the support of size n m) and find the minimizer by solving the least squares problem≤ which this time is unstructured (there The linear transformation (6) has two major advantages: i) is no Toeplitz structure). the computational complexity of matrix-vector multiplications Wx with a fixed x Rp is controlled by the number of stages m and the length∈ of the filters n, instead of the fixed D. Wavelet-like dictionary learning matrix-vector multiplication complexity which is O(p2) and Starting from (6), in the spirit of dictionary learning, our ii) it allows for learning atoms that capture features from the goal is to learn a transformation from given data such that it data at different scales, i.e., atoms of different sparsity levels. has sparse representations. The strategy we use is to update The new proposed procedure, called Wavelet-like Dictionary (p) (p) each Wk (actually, the Ck component) while keeping Learning Algorithm (W–DLA), is described in Algorithm 3. all the other transformations fixed. Therefore, for the kth Remark 6 (Extending and constraining the wavelet-like component we want to minimize structure). Unlike wavelets that use the same filters g and h 2 (p) 2 (known as the low and high pass filters, respectively) at each Y WX F = Y WAWk WBX F k − k k − k stage k of the transformation, we learn different filters gk p 2 ( k−1 ) ¯ 2 X1 and hk. Also, the support of the filters (and their size) can be W W Ck 0 = Y A,1 A,2 ¯ decided dynamically at every stage and the downsampling can − " 0 I# X2 F p be replaced with a general column selection matrix S. Note ( − ) ¯ 2k 1 ¯ 2 = Y WA,1C X1 F that if we allow full support then the least squares problem k − k k (24) Y¯ W G S H S X¯ 2 can be solved in the Fourier domain with the complex-valued = A,1 k k 1 F ˜ k − k unknowns ˜gk = Fg and hk = Fhk. Furthermore, each stage Y¯ W FH Σ Σ I FS X¯ 2 k = A,1 gk hk ( 2 ) 1 F in (5) applies only to the decomposition of the left-most (so- k − T H ⊗ k 2 = ¯y (((I FS)X¯ ) WA F )vec( Σg Σh ) called low-frequency) components from the previous stage. In k − 2 ⊗ 1 ⊗ ,1 k k kF Fg 2 g 2 the spirit of optimal sub-band tree structuring (also known as = ¯y A k = ¯y A(I F) k , Fh 2 h wavelet packet decompositions) [44], we can propose a trans- − k F − ⊗ k F (p) (p) (p) formation W = C1 Cm 1Cm factored into m stages p ··· − where F is the Fourier matrix of size 2k−1 , we denoted WA = all of the form (4) (where again each stage has its own filters (p ) (p) (p) (p) ¯ W1 Wk 1, WB = Wk+1 Wm and X = WBX, gk and hk which now can have support n p) and use the ¯ ··· − ¯ ¯··· p ≤ Y = Y WA X , ¯y = vec(Y), WA are the first k−1 same optimization procedure. Lastly, we could also propose − ,2 2 ,1 2 7

Algorithm 3 – W–DLA. 90 p N m Input: The dataset Y R × such that 2 divides p exactly, ∈ 80 the number of stages m log2 p, the size of the support of ≤p 70 the filters denoted n 2m−1 and the sparsity s p. ≤ p≤p Output: The wavelet-like dictionary W R × as in (6), 60 p p ∈ the diagonal D R × such that WD has unit ℓ2 norm 50 ∈ Rp N UCirc−DLA−SU columns and the sparse representations X × such that 40 2 ∈ UC−DLA Detection rate (%) M−DLA Y WDX F is reduced. 30 k − k SI−ILS−DLA (n) SI−K−SVD 1. Initialization: set all stages C = I, k = 1,...,m; 20 k K−SVD compute the singular value decomposition of the dataset 10 T Y = UΣV and compute the sparse representations X = 10 20 30 No noise SNR (dB) (UT Y), i.e., project and keep the s largest entries in Ts magnitude for each element (column) in the dataset Y. Fig. 1. Average kernel recovery rate as a function of noise level for different 2. For 1,...,K : shift invariant dictionary learning methods. We compare against SI–K–SVD [20], SI–ILS–DLA [22], M–DLA [25], UC–DLA [15]. The K–SVD approach Update dictionary: with all other components fixed, • serves as a baseline performance indicator since it is not explicitly developed update only the kth non-trivial component of W de- to recover shift-invariant structures. p − ( 2k 1 ) noted Ck (by computing both gk and hk on the 30 UC−DLA support of size n) for each k =1, . . . , m, at a time by UCirc−DLA−SU minimizing the least squares problem (24). Update D such that WD has unit ℓ norm columns and • 2 update sparse representations X = OMP(WD, Y,s). 5 Frequency of utilization 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Shifts of kernel a structured transformation similar to (4) but based on more (p) Fig. 2. Average frequency of utilization for n = 20 atoms of the L = 45 C G S H S J S Rp p Ns than two filters, e.g., k = k k k × , circulants Cℓ. With perfect recovery the q = 3 peaks should be ≈ 44. ∈ p p Lq for a new selection (downsampling by 3) matrix S R 3 . × Heuristics to choose m, n (maybe at each stage m, i.e.,∈ having nk), the location of the n non-zero entries in each filter, optimization problem with two quadratic constraints [46]. the structure and sizes of Ck and S may be proposed to further improve the accuracy of the algorithm (or the trade- IV. EXPERIMENTAL RESULTS off between numerical efficiency and accuracy in terms of the We now discuss numerical results that show how well the representation error). proposed methods extract shift-invariant structures from data. Orthonormal wavelets, i.e., in our case meaning that W and Since we are dealing with the dictionary learning problem our (p) goal is to build good representations of the datasets we con- all Ck are orthonormal, are also extensively used in many application. With these orthogonality constraints the objective sider. In our proposed algorithms, in the sparse recovery phase, function develops into a simpler form than (24) as we use the orthogonal matching pursuit (OMP) algorithm [47], but any other sparse approximation method could be chosen. (p) 2 T (p) 2 Y WAWk WBX F = WA Y Wk WBX F Also, because the sparse approximation steps are numerically k − k p k − k T ¯ ( k−1 ) ¯ 2 expensive (at least quadratic complexity in general and applied = WA Y X2 W 2 X1 F k T − − H k 2 for all N n data points) and repetitive operations (done at = W Y X¯ F Σg Σh (I F)X¯ (25) ≫ k A − 2 − k k 2 ⊗ 1kF each iterative step of the alternating optimization algorithms), T ¯ ¯ 2 = F(W Y X2) Σg Σh (I2 FS)X1 OMP is chosen from practical considerations, as numerically k A − − k k ⊗ kF ¯y I FS X¯ T I vec Σ Σ 2 , efficient implementations are available (for example [48]). = ((( 2 ) 1) n) ( gk hk ) F k − ⊗ ⊗ k Notice from the description of all the proposed algorithms T ¯ where we have now denoted ¯y = vec(F(W Y X2)) that each dictionary update step necessarily decrease the A − and of course the Kronecker products are never explicitly objective function but, unfortunately, the overall algorithms built. We minimize this quadratic objective with the ad- may not be monotonically convergent to a local minimum T ditional orthogonality constraint gk hk = 0 (in order to since OMP is not guaranteed in general to always reduce the (p) keep Ck orthogonal). Minimizing quadratic functions under objective function. As such, in this section, we also provide quadratic equality constraints has been studied in the past some experimental results where we empirically observe the and numerical procedures are available [45]. This constraint converge of the proposed methods. ensures orthogonality between the columns of GS and HS. For datasets with a strong DC component the learned C 1 1 To ensure orthogonality among the columns of GS (and HS, circulant dictionary might be √n n n. Therefore, respectively) we can explicitly add symmetry constraints to preprocessing the dataset Y by removing≈ the mean× component the filter coefficients, e.g., if gk has a non-zero support of size is necessary and we have σ1 = 0 in (10) since ˜y1 = 0N 1. × four we have gk3 = gk1 and gk4 = gk2. Alternatively, both This operation is assumed performed before the application of orthogonality constraints can be added− leading to a quadratic any algorithm developed in this paper. 8

140 1.4 UC−DLA, s=4 Original UCirc−DLA−SU, s=4 1.2 Reconstructed 120 UC−DLA, s=8 UCirc−DLA−SU, s=8 1 100 UC−DLA, s=12 UCirc−DLA−SU, s=12 0.8

80 0.6

0.4 60 0.2 40 Centered ECG signal (mV) Learning time (seconds) 0

20 −0.2

−0.4 0 0 200 400 600 800 1000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Time (samples) Number of bases L Fig. 4. Original ECG sample and reconstruction by UConv–DLA–SU with Fig. 3. Average learning times for UC–DLA [15] and the proposed UCirc– support n = 12, sparsity s = 4 and L = 2 circulants. With these parameters, DLA–SU over 100 random realizations. For the same parameters, the running the approximation error (26) for the whole training dataset is 7.5%, showing time of UConv–DLA–SU is several times higher and therefore not shown in that such data can be indeed well represented in a simple (in terms of small this plot – this highlights the importance of solving the learning problem in L) shift-invariant dictionary. the Fourier domain, when possible.

is on average 20% while for s = 4 the average speedup is A. Synthetic experiments only 10% and for large L there are cases where UCirc–DLA– We create a synthetic dataset generated from a fixed number SU is slower than UC–DLA. In principle, UCirc–DLA–SU of atoms and their shifts and the proceed to measure how well should always be faster but in practice, the algorithm involves we recover them from the dictionary learning perspective. The memory manipulations in order to build all the matrices for experimental setup follows: generate N = 2000 signals of all the subproblems (15). This is the reason why the running length n = 20, that are linear combinations (with sparsity time difference is not larger. Furthermore, for small s the s =4) of L = 45 randomly generated kernel columns which blocks used by UC–DLA are calculated rapidly (calculations are allowed to have only q = 3 circular shifts (out of the are similar to the matrix operations in Remark 5) because X n N possible n = 20), i.e., Y R × where each columns is yi = is very sparse and the overall running time is, therefore, lower. L ∈ α Pqiℓ c + n for i =1,...,N with fixed c =1, ℓ=1 iℓ ℓ i k ℓk2 αi 0 = s where αiℓ [ 10, 10] and qiℓ 0,...,q 1 are kP k ∈ − ∈{ − } B. Experiments on ECG data randomly uniformly distributed and ni is a random Gaussian vector representing noise. Electrocardiography (ECG) signals [49] have many repetitive sub-structures that could be recovered by shift-invariant First, using the synthetic dataset, we show in Figure 1 dictionary learning. Therefore, in this section, we use the how the UCirc–DLA–SU outperforms previously proposed proposed UConv–DLA–SU to find in ECG signals short methods in the task of recovering the atoms used in creating (compact support) features that are repeated. We use the MIT- the dataset. This shows the benefit (as compared to UC–DLA) BIH arrhythmia database1 from which we extract a normal of updating all the circulant components simultaneously with sinus rhythm signal composed of equality length samples from each step of the algorithm. We observe that UCirc–DLA– five different patients, all sampled at 128 Hz. This signal is SU achieves lower error approximately 75% of the time. The reshaped into a matrix of centered, non-overlapping sections of typical counter-example is one where UC–DLA converges 64 101000 length p = 64, leading to the training dataset Y R × . slower (in more iterations) to a slightly lower representation ∈ error, i.e., sub-optimal block calculations ultimately lead to Because we are searching for sub-signals with limited a better final result. This observation is not surprising since support, we use the UConv–DLA–SU with parameters n = 12, s L both heuristic methods only approximately solve the overall = 2 and = 2 to recover the shift-invariant structure. In original dictionary learning problem (with unknowns both C Figure 4 we show the original ECG signal and its reconstruc- and X). To show this, with the same synthetic dataset for noise tion in the union of convolutional dictionaries. Of course, the reconstruction is not perfect but it is able to accurately capture level SNR = 30dB in Figure 2 we calculate how many times on average each atom in all circulant components (from all the spikes in the data and remove some of the high-frequency features, i.e., the signal looks filtered (denoised). The second the L = 45) is used in the sparse representations. On average, L UCirc–DLA–SU recovers the correct supports (in effect, the plot, Figure 5, shows the =2 learned atoms from the data indices of the shifts used) more often than UC–DLA. which capture the spiky nature of the training signal. Figure 3 shows the learning times for the union of circulants algorithms, with blocks and with simultaneous updates, for a C. Experiments on image data fixed number of K = 100 iterations. For this test, we created The training data Y that we consider are taken from popular a synthetic dataset Y of size n = 64 with N = 8192 data test images from the image processing literature (pirate, pep- points and we vary the sparsity level s and the number of bases p N pers, boat, etc.). The test dataset Y R × consists of 8 8 in the union L. For s 8, 12 , in our experiments, UCirc– ∈ × ∈ { } DLA–SU is always faster than UC–DLA [15] and the speedup 1https://www.physionet.org/physiobank/database/mitdb/ 9

0.5 50 Haar 0.4 40 W−DLA, Haar initialization W−DLA 0.3 30 %) ε 0.2 20

0.1 10 4 6 8 10 12 50 0 Centered ECG signal (mV) 10 20 30 40 50 60 D4 Time (samples) 40 W−DLA, D4 initialization W−DLA 30 Representation error ( Fig. 5. The L = 2 kernel atoms of length p = 64 learned by UConv–DLA– 20 SU with support n = 12 and sparsity s = 4. 10 4 6 8 10 12 Sparsity level (s) 40 W−DLA 38 W−DLA, Haar initialization Fig. 7. Representation errors as a function of the sparsity level for wavelet and W–DLA transformations. The top ﬁgure has parameters n = 2, m = 36 log2 p and therefore allows a Haar initialization while the bottom ﬁgure has %) ε 34 parameters n = 4, m = −1 + log2 p and allows a D4 initialization. 32

26 35 W−DLA, maximum m W−DLA, m = 1 Representation error ( 24 Circ−DLA

22 30 %) ε 20 5 10 15 20 25 30 35 40 Iteration 25

Fig. 6. A typical convergence of the proposed W–DLA algorithm with the 20 number of iterations. Since n = 2, m = log2 p we also have the opportunity to initialize the ﬁlters with the Haar values. Representation error ( 15

10 non-overlapping image patches with their means removed. We 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Size of filter (n) consider N = 12288 and we have p = 64. To evaluate the learning algorithms, in this section we consider the relative Fig. 8. Representation error for W–DLA as a function of the size of the ﬁlter representation error of the dataset Y in the dictionary D given support n. For reference we show C–DLA [15] while W–DLA runs twice: once with ﬁxed m = 1 and with largest m such that n ≤ p is obeyed. the sparse representations X as 2m−1

2 2 ǫ = Y DX Y − (%). (26) k − kF k kF We consider image data because there are well-known to the small number of free filter parameters to learn, i.e., wavelet transforms that efficiently encode such data. We only 24: m = log2 p = 6 stages each with 2 filters and each will use the filters of the Haar and Daubechies D4 wavelet with 2 coefficients. In the D4 case, the representation error is ¯ transforms, i.e., with n = 2, m = log2 p, hk = improved significantly. Both figures show that, when available, 1 1 1 1 wavelet coefficients provide an excellent initialization even √2 √2 , ¯gk = √2 √2 and n = 4, m = 1 + − − slightly better than the proposed W–DLA (also confirmed in 1 √3 3 √3 3+√3 1+√3 logh p, h¯ i= − h − i , ¯g = 2 k 4√2 − 4√2 4√2 − 4√2 k Figure 6). Since these wavelets are not available for all choices 1+√3 3+√3 h3 √3 1 √3 i n,m the purpose of this plot is to show that the proposed √ √ −√ −√ , respectively, for all k. The 4 2 4 2 4 2 4 2 initialization provides very good results in general. filtersh are chosen such that thei resulting W is orthonormal. First, we show in Figure 6 the experimental convergence of Finally, in Figure 8 we show the effect that parameters n the proposed W–DLA. If we also impose an orthogonality and m have on the representation error. For reference, we constraint on W then we can avoid OMP for the sparse show the representation error of C–DLA which has p = 64 representations and just a projection operation guarantees free parameters to learn. The performance of this dictionary optimal sparse representations. The figure shows that, when is approximately matched by W–DLA with n = 4 and available, it is convenient to initialize the W–DLA with well- m = 1 + log2 p which has 40 free parameters to learn: known wavelet filters since the algorithm converges faster and m = 5−stages each with 2 filters of support 4 each. Notice to slightly lower representation errors. Still, the differences are that the representation error plateaus after n = 8. In general, not significant and wavelets do not exist for every n,m. dictionaries built with W–DLA have 2nm degrees of freedom. Then, in Figure 7 we show how the representation error We also show a version of W–DLA where we keep m = 1 varies with the sparsity level s. We also run W–DLA initialized and vary only m in which case, of course, the representation with the well-known wavelet filter coefficients Haar and D4, error decreases. Note that in this case each run of W–DLA respectively. W–DLA is always able to improve the repre- is initialized with a random set of coefficients. To show sentation performance, even when starting with the wavelet monotonic convergence it would help to initialize the filters coefficients. In the Haar case the improvement is small due of size n with those previously computed of support n 1. − 10

V. CONCLUSIONS [23] P. Jost, P. Vandergheynst, S. Lesage, and R. Gribonval, “MoTIF: An efficient algorithm for learning translation invariant dictionaries,” in In this paper, we propose several algorithms that learn, Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2006. under different constraints, shift-invariant structures from data. [24] G. Zheng, Y. Yang, and J. Carbonell, “Efficient shift-invariant dictionary Our work is based on using circulant matrices and finding learning,” in Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’16. numerically efficient closed-form solutions to the dictionary New York, NY, USA: ACM, 2016, pp. 2095–2104. update steps, by least-squares. We analyze the behavior of the [25] Q. Barthelemy, A. Larue, A. Mayoue, D. Mercier, and J. I. Mars, “Shift algorithms on various data sources, we compare and show we & 2D rotation invariant sparse coding for multivariate signals,” IEEE Trans. Sig. Process., vol. 60, no. 4, pp. 1597–1611, 2012. outperform previously proposed algorithms from the literature. [26] F. Barzideh, K. Skretting, and K. Engan, “Imposing shift-invariance using flexible structure dictionary learning (FSDL),” Digital Signal REFERENCES Processing, vol. 69, pp. 162–173, 2017. [27] R. Grosse, R. Raina, H. Kwong, and A. Y. Ng, “Shift-invariant sparse [1] R. M. Gray, “Toeplitz and circulant matrices: a review,” Found. Trends coding for audio classification,” in Uncertainty in Artificial Intelligence, Commun. Inf. Theory, vol. 2, no. 3, pp. 155–239, 2006. 2007, pp. 149–158. [2] A. Hero, H. Messer, J. Goldberg, D. Thomson, M. Amin, G. Giannakis, [28] F. Heide, W. Heidrich, and G. Wetzstein, “Fast and flexible convolutional A. Swami, J. Tugnait, A. Nehorai, A. Swindlehurst, J.-F. Cardoso, sparse coding,” in IEEE Computer Vision and Pattern Recognition, 2015, L. Tong, and J. Krolik, “Highlights of statistical signal and array pp. 5135–5143. processing,” IEEE Signal Processing Magazine, vol. 15, no. 5, pp. 21– [29] H. Bristow and S. Lucey, “Optimization methods for convolutional 64, 1998. sparse coding,” Technical Report, 2014. [3] H. Hassanieh, F. Adib, D. Katabi, and P. Indyk, “Faster GPS via [30] P. Vardan, Y. Romano, and M. Elad, “Convolutional neural networks the sparse Fourier transform,” in Proc. ACM 8th Annual International analyzed via convolutional sparse coding,” arXiv:1607.08194, 2016. Conference on Mobile Computing and Networking (Mobicom), 2012, [31] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” pp. 353–364. SIAM Review, vol. 51, no. 3, pp. 455–500, 2009. [4] G. C. Carter, “Coherence and time delay estimation,” Proceedings of the [32] G. Strang and T. Nguyen, Wavelets and filter banks. Wellesley, MA : IEEE, vol. 75, no. 2, pp. 236–255, 1987. Wellesley-Cambridge Press, 1997. [5] H. Ohlsson, Y. C. Eldar, A. Y. Yang, and S. S. Sastry, “Compressive [33] B. Dumitrescu and P. Irofti, Dictionary Learning Algorithms and Appli- shift retrieval,” IEEE Trans. Sig. Proc., vol. 62, no. 16, pp. 4105–4113, cations. Springer, 2018. 2014. [34] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online dictionary learning [6] B. Lucas and T. Kanade, “An iterative image registration technique with for sparse coding,” in Proceedings of the 26th Annual International an application in stereo vision,” in Int. J. Conf. Artificial Intelligence, Conference on Machine Learning. ACM, 2009, pp. 689–696. 1981, pp. 674–679. [35] Y. Naderahmadian, S. Beheshti, and M. A. Tinati, “Correlation based [7] S. Jain and J. Haupt, “Convolutional approximations to linear dimen- online dictionary learning algorithm,” IEEE Transactions on Signal sionality reduction operators,” in Proc. IEEE Int. Conf. Acoust. Speech Processing, vol. 64, no. 3, pp. 592–602, 2016. Signal Process. (ICASSP), 2017, pp. 5885–5889. [36] F. Giovanneschi, K. V. Mishra, M. A. González-Huici, Y. C. Eldar, [8] O. Chabiron, F. Malgouyres, H. Wendt, and J.-Y. Tourneret, and J. H. G. Ender, “Dictionary learning for adaptive GPR target “Optimization of a fast transform structured as a convolutional classification,” CoRR, vol. abs/1806.04599, 2018. tree,” working paper or preprint, 2016. [Online]. Available: [37] A. Bottcher, “Orthogonal symmetric Toeplitz matrices,” Complex Anal- https://hal.archives-ouvertes.fr/hal-01258514 ysis and Operator Theory, vol. 2, no. 2, pp. 285–298, 2008. [9] K. Ye and L.-H. Lim, “Every matrix is a product of Toeplitz matrices,” [38] M. Huhtanen and A. Perämäki, “Factoring matrices into the product of Found. of Comp. Math., vol. 16, no. 3, pp. 577–598, 2016. circulant and diagonal matrices,” J. of Fourier Anal. and App., vol. 21, [10] H. Bristow, A. Eriksson, and S. Lucey, “Fast convolutional sparse no. 5, pp. 1018–1033, 2015. coding,” in Proc. IEEE Conf. Comp. Vis. Pat. Recog. (CVPR), 2013, [39] C. Rusu and B. Dumitrescu, “Stagewise K-SVD to design efficient pp. 391–398. dictionaries for sparse representations,” IEEE Signal Processing Letters, [11] B. Kong and C. C. Fowlkes, “Fast convolutional sparse coding (FCSC),” vol. 19, no. 10, pp. 631–634, 2012. Tech. Rep., University of California, Irvine, 2014. [40] G. H. Golub and C. F. Van Loan, Matrix Computations. Johns Hopkins [12] C. Garcia-Cardona and B. Wohlberg, “Convolutional dictionary learning: University Press, 1996. A comparative review and new algorithms,” IEEE Trans. on Comp. [41] B. Wohlberg, “Efficient algorithms for convolutional sparse representa- Imag., vol. 4, no. 3, pp. 366–381, 2018. tions,” IEEE Trans. Image Process., vol. 25, no. 1, pp. 301–315, 2016. [13] V. Papyan, J. Sulam, and M. Elad, “Working locally thinking globally: [42] H. G. Feichtinger and T. Strohmer, Advances in Gabor Analysis. Theoretical guarantees for convolutional sparse coding,” IEEE Trans. Birkhauser, 2003. Sig. Process., vol. 65, no. 21, pp. 5687–5701, 2017. [43] G. Pfander and H. Rauhut, “Sparsity in time-frequency representations,” [14] G. Pope, C. Aubel, and C. Studer, “Learning phase-invariant dictionar- J. Fourier Anal. Appl., vol. 16, pp. 233–260, 2010. ies,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2013, pp. [44] R. R. Coifman and M. V. Wickerhauser, “Entropy-based algorithms for 5979–5983. best basis selection,” IEEE Trans. Inf. Theory, vol. 38, no. 2, pp. 713– [15] C. Rusu, B. Dumitrescu, and S. A. Tsaftaris, “Explicit shift-invariant 718, 1992. dictionary learning,” IEEE Sig. Proc. Let., vol. 21, no. 1, pp. 6–9, 2014. [45] H. Hmam, “Quadratic optimisation with one quadratic equality con- [16] C. Rusu, R. Morisi, D. Boschetto, R. Dharmakumar, and S. A. Tsaftaris, straint,” Electronic Warfare and Radar Division DSTO Defence Science “Synthetic generation of myocardial blood-oxygen-level-dependent MRI and Technology Organisation, Australia, Report DSTO-TR2416, 2010. time series via structural sparse decomposition modeling,” IEEE Trans. [46] S. M. Guu and Y. C. Liou, “On a quadratic optimization problem with Med. Imaging, vol. 33, no. 7, pp. 1422–1433, 2014. equality constraints,” Journal of Optimization Theory and Applications, [17] B. Mailhe, S. Lesage, R. Gribonval, F. Bimbot, and P. Vandergheynst, vol. 98, no. 3, pp. 733–741, 1998. “Shift-invariant dictionary learning for sparse representations: extending [47] Y. Pati, R. Rezaiifar, and P. Krishnaprasad, “Orthogonal matching K-SVD,” in European Signal Processing Conference, 2008. pursuit: recursive function approximation with application to wavelet [18] T. Blumensath and M. Davies, “Sparse and shift-invariant representations decomposition,” in Asilomar Conf. on Signals, Systems and Comput., of music,” IEEE Trans. Audio Speech Lang. Process., vol. 14, no. 1, pp. 1993, pp. 40–44. 50–57, 2006. [48] R. Rubinstein, M. Zibulevsky, and M. Elad, “Efficient implementation [19] B. Wohlberg, “Efficient algorithms for convolutional sparse representa- of the K-SVD algorithm using batch orthogonal matching pursuit,” CS tions,” IEEE Trans. Image Process., vol. 25, no. 1, pp. 301–315, 2016. Technion, 2008. [20] M. Aharon, “Overcomplete dictionaries for sparse representation of [49] G. B. Moody and R. G. Mark, “The impact of the MIT-BIH arrhythmia signals,” Ph.D. thesis, Technion - Israel Institute of Technology, 2006. database,” IEEE Eng. in Med. and Biol., vol. 20, no. 3, pp. 45–50, 2001. [21] J. J. Thiagarajan, K. N. Ramamurthy, and A. Spanias, “Shift-invariant sparse representation of images using learned dictionaries,” in IEEE Machine Learning for Signal Processing, 2008. [22] K. Skretting, J. Husoy, and S. Aase, “General design algorithm for sparse frame expansions,” Signal Process., vol. 86, pp. 117–126, 2006.