Sparse Estimation and Dictionary Learning (for Biostatistics?)

Julien Mairal

Biostatistics Seminar, UC Berkeley

Julien Mairal Sparse Estimation and Dictionary Learning Methods 1/69 What this talk is about?

Why sparsity, what for and how? Feature learning / clustering / sparse PCA; : selecting relevant features; Signal and image processing: restoration, reconstruction; Biostatistics: you tell me.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 2/69 Part I: Sparse Estimation

Julien Mairal Sparse Estimation and Dictionary Learning Methods 3/69 Sparse Linear Model: Machine Learning Point of View i i n i Rp Let (y , x )i=1 be a training set, where the vectors x are in and are called features. The scalars y i are in 1, +1 for binary classification problems. {− } R for regression problems.

We assume there is a relation y w⊤x, and solve ≈ n 1 min ℓ(y i , w⊤xi ) + λψ(w) . w∈Rp n Xi=1 regularization empirical risk | {z } | {z }

Julien Mairal Sparse Estimation and Dictionary Learning Methods 4/69 Sparse Linear Models: Machine Learning Point of View A few examples: n 1 1 i ⊤ i 2 2 Ridge regression: min (y w x ) + λ w 2. w∈Rp n 2 − k k Xi=1 n 1 i ⊤ i 2 Linear SVM: min max(0, 1 y w x ) + λ w 2. w∈Rp n − k k Xi=1 n 1 i i −y w⊤x 2 : min log 1+ e + λ w 2. w∈Rp n k k Xi=1  

Julien Mairal Sparse Estimation and Dictionary Learning Methods 5/69 Sparse Linear Models: Machine Learning Point of View A few examples: n 1 1 i ⊤ i 2 2 Ridge regression: min (y w x ) + λ w 2. w∈Rp n 2 − k k Xi=1 n 1 i ⊤ i 2 Linear SVM: min max(0, 1 y w x )+ λ w 2. w∈Rp n − k k Xi=1 n 1 i i −y w⊤x 2 Logistic regression: min log 1+ e + λ w 2. w∈Rp n k k Xi=1  

The squared ℓ2-norm induces “smoothness” in w. When one knows in advance that w should be sparse, one should use a sparsity-inducing regularization such as the ℓ1-norm. [Chen et al., 1999, Tibshirani, 1996]

Julien Mairal Sparse Estimation and Dictionary Learning Methods 6/69 Sparse Linear Models: Signal Processing Point of View

Let y in Rn be a signal.

Let X = [x1,..., xp] Rn×p be a set of ∈ normalized “basis vectors”. We call it dictionary.

X is “adapted” to y if it can represent it with a few basis vectors—that is, there exists a sparse vector w in Rp such that y Xw. We call w ≈ the sparse code. w1 w2 y  x1 x2 xp  . ≈ ···  .        wp y∈Rn X∈Rn×p   | {z } | {z } w∈Rp ,sparse | {z } Julien Mairal Sparse Estimation and Dictionary Learning Methods 7/69 The Sparse Decomposition Problem

1 2 min y Xw 2 + λψ(w) w∈Rp 2k − k sparsity-inducing data fitting term regularization | {z } | {z } ψ induces sparsity in w. It can be △ the ℓ “pseudo-norm”. w =# i s.t. wi = 0 (NP-hard) 0 k k0 { 6 } △ p the ℓ norm. w = wi (convex), 1 k k1 i=1 | | ... P

This is a selection problem. When ψ is the ℓ1-norm, the problem is called Lasso [Tibshirani, 1996] or basis pursuit [Chen et al., 1999]

Julien Mairal Sparse Estimation and Dictionary Learning Methods 8/69 Why does the ℓ1-norm induce sparsity? Exemple: quadratic problem in 1D

1 min (u w)2 + λ w w∈R 2 − | | Piecewise quadratic function with a kink at zero.

Derivative at 0 : g = u + λ and 0 : g = u λ. + + − − − − − Optimality conditions. w is optimal iff: w > 0 and (u w)+ λ sign(w) = 0 | | − w = 0 and g 0 and g 0 + ≥ − ≤ The solution is the soft-thresholding operator:

w ⋆ = sign(u)( u λ)+. | |−

Julien Mairal Sparse Estimation and Dictionary Learning Methods 9/69 Why does the ℓ1-norm induce sparsity?

w ⋆ w ⋆

λ √2λ − − λ u √2λ u

(a) soft-thresholding operator, (b) hard-thresholding operator w ⋆ u u λ + w ⋆ u = sign( )(| | − ) , = 1 u √2λ 1 u w 2 λ w 1 | |≥ 2 minw 2 ( − ) + | | minw (u − w) + λ1 w >0 2 | |

Julien Mairal Sparse Estimation and Dictionary Learning Methods 10/69 Why does the ℓ1-norm induce sparsity? Comparison with ℓ2-regularization in 1D

ψ(w)= w ψ(w)= w 2 | |

ψ′(w) = 2w ψ′ (w)= 1, ψ′ (w) = 1 − − +

The gradient of the ℓ2-penalty vanishes when w get close to 0. On its differentiable part, the norm of the gradient of the ℓ1-norm is constant.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 11/69 Why does the ℓ1-norm induce sparsity? Physical illustration

E1 = 0 E1 = 0

y0

Julien Mairal Sparse Estimation and Dictionary Learning Methods 12/69 Why does the ℓ1-norm induce sparsity? Physical illustration

E = k1 (y y)2 1 2 0 − E = k1 (y y)2 1 2 0 −

E2 = mgy k2 2 y E2 = 2 y y

Julien Mairal Sparse Estimation and Dictionary Learning Methods 13/69 Why does the ℓ1-norm induce sparsity? Physical illustration

E = k1 (y y)2 1 2 0 − E = k1 (y y)2 1 2 0 −

k2 2 y E2 = 2 y y = 0 !! E2 = mgy

Julien Mairal Sparse Estimation and Dictionary Learning Methods 14/69 Regularizing with the ℓ1-norm

w1

w2 ℓ1-ball w T k k1 ≤

The projection onto a convex set is “biased” towards singularities.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 15/69 Regularizing with the ℓ2-norm

w1

w2 ℓ2-ball w T k k2 ≤

The ℓ2-norm is isotropic.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 16/69 Regularizing with the ℓ -norm ∞

w1

w2 ℓ∞-ball

w T k k∞ ≤

The ℓ -norm encourages w = w . ∞ | 1| | 2|

Julien Mairal Sparse Estimation and Dictionary Learning Methods 17/69 In 3D. Copyright G. Obozinski

Julien Mairal Sparse Estimation and Dictionary Learning Methods 18/69 What about more complicated norms? Copyright G. Obozinski

Julien Mairal Sparse Estimation and Dictionary Learning Methods 19/69 What about more complicated norms? Copyright G. Obozinski

Julien Mairal Sparse Estimation and Dictionary Learning Methods 20/69 Examples of sparsity-inducing penalties

Exploiting concave functions with a kink at zero p ψ(w)= φ( wi ). i=1 | | P △ p q ℓq-“pseudo-norm”, with 0 < q < 1: ψ(w) = wi , i=1 | | △ p P log penalty, ψ(w) = log( wi + ε), i=1 | | P φ is any function that looks like this: φ( w ) | | w

Julien Mairal Sparse Estimation and Dictionary Learning Methods 21/69 Examples of sparsity-inducing penalties

(c) ℓ0.5-ball, 2-D (d) ℓ1-ball, 2-D (e) ℓ2-ball, 2-D

Figure: Open balls in 2-D corresponding to several ℓq-norms and pseudo-norms.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 22/69 Examples of sparsity-inducing penalties

The ℓ1-ℓ2 norm (group Lasso),

2 1/2 wg = w , with a partition of 1,..., p . k k2 j G { } gX∈G gX∈G Xj∈g 

selects groups of variables [Yuan and Lin, 2006]. the fused Lasso [Tibshirani et al., 2005] or total variation [Rudin et al., 1992]: p−1 ψ(w)= wj wj . | +1 − | Xj=1

Extensions (out of the scope of this talk): hierarchical norms [Zhao et al., 2009]. structured sparsity [Jenatton et al., 2009, Jacob et al., 2009, Huang et al., 2009, Baraniuk et al., 2010]

Julien Mairal Sparse Estimation and Dictionary Learning Methods 23/69 Part II: Dictionary Learning and Matrix Factorization

Julien Mairal Sparse Estimation and Dictionary Learning Methods 24/69 Matrix Factorization and Clustering Let us cluster some training vectors y1,..., ym into p clusters using K-means: m i li 2 min y x 2. (xj )p ,(l )m k − k j=1 i i=1 Xi=1 It can be equivalently formulated as

m p i i 2 i i min y Xw F s.t. w 0 and wj = 1, X∈Rn p,W∈{0,1}p m k − k ≥ × × Xi=1 Xj=1

which is a matrix factorization problem:

p 2 i min Y XW F s.t. W 0 and wj = 1, X∈Rn p,W∈{0,1}p m k − k ≥ × × Xj=1

Julien Mairal Sparse Estimation and Dictionary Learning Methods 25/69 Matrix Factorization and Clustering

Hard clustering p 2 i min Y XW F s.t. W 0 and wj = 1, X∈Rn p,W∈{0,1}p m k − k ≥ × × Xj=1

X =[x1,..., xp] are the centroids of the p clusters.

Soft clustering p 2 i min Y XW F s.t. W 0 and wj = 1, X∈Rn p,W∈Rp m k − k ≥ × × Xj=1

X =[x1,..., xp] are the centroids of the p clusters.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 26/69 Other Matrix Factorization Problems PCA

1 min Y XW 2 s.t. X⊤X = I and WW⊤ is diagonal. p n F W∈R × 2k − k m p X∈R ×

X =[x1,..., xp] are the principal components.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 27/69 Other Matrix Factorization Problems Non-negative matrix factorization [Lee and Seung, 2001]

1 min Y XW 2 s.t. W 0 and X 0. p n F W∈R × 2k − k ≥ ≥ m p X∈R ×

Julien Mairal Sparse Estimation and Dictionary Learning Methods 28/69 Dictionary Learning and Matrix Factorization [Olshausen and Field, 1997]

n 1 i i 2 i min y Xw F + λ w 1, X∈X ,W∈Rp m 2k − k k k × Xi=1 which is again a matrix factorization problem

1 2 min Y XW + λ W 1. p n F W∈R × 2k − k k k X∈X

Julien Mairal Sparse Estimation and Dictionary Learning Methods 29/69 Why having a unified point of view?

1 2 min Y XW F + λψ(W). W∈W 2k − k X∈X

same framework for NMF, sparse PCA, dictionary learning, clustering, topic modelling; can play with various constraints/penalties on W (coefficients) and on X (loadings, dictionary, centroids); same algorithms (no need to reinvent the wheel): alternate minimization, online learning [Mairal et al., 2010].

Julien Mairal Sparse Estimation and Dictionary Learning Methods 30/69 Advertisement SPAMS toolbox (open-source) C++ interfaced with Matlab, R, Python.

proximal gradient methods for ℓ0, ℓ1, elastic-net, fused-Lasso, group-Lasso, tree group-Lasso, tree-ℓ0, sparse group Lasso, overlapping group Lasso...... for square, logistic, multi-class logistic loss functions. handles sparse matrices, provides duality gaps. fast implementations of OMP and LARS - homotopy. dictionary learning and matrix factorization (NMF, sparse PCA). coordinate descent, block coordinate descent algorithms. fast projections onto some convex sets. Try it! http://www.di.ens.fr/willow/SPAMS/

Julien Mairal Sparse Estimation and Dictionary Learning Methods 31/69 Part III: A few Image Processing Stories

Julien Mairal Sparse Estimation and Dictionary Learning Methods 32/69 The Image Denoising Problem

y = xorig + w measurements original image noise |{z} |{z} |{z}

Julien Mairal Sparse Estimation and Dictionary Learning Methods 33/69 Sparse representations for image restoration

y = xorig + w

measurements original image noise |{z} |{z} |{z} Energy minimization problem - MAP estimation 1 E(x)= y x 2 + ψ(x) 2k − k2 image model (-log prior) relation to measurements | {z } |{z} Some classical priors Smoothness λ x 2 kL k2 Total variation λ x 2 k∇ k1 MRF priors ...

Julien Mairal Sparse Estimation and Dictionary Learning Methods 34/69 Sparse representations for image restoration

Designed dictionaries [Haar, 1910], [Zweig, Morlet, Grossman 70s], [Meyer, Mallat, ∼ Daubechies, Coifman, Donoho, Candes 80s-today]...(see [Mallat, ∼ 1999]) Wavelets, Curvelets, Wedgelets, Bandlets, ...lets

Learned dictionaries of patches [Olshausen and Field, 1997], [Engan et al., 1999], [Lewicki and Sejnowski, 2000], [Aharon et al., 2006] , [Roth and Black, 2005], [Lee et al., 2007] 1 2 min yi Xwi 2 + λψ(wi ) wi ,X∈C 2k − k i X sparsity reconstruction | {z } ψ(w)= w (“ℓ pseudo-norm”)| {z } k k0 0 ψ(w)= w (ℓ norm) k k1 1

Julien Mairal Sparse Estimation and Dictionary Learning Methods 35/69 Sparse representations for image restoration Solving the denoising problem [Elad and Aharon, 2006] Extract all overlapping 8 8 patches y . × i Solve a matrix factorization problem: n 1 2 min yi Xwi 2 + λψ(wi ), wi ,X 2k − k ∈C Xi=1 sparsity reconstruction | {z } with n > 100, 000 | {z } Average the reconstruction of each patch.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 36/69 Sparse representations for image restoration K-SVD: [Elad and Aharon, 2006]

Figure: Dictionary trained on a noisy version of the image boat.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 37/69 Sparse representations for image restoration Grayscale vs color image patches

Julien Mairal Sparse Estimation and Dictionary Learning Methods 38/69 Sparse representations for image restoration [Mairal, Sapiro, and Elad, 2008b]

Julien Mairal Sparse Estimation and Dictionary Learning Methods 39/69 Sparse representations for image restoration Inpainting, [Mairal, Elad, and Sapiro, 2008a]

Julien Mairal Sparse Estimation and Dictionary Learning Methods 40/69 Sparse representations for image restoration Inpainting, [Mairal, Elad, and Sapiro, 2008a]

Julien Mairal Sparse Estimation and Dictionary Learning Methods 41/69 Sparse representations for video restoration Key ideas for video processing [Protter and Elad, 2009] Using a 3D dictionary. Processing of many frames at the same time. Dictionary propagation.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 42/69 Sparse representations for image restoration Inpainting, [Mairal, Sapiro, and Elad, 2008b]

Figure: Inpainting results.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 43/69 Sparse representations for image restoration Inpainting, [Mairal, Sapiro, and Elad, 2008b]

Figure: Inpainting results.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 44/69 Sparse representations for image restoration Inpainting, [Mairal, Sapiro, and Elad, 2008b]

Figure: Inpainting results.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 45/69 Sparse representations for image restoration Inpainting, [Mairal, Sapiro, and Elad, 2008b]

Figure: Inpainting results.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 46/69 Sparse representations for image restoration Inpainting, [Mairal, Sapiro, and Elad, 2008b]

Figure: Inpainting results.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 47/69 Sparse representations for image restoration Color video denoising, [Mairal, Sapiro, and Elad, 2008b]

Figure: Denoising results. σ = 25

Julien Mairal Sparse Estimation and Dictionary Learning Methods 48/69 Sparse representations for image restoration Color video denoising, [Mairal, Sapiro, and Elad, 2008b]

Figure: Denoising results. σ = 25

Julien Mairal Sparse Estimation and Dictionary Learning Methods 49/69 Sparse representations for image restoration Color video denoising, [Mairal, Sapiro, and Elad, 2008b]

Figure: Denoising results. σ = 25

Julien Mairal Sparse Estimation and Dictionary Learning Methods 50/69 Sparse representations for image restoration Color video denoising, [Mairal, Sapiro, and Elad, 2008b]

Figure: Denoising results. σ = 25

Julien Mairal Sparse Estimation and Dictionary Learning Methods 51/69 Sparse representations for image restoration Color video denoising, [Mairal, Sapiro, and Elad, 2008b]

Figure: Denoising results. σ = 25

Julien Mairal Sparse Estimation and Dictionary Learning Methods 52/69 Digital Zooming Couzinie-Devy, 2010, Original

Julien Mairal Sparse Estimation and Dictionary Learning Methods 53/69 Digital Zooming Couzinie-Devy, 2010, Bicubic

Julien Mairal Sparse Estimation and Dictionary Learning Methods 54/69 Digital Zooming Couzinie-Devy, 2010, Proposed method

Julien Mairal Sparse Estimation and Dictionary Learning Methods 55/69 Digital Zooming Couzinie-Devy, 2010, Original

Julien Mairal Sparse Estimation and Dictionary Learning Methods 56/69 Digital Zooming Couzinie-Devy, 2010, Bicubic

Julien Mairal Sparse Estimation and Dictionary Learning Methods 57/69 Digital Zooming Couzinie-Devy, 2010, Proposed approach

Julien Mairal Sparse Estimation and Dictionary Learning Methods 58/69 Inverse half-toning Original

Julien Mairal Sparse Estimation and Dictionary Learning Methods 59/69 Inverse half-toning Reconstructed image

Julien Mairal Sparse Estimation and Dictionary Learning Methods 60/69 Inverse half-toning Original

Julien Mairal Sparse Estimation and Dictionary Learning Methods 61/69 Inverse half-toning Reconstructed image

Julien Mairal Sparse Estimation and Dictionary Learning Methods 62/69 Inverse half-toning Original

Julien Mairal Sparse Estimation and Dictionary Learning Methods 63/69 Inverse half-toning Reconstructed image

Julien Mairal Sparse Estimation and Dictionary Learning Methods 64/69 Inverse half-toning Original

Julien Mairal Sparse Estimation and Dictionary Learning Methods 65/69 Inverse half-toning Reconstructed image

Julien Mairal Sparse Estimation and Dictionary Learning Methods 66/69 Inverse half-toning Original

Julien Mairal Sparse Estimation and Dictionary Learning Methods 67/69 Inverse half-toning Reconstructed image

Julien Mairal Sparse Estimation and Dictionary Learning Methods 68/69 Conclusion We have seen that many formulations are related to sparse regularized matrix factorization problems: pca, sparse pca, clustering, nmf, dictionary learning; we have so successful stories in images processing, computer vision and neuroscience; there exists efficient software for Matlab/R/Python.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 69/69 References I

M. Aharon, M. Elad, and A. M. Bruckstein. The K-SVD: An algorithm for designing of overcomplete dictionaries for sparse representations. IEEE Transactions on Signal Processing, 54(11):4311–4322, November 2006. R. G. Baraniuk, V. Cevher, M. Duarte, and C. Hegde. Model-based compressive sensing. IEEE Transactions on Information Theory, 2010. to appear. A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009. E. Candes. Compressive sampling. In Proceedings of the International Congress of Mathematicians, volume 3, 2006. S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20:33–61, 1999.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 70/69 References II M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 54(12):3736–3745, December 2006. K. Engan, S. O. Aase, and J. H. Husoy. Frame based signal compression using method of optimal directions (MOD). In Proceedings of the 1999 IEEE International Symposium on Circuits Systems, volume 4, 1999. A. Haar. Zur theorie der orthogonalen funktionensysteme. Mathematische Annalen, 69:331–371, 1910. J. Huang, Z. Zhang, and D. Metaxas. Learning with structured sparsity. In Proceedings of the International Conference on Machine Learning (ICML), 2009. L. Jacob, G. Obozinski, and J.-P. Vert. Group Lasso with overlap and graph Lasso. In Proceedings of the International Conference on Machine Learning (ICML), 2009.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 71/69 References III R. Jenatton, J-Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms. Technical report, 2009. preprint arXiv:0904.3523v1. D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems, 2001. H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparse coding algorithms. In B. Sch¨olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems, volume 19, pages 801–808. MIT Press, Cambridge, MA, 2007. M. S. Lewicki and T. J. Sejnowski. Learning overcomplete representations. Neural Computation, 12(2):337–365, 2000. J. Mairal, M. Elad, and G. Sapiro. Sparse representation for color image restoration. IEEE Transactions on Image Processing, 17(1):53–69, January 2008a.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 72/69 References IV J. Mairal, G. Sapiro, and M. Elad. Learning multiscale sparse representations for image and video restoration. SIAM Multiscale Modelling and Simulation, 7(1):214–241, April 2008b. J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 2010. S. Mallat. A Wavelet Tour of Signal Processing, Second Edition. Academic Press, New York, September 1999. S. Mallat and Z. Zhang. Matching pursuit in a time-frequency dictionary. IEEE Transactions on Signal Processing, 41(12):3397–3415, 1993. Y. Nesterov. A method for solving a convex programming problem with convergence rate O(1/k2). Soviet Math. Dokl., 27:372–376, 1983. Y. Nesterov. Gradient methods for minimizing composite objective function. Technical report, CORE, 2007.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 73/69 References V B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37: 3311–3325, 1997. M. Protter and M. Elad. Image sequence denoising via sparse and redundant representations. IEEE Transactions on Image Processing, 18(1):27–36, 2009. S. Roth and M. J. Black. Fields of experts: A framework for learning image priors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005. L.I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1-4): 259–268, 1992. R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B, 58(1):267–288, 1996.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 74/69 References VI R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society Series B, 67(1):91–108, 2005. M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B, 68:49–67, 2006. P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical variable selection. 37(6A):3468–3497, 2009.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 75/69 One short slide on compressed sensing

Important message Sparse coding is not “compressed sensing”. Compressed sensing is a theory [see Candes, 2006] saying that a sparse signal can be recovered from a few linear measurements under some conditions. Signal Acquisition: Z⊤y, where Z Rm×s is a “sensing” matrix ∈ with s m. ≪ ⊤ ⊤ Signal Decoding: min Rp w s.t. Z y = Z Xw. w∈ k k1 with extensions to approximately sparse signals, noisy measurements. Remark The dictionaries we are using in this lecture do not satisfy the recovery assumptions of compressed sensing.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 76/69 Greedy Algorithms Several equivalent non-convex and NP-hard problems:

1 2 min y Xw 2 + λ w 0 , w∈Rp 2k − k k k regularization residual r | {z } | {z } 2 min y Xw 2 s.t. w 0 L, w∈Rp k − k k k ≤ 2 min w 0 s.t. y Xw 2 ε, w∈Rp k k k − k ≤ The solution is often approximated with a greedy algorithm. Signal processing: Matching Pursuit [Mallat and Zhang, 1993], Orthogonal Matching Pursuit [?]. Statistics: L2-boosting, forward selection.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 77/69 Matching Pursuit w = (0, 0, 0) y x2

x1 z r

x3

x

Julien Mairal Sparse Estimation and Dictionary Learning Methods 78/69 Matching Pursuit w = (0, 0, 0) y x2

x1 z r

x3 < r, x3 > x3 x

Julien Mairal Sparse Estimation and Dictionary Learning Methods 79/69 Matching Pursuit w = (0, 0, 0) y x2

x1 r z r < r, x > x − 3 3

x3

x

Julien Mairal Sparse Estimation and Dictionary Learning Methods 80/69 Matching Pursuit w = (0, 0, 0.75) y x2

x1 z r

x3

x

Julien Mairal Sparse Estimation and Dictionary Learning Methods 81/69 Matching Pursuit w = (0, 0, 0.75) y x2

x1 z r

x3

x

Julien Mairal Sparse Estimation and Dictionary Learning Methods 82/69 Matching Pursuit w = (0, 0, 0.75) y x2

x1 z r

x3

x

Julien Mairal Sparse Estimation and Dictionary Learning Methods 83/69 Matching Pursuit w = (0, 0.24, 0.75) y x2

x1 z

x3 r

x

Julien Mairal Sparse Estimation and Dictionary Learning Methods 84/69 Matching Pursuit w = (0, 0.24, 0.75) y x2

x1 z

x3 r

x

Julien Mairal Sparse Estimation and Dictionary Learning Methods 85/69 Matching Pursuit w = (0, 0.24, 0.75) y x2

x1 z

x3 r

x

Julien Mairal Sparse Estimation and Dictionary Learning Methods 86/69 Matching Pursuit w = (0, 0.24, 0.65) y x2

x1 z

x3 r

x

Julien Mairal Sparse Estimation and Dictionary Learning Methods 87/69 Matching Pursuit

2 min y Xw 2 s.t. w 0 L w∈Rp k − k k k ≤ r | {z } 1: w 0 ← 2: r y (residual). ← 3: while w < L do k k0 4: Select the predictor with maximum correlation with the residual

ˆı arg max xi⊤r ← i=1,...,p | |

5: Update the residual and the coefficients

w w + xˆı⊤r ˆı ← ˆı r r (xˆı⊤r)xˆı ← − 6: end while Julien Mairal Sparse Estimation and Dictionary Learning Methods 88/69 Orthogonal Matching Pursuit w = (0, 0, 0) y x J= 2 ∅

x1 y z

x3

x

Julien Mairal Sparse Estimation and Dictionary Learning Methods 89/69 Orthogonal Matching Pursuit w = (0, 0, 0.75) y J= 3 x2 { }

π2 x1 y z r

π1 x3 π3

x

Julien Mairal Sparse Estimation and Dictionary Learning Methods 90/69 Orthogonal Matching Pursuit w = (0, 0.29, 0.63) y J= 3, 2 x2 { }

x1 y z π32 r

π 31 x3

x

Julien Mairal Sparse Estimation and Dictionary Learning Methods 91/69 Orthogonal Matching Pursuit

2 min y Xw 2 s.t. w 0 L w∈Rp k − k k k ≤

1: J= . ∅ 2: for iter = 1,..., L do 3: Select the predictor which most reduces the objective

′ 2 ˆı arg min min y XJ∪{i}w 2 ← i∈JC  w′ k − k  4: Update the active set: J J ˆı . ← ∪ { } 5: Update the residual (orthogonal projection) r (I X (X⊤X )−1X⊤)y. ← − J J J J 6: Update the coefficients w (X⊤X )−1X⊤y. J ← J J J 7: end for Julien Mairal Sparse Estimation and Dictionary Learning Methods 92/69 Orthogonal Matching Pursuit The keys for a good implementation If available, use Gram matrix G = X⊤X, Maintain the computation of X⊤r for each signal, ⊤ −1 Maintain a Cholesky decomposition of (XJ XJ) for each signal. The total complexity for decomposing n L-sparse signals of size m with a dictionary of size p is

O(p2m) + O(nL3) + O(n(pm + pL2)) = O(np(m + L2))

Gram matrix Cholesky XT r | {z } | {z } | {z } It is also possible to use the matrix inversion lemma instead of a Cholesky decomposition (same complexity, but less numerical stability).

Julien Mairal Sparse Estimation and Dictionary Learning Methods 93/69 Coordinate Descent for the Lasso Coordinate descent + nonsmooth objective: WARNING: not convergent in general Here, the problem is equivalent to a convex smooth optimization problem with separable constraints

1 2 T T min y X+w+ + X−w− 2 + λw+1 + λw−1 s.t. w−, w+ 0. w+,w 2 − k − k ≥ For this specific problem, coordinate descent is convergent. Assume the colums of X to have unit ℓ2-norm, updating the coordinate i:

1 j i 2 wi arg min y wj x wx 2 + λ w ← w R 2k − − k | | ∈ Xj6=i

r sign(xi⊤r)(|xi⊤r{z λ)+} ← | |− soft-thresholding! ⇒ Julien Mairal Sparse Estimation and Dictionary Learning Methods 94/69 First-order/proximal methods

min f (w)+ λψ(w) w∈Rp

f is strictly convex and differentiable with a Lipshitz gradient. Generalizes the idea of gradient descent

k+1 k k ⊤ k L k 2 w arg min f (w )+ f (w ) (w w )+ w w 2 +λψ(w) ← w∈Rp ∇ − 2 k − k linear approximation quadratic term | {z } 1 k 1 k 2 | λ {z } arg min w (w f (w )) 2 + ψ(w) ← w∈Rp 2k − − L∇ k L

When λ = 0, wk+1 wk 1 f (wk ), this is equivalent to a ← − L ∇ classical gradient descent step.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 95/69 First-order/proximal methods They require solving efficiently the proximal operator

1 2 min u w 2 + λψ(w) w∈Rp 2k − k

For the ℓ1-norm, this amounts to a soft-thresholding:

⋆ + w = sign(ui )(ui λ) . i − There exists accelerated versions based on Nesterov optimal first-order method (gradient method with “extrapolation”) [Beck and Teboulle, 2009, Nesterov, 2007, 1983] suited for large-scale experiments.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 96/69 Optimization for Grouped Sparsity The proximal operator:

1 2 min u w 2 + λ wg q w∈Rp 2k − k k k gX∈G

For q = 2, ⋆ ug + wg = ( ug 2 λ) , g ug k k − ∀ ∈G k k2 For q = , ∞ ⋆ w = ug Π [ug ], g g − k.k1≤λ ∀ ∈G These formula generalize soft-thrsholding to groups of variables. They are used in block-coordinate descent and proximal algorithms.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 97/69 Smoothing Techniques: Reweighted ℓ2 Let us start from something simple

a2 2ab + b2 0. − ≥ Then 1 a2 a + b with equality iff a = b ≤ 2 b  and p 1 w[j]2 w 1 = min + ηj . k k ηj ≥0 2 ηj Xj=1 The formulation becomes

p 2 1 2 λ w[j] min y Xw 2 + + ηj . w,ηj ≥ε 2k − k 2 ηj Xj=1

Julien Mairal Sparse Estimation and Dictionary Learning Methods 98/69 DC (difference of convex) - Programming

Remember? Concave functions with a kink at zero p ψ(w)= φ( wi ). i=1 | | P △ p q ℓq-“pseudo-norm”, with 0 < q < 1: ψ(w) = wi , i=1 | | △ p P log penalty, ψ(w) = log( wi + ε), i=1 | | P φ is any function that looks like this: φ( w ) | | w

Julien Mairal Sparse Estimation and Dictionary Learning Methods 99/69 φ′( w k ) w + C | | | |

w k | | w k f (w) | | φ(w) = log( w + ε) | |

f (w)+ φ′( w k ) w + C | | | |

w k | | f (w)+ φ( w ) | |

Figure: Illustration of the DC-programmingJulien Mairal Sparse approach. Estimation and The Dictionary non-conve Learning Methodsx part of100/69 DC (difference of convex) - Programming

p min f (w)+ λ φ( wi ). w∈Rp | | Xi=1 This problem is non-convex. f is convex, and φ is concave on R+. if wk is the current estimate at iteration k, the algorithm solves

p k+1 ′ k w arg min f (w)+ λ ψ ( wi ) wi , ← Rp | | | | w∈ h Xi=1 i

which is a reweighted-ℓ1 problem. Warning: It does not solve the non-convex problem, only provides a stationary point. In practice, each iteration sets to zero small coefficients. After 2 3 − iterations, the result does not change much.

Julien Mairal Sparse Estimation and Dictionary Learning Methods 101/69