Sparse Estimation for Image and Vision Processing

Julien Mairal

Inria, Grenoble

SPARS summer school, Lisbon, December 2017

Julien Mairal Sparse Estimation for Image and Vision Processing 1/187 Course material (freely available on arXiv)

J. Mairal, F. Bach and J. Ponce. Sparse Modeling for Image and Vision Processing. Foundations and Trends in Computer Graphics and Vision. 2014.

F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends in , 4(1). 2012.

Julien Mairal Sparse Estimation for Image and Vision Processing 2/187 Outline

1 A short introduction to parsimony

2 Discovering the structure of natural images

3 Sparse models for image processing

4 Optimization for sparse estimation

5 Application cases

Julien Mairal Sparse Estimation for Image and Vision Processing 3/187 Part I: A Short Introduction to Parcimony

Julien Mairal Sparse Estimation for Image and Vision Processing 4/187 1 A short introduction to parsimony Early thoughts Sparsity in the statistics literature from the 60’s and 70’s Wavelet thresholding in from 90’s The modern parsimony and the ℓ1-norm Structured sparsity and sparse recovery

2 Discovering the structure of natural images

3 Sparse models for image processing

4 Optimization for sparse estimation

5 Application cases

Julien Mairal Sparse Estimation for Image and Vision Processing 5/187 Early thoughts

(a) Dorothy Wrinch (b) Harold Jeffreys 1894–1980 1891–1989

The existence of simple laws is, then, apparently, to be regarded as a quality of nature; and accordingly we may infer that it is justifiable to prefer a simple law to a more complex one that fits our observations slightly better. [Wrinch and Jeffreys, 1921]. Philosophical Magazine Series.

Julien Mairal Sparse Estimation for Image and Vision Processing 6/187 Historical overview of parsimony

14th century: Ockham’s razor; 1921: Wrinch and Jeffreys’ simplicity principle; 1952: Markowitz’s portfolio selection; 60 and 70’s: best subset selection in statistics;

70’s: use of the ℓ1-norm for signal recovery in geophysics; 90’s: wavelet thresholding in signal processing; 1996: Olshausen and Field’s dictionary learning; 1996–1999: (statistics) and basis pursuit (signal processing); 2006: compressed sensing (signal processing) and Lasso consistency (statistics); 2006–now: applications of dictionary learning in various scientific fields such as image processing and computer vision.

Julien Mairal Sparse Estimation for Image and Vision Processing 7/187 Sparsity in the statistics literature from the 60’s and 70’s

Given some observed data points z1,..., zn that are assumed to be independent samples from a statistical model with parameters θ in Rp, maximum likelihood estimation (MLE) consists of minimizing

n △ min (θ) = log Pθ(zi ) . θ∈Rp L − " i # X=1

Example: ordinary least square

Observations zi =(yi , xi ), with yi in R. ⊤ Linear model: yi = x θ + εi , with εi (0, 1). i ∼N n 2 1 ⊤ min yi xi θ . θ∈Rp 2 − i X=1  

Julien Mairal Sparse Estimation for Image and Vision Processing 8/187 Sparsity in the statistics literature from the 60’s and 70’s

Given some observed data points z1,..., zn that are assumed to be independent samples from a statistical model with parameters θ in Rp, maximum likelihood estimation (MLE) consists of minimizing

n △ min (θ) = log Pθ(zi ) . θ∈Rp L − " i # X=1

Motivation for finding a sparse solution: removing irrelevant variables from the model; obtaining an easier interpretation; preventing overfitting;

Julien Mairal Sparse Estimation for Image and Vision Processing 8/187 Sparsity in the statistics literature from the 60’s and 70’s

Given some observed data points z1,..., zn that are assumed to be independent samples from a statistical model with parameters θ in Rp, maximum likelihood estimation (MLE) consists of minimizing

n △ min (θ) = log Pθ(zi ) . θ∈Rp L − " i # X=1

Two questions: 1 how to choose k? 2 how to find the best subset of k variables?

Julien Mairal Sparse Estimation for Image and Vision Processing 8/187 Sparsity in the statistics literature from the 60’s and 70’s How to choose k?

Mallows’s Cp statistics [Mallows, 1964, 1966]; Akaike information criterion (AIC) [Akaike, 1973]; Bayesian information critertion (BIC) [Schwarz, 1978]; Minimum description length (MDL) [Rissanen, 1978].

These approaches lead to penalized problems

min (θ)+ λ θ 0, θ∈Rp L k k with different choices of λ depending on the chosen criterion.

Julien Mairal Sparse Estimation for Image and Vision Processing 9/187 Sparsity in the statistics literature from the 60’s and 70’s How to solve the best k-subset selection problem? Unfortunately...... the problem is NP-hard [Natarajan, 1995].

Two strategies combinatorial exploration with branch-and-bound techniques [Furnival and Wilson, 1974] leaps and bounds, → exact algorithm but exponential complexity; greedy approach: forward selection [Efroymson, 1960] (originally developed for observing intermediate solutions), already contains all the ideas of matching pursuit algorithms. Important reference: [Hocking, 1976]. The analysis and selection of variables in linear regression. Biometrics.

Julien Mairal Sparse Estimation for Image and Vision Processing 10/187 Wavelet thresholding in signal processing from the 90’s

A wavelet basis represents a set of functions ϕ1,ϕ2 that are essentially dilated and shifted versions of each other [see Mallat, 2008]. Concept of parsimony with wavelets

When a signal f is “smooth”, it is close to an expansion i αi ϕi where only a few coefficients αi are non-zero. P

1.5 1

0.8

1 0.6

0.4

0.5 0.2

0

0 −0.2

−0.4

−0.5 −0.6

−0.8

−1 −1 −8 −6 −4 −2 0 2 4 6 8 −4 −3 −2 −1 0 1 2 3 4 (a) Meyer’s wavelet (b) Morlet’s wavelet

Julien Mairal Sparse Estimation for Image and Vision Processing 11/187 Wavelet thresholding in signal processing from the 90’s Wavelets where the topic of a long quest for representing natural images 2D-Gabors [Daugman, 1985]; steerable wavelets [Simoncelli et al., 1992]; curvelets [Cand`es and Donoho, 2002]; countourlets [Do and Vertterli, 2003]; bandlets [Le Pennec and Mallat, 2005]; ⋆-lets (joke).

(a) 2D Gabor filter. (b) With shifted phase. (c) With rotation.

Julien Mairal Sparse Estimation for Image and Vision Processing 12/187 Wavelet thresholding in signal processing from 90’s The theory of wavelets is well developed for continuous signals, e.g., in L2(R), but also for discrete signals x in Rn.

Julien Mairal Sparse Estimation for Image and Vision Processing 13/187 Wavelet thresholding in signal processing from 90’s n×n Given an orthogonal wavelet basis D =[d1,..., dn] in R , the wavelet decomposition of x in Rn is simply

β = D⊤x and we have x = Dβ.

The k-sparse approximation problem

1 2 min x Dα 2 s.t. α 0 k, α∈Rp 2k − k k k ≤ is not NP-hard here: since D is orthogonal, it is equivalent to

1 2 min β α 2 s.t. α 0 k. α∈Rp 2k − k k k ≤

Julien Mairal Sparse Estimation for Image and Vision Processing 14/187 Wavelet thresholding in signal processing from 90’s n×n Given an orthogonal wavelet basis D =[d1,..., dn] in R , the wavelet decomposition of x in Rn is simply

β = D⊤x and we have x = Dβ.

The k-sparse approximation problem

1 2 min x Dα 2 s.t. α 0 k, α∈Rp 2k − k k k ≤ The solution is obtained by hard-thresholding:

β[j] if β[j] µ αht[j]= δ β[j]= | |≥ , |β[j]|≥µ 0 otherwise  where µ the k-th largest value among the set β[1] ,..., β[p] . {| | | |}

Julien Mairal Sparse Estimation for Image and Vision Processing 14/187 Wavelet thresholding in signal processing, 90’s Another key operator introduced by Donoho and Johnstone [1994] is the soft-thresholding operator:

β[j] λ if β[j] λ △ − ≥ αst[j] = sign(β[j]) max( β[j] λ, 0) = β[j]+ λ if β[j] λ , | |−  ≤−  0 otherwise where λ is a parameter playing the same role as µ previously. △ With β = D⊤x and D orthogonal, it provides the solution of the following sparse reconstruction problem:

1 2 min x Dα 2 + λ α 1, α∈Rp 2k − k k k which will be of high importance later.

Julien Mairal Sparse Estimation for Image and Vision Processing 15/187 Wavelet thresholding in signal processing, 90’s

αst αht

λ µ − − λ β µ β

(d) Soft-thresholding operator, (e) Hard-thresholding operator st ht α = sign(β) max(|β| − λ, 0). α = δ|β|≥µβ.

Figure: Soft- and hard-thresholding operators, which are commonly used for signal estimation with orthogonal wavelet basis.

Julien Mairal Sparse Estimation for Image and Vision Processing 16/187 Wavelet thresholding in signal processing, 90’s Various work tried to exploit the structure of wavelet coefficients.

α1

α2 α3

α4 α5 α6 α7

α8 α9 α10 α11 α12 α13 α14 α15

Figure: Illustration of a wavelet tree with four scales for one-dimensional signals. We also illustrate the zero-tree coding scheme [Shapiro, 1993].

Julien Mairal Sparse Estimation for Image and Vision Processing 17/187 Wavelet thresholding in signal processing, 90’s To model spatial relations, it is possible to define some (non-overlapping) groups of wavelet coefficients, and define a group soft-thresholding G operator [Hall et al., 1999, Cai, 1999]. For every group g in , G λ gt △ 1 β g β[g] if β[g] 2 λ α [g] = − k [ ]k2 k k ≥ , ( 0 otherwise

where β[g] is the vector of size g recording the entries of β in g. | | △ With β = D⊤x and D orthogonal, it is in fact the solution of

1 2 min x Dα 2 + λ α[g] 2, α∈Rp 2k − k k k g X∈G which will be of interest later in the lecture.

Julien Mairal Sparse Estimation for Image and Vision Processing 18/187 The modern parsimony and the ℓ1-norm Sparse linear models in signal processing

Let x in Rn be a signal.

n×p Let D = [d ,..., dp] R be a set of 1 ∈ elementary signals. We call it dictionary.

D is “adapted” to x if it can represent it with a few elements—that is, there exists a sparse vector α in Rp such that x Dα. We call α the ≈ sparse code. α[1] α[2] x d1 d2 dp  .    ≈  ···  .   α[p] ∈Rn  Rn×p    x D∈   α∈Rp ,sparse | {z } | {z } Julien Mairal Sparse Estimation| for{z Image} and Vision Processing 19/187 The modern parsimony and the ℓ1-norm Sparse linear models: machine learning/statistics point of view n Rp Let (yi , xi )i=1 be a training set, where the vectors xi are in and are called features. The scalars yi are in 1, +1 for binary classification problems. {− } R for regression problems.

We assume there exists a relation y β⊤x, and solve ≈ n 1 ⊤ min L(yi , β xi ) + λψ(β) . β∈Rp n i=1 X regularization empirical risk | {z } | {z }

Julien Mairal Sparse Estimation for Image and Vision Processing 20/187 The modern parsimony and the ℓ1-norm Sparse linear models: machine learning/statistics point of view A few examples: n 1 1 ⊤ 2 2 Ridge regression: min (yi β xi ) + λ β 2. β∈Rp n 2 − k k i=1 Xn 1 ⊤ 2 Linear SVM: min max(0, 1 yi β xi ) + λ β 2. β∈Rp n − k k i=1 Xn ⊤ 1 −yi β xi 2 Logistic regression: min log 1+ e + λ β 2. β∈Rp n k k i X=1  

Julien Mairal Sparse Estimation for Image and Vision Processing 21/187 The modern parsimony and the ℓ1-norm Sparse linear models: machine learning/statistics point of view A few examples: n 1 1 ⊤ 2 2 Ridge regression: min (yi β xi ) + λ β 2. β∈Rp n 2 − k k i=1 Xn 1 ⊤ 2 Linear SVM: min max(0, 1 yi β xi )+ λ β 2. β∈Rp n − k k i=1 Xn ⊤ 1 −yi β xi 2 Logistic regression: min log 1+ e + λ β 2. β∈Rp n k k i X=1   The squared ℓ2-norm induces “smoothness” in β. When one knows in advance that β should be sparse, one should use a sparsity-inducing regularization such as the ℓ1-norm. [Chen et al., 1999, Tibshirani, 1996]

Julien Mairal Sparse Estimation for Image and Vision Processing 22/187 The modern parsimony and the ℓ1-norm Originally used to induce sparsity in geophysics [Claerbout and Muir, 1973, Taylor et al., 1979], the ℓ1-norm became popular in statistics with the Lasso [Tibshirani, 1996] and in signal processing with the Basis pursuit [Chen et al., 1999]. Three “equivalent” formulations

1 1 2 min x Dα 2 + λ α 1; α∈Rp 2k − k k k

2 1 2 min x Dα 2 s.t. α 1 µ; α∈Rp 2k − k k k ≤

3 2 min α 1 s.t. x Dα 2 ε. α∈Rp k k k − k ≤

Julien Mairal Sparse Estimation for Image and Vision Processing 23/187 The modern parsimony and the ℓ1-norm And some variants... For noiseless problems

min α 1 s.t. x = Dα. α∈Rp k k

Beyond

min f (α)+ λ α 1, α∈Rp k k where f : Rp R is convex. →

Julien Mairal Sparse Estimation for Image and Vision Processing 24/187 The modern parsimony and the ℓ1-norm And some variants... For noiseless problems

min α 1 s.t. x = Dα. α∈Rp k k

Beyond least squares

min f (α)+ λ α 1, α∈Rp k k where f : Rp R is convex. → An important question remains:

why does the ℓ1-norm induce sparsity?

Julien Mairal Sparse Estimation for Image and Vision Processing 24/187 The modern parsimony and the ℓ1-norm

Why does the ℓ1-norm induce sparsity? Can we get some intuition from the simplest isotropic case?

1 2 αˆ(λ) = argmin x α 2 + λ α 1, α∈Rp 2k − k k k

or equivalently the Euclidean projection onto the ℓ1-ball?

1 2 α˜ (µ) = argmin x α 2 s.t. α 1 µ. α∈Rp 2k − k k k ≤

“equivalent” means that for all λ> 0, there exists µ 0 such that ≥ α˜ (µ)= αˆ(λ).

Julien Mairal Sparse Estimation for Image and Vision Processing 25/187 The modern parsimony and the ℓ1-norm

Why does the ℓ1-norm induce sparsity? Can we get some intuition from the simplest isotropic case?

1 2 αˆ(λ) = argmin x α 2 + λ α 1, α∈Rp 2k − k k k

or equivalently the Euclidean projection onto the ℓ1-ball?

1 2 α˜ (µ) = argmin x α 2 s.t. α 1 µ. α∈Rp 2k − k k k ≤

“equivalent” means that for all λ> 0, there exists µ 0 such that ≥ α˜ (µ)= αˆ(λ). The relation between µ and λ is unknown a priori.

Julien Mairal Sparse Estimation for Image and Vision Processing 25/187 Why does the ℓ1-norm induce sparsity? Regularizing with the ℓ1-norm

α[2]

α[1] ℓ1-ball α µ k k1 ≤

The projection onto a convex set is “biased” towards singularities.

Julien Mairal Sparse Estimation for Image and Vision Processing 26/187 Why does the ℓ1-norm induce sparsity? Regularizing with the ℓ2-norm

α[2]

α[1] ℓ2-ball α µ k k2 ≤

The ℓ2-norm is isotropic.

Julien Mairal Sparse Estimation for Image and Vision Processing 27/187 Why does the ℓ1-norm induce sparsity? In 3D. (images produced by G. Obozinski)

Julien Mairal Sparse Estimation for Image and Vision Processing 28/187 Why does the ℓ1-norm induce sparsity? Regularizing with the ℓ∞-norm

α[2]

α[1] ℓ∞-ball α µ k k∞ ≤

The ℓ -norm encourages α[1] = α[2] . ∞ | | | |

Julien Mairal Sparse Estimation for Image and Vision Processing 29/187 Why does the ℓ1-norm induce sparsity? Analytical point of view: 1D case

1 min (x α)2 + λ α α∈R 2 − | | Piecewise quadratic function with a kink at zero.

Derivative at 0 : g = x + λ and 0 : g = x λ. + + − − − − − Optimality conditions. α is optimal iff: α > 0 and (x α)+ λ sign(α) = 0 | | − α = 0 and g 0 and g 0 + ≥ − ≤ The solution is a soft-thresholding:

α⋆ = sign(x)( x λ)+. | |−

Julien Mairal Sparse Estimation for Image and Vision Processing 30/187 Why does the ℓ1-norm induce sparsity? Comparison with ℓ2-regularization in 1D

ψ(α)= α ψ(α)= α2 | |

ψ′(α) = 2α ψ′ (α)= 1, ψ′ (α) = 1 − − +

The gradient of the ℓ2-penalty vanishes when α get close to 0. On its differentiable part, the norm of the gradient of the ℓ1-norm is constant.

Julien Mairal Sparse Estimation for Image and Vision Processing 31/187 Why does the ℓ1-norm induce sparsity? Physical illustration

E1 = 0 E1 = 0

x

Julien Mairal Sparse Estimation for Image and Vision Processing 32/187 Why does the ℓ1-norm induce sparsity? Physical illustration

E = k1 (x α)2 1 2 − E = k1 (x α)2 1 2 −

E2 = mgα k2 2 E2 = 2 α α α

Julien Mairal Sparse Estimation for Image and Vision Processing 32/187 Why does the ℓ1-norm induce sparsity? Physical illustration

E = k1 (x α)2 1 2 − E = k1 (x α)2 1 2 −

k2 2 E2 = 2 α α α = 0 !! E2 = mgα

Julien Mairal Sparse Estimation for Image and Vision Processing 32/187 Why does the ℓ1-norm induce sparsity?

1.5 w 1 w 2 1 w 3 w 4 0.5 w 5

coefficient values 0

−0.5 0 1 2 3 4 λ Figure: The regularization path of the Lasso.

1 2 min x Dα 2 + λ α 1. α∈Rp 2k − k k k

Julien Mairal Sparse Estimation for Image and Vision Processing 33/187 Non-convex sparsity-inducing penalties

Exploiting concave functions with a kink at zero ψ(α)= p ϕ( α[j] ). j=1 | | △ p q ℓq-penalty,P with 0 < q < 1: ψ(α) = α[j] , [Frank and j=1 | | Friedman, 1993]; △ P log penalty, ψ(α) = p log( α[j] + ε), [Cand`es et al., 2008]. j=1 | | ϕ is any function that looksP like this: ϕ( α ) | | α

Julien Mairal Sparse Estimation for Image and Vision Processing 34/187 Non-convex sparsity-inducing penalties

(a) ℓ0.5-ball, 2-D (b) ℓ1-ball, 2-D (c) ℓ2-ball, 2-D

Figure: Open balls in 2-D corresponding to several ℓq-norms and pseudo-norms.

Julien Mairal Sparse Estimation for Image and Vision Processing 35/187 Non-convex sparsity-inducing penalties

α[2]

α[1] ℓq-ball

α q µ with q < 1 k k ≤

Julien Mairal Sparse Estimation for Image and Vision Processing 36/187 Elastic-net The elastic net introduced by [Zou and Hastie, 2005]

ψ(α)= α + γ α 2, k k1 k k2 The penalty provides more stable (but less sparse) solutions.

(a) ℓ1-ball, 2-D (b) elastic-net, 2-D (c) ℓ2-ball, 2-D

Julien Mairal Sparse Estimation for Image and Vision Processing 37/187 The elastic-net vs other penalties

α[2]

α[1] ℓq-ball

α q µ with q < 1 k k ≤

Julien Mairal Sparse Estimation for Image and Vision Processing 38/187 The elastic-net vs other penalties

α[2]

α[1] ℓ1-ball α µ k k1 ≤

Julien Mairal Sparse Estimation for Image and Vision Processing 38/187 The elastic-net vs other penalties

α[2]

α[1] elastic-net ball (1 γ) α + γ α 2 µ − k k1 k k2 ≤

Julien Mairal Sparse Estimation for Image and Vision Processing 38/187 The elastic-net vs other penalties

α[2]

α[1] ℓ2-ball α 2 µ k k2 ≤

Julien Mairal Sparse Estimation for Image and Vision Processing 38/187 Total variation and fused Lasso The anisotropic total variation [Rudin et al., 1992] p−1 ψ(α)= α[j + 1] α[j] , | − | j X=1 called fused Lasso in statistics [Tibshirani et al., 2005]. The penalty encourages piecewise constant signals (can be extended to images).

Image borrowed from a talk of J.-P. Vert, representing DNA copy numbers. Julien Mairal Sparse Estimation for Image and Vision Processing 39/187 Group Lasso and mixed norms [Turlach et al., 2005, Yuan and Lin, 2006, Zhao et al., 2009] [Grandvalet and Canu, 1999, Bakin, 1999]

the ℓ /ℓq-norm : ψ(α)= α[g] q. 1 k k g X∈G is a partition of 1,..., p ; G { } q =2 or q = in practice; ∞ can be interpreted as the ℓ -norm of [ α[g] q]g . 1 k k ∈G

ψ(α)= α[ 1, 2 ] + α[3] . k { } k2 | |

Julien Mairal Sparse Estimation for Image and Vision Processing 40/187 Spectral sparsity [Fazel et al., 2001, Srebro et al., 2005] A natural regularization function for matrices is the rank

△ rank(A) = j : sj (A) = 0 = s(A) , |{ 6 }| k k0

where sj is the j-th singular value and s is the spectrum of A. A successful convex relaxation of the rank is the sum of singular values

p △ A = sj (A)= s(A) , k k∗ k k1 j X=1 for A in Rp×k with k p. ≥ The resulting function is a norm, called the trace or nuclear norm.

Julien Mairal Sparse Estimation for Image and Vision Processing 41/187 Structured sparsity images produced by G. Obozinski

Julien Mairal Sparse Estimation for Image and Vision Processing 42/187 Structured sparsity images produced by G. Obozinski

Julien Mairal Sparse Estimation for Image and Vision Processing 43/187 Structured sparsity Metabolic network of the budding yeast from Rapaport et al. [2007]

Julien Mairal Sparse Estimation for Image and Vision Processing 44/187 Structured sparsity Metabolic network of the budding yeast from Rapaport et al. [2007]

Julien Mairal Sparse Estimation for Image and Vision Processing 45/187 Structured sparsity Warning: Under the name “structured sparsity” appear in fact significantly different formulations!

1 non-convex zero-tree wavelets [Shapiro, 1993]; predefined collection of sparsity patterns: [Baraniuk et al., 2010]; select a union of groups: [Huang et al., 2009]; structure via Markov random fields: [Cehver et al., 2008]; 2 convex (norms) tree-structure: [Zhao et al., 2009]; select a union of groups: [Jacob et al., 2009]; zero-pattern is a union of groups: [Jenatton et al., 2011a]; other norms: [Micchelli et al., 2013].

Julien Mairal Sparse Estimation for Image and Vision Processing 46/187 Structured sparsity Group Lasso with overlapping groups [Jenatton et al., 2011a]

ψ(α)= α[g] q. k k g X∈G What happens when the groups overlap?

the pattern of non-zero variables is an intersection of groups; the zero pattern is a union of groups.

ψ(α)= α + α[2] + α[3] . k k2 | | | |

Julien Mairal Sparse Estimation for Image and Vision Processing 47/187 Structured sparsity Group Lasso with overlapping groups [Jenatton et al., 2011a] Examples of set of groups G Selection of contiguous patterns on a sequence, p = 6.

is the set of blue groups. G Any union of blue groups set to zero leads to the selection of a contiguous pattern. Julien Mairal Sparse Estimation for Image and Vision Processing 48/187 Structured sparsity Group Lasso with overlapping groups [Jenatton et al., 2011a] Examples of set of groups G Selection of rectangles on a 2-D grids, p = 25.

is the set of blue/green groups (with their not displayed G complements).

Any union of blue/greenJulien groups Mairal setSparse to Estimation zero leadsfor Image to and theVision selectProcessingion 48/187 of Structured sparsity Group Lasso with overlapping groups [Jenatton et al., 2011a]

Examples of set of groups G Selection of diamond-shaped patterns on a 2-D grids, p = 25.

It is possible to extent such settings to 3-D space, or more complex topologies.

Julien Mairal Sparse Estimation for Image and Vision Processing 48/187 Structured sparsity Hierarchical norms [Zhao et al., 2009].

A node can be active only if its ancestors are active. The selected patterns are rooted subtrees.

Julien Mairal Sparse Estimation for Image and Vision Processing 49/187 Structured sparsity Hierarchical norms [Zhao et al., 2009].

(d) Sparsity. (e) Group sparsity. (f) Hierarchical sparsity.

Julien Mairal Sparse Estimation for Image and Vision Processing 50/187 Structured sparsity the non-convex penalty of Huang et al. [2009] Warning: different point of view than in the previous three slides

△ ϕ(α) = min ηg s.t. Supp(α) g . J⊆G ⊆ g g n X∈J [∈J o the penalty is non-convex. is NP-hard to compute (set cover problem). The pattern of non-zeroes in α is a union of (a few) groups.

It can be rewritten as a boolean linear program:

ϕ(α)= min η⊤x s.t. x Supp(α) . x∈{0,1}|G| N ≥ n o

Julien Mairal Sparse Estimation for Image and Vision Processing 51/187 Structured sparsity convex relaxation and the penalty of Jacob et al. [2009] The penalty of Huang et al. [2009]:

ϕ(α)= min η⊤x s.t. x Supp(α) . x∈{0,1}|G| N ≥ n o A convex LP-relaxation:

△ ψ(α) = min η⊤x s.t. x α . x∈R|G| N ≥ | | + n o

Lemma: ψ is the penalty of Jacob et al. [2009] with the ℓ∞-norm:

ψ(α)= min ηg β ∞ s.t. α= β and g, Supp(β ) g, g p g g g (β ∈R )g k k ∀ ⊆ ∈G g g X∈G X∈G

Julien Mairal Sparse Estimation for Image and Vision Processing 52/187 Structured sparsity The norm of Jacob et al. [2009] in 3D

ψ(α) with = 1, 2 , 2, 3 , 1, 3 . G {{ } { } { }}

Julien Mairal Sparse Estimation for Image and Vision Processing 53/187 Sparse recovery: theoretical results for the Lasso three upcoming slides are inspired from a lecture of G. Obozinski given at H´olar in 2010

Given some observations (yi , xi )i=1,...,n, with yi in R, assume that the ⊤ 2 linear model yi = x θ + εi is valid, with εi (0,σ ). i ∼N Given an estimate θˆ, three main problems: 1 Regular consistency: convergence of estimator θˆ to θ, i.e., θˆ θ tends to zero when n tends to ; k − k2 ∞ 2 Model selection consistency: convergence of the sparsity pattern of θˆ to the pattern of θ;

3 Efficiency: convergence of predictions with θˆ to the predictions with θ, i.e., 1 Xθˆ Xθ 2 tends to zero. n k − k2

Julien Mairal Sparse Estimation for Image and Vision Processing 54/187 Sparse recovery: theoretical results for the Lasso Conditions on the design for success Restricted Isometry Property (RIP)

1 δk θ Xθ 1+ δk for all θ k. − k k ≤ k k≤ k k0 ≤ Subsetsp of size k of the columnsp of X should be close to orthogonal. Mutual Incoherence Property (MIP)

⊤ max xi xj < µ. i6=j | |

Irrepresentable condition (IC)

−1 ⊤ Q c Q sign(θ ) 6 1 γ with Q ′ = X X ′ . k J J JJ J k∞ − JJ J J Restricted Eigenvalue condition (RE) ⊤ 2 ∆ Q∆ κ(k) = min min 2 > 0 |J|6k ∆, k∆Jc k16k∆J k1 ∆J k k2 Julien Mairal Sparse Estimation for Image and Vision Processing 55/187 Sparse recovery: theoretical results for the Lasso Model selection consistency (Lasso)

Assume θ sparse and denote J = j : θ[j] = 0 the nonzero pattern { 6 } Irrepresentable Condition(γ) [Zhao and Yu, 2006, Wainwright, 2009]

−1 Q c Q sign(θ ) 6 1 γ k J J JJ J k∞ − 1 n ⊤ p×p where Q = limn xi x R (covariance ). →+∞ n i=1 i ∈ Note that condition dependsP on θ and J

Julien Mairal Sparse Estimation for Image and Vision Processing 56/187 Sparse recovery: theoretical results for the Lasso Model selection consistency (Lasso)

Assume θ sparse and denote J = j : θ[j] = 0 the nonzero pattern { 6 } Irrepresentable Condition(γ) [Zhao and Yu, 2006, Wainwright, 2009]

−1 Q c Q sign(θ ) 6 1 γ k J J JJ J k∞ − 1 n ⊤ p×p where Q = limn xi x R (covariance matrix). →+∞ n i=1 i ∈ Note that condition depends on θ and J P Theorem (Model selection for classical asymptotics (i.e., p fixed)) IC(0) is necessary and IC(γ) for γ > 0 is sufficient for model selection consistency. High-dimension (p + ): additional requirements → ∞ Sample size condition : n > k log p Requires lower-bound on magnitude of nonzero θ[j] see B¨uhlmann and Van De Geer [2011] for a review.

Julien Mairal Sparse Estimation for Image and Vision Processing 56/187 Compressed sensing Compressed sensing [Cand`es et al., 2006] says that an s-sparse signal α⋆ in Rp can be exactly recovered by observing x = Dα⋆ and solving the linear program

min α 1 s.t. x = Dα, α∈Rp k k

where D satisfies the RIP assumption with δ s √2 1. 2 ≤ − Moreover, the convex relaxation is exact. matrices D in Rm×p satisfying the RIP assumption can be obtained with simple random sampling schemes with

m = O(slog(p/s)).

Julien Mairal Sparse Estimation for Image and Vision Processing 57/187 Compressed sensing and sparse recovery

Remarks The theory also admits extensions to approximately sparse signals, noisy measurements... extensions where D is replaced by Z⊤D where Z is random and D deterministic; the dictionaries we are using in this lecture do not satisfy RIP; sparse estimation and sparse coding is not compressed sensing.

Julien Mairal Sparse Estimation for Image and Vision Processing 58/187 Sparse recovery and compressed sensing

Some thoughts from Hocking [1976]:

The problem of selecting a subset of independent or predictor variables is usually described in an idealized setting. That is, it is assumed that (a) the analyst has data on a large number of potential variables which include all relevant variables and appropriate functions of them plus, possibly, some other extraneous variables and variable functions and (b) the analyst has available “good” data on which to base the eventual conclusions. In practice, the lack of satisfaction of these assumptions may make a detailed subset selection analysis a meaningless exercise.

Julien Mairal Sparse Estimation for Image and Vision Processing 59/187 Conclusions from the first part

the sparsity principle has been used for a long time, and this is not a recent idea; there are numerous ways of designing sparse regularization functions adapted to a particular problem. Choosing the best one is not easy and requires some domain knowledge; the dictionaries we will use in this literature almost never satisfy theoretical assumptions ensuring sparse recovery.

Other take-home messages:

sparsity is not always good. If possible, try ℓ2 before trying ℓ1;

convexity is not always good. When trying ℓ1, try also ℓ0.

Julien Mairal Sparse Estimation for Image and Vision Processing 60/187 Part II: Discovering the structure of natural images

Julien Mairal Sparse Estimation for Image and Vision Processing 61/187 1 A short introduction to parsimony

2 Discovering the structure of natural images Dictionary learning Pre-processing Principal component analysis Clustering or vector quantization Structured dictionary learning Other matrix factorization methods

3 Sparse models for image processing

4 Optimization for sparse estimation

5 Application cases

Julien Mairal Sparse Estimation for Image and Vision Processing 62/187 Dictionary learning The goal of automatically learning local structures in natural images was first achieved by neuroscientists. The model of Olshausen and Field [1996] looks for a dictionary D adapted to a training set of natural image patches xi , i = 1,..., n:

n 1 2 min xi Dαi 2 + λψ(αi ), D∈C,A∈Rp×n n k − k i X=1 △ m×p where A =[α ,..., αn] and = D R : j, dj 1 . 1 C { ∈ ∀ k k2 ≤ } Typical settings n 100 000; ≈ m = 10 10 pixels; × p = 256.

Julien Mairal Sparse Estimation for Image and Vision Processing 63/187 Dictionary learning

Figure: with centering

Julien Mairal Sparse Estimation for Image and Vision Processing 64/187 Dictionary learning

Figure: with whitening

Julien Mairal Sparse Estimation for Image and Vision Processing 65/187 Dictionary learning

Why was it found impressive by neuroscientists? since Hubel and Wiesel [1968], it is known that some visual neurons are responding to particular image features, such as oriented edges. Later, Daugman [1985] demonstrated that fitting a linear model to neuronal responses given a visual stimuli may produce filters that can be well approximated by a two-dimensional Gabor function. the original motivation of Olshausen and Field [1996] was to establish a relation between the statistical structure of natural images and the properties of neurons from area V1.

The results provided some “support” for classical models of V1 based on Gabor filters.

Julien Mairal Sparse Estimation for Image and Vision Processing 66/187 Dictionary learning

Why was it found impressive by neuroscientists? since Hubel and Wiesel [1968], it is known that some visual neurons are responding to particular image features, such as oriented edges. Later, Daugman [1985] demonstrated that fitting a linear model to neuronal responses given a visual stimuli may produce filters that can be well approximated by a two-dimensional Gabor function. the original motivation of Olshausen and Field [1996] was to establish a relation between the statistical structure of natural images and the properties of neurons from area V1.

Warning In fact, little is known about the early visual cortex [Olshausen and Field, 2005, Carandini et al., 2005].

Julien Mairal Sparse Estimation for Image and Vision Processing 66/187 Dictionary learning Snippet from Olshausen and Field [2005]

However, [...], there remains a great deal that is still unknown about how V1 works and its role in visual system function. We believe it is quite probable that the correct theory of V1 is still far afield from the currently proposed theories.

Julien Mairal Sparse Estimation for Image and Vision Processing 67/187 Dictionary learning Point of views

Matrix factorization It is useful to see dictionary learning as a matrix factorization problem

1 2 min X DA F + λΨ(A). D∈C,A∈Rp×n 2nk − k

This is simply a matter of notation:

n 1 2 min xi Dαi 2 + λψ(αi ), D∈C,A∈Rp×n n k − k i X=1 but the matrix factorization point of view allows us to make connections with numerous other unsupervised learning techniques, such as K-means, PCA, NMF, ICA...

Julien Mairal Sparse Estimation for Image and Vision Processing 68/187 Dictionary learning Point of views

Empirical risk minimization n 1 min L(xi , D), D∈C n i X=1 with △ 1 2 L(x, D) = min x Dα 2 + λψ(α). α∈Rp 2k − k Again, this is a matter of notation, but the empirical risk minimization point of view paves the way to stochastic optimization [Mairal et al., 2010a]; some theoretical analysis [?Vainsencher et al., 2011, Gribonval et al., 2013].

Julien Mairal Sparse Estimation for Image and Vision Processing 69/187 Dictionary learning Constrained variants The formulations below are not equivalent

n 1 2 min xi Dαi 2 s.t. ψ(αi ) µ. D∈C,A∈Rp×n 2k − k ≤ i X=1 or n 2 min ψ(αi ) s.t. xi Dαi 2 ε. D∈C,A∈Rp×n k − k ≤ i X=1 Using one instead of another is a matter of taste and of the problem at hand.

Julien Mairal Sparse Estimation for Image and Vision Processing 70/187 Pre-processing of natural image patches Centering (also called removing the DC component)

m 1 xi xi xi [j] 1m, ← − m  j X=1  

(a) Without pre-processing. (b) After centering.

Julien Mairal Sparse Estimation for Image and Vision Processing 71/187 Pre-processing of natural image patches Contrast (variance) normalization 1 xi xi . ← max( xi ,η) k k2 ex: η can be 0.2 times the mean value of the xi . k k2

(a) After centering. (b) After contrast normalization.

Julien Mairal Sparse Estimation for Image and Vision Processing 72/187 Pre-processing of natural image patches Whitening after centering

† ⊤ xi US U xi , ← where (1/√n)X = USV⊤ (SVD). Sometimes, small singular values are also set to zero. The resulting covariance (1/n)XX⊤ is close to identity.

(a) After centering. (b) After whitening.

Julien Mairal Sparse Estimation for Image and Vision Processing 73/187 Pre-processing of natural image patches Treatment of color image patches Should we use RGB?

Julien Mairal Sparse Estimation for Image and Vision Processing 74/187 Pre-processing of natural image patches Treatment of color image patches Should we use RGB? RGB dates back to our first understanding of the nature of light: color spectrum [Newton, 1675], trichromatic vision [Young, 1845], color composition [Grassmann, 1854, Maxwell, 1860, von Helmholtz, 1852], biological photoreceptors [Nathans et al., 1986]; other color spaces, such as CIELab, YIQ, YCrBr have less correlated color channels [Pratt, 1971, Sharma and Trussell, 1997], and provide a better perceptual distance; it does not mean that RGB should never be used: changing the color space will also change the nature of the noise...

Julien Mairal Sparse Estimation for Image and Vision Processing 74/187 Pre-processing of natural image patches Treatment of color image patches Should we use RGB? RGB dates back to our first understanding of the nature of light: color spectrum [Newton, 1675], trichromatic vision [Young, 1845], color composition [Grassmann, 1854, Maxwell, 1860, von Helmholtz, 1852], biological photoreceptors [Nathans et al., 1986]; other color spaces, such as CIELab, YIQ, YCrBr have less correlated color channels [Pratt, 1971, Sharma and Trussell, 1997], and provide a better perceptual distance; it does not mean that RGB should never be used: changing the color space will also change the nature of the noise...

how do we perform centering, whitening and normalization?

Julien Mairal Sparse Estimation for Image and Vision Processing 74/187 Pre-processing of natural image patches Treatment of color image patches Should we use RGB? RGB dates back to our first understanding of the nature of light: color spectrum [Newton, 1675], trichromatic vision [Young, 1845], color composition [Grassmann, 1854, Maxwell, 1860, von Helmholtz, 1852], biological photoreceptors [Nathans et al., 1986]; other color spaces, such as CIELab, YIQ, YCrBr have less correlated color channels [Pratt, 1971, Sharma and Trussell, 1997], and provide a better perceptual distance; it does not mean that RGB should never be used: changing the color space will also change the nature of the noise...

how do we perform centering, whitening and normalization? center each R,G,B channel independently.

Julien Mairal Sparse Estimation for Image and Vision Processing 74/187 Pre-processing of natural image patches

(a) Without pre-processing. (b) After centering.

(c) Centering and normalization. (d) After whitening.

Julien Mairal Sparse Estimation for Image and Vision Processing 75/187 Principal component analysis (PCA) Also known has the Karhunen-Lo`eve or Hotelling transform [Hotelling, 1933], it is often presented as an iterative process finding orthogonal directions maximizing variance in the data. In fact, it can be cast as a low-rank matrix factorization problem:

2 ⊤ ⊤ min X UV s.t. U U = Ik , U∈Rm×k ,V∈Rn×k − F

where the rows of X are centered. As a consequence of the theorem of Eckart and Young [1936], the matrix U contains the principal components of X corresponding to the k largest singular values.

Julien Mairal Sparse Estimation for Image and Vision Processing 76/187 Principal component analysis (PCA)

(e) DCT Dictionary. (f) Principal components. Figure: On the right, we visualize the principal components of 400000 randomly sampled natural image patches of size 16 16. On the left, we display a discrete cosine transform (DCT) dictionary× [Ahmed et al., 1974].

Julien Mairal Sparse Estimation for Image and Vision Processing 77/187 Principal component analysis (PCA)

(a) Original Image. (b) Principal components. Figure: Visualization of the principal components of all overlapping patches from the image tiger. Even though the image is not natural, its principal components are similar to the previous ones.

Julien Mairal Sparse Estimation for Image and Vision Processing 78/187 Principal component analysis (PCA) [Bossomaier and Snyder, 1986, Simoncelli and Olshausen, 2001, Hyv¨arinen et al., 2009]. Warning The sinusoids produced by PCA have nothing to do with the structure of natural images, but are due to a property of shift invariance.

Julien Mairal Sparse Estimation for Image and Vision Processing 79/187 Principal component analysis (PCA) [Bossomaier and Snyder, 1986, Simoncelli and Olshausen, 2001, Hyv¨arinen et al., 2009]. Warning The sinusoids produced by PCA have nothing to do with the structure of natural images, but are due to a property of shift invariance. Consider an infinite 1D signal with covariance Σ[k, l]= σ(k l), − where σ is even. Then, for all ω and ϕ,

′ Σ(k, l)ei(ωl+ϕ) = σ(l k)ei(ωl+ϕ) = σ(l′)eiωl ei(ωk+ϕ), − l l ′ ! X X Xl ′ iωl′ Since the function σ is even, the infinite sum l′ σ(l )e is real, and the signals [sin(ωk + ϕ)]k∈Z are all eigenvectorsP of Σ. 

Julien Mairal Sparse Estimation for Image and Vision Processing 79/187 Principal component analysis (PCA) [Bossomaier and Snyder, 1986, Simoncelli and Olshausen, 2001, Hyv¨arinen et al., 2009]. Warning The sinusoids produced by PCA have nothing to do with the structure of natural images, but are due to a property of shift invariance. Consider an infinite 1D signal with covariance Σ[k, l]= σ(k l), − where σ is even. Then, for all ω and ϕ,

′ Σ(k, l)ei(ωl+ϕ) = σ(l k)ei(ωl+ϕ) = σ(l′)eiωl ei(ωk+ϕ), − l l ′ ! X X Xl ′ iωl′ Since the function σ is even, the infinite sum l′ σ(l )e is real, and the signals [sin(ωk + ϕ)]k∈Z are all eigenvectorsP of Σ.  Note that controlling the approximation of the PCs by the discrete Fourier transform for finite signals is non-trivial [Pearl, 1973].

Julien Mairal Sparse Estimation for Image and Vision Processing 79/187 Clustering or vector quantization First used on natural image patches for compression and communication purposes [Nasrabadi and King, 1988, Gersho and Gray, 1992]. The goal is to find p clusters in the data, by minimizing the following objective:

n 2 min xi dli 2, (1) D∈Rm×p k − k i=1 ∀i, li ∈{1,...,p} X

This is again a matrix factorization problem p 1 2 min X DA F s.t. i, αi [j] = 1. D∈Rm×p 2nk − k ∀ j A∈{0,1}p×n X=1

Julien Mairal Sparse Estimation for Image and Vision Processing 80/187 Clustering or vector quantization

(a) With centering. (b) With whitening. Figure: Visualization of p = 256 centroids computed with the algorithm K-means on n = 400000 image patches of size m = 16 16 pixels. ×

Julien Mairal Sparse Estimation for Image and Vision Processing 81/187 Dictionary learning on color image patches

(a) With centering - RGB. (b) With whitening - RGB. Figure: Dictionaries learned on RGB patches.

Julien Mairal Sparse Estimation for Image and Vision Processing 82/187 Dictionary learning with structured sparsity

Formulation n 1 2 λ min X DA F + αi [g] q. D∈C,A∈Rp×n 2nk − k n k k i g X=1 X∈G Group structures hierarchical: organize the dictionary elements in a tree [Jenatton et al., 2010, 2011b]; topographic: organize the elements on a 2D grid [Kavukcuoglu et al., 2009, Mairal et al., 2011]. The groups are 3 3 or 4 4 × × spatial neighborhoods. The second group structure is inspired by topographic ICA [Hyv¨arinen et al., 2001].

Julien Mairal Sparse Estimation for Image and Vision Processing 83/187 Dictionary learning with structured sparsity Hierarchical dictionary learning

(a) Tree structure 1. (b) Tree structure 2. Figure: Hierarchical dictionaries learned on natural image patches of size 16 16 pixels. × Julien Mairal Sparse Estimation for Image and Vision Processing 84/187 Dictionary learning with structured sparsity Topographic dictionary learning

(a) With 3 × 3 neighborhoods. (b) With 4 × 4 neighborhood. Figure: Topographic dictionaries learned on whitened natural image patches of size 12 12 pixels. × Julien Mairal Sparse Estimation for Image and Vision Processing 85/187 Other matrix factorization methods Independent component analysis (ICA) Assume that x is a random variable—here a natural image patch— and the columns of X =[x1,..., xn] are random realizations of x. ICA is principle looking for a factorization x = Dα, where D is orthogonal and α is a random vector whose entries are statistically independent [Bell and Sejnowski, 1997, Hyv¨arinen et al., 2009].

Julien Mairal Sparse Estimation for Image and Vision Processing 86/187 Other matrix factorization methods Independent component analysis (ICA) Assume that x is a random variable—here a natural image patch— and the columns of X =[x1,..., xn] are random realizations of x. ICA is principle looking for a factorization x = Dα, where D is orthogonal and α is a random vector whose entries are statistically independent [Bell and Sejnowski, 1997, Hyv¨arinen et al., 2009]. Warning Because ICA is only a principle, there is not a unique ICA formulation.

Julien Mairal Sparse Estimation for Image and Vision Processing 86/187 Other matrix factorization methods Independent component analysis (ICA) Assume that x is a random variable—here a natural image patch— and the columns of X =[x1,..., xn] are random realizations of x. ICA is principle looking for a factorization x = Dα, where D is orthogonal and α is a random vector whose entries are statistically independent [Bell and Sejnowski, 1997, Hyv¨arinen et al., 2009]. How do we measure independence? p Compare p(α) with the product of its marginals j=1 p(α[j]):

p Q △ p(α) KL p(α), p(α[j]) = p(α) log p dα,   Rp p(α[j]) j j=1 ! Y=1 Z   Q which is zero iff the α[j]′s are independent [Cover and Thomas, 2006].

Julien Mairal Sparse Estimation for Image and Vision Processing 86/187 Other matrix factorization methods Independent component analysis (ICA) We can rewrite the Kullback-Leibler distance with entropies

p p KL p(α), p(α[j]) = H(α[j]) H(α).   − j j Y=1 X=1   The entropy H(α) can be shown to be independent of D when D is orthogonal and x is whitened. Minimizing KL amounts to minimizing

p p ⊤ H(α[j]) = H(dj x). j j X=1 X=1

We are close to an ICA “formulation” but not yet there. The entropy is an abstract quantity that is not computable.

Julien Mairal Sparse Estimation for Image and Vision Processing 87/187 Other matrix factorization methods Independent component analysis (ICA) Strategies leading to concrete formulations/algorithms for solving the following problem after whitening the data

p ⊤ ⊤ min H(dj x) s.t. D D = I. D j X=1

⊤ parameterizing the densities p(dj x), leading to maximum likelihood estimation [see Hyv¨arinen et al., 2004]; plug in non-parametric estimators of the entropy [Pham, 2004]; encourage the distributions of the α[j]’s to be “non-Gaussian” [Cardoso, 2003]. Among all probability distributions with same variance, the Gaussian ones are known to maximize entropy [Cover and Thomas, 2006].

Julien Mairal Sparse Estimation for Image and Vision Processing 88/187 Other matrix factorization methods Non-negative matrix factorization [Paatero and Tapper, 1994].

2 min X DA F s.t. D 0 and A 0. D∈Rm×p,A∈Rp×n k − k ≥ ≥

Julien Mairal Sparse Estimation for Image and Vision Processing 89/187 Other matrix factorization methods Non-negative matrix factorization [Paatero and Tapper, 1994].

2 min X DA F s.t. D 0 and A 0. D∈Rm×p,A∈Rp×n k − k ≥ ≥

Archetypal analysis [Cutler and Breiman, 1994].

for all dictionary element j, dj = Xβj , where βj is in the simplex △ n n ∆n = β R s.t. β 0 and β[i] = 1 . { ∈ ≥ i=1 } for all data point i, xi is close to DPαi , where αi is in ∆p; formulation: 2 min X XBA F, αi ∈∆p for 1≤i≤n k − k βj ∈∆n for 1≤j≤p

where A =[α1,..., αn], B =[β1,..., βp] and the matrix of archetypes D is equal to the product XB.

Julien Mairal Sparse Estimation for Image and Vision Processing 89/187 Other matrix factorization methods Convolutional sparse coding [Zhu et al., 2005, Zeiler et al., 2010]

Main idea Decompose directly the full image x using small dictionary elements placed at all possible positions in the image. Given a dictionary Dm×p where m is a patch size, and an image x in Rl , the image decomposition can be written.

l 2 l 1 ⊤ min x Rk Dαk + λ αk 1, A∈Rp×l 2 − k k k=1 2 i=1 X X

Model with effective applications to visual recognition [Zeiler et al., 2010, Rigamonti et al., 2013, Kavukcuoglu et al., 2010]. Then, the extension to dictionary learning is easy.

Julien Mairal Sparse Estimation for Image and Vision Processing 90/187 Other matrix factorization methods Convolutional sparse coding [Zhu et al., 2005, Zeiler et al., 2010]

Figure: Visualization of p = 100 dictionary elements learned on 30 whitened natural images.

Julien Mairal Sparse Estimation for Image and Vision Processing 91/187 Conclusions from the second part

intriguing structures naturally emerge from natural images; matrix factorization is an effective tool to find these structures;

Advertisement some code will be provided upon publication of the monograph for generating most of the figures from this lecture. the SPAMS toolbox already contains lots of code (C++ interfaced with Matlab, Python, R) for learning dictionaries, factorizing matrices (NMF, archetypal analysis), solving sparse estimation problems. http://spams-devel.gforge.inria.fr/.

Julien Mairal Sparse Estimation for Image and Vision Processing 92/187 Conclusions from the second part

intriguing structures naturally emerge from natural images; matrix factorization is an effective tool to find these structures;

Advertisement some matlab code will be provided upon publication of the monograph for generating most of the figures from this lecture. the SPAMS toolbox already contains lots of code (C++ interfaced with Matlab, Python, R) for learning dictionaries, factorizing matrices (NMF, archetypal analysis), solving sparse estimation problems. http://spams-devel.gforge.inria.fr/.

Question Is unsupervised learning on natural image patches useful for any prediction task?

Julien Mairal Sparse Estimation for Image and Vision Processing 92/187 Part III: Sparse models for image processing

Julien Mairal Sparse Estimation for Image and Vision Processing 93/187 1 A short introduction to parsimony

2 Discovering the structure of natural images

3 Sparse models for image processing Image denoising Image inpainting Image demosaicking Video processing Image up-scaling Inverting nonlinear local transformations Other patch modeling approaches

4 Optimization for sparse estimation

5 Application cases

Julien Mairal Sparse Estimation for Image and Vision Processing 94/187 Image denoising

y = xorig + w measurements original image noise |{z} |{z} Julien Mairal |{z}Sparse Estimation for Image and Vision Processing 95/187 Image denoising Classical image models

y = xorig + w .

measurements original image noise |{z} |{z} |{z} Energy minimization problem - MAP estimation 1 E(x)= y x 2 + ψ(x) . 2k − k2 image model relation to measurements |{z} Some classical priors | {z } Smoothness λ x 2; kL k2 total variation λ x 2 [Rudin et al., 1992]; k∇ k1 Markov random fields [Zhu and Mumford, 1997]; wavelet sparsity λ Wx . k k1

Julien Mairal Sparse Estimation for Image and Vision Processing 96/187 Image denoising The method of Elad and Aharon [2006]

Given a fixed dictionary D, a patch yi is denoised as follows:

1 center yi , c △ △ 1 ⊤ y = yi µi 1m with µi = 1 yi ; i − n m 2 find a sparse linear combination of dictionary elements that c approximates yi up to the noise level:

min α s.t. yc Dα 2 ε, (2) p i 0 i i 2 αi ∈R k k k − k ≤

where ε is proportional to the noise variance σ2;

3 add back the mean component to obtain the clean estimate ˆxi :

△ ⋆ ˆxi = Dαi + µi 1m,

Julien Mairal Sparse Estimation for Image and Vision Processing 97/187 Image denoising The method of Elad and Aharon [2006] An adaptive approach

1 extract all overlapping √m √m patches yi . × 2 dictionary learning: learn D on the set of centered noisy c c patches [y1,..., yn]. 3 final reconstruction: find an estimate ˆxi for every patch using the approach of the previous slide; 4 patch averaging: n 1 ˆx = R⊤ˆx , m i i i X=1 Remark Like other state-of-the-art denoising approaches, it is patch-based [Buades et al., 2005, Dabov et al., 2007].

Julien Mairal Sparse Estimation for Image and Vision Processing 98/187 Practical tricks

use larger patches when the noise level is high; choose ε = m(1.15σ)2 or take the 0.9-quantile of 2 the χm-distribution. always use the ℓ0 regularization for the final reconstruction;

using ℓ1 for learning the dictionary seems to yield better results.

Julien Mairal Sparse Estimation for Image and Vision Processing 99/187 Image inpainting [Mairal et al., 2008a,b] For removing small holes in the image, a natural extension consists in introducing a binary mask Mi in the formulation:

n 1 1 2 min Mi (yi Dαi ) 2 + λψ(αi ), D∈C,A∈Rp×n n 2k − k i X=1 The approach assumes that the noise is not structured; the holes are smaller than the patch size.

The problem is called inpainting [Bertalmio et al., 2000].

Julien Mairal Sparse Estimation for Image and Vision Processing 100/187 Image inpainting [Mairal et al., 2008a,b]

Julien Mairal Sparse Estimation for Image and Vision Processing 101/187 Image inpainting [Mairal et al., 2008a,b]

Julien Mairal Sparse Estimation for Image and Vision Processing 102/187 Image inpainting [Mairal et al., 2008a,b]

Julien Mairal Sparse Estimation for Image and Vision Processing 103/187 Image inpainting Inpainting a 12-Mpixel photograph [Mairal et al., 2009a]

Julien Mairal Sparse Estimation for Image and Vision Processing 104/187 Image inpainting Inpainting a 12-Mpixel photograph [Mairal et al., 2009a]

Julien Mairal Sparse Estimation for Image and Vision Processing 105/187 Image inpainting Inpainting a 12-Mpixel photograph [Mairal et al., 2009a]

Julien Mairal Sparse Estimation for Image and Vision Processing 106/187 Image inpainting Inpainting a 12-Mpixel photograph [Mairal et al., 2009a]

Julien Mairal Sparse Estimation for Image and Vision Processing 107/187 Image demoisaicking RAW Image Processing

G R G R G R White B G B G B G G R G R G R balance. B G B G B G G R G R G R Black B G B G B G Denoising substraction. Conversion to sRGB. Demosaicking Gamma correction.

Problem The noise pattern is very structured: the previous inpainting scheme needs to be modified [Mairal et al., 2008a].

Julien Mairal Sparse Estimation for Image and Vision Processing 108/187 Image demoisaicking

(a) Mosaicked image (b) Demosaicked image A (c) Demosaicked image B

Figure: Demosaicked image A is with the approach previously described; image B is with an extension called non-local sparse model [Mairal et al., 2009b].

Julien Mairal Sparse Estimation for Image and Vision Processing 109/187 Image demoisaicking

(a) Zoom (b) Zoom (c) Zoom

Figure: Demosaicked image A is with the approach previously described; image B is with an extension called non-local sparse model [Mairal et al., 2009b].

Julien Mairal Sparse Estimation for Image and Vision Processing 109/187 Video processing Extension developed by Protter and Elad [2009]: Key ideas for video processing Using a 3D dictionary. Processing of many frames at the same time. Dictionary propagation.

Julien Mairal Sparse Estimation for Image and Vision Processing 110/187 Video processing Inpainting, [Mairal et al., 2008b]

Figure: Inpainting results.

Julien Mairal Sparse Estimation for Image and Vision Processing 111/187 Video processing Inpainting, [Mairal et al., 2008b]

Figure: Inpainting results.

Julien Mairal Sparse Estimation for Image and Vision Processing 111/187 Video processing Inpainting, [Mairal et al., 2008b]

Figure: Inpainting results.

Julien Mairal Sparse Estimation for Image and Vision Processing 111/187 Video processing Inpainting, [Mairal et al., 2008b]

Figure: Inpainting results.

Julien Mairal Sparse Estimation for Image and Vision Processing 111/187 Video processing Inpainting, [Mairal et al., 2008b]

Figure: Inpainting results.

Julien Mairal Sparse Estimation for Image and Vision Processing 111/187 Video processing Color video denoising, [Mairal et al., 2008b]

Figure: Inpainting results.

Julien Mairal Sparse Estimation for Image and Vision Processing 112/187 Video processing Color video denoising, [Mairal et al., 2008b]

Figure: Inpainting results.

Julien Mairal Sparse Estimation for Image and Vision Processing 112/187 Video processing Color video denoising, [Mairal et al., 2008b]

Figure: Inpainting results.

Julien Mairal Sparse Estimation for Image and Vision Processing 112/187 Video processing Color video denoising, [Mairal et al., 2008b]

Figure: Inpainting results.

Julien Mairal Sparse Estimation for Image and Vision Processing 112/187 Video processing Color video denoising, [Mairal et al., 2008b]

Figure: Inpainting results.

Julien Mairal Sparse Estimation for Image and Vision Processing 112/187 Image up-scaling The main recipe of Yang et al. [2010] l h n The approach requires a database of pairs of training patches (xi , xi )i=1, l Rml h Rmh where xi in is a low-resolution version of the patch xi in . Training step:

n 2 2 1 1 l 1 h min xi Dl αi + xi Dhαi + λ αi 1 , Dl ∈Cl n 2ml − 2 2mh − 2 k k i=1 Dh∈Ch Rp×n X A∈ l h Dl and Dh are jointly learned such that the pairs (xi , xi ) “share” the same sparse decompositions on the dictionaries. Reconstruction step given a low-resolution image: 2 1 l 1 2 min y Dl β + zi Dhβ + λ β , Rp i i i 2 i 1 βi ∈ 2ml − 2 2mh k − k k k

Julien Mairal Sparse Estimation for Image and Vision Processing 113/187 Image up-scaling Variant with regression [Zeyde et al., 2012, Couzinie-Devy et al., 2011, Yang et al., 2012]

1 compute Dl and A with a classical dictionary learning formulation, n 1 1 l 2 min xi Dl αi 2 + λ αi 1. D ∈C ,A∈Rp×n n 2k − k k k l l i X=1 2 obtain Dh by solving a multivariate regression problem: n 1 1 h 2 min xi Dhαi 2, D ∈Rmh×p n 2k − k h i X=1 where the αi ’s are fixed after the first step. See also Zeyde et al. [2012] for other variants.

Main difference with Yang et al. [2010] , testing and training is more consistent;

/ Dh and Dl are not learned jointly anymore.

Julien Mairal Sparse Estimation for Image and Vision Processing 114/187 Image up-scaling Variant with task-driven dictionary learning [Couzinie-Devy et al., 2011, Yang et al., 2012] Define ⋆ △ 1 2 α (x, D) = argmin x Dα 2 + λ α 1 , Rp 2k − k k k α∈   Then, the joint dictionary learning formulation consists of minimizing

n 2 1 1 h ⋆ l min xi Dhα (xi , Dl ) . (3) Dl ∈Cl n 2 − 2 m ×p i=1 Dh∈R h X

Pros and Cons , testing and training is still consistent;

, Dh and Dl are learned jointly; / optimization looks horribly difficult.

Julien Mairal Sparse Estimation for Image and Vision Processing 115/187 Image up-scaling Scheme with regression Pipeline:

h Dictionary Dh Output x

Dictionary Dl Sparse codes α

Input vector xl

Julien Mairal Sparse Estimation for Image and Vision Processing 116/187 Image up-scaling Scheme with regression First step: dictionary learning

h Dictionary Dh Output x

Dictionary Dl Sparse codes α

Sparse Coding Layer

Input vector xl

Julien Mairal Sparse Estimation for Image and Vision Processing 116/187 Image up-scaling Scheme with regression Second step: regression

h Weights Wh Output x

Supervised Learning Layer

Dictionary Dl Sparse codes α

Input vector xl

Julien Mairal Sparse Estimation for Image and Vision Processing 116/187 Image up-scaling Scheme with task-driven dictionary learning A single step: supervised (task-driven) dictionary learning

h Weights Dh Output x Supervised Dictionary Learning?

Dictionary Dl Sparse codes α

Input vector xl

In the neural network language, we need back-propagation [LeCun et al., 1998].

Julien Mairal Sparse Estimation for Image and Vision Processing 117/187 Image up-scaling Scheme with task-driven dictionary learning [Couzinie-Devy et al., 2011]

Proposition In the asymptotic regime, the cost function is differentiable and its gradient admits a simple form [Mairal et al., 2012].

Main recipe of the optimization initialize with the regression variant; use stochastic gradient descent. use classical heuristics from the neural network literature [LeCun et al., 1998].

Julien Mairal Sparse Estimation for Image and Vision Processing 118/187 Image up-scaling Image from Couzinie-Devy et al. [2011]

Figure: Original

Julien Mairal Sparse Estimation for Image and Vision Processing 119/187 Image up-scaling Image from Couzinie-Devy et al. [2011]

Figure: Bicubic interpolation

Julien Mairal Sparse Estimation for Image and Vision Processing 119/187 Image up-scaling Image from Couzinie-Devy et al. [2011]

Figure: from Couzinie-Devy et al. [2011].

Julien Mairal Sparse Estimation for Image and Vision Processing 119/187 Image up-scaling Image from Couzinie-Devy et al. [2011]

Figure: Original

Julien Mairal Sparse Estimation for Image and Vision Processing 119/187 Image up-scaling Image from Couzinie-Devy et al. [2011]

Figure: Bicubic interpolation

Julien Mairal Sparse Estimation for Image and Vision Processing 119/187 Image up-scaling Image from Couzinie-Devy et al. [2011]

Figure: from Couzinie-Devy et al. [2011].

Julien Mairal Sparse Estimation for Image and Vision Processing 119/187 Image up-scaling Image from Couzinie-Devy et al. [2011]

Figure: Original

Julien Mairal Sparse Estimation for Image and Vision Processing 120/187 Image up-scaling Image from Couzinie-Devy et al. [2011]

Figure: Bicubic interpolation

Julien Mairal Sparse Estimation for Image and Vision Processing 120/187 Image up-scaling Image from Couzinie-Devy et al. [2011]

Figure: from Couzinie-Devy et al. [2011].

Julien Mairal Sparse Estimation for Image and Vision Processing 120/187 Image up-scaling Image from Couzinie-Devy et al. [2011]

Figure: Original

Julien Mairal Sparse Estimation for Image and Vision Processing 120/187 Image up-scaling Image from Couzinie-Devy et al. [2011]

Figure: Bicubic interpolation

Julien Mairal Sparse Estimation for Image and Vision Processing 120/187 Image up-scaling Image from Couzinie-Devy et al. [2011]

Figure: from Couzinie-Devy et al. [2011].

Julien Mairal Sparse Estimation for Image and Vision Processing 120/187 Inverting nonlinear local transformations

Remark The previous up-scaling approaches are generic and can work for other types of local image transformations. Example: inverse half-toning consists of reconstructing grayscale images from (probably old) binary ones, see, e.g., [Dabov et al., 2006]. A classical algorithm for producing binary images is the one of Floyd and Steinberg [1976].

Julien Mairal Sparse Estimation for Image and Vision Processing 121/187 Inverting nonlinear local transformations

Remark The previous up-scaling approaches are generic and can work for other types of local image transformations. Example: inverse half-toning consists of reconstructing grayscale images from (probably old) binary ones, see, e.g., [Dabov et al., 2006]. A classical algorithm for producing binary images is the one of Floyd and Steinberg [1976]. Warning Inverse half-toning is probably not a hot topic in image processing nowadays.

Julien Mairal Sparse Estimation for Image and Vision Processing 121/187 Inverting nonlinear local transformations Inverse half-toning [Mairal et al., 2012]

Figure: Original

Julien Mairal Sparse Estimation for Image and Vision Processing 122/187 Inverting nonlinear local transformations Inverse half-toning [Mairal et al., 2012]

Figure: Binary image

Julien Mairal Sparse Estimation for Image and Vision Processing 122/187 Inverting nonlinear local transformations Inverse half-toning [Mairal et al., 2012]

Figure: Reconstructed.

Julien Mairal Sparse Estimation for Image and Vision Processing 122/187 Inverting nonlinear local transformations Inverse half-toning [Mairal et al., 2012]

Figure: Original

Julien Mairal Sparse Estimation for Image and Vision Processing 122/187 Inverting nonlinear local transformations Inverse half-toning [Mairal et al., 2012]

Figure: Binary image

Julien Mairal Sparse Estimation for Image and Vision Processing 122/187 Inverting nonlinear local transformations Inverse half-toning [Mairal et al., 2012]

Figure: Reconstructed.

Julien Mairal Sparse Estimation for Image and Vision Processing 122/187 Inverting nonlinear local transformations Inverse half-toning [Mairal et al., 2012]

Julien Mairal Sparse Estimation for Image and Vision Processing 123/187 Inverting nonlinear local transformations Inverse half-toning [Mairal et al., 2012]

Julien Mairal Sparse Estimation for Image and Vision Processing 123/187 Inverting nonlinear local transformations Inverse half-toning [Mairal et al., 2012]

Julien Mairal Sparse Estimation for Image and Vision Processing 124/187 Inverting nonlinear local transformations Inverse half-toning [Mairal et al., 2012]

Julien Mairal Sparse Estimation for Image and Vision Processing 124/187 Inverting nonlinear local transformations Inverse half-toning [Mairal et al., 2012]

Julien Mairal Sparse Estimation for Image and Vision Processing 125/187 Inverting nonlinear local transformations Inverse half-toning [Mairal et al., 2012]

Julien Mairal Sparse Estimation for Image and Vision Processing 125/187 Inverting nonlinear local transformations Inverse half-toning [Mairal et al., 2012]

Julien Mairal Sparse Estimation for Image and Vision Processing 126/187 Inverting nonlinear local transformations Inverse half-toning [Mairal et al., 2012]

Julien Mairal Sparse Estimation for Image and Vision Processing 126/187 Inverting nonlinear local transformations Inverse half-toning [Mairal et al., 2012]

Julien Mairal Sparse Estimation for Image and Vision Processing 127/187 Inverting nonlinear local transformations Inverse half-toning [Mairal et al., 2012]

Julien Mairal Sparse Estimation for Image and Vision Processing 127/187 Other patch modeling approaches

Non-local means and non-parametric approaches Image pixels are well explained by a Nadaraya-Watson estimator:

n Kh(yi yj ) ˆx[i]= n − y[j], (4) Kh(yi yl ) j l=1 X=1 − P with successful application to texture synthesis: [Efros and Leung, 1999] image denoising (Non-local means): [Buades et al., 2005] image demosaicking: [Buades et al., 2009].

Julien Mairal Sparse Estimation for Image and Vision Processing 128/187 Other patch modeling approaches

BM3D state-of-the-art image denoising approach [Dabov et al., 2007]: block matching: for each patch, find similar ones in the image; 3D wavelet filtering: denoise blocks of patches with 3D-DCT; patch averaging: average estimates of overlapping patches; second step with Wiener filtering: use the first estimate to perform again and improve the previous steps. Further refined by Dabov et al. [2009] with shape-adaptive patches and PCA filtering.

Julien Mairal Sparse Estimation for Image and Vision Processing 129/187 Other patch modeling approaches

Non-local sparse models [Mairal et al., 2009b] Exploit some ideas of BM3D to combine the non-local means principle with dictionary learning. The main idea is that similar patches should admit similar decompositions by using group sparsity:

The approach uses a block matching/clustering step, followed by group sparse coding and patch averaging.

Julien Mairal Sparse Estimation for Image and Vision Processing 130/187 Other patch modeling approaches Non-local sparse image models

Julien Mairal Sparse Estimation for Image and Vision Processing 131/187 Other patch modeling approaches Non-local sparse image models

Julien Mairal Sparse Estimation for Image and Vision Processing 132/187 Conclusions from the third part

many inverse problems in image processing can be tackled by modeling natural image patches; dictionary learning is one effective way to do it, among others.

Julien Mairal Sparse Estimation for Image and Vision Processing 133/187 Part IV: Optimization for sparse estimation

Julien Mairal Sparse Estimation for Image and Vision Processing 134/187 1 A short introduction to parsimony

2 Discovering the structure of natural images

3 Sparse models for image processing

4 Optimization for sparse estimation Sparse reconstruction with the ℓ0-penalty Introduction of a few optimization principles Sparse reconstruction with the ℓ1-norm Sparse reconstruction with the ℓ1-norm Iterative reweighted ℓ1-algorithms Optimization for dictionary learning

5 Application cases

Julien Mairal Sparse Estimation for Image and Vision Processing 135/187 Sparse reconstruction with the ℓ0-penalty Matching pursuit [Mallat and Zhang, 1993] α = (0, 0, 0) y d2

d1 z r

d3

x Julien Mairal Sparse Estimation for Image and Vision Processing 136/187 Sparse reconstruction with the ℓ0-penalty Matching pursuit [Mallat and Zhang, 1993] α = (0, 0, 0) y d2

d1 z r

d3 < r, d3 > d3 x Julien Mairal Sparse Estimation for Image and Vision Processing 136/187 Sparse reconstruction with the ℓ0-penalty Matching pursuit [Mallat and Zhang, 1993] α = (0, 0, 0) y d2

d1 r z r < r, d > d − 3 3

d3

x Julien Mairal Sparse Estimation for Image and Vision Processing 136/187 Sparse reconstruction with the ℓ0-penalty Matching pursuit [Mallat and Zhang, 1993] α = (0, 0, 0.75) y d2

d1 z r

d3

x Julien Mairal Sparse Estimation for Image and Vision Processing 136/187 Sparse reconstruction with the ℓ0-penalty Matching pursuit [Mallat and Zhang, 1993] α = (0, 0, 0.75) y d2

d1 z r

d3

x Julien Mairal Sparse Estimation for Image and Vision Processing 136/187 Sparse reconstruction with the ℓ0-penalty Matching pursuit [Mallat and Zhang, 1993] α = (0, 0, 0.75) y d2

d1 z r

d3

x Julien Mairal Sparse Estimation for Image and Vision Processing 136/187 Sparse reconstruction with the ℓ0-penalty Matching pursuit [Mallat and Zhang, 1993] α = (0, 0.24, 0.75) y d2

d1 z

d3 r

x Julien Mairal Sparse Estimation for Image and Vision Processing 136/187 Sparse reconstruction with the ℓ0-penalty Matching pursuit [Mallat and Zhang, 1993] α = (0, 0.24, 0.75) y d2

d1 z

d3 r

x Julien Mairal Sparse Estimation for Image and Vision Processing 136/187 Sparse reconstruction with the ℓ0-penalty Matching pursuit [Mallat and Zhang, 1993] α = (0, 0.24, 0.75) y d2

d1 z

d3 r

x Julien Mairal Sparse Estimation for Image and Vision Processing 136/187 Sparse reconstruction with the ℓ0-penalty Matching pursuit [Mallat and Zhang, 1993] α = (0, 0.24, 0.65) y d2

d1 z

d r 3

x Julien Mairal Sparse Estimation for Image and Vision Processing 136/187 Sparse reconstruction with the ℓ0-penalty Matching pursuit [Mallat and Zhang, 1993]

2 min x Dα 2 s.t. α 0 k. α∈Rp k − k k k ≤ r

1: α 0 | {z } ← 2: r x (residual). ← 3: while α < k do k k0 4: Select the predictor with maximum inner-product with the residual

⊤ ˆ arg max dj r ← j=1,...,p | |

5: Update the residual and the coefficients α[ˆ] α[ˆ]+ d⊤r ← ˆ ⊤ r r (d r)d ← − ˆ ˆ 6: end while Julien Mairal Sparse Estimation for Image and Vision Processing 137/187 Sparse reconstruction with the ℓ0-penalty Matching pursuit [Mallat and Zhang, 1993]

Remarks Matching pursuit is a algorithm. It greedily selects one coordinate at a time and optimizes the cost function with respect to that coordinate.

2

α[ˆ] arg min x α[l]dl αdˆ . ← α R − − ∈ l6=ˆ X 2

Each coordinate can be selected several times during the process. The roots of this algorithm can be found in the statistics literature [Efroymson, 1960].

Julien Mairal Sparse Estimation for Image and Vision Processing 138/187 Sparse reconstruction with the ℓ0-penalty Orthogonal matching pursuit [Pati et al., 1993] α = (0, 0, 0) Γ= y ∅ d2

d1 y z

d3

x Julien Mairal Sparse Estimation for Image and Vision Processing 139/187 Sparse reconstruction with the ℓ0-penalty Orthogonal matching pursuit [Pati et al., 1993] α = (0, 0, 0.75) Γ= 3 y { } d2

π2 d1 y z r

π1 d3 π3

x Julien Mairal Sparse Estimation for Image and Vision Processing 139/187 Sparse reconstruction with the ℓ0-penalty Orthogonal matching pursuit [Pati et al., 1993] α = (0, 0.29, 0.63) Γ= 3, 2 y { } d2

d1 y z π32 r

π 31 d3

x Julien Mairal Sparse Estimation for Image and Vision Processing 139/187 Sparse reconstruction with the ℓ0-penalty Orthogonal matching pursuit [Pati et al., 1993]

2 min x Dα 2 s.t. α 0 k α∈Rp k − k k k ≤

1: Γ= . ∅ 2: for iter = 1,..., k do 3: Select the variable that most reduces the objective

ˆ 2 (ˆ, β) arg min x DΓ∪{j}β 2. ← j∈Γ∁,β k − k

4: Update the active set: Γ Γ ˆ . ← ∪ { } 5: Update the coefficients:

α[Γ] β and α[Γ∁] 0. ← ← 6: end for

Julien Mairal Sparse Estimation for Image and Vision Processing 140/187 Sparse reconstruction with the ℓ0-penalty Orthogonal matching pursuit [Pati et al., 1993]

Remarks this is an active-set algorithm. when a new variable is selected, the coefficients for the full set Γ are re-optimized: ⊤ −1 ⊤ α[Γ] = (DΓ DΓ) DΓ x,

and the residual is always orthogonal to the matrix DΓ of previously selected dictionary elements:

D⊤(x Dα)= D⊤(x D α[Γ]) = 0. Γ − Γ − Γ several variants of OMP exist regarding the selection rule of ˆ. The one we use appears in Cotter et al. [1999].

Julien Mairal Sparse Estimation for Image and Vision Processing 141/187 Sparse reconstruction with the ℓ0-penalty Orthogonal matching pursuit [Pati et al., 1993]

Keys for a fast implementation If available, use the Gram matrix G = D⊤D; Maintain the computation of D⊤(x Dα), − ⊤ −1 Update the Cholesky decomposition of (DΓ DΓ) . The total complexity for decomposing n k-sparse signals of size m with a dictionary of size p is

O(p2m) + O(nk3) + O(n(pm + pk2)) = O(np(m + k2))

Gram matrix Cholesky D⊤(x−Dα)

It is also| possible{z } to| use{z the} matrix| inversion{z lemma} instead of a Cholesky decomposition.

Julien Mairal Sparse Estimation for Image and Vision Processing 142/187 Sparse reconstruction with the ℓ0-penalty Orthogonal matching pursuit [Pati et al., 1993]

Example with the software SPAMS Software available at http://spams-devel.gforge.inria.fr/.

>> I=double(imread(’data/lena.eps’))/255; >> %extract all patches of I >> X=im2col(I,[8 8],’sliding’); >> %load a dictionary of size 64 x 256 >> D=load(’dict.mat’); >> >> %set the sparsity parameter L to 10 >> param.L=10; >> alpha=mexOMP(X,D,param);

On this dual-core laptop: 110000 signals processed per second!

Julien Mairal Sparse Estimation for Image and Vision Processing 143/187 Sparse reconstruction with the ℓ0-penalty Iterative hard-thresholding [Herrity et al., 2006, Blumensath and Davies, 2009] Require: Signal x in Rm, dictionary D in Rm×p, target sparsity k, gradient descent step size η, number of iterations T . 1: Initialize α α ; ← 0 2: for t = 1,..., T do 3: perform one step of gradient descent: α α + ηD⊤(x Dα); ← − 4: choose τ to be the k-th largest entry of α[1] ,..., α[p] ; {| | | |} 5: for j = 1,..., p do 6: hard-thresholding: α[j] if α[j] τ α[j] | |≥ ← 0 otherwise.  7: end for 8: end for 9: return the sparse decomposition α in Rp.

Julien Mairal Sparse Estimation for Image and Vision Processing 144/187 Sparse reconstruction with the ℓ0-penalty Iterative hard-thresholding [Herrity et al., 2006, Blumensath and Davies, 2009]

Remarks This is a projected gradient algorithm;

α Π [α η f (α)] . ← k.k0≤k − ∇ It performs one gradient descent step, followed by a Euclidean projection onto the non-convex set of k-sparse vectors. it can be easily extended to the (approximate) minimization of

1 2 min x Dα 2 + λ α 0. α∈Rp 2k − k k k In that case, it is as a proximal gradient algorithm. it can be seen to iteratively decreases the value of the objective function from the majorization-minimization point of view.

Julien Mairal Sparse Estimation for Image and Vision Processing 145/187 Sparse reconstruction with the ℓ0-penalty Majorization-minimization principle [Lange et al., 2000] The principle for (approximately) minimizing a general cost function f :

min f (α). α∈A

f (α) gβ(α) b β = αold new α f (α) gβ(α) b ≤

Figure: At each step, we update α arg min ∈A gβ(α) ∈ α

Julien Mairal Sparse Estimation for Image and Vision Processing 146/187 Sparse reconstruction with the ℓ0-penalty Majorization-minimization principle [Lange et al., 2000] The principle for (approximately) minimizing a general cost function f : min f (α). α∈A

f (α) gβ(α) b β = αold new α f (α) gβ(α) b ≤

Figure: At each step, we update α arg min ∈A gβ(α) ∈ α What is the surrogate for the iterative hard-thresholding algorithm?

Julien Mairal Sparse Estimation for Image and Vision Processing 146/187 Sparse reconstruction with the ℓ0-penalty Majorization-minimization principle [Lange et al., 2000] The principle for (approximately) minimizing a general cost function f : min f (α). α∈A

f (α) gβ(α) b β = αold new α f (α) gβ(α) b ≤

Figure: At each step, we update α arg min ∈A gβ(α) ∈ α We need to introduce a few principles first...

Julien Mairal Sparse Estimation for Image and Vision Processing 146/187 Introduction of a few optimization principles Convex Functions Why do we care about convexity?

f (α)

α

Julien Mairal Sparse Estimation for Image and Vision Processing 147/187 Introduction of a few optimization principles Convex Functions Local observations give information about the global optimum

f (α)

α⋆ b α b

b

f (α) = 0 is a necessary and sufficient optimality condition for ∇ differentiable convex functions; it is often easy to upper-bound f (α) f ⋆. − Julien Mairal Sparse Estimation for Image and Vision Processing 147/187 Introduction of a few optimization principles An important inequality for smooth convex functions

If f is convex f (α)

α⋆ α0 b b α b

b

f (α) f (α0)+ f (α0)⊤(α α0); ≥ ∇ − linear approximation this is an| equivalent definition{z of convexity} for smooth functions. Julien Mairal Sparse Estimation for Image and Vision Processing 148/187 Introduction of a few optimization principles An important inequality for smooth functions

If f is L-Lipschitz continuous (f does not need to be convex) ∇ g(α) f (α)

α⋆ α1 α0 b b b α b

b

b

f (α) g(α)= f (α0)+ f (α0)⊤(α α0) + L α α0 2; ≤ ∇ − 2 k − k2 linear approximation | {z } Julien Mairal Sparse Estimation for Image and Vision Processing 149/187 Introduction of a few optimization principles An important inequality for smooth functions

If f is L-Lipschitz continuous (f does not need to be convex) ∇ g(α) f (α)

α⋆ α1 α0 b b b α b

b

b

f (α) g(α)= f (α0)+ f (α0)⊤(α α0) + L α α0 2; ≤ ∇ − 2 k − k2 linear approximation L 0 0 2 g(α)= C 0 + α (1/L) f (α ) α . α 2|k − ∇{z − k2} Julien Mairal Sparse Estimation for Image and Vision Processing 149/187 Introduction of a few optimization principles An important inequality for smooth functions

If f is L-Lipschitz continuous (f does not need to be convex) ∇ g(α) f (α)

α⋆ α1 α0 b b b α b

b

b

f (α) g(α)= f (α0)+ f (α0)⊤(α α0) + L α α0 2; ≤ ∇ − 2 k − k2 linear approximation α1 = α0 1 f (α0). (gradient descent step). − L ∇ | {z } Julien Mairal Sparse Estimation for Image and Vision Processing 149/187 Introduction of a few optimization principles Gradient Descent Algorithm Assume that f is convex and differentiable, and that f is L-Lipschitz. ∇ Theorem Consider the algorithm

αt αt−1 1 f (αt−1). ← − L ∇ Then, L α0 α⋆ 2 f (αt ) f ⋆ k − k2 . − ≤ 2t

Remarks the convergence rate improves under additional assumptions on f (strong convexity); some variants have a O(1/t2) convergence rate [Nesterov, 2004].

Julien Mairal Sparse Estimation for Image and Vision Processing 150/187 Proof (1/2) Proof of the main inequality for smooth functions We want to show that for all α and β, L f (α) ≤ f (β) + ∇f (β)⊤(α − β) + kα − βk2. 2 2 By using Taylor’s theorem with integral form,

1 f (α) − f (β) = ∇f (tα + (1 − t)β)⊤(α − β)dt. Z0 Then,

1 f (α)−f (β)−∇f (β)⊤(α−β) ≤ (∇f (tα+(1−t)β)−∇f (β))⊤(α−β)dt Z0 1 ≤ |(∇f (tα+(1−t)β)−∇f (β))⊤(α−β)|dt Z0 1 ≤ k∇f (tα+(1−t)β)−∇f (β)k2kα−βk2dt (C.-S.) Z0 1 2 L 2 ≤ Ltkα−βk2dt = kα−βk2. Z0 2

Julien Mairal Sparse Estimation for Image and Vision Processing 151/187 Proof (2/2) Proof of the theorem We have shown that for all α,

t−1 t−1 ⊤ t−1 L t−1 2 f (α) ≤ gt (α) = f (α ) + ∇f (α ) (α − α ) + kα − α k . 2 2

t t L t 2 gt is minimized by α ; it can be rewritten gt (α) = gt (α ) + 2 kα − α k2. Then,

t t ⋆ L ⋆ t 2 f (α ) ≤ gt (α ) = gt (α ) − kα − α k 2 2 L L = f (αt−1) + ∇f (αt−1)⊤(α⋆ − αt−1) + kα⋆ − αt−1k2 − kα⋆ − αt k2 2 2 2 2 L L ≤ f ⋆ + kα⋆ − αt−1k2 − kα⋆ − αt k2. 2 2 2 2 By summing from t =1 to T , we have a telescopic sum

T L L T (f (αT ) − f ⋆) ≤ f (αt ) − f ⋆ ≤ kα⋆ − α0k2 − kα⋆ − αT k2. 2 2 2 2 Xt=1

Julien Mairal Sparse Estimation for Image and Vision Processing 152/187 Sparse reconstruction with the ℓ0-penalty iterative hard-thresholding [Herrity et al., 2006, Blumensath and Davies, 2009]

What is the surrogate gβ(α)?

Julien Mairal Sparse Estimation for Image and Vision Processing 153/187 Sparse reconstruction with the ℓ0-penalty iterative hard-thresholding [Herrity et al., 2006, Blumensath and Davies, 2009] Simply the same as for the gradient descent algorithm:

△ ⊤ L 2 gβ(α) = f (α)+ f (β) (α β)+ β α , ∇ − 2 k − k2 with β = αold, L = (1/η) and f (α)=(1/2) x Dα 2. Indeed, k − k2

L ⊤ 2 gβ(α)= Cβ + β + ηD (x Dβ) α . 2 k − − k2 and the update can be rewritten

α arg min gβ(α) p ← α∈R :kαk0≤k =Π β + ηD⊤(x Dβ) . k.k0≤k − h i

Julien Mairal Sparse Estimation for Image and Vision Processing 153/187 Sparse reconstruction with the ℓ1-norm

For the ℓ0-penalty, we have seen 1 a coordinate descent algorithm (matching pursuit); 2 a gradient descent algorithm (iterative hard-thresholding); 3 an active-set algorithm (orthogonal matching pursuit);

For ℓ1, the same three classes of methods play an important role.

Julien Mairal Sparse Estimation for Image and Vision Processing 154/187 Sparse reconstruction with the ℓ1-norm Projected gradient descent Suppose we want to solve

1 2 min x Dα 2 s.t. α1 1 µ. α∈Rp 2k − k k k ≤

Julien Mairal Sparse Estimation for Image and Vision Processing 155/187 Sparse reconstruction with the ℓ1-norm Projected gradient descent Suppose we want to solve

1 2 min x Dα 2 s.t. α1 1 µ. α∈Rp 2k − k k k ≤ The following update with η small enough converges to a solution

α Π α + ηD⊤(x Dα) . ← k.k1≤µ − h i

Julien Mairal Sparse Estimation for Image and Vision Processing 155/187 Sparse reconstruction with the ℓ1-norm Projected gradient descent Suppose we want to solve

1 2 min x Dα 2 s.t. α1 1 µ. α∈Rp 2k − k k k ≤ The following update with η small enough converges to a solution

α Π α + ηD⊤(x Dα) . ← k.k1≤µ − h i Remarks the convergence rate is the same as the gradient descent method for smooth convex functions; when L is unknown, efficient line-search scheme can be used. the principle is the same as for the iterative hard-thresholding algorithm. see [Nesterov, 2004, Bertsekas, 1999, Boyd and Vandenberghe, 2004].

Julien Mairal Sparse Estimation for Image and Vision Processing 155/187 Sparse reconstruction with the ℓ1-norm The proximal gradient method We consider a smooth convex function f and a non-smooth regularizer ψ. min f (α)+ ψ(α) α∈Rp For example, 1 2 min x Dα 2 + λ α 1. α∈Rp 2k − k k k

the objective function is not differentiable. an extension of gradient descent for such a problem is called “proximal gradient descent” [Beck and Teboulle, 2009, Nesterov, 2013].

Julien Mairal Sparse Estimation for Image and Vision Processing 156/187 Sparse reconstruction with the ℓ1-norm An important inequality for composite functions If f is L-Lipschitz continuous ∇ g(α)+ ψ(α) f (α)+ ψ(α)

α⋆ α1 α0 b b b α

b

b

f (α)+ ψ(α) ≤ f (α0)+ f (α0)⊤(α α0)+ L α α0 2 + ψ(α); ∇ − 2 k − k2 α1 minimizes g + ψ.

Julien Mairal Sparse Estimation for Image and Vision Processing 157/187 Sparse reconstruction with the ℓ1-norm The proximal gradient method Gradient descent for minimizing f consists of

t t t−1 1 t−1 α arg min gt (α) α α f (α ). ← α∈Rp ⇐⇒ ← − L∇ The proximal gradient method for minimizing f + ψ consists of t α arg min gt (α)+ ψ(α), ← α∈Rp which is equivalent to

1 1 2 1 αt arg min αt−1 f (αt−1) α + ψ(α). ← Rp 2 − L∇ − L α∈ 2

It requires computing efficiently the proximal operator of ψ. 1 2 α arg min β α 2 + ψ(α). 7→ α∈Rp 2k − k

Julien Mairal Sparse Estimation for Image and Vision Processing 158/187 Sparse reconstruction with the ℓ1-norm The proximal gradient method Remarks also known as forward-backward algorithm; has similar convergence rates as the gradient descent method. there exists line search schemes to automatically tune L; there exists accelerated schemes [Beck and Teboulle, 2009, Nesterov, 2013].

The case of ℓ1 The proximal operator of λ . is the soft-thresholding operator k k1 α[j] = sign(β[j])( β[j] λ)+. | |− The resulting algorithm is called iterative soft-thresholding [Nowak and Figueiredo, 2001, Figueiredo and Nowak, 2003, Starck et al., 2003, Daubechies et al., 2004].

Julien Mairal Sparse Estimation for Image and Vision Processing 159/187 Sparse reconstruction with the ℓ1-norm The proximal gradient method

The proximal operator for the group Lasso penalty

1 2 min β α 2 + λ α[g] q. α∈Rp 2k − k k k g X∈G

For q = 2,

β[g] α[g]= ( β[g] λ)+, g . β[g] k k2 − ∀ ∈G k k2 For q = , ∞ α[g]= β[g] Π [β[g]], g . − k.k1≤λ ∀ ∈G These formula generalize soft-thresholding to groups of variables.

Julien Mairal Sparse Estimation for Image and Vision Processing 160/187 Sparse reconstruction with the ℓ1-norm The proximal gradient method

A few proximal operators:

ℓ0-penalty: hard-thresholding;

ℓ1-norm: soft-thresholding; group-Lasso: group soft-thresholding; fused-lasso (1D total variation): [Hoefling, 2010]; hierarchical norms: [Jenatton et al., 2011b], O(p) complexity;

overlapping group Lasso with ℓ∞-norm: [Mairal et al., 2010b], (link with network flow optimization);

Julien Mairal Sparse Estimation for Image and Vision Processing 161/187 Sparse reconstruction with the ℓ1-norm Coordinate descent for the Lasso [Fu, 1998]

1 2 min x Dα 2 + λ α 1. α∈Rp 2k − k k k The coordinate descent method consists of iteratively fixing all variables and optimizing with respect to one:

1 2 α[j] arg min x α[l]dl αdj 2 + λ α . ← R 2k − − k | | α∈ l j X6= r

Assume the columns of D to have| unit{z ℓ2-norm,}

⊤ ⊤ + αj sign(d r)( d r λ) ← j | j |− This involves again the soft-thresholding operator.

Julien Mairal Sparse Estimation for Image and Vision Processing 162/187 Sparse reconstruction with the ℓ1-norm Coordinate descent for the Lasso [Fu, 1998]

Remarks no parameter to tune! several strategies are possible for selecting the variable to update. impressive performance with five lines of code. coordinate descent + nonsmooth objective is not convergent in general. Here, the problem is equivalent to a convex smooth optimization problem with separable constraints

1 2 T T min x D+α+ + D−α− 2 + λα+1 + λα−1 s.t. α−, α+ 0. α+,α− 2k − k ≥ For this specific problem, the algorithm is convergent. can be extended to group-Lasso, or other loss functions. j can be picked up at random, or by cycling (harder to analyze).

Julien Mairal Sparse Estimation for Image and Vision Processing 163/187 Sparse reconstruction with the ℓ1-norm Smoothing techniques: reweighted ℓ2 [Daubechies et al., 2010, Bach et al., 2012] Let us start from something simple

a2 2ab + b2 0. − ≥

Julien Mairal Sparse Estimation for Image and Vision Processing 164/187 Sparse reconstruction with the ℓ1-norm Smoothing techniques: reweighted ℓ2 [Daubechies et al., 2010, Bach et al., 2012] Let us start from something simple

a2 2ab + b2 0. − ≥ Then 1 a2 a + b with equality iff a = b ≤ 2 b and   p 1 α[j]2 α 1 = min + ηj . k k ηj ≥0 2 ηj j=1 X The formulation becomes

p 2 1 2 λ α[j] min x Dα 2 + + ηj . α,ηj ≥ε 2k − k 2 ηj j=1 X

Julien Mairal Sparse Estimation for Image and Vision Processing 164/187 Sparse reconstruction with the ℓ1-norm Homotopy [Osborne et al., 2000b, Efron et al., 2004, Ritter, 1962]

1.5 w 1 w 2 1 w 3 w 4 0.5 w 5

coefficient values 0

−0.5 0 1 2 3 4 λ Figure: The regularization path of the Lasso is piecewise linear.

1 2 min x Dα 2 + λ α 1. α∈Rp 2k − k k k property discoved by Markowitz [1952]. Julien Mairal Sparse Estimation for Image and Vision Processing 165/187 Sparse reconstruction with the ℓ1-norm Homotopy [Osborne et al., 2000b, Efron et al., 2004, Ritter, 1962]

Theorem α is a solution of the Lasso if and only if

d⊤(x Dα) λ if α[j] = 0 | j − | ≤ d⊤(x Dα) = λ sign(α[j]) otherwise.  j − Consequence

α⋆[Γ] = (DT D )−1(DT x λ sign(α⋆[Γ])) = A + λB, Γ Γ Γ − where Γ = j s.t. α[j] = 0 . If we know Γ and the signs of α⋆ in { 6 } advance, we have a closed form solution. Following the piecewise linear regularization path is called the homotopy method [Osborne et al., 2000a, Efron et al., 2004].

Julien Mairal Sparse Estimation for Image and Vision Processing 166/187 Sparse reconstruction with the ℓ1-norm Homotopy [Osborne et al., 2000b, Efron et al., 2004, Ritter, 1962] The regularization path (λ, α⋆(λ)) is piecewise linear. T ⋆ 1 Start from the trivial solution (λ = D x , α (λ) = 0). k k∞ 2 Define Γ = j s.t. d⊤x = λ , { | j | } 3 ⋆ ⋆ Follow the regularization path: αΓ(λ)= A + λB, keeping αΓc = 0, decreasing the value of λ, until one of the following event occurs: ⊤ ⋆ j / Γ such that dj (x Dα (λ)) = λ, then Γ Γ j . ∃j ∈ Γ such that α| ⋆(λ)=0,− then Γ| Γ j . ← ∪ { } ∃ ∈ ← \ { } 4 Update the direction of the path and go back to 3. Hidden assumptions the regularization path is unique. variables enter the path one at a time. Extremely efficient for small/medium scale problems (p 10 000) ≤ and/or very sparse problems (when implemented correctly). Robust to correlated features. Can solve the elastic-net. Julien Mairal Sparse Estimation for Image and Vision Processing 167/187 Sparse reconstruction with the ℓ1-norm Homotopy [Osborne et al., 2000b, Efron et al., 2004, Ritter, 1962] Theorem - worst case analysis [Mairal and Yu, 2012] In the worst-case, the regularization path of the Lasso has exactly (3p + 1)/2 linear segments. Regularization path, p=6

2

1

0

−1 Coefficients (log scale)

100 200 300 Kinks Julien Mairal Sparse Estimation for Image and Vision Processing 168/187 Sparse reconstruction with the ℓ1-norm Lasso empirical comparison: Lasso, small scale (n = 200,p = 200) reg:low reg:high corr:low corr:high

Julien Mairal Sparse Estimation for Image and Vision Processing 169/187 Sparse reconstruction with the ℓ1-norm Empirical comparison: Lasso, medium scale (n = 2000,p = 10000) reg:low reg:high corr:low corr:high

Julien Mairal Sparse Estimation for Image and Vision Processing 170/187 Sparse reconstruction with the ℓ1-norm Empirical comparison: conclusions

Lasso Generic methods (subgradient descent, QP/CP solvers) are slow; homotopy fastest in low dimension and/or for high correlation Proximal methods are competitive esp. larger setting and/or weak corr. and/or weak reg. and/or low precision Coordinate descent usually dominated by LARS; but much simpler to implement!

Smooth Losses and other regularization LARS not available (block) coordinate descent, proximal → gradient methods are good candidates.

Julien Mairal Sparse Estimation for Image and Vision Processing 171/187 Iterative reweighted ℓ1-algorithms DC (difference of convex) - Programming

Remember? Concave functions with a kink at zero ψ(α)= p ϕ( α[j] ). j=1 | | △ p q ℓq-“pseudo-norm”,P with 0 < q < 1: ψ(w) = ( α[j] + ε) , j=1 | | △ p log penalty, ψ(w) = log( α[j] + ε), P j=1 | | ϕ is any function that looksP like this: ϕ(α)

α

Julien Mairal Sparse Estimation for Image and Vision Processing 172/187 ϕ′( αt ) α + C | | | |

αt | | αt f (α) | | ϕ(α) = log( α + ε) | |

f (α)+ ϕ′( αt ) α + C | | | |

αk | | f (α)+ ϕ( α ) | |

Figure: Illustration of the DC-programmingJulien Mairal Sparse approach. Estimation for The Image non-conve and Vision Processingx part 173/187 of DC (difference of convex) - Programming

p min f (α)+ λ ϕ( α[j] ). α∈Rp | | j X=1 This problem is non-convex. f is convex, and ϕ is concave on R+. if αk is the current estimate at iteration t, the algorithm solves

p αt+1 arg min f (α)+ λ ϕ′( αt [j] ) α[j] , ← Rp | | | | α∈ j h X=1 i

which is a reweighted-ℓ1 problem [Figueiredo and Nowak, 2005, Figueiredo et al., 2007, Cand`es et al., 2008]. Warning: It does not solve the non-convex problem, only provides a stationary point. In practice, each iteration sets to zero small coefficients. After 2 3 − iterations, the result does not change much.

Julien Mairal Sparse Estimation for Image and Vision Processing 174/187 Optimization for Dictionary Learning

n 1 2 min xi Dαi 2 + λψ(αi ) α∈Rp×n 2k − k i D∈C X=1 △ m×p = D R s.t. j = 1,..., p, dj 1 . C { ∈ ∀ k k2 ≤ }

Classical approach

Alternate minimization between D and α (MOD with ψ = ℓ0 [Engan et al., 1999], K-SVD with ψ = ℓ0 [Aharon et al., 2006], [Lee et al., 2007] with ψ = ℓ1); good results, reliable, but can be slow when n is large!

Julien Mairal Sparse Estimation for Image and Vision Processing 175/187 Optimization for Dictionary Learning

Empirical risk minimization point of view n 1 min fn(D) = min L(xi , D), D∈C D∈C n i X=1 where △ 1 2 L(x, D) = min x Dα 2 + λψ(α). α∈Rp 2k − k

Which formulation are we interested in? n 1 min f (D)= Ex[L(x, D)] lim L(xi , D) D∈C ≈ n→+∞ n i n X=1 o [Bottou and Bousquet, 2008]: Online learning can handle potentially infinite or dynamic datasets, be dramatically faster than batch algorithms.

Julien Mairal Sparse Estimation for Image and Vision Processing 176/187 Optimization for Dictionary Learning Stochastic gradient descent

Recipe

draw a single point xt (or a mini-batch) at each iteration; update D Π [D ηt L(xt , D)], ← C − ∇D which is equivalent (up to some assumptions) to

1 2 αt arg min xt Dα 2 + λ α 1, ← α∈Rp 2k − k k k ⊤ D Π [D + ηt (xt Dαt )α ]. ← C − t

Julien Mairal Sparse Estimation for Image and Vision Processing 177/187 Optimization for Dictionary Learning Stochastic gradient descent

Recipe

draw a single point xt (or a mini-batch) at each iteration; update D Π [D ηt L(xt , D)], ← C − ∇D which is equivalent (up to some assumptions) to

1 2 αt arg min xt Dα 2 + λ α 1, ← α∈Rp 2k − k k k ⊤ D Π [D + ηt (xt Dαt )α ]. ← C − t Remark historically, this is very close to the original algorithm of Olshausen and Field [1996].

Julien Mairal Sparse Estimation for Image and Vision Processing 177/187 Optimization for Dictionary Learning Stochastic gradient descent

Recipe

draw a single point xt (or a mini-batch) at each iteration; update D Π [D ηt L(xt , D)], ← C − ∇D which is equivalent (up to some assumptions) to

1 2 αt arg min xt Dα 2 + λ α 1, ← α∈Rp 2k − k k k ⊤ D Π [D + ηt (xt Dαt )α ]. ← C − t Pros and cons , can be effective in practice; / difficult to tune.

Julien Mairal Sparse Estimation for Image and Vision Processing 177/187 Optimization for Dictionary Learning Online dictionary learning [Mairal et al., 2010a]

Recipe stochastic majorization-minimization algorithm; relies on a fast dictionary update; easier to tune (the implementation of SPAMS has been successfully used by others in plenty of “exotic” unexpected scenarios.

Julien Mairal Sparse Estimation for Image and Vision Processing 178/187 Optimization for Dictionary Learning Online dictionary learning [Mairal et al., 2010a] Require: D Rm×p (initial dictionary); λ R 0 ∈ ∈ 1: C0 = 0, B0 = 0. 2: for t=1,...,T do 3: Draw xt 1 2 4: Sparse Coding: αt arg min xt Dt−1α 2 + λ α 1, ← α∈Rp 2k − k k k 5: Aggregate sufficient statistics T T Ct Ct + αt α , Bt Bt + xt α ← −1 t ← −1 t 6: Dictionary update

1 t 1 2 Dt arg min xi Dαi + λ αi . ← D∈C t i=1 2 k − k2 k k1 1 1  T T  = argmin P Tr(D DCt ) Tr(D Bt ) . D∈C t 2 −   7: end for

Julien Mairal Sparse Estimation for Image and Vision Processing 179/187 Optimization for Dictionary Learning Fast dictionary udpate [Mairal et al., 2010a] Require: D (input dictionary); X Rm×n (dataset); A Rp×n 0 ∈C ∈ ∈ (sparse codes); 1: Initialization: D D ; B XA⊤; C AA⊤; ← 0 ← ← 2: repeat 3: for j = 1,..., p do 4: update the j-th column: 1 dj (bj Dcj )+ dj , ← C[j, j] − 1 dj dj . ← max( dj , 1) k k2 5: end for 6: until convergence; 7: return D (updated dictionary).

Julien Mairal Sparse Estimation for Image and Vision Processing 180/187 Optimization for Dictionary Learning Fast dictionary udpate [Mairal et al., 2010a]

Minimizing with respect to one column dj when keeping the other columns fixed can be formulated as

2 n 1 dj arg min xi αi [l]dl αi [j]d . ← Rm  2 − −  d∈ ,kdk2≤1 i=1 l6=j X X 2  

Julien Mairal Sparse Estimation for Image and Vision Processing 181/187 Optimization for Dictionary Learning Fast dictionary udpate [Mairal et al., 2010a]

Minimizing with respect to one column dj when keeping the other columns fixed can be formulated as

2 n 1 dj arg min xi αi [l]dl αi [j]d . ← Rm  2 − −  d∈ ,kdk2≤1 i=1 l6=j X X 2  

Then, in a matrix form

1 2 αj αj dj arg min X DA + dj d F , ← Rm 2 − − d∈ ,kdk2≤1  

Julien Mairal Sparse Estimation for Image and Vision Processing 181/187 Optimization for Dictionary Learning Fast dictionary udpate [Mairal et al., 2010a]

Minimizing with respect to one column dj when keeping the other columns fixed can be formulated as 2 n 1 dj arg min xi αi [l]dl αi [j]d . ← Rm  2 − −  d∈ ,kdk2≤1 i=1 l6=j X X 2  

After expanding the Frobenius norm and removing the constan t term,

1 2 ⊤ αj αj⊤ αj dj arg min d (X DA + dj ) + d F ← Rm − − 2 d∈ ,kdk2≤1  

⊤ 1 2 = argmin d (bj Dcj + dj C[j, j]) + d 2 C[j, j] Rm − − 2 k k d∈ ,kdk2≤1   1 1 2 = argmin (bj Dcj )+ dj d , Rm 2 C[j, j] − − d∈ ,kdk2≤1 " 2#

Julien Mairal Sparse Estimation for Image and Vision Processing 181/187 Optimization for Dictionary Learning [Mairal et al., 2010a]

Which guarantees do we have? Under a few reasonable assumptions,

we build a surrogate functiong ˆt of the expected cost f verifying

lim gˆt (Dt ) f (Dt ) = 0, t→+∞ −

Dt is asymptotically close to a stationary point.

Extensions (all implemented in SPAMS) non-negative matrix decompositions; sparse PCA (sparse dictionaries); fused-lasso regularizations (piecewise constant dictionaries); non-convex regularization, structured regularization.

Julien Mairal Sparse Estimation for Image and Vision Processing 182/187 Optimization for Dictionary Learning Experimental results, batch vs online

m =8 8, p = 256 Julien Mairal× Sparse Estimation for Image and Vision Processing 183/187 Optimization for Dictionary Learning Experimental results, batch vs online

m = 12 12 3, p = 512 Julien Mairal× ×Sparse Estimation for Image and Vision Processing 184/187 Conclusions from the fourth part

there are a few algorithms for sparse estimation that are efficient and easy to implement; there is no algorithm that wins all the time; designing an evaluation benchmark that makes sense is hard.

Julien Mairal Sparse Estimation for Image and Vision Processing 185/187 Conclusions from the fourth part

there are a few algorithms for sparse estimation that are efficient and easy to implement; there is no algorithm that wins all the time; designing an evaluation benchmark that makes sense is hard.

What was not covered stochastic optimization for sparse estimation; proximal splitting algorithms.

Advertisement again the SPAMS toolbox already contains lots of code (C++ interfaced with Matlab, Python, R) for learning dictionaries, factorizing matrices (NMF, archetypal analysis), solving sparse estimation problems, including most of the algorithms we have presented. http://spams-devel.gforge.inria.fr/.

Julien Mairal Sparse Estimation for Image and Vision Processing 185/187 Part V: Application cases

Julien Mairal Sparse Estimation for Image and Vision Processing 186/187 Application cases

Case 1 use of dictionary learning for processing electrophysiological data from the visual cortex.

Case 2 use of structured sparse models for next-generation DNA/RNA sequencing.

Julien Mairal Sparse Estimation for Image and Vision Processing 187/187 References I

M. Aharon, M. Elad, and A. Bruckstein. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11):4311–4322, 2006. N. Ahmed, T. Natarajan, and K. R. Rao. Discrete cosine transform. IEEE Transactions on Computers, 100(1):90–93, 1974. H. Akaike. Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory, volume 1, pages 267–281, 1973. F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundation and Trends in Machine Learning, 4:1–106, 2012. S. Bakin. Adaptive regression and model selection in data mining problems. PhD thesis, 1999.

Julien Mairal Sparse Estimation for Image and Vision Processing 188/187 References II R. G. Baraniuk, V. Cevher, M. Duarte, and C. Hegde. Model-based compressive sensing. IEEE Transactions on Information Theory, 56(4): 1982–2001, 2010. A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009. A. J. Bell and T. J. Sejnowski. The “independent components” of natural scenes are edge filters. Vision Research, 37(23):3327–3338, 1997. M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester. Image inpainting. In Proceedings of the ACM SIGGRAPH Conference, 2000. D. P. Bertsekas. Nonlinear programming. Athena Scientific, 1999. 2nd edition.

Julien Mairal Sparse Estimation for Image and Vision Processing 189/187 References III T. Blumensath and M. E. Davies. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3):265–274, 2009. J.M. Borwein and A.S. Lewis. Convex analysis and nonlinear optimization: theory and examples. Springer, 2006. T. Bossomaier and A. W. Snyder. Why spatial frequency processing in the visual cortex? Vision Research, 26(8):1307–1309, 1986. L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems (NIPS), 2008. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. A. Buades, B. Coll, and J.-M. Morel. A review of image denoising algorithms, with a new one. SIAM Journal on Multiscale Modeling and Simulation, 4(2):490–530, 2005.

Julien Mairal Sparse Estimation for Image and Vision Processing 190/187 References IV A. Buades, B. Coll, J.-M. Morel, and C. Sbert. Self-similarity driven color demosaicking. IEEE Transactions on Image Processing, 18(6): 1192–1202, 2009. P. B¨uhlmann and S. Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer, 2011. T. T. Cai. Adaptive wavelet estimation: a block thresholding and oracle inequality approach. Annals of Statistics, 27(3):898–924, 1999. E. J. Cand`es and D. L. Donoho. Recovering edges in ill-posed inverse problems: Optimality of curvelet frames. Annals of Statistics, 30(3): 784–842, 2002. E. J. Cand`es, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2): 489–509, 2006.

Julien Mairal Sparse Estimation for Image and Vision Processing 191/187 References V E.J. Cand`es, M. Wakin, and S. Boyd. Enhancing sparsity by reweighted ℓ1 minimization. Journal of Fourier Analysis and Applications, 14(5): 877–905, 2008. M. Carandini, J. B. Demb, V. Mante, D. J. Tolhurst, Y. Dan, B. A. Olshausen, J. L. Gallant, and N. C. Rust. Do we know what the early visual system does? The Journal of Neuroscience, 25(46): 10577–10597, 2005. J.-F. Cardoso. Dependence, correlation and Gaussianity in independent component analysis. Journal of Machine Learning Research, 4: 1177–1203, 2003. V. Cehver, M. Duarte, C. Hedge, and R.G. Baraniuk. Sparse signal recovery using markov random fields. In Advances in Neural Information Processing Systems (NIPS), 2008.

Julien Mairal Sparse Estimation for Image and Vision Processing 192/187 References VI S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20:33–61, 1999. J. F. Claerbout and F. Muir. Robust modeling with erratic data. Geophysics, 38(5):826–844, 1973. S. F. Cotter, J. Adler, B. Rao, and K. Kreutz-Delgado. Forward sequential algorithms for best basis selection. In IEEE Proceedings of Vision Image and Signal Processing, pages 235–244, 1999. F. Couzinie-Devy, J. Mairal, F. Bach, and J. Ponce. Dictionary learning for deblurring and digital zoom. preprint arXiv:1110.0957, 2011. T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley and Sons, 2006. 2nd edition. A. Cutler and L. Breiman. Archetypal analysis. Technometrics, 36(4): 338–347, 1994.

Julien Mairal Sparse Estimation for Image and Vision Processing 193/187 References VII K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Inverse halftoning by pointwise shape-adaptive DCT regularized deconvolution. In Proceedings of the International TICSP Workshop on Spectral Methods Multirate Signal Processing (SMMSP), 2006. K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3D transform-domain collaborative filtering. IEEE Transactions on Image Processing, 16(8):2080–2095, 2007. K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. BM3D image denoising with shape-adaptive principal component analysis. In SPARS’09-Signal Processing with Adaptive Sparse Structured Representations, 2009. I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57(11): 1413–1457, 2004.

Julien Mairal Sparse Estimation for Image and Vision Processing 194/187 References VIII I. Daubechies, R. DeVore, M. Fornasier, and C. S. G¨unt¨urk. Iteratively reweighted least squares minimization for sparse recovery. Communications on Pure and Applied Mathematics, 63(1):1–38, 2010. J. G. Daugman. Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. Journal of the Optical Society of America A, 2(7):1160–1169, 1985. M. Do and M. Vertterli. Contourlets, Beyond Wavelets. Academic Press, 2003. D. L. Donoho and J. M. Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81(3):425–455, 1994. C. Eckart and G. Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936. B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32(2):407–499, 2004.

Julien Mairal Sparse Estimation for Image and Vision Processing 195/187 References IX A. A. Efros and T. K. Leung. Texture synthesis by non-parametric sampling. In Proceedings of the International Conference on Computer Vision (ICCV), 1999. M. A. Efroymson. Multiple regression analysis. Mathematical methods for digital computers, 9(1):191–203, 1960. M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 15(12):3736–3745, 2006. K. Engan, S. O. Aase, and J. H. Husoy. Method of optimal directions for frame design. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1999. M. Fazel, H. Hindi, and S. Boyd. A rank minimization heuristic with application to minimum order system approximation. In American Control Conference, volume 6, pages 4734–4739, 2001.

Julien Mairal Sparse Estimation for Image and Vision Processing 196/187 References X M. A. T. Figueiredo and R. D. Nowak. An EM algorithm for wavelet-based image restoration. IEEE Transactions on Image Processing, 12(8):906–916, 2003. M. A. T. Figueiredo and R. D. Nowak. A bound optimization approach to wavelet-based image deconvolution. In Proceedings of the International Conference on Image Processing (ICIP), 2005. M. A. T. Figueiredo, J. M. Bioucas-Dias, and R. D. Nowak. Majorization-minimization algorithms for wavelet-based image restoration. IEEE Transactions on Image Processing, 16(12): 2980–2991, 2007. R. W. Floyd and L. Steinberg. An adaptive algorithm for spatial grey scale. In Proceedings of the Society of Information Display, volume 17, pages 75–77, 1976. I. E Frank and J. H. Friedman. A statistical view of some chemometrics regression tools. Technometrics, 35(2):109–135, 1993.

Julien Mairal Sparse Estimation for Image and Vision Processing 197/187 References XI W.J. Fu. Penalized regressions: the bridge versus the lasso. Journal of Computational and Graphical Statistics, 7(3):397–416, 1998. G. M. Furnival and R. W. Wilson. Regressions by leaps and bounds. Technometrics, 16(4):499–511, 1974. A. Gersho and R. M. Gray. Vector quantization and signal compression. Kluwer Academic Publishers, 1992. Y. Grandvalet and S. Canu. Outcomes of the equivalence of adaptive ridge with least absolute shrinkage. In Advances in Neural Information Processing Systems (NIPS), 1999. H. Grassmann. LXXXVII. On the theory of compound colours. Philosophical Magazine Series 4, 7(45):254–264, 1854. R. Gribonval, R. Jenatton, F. Bach, M. Kleinsteuber, and M. Seibert. Sample complexity of dictionary learning and other matrix factorizations. preprint arXiv:1312.3790, 2013.

Julien Mairal Sparse Estimation for Image and Vision Processing 198/187 References XII P. Hall, G. Kerkyacharian, and D. Picard. On the minimax optimality of block thresholded wavelet estimators. Statistica Sinica, 9(1):33–49, 1999. K. K. Herrity, A. C. Gilbert, and J. A. Tropp. Sparse approximation via iterative thresholding. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, (ICASSP), 2006. R. R. Hocking. A Biometrics invited paper. The analysis and selection of variables in linear regression. Biometrics, 32:1–49, 1976. H. Hoefling. A path algorithm for the fused lasso signal approximator. Journal of Computational and Graphical Statistics, 19(4):984–1006, 2010. H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6):417, 1933.

Julien Mairal Sparse Estimation for Image and Vision Processing 199/187 References XIII J. Huang, Z. Zhang, and D. Metaxas. Learning with structured sparsity. In Proceedings of the International Conference on Machine Learning (ICML), 2009. D. H. Hubel and T. N. Wiesel. Receptive fields and functional architecture of monkey striate cortex. The Journal of physiology, 195 (1):215–243, 1968. A. Hyv¨arinen, P. O. Hoyer, and M. Inki. Topographic independent component analysis. Neural computation, 13(7):1527–1558, 2001. A. Hyv¨arinen, J. Karhunen, and E. Oja. Independent component analysis. John Wiley and Sons, 2004. A. Hyv¨arinen, J. Hurri, and P. O. Hoyer. Natural Image Statistics: A Probabilistic Approach to Early Computational Vision. Springer, 2009. L. Jacob, G. Obozinski, and J.-P. Vert. Group Lasso with overlaps and graph Lasso. In Proceedings of the International Conference on Machine Learning (ICML), 2009.

Julien Mairal Sparse Estimation for Image and Vision Processing 200/187 References XIV R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical dictionary learning. In Proceedings of the International Conference on Machine Learning (ICML), 2010. R. Jenatton, J-Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms. Journal of Machine Learning Research, 12:2777–2824, 2011a. R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. Journal of Machine Learning Research, 12:2297–2334, 2011b. K. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. LeCun. Learning invariant features through topographic filter maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.

Julien Mairal Sparse Estimation for Image and Vision Processing 201/187 References XV K. Kavukcuoglu, M. Ranzato, and Y. LeCun. Fast inference in sparse coding algorithms with applications to object recognition. preprint arXiv:1010.3467, 2010. K. Lange, D. R. Hunter, and I. Yang. Optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics, 9(1):1–20, 2000. E. Le Pennec and S. Mallat. Sparse geometric image representations with bandelets. IEEE Transactions on Image Processing, 14(4): 423–438, 2005. Y. LeCun, L. Bottou, G. Orr, and K. Muller. Efficient backprop. In G. Orr and Muller K., editors, Neural Networks: Tricks of the trade. Springer, 1998. H. Lee, A. Battle, R. Raina, and A.Y. Ng. Efficient sparse coding algorithms. Advances in Neural Information Processing Systems (NIPS), 19:801, 2007.

Julien Mairal Sparse Estimation for Image and Vision Processing 202/187 References XVI J. Mairal and B. Yu. Complexity analysis of the Lasso regularization path. In Proceedings of the International Conference on Machine Learning (ICML), 2012. J. Mairal, M. Elad, and G. Sapiro. Sparse representation for color image restoration. IEEE Transactions on Image Processing, 17(1):53–69, 2008a. J. Mairal, G. Sapiro, and M. Elad. Learning multiscale sparse representations for image and video restoration. SIAM Multiscale Modeling and Simulation, 7(1):214–241, 2008b. J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In Proceedings of the International Conference on Machine Learning (ICML), 2009a. J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Non-local sparse models for image restoration. In Proceedings of the International Conference on Computer Vision (ICCV), 2009b.

Julien Mairal Sparse Estimation for Image and Vision Processing 203/187 References XVII J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11:19–60, 2010a. J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Network flow algorithms for structured sparsity. In Advances in Neural Information Processing Systems (NIPS), 2010b. J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Convex and network flow optimization for structured sparsity. Journal of Machine Learning Research, 12:2681–2720, 2011. J. Mairal, F. Bach, and J. Ponce. Task-driven dictionary learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4): 791–804, 2012. S. Mallat. A wavelet tour of signal processing. Academic press, 2008. 3rd edition.

Julien Mairal Sparse Estimation for Image and Vision Processing 204/187 References XVIII S. Mallat and Z. Zhang. Matching pursuit in a time-frequency dictionary. IEEE Transactions on Signal Processing, 41(12):3397–3415, 1993. C. L. Mallows. Choosing variables in a linear regression: A graphical aid. unpublished paper presented at the Central Regional Meeting of the Institute of Mathematical Statistics, Manhattan, Kansas, 1964. C. L. Mallows. Choosing a subset regression. unpublished paper presented at the Joint Statistical Meeting, Los Angeles, California, 1966. H. Markowitz. Portfolio selection. Journal of Finance, 7(1):77–91, 1952. J. C. Maxwell. On the theory of compound colours, and the relations of the colours of the spectrum. Philosophical Transactions of the Royal Society of London, pages 57–84, 1860. C. A. Micchelli, J. M. Morales, and M. Pontil. Regularizers for structured sparsity. Advances in Computational Mathematics, 38(3): 455–489, 2013.

Julien Mairal Sparse Estimation for Image and Vision Processing 205/187 References XIX N. M. Nasrabadi and R. A. King. Image coding using vector quantization: A review. IEEE Transactions on Communications, 36 (8):957–971, 1988. B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24:227–234, 1995. J. Nathans, D. Thomas, and D. S. Hogness. Molecular genetics of human color vision: the genes encoding blue, green, and red pigments. Science, 232(4747):193–202, 1986. Y. Nesterov. Introductory lectures on convex optimization: a basic course. Kluwer Academic Publishers, 2004. Y. Nesterov. Gradient methods for minimizing composite objective function. Mathematical Programming, 140(1):125–161, 2013. I. Newton. Hypothesis explaining the properties of light. In The History of the Royal Society, volume 3, pages 247–269. T. Birch, 1675. text published in 1757.

Julien Mairal Sparse Estimation for Image and Vision Processing 206/187 References XX R. D. Nowak and M. A. T. Figueiredo. Fast wavelet-based image deconvolution using the EM algorithm. In Conference Record of the Thirty-Fifth Asilomar Conference on Signals, Systems and Computers., 2001. B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381: 607–609, 1996. B. A. Olshausen and D. J. Field. How close are we to understanding V1? Neural computation, 17(8):1665–1699, 2005. M. R. Osborne, B. Presnell, and B. A. Turlach. On the Lasso and its dual. Journal of Computational and Graphical Statistics, 9(2):319–37, 2000a. M. R. Osborne, B. Presnell, and B. A. Turlach. A new approach to variable selection in least squares problems. IMA journal of numerical analysis, 20(3):389–403, 2000b.

Julien Mairal Sparse Estimation for Image and Vision Processing 207/187 References XXI P. Paatero and U. Tapper. Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(2):111–126, 1994. Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Proceedings of The Twenty-Seventh Asilomar Conference on Signals, Systems and Computers, 1993. J. Pearl. On coding and filtering stationary signals by discrete fourier transforms (corresp.). IEEE Transactions on Information Theory, 19 (2):229–232, 1973. D.-T. Pham. Fast algorithms for mutual information based independent component analysis. IEEE Transactions on Signal Processing, 52(10): 2690–2700, 2004. W. Pratt. Spatial transform coding of color images. IEEE Transactions on Communication Technology, 19(6):980–992, 1971.

Julien Mairal Sparse Estimation for Image and Vision Processing 208/187 References XXII M. Protter and M. Elad. Image sequence denoising via sparse and redundant representations. IEEE Transactions on Image Processing, 18(1):27–35, 2009. F. Rapaport, A. Zinovyev, M. Dutreix, E. Barillot, and J.P. Vert. Classification of microarray data using gene networks. BMC bioinformatics, 8(1):35, 2007. R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua. Learning separable filters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013. J. Rissanen. Modeling by shortest data description. Automatica, 14(5): 465–471, 1978. K. Ritter. Ein verfahren zur l¨osung parameterabh¨angiger, nichtlinearer maximum-probleme. Mathematical Methods of Operations Research, 6(4):149–166, 1962.

Julien Mairal Sparse Estimation for Image and Vision Processing 209/187 References XXIII L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1-4): 259–268, 1992. G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6(2):461–464, 1978. J. M. Shapiro. Embedded image coding using zerotrees of wavelet coefficients. IEEE Transactions on Signal Processing, 41(12): 3445–3462, 1993. G. Sharma and H. J. Trussell. Digital color imaging. IEEE Transactions on Image Processing, 6(7):901–932, 1997. E. P. Simoncelli and B. A. Olshausen. Natural image statistics and neural representation. Annual review of neuroscience, 24(1): 1193–1216, 2001.

Julien Mairal Sparse Estimation for Image and Vision Processing 210/187 References XXIV E. P. Simoncelli, W. T. Freeman, E. H. Adelson, and D. J. Heeger. Shiftable multiscale transforms. IEEE Transactions on Information Theory, 38(2):587–607, 1992. N. Srebro, J.D.M. Rennie, and T.S. Jaakkola. Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems (NIPS), 2005. J.-L. Starck, D. L. Donoho, and E. J. Cand`es. Astronomical image representation by the curvelet transform. Astronomy and Astrophysics, 398(2):785–800, 2003.

H. L. Taylor, S. C. Banks, and J. F. McCoy. Deconvolution with the ℓ1 norm. Geophysics, 44(1):39–52, 1979. R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B, 58(1):267–288, 1996.

Julien Mairal Sparse Estimation for Image and Vision Processing 211/187 References XXV R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the fused Lasso. Journal of the Royal Statistical Society Series B, 67(1):91–108, 2005. B. A. Turlach, W. N. Venables, and S. J. Wright. Simultaneous variable selection. Technometrics, 47(3):349–363, 2005. D. Vainsencher, S. Mannor, and A. M. Bruckstein. The sample complexity of dictionary learning. Journal of Machine Learning Research, 12:3259–3281, 2011. H. von Helmholtz. LXXXI. on the theory of compound colours. Philosophical Magazine Series 4, 4(28):519–534, 1852. M.J. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using ℓ1-constrained quadratic programming. IEEE Transactions on Information Theory, 55(5):2183–2202, 2009.

Julien Mairal Sparse Estimation for Image and Vision Processing 212/187 References XXVI D. Wrinch and H. Jeffreys. XLII. On certain fundamental principles of scientific inquiry. Philosophical Magazine Series 6, 42(249):369–390, 1921. J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image super-resolution via sparse representation. IEEE Transactions on Image Processing, 19 (11):2861–2873, 2010. J. Yang, Z. Wang, Z. Lin, S. Cohen, and T. Huang. Coupled dictionary training for image super-resolution. IEEE Transactions on Image Processing, 21(8):3467–3478, 2012. T. Young. A course of lectures on natural philosophy and the mechanical art. Taylor and Watson, 1845. M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B, 68:49–67, 2006.

Julien Mairal Sparse Estimation for Image and Vision Processing 213/187 References XXVII M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus. Deconvolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010. R. Zeyde, M. Elad, and M. Protter. On single image scale-up using sparse representations. In Curves and Surfaces, pages 711–730. Springer, 2012. P. Zhao and B. Yu. On model selection consistency of Lasso. Journal of Machine Learning Research, 7:2541–2563, 2006. P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical variable selection. Annals of Statistics, 37 (6A):3468–3497, 2009. S. C. Zhu and D. Mumford. Prior learning and gibbs reaction-diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19 (11):1236–1250, 1997.

Julien Mairal Sparse Estimation for Image and Vision Processing 214/187 References XXVIII S.-C. Zhu, C.-E. Guo, Y. Wang, and Z. Xu. What are textons? International Journal of Computer Vision, 62(1-2):121–143, 2005. H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B, 67(2): 301–320, 2005.

Julien Mairal Sparse Estimation for Image and Vision Processing 215/187 Appendix

Julien Mairal Sparse Estimation for Image and Vision Processing 216/187 Basic convex optimization tools: subgradients

α α

(a) Smooth case (b) Non-smooth case

Figure: Gradients and subgradients for smooth and non-smooth functions.

△ ∂f (α) = κ Rp f (α)+ κ⊤(α′ α) f (α′) for all α′ Rp . { ∈ | − ≤ ∈ }

Julien Mairal Sparse Estimation for Image and Vision Processing 217/187 Basic convex optimization tools: subgradients Some nice properties

∂f (α)= g iff f differentiable at α and g = f (α). { } ∇ many calculus rules: ∂(γf + µg)= γ∂f + µ∂g for γ,µ > 0. for more details, see Boyd and Vandenberghe [2004], Bertsekas [1999], Borwein and Lewis [2006] and S. Boyd’s course at Stanford. Optimality conditions For g : Rp R convex, → g differentiable: α⋆ minimizes g iff g(α⋆)=0. ∇ g nondifferentiable: α⋆ minimizes g iff 0 ∂g(α⋆). ∈ Careful: the concept of subgradient requires a function to be above its tangents. It does only make sense for convex functions!

Julien Mairal Sparse Estimation for Image and Vision Processing 218/187 Basic convex optimization tools: dual-norm

Definition Let κ be in Rp, △ ⊤ κ ∗ = max α κ. k k α∈Rp:kαk≤1

Exercises α = α (true in finite dimension) k k∗∗ k k ℓ2 is dual to itself.

ℓ1 and ℓ∞ are dual to each other. ′ 1 1 ℓq and ℓq are dual to each other if q + q′ = 1. similar relations for spectral norms on matrices. ∂ α = κ Rp s.t. κ 1 and κ⊤α = α . k k { ∈ k k∗ ≤ k k}

Julien Mairal Sparse Estimation for Image and Vision Processing 219/187 Optimality conditions Let f : Rp R be convex differentiable and . be any norm. → k k min f (α)+ λ α . α∈Rp k k α is solution if and only if

0 ∂(f (α)+ λ α )= f (α)+ λ∂ α ∈ k k ∇ k k Since ∂ α = κ Rp s.t. κ 1 and κ⊤α = α , k k { ∈ k k∗ ≤ k k} General optimality conditions:

f (α) λ and f (α)⊤α = λ α . k∇ k∗ ≤ −∇ k k

Julien Mairal Sparse Estimation for Image and Vision Processing 220/187 Convex Duality Strong Duality f (α), primal

α⋆ κ⋆ b b α κ

b b

g(κ), dual

Strong duality means that maxκ g(κ) = minα f (α)

Julien Mairal Sparse Estimation for Image and Vision Processing 221/187 Convex Duality Duality Gaps f (α), primal

α˜ κ˜ b b α κ

b α κ δ( ˜, ˜) b

g(κ), dual

Strong duality means that maxκ g(κ) = minα f (α) The duality gap guarantees us that 0 f (α˜) f (α⋆) δ(α˜, κ˜). ≤ − ≤

Julien Mairal Sparse Estimation for Image and Vision Processing 222/187