Sparse Optimization Lecture: Sparse Recovery Guarantees

Instructor: Wotao Yin

July 2013

Note scriber: Zheng Sun online discussions on piazza.com

Those who complete this lecture will know

• how to read different recovery guarantees • some well-known conditions for exact and stable recovery such as Spark, coherence, RIP, NSP, etc.

• indeed, we can trust `1-minimization for recovering sparse vectors

1 / 35 The basic question of sparse optimization is:

Can I trust my model to return an intended sparse quantity?

That is

• does my model have a unique solution? (otherwise, different algorithms may return different answers) • is the solution exactly equal to the original sparse quantity? • if not (due to noise), is the solution a faithful approximate of it? • how much effort is needed to numerically solve the model?

This lecture provides brief answers to the first three questions.

2 / 35 What this lecture does and does not cover

It covers basic sparse vector recovery guarantees based on

• spark • coherence • restricted isometry property (RIP) and null-space property (NSP) as well as both exact and robust recovery guarantees.

It does not cover the recovery of matrices, subspaces, etc.

Recovery guarantees are important parts of sparse optimization, but they are not the focus of this summer course.

3 / 35 Examples of guarantees

Theorem (Donoho and Elad [2003], Gribonval and Nielsen [2003])

m×n For Ax = b where A ∈ R has full rank, if x satisfies 1 −1 kxk0 ≤ 2 (1 + µ(A) ), then `1-minimization recovers this x.

Theorem (Candes and Tao [2005])

If x is k-sparse and A satisfies the RIP-based condition δ2k + δ3k < 1, then x is the `1-minimizer.

Theorem (Zhang [2008])

m×n IF A ∈ R is a standard Gaussian , then with probability at least 1 − exp(−c0(n − m)) `1-minimization is equivalent to `0-minimization for all x:

c2 m kxk < 1 0 4 1 + log(n/m) where c0, c1 > 0 are constants independent of m and n.

4 / 35 How to read guarantees

Some basic aspects that distinguish different types of guarantees:

• Recoverability (exact) vs stability (inexact)

• General A or special A?

• Universal (all sparse vectors) or instance (certain sparse vector(s))?

• General optimality? or specific to model / algorithm?

• Required property of A: spark, RIP, coherence, NSP, dual certificate?

• If randomness is involved, what is its role?

• Condition/bound is tight or not? Absolute or in order of magnitude?

5 / 35 Spark

First questions for finding the sparsest solution to Ax = b

1. Can sparsest solution be unique? Under what conditions? 2. Given a sparse x, how to verify whether it is actually the sparsest one?

Definition (Donoho and Elad [2003]) The spark of a given matrix A is the smallest number of columns from A that are linearly dependent, written as spark(A). rank(A) is the largest number of columns from A that are linearly independent. In general, spark(A) 6= rank(A) + 1; except for many randomly generated matrices.

Rank is easy to compute (due to the structure), but spark needs a combinatorial search.

6 / 35 Spark

Theorem (Gorodnitsky and Rao [1997])

If Ax = b has a solution x obeying kxk0 < spark(A)/2, then x is the sparsest solution.

• Proof idea: if there is a solution y to Ax = b and x − y 6= 0, then A(x − y) = 0 and thus

kxk0 + kyk0 ≥ kx − yk0 ≥ spark(A)

or kyk0 ≥ spark(A) − kxk0 > spark(A)/2 > kxk0.

• The result does not mean this x can be efficiently found numerically.

m×n • For many random matrices A ∈ R , the result means that if an algorithm returns x satisfying kxk0 < (m + 1)/2, the x is optimal with probability 1.

• What to do when spark(A) is difficult to obtain?

7 / 35 General Recovery - Spark

Rank is easy to compute, but spark needs a combinatorial search.

However, for matrix with entries in general positions, spark(A) = rank(A) + 1.

m×n For example, if matrix A ∈ R (m < n) has entries Aij ∼ N (0, 1), then rank(A) = m = spark(A) − 1 with probability 1.

m×n In general, ∀ full rank matrix A ∈ R (m < n), any m + 1 columns of A is linearly dependent, so

spark(A) ≤ m + 1 = rank(A) + 1.

8 / 35 Coherence

Definition (Mallat and Zhang [1993]) The (mutual) coherence of a given matrix A is the largest absolute normalized inner product between different columns from A. Suppose

A = [a1 a2 ··· an]. The mutual coherence of A is given by

|a>a | µ(A) = max k j . k,j,k6=j kakk2 · kaj k2

• It characterizes the dependence between columns of A • For unitary matrices, µ(A) = 0 • For matrices with more columns than rows, µ(A) > 0 • For recovery problems, we desire a small µ(A) as it is similar to unitary matrices. • For A = [Φ Ψ] where Φ and Ψ are n × n unitary, it holds n−1/2 ≤ µ(A) ≤ 1 • µ(A) = n−1/2 is achieved with [I F], [I Hadamard], etc. m×n −1/2 • if A ∈ R where n > m, then µ(A) ≥ m .

9 / 35 Coherence

Theorem (Donoho and Elad [2003])

spark(A) ≥ 1 + µ−1(A).

Proof sketch:

• A¯ ← A with columns normalized to unit 2-norm • p ← spark(A) • B ← a p × p minor of A¯ >A¯ P • |Bii| = 1 and j6=i |Bij | ≤ (p − 1)µ(A) −1 P • Suppose p < 1 + µ (A) ⇒ 1 > (p − 1)µ(A) ⇒ |Bii| > j6=i |Bij |, ∀i • ⇒ B 0 (Gershgorin circle theorem) ⇒ spark(A) > p. Contradiction.

10 / 35 Coherence-base guarantee

Corollary −1 If Ax = b has a solution x obeying kxk0 < (1 + µ (A))/2, then x is the unique sparsest solution.

Compare with the previous Theorem

If Ax = b has a solution x obeying kxk0 < spark(A)/2, then x is the sparsest solution.

m×n −1 √ For A ∈ R where m < n, (1 + µ (A)) is at most 1 + m but spark can be 1 + m. spark is more useful.

Assume Ax = b has a solution with kxk0 = k < spark(A)/2. It will be the unique `0 minimizer. Will it be the `1 minimizer as well? Not necessarily. −1 However, kxk0 < (1 + µ (A))/2 is a sufficient condition.

11 / 35 Coherence-based `0 = `1

Theorem (Donoho and Elad [2003], Gribonval and Nielsen [2003]) If A has normalized columns and Ax = b has a solution x satisfying 1 kxk < 1 + µ−1(A) , 0 2 then this x is the unique minimizer with respect to both `0 and `1.

Proof sketch:

• Previously we know x is the unique `0 minimizer; let S := supp(x)

• Suppose y is the `1 minimizer but not x; we study e := y − x

• e must satisfy Ae = 0 and kek1 ≤ 2keS k1 > −1 • A Ae = 0 ⇒ |ej | ≤ (1 + µ(A)) µ(A)kek1, ∀j • the last two points together contradict the assumption √ Result bottom line: allow kxk0 up to O( m) for exact recovery

12 / 35 The null space of A

P p1/p • Definition: kxkp := i |xi| .

• Lemma: Let 0 < p ≤ 1. If k(y − x)S¯kp > k(y − x)S kp then kxkp < kykp. Proof: Let e := y − x. p p p p kykp = kx + ekp = kxS + eS kp + keS¯kp = p p p p p p kxkp + (keS¯kp − keS kp) + (kxS + eS kp − kxS kp + keS kp). Last term is nonnegative for 0 < p ≤ 1. p p So, a sufficient condition is keS¯kp > keS kp. 

• If the condition holds for 0 < p ≤ 1, it also holds for q ∈ (0, p].

• Definition (null space property NSP(k, γ)). Every nonzero e ∈ N (A)

satisfies keS k1 < γkeS¯k1 for all index sets S with |S| ≤ k.

13 / 35 The null space of A

Theorem (Donoho and Huo [2001], Gribonval and Nielsen [2003])

o Basis pursuit min{kxk1 : Ax = b} uniquely recovers all k-sparse vectors x from measurements b = Axo if and only if A satisfies NSP(k, 1).

Proof. Sufficiency. Pick any k-sparse vector xo. Let S := supp(xo) and S¯ = Sc. For any non-zero h ∈ N (A), we have A(xo + h) = Axo = b and

0 0 kx + hk1 = kxS + hS k1 + khS¯k1 0 ≥ kxS k1 − khS k1 + khS¯k1 (1) 0 = kx k1 + (khS¯k1 − khS k1) .

0 0 o NSP(k, 1) of A guarantees kx + hk1 > kx k1, so x is the unique solution. o Necessity. The inequality (1) holds with equality if sign(xS ) = −sign(hS ) and hS has a sufficiently small scale. Therefore, basis pursuit to uniquely recovers all k-sparse vectors xo, NSP(k, 1) is also necessary.

14 / 35 The null space of A

• Another sufficient condition (Zhang [2008]) for kxk1 < kyk1 is  2 1 ky − xk1 kxk0 < . 4 ky − xk2 Proof: p p p keS k1 ≤ |S|keS k2 ≤ |S|kek2 = kxk0kek2.

Then, the above sufficient condition ky − xk1 > 2k(y − x)S k1 is given the above inequality. 

15 / 35 Null space Theorem (Zhang [2008]) Given x and b = Ax,

min kxk1 s.t. Ax = b recovers x uniquely if

 2  1 kek1 kxk0 < min 2 : e ∈ N (A) \{0} . 4 kek2

Comments: √ • We know 1 ≤ kek1/kek2 ≤ n for all e 6= 0. The ratio is small for sparse √ vectors but we want it large, i.e., close to n and away from 1. • Fact: in most subspaces, the ratio is away from 1 • In particular, Kashin, Garvaev, and Gluskin showed that a randomly drawn (n − m)-dimensional subspace V satisfies √ kek1 c1 m ≥ p , e ∈ V, e 6= 0 kek2 1 + log(n/m)

with probability at least 1 − exp(−c0(n − m)), where c0, c1 > 0 are independent of m and n. 16 / 35 Null space

Theorem (Zhang [2008])

m×n If A ∈ R is sampled from i.i.d. Gaussian or is any rank-m matrix such that > (n−m)×m BA = 0 and B ∈ R is i.i.d. Gaussian, then with probability at least 1 − exp(−c0(n − m)), `1 minimization recovers any sparse x if

c2 m kxk < 1 , 0 4 1 + log(n/m) where c0, c1 are positive constants independent of m and n.

17 / 35 Comments on NSP

• NSP is no longer necessary if “for all k-sparse vectors” is relaxed. • NSP is widely used in the proofs of other guarantees. • NSP of order 2k is necessary for stable universal recovery. Consider an arbitrary decoder ∆, tractable or not, that returns a vector from the input b = Axo. If one requires ∆ to be stable in the sense

o o o kx − ∆(Ax )k1 < C · σ[k](x )

o for all x and σ[k] is the best k-term approximation error, then it holds

khS k1 < C · khSc k1,

for all non-zero h ∈ N (A) and all coordinate sets S with |S| ≤ 2k. See Cohen, Dahmen, and DeVore [2006].

18 / 35 Restricted isometry property (RIP)

Definition (Candes and Tao [2005])

Matrix A obeys the restricted isometry property (RIP) with constant δs if

2 2 2 (1 − δs)kck2 ≤ kAck2 ≤ (1 + δs)kck2 for all s-sparse vectors c.

RIP essentially requires that every set of columns with cardinality less than or equal to s behaves like an orthonormal system.

19 / 35 RIP

Theorem (Candes and Tao [2006])

If x is k-sparse and A satisfies δ2k + δ3k < 1, then x is the unique `1 minimizer.

Comments:

• RIP needs a matrix to be properly scaled • the tight RIP constant of a given matrix A is difficult to compute • the result is universal for all k-sparse • ∃ tighter conditions (see next slide)

• all methods (including `0) require δ2k < 1 for universal recovery; every

k-sparse x is unique if δ2k < 1 • the requirement can be satisfied by certain A (e.g., whose entries are i.i.d samples following a subgaussian distribution) and lead to exact recovery

for kxk0 = O(m/ log(m/k)).

20 / 35 More Comments

• (Foucart-Lai) If δ2k+2 < 1, then ∃ a sufficiently small p so that `p minimization is guaranteed to recovery any k-sparse x √ • (Candes) δ2k < 2 − 1 is sufficient √ • (Foucart-Lai) δ2k < 2(3 − 2)/7 ≈ 0.4531 is sufficient p √ • RIP gives κ(AS ) ≤ (1 + δk)/(1 − δk), ∀|S| ≤ k; so δ2k < 2(3 − 2)/7

gives κ(AS ) ≤ 1.7, ∀|S| ≤ 2m, very well-conditioned.

• (Mo-Li) δ2k < 0.493 is sufficient

• (Cai-Wang-Xu) δk < 0.307 is sufficient

• (Cai-Zhang) δk < 1/3 is sufficient and necessary for universal `1 recovery

21 / 35 Random matrices with RIPs

Trivial randomly constructed matrices satisfy RIPs with overwhelming probability.

• Gaussian: Aij ∼ N(0, 1/m), kxk0 ≤ O(m/ log(n/m)) whp, proof is based on applying concentration of measures to the singular values of Gaussian matrices (Szarek-91,Davidson-Szarek-01).

• Bernoulli: Aij ∼ ±1 wp 1/2, kxk0 ≤ O(m/ log(n/m)) whp, proof is based on applying concentration of measures to the smallest singular value of a subgaussian matrix (Candes-Tao-04,Litvak-Pajor-Rudelson-TomczakJaegermann-04). m×n • Fourier ensemble: A ∈ C is a randomly chosen submatrix of discrete n×n 6 Fourier transform F ∈ C . Candes-Tao shows kxk0 ≤ O(m/ log(n) ) 4 whp; Rudelson-Vershynin shows kxk0 ≤ O(m/ log(n) ); conjectured

kxk0 ≤ O(m/ log(n)). • ......

22 / 35 Incoherent Sampling

n Suppose (Φ, Ψ) is a pair of orthonormal bases of R . • Φ is used for sensing: A is a subset of rows of Φ∗ • Ψ is used to sparsely represent x: x = Ψα, α is sparse

Definition The coherence between Φ and Ψ is √ µ(Φ, Ψ) = n max |hφk, ψj i| 1≤k,j≤n

Coherence is the largest correlation between any two elements of Φ and Ψ.

• If Φ and Ψ contains correlated elements, then µ(Φ, Ψ) is large • Otherwise, µ(Φ, Ψ) is small √ From , 1 ≤ µ(Φ, Ψ) ≤ n.

23 / 35 Incoherent Sampling Compressive sensing requires low coherent pairs. • x is sparse under Ψ: x = Ψα, α is sparse • x is measured as b ← Ax

• x is recovered from min kαk1, s.t. AΨα = b Examples:

• Φ is spike basis φk(t) = δ(t − k) and Ψ is the Fourier basis ψ (t) = n−1/2ei·2π·jt/n; then µ(Φ, Ψ) = 1, achieving max incoherence. j √ • Coherence between noiselets and Haar wavelets is 2. • Coherence between noiselets and Baubechies D4 and D8 are ∼ 2.2 and 2.9, respectively. • Random matrices are largely incoherent with any fixed basis Ψ. Randomly generated and orthonormalized Φ: w.h.p., the coherence between Φ and √ any fixed Ψ is about 2 log n. • Similar results apply to random Gaussian or ±1 matrices. Bottom line: many random matrices are universally incoherent with any fixed Ψ w.h.p. • some kind of random circulant matrix is universally incoherent with any fixed Ψ w.h.p. 24 / 35 Incoherent Sampling

Theorem (Candes and Romberg [2007]) Fix x and suppose x is k-sparse under basis Ψ with coefficients in uniformly random signs. Select m measurements in the Φ domain uniformly at random. 2 If m ≥ O(µ (Φ, Ψ)k log(n)), `1-minimization recovers x with high probability.

Comments:

• The result is not universal for all Ψ or all k-sparse x under Ψ. • Only guaranteed for nearly all sign sequences x with a fixed support. • Why seeing probability? Because there are special signals that are sparse in Ψ yet vanish at most places in the Φ domain. • This result allows structured, as opposed to noise-like (random), matrices. • Can be seen as an extension to Fourier CS. • Bottom line: the smaller the coherence, the fewer the samples required. This matches numerical experience.

25 / 35 Robust Recovery

In order to be practically powerful, CS must deal with

• nearly sparse signals • measurement noise • sometimes both

Goal: To obtain accurate reconstructions from highly undersampled measurements, or in short, stable recovery.

26 / 35 Stable `1 Recovery Consider

• a sparse x

• noisy CS measurements b ← Ax + z, where kzk2 ≤ 

Apply the BPDN model: min kxk1 s.t. kAx − bk2 ≤ .

Theorem

Assume (some bounds on δk or δ2k). The solution of the BPDN model returns a solution x∗ satisfying ∗ kx − xk2 ≤ C ·  for some constant C.

Proof sketch (using an overly sufficient RIP bound):

∗ 0 • Let e = x − x. S = {i : largest 2k |x|(i)}.

• One can show kek2 ≤ C1keS0 k2 ≤ C2kAek2 ≤ C · .

• 1st inequality essentially from keS k1 > keS¯k1, • 2nd inequality essentially from the RIP; • 3rd inequality essentially from the constraint.

27 / 35 Stable `1 Recovery

Theorem

Assume (some bounds on δk or δ2k). The solution of the BPDN model returns a solution x∗ satisfying ∗ kx − xk2 ≤ C ·  for some constant C.

Comments:

• The result is universal and more general than exact recovery; • The error bound is order-optimal: knowing supp(x) will give C0 ·  at best; • x∗ is almost as good as if one knows where the largest k entries are and directly measure them; • C depends on k; when k violates the condition and gets too large, ∗ kx − xk2 will blow up.

28 / 35 Stable `1 Recovery

Consider

• a nearly sparse x = xk + w,

• xk is the vector x with all but the largest (in magnitude) k entries set to 0,

• CS measurements b ← Ax + z, where kzk2 ≤ .

Theorem

Assume (some bounds on δk or δ2k). The solution of the BPDN model returns a solution x∗ satisfying

∗ −1/2 kx − xk2 ≤ C¯ ·  + C˜ · k kx − xkk1 for some constants C¯ and C˜.

Proof sketch (using an overly sufficient RIP bound): Similar to the previous one, except x is no longer k-sparse and keS k1 > keS¯k1 is no longer valid.

Instead, we get keS k1 + 2kx − xkk1 > keS¯k1. Then, 0 −1/2 kek2 ≤ C1ke4kk2 + C k kx − xkk1, ......

29 / 35 Stable `1 Recovery

Comments on ∗ −1/2 kx − xk2 ≤ C¯ ·  + C˜ · k kx − xkk1.

Suppose  = 0; let us focus on the last term:

−r 1. Consider power-law decay signals |x|(i) ≤ C · i , r > 1; −r+1 −1/2 −r+(1/2) 2. Then, kx − xkk1 ≤ C1k or k kx − xkk1 ≤ C1k ; ∗ ∗ −r+(1/2) 3. But even if x = xk, kx − xk2 = kxk − xk2 ≤ C1k ; 4. Conclusion: the bound cannot be fundamentally improved.

30 / 35 Information Theoretic Analysis

Question: is there an encoding-decoding means that can do fundamentally better than Gaussian A and `1-minimization?

−1/2 In math: ∃ encoder-decoder pair (A, ∆), 3 k∆(Ax) − xk2 ≤ O(k σk(x)) holds for k larger than O(m/ log(n/m))?

Comments: A can be any matrix, and ∆ can be any decoder, tractable or not.

Let kx − xkk1 be called the best-k approximation error, denoted by

σk(x) := kx − xkk1.

31 / 35 m×n Performance of (A, ∆) where A ∈ R :

Em(K) := inf sup k∆(Ax) − xk2 (A,∆) x∈K

Gelfand width:

m d (K) = inf sup {khk2 : h ∈ K ∩ Y } . codim(Y )≤m

Cohen, Dahmen, and DeVore [2006]: If set K = −K and K + K ≤ C0K, then

m m d (K) ≤ Em(K) ≤ C0d (K).

32 / 35 Gelfand Width and K = `1-ball

n Kashin, Gluskin, Garnaev: for K = {h ∈ R : khk1 ≤ 1}, r r log(n/m) log(n/m) C ≤ dm(K) ≤ C . 1 m 2 m

Consequences:

q log(n/m) 1. KGG means Em(K) ≈ m −1/2 −1/2 2. we want k∆(Ax) − xk2 ≤ C · k σk(x) ≤ C · k kxk1; −1/2 normalizing gives Em(K) ≤ C · k . 3. Therefore, k ≤ m/ log(n/m). We cannot do better than this.

33 / 35 (Incomplete) References:

D. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal) dictionaries vis `1 minimization. Proceedings of the National Academy of Sciences, 100:2197–2202, 2003. R. Gribonval and M. Nielsen. Sparse representations in unions of bases. IEEE Transactions on Information Theory, 49(12):3320–3325, 2003. E. Candes and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51:4203–4215, 2005. Y. Zhang. Theory of compressive sensing via l1-minimization: a non-RIP analysis and extensions. Rice University CAAM Technical Report TR08-11, 2008. I.F. Gorodnitsky and B.D. Rao. Sparse signal reconstruction from limited data using focuss: A re-weighted minimum norm algorithm. Signal Processing, IEEE Transactions on, 45(3):600–616, 1997. S. G. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12):3397–3415, 1993. D. Donoho and X. Huo. Uncertainty principles and ideal atomic decompositions. IEEE Transactions on Information Theory, 47:2845–2862, 2001. A. Cohen, W. Dahmen, and R. A. DeVore. and best k-term approximation. Submitted., 2006.

34 / 35 E. Candes and T. Tao. Near optimal signal recovery from random projections: universal encoding strategies. IEEE Transactions on Information Theory, 52(1): 5406–5425, 2006. E. Candes and J. Romberg. Sparsity and incoherence in compressive sampling. Inverse Problems, 23(3):969–985, 2007.

35 / 35