Sparse Optimization Lecture: Sparse Recovery Guarantees
Instructor: Wotao Yin
July 2013
Note scriber: Zheng Sun online discussions on piazza.com
Those who complete this lecture will know
• how to read different recovery guarantees • some well-known conditions for exact and stable recovery such as Spark, coherence, RIP, NSP, etc.
• indeed, we can trust `1-minimization for recovering sparse vectors
1 / 35 The basic question of sparse optimization is:
Can I trust my model to return an intended sparse quantity?
That is
• does my model have a unique solution? (otherwise, different algorithms may return different answers) • is the solution exactly equal to the original sparse quantity? • if not (due to noise), is the solution a faithful approximate of it? • how much effort is needed to numerically solve the model?
This lecture provides brief answers to the first three questions.
2 / 35 What this lecture does and does not cover
It covers basic sparse vector recovery guarantees based on
• spark • coherence • restricted isometry property (RIP) and null-space property (NSP) as well as both exact and robust recovery guarantees.
It does not cover the recovery of matrices, subspaces, etc.
Recovery guarantees are important parts of sparse optimization, but they are not the focus of this summer course.
3 / 35 Examples of guarantees
Theorem (Donoho and Elad [2003], Gribonval and Nielsen [2003])
m×n For Ax = b where A ∈ R has full rank, if x satisfies 1 −1 kxk0 ≤ 2 (1 + µ(A) ), then `1-minimization recovers this x.
Theorem (Candes and Tao [2005])
If x is k-sparse and A satisfies the RIP-based condition δ2k + δ3k < 1, then x is the `1-minimizer.
Theorem (Zhang [2008])
m×n IF A ∈ R is a standard Gaussian matrix, then with probability at least 1 − exp(−c0(n − m)) `1-minimization is equivalent to `0-minimization for all x:
c2 m kxk < 1 0 4 1 + log(n/m) where c0, c1 > 0 are constants independent of m and n.
4 / 35 How to read guarantees
Some basic aspects that distinguish different types of guarantees:
• Recoverability (exact) vs stability (inexact)
• General A or special A?
• Universal (all sparse vectors) or instance (certain sparse vector(s))?
• General optimality? or specific to model / algorithm?
• Required property of A: spark, RIP, coherence, NSP, dual certificate?
• If randomness is involved, what is its role?
• Condition/bound is tight or not? Absolute or in order of magnitude?
5 / 35 Spark
First questions for finding the sparsest solution to Ax = b
1. Can sparsest solution be unique? Under what conditions? 2. Given a sparse x, how to verify whether it is actually the sparsest one?
Definition (Donoho and Elad [2003]) The spark of a given matrix A is the smallest number of columns from A that are linearly dependent, written as spark(A). rank(A) is the largest number of columns from A that are linearly independent. In general, spark(A) 6= rank(A) + 1; except for many randomly generated matrices.
Rank is easy to compute (due to the matroid structure), but spark needs a combinatorial search.
6 / 35 Spark
Theorem (Gorodnitsky and Rao [1997])
If Ax = b has a solution x obeying kxk0 < spark(A)/2, then x is the sparsest solution.
• Proof idea: if there is a solution y to Ax = b and x − y 6= 0, then A(x − y) = 0 and thus
kxk0 + kyk0 ≥ kx − yk0 ≥ spark(A)
or kyk0 ≥ spark(A) − kxk0 > spark(A)/2 > kxk0.
• The result does not mean this x can be efficiently found numerically.
m×n • For many random matrices A ∈ R , the result means that if an algorithm returns x satisfying kxk0 < (m + 1)/2, the x is optimal with probability 1.
• What to do when spark(A) is difficult to obtain?
7 / 35 General Recovery - Spark
Rank is easy to compute, but spark needs a combinatorial search.
However, for matrix with entries in general positions, spark(A) = rank(A) + 1.
m×n For example, if matrix A ∈ R (m < n) has entries Aij ∼ N (0, 1), then rank(A) = m = spark(A) − 1 with probability 1.
m×n In general, ∀ full rank matrix A ∈ R (m < n), any m + 1 columns of A is linearly dependent, so
spark(A) ≤ m + 1 = rank(A) + 1.
8 / 35 Coherence
Definition (Mallat and Zhang [1993]) The (mutual) coherence of a given matrix A is the largest absolute normalized inner product between different columns from A. Suppose
A = [a1 a2 ··· an]. The mutual coherence of A is given by
|a>a | µ(A) = max k j . k,j,k6=j kakk2 · kaj k2
• It characterizes the dependence between columns of A • For unitary matrices, µ(A) = 0 • For matrices with more columns than rows, µ(A) > 0 • For recovery problems, we desire a small µ(A) as it is similar to unitary matrices. • For A = [Φ Ψ] where Φ and Ψ are n × n unitary, it holds n−1/2 ≤ µ(A) ≤ 1 • µ(A) = n−1/2 is achieved with [I F], [I Hadamard], etc. m×n −1/2 • if A ∈ R where n > m, then µ(A) ≥ m .
9 / 35 Coherence
Theorem (Donoho and Elad [2003])
spark(A) ≥ 1 + µ−1(A).
Proof sketch:
• A¯ ← A with columns normalized to unit 2-norm • p ← spark(A) • B ← a p × p minor of A¯ >A¯ P • |Bii| = 1 and j6=i |Bij | ≤ (p − 1)µ(A) −1 P • Suppose p < 1 + µ (A) ⇒ 1 > (p − 1)µ(A) ⇒ |Bii| > j6=i |Bij |, ∀i • ⇒ B 0 (Gershgorin circle theorem) ⇒ spark(A) > p. Contradiction.
10 / 35 Coherence-base guarantee
Corollary −1 If Ax = b has a solution x obeying kxk0 < (1 + µ (A))/2, then x is the unique sparsest solution.
Compare with the previous Theorem
If Ax = b has a solution x obeying kxk0 < spark(A)/2, then x is the sparsest solution.
m×n −1 √ For A ∈ R where m < n, (1 + µ (A)) is at most 1 + m but spark can be 1 + m. spark is more useful.
Assume Ax = b has a solution with kxk0 = k < spark(A)/2. It will be the unique `0 minimizer. Will it be the `1 minimizer as well? Not necessarily. −1 However, kxk0 < (1 + µ (A))/2 is a sufficient condition.
11 / 35 Coherence-based `0 = `1
Theorem (Donoho and Elad [2003], Gribonval and Nielsen [2003]) If A has normalized columns and Ax = b has a solution x satisfying 1 kxk < 1 + µ−1(A) , 0 2 then this x is the unique minimizer with respect to both `0 and `1.
Proof sketch:
• Previously we know x is the unique `0 minimizer; let S := supp(x)
• Suppose y is the `1 minimizer but not x; we study e := y − x
• e must satisfy Ae = 0 and kek1 ≤ 2keS k1 > −1 • A Ae = 0 ⇒ |ej | ≤ (1 + µ(A)) µ(A)kek1, ∀j • the last two points together contradict the assumption √ Result bottom line: allow kxk0 up to O( m) for exact recovery
12 / 35 The null space of A