Estimating Unknown Sparsity in

Miles E. Lopes [email protected] UC Berkeley, Dept. Statistics, 367 Evans Hall, Berkeley, CA 94720-3860

n p Abstract where A R × is a user-specified measurement ma- trix,  ∈n is a random noise vector, and n is much In the theory of compressed sensing (CS), the R smaller∈ than the signal dimension p. During the last sparsity x of the unknown signal x p 0 R several years, the theory of CS has drawn widespread is commonlyk k assumed to be a known parame-∈ attention to the fact that this seemingly ill-posed prob- ter. However, it is typically unknown in prac- lem can be solved reliably when x is sparse — in tice. Due to the fact that many aspects of the sense that the parameter x := card j : x = 0 CS depend on knowing x , it is important 0 j 0 is much less than p. For instance,k k if n {is approxi-6 } to estimate this parameterk k in a data-driven mately x log(p/ x ), then accurate recovery can way. A second practical concern is that x 0 0 0 be achievedk k with highk k probability when A is drawn is a highly unstable function of x. In partic-k k from a Gaussian ensemble (Donoho, 2006; Cand`es ular, for real signals with entries not exactly et al., 2006). Along these lines, the value of the pa- equal to 0, the value x = p is not a useful 0 rameter x is commonly assumed to be known in description of the effectivek k number of coordi- 0 the analysisk k of recovery algorithms — even though it nates. In this paper, we propose to estimate a is typically unknown in practice. Due to the funda- stable measure of sparsity s(x) := x 2/ x 2, 1 2 mental role that sparsity plays in CS, this issue has which is a sharp lower bound on k xk .k Ourk 0 been recognized as a significant gap between theory estimation procedure uses only a smallk k num- and practice by several authors (Ward, 2009; Eldar, ber of linear measurements, does not rely on 2009; Malioutov et al., 2008). Nevertheless, the liter- any sparsity assumptions, and requires very ature has been relatively quiet about the problems of little computation. A confidence interval for estimating this parameter and quantifying its uncer- s(x) is provided, and its width is shown to tainty. have no dependence on the signal dimension p. Moreover, this result extends naturally to the recovery setting, where a soft ver- 1.1. Motivations and the role of sparsity

sion of matrix rank can be estimated with At a conceptual level, the problem of estimating x 0 analogous guarantees. Finally, we show that is quite different from the more well-studied prob-k k the use of randomized measurements is essen- lems of estimating the full signal x or its support set tial to estimating s(x). This is accomplished S := j : xj = 0 . The difference arises from sparsity by proving that the minimax risk for estimat- assumptions.{ 6 On} one hand, a procedure for estimating ing s(x) with deterministic measurements is x 0 should make very few assumptions about sparsity large when n p. k k  (if any). On the other hand, methods for estimating x or S often assume that a sparsity level is given, and then impose this value on the solution xb or Sb. Con- 1. Introduction sequently, a simple plug-in estimate of x , such as k k0 x 0 or card(Sb), may fail when the sparsity assump- The central problem of compressed sensing (CS) is to kbk p tions underlying x or S are invalid. estimate an unknown signal x R from n linear mea- b b ∈ surements y = (y1, . . . , yn) given by To emphasize that there are many aspects of CS that depend on knowing x , we provide several examples y = Ax + , (1) 0 below. Our main pointk k here is that a method for esti- th mating x is valuable because it can help to address Proceedings of the 30 International Conference on Ma- k k0 chine Learning, Atlanta, Georgia, USA, 2013. JMLR: a broad range of issues. W&CP volume 28. Copyright 2013 by the author(s). Estimating Unknown Sparsity in Compressed Sensing

Modeling assumptions. One of the core mod- we must also remember that such a “certificate” • eling assumptions invoked in applications of CS is not meaningful unless we can check that k is is that the signal of interest has a sparse rep- consistent with the true signal. resentation. Likewise, the problem of checking Recovery algorithms. When recovery al- whether or not this assumption is supported by • data has been an active research topic, partic- gorithms are implemented, the sparsity level ularly in in areas of face recognition and image of x is often treated as a tuning parame- classification (Rigamonti et al., 2011; Shi et al., ter. For example, if k is a presumed bound on x , then the Orthogonal Matching Pur- 2011). In this type of situation, an estimate dx 0 0 suitk algorithmk (OMP) is typically initialized to that does not rely on any sparsity assumptionsk k is run for k iterations. A second example is a natural device for validating the use of sparse the algorithm, which computes the so- representations. 2 p lution xb argmin y Av 2 + λ v 1 : v R , for some∈ choice of{kλ −0 .k The sparsityk k ∈ of x }is The number of measurements. If the choice ≥ b • of n is too small compared to the “critical” num- determined by the size of λ, and in order to se- lect the appropriate value, a family of solutions ber n∗(x) := x 0 log(p/ x 0), then there are known information-theoretick k k k barriers to the ac- is examined over a range of λ values. In the case curate reconstruction of x (Arias-Castro et al., of either OMP or Lasso, a sparsity estimate dx k k0 2011). At the same time, if n is chosen to be much would reduce computation by restricting the pos- larger than n∗(x), then the measurement process sible choices of λ or k, and it would also ensure is wasteful, as there are known algorithms that that the chosen values conform to the true signal. can reliably recover x with approximately n∗(x) measurements (Davenport et al., 2011). 1.2. An alternative measure of sparsity To deal with the selection of n, a sparsity esti- Despite the important theoretical role of the param- mate dx 0 may be used in two different ways, de- k k eter x in many aspects of CS, it has the practical pending on whether measurements are collected k k0 sequentially, or in a single batch. In the sequential drawback of being a highly unstable function of x. In p case, an estimate of x 0 can be computed from particular, for real signals x R whose entries are not k k exactly equal to 0, the value∈ x = p is not a useful a set of “preliminary” measurements, and then k k0 the estimated value dx determines how many description of the effective number of coordinates. k k0 additional measurements should be collected to In order to estimate sparsity in a way that accounts recover the full signal. Also, it is not always nec- for the instability of x , it is desirable to replace k k0 essary to take additional measurements, since the the `0 norm with a “soft” version. More precisely, preliminary set may be re-used to compute xb (as we would like to identify a function of x that can be discussed in Section5). Alternatively, if all of the interpreted like x 0, but remains stable under small measurements must be taken in one batch, the perturbations ofkx.k A natural quantity that serves this

value dx 0 can be used to certify whether or not purpose is the numerical sparsity enoughk measurementsk were actually taken. x 2 s(x) := 1 , (2) The measurement matrix. Two of the most k k2 x 2 • well-known design characteristics of the matrix A k k are defined explicitly in terms of sparsity. These which always satisfies 1 s(x) p for any non-zero ≤ ≤ are the restricted isometry property of order k x. Although the ratio x 2/ x 2 appears sporadically k k1 k k2 (RIP-k), and the restricted null-space property of in different areas (Tang & Nehorai, 2011; Hurley & order k (NSP-k), where k is a presumed upper Rickard, 2009; Hoyer, 2004; Lopes et al., 2011), it does bound on the sparsity level of the true signal. not seem to be well known as a sparsity measure in CS. Since many recovery guarantees are closely tied A key property of s(x) is that it is a sharp lower bound to RIP-k and NSP-k, a growing body of work has on x for all non-zero x, been devoted to certifying whether or not a given k k0

matrix satisfies these properties (d’Aspremont & s(x) x 0, (3) El Ghaoui, 2011; Juditsky & Nemirovski, 2011; ≤ k k Tang & Nehorai, 2011). When k is treated as which follows from applying the Cauchy-Schwarz in- given, this problem is already computationally equality to the relation x 1 = x, sgn(x) . (Equality difficult. Yet, when the sparsity of x is unknown, in (3) is attained iff thek non-zerok h coordinatesi of x are Estimating Unknown Sparsity in Compressed Sensing equal in magnitude.) We also note that this inequal- methods consider a collection of (say m) solutions (1) (m) ity is invariant to scaling of x, since s(x) and x 0 are xb ,..., xb obtained from different values θ1, . . . , θm individually scale invariant. In the opposite direction,k k of some tuning parameter of interest. For each so- (j) it is easy to see that the only continuous upper bound lution, an empirical error estimate err(c xb ) is com- on x 0 is the trivial one: If a continuous function f puted, and the value θj∗ corresponding to the smallest k k (j) satisfies x 0 f(x) p for all x in some open subset err(c xb ) is chosen. of p, thenk kf ≤must be≤ identically equal to p. (There is R Although methods based on CV/ERM share common a dense set of points where x = p.) Therefore, we 0 motivations with our work here, these methods dif- must be content with a continuousk k lower bound. fer from our approach in several ways. In particular, The fact that s(x) is a sensible measure of sparsity the problem of estimating a soft measure of sparsity, for non-idealized signals is illustrated in Figure 1. In such as s(x), has not been considered from that angle. essence, if x has k large coordinates and p k small Also, the cited methods do not give any theoretical − coordinates, then s(x) k, whereas x 0 = p. In guarantees to ensure that the estimated sparsity level the left panel, the sorted≈ coordinates ofk threek different is close to the true one. Note that even if an estimate 100 vectors in R are plotted. The value of s(x) for each x has small error x x 2, it is not necessary for x 0 b kb − k kbk vector is marked with a triangle on the x-axis, which to be close to x 0. This point is especially relevant shows that s(x) adapts well to the decay profile. This when one is interestedk k in identifying a set of important idea can be seen in a more geometric way in the mid- variables or interpreting features. dle and right panels, which plot the the sub-level sets From a computational point view, the CV/ERM ap- := x p : s(x) c with c [1, p]. When c 1, c R proaches can also be costly — since x(j) is typically theS vectors{ ∈ in are≤ closely} aligned∈ with the coordi-≈ b c computed from a separate optimization problem for nate axes, andS hence contain one effective coordinate. for each choice of the tuning parameter. By contrast, As c p, the set c includes more dense vectors until ↑ p S our method for estimating s(x) requires no optimiza- p = R . S tion, and can be computed easily from just a small set of preliminary measurements. 1.3. Related work. Some of the challenges described in Section 1.1 can be 1.4. Our contributions. approached with the general tools of cross-validation The primary contribution of this paper is our treat- (CV) and empirical risk minimization (ERM). This ment of unknown sparsity as a parameter estimation approach has been used to select various parameters, problem. Specifically, we identify a stable measure of such as the number of measurements n (Malioutov sparsity that is relevant to CS, and propose an efficient et al., 2008; Ward, 2009), the number of OMP it- estimator with provable guarantees. Secondly, we are erations k (Ward, 2009), or the Lasso regularization not aware of any other papers that have demonstrated parameter λ (Eldar, 2009). At a high level, these a distinction between random and deterministic mea-

vectors in R100 and s(x) values sub-level set s(x) 1.1 in R2 sub-level set s(x) 1.9 in R2 ≤ ≤ 1.0 s(x) = 16.4, x 0 = 100 1.0 1.0 s(x) = 32.7, kxk = 45 k k0 0.8 s(x) = 66.6, x 0 = 100

k k 0.5 0.5 0.6 2 2 x x 0.0 0.0 0.4 coordinate value -0.5 -0.5 0.2

0.0 16.4 32.7 66.6 -1.0 -1.0 0 20 40 60 80 100 -1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 coordinate index x1 x1

100 Figure 1. Characteristics of s(x). Left panel: Three vectors (red, blue, black) in R have been plotted with their coordinates in order of decreasing size (maximum entry normalized to 1). Two of the vectors have power-law decay −1 −1/2 profiles, and one is a dyadic vector with exactly 45 positive coordinates (red: xi ∝ i , blue: dyadic, black: xi ∝ i ). Color-coded triangles on the bottom axis indicate that the s(x) value represents the “effective” number of coordinates. Estimating Unknown Sparsity in Compressed Sensing surements with regard to unknown sparsity (as in Sec- x, and c1, c2 > 0 are constants. This bound is a fun- tion4). damental point of reference, since it matches the mini- max optimal rate under certain conditions (Cand`es, The remainder of the paper is organized as follows. In 2006), and applies to all signals x p (rather Section2, we show that a principled choice of n can R than just k-sparse signals). Additional details∈ may be be made if s(x) is known. This is accomplished by found in (Cai et al., 2010) [Theorem 3.3], (Vershynin, formulating a recovery condition for the Pursuit 2010) [Theorem 5.65]. algorithm directly in terms of s(x). Next, in Section3, we propose an estimator sb(x), and derive a dimension- We now aim to answer the question, “If s(x) were free confidence interval for s(x). The procedure is also known, how large should n be in order for xb to be close shown to extend to the problem of estimating a soft to x?” Since the bound (4) assumes n & T log(pe/T ), measure of rank for matrix-valued signals. In Section4 our question amounts to choosing T . For this purpose, we show that the use of randomized measurements is it is natural to consider the relative `2 error essential to estimating s(x). Finally, we present simu- x x 2  1 x xT 1 kb− k c 0 + c k − k , (5) x 2 1 x 2 2 √T x 2 lations in Section5 to validate the consequences of our k k ≤ k k k k theoretical results. Due to space constraints, we defer 1 x xT 1 so that the T -term approximation error k − k all of our proofs to the supplement. √T x 2 does not depend on the scale of x (i.e. invariantk underk q Pp q Notation. We define x q := xj for any q > x cx with c = 0). k k j=1 | | 0 and x Rp, which only corresponds to a genuine 7→ 6 ∈ Proposition1 below shows how knowledge of s(x) al- norm for q 1. For sequences of numbers an and bn, ≥ lows us to control the approximation error. Specifi- we write an . bn or an = (bn) if there is an absolute O cally, the result shows that the condition T s(x) is constant c > 0 such that an cbn for all large n. If & ≤ necessary for the approximation error to be small, and an/bn 0, we write an = o(bn). For a matrix M, we → qP 2 the condition T & s(x) log(p) is sufficient. define the Frobenius norm M F = i,j Mij, the p Pk k Proposition 1. Let x R 0 , and T 1, . . . , p . matrix `1-norm M 1 = Mij . Finally, for two ∈ \{ } ∈ { } i,j The following statements hold for any c0, ε > 0. matrices A and Bk ofk the same size,| we| define the inner product A, B := tr(A>B). (i) If the T -term approximation error satisfies h i 1 x xT 1 k − k ε, then √T x 2 k k ≤ 2. Recovery conditions in terms of s(x) 1 T 2 s(x). ≥ (1+ε) · The purpose of this section is to present a simple proposition that links s(x) with recovery conditions (ii) If T c0 s(x) log(p), then the T -term approxi- for the Basis Pursuit algorithm (BP). This is an im- mation≥ error satisfies portant motivation for studying s(x), since it implies 1 x xT 1 1 T  k − k 1 . that if s(x) can be estimated well, then n can be cho- √T x 2 p k k ≤ √c0 log(p) − sen appropriately. In other words, we offer an adaptive choice of n. In particular, if T 2s(x) log(p) with p 100, then ≥ ≥ 1 x xT 1 1 In order to explain the connection between s(x) and k − k . √T x 2 3 recovery, we first recall a standard result (Cand`es k k ≤ et al., 2006) that describes the `2 error rate of the Remarks. A notable feature of these bounds is BP algorithm. Informally, the result assumes that that they hold for all non-zero signals. In our sim- the noise is bounded as  2 0 for some con- k k ≤n p ulations in Section5, we show that choosing n = stant 0 > 0, the matrix A R × is drawn from ∈ 2 sb(x) log(p/ sb(x) ) based on an estimate sb(x) leads a suitable ensemble, and n satisfies n & T log(pe/T ) tod accuratee reconstructiond e across many sparsity levels. for some T 1, . . . , p with e = exp(1). The conclusion is that∈ { with high} probability, the solution p 3. Estimation results for s(x) x argmin v 1 : Av y 2 0, v R satisfies b ∈ {k k k − k ≤ ∈ }

x xT 1 In this section, we give a simple procedure to estimate x x c  + c k − k , (4) p b 2 1 0 2 √T s(x) for any x R 0 . The procedure uses a small k − k ≤ ∈ \{ } p 1 number of measurements, makes no sparsity assump- where xT R is the best T -term approximation to ∈ tions, and requires very little computation. The mea- 1 p re-used The vector xT ∈ R is obtained by setting to 0 all surements we prescribe may also be to recover coordinates of x except the T largest ones (in magnitude). the full signal after s(x) has been estimated. Estimating Unknown Sparsity in Compressed Sensing

p The results in this section are based on the measure- ai Sq(γ)⊗ . The γ parameter controls the “energy ∼ ment model (1), written in scalar notation as level” of the measurement vectors ai. (Note that in the p 2 2 Gaussian case, if a1 S2(γ)⊗ , then E a1 2 = γ p.) yi = ai, x + i, i = 1, . . . , n. (6) ∼ k k h i In our results, we leave γ as a free parameter to show how the effect of noise is reduced as γ is increased. We assume only that the noise variables i are inde- pendent, and bounded by i σ , for some constant | | ≤ 0 3.2. Estimation procedure for s(x) σ0 > 0. No additional structure on the noise is needed. Two sets of measurements are used to estimate s(x), 3.1. Sketching with stable laws and we write the total number as n = n1 + n2. The first set is obtained by generating i.i.d. measurement Our estimation procedure derives from a technique vectors from a Cauchy distribution, known as sketching in the streaming computation lit- p erature (Indyk, 2006). Although this area deals with ai C(0, γ)⊗ , i = 1, . . . , n1. (7) problems that have mathematical connections to CS, ∼ the use of sketching techniques in CS does not seem to The corresponding values yi are then used to estimate be well known. x 1 via the statistic k k For any q (0, 2], the sketching technique offers a T := 1 ( y ,..., y ), (8) ∈ b1 γ 1 n1 way to estimate x q from a set of randomized lin- | | | | ear measurements.k k In our approach, we estimate which is a standard estimator of the scale parameter of 2 2 s(x) = x 1/ x 2 by estimating x 1 and x 2 from the Cauchy distribution (Fama & Roll, 1971; Li et al., separatek setsk k ofk measurements.k Thek corek ideak is to 2007). Next, a second set of i.i.d.measurement vectors p generate the measurement vectors ai R using sta- are generated from a Gaussian distribution ∈ ble laws (Zolotarev, 1986). 2 p ai N(0, γ )⊗ , i = n1 + 1, . . . , n1 + n2. (9) Definition 1. A random variable V has a symmet- ∼ ric stable distribution if its characteristic function is In this case, the associated yi values are used to com- q 2 of the form E[exp(√ 1tV )] = exp( γt ) for some pute an estimate of x 2 given by q (0, 2] and some γ− > 0. We denote−| the| distribu- k k ∈ 2 1 2 2 tion by V S (γ), and γ is referred to as the scale Tb := 2 (yn + + yn n ), (10) q 2 γ n2 1+1 ··· 1+ 2 parameter.∼ which is a natural estimator of the variance of a Gaus- The most well-known examples of symmetric stable sian distribution. Combining these two statistics, our 2 2 laws are the cases of q = 2, 1, namely the Gaussian estimate of s(x) = x 1/ x 2 is defined as 2 k k k k distribution N(0, γ ) = S2(γ), and the Cauchy dis- 2 2 s(x) := Tb1 Tb2 . (11) tribution C(0, γ) = S1(γ). To fix some notation, if b p a vector a1 = (a1,1, . . . , a1,p) R has i.i.d. entries ∈ p drawn from Sq(γ), we write a Sq(γ)⊗ . The con- 3.3. Confidence interval. 1 ∼ nection with `q norms hinges on the following property The following theorem describes the relative error of stable distributions (Zolotarev, 1986). s(x) b 1 via an asymptotic confidence interval for s(x). p p s(x) − Lemma 1. Suppose x R , and a1 Sq(γ)⊗ with Our result is stated in terms of the noise-to-signal ratio parameters q (0, 2] and∈ γ > 0. Then,∼ the random variable ∈is distributed according to ρ := σ0 , x, a1 Sq(γ x q). γ x 2 h i k k k k

Using this fact, if we generate a set of i.i.d. vectors and the standard Gaussian quantile z1 α, which satis- p − a1, . . . , an from Sq(γ)⊗ and let yi = ai, x , then fies Φ(z1 α) = 1 α for any coverage level α (0, 1). h i − − ∈ y1, . . . , yn is an i.i.d.sample from Sq(γ x q). Hence, in In this notation, the following parameters govern the the special case of noiseless linear measurements,k k the width of the confidence interval, task of estimating x q is equivalent to a well-studied k k η (α, ρ) := z1−α + ρ and δ (α, ρ) := πz1−α + ρ, univariate problem: estimating the scale parameter of n √n n √2n a stable law from an i.i.d. sample. and we write these simply as δn and ηn. As is standard When the yi are corrupted with noise, our analysis in high-dimensional statistics, we allow all of the model shows that standard estimators for scale parameters parameters p, x, σ0 and γ to vary implicitly as func- are only moderately affected. The impact of the noise tions of (n1, n2), and let (n1, n2, p) . For simplic- can also be reduced via the choice of γ when generating ity, we choose to take measurement→ sets ∞ of equal sizes, Estimating Unknown Sparsity in Compressed Sensing n1 = n2 = n/2, and we place a mild constraint on ρ, rank(X) plays the role of x 0 in the recovery of a k k p namely ηn(α, ρ) < 1. (Note that standard algorithms sparse vector. If we let λ(X) R+ denote the vec- such as Basis Pursuit are not expected to perform well tor of ordered eigenvalues of X∈, the connection can unless ρ 1, as is clear from the bound (5).) Lastly, be made explicit by writing rank(X) = λ(X) 0. As we make no restriction on the growth of p/n, which in Section 3.2, our approach is to considerk ak robust makes sb(x) well-suited to high-dimensional problems. alternative to the rank. Motivated by the quantity 2 2 p s(x) = x x in the vector case, we now consider Theorem 1. Let α (0, 1/2) and x R 0 . As- 1 2 ∈ ∈ \{ } k k k k sume that s(x) is constructed as above, and that the λ(X) 2 tr(X)2 b k k1 r(X) := λ(X) 2 = X 2 model (6) holds. Suppose also that n1 = n2 = n/2 and k k2 k kF ηn(α, ρ) < 1 for all n. Then as (n, p) , we have → ∞ as our measure of the effective rank for non-zero X, q  which always satisfies 1 r(X) p. The quan- s(x)  1 δn 1+δn  2 ≤ ≤ b − , (1 2α) + o(1). (12) tity r(X) has appeared elsewhere as a measure of P s(x) 1+ηn 1 ηn ∈ − ≥ − rank (Lopes et al., 2011; Tang & Nehorai, 2010), but is less well known than other rank relaxations, such as Remarks. The most important feature of this result the numerical rank X 2  X 2 (Rudelson & Ver- is that the width of the confidence interval does not F op shynin, 2007). Thek relationshipk k k between r(X) and depend on the dimension or sparsity of the unknown rank(X) is completely analogous to s(x) and x . signal. Concretely, this means that the number of mea- 0 Namely, we have a sharp, scale-invariant inequalityk k surements needed to estimate s(x) to a fixed precision is only (1) with respect to the size of the problem. r(X) rank(X). ≤ Our simulationsO in Section5 also show that the relative The quantity r(X) is more stable than rank(X) in the error of s(x) does not depend on dimension or sparsity b sense that if X has k large eigenvalues, and p k small of x. Lastly, when δ and η are small, we note that n n eigenvalues, then r(X) k, whereas rank(X)− = p. the relative error sb(x)/s(x) 1 is at most of order ≈ 1/2 | − | (n− + ρ) with high probability, which follows from Our procedure for estimating r(X) is based on esti- (1+ε)2 mating tr(X) and X 2 from separate sets of measure- the simple Taylor expansion 2 = 1 + 4ε + o(ε). F (1 ε) k k − ments. The semidefinite condition is exploited through the basic relation Ip p,X = tr(X) = λ(X) 1. To 3.4. Estimating rank and sparsity of matrices h × i k k estimate tr(X), we use n1 linear measurements of the The framework of CS naturally extends to the problem form p1 p2 of recovering an unknown matrix X R × on the ∈ yi = γIp p,X + i, i = 1, . . . , n1 (14) basis of the measurement model h × i 1 1 Pn1 and compute the estimator T˘1 := yi, where y = (X) + , (13) γ n1 i=1 A γ > 0 is again the measurement energy parameter. 2 p p n Next, to estimate X F , we note that if Z R × has where y R , is a user-specified linear operator k k 2 ∈2 ∈ A i.i.d. N(0, 1) entries, then E X,Z = X F . Hence, from p1 p2 to n, and  n is a vector of noise R × R R if we collect n additional measurementsh i k ofk the form variables. In recent years, many∈ researchers have ex- 2 plored the recovery of X when it is assumed to have yi = γZi,X + i, i = n + 1, . . . , n + n , (15) h i 1 1 2 sparse or low rank structure. We refer to the pa- where the Z p p are independent random matrices pers (Cand`es& Plan, 2011; Chandrasekaran et al., i R × with i.i.d. N∈(0, 1) entries, then a suitable estimator 2010) for descriptions of numerous applications. In 2 2 1 Pn1+n2 2 of X is T˘ := 2 y . Combining these analogy with the previous section, the parameters F 2 γ n2 i=n1+1 i statistics,k k we propose rank(X) or X 0 play important theoretical roles, but k k ˘2 ˘2 are very sensitive to perturbations of X. Likewise, it rb(X) := T1 T2 is of basic interest to estimate stable measures of rank as our estimate of r(X). In principle, this procedure and sparsity for matrices. Since the sparsity analogue can be refined by using the measurements (14) to esti- s(X) := X 2/ X 2 can be estimated as a straight- 1 F mate the noise distribution, but we omit these details. forward extensionk k k ofk Section 3.2, we restrict our atten- Also, we retain the assumptions of the previous sec- tion to the more distinct problem of rank estimation. tion, and assume only that the i are independent and bounded by  σ . The next theorem shows that 3.4.1. The rank of semidefinite matrices i 0 the estimator| r(|X ≤) mirrors s(X) as in Theorem1, but b b  In the context of recovering a low-rank positive with ρ being replaced by % := σ0 (γ X F ), and with p p k k semidefinite matrix X S+× 0 , the quantity ηn being replaced by ζn = ζn(%, α) := z1 α/√n + %. ∈ \{ } − Estimating Unknown Sparsity in Compressed Sensing

p p Theorem 2. Let α (0, 1/2) and X S × 0 . the observed measurements y = Ax are noiseless and ∈ ∈ + \{ } Assume that rb(X) is constructed as above, and that the obtained from a deterministic matrix A. model holds. Suppose also that (13) n1 = n2 = n/2 Theorem 3. The minimax relative error for estimat- and for all . Then as , we ζn(α, ρ) < 1 n (n, p) ing s(x) from noiseless deterministic measurements have → ∞ y = Ax satisfies q r(X)  1 % 1+%  b − , 1 2α + o(1). (16) P r(X) 1+ζn 1 ζn δ(Ax) 1 (n+1)/p ∈ − ≥ − inf inf sup 1 − . n×p n s(x) p 2 A R δ:R R x p 0 − ≥ 2(1+2√2 log(2 )) ∈ → ∈R \{ } Remarks. In parallel with Theorem1, this confi- dence interval has the valuable property that its width Remarks. Under the typical high-dimensional sce- does not depend on the rank or dimension of X, but nario where there is some κ (0, ) for which  ∈ ∞ merely on the noise-to-signal ratio % = σ0 (γ X F ). p/n κ as (n, p) , we have the lower bound k k δ(Ax→) 1 → ∞ 1/2 The relative error rb(X)/r(X) 1 is at most of order 1 , which is much larger than n− . 1/2 | − | s(x) & log(n) (n− + %) with high probability when ζn is small. | − | 5. Simulations 4. Deterministic measurement matrices Relative error of sb(x). To validate the conse- The problem of constructing deterministic matrices A quences of Theorem1, we study how the relative error with good recovery properties (e.g. RIP-k or NSP-k) sb(x)/s(x) 1 depends on the parameters p, ρ, and has been a longstanding topic within CS. Since our s| (x). We generated− | measurements y = Ax +  under procedure in Section 3.2 selects A at random, it is 104 a broad range of parameter settings, with x R in natural to ask if randomization is essential to the esti- most cases. Note that although p = 104 is a very∈ large mation of unknown sparsity. In this section, we show dimension, it is not at all extreme from the viewpoint that estimating s(x) with a deterministic matrix A of applications (e.g. a megapixel image with p = 106). leads to results that are inherently different from our Details regarding parameter settings are given below. randomized procedure. As anticipated by Theorem1, the left and right panels At an informal level, the difference between random in Figure2 show that the relative error has no notice- and deterministic matrices makes sense if we think of able dependence on p or s(x). The middle panel shows the estimation problem as a game between nature and that for fixed n1 + n2, the relative error grows moder- ately with ρ = σ0 . Lastly, our theoretical bounds a statistician. Namely, the statistician first chooses a γ x 2 n p n k k matrix A R × and an estimation rule δ : R R. on sb(x)/s(x) 1 conform to the empirical curves in ∈ n → | − | 2 (The function δ takes y R as input and returns the case of low noise (ρ = 10− ). an estimate of s(x).) In∈ turn, nature chooses a sig- p nal x R 0 , with the goal of maximizing the Reconstruction of x based on sb(x). For the statistician’s∈ \{ error.} When the statistician chooses A problem of choosing n, we considered the choice of deterministically, nature has the freedom to adversar- n := 2 s(x) log(p/ s(x) ). The simulations show that b db e db e ially select an x that is ill-suited to the fixed matrix nb adapts to the structure of the true signal, and A. By contrast, if the statistician draws A at random, is also sufficiently large for accurate reconstruction. then nature does not know what value A will take, and First, to compute sb(x), we followed Section 3.2, and therefore has less knowledge to choose a “bad” signal. drew initial measurement sets of Cauchy and Gaus- sian vectors with n = n = 500 and γ = 1. If In the case of noiseless random measurements, 1 2 it happened to be the case that 500 n, then re- Theorem1 implies that our particular estimation b construction was performed using only≥ the initial 500 rule sb(x) can achieve a relative error of order 1/2 Gaussian measurements. Alternatively, if nb > 500, sb(x)/s(x) 1 = (n− ) with high probability for | − | O then (nb 500) additional measurements vectors ai any non-zero x. (cf. Remarks for Theorem1.) Our − 1/2 were drawn from N(0, n− Ip p) for reconstruction. aim is now to show that for noiseless deterministic b × Further details are given below. Figure 3 illustrates measurements, all estimation rules δ have a worst-case 4 the results for three power-law signals in 10 with relative error δ(Ax)/s(x) 1 that is much larger than R x i ν , ν = 0.7, 1.0, 1.3 and x = 1 (corre- than n 1/2. In| other words,− there| is always a choice of [i] − 1 − sponding∝ to s(x) = 823, 58, 11). Ink eachk panel, the x that can defeat a deterministic procedure, whereas coordinates of x are plotted in black, and those of x s(x) is likely to succeed under any choice of x. b b are plotted in red. Clearly, there is good qualitative In stating the following result, we note that it involves agreement in all cases. From left to right, the value of no randomness whatsoever — since we assume that n = 2 s(x) log(p/ s(x) ) was 4108, 590, and 150. b db e db e Estimating Unknown Sparsity in Compressed Sensing

√ Settings for relative error of sb(x) (Figure 2). For the choice 0 = σ0 nb being based on i.i.d. noises each parameter setting labeled in the figures, we let i Uniform[ σ , σ ] and σ = 0.001. When re-using ∼ − 0 0 0 n1 = n2 and averaged sb(x)/s(x) 1 over 200 prob- the first 500 Gaussian measurements, we re-scaled the | − | √ lem instances of y = Ax + . In all cases, the matrix vectors ai and the values yi by 1/ nb so that the ma- n p A was chosen according to (7) and (9) with γ = 1 and trix A Rb× would be expected to satisfy the RIP ∈ i Uniform[ σ , σ ]. We always chose the normal- property. For each choice of ν = 0.7, 1.0, 1.3, we gen- ∼ − 0 0 ization x 2 = 1, and γ = 1 so that ρ = σ0. (In the erated 25 problem instances and plotted the vector xb k k 2 left and right panels, ρ = 10− .) For the left and mid- corresponding to the median of xb x 2 over the 25 1 k − k dle panels, all signals have the decay profile x[i] i− . runs (so that the plots reflect typical performance). For the right panel, the values s(x) = 2, 58, 4028,∝ and ν 9878 were obtained using decay profiles xi i− with Acknowledgements ν = 2, 1, 1/2, 1/10. In the left and right∝ panels, we chose p = 104 for all curves. A theoretical bound on MEL thanks the reviewers for constructive comments, s(x)/s(x) 1 in black was computed from Theorem1 as well as valuable references that substantially im- |b − | with α = 1 1 . (This bound holds with probabil- proved the paper. Peter Bickel is thanked for helpful 2 − 2√2 ity at least 1/2, and hence may be reasonably plotted discussions, and the DOE CSGF fellowship is grate- fully acknowledged for support under grant DE-FG02- against the average of sb(x)/s(x) 1 .) | − | 97ER25308. Settings for reconstruction (Figure 3). We com- puted reconstructions using the SPGL1 solver (van den Berg & Friedlander, 2007; 2008) for the BP prob- p lem x argmin v 1 : Av y 2 0, v R , with b ∈ {k k k − k ≤ ∈ }

error dependence on p error dependence on ρ error dependence on s(x)

ρ = 0 1.4 p = 10 1.4 1.4 s(x) = 2 ) ) 2 ) 2 ρ = 10− x p = 10 x x s(x) = 58 ( ( ρ = 1 (

s 3 s s 1.2 p = 10 1.2 ρ = 2 1.2 s(x) = 4028 4 ρ = 5 p = 10 2 s(x) = 9878 theory bound, ρ = 10− 1 theory bound 1 (all curves: p = 104, ν = 1) 1 theory bound 2 4 2 (all curves: ν = 1, ρ = 10− ) (all curves: p = 10 , ρ = 10− ) 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 averaged relative error of ˆ averaged relative error of ˆ averaged relative error of ˆ 0 0 0 200 400 600 800 1000 200 400 600 800 1000 200 400 600 800 1000

total measurements (n1 + n2) total measurements (n1 + n2) total measurements (n1 + n2)

Figure 2. Performance of sb(x) as a function of p, ρ, s(x), and number of measurements.

reconstruction with ν = 0.7 reconstruction with ν = 1.0 reconstruction with ν = 1.3 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 coordinate value coordinate value coordinate value 0.2 0.2 0.2 0 0 0 0 2e3 4e3 6e3 8e3 10e3 0 2e3 4e3 6e3 8e3 10e3 0 2e3 4e3 6e3 8e3 10e3 coordinate index coordinate index coordinate index

Figure 3. Signal recovery after choosing n based on sb(x). True signal x in black, and xb in red. Estimating Unknown Sparsity in Compressed Sensing

References Lopes, M. E., Jacob, L., and Wainwright, M.J. A more powerful two-sample test in high dimensions using ran- Arias-Castro, E., Cand`es,E. J., and Davenport, M. On the dom projection. In NIPS 24, pp. 1206–1214. 2011. fundamental limits of adaptive sensing. Arxiv preprint arXiv:1111.4646, 2011. Malioutov, D. M., Sanghavi, S., and Willsky, A. S. Com- pressed sensing with sequential observations. In IEEE Cai, T. T., Wang, L., and Xu, G. New bounds for re- International Conference on Acoustics, Speech and Sig- stricted isometry constants. Information Theory, IEEE nal Processing, 2008., pp. 3357–3360. IEEE, 2008. Transactions on, 56(9), 2010. Rigamonti, R., Brown, M. A., and Lepetit, V. Are sparse Cand`es,E. J. Compressive sampling. In Proceedings oh the representations really relevant for image classification? International Congress of Mathematicians: Madrid, Au- In Computer Vision and Pattern Recognition (CVPR), gust 22-30, 2006: invited lectures, pp. 1433–1452, 2006. 2011 IEEE Conference on, pp. 1545–1552. IEEE, 2011. Cand`es,E. J. and Plan, Y. Tight oracle inequalities for low- Rudelson, M. and Vershynin, R. Sampling from large ma- rank matrix recovery from a minimal number of noisy trices: An approach through geometric functional anal- random measurements. IEEE Transactions on Informa- ysis. Journal of the ACM (JACM), 54(4):21, 2007. tion Theory, 57(4):2342–2359, 2011. Shi, Q., Eriksson, A., van den Hengel, A., and Shen, C. Is Cand`es,E. J., Romberg, J.K., and Tao, T. Stable signal face recognition really a compressive sensing problem? recovery from incomplete and inaccurate measurements. In Computer Vision and Pattern Recognition (CVPR), Communications on pure and applied mathematics, 59 2011 IEEE Conference on, pp. 553–560. IEEE, 2011. (8):1207–1223, 2006. Chandrasekaran, V., Recht, B., Parrilo, P. A., and Willsky, Tang, G. and Nehorai, A. The stability of low-rank ma- A. S. The convex algebraic geometry of linear inverse trix reconstruction: a constrained singular value view. problems. In Communication, Control, and Computing arXiv:1006.4088, submitted to IEEE Transactions on (Allerton), 2010 48th Annual Allerton Conference on, Information Theory, 2010. pp. 699–703. IEEE, 2010. Tang, G. and Nehorai, A. Performance analysis of sparse d’Aspremont, A. and El Ghaoui, L. Testing the nullspace recovery based on constrained minimal singular values. property using semidefinite programming. Mathematical IEEE Transactions on , 59(12):5734– programming, 127(1):123–144, 2011. 5745, 2011. Davenport, M. A., Duarte, M. F., Eldar, Y. C., and Ku- van den Berg, E. and Friedlander, M. P. SPGL1: A tyniok, G. Introduction to compressed sensing. Preprint, solver for large-scale sparse reconstruction, June 2007. 93, 2011. http://www.cs.ubc.ca/labs/scl/spgl1. Donoho, D. L. Compressed sensing. IEEE Transactions van den Berg, E. and Friedlander, M. P. Probing the on Information Theory, 52(4):1289–1306, 2006. pareto frontier for basis pursuit solutions. SIAM Jour- nal on Scientific Computing, 31(2):890–912, 2008. doi: Eldar, Y. C. Generalized sure for exponential families: 10.1137/080714488. URL http://link.aip.org/link/ Applications to regularization. IEEE Transactions on ?SCE/31/890. Signal Processing, 57(2):471–481, 2009. Vershynin, R. Introduction to the non-asymptotic analy- Fama, E. F. and Roll, Richard. Parameter estimates for sis of random matrices. Arxiv preprint arxiv:1011.3027, symmetric stable distributions. Journal of the American 2010. Statistical Association, 66(334):331–338, 1971. Ward, R. Compressed sensing with cross validation. IEEE Hoyer, P.O. Non-negative matrix factorization with sparse- Transactions on Information Theory, 55(12):5773–5782, ness constraints. The Journal of Machine Learning Re- 2009. search, 5:1457–1469, 2004. Zolotarev, V. M. One-dimensional stable distributions, vol- Hurley, N. and Rickard, S. Comparing measures of spar- ume 65. Amer Mathematical Society, 1986. sity. IEEE Transactions on Information Theory, 55(10): 4723–4741, 2009. Indyk, P. Stable distributions, pseudorandom generators, embeddings, and data stream computation. Journal of the ACM (JACM), 53(3):307–323, 2006. Juditsky, A. and Nemirovski, A. On verifiable sufficient conditions for sparse signal recovery via `1 minimization. Mathematical programming, 127(1):57–88, 2011. Li, P., Hastie, T., and Church, K. Nonlinear estimators and tail bounds for dimension reduction in l 1 using cauchy random projections. Journal of Machine Learning Re- search, pp. 2497–2532, 2007.