arXiv:1702.06781v2 [cs.IT] 28 Feb 2020 ℓ ihsrcue priy e uscin11 nteohrhand, other the On 1.1. Subsection see sparsity, o structured numbers with Gelfand of study The details. (approximately) further for for 2.2 models Subsection as processing signal orsodn ntbl is ball unit corresponding en htw r ofotdwt aaee oseltoswher constellations parameter with confronted are we that means ihsalmxdsmoothness. mixed small with (2) eal nti oncincnb on nPooiin22adtediscu the and 2.2 Proposition ℓ in found be can connection this on details eptepai ncssweeete 0 either where cases on emphasis put We inlpoesn,Gladnmeso mednsntrlyappea naturally element embeddings an of of recovery numbers optimal Gelfand processing, signal uniytecmateso h embedding the of compactness the quantify emti on fve,teGladnmeso nebdigbetwe embedding an of numbers Gelfand the theo Y view, approximation of in point concepts geometric fundamental are numbers Gelfand in sudrml supin qiaett the to equivalent assumptions mild under is otnosmaueetmpand map measurement continuous (1) hr h nmmi ae vralmp fteform the of maps all over taken is infimum the where ore rwvltcecet of coefficients wavelet or Fourier 2 bd p b ie by given , ( e od n phrases. and words Key h i fti ae st nesadtebhvo fteGladnu Gelfand the of behavior the understand to is paper this of aim The Y hc ecnuti hsppr ilsgnrlproac bounds perfomance general yields paper, this in conduct we which , ℓ q d o instance, For . EFN UBR EAE OSRCUE PRIYADBESOV AND SPARSITY STRUCTURED TO RELATED NUMBERS GELFAND -pcs hc r qipdwt h ie (quasi-)norm mixed the with equipped are which )-spaces, ubro neetn aaee oseltos u e ma new Our of constellations. embeddings parameter interesting of number qai)omembeddings (quasi-)norm prei-eesvcos epciey nadto,w a we addition, In respectively. vectors, sparse-in-levels Abstract. and/or eaira nteuiait iuto.I h eann ca remaining cases the particular In some bound. situation. in univariate univariate the that the out in as turns behavior It multivar of smoothness. numbers Gelfand mixed the for estimates two-sided new c m I : (Id q PC MEDNSWT ML IE SMOOTHNESS MIXED SMALL WITH EMBEDDINGS SPACE X ≤ ecnie h rbe fdtriigteaypoi order asymptotic the determining of problem the consider We .Teecsstr u ob eae osrcue sparsity. structured to related be to out turn cases These 1. → f ℓ 1 b Y a eafnto nacranfnto pc n h iersmlsma samples linear the and space function certain a in function a be may ( ℓ efn numbers, Gelfand inf = ) 2 d and ) non-convex ℓ ℓ f 2 b n p b f ( ( ∈ ℓ x ℓ 1 d rsml ucinvle.The values. function simple or q d ∈ E ϕ into ) X JEDDRSNADTN ULLRICH TINO AND DIRKSEN SJOERD ) k B sup m m x ֒ → X rmfwarbitrary few from k ( ∩ The . : ,Y X, ℓ ℓ ℓ p b M R r b ℓ p ( 2 b ( ℓ ( m ℓ ( k q d ℓ u d ℓ q p < ) x 2 d =inf := ) ie that given ) -pcs opesdsnig lc priy sparsity-i sparsity, block sensing, compressed )-spaces, → ml piaiyasrin o h eoeyo block-spa of recovery the for assertions optimality imply ) k := 1. m X ֒ Y ℓ p b t iia os-aercvr error recovery worst-case minimal -th Y ≤ Introduction ( S : ℓ → X sa rirr ntncsaiylna)rcvr a.More map. recovery linear) necessarily (not arbitrary an is m i block-sparse 2 d =1 r0 or 1 b M ֒ -and )- k Y f sup k 1 rmteprpcieo prxmto hoyand theory approximation of perspective the From . → X p X j S pyorsapetmtsfor estimates sharp our pply =1 ≤ d ≤ m linear X q < 1 ℓ r | k = x 2 b aeBsvsaeebdig nrgmso small of regimes in embeddings space Besov iate e hydffra otb o o atrfrom factor log log a by most at differ they ses f lsdsbpc ihcodim( with subspace closed and ( cigbud o h efn ubr fthe of numbers Gelfand the for bounds tching ij ℓ ϕ ≤ − | p d q and m ape,weetercvr ro smeasured is error recovery the where samples, hs siae hwtesm asymptotic same the show estimates these -al ih0 with )-balls q S rbt 0 both or 1 m p/q ◦ m ≤ h mednsof embeddings the f 1t efn number Gelfand +1-th e N ( sparse-in-levels u yadBnc pc emty rma From geometry. space Banach and ry ℓ f hncnieigtepolmo the of problem the considering when r ihepai ncsswith cases on emphasis with , p b k · k m 1 ) ( /p k Here . ℓ nto(us-nre spaces (quasi-)normed two en Y fteGladnmeso mixed- of numbers Gelfand the of q d . so nRmr . below. 2.3 Remark in ssion -al perntrlyi wavelet in naturally appear )-balls ℓ , o h pia eoeyo signals of recovery optimal the for p b eoti hr onsi a in bounds sharp obtain We ( ℓ q d ,q p, < ) br febdig between embeddings of mbers N p < sol us-omadthe and quasi-norm a only is m ℓ p b ( : ℓ ≤ ≤ q d X inl,rsetvl,see respectively, signals, -pcst obtain to )-spaces perntrlyin naturally appear 1 ,wihi particular in which 1, → ℓ p b M ( R c ℓ -ees eo spaces Besov n-levels, 2 d m ) m and ) +1 m < salna and linear a is s and rse I : (Id p oss of consist y ≤ o ℓ , 2 b 1 ( X ℓ X p d → into ) and Y ) 2 SJOERD DIRKSEN AND TINO ULLRICH
characterizations of multivariate Besov spaces with dominating mixed smoothness. The study of embeddings b b between quasi-normed ℓp(ℓq)-spaces in this paper leads to new asymptotic bounds for Gelfand numbers of embeddings between the aforementioned Besov spaces in regimes of small smoothness, see Subsection 1.2. Interestingly, in some particular cases these estimates show the same asymptotic behavior as in the univariate situation. In the remaining cases they differ at most by a log log factor from the univariate bound. 1.1. Gelfand numbers and sparse recovery. We obtain matching bounds for the Gelfand numbers c (id : ℓb (ℓd) ℓb(ℓd )) m p q → r u for a number of interesting parameter ranges in this paper. The results in full detail can be found in Section 6. At this point, we wish to discuss only the most illustrative cases and highlight some interesting effects. In particular, we prove for 1 m bd and 0
Then, there exist constants c1,c2 > 0 depending only on C so that if s>c2, then m c (s log(eb/s)+ sd). ≥ 1 On the other hand, if for all x Rbd, ∈ inner σ (x) b d t ℓ2(ℓ1 ) (8) x ∆(Ax) ℓb (ℓd) D , k − k 2 2 ≤ √t
then there exist constants c1,c2 > 0 depending only on D such that m c bt log(ed/t), ≥ 1 provided that t>c2. GELFAND NUMBERS, STRUCTURED SPARSITY, BESOV SPACE EMBEDDINGS 3
To discuss the consequences of this result for structured sparse recovery, let us recall the following known results. We refer to Section 2.2 for a more extensive discussion of the literature. Let B be an m bd Gaussian matrix, i.e., a matrix with i.i.d. standard Gaussian entries and set A = (1/√m)B. For any 0 <× p,q , let ≤ ∞ ∆ℓ (ℓ )(Ax) = argmin Rbd z b d p q z∈ : Az=Axk kℓp(ℓq ) be the reconstruction via ℓp(ℓq)-minimization. It is known that if the number of measurements m satisfies
m & s log(eb/s)+ sd, then with high probability the pair (A, ∆ℓ1(ℓ2)) satisfies the stable reconstruction property in (7) (see [19] and also the discussion in [3, Section 2.1]). Moreover, this number of measurements cannot be reduced: [3, Theorem 2.4] shows that if for a given A Rm×bd one can recover every s-outer ∈ sparse vector x exactly from y = Ax via ∆ℓ1(ℓ2) then necessarily m & s log(eb/s)+ sd. However, one could still hope to further reduce the required number of measurements for outer sparse recovery by picking a better pair (A, ∆). Corollary 1.2 shows that this is not possible if we want the same stable reconstruction
guarantee as in (7). In this sense, the pair (A, ∆ℓ1(ℓ2)) with A = (1/√m)B is optimal.
It is worthwhile to note that the pair (A, ∆ℓ1(ℓ1)), A = (1/√m)B, with high probability satisfies the stable reconstruction guarantee ([10, Theorem 1.2 and Section 1.3], see also [11, 14] and [23, Theorem 9.13])
σk(x) b d ℓ1 (ℓ1 ) b d (9) x ∆ℓ1(ℓ1)(Ax) ℓ (ℓ ) D , k − k 2 2 ≤ √k whenever m & k log(ebd/k), where
σk(x)ℓb (ℓd) = inf x z ℓb(ℓd) : z is k -sparse . 1 1 {k − k 1 1 } As a consequence (by taking k = sd), (7) is satisfied under the condition m & sd log(eb/s). In conclusion, if we are only interested in stability of our reconstruction method with respect to outer sparsity (instead of sparsity) defects, then we can achieve this with fewer measurements (in a best worst-case scenario) by using
∆ℓ1(ℓ2) as a reconstruction map instead of ∆ℓ1(ℓ1). In the case of reconstruction of inner-sparse vectors, one might expect that a similar phenomenon occurs: if we are only interested in stability with respect to inner sparsity (instead of sparsity) defects, then we might be able to reconstruct from fewer measurements by using a reconstruction map that is especially suited for inner sparse recovery (a potential candidate would be
∆ℓ2(ℓ1), in analogy with the outer sparse case). Somewhat surprisingly, Corollary 1.2 shows that this is not the case. By taking k = bt in (9) and using that every t-inner sparse vector is bt-sparse (but not vice-versa),
one deduces that the pair (A, ∆ℓ1(ℓ1)), A = (1/√m)B, with high probability satisfies (8) if m & bt log(ed/t). This is optimal by Corollary 1.2. 1.2. Besov space embeddings with small mixed smoothness. As a second application of our work, we are interested in proving upper and lower bounds for the Gelfand numbers of compact embeddings between Besov spaces (10) Id : Sr0 B([0, 1]d) Sr1 B([0, 1]d) , p0,q0 → p1,q1 where 0 < p0 p1 , 0 < q0, q1 and r0 r1 > 1/p0 1/p1. By using a well-known hyperbolic wavelet discretization≤ ≤ machinery, ∞ see for≤ ∞ instance [63],− we can tran−sfer the problem to bounding the Gelfand numbers of id : sr0 b([0, 1]d) sr1 b([0, 1]d), p0,q0 → p1,q1 where (see (46) below for details) q/p 1/q r (r−1/p)qk¯jk1 p ¯ ¯ C r ¯ sp,qb := λ = λ¯j,k ¯j,k : λ sp,qb := 2 λ¯j,k < { } ⊂ k k d | | ∞ ¯j∈N k¯∈A¯ n h X0 Xj i o j j d r and A¯ = ([0, 2 1 1] [0, 2 d 1]) Z . We immediately observe that s b has an ℓ (ℓ )-structure. j − ×···× − ∩ p,q q p After a suitable block decomposition we use the subadditivity of Gelfand numbers (see (S2) below) to reduce the problem to finite-dimensional blocks. This strategy has for instance been pursued in the recent paper [46], where the case p0 = q0 > 1 has been treated. In our work, we are interested in the case min p0, q0 1, max p1, q1 2, which is connected to Theorem 6.1 below. As it turns out, there is a strong{ qualitative} ≤ difference{ } in ≤ the behavior of the Gelfand numbers depending on whether one is in the “large mixed smoothness” (r0 r1 > 1/q0 1/q1) or “small mixed smoothness” (r0 r1 1/q0 1/q1) regime. In the large smoothness− regime, sharp− bounds for the Gelfand numbers can be readily− ≤ derived:− using 4 SJOERD DIRKSEN AND TINO ULLRICH almost literally the arguments in [63, Thms. 3.18, 3.19] together with the well-known sharp estimates for the n n Gelfand numbers of the (non-mixed) embeddings id : ℓu ℓv where 0 0 depending≤ 0 ≤ 0 ≤ 1 ≤ 1 ≤ 0 − 1 0 − 1 only on p0,p1, q0, q1, d and r,
−r r d 0 d −r 1/q0 −1/q1 (13) cm cm(Id : S B([0, 1] ) S B([0, 1] )) Cm (log log m) . ≤ p0,q0 → p1,q1 ≤ 2
By taking p1 = q1 and q0 in the eighties. We refer to [64, Section 4] for a summary of these classical results and historical references. Let us only specifically mention the classical bound log(ebd/m) 1/2 (14) c (id : ℓb (ℓd) ℓb (ℓd)) min 1, , m 1 1 → 2 2 ≃ m n o which was obtained by Garnaev and Gluskin [26, 28] following important earlier advances of Kashin [36]. Note that one recovers this result from (4) by taking ‘b = bd’, d = 1. Some of the classical estimates were extended to the quasi-Banach range p< 1, see [22, 64], where it was shown that for 0 1 1/q, then ∞ − r− 1 + 1 logd−1 m 2 q d−1 (log m) q : 1 + 1 < 1, p 2, r Td Td m p q ≤ cm(Id : Sp,∞B( ) Lq( )) p,q,r,d r− 1 + 1 → ≃ logd−1 m p q d−1 (log m) q : 2 p < q. m ≤ The lower bound in Theorem 1.3 for the (larger) approximation numbers (linear widths) was proved recently by Malykhin and Ryutin. It closes a log log-gap in a result of Galeev [25] from 1996 . Let us also mention the results from [35, Thm. 3] and the remark following it. Let m < bd/2. First, c (id : ℓb (ℓd) ℓb (ℓd )) 1; m ∞ 2 → ∞ ∞ ≃ Second, for 1 p 2, ≤ ≤ b d b d 1 1 − 1 c (id : ℓ (ℓ ) ℓ (ℓ )) b p d p 2 . m ∞ 2 → p p ≃p Third, if 2 p< and in addition there exists a constant γ > 0 such that d γ log b, then ≤ ∞ ≥ 1 p b b d b d 1 . c (id : ℓ (ℓ ) ℓ (ℓ )) b p . 1− 1 γ,p m ∞ 2 p p (log b) p → ≤ 6 SJOERD DIRKSEN AND TINO ULLRICH Finally, we state a result of Vasileva from [62]. For 1 a 2 and 1 a b we define ≤ ≤ ≤ ≤ ≤ ∞ 1 1 a b λ(a,b) = min 1 − 1 , 1 n a − 2 o if a = 2 and λ(a,b) = 1 if a = 2. Note that if a< 2, then λ(a,b) < 1 if and only if b< 2. 6 Theorem 1.4 ([62]). Let 1 < p 2, 1 < q 2, p r , and q u . Then there exists an a = a(p, q) > 0 such that for all m,b,d≤ with m ≤abd ≤ ≤ ∞ ≤ ≤ ∞ ≤ c (id : ℓb (ℓd) ℓb(ℓd )) Φ (m,b,d,p,q,r,u) m p q → r u ≃p,q,r,u 0 provided that one of the following conditions holds: (i) u 2, r 2, and ≥ ≥ − 1 1− 1 1− 1 Φ = min 1,m 2 d q b p 0 { } (ii) λ(q,u) λ(p, r), λ(q,u) < 1, and ≤ − 1 1− 1 1− 1 λ(q,u) 1 − 1 − 1 1 1− 1 λ(p,r) Φ0 = min 1, (m 2 d q b p ) , d u q (m 2 d 2 b p ) (iii) λ(q,u) λ(p, r), λ(p, r) n< 1, and o ≥ − 1 1− 1 1− 1 λ(p,r) 1 − 1 − 1 1− 1 1 λ(q,u) Φ0 = min 1, (m 2 d q b p ) ,b r p (m 2 d q b 2 ) . 1.4. Outline. After fixing notationaln conventions we introduce several basic concepts,o such as Gelfand numbers and Gelfand widths and generalized notions of sparsity and compressibility in Section 2. In Section n 3 we set up some machinery to prove upper bounds for Gelfand widths of bounded subsets in (R , Y ) via random subspaces based on Gordon’s “escape through the mesh” theorem. For the lower boundsk·k we establish a connection to packing numbers of structured sparse vectors and give lower bounds for these packing numbers, see Sections 4 and 5. Section 6 establishes matching bounds for Gelfand numbers for b d embeddings between ℓp(ℓq )-spaces in several interesting parameter ranges. These results are applied in Section 7, where we prove Corollary 1.2 and establish new results for the Gelfand numbers of Besov space embeddings with small mixed smoothness. 1.5. Notation. As usual N denotes the natural numbers, N := N 0 , Z denotes the integers, R the real 0 ∪{ } numbers, and C the complex numbers. For a R we denote a+ := max a, 0 . We write log for the natural logarithm. Rm×n denotes the set of all m n-matrices∈ with real entries and{ R}n denotes the Euclidean space. R×n ¯ ¯ Nd Vectors are usually denoted with x, y , sometimes we use j, k 0 for multi-indices. For 0 < p n ∈ ∈ ≤ ∞ and x Rn we denote x := ( x p)1/p with the usual modification in the case p = . If X is a ∈ k kp i=1 | i| ∞ (quasi-)normed space, then BX denotes its unit ball and the (quasi-)norm of an element x in X is denoted P by x X . For any U X we let rX (U) = supx∈U x X denote the radius of U. In case X is a Banach k k ∗ ⊂ k k space, X denotes its dual. We will frequently use the quasi-norm constant, i.e., the smallest constant αX satisfying x + y α ( x + y ), for all x, y X. k kX ≤ X k kX k kX ∈ For a given 0 0 such that a cb for all n. We write a ⊂ b if both a . b and b . a . If α is a set of parameters, then we n ≤ n n ≃ n n n n n write an .α bn if there exists a constant cα > 0 depending only on α such that an cα bn for all n. N b d ≤ Let b, d . The bd-dimensional mixed space ℓp(ℓq ) is defined as the space of all bd-dimensional block ∈b Rbd Rd vectors (xi)i=1 , where xi for all 1 i b, equipped with the mixed (quasi-)norm defined in (2). Occasionally we∈ will identify x ∈ Rbd with the≤b ≤d matrix whose rows are the vectors x . We always refer ∈ × i to the ℓp-space supported on [b] := 1,...,b as the outer space and to the ℓq-space supported on [d] as the { } bd inner space. For any S [b] [d] and x R we define x to be the block vector with entries (xS )ij = xij for (i, j) S, (x ) = 0⊂ for (i,× j) Sc. ∈ ∈ S ij ∈ GELFAND NUMBERS, STRUCTURED SPARSITY, BESOV SPACE EMBEDDINGS 7 2. Preliminaries 2.1. Gelfand numbers and worst-case errors. We recall the definition of Gelfand numbers and the mild conditions under which Gelfand numbers can be considered as worst-case approximation/recovery errors. Let X, Y be (quasi-)Banach spaces and T : X Y be a continuous map. → Definition 2.1. Let X, Y be two quasi-Banach spaces and T (X, Y ). For m N the m-th Gelfand number is defined as ∈ L ∈ . c (T ) := inf TJ X : M ֒ X closed subspace with codim(M) (17) cm(K, Y ) := inf sup f Y : f K M . { ∩ ∋ M֒→Y,codim(M)≤m {k k The terms Gelfand numbers and Gelfand widths are often used interchangeably in the literature although they are defined in a fundamentally different way. In particular, it is important to note that in the definition of the Gelfand numbers c (id : X Y ) we take an infimum over subspaces M of X, whereas in the m → definition of Gelfand widths cm(BX , Y ) we take an infimum over subspaces M of Y . In particular, if X and Y are quasi-Banach spaces then it is not clear whether the above Gelfand numbers and Gelfand widths are equivalent. Only recently, some sufficient conditions have been determined under which c (id : X Y ) m → and cm(BX , Y ) are equivalent, see [16]. Obviously, if X and Y are both n-dimensional, then the Gelfand numbers and Gelfand widths coincide. (ii) A relation between Gelfand widths and a corresponding notion of minimal worst case errors, which are nowadays sometimes called compressive m-widths [23, Def. 10.2, Thm. 10.4], was first given in [59, Thm. 8 SJOERD DIRKSEN AND TINO ULLRICH 5.4.1]. These associated minimal worst-case errors are defined, for a given bounded set K in a (quasi-)Banach space Y , as (18) Em(K, Y ) := inf sup f Sm(f) Y Sm f∈K k − k m where the infimum is taken over all maps Sm = ϕm Nm, with Nm : Y R any linear and continuous measurement map and ϕ : Rm Y an arbitrary recovery◦ map. Note that→ the worst-case reconstruction m → error Em(X, Y ) in (1) and Em(BX , Y ) in (18) are defined in a subtly different manner: in the first case we m m consider measurement maps Nm : X R , whereas in the second case we consider maps Nm : Y R . The compressive m-width and the m-th→ Gelfand width are equivalent under mild conditions [23, Thm.→ 10.4]. In particular, if and are quasi-norms on Rn, then k·kX k·kY (19) α−1c (B , Y ) E (B , Y ) 2α c (B , Y ). Y m X ≤ m X ≤ X m X In conclusion, the m-th Gelfand number c (id : X Y ), the m-th Gelfand width c (B , Y ) and m → m X their associated minimal worst-case errors Em(X, Y ) and Em(BX , Y ), respectively, can in general be four non-equivalent quantities. However, if X and Y are finite-dimensional quasi-normed spaces of the same dimension, then these quantities are equivalent up to constants depending only the quasi-norm constants of b d b d X and Y . In particular, if X = ℓp(ℓq ) and Y = ℓr(ℓu), then the constants in the equivalences depend only on p,q,r and u. We will use these facts heavily in the sections that follow. 2.2. Generalized notions of sparsity and compressibility. We fix some terminology concerning struc- b Rdi tured sparsity. Consider a block vector x = (xi)i=1 with xi . As usual, we say that x is s-sparse if it has at most s nonzero entries. We say x is s-block sparse if∈ at most s blocks of x are non-zero. This type of sparsity occurs frequently in the signal processing literature (see e.g. [3, 17, 18, 19]). Block sparsity is strongly related to (and can in fact be viewed as a generalization of) the joint sparsity model, where one considers a signal consisting of several ‘channels’ (for instance, the three color channels of an RGB image) and assumes that nonzero coefficients appear at the same location within each of the channels (see e.g. [19, 20, 30] and [3, Section 1.2]). Block sparsity also plays an important role in high-dimensional statistics (see e.g. [8, Chapters 4 and 8], [65, Sections 2.1 and 3.2.1] and the references therein) and appears in machine learning, for instance in multi-task learning and multiple kernel learning (see e.g. [4, Sections 1.3 and 1.5] and the references therein). In this literature, it is sometimes of interest to consider group sparsity, a generalization of the block sparsity model where the signal consists of a small number of groups which are allowed to overlap. There is a well-developed theory available on sufficient conditions for recovery of block and, more generally, group sparse vectors via the group Lasso, see e.g. [8, 33, 41, 45]. As a third form of sparsity, we consider b sparsity-in-levels, which has emerged recently in the signal processing literature. A block vector x = (xi)i=1 b is called (ti)i=1-sparse-in-levels if each block xi is a ti-sparse vector. Empirically it has been observed that certain signals exhibit this type of structured sparsity after expansion in an appropriate basis. For instance, the wavelet coefficients of natural images often exhibit an (approximately) sparse-in-levels structure, where the levels correspond to the wavelet scales. We refer to e.g. [1, 5, 7, 40] for recent work on sufficient recovery guarantees for sparse-in-levels signals. To keep our exposition non-technical and to ensure that our sparsity concepts are linked to classical ℓp(ℓq)-spaces, we restrict ourselves to the study of block sparsity and sparsity-in-levels in the case where the block sizes and (in the case of sparsity-in-levels) the sparsity of the blocks are identical. Accordingly, we say b Rbd Rd that a block vector x = (xi)i=1 with xi for all i [b] is s-outer sparse if at most s blocks of x are non-zero and say that x is t-inner∈ sparse if∈ each block of x∈has at most t non-zero entries. The qualifiers ‘outer’ and ‘inner’ refer to the fact that the sparsity is defined with respect to the outer ℓp-space or inner ℓ -space, respectively, if x ℓb (ℓd). For our analysis it will also be important to consider a mixture of outer q ∈ p q and inner sparsity: we define x to be (s,t)-sparse if it has at most s non-zero blocks, each of which has at most t-non-zero entries. To illustrate why outer and inner sparsity should play an important role for the Gelfand numbers of b d ℓp(ℓq )-spaces where p and/or q are small, let us point out a generalization of “Stechkin’s lemma” [23, Prop. 2.3, Thm. 2.5] to the two different best s-term approximation errors (6) associated with inner and outer sparsity. Let us emphasize that although [23, Prop. 2.3, Thm. 2.5] is often attributed to S.B. Stechkin, he never formulated such a result. The first known version goes back to Temlyakov 1986 [58, p. 92] for p, q 1. Note, that there is indeed a well-known Stechkin lemma (or Stechkin criterion) from 1955 proved in [56],≥ GELFAND NUMBERS, STRUCTURED SPARSITY, BESOV SPACE EMBEDDINGS 9 see also [13, 7.4] for further historical comments. This criterion implies the inequality [23, Prop. 2.3] in case p = 1 and q = 2 but with a constant larger than 1. Let us now formulate two generalizations of [23, Prop. 2.3]. Their proofs are straightforward modifications. If 0 The bound (20) shows that vectors in B b d with p small can be approximated well by outer-sparse vectors ℓp(ℓq ) and, similarly, (21) implies that elements in B b d with q small can be approximated well by inner-sparse ℓp(ℓq ) vectors. The constant c = 1 in (20) and (21) is not optimal. It is actually below one, depending on p,q,r,u, see [13, Section 7.4] and the references therein. 3. A general upper bound for Gelfand widths of K Rn ⊂ In this section we set up some machinery to prove upper bounds for Gelfand numbers based on “Gordon’s escape through a mesh” theorem . To facilitate potential re-use, we write our results down in more generality than needed in Section 6 below. Below we will deal with Gelfand widths of bounded subsets in Rn, see (17) in Remark 2.3 . The main tool in the proof will be Gordon’s “Escape through a mesh” theorem [29, Cor. 3.4]. Recall that the Gaussian width of a set T Rn is defined by ⊂ w(T )= E sup g, x , x∈T h i where g is an n-dimensional standard Gaussian vector. Let us denote Γ((n + 1)/2) E := E g = √2 . n k k2 Γ(n/2) We will state the version of Gordon’s theorem from [23, Thm. 9.21]. Lemma 3.1. Let A Rm×n be a Gaussian random matrix and T be a subset of the unit sphere Sn−1 := x Rn : x =1 ∈. Then for every t> 0 we have { ∈ k k2 } −t2/2 P inf Ax 2 Em w(T ) t e . x∈T k k ≤ − − ≤ By the known relation m/√m +1 Em √m (see e.g. [23, Prop. 8.1]), Lemma 3.1 yields a non-trivial ≤ ≤ 2 result if Em > w(T ) and in particular if m & w(T ) . n n Proposition 3.2. Let 0 <ρ< , Y denote a quasi-norm on R and K R be a bounded symmetric set. Define the set ∞ k·k ⊂ K = x K : x >ρ ρ { ∈ k kY } k·k2 and let Kρ = x/ x : x K be the associated set of ℓ -normalized vectors. If { k k2 ∈ ρ} 2 (22) m & max w(Kk·k2 )2, 1 , { ρ } then c (K, Y ) ρ. m ≤ m×n Proof. To prove the assertion, we need to find an A R such that x Y ρ for all x ker(A) K. 2 ∈ k k ≤ ∈ ∩ This is satisfied if infx∈Kρ Ax 2 > 0. By Lemma 3.1, if A is an m n standard Gaussian matrix such that k·k2 k k × Em > w(Kρ ), then (23) γ(K,ρ; A) := inf Ax 2 > 0 k·k2 x∈Kρ k k 2 k·k2 with probability at least 1 exp( t /2), where 0 n m×n for a certain CY > 0, as all quasi-norms on R are equivalent. Hence, (22) ensures the existence of A R satisfying inf Ax > 0. ∈ x∈Kρ k k2 In conclusion, we can obtain an upper bound for the Gelfand width by estimating x k·k2 E w(Kρ )= sup ,g x∈K : kxkY >ρ x 2 Dk k E from above. 4. A lower bound for Gelfand widths via packing numbers We generalize the proof-by-contradiction technique from [22], which was used to prove lower bounds for n n cm(id : ℓp ℓq ). To this end, we establish a connection between Gelfand and packing numbers in Proposition 4.4, which→ requires some further preparation. Recall that for a quasi-normed space X, a bounded subset K X, and ε > 0, the packing number (K, X ,ε) is the maximal number of elements in K which have⊂ pairwise distances strictly larger than Pε. Thek·k following well-known lemma can be proven by a standard volume comparison argument. n Lemma 4.1. Let X be any quasi-norm on R with quasi-norm constant 1 αX < and let U be a subset of the unit ballk·kB = x Rn : x 1 . Then for any ε> 0, ≤ ∞ X { ∈ k kX ≤ } 2 n (U, ,ε) αn 1+ . P k·kX ≤ X ε m×n n n Lemma 4.2. Let A R and let X be any quasi-norm on R . Suppose that T R is a set of vectors satisfying, for∈ some c 1, k·k ⊂ ≥ −1 (24) inf z X c x X for all x T T. z∈Rn : Az=Ax k k ≥ k k ∈ − Let U T and let r (U) = sup x be its radius. Then, for any ε> 0, ⊂ X x∈U k kX 2cα r (U) log (U, ,ε) m log α + X X , P k·kX ≤ X ε where αX is the quasi-norm constant. Proof. Consider the quotient space Q = X/ker(A) equipped with its natural quasi-norm [x] Q := inf x v X . k k v∈ker(A) k − k Note that for every v ker(A), the vector z = x v satisfies Az = Ax. On the other hand, if z Rn satisfies Az = Ax, then v = x ∈ z defines a vector in ker(−A). By (24), this implies ∈ − [x] x c [x] for all x T T k kQ ≤k kX ≤ k kQ ∈ − and in particular (U, ,ε) (U,c ,ε)= (U, ,ε/c). P k·kX ≤P k·kQ P k·kQ Now apply Lemma 4.1 to obtain (U, ,ε) (r (U)B , ,ε/c) P k·kX ≤P X Q k·kQ 2cα r (U) rank(A) 2cα r (U) m α + X X α + X X . ≤ X ε ≤ X ε Taking logarithms yields the result. bd In what follows, we let X be a quasi-norm on R with quasi-norm constant 1 αX < . Moreover, we let 0 <β < denotek·k the smallest constant such that for all S [b] [d] and all≤x Rbd∞we have X ∞ ⊂ × ∈ x + x c β x . k SkX k S kX ≤ X k kX Clearly, for any S [b] [d] and x Rbd, ⊂ × ∈ x α ( x + x c ) α β x , k kX ≤ X k S kX k S kX ≤ X X k kX so αX βX 1. Note that β b d 2 depends only on p and q. ≥ ℓp(ℓq ) ≤ GELFAND NUMBERS, STRUCTURED SPARSITY, BESOV SPACE EMBEDDINGS 11 Lemma 4.3. Let be a collection of supports in [b] [d] and let A Rm×bd. Suppose that for every S × ∈ bd v Ker(A) 0 and S we have v < v c . Then, for any x R and S , ∈ \{ } ∈ S k SkX k S kX ∈ ∈ S −1 inf z X > (αX βX ) xS X . bd z∈R : Az=AxS k k k k Proof. Let x Rbd. If z = x satisfies Az = Ax , then v = x z Ker(A) 0 and therefore ∈ 6 S S S − ∈ \{ } α−1 x x z + z = v + z X k SkX ≤k S − SkX k SkX k SkX k SkX < v c + z k S kX k SkX = z c + z β z . k S kX k SkX ≤ X k kX Let us define the parameter xS X κX,Y,S = sup k k . Rbd x x∈ ,S∈S k SkY Proposition 4.4. Let be a collection of supports in [b] [d] and define + = S1 S2 : S1,S2 . Let be a quasi-norm onS Rbd satisfying × S { ∪ ∈ S} k·kY x x for all x Rn,S . k SkY ≤k kY ∈ ∈ S+ If −1 cm(BX , Y ) < (2αX κX,Y,S+ ) then for any U x : x Rbd, S and ε> 0, ⊂{ S ∈ ∈ S} 2α2 β r (U) log (U, ,ε) m log α + X X X . P k·kX ≤ X ε Proof. By definition of the Gelfand width, there exists an A Rm×bd such that for every x Ker(A) 0 ∈ ∈ \{ } x < (2α κ )−1 x . k kY X X,Y,S+ k kX For any S , ∈ S+ x κ x k SkX ≤ X,Y,S+ k SkY −1 1 κ x < (2α ) x ( x + x c ). ≤ X,Y,S+ k kY X k kX ≤ 2 k SkX k S kX Rearranging, we conclude that xS X < xSc X . Now apply Lemma 4.3 and subsequently Lemma 4.2 (with T = x : x Rbd, S , notingk k thatkT kT = x : x Rbd, S ) to obtain the conclusion. { S ∈ ∈ S} − { S ∈ ∈ S+} b d Proposition 4.4 can be used to prove lower bounds for the Gelfand widths c (B b d ,ℓ (ℓ )) via con- m ℓp(ℓq ) r u b d tradiction: one assumes that c (B b d ,ℓ (ℓ )) is small and then constructs a large packing consisting of m ℓp(ℓq ) r u structured sparse vectors to find a contradiction. This packing construction is the subject of the next section. 5. Construction of a large packing in B b d ℓp(ℓq ) We obtain a lower bound for the packing numbers (B b d , b d ,ε), where 0 < p,q,r,u . The P ℓp(ℓq ) k·kℓr(ℓu) ≤ ∞ proof, inspired by [3], is based on two combinatorial facts. The first combinatorial fact has been independently observed in various disciplines of mathematics. A proof can be found, e.g., in [23, Lemma 10.12], [38] or [50, Prop. 2.21, Page 219] . Lemma 5.1. Given integers ℓ The second combinatorial fact is a variation of Lemma 5.1, which is well known in coding theory [27, 61]. For given sets A ,...,A we let d denote the Hamming distance on A A , i.e., 1 ℓ H 1 ×···× ℓ ℓ dH ((x1,...,xℓ), (y1,...,yℓ)) = 1xi6=yi . i=1 X Lemma 5.2. (Gilbert-Varshamov bound) Fix θ > 1. Let A1,...,Aℓ be sets, each consisting of θ elements. Then, for any k [ℓ], ∈ θℓ (A1 Aℓ, dH , k) . P ×···× ≥ k−1 ℓ (θ 1)j j=0 j − b Rbd Let us recall that we defined a block vector x = (xi)iP=1 to be (s,t)-sparse if it has at most s non-zero blocks, each of which has at most t non-zero entries. In∈ the proof of the following proposition it will bd be notationally convenient to identify each x R with the b d matrix having the xi as its rows. Under this identification, the matrix x is (s,t)-sparse∈ if it has at most×s non-zero rows, each of which has at most t non-zero entries. Proposition 5.3. Let b, d N with b, d 8. For any 0 < p,q,r,u , s [ b/8 ], and t [ d/8 ], there ∈ ≥ 1/p 1/q≤ ∞ ∈ ⌊ ⌋ ∈ ⌊ ⌋ is a set W of (2s, 2t)-sparse vectors with r b d (W ) (2s) (2t) such that ℓp(ℓq ) ≤ s st 1/r 1/u b d (W, b d ,s (2t) ) . P k·kℓr(ℓu) ≥ 32s 8t t Proof. We use Lemma 5.1 to find a collection Ad,t of (d/(8t)) subsets of [d] such that each I Ad,t has cardinality 2t and ∈ card(I J)