1

Blind using Modulated Inputs Ali Ahmed

Abstract—This paper considers the blind deconvolution of make a structural assumption that the entries of the modulating multiple modulated signals/filters, and an arbitrary filter/signal. sequences [rn] are random, in particular, binary ±1. This Multiple inputs s1, s2,..., sN =: [sn] are modulated (pointwise implicitly means that the signs of the inputs sn rn, n ∈ [N] multiplied) with random sign sequences r1, r2,..., rN =: [rn], Q are random/generic. In applications, this may either be justified respectively, and the resultant inputs (sn rn) ∈ C , n ∈ [N] are convolved against an arbitrary input h ∈ CM to yield by analog signal modulations with binary waveforms (easily the measurements yn = (sn rn) ~ h, n ∈ [N] := 1, 2,..., N, implementable) prior to or might implicitly hold where and ~ denote pointwise multiplication, and circular when the inputs are naturally sufficiently diverse (dissimilar . Given [yn], we want to recover the unknowns signs). [s ] and h. We make a structural assumption that unknowns n This problem arises in passive imaging. An ambient uncon- [sn] are members of a known K-dimensional (not necessarily random) subspace, and prove that the unknowns can be recovered trollable source generates an unstructured signal that drives from sufficiently many observations using a regularized gradient multiple convolutive channels, and one aims to recover the descent algorithm whenever the modulated inputs sn rn are source signal, and channel impulse responses (CIRs) from long enough, i.e, Q & KN + M (to within logarithmic factors, and the recorded convolutions. Recovering CIRs reveals important signal dispersion/coherence parameters). Under the bilinear model, this is the first result on multi- information about the structure of the environment such as in channel (N ≥ 1) blind deconvolution with provable recovery seismic interferometry [1], [2], and passive synthetic aperture guarantees under near optimal (in the N = 1 case) sample radar imaging [3]. Recovering the source signal is required in, complexity estimates, and comparatively lenient structural as- for example, underwater acoustics to classify the identity of a sumptions on the convolved inputs. A neat conclusion of this submerged source [4]–[6], and in speech processing to clean result is that modulation of a bandlimited signal protects it against an unknown convolutive distortion. We discuss the appli- the reverberated records [7]. In general, the problem setup (1) cations of this result in passive imaging, wireless communication is of interest in signal processing, wireless communication, in unknown environment, and image deblurring. A thorough and system theory. Apart from other applications, one specific numerical investigation of the theoretical results is also presented and interesting result of general interest is that randomly using phase transitions, image deblurring experiments, and noise modulating a bandlimited analog signal protects it against an stability plots. unknown linear time invariant (or convolutive) system; this Index Terms—Blind deconvolution, gradient descent, passive will also be discussed in detail below. imaging, modulation, random signs, multichannel blind de- We assume that [s ] live in known subspaces, that is, each convolution, random mask imaging, channel protection, image n deblurring. sn can be written as the multiplication of a known Q × K tall 3 orthonormal matrix C, and a short vector xn of expansion coefficients. Mathematically, I.INTRODUCTION s = Cx , n ∈ [N]. (2) This paper considers blind deconvolution of a single input n n convolved against multiple modulated inputs. We observe the It is important to point out critical differences in the M 1 Q convolutions of h ∈ C , and [sn] ∈ C modulated in structural assumptions compared to the contemporary recent Q advance with known [rn] ∈ R , respectively. Modulation literature [8], and [9]–[13] on blind deconvolution, where at (sn rn) is simply the pointwise multiplication (Hadamard arXiv:1811.08453v3 [cs.IT] 23 Dec 2019 least one of the convolved signals is assumed to live in a product) of the entries of sn, and rn. Mathematically, the known subspace spanned by the columns of a random Gaus- observed convolutions are sian matrix. Signal subspace in actual applications is often poorly described by the Gaussian model. We relinquish the yn = h ~ (rn sn), n ∈ [N], (1) restrictive Gaussian subspace assumption, and give a provable where ~ denotes an L-point circular convolution2 operator blind deconvolution result by only assuming random/generic L giving [yn] ∈ C . We always set L ≥ max(Q, M). We want to sign (modulated) signals that reside in realistic subspaces recover [sn], and h from the circular convolutions [yn]. We spanned by DCT, wavelets, etc. Additionally, we take h to be completely arbitrary, and do A. Ahmed is an Assistant Professor at the Department of Electrical Engi- not assume any structure such as a known subspace or sparsity neering, Information Technology University (ITU), Lahore 54000, Pakistan. Email: [email protected]. This work was supported by the Higher in a known basis; this is unlike some of the other recent works Education Commission (HEC), Pakistan under the National Research Program [9]–[12]. In general, we take M ≤ L as already assumed in for Universities (NRPU), Project no. 6856. Manuscript submitted on July 16, the observation model (1). Equivalently, we assume length M 2019. 1 vector h to be zero-padded to length L before the L-point The notation [sn] along with other notations in the paper is introduced in Notations section below. 2Two vectors in CM , and CQ are zero-padded to length L and then 3The model and all the results in the paper can be easily be generalized to circularly convolved to return an L-point circular convolution. different matrix Cn for each n. 2

M KN circular convolution. In the particular case of multiple (N > For arbitrary vectors h ∈ C , x := vec([xn]) ∈ C , where K 1) convolutions, as will be shown later, no zero-padding is [xn] ∈ C for every n, we define coherences assumed, i.e., one can set M as large as L. This is important kF hk2 kC ⊗N xk2 in applications such as passive imaging, where often the source µ2 := L M ∞ , ν2 := QN ∞ , and ν2 := QkCk2 , h k k2 x k k2 max ∞ signal is uncontrolled, and unstructured, and hence cannot be h 2 x 2 realistically assumed to be zero-padded; see, the discussion in (4) Section II. where as noted in the notations section above, C ⊗N is a block × [ ] Let F be an L L DFT matrix with entries F ω, n = diagonal matrix formed by stacking N matrices C. √1 e−ι2πωn/L, (ω, n) ∈ [L] × [L]. Formally, we take the L Similar coherence parameters appear in the related recent measurements in the Fourier domain √ literature on blind deconvolution [9], [12], and elsewhere in  yˆn = L FM h0 FQ(rn s0,n) + eˆn compressed sensing [14], [15], in general.√ Without loss of√ gen- √ erality, we only assume that kh k = d , and kx k = d . = L(F h F R Cx ) + eˆ , (3) 0 2 0 0 2 0 M 0 Q n 0,n n For brevity, we will denote the coherence parameters µ2 , and h0 where R := diag(r ) is a Q × Q diagonal matrix, h , 2 (h x ) n n 0 νx0 of the fixed ground truth vectors 0, 0 by x0 := vec([x0,n]) are the ground truths, yˆn = F yn, and L µ2 := µ2 , and ν2 := ν2 . (5) [eˆn] ∈ C denote the additive noise in the Fourier domain. h0 x0 To deconvolve, we minimize the measurement loss by taking In words, coherence parameter µ2 is the peak value of the a gradient step in each of the unknowns h , and [x ] while h 0 0,n frequency spectrum of a fixed norm vector h. A higher value keeping the others fixed. This paper details a particular set roughly indicates a concentrated spectrum and vice versa. It of conditions on the sample complexity, subspace dimensions, is easy to check that 1 ≤ µ2 ≤ L. and the signals/filters under which this computationally feasi- h On the other hand, ν2 quantifies the dispersion (not in the ble gradient descent scheme provably succeeds. x Fourier domain) of the signals sn = Cxn. A signal con- centrated in time (mostly zero) remains somewhat oblivious A. Notations to the random sign flips rn sn, and as a result is not Standard notation for matrices (capital, bold: C, F, etc.), ∗ as well-dispersed in the frequency domain. Let cq,n be the column vectors (small, bold: x, y, etc.), and scalars (α, c, C, ⊗N 2 2 ∗ 2 rows of C . By definition, νx kxk2 ≥ QN|cq,n x| for any etc.) holds. Matrix and vector conjugate transpose is denoted (q, n) ∈ [Q] × [N]. Summing over (q, n) ∈ [Q] × [N] on both ∗ ∗ by ∗ (e.g. A , x ). A bar over a column vector x¯ returns the sides, and using the isometry of C gives us the inequality same vector with each entry complex conjugated. 2 2 νx ≥ 1. The upper bound νx ≤ QN is easy to see using In general, the notation [xn] refers to the set of vectors Cauchy Schwartz inequality, hence, 1 ≤ ν2 ≤ QN. Q Q x {x1, x2,..., xN }. Moreover, [xn] ∈ C means that xn ∈ C 2 The third coherence parameter νmax in (4) ensures that the for every n. For a scalar N, we define [N] := {1, 2, 3,..., N}. subspace of vectors sn is generic—it is spanned by well- The notation vec([xn]) refers to a concatenation of N vectors dispersed vectors. One can easily check that 1 ≤ ν2 ≤ Q, ([ ]) [ ∗ ∗ ∗ ]∗   max x1, x2,..., xN , i.e., vec xn := x1, x2,..., xN . IK ⊗N and the upper bound is achieved for C = . For a matrix C, we denote by C , a block diagonal matrix 0 ÍN ∗ n=1 en en ⊗ C, where ⊗ is the standard Kronecker product, Our results indicate that for successful recovery, the sample and [en] are the standard N-dimensional basis vectors. Build- complexity or the number LN of measurements, and the length ( )⊗N 2 2 2 ing on this notation, we define, for example, RnC := of the modulated signals QN scale with µ , ν , and νmax. ÍN ∗ n=1 en en ⊗ RnC for a sequence of matrices R1, R2,..., RN , and C. We denote by FJ , a submatrix formed by first L × J ∗ C. Signal Recovery via Regularized Gradient Descent columns of an L × L matrix F; e.g. (FJ ) denotes a J × L matrix. We denote by IK , a K × K identity matrix. Absolute Notice that the measurements in (3) are non-linear in the constants will be denoted by C, c1, c2,..., their value might unknowns (h0, x0), however, are linear in the rank-1 outer- ∗ ∗ M change from line to line. We will write A & B if there is an product h0 x¯0. To see this, let f` ∈ C be the `th row of absolute constant c1 for which A ≥ c1B. We use k · k2, k · k∞ FM , an L × M submatrix of the L × L normalized DFT matrix ∈ KN to denote standard `2, and `∞ norms, respectively. Moreover, F, and cˆ`,n C be the `th row√ in the nth block-row of ⊗N k · k∗, k · k2→2, and k · kF signify matrix nuclear, operator, and the LN × KN block-diagonal matrix L(FQ RnC) . The `th Frobenius norms, respectively. entry yˆn[`] of measurements yˆn in (3) is then simply yˆ [`] = f ∗h x¯ ∗ cˆ + eˆ [`] = h f cˆ∗ , h x¯ ∗i + eˆ [`]; (6) B. Coherence Parameters n ` 0 0 `,n n ` `,n 0 0 n ∗ Our main theoretical results depend on some signal dis- the linearity of the measurements in h0 x¯0 is clear from the last M×KN persion measures that characterize how diffuse signals are inequality above. We also define a linear map A : C → LN ∗ in the Fourier domain. Intuitively, concentrated (not diffuse) C that maps h0 x0 to the vector yˆ := vec([yˆn]). The action ∗ signals in the Fourier domain annihilate the measurements in of A on a rank-1 matrix hx returns (3) making it relatively difficult (more samples required) to A(hx∗) := { f ∗hx¯ ∗ cˆ } , (`, n) ∈ [L] × [N], recover such signals. We refer to the signal diffusion measures ` `,n `,n A( ∗) as coherence parameters, defined and discussed below. and therefore, yˆ = h0 x0 + e, (7) 3

∗   where in the last display above e := vec([eˆn]), and we used ∇Fxn = (FQ RnC) (FM h) (FM h FQ RnCxn − yˆn) , the definition of A to compactly express (6). It also shows that where recall that bar notation z¯ represents the entry wise multichannel blind deconvolution with a shared input h0 can ∗ M×KN conjugate of a vector z. The gradient ∇Fx is obtained by be treated jointly as a rank-1 matrix h0 x0 ∈ C recovery stacking [∇Fx ] in a column vector. However, as h is fixed problem, and the observations [yˆn] in all the channels are the n linear measurements of this common rank-1 object. across channels, its gradient update is jointly computed as an average of the contributions from all the N channels. Jointly Given measurements yˆ of the ground truth (h0, x0), we employ a regularized gradient descent algorithm that aims to solving for h enables recovery of arbitrary h with no additional minimize a loss function: assumption such as h lying in a known subspace as is the case for single-channel blind deconvolution [9], [10]. ˜ F(h, x) := F(h, x) + G(h, x). (8) Similar algorithms with provable recovery results appeared w.r.t. h, and x, where the functions F(h, x), and G(h, x) earlier beginning with [16], [17] for matrix completion, and account for the measurement loss, and regularization, respec- [18] for phase retrieval, and in [10] for blind deconvolution, tively; and are defined below however, with observation model different from (1) considered ∗ 2 ∗ ∗ 2 here. F(h, x) := kA(hx ) − yˆ k2 = kA(hx − h0 x0) − ek2 = ∗ ∗ 2 2 ∗ ∗ ∗ Algorithm 1 Wirtinger gradient descent with a step size η kA(hx − h0 x0)k2 + kek2 − 2Re(hA (e), hx − h0 x0i), (9) Input: Obtain (u0, v0) via Algorithm 2 below. and for t = 1,... do " 2 ! 2 ! L ∗ 2 ! ut ← ut−1 − η∇F˜h(ut−1, vt−1) khk2 kxk2 Õ L| f` h| G(h, x) := ρ G + G + G vt ← vt−1 − η∇F˜m(ut−1, vt−1) 0 2d 0 2d 0 8dµ2 `=1 end for Q N ∗ 2 !# Õ Õ QN|cq,n x| + G0 , (10) ( ) 8dν2 Finally, a suitable initialization u0, v0 for Algorithm 1 is q=1 n=1 computed using Algorithm 2 below. In short, the left and 2 ∗ where G0(z) = max{z−1, 0} . Conforming to the choice in the right singular vectors of A (yˆ) when projected in the set of 2 2 sufficiently incoherent (measured in terms of the coherence µ, proofs below, we set ρ ≥ d + kek2 , and 0.9d0 ≤ d ≤ 1.1d0 (proof of Theorem 2 below). Together, the first and third term ν of the original vectors h0, and x0) vectors supply us with 2 the initializers (u0, v0). in the regularizer G(h, x) penalize any h for which khk2 > 2d, 2 2 2 2 and khk2 µh = LkFM hk∞ > 8dµ . Similarly, the second and 2 Algorithm 2 Initialization fourth term penalize the coherence νx and norm of x. In words, ∗ the regularizer keeps the coherences µ2 , ν2; and norms of Input: Compute A (yˆ), and find the leading singular value d, h x ˆ (h, x) from deviating too much from those of the ground truth and the corresponding left and right singular vectors h0, and xˆ0, respectively. (h0, x0). Solve the following optimization programs The proposed regularized gradient descent algorithm takes √ √ √ u0 ← argmin kh − dhˆ0 k2, subject to LkFM hk∞ ≤ 2 dµ, and alternate Wirtinger gradient (of the loss function F˜(h, x)) steps h √ √ √ ⊗N in each of the unknowns h, and x while fixing the other; v0 ← argmin kx − dxˆ0 k2, subject to QN kC xk∞ ≤ 2 dν. x see Algorithm 1 below for the pseudo code. The Wirtinger Output: (u0, v0). gradients are defined as4 ∂F˜ ∂F˜ ∂F˜ ∂F˜ ∇F˜h := = , and ∇F˜x := = . (11) D. Main Results ∂h¯ ∂h ∂x¯ ∂x Our main result shows that given the convolution measure- We explicitly write here the gradients of F(h, x) in (9) w.r.t. ments (3), a suitably initialized Wirtinger gradient-descent Al- h, and x to shed a bit more light on how the multichannel n gorithm 1 converges to the true solution, i.e., (u , v ) ≈ (h , x ) problem is jointly solved across all the channels for the same t t 0 0 under an appropriate choice of Q, N, and L. To state the main h. Recall from (7) that by definition theorem, we need to introduce some neighborhood sets. For  FM h FQ R1Cx1  vectors h ∈ CM , and x ∈ CKN , we define the following sets    FM h FQ R2Cx2  of neighboring points of (h, x) based on either, magnitude, A(hx∗) =   .  .  coherence, or the distance from the ground truth.  .    FM h FQ RN CxN  p p   Nd0 := {(h, x)|khk2 ≤ 2 d0, kxk2 ≤ 2 d0}, (12) It is then easy to see that the gradients of F(h, x) w.r.t. h, and √ p Nµ := {(h, x)| LkFM hk∞ ≤ 4µ d0}, (13) [xn] are p ⊗N p N Nν := {(h, x)| QN kC xk∞ ≤ 4ν d0}, (14) Õ ∗   ∗ ∗ ∇Fh = FM (FQ RnCxn) (FM h FQ RnCxn − yˆn) Nε := {(h, x)|khx − h0 x0 kF ≤ εd0}. (15) n=1 Our main result on blind deconvolution from modulated 4For a complex function f (z), where z = u + ιv ∈ CL , and u, v ∈ RL , the inputs (3) is stated below. ∂ f 1  ∂ f ∂ f  Wirtinger gradient is defined as ∂z¯ = 2 ∂u + ι ∂v . 4

Q×K Theorem 1. Fix 0 < ε ≤ 1/15. Let C ∈ R be a tall for a fixed desired accuracy δt+1 of the recovery. We want to K basis matrix, and set s0,n = Cx0,n; and x0,n ∈ C for every remark here that the result only guarantees an approximate M n = 1, 2, 3,..., N, and h0 ∈ C be arbitrary vectors. Let the recovery as is the case in some earlier works [16], [19], coherence parameters of C and (h0, x0) be as defined in (5). [20] on matrix completion using non-convex methods. For Let [rn] be independently generated Q-vectors with standard example, [16] uses fresh independent samples to compute a iid Rademacher entries. We observe the L-point circular con- stochastic gradient update for technical reasons of avoiding volutions of the random sign vectors rn s0,n with h0, where dependencies among the iterates; this leads to an approximate L ≥ max(Q, M), leading to observations (3) contaminated recovery. On the other hand, our measurement model gives with additive noise e. Assume that the initial guess (u0, v0) rise the dependent scalar measurements (6) across index `, √1 √1 √1 of (h0, x0) belongs to Nd0 ∩ Nµ ∩ Nν ∩ N 2 , then Al- and does not give way to a natural splitting of measurements 3 3 3 5 ε in batches of independent samples. Fortunately, we do not gorithm 1 will create a sequence (ut, vt ) ∈ Nd0 ∩Nµ ∩Nν ∩Nε, which converges geometrically (in the noiseless case, e = 0) need batch splitting in this proof method, however, we are to (h0, x0), and there holds still only able to guarantee approximate recovery; the main technical reason being a limited structured randomness in the ∗ ∗ 2 (t+1)/2 ∗ kut+1vt+1 − h0 x0 kF ≤ 3 (1 − ηω) εd0 + 50kA (e)k2→2 linear map A, which does not lead to strong concentration (16) bounds. For "exact" recovery, i.e., with error δt+1 = 0 the with probability at least method requires infinitely many samples. We leave this mainly a technical challenge of improving the results to finite sample  2 2 2 1 − 2 exp −cδt QN/µ ν , (17) complexity to the future work. All the remaining discussion in this section will be assuming a fixed accuracy δt , and hence whenever 2 a constant factor 1/δt in the sample complexity bound. c QN ≥ (µ2ν2 KN2 + ν2 M) log4(LN), (18) We now provide a discussion on the interpretation of these 2 max δt sample complexity bounds in several interesting scenarios such ∗ ∗ as single (N = 1) and multiple (N > 1) convolutions to where δt := kut vt − h0 x0 kF /d0, ω > 0, and η is the fixed 2 2 facilitate the understanding of the reader. σ d0 step size. Fix α ≥ 1. For noise e ∼ Normal(0, 2LN ILN ) + Sample Complexity: Observe that the number of unknowns σ2 d2 0 ∗ 2ε in the system of equations (3) is KN + M. Combining ιNormal(0, 2LN ILN ), kA (e)k2→2 ≤ 50 d0 with probability at least 1 − O((LN)−α) whenever L ≥ max(Q, M), (18), and (19), it becomes clear that number LN of measurements required for successful recovery scale 2 σ with KN2 M (within coherences, and log factors). This shows LN ≥ cα max(M, KN log(LN)) log(LN). (19) + ε2 that the sample complexity results are off by a factor of N The above theorem claims that starting from a good enough compared to optimal scalings. We believe this is mainly a initial guess the gradient descent algorithm converges super limitation of the proof technique; the phase transitions in Sec- linearly to the ground truth in the noiseless case. The theorem tion III show that (18) is a conservative bound, and successful below guarantees that the required good enough initialization: deconvolution generally occurs when LN ≥ QN & KN + M; √1 √1 √1 (u0, v0) ∈ Nd0 ∩ Nµ ∩ Nν ∩ N 2 is supplied by see phase transitions in numerical simulations Section III-A. 3 3 3 5 ε Algorithm 2. In general, for multiple convolution, we require LN ≥ 2 2 2 2 4 QN & (µ νmaxKN + ν M) log (LN). In passive imaging Theorem 2. The initialization obtained via Algorithm 2 satis- problem, where an ambient source drives multiple CIRs, the √1 √1 √1 fies (u0, v0) ∈ Nd0 ∩ Nµ ∩ Nν ∩ N 2 , and 0.9d0 ≤ d ≤ 3 3 3 5 ε above bound places a minimum required length QN of CIRs 2 2 2 1.1d0 holds with probability at least 1−2 exp −cε QN/µ ν for a successful blind deconvolution from the recorded data. whenever In the case of single (N = 1) convolution, the above c   bound reduces to L ≥ Q & (µ2ν2 K + ν2 M) log4 L. The QN ≥ µ2ν2 KN2 + ν2 M log4(LN). max ε2 max number of unknowns in this case are only K + M. Unlike the Proofs of Theorem 1, and 2 are given in Section IV-D, and multiple convolutions case above, for a desired accuracy of Appendix G, respectively. the recovered estimate, the bound on L above is information theoretically optimal (within log factors and coherence terms). This sample complexity result almost matches the results in E. Discussion [9], [10] except for an extra log factor. However, the important Theorem 1, and 2 together prove that randomly modulated difference is, as mentioned in the introduction, that unlike unknown Q-vectors [s0,n], and unknown M-vector h0 can [9]–[13], the inputs are not assumed to reside in Gaussian be recovered with desired accuracy from their N circular subspaces rather only have random signs, which if not given convolutions h0 ~ (rn s0,n), n ∈ [N] under suitably large can also be enforced through random modulation in some Q, N, and L. We will refer to the bounds in (18), (19), and applications; see, Section II. L ≥ max(Q, M) as sample complexity bounds. Together these No zero-padding of h0: Assume that the unknown filter h0 give is completely arbitrary, that is, its length M can be as large as L. Equivalently, no zero-padding or, in general, no known 2 2 2 2 4 LN ≥ QN & (µ νmaxKN + ν M) log (LN) (20) subspace is assumed. Even in the case of linear systems of 5

1 1 equations, the recovery is only possible in this case whenever {0, Q,..., 1 − Q }. Let FK be a matrix formed by the K LN ≥ KN + L, i.e., the number LN of measurements exceeds columns (corresponding to the signal frequencies) of a nor- the number KN + L of unknowns. Evidently, LN ≥ KN + L malized Q × Q DFT matrix F. Then the samples of s(t) can can never be achieved in the single channel scenario (N = be expressed as s = FK x, where the Fourier coefficients x[k] 1). However, in the multichannel scenario (N strictly bigger are the entries of the K-vector x. The signal s(t) is modulated than 1) LN ≥ KN + L is possible by setting N, and L to be by a binary waveform r(t) alternating at a rate Q. Let r be sufficiently large, and hence successful recovery may also be a Q-vector of samples of binary waveform r(t) in t ∈ TQ. possible. In light of (20), we have that inputs [x0,n], and a filter The modulated signal s(t) r(t) undergoes an unknown linear h0 with length M, possibly as large as L, can be recovered transformation (s(t) r(t))~h(t) = y(t) through an LTI system, whenever L, and N are chosen to be sufficiently large such where h(t) is the impulse response of an LTI system, and is 2 2 2 2 4 that LN ≥ QN & (µ νmaxKN + ν L) log (QN) holds. The given as numerics consistently show that the successful recovery occurs M LN ≥ QN (KN Õ at a near optimal sample complexity, i.e., & + h(t) = h[m]δ(t − tm), where tm ∈ TQ. L) indicating a probable room of improvement in the derived m=1 performance bounds. Assuming h[m] to be the mth entry of an M-vector h. The No zero padding has practical importance in passive imag- samples y of the transformed signal y(t) exactly take the form ing, where an unstructured and uncontrollable source signal with no discernible on, or off time is driving the CIRs [5], y = RFK x ~ h, (21) and hence, one cannot realistically assume any zero-padding. where as before R = diag(r), and y ∈ CL. Finally, the bound in (18) might appear contradictory to Since the observation model (21) aligns with the model a reader as it guarantees recovery for a longer filters/signals considered in this paper, a direct application of our main (large enough Q) whereas one should expect that deconvolu- result shows that s, and hence s(t) can be recovered from tion must be easier if the convolved signals are shorter (fewer the received signal y(t) = (s(t) r(t)) ~ h(t) without knowing overlapping copies); for example, deconvolution is immediate CIR by operating the random binary waveform r(t) at rate in the trivial case of one tap Q = 1 filters. However, it is Q & (µ2K + ν2 M) log4 L, and sampling the received signal important to note that the bound (18) only gives a range of y(t) at a rate L = Q, where we used the fact that ν2 = 1 Q, and N under which recovery is certified, and in no way max for the DFT matrix F above. The coherences ν2, and µ2 are eliminates the possibility of a successful blind deconvolution K simply the peak values in time ksk2 , and frequency domain when it is violated. Roughly speaking in our case, longer ∞ kF hk2 , respectively, where F are the first M columns of length Q only introduces more sign randomness and makes M ∞ M a normalized L × L DFT matrix. the (blind deconvolution) well-conditioned. Binary modulation of an analog signal can be easily imple- mented using switches that flip the signs of the signal in real II.APPLICATIONS time; the setup is shown in Figure 1. Fast rate binary switches The measurement model in (3) finds many applications can be easily implemented; see, for example, [21], and [22]– owing to minimal structural assumptions on the signals. [24] for the use of binary switches in other applications We present three applications scenarios including an im- in signal processing. The implementation potential combined plementable modulation system to protect an analog signal with the ubiquity of blind deconvolution make this result against convolutive interference using a real time preprocess- interesting in system theory, applied communications, and ing, channel protection in wireless communications, random signal processing, among other. mask imaging, and passive imaging. The signal subspace can be other than Fourier vectors in practical application in wireless communications, for example, A. Channel Protection using Random Modulators channel coding protects a message against unknown errors by introducing redundancy in the messages. This operation can be One of the important results of this paper is that binary mod- viewed as the multiplication of the message vector with a tall ulations of an analog bandlimited signal protect it against un- matrix C. The coded message s = Cx is transmitted over known linearly convolutive channels. For illustration, consider an unknown channel characterized by an impulse response 5 a simple scenario of wireless communication of a periodic h ∈ RM . A simple, and easy to implement additional step signal s(t) in t ∈ [0, 1), bandlimited to B Hertz. Expansion of of randomly flipping the signs r s of the coded message s(t) using Fourier basis functions is enables the decoder to recover x from several delayed, and B attenuated overlapping copies (r s) h of the transmitted Õ ~ s(t) = x[k]eι2πkt . codeword; see Figure 2 for a pictorial illustration. k=−B The signal s(t) can be captured by taking Q ≥ K := B. Random Mask Imaging 2B + 1 equally spaced samples at time instants t ∈ TQ := A stylized application of the blind deconvolution problem in (3) is in image deblurring using random masks. Images 5We restrict the discussion to a periodic signal mainly to reduce the mathematical clutter. A non-periodic signal can be handled within a time are observed through a lens, which introduces an unknown limited window and smoothing around the edges. blur in the images. To deblur the images, this paper suggests 6

the formation of the micro-organisms. Secondly, the image Decoder deblurring in [25] is achieved via a computationally expensive -tap ADC semidefinite program operating in the lifted space of dimension rate LTI system KNM. On the other hand, the gradient descent scheme in Algorithm 1 is computationally efficient as it operates in rate natural parameter space of dimension only KN + M.

C. Passive Imaging: Multichannel Blind Deconvolution ✕ =

Time In passive imaging, a source signal s(t) feeds multiple con- volutive channels. The signal s(t) is not observed/controlled, and is unstructured. For example, in seismic experiments, a drill generates noise like signature that propagates through ✻ = earth subsurfaces. The reflected copies from earth layers overlap and are recorded at multiple receivers. To characterize Frequency the subsurfaces, a multichannel blind deconvolution (MBD) on the received data discovers the Green’s function; for details, Fig. 1. Analog implementation for real time protection against channel see an interesting recent work [2], and references therein. In intereference. A continuosus time signal s(t), bandlimited to B Hz, is underwater acoustics, a submerged source signal is distorted, modulated with a random binary waveform r(t) alternating at a rate Q. The modulated signal drives an unknown LTI system characterized reverberated while propagating through the water media. Mul- by an M-tap impulse response h(t). The resulting signal is sampled at tiple passive sensors on water surface record the distorted a rate Q. Operate the modulator and ADC at a rate Q & max(B, M) signals. The source recognition is better if the recorded data (to within a constant, log factors and coherences), and recover s(t), is cleaned using blind deconvolution [28]. and h(t) using algorithm 1. Underneath, the preprocessing is shown in The recorded data at each of the receivers in the passive time, and frequency domain. Modulation in time domain spreads the 6 spectrum, and the resulting higher frequency signal remains protected imaging applications above takes the form . against the distortions caused by an unknown LTI system. yn(t) = s(t) ~ hn(t), n = 1, 2, 3,..., N, (22)

where hn(t)’s are short CIRs. Importantly , Theorem 1 clearly an image acquisition system, shown in Figure 3, in which a determines the combined length QN of CIRs must exceed the programmable spatial light modulating (SLM) array is placed length M of the source signal, as is evident from the bound between the image and the lens. SLM modulates (±1) the in (18). This means that for longer (meeting the generic sign light reflected off of the object before it passes through the assumption) CIRs, one can guarantee to resolve a longer length lens. While ideal binary masks are 0/1, we consider (±1) of source signal from the recorded data. for technical reasons; the (±1) masks can be implemented MBD was studied with keen interest in 90’s; see, [29], in practices using a (0/1) mask together with all 1’s mask. [30] for some of the least squares based approaches. Using The light impinging on the detector array is convolution of commutativity of convolutions, an effective strategy [31], [32] of lens with randomly modulated images. relies on the null space of the cross correlation matrix of Assuming an apriori knowledge of the subspace of each image, the recorded outputs. Recovery using these spectral methods which might be a subset of a carefully selected wavelet or depends on the condition that CIRs do not share common DCT bases functions, we can deblur the images using gradient roots in the z-domain — some of the MBD schemes developed descent as discussed in Section I-C. The relative dimension of based on this observation can be found in [33]–[35]. the image subspaces w.r.t. image, and blur size must obey MBD has also been reexamined more recently using the sample complexity bounds presented in Theorem 1; see semidefinite programming (SDP) [8], [12], [13], and spectral Section III for details. methods [11] that enjoy theoretical performance guarantees It is instructive to compare our results with a recent and under restrictive Gaussian known subspace assumptions on closely related random mask imaging (RMI) setup given in CIRs. In comparison to computationally expensive SDP op- [25] for image deblurring. A similar physical setup is studied, erating in lifted domain, and spectral methods, we present a and recovery of a blurred image is achieved by placing a gradient descent scheme for MBD with provable guarantees random mask between the lens and image, however, two im- under a weaker random signs assumption on the CIRs. The portant differences exist compared to our approach. Firstly, in generic/random sign assumptions on the CIRs might implicitly [25], and other works in this direction [26], [27], one image is hold naturally, or could be made more likely to hold using fed multiple times through different random masks to improve indirect means such as arranging the locations of the receivers dissimilar the conditioning of the inverse problem, whereas in our setup at points might lead to diverse CIRs. Moreover, we use a different unknown image every time. This is very as already discussed in Section I-E, we donot assume any important in applications, where it is not possible to obtain unrealistic structure such as known subspace, or zero-padding s(t) multiple snapshots of the same scene as it is dynamic. For on the source signal , and it can be completely arbitrary. example, imagine imaging a culture of micro-organisms; the 6Compared to the model in (1), the role of s, and h is swapped in this moving organisms and the fluid around continuously changes section as there is one source signal s(t) and multiple CIRs hn(t). 7

= = =

Propagation Channel Coding Modulation Decoding (Convolution)

Fig. 2. Channel protection in wireless communications: A user message x is encoded by multiplying with a tall coding matrix C (known subspace) in the conventional channel coding block. The signs of the resultant symbols s are randomly flipped (modulation). This signal is then transmitted and undergoes a series of reverberations and distortions (modeled as a convolution with CIR) while propagating to the receiver. The decoder estimates the symbols x, and CIR h.

−1 on both sides by the (element-wise) inverse h0 to give ✚➖✚➖✚✚✚ √ ˆ−1 ˆ−1 ➖✚➖✚➖✚➖ h0 yˆn = LFQ RnCx0,n + h0 eˆn. ✚✚➖➖➖✚➖ ➖➖ ✚✚➖✚✚ Clearly, the problem is now linearized [38] in the unknowns ➖✚➖✚✚➖ ➖ (hˆ−1, x ), and one can proceed with the recovery using the ✚ 0 0,n ➖✚✚➖➖✚ least squares objective below ➖➖➖✚➖✚✚ N √ image SLM lens Õ 2 minimize kg yˆn − LFQ RnCx0,n k2 . g, {x } detector array 0, n n n=1

Fig. 3. Schematic of random mask imaging setup. The reflection of The drawbacks of this approach are its sensitivity to the a target image from a spatial light modulator (SLM) is blurred by noise components that are amplified due to the weighting a lens and the resultant intensities are measured on a detector array. hˆ−1 eˆ , and will affect the overall least squares recovered Everytime a new image is observed through this system (with a new 0 n mask pattern) and eventaully the lens blur kernel and all the images are solution. Moreover, the problem can be framed as finding a discovered using a gradient descent algorithm. smallest eigenvector of a matrix, and an inherent ambiguity exists if it has more than one-dimensional null space. [39] gives provable recovery results using a least squares approach This perfectly models the unstructured source signal in passive under various random subspace models. [40] relinquishes the imaging. subspace model and instead assumes sn admit sparse repre- sentations in Gaussian random matrices, and proves the signal D. Other Related Work recovery using the same linearized eigenvector approach using a power iteration algorithm under strict spectrally flatness A regularized gradient descent scheme to minimize the non- conditions on the signals. The performance under noise in convex measurement loss was rigorously analyzed recently these linearized schemes [39], [40] is only guaranteed under in [10] for the single channel (N = 1) blind deconvolution, −1 additional assumptions on filter invertibility h0 , and on the and was shown to be provably effective under the known −1 magnitudes of the entries of h0 . In comparison, we directly Gaussian subspace assumption. In comparison, we study the work with the bilinear model, and give the first provable multichannel blind deconvolution, and the problem set up (3) approximate stable (under noise) recovery results for blind also has much limited and structured randomness in a diagonal deconvolution using random modulations. Rn compared to a dense Gaussian matrix used in [10]. This Multichannel blind deconvolution problem can be framed requires a considerably more intricate proof argument based as a rank-1 matrix recovery problem [12], [41]. Exact and on generic chaining [36] to show approximate stable recovery stable recovery results from optimally many measurements using a regularized gradient descent algorithm. Recently, [37] are derived in [12] when the signals lie in random Gaussian showed that (vanilla) gradient descent without the additional subspaces. regularization term such as (10) shows provably similar re- The question of the uniqueness of the solution (h , x ) (up- covery guarantees for blind deconvolution under Gaussian 0 0 to global scaling) of the multichannel bilinear problems of the subspace model as given in [10]. Extending a result similar form to [37] for our case (3) remains challenging as unlike the case of Gaussian subspace considered in [37], the samples in yˆn = hˆ0 Cxˆ 0,n yˆn in (3) are statistically dependent. Numerically, we observe similar performance to Algorithm 1 even if the regularization has been studied in [42]. In particular, necessary and sufficient term (10) is not included. conditions for the identifiability of (hˆ0, x0) were given for L Observations in (3) are bilinear in the unknowns (h0, x0). almost all (hˆ0, Cˆ, x0). In the particular case of h0 ∈ C , and Denoting FM h0 = hˆ0, the measurements (3) can be rescaled choosing hˆ0 = Fh0, and Cˆ = FQ RnC makes the last display 8 above equivalent to the measurement model (3) in the noiseless which says that energy must be roughly equally shared among case. Thus applying Theorem 2.1 in [42] would imply that if all the inputs sn. Not only that the share of energy should be L − 1 roughly equal across the corresponding entries of sn[k] as well L > K, and ≤ N ≤ K as is clear from (23). Together with this, the proof also uses L − K other strict flatness conditions: ˆ ( ˆ ) then for almost all h0, FQ RnC, and x0, the pair h0, x0  1   1  is identifiable up to global scaling. The results show that |sˆ [k]|2 = O , and h2 = O , n L min L identifiability is possible under optimal sample complexity ≥ − ( ˆ ) 2 2 LN KN + L 1 for almost all h0, FQ RnC, x0 . Compared where ksn k2 = 1 for every n ∈ [N], and khk2 = 1. These to this results, our derived sample complexity bound (20) is conditions basically enforce that sn, h have to be flat in off by a factor N (to within log factors, and coherences). The the frequency domain for successful recovery. In comparison, numerics also show that this additional factor of N on the the required coherence parameters (4) in our paper are much right hand side of (20) is not required to obtain successful milder and successful recovery is still possible under Theorem recovery in practice. Necessary and sufficient conditions on 1 of our paper for any value of these coherences (smaller or the modulation rate Q for the identifiability of the unknowns, larger). however, do not directly follow from the work in [39], and Blind deconvolution has also been studied under various are an open question. The numerics suggest that successful assumptions on input statistics, some important references are recovery occurs whenever QN is roughly of the order of the [46], [47]. We complete the brief tour of the related works in number of unknowns. the above sections by pointing readers to survey articles [48], Multichannel blind deconvolution from observations yn = [49] to account for other interesting works that we might have h0 ~sn under the assumption that sn are sparse vectors has also missed in the expansive literature on this subject. been studied [43], [44]. The blind inverse problem is solved by looking for a filter g such that [g ~ yn] are sparse. Sparsity III.NUMERICAL SIMULATIONS is promoted using a convex penalty such as `1 norm [43], or In this section, we numerically investigate the sample com- more recently using a different convex relaxation involving `4 plexity bounds using phase transitions. We showcase random norm [43]. However, the provable sample complexity results mask image deblurring results, and also report stable recovery are far from optimal; for details, see [44]. In comparison, we in the presence of additive measurement noise. assume that sn reside in a known subspace, and have generic sign patterns that either exist naturally or can be explicitly enforced using random modulation. This model nicely fits A. Phase transitions some practical applications as already laid out in Section II. We present phase transitions to numerically investigate the We would also like to discuss a related paper [45] that constraints in (18), and (19) on the dimensions Q, N, M, K, L L considers recovering sn ∈ C , n ∈ [N], and h ∈ C from and L for the gradient descent algorithm to succeed with high the convolutions probability. The shade represents the probability of failure, which is computed over hundred independent experiments. For yn = h ~ (rn sn), n ∈ [N], each experiment, we generate Gaussian random vectors h0, and [x ], and choose C to be the subset of the columns of a where kF sn k0 ≤ K. Theorem 1.1 in [45] claims that h, and 0,n −β DCT matrix for every [s ]. The synthetic measurements are sn can be recovered with probability at least 1 − O(L ) by 0,n solving a convex program whenever L & βN, and N & βK2, then generated following the model (1). We run Algorithm and that 1 initialized via Algorithm 2, and classify the experiment as successful if the relative error 2   supn,k |sn[k]| 1 = O . (23) khˆ xˆ ∗ − h x∗ k ÍN | [ ]|2 N 0 0 F infk n=1 sn k Relative Error := ∗ (24) kh0 x kF However, it seems that the at least the statement of Theorem 0 −2 1.1 in [45] is not correct as there are several assumptions made is below 10 . The probability of success at each point is 2 2 computed over hundred such independent experiments. in the proof argument such as |sˆn[k]| = O(1/L), and hmin := 2 min` |hˆ[`]| = O(1/L), which do not appear in the statement The first four (left to right) phase diagrams in Figure 4 of Theorem 1.1 in [45]. In addition, Theorem 1.1 [45] claims investigate successful recovery using four different (one for recovery under comparatively strict ’coherence requirements each phase diagram) lengths Q of the modulated inputs, and such as varying values of K, and M while keeping L, and N fixed. We 2 set L = 3200, and N = 1 in all four phase transitions while Q sup |sn[k]|  1  n,k = O . is fixed at 800, 1600, 2400, and 3200, respectively. Clearly, the ÍN 2 N infk n=1 |sn[k]| white region (probability of success almost 1) expands with For example, to satisfy this coherence condition, it must increasing Q. For example, in first (top left) phase transition, always be true that successful recovery occurs almost always when the measure- ments are a factor of 9 above the number of unknowns, that is, 2 ksn k  1  L ≈ 9(K + M), and this factor improves to 5, 3, and 2.8 from max 2 = O , n ÍN k k2 N second to fourth phase transition, respectively. These phase n=1 sn 2 9 transitions show that successful recovery is obtained for a wide and the right shows that relative error almost reduces to zero range of shorter to longer unknown random sign filters/signals, when the oversampling ration exceeds 2.1. however, successful recovery happens more often for longer (larger Q) modulated inputs. IV. PROOFS Recall the discussion in Section I-E, where we pointed out A. Preliminaries that the bound in (18) is conservative by a factor of N. The fifth phase transition in Figure 4 investigates the affect of N Recall the function F(h, x) defined in (9), and the gradients ˜ on minimum value of Q required for successful recovery, and ∇Fh, and ∇Fx in (11). By linearity, ∇Fh = ∇Fh + ∇Gh, and ˜ shows that numerically this value of Q does not increase with similarly, ∇Fx = ∇Fx + ∇Gx. Using the definitions of F, and increasing N, and is roughly on the order of K + M, and not G in (9), and (10), the gradients w.r.t. h, and x are KN + (M/N) as predicted in (18) in Theorem 1. ∗ ∗ ∗ ∇Fh = A (A(hx − h0 x0) − e)x, Finally, the last phase transition in Figure 4 investigates ∇F = [A∗(A(hx∗ − h x∗) − e)]∗h, (25) K vs. N under fixed M, and L, and setting Q = 1.5K. It x 0 0 shows that increasing N improves the frequency of successful " 2 ! recovery even under a pessimistic choice of Q = 1.5K in ρ 0 kxk2 ∇Gx = G x comparison to the bound in (18), which suggests that Q & 2d 0 2d KN + (M/N). Q N QN|c∗ x|2 ! # In summary, the phase diagrams suggest that LN ≥ QN QN Õ Õ 0 q,n ∗ & + G cq,n c x , (26) 4ν2 0 8dν2 q,n (KN +M) is sufficient for exact recovery with high probability. q=1 n=1 and B. Image deblurring " 2 ! L ∗ 2 ! # ρ 0 khk2 L Õ 0 L| f` h| ∗ In this section, we showcase the result of a synthetic ∇G = G h + G f` f h . h 2d 0 2d 4µ2 0 8dµ2 ` experiment on image deblurring using random masks. We `=1 select three microscopic 150 × 150 images of human blood (27) cells each of which is blurred using the same 10 × 10 We have the following useful lower and upper bounds on (M = 100) Gaussian blur of variance 7. The original and F(h, x) using the triangle inequality blurred images are shown in the first and second row of Figure ∗ ∗ ∗ 2 5, respectively. Each of the image is assumed to live in a − 2kA (e)k2→2 khx − h0 x0 k∗ ≤ F(h, x) − kek2 known subspace7 of dimension K = 3400 spanned by the ∗ ∗ 2 ∗ ∗ ∗ − kA(hx − h0 x0)k2 ≤ 2kA (e)k2→2 khx − h0 x0 k∗. most significant wavelet coefficients. We mimic the random For brevity, we set khx∗ −h x∗ k := δd . Since hx∗ −h x∗ is mask imaging setup discussed in Section II-B, and pixelwise 0 0 F 0 √ 0 0 a rank-2 matrix, we have khx∗−h x∗ k ≤ 2khx∗−h x∗ k = multiply each image with a 150×150 random ±1 mask. Given √ 0 0 ∗ √ 0 0 F the observations on the detector array, we jointly deblur three 2δd0, where we used kh0 k2 = kx0 k2 = d0 from Lemma 1. 1 (N = 3) images using the proposed gradient descent algorithm. Invoking Lemma 9, and Lemma 7 with ξ = 4 , we have The deblurred images are shown in the third row of Figure 3 εδd0 5 εδd0 5. The total relative mean squared error (MSE) of the three kek2 + δ2d2 − ≤ F(h, x) ≤ kek2 + δ2d2 + . 2 4 0 5 2 4 0 5 recovered images are 0.0184, and the blur kernel is estimated (28) within a relative MSE of 1.03 × 10−4. In the analysis later, it will be convenient to uniquely decom- pose h, and x as h = α1h0 + h˜, and x = α2 x0 + x˜, where C. Performance under noise and oversampling h∗h x∗ x h˜ ⊥ h, x˜ ⊥ x, and α = 0 , and α = 0 . We also define 1 d0 2 d0 Noise performance of the algorithm is depicted in Figure specific vectors ∆h, and ∆x that will repeatedly arise in the 6. Additive Gaussian noise eˆ is added in the measurements technical discussion later. as in (3). As before, we synthetically generate h , and x as 0 0 −1 Gaussian vectors, and C is the subset of the columns of a DCT ∆h = h − αh0, and ∆x = x − α¯ x0, (29) ( matrix. We plot (left) relative error (log scale) in (24) of the (1 − δ0)α1, if khk2 ≥ kxk2 recovered vectors hˆ, and xˆ averaged over hundred independent where α(h, x) = 1 , if khk2 < kxk2 ∗ 2 2 (1−δ0)α¯2 experiments vs. SNR := 10 log10 kh0 x0 kF /kek2 , and (right) average relative error (log scale) vs. oversampling ratio := δ with δ0 := 10 — the choice of α is mainly required for the L/(KN + M) under no noise. Oversampling ratio is a factor by proof of Lemma 6. Note that which the number (L in this case as N = 1) of measurements ∗ − ∗ ( − ) ∗ ˜ ∗ ∗ ˜ ∗ exceed the number K + M of unknowns. The left plot shows hx h0 x0 = α1α¯2 1 h0 x0 + α¯2hx0 + α1h0 x˜ + hx˜ . that the relative error degrades gracefully by reducing SNR, (30) The lemma below gives bounds on some relevant norms that 7 The known subspace of the original image is perhaps an unrealistic will be useful in the proofs later. assumption in this case, however, a reasonably accurate estimate of the image subspace can be obtained from blurred (small blur) image by taking the √ Lemma 1. Recall that kh0 k2 = kx0 k2 = d0. If δ := multiscale structure of wavelets into account to recover the support of wavelet k ∗− ∗ k hx h0 x0 F coefficients of the original image from blurred/smoothed out edges. < 1 then for all (h, x) ∈ Nd , we have the d0 0 10

L L 3L N = 1, L = 3200; Q = 4 N = 1; L = 3200; Q = 2 N = 1; L = 3200; Q = 4 800 800 800

720 720 720

640 640 640

560 560 560

480 480 480

K K K 400 400 400

320 320 320

240 240 240

160 160 160

80 80 80

80 160 240 320 400 480 560 640 720 800 80 160 240 320 400 480 560 640 720 800 80 160 240 320 400 480 560 640 720 800 M M M

N = 1; L = 3200; Q = L K = 200; M = 500 and L = 1800 M = 40; Q = 1:5K, and L = 3(M + K) 800 1800 80

720 1550 640 60 560 1300

480

1050

Q K K 400 40

320 800

240 550 30 160

80 300

80 160 240 320 400 480 560 640 720 800 2 4 6 8 10 12 14 16 18 2 4 6 8 10 12 14 M N N

Fig. 4. First four (left to right) are phase transitions of K vs. M for fixed Q, N, and L. Together these four phase diagrams show that longer (larger Q) modulated inputs allow recovery with larger values of K, and M. Fifth phase transition is Q vs. N for a fixed L, K, and M. Successful recovery almost always occurs when Q ≈ 2(K + M) across all values of N, and hence showing a better scaling than the linear scaling predicted by the theory in (18). Sixth phase transition is K vs. N for a fixed M, Q, and L. Q, and L are chosen to be much more pessimistically than (18), however, the dominant white region shows that multichannel N > 1 leads to favorable results even in such pessimistic regimes.

following useful bounds |α1| < 2, |α2| < 2, and |α1α¯2 −1| ≤ δ.

For all (h, x) ∈ Nd0 ∩ Nε with ε ≤ 1/15, there holds 2 2 2 2 2 2 k∆hk2 ≤ 6.1δ d0, k∆xk2 ≤ 6.1δ d0, and k∆hk2 k∆xk2 ≤ 8.4δ4d2. Moreover, if we assume (h, x) ∈ N ∩ N , we have √ 0 √ √ µ ν √ ⊗N LkFM ∆hk∞ ≤ 6µ d0, and QN kC (∆x)k∞ ≤ 6ν d0. Original Proof of this lemma is provided in Appendix C.

B. Proof Strategy Albeit important differences, the main template of the proof

Blurred of Theorem 1 is similar to [10]. To avoid overlap with [10], we refer the reader on multiple points in the exposition below to consult [10] for some intermediate results that are already proved in [10]. To facilitate this, the notation is kept very similar to [10]. The main lemmas required to prove Theorem 1 fall under one of the four key conditions [10] stated in Section IV-C. The lemmas under the important local RIP, and noise robustness Deblurred conditions require completely different proofs compared to [10], due to the new structured random linear map A in (7) in this paper. The limited randomness in A calls for a more Fig. 5. Random mask blind image deblurring via gradient descent. intricate chaining argument to handle probabilistic deviation Three 150 × 150 blood cell images shown in the first row. Blurred images using same 10×10 Gaussian blur of variance 7 are shown in the results required to prove the local-RIP. second row. Applying random masks on images before the unknown We begin by stating a main lemma showing that the iterates blurring (lens), and using gradient descent algorithm gives deblurred of gradient descent algorithm decrease the objective function M+KN images in the third row. F˜(h, x). Let zt := (ut, vt ) ∈ C be the tth iterate of Algorithm 1 that are close enough to the truth (h0, x0), i.e., 11

K = 200; M = 400; N = 1 and L = Q = 2400 K = 200; M = 300, N = 1, and L = Q: 100 100

10-1

r r

o o

r r r

r -5 e

e 10 e

e -2 v

v 10

i i

t t

a a

l l

e e

r r e

e 10-3

g g a

a -10 r

r 10

e e

v v

A A 10-4

10-5 10-15 0 10 20 30 40 50 60 70 80 90 1 1.2 1.4 1.6 1.8 2 2.2 2.4 SNR (dB) Oversampling ratio: L=(K + M)

Fig. 6. Performance in the presence of additive meaurement noise (left). Number of samples vs. the relative error (right).

zt ∈ Nε and have a small enough loss F˜(zt ) := F˜(ut, vt ), that 2) Local regularity: Recall that from Lemma 2, we set the is, zt ∈ NF˜ , where step size η ≤ 1/CL. With the step size η characterized through the constant CL in (32), the next task is to find a lower bound   2 1 2 2 2 on the norm of the gradient k∇F˜(zt )k in Lemma 2. Following N ˜ := (h, x) | F˜(h, x) ≤ ε d + kek (31) 2 F 3 0 2 lemma provides this bound. is a sub-level set of the non-convex loss function F˜(h, x). We Lemma 4 (Lemma 5.18 in [10]). Let F˜(h, x) be as defined in M+KN (8) and ∇F˜(h, x) := (∇F˜h, ∇F˜x) ∈ C . Then there exists show that with current iterate zt ∈ Nε ∩ NF˜ , the next iterate a regularity constant ω = d /5000 > 0 such that zt+1 of the Algorithm 1 belongs to the same neighborhood, 0 and does not increase the loss function. k∇ ˜( )k2 ≥  ˜( ) −  F h, x 2 ω F h, x c + Lemma 2 (Lemma 5.8 in [10]). Let the step size η ≤ 1/C , 2 L for any (h, x) ∈ Nd ∩ Nµ ∩ Nν ∩ Nε, where c = kek + M+KN 0 2 zt := (ut, vt ) ∈ C , and CL be the constant defined in ∗ 2 2 2 1700kA (e)k2→2, and ρ ≥ d + kek2 . (32). Then, as long as zt ∈ Nε ∩NF˜ , we have zt+1 ∈ Nε ∩NF˜ , and Proof. The proof hinges on following two conclusions δ2d2 2 0 ∗ F˜(z ) ≤ F˜(z ) − ηk∇F˜(z )k . Re {h∇F , ∆hi + h∇F , ∆xi} ≥ − 2δd kA (e)k → , t+1 t t 2 h x 8 0 2 2 δ p Proof. The proof is exactly as the proof of Lemma 5.8 in [10], Re {h∇Gh, ∆hi + h∇Gx, ∆xi} ≥ ρG0(h, x); 5 and relies on the smoothness condition in Lemma 3 below.  established in Lemma 5, and 6 below. Given the above two conditions the proof of this lemma reduces exactly to the proof C. Key conditions of Lemma 5.18 in [10].  Lemma 5. For any (h, x) ∈ N ∩ N ∩ N ∩ N with ε ≤ 1 , We now state the lemmas under four key conditions required d0 µ ν ε 15 to prove Lemma 2, and the theorems. uniformly: 2 2 1) Local smoothness: δ d0 ∗ Re {h∇F , ∆hi + h∇F , ∆xi} ≥ − 2δd kA (e)k → , h x 8 0 2 2 Lemma 3. For any z := (h, x) and w := (u, v) such that z, with probability at least z + w ∈ Nε ∩ NF˜ , there holds   2 QN ˜ ˜ 1 − 2 exp −cδ (33) k∇F(z + w) − ∇F(z)k2 ≤ CL kwk2 with µ2ν2 √    2 ρ 3L 3QN provided CL ≤ 2d0 10kAk2→2 + 5 + + , d2 2µ2 2ν2 c QN ≥ max(µ2ν2 KN2 + ν2 M) log4(LN). (34) 2 max 2 2 p δ where ρ ≥ d + 2kek , and kAk → ≤ c K log(LN) 2 2 2 α Proof. Observe that Re {h∇F , ∆hi + h∇F , ∆xi} = holds with probability at least 1 − O((LN)−α). In particular, n o h x 2 2 2 2 4 2 2 2 Re h∇Fh, ∆hi + h∇Fx, ∆xi . Using the gradients derived QN = O(µ νmaxKN + ν M) log (LN), and kek = O(σ d0 ). Therefore, CL can be simplified to earlier in (25), we have ∗ ∗ ∗ ∗   h∇Fh, ∆hi = hA (A(hx − h0 x ) − e), ∆hx i, and O ( 2) 2 2 2  4( ) 0 CL = d0 1 + σ µ νmaxKN + L log LN (32) ∗ ∗ ∗ ∗ h∇Fx, ∆xi = hA (A(hx − h0 x0) − e), h∆x i, 2 2 by choosing ρ ≈ d + 2kek2 . and hence ∗ ∗ ∗ Proof of this lemma is provided in Appendix E. h∇Fh, ∆hi + h∇Fx, ∆xi = −hA (e), ∆hx + h∆x i 12

∗ ∗ ∗ ∗ + hA(hx − h0 x0), A(∆hx + h∆x )i. (35) 4) Noise robustness: Finally, we give the noise robustness ∗ result below that gives a bound on the noise term kA (e)k2→2 k∆ ∗ ∆ ∗ k ≥ k ∗ − ∗ k − Using triangle inequality, hx +h x F hx h0 x0 F appearing in Lemma 4. ∗ ∗ 2 k∆h∆x kF . Lemma 1 shows k∆h∆x kF ≤ 2.9δ d0 when ∗ ∗ δ ≤ ε ≤ 1/15. This implies that k∆hx + h∆x kF ≥ Lemma 9. Fix α ≥ 1. For the linear map A defined in (7), 2 p δd0 − 2.9δ d0 ≥ 0.8δd0. In a similar manner the upper bound it holds kAk2→2 ≤ cα K log(LN) with probability at least can be established leading us to 1 − O((LN)−α). Moreover, let e ∈ CLN be additive noise 2 2 σ d0 ∗ ∗ introduced in (3), distributed as e ∼ Normal(0, 2LN ILN ) + 0.8δd0 ≤ k∆hx + h∆x kF ≤ 1.2δd0 (36) 2 2 σ d0 ιNormal(0, LN ILN ), there holds 1 1 2 that holds when δ ≤ ε ≤ 15 . Using ξ = 4 , and δ ≤ ε in ∗ 2ε Lemma 7, and 8 below give the following conclusions kA (e)k2→2 ≤ 50 d0 (39)

∗ ∗ 2 ∗ ∗ 2 1 2 2 3 2 2 −α kA(hx − h0 x0)k2 ≥ khx − h0 x0 kF − 4 δ d0 = 4 δ d0 and with probability at least 1 − O((LN) ) whenever LN ≥ 2 ∗ ∗ 2 ∗ ∗ 2 1 2 2 1 2 2 σ c0 (M, KN log(LN)) log(LN), where c , and c0 are abso- kA(∆hx + h∆x )k2 ≥ k∆hx + h∆x kF − 4 δ d0 ≥ 4 δ d0, ε2 α α α lute constants depending on the free parameter α ≥ 1. each holding with probability at least (33) under the sample complexity bound (34). In addition, we also have Please refer to Appendix F for the proof of this lemma. √ ∗ ∗ ∗ ∗ ∗ ∗ hA (e), ∆hx + h∆x i ≤ 2kA (e)k2→2 k∆hx + h∆x kF ∗ D. Proof of Theorem 1 ≤ 2δd0 kA (e)k2→2. Given all the intermediate results above, we are in a position Employing these bounds in (35), we obtain the desired bound. to prove Theorem 1 below.  ( ) Lemma 6. For any (h, x) ∈ Nµ ∩Nν ∩Nd0 ∩Nε with ε ≤ 1/15, Proof. We denote by zt = ut, vt , the iterates of gradient ∗ ∗ and 0.9d0 ≤ d ≤ 1.1d0, the following inequality holds uni- descent algorithm, and δ(zt ) = kut vt − h0 x0 kF /d0. At the δ p √1 √1 √1 formly Re {h∇Gh, ∆hi + h∇Gx, ∆xi} ≥ ρG0(h, x), where initial guess z0 := (u0, v0) ∈ Nd0 ∩ Nµ ∩ Nν ∩ N 2 , 5 3 3 3 5 ε ρ ≥ d2 kek2. + 2 2 it is easy to verify using the definitions of Nd0 , Nµ, and Nν 1 that G(u0, v0) = 0. For example, (u0, v0) ∈ √ Nν gives For the proof, see Appendix D below. 3 3) Local RIP: The following two lemmas state the local ∗ 2 2 QN|cq,n x| QN 16d ν 2d restricted isometry property used above in the proof of Lemma ≤ · 0 = 0 < 1, 5. 8dν2 8dν2 3QN 3d ∗ ( ) Lemma 7. For all (h, x) ∈ Nd0 ∩ Nµ ∩ Nν such that khx − which immediately implies that the fourth term of G u0, v0 in ∗ h0 x0 kF = δd0, where δ < 1, the following local restricted (10) is zero. Similar calculations show that all other terms in isometry property: G(u0, v0) are zero. The remaining proof is the exact repetition of the proof of Theorem 1 in [10], and uses Lemma 2, and 4 ∗ ∗ 2 ∗ ∗ 2 2 2 kA(hx − h0 x0)k2 − khx − h0 x0 kF ≤ ξδ d0 (37) to produce holds for a ξ ∈ (0, 1) with probability at least 1 − ku v∗ − h x∗ k ≤ 2 (1 − ηω)(t+1)/2εd + 50kA∗(e)k , 2 exp(−cξ2δ2QN/µ2ν2) whenever t+1 t+1 0 0 F 3 0 2→2 c where η is the step size in the gradient decent algorithm and QN ≥ µ2ν2 KN2 + ν2 M log4(LN). (38) 2 2 max ξ δ satisfies η ≤ 1/CL for a constant CL defined in Lemma 3, and the constant ω is characterized in Lemma 4.  Lemma 8. For all (h, x) ∈ Nd0 ∩ Nµ ∩ Nν ∩ Nε such that 0.8δd ≤ k∆hx∗ + h∆x∗ k ≤ 1.2δd , where δ ≤ ε ≤ 1/15, 0 F 0 Due to space constraints, the proofs of the remaining the following local restricted isometry property: lemmas are moved to the appendices, which include the proof ∗ ∗ 2 ∗ ∗ 2 2 2 of the key lemmas on local RIP that constitute our main kA(∆hx + h∆x )k2 − k∆hx + h∆x kF ≤ ξδ d0 technical contribution. holds for a ξ ∈ (0, 1) with probability at least 1 − 2 exp(−cξ2δ2QN/µ2ν2) whenever (38) holds. V. CONCLUSION Proof of both these lemmas is the main technical contribution above [10], and is provided in Section A, and Appendix B. The We studied the blind deconvolution problem under a practi- usual probability concentration, and union bound argument cally relevant model of modulated input signals. We discussed [50] to prove RIP is not sufficient due to the limited/structured several applications to motivate the problem, and presented randomness in A. We therefore use the result in [51] based on some recovery guarantees. We believe that a better proof generic chaining, which abstracts out the entire signal space technique may show that the regularization term G(h, x) is not from a coarse to a fine scale and employs a more efficient use required. Moreover, we also conjecture that the approximate of union bound at each scale after probability concentration. recovery guarantees may also be improved to exact recovery. 13

REFERENCES [27] F. Sroubek and P. Milanfar, “Robust multichannel blind deconvolution via fast alternating minimization,” IEEE Trans. Imag. Process., vol. 21, [1] A. Curtis, P. Gerstoft, H. Sato, R. Snieder, and K. Wapenaar, “Seismic no. 4, pp. 1687–1700, 2012. interferometry — turning noise into signal,” The Leading Edge, vol. 25, [28] S.-H. Byun, C. M. Verlinden, and K. G. Sabra, “Blind deconvolution of no. 9, pp. 1082–1092, 2006. shipping sources in an ocean waveguide,” J. Acoust. Soc. America, vol. [2] P. Bharadwaj, L. Demanet, and A. Fournier, “Focused blind deconvo- 141, no. 2, pp. 797–807, 2017. lution of interferometric green’s functions,” in SEG Technical Program [29] S. C. Douglas, A. Cichocki, and S.-I. Amari, “Multichannel blind Expanded Abstracts 2018. Society of Exploration Geophysicists, 2018, separation and deconvolution of sources with arbitrary distributions,” in pp. 4085–4090. Proc. IEEE Workshop Neural Netw. Signal Process. [1997] VII. IEEE, [3] J. C. Marron and A. M. Tai, “Passive synthetic aperture imaging,” in Adv. 1997, pp. 436–445. Imag. Technol. Commercial Appl., vol. 2566. Int’l Soc. Opt. Photon., [30] S.-i. Amari, S. C. Douglas, A. Cichocki, and H. H. Yang, “Multichannel 1995, pp. 196–204. blind deconvolution and equalization using the natural gradient,” in First [4] K. G. Sabra and D. R. Dowling, “Blind deconvolution in ocean waveg- IEEE Signal Process. Workshop Signal Process. Adv. Wireless Commun. uides using artificial time reversal,” J. Acoust. Soc. America, vol. 116, IEEE, 1997, pp. 101–104. no. 1, pp. 262–271, 2004. [31] G. Xu, H. Liu, L. Tong, and T. Kailath, “A least-squares approach [5] K. G. Sabra, H.-C. Song, and D. R. Dowling, “Ray-based blind decon- to blind channel identification,” IEEE Trans. Signal Process., vol. 43, volution in ocean sound channels,” J. Acoust. Soc. America, vol. 127, no. 12, pp. 2982–2993, 1995. no. 2, pp. EL42–EL47, 2010. [32] E. Moulines, P. Duhamel, J.-F. Cardoso, and S. Mayrargue, “Subspace [6] N. Tian, S.-H. Byun, K. Sabra, and J. Romberg, “Multichannel myopic methods for the blind identification of multichannel fir filters,” IEEE deconvolution in underwater acoustic channels via low-rank recovery,” Trans. Signal Process., vol. 43, no. 2, pp. 516–525, 1995. J. Acoust. Soc. America, vol. 141, no. 5, pp. 3337–3348, 2017. [33] S. Subramaniam, A. P. Petropulu, and C. Wendt, “Cepstrum-based [7] T. Yoshioka, A. Sehr, M. Delcroix, K. Kinoshita, R. Maas, T. Nakatani, deconvolution for speech dereverberation,” IEEE Trans. Speech, Audio and W. Kellermann, “Making machines understand us in reverberant Process., vol. 4, no. 5, pp. 392–396, 1996. rooms: Robustness against reverberation for automatic speech recogni- [34] X. Lin, N. D. Gaubitch, and P. A. Naylor, “Two-stage blind identification tion,” IEEE Signal Process. Mag., vol. 29, no. 6, pp. 114–126, 2012. of simo systems with common zeros,” in 14th European Signal Process. [8] A. Ahmed, A. Cosse, and L. Demanet, “A convex approach to blind Conf., 2006. IEEE, 2006, pp. 1–5. deconvolution with diverse inputs,” in IEEE 6th Int’l Workshop Comput. [35] Y. A. Huang and J. Benesty, “Adaptive multi-channel least mean square Adv. Multi-Sensor Adaptive Process.(CAMSAP). IEEE, 2015, pp. 5–8. and newton algorithms for blind channel identification,” Signal Process., [9] A. Ahmed and B. Recht and J. Romberg, “Blind deconvolution using vol. 82, no. 8, pp. 1127–1138, 2002. convex programming,” IEEE Trans. Inform. Theory, vol. 60, no. 3, pp. [36] M. Talagrand, “The generic chaining. springer monographs in mathe- 1711–1732, 2014. matics,” 2005. [10] X. Li, S. Ling, T. Strohmer, and K. Wei, “Rapid, robust, and reliable [37] C. Ma, K. Wang, Y. Chi, and Y. Chen, “Implicit regularization in blind deconvolution via nonconvex optimization,” Appl. Comput. Har- nonconvex statistical estimation: Gradient descent converges linearly monic Anal., 2018. for phase retrieval, matrix completion and blind deconvolution,” arXiv [11] K. Lee, F. Krahmer, and J. Romberg, “Spectral methods for passive preprint arXiv:1711.10467, 2017. imaging: Nonasymptotic performance and robustness,” SIAM J. Imag. [38] L. Balzano and R. Nowak, “Blind calibration of sensor networks,” Sci., vol. 11, no. 3, pp. 2110–2164, 2018. in Proceedings of the 6th international conference on Information [12] A. Ahmed and L. Demanet, “Leveraging diversity and sparsity in blind processing in sensor networks. ACM, 2007, pp. 79–88. deconvolution,” IEEE Trans. Inform. Theory, vol. 64, no. 6, pp. 3975– [39] S. Ling and T. Strohmer, “Self-calibration and bilinear inverse problems 4000, 2018. via linear least squares,” SIAM J. Imag. Sci., vol. 11, no. 1, pp. 252–292, [13] A. Ahmed, “A convex approach to blind MIMO communications,” IEEE 2018. Wireless Commun. Lett., 2018. [40] Y. Li, K. Lee, and Y. Bresler, “Blind gain and phase calibration via [14] E. Candès and B. Recht, “Exact matrix completion via convex optimiza- sparse spectral methods,” IEEE Trans. Inform. Theory, 2018. tion,” Found. Comput. Math., vol. 9, no. 6, pp. 717–772, 2009. [41] J. Romberg, N. Tian, and K. Sabra, “Multichannel blind deconvolution [15] E. Candes and J. Romberg, “Sparsity and incoherence in compressive using low rank recovery,” in Independent Component Analyses, Com- sampling,” Inverse problems, vol. 23, no. 3, p. 969, 2007. pressive Sampling, Wavelets, Neural Net, Biosystems, and Nanoengi- [16] P. Jain, P. Netrapalli, and S. Sanghavi, “Low-rank matrix completion neering XI, vol. 8750. Int’l Soc. Opt. Photon., 2013, p. 87500E. using alternating minimization,” in Proc. forty-fifth annual ACM sympo- [42] Y. Li, K. Lee, and Y. Bresler, “Optimal sample complexity for blind gain sium Theory comput. ACM, 2013, pp. 665–674. and phase calibration.” IEEE Trans. Signal Processing, vol. 64, no. 21, [17] R. Sun and Z.-Q. Luo, “Guaranteed matrix completion via non-convex pp. 5549–5556, 2016. factorization,” IEEE Trans. Inform. Theory, vol. 62, no. 11, pp. 6535– [43] L. Wang and Y. Chi, “Blind deconvolution from multiple sparse inputs,” 6579, 2016. IEEE Signal Process. Letters, vol. 23, no. 10, pp. 1384–1388, 2016. [18] E. J. Candes, X. Li, and M. Soltanolkotabi, “Phase retrieval via wirtinger [44] Y. Li and Y. Bresler, “Global geometry of multichannel sparse blind flow: Theory and algorithms,” IEEE Trans. Inform. Theory, vol. 61, deconvolution on the sphere,” in Advances Neural Inform. Process. Syst., no. 4, pp. 1985–2007, 2015. 2018, pp. 1140–1151. [19] R. H. Keshavan et al., “Efficient algorithms for collaborative filtering,” [45] A. Cosse, “A note on the blind deconvolution of multiple sparse signals Ph.D. dissertation, Stanford University, 2012. from unknown subspaces,” in Wavelets and Sparsity XVII, vol. 10394. [20] M. Hardt, “Understanding alternating minimization for matrix comple- International Society for Optics and Photonics, 2017, p. 103941N. tion,” in 2014 IEEE 55th Ann. Symp. Foundations Comput. Science. [46] L. Tong, G. Xu, B. Hassibi, and T. Kailath, “Blind channel identification IEEE, 2014, pp. 651–660. based on second-order statistics: A frequency-domain approach,” IEEE [21] J. N. Laska, S. Kirolos, M. F. Duarte, T. S. Ragheb, R. G. Baraniuk, and Trans. Inform. Theory, vol. 41, no. 1, pp. 329–334, 1995. Y. Massoud, “Theory and implementation of an analog-to-information [47] L. Tong, G. Xu, and T. Kailath, “Blind identification and equalization converter using random demodulation,” in IEEE Int’l Symposium Cir- based on second-order statistics: A time domain approach,” IEEE Trans. cuits Syst. ISCAS. IEEE, 2007, pp. 1959–1962. Inform. Theory, vol. 40, no. 2, pp. 340–349, 1994. [22] J. Tropp and J. Laska and M. Duarte and J. Romberg, and R. Baraniuk, [48] P. Campisi and K. Egiazarian, Blind image deconvolution: theory and “Beyond nyquist: Efficient sampling of sparse bandlimited signals,” applications. CRC press, 2016. IEEE Trans. Inform. Theory, vol. 56, no. 1, pp. 520–544, 2010. [49] L. Tong and S. Perreau, “Multichannel blind identification: From sub- [23] Ali Ahmed and Justin Romberg, “Compressive multiplexing of corre- space to maximum likelihood methods,” Proc. IEEE, vol. 86, no. 10, lated signals,” IEEE Trans. Inform. Th., vol. 1, pp. 479–498, 2015. pp. 1951–1968, 1998. [24] A. Ahmed and J. Romberg, “Compressive sampling of ensembles of [50] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin, “A simple proof correlated signals,” arXiv preprint arXiv:1501.06654, 2015. of the restricted isometry property for random matrices,” Constructive [25] S. Bahmani and J. Romberg, “Lifting for blind deconvolution in random Approximation, vol. 28, no. 3, pp. 253–263, 2008. mask imaging: Identifiability and convex relaxation,” SIAM J. Imag. Sci., [51] F. Krahmer, S. Mendelson, and H. Rauhut, “Suprema of chaos processes vol. 8, no. 4, pp. 2203–2238, 2015. and the restricted isometry property,” Commun. Pure Appl. Math., 2014. [26] G. Harikumar and Y. Bresler, “Perfect blind restoration of images blurred [52] R. M. Dudley, “The sizes of compact subsets of hilbert space and by multiple filters: Theory and efficient algorithms,” IEEE Trans. Imag. continuity of gaussian processes,” J. Funct. Analysis, vol. 1, no. 3, pp. Process., vol. 8, no. 2, pp. 202–219, 1999. 290–330, 1967. 14

[53] H. Rauhut, “Compressive sensing and structured random matrices,” . Theoretical foundations and numerical methods for sparse recovery, vol. 9, pp. 1–92, 2010. [54] J. Tropp, “User-friendly tail bounds for sums of random matrices,” APPENDIX Found. Comput. Math., vol. 12, no. 4, pp. 389–434, 2012. [55] R. Escalante and M. Raydan, Alternating projection methods. SIAM, We now complete the proofs of lemmas not covered in the 2011, vol. 8. main body of the paper due to space constraints. We begin with the local RIP proofs that constitute our main technical contribution, and rely on the chaining arguments.

A. Proof of Lemma 7 ∗ Recall that A in (7) maps the unknowns h0 x0 to the noiseless convolution measurements in the Fourier domain. Using the isometry of the DFT matrix F (an L × L normalized DFT matrix), we have

∗ A(h0 x0) = N N Õ 2 Õ 2 kh0 ~ RnCx0,n k2 = kcirc(h0)diag(Cx0,n)rn k2, n=1 n=1

where as before x0 = vec([x0,n]), and Rn = diag(rn). Similar calculation shows that

∗ ∗ 2 kA(hx − h0 x0)k2 N Õ 2 ( ) ( ) − ( ) ( ) = circ h diag Cxn rn circ h0 diag Cx0,n rn 2 n=1 2 = k(Hh Xx − Hh0 Xx0 )r kF, (40)

where r = vec([rn]), and the block diagonal matrices Hh ∈ LN×QN QN×QN C , and Xx ∈ C are defined as

⊗N ⊗N Hh := [circ(h)] , and Xx := [diag (Cxn)] . (41) The empirical process in (40) is known as 2nd order chaos process, where r is a standard Rademacher QN-vector, and

(Hh Xx −Hh0 Xx0 ) is a deterministic matrix. The expected value of this random quantity is simply k( − ) k2 k( − )k2 E Hh Xx Hh0 Xx0 r 2 = Hh Xx Hh0 Xx0 F . (42)

Recall that ~ denotes L-point circular convolution, and√ there- ∗ L×Q fore, circ(h) = F diag(hˆ)FQ ∈ C , where hˆ = LFM h. k − k2 k ∗ − ∗ k2 Note a simple identity Hh Xx Hh0 Xx0 F = hx h0 x0 F ; its proof follows from the couple of simple steps below

2 kHh Xx − Hh0 Xx0 kF = Õ ∗ ˆ ∗ ˆ 2 kF diag(h)FQdiag(Cxn) − F diag(h0)FQdiag(Cx0,n)kF n Õ ˆ ˆ 2 = kdiag(h)FQdiag(Cxn) − diag(h0)FQdiag(Cx0,n)kF n Õ ˆ ∗ ˆ ∗ 2 ∗ ∗ 2 = kFQ (hxn − h0 x0,n)kF = khx − h0 x0 kF, (43) n where the last two equalities follow from the fact the C∗C = ∗ √IK , (FM ) FM = IM , and that the entries of the DFT matrix LFQ have unit magnitude. Define the sets H, and X indexed by h, and x, respectively, as below

H := {Hh |h ∈ Nd0 ∩ Nµ }, X := {Xx |x ∈ Nd0 ∩ Nν }. (44) 15

√ √ ˆ Using (40), (42), and the identity (43), the local-RIP over all kdiag(h)FQ k√2→2 = LkFM hk∞ ≤ 4 d0 µ, and likewise ( ) k ∗ − ∗ k k( − )k k k [ ( )]⊗N h, x such that hx h0 x0 F = Hh Xx Hh0 Xx0 F = δd0 Hh0 F = µ d0. Similarly, for Xx = diag Cxn , we have as stated in Lemma 7 can be restated as √ ⊗N 4ν d0 sup kX k → = kC xk∞ ≤ √ , x 2 2 QN 2 Xx ∈X k( − ) k − √ sup sup Hh Xx Hh0 Xx0 r 2 ⊗N ν d0 Hh ∈H Xx ∈X kX k → = kC x k∞ = √ . x0 2 2 0 QN 2 2 2 Ek(Hh Xx − Hh Xx )r k ≤ ξδ d 0 0 2 0 An upper bound on the diameter d2→2(S) can then be obtained as holds with high probability for a ξ ∈ (0, 1). d2 (S) k − k2 ≤ This bound above depends on the notion of geometrical 2→2 = sup sup Hh Xx Hh0 Xx0 2→2 Hh ∈H Xx ∈X complexity of both the sets X, and H. The definition of ! this complexity is subtle and is measured in terms of the sup kHh k2→2 · sup kXx k2→2 + kHh0 k2→2 kXx0 k2→2 Talagrand’s γ2-functional [36] for these sets relative to two Hh ∈H Xx ∈H different distance metrics. Given a set S, and a distance 1  p p p p  17µνd0 defined by a norm k · k, the γ -functional quantifies that how ≤ √ 4µ d0 · 4ν d0 + µ d0 · ν d0 = √ (45) 2 QN QN well S can be approximated at different scales. Since we only consider all (h, x) such that kH Xx − The γ2-functional can be directly related to the rate at which h H X k δd the size of the best -cover of the set S grows as  decreases. h0 x0 F = 0, the Frobenius diameter is then simply Although this is a purely geometric characteristics of set dF (S) = sup sup kHh Xx − Hh0 Xx0 kF = δd0. (46) S, the γ2-functional gives a tight bound on the supremum Hh ∈H Xx ∈X of a Gaussian process. For example, if G is an M × M 1 2 The γ2-functional can be directly related to the complexity random matrix whose entries are independent and distributed of the space under consideration. To make this precise, we Normal(0, 1), then supS∈S hS, Gi ∼ γ2(S, k · kF ). need to introduce a covering set. A set C is an -cover of the Along with γ2, the other geometrical quantities that ap- set S in the norm k · k if every point in the (infinite) set is pear in the final bound are the diameters d2→2(S) := within an  of the finite set C: supS∈S kSk2→2, and dF (S) := supS∈S kSkF of the set S with respect to the matrix operator, and Frobenius (sum of squares) sup sup kC − Sk ≤ . S∈S C ∈C norms, respectively. (S k · k ) Since the random quantity k(H X − H X )r k2 is a The covering number N , ,  is the size of the smallest h x h0 x0 2 S second-order-chaos process, we now present a result [51] that -cover of . controls the deviation of a general second-order-chaos process We can bound the γ2-functional in terms of covering num- from its mean in terms of the geometrical quantities introduced bers using Dudley’s integral [36], [52] above. ¹ d2→2(S) γ2(S, k · k2→2) ≤ c N(S, k · k2→2, )d, (47) Theorem 3 (Theorem 3.1 in [51]). Let S be a set of matrices, 0 and r be a random vector whose entries rj are independent where c is a known constant, and d2→2(S) is the diameter mean zero, variance 1, and α-subgaussian random variables. of S in the operator norm k · k2→2. The distance between

Let dF (S), and d2→2(S) denote the diameters of S under k·kF , Hh˜ Xx˜ − Hh0 Xx0 ∈ C ⊆ S, and Hh Xx − Hh0 Xx0 ∈ S is and k · k2→2 norms. Set k(Hh˜ Xx˜ − Hh0 Xx0 ) − (Hh Xx − Hh0 Xx0 )k2→2 kH X − H X k E = γ2(S, k · k2→2)(γ2(S, k · k2→2) + dF (S)) + dF (S)d2→2(S) = h˜ x˜ h x 2→2 V = d (S)(γ (S, k · k ) + d (S)), and U = d2 (S). ≤ kHh˜ k2→2 kXx˜ − Xx k2→2 + kXx k2→2 kHh˜ − Hh k2→2 = 2→2 2 2→2 F 2→2 √  ⊗N ⊗N  L kFM h˜ k∞ kC (x˜ − x)k∞ + kC xk∞ kFM (h˜ − h)k∞ Then for t ≥ 0, √ √ 4µ d0 4ν d0 ˜ 2 2  ≤ √ kx˜ − xkc + √ kh − hkf . (48) P sup kSr k2 − EkSr k2 ≥ c1E + t ≤ Q QN S∈S ∈ X ∈ H  n t2 t o The last inequality follows from the fact that Xx˜ , Hh 2 exp − c2 min 2 , U . V implying that x˜ ∈ Nµ ∩ Nd0 , and h ∈ Nν ∩ Nd0 . From the distance measure in (48), it is clear that setting the following The constants c1, and c2 depend only on α. norms to √ The proof of the local restricted isometry property is an p ⊗  Q kx˜ − xk := QkC N (x˜ − x)k ≤ · , application of the above result. For a fixed H ∈ H, c ∞ √ h0 2 4µ d0 ∈ X √ and Xx0 , we start by defining the set of matrices as √ ˜ ˜  QN S := {Hh Xx − Hh0 Xx0 |Hh ∈ H, Xx ∈ X}. Recalling that kh − hkf := LkFM (h − h)k∞ ≤ · √ (49) ∗ 2 4ν d circ(h) = F diag(hˆ)FQ is an L ×Q circulant matrix, and again 0 F is an L × L normalized DFT matrix. Using (44), it is now gives kHh˜ Xx˜ − Hh Xx k2→2 ≤ . Precisely, if for every point easy to see that sup kH k = kF∗diag(hˆ)F k = H X −H X ∈ S, where H ∈ H, and X ∈ X, there exists Hh ∈H h 2→2 Q 2→2 h˜ x˜ h0 x0 h x 16 an√ Xx˜ ∈√ CX such that k√Xx − X√x˜ k2→2 = k[C(xn − x˜n)]k∞ ≤ Upper bounds in (46),(50), and (51) produce  Q/8µ d0 (CX is an  Q/8µ d0-cover of√X in k·kc norm), "   ∈ C k − k k ( − ˜)k ≤ 2 2 2 KN 2 M 4 and an Hh˜√ H such that Hh Hh˜ √2→2 = L F h h ∞ E . d µ ν + ν log (QN + L)+ √ √ 0 max Q QN  QN/8ν d0 (CH is an  QN/8ν d0-cover of H in k · kf H X −H X s s norm) then from (48), it is clear that the point h˜ x˜ h0 x0   2 2 2 # 2 2 2 KN 2 M 4 µ νmaxKN obeys k(H ˜ Xx˜ − Hh Xx0 ) − (Hh Xx − Hh Xx0 )k2→2 ≤ . This δ µ ν + ν log (QN + L) + δ , h 0 0 max Q QN QN implies that C := {Hh˜ Xx˜ − Hh0 Xx0 : Hh˜ ∈ CH, Xx˜ ∈ CX } is an -cover of S in k · k2→2 norm, and Similarly, the (46),(50), and (45) give

N(S, k · k2→2, ) ≤ 2 2 2 µ ν d µνd0 µνmaxd0 q  √   √  U . 0 , V . √ √ KN log4(QN)  Q  QN QN QN Q N H, k · kc, √ · N X, k · kf , √ ≤ 8µ d 8ν d ! 0 √ 0 √ q     νd0 4 p KN  Q p M  QN + √ M log L + δd0 , N 2 d0B , k · kc, √ · N 2 d0B , k · kf , √ QN 2 8µ d 2 8ν d √ 0 √ 0   Q    QN  Using the fact that L ≥ Q, and choosing QN as in (38), and KN M 1 2 2 = N B2 , k · kc, · N B2 , k · kf , . t = ξd δ , the tail bound in Theorem 3 now gives 16µd0 16νd0 2 0

We evaluate the Dudley integral as follows k( − ) k2− P sup sup Hh Xx Hh0 Xx0 r 2 d (S) ¹ 2→2 p Hh ∈H Xx ∈X log N(S, k · k2→2, )d !   0 2 2 2 2 2 QN kH Xx − H Xx k ≥ ξδ d ≤ 2 exp −cξ δ , 17µνd0 s √ h h0 0 F 0 2 2 ¹ √   Q  µ ν ≤ QN KN k · k log N B2 , c, 0 16µd0 which completes the proof. s √ !   QN  M k · k B. Proof of Lemma 8 + log N B2 , f , d 16νd0 Just as ∆h, and ∆x in (29), we define ∆Hh = Hh − αHh0 , 17√ν −1 16µd ¹ 16 N q and ∆X = X − α¯ X . Similar to (43), one can also show 0 ( KN k · k ) x x x0 = √ log N B2 , c,  d+ 2 ∗ ∗ 2 Q 0 that k∆Hh Xx + Hh∆Xx kF = k∆hx + h∆x kF ≤ 1.2δd0, 17 µ where the last inequality is already shown in (36), and holds 16νd ¹ 16 q 0 ( M k · k ) for δ ≤ ε ≤ 1/15, where δ = khx∗ − h x∗ k /d . √ log N B2 , f ,  d 0 0 F 0 QN 0 Using similar steps as laid out in the proof of Lemma 7, the √2ν 16µd √ ¹ N q local-RIP in Lemma 8 reduces to showing that for a 0 < ξ < 1, ≤ 0 KN log N(BKN, k · k , )d+ √ 1 c the following holds Q 0 √ ¹ 2µ 16νd0 q k(∆ ∆ ) k2− √ M log N(BM, k · k , )d sup sup Hh Xx + Hh Xx r 2 1 f H ∈H X ∈X QN 0 h x 2 2 2 µνmaxd0 q νd0 q Ek(∆H X + H ∆X )r k ≤ ξδ d . √ KN log4(QN) + √ M log4 L, (50) h x h x 2 0 Q QN with high probability. Define where the second last inequality follows from BKN ⊆ √ √ 2 S = {∆H X + H ∆X | H ∈ H, X ∈ X, (h, x) ∈ N }, KNBKN , and BL ⊆ LBL, and finally the last inequality h x h x h x ε 1 2 1 (52) is the result of by now standard entropy calculations that can be found in, for example, [51], and Section 8.4 in [53]. where H, and X are already defined in (44). Let (∆Hh Xx + Combining this result with the Dudley’s integral in (47) gives Hh∆Xx), and (∆Hh˜ Xx˜ + Hh˜ ∆Xx˜ ) be the elements of S, and a bound on the γ functional. We now have all the ingredients observe that (∆Hh Xx + Hh∆Xx) − (∆H ˜ Xx˜ + H ˜ ∆Xx˜ ) = 2 ∗ h h khx −h0 x0 kF required in Theorem 3. Recall that δ = = (∆Hh −∆H ˜ )Xx˜ +∆Hh(Xx − Xx˜ )+(Hh −H ˜ )∆Xx +H ˜ (∆Xx − d0 h h h kHh Xx −Hh Xx kF ∆X ), which gives 0 0 . Observe that x˜ d0 k(∆H X + H ∆X ) − (∆H X + H ∆X )k ⊗N 2 k ⊗N k2 · k k2 h x h x h˜ h˜ h˜ x˜ 2→2 2 kC xk∞ Q C ∞ N x 1 ν = QN ≤ ≤ kHh − H ˜ k2→2 kXx˜ k2→2 + k∆Hh k2→2 kXx − Xx˜ k2→2 kxk2 kxk2 h 2 2 kH − H k k∆X k kH k kX − X k ⊗N 2 2 2 + h h˜ 2→2 x 2→2 + h˜ 2→2 x x˜ 2→2 = QkC k∞ · KN kxk √ ≤ 2 = ν2 KN2. L kF (h − h˜)k kC ⊗N x˜ k + kF (∆h)k kC ⊗N (x − x˜)k kxk2 max M ∞ ∞ M ∞ ∞ 2 ⊗N ⊗N  + kFM (h − h˜)k∞ kC (∆x)k∞ + kFM h˜ k∞ kC (x − x˜)k∞ . Using this fact, we have 2 2 2 As it is clear form the definition of set S that the in- 17µ ν d µ2ν2 KN2 ˜ d2 (S) ≤ 0 ≤ max . (51) dex vectors (h, x), and (h, x˜) of the elements of S lie in 2→2 QN QN Nd0 ∩Nµ ∩Nν ∩Nε, and by assumption ε ≤ 1/15, therefore, we 17

√ √ √ √ ⊗N have LkFM ∆hk∞ ≤ 6µ d0, and QN kC ∆xk∞ ≤ 6ν d0 D. Proof of Lemma 6 using Lemma 1. This results in The proof is adapted from Lemma 5.17 in [10]. k(∆Hh Xx + Hh∆Xx) − (∆H ˜ X ˜ + H ˜ ∆Xx˜ )k2→2 Case 1: khk2 ≥ kxk2, and α = (1 − δ0)α1. Using δ ≤ ε ≤ √ h h h 4ν d √ 1/15, we have the following easily verifiable (Lemma 5.17 in 0 p ⊗N 2 2 ≤ √ LkFM (h − h˜)k∞ + 6µ d0 kC (x − x˜)k∞ [10]) identities hh, ∆hi ≥ δ khk , kxk < 2d, and also QN 0 2 2 √ √ 2 | f ∗h|2 6ν d0 ˜ p ⊗N  ∗ 2dµ ` + √ LkFM (h − h)k∞ + 4µ d0 kC (x − x˜)k∞ Re h f` f h, ∆hi ≥ when L > 1, QN ` L 8dµ2 √ √ 2 |c∗ x|2 10ν d0 ˜ 10µ d0  ∗ dν q,n = √ kh − hkf + √ kx − x˜ kc, Re hcq,n c x, ∆xi ≥ when QN > 1. (53) QN Q q,n QN 8dν2 where the last equality follows by using the k · kc, k · kf For example, the last identity an simply be proven as follows norms, defined earlier in (49). Similar to the discussion before, Re hc c∗ x, x − α¯ −1 x i this√ means√ that the -cover of S in (52) is obtained√ by√ the q,n q,n 0  QN/20 d0ν-cover of H in k · kf norm, and  Q/20µ d0- ∗ 2 1 ∗ ∗ ≥ |cq,n x| − |cq,n x||cq,n x0| cover of X in k · kc norm. With this fact in place the rest of the (1 − δ0)|α1| proof follows exactly the same outline as the proof of Lemma ∗ 2 |α2| ∗ ∗ 7. = |cq,n x| − |cq,n x||cq,n x0|. (1 − δ0)|α1α¯2|

C. Proof of Lemma 1 Using Lemma 1, we have that |α2| ≤ 2, |α1α¯2 − 1| ≤ δ, and ∗ the fact that (h, x) ∈ Nµ ∩ Nν ∩ Nd , we further obtain Recall that α1 = h h0/d0, which directly gives |α1| ≤ 0 khkkh0 k/d0 ≤ 2 by using the Cauchy-Schwartz inequality, ∗ −1 Re(hcq,n cq,n x, x − α¯ x0i) and the fact that h ∈ Nd0 . In a similar manner, we can also | | ≤ k ∗ − ∗ k2 2 2 ∗ 2 2 ∗ ∗ show that α2 2. Expand hx h0 x0 F = δ d0 using (30) ≥ |cq,n x| − |cq,n x||cq,n x0| to obtain (1 − δ)(1 − δ0) s s 2 2 2 2 2 ˜ 2 2 2 ˜ 2 2 8dν2 2 8dν2 10dν2 dν2 δ d0 = (α1α¯2 −1) d0 + |α¯2| khk2 d0 + |α1| kx˜ k2 d0 + khk2 kx˜ k2, ≥ − · ≥ , QN (1 − δ)(1 − δ0) QN 9QN QN which implies |α1α¯2 − 1| ≤ δ. The identities k∆hk2 ≤ 6.1δ2d , k∆xk2 ≤ 6.1δ2d , √ 2 √ 0 2 √ 0 where the last inequality is obtained by using |c∗ x | = √ν d0 , 2 2 4 2 q,n 0 QN k∆hk k∆xk ≤ 8.4δ d , and LkFM ∆hk∞ ≤ 6µ d0 2 2 0 and 0.9d0 ≤ d ≤ 1.1d0. are proved in Lemma 5.15 in [10]. We now prove that 1 √ √ Case 2: khk2 < kxk2, α = . Given δ ≤ ε ≤ QN max kC(∆x )k ≤ 6ν d . (1−δ0)α¯2 n n ∞ 0 1/15, one can show (Lemma 5.17 in [10]) that hx, ∆xi ≥ 2 2 δ0 kxk2, khk2 < 2d, and also Case 1: khk2 ≥ kxk2, and α = (1 − δ0)α1. Observe that in this case dµ2 | f ∗h|2 Re(h f f ∗h, ∆hi) ≥ when L ` > 1, kxk2 kx0 k2 1 p ` ` 2 |α2| ≤ ≤ √ khk2 kxk2 L 8dµ d0 d0 2 ∗ 2 2dν |cq,n x| 1 q √ Re hc c∗ x, ∆xi ≥ when QN > 1. (54) ≤ k ∗ − ∗ k k ∗ k q,n q,n 2 √ hx h0 x0 F + h0 x0 F = 1 + δ, QN 8dν d0 where we used the fact that khx∗ − h x∗ k = δd , and Expanding gradients, it is easy to see that √ 0 0 F 0 1 |α2 | 2 kh0 k2 = kx0 k2 = d0. Therefore, |(1−δ )α | = |(1−δ )α¯ α | ≤ ρ  khk  √ 0 1 0 2 1 Re {h∇G , ∆hi + h∇G , ∆xi} = G0 2 Re {hh, ∆hi} + 1+δ ≤ 2, where the last inequality follows using our h x d 0 2d |1−δ0 | |1−δ | choice δ ≤ ε ≤ 1/15, and δ = δ/10. This gives us 0 k k2 L| ∗ |2 0  x 2  0  f` h  L  ∗ p G Re {hx, ∆xi} + G Re h f` f h, ∆hi max QN kC(∆xn)k∞ 0 0 2 2 ` n 2d 8dµ 4µ p p ∗ 2 ! 1  QN|cq,n x|  QN ≤ max QN kCxn k∞ + (1−δ )|α | max QN kCxn,0 k∞ 0  ∗ n 0 1 n + G0 Re hcq,n cq,n x, ∆xi . (55) p p p 8dν2 4ν2 ≤ 4ν d0 + 2ν d0 ≤ 6ν d0. 1 We can now conclude the following inequality for both of the Case 2: khk2 < kxk2, and α = . Since |α2| ≤ 2, we (1−δ0)α¯2 have above cases p khk2 ! khk2 ! max QN kC(∆xn)k∞ 0 2 δd 0 2 n G0 hh, ∆hi ≥ G0 . p p 2d 5 2d ≤ max QN kCxn k∞ + (1 − δ0)|α2| max QN kCxn,0 k∞ n n 2 p p p To see this, note that it holds trivially when khk2 < 2d, ≤ 4ν d0 + 2(1 − δ0)ν d0 ≤ 6ν d0. 2 and in the contrary scenario when khk2 ≥ 2d, Case-2 is not 2 This completes the proof. possible, and Case-1 always has hh, ∆hi ≥ δ0 khk2 , and hence 18

∗ 2 ! ∗ 2 ! hh, ∆hi ≥ δd/5 shows that the inequality holds. Similarly, we h 0 L| f` (h + u)| ∗ 0 L| f` h| ∗ i G f (h + u) − G f h f` . can also argue that 0 8dµ2 ` 0 8dµ2 ` 2 ! ! kxk2 δd kxk2 | {z } G0 2 hx, ∆xi ≥ G0 2 . α` 0 2d 5 0 2d (60) 0 Moreover, the following inequalities Begin by noting that G0(z) ≤ 2|z|, and for any z1, z2 ∈ R, it |G0 ( ) − G0 ( )| ≤ | − | ( ) ∈ ! ! holds that 0 z1 0 z2 2 z1 z2 ; moreover, h + u L| f ∗h|2 L d L| f ∗h|2 0 ` ∗ 0 ` Nd0 , and simplifying using the triangle inequality, we obtain G0 Re(h f` f` h, ∆hi) ≥ G0 , 8dµ2 4µ2 4 8dµ2 ! ! kh + uk2 khk2 kh + uk + kxk (56) G0 2 − G0 2 ≤ 2 2 kuk ≤ 0 d 0 d d 2 ∗ 2 ! 2 2 QN|cq,n x| QN 0 ∗ √ 2 ! 2 G Re(hcq,n cq,n x, ∆xi) khk khk 0 8dν2 4ν2 4 d0 0 2 2 d0 kuk2, and G0 ≤ 2 ≤ 4 . (61) ∗ 2 ! d 2d 2d d d QN|cq,n x| ≥ G0 (57) 4 0 8dν2 Using same identities as above, and h, h + u ∈ Nµ, we have ! ! L| f ∗(h + u)|2 L| f ∗h|2 L hold in general. Again to see this, note that both hold trivially G0 ` − G0 ` ≤ (| f ∗(h + u)|+ ∗ 2 2 ∗ 2 2 0 2 0 2 2 ` when QN|cq,n x| > 8dν , and L| f` h| > 8dµ , and in the 8dµ 8dµ 4dµ contrary case, we have from the bounds (53), and (54) that √ L 8µ d the (56), and (57) above hold in Case 1 and 2. Plugging these | f ∗h|)| f ∗ u| ≤ · 0 | f ∗ u|, ` ` 2 √ ` results in (55) proves the lemma. 4dµ L and ! E. Proof of Lemma 3 L| f ∗(h + u)|2 L| f ∗(h + u)|2 d G0 ` ≤ 2 ` ≤ 4 0 . 0 2 2 Given that z = (h, x), z + w = (h + u, x + v) ∈ NF˜ ∩ Nε, 8dµ 8dµ d and lemma below, it follows that z + w ∈ Nd0 ∩ Nµ ∩ Nν. We can now use above two displays to obtain α` = L ∗ ∗ ∗ ∗ d ∗ Lemma 10. There holds N ˜ ⊂ N ∩N ∩N under local-RIP, ± ( ( ) ) 0 F d0 µ ν 4dµ2 f` h + u + f` u f` h f` u + 4 d f` u, which eventually and noise robustness lemmas in Section IV-C. gives Õ Proof of this lemma follows from exact same line of reasoning ≤ α` f` 2 as the proof of Lemma 5.5 in [10]. ` ∇ Using the gradient Fh expansion in (25), we estimate the  L d  ∗ ∗ ∗ 0 Õ ∗ upper bound of k∇Fh(z + w) − ∇Fh(z)k2. A straight forward max ( f` (h + u) + f` u) f` h + 4 f` u f` ` 4dµ2 d calculation gives ` 2 ∗ ∗ ∗ ∗ 12d0 ∇F (z w) − ∇F (z) A A(ux hv uv )x = kuk2, (62) h + h = + + + d ∗ ∗ ∗ ∗ A A((h + u)(x + v) − h0 x0)v − A (e)v. Í ∗ where we used the fact that ` f` f` = I. Finally, using h+u ∈ ∗ ∗ N z, z w ∈ N kux hv d0 , and plugging (62), and (61) in (60) gives us Note that + d0 directly implies√ + + ∗ k ≤ k k k k k k k k ≤ (k k k k ) uv F u 2 x 2 +√ h + u 2 v 2 2 d0 u 2 + v 2 , d0 3d0 Lρ k∇Gh(z + w) − ∇Gh(z)k2 ≤ 5ρ kuk2 + kuk2. (63) where kh + uk2 ≤ 2 d0. Moreover, z + w ∈ Nε implies d2 2d2 µ2 ∗ ∗ √ k(h + u)(x + v) − h0 x0 kF ≤ εd0. Using kxk2 ≤ 2 d0 together ∗ In an exactly similar manner, we have with kA (e)k2→2 ≤ εd0, which follows from Lemma 9, gives

2 ρ k∇Fh(z + w) − ∇Fh(z)k2 ≤ 4d0 kAk2→2(kuk2 + kvk2)+ k∇G (z + w) − ∇G (z)k ≤ x x 2 2d εd kAk2 kvk + εd kvk ≤ 5d kAk2 (kuk + kvk ). 0 2→2 2 0 2 0 2→2 2 2 ! ! ! (58) kx + vk2 kxk2  kxk2  G0 2 − G0 2 k k G0 k k 0 0 x + v 2 + 0 v 2 In a similar manner, we can show that 2d 2d 2d

2 k∇Fx(z + w) − ∇Fx(z)k2 ≤ 5d0 kAk (kuk2 + kvk2). (59) QN ρ Õ 2→2 + βq,n cq,n , 8dν2 Plugging in the gradient expressions from (26), we have q,n 2 where k∇Gh(z + w) − ∇Gh(z)k2 ≤ " ∗ 2 ! 2 ! 2 ! QN|cq,n(x + v)| ρ kh + uk khk β = G0 c∗ (x + v)− G0 2 − G0 2 k k q,n 0 2 q,n 0 0 h + u 2+ 8dν 2d 2d 2d ∗ ! # ! QN|c x|2 khk2 L 0 q,n ∗ ρ 0 2 ρ Õ G0 cq,n x , G kuk2 + 8dν2 2d 0 2d 8dµ2 ` 19

∗ and one can show using the facts that x, x +v are the members simply {cˆ`,n f` }`,n. The computation of variance in (66) now Í ∗ of the set Nd0 ∩ Nν, cq,n cq,n = I, and the approach similar reduces to q,n to obtain the bound (63) that 2d2 ( L N 2 σ 0 Õ Õ 2 ∗ σ = max k f` k cˆ`,n cˆ , d 3d QN ρ Z LN 2 `,n 0 0 `=1 n=1 → k∇Gx(z + w) − ∇Gx(z)k2 ≤ 5ρ kvk2 + kvk2. 2 2 d2 2d2ν2 L N ) Õ Õ (64) k k2 ∗ cˆ`,n 2 f` f` .

Using (58), (59), (63), and (64) together with the fact that `=1 n=1 2→2 √ ∇F˜(z) = (∇Fh(z) + ∇Gh(z), ∇Fx(z) + ∇Gx(z)), and using ∗ ∗ ⊗N √ Recall that f` , cˆ`,n are the rows of FM , and L(FQ RnC) , kuk2 + kvk2 ≤ 2kwk2, we obtain respectively, therefore,

k∇F˜(z + w) − ∇F˜(z)k2 ≤ L 2 M Õ ∗ √ h ρ  3L 3QN i k f` k2 = , f` f` = I, 2d 10kAk2 + 5 + + kwk . L 0 2→2 d2 2µ2 2ν2 2 `=1 L N Õ Õ ∗ ∗ ∗ ∗ ⊗N cˆ`,n cˆ`,n = L(C Rn(FQ) FQ RnC) = LIKN×KN . F. Proof of Lemma 9 `=1 n=1 We begin by controlling the operator norm of the lin- Using the above display together with (65), the variance in ear map A in (7), where f ∗, cˆ∗ are the rows of F , this case is upper bounded as √ ` `,n M ⊗N and L(FQ RnC) , respectively. It is easy to see that σ2d2 kAk = max k f cˆ∗ k , which follows from the fact that σ2 ≤ 0 max (M, αKN log(LN)) . `,n ` `,n F Z LN ∗ 0 ∗ 0 0 h f` cˆ`,n, f` cˆ`0,n0 i = 0 whenever ` , ` or n , n . Since k f cˆ∗ k = kcˆ k , we only require an upper bound on 2εd0 ` `,n F `,n 2 Choosing t = 50 , and kcˆ`,n k2. σ2 As introduced earlier, Rn = diag(rn), where rn is a Q-vector LN ≥ c0 max(M, αKN log(LN)) log(LN) ∗ 2 α of Rademacher random variables. Defining cq,n as the rows ε ⊗N ∗ of C , we can write cˆ`,n as the random sum and using the inequality (67) in Proposition 1 proves the claim. Q Proposition 1 (Corollary 4.2 in [54]). Consider a finite √ Õ cˆ`,n = L f`[q]rn[q]cq,n, sequence {Zk } of fixed matrices with dimensions d1 × d2, q=1 and let {gk } be a finite sequence of independent Gaussian or Rademacher random variables. Define the variance where rn[q] is the qth entry of rn. The upper bound on ( ) kcˆ`,n k2 can now be obtained by an application of Proposition Õ Õ σ2 := max Z Z∗ , Z∗ Z . (66) 1 below. The sequence√ {Zk } in the statement of Proposition Z k k k k { [ ] [ ] }Q k 2→2 k 2→2 1 in this case is L f` q rn q cq,n q=1. Using the identities, ÍQ ∗ ÍQ k k2 Then for all t ≥ 0 q=1 cq,n cq,n = I, and q=1 cq,n 2 = K, which follows from the fact that Q × K matrix C has orthonormal columns, i.e., ! ∗ 2 Õ −t2/2σ2 C C = I, a simple calculation shows that the variance σZ in P gk Zk ≥ t ≤ (d1 + d2)e Z . (67) 2 2 (66) is σZ ≤ K + 1. Choosing t = αK log(LN), and using the k 2→2 bound in (67) results in p G. Proof of Theorem 2 max kcˆ`,n k2 ≤ αK log(LN) (65) `,n We now give the proof of Theorem 2 by explicitly con- − 1 1 − O( α) ≥ structing a good initial guess: (u0, v0) ∈ √ Nd ∩ √ Nµ ∩ with probability at least 1 LN , where α 1 is a free 3 0 3 1 parameter. This proves the first claim in the statement of the √ Nν ∩ N 2 , from the measurements y, and the knowledge 3 5 ε lemma. of the model A. As for the second claim, we begin by writing the vector ∗ ∗ ∗ A (e) as a sum of random matrices Proof. Recall that δd0 = khx − h0 x0 kF , and under (h, x) = ∗ (0, 0), this implies that δd0 = kh0 x0 kF = d0 giving δ = 1. In L N σd L N ∗ 2 ∗ Õ Õ ∗ 0 Õ Õ ∗ this scenario, one can conclude from Lemma 7 that kh0 x0 kF − A (e) = eˆn[`]cˆ`,n f` = √ gn[`]cˆ`,n f` , 2 ∗ 2 ∗ 2 2 ξd ≤ kA(h0 x )k ≤ kh0 x k + ξd with probability at least `=1 n=1 LN `=1 n=1 0 0 2 0 F 0 1 − 2 exp −cξ2QN/µ2ν2 whenever where the second equality follows by rewriting the Gaussian c  2 2 2 2  4 random variables eˆn[`] as a scaling of the standard Gaussian QN ≥ µ νmaxKN + ν M log (LN). (68) 1 1 ξ2 random variables gn[`] ∼ Normal(0, 2 ) + ιNormal(0, 2 ). We employ the matrix concentration inequality in Proposition 1 This implies that to control the operator norm of the random matrix above. |h(A∗A − I)(h x∗), h x∗i| ≤ ξkh x∗ k2 , The summand matrices {Zk } in Proposition 1 in this case are 0 0 0 0 0 0 F 20

√ ∗ ∗ ∗ ∗ ˆ and hence kA A(h0 x0) − h0 x0 k2→2 ≤ ξd0. Using triangle where α0 = dh0h0. Finally, inequality, and (7), we obtain ∗ ∗ ∗ ∗ ku0v0 − h0 x0 kF ≤ ku0v0 − α0 u0 x0 kF ∗ ∗ ∗ ∗ ∗ ∗ kA (yˆ) − h0 x k2→2 ≤ kA A(h0 x ) − h0 x k2→2 + kA (e)k2→2 ∗ ∗ ∗ ∗ 0 0 0 + kα0 u0 x0 − α0 β0h0 x0 kF + kα0 β0h0 x0 − h0 x0 kF 2εd0 3εd0 ≤ ξd + ≤ := γd , (69) ≤ ku0 k2 kv0 − α0 x0 k2 + |α0|kx0 k2 ku0 − β0h0 k2 0 50 50 0 + kdh h∗hˆ xˆ ∗ x x∗ − h x∗ k where the last display follows from Lemma 9, and choosing 0 0 0 0 0 0 0 0 F ε ˆ = ku0 k2 kv0 − α0 x0 k2 + |α0|ku0 − β0h0 k2 ξ = 50 . Recall from Algorithm 2 that d, h0, and xˆ0 denote the A∗( ) ˆ ∗ ∗ highest singular value of yˆ , the corresponding left, and + kdh0 xˆ0 − h0 x0 kF right singular vectors, respectively. Assuming without loss of 2 √ √ √ 20 d |d − | ≤ 3ε ≤ √ · 2 dγ + d · 2 dγ + 2γ < γ, generality that 0 = 1 gives 1 50 . Use this together 3 1 3 with ε ≤ 15 to conclude that 0.9d0 ≤ d ≤ 1.1d0. ku v∗ − h x∗ k ≤ 2 ε The initializer v0 of x0 computed by solving a minimization√ which amounts to showing that 0 0 0 0 F 5 using γ defined in (69), and d = 1 as before. This program in Algorithm 2√ is basically a projection√ of dx0 onto 0 ⊗N ε the convex set Z = {z| QN kC zk∞ ≤ 2 dν}. Now v ∈ shows that (u0, v0) ∈ N 2 . Plugging the choice ξ = √ √ 0 5 ε 50 ⊗N 4ν 2 2 2 Z implies that QN kC v0 k∞ ≤ 2 dν ≤ √ , and hence in 1 − 2 exp −cξ QN/µ ν , computed above, and in (68) 3 1 v0 ∈ √ Nν. In addition, we have gives the claimed probability, and sample complexity bound, 3 √ respectively.  2 k dxˆ0 − pk = √ 2 Lemma 11 (Theorem 2.8 in [55]). Let Z be a closed 2 2 k dxˆ0 − v0 k + 2 Re {hxˆ0 − v0, v0 − pi} + kv0 − pk nonempty convex set. There holds √ 2 2 2 2 W ≥ k dxˆ0 − v0 k2 + kv0 − pk2 (70) Re {hq − PZ(q), z − PZ(q)i} ≤ 0, z ∈ Z, q ∈ C , ∀ ∈ Z for all p , where the last inequality is the result of Lemma where PZ(q) is the projection of q onto Z. 11 on the inner product. A specific√ choice of p = 0 ∈ Z in 2 the above inequality gives kv0 k ≤ d ≤ √ , and hence v0 ∈ 3 1 1 1 √ Nd . We have thus shown that v0 ∈ √ Nd ∩ √ Nν. In an 3 0 3 0 3 1 1 exactly similar manner, we can show that u0 ∈ √ Nd ∩ √ Nµ. 3 0 3 It remains now to show that (u0, v0) ∈ N 2 ε. Begin by noting ∗ ∗ 5∗ that kA (yˆ)−h0 x0 k ≤ γ implies that σi(A (yˆ)) ≤ γ for i ≥ 2, ∗ where σi(A (yˆ)) denotes the ith largest singular value of the matrix A∗(yˆ). This implies using triangle inequality, and (69) that ˆ ∗ ∗ ∗ ˆ ∗ kdh0 xˆ0 − h0 x0 k2→2 ≤ kA (yˆ) − dh0 xˆ0 k2→2+ ∗ ∗ kA (yˆ) − h0 x0 k2→2 ≤ 2γ, (71) ∗ where d is the highest singular value of A (y), and hˆ0 and xˆ0 are the corresponding singular vectors already introduced in Algorithm 2. We also have ∗ ∗ ˆ∗ ˆ ∗ ∗ kxˆ0(I − x0 x0)k2 = k(xˆ0h0h0 xˆ0)(I − x0 x0)kF ˆ∗ ∗ ˆ ∗ ˆ ∗ ∗ ∗ = kxˆ0h0(A (y) − dh0 xˆ0 + h0 xˆ0 − h0 x0)(I − x0 x0)kF ˆ∗ ∗ ∗ ∗ ≤ kxˆ0h0(A (y) − h0 x0)(I − x0 x0)kF + |d − 1| ≤ 2γ, where the second equality follows from h x∗(I − x x∗) = 0, 0 0 √ 0 0 ˆ∗ ∗ ˆ ∗ ∗ and xˆ0h0(A (y) − dh0 xˆ0) = 0. Denoting β0 = dxˆ0 x0, the above inequality can be equivalently written as √ √ k dxˆ0 − β0 x0 k2 ≤ 2 dγ. (72)

√Observe that p = β0 x0 ∈√ Z, which√ follows from ⊗N QN|β0|kC x0 k∞ = |β0|ν ≤ dν <√2 dν. Therefore, using p = β0 x0 ∈ Z in (70) gives k dxˆ0 − α0 x0 k2 ≥ kv0 − β0 x0 k2, which combined with (72) yields √ kv0 − β0 x0 k2 ≤ 2 dγ. (73) In an exactly similar manner, one can also show that √ ku0 − α0h0 k2 ≤ 2 dγ, (74)