IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020 1785 Convolutional Phase Retrieval via Gradient Descent Qing Qu , Yuqian Zhang , Yonina C. Eldar , Fellow, IEEE, and John Wright

Abstract— We study the convolutional phase retrieval problem, where  is cyclic convolution modulo m and |·| denotes of recovering an unknown signal x ∈ Cn from m measurements entrywise absolute value. This problem can be rewritten in consisting of the magnitude of its cyclic convolution with a the common matrix-vector form given kernel a ∈ Cm . This model is motivated by applications such as channel estimation, optics, and underwater acoustic find z, s.t. y = |Az| , (I.2) communication, where the signal of interest is acted on by a given channel/filter, and phase information is difficult or impossible where A corresponds to a cyclic convolution. Convolutional to acquire. We show that when a is random and the number phase retrieval is motivated by applications in areas such of observations m is sufficiently large, with high probability x as channel estimation [2], noncoherent optical communica- can be efficiently recovered up to a global phase shift using a combination of spectral initialization and generalized gradient tion [3], and underwater acoustic communication [4]. For descent. The main challenge is coping with dependencies in the example, in millimeter-wave (mm-wave) wireless communica- measurement operator. We overcome this challenge by using ideas tions for 5G networks [5], one important problem is to estimate from decoupling theory, suprema of chaos processes and the the angle of arrival (AoA) of a signal from measurements restricted isometry property of random circulant matrices, and taken by the convolution of an antenna pattern and AoAs. recent analysis of alternating minimization methods. As the phase measurements are often very noisy, unreliable, Index Terms— Phase retrieval, nonconvex optimization, non- and expensive to acquire, it may be preferred to only take linear inverse problem, circulant convolution. measurements of the signal magnitude in which case the phase information is lost. I. INTRODUCTION Most known results on the exact solution of phase retrieval problems [6]–[11] pertain to generic random matrices,where E CONSIDER the problem of convolutional phase the entries of A are independent subgaussian random vari- retrieval where our goal is to recover an unknown W ables. We term problem (I.2) with a generic sensing matrix signal x ∈ Cn from the magnitude of its cyclic convolution A as generalized phase retrieval.1 However, in practice it is with a given filter a ∈ Cm. Specifically, the measurements difficult to implement purely random measurement matrices. take the form In most applications, the measurement is much more structured y = |a  x| , (I.1) – the convolutional model studied here is one such structured measurement operator. Moreover, structured measurements Manuscript received May 30, 2018; revised June 25, 2019; accepted often admit more efficient numerical methods: by using the October 2, 2019. Date of publication October 31, 2019; date of current version February 14, 2020. This work was supported in part by National Science fast Fourier transform for matrix-vector products, the benign Foundation (NSF) Computing and Communication Foundations (CCF) under structure of the convolutional model (I.1) allows to design Grant 1527809, in part by NSF Information and Intelligent Systems (IIS) methods with O(m) memory and O(m log m) computation under Grant 1546411, in part by the European Union’s Horizon 2020 Research and Innovation Program under Grant 646804-ERCCOGBNYQ, in part by cost per iteration. In contrast, for generic measurements, the Science Foundation under Grant 335/14, and in part by the CCF the cost is around O(mn). under Grant 1740833 and Grant 1733857. This article was presented at the In this work, we study the convolutional phase retrieval NeurIPS 2017. problem (I.1) under the assumption that the kernel a = Q. Qu was with the Department of , Columbia  University, New York, NY 10027 USA, and also with the Data Science [a1, ··· ,am] is randomly generated from an i.i.d. standard Institute, , New York, NY 10027 USA. He is now with complex Gaussian distribution CN (0, I). Compared to the the Center for Data Science, New York University, New York, NY 10011 USA (e-mail: [email protected]). generalized phase retrieval problem, the random convolution Y. Zhang was with the Department of Electrical Engineering, Columbia model (I.1) we study here is far more structured: it is para- University, New York, NY 10027 USA, and also with the Data Science Insti- meterized by only O(m) independent complex normal random tute, Columbia University, New York, NY 10027 USA. She is now with the O Department of , Cornell University, Ithaca, NY 14853 USA variables, whereas the generic model involves (mn) random (e-mail: [email protected]). variables. This extra structure poses significant challenges for Y. C. Eldar is with the Weizmann Faculty of Mathematics and Computer analysis: the rows and columns of the sensing matrix A are Science, Weizmann Institute of Science, 7610001, Israel (e-mail: [email protected]). probabilistically dependent, so that classical probability tools J. Wright is with the Department of Electrical Engineering, Columbia Uni- (based on concentration of functions of independent random versity, New York, NY 10027 USA, with the Department of Applied Physics vectors) do not apply. and Applied Mathematics, Columbia University, New York, NY 10027 USA, and also with the Data Science Institute, Columbia University, New York, We propose and analyze a local gradient descent type NY 10027 USA (e-mail: [email protected]). method, minimizing a weighted, nonconvex and nonsmooth Communicated by P. Grohs, Associate Editor for . Color versions of one or more of the figures in this article are available 1We use this term to make distinctions from Fourier phase retrieval, where online at http://ieeexplore.ieee.org. the matrix A is an oversampled DFT matrix. Here, generalized phase retrieval Digital Object Identifier 10.1109/TIT.2019.2950717 refers to the problem with any generic measurement other than Fourier. 0018-9448 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1786 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020 objective well in practice, while theoretical understandings of its global 1 2 convergence properties still remain largely open [30]. min f(z)= b1/2  (y −|Az|) , (I.3) For the generalized phase retrieval where the sensing z∈Cn 2m matrix A is i.i.d. Gaussian, the problem is better-studied:  ∈ Rm where denotes the Hadamard product. Here, b ++ is in many cases, when the number of measurements is large a weighting vector, which is introduced mainly for analysis enough, the target solution can be exactly recovered by purposes. The choice of b is discussed in Section III. Our using either convex or nonconvex optimization methods. The result can be informally summarized as follows: first theoretical guarantees for global recovery of generalized phase retrieval with i.i.d. Gaussian measurement are based 2 Cx on convex optimization – the so-called Phaselift/Phasemax With m ≥ Ω 2 n poly log n samples, general- x methods [6], [10], [31]. These methods lift the problem to ized gradient descent starting from a data-driven initial- a higher dimension and solve a semi-definite program- ization converges to the target solution with linear rate. ming (SDP) problem. However, the high computational cost of SDP limits their practicality. Quite recently, [32]–[34] reveal m×m Here, Cx ∈ C denotes the circulant matrix corre- that the problem can also be solved in the natural parameter sponding to cyclic convolution with a length m zero padding space via linear programming. of x,andpoly log n denotes a polynomial in log n.Com- Recently, nonconvex approaches have led to new computa- pared to the results of generalized phase retrieval with i.i.d. tional guarantees for global optimizations of generalized phase Gaussian measurement, the sample complexity m here has retrieval. The first result of this type is due to Netrapalli extra dependency on Cx / x. The operator norm Cx et al. [35], showing that the alternating minimization method − is inhomogeneous over CSn 1: for a typical x (e.g., drawn provably converges to the truth when initialized using a n−1 uniformly at random from CS ), Cx is of the order spectral method and provided with fresh samples at each O(log n) and the sample complexity matches that of the iteration. Later on, Candès et al. [36] showed that with the generalized phase retrieval up to log factors; the “bad”√ case same initialization, gradient descent for the nonconvex least is when x is sparse in the Fourier domain: Cx∼O( n) squares objective, 2 and m can be as large as O(n poly log n). 2 1 2 2 min f1(z)= y −|Az| , (I.4) Our proof is based on ideas from decoupling theory [12], z∈Cn 2m the suprema of chaos processes and restricted isometry prop- erty of random circulant matrices [13], [14], and is also provably recovers the ground truth, with near-optimal ≥ inspired by a new iterative analysis of alternating minimization sample complexity m Ω(n log n). The subsequent methods [11]. Our analysis draws connections between the work [8], [9], [37] further reduced the sample complexity to ≥ convergence properties of gradient descent and the classical m Ω(n) by using different nonconvex objectives and trunca- alternating direction method. This allows us to avoid the need tion techniques. In particular, recent work by [9], [37] studied a to argue uniform concentration of high-degree polynomials nonsmooth objective that is similar to ours (I.3) with weighting 1 in the structured random matrix A, as would be required b = . Compared to the SDP-based techniques, these methods by a straightforward translation of existing analysis to this are more scalable and closer to the approaches used in practice. new setting. Instead, we control the bulk effect of phase Moreover, Sun et. al. [38] reveal that the nonconvex objec- errors uniformly in a neighborhood around the ground truth. tive (I.4) actually has a benign global geometry: with high probability, it has no bad critical points with m ≥ Ω(n log3 n) This requires us to develop new decoupling and concentration 2 tools for controlling nonlinear phase functions of circulant samples. Such a result enables initialization-free nonconvex random matrices, which could be potentially useful for ana- recovery [40], [41]. For convolutional phase retrieval, it would lyzing other random convolution problems, such as sparse be nicer to characterize the global geometry of the problem   blind deconvolution [15], [16] and convolutional dictionary as in [38], [42]–[45]. However, the inhomogeneity of Cx CSn−1 learning [17], [18]. over causes tremendous difficulties for concentration with m ≥ Ω(n poly log n) samples.

A. Comparison With Literature b) Structured random measurements. The study of struc- tured random measurements in signal processing has quite a) Prior arts on phase retrieval. The challenge of developing a long history [46]. For compressed sensing [47], the efficient, guaranteed methods for phase retrieval has attracted work [48]–[50] studied random Fourier measurements, and substantial interest over the past several decades [19], [20]. later [13], [14] proved similar results for partial random convo- The problem is motivated by applications such as X-ray lution measurements. However, the study of structured random crystallography [21], [22], microscopy [23], astronomy [24], measurements for phase retrieval is still quite limited. In par- diffraction and array imaging [25], [26], optics [27], and more. ticular, [51] and [52] studied t-designs and coded diffraction The most classical method is the error reduction algorithm patterns (i.e., random masked Fourier measurements) using derived by Gerchberg and Saxton [28], also known as the semidefinite programming. Recent work studied nonconvex alternating direction method. This approach has been further improved by the hybrid input-output (HIO) algorithm [29]. For 2 [39] further tightened the sample complexity to m ≥ Ω(n log n) by using oversampled Fourier measurements, it often works surprisingly more advanced probability tools.

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1787

optimization using coded diffraction patterns [36] and STFT set of Ω.If1Ω is the indicator function of the set Ω,then measurements [53], both of which minimize a nonconvex ∈ objective similar to (I.4). These measurement models are moti- 1 if j Ω, [1Ω]j = vated by different applications. For instance, coded diffraction 0 otherwise, is designed for imaging applications such as X-ray diffraction · imaging, STFT can be applied to frequency resolved optical where [ ]j is the jth coordinate of a given vector. Suppose | | Cm → C gating [54] and some speech processing tasks [55]. Both Ω = ,weletRΩ : be a mapping that maps a vector into its coordinates restricted to the set Ω. of the results√ show iterative contraction in a region that is ∗ at most O(1/ n)-close to the optimum. Unfortunately, for Correspondingly, we use RΩ to denote its adjoint operator. ∈ Cn×n × both results either the radius of the contraction region is not Let F n to denote√ a unnormalized n n Fourier   m ∈ Cm×n ≥ large enough for initialization to reach, or they require extra matrix with F n = n,andletF n (m n) artificial technique such as resampling the data. In comparison, be an oversampled Fourier matrix. ∈ Cm×m the contraction region we show for the random convolutional • Cyclic convolution. Let Ca be the circulant matrix model is larger O(1/polylog(n)), which is achievable in the generated from a, i.e., ⎡ ⎤ initialization stage via the spectral method. For a more detailed a1 am ··· a3 a2 review of this subject, we refer the readers to Section 4 of [46]. ⎢ ⎥ ⎢ a2 a1 am a3 ⎥ The convolutional measurement can also be reviewed as ⎢ ⎥ ⎢ . .. . ⎥ a single masked coded diffraction patterns [36], [52], since Ca = ⎢ . a2 a1 . . ⎥ − ⎢ ⎥ a  x = F 1(a  x),wherea is the Fourier transform of a ⎣ .. .. ⎦ a − . . a and x is the oversampled Fourier transform of x.Thesample m 1 m a a − ··· a a complexity for coded diffraction patterns m ≥ Ω(n log4 n)  m m 1 2 1 ··· in [36] suggests that the dependence of our sample complexity = s0[a] s1[a] sm−1[a] , (I.5) on C  for convolutional phase retrieval might not be x where s [·](0≤  ≤ m − 1) denotes a circulant shift by  necessary and can be improved in the future. On the other  samples. Therefore, the convolution a  x in (I.1) can be hand, our results suggest that the contraction region is larger √ rewritten in the matrix-vector form than O(1/ n) for coded diffraction patterns, and resampling for initialization might not be necessary.  ∗ a x = CaR[1:n]x, (I.6)

m n m where R[1:n] : R → R maps any vector x ∈ R to this B. Notations, Wirtinger Calculus, and Organizations ∗ first n coordinates, and R[1:n] is its adjoint operator. a) Basic notations. Throughout this paper, all vectors/ matrices are written in bold font a/A; indexed values are b) Wirtinger calculus. Consider a real-valued function n−1 Cn → R written as ai, Aij .WeuseCS to represent the unit g(z): . The function is not holomorphic, so that it is complex sphere in Cn.Weuse(·) and (·)∗ to denote the real not complex differentiable unless it is constant [56]. However, and Hermitian transpose of a vector or matrix, respectively. if one identifies Cn with R2n and treats g as a function in the We use (·) and (·) for the real and imaginary parts of a real domain, g can be differentiable in the real sense. Doing calculus for g directly in the real domain tends to produce = complex variable, respectively. We use g1 | g2 to represent the independence of two random variables g1,g2. Given a matrix cumbersome expressions. A more elegant way is adopting the X ∈ Cm×n, col(X) and row(X) are its column and row Wirtinger calculus [57], which can be considered as a neat · · way of organizing the real partial derivatives (see also [7] and space. We use F and to denote the Frobenius norm and spectral norm of a matrix, respectively. For a random Section 1 of [38]). The Wirtinger derivatives can be defined p p 1/p formally as variable X, its L norm is defined as X p = E [|X| ] L  for any positive p ≥ 1. For a smooth function f ∈C1,  ∞ ∂g . ∂g(z, z) its L norm is defined as f ∞ =sup |f(t)|.We =  L t∈dom(f) ∂z ∂z CN 0  z constant  use ( , I) be the standard complex Gaussian distribution.  ∼CN 0 ∂g(z, z) ∂g(z, z)  If a ( , I),then = ,...,  ∂z1  ∂zn z constant ∼ N 0 1  a = u +iv, u, v i.i.d. , 2 I . ∂g . ∂g(z, z) =  ∂z ∂z  z constant  In addition, for all theorems and proofs we use ci and Ci  ··· ∂g(z, z) ∂g(z, z)  (i =1, 2, ) to denote positive numerical constants. = ,...,  . • Some basic operators. For any vector v ∈ Cn,wedefine ∂z1 ∂zn z constant ∗ ∗ Basically it says that when evaluating ∂g/∂z, one just writes vv − vv P v = , P v⊥ = I ∂g/∂z in the pair of (z, z), and conducts the calculus by treat- v2 v2 ing z as if it was a constant. We compute ∂g/∂z in a similar to be the projection onto the span of v and its orthogonal fashion. To evaluate the individual partial derivatives, such as complement, respectively. For an arbitrary set Ω, we denote ∂g(z,z) , all the usual rules of calculus apply. For more details ∂zi the cardinality of Ω as |Ω|,andletsupp(Ω) be the support on Wirtinger calculus, we refer interested readers to [56].

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1788 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020 c) Organization. The rest of the paper is organized as fol- Algorithm 1 Spectral Initialization lows. In Section II, we introduce the basic formulation of the { }m Input: Observations yk k=1. problem and the proposed algorithm. In Section III, we present Output: The initial guess z(0). the main results and a sketch of the proof; detailed analysis 1: Estimate the norm of x by is postponed to Section VI. In Section IV, we corroborate our   analysis with numerical experiments. We discuss the potential  1 m λ =  y2 impacts of our work in Section V. Finally, all the basic m k probability tools that are used in this paper are described in k=1 the appendices. (0) n−1 2: Compute the leading eigenvector z ∈ CS of the matrix, II. NONCONVEX OPTIMIZATION VIA GRADIENT DESCENT m 1 1 ∗ In this work, we develop an approach to convolutional phase Y = y2a a∗ = A diag y2 A, m k k k m retrieval based on local nonconvex optimization. Our proposed k=1 algorithm has two components: (1) a careful data-driven (0) 3: Set z(0) = λz . initialization using a spectral method; (2) local refinement by gradient descent. We introduce the two steps below.

which is constructed from the knowledge of the sensing A. Nonconvex and Nonsmooth Minimization vectors and observations. The leading eigenvector of Y can We consider minimizing a weighted nonconvex and be efficiently computed via the power method. Note that nonsmooth objective introduced in (I.3), where as shown E [Y ]=x2 I + xx∗, so the leading eigenvector of E [Y ] is ∗ ∈ Cm×n in (I.6) the matrix A = CaR[1:n] is formed by the proportional to the target solution x. Under the random con- first n columns of Ca. The adoption of the positive weights b volutional model of A, by using probability tools from [46], facilitates our analysis, by enabling us to compare certain func- we show that v∗Yvconcentrates to its expectation v∗E [Y ] v − tions of the dependent random matrix A to functions involving for all v ∈ CSn 1 whenever m ≥ Ω(n poly log n), ensuring more independent random variables. We will substantiate this that the initialization z(0) is close to the set of target solutions. claim in the next section. It should be noted that several variants of the initialization We consider the generalized Wirtinger gradient of (I.3), approach in Algorithm 1 have been introduced in the literature. They improve upon the log factors of sample complexity for ∂ 1 ∗ f(z)= A diag (b)[Az − y  exp (iφ(Az))] . generalized phase retrieval with i.i.d. measurements. Those ∂z m methods include the truncated spectral method [8], null initial- · Here, because of the nonsmoothness of (I.3), f( ) is not ization [58] and orthogonality-promoting initialization [9]. For differentiable everywhere even in the real sense. To deal with the simplicity of analysis, here we only consider Algorithm 1 this issue, we specify for the convolutional model. . u/ |u| if |u| =0 , exp (iφ(u)) = 1 otherwise, III. MAIN RESULT AND SKETCH OF ANALYSIS for any complex number u ∈ C and φ(u) ∈ [0, 2π). Starting In this section, we introduce our main theoretical result, from some initialization z(0), we minimize the objective (I.3) and sketch the basic ideas behind the analysis. Without loss of generality, we assume the ground truth signal to be by generalized gradient descent − x ∈ CSn 1. Because the problem can only be solved up to a ∂ z(r+1) = z(r) − τ f(z(r)), (II.1) global phase shift, we define the set of target solutions as ∂z   . iφ ∂ X = xe | φ ∈ [0, 2π) , where τ>0 is the stepsize. Indeed, ∂z f(z) can be interpreted as the subgradient of f(z) in the real case; this method can and correspondingly let be seen as a variant of amplitude flow [9]. . dist(z, X ) =infz − xeiφ , φ∈[0,2π) B. Initialization via Spectral Method which measures the distance from a point z ∈ Cn to the set Similar to [7], [35], we compute the initialization z(0) via a of target solutions X . spectral method, detailed in Algorithm 1. More specifically, z(0) is a scaled version of the leading eigenvector of the following matrix A. Main Result m 1 Suppose the weighting vector b = ζ 2 (y) in (I.3), where Y = y2a a∗ σ m k k k k=1 2 − 2 2 ζσ (t)=12πσξσ (t), (III.1a) 1 2 ∗ 2 1 |t| = A diag y A, (II.2) ξ 2 (t)= 2 exp − 2 , (III.1b) m σ 2πσ 2σ

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1789 with σ2 > 1/2. Our main theoretical result shows that with 1) Proof sketch of iterative contraction Our iterative analy- high probability, the generalized gradient descent (II.1) with sis is inspired by the recent analysis of alternating direction spectral initialization converges linearly to the optimal set X . method (ADM) [11]. In the following, we draw connections 31 Theorem 3.1 (Main Result): If m ≥ C0 n log n,then between the gradient descent method (II.1) and ADM, and Algorithm 1 produces an initialization z(0) that sketch the basic ideas of convergence analysis. (0) X ≤ −6   dist z , c0 log n x , a) ADM iteration. ADM is a classical method for solving phase retrieval problems [11], [28], [35], which can be consid- − −c2 (0) with probability at least 1 c1 m . Starting from z , ered as a heuristic method for solving the following nonconvex with σ2 =0.51 and stepsize τ =2.02, whenever 2   problem Cx 17 4 (r) m ≥ C 2 max log n, n log n , for all iterates z 1 x 2 ≥ min 1 Az − y  u . (r 1) in (II.1), we have z∈Cn,|u|=1 2 (r) r (0) dist z , X ≤ (1 − ) dist z , X , (III.2) At every iterate z(r), ADM proceeds in two steps: −c4 (r+1) (r) with probability at least 1 − c3 m for some numerical c = y  exp Az , ∈ constant (0, 1). 1 2 z(r+1) =argmin Az − c(r+1) , z 2 Remark. Our result shows that by initializing the prob- lem O(1/polylog(n))-close to the optimum via the spectral which leads to the following update method, the gradient descent (II.1) converges linearly to the (r+1) †  (r) optimal solution. As we can see, the sample complexity here z = A y exp Az ,   also depends on Cx , which is quite different from the i.i.d. † ∗ −1 ∗ ∈ CSn−1 where A =(A A) A is the pseudo-inverse of A.Let case. For a typical x (e.g., x is drawn uniformly n−1 (r) − iθ random from CS ), Cx is on the order of O(log n),and θr =argminθ∈[0,2π) z xe . The distance between ≥ the sample complexity m Ω(n poly log n) matches the i.i.d. z(r+1) and X is bounded by case up to log factors. However, Cx is nonhomogeneous n−1 over x ∈ CS :ifx is sparse in the Fourier domain dist z(r+1), X x = √1 1 (e.g., n ), the sample complexity can be as large as (r+1) iθ +1 m ≥ Ω n2 poly log n . Such a behavior is also demonstrated = z − xe r

in the experiments of Section IV. We believe the (very large!) † ≤ iθr −  (r) number of logarithms in our result is an artifact of our analysis, A Axe y exp Az . (III.3) rather than a limitation of the method. We expect to reduce 2 Cx 6 the sample complexity to m ≥ Ω 2 n log n by a tighter x b) Gradient descent with b = 1. For simplicity analysis, which is left for future work. and illustration purposes, let us first consider the gra- As we shall see in the following, the smoothness of the dient descent update (II.1) with b = 1.Letθr = weighting b ∈ Rm in (III.1) helps us get around the difficulty arg min z(r) − xeiθ, with stepsize τ =1.The of convergence analysis. The particular numerical choices of θ∈[0,2π) σ2 =0.51 and τ =2.02 are merely to ensure iterative distance between the iterate z(r+1) and the optimal set X is contraction in (III.2). For analysis, the parameter pair (τ,σ2) bounded by can be selected in a range as long as ∈ (0, 1) in (III.2). In 1 dist z(r+1), X practice, the algorithm converges with b = and a choice of small stepsize τ, or by using backtracking linesearch for the = z(r+1) − xeiθr+1 stepsize τ. 1 ≤ A Axeiθr − y  exp iφ(Az(r)) m B. A Sketch of the Analysis 1 ∗ (r) iθ + I − A A z − xe r . (III.4) In this subsection, we briefly highlight some major chal- m lenges and new ideas behind the analysis. All the detailed proofs are postponed to Section VI. The core idea behind the c) Towards iterative contraction. By measure concentration, analysis is to show that the iterate contracts once we initialize it can be shown that close enough to the optimum. In the following, we first describe the basic ideas of proving iterative contraction, which critically depends on bounding a certain nonlinear function 1 ∗ I − A A = o(1), (III.5a) of a random circulant matrix. We sketch the core ideas of m √ √ how to bound such a complicated term via the decoupling † A≈ m, A ≈ 1/ m, (III.5b) technique.

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1790 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020

holds with high probability whenever m ≥ Ω(n poly log n). a) Why decoupling? Let ak (1 ≤ k ≤ m)bearowvector Therefore, based on (III.3) and (III.4), to show iterative of A, then the term contraction, it is sufficient to prove   L(a, w)=w A diag (ψ(Ax)) Aw iθ Axe − y  exp (iφ(Az)) m √ ∗   ≤ (1 − η) m z − xeiθ , (III.6) = ψ(akx)w akak w k=1 dependence across k for some constant η ∈ (0, 1) sufficiently small, where θ = ∗ ∗ is a summation of dependent random variables. To address arg min z − xeiθ such that eiθ = x z/ |x z|. θ∈[0,2π) this problem, we deploy ideas from decoupling [12]. Infor- By borrowing ideas from controlling (III.6) in the ADM mally, decoupling allows us to compare moments of random method [11], this observation provides a new way of analyzing functions to functions of more independent random variables, the gradient descent method. As an attempt to show (III.6) which are usually easier to analyze. The book [12] provides A for the random circulant matrix , we invoke Lemma A.1 a beautiful introduction to this area. In our problem, notice in the appendix, which controls the error in a first order that the random vector a occurs twice in the definition of exp(iφ(·)) approximation to . Let us decompose L(a, w) – one in the phase term ψ(Ax)=exp(−2iφ(Ax)), z = αx + βw, and another in the quadratic term. The general spirit of n−1 decoupling is to seek to replace one of these copies of a with where w ∈ CS with w ⊥ x,andα, β ∈ C. Notice that an independent copy a of the same random vector, yielding φ(α)=θ, so that by Lemma A.1, for any ρ ∈ (0, 1) we have a random process with fewer dependencies. Here, we seek to iθ −  L Ax e y exp (iφ(Az)) replace (a, w) with ≤ · | | QL 2 Ax ½| β || |≥ | | dec(a, a , w) α Aw ρ Ax      = w A diag ψ(A x) Aw. (III.8)   T1   QL 1  β  The utility of this new, decoupled form dec(a, a , w) +    ((Aw)  exp (−iφ(Ax))) . 1 − ρ α    of L(a, w) is that it introduces extra randomness — L T Q 2 dec(a, a , w) is now a chaos process of a conditioned on a . QL a a w The first term T1 is relatively much smaller than T2,which This makes analyzing supw∈S dec( , , ) amenable to can be bounded by a small numerical constant using the existing analysis of suprema of chaos processes for random restricted isometry property of a random circulant matrix [14], circulant matrices [46]. together with some auxiliary analysis. The detailed analysis However, achieving the decoupling requires additional is provided in Section VI-D. The second term T2 involves work; the most general existing results on decoupling per- a nonlinear function exp (−iφ(Ax)) of the random circulant tain to tetrahedral polynomials, which are polynomials with matrix A. Controlling this nonlinear, highly dependent random no monomials involving any power larger than one of any process for all w is a nontrivial task. In the next subsection, random variable. By appropriately tracking cross terms, these we explain why bounding T2 is technically challenging, and results can also be applied to more general (non-tetrahedral) describe the key ideas on how to control a smoothed variant polynomials in Gaussian random variables [59]. However, L a w of T2, by using the weighting b introduced in (III.1). We also our random process ( , ) involves a nonlinear phase term provide intuitions for why the weighting b is helpful. ψ(Aw) which is not a polynomial, and hence is not amenable 2) Controlling a smoothed variant of the phase term T2 to a direct appeal to existing results. As elaborated above, the major challenge of showing iterative b) Decoupling is “recoupling”. Existing results [59] for contraction is bounding the suprema of the nonlinear, depen- decoupling polynomials of Gaussian random variables are dent random process T (w) over the set 2  derived from two simple facts: . − S = w ∈ CSn 1 | w ⊥ x . (1) orthogonal projections of Gaussian variables are independent3; By using the fact that (u)= 1 (u − u) for any u ∈ C, 2i (2) Jensen’s inequality. we have     For the random vector a ∼CN(0, I), let us introduce an     independent copy δ ∼CN(0, I). Write sup T 2(w) ≤ 1 sup wA diag (ψ(Ax)) Aw 2 2    1 2 − w∈S w∈S   g = a + δ, g = a δ. L(a,w) Because of Item 1, g1 and g2 are two independent CN(0, 2I) + 1 A2 2 vectors. Now, by taking conditional expectation with respect . where we define ψ(t)√=exp(−2iφ(t)). As from (III.5), to δ,wehave  ≈   we know that A m. Thus, to show (III.6), the major E QL (g1, g2, w) task left is to prove that δ  dec  E QL − . L = δ dec(a + δ, a δ, w) = (a, w). (III.9) sup |L(a, w)| < (1 − η )m (III.7) ∈S w 3If two random variables are jointly Gaussian, they are statistically inde- for some constant η ∈ (0, 1). pendent if and only if they are uncorrelated.

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1791

Thus, we can see that the key idea of decoupling L(a, w) into QL QL 1 2 dec(a, a , w), is essentially “recoupling” dec(g , g , w) via conditional expectation – the “recoupled” term L can be reviewed as an approximation of L(a, w). Notice that by Fact 2, Jensen’s inequality, for any convex function ϕ,   Ea sup ϕ L(a, w) w∈S    E E QL − = a sup ϕ δ dec(a + δ, a δ, w) w∈S  ≤ E QL − a,δ sup ϕ dec(a + δ, a δ, w) w∈S  L 1 2 E 1 2 Q = g ,g sup ϕ dec(g , g , w) . w∈S Fig. 1. Plots of functions h(t), ψ(t) and ζσ2 (t) over the real line with 2 Thus, by choosing ϕ appropriately, i.e., as ϕ(t)=|t|p, we can σ =0.51. The ψ(t) function is discontinuous at 0, and cannot be uniformly h t h t L approximated by ( ). On the other hand, the function ( ) serves as a good control all the moments of supw∈S (a, w) via ψ t approximation of the weighting ( ).     sup L(a, w) w∈S Lp   error is small? To understand this question, recall that the L ≤ Q 1 2  mechanism that links Qdec to L is the conditional expectation sup dec(g , g , w) . (III.10) w∈S Lp operator Eδ [·|a]. For our case, from (III.9) orthogonality This type of inequality is very useful because it leads to   L  L   relates the moments of supw∈S (a, w) to those of (a, w)=w A diag (h(Ax)) Aw, (III.12)  L  L . Q a a w Q E 2 supw∈S dec( , , ) . As discussed previously, dec is a h(t) = s∼CN (0,x ) [ψ(t + s)] . chaos process of g1 conditioned on g2. Its moments can be bounded using existing results [14]. Thus, by using the results in (III.11) and (III.12), we can L bound sup ∈S |L(a, w)| as If was a tetrahedral polynomial, then we have w   L L   = , i.e., the approximation is exact. As the tail bound |L |≤ L  |L | sup (a, w) sup (a, w) of supw∈S (a, w) can be controlled via its moments w∈S w∈S bounds [60, Chapter 7.2], this allows us to directly control 2 + h − ψ ∞ A . (III.13) the object of interest L. The reason of achieving this bound is   L  approximation error because the conditional expectation operator Eδ [·|a] “recou- QL L ples” dec(a, a , w) back to the target (a, w).Inother Note that the function h is not exactly ψ, but generated by con- words, (Gaussian) decoupling is recoupling. volving ψ with a multivariate Gaussian pdf : indeed, recoupling is Gaussian smoothing. The Fourier transform of a multivariate c) “Recoupling” is Gaussian smoothing. In convolutional Gaussian is again a Gaussian; it decays quickly with frequency. phase retrieval, a distinctive feature of the term L(a, w) So, in order to admit a small approximation error, the target is that ψ(·) is a phase function and therefore L is not a ψ must be smooth. However, in our case, the function ψ(t)= polynomial. Hence, it may be challenging to posit a QL dec exp(−2iφ(t)) is discontinuous at t =0; it changes extremely which “recouples” back to L. In other words, as L = L rapidly in the vicinity of t =0, and hence its Fourier in the existing form, we need to tolerate an approximation L L transform (appropriately defined) does not decay quickly at all. error. Although is not exactly , we can still control L |L | L Therefore, the term (a, w) is a poor target for approximation supw∈S (a, w) through its approximation , L E QL 1 2   by using a smooth function (a, w)= δ[ dec(g , g , w)].   sup |L(a, w)|≤sup L(a, w) −L(a, w) From Figure 1, the difference between h and ψ increases as | |  −  w∈S w∈S   t 0. The poor approximation error ψ f L∞ =1results   in a trivial bound for sup |L(a, w)| instead of the desired +supL(a, w) . (III.11) w∈S w∈S bound (III.7).     L d) Decoupling and convolutional phase retrieval. To reduce As we discussed above, the term supw∈S  (a, w) can be controlled by using decoupling and the moments bound the approximation error caused by the nonsmoothness of ψ at in (III.10). Therefore, the inequality (III.11) is useful to derive t =0, we smooth ψ. More specifically, we introduce a new a sufficiently tight bound for L(a, w) if L(a, w) is very weighted objective (I.3) with Gaussian weighting b = ζσ2 (y) L in (I.2), replacing the analyzing target T2 with close to (a, w) uniformly, i.e., the approximation error is L small. Now the question is: for what is it possible to T 1/2  − QL 2 = diag b ((Aw) exp ( iφ(Ax))) . find a “well-behaved” dec such that the approximation

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1792 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020

Consequently, we obtain a smoothed variant Ls(a, w) of L(a, w),   Ls(a, w)=w A diag (ζσ2 (y)  ψ(Ax)) Aw. Similar to (III.13), we obtain

sup |Ls(a, w)| w∈S     2 ≤ L   − 2    sup (a, w) + h(t) ζσ (t)ψ(t) L∞ A . w∈S  −  Now the approximation error h ψ L∞ in (III.13) is  − 2  replaced by h(t) ζσ (t)ψ(t) L∞ . As observed from Figure 1, the function ζσ2 (t) smoothes ψ(t) especially near the vicinity of t =0, such that the new approximation  − 2  error f(t) ζσ (t)ψ(t) L∞ is significantly reduced. Thus, by using similar ideas above, we can provide a nontrivial bound Fig. 2. Phase transition for signals x ∈ CSn−1 with different signal sup |Ls(a, w)| < (1 − ηs) m, patterns. We fix n = 1000 and generate x with different patterns. We vary w∈S the ratio m/n. for some η ∈ (0, 1), which is sufficient for showing iterative s b) Necessity of initializations. As has been shown contraction. Finally, because of the weighting b = ζ 2 (y), σ in [38], [39], for phase retrieval with generic measurement, it should be noticed that the overall analysis needs to be when the sample complexity satisfies m ≥ Ω(n log n), with slightly modified accordingly. For a more detailed analysis, high probability the landscape of the nonconvex objective (I.4) we refer the readers to Section VI. is nice enough that it enables initialization free global opti- mization. This raises an interesting question of whether spec- IV. EXPERIMENTS tral initialization is necessary for the random convolutional In this section, we conduct experiments on both synthetic model. We consider a similar setting as the previous experi- and real datasets to demonstrate the effectiveness of the ment, where the ground truth x ∈ Cn is drawn uniformly at − proposed method. random from CSn 1. We fix the dimension n = 1000 and change the ratio m/n. For each ratio, we randomly generate A. Experiments on Synthetic Dataset the kernel a ∼CN(0, I) in (I.1) and repeat the experiment a) Dependence of sample complexity on Cx. First, 100 times. For each instance, we start the algorithm from we investigate the dependence of the sample complexity m random and spectral initializations, respectively. We choose n−1 on Cx. We assume the ground truth x ∈ CS ,and the stepsize via backtracking linesearch and terminate the consider three cases: experiment either when the number of iterations is larger than × 4 • x = e1 with e1 to be the standard basis vector, such that 2 10 or the distance of the iterate to the solution is smaller × −5 Cx =1; than 1 10 . • x is uniformly random generated on the complex sphere As we can see from Figure 3, the number of samples n−1 CS ; √ required for successful recovery with random initializations • √1 1   is only slightly more than that with the spectral initialization. x = n , such that Cx = n. For each case, we fix the signal length n = 1000 and vary This implies that the requirement of spectral initialization is the ratio m/n. For each ratio m/n, we randomly generate an artifact of our analysis. For convolutional phase retrieval, the kernel a ∼CN(0, I) in (I.1) and repeat the experiment the result in [40] shows some promises for analyzing global 100 times. We initialize the algorithm by the spectral method convergence of gradient methods with random initializations. in Algorithm 1 and run the gradient descent (II.1). Given the c) Effects of weighting b. Although the weighting b in (III.1) algorithm output x, we judge the success of recovery by that we introduced in Theorem 3.1 is mainly for analysis, here inf x − xeiφ ≤ , (IV.1) we investigate its effectiveness in practice. We consider the ∈ φ [0,2π) same three cases for x as we did before. For each case, we fix where  =10−5. From Figure 2, for the case when the signal length n = 100 and vary the ratio m/n. For each Cx = O(1), the number of measurements needed is far ratio m/n, we randomly generate the kernel a ∼CN(0, I) less than Theorem 3.1 suggests. Bridging the gap between the in (I.1) and repeat the experiment 100 times. We initialize practice and theory is left for the future work. the algorithm by the spectral method in Algorithm 1 and Another observation is that the larger Cx is, the more run the gradient descent (II.1) with weighting b = 1 and b samples we needed for the success of recovery. One possibility in (III.1), respectively. We judge success of recovery once the −5 is that the sample complexity depends on Cx, another error (IV.1) is smaller than 10 . From Figure 4, we can see possibility is that the extra logarithmic factors in our analysis that the sample complexity is slightly larger for b = ζσ2 (y), are truly necessary for worst case (here, spectral sparse) inputs. the benefit of weighting here is more for the ease of analysis.

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1793

generated as complex Gaussian CN(0, I).Werunthepower method for 100 iterations for initialization, and stop the algorithm once the error is smaller than 1×10−4.Wefirsttest the proposed method on a gray 468×1228 electron microscopy image. As shown in Figure 7, the gradient descent method with spectral initialization converges to the target solution in around 64 iterations. Second, we test our method on a color image of size 200 × 300 as shown in Figure 8, it takes 197.08s to reconstruct all the RGB channels. In contrast, methods using general Gaussian measurements A ∈ Cm×n could easily run out of memory on a personal computer for problems of this size.

V. D ISCUSSION AND FUTURE WORK In this work, we showed that via nonconvex optimiza- tion, the phase retrieval problem with random convolu- Fig. 3. Phase transition with different initializations schemes. We fix n−1 tional measurement can be solved to global optimum with n =1000and x is generated uniformly random from CS .Wevarythe 2 Cx ratio m/n. ≥ 2 m Ω x n poly log n samples. Our result raises sev- eral interesting questions that we discuss below. d) Comparison with generic random measurements. Another interesting question is that, in comparison with a pure a) Tightening sample complexity. Ourestimateofthesam- random model, how many more samples are needed for the ple complexity is only tight up to logarithm factors: there random convolutional model in practice? We investigate this is a substantial gap between our theory and practice for the question numerically. We consider the same three cases for dependence of the logarithm factors. We believe the high x as we did before, and consider two random measurement order dependence of the logarithm factors is an artifact of our models analysis. In particular, our analysis in Appendix VI-D is based |  | | | on the result of RIP conditions for partial circulant random y1 = a x , y2 = Ax , matrices, which is in no way tight. We believe that by using ∼CN 0 ∼ CN 0 where a ( , I),andak i.i.d. ( , I) is a row vector advanced tools in probability, the sample complexity can be of A. For each case, we fix the signal length n = 100 and vary tightened to at least m ≥ Ω n log6 n . the ratio m/n. We repeat the experiment 100 times. We ini- tialize the algorithm by the spectral method in Algorithm 1 for b) Geometric analysis and global result. Our convergence both models, and run gradient descent (II.1). We judge success analysis is based on showing iterative contraction of gradient −5 of recovery once the error (IV.1) is smaller than 10 .From descent methods. However, it would be interesting if we could Figure 5, we can see that when x is typical (e.g., x = e1 characterize the function landscape of nonconvex objectives as n−1 or x is uniformly random generated from CS ), under in [38]. Such a result would provide a better explanation of the same settings, the samples needed for the two random why the gradient descent method works, and help us design models are almost the same. However, when x is Fourier more efficient algorithms. The major difficulty we encountered √1 1 sparse (e.g., x = n ), more samples are required for the is the lack of probability tools for analyzing the random random convolution model. convolutional model: because of the nonhomogeneity of Cz over the sphere, it is hard to tightly uniformize quantities B. Experiments on Real Problems of random convolutional matrices over the complex sphere CSn−1 a) Experiments on real antenna data for 5G . Our preliminary analysis results in suboptimal bounds communication. We demonstrate the effectiveness of the for sample complexity. proposed method on a problem arising in 5G communication, as we mentioned in the introduction. Figure 6 (left) shows an c) Tools for analyzing other structured nonconvex antenna pattern a ∈ C361 obtained from Bell labs. We observe problems. This work is part of a recent surge of research the modulus of the convolution of this pattern with the signal efforts on deriving provable and practical nonconvex algo- of interest. For three different types of signals with length rithms to central problems in modern signal processing and machine learning [42], [44], [45], [61]–[81]. On the other n =20,(1)x = e1,(2)x is uniformly random generated CSn−1 √1 1 hand, we believe the probability tools of decoupling and from ,(3)x = n , our result in Figure 6 (right) shows that we can achieve almost perfect recovery. measure concentration we developed here can form a solid foundation for studying other nonconvex problems under the b) Experiments on real images. Finally, we run the experi- random convolutional model. Those problems include blind ment on some real images to demonstrate the effectiveness and calibration [82]–[84], sparse blind deconvolution [16], [78], the efficiency of the proposed method. We use m =5n log n [85]–[94], and convolutional dictionary learning [17], [18], samples for reconstruction. The kernel a ∈ Cm is randomly [95]–[97].

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1794 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020

Fig. 4. Phase transition for different signal patterns with weightings b. We fix n = 1000 and vary the ratio m/n.

Fig. 5. Phase transition of random convolution model vs. i.i.d. random model. We fix n = 1000 and vary the ratio m/n.

Fig. 6. Experiments on real antenna filter for 5G communication.

VI. PROOFS OF TECHNICAL RESULTS the proof of our main result, i.e., Theorem 3.1, where some In this section, we provide the detailed proof of Theo- key details are provided in Section VI-C. All the other sup- rem 3.1. The section is organized as follows. In Section VI-A, porting results are provided subsequently. We provide detailed we show that the the initialization produced by Algorithm 1 proofs of two key supporting lemmas in Section VI-D and is close to the optimal solution. In Section VI-B, we sketch Section VI-E, respectively. Finally, other supporting lemmas

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1795

Fig. 7. Experiment on a gray 468 × 1228 electron microscopy image.

Fig. 8. Experiment on colored images. are postponed to the appendices: (i) in Appendix A, we intro- we have duce the elementary tools and results that are useful throughout 2 X ≤  2 analysis; (ii) in Appendix B, we provide results of bounding dist (z0, ) δ x the suprema of chaos processes for random circulant matrices; with probability at least 1−c m−c2 . (iii) in Appendix C, we provide concentration results for 1 The proof is similar to that of [7]. However, the proof suprema of some dependent random processes via decoupling. in [7] only holds for generic random measurements. Our proof here is tailored for random circulant matrices. We sketch A. Spectral Initialization the main ideas of the proof below: more detailed analysis Proposition 6.1: Suppose z0 is produced by Algorithm 1. for concentration of random circulant matrices is retained to Given a fixed scalar δ>0, whenever m ≥ Cδ−2n log7 n, Appendix B and Appendix C.

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1796 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020

Proof Without loss of generality, we assume that x =1. B. Proof of Main Result  Let z0 be the leading eigenvector of In this section, we prove Theorem 3.1. Without loss of generality, we assume x =1for the rest of the section. 1 m Y = |a∗x|2 a a∗ Given the function m k k k k=1 1 2 f(z)= b1/2  (y −|Az|) , 2m with z0 =1,andletσ1 be the corresponding eigenvalue. We have we show that simple generalized gradient descent ∂ X ≤ −    X z = z − τ f(z), (VI.3) dist(z0, ) z0 z0 +dist(z0, ) . ∂z ∂ 1 ∗  f(z)= A diag (b) × First, since z0 = λz0,wehave ∂z m [Az − y  exp (iφ(Az))] , (VI.4) z0 − z0 = |λ − 1| . with spectral initialization converges linearly to the target By Theorem B.1 in Appendix B, for any ε>0, whenever solution. We restate our main result below. −2 4 31 m ≥ Cε n log n, we know that Theorem 6.2 (Main Result): Whenever m ≥ C0 n log n,   Algorithm 1 produces an initialization z(0) that satisfies |λ − 1|≤λ2 − 1   (0) −6   dist z , X ≤ c0 log n x  1 m  =  |a∗x|2 − 1 ≤ ε/2 (VI.1) k −c2 m  with probability at least 1 − c m . Suppose b = ζ 2 (y), k=1 1 σ where −c log3 n with probability at least 1 − 2 m ,wherec, C > 0 are 2 ζ 2 (t)=1− 2πσ ξ 2 (t), (VI.5) some numerical constants. On the other hand, we have σ σ 2 1 |t| 2 2 − 2 iθ ∗ ξσ (t)= 2 exp 2 , (VI.6) dist (z0, X )=argmin z0 − xe =2− 2 |x z0| . 2πσ 2σ θ 2 (0) 2 Theorem C.1 in Appendix C implies that for any δ>0, with σ > 1/2. Starting from z , with σ = −2 7 0.51 and stepsize τ =2.02, whenever m ≥ whenever m ≥ C δ n log n 2   Cx 17 4 C1 x2 max log n, n log n , with probability at ∗ 2 −   ≤ −c4 (r) Y xx + x I δ, least 1 − c3 m for all iterate z (r ≥ 1) defined in (VI.3), we have with probability at least 1 − 2m−c1 .Herec > 0 is some 1 (r) X ≤ − r (0) X numerical constant. It further implies that dist z , (1 ) dist z , ,       holds for some small numerical constant ∈ (0, 1). ∗  − ∗ 2 −  ≤ z0Y z0 z0x 1 δ, Our proof critically depends on the following result, where we show that with high probability for every z ∈ Cn close so that enough to the optimal set X , the iterate produced by (VI.3) is   a contraction. ∗ 2 ≥ − − z0x σ1 1 δ, Proposition 6.3 (Iterative Contraction): Let σ2 =0.51 and τ =2.02. There exists some positive con- where σ1 is the top singular value of Y .Sinceσ1 is the top stants c ,c ,c and C, such that whenever m ≥ 2 1 2  3  singular value, we have Cx 17 4 C x2 max log n, n log n , with probability at least ∗ ∗ ∗ 2 − −c2 ∈ Cn X ≤ σ ≥ x Yx = x (Y − xx −x I)x +2 1 c1m for every z satisfying dist (z, ) 1 −6   ≥ − c3 log n x ,wehave 2 δ. ! " ∂ dist z − τ f(z), X ≤ (1 − )dist(z, X ) Thus, for δ>0 sufficiently small, we obtain ∂z √ 2  ∈ ∂ dist (z0, X ) ≤ 2 − 2 1 − 2δ ≤ 2δ. (VI.2) holds for some small constant (0, 1). Here, ∂z f(z) is defined in (VI.4). Choose δ = ε2/8. Combining the results in (VI.1) and (VI.2), To prove this proposition, let us first define we obtain that M = M(a) 2 dist(z0, X ) ≤z0 − z0 +dist(z0, X ) ≤ ε, . 2σ +1 ∗ = A diag (ζ 2 (y)) A, (VI.7) m σ . holds with high probability. H = H(a) = P x⊥ M(a)P x⊥ , (VI.8)

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1797 and introduce depending on σ2. With ε =0.2 and σ2 =0.51, Lemma 6.7 ≤ . E implies that Δ(ε) 0.404. Thus, we have h(t) = s∼N (0,1) [ψ(t + s)] . (VI.9)  ! 2 iθ 5.686δ Given some scalar ε>0 and σ > 1/2, let us introduce a P ⊥ d≤ z − xe δ +(1+δ) x ρ quantity % " 1 1 1+2.2δ +0.404(1 + δ) . 2 + Δ∞(ε) = 1+2σ (1 + ε)Es∼CN (0,1) [ψ(t + s)] 1 − ρ 1 − δ 2 P xd≤(0.505 + c 2 δ). σ − ζσ2 (t)ψ(t) , (VI.10) L∞ By choosing the constants δ and ρ sufficiently small, direct calculation reveals that where ψ(t)=exp(−2iφ(t)) and ζσ2 is defined in (VI.5). We sketch the main idea of the proof below. More detailed 2 2 2 dist (z, X ) ≤P ⊥ d + P d analysis is postponed to Appendix VI-C, Appendix VI-D and x x iθ2 Appendix VI-E. ≤ 0.96 z − xe Proof [Proof of Proposition 6.3] By (VI.3) and (VI.4), and =0.96 dist2 (z, X ) , with the choice of stepsize τ =2σ2 +1,wehave 2 as desired. 2σ +1 ∗ z = z − A diag (ζ 2 (y)) × m σ Now with Proposition 6.3 in hand, we are ready to prove [Az − y  exp (iφ (Az))] Theorem 6.2 (in other words, Theorem 3.1). 2 2σ +1 ∗ Proof [Proof of Theorem 6.2] We prove the theorem by = z − Mz + A diag (ζ 2 (y)) × m σ recursion. Let us assume that the properties in Proposition 6.3 E [y  exp (iφ (Az))] . holds, which happens on an event with probability at −c2 least 1 − c1m for some numerical constants c1,c2 > 0. For any z ∈ Cn, let us decompose z as By Proposition 6.1 in Appendix VI-A, for any numerical con- ≥ −12 31 z = αx + βw, (VI.11) stant δ>0, whenever m Cδ n log n, the initialization z(0) produced by Algorithm 1 satisfies − where α, β ∈ C,andw ∈ CSn 1 with w ⊥ x,and (0) 3 −6 α = |α| eiφ(α) with the phase φ(α) of α satisfies eiφ(α) = dist z , X ≤ c3δ log n x , x∗z/ |x∗z|. Therefore, if we let − −c5 with probability at least 1 c4 m . Therefore, conditioned − iθ E θ =argminθ∈[0,2π) z xe , (VI.12) on the event , we know that ! " ∂ then we also have φ (α)=θ. Thus, by using the results above, dist z(1), X =dist z(0) − τ f(z), X we observe ∂z 2 ≤ − (0) X dist2 (z, X )= min z − xeiθ (1 )dist z , θ∈[0,2π) ∈ iθ 2 2 2 holds for some small constant (0, 1). This proves (III.2) ≤ z − e x ≤P ⊥ d + P d , x x for the first iteration z(1). Notice that the inequality above also (1) 3 −6 where we define implies that dist z , X ≤ c3δ log n x. Therefore, 2 by reapplying the same reasoning, we can prove (III.2) for . 2σ +1 d(z) =(I − M) z − eiθx − eiθMx+ × the iterations r =2, 3, ···. m ∗ A diag (ζσ2 (y)) [y  exp (iφ (Az))] . (VI.13)

Let δ>0, by Lemma 6.7 and Lemma 6.8, whenever C. Bounding Px⊥ d(z) and Pxd(z) ≥  2 17 −2 4 m C Cx max log n, δ n log n , with probability   −c2 n iθ Let d(z) be defined as in (VI.13) and assume that x =1. at least 1−c1m for all z ∈ C such that z − xe ≤ − In this section, we provide bounds for P xd and P ⊥ d c δ3 log 6 n,wehave x 3 # under the condition that z and x are close. Before presenting $ the main results, let us first introduce some useful preliminary iθ 4δ P ⊥ d≤ z − xe δ +(1+δ) 2σ2 +1 x ρ lemmas. First, based on the decomposition of z in (VI.11) % & and the definition of θ in (VI.12), we can show the following 1 1 1+(2+ε)δ +(1+δ)Δ∞(ε) result. + − − Lemma 6.4: Let θ =argmin z − xeiθ and sup- 1 ρ 1 δ 2 θ∈[0,2π) ! " 2 pose dist (z, x)= z − xeiθ ≤  for some  ∈ (0, 1),then 2σ iθ  ≤ 2 − P xd + cσ δ z xe we have 1+2σ2     ∈  β  1 iθ holds for any ρ (0, 1),whereΔ∞(ε) be defined in (VI.10)   ≤ z − xe . α 1 −  with ε ∈ (0, 1). Here, cσ2 is a numerical constant only

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1798 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020

Proof Given the facts in (VI.11) and (VI.12) that z = αx + The analysis of bounding P x⊥ d is similar to that of [11]. n−1 ∗ βw with w ∈ CS and w ⊥ x,andφ(α)=θ,wehave Proof Let Ay = A diag (ζσ2 (y)). By the definition (VI.13) 2 of d(z), notice that z − xeiθ =(|α|−1)2 + |β|2 . iθ P ⊥ d≤P ⊥ (I − M) z − e x + x ) x , This implies that 2 * + 2σ +1 iφ(Az) iθ P x⊥ Ay y  e − e Mx |β|≤z − xeiθ , |α|≥1 − z − xeiθ m     − iθ For the second term, by (C.9) in Theorem C.4, for any δ>0,  β  z xe 1 iθ =⇒   ≤ ≤ z − xe , ≥ −2  2 4 α 1 −z − xeiθ 1 −  whenever m C1δ Cx n log n,wehave  − ≤ as desired. P x⊥ (I M) δ, (VI.14) 3 −c2 log n On the other hand, our proof is also critically depends on with probability at least 1 − c1m .Forthefirstterm, the concentration of M(a) in Theorem C.4 of Appendix C, we observe ) , and the following lemmas. Detailed proofs are given in 2 * + 2σ +1 iφ(Az) iθ P x⊥ Ay y  e − e Mx Appendix VI-D and Appendix VI-E. m ∈ ) Lemma 6.5: For any given scalar δ (0, 1),letγ = 2 3 −6 2σ +1 c0δ log n,' whenever ( = P x⊥ Ay× ≥  2 17 −2 4 m , m C max Cx log n, δ n log n , with probability * + − −c2  ≤   iφ(Az) iθ+iφ(Ax) at least 1 c1m for all w with w γ x ,wehavethe |Ax| e − e inequality 2 √ 2σ +1 ∗ 1/2 ≤ ⊥ ×  1 ≤   P x A diag ζ 2 (y) Ax |Aw|≥|Ax| δ m w . m * σ + 1/2 | | iφ(Az) − iθ+iφ(Ax) Lemma 6.6: For any scalar δ ∈ (0, 1), whenever diag ζσ2 (y) Ax e e . 2 −2 4 m ≥ C Cx δ n log n, with probability at least 1 − −c log3 n n By (C.9) in Theorem C.4 and Lemma C.10 in Appendix C, for cm for all w ∈ C with w ⊥ x,wehave − 2 4 any δ>0, whenever m ≥ C δ 2 C  n log n,wehave % 1 x 2 % 2σ2 +1 2 1/2 −iφ(Ax) 2σ +1 ∗ diag ζ 2 (y) Aw  e 1/2 σ P ⊥ A diag ζ 2 (y) m m x σ 1+(2+ε)δ +(1+δ)Δ∞(ε) ≤ w2 . ≤H1/2 ≤ (E [H] + H − E [H])1/2 2 ≤ (1 + δ)1/2 ≤ 1+δ, Here, Δ∞(ε) is defined in (VI.10) for any scalar ε ∈ (0, 1). 3 2 −c2 log n In particular, when σ =0.51 and ε =0.2,wehave with probability at least 1−c1m . And by Lemma A.1 n Δ(ε) ≤ 0.404. With the same probability for all w ∈ C and decomposition of z in (VI.11) with φ(α)=θ, we obtain ⊥ % with w x,wehave 2 % 2 2σ +1 1/2 diag ζ 2 (y) × 2 σ 2σ +1 1/2  −iφ(Ax) m diag ζ 2 (y) Aw e m σ * + iφ(Az) iθ+iφ(Ax) 1+2.2δ +0.404(1 + δ) |Ax| e − e ≤ w2 . % 2 2σ2 +1  ⊥  1/2 1) Bounding the “x-perpendicular” term Px d = diag ζ 2 (y) × m σ Lemma 6.7: Let d be defined in (VI.13), and suppose 2 * + σ > 1/2 be a constant. For any δ> 0, whenever β iφ(Ax) iφ Ax+ Aw ≥  2 17 −2 4 | | − ( α ) m C Cx max log n, δ n log n , with probability Ax e e at least 1−c m−c2 for all z ∈ Cn such that z − xeiθ ≤ 1 % 3 −6 2σ2 +1 c3δ log n,wehave ≤ | |1 # 2 Ax | β ||Aw|≥ρ|Ax| m % α $   iθ 4δ 2   2 P x⊥ d≤ z − xe δ +(1+δ) 2σ +1 1 β  2σ +1 1/2 ρ +   diag ζ 2 (y) × & 1 − ρ α m σ % 1 1 1+2δ +(1+δ)Δ∞(ε) + − − .  −iφ(Ax) 1 ρ 1 δ 2 Aw e ,   Here, Δ∞(ε) is defined in (VI.10) for any scalar ε ∈ (0, 1). −  β  2 for any ρ ∈ (0, 1). By Lemma 6.4, we know that ρ 1   ≤ In particular, when ε =0.2 and σ =0.51,wehaveΔ∞(ε) ≤ α 2 − iθ 3 −6 0.404. ρ z xe

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1799

where cρ'is a constant depending on ρ(. Thus, whenever m ≥ definition of d(z) in (VI.13), we observe 2 17 −2 4 C2 max Cx log n, δ n log n for any δ ∈ (0, 1), P xd −c2 n−1  ) with probability at least 1−c1m for all w ∈ CS ,   ∗ iθ iθ Lemma 6.5 implies that = x (I − M) z − e x − e Mx + ,   2    √ 2σ +1 ∗  δ β 2  | |1 ≤   A diag (ζσ (y)) [y exp (iφ (Az))]  Ax | β || |≥ | |   m m α Aw ρ Ax  ρ α  2δ √ ≤  − ∗E | |− iθ − iθ ∗ ≤ m z − xeiθ .  (1 x [M] x)(α 1) e e x Mx + ρ  2  2σ +1 ∗ ∗ 2   ∈ ≥ x A diag (ζσ (y)) [y exp (iφ (Az))]  Moreover, for any δ (0, 1), whenever m m  2 4 iθ C3 Cx n log n, with probability at least + M − E [M] z − xe − 3 n−1 1−c m c4 log n w ∈ CS w ⊥ x ∗ 3 for all with , ≤|(1 − x E [M] x)|||α|−1| Lemma 6.6 implies that    T 1 % iθ 2 + M − E [M] z − xe 2σ +1 1/2 −iφ(Ax)   diag ζ 2 (y) Aw  e * + σ  2 i  m 2σ +1 ∗ iφ(Az−iφ(Ax))−e θ1  +  x Ay (Ax)  e , % m 1+2δ +(1+δ)Δ∞(ε)    ≤ , T 2 2 where for the second inequality, we used Lemma C.10 such ∗ where Δ∞(ε) is defined in (VI.10) for some ε ∈ (0, 1). that x (I − E [M]) w =0.ForthefirsttermT1, notice that iθ 3 −6 iθ - In addition, whenever z − xe ≤ c5δ log n z − xe − iθ || |− |2  2 ≥|||− | for some constant c5 > 0, Lemma 6.4 implies that z xe = α 1 + βw α 1 ,   2   E 2σ ∗ β 1 and by using the fact that [M]=I + 1+2σ2 xx in   ≤ z − xeiθ   3 −6 Lemma C.10, we have α 1 − c5δ log n 2 2 1 iθ 2σ 2σ iθ ≤ z − xe , T1 = ||α|−1|≤ z − xe . 1 − δ 1+2σ2 1+2σ2

For the term T2, using the fact that z = αx + βw and for δ>0 sufficiently small. Thus, combining the results above, θ = φ(α), and by Lemma A.2, notice that we have the bound  ! "   #  iφ(Az)−iφ(Ax) iθ iθ βAw  e − e 1 +ie  $ αAx iθ 4δ 2  ! " P x⊥ d≤ z − xe δ +(1+δ) 2σ +1   βAw βAw ρ  iφ(1+ Ax )  = e α − 1 +i  % & αAx     1 1 1+2δ +(1+δ)Δ∞(ε)  2  2 + β  Aw  1 − ρ 1 − δ 2 ≤ 6     , α Ax     holds as desired. Finally, when σ2 =0.51 and ε =0.2,  βAw  ≤ whenever αAx 1/2. Thus, by using the result above, the bound for Δ∞(ε) can be found in Lemma 6.15 in we observe Appendix VI-E. T2  * + 2) Bounding the “x-parallel” term P d  2  x ≤ 2σ +1 ∗  1  2  x Ay (Ax) | β ||Aw|≥ 1 |Ax|  Lemma 6.8: Let d(z) be defined in (VI.13), and let m α 2 2  ! σ > 1/2 be a constant. For any δ> 0, whenever  2 2 − 2σ +1 ∗ ∗ 17 2 4  2  m ≥ C Cx max log n, δ n log n , with probabil- +  x A diag ζσ (y) − m − c2 − iθ ≤ "  ity at least 1 c1m for all z such that z xe  3 −6 iφ(Az)−iφ(Ax) − iθ1  1  c3δ log n,wehave e e | β || |≤ 1 | | Ax α Aw 2 Ax ! "  * + 2 2σ2 +1  2σ iθ ≤  ∗  1  P xd≤ + cσ2 δ z − xe . 2  x Ay (Ax) | β ||Aw|≥ 1 |Ax|  2 m α 2 1+2σ   ! "    2   2 2σ +1 ∗  βAw  β  2 +  x Ay (Ax)  +6  Here, cσ > 0 is some numerical constant depending only m αAx α on σ2.    2 2  ∗ 2σ +1 ∗ |Aw|  Proof Let Ay = A diag (ζ 2 (y)). Given the decomposition ∗ σ ×  x A diag ζσ2 (y)  Ax of z in (VI.11) with w ⊥ x and φ(α)=θ, and by the  m |Ax|2 

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1800 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020

2 3 2σ +1 −c3 log n ≤    1 holds with probability at least 1 − 2m for some 2 A Ax | β ||Aw|≤ 1 |Ax|   m α 2 1 β iθ 3 −6     constant c3 > 0.If   ≤ z − xe ≤ c4δ log n,  2  2  2 α ' ( β 2σ +1 ∗ ∗    2  2 17 −2 4 +6   w A diag (ζσ (y)) Aw whenever m ≥ C3 max Cx log n, δ n log n , α   m     2  Lemma 6.5 implies that 1  β  2σ +1 ∗ ∗  +    x A diag (ζσ2 (y)) Aw 2 α m | |1     Ax | β ||Aw|≥ 1 |Ax|    2 * +   α 2 1  β  2σ +1 ∗ 2iφ(Ax)    √ √ +    x Ay e  Aw   β  iθ 2 α m ≤ 2δ   m ≤ 4δ m z − xe α ⊥ Given the fact that x w, by Lemma C.10 again we have n−1 −c6 ∗ holds for all w ∈CS with probability at least 1−c5m . x E [M] w =0 − . Thus − iθ ≤ c4 3 6   Given z xe 4 δ log n, combining the estimates  2  2σ +1 ∗ ∗  above, we have  x A diag (ζσ2 (y)) Aw  m 2 2σ 2 ∗ ∗ P d≤ +3δ +8(1+δ)δ 2σ +1 = |x Mw− x E [M] w|≤M − E [M] , x 2 ! 1+2σ "  2 1+4σ 3 −6 iθ and similarly we have +24c4 + δ δ log n z − xe   1+2σ2  2 * + ! " 2σ +1 ∗ ∗ 2iφ(Ax)  2 2  2σ iθ  x A diag (ζσ (y)) e Aw  ≤ + c 2 δ z − xe  m  1+2σ2 σ  2 * + 2σ +1   −2iφ(Ax)  =  w A diag (ζσ2 (y)) e  Ax  for δ sufficiently small. Here, cσ2 is some positive numerical  m  constant depending only on σ2.  2  2σ +1    =  w A diag (ζσ2 (y)) Ax  m   2  D. Proof of Lemma 6.5 2σ +1 ∗ ∗  =  x A diag (ζσ2 (y)) Aw m In this section, we prove Lemma 6.5 in Section VI-C, which ≤M − E [M] . can be restated as follows. Lemma 6.9: For any given scalar δ ∈ (0, 1),letγ = − Thus, suppose z − xeiθ ≤ 1 , by using Lemma 6.4 we know c δ3 log 6 n,' whenever (   2 0  β  2 17 −2 4   ≤ − iθ m ≥ C max Cx log n, δ n log n , with probability that α 2 z xe . Combining the estimates above, −c2 we obtain at least 1−c1m for all w with w≤γ x,wehavethe inequality 2σ2 +1 T ≤    1 √ 3 2 A Ax | β ||Aw|≤ 1 |Ax| m α 2 Ax  1|Aw|≥|Ax| ≤ δ m w . (VI.15) 2 +24M z − xeiθ We prove this lemma using the results in Lemma 6.10 and iθ +2M − E [M] z − xe . Lemma 6.11. Proof By Corollary B.2, for some small scalar ε ∈ T T Combining the estimates for 1 and 2,wehave (0, 1), whenever m ≥ Cnlog4 n, with probability at least − −c log3 n  ≤   P d 1 m for every w with w γ x ,wehave x √ √ 2  ≤  ≤   2σ iθ iθ Aw (1 + ε) m w (1 + ε)γ m x ≤ z − x +3M − E [M] z − x ! " 1+2σ2 1/2 1+ε 2σ2 +1 ≤ γ Ax≤2γ Ax .    1 1 − ε +2 A Ax | β ||Aw|≤ 1 |Ax| m α 2 2 Let us define a set +24M z − xeiθ . . ∗ ∗ S = {k ||a w|≥|a x|} . By Theorem C.4, for any δ>0, whenever m ≥ k k −2 2 4 S |S| C1δ Cx n log n,wehave By Lemma 6.10, for every set with >ρm(with some ρ ∈ (0, 1) to be chosen later), with probability at least M − E [M]≤δ, ρ4m 1 − exp − 2 ,wehave 2Cx 1+4σ2 M≤E [M] + δ = + δ 1+2σ2 ρ3/2 (Ax)  1S  > Ax . 3 32 − −c2 log n holds with probability at least 1 c1m . By Corol- 3/2 −2 4 ρ lary B.2, for any δ ∈ (0, 1), whenever m ≥ C2δ n log n, Choose ρ such that γ = 64 ,wehave we have Aw≥(Aw)  1S ≥(Ax)  1S  √ A≤(1 + δ) m > 2γ Ax .

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1801

−4 This contradicts with the fact that Aw≤2γ Ax. Lemma 6.11: Given any scalar δ>0,letρ ∈ (0,cδlog n) Therefore, whenever w≤γ x, with high probability with cδ be some constant depending on δ, whenever m ≥ 2 we have |S| ≤ ρm holds. Given any δ>0, choose γ = Cδ−2n log4 n, with probability at least 1 − 2 m−c log n,for 3 2 3 −6 ρ / any set S∈[m] with |S| <ρmand for all w ∈ Cn,wehave cδ log n for some constant c>0. Because γ = 64 , we know that ρ = c δ2/ log4 n. By Lemma 6.11, whenever √ 2 m ≥ Cδ−2n log4 n, with probability at least 1−2 m−c log n (Aw)  1S ≤δ m w . − for all w ∈ CSn 1,wehave √ Proof Without loss of generality, let us assume that w =1. | |1 ≤| |1 ≤   Ax |Aw|≥|Ax| Aw S δ m w . First, notice that Combining the results above, we complete the proof. Aw  1S  =supv, Aw Lemma 6.10: Let ρ ∈ (0, 1) be a positive scalar, with v∈CSm−1, supp(v)⊆S ρ4 m ≤  ∗  probability at least 1−exp − 2 , for every set S∈[m] sup A v . 2Cx −1 v∈CSm , supp(v)⊆S with |S| ≥ ρm,wehave

1 3/2 By Lemma A.11, for any positive scalar δ>0 and any |Ax|1S  > ρ Ax . −4 − 4 32 ρ ∈ (0,cδ2 log n), whenever m ≥ Cδ 2n log n, with − −c log2 n To prove this, let us define probability at least 1 m ,wehave ⎧ √ ⎨⎪1 if |u|≤v, sup A∗v≤δ m. 1 v∈CSn−1, supp(v)⊆S gv(u)= (2v −|u|) v<|u|≤2v (VI.16) ⎩⎪ v 0 otherwise, Combining the result above, we complete the proof. for a variable u ∈ C and a fixed positive scalar v ∈ R. Lemma 6.12: For a variable u ∈ C and a fixed positive Proof Let ρ ∈ (0, 1) be a positive scalar, from Lemma 6.12, scalar v ∈ R, the function g (u) introduced in (VI.16) is we know that v 1/v-Lipschitz. Moreover, the following bound  Ax  ≥ 1| |≤ gρ( ) 1 Ax ρ 1

gv(u) ≥ ½|u|≤v holds uniformly. Thus, for an independent copy a of a, we have   holds uniformly for u over the whole space.   Proof The proof of Lipschitz continuity of g (u) is straight gρ(Cxa) −gρ(Cxa ) v 1 1√ forward, and the inequality directly follows from the definition ≤ −  ≤ m  −  gρ(Cxa) gρ(Cxa ) 1 Cxa Cxa of gv(u). √ ρ m ≤ C a − a  . ρ x   E. Proof of Lemma 6.6 Therefore, we can see√ that gρ(Cxa) 1 is L-Lipschitz with m   In this section, we prove Lemma 6.6 in Section VI-C, which respect to a, with L = ρ Cx . By Gaussian concentration inequality in Lemma A.3, we have can be restated as follows. 2 Lemma 6.13: For any scalar δ ∈ (0, 1), whenever    t   − 2 2 − 4 P   − E   ≥ ≤ 2L ≥   2 gρ(Cxa) 1 gρ(Cxa) 1 t 2e . m C Cx δ n log n, with probability at least √ −c log3 n n ∗ 1 − cm for all w ∈ C with w ⊥ x,wehave By using the fact that 2 |a x| follows the χ distribution, k we have % 2 2 m * + 2σ +1 1/2 −iφ(Ax)   diag ζ 2 (y) Aw  e

E  σ  ≤ E ½ gρ(Cxa) | ∗ |≤ m 1 ak x 2ρ k=1 1+(2+ε)δ +(1+δ)Δ∞(ε) m ≤ w2 P | ∗ |≤ ≤ 2 = ( akx 2ρ) ρm. k=1 holds. In particular, when σ2 =0.51 and ε =0.2,wehave ρ4 m Δ(ε) ≤ 0.404. With the same probability for all w ∈ Cn with Thus, with probability at least 1−2exp − 2 ,wehave 2Cx w ⊥ x,wehave 1|Ax|≤ρ ≤gρ(Ax) ≤ 2ρm % 1 1 2 2 S |S| ≥ 2σ +1 1/2 −iφ(Ax) holds. Thus, for any set such that 4ρm,wehave diag ζ 2 (y) Aw  e m σ   1 2 ≥  1 2 ≥ 3 (Ax) S (Ax) |Ax|≤ρ 2ρ m. 1+2.2δ +0.404(1 + δ) ≤ w2 . Thus, by replacing 4ρ with ρ, we complete the proof. 2

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1802 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020

− 3 Proof Without loss of generality, let us assume w ∈ CSn 1. m ≥ Cδ−2n log4 n, with probability at least 1 − m−c log n − − For any w ∈ CSn 1 with w ⊥ x, we observe for all w ∈ CSn 1 with w ⊥ x,wehave   %  2  2 2σ +1    2 w A diag (ζ 2 (Ax)ψ(Ax)) Aw 2σ +1 1/2 −iφ(Ax)  σ  diag ζ 2 (y) Aw  e m σ m ≤ % (1 + δ)Δ∞(ε)+(1+ε)δ. 2 1 2σ +1 1/2 ∼CN 0 = diag ζ 2 (y) × Proof Let g = a and let δ ( , I) independent of g. 2 m σ ∗ Let v = R[1:n]w. Given a small scalar ε>0,wehave * + 2   −iφ(Ax) iφ(Ax)  2  (Aw)  e − (Aw)  e 2σ +1    2      v Cg diag (ζσ (g x)ψ(g x)) Cgv  2 m 1 ∗ 12σ +1     2 ≤ |w P x⊥ MPx⊥ w| +  w P ⊥ A 2σ +1   x ≤  2   2 2 m   v Cg diag ζσ (g x)ψ(g x)  m   × 2 ⊥  diag (ζσ (Ax)ψ(Ax)) AP x w − E −   ×  (1 + ε) δ [ψ((g δ) x)] Cgv +(1+ε)  2   ≤ 1 E  1  − E  12σ +1   2  [H] + H [H] +  w 2σ +1   E −   2 2 2 m   v Cg diag ( δ [ψ((g δ) x)]) Cgv  m   × 2 ⊥  P x⊥ A diag (ζσ (Ax)ψ(Ax)) AP x w Δ∞(ε) ∗ 2 ≤ R[1:n]C +(1+ε)×  m g    ∈ CSn−1 ⊥  2  holding for all w with w x,whereM and 2σ +1   E −  2  v Cg diag ( δ [ψ((g δ) x)]) Cgv . H are defined in (VI.7) and (VI.8), and ψ(t)= t/ |t| .   m    By Lemma C.10, we know that D(g,w) δ>0 m ≥ E [H] = P ⊥ ≤1. (VI.17) By Corollary B.2, for any , whenever x −2 4 C0δ n log n for some constant C0 > 0,wehave By Theorem C.4, we know that for any δ>0, whenever ∗ 2 ≤ −2 2 4 R[1:n]Cg (1 + δ)m m ≥ C1δ Cx n log n,wehave 3 with probability at least 1−m−c0 log m for some constant H − E [H]≤δ, c0 > 0. Next, let us define a decoupled version of D(g, w), 3 − −c2 log n 2 with probability at least 1 c1m . In addition, QD 1 2 2σ +1   × dec(g , g , w)= w R[1:n]Cg1 Lemma 6.14 implies that for any δ>0,whenm ≥ m −2 4 2 ∗  1 C2δ n log n for some constant C2 > 0,wehave diag ψ(g x) Cg R[1:n]w. (VI.18)    2  1 2 − 2σ +1     where g = g + δ and g = g δ.Thenbyusingthefact w P ⊥ A diag (ζ 2 (Ax)ψ(Ax)) AP ⊥ w  x σ x  that w ⊥ x,wehave m  ≤     2 (1 + δ)Δ∞(ε)+(1+ε)δ,  D 1 2  2σ +1   Eδ Q (g , g , w) =  w R[1:n]C × dec m g − 3  holds with probability at least 1 − 2m c3 log n for some  E −  ∗  constant c3 > 0. Combining the results above, we obtain diag ( δ [ψ ((g δ) x)]) CgR[1:n]w. % 2 2 Then for any positive integer p ≥ 1, by Jensen’s inequality 2σ +1 1/2  −iφ(Ax) diag ζ 2 (y) Aw e and Theorem B.3, we have m σ 1+(2+ε)δ +(1+δ)Δ∞(ε) |D | ≤ . sup (g, w) 2 w=1, w⊥x Lp 2   Finally, by using Lemma 6.15, when σ =0.51 and ε =0.2, ≤ QD 1 2  sup dec(g , g , w) we have w=1 % !% Lp % " 2 √ 2 n 3/2 1/2 n n 2σ +1 1/2 −iφ(Ax) ≤C 2 log n log m + p + p , diag ζ 2 (y) Aw  e σ m σ m m m 2 1+2.2δ +0.404(1 + δ) where Cσ2 > 0 is some positive constant depending on σ , ≤ , 2  ≤ 2 and we used the fact that ψ(g x) ∞ 1 holds uniformly for all g2. Thus, by Lemma A.6, then for any δ>0, whenever as desired. −2 3 m ≥ C1δ n log n log m,wehave Lemma 6.14: For a fixed scalar ε>0,letΔ∞(ε) sup |D(g, w)|≤δ be defined as (VI.10). For any δ>0, whenever w=1, w⊥x

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1803

2 Fig. 9. Computer simulation of the functions ζσ2 (t) and h(t). (a) displays the functions ζσ2 (t) and h(t) with σ =0.51; (b) shows differences two function ζσ2 (t) − (1 + ε)h(t) with ε =0.2.

3 −c1 log m holding with probability at least 1 − m . Combining the function ζσ2 (t) − (1 + ε)h(t). Therefore, we have the results above completes the proof. 2  2 −  Δ∞(ε)=(1+2σ ) ζσ (t) (1 + ε)h(t) L∞ ≤ 0.2 × (1 + 2 × 0.51) = 0.404,

2 F. Bounding Δ∞(ε) when σ =0.51 and ε =0.2. Given h(t) and Δ∞(ε) introduced in (VI.9) and (VI.10), Lemma 6.16: For t ≥ 0,whenε =0.2 and σ2 =0.51, we prove the following results. we have Lemma 6.15: Given σ2 =0.51 and ε =0.2,wehave |ζσ2 (t) − (1 + ε)h(t)|≤0.2. Δ∞(ε) ≤ 0.404. Proof Given ε =0.2 and σ2 =0.51,wehave Proof First, by Lemma 6.17, notice that the function h(t) can 2 − be decomposed as g(t)=ζσ (t) (1 + ε)h(t) 2 − t −2 −2 −t2 = −0.2 − e 1.02 +1.2 t − 1.2 t e . h(t)=g(t)ψ(t) When t =0,wehave|g(t)| =0.Whent>10,wehave where g(t):C → [0, 1) is rotational invariant with respect to t. 2 Since ζ 2 (t) is also rotational invariant with respect to t,itis t − t −3 −t2 σ g (t)= e 1.02 − 2.4t 1 − e enough to consider the case when t ∈ [0, +∞), and bounding 0.51 2 the following quantity +2.4t−1e−t ≤ 0.

sup |(1 + ε)h(t) − ζσ2 (t)| . So the function g(t) is monotonically decreasing for t ≥ 10. t∈[0,+∞) As limt→+∞ g(t)=−0.2 and |g(10)| < 0.2,wehave Lemma 6.18 implies that |g(t)|≤0.2, ∀ t>10. E h(t)=s∼CN (0,1) [ψ(t + s)] For 0 0, Figure 9b implies that |g(t)|≤0.2 for all t ∈ (0, 10) (we = 0 t =0. omit the tedious proof here). 2 Lemma 6.17: Let ψ(t)= t/ |t| ,thenwehave When t =0, we obtain that |(1 + ε)h(t) − ζσ2 (t)| =0.For t>0,whenε =0.2 and σ2 =0.51,wehave h(t)=Es∼CN (0,1) [ψ(t + s)] = g(t)ψ(t).

2 − ζσ (t) (1 + ε)h(t) where g(t):C → [0, 1), such that 2 − t −2 −2 −t2   − − 1.02 − | | 2 − 2 = 0.2 e +1.2 t 1.2 t e . ( t + v1) v2 g(t)=E ∼N , v1,v2 (0,1/2) | | 2 2 From Lemma 6.16, we can prove that ( t + v1) + v2  2 −  ≤ ∼N ∼N ζσ (t) (1 + ε)h(t) L∞ 0.2 by a tight approximation of where v1 (0, 1/2),andv2 (0, 1/2).

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1804 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020

Proof By definition, we know that cosh (2rt cos θ) dθdr +∞ π/2 E 2 − 2 − 2 s∼CN# (0,1) [ψ(t&+ s)] # & = e t cos(2θ)re r × ! "2 ! "2 π t + s t t + s r=0 θ=0 = E = E ψ(t). [cosh (2rt cos θ) − cosh(2rt sin θ)] dθdr s |t + s| s |t| |t + s|    g(t) where the third equality uses the fact that the integral of an odd function is zero. By using Taylor expansion of cosh(x), Next, we estimate g(t) and show that it is indeed real. and by using the dominated convergence theorem to exchange We decompose the random variable s as ! " ! " the summation and integration, we observe ts t ts t t t s =  +i = v1 +iv2 , |t| |t| |t| |t| |t| |t| E ∼CN [ψ(t + s)] s (0,1) +∞ π ts ts 2 −t2 −r2 where v1 =  and v2 = are the real and imagi- = e cos(2θ)re × |t| |t| π nary parts of a complex Gaussian variable ts/ |t|∼CN(0, 1). # r=0 θ=0 & +∞ 2k 2k ∼N (2rt cos θ) (2rt sin θ) By rotation invariant property, we have v1 (0, 1/2) and − dθdr v2 ∼N(0, 1/2),andv1 and v2 are independent. Thus, we have (2k)! (2k)! # & k=0 ! " +∞ π | | − 2 2 − 2 t + v1 iv2 = e t cos(2θ)× h(t)=Es |t + s| π r=0 θ=0 ∞  2 # & # & + 2k 2k+1 −r | | 2 − 2 | | (2t cos θ) r e − E ( t + v1) v2 − E ( t + v1) v2 = s 2 2i s 2 (2k)! |t + s| |t + s| k=0  # & 2k 2k+1 −r2 | | 2 − 2 (2t sin θ) r e E ( t + v1) v2 dθdr = v1,v2 (2k)! 2 2 (|t| + v ) + v ∞ # 1 2 & + 2k +∞ 2 − 2 (2t) − 2 | | = e t r2k+1e r dr× − E ( t + v1) v2 π (2k)! 2i v1,v2 . r=0 | | 2 2  k=0  ( t + v1) + v2 π π 2k − 2k (|t|+v1)v2 cos(2θ)cos θdθ cos(2θ)sin θdθ . We can see that 2 2 is an odd function of v2. θ=0 θ=0 (|t|+v1) +v2 (|t|+v1)v2 Therefore, the expectation of 2 2 with respect to v2 (|t|+v1) +v2 We have the integrals is zero. Thus, we have #! " & +∞ 2 2 t t + s 2k+1 −r Γ(k +1) g(t)=E r e dr = , s |t| |t + s| r=0 √ 2   π π kΓ(k +1/2) | | 2 − 2 2k ( t + v1) v2 cos(2θ)cos θdθ = , = E ∼N , 2 Γ(k +2) v1,v2 (0,1/2) | | 2 2 θ=0 √ ( t + v1) + v2 π 2k − π kΓ(k +1/2) which is real. cos(2θ)sin θdθ = θ=0 2 Γ(k +2) Lemma 6.18: For t ∈ [0, +∞),wehave holds for any integer k ≥ 0,whereΓ(k) is the Gamma E f(t)=s∼CN (0,1) [ψ(t + s)] function such that − −2 −2 −t2 1 t + t e t>0, (2k)!√ = (VI.19) Γ(k +1)=k!, Γ(k +1/2) = π. 0 t =0. 4kk!  Proof Let sr = (s) and si = (s),andlets = r exp (iθ) Thus, for t>0,wehave with r = |s| and exp (iθ)=s/ |s|. We observe

E ∼CN [ψ(t + s)] Es∼CN (0,1) [ψ(t + s)] s (0,1) ∞ +∞ +∞ 2 + 2k √ 1 (s +is ) −| − |2 2 − 2 (2t) Γ(k +1) kΓ(k +1/2) r i sr+isi t t × × = 2 e dsrdsi = e π π s =−∞ s =−∞ |s +is | π (2k)! 2 Γ(k +2) r i r i k=0 +∞ 2π 2 2 +∞ +∞ +∞ 1 −i2θ −r −t 2rt cos θ 2 2k 2 2k 2k = e e e rdθdr −t kt −t t − t π =e = e r=0 θ=0 (k +1)! k! (k +1)! +∞ 2π k=0 k=0 k=0 1 − 2 − 2 # & = e t cos(2θ)re r e2rt cos θdθdr +∞ 2k π −t2 t2 −2 t r=0 θ=0 =e e − t − 1 +∞ π k! 2 −t2 −r2 k=0 = e cos(2θ)re × 2 −2 −2 −t π r=0 θ=0 =1 − t + t e .

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1805

When t =0, by using L’Hopital’s rule, we have Lemma A.3 (Gaussian Concentration Inequality): Let w ∈ Rn be a standard Gaussian random variable w ∼N(0, I),and η(0) = lim Es∼CN (0,1) [ψ(t + s)] n t→0 # & let g : R → R denote an L-Lipschitz function. Then for all 2 1 − et −1 t>0, = lim 1+ 2 t2 = 1 + lim =0. t→0 t e t→0 1+t P (|g(w) − E [g(w)]|≥t) ≤ 2exp −t2/(2L2) .

We complete the proof. Moreover, if w ∈ Cn with w ∼CN(0, I),andg : Cn → R In the appendix, we provide details of proofs for some is L-Lipschitz, then the inequality above still holds. supporting results. Appendix A summarizes basic tools used Proof The result for real-valued Gaussian random variables throughout the analysis. In Appendix B, we provide results of is standard, see [98, Chapter 5] for a detailed proof. For the bounding the suprema of chaos processes for random circulant complex case, let matrices. In Appendix C, we present concentration results for   1   v suprema of some dependent random processes via decoupling. v = √ I iI r , v , v ∼ N (0, I) . v r i i.i.d.  2   i APPENDIX A h ELEMENTARY TOOLS AND RESULTS By composition theorem, we know that g ◦ h : R2n → R is Lemma A.1: Given a fixed number ρ ∈ (0, 1),forany L-Lipschitz. Therefore, by applying  the Gaussian concen- z, z ∈ C,wehave v tration inequality for g ◦ h and r , we get the desired |exp (iφ(z + z)) − exp (iφ(z ))| vi result. 1 ≤  | | 2 ½| |≥ | | + (z/z ) . z ρ z 1 − ρ Theorem A.4 (Gaussian tail comparison for vector-valued ∈ Rn Proof See the proof of Lemma 3.2 of [11]. functions, Theorem 3, [99]): Let w be standard Gaussian variable w ∼N(0, I),andletf : Rn → R be an L-Lipschitz Lemma A.2: Let ρ ∈ (0, 1),foranyz ∈ C with |z|≤ρ, function. Then for any t>0,wehave we have ! " − t 2 ρ 2 P (f(w) − E [f(w)]≥t) ≤ eP v≥ , |1 − exp (iφ(1 + z)) + i (z)|≤ |z| . (A.1) (1 − ρ)2 L - where v ∈ R such that v ∼N(0, I). Moreover, if w ∈ Cn ∈ R+  2 2 Proof For any t ,letg(t)= (1 + (z)) + t ,then with w ∼CN(0, I) and f : Cn → R is L-Lipschitz, then the inequality above still holds. - t ≤ t g (t)= |  |. The proof is similar to that of Lemma A.3.  2 2 1+ (z) (1 + (z)) + t Lemma A.5 (Tail of sub-Gaussian Random Variables): Let ∈ C | |≤ X be a centered σ2 sub-Gaussian random variable, such that Hence, for any z with z ρ,wehave ! " || |−  | t2  1+z (1 + (z))  P (|X|≥t) ≤ 2exp − , -  2σ2 =  (1 + (z))2 + 2(z) − (1 + (z))   then for any integer p ≥ 1,wehave 2 (z) 1 p/2 = |g ( (z)) − g(0)|≤ ≤ 2(z). E [|X|p] ≤ 2σ2 pΓ(p/2). |1+(z)| 1 − ρ Let f(z)=1−exp (iφ(1 + z)). By using the estimates above, In particular, we have √ we observe   E | |p 1/p ≤ 1/e ≥ X Lp =( [ X ]) σe p, p 2, |f(z)+i (z)| √   and E [|X|] ≤ σ 2π. |1+z|−(1 + z)    Lemma A.6 (Sub-exponential tail bound via moment =  | | +i (z) 1+z control): Suppose X is a centered random variable satisfying 1 = ||1+z|−(1 + z)+i (z) |1+z|| p 1/p √ |1+z| (E [|X| ]) ≤ α0 + α1 p + α2 p, for all p ≥ p0 1 ≤ | || −| || ≥ | |( (z) 1 1+z for some α0,α1,α2,p0 > 0. Then, for any u p0,wehave 1+z √ || |−  | P | |≥ ≤ − + 1+!z (1 + (z)) ) " X e(α0 + α1 u + α2 u) 2exp( u) . √ 1 1 2 ≤ |z|| (z)| + (z) This further implies that for any t>α1 p0 + α2 p0,wehave |1+z| 1 − ρ ! ) ," − t2 t ≤ 2 ρ | |2 P | |≥ ≤ − 2 z . ( X c1α0 + t) 2exp c2 min 2 , , (1 − ρ) α1 α2

for some positive constants c1,c2 > 0.

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1806 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020

Proof The first inequality directly comes from Proposi- where N (B, · ,) is the covering number of the set B with tion 2.6 of [14] via Markov inequality, also see Proposi- diameter  ∈ (0, 1). 2 ≥ tion 7.11√ and Proposition√ 7.15 of [60]. For the second, let Theorem A.9 (Theorem 3.5, [14]): Let σξ 1 and ξ = ≤ n { }n t = α1 u + α2 u,ifα1 u α2 u,then (ξj)j=1,where ξj j=1 are independent zero-mean, variance √ 2 B t one, σξ -subgaussian random variables, and let be a class of t = α1 u + α2 u ≤ 2α2 u ⇒ u ≥ . 2α2 matrices. Let us define a quantity  * + 2 2 Otherwise, similarly, we have u ≥ t /(4α ). Combining the .  2 2  1 CB (ξ) =supBξ − E Bξ  . (A.2) two cases above, we get the desired result. B∈B In the following, we describe a tail bound for a class of For every p ≥ 1,wehave heavy-tailed random variables, whose moments are growing much faster than sub-Gaussian and sub-exponential.   ≤ 2 B · B sup Bξ Cσ [γ2 ( , )+dF ( ) Lemma A.7 (Tail bound for heavy-tailed distribution via ∈B ξ B Lp √ moment control): Suppose X is a centered random variable + pd2(B)],  * + satisfying   √  2 − E  2  E | |p 1/p ≤ ∀ ≥ sup Bξ Bξ ( [ X ]) p (α0 + α1 p + α2 p) , p p0, B∈B Lp ≤ 2 { B · B · B for some α0,α1,α2,p0 ≥ 0. Then, for any u ≥ p0,wehave Cσ γ2( , )[γ2 ( , )+dF ( )] √ ξ √  + pd (B)[γ (B, ·)+d (B)] + pd2(B) , P |X|≥eu α0 + α1 u + α2 u ≤ 2exp(−u) . 2 2 F 2

This further implies that for any t>where C 2 is some positive numerical constant only depending √ σξ p0 α0 + α1 p0 + α2 p0 ,wehave 2 · · B · on σξ ,andd2( ),dF ( ) and γ2( , ) are given in Defini- tion A.8. P (|X|≥c1t) The following theorem establishes the restricted isometry t t ≤ 2exp −c min , , property (RIP) of the Gaussian random convolution matrix. 2 m 2(α1 + α2) 2α0 Theorem A.10 (Theorem 4.1, [14] ): Let ξ ∈ C be a ∼ CN for some positive constant c ,c > 0. random vector with ξi i.i.d. (0, 1),andletΩ be a 1 2 | | E Proof The proof of the first tail bound is similar to that of fixed subset of [m] with Ω = n. Define a set s = { ∈ Cm |  ≤ } Lemma A.6 by using Markov inequality. Notice that v v 0 s , and define a matrix ∗ × P (|X|≥eu (α +(α + α )u)) Φ = R C ∈ Cn m, 0 1√ 2 Ω ξ ≤ P | |≥ ≤ − X eu α0 + α1 u + α2 u 2exp( u) . m n where RΩ : C → C is an operator that restrict a vector to 2 2 Let t = α0 u +(α1 + α2) u ,ifα0 u ≤ (α1 + α2) u ,then its entries in Ω. Then for any s ≤ m,andη, δs ∈ (0, 1) such that t = α u +(α + α ) u2 ≤ 2(α + α ) u2 0 1 2 1 2 ≥ −2 2 2 t n Cδs s log s log m, ⇒ u ≥ . 2(α + α ) 1 2 the partial random circulant matrix Φ ∈ Rn×m satisfies the Otherwise, we have u ≥ t/(2α0). Combining the two cases restricted isometry property above, we get the desired result. √ √ (1 − δs) n v≤Φv≤(1 + δs) n v (A.3) Definition A.8 (d2(·),dF (·) and γβ functional): For a given set of matrices B,wedefine − log2 s log m for all v ∈Es, with probability at least 1 − m . . . m d (B) =supB ,d(B) =supB , Lemma A.11 Let the random vector ξ ∈ C and the ran- F F 2 × B∈B B∈B dom matrix Φ ∈ Cn m be defined the same as Theorem A.10, E { ∈ Cm |  ≤ } For a metric space (T,d), an admissible sequence of T is a and let s = v v 0 s for some positive integer ≤ collection of subsets of T , {Tr : r>0}, such that for every s n. For any positive scalar δ>0 and any positive integer 2r ≤ ≥ −2 4 s>1, |Tr|≤2 and |T0| =1.Forβ ≥ 1,definetheγβ s n, whenever m Cδ n log n,wehave functional by √ Φ ≤   ∞ v δ m v , . r/β γβ(T,d) =infsup 2 d(t, Tr), −c log2 s ∈ for all v ∈E , with probability at least 1 − m . t T r=0 s Proof The proof follows from the results in [14]. Without where the infimum is taken with respect to all admissible loss of generality, we assume v =1. Let us define sets sequences of T . In particular, for γ2 functional of the set B · . equipped with distance , [100] shows that D = {v ∈ Cm : v =1, v ≤ s} , ) s,m 0 , d2(B) 1 B · ≤ 1/2 N B · V . √ −1 | ∈D γ2( , ) c log ( , ,)d, = R[1:n]F m diag (F mv) F m v s,m , 0 n

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1807

m n where R[1:n] : R → R denotes an operator that restricts a A. Controlling the Moments and Tail of T1(g) vector to its first n coordinates. Section 4 of [14] shows that m   Theorem B.1 Let g ∈ C be a random complex Gaussian   ∼CN0 2  1 2  vector with g ( ,σg I) and any fixed vector sup Φv − 1    ··· ∈ Rm v∈Ds,m n b =[b1, ,bm] . Given a partial random circulant  * + ∗ ∈ Cm×n ≥  2 2  matrix CgR[1:n] (m n), let us define =sup V vξ − Eξ V vξ  = CV (ξ), V v∈V L . (g) = where CV (ξ) is defined in (A.2). Theorem 4.1 and m 1 ∗ ∗ 1 Lemma 4.2 of [14] implies that R C diag (b) C R − b I . % m [1:n] g g [1:n] m k k=1 V V ≤ s dF ( )=1,d2( ) , Then for any integer p ≥ 1,wehave % n

s L(g) p ≤ C 2 b × L σg ∞ γ2(V, ·) ≤ c log s log m, !% % " n n √ n n log3/2 n log1/2 m + p + p . for some constant c>0. By using the estimates above, m m m Theorem 3.1 of [14] further implies for any t>0 −2 ! % " In addition, for any δ>0, whenever m ≥ C 2 δ σg s 2 2 2 4 P CV (ξ) ≥ c log s log m + t b∞ n log n,wehave 1 n ! ) ," L ≤ nt2 nt (g) δ (B.3) ≤ 2exp −c min , . 2 2 2 − 3 s log s log m s cσ2 log n holds with probability at least 1 − 2 m g . Here,

2 2 2 For any positive constant δ>0, choosing t = δ m/n, cσ ,Cσ ,andCσ2 are some numerical constants only depend- − 2 2 g g g whenever m ≥ Cδ 2n log s log m for some constant C>0 2 ing on σg . large enough, we have 2   % Proof Without loss of generality, let us assume that σg =1.  1  s m Let us first consider the case b ≥ 0,andletΛ =diag(b).  Φ 2 −  ≤ 2 2 − sup  v 1 c1 log s log m + δ For any w ∈ CSn 1, let us denote v∈Ds,m n n n . ∗ 2 m v = R w, ≤ 2δ ,  [1:n]  n S . ∈ CSm−1 ∈ 2 n = v : supp(v) [n] , with probability at least 1−m−c3 log s. Therefore, we have $ √ then   Φv≤ n +2δ2 m ≤ C δ m,  m   1 1   L  ∗ ∗ Λ −  ∈D (g)= sup v Cg Cgv bk . holds for any v s,m with high probability. v∈S m m  n k=1 By the convolution theorem, we know that APPENDIX B 1 1/2 1 1/2 MOMENTS AND SPECTRAL NORM OF PARTIAL √ Λ Cgv = √ Λ (g  v) RANDOM CIRCULANT MATRIX m m 1 ∈ Cm √ Λ1/2 −1 Let g be a random complex Gaussian vector with = F m diag (F mv) F mg ∼CN0 2 m g ( ,σg I). Given a partial random circulant matrix  m×n = V vg. CgR ∈ C (m ≥ n), we control the moments and the [1:n] * + tail bound of the terms in the following form E ∗ Λ ∗ m Since R[1:n]Cg CgR[1:n] =( k=1 bk) I, we observe 1 ∗ ∗  * + T 1(g)= R[1:n]Cg diag (b) CgR[1:n],  2 2  m L(g)= sup V vg − E V vg  , 1   ∗ V v ∈V(b) T 2(g)= R[1:n]Cg diag b CgR[1:n], m where the set V(b) is defined in (B.2). Next, we invoke where b ∈ Rm,andb ∈ Cm. The concentration of these Theorem A.9 to control all the moments of L(a),where quantities plays an important role in our arguments, and the we need to control the quantities d2(·), dF (·) and γ2(·, ·) proof mimics the arguments in [13], [14]. Prior to that, let us defined in Definition A.8 for the set V(b). By Lemma B.7 and define the sets Lemma B.8, we know that   . − 1/2 D = v ∈ CSm 1 : supp(v) ∈ [n] , (B.1) d (V(b)) ≤b , (B.4) ) F % ∞ V . √1 1/2 −1× n 1/2 (d) = V v : V v = diag (d) F m d2(V(b)) ≤ b∞ , (B.5) m , m% n 1/2 diag (F mv) F m, v ∈D , (B.2) γ (V(b), ·) ≤ C b × 2 0 m ∞ for some d ∈ Cm. log3/2 n log1/2 m, (B.6)

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1808 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020

∗ ∈ Cn×m ≤ ≥ for some constant C0 > 0. Thus, combining the results Cg (n m). Then for any integer p 1,wehave ≥ 3 in (B.4), (B.5) and (B.6), whenever m C1n log n log m ! % √ for some constant C1 > 0, Theorem A.9 implies that p 1/p n √ (E [G ]) ≤ C 2 m 1+ p σg % m " L(g) p L n ≤ C {γ (V(b), ·)[γ (V(b), ·)+d (V(b))] + + log3/2 n log1/2 m . √2 2 2 F m pd (V(b)) [γ (V, ·)+d (V(b))] + pd2(V(b))} 2 !% 2 F 2 Moreover, for any  ∈ (0, 1), whenever m ≥ Cδ−2n log4 n ≤   n 3/2 1/2 C3 b ∞ log n log m for some constant C>0,wehave % m " n √ n 2 ∗ 2 2 + p + p (1 − δ)m w ≤G w ≤ (1 + δ)m w m m − 3 n cσ2 log n holds for some constants C2,C3 > 0. Based on the moments holds for w ∈ C with probability at least 1−2 m g . 2 estimate of L(g), Lemma A.6 further implies that Here c 2 ,C 2 > 0 are some constants depending only on σ . σg σg g ! % " Proof Firstly, notice that n 3/2 1/2 P L(g) ≥ C4 b∞ log n log m + t ! m ) ," G =sup|w, Gr| 2 w∈CSn−1,r∈CSm−1 ≤ − m  −1 t 2exp C5 b ∞ min ,t , ≤  ∗  n b∞ sup G w w∈CSn−1 for some constants C4,C5 > 0. Thus, for any δ>0, when- ∗ − 2 3 =supCgR[1:n]w ≥ 2   −1 ever m C6δ b ∞ n log n log m for some constant w∈CSn C > 0,wehave 6 =supCgv . v∈CSm−1,supp(v)∈[n] L(g) ≤ δ Thus, similar to the argument of Theorem B.1, let the set D − 3 holds with probability at least 1 − 2 m C7 log n. and V(1) define as (B.1) and (B.2), we have Now when b is not nonnegative, let b = b+ − b−,where     + + − − m 1 b+ = b , ··· ,b , b− = b , ··· ,b ∈ R are the 1 m 1 m + √ G≤ sup V vg . nonnegative and nonpositive part of b, respectively. Let Λ = m V v ∈V(1) diag (b), Λ+ =diag(b+) and Λ− =diag(b−),wehave   By Lemma B.7 and Lemma B.8, we know that    m  L 1 ∗ ∗ Λ − % (g)= sup v Cg Cgv bk ∈S   n m v n d (V(1)) ≤ 1,d(V(1)) ≤ ,  k=1  F 2 m  m  % 1  ∗ ∗  ≤  Λ − + n 3/2 1/2 sup v Cg +Cgv bk γ (V(1), ·) ≤ C log n log m. m v∈S   2 0  n  k=1  m  L+(g)  Thus, using Theorem A.9, we obtain  m  1  ∗ ∗ −  Λ −  #  &1/p + sup v Cg −Cgv bk .  p m v∈S     n k=1 E        sup V vg  V v ∈V(1) L−(g) !% % " √ m n 3/2 1/2 n Now since b+, b− ∈ R , we can apply the results above ≤ Cσ2 log n log m +1+ p , + g m m for L+(g) and L−(g), respectively. Then by Minkowski’s inequality, we have 2 where C 2 > 0 is constant depending only on σ . The concen- σg g L  ≤L  L ≤   tration inequality can be directly derived from Theorem B.1, (g) p +(g) p + −(g C6 b ∞ − 4 L L ) p ≥ 2 !% % L " noticing that for any δ>0, whenever m C1δ n log n √ for some positive constant C > 0,wehave × n 3/2 1/2 n n 1 log n log m + p + p   m m m    1 ∗ ∗ −  ≤ for some constant C > 0. The tail bound can be simi- sup  (w GG w 1) δ 6 w∈CSn−1 m larly derived from the moments bound. This completes the ∗ =⇒ (1 − δ)m ≤ sup G w2 ≤ (1 + δ)m proof. −1 w∈CSn

The result above also implies the following result. − 3 cσ2 log n ∈ Cm holds with probability at least 1−2 m g ,wherec 2 > Corollary B.2 Let g be a random complex Gaussian σg ∼CN0 2 2 vector with g ( ,σg I),andletG = R[1:n] 0 is some constant depending only on σg .

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1809

B. Controlling the Moments of T2(g) Lemma B.4 Let N (g) be defined as (B.7), and let g be an independent copy of g,thenwehave Theorem B.3 Let g ∈ Cm are a complex random Gaussian variable with g ∼CN(0,σ2I),andlet   g   N (g) p ≤ 4 sup V g, V g ,   L v v   V v ∈V(d) N .  1    ∗  Lp (g) =sup w R[1:n]Cg diag b CgR[1:n]w , m w∈CSn−1 m for some vector d ∈ C . ∼CN 0 2 Proof Let δ ( ,σg I) which is independent of g,and  4 where b ∈ Cm. Then whenever m ≥ Cnlog n for some let positive constant C>0, for any positive integer p ≥ 1, 1 2 − we have g = g + δ, g = g δ, 1 2 1 2 ∼  so that g and g are also independent with g , g N (g) p ≤ C 2 b × N L σg CN 0 2 Q 1 2 1 2 !% ∞ % " ( , 2σg I).Let dec(g , g )= V vg , V vg ,thenwe n n √ n have log3/2 n log1/2 m + p + p ,   m m m E QN 1 2 δ dec(g , g ) = V vg, V vg . 2 where C 2 is positive constant only depending on σ . σg g Therefore, by Jensen’s inequality, we have   Proof Let Λ =diagb , similar to the arguments of N  (g) Lp Theorem B.1, we have # p& 1/p        = E sup E QN (g1, g2)  N (g)= sup V vg, V vg , (B.7) g δ dec V v∈V(d) V v ∈V(b) # &   p 1/p N 1 2  ≤ E 1 2 Q  where V b is defined as (B.2). Let g be an independent g ,g sup dec(g , g ) ∈V ≥ V v (d) copy of g, by Lemma B.4, for any integer p 1 we have       =4 sup V vg, V vg , V v ∈V(d) N  ≤   Lp (g) Lp 4 sup V vg, V vg .

 V v ∈V(b) as desired. Lp Lemma B.5 Let g be an independent copy of g,forevery V V  For convenience, let = b . By Lemma B.5 and integer p ≥ 1,wehave Lemma B.6, we know that   sup V vg, V vg   # Lp sup V vg, V vg V v ∈V  Lp ≤ C 2 γ2 (V(d), ·) sup V vg  σg ≤ 2 V ·   V v∈V(d) Cσ γ2 ( , ) sup V vg Lp g ∈V & V v  Lp +sup V g, V g , +sup V g, V g v v Lp v v Lp V v ∈V(d) V v ∈V m ≤ V · V · V for some vector d ∈ C ,whereC 2 > 0 is a constant C 2 [γ2 ( , )(γ2( , )+dF ( )) σg σg √ depending only on σ2. + pd (V)(d (V)+γ (V, ·)) + pd2(V)]. g 2 F 2 2 Proof The proof is similar to the proof of Lemma 3.2 of [14], By Lemma B.7 and Lemma B.8, we know that and it is omitted here. % 1/2 1/2 Lemma B.6 Let g be an independent copy of g,forevery  n  ≥ dF (V) ≤ b ,d2(V) ≤ b , integer p 1,wehave ∞ m ∞ % 1/2 n  3/2 1/2 sup V g  γ2(V, ·) ≤ C b log n log m, v m ∞ V v∈V(d) Lp √ ≤ Cσ2 [γ2(V(d), ·)+dF (V)+ pd2(V(d))] , where C>0 is constant. Thus, combining the results above, g we have sup V vg, V vg p !% ∈V L V v √(d)  n  3/2 1/2 2 N  ≤ ≤ 2 V · V V (g) p Cσ2 b log n log m Cσ pdF ( (d)) d2( (d)) + pd2( (d)) , L g m ∞ g % " m √ for some vector d ∈ C ,whereCσ2 > 0 is a constant n  n  g + p b + p b , 2 m ∞ m ∞ depending only on σg . Proof The proof is similar to the proofs of Theorem 3.5 and 2 where C 2 > 0 is some constant depending on σ . Lemma 3.6 of [14], and it is omitted here. σg g

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1810 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020

√ C. Auxiliary Results D⊆ nB[n]. By Proposition 10.1 of [13], we have 1 The following are the auxiliary results required in the main −1/2 1/2

N D,m d ·∞ , ∞  proof. √ D V ≤N B[n] −1/2  1/2 · Lemma B.7 Let the sets , (d) be defined as (B.1) n 1 ,m d ∞ 1 , and (B.2) for some d ∈ Cm,wehave % % ≤N B[n] ·  −1/2 m ( 1 , 1 , d ∞ ) V ≤1/2 V ≤ n  1/2 n dF ( (d)) d ∞ ,d2( (d)) d ∞ . √ n m 2 n d1/2 ≤ 1+ √ ∞ . Proof Since each row of V v ∈V(d) consists of weighted m shifted copies of v,the2-norm of each nonzero row of V v −1/2 1/2 Thus, we have is m |dk| v. Thus, we have −1/2 1/2

log N D,m d ·∞ , ∞  dF (V(d)) = sup V v √ F 1/2 V v∈V(d)   ≤ 2 n√ d ∞ 1/2 1/2 n log 1+ . ≤d∞ sup v = d∞ . m v∈D If the scalar  is large, let us introduce a norm Also, for every v ∈D, we observe m ∗ m 1 v = |(vk)| + | (vk)| , ∀ v ∈ C , (B.8) V ≤√ d1/2 diag (F v) 1 v m ∞ m k=1 m 1 which is the usual 1-norm after identification of C with 1/2  ∗  = √ d∞ F mv∞ . 2m [n] m R .LetB·∗ = v ∈ C : v ≤ 1, supp (v) ∈ [n] , m 1 √ 1 [n] then we have D⊆ 2nB ∗ . By Lemma B.9, we obtain ∈D   ≤ ≤ ·1 √It is obvious√ for any v that F mv ∞ v 1   n v 2 = n,sothat −1/2 1/2

log N D,m d ·∞ , ∞  % ! √ " n V  ≤  1/2 [n] m −1/2 d2( (d)) = sup V v d ∞ . ≤ log N B ∗ , ·∞ , √ d  ·  ∞ v∈D m 1 2n Cn ≤ d log m log n. m2 ∞ D V Lemma B.8 Let the sets , be defined as (B.1) and (B.2) Finally, we combine the results above to estimate the “Dudley ∈ Cm for some d ,wehave integral”, % n d2(V(d)) γ (V(d), ·) ≤ C d1/2 log3/2 n log1/2 m, I . 1/2 N V · 2 m ∞ = log ( , ,) d 0 % where γ (·) is defined in Definition A.8. √ κ n d1/2 2 ≤ n log1/2 1+2 ∞ d Proof By Definition A.8, we know that m  0 √ % 1/2 2 V n   d ( ) m v ∞ 1/2 n −1 γ2 (V(d), ·) ≤ C log N (V(d), · ,) d, + C d∞ log m log n  d 0 m √ κ κ  −1/2 m 2 d ∞ n for some constant C>0, where the right hand side is known 2n 1/2 1/2 −1 ≤ √ d∞ log 1+t dt as the “Dudley integral”. To estimate the covering number m 0 % !% " N (V(d), · ,), we know that for any v, v ∈D, n n + C d log m log n log d1/2 /κ m ∞ m ∞ V − V   = V −   v v v v ! ! % "" 1 √ 2 n ≤ √ diag (d)1/2 F (v − v ) ≤ κ n log e 1+ d1/2 m m ∞ κ ∞ m % !% " ≤ √1  1/2  −  n n d ∞ F m(v v ) ∞ . + C d log m log n log d1/2 /κ , m m ∞ m ∞ .

Let v = F mv that v ≤v ,wehave where the last inequality we used Lemma 10.3 of [13]. Choose

∞ ∞ ∞  1  1/2 −1/2 1/2 d√ ∞ N (V(d), · ,) ≤ND,m d ·∞ , . κ = , we obtain the desired result. ∞  m Next, we bound the covering number B[n] { ∈ Cm  ∗ ≤ ∈ −1/2 1/2 Lemma B.9 Let ·∗ = v : v 1 1, supp(v) N D,m d ·∞ , when  is small and large, 1 ∞  } ·∗ [n] ,and 1 is defined in (B.8), we have respectively. √ ≤O B[n] C When  is small (i.e.,  (1/ m)), let 1 = [n] log N B ∗ , ·∞ , ≤ log m log n (B.9) ·  { ∈ Cm   ≤ ∈ } 1 2 v : v 1 1, supp v [n] , then it is obvious that 

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1811

for some constant C>0, where the norm v∞ = F mv∞. according to (B.10), we can therefore find a vector zS such  9 Proof Let U = {±e1, ··· , ±en, ±ie1, ··· , ±ien},itis that v − zS∞ ≤  with the choice L ≤2 log(10m).   [n] obvious that B·∗ ⊆ conv(U),whereconv(U) denotes the Thus, we have U 1 ∈U convex hull of .Fixanyv , the idea is to approximate [n]

log N B ∗ , · , ≤ log N (conv(U), · ,)

∞ ∞ v by a finite set of very sparse vectors. We define a random ·1  vector 9 ⎧ ≤ L log(4n +1)≤ log(10m) log(4n +1), 2 ⎨⎪sign ((vj)) ej, with prob. | (wj )| ,j∈ [n] z = sign ( (v )) e , with prob. | (w )| ,j∈ [n] as desired. ⎩⎪ j j j 0 − ∗ , with prob. 1 w 1 .  ∗ ≤ Since v 1 1, this is a valid probability distribution with APPENDIX C E [z]=v.Letz1, ··· , zL be independent copies of z, CONCENTRATION VIA DECOUPLING L where is a number to be determined later. We attempt to In this section, we assume that x =1, and we develop approximate v with a L-sparse vector concentration inequalities for the following quantities L 1 1 ∗ 2 zS = zk. Y (g)= R C diag |g  x| × L m [1:n] g k=1 ∗ C R , (C.1) By using a classical symmetrization argument (e.g., see g [1:n] 2 2σ +1 ∗ Lemma 6.7 of [13]), we obtain M(g)= R C × # & m [1:n] g L ∗ 2  1 diag (ζσ (g x)) CgR[1:n], (C.2) E [zS − v∞]=E (zk − E [zk])  L k=1 ∞ # &  via the decoupling technique and moments control, where L 2 m 2 ζ 2 (·) is defined in (VI.5) and σ > 1/2. Suppose g ∈ C ≤ E ε z σ L k k is complex Gaussian random variable g ∼CN(0, I).Once k=1 ∞ #   &   all the moments are bounded, it is easy to turn the moment L  2 E   bounds into a tail bound via Lemma A.6 and Lemma A.7. = max  εk f , zk  L ∈[m]   To bound the moments, we use the decoupling technique k=1 ∗ developed in [12], [14], [101]. The basic idea is to decouple where ε =[ε1, ··· ,εL] is a Rademacher vector, independent the terms above into terms like { }L { }L of zk k=1. Fix a realization of zk k=1, by applying the   Y 1 2 1 ∗  2 2 Hoeffding’s inequality to , we obtain Q (g , g )= R C 1 diag g  x × dec m [1:n] g   ∗ L  1   √ Cg R[1:n], (C.3) P    ≥ ε  εk f , zk  Lt 2 QM 1 2 1+2σ ∗ × k=1  dec(g , g )= R[1:n]Cg1  L  L m   2 ∗   2  1 ≤ P    ≥   diag ησ g x Cg R[1:n], (C.4) ε  εk f , zk  f , zk t k=1 k=1 2 1 2 where ησ2 (t)=1− 2πσ ξ 2− 1 (t),andg and g are two ≤ − 2 σ 2 2exp t /2 independent random variables with for all t>0 and  ∈ [m]. Thus, by combining the result above g1 = g + δ, g2 = g − δ, (C.5) with Lemma 6.6 of [13], it implies that #  &   ∼CN 0 L  $ where δ ( , I) is an independent copy of g.Aswedis- E    ≤ cussed in Section III, it turns out that controlling the moments max  εk f , zk  C L log(8m), ∈[m] QY 1 2 QM 1 2 k=1 of the decoupled terms dec(g , g ) and dec(g , g ) for √ √ −1 convolutional random matrices is easier and sufficient for with C = 2+ 4 2log8 < 1.5. By Fubini’s theorem, providing the tail bound of Y and M. The detailed results we obtain #  & and proofs are described in the following subsections.   2 L  E [zS − v∞] ≤ EzEε max  εk f , zk   L ∈[m]   k=1 A. Concentration of Y (g) 3 $ ≤ √ log(8m). (B.10) In this subsection, we show that L Theorem C.1 Let g ∼CN(0, I),andletY (g) be defined −2 7 1 L as (C.1). For any δ>0,whenm ≥ Cδ n log n,wehave This implies that there exists a vector zS = $L k=1 zk √3 ∗ where each zk ∈U such that zS − v∞ ≤ log(8m).  L Y (g) − xx − I≤δ, Since each zk can take 4n +1 values, so that zS can take at most (4n +1)L values. And for each v ∈ conv(U), holds with probability at least 1 − 2m−c.

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1812 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020

!% 1 2 Y 1 2 n Proof Suppose g , g are defined as (C.5), and Q (g , g ) ≤ 3/2 3/2 ∗ dec ∗ C3 log n log m p 2  2 1 m is defined as (C.3). Let [g x]k = g x and Cg R[1:n] = ⎡ ∗ ⎤ k % " g1 n n 1 + (log m) p3/2 + (log m) p2 , ⎣ ··· ⎦, then by Lemma C.2, we have m m ∗ 1 gm   where the first inequality follows from the triangle inequality, E QY 1 2 the second inequality follows from Theorem B.1, and the δ dec(g , g ) m *  + last inequality follows from Lemma C.3. Thus, combining the 1  ∗ 2 ∗ T T = E (g − δ ) x (g + δ )(g + δ ) estimates for 1 and 2 above, we have m δ k k k k k k k=1 ! (E [Y (g) − E [Y (g)]p])1/p m !g % 1 | ∗ |2 ∗ ∗ | ∗ |2 ∗ = gkx gkgk + gkgk + gkx I + xx n m ≤ C log3/2 n log3/2 m p k=1 " 4 % m " + I − xx∗g g∗ − g g∗xx∗ n n k k k k + (log m) p3/2 + (log m) p2 . m m 1 m =4I + Y (g) − E [Y (g)] + (g g∗ − I) g m k k Therefore, by using Lemma A.7, for any δ>0, whenever k=1 −2 4 3 m ≥ C5δ n log m log n m 1 ∗ ∗ ∗ ∗ ∗ − (xx g g + g g xx − 2xx )  − E ≤ m k k k k Y [Y ] δ, k=1 −c 1 m with probability at least 1 − 2m ,wherec>0 is some + |g∗x|2 − 1 I. m k numerical constant. Finally, using Lemma C.2, we get the k=1 desired result. Thus, by Minkowski inequality and Jensen’s inequality, for Lemma C.2 Suppose x ∈ Cn with x =1.Letg ∼ any positive integer p ≥ 1,wehave CN(0, I),andletY (g) be defined as (C.1), then we have E  − E p 1/p ( g [ Y (g) [Y (g)] ]) E ∗ !  " [Y (g)] = xx + I. p 1/p ⎡ ⎤ 1 ∗ ∗ ∗ ≤ 4 Eg R[1:n]C CgR − I g m g [1:n] 1 * + ∗ ⎢ . ⎥   1/p Proof Let CgR[1:n] = ⎣ . ⎦.Sincewehave Y 1 2 p . + E E Q (g , g ) − 4I ∗ g δ dec g !  " m p 1/p # & 1 ∗ ∗ m ≤ 4 Eg R[1:n]C CgR − I 1 m g [1:n] E [Y (g)] = E (g∗x)2 g g∗    m k k k k=1 T1 * + * + ∗ 2 ∗ 1/p E Y 1 2 p = (g x) gg , E 1 2 Q − + g ,g dec(g , g ) 4I .    we obtain the desired result by applying Lemma 20 T 2 of [38]. By using Theorem B.1 with b = 1,wehave !% % " Lemma C.3 Suppose g ∼CN(0, 2I), for any positive n n √ n integer p ≥ 1,wehave T ≤ C log3/2 log1/2 m + p + p , 1 1 m m m $ √ E   p 1/p ≤

g [ g x ∞] 6 log m p. where C1 > 0 is some numerical constant. For T2,wehave Proof By Minkowski inequality, we have T2 * + 1/p 1/p E   p ≤ E    Y 1 2 p g [ g x ∞] [ g x ∞] E 1 2 Q − = g ,g dec(g , g ) 4I E    − E    p 1/p

!  " +  [( g x [ g x ]) ] . p 1/p g ∞ ∞ Y 1 2 1 2 2 ≤ E 1 2 Q −  g ,g dec(g , g ) 2 g x I     m We know that g x ∞ is 1-Lipschitz w.r.t. g. Thus, !  " p 1/p by Gaussian concentration inequality in Lemma A.3, we have 1 ∗ ∗ E 1 1 −   2 +2 g R[1:n]Cg1 Cg R[1:n] 2I P    − E     ≥ ≤ −t /2 ! m " g x ∞ g [ g x ∞] t 2e . * +1/p 2 2p ≤ E 2  ×    C2 g g x ∞ +2 By Lemma A.5, we know that g x ∞ is sub-Gaussian, and !% % " satisfies √ n 3/2 1/2 n n    √

log n log m + p + p E    − E    p 1/p ≤  m m m g g x ∞ g [ g x ∞] 4 p.

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1813

⎡ ∗ ⎤ g x ⎢ 1 ⎥ where the last equality follows from the second equality in   . Besides, let g x = ⎣ . ⎦, then by Jensen’s inequality, for Lemma C.9. Using the results above and Lemma C.10, for ∗ all integer p ≥ 1, we observe gmx all λ>0,wehave (E [H − E [H]p])1/p E    ≤ E    !  exp (λ [ g x ∞]) [exp (λ g x ∞)]     H = Eg Eδ Q (g + δ, g − δ) − P x⊥ E ∗ dec = max exp λgkx 1≤k≤m " 2 p 1/p m * + 1+2σ ∗ 2 − 1,ζσ2 (|g  x|) P x⊥ λg x λ ≤ E e k ≤ me , m * +   1/p k=1 H p ≤ Eg Eδ Q (g + δ, g − δ) − 2P ⊥ where we used the fact that the moment generating function dec x ∗  ∗  !   "  E  ≤ 2  2 p 1/p of gkx satisfies exp λgkx exp λ .Takingthe  1+2σ  + Eg 1 − 1,ζσ2 (|g  x|) logarithms on both sides, we have m * +1/p E [g  x ] ≤ log m/λ + λ. M 1 2 p ∞ ≤ E 1 2 Q − √ g ,g dec(g , g ) 2I Taking λ = log m, so that the right hand side of the !   "  2 p 1/p inequality above achieves the minimum, which is  1+2σ  + Eg 1 − 1,ζσ2 (|g  x|) , $ m E [g  x∞] ≤ 2 log m. where QM (g1, g2) is defined as (C.2), and we have used the Combining the results above, we obtain the desired result. dec Minkowski’s inequality and the Jensen’s inequality, respec- tively. By Lemma C.5 and Lemma C.11, we obtain B. Concentration of M(g) (E [H − E [H]p])1/p Given M(g) as in (C.2), let us define !% % " √ n 3/2 1/2 n n ≤ 2 H(g)=P ⊥ MP ⊥ (C.6) Cσ log n log m + p + p , x x m m m and correspondingly its decoupled term 2 where Cσ2 is some numerical constant depending only on σ . H 1 2 M 1 2 Q Q ⊥ dec(g , g )=P x⊥ dec(g , g )P x , (C.7) Thus, by using the tail bound in Lemma A.6, for any t>0, we obtain and let ! % " 2 n η 2 (t)=1− 2πσ ξ 2− 1 (t), (C.8a) P  − E ≥ 3/2 1/2 σ σ 2 H [H] C1 log n log m + t 4 ! m " 4πσ 2 νσ2 (t)=1− ξ 2 − 1 (t), (C.8b) mt 2σ2 − 1 σ 2 ≤ 2exp −C 2 n where σ2 > 1/2. In this subsection, we show the following result. for some constants C1,C2 > 0. This further implies that for −2 2 ≥ −2 3 Theorem C.4 For any δ>0,whenm ≥ Cδ Cx any δ>0,ifm C3δ n log n log m for some positive 4 n log n,wehave numerical constant C3,wehave

H(g) − P ⊥ ≤δ x (C.9) H − E [H]≤δ, 2 2σ ∗ − − ≤ 3 M(g) I xx 3δ (C.10) −C4 log n 1+2σ2 holds with probability at least 1 − 2 m ,where C4 > 0 is numerical constant. Next, we use this result to P ⊥ M(g) − P ⊥ ≤2δ (C.11) x x bound the term M − E [M], by Lemma C.10, notice that  3 holds with probability at least 1 − cm−c log n,wherec, c and C are some positive numerical constants depending only M − E [M] 2 on σ . ≤P x⊥ (M − E [M]) P x⊥  Proof Let QH (g1, g2) be defined as (C.7). We calculate its dec + P x (M − E [M]) P x expectation with respect to δ, +2P x⊥ (M − E [M] P x) ! "−1 1+2σ2   ≤ − E  | ∗ − E | E QH (g + δ, g − δ) H [H] + x (M [M]) x m δ dec  ⊥  ∗ +2 P x Mx . E 2 | −  | = P x⊥ R[1:n]Cg diag ( δ [ησ ( (g δ) x )]) ∗ Hence, by using the results in Lemma C.6 and Lemma C.7, × CgR P x⊥ + 1, Eδ [ησ2 ((g − δ)  x)] P x⊥ [1:n] 2 −2 4 whenever m ≥ C Cx δ n log n we obtain = 1,ζσ2 (|g  x|) P x⊥ + ∗ ∗ 2 |  | ⊥ P x⊥ R[1:n]Cg diag (ζσ ( g x )) CgR[1:n]P x M − E [M]≤3δ,

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1814 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020

−c log3 n holds with probability at least 1−cm .Herec, c > 0 For the term T2, Lemma C.11 implies that are some numerical constants. Similarly, we have 2   1+2σ  2  T2 = 1,ησ2 g  x − 1 P ⊥ (M − E [M]) x m Lp

≤ ⊥ − E ⊥  √ P x (M [M]) P x Cσ2 ≤ √ Cx p, + P x⊥ (M − E [M]) P x m

= H − E [H] + P x⊥ Mx . for some constant Cσ2 > 0√. Combining the results above and  ≤ Again, by Lemma C.7, we have use the fact that Cx n, we obtain * +1/p  ⊥ − E ≤ M 1 2 p P x (M [M]) 2δ, E 1 2 Q − g ,g dec(g , g ) 2I  3 !% % " − −c log n √ holds with probability at least 1 cm .Byusing n 3/2 1/2 n n ≤ C 2 log n log m + p + p , Lemma C.10, we obtain the desired results. σ m m m Lemma C.5 Suppose g1, g2 are independent with g1, g2 ∼ where C 2 > 0 is some numerical constant only depending CN 0 QM 1 2 σ ( , 2I),andlet dec(g , g ) be defined as (C.4), then for on σ2. any integer p ≥ 1,wehave ∈ Cm * +1/p Lemma C.6 Let g be a complex Gaussian random M 1 2 p E 1 2 Q − ∼CN 0 g ,g dec(g , g ) 2I variable g ( , I).LetM(g) be defined as (C.2). For − 2 !% % " ≥ ≥ 2 1   √ any δ 0, whenever m Cσ δ Cx n log m,wehave n 3/2 1/2 n n ≤ C 2 log n log m + p + p , σ m m m |x∗ (M − E [M]) x|≤δ

2  2 where Cσ > 0 is some numerical constant only depending −C 2 Cx n holds with 1 − m σ . Here, C 2 ,C2 are some on σ2. ⎡ ⎤ σ σ b numerical constants depending on σ2. - 1 ⎢ . ⎥ ∗ 1/2 2σ2+1 2 2 2  Proof h(g)=|x M(g)x| = Proof Let b = 2σ +1 ησ g x ,setb = ⎣ . ⎦, Let m . 1/2 bm diag ζσ2 (Cxg) Cxg . Then we have its Wirtinger QM 1 2 1 ∗ ∗ and write dec(g , g )= m R[1:n]Cg diag (b) CgR[1:n]. gradient % By Minkowski’s inequality, we observe 2 −1 * +1/p ∂ 1 2σ +1 1/2 M 1 2 p h(g)= diag ζ 2 (Cxg) Cxg E 1 2 Q − σ g ,g dec(g , g ) 2I ∂z 2  m # & 1/p ∗ m p × 2  Cx diag (ζσ (Cxg)) Cxg + M 1 2 2 ≤ E 1 2 Q (g , g ) − b I  g ,g dec m k k=1 ∗    Cx diag (f(Cxg)) Cxg , T 1 2   | |2 | |2 1+2σ  2  t − t +2 1,ησ2 g  x − 1 . where g1(t)= 2σ2 exp 2σ2 ,sothat m Lp    % 2 −1 T2 2σ +1 1/2 ∇gh(g) = diag ζ 2 (Cxg) Cxg T 2 m σ For the term 1, conditioned g so that b is fixed, Theorem B.1 ≥ ∗ implies that for any integer p 1, × 2 Cx diag (ζσ (Cxg)) Cxg # & m p 1/p M 1 2 2 2 ∗ Eg1 Q (g , g ) − bkI | g + C diag (g1(Cxg)) Cxg. dec m x !% k=1 n 3/2 1/2 Thus, we have ≤ Cσ2 b∞ log n log m % m ! % " 2σ2 +1 √ n n ∇gh(g)≤ Cx diag (g2(Cxg) + p + p , m m m " 1/2 + diag ζ 2 (Cxg) , where Cσ2 > 0 is some numerical constant depending only σ 2 on σ . Given the fact that b∞ ≤ cσ2 for some constant −1/2 2 2 cσ > 0, and for any choice of g ,wehave where g2(t)=g1(t)ζσ2 (t). By using the fact that !% 1/2 ζ 2 ≤ 1 and g2 ∞ ≤ C1 for some constant C1 > 0, n 3/2 1/2 σ ∞  T ≤ 2  1 Cσ log n log m we have %m " % √ n n 2σ2 +1 + p + p . ∇ h(g)≤C C  , m m g 2 m x

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1815

m for some constant C2 > 0. Therefore, we can- see that the Lemma C.7 Let g ∈ C be a complex Gaussian random 2σ2+1 ∼CN 0 Lipschitz constant L of h(g) is bounded by C2 Cx. variable g ( , I),andletM(g) be defined as (C.2). For m − 2 4 ≥ 2 2   Thus, by the Gaussian concentration inequality, we observe any δ>0, whenever m Cσ δ Cx n log n,wehave 2 C mt − σ2 C 2  ≤ P (|h(g) − E [h(g)]|≥t) ≤ 2e x (C.12) P x⊥ Mx δ,

2 2 3 holds with some constant Cσ > 0 depending only on σ . −c 2 log n holds with probability at least 1 − 2m σ . Here, Thus, we have 2 cσ2 ,Cσ2 are some positive constants only depending on σ . −t ≤ h(g) − E [h(g)] ≤ t (C.13) Proof First, let us define decoupled terms 2 Cσ2 mt holds with probability at least 1 − 2exp − 2 . ⊥ 2 Cx Mx 1 2 2σ +1 ∗ Q ⊥ · dec (g , g )= P x R[1:n]Cg1 By Lemma C.10, we know that m 2 ∗   2 diag ν 2 g  x C 2 R x, E 2 ∗E 4σ +1 σ g [1:n] h (g) = x [M(g)] x = 2 . 2 2σ +1 QHx⊥ 1 2 2σ +1 ∗ × dec (g , g )= R[1:n]Cg1 This implies that m 2 ∗ 2 diag ν 2 g  x C 2 R x, (C.16) h2(g) ≤ (E [h(g)] + t) =⇒ σ g [1:n]   $ ⎡ ⎤ h2(g) − E h2(g) ≤ 2t E [h2(g)] + t2 ∗ % g1 2 ∗ ⎢ . ⎥ 1+4σ 2 ⎣ . ⎦ ≤ 2t + t2 (C.14) where νσ (t) is defined in (C.8). Let CgR[1:n] = . and 1+2σ2 ∗ gm 2 ⎡ ∗ ⎤ Cσ2 mt holds with probability at least 1 − 2exp − 2 .Onthe δ1 Cx ⎢ ⎥ ∗ ⎣ . ⎦ other hand, (C.12) also implies that h(g) is subgaussian, CδR[1:n] = . , then by Lemma C.9, we observe ∗ Lemma A.5 implies that δm * +  2 2 Cσ2 Cx ! "−1 * + E (h(g) − E [h(g)]) ≤ 2 ⊥ 2σ +1 Mx m Eδ Q (g + δ, g − δ)   2 m dec C 2 C  ⇒ E 2 ≤ E 2 σ x  = h (g) ( [h(g)]) + ∗ m E ⊥ 2 −  × = δ P x R[1:n]Cg+δ diag (νσ ((g δ) x)) 2 for some constant Cσ2 > 0 only depending on σ . Suppose  2 ∗ m ≥ C 2 C  for some large constant C 2 > 0 depending σ x σ Cg−δR x 2 [1:n] on σ > 0, from (C.13), we have  m ≥ E − ∗ h(g) [h(g)] t E 2 − ⊥ × = δ νσ ((gk δk) x) P x 2 k=1  C 2 Cx ≥ E [h2(g)] − σ − t. ∗ m (g + δk)(g − δk) x - k k   2 C 2 Cx m   t ≤ E [h2(g)] − σ ∗ ∗ Suppose m , by squaring both sides, ∗ = P x⊥ g Eδ x νσ2 ((g − δk) x)(g − δk) x we have k k k k k=1   2 C 2 Cx m h2(g) ≥ E h2(g) − σ + t2 ∗ ∗ m = ζσ2 (gkx) P x⊥ gkgkx 2 k=1 C 2 C  ∗ ∗ σ x ⊥ 2  − 2t E [h2(g)] − . = P x R[1:n]Cg diag (ζσ (g x)) CgR[1:n]x. m This further implies that Thus, for any integer p ≥ 1,wehave    2 !  2 2 2 C 2 Cx 2 h (g) − E h (g) ≥ t − σ 2σ +1 ∗ m Eg P x⊥ R[1:n]C diag (ζσ2 (g  x)) m g 2 2   p"1/p 4σ +1 Cσ2 Cx − 2t − , (C.15) × ∗ 2σ2 +1 m CgR[1:n]x 2 * * + + Cσ2 mt ⊥ p 1/p holds 1 − 2exp − 2 . Therefore, combining the results Mx Cx = E E Q (g + δ, g − δ) ≥ ≥ g δ dec in (C.14) and (C.15), for any δ 0, whenever m * + −1 2 ⊥ p 1/p C δ C  n log m t = C δ Mx 1 2 4 x , choosing 5 ,wehave ≤ E 1 2 Q    g ,g dec (g , g )  2 − E 2  ≤ h (g) h (g) δ, * p+1/p Hx⊥ 1 2 2 ≤ E 1 2 Q (g , g ) . holds with probability at least 1 − m−C6Cx n. g ,g dec

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1816 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020

By Lemma C.8, we have holds for all t>0. By using the tail bound above, we obtain * p+ Hx⊥ 1 2 Eg2 Q (g , g ) * p+1/p dec Hx⊥ 1 2 E 1 2 Q ≤ 2  × ∞ g ,g dec (g , g ) Cσ Cx p % ! % "  = P h(g2) ≥ t dt √ n n 3/2 1/2 n t=0√ 1+ log n log m p + p . p ( nLh) m m m p = P h(g2) ≥ t dt t=0 ∞ Therefore, by Lemma A.6, finally for any δ>0, whenever P 2 ≥ 1/p −2 2 4 + √ h(g ) t dt m ≥ Cδ Cx n log n we obtain p t=( nLh) √ ∞ ≤ p P 2 ≥ p−1 −c log3 n nLh + p √ h(g ) u u du P (P ⊥ Mx≥δ) ≤ 2 m , x u= nLh ∞ √ 2 = p P h(g ) ≥ u + nLh × where c, C > 0 are some positive constants. u=0 √ p−1 √ p u + nLh du + nLh 1 2 ∞ 2 Lemma C.8 Let g and g be random variables defined as u ⊥ √ √ − − 2 p p−2 p 1 2L QHx 1 2 ≤ nL +2 p nL e e h du in (C.5), and let dec (g , g ) be defined as (C.16). Then h h u=0 for any integer p ≥ 1,wehave 2 ∞ u − 2 p−2 2L p−1 +2 pe e h u du * p+1/p u=0% Hx⊥ 1 2 E 1 2 Q ≤ 2  × √ √ g ,g dec (g , g ) Cσ Cx p π p−2 p−1 p % ! % "  = nLh + 2 p n Lhe √ 2 n n 3/2 1/2 n ∞ 1+ log n log m p + p , p 3p/2−3 p −τ 2 −1 m m m +2 pLhe e τ dτ ! %τ=0 " √ π 2 ≤ p p p−1 3p/2−3 where Cσ2 is some positive constant only depending on σ . 3 n Lh 1+ 2 p +2 pΓ(p/2) 1 2 QHx⊥ 1 2 2 ' ( Proof First, we fix g ,andleth(g )= dec (g , g ).Let √ √ ≤ p p/2 g(t)=tνσ2 (t), for which the Lipschitz constant Lf ≤ Cσ2 3 4 nLh p max (p/2) , 2π , 2 for some positive constant Cσ2 only depending on σ .Then  √  ≤ p/2 given an independent copy g2 of g2, we observe where we used the fact that Γ(p/2) max (p/2) , 2π for any integer p ≥ 1. By Corollary B.2, we know that * + √ ∗ p p p 2 E 1 1 ≤ × h(g ) − h(g2) g R[1:n]Cg cσ2 m ! % % " √ p 2 n 3/2 1/2 n 2σ +1 ∗ 2 2 ≤ R C 1 g(C g ) − g(C g ) 1+ log n log m + p , m [1:n] g x x m m C 2 2 σ ∗ 2 2 2 ≤ R C 1 C  g − g where cσ is some constant only depending only on σ .There- m [1:n] g x ∗      fore, using the fact that Lh = Cσ2 R[1:n]Cg1 Cx /m and 1/p 1/e Lh p ≤ e , we obtain * p+1/p Hx⊥ 1 2 2 E 1 2 Q (g , g ) where Lh is the Lipschitz constant of h(g ). Given√ the fact g ,g dec 2 % ! % that Eg2 h(g ) = 0, by Lemma A.4, for any t> nLh we √ ≤   n n 3/2 1/2 have Cσ2 Cx p 1+ log n log m % " m m ! " n √ t + p P h(g2) ≥ t ≤ eP v≥ m L  ! h " n 2   1 t √ = Cσ2 Cx p + ≤ e exp − − n , % ! %m "  2 Lh n n √ 1+ log3/2 n log1/2 m p , m m n where v ∈ R with v ∼N(0, I), and we used the Gaussian 2 where C 2 > 0 is some constant depending only on σ . concentration inequality for the tail bound of v. By a change σ of variable, we obtain ! " C. Auxiliary Results √ P 2 ≥ ≤ − 1 2 The following are some auxiliary results used in the main h(g ) t + nLh e exp 2 t 2Lh proof.

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1817

Lemma C.9 Let ξσ2 ,ζσ2 , ησ2 and νσ2 be defined as (VI.5) In addition, by using the fact that Es∼CN (0,1) [ξσ2 (t + s)] = and (C.8), for t ∈ C,wehave ξ 2 1 (t),wehave σ + 2

E 2 2 1 s∼CN (0,1) [ξσ (t + s)] = ξσ + (t) E 2 E 2 2 t∼CN (0,2) [ησ (t)] = t1,t2∼i.i.d.CN(0,1) [ησ (t1 + t2)]

E ∼CN [η 2 (t + s)] = ζ 2 (t) E 2 s (0,1) σ σ = t1∼CN (0,1) [ζσ (t1)] 1 1 E 2 s∼CN (0,1) [ζσ (s)] = = . 2σ2 +1 1+2σ2 * + 2 2 4σ +1 Es∼CN (0,1) |t| ζσ2 (s) = For the last equality, first notice that (2σ2 +1)2

1 E ∼CN [sξ 2 (t + s)] E ∼CN [η 2 (s)] = s (0,1) σ s (0,2) σ 2σ2 +1 2 E ∼CN [(t + s)ν 2 (t + s)] = tζ 2 (t). 1 1 |t + s| s (0,1) σ σ = s exp − exp −|s|2 ds π 2πσ2 2σ2   s Proof Let sr = (s), si = (s) and tr = (t), ti = (t), by definition, we observe | |2 1 − t × = 2 2 exp 2 Es∼CN (0,1) [ξσ2 (t + s)] 2π σ 1+2σ 2   1 1 |t + s| 2 − −| |2 1+2σ2  t  = 2 exp 2 exp s ds −   2πσ π s 2σ s exp 2 s + 2  ds ! " s 2σ 1+2σ +∞ 2 1 −(sr + tr) − 2 × = 2 2 exp 2 sr dsr 2 2 2π σ −∞ 2σ 1 |t| σ sr = ! " = exp − × 2π +∞ 2 2π2σ2 1+2σ2 1+2σ2 −(si + ti) − 2 exp 2 si dsi s =−∞ 2σ − i × t 1 |t|2 1+2σ2 = exp − 2π (σ2 +1/2) 2(σ2 +1/2) −t |t|2 = exp − = ξ 2 1 (t). 2 2 2 σ + 2 π (1 + 2σ ) 1+2σ

Thus, by definition of ησ2 and ζσ2 ,wehave −t = ξ 2 1 (t). 1+2σ2 σ + 2 E ∼CN [η 2 (t + s)] s (0,1) σ   2 =1− 2πσ Es∼CN (0,1) ξσ2−1/2(t + s) Therefore, we have 2 * + =1− 2πσ ξσ2 (t)=ζσ2 (t). E ∼CN (t + s)ξ 2 1 (t + s) s (0,1) σ − 2 For Et∼CN (0,1) [ζσ2 (t)],wehave * + = t · E ∼CN ξ 2 1 (t + s) s (0,1) σ − 2 E 2 t∼CN (0,1) [ζσ (t)] * + 2 − E 2 =1 2πσ t∼CN (0,1) [ξσ (t)] + E ∼CN sξ 2 1 (t + s) # & s (0,1) σ − 2 |t|2 − E ∼CN − 2 − =1 t (0,1) exp 2 t 2σ 1 2σ = tξ 2 (t) − ξ 2 (t)= tξ 2 (t). σ 2σ2 σ 2σ2 σ 2σ2 1 =1− = . Using the result above, we observe 2σ2 +1 1+2σ2 * + 2 Es∼CN (0,1) [(t + s)νσ2 (t + s)] For Et∼CN (0,1) |t| ζσ2 (t) , we observe * + * + 4πσ4 2 − E 2 1 E | | 2 = t s∼CN (0,1) (t + s)ξ − (t + s) t∼CN (0,1) t ζσ (t) 2σ2 − 1 σ 2 # & 2 2 | | − 2 2 1 | |2 − − t −| |2 = t 1 2πσ ξσ (t) = tζσ (t), = t 1 exp 2 exp t dt π t 2σ * + as desired. 2 2+1 2 1 2 − σ |t|2 E | | − | | 2σ2 = t∼CN (0,1) t t e dt ∼CN0 π t Lemma C.10 Let g ( , I),andM(g), H(g) be 2 * +

2σ 2 defined as (C.2) and (C.6), we have  =1− E  2 2 E |t| 2 t∼CN 0, σ 2σ +1 2σ2+1 2 ! " 1+4σ ∗ 2 2 2 E [M(g)] = P ⊥ + xx , 2σ 4σ +1 g x 2 =1− = . 1+2σ 2 2 2 2σ +1 (2σ +1) Eg [H(g)] = P x⊥ .

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1818 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020 ⎡ ⎤ ∗ g1 in Lemma A.3, we have ∗ ⎢ ⎥ Proof By Lemma C.9 and suppose C R = ⎣ . ⎦, g [1:n] . P (|h(g)|≥t) ∗ !  " gm  2  we observe 1+2σ  = P  1,ζσ2 (|g  x|)−1 ≥ t m E [M] σ2 mt2 ≤ exp − , 2 m 2 2 2 2σ +1 ∗ ∗ 2(2σ +1) Cx = E [ζ 2 (g x)g g ] m σ k k k k=1 for any scalar t ≥ 0. Thus, we can see that h(g) is a centered 2 2 2 2 m (σ +1) Cx 2σ +1 ∗ ∗ 2 -subgaussian random variable, by Lemma A.5, = {E [ζ 2 (g x)] E [P ⊥ g g P ⊥ ] σ m m σ k x k k x we know that for any positive p ≥ 1 k=1 ∗ ∗ 2 E 2 }   √ + [ζσ (gkx)P xgkgkP x] 2σ +1 Cx h(g) p ≤ 3 √ p, 2 m * + L σ m 2σ +1 ∗ ∗ ∗ 2 = P ⊥ + xx E ζ 2 (g x) |g x| x m σ k k as desired. For h (g), we can obtain the result similarly. k=1 2 4σ +1 ∗ = P ⊥ + xx . x 2σ2 +1 ACKNOWLEDGMENT Q. Qu thanks the generous support of the Microsoft graduate Thus, we have research fellowship, and Moore-Sloan fellowship. The authors would like to thank S. Zhong for the helpful discussion for real E [H]=P ⊥ E [M] P ⊥ x  x  applications and providing the antenna data for experiments, 2 4σ +1 ∗ and they would also like to thank J. Sun and H.-W. Kuo = P ⊥ P ⊥ + xx P ⊥ = P ⊥ , x x 2σ2 +1 x x for helpful discussion and input regarding the analysis of this work. as desired. REFERENCES Lemma C.11 Let g ∼CN(0, I) and g ∼CN(0, 2I),for any positive integer p ≥ 1,wehave [1] Q. Qu, Y. Zhang, Y. C. Eldar, and J. Wright, “Convolutional phase retrieval,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 6086–6096. 2 1+2σ [2] P. Walk, H. Becker, and P. Jung, “OFDM channel estimation via phase 1 − 1,ζσ2 (g  x) m retrieval,” in Proc. 49th Asilomar Conf. Signals, Syst. Comput., 2015, Lp pp. 1161–1168. 3 2σ2 +1 √ [3] R. M. Gagliardi and S. Karp, Optical Communications,vol.1. ≤ √ Cx p, New York, NY, USA: Wiley, 1976, p. 445. m σ [4] M. Stojanovic, J. A. Catipovic, and J. G. Proakis, “Phase-coherent 2 1+2σ digital communications for underwater acoustic channels,” IEEE J. 1 − 1,ησ2 (g  x) Ocean. Eng. m , vol. 19, no. 1, pp. 100–111, Jan. 1994. Lp [5] A. Shahmansoori, G. E. Garcia, G. Destino, G. Seco-Granados, 3 σ2 2σ2 +1 √ and H. Wymeersch, “5G position and orientation estimation through ≤ √ C  3/2 x p. millimeter wave mimo,” in Proc. IEEE Globecom Workshops (GC m 2 − 1 Wkshps), Dec. 2015, pp. 1–6. σ 2 [6] E. J. Candès, T. Strohmer, and V. Voroninski, “PhaseLift: Exact and 2 stable signal recovery from magnitude measurements via convex pro- 1+2σ 1 2  −  Proof Let h(g)= m ,ζσ (g x) 1 and let h (g)= gramming,” Commun. Pure Appl. Math., vol. 66, no. 8, pp. 1241–1274, 1 Aug. 2013. 1,ησ2 (|g  x|), by Lemma C.9, we know that m [7] M. Soltanolkotabi, “Algorithms theory for clustering nonconvex quadratic programming,” Ph.D. dissertation, Dept. Elect. Eng., Stanford E E 

g [h(g)] = 0, g [h (g)] = 0. Univ., Stanford, CA, USA, 2014. [8] Y. Chen and E. J. Candès, “Solving random quadratic systems of equations is nearly as easy as solving linear systems,” Commun. Pure And for an independent copy g of g,wehave Appl. Math., vol. 5, no. 70, pp. 822–883, 2017. [9] G. Wang, G. B. Giannakis, and Y. C. Eldar, “Solving systems of random |h(g) − h(g )| quadratic equations via truncated amplitude flow,” IEEE Trans. Inf.   Theory, vol. 64, no. 2, pp. 773–794, Feb. 2018. 2 2 1+2σ  − 1 |gx|2 − 1 |gx|  [10] I. Waldspurger, A. D’Aspremont, and S. Mallat, “Phase recovery, ≤  1,e 2σ2 − e 2σ2  MaxCut and complex semidefinite programming,” Math. Programm., m 2 2 vol. 149, nos. 1–2, pp. 47–81, 2015. 1+2σ − 1 |gx|2 − 1 |gx| ≤ e 2σ2 − e 2σ2 [11] I. Waldspurger, “Phase retrieval with random Gaussian sensing vectors m 1 by alternating projections,” IEEE Trans. Inf. Theory, vol. 64, no. 5, 2 2 1+2σ 1+2σ pp. 3301–3312, May 2018. ≤ √ Cx(g − g )≤ √ Cxg − g  , [12] V. De la Pena and E. Giné, Decoupling: From Dependence to Inde- mσ mσ pendence. New York, NY, USA: Springer-Verlag, 1999. [13] H. Rauhut, “Compressive sensing and structured random matrices,” in 2 − x 1 −1/2 Theoretical Foundations and Numerical Methods for Sparse Recovery where we used the fact that exp 2σ2 is σ e - (Venue Radon Series Comp. Appl. Math XX), vol. 9. de Gruyter 20YY, Lipschitz. By applying the Gaussian concentration inequality 2010, pp. 1–95.

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1819

[14] F. Krahmer, S. Mendelson, and H. Rauhut, “Suprema of chaos [38] J. Sun, Q. Qu, and J. Wright, “A geometric analysis of phase retrieval,” processes and the restricted isometry property,” Commun. Pure Appl. Found. Comput. Math., vol. 18, no. 5, pp. 1131–1198, Oct. 2018. Math., vol. 67, no. 11, pp. 1877–1904, Nov. 2014. [39] M. Soltanolkotabi, “Structured signal recovery from quadratic mea- [15] Y. Zhang, Y. Lau, H.-W. Kuo, S. Cheung, A. Pasupathy, and J. Wright, surements: Breaking sample complexity barriers via nonconvex opti- “On the global geometry of sphere-constrained sparse blind deconvo- mization,” IEEE Trans. Inf. Theory, vol. 65, no. 4, pp. 2374–2400, lution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017, Apr. 2019. pp. 4894–4902. [40] Y. Chen, Y. Chi, J. Fan, and C. Ma, “Gradient descent with random [16] Q. Qu, X. Li, and Z. Zhu, “A nonconvex approach for initialization: Fast global convergence for nonconvex phase retrieval,” exact and efficient multichannel sparse blind deconvolution,” Math. Program., vol. 176, nos. 1–2, pp. 5–37, 2018. 2019, arXiv:1908.10776. [Online]. Available: https://arxiv.org/abs/ [41] D. Gilboa, S. Buchanan, and J. Wright, “Efficient dictionary learning 1908.10776 with gradient descent,” 2018, arXiv:1809.10313. [Online]. Available: [17] F. Heide, W. Heidrich, and G. Wetzstein, “Fast and flexible convo- https://arxiv.org/abs/1809.10313 lutional sparse coding,” in Proc. IEEE Conf. Comput. Vis. Pattern [42] R. Ge, F. Huang, C. Jin, and Y. Yuan, “Escaping from saddle points— Recognit., Jun. 2015, pp. 5135–5143. Online stochastic gradient for tensor decomposition,” in Proc. 28th [18] Y. Lau, Q. Qu, H.-W. Kuo, P. Zhou, Y. Zhang, and J. Wright, Conf. Learn. Theory, 2015, pp. 797–842. “Short-and-sparse deconvolution—A geometric approach,” 2019, [43] J. Sun, Q. Qu, and J. Wright, “When are nonconvex prob- arXiv:1908.10959. [Online]. Available: https://arxiv.org/abs/1908. lems not scary?” 2015, arXiv:1510.06096. [Online]. Available: 10959 https://arxiv.org/abs/1510.06096 [19] Y. Schechtman, Y. C. Eldar, O. Cohen, H. N. Chapman, J. Miao, [44] J. Sun, Q. Qu, and J. Wright, “Complete dictionary recovery over the and M. Segev, “Phase retrieval with application to optical imaging: sphere I: Overview and the geometric picture,” IEEE Trans. Inf. Theory, A contemporary overview,” IEEE Signal Process. Mag., vol. 32, no. 3, vol. 63, no. 2, pp. 853–884, Feb. 2017. pp. 87–109, Mar. 2015. [45] J. Sun, Q. Qu, and J. Wright, “Complete dictionary recovery over the [20] K. Jaganathan, Y. C. Eldar, and B. Hassibi, “Phase retrieval: An sphere II: Recovery by Riemannian trust-region method,” IEEE Trans. overview of recent developments,” in Optical Compressive Imag- Inf. Theory, vol. 63, no. 2, pp. 885–914, Feb. 2017. ing. 2017, ch. 13, pp. 263–296. [Online]. Available: https://www. [46] F. Krahmer and H. Rauhut, “Structured random measurements in taylorfrancis.com/books/e/9781315371474 signal processing,” GAMM-Mitteilungen, vol. 37, no. 2, pp. 217–238, [21] R. P. Millane, “Phase retrieval in crystallography and optics,” J. Opt. Nov. 2014. Soc. Amer. A, Opt. Image Sci., vol. 7, no. 3, pp. 394–411, 1990. [47] E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: [22] R. W. Harrison, “Phase problem in crystallography,” J. Opt. Soc. Exact signal reconstruction from highly incomplete frequency informa- Amer. A, Opt. Image Sci., vol. 10, no. 5, pp. 1046–1055, May 1993. tion,” IEEE Trans. Inf. Theory, vol. 52, no. 2, pp. 489–509, Feb. 2006. [23] J. Miao, T. Ishikawa, B. Johnson, E. H. Anderson, B. Lai, and [48] E. J. Candés, J. K. Romberg, and T. Tao, “Stable signal recovery from K. O. Hodgson, “High resolution 3D X-ray diffraction microscopy,” incomplete and inaccurate measurements,” Commun. Pure Appl. Math., Phys. Rev. Lett., vol. 89, no. 8, p. 088303, 2002. vol. 59, no. 8, pp. 1207–1223, 2006. [24] C. Dainty and J. R. Fienup, “Phase retrieval and image reconstruction [49] E. J. Candès and T. Tao, “Near-optimal signal recovery from random for astronomy,” in Image Recovery: Theory and Application. Orlando, projections: Universal encoding strategies?” IEEE Trans. Inf. Theory, FL, USA: Academic, 1987, pp. 231–275. vol. 52, no. 12, pp. 5406–5425, Dec. 2006. [25] O. Bunk et al., “Diffractive imaging for periodic samples: Retrieving [50] Y. C. Eldar and G. Kutyniok, Compressed Sensing: Theory and one-dimensional concentration profiles across microfluidic channels,” Applications. Cambridge, U.K.: Cambridge Univ. Press, 2012. Acta Crystallogr., vol. 63, no. 4, pp. 306–314, Jul. 2007. [51] D. Gross, F. Krahmer, and R. Kueng, “A partial derandomization of [26] A. Chai, M. Moscoso, and G. Papanicolaou, “Array imaging using PhaseLift using spherical designs,” J. Fourier Anal. Appl., vol. 21, intensity-only measurements,” Inverse Problems, vol. 27, no. 1, no. 2, pp. 229–266, 2015. p. 015005, 2011. [52] E. J. Candès, X. Li, and M. Soltanolkotabi, “Phase retrieval from coded [27] A. Walther, “The question of phase retrieval in optics,” Opt. Acta, diffraction patterns,” Appl. Comput. Harmon. Anal., vol. 39, no. 2, vol. 10, no. 1, pp. 41–49, Jan. 1963. pp. 277–299, Sep. 2015. [28] R. W. Gerchberg and W. O. Saxton, “A practical algorithm for the [53] T. Bendory, Y. C. Eldar, and N. Boumal, “Non-convex phase retrieval determination of the phase from image and diffraction plane pictures,” from STFT measurements,” IEEE Trans. Inf. Theory, vol. 64, no. 1, Optik, vol. 35, pp. 237–246, May 1972. pp. 467–484, Jan. 2018. [29] J. R. Fienup, “Phase retrieval algorithms: A comparison,” Appl. Opt., [54] T. Tsang, M. A. Krumbügel, K. W. DeLong, D. N. Fittinghoff, vol. 21, no. 15, pp. 2758–2769, Aug. 1982. and R. Trebino, “Frequency-resolved optical-gating measurements of [30] E. J. R. Pauwels, A. Beck, Y. C. Eldar, and S. Sabach, “On Fienup ultrashort pulses using surface third-harmonic generation,” Opt. Lett., methods for sparse phase retrieval,” IEEE Trans. Signal Process., vol. 21, no. 17, pp. 1381–1383, 1996. vol. 66, no. 4, pp. 982–991, Feb. 2017. [55] J. S. Lim and A. V. Oppenheim, “Enhancement and bandwidth com- [31] E. J. Candès, Y. C. Eldar, T. Strohmer, and V. Voroninski, “Phase pression of noisy speech,” Proc. IEEE, vol. 67, no. 12, pp. 1586–1604, retrieval via matrix completion,” SIAM J. Imag. Sci., vol. 6, no. 1, Dec. 1979. pp. 199–225, 2013. [56] K. Kreutz-Delgado, “The complex gradient operator and the [32] S. Bahmani and J. Romberg, “Phase retrieval meets statistical learn- CR-calculus,” 2009, arXiv:0906.4835. [Online]. Available: ing theory: A flexible convex relaxation,” in Proc. 20th Int. Conf. https://arxiv.org/abs/0906.4835 Artif. Intell. Statist., in Proceedings of Machine Learning Research, [57] W. Wirtinger, “Zur formalen Theorie der Funktionen von vol. 54, A. Singh and J. Zhu, Eds. Fort Lauderdale, FL, USA: mehr komplexen Veränderlichen,” Math. Ann., vol. 97, no. 1, PMLR, 2017, pp. 252–260. [Online]. Available: http://proceedings. pp. 357–375, 1927. mlr.press/v54/bahmani17a.html [58] P. Chen, A. Fannjiang, and G.-R. Liu, “Phase retrieval with one [33] T. Goldstein and C. Studer, “PhaseMax: Convex phase retrieval via or two diffraction patterns by alternating projections with the null basis pursuit,” IEEE Trans. Inf. Theory, vol. 64, no. 4, pp. 2675–2689, initialization,” J. Fourier Anal. Appl., vol. 24, no. 3, pp. 719–758, 2018. Apr. 2018. [59] S. Kwapien, “Decoupling inequalities for polynomial chaos,” Ann. [34] P. Hand and V. Voroninski, “An elementary proof of convex phase Probab., vol. 15, no. 3, pp. 1062–1071, 1987. retrieval in the natural parameter space via the linear program phase- [60] S. Foucart and H. Rauhut, A Mathematical Introduction to Compressive max,” Commun. Math. Sci., vol. 16, no. 7, pp. 2047–2051, 2018. Sensing. Basel, Switzerland: Birkhäuser, 2013. [35] P. Netrapalli, P. Jain, and S. Sanghavi, “Phase retrieval using alter- [61] P. Jain, P. Netrapalli, and S. Sanghavi, “Low-rank matrix completion nating minimization,” IEEE Trans. Signal Process., vol. 18, no. 63, using alternating minimization,” in Proc. 45th Annu. ACM Symp. pp. 4814–4826, 2015. Theory Comput., 2013, pp. 665–674. [36] E. J. Candès, X. Li, and M. Soltanolkotabi, “Phase retrieval via [62] M. Hardt, “Understanding alternating minimization for matrix com- Wirtinger flow: Theory and algorithms,” IEEE Trans. Inf. Theory, pletion,” in Proc. IEEE 55th Annu. Symp. Found. Comput. Sci., 2014, vol. 61, no. 4, pp. 1985–2007, Apr. 2015. pp. 651–660. [37] H. Zhang and Y. Liang, “Reshaped Wirtinger flow for solving quadratic [63] M. Hardt and M. Wootters, “Fast matrix completion without the system of equations,” in Proc. Adv. Neural Inf. Process. Syst., 2016, condition number,” in Proc. 27th Conf. Learn. Theory, 2014, pp. 2622–2630. pp. 638–678.

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1820 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020

[64] P. Netrapalli, U. N. Niranjan, S. Sanghavi, A. Anandkumar, and P. Jain, [89] X. Li, S. Ling, T. Strohmer, and K. Wei, “Rapid, robust, and reliable “Non-convex robust PCA,” in Proc. Adv. Neural Inf. Process. Syst., blind deconvolution via nonconvex optimization,” Appl. Comput. Har- 2014, pp. 1107–1115. mon. Anal., vol. 47, no. 3, pp. 893–934, 2018. [65] P. Jain and P. Netrapalli, “Fast exact matrix completion with finite [90] K. Lee, N. Tian, and J. Romberg, “Fast and guaranteed blind multi- samples,” in Proc. 28th Conf. Learn. Theory, 2015, pp. 1007–1034. channel deconvolution under a bilinear system model,” IEEE Trans. [66] R. Sun and Z.-Q. Luo, “Guaranteed matrix completion via non-convex Inf. Theory, vol. 64, no. 7, pp. 4792–4818, Jul. 2018. factorization,” IEEE Trans. Inf. Theory, vol. 62, no. 11, pp. 6535–6579, [91] S. Ling and T. Strohmer, “Regularized gradient descent: A non- Nov. 2016. convex recipe for fast joint blind deconvolution and demixing,” Inf. [67] P. Jain and S. Oh, “Provable tensor factorization with missing data,” Inference, J., vol. 8, no. 1, pp. 1–49, 2018. in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 1431–1439. [92] Y. Zhang, H.-W. Kuo, and J. Wright, “Structured local optima in sparse [68] K. Wei, J.-F. Cai, T. F. Chan, and S. Leung, “Guarantees of Riemannian blind deconvolution,” 2018, arXiv:1806.00338. [Online]. Available: optimization for low rank matrix recovery,” SIAM J. Matrix Anal. Appl., https://arxiv.org/abs/1806.00338 vol. 37, no. 3, pp. 1198–1222, Sep. 2016. [93] H.-W. Kuo, Y. Lau, Y. Zhang, and J. Wright, “Geometry and symmetry [69] C. D. Sa, C. Re, and K. Olukotun, “Global convergence of stochastic in short-and-sparse deconvolution,” 2019, arXiv:1901.00256. [Online]. gradient descent for some non-convex matrix problems,” in Proc. 32nd Available: https://arxiv.org/abs/1901.00256 Int. Conf. Mach. Learn., vol. 37, 2015, pp. 2332–2341. [94] Y. Li and Y. Bresler, “Global geometry of multichannel sparse blind [70] Q. Zheng and J. Lafferty, “A convergent gradient descent algorithm deconvolution on the sphere,” in Proc. Adv. Neural Inf. Process. Syst., for rank minimization and semidefinite programming from random 2018, pp. 1132–1143. linear measurements,” in Proc. Adv. Neural Inf. Process. Syst., 2015, [95] H. Bristow, A. Eriksson, and S. Lucey, “Fast convolutional sparse pp. 109–117. coding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013, [71] S. Tu, R. Boczar, M. Simchowitz, M. Soltanolkotabi, and B. Recht, pp. 391–398. “Low-rank solutions of linear matrix equations via procrustes flow,” in [96] I. Y. Chun and J. A. Fessler, “Convolutional dictionary learning: Proc. Int. Conf. Mach. Learn., 2016, pp. 964–973. Acceleration and convergence,” IEEE Trans. Image Process., vol. 27, [72] Y. Chen and M. J. Wainwright, “Fast low-rank estimation no. 4, pp. 1697–1712, Apr. 2018. by projected gradient descent: General statistical and algorith- [97] C. Garcia-Cardona and B. Wohlberg, “Convolutional dictionary learn- mic guarantees,” 2015, arXiv:1509.03025. [Online]. Available: ing: A comparative review and new algorithms,” IEEE Trans. Comput. https://arxiv.org/abs/1509.03025 Imag., vol. 4, no. 3, pp. 366–381, Sep. 2018. [73] A. Anandkumar, R. Ge, and M. Janzamin, “Analyzing tensor power [98] S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities: method dynamics in overcomplete regime,” J.Mach.Learn.Res., A Nonasymptotic Theory of Independence. London, U.K.: Oxford Univ. vol. 18, no. 1, pp. 752–791, 2017. Press, 2013. [74] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky, [99] M. Ledoux and K. Oleszkiewicz, “On measure concentration of vector- “Tensor decompositions for learning latent variable models,” J. Mach. valued maps,” Bull. Polish Acad. Sci. Math., vol. 55, no. 3, pp. 261–278, Learn. Res., vol. 15, no. 1, pp. 2773–2832, 2014. 2007. [75] Q. Qu, J. Sun, and J. Wright, “Finding a sparse vector in a subspace: [100] M. Talagrand, Upper Lower Bounds for Stochastic Processes, vol. 60. Linear sparsity using alternating directions,” IEEE Trans. Inf. Theory, Berlin, Germany: Springer-Verlag, 2014. vol. 62, no. 10, pp. 5855–5880, Oct. 2016. [101] M. A. Arcones and E. Giné, “On decoupling, series expansions, and [76] S. B. Hopkins, T. Schramm, J. Shi, and D. Steurer, “Fast spectral tail behavior of chaos processes,” J. Theor. Probab., vol. 6, no. 1, algorithms from sum-of-squares proofs: Tensor decomposition and pp. 101–122, 1993. planted sparse vectors,” in Proc. 48th Annu. ACM Symp. Theory Comput., 2016, pp. 178–191. [77] S. Arora, R. Ge, T. Ma, and A. Moitra, “Simple, efficient, and neural algorithms for sparse coding,” in Proc. 29th Conf. Learn. Theory, 2015, pp. 113–149. [78] K. Lee, Y. Li, M. Junge, and Y. Bresler, “Blind recovery of sparse sig- nals from subsampled convolution,” IEEE Trans. Inf. Theory, vol. 63, no. 2, pp. 802–821, Feb. 2017. [79] R. Ge, J. D. Lee, and T. Ma, “Matrix completion has no spurious Qing Qu is a Moore-Sloan research fellow at Center for Data Science, local minimum,” in Proc. Adv. Neural Inf. Process. Syst., 2016, New York University. He received his Ph.D from Columbia University in pp. 2973–2981. Electrical Engineering in Oct. 2018. He received his B.Eng. from Tsinghua [80] R. Ge, C. Jin, and Y. Zheng, “No spurious local minima in nonconvex University in Jul. 2011, and an M.Sc. from the Johns Hopkins University in low rank problems: A unified geometric analysis,” in Proc. 34th Int. Dec. 2012, both in Electrical and Computer Engineering. He interned at U.S. Conf. Mach. Learn., vol. 70, Aug. 2017, pp. 1233–1242. Army Research Laboratory in 2012 and Microsoft Research in 2016, respec- [81] N. Boumal, “Nonconvex phase synchronization,” SIAM J. Optim., tively. His research interest lies at the intersection of signal/image processing, vol. 26, no. 4, pp. 2355–2377, 2016. machine learning, numerical optimization, with focus on developing efficient [82] S. Ling and T. Strohmer, “Self-calibration and biconvex compressive nonconvex methods and global optimality guarantees for solving engineering sensing,” Inverse Problems, vol. 31, no. 11, p. 115002, 2015. problems in signal processing, computational imaging, and machine learning. [83] V. Cambareri and L. Jacques, “Through the haze: A non-convex He is the recipient of Best Student Paper Award at SPARS’15 (with Ju Sun, approach to blind gain calibration for linear random sensing models,” John Wright), and the recipient of 2016-18 Microsoft Research Fellowship in Inf. Inference, J., vol. 8, no. 2, pp. 205–271, 2018. North America. [84] S. Ling and T. Strohmer, “Self-calibration and bilinear inverse prob- lems via linear least squares,” SIAM J. Imag. Sci., vol. 11, no. 1, pp. 252–292, Jan. 2018. [85] A. Levin, Y. Weiss, F. Durand, and W. T. Freeman, “Understanding blind deconvolution algorithms,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 12, pp. 2354–2367, Dec. 2011. [86] C. Ekanadham, D. Tranchina, and E. P. Simoncelli, “A blind sparse deconvolution method for neural spike identification,” in Proc. Adv. Neural Inf. Process. Syst., 2011, pp. 1440–1448. Yuqian Zhang is a postdoc scholar affiliated with the Department of Com- [87] S. Choudhary and U. Mitra, “Sparse blind deconvolution: What puter Science and the TRIPODS data science center at Cornell University. She cannot be done,” in Proc. IEEE Int. Symp. Inf. Theory, Jul. 2014, received her PhD in Electrical Engineering from Columbia University in 2018, pp. 3002–3006. and her BS in Electrical Engineering from Xi’an Jiaotong University in 2011. [88] A. Ahmed, B. Recht, and J. Romberg, “Blind deconvolution using Her research leverages physical models in data driven computations, convex convex programming,” IEEE Trans. Inf. Theory, vol. 60, no. 3, and nonconvex optimization, solving problems in machine learning, computer pp. 1711–1732, Mar. 2014. vision, signal processing.

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1821

Yonina C. Eldar (S’98–M’02–SM’07–F’12) received the B.Sc. degree in She was a member of the Young Israel Academy of Science and Humanities Physics in 1995 and the B.Sc. degree in Electrical Engineering in 1996 both and the Israel Committee for Higher Education. She is the Editor in Chief from Tel-Aviv University (TAU), Tel-Aviv, Israel, and the Ph.D. degree in of Foundations and Trends in Signal Processing, a member of the IEEE Electrical Engineering and Computer Science in 2002 from the Massachusetts Sensor Array and Multichannel Technical Committee and serves on several Institute of Technology (MIT), Cambridge. She is currently a Professor in the other IEEE committees. In the past, she was a Signal Processing Society Department of Mathematics and Computer Science, Weizmann Institute of Distinguished Lecturer, member of the IEEE Signal Processing Theory and Science, Rehovot, Israel. She was previously a Professor in the Department Methods and Bio Imaging Signal Processing technical committees, and of Electrical Engineering at the Technion, where she held the Edwards served as an associate editor for the IEEE TRANSACTIONS ON SIGNAL Chair in Engineering. She is also a Visiting Professor at MIT, a Visiting PROCESSING,theEURASIP Journal of Signal Processing,theSIAM Journal Scientist at the Broad Institute, and an Adjunct Professor at on Matrix Analysis and Applications,andtheSIAM Journal on Imaging and was a Visiting Professor at Stanford. She is a member of the Israel Sciences. She was Co-Chair and Technical Co-Chair of several international Academy of Sciences and Humanities (elected 2017), an IEEE Fellow and a conferences and workshops. EURASIP Fellow. Her research interests are in the broad areas of statistical signal processing, sampling theory and compressed sensing, learning and optimization methods, and their applications to biology and optics. Dr. Eldar has received many awards for excellence in research and teaching, including the IEEE Signal Processing Society Technical Achievement Award (2013), the IEEE/AESS Fred Nathanson Memorial Radar Award (2014), and the IEEE Kiyo Tomiyasu Award (2016). She was a Horev Fellow of the Leaders in Science and Technology program at the Technion and an Alon Fellow. She received the Michael Bruno Memorial Award from the Rothschild Foundation, the Weizmann Prize for Exact Sciences, the Wolf Foundation Krill Prize for Excellence in Scientific Research, the Henry Taub Prize for Excellence in Research (twice), the Hershel Rich Innovation Award (three John Wright is an associate professor in Electrical Engineering at Columbia times), the Award for Women with Distinguished Contributions, the Andre University. He is also affiliated with the Department of Applied Physics and and Bella Meyer Lectureship, the Career Development Chair at the Technion, Applied Mathematics and Columbia’s Data Science Institute. He received his the Muriel & David Jacknow Award for Excellence in Teaching, and the PhD in Electrical Engineering from the University of Illinois at Urbana Cham- Technion’s Award for Excellence in Teaching (two times). She received several paign in 2009. Before joining Columbia he was with Microsoft Research Asia best paper awards and best demo awards together with her research students from 2009-2011. His research interests include sparse and low-dimensional and colleagues including the SIAM outstanding Paper Prize, the UFFC models for high-dimensional data, optimization (convex and otherwise), and Outstanding Paper Award, the Signal Processing Society Best Paper Award applications in imaging and vision. His work has received a number of awards and the IET Circuits, Devices and Systems Premium Award, and was selected and honors, including the 2012 COLT Best Paper Award and the 2015 PAMI as one of the 50 most influential women in Israel. TC Young Researcher Award.

Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply.