IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020 1785 Convolutional Phase Retrieval via Gradient Descent Qing Qu , Yuqian Zhang , Yonina C. Eldar , Fellow, IEEE, and John Wright
Abstract— We study the convolutional phase retrieval problem, where is cyclic convolution modulo m and |·| denotes of recovering an unknown signal x ∈ Cn from m measurements entrywise absolute value. This problem can be rewritten in consisting of the magnitude of its cyclic convolution with a the common matrix-vector form given kernel a ∈ Cm . This model is motivated by applications such as channel estimation, optics, and underwater acoustic find z, s.t. y = |Az| , (I.2) communication, where the signal of interest is acted on by a given channel/filter, and phase information is difficult or impossible where A corresponds to a cyclic convolution. Convolutional to acquire. We show that when a is random and the number phase retrieval is motivated by applications in areas such of observations m is sufficiently large, with high probability x as channel estimation [2], noncoherent optical communica- can be efficiently recovered up to a global phase shift using a combination of spectral initialization and generalized gradient tion [3], and underwater acoustic communication [4]. For descent. The main challenge is coping with dependencies in the example, in millimeter-wave (mm-wave) wireless communica- measurement operator. We overcome this challenge by using ideas tions for 5G networks [5], one important problem is to estimate from decoupling theory, suprema of chaos processes and the the angle of arrival (AoA) of a signal from measurements restricted isometry property of random circulant matrices, and taken by the convolution of an antenna pattern and AoAs. recent analysis of alternating minimization methods. As the phase measurements are often very noisy, unreliable, Index Terms— Phase retrieval, nonconvex optimization, non- and expensive to acquire, it may be preferred to only take linear inverse problem, circulant convolution. measurements of the signal magnitude in which case the phase information is lost. I. INTRODUCTION Most known results on the exact solution of phase retrieval problems [6]–[11] pertain to generic random matrices,where E CONSIDER the problem of convolutional phase the entries of A are independent subgaussian random vari- retrieval where our goal is to recover an unknown W ables. We term problem (I.2) with a generic sensing matrix signal x ∈ Cn from the magnitude of its cyclic convolution A as generalized phase retrieval.1 However, in practice it is with a given filter a ∈ Cm. Specifically, the measurements difficult to implement purely random measurement matrices. take the form In most applications, the measurement is much more structured y = |a x| , (I.1) – the convolutional model studied here is one such structured measurement operator. Moreover, structured measurements Manuscript received May 30, 2018; revised June 25, 2019; accepted often admit more efficient numerical methods: by using the October 2, 2019. Date of publication October 31, 2019; date of current version February 14, 2020. This work was supported in part by National Science fast Fourier transform for matrix-vector products, the benign Foundation (NSF) Computing and Communication Foundations (CCF) under structure of the convolutional model (I.1) allows to design Grant 1527809, in part by NSF Information and Intelligent Systems (IIS) methods with O(m) memory and O(m log m) computation under Grant 1546411, in part by the European Union’s Horizon 2020 Research and Innovation Program under Grant 646804-ERCCOGBNYQ, in part by cost per iteration. In contrast, for generic measurements, the Israel Science Foundation under Grant 335/14, and in part by the CCF the cost is around O(mn). under Grant 1740833 and Grant 1733857. This article was presented at the In this work, we study the convolutional phase retrieval NeurIPS 2017. problem (I.1) under the assumption that the kernel a = Q. Qu was with the Department of Electrical Engineering, Columbia University, New York, NY 10027 USA, and also with the Data Science [a1, ··· ,am] is randomly generated from an i.i.d. standard Institute, Columbia University, New York, NY 10027 USA. He is now with complex Gaussian distribution CN (0, I). Compared to the the Center for Data Science, New York University, New York, NY 10011 USA (e-mail: [email protected]). generalized phase retrieval problem, the random convolution Y. Zhang was with the Department of Electrical Engineering, Columbia model (I.1) we study here is far more structured: it is para- University, New York, NY 10027 USA, and also with the Data Science Insti- meterized by only O(m) independent complex normal random tute, Columbia University, New York, NY 10027 USA. She is now with the O Department of Computer Science, Cornell University, Ithaca, NY 14853 USA variables, whereas the generic model involves (mn) random (e-mail: [email protected]). variables. This extra structure poses significant challenges for Y. C. Eldar is with the Weizmann Faculty of Mathematics and Computer analysis: the rows and columns of the sensing matrix A are Science, Weizmann Institute of Science, Rehovot 7610001, Israel (e-mail: [email protected]). probabilistically dependent, so that classical probability tools J. Wright is with the Department of Electrical Engineering, Columbia Uni- (based on concentration of functions of independent random versity, New York, NY 10027 USA, with the Department of Applied Physics vectors) do not apply. and Applied Mathematics, Columbia University, New York, NY 10027 USA, and also with the Data Science Institute, Columbia University, New York, We propose and analyze a local gradient descent type NY 10027 USA (e-mail: [email protected]). method, minimizing a weighted, nonconvex and nonsmooth Communicated by P. Grohs, Associate Editor for Signal Processing. Color versions of one or more of the figures in this article are available 1We use this term to make distinctions from Fourier phase retrieval, where online at http://ieeexplore.ieee.org. the matrix A is an oversampled DFT matrix. Here, generalized phase retrieval Digital Object Identifier 10.1109/TIT.2019.2950717 refers to the problem with any generic measurement other than Fourier. 0018-9448 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1786 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020 objective well in practice, while theoretical understandings of its global 1 2 convergence properties still remain largely open [30]. min f(z)= b1/2 (y −|Az|) , (I.3) For the generalized phase retrieval where the sensing z∈Cn 2m matrix A is i.i.d. Gaussian, the problem is better-studied: ∈ Rm where denotes the Hadamard product. Here, b ++ is in many cases, when the number of measurements is large a weighting vector, which is introduced mainly for analysis enough, the target solution can be exactly recovered by purposes. The choice of b is discussed in Section III. Our using either convex or nonconvex optimization methods. The result can be informally summarized as follows: first theoretical guarantees for global recovery of generalized phase retrieval with i.i.d. Gaussian measurement are based 2 Cx on convex optimization – the so-called Phaselift/Phasemax With m ≥ Ω 2 n poly log n samples, general- x methods [6], [10], [31]. These methods lift the problem to ized gradient descent starting from a data-driven initial- a higher dimension and solve a semi-definite program- ization converges to the target solution with linear rate. ming (SDP) problem. However, the high computational cost of SDP limits their practicality. Quite recently, [32]–[34] reveal m×m Here, Cx ∈ C denotes the circulant matrix corre- that the problem can also be solved in the natural parameter sponding to cyclic convolution with a length m zero padding space via linear programming. of x,andpoly log n denotes a polynomial in log n.Com- Recently, nonconvex approaches have led to new computa- pared to the results of generalized phase retrieval with i.i.d. tional guarantees for global optimizations of generalized phase Gaussian measurement, the sample complexity m here has retrieval. The first result of this type is due to Netrapalli extra dependency on Cx / x. The operator norm Cx et al. [35], showing that the alternating minimization method − is inhomogeneous over CSn 1: for a typical x (e.g., drawn provably converges to the truth when initialized using a n−1 uniformly at random from CS ), Cx is of the order spectral method and provided with fresh samples at each O(log n) and the sample complexity matches that of the iteration. Later on, Candès et al. [36] showed that with the generalized phase retrieval up to log factors; the “bad”√ case same initialization, gradient descent for the nonconvex least is when x is sparse in the Fourier domain: Cx∼O( n) squares objective, 2 and m can be as large as O(n poly log n). 2 1 2 2 min f1(z)= y −|Az| , (I.4) Our proof is based on ideas from decoupling theory [12], z∈Cn 2m the suprema of chaos processes and restricted isometry prop- erty of random circulant matrices [13], [14], and is also provably recovers the ground truth, with near-optimal ≥ inspired by a new iterative analysis of alternating minimization sample complexity m Ω(n log n). The subsequent methods [11]. Our analysis draws connections between the work [8], [9], [37] further reduced the sample complexity to ≥ convergence properties of gradient descent and the classical m Ω(n) by using different nonconvex objectives and trunca- alternating direction method. This allows us to avoid the need tion techniques. In particular, recent work by [9], [37] studied a to argue uniform concentration of high-degree polynomials nonsmooth objective that is similar to ours (I.3) with weighting 1 in the structured random matrix A, as would be required b = . Compared to the SDP-based techniques, these methods by a straightforward translation of existing analysis to this are more scalable and closer to the approaches used in practice. new setting. Instead, we control the bulk effect of phase Moreover, Sun et. al. [38] reveal that the nonconvex objec- errors uniformly in a neighborhood around the ground truth. tive (I.4) actually has a benign global geometry: with high probability, it has no bad critical points with m ≥ Ω(n log3 n) This requires us to develop new decoupling and concentration 2 tools for controlling nonlinear phase functions of circulant samples. Such a result enables initialization-free nonconvex random matrices, which could be potentially useful for ana- recovery [40], [41]. For convolutional phase retrieval, it would lyzing other random convolution problems, such as sparse be nicer to characterize the global geometry of the problem blind deconvolution [15], [16] and convolutional dictionary as in [38], [42]–[45]. However, the inhomogeneity of Cx CSn−1 learning [17], [18]. over causes tremendous difficulties for concentration with m ≥ Ω(n poly log n) samples.
A. Comparison With Literature b) Structured random measurements. The study of struc- tured random measurements in signal processing has quite a) Prior arts on phase retrieval. The challenge of developing a long history [46]. For compressed sensing [47], the efficient, guaranteed methods for phase retrieval has attracted work [48]–[50] studied random Fourier measurements, and substantial interest over the past several decades [19], [20]. later [13], [14] proved similar results for partial random convo- The problem is motivated by applications such as X-ray lution measurements. However, the study of structured random crystallography [21], [22], microscopy [23], astronomy [24], measurements for phase retrieval is still quite limited. In par- diffraction and array imaging [25], [26], optics [27], and more. ticular, [51] and [52] studied t-designs and coded diffraction The most classical method is the error reduction algorithm patterns (i.e., random masked Fourier measurements) using derived by Gerchberg and Saxton [28], also known as the semidefinite programming. Recent work studied nonconvex alternating direction method. This approach has been further improved by the hybrid input-output (HIO) algorithm [29]. For 2 [39] further tightened the sample complexity to m ≥ Ω(n log n) by using oversampled Fourier measurements, it often works surprisingly more advanced probability tools.
Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1787
optimization using coded diffraction patterns [36] and STFT set of Ω.If1Ω is the indicator function of the set Ω,then measurements [53], both of which minimize a nonconvex ∈ objective similar to (I.4). These measurement models are moti- 1 if j Ω, [1Ω]j = vated by different applications. For instance, coded diffraction 0 otherwise, is designed for imaging applications such as X-ray diffraction · imaging, STFT can be applied to frequency resolved optical where [ ]j is the jth coordinate of a given vector. Suppose | | Cm → C gating [54] and some speech processing tasks [55]. Both Ω = ,weletRΩ : be a mapping that maps a vector into its coordinates restricted to the set Ω. of the results√ show iterative contraction in a region that is ∗ at most O(1/ n)-close to the optimum. Unfortunately, for Correspondingly, we use RΩ to denote its adjoint operator. ∈ Cn×n × both results either the radius of the contraction region is not Let F n to denote√ a unnormalized n n Fourier m ∈ Cm×n ≥ large enough for initialization to reach, or they require extra matrix with F n = n,andletF n (m n) artificial technique such as resampling the data. In comparison, be an oversampled Fourier matrix. ∈ Cm×m the contraction region we show for the random convolutional • Cyclic convolution. Let Ca be the circulant matrix model is larger O(1/polylog(n)), which is achievable in the generated from a, i.e., ⎡ ⎤ initialization stage via the spectral method. For a more detailed a1 am ··· a3 a2 review of this subject, we refer the readers to Section 4 of [46]. ⎢ ⎥ ⎢ a2 a1 am a3 ⎥ The convolutional measurement can also be reviewed as ⎢ ⎥ ⎢ . .. . ⎥ a single masked coded diffraction patterns [36], [52], since Ca = ⎢ . a2 a1 . . ⎥ − ⎢ ⎥ a x = F 1(a x ),wherea is the Fourier transform of a ⎣ .. .. ⎦ a − . . a and x is the oversampled Fourier transform of x.Thesample m 1 m a a − ··· a a complexity for coded diffraction patterns m ≥ Ω(n log4 n) m m 1 2 1 ··· in [36] suggests that the dependence of our sample complexity = s0[a] s1[a] sm−1[a] , (I.5) on C for convolutional phase retrieval might not be x where s [·](0≤ ≤ m − 1) denotes a circulant shift by necessary and can be improved in the future. On the other samples. Therefore, the convolution a x in (I.1) can be hand, our results suggest that the contraction region is larger √ rewritten in the matrix-vector form than O(1/ n) for coded diffraction patterns, and resampling for initialization might not be necessary. ∗ a x = CaR[1:n]x, (I.6)
m n m where R[1:n] : R → R maps any vector x ∈ R to this B. Notations, Wirtinger Calculus, and Organizations ∗ first n coordinates, and R[1:n] is its adjoint operator. a) Basic notations. Throughout this paper, all vectors/ matrices are written in bold font a/A; indexed values are b) Wirtinger calculus. Consider a real-valued function n−1 Cn → R written as ai, Aij .WeuseCS to represent the unit g(z): . The function is not holomorphic, so that it is complex sphere in Cn.Weuse(·) and (·)∗ to denote the real not complex differentiable unless it is constant [56]. However, and Hermitian transpose of a vector or matrix, respectively. if one identifies Cn with R2n and treats g as a function in the We use (·) and (·) for the real and imaginary parts of a real domain, g can be differentiable in the real sense. Doing calculus for g directly in the real domain tends to produce = complex variable, respectively. We use g1 | g2 to represent the independence of two random variables g1,g2. Given a matrix cumbersome expressions. A more elegant way is adopting the X ∈ Cm×n, col(X) and row(X) are its column and row Wirtinger calculus [57], which can be considered as a neat · · way of organizing the real partial derivatives (see also [7] and space. We use F and to denote the Frobenius norm and spectral norm of a matrix, respectively. For a random Section 1 of [38]). The Wirtinger derivatives can be defined p p 1/p formally as variable X, its L norm is defined as X p = E [|X| ] L for any positive p ≥ 1. For a smooth function f ∈C1, ∞ ∂g . ∂g(z, z) its L norm is defined as f ∞ =sup |f(t)|.We = L t∈dom(f) ∂z ∂z CN 0 z constant use ( , I) be the standard complex Gaussian distribution. ∼CN 0 ∂g(z, z) ∂g(z, z) If a ( , I),then = ,..., ∂z1 ∂zn z constant ∼ N 0 1 a = u +iv, u, v i.i.d. , 2 I . ∂g . ∂g(z, z) = ∂z ∂z z constant In addition, for all theorems and proofs we use ci and Ci ··· ∂g(z, z) ∂g(z, z) (i =1, 2, ) to denote positive numerical constants. = ,..., . • Some basic operators. For any vector v ∈ Cn,wedefine ∂z1 ∂zn z constant ∗ ∗ Basically it says that when evaluating ∂g/∂z, one just writes vv − vv P v = , P v⊥ = I ∂g/∂z in the pair of (z, z), and conducts the calculus by treat- v2 v2 ing z as if it was a constant. We compute ∂g/∂z in a similar to be the projection onto the span of v and its orthogonal fashion. To evaluate the individual partial derivatives, such as complement, respectively. For an arbitrary set Ω, we denote ∂g(z,z) , all the usual rules of calculus apply. For more details ∂zi the cardinality of Ω as |Ω|,andletsupp(Ω) be the support on Wirtinger calculus, we refer interested readers to [56].
Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1788 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020 c) Organization. The rest of the paper is organized as fol- Algorithm 1 Spectral Initialization lows. In Section II, we introduce the basic formulation of the { }m Input: Observations yk k=1. problem and the proposed algorithm. In Section III, we present Output: The initial guess z(0). the main results and a sketch of the proof; detailed analysis 1: Estimate the norm of x by is postponed to Section VI. In Section IV, we corroborate our analysis with numerical experiments. We discuss the potential 1 m λ = y2 impacts of our work in Section V. Finally, all the basic m k probability tools that are used in this paper are described in k=1 the appendices. (0) n−1 2: Compute the leading eigenvector z ∈ CS of the matrix, II. NONCONVEX OPTIMIZATION VIA GRADIENT DESCENT m 1 1 ∗ In this work, we develop an approach to convolutional phase Y = y2a a∗ = A diag y2 A, m k k k m retrieval based on local nonconvex optimization. Our proposed k=1 algorithm has two components: (1) a careful data-driven (0) 3: Set z(0) = λz . initialization using a spectral method; (2) local refinement by gradient descent. We introduce the two steps below.
which is constructed from the knowledge of the sensing A. Nonconvex and Nonsmooth Minimization vectors and observations. The leading eigenvector of Y can We consider minimizing a weighted nonconvex and be efficiently computed via the power method. Note that nonsmooth objective introduced in (I.3), where as shown E [Y ]=x2 I + xx∗, so the leading eigenvector of E [Y ] is ∗ ∈ Cm×n in (I.6) the matrix A = CaR[1:n] is formed by the proportional to the target solution x. Under the random con- first n columns of Ca. The adoption of the positive weights b volutional model of A, by using probability tools from [46], facilitates our analysis, by enabling us to compare certain func- we show that v∗Yvconcentrates to its expectation v∗E [Y ] v − tions of the dependent random matrix A to functions involving for all v ∈ CSn 1 whenever m ≥ Ω(n poly log n), ensuring more independent random variables. We will substantiate this that the initialization z(0) is close to the set of target solutions. claim in the next section. It should be noted that several variants of the initialization We consider the generalized Wirtinger gradient of (I.3), approach in Algorithm 1 have been introduced in the literature. They improve upon the log factors of sample complexity for ∂ 1 ∗ f(z)= A diag (b)[Az − y exp (iφ(Az))] . generalized phase retrieval with i.i.d. measurements. Those ∂z m methods include the truncated spectral method [8], null initial- · Here, because of the nonsmoothness of (I.3), f( ) is not ization [58] and orthogonality-promoting initialization [9]. For differentiable everywhere even in the real sense. To deal with the simplicity of analysis, here we only consider Algorithm 1 this issue, we specify for the convolutional model. . u/ |u| if |u| =0 , exp (iφ(u)) = 1 otherwise, III. MAIN RESULT AND SKETCH OF ANALYSIS for any complex number u ∈ C and φ(u) ∈ [0, 2π). Starting In this section, we introduce our main theoretical result, from some initialization z(0), we minimize the objective (I.3) and sketch the basic ideas behind the analysis. Without loss of generality, we assume the ground truth signal to be by generalized gradient descent − x ∈ CSn 1. Because the problem can only be solved up to a ∂ z(r+1) = z(r) − τ f(z(r)), (II.1) global phase shift, we define the set of target solutions as ∂z . iφ ∂ X = xe | φ ∈ [0, 2π) , where τ>0 is the stepsize. Indeed, ∂z f(z) can be interpreted as the subgradient of f(z) in the real case; this method can and correspondingly let be seen as a variant of amplitude flow [9]. . dist(z, X ) =inf z − xeiφ , φ∈[0,2π) B. Initialization via Spectral Method which measures the distance from a point z ∈ Cn to the set Similar to [7], [35], we compute the initialization z(0) via a of target solutions X . spectral method, detailed in Algorithm 1. More specifically, z(0) is a scaled version of the leading eigenvector of the following matrix A. Main Result m 1 Suppose the weighting vector b = ζ 2 (y) in (I.3), where Y = y2a a∗ σ m k k k k=1 2 − 2 2 ζσ (t)=12πσ ξσ (t) , (III.1a) 1 2 ∗ 2 1 |t| = A diag y A, (II.2) ξ 2 (t)= 2 exp − 2 , (III.1b) m σ 2πσ 2σ
Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1789 with σ2 > 1/2. Our main theoretical result shows that with 1) Proof sketch of iterative contraction Our iterative analy- high probability, the generalized gradient descent (II.1) with sis is inspired by the recent analysis of alternating direction spectral initialization converges linearly to the optimal set X . method (ADM) [11]. In the following, we draw connections 31 Theorem 3.1 (Main Result): If m ≥ C0 n log n,then between the gradient descent method (II.1) and ADM, and Algorithm 1 produces an initialization z(0) that sketch the basic ideas of convergence analysis. (0) X ≤ −6 dist z , c0 log n x , a) ADM iteration. ADM is a classical method for solving phase retrieval problems [11], [28], [35], which can be consid- − −c2 (0) with probability at least 1 c1 m . Starting from z , ered as a heuristic method for solving the following nonconvex with σ2 =0.51 and stepsize τ =2.02, whenever 2 problem Cx 17 4 (r) m ≥ C 2 max log n, n log n , for all iterates z 1 x 2 ≥ min 1 Az − y u . (r 1) in (II.1), we have z∈Cn,|u|=1 2 (r) r (0) dist z , X ≤ (1 − ) dist z , X , (III.2) At every iterate z (r), ADM proceeds in two steps: −c4 (r+1) (r) with probability at least 1 − c3 m for some numerical c = y exp Az , ∈ constant (0, 1). 1 2 z (r+1) =argmin Az − c(r+1) , z 2 Remark. Our result shows that by initializing the prob- lem O(1/polylog(n))-close to the optimum via the spectral which leads to the following update method, the gradient descent (II.1) converges linearly to the (r+1) † (r) optimal solution. As we can see, the sample complexity here z = A y exp Az , also depends on Cx , which is quite different from the i.i.d. † ∗ −1 ∗ ∈ CSn−1 where A =(A A) A is the pseudo-inverse of A.Let case. For a typical x (e.g., x is drawn uniformly n−1 (r) − iθ random from CS ), Cx is on the order of O(log n),and θr =argminθ∈[0,2π) z xe . The distance between ≥ the sample complexity m Ω(n poly log n) matches the i.i.d. z (r+1) and X is bounded by case up to log factors. However, Cx is nonhomogeneous n−1 over x ∈ CS :ifx is sparse in the Fourier domain dist z (r+1), X x = √1 1 (e.g., n ), the sample complexity can be as large as (r+1) iθ +1 m ≥ Ω n2 poly log n . Such a behavior is also demonstrated = z − xe r
in the experiments of Section IV. We believe the (very large!) † ≤ iθr − (r) number of logarithms in our result is an artifact of our analysis, A Axe y exp Az . (III.3) rather than a limitation of the method. We expect to reduce 2 Cx 6 the sample complexity to m ≥ Ω 2 n log n by a tighter x b) Gradient descent with b = 1. For simplicity analysis, which is left for future work. and illustration purposes, let us first consider the gra- As we shall see in the following, the smoothness of the dient descent update (II.1) with b = 1.Letθr = weighting b ∈ Rm in (III.1) helps us get around the difficulty arg min z(r) − xeiθ , with stepsize τ =1.The of convergence analysis. The particular numerical choices of θ∈[0,2π) σ2 =0.51 and τ =2.02 are merely to ensure iterative distance between the iterate z(r+1) and the optimal set X is contraction in (III.2). For analysis, the parameter pair (τ,σ2) bounded by can be selected in a range as long as ∈ (0, 1) in (III.2). In 1 dist z(r+1), X practice, the algorithm converges with b = and a choice of small stepsize τ, or by using backtracking linesearch for the = z(r+1) − xeiθr+1 stepsize τ. 1 ≤ A Axeiθr − y exp iφ(Az(r)) m B. A Sketch of the Analysis 1 ∗ (r) iθ + I − A A z − xe r . (III.4) In this subsection, we briefly highlight some major chal- m lenges and new ideas behind the analysis. All the detailed proofs are postponed to Section VI. The core idea behind the c) Towards iterative contraction. By measure concentration, analysis is to show that the iterate contracts once we initialize it can be shown that close enough to the optimum. In the following, we first describe the basic ideas of proving iterative contraction, which critically depends on bounding a certain nonlinear function 1 ∗ I − A A = o(1), (III.5a) of a random circulant matrix. We sketch the core ideas of m √ √ how to bound such a complicated term via the decoupling † A≈ m, A ≈ 1/ m, (III.5b) technique.
Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1790 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020
holds with high probability whenever m ≥ Ω(n poly log n). a) Why decoupling? Let ak (1 ≤ k ≤ m)bearowvector Therefore, based on (III.3) and (III.4), to show iterative of A, then the term contraction, it is sufficient to prove L(a, w)=w A diag (ψ(Ax)) Aw iθ Axe − y exp (iφ(Az)) m √ ∗ ≤ (1 − η) m z − xeiθ , (III.6) = ψ(akx)w akak w k=1 dependence across k for some constant η ∈ (0 , 1) sufficiently small, where θ = ∗ ∗ is a summation of dependent random variables. To address arg min z − xeiθ such that eiθ = x z/ |x z|. θ∈[0,2π) this problem, we deploy ideas from decoupling [12]. Infor- By borrowing ideas from controlling (III.6) in the ADM mally, decoupling allows us to compare moments of random method [11], this observation provides a new way of analyzing functions to functions of more independent random variables, the gradient descent method. As an attempt to show (III.6) which are usually easier to analyze. The book [12] provides A for the random circulant matrix , we invoke Lemma A.1 a beautiful introduction to this area. In our problem, notice in the appendix, which controls the error in a first order that the random vector a occurs twice in the definition of exp(iφ(·)) approximation to . Let us decompose L(a, w) – one in the phase term ψ(Ax)=exp(−2iφ(Ax)), z = αx + βw, and another in the quadratic term. The general spirit of n−1 decoupling is to seek to replace one of these copies of a with where w ∈ CS with w ⊥ x,andα, β ∈ C. Notice that an independent copy a of the same random vector, yielding φ(α)=θ, so that by Lemma A.1, for any ρ ∈ (0, 1) we have a random process with fewer dependencies. Here, we seek to iθ − L Ax e y exp (iφ(Az)) replace (a, w) with ≤ · | | QL 2 Ax ½| β || |≥ | | dec(a, a , w) α Aw ρ Ax = w A diag ψ(A x) Aw. (III.8) T1 QL 1 β The utility of this new, decoupled form dec(a, a , w) + ((Aw) exp (−iφ(Ax))) . 1 − ρ α of L(a, w) is that it introduces extra randomness — L T Q 2 dec(a, a , w) is now a chaos process of a conditioned on a . QL a a w The first term T1 is relatively much smaller than T2,which This makes analyzing supw∈S dec( , , ) amenable to can be bounded by a small numerical constant using the existing analysis of suprema of chaos processes for random restricted isometry property of a random circulant matrix [14], circulant matrices [46]. together with some auxiliary analysis. The detailed analysis However, achieving the decoupling requires additional is provided in Section VI-D. The second term T2 involves work; the most general existing results on decoupling per- a nonlinear function exp (−iφ(Ax)) of the random circulant tain to tetrahedral polynomials, which are polynomials with matrix A. Controlling this nonlinear, highly dependent random no monomials involving any power larger than one of any process for all w is a nontrivial task. In the next subsection, random variable. By appropriately tracking cross terms, these we explain why bounding T2 is technically challenging, and results can also be applied to more general (non-tetrahedral) describe the key ideas on how to control a smoothed variant polynomials in Gaussian random variables [59]. However, L a w of T2, by using the weighting b introduced in (III.1). We also our random process ( , ) involves a nonlinear phase term provide intuitions for why the weighting b is helpful. ψ(Aw) which is not a polynomial, and hence is not amenable 2) Controlling a smoothed variant of the phase term T2 to a direct appeal to existing results. As elaborated above, the major challenge of showing iterative b) Decoupling is “recoupling”. Existing results [59] for contraction is bounding the suprema of the nonlinear, depen- decoupling polynomials of Gaussian random variables are dent random process T (w) over the set 2 derived from two simple facts: . − S = w ∈ CSn 1 | w ⊥ x . (1) orthogonal projections of Gaussian variables are independent3; By using the fact that (u)= 1 (u − u) for any u ∈ C, 2i (2) Jensen’s inequality. we have For the random vector a ∼CN(0, I), let us introduce an independent copy δ ∼CN(0, I). Write sup T 2(w) ≤ 1 sup wA diag (ψ(Ax)) Aw 2 2 1 2 − w∈S w∈S g = a + δ, g = a δ. L(a,w) Because of Item 1, g1 and g2 are two independent CN(0, 2I) + 1 A2 2 vectors. Now, by taking conditional expectation with respect . where we define ψ(t)√=exp(−2iφ(t)). As from (III.5), to δ,wehave ≈ we know that A m. Thus, to show (III.6), the major E QL (g1, g2, w) task left is to prove that δ dec E QL − . L = δ dec(a + δ, a δ, w) = (a, w). (III.9) sup |L(a, w)| < (1 − η )m (III.7) ∈S w 3If two random variables are jointly Gaussian, they are statistically inde- for some constant η ∈ (0, 1). pendent if and only if they are uncorrelated.
Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1791
Thus, we can see that the key idea of decoupling L(a, w) into QL QL 1 2 dec(a, a , w), is essentially “recoupling” dec(g , g , w) via conditional expectation – the “recoupled” term L can be reviewed as an approximation of L(a, w). Notice that by Fact 2, Jensen’s inequality, for any convex function ϕ, Ea sup ϕ L(a, w) w∈S E E QL − = a sup ϕ δ dec(a + δ, a δ, w) w∈S ≤ E QL − a,δ sup ϕ dec(a + δ, a δ, w) w∈S L 1 2 E 1 2 Q = g ,g sup ϕ dec(g , g , w) . w∈S Fig. 1. Plots of functions h(t), ψ(t) and ζσ2 (t) over the real line with 2 Thus, by choosing ϕ appropriately, i.e., as ϕ(t)=|t|p, we can σ =0.51. The ψ(t) function is discontinuous at 0, and cannot be uniformly h t h t L approximated by ( ). On the other hand, the function ( ) serves as a good control all the moments of supw∈S (a, w) via ψ t approximation of the weighting ( ). sup L(a, w) w∈S Lp error is small? To understand this question, recall that the L ≤ Q 1 2 mechanism that links Qdec to L is the conditional expectation sup dec(g , g , w) . (III.10) w∈S Lp operator Eδ [·|a]. For our case, from (III.9) orthogonality This type of inequality is very useful because it leads to L L relates the moments of supw∈S (a, w) to those of (a, w)=w A diag (h(Ax)) Aw, (III.12) L L . Q a a w Q E 2 supw∈S dec( , , ) . As discussed previously, dec is a h(t) = s∼CN (0,x ) [ψ(t + s)] . chaos process of g1 conditioned on g2. Its moments can be bounded using existing results [14]. Thus, by using the results in (III.11) and (III.12), we can L bound sup ∈S |L(a, w)| as If was a tetrahedral polynomial, then we have w L L = , i.e., the approximation is exact. As the tail bound |L |≤ L |L | sup (a, w) sup (a, w) of supw∈S (a, w) can be controlled via its moments w∈S w∈S bounds [60, Chapter 7.2], this allows us to directly control 2 + h − ψ ∞ A . (III.13) the object of interest L. The reason of achieving this bound is L approximation error because the conditional expectation operator Eδ [·|a] “recou- QL L ples” dec(a, a , w) back to the target (a, w).Inother Note that the function h is not exactly ψ, but generated by con- words, (Gaussian) decoupling is recoupling. volving ψ with a multivariate Gaussian pdf : indeed, recoupling is Gaussian smoothing. The Fourier transform of a multivariate c) “Recoupling” is Gaussian smoothing. In convolutional Gaussian is again a Gaussian; it decays quickly with frequency. phase retrieval, a distinctive feature of the term L(a, w) So, in order to admit a small approximation error, the target is that ψ(·) is a phase function and therefore L is not a ψ must be smooth. However, in our case, the function ψ(t)= polynomial. Hence, it may be challenging to posit a QL dec exp(−2iφ(t)) is discontinuous at t =0; it changes extremely which “recouples” back to L. In other words, as L = L rapidly in the vicinity of t =0, and hence its Fourier in the existing form, we need to tolerate an approximation L L transform (appropriately defined) does not decay quickly at all. error. Although is not exactly , we can still control L |L | L Therefore, the term (a, w) is a poor target for approximation supw∈S (a, w) through its approximation , L E QL 1 2 by using a smooth function (a, w)= δ[ dec(g , g , w)]. sup |L(a, w)|≤sup L (a, w) −L(a, w) From Figure 1, the difference between h and ψ increases as | | − w∈S w∈S t 0. The poor approximation error ψ f L∞ =1results in a trivial bound for sup |L(a, w)| instead of the desired +supL (a, w) . (III.11) w∈S w∈S bound (III.7). L d) Decoupling and convolutional phase retrieval. To reduce As we discussed above, the term supw∈S (a, w) can be controlled by using decoupling and the moments bound the approximation error caused by the nonsmoothness of ψ at in (III.10). Therefore, the inequality (III.11) is useful to derive t =0, we smooth ψ. More specifically, we introduce a new a sufficiently tight bound for L(a, w) if L (a, w) is very weighted objective (I.3) with Gaussian weighting b = ζσ2 (y) L in (I.2), replacing the analyzing target T2 with close to (a, w) uniformly, i.e., the approximation error is L small. Now the question is: for what is it possible to T 1/2 − QL 2 = diag b ((Aw) exp ( iφ(Ax))) . find a “well-behaved” dec such that the approximation
Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1792 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020
Consequently, we obtain a smoothed variant Ls(a, w) of L(a, w), Ls(a, w)=w A diag (ζσ2 (y) ψ(Ax)) Aw. Similar to (III.13), we obtain
sup |Ls(a, w)| w∈S 2 ≤ L − 2 sup (a, w) + h(t) ζσ (t)ψ(t) L∞ A . w∈S − Now the approximation error h ψ L∞ in (III.13) is − 2 replaced by h(t) ζσ (t)ψ(t) L∞ . As observed from Figure 1, the function ζσ2 (t) smoothes ψ(t) especially near the vicinity of t =0, such that the new approximation − 2 error f(t) ζσ (t)ψ(t) L∞ is significantly reduced. Thus, by using similar ideas above, we can provide a nontrivial bound Fig. 2. Phase transition for signals x ∈ CSn−1 with different signal sup |Ls(a, w)| < (1 − ηs) m, patterns. We fix n = 1000 and generate x with different patterns. We vary w∈S the ratio m/n. for some η ∈ (0, 1), which is sufficient for showing iterative s b) Necessity of initializations. As has been shown contraction. Finally, because of the weighting b = ζ 2 (y), σ in [38], [39], for phase retrieval with generic measurement, it should be noticed that the overall analysis needs to be when the sample complexity satisfies m ≥ Ω(n log n), with slightly modified accordingly. For a more detailed analysis, high probability the landscape of the nonconvex objective (I.4) we refer the readers to Section VI. is nice enough that it enables initialization free global opti- mization. This raises an interesting question of whether spec- IV. EXPERIMENTS tral initialization is necessary for the random convolutional In this section, we conduct experiments on both synthetic model. We consider a similar setting as the previous experi- and real datasets to demonstrate the effectiveness of the ment, where the ground truth x ∈ Cn is drawn uniformly at − proposed method. random from CSn 1. We fix the dimension n = 1000 and change the ratio m/n. For each ratio, we randomly generate A. Experiments on Synthetic Dataset the kernel a ∼CN(0, I) in (I.1) and repeat the experiment a) Dependence of sample complexity on Cx. First, 100 times. For each instance, we start the algorithm from we investigate the dependence of the sample complexity m random and spectral initializations, respectively. We choose n−1 on Cx. We assume the ground truth x ∈ CS ,and the stepsize via backtracking linesearch and terminate the consider three cases: experiment either when the number of iterations is larger than × 4 • x = e1 with e1 to be the standard basis vector, such that 2 10 or the distance of the iterate to the solution is smaller × −5 Cx =1; than 1 10 . • x is uniformly random generated on the complex sphere As we can see from Figure 3, the number of samples n−1 CS ; √ required for successful recovery with random initializations • √1 1 is only slightly more than that with the spectral initialization. x = n , such that Cx = n. For each case, we fix the signal length n = 1000 and vary This implies that the requirement of spectral initialization is the ratio m/n. For each ratio m/n, we randomly generate an artifact of our analysis. For convolutional phase retrieval, the kernel a ∼CN(0, I) in (I.1) and repeat the experiment the result in [40] shows some promises for analyzing global 100 times. We initialize the algorithm by the spectral method convergence of gradient methods with random initializations. in Algorithm 1 and run the gradient descent (II.1). Given the c) Effects of weighting b. Although the weighting b in (III.1) algorithm output x , we judge the success of recovery by that we introduced in Theorem 3.1 is mainly for analysis, here inf x − xeiφ ≤ , (IV.1) we investigate its effectiveness in practice. We consider the ∈ φ [0,2π) same three cases for x as we did before. For each case, we fix where =10−5. From Figure 2, for the case when the signal length n = 100 and vary the ratio m/n. For each Cx = O(1), the number of measurements needed is far ratio m/n, we randomly generate the kernel a ∼CN(0, I) less than Theorem 3.1 suggests. Bridging the gap between the in (I.1) and repeat the experiment 100 times. We initialize practice and theory is left for the future work. the algorithm by the spectral method in Algorithm 1 and Another observation is that the larger Cx is, the more run the gradient descent (II.1) with weighting b = 1 and b samples we needed for the success of recovery. One possibility in (III.1), respectively. We judge success of recovery once the −5 is that the sample complexity depends on Cx, another error (IV.1) is smaller than 10 . From Figure 4, we can see possibility is that the extra logarithmic factors in our analysis that the sample complexity is slightly larger for b = ζσ2 (y), are truly necessary for worst case (here, spectral sparse) inputs. the benefit of weighting here is more for the ease of analysis.
Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1793
generated as complex Gaussian CN(0, I).Werunthepower method for 100 iterations for initialization, and stop the algorithm once the error is smaller than 1×10−4.Wefirsttest the proposed method on a gray 468×1228 electron microscopy image. As shown in Figure 7, the gradient descent method with spectral initialization converges to the target solution in around 64 iterations. Second, we test our method on a color image of size 200 × 300 as shown in Figure 8, it takes 197.08s to reconstruct all the RGB channels. In contrast, methods using general Gaussian measurements A ∈ Cm×n could easily run out of memory on a personal computer for problems of this size.
V. D ISCUSSION AND FUTURE WORK In this work, we showed that via nonconvex optimiza- tion, the phase retrieval problem with random convolu- Fig. 3. Phase transition with different initializations schemes. We fix n−1 tional measurement can be solved to global optimum with n =1000and x is generated uniformly random from CS .Wevarythe 2 Cx ratio m/n. ≥ 2 m Ω x n poly log n samples. Our result raises sev- eral interesting questions that we discuss below. d) Comparison with generic random measurements. Another interesting question is that, in comparison with a pure a) Tightening sample complexity. Ourestimateofthesam- random model, how many more samples are needed for the ple complexity is only tight up to logarithm factors: there random convolutional model in practice? We investigate this is a substantial gap between our theory and practice for the question numerically. We consider the same three cases for dependence of the logarithm factors. We believe the high x as we did before, and consider two random measurement order dependence of the logarithm factors is an artifact of our models analysis. In particular, our analysis in Appendix VI-D is based | | | | on the result of RIP conditions for partial circulant random y1 = a x , y2 = Ax , matrices, which is in no way tight. We believe that by using ∼CN 0 ∼ CN 0 where a ( , I),andak i.i.d. ( , I) is a row vector advanced tools in probability, the sample complexity can be of A. For each case, we fix the signal length n = 100 and vary tightened to at least m ≥ Ω n log6 n . the ratio m/n. We repeat the experiment 100 times. We ini- tialize the algorithm by the spectral method in Algorithm 1 for b) Geometric analysis and global result. Our convergence both models, and run gradient descent (II.1). We judge success analysis is based on showing iterative contraction of gradient −5 of recovery once the error (IV.1) is smaller than 10 .From descent methods. However, it would be interesting if we could Figure 5, we can see that when x is typical (e.g., x = e1 characterize the function landscape of nonconvex objectives as n−1 or x is uniformly random generated from CS ), under in [38]. Such a result would provide a better explanation of the same settings, the samples needed for the two random why the gradient descent method works, and help us design models are almost the same. However, when x is Fourier more efficient algorithms. The major difficulty we encountered √1 1 sparse (e.g., x = n ), more samples are required for the is the lack of probability tools for analyzing the random random convolution model. convolutional model: because of the nonhomogeneity of Cz over the sphere, it is hard to tightly uniformize quantities B. Experiments on Real Problems of random convolutional matrices over the complex sphere CSn−1 a) Experiments on real antenna data for 5G . Our preliminary analysis results in suboptimal bounds communication. We demonstrate the effectiveness of the for sample complexity. proposed method on a problem arising in 5G communication, as we mentioned in the introduction. Figure 6 (left) shows an c) Tools for analyzing other structured nonconvex antenna pattern a ∈ C361 obtained from Bell labs. We observe problems. This work is part of a recent surge of research the modulus of the convolution of this pattern with the signal efforts on deriving provable and practical nonconvex algo- of interest. For three different types of signals with length rithms to central problems in modern signal processing and machine learning [42], [44], [45], [61]–[81]. On the other n =20,(1)x = e1,(2)x is uniformly random generated CSn−1 √1 1 hand, we believe the probability tools of decoupling and from ,(3)x = n , our result in Figure 6 (right) shows that we can achieve almost perfect recovery. measure concentration we developed here can form a solid foundation for studying other nonconvex problems under the b) Experiments on real images. Finally, we run the experi- random convolutional model. Those problems include blind ment on some real images to demonstrate the effectiveness and calibration [82]–[84], sparse blind deconvolution [16], [78], the efficiency of the proposed method. We use m =5n log n [85]–[94], and convolutional dictionary learning [17], [18], samples for reconstruction. The kernel a ∈ Cm is randomly [95]–[97].
Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1794 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020
Fig. 4. Phase transition for different signal patterns with weightings b. We fix n = 1000 and vary the ratio m/n.
Fig. 5. Phase transition of random convolution model vs. i.i.d. random model. We fix n = 1000 and vary the ratio m/n.
Fig. 6. Experiments on real antenna filter for 5G communication.
VI. PROOFS OF TECHNICAL RESULTS the proof of our main result, i.e., Theorem 3.1, where some In this section, we provide the detailed proof of Theo- key details are provided in Section VI-C. All the other sup- rem 3.1. The section is organized as follows. In Section VI-A, porting results are provided subsequently. We provide detailed we show that the the initialization produced by Algorithm 1 proofs of two key supporting lemmas in Section VI-D and is close to the optimal solution. In Section VI-B, we sketch Section VI-E, respectively. Finally, other supporting lemmas
Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1795
Fig. 7. Experiment on a gray 468 × 1228 electron microscopy image.
Fig. 8. Experiment on colored images. are postponed to the appendices: (i) in Appendix A, we intro- we have duce the elementary tools and results that are useful throughout 2 X ≤ 2 analysis; (ii) in Appendix B, we provide results of bounding dist (z0, ) δ x the suprema of chaos processes for random circulant matrices; with probability at least 1−c m−c2 . (iii) in Appendix C, we provide concentration results for 1 The proof is similar to that of [7]. However, the proof suprema of some dependent random processes via decoupling. in [7] only holds for generic random measurements. Our proof here is tailored for random circulant matrices. We sketch A. Spectral Initialization the main ideas of the proof below: more detailed analysis Proposition 6.1: Suppose z0 is produced by Algorithm 1. for concentration of random circulant matrices is retained to Given a fixed scalar δ>0, whenever m ≥ Cδ−2n log7 n, Appendix B and Appendix C.
Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1796 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020
Proof Without loss of generality, we assume that x =1. B. Proof of Main Result Let z0 be the leading eigenvector of In this section, we prove Theorem 3.1. Without loss of generality, we assume x =1for the rest of the section. 1 m Y = |a∗x|2 a a∗ Given the function m k k k k=1 1 2 f(z)= b1/2 (y −|Az|) , 2m with z0 =1,andletσ1 be the corresponding eigenvalue. We have we show that simple generalized gradient descent ∂ X ≤ − X z = z − τ f(z), (VI.3) dist(z0, ) z0 z0 +dist(z0, ) . ∂z ∂ 1 ∗ f(z)= A diag (b) × First, since z0 = λz0,wehave ∂z m [Az − y exp (iφ(Az))] , (VI.4) z0 − z0 = |λ − 1| . with spectral initialization converges linearly to the target By Theorem B.1 in Appendix B, for any ε>0, whenever solution. We restate our main result below. −2 4 31 m ≥ Cε n log n, we know that Theorem 6.2 (Main Result): Whenever m ≥ C0 n log n, Algorithm 1 produces an initialization z(0) that satisfies |λ − 1|≤λ2 − 1 (0) −6 dist z , X ≤ c0 log n x 1 m = |a∗x|2 − 1 ≤ ε/2 (VI.1) k −c2 m with probability at least 1 − c m . Suppose b = ζ 2 (y), k=1 1 σ where −c log3 n with probability at least 1 − 2 m ,wherec, C > 0 are 2 ζ 2 (t)=1− 2πσ ξ 2 (t), (VI.5) some numerical constants. On the other hand, we have σ σ 2 1 |t| 2 2 − 2 iθ ∗ ξσ (t)= 2 exp 2 , (VI.6) dist (z0, X )=argmin z0 − xe =2− 2 |x z0| . 2πσ 2σ θ 2 (0) 2 Theorem C.1 in Appendix C implies that for any δ>0, with σ > 1/2. Starting from z , with σ = −2 7 0.51 and stepsize τ =2.02, whenever m ≥ whenever m ≥ C δ n log n 2 Cx 17 4 C1 x2 max log n, n log n , with probability at ∗ 2 − ≤ −c4 (r) Y xx + x I δ, least 1 − c3 m for all iterate z (r ≥ 1) defined in (VI.3), we have with probability at least 1 − 2m−c1 .Herec > 0 is some 1 (r) X ≤ − r (0) X numerical constant. It further implies that dist z , (1 ) dist z , , holds for some small numerical constant ∈ (0, 1). ∗ − ∗ 2 − ≤ z0Y z0 z0x 1 δ, Our proof critically depends on the following result, where we show that with high probability for every z ∈ Cn close so that enough to the optimal set X , the iterate produced by (VI.3) is a contraction. ∗ 2 ≥ − − z0x σ1 1 δ, Proposition 6.3 (Iterative Contraction): Let σ2 =0.51 and τ =2.02. There exists some positive con- where σ1 is the top singular value of Y .Sinceσ1 is the top stants c ,c ,c and C, such that whenever m ≥ 2 1 2 3 singular value, we have Cx 17 4 C x2 max log n, n log n , with probability at least ∗ ∗ ∗ 2 − −c2 ∈ Cn X ≤ σ ≥ x Yx = x (Y − xx −x I)x +2 1 c1m for every z satisfying dist (z, ) 1 −6 ≥ − c3 log n x ,wehave 2 δ. ! " ∂ dist z − τ f(z), X ≤ (1 − )dist(z, X ) Thus, for δ>0 sufficiently small, we obtain ∂z √ 2 ∈ ∂ dist (z0, X ) ≤ 2 − 2 1 − 2δ ≤ 2δ. (VI.2) holds for some small constant (0, 1). Here, ∂z f(z) is defined in (VI.4). Choose δ = ε2/8. Combining the results in (VI.1) and (VI.2), To prove this proposition, let us first define we obtain that M = M(a) 2 dist(z0, X ) ≤z0 − z0 +dist(z0, X ) ≤ ε, . 2σ +1 ∗ = A diag (ζ 2 (y)) A, (VI.7) m σ . holds with high probability. H = H(a) = P x⊥ M(a)P x⊥ , (VI.8)
Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1797 and introduce depending on σ2. With ε =0.2 and σ2 =0.51, Lemma 6.7 ≤ . E implies that Δ(ε) 0.404. Thus, we have h(t) = s∼N (0,1) [ψ(t + s)] . (VI.9) ! 2 iθ 5.686δ Given some scalar ε>0 and σ > 1/2, let us introduce a P ⊥ d≤ z − xe δ +(1+δ) x ρ quantity % " 1 1 1+2.2δ +0.404(1 + δ) . 2 + Δ∞(ε) = 1+2σ (1 + ε)Es∼CN (0,1) [ψ(t + s)] 1 − ρ 1 − δ 2 P xd≤(0.505 + c 2 δ). σ − ζσ2 (t)ψ(t) , (VI.10) L∞ By choosing the constants δ and ρ sufficiently small, direct calculation reveals that where ψ(t)=exp(−2iφ(t)) and ζσ2 is defined in (VI.5). We sketch the main idea of the proof below. More detailed 2 2 2 dist (z , X ) ≤P ⊥ d + P d analysis is postponed to Appendix VI-C, Appendix VI-D and x x iθ 2 Appendix VI-E. ≤ 0.96 z − xe Proof [Proof of Proposition 6.3] By (VI.3) and (VI.4), and =0.96 dist2 (z, X ) , with the choice of stepsize τ =2σ2 +1,wehave 2 as desired. 2σ +1 ∗ z = z − A diag (ζ 2 (y)) × m σ Now with Proposition 6.3 in hand, we are ready to prove [Az − y exp (iφ (Az))] Theorem 6.2 (in other words, Theorem 3.1). 2 2σ +1 ∗ Proof [Proof of Theorem 6.2] We prove the theorem by = z − Mz + A diag (ζ 2 (y)) × m σ recursion. Let us assume that the properties in Proposition 6.3 E [y exp (iφ (Az))] . holds, which happens on an event with probability at −c2 least 1 − c1m for some numerical constants c1,c2 > 0. For any z ∈ Cn, let us decompose z as By Proposition 6.1 in Appendix VI-A, for any numerical con- ≥ −12 31 z = αx + βw, (VI.11) stant δ>0, whenever m Cδ n log n, the initialization z(0) produced by Algorithm 1 satisfies − where α, β ∈ C,andw ∈ CSn 1 with w ⊥ x,and (0) 3 −6 α = |α| eiφ(α) with the phase φ(α) of α satisfies eiφ(α) = dist z , X ≤ c3δ log n x , x∗z/ |x∗z|. Therefore, if we let − −c5 with probability at least 1 c4 m . Therefore, conditioned − iθ E θ =argminθ∈[0,2π) z xe , (VI.12) on the event , we know that ! " ∂ then we also have φ (α)=θ. Thus, by using the results above, dist z(1), X =dist z(0) − τ f(z), X we observe ∂z 2 ≤ − (0) X dist2 (z , X )= min z − xeiθ (1 )dist z , θ∈[0,2π) ∈ iθ 2 2 2 holds for some small constant (0, 1). This proves (III.2) ≤ z − e x ≤P ⊥ d + P d , x x for the first iteration z(1). Notice that the inequality above also (1) 3 −6 where we define implies that dist z , X ≤ c3δ log n x. Therefore, 2 by reapplying the same reasoning, we can prove (III.2) for . 2σ +1 d(z) =(I − M) z − eiθx − eiθMx+ × the iterations r =2, 3, ···. m ∗ A diag (ζσ2 (y)) [y exp (iφ (Az))] . (VI.13)
Let δ>0, by Lemma 6.7 and Lemma 6.8, whenever C. Bounding Px⊥ d(z) and Pxd(z) ≥ 2 17 −2 4 m C Cx max log n, δ n log n , with probability −c2 n iθ Let d(z) be defined as in (VI.13) and assume that x =1. at least 1−c1m for all z ∈ C such that z − xe ≤ − In this section, we provide bounds for P xd and P ⊥ d c δ3 log 6 n,wehave x 3 # under the condition that z and x are close. Before presenting $ the main results, let us first introduce some useful preliminary iθ 4δ P ⊥ d≤ z − xe δ +(1+δ) 2σ2 +1 x ρ lemmas. First, based on the decomposition of z in (VI.11) % & and the definition of θ in (VI.12), we can show the following 1 1 1+(2+ε)δ +(1+δ)Δ∞(ε) result. + − − Lemma 6.4: Let θ =argmin z − xeiθ and sup- 1 ρ 1 δ 2 θ∈[0,2π) ! " 2 pose dist (z, x)= z − xeiθ ≤ for some ∈ (0, 1),then 2σ iθ ≤ 2 − P xd + cσ δ z xe we have 1+2σ2 ∈ β 1 iθ holds for any ρ (0, 1),whereΔ∞(ε) be defined in (VI.10) ≤ z − xe . α 1 − with ε ∈ (0, 1). Here, cσ2 is a numerical constant only
Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1798 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020
Proof Given the facts in (VI.11) and (VI.12) that z = αx + The analysis of bounding P x⊥ d is similar to that of [11]. n−1 ∗ βw with w ∈ CS and w ⊥ x,andφ(α)=θ,wehave Proof Let Ay = A diag (ζσ2 (y)). By the definition (VI.13) 2 of d(z), notice that z − xeiθ =(|α|−1)2 + |β|2 . iθ P ⊥ d≤P ⊥ (I − M) z − e x + x ) x , This implies that 2 * + 2σ +1 iφ(Az) iθ P x⊥ Ay y e − e Mx |β|≤ z − xeiθ , |α|≥1 − z − xeiθ m − iθ For the second term, by (C.9) in Theorem C.4, for any δ>0, β z xe 1 iθ =⇒ ≤ ≤ z − xe , ≥ −2 2 4 α 1 −z − xeiθ 1 − whenever m C1δ Cx n log n,wehave − ≤ as desired. P x⊥ (I M) δ, (VI.14) 3 −c2 log n On the other hand, our proof is also critically depends on with probability at least 1 − c1m .Forthefirstterm, the concentration of M(a) in Theorem C.4 of Appendix C, we observe ) , and the following lemmas. Detailed proofs are given in 2 * + 2σ +1 iφ(Az) iθ P x⊥ Ay y e − e Mx Appendix VI-D and Appendix VI-E. m ∈ ) Lemma 6.5: For any given scalar δ (0, 1),letγ = 2 3 −6 2σ +1 c0δ log n,' whenever ( = P x⊥ Ay× ≥ 2 17 −2 4 m , m C max Cx log n, δ n log n , with probability * + − −c2 ≤ iφ(Az) iθ+iφ(Ax) at least 1 c1m for all w with w γ x ,wehavethe |Ax| e − e inequality 2 √ 2σ +1 ∗ 1/2 ≤ ⊥ × 1 ≤ P x A diag ζ 2 (y) Ax |Aw|≥|Ax| δ m w . m * σ + 1/2 | | iφ(Az) − iθ+iφ(Ax) Lemma 6.6: For any scalar δ ∈ (0, 1), whenever diag ζσ2 (y) Ax e e . 2 −2 4 m ≥ C Cx δ n log n, with probability at least 1 − −c log3 n n By (C.9) in Theorem C.4 and Lemma C.10 in Appendix C, for cm for all w ∈ C with w ⊥ x,wehave − 2 4 any δ>0, whenever m ≥ C δ 2 C n log n,wehave % 1 x 2 % 2σ2 +1 2 1/2 −iφ(Ax) 2σ +1 ∗ diag ζ 2 (y) Aw e 1/2 σ P ⊥ A diag ζ 2 (y) m m x σ 1+(2+ε)δ +(1+δ)Δ∞(ε) ≤ w2 . ≤H1/2 ≤ (E [H] + H − E [H])1/2 2 ≤ (1 + δ)1/2 ≤ 1+δ, Here, Δ∞(ε) is defined in (VI.10) for any scalar ε ∈ (0, 1). 3 2 −c2 log n In particular, when σ =0.51 and ε =0.2,wehave with probability at least 1−c1m . And by Lemma A.1 n Δ(ε) ≤ 0.404. With the same probability for all w ∈ C and decomposition of z in (VI.11) with φ(α)=θ, we obtain ⊥ % with w x,wehave 2 % 2 2σ +1 1/2 diag ζ 2 (y) × 2 σ 2σ +1 1/2 −iφ(Ax) m diag ζ 2 (y) Aw e m σ * + iφ(Az) iθ+iφ(Ax) 1+2.2δ +0.404(1 + δ) |Ax| e − e ≤ w2 . % 2 2σ2 +1 ⊥ 1/2 1) Bounding the “x-perpendicular” term Px d = diag ζ 2 (y) × m σ Lemma 6.7: Let d be defined in (VI.13), and suppose 2 * + σ > 1/2 be a constant. For any δ> 0, whenever β iφ(Ax) iφ Ax+ Aw ≥ 2 17 −2 4 | | − ( α ) m C Cx max log n, δ n log n , with probability Ax e e at least 1−c m−c2 for all z ∈ Cn such that z − xeiθ ≤ 1 % 3 −6 2σ2 +1 c3δ log n,wehave ≤ | |1 # 2 Ax | β ||Aw|≥ρ|Ax| m % α $ iθ 4δ 2 2 P x⊥ d≤ z − xe δ +(1+δ) 2σ +1 1 β 2σ +1 1/2 ρ + diag ζ 2 (y) × & 1 − ρ α m σ % 1 1 1+2δ +(1+δ)Δ∞(ε) + − − . −iφ(Ax) 1 ρ 1 δ 2 Aw e , Here, Δ∞(ε) is defined in (VI.10) for any scalar ε ∈ (0, 1). − β 2 for any ρ ∈ (0, 1). By Lemma 6.4, we know that ρ 1 ≤ In particular, when ε =0.2 and σ =0.51,wehaveΔ∞(ε) ≤ α 2 − iθ 3 −6 0.404. ρ z xe Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1799 where cρ'is a constant depending on ρ(. Thus, whenever m ≥ definition of d(z) in (VI.13), we observe 2 17 −2 4 C2 max Cx log n, δ n log n for any δ ∈ (0, 1), P xd −c2 n−1 ) with probability at least 1−c1m for all w ∈ CS , ∗ iθ iθ Lemma 6.5 implies that = x (I − M) z − e x − e Mx + , 2 √ 2σ +1 ∗ δ β 2 | |1 ≤ A diag (ζσ (y)) [y exp (iφ (Az))] Ax | β || |≥ | | m m α Aw ρ Ax ρ α 2δ √ ≤ − ∗E | |− iθ − iθ ∗ ≤ m z − xeiθ . (1 x [M] x)(α 1) e e x Mx + ρ 2 2σ +1 ∗ ∗ 2 ∈ ≥ x A diag (ζσ (y)) [y exp (iφ (Az))] Moreover, for any δ (0, 1), whenever m m 2 4 iθ C3 Cx n log n, with probability at least + M − E [M] z − xe − 3 n−1 1−c m c4 log n w ∈ CS w ⊥ x ∗ 3 for all with , ≤|(1 − x E [M] x)|||α|−1| Lemma 6.6 implies that T 1 % iθ 2 + M − E [M] z − xe 2σ +1 1/2 −iφ(Ax) diag ζ 2 (y) Aw e * + σ 2 i m 2σ +1 ∗ iφ(Az−iφ(Ax))−e θ1 + x Ay (Ax) e , % m 1+2δ +(1+δ)Δ∞(ε) ≤ , T 2 2 where for the second inequality, we used Lemma C.10 such ∗ where Δ∞(ε) is defined in (VI.10) for some ε ∈ (0, 1) . that x (I − E [M]) w =0.ForthefirsttermT1, notice that iθ 3 −6 iθ - In addition, whenever z − xe ≤ c5δ log n z − xe − iθ || |− |2 2 ≥|||− | for some constant c5 > 0, Lemma 6.4 implies that z xe = α 1 + βw α 1 , 2 E 2σ ∗ β 1 and by using the fact that [M]=I + 1+2σ2 xx in ≤ z − xeiθ 3 −6 Lemma C.10, we have α 1 − c5δ log n 2 2 1 iθ 2σ 2σ iθ ≤ z − xe , T1 = ||α|−1|≤ z − xe . 1 − δ 1+2σ2 1+2σ2 For the term T2, using the fact that z = αx + βw and for δ>0 sufficiently small. Thus, combining the results above, θ = φ(α), and by Lemma A.2, notice that we have the bound ! " # iφ(Az)−iφ(Ax) iθ iθ βAw e − e 1 +ie $ αAx iθ 4δ 2 ! " P x⊥ d≤ z − xe δ +(1+δ) 2σ +1 βAw βAw ρ iφ(1+ Ax ) = e α − 1 +i % & αAx 1 1 1+2δ +(1+δ)Δ∞(ε) 2 2 + β Aw 1 − ρ 1 − δ 2 ≤ 6 , α Ax holds as desired. Finally, when σ2 =0.51 and ε =0.2, βAw ≤ whenever αAx 1/2. Thus, by using the result above, the bound for Δ∞(ε) can be found in Lemma 6.15 in we observe Appendix VI-E. T2 * + 2) Bounding the “x-parallel” term P d 2 x ≤ 2σ +1 ∗ 1 2 x Ay (Ax) | β ||Aw|≥ 1 |Ax| Lemma 6.8: Let d(z) be defined in (VI.13), and let m α 2 2 ! σ > 1/2 be a constant. For any δ> 0, whenever 2 2 − 2σ +1 ∗ ∗ 17 2 4 2 m ≥ C Cx max log n, δ n log n , with probabil- + x A diag ζσ (y) − m − c2 − iθ ≤ " ity at least 1 c1m for all z such that z xe 3 −6 iφ(Az)−iφ(Ax) − iθ1 1 c3δ log n,wehave e e | β || |≤ 1 | | Ax α Aw 2 Ax ! " * + 2 2σ2 +1 2σ iθ ≤ ∗ 1 P xd≤ + cσ2 δ z − xe . 2 x Ay (Ax) | β ||Aw|≥ 1 |Ax| 2 m α 2 1+2σ ! " 2 2 2σ +1 ∗ βAw β 2 + x Ay (Ax) +6 Here, cσ > 0 is some numerical constant depending only m αAx α on σ2. 2 2 ∗ 2σ +1 ∗ |Aw| Proof Let Ay = A diag (ζ 2 (y)). Given the decomposition ∗ σ × x A diag ζσ2 (y) Ax of z in (VI.11) with w ⊥ x and φ(α)=θ, and by the m |Ax|2 Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1800 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020 2 3 2σ +1 −c3 log n ≤ 1 holds with probability at least 1 − 2m for some 2 A Ax | β ||Aw|≤ 1 |Ax| m α 2 1 β iθ 3 −6 constant c3 > 0.If ≤ z − xe ≤ c4δ log n, 2 2 2 α ' ( β 2σ +1 ∗ ∗ 2 2 17 −2 4 +6 w A diag (ζσ (y)) Aw whenever m ≥ C3 max Cx log n, δ n log n , α m 2 Lemma 6.5 implies that 1 β 2σ +1 ∗ ∗ + x A diag (ζσ2 (y)) Aw 2 α m | |1 Ax | β ||Aw|≥ 1 |Ax| 2 * + α 2 1 β 2σ +1 ∗ 2iφ(Ax) √ √ + x Ay e Aw β iθ 2 α m ≤ 2δ m ≤ 4δ m z − xe α ⊥ Given the fact that x w, by Lemma C.10 again we have n−1 −c6 ∗ holds for all w ∈ CS with probability at least 1−c5m . x E [M] w =0 − . Thus − iθ ≤ c4 3 6 Given z xe 4 δ log n, combining the estimates 2 2σ +1 ∗ ∗ above, we have x A diag (ζσ2 (y)) Aw m 2 2σ 2 ∗ ∗ P d≤ +3δ +8(1+δ)δ 2σ +1 = |x Mw− x E [M] w|≤M − E [M] , x 2 ! 1+2σ " 2 1+4σ 3 −6 iθ and similarly we have +24c4 + δ δ log n z − xe 1+2σ2 2 * + ! " 2σ +1 ∗ ∗ 2iφ(Ax) 2 2 2σ iθ x A diag (ζσ (y)) e Aw ≤ + c 2 δ z − xe m 1+2σ2 σ 2 * + 2σ +1 −2iφ(Ax) = w A diag (ζσ2 (y)) e Ax for δ sufficiently small. Here, cσ2 is some positive numerical m constant depending only on σ2. 2 2σ +1 = w A diag (ζσ2 (y)) Ax m 2 D. Proof of Lemma 6.5 2σ +1 ∗ ∗ = x A diag (ζσ2 (y)) Aw m In this section, we prove Lemma 6.5 in Section VI-C, which ≤M − E [M] . can be restated as follows. Lemma 6.9: For any given scalar δ ∈ (0, 1),letγ = − Thus, suppose z − xeiθ ≤ 1 , by using Lemma 6.4 we know c δ3 log 6 n,' whenever ( 2 0 β 2 17 −2 4 ≤ − iθ m ≥ C max Cx log n, δ n log n , with probability that α 2 z xe . Combining the estimates above, −c2 we obtain at least 1−c1m for all w with w≤γ x,wehavethe inequality 2σ2 +1 T ≤ 1 √ 3 2 A Ax | β ||Aw|≤ 1 |Ax| m α 2 Ax 1|Aw|≥|Ax| ≤ δ m w . (VI.15) 2 +24M z − xeiθ We prove this lemma using the results in Lemma 6.10 and iθ +2M − E [M] z − xe . Lemma 6.11. Proof By Corollary B.2, for some small scalar ε ∈ T T Combining the estimates for 1 and 2,wehave (0, 1), whenever m ≥ Cnlog4 n, with probability at least − −c log3 n ≤ P d 1 m for every w with w γ x ,wehave x √ √ 2 ≤ ≤ 2σ iθ iθ Aw (1 + ε) m w (1 + ε)γ m x ≤ z − x +3M − E [M] z − x ! " 1+2σ2 1/2 1+ε 2σ2 +1 ≤ γ Ax≤2γ Ax . 1 1 − ε +2 A Ax | β ||Aw|≤ 1 |Ax| m α 2 2 Let us define a set +24M z − xeiθ . . ∗ ∗ S = {k ||a w|≥|a x|} . By Theorem C.4, for any δ>0, whenever m ≥ k k −2 2 4 S |S| C1δ Cx n log n,wehave By Lemma 6.10, for every set with >ρm(with some ρ ∈ (0, 1) to be chosen later), with probability at least M − E [M]≤δ, ρ4m 1 − exp − 2 ,wehave 2Cx 1+4σ2 M≤E [M] + δ = + δ 1+2σ2 ρ3/2 (Ax) 1S > Ax . 3 32 − −c2 log n holds with probability at least 1 c1m . By Corol- 3/2 −2 4 ρ lary B.2, for any δ ∈ (0, 1), whenever m ≥ C2δ n log n, Choose ρ such that γ = 64 ,wehave we have Aw≥(Aw) 1S ≥(Ax) 1S √ A≤(1 + δ) m > 2γ Ax . Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1801 −4 This contradicts with the fact that Aw≤2γ Ax. Lemma 6.11: Given any scalar δ>0,letρ ∈ (0,cδlog n) Therefore, whenever w≤γ x, with high probability with cδ be some constant depending on δ, whenever m ≥ 2 we have |S| ≤ ρm holds. Given any δ>0, choose γ = Cδ−2n log4 n, with probability at least 1 − 2 m−c log n,for 3 2 3 −6 ρ / any set S∈[m] with |S| <ρmand for all w ∈ Cn,wehave cδ log n for some constant c>0. Because γ = 64 , we know that ρ = c δ2/ log4 n. By Lemma 6.11, whenever √ 2 m ≥ Cδ−2n log4 n, with probability at least 1−2 m−c log n (Aw) 1S ≤δ m w . − for all w ∈ CSn 1,wehave √ Proof Without loss of generality, let us assume that w =1. | |1 ≤| |1 ≤ Ax |Aw|≥|Ax| Aw S δ m w . First, notice that Combining the results above, we complete the proof. Aw 1S =supv, Aw Lemma 6.10: Let ρ ∈ (0, 1) be a positive scalar, with v∈CSm−1, supp(v)⊆S ρ4 m ≤ ∗ probability at least 1−exp − 2 , for every set S∈[m] sup A v . 2Cx −1 v∈CSm , supp(v)⊆S with |S| ≥ ρm,wehave 1 3/2 By Lemma A.11, for any positive scalar δ>0 and any |Ax|1S > ρ Ax . −4 − 4 32 ρ ∈ (0,cδ2 log n), whenever m ≥ Cδ 2n log n, with − −c log2 n To prove this, let us define probability at least 1 m ,wehave ⎧ √ ⎨⎪1 if |u|≤v, sup A∗v≤δ m. 1 v∈CSn−1, supp(v)⊆S gv(u)= (2v −|u|) v<|u|≤2v (VI.16) ⎩⎪ v 0 otherwise, Combining the result above, we complete the proof. for a variable u ∈ C and a fixed positive scalar v ∈ R. Lemma 6.12: For a variable u ∈ C and a fixed positive Proof Let ρ ∈ (0, 1) be a positive scalar, from Lemma 6.12, scalar v ∈ R, the function g (u) introduced in (VI.16) is we know that v 1/v-Lipschitz. Moreover, the following bound Ax ≥ 1| |≤ gρ( ) 1 Ax ρ 1 gv(u) ≥ ½|u|≤v holds uniformly. Thus, for an independent copy a of a, we have holds uniformly for u over the whole space. Proof The proof of Lipschitz continuity of g (u) is straight gρ(Cxa) −gρ(Cxa ) v 1 1√ forward, and the inequality directly follows from the definition ≤ − ≤ m − gρ(Cxa) gρ(Cxa ) 1 Cxa Cxa of gv(u). √ ρ m ≤ C a − a . ρ x E. Proof of Lemma 6.6 Therefore, we can see√ that gρ(Cxa) 1 is L-Lipschitz with m In this section, we prove Lemma 6.6 in Section VI-C, which respect to a, with L = ρ Cx . By Gaussian concentration inequality in Lemma A.3, we have can be restated as follows. 2 Lemma 6.13: For any scalar δ ∈ (0, 1), whenever t − 2 2 − 4 P − E ≥ ≤ 2L ≥ 2 gρ(Cxa) 1 gρ(Cxa) 1 t 2e . m C Cx δ n log n, with probability at least √ −c log3 n n ∗ 1 − cm for all w ∈ C with w ⊥ x,wehave By using the fact that 2 |a x| follows the χ distribution, k we have % 2 2 m * + 2σ +1 1/2 −iφ(Ax) diag ζ 2 (y) Aw e E σ ≤ E ½ gρ(Cxa) | ∗ |≤ m 1 ak x 2ρ k=1 1+(2+ε)δ +(1+δ)Δ∞(ε) m ≤ w2 P | ∗ |≤ ≤ 2 = ( akx 2ρ) ρm. k=1 holds. In particular, when σ2 =0.51 and ε =0.2,wehave ρ4 m Δ(ε) ≤ 0.404. With the same probability for all w ∈ Cn with Thus, with probability at least 1−2exp − 2 ,wehave 2Cx w ⊥ x,wehave 1|Ax|≤ρ ≤gρ(Ax) ≤ 2ρm % 1 1 2 2 S |S| ≥ 2σ +1 1/2 −iφ(Ax) holds. Thus, for any set such that 4ρm,wehave diag ζ 2 (y) Aw e m σ 1 2 ≥ 1 2 ≥ 3 (Ax) S (Ax) |Ax|≤ρ 2ρ m. 1+2.2δ +0.404(1 + δ) ≤ w2 . Thus, by replacing 4ρ with ρ, we complete the proof. 2 Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1802 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020 − 3 Proof Without loss of generality, let us assume w ∈ CSn 1. m ≥ Cδ−2n log4 n, with probability at least 1 − m−c log n − − For any w ∈ CSn 1 with w ⊥ x, we observe for all w ∈ CSn 1 with w ⊥ x,wehave % 2 2 2σ +1 2 w A diag (ζ 2 (Ax)ψ(Ax)) Aw 2σ +1 1/2 −iφ(Ax) σ diag ζ 2 (y) Aw e m σ m ≤ % (1 + δ)Δ∞(ε)+(1+ε)δ. 2 1 2σ +1 1/2 ∼CN 0 = diag ζ 2 (y) × Proof Let g = a and let δ ( , I) independent of g. 2 m σ ∗ Let v = R[1:n]w. Given a small scalar ε>0,wehave * + 2 −iφ(Ax) iφ(Ax) 2 (Aw) e − (Aw) e 2σ +1 2 v Cg diag (ζσ (g x)ψ(g x)) Cgv 2 m 1 ∗ 12σ +1 2 ≤ |w P x⊥ MPx⊥ w| + w P ⊥ A 2σ +1 x ≤ 2 2 2 m v Cg diag ζσ (g x)ψ(g x) m × 2 ⊥ diag (ζσ (Ax)ψ(Ax)) AP x w − E − × (1 + ε) δ [ψ((g δ) x)] Cgv +(1+ε) 2 ≤ 1 E 1 − E 12σ +1 2 [H] + H [H] + w 2σ +1 E − 2 2 2 m v Cg diag ( δ [ψ((g δ) x)]) Cgv m × 2 ⊥ P x⊥ A diag (ζσ (Ax)ψ(Ax)) AP x w Δ∞(ε) ∗ 2 ≤ R[1:n]C +(1+ε)× m g ∈ CSn−1 ⊥ 2 holding for all w with w x,where M and 2σ +1 E − 2 v Cg diag ( δ [ψ((g δ) x)]) Cgv . H are defined in (VI.7) and (VI.8), and ψ(t)= t/ |t| . m By Lemma C.10, we know that D(g,w) δ>0 m ≥ E [H] = P ⊥ ≤1. (VI.17) By Corollary B.2, for any , whenever x −2 4 C0δ n log n for some constant C0 > 0,wehave By Theorem C.4, we know that for any δ>0, whenever ∗ 2 ≤ −2 2 4 R[1:n]Cg (1 + δ)m m ≥ C1δ Cx n log n,wehave 3 with probability at least 1−m−c0 log m for some constant H − E [H]≤δ, c0 > 0. Next, let us define a decoupled version of D(g, w), 3 − −c2 log n 2 with probability at least 1 c1m . In addition, QD 1 2 2σ +1 × dec(g , g , w)= w R[1:n]Cg1 Lemma 6.14 implies that for any δ>0,whenm ≥ m −2 4 2 ∗ 1 C2δ n log n for some constant C2 > 0,wehave diag ψ(g x) Cg R[1:n]w. (VI.18) 2 1 2 − 2σ +1 where g = g + δ and g = g δ.Thenbyusingthefact w P ⊥ A diag (ζ 2 (Ax)ψ(Ax)) AP ⊥ w x σ x that w ⊥ x,wehave m ≤ 2 (1 + δ)Δ∞(ε)+(1+ε)δ, D 1 2 2σ +1 Eδ Q (g , g , w) = w R[1:n]C × dec m g − 3 holds with probability at least 1 − 2m c3 log n for some E − ∗ constant c3 > 0. Combining the results above, we obtain diag ( δ [ψ ((g δ) x)]) CgR[1:n]w. % 2 2 Then for any positive integer p ≥ 1, by Jensen’s inequality 2σ +1 1/2 −iφ(Ax) diag ζ 2 (y) Aw e and Theorem B.3, we have m σ 1+(2+ε)δ +(1+δ)Δ∞(ε) |D | ≤ . sup (g, w) 2 w=1, w⊥x L p 2 Finally, by using Lemma 6.15, when σ =0.51 and ε =0.2, ≤ QD 1 2 sup dec(g , g , w) we have w=1 % !% Lp % " 2 √ 2 n 3/2 1/2 n n 2σ +1 1/2 −iφ(Ax) ≤C 2 log n log m + p + p , diag ζ 2 (y) Aw e σ m σ m m m 2 1+2.2δ +0.404(1 + δ) where Cσ2 > 0 is some positive constant depending on σ , ≤ , 2 ≤ 2 and we used the fact that ψ(g x) ∞ 1 holds uniformly for all g2. Thus, by Lemma A.6, then for any δ>0, whenever as desired. −2 3 m ≥ C1δ n log n log m,wehave Lemma 6.14: For a fixed scalar ε>0,letΔ∞(ε) sup |D(g, w)|≤δ be defined as (VI.10). For any δ>0, whenever w=1, w⊥x Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1803 2 Fig. 9. Computer simulation of the functions ζσ2 (t) and h(t). (a) displays the functions ζσ2 (t) and h(t) with σ =0.51; (b) shows differences two function ζσ2 (t) − (1 + ε)h(t) with ε =0.2. 3 −c1 log m holding with probability at least 1 − m . Combining the function ζσ2 (t) − (1 + ε)h(t). Therefore, we have the results above completes the proof. 2 2 − Δ∞(ε)=(1+2σ ) ζσ (t) (1 + ε)h(t) L∞ ≤ 0.2 × (1 + 2 × 0.51) = 0.404, 2 F. Bounding Δ∞(ε) when σ =0.51 and ε =0.2. Given h(t) and Δ∞(ε) introduced in (VI.9) and (VI.10), Lemma 6.16: For t ≥ 0,whenε =0.2 and σ2 =0.51, we prove the following results. we have Lemma 6.15: Given σ2 =0.51 and ε =0.2,wehave |ζσ2 (t) − (1 + ε)h(t)|≤0.2. Δ∞(ε) ≤ 0.404. Proof Given ε =0.2 and σ2 =0.51,wehave Proof First, by Lemma 6.17, notice that the function h(t) can 2 − be decomposed as g(t)=ζσ (t) (1 + ε)h(t) 2 − t −2 −2 −t2 = −0.2 − e 1.02 +1.2 t − 1.2 t e . h(t)=g(t)ψ(t) When t =0,wehave|g(t)| =0.Whent>10,wehave where g(t):C → [0, 1) is rotational invariant with respect to t. 2 Since ζ 2 (t) is also rotational invariant with respect to t,itis t − t −3 −t2 σ g (t)= e 1.02 − 2.4t 1 − e enough to consider the case when t ∈ [0, +∞), and bounding 0.51 2 the following quantity +2.4t−1e−t ≤ 0. sup |(1 + ε)h(t) − ζσ2 (t)| . So the function g(t) is monotonically decreasing for t ≥ 10. t∈[0,+∞) As limt→+∞ g(t)=−0.2 and |g(10)| < 0.2,wehave Lemma 6.18 implies that |g(t)|≤0.2, ∀ t>10. E h(t)= s∼CN (0,1) [ψ(t + s)] For 0 2 − ζσ (t) (1 + ε)h(t) where g(t):C → [0, 1), such that 2 − t −2 −2 −t2 − − 1.02 − | | 2 − 2 = 0.2 e +1.2 t 1.2 t e . ( t + v1) v2 g(t)=E ∼N , v1,v2 (0,1/2) | | 2 2 From Lemma 6.16, we can prove that ( t + v1) + v2 2 − ≤ ∼N ∼N ζσ (t) (1 + ε)h(t) L∞ 0.2 by a tight approximation of where v1 (0, 1/2),andv2 (0, 1/2). Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1804 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020 Proof By definition, we know that cosh (2rt cos θ) dθdr +∞ π/2 E 2 − 2 − 2 s∼CN# (0,1) [ψ(t&+ s)] # & = e t cos(2θ)re r × ! "2 ! "2 π t + s t t + s r=0 θ=0 = E = E ψ(t). [cosh (2rt cos θ) − cosh(2rt sin θ)] dθdr s |t + s| s |t| |t + s| g(t) where the third equality uses the fact that the integral of an odd function is zero. By using Taylor expansion of cosh(x), Next, we estimate g(t) and show that it is indeed real. and by using the dominated convergence theorem to exchange We decompose the random variable s as ! " ! " the summation and integration, we observe ts t ts t t t s = +i = v1 +iv2 , |t| |t| |t| |t| |t| |t| E ∼CN [ψ(t + s)] s (0,1) +∞ π ts ts 2 −t2 −r2 where v1 = and v2 = are the real and imagi- = e cos(2θ)re × |t| |t| π nary parts of a complex Gaussian variable ts/ |t|∼CN(0, 1). # r=0 θ=0 & +∞ 2k 2k ∼N (2rt cos θ) (2rt sin θ) By rotation invariant property, we have v1 (0, 1/2) and − dθdr v2 ∼N(0, 1/2),andv1 and v2 are independent. Thus, we have (2k)! (2k)! # & k=0 ! " +∞ π | | − 2 2 − 2 t + v1 iv2 = e t cos(2θ)× h(t)=Es |t + s| π r=0 θ=0 ∞ 2 # & # & + 2k 2k+1 −r | | 2 − 2 | | (2t cos θ) r e − E ( t + v1) v2 − E ( t + v1) v2 = s 2 2i s 2 (2k)! |t + s| |t + s| k=0 # & 2k 2k+1 −r2 | | 2 − 2 (2t sin θ) r e E ( t + v1) v2 dθdr = v1,v2 (2k)! 2 2 (|t| + v ) + v ∞ # 1 2 & + 2k +∞ 2 − 2 (2t) − 2 | | = e t r2k+1e r dr× − E ( t + v1) v2 π (2k)! 2i v1,v2 . r=0 | | 2 2 k=0 ( t + v1) + v2 π π 2k − 2k (|t|+v1)v2 cos(2θ)cos θdθ cos(2θ)sin θdθ . We can see that 2 2 is an odd function of v2. θ=0 θ=0 (|t|+v1) +v2 (|t|+v1)v2 Therefore, the expectation of 2 2 with respect to v2 (|t|+v1) +v2 We have the integrals is zero. Thus, we have #! " & +∞ 2 2 t t + s 2k+1 −r Γ(k +1) g(t)=E r e dr = , s |t| |t + s| r=0 √ 2 π π kΓ(k +1/2) | | 2 − 2 2k ( t + v1) v2 cos(2θ)cos θdθ = , = E ∼N , 2 Γ(k +2) v1,v2 (0,1/2) | | 2 2 θ=0 √ ( t + v1) + v2 π 2k − π kΓ(k +1/2) which is real. cos(2θ)sin θdθ = θ=0 2 Γ(k +2) Lemma 6.18: For t ∈ [0, +∞),wehave holds for any integer k ≥ 0,whereΓ(k) is the Gamma E f(t)= s∼CN (0,1) [ψ(t + s)] function such that − −2 −2 −t2 1 t + t e t>0, (2k)!√ = (VI.19) Γ(k +1)=k!, Γ(k +1/2) = π. 0 t =0. 4kk! Proof Let sr = (s) and si = (s),andlets = r exp (iθ) Thus, for t>0,wehave with r = |s| and exp (iθ)=s/ |s|. We observe E ∼CN [ψ(t + s)] Es∼CN (0,1) [ψ(t + s)] s (0,1) ∞ +∞ +∞ 2 + 2k √ 1 (s +is ) −| − |2 2 − 2 (2t) Γ(k +1) kΓ(k +1/2) r i sr+isi t t × × = 2 e dsrdsi = e π π s =−∞ s =−∞ |s +is | π (2k)! 2 Γ(k +2) r i r i k=0 +∞ 2π 2 2 +∞ +∞ +∞ 1 −i2θ −r −t 2rt cos θ 2 2k 2 2k 2k = e e e rdθdr −t kt −t t − t π =e = e r=0 θ=0 (k +1)! k! (k +1)! +∞ 2π k=0 k=0 k=0 1 − 2 − 2 # & = e t cos(2θ)re r e2rt cos θdθdr +∞ 2k π −t2 t2 −2 t r=0 θ=0 =e e − t − 1 +∞ π k! 2 −t2 −r2 k=0 = e cos(2θ)re × 2 −2 −2 −t π r=0 θ=0 =1 − t + t e . Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1805 When t =0, by using L’Hopital’s rule, we have Lemma A.3 (Gaussian Concentration Inequality): Let w ∈ Rn be a standard Gaussian random variable w ∼N(0, I),and η(0) = lim Es∼CN (0,1) [ψ(t + s)] n t→0 # & let g : R → R denote an L-Lipschitz function. Then for all 2 1 − et −1 t>0, = lim 1+ 2 t2 = 1 + lim =0. t→0 t e t→0 1+t P (|g(w) − E [g(w)]|≥t) ≤ 2exp −t2/(2L2) . We complete the proof. Moreover, if w ∈ Cn with w ∼CN(0, I),andg : Cn → R In the appendix, we provide details of proofs for some is L-Lipschitz, then the inequality above still holds. supporting results. Appendix A summarizes basic tools used Proof The result for real-valued Gaussian random variables throughout the analysis. In Appendix B, we provide results of is standard, see [98, Chapter 5] for a detailed proof. For the bounding the suprema of chaos processes for random circulant complex case, let matrices. In Appendix C, we present concentration results for 1 v suprema of some dependent random processes via decoupling. v = √ I iI r , v , v ∼ N (0, I) . v r i i.i.d. 2 i APPENDIX A h ELEMENTARY TOOLS AND RESULTS By composition theorem, we know that g ◦ h : R2n → R is Lemma A.1: Given a fixed number ρ ∈ (0, 1),forany L-Lipschitz. Therefore, by applying the Gaussian concen- z, z ∈ C,wehave v tration inequality for g ◦ h and r , we get the desired |exp (iφ(z + z)) − exp (iφ(z ))| vi result. 1 ≤ | | 2 ½| |≥ | | + (z/z ) . z ρ z 1 − ρ Theorem A.4 (Gaussian tail comparison for vector-valued ∈ Rn Proof See the proof of Lemma 3.2 of [11]. functions, Theorem 3, [99]): Let w be standard Gaussian variable w ∼N(0, I),andletf : Rn → R be an L-Lipschitz Lemma A.2: Let ρ ∈ (0, 1),foranyz ∈ C with |z|≤ρ, function. Then for any t>0,wehave we have ! " − t 2 ρ 2 P (f(w) − E [f(w)]≥t) ≤ eP v≥ , |1 − exp (iφ(1 + z)) + i (z)|≤ |z| . (A.1) (1 − ρ)2 L - where v ∈ R such that v ∼N(0, I). Moreover, if w ∈ Cn ∈ R+ 2 2 Proof For any t ,letg(t)= (1 + (z)) + t ,then with w ∼CN(0, I) and f : Cn → R is L-Lipschitz, then the inequality above still holds. - t ≤ t g (t)= | |. The proof is similar to that of Lemma A.3. 2 2 1+ (z) (1 + (z)) + t Lemma A.5 (Tail of sub-Gaussian Random Variables): Let ∈ C | |≤ X be a centered σ2 sub-Gaussian random variable, such that Hence, for any z with z ρ,wehave ! " || |− | t2 1+z (1 + (z)) P (|X|≥t) ≤ 2exp − , - 2σ2 = (1 + (z))2 + 2(z) − (1 + (z)) then for any integer p ≥ 1,wehave 2 (z) 1 p/2 = |g ( (z)) − g(0)|≤ ≤ 2(z). E [|X|p] ≤ 2σ2 pΓ(p/2). |1+(z)| 1 − ρ Let f(z)=1−exp (iφ(1 + z)). By using the estimates above, In particular, we have √ we observe E | |p 1/p ≤ 1/e ≥ X Lp =( [ X ]) σe p, p 2, |f(z)+i (z)| √ and E [|X|] ≤ σ 2π. |1+z|−(1 + z) Lemma A.6 (Sub-exponential tail bound via moment = | | +i (z) 1+z control): Suppose X is a centered random variable satisfying 1 = ||1+z|−(1 + z)+i (z) |1+z|| p 1/p √ |1+z| (E [|X| ]) ≤ α0 + α1 p + α2 p, for all p ≥ p0 1 ≤ | || −| || ≥ | |( (z) 1 1+z for some α0,α1,α2,p0 > 0. Then, for any u p0,wehave 1+z √ || |− | P | |≥ ≤ − + 1+!z (1 + (z)) ) " X e(α0 + α1 u + α2 u) 2exp( u) . √ 1 1 2 ≤ |z|| (z)| + (z) This further implies that for any t>α1 p0 + α2 p0,wehave |1+z| 1 − ρ ! ) ," − t2 t ≤ 2 ρ | |2 P | |≥ ≤ − 2 z . ( X c1α0 + t) 2exp c2 min 2 , , (1 − ρ) α1 α2 for some positive constants c1,c2 > 0. Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1806 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020 Proof The first inequality directly comes from Proposi- where N (B, · ,) is the covering number of the set B with tion 2.6 of [14] via Markov inequality, also see Proposi- diameter ∈ (0, 1). 2 ≥ tion 7.11√ and Proposition√ 7.15 of [60]. For the second, let Theorem A.9 (Theorem 3.5, [14]): Let σξ 1 and ξ = ≤ n { }n t = α1 u + α2 u,ifα1 u α2 u,then (ξj)j=1,where ξj j=1 are independent zero-mean, variance √ 2 B t one, σξ -subgaussian random variables, and let be a class of t = α1 u + α2 u ≤ 2α2 u ⇒ u ≥ . 2α2 matrices. Let us define a quantity * + 2 2 Otherwise, similarly, we have u ≥ t /(4α ). Combining the . 2 2 1 CB (ξ) =supBξ − E Bξ . (A.2) two cases above, we get the desired result. B∈B In the following, we describe a tail bound for a class of For every p ≥ 1,wehave heavy-tailed random variables, whose moments are growing much faster than sub-Gaussian and sub-exponential. ≤ 2 B · B sup Bξ Cσ [γ2 ( , )+dF ( ) Lemma A.7 (Tail bound for heavy-tailed distribution via ∈B ξ B Lp √ moment control): Suppose X is a centered random variable + pd2(B)], * + satisfying √ 2 − E 2 E | |p 1/p ≤ ∀ ≥ sup Bξ Bξ ( [ X ]) p (α0 + α1 p + α2 p) , p p0, B∈B Lp ≤ 2 { B · B · B for some α0,α1,α2,p0 ≥ 0. Then, for any u ≥ p0,wehave Cσ γ2( , )[γ2 ( , )+dF ( )] √ ξ √ + pd (B)[γ (B, ·)+d (B)] + pd2(B) , P |X|≥eu α0 + α1 u + α2 u ≤ 2exp(−u) . 2 2 F 2 This further implies that for any t>where C 2 is some positive numerical constant only depending √ σξ p0 α0 + α1 p0 + α2 p0 ,wehave 2 · · B · on σξ ,andd2( ),dF ( ) and γ2( , ) are given in Defini- tion A.8. P (|X|≥c1t) The following theorem establishes the restricted isometry t t ≤ 2exp −c min , , property (RIP) of the Gaussian random convolution matrix. 2 m 2(α1 + α2) 2α0 Theorem A.10 (Theorem 4.1, [14] ): Let ξ ∈ C be a ∼ CN for some positive constant c ,c > 0. random vector with ξi i.i.d. (0, 1),andletΩ be a 1 2 | | E Proof The proof of the first tail bound is similar to that of fixed subset of [m] with Ω = n. Define a set s = { ∈ Cm | ≤ } Lemma A.6 by using Markov inequality. Notice that v v 0 s , and define a matrix ∗ × P (|X|≥eu (α +(α + α )u)) Φ = R C ∈ Cn m, 0 1√ 2 Ω ξ ≤ P | |≥ ≤ − X eu α0 + α1 u + α2 u 2exp( u) . m n where RΩ : C → C is an operator that restrict a vector to 2 2 Let t = α0 u +(α1 + α2) u ,ifα0 u ≤ (α1 + α2) u ,then its entries in Ω. Then for any s ≤ m,andη, δs ∈ (0, 1) such that t = α u +(α + α ) u2 ≤ 2(α + α ) u2 0 1 2 1 2 ≥ −2 2 2 t n Cδs s log s log m, ⇒ u ≥ . 2(α + α ) 1 2 the partial random circulant matrix Φ ∈ Rn×m satisfies the Otherwise, we have u ≥ t/(2α0). Combining the two cases restricted isometry property above, we get the desired result. √ √ (1 − δs) n v≤Φv≤(1 + δs) n v (A.3) Definition A.8 (d2(·),dF (·) and γβ functional): For a given set of matrices B,wedefine − log2 s log m for all v ∈Es, with probability at least 1 − m . . . m d (B) =supB ,d(B) =supB , Lemma A.11 Let the random vector ξ ∈ C and the ran- F F 2 × B∈B B∈B dom matrix Φ ∈ Cn m be defined the same as Theorem A.10, E { ∈ Cm | ≤ } For a metric space (T,d), an admissible sequence of T is a and let s = v v 0 s for some positive integer ≤ collection of subsets of T , {Tr : r>0}, such that for every s n. For any positive scalar δ>0 and any positive integer 2r ≤ ≥ −2 4 s>1, |Tr|≤2 and |T0| =1.Forβ ≥ 1,definetheγβ s n, whenever m Cδ n log n,wehave functional by √ Φ ≤ ∞ v δ m v , . r/β γβ(T,d) =infsup 2 d(t, Tr), −c log2 s ∈ for all v ∈E , with probability at least 1 − m . t T r=0 s Proof The proof follows from the results in [14]. Without where the infimum is taken with respect to all admissible loss of generality, we assume v =1. Let us define sets sequences of T . In particular, for γ2 functional of the set B · . equipped with distance , [100] shows that D = {v ∈ Cm : v =1, v ≤ s} , ) s,m 0 , d2(B) 1 B · ≤ 1/2 N B · V . √ −1 | ∈D γ2( , ) c log ( , ,)d, = R[1:n]F m diag (F mv) F m v s,m , 0 n Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1807 m n where R[1:n] : R → R denotes an operator that restricts a A. Controlling the Moments and Tail of T1(g) vector to its first n coordinates. Section 4 of [14] shows that m Theorem B.1 Let g ∈ C be a random complex Gaussian ∼CN0 2 1 2 vector with g ( ,σg I) and any fixed vector sup Φv − 1 ··· ∈ Rm v∈Ds,m n b =[b1, ,bm] . Given a partial random circulant * + ∗ ∈ Cm×n ≥ 2 2 matrix CgR[1:n] (m n), let us define =sup V vξ − Eξ V vξ = CV (ξ), V v∈V L . (g) = where CV (ξ) is defined in (A.2). Theorem 4.1 and m 1 ∗ ∗ 1 Lemma 4.2 of [14] implies that R C diag (b) C R − b I . % m [1:n] g g [1:n] m k k=1 V V ≤ s dF ( )=1,d2( ) , Then for any integer p ≥ 1,wehave % n s L(g) p ≤ C 2 b × L σg ∞ γ2(V, ·) ≤ c log s log m, !% % " n n √ n n log3/2 n log1/2 m + p + p . for some constant c>0. By using the estimates above, m m m Theorem 3.1 of [14] further implies for any t>0 −2 ! % " In addition, for any δ>0, whenever m ≥ C 2 δ σg s 2 2 2 4 P CV (ξ) ≥ c log s log m + t b∞ n log n,wehave 1 n ! ) ," L ≤ nt2 nt (g) δ (B.3) ≤ 2exp −c min , . 2 2 2 − 3 s log s log m s cσ2 log n holds with probability at least 1 − 2 m g . Here, 2 2 2 For any positive constant δ>0, choosing t = δ m/n, cσ ,Cσ ,andCσ2 are some numerical constants only depend- − 2 2 g g g whenever m ≥ Cδ 2n log s log m for some constant C>0 2 ing on σg . large enough, we have 2 % Proof Without loss of generality, let us assume that σg =1. 1 s m Let us first consider the case b ≥ 0,andletΛ =diag(b). Φ 2 − ≤ 2 2 − sup v 1 c1 log s log m + δ For any w ∈ CSn 1, let us denote v∈Ds,m n n n . ∗ 2 m v = R w, ≤ 2δ , [1:n] n S . ∈ CSm−1 ∈ 2 n = v : supp(v) [n] , with probability at least 1−m−c3 log s. Therefore, we have $ √ then Φv≤ n +2δ2 m ≤ C δ m, m 1 1 L ∗ ∗ Λ − ∈D (g)= sup v Cg Cgv bk . holds for any v s,m with high probability. v∈S m m n k=1 By the convolution theorem, we know that APPENDIX B 1 1/2 1 1/2 MOMENTS AND SPECTRAL NORM OF PARTIAL √ Λ Cgv = √ Λ (g v) RANDOM CIRCULANT MATRIX m m 1 ∈ Cm √ Λ1/2 −1 Let g be a random complex Gaussian vector with = F m diag (F mv) F mg ∼CN0 2 m g ( ,σg I). Given a partial random circulant matrix m×n = V vg. CgR ∈ C (m ≥ n), we control the moments and the [1:n] * + tail bound of the terms in the following form E ∗ Λ ∗ m Since R[1:n]Cg CgR[1:n] =( k=1 bk) I, we observe 1 ∗ ∗ * + T 1(g)= R[1:n]Cg diag (b) CgR[1:n], 2 2 m L(g)= sup V vg − E V vg , 1 ∗ V v ∈V(b) T 2(g)= R[1:n]Cg diag b CgR[1:n], m where the set V(b) is defined in (B.2). Next, we invoke where b ∈ Rm,andb ∈ Cm. The concentration of these Theorem A.9 to control all the moments of L(a),where quantities plays an important role in our arguments, and the we need to control the quantities d2(·), dF (·) and γ2(·, ·) proof mimics the arguments in [13], [14]. Prior to that, let us defined in Definition A.8 for the set V(b). By Lemma B.7 and define the sets Lemma B.8, we know that . − 1/2 D = v ∈ CSm 1 : supp(v) ∈ [n] , (B.1) d (V(b)) ≤b , (B.4) ) F % ∞ V . √1 1/2 −1× n 1/2 (d) = V v : V v = diag (d) F m d2(V(b)) ≤ b∞ , (B.5) m , m% n 1/2 diag (F mv) F m, v ∈D , (B.2) γ (V(b), ·) ≤ C b × 2 0 m ∞ for some d ∈ Cm. log3/2 n log1/2 m, (B.6) Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. 1808 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 3, MARCH 2020 ∗ ∈ Cn×m ≤ ≥ for some constant C0 > 0. Thus, combining the results Cg (n m). Then for any integer p 1,wehave ≥ 3 in (B.4), (B.5) and (B.6), whenever m C1n log n log m ! % √ for some constant C1 > 0, Theorem A.9 implies that p 1/p n √ (E [G ]) ≤ C 2 m 1+ p σg % m " L(g) p L n ≤ C {γ (V(b), ·)[γ (V(b), ·)+d (V(b))] + + log3/2 n log1/2 m . √2 2 2 F m pd (V(b)) [γ (V, ·)+d (V(b))] + pd2(V(b))} 2 !% 2 F 2 Moreover, for any ∈ (0, 1), whenever m ≥ Cδ−2n log4 n ≤ n 3/2 1/2 C3 b ∞ log n log m for some constant C>0,wehave % m " n √ n 2 ∗ 2 2 + p + p (1 − δ)m w ≤G w ≤ (1 + δ)m w m m − 3 n cσ2 log n holds for some constants C2,C3 > 0. Based on the moments holds for w ∈ C with probability at least 1−2 m g . 2 estimate of L(g), Lemma A.6 further implies that Here c 2 ,C 2 > 0 are some constants depending only on σ . σg σg g ! % " Proof Firstly, notice that n 3/2 1/2 P L(g) ≥ C4 b∞ log n log m + t ! m ) ," G =sup|w, Gr| 2 w∈CSn−1,r∈CSm−1 ≤ − m −1 t 2exp C5 b ∞ min ,t , ≤ ∗ n b∞ sup G w w∈CSn−1 for some constants C4,C5 > 0. Thus, for any δ>0, when- ∗ − 2 3 =sup CgR[1:n]w ≥ 2 −1 ever m C6δ b ∞ n log n log m for some constant w∈CSn C > 0,wehave 6 =supCgv . v∈CSm−1,supp(v)∈[n] L(g) ≤ δ Thus, similar to the argument of Theorem B.1, let the set D − 3 holds with probability at least 1 − 2 m C7 log n. and V(1) define as (B.1) and (B.2), we have Now when b is not nonnegative, let b = b+ − b−,where + + − − m 1 b+ = b , ··· ,b , b− = b , ··· ,b ∈ R are the 1 m 1 m + √ G≤ sup V vg . nonnegative and nonpositive part of b, respectively. Let Λ = m V v ∈V(1) diag (b), Λ+ =diag(b+) and Λ− =diag(b−),wehave By Lemma B.7 and Lemma B.8, we know that m L 1 ∗ ∗ Λ − % (g)= sup v Cg Cgv bk ∈S n m v n d (V(1)) ≤ 1,d(V(1)) ≤ , k=1 F 2 m m % 1 ∗ ∗ ≤ Λ − + n 3/2 1/2 sup v Cg +Cgv bk γ (V(1), ·) ≤ C log n log m. m v∈S 2 0 n k=1 m L+(g) Thus, using Theorem A.9, we obtain m 1 ∗ ∗ − Λ − # &1/p + sup v Cg −Cgv bk . p m v∈S n k=1 E sup V vg V v ∈V(1) L−(g) !% % " √ m n 3/2 1/2 n Now since b+, b− ∈ R , we can apply the results above ≤ Cσ2 log n log m +1+ p , + g m m for L+(g) and L−(g), respectively. Then by Minkowski’s inequality, we have 2 where C 2 > 0 is constant depending only on σ . The concen- σg g L ≤L L ≤ tration inequality can be directly derived from Theorem B.1, (g) p +(g) p + −(g C6 b ∞ − 4 L L ) p ≥ 2 !% % L " noticing that for any δ>0, whenever m C1δ n log n √ for some positive constant C > 0,wehave × n 3/2 1/2 n n 1 log n log m + p + p m m m 1 ∗ ∗ − ≤ for some constant C > 0. The tail bound can be simi- sup (w GG w 1) δ 6 w∈CSn−1 m larly derived from the moments bound. This completes the ∗ =⇒ (1 − δ)m ≤ sup G w2 ≤ (1 + δ)m proof. −1 w∈CSn The result above also implies the following result. − 3 cσ2 log n ∈ Cm holds with probability at least 1−2 m g ,wherec 2 > Corollary B.2 Let g be a random complex Gaussian σg ∼CN0 2 2 vector with g ( ,σg I),andletG = R[1:n] 0 is some constant depending only on σg . Authorized licensed use limited to: New York University. Downloaded on August 26,2020 at 06:19:19 UTC from IEEE Xplore. Restrictions apply. QU et al.: CONVOLUTIONAL PHASE RETRIEVAL VIA GRADIENT DESCENT 1809 B. Controlling the Moments of T2(g) Lemma B.4 Let N (g) be defined as (B.7), and let g be an independent copy of g,thenwehave Theorem B.3 Let g ∈ Cm are a complex random Gaussian variable with g ∼CN(0,σ2I),andlet g N (g) p ≤ 4 sup V g, V g , L v v V v ∈V(d) N . 1 ∗ Lp (g) =sup w R[1:n]Cg diag b CgR[1:n]w , m w∈CSn−1 m for some vector d ∈ C . ∼CN 0 2 Proof Let δ ( ,σg I) which is independent of g,and 4 where b ∈ Cm. Then whenever m ≥ Cnlog n for some let positive constant C>0, for any positive integer p ≥ 1, 1 2 − we have g = g + δ, g = g δ, 1 2 1 2 ∼ so that g and g are also independent with g , g N (g) p ≤ C 2 b × N L σg CN 0 2 Q 1 2 1 2 !% ∞ % " ( , 2σg I).Let dec(g , g )= V vg , V vg ,thenwe n n √ n have log3/2 n log1/2 m + p + p , m m m E QN 1 2 δ dec(g , g ) = V vg, V vg . 2 where C 2 is positive constant only depending on σ . σg g Therefore, by Jensen’s inequality, we have Proof Let Λ =diagb , similar to the arguments of N (g) Lp Theorem B.1, we have # p& 1/p = E sup E QN (g1, g2) N (g)= sup V vg, V vg , (B.7) g δ dec V v∈V(d) V v ∈V(b) # & p 1/p N 1 2 ≤ E 1 2 Q where V b is defined as (B.2). Let g be an independent g ,g sup dec(g , g ) ∈V ≥ V v (d) copy of g, by Lemma B.4, for any integer p 1 we have =4 sup V vg, V vg , V v ∈V(d) N ≤ Lp (g) Lp 4 sup V vg, V vg .