<<

Blind Demixing via Wirtinger Flow with Random Initialization

Jialin Dong Yuanming Shi [email protected] [email protected] ShanghaiTech University ShanghaiTech University

Abstract i = 1, ··· , s, i.e.,

s This paper concerns the problem of demixing X \ \ a series of source signals from the sum of bi- yj = bj∗hixi∗aij, 1 ≤ j ≤ m, (1) linear measurements. This problem spans di- i=1 verse areas such as communication, imaging N K processing, machine learning, etc. However, where {aij} ∈ C , {bj} ∈ C are design vectors. semidefinite programming for blind demixing Here, the first K columns of the matrix F form the ma- m K m m is prohibitive to large-scale problems due to trix B := [b1, ··· , bm]∗ ∈ C × , where F ∈ C × high computational complexity and storage is the unitary discrete Fourier transform (DFT) ma- cost. Although several efficient algorithms \ trix with FF ∗ = Im. Our goal is to recover {xi} and have been developed recently that enjoy the \ {hi} from the sum of bilinear measurements, which is benefits of fast convergence rates and even known as blind demixing [1, 2]. regularization free, they still call for spec- tral initialization. To find simple initial- This problem has spanned a wide scope of applications ization approach that works equally well as ranging from imaging processing [3, 4] and machine spectral initialization, we propose to solve learning [5, 6] to communication [7, 8]. Specifically, blind demixing problem via Wirtinger flow by solving the blind demixing problem, both original with random initialization, which yields a images and corresponding convolutional kernels can be natural implementation. To reveal the ef- recovered from a blurred image [3]. This problem has ficiency of this algorithm, we provide the also been exploited to demix and deconvolve calcium global convergence guarantee concerning ran- imaging recordings of neuronal ensembles [4]. Blind domly initialized Wirtinger flow for blind demixing also finds applications in machine learning demixing. Specifically, it shows that with [5, 6] where feature maps convolve with filters individ- sufficient samples, the iterates of randomly ually and added overall users to approximate the input initialized Wirtinger flow can enter a local re- vector. In the context of communication, this problem gion that enjoys strong convexity and strong ensures to recover the source signal from distinguish- within a few iterations at the first ing users without knowing channel vectors, thereby en- stage. At the second stage, iterates of ran- abling low-latency communication [7]. domly initialized Wirtinger flow further con- Although blind demixing plays vital role in various verge linearly to the ground truth. areas, solving it is generally highly intractable. More- over, some applications of blind demixing problem in- volves large-scale data, hence efficient algorithms with 1 INTRODUCTION optimal guarantees are needed urgently to recover sig- nal vectors from a single observation vector. Instead of Suppose we are given an observation vector y ∈ Cm in computationally expensive convex method [1], a non- frequency domain generated from the sum of bilinear convex algorithm, regularized gradient descent [9], has \ N \ K measurements of unknown vectors xi ∈ C , hi ∈ C , been recently proposed to efficiently solve blind demix- ing problem. This method, however, requires extra Proceedings of the 22nd International Conference on Ar- regularization and still yields conservative computa- tificial Intelligence and Statistics (AISTATS) 2019, Naha, tional optimality guarantees. To elude the regulariza- Okinawa, Japan. PMLR: Volume 89. Copyright 2019 by tion and develop an algorithm with progressive com- the author(s). putational guarantees, the least square estimation ap- Blind Demixing via Wirtinger Flow with Random Initialization proach was proposed in [2]: to design generic saddle- escaping algorithms, e.g., noisy stochastic gradient descent [10], trust-region m s 2 X X method [11], perturbed gradient descent [12]. With P : minimize f(h, x) := bj∗hixi∗aij − yj , (2) hi , xi sufficient sample size, these algorithms are guaranteed { } { } j=1 i=1 to converge globally for phase retrieval [11], matrix re- which is solved via Wirtinger flow with spectral ini- covery [13], matrix sensing [13], robust PCA [14] and tialization by harnessing the benefits of low computa- shallow neural networks [15] where all local minima tional complexity, regularization-free, fast convergence are as good as global and all the saddle points are rate with aggressive step size and computational op- strict. However, the theoretical results developed in timality guarantees [2]. Nevertheless, this algorithm [10, 11, 12, 13, 14, 15] are fairly general and may yield with theoretical guarantees depends on the spectral pessimistic convergence rate guarantees. Moreover, initialization by computing the largest singular vector these saddle-point escaping algorithms are more com- of the data matrix [2]. To find a natural implementa- plicated for implementation than the natural vanilla tion for the practitioners, we propose to solve the blind gradient descent or Wirtinger flow. To advance the demixing problem via randomly initialized Wirtinger theoretical analysis for gradient descent with random flow. Our goal, in this paper, is to confirm the effi- initialization, the fast global convergence guarantee ciency of this algorithm with theoretical guarantees. concerning randomly initialized gradient descent for Specifically, we shall verify that the iterates of ran- phase retrieval has been recently provided in [16]. domly initialized Wirtinger flow are able to achieve a local region that enjoys strong convexity and strong 1.2 Wirtinger Flow with Random smoothness within a few iteration at the first stage. At Initialization the next stage, we further show that the iterates of the In this paper, we solve the blind demixing problem randomly initialized Wirtinger flow linearly converges via randomly initialized Wirtinger flow by harnessing to the ground truth. the benefits of computational efficiency, regularization- free and careful initialization free. In this paper, our 1.1 State-of-the-Art Algorithms main contribution is to provide the global convergence guarantee for the randomly initialized Wirtinger flow. Despite the general intractability of blind demixing, it can be effectively solved by several algorithms under Wirtinger flow with random initialization is an it- the proper statistical models. In particular, semidef- erative algorithm with vanilla gradient descent up- inite programming was developed in [1] to solve the date procedure, i.e., without regularization. Specif- blind demixing problem by lifting the bilinear model ically, the gradient step of Wirtinger flow is repre- into the matrix space. However, it is computation- sented by the notion of Wirtinger [17], i.e., ally prohibitive for solving large-scale problem due to the derivatives of real-valued functions over complex the high computation and storage cost. To address variables. To simplify the presentation, we denote this issue, the nonconvex algorithm, e.g., regularized f(z) := f(h, x), where gradient descent with spectral initialization [9], was z  further developed to optimize the variables in the nat- 1 h  z = ··· ∈ s(N+K) with z = i ∈ N+K . (3) ural space. Nevertheless, the theoretical guarantees for   C i x C z i the regularized gradient [9] provide pessimistic conver- s gence rate and require carefully-designed initialization. Furthermore, we define the discrepancy between the The Riemannian trust-region optimization algorithm estimate z and the ground truth z\ as the distance without regularization was further proposed in [7] to , given as improve the convergence rate. However, the second- s !1/2 order algorithm brings unique challenges in provid- \ X 2 \ ing statistical guarantees. Recently, theoretical guar- dist(z, z ) = dist (zi, zi ) , (4) antees concerning regularization-freed Wirtinger flow i=1 with spectral initialization for blind demixing was pro- where vided in [2]. However, this regularization-freed method 2 \ 1 \ 2 \ 2 still calls for spectral initialization. In this paper, we dist (zi, zi ) = min(k hi − hik2 + kαixi − xik2)/di αi C αi aim to explore the random initialization strategy for ∈ \ 2 \ 2 natural implementation and theoretically verify the ef- for i = 1, ··· , s. Here, di = khik2 + kxik2 and each αi ficiency of the randomly initialized Wirtinger flow. is the alignment parameter.

Based on the random initialization strategy, a line of Let A∗ and a∗ denote the conjugate transpose of ma- research studies the benign global landscapes and aims trix A vector a respectively. For each i = 1, ··· , s, Running heading author breaks the line

0 K = 10 10 κ = 1 10 0 K = 20 κ = 2 10 1 − K = 40 1 κ = 3 10 − K = 80 t i

K = 100 h 2 1 3 10 − β 10 10 − −

α t 3 h1 10 − and α t h2 α t 5 t i h3

10 − h Relative error Relative error 4 2 αht 10 − α 10 − 4 β t h1 β t 5 h2 Stage I Stage II 10 − 7 β t 10 − h3 βht 10 6 10 3 4 0 100 200 300 400 500 − 0 50 100 150 200 250 300 − 20 40 60 80 Iteration count Iteration count Iteration count (a) (b) (c)

6 6 0 10 10 10 α t /β t α t /β t h1 h1 x1 x1 5 α t /β t 5 α t /β t 10 h2 h2 10 x2 x2 α t /β t α t /β t h3 h3 x3 x3 1 4 4 α t /β t α t /β t t i 10 − 10 h4 h4 10 x4 x4 x β t i 3 t i 3 x h 10 10 α t /β 2 x1 /β t i t i and 10 − α t 2 2 x x2 h 10 10 α α t α t i x3 x αxt 1 1 α 4 10 10 3 βxt 10 − 1 βxt 0 0 2 10 10 β t x3 β t 1 1 x4 10 10 10 4 − − − 20 40 60 80 50 100 150 200 50 100 150 200 Iteration count Iteration count Iteration count (d) (e) (f)

Figure 1: Numerical results.

∇ z ∇ z N 0 1 I N 0 1 I hi f( ) and xi f( ) denote the Wirtinger gradient ( , 2 N ) + i ( , 2 N ). With the chosen step size of f(z) with respect to hi and xi respectively: η = 0.1 in all settings, Fig. 1(a) shows the relative er- Ps t t \ \ Ps \ \ ror, i.e., kh x − h x ∗k / kh x ∗k , ver- m  s  i=1 i i∗ i i F i=1 i i F X X sus the iteration count, where kXk denotes the ∇h f(z) = b∗hkx∗akj − yj bja∗ xi, (5a) F i j k ij Frobenius norm of the matrix X. We observe that, j=1 k=1 m  s  the iterations of randomly initialized Wirtinger flow X X can be separately into two stages: Stage I: within first ∇xi f(z) = hk∗bjakj∗ xk − yj aijbj∗hi. (5b) j=1 k=1 tens of iterations, the relative error maintains nearly flat, Stage II: the relative error enjoys linear conver- In light of the Wirtinger gradient (5), the update rule gence rate which rarely changes as the problem size of Wirtinger flow is given by varies.   We further illustrate that the performance and con- " t+1 #  t  1 ∇ zt hi hi xt 2 hi f( ) = − η  k i k2  , i = 1, ··· , s, vergence rate of the Wirtinger flow with random ini- xt+1 xt 1 ∇ zt i i ht 2 xi f( ) tialization depend on the condition number, i.e., κ := k i k2 \ maxi xi 2 (6) k \ k . In this experiment, let K = 50, m = 800, mini xi 2 s =k 2, thek step size be η = 0.2. We then set for the where η > 0 is the step size. \ \ first component kh1k2 = kx1k2 = 1 and for the sec- ond one kh\ k = kx\ k = κ with κ ∈ {1, 2, 3}. The 1.3 Numerical Results 2 2 2 2 random initialization (8) is utilized. Fig. 1(b) shows The power of randomly initialized WF for solving the relative error versus the iteration count. As we can problem P (2) will be illustrated by numerical ex- see, a larger κ yields slower convergence rate. ample. The ground truth signals and initial points are randomly generated as 2 MAIN RESULTS \ 1 \ 1 hi ∼ N (0,K− IK ), xi ∼ N (0,N − IN ), (7) 0 1 0 1 In this section, we shall verify the preceding numerical hi ∼ N (0,K− IK ), xi ∼ N (0,N − IN ), (8) results with theoretical guarantees. Through theoret- for i = 1, ··· , s. In all simulations, we set K = N ical analysis, we assume the design vectors aij’s as \ \ 1 1 and normalize khik2 = kxik2 = 1 for i = 1, ··· , s. aij ∼ N (0, 2 IN ) + iN (0, 2 IN ). To present the main Specifically, for each K ∈ {10, 20, 40, 80, 100}, s = 10 theorem, we first introduce several fundamental no- and m = 50K, the design vectors aij’s follow aij ∼ tions and definitions. Blind Demixing via Wirtinger Flow with Random Initialization

Throughout this paper, f(n) = O(g(n)) or f(n) . 1. The randomly initialized WF linearly converges to g(n) denotes that there exists a constant c > 0 such z\, i.e., that |f(n)| ≤ c|g(n)|, whereas f(n) & g(n) means that  t Tγ there exists a constant c > 0 such that |f(n)| ≥ c|g(n)|. t \ η − \ dist(z , z ) ≤ γ 1 − z , t ≥ Tγ , f(n)  g(n) denotes that there exists some sufficiently 16κ 2 large constant c > 0 such that |f(n)| ≥ c|g(n)|. In ad- dition, the notation f(n)  g(n) means that there ex- 2. The magnitude ratios of the signal component to the perpendicular component with respect to ht ists constants c1, c2 > 0 such that c1|g(n)| ≤ |f(n)| ≤ i and xt obey c2|g(n)|. i

α t Furthermore, the incoherence parameter [9], which hi 1 t max & √ (1 + c3η) , (15a) characterizes the incoherence between bj and hi for 1 i s β t K log K ≤ ≤ hi 1 ≤ i ≤ s, 1 ≤ j ≤ m. α t xi 1 t max & √ (1 + c4η) , (15b) Definition 1 (Incoherence for blind demixing). Let 1 i s β t N log N ≤ ≤ xi the incoherence parameter µ be the smallest number such that respectively, where t = 0, 1, ··· for some constant c3, c4 > 0. \ |b∗h | µ max j i ≤ √ . (9) 1 i s,1 j m \ m Theorem 1 provides precise statistical analysis on ≤ ≤ ≤ ≤ khik2 the computational efficiency of WF with random ini- tialization. Specifically, in Stage I, it takes Tγ = t t Let hei and xei be O(s log(max {K,N})) iterations for randomly initial- ized WF to a local region near the ground truth that 1 ht = ht and xt = ωtxt, (10) enjoys strong convexity and strong smoothness. The ei t i ei i i ωi short duration of Stage I is own to the exponential growth of the magnitude ratio of the signal compo- for i = 1, ··· , s, respectively, where ωi’s are alignment nent to the perpendicular components (15). More- parameters. We further define the norm of signal com- over, in Stage II, it takes O(s log(1/ε)) iterations to ponent and the perpendicular component with respect reach ε-accurate solution at a linear convergence rate. to hi for i = 1, ··· , s, as Thus, the randomly initialized WF is guaranteed to converge to the ground truth with the iteration com- α := hh\, hti/kh\k , (11) hi i ei i 2 plexity O(s log(max {K,N}) + s log(1/ε)) given the

hh\ hti sample size m s2 max {K,N}poly log(m). t i, ei \ & βhi := hei − h , (12) kh\k2 i i 2 2 To further illustrate the relationship between the sig-

nal component αhi (resp. αxi ) and the perpendicu- respectively. Here, ωi’s are the alignment parameters. lar component βhi (resp. βxi ) for i = 1, ··· , s, we Similarly, the norms of the signal component and the provide the simulation results under the setting of perpendicular component with respect to xi for i = K = N = 10, m = 50K, s = 4 and η = 0.1 with 1, ··· , s, can be represented as \ \ khik2 = kxik2 = 1 for 1 ≤ i ≤ s. In particular, αhi ,

βhi versus iteration count (resp. αhi , βhi versus iter- ation count) for i = 1, ··· , s is demonstrated in Fig. hx\ xti kx\k 1(c) (resp. Fig. 1(d)). Consider Fig. 1(a), Fig. 1(c) αxi := i, ei / i 2, (13) and Fig. 1(d) collectively, it shows that despite the \ t t hxi, xeii \ rare decline of the relative error during Stage I, the βx := x − x , (14) i ei \ 2 i kx k sizes of the signal components, i.e., αh and αx for i 2 2 i i each i = 1, ··· , s, exponentially increase and the signal respectively. component becomes dominant component at the end Theorem 1. Assume that the initial points obey (8) of Stage I. Furthermore, the exponential growth of the ratio α /β (resp. α /β ) for each i = 1, ··· , s is for i = 1, ··· , s and the stepsize η > 0 satisfies hi hi xi xi 1 illustrated in Fig. 1(e) (resp. Fig. 1(f)). η  s− . Suppose that the sample size satisfies m ≥ Cµ2s2κ4 max{K,N} log12 m for some sufficiently large constant C > 0. Then with probability at least 3 DYNAMICS ANALYSIS ν c2N 1−c1m− −c1me− for some constants ν, c1, c2 > 0, there exists a sufficiently small constant 0 ≤ γ ≤ 1 and In this section, we shall briefly summarize the proof of Tγ . s log(max {K,N}) such that the main theorem which is based on investigating the Running heading author breaks the line dynamics of the iterates of WF with random initializa- based on the rotational invariance of Gaussian distri- tion. The steps of proving Theorem 1 are summarized butions. Since the deterministic nature of {bj}, the \ as follows. ground truth signals {hi} (channel vectors) cannot be transferred to a simple form, which yields more tedious 1. Stage I: analysis procedure. For simplification, for i = 1, ··· , s, • Dynamics of population-level state evo- we denote lution. Provide the population-level state t t t xi1 and xi := [xij]2 j N , (16) ⊥ ≤ ≤ evolution of αxi (20a) and βxi (20b), αhi th (21a), βhi (21b) respectively, where the sam- as the first entry and the second through the N en- t ple size approaches infinity. We then develop tries of xi, respectively. Based on the assumption that \ the approximate state evolution (23), which xi = qie1 for i = 1, ··· , s, (13) and (14) can be refor- are remarkably close to the population-level mulated as state evolution, in the finite-sample regime. t t αxi := xi1 and βxi := xi . (17) See details in Section 3.1. e e ⊥ 2 • Dynamics of approximate state evo- lution. Show that there exists some To study the population-level state evolution, we start by considering the case where the sequences {zt} (refer Tγ = O(s log(max {K,N})) such that i Tγ \ to (3)) are established via the population gradient, i.e., dist(z , z ) ≤ γ , if αh (11), βh (12), αx i i i for i = 1, ··· , s, (13) and βxi (14) satisfy the approximate state evolution (23). The exponential growth   " t+1 #  t  1 t h h t 2 ∇hi F (z ) of the ratio α /β and α /β are fur- i i xi 2 hi hi xi xi = − η  k k  , (18) xt+1 xt 1 t ther demonstrated under the same assump- i i t 2 ∇xi F (z ) hi 2 tion. Please refer to Section 3.2. k k • Leave-one-out arguments. Prove that where 2 \ \ with high probability αhi , βhi , αxi and βxi ∇hi F (z) := E[∇hi f(h, x)] = kxik hi − (xi∗xi)hi, satisfy the approximate state evolution (23) if 2 {z } {a } 2 \ \ the iterates i are independent with ij . ∇xi F (z) := E[∇xi f(h, x)] = khik2 xi − (hi∗hi)xi. See details in Section A in the supplemen- Here, the population gradients are computed based tal material. To achieve this, the “near- on the assumption that {xi} (resp. {hi}) and {aij} independence” between {zi} and {aij} is es- (resp. {bj}) are independent with each other. With tablished via exploiting leave-one-out argu- simple calculations, the dynamics for both the signal ments and some variants of the arguments. t and the perpendicular components with respect to xi, Specifically, the leave-one-out sequences and i = 1, ··· , s are given as: random-sign sequences are constructed in 2 Section 3.3. The concentrations between the t+1 t qi \ t x = (1 − η) x + η h ∗he , (19a) original and these auxiliary sequences are ei1 ei1 t 2 i i kheik2 provided in the supplemental material. t+1 t xei = (1 − η) xei . (19b) Stage II Local geometry in the region of ⊥ ⊥ 2. : \ incoherence and contraction. We invoke the Assuming that η > 0 is sufficiently small and khik2 = \ prior theory provided in [2] to show local conver- kxik2 = qi (0 < qi ≤ 1) for i = 1, ··· , s and recogniz- t 2 2 2 gence of the random initialized WF in Stage II. ing that kheik2 = α t + β t , we arrive at the following hi hi Claim 15 in Stage II are further proven. Please population-level state evolution for both αxt and βxt : refer to Section A.3 in the supplemental material. i i

qiα t hi 3.1 Dynamics of Population-level State α t+1 = (1 − η)αxt + η , (20a) xi i 2 2 αht + βht Evolution i i

βxt+1 = (1 − η)βxt . (20b) In this subsection, we investigate the dynamics of i i population-level (where we have infinite samples) state Likewise, the population-level state evolution for both α t and β t : evolution of αhi (11), βhi (12), αxi (13) and βxi (14). hi hi

\ qiα t Without loss the generality, we assume that xi = qie1 xi αht+1 = (1 − η)αht + η , (21a) ··· ≤ ··· i i 2 2 for i = 1, , s, where 0 < qi 1, i = 1, , s αxt + βxt are some constants and κ = maxi qi , and e denotes i i mini qi 1 β t+1 = (1 − η)βht . (21b) the first standard basis vector. This assumption is hi i Blind Demixing via Wirtinger Flow with Random Initialization

In finite-sample case, the dynamics of the randomly • The initial points obey initialized WF iterates can be represented as qi qi α 0 ≥ and α 0 ≥ , (26a) hi xi  t+1   t t 2  K log K N log N t+1 hi hi − η/kxik2 · ∇hi F (z) z = t = t t −   i +1 2 q 2 2 1 1 xi xi − η/kxik2 · ∇xi F (z) ∈ − αh0 + βh0 1 , 1 + qi, (26b) i i log K log K  η/kxtk2 · (∇ f (z) − ∇ F (z))  − i 2 hi hi .   t 2 q 2 2 1 1 η/khik2 · (∇xi f (z) − ∇xi F (z)) ∈ − αx0 + βx0 1 , 1 + qi, (26c) (22) i i log N log N for i = 1, ··· , s. Under the assumption that the last term in (22) is well- controlled, which will be justified in Section D in the • Define supplemental material, we arrive at the approximate  state evolution: Tγ := min t : satifes (24) , (27)

ηqiψ t qiα t where γ > 0 is some sufficiently small constant. hi xi α t+1 = (1 − η + )αht + η(1 − ρht ) , hi 2 2 i i 2 2 α t + β t α t + β t xi xi xi xi • Define (23a) α t α t  hi c7 xi ηqiϕht T1 := min t : min ≥ , min i i 5 i β t+1 = (1 − η + )βht , (23b) qi log m qi hi 2 2 i αxt + βxt i i c ≥ 70 , (28) ηqiψ t qiα t 5 xi hi log m α t+1 = (1 − η + )αxt + η(1 − ρxt ) , xi 2 2 i i 2 2 α t + β t α t + β t  α t α t  hi hi hi hi hi xi T2 := min t : min > c8, min > c80 , (23c) i qi i qi ηqiϕ t xi (29) β t+1 = (1 − η + )βxt , (23d) xi 2 2 i αht + βht i i for some small absolute positive constants c7, c70 , c8, c80 > 0. where {ψ t }, {ψ t }, {ϕ t }, {ϕ t }, {ρ t } and {ρ t } hi xi hi xi hi xi represent the perturbation terms. • For 0 ≤ t ≤ Tγ , it has

α t β t 1 hi hi 3.2 Dynamics of Approximate State √ ≤ ≤ 2, c5 ≤ ≤ 1.5 and Evolution 2 K log K qi qi αht+1 /αht i i ≥ ··· It is easily seen that if α t (11), β t (12), α t (13) and 1 + c5η, i = 1, , s, (30) hi hi xi β t+1 /β t h hi β t (14) obey i xi α t β t 1 xi xi √ √ √ ≤ ≤ 2, c6 ≤ ≤ 1.5 and 2γ 2γ 2 N log N qi qi |αht − qi| ≤ √ and βht ≤ √ and i i α t+1 /αxt 4κ s 4κ s xi i √ √ ≥ 1 + c6η, i = 1, ··· , s, (31) β t+1 /βxt 2γ 2γ xi i |αxt − qi| ≤ √ and βxt ≤ √ , (24) i 4κ s i 4κ s for some constants c5, c6 > 0. for i = 1, ··· , s, then Lemma 1. Assume that the initial points obey condition (26) and the perturbation terms  2 2 \ sκ   dist(z, z ) ≤ |α t − qi| + |β t | + in the approximate state evolution (23) obey hi hi 2 n o c max |ψht |, |ψxt |, |ϕht |, |ϕxt |, |ρxt | ≤ log m , for 2 1/2 i i i i i sκ  2 i = 1, ··· , s, t = 0, 1, ··· and some sufficiently small |αxt − qi| + |βxt | ≤ γ. (25) 2 i i constant c > 0.

In this subsection, we shall show that as long as the 1. Then for any sufficiently large K,N and the step- approximate state evolution (23) holds, there exists 1 size η > 0 that obeys η  s− , it follows some constant Tγ = O(s log max {K,N}) satisfying condition (24). This is demonstrated in the following Tγ . s log(max {K,N}), (32) Lemma. Prior to that, we first list several conditions and definitions that contribute to the lemma. and (30), (31). Running heading author breaks the line

1 t,(l) 2. Then with the stepsize η > 0 following η  s− , Thus, the sequences {zi } (recall the definition one has that T1 ≤ T2 ≤ Tγ . s log max{K,N}, of zi (3)) are statistically independent of {ail}. T2 − T1 . s log log m, Tγ − T2 . s. t,sgn • Random-sign sequences {z }t 0. Define the  sgn ≥ Proof. Please refer to Appendix C in the supplemental auxiliary design vectors aij as material.   sgn ξijaij,1 aij := , (34) The random initialization (8) satisfies the aij, ⊥ condition (26) with probability at least p 1 − O(1/ log min{K,N}) [16]. According to where {ξij} is a set of standard complex uniform this fact, Lemma 1 ensures that under both random random variables independent of {aij}, i.e., initialization (8) and approximate state evolution 1 i.i.d. (23) with the stepsize η  s− , Stage I only lasts ξij = u/|u|, (35) a few iterations, i.e., Tγ = O(s log max{K,N}). In 1 1 addition, Lemma 1 demonstrates the exponential where u ∼ N (0, 2 ) + iN (0, 2 ). Moreover, with growth of the ratios, i.e., αht+1 /αht , βht+1 /βht , which the corresponding ξij, the auxiliary design vec- i i i i sgn sgn contributes to the short duration of Stage I. tor {bj } is defined as bj = ξijbj. With these auxiliary design vectors, the sequences {zt,sgn} Moreover, Lemma 1 defines the midpoints T when the 1 are generated by running randomly initialized WF sizes of the signal component, i.e., αht and αxt , i =  i i with respect to the loss function f sgn z 1, ··· , s, become sufficiently large, which is crucial to the following analysis. In particular, when establishing m s 2 the approximate state evolution (23) in Stage I, we X X sgn sgn sgn \ \ sgn bj ∗hixi∗aij − bj ∗hixi∗aij . (36) analyze two subphases of Stage I individually: j=1 i=1

• Phase 1: consider the iterations in 0 ≤ t ≤ T1, Note that these auxiliary design vectors, i.e., sgn sgn {aij }, {bj } produce the same measurements as • Phase 2: consider the iterations in T1 < t ≤ Tγ , sgn \ \ sgn \ \ {aij} , {bj}, i.e., bj ∗hixi∗aij = bj∗hixi∗aij = \ qiaij,1bj∗hi for 1 ≤ i ≤ s, 1 ≤ j ≤ m. where T1 is defined in (28). where T1 is defined in (28). Note that all the auxiliary sequences are assumed to have the same initial point, namely, for 1 ≤ l ≤ m, 3.3 Leave-One-Out Approach {z0} = {z0,(l)} = {z0,sgn} = {z0,sgn,(l)}.

\ \ 1 \ \ According to Section 3.1 and Lemma 1, the unique In view of the ambiguities, i.e., hixi = ω hi(ωxi)∗, challenge for establishing the approximate state evo- several alignment parameters are further defined for lution (23) is to bound the perturbation terms to cer- the sequel analysis. Specifically, the alignment pa- t,(l) t,(l) t,(l) t tain order, i.e., |ψ t |, |ψ t |, |ϕ t |, |ϕ t |, |ρ t |, |ρ t |  rameter between z = [h ∗ x ∗] and z = hi xi hi xi hi xi i i i ∗ ei ··· t t t 1 t t t t 1/log m for i = 1, , s. To achieve this goal, we ex- [h ∗ x ∗]∗, where h = h and x = ω x , is repre- ei ei ei ωt i ei i i ploit some variants of leave-one-out sequences [16] to i t sented as establish the “near-independence” between {zi } and {a }. Hence, some terms can be approximated by 2 i 1 1 a sum of independent variables with well-controlled t,(l) ht,(l) − ht ωi,mutual := arg min i i ω ω ωt weights, thereby be controlled via central theo- ∈C i 2 rem. 2 t,(l) t t + ωxi − ωi xi , (37) In the following, we define three sets of auxiliary se- 2 t,(l) t,sgn t,sgn,(l) quences {z }, {z } and {z }, respectively. t,(l) for i = 1, ··· , s. In addition, we denote zbi = t,(l) t,(l) t,(l) [hb ∗ x ∗]∗ where • Leave-one-out sequences {z }t 0. For each i bi 1 ≤ l ≤ m, the auxiliary sequences≥ {zt,(l)} are th t,(l) 1 t,(l) t,(l) t,(l) t,(l) established by dropping the l sample and runs hbi := hi and xi := ωi, xi . t,(l) b mutual randomly initialized WF with objective function ωi,mutual s (38) 2 (l) X X f (z) = b∗h x∗a − y . (33) j i i ij j t,sgn j:j=l i=1 Define the alignment parameter between zi = 6 Blind Demixing via Wirtinger Flow with Random Initialization

t,sgn t,sgn t t t q [h ∗ x ∗]∗ and z = [h ∗ x ∗]∗ as t 5 i i i i i h ≤ 5α t log m, (41i) i 2 hi 2 q t 5 t 1 t,sgn 1 t xi ≤ 5αxt log m, (41j) ω := arg min h − h 2 i i,sgn i t i ω C ω ωi ∈ 2 where C1, ··· ,C5 and c5 are some absolute positive t,sgn t t 2 x − x constants and xi, xi, hbi, hei are defined in Section 3.3. + ω i ωi i 2 , (39) b e

t,sgn Specifically, (41a), (41c), (41d) and (41e) identify that for i = 1, ··· , s. In addition, we denote zˇi = t,(l) t,sgn t,sgn t,sgn the auxiliary sequences {z } and {z } are ex- hˇ ∗ xˇ ∗ [ i i ]∗ where tremely close to the original sequences {zt}. In addi- t t,sgn t,sgn 1 t,sgn t,sgn t t,sgn tion, as claimed in (41f) and (41g), hei − hei (resp. hˇ := h and xˇ := ω x . t,sgn t,(l) t,sgn,(l) t,(l) t,sgn,(l) i t i i i,sgn i xt −x ) and hb −hb (resp. x −x ) ωi,sgn ei ei i i bi bi are also exceedingly close to each other. The hypothe- (40) t ses (41h) illustrates that the norm of the iterates {hi} (resp. {xt}) is well-controlled in Phase 1. Moreover, 3.4 Induction Hypotheses i (41i) (resp. (41j)) indicates that α t (resp. α t ) is hi xi t t In this subsection, we shall establish a collection of comparable to khik2 (resp. kxik2). induction hypotheses which are crucial to the justifi- Lemma 4-Lemma 9 in the supplemental material cation of approximate state evolution (23). We list all demonstrate that if induction hypotheses (41) for 1 ≤ the induction hypotheses: for 1 ≤ i ≤ s, i ≤ s hold true up to the tth iteration, then they are still satisfied in the (t + 1)th iteration under the con-  t,(l) t max dist zi , zi 1 l m e dition of sufficient samples. Thus auxiliary sequences ≤ ≤ q are close sufficient to the original sequences, which con- 2 8  1 t sµ κ max{K,N} log mtribute to bound the last term in (22) thereby validat- ≤(βht + βxt ) 1 + C1 i i s log m m ing the approximate evolution (23). (41a)

 \ t,(l) \ t \ 1 4 CONCLUSION max dist hi∗hi , hi∗hei · khik2− 1 l m ≤ ≤ t p  1  sµ2κ K log13 m In this paper, we develop a least square estimation ap- ≤αht 1 + C2 (41b) proach to solve blind demixing problem. To address i s log m m the limitations of state-of-the-art algorithms, e.g., high  t,(l) t  max dist xi1 , xi1 computational complexity, low convergence rate and 1 l m e ≤ ≤ requirement of careful initialization, we developed the t p  1  sµ2κ N log13 m randomly initialized Wirtinger flow to solve the blind ≤αxt 1 + C2 (41c) i s log m m demixing problem. The global convergence guaran-  t,sgn t tee of this algorithm is further provided in this pa- max dist hi , hei 1 i s per. It shows that iterates of the randomly initial- ≤ ≤ s ized Wirtinger flow can enter the local region that en-  t 8 1 sµ2κ2K log m joys strong convexity and strong smoothness within a ≤αht 1 + C3 (41d) i s log m m few iterations, followed by linearly converging to the t,sgn t ground truth. max dist xi , xi 1 i s e ≤ ≤ s  1 t sµ2κ2N log8 m Acknowledgements ≤αxt 1 + C3 (41e) i s log m m This work was supported in part by the National t t,(l) t,sgn t,sgn,(l) max hei − hbi − hei + hbi Nature Science Foundation of China under Grant 1 l m 2 ≤ ≤ 61601290 and the Shanghai Sailing Program under  t 2p 16 1 sµ K log m Grant 16YF1407700. ≤αht 1 + C4 , (41f) i s log m m

t t,(l) t,sgn t,sgn,(l) References max xi − xi − xi + xi 1 l m e b e b 2 ≤ ≤ t p [1] S. Ling and T. Strohmer, “Blind deconvolution  1  sµ2 N log16 m ≤αxt 1 + C4 , (41g) meets blind demixing: Algorithms and perfor- i s log m m mance bounds,” IEEE Trans. Inf. Theory, vol. 63, ≤ ht xt ≤ c5 i 2 , i 2 C5, (41h) pp. 4497–4520, Jul. 2017. Running heading author breaks the line

[2] J. Dong and Y. Shi, “Nonconvex demixing from [14] R. Ge, C. Jin, and Y. Zheng, “No spurious lo- bilinear measurements,” IEEE Trans. Signal Pro- cal minima in nonconvex low rank problems: A cess., vol. 66, pp. 5152–5166, Oct. 2018. unified geometric analysis,” in Proc. Int. Conf. Mach. Learn. (ICML), pp. 1233–1242, 2017. [3] Y. Zhang, Y. Lau, H.-w. Kuo, S. Cheung, A. Pa- supathy, and J. Wright, “On the global geome- [15] M. Soltanolkotabi, A. Javanmard, and J. D. Lee, try of sphere-constrained sparse blind deconvolu- “Theoretical insights into the optimization land- tion,” in Proc. IEEE Comput. Soc. Conf. Comput. scape of over-parameterized shallow neural net- Vis. Pattern Recognit. (CVPR), pp. 4894–4902, works,” IEEE Trans. Inf. Theory, Jul. 2018. Jun. 2017. [16] Y. Chen, Y. Chi, J. Fan, and C. Ma, “Gra- [4] E. A. Pnevmatikakis, D. Soudry, Y. Gao, T. A. dient descent with random initialization: Fast Machado, J. Merel, D. Pfau, T. Reardon, Y. Mu, global convergence for nonconvex phase re- C. Lacefield, W. Yang, et al., “Simultaneous de- trieval,” Math. Program., to appear, 2018. noising, deconvolution, and demixing of calcium imaging data,” Neuron, vol. 89, pp. 285–299, Jan. [17] E. J. Candes, X. Li, and M. Soltanolkotabi, 2016. “Phase retrieval via Wirtinger flow: Theory and algorithms,” IEEE Trans. Inf. Theory, vol. 61, [5] H. Bristow, A. Eriksson, and S. Lucey, “Fast pp. 1985–2007, Apr. 2015. convolutional sparse coding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), [18] C. Ma, K. Wang, Y. Chi, and Y. Chen, “Implicit pp. 391–398, 2013. regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase re- [6] M. D. Zeiler, D. Krishnan, G. W. Taylor, and trieval, matrix completion and blind deconvolu- R. Fergus, “Deconvolutional networks,” in Proc. tion,” in Proc. Int. Conf. Mach. Learn. (ICML), IEEE Comput. Soc. Conf. Comput. Vis. Pattern pp. 3345–3354, Jul. 2018. Recognit. (CVPR), pp. 2528–2535, Jun. 2010. [7] J. Dong and Y. Shi, “Blind demixing for low- [19] K. Pothoven, Real and functional analysis. latency communication,” IEEE Trans. Wireless Springer Science & Business Media, 2013. Commun., Dec. 2018. [8] X. Wang and H. V. Poor, “Blind equalization and multiuser detection in dispersive CDMA chan- nels,” IEEE Trans. Commun., vol. 46, pp. 91–103, Jan. 1998. [9] S. Ling and T. Strohmer, “Regularized gradient descent: A nonconvex recipe for fast joint blind deconvolution and demixing,” Inf. Inference: J. IMA, Mar. 2018. [10] R. Ge, F. Huang, C. Jin, and Y. Yuan, “Escap- ing from saddle points-online stochastic gradient for tensor decomposition.,” in 28th Annu. Proc. Conf. Learn. Theory (COLT), pp. 797–842, Jul. 2015. [11] J. Sun, Q. Qu, and J. Wright, “A geometric anal- ysis of phase retrieval,” Found. Comput. Math, vol. 18, pp. 1131–1198, Oct. 2018. [12] C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan, “How to escape saddle points effi- ciently,” Proc. Int. Conf. Mach. Learn. (ICML), 2017. [13] S. Bhojanapalli, B. Neyshabur, and N. Srebro, “Global optimality of local search for low rank matrix recovery,” in Adv. Neural. Inf. Process. Syst. (NIPS), pp. 3873–3881, 2016.