Recursive Importance Sketching for Rank Constrained Least Squares: Algorithms and High-order Convergence

Yuetian Luo1, Wen Huang2, Xudong Li3, and Anru R. Zhang1,4

March 16, 2021

Abstract In this paper, we propose a new Recursive Importance Sketching algorithm for Rank con- strained least squares Optimization (RISRO). As its name suggests, the algorithm is based on a new sketching framework, recursive importance sketching. Several existing algorithms in the literature can be reinterpreted under the new sketching framework and RISRO offers clear ad- vantages over them. RISRO is easy to implement and computationally efficient, where the core procedure in each iteration is only solving a dimension reduced least squares problem. Different from numerous existing algorithms with locally geometric convergence rate, we establish the local quadratic-linear and quadratic rate of convergence for RISRO under some mild conditions. In addition, we discover a deep connection of RISRO to Riemannian manifold optimization on fixed rank matrices. The effectiveness of RISRO is demonstrated in two applications in machine learning and : low-rank matrix trace regression and phase retrieval. Simulation studies demonstrate the superior numerical performance of RISRO.

Keywords: Rank constrained least squares, Sketching, Quadratic convergence, Riemannian man- ifold optimization, Low-rank matrix recovery, Non-

1 Introduction

The focus of this paper is on the rank constrained least squares:

1 2 min fpXq :“ }y ´ ApXq}2 , subject to rankpXq “ r. (1) XPRp1ˆp2 2 n p p n Here, y P R is the given data and A P R 1ˆ 2 Ñ R is a known linear map that can be explicitly arXiv:2011.08360v2 [math.OC] 14 Mar 2021 represented as

J ApXq “ rxA1, Xy,..., xAn, Xys , xAi, Xy “ pAiqrj,ksXrj,ks (2) 1ďjďp ,1ďkďp ÿ1 2 p ˆp with given measurement matrices Ai P R 1 2 , i “ 1, . . . , n. The expected rank is assumed to be known in Problem (1) since in some applications such as phase retrieval and blind deconvolution,

1Department of Statistics, University of Wisconsin-Madison [email protected], [email protected] . Y. Luo would like to thank RAship from Institute for Foundations of Data Science at UW-Madison. 2School of Mathematical Sciences, Xiamen University [email protected] 3School of Data Science and Shanghai Center for Mathematical Sciences, Fudan University [email protected] 4Department of Biostatistics and Bioinformatics, Duke University

1 the expected rank is known to be one. If the expected rank is unknown, it is typical to optimize over the set of fixed rank matrices using the formulation of (1) and dynamically update the rank, see, e.g., Vandereycken and Vandewalle(2010); Zhou et al.(2016). The rank constrained least squares (1) is motivated by the widely studied low-rank matrix recovery problem, where the goal is to recovery a low-rank matrix X˚ from the observation y “ ApX˚q `  ( is the noise). This problem is of fundamental importance in a variety of fields such as optimization, machine learning, signal processing, scientific computation, and statistics. With different realizations of A,(1) covers many applications, such as matrix trace regression (Cand`esand Plan, 2011; Davenport and Romberg, 2016), matrix completion (Cand`es and Tao, 2010; Keshavan et al., 2009; Koltchinskii et al., 2011; Miao et al., 2016), phase retrieval (Cand`es et al., 2013; Shechtman et al., 2015), blind devolution (Ahmed et al., 2013), and matrix recovery via rank-one projections (Cai and Zhang, 2015; Chen et al., 2015). To overcome the non-convexity and NP-hardness of directly solving (1)(Recht et al., 2010), various computational feasible schemes have been developed in the past decade. In particular, the convex relaxation has been a central topic of interest (Recht et al., 2010; Cand`esand Plan, 2011):

1 2 min }y ´ ApXq}2 ` λ}X}˚, (3) XPRp1ˆp2 2

minpp1,p2q where }X}˚ “ i“1 σipXq is the nuclear norm of X and λ ą 0 is a tuning parameter. Nevertheless, the convex relaxation technique has one well-documented limitation: the parameter space after relaxationř is usually much larger than that of the target problem. Also, algorithms for solving the convex programming often require the singular value decomposition as the stepping stone and can be prohibitively time consuming for large-scale instances. In addition, non-convex optimization renders another important class of algorithms for solving (1), which directly enforce the rank r constraint on the iterates. Since each iterate lies in a low dimensional space, the computation cost of the non-convex approach can be much smaller than the convex regularized approach. In the last couple of years, there is a flurry of research on non-convex methods in solving (1)(Chen and Wainwright, 2015; Hardt, 2014; Jain et al., 2013; Sun and Luo, 2015; Tran-Dinh and Zhang, 2016; Tu et al., 2016; Wen et al., 2012; Zhao et al., 2015; Zheng and Lafferty, 2015), and many of the algorithms such as descent and alternating minimization are shown to have nice convergence results under proper assumptions (Hardt, 2014; Jain et al., 2013; Sun and Luo, 2015; Tu et al., 2016; Zhao et al., 2015). We refer readers to Section 1.2 for more review on recent works on convex and non-convex approaches on solving (1). In the existing literature, many algorithms for solving (1) either require careful tuning of hyper- parameters or have a convergence rate no faster than linear. Thus, we raise the following question: Can we develop an easy-to-compute and efficient (hopefully has the comparable per-iteration computational complexity as the first-order methods) algorithm with provable high-order convergence guarantees (possibly converge to a stationary point due to the non-convexity) for solving (1)? In this paper, we give an affirmative answer to this question by making contributions to the rank constrained optimization problem (1) as outlined next.

1.1 Our Contributions We introduce an easy-to-implement and computationally efficient algorithm, Recursive Importance Sketching for Rank constrained least squares Optimization (RISRO), for solving (1) in this paper. The proposed algorithm is tuning free and has the same per-iteration computational complexity as Alternating Minimization (Jain et al., 2013), as well as comparable complexity to many popular first-order methods such as iterative hard thresholding (Jain et al., 2010) and

2 (Tu et al., 2016) when r ! p1, p2, n. We then illustrate the key idea of RISRO under a general framework of recursive importance sketching. This framework also renders a platform to compare RISRO and several existing algorithms for rank constrained least squares. Assuming A satisfies the restricted isometry property (RIP), we prove that RISRO enjoys local quadratic-linear convergence in general and quadratic convergence under some extra conditions. Figure1 provides a numerical example on the performance of RISRO in the noiseless low-rank matrix trace regression (left panel) and phase retrieval (right panel). In both problems, RISRO converges to the underlying parameter quadratically and reaches to a highly accurate solution within five iterations. We will illustrate later that RISRO has the same per-iteration complexity with other first-order methods when r is small while converges quadratically with provable guarantees. To our best knowledge, we are among the first to achieve so for the general rank constrained least squares problem.

● F ● 1e+00 || 1e+00 ●

● T ● ) ● ∗ ● x ( F n n ∗ ||

∗ 1e−04 ● 1e−04 ●

pr /||x p F /||X || F 4 T 4 ) || ∗ ∗ 1e−08 ● 5 1e−08 ● 5 x ●

● (

6 ∗ 6 −X t 7 7 −x ||X 8 T 8 1e−12 ) t 1e−12 x ( ● t ● ||x 0 1 2 3 4 5 0 2 4 6 Iteration Number Iteration Number

J ˚ ˚J (a) Noiseless low-rank matrix trace regression. (b) Phase Retrieval. Here, yi “ xaiai , x x y ˚ ˚ pˆp ˚ p i.i.d. Here, yi “ xAi, X y for 1 ď i ď n, X P R for 1 ď i ď n, x P with p “ 1200, ai „ ˚ ˚ R with p “ 100, σ1pX q “ ¨ ¨ ¨ “ σ3pX q “ Np0, Ipq ˚ 3, σkpX q “ 0 for 4 ď k ď 100 and Ai has in- dependently identical distributed (i.i.d.) stan- dard Gaussian entries

Figure 1: RISRO achieves a quadratic rate of convergence (spectral initialization is used in each setting and more details about the simulation setup is given in Section7)

In addition, we discover a deep connection between RISRO and the optimization algorithm on Riemannian manifold. The least squares step in RISRO implicitly solves a Fisher Scoring or Riemannian Gauss-Newton equation on the Riemannian optimization of low-rank matrices and the updating rule in RISRO can be seen as a retraction map. With this connection, our theory on RISRO also improves the existing convergence results on the Riemannian Gauss-Newton method for the rank constrained least squares problem. Next, we further apply RISRO to two important problems arising from machine learning and statistics: low-rank matrix trace regression and phase retrieval. In low-rank matrix trace regression, we are able to prove RISRO achieves the minimax optimal estimation error rate under the Gaussian ensemble design with only double-logarithmic number of iterations. In phase retrieval, where A does not satisfy the RIP condition, we can still establish the local convergence of RISRO given a proper initialization. Finally, we conduct simulation studies to support our theoretical results and compare RISRO with many existing algorithms. The simulation studies show RISRO not only offers faster and more robust convergence but also smaller sample size requirement for low-rank matrix recovery, compared to the existing approaches.

3 1.2 Related Literature This work is related to a range of literature on low-rank matrix recovery, convex/non-convex op- timization, and sketching arising from a number of communities, including optimization, machine learning, statistics and applied mathematics. We make an attempt to review the related literature without claiming the survey is exhaustive. One class of the most popular approaches to solve (1) is the nuclear norm minimization (NNM) (3). Many algorithms have been proposed to solve NNM, such as proximal gradient descent (Toh and Yun, 2010), fixed-point continuation (FPC) (Goldfarb and Ma, 2011), and proximal point methods (Jiang et al., 2014). It has been shown that the solution of NNM has desirable properties under proper model assumptions (Cai and Zhang, 2013, 2014, 2015; Cand`esand Plan, 2011; Recht et al., 2010). In addition to NNM, the max norm minimization is another widely considered convex realization for the rank constrained optimization (Lee et al., 2010; Cai and Zhou, 2013). However, it is usually computationally intensive to solve these convex programs and this motivates a line of work on using non-convex approaches. Since Burer and Monteiro(2003), one of the most popular non-convex methods for solving (1) is to first factor the low-rank matrix X to RLJ with two factor p r p r matrices R P R 1ˆ , L P R 2ˆ , then run either gradient decent or alternating minimization on R and L (Cand`eset al., 2015; Li et al., 2019b; Ma et al., 2019; Park et al., 2018; Sanghavi et al., 2017; Sun and Luo, 2015; Tu et al., 2016; Wang et al., 2017c; Zhao et al., 2015; Zheng and Lafferty, 2015; Tong et al., 2020). Others methods, such as singular value projection or iterative hard thresholding (Goldfarb and Ma, 2011; Jain et al., 2010; Tanner and Wei, 2013), Grassmann manifold optimization (Boumal and Absil, 2011; Keshavan et al., 2009), Riemannian manifold optimization (Huang and Hand, 2018; Meyer et al., 2011; Mishra et al., 2014; Vandereycken, 2013; Wei et al., 2016) have also been proposed and studied. We refer readers to the recently survey paper Chi et al.(2019) for comprehensive overview on existing literature on convex and non-convex approaches on solving (1). There are a few recent attempts in connecting the geometric structures of different approaches (Ha et al., 2020; Li et al., 2019a), and the landscape of problem (1) has also been studied in various settings (Bhojanapalli et al., 2016; Ge et al., 2017; Uschmajew and Vandereycken, 2018; Zhang et al., 2019; Zhu et al., 2018). Our work is also related to the idea of sketching in numerical linear algebra. Performing sketching to speed up the computation via dimension reduction has been explored extensively in recent years (Mahoney, 2011; Woodruff, 2014). Sketching methods have been applied to solve a number of problems including but not limited to matrix approximation (Song et al., 2017; Zheng et al., 2012; Drineas et al., 2012), linear regression (Clarkson and Woodruff, 2017; Dobriban and Liu, 2019; Pilanci and Wainwright, 2016; Raskutti and Mahoney, 2016), ridge regression (Wang et al., 2017b), etc. In most of the sketching literature, the sketching matrices are randomly constructed (Mahoney, 2011; Woodruff, 2014). Randomized sketching matrices are easy to generate and require little storage for sparse sketching. However, randomized sketching can be suboptimal in statistical settings (Raskutti and Mahoney, 2016). To overcome this, Zhang et al.(2020) introduced an idea of importance sketching in the context of low-rank tensor regression. Contract to the randomized sketching, importance sketching matrices are constructed deterministically with the supervision of the data and are shown capable of achieving better statistical efficiency. In this paper, we propose a more powerful recursive importance sketching algorithm where we can recursively refine the sketching matrices. Then, we provide a comprehensive convergence analysis towards the proposed algorithm and demonstrate its advantages over other algorithms for rank constrained least squares problem.

4 1.3 Organization of the Paper The rest of this article is organized as follows. After a brief introduction of notation in Section 1.4, we present our main algorithm RISRO with an interpretation from the recursive importance sketching perspective in Section2. The theoretical results of RISRO are given in Section3. In Section4, we present anther interpretation for RISRO from Riemannian manifold optimization. The computational complexity of RISRO and its applications to low-rank matrix trace regression and phase retrieval are discussed in Section5 and6, respectively. Numerical studies of RISRO and the comparison with existing algorithms in the literature are presented in Section7. Conclusion and future work are given in Section8.

1.4 Notation The following notation will be used throughout this article. Upper and lowercase letters (e.g., A, B, a, b), lowercase boldface letters (e.g. u, v), uppercase boldface letters (e.g., U, V) are used to denote scalars, vectors, matrices, respectively. For any two series of numbers, say tanu and tbnu, denote a “ Opbq if there exists uniform constants C ą 0 such that an ď Cbn, @n. For any a, b P R, let p p a^b :“ minta, bu, a_b “ maxta, bu. For any matrix X P R 1ˆ 2 with singular value decomposition p1^p2 J r J i“1 σipXquivi , where σ1pXq ě σ2pXq ě ¨ ¨ ¨ ě σp1^p2 pXq, let Xmaxprq “ i“1 σipXquivi be the 2 bestř rank-r approximation of X and denote }X}F “ i σi pXq and }X}“ řσ1pXq as the Frobenius norm and spectral norm, respectively. Let QRpXq beb the Q part of the QR decomposition outcome p p ř of X. vecpXq P R 1 2 represents the vectorization of X by its columns. In addition, Ir is the r-by-r J identity matrix. Let Op,r “ tU : U U “ Iru be the set of all p-by-r matrices with orthonormal J columns. For any U P Op,r, PU “ UU represents the orthogonal projector onto the column space of U; we also note UK P Op,p´r as the orthonormal complement of U. We use bracket subscripts to denote sub-matrices. For example, Xri1,i2s is the entry of X on the i1-th row and i2-th column; : Xrpr`1q:p1,:s contains the pr ` 1q-th to the m-th rows of X. For any matrix X, we use X to denote p p m m its Moore-Penrose inverse. For matrices U P R 1ˆ 2 , V P R 1ˆ 2 , let

Ur1,1s ¨ V ¨ ¨ ¨ Ur1,p2s ¨ V . . pp1m1qˆpp2m2q U b V “ » . . fi P R U ¨ V ¨ ¨ ¨ U ¨ V — rp1,1s rp1,p2s ffi – fl be their Kronecker product. Finally, for any given linear operator L, we use L˚ to denote its adjoint, and use RanpLq to denote its range space.

2 Recursive Importance Sketching for Rank Constrained Least Squares

In this section, we discuss the procedure and interpretations of RISRO, then compare it with existing algorithms from a sketching perspective.

2.1 RISRO Procedure and Recursive Importance Sketching The detailed procedure of RISRO is given in Algorithm1. RISRO includes three steps in each iter- t ation. Specifically, in the t-th iteration, we first sketch each Ai onto the subspace spanned by rU b t t t t t tJ t tJ t tJ t V , UKbV , U bVKs, which yields the sketched importance covariates U AiV , UK AiV , U AiVK in (4). See Figure2 left panel for an illustration for the sketching scheme of RISRO. Second, we

5 Figure 2: Illustration of RISRO (this work), Alter Mini (Hardt, 2014; Jain et al., 2013), and R2RILS (Bauch and Nadler, 2020) in a sketching perspective solve a dimension reduced least squares problem (5) where the number of parameters is reduced to pp1 ` p2 ´ rqr while the sample size remains to be n. Third, we update the sketching matrices Ut`1, Vt`1 and Xt`1 in Step6 and7. Note that by construction, Ut`1, Vt`1 capture both the column and row spans of Xt`1. In particular, if Bt`1 is invertiable, then Ut`1, Vt`1 are exactly orthonormal bases of the column and row spans of Xt`1, respectively.

Algorithm 1 Recursive Importance Sketching for Rank Constrained Least Squares (RISRO) p p n n 0 1: Input: Ap¨q : R 1ˆ 2 Ñ R , y P R , rank r, initialization X which admits singular value 0 0 0J 0 0 0 rˆr decomposition U Σ V , where U P Op1,r, V P Op2,r, Σ P R 2: for t “ 0, 1,... do rˆr n 3: Perform importance sketching on A and construct the covariates maps AB : R Ñ R , pp1´rqˆr n rˆpp2´rq n AD1 : R Ñ R and AD2 : R Ñ R , where for 1 ď i ď n,

tJ t tJ t tJ t pABqi “ U AiV , pAD1 qi “ UK AiV , pAD2 qi “ U AiVK (4)

Here, pABqi satisfies rABp¨qsi “ x¨, pABqiy and similarly for pAD1 qi and pAD2 qi. 4: Solve the unconstrained least squares problem

t`1 t`1 t`1 J 2 pB , D1 , D2 q “ arg min y ´ ABpBq ´ AD1 pD1q ´ AD2 pD2 q 2 (5) rˆr pp ´rqˆr BPR ,DiPR i ,i“1,2 › › › › t`1 t t`1 t t`1 t`1 t t`1J t t`1 5: Compute XU “ U B ` UKD1 and XV “ V B ` VKD2 . 6: Perform QR orthogonalization: Ut`1 “ QRpXt`1q, Vt`1 “ QRpXt`1q. ` ˘ U ` V ˘ t`1 t`1 t`1 : t`1J 7: Update X “ XU B XV . 8: end for ` ˘

We give a high-level explanation of RISRO through a decomposition of yi. Suppose yi “ J xAi, Xy ` i where X is a rank r target matrix with singular value decomposition UΣV with rˆr U P Op1,r, Σ P R and V P Op2,r. Then s s s s s tJ s t tJ t tJ t tJ t tJ t tJ t tJ t tJ t ysi “xU AsiV , U XV ys ` xUK AiV , UK XV y ` xU AiVK, U XVKy ` xUK AiVK, UK XVKy ` i tJ t tJ t tJ t tJ t tJ t tJ t t :“xU AiV , U XV y ` xUK AiV , UK XV y ` xU AiVK, U XVKy ` i. s s s s(6) s s s s t n Here,  :“ ApP t XP t q `  P can be seen as the residual of the new regression model (6), UK VK R

s s 6 tJ t tJ t tJ t and U AiV , UK AiV , U AiVK are exactly the importance covariates constructed in (4). Let t tJ t t tJ t tJ tJ t B :“ U XV , D1 :“ UK XV , D2 :“ U XVK. (7) t t t t If  “ 0, we have pB , D1, D2q is a solution of the least squares in (5). Hence, we could set t`1 t t`1 t r t`1 ts r t`1 s rt t`1 sJ t t`1 B “ B , D1 “ D1, D2 “ D2 and thus XU “ XV , XV “ X U . Furthermore, if B is invertible, then it holdsr thatr r

r r t`1 t`1 r´1 t`1J t tJs t ´1 J st J XU pB q XV “ XV pU XV q pX U q “ X, (8) which means X can be exactly recovered by one iteration of RISRO. s s s s In general, t ‰ 0. When the column spans of Ut, Vt well approximate the ones of U, V, tJ t i.e., the columns and row subspaces that the target parameter X lie on, we expect UK XVK and t tJ t tJ t t`1 t`1 t`1 i “ xUK AiVK, UK XVKy ` i to have a small amplitude, then B , D1 , D2 , the outcomes s t t t of the least squares problem (5), can well approximate B , D1, Ds2. In Lemma1, we gives a precise characterization for thiss approximation.s Before that, let us introduce a convenient notation so that (5) can be written in a more compact way. r r r Define the linear operator Lt as

rˆr rˆpp2´rq W0 P R W2 P R t t W0 W2 t t J Lt : W “ pp ´rqˆr Ñ rU UKs rV VKs , (9) W P 1 0 W1 0 „ 1 R pp1´rqˆpp2´rq  „  UtJMVt UtJMVt ˚ p1ˆp2 K and it is easy to compute its adjoint Lt : M P R Ñ t J t . Then, the pUKq MV 0 least squares problem in (5) can be written as „ 

J 2 t`1 t`1 t`1 BD2 pB , D1 , D2 q “ arg min y ´ ALt . (10) rˆr pp ´rqˆr D1 0 BPR ,DiPR i ,i“1,2 › ˆ„ ˙›2 › › › › › › Lemma 1 (Iteration Error Analysis for RISRO) Let X be any given target matrix. Recall t ˚ ˚ the definition of  “  ` ApP t XP t q from (6). If the operator L A ALt is invertible over UK VK t ˚ t`1 t`1 t`1 s RanpLt q, then B , D1 , D2 in (5) satisfy s s t`1 t t`1J tJ B ´ B D2 ´ D2 ˚ ˚ ´1 ˚ ˚ t t`1 t “ pLt A ALtq Lt A  , (11) «D1 ´ D1 0 ff r r and r 2 t`1 t 2 t`1 t 2 ˚ ˚ ´1 ˚ ˚ t 2 }B ´ B }F ` }Dk ´ Dk}F “ pLt A ALtq Lt A  F . (12) k“1 ÿ › › r r ›t`1 t`1 t`1 t › t t In view of Lemma1, the approximation errors of B , D1 , D2 to B , D1, D2 are driven by ˚ ˚ ´1 ˚ ˚ t 2 the least squares residual }pLt A ALtq Lt A  }F. This fact plays a key role in the proof for the high-order convergence theory of RISRO. r r r Remark 1 (Comparison with Randomized Sketching) The importance sketching in RISRO is significantly different from the randomized sketching in the literature (see surveys Mahoney (2011); Woodruff(2014) and the references therein). The randomized sketching matrices are often randomly generated and reduce the sample size (n), the importance sketching matrices are deter- ministically constructed under supervision of y and reduce the dimension of parameter space (p1p2). See (Zhang et al., 2020, Section 1.3 and 2) for more comparison of randomized and importance sketchings.

7 2.2 Comparison with More Algorithms in View of Sketching In addition to RISRO, several classic algorithms for rank constrained least squares can be in- terpreted from a recursive importance sketching perspective. Through the lens of the sketching, RISRO exhibits advantages over these existing algorithms. We first focus on Alternating Minimization (Alter Mini) proposed and studied in Hardt(2014); Jain et al.(2013); Zhao et al.(2015). Suppose Ut is the left singular vectors of Xt, the outcome of the t-th iteration, Alter Mini solves the following least squares problems to update U and V,

n n t`1 t J 2 tJ J 2 V “ arg min yi ´ xAi, U V y “ arg min yi ´ xU Ai, V y , p ˆr p ˆr VPR 2 i“1 VPR 2 i“1 ÿn ` ˘ ÿ `n ˘ q t`1 t`1 J 2 t`1 2 (13) U “ arg min y ´ xAi, UpV q y “ arg min y ´ xAiV , Uy , p ˆr p ˆr UPR 1 i“1 UPR 1 i“1 ÿ ` ˘ ÿ ` ˘ Vq t`1 “ QRpVt`1q, Ut`1 “ QRpUt`1q.

tJ t`1 Then, Alter Mini essentiallyq solves least squaresq problems with sketched covariates U Ai, AiV to update Vt`1, Ut`1 alternatively and iteratively. The number of parameters of the least squares in (13) are rp2 and rp1 as opposed to p1p2, the number of parameters in the original least squares problem. Seeq Figureq 2 upper right panel for an illustration of the sketching scheme in Alter Mini. Consider the following decomposition of yi,

tJ tJ tJ tJ t yi “ xAi,PUt Xy ` xAi,P t Xy ` i “ xU Ai, U Xy ` xAi,P t Xy ` i :“ xU Ai, U Xy `  , UK UK i (14) s s s s s s s q t n t nˆp2r tJ where  :“ ApP t Xq ` ¯ P . Define A P with A i,: “ vecpU Aiq. Similar to how UK R R r s t`1J tJ 2 tJ t ´1 tJ t 2 Lemma1 is proved, we can show }V ´ U X}F “ }pA A q A  }2, which implies the approximationq errors of Vt`1 “ QRpVt`1q (i.e., the outcomeq of one iteration Alter Mini) to V t (i.e., true row span of the target matrixq X) is drivens by  q“ AqpP t Xq q `q ¯, i.e., the residual of UK least squares problem (14). Recall forq RISRO, Lemma1 shows the approximation error of Vt`1 sis t driven by  “ ApP t XP t q `¯. Since }Ps t XP t }F ď }Pq t X}F, thes approximation error in per UK VK UK VK UK iteration of RISRO can be smaller than the one of Alter Mini. Such a difference between RISRO and Alter Mini is dues to the following fact: ins Alter Mini, thes sketching captures the importance covariates correspond to only the row (or column) span of Xt in updating Vt`1 (or Ut`1), while the importance sketching of RISRO in (4) catches the importance covariates from both the row span and column span of Xt. As a consequence, Alter Mini iterations yield first order convergence while RISRO iterations render high-order convergence as will be established in Section3.

Remark 2 Recently, K¨ummerleand Sigl(2018) proposed a harmonic mean iterative reweighted q least squares (HM-IRLS) method for low-rank matrix recovery via solving minXPRp1ˆp2 }X}q, subject q 1{q to y “ ApXq, where }X}q “ p i σi pXqq is the Schatten-q norm of the matrix X. Compared to the original iterative reweighted least squares (IRLS) (Fornasier et al., 2011; Mohan and Fazel, 2012), which only uses the column spanř of Xt in constructing the reweight matrix, HM-IRLS uses both the column and row spans of Xt in constructing the reweight matrix and results in better performance. Such a comparison of HM-IRLS versus IRLS shares the same spirit as RISRO versus Alter Mini: the importance sketching of RISRO captures the information of both column and row spans of Xt and achieves a better performance.

8 Another example is the rank 2r iterative least squares (R2RILS) proposed in Bauch and Nadler (2020) for solving ill-conditioned matrix completion problems. In particular, at t-th iteration, Step 1 of R2RILS solves the following least squares problem

2 min UtNJ MVtJ X , (15) ` ´ ri,js MP p1ˆr,NP p2ˆr R R pi,jqPΩ ÿ !` ˘ ) where Ω is the set of index pairs of the observed entries. In the matrix completion setting, it turns out the following equivalence holds (proof given in Appendix)

2 t J tJ arg min U N ` MV ´ X i,j p ˆr p ˆr r s MPR 1 ,NPR 2 pi,jqPΩ ÿ !` ˘ ) (16) tJ ij J ij t 2 “ arg min xU A , N y ` xM, A V y ´ Xri,js , p ˆr p ˆr MPR 1 ,NPR 2 pi,jqPΩ ÿ ` ˘ ij p1ˆp2 ij where A P R is the special covariate in matrix completion satisfying pA qrk,ls “ 1 if pi, jq “ ij pk, lq and pA qrk,ls “ 0 otherwise. This equivalence reveals that the least squares step (15) in R2RILS can be seen as an implicit sketched least squares problem similar to (5) and (13) with covariates UtJAij and AijVt for pi, jq P Ω. We give a pictorial illustration for the sketching interpretation of R2RILS on the bottom right part of Figure2. Different from the sketching in RISRO, R2RILS incorporates the core sketch tJ t U AiV twice, which results in the rank deficiency in the least squares problem (15) and brings difficulties in both implementation and theoretical analysis. RISRO overcomes this issue by per- forming a better designed sketching and covers more general low-rank matrix recovery settings than R2RILS. With the new sketching scheme, we are able to give a new and solid theory for RISRO with high-order convergence.

3 Theoretical Analysis

In this section, we provide convergence analysis for the proposed algorithm. For technical conve- nience, we assume A satisfies the Restricted Isometry Property (RIP) (Cand`es, 2008). The RIP condition, first introduced in compressed sensing, has been widely used as one of the most stan- dard assumptions in the low-rank matrix recovery literature (Cai and Zhang, 2013, 2014; Cand`es and Plan, 2011; Chen and Wainwright, 2015; Jain et al., 2010; Recht et al., 2010; Tu et al., 2016; Zhao et al., 2015). It also plays a critical role in analyzing the landscape of the rank constrained optimization problem (1)(Bhojanapalli et al., 2016; Ge et al., 2017; Uschmajew and Vandereycken, 2018; Zhang et al., 2019; Zhu et al., 2018). Moreover, it is practically useful as the condition has been shown to be satisfied with desired sample size in random design (Cand`esand Plan, 2011; Recht et al., 2010).

p p n Definition 1 (Restricted Isometry Property (RIP)) Let A : R 1ˆ 2 Ñ R be a linear map. For every integer r with 1 ď r ď minpp1, p2q, define the r-restricted isometry constant to be the 2 2 2 smallest number Rr such that p1 ´ Rrq}Z}F ď }ApZq}2 ď p1 ` Rrq}Z}F holds for all Z of rank at most r. And A is said to satisfy the r-restricted isometry property (r´RIP) if 0 ď Rr ă 1.

1 Note that, by definition, Rr ď Rr1 for r ď r . By assuming RIP for A, we can show the ˚ ˚ ˚ linear operator Lt A ALt mentioned in Lemma1 is always invertible over Ran pLt q (i.e. the least squares (5) has the unique solution). In fact, we could give explicit lower and upper bounds for the spectrum of this operator.

9 ˚ ˚ Lemma 2 (Bounds for Spectrum of Lt A ALt) Recall the definition of Lt in (9). It holds that

˚ }LtpMq}F “}M}F, @ M P RanpLt q. (17)

˚ Suppose the linear map A satisfies the 2r-RIP. Then, it holds that for any matrix M P RanpLt q,

˚ ˚ p1 ´ R2rq}M}F ď }Lt A ALtpMq}F ď p1 ` R2rq}M}F.

˚ ˚ ´1 Remark 3 (Bounds for spectrum of pLt A ALtq ) By the relationship of the spectrum of an ˚ ˚ ´1 operator and its inverse, from Lemma2, we also have the spectrum of pLt A ALtq is lower and upper bounded by 1 and 1 , respectively. p1`R2rq p1´R2rq In the following Proposition1, we bound the iteration approximation error given in Lemma1.

Proposition 1 (Upper Bound for Iteration Approximation Error) Let X be a given target rank r matrix and  “ y ´ ApXq. Suppose that A satisfies the 3r-RIP. Then at t-th iteration of RISRO, the approximation error (12) has the following upper bound: s s ˚ ˚ s´1 ˚ ˚ t 2 pLt A ALtq Lt A  F R2 }Xt ´ X}2}Xt ´ X}2 }L˚A˚pq}2 2R }Xt ´ X}}Xt ´ X} (18) › 3r ›F t F L˚A˚  3r F . ď › 2 2 › ` 2 ` } t p q}F 2 p1 ´ R2rq σr pXq p1 ´ R2rq σrpXqp1 ´ R2rq s s s s s Note that Proposition1s is rather general in the senses that it appliess to any X of rank r and we will pick different choices of X depending on our purposes. For example, in studying the convergence of RISRO, e.g., the upcoming Theorem1, we treat X as a stationarys point and in the setting of estimating the model parameters in matrix trace regression, we take X to be the ground truth (see Theorem3). s Now, we are ready to establish the deterministic convergence theory for RISRO.s For problem (1), we use the following definition of stationary points: a rank r matrix X is said to be a stationary point of (1) if ∇fpXqJU “ 0 and ∇fpXqV “ 0 where ∇fpXq “ A˚pApXq ´ yq, and U, V are the left and right singular vectors of X. See also Ha et al.(2020). In Theorems 1, we show that given any target stationarys s point X ands propers initialization,s RISRO has as local quadratic-linears s convergence rate in general and quadratics convergence rate if y “ ApX¯ q. s Theorem 1 (Local Quadratic-Linear and Quadratic Convergence of RISRO) Let X be a stationary point to problem (1) and  “ y ´ ApXq. Suppose that A satisfies the 3r-RIP, and the initialization X0 satisfies s 0 1 s 1 ´ R2r }X ´sX}F ď ^ ? σrpXq, (19) 4 4 5R ˆ 3r ˙ and A˚  1´?R2r σ X . Then, wes have Xt , the sequences generated by RISRO (Algorithm } p q}F ď 4 5 rp q t u t`1 3 t 1), converges Q-linearly to X: }X ´ X}F ď 4 }X ´ X}F, @ t ě 0. Mores precisely, it holdss that @ t ě 0: s s s 5}Xt ´ X}2 Xt`1 X 2 R2 Xt X 2 4R A˚  Xt X 4 A˚  2 . (20) } ´ }F ď 2 2 ¨ 3r} ´ }F ` 3r} p q}F} ´ }F ` } p q}F p1 ´ R2rq σr pXq s ` ˘ s t s s In particular, if  “ 0, thenstX u converges quadratically tosX as s ? t`1 5R3r t 2s s }X ´ X}F ď }X ´ X}F, @ t ě 0. p1 ´ R2rqσrpXq s s 10 s Remark 4 (Quadratic-linear and Quadratic Convergence of RISRO) We call the conver- gence in (20) quadratic-linear since the sequence tXtu generated by RISRO exhibits a phase tran- t ˚ sition from quadratic to linear convergence: when }X ´ X}F "}A pq}F, the algorithm has a t t ˚ quadratic convergence rate; when X becomes close to X such that }X ´ X}F ď c}A pq}F for some c ą 0, the convergence rate becomes linear. Moreover,s as  becomess smaller, the stage of quadratic convergence becomes longer (see Section 7.1 fors a numerical illustrations of thiss conver- gence pattern). In the extreme case  “ 0, Theorem1 covers thes widely studied matrix sensing problem under the RIP framework (Chen and Wainwright, 2015; Jain et al., 2010; Park et al., 2018; Recht et al., 2010; Tu et al., 2016s ; Zhao et al., 2015; Zheng and Lafferty, 2015). It shows as long as the initialization error is within a constant factor of σrpXq, Algorithm RISRO enjoys quadratic convergence to the target matrix X. To our best knowledge, we are among the first to give quadratic-linear algorithmic convergence guarantees for general ranks constrained least squares and quadratic convergence for matrix sensing.s Recently, Charisopoulos et al.(2019) formulated (1) as a non-convex composite optimization problem based on X “ RLJ factorization and showed that the prox-linear algorithm (Burke, 1985; Lewis and Wright, 2016) achieves local quadratic convergence when ¯ “ 0. Note that in each iteration therein, a carefully tuned convex programming problem needs to be solved exactly. In contrast, the proposed RISRO is tuning free, only solves a dimension- reduced least squares in each step, and can be as cheap as many first-order methods. See Section5 for a detailed discussion on the computational complexity of RISRO. It is noteworthy that a quadratic-linear convergence rate appears in the recent Newton Sketch algorithm (Pilanci and Wainwright, 2017). However, Pilanci and Wainwright(2017) studied a convex problem using randomized sketching, which is significantly different from our settings, i.e., the recursive importance sketching and the non-convex matrix optimization.

Remark 5 (Initialization and global convergence of RISRO) The convergence theory in The- orem1 requires a good initialization condition. Practically, the spectral method often provides a sufficiently good initialization that meets the requirement in (19) in many statistical applications. In Section6 and7, we will illustrate this point from two applications: matrix trace regression and phase retrieval. Moreover, our main finding in the next section, i.e., RISRO can be interpreted as a Riemannian manifold optimization method, implies that standard globalization strategies in manifold optimiza- tion such as the or the scheme (Absil et al., 2009; Nocedal and Wright, 2006) can be used to guarantee the global convergence of RISRO.

Remark 6 (Small residual condition in Theorem1) In addition to the initialization condi- tion, the small residual condition A˚  1´?R2r σ X is also needed in Theorem1. This } p q}F ď 4 5 rp q condition essentially means that the signal strength at point X needs to dominate the noise. If  “ y ´ ApXq “ 0, then the aforementioneds small residuals condition holds automatically. s s4 A Riemannians Manifold Optimization Interpretation of RISRO

We give an interpretation of RISRO by recursive importance sketching in Section2 and develop the convergence results in Section3. The superior performance of RISRO yields the following question: Is there a connection of RISRO to any class of optimization algorithms in the literature? In this section, we give an affirmative answer to this question. We show RISRO can be viewed p ˆp as a Riemannian optimization algorithm on the manifold Mr :“ tX P R 1 2 | rankpXq “ ru. We find the sketched least squares in (5) in RISRO actually solves the Fisher Scoring or Riemannian

11 Gauss-Newton equation and Step7 in RISRO performs a type of retraction under the framework of Riemannian optimization. Riemannian optimization concerns optimizing a real-valued function f defined on a Riemannian n manifold M. One commonly-encountered manifold is a submanifold of R . Under such circum- n stance, a manifold can be viewed as a smooth subset of R . When a smooth-varying inner product is further defined on the subset, the subset together with the inner product is called a Rieman- nian manifold. We refer to Absil et al.(2009) for the rigorous definition of Riemannian manifolds. Optimization on a Riemannian manifold often relies on the notion of Riemannian gradient and Riemannian Hessian, which are used for finding a search direction, and the notion of retraction, which is defined for motion of iterates on the manifold. The remaining of this section describes the required Riemannian optimization tools and the connection of RISRO to Riemannian optimization. It has been shown in (Lee, 2013, Example 8.14) that the set Mr is a smooth submanifold p p of R 1ˆ 2 and the tangent space is also given therein. The result is given in Proposition2 for completeness.

p ˆp Proposition 2 (Lee, 2013, Example 8.14) Mr “ tX P R 1 2 : rankpXq “ ru is a smooth subman- ifold of dimension pp1 ` p2 ´ rqr. Its tangent space TXMr at X P Mr with the SVD decomposition J X “ UΣV (U P Op1,r and V P Op2,r) is given by:

rˆr rˆpp2´rq R R J TXMr “ rUU s rVV s . (21) K pp1´rqˆr K # «R 0pp1´rqˆpp2´rqff +

The Riemannian metric of Mr that we use throughout this paper is the Euclidean inner product, i.e., xU, Vy “ tracepUJVq. In the Euclidean setting, the update formula in an iterative algorithm is Xt ` αηt, where α is the stepsize and ηt is a descent direction. However, in the framework of Riemannian optimization, Xt ` αηt is generally neither well-defined nor lying in the manifold. To overcome this difficulty, the notion of retraction is used, see e.g., Absil et al.(2009). Considering the manifold Mr, we have the definition that a retraction R is a smooth map from T Mr to Mr satisfying i) RpX, 0q “ X and ii) d dt RpX, tηq|t“0 “ η for all X P Mr and η P TXMr, where T Mr “ tpX,TXMrq : X P Mru, is the tangent bundle of Mr. The two conditions guarantee that RpX, tηq stays in Mr and RpX, tηq is a first order approximation of X ` tη at t “ 0. Next, we show that Step7 in RISRO performs the orthographic retraction on the manifold of fixed-rank matrices given in Absil and Malick(2012). Suppose at iteration t ` 1, Bt`1 is invertible (this is true under the RIP framework, see Step 2 in the proof of Theorem1). We can show by some algebraic calculations that the update Xt`1 in Step7 can be rewritten as

t`1 t`1J t`1 t`1 t`1 ´1 t`1J t t B D2 t t J X “ XU B XV “ rU UKs t`1 t`1 t`1 ´1 t`1J rV VKs . (22) «D1 D1 pB q D2 ff ` ˘ t t t Let η P TXt Mr be the update direction and X ` η has the following representation,

Bt`1 Dt`1J Xt ` ηt “ rUt Ut s 2 rVt Vt sJ. (23) K Dt`1 0 K „ 1  Comparing (22) and (23), we can view the update of Xt`1 from Xt ` ηt as simply completing the Bt`1 Dt`1J 0 matrix in 2 by Dt`1pBt`1q´1Dt`1J. This operation maps the tangent vector on Dt`1 0 1 2 „ 1 

12 TXt Mr back to the manifold Mr and it turns out that it coincides with the orthographic retraction

t`1 t`1J t t t t B D2 t t J RpX , η q “ rU UKs t`1 t`1 t`1 ´1 t`1J rV VKs (24) «D1 D1 pB q D2 ff on the set of fixed-rank matrices (Absil and Malick, 2012). Therefore, we have Xt`1 “ RpXt, ηtq.

Remark 7 Although the orthographic retraction defined in Absil and Malick(2012) requires that Ut and Vt are left and right singular vectors of Xt, one can verify that even if the Ut and Vt are not exactly the left and right singular vectors but satisfy Ut “ UtO, Vt “ VtQ, then the mapping (24) t is equivalent to the orthographic retraction in Absil and Malick(2012). Here, O, Q P Or,r, and U and Vt are left and right singular vectors of Xt. r r r Ther Riemannian gradient of a smooth function f : Mr Ñ R at X P Mr is defined as the unique tangent vector grad fpXq P TXMr such that xgrad fpXq, Zy “ D fpXqrZs, @ Z P TXMr, where DfpXqrZs denotes the directional derivative of f at point X along the direction Z. Since p ˆp Mr is an embedded submanifold of R 1 2 and the Euclidean metric is used, from (Absil et al., 2009, (3.37)), we know in our problem,

˚ grad fpXq “ PTX pA pApXq ´ yqq, (25)

and here PTX is the orthogonal projector onto the tangent space at X defined as follows

p1ˆp2 PTX pZq “ PUZPV ` PUK ZPV ` PUZPVK , @ Z P R , (26) where U P Op1,r, V P Op2,r are the left and right singular vectors of X. Next, we introduce the Riemannian Hessian. The Riemannian Hessian of f at X P Mr is the linear map Hess fpXq of TXMr onto itself defined as Hess fpXqrZs “ ∇Zgrad fpXq, @Z P TXMr, where ∇ is the Riemannian connection on Mr (Absil et al., 2009, Section 5.3). Lemma3 gives an explicit formula for Riemannian Hessian in our problem. s s Lemma 3 (Riemannian Hessian) Consider fpXq in (1). If X P Mr has singular value decom- J position UΣV and Z P TXMr has representation

J ZB Z Z “ rUU s D2 rVV sJ, K Z 0 K „ D1  then the Hessian operator in this setting satisfies

˚ ˚ ´1 J HessfpXqrZs “PT pA pApZqqq ` PU A pApXq ´ yqVpΣ V PV X K (27) ´1 J ˚ ` PUUΣ Up A pApXq ´ yqPVK , where Up “ UKZD1 , Vp “ VKZD2 . Next, we show that the update direction ηt, implicitly encoded in (23), finds the Riemannian Gauss-Newton direction in the manifold optimization of Mr. Similar as the classic Newton’s method, at t-th iteration, the Riemannian Newton method aims to find the Riemannian Newton t direction ηNewton in TXt Mr that solves the following Newton equation

t t t ´ gradfpX q “ HessfpX qrηNewtons. (28)

13 If the residual py ´ ApXtqq is small, the last two terms in HessfpXtqrηs of (27) are expected to be small, which means we can approximately solve the Riemannian Newton direction via t ˚ gradf X P A A η , η T t M . (29) ´ p q “ TXt p p p qqq P X r In fact, Equation (29) has a interpretation from the Fisher scoring algorithm. Consider the i.i.d. 2 statistical setting y “ ApXq ` , where X is a fixed low-rank matrix and i „ Np0, σ q. Then for any η, ˚ Hessf X η t P A A η , tEp p qr squ |X“X “ TXt p p p qqq where on the left hand side, the expression is evaluated at Xt after taking expectation. In the literature, the Fisher Scoring algorithm computes the update direction via solving the modified Newton equation which replaces the Hessian with its expected value (Lange, 2010), i.e., t tEpHessfpXqrηsqu |X“Xt “ ´gradfpX q, η P TXt Mr, which exactly becomes (29) in our setting. Meanwhile, it is not difficult to show that the Fisher Scoring algorithm here is equivalent to the Riemannian Gauss-Newton method for solving nonlinear least squares (Lange, 2010, Section 14.6) and (Absil et al., 2009, Section 8.4). Thus, η that solves the equation (29) is also the Riemannian Gauss-Newton direction. It turns out that the update direction ηt (23) of RISRO solves the Fisher Scoring or Riemannian Gauss-Newton equation (29): Theorem 2 Let tXtu be the sequence generated by RISRO under the same assumptions as in Theorem1. Then, for all t ě 0, the implicitly encoded update direction ηt in (23) solves the Riemannian Gauss-Newton equation (29). Theorem2 together with the retraction explanation in (24) establishes the connection of RISRO and Riemannian manifold optimization. Following this connection, we further show that each ηt is always a decent direction in the next Proposition3. By incorporating the globalization scheme discussed in Remark5, this fact is useful in boosting the local convergence of RISRO to the global convergence. t t t Proposition 3 For all t ě 0, the update direction η P TXt Mr in (23) satisfies xgradfpX q, η y ă 0, t i.e., ηt is a descent direction. If A satisfies the 2r-RIP, then the direction sequence tη u is gradient related. Remark 8 The convergence of Riemannian Gauss-Newton was studied in a recent work Breiding and Vannieuwenhoven(2018). Our results are significantly different from and offer improvements to Breiding and Vannieuwenhoven(2018) in the following ways. First, Breiding and Vannieuwenhoven (2018) considered a more general Riemannian Gauss-Newton setting, while their convergence results are established for a local minima, which is a much stronger and less practical requirement than the stationary point assumption we need. Second, the initialization condition and convergence rate in Breiding and Vannieuwenhoven(2018) can be suboptimal in our rank constrained least squares setting. Third, the proof technique of ours are quite different from Breiding and Vannieuwenhoven (2018). The proof of Breiding and Vannieuwenhoven(2018) is based on the manifold optimization, while our proof is motivated from the insight of recursive importance sketching introduced in Section 2 and in particular, the approximation error of sketched least squares established in Lemma1 plays a key role. Fourth, our recursive importance sketching framework provides new sketching interpretations for several classical algorithms for rank constrained least squares. In Section6 we also apply RISRO in popular statistical models and establish the statistical convergence results. It is however not immediately clear how to utilize the results in Breiding and Vannieuwenhoven(2018) in these statistical settings.

14 Alter Mini SVP GD RISRO (this work) Complexity per iteration Opnp2r2 ` pprq3q Opnp2q Opnp2q Opnp2r2 ` pprq3q Convergence rate Linear Linear Linear Quadratic-(linear)

Table 1: Computational complexity per iteration and convergence rate for Alternating Minimization (Alter Mini) (Jain et al., 2013), singular value projection (SVP) (Jain et al., 2010), gradient descent (GD) (Tu et al., 2016), and RISRO

5 Computational Complexity of RISRO

In this section, we discuss the computational complexity of RISRO. Suppose p1 “ p2 “ p, the computational complexity of RISRO per iteration is Opnp2r2 ` pprq3q in the general setting. A comparison on computational complexity of RISRO and other common algorithms is provided in Table1. Here the main complexity of RISRO and Alter Mini is from solving the least squares. The main complexity of the singular value projection (SVP) (Jain et al., 2010) and gradient descent (Tu et al., 2016) is from computing the gradient. From Table1, we can see RISRO has the same per-iteration complexity as Alter Mini and comparable complexity with SVP and GD when n ě pr and r is much less than n and p. On the other hand, RISRO and Alter Mini are tuning free, while a proper step size is crucial for SVP and GD to have fast convergence. Finally, RISRO enjoys a high-order convergence as we have shown in Section3, and the convergence rates of all other algorithms are limited to be linear. The main computational bottleneck of RISRO is solving the least squares, which can be al- leviated by using iterative linear system solvers, such as the (preconditioned) conjugate gradient method, when the linear operator A has special structures. Such special structures occur, for example, in matrix completion problem (A is sparse) (Vandereycken, 2013), phase retrieval for X-ray crystallography imaging (A involves fast Fourier transforms) (Huang et al., 2017b), and blind deconvolution for imaging deblurring (A involves fast Fourier transforms and Haar wavelet transforms) (Huang and Hand, 2018). To utilize these structures, we introduce an intrinsic representation of tangent vectors in Mr: if U, V are the left and right singular vectors of a rank-r matrix X, an orthonormal basis of TXMr can be

J eiej 0rˆpp´rq J rUUKs rVVKs , i “ 1, . . . , r, j “ 1, . . . , r Y # «0pp´rqˆr 0pp´rqˆpp´rqff + J 0rˆr eie˜j J rUUKs rVVKs , i “ 1, . . . , r, j “ 1, . . . , p ´ r Y # «0pp´rqˆr 0pp´rqˆpp´rqff +

0rˆr 0rˆpp´rq J rUUKs J rVVKs , i “ 1, . . . , p ´ r, j “ 1, . . . , r , # «e˜iej 0pp´rqˆpp´rqff +

r p´r where ei and e˜i denote the i-th canonical basis of R and R , respectively. It follows that any p2p´rqr tangent vector in TXMr can be uniquely represented by a coefficient vector in R via the basis above. This representation is called the intrinsic representation (Huang et al., 2017a). Computing the intrinsic representations of a Riemannian gradient can be computationally efficient. For exam- ple, the complexity of computing the Riemannian gradient in matrix completion is Opnr `pr2q and its intrinsic representation can be computed by an additional Oppr2q operations (Vandereycken, 2013). The complexities of computing intrinsic representations of the Riemannian of the

15 phase retrieval and the blind deconvolution are both Opn logpnqr`pr2q (Huang et al., 2017b; Huang and Hand, 2018). By Theorem2, the least squares problem (5) of RISRO is equivalent to solve η P TXt Mr such that P A˚ A η gradf Xt . Reformulating this equation by intrinsic representation yields TXt p p qq “ ´ p q

gradf Xt P A˚ A η u B˚ A˚ A B v , (30) ´ p q “ TXt p p qq ùñ ´ “ Xp p p X qqq

t p2p´rqr where u, v are the intrinsic representations of gradfpX q and η, the mapping BX : R Ñ pˆp ˚ TXMr Ă R converts an intrinsic representation to the corresponding tangent vector, and BX : pˆp p2p´rqr R Ñ R is the adjoint operator of BX. The computational complexity of using conjugate ˚ gradient method to solve (30) is determined by the complexity of evaluating the operator BX ˝ ˚ pA Aq˝BX on a given vector. With the intrinsic representation, it can be shown that this evaluation costs Opnr ` pr2q in matrix completion and Opn logpnqr ` pr2q in the phase retrieval and blind deconvolution. Thus, when solving (30) via the conjugate gradient method, the complexity is Opkpnr ` pr2qq in the matrix completion and Opkpn logpnqr ` pr2qq in the phase retrieval and the blind deconvolution, where k is the number of conjugate gradient iterations and is provably at most p2p´rqr. Hence, for special applications such as matrix completion, phase retrieval and blind deconvolution, by using the conjugate gradient method with the intrinsic representation, the per iteration complexity of RISRO can be greatly reduced. This point will be further exploited in our future research.

6 Recursive Importance Sketching under Statistical Models

In this section, we study the applications of RISRO in machine learning and statistics. We specifi- cally investigate the low-rank matrix trace regression and phase retrieval, while our key ideas can be applied to more problems.

6.1 Low-Rank Matrix Trace Regression Consider the low-rank matrix trace regression model:

˚ yi “ xAi, X y ` i, for 1 ď i ď n, (31)

p p where X˚ P R 1ˆ 2 is the true model parameter and rankpX˚q “ r. The goal is to estimate X˚ n ˚ from tyi, Aiui“1. Due to the noise i, X usually cannot be exactly recovered. The following Theorem3 shows RISRO converges quadratic-linearly to X˚ up to some statistical error given a proper initialization. Moreover, under the Gaussian ensemble design, RISRO with spectral initialization achieves the minimax optimal estimation error rate.

Theorem 3 (RISRO in Matrix Trace Regression) Consider the low-rank matrix trace regres- sion problem (31). Suppose that A satisfies the 3r-RIP, the initialization of RISRO satisfies

0 ˚ 1 1 ´ R2r ˚ }X ´ X }F ď ^ ? σrpX q, (32) 4 2 5R ˆ 3r ˙ and ? ? ˚ ˚ 40 2R3r }pA pqqmaxprq}F σrpX q ě 16 5 _ . (33) 1 ´ R 1 ´ R ˆ 2r ˙ 2r

16 Then the iterations of RISRO converge as follows

R2 }Xt ´ X˚}4 20}pA˚pqq }2 Xt`1 X˚ 2 10 3r F maxprq F , t 0.... (34) } ´ }F ď 2 2 ˚ ` 2 @ ě p1 ´ R2rq σr pX q p1 ´ R2rq

i.i.d. i.i.d. 2 Moreover, if we assume pAiqrj,ks „ Np0, 1{nq and i „ Np0, σ {nq. Then there exist universal ˚ 1 σ2 2 σ1pX q constants C1,C2,C , c ą 0 such that as long as n ě C1pp1 ` p2qrp 2 ˚ _ rκ q (here κ “ ˚ σr pX q σrpX q 2 ˚ ˚ σr pX qn is the condition number of X ) and tmax ě C2 log logp 2 q _ 1, the output of RISRO with rpp1`p2qσ 0 ˚ tmax ˚ 2 rpp1`p2q 2 spectral initialization X “ pA pyqqmaxprq satisfies }X ´ X }F ď c n σ with probability at 1 least 1 ´ expp´C pp1 ` p2qq.

Remark 9 (Quadratic Convergence, Statistical Error, and Robustness) Note that in (34), t ˚ 2 there are two terms in the upper bound of }X ´X }F. The first term corresponds to the optimization ˚ 2 error, which quadratically decreases over iteration t; the second term Op}pA pqqmaxprq}Fq is the essential statistical error that is independent of the optimization algorithm. In addition, (34) shows the error contraction factor is independent of the condition number κ, which demonstrates the robustness of RISRO to the ill-conditioning of the underlying low-rank matrix. We will further demonstrate this point by simulation studies in Section 7.2.

Remark 10 (Optimal Statistical Error) Under the Gaussian ensemble design, RISRO with 2 spectral initialization achieves the rate of estimation error crpp1 ` p2qσ {n after double-logarithmic σ2 2 number of iterations when n ě C1pp1 ` p2qrp 2 ˚ _ rκ q. Compared with the lower bound of the σr pX q estimation error 2 ˚ 2 1 rpp1 ` p2qσ min max E}X ´ X }F ě c X rankpX˚qďr n 1 for some c ą 0 in Cand`esandp Plan(2011), RISROp achieves the minimax optimal estimation error. To the best of our knowledge, RISRO is the first provable algorithm that achieves the minimax rate- optimal estimation error with only a double-logarithmic number of iterations.

6.2 Phase Retrieval In this section, we consider RISRO for solving the following quadratic equation system

˚ 2 yi “ |xai, x y| for 1 ď i ď n, (35)

n n p p ˚ p p where y P R and covariates taiui“1 P R (or C ) are known whereas x P R (or C ) are unknown. ˚ n The goal is to recover x based on tyi, aiui“1. One important application is known as phase retrieval arising from physical science due to the nature of optical sensors (Fienup, 1982). In the literature, various approaches have been proposed for phase retrieval with provable guarantees, such as convex relaxation (Cand`eset al., 2013; Huang et al., 2017b; Waldspurger et al., 2015) and non-convex approaches (Cand`eset al., 2015; Chen and Cand`es, 2017; Gao and Xu, 2017; Ma et al., 2019; Netrapalli et al., 2013; Sanghavi et al., 2017; Wang et al., 2017a; Duchi and Ruan, 2019). ˚ n n For ease of exposition, we focus on the real-value model, i.e., x P R and ai P R , while a simple trick in Sanghavi et al.(2017) can recast the problem (35) in the complex model into a rank-2 real value matrix recovery problem, then our approach still applies. In the real-valued setting, we can rewrite model (35) into a low-rank matrix recovery model

˚ ˚ ˚ ˚J ˚ J ˚ ˚J y “ ApX q with X “ x x and rApX qsi “ xaiai , x x y. (36)

17 There are two challenges in phase retrieval compared to the low-rank matrix trace regression consid- J ˚ ˚J ered previously. First, due to the symmetry of sensing matrices aiai and x x in phase retrieval, the importance covariates AD1 and AD2 in (4) are exactly the same and an adaptation of Algo- rithm1 is thus needed. Second, in phase retrieval, the mapping A no longer satisfies a proper RIP condition in general (Cai and Zhang, 2015; Cand`eset al., 2013), so new theory is needed. To this end, we introduce a modified RISRO for phase retrieval in Algorithm2. Particularly in Step 4 of Algorithm2, we multiply the importance covariates A2 by an extra factor 2 to account for the duplicate importance covariates due to symmetry.

Algorithm 2 RISRO for Phase Retrieval n p n 0 1: Input: design vectors taiui“1 P R , y P R , initialization X that admits eigenvalue decompo- 0 0 0J sition σ1u u 2: for t “ 0, 1,..., do n nˆpp´1q 3: Perform importance sketching on ai and construct the covariates A1 P R , A2 P R , J t 2 tJ J t where for 1 ď i ď n, pA1qi “ pai u q , pA2qri,:s “ uK aiai u . 4: Solve the unconstrained least squares problem pbt`1, dt`1q “ 2 arg min pp´1q y A b 2A d . bPR,dPR } ´ 1 ´ 2 }2 bt`1 dt`1J 5: Compute the eigenvalue decomposition of rut ut s rut ut sJ, and denote it K dt`1 0 K „  λ1 0 J as rv1 v2s rv1 v2s with λ1 ě λ2. 0 λ2 t„`1  t`1 t`1 t`1J 6: Update u “ v1 and X “ λ1u u . 7: end for

Next, we show under Gaussian ensemble design, given the sample number n “ Opp log pq and proper initialization, the sequence tXtu generated by Algorithm2 converges quadratically to X˚. Theorem 4 (Local Quadratic Convergence of RISRO for Phase Retrieval) In the phase n retrieval problem (35), assume that taiui“1 are independently generated from Np0, Ipq. Then for 1 any δ1, δ2 P p0, 1q, there exist c, Cpδ1q,C ą 0 such that when p ě c log n, n ě Cpδ1qp log p, if 0 ˚ p1´δ1q ˚ ´p }X ´ X } ď 1 }X } , with probability at least 1 ´ C1 expp´C2pδ1, δ2qnq ´ C3n , the F C p1`δ2qp F sequence tXtu generated by Algorithm2 satisfies C1p1 ` δ qp Xt`1 X˚ 2 Xt X˚ 2 , t 0 (37) } ´ }F ď ˚ } ´ }F @ ě p1 ´ δ1q}X }F for some C1,C2pδ1, δ2q,C3 ą 0. 0 ˚ ˚ Remark 11 (Initialization Condition) In Theorem4, we assume }X ´ X }F ď Op}X }F{pq to show the quadratic convergence for RISRO. This initialization assumption is used to handle some technical difficulties without the proper restricted isometry property in phase retrieval. Although in theory it is difficult to prove that the spectral initialization satisfies this assumption, we find by simulation that the spectral initialization is sufficiently good to guarantee quadratic convergence in the subsequent updates. We leave the convergence theory under weaker initialization assumption as future work.

7 Numerical Studies

In this section, we conduct simulation studies to investigate the numerical performance of RISRO. We specifically consider two settings:

18 ˚ • Matrix trace regression. Let p “ p1 “ p2 and yi “ xX , Aiy ` i, where Ais are constructed i.i.d. 2 ˚ ˚ ˚ ˚J with independent standard normal entries and i „ Np0, σ q. X “ U Σ V where ˚ ˚ U , V P Op,r are randomly generated, Σ “ diagpλ1, . . . , λrq. Also, we set λ1 “ 3 and λi “ λ1 for i 2, . . . , r, so the condition number of X˚ is κ. We initialize X0 via A˚ y . κi{r “ p p qqmaxprq ˚ 2 ˚ p • Phase retrieval. Let yi “ xai, x y , where x P R is a randomly generated unit vector, i.i.d. 0 ai „ Np0, Ipq. We initialize X via truncated spectral initialization (Chen and Cand`es, 2017).

t tmax tmax Throughout the simulation studies, we consider errors in two metrics: (1) }X ´ X }F{}X }F, t ˚ ˚ which measures the convergence error; (2) }X ´ X }F{}X }F, which is the relative root mean- squared error (Relative RMSE) that measures the estimation error for X˚. The algorithm is terminated when it reaches the maximum number of iterations tmax “ 300 or the corresponding error metric is less than 10´12. Unless otherwise noted, the reported results are based on the averages of 50 simulations and on a computer with Intel Xeon E5-2680 2.5GHz CPU.

7.1 Properties of RISRO We first study the convergence rate of RISRO. Specifically, set p “ 100, r “ 3, n P t1200, 1500, 1800, 2100, 2400u, κ “ 1, σ “ 0 for low-rank matrix trace regression and p “ 1200, n P t4800, 6000, 7200, 8400, 9600u for phase retrieval. The convergence performance of RISRO (Algorithm1 in low-rank matrix trace regression and Algorithm2 in phase retrieval) is plotted in Figure1. We can see RISRO with the (truncated) spectral initialization converges quadratically to the true parameter X˚ in both problems, which is in line with the theory developed in previous sections. Although our theory on phase retrieval in Theorem4 is based a stronger initialization assumption, the truncated spectral initialization achieves great empirical performance. In another setting, we examine the quadratic-linear convergence for RISRO under the noisy setting. Consider the matrix trace regression problem, where σ “ 10α, α P t0, ´1, ´2, ´3, ´5, ´14u, n “ 1500, and p, r, κ are the same as the previous setting. The simulation results in Figure3 show the gradient norm }gradfpXtq} of the iterates converges to zero, which demonstrates the convergence of the algorithm. Meanwhile, since the observations are noisy, RISRO exhibits the quadratic-linear convergence as we discussed in Remark4: when α “ 0, i.e., σ “ 1, RISRO converges quadratically in the first 2-3 steps and then reduces to linear convergence afterwards; as σ gets smaller, we can see RISRO enjoys a longer path of quadratic convergence, which matches our theoretical prediction in Remark4.

1e+00 ● 1e+04 ● ● ●

F ●

|| ●

max α t 1e−04 1e+00 ● ● −14 /||X

F ● −5 || 1e−08 1e−04 −3 max

t −2 ● ●

X −1

− 0 t 1e−12 NormGradient 1e−08 ||X ● ● 0 10 20 0 10 20 Iteration Number Iteration Number

Figure 3: Convergence plot of RISRO in matrix trace regression. p “ 100, r “ 3, n “ 1500, κ “ 1, σ “ 10α with varying α

19 Finally, we study the performance of RISRO under the large-scale setting of the matrix trace regression. Fix n “ 7000, r “ 3, κ “ 1, σ “ 0 and let dimension p grow from 100 to 500. For the largest case, the space cost of storing A reaches 7000 ¨ 500 ¨ 500 ¨ 8B “ 13.04GB. Figure4 shows the relative RMSE of the output of RISRO and runtime vesus the dimension. We can clearly see the relative RMSE of the output is stable and the runtime scales reasonably well as the dimension p grows.

● ● ● ● ● 3e−15 ● ● ● 10 ● ● 2e−15 ● ● ● 3 Runtime (s) ● Relative RMSE Relative ● 1e−15 ● ● ● 1 100 200 300 400 500 100 200 300 400 500 p p

Figure 4: Relative RMSE and runtime of RISRO in matrix trace regression. p P r100, 500s, r “ 3, n “ 7000, κ “ 1, σ “ 0

7.2 Comparison of RISRO with Other Algorithms in Literature In this subsection, we further compare RISRO with existing algorithms in the literature. In the matrix trace regression, we compare our algorithm with singular value projection (SVP) (Goldfarb and Ma, 2011; Jain et al., 2010), Alternating Minimization (Alter Mini) (Jain et al., 2013; Zhao et al., 2015), gradient descent (GD) (Park et al., 2018; Tu et al., 2016; Zheng and Lafferty, 2015), and convex nuclear norm minimization (NNM) (3)(Toh and Yun, 2010). We consider the setting with p “ 100, r “ 3, n “ 1500, κ P t1, 50, 500u, σ “ 0 (noiseless case) or σ “ 10´6 (noisy case). Following Zheng and Lafferty(2015), in the implementation of GD and SVP, we evaluate three choices of step size, t5 ˆ 10´3, 10´3, 5 ˆ 10´4u, then choose the best one. In phase retrieval, we compare Algorithm2 with Wirtinger Flow (WF) (Cand`eset al., 2015) and Truncated Wirtinger Flow (TWF) (Chen and Cand`es, 2017) with p “ 1200, n “ 6000. We use the codes of the accelerated proximal gradient for NNM, WF and TWF from the corresponding authors’ websites and implement the other algorithms by ourselves. The stopping criteria of all procedures are the same as RISRO mentioned in the previous simulation settings. We compare the performance of various procedures on noiseless matrix trace regression in Figure5. For all different choices of κ, RISRO converges quadratically to X˚ in 7 iterations with high accuracy, while the other baseline algorithms converge much slower in a linear rate. When κ (condition number of X˚) increases from 1 to 50 and 500 so that the problem becomes more ill-conditioned, RISRO, Alter Mini, and SVP perform robustly, while GD converges more slowly. In Theorem3, we have shown the quadratic convergence rate of RISRO is robust to the condition number (see Remark9). As we expect, the non-convex optimization methods converge much faster than the convex relaxation method. Moreover, to achieve a relative RMSE of 10´10, RISRO only takes about 1{5 runtime compared to other algorithms if κ “ 1 and this factor is even smaller in the ill-conditioned cases that κ “ 50 and 500. The comparison of RISRO, WF, and TWF in phase retrieval is plotted in Figure6. We can also see RISRO can recover the underlying true signal with high accuracy in much less time than the other baseline methods.

20 1e+00 ● 1e+00 ● ●● ●● ●● ● ● ●● ●● ●● ●● ●● ●● ●●● ●●● 1e−04 ●●● 1e−04 ● ●● ●●● ●●● Algorithm ●●● ●●● ●●● RISRO ●●● ● Alter Mini 1e−08 ●●● 1e−08 ●●● GD ●●● SVP ●●● ●●● NNM ●●● Relative RMSE Relative 1e−12 RMSE Relative 1e−12

0 10 20 30 40 50 0 1 2 3 4 5 Iteration Number Runtime (s)

(a) κ “ 1

● ● 1e+00 ● ● ●● ● ●● ●● ●● ● ●● ●● ●● ● ●● 1e−03 ●● 1e−04 ●● ●● ●● ●● Algorithm ●● ●● ●●● ●● ●● ●●● RISRO ●●● ●●● 1e−07 ● Alter Mini 1e−08 ●●● ●●● GD ●●● SVP ●●● ●●● NNM ●●● Relative RMSE Relative ● RMSE Relative 1e−12 1e−11

0 10 20 30 40 50 0 1 2 3 4 5 Iteration Number Runtime (s)

(b) κ “ 50

● ● 1e+00 ● 1e+00 ● ●● ●● ●● ●● ●●● ●●● ●● ●● 1e−04 ●● 1e−04 ●● ●●● ●●● Algorithm ●●● ●●● ●●● ●●● ●●● ● RISRO ●● ● ●●● Alter Mini 1e−08 ●●● 1e−08 GD ●●● ●●● SVP ●●● ●●● NNM ●●● Relative RMSE Relative 1e−12 ●● RMSE Relative 1e−12

0 10 20 30 40 50 0 1 2 3 4 5 Iteration Number Runtime (s)

(c) κ “ 500

Figure 5: Relative RMSE of RISRO, singular value projection (SVP), Alternating Minimization (Alter Mini), gradient descent (GD), and Nuclear Norm Minimization (NNM) in low-rank matrix trace regression. Here, p “ 100, r “ 3, n “ 1500, σ “ 0, κ P t1, 50, 500u.

21 1e+00 ●●●●●●●●●●●●●● 1e+00 ●● ●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●● ●●● ●● ●●●● ●●●● ●●●● ●●●● ●●●● 1e−04 1e−04 ●●●● ●●●● ●●●● ●●●● Algorithm ●●●● ●●●● ●●●● ●●●● RISRO 1e−08 1e−08 ●●●● ● ●●●● TWF ●●●● ●●●● WF ●●●● ●●●●

Relative RMSE Relative 1e−12 RMSE Relative 1e−12

0 10 20 30 40 50 0 5 10 15 20 25 Iteration Number Runtime (s)

Figure 6: Relative RMSE of RISRO, Wirtinger Flow (WF), Truncated Wirtinger Flow (TWF) in phase retrieval. Here, p “ 1200, n “ 6000

Next, we compare the performance of RISRO with other algorithms in the noisy setting, σ “ 10´6, in the low-rank matrix trace regression. We can see from the results in Figure7 that due to the noise, the estimation error first decreases then stabilizes after reaching at a certain level. Meanwhile, we can also find RISRO converges in a much faster quadratic-linear rate before reaching at the stable level compared to all other algorithms.

● ● ● ● ● ● 1e−01 ● 1e−01 ● ● ● ● ● ● ● ● ● Algorithm ● ● ● ● 1e−03 ● 1e−03 ● RISRO ● ● ● ● ● Alter Mini ● ● ● ● GD ● ● ● ● SVP ● ● ● ● NNM 1e−05 ● 1e−05 ● Relative RMSE Relative ● RMSE Relative ● ●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●● 0 10 20 30 40 50 0 1 2 3 4 5 Iteration Number Runtime (s)

Figure 7: Relative RMSE of RISRO, singular value projection (SVP), Alternating Minimization (Alter Mini), gradient descent (GD), and Nuclear Norm Minimization (NNM) in low-rank matrix trace regression. Here, p “ 100, r “ 3, n “ 1500, κ “ 5, σ “ 10´6

Finally, we study the required sample size to guarantee successful recovery by RISRO and other algorithms. We set p “ 100, r “ 3, κ “ 5, n P r600, 1500s in the noiseless matrix trace regression and p “ 1200, n P r2400, 6000s in phase retrieval. We say the algorithm achieves successful recovery if the relative RMSE is less than 10´2 when the algorithm terminates. The simulation results in Figure8 show RISRO requires the minimum sample size to achieve a successful recovery in both matrix trace regression and phase retrieval; Alter Mini has similar performance to RISRO; and both RISRO and Alter Mini require smaller sample size than the rest of algorithms for successful recovery.

8 Conclusion and Discussion

In this paper, we propose a new algorithm, RISRO, for solving rank constrained least squares. RISRO is based on a novel algorithmic framework, recursive importance sketching, which also

22 1.00 ● ● ● ● ● ● ● 1.00 ● ● ● 0.75 Algorithm 0.75 ● Algorithm RISRO 0.50 ● Alter Mini 0.50 ● RISRO GD ● TWF SVP WF 0.25 NNM 0.25 ● Succ Recovery Rate Succ Recovery Rate Succ Recovery 0.00 ● ● 0.00 ● ● ● ● ● 2 3 4 5 2 3 4 5 n/(pr) n/p

(a) Matrix Trace Regression (p “ 100, r “ (b) Phase Retrieval (p “ 1200) 3, σ “ 0, κ “ 5)

Figure 8: Successful recovery rate comparison provides new sketching interpretations for several existing algorithms for rank constrained least squares. RISRO is easy to implement and computationally efficient. Under some reasonable as- sumptions, local quadratic-linear and quadratic convergence are established for RISRO. Simulation studies demonstrate the superior performance of RISRO. There are many interesting extensions to the results in this paper to be explored in the future. First, our current convergence theory on RISRO relies on RIP assumption, which may not hold in many scenarios, such as phase retrieval and matrix completion. In this paper, we give some theoretical guarantees of RISRO in phase retrieval with a strong initialization assumption as we discussed in Remark 11. However, such an initialization requirement may be unnecessary and spectral initialization is good enough to guarantee quadratic convergence as we observe in the simulation studies. Also in matrix completion, the analysis will become more delicate as the additional incoherence condition on X˚ is needed to guarantee recovery. To improve and establish theoretical guarantees for RISRO in phase retrieval and matrix completion, we think some extra properties such as “implicit regularization” (Ma et al., 2019) need to be incorporated in the analysis for RISRO and it is an interesting future work. Also, this paper focuses on the squared error loss in (1), while the other loss functions may be of interest in different settings, such as the `1 loss in robust low-rank matrix recovery (Charisopoulos et al., 2019; Li et al., 2020a,b), which is worth exploring.

References

Absil, P.-A., Mahony, R., and Sepulchre, R. (2009). Optimization algorithms on matrix manifolds. Princeton University Press.

Absil, P.-A. and Malick, J. (2012). Projection-like retractions on matrix manifolds. SIAM Journal on Optimization, 22(1):135–158.

Ahmed, A., Recht, B., and Romberg, J. (2013). Blind deconvolution using convex programming. IEEE Transactions on Information Theory, 60(3):1711–1732.

Bauch, J. and Nadler, B. (2020). Rank 2r iterative least squares: efficient recovery of ill-conditioned low rank matrices from few entries. to appear, SIAM Journal on the Mathematics of Data Science.

Bhojanapalli, S., Neyshabur, B., and Srebro, N. (2016). Global optimality of local search for low rank matrix recovery. In Advances in Neural Information Processing Systems, pages 3873–3881.

23 Boumal, N. and Absil, P.-a. (2011). Rtrmc: A Riemannian trust-region method for low-rank matrix completion. In Advances in neural information processing systems, pages 406–414.

Breiding, P. and Vannieuwenhoven, N. (2018). Convergence analysis of Riemannian Gauss–Newton methods and its connection with the geometric condition number. Applied Mathematics Letters, 78:42–50.

Burer, S. and Monteiro, R. D. (2003). A algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming, 95(2):329–357.

Burke, J. V. (1985). Descent methods for composite nondifferentiable optimization problems. Mathematical Programming, 33(3):260–279.

Cai, T. T. and Zhang, A. (2013). Sharp RIP bound for sparse signal and low-rank matrix recovery. Applied and Computational Harmonic Analysis, 35(1):74–93.

Cai, T. T. and Zhang, A. (2014). Sparse representation of a polytope and recovery of sparse signals and low-rank matrices. IEEE transactions on information theory, 60(1):122–132.

Cai, T. T. and Zhang, A. (2015). ROP: Matrix recovery via rank-one projections. The Annals of Statistics, 43(1):102–138.

Cai, T. T. and Zhang, A. (2018). Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics. The Annals of Statistics, 46(1):60–89.

Cai, T. T. and Zhou, W.-X. (2013). A max-norm constrained minimization approach to 1-bit matrix completion. The Journal of Machine Learning Research, 14(1):3619–3647.

Cand`es,E. J. (2008). The restricted isometry property and its implications for compressed sensing. Comptes rendus mathematique, 346(9-10):589–592.

Cand`es,E. J., Li, X., and Soltanolkotabi, M. (2015). Phase retrieval via Wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985–2007.

Cand`es,E. J. and Plan, Y. (2011). Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Transactions on Information Theory, 57(4):2342–2359.

Cand`es,E. J., Strohmer, T., and Voroninski, V. (2013). Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Mathematics, 66(8):1241–1274.

Cand`es,E. J. and Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053–2080.

Charisopoulos, V., Chen, Y., Davis, D., D´ıaz,M., Ding, L., and Drusvyatskiy, D. (2019). Low-rank matrix recovery with composite optimization: good conditioning and rapid convergence. arXiv preprint arXiv:1904.10020.

Chen, Y. and Cand`es,E. J. (2017). Solving random quadratic systems of equations is nearly as easy as solving linear systems. Communications on Pure and Applied Mathematics, 70(5):822–883.

24 Chen, Y., Chi, Y., and Goldsmith, A. J. (2015). Exact and stable covariance estimation from quadratic sampling via convex programming. IEEE Transactions on Information Theory, 61(7):4034–4059.

Chen, Y. and Wainwright, M. J. (2015). Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. arXiv preprint arXiv:1509.03025.

Chi, Y., Lu, Y. M., and Chen, Y. (2019). Nonconvex optimization meets low-rank matrix factor- ization: An overview. IEEE Transactions on Signal Processing, 67(20):5239–5269.

Clarkson, K. L. and Woodruff, D. P. (2017). Low-rank approximation and regression in input sparsity time. Journal of the ACM (JACM), 63(6):54.

Davenport, M. A. and Romberg, J. (2016). An overview of low-rank matrix recovery from incom- plete observations. IEEE Journal of Selected Topics in Signal Processing, 10(4):608–622.

Dobriban, E. and Liu, S. (2019). Asymptotics for sketching in least squares regression. In Advances in Neural Information Processing Systems, pages 3675–3685.

Drineas, P., Magdon-Ismail, M., Mahoney, M. W., and Woodruff, D. P. (2012). Fast approxi- mation of matrix coherence and statistical leverage. Journal of Machine Learning Research, 13(Dec):3475–3506.

Duchi, J. C. and Ruan, F. (2019). Solving (most) of a set of quadratic equalities: Composite optimization for robust phase retrieval. Information and Inference: A Journal of the IMA, 8(3):471–529.

Fienup, J. R. (1982). Phase retrieval algorithms: a comparison. Applied optics, 21(15):2758–2769.

Fornasier, M., Rauhut, H., and Ward, R. (2011). Low-rank matrix recovery via iteratively reweighted least squares minimization. SIAM Journal on Optimization, 21(4):1614–1640.

Gao, B. and Xu, Z. (2017). Phaseless recovery using the Gauss–Newton method. IEEE Transactions on Signal Processing, 65(22):5885–5896.

Ge, R., Jin, C., and Zheng, Y. (2017). No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1233–1242. JMLR. org.

Goldfarb, D. and Ma, S. (2011). Convergence of fixed-point continuation algorithms for matrix rank minimization. Foundations of Computational Mathematics, 11(2):183–210.

Ha, W., Liu, H., and Barber, R. F. (2020). An equivalence between critical points for rank con- straints versus low-rank factorizations. SIAM Journal on Optimization, 30(4):2927–2955.

Hardt, M. (2014). Understanding alternating minimization for matrix completion. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 651–660. IEEE.

Huang, W., Absil, P.-A., and Gallivan, K. A. (2017a). Intrinsic representation of tangent vectors and vector transports on matrix manifolds. Numerische Mathematik, 136(2):523–543.

Huang, W., Gallivan, K. A., and Zhang, X. (2017b). Solving phaselift by low-rank Riemannian op- timization methods for complex semidefinite constraints. SIAM Journal on Scientific Computing, 39(5):B840–B859.

25 Huang, W. and Hand, P. (2018). Blind deconvolution by a steepest descent algorithm on a quotient manifold. SIAM Journal on Imaging Sciences, 11(4):2757–2785.

Jain, P., Meka, R., and Dhillon, I. S. (2010). Guaranteed rank minimization via singular value projection. In Advances in Neural Information Processing Systems, pages 937–945.

Jain, P., Netrapalli, P., and Sanghavi, S. (2013). Low-rank matrix completion using alternating minimization. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 665–674. ACM.

Jiang, K., Sun, D., and Toh, K.-C. (2014). A partial proximal point algorithm for nuclear norm regularized matrix least squares problems. Mathematical Programming Computation, 6(3):281– 325.

Keshavan, R. H., Oh, S., and Montanari, A. (2009). Matrix completion from a few entries. In 2009 IEEE International Symposium on Information Theory, pages 324–328. IEEE.

Koltchinskii, V., Lounici, K., Tsybakov, A. B., et al. (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5):2302–2329.

K¨ummerle,C. and Sigl, J. (2018). Harmonic mean iteratively reweighted least squares for low-rank matrix recovery. The Journal of Machine Learning Research, 19(1):1815–1863.

Lange, K. (2010). for statisticians. Springer Science & Business Media.

Lee, J. D., Recht, B., Srebro, N., Tropp, J., and Salakhutdinov, R. R. (2010). Practical large- scale optimization for max-norm regularization. In Advances in Neural Information Processing Systems, pages 1297–1305.

Lee, J. M. (2013). Smooth manifolds. In Introduction to Smooth Manifolds, pages 1–31. Springer.

Lewis, A. S. and Wright, S. J. (2016). A proximal method for composite minimization. Mathematical Programming, 158(1-2):501–546.

Li, Q., Zhu, Z., and Tang, G. (2019a). The non-convex geometry of low-rank matrix optimization. Information and Inference: A Journal of the IMA, 8(1):51–96.

Li, X., Ling, S., Strohmer, T., and Wei, K. (2019b). Rapid, robust, and reliable blind deconvolution via nonconvex optimization. Applied and computational harmonic analysis, 47(3):893–934.

Li, X., Zhu, Z., Man-Cho So, A., and Vidal, R. (2020a). Nonconvex robust low-rank matrix recovery. SIAM Journal on Optimization, 30(1):660–686.

Li, Y., Chi, Y., Zhang, H., and Liang, Y. (2020b). Non-convex low-rank matrix recovery with arbitrary outliers via median-truncated gradient descent. Information and Inference: A Journal of the IMA, 9(2):289–325.

Luo, Y. and Zhang, A. R. (2020). A schatten-q matrix perturbation theory via perturbation projection error bound. arXiv preprint arXiv:2008.01312.

Ma, C., Wang, K., Chi, Y., and Chen, Y. (2019). Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval, matrix completion, and blind deconvolution. Foundations of Computational Mathematics, pages 1–182.

26 Mahoney, M. W. (2011). Randomized algorithms for matrices and data. Foundations and Trends® in Machine Learning, 3(2):123–224.

Meyer, G., Bonnabel, S., and Sepulchre, R. (2011). Linear regression under fixed-rank constraints: a Riemannian approach. In Proceedings of the 28th international conference on machine learning.

Miao, W., Pan, S., and Sun, D. (2016). A rank-corrected procedure for matrix completion with fixed basis coefficients. Mathematical Programming, 159(1):289–338.

Mishra, B., Meyer, G., Bonnabel, S., and Sepulchre, R. (2014). Fixed-rank matrix factorizations and Riemannian low-rank optimization. Computational Statistics, 29(3-4):591–621.

Mohan, K. and Fazel, M. (2012). Iterative reweighted algorithms for matrix rank minimization. The Journal of Machine Learning Research, 13(1):3441–3473.

Netrapalli, P., Jain, P., and Sanghavi, S. (2013). Phase retrieval using alternating minimization. In Advances in Neural Information Processing Systems, pages 2796–2804.

Nocedal, J. and Wright, S. (2006). Numerical optimization. Springer Science & Business Media.

Park, D., Kyrillidis, A., Caramanis, C., and Sanghavi, S. (2018). Finding low-rank solutions via nonconvex matrix factorization, efficiently and provably. SIAM Journal on Imaging Sciences, 11(4):2165–2204.

Pilanci, M. and Wainwright, M. J. (2016). Iterative Hessian sketch: Fast and accurate solu- tion approximation for constrained least-squares. The Journal of Machine Learning Research, 17(1):1842–1879.

Pilanci, M. and Wainwright, M. J. (2017). Newton sketch: A near linear-time optimization algo- rithm with linear-quadratic convergence. SIAM Journal on Optimization, 27(1):205–245.

Raskutti, G. and Mahoney, M. W. (2016). A statistical perspective on randomized sketching for ordinary least-squares. The Journal of Machine Learning Research, 17(1):7508–7538.

Recht, B., Fazel, M., and Parrilo, P. A. (2010). Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471–501.

Sanghavi, S., Ward, R., and White, C. D. (2017). The local convexity of solving systems of quadratic equations. Results in Mathematics, 71(3-4):569–608.

Shechtman, Y., Eldar, Y. C., Cohen, O., Chapman, H. N., Miao, J., and Segev, M. (2015). Phase retrieval with application to optical imaging: a contemporary overview. IEEE signal processing magazine, 32(3):87–109.

Song, Z., Woodruff, D. P., and Zhong, P. (2017). Low rank approximation with entrywise l1-norm error. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 688–701. ACM.

Sun, J., Qu, Q., and Wright, J. (2018). A geometric analysis of phase retrieval. Foundations of Computational Mathematics, 18(5):1131–1198.

Sun, R. and Luo, Z.-Q. (2015). Guaranteed matrix completion via nonconvex factorization. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages 270– 289. IEEE.

27 Tanner, J. and Wei, K. (2013). Normalized iterative hard thresholding for matrix completion. SIAM Journal on Scientific Computing, 35(5):S104–S125.

Toh, K.-C. and Yun, S. (2010). An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. Pacific Journal of Optimization, 6(615-640):15.

Tong, T., Ma, C., and Chi, Y. (2020). Low-rank matrix recovery with scaled subgradient methods: Fast and robust convergence without the condition number. arXiv preprint arXiv:2010.13364.

Tran-Dinh, Q. and Zhang, Z. (2016). Extended Gauss-Newton and Gauss-Newton-ADMM algo- rithms for low-rank matrix optimization. arXiv preprint arXiv:1606.03358.

Tu, S., Boczar, R., Simchowitz, M., Soltanolkotabi, M., and Recht, B. (2016). Low-rank solutions of linear matrix equations via Procrustes flow. In International Conference on Machine Learning, pages 964–973.

Uschmajew, A. and Vandereycken, B. (2018). On critical points of quadratic low-rank matrix optimization problems. IMA Journal of Numerical Analysis.

Vandereycken, B. (2013). Low-rank matrix completion by Riemannian optimization. SIAM Journal on Optimization, 23(2):1214–1236.

Vandereycken, B. and Vandewalle, S. (2010). A Riemannian optimization approach for computing low-rank solutions of lyapunov equations. SIAM Journal on Matrix Analysis and Applications, 31(5):2553–2579.

Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027.

Waldspurger, I., d’Aspremont, A., and Mallat, S. (2015). Phase recovery, maxcut and complex semidefinite programming. Mathematical Programming, 149(1-2):47–81.

Wang, G., Giannakis, G. B., and Eldar, Y. C. (2017a). Solving systems of random quadratic equations via truncated amplitude flow. IEEE Transactions on Information Theory, 64(2):773– 794.

Wang, J., Lee, J. D., Mahdavi, M., Kolar, M., Srebro, N., et al. (2017b). Sketching meets ran- dom projection in the dual: A provable recovery algorithm for big and high-dimensional data. Electronic Journal of Statistics, 11(2):4896–4944.

Wang, L., Zhang, X., and Gu, Q. (2017c). A unified computational and statistical framework for nonconvex low-rank matrix estimation. In Artificial Intelligence and Statistics, pages 981–990.

Wei, K., Cai, J.-F., Chan, T. F., and Leung, S. (2016). Guarantees of Riemannian optimization for low rank matrix recovery. SIAM Journal on Matrix Analysis and Applications, 37(3):1198–1222.

Wen, Z., Yin, W., and Zhang, Y. (2012). Solving a low-rank factorization model for matrix com- pletion by a nonlinear successive over-relaxation algorithm. Mathematical Programming Compu- tation, 4(4):333–361.

Woodruff, D. P. (2014). Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science, 10(1–2):1–157.

28 Zhang, A. R., Luo, Y., Raskutti, G., and Yuan, M. (2020). ISLET: Fast and optimal low-rank tensor regression via importance sketching. SIAM Journal on Mathematics of Data Science, 2(2):444–479.

Zhang, R. Y., Sojoudi, S., and Lavaei, J. (2019). Sharp restricted isometry bounds for the inex- istence of spurious local minima in nonconvex matrix recovery. Journal of Machine Learning Research, 20(114):1–34.

Zhao, T., Wang, Z., and Liu, H. (2015). A nonconvex optimization framework for low rank matrix estimation. In Advances in Neural Information Processing Systems, pages 559–567.

Zheng, Q. and Lafferty, J. (2015). A convergent gradient descent algorithm for rank minimiza- tion and semidefinite programming from random linear measurements. In Advances in Neural Information Processing Systems, pages 109–117.

Zheng, Y., Liu, G., Sugimoto, S., Yan, S., and Okutomi, M. (2012). Practical low-rank matrix approximation under robust l1-norm. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1410–1417. IEEE.

Zhou, G., Huang, W., Gallivan, K. A., Van Dooren, P., and Absil, P.-A. (2016). A Riemannian rank-adaptive method for low-rank optimization. Neurocomputing, 192:72–80.

Zhu, Z., Li, Q., Tang, G., and Wakin, M. B. (2018). Global optimality in low-rank matrix opti- mization. IEEE Transactions on Signal Processing, 66(13):3614–3628.

29 9 Proof of the Main Results in the Paper

Proof of Lemma1 . First by the decomposition of (6), we have

t tJ B D2 t y “ ALt t `  . (38) ˜«D1 0 ff¸ r r ˚ ˚ In view of (10), if the operator Lt A ALt is invertible,r the output of least squares in (5) satisfies

Bt`1 Dt`1J Bt DtJ 2 “ pL˚A˚AL q´1L˚A˚y “ 2 ` pL˚A˚AL q´1L˚A˚t, Dt`1 0 t t t Dt 0 t t t „ 1  « 1 ff r r where the second equality is due to (38). This finishesr the proof.  ˚ Proof of Lemma2 . Equation (17) can be directly verified from definitions of Lt and Lt in (9). The second conclusion holds if M is a zero matrix. When M is not zero, to prove the claim ˚ ˚ it is equivalent to show the spectrum of Lt A ALt is upper and lower bounded by 1 ` R2r and ˚ ˚ ˚ 1 ´ R2r, respectively, in the range of Lt . Since Lt A ALt is a symmetric operator, the upper and lower bounds of its spectrum are given below

paq ˚ ˚ 2 sup xZ, Lt A ALtpZqy “ sup }ALtpZq}2 ď 1 ` R2r ˚ ˚ ZPRanpLt q:}Z}F “1 ZPRanpLt q:}Z}F“1 pbq ˚ ˚ 2 inf xZ, L A ALtpZqy “ inf }ALtpZq} ě 1 ´ R2r. ˚ t ˚ 2 ZPRanpLt q:}Z}F “1 ZPRanpLt q:}Z}F“1

Here (a) and (b) are due to the RIP condition of A, LtpZq is a at most rank 2r matrix by the definition of Lt and (17). This finishes the proof.  ˚ ˚ Proof of Proposition1 . Since A satisfies 3r-RIP, R2r ď R3r ă 1. Then, by Lemma2, Lt A ALt ˚ is invertible over RanpLt q. With a slight abuse of notation, define PXt as

˚ p1ˆp2 PXt pZq :“ LtL pZq “ PUt ZPVt ` P t ZPVt ` PUt ZP t , @ Z P , (39) t UK VK R where Ut, Vt are the updated sketching matrices at iteration t defined in Step 7 of Algorithm1. We can verify PXt is an orthogonal projector. Let P Xt pZq :“ Z ´ PXt pZq “ P t ZP t . Recall p qK UK VK t  “ ApP t XP t q `  “ AP Xt pXq `  from (6), we have UK VK p qK

˚ s˚ ´1 ˚ ˚ t 2 s pLt A ALtq sLt A  F s paq › 1 ˚ ˚ › 2 ď › }Lt A ApP›pXtq Xq `  }F p1 ´ R q2 K 2r (40) ` p1q ˘ p2q p3q s s 1 ˚ ˚ 2 ˚ ˚ 2 ˚ ˚ ˚ ˚ “ }L A ApP t Xq} ` }L A pq} ` 2 L A AP t X, L A pq , p1 ´ R q2 ¨ t pX qK F t F t pX qK t ˛ 2r hkkkkkkkkkkkkikkkkkkkkkkkkj hkkkkkkikkkkkkj hkkkkkkkkkkkkkkkkkkikkkkkkkkkkkkkkkkkkj ˚ @ D‹ ˝ s s s s ‚ where (a) is due to Lemma2 and the definition of t. Notice term p2q is the target term we want, next we bound (1) and (3) at the right hand side of (40).

30 Bound for (1).

˚ ˚ 2 ˚ ˚ ˚ ˚ t t t }Lt A ApPpX qK Xq}F “ xLt A ApPpX qK Xq, Lt A ApPpX qK Xqy ˚ “ xApPpXtq Xq, APXt A ApPpXtq Xqy s K s K s paq (41) ˚ t t t ď R3r}PpX qKsX}F}PX A ApPpX qKsXq}F ˚ ˚ “ R3r}PpXtq X}F}Lt A ApPpXtq Xq}F, K s K s ˚ where (a) is due to the Lemma8 and the fact that xP t X,P t A AP t Xy “ 0, rankpP t Xq ď pXsqK X pX qKs pX qK ˚ t t r and rankpPX A APpX qK Xq ď 2r. Note that s s s P t X “ X ´ P t X pX qK X s paq s “ sPUX ` XsPV ´ PUXPV ´ PUt X ´ XPVt ` PUt XPVt “ pPU ´ PUt qX ` XpPV ´ PVt q ´ PUXPV ` PUXPVt ´ PUXPVt ` PUt XPVt s s s s s s s s s s (42) “ pPU ´ PUt qXpI ´ PVt q ` pI ´ PUqXpPV ´ PVt q s s s s s s s s s s s s pbq “ pPsU ´ PUt qsXpI ´ PVt q s s s pcq t “ pPUs ´ PUt qpsX ´ X qpI ´ PVt q, where U, V are lefts and rights singular vectors of X, (a) is because X “ PUX ` XPV ´ PUXPV, t (b) is due to the fact that pI ´ PUqXpPV ´ PVt q “ 0 and (c) is because X pI ´ PVt q “ 0. Now, from (41s),s we get s s s s s s s s s s s ˚ ˚ s t t }Lt A ApPpX qK Xq}F ď R3r}PpX qK X}F paq t s ď R3r}pPU ´sPUt qpX ´ X qpI ´ PVt q}F t (43) ď R3r}P ´ PUt }}X ´ X }F Us t t s pbq }X ´ X}}X ´ X}F ď R3r s s , σrpXq s s here (a) is due to (42) and (b) is due to the singular subspace perturbation inequality }PU ´PUt } ď t s }X ´ X}{σrpXq obtained from Lemma9. Bound for (3). s

s s ˚ ˚ ˚ ˚ ˚ t t t 2 Lt A APpX qK X, Lt A pq “ 2xAPpX qK X, APX A pqy paq @ D ˚ ˚ ď 2R }P t X} }L A pq} s s 3r pX qsK F t s F (44) pbq t t }X ´ X}}X ´ X}F ˚ ˚ ď 2R3r s }Lt A pq}F, σrpXq s s s here (a) and (b) are by the same arguments in (41) and (43), respectively. s s Plugging (44) and (43) into (40), we get (18). This finishes the proof of this Proposition.  Proof of Theorem1 . First notice that quadratic convergence result follows easily from (20) when we set  “ 0. So the rest of the proof is devoted to prove (20) and we also prove the Q-linear convergence along the way. The proof can be divided into four steps. In Step 1, we use Proposition 1 and give ans upper bound for the approximation error in the case X is a stationary point. Then, we use induction to show the main results in Step 2,3,4. To start, similar as (39), define s P Z P ZP P ZP P ZP , Z p1ˆp2 , Xp q “ U V ` UK V ` U VK @ P R

s s s s s s s 31 where U, V are left and right singular vectors of X. Step 1. In this step, we apply Proposition1 in the case X is a stationary point. In view of (18), ˚ ˚ ˚ the terms thats we can simplify is }Lt A pq}F. Sinces X is a stationary point, we know PX A pq “ P A˚py ´ ApXqq “ 0. Then s X ` ˘ s s ` ˚ ˚ ˘ 2 ˚ 2 s ˚ 2 2 ˚ 2 s s }Lt As pq}F “}PXt A pq}F “ }pPXt ´ PXqA pq}F ď }PXt ´ PX} }A pq}F, (45) where the first equality is by the definition of P t ins (39) and equation (17s ). Meanwhile, it holds s s X s s that

}PXt ´ PX}“ sup }pPXt ´ PXqpZq}F }Z}Fď1 s paq s “ sup }pPUt ´ PUqZpI ´ PVq}F ` }pI ´ PUt qZpPVt ´ PVq}F }Z}Fď1 pbq 2}Xt ´ X}s 2}sXt ´ X} s ď sup }Z}F ď , }Z}Fď1 σrpXq σrpXq s s here (a) is because P t P Z P t P Z I P t I P Z P t P by a similar p X ´ Xqp q “ psU ´ Uq p ´ Vs q ` p ´ Uq p V ´ Vq argument in (42) and (b) is by Lemma9. Then, from (45), we see that s s s s 4}Xt ´ X}2 L˚A˚  2 A˚  2 , } t p q}F ď 2 } p q}F σr pXq s which, together with (18), implies that s s s }Xt ´ X}2 L˚A˚AL ´1L˚A˚t 2 }p t tq t }F ď 2 2 p1 ´ R2rq σr pXq (46) 2 t 2 s˚ 2 ˚ t ¨ R3r}X ´ X}F ` 4}A pq}F ` 4R3r}A pq}F}X ´ X}F . s Step 2. In this step, we start` using induction to show the convergence of Xt in (˘20). First, we s s s s introduce θt to measure the goodness of the current iterate in terms of subspace estimation:

t t θt “ maxt sin ΘpU , Uq , sin ΘpV , Vq u, (47) › › › › t ´1 tJ where U, V are the left and right singular› vectors ofsX› and› here } sins Θ›pU , Uq} :“ | sinpcos pσrpU Uqqq| measures the largest angle between subspaces Ut, U. t t t Recalls s the definition of B , D1 and D2 in (7). Thes induction we want tos show is: given θt ď 1{2, s t t 0 t`1 t`1 B is invertible and }X ´X}F ď }X ´X}F, we proves θt`1 ď 1{2, B is invertible, }X ´X}F ď 0 t`1 }X ´ X}F as well as B ris invertibler r and r s s 2 r s Xt`1 X 2 5 L˚A˚AL ´1L˚A˚t } s ´ }F ď p t tq t F 5}Xt ´ X}2 › R2 X›t X 2 4R A˚  Xt X 4 A˚  2 s ď › 2 2 3r} › ´ }F ` 3r} p q}F} ´ }F ` } p q}F (48) p1 ´ R2rqσr pXq 9 s ` ˘ ď }Xt ´ X}2 . s s s s 16 Fs For the rest of this step, wes show when t “ 0, the induction assumption holds, i.e. θ0 ď 1{2 and 0 }X0´X} 1 0 0 B is invertible. Since ď , it holds by Lemma9 that θ0 “ maxt} sin ΘpU , Uq}, } sin ΘpV , Vq}u ď σrpXq 4 0 2 }X ´X} ď 1 . Then, we haves r σrpXq 2 s s s s paq s 0 0J J 0 pbq 2 σrpB q ě σrpU UqσrpXqσrpV V q “ p1 ´ θ0qσrpXq ě 3{4 ¨ σrpXq ą 0, (49)

r s s s 32 s s where (a) is because U0JU, X, VJV0 are all rank r matrices and (Luo and Zhang, 2020, Lemma 5) and (b) is by the definition of } sin ΘpUt, Uq}. t 0 t`1 Step 3. In this step, we firsts s shows given θt ď 1{2 and }X ´ X}F ď }X ´ X}F, B is invertible. Then we also do some preparation to show thes main contraction (48). Notice that s s t`1 t t`1 t tJ tJ t`1 t σrpB q ě σrpB q ´ }B ´ B } ě σrpU UqσrpXqσrpV Vq ´ }B ´ B } (50) 2 t`1 t ě p1 ´ θ qσrpXq ´ }B ´ B }, r r t s s s r and s r paq t`1 t t`1 t t`1 t ˚ ˚ ´1 ˚ ˚ t maxt}B ´ B }, }D1 ´ D1}, }D2 ´ D2}u ď pLt A ALtq Lt A  (51) ˚ ˚ ´1 ˚ ˚ t ď ›pLt A ALtq Lt A  › , r r r › ›F › › where (a) is due to (11). › › t 0 Under the induction assumption }X ´ X}F ď }X ´ X}F, the initialization condition (19) and A˚  1´?R2r σ X , we have } p q}F ď 4 5 rp q s s R2 }Xt ´ X}2s 4}A˚pq}2 4R }Xt ´ X} }A˚pq} s 3r F 1 80, F 1 20, 3r F F 1 20. 2 2 ď { 2 2 ď { 2 2 ď { p1 ´ R2rq σr pXq p1 ´ R2rq σr pXq p1 ´ R2rq σr pXq s s s s By combining these inequalitiess with (46), we obtains s

˚ ˚ ´1 ˚ ˚ t 3 0 3 pL A ALtq L A  ď ? }X ´ X} ď σrpXq. t t F 4 5 16 › › Thus, from (51) we have› › s s 3 maxt}Bt`1 ´ Bt}, }Dt`1 ´ Dt }, }Dt`1 ´ Dt }u ď σ pXq, (52) 1 1 2 2 16 r t`1 9 r r r s t`1 and σrpB q ě 16 σrpXq ą 0 because of (50). This shows the invertibility of B . t`1 With the invertibility of B , we also introduce ρt`1 to measure the goodness of the current iterate in the followings way

t`1 t`1 ´1 t`1 ´1 t`1J ρt`1 “ maxt D1 pB q , pB q D2 u. (53) › › › › t t ´1 tJ t tJ ›t ´1 tJ › tJ› ´1 › Notice D1pB q “ UK XV pU XV q “ UK UpU › Uq , so ›

t paq } sin ΘpU , Uq} pbq θt r r}Dt pBtq´1} ďs }UtJU}}ps UtJUq´1} “s s ď , (54) 1 K t 2 2 1 ´ } sin ΘpU , Uq} 1 ´ θt s r r s s a a where paq is due to the sin Θ property in Lemma 1 of Cai and Zhang(s2018) and (b) is due to (47). t ´1 tJ The same bound also holds for }pB q D2 }. Meanwhile, it holds that

t`1 t`1 ´1 t`1 t t`1 ´1 t t`1 ´1 }D1 pB q } ď}pD1 r´ D1qprB q } ` }D1pB q } paq }Dt`1 ´ Dt } 1 1 Dt Bt ´1 Dt Bt ´1 Bt`1 Bt Bt`1 ´1 ď t`1r ` } 1p q }r ` } 1p q p ´ qp q } σrpB q (55) t`1 r t t`1 t pbq }D ´ D } θt θt }B ´ B } ď 1 1 ` r r ` r r ,r t`1 2 2 t`1 σrpB q 1 ´ θt 1 ´ θt σrpB q r r a a 33 where (a) is because pBt`1q´1 “ pBtq´1 ´ pBtq´1pBt`1 ´ BtqpBt`1q´1 and (b) is due to the bound t t ´1 t`1 ´1 t`1J for }D1pB q } in (54). We can also bound pB q D2 in a similar way. t`1 r 9 r r Thus, by plugging σrpB q ě σrpXq,›θt ď 1{2 and the› upper bound in (52) into (55), we r r 16 › › have › › ? 1 s 1 1 4 ` 3 ρt`1 ď ` ? ` ? ď ? . 3 3 3 3 3 3

Step 4. In the last step, we show the contraction inequality (48). Note from Lemma9 that θt t t`1 decreases as }X ´X}F decreases. So after we show the contraction of }X ´X}F, it automatically implies that θt`1 ď 1{2, which, together with similar arguments in (49), further guarantees the invertibility of Bt`s1. s t t`1 t t ´1 tJ Since X is a rank r matrix and B , B are invertible, a quick calculation asserts D1pB q D2 “ tJ tJ UK XVK . Therefore,r s r r r r X “ rUt Ut srUt Ut sJXrVt Vt srVt Vt sJ s K K K K t tJ t t B D2 t t J s “ rU UKs t t st ´1 tJ rV VKs . «D1 D1pB q D2 ff r r At the same time, it is easy to check r r r r t`1 t`1J ´1 B D Xt`1 “ Xt`1 Bt`1 Xt`1J “ rUt Ut s 2 rVt Vt sJ. U V K Dt`1 Dt`1pBt`1q´1Dt`1J K „ 1 1 2  ` ˘ So 2 Bt`1 ´ Bt Dt`1J ´ DtJ }Xt`1 ´ X}2 “ 2 2 , (56) F Dt`1 ´ Dt ∆t`1 › 1 1 ›F › r r › t`1 t`1 t`1 ´1 t`1Js t› t ´1 tJ › where ∆ “ D1 pB q D2 ´ D1›pB q D2 r. Recall that (12)› gives a precise error charac- t`1 t 2 2 t`›1 t 2 › t`1 2 terization for }B ´ B }F ` k“1 }Dk ´ Dk}F. Hence, to bound }X ´ X}F from (56), we t`1 2 only need to obtain an upper bound ofr}∆r }F.r Combining (53), (54r) and Lemmař 7, we haver s

θt θtρt 1 }∆t`1} ď ρ }Dt`1 ´ Dt } ` }Dt`1 ´ Dt } ` ` }Bt`1 ´ Bt} . F t`1 1 1 F 2 2 2 F 2 F 1 ´ θt 1 ´ θt Moreover, since pa ` b ` cq2 ď 3par2 ` b2 `ac2q and (12), it holdsr thata r

θt θtρt 1 2 }∆t`1}2 ď 3pρ _ _ ` q2 pL˚A˚AL q´1L˚A˚t . (57) F t`1 2 2 t t t F 1 ´ θt 1 ´ θt › › In summary, combining (56), (12) anda (57), wea get › ›

θt θtρt 1 2 }Xt`1 ´ X}2 ď p1 ` 3 ρ _ _ ` q2 pL˚A˚AL q´1L˚A˚t . (58) F t`1 2 2 t t t F 1 ´ θt 1 ´ θt ` ˘ › › s a a › ? θt ?θtρt`1 2 › Plugging in the upper bounds for θt and ρt`1, we have 1 ` 3pρt`1 _ 2 _ 2 q ď 5, and thus 1´θt 1´θt

t`1 2 ˚ ˚ ´1 ˚ ˚ t 2 }X ´ X}F ď 5 pLt A ALtq Lt A  F paq 5}Xt ´ X}2 › R2 X› t X 2 4 A˚  2 4R A˚  Xt X s ď › 2 2 3r} › ´ }F ` } p q}F ` 3r} p q}F} ´ }F p1 ´ R2rq σr pXq pbq 9 s ` ˘ ď }Xt ´ X}2 , s s s s 16 F s

s 34 ˚ where (a) is due to (46) and (b) is because under the initialization condition (19) and }A pq}F ď 1´?R2r σ X , the following inequalities hold 4 5 rp q s 5R2s }Xt ´ X}2 20}A˚pq}2 20R }Xt ´ X} }A˚pq} 3r F 1 16, F 1 4, 3r F F 1 4. 2 2 ď { 2 2 ď { 2 2 ď { p1 ´ R2rq σr pXq p1 ´ R2rq σr pXq p1 ´ R2rq σr pXq s s s s This completes thes induction and finishes the proofs of this theorem.  s Proof of Theorem2 . In view of the Riemannian Gauss-Newton equation in (29), to prove the claim, we only need to show P A˚ A ηt Xt y 0. (59) TXt p p p ` q ´ qq “ t`1 t`1 From the optimality condition of the least squares problem (10), we know that B , D1 and t`1 D2 obtained in (5) satisfy

t`1 t`1 J ˚ ˚ B pD2 q Lt A pALt t`1 ´ yq “ 0. (60) «D1 0 ff

Then, the updating formula in (23), together with the definition of Lt, implies that

t`1 t`1 J t t B pD2 q η ` X “ Lt t`1 . (61) «D1 0 ff

˚ ˚ t t Hence, (60) implies Lt A pApη ` X q ´ yq “ 0. Note from the proof of Theorem1 that for all t ě 1, Bt is invertible. Then, it is not difficult to verify that Ut, Vt are orthonormal bases of the colmun and row spans of Xt and P L L˚ for all t 0. We thus proved (59). TXt “ t t ě  Proof of Proposition3 . We compute the inner product between the update direction ηt in (23) and the Riemannian gradient:

gradf Xt , ηt P A˚ A Xt y , ηt x p q y “ x TXt p p q ´ q y paq P A˚Aηt, ηt “ x´ TXt y pbq ˚ t t t 2 “ ´xA Aη , η y “ ´}Apη q}2,

t here (a) is due to (59) and (b) is because η lies in TXt Mr. With this, we conclude the update direction ηt has negative inner product with the Riemannian gradient unless it is 0. Thus the update ηt is a descent direction. If A satisfies the 2r-RIP, by similar arguments as in Lemma2, we see that P A˚AP is TXt TXt t symmetric positive definite over TXt Mr for all t ě 0. Since η solves the Riemannian Gauss-Newton equation (29), we know that ηt P A˚AP ´1gradf Xt for all t 0. For any subsequence “ ´p TXt TXt q p q ě t tX utPK that converges to a nonstationary point X, it is not difficult to show that

˚ ˚ t ˚ ´1 lim PT t A APT t “ PT A APT andη ˜ “ lim η “ ´pPT A APT q gradfpXq. tÑ8,tPK X X X X r tÑ8,tPK X X Ă Ă Ă Ă t r Hence, tη utPK is bounded,η ˜ ‰ 0 and

(RIP condition) t t 2 2 lim xgradfpX q, η y “ xgradfpXq, η˜y “ ´}Apη˜q}2 ď ´p1 ´ R2rq}η˜}2 ă 0, tÑ8,tPK

t r i.e., the direction sequence tη u is gradient related by (Absil et al., 2009, Definition 4.2.1). 

35 Proof of Theorem3 . The proof of (34) shares many similar ideas to the proof of Theorem 1. Hence, we point out the main difference first and then give the complete proof. Compared to Theorem1 where the target matrix is a stationary point, here the target matrix is X˚. So when we apply Proposition1, X “ X˚,  “  and we no longer have (45). Due to this difference, here we have an unavoidable statistical error term in the upper bound. ˚ ˚ ´1 ˚ ˚ t 2 We begin by provings (34). Wes first apply Proposition1 to bound pLt A ALtq Lt A  F in this setting. Set X “ X˚ and  “ , by Proposition1, we have › › › › 2 L˚A˚AL ´1L˚A˚t p t stq t Fs 2 t ˚ 2 t ˚ 2 ˚ ˚ 2 t ˚ t ˚ R }X ´ X } }X ´ X } }L A pq} 2R3r}X ´ X }}X ´ X } ď › 3r › F ` t F ` }L˚A˚pq} F › 2 2 ˚ › 2 t F ˚ 2 (62) p1 ´ R2rq σr pX q p1 ´ R2rq σrpX qp1 ´ R2rq paq R2 }Xt ´ X˚}2}Xt ´ X˚}2 }L˚A˚pq}2 2 3r F 2 t F , ď 2 2 ˚ ` 2 p1 ´ R2rq σr pX q p1 ´ R2rq

˚ where (a) is by Cauchy-Schwarz inequality. Recall PXt in (39), PXt pA pqq is a at most rank 2r ˚ ˚ matrix and σipPXt A pqq ď σipA pqq for 1 ď i ď p1 ^ p2 by the projection property of PXt . Then we have ˚ ˚ 2 ˚ 2 ˚ 2 ˚ 2 }Lt A pq}F “}PXt A pq}F ď }pA pqqmaxp2rq}F ď 2}pA pqqmaxprq}F. (63) Recall in this setting, the target matrix is X˚. We replace X in (7) by X˚ and obtain Bt :“ tJ ˚ t t tJ ˚ t tJ tJ ˚ t U X V , D1 :“ UK X V , D2 :“ U X VK. Next, we use the induction to prove the main results. Define θt, ρt`1 in the same way as in the proof of? Theorems 1. We aim to show: givenr rt tr ˚ 0 ˚ 2 10 ˚ θt ď 1{2, B is invertible, }X ´ X }F ď }X ´ X }F _ 1 R }pA pqqmaxprq}F, then θt`1 ď 1{2, ?p ´ 2rq t`1 t`1 ˚ 0 ˚ 2 10 ˚ t`1 B is invertible, }X ´ X }F ď }X ´ X }F _ }pA pqqmax r }F, as well as B is r p1´R2rq p q invertible and (34). r First we can easily check the assumption is true when t “ 0 under the initialization condition. Now, suppose the induction assumption is true at iteration t. Under the conditions (32), (33), from (62) and (63), we have 3 pL˚A˚AL q´1L˚A˚t ď σ pX˚q. t t t F 16 r › › t`1 Then? following? the same proof› as the Step 3,4 of Theorem› 1, we have B is invertible, ρt`1 ď p4 ` 3q{3 3 and

θt θtρt 1 2 }Xt`1 ´ X˚}2 ď p1 ` 3pρ _ _ ` q2q pL˚A˚AL q´1L˚A˚t F t`1 2 2 t t t F 1 ´ θt 1 ´ θt (64) 2 › › ˚ ˚ ´1 ˚ ˚ t › › ď 5 pLt A ALtqaLt A  Fa. › › Plugging (63) and (62) into› (64), we arrive at ›

2 t ˚ 2 t ˚ 2 ˚ 2 R }X ´ X } }X ´ X } 20}pA pqqmaxprq} }Xt`1 ´ X˚}2 ď 10 3r F ` F F p1 ´ R q2σ2pX˚q p1 ´ R q2 2r r 2r (65) paq 1 t ˚ 2 20 ˚ 2 ď }X ´ X }F ` 2 }pA pqqmaxprq}F, 2 p1 ´ R2rq where (a) is because under conditions (32), (33) and induction assumption at iteration t, it holds that 10R2 }Xt ´ X˚}2 3r F 1 2. 2 2 ˚ ď { p1 ´ R2rq σr pX q

36 ? By (65), we get }Xt`1 ´ X˚} ď }X0 ´ X˚} _ 2 10 }pA˚pqq } . Under the initialization F F p1´R2rq maxprq F t`1 conditions, Lemma9 also implies θt`1 ď 1{2 and B is invertible. This finishes the proof of (34). Next, we proof the guarantee of RISRO under the Gaussian ensemble design with spectral initialization. Throughout the proof, we use variousr c, C, C1 to denote constants and they may 0 ˚ vary from line to line. First we give the guarantee for the initialization X “ pA pyqqmaxprq. Define p1ˆ2r 0 ˚ Q0 P R as an orthogonal matrix which spans the column subspaces of X and X . Let Q0K be the orthogonal complement of Q0. Since

0 ˚ 2 0 ˚ 2 ˚ 2 }X ´ A pyq}F “}X ´ PQ0 pA pyqq}F ` }PQ0K pA pyqq}F and ˚ ˚ 2 ˚ ˚ 2 ˚ 2 }X ´ A pyq}F “}X ´ PQ0 pA pyqq}F ` }PQ0K pA pyqq}F, 0 ˚ 2 ˚ ˚ 2 the SVD property }X ´ A pyq}F ď }X ´ A pyq}F implies that

0 ˚ 2 ˚ ˚ 2 }X ´ PQ0 pA pyqq}F ď }X ´ PQ0 pA pyqq}F.

Note that

˚ paq ˚ }PQ0 ´ PQ0 A APQ0 } “ sup xpPQ0 ´ PQ0 A APQ0 qZ, Zy }Z}Fď1 “ sup }P Z}2 ´ }AP Z}2 Q0 F Q0 F (66) }Z}Fď1 pbq ˇ ˇ ˇ 2 ˇ ď sup R2r}PQ0 Z}F ď R2r, }Z}Fď1

˚ where (a) is because PQ0 ´ PQ0 A APQ0 is symmetric and (b) is by the 2r-RIP of A. Hence,

0 ˚ 0 ˚ ˚ ˚ }X ´ X }F ď }X ´ PQ0 pA pyqq}F ` }X ´ PQ0 pA pyqq}F ˚ ˚ ď 2}X ´ PQ0 pA pyqq}F paq ˚ ˚ ˚ “ 2}X ´ PQ0 pA pApX q ` qq}F ˚ ˚ ˚ ˚ “ 2}PQ0 X ´ PQ0 A ApPQ0 X q ´ PQ0 pA pqq}F (67) ˚ ˚ ˚ ď 2 p}pPQ0 ´ PQ0 A APQ0 qX }F ` }PQ0 pA pqq}Fq pbq ? ˚ ˚ ď 2R2r}X }F ` 2 2}pA pqqmaxprq}F ? ? ˚ ˚ ď 2R2r rκσrpX q ` 2 2}pA pqqmaxprq}F,

˚ where (a) is due to the model of y and (b) is due to that PQ0 pA pqq is a at most rank 2r matrix ˚ and the spectral norm bound for the operator pPQ0 ´ PQ0 A APQ0 q in (66). Hence, there exists c1, c2,C ą 0 such that when 1 1 R ď c ? ,R ă , and σ pX˚q ě C}pA˚pqq } , (68) 2r 1 κ r 3r 2 r maxprq F

0 ˚ ˚ we have }X ´ X }F ď c2σrpX q by (67) and the conditions in (32) and (33) are satisfied. Next we show under the sample complexity indicated in the Theorem, (68) are satisfied with high probability. First by (Zhang et al., 2020, Lemma 6), for the Gaussian ensemble design considered

37 ˚ 1 p1`p2 here, we have with probability at least 1´expp´cpp1`p2qq for some c ą 0 that }A pq} ď c n σ. So with the same high probability, we have b

˚ 1 pp1 ` p2qr }pA pqqmaxprq}F ď c σ, (69) c n σ2 ˚ ˚ and when n ě Cpp1 ` p2qr 2 ˚ , we have σrpX q ě C}pA pqqmax r }F. At the same time, by σr pX q p q 2 2 ?1 (Cand`esand Plan, 2011, Theorem 2.3), there exists C ą 0 when n ě Cpp1 ` p2qκ r , R2r ď c1 κ r 1 and R3r ă 2 are satisfied with probability at least 1 ´ expp´cpp1 ` p2qq for some c ą 0. σ2 2 In summary, there exists C ą 0 such that when n ě Cpp1 ` p2qrp 2 ˚ _ rκ q,(68) holds with σr pX q probability at least 1 ´ expp´cpp1 ` p2qq for some c ą 0. So by the first part of the Theorem, we have with the same high probability:

R2 }Xt ´ X˚}4 20}pA˚pqq }2 Xt`1 X˚ 2 10 3r F maxprq F , t 0. } ´ }F ď 2 2 ˚ ` 2 @ ě p1 ´ R2rq σr pX q p1 ´ R2rq More specifically, the above convergence can be divided into two phases. Let ? (Phase I) When }Xt ´ X˚}2 ě 2 }pA˚pqq } σ pX˚q, • F R3r maxprq F r

? R }Xt ´ X˚}2 Xt`1 X˚ 2 5 3r F } ´ }F ď ˚ p1 ´ R2rqσrpX q

? (Phase II) When }Xt ´ X˚}2 ď 2 }pA˚pqq } σ pX˚q, • F R3r maxprq F r ? ˚ t`1 ˚ 2 10}pA pqqmaxprq}F }X ´ X }F ď 1 ´ R2r

t ˚ ´2t 0 ˚ Combining Phase I, II and (69), by induction we have }X ´ X }F ď 2 }X ´ X }F ` 2 rpp1`p2qσ t ˚ c n and this implies the desired error bound for }X ´ X }F after double-logarithmic numberb of iterations.  Proof of Theorem4 . In the phase retrieval example, the mapping A no longer satisfies a proper RIP condition and the strategy we use is to show the contraction of Xt ´ X˚ in terms of its nuclear norm and then transform it back to Frobenius norm. We also use the induction to show the main results. Specifically, we show: given |u˚Jut| ą 0 ˚ x˚ t ˚ 0 ˚ ˚J t`1 t`1 ˚ 0 ˚ where u “ ˚ and }X ´ X } ď }X ´ X } , then |u u | ą 0, }X ´ X } ď }X ´ X } }x }2 F F F F and (37). First, the induction assumption is true when t “ 0 by the initialization condition and the per- turbation bound in Lemma9. Assume it is also correct at iteration t. Let bt “ utJX˚ut, dt “ bt dtJ u tJX˚ut. It is easy to verify dt bt ´1dt utJX˚ut and X˚ ut ut ut ut J. p Kq p q “ K K “ r Ks t r t t ´1 t r r Ks « d d pb q d ff r r Define the linear operator Lt similarr r asr (9) in this setting in the following way r r r r J 1ˆpp´1q J w0 P R w1 P R t t w0 w1 t t J Lt : W “ pp´1qˆ1 Ñ ru uKs ru uKs , (70) w P 0 w1 0 „ 1 R  „ 

38 utJMut utJMut and it is easy to compute its adjoint L˚pMq “ K , where M is a rank 2 t utJMut 0 „ K  symmetric matrix. Define operator PXt similar as (39) over the space of p ˆ p symmetric matrices ˚ t tJ t tJ t tJ t tJ t tJ t tJ PXt pWq :“ LtLt pWq “ u u Wu u ` uKuK Wu u ` u u WuKuK .

t t t It is easy to verify that PX is an orthogonal projector. Meanwhile, let PpX qK pWq “ W´PX pWq “ t tJ t tJ uKuK WuKuK . By using the operator Lt, the least squares solution in Step4 can be rewritten in the following way bt`1 dt`1J b dJ 2 t`1 “ arg min y ´ ALt d 0 p 1 d 0 „  bPR,dPR ´ › „ ›2 ˚ ˚ › ´1 ˚ ˚ › “ pL A AL›tq L A y › t › t › ˚ ˚ ´1 ˚ ˚ ˚ ˚ (71) t t “ pLt A ALtq Lt A pApPX X q ` ApPpX qK pX qqq t tJ b d ˚ ˚ ´1 ˚ ˚ ˚ L A AL L A A P t X . “ t ` p t tq t p pX qK p qq « d 0 ff r r Here, L˚A˚AL is invertible is due to the lower bound of the spectrum of L˚A˚AL in Lemma5. t t r t t So bt`1 dt`1J }Xt`1 ´ X˚} ď Xt`1 ´ rut ut s rut ut sJ ˚ K dt`1 0 K › „  ›˚ › › › bt`1 dt`1J › bt dtJ › t t t t J t› t t t J ` ru uKs t`1 ru uKs ´ pu uKq t t t ´1 tJ ru uKs › d 0 « d d pb q d ff › › „  ›˚ › r r › paq › bt`1 dt`1J bt dtJ › ď2 r›ut ut s rut ut sJ ´ rut ut s r r r r rut ut sJ › K dt`1 0 K K dt dtpbtq´1dtJ K › „  « ff ›˚ › r r › › t`1 t t`1J tJ › › b ´ b d ´ d t t ´1 tJ r r r r › ď2 › t`1 t ` 2}d pb q d }˚ › › d ´ d 0 › › r r ›˚ pbq › ˚ ˚ ´1 ˚ ˚ › ˚ r r tr t ´1 tJ “2}p› L A AL q L A ApP t› X q} ` 2}d pb q d } , › t rt t pX›qK ˚ ˚ (72) r r r here (a) is due to Lemma4 and (b) is due to (71). t t ´1 tJ ˚ ˚ ˚ ´1 ˚ ˚ ˚ t t First notice }d pb q d }˚ “}PpX qK X }˚. Next we give bound for }pLt A ALtq Lt A ApPpX qK X q}˚. ´p With probability at least 1 ´ C1 expp´C2pδ1, δ2qpq ´ C3n (C1,C2,C3 ą 0), we have r r r ˚ ˚ ´1 ˚ ˚ ˚ t }pLt A ALtq Lt A ApPpX qK X q}˚ paq 4 ˚ ˚ ˚ t ď }Lt A APpX qK pX q}˚ p1 ´ δ1qn pbq 4p ˚ t ďC }APpX qK pX q}1 p1 ´ δ1qn pcq 1 1 ` δ2 ˚ t ďC p}PpX qK pX q}˚ 1 ´ δ1 1 for some C,C ą 0. Here } ¨ }1 denotes the `1 norm of a vector, (a) is due to Lemma5, (b) is due ˚ t to Lemma6 and (c) is due to (Cand`eset al., 2013, Lemma 3.1) and PpX qK pX q is a symmetric matrix.

39 Putting above results into (72), we have with probability at least 1 ´ C1 expp´C2pδ1, δ2qpq ´ ´p C3n ,

t`1 ˚ t`1 ˚ 1 ` δ2 ˚ paq 1 ` δ2 ˚ t t }X ´ X }F ď}X ´ X }˚ ď C p}PpX qK pX q}˚ “ C p}PpX qK pX q}F 1 ´ δ1 1 ´ δ1 pbq 1 ` δ }Xt ´ X˚}2 C 2 p F , ď ˚ 1 ´ δ1 σ1pX q

˚ t where (a) is because PpX qK pX q is a symmetric rank 1 matrix, (b) is due to the same argument as ˚ ˚ 0 ˚ p1´δ1q ˚ (43). Since σ1pX q “ }X } , when }X ´ X } ď }X } for some large enough C, we have F F Cp1`δ2qp F t`1 ˚ t ˚ 0 ˚ t`1J ˚ }X ´ X }F ď }X ´ X }F ď }X ´ X }F. Also |u u | ą 0 as it is a non-decreasing function t`1 ˚ of }X ´ X }F by Lemma9. This finishes the induction and the proof. 

10 Additional Proofs and Technical Lemmas

We collect the additional proofs and technical lemmas that support the main technical results in this section. Proof of Equation (16). First, we denote

t J t J J J U “ ru1,..., up1 s , V “ rv1,..., vp2 s , M “ rm1,..., mp1 s , N “ rn1,..., np2 s .

Then

2 t J tJ J J 2 arg min U N ` MV ´ X i,j “ arg min pui nj ` mi vj ´ Xri,jsq . p ˆr r s p ˆr (73) MPR 1 , i,j Ω MPR 1 , i,j Ω p r p qP p r p qP NPR 2ˆ ÿ !` ˘ ) NPR 2ˆ ÿ If pi, jq P Ω, then the corresponding design matrix Aij has 1 at location pi, jq and 0 at the rest of locations. Then

ith jth ij t J ijJ t J A V “ r0,..., vj ,..., 0s , A U “ r0,..., ui ,..., 0s . hkkikkj hkkikkj So on the sketching perspective of R2RILS, we have

tJ ij J ij t 2 arg min xU A , N y ` xM, A V y ´ Xri,js p ˆr p ˆr MPR 1 ,NPR 2 pi,jqPΩ ÿ ` ˘ jth ith J J 2 “ arg min pxN, r0,..., ui ,..., 0s y ` xM, r0,..., vj ,..., 0s y ´ Xijq p ˆr p ˆr MPR 1 ,NPR 2 i,j Ω p ÿqP hkkikkj hkkikkj J J 2 “ arg min pui nj ` mi vj ´ Xri,jsq , p ˆr p ˆr MPR 1 ,NPR 2 i,j Ω p ÿqP which is exactly the same as (73) and this finishes the proof.  Proof of Lemma3 . The proof is the same as the proof of Proposition 2.3 Vandereycken(2013), except here we need to replace the gradient in the matrix completion setting to the gradient ˚ A pApXq ´ yq in our setting. 

40 Lemma 4 (Projection onto the Positive Semidefinite Cone in the Nuclear norm) Given pˆp p J any symmetric matrix A P R , and denotes its eigenvalue decomposition as i“1 λivivi with p J λ1 ě ¨ ¨ ¨ ě λp. Let A0 “ pλi _ 0qviv , then i“1 i ř

ř A0 “ arg min }A ´ X}˚, p XPS`

p here S` is the set of p ˆ p positive semidefinite (PSD) matrices.

Proof of Lemma4 Here the main property we use is the variational representation of nuclear norm from. Let m “ maxti : λi ě 0u. For any PSD matrix X,

p´m J }X ´ A}˚ ě σipX ´ Aq “ sup trpU pX ´ AqVq i“1 UPOp,pp´mq,VPOp,pp´mq ÿ ě sup trpUJpX ´ AqUq ě 0 ´ inf trpUJAUq UP UPOp,pp´mq Op,pp´mq p ě ´p λiq. i“m`1 ÿ p On the other hand, }A0 ´ A}˚ “ ´p i“m`1 λiq and this finishes the proof. 

Lemma 5 (Bounds for spectrumř of L˚A˚AL in Phase Retrieval) For any given unit vec- p tor u P R , define the linear map

J 1ˆpp´1q J w0 P R w1 P R w0 w1 J L : W “ pp´1qˆ1 Ñ ru uKs ru uKs . w P 0 w1 0 „ 1 R  „  J J ˚ u Mu u MuK pˆp It is easy to compute L pMq “ J , where M P R is a symmetric matrix. uKMu 0 i.i.d. „  Suppose ai „ Np0, Ipq. Then @δ P p0, 1q, DCpδq ą 0 such that when n ě Cpδqp log p, with ´p ˚ probability at least 1 ´ c1 expp´c2pδqpq ´ c3n , we have for any u and M P RanpL q 1 ´ δ }L˚A˚ALpMq} ě n}M} . (74) F 2 F

n ˚ where A is the linear map in (36) generated by taiui“1. Also for any u and matrix M P RanpL q, with the same high probability, we have 4 }pL˚A˚ALq´1pMq} ď }M} , (75) ˚ p1 ´ δqn ˚

Proof of Lemma5 Note that (74) is true when M is a zero matrix. When M is non-zero, LpMq “ J J 1 2 1 n J 2 J 2 2 2 um ` mu for some m. Then n }ALpMq}2 “ n i“1 |ai u| |ai m| and }M}F “}LpMq}F “ J 2 2 2 ´p 2p|m u| ` }m} }u} q. For any L and M, with probability 1 ´ c1 expp´c2pδqpq ´ c3n , we have 2 2 ř paq 1 ´ δ 1 ´ δ }L˚A˚ALpMq} “}ALpMq}2{}M} ě n}LpMq}2 {}M} “ n}LpMq} , (76) F 2 F 2 F F 2 F where (a) is due to the (Sun et al., 2018, Lemma 6.4).

41 Next, we prove (75). First suppose W P RanpL˚q satisfies pL˚A˚ALqpWq “ M, then we have

}pL˚A˚ALq´1M} }W} ˚ ˚ . (77) “ ˚ ˚ }M}˚ }Lt A ALtpWq}˚

˚ ˚ Hence, to prove the desired result, we only need to obtain a lower bound of }Lt A ALtpWq}˚:

˚ ˚ W }L A ALpWq}˚ “ sup xALpWq, ALpZqy ě xALpWq, ALp qy }Z}ď1 }W} 2 (78) “}ALpWq}2{}W} paq p1 ´ δq pbq 1 ´ δ ě n}LpWq}2 {}W} ě n}W} , 2 F 4 ˚

´p where (a) holds for any L, W with probability 1 ´ c1 expp´c2pδqpq ´ c3n by? the same reason as ˚ (a) in (76); (b) is true because W P RanpL q and }LpWq}F “}W}F ě }W}˚{ 2. 

˚ ˚ Lemma 6 (Upper Bound for }L A pzq}˚ in Phase Retrieval) Consider the same linear op- i.i.d. erator L as in Lemma5. Suppose ai „ Np0, Ipq and A is the linear map in (36) generated n by taiui“1. Then there exists C, c1, c2 ą 0 such that when n ě Cp, with probability at least ˚ ˚ 1 ´ c1 expp´c2pq, for any L and z, we have }L A pzq}˚ ď cp}z}1 for some c ą 0, where }z}1 denotes the `1 norm of z.

Proof of Lemma6. The proof is based on concentration of sub-exponential random variables. First, for fixed L, we have

˚ ˚ ˚ ˚ sup }L A pzq}˚ “ sup sup xL A pzq, Wy }z}1ď1 }z}1ď1 WPS,}W}ď1

“ sup sup xz, ALpWqy “ sup sup xz, ALpWqy “ sup }ALpWq}8, }z}1ď1 WPS,}W}ď1 WPS,}W}ď1 }z}1ď1 WPS,}W}ď1 here S is the set of symmetric matrices and }ALpWq}8 denotes the largest absolute value in the vector ALpWq. Notice LpWq is a symmetric rank-2 matrix with spectral norm bounded by 1, without loss of generality, we can consider the bound for }ApMq}8 for fixed rank 2 matrix M with eigenvalue J J J 2 J 2 decomposition u1u1 ´tu2u2 and t P r´1, 1s. In this case }ApMq}8 “ maxi ||ai u1| ´t|ai u2| | and J 2 J 2 ||ai u1| ´ t|ai u2| | is a subexponential random variable. By the concentration of subexponetial random variable Vershynin(2010), we have

J 2 J 2 2 Pp||ai u1| ´ t|ai u2| | ´ ξ ą xq ď expp´c minpx , xqq,

J 2 J 2 where ξ “ E||ai u1| ´ t|ai u2| |. And a union bound yields

J 2 J 2 2 Ppmaxt||ai u1| ´ t|ai u2| | ´ ξu ą xq ď n expp´c minpx , xqq. (79) i Next, we use the -net argument to extend the bound to hold for any symmetric rank 2 matrix M with spectral norm bounded by 1. Notice that by proving that, we also prove the desired inequality for any L. Let S be an -net on the unit sphere, T be an  net on r´1, 1s and set

J J N “ tM “ u1u1 ´ tu2u2 : pu1, u2, tq P S ˆ S ˆ Tu.

42 p 2p`1 Since |S| ď p3{q , we have |N| ď p3{q . A union bound yields

J 2 J 2 2 Pp@M P N, maxt||ai u1| ´ t|ai u2| | ´ ξu ą xq ď n expp´c minpx , xq ` p2p ` 1q logp3{qq. (80) i

˚ ˚ ˚ ˚ ˚ ˚J ˚ ˚ ˚J Now suppose pu1 , u2 , t q “ arg maxu1,u2,tPr´1,1s }ApMq}8 and denote M “ u1 u1 ´ t u2 u2 , ˚ J J ˚ ˚ µ “}ApM q}8. Then find the approximation M0 “ u0u0 ´t0v0v0 P N such that }u0´u }2, }v ´ ˚ v0}2, |t ´ t0| are each at most . First notice

˚ ˚J J J 2 ˚J 2 }u u ´ u0u0 }“ sup ||u0 x| ´ |u x| | }x}2“1 ˚ J ˚ J “ sup |pu0 ´ u q x||pu0 ` u q x| }x}2“1 ˚ ˚ ˚ ď }u ´ u0}2}u ` u0}2 ď 2}u ´ u0}2 ď 2.

Using the above bound, we have

˚ ˚ ˚J J ˚ ˚ ˚J ˚ ˚J J }M ´ M0} ď }u u ´ u0u0 } ` |t ´ t0|}v v } ` |t0|}v v ´ v0v0 } ď 5. (81)

Now take x ě cp for some c ą 0, then from (80), we have the event t}ApM0q}8 ď cpu happens with probability at least 1 ´ expp´Cpq. And on this event, we have

paq ˚ cp µ ď }ApM ´ M0q} ` }ApM0q} ď 5 ¨ 2µ ` cp ùñ µ ď , 8 8 p1 ´ 10q

˚ where (a) is by triangle inequality and the fact that M ´ M0 can be decomposed into the sum of two rank 2 symmetric matrices with spectral norm bounded by 5 (81). Take  ă 1{10, we get ˚ }ApM q}8 ă cp for some c ą 0 with probability at least 1 ´ expp´Cpq. This finishes the proof. 

p r r r r p Lemma 7 (Zhang et al., 2020, Lemma 3 ) Suppose F, F P R 1ˆ , G, G P R ˆ , H, H P R ˆ 2 . If ´1 ´1 G and G are invertible, }FG } ď λ1, and }G H} ď λ2, we have p p p ´1 ´1 p FG H ´ FG H ď λ2}F ´pF}Fp ` λ1}H ´ H}F ` λ1λ2}G ´ G}F. (82) F › › › › p ˆp Lemma 8 (Cand`esand›p p p Plan, 2011› , Lemmap 3.3) Let Z1p, Z2 P R 1 2 bep two low rank matrices with r1 “ rankpZ1q, r2 “ rankpZ2q. Suppose xZ1, Z2y “ 0 and r1 ` r2 ď minpp1, p2q. If A satisfies the pr1 ` r2q-RIP condition, then

|xApZ1q, ApZ2qy| ď Rr1`r2 }Z1}F}Z2}F.

J J Lemma 9 Let X1 “ U1Σ1V1 and X2 “ U2Σ2V2 be two rank r matrices with corresponding singular value decompositions. Then

J J }X1 ´ X2} }U1U ´ U2U } ď , 1 2 σ pX q _ σ pX q $ r 1 r 2 ’ 2}X1 ´ X2} &’ maxt} sin ΘpU1, U2q}, } sin ΘpV1, V2q}u ď . σrpX1q _ σrpX2q ’ ’ Proof. See Lemma% 4.2 of Wei et al.(2016) and Theorem 6 of Luo and Zhang(2020). 

43