Principal Component Regression with Semirandom Observations via Matrix Completion

Aditya Bhaskara Kanchana Ruwanpathirana Maheshakya Wijewardena University of Utah University of Utah University of Utah

Abstract variables. Denoting the outputs by a vector y ∈ Rn and the prediction variables by a matrix M ∈ Rn×d (each row corresponds to an observation), linear re- Principal Component Regression (PCR) is a gression aims to model y as popular method for prediction from data, and is one way to address the so-called multi- y = Mβ + η, (1) collinearity problem in regression. It was shown recently that algorithms for PCR such where β ∈ d the vector of regression coefficients, and as hard singular value thresholding (HSVT) R η is a vector of error terms, typically modeled as having are also quite robust, in that they can han- independent Gaussian entries. The goal of the regres- dle data that has missing or noisy covariates. sion problem is to find the coefficients β. The stan- However, such spectral approaches require dard procedure is to solve the least squares problem, strong distributional assumptions on which minky − Mβk2. entries are observed. Specifically, every co- variate is assumed to be observed with prob- One of the issues in high dimensional regression (large ability (exactly) p, for some value of p. Our d) is the problem of multi-collinearity, where there goal in this work is to weaken this require- are correlations (in values) between regression vari- ment, and as a step towards this, we study a ables, leading to unstable values for the parameters “semi-random” model. In this model, every β (see Gunst and Webster (1975); Jolliffe (1986)). covariate is revealed with probability p, and One common approach to overcome this problem is then an adversary comes in and reveals ad- to use subset selection methods (e.g., Hocking (1972); ditional covariates. While the model seems Draper and Smith (1966)). Another classic approach intuitively easier, it is well known that algo- Hotelling (1957); Jolliffe (1986) is to perform regres- rithms such as HSVT perform poorly. Our sion after projecting to the space of the top prin- approach is based on studying the closely re- cipal components of the matrix M. This is called lated problem of Noisy Matrix Completion in Principal Component Regression (PCR). Formally, if a semi-random setting. By considering a new M (r) = U (r)Σ(r)(V (r))T is the best rank r approxima- semidefinite programming relaxation, we de- tion of M, then the idea is to replace (1) with velop new guarantees for matrix completion, which is our core technical contribution. y = MV (r)β0 + η. (2)

The goal is to now find the best β0 ∈ Rr. This is known 1 Introduction to yield more stable results in many settings (Mosteller and Tukey, 1977). It also has the advantage of being Regression is one of the fundamental problems in a “smaller” problem, thus it can be solved more ef- statistics and data analysis, with over two hundred ficiently. Recently, Agarwal et al. (2019) observed years of history. We are given n observations, each that in settings where the regression matrix is (close consisting of d prediction or regression variables and to) low rank, PCR also provides a way to deal with one output that is a linear function of the prediction missing and noisy covariates — a well-known issue in applications (Little, 1992). Their main insight is that Proceedings of the 24th International Conference on Artifi- observing a random subset of the entries of M is good cial Intelligence and Statistics (AISTATS) 2021, San Diego, enough to obtain the top singular vectors V (r) to a California, USA. PMLR: Volume 130. Copyright 2021 by high accuracy. Thus one can obtain a β0 that yields a the author(s). low-error regression model. Principal Component Regression with Semirandom Observations via Matrix Completion

While the result of Agarwal et al. (2019) provides novel matrix and applying PCR. Further, matrix completion theoretical guarantees on regression with missing en- can also be formulated (and indeed has been studied tries, it is restrictive: guarantees can only be obtained in Cheng and Ge (2018)) in a semi-random setting. when every entry of M is observed independently with However, to the best of our knowledge, trade-offs be- (the same) probability p. Given inherent dependencies tween the error in the observations and the recovery in applications (e.g., some users may intrinsically re- error have not been studied in a semi-random model. veal more information than others; certain covariates Analyzing this via a semidefinite programming (SDP) may be easier to measure than others, etc.), we ask: relaxation is our main technical contribution. Interest- Can we design algorithms for partially-observed PCR ingly, our analysis relies heavily on a recent non-convex under milder assumptions on the observations? approach to matrix completion (Chen et al., 2019). It is easy to see that unless sufficiently many entries are observed from each column, recovering the corre- 1.1 Our results sponding coordinate of β is impossible. This motivates having a lower bound on the probability of each entry We discuss now our results about matrix completion being revealed. This leads us to considering a so-called and the implications to PCR with missing and noisy semi-random model (Blum and Spencer, 1995; Feige entries. Formal definitions and the full setup is de- and Krauthgamer, 2000; Makarychev et al., 2012; ferred to Section 2. Cheng and Ge, 2018). In our context, such a model Result about matrix completion. Recall that in corresponds to the following: (a) initially, all the en- matrix completion, we have an unknown rank-r ma- tries are revealed independently with probability p, (b) trix M ∗ (dimensions n × n), which we wish to infer an adversary reveals additional elements (arbitrarily). given noisy and partial observations. Ω denotes the Despite appearing to make the problem “easier” than e set of observed indices, and σ denotes the standard the random case, since an algorithm has no idea if an deviation of the noise added to each observation. Ω is entry was revealed in step (a) or (b), algorithms based e chosen using a semi-random process (first forming Ω on obtaining estimators (that rely on entries being ob- in which every index is present with probability p and served with probability p) will fail. then adversarially adding indices). Theorem (informal). Under appropriate incoher- Further motivation for the semirandom model. ∗ In both PCR with missing covariates and matrix com- ence assumptions on M , there exists a polynomial pletion, at a high level, we wish to understand what time algorithm that finds an estimate Z that satisfies “observation patterns” allow for effective recovery. kM − ZkF ≤ Oκ,p,µ (nr log n · σ) . Without stochasticity assumptions, there are hardness 2 results (even when one observes Ω(n ) entries; Hardt The result holds with high probability, and the formal et al. (2014)), and we know of positive results when statement along with conditions for p, σ is presented in each entry is observed independently with probability Theorem 1. The parameters κ, µ capture the condition p. The semirandom model has been studied in the number and incoherence properties of M ∗. To the best literature (for graph partitioning and also for matrix of our knowledge, such a bound relating the noise and completion) because in spite of seeming easier as dis- recovery error is not previously known in the semi- cussed above, it ends up causing spectral methods to random model for matrix completion. The prior work fail. The semirandom model is equivalent to the set- of (Cheng and Ge, 2018) does not consider the noisy ting where every entry has some “base” probability of case. being observed, but the probability could be higher for √ some (unknown to us) entries. E.g., in recommender Our recovery error bound is a factor n worse than systems, some users may provide more ratings for cer- the best (and optimal) bounds for matrix completion tain types of items than others, and this is typically with random observations (Chen et al., 2019; Kesha- unknown to the algorithm. The model is also closely van et al., 2012). However, it is better than some of related to the notion of Massart noise which is known the earlier results of (Candes and Plan, 2009). Specif- to be challenging for a variety of learning problems ically, the work of (Candes and Plan, 2009) imposes (see, e.g., Diakonikolas et al. (2019)). the constraint kPΩ(Z − M)kF ≤ δ and shows that q kZ − M ∗k ≤ n δ (up to constants). If Ω corre- Our approach to partially-observed PCR will be based F p sponds to i.i.d. observations with probability p, we on the closely related problem of matrix completion √ (see Candes and Recht (2008); Keshavan et al. (2012); must set δ = n p · σ for the SDP to be feasible. This Bhojanapalli and Jain (2014) and references therein). leads to a recovery error bound of n3/2σ. In the semi- The problems are related because intuitively, we can random model, the SDP feasibility constraint must think of filling in the missing values in the covariate now depend on Ωe and if we have  n2p observations, Aditya Bhaskara, Kanchana Ruwanpathirana, Maheshakya Wijewardena the n3/2σ bound becomes even worse. Chen et al. 1.2 Related work (2019) improve the error bound above by a factor of √ n p in the random case. Our result can be viewed There has been extensive literature on PCR and ma- as being in between the two results while holding true trix completion that we do not review for lack of space. For PCR, we refer to the early work of Jolliffe (1986), even in the semirandom model.√ An interesting open problem is thus to close the n gap between the ran- the recent work Agarwal et al. (2019) and references dom and semirandom settings. therein. We highlight that unlike classic regression, where the goal is to find β and error is measured in Our main proof ingredient is a semidefinite program- ∗ terms of the distance kβb − β k, in PCR we implic- ming (SDP) relaxation that we can write down given itly consider situations where there are multiple good Ω,e but whose analysis can be conducted using tools β∗s, and the focus is on the mean-squared error (see from the random case. In particular, we use a tech- Section 2.2). nique of Chen et al. (2019) in which the solution to a non-convex program is used to derive a dual certificate. The HSVT algorithm is a classic method for matrix completion, and Chatterjee (2015) presents near-tight Result about PCR. Using our algorithm for ma- results for it. The early works on matrix comple- trix completion as a subroutine lets us obtain new tion such as Candes and Plan (2009); Keshavan et al. guarantees for PCR under noisy and partial obser- (2012) obtain weaker dependencies between the noise vations. Under appropriate assumptions on the true (per observation) and the recovery error, even with ∗ covariate matrix M , we can relax the i.i.d. assump- random observations. To the best of our knowledge, tion on the observations to semi-random observations, the recent work of Cheng and Ge (2018) is the only while still obtaining bounds similar to those of Agar- one to develop guarantees for matrix completion in the wal et al. (2019). semi-random model (albeit without error). There is a Theorem (informal). Under appropriate incoher- large body of work on such models for other problems ence assumptions on M ∗, there exists an efficient al- such as graph partitioning and community detection gorithm that, given a noisy and partially observed co- (see Feige and Krauthgamer (2000); Makarychev et al. variate matrix, outputs βb whose mean-squared-error (2012) and references therein). 1 ∗ ∗ ∗ 2 (defined as n kM βb − M β k2 for the optimal coeffi- cients β∗; see Section 2.2) is at most 2 Notation and preliminaries ∗ 2 2 2 O(“optimal MSE”) + Oκ,µ,p kβ k2r n log n · σ . In what follows, all the matrices that we consider will The optimal MSE is a term that turns out to be un- ∗ be square (n × n). Our results immediately apply to avoidable (with high probability) even if we knew M . rectangular matrices, as long as we use the maximum We also note that our bound is in general incompara- of the two dimensions. The rank parameter r is as- ble with that of Agarwal et al. (2019), but it avoids sumed to be  n. additive-error terms. Specifically, when σ = 0, our bounds truly converge to the optimal MSE, even if the The matrix M ∗ will denote the unknown (or ground matrix is only partially observed (which is not true for truth) of rank r. We assume that the low-rank decom- the HSVT-based algorithm of Agarwal et al. (2019)). position is M ∗ = U ∗Σ∗(V ∗)T , where U ∗,V ∗ ∈ Rn×r. The decomposition is said to satisfy the µ-incoherence Experimental results. In our experiments, we dis- ∗ p µr property if for all i ∈ [n], we have kUi k ≤ n . The cuss the efficacy of our techniques along multiple axes. incohrence property ensures that the mass of the ma- First, we focus on matrix completion and (a) evaluate trix (as well as “information” about the decomposi- the robustness of our SDP formulation to adding more tion) is well spread. observations (i.e., compare the random setting to the semi-random one) and (b) compare the recovery error High probability. All our theorems hold with high in matrix completion with that of HSVT. Next, for probability, and by this we mean probability over the matrix completion, we compare our SDP with the best randomness in the support Ω (which gets appended to SDP when the observations are purely random. Here, form Ωe in the semirandom model), as well as the ran- we observe that our SDP does slightly worse. This domness in the noise terms, which are distributed as indicates a “price to pay”√ for being robust; it also sug- independent Gaussians. gests that the loss of n compared to the best-known bounds in the random case may not be an artefact 2.1 Model for semi-random matrix of our analysis. Finally, we compare our SDP-based completion algorithm with HSVT in the PCR context and show significant improvement both with and without reveal- Let M ∗ be the ground truth matrix with rank r. We ing additional entries. assume that the low-rank decomposition of M ∗ = Principal Component Regression with Semirandom Observations via Matrix Completion

U ∗Σ∗(V ∗)T satisfies the µ-incoherence property de- different problem: fined above. Further, we let σi denote the ith largest ∗ singular value of M , and for convenience, denote Sdp(δ) : min kZk∗ subject to σ1 (5) σmin = σr. The κ := . σmin |Zij − Mij| ≤ δ ∀ (i, j) ∈ Ωe, Finally, what we are given are partial observations √ ∗ 2 where δ = 4σ log n. The choice of δ ensures that from a matrix M := M + E where Eij ∼ N (0, σ ). M ∗ is feasible. Our key observation (as also shown in The set of observed entries is denoted by Ω.e These are generated via a semi-random model where (a) our experiments) is that this SDP is quite stable to the Ω ⊂ [n] × [n] is chosen selecting each index uniformly addition of observations (which translate to additional at random with probability p, and (b) an adversary constraints in the matrix). adds an arbitrary number of indices to Ω to yield Ω.e Parameter choices and assumptions. In what Crucially, the adversary sees the matrices M and M ∗ √ before deciding the entries to reveal. follows, we will set λ = δn p, and the lemmas will assume that 2.2 The PCR Model σ log n σ ≤ c min (6) n3/2 ∗ Let M be the ground-truth matrix of covariates. Let n2p ≥ Cκ4µ2r2n log3 n, (7) y be the response variables where each yi is linearly ∗ associated with Mi,., i.e., where c is a sufficiently small and C a sufficiently large

∗ ∗ constant. yi = Mi,.β + i (3) Our main technical result bounds the error between ∗ where β is the unknown model parameter and the i the solution to Sdp(δ) and the unknown low-rank ma- is an independent noise distributed as N (0, γ2). We trix M ∗. ∗ are given a noisy, partially observed version of M . Theorem 1. (Error bound for Sdp) Suppose δ = ∗ √ Specifically, let M = M + E, where E is an error 4σ log n. Let µ be the incoherence parameter and let matrix whose entries are again independent and dis- σ be the variance of the noise. Suppose each M is 2 ij tributed as N (0, σ ). To this matrix, a mask is applied observed with probability at least p. Then for a suffi- according to the semi-random model discussed earlier. ciently large constant c, the solution Z to optimization P (M) is the matrix that is given as input. Ωe problem (5) satisfies The objective in PCR is to produce a β so as to min- √ b κ3r3µ2n log n imize the recovery error: kZ − M ∗k ≤ c · · σ (8) F p n 2 1 X  ∗ ∗ ∗  MSE(βb) = M β − M βbi . (4) n i,· i i,. We note that the bound does not depend on the num- i=1 ber of additional element√ revealed by the adversary. It is also a factor of n weaker than the optimal recovery In other words, the combination β that we produce b bounds (Keshavan et al., 2012; Chen et al., 2019). Our must lead to a good approximation of the “ideal” proof of the theorem goes via an elegant argument used (noise-free) observations. Even if we were given the in Chen et al. (2019). In the random support setting rank r matrix M ∗, the recovery error via least-squares (where we are given Ω) they show that the solution to  γ2r  is Θ n (see, e.g., Agarwal et al. (2019)), and thus the unconstrained SDP: this is the “best possible” error we can achieve. 1 min kP (Z − M)k2 + λkZk (9) 2 Ω F ∗ 3 Robust semi-random matrix completion is closely related to the solution of an appropriate non- convex optimization problem: We will use the model and notation as discussed in Sec- 1 T 2 min kPΩ(XY − M)k + λR(X,Y ), (10) tion 2.1. While many of the known SDP relaxations n×r F X,Y ∈R 2 (e.g., Candes and Plan (2009); Chen et al. (2019)) in- volve terms such as kPΩ(M − Z)kF in the constraints, where R(X,Y ) is an appropriate regularization term. using an analogous constraint with Ωe makes it impos- In both the problems above, λ is a parameter (which sible to carry over the results, as they rely crucially may be viewed as an appropriate Lagrange multiplier), √ on the randomness of Ω. Our starting point is thus a that is set to roughly σ np for their result. Aditya Bhaskara, Kanchana Ruwanpathirana, Maheshakya Wijewardena

Unfortunately, the strong connection between the two our choice of λ. This lemma is where the incoher- optimization problems seems to fail to hold in the ence property plays a crucial role, especially in part 2 semi-random setting, i.e., when Ω in the formula- (Equation 12). tions is replaced by Ω. The analysis of Chen et al. e We now use this lemma to show our main result. The (2019) that the non-convex optimum is close to M ∗ proof is inspired by duality-based arguments in many also strongly relies on the randomness of Ω. However, prior works (including Candes and Plan (2009); Chen the existence of X,Y ∈ n×r with appropriate struc- R et al. (2019)). tural properties turns out to be very useful for us. We now state the lemmas from Chen et al. (2019) that we will use. 3.1 Proof of Theorem 1 Lemma 1 (Properties from Chen et al. (2019)). Let Z denote the optimum solution to the optimization There exist matrices X,Y ∈ Rn×r (specifically the out- problem (5). Our goal will be to prove that Z is close put of an appropriate algorithm for to XY T . By part (3) of Lemma 1, it will follow that the non-convex problem above) with the following prop- Z is also close to XY T , by the triangle inequality. erties. In what follows, T refers to the tangent space T defined as To this end, define Z = XY + ∆. In order to bound k∆kF , we first write T 0 T 0 n×r T = {XW + W Y : W, W ∈ R }. k∆kF ≤ kPT ∆kF + kPT ⊥ ∆kF , (14) ⊥ P and P ⊥ refer to the projections to T and T re- T T and bound the two terms. The first step is to relate spectively. the first term on the RHS with the second.

T 1. (Claim 2 of Chen et al. (2019)). Let XY = kPΩ(∆)kF = kPΩPT (∆) + PΩPT ⊥ (∆)kF T UΣV be the SVD. Then ≥ kP P (∆)k − kP P ⊥ (∆)k Ω T F Ω T F (15) r T T p PΩ(XY − M) = −λUV + R (11) ≥ kP (∆)k − kP ⊥ (∆)k 32κ T F T F where R is a residual matrix that satisfies In the last step, we used the strong injectivity property 72κλ from Lemma 1. This implies that kP (R)k ≤ and kP ⊥ (R)k ≤ λ/2, T F n5 T r32κ kP (∆)k ≤ (kP ⊥ (∆)k + kP (∆)k ) (16) T F p T F Ω F 2. (Strong injectivity) The projection PΩ satisfies:

1 2 1 2 The term kPΩ(∆)kF turns out to be easy to bound, kPΩ(H)k ≥ kHk ∀H ∈ T (12) p F 32κ F using the properties of the SDP:

T T 3. (Lemma 5 of Chen et al. (2019)) XY is close to kPΩ(∆)kF = PΩ(Z − XY ) F the ground truth M ∗ in the following sense: ∗ ∗ T ≤ kPΩ(Z − M )kF + PΩ(M − XY ) F √ p 2 λ µ3r3κ5 T ∗ λκ µ r XY − M ≤ CF , ≤ 2λ + C∞ √ . F p p (17) T ∗ 2rλκ XY − M ≤ Cop , (13) ∗ p The last inequality is due to the fact that M ∗ is a fea- p 3 3 5 T ∗ λ µ r κ sible solution to the SDP (5) (which implies that both PΩ(XY − M ) ≤ C∞ √ , F p Z and M ∗ are within distance λ from M), combined with the last inequality in (13). where CF ,Cop,C∞ are appropriate constants. Combining (14), (15) and (17), it follows that we only

need to bound kP ⊥ (∆)k . We will indeed show a Remarks. Roughly, Lemma 1 shows the existence of T F stronger bound, on the quantity kP ⊥ (∆)k , which is a low-rank matrix XY T that has nicer properties than T ∗ always ≥ kP ⊥ (∆)k . even the ground-truth X∗(Y ∗)T , while being close to T F it in different norms. We also note that parts (1-3) of The bulk of the argument is thus in bounding Lemma 1 are shown in Chen et al. (2019) for a slightly kPT ⊥ (∆)k∗. The key claim is the following, relating different value of λ. In the supplementary material, this term to the matrices U, V (defined using the SVD Section B, we verify that all the claims also hold for XY T = UΣV T ). Principal Component Regression with Semirandom Observations via Matrix Completion

T 1 Claim 1. We have This implies that hUV , ∆i ≥ λ hR, ∆i.

T T T Now the key observation is that kPT ⊥ (∆)k∗ ≤ kXY + ∆k∗ − kXY k∗ − hUV , ∆i.

|hR, ∆i| ≤ |hPT ⊥ R, PT ⊥ ∆i| + |hPT R, PT ∆i| Proof of Claim 1. By definition, for any W ∈ T ⊥ with T λ 72κλ kW k ≤ 1, we have that UV +W is in the subgradient ≤ kPT ⊥ ∆k∗ + kPT ∆k∗. (20) T 2 n5 of k·k∗ at XY . Thus for any such W , we have This immediately implies that T T T XY + ∆ ∗ ≥ XY ∗ + hUV + W, ∆i. T 1 72κ −hUV , ∆i ≤ kP ⊥ ∆k + kP ∆k . The observation is that we can pick W such that 2 T ∗ n5 T ∗ hW, ∆i = kP ⊥ (∆)k . This is well-known, and follows T ∗ Plugging this into Claim 1 and using Claim 2, we have by choosing W using the SVD directions of PT ⊥ (∆) (and this will lie entirely in T ). Rearranging now im- that plies the claim. 2rλκ 1 72κ kP ⊥ (∆)k ≤ C + kP ⊥ ∆k + kP ∆k T ∗ op p 2 T ∗ n5 T ∗ The next claim shows the following bound. 4rλκ 144κ =⇒ kP ⊥ (∆)k∗ ≤ Cop + kPT ∆k∗. (21) T T 2rλκ T p n5 Claim 2. kXY + ∆k∗ − kXY k∗ ≤ Cop p . This is where we crucially used the factor of 1/2 from Proof of Claim 2. Since Z = XY T +∆ is the optimum part 1 of Lemma 1. solution to the SDP (5) and M ∗ is another feasible solution, we have (using also (13) from Lemma 1), Plugging this back into (16) and using (17) along with the fact that k·kF ≤ k·k∗ < nk·kF , we obtain: kZk ≤ kM ∗k ∗ ∗ r32κ  4rλκ 144κ ∗ T T ≤ M − XY + XY kPT (∆)kF ≤ Cop + kPT ∆k∗ ∗ ∗ (18) p p n5 2rλκ T p 3 3 5 ! ≤ Cop + XY λ µ r κ p ∗ + 2λ + C √ . ∞ p Rearranging now implies the claim. Simplifying, we get Using the two claims, it follows that our goal should  900κ3/2  be to prove that hUV T , ∆i is not too negative. This kPT (∆)kF 1 − √ 4 is done somewhat indirectly. pn 3/2   16rλκ 2Cop p Claim 3. From our choice of λ, we have: ≤ √ + C κ3µ3r3 . p p ∞ kP (XY T + ∆ − M)k2 ≤ kP (XY T − M)k2 . (19) Ω F Ω F Using the fact that n is large enough, the coefficient on the LHS is > 1/2, we get Proof of Claim 3. Because Z = XY T + ∆ is feasible for our SDP, we have that the LHS of (19) is ≤ λ2, by 3/2   32rλκ 2Cop p kP (∆)k ≤ √ + C κ3µ3r3 . our choice of λ. T F p p ∞ Next, using the property of the non-convex solution (22) T (part 1 of Lemma 1), we have PΩ(XY − M) = Combining equations (22) and (21) with (14), the the- −λUV T + R, for some R ∈ T ⊥. This implies that T 2 2 2 orem follows. kPΩ(XY − M)kF ≥ rλ ≥ λ . Thus the claim fol- lows. 4 Principal Component Regression We can now expand Eq. (19) to obtain We now present the application of our results on ma- 1 trix completion to the PCR problem. Recall that we kP (∆)k2 + hP (XY T − M), ∆i ≤ 0. 2 Ω F Ω have an unknown rank r covariate matrix M ∗, and we are given a perturbed, partially observed version of Plugging in part 1 of Lemma 1 again, we get M ∗, which we will denote by P (M) (see Section 2.2). Ωe 1 Also, as before, Ω denotes the random part of the ob- hλUV T − R, ∆i ≥ kP (∆)k2 ≥ 0. 2 Ω F served indices, and Ωe is the given set of indices, where Aditya Bhaskara, Kanchana Ruwanpathirana, Maheshakya Wijewardena

Ω is appended with some (adversarially chosen) subset the noise parameter σ satisfies the conditions of The- of indices. orem 1. Finally, suppose the values y satisfy the fol- lowing, for some unknown β∗: Recall also that our goal is to find a βb whose MSE 1 ∗ ∗ ∗ kM β − M βbk is minimized. ∗ ∗ n Yi = Mi,.β + φi + i,

Approach of Agarwal et al. (2019) The work of where φi is a model mismatch parameter. Then with Agarwal et al. (2019) studies a natural approach for high probability, the output βb of Algorithm 1 satisfies PCR, when we are given PΩ(M) (and not the projec- 2 2 tion to Ω).e The idea is to first “complete” and denoise 4γ r 20 kφk2 MSE(βb) ≤ + this matrix, thus obtaining a low-rank estimate Mˆ for n n ∗ M , and then using ordinary least squares with M,ˆ y 3 kβ∗k2 4C2κ6r6µ4σ2n2log n + 2 . to obtain βb. n p2 They then prove that the MSE of this βb can be where γ is the variance of the noise in the regression bounded in terms of an appropriate function of the model and the rest of the parameters are defined as in error (Mˆ − M ∗) in estimating M ∗. To obtain Mˆ , they previous sections. prove that simply using the best rank-r approxima- tion of the matrix 1 P (M) suffices. This procedure p Ω Remark. In our proof below as well as in previous is referred to as hard singular-value thresholding, or works on robust PCR, we recover the incomplete ma- HSVT. trix before training the regression model rather than directly approximating the target Mβ. The reason for The semi-random setting. The issue with HSVT this is partly that the top r components of M must be is that it crucially relies on 1 P (M) being an “unbi- p Ω estimated either directly or indirectly, and bounding ased estimator” of M, and since the error M − M ∗ is ∗ the error seems to require bounding the error in this i.i.d. Gaussian, also of M . The guarantee for HSVT estimation. would completely fail to hold if Ω,e for instance, is im- balanced so that entries in some columns have a prob- 4.1 Proof of Theorem 2 ability > p of being revealed. We will demonstrate this also in our experiments (Section 5). Our main idea is We show that a slight modification of the proof thus to replace the HSVT with the noisy matrix com- of Agarwal et al. (2019), combined with our argument pletion subroutine developed in Section 3. from Section 3 yields the theorem. We start by presenting our overall algorithm for PCR. At a high level, the argument in Agarwal et al. (2019) The algorithm assumes knowledge of n, and more cru- proceeds by proving an upper bound on the MSE in cially, of r and σ. (As discussed in Agarwal et al. terms of kβ∗k and the error kMˆ − M ∗k, where Mˆ is (2019), when r  n, dividing the input randomly and the completed matrix. However, the proof uses the fact performing cross validation is a general way to over- that Mˆ is also of rank r. This is the reason we use Z(r) come this issue.) instead of the SDP solution Z itself in Algorithm 1. (r) ∗ Algorithm 1 PCR via Matrix Completion A simple lemma shows that kZ − M k is also small. Input: A matrix P (M) with missing entries and Lemma 2. Let Z(r) be the best rank-r approximation Ωe noise with variance σ2. of Z (where r = rank(M ∗)), as computed in the algo- Output: A feature weight vector βb rithm. Then with high probability, we have √ κ3r3µ2nσ log n 1: Solve the SDP 5 (using σ) and get the optimal (r) ∗ Z − M ≤ 2C · . (23) solution Z F p 2: Define Z(r) ← rank-r approximation of Z (ob- tained via SVD) The proof is based on using the matrix XY T from 3: Carry out ordinary least squares using Z(r) and Section 3 to show that Z has a good low-rank approx- the given y, return the obtained βb imation. The proof is deferred to Section A of the supplement. We show the following result. We can thus prove our main result of the section. Theorem 2. Suppose the ground truth matrix M ∗ has rank r and has a decomposition that satisfies the Proof of Theorem 2. We invoke the proof of Theorem properties described in Section 2. Suppose also that 6 in Agarwal et al. (2019). This relies on the fact that Principal Component Regression with Semirandom Observations via Matrix Completion

ˆ the estimate M is rank r (denoted Ab in the paper; it r=5 is the result of HSVT on the normalized matrix). In 0.5 Elementwise Constrained Unconstrained our case, we can carry out the same arguments with 0.4 Z(r) instead of Ab. This is used to obtain: (see Eq. (29) of Agarwal et al. (2019)) 0.3

2 (r) ∗ ∗ 2 2 0.2 nMSE(βb) ≤ 4γ r + 3k(Z − M )β k2 + 20kφk2.

0.1

To obtain the desired bound from this, we can use Scaled Frobenius norm error of Z

H¨older’sinequality, which yields: 0.0 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00   2 2 sigma 1e−3 (r) ∗ ∗ ∗ 2 (r) ∗ Z − M β ≤ kβ k2 Z − M 2 F Figure 1: Scaled Frobenius norm error of recovered This along with the bound above gives us matrices when observations are uniformly random. 1 2 4γ2r 20 kφk2 (r) ∗ ∗ 2 Z βb − M β ≤ + n 2 n n Error rates of HSVT are higher in general than those 3 kβ∗k2 2 + 2 Z(r) − M ∗ obtainable via the SDP. Further, the gap between the n F random and semi-random cases is much worse in the Combining this with Lemma 2 completes the proof of case of HSVT. This suggests that the solutions to el- the theorem. ementwise constrained SDP (5) remain stable under semirandom perturbation. 5 Experiments 5.2 PCR using matrix completion vs. PCR In this section we empirically evaluate our results on with HSVT matrix completion and the application to PCR using We compare the performance our algorithm with synthetic datasets. Additional experiments are de- HSVT in principle component regression with covari- ferred to section A of the supplement. ates observed with random and semi-random manner. Similar to the experiments in the previous section we 5.1 Matrix completion generate ground truth covariate matrices with n = 100 We first compare the error rates of the solutions to and r = 1, 5 (M ∗). We generate a regression coefficient SDPs (5) (elementwise constrained) and (9) (uncon- vector β∗ of size n. In rank 1 case each covariate is strained) with uniformly random observations with observed with probability p = 0.4 at first then first 90 varying noise levels (measured in variance σ). The rows of first 30 columns are opened later as additional ∗ ∗ error is measured as kZ − M kF /kM kF where Z is observations. In rank 5 case each covariate is observed the solution to each SDP and M ∗ is the ground truth with probability p = 0.5 at first then every row of first matrix. Here we consider a ground truth matrix with 10 columns is opened later as additional observations. n = 100 and r = 5 where each singular value is 1(figure We evaluate the regression error as in equation 4 for 1). Covariates are sampled with probability p = 0.4 the recovered matrices using our algorithm and HSVT and results are averaged over 5 iterations. λ in SDP with varying noise levels (figure 3). The results are √ (9) is set to 5σ np. average over 5 iterations. When the covariates are observed uniformly at ran- Regression error rates using HSVT imitate the error dom, as seen in the figure 1, solving the SDP (5) gives rates displayed in matrix completion with the same slightly worse error compared to the SDP from (9), technique while replacing the HSVT matrix by the so- which is known to give the tight bounds. lution to elementwise constrained SDP (5) maintains stable regression error rates. In the second experiment, we compare the error rates of the SDP (5) and HSVT with random and semi- random observations. Similar to the previous experi- 6 Conclusion ment, error is measured in Frobenius norm scaled by ∗ 1/kM kF with varying noise levels. We consider two We investigate the principal component regression ground truth matrices, with ranks 1 and 5. Each co- problem with noisy and missing entries, where the variate is first observed with probability 0.4 then first set of observed entries is not i.i.d., but comes from 90 rows of first 30 columns are opened later as addi- a semirandom model. In this case, existing algorithms tional observations. The results are averaged over 5 based on spectral methods fail, and our contribution iterations (figure 2). is a semidefinite programming (SDP) based algorithm Aditya Bhaskara, Kanchana Ruwanpathirana, Maheshakya Wijewardena

r=1 r=1

0.5 0.4 SDP-random SDP-semirandom 0.4 0.3 HSVT-random HSVT-semirandom

0.3 0.2

0.2 Regression error 0.1 0.1 SDP-random SDP-semirandom Scaled Frobenius norm error of Z HSVT-random 0.0 HSVT-semirandom 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 sigma 1e−4 sigma 1e−4

r=5 r=5

0.20 0.5

0.4 0.15

SDP-random 0.3 SDP-semirandom 0.10 HSVT-random HSVT-semirandom 0.2 Regression error 0.05 SDP-random 0.1 SDP-semirandom Scaled Frobenius norm error of Z HSVT-random HSVT-semirandom 0.00 0.0 0.2 0.4 0.6 0.8 1.0 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 sigma 1e−3 sigma 1e−4

Figure 2: Comparison of scaled Frobenius norm error Figure 3: Comparison of regression error when covari- of recovered matrices when observations are random ates are observed in random and semirandom manner. and semirandom.

Chen, Y., Chi, Y., Fan, J., Ma, C., and Yan, Y. (2019). that is robust to such cases. The key technical contri- Noisy matrix completion: Understanding statistical bution in this paper is an analysis of matrix completion guarantees for convex relaxation via nonconvex op- using an elementwise constrained SDP and we develop timization. the first recovery bounds under noisy and semiran- Cheng, Y. and Ge, R. (2018). Non-convex matrix com- dom observations. We complement our results with pletion against a semi-random adversary. In Bubeck, experiments demonstrating the pros and cons of our S., Perchet, V., and Rigollet, P., editors, Proceed- approach. ings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Re- References search, pages 1362–1394. PMLR. Agarwal, A., Shah, D., Shen, D., and Song, D. (2019). Diakonikolas, I., Gouleakis, T., and Tzamos, C. On robustness of principal component regression. (2019). Distribution-independent pac learning of Bhojanapalli, S. and Jain, P. (2014). Universal matrix halfspaces with massart noise. In Wallach, H., completion. In Xing, E. P. and Jebara, T., editors, Larochelle, H., Beygelzimer, A., d'Alch´e-Buc, F., Proceedings of the 31st International Conference on Fox, E., and Garnett, R., editors, Advances in Machine Learning, volume 32 of Proceedings of Ma- Neural Information Processing Systems, volume 32. chine Learning Research, pages 1881–1889, Bejing, Curran Associates, Inc. China. PMLR. Draper, N. and Smith, H. (1966). Applied regression Blum, A. and Spencer, J. (1995). Coloring random analysis. Wiley series in probability and mathemat- and semi-random k-colorable graphs. J. Algorithms, ical statistics. Wiley, New York [u.a.]. 19(2):204–234. Feige, U. and Krauthgamer, R. (2000). Finding and Candes, E. J. and Plan, Y. (2009). Matrix completion certifying a large hidden clique in a semirandom with noise. graph. Random Structures & Algorithms, 16(2):195– 208. Candes, E. J. and Recht, B. (2008). Exact matrix completion via convex optimization. Gunst, R. and Webster, J. (1975). Regression analysis Chatterjee, S. (2015). Matrix estimation by universal and problems of multicollinearity. Communications singular value thresholding. The Annals of Statis- in Statistics, 4(3):277–292. tics, 43(1):177–214. Hardt, M., Meka, R., Raghavendra, P., and Weitz, Principal Component Regression with Semirandom Observations via Matrix Completion

B. (2014). Computational limits for matrix comple- tion. In Balcan, M. F., Feldman, V., and Szepesv´ari, C., editors, Proceedings of The 27th Conference on Learning Theory, volume 35 of Proceedings of Ma- chine Learning Research, pages 703–725, Barcelona, Spain. PMLR. Hocking, R. R. (1972). Criteria for selection of a subset regression: Which one should be used? Technomet- rics, 14(4):967–976. Hotelling, H. (1957). The relations of the newer multi- variate statistical methods to factor analysis. British Journal of Statistical Psychology, 10(2):69–79. Jolliffe, I. (1986). Principal Component Analysis. Springer Verlag. Keshavan, R. H., Montanari, A., and Oh, S. (2012). Matrix completion from noisy entries. Little, R. J. A. (1992). Regression with missing x’s: A review. Journal of the American Statistical Associ- ation, 87(420):1227–1237. Makarychev, K., Makarychev, Y., and Vijayaragha- van, A. (2012). Approximation algorithms for semi- random graph partitioning problems. Mosteller, F. and Tukey, J. W. (1977). Data Analy- sis and Regression: a Second Course in Statistics. Addison-Wesley Publishing Company.