Principal Component Regression with Semirandom Observations Via Matrix Completion

Principal Component Regression with Semirandom Observations via Matrix Completion Aditya Bhaskara Kanchana Ruwanpathirana Maheshakya Wijewardena University of Utah University of Utah University of Utah Abstract variables. Denoting the outputs by a vector y 2 Rn and the prediction variables by a matrix M 2 Rn×d (each row corresponds to an observation), linear re- Principal Component Regression (PCR) is a gression aims to model y as popular method for prediction from data, and is one way to address the so-called multi- y = Mβ + η; (1) collinearity problem in regression. It was shown recently that algorithms for PCR such where β 2 d the vector of regression coefficients, and as hard singular value thresholding (HSVT) R η is a vector of error terms, typically modeled as having are also quite robust, in that they can han- independent Gaussian entries. The goal of the regres- dle data that has missing or noisy covariates. sion problem is to find the coefficients β. The stan- However, such spectral approaches require dard procedure is to solve the least squares problem, strong distributional assumptions on which minky − Mβk2. entries are observed. Specifically, every covariate is assumed to be observed with prob- One of the issues in high dimensional regression (large ability (exactly) p, for some value of p. Our d) is the problem of multi-collinearity, where there goal in this work is to weaken this require- are correlations (in values) between regression vari- ment, and as a step towards this, we study a ables, leading to unstable values for the parameters \semi-random" model. In this model, every β (see Gunst and Webster (1975); Jolliffe (1986)). covariate is revealed with probability p, and One common approach to overcome this problem is then an adversary comes in and reveals ad- to use subset selection methods (e.g., Hocking (1972); ditional covariates. While the model seems Draper and Smith (1966)). Another classic approach intuitively easier, it is well known that algo- Hotelling (1957); Jolliffe (1986) is to perform regres- rithms such as HSVT perform poorly. Our sion after projecting to the space of the top prin- approach is based on studying the closely re- cipal components of the matrix M. This is called lated problem of Noisy Matrix Completion in Principal Component Regression (PCR). Formally, if a semi-random setting. By considering a new M (r) = U (r)Σ(r)(V (r))T is the best rank r approxima- semidefinite programming relaxation, we de- tion of M, then the idea is to replace (1) with velop new guarantees for matrix completion, which is our core technical contribution. y = MV (r)β0 + η: (2) The goal is to now find the best β0 2 Rr. This is known 1 Introduction to yield more stable results in many settings (Mosteller and Tukey, 1977). It also has the advantage of being Regression is one of the fundamental problems in a \smaller" problem, thus it can be solved more ef- statistics and data analysis, with over two hundred ficiently. Recently, Agarwal et al. (2019) observed years of history. We are given n observations, each that in settings where the regression matrix is (close consisting of d prediction or regression variables and to) low rank, PCR also provides a way to deal with one output that is a linear function of the prediction missing and noisy covariates | a well-known issue in applications (Little, 1992). Their main insight is that Proceedings of the 24th International Conference on Artifi- observing a random subset of the entries of M is good cial Intelligence and Statistics (AISTATS) 2021, San Diego, enough to obtain the top singular vectors V (r) to a California, USA. PMLR: Volume 130. Copyright 2021 by high accuracy. Thus one can obtain a β0 that yields a the author(s). low-error regression model. Principal Component Regression with Semirandom Observations via Matrix Completion While the result of Agarwal et al. (2019) provides novel matrix and applying PCR. Further, matrix completion theoretical guarantees on regression with missing en- can also be formulated (and indeed has been studied tries, it is restrictive: guarantees can only be obtained in Cheng and Ge (2018)) in a semi-random setting. when every entry of M is observed independently with However, to the best of our knowledge, trade-offs be- (the same) probability p. Given inherent dependencies tween the error in the observations and the recovery in applications (e.g., some users may intrinsically re- error have not been studied in a semi-random model. veal more information than others; certain covariates Analyzing this via a semidefinite programming (SDP) may be easier to measure than others, etc.), we ask: relaxation is our main technical contribution. Interest- Can we design algorithms for partially-observed PCR ingly, our analysis relies heavily on a recent non-convex under milder assumptions on the observations? approach to matrix completion (Chen et al., 2019). It is easy to see that unless sufficiently many entries are observed from each column, recovering the corre- 1.1 Our results sponding coordinate of β is impossible. This motivates having a lower bound on the probability of each entry We discuss now our results about matrix completion being revealed. This leads us to considering a so-called and the implications to PCR with missing and noisy semi-random model (Blum and Spencer, 1995; Feige entries. Formal definitions and the full setup is de- and Krauthgamer, 2000; Makarychev et al., 2012; ferred to Section 2. Cheng and Ge, 2018). In our context, such a model Result about matrix completion. Recall that in corresponds to the following: (a) initially, all the en- matrix completion, we have an unknown rank-r ma- tries are revealed independently with probability p, (b) trix M ∗ (dimensions n × n), which we wish to infer an adversary reveals additional elements (arbitrarily). given noisy and partial observations. Ω denotes the Despite appearing to make the problem \easier" than e set of observed indices, and σ denotes the standard the random case, since an algorithm has no idea if an deviation of the noise added to each observation. Ω is entry was revealed in step (a) or (b), algorithms based e chosen using a semi-random process (first forming Ω on obtaining estimators (that rely on entries being ob- in which every index is present with probability p and served with probability p) will fail. then adversarially adding indices). Theorem (informal). Under appropriate incoher- Further motivation for the semirandom model. ∗ In both PCR with missing covariates and matrix com- ence assumptions on M , there exists a polynomial pletion, at a high level, we wish to understand what time algorithm that finds an estimate Z that satisfies \observation patterns" allow for effective recovery. kM − ZkF ≤ Oκ,p,µ (nr log n · σ) : Without stochasticity assumptions, there are hardness 2 results (even when one observes Ω(n ) entries; Hardt The result holds with high probability, and the formal et al. (2014)), and we know of positive results when statement along with conditions for p; σ is presented in each entry is observed independently with probability Theorem 1. The parameters κ, µ capture the condition p. The semirandom model has been studied in the number and incoherence properties of M ∗. To the best literature (for graph partitioning and also for matrix of our knowledge, such a bound relating the noise and completion) because in spite of seeming easier as dis- recovery error is not previously known in the semi- cussed above, it ends up causing spectral methods to random model for matrix completion. The prior work fail. The semirandom model is equivalent to the set- of (Cheng and Ge, 2018) does not consider the noisy ting where every entry has some \base" probability of case. being observed, but the probability could be higher for p some (unknown to us) entries. E.g., in recommender Our recovery error bound is a factor n worse than systems, some users may provide more ratings for cer- the best (and optimal) bounds for matrix completion tain types of items than others, and this is typically with random observations (Chen et al., 2019; Kesha- unknown to the algorithm. The model is also closely van et al., 2012). However, it is better than some of related to the notion of Massart noise which is known the earlier results of (Candes and Plan, 2009). Specif- to be challenging for a variety of learning problems ically, the work of (Candes and Plan, 2009) imposes (see, e.g., Diakonikolas et al. (2019)). the constraint kPΩ(Z − M)kF ≤ δ and shows that q kZ − M ∗k ≤ n δ (up to constants). If Ω corre- Our approach to partially-observed PCR will be based F p sponds to i.i.d. observations with probability p, we on the closely related problem of matrix completion p (see Candes and Recht (2008); Keshavan et al. (2012); must set δ = n p · σ for the SDP to be feasible. This Bhojanapalli and Jain (2014) and references therein). leads to a recovery error bound of n3=2σ. In the semi- The problems are related because intuitively, we can random model, the SDP feasibility constraint must think of filling in the missing values in the covariate now depend on Ωe and if we have n2p observations, Aditya Bhaskara, Kanchana Ruwanpathirana, Maheshakya Wijewardena the n3=2σ bound becomes even worse. Chen et al. 1.2 Related work (2019) improve the error bound above by a factor of p n p in the random case. Our result can be viewed There has been extensive literature on PCR and ma- as being in between the two results while holding true trix completion that we do not review for lack of space.

Load more