arXiv:2007.01394v4 [stat.ML] 4 Dec 2020 h ainlSineFudto NF ne rn o CCF-1 No. Grant under (NSF) Foundation Science National the outLna ersin pia ae nPlnma Time Polynomial in Rates Optimal Regression: Linear Robust * Bwudlk otakteprilspotfo h fc fNa of Office the from support partial the thank to like would AB ute,we h odto sntstse,w banarate a obtain we satisfied, not is condition obtain the when we show Further, negatively-correlated, we are particular, covariates In and noise variables. random of independence of ques open main [COLT’18]. r the su The resolve we no computation. thus unbounded that and to optimal note theoretically access We with even covariates. work, the our to of independent is noise distribution the true the for minimizer least-squares optimal in.Cnrtl,w sueordt sdanfo a from drawn is data our assume un we rate Concretely, convergence optimal tions. statistically achieve that els an omltn ta h frmnindplnma inequalit polynomial aforementioned the th as in t it variables formulating central random of Our independence exploit bound. rithmically lower information-theoretic the match ǫ u e nih st dniya nltccniinta ser that condition analytic an identify to is insight key Our eoti outadcmuainlyefiin siaosf estimators efficient computationally and robust obtain We fato savrailycrutd ete eciea e an describe then We corrupted. adversarially is -fraction [email protected] iehBakshi Ainesh CMU * Abstract [email protected] 815840. drhPrasad Adarsh a eerh(N)gatN01-8126,and N00014-18-1-2562, grant (ONR) Research val tart rprinlto proportional rate a at k e iia itiuinlassump- distributional minimal der h aert sidpnetnoise. independent as rate same the y. hprotatv itiuinand distribution -hypercontractive smo-qae”faeokby framework ”sum-of-squares” e CMU ini lvn,KtaiadMeka and Kothari Klivans, in tion rprinlto proportional cnclcnrbto st algo- to is contribution echnical e saplnma relaxation polynomial a as ves rlann eea iermod- linear several learning or htwe h oet fthe of moments the when that hetmtrwskonprior known was estimator ch tmtrta ovre othe to converges that stimator t eaheei information- is achieve we ate ǫ 2 − 4/ ǫ k 2 n again and , − 2/ k when , Contents
1 Introduction 1 1.1 OurResults ...... 2 1.2 RelatedWork...... 5
2 Technical Overview 6 2.1 Total Variation Closeness implies Hyperplane Closeness ...... 7 2.2 Proofs to Inefficient Algorithms: Distilling Constraints ...... 9 2.3 EfficientAlgorithmsviaSum-of-Squares ...... 11 2.4 DistributionFamilies ...... 12
3 Preliminaries 13 3.1 SoSBackground...... 14
4 Robust Certifiability and Information Theoretic Estimators 17
5 Robust Regression in Polynomial Time 21 5.1 Analysis ...... 22
6 Lowerbounds 32 6.1 TrueLinearModel ...... 32 6.2 AgnosticModel ...... 33
7 Bounded Covariance Distributions 35
A Robust Identifiability for Arbitrary Noise 42
B Efficient Estimator for Arbitrary Noise 45
C Proofof Lemma 3.4 48 1 Introduction
While classical statistical theory has focused on designing statistical estimators assuming access to i.i.d. samples from a nice distribution, estimation in the presence of adversarial outliers has been a challenging problem since it was formalized by Huber [Hub64]. A long and influential line of work in high-dimensional robust statistics has since focused on studying the trade-off be- tween sample complexity, accuracy and more recently, computational complexity for basic tasks such as estimating mean, covariance [LRV16, DKK+16, CSV17, KS17b, SCV17, CDG19, DKK+17, DKK+18a, CDGW19], moment tensors of distributions [KS17b] and regression [DKS17, KKM18b, DKK+19, PSBR20, KKK19, RY20a]. Regression continues to be extensively studied under various models, including realizable re- gression (no noise), true linear models (independent noise), asymmetric noise, agnostic regression and generalized linear models (see [Wei05] and references therein). In each model, a variety of distributional assumptions are considered over the covariates and the noise. As a consequence, there exist innumerable estimators for regression achieving various trade-offs between sample complexity, running time and rate of convergence. The presence of adversarial outliers adds yet another dimension to design and compare estimators. Seminal works on robust regression focused on designing non-convex loss functions, includ- ing M-estimators [Hub11], Theil-Sen estimators[The92, Sen68], R-estimators[Jae72], Least-Median- Squares [Rou84] and S-estimators[RY84]. These estimators have desirable statistical properties un- der disparate assumptions, yet remain computationally intractable in high dimensions. Further, recent works show that it is information-theoretically impossible to design robust estimators for linear regression without distributional assumptions [KKM18b]. An influential recent line of work showed that when the data is drawn from the well studied and highly general class of hypercontractive distributions (see Definition 1.3), there exist robust and computationally efficient estimators for regression [KKM18b, PSBR20, DKS19]. Several families of natural distributions fall into this category, including Gaussians, strongly log-concave distri- butions and product distributions on the hypercube. However, both estimators converge to the the true hyperplane (in ℓ2-norm) at a sub-optimal rate, as a function of the fraction of corrupted points. Given the vast literature on ad-hoc and often incomparable estimators for high-dimensional robust regression, the central question we address in this work is as follows:
Does there exist a unified approach to design robust and computationally efficient estimators achieving optimal rates for all linear regression models under mild distributional assumptions?
We address the aforementioned question by introducing a framework to design robust estima- tors for linear regression when the input is drawn from a hypercontractive distribution. Our estima- tors converge to the true hyperplanes at the information-theoretically optimal rate (as a function of the fraction of corrupted data) under various well-studied noise models, including independent and agnostic noise. Further, we show that our estimators can be computed in polynomial time using the sum-of-squares convex hierarchy. We note that, despite decades of progress, prior to our work, estimators achieving optimal con- vergence rate in terms of the fraction of corrupted points were not known, even with independent noise and access to unbounded computation.
1 1.1 Our Results
We begin by formalizing the regression model we work with. In classical regression, we assume
d d D
Ê Ê is a distribution over Ê and for a vector Θ , the least-squares loss is given by err (Θ)= 2 × ∈ D Ex,y y x⊤Θ . The goal is to learn Θ∗ = arg minΘ err (Θ). We assume sample access to ∼D − D , andh given n i.i.d.i samples, we want to obtain a vector Θ that approximately achieves optimal D error, err (Θ∗). D In contrast to the classical setting, we work in the strong contamination model. Here, an adver- sary has access to the input samples and is allowed to corrupt an ǫ-fraction arbitrarily. Note, the adversary has access to unbounded computation and has knowledge of the estimators we design. We note that this is the most stringent corrupt model and captures Huber contamination, additive corruption, label noise, agnostic learning etc (see [DK19]). Formally,
d Model 1.1 (Robust Regression Model). Let be a distribution over Ê R such that the marginal dis- d D × 2 tribution over Ê is centered and has covariance Σ∗ and let Θ∗ = arg minΘ Ex,y (y Θ, x ) be the ∼D − h i optimal hyperplane for . Let (x , y ), (x , y ),... (x , y ) be n i.i.d. random variablesh drawn fromi . D { 1∗ 1∗ 2∗ 2∗ n∗ n∗ } D Given ǫ > 0, the robust regression model (ǫ, Σ∗, Θ∗) outputs a set of n samples (x1, y1),... (xn, yn) RD { } such that for at least (1 ǫ)n points x = x and y = y . The remaining ǫn points are arbitrary, and − i i∗ i i∗ potentially adversarial w.r.t. the input and estimator.
A natural starting point is to assume that the marginal distribution over the covariates (the x’s above) is heavy-tailed and has bounded, finite covariance. However, we show that there is no robust estimator in this setting, even when the linear model has no noise and the uncorrupted points lie on a line.
Theorem 1.2 (Bounded Covariance does not suffice, Theorem 7.1 informal). For all ǫ > 0, there
d Ê exist two distributions , over Ê such that the marginal distribution over the covariates has D1 D2 × bounded covariance, denoted by Σ2 = Θ(1), yet Σ1/2 (Θ Θ ) = Ω(1), where Θ and Θ are the 1 − 2 2 1 2 optimal hyperplanes for 1 and 2. D D The aforementioned result precludes any statistical estimator that converges to the true hyper- plane as the fraction of corrupted points tends to 0. Therefore, we strengthen the distributional assumption and hypercontractive distributions instead.
Definition 1.3 ((C, k)-Hypercontractivity). A distribution over Rd is (C, k)-hypercontractive for D an even integer k > 4, if for all r [k/2], for all v Rd, ∈ ∈ 2r 2 r E v, x E [x] 6 E C v, x E [x] x − x − ∼D " # ∼D " # Remark 1.4. Hypercontractivity captures a broad class of distributions, including Gaussian distri- butions, uniform distributions over the hypercube and sphere, affine transformations of isotropic distributions satisfying Poincare inequalities [KS17a] and strongly log-concave distributions. Fur- ther, hypercontractivity is preserved under natural closure properties like affine transformations, products and weighted mixtures [KS17c]. Further, efficiently computable estimators appearing in this work require certifiable-hypercontractivity (Definition 3.5), a strengthening that continues to capture aforementioned distribution classes.
2 In this work we focus on the rate of convergence of our estimators to the true hyperplane, Θ∗, as a function of the fraction of corrupted points, denoted by ǫ. We measure convergence in both parameter distance (ℓ2-distance between the hyperplanes) and least-squares error on the true dis- tribution (err ). D We introduce a simple analytic condition on the relationship between the noise (marginal dis- tribution over y x Θ ) and covariates (marginal distribution over x) that can be considered as a − ⊤ ∗ proxy for independence of y x Θ and x : − ⊤ ∗
d Ê Definition 1.5 (Negatively Correlated Moments). Given a distribution over Ê , such that d D × the marginal distribution on Ê is (ck, k)-hypercontractive, the corresponding regression instance has negatively correlated moments if for all r 6 k, and for all v,
r r r r E v, x y x⊤Θ∗ 6 (1) E v, x E y x⊤Θ∗ x,y h i − O x h i x,y − ∼D ∼D ∼D h i h i Informally, the negatively correlated moments condition can be viewed as a polynomial relaxation of independence of random variables. Note, it is easy to see that when the noise is independent of the covariates, the above definition is satisfied.
Remark 1.6. We show that when this condition is satisfied by the true distribution, , we obtain D rates that match the information theoretically optimal rate in a true linear model, where the noise (marginal distribution over y x Θ ) is independent of the covariates (marginal distribution over − ⊤ ∗ x). Further, when this condition is not satisfied, we show that there exist distributions for which obtaining rates matching the true linear model is impossible.
When the distribution over the input is hypercontractive and has negatively correlated mo- 1 1/k ments, we obtain an estimator achieving rate proportional to ǫ − for parameter recovery. Fur- ther, our estimator can be computed efficiently. Thus, our main algorithmic result is as follows:
Theorem 1.7 (Robust Regresssion with Negatively Correlated Noise, Theorem 5.1 informal). Given (k) ǫ > 0, k > 4, and n > (d log(d))O samples from (ǫ, Σ∗, Θ∗), such that is (c, k)-certifiably RD D (k) hypercontractive and has negatively correlated moments, there exists an algorithm that runs in nO time and outputs an estimator Θ˜ such that with high probability,
1/2 1 1/k 1/2 (Σ∗) Θ∗ Θ˜ 6 ǫ − err (Θ∗) − 2 O D and, 2 2/k err (Θ˜ ) 6 1 + ǫ − err (Θ∗) D O D Remark 1.8. We note that prior work does not draw a distinction between the independent and dependent noise models. In comparison (see Table 1), Klivans, Kothari and Meka [KKM18b] ob- 1 2/k tained a sub-optimal least-squares error scales proportional to ǫ − . For the special case of k = 4, Prasad et. al. [PSBR20] obtain least squares error proportional to O(ǫκ2(Σ)), where κ is the con- dition number. In very recent independent work Zhu, Jiao and Steinhardt [ZJS20] obtained a 2 4/k sub-optimal least-squares error scales proportional to ǫ − .
Further, we show that the rate we obtained in Theorem 1.7 is information-theoretically optimal, even when the noise and covariates are independent:
3 Estimator Independent Noise Arbitrary Noise Prasad et. al. [PSBR20], ǫκ2 (only k = 4) ǫκ2 (only k = 4) Diakonikolas et. al. [DKK+18b] Klivans, Kothari and Meka ǫ1 2/k ǫ1 2/k [KKM18b] − − Zhu, Jiao and Steinhardt ǫ2 4/k ǫ2 4/k [ZJS20] − − Our Work ǫ2 2/k ǫ2 4/k Thm 1.7, Cor 1.10 − − Lower Bounds ǫ2 2/k ǫ2 4/k Thm 1.9, Thm 1.11 − −
Table 1: Comparison of convergence rate (for least-squares error) achieved by various compu- tationally efficient estimators for Robust Regression, when the underlying distribution is (ck, k)- hypercontractive.
Theorem 1.9 (Lower Bound for Independent Noise, Theorem 6.1 informal ). For any ǫ > 0, there
2 2
Ê Ê exist two distributions , over Ê such that the marginal distribution over has covariance Σ D1 D2 × and is (c, k)-hypercontractive for both distributions, and yet Σ1/2(Θ Θ ) = Ω ǫ1 1/kσ , where 1 − 2 2 − Θ , Θ are the optimal hyperplanes for and respectively, σ = max(err (Θ ), err (Θ )) and the 1 2 1 2 1 1 2 2 D D 2 2/kD 2 D noise is uniform over [ σ, σ]. Further, err (Θ2) err (Θ1 ) = Ω ǫ − σ . − | D1 − D1 | Next, we consider the setting where the noise is allowed to ar bitrary, and need not have nega- tively correlated moments with the covariates. A simple modification to our algorithm and analy- 1 2/k sis yields an efficient estimator that obtains rate proportional to ǫ − for parameter recovery.
Corollary 1.10 (Robust Regresssion with Dependent Noise, Corollary 4.2 informal). Given ǫ > (k) 0, k > 4 and n > (d log(d))O samples from (ǫ, Σ∗, Θ∗), such that is (c, k)-certifiably hyper- RD(k) D contractive, there exists an algorithm that runs in nO time and outputs an estimator Θ˜ such that with probability 9/10, 1/2 1 2/k 1/2 (Σ∗) Θ∗ Θ˜ 6 ǫ − err (Θ∗) − 2 O D and, 2 4/k err (Θ˜ ) 6 1 + ǫ − err (Θ∗) D O D Further, we show that the dependence on ǫ is again information-theoretically optimal:
Theorem 1.11 (Lower Bound for Dependent Noise, Theorem 6.2 informal). For any ǫ > 0, there exist
2 2
Ê Ê two distributions , over Ê such that the marginal distribution over has covariance Σ and is D1 D2 × (c, k)-hypercontractive for both distributions, and yet Σ1/2(Θ Θ ) = Ω ǫ1 2/kσ , where Θ , Θ 1 − 2 2 − 1 2 be the optimal hyperplanes for and respectively and σ = max(err (Θ ), err (Θ )). Further, 1 2 1 1 2 2 D2 4/k D2 D D err (Θ2) err (Θ1) = Ω ǫ − σ . | D1 − D1 | Applications for Gaussian Covariates. The special case where the marginal distribution over x is Gaussian has received considerable interest recently [DKS18, DKK+18b]. We note that our
4 estimators extend to the setting of Gaussian covariates, since the uniform distribution over sam- ples from (0, Σ) are ( (k) , (k))-certifiably hypercontractive for all k (see Section 5 in Kothari N O O and Steurer [KS17c]). As a consequence, instantiating Corollary 1.10 with k = log(1/ǫ) yields the following:
(log(1/ǫ)) Corollary 1.12 (Robust Regression with Gaussian Covariates). Given ǫ > 0 and n > (d log(d))O samples from (ǫ, Σ∗, Θ∗), such that the marginal distribution over the x’s is (0, Σ∗), there exists an RN (log(1/ǫ) N algorithm that runs in nO time and outputs an estimator Θ˜ such that with high probability,
1/2 1/2 (Σ∗) Θ∗ Θ˜ 6 (ǫ log(1/ǫ))(err (Θ∗)) − 2 O N