arXiv:2009.14774v2 [cs.LG] 25 May 2021 no’ oio 00rsac n noainporme(gr programme innovation and research 2020 Horizon Union’s ∗ § ‡ † T Zürich. ETH Zürich. ETH Zürich. ETH hspoethsrcie udn rmteErpa Researc European the from funding received has project This ossetrgeso hnolvosoutliers oblivious when regression Consistent u ocntn atr) u eut xedt ein a b far designs to extend results Our factors). constant to (up h oe tde eeas atrshaytie os dis noise heavy-tailed captures also here studied model The h observations the size civsotmlgaate nnal-iertm,adtha and coordin time, nearly-linear computing in on guarantees based optimal achieves algorithm designs, simple Gaussian strikingly of a case to special an the In loss computations. Huber dian the between connection a highlights siao for estimator in oceey eso htteHbrls siao sco is estimator loss size Huber the that show we Concretely, in and tion. size sample linear nearly with possible is estimation vnhv rtmoment. first a have even biiu otedesign the to oblivious togcneiy xedn oko [ of work pr extending One convexity, proofs. strong s similar kernel technically the two about provide made We commonly sensing). assumption of kind the to lar nyrqieteclm pnof span column the require only ecnie outlna ersinmodel regression linear robust a consider We = = oms d’Orsi Tommaso & = ( $ 3 ( / 3

/  )

2 ∗ 2 rfrlgrtmcile fraction inlier logarithmic for or a nw ob osseti hsmdlecp o quadratic for except model this in consistent be to known was ) n civsa ro aeof rate error an achieves and H na rirr a.Pirt u ok vnfrGaussian for even work, our to Prior way. arbitrary an in - ∈ † ℝ = overwhelm × - 3 lbNovikov Gleb ontcnanapoiaeysas etr (simi- vectors sparse approximately contain not to a choose may TJSO14 Abstract 1 ,adpriual hr.Teohrproof other The short. particularly and ], $

 ( 3 > ocrutalbtan but all corrupt to / n gemn o815464). No agreement ant H

1 = ∗ 2 ‡ / = log ) - oni EC ne h European the under (ERC) Council h 1 yn h asincs and case Gaussian the eyond / es-oyoilile frac- inlier verse-polynomial  2 = a xli priyof sparsity exploit can t ∗ ohbud r optimal are bounds Both . o sprsdi em of terms in phrased is oof eso htconsistent that show We . + ssetfreeysample every for nsistent ai Steurer David hscneto ed us leads connection this ihdmninlme- high-dimensional d rbtosta a not may that tributions  t-iemdasthat ate-wise hr nadversary an where , aefrcompressed for pace

rcinof fraction sample § - no ,  ∗ . Contents

1 Introduction 3 1.1 Results about Huber-loss estimator ...... 4 1.2 Resultsaboutfastalgorithms ...... 7

2 Techniques 9 2.1 Statistical guarantees from strong convexity ...... 9 2.2 Huber-loss estimator and high-dimensional medians ...... 12 2.3 FastalgorithmsforGaussiandesign ...... 15

3 Huber-loss estimation guarantees from strong convexity 16 3.1 Huber-loss estimator for Gaussian design and deterministicnoise ...... 21

4 in linear time 24 4.1 Warmup:one-dimensionalsettings ...... 24 4.2 High-dimensionalsettings ...... 25 4.2.1 High-dimensional Estimation via algorithm ...... 26 4.2.2 Nearly optimal estimation via bootstrapping ...... 27 4.2.3 Nearlyoptimalsparseestimation ...... 28 4.2.4 Estimationfornon-sphericalGaussians ...... 32 4.2.5 Estimatingcovariancematrix ...... 36

5 Bounding the Huber-loss estimator via first-order conditions 37

Bibliography 42

A Error convergence and model assumptions 47 A.1 Lower bounds for consistent oblivious linear regression...... 47 A.2 Onthedesignassumptions ...... 47 A.2.1 Relaxingwell-spreadassumptions ...... 50 A.3 Onthenoiseassumptions ...... 51 A.3.1 Tightnessofnoiseassumptions ...... 51

B Computing the Huber-loss estimator in polynomial time 53

C Consistent in high-dimensional settings 55

D Concentration of measure 57

E Spreadness notions of subspaces 61

2 1 Introduction

Linear regression is a fundamental task in : given observations 3 1 3 G1,H1 ,..., G= ,H= ℝ following a linear model H8 = G8 ,  8, where  ℝ ( ) ( ) ∈ + h ∗i + ∗ ∈ is the unknown parameter of interest and 1,..., = is noise, the goal is to recover ∗ as accurately as possible. In the most basic setting, the noise values are drawn independently from a Gaussian distribution with mean 0 and variance 2. Here, the classical least-squares estimator #ˆ achieves an optimal error bound 1 -  #ˆ 2 . 2 3 = with high probability, where the = k ( ∗ − )k · / design - is has rows G1,...,G=. Unfortuantely, this guarantee is fragile and the estimator may experience arbitrarily large error in the presence of a small number of benign noise values. In many modern applications, including economics [RL05], image recognition [WYG+08], and sensor networks [HBRN08], there is a desire to cope with such stemming from extreme events, gross errors, skewed and corrupted measurements. It is therefore paramount to design estimators robust to noise distributions that may have substantial probability mass on outlier values. In this paper, we aim to identify the weakest possible assumptions on the noise distri- bution such that for a wide range of measurement matrices -, we can efficiently recover the parameter vector ∗ with vanishing error. The design of learning algorithms capable of succeeding on data sets contaminated by adversarial noise has been a central topic in (e.g. see [DKK+19, CSV17] and their follow-ups for some recent developments). In the context of regression with adaptive adversarial outliers (i.e. depending on the instance) several results are known [CT05, CRT05, KKM18, KKK19, LLC19, LSLC18, KP18, DT19, RY20]. However, it turns out that for adaptive adversaries, vanishing error bounds are only possible if the fraction of outliers is vanishing. In order to make vanishing error possible in the presence of large fractions of outliers, we consider weaker adversary models that are oblivious to the design -. Different as- sumptions can be used to model oblivious adversarial corruptions. [SZF19] assume the 1  noise distribution satisfies (8 G8 = 0 and (8 + < for some 0 6  6 1, and show ∞ that if - has constant condition number,  then (ah modification i of) the Huber loss estimator 1  2 [Hub64] is consistent for1 = > $ - 3 ( + )/ (an estimator is consistent if the error ˜ ((k k∞ · ) ) tends to zero as the number of observation grows, 1 - #  2 0). = k ( ˆ − ∗)k → Without constraint on moments, a useful model is that of assuming the noise vector  ℝ= to be an arbitrary fixed vector with = coordinates bounded by 1 in absolute ∈ · value. This model2 also captures random vectors ( ℝ= independent of the measurement ∈ 1We hide absolute constant multiplicative factors using the standard notations .,$ . Similarly, we hide (·) multiplicative factors at most logarithmic in = using the notation $˜ . 2Several models for consistent robust linear regression appear in the literature. Our model is strong enough to subsume the models we are aware of. See Appendix A for a detailed comparison.

3 matrix - and conveniently allows us to think of the fraction of samples with small noise as the set of uncorrupted samples. In these settings, the problem has been mostly studied in the context of Gaussian design x1,..., x= # 0, Σ . [BJKK17a] provided an estimator ∼ ( ) achieving error $ 3 2 = for any larger than some fixed constant. This result was then ˜ ( /( · )) extended in [SBRJ19], where the authors proposed a near-linear time algorithm computing a $ 3 2 = -close estimate for any3 & 1 log log =. That is, allowing the number of ˜ ( /( · )) / uncorrupted samples to be > = . Considering even smaller fractions of inliers, [TJSO14] ( ) showed that with high probability the Huber loss estimator is consistent for = > $ 32 2 , ˜ ( / ) thus requiring sample size quadratic in the ambient dimension. Prior to this work, little was known for more general settings when the design matrix - is non-Gaussian. From an asymptotic viewpoint, i.e., when 3 and are fixed and = ,a → ∞ similar model was studied 30 years ago in a seminal work by Pollard [Pol91], albeit under stronger assumptions on the noise vector. Under mild constraints on -, it was shown that the least absolute deviation (LAD) estimator is consistent. So, the outlined state-of-the-art provides an incomplete picture of the statistical and computational complexity of the problem. The question of what conditions we need to enforce on the measurement matrix - and the noise vector ( in order to efficiently and consistently recover ∗ remains largely unanswered. In high-dimensional settings, no estimator has been shown to be consistent when the fraction of uncontaminated samples is smaller than 1 log = and the number of samples = is smaller than 32 2, even in the / / simple settings of spherical Gaussian design. Furthermore, even less is known on how we can regress consistently when the design matrix is non-Gaussian. In this work, we provide a more comprehensive picture of the problem. Concretely, we analyze the Huber loss estimator in non-asymptotic, high dimensional setting where the fraction of inliers may depend (even polynomially) on the number of samples and ambient dimension. Under mild assumptions on the design matrix and the noise vector, we show that such algorithm achieves optimal error guarantees and sample complexity. Furthermore, a by-product of our analysis is an strikingly simple linear-time estimator based on computing coordinate-wise medians, that achieves nearly optimal guarantees for standard Gaussian design, even in the regime where the parameter vector ∗ is :-sparse (i.e. ∗ has at most : nonzero entries).

1.1 Results about Huber-loss estimator We provide here guarantees on the error convergence of the Huber-loss estimator, defined 3 as a minimizer of the Huber loss 5 : ℝ ℝ>0, → = 1 5  = Φ - H 8 , ( ) = [( − ) ] Õ8=1 3 & 1 & 1 More precisely, their condition is log = for consistent estimation and log log = to get the error bound $ 3 . ˜ ( 2= ) 4 where Φ: ℝ ℝ>0 is the Huber penalty,4 → 1 2 def C if C 6 2 , Φ C = 2 | | [ ] (2 C 2 otherwise. | |−

Gaussian design. The following theorem states our the Huber-loss estimator in the case of Gaussian designs. Previous quantitative guarantees for consistent robust linear regression focus on this setting [TJSO14, BJKK17a, SBRJ19].

Theorem 1.1 (Guarantees for Huber-loss estimator with Gaussian design). Let  ℝ= ∈ be a deterministic vector. Let ^ be a random5 =-by-3 matrix with iid standard Gaussian entries ^89 # 0, 1 . ∼ ( ) Suppose = >  3 2, where is the fraction of entries in  of magnitude at most 1, and  > 0 · / is large enough absolute constant. Then, with probability at least 1 2 3 over ^, for every  ℝ3, given ^ and y = ^ , − − ∗ ∈ ∗ + the Huber-loss estimator #ˆ satisfies

ˆ 2 3 ∗ # 6 $ . − 2=  

The above result improves over previous quantitative analyses of the Huber-loss es- timator that require quadratic sample size = & 32 2 to be consistent [TJSO14]. Other / estimators developed for this model [BJKK17b, SBRJ19] achieve a sample-size bound nearly-linear in 3 at the cost of an exponential dependence on 1 . These results require / for consistent estimation a logarithmic bound on the inlier fraction & 1 log 3 to achieve / sample-size bound nearly-linear in 3 . In constrast our sample-size bound is nearly-linear > 1 in 3 even for any sub-polynomial inlier fraction = 1 3 ( ). In fact, our sample-size bound / and estimation-error bound is statistically optimal up to constant factors.6 The proof of the above theorem also applies to approximate minimizers of the Huber loss and it shows that such approximations can be computed in polynomial time. We remark that related to (one of) our analyses of the Huber-loss estimator, we develop a fast algorithm based on (one-dimensional) median computations that achieves estima- tion guarantees comparable to the ones above but in linear time $ =3 . A drawback of ( ) this fast algorithm is that its guarantees depend (mildly) on the norm of ∗. Several results [CT05, CRT05, KP18, DKS19, DT19] considered settings where the noise vector is adaptively chosen by an adversary. In this setting, it is possible to obtain a unique

4Here, we choose 2 as transition point between quadratic and linear penalty for simplicity. See Appendix A.3 for a discussion about different choices for this transition point. 5As a convention, we use boldface to denote random variables. 6In the case that  # 0, 2 Id , it’s well known that the optimal Bayesian estimator achieves expected ∼ ( · ) error 2 3 =. For  > 1, the vector  has a Θ 1  fraction of entries of magnitude at most 1 with high · / ( / ) probability.

5 estimate only if the fraction of outliers is smaller than 1 2. In contrast, Theorem 1.1 implies / consistency even when the fraction of corruptions tends to 1 but applies to settings where the noise vector  is fixed before sampling ^ and thus it is oblivious to the data.

Deterministic design. The previous theorem makes the strong assumption that the design is Gaussian. However, it turns out that our proof extends to a much broader class of designs with the property that their columns spans are well-spread (in the sense that they don’t contain vectors whose ℓ2-mass is concentrated on a small number of coordinates, see [GLW08]). In order to formulate this more general results it is convenient to move the randomness from the design to the noise vector and consider deterministic designs - ℝ= 3 with probabilistic =-dimensional noise vector (, ∈ ×

y = -∗ ( . (1.1) + Here, we assume that ( has independent, symmetrically distributed entries satisfying ℙ (8 6 1 > for all 8 = . {| | } ∈[ ] This model turns out to generalize the one considered in the previous theorem. Indeed, given data following the previous model with Gaussian design and deterministic noise, we can generate data following the above model randomly subsampling the given data and multiplying with random signs (see Appendix A for more details).

Theorem 1.2 (Guarantees for Huber-loss estimator with general design). Let - ℝ= 3 be a ∈ × deterministic matrix and let ( be an =-dimensional random vector with independent, symmetrically distributed entries and = min8 = ℙ (8 6 1 . ∈[ ] {| | } Suppose that for every vector E in the column span of - and every subset ( = with ⊆ [ ] ( 6  3 2, | | · /

E( 6 0.9 E , (1.2) k k ·k k where E( denotes the restriction of E to the coordinates in (, and  > 0 is large enough absolute constant. Then, with probability at least 1 2 3 over (, for every  ℝ3, given - and y = - (, − − ∗ ∈ ∗ + the Huber-loss estimator #ˆ satisfies

1 ˆ 2 3 - ∗ # 6 $ . = ( − ) 2=  

In particular, Theorem 1.2 implies that under condition Eq. (1.2) and mild noise as- sumptions, the Huber loss estimator is consistent for = > $ 3 2 . / As we only assume the column span of - to be well-spread, the result applies to  a substantially broader class of design matrices - than Gaussian, naturally including those studied in [TJSO14, BJKK17a, SBRJ19]. Spread subspaces are related to ℓ1-vs-ℓ2

6 distortion7, and have some resemblance with restricted isometry properties (RIP). Indeed both RIP and distortion assumptions have been successfully used in compressed sensing [CT05, CRT05, KT07, Don06] but, to the best of our knowledge, they were never observed to play a fundamental role in the context of robust linear regression. This is a key difference between our analysis and that of previous works. Understanding how crucial this well- spread property is and how to leverage it allows us to simultaneously obtain nearly optimal error guarantees, while also relaxing the design matrix assumptions. It is important to remark that a weaker version of property Eq. (1.2) is necessary as otherwise it may be information theoretically impossible to solve the problem (see Lemma A.5). We derive both Theorem 1.1 and Theorem 1.2 using the same proof techniques ex- plained in Section 2. Remark 1.3 (Small failure probability). For both Theorem 1.1 and Theorem 1.2 our proof 3 log 1  also gives that for any  0, 1 , the Huber loss estimator achieves error $ + ( / ) with ∈ ( ) 2= 3 ln 1  probability at least 1  as long as = & + ( / ) , and, in Theorem 1.2, the well-spread − 2 3 log 1  property is satisfied for all sets ( = of size ( 6 $ + ( / ) . ⊆[ ] | | 2   1.2 Results about fast algorithms The Huber loss estimator has been extensively applied to robust regression problems [TSW18, TJSO14, EvdG+18]. However, one possible drawback of such algorithm (as well as other standard approaches such as !1-minimization [Pol91, KP18, NT13]) is the non- linear running time. In real-world applications with large, high dimensional datasets, an algorithm running in linear time $ =3 may make the difference between feasible and ( ) unfeasible. In the special case of Gaussian design, previous results [SBRJ19] already obtained estimators computable in linear time. However these algorithms require a logarithmic bound on the fraction of inliers & 1 log =. We present here a strikingly simple algorithm / that achieves similar guarantees as the ones shown in Theorem 1.1 and runs in linear time: ˆ for each coordinate 9 3 compute the median #9 of y1 ^19 ,..., y= ^=9 subtract the ∈ [ ] / / resulting estimation ^#ˆ and repeat, logarithmically many times, with fresh samples.

= 3 Theorem 1.4 (Guarantees for fast estimator with Gaussian design). Let  ℝ and ∗ ℝ ∈ ∈ be deterministic vectors. Let ^ be a random =-by-3 matrix with iid standard Gaussian entries ^89 # 0, 1 . ∼ ( ) Let be the fraction of entries in  of magnitude at most 1, and let Δ > 10  . Suppose +k ∗ k that 3 = >  ln Δ ln 3 lnln Δ , · 2 · · ( + ) 7Our analysis of Theorem 1.2 also applies to design matrices whose column span has bounded distortion, see Appendix A for more details.

7 where  is a large enough absolute constant. Then, there exists an algorithm that given Δ, ^ and y = ^  as input, in time8 $ =3 ∗ + ( ) finds a vector #ˆ ℝ3 such that ∈ ˆ 2 3 ∗ # 6 $ log 3 , − 2= ·   with probability at least 1 3 10. − − The algorithm in Theorem 1.4 requires knowledge of an upper bound Δ on the norm of the parameter vector. The sample complexity of the estimator has logarithmic dependency on this upper bound. This phenomenon is a consequence of the iterative nature of the algorithm and also appears in other results [SBRJ19]. Theorem 1.4 also works for non-spherical settings Gaussian design matrixandprovides nearly optimal error convergence with nearly optimal sample complexity, albeit with running time $ =32 . The algorithm doesn’t require prior knowledge of the covariance ˜ ( ) matrix Σ. In these settings, even though time complexity is not linear in 3, it is linear in =, and if = is considerably larger than 3, the algorithm may be very efficient.

Sparse linear regression. For spherical Gaussian design, the median-based algorithm introduced above can naturally be extended to the sparse settings, yielding the following theorem.

Theorem 1.5 (Guarantees of fast estimator for sparse regression with Gaussian design). Let  ℝ= and  ℝ3 be deterministic vectors, and assume that  has at most : 6 3 nonzero ∈ ∗ ∈ ∗ entries. Let ^ be a random =-by-3 matrix with iid standard Gaussian entries ^89 # 0, 1 . ∼ ( ) Let be the fraction of entries in  of magnitude at most 1, and let Δ > 10  . Suppose +k ∗ k that : = >  ln Δ ln 3 lnln Δ , · 2 · · ( + ) where  is a large enough absolute constant. Then, there exists an algorithm that given :, Δ, ^ and y = ^  as input, in time $ =3 ∗ + ( ) finds a vector #ˆ ℝ3 such that ∈ ˆ 2 : ∗ # 6 $ log 3 , − 2= ·  

10 with probability at least 1 3− . − 8By time we mean number of arithmetic operations and comparisons of entries of y and ^. We do not take bit complexity into account.

8 2 Techniques

Recall our linear regression model,

y = -∗ ( , (2.1) + = 3 where we observe (a realization of) the random vector y, the matrix - ℝ × is a known = ∈ design, the vector ∗ ℝ is the unknown parameter of interest, and the noise vector ( ∈ 9 10 has independent, symmetrically distributed coordinates with = min8 = ℙ (8 6 1 . 1 T ∈[ ] {| | } To simplify notation in our proofs, we assume = - - = Id. (For general -, we can ensure this property by orthogonalizing and scaling the columns of -.) We consider the Huber loss estimator #ˆ, defined as a minimizer of the Huber loss f ,

= 1 f  := Φ - y 8 , ( ) = [( − ) ] Õ8=1 where Φ: ℝ ℝ>0 is the Huber penalty,11 → 1 C2 if C 6 2 , Φ C = 2 | | [ ] (2 C 2 otherwise. | |− 2.1 Statistical guarantees from strong convexity In order to prove statistical guarantees for this estimator, we follow a well-known approach that applies to a wide range of estimators based on convex optimization (see [NRWY09] for a more general exposition), which also earlier analyses of the Huber loss estimator [TJSO14] employ. This approach has two ingredients: (1) an upper bound on the norm of the gradient of the f at the desired parameter ∗ and (2) a lower bound on the strong-convexity curvature parameter of f within a ball centered at ∗. Taken together, these ingredients allow us to construct a global lower bound for f that implies that all (approximate) minimizers of f are close to ∗. (See Theorem 3.1 for the formal statement.) An important feature of this approach is that it only requires strong convexity to hold locally around ∗. (Due to its linear parts, the Huber loss function doesn’t satisfy strong convexity globally.) It turns out that the radius of strong convexity we can prove is the main factor determining the strength of the statistical guarantee we obtain. Indeed, the reason why previous analyses12 of the Huber loss estimator [TJSO14] require quadratic

9The distributions of the coordinates are not known to the algorithm designer and can be non-identical. 10The value of need not be known to the algorithm designer and only affects the error guarantees of the algorithms. 11Here, in order to streamline the presentation, we choose 2 as the transition points between quadratic {± } and linear penalty. Changing these points to 2 is achieved by scaling C 2Φ C  . {± } 7→ ( / ) 12We remark that the results in [TJSO14] are phrased asymptotically, i.e., fixed 3 and = . Therefore, a →∞ radius bound independent of = is enough for them. However, their proof is quantiative and yields a radius bound of 1 √3 a we will discuss. / 9 sample size = & 3 2 to ensure consistency is that they can establish strong convexity ( / ) only within inverse-polynomial radius Ω 1 √3 even for Gaussian - # 0, 1 = 3. In ( / ) ∼ ( ) × contrast, our analysis gives consistency for any super-linear sample size = = $ 3 2 for ( / ) Gaussian - because we can establish strong convexity within constant radius. Compared to the strong-convexity bound, which we disucss next, the gradient bound is straightforward to prove. The gradient of the Huber loss at ∗ for response vector y = - ( takes the following form, ∗ + = 1 f ∗ = Φ′ (8 G8 with Φ′ C = sign C min C , 2 , ∇ ( ) = [ ]· [ ] ( )· {| | } Õ8=1 3 where G1,...,G= ℝ form the rows of -. Since (1,..., (= are independent and symmet- ∈ rically distributed, the random vectors in the above sum are independent and centered. We can bound the expected gradient norm,

= 2 = = 1 1 2 2 2 23 Φ′ (8 = Φ′ (8 G8 6 G8 = . = [ ] =2 [ ] ·k k =2 k k = 8=1 8=1 8=1 Õ Õ Õ

1 T = The last step uses our assumption = - - Id. Finally, Bernstein’s inequality for random vectors (see e.g. [Gro11]) allows us to conclude that the gradient norm is close to its expectation with high probability (see the Theorem 3.2 for details).

Proving local strong convexity for Huber loss. For response vector y = - ( and ∗ + arbitrary D ℝ3, the Hessian13 of the Huber loss at  D has the following form, ∈ ∗ + = 1 T  f ∗ D = Φ′′ -D 8 (8 G8 G8 with Φ′′ C = ~ C 6 2 . ( + ) = [( ) − ]· [ ] | | Õ8=1 Here, ~  is the Iverson bracket (0/1 indicator). To prove strong convexity within radius · ', we are to lower bound the smallest eigenvalue of this random matrix uniformly over all vectors D ℝ3 with D 6 '. ∈ k k We do not attempt to exploit any cancellations between -D and ( and work with the following lower bound S D for the Hessian, ( ) = 1 T  f ∗ D S D := ~ G8 , D 6 1 ~ (8 6 1 G8 G8 . (2.2) ( + ) ( ) = |h i| · | | · Õ8=1 Here, denotes the Löwner order.  13The second derivative of the Huber penalty doesn’t exit at the transition points 2 between its {± } quadratic and linear parts. Nevertheless, the second derivative exists as an !1-function in the sense that 1 Φ 1 Φ 0 = ~ C 6 2 dC for all 0, 1 ℝ. This property is enough for our purposes. ′[ ]− ′[ ] 0 | | ∈ ∫ 10 It’s instructive to first consider D = 0. Here, the above lower bound for the Hessian satisfies, = 1 T S 0 = ℙ (8 6 1 G8 G8 Id . [ ( )] = {| | }·  Õ8=1 Using standard (matrix) concentration inequalities, we can also argue that this random matrix is close to its expectation with high-probability if = > $ 3 under some mild ˜ ( / ) assumption on - (e.g., that the row norms are balanced so that G1 ,..., G= 6 $ √3 ). k k k k ( ) The main remainingchallenge isdealingwith the quantification over D. Earlier analyses [TJSO14] observe that the Hessian lower bound S is constant over balls of small enough 3 (·) radius. Concretely, for all D ℝ with D 6 1 max8 G8 , we have ∈ k k / k k S D = S 0 , ( ) ( )

because G8 , D 6 G8 D 6 1 by Cauchy-Schwarz. Thus, strong convexity with curva- |h i| k k·k k ture parameter withinradius1 max8 G8 follows from the aforementioned concentration / k k argument for S 0 . However, since max8 G8 > √3, this argument cannot give a better ( ) k k radius bound than 1 √3, which leads to a quadratic sample-size bound = & 32 2 as / / mentioned before. For balls of larger radius, the lower bound S can vary significantly. For illustration, (·) let us consider the case ( = 0 and let us denote the Hessian lower bound by " for this (·) case. (The deterministic choice of ( = 0 would satisfy all of our assumptions about (.) As we will see, a uniform lower bound on the eigenvalues of " over a ball of radius ' > 0 (·) implies that the column span of - is well-spread in the sense that every vector E in this subspace has a constant fraction of its ℓ2 on entries with squared magnitude at most a 1 '2 factor times the average squared entry of E. (Since we aim for ' > 0 to be a small / constant, the number 1 '2 is a large constant.) Concretely, / 1 min min " D 6 min 2 D, " D D D 6' ( ( )) D =' ' h ( ) i k k k k = 1 1 ~ 2 6  2 = min 2 = G8 , D 1 G8 , D D =' ' · h i · h i k k 8=1 Õ = = 1 2 2 6 1 2 2 min E 2 ' E8 = E  E8 E col.span - k k · k k · ∈ ( ) Õ8=1 ≕ ' . (2.3)

(The last step uses our assumption -T- = Id.) It turns out that the above quantity ' in Eq. (2.3) indeed captures up to constant factors the radius and curvature parameter of strong convexity of the Huber loss function around ∗ for ( = 0. In this sense, the well-spreadness of the column span of - is required for the current approach of analyzing the Huber-loss estimator based on strong convexity. The quantity ' in Eq. (2.3) is closely related to previously studied notions of well-spreadness

11 for subspaces [GLW08, GLR10] in the context of compressed sensing and error-correction over the reals. Finally, we use a covering argument to show that a well-spread subspace remains well-spread even when restricted to a random fraction of the coordinates (namely the coordinates satisfying (8 6 1). This fact turns out to imply the desired lower bound on | | the local strong convexity parameter. Concretely, if the column space of - is well-spread in 3 1 2 the sense of Eq. (2.3) with parameter ' for some ' > $ , we show that the Huber ˜ ' ( = ) / loss function is locally Ω ' -strong convex at ∗ within radius Ω ' . (See Theorem 3.4.) ( · ) ( ) Recall that we are interested in the regime = > 3 2 (otherwise, consistent estimation is 0.1 / impossible) and small , e.g., = 3− . In this case, Gaussian - satisfies ' > 0.1 even for constant '.

Final error bound. The aforementioned general framework for analyzing estimators via strong convexity (see Theorem 3.1) allows us to bound the error #ˆ  by the norm k − ∗ k of the gradient f  divided by the strong-convexity parameter, assuming that this k∇ ( ∗)k upper bound is smaller than the strong-convexity radius. Consequently, for the case that our design - satisfies ' > 0.1 (corresponding to the setting of Theorem 1.2), the previously discussed gradient bound and strong-convexity bound together imply that, with probability over (, the error bound satisfies

1 2 ˆ 3 1 3 / # ∗ 6 $ $ = $ , k − k = · 2= r ! !  

gradient bound strong-convexity bound assuming ' & 3 2=. (This| lower{z bound} on '|is{z required} by Theorem 3.1 and is stronger / than the lower bound on ' required by Theorem 3.4.) p

2.2 Huber-loss estimator and high-dimensional medians We discuss here some connections between high-dimensional median computations and efficient estimators such as Huber loss or the LAD estimator. This connection leads to a better understanding of why these estimators are not susceptible to heavy-tailed noise. Through this analysis we also obtain guarantees similar to the ones shown in Theorem 1.2. Recall our linear regression model y = - ( as in Eq. (2.1). The noise vector ( has ∗ + independent, symmetrically distributed coordinates with = min8 = ℙ (8 6 1 . We ∈[ ] further assume the noise entries to satisfy 

C 0, 1 , ℙ (8 6 C > Ω C . ∀ ∈[ ] ( · ) This can be assumed without loss of generality as, for example, we may simply add a

Gaussian vector w # 0, Id= (independent of y) to y (after this operation parameter ∼ ( ) changes only by a constant factor).

12 The one dimensional case: median algorithm. To understand how to design an effi- cient algorithm robust to 1 3 = = corruptions, it is instructive to look into the − / · simple settings of one dimensional  Gaussian design ^ # 0, Id= . Given samples p ∼ ( ) y1, ^1 ,..., y= , ^= for any 8 = such that ^8 > 1 2, consider ( ) ( ) ∈[ ] | | /

y8 ^8 = ∗ (8 ^8 . / + /

By obliviousness the random variables (′ = (8 ^8 are symmetric about 0 and for any 8 / 0 6 C 6 1, still satisfy ℙ C 6 (′ 6 C > Ω C . Surprisingly, this simple observation (− 8 ) ( · ) is enough to obtain an optimal robust algorithm. Standard tail bounds show that with 2 2 probability 1 exp Ω  = the median # of y1 ^1,..., y= ^= falls in the interval − − · · ˆ / / & √ 2  ∗,  ∗ for any  0, 1 . Hence, setting  1 = we immediately get that − + + + ∈[ ] 2 / · 2 2 with probability at least 0.999, ∗ # 6  6 $ 1 = . − ˆ ( /( · ))

The high-dimensional case: from the median to the Huber loss. In the one dimensional case, studying the median of the samples y1 ^1,... y= ^= turns out to be enough to obtain / / optimal guarantees. The next logical step is to try to construct a similar argument in high dimensional settings. However, the main problem here is that high dimensional analogs of the median are usually computationally inefficient (e.g. Tukey median [Tuk75]) and so this doesn’t seem to be a good strategy to design efficient algorithms. Still in our case one such function provides fundamental insight. We start by considering the sign pattern of -∗, we do not fix any property of - yet. Indeed, note that the median satisfies 8 = sign y8 -8 # 0 and so ∈[ ] / − ˆ ≈   8 = sign y8 #-8 sign -8 0. So a natural generalizationÍ to high dimensions is the ∈[ ] − ˆ ( ) ≈ following candidate estimator Í   1 # = argmin ℝ3 max = sign y - , sign -D . (2.4) ˆ ∈ D ℝ3 h − ( )i ∈  Such an estimator may be inefficient to compute, but nonetheless it is instructive to reason about it. We may assume -, ∗ are fixed, so that the randomness of the observations y1,..., y= only depends on (. Since for each 8 = , the distribution of (8 has median ∈ [ ] zero and as there are at most =$ 3 sign patterns in sign -D D ℝ3 , standard -net ( ) ( ) ∈ arguments show that with high probability 

1 6 max = sign y -# , sign -D $ 3 = , (2.5) D ℝ3 h − ˆ ( )i ˜ / ∈     p and hence 1 6 max = sign ( - ∗ # , sign -D $ 3 = . D ℝ3 h + − ˆ ( )i ˜ / ∈      p

13 Consider © I = 1 sign ( -I , sign -I 6 $ 3 = for I ℝ3. Now the central obser- ( ) = h + ( )i ˜ ( / ) ∈ vation is that for any I ℝ3, ∈  1 © I = sign (8 -8 , I sign -8 , I ( ( ) = ( + h i · (h i) 8 = Õ∈[ ]  1 > ℙ 0 > sign -8 , I (8 > -8 , I = (h i)· −|h i| 8 = Õ∈[ ]  1 > Ω min 1, -8 , I . = ( )· { |h i|} 8 = Õ∈[ ] By triangle inequality © I 6 © I © I © I and using a similar argument as in ( ) ( ) + ( )− ( ) Eq. (2.5), with high probability, for any I ℝ3,

∈ © I © I 6 $ 3 = . ( )− ( ) ˜ / p  Denote with z :=  # ℝ 3. Consider © I , thinking of I ℝ3 as a fixed vector. This ∗ − ˆ ∈ ( ) ∈ allows us to easily study © I . On the other hand, since our bounds are based on -net ( ( ) argument, we don’t have to worry about the dependency of z on (. So without any constraint on the measurement - we derived the following inequality:

1 2 min 1, -8 , z 6 $ 3 = . = { |h i|} ˜ /( · ) 8 = Õ∈[ ] p  1 Now, our well-spread condition Eq. (1.2) will allow us to relate = 8 = min 1, -8 , z 1 2 ∈[ ] { |h i|} with = 8 = -8 , z and thus obtain a bound of the form ∈[ ] h i Í 2 Í 1 2 - ∗ # 6 $ 3 = . (2.6) = − ˆ ˜ /( )    So far we glossed over the fact that Eq. (2.4) may be hard to compute, however it is easy to see that we can replace such estimator with some well-known efficient estimators and keep a similar proof structure. For instance, one could expect the LAD estimator

= # min y - 1 (2.7) ˆ  ℝ3 − ∈ to obtain comparable guarantees. For fixed 3 and and = tending to infinity this is indeed the case, as we know by [Pol91] that such estimator recovers ∗. The Huber loss function = 1 Φ also turns out to be a good proxy for Eq. (2.4). Let © D : = ′ℎ (8 -8 , D , -D ( ) 8 = h ( + h i) i ∈[ ] where Φ : ℝ ℝ>0 is the Huber penalty function and z =  Í #. Exploiting only first ℎ → ∗ − ˆ order optimality conditions on #ˆ one can show

© I 6 © I © I 6 $ 3 = , ( ) ( )− ( ) ˜ / p 

14 using a similar argument as the one mentioned for Eq. (2.5). Following a similar proof structure as the one sketched above, we can obtain a bound similar to Eq. (2.6). Note that this approach crucially exploits the fact that the noise ( has median zero but does not rely on symmetry and so can successfully obtain a good estimate of -∗ under weaker noise assumptions.

2.3 Fast algorithms for Gaussian design The one dimensional median approach introduced above can be directly extended to high dimensional settings. This essentially amounts to repeating the procedure for each coordinate, thus resulting in an extremely simple and efficient algorithm. More concretely:

Algorithm 1 Multivariate linear regression iteration via median Input: H,- where H ℝ=, - ℝ= 3. ( ) ∈ ∈ × for all 9 3 do ∈[ ] for all 8 = do ∈[ ] H Compute I = . 89 -89 end for Let  be the median of I . ˆ9 89 8 = end for ∈[ ]  Return := ,..., T. ˆ ˆ1 ˆ3  

If ^1,..., ^= # 0, Id3 , the analysis of the one dimensional case shows that with ∼ ( ) 2 high probability, for each 9 3 , the algorithm returns #9 satisfying ∗ #9 6 ∈ [ ] ˆ ( 9 − ˆ ) 1  2 $ +k ∗ k log 3 . Summing up all the coordinate-wise errors, Algorithm 1 returns a 2 · 3 1  2 $  ( +k ∗ k ) log3 -close estimation. This is better than a trivial estimate, but for large 2 · 2 ∗ it is far from the $ 3 log 3 = error guarantees we aim for. However, using ( · /( · )) bootstrapping we can indeed improve the accuracy of the estimate. It suffices to iterate

log ∗ many times.

15 Algorithm 2 Multivariate linear regression via median Input: H,-, Δ where - ℝ= 3, H ℝ= and Δ is an upper bound to  . ( ) ∈ × ∈ ∗ Randomly partition the samples H1,...,H= in C := Θ log Δ sets S1,..., SC, such that all ( ) S S = S 1,..., C 1 have sizes Θ Δ and C has size = 2 . − log ⌊ / ⌋ for all 8 C do   ∈[ ] Run Algorithm 1 on input

9 HS -S #( ) ,-S , 8 − 8 ˆ 8 9<8 1 © © Õ− ª ª ­ ­ ® ® and let # 8 be the resulting estimator. ˆ( ) « « ¬ ¬ end for Return := ,..., T. ˆ ˆ1 ˆ3   As mentioned in Section 1.2, Algorithm 2 requires knowledge of an upper bound Δ on the norm of ∗. The algorithm only obtains meaningful guarantees for 3 = & log Δ log 3 log log Δ 2 + and as such works with nearly optimal (up to poly-logarithmi c terms) sample complexity 2 whenever ∗ is polynomial in 3 . / In these settings, since each iteration 8 requires $ S8 3 steps, Algorithm 2 runs in (| |· ) linear time $ = 3 and outputs a vector # satisfying ( · ) ˆ 2 3 # ∗ 6 $ log 3 , ˆ − 2 = ·  ·  with high probability. Remark 2.1 (On learning the norm of ∗). As was noticed in [SBRJ19], one can obtain a rough estimate of the norm of  by projecting y onto the orthogonal complement of the columns span of ^ = 2 . Since the ordinary least square estimator obtains an estimate with [ / ] error $ √3  = with high probability, if  is polynomial in the number of samples, we ( / ) ˆ ∗ $ 1 obtain a vector #!( such that  # 6 = ( ). The median algorithm can then be applied ˆ − !( ˆ ∗ on y = ^ = = 2 ∗ #!( , ^ = = 2 ,  # . Note that since ^ = 2 and ^ = = 2 [ ]\[ / ]( − )+ [ ]\[ / ] ˆ − !( [ / ] [ ]\[ / ]  ˆ  are independent, ∗ #!( is independent of ^ = = 2 . − [ ]\[ / ] 3 Huber-loss estimation guarantees from strong convexity

In this section, we prove statistical guarantees for the Huber loss estimator by establishing strong convexity bounds for the underlying objective function.

16 The following theorem allows us to show that the global minimizerof the Huberloss is close to the underlying parameter ∗. To be able to apply the theorem, it remains to prove (1) a bound on the gradient of the Huber loss at ∗ and (2) a lower bound on the curvature of the Huber loss within a sufficiently large radius around ∗. Theorem 3.1 (Error bound from strong convexity, adapted from [NRWY09, TJSO14]). Let 5 : ℝ3 ℝ be convex differentiable function and let  ℝ3. Suppose 5 is locally -strongly → ∗ ∈ convex at ∗ within radius ' > 0:

3  2 D ℝ , D 6 '. 5 ∗ D > 5 ∗ 5 ∗ , D D . (3.1) ∀ ∈ k k ( + ) ( )+h∇ ( ) i+ 2 k k If 5  < 1 ', then every vector  ℝ3 such that 5  6 5  satisfies k∇ ( ∗)k 2 · ∈ ( ) ( ∗)  ∗ 6 2 5 ∗  . (3.2) k − k · k∇ ( )k/ Furthermore, if If 5  < 0.49 ', then every vector  ℝ3 such that 5  6 5  , k∇ ( ∗)k · ∈ ( ) ( ∗)+ where  = 0.01 5  2 , satisfies · k∇ ( ∗)k /  ∗ 6 2.01 5 ∗  . (3.3) k − k · k∇ ( )k/ Proof. Let  ℝ3 be any vector that satisfies 5  6 5  . Write  =  C D such that ∈ ( ) ( ∗)+ ∗ + · Δ 6 ' and C = max 1,   ' . Since  =  D lies on the line segment joining  k k { k − ∗ k/ } ′ ∗ + ∗ and , the convexity of 5 implies that 5  6 max 5  , 5  6 5  . By local strong ( ′) { ( ) ( ∗)} ( ∗)+ convexity and Cauchy–Schwarz,

 2  > 5 ′ 5 ∗ > 5 ∗ D D . ( )− ( ) −k∇ ( )k·k k+ 2 ·k k If  = 0, we get D 6 2 5   < '. By our choice of C, this bound on D implies k k ·k∇ ( ∗)k/ k k C = 1 and we get the desired bound. If  = 0.01 5  2  and 5  < 0.49 ', by the quadratic formula, · k∇ ( ∗)k / k∇ ( ∗)k · 2 5 ∗ 5 ∗ 2 2.01 5 ∗ D 6 k∇ ( )k + k∇ ( )k + 6 k∇ ( )k < 2.01 0.49 ' < '. k k p   · · Again, by our choice of C, this bound on D implies C = 1. We can conclude   = k k k − ∗ k D 6 2.01 5   as desired.  k k k∇ ( ∗)k/ We remark that the notion of local strong convexity Eq. (3.1) differs from the usual notion of strong convexity in that one evaluation point for the function is fixed to be ∗. (For the usual notion of strong convexity both evaluation points for the function may vary within some convex region.) However, it is possible to adapt our proof of local strong convexity to establish also (regular) strong convexity inside a ball centered at ∗. A more general form of the above theorem suitable for the analysis of regularized M-estimators appears in [NRWY09] (see also [Wai19]). Earlier analyses of the Huber-loss estimator also use this theorem implicitly [TJSO14]. (See the the discussion in Section 2.1.) The following theorem gives an upper bound on the gradient of the Huber loss at ∗ for probabilistic error vectors (.

17 = 3 T 3 Theorem 3.2 (Gradient bound for Huber loss). Let - ℝ × with - - = Id and ∗ ℝ . ∈ ∈ Let ( be an =-dimensional random vector with independent symmetrically-distributed entries. Then for any  0, 1 , with probability at least 1  2, the Huber loss function f  = 1 = ∈ ( ) − / ( ) Φ - y 8 for y = - ( satisfies = 8=1 [( − ) ] ∗ + Í 3 ln 2  f ∗ 6 8 + ( / ) . k∇ ( )k r = Proof. Let z be the =-dimensional random vector with entries z8 = Φ′ (8 , where Φ′ C = 1 = ( ) ( ) sign C min 2, C . Then, f  = z8 G8. Since (8 is symmetric, z8 = 0. By ( ) · { | |} ∇ ( ∗) = 8=1 · the Hoeffding bound, every unit vector D ℝ3 satisfies with probability at least 1 Í ∈ − 2exp 2 3 ln 2  . (− ( + ( / )))

= f ∗ , D = z, -D 6 4 3 ln 2  -D = 4 3 ln 2  = . · h∇ ( ) i |h i| + ( / ) ·k k ( + ( / )) Hence, by union bound over a 1 2-coveringp of the 3-dimensionalp unit ball of size at most / 53, we have with probability at least 1 2exp 2ln 2  > 1  2, − (− ( / )) − /

max f ∗ , D 6 4 3 ln 2  = max f ∗ , D D 61 h∇ ( ) i ( + ( / ))/ + D 61 2 h∇ ( ) i k k k k / p 1 = 4 3 ln 2  = 2 max f ∗ , D . ( + ( / ))/ + D 61 h∇ ( ) i k k p = 1 =  Since D f  f ∗ satisfies f ∗ , D f ∗ , we get the desired bound. k∇ ( ∗)k ∇ ( ) h∇ ( ) i k∇ ( )k Proof of local strong convexity. The following lemma represents the second-order behav- ior of the Huber penalty as an integral. To prove local strong convexity for the Huber-loss function, we will lower bound this integral summed over all sample points. Lemma 3.3 (Second-order behavior of Huber penalty). For all ℎ,  ℝ, ∈ 1 2 1 Φ  ℎ Φ  Φ′  ℎ = ℎ 1 C 1  C ℎ 62 dC > 1  61 1 ℎ 61 . (3.4) ( + )− ( )− ( )· · ( − )· | + · | 2 | | · | | ¹0 Proof. A direct consequence of Taylor’s theorem and the integral form of the remainder of the Tayler approximation. Concretely, consider the function , : ℝ ℝ with , C = → 2 ( ) Φ  C ℎ . The first derivative of , at 0 is ,′ 0 = Φ′  ℎ. The function ,′′ C = ℎ 1 C 62 ( + · ) ( ) ( )· 1 ( ) · | | is the second derivative of , as an !1 function (so that , C dC = , 1 , 0 for all 0 ′′( ) ′( )− ′( ) 0, 1 ℝ). Then, the lemma follows from the following integral form of the remainder of ∈ ∫ the first-order Taylor expansion of , at 0,

1 , 1 , 0 ,′ 0 = 1 C ,′′ C dC. ( )− ( )− ( ) ( − )· ( ) ¹0 1 1 1 Finally, we lower bound the above right-hand side by > 1  61 1 ℎ 61 using 1 C dC = 2 | | · | | 0 ( − ) 2 and the fact ,′′ C > 1  61 1 ℎ 61 for all C 0, 1 .  ( ) | | · | | ∈[ ] ∫ 18 = 3 T 3 Theorem 3.4 (Strong convexity of Huber loss). Let - ℝ × with - - = Id and ∗ ℝ . Let ∈ ∈ ( be an =-dimensional random vector with independent entries such that = min8 ℙ (8 6 1 . {| | } Let  > 0 and  0, 1 . Suppose that every vector E in the column span of - satisfies ∈ ( ) = A2 E2 6 1 E 2 E2 >  E 2 , (3.5) · 8 = k k · 8 ·k k Õ8=1 with 50 3 ln 100 ln 2  · ·  + ( / ) 6 A 6 . 2 1 s  =  1 = Then, with probability at least 1  2, the Huber loss function f  = Φ - y 8 − / ( ) = 8=1 [( − ) ] for y = -∗ ( is locally  -strongly convex at ∗ within radius A 2 (in the sense of Eq. (3.1)). + / Í Proof. By Lemma 3.3, for every D ℝ3, ∈ = 1 f ∗ D f ∗ f ∗ , D = Φ G8 , D (8 Φ (8 Φ′ (8 G8 , D ( + )− ( )−h∇ ( ) i = (h i− )− (− )− (− ) · h i 8=1 Õ= > 1 2 G8 , D 1 G8 ,D 61 1 (8 61 . (3.6) 2= h i · |h i| · | | Õ8=1 It remains show that with high probability over the realization of (, the right-hand side Eq. (3.6) is bounded from below uniformly over all D ℝ3 in a ball. ∈ To this end, we will show, using a covering argument, that with probability at least 1 2 Ω 3 over (, for every unit vector D ℝ3, the vector E = -D satisfies the following − − ( ) ∈ inequality, = 1 2 2 2 ~E 6 4 A  E ~ (8 6 1 >  2 . (3.7) = 8 / · 8 · | | / Õ8=1 (Since 1 -T- = Id and D = 1, the vector E has average squared entry 1.) = k k Let # be an -covering of the unit sphere in ℝ3 of size # 6 3  3 for a parameter  | | ( / )  to be determined later. Let D ℝ3 be an arbitrary unit vector and let E = -D. Choose ∈ D # such that D D 6  and let E = -D . We establish the following lower bound ′ ∈  k − ′k ′ ′ on the left-hand side of Eq. (3.7) in terms of a similar expression for E′,

= 1 2 2 2 ~ E′ 6 1 A  E′ ~ (8 6 1 (3.8) = ( 8) / · ( 8) · | | 8=1 Õ = 2 1 2 2 2 2 2 6  ~E 6 4 A  ~ E′ 6 1 A  E′ ~ (8 6 1 (3.9) + = 8 / · ( 8) / · ( 8) · | | 8=1 Õ = 2 1 2 2 2 2 2 6 2  ~E 6 4 A  ~ E′ 6 1 A  E ~ (8 6 1 (3.10) + + = 8 / · ( 8) / · 8 · | | Õ8=1

19 The first step Eq. (3.9) uses that each term in the first sum that doesn’t appear in the 6 > second sum corresponds to a coordinate 8 with E′8 1 A and E8 2 A, which means 2 > 2 | | / 2 | | / that E8 E′8 1 A . Since each term has value at most 1 A , the sum of those terms is ( − ) / 2 6 2 / 2 bounded by E E′  =. For the second step Eq. (3.10), let F′8 be the terms of the k − 2k ( ) second sum and F8 the terms of the third sum. Then, the difference of the two sums is equal to F F ,F F 6 F F F F . We have F F 6 E E 6 √= and h − ′ + ′i k − ′k·k + ′k k − ′k k − ′k F F 6 E E = 2√=. k + ′k k k+k ′k It remains to lower bound the expression Eq. (3.8) over all D′ # . Let z8 = 8 ∈  − ~ (8 6 1, where 8 =ℙ (8 6 1 > . The random variables z1,..., z= are independent, | | {| | } 2 2 2 centered, and satisfy z8 6 1. Let 28 = ~ E′ 6 1 A  E′ . By Bernstein inequality, for all | | ( 8) / · ( 8) C > 1, = 2 2 2 C2 4 ℙ 28 z8 > C 8 2 C A 6 4− / . · · 8 + /  8=1 s8 =  Õ Õ∈[ ]  2 6 2 2 6 1 = 1 Since 28 28 A , we have 8 = 8 28 2 8 = 8 28. Denote 1 = 8 = 8 28. Note that /  ∈[ ] A ∈[ ]  ∈[ ] 1 > .   Í Í Í Choosing  = 0.03 , C = 2 3 ln 3  ln 2  , by the union bound over # , it follows ( / )+ ( / )  that with probability at least 1  2, for every D # , the vector E = -D satisfies, p− / ′ ∈  ′ ′ = 2 2 1 2 2 2 C 1 C ~ E′ 6 1 A  E′ ~ (8 6 1 > 1 = ( 8) / · ( 8) · | | − A2= − A2= = r Õ8 1 > 1 √0.1 √1 √  0.08  − · · − · > 0.6  .

As discussed before, this event implies that for all unit vectors D ℝ3, the vector E = -D ∈ satisfies = 1 2 2 2 2 ~E 6 4 A  E ~ (8 6 1 > 0.6  2  > 0.5  . = 8 / · 8 · | | − − Õ8=1 

Putting things together. Inthisparagraph,weproof Theorem 1.2 by combining previous results in this section. We start with the definition of well-spread property:

Definition 3.5. Let + ℝ= be a vector space. + is called <,  -spread, if for every E + ⊆ ( ) ∈ and every subset ( = with ( > = <, ⊆ [ ] | | −

E( >  E . k k k k = 3 Theorem 3.6. Let - ℝ × be a deterministic matrix and let ( be an =-dimensional random ∈ vector with independent, symmetrically distributed entries and = min8 = ℙ (8 6 1 . ∈[ ] {| | } 20 Let  0, 1 and  0, 1 and suppose that column span of - is <,  -spread14 for ∈ ( ) ∈ ( ) ( ) 100 3 ln 10  ln 2  < = · · ( / )+ ( / ) . 4 2 ·  Then, with probability at least 1  over (, for every  ℝ3, given - and y = - (, the − ∗ ∈ ∗ + Huber-loss estimator #ˆ satisfies

1 ˆ 2 3 ln 2  - ∗ # 6 1000 + ( / ) . = ( − ) · 4 2 = · · Proof. Note that if column span of - is <,  -spread, then Eq. (3.5) holds for all E from ( ) column span of - with A2 = < = and  = 2. Indeed, the set 8 = E2 > 1 E 2 has / ∈[ ] 8 A2= k k 100 3 ln 10  ln 2  2 = 2 > 2 2 = 2 = ( / )+ ( / ) size at most A = <, so 8 = ( E8  E  E . Hence for < ( 4 2 ), ∈[ ]\ k k k k  2 the conditions of TheoremÍ 3.4 are satisfied and f is  -strongly convex in the ball of radius < = with probability at least 1  2. / − / By Theorem 3.2, with probability at least 1  2, p − / 3 ln 2  f ∗ 6 8 + ( / ) . k∇ ( )k r = Hence with probability at least 1 , f  < 0.492 < =. Therefore, by Theorem 3.1, − k∇ ( ∗)k / with probability at least 1 , − p 2 1 ˆ 2 2 f ∗ 3 ln 2  - ∗ # 6 2.1 k∇ ( )k 6 1000 + ( / ) . = ( − ) · 4 2 · 4 2 = · · · 

Proof of Theorem 1.2. Theorem 1.2 follows from Theorem 3.6 with  = √1 0.81 = √0.19 − and  = 2 3. Note that in this case < 6 20000 3 2.  − · /

3.1 Huber-loss estimator for Gaussian design and deterministic noise In this section we provide a proof of Theorem 1.1. We will use the same strategy as in the previous section: show that the gradient at  is bounded by $ 3 = , then show that ∗ / Huber loss is locally strongly convex at ∗ in a ball of radius Ω 1 , andthen  use Theorem 3.1 ( ) p to obtain the desired bound. 14The statement of this theorem in the submitted short version of this paper unfortunately contained a typo and stated an incorrect bound for ( = <. | |

21 Gradient bound.

Theorem 3.7 (Gradient bound, Gaussian design). Let ^ # 0, 1 = 3 and  ℝ3. Let ∼ ( ) × ∗ ∈  ℝ= be a deterministic vector. ∈ Then for every  0, 1 , with probability at least 1  2, the Huber loss function f  = 1 = ∈ ( ) − / ( ) Φ ^ y 8 for y = ^  satisfies = 8=1 [( − ) ] ∗ + Í 2 3 2ln 2  f ∗ 6 3 + ( / ) . k∇ ( )k r =

1 Φ 2 Proof. The distribution of f ∗ is # 0, =2 ′ 8 Id3 . Hence by Fact D.10, with ∇ ( ) 8 = ( ) · ∈[ ] ! probability at least 1  2, Í  − /

2 4 83 12ln 2  f ∗ 6 3 2ln 2  2 3 ln 2  6 + ( / ) . k∇ ( )k = + ( / )+ ( / ) =  p  

Strong convexity.

= 3 3 Theorem 3.8 (Strong convexity, Gaussian design). Let ^ # 0, 1 × and ∗ ℝ . Let ∼ ( ) ∈  ℝ= be a deterministic vector with = entries of magnitude at most 1. Suppose that for some ∈  0, 1 , ∈ ( ) 3 2ln 4  = > 200 + ( / ) . · 2 1 = Then with probability at least 1  2, the Huber loss function f  = Φ ^ y 8 − / ( ) = 8=1 [( − ) ] for y = ^  is locally -strongly convex at  within radius 1 6 (in the sense of Eq. (3.1)). ∗ + ∗ / Í Proof. By Lemma 3.3, for every D ℝ3, ∈ = 1 f ∗ D f ∗ f ∗ , D = Φ x8 , D 8 Φ 8 Φ′ 8 x8 , D ( + )− ( )−h∇ ( ) i = (h i− )− (− )− (− ) · h i 8=1 Õ= > 1 2 x8 , D 1 x8 ,D 61 1 8 61 . 2= h i · |h i| · | | Õ8=1

Consider the set = 8 = 8 6 1 . Since = = and  is deterministic, ^ C ∈[ ] | | |C| C ∼ # 0, 1 = 3. ( ) ×  By Fact D.12, for : = = 200, with probability at least 1  4, for any set of / − / K ⊆C size : and every D ℝ3, ∈ 2 2 2 4 = x8 , D 6 D √3 √: 2: ln 2ln 4  h i k k · + + : + ( / ) 8 r ! Õ∈K   p 22 2 6 D 2 = 200 0.01 = ln 2004 0.1 √ = k k · / + ( )+ · 6 0.18 Dp2 =. p  k k · Now if is the set of top : entries of ^D for D ℝ3 such that D 6 1 6, then we get that K ∈ k k / the average squared coordinate in is at most 1. Hence K = 2 > 2 > 2 2 x8 , D 1 x8 ,D 61 1 8 61 x8 , D ^ D 0.18 D =. h i · |h i| · | | h i k C k − ·k k 8=1 8 Õ ∈C\KÕ Since ^ is a Gaussian matrix, for all D ℝ3, with probability at least 1  4, C ∈ − / 2 2 ^ D 2 > D 2 √ = √3 2 log 4  > D 2 √ = 0.1√ = > 0.81 D 2 =. k C k k k − − ( / ) k k − k k  q    Hence with probability at least 1  2, for all D ℝ3 such that D 6 1 6, − / ∈ k k / f ∗ D f ∗ f ∗ , D > 0.5 , ( + )− ( )−h∇ ( ) i and f is locally strongly convex with parameter at  in the ball of radius 1 6.  ∗ /

Putting everything together. The following theorem implies Theorem 1.1. Theorem 3.9. Let  ℝ= be a deterministic vector. Let ^ be a random15 =-by-3 matrix with iid ∈ standard Gaussian entries ^89 # 0, 1 . ∼ ( ) Suppose that for some  0, 1 , ∈ ( ) 3 2ln 2  = > 2000 + ( / ) , · 2 where is the fraction of entries in  of magnitude at most 1. Then, with probability at least 1 , for every  ℝ3, given ^ and y = ^ , the − ∗ ∈ ∗ + Huber-loss estimator #ˆ satisfies

ˆ 2 3 2ln 2  ∗ # 6 100 + ( / ) . − · 2=

Proof. Using bounds from Theorem 3.7 and Theorem 3.8, we can apply Theorem 3.1. In- deed, with probability at least 1 , − 3 2ln 2  f  = > k∇ ( ∗)k ' 1 6 > 2 3 + 2 ( / ) 2 , / · r = · Hence 2 ˆ 2 f ∗ 3 2ln 2  ∗ # 6 4 k∇ ( )k 6 100 + ( / ) . − · 2 · 2= 

15As a convention, we use boldface to denote random variables.

23 4 Robust regression in linear time

In this section we prove Theorem 1.4 and Theorem 1.5. We consider the linear model = 3 y = ^ ( where ^ ℝ has i.i.d entries ^89 # 0, 1 and the noise vector  satisfies ∗ + ∈ × ∼ ( ) the following assumption. Assumption 4.1.  ℝ= is a fixed vector such that for some = =,3 0, 1 , ∈ ( ) ∈ ( ) 8 = (8 6 1 > =. We denote := 8 = 8 6 1 . ∈[ ] | | T ∈[ ] | |  We start showing how to obtain a linear time algorithm for Gaussian design in one

dimension. Then generalize it to high dimensional settings. We add the following (linear time) preprocessing step

8 = , y′ = 28 y8 w8 , ∀ ∈[ ] 8 · + ^ ′ = 28 ^8 , w8 # 0, 1 , 28 * 1, 1 , (PRE) 8 · ∼ ( ) ∼ {− } where w1,..., w= , 21,..., 2= , ^ are mutually independent. For simplicity, when the con- text is clear we denote 288 w8 by (8 and y , ^ with y, ^. Note that this preprocessing + ′ ′ step takes time linear in =3. Assumption 4.1 implies that after this preprocessing step, ( satisfies the following assumption: Assumption 4.2. For all 8 and for any C 0, 1 , ∈ T ∈[ ] ℙ 0 6 (9 6 C =ℙ C 6 (9 6 0 > C 10 . − /   4.1 Warm up: one-dimensional settings

For the one-dimensional settings, the only property of the design = 1 matrix ^ # 0, Id= × ∼ ( ) we are going to use is anti-concentration.

Fact 4.3. Let ^ # 0, Id= . Then for any 2 0, 1 and 8 = , ∼ ( ) ∈[ ] ∈[ ] ℙ ^8 > 2 > Ω 1 . (| | ) ( ) As shown below, our estimator simply computes a median of the samples.

Algorithm 3 Univariate Linear Regression via Median Input: H,- , where H,- ℝ=. ( ) ∈ 0. Preprocess H,- as in Eq. (PRE) and let y , ^ be the resulting pair. ( ′ ′)

y M = > M = ′8 1. Let 8 = ^8′ 1 2 . For 8 , compute z8 ^ . ∈[ ] | | / ∈ 8′  2. Return the median # of z8 8 M . ˆ { } ∈

24 Remark 4.4 (Running time). Preprocessing takes linear time. Finding requires linear ℳ time, similarly we can compute all z8 in $ = . The median can then be found in linear ( ) time using quickselect [Hoa61] with pivot chosen running the median of medians algorithm [BFP 73]. Thus the overall running time is $ = . + ( ) The guarantees of the algorithm are proved in the following theorem.

Theorem 4.5. Let y = ^  for arbitrary  ℝ, ^ # 0, 1 = 3 and  ℝ= satisfying ∗ + ∗ ∈ ∼ ( ) × ∈ Assumption 4.1 with parameter . Let #ˆ be the estimator computed by Algorithm 3 given ^ , y ( ) as input. Then for any positive 6 2 =, · ˆ 2 ∗ # 6 − 2 = · with probability at least 1 2exp Ω . − {− ( )} To prove Theorem 4.5 we will use the following bound on the median, which we prove in Appendix D.

Lemma 4.6. Let = be a set of size = and let z1,..., z= ℝ be mutually independent S⊆[ ] ∈ random variables satisfying

1. For all 8 = , ℙ z8 > 0 =ℙ z8 6 0 . ∈[ ] ( ) ( )

2. For some  > 0, for all 8 , ℙ z8 0,  =ℙ z8 , 0 > @. ∈ S ( ∈[ ]) ( ∈ [− ]) Then with probability at least 1 2exp Ω @22= the median zˆ satisfies − −  zˆ 6  . | | Proof of Theorem 4.5. Due to the preprocessing step the resulting noise ( satisfies 1 Assumption 4.2. Let M = be the set of entries such that ^ ′ > . Since and ⊆ [ ] | 8 | 2 T M are independent, by Chernoff bound, M > Ω = with probability at least |T ∩ | ( ) 1 2exp Ω = > 1 2exp Ω . Now observe that for all  0, 1 and for all − [− ( )] − [− ( )] ∈ ( ) 8 M, by Assumption 4.2, ∈T∩

(8  ℙ z8 ∗ 6  =ℙ 6  > ℙ (8 6  2 > . − ^ / 20  8′   

By Lemma 4.6, we get the desired bound for = 2 2=. 

4.2 High-dimensional settings The median approach can also be applied in higher dimensions. In these settings we need to assume that an upper bound Δ on  is known. k ∗k

25 Remark 4.7. As shown in [SBRJ19],underthemodelof Theorem 1.1 the classical least square estimator obtain an estimate # with error 3  . Thus under the additional assumption ˆ = · that the noise magnitude is polynomial in = it is easy to obtain a good enough estimate of

the parameter vector. 2  2 We first prove in Section 4.2.1 how to obtain an estimate of the form   6 k ∗ k . ∗ − ˆ 2 Then in Section 4.2.2 we obtain Theorem 1.4, using bootstrapping. In Section 4.2.3 we

generalize the results to sparse parameter vector ∗, proving Theorem 1.5. Finally we extend the result of Section 4.2.2 to non-spherical Gaussians in Section 4.2.4.

4.2.1 High-dimensional Estimation via median algorithm

2  2 To get an estimate of the form   6 k ∗ k , we use the algorithm below: ∗ − ˆ 2

Algorithm 4 Multivariate Linear Regression Iteration via Median

= = 3 Input: H,- where H ℝ , - ℝ × . ( ) ∈ ∈ 0. Preprocess H,- as in Eq. (PRE) and let y , ^ be the resulting pair. ( ′ ′)

1. Forall 9 3 run Algorithm 3 on input y, ^ ′ , where ^ ′ isa 9-th column of ^ (without ∈[ ] ( 9 ) 9 ′ additional preprocessing). Let #ˆ 9 be the resulting estimate. = T 2. Return #ˆ : #ˆ1,..., #ˆ3 .  

Remark 4.8 (Running time). Preprocessing takes linear time. Then the algorithm simply executes Algorithm 3 3 times, so it runs in $ =3 time. ( ) The performance of the algorithm is captured by the following theorem.

Theorem 4.9. Let y = ^  for arbitrary  ℝ3, ^ # 0, 1 = 3 and  ℝ= satisfying ∗ + ∗ ∈ ∼ ( ) × ∈ Assumption 4.1 with parameter . Let #ˆ be the estimator computed by Algorithm 4 given y, ^ ( ) as input. Then for any positive 6 2=,

ˆ 2 3 2 ∗ # 6 · 1 ∗ − 2 = + ·   with probability at least 1 2exp ln 3 Ω . − [ − ( )] Proof. We first show that the algorithm obtains a good estimate for each coordinate. Then it suffices to sum the coordinate-wise errors. For 9 3 , let M 9 = be the set of entries 1 ∈[ ] ⊆ [ ] such that ^89 > . Observe that since M 9 doesn’t depend on , by Chernoff bound, | | 2 T

26 M 9 > Ω = with probability at least 1 2exp Ω = > 1 2exp Ω . Now for T ∩ ( ) − [− ( )] − [− ( )] all 8 = let ∈[ ] 1 = z89 : 288 w8 ^8;′ ∗; . ^ ′ + + 89 ;≠9 © Õ ª ­ 3 ® Note that ℙ z89 > 0 =ℙ z89 6 0 . Now let  ℝ be the vector such that for 9 3 8 , ¯ ∈ ∈ [ ] \ { } 9 = ∗ and 8 = 0. By properties of Gaussian« distribution, for¬ all 8 = , ¯ 9 ¯   ∈[ ] 2 w8 ^ ′ ∗ # 0, 1  . + 8; ; ∼ ( +k ¯k ) Õ;≠9

2 Hence for each 8 M 9, for all 0 6 C 6 1  , ∈T∩ +k ¯k q C ℙ z89 6 C > Ω . | | 1  2  © +k ¯k ª ­q ® By Lemma 4.6, median zˆ 9 of z89 satisfies ­ ® « ¬ 2 2 2 zˆ 6 1  6 1 ∗ 9 2 = + ¯ 2 = + · ·   ˆ   with probability at least 1 2exp Ω . Since #9 = zˆ 9  ∗, applying union bound over − [− ( )] + 9 all coordinates 9 3 , we get the desired bound.  ∈[ ]

4.2.2 Nearly optimal estimation via bootstrapping

Here we show how through multiple executions of Algorithm 4 we can indeed obtain error 2  #ˆ 6 $ 3 . As already discussed, assume that we know some upper bound on ∗ ˜ 2 = − ·  , which we denote by Δ. Consider the following procedure: k k Algorithm 5 Multivariate Linear Regression via Median Input: H,-, Δ where - ℝ= 3, H ℝ=, and Δ > 3. ( ) ∈ × ∈

1. Randomly partition the samples H1,...,H= in C := ln Δ sets S1,..., SC, such that all ⌈ ⌉ S S = S 1,..., C 1 have sizes Θ Δ and C has size = 2 . − log ⌊ / ⌋   ˆ 0 3 ˆ 8 1 2. Denote # = 0 ℝ . For 8 C , run Algorithm 4 on input HS -S # ,-S , and ( ) ∈ ∈ [ ] 8 − 8 ( − ) 8 ˆ 8 let #( ) be the resulting estimator.  

ˆ 8 3. Return # := #( ). ˆ 8 C ∈[ ] Í

27 Remark 4.10 (Running time). Splitting the samples into C sets requires time $ = . For ( ) each set 8, the algorithm simply executes Algorithm 4, so all in all the algorithm takes S $ Θ = 3 log Δ = $ =3 time. log Δ · ( )  The theorem below proves correctness of the algorithm.

Theorem 4.11. Let y = ^  for  ℝ3, ^ # 0, 1 = 3 and  ℝ= satisfying Assumption ∗ + ∗ ∈ ∼ ( ) × ∈ 4.1 with parameter . Suppose that Δ > 3 1  , and that for some positive  6 1 2, +k ∗k / = >  3 ln Δ ln 3  lnln Δ for sufficiently large absolute constant  > 0. Let #ˆ be the · 2 · ( ( / )+ )  estimator computed by Algorithm 5 given y, ^ , Δ as input. Then, with probability at least 1 , ( ) − 1 ˆ 2 3 log 3  ^ ∗ # 6 $ · ( / ) . = − 2 =    · 

Proof. Since = >  3 ln Δ ln 3 lnln Δ , by Theorem 4.9, for each 8 C 1 , · 2 · ( + ) ∈[ − ] 2 2 8 8 1 ˆ 9 1 ∗ 9−=1 #( ) ˆ 9 + − ∗ #( ) 6 , − 10 9=1 Í Õ with probability at least 1 2exp ln 3 10ln 3  10lnln Δ . By union bound over 8 − [ − ( / )− ] ∈ C 1 , with probability at least 1 210, [ − ] − 2 C 1 − ˆ 9 ∗ #( ) 6 100 . − 9=1 Õ

Hence by Theorem 4.9, with probability at least 1 410, − 2 C ˆ 9 3 log 3  ∗ #( ) 6 $ · ( / ) . − 2 = 9=1   Õ ·

By Fact D.8, with probability at least 1 10, − 2 1 ˆ ˆ T 1 T ˆ ˆ 2 ^ ∗ # = ∗ # ^ ^ ∗ # 6 1.1 ∗ # . = − − = − · −        



4.2.3 Nearly optimal sparse estimation

A slight modification of Algorithm 4 can be use in the sparse settings.

28 Algorithm 6 Multivariate Sparse Linear Regression Iteration via Median Input: H,- , where - ℝ= 3, H ℝ=. ( ) ∈ × ∈

1. Run Algorithm 4, let #′ be the resulting estimator.

2. Denote by a: the value of the :-th largest (by absolute value) coordinate of #′. For each ˆ ˆ 9 3 , let #9 = #′ if #′ > a , and #9 = 0 otherwise. ∈[ ] 9 | 9 | : = T 3. Return #ˆ : #ˆ1,..., #ˆ3 .  

Remark 4.12 (Running time). Running time of Algorithm 4 is $ =3 . Similar to median, 0 ( ) : can be computed in time $ 3 (for example, using procedure from [BFP 73]). ( ) + The next theorem is the sparse analog of Theorem 4.9. 3 = 3 = Theorem 4.13. Let y = ^∗  for :-sparse ∗ ℝ , ^ # 0, 1 × and  ℝ satisfying + ∈ ∼ ( ) ∈ Assumption 4.1 with parameter . Let #ˆ be the estimator computed by Algorithm 6 given y, ^ ( ) as input. Then for any positive 6 2=,

ˆ 2 : 2 ∗ # 6 $ · 1 ∗ − 2 = +   ·   with probability at least 1 2exp ln 3 Ω . − [ − ( )] Proof. The reasoning of Theorem 4.9 shows that for each coordinate 9 , the median zˆ 9 [ ] satisfies 2 zˆ 9 6 1  | | 2 = + ∗ r · with probability at least 1 2exp Ω . By union bound  over 9 3 , with probability − [− ( )] ∈ [ ] at least 1 2exp ln 3 Ω , for any 9 3 , − [ − ( )] ∈[ ] 2 #′ ∗ 6 1  . | 9 − 9 | 2 + ∗ r = ·   If #′ < 0:, then there should be some 8 ∉ supp ∗ such that #′ > #′ . Hence for such 9, 9 { } | 8 | | 9 |

2 6 6 6 $ . ∗9 #′9 ∗9 #′9 #′9 ∗9 #′8 ∗8 2 1 ∗ | | | − |+| | | − |+| − | r = + ! ·   Note that since random variables ^89 are independent and absolutely continuous with positive density, #′ ≠ # for < ≠ 9 with probability 1. Hence 9 3 #′ > 0 = :. It 9 ′< ∈[ ] | 9 | : follows that n o

2 3 2 ˆ 2 6 ∗ # 1 ∗9 1 > ∗9 #′9 − #′9 <0: · + #′9 0: · − 9 supp  | | 9=1 | | ∈ Õ{ ∗} h i   Õ h i  

29 3 2 2 6 $ 1 $ 2 1 ∗ > 2 1 ∗ + + #′9 0: · + = = | | = 9 supp ∗ · 9 1 · ∈ Õ{ }    Õ h i    : 2 6 $ · 1 ∗ . 2 = +   ·   

Again through bootstrapping we can obtain a nearly optimal estimate. However, for the first few iterations we need a different subroutine. Instead of taking the top-: entries, we will zeros all entries smaller some specific value.

Algorithm 7 Multivariate Sparse Linear Regression Iteration via Median Input: H,-, Δ , where - ℝ= 3, H ℝ=, Δ > 0. ( ) ∈ × ∈

1. Run Algorithm 4, let #′ be the resulting estimator.

ˆ 1 ˆ 2. For each 9 3 , let #9 = #′ if #′ > Δ, and #9 = 0 otherwise. ∈[ ] 9 | 9 | 100√: = T 3. Return #ˆ : #ˆ1,..., #ˆ3 .  

Remark 4.14 (Running time). The running time of this algorithm is the same as the running time of Algorithm 4, i.e. $ =3 . ( ) The following theorem proves correctness of Algorithm 7.

Theorem 4.15. Let y = ^  for :-sparse  ℝ3, ^ # 0, 1 = 3 and  ℝ= satisfying ∗ + ∗ ∈ ∼ ( ) × ∈ Assumption 4.1 with parameter . Suppose that  6 Δ. Let #ˆ be the estimator computed by k ∗k Δ Ω 2 = Δ2 Algorithm 7 given y, ^ , as input. Then, with probability at least 1 2exp ln 3 : 1· ·Δ2 , ( ) − − ( + ) supp #ˆ supp  and h  i { } ⊆ { ∗} ˆ Δ ∗ # 6 . − 10

Proof. Fix a coordinate 9 3 . The reasoning of Theorem 4.9 shows that the median zˆ 9 ∈ [ ] satisfies 2 2 2 zˆ 6 1 ∗ 6 1 Δ 9 2 = + 2 = + ·   · with probability at least 1 exp Ω exp Ω = . If ∗ = 0 then with probability at − [− ( )] − [− ( )] 8 2 = Δ2 Δ ˆ least 1 2exp Ω · · 2 we have zˆ 9 6 , so #9 = 0. Conversely if ∗ ≠ 0 then with : 1 Δ 100√: 8 − − ( + ) h  i 2 = Δ2 Δ probability 1 2exp Ω · · the error is at most 2 . Combining the two and : 1 Δ2 100√: − − ( + ) · repeating the argumenth for all 8i 3 , we get that by union bound, with probability at 2 2 ∈ [ ] 2 2 2 Ω = Δ ˆ 6 Δ 6 Δ  least 1 2exp ln 3 : 1· ·Δ2 , ∗ # : 4 10000 : 100. − − ( + ) − · · · h  i 30 Now, combining Algorithm 6 and Algorithm 7 we can introduce the full algorithm.

Algorithm 8 Multivariate Sparse Linear Regression via Median Input: H,-, Δ where - ℝ= 3, H ℝ=, and Δ > 3. ( ) ∈ × ∈

1. Randomly partition the samples H1,...,H= in C := ln Δ sets S1,..., SC, such that all ⌈ ⌉ S S = S 1,..., C 1 have sizes Θ Δ and C has size = 2 . − log ⌊ / ⌋   ˆ 0 3 2. Denote # = 0 ℝ and Δ0 =Δ. For 8 C 1 , run Algorithm 7 on input ( ) ∈ ∈[ − ] ˆ 8 1 HS -S #( − ),-S , Δ8 1 . 8 − 8 8 −   ˆ 8 Let #( ) be the resulting estimator and Δ8 =Δ8 1 2. − / ˆ C 1 ˆ C 3. Run Algorithm 6 on input HS -S #( − ),-S , and let #( ) be the resulting estimator. C − C C   = ˆ 8 4. Return #ˆ : #( ). 8 C ∈[ ] Í

Remark 4.16 (Running time). Splitting the samples into C sets requires time $ = . For each ( ) set 8, the algorithm simply executes either Algorithm 7 or Algorithm 6, so all in all the S algorithm takes $ Θ = 3 log Δ $ =3 = $ =3 time. log Δ + ( ) ( ) Finally, Theorem 1.4 follows from the result below. Theorem 4.17. Let y = ^  for :-sparse  ℝ3, ^ # 0, 1 = 3 and  ℝ= satisfying ∗ + ∗ ∈ ∼ ( ) × ∈ Assumption 4.1 with parameter . Suppose that Δ > 3 1  , and that for some positive +k ∗ k  < 1 2, = >  : ln Δ ln 3  lnln Δ for sufficiently large absolute constant  > 0. Let #ˆ / · 2 · ( ( / )+ )  be the estimator computed by Algorithm 8 given y, ^ , Δ as input. Then, with probability at least ( ) 1 , − 1 ˆ 2 : log 3  ^ ∗ # 6 $ · ( / ) . = − 2 =    ·  Proof. Since = >  : ln Δ ln 3 lnln Δ , by Theorem 4.15 and union bound over 8 C 1 , · 2 ·( + ) ∈[ − ] with probability at least 1 2exp ln 3 ln C 10ln 3  10lnln Δ , for each 8 C 1 , − [ + − ( / )− ] ∈[ − ] 8 ˆ 9 Δ ∗ #( ) 6 . − 108 9=1 Õ Hence 2 C 1 − ˆ 9 ∗ #( ) 6 100 − 9=1 Õ

31 with probability 1 210. Therefore, by Theorem 4.13, − 2 C ˆ 9 : log 3  ∗ #( ) 6 $ · ( / ) − 2 = 9=1   Õ · with probability 1 410. − ˆ C 10 C 1 ˆ 9 Since # is :-sparse and with probability 1 2 , supp − # supp  , vector ( ) − 9=1 ( ) ⊆ ∗  #ˆ is 2:-sparse. By Lemma D.9, with probability at leastn 1 10, o ∗ − Í−  1 ˆ 2 ˆ T 1 T ˆ ˆ 2 ^ ∗ # = ∗ # ^ ^ ∗ # 6 1.1 ∗ # . = − − = − · −        



4.2.4 Estimation for non-spherical Gaussians

We further extend the results to non-spherical Gaussian design. In this section we assume = > 3. We use the algorithm below. We will assume to have in input an estimate of the covariance matrix of the rows of ^: x1,..., x= # 0, Σ . For example, if number of ∼ ( ) samples is large enough, sample covariance matrix is a good estimator of Σ. For more details, see Section 4.2.5.

Algorithm 9 Multivariate Linear Regression Iteration via Median for Non-Spherical De- sign Input: H,-, Σ , where - ℝ= 3, H ℝ=, Σ is a positive definite symmetric matrix. ˆ ∈ × ∈ ˆ   = Σ 1 2 1. Compute -˜ - ˆ − / . 2. Run Algorithm 4 on input H, - and let # be the resulting estimator. ( ˜ ) ′ = Σ 1 2 3. Return #ˆ : ˆ − / #′.

Remark 4.18 (Running time). Since = > 3, computing - requires $ =) 3 3 , where ) 3 ˜ ( ( )/ ) ( ) is a time required for multiplication of two 3 3 matrices. Algorithm 4 runs in time $ =3 , × ( ) so the running time is $ =) 3 3 . ( ( )/ ) The performance of the algorithm is captured by the following theorem.

Theorem 4.19. Let y = ^ , such that rows of ^ are iid x8 # 0, Σ ,  satisfies ∗ + ∼ ( ) Σ ℝ3 3 ^ Assumption 4.1 with parameter , and ˆ × is a symmetric matrix independent of such 1 2 1 2 ∈ 1 that Σ / Σ− / Id3 6  for some  > 0. Suppose that for some # > = 3,  6 and k ˆ − k + 100√ln #

32 2= > 100 ln #. Let #ˆ be the estimator computed by Algorithm 9 given y, ^ , Σ as input. Then, ( ˆ ) with probability at least 1 $ # 5 , − − 2  2 1 2 ˆ 3 log # 1 2 2 1 2 2 Σ / ∗ # 6 $ · 1 Σ / ∗  Σ / ∗ . ˆ − 2 = + ˆ + ·k ˆ k    ·   

Proof. We first show that the algorithm obtains a good estimate for each coordinate. Then Σ it suffices to add together the coordinate-wise errors. By assumptions on ˆ , 1 2 1 2 1 2 ^ = ^Σ− / = MΣ / Σ− / = M M, ˜ ˆ ˆ + = 3 1 where M # 0, 1 × and  is a matrix such that  6  for some  6 . Since  ∼ ( ) k k 100√ln # is independent of ^ and each column of  has norm at most , for all 8 = and 9 3 , 10 ∈ [ ] ∈ [ ] ^89 M89 6 40√ln # with probability 1 $ #− . For simplicity, we still write y, ^ after | ˜ − | − ˜ the preprocessing step. Fix 9 3 , let M = be the set of entries such that ^ > 1 2. 9 ˜ 89 10∈ [ ] ⊆ [ ]  | | / With probability 1 $ # , 8 = M89 > 1 is a subset of M 9. Hence by Chernoff − − ∈[ ] | | bound, M > Ω = with probability at least 1 $ # 10 . Now for all 8 M let 9   − 9 T ∩ ( ) − ∈  1 q := ^ Σ1 2 89 (8 ˜ 8; ˆ / ∗ ^89 + ; ˜ ;≠9   © Õ ª 1 ­ ® = M Σ1 2 M Σ1 2 M Σ1 2 «(8 8; ˆ / ∗ ¬ 8<<; ˆ / ∗ 899; ˆ / ∗ . ^89 + ; + ; + ; ˜ ;≠9   ;≠9 <≠9   ;≠9   © Õ Õ Õ Õ ª ­ 10 ® Note that for any 8 M 9, with probability 1 $ #− , sign ^89 = sign M89 . Hence « ∈ − ˜ ¬    1 z = w M Σ1 2 M Σ1 2 89 : 288 8 8; ˆ / ∗ 8<<; ˆ / ∗ ^89 + + ; + ; ˜ ;≠9   ;≠9 <≠9   © Õ Õ Õ ª is symmetric about zero.­ ® 3 « 1 2 ¬ Now let  ℝ be the vector such that for ; 3 9 , ; = Σ / ∗ and 8 = 0. Note ¯ ∈ ∈[ ] \ ¯ ˆ ; ¯ 1 2 that  6 Σ / ∗ . By properties of Gaussian distribution,   k ¯k k ˆ k 1 2 1 2 2 w8 M8; Σ / ∗ M8<<; Σ / ∗ # 0,  , + ˆ ; + ˆ ; ∼ ( ) Õ;≠9   Õ;≠9 Õ<≠9   2 2 2 2 where 1  6  6 1 1   . Hence for each 8 M 9, for all 0 6 C 6 + k ¯k + + k ¯k ∈T∩ 1  2, +k ¯k  q C ℙ z89 6 C > Ω . | | 1  2 © ¯ ª  ­ +k k ® ­q ® 33 « ¬ By Lemma 4.6, median zˆ 9 of z89 satisfies 8 M 9 ∈  2 2 log # 2 log # 1 2 zˆ 6 $ 1  6 $ 1 Σ / ∗ 9 2 = + ¯ 2 = + ˆ      ·   · 10 with probability at least 1 $ #− . − For any 8 M 9, the event ∈ 

M89 1 2 1 2 9; Σ / ∗ 6 $  Σ / ∗ ^ ˆ ; ·k ˆ k ˜ 89 ;≠9 Õ    

10 10 occurs with probability 1 $ #− . Moreover, since  6 , with probability 1 $ #− , − k k − for all 81,...,8 M 9, 3 ∈   2 3 M8 9 9 Σ1 2 6 2 Σ1 2 2 9; ˆ / ∗ $  ˆ / ∗ . ^8 9 ; ·k k 9=1 ˜ 9 ;≠9     Õ© Õ ª ­ 9 ® Therefore, with probability 1 $ #− , medians q9 of q89 satisfiy 8 M 9 « − ¬ ˆ ∈   3 2 2 3 log = 1 2 2 1 2 2 q 6 $ · 1 Σ / ∗  Σ / ∗ . ˆ 9 2 = + ˆ + ·k ˆ k 9=1  ·    Õ

1 2 Since y8 ^89 = Σ / ∗ q89 , / ˜ ˆ 9 +   3 2 3 log = 2 1 2 6 1 2 2 1 2 2 #′9 Σ / ∗ $ · 1 Σ / ∗  Σ / ∗ . − ˆ 9 2 = + ˆ + ·k ˆ k 9=1      ·    Õ 

Next we show how to do bootstrapping for this general case. In this case we will assume to know an upper bound Δ of ^ . k ∗k

34 Algorithm 10 Multivariate Linear Regression via Median for Non-Spherical Design Input: H,-, Σ, Δ , where - ℝ= 3, H ℝ= , Δ > 3, and Σ ℝ3 3 is a positive definite ˆ ∈ × ∈ ˆ ∈ × symmetric matrix.

1. Randomly partition the samples H1,...,H= in C := C1 C2, sets S1,..., SC, where C1 = + S S = S S ln Δ and C2 = ln = , such that all 1,..., C1 have sizes Θ Δ and C1 1,... C2 ⌈ ⌉ ⌈ ⌉ log + Θ =   have sizes log = .   ˆ 0 3 2. Denote # = 0 ℝ and Δ0 =Δ. For 8 C , run Algorithm 9 on input ( ) ∈ ∈[ ] ˆ 8 1 HS -S #( − ),-S , Σ , 8 − 8 8 ˆ   ˆ 8 and let #( ) be the resulting estimator.

ˆ C 3. Return # := #( ). ˆ 8 C ∈[ ] Í

Remark 4.20 (Running time). Running time is $ =) 3 3 , where ) 3 is a time required ( ( )/ ) ( ) for multiplication of two 3 3 matrices. × The theorem below extend Theorem 1.4 to non-spherical Gaussians. 3 = 3 Theorem 4.21. Let y = ^  for  ℝ , ^ ℝ with iid rows x8 # 0, Σ ,  ∗ + ∗ ∈ ∈ × ∼ ( ) satisfying Assumption 4.1 with parameter . Let Σ ℝ3 3 be a positive definite symmetric matrix ˆ ∈ × independent of ^. Denote by min, max and  the smallest singular value, the largest singular value, and the condi- tion number of Σ. Suppose that Δ > 3 1 ^ , = >  3 ln = ln Δ = 3 ln = 2 ln = +k ∗k · 2 · ( · ) + ( + ) for some large enough absolute constant , and Σ Σ 6  min .  k ˆ − k √ln = Let #ˆ be the estimator computed by Algorithm 10 given y, ^ , Σ, Δ as input. Then ( ˆ ) 2 1 ˆ 2 3 ln = ^ ∗ # 6 $ · = − 2 =    ·  with probability 1 > 1 as = . − ( ) → ∞ 1 2 1 2 1 min Proof. Let’s show that Σ / Σ− / Id3 6 . Since Σ Σ 6 , k ˆ − k 100√ln = k ˆ − k √ln =

1 1 Σ− Σ Id3 6 . ˆ − √ln =

1 1 So Σ− Σ= Id3 , where  6 . Hence for large enough , ˆ + k k √ln =

1 2 1 2 1 2 1 Σ / Σ− / Id3 = Id3  − / Id3 6 . ˆ − ( + ) − 100√ln =

35 Since = >  3 ln = ln 3 ln Δ and and  < 1 , applying Theorem 4.19 with # = 2=, · 2 ·( + ) √ln = for each 8 C , ∈[ ] 2 2 8 8 1 1 2 1 2 ˆ 9 1 1 2 − 1 2 ˆ 9 Σ / ∗ Σ / #( ) 6 1 Σ / ∗ Σ / #( ) , ˆ − ˆ 10 + ˆ − ˆ 9=1 9=1 Õ © Õ ª ­ ® 5 ­ ® with probability at least 1 $ =− . By union bound over 8 C1 , with probability at least 4 − « ∈[ ] ¬ 1 $ =− , −  2 C1  1 2 1 2 ˆ 9 Σ / ∗ Σ / #( ) 6 100 . ˆ − ˆ 9=1 Õ

Hence by Theorem 4.19, by union bound over 8 C C1 , with probability at least 4 ∈ [ ]\[ ] 1 $ =− , −  2 C 2 1 2 1 2 ˆ 9 3 log = 1 3 log = Σ / ∗ Σ / #( ) 6 $ · 6 $ · . ˆ − ˆ 2 = log = + = 2 = 9=1   ! Õ · / ·  By Fact D.8 , with probability at least 1 $ = 4 , for large enough , − ( − )

min Σ 1 T 1 T    ˆ ^ ^ Σ 6 ^ ^ Σ Σ Σ 6 min min 6 min 6 , = − ˆ = − + − ˆ + 10 5  100√ln = √ln =

Σ Σ where min ˆ is the smallest singular value of ˆ . Hence   2 2 1 ˆ ˆ T 1 T ˆ 1 2 ˆ ^ ∗ # = ∗ # ^ ^ ∗ # 6 1.2 Σ / ∗ # . = − − = − ˆ −          



4.2.5 Estimating covariance matrix

Algorithm 10 requires Σ ℝ3 3 which is a symmetric matrix independent of ^ such ˆ ∈ × that Σ Σ . min . The same argument as in the proof of Theorem 4.21 shows that k ˆ − k √ln = if = >  3 ln = 2 ln = for some large enough absolute constant  > 0, then with · ( + ) 4  probability at least 1 $ =− the sample covariance matrix of x1,..., x = 2 satisfies − ˆ ⌊ / ⌋ the desired property. So we can use Algorithm 5 with design matrix x = 2 1,..., x= and  ⌊ / ⌋+ covariance estimator . Computation of  takes time $ =32 . ˆ ˆ ( )

36 5 Bounding the Huber-loss estimator via first-order condi- tions

In this section we study the guarantees of the Huber loss estimator via first order op- timality condition. Our analysis exploits the connection with high-dimensional median estimation as described in Section 2.2 and yields guarantees comparable to Theorem 1.2 (up to logarithmic factors) for slightly more general noise assumptions. We consider the linear regression model

y = -∗ ( (5.1) + where - ℝ= 3 is a deterministic matrix,  ℝ3 is a deterministic vector and ∈ ℝ= is a ∈ × ∗ ∈ random vector satisfying the assumption below. Assumption 5.1 (Noise assumption). Let R = be a set chosen uniformly at random = ⊆ [ ] among all sets of size16 =. Then ( ℝ is a random vector such that for all 8 = , (8 ∈ ∈ [ ] satisfies:

1. (1,..., (= are mutually conditionally independent given R.

2. For all 8 = , ℙ (8 6 0 R =ℙ (8 > 0 R , ∈[ ] 3. For all 8 R, there exists a conditional density  p8 of (8 given R such that p8 C > 0.1 ∈ ( ) for all C 1, 1 . ∈ [− ] We remark that Assumption 5.1 is more general than the assumptions of Theorem 1.2 (see Appendix A). We will also require the design matrix - to be well-spread. We restate here the defini- tion. Definition 5.2. Let + ℝ= be a vector space. + is called <,  -spread, if for every E + ⊆ ( ) ∈ and every subset ( = with ( > = <, ⊆ [ ] | | − E( >  E . k k k k Also recall the definition of Huber loss function. Definition 5.3 (Huber Loss Function). Let y = - ( for  ℝ3, ( ℝ= ,- ℝ= 3. For ∗ + ∗ ∈ ∈ ∈ × ℎ > 0 and  ℝ3, define ∈ = = f  = Φ G8 ,  y8 = Φ G8 ,  ∗ (8 , ℎ( ) ℎ h i− ℎ h − i− 8=1 8=1 Õ  Õ  where 1 2 6 2ℎ C if C ℎ Φℎ C = | | ( ) ( C ℎ otherwise. | |− 2 16For simplicity we assume that = is integer.

37 We are now ready to state the main result of the section. We remark that for simplicity we do not optimize constants in the statement below.

Theorem 5.4. Let = =,3 0, 1 and let y = - ( for  ℝ=, - ℝ= 3 and ( ℝ= ( ) ∈ ( ) ∗ + ∗ ∈ ∈ × ∈ satisfying Assumption 5.1. Let 107 3 ln =  = · , 2 = · and suppose the column span of - is  =, 1 2 -spread. Then, for ℎ = 1 =, the Huber loss estimator ˆ ( · / ) / # := argmin fℎ  satisfies  ℝ3 ( ) ∈ 1 ˆ 2 - # ∗ 6  = k − k   with probability at least 1 10= 3 2. − − /

Remark 5.5. If for all 8 = the conditional distribution of (8 given R is symmetric about ∈ [ ] 0 (this assumption is satisfied for the noise from considered in Theorem 1.2), the theorem is also true for Huber parameter ℎ = 1. To prove Theorem 5.4 we need the Lemmata below. We start showing a consequence of the  =, 1 2 -spread property of -. ( · / ) = 3 Lemma 5.6. Let - ℝ × , and  be as in Theorem 5.4, and let R be as in Assumption 5.1. ∈ 3 2 3 With probability 1 2=− / , for any D ℝ , − ∈ 1 G8 , D > √= -D . |h i| 2 · ·k k 8 R Õ∈

Proof. Let '1,..., '= be i.i.d. Bernoulli random variables such that ℙ '8 = 1 = 1 ( ) 3 − ℙ '8 = 0 = . By Lemma D.7, it is enough to show that with probability 1 = , for ( ) − − any D ℝ3, ∈ = 1 '8 G8 , D > √= -D . |h i| 2 · ·k k Õ8=1 Note that the inequality is scale invariant, hence it is enough to prove it for all D ℝ3 such ∈ that -D = 1. Consider arbitrary D ℝ3 such that -D = 1. Applying Lemma E.5 with k k ∈ k k = , < = = , 1 = 0, 2 = 1 2 and E = -D, we get A ∅ ⌊ ⌋ / = 3 G8 , D > √=. |h i| 4 Õ8=1 Hence = 3 '8 G8 , D > √=. ' |h i| 4 · Õ8=1

38 Let ~  is the Iverson bracket (0/1 indicator). Applying Lemma D.6 with , G,H = ~H = · T ( 3) 1 G , E = -D and w = ' = '1,..., '= , we get that with probability 1 = for all D ·| | ( ) − − such that -D = 1, k k = 1 '8 G8 , D '8 G8 , D 6 20√3 ln = 6 √=, |h i|− ' |h i| 5 8=1   Õ which yields the desired bound. 

Next we show that with high probability - #ˆ  < =. k − ∗ k   = = 3 3 2 Lemma 5.7. Let H ℝ ,- ℝ × as in Theorem 5.4, and let ℎ 6 1. With probability 1 4=− / , ∈ ∈ − for any  such that -   > =, k ( − ∗)k

f  > f ∗ 1 . ℎ( ) ℎ( )+ Proof. Note that = = f ∗ = Φ (8 6 (8 . ℎ( ) ℎ | | 8=1 8=1 Õ  Õ Consider some  such that -   = =. Denote D =   . Since there exists a k ( − ∗)k − ∗ conditional density p8 C > 0.1 (for C 1, 1 ), there exist 0, 1 0, 1 such that for all ( ) ∈ [− ] ∈ [ ] 8 R, ∈ ℙ 0 6 (8 6 0 R =ℙ 0 6 (8 6 1 R > 0.1 . − Let S = 8 = 0 6 (8 6 1 . We get   ∈[ ] −  = =

f  = Φ G8 , D (8 > G8 , D (8 ℎ= ℎ( ) ℎ h i− h i− − 8=1 8=1 Õ  Õ > G8 , D G8 , D (8 2=. |h i|+ h i− − 8 S R 8 = S ∈Õ∩ ∈[Õ]\

Denote '8 = ~ 0 6 (8 6 1. By Lemma 5.6, − 1 '8 G8 , D R > √= -D . " · |h i| # 10 · ·k k 8 R Õ∈ 3 2 with probability 1 2=− / . − By Lemma D.6 with , G,H = ~H = 1 G , E = -R D, ' = = and w = 'R, we get that 3 ( ) ·| | with probability 1 =− for all D such that -RD 6 =, − k k 1 1 G8 , D > √= -D 20 -RD √3 ln = 1 > √= -D 1 . |h i| 10 · ·k k− ·k k· − 20 · · ·k k− 8 S R ∈Õ∩

39 Note that

G8 , D (8 = (8 sign (8 G8 , D > (8 sign (8 G8 , D . h i− | |− ( )h i | |− ( )h i 8 = S 8 = S 8 = S 8 = S ∈[Õ]\ ∈[Õ]\ ∈[Õ]\ ∈[Õ]\

Applying Lemma D.6 with , G,H = ~ 0 6 H 6 1 sign H G , E = -D, w = (, ' = =, ( ) − ( )·| | we get that with probability 1 = 3, for all D ℝ3 such that -D = =, − − ∈ k k = sign (8 G8 , D = ~ 0 6 (8 6 1 sign (8 G8 , D ~ 0 6 (8 6 1 sign (8 G8 , D R ( )h i − ( )h i− − ( )h i 8 = S 8=1 ∈[Õ]\ Õ  

6 20=√3 ln = -D 1 . k k+ 3 2 3 Therefore, with probability 1 4=− / , for any  ℝ such that -  ∗ = =, − ∈ k ( − )k = = 1 f  > √= -D (8 (8 2= 20=√3 ln = -D 2 > (8 1 > f ∗ 1 . ℎ( ) 20 · ·k k+ | |− | |− − k k− | |+ ℎ( )+ 8=1 8 S 8=1 Õ Õ∈ Õ Note that since f is convex, with probability 1 4= 3 2, for any  such that -   > =, − − / k ( − ∗)k f  > f  1.  ℎ( ) ℎ( ∗)+ Now observe that f  is differentiable and ℎ( ) = = f  = ) G8 ,  y8 G8 = ) G8 ,  ∗ (8 G8 , ∇ ℎ( ) ℎ h i− · ℎ h − i− · 8=1 8=1 Õ  Õ  where ) C = Φ′ C = sign C min C ℎ, 1 , C ℝ. We will need the following lemma. ℎ( ) ℎ( ) ( )· {| |/ } ∈ Lemma 5.8. Let z be a random variable such that ℙ z 6 0 =ℙ z > 0 . Then for any such that ( ) ( ) > 2ℎ, | | )ℎ z > ℙ 0 6 sign z 6 2 . · z ( − ) | |· ( )· | |/ Proof. Note that  )ℎ z = )ℎ sign z . · z ( − ) | |· z | |− ( )· We get 

)ℎ sign z =ℙ sign z 6 ℎ ~sign z > ℎ )ℎ sign z z | |− ( ) ( ) | |− + z ( ) | |− · | |− ( )  > ℙ0 6 sign z 6  ℎ ℙ sign z < 0 ℙ sign z > 0  ( ) | |− + ( ) − ( ) > ℙ 0 6 sign z 6 2 . ( ) | |/     

40 Using point 3 of Assumption 5.1, we get for all 8 = and for all > 2ℎ, ∈[ ] 1 ) (8 R > min , 1 . · ℎ − 20 | |· {| | }    Note that for ℎ = 1, if z is symmetric, we can also show it for 6 2ℎ 6 2. Indeed,

)ℎ sign z = ~ z 6 ℎ)ℎ sign z ~ z > ℎ)ℎ sign z | | z | |− ( ) | | z | | | |− ( ) +| | z | | | |− ( )  > ~ z 6 ℎ)ℎ sign z  | | z | | | |− ( ) = ~ z 6 ℎ )ℎ z )ℎ  z , | | z | | (| |− )− (− )  since for symmetric z, z ~ z > ℎ) sign z > 0. Assuming the existence of | | ℎ | |− ( ) density ? of z such that ? C > 0.1 for all C 1, 1 , we get ( ) ∈ [− ]  1 ~ z 6 ℎ )ℎ z )ℎ z >0.1 )ℎ I )ℎ I 3I z | | (| |− )− (− ) 1 (| |− )− ( ) ¹−  1  > min , 1 . 20 {| | } We are now ready to prove Theorem 5.4.

Proof of Theorem 5.4. Consider u = #ˆ  . By Lemma 5.7, with probability 1 3= 3 2, − ∗ − − / -u 6 =. If -u < 100, we get the desired bound. So further we assume that 100 6 k k k k -u 6 =. k k Since fℎ  = 0, ∇ ( ˆ) = ) G8 , u (8 G8 , u = 0 . ℎ h i− · h i 8=1 Õ  For each 8 = , consider the function 8 defined as follows: ∈[ ]

8 0 = G8 , 0 )ℎ G8 , 0 (8 R ( ) h i h i− 3    for any 0 ℝ . Applying Lemma 5.8 with z = (8, = G8 , 0 and ℎ = 1 =, and using point ∈ h i / 3 of Assumption 5.1, we get for any 0 ℝ3, ∈ = = = 8 0 > ~ G8 , 0 < 2ℎ G8 , 0 ~ G8 , 0 > 2ℎ8 0 (5.2) ( ) − |h i| |h i|+ |h i| ( ) Õ8=1 Õ8=1 Õ8=1 1 > 2 G8 , 0 min G8 , 0 , 1 . (5.3) − + 20 |h i| · {|h i| } 8 R Õ∈ Note that this is the only place in the proof where we use ℎ 6 1 =. By the observation / described after Lemma 5.8, if for all 8 = , conditional distribution of (8 given R is ∈ [ ] symmetric about 0, the proof also works for ℎ = 1.

41 For G,H ℝ, consider , G,H = G ) G H . For any ΔG ℝ, ∈ ( ) · ℎ − ∈  G , G ΔG,H , G,H = G ΔG ) G ΔG H G ) G H 6 ΔG ΔG. | ( + )− ( )| ( + )· ℎ + − − · ℎ − + ℎ   By Lemma D.6 with E = -0, w = (, ' = =, and = 1 =2 , with probability 1 4= 3 2, + − − / for all 0 ℝ3 such that -0 6 =, ∈ k k  = G8 , 0 )ℎ G8 , 0 (8 8 0 6 25√3 ln = -0 1 =. (5.4) h i· h i− − ( ) ·k k+ / 8=1 Õ  

Let '1,..., '= be i.i.d. Bernoulli random variables such that ℙ '8 = 1 = 1 ℙ '8 = 0 = ( ) − ( ) . By Lemma D.7 and Lemma D.6 with , G,H = ~H = 1 G min G , 1 , E = -0, w8 = '8, ( ) | |· {| | } ' = = and = 2, with probability 1 3= 3 2, for all 0 ℝ3 such that -0 6 =, − − / ∈ k k = G8 , 0 min G8 , 0 , 1 > G8 , 0 min G8 , 0 , 1 20√3 ln = -0 1 =. (5.5) |h i|· {|h i| } |h i|· {|h i| }− k k− / 8 R 8=1 Õ∈ Õ Plugging 0 = u into inequalities 5.3, 5.4 and 5.5, we get

2 1000 G8 , u G8 , u 6 √3 ln = -u h i + |h i| ·k k G8 ,u 61 G8 ,u >1 |h Õi| |h Õi| with probability 1 7= 3 2. − − / If 2 1 2 G8 , u < -u , h i 3 k k G8 ,u 61 |h Õi| we get 1000 G8 , u 6 √3 ln = -u . |h i| ·k k G8 ,u >1 |h Õi| 1073 ln = Applying Lemma E.5 with < = 2 , 1 = 1 √3, = 8 = : G8 , u 6 1 , 2 = 1 2 ⌊ ⌋ / A { ∈[ ] |h i| } / and E = -u, we get a contradiction. Hence 2 1 2 G8 , u > -u h i 3 k k G8 ,u 61 |h Õi| and we get ˆ 3000 - # ∗ 6 √3 ln =, k − k   with probability at least 1 10= 3 2, which yields the desired bound.  − − /

42 References

[BFP+73] Manuel Blum, Robert W. Floyd, Vaughan R. Pratt, Ronald L. Rivest, and Robert Endre Tarjan, Time bounds for selection, J. Comput. Syst. Sci. 7 (1973), no. 4, 448–461. 25, 29

[BJKK17a] Kush Bhatia, Prateek Jain, Parameswaran Kamalaruban, and Purushottam Kar, Consistent robust regression, Advances in Neural Information Processing Systems 30 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish- wanathan, and R. Garnett, eds.), Curran Associates, Inc., 2017, pp. 2110–2119. 4, 5, 6

[BJKK17b] Kush Bhatia, Prateek Jain, Parameswaran Kamalaruban, and Purushottam Kar, Consistent robust regression, NIPS, 2017, pp. 2107–2116. 5

[CRT05] Emmanuel Candes, Justin Romberg, and Terence Tao, Stable signal recovery from incomplete and inaccurate measurements, 2005. 3, 5, 7

[CSV17] Moses Charikar, Jacob Steinhardt, and Gregory Valiant, Learning from untrusted data, Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, 2017, pp. 47–60. 3

[CT05] EmmanuelCandesandTerenceTao, Decoding by linear programming, 2005. 3, 5, 7

[DKK+19] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and Alistair Stewart, Robust estimators in high-dimensions without the computational intractability, SIAM Journal on Computing 48 (2019), no. 2, 742–864. 3

[DKS19] Ilias Diakonikolas, Weihao Kong, and Alistair Stewart, Efficient algorithms and lower bounds for robust linear regression, Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, Cal- ifornia, USA, January 6-9, 2019 (Timothy M. Chan, ed.), SIAM, 2019, pp. 2745– 2754. 5

[Don06] David L Donoho, Compressed sensing, IEEE Transactions on information theory 52 (2006), no. 4, 1289–1306. 7

[DT19] ArnakDalalyanandPhilipThompson, Outlier-robust estimation of a sparse linear model using ;1-penalized huber’s <-estimator, Advances in Neural Information Processing Systems, 2019, pp. 13188–13198. 3, 5

[EvdG+18] Andreas Elsener, Sara van de Geer, et al., Robust low-rank matrix estimation, The Annals of Statistics 46 (2018), no. 6B, 3481–3509. 7

43 [GLR10] Venkatesan Guruswami, James R. Lee, and Alexander A. Razborov, Almost n euclidean subspaces of l 1 VIA expander codes, Combinatorica 30 (2010), no. 1, 47–68. 12, 62

[GLW08] Venkatesan Guruswami, James R. Lee, and Avi Wigderson, Euclidean sections of with sublinear randomness and error-correction over the reals, APPROX-RANDOM, Lecture Notes in Computer Science, vol. 5171, Springer, 2008, pp. 444–454. 6, 12

[Gro11] David Gross, Recovering low-rank matrices from few coefficients in any basis, IEEE Trans. Information Theory 57 (2011), no. 3, 1548–1566. 10

[HBRN08] Jarvis Haupt, Waheed U Bajwa, Michael Rabbat, and Robert Nowak, Com- pressed sensing for networked data, IEEE Signal Processing Magazine 25 (2008), no. 2, 92–101. 3

[Hoa61] C.A.R.Hoare, Algorithm 65: Find, Commun. ACM 4 (1961), no. 7, 321–322. 25

[Hub64] Peter J. Huber, Robust estimation of a location parameter, Ann. Math. Statist. 35 (1964), no. 1, 73–101. 3

[Jen69] Robert I. Jennrich, Asymptotic properties of non-linear least squares estimators, Ann. Math. Statist. 40 (1969), no. 2, 633–643. 56

[KKK19] Sushrut Karmalkar, Adam R. Klivans, and Pravesh Kothari, List-decodable linear regression, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, 2019, pp. 7423–7432. 3

[KKM18] Adam R. Klivans, Pravesh K. Kothari, and Raghu Meka, Efficient algorithms for outlier-robust regression, Conference On Learning Theory, COLT 2018, Stock- holm, Sweden, 6-9 July 2018, 2018, pp. 1420–1430. 3

[KP18] Sushrut Karmalkar and Eric Price, Compressed sensing with adversarial sparse noise via l1 regression, arXiv preprint arXiv:1809.08055 (2018). 3, 5, 7

[KT07] Boris S Kashin and Vladimir N Temlyakov, A remark on compressed sensing, Mathematical notes 82 (2007), no. 5, 748–755. 7

[LLC19] Liu Liu, Tianyang Li, and Constantine Caramanis, High dimensional robust <- estimation: Arbitrary corruption and heavy tails, 2019. 3

[LM00] B. Laurent and P. Massart, Adaptive estimation of a quadratic functional by model selection, Ann. Statist. 28 (2000), no. 5, 1302–1338. 60

44 [LSLC18] Liu Liu, Yanyao Shen, Tianyang Li, and Constantine Caramanis, High dimen- sional robust sparse regression, arXiv preprint arXiv:1805.11643 (2018). 3

[NRWY09] Sahand Negahban, Pradeep Ravikumar, Martin J. Wainwright, and Bin Yu, A unified framework for high-dimensional analysis of $m$-estimators with decomposable regularizers, NIPS, Curran Associates, Inc., 2009, pp. 1348–1356. 9, 17

[NT13] Nam H Nguyen and Trac D Tran, Exact recoverability from dense corrupted obser- vations via ;1-minimization, IEEE transactions on information theory 59 (2013), no. 4. 7

[Pol91] David Pollard, Asymptotics for least absolute deviation regression estimators, Econo- metric Theory 7 (1991), no. 2, 186–199. 4, 7, 14

[RH15] Phillippe Rigollet and Jan-Christian Hütter, High dimensional statistics, Lecture notes for course 18S997 813 (2015), 814. 47

[RL05] Peter J Rousseeuw and Annick M Leroy, Robust regression and outlier detection, vol. 589, John wiley & sons, 2005. 3

[RY20] Prasad Raghavendra and Morris Yau, List decodable learning via sum of squares, Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020, 2020, pp. 161–180. 3

[SBRJ19] Arun Sai Suggala, Kush Bhatia, Pradeep Ravikumar, and Prateek Jain, Adap- tive hard thresholding for near-optimal consistent robust regression, Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, 2019, pp. 2892–2897. 4, 5, 6, 7, 8, 16, 26, 51

[SZF19] Qiang Sun, Wen-Xin Zhou, and Jianqing Fan, Adaptive huber regression, Journal of the American Statistical Association (2019), 1–24. 3

[TJSO14] Efthymios Tsakonas, Joakim Jaldén, Nicholas D Sidiropoulos, and Björn Otter- sten, Convergence of the huber regression m-estimate in the presence of dense outliers, IEEE Signal Processing Letters 21 (2014), no. 10, 1211–1214. 1, 4, 5, 6, 7, 9, 11, 17, 51

[TSW18] Kean Ming Tan, Qiang Sun, and Daniela Witten, Robust sparse reduced rank regression in high dimensions, arXiv preprint arXiv:1810.07913 (2018). 7

[Tuk75] John W Tukey, Mathematics and the picturing of data, Proceedings of the Interna- tional Congress of Mathematicians, Vancouver, 1975, vol. 2, 1975, pp. 523–531. 13

45 [Ver18] Roman Vershynin, High-dimensional probability: An introduction with applications in data science, Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press, 2018. 57

[Vis18] Nisheeth K Vishnoi, Algorithms for convex optimization, Cambridge University Press, 2018. 53

[Wai19] Martin J. Wainwright, High-dimensional statistics: A non-asymptotic viewpoint, Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge Uni- versity Press, 2019. 17, 57, 60

[WYG+08] JohnWright,AllenYYang,ArvindGanesh,SShankarSastry,andYiMa, Robust face recognition via sparse representation, IEEE transactions on pattern analysis and machine intelligence 31 (2008), no. 2, 210–227. 3

46 A Error convergence and model assumptions

In this section we discuss the error convergence of our main theorems Theorem 1.1, Theorem 1.2 as well as motivate our model assumptions.

A.1 Lower bounds for consistent oblivious linear regression We show here that no estimator can obtain expected squared error > 3 2 = for any /( · ) 0, 1 and that no estimator can have expected error converging to zero for . 3 =. ∈ ( )  / The first claim is captured by the following statement. p = 3 2 Fact A.1. Let - ℝ be a matrix with linearly independent columns. Let ( # 0,  Id= ∈ × ∼ ( · ) with  > 0 so that = min8 ℙ (8 6 1 = Θ 1  . {| | } ( / ) Then there exists a distribution over # independent of ( such that for every estimator  : ℝ= ∗ ˆ → ℝ3, with probability at least Ω 1 , ( ) 1 2 3 - -#∗ ( -∗ > Ω . = ˆ( + )− 2 =  · 

In particular,for every estimator  : ℝ= ℝ3 there exists a vector  such that for y = - (, ˆ → ∗ ∗ + with probability at least Ω 1 , ( ) 1 2 3 - y -∗ > Ω . = ˆ( )− 2 =  · 

Fact A.1 is well-known and we omit the proof here (see for example [RH15]). The catch is that the vector ( # 0, 2 Id satisfies the noise constraints of Theorem 1.2 for ∼ ( · ) = Θ 1  . Hence, for . 3 = we obtain the second claim as an immediate corollary. ( / ) / p 3 = 3 Corollary A.2. Let =,3 ℝ and . . Let - ℝ × be a matrix with linearly independent ∈ = ∈ 2 = 3 columns and let ( # 0, 1 Id= . Then,q for every estimator  : ℝ ℝ there exists a vector ∼ ( / · ) ˆ →  such that for y = - (, with probability at least Ω 1 , ∗ ∗ + ( ) 1 2 - y -∗ > Ω 1 . = ˆ( )− ( )

In other words, in this regime no estimator obtains error converging to zero.

A.2 On the design assumptions Recall our linear model with Gaussian design:

y = ^∗  (A.1) +

47 3 = 3 = for ∗ ℝ , ^ ℝ × with i.i.d entries ^89 # 0, 1 and  ℝ a deterministic vector ∈ ∈ ∼ ( ) ∈ with = entries bounded by 1 in absolute value. We have mentioned in Section 1.1 how to · extend these Gaussian design settings to deterministic design settings. We formally show here how to apply Theorem 1.2 to reason about the Gaussian design model Eq. (A.1). For this it suffices to to turn an instance of model Eq. (A.1) into an instance of the model:

y′ = -′∗ (′ (A.2) + where  ℝ3, - is a =-by-3 matrix Ω = , Ω 1 -spread and the noise vector ( has ∗ ∈ ′ ( ( ) ( )) independent, symmetrically distributed entries with = min8 = ℙ (8 6 1 . This can be ∈[ ] done resampling through the following procedure. 

Algorithm 11 Resampling Input: H,- where H ℝ=, - ℝ= 3. ( ) ∈ ∈ × Sample = indices $1,..., $= independently from the uniform distribution over = . [ ] Sample = i.i.d Rademacher random variables 21,..., 2=. Return pair y , ^ where y is an =-dimensional random vector and ^ is an =-by-3 ( ′ ′) ′ ′ random matrix:

y′ = 28 H 8 · $8 = ^8,′ 28 -$8 , . − · −

Note that ^ ′ and ( are independent. The next result shows that for = & 3 log 3 with high probability over ^ ′ Algorithm 11 outputs an instance of Eq. (A.2) as desired.

Theorem A.3. Let = & 3 ln 3. Let ^ ℝ= 3 be a matrix obtained from a Gaussian matrix ′ ∈ × ^ ℝ= 3 by choosing (independently of ^) = rows of ^ with replacement. Then column span of ∈ × ^ is Ω = , Ω 1 -spread with probability 1 > 1 as = . ′ ( ( ) ( )) − ( ) → ∞ Proof. Let c 8 be the row chosen at 8-th step. Let’s show that with high probability for all ( ) subsets = of size <, c 1 6 $ < ln = < . By Chernoff bound (Fact D.2), for ℳ⊆[ ] − (ℳ) ( ( / )) any 9 = and any Δ > 1,

∈[ ] = Δ ℙ ~c 8 = 9 > 2Δ 6 Δ− . (A.3) ( ) ! Õ8=1 = > < Denote 0< = 4= <. Let 9 2Δ9 be the event 8=1 1 c 8 =9 2Δ9 . If 9=1 Δ9 = 2< ln 0<, / ( ) then [ ]  nÍ o Í < < 1 − ℙ 9 2Δ9 =ℙ < 2Δ< ℙ 9 2Δ9 < 2Δ< [ ( )] · ( )  9=1   9=1  Ù  Ù             48       < 1 − 6 ℙ < 2Δ< ℙ 9 2Δ9 [ ( )] ·  9=1  Ù    <   6 exp Δ9   −   Õ9=1 2<© ª = 0<− ­. ®

« ¬ < By union bound, with probability at least 1 0<− , for all subsets = of size <, 1 − ℳ= ⊆[ ] c 6 4< ln 4= < . Let z be the vector with entries z9 = ~c 8 = 9. With − (ℳ) ( / ) 8=1 ( ) probability at least 1 1 = 6 1 = 0 <, for any = , <=1 <− − / − ℳ⊆[ ] Í Í 4= z9 6 4 ln . | | |ℳ| 9   Õ∈ℳ |ℳ| Note that by Fact D.12, with probability at least 1 1 =, for any vector unit E from − / column span of ^ and any set = , ℳ⊆[ ] 4= E 2 6 23 20 ln . k ℳ k + |ℳ|  |ℳ|  Now let E be arbitrary unit vector from column span of ^ . Let be the set of its top ′ ′ S < = 26 = entries for some small enough constant 2. Then for some vector E from column · span of ^, with probability 1 > 1 , − ( ) 2 6 2 2 6 4 2 E′8 I9 E = I2 ,E 2 3 ln 4= 6= E , ( ) | |· 9 h (S) 2 i · ( ( )+ )·k k 8 9 c (S) Õ∈S ∈Õ(S) where the last inequality follows from Lemma E.7. 2 Now let’s bound E . Let D be a fixed unit vector, then ^D # 0, Id= . By Chernoff k k ∼ ( ) bound, with probability at least 1 exp 2= , number of entries of ^D bounded by 2 is at − (− ) most 22=. Note that with high probability 0.9√= D 6 ^D 6 1.1√= D k k k k k k Let D ℝ3 be a unit vector such that number of entries of ^D bounded by 2 is at most ′ ∈ ′ 22= and let D ℝ3 be a unit vector such that ^D ^D 2 6 1.12 = D D 2 6 22= 5. ∈ k ′ − k · k ′ − k / Then -D cannot have more than 32= entries bounded by 2 2. Hence by union bound over / 2 3-net in the unit ball in ℝ3, with probability at least 1 exp 2= 2 , E has at most 32= / − (− / ) 4 6 entries of magnitude smaller than 2 2. Hence E′ has at most 122= ln 32 0.9= entries / 2 of magnitude smaller than 2 E , and E 2 > 2 E 2. Choosing 2 = 0.01, we get that 3√= k k k ′k 100 k k  column span of ^ is 10 12 =, 1 2 -spread with high probability.  ′ ( − · / ) Remark A.4. It is perhaps more evident how the model considered in Theorem 5.4 sub- sumes the one of Theorem 1.1. Given y, ^ as in Eq. (A.1) it suffices to multiply the ( )

49 instance by an independent random matrix [ corresponding to flip of signs and permu- tation of the entries, then add an independent Gaussian vector w # 0, Id= . The model ∼ ( ) then becomes

[ y = [^∗ [ w , + + which can be rewritten as

y′ = ^ ′∗ [ w . + + = 3 = = Here ^ ′ ℝ × has i.i.d entries ^89 # 0, 1 , [ = YV where V ℝ × is a permutation ∈ ∼ ( ) ∈ matrix chosen u.a.r. among all permutation matrices, Y ℝ= = is a diagonal random ∈ × matrix with i.i.d. Rademacher variables on the diagonal and [ ℝ= is a symmetrically ∈ distributed vector independent of ^ such that ℙ (8 6 1 > 2. Moreover the entries ′ / of [ are conditionally independent given V. At this point, we can relax our Gaussian  design assumption and consider

y = -∗ w [ , (A.4) + + which corresponds to the model considered in Theorem 5.4.

A.2.1 Relaxing well-spread assumptions

It is natural to ask if under weaker assumptions on - we may design an efficient algorithm that correctly recovers ∗. While it is likely that the assumptions in Theorem 1.2 are not tight, some requirements are needed if one hopes to design an estimator with bounded error. Indeed, suppose there exists a vector  ℝ3 and a set = of cardinality > 1 such ∗ ∈ S⊆[ ] ( / ) that - ∗ = -∗ > 0. Consider an instance of linear regression y = -∗ ( with ( as S + in Theorem 1.2. Then with probability 1 > 1 any non-zero row containing information − ( ) about ∗ will be corrupted by (possibly unbounded) noise. More concretely:

Lemma A.5. Let  > 0 be arbitrarily chosen. For large enough absolute constant  > 0, let - ℝ= be an = -sparse deterministic vector and let ( be an =-dimensional random vector with ∈  i.i.d coordinates sampled as

(8 = 0 with probability 2 (8 # 0,  otherwise. ∼ ( ) Then for every estimator  : ℝ= ℝ there exists  ℝ such that for y = - (, with ˆ → ∗ ∈ ∗ + probability at least Ω 1 , ( ) 2 2 1  -  y ∗ > Ω . = ˆ( )− =    

50 Proof. Let = be the set of zero entries of - and its complement. Notice that with C⊆[ ] C probability 1 Ω 1 over ( the set S = 8 = 8 and (8 = 0 is empty. Conditioning − ( ) ∈[ ] ∈ C = = on this event , for any estimator  : ℝ n ℝ and ( define the functiono ,( : ℝ −|C| ℝ ℰ ˆ → C∗ C → such that ,( y =  y . Taking distribution over # from Fact A.1 (independent of (), we C ( ) ˆ( ) get with probabilityC Ω 1 ( ) 2 2 2 1 1 >  = -  y ∗ = = - ,( y ∗ Ω . ˆ( )− C ( C)− =      

Hence for any there exists with desired property.  ˆ ∗ Notice that the noise vector ( satisfies the premises of Theorem 1.2. Furthermore, since  > 0 can be arbitrarily large, no estimator can obtain bounded error.

A.3 On the noise assumptions On the scaling of noise. Recall our main regression model,

y = -∗ ( (A.5) + where we observe (a realization of) the random vector y, the matrix - ℝ= 3 is a known ∈ × design, the vector  ℝ= is the unknown parameter of interest, and the noise vector ( ∗ ∈ has independent, symmetrically distributed coordinates with = min8 = ℙ (8 6 1 . ∈[ ] {| | } A slight generalization of Eq. (A.5) can be obtained if we allow ( to have independent, symmetrically distributed coordinates with = min8 = ℙ (8 6  . This parameter  is ∈[ ] {| | } closely related to the subgaussian parameter  from [SBRJ19]. If we assume as in [SBRJ19] that some (good enough) estimator of this parameter is given, we could then simply divide each y8 by this estimator and obtain bounds comparable to those of Theorem 1.2. For unknown  1 better error bounds can be obtained if we decrease Huber loss parameter ≪ ℎ (for example, it was shown in the Huber loss minimization analysis of [TJSO14]). It is not difficult to see that our analysis (applied to small enough ℎ) also shows similar effect. However, as was also mentioned in [SBRJ19], it is not known whether in general  can be estimated using only y and -. So for simplicity we assume that  = 1.

A.3.1 Tightness of noise assumptions

We provide here a brief discussion concerning our assumptions on the noise vector . We argue that, without further assumptions on -, the assumptions on the noise in Theorem 5.4 are tight (notice that such model is more general than the one considered in Theorem 1.2). Fact A.6 and Fact A.7 provide simple arguments that we cannot relax median zero and independence noise assumptions. Fact A.6 (Tightness of Zero Median Assumption). Let 0, 1 be such that = is integer, ∈ ( ) and let 0 << 1 100. There exist ,  ℝ3 and - ℝ= 3 satisfying / ′ ∈ ∈ × 51 • the column span of - is = 2, 1 2 -spread, ( / / ) • -   >  - >  √=, k − ′ k 10 k k 10 · = and there exist distributions  and ′ over vectors in ℝ such that if I  or I ′, then: ∼ ∼

1. I 1,...,II= are mutually inependent,

2. For all 8 = , ℙ I 8 > 0 > ℙ I 8 6 0 > 1  ℙ I 8 > 0 , ∈[ ] ( ) ( ) ( − )· ( )

3. For all 8 = , there exists a density ?8 of I8 such that ?8 C > 0.1 for all C 1, 1 , ∈[ ] ( ) ∈ [− ] and for   and   , random variables -  and -  have the same distribution. ∼ ′ ∼ ′ + ′ + ′ Proof. It suffices to consider the one dimensional case. Let - ℝ= be a vector with all  ∈ entries equal to 1, let  = 1 and ′ = 1 10. Then  # 0, Id= and ′ = # , Id= with T − ∼ ( ) ( )  =  ,...,  ℝ= satisfy the assumptions, and random variables -  and -  10 10 ∈ + ′ + ′ have the same distribution.   Fact A.7 (Tightness of Independence Assumption). Let 0, 1 be such that = is integer. 3 = 3 ∈ ( ) There exist , ′ ℝ and - ℝ × satisfying ∈ ∈ • the column span of - is = 2, 1 2 -spread, ( / / ) • -   > - > √=, k − ′ k k k and there exists a distribution  over vectors in ℝ= such that   satisfies: ∼

1. For all 8 = , ℙ 8 6 0 > ℙ 8 > 0 , ∈[ ]

2. For all 8 = , there exists a density ?8 of I8 such that ?8 C > 0.1 for all C 1, 1 , ∈[ ] ( ) ∈ [− ] and for some  , with probability 1 2, -  = -  . ′ ∼ / + ′ + ′ Proof. Again it suffices to consider the one dimensional case. Let - ℝ= be a vector ∈ with all entries equal to 1,  = 0,  = 1. Let  * 1, 1 and let E ℝ= be a vector ′ ∼ {− } ∈ independent of  such that for all 8 = , the entries of E are iid E 8 = * 0, 1 . Then  =  E ∈[ ] [ ] · and ′ =  1 E satisfy the assumptions, and if  = 1, -  = -′ ′.  − ( − ) + + Note that the assumptions in both facts do not contain as opposed to the assumptions ℛ of Theorem 5.4. One can take to be a random subset of = of size = independent of  ℛ [ ] and ′.

52 B Computing the Huber-loss estimator in polynomial time

In this section we show that in the settings of Theorem 1.2, we can compute the Huber-loss estimator efficiently. For a vector E ℚ# we denote by b E its bit complexity. For A > 0 ∈ [ ] we denote by 0,A the Euclidean ball of radius A centered at 0. We consider the Huber ℬ( ) T loss function as in Definition 5.3 and for simplicity we will assume - - = =Id3. Theorem B.1. Let - ℚ= 3 be a matrix such that -T- = =Id and H ℚ= be a vector. Let ∈ × 3 ∈ B := b H b - [ ]+ [ ] Let 5 be a Huber loss function 5  = = Φ - H . Then there exists an algorithm that ( ) 8=1 − 8 given -, H and positive  ℚ, computes a vector  ℝ3 such that ∈ Í ∈   5  6 inf 5   , ( )  ℝ3 ( )+ ∈ in time

$ 1 B ( ) ln 1  . · ( / ) As an immediate corollary, the theorem implies that for design matrix - with orthog- onal columns we can compute an -close approximation of the Huber loss estimator in time polynomial in the input size. To prove Theorem B.1 we will rely on the following standard result concerning the Ellipsoid algorithm. Theorem B.2 (See [Vis18]). There is an algorithm that, given 1. a first-order oracle for a convex function , : ℝ3 ', → 2. a separation oracle for a convex set '3, ⊆ 3. numbers A > 0 and ' > 0 such that 0,A 0, ' , ℬ( ) ⊂ ⊂ ℬ( ) 4. bounds ℓ , D such that E , ℓ 6 , E 6 D ∀ ∈ ( ) 5. > 0, outputs a point G such that ∈ , G 6 , G  , ( ) ( ˆ)+ where G is any minimizer of , over . The running time of the algorithm is ˆ

2 2 ' D ℓ $ 3 ) ) 3 log − , + + , · · A ·      where ): ,), are the running times for the separation oracle for and the first-order oracle for , respectively.

53 We only need to apply Theorem B.2 to the settings of Theorem B.1. Let ˆ be a (global) minimizer of 5 . Our first step is to show that  is bounded by exp $ B . k ˆk ( ( )) B Lemma B.3. Consider the settings of Theorem B.1. Then 6 25 . Moreover, for any ˆ  B ∈ 0, 210 ℬ( ) B 0 6 5  6 212 . ( ) B 1 2 4 Proof. Let " = 2 . By definition 5 0 6 2 H 1 H 6 " . On the other hand for any ( ) k k + 2 k k E ℝ3 with E > "5 we have ∈ k k

5 E = Φ H8 -8 ,E ( ) − h i 8 = Õ∈[ ]  2 > Φ -8 ,E H H (h i)− − 1 8 = Õ∈[ ] > "5 "4 − > "4 .

It follows that  6 "5. For the second inequality note that for any E ℝ3 with E 6 "10 k ˆk ∈ k k

5 E = Φ H8 -8 ,E ( ) − h i 8 = Õ∈[ ]  2 6 2 Φ -8 ,E H 2 H (h i)+ + 1 8 = Õ∈[ ] 6 "12 .



Next we state a simple fact about the Huber loss function and separation oracles, which proof we omit. Recall the formula for the gradient of the Huber loss function. = = 1 Φ = 5 ∗ = )′ 8 G8 with ′ C sign C min C ,ℎ . ∇ ( ) 8=1 [ ]· [ ] ( )· {| | } Í B Fact B.4. Consider the settings of Theorem B.1. Let ' = 210 . Then

1. there exists an algorithm that given E ℚ3, computes 5 E and 5 E in time B$ 1 b$ 1 E . ∈ ∇ ( ) ( ) ( ) · ( )[ ] 2. there exists an algorithm that given E ℚ3 outputs ∈ • .( if E 0, ' ∈ ℬ( ) • otherwise outputs a hyperplane G ℝ3 0,G = 1 with 0, 1 ℚ3 separating E ∈ h i ∈ from 0, ' ,  ℬ( ) in time B$ 1 b$ 1 E . ( ) · ( )[ ] 54 We are now ready to prove Theorem B.1. B Proof of Theorem B.1. Let " = 2 . By Lemma B.3 it suffices to set = 0, "10 , ' = "10 12 ℬ( )$ 1 and A = ' 2. Then for any E , 0 6 5 E 6 " . By Fact B.4 ) ) 6  . Thus, / ∈ ( ) 5 + ( ) puttings things together and plugging Theorem B.2 itfollowsthatthere existsan algorithm computing  with 5  6 5   in time ( ) ( ˆ)+ $ 1 B ( ) ln 1  · ( / ) for positive  ℚ.  ∈

C Consistent estimators in high-dimensional settings

In this section we discuss the generalization of the notion of consistency to the case when the dimension 3 (and the fraction of inliers ) can depend on =.

Definition C.1 (Estimator). Let 3 = ∞==1 be a sequence of positive integers. We call a = = 3 = { ( )} 3 = = = 3 = function  : ∞ ℝ ℝ × ( ) ∞ ℝ ( ) an estimator,ifforall = ℕ,  ℝ ℝ × ( ) ˆ ==1 × → ==1 ∈ ˆ × ⊆ ℝ3 = and the restriction of  to ℝ= ℝ= 3 = is a Borel function. ( ) Ð ˆ Ð× × ( )  For example, Huber loss defined at H ℝ= and - ℝ= 3 = as ∈ ∈ × ( ) = = Φ  H,- argmin - H 8 ˆ 3 = −  ℝ ( ) 8=1  ∈ Õ   is an estimator (see Lemma C.4 for formal statement and the proof).

Definition C.2 (Consistent estimator). Let 3 = ∞==1 be a sequence of positive integers = = 3 = = 3 ={ ( )} and let  : ∞ ℝ ℝ ℝ be an estimator. Let ^= ∞ be a sequence ˆ ==1 × × ( ) → ==1 ( ) { }==1 of (possibly random) matrices and let (= ∞ be a sequence of (possibly random) vectors Ð Ð { }==1 such that = ℕ, ^= has dimensions = 3 = and (= has dimension =. ∀ ∈ × ( ) We say that estimator  is consistent for ^ and ( if there exists a sequence ˆ = ∞==1 = ∞==1 = { } { } of positive numbers = = such that lim= = = 0 and for all = ℕ, { }= 1 →∞ ∈ ℙ 1 2 > 6 sup = ^= ^=∗ (= , ^= ^=∗ = = . 3 = k ˆ + − k ∗ ℝ ( ) ∈    Theorem 1.1 implies that if sequences 3 = and = satisfy 3 = 2 = 6 > = and ( ) ( ) ( )/ ( ) ( ) 3 = , then Huber loss estimator is consistent for a sequence ^= ∞==1 of standard ( ) → ∞ = 3 = { } Gaussian matrices ^= # 0, 1 × ( ) and every sequence of vectors = ∞==1 such that = ∼ ( ) { } each = ℝ is independent of ^= and has at least = = entries of magnitude at most 1. ∈ ( )· Similarly, Theorem 1.2 implies that if sequences 3 = and = satisfy 3 = 2 = 6 > = ( ) ( ) ( )/ ( ) ( ) and 3 = , then Huber loss estimator is consistent for each sequence -= ∞ of ( ) → ∞ { }==1 55 = 3 = 2 matrices -= ℝ × ( ) whose column span is $ 3 = = , Ω 1 -spread and every ∈ ( ( )/ ( )) ( ) sequence of =-dimensional random vectors (= ∞ such that each (= is independent of { }==1  -= and has mutually independent, symmetrically distributed entries whose magnitude does not exceed 1 with probability at least = . ( ) Note that the algorithm from Theorem 1.4 requires some bound on  . So formally k ∗k we cannot say that the estimator that is computed by this algorithm is consistent. However, 3 = 3 = if in Definition C.2 we replace supremum over ℝ ( ) by supremum over some ball in ℝ ( ) centered at zero (say, of radius =100), then we can say that this estimator is consistent. More precisely, it is consistent (according to modified definition with sup over ball of radius 100 = 3 = = ) for sequence ^= ∞==1 of standard Gaussian matrices ^= # 0, 1 × ( ) and every { } = ∼ ( ) sequence of vectors = ∞ such that each = ℝ is independent of ^= and has at least { }==1 ∈ = = entries of magnitude at most 1, if = & 3 = log2 3 = 2 = and 3 = . ( )· ( ) ( ( ))/ ( ) ( ) → ∞ To show that Huber loss minimizer is an estimator, we need the following fact: Fact C.3. [Jen69] For 3,# ℕ, let Θ ℝ3 be compact and let ℝ# be a Borel set. Let ∈ ⊂ ℳ ⊆ 5 : Θ ℝ be a function such that for each  Θ, 5 G,  is a Borel function of G and for ℳ× → ∈ ( ) each G , 5 G,  is a continuous function of . Then there exists a Borel function  : Θ ∈ ℳ ( ) ˆ ℳ → such that for all G , ∈ ℳ 5 G,  G = min 5 G,  . ( ˆ( ))  Θ ( ) ∈ The following lemma shows that Huber loss minimizer is an estimator.

Lemma C.4. Let 3 = ∞ be a sequence of positive integers. There exists an estimator  such { ( )}==1 ˆ that for each = ℕ and for all H ℝ= and - ℝ= 3 = , ∈ ∈ ∈ × ( ) = = Φ = Φ - H,- H min - H 8 . ˆ − 8  ℝ3 = − 8=1 ∈ ( ) 8=1 Õ     Õ   Proof. For 8 ℕ denote ∈ 8 = = 3 = 8 8 = H,- ℝ ℝ × ( ) H 6 2 , min - > 2− , ℳ= ∈ × k k ( ) where min - is the smallest positive singular value of -. Note that for all ( ) H,- 8 , there exists a minimizer of Huber loss function at H,- in the ball ( ) ∈ ℳ= ( )  ℝ3 =  6 228 =10 . By Fact C.3, there exists a Borel measurable Huber loss min- ∈ ( ) k k imizer 8 H,- on 8 .  ˆ=( ) ℳ= Denote 0 = H,- ℝ= ℝ= 3 = - = 0 and let 0 H,- = 0 for all H,- 0 . = × ( ) ˆ= = ℳ 8 = ∈ = ×3 = = = 3 =( ) ( )8 ∈ ℳ Note that ∞ = ℝ ℝ . For H,- ℝ ℝ , define = H,- =  H,- , 8=0 ℳ=  × × ( ) ( ) ∈ × × ( ) ˆ ( ) ˆ=( ) where 8 is a minimal index such that H,- 8 . Then for each Borel set ℝ3 = , Ð ∈ ℳ= ℬ ⊆ ( ) 1 1 1 − ∞ − 8  8 1 ∞ 8 − 8 8 1 = = = − =  − . ˆ (ℬ) ˆ (ℬ) ∩ ℳ= \ ℳ= ˆ= (ℬ) ∩ ℳ= \ ℳ=   Ø8=0     Ø8=0     = = 3 = Hence = is a Borel function. Now for each = ℕ and for all H ℝ and - ℝ × ( ) ˆ ∈ ∈ ∈ define an estimator  H,- = = H,- .  ˆ( ) ˆ ( ) 56 D Concentration of measure

This section contains some technical results needed for the proofs of Theorem 1.2 and Theorem 1.4. We start by proving a concentration bound for the empirical median.

Fact D.1 ([Ver18]). Let 0 << 1. Let = E ℝ= E 6 1 . Then has an -net of size 2  = ℬ { ∈ | k 2k = } ℬ + . That is, there exists a set of size at most + such that for any vector D  N ⊆ ℬ  ∈ ℬ there exists some E such that E D 6 .  ∈ N k − k 

Fact D.2 (Chernoff’s inequality, [Ver18]). Let '1,..., '= be independent Bernoulli random variables such that ℙ '8 = 1 =ℙ '8 = 0 = ?. Then for every Δ > 0, ( ) ( ) = ?= 4 Δ ℙ ' > ?= 1 Δ 6 − . 8 1 Δ ( + )! 1 Δ + Õ8=1  ( + )  and for every Δ 0, 1 , ∈ ( ) = ?= 4 Δ ℙ ' 6 ?= 1 Δ 6 − . 8 1 Δ ( − )! 1 Δ − Õ8=1  ( − ) 

Fact D.3 (Hoeffding’s inequality, [Wai19]). Let z1,..., z= be mutually independent random variables such that for each 8 = , z8 is supported on 28 ,28 for some 28 > 0. Then for all C > 0, ∈[ ] [− ] = C2 ℙ z z > C 6 . 8 8 2exp = 2 ( − ) ! −2 8= 2 ! 8=1 1 8 Õ Í Fact D.4 (Bernstein’s inequality [Wai19 ]). Let z1,..., z= be mutually independent random variables such that for each 8 = , z8 is supported on , for some  > 0. Then for all C > 0, ∈[ ] [− ] = C2 ℙ z z > C 6 exp . 8 8 = 2 2C ( − ) ! −2 = z ! Õ8=1 8 1 8 + 3

Lemma D.5 (Restate of Lemma 4.6). Let = be aÍ set of size = and let z1,..., z= ℝ be S⊆[ ] ∈ mutually independent random variables satisfying

1. For all 8 = , ℙ z8 > 0 =ℙ z8 6 0 . ∈[ ] ( ) ( )

2. For some  > 0, for all 8 , ℙ z8 0,  =ℙ z8 , 0 > @. ∈ S ( ∈[ ]) ( ∈ [− ]) Then with probability at least 1 2exp Ω @22= the median zˆ satisfies − −  zˆ 6  . | |

57 Proof. Let Z = z1,..., z= . Consider the following set: { } A := z Z z 6  . { ∈ | | | }

Denote Z+ = Z ℝ>0, A+ = A ℝ>0, Z− = Z ℝ60, A− = A ℝ60. Applying ∩ ∩ ∩ ∩ Chernoff bound for 1, 2, 3 0, 1 , ∈ ( ) 2 1 1 = ℙ Z+ 6 1 = 6 exp · , 2 − − 10     ( )

2 2 @ ℙ A 6 1 2 @ 6 exp · · |S| , | | ( − )· · |S| (− 10 )  2 A 1 3 ℙ A+ 6 3 A A 6 exp ·| | . | | 2 − ·| | | | − 10     ( )

Similar bounds hold for Z−, A−. Now, the median is in A if Z− A+ > = 2 and Z+ A− > = 2. It is enough | |+| | / +| | / to prove one of the two inequalities, the proof for the other is analogous. A union bound

then concludes the proof. 2 1 1 = So for 2 = 3 = , with probability at least 1 exp · 2exp Ω @ , 4 − − 10 − − · |S| n o 1 @   Z− A+ > 1 = · |S| . | |+| | 2 − + 10   it follows that Z− A+ > = 2 for | |+| | / @  6 · |S| . 1 10= 

Lemma D.6. Let + be an <-dimensional vector subspace of ℝ=. Let E + E 6 ' for ℬ ⊆ { ∈ | k k } some ' > 1 . Let , : ℝ2 ℝ be a function such that for all H ℝ and G 6 ', , G,H 6  G for some → ∈ | | | ( )| | |  > 1 and for any ΔG 6 1, , G ΔG,H , G,H 6 ΔG for some > 1. = | | | ( + )− ( )| | | Let w ℝ be a random vector such that w1,..., w= are mutually independent. For any ∈ # > =, with probability at least 1 # <, for all E , − − ∈ ℬ = , E8 , w8 , E8 , w8 6 10 < ln ' # E 1 #. ( )− w ( ) ( )·k k+ / 8=1 Õ  p = Proof. Consider some E ℝ . Since ,8 E8 , w8 6  E8 , by Hoeffding’s inequality, ∈ | ( )| | | = , E8 , w8 , E8 , w8 6  E ( )− w ( ) k k 8=1   Õ

58 with probability 1 2exp 2 2 . − − / < Let # > = and  = 1 . Denote by some -net in such that 6 6 ' . By 2 =#  N ℬ |N|  union bound, for any E , ∈ N  = , E8 , w8 , E8 , w8 6 10 < ln ' # E ( )− w ( ) ( )·k k 8=1   Õ p with probability at least 1 # <. − − Consider arbitrary ΔE + such that ΔE 6 . For any E and F ℝ=, ∈ k k ∈ N ∈ = = = 1 , E8 ΔE8 ,F8 , E8 ,F8 6 , E8 ΔE8 ,F8 , E8 ,F8 6 ΔE8 6 . ( + )− ( ) ( + )− ( ) | | 2# 8=1 8=1 8=1 Õ  Õ Õ

Hence = = 1 , E8 ΔE8 , w8 , E8 , w8 6 , E8 ΔE8 , w8 , E8 , w8 6 , w ( + )− ( ) w ( + )− ( ) 2# 8=1 8=1 Õ  Õ  and = = , E8 , w8 , E8 , w8 , E8 ΔE8 ,F8 , ΔE8 , w8 6 1 #. ( )− w ( ) − ( + )− w ( ) / 8=1   8=1   Õ Õ Therefore, with probability 1 # <, for any E , − − ∈ ℬ = , E8 , w8 , E8 , w8 6 10 < ln ' # E 1 #. ( )− w ( ) ( )·k k+ / 8=1   Õ p 

Lemma D.7. Let '1,..., '= be i.i.d. Bernoulli random variables such that ℙ '8 = 1 = 1 ( ) − ℙ '8 = 0 = < = for some integer < 6 =. Denote S1 = 8 = '8 = 1 . Let S2 = be a ( ) / { ∈[ ] | } ⊆ [ ] random set chosen uniformly from all subsets of = of size exactly <. [ ] Let % be an arbitrary property of subsets of = . If S1 satisfies % with probability at least 1  [ ] − (for some 0 6  6 1), then S2 satisfies % with probability at least 1 2√=. − Proof. If < = 0 or < = =, then S1 = S2 with probability 1. So it is enough to consider the case 0 < < < =. By Stirling’s approximation, for any integer : > 1,

:: :: :: √2: 6 :! 6 √2: 6 1.1 √2: . : : 1 12: : · 4 · 4 − /( ) · · 4 Hence

< = < = < = < − √= 1 ℙ S1 = < = − > > . (| | ) < = = 1.12 2< = < 2√=       · ( − ) 59 p Therefore, ℙ S1 ∉ % ℙ S2 ∉ % =ℙ S1 ∉ % S1 = < 6 ( ) 6 2√= . ( ) ( | | | ) ℙ S1 = < (| | ) 

3 Fact D.8 (Covariance estimation of Gaussian vectors, [Wai19]). Let x1,..., x= ℝ be iid 3 ∈ x8 # 0, Σ for some positive definite Σ ℝ . Then, with probability at least 1 , ∼ ( ) ∈ − = 1 T 3 log 1  3 log 1  x8 x8 Σ 6 $ + ( / ) + ( / ) Σ . = − = + = ! ·k k 8=1 r Õ 3 3 Lemma D.9. Let x1,..., x= ℝ be iid x8 # 0, Σ for some positive definite Σ ℝ . Let ∈ ∼ ( ) ∈ : 3 . Then, with probability at least 1 , for any :-sparse unit vector E ℝ3, ∈[ ] − ∈ = T 1 T : log 3 log 1  : log 3 log 1  E x8 x8 Σ E 6 $ + ( / ) + ( / ) Σ . = − = + = ·k k = ! r ! Õ8 1 Proof. If 3 = 1, 1 3 : = 0 and the statement is true, so assume 3 > 1. Consider some set − − 3 of size at most :. By Fact D.8, with probability at least 1 , for any :-sparse unit S⊆[ ] − vector E with support , S = T 1 T : log 1  : log 1  E x8 x8 Σ E 6 $ + ( / ) + ( / ) Σ . = − = + = ·k k = ! r ! Õ8 1 Since there are at most exp 2: ln 3 subsets of 3 of size at most :, the lemma follows from ( ) [ ] union bound with  =  exp 3: ln 3 .  (− ) Fact D.10 (Chi-squared tail bounds, [LM00]). Let - "2 (that is, a squared norm of standard ∼ < <-dimensional Gaussian vector). Then for all G > 0

G ℙ - < > 2G 2√ 2√G< 6 4− −   Fact D.11 (Singular values of Gaussian matrix, [Wai19]). Let , # 0, 1 = 3, and assume ∼ ( ) × = > 3. Then for each C > 0

ℙ max , > √= √3 √C 6 exp C 2 ( ) + + (− / )   and ℙ min , 6 √= √3 √C 6 exp C 2 , ( ) − − (− / )   where max , and min , are the largest and the smallest singular values of ,. ( ) ( )

60 = 3 Fact D.12 (:-sparse norm of a Gaussian matrix). Let , # 0, 1 × be a Gaussian matrix. ∼ ( ) Let 1 6 : 6 =. Then for every  > 0 with probability at least 1 , −

T 4= max max E ,D 6 √3 √: 2: ln 2ln 1  . D ℝ3 :-sparse E ℝ= + + : + ( / ) D∈ =1 E =1∈ r   k k k k p Proof. Let E be some :-sparse unit vector that maximizes the value, and let ( E be the set ( ) of nonzero coordinates of E. Consider some fixed (independend of ,) unit :-sparse vector G ℝ= and the set ( G of nonzero coordinates of G. If we remove from , all the rows ∈ ( ) with indices not from ( G , we get an : 3 Gaussian matrix ,( G . By Fact D.11, norm of ( ) × ( ) this matrix is bounded by √3 √: √C with probability at least exp C 2 . Number of all + + : (− / ) subsets ( = of size : is = 6 4= . By union bound, the probability that the norm of ⊆ [ ] : : ,( E is greater than √3 √: √C is at most ( ) + +  = exp C 2 6 exp : ln 4= : C 2 . : · (− / ) ( ( / )− / )   Taking C = 2: ln 4= : 2 log 1  , we get the desired bound.  ( / )+ ( / )

E Spreadness notions of subspaces

Lemma E.1. Let E ℝ= be a vector with 1 E 2 = 1. Suppose ∈ = k k = 1 ~E2 6 1  E2 >  . = 8 / · 8 Õ8=1 1 2 Then, E( >  2 for every subset ( = with ( > 1  2 =. = k k / ⊆ [ ] | | ( − / ) The above lemma is tight in the sense that there are vectors E that satisfy the premise but have E( = 0 for a subset ( = of size ( = 1 : =. k k ⊆[ ] | | ( − ) = ~ 2 6  2 Proof. Let ) = consist of the  2 = entries of E. Let F8 E8 1  E8 . Then, 1 2 ⊆ [ ] 1 / · 2 1 2 1 2 / · F) 6 1   2 6  2. Thus, E( > F F) >  2.  = k k / · / / = k k = k k − = k k / = 1 2 1 2 Lemma E.2. Let E ℝ be a vector with E = 1. Suppose E( >  for every subset ∈ = k k = k k ( = with ( > 1  =. Then ⊆[ ] | | ( − ) = 1 ~E2 6 1  E2 >  . = 8 / · 8 Õ8=1 Proof. The number of entries satisfying E2 6 1  is at least 1  =.  8 / ( − )

61 Lemma E.3. Let + ℝ= be a vector subspace. Assume that for some , ' 0, 1 , for all E +, ⊆ ∈ ( ) ∈ = 2 1 2 2 2 2 E 6 2 E  E >  E . 8 ' = k k · 8 || || Õ8=1 2  2  Then + is 4 ' =, 2 -spread.   2 Proof. Let < = '2=. For vector E + with E 2 = <, the set " = 8 = E2 > 1 4 ∈ || || { ∈ [ ] | 8 } has size at most <, so its complement is a disjoint union of three sets: the set ( of = < − smallest entries of E, the set " = ( of entries of magnitude 6  2, and the set of ′ ⊆ [ ] \ / entries " = ( " of magnitude between  2 and 1. Note that since " 6 <, ′′ ⊆ [ ] \ ( ∪ ′) / | ′| E2 6 1 2< = 1 2 E 2. 8 "′ 8 4 4 ∈ || || Now consider F = 2 E. The set # of entries of F of magnitude at most one is a subset Í  2 > 2 2 of ( "′. By our assumption, F261 F  F . Hence ∪ 8 8 || || Í 2 > 2 2 > 3 2 2 F8 F8 E8  F 2 − 4 || || 8 ( F 61 8 "′ Õ∈ Õ8 Õ∈

Since this inequality is scale invariant, + is <, 3 4  -spread.  ( / · ) = Fact E.4. [GLR10] Let = ℕ and let + be a vector subspacep of ℝ . Define Δ + = sup √= E 1. ∈ ( ) E + /k k E∈=1 Then k k

1. If + is <,  -spread, then Δ + 6 1 = <. ( ) 2 /

=  1 p 2. + is 2Δ + 2 , 4Δ + -spread. ( ) ( )   The lemma below relates and of vectors satisfying specific sparsity constraints. k·k1 k·k = Lemma E.5. Let < = , = , 1 > 0 and 2 > 0. Let E ℝ be a vector such that ∈[ ] A⊂[ ] ∈ E2 6 2 E 2 . 8 1 k k 8 Õ∈A and for any set = of size <, ℳ⊂[ ] E2 6 2 E 2 . 8 2 k k 8 Õ∈ℳ Then 2 2 1 1 2 E8 > − − √< E . | | 2 k k 8 = ∈[Õ]\A 62 Proof. Let be the set of < largest coordinates of E (by absolute value). Since the inequality ℳ 1 2 2 > − 1− 2 8 = ( E8  √< E is scale invariant, assume without loss of generality that for ∈[ ]\ | | 2 k k all 8 ", E8 > 1 and for all 8 = , E8 6 1. Then Í ∈ | | ∈[ ] \ ℳ | | E 2 6 E2 E2 E2 6 2 2 E 2 E2 . k k 8 + 8 + 8 2 + 1 k k + 8 8 8 8 = 8 = Õ∈ℳ Õ∈A ∈[ ]\(A∪ℳ)Õ  ∈[ ]\(A∪ℳ)Õ hence 2 2 2 2 1   E 6 E 6 E8 6 E8 . − 2 − 1 k k 8 | | | | 8 = 8 = 8 =  ∈[ ]\(A∪ℳ)Õ ∈[ ]\(A∪ℳ)Õ ∈[Õ]\A Note that 1 2 2 1 2 2 1 2 2 E 2 > − 2 − 1 E2 > − 2 − 1 <. − − 1 k k 2 8 2 2 ! 8 2 !  Õ∈ℳ Therefore, 2 2 2 2 1 2 1 2 E8 > − − < E . | | 2 · k k 8 = 2  © ∈[Õ]\A ª ­ ®  « ¬ Lemma E.6. Suppose that vector E ℝ= satisfies the following property: for any = , ∈ S⊆[ ] 4= E8 6 ln . (E.1) | | |S| · 8   Õ∈S |S| Then E 6 √6=. k k Proof. Note that the set of vectors which satisfy Eq. (E.1) is compact. Hence there exists a vector E in this set with maximal E . Without loss of generality we can assume that the k k entries of E are nonnegative and sorted in descending order. Then for any < = , ∈[ ] < 4= E 6 < ln . 8 < Õ8=1   Assume that for some < the corresponding inequality is strict. Let’s increase the last term

E< by small enough  > 0. If there are no nonzero E<′ for <′ > <, all inequalities are still satisfied and E becomes larger, which contradicts our choice of E. So there exists the k k smallest <′ > < such that E<′ > 0, and after decreasing E<′ by < E<′ all inequalities are still satisfied. E increases after this operation: k k 2 2 2 2 2 2 2 E<  E<  = E< E 2 E< E<  > E< E . ( + ) + ( ′ − ) + <′ + ( − ′)+ + <′

63 Therefore, there are no strict inequalities. Hence E1 = ln 4= and for all < > 1, ( ) 4= 4= 4= 4= E< = < ln < 1 ln = < ln 1 1 < ln 6 ln . < − ( − ) < 1 ( − / )+ < 1 < 1 − − −     =    Since for any decreasing function 5 : 1, = ℝ, = 5 9 6 5 G 3G, [ ] → 9=2 ( ) 1 ( ) = 1 Í =∫ − 4= 4= E 2 6 ln2 4= ln2 6 2ln2 4= ln2 3G . k k ( )+ 9 ( )+ 1 G Õ9=1   ¹   Note that 4= 4= 4= ln2 3G = 2G 2G ln G ln2 . G + G + G ¹ Hence       = 4= ln2 3G = 5= ln2 4= 2ln 4= 2 6 5= ln2 4= , 1 G − ( )− ( )− − ( ) ¹   and we get the desired bound.  Lemma E.7. Suppose that vectors E ℝ= and F ℝ= satisfy the following properties: for some ∈ ∈ C1 > 0 and C2 > 0, for any = , S⊆[ ] 4= E8 6 C1 ln . (E.2) | | +|S|· 8   Õ∈S |S| and 4= F8 6 C2 ln . (E.3) | | +|S|· 8   Õ∈S |S| Then E,F 6 C1C2 C1 C2 ln 4= 6=. |h i| + ( + ) ( )+ Proof. Note that the set of pairs of vectors which satisfy these properties is compact. Hence there exist vectors E, F that satisfy these properties such that E,F is maximal. Without |h i| loss of generality we can assume that the entries of E and F are nonnegative and sorted in descending order. Moreover, if some entry E8 of E is zero, we can increase E,F by |h i| assigning some small positive value to it without violating conditions on E (if F8 = 0, we can also assign some positive value to it without violating conditions on F). Hence we can assume that all entries of E and F are strictly positive. Now assume that for some < the corresponding inequality for E with the set < is [ ] strict. Let’s increase E< by small enough > 0 and decrease E< 1 by . This operation does + not decrease E,F : |h i|

E<  F< E< 1  F< 1 = E 0 (E.4) ( + ) + ( − − ) − + − − + ( − − )

Moreover, if F< = F< 1, the inequality for F with a set < is strict, so by adding  to − [ ] F< and subtracting  from F< 1 we can make F< and F< 1 different without violating − − 64 constraints on F and without decreasing E,F . Hence without loss of generality we can |h i| assume that all E8 are different from each other and all F8 are different from each other. Now,by Eq. (E.4), there are no strict inequalities (otherwise there would be a contradiction). Hence E1 = C1 ln 4= , F1 = C2 ln 4= and for all < > 1, + ( ) + ( ) 4= 4= 4= 4= E< = F< = < ln < 1 ln = < ln 1 1 < ln 6 ln . < − ( − ) < 1 ( − / )+ < 1 < 1    −   −   −  Since E C141 satisfies conditions of Lemma E.6, − 2 2 E,F 6 E1F1 E C141 ln 4= 6 C1C2 C1 C2 ln 4= 6=. |h i| +k − k − ( ) + ( + ) ( )+ 

65