arXiv:2009.09304v2 [math.ST] 7 Mar 2021 esuyrno einlna ersinudrbuddesassu boundedness under regression linear Let design random study We Introduction bounde 1 unfavorable out rule unb also of they analysis cases, the some in in central that as are indicates equivalence assumptions moment such certain While are literature. them Among estimator. squares edsusadtoa itiuinlasmtossffiin oguara to sufficient assumptions distributional additional discuss We ierpeitr a civ h pia rate optimal the achieve can predictors linear h admpi ( pair random the bevdlann sample learning observed P osrc one itiuinsc httecntandleast constrained the that such Ω( distribution bounded a construct o miia ikmnmzto EM)etmtr osrie obou to constrained estimators (ERM)) minimization risk empirical (or ok[ work n needn of independent and xesrs ae where rate, risk excess on ( bound uhta h apecvrac arxi lotsrl netbeand invertible surely almost is matrix covariance sample the that such Euclide bounded a to g data constrained the estimator of squares boundedness least only When the loss. that squared the to respect es qae is squares least netbecvrac arxΣ ai eut[ result basic a Σ, matrix covariance invertible es squares least hr h xetto stknwt epc otesample the to respect with taken is expectation the where xss suigadtoal that additionally Assuming exists. osat,adteotmlrisk optimal the and constants, nataiinlsaitcllann hoystp h i fa(line a of aim the setup, theory learning statistical traditional a In . d uotmlt fCntandLatSursadImprovemen and Squares Least Constrained of Suboptimality samtvtn xml,cnie h elseie model well-specified the consider example, motivating a As esuytepolmo rdciga ela h etlna predicto linear best the as well as predicting of problem the study We 3 S / n 2 Mou19 /n eoeasml of sample a denote 1 ec euigarcn ojcueo hdSai JL 05.I 2015]. [JMLR Shamir Ohad of conjecture recent a refuting hence ) eed ete nteeatfr ftecvrac arxΣnor Σ matrix covariance the of form exact the on neither depends ) hoe ]sosta ftemdli elseie hnfraydis any for then well-specified is model the if that shows 1] Theorem , ω b 1 as nw steodnr es qae siao)satisfies estimator) squares least ordinary the as known (also eateto ttsis nvriyo Oxford, of University Statistics, of Department exactly ,Y X, 2 eateto ahmtc,EHZ¨urich, ETH Mathematics, of Department X sdsrbtdacrigto according distributed is ) eawy suethat assume always we ; d qa otemnmxrs.Wieteaoersl tet oteexist the to attests result above the While risk. minimax the to equal stedmnino h oaitsand covariates the of dimension the is S n n ...iptotu ar ( pairs input-output i.i.d. oa VaˇskeviˇciusTomas oalna predictor linear a to R ( ω n ∗ seult h aineo h os admvariable random noise the of variance the to equal is ) ≥ o-ierPredictors Non-Linear 2 d E , R ξ ( O sGusa n that and Gaussian is BF83 ω b ( ) Y d/n − h P Abstract 1 ssur nerbeadta h oainemti of Σ matrix covariance the that and integrable square is ω, b hoe .]ipista the that implies 1.1] Theorem , R ihn supin ntedsrbto ftecovariates. the of distribution the on assumptions no with ) nti ae,w nlz h efrac flatsquares least of performance the analyze we paper, this In . n iiaZhivotovskiy and X ( ·i ω 1 i ∗ hticr small a incurs that Y , ) S i . ) n ∈ h notation the , dR [email protected] R [email protected] Y n ( d qae siao nusa xesrs forder of risk excess an incurs estimator squares n ω nrtn itiuini sue,w establish we assumed, is distribution enerating × = ∗ pin ntedt eeaigdistribution. generating data the on mptions nbl osntati h classical the attain not does ball an upin fe sdi h outstatistics robust the in used often sumptions stenme fsmls npriua,we particular, In samples. of number the is distributions. d tean ntee ) X R uddadhaytie etns u work our settings, heavy-tailed and ounded , h any ω ape rmsm nnw distribution unknown some from sampled ddEcienballs. Euclidean nded szr enmliait asinwith Gaussian multivariate mean zero is ∗ X , n r erigagrtmi ompthe map to is algorithm learning ar) . i nabuddEcienbl with ball Euclidean bounded a in r ≥ O risk + ( upessaslt multiplicative absolute suppresses d d/n otat eosreta non- that observe we contrast, n ntemgiueof magnitude the on h xesrs funconstrained of risk excess the , ξ. R 2 Here xesrs aefrteleast the for rate risk excess ) ( rbto ftecovariates the of tribution ω b xesrisk excess = ) ω ∗ E ( ∈ Y ξ R eakby the Remarkably, . h − d funconstrained of , neo regimes of ence ,X ω, b ξ szr mean zero is ω svia ts i ∗ ) Recent . 2 where , O ( d/n (1) X X ) where least squares is a statistically optimal estimator in a minimax sense, there is a growing interest in the statistics and machine learning communities in understanding the robustness of statistical estimators to various forms of model misspecification. For instance, the regression function E(Y X = ) might be non-linear, or the distribution of the noise random variable ξ might depend on the corresponding| covariate· X. Many authors have matched the O(d/n) rate (1) for ERM-based algorithms under significantly less re- strictive assumptions than that of a well-specified model with Gaussian covariates (e.g., assuming a favorable covariance structure and sub-Gaussian noise [HKZ14], assuming Lq–L2 (for some q> 2) moment equivalence of the marginals ω,X and the noise random variable ξ [AC11, Oli16, Cat16, Mou19], or the weaker small-ball assumption [Men15h ,iLM16]). Moment equivalence type assumptions allow for modelling heavy-tailed distribu- tions, and, in particular, they have played a crucial role in recent developments in the robust statistics literature (e.g., [Cat16, CLL19, LM20, MZ20]); however, in some cases, such assumptions only hold with constants that can deteriorate arbitrarily with respect to the parameters of the unknown distribution P , even for light-tailed or bounded distributions. For instance, the smallest constant with respect to which Bernoulli(p) distribution satisfies the L4–L2 moment equivalence can get arbitrarily large for small p. In the context of linear regression, the work [Cat16, a discussion following Proposition 4.8] highlights that some of the prior results on the perfor- mance of least squares relying on such assumptions can have constants that may unintentionally depend on the dimension of the covariates d. Recent literature has further accentuated this problem and witnessed an emerging interest in refining moment equivalence and small-ball assumptions [Sau18, CLL19, Men20]. While the moment equivalence assumptions allow us to study unbounded and possibly heavy-tailed problems, such assumptions might impose undesirable structural constraints on the unknown distribution P and, in some cases, result in overly optimistic bounds, as our work suggests. In this work we take an alternative point of view, frequently adopted in the aggregation theory literature [Nem00, Tsy03, JRT08, Aud09, Lec13, RST17]: we impose no assumptions on the distribution P other than boundedness and aim to obtain a (possibly non-linear) d predictor that performs at least as well as the best linear predictor in b = ω R : ω b , where denotes the Euclidean norm. We remark that some distributional assumptionsW { need∈ to bek madek ≤ since} otherwise,k·k any algorithm that returns a linear predictor (including least squares) can incur arbitrarily large excess risk (cf. the lower bounds in [Sha15, LM16, Cat16]). Let ω = ω (S ) denote a proper estimator which corresponds to a linear function ω , for some ω . b b n h b ·i b ∈Wb Otherwise, the estimator is called improper. Fix any proper estimator ωb and any constants r,m > 0. The recent workb [Sha15b ] shows that there exists a distribution P = P (ωb,r,m) satisfying Xb r almost surelyb and Y m such that the following lower bound holds for any large enough samplek k≤size n (see Section 2.4 k kL∞(P ) ≤ b for a more general statement): b dm2 r2b2 E R(ωb) inf R(ω) & + . (2) ω b n n − ∈W Note that the first term in the above lower bound corresponds to the rate in the upper bound (1): the excess risk b 2 of a best predictor in class b is upper bounded by that of a zero function, whose risk is in turn bounded by m . On the other hand, the secondW term in (2) shows that in the absence of simplifying distributional assumptions, the statistical performance of linear predictors can deteriorate arbitrarily with respect to the boundedness constants r, b, even in one-dimensional settings; in contrast, the upper bound (1) does not depend on b and r. Imposing only boundedness constraints on P , we study excess risk bounds of least squares performed over ERM the class b, denoted in what follows by ωb . A baseline for our work is a conjecture proposed in [Sha15] W ERM postulating the statistical optimality of the constrained least squares estimator ωb in a sense that it matches the lower bound (2). For some of the recentb discussions and attempts to resolve this conjecture see, for example, the works [KL15, BGS16, GSS17, Wan18]. The existing results, however, only partiallyb address this conjecture, restricting to the regimes where br m (the notation a b means a . b . a). Specifically, the best known guarantees that can be obtained, for∼ instance, via localized∼ Rademacher complexity arguments [BBM05, Kol06, LRS15] yielding the following upper bound

2 2 2 E ERM dm r b R(ωb ) inf R(ω) . + d , (3) ω b n n − ∈W · b 2 We note that an overlooked aspect of the work [Sha15] is that the lower-bound (2) proved for proper algorithms is matched there via the improper Vovk-Azoury-Warmuth (VAW) forecaster. Among the proper algorithms, least squares is arguably the most natural and most extensively studied candidate that could potentially match the lower bound (2) (as conjetured by Shamir). Thus, a natural reformulation of Shamir’s conjecture arises:

Provided that the covariate vectors and the response variable are bounded almost surely, is the con- ERM strained least squares estimator ωb optimal among all (potentially non-linear) estimators in a sense that it always matches the lower bound (2)? b We address this question by showing that there exist bounded distributions inducing a multiplicative √d ERM gap between the excess risk achievable by the constrained least squares estimator ωb and that achievable via non-linear predictors. It is important to highlight that this statistical gap holds despite performing ERM over a convex and bounded function class with respect to the squared loss, a setting consideredb to be favorable in the literature (see, e.g., [Kol11, Chapter 5]). In particular, the so-called Bernstein class condition (see [BM06]) is always satisfied in our setup, which is known to imply fast rates for least squares in the bounded setup whenever the underlying class is not too complex. Our work identifies a contrasting scenario: we find that the least squares algorithm is suboptimal for a convex problem and as such, the failure of the least squares procedure cannot be attributed to complex/non-convex structure of the underlying class. Instead, we identify the localized multiplier process as the complexity measure giving rise to unfavorable distributions; see Sections 2 and 3 for an extended discussion. We now state an informal version of our main result.

Theorem (An informal statement). There exists a distribution P satisfying X 1 almost surely and Y L∞ 1 (i.e., r = m =1) such that for any large enough d, b √d and large enoughk k≤n the following holds: k k ≤ ∼ 3/2 2 2 2 E ERM d √ dm r b R(ωb ) inf R(ω) & d + . (4) − ω b n ∼ · n n ∈W   At the same time, there exists ab non-linear predictor f( ) such that for any boundedness constant m > 0, any distribution P (with no assumptions on the distribution· of the covariates) satisfying Y m, and any k kL∞ ≤ d,n > 0, the following holds: b dm2 E R(f) inf R(ω) . . (5) − ω Rd n ∈ ERM In particular, the lower bound (4) resolves theb conjecture of Ohad Shamir on the optimality of ωb in the negative. The distribution P used to prove the lower bound relies on a mixture of dense and sparse covariate vectors, with the majority of the samples having dense covariates; intuitively, the sparse covariatesb have high statistical leverage which in turn forces least squares to overfit on a small subset of the observed data. The construction of such a distribution is guided by our upper bounds presented in Section 2, where we prove a refined version of the upper bound (3) that allows us to get rid of the excess factor d appearing in the second term of (3), while replacing the first term with a quantity that appears in minimax lower bounds for the well-specified model. In Section 2.4, we demonstrate that some additional assumptions, such as L4–L2 moment-equivalence of the marginals ω,X and the noise, are enough to ensure that the constrained least squares estimator matches the lower boundh (2).i An emerging picture is that there exist scenarios (namely, when statistical leverage scores are suitably correlated with the noise as we discuss in Section 3) under which the performance of least squares is sensitive to the constants with respect to which such assumptions are satisfied. Consequently, unfavorable distributions exist even when both the covariates and the response variables are bounded. Our lower bound (4) is proved by constructing one such distribution with the moment equivalence constants proportional to √d. Rather than viewing our lower bound as an isolated case, we note that there exists a spectrum of bounded distributions, ranging from the ones that satisfy moment equivalence conditions with absolute constants, to the ill-behaved ones used to prove our theorem above. On the positive side, our work unveils the potential statistical improvements offered by non-linear predic- tors. In the theorem statement above, f denotes a modified Vovk-Azoury-Warmuth forecaster [Vov01, AW01]

b 3 introduced in [FW02]. In addition to the √d gap between (4) and (5), the non-linear predictor f surpasses the lower bound (2) that holds for any algorithm returning a linear predictor in . In particular, it removes the Wb dependence on the boundedness constants b and r appearing in the lower bound (2) completelyb removing any dependence on the distribution of the covariates. Regarding the terminology, we note the word pairs proper and linear, as well as improper and non-linear are used synonymously in this work. Thus, the separation between proper and improper learning is not to be confused with the separation between linear and non-linear procedures considered in other works. For example, the work [DJ98] finds a statistical gap between minimax rates achievable by procedures that are linear and non-linear in the observed response variables. In contrast, in our case, the linearity/properness means that an estimator always selects a predictor of the form x x, w for some w b, while non-linear/improper predictor is allowed to select predictors that are not of the7→ h abovei form. ∈ W Finally, we highlight that the construction used to prove our main lower bound (4) is specifically designed to simultaneously satisfy moment equivalence assumptions (on the noise and the design) with ill-behaved constants. The fact that non-linear estimators can always achieve the optimal d/n rate shows that they are insensitive to moment equivalence constants. Thus, the observed statistical gap between proper and improper estimators is of interest to the line of work investigating estimators (typically proper) robust to heavy-tailed data. Indeed, following the first appearance of the preprint of this work, this observation has motivated the design of a non- linear statistical estimator robust to heavy-tailed data that works without any assumptions on the distribution of the covariates [MVZ21].

1.1 Related Work Linear regression. Ordinary least squares and its variations (e.g., ridge regression or constrained least squares) have been studied extensively in the literature, with most of the results primarily focusing on the upper bounds. Many variations of our problem were previously considered in the literature (e.g., fixed-design regression, distributional assumptions different from boundedness or performance metrics that differ from the excess risk); see [Sha15] for a detailed comparison of different setups. For comprehensive surveys of existing work, we refer to [AC10, AC11, HKZ14, Mou19] and the books [GKKW02, Kol11, Wai19]. The lower bound (2) is due to [Sha15], and it is the tightest lower bound in the literature that covers the setting considered in our work. The best known upper bound in such a setting is of the form given in (3); the gap between (2) and (3) is currently not addressed in the literature. Many of the existing upper bounds in the literature hold with high probability. In contrast, we focus on es- tablishing suboptimality of constrained least squares and demonstrating a form of statistical separation between proper and improper estimators; thus, we concentrate on in-expectation analysis to convey our main findings without introducing additional technicalities. In the bounded regime, our upper bound for constrained least squares can be translated into high-probability results via standard arguments based on Talagrand’s concentra- tion inequality for empirical processes (see discussions in the book [Kol11]). At the same time, our upper bounds for ridge regression and Vovk-Azoury-Warmuth estimators are based on stability and online-to-batch arguments, respectively, neither of which easily generalizes to a high-probability counterpart.

(Sub)optimality of ERM. Understanding statistical guarantees pertaining to estimators based on ERM has been a subject of intense study in many contexts. Among the simplest problems where ERM is known to incur suboptimal excess risk rates is the model selection aggregation [Nem00, Tsy03], where the aim is to predict as well as the best function in a given finite class of bounded functions of size M. It is well-known that the non- convex structure of finite classes inhibits the excess risk achievable by any proper estimator (including ERM over the class); in particular, proper estimators can only achieve a log M/n rate as opposed to the optimal log M/n rate achievable by improper estimators [JRT08]. Among the similarities between the aggregation setup and our p work is that optimal rates are often achieved via procedures taking their roots in the sequential prediction setup (in our case, the Vovk-Azoury-Warmuth forecaster) [Aud09]. A key difference, however, is the fact that our results establish suboptimality of ERM for a convex and bounded class, albeit with respect to the dependence on d instead of n.

4 Constrained ERM with the squared loss is also actively studied in an on-going line of work concerning the shape restricted regression literature (e.g., [Cha14, CGS15, Bel18, HWCS19]), where the least squares projection is performed over constraint sets that may be significantly more complex than Euclidean balls. In particular, when considering some expressive nonparametric classes of functions, ERM can be either optimal [HWCS19] or suboptimal [BM93, KRG20], depending on some additional properties of these classes. In contrast, our results establish suboptimality of the constrained least squares estimator for a parametric class that has a small intrinsic complexity. The work [Cha14] allows more general convex constraints and shows that ERM can be rate suboptimal. However, establishing suboptimality of ERM in our setting is more complicated: we are not free to choose an arbitrary ill-behaved convex constraint set and also, we study a random design setting and thus cannot choose a fixed set of ill-behaved covariates Xi. We additionally refer to [Bir06] for an extensive discussion on optimality and suboptimality of least squares and maximum likelihood estimators in different setups. A phenomenon separating statistical performance achievable by proper and improper estimators, related to the one observed in our work for linear regression, has recently attracted considerable attention in the logistic regression literature. Consider the setting of online logistic regression over a bounded Euclidean ball b in Rd and denote the number of rounds by t (for the background on the setup see [FKL+18]). In this case,W the cumulative regret of the online Newton step algorithm [HAK07] is of order ebd log t. The exponential dependence on b lead to a question formulated in [MS12] asking whether there exists an algorithm with logarithmic regret but polynomial dependence on the radius b of the constraint set. The work [HKL14] shows that such an algorithm does not exist in the class of proper estimators; the work [FKL+18] provides an improper algorithm that attains a cumulative regret guarantee of order d log(bt) with a doubly-exponential improvement in the dependence on b.

1.2 Summary of our Techniques and Results As mentioned above, the upper bound (3) can also be readily obtained via the classical localized Rademacher complexity arguments [BBM05, Kol06] or the more recently introduced offset Rademacher complexity [LRS15]. Crucially, the lower bound (2) and the upper bound (3) differ by a factor of d in the worst case. It appears that the suboptimal dependence on the boundedness constants br and m arises due to an application of the

Ledoux-Talagrand contraction inequality [LT13]. In particular, when Y L∞(Pr ) m, the quadratic loss is 2 k k ≤ 2(br + m)-Lipschitz on b, and the constant (br + m) propagates into the resulting upper bounds. It is well- known that in the contextW of unbounded distributions, the application of the contraction argument can yield suboptimal bounds [Men15]; our work shows that the same is true in the classical bounded setup. In order to avoid the contraction step, we base our analysis on two components. First, we reduce our proof to the analysis of the localized multiplier and quadratic processes as in [LM13, Men15]. Second, we use a version of Rudelson’s inequality [Rud99] for sums of rank-one operators due to Oliveira [Oli10] to analyze the localized quadratic process. We believe that our approach can also be used in the case of unbounded distributions: there is a version of Oliveira’s bound for unbounded matrices [KZ20] that can be used to replace the assumption X r with a strictly weaker sub-Gaussian tail assumption on the norms X ; this could be seen as a step ktowardsk ≤ incorporating unbounded distributions within our framework while notk relyingk on moment equivalence assumptions discussed above. Using the notion of average stability, we prove a tight excess risk upper bound for the ridge regression estimator (see [SSBD14, KL15, GSS17] for a detailed account of stability in our context). The novel ingredient in our proof is the exploitation of the curvature of the squared loss in the stability-fitting trade-off. As a result, we show that the ridge estimator does not suffer from an excess factor log(min n, d ) that appears in an upper bound on the localized quadratic process. Moreover, we demonstrate that this logarithmic{ } term is unavoidable for least squares in some regimes, thus showing an interesting performance gap between constrained and penalized least squares. Finally, the proof of our main lower bound, presented in Section 3, relies on a combination of some delicate exact computations and multiple applications of the matrix Chernoff and Bernstein inequalities [Tro15]. Tech- nical difficulties aside, the main challenge in proving our lower bound is constructing the example distribution used to establish that constrained least squares is suboptimal. In addition to the restrictions imposed by the boundedness constraints, we discuss other distributional assumptions sufficient to ensure that constrained least

5 squares matches the rate (2) (see Section 2.4), thus making our main lower bound impossible. The intuition behind our construction is rooted in the form taken by our upper bounds. In particular, we construct a distribu- tion tailored to make the localized multiplier process ill-behaved by simultaneously violating moment equivalence assumptions on the noise and statistical leverage score distributions. See Section 3 for more details. Below is a summary of our main contributions. 1. In Theorem 2.1, we prove a tight upper bound on the expected excess risk of any constrained least squares ERM estimator ωb . We demonstrate that the localized multiplier process is equal to the correlation between the squared noise and the statistical leverage scores of the covariates, a quantity that appears in minimax lower boundsb in the well-specified case with independent noise [Mou19]. In Proposition 2.3, we construct ERM a distribution for which there exists a least squares solution ωb whose excess risk is lower-bounded by the localized quadratic process. Hence, both terms appearing in our upper bound are not improvable in general. b 2. In Theorem 2.2, we prove a tight excess risk upper bound for the ridge regression estimator. We recover the excess risk upper bound proved for constrained least squares, with the localized quadratic term replaced by a bias term that yields a logarithmic improvement. 3. Section 2.4 is dedicated to corollaries of our upper bounds. We show that under some assumptions frequently considered in the literature, the localized multiplier process is, up to log factors, upper-bounded by a term of order dm2/n. Consequently, in such regimes, the constrained least squares and the ridge regression estimators match the rate of the lower bound (2) up to logarithmic factors. 4. In Theorem 3.1, we construct a distribution P with r = 1 and m = 1 such that for b √d, any constrained ERM ∼ least squares estimator ωb incurs an excess risk that is larger by a √d factor than the lower bound (2) proved in [Sha15]. In particular, we refute the conjecture of Ohad Shamir on the optimality of the constrained least squaresb estimator. 5. The lower bound (2) that only holds for proper linear predictors was matched in [Sha15] via the non-linear Vovk-Azoury-Warmuth (VAW) forecaster without tuning the regularization parameter. We observe that once the regularization parameter is tuned, the VAW forecaster yields an exponential improvement on the parameters b and r in the lower bound (2); in particular, our observation demonstrates a fundamental gap between the performance achievable by proper and improper estimators. In addition, we discuss a modified version of the VAW forecaster due to [FW02] that completely removes the dependence on the boundedness constants b and r in its excess risk upper bound. We present the proofs of the main results in Section 4 with some of the details delegated to the appendix.

1.3 Notation

The subscript r in Pr denotes that the distribution Pr of the random pair (X, Y ) that satisfies X r almost surely. The boundedness constant m is an upper bound on the L norm of the response variablek Yk≤. We denote Rd ∞ ERM Euclidean balls with radius b by b = ω : ω b . Let ωb denote any ERM over b, that is, any solution to W { ∈ k k≤ } W n ωERM arg min (Y ω,Xb )2. (6) b ∈ i − h ii ω b i=1 ∈W X For any λ 0, denote the regularized sample second moment matrix by ≥ b n 1 1 − T T Σλ = λId + XkXk + XX , (7) n ! Xk=1 b where we write X instead of Xn in order to simplify the notation in our main results, and correspondingly, we write Y and ξ instead of Yn and ξn. Given a regularization parameter λ> 0, the ridge estimator is defined as

n 1 n 2 2 − ωλ = arg min (Yi ω,Xi ) + λ ω = nΣλ YiXi . (8) Rd − h i k k ω i=1 i=1 ! ∈ X   X b b 6 Further, let ωb∗ denote any solution minimizing the population risk R( ) over b and let ξ denote the noise variable: · W ωb∗ arg min R(ω) and ξ = ξb(X, Y )= Y ωb∗,X , (9) ∈ ω − h i ∈Wb We denote positive numerical constants by c,c1,... and note that their values may change from line to line; a . b denotes the existence of a numerical constant c such that a cb; a b is a shorthand for b . a . b. The notation denotes the Euclidean norm for vectors and the operator≤ norm∼ for matrices. For any p [1, ], denotesk·k the L (P ) norm, where the distribution P will always be clear from the context. With∈ a slight∞ k·kLp p abuse of notation, for any ω Rd, we let ω 2 = ω,X 2 = E ω,X 2. The d d identity matrix is denoted ∈ k kL2 kh ikL2 h i × by Id and Diag(a1,...,ad) denotes the diagonal matrix formed by a1,...,ad. Finally, the indicator function of an event E is denoted by 1E.

2 Upper Bounds

In this section, we provide two upper bounds: the first is for constrained least squares and the second is for the ridge estimator. These bounds will later motivate our construction separating the performance of least squares and non-linear predictors under the boundedness assumption.

2.1 Performance of Constrained Least Squares

ERM Our first theorem is an upper bound on the excess risk of any constrained least squares estimator ωb . The proof is deferred to Section 4.1. We remark that the below upper bound is non-asymptotic, in contrast to the agnostic upper bound proved in [AC11, Theorem 2.1]. Also, we make no restrictions on the sampleb size n and the uniqueness of least squares, as opposed to results that hold for the unconstrained least squares estimator (e.g., [Mou19, Proposition 1]). Theorem 2.1. For any n,d,b,r > 0 and any distribution P satisfying E Y 2 < it holds that r ∞ 2 T 1 2 2 2 2 E ξ X Σ− 2 X E ERM λr λr b r b log(min n, d ) R(ωb ) R(ωb∗) . inf + + { } , (10) − λ>0 n n ! n b ERM b where ωb , Σλr2 and ωb∗, ξ are defined in (6), (7) and (9) respectively. We comment on the structure of the above bound. Assume for the sake of presentation that Σ is invertible. b 0 Then,b with the choice λ = 0, we may rewrite the above upper bound as follows: b E 2 T 1 2 2 ERM ξ X Σ0− X r b log(min n, d ) E R(ω ) R(ω∗) . + { } . (11) b − b n n Interaction withb the noise Low-noise complexity b The first term, which arises from the supremum of| the localized{z } multip|lier process,{z shows} the correlation between T 1 Rn d the noise ξ and the statistical leverage score X Σ0− X (let X × denote the matrix with the i-th row equal to T 1 Rn n ∈ T 1 T Xi; we may write X Σ0− X = Hnn, where H × is the “hat matrix” defined as H = X(X X)− X ). If d n and the noise random variable ξ is independent∈b of X, then the first term in (11) corresponds essentially to the≤ minimax optimal rateb for unconstrained least squares regression [Mou19, Theorem 2] and is hence unimprovable in general. The second term in (11), which arises from the supremum of the localized quadratic process, intuitively captures the problem complexity in low-noise regimes, that is, when ξ is relatively small. In Proposition 2.3, we demonstrate a noiseless problem such that for some constrained least squares solutions the second term in (11) is tight.

7 2.2 Performance of the Ridge Regression Estimator We now turn to our second result, which provides an excess risk upper bound for the ridge regression estimator. The proof is deferred to Appendix A. 2 Theorem 2.2. For any n,d,b,r > 0, any distribution Pr satisfying E Y < , and any choice of the regular- ization parameter λ & r2, it holds that ∞

E 2 T 1 2 ξ X Σλ− X λb E R(ω ) R(ω∗) . + , λ − b n n b where ωλ, Σλ and ωb∗, ξ are defined in (8b), (7) and (9) respectively. We remark that one may not choose an arbitrary small value of λ and hence the above theorem does not directlyb implyb the result of Theorem 2.1. Also, note that the empirical risk functional is not normalized in our work, and hence the above choice of λ corresponds to the regularization parameter scaling as r2/n for normalized empirical risks considered in some other works.

2.3 Discussion on the Optimality of Theorems 2.1 and 2.2 At the first sight, the upper bounds presented in Theorems 2.1 and 2.2 look similar; however, there are several important differences that we emphasize below:

• The ridge estimator ωλ does not necessarily belong to the set b and hence it is an improper estimator, in ERM W contrast to the least squares estimator ωb . In addition, a single ridge regression estimator ωλ provides a ERM family of upper bounds,b one for each choice of b, whereas the least squares estimator ωb itself depends on the choice of b. b b • The parameter λ in Theorem 2.1 is used to optimize the trade-off between the two termsb and does not affect the estimator itself. In Section 2.4, we demonstrate how the flexibility to optimize λ in the upper bound in Theorem 2.1 allows to match the “slow rate” term in a more general version of the lower bound (2). It is not immediately evident whether the same is true for the ridge estimator ωλ in view of Theorem 2.2. • Theorem 2.1 contains an extra factor log(min n, d ) that is not present in Theorem 2.2. In Proposition 2.3 { } below, we show that this logarithmic factor is inherent for constrained leastb squares. • The analysis of the constrained least squares estimator is based on the empirical process theory and concentration inequalities for random matrices; the analysis of the ridge estimator is based on an average stability argument. The next result, proved in Appendix B, shows that the extra logarithmic factor that appears in Theorem 2.1 but not in Theorem 2.2 is unavoidable. Our proof technique is based on an instance of the coupon collector problem, a common tool for establishing that some logarithmic factors are unimprovable in the noise-free binary classification problem (see, e.g., [BHMZ20] and the references therein). We remark that the below lower bound holds for some ERM, yet there might exist other ERMs which may violate the below lower bound. Proposition 2.3. For any large enough sample size n, any d n, and any r,b > 0, there exists a distribution ≥ Pr = Pr(n,d,b) with ξ =0 such that the following lower bound holds for some constrained least squares estimator: 2 2 ERM r b log n E R(ω ) R(ω∗) & . b − b n

We now discuss some closely related lowerb bounds indicating that Theorem 2.1 and Theorem 2.2 cannot be improved in a certain sense. First, [Mou19, Theorem 2] shows that if d n, then for any distribution of the covariates X such that the sample covariance matrix is invertible almost surely≤ and any linear predictorω ˜, there is a joint distribution (X, Y ) with independent zero mean Gaussian noise ξ such that the following holds:

T 1 2 T 1 2 X Σ0− X ξ X Σ0− X E R(˜ω) R(ω∗ ) E ξ E = E , (12) ∞ T 1 T 1 − ≥ n X Σ0− X ! n X Σ0− X ! − b − b 8 b b d where ω∗ minimizes the risk among all vectors in R . This term is an exact analog of the first term in the upper ∞ 1 T 1 bounds (10) whenever n X Σ0− X is separated from 1 and λ = 0. Second, it is shown in [Sha15, Theorem 3] that for d = 1, any m,b,r satisfying br 2m, and any estimatorω ˜ taking its value in , there exists a distribution ≥ Wb Pr such that Y L∞ m andb k k ≤ 2 2 2 r b E R(˜ω) R(ω∗) & min m , . (13) − b n   Proposition 2.3 and the lower bounds (12), (13) indicate the existence of regimes such that none of the terms appearing in Theorems 2.1 and 2.2 can be improved in general; the full picture is, however, more subtle. In ERM particular, for constrained least squares ωb (i.e., b < ) we obtain a non-trivial upper bound even in the 1 T 1 ∞ regimes when the leverage scores X Σ− X are close to 1. On the other hand, the global bound (i.e., b = ) n 0 ∞ stated in (12) can deteriorate if the leverageb scores are close to 1. We also remark that the construction of distributions used to prove the lower boundb (13) rely on non-zero noise problems, in contrast to the construction r2b2 of the distribution used to prove Proposition 2.3 in our work. Therefore, the term n that appears in the lower bound (2) is not directly related to the problem complexity in the low-noise regimes, as opposed to the second term in Theorem 2.1.

2.4 Bounds on the Multiplier Term The upper bounds presented in Theorems 2.1 and 2.2 hold assuming boundedness of the covariates X r, and square integrability of the labels E Y 2 < . In this section, we turn to the bounded setting whenk in additionk≤ it ∞ 2 2 2 2 holds that Y L∞ m . Note that the terms r b log(min n, d )/n and r b /n appearing in Theorems 2.1 and 2.2 respectivelyk k match≤ the corresponding term (up to the logarithmic{ } factor) that appears in the lower bound (2). As a result, the constrained least squares and the ridge estimators can only exhibit suboptimal behaviour when the E 2 T 1 2 multiplier term ξ X Σλ− X/n is significantly larger than dm /n (cf. the discussion following Theorem 2.1). In this section, we discuss several distributional assumptions that in addition to boundedness ensure a well-behaved multiplier term. Intuitively,b all the assumptions considered below introduce a form of independence between the noise variable and the statistical leverage scores. In Section 3, we demonstrate that once such assumptions are violated, the multiplier term can be larger than dm2/n by a multiplicative √d factor, despite the restriction to a family of bounded distributions. The main result of the current section is Proposition 2.8, which shows that the constrained least squares and the ridge estimators match the lower bound (2) under L4–L2 moment equivalence assumptions specified below. We remark, however, that tighter lower bounds than that of (2) might be possible under some of the assumptions considered below. In particular, it is well-known that assumptions closely related to moment equivalence considered below (e.g., small-ball [Men15]) might simplify the quadratic process, which gives rise to the r2d2 log(min n, d )/n term in Theorem 2.1. However, as discussed above, the multiplier term is responsible for the suboptimality{ } of constrained least squares and in this section we are mainly trying to understand the conditions sufficient to ensure a well-behaved multiplier term. Before discussing the upper bounds on the multiplier term, let us briefly review a more general version of the lower bound (2) that holds without any restrictions on the sample size n:

dm2 rbm r2b2 E R(˜ω) inf R(ω) & min m2, min , + , (14) − ω b n √n n ∈W     1 2 whereω ˜ is any linear predictor in the set b . First, observe that the term m is matched by a zero predictor that corresponds toω ˜ = 0. In PropositionW2.3 presented in the previous section, we demonstrate a distribution with m = 0 (i.e., Y = 0 almost surely) under which a provably non-zero lower bound holds for some constrained least squares estimator. However, this is not the primary reason for the suboptimality that we establish in Theorem 3.1 with respect to the above lower bound and hence we ignore the term m2 in what follows.

1 d The statement of Theorem 1 in [Sha15] allows for ωb ∈ R instead of ωb ∈Wb. However, Lemma 2 in [Sha15], upon which the proof of the lower bound is built, requires that ωb ∈Wb. We formulate the bound in this weaker form.

9 Second, for constrained least squares , the “slow rate” term rbm/√n + r2b2/n is matched by optimizing the ∗ √R(ωb )n 1 1 2 1 first term in Theorem 2.1 with the choice λ = . To see that, note that Σ− 2 (λr )− I and hence rb n λr  d E 2 T 1 2 2 E 2 T 2 2 ξ X Σλr− 2 X λr b ξ X X λr b b inf + inf 2 + λ>0 ( n n ) ≤ λ>0 λr n b   E ξ2 λr2b2 R(ω )r2b2 inf + 2 b∗ , ≤ λ>0 λ n ≤ n   r E 2 where the last line follows by noting that ξ = R(ωb∗) and plugging in the choice of λ defined above. Finally, since 0 b, we have ∈W 2 R(ω∗) R(0) m , b ≤ ≤ and the result follows. Since the above discussion establishes that constrained least squares does not match the m2 term in (14) but matches the slow rate term, in what follows we focus on the fast rate term dm2/n + r2b2/n, that is, the lower bound stated in (2). We demonstrate that the constrained least squares and the ridge regression estimators match this lower bound up to logarithmic factors under several assumptions widely considered in the literature. E 2 2 E T 1 E 1 The key observation is that ξ = R(ωb∗) m and for any λ > 0, X Σλr− 2 X = Tr(Σλr− 2 Σ0) d. In particular, any independence-like assumption≤ that allows to “split” the noise and the leverage scores≤ in the localized multiplier term, perhaps at the price of extra logarithmic factors, establishesb a formb of optimalityb of the constrained least squares and the ridge estimators with respect to the lower bound (2). We begin with the simplest example, which matches the lower bound (2) under the assumption that the noise random variables ξ are independent of the covariates X , ,X . Note that such an assumption is weaker i 1 ··· n than assuming that the model is well-specified, since we do not assume that ξi are zero mean. To simplify the notation we write λ instead of λr2 in what follows.

Example 2.4. Assume that the noise variable ξ is independent of X1,...,Xn. Then

2 T 1 2 T 1 1 E ξ X Σ− X E ξ E X Σ− X R(ω ) E Tr(Σ− Σ ) λ = λ = b∗ λ 0 n  n  n b R(ω ) Tr((Σ +b λI /n) 1Σ) b b b∗ d − , ≤ n

1 where the last step follows by Jensen’s inequality and the fact that that A Tr((A + λId/n)− A) is a concave 1 7→ map for A 0. Note that the quantity quantity Tr((Σ + λId/n)− Σ), known as the effective dimension (cf. [HKZ14]), is never larger than d. Our second example shows how to upper bound the multiplier term given an L bound on the noise variable ξ. Among the prior work that proves upper bounds under such an assumption∞ see [AC11, Theorem 2.1] and [Mou19, Equation (7)]. For a closely related assumption see [BM13, Assumpton A6], which is also imposed in order to decouple the noise variables from the statistical leverage scores in the spirit of moment equivalence assumptions. Example 2.5. Let Σ= E XXT. Then, for any λ> 0 we have

2 T 1 2 T 1 2 1 E ξ X Σ− X ξ E X Σ− X ξ Tr((Σ + λId/n)− Σ) λ k kL∞ λ k kL∞ , n ≤ n ≤ n b b where the last step follows by Jensen’s inequality (cf. Example 2.4). We now turn to an example that requires a less restrictive control on the noise variables ξ.

10 Example 2.6. Let ξ1,...,ξn denote independent copies of ξ. Then, for any λ> 0 we have

2 T 1 2 E ξ X Σ− X d maxi ξi λ k kL2 . (15) n ≤ n b The above inequality follows by noting that

2 T 1 2 n T 1 2 1 E ξ X Σ− X E maxi ξ X Σ− Xi E maxi ξ Tr(Σ− (nΣ0)) λ i · i=1 i λ = i · λ n ≤ n2 n2  P  b b b b 1 and using the fact that Tr(Σ− (nΣ )) nd. λ 0 ≤ A sub-Gaussian norm of a random variable Z is defined as (see e.g., Definition 2.5.6 in [Ver16]) b b Z = inf c> 0 : E exp(Z2/c2) 2 . k kψ2 ≤ Below, we show how an assumption that the noise ξ is well-behaved yields to a simplification of the upper bound stated in the above example. We emphasize that the above assumption does not impose any restrictions on the covariates, other than boundedness assumption used throughout this paper. Example 2.7. Suppose that the noise random variable ξ satisfies the sub-Gaussian assumption of the form 2 ξ ψ2 . ξ L2 . Then, by standard sub-Gaussian maximum inequalities (e.g., [LT13]), we have maxi ξi L2 . k k2 k k k k ξ log n = R(ω∗) log n and hence the upper bound (15) simplifies to k kL2 2 T 1 E ξ X Σ− X dR(ω ) log n λ . b∗ . n n b Our final example weakens the above assumption on the distribution of the noise ξ but requires an L4–L2 moment equivalence for the marginals ω,X . For related work proving excess risk bounds under similar moment equivalence assumptions see [LM20, Assumptionh i 2.1], [Oli16, Theorem 1.2] and [Mou19, Assumptions 2 and 3]. The proof of the below proposition is based on controlling the lower tail of random quadratic forms using a result in [Oli16]. See Appendix C for details. Proposition 2.8. Suppose that Σ= E XXT is of full rank and assume that the following holds:

2 ξ . ξ and for all ω Rd we have E ω,X 4 . E ω,X 2 . k kL4 k kL2 ∈ h i h i   Then, for any λ> 0 and any n & d, it holds that

2 T 1 2 2 2 E ξ X Σ− X dm r b λ . + . n n n b As demonstrated in this section, the upper bound of order dm2/n + r2b2/n is achievable by the least squares and the ridge regression estimators under various distributional assumptions frequently considered in the lit- erature. It is important to emphasize, however, that in Example 2.7 and Proposition 2.8 we used the most favorable versions of moments equivalence assumptions, where the corresponding norms are linked via absolute constants. As we mentioned, in some cases these constants may depend on the dimension of the problem leading to suboptimal results. Indeed, once such assumptions are violated, we prove in Theorem 3.1 presented in the next section, that the optimistic rate dm2/n + r2b2/n is not always achievable. In contrast, the above rate can be achieved and surpassed by improper estimators (see Section 3.1).

11 3 Main Results

In this section, we present our main result: a construction of a bounded distribution under which constrained least squares exceeds the lower bound (2) by a factor proportional to √d. The regime considered in our lower bound is essential to establishing a separation between the performance of constrained least squares and non- linear estimators in our bounded setup. As discussed in Section 2.4, various distributional assumptions ensure that the gap between the performance of constrained least squares and that of the lower bound (2) is at most logarithmic. As a result, we need to construct a bounded distribution that violates all the assumptions considered in Section 2.4. If the noise variables and the leverage scores satisfy the assumptions that allow us to split them apart in the multiplier term, then we obtain an upper bound that matches the lower bound (2). Since Bernoulli random variables with a small parameter p satisfy L4–L2 moment equivalence with an ill-behaved constant 1/p, we aim to construct a distribution such that the noise random variables and the leverage scores both approximately follow Bernoulli distributions with a small parameter that depends on the dimension d. Besides, the noise variables and the leverage scores need to be highly correlated; otherwise, the multiplier term would be too small. We remark that our construction could be considered somewhat extreme only with respect to the constants appearing in the moment equivalence assumptions discussed in Section 2.4. At the same time, our distribution is bounded with the favorable choice of constants, the Bernstein class assumption is satisfied and therefore, by the upper bound (3), the constrained least squares estimator satisfies non-trivial fast rate excess risk guarantee, making the construction of our main lower bound more challenging. Let us now present our construction. For simplicity, we assume that √d is an integer in what follows. Let 1 denote an all-ones vector. For a support set S 1,...,d , let 1S denote a vector such that (1S)i = 1 if i S and 0 otherwise. Let = S 1,...,d : S⊆{= √d . We} consider the following distribution: ∈ S√d { ⊆{ } | | }

1 1/2 (d− 1, 1) with probability 1 d− , (X, Y )= 1/4 −1/2 (16) (d− 1 , 0) with probability d− , where S Uniform . ( S ∼ S√d 1 √  2 A simple calculation shows that ω∗ 2 1 and hence for b & d we have ωb∗ = ω∗ . In particular, ξi is smaller ∞ ≈ 1 2 ∞ than 1 for the “high probability” points (Xi, Yi) = (d− 1, 1), while ξi √d for the “low probability” points 1/4 2 ≈ (Xi, Yi) = (d− 1S, 0). This establishes that ξi behaves as Bernoulli random variables. Similarly, since all the “high probability” points are exactly the same, they essentially have zero leverage. On the other hand, the “low probability” points all have high leverage, thus the leverage scores also approximately follow the Bernoulli 2 distribution. Finally, since ξi is large exactly for the high leverage points, the squared noise random variables are correlated with the leverage scores. Intuitively, the multiplier term (i.e., the first term in Theorem 2.1 and 2.2) scales as d3/2/n under the distribution (16), while the lower bound (2) scales only as d/n provided that b √d. The main result of our paper is presented below. The proof is deferred to Section 4.2. ∼ Theorem 3.1. Suppose that the distribution P of (X, Y ) is given by (16). Then, for any constrained least ERM squares estimator ωb defined by (6), the following lower bound holds, provided that d is large enough, b √d and n & d3 log d: ∼ 3/2 b ERM d E R(ω ) R(ω∗) & . b − b n For the distribution (16) we have r =1 and m =1; hence, the lower bound (2) scales only as d/n. b We now comment on the above result. Recall that the aim of the construction (16) is to maximize the multiplier term under boundedness constraints on the underlying distribution. In view of the lower bound (2), the parameters m,r,b are chosen the most relevant way in the sense explained below. First, because of the homogeneity, we may always set m = 1. Second, the choice b √d is natural for d-dimensional vectors, ∼ 2 2 2 particularly, for the underlying parameter wb∗. Finally, the scaling dm r d equalizes the two terms in the lower bound (2) and according to the results in Section 3.1 leaves open∼ the possibility that in such regimes improper estimators offer no statistical improvements. Indeed, the best upper bound for non-linear estimators

12 scales as dm2/n. It also follows from the proof of Theorem 3.1 that under the distribution (16), the constrained ERM least squares estimator ωb coincides with the global least squares solution, so that any larger value of b can be chosen in Theorem 3.1 without changing the statement. More importantly, our construction can be extended 1+α to a family of bounded distributionsb with the lower bound scaling as d /n for any α [0, 1/2], while improper estimators (as shown in Section 3.1) can still achieve the optimal d/n rate. In all these∈ cases, there is still a gap between the lower bound (2) and the performance of constrained least squares. To simplify the presentation, we focus only on one particular distribution (16), which maximizes the exhibited performance gap.

3.1 Improvements via Non-Linear Predictors In this section, we observe that non-linear predictors can surpass the lower bound (2) that holds for linear predictors in b. In particular, via known results in the literature, we first demonstrate that the Vovk-Azoury- Warmuth (VAW)W forecaster yields an exponential improvement on the boundedness constants b and r compared to the lower bound (2). We remark that in our setup, the VAW forecaster was previously used to match the lower bound (2) in [Sha15]; the difference in the result below is that we tune the regularization parameter and use the resulting upper bound to demonstrate a statistical separation between proper and improper algorithms, rather than matching a lower bound that holds for proper algorithms. Finally, we discuss a less known modification of the VAW forecaster due to Forster and Warmuth [FW02] that can completely remove the dependence on the boundedness constants b and r and removes any assumptions on the distribution of the covariates. The VAW forecaster is defined as follows. Given a (random) sample Sn, a regularization parameter λ > 0, and any point X Rd we first compute ∈ n 2 2 2 ωλ,n(X) = arg min (Yi ω,Xi ) + λ ω + ω,X Rd − h i k k h i ω i=1 ∈ X and then output a predictionb f VAW(X)= ω (X),X . λ h λ,n i Thus, in order to make a prediction, a new linear predictor ωλ,n(X), is computed for every point X and in particular, the VAW forecaster is non-linear.b For backgroundb h on the·i VAW forecaster and regret bounds we E 2 refer to [Vov98, CBL06, Ora19]. Below, for any predictor f( )b we denote R(f(X)) = (Y f(X)) . Our key · 2 − ∗ dm observation is that the sequence of weights ωλ ,j (X) for j =1,...,n and λ∗ = b can be immediately translated VAW into a non-linear estimator f ∗ ( ) satisfying in the notation of the lower bound (2): λ · b 2 2 2 VAW dm r b n e E R(f ∗ (X)) R(ω∗) . log 1+ . (17) λ − b n d2m2   The proof of this fact follows frome the regret bound for the VAW forecaster (see the survey [Ora19, Theorem 7.25]) and the standard online-to-batch conversion. The above bound yields an exponential improvement on the boundedness constants b and r compared to the lower bound (2). More importantly, there exists a modification of the VAW forecaster due to [FW02] that can remove the 2 logarithmic factor in (17). Let us introduce the modified VAW forecaster. Given a sample Sn and any X Rd T n T T ∈ , let hX = X ( i=1 XiXi + XX )†X denote the leverage of the point X with respect to the covariates X1,...,Xn,X, where the notation A† denotes the Moore-Penrose inverse of a matrix A. The modified VAW predictor f is thenP defined pointwise as follows: f(X)=(1 h )f VAW(X). (18) b − X 0 Thus, the above function outputs the predictionsb of the VAWb forecaster (with λ = 0, where the VAW predictions are computed by taking a Moore-Penrose inverse of the sample covariance matrix), albeit reweighted by the factor (1 hX ). Intuitively, the above predictor avoids making large errors for high leverage points. The following theorem− describes the main property of this estimator.

2We are thankful to Manfred Warmuth for pointing us to the modified VAW forecaster.

13 Theorem (Theorem 6.2 in [FW02]). Let f( ) denote the non-linear predictor defined in (18). Let P be any distribution (with possibly unbounded covariates)· satisfying Y m. Then, for any d,n > 0, the following k kL∞ ≤ holds: b dm2 E R(f(X)) inf R(ω) . . (19) − ω Rd n ∈ For the reader’s convenience, we reproduceb the proof of the above result in Appendix E and compare some of the steps with the proof of Theorem 2.2. Observe that the bound (19) is closely related to the upper bound (1) that holds for unconstrained least squares in the well-specified setup with Gaussian design: both bounds do not depend on the magnitude of the covariates, specific properties of the covariance structure, and the norm of the 2 optimal linear predictor ω∗ = arg infω Rd R(ω). The difference is that R(ω∗) is replaced by m . It is reported in [FW02] that the authors could not prove∈ a bound similar to (19) for the least squares estimator. Given the lower bound (2) and Proposition 2.3, it is not surprising. In our setting, the performance of least squares is affected by the boundedness constants r, b, which can be arbitrarily bad in the theorem above. Finally, note that once the pseudoinverse of the sample covariance matrix is computed, pointwise evaluation of the non-linear estimator (59) can be done in O(d2) operations. In contrast, only O(d) operations are needed to evaluate a linear function. It is unknown whether a more computationally efficient algorithm that matches the upper bounds (17) and (19) exists and, more broadly, whether there exist inherent statistical-computational trade-offs needed to attain the optimal rate in the distribution-free setting. The search for computationally efficient improper algorithms in a related phenomenon observed for logistic regression (cf. Section 1.1) is currently an active line of research [FKL+18, MG19, JGR20].

4 Proofs 4.1 Proof of Theorem 2.1 This section is devoted to the proof of Theorem 2.1. First, notice that by convexity of the quadratic loss and convexity of the class b, the following inequality holds sometimes called the Bernstein condition in the literature: W 2 2 R(ω) R(ω∗) E( ω ω∗,X ) = ω ω∗ (20) − b ≥ h − b i k − b kL2 Our analysis is split into three parts. First, we provide the excess risk bound in terms of the localized complexities corresponding to the quadratic and multiplier terms. Then, we prove sharp bounds for both of ERM them. To simplify the notation, in what follows we write ω,ω∗ instead of ωb ,ωb∗.

Localization. For any ω, x Rd,y R define the excessb loss functionalb ∈ ∈ 2 2 (x, y) = ( ω, x y) ( ω∗, x y) . Lω h i− − h i− Let us split the empirical excess risk to the quadratic and multiplier components as follows:

n n n 1 1 2 2 P = (X , Y )= ω ω∗,X + ( ω∗,X Y ) ω ω∗,X , nLω n Lω i i n h − ii n h ii− i · h − ii i=1 i=1 i=1 X X X P − ∗ P − ∗ nQω ω nMω ω where Pn ω and Pn ω denote empirical| quadratic{z and} multiplier| processes{z respectively, both} indexed by ω Rd. WeQ denote theirM population counterparts by ∈ E 2 E E ω ω∗ = ω ω∗ L and ω ω∗ = 2( ω∗,X Y ) ω ω∗,X . Q − k − k 2 M − h i− · h − i

Since ω defined in (6) minimizes the empirical excess risk Pn ω over ω b, we have Pn ωb 0. Thus, it 2 L ∈ W L ≤ suffices to show that, with high probability, if E ω q for some q > 0, then Pn ω > 0. This will imply by L ≥ 2 L contradiction that with high probability E b = R(ω) R(ω∗) q . b Lω − ≤

b 14 2 E 2 2 Recall that ω L2 = ω,X . As a first step, let us show that if ω ω∗ L2 is larger than a maximum of k k h i k − k 2 suitably defined fixed points (see below), then Pn ω > 0, thus implying that ω ω∗ L2 is small. 2 2 L k − k Fix some ω b and let s = ω ω∗ L2 . We aim to investigate under what assumptions on s it holds that P > 0. Using∈W the Bernstein assumptionk − k (20) we have nLω b Pn ω = Pn ω ω∗ + Pn ω ω∗ (21) L Q − M − = (Pn ω ω∗ E ω ω∗ ) + (Pn ω ω∗ E ω ω∗ )+ E ω Q − − Q − M − − M − L 2 (Pn ω ω∗ E ω ω∗ ) + (Pn ω ω∗ E ω ω∗ )+ s . ≥ Q − − Q − M − − M − Observe that if s satisfies s2 sup Pn ω ω∗ E ω ω∗ ω , ω ω∗ s | Q − − Q − |≤ 10 ∈Wb k − kL2 ≤ s2 and sup Pn ω ω∗ E ω ω∗ , (22) ω , ω ω∗ s | M − − M − |≤ 10 ∈Wb k − kL2 ≤ 2s2 2 then P + s > 0. The inequality P > 0 extends to all ω′ such that ω′ ω∗ s via a nLω ≥− 10 nLω ∈Wb k − kL2 ≥ standard star-shapedness argument of the class . To see that, suppose that ω′ ω∗ = s′ >s. Then, Wb k − kL2

s′ s s′ s s′ ω′ ω∗ =  ω′ + − ω∗ ω∗ = (ωs′ ω∗) . (23) − s s′ s′ − s −  ′   ωs   s  By convexity of , ω′ . Further, ω′ | ω∗ {z= ′ ω}′ ω∗ = s. Hence, it follows that Wb s ∈Wb k s − kL2 s k − kL2 Pn ω′ = Pn ω′ ω∗ + Pn ω′ ω∗ L Q − M − 2 s′ s′ s′ = Pn ω′ ω∗ + Pn ω′ ω∗ Pn ω′ ω∗ + Pn ω′ ω∗ > 0. s Q s− s M s− ≥ s Q s− M s−    Therefore, we are interested in the smallest value of s that satisfies the two conditions in (22). This leads to the definition of the (random) fixed point corresponding to the quadratic term:

s2 s∗ = inf s> 0 : sup Pn ω ω∗ E ω ω∗ (24) Q ω , ω ω∗ s | Q − − Q − |≤ 10 ( ∈Wb k − kL2 ≤ ) and the (random) fixed point corresponding to the multiplier term:

s2 s∗ = inf s> 0 : sup Pn ω ω∗ E ω ω∗ . (25) M ω , ω ω∗ s | M − − M − |≤ 10 ( ∈Wb k − kL2 ≤ ) We conclude by proving the excess risk bound for the constrained least squares estimator ω in terms of the 2 2 fixed points s∗ and s∗ defined above. Observe that if ω ω∗ L2 max s∗ ,s∗ , then by the argument above, P >M0. Therefore,Q ω satisfies ω ω 2 maxk s −,s k 2.≥ Assume{ forM theQ} sake of contradiction that n ω ∗ L2 ∗ ∗ b L 2 k − k ≤ { M Q} E ωb > 3max s∗ ,s∗ . Then, L { M Q} b 2 b 2 3max s∗ ,s∗ < E ωb = E ωb ω∗ + E ωb ω∗ max s∗ ,s∗ + E ωb ω∗ . { M Q} L Q − M − ≤ { M Q} M − 2 The above inequality implies that E ωb ω∗ > 2max s∗ ,s∗ . Therefore, using the fact that Pn ωb M − { M Q} L ≥ Pn ωb ω∗ = E ωb ω∗ + Pn ωb ω∗ E ωb ω∗ we have M − M − M − − M − Pn ωb Pn ω ω∗ Pn ωb ω∗ E ωb ω∗ L ≥ M − − | M − − M − | 2 2(max s∗ ,s∗ ) Pn ωb ω∗ E ωb ω∗ . (26) ≥ { M Q} − | M − − M − |

15 We now aim to upper bound Pn ωb ω∗ E ωb ω∗ in order to conclude that Pn ωb > 0, which will yield | M − − M2 − | 2 L the desired contradiction. Recall that ω ω∗ L2 max s∗ ,s∗ . By the star-shapedness argument (cf. k − k ≤ { M Q} Equation (23)), for all ω b with s∗ < ω ω∗ L2 max s∗ ,s∗ , there exists some ω′ b such max s∗ ,s∗ ∈ W M k − k ≤ { M Q} ∈ W { M Q} that ω ω∗ = s∗ (ω′ ω∗) and b ω′ ω∗ L2 s∗ . To simplify the notation, for any r > 0, denote − M − k − k ≤ M (r)= ω : ω ω∗ r . Then, we have B { ∈Wb k − kL2 ≤ } Pn ωb ω∗ E ωb ω∗ sup Pn ω ω∗ E ω ω∗ | M − − M − |≤ ω (max s∗ ,s∗ ) | M − − M − | ∈B { M Q}

max s∗ ,s∗ { M Q} sup Pn ω ω∗ E ω ω∗ ≤ s · ∗ | M − − M − | ∗ ω (sM) ! M ∈B 2 (max s∗ ,s∗ ) { M Q} . ≤ 10

Combining the above inequality with (26) yields Pn ωb > 0, which contradicts the assumption that ω is an empirical risk minimizer over . Therefore, we haveL Wb 2 2 2 b E R(ω) R(ω∗) 3 E(max s∗ ,s∗ ) 3 E(s∗ ) +3 E(s∗ ) (27) − ≤ { M Q} ≤ M Q 2 2 and we turn to the upper bounds on E(s∗ ) and E(s∗ ) in the sequel. b Q M

2 Quadratic term. In this part of the analysis we obtain an upper bound on E(s∗ ) . Denote the second E T 2 2 Q 2 moment matrix by Σ = XX and assume without loss of generality that λ1 λ2 ... λd > 0. Indeed, if some of the eigenvalues are equal to zero then the distribution of X is supported≥ on a≥ subspace≥ of Rd. Then we may restrict our analysis to this subspace only. 1 T 1 We may write X = Σ 2 Z, where Z is an isotropic vector (E ZZ = Id) and the eigenvalues of Σ 2 satisfy λ λ ... λ > 0. Observe that 1 ≥ 2 ≥ ≥ d 2 2 T 2 ω ω∗ s is equivalent to (ω ω∗) Σ(ω ω∗) s . k − kL2 ≤ − − ≤ 1 1 Denoting v =Σ 2 (ω ω∗) and = Σ 2 (ω ω∗): ω we may write − V { − ∈Wb} n 1 2 2 sup Xi,ω ω∗ E X,ω ω∗ ∗ n h − i − h − i ω b, ω ω L2 s i=1 ∈W k − k ≤ X n 1 2 2 = sup Zi, v E Z, v . (28) v , v s n h i − h i ∈V k k≤ i=1 X

In what follows, our idea is to replace the supremum over the set v : v , v s in (28) by the supremum over the unit ball by considering a special vector W defined below{ which∈V replacesk k≤X}(and Z). This will put us in position to apply the concentration result of Oliveira [Oli10]. T 2 2 Since the matrix Σ is real and symmetric, we may write Σ = U Diag(λ1,...,λd)U, where U is an orthogonal 1 d d matrix, and therefore, Σ 2 = Diag(λ ,...,λ )U. Denote by B(s) the closed Euclidean ball in Rd of radius × 1 d s centred at zero. Since ω ω∗ 2b we have k − k≤ 1 v : v , v s B(s) Σ 2 B(2b)= B(s) Diag(λ ,...,λ )B(2b), { ∈V k k≤ }⊆ ∩ ∩ 1 d where the last inequality holds since the orthogonal matrix does not change the Euclidean ball B(2b). Let d e1,...,ed denote the standard basis in R . It is easy to verify that any point (x1,...,xd) that belongs to the intersection of the ball and the ellipsoid B(s) Diag(λ ,...,λ )B(2b) satisfies for any 1 k d, ∩ 1 d ≤ ≤ k d s x e + x e B(2s). (29) i i 2bλ i i ∈ i=1 i X i=Xk+1   16 d d 2 2 k d 2 2 2 2 s xi 2 2 2 s xi 2 Indeed, we have xi s and 4b2λ2 s which leads to xi s and 4b2λ2 s implying (29). i=1 ≤ i=1 i ≤ i=1 ≤ i=k+1 i ≤ In what follows, letP k be the largestP integer that satisfies s 2bλPk. Finally, the setP consisting of all (x1,...,xd) satisfying (29) contains v : v , v s as a subset. Denote≤ Z = (z ,...,z ). For the same value of k define { ∈V k k≤ } 1 d the random vector W = (w1,...,wd), 2bλ 2bλ W = z ,...,z , k+1 z ,..., n z . 1 k s k+1 s n   We may rewrite k d k d s 2bλ s Z, v = v z + v i z = v w + v w . h i i i 2bλ i s i i i 2bλ i i i=1 i i=1 i X i=Xk+1     X i=Xk+1   These computations imply that n n 1 2 2 1 2 2 sup Zi, v E Z, v sup Wi, v E W, v (30) v , v s n h i − h i ≤ v B(2s) n h i − h i ∈V k k≤ i=1 ∈ i=1 X X Finally, we provide two properties of the defined random vector W :

d • 2 2 1/2 2 2 2 1 2 2 Note that λi zi = Σ Z = X 1 almost surely. We have W = i k λ2 (λi zi ) + i=1 k k k k ≤ k k ≤ i · 4b2 d P2 2 1 4b2 P s2 i=k+1 λi zi max λ2 , s2 , and recalling that s 2λkb, it implies that almost surely ≤ k ≤ P n o 2b W . k k≤ s • For every v Rd, since 2λ b s for i k + 1, ∈ i ≤ ≥ d 2bλ 2 E W, v 2 = v2 + v2 i v 2. h i i i s ≤k k2 i k i=k+1   X≤ X Combining these two properties with (30) we apply a version of Rudelson’s inequality for rank one operators [Oli10, Lemma 1] which implies that with probability at least 1 δ − n n 1 2 2 2 1 2 2 sup Wi, v E W, v =4s sup Wi, v E W, v v B(2s) n h i − h i v B(1) n h i − h i ∈ i=1 ∈ i=1 X X 2 log(min n, d ) + 2 log 2 + log 1 32bs { } δ , ≤ s n

2 log(min n,d +2 log 2+log 1 ) 8b { } δ provided that s n 2. Recalling the definition (24) of s∗ and solving the fixed point ≤ Q q2 log(min n,d +2 log 2+log 1 ) 2 inequality 32bs { } δ s , we may choose a large enough numerical constant c such that, n ≤ 10 1 with probabilityq at least 1 δ, − 2 1 2 b (log(min n, d ) + log δ ) (s∗ ) c1 { } . Q ≤ n b2 log(min n,d ) Integrating the last inequality we have for u′ = c2 n { } , where c2 is some numerical constant

∞ ∞ E 2 2 nu (s∗ ) = Pr (s∗ ) >u du u′ + exp 2 + log(min n, d ) du Q Q ≤ −b c1 { } Z Z′   0  u b2 log(min n, d ) . { } . n

17 And for r> 0 we have due to homogeneity

2 2 2 r b log(min n, d ) E(s∗ ) . { } . (31) Q n

Multiplier term. In this part of the proof we work with general r > 0 and we aim to upper bound 2 E(s∗ ) . Recall that ξ = Y ω∗,X . Fix any λ > 0 and consider the event E that s∗ /2 > s∗ . Denote M − h i M Q ′ = ω : ω b, ω ω∗ L2 s∗ /2 Plugging s = s∗ /2 into (25) we have on E, W { ∈W k − k ≤ M } M n 2 1 (s∗ ) sup 80 ξi Xi,ω∗ ω E ξ X,ω∗ ω M ≤ ′ n h − i− h − i ω i=1 ∈W X n 1 sup 80 ξi Xi,ω∗ ω E ξ X,ω∗ ω ≤ ′ n h − i− h − i ω i=1 ∈W X n 2 2 2 1 2 2λr ω ω∗ + ω ω∗ X ,ω ω∗ k − k k − kL2 − n h i − i − n i=1 ! 2 2X2 (s∗ ) 8λr b + M + 40 n n 1 sup 80 ξi Xi,ω∗ ω E ξ X,ω∗ ω ≤ ′ n h − i− h − i ω i=1 ∈W X n 2 2 2 1 2 2λr ω ω∗ ω ω∗ X ,ω ω∗ k − k −k − kL2 − n h i − i − n i=1 ! X 2 2 2 1 1 8λr b + (s∗ ) + + . M 2 40 n  

In the first step above we used the definition (24) of s∗ together with the star-shapedness argument (cf. the Q 2 2 localization part of the proof above), the inequality s∗ /2 >s∗ , and the fact that ω ω∗ 4b for ω b. M Q k − k ≤ ∈W In the second inequality we have used the definition of ′. W Let ε1,...,εn be independent random signs and let E′ denote the expectation with respect to an independent copy of the sample Sn. We now show how to control the multiplier term on the event E, that is, the term 2 d E(s∗ ) 1E. By Jensen’s inequality, the symmetrization argument and the symmetry of R used to remove the M

18 absolute value, we have the following:

n 80 E sup ξi Xi,ω∗ ω E′ ξ X,ω∗ ω ′ n h − i− h − i ω i=1 ∈W X n 2 2 1 2 2λr ω ω∗ X ,ω ω∗ E′ ξ X,ω∗ ω k − k − n h i − i − h − i− n i=1 ! X n n 2 2 80 1 2 2λr v E sup ξi Xi, v E′ ξ X, v Xi, v E′ ξ X, v k k ≤ Rd n h i− h i − n h i − h i− n v i=1 i=1 ! ∈ X X n E E 80 ′ sup εi(ξi Xi, v ξi′ Xi ′, v ) ≤ Rd n h i− h i v i=1 ∈ X n n 2 2 1 2 1 2 2λr v X , v X′, v k k − n h i i − n h i i − n i=1 i=1 ! X X n n 2 2 80 1 2 λr v 2 E sup εiξi Xi, v Xi, v k k ≤ Rd n h i− n h i − n v i=1 i=1 ! ∈ X X 1 n n − 3200 T T = E ε ε ξ ξ X λr2I + X X X n  i j i j i d k k j  i,j ! X kX=1 1 n n −  3200 T T = E ξ2X λr2I + X X X n i i d k k i i=1 ! X Xk=1 3200 2 T 1 = E ξ X Σ− 2 X, n λr where in the last lines we usedb the exact value of v maximizing the expression as well as the exchangeability of Xi and Xj . Finally, we have E 2 E 2 1 E 2 1 (s∗ ) = (s∗ ) s∗ 2s∗ + (s∗ ) s∗ >2s∗ M M { M≤ Q} M { M Q} 2 2 E 2 3200 E 2 T 1 E 2 1 1 8λr b 4 (s∗ ) + ξ X Σλr− 2 X + (s∗ ) + + . ≤ Q n M 2 40 n   Combining the last inequality with (27) and (31) we haveb

E 2 T 1 2 2 ERM ξ X Σλr− 2 X (λ + log(min n, d ))r b E R(ω ) R(ω∗) . + { } . b − n n b This proves Theorem 2.1. b 4.2 Proof of Theorem 3.1 The proof of this result is split into several steps. In this section, we provide three technical lemmas and demonstrate how they imply the result of Theorem 3.1. The first two lemmas are based on exact computations using the Sherman-Morrison formula. The proof of the third lemma, for which we sketch a simple heuristic argument before presenting the formal proof, is based on matrix concentration inequalities. We always assume that d (and therefore n, since it satisfies n & d3 log d) is large enough. Within the proofs, we use auxiliary variables α,β,x,y, that are sometimes redefined throughout the text. Before we proceed, let us remark that (X, Y ) distributed according to (16) satisfies X 1 almost surely k k ≤ and Y L∞ 1, thus r = m = 1. Our first lemma, proved in Appendix D.1, provides an excess risk lower bound for anyk k vector≤ ω Rd, provided that b is large enough. ∈

19 Lemma 4.1. Suppose that b √d/2, d 4, and (X, Y ) is distributed according to (16). Then, for any ω Rd we have ≥ ≥ ∈ 1 3/2 2 √d 1 R(ω) R(ωb∗) d− ω ωb∗ and also ωb∗ = − 1. − ≥ 2 k − k 2√d 1 · − Further, we define an unconstrained least squares solution as (dropping the superscript ERM in our notation):

1 n − ω = nΣ XiYi . (32) ∞ i=1    X  b b In the proof of Theorem 3.1 we work on the event where Σ is invertible which will be shown to hold with sufficient probability. This ensures the uniqueness of ω hence, we remark that the result of Theorem 3.1 holds ∞ for any constrained least squares estimator. Our proof strategyb is quite straightforward: using Lemma 4.1 we 3/2 show that the excess risk of ω is lower bounded by cdb /n, while for large enough b, ω is also a least squares solution constrained to the ball∞ of radius b. Before stating our next lemma we introduce some∞ additional notation. Let b b I = i 1,...,n : X = 1/d { ∈{ } i 6 } denote the (random) subset of data points whose covariates are not equal to 1/d. Denote

T 1 2 T A = X X and hence Σ= (n I ) d− 11 + A . (33) i i n − | | i I X∈  b Further, let v, ζ Rd denote the (random) vectors such that ∈ 1/2 v = A √d and ζ = v I d− 1. (34) i ii − | | In words, the i-th entry of v denotes the number of observations in the set I whose i-th entry is non-zero. 1/2 Conditionally on the size of the set I, E v = I d− 1 and hence, ζ represents the noise present in the counts vector v. We will repeatedly rely on the following| | identities, which can be shown via a simple counting argument:

1/2 A1 = v = I d− 1 + ζ and ζ, 1 =0. (35) | | h i The following lemma provides a sharp inequality for the norm of ω as well as an exact expression for the vector ω itself. The proof is deferred to Appendix D.2. ∞ ∞ Lemma 4.2. Let ω be defined by (32). The following two identitiesb hold whenever the matrix A defined in (b33) is invertible: ∞ 3/2 1 3/2 1 b d I − d I − 1 ω = | | T 1 | | T A− ζ, (36) ∞ (n I ) 1d2 + 1 A 11 − (n I ) 1d2 + 1 A 11 − | | − − − | | − − and b 2 2 2 T 2 ω n d− 1 A− 1. k ∞k ≤ Note that the first summand in (36) as well as the vector ωb∗ are both proportional to 1. However, it will 1 be shown later that the second summand inb (36), which is proportional to A− ζ, is almost orthogonal to 1. Combining this observation with the fact that 1, ζ = 0 will yield the desired lower bound via Lemma 4.1, provided that the magnitude of the second termh in (36i ) is large enough. Combining Lemmas 4.1 and 4.2, the excess risk of the unconstrained least squares solution ω can be T 1 1 ∞ expressed in terms of the random quadratic form 1 A− 1 and the random vector A− ζ. Also, the norm of T 2 ω can be upper-bounded in terms of 1 A− 1. The following result provides sharp bounds on all theb random quantities∞ that we need. Lemmab 4.3. Suppose that d is large enough and n & d3 log d. Then, the following results hold simultaneously, with probability at least 1/2:

20 1/2 (a) I nd− ; | |∼ (b) ζ 2 n; k k ∼ (c) The matrix A defined by (33) is invertible; T 1 3/2 (d) ζ A− ζ & d ; T 1 1 2 (e) 1 A− 1 . n− d ; T 2 2 3 (f) 1 A− 1 . n− d . Before presenting the formal proof (see Appendix D.3), we discuss the intuition behind the proof of this 1/2 lemma. First, observe that (a) follows from the fact that I is Binomially distributed with parameters n, d− . The magnitude of ζ 2 follows from a direct computation| | of its expectation and variance. For large enough n, we expect that A k EkA. Assuming this, we may focus on E A, which conditionally on the size of set I has the following simple form:≈ 1 3/2 1 3/2 1 T E A = I (d− (d + d)− )I + (d + d)− 11 . | | − d Observe that the eigenvector corresponding to the largest eigenvalue of E A is proportional to 1, and the remaining eigenvectors complement this direction and form an orthonormal basis. Moreover, the above expression for E A 1/2 1 implies that λ1(E A) I d− and λj (E A) I d− for j =2,...,d. Thus, E A is invertible and in particular, we have ∼ | | ∼ | | T 1 1 2 T 2 2 2 3 1 (E A)− 1 = d/λ1(E A) . n− d , and 1 (E A)− 1 = d/(λ1(E A)) . n− d . Finally, since by (35) we have ζ, 1 = 0, the vector ζ is orthogonal to the first eigenvalue of E A. Therefore, h i T 1 2 3/2 ζ (E A)− ζ ζ /λ (E A) & d . ≥k k 2 With the above lemmas at hand, we are ready to prove Theorem 3.1. Proof of Theorem 3.1. We work on the event of Lemma 4.3. First, note that combining Lemmas 4.2 and 4.3 we have ω 2 . d. k ∞k Thus, on the event of Lemma 4.3, the unconstrained ERM solution ω is also a solution over the Euclidean ball of any radius b that satisfies b c√d, where c is someb absolute constant.∞ ≥ We will now lower bound the expected excess risk of ω . Observeb that for any vector x and a unit vector u we have x x, u . Consider the unit vector u = ζ/ ζ∞. Denote k k≥|h i | k k 3/2b 1 d I − β = | | . (n I ) 1d2 + 1TA 11 − | | − − Combining Lemmas 4.1 and 4.2 together with ζ, 1 = 0 given by (35) we have h i 2 1 3/2 √d 1 1 R(ω ) R(ωb∗) d− β − 1 βA− ζ ∞ − ≥ 2 − 2√d 1! − − 2 b 1 3/2 ζ √d 1 1 d− , β − 1 βA− ζ ≥ 2 * ζ − 2√d 1! − +! k k − 2 1 3/2 ζ 1 = d− , βA− ζ . (37) 2 ζ k k  1 By Lemma (4.3) we have β & 1, with probability at least 2 . Hence, the lower bound (37) implies on the event of Lemma 4.3 that T 1 2 3/2 1 3/2 2 ζ A− ζ d R(ω ) R(ωb∗) d− β & . ∞ − ≥ 2 ζ n  k k  1 E d3/2 Since the event of Lemma 4.3 holdsb with probability at least 2 , it follows that R(ω ) R(ωb∗) & n . This concludes the proof of our theorem. ∞ − b 21 Acknowledgements

We are indebted to Shahar Mendelson for fruitful discussions and valuable feedback: in particular, for suggesting us the technique to analyze the quadratic process in Theorem 2.1 and for motivating us to work on the lower bounds. We are also grateful to Manfred Warmuth for providing a reference to the predictor that removes the excess logarithmic factor appearing in one of our results. Finally, we thank Jaouad Mourtada for many related discussions. This work was conducted when Nikita Zhivotovskiy was at Google Research, Z¨urich. Tomas Vaˇskeviˇcius is supported by the EPSRC and MRC through the OxWaSP CDT programme (EP/L016710/1).

References

[AC10] Jean-Yves Audibert and Olivier Catoni. Linear regression through pac-bayesian truncation. arXiv preprint arXiv:1010.0072, 2010.

[AC11] Jean-Yves Audibert and Olivier Catoni. Robust linear least squares regression. The Annals of Statistics, 39(5):2766–2794, 2011.

[Aud09] Jean-Yves Audibert. Fast learning rates in statistical inference through aggregation. The Annals of Statistics, 37(4):1591–1646, 2009.

[AW01] Katy S Azoury and Manfred K Warmuth. Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning, 43(3):211–246, 2001.

[BBM05] Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local Rademacher complexities. The Annals of Statistics, 33(4):1497–1537, 2005.

[Bel18] Pierre C Bellec. Sharp oracle inequalities for least squares estimators in shape restricted regression. The Annals of Statistics, 46(2):745–780, 2018.

[BF83] Leo Breiman and David Freedman. How many variables should be entered in a regression equation? Journal of the American Statistical Association, 78(381):131–136, 1983.

[BGS16] G´abor Bal´azs, Andr´as Gy¨orgy, and Csaba Szepesv´ari. Chaining bounds for empirical risk minimization. arXiv preprint arXiv:1609.01872, 2016.

[BHMZ20] Olivier Bousquet, Steve Hanneke, Shay Moran, and Nikita Zhivotovskiy. Proper learning, Helly number, and an optimal SVM bound. Conference on Learning Theory, 125:582–609, 2020.

[Bir06] Lucien Birg´e. Model selection via testing: an alternative to (penalized) maximum likelihood estimators. In Annales de l’IHP Probabilit´es et statistiques, volume 42, pages 273–325, 2006.

[BM93] Lucien Birg´eand Pascal Massart. Rates of convergence for minimum contrast estimators. Probability Theory and Related Fields, 97(1-2):113–150, 1993.

[BM06] Peter L Bartlett and Shahar Mendelson. Empirical minimization. Probability Theory and Related Fields, 135(3):311–334, 2006.

[BM13] Francis R. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate o(1/n). In Advances in Neural Information Processing Systems, volume 26, 2013.

[Cat16] Olivier Catoni. Pac-bayesian bounds for the gram matrix and least squares regression with a random design. arXiv preprint arXiv:1603.05229, 2016.

[CBL06] Nicolo Cesa-Bianchi and G´abor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.

[CGS15] Sabyasachi Chatterjee, Adityanand Guntuboyina, and Bodhisattva Sen. On risk bounds in isotonic and other shape restricted regression problems. The Annals of Statistics, 43(4):1774–1800, 2015.

22 [Cha14] Sourav Chatterjee. A new perspective on least squares under convex constraint. The Annals of Statistics, 42(6):2340–2381, 2014.

[CLL19] Geoffrey Chinot, Guillaume Lecu´e, and Matthieu Lerasle. Robust statistical learning with Lipschitz and convex loss functions. Probability Theory and Related Fields, pages 1–44, 2019.

[DJ98] David L Donoho and Iain M Johnstone. Minimax estimation via wavelet shrinkage. The Annals of Statistics, 26(3):879–921, 1998.

[FKL+18] Dylan J. Foster, Satyen Kale, Haipeng Luo, Mehryar Mohri, and Karthik Sridharan. Logistic regression: The importance of being improper. In Conference On Learning Theory, volume 75, pages 167–208, 2018.

[FW02] J¨urgen Forster and Manfred K Warmuth. Relative expected instantaneous loss bounds. Journal of Computer and System Sciences, 64(1):76–102, 2002.

[GKKW02] L Gy¨orfi, M Kohler, A Kryzak, and H Walk. A distribution-Free Theory of Nonparametric Regression. Springer, 2002.

[GSS17] Alon Gonen and Shai Shalev-Shwartz. Average stability is invariant to data preconditioning: Implications to exp-concave empirical risk minimization. The Journal of Machine Learning Research, 18(1):8245–8257, 2017.

[Hag89] William W Hager. Updating the inverse of a matrix. SIAM review, 31(2):221–239, 1989.

[HAK07] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169–192, 2007.

[HKL14] Elad Hazan, Tomer Koren, and Kfir Y Levy. Logistic regression: Tight bounds for stochastic and online optimization. In Conference on Learning Theory, pages 197–209, 2014.

[HKZ14] Daniel Hsu, Sham M Kakade, and Tong Zhang. Random design analysis of ridge regression. Foundations of Computational Mathematics, 14(3):569–600, 2014.

[HWCS19] Qiyang Han, Tengyao Wang, Sabyasachi Chatterjee, and Richard J Samworth. Isotonic regression in general dimensions. The Annals of Statistics, 47(5):2440–2471, 2019.

[JGR20] R´emi J´ez´equel, Pierre Gaillard, and Alessandro Rudi. Efficient improper learning for online logistic regression. In Conference on Learning Theory, volume 125, pages 2085–2108, 2020.

[JRT08] Anatoli Juditsky, Philippe Rigollet, and Alexandre B Tsybakov. Learning by mirror averaging. The Annals of Statistics, 36(5):2183–2206, 2008.

[KL15] Tomer Koren and Kfir Levy. Fast rates for exp-concave empirical risk minimization. In Advances in Neural Information Processing Systems, pages 1477–1485, 2015.

[Kol06] Vladimir Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593–2656, 2006.

[Kol11] Vladimir Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: Ecole d’Et´ede Probabilit´es de Saint-Flour XXXVIII-2008, volume 2033. Springer Science & Business Media, 2011.

[KRG20] Gil Kur, Alexander Rakhlin, and Adityanand Guntuboyina. On suboptimality of least squares with applica- tion to estimation of convex bodies. In Conference on Learning Theory, pages 2406–2424, 2020.

[KZ20] Yegor Klochkov and Nikita Zhivotovskiy. Uniform Hanson-Wright type concentration inequalities for un- bounded entries via the entropy method. Electronic Journal of Probability, 25:1–30, 2020.

[Lec13] Guillaume Lecu´e. Empirical risk minimization is optimal for the convex aggregation problem. Bernoulli, 19(5B):2153–2166, 2013.

23 [LM13] Guillaume Lecu´eand Shahar Mendelson. Learning subgaussian classes: Upper and minimax bounds. arXiv preprint arXiv:1305.4825, 2013.

[LM16] Guillaume Lecu´eand Shahar Mendelson. Performance of empirical risk minimization in linear aggregation. Bernoulli, 22(3):1520–1534, 2016.

[LM20] Gabor Lugosi and Shahar Mendelson. Risk minimization by median-of-means tournaments. J. Eur. Math. Soc, 22:925–965, 2020.

[LRS15] Tengyuan Liang, Alexander Rakhlin, and Karthik Sridharan. Learning with square loss: Localization through offset rademacher complexity. In Conference on Learning Theory, pages 1260–1285, 2015.

[LT13] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer Science & Business Media, 2013.

[Men15] Shahar Mendelson. Learning without concentration. Journal of the ACM, 62(3), 2015.

[Men20] Shahar Mendelson. Extending the scope of the small-ball method. Studia Mathematica, 2020.

[MG19] Jaouad Mourtada and St´ephane Ga¨ıffas. An improper estimator with optimal excess risk in misspecified density estimation and logistic regression. arXiv preprint arXiv:1912.10784, 2019.

[Mou19] Jaouad Mourtada. Exact minimax risk for linear least squares, and the lower tail of sample covariance matrices. arXiv preprint arXiv:1912.10754, 2019.

[MR10] Rajeev Motwani and Prabhakar Raghavan. Randomized Algorithms. Chapman & Hall/CRC, 2010.

[MS12] Brendan McMahan and Matthew Streeter. Open problem: Better bounds for online logistic regression. In Conference on Learning Theory, 2012.

[MVZ21] Jaouad Mourtada, Tomas Vaˇskeviˇcius, and Nikita Zhivotovskiy. Distribution-free robust linear regression. arXiv preprint arXiv:2102.12919, 2021.

[MZ20] Shahar Mendelson and Nikita Zhivotovskiy. Robust covariance estimation under L4 − L2 norm equivalence. Annals of Statistics, 48(3):1648–1664, 2020.

[Nem00] Arkadi Nemirovski. Topics in non-parametric statistics. Ecole d’Et´ede Probabilit´es de Saint-Flour, 28:85, 2000.

[Oli10] Roberto Oliveira. Sums of random hermitian matrices and an inequality by Rudelson. Electronic Communi- cations in Probability, 15:203–212, 2010.

[Oli16] Roberto Oliveira. The lower tail of random quadratic forms with applications to ordinary least squares. Probability Theory and Related Fields, 166(3-4):1175–1194, 2016.

[Ora19] Francesco Orabona. A modern introduction to online learning. arXiv preprint arXiv:1912.13213v2, 2019.

[RST17] Alexander Rakhlin, Karthik Sridharan, and Alexandre B Tsybakov. Empirical entropy, minimax regret and minimax risk. Bernoulli, 23(2):789–824, 2017.

[Rud99] Mark Rudelson. Random vectors in the isotropic position. Journal of Functional Analysis, 164(1):60–72, 1999.

[Sau18] Adrien Saumard. On optimality of empirical risk minimization in linear aggregation. Bernoulli, 24(3):2176– 2203, 2018.

[Sha15] Ohad Shamir. The sample complexity of learning linear predictors with the squared loss. The Journal of Machine Learning Research, 16(1):3475–3486, 2015.

24 [SSBD14] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.

[Tro15] Joel A. Tropp. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning, 8(1-2):1–230, 2015.

[Tsy03] Alexandre B Tsybakov. Optimal rates of aggregation. Conference on Learning Theory, pages 303–313, 2003.

[Ver16] Roman Vershynin. High-Dimensional Probability: An Introduction with Applications, volume 47 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2016.

[Vov98] Volodya Vovk. Competitive on-line linear regression. In Advances in Neural Information Processing Systems, pages 364–370, 1998.

[Vov01] Volodya Vovk. Competitive on-line statistics. International Statistical Review, 69(2):213–248, 2001.

[Wai19] Martin J Wainwright. High-dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019.

[Wan18] Yu-Xiang Wang. Revisiting differentially private linear regression: optimal and adaptive prediction and estimation in unbounded domain. In Uncertainty in Artificial Intelligence, 2018.

A Proof of Theorem 2.2

Our analysis is based on the notion of average stability (for more details we refer to e.g., [SSBD14] and reference therein). Although there is a vast literature on stability based techniques, with several recent related papers [KL15, GSS17], we could not find any general result that implies Theorem 2.2. Therefore, we provide an elementary proof. One of the differences compared with some of the previous results is that we provide an exact formula for the average stability as well as an exact analysis of the fitting term as an intermediate step in our proof. Our analysis of the fitting term exploits the curvature of the quadratic loss and, in particular, allows to rewrite the stability term as the multiplier term that appears in Theorem 2.1. Crucially, stability based approach λb2 allows us to replace the quadratic term appearing in the proof of Theorem 2.1 by a bias term that scales as n and consequently, removes multiplicative factor log(min n, d ) that is unimprovable for constrained ERM as we demonstrate in Proposition 2.3. { }

Similarly to the proof of Theorem 2.1 presented in Section 4.1, we drop the subscript b from ωb∗ and denote it by ω∗ in the rest of this section. We also introduce an additional independent element (Xn+1, Yn+1) distributed according to Pr to the sample. With a slight abuse of notation, we define for j =1,...,n + 1 the penalized (but not normalized by the sample size as in (7)) empirical second moment matrices by

n+1 n+1 ˜ T ˜ ( j) T Σλ = λId + XiXi and Σλ− = λId + XiXi . i=1 i=1,i=j X X6 For any λ> 0, the (unique) ridge estimator (8) constructed on all but the j-th sample, and all the n + 1 samples respectively, is defined as follows:

n+1 n+1 ( j) ˜ ( j) 1 ˜ 1 ωλ− = (Σλ− )− YiXi andω ˜λ = Σλ− YiXi, (38) i=1,i=j i=1 X6 X b Therefore, the ridge estimator defined in (8) is trained on the first n samples and hence it satisfies

( (n+1)) ωλ = ωλ− .

b b 25 Let E denote the expectation with respect to all n + 1 samples (X1, Y1),..., (Xn+1, Yn+1). Since the sample is exchangeable we have n+1 2 1 2 E R(ω ) R(ω∗)= E(Y ω ,X ) E(Y ω∗,X ) λ − n+1 − h λ n+1i − n +1 i − h ii i=1 X n+1 b 1 ( i)b 2 2 = E (Y ω − ,X ) (Y ω˜ ,X ) n +1 i − h λ ii − i − h λ i i=1 X  Average stability  n+1 b | 2 {z 2 } + E (Y ω˜ ,X ) (Y ω∗,X ) , i − h λ ii − i − h ii i=1 X  Fitting term where the two terms in the last| display are interpreted{z as the fitting-stability} trade-off [SSBD14]. Indeed, the first term corresponds to the average sensitivity of the estimator to the perturbation in one point of the sample and is called the average stability. The second term shows how the empirical loss of the estimator compares to the empirical loss of the best linear predictor ω∗ and is called the fitting term. The remainder of the proof is devoted to the analysis of these two terms.

Average stability term. We provide exact computations for the average stability term via the Sherman- Morrison formula (see e.g., [Hag89]), which states that for any j =1,...,n + 1 we have T 1 ˜ 1 ˜ 1 ( j) − 1 Σλ− Xj Xj Σλ− ˜ − ˜ Σλ = Σλ− + T 1 . (39) 1 X Σ˜ − Xj   − j λ To simplify the notation we denote for the rest of the proof the j-th (random) leverage score by T ˜ 1 hj = Xj Σλ− Xj. Using the definition (38), Sherman-Morrison formula (39) and simple algebra we demonstrate the following equality:

( i) 2 (Y ω − ,X ) i − h λ ii 2 n+1 ( i) 1 ( i) 1 = Y b (Σ˜ − )− Y X ,X + (Σ˜ − )− Y X ,X  i − λ j j i λ i i i  * j=1 + X D E  2  n+1 2 ( i) 1 hi = Y (Σ˜ − )− Y X ,X + Y h +  i − λ j j i i i 1 h  * j=1 + i X  −    2 n+1 1 1 1 T Yihi = Y Σ˜ − Y X ,X Σ˜ − X X ω˜ ,X +  i − λ j j i − 1 h λ i i λ i 1 h  * j=1 + i i X − D E −  h Y h 2 1 2  = Y ω˜ ,X i ω˜ ,X + i i = (Y ω˜ ,X )2 . i − h λ ii− 1 h h λ ii 1 h 1 h i − h λ ii  − i − i   − i  The above result implies 2 E ( i) 2 2 E 1 2 (Yi ωλ− ,Xi ) (Yi ω˜λ,Xi ) = 1 (Yi ω˜λ,Xi ) − h i − − h i 1 hi − ! − h i    −  h h b = E i + i (Y ω˜ ,X )2. (40) (1 h )2 1 h i − h λ ii  − i − i 

26 2 Finally, we show that if λ> cr for some numerical constant c> 0, then hi is separated from 1. First, observe T ˜ T that Xj Xj + λId Σλ. Since Xj Xj is a rank one matrix having at most one non-zero positive eigenvalue denoted by µ, we have

T 1 1 T 0 h = X Σ˜ − X = Tr(Σ˜ − X X ) ≤ j j λ j λ j j 2 T 1 T µ r 1 Tr((X X + λI )− X X )= , (41) ≤ j j d j j µ + λ ≤ r2 + λ ≤ 1+ c

T 1 T x 2 where we applied the facts that (X X + λI )− X X is rank one matrix, x is monotone, and µ r . j j d j j 7→ x+λ ≤ Using simple algebra and the identity given in (40), for any λ cr2 we have ≥ 2 ( i) 2 2 1+3c +2c 2 E (Y ω − ,X ) (Y ω˜ ,X ) E h (Y ω˜ ,X ) . (42) i − h λ ii − i − h λ ii ≤ c2 i i − h λ ii   b 1 2 Fitting term. In contrast to the naive upper bound, which follows by adding and subtracting n+1 (λ ω˜λ 2 k k − ω∗ ) to the (unnormalized) fitting term and using the fact thatω ˜λ minimizes the ridge regression optimization kobjective:k n+1 2 2 2 2 2 E (Y ω˜ ,X ) (Y ω∗,X ) λ ω∗ ω˜ λb , i − h λ ii − i − h ii ≤ k k −k λk ≤ i=1 X   in the proof below, we exploit the curvature of the squared loss, which results in an improved upper bound. The improvement that comes from extra negative terms allow us to compensate rewrite the average stability term as localized multiplier term, thus establishing a direct link between localization and stability analysis. Recall the definition (9) of the noise variable ξ. We show that the following deterministic inequality holds for the fitting term: n+1 n+1 2 2 2 2 (Y ω˜ ,X ) (Y ω∗,X ) (ξ ξ ) + λ ω∗ /2, (43) i − h λ ii − i − h ii ≤− i − i k k i=1 i=1 X  X b where we denote ξi = Yi ω˜λ,Xi . Sinceω ˜λ by the first order optimality conditions nullifies the gradient of the penalized empirical risk,− that h is i b n+1 0= (Y ω˜ ,X )2 + λ ω˜ 2 ∇ω˜λ i − h λ ii k λk i=1 ! X it follows that n+1 X ( ω˜ ,X Y )+ λω˜ =0. i h λ ii− i λ i=1 X Taking an inner product withω ˜ ω∗ in the above equality yields λ − n+1 2 ( ω˜ ,X Y )( ω˜ ,X ω∗,X )= λ( ω˜ ,ω∗ ω˜ ). h λ ii− i h λ ii − h ii h λ i−k λk i=1 X

27 The bound (43) follows by combining the last equality together with the following formula for the excess loss P as in the proof of Theorem 2.1 (cf. Equation (21)): nLω˜λ n+1 2 2 (Y ω˜ ,X ) (Y ω∗,X ) i − h λ ii − i − h ii i=1 X n+1 n+1  2 = ω˜ ω∗,X +2 ( ω˜ ,X Y ) ω˜ ω∗,X − h λ − ii h λ ii− i · h λ − ii i=1 i=1 X X n+1 2 2 = (ξ ξ ) +2λ( ω˜ ,ω∗ ω˜ ) − i − i h λ i−k λk i=1 X n+1 b 2 2 (ξ ξ ) + λ ω∗ /2, ≤− i − i k k i=1 X b 2 2 where in the last line we apply the Cauchy-Schwarz inequality and the fact that ω˜ ω∗ ω˜ ω∗ /4. k λkk k−k λk ≤k k Completing the proof. We are ready to finish the proof of Theorem 2.2 via the stability-fitting trade-off and inequalities (42) and (43). Indeed, optimizing the below quadratic function yields the following inequality:

α αξ2 (ξ ξ )2 = ξ2(α 1)+2ξ ξ ξ2 ξ2 , i − i − i i − i i − i ≤ i 1 α  −  b b b1+3c+2c2 b which holds for any 0 <α< 1, we have that if c2 hi < 1, then

E R(ω ) R(ω∗) λ − n+1 2 1 E 1+3c +2c 2 2 2 b hiξ (ξi ξi) + λ ω∗ /2 ≤ n +1 c2 i − − k k i=1 ! X   n+1 2 b b 2 1 1+3c +2c 2 1+3c +2c 2 E h ξ / 1 h + λ ω∗ /2 . ≤ n +1 c2 i i − c2 i k k i=1 ! X     Using the inequality on the leverage scores given in (41), a simple computation shows that the choice c = 3 1+3c+2c2 1+3c+2c2 guarantees that 2 / 1 2 h 14, which concludes our proof. c − c i ≤ d Observe that we never used any specific properties of ω∗ and it can be replaced by any vector in R . In contrast, our proof technique based on localization (cf. Section 4.1) crucially relies on the fact that ω∗ = ωb∗ minimizes the population risk over all vectors in . Finally, note that we have established our results for the Wb leverage scores hi that are computed on the sample of size n+1. However, it is easy to see that one may decrease the sample size by one so that the original claim of Theorem 2.2 holds.

B Proof of Proposition 2.3

Without loss of generality we set r = 1 (otherwise we can rescale the covariates introduced below by r). Let Y = 0, so that ω∗ = 0 and Y = ω∗,X (i.e., the problem is noise-free). Let k = k(d, n) d be an integer to be b h b i ≤ specified later. Let the covariates X follow a uniform distribution on the set of first k basis vectors e1,...,ek. The distribution of the samples (X, Y ) hence satisfies X 1. We aim to choose the value of k such that with probabilityk k≤ at least 1/2, at most k 1 out of the k basis − vectors are observed in the random sample X1,...,Xn (i.e., at least one of the basis vectors is not observed). This analysis follows from the coupon collector argument showing that one needs, with probability at least 1/2, a sample of size at least ck log k to observe all k vectors where c is a numerical constant. Let T be a random

28 variable that counts the (random) number of trials needed to observe all k basis vectors. A basic result (see, e.g., [MR10, Section 3.6]) shows that for any t> 0,

π2 Pr ( T kH tk) , | − k|≥ ≤ 6t2 where Hk is the k-th Harmonic number. Since log k k log k k log k, − √3 ≥ 2 where a simple computation shows that the last inequality holds provided that k 38. In what follows, we 1 ≥ choose k to be the smallest integer such that n 2 k log k; the condition k 38 can be always be satisfied provided that n is large enough. By the above, with≤ probability at least 1/2 there≥ is at least one basis vector among e1,...,ek such that it is not included in the sample X1,...,Xn. Denote such a (random) basis vector by e∗ and the corresponding event by E. Observe that on this event the vector be∗ b satisfies be∗,Xi = 0; ∈W 2 h i thus it is one of the least squares solutions (on the observed sample) in with R(be∗)= b /k. Let ω be equal Wb to be∗ on E and equal to any linear least squares on the complementary event E. Using Markov’s inequality we have b b2 b2 b2 b2 b2 log n E R(ω) R(ω∗)= E R(ω) Pr R(ω) Pr (E) , − b ≥ k ≥ k ≥ k ≥ 2k ≥ 10n   1 1 where the last step canb be proved as follows:b by our choiceb of k, we have 2 (k 1) log(k 1) n 2 k log k 2 2n 4n 5n − − ≤ ≤ ≤ (k 1) which implies k log(k 1) +1 log n +1 log n . This concludes our proof. − ≤ − ≤ ≤ C Proof of Proposition 2.8

2 Let c 1 denote the numerical constant that satisfies the assumption E ω,X 4 c E ω,X 2 . Since the ≥ h i ≤ h i T 1 leverage scores X (nΣ )− X are in [0, 1], using ξ m + rb we have   λ k kL∞ ≤ 2 T 1 2 b ξ X Σ− X n(m + rb) . λ ≤ 1/2 1 1/2 Let E denote the event that Σ Σ− Σ 2.b Combining the above inequality, the Cauchy-Schwarz inequal- k λ k≤ ity, and the L4–L2 moment equivalence assumption ξ L4 . ξ L2 = R(ωb∗) m, we have b k k k k ≤ E 2 T 1 E 2 T 1/2 1/2 1 1/2 1/2 p 1 E 2 T 1 1 ξ X Σλ− X = ξ (X Σ− )(Σ Σλ− Σ )(Σ− X) E + ξ X Σλ− X E 2 1/2 2 1/2 1 1/2 2 E ξ Σ− X Σ Σ− Σ 1E +n(m + rb) Pr(E) b ≤ k k k b λ k b 2 1/2 2 2 2 E ξ Σ− X + n(m + rb) Pr(E) ≤ k k b 2 E ξ4 E Σ 1/2X 4 + n(m + rb)2 Pr(E) ≤ k − k q . 2pm2 E Σ 1/2X 4 + n(m + rb)2 Pr(E). (44) k − k q 1/2 4 2 A direct calculation [Mou19, Remark 3] shows that E Σ− X . d under our assumption. Indeed, fixing 1/2 k k ωi =Σ− ei the following holds for i =1,...,d:

4 2 2 1/2 1/2 E Σ− e ,X c E Σ− e ,X = c. i ≤ i D E  D E 

29 Therefore, by the Cauchy-Schwarz inequality we have

2 d 2 1/2 4 1/2 E Σ− X = E e , Σ− X k k i i=1 ! X D E d 4 4 E e , Σ 1/2X E e , Σ 1/2X cd2. (45) ≤ i − j − ≤ i,j X q  q  Under the L4–L2 moment equivalence assumption on the marginals ω,X , the following lower tail bound given in [Oli16, Theorem 1.1] shows that for any δ (0, 1), with probabilityh i at least 1 δ, simultaneously for all v Rd, it holds that ∈ − ∈ 2 T d + 2 log T v Σ v 1 9c δ v Σv. (46) 0 ≥  − s n  b   For δ = 1 we have 1 9c d+2 log(2n) 1 , provided that n & d. Hence, by (46), with probability at least 1 1 , n − n ≥ 2 − n q 1 1/2 1 1/2 T 1/2 1/2 − Σ Σλ− Σ = inf v Σ− ΣλΣ− v k k v Sd−1  ∈  1 1 b T b − T − = inf u Σλu inf u Σ0u 2. u:uTΣu=1 ≤ u:uTΣu=1 ≤     The above inequality implies Pr(E) 1 and combinedb with inequalities (44)b and (45) yields ≤ n E 2 T 1 2 2 2 2 2 ξ X Σλ− X . dm + (m + rb) . dm + r b .

The proof is complete. b We remark that one may choose δ = exp( c0n) in (46), for some small enough numerical constant c0 and improve the resulting bound. This may be important− since as discussed in Section 2.4, a bound on the quadratic r2b2 term better than n could be possible under moment equivalence assumptions. Since this is not the main focus of our paper we do not pursue this direction.

D Proofs of Lemmas Supporting Theorem 3.1 D.1 Proof of Lemma 4.1

The proof is split into two steps. We first compute ω∗ = inf R(ω) and show that ω∗ √d/2 so that ∞ ω Rd k ∞k ≤ √ ∈ ω∗ = ωb∗ whenever b d/2. Next, we show that the lower bound follows via the Bernstein assumption (20). ∞ ≥ ∗ Computing ωb . Differentiating R(ω) with respect to ω and applying the first order optimality conditions, we obtain the following well-known expression for an unconstrained minimizer of the population risk over Rd: 1 T ω∗ =Σ− E XY , where Σ = E XX . A simple calculation shows that ∞ T 1/2 2 2 3/2 1 Σ= α11 + βI , with α = (1 d− )d− + (d + d )− (47) d − 3/2 2 3/2 1 and β = d− (d + d )− . − By the Sherman-Morrison formula, we have

1 T 1 2 1 1 (βId)− α11 (βId)− 1 αβ− T − Σ− = (βId) T 1 = β− Id 1 11 , − 1+ α1 (βId)− 1 − 1+ αβ− d

30 1 which plugged into the equation ω∗ =Σ− E XY yields ∞ 2 1 αβ− d 1/2 1 √d 1 ωb∗ = β− 1 (1 d− )d− 1 = − 1. − 1+ αβ− d − · 2√d 1 ·   − For all d 1 we have 0 (√d 1)/(2√d 1) 1/2 and, in particular, ω∗ √d/2 b. ≥ ≤ − − ≤ k ∞k≤ ≤ Lower bounding the excess risk. Let ω denote any parameter vector in Rd. Since we have already Rd shown that ωb∗ minimizes R(ω) over all of , by the Bernstein assumption stated in (20) we have E 2 T R(ω) R(ωb∗) X,ω ωb∗ = (ω ωb∗) Σ(ω ωb∗) − ≥ h − iT − − = (ω ω∗)(α11 + βI )(ω ω∗), − b d − b with the values of α and β given in (47). Since 11T is positive semi-definite, it hence follows that 2 R(ω) R(ω∗) β ω ω∗ . − b ≥ k − b k 1 3/2 Finally, for d 4 we have β d− , which completes our proof. ≥ ≥ 2 D.2 Proof of Lemma 4.2

2 T 1 Computing ω∞. We set once again α = (n I )d− and y = 1 A− 1. Combining (32) and (33) with n X Y = (n I )1/d and the Sherman-Morrison− | formula| we have i=1 i i − | | 1 2 P b 1 αyA− 1 dα y 1 dα 1 ω = dα A− 1 = dα A− 1 = A− 1. (48) ∞ − 1+ αy − 1+ αy 1+ αy     1/2 1 By (35), we have A1b= I d− 1 + ζ. Multiplying both sides by A− and rearranging yields | | 1 1 1/2 1 A− 1 = I − d (1 A− ζ). | | − Plugging the above into (48) yields 3/2 1 d I − α 1 ω = | | (1 A− ζ). ∞ 1+ αy −

2 b Computing ω∞ . Using the computations as above, we obtain k k 2 2 2 T 2 ω = ω , ω = (n I ) d− 1 (nΣ)− 1. (49) b k ∞k h ∞ ∞i − | | · To simplify the notation, let α = (n I )d 2, x = 1TA 21 and y = 1TA 11. Applying the Sherman-Morrison − − b− formula together with (33) we haveb − | | b b

T 2 1 (nΣ)− 1 1 T 1 2 T 1 αA− 11 A− = 1 b A− 1 − 1+ αy   2 T 1 1 T 2 2 1 T 2 T 1 T 2 αA− 11 A− αA− 11 A− α A− 11 A− 11 A− = 1 A− + 1 − 1+ αy − 1+ αy (1 + αy)2   αxy αyx α2yxy x = x + = . − 1+ αy − 1+ αy (1 + αy)2 (1 + αy)2 Plugging the above into (49) yields T 2 2 2 2 1 A− 1 2 2 T 2 ω = (n I ) d− n d− 1 A− 1. k ∞k − | | (1 + (n I )d 21TA 11)2 ≤ − | | − − The claim follows. b

31 D.3 Proof of Lemma 4.3 The proof is based on applying the union bound on the probability of several events. Adjusting the constants 1 one may always guarantee that the statement of Lemma 4.3 holds with probability at least 2 . By writing that 99 the event holds with sufficient probability we mean that it holds with probability at least 100 . Additionally, in many places we work conditionally on the size of the set I . | | Controlling I . The result follows from Chebyshev’s inequality since I follows the Binomial distribution | | 1/2 | | with parameters n, d− .

I 2 | | Bound on ζ . Recalling (33) and (34) we may rewrite vi = vi,j , where vi,j has a Bernoulli distribution k k j=1 1/2 with parameter d− . Moreover, for any fixed i we have that vi,P1,...,vi, I are independent and for any j it | | d 1/2 holds that vi,j = d . Combining these facts we have i=1 P 2 d I | | 2 1/2 ζ = v d− k k  i,j −  i=1 j=1 X X   d I  d I | | 2 | | 1/2 1/2 1/2 = v d− + v d− v d− i,j − i,j − i,k − i=1 j=1 i=1 j=k X X   X X6    d I | | 1/2 1/2 1/2 = d I I + v d− v d− . | | − | | i,j − i,k − i=1 j=k X X6   

d I | | 1/2 1/2 We proceed with analysis of the zero mean sum vi,j d− vi,k d− . Observe that for any given i=1 j=k − − 6 j the values v1,j ,...,vd,j are not independent butP areP sampled with replacement. However, it is possible to avoid this problem using a direct computation. First, for i = i and any j we have 1 6 2 1/2 1/2 1/2 1/2 1 d d 1 1 E v d− v d− = E v v d− = − d− i1 ,j − i2,j − i1 ,j i2,j − d · d 1 −    1 − = . −d3/2 + d This implies the following correlation identity

I I | | | | 1/2 1/2 1/2 1/2 E v d− v d− v d− v d−  i1,j − i1,k −   i2 ,j − i2,k −  j=k j=k X6    X6     I    | | 1/2 1/2 1/2 1/2 = E v d− v d− v d− v d−  i1,j − i1,k − i2,j − i2,k −  j=k X6      I 2 I  = | | − | | . (d3/2 + d)2

32 The last identity leads to

2 d I | | 1/2 1/2 E v d− v d−  i,j − i,k −  i=1 j=k X X6      2 d I | | 2 1/2 1/2 2 I I = E v d− v d− + d d | | − | |  i,j − i,k −  − (d3/2 + d)2 i=1 j=k X X6      2  2 2 1/2 1/2 2 I I 2 = d( I I ) d− (1 d− ) + d d | | − | | 2 I . | | − | | − − (d3/2 + d)2 ≤ | |    Finally, using Chebyshev’s inequality we have ζ 2 I d1/2 n with sufficient probability. k k ∼ | | ∼ Invertibility of A. Observe that A is a sum I independent positive semi-definite random matrixes such that each summand has the operator norm equaling| | one. Using the lower tail of the matrix Chernoff bound [Tro15, Theorem 5.1.1] we have

1 I (d−1+(d3/2+d)−1)/2 Pr(λ (A) λ (E A)/2) d(2e− )| | , d ≤ d ≤ which is arbitrary small for large enough d and n & d3 log d. Finally, observe that

1 3/2 1 3/2 1 T 1 λ (E A)= λ I (d− (d + d)− )I + (d + d)− 11 I d− . d d | | − d ∼ | |    Therefore, λd(A) > 0 with sufficient probability and A is invertible.

T T 1/2 1/2 2 A lower bound on λ1(A). By (34) we have 1 A1 = 1 ( I d− 1 + ζ)= I d− 1 , which shows that | | | | k k 1/2 λ (A) I d− . (50) 1 ≥ | |

An upper bound on λ2(A). We need the following bound which states what with sufficient probability

1 λ (A) . I d− . (51) 2 | | By the Courant-Fischer theorem we have

T T λ2(A) = inf sup x Ax sup x Ax. v x Sd−1, x,v =0 ≤ x Sd−1, x,1 =0 ∈ h i ∈ h i Consider the d d partial isometry matrix R defined as follows. Fix an orthonormal basis w ,...,w in Rd such × 1 d that w1 is proportional to 1. The matrix R has its first row equal to zero and its i-th row for i 2 equal to wi. ≥T T Observe that R 1 = 0 and for any v such that u, 1 = 0 we have Ru = u together with RR = Id e1e1 . Next, we show h i k k k k − T T T sup x Ax = sup x RAR x. (52) x Sd−1, x,1 =0 x Sd−1 ∈ h i ∈ d 1 T Indeed, consider a maximizer x0 S − of the right-hand side. We have that R x0 is orthogonal to 1 since T T T ∈ d 1 d 1 1 R x0 = (R 1) x0 = 0. Finally, we have that for any x′ S − such that x′, 1 = 0 there is x S − such T ∈T T h i d 1∈ that R x = x′. This is because x′ = α2w2 + ...αdwd = R x, where x = (0, α2,...,αd) S − . Therefore, (52) follows. ∈ Further, the matrix RART is non-negative semi-definite as well as each additive term that forms it. We have

T T T RAR = RXiXi R i I X∈ 33 and for the operator norm we have RX XTRT R X XT RT = R RT 1. Note that k i i k≤k kk i i kk k k kk k≤ T T 1 3/2 1 3/2 1 T T E RAR = R E (A) R = I R (d− (d + d)− )I + (d + d)− 11 R . | | − d   Using R 1 = 0, the above simplifies to

T T 1 3/2 1 T E RAR = R E (A) R = I (d− (d + d)− )RR . | | − T T E T 1 3/2 1 Since RR = Id e1e1 , we have λ1( RAR )= I (d− (d +d)− ). Applying the matrix Chernoff inequality [Tro15, Theorem− 5.1.1] we obtain | | −

1 3/2 1 T 1 3/2 1 Pr λ (A) 2 I (d− (d + d)− ) Pr λ (RAR ) 2 I (d− (d + d)− ) 2 ≥ | | − ≤ 1 ≥ | | −    I (d−1+(d3/2+d)−1)  d(e/4)| | . (53) ≤ The above probability is arbitrary small for large enough d and n & d3 log d. The bound follows.

T −1 A lower bound on ζ A ζ. Let u1,...,ud be an orthonormal basis of eigenvectors of A. Using the spectral decomposition and d u , ζ 2 = ζ 2, we have i=1 h i i k k P d d T 1 2 1 1 2 1 2 2 ζ A− ζ = u , ζ λ (A)− λ (A)− u , ζ = λ (A)− ( ζ u , ζ ). h i i i ≥ 2 h i i 2 k k − h 1 i i=1 i=2 X X 3/2 2 By (51) we have λ (A) . nd− and by above computations ζ n. Therefore, the claim immediately follows 2 k k ∼ if we prove that u , ζ 2 n. Note that h 1 i ≪ λ (A) u , ζ = Au , ζ = (A E A)u , ζ + (E A)u , ζ 1 h 1 i h 1 i h − 1 i h 1 i implies 2 2 2 u , ζ λ (A)− ( A E A ζ + (E A)u , ζ ) . (54) h 1 i ≤ 1 k − kk k | h 1 i | Recall that 1 3/2 1 3/2 1 T E A = I (d− (d + d)− )I + (d + d)− 11 and 1, ζ =0. | | − d h i We have   1 3/2 1 1 3/2 1 (E A)u , ζ = I (d− + (d + d)− ) u , ζ I (d− + (d + d)− ) ζ . | h 1 i | | | | h 1 i | ≤ | | k k Using that (X XT)2 = X XT for i I, we have i i i i ∈ I I | | | | T T 2 T 2 1/2 E X X E X X E X X = E A 2 I d− . i i − i i ≤ i i k k≤ | | i=1 i=1 X  X 

Applying the matrix Bernstein inequality [Tro15 , Theorem 6.6.1] we obtain

2 2 1 I d− /2 Pr A E A I d− d exp | | , (55) k − k ≥ | | ≤ −2 I d 1/2 + d/3  −   | | where the above probability is arbitrary small for large enough d and n & d3 log d. Hence, (54) gives with sufficient probability

2 2 2 2 2 2 2 2 2 2 u , ζ 2λ (A)− ( A E A +2 I d− ) ζ . λ (A)− I d− ζ . n/d. h 1 i ≤ 1 k − k | | k k 1 | | k k The claim follows.

34 T −1 An upper bound on 1 A 1. As before let u1,...,ud be an orthonormal basis of eigenvectors of A. Using the lower bound (50) and d u , 1 2 = d u , 1 2, we have i=2 h i i − h 1 i Pd T 1 2 1 1 A− 1 = u , 1 λ (A− ) h i i i i=1 X u , 1 2 /λ (A) + (d u , 1 2)/λ (A) d3/2/ I + (d u , 1 2)/λ (A). (56) ≤ h 1 i 1 − h 1 i d ≤ | | − h 1 i d 2 We want to provide an upper bound on d u1, 1 . By(53) we have that with sufficient probability λj (A) 1 − h i 1/2 ≤ 2 I d− for j = 2,...,d and therefore, for the same values of j we have λj (A)/λ1(A) 2d− . Using the last fact| | we have ≤ T d 1 A1 2 2 λi(A) 2 1/2 2 = u1, 1 + ui, 1 u1, 1 +2d− (d u1, 1 ). λ1(A) h i h i λ1(A) ≤ h i − h i i=2 X The last inequality combined with the fact that 1 is the first eigenvector of E A implies for d 16, ≥ T T E T 2 1/2 1 1 A1 1 A1 1 A1 d u , 1 (1 2d− )− d 2 (57) − h 1 i ≤ − − λ (A) ≤ λ (E A) − λ (A)  1   1 1  Finally, since λ (A) λ (E A) A E A and by the lower bound (50) we have | 1 − 1 |≤k − k T T T 1 E A1 1 A1 T 1 1 1 (E A A)1 = 1 E A1 + − λ (E A) − λ (A) λ (E A) − λ (A) λ (A) 1 1  1 1  1 2d A E A 2d3/2 A E A k − k k − k. (58) ≤ λ (A) ≤ I 1 | | 1 By (55), we have that with sufficient probability A E A . I d− . Therefore, combining this with (57) we have with sufficient probability k − k | | d u , 1 2 . √d. − h 1 i 1 Plugging the above inequality into (56) and using our lower bound λ (A) & I d− we prove the claim. d | |

T −2 T 1 An upper bound on 1 A 1. The proof is completely analogous to the case 1 A− 1. We have T 2 2 2 2 2 1 A− 1 u , 1 /(λ (A)) + (d u , 1 )/(λ (A)) ≤ h 1 i 1 − h 1 i d d2/ I 2 + (d u , 1 2)/(λ (A))2. ≤ | | − h 1 i d 2 2 2 3 As before, with sufficient probability we have d / I . n− d . The only difficulty is that we need a slightly sharper variant of the upper bound on A E A .| Recalling| the bound (55), by the matrix Bernstein inequality [Tro15, Theorem 6.6.1] we have k − k

2 3 E 3/2 I d− /2 Pr A A I d− d exp | | 1/2 , k − k ≥ | | ≤ −4 I d− + d/3    | |  which is arbitrarily small provided that d is large enough and n & d3 log d. Observe that this is the step where we have our strongest requirement on n. Note that using that by matrix Chernoff inequality, as shown above in the proof that A is invertible, we have with sufficient probability:

1 λ (A) & I d− . d | | Using (57), (58) and the two inequalities above, we conclude that the following holds with sufficient probability:

2 3/2 7/2 3/2 2 3 d u , 1 d A E A / I d I d− d d − h 1 i . k − k | | . | | = . . λ (A)2 I 2d 2 I 3 I 2 n2 d | | − | | | | The proof of our result is complete.

35 E Proof of the Theorem by Forster and Warmuth

For the sake of completeness, in this section we present the leave-one-out analysis due to [FW02]. Following the notation used in the proof of Theorem 2.2, we introduce an additional independent element (Xn+1, Yn+1) distributed according to P to the sample Sn+1. With a slight abuse of notation, we define for j =1,...,n +1 the unnormalized and unpenalized (as opposed to (7)) empirical second moment matrices by

n+1 n+1 ˜ T ˜ ( j) T Σ= XiXi and Σ − = XiXi . i=1 i=1,i=j X X6

Given the sample Sn+1, fix the minimum ℓ2 norm ERM defined as follows:

n+1 ω = Σ˜ † YiXi , ∞ i=1  X  b where recall that Σ˜ † denotes the Moore-Penrose inverse of the matrix Σ.˜ For i =1,...,n + 1, let hi denote the i-th leverage score: T ˜ hi = Xi Σ†Xi. ( j) ( j) Let f − denote the non-linear estimator trained on the n samples Sn+1 (Xj , Yj ) . By the definition of f − , we have \{ } b b n+1 ( j) ( j) T † f − (X )=(1 h ) Σ˜ − + X X Y X ,X j − j j j  i i j * i=1,i=j +   X6 b n+1   = (1 h ) Σ˜ † Y X Σ˜ †Y X ,X − j i i − j j j * i=1 ! + X = (1 hj )( ω ,Xj hj Yj ). (59) − h ∞ i− The analysis in [FW02] presented below is akin to the one used in our ridge regression proof (cf. Appendix A), b albeit with one simplifying modification. Instead of decomposing the excess risk into the stability and fitting terms, we use the following leave-one-out decomposition:

2 E ( (n+1)) Sn+1 f − (Xn+1) Yn+1 inf R(ω) − − ω Rd   ∈ n+1 2 n E b 1 ( i) E 1 2 = Sn+1 f − (Xi) Yi inf Sn+1 ( ω,Xi Yi) n +1 − − ω Rd n +1 h i− i=1 ! i=1 ! X   ∈ X n+1 b 2 n E 1 ( i) E 1 2 Sn+1 f − (Xi) Yi Sn+1 inf ( ω,Xi Yi) ≤ n +1 − − ω Rd n +1 h i− i=1 ! i=1 ! X   ∈ X n+1 b 2 n E 1 ( i) E 1 2 = S +1 f − (Xi) Yi S +1 ( ω ,Xi Yi) n n +1 − − n n +1 h ∞ i− i=1 ! i=1 ! X   X n+1 b 2 b E 1 ( i) 2 = S +1 f − (Xi) Yi ( ω ,Xi Yi) . (60) n n +1 − − h ∞ i− i=1 ! X   b b

36 Plugging in (59) into the above summands, we obtain

2 ( i) 2 f − (Xi) Yi ( ω ,Xi Yi) − − h ∞ i−   2 2 2 = (1 hj ) ω ,Xj (1 + hj hj )Yi ( ω ,Xi Yi) b − h ∞ i−b − − h ∞ i− 2 2 2 = (1 hj ) 1 ω ,Xj 2 (1 hj )(1 + hj hj ) 1 ω ,Xj Yj − − h ∞ i − − − − h ∞ i b 2 2 2 b + ((1 + hj hj ) 1)Yj .  − b − b If hj = 0, then the above expression is equal to zero. Assume that hj > 0 (hence, hj (0, 1]). Then, the 2 ∈ coefficient preceding ω ,Xj is negative; optimizing the quadratic equation we have h ∞ i 2 2 2 (1 hj ) 1 ω ,Xj 2 (1 hj )(1 + hj hj ) 1 ω ,Xj Yj b − − h ∞ i − − − − h ∞ i + ((1 + h h2)2 1)Y 2 j − j − j  b 2 2 b (1 hj )(1 + hj h ) 1 − − j − Y 2 + ((1 + h h2)2 1)Y 2 ≤ 1 (1 h )2 j j − j − j − − j  = (2h h2)Y 2h Y 2. j − j j ≤ j j Plugging the above into (60) yields:

2 E ( (n+1)) Sn+1 f − (Xn+1) Yn+1 inf R(ω) − − ω Rd   ∈ n+1 2 E b 1 ( i) 2 S +1 f − (Xi) Yi ( ω ,Xi Yi) ≤ n n +1 − − h ∞ i− i=1 ! X   n+1 1 b 2dm2 b E 2h Y 2 , ≤ Sn+1 n +1 i i ≤ n +1 i=1 X n+1 where in the last line we used the fact that hi = rank(Σ)˜ d. The proof is complete. i=1 ≤ P

37