Submitted to the Annals of

ALL-IN-ONE ROBUST ESTIMATOR OF THE GAUSSIAN

BY ARNAK S.DALALYAN1 AND ARSHAK MINASYAN2

1ENSAE-CREST, [email protected]

2Yerevan State University, YerevaNN, [email protected]

The goal of this paper is to show that a single robust estimator of the mean of a multivariate Gaussian distribution can enjoy five desirable proper- ties. First, it is computationally tractable in the sense that it can be computed in a time which is at most polynomial in dimension, size and the log- arithm of the inverse of the contamination rate. Second, it is equivariant by translations, uniform scaling and orthogonal transformations. Third, it has a high breakdown point equal to 0.5, and a nearly--rate-breakdown point approximately equal to 0.28. Fourth, it is minimax rate optimal, up to a logarithmic factor, when data consists of independent observations corrupted by adversarially chosen . Fifth, it is asymptotically efficient when the rate of contamination tends to zero. The estimator is obtained by an iterative reweighting approach. Each sample point is assigned a weight that is itera- tively updated by solving a convex optimization problem. We also establish a dimension-free non-asymptotic risk bound for the expected error of the pro- posed estimator. It is the first result of this kind in the literature and involves only the effective rank of the covariance matrix. Finally, we show that the obtained results can be extended to sub-Gaussian distributions, as well as to the cases of unknown rate of contamination or unknown covariance matrix.

CONTENTS

1 Introduction ...... 1 2 Desirable properties of a robust estimator ...... 3 3 Iterative reweighting approach ...... 6 4 Relation to prior work and discussion ...... 10 5 Formal statement of main building blocks ...... 11 6 Sub-Gaussian distributions, high- bounds and adaptation ...... 13 7 Empirical results ...... 16

arXiv:2002.01432v2 [math.ST] 4 Mar 2021 8 Postponed proofs ...... 18

1. Introduction. Robust estimation is one of the most fundamental problems in statistics. Its goal is to design efficient methods capable of processing data sets contaminated by out- liers, so that these outliers have little influence on the final result. The notion of an is hard to define for a single data point. It is also hard, inefficient and often impossible to clean data by removing the outliers. Instead, one can build methods that take as input the contaminated data set and provide as output an estimate which is not very sensitive to the

AMS 2000 subject classifications: Primary 62H12, ; secondary 62F35. Keywords and phrases: Gaussian mean, robust estimation, breakdown point, minimax rate, computational tractability. 1 2 DALALYAN AND MINASYAN contamination. Recent advances in data acquisition and computational power provoked a re- vival of interest in robust estimation and learning, with a focus on finite sample results and computationally tractable procedures. This was in contrast to the more traditional studies analyzing asymptotic properties of such statistical methods. This paper builds on recent advances made in robust estimation and suggests a method that has attractive properties both from asymptotic and finite-sample points of view. Furthermore, it is computationally tractable and its statistical complexity depends optimally on the dimen- sion. As a matter of fact, we even show that what really matters is the intrinsic dimension, defined in the Gaussian model as the effective rank of the covariance matrix. Note that in the framework of robust estimation, the high-dimensional setting is qualitatively different from the one dimensional setting. This qualitative difference can be shown at two levels. First, from a computational point of view, the running time of several robust meth- ods scales poorly with dimension. Second, from a statistical point of view, while a simple “remove then average” strategy might be successful in low-dimensional settings, it can eas- ily be seen to fail in the high dimensional case. Indeed, assume that for some ε ∈ (0, 1/2), p ∈ N, and n ∈ N, the data X1,..., Xn consist of n(1 − ε) points (inliers) drawn from a p-dimensional Gaussian distribution Np(0, Ip) (where Ip is the p × p identity matrix) and εn points (outliers) equal to a given vector u. Consider an idealized setting in which, for a given threshold r > 0, an oracle tells the user whether or not Xi is within a distance r of the true mean 0. A simple strategy for robust mean estimation consists of removing all the points of √ Euclidean norm larger than 2 p and averaging all the remaining points. If the norm of u is √ equal to p, one can check that the distance between this estimator and the true mean µ = 0 p p √ is of order p/n + εkuk2 = p/n + ε p. This error rate is provably optimal in the small dimensional setting p = O(1), but suboptimal as compared to the optimal rate pp/n + ε when the dimension p is not constant. The reason of this suboptimality is that the individu- ally harmless outliers, lying close to the bulk of the point cloud, have a strong joint impact on the quality of estimation. We postpone a review of the relevant prior work to Section4 in order to ease comparison with our results, and proceed here with a summary of our contributions. In the context of a data set subject to a fully adversarial corruption, we introduce a new estimator of the Gaussian mean that enjoys the following properties (the precise meaning of these properties is given in Section2): • it is computable in polynomial time, • it is equivariant with respect to similarity transformations (translations, uniform scaling and orthogonal transformations), √ • it has a high (minimax) breakdown point: ε∗ = (5 − 5)/10 ≈ 0.28, • it is minimax-rate-optimal, up to a logarithmic factor, • it is asymptotically efficient when the rate of contamination tends to zero, • for inhomogeneous covariance matrices, it achieves a better sample complexity than all the other previously studied methods.

In order to keep the presentation simple, all the aforementioned results are established in the case where the inliers are drawn from the Gaussian distribution. We then show that the extension to a sub-Gaussian distribution can be carried out along the same lines. Furthermore, we prove that using Lepski’s method, one can get rid of the knowledge of the contamination p p rate. More precisely, we establish that the rate√ p/n + ε log(1/ε) can be achieved without any information on ε other than ε < (5 − 5)/10 ≈ 0.28. Finally, we prove that the same order of magnitude of the estimation error is achieved when the covariance matrix Σ is ROBUST ESTIMATION OF A GAUSSIAN MEAN 3 unknown but isotropic (i.e., proportional to the identity matrix). When the covariance matrix is an arbitrary unknown√ matrix with bounded operator norm, our estimator has an error of order pp/n + ε, which is the best known rate of estimation by a computationally tractable procedure in the case of unknown covariance matrices. The rest of this paper is organized as follows. We complete this introduction by presenting the notation used throughout the paper. Section2 describes the problem setting and provides the definitions of the properties of robust estimators such as rate optimality or breakdown point. The iteratively reweighted mean estimator is introduced in Section3. This section also contains the main facts characterizing the iteratively reweighted mean estimator along with their high-level proofs. A detailed discussion of relation to prior work is included in Section4. Section5 is devoted to a formal statement of the main building blocks of the proofs. Extensions to the cases of sub-Gaussian distributions, unknown ε and Σ are examined in Section6. Some empirical results illustrating our theoretical claims are reported in Section7. Postponed proofs are gathered in Section8 and in the appendix.

For any vector v, we use the norm notations kvk2 for the standard Euclidean norm, kvk1 for the sum of absolute values of entries and kvk∞ for the largest in absolute value entry of v. The tensor product of v by itself is denoted by v⊗2 = vv>. We denote by ∆n−1 and n−1 n by S , respectively, the probability simplex and the unit sphere in R . For any symmetric matrix M, λmax(M) is the largest eigenvalue of M, while λmax,+(M) is its positive part. The operator norm of M is denoted by kMkop. We will often use the effective rank rM defined as Tr(M)/kMkop, where Tr(M) is the trace of matrix M. For symmetric matrices A and B of the same size we write A  B, if the matrix A − B is positive semidefinite. For a rectangular p × n matrix A, we let smin(A) and smax(A) be the smallest and the largest singular values of A defined respectively as smin(A) = infv∈ n−1 kAvk2 and smax(A) = supv∈ n−1 kAvk2. S p S The set of all p × p positive semidefinite matrices is denoted by S+.

2. Desirable properties of a robust estimator. We consider the setting in which the sam- ple points are corrupted versions of independent and identically distributed random vectors drawn from a p-variate Gaussian distribution with mean µ∗ and covariance matrix Σ. In what follows, we will assume that the rate of contamination and the covariance matrix are known and, therefore, can be used for constructing an estimator of µ∗. We present in Section6 some additional results which are valid under relaxations of this assumption.

DEFINITION 1. We say that the distribution Pn of data X1,..., Xn is Gaussian with ad- ∗ versarial contamination, denoted by Pn ∈ GAC(µ , Σ, ε) with ε ∈ (0, 1/2) and Σ  0, if there is a set of n independent and identically distributed random vectors Y 1,..., Y n drawn ∗ from Np(µ , Σ) satisfying

{i : Xi 6= Y i} ≤ εn.

In what follows, the sample points Xi with indices in the set O = {i : Xi 6= Y i} are called outliers, while all the other sample points are called inliers. We define I = {1, . . . , n}\O, the set of inliers. Assumption GAC allows both the set of outliers O and the outliers themselves to be random and to depend arbitrarily on the values of Y 1,..., Y n. In particular, we can consider a game in which an adversary has access to the clean observations Y 1,..., Y n and is allowed to modify an ε fraction of them before unveiling to the Statistician. The Statistician aims at estimating µ∗ as accurately as possible, the accuracy being measured by the expected estimation error: p 1/2  X  R (µ , µ∗) = kµ − µ∗k = E [(µ − µ∗)2] . Pn bn bn L2(Pn) Pn bn j j=1 4 DALALYAN AND MINASYAN

Thus, the goal of the adversary is to apply a contamination that makes the task of estimation the hardest possible. The goal of the Statistician is to find an estimator µbn that minimizes the worst-case risk ∗ Rmax(µn, Σ, ε) = sup sup RPn (µn, µ ). b ∗ p ∗ b µ ∈R Pn∈GAC(µ ,Σ,ε)

Let rΣ = Tr(Σ)/kΣkop be the effective rank of Σ. The theory developed by [Chen et al., 2016, 2018], in conjunction with [Bateni and Dalalyan, 2020, Prop. 1], implies that r  1/2 rΣ inf Rmax(µbn, Σ, ε) ≥ ckΣkop + ε (1) µb n n for some constant c > 0, where the infimum is over all measurable functions of (X1,..., Xn). A detailed proof of this claim is presented in AppendixD. This lower bound naturally leads to the following definition.

DEFINITION 2. We say that the estimator µbn is minimax rate optimal (in expectation), if there are universal constants c1, c2 and C such that rr  R (µ , Σ, ε) ≤ CkΣk1/2 Σ + ε max bn op n for every (n, Σ, ε) satisfying rΣ ≤ c1n and ε ≤ c2.

The iteratively reweighted mean estimator, introduced in the next section, is not minimax rate optimal but is very close to being so. Indeed, we will prove that it is minimax rate optimal up to a plog(1/ε) factor in the second term (ε is replaced by εplog(1/ε)). It should be stressed here that, to the best of our knowledge, none of the results on robust estimation of the Gaus- sian mean provides rate-optimality in expectation in the high-dimensional setting. Indeed, all those results provide risk bounds that hold with high probability, and either (a) do not say any- thing about the magnitude of the error on a set of small but strictly positive probability or (b) use the confidence in the construction of the estimator. Both of these shortcomings prevent from extracting bounds for expected loss from high-probability bounds. This being said, it should be noted that most prior work has focused on the Huber contamination, in which case no meaningful (i.e., different from +∞, see Section 2.6 in [Bateni and Dalalyan, 2020]) upper bound on the minimax risk in expectation can be obtained.

∗ DEFINITION 3. We say that µbn is an asymptotically efficient estimator of µ , if when ε = εn tends to zero sufficiently fast, as n tends to infinity, we have rr R (µ , Σ, ε) ≤ kΣk1/2 Σ 1 + o (1). max bn op n n

If we compare inequalities in Definition2 and Definition3, we can see that the constant C present in the former disappeared in the latter. This reflects the fact that asymptotic efficiency implies not only rate-optimality, but also the optimality of the constant factor. We recall here that in the outlier-free situation, the sample mean is asymptotically efficient and its worst-case 1/2p risk is equal to kΣkop rΣ/n. One can infer from (1) that a necessary condition for the existence of asymptotically efficient 2 estimator is εn = on(rΣ/n). We show in the next section that this condition is almost suf- ficient, by proving that the iteratively reweighted mean estimator is asymptotically efficient 2 provided that εn log(1/εn) = on(rΣ/n). ROBUST ESTIMATION OF A GAUSSIAN MEAN 5

The last notion that we introduce in this section is the breakdown point, the term being coined by Hampel[1968], see also [Donoho and Huber, 1983]. Roughly speaking, the breakdown point of a given estimator is the largest proportion of outliers that the estimator can support without becoming infinitely large. The definition we provide below slightly differs from the original one. This difference is motivated by our goal to focus on studying the expected risk of robust estimators.

∗ DEFINITION 4. We say that εn ∈ [0, 1/2] is the (finite-sample) breakdown point of the estimator µbn, if ∗ Rmax(µbn, Σ, ε) < +∞, ∀ε < εn ∗ and Rmax(µbn, Σ, ε) = +∞, for every ε > εn.

One can check that the breakdown points of the componentwise and the geometric GM median (see the definition of µbn in (4) below) equal 1/2. Unfortunately, the minimax rate of these methods is strongly suboptimal, see [Chen et al., 2018, Prop. 2.1] and [Lai et al., 2016, Prop. 2.1]. Among all rate-optimal (up to a polylogarithmic factor) robust estimators, Tukey’s median is1 the one with highest known breakdown point equal to 1/3 [Donoho and Gasko, 1992]. It should be noted that this paper deal with the original definition of the breakdown point which, as already mentioned, is slightly different from that of Definition4. The notion of breakdown point given in Definition4, well adapted to estimators that do not rely on the knowledge of ε, becomes less relevant in the context of known ε. Indeed, if a given estimator µbn(ε) is proved to have a breakdown point equal to 0.1, one can consider instead 1 GM1 the estimator µen(ε) = µbn(ε) (ε < 0.1) + µbn (ε ≥ 0.1), which will have a breakdown point equal to 0.5. For this reason, it appears more appealing to consider a different notion that we call rate-breakdown point, and which is of the same flavor as the δ-breakdown point defined in [Chen et al., 2016].

∗ DEFINITION 5. We say that εr ∈ [0, 1/2] is the r(n, Σ, ε)-breakdown point of the estimator p ∗ µbn for a given function r : N × S+ × [0, 1/2), if for every ε < εr, R (µ (ε), Σ, ε) sup max bn < +∞, n,p r(n, Σ, ε) ∗ and εr is the largest value satisfying this property.

In the context of Gaussian mean estimation, if the previous definition is applied with p r(n, Σ, ε) = kΣkop rΣ/n + ε), we call the corresponding value the minimax-rate- p p breakdown point. Similarly, if r(n, Σ, ε) = kΣkop rΣ/n + ε log(1/ε)), we call the cor- responding value the nearly-minimax-rate-breakdown point. It should be mentioned here that the extra plog(1/ε) factor cannot be avoided by any Statistical Query (SQ) polynomial-time algorithm, as shown by the SQ lower bound established in [Diakonikolas et al., 2017].

1Recent results in [Zhu et al., 2020] suggest that an estimator based on the TV-projection may achieve the optimal breakdown point of 1/2. 6 DALALYAN AND MINASYAN

Algorithm 1: Iteratively reweighted mean estimator (known ε and Σ)

p Input: data X1,..., Xn ∈ R , contamination rate ε and Σ IR Output: parameter estimate µbn 0 Pn Initialize: compute µb as a minimizer of i=1 kXi − µk2

l log(4rΣ)−2 log(ε(1−2ε)) m Set K = 0 ∨ 2 log(1−2ε)−log ε−log(1−ε) . For k = 1 : K Compute current weights:  n  X k−1 ⊗2 w ∈ arg min λmax wi(Xi − µb ) − Σ ∨ 0. (n−nε)kwk∞≤1 i=1 k Pn Update the estimator: µb = i=1 wiXi. EndFor K Return µb .

3. Iterative reweighting approach. In this section, we define the iterative reweighting estimator that will be later proved to enjoy all the desirable properties. To this end, we set n  n  X X ⊗2 X¯ w = wiXi,G(w, µ) = λmax,+ wi(Xi − µ) − Σ (2) i=1 i=1 n p for any pair of vectors w ∈ [0, 1] and µ ∈ R . The main idea of the proposed methods is to find a weight vector wb n belonging to the probability simplex n−1 n n o ∆ = w ∈ [0, 1] : w1 + ... + wn = 1 ∗ ∗ 1 that mimics the ideal weight vector w defined by wj = (j ∈ I)/|I|, so that the weighted average X¯ is nearly as close to µ∗ as the average of the inliers. Note that, for any weight wb n n−1 p vector w ∈ ∆ and any vector µ ∈ R , we have n n n X ⊗2 X ⊗2 ⊗2 X ⊗2 wi(Xi − µ) = wi(Xi − X¯ w) + (X¯ w − µ)  wi(Xi − X¯ w) . i=1 i=1 i=1

This readily yields that G(w, µ) ≥ G(w, X¯ w) , and, therefore n−1 G(w, X¯ w) = min G(w, µ), ∀w ∈ ∆ . (3) µ∈Rn The precise definition of the proposed estimator is as follows. We start from an initial esti- mator µ0 of µ∗. To give a concrete example, and also in order to guarantee equivariance by b 0 similarity transformations, we assume that µb is the geometric median: n 0 GM X µ = µn ∈ arg min kXi − µk2. (4) b b µ∈ p R i=1

IR DEFINITION 6. We call iteratively reweighted mean estimator, denoted by µbn , the K- k 0 th element of the sequence {µb ; k = 0, 1,...} starting from µb in (4) and defined by the recursion k k−1 k ¯ wb ∈ arg min G(w, µb ), µb = Xwk , (5) (n−nε)kwk∞≤1 b ROBUST ESTIMATION OF A GAUSSIAN MEAN 7

p = 102 p = 105 p = 1010

102 Number of iterations K 101

0.00 0.05 0.10 0.15 0.20 0.25 Contamination level eps

FIG 1. The behavior of the number of iterations K = Kε, given by (6), as a function of the contamination rate ε for different values of the dimension p = rΣ.

n−1 where the minimum is over all weight vectors w ∈ ∆ satisfying maxj wj ≤ 1/(n − nε) and the number of iteration is   _ log(4rΣ) − 2 log(ε(1 − 2ε)) K = 0 . (6) 2 log(1 − 2ε) − log ε − log(1 − ε)

The idea of computing a weighted mean, with weights measuring the outlyingness of the observations goes back at least to [Donoho, 1982, Stahel, 1981]. Perhaps the first idea similar to that of minimizing the largest eigenvalue of the covariance matrix was that of minimizing the determinant of the sample covariance matrix over all subsamples of a given cardinality [Rousseeuw, 1985, 1984]. It was also observed in [Lopuhaä and Rousseeuw, 1991] that one can improve the estimator by iteratively updating the weights. An overview of these results can be found in [Rousseeuw and Hubert, 2013]. Note that the value of K provided above is tailored to the case where the initial estimator is the geometric median. Clearly, K depends only logarithmically on the dimension and K = Kε tends to 2 when ε goes to zero, see Figure1. Note also that the choice of K in (6) is derived from the condition pε(1 − ε)/(1 − 2ε)K kµ0 − µ∗k ≤ ε, see (10) below. We b L2 can use any other initial estimator of µ∗ instead of the geometric median, provided that K is large enough to satisfy the last inequality. IR We have to emphasize that µbn relies on the knowledge of both ε and Σ (the dependence on Σ is through the effective rank, which coincides with the dimension for non degenerate covariance matrices). Indeed, the number of iterations K depends on both ε and Σ. Addition- ally, Σ is used in the cost function G(w, µ) and ε is used for specifying the set of feasible weights in optimization problem (5). We present some extensions to the case of unknown ε and Σ in Section6. The rest of this section is devoted to showing that the iteratively reweighted estimator enjoys all the desirable properties announced in the introduction. An estimator is called computa- tionally tractable, if its computational complexity is at most polynomial in n, p and 1/ε.

Fact 1

IR The estimator µbn is computationally tractable. 8 DALALYAN AND MINASYAN

In order to check computational tractability, it suffices to prove that each iteration of the algorithm can be performed in polynomial time. Since the number of iterations depends log- arithmically on rΣ ≤ p, this will suffice. Note now that the optimization problem in (5) is convex and can be cast into a semi-definite program. Indeed, it is equivalent to minimizing a real value t over all the pairs (t, w) satisfying the constraints n 1 X t ≥ 0, w ∈ ∆n−1, kwk ≤ , w (X − µk−1)⊗2  Σ + tI . ∞ n(1 − ε) i i b p i=1 The first two constraints can be rewritten as a set of linear inequalities, while the third con- straint is a linear matrix inequality. Given the special form of the cost function and the con- straints, it is possible to design specific optimization routines which will find an approximate solution to the problem in a faster way than the out-of-shelf SDP-solvers. However, we will not pursue this line of research in this work.

Fact 2

IR The estimator µbn is translation, uniform scaling and orthogonal transforma- tion equivariant.

The equivariance mentioned in this statement should be understood as follows. If we denote IR IR by µbn,X the estimator computed for data X1,..., Xn, and by µbn,X0 the one computed for 0 0 0 p data X1,..., Xn, with Xi = a + λUXi, where a ∈ R , λ > 0 and U is a p × p orthogonal IR IR matrix, then µbn,X0 = a + λUµbn,X . To prove this property, we first note that n n X 0 X −1 > min kXi − µk2 = λ min kXi − λ U (µ − a)k2. µ∈ p µ∈ p R i=1 R i=1 GM −1 > GM GM GM This implies that µbn,X = λ U (µbn,X0 − a), which is equivalent to µbn,X0 = a + λUµbn,X . Therefore, the initial value of the recursion is equivariant. If we add to this the fact that2 2 IR GX (w, µ) = λ GX0 (w, a + λUµ) for every (w, µ), we get the equivariance of µbn .

Fact 3 ∗ ∗ The breakdown point εn and the nearly-minimax-rate-breakdown√ point εr IR ∗ ∗ of µbn satisfy, respectively, εn = 0.5 and εr ≥ (5 − 5)/10 ≈ 0.28.

∗ We prove later in this paper (see (14)) that if X1,..., Xn satisfy GAC(µ , Σ, ε), there is a ∗ Ξ depending only on ζi = Y i − µ , i = 1, . . . , n, such that pε(1 − ε) kX¯ − µ∗k ≤ G(w, µ)1/2 + Ξ, ∀µ ∈ p, (7) w 2 1 − 2ε R n−1 for every w ∈ ∆ such that n(1 − ε)kwk∞ ≤ 1. Inequality (7) is one of the main building blocks of the proof of Facts3 to5. This inequality, as well as inequalities (11) and (12) below will be formally stated and proved in subsequent sections. To check Fact3, we set

2 We use here the notation GX (w, µ) to make clear the dependence of G in (2) on Xis. We also stress that 0 2 > when the estimator is computed for the transformed data Xi, the matrix Σ is naturally replaced by λ UΣU . ROBUST ESTIMATION OF A GAUSSIAN MEAN 9

p 3 αε = ε(1 − ε)/(1 − 2ε) and note that k ∗ ¯ ∗ k k−1 1/2 kµ − µ k2 = kX k − µ k2 ≤ αε G(w , µ ) + Ξ b wb b b ∗ k−1 1/2 ≤ αε G(w , µb ) + Ξ ∗ ¯ ¯ k−1 21/2 ≤ αε G(w , Xw∗ ) + kXw∗ − µb k2 + Ξ ∗ ∗ ¯ k−1 21/2 ≤ αε G(w , µ ) + kXw∗ − µb k2 + Ξ k−1 ∗ ≤ αε kµb − µ k2 + Ξe, (8) ∗ ∗ 1/2 ¯  4 where Ξe = αε G(w , µ ) + kζw∗ k2 + Ξ. Unfolding this recursion, we get

IR ∗ K ∗ K 0 ∗ Ξe kµbn − µ k2 = kµb − µ k2 ≤ αε kµb − µ k2 + . (9) 1 − αε 0 GM The geometric median µb = µbn having a breakdown point equal to 1/2, we infer from the last display that the error of the iteratively reweighted estimator remains bounded af- ter altering ε-fraction of data points provided that αε < 1. This implies that the breakdown p point is at√ least equal to the solution of the√ equation ε(1 − ε) = 1 − 2ε, which yields ε∗ ≥ (5 − 5)/10. Moreover, if ε ∈ [(5 − 5)/10, 1/2], then the number of iterations K equals zero and the iteratively reweighted mean coincides with the geometric median. There- fore, its breakdown point is 1/2.

Fact 4

IR The estimator µbn is nearly minimax rate optimal, in the sense that its worst- 1/2p p  case risk is bounded by CkΣkop rΣ/n + ε log(1/ε) , where C is a universal constant.

Without loss of generality, we assume that kΣkop = 1 so that rΣ = Tr(Σ). We can always 1/2 reduce the initial problem to this case by considering scaled data points Xi/kΣkop instead of Xi. Combining (9) and the triangle inequality, we get

kΞek 2 kµIR − µ∗k ≤ αK kµGM − µ∗k + L . (10) bn L2 ε bn L2 1 − αε √ It is not hard to check that kµGM − µ∗k ≤ 2 r /(1 − 2ε), see Lemma2 below. Further- bn L2 Σ K √ more, the choice of K in (6) entails 2αε rΣ ≤ ε(1 − 2ε). This implies that

kΞek 2 kµIR − µ∗k ≤ ε + L . bn L2 1 − αε The last two building blocks of the proof are the following5 inequalities: ∗ ∗ p p E[G(w , µ )] ≤ C 1 + rΣ/n rΣ/n, (11)

p √ √ 1/4 p kΞkL2 ≤ rΣ/n(1 + C ε) + C ε(rΣ/n) + Cε log(1/ε), (12)

3See Section 8.1 for more detailed explanations. 4 K Here and in the sequel αε stands for K-th power of αε. 5Inequality (11) is [Koltchinskii and Lounici, 2017, Th. 4], while (12) is the claim of Proposition2 below. 10 DALALYAN AND MINASYAN where C > 0 is a universal constant. In what follows, the value of C may change from one line to the other. We have ∗ ∗ 1/2 ¯  kΞekL2 ≤ αε kG(w , µ ) kL2 + kζw∗ kL2 + kΞkL2 √ p ≤ C ε E1/2[G(w∗, µ∗)] + r /n + kΞk2 Σ L2 1/4 p  ≤ Cε (rΣ/n) + rΣ/n + kΞkL2 p √ √ 1/4 p ≤ rΣ/n(1 + C ε) + C ε(rΣ/n) + Cε log(1/ε) p p ≤ C rΣ/n + Cε log(1/ε). (13)

Returning to (9) and combining it with (13), we get the claim√ of Fact4 for every ε ≤ ε0, where ε0 is any positive number strictly smaller than (5 − 5)/10. This also proves the second claim of Fact3.

Fact 5

2 In the setting ε = εn → 0 so that ε log(1/ε) = on(rΣ/n) when n → ∞, the IR estimator µbn is asymptotically efficient.

2 The proof of this fact follows from (9) and (12). Indeed, if ε log(1/ε) = on(rΣ/n),(12) implies that

2 rΣ  kΞek ≤ 1 + on(1) . L2 n Injecting this bound in (9) and using the fact that ε tends to zero, we get the claim of Fact5.

4. Relation to prior work and discussion. Robust estimation of a mean is a statisti- cal problem studied by many authors since at least sixty years. It is impossible to give an overview of all existing results and we will not try to do it here. The interested reader may re- fer to the books [Maronna et al., 2006] and [Huber and Ronchetti, 2009]. We will rather focus here on some recent results that are the most closely related to the present work. Let us just recall that Huber and Ronchetti[2009] enumerates three desirable properties of a statistical procedure: efficiency, stability and breakdown. We showed here that iteratively reweighted mean estimator possesses these features and, in addition, is equivariant and computationally tractable. To the best of our knowledge, the form pp/n + ε of the minimax risk in the Gaussian mean estimation problem has been first obtained by Chen et al.[2018]. They proved that this rate holds with high probability for the Tukey median, which is known to be computationally intractable in the high-dimensional setting. The first nearly-rate-optimal and computationally tractable estimators have been proposed by Lai et al.[2016] and Diakonikolas et al.[2016] 6. The methods analyzed in these papers are different, but they share the same idea: If for a subsample of points the empirical covariance matrix is sufficiently close to the theoretical one, then the arithmetic mean of this subsample is a good estimator of the theoretical mean. Our method is based on this idea as well, which is mathematically formalized in (7), see also Proposition1 below. Further improvements in running times—up to obtaining a linear in np computational comp- lexity in the case of a constant ε—are presented in [Cheng et al., 2019a]. Some lower bounds

6See [Diakonikolas et al., 2019a] for the extended version ROBUST ESTIMATION OF A GAUSSIAN MEAN 11 suggesting that the log-factor in the term εplog(1/ε) cannot be removed from the rate of computationally tractable estimators are established in [Diakonikolas et al., 2017]. In a slightly weaker model of corruption, Diakonikolas et al.[2018] propose an iterative filtering algorithm that achieves the optimal rate ε without the extra factor plog(1/ε). On a related note, [Collier and Dalalyan, 2019] shows that in a weaker contamination model termed as parametric contamination, the carefully trimmed sample mean can achieve a better rate than that of the coordinatewise/geometric median. An overview of the recent advances on robust estimation with a focus on computational aspects can be found in [Diakonikolas and Kane, 2019]. Extensions of these methods to the sparse mean estimation are developed in [Balakrishnan et al., 2017, Diakonikolas et al., 2019b]. All these results are proved to hold on an event with a prescribed probability, see [Bateni and Dalalyan, 2020] for a relation between results in expectation and those with high probability, as well as for the definitions of various types of contamination. The proposed estimator shares some features with the adaptive weights smoothing [Polzehl and Spokoiny, 2000]. Adaptive weights smoothing (AWS) iteratively updates the weights assigned to observations, similarly to Algorithm1. The main difference is that the weights in AWS are not measuring the outlyingness but the relevance for interpolating a function at a given point. There are also many other statistical problems in which robust estimation has been recently revisited from the point of view of minimax rates. This includes scale and co- matrix estimation [Chen et al., 2018, Comminges et al., 2020], matrix completion [Klopp et al., 2017], multivariate regression [Chinot, 2020, Dalalyan and Thompson, 2019, Gao, 2020], classification [Cannings et al., 2020, Li and Bradic, 2018], subspace clustering [Soltanolkotabi and Candès, 2012], community detection [Cai and Li, 2015], etc. Proper- ties of robust M-estimators in high-dimensional settings are studied in [Elsener and van de Geer, 2018, Loh, 2017]. There is also an increasing body of literature on the robustness to heavy tailed distributions [Devroye et al., 2016, Lecué and Lerasle, 2019, Lecué and Lerasle, 2020, Lugosi and Mendelson, 2019, 2020, Minsker, 2018] and the computationally tractable methods in this context [Cherapanamjeri et al., 2019, Depersin and Lecué, 2019, Dong et al., 2019, Hopkins, 2018]. A potentially useful observation, from a computational standpoint, is that it is sufficient to p √ solve the optimization problem in Equation (5) up to an error proportional to rΣ/n + ε. Indeed, one can easily repeat all the steps in (8) to check that this optimization error does not alter the order of magnitude of the statistical error.

5. Formal statement of main building blocks. The first building block, inequality (7), IR used in Section3 to analyze the risk of µbn , upper bounds the error of estimating the mean by the error of estimating the covariance matrix. In order to formally state the result, we need some additional notations. Let w ∈ ∆n−1 be a vector of weights and let I be a subset of {1, . . . , n}. We use the notation wI for the vector obtained from w by zeroing all the entries having indices outside I. Con- sidering w as a probability on {1, . . . , n}, we define w|I as the corresponding conditional probability on I that is n−1 w|I ∈ ∆ , (w|I )i = (wi/kwI k1)1(i ∈ I). We will make repeated use of the notation n ¯ X ¯ X ¯ X Xw = wiXi, ξw|I = (w|I )iξi, ζw|I = (w|I )iζi. i=1 i∈I i∈I 12 DALALYAN AND MINASYAN

∗ PROPOSITION 1. Let ζ1,..., ζn be a set of vectors such that ζi = Xi − µ for every i ∈ I, n−1 P where I is a subset of {1, . . . , n}. For every weight vector w ∈ ∆ such that i6∈I wi ≤ εw ≤ 1/2 and for every p × p matrix Σ, it holds √ n ¯ ∗ εw 1/2  X ¯ ⊗2  kXw − µ k2 ≤ λmax,+ wi(Xi − Xw) − Σ + R(ζ, w, I), 1 − εw i=1 with the remainder term q √   √ 1/2 X > ¯ R(ζ, w, I) = 2 kΣkopεw + 2εw λmax,+ (w|I )i(Σ − ζiζi ) + (1 + 2εw)kζw|I k2. i∈I

The proof of this result is postponed to the last section. In simple words, the claim of propo- sition is that the estimation error of the weighted mean X¯ w is, up to a remainder term, 1/2 governed by the quantity G(w, X¯ w) . It turns out that the remainder term is bounded by a small quantity uniformly in w and I, provided that these two satisfy suitable conditions. For I, it is enough to constrain the cardinality of its complement Ic = O. For w, it appears to be sufficient to assume that its sup-norm is small. In that respect, the following lemma plays a key role in the proof.

n−1 LEMMA 1. For any integer ` > 0, let Wn,` be the set of all w ∈ ∆ such that maxi wi ≤ 1/`. The following facts hold: J i) For every J ⊂ {1, . . . , n} such that |J| ≥ `, the uniform weight vector u ∈ Wn,`. J ii) The set Wn,` is the convex hull of the uniform weight vectors {u : |J| = `}. n−1 iii) For every convex mapping G : ∆ → R, we have sup G(w) = max G(uJ ), w∈Wn,` |J|=` where the last maximum is over all subsets J of cardinality ` of the set {1, . . . , n}. 0 iv) If w ∈ Wn,` then for any I such that |I| ≥ ` > n − `, we have w|I ∈ Wn,`+`0−n.

Let us denote by Wn(ε) the set Wn,n(1−ε). This is exactly the feasible set in the optimization c problem defining the iterations of Algorithm1. It is clear that for w ∈ Wn(ε) and for |I | ≤ P nε, we have i6∈I wi ≤ ε/(1 − ε). We now infer from Proposition1 and (3) that n ¯ ∗ 1/2  X ¯ ⊗2  kXw − µ k2 ≤ αε λmax,+ wi(Xi − Xw) − Σ + Ξ i=1 1/2 = αε inf G(w, µ) + Ξ, (14) µ∈Rp with Ξ being the largest value of R(ζ, w,I) over all possible weights w ∈ Wn(ε) and subsets I ⊂ {1, . . . , n} satisfying |Ic| ≤ nε. The second building block, formally stated in the next proposition, provides a suitable upper bound for the random variable Ξ.

PROPOSITION 2. Let R(ζ, w,I) be defined in Proposition1 and ζ1,..., ζn be i.i.d. centered Gaussian random vectors with covariance matrix Σ satisfying λmax(Σ) = 1. If ε ≤ 0.28, then the random variable Ξ = sup max R(ζ, w,I) |I|≥n(1−ε) w∈Wn(ε) ROBUST ESTIMATION OF A GAUSSIAN MEAN 13 satisfies, for a universal constant C > 0, the inequalities p √ √ 1/4 p kΞkL2 ≤ rΣ/n(1 + C ε) + C ε(rΣ/n) + Cε log(1/ε) (15) √ p √ 1/4 p kΞkL2 ≤ p/n(1 + 16 ε) + 5 3ε (p/n) + 32ε log(2/ε), (16) where for the second inequality we assumed that p ≥ 2 and n ≥ p ∨ 4.

The second inequality is weaker than the first one, since obviously rΣ ≤ p. However, the advantage of the second inequality is that it comes with explicit constants and shows that these constants are not excessively large. To close this section, we state a theorem that rephrases Fact4 in a way that might be more convenient for future references. Its proof is omitted, since it follows the lines of the proof of Fact4 presented above.

IR THEOREM 1. Let µbn be the iteratively reweighted mean defined in Definition6 and Al- gorithm1.√ There is a universal constant C > 0 such that for any n, p ≥ 1 and for every ε < (5 − 5)/10, we have 1/2 1/2 IR ∗ 2 CkΣkop p p  sup sup E [kµn − µ k2] ≤ rΣ/n + ε log(1/ε) , ∗ p ∗ b p µ ∈R Pn∈GAC(µ ,Σ,ε) 1 − 2ε − ε(1 − ε) ∗ where Pn ∈ GAC(µ , Σ, ε) that the data points Xi are Gaussian with adversarial contamination, see Definition1. If, in addition, p ≥ 2 and n ≥ p ∨ 10, then 1/2 1/2 IR ∗ 2 10kΣkop p p  sup sup E [kµn − µ k2] ≤ p/n + ε log(1/ε) . ∗ p ∗ b p µ ∈R Pn∈GAC(µ ,Σ,ε) 1 − 2ε − ε(1 − ε)

To the best of our knowledge, this is the first result in the literature that provides an upper bound on the expected error of an outlier-robust estimator, which is of nearly optimal rate.

6. Sub-Gaussian distributions, high-probability bounds and adaptation. Risk bounds stated in Fact4 and Fact5 and formalized in Theorem1 hold for the expected error under the condition that the reference distribution is Gaussian. Furthermore, the proposed procedure relies on the knowledge of both the contamination rate ε and the covariance matrix Σ. The goal of this section is to show how some of these restriction can be alleviated.

6.1. High-probability risk bound for a sub-Gaussian reference distribution. As expected, the risk bounds established in previous sections can be extended to the case of sub-Gaussian distributions. Furthermore, risk bounds holding with high-probability can be proved using the same techniques as those employed for proving the in-expectation bounds. In order to be more precise, we state in this subsection the high-probability counterpart of the second claim of Theorem1. The price to pay for covering the more general sub-Gaussian case is that the constant in the right hand side of the inequality is no longer explicit. Recall that a zero-mean random vector ξ is called sub-Gaussian with parameter τ > 0 (also known as the variance proxy), if

> 2 v ξ τ/2kvk p E[e ] ≤ e 2 , ∀v ∈ R .

We write ξ ∼ SGp(τ). If ξ is standard Gaussian then it is sub-Gaussian with parameter 1. Similarly, if ξ is centered and belongs almost surely to the unit ball, then ξ is sub-Gaussian with parameter 1. Let us describe now the set of data-generating distributions that we consider in this section. 14 DALALYAN AND MINASYAN

DEFINITION 7. We say that the joint distribution Pn of the random vectors X1,..., Xn ∈ p ∗ R belongs to the sub-Gaussian model with adversarial contamination with mean µ , co- variance matrix Σ and contamination rate ε, if there are independent random vectors ξi ∼ SGp(τ) such that n o ∗ 1/2 i := 1, . . . , n : Xi 6= µ + Σ ξi ≤ εn. ∗ We then write Pn ∈ SGAC(µ , Σ, ε).

It is clear that a Gaussian model with adversarial contamination defined in Definition1 is a particular case of the sub-Gaussian model with adversarial contamination. In other terms, the set SGAC(µ∗, Σ, ε) is strictly larger than the set GAC(µ∗, Σ, ε). Nevertheless, as shows the result below, the risk bounds established for the iteratively reweighted mean algorithms remain valid uniformly over this extended class SGAC(µ∗, Σ, ε).

IR THEOREM 2. Let µn be the iteratively reweighted mean defined in Definition6 and in b −n Algorithm1. Let δ ∈ (4e , 1) be a tolerance level. There exists a√ constant A5 depending only on the variance proxy τ such that if n ≥ p ≥ 2 and ε < (5 − 5)/10, then for every ∗ p ∗ µ ∈ R and every Pn ∈ SGAC(µ , Σ, ε), we have 1/2 r ! A kΣk  p + log(4/δ)  P kµIR − µ∗k2 ≤ 5 op + εplog(1/ε) ≤ 1 − 4δ. bn 2 1 − 2ε − pε(1 − ε) n

The proof of this theorem is postponed to the supplementary material. Let us just mention that [Cheng et al., 2019b, Section 1.2] claim that the rate pp/n + εplog(1/ε) is optimal for sub-Gaussian distributions, meaning that, unlike the Gaussian case, the plog(1/ε) factor cannot be removed. A formal proof of this fact can be found in the last remark of Section 2 in [Lugosi and Mendelson, 2021].

6.2. Adaptation to unknown contamination rate ε. An appealing feature of the risk bounds that hold with high probability is that they allow us to apply Lepski’s method [Lepski and Spokoiny, 1997, Lepskii, 1992] for obtaining an adaptive estimator with respect to ε. The obtained adaptive estimator enjoys all the five properties enumerated in Section3 except the asymptotic efficiency, since the adaptation results in an inflation of the risk bound by a factor 3. The precise description of the algorithm, already used in the framework of robust estimation by Collier and Dalalyan[2019], is presented below. We will denote by B(µ, r) the p ball with center µ and radius r in the Euclidean space R .

` DEFINITION 8. We choose a geometric grid ε` = a ε0, ` = 1, 2, . . . , `max, of possible√ values of the contamination rate. Here, a ∈ (0, 1) is a real number, ε0 = (5 − 5)/10 IR and `max = [0.5 loga(p/n)]. For each ` = 1, . . . , `max, we denote by µbn (ε`) the iteratively reweighted mean computed for ε = ε`, see Algorithm1, and we set 1/2 r  A5kΣkop p + log(4`max/δ) p Rδ(z) = + z log(1/z) , z ∈ [0, ε0), 1 − 2z − pz(1 − z) n where δ ∈ (0, 1) is a tolerance level and A5 is the constant from Theorem2. The adaptively chosen iteratively reweighted mean estimator µAIR is defined by µAIR = µIR(ε ) where bn bn bn `b ` n \ IR  o `b= max ` ≤ `max : B µbn (εj); Rδ(εj) 6= ∅ . j=1 ROBUST ESTIMATION OF A GAUSSIAN MEAN 15

Algorithm 2: Iteratively reweighted mean estimator (known ε, unknown Σ)

p Input: data X1,..., Xn ∈ R , contamination rate ε IR Output: parameter estimate µbn 0 Pn Initialize: compute µb as a minimizer of i=1 kXi − µk2 l log(4p)−2 log(ε(1−2ε)) m Set K = 0 ∨ 2 log(1−2ε)−log ε−log(1−ε) . For k = 1 : K Compute current weights:  n  X k−1 ⊗2 w ∈ arg min λmax wi(Xi − µb ) . (n−nε)kwk∞≤1 i=1 k Pn Update the estimator: µb = i=1 wiXi. EndFor K Return µb .

AIR The estimator µbn can be computed without the knowledge of the true contamination rate ε. Furthermore, its computational complexity nearly of the same order as the complexity of computing a single instance of the iteratively reweighted mean as defined by Algorithm1. AIR Indeed, to compute µbn , one needs to apply Algorithm1 at most `max = [0.5 loga(p/n)] times, and to solve a second-order cone program for checking whether the intersection of a small number of balls is empty. The next theorem, proved in the supplementary material, AIR shows that the estimation error of this estimator µbn is of the optimal rate, up to a logarithmic factor. To the best of our knowledge, this is the first result of this kind in the literature.

AIR −n THEOREM 3. Let µbn be the estimator√ defined in Definition8. Let δ ∈ (4e , 1) be a tolerance level. Let n ≥ p ≥ 2 and ε ≤ (5 − 5)a/10, where a ∈ (0, 1) is the parameter used ∗ p ∗ in Definition8. Then, for every µ ∈ R and every Pn ∈ SGAC(µ , Σ, ε), we have 1/2 r ! 3A kΣk  p + log(4` /δ)  P kµAIR − µ∗k2 ≤ 5 op max + εplog(1/ε) ≤ 1 − 4δ. bn 2 a − 2ε − pε(a − ε) n

AIR The breakdown point of the adaptive estimator µn inferred from the last theorem is slightly IR b smaller than the one of µbn . Indeed, there is a factor a < 1 between these two quantities. Note that one can choose a to be very close to one. The only drawback of choosing a too close to one is the higher computational complexity of the resulting estimator.

6.3. Extension to unknown covariance Σ. The iteratively reweighted mean estimator, as defined in Algorithm1, requires the knowledge of the covariance matrix Σ. Let us briefly discuss what happens when this matrix is unknown, by considering two qualitatively different situations. 2 The first situation is when the covariance matrix is isotropic, Σ = σ Ip, with unknown σ > 0. One can easily check that all the claims of Section3 and Section5 hold true for a slight IR P k−1 ⊗2 modification of µbn obtained by minimizing λmax( i wi(Xi − µb ) − Σ) instead of k−1 G(w, µb ) in (5). For isotropic matrices Σ, we have     X k−1 ⊗2 2 X k−1 ⊗2 arg min λmax wi(Xi − µ ) − σ Ip = arg min λmax wi(Xi − µ ) , w b w b i i 16 DALALYAN AND MINASYAN which means that the resulting value is independent of Σ. Therefore, in this first situation, the modified estimator defined in Algorithm2 satisfies inequalities of Theorem1 and Theorem2. The second situation is when Σ is unknown and arbitrary. In this case, to the best of ∗ our knowledge, there is no√ known computationally tractable estimator of µ achieving a rate faster than pp/n + ε. It turns out that a slightly modified version of the iteratively reweighted mean estimator defined in Algorithm2 achieves this rate as well. The formal statement of the result entailing this claim is presented below.

IR THEOREM 4. Let µbn be the iteratively reweighted mean defined in Algorithm2. Let δ ∈ −n (4e , 1) be a tolerance level. There exists a constant√ A5 depending only on the variance ∗ p proxy τ such that if n ≥ p ≥ 2 and ε < (5 − 5)/10, then for every µ ∈ R and every ∗ Pn ∈ SGAC(µ , Σ, ε), we have 1/2 r ! A kΣk  p + log(4/δ) √  P kµIR − µ∗k2 ≤ 5 op + 3 ε ≤ 1 − 4δ. bn 2 1 − 2ε − pε(1 − ε) n

The proof of this theorem is deferred to the supplementary material. One can use Theorem4 to construct an adaptive (with respect to ε) estimator of µ∗ in the case of unknown Σ using Lepski’s method detailed in Section 6.2. For this construction, it suffices to know an upper 1/2 bound σmax on the operator norm kΣkop . The resulting estimator has an error of the order p √ σmax( p/n + ε). One can also consider intermediate cases, in which the covariance matrix is not arbitrary but has a more general form than the simple isotropic one. In such a situation, it might be of interest to extend the method proposed in Algorithm1 by using an initial estimator of Σ and by updating its value at each step. Indeed, when a weight vector is computed, it can be used for updating not only the mean but also the covariance matrix. The study of this estimator is left for future research.

7. Empirical results. This section showcases empirical results obtained by applying the iteratively reweighted mean estimator described in Algorithm1 to synthetic data sets. We stress right away that there are multiple ways of solving the optimization problem involved in Algorithm1, and the implementation we used in our experiments is not the most efficient one. As already mentioned, the aforementioned optimization problem can be seen as a semi- definite program and out-of-shelf algorithm can be applied to solve it. We implemented this approach in R using the MOSEK solver. All the results reported in this section are obtained using this implementation. We are currently working on an improved implementation using the dual sub-gradient algorithm of [Cox et al., 2014]. ∗ We applied Algorithm1 to X1,..., Xn from Pn ∈ GAC(µ , Σ, ε) with various types of con- tamination schemes. In the numerical experiments below, we illustrate (a) the evolution of the estimation error along the iterations, (b) the properties related to the theoretical breakdown point and (c) the performance of the estimator obtained from Algorithm1 as compared to some simple estimators of the mean and to the oracle. In this section, the error of estimation is understood as the Euclidean distance between the estimated mean and its true value. Notice that, due to the equivariance stated in Fact2, it is sufficient to take as the true target ∗ mean vector µ the zero vector 0p and as Σ any diagonal matrix with nonnegative entries. We consider the following two schemes of outlier generation: ROBUST ESTIMATION OF A GAUSSIAN MEAN 17

−1.2 n = 400 n = 700 2 squared error −1.4 n = 1000 n = 1300 εplog(1/ε) 1.5 −1.6

1 −1.8 log(error) 0.5 −2

0 −2.2 1 2 3 4 5 6 7 8 9 0 0.1 0.2 0.3 0.4 0.5 iteration number ε

FIG 2. The decay of the error along the iterations FIG 3. The error as a function of the contamination for p = 9, ε = 0.2 and different values of n. The rate ε for n = 500 and p = 5. The contamination contamination scheme is “uniform outliers” with scheme is “smallest" eigenvector. The number of it- (a, b) = (0.5, 2). erations for ε > ε∗ was set to 30.

• Contamination by “smallest” eigenvector: Sample n i.i.d. observations Y 1,..., Y n drawn from N (0p, Ip) and compute the smallest eigenvalue λp and the corresponding eigenvector vp of the sample covariance matrix, defined as n 1 X ⊗2 Σb n = (Y i − Y n) , n i=1 def where Y n = (Y 1 + ··· + Y n)/n. Choose the nε observations from Y 1,..., Y n that have the highest (in absolute value) correlation coefficient with v and replace them by a vector √p proportional to vp with proportionality coefficient equal to p. • Uniform outliers: Sample n observations according to this model

i.i.d. Y i = θi + ξi, with ξi ∼ N (0p, Ip) for i = 1, . . . , n,

where θis are all-zero vectors for n(1 − ε) observations i ∈ I = {1, . . . , n(1 − ε)}, while for the indices i 6∈ I we have kθik2 6= 0. We take the values of {θi}i∈O to be i.i.d. from j p i.i.d. uniform distribution, i.e., for i 6∈ I we take {θi }j=1 ∼ U[a, b]. We took different values of a and b in different experiments reported below.

7.1. Improvement along iterations. In this experiment we show the improvement of the estimation error along the iterations. The data for this experiment were generated according to the second contamination scheme mentioned above with a = 0.5 and b = 2. In Figure2, we ∗ drew the logarithm of the estimation error (i.e., the risk between µbn and µ ) as a function of the iteration number. The results are obtained by averaging over 50 independent repetitions. We observe that the error decreases very fast during the first iteration and remains almost constant during the rest of time. As a matter of fact, in order to speed-up the procedure, we P ¯ ⊗2 can stop the iterations if the current weights w satisfy λmax( i wi(Xi − Xw) − Σ) ≤ ε. It is easy to check that this modified estimator still possesses all the desirable properties described in previous sections.

7.2. Breakdown point. The goal of this experiment is to check empirically the validity of √ the breakdown point ε∗ def= (5 − 5)/10 ≈ 0.28. To this end, we chose to contaminate the standard normal vectors by “smallest” eigenvector scheme. Note that if the outliers are well separated from the inliers, like in the uniform outliers scheme, then Algorithm1 detects well these outliers by pushing their weights to 0 even when ε is large (larger than 0.28). 18 DALALYAN AND MINASYAN

Estimator comparison Estimator comparison

Estimated Mu Sample Mean Sample Median Geometric Median Oracle l2 loss l2 loss 0 5 10 15 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.1 0.2 0.3 0.4 0.5 0 200 400 600 800 1000 eps n

FIG 4. `2-loss for different estimators when n = FIG 5. `2-loss for different estimators when p = 10, 500, p = 20 and the uniform outliers scheme with ε = 0.1 and the uniform outliers scheme with a = 4 a = 4 and b = 10. and b = 10.

For n = 500 and p = 5, we conducted 50 independent repetitions of the experiment and plotted the error averaged over these 50 repetitions in Figure3. For ε > 0.28, we manually set the number of iterations to7 30. It is interesting to observe that there is a clear change in the mean error occurring at a value close to 0.28. This empirical result allows us to conjecture that the breakdown point of the presented estimator is indeed close to 0.28.

7.3. Comparison with other estimators. In this last experiment we wanted to provide a IR visual illustration of the performance of the proposed estimator µbn as compared to some simple competitors: the sample mean, the coordinatewise median, the geometric median and the oracle obtained by averaging all the inliers. The obtained errors, averaged over 50 in- dependent repetitions, are depicted in Figure4 and Figure5. The former corresponds to n = 500, p = 20 and varying ε, while the of the latter are p = 10, ε = 0.1 and IR n ∈ [10, 1000]. In the legend of these figures, Estimated Mu refers to µbn , the output of Algorithm1, Sample Mean and Sample Median correspond respectively to the sam- ple mean and sample coordinatewise median, Geometric Median refers to the estimator defined in (4) and Oracle refers to the sample mean of inliers only. The plots show that the iteratively reweighted mean estimators has an error which is nearly as small as the error of the oracle. These errors are way smaller than the errors of the other estimators included in this experiment, which is in line with all the existing theoretical results.

8. Postponed proofs. We collected in this section all the technical proofs postponed from the Section3 and Section5. Throughout this section, we will always assume that λmax(Σ) = kΣkop = 1. As we already mentioned above, the general case can be reduced to this one by 1/2 dividing all the data vectors by kΣkop . The proofs in this section are presented according to the order of the appearance of the corresponding claims in the previous sections. Since the proof of Theorem1 relies on several lemmas and propositions, we provide in Figure6a diagram showing the relations between these results.

8.1. Additional details on (8). One can check that n n X ∗ k−1 ⊗2 X ∗ ¯ ⊗2 ¯ k−1 ⊗2 wi (Xi − µb ) = wi (Xi − Xw∗ ) + (Xw∗ − µb ) i=1 i=1

7 ∗ Since the number of iteration K is well-defined for ε ≤ εr ≈ 0.28. ROBUST ESTIMATION OF A GAUSSIAN MEAN 19

n X ∗ ⊗2 k−1 2  w (X − X¯ ∗ ) + X¯ ∗ − µ I i i w w b 2 p i=1 n X ∗ ∗ ⊗2 k−1 2  w (X − µ ) + X¯ ∗ − µ I . i i w b 2 p i=1 This readily yields  n  ∗ k−1 X ∗ ∗ ⊗2 k−1 2 G(w , µ ) ≤ λ w (X − µ ) − Σ + X¯ ∗ − µ b max,+ i i w b 2 i=1 ∗ ∗ ∗ k−1 2 ≤ G(w , µ ) + ζ¯ ∗ + µ − µ . w 2 b 2 To get the last line of (8), it suffices√ to apply the elementary consequence of the Minkowskii inequality pa + (b + c)2 ≤ a + b2 + c, for every a, b, c > 0.

8.2. Rough bound on the error of the geometric median. This subsection is devoted to the proof of an error estimate of the geometric median. This estimate is rather crude, in terms of its dependence on the sample size, but it is sufficient for our purposes. As a matter of fact, it also shows that the breakdown point of the geometric median is equal to 1/2.

LEMMA 2. For every ε ≤ 1/2, the geometric median satisfies the inequality n 2 X kµGM − µ∗k ≤ kζ k . bn 2 n(1 − 2ε) i 2 i=1 Furthermore, its expected error satisfies

2r1/2 kµGM − µ∗k ≤ Σ . bn L2 1 − 2ε

PROOF. Recall that the geometric median of X1,..., Xn is defined by n GM X µn ∈ arg min kXi − µk2. b µ∈ p R i=1 It is clear that n n X GM X ∗ kXi − µbn k2 ≤ kXi − µ k2 i=1 i=1 Without loss of generality, we assume that µ∗ = 0. We also assume that nε is an integer. On the one hand, we have the simple bound GM X GM  n(1 − ε)kµbn k2 ≤ kXi − µbn k2 + kζik2 i∈I n X GM X X GM ≤ kXi − µbn k2 + kζik2 − kXi − µbn k2 i=1 i∈I i∈O n X ∗ X X GM ≤ kXi − µ k2 + kζik2 − kXi − µbn k2 i=1 i∈I i∈O 20 DALALYAN AND MINASYAN

Theorem1: Bound on Expected Error

Lemma2: Proposition1: Proposition2: Assessing the Error Deterministic Uniform bounds of Initial Estimator bound involving on the moments of (geom. median) the weighted covariance Stochastic Terms

Lemma7: Lemma3: Operator-norm error L2-error of weighted of weighted covari- averages of Gaus- ance of Gaussians sian vectors

Lemmas4,5: Lemma6: Lemma1: Moments and deviations Centered moments Properties of the of singular values of and deviations of weights from Gaussian matrices Gaussian matrices the feasible set

FIG 6. A diagram showing the relation between different lemmas and propositions used in the proof of Theorem1.

X X GM  ≤ 2 kζik2 + kXik2 − kXi − µbn k2 i∈I i∈O X GM ≤ 2 kζik2 + nεkµbn k2, i∈I where the first and the last inequalities follow from triangle inequality. From the last display, we infer that n 1/2 2 X 2 2rΣ kµGMk ≤ kζ k ≤ kζ k ≤ bn L2 n(1 − 2ε) i L2 1 − 2ε 1 L2 1 − 2ε i=1 and we get the claim of the lemma.

8.3. Proof of Proposition1. To ease notation throughout this proof, we write ε¯ instead of εw. Simple algebra yields ¯ ∗ ¯ ¯ ¯ ∗ ¯ Xw − µ − ζwI = XwI + XwIc − µ − ζwI ∗ ¯ ¯ ∗ ¯ = (1 − ε¯)µ + ζwI + XwIc − µ − ζwI ¯ ∗ = XwIc − ε¯µ ¯ ¯ ¯ ∗ = XwIc − ε¯Xw +ε ¯(Xw − µ ). ¯ ∗ ¯ ¯ ¯ This implies that (1 − ε¯)(Xw − µ ) − ζwI = XwIc − ε¯Xw, which is equivalent to ¯ ¯ ∗ XwIc − ε¯Xw X¯ w − µ − ζ¯ = . w|I 1 − ε¯ ROBUST ESTIMATION OF A GAUSSIAN MEAN 21

Therefore, we have X¯ − ε¯X¯ ¯ ∗ ¯ wIc w 2 Xw − µ − ζ = w|I 2 1 − ε¯

1 X ¯ = wi(Xi − Xw) 1 − ε¯ i∈Ic 2

1 X > = sup wi v (Xi − X¯ w). (17) 1 − ε¯ p−1 v∈S i∈Ic ⊗2 Using the Cauchy-Schwarz inequality as well as the notations Mi = (Xi − X¯ w) and Pn M = i=1 wiMi, we get  2 X > X X > 2 > X wi v (Xi − X¯ w) ≤ wi wi|v (Xi − X¯ w)| =ε ¯v wiMiv i∈Ic i∈Ic i∈Ic i∈Ic n > X > X =ε ¯v wiMiv − v wiMiv i=1 i∈I

>  > X  =ε ¯v M − Σ v +ε ¯v Σ − wiMi v i∈I

 n > > X o ≤ ε¯ λmax M − Σ +ε ¯ v Σv − v wiMiv . (18) i∈I Finally, for any unit vector v, > X > X ¯ ⊗2 v wiMiv = (1 − ε¯) v (w|I )i(Xi − Xw) v i∈I i∈I > X ¯ ⊗2 ≥ (1 − ε¯) v (w|I )i(Xi − Xw|I ) v i∈I > X ¯ ⊗2 = (1 − ε¯) v (w|I )i(ζi − ζw|I ) v i∈I > X > >¯ 2 = (1 − ε¯) v (w|I )iζiζi v − (1 − ε¯)(v ζw|I ) i∈I     > X > ¯ 2 ≥ (1 − ε¯) v Σv − λmax (w|I )i(Σ − ζiζi ) − kζw|I k2 . (19) i∈I Combining (18) and (19), we get  2 X >  2  wi v (Xi − µ) ≤ ε¯ λmax M − Σ +ε ¯ λmax Σ i∈Ic     X > ¯ 2 +ε ¯(1 − ε¯) λmax (w|I )i(Σ − ζiζi ) + kζw|I k2 . i∈I In conjunction with (17), this yields √  ¯ ∗ ¯ ε¯   Xw − µ − ζ ≤ λmax M − Σ +ε ¯ λmax Σ w|I 2 1 − ε¯ 22 DALALYAN AND MINASYAN

   1/2 X > ¯ 2 + (1 − ε¯) λmax (w|I )i(Σ − ζiζi ) + kζw|I k2 . i∈I √ √ a + ... + a ≤ a + From√ this relation, using the triangle inequality and the inequality 1 n 1 ... + an, we get √ 1/2 ¯ ∗ ¯ ε¯ 1/2  ε¯kΣkop Xw − µ k2 ≤ kζ + λ M − Σ + w|I 2 1 − ε¯ max,+ 1 − ε¯ r   r ε¯ 1/2 X > ε¯ + λ (w )i(Σ − ζ ζ ) + kζ¯ k2 1 − ε¯ max,+ |I i i 1 − ε¯ w|I i∈I and, after rearranging the terms, the claim of the proposition follows.

8.4. Proof of Lemma1. Claim i) of this lemma is straightforward. For ii), we use the fact that the compact convex polytope Wn,` is the convex hull of its extreme points. The fact J that each uniform weight vector u is an extreme point of Wn,` is easy to check. Indeed, if 0 J 0 for two points w and w from Wn,` we have u = 0.5(w + w ), then we necessarily have 0 wJ c = wJ c = 0. Therefore, for any j ∈ J, 1 X 1 1 ≥ w = 1 − w ≥ 1 − ` − 1) × = . ` j i ` ` i∈J\{j} 1 1 J 0 J This implies that wj = ` (j ∈ J). Hence, w = u and the same is true for w . Hence, u is an extreme point. Let us prove now that all the extreme points of Wn,` are of the form J u with |J| = `. Let w ∈ Wn,` be such that one of its coordinates is strictly positive and strictly smaller than 1/`. Without loss of generality, we assume that the two smallest nonzero entries of w are w1 and w2. We have 0 < w1 < 1/` and 0 < w2 < 1/`. Set ρ = w1 ∧ w2 ∧ + − {1/` − w1} ∧ {1/` − w2}. For w = w + (ρ, −ρ, 0,..., 0) and w = w − (ρ, −ρ, 0,..., 0), + − + − we have w , w ∈ Wn,` and w = 0.5(w + w ). Therefore, w is not an extreme point of Wn,`. This completes the proof of ii). P J K−1 n Claim ii) implies that Wn,` = { |J|=` αJ u : α ∈ ∆ } with K = ` . Hence,   X J X J  J sup G(w) = sup G αJ u ≤ sup αJ G u ≤ max G(u ) w∈W K−1 K−1 |J|=` n,` α∈∆ |J|=` α∈∆ |J|=` and claim iii) follows. To prove iv), we check that X X |Ic| n − `0 ` + `0 − n kw k = w = 1 − w ≥ 1 − ≥ 1 − = . I 1 i i ` ` ` i∈I i6∈I 0 This readily yields (w|I )i ≤ 1/(` + ` − n), which leads to the claim of item iv).

8.5. Moments of suprema over Wn,` of weighted averages of Gaussian vectors. We recall that ξi’s, for i = 1, . . . , n are i.i.d. Gaussian vectors with zero mean and identity covariance 1/2 matrix, and ζi = Σ ξi. In addition, the covariance matrix Σ satisfies kΣkop = 1.

LEMMA 3. Let p ≥ 1, n ≥ 1, m ∈ [2, 0.562n] and o ∈ [0, m] be four positive integers. It holds that " #  r  1/2 2 p m 4.6mp E sup kζ¯ k ≤ r /n 1 + 7 + log(ne/m). w|I 2 Σ w∈Wn,n−m+o 2n n |I|≥n−o ROBUST ESTIMATION OF A GAUSSIAN MEAN 23

PROOFOF LEMMA 3. We have (a) (b) sup kζ¯ k ≤ sup kζ¯ k ≤ max ζ¯ J , w|I 2 w 2 u 2 w∈Wn,n−m+o w∈Wn,n−m |J|=n−m where (a) follows from claim iv) of Lemma1 and (b) is a direct consequence of claim iii) of Lemma1. Thus, we get

¯ ¯ 1 X sup kζw k2 ≤ max ζuJ = max ζi |I |J|=n−m 2 n − m |J|=n−m w∈Wn,n−m+o i∈J 2 n 1 X 1 X ≤ ζi + max ζi . (20) n − m n − m |J¯|=m i=1 2 i∈J¯ 2 On the one hand, one readily checks that  n 2 X E ζ = nrΣ. (21) i i=1 2 On the other hand, it is clear that for every J¯ of cardinality m, the random variable P 2 Pp 2 i∈J¯ ζi 2 has the same distribution as m j=1 λj(Σ)ξj , where ξ1, . . . , ξp are i.i.d. stan- dard Gaussian. Therefore, by the union bound, for every t ≥ 0, we have  2     m 2  X  n X  P max ζi > m 2rΣ + t ≤ P ζi > m 2rΣ + t |J¯|=m m i∈J¯ 2 i=1 2 p  n   X   n  = P λ (Σ)ξ2 > 2r + t ≤ e−t/3, m j j Σ m j=1 where the last line follows from a well-known bound on the tails of the generalized χ2- distribution, see for instance [Comminges and Dalalyan, 2012, Lemma 8]. 1 P 2 Therefore, setting Z = m max|J|=m i∈J ζi 2 − 2rΣ and using the well-known identity R ∞ E[Z] ≤ E[Z+] = 0 P(Z ≥ t) dt, we get  2 Z ∞   1 X n −t/3 E max ζi ≤ 2rΣ + 1 ∧ e dt |J¯|=m m 0 m i∈J¯ 2     Z ∞ n n −t/3 = 2rΣ + 3 log + e dt m m 3 log n (m) ne ≤ 2rΣ + 3m log( /m) + 3 ne ≤ 2rΣ + 3.9m log( /m), (22) n  where the last two steps follow from the inequality log m ≤ m log(ne/m) and the fact ne that m ≥ 2, m log( /m) ≥ n infx∈[2/n,1] x(1−log x) ≥ 2(1−log(2/n)) ≥ 10/3. Combining (20), (21) and (22), we arrive at   1/2 √ √ 2 nrΣ m 1/2 E sup kζ¯ k ≤ + × 2r + 3.9m log(ne/m) w|I 2 Σ w∈Wn,n−m+o n − m n − m |I|≥n−o √ p  m 2mn 4.6mp ≤ r /n 1 + + + log(ne/m). Σ n − m n − m n 24 DALALYAN AND MINASYAN

Finally, note that for α = m/n ≤ 0.562, we have √ √ m 2mn r m  2mn 2n  + = + n − m n − m 2n n − m n − m √ r m  2α 2  r m = + ≤ 7 . 2n 1 − α 1 − α 2n This completes the proof of the lemma.

8.6. Moments and deviations of singular values of Gaussian matrices. Let ζ1,..., ζn be i.i.d. random vectors drawn from Np(0, Σ) distribution, where Σ is a p×p covariance matrix. We denote by ζ1:n the p × n random matrix obtained by concatenating the vectors ζi. Recall 1/2 > 1/2 > that smin(ζ1:n) = λmin(ζ1:nζ1:n) and smax(ζ1:n) = λmax(ζ1:nζ1:n) are the smallest and the largest singular values of the matrix ζ1:n.

LEMMA 4 (Vershynin[2012], Theorem 5.32 and Corollary 5.35). Let λmax(Σ) = 1. For every t > 0 and for every pair of positive integers n and p, we have   √ √ √ √  −t2/2 E smax(ζ1:n) ≤ n + rΣ, P smax(ζ1:n) ≥ n + rΣ + t ≤ e . If, in addition, Σ is the identity matrix, then   √ √  √ √  −t2/2 E smin(ζ1:n) ≥ n − p +, P smin(ζ1:n) ≤ n − p − t ≤ e .

The corresponding results in [Vershynin, 2012] treat only the case of identity covariance ma- trix Σ = Ip, however the proof presented therein carries with almost no change over the case of arbitrary covariance matrix. These bounds allow us to establish the following inequalities.

¯ ¯ LEMMA 5. For a subset J of {1, . . . , n}, we denote by ζJ¯ the p × |J| matrix obtained by ¯ concatenating the vectors {ζi : i ∈ J}. Let the covariance matrix Σ be such that λmax(Σ) = 1. For every pair of integers n, p ≥ 1 and for every integer m ∈ [1, n], we have

 2  √ √ 2 E smax(ζ1:n) ≤ rΣ + n + 4, (23)  2  √ E (smax(ζ1:n) − n)+ ≤ 6 nrΣ + 4rΣ, ∀n ≥ 8, (24)  2  √ E (n − smin(ξ1:n) )+ ≤ 6 np, ∀n ≥ 8, (25) h i √ √ 2 2 p ne  E max smax(ζJ¯) ≤ rΣ + m + 1.81 m log( /m) + 4. (26) |J¯|=m

PROOF. The bias-variance decomposition, in conjunction with Lemma4, yields  2    2 E smax(ζ1:n) = Var smax(ζ1:n) + E smax(ζ1:n)  √ √ 2 ≤ Var smax(ζ1:n) + n + rΣ . 2 R ∞ 2 Applying the well-known fact E[Z ] = 0 P(Z ≥ t) dt to the random variable Z = smax(ζ1:n) − E[smax(ζ1:n)] and using the Gaussian concentration inequality, we get Z ∞ √ Z ∞    −t/2 Var smax(ζ1:n) = P smax(ζ1:n) − E[smax(ζ1:n)] ≥ t dt ≤ 2 e dt = 4. 0 0 This completes the proof of (23). ROBUST ESTIMATION OF A GAUSSIAN MEAN 25

For every random variable Z and every constant a > 0, we have  2   2 2  E ((a + Z) − n)+ = E (a + 2aZ + Z − n)+  2  2 ≤ E (a + 2aZ − n)+ + E[Z ] 2 2 ≤ |a − n| + 2aE[Z+] + E[Z ].

Taking a = E[smax(ζ1:n)] and Z = smax(ζ1:n) − E[smax(ζ1:n)], we get √  2  √ √ 2 √ √ E (smax(ζ1:n) − n)+ ≤ |( n + rΣ) − n| + ( n + rΣ) 2π + 4 √ √ √ √ = 2 nrΣ + rΣ + ( n + rΣ) 2π + 4 √ ≤ 6 nrΣ + 4rΣ, ∀n ≥ 8. Similarly, we have  2   2 2  2   E (n − (a + Z) )+ = E (n − a − 2aZ − Z )+ ≤ (n − a )+ + 2aE Z− .

Taking a = E[smin(ξ1:n)] and Z = smin(ξ1:n) − E[smin(ξ1:n)], we get √  2  2 E (n − smin(ξ1:n) )+ ≤ (n − E[smin(ξ1:n)] )+ + E[smin(ξ1:n)] 2π √ √ √ 2 √ √ ≤ (n − ( n − p) )+ + ( n + p) 2π √ √ √ √ √ ≤ 2 np + ( n + p) 2π ≤ 6 np. Thus, we have checked (24) and (25). In view of Lemma4, for every t > 0,   p √  −t2/2 P smax ζJ¯ < |J¯| + rΣ + t ≥ 1 − e . (27) R ∞ Let I0 ⊂ [n] be any set of cardinality m. Combining the relation E[Z] = 0 P(Z ≥ s) ds, the union bound, and (27), we arrive at h i √ √ h √ √  i E max smax(ζJ¯) ≤ m + rΣ + E max smax(ζJ¯) − m − rΣ + |J¯|=m |J¯|=m √ √ Z ∞  √ √  = m + rΣ + P max smax(ζJ¯) ≥ m + rΣ + t dt 0 |J¯|=m √ √ Z ∞ ^  n   √ √  ≤ m + rΣ + 1 P smax(ζI0 ) ≥ m + rΣ + t dt 0 m Z ∞     √ √ ^ n −t2/2 ≤ m + rΣ + 1 e dt. (28) 0 m n  From now on, we assume that m ∈ [2, n − 1], which implies that m ≥ n. Let t0 be the value of t for which the two terms in the last minimum coincide, that is (     2 ne n −t2/2 2 n t0 ≤ 2m log( /m), e 0 = 1 ⇐⇒ t = 2 log =⇒ 0 2 n(n−1) m m t0 ≥ 2 log( /2), We have, for m ≥ 1,

2 Z ∞       Z ∞   −t0/2 ^ n −t2/2 n −t2/2 n e 1 e dt = t0 + e dt ≤ t0 + 0 m m t0 m t0 1 p ≤ t0 + ≤ 1.81 m log(ne/m). (29) t0 26 DALALYAN AND MINASYAN

Inequalities (28) and (29) yield h i √ √  p ne E max smax ζJ¯ ≤ m + rΣ + 1.81 m log( /m). |J¯|=m  On the other hand, the mapping ζ1:n 7→ F (ζ1:n) := max|J¯|=m smax ζJ¯ being 1-Lipschitz, the Gaussian concentration inequality leads to Z ∞ √ Z ∞     −t/2 Var max smax ζJ¯ = P F (ζ1:n) − E[F (ζ1:n)] ≥ t dt ≤ 2 e dt = 4. |J¯|=m 0 0 This completes the proof of the lemma.

LEMMA 6. There is a constant A1 > 0 such that for every pair of integers n ≥ 8 and p ≥ 1 and for every covariance matrix Σ such that λmax(Σ) = 1, we have

h > i √ √ √ E kζ1:nζ1:n − nΣkop ≤ A1 n + rΣ rΣ, (30)

h > i √ E λmax,+(ζ1:nζ1:n − nΣ) ≤ 6 np + 4p, (31)

h > i √ E λmax,+(nΣ − ζ1:nζ1:n) ≤ 6 np, (32) where the last inequality is valid under the additional assumption p ≤ n. Furthermore, there is a constant A2 > 0 such that  √  > h > i  √  −t P kζ1:nζ1:n − nΣkop − E kζ1:nζ1:n − nΣkop ≥ A2 tn + trΣ + t ≤ e , ∀t ≥ 1.

PROOF. Inequality (30) and the last claim of the lemma are respectively Theorems 4 and 5 1/2 in [Koltchinskii and Lounici, 2017]. Let us prove the two other claims. Since ζi = Σ ξi where ξi’s are i.i.d. N (0, Ip), we have > > 2  λmax,+(ζ1:nζ1:n − nΣ) ≤ λmax,+(ξ1:nξ1:n − nIp) = smax(ξ1:n) − n +. Inequality (31) now follows from (24) applied in the case of an identity covariance matrix so that rΣ = p. Similarly, (32) follows from (25) using the same argument.

8.7. Moments of suprema over Wn,` of weighted centered Wishart matrices.

LEMMA 7. Let p ≥ 2, n ≥ 4 ∨ p, m ∈ [2, 0.6n] and o ≤ m be four integers. It holds that " n #  X > p E sup λmax,+ Σ − (w|I )iζiζi ≤ 25 p/n + 33(m/n) log(n/m). w∈Wn,n−m+o i=1 |I|≥n−o Furthermore, for any p ≥ 1, n ≥ 1, m ∈ [2, 0.6n] and o ≤ m, n   r ne X > rΣ m log( /m) E sup Σ − (w|I )iζiζi ≤ (5.1A1 + 2.5A2) + 7.5A2 , op n n w∈Wn,n−m+o i=1 |I|≥n−o where A1 and A2 are the same constants as in Lemma6. ROBUST ESTIMATION OF A GAUSSIAN MEAN 27

PROOFOF LEMMA 7. For every I ⊂ [n] such that |I| ≥ n − o, we have

n (a) n  X >  X > sup λmax,+ Σ − (w|I )iζiζi ≤ sup λmax,+ Σ − wiζiζi w∈Wn,n−m+o i=1 w∈Wn,n−m i=1

(b) n  X J >  ≤ sup λmax,+ ui (Σ − ζiζi ) . |J|=n−m i=1 In the above inequalities, (a) follows from claim iv) of Lemma1, while (b) is a direct conse- quence of Lemma1, claim iii). Thus, we can infer that for every w ∈ Wn,n−m+o, n  X >  X >  (n − m)λmax,+ Σ − (w|I )iζiζi ≤ max λmax,+ (Σ − ζiζi ) |J|=n−m i=1 i∈J  n    X > X > ≤ λmax,+ (Σ − ζiζi ) + max λmax,+ (ζiζi − Σ) |J¯|=m i=1 i∈J¯  n      X > X > ≤ λmax,+ (Σ − ζiζi ) + max λmax ξiξi − m , (33) |J¯|=m i=1 i∈J¯ + where in the last line we have used the fact that for any symmetric matrix B, we have 1/2 1/2 1/2 2 λmax,+(Σ BΣ ) ≤ kΣ kopλmax,+(B) = λmax,+(B). To analyze the last term of the previous display, we note that   X > > 2  λmax ξiξi = λmax ξJ¯ξJ¯ = smax ξJ¯ . i∈J¯ On the one hand, inequality (26) of Lemma5 yields       X > h 2 i E max λmax ξiξi − m ≤ E max smax ξJ¯ |J¯|=m |J¯|=m i∈J¯ + √ √ p 2 ≤ p + m + 1.81 m log(ne/m) + 4 p ≤ p + 10m log(ne/m) + 5.62 pm log(ne/m)

≤ 4p + 13m log(ne/m). (34) On the other hand, for n ≥ p, according to (32),  n  X > √ λmax,+ (Σ − ζiζi ) ≤ 6 np. (35) i=1 Combining (33), (34) and (35), we arrive at n √   X  6 np + 4p + 13m log(ne/m) E sup λ Σ − (w ) ξ ξ> ≤ max,+ |I i i i n − m w∈Wn,n−m+o i=1 |I|≥n−o √ 10 np + 13m log(ne/m) ≤ 0.4n p ≤ 25 p/n + 33(m/n) log(ne/m), and the first claim of the lemma follows. 28 DALALYAN AND MINASYAN

To prove the last claim, we repeat the arguments in (33) to check that for every weight vector w ∈ Wn,n−m+o, n n X > X > X > (n − m) Σ − (w|I )iζiζi ≤ (Σ − ζiζi ) + max (ζiζi − Σ) op |J¯|=m i=1 i=1 op i∈J¯ op > > = ζ1:nζ1:n − nΣ + max ζJ¯ζJ¯ − mΣ . (36) op |J¯|=m op In view of Lemma6, for every t ≥ 1, we have  √  > √  √  −t P kζJ¯ζJ¯ − mΣkop ≥ 2A1 mrΣ + A2 tm + trΣ + t ≤ e .

Since t ≥ 1 and m ≥ 2, the last inequality implies   > √  −t P kζJ¯ζJ¯ − mΣkop ≥ 2A1 mrΣ + A2 m + rΣ + 1.5t ≤ e . √ To ease notation, let us set a = 2A1 mrΣ + A2(m + rΣ) and b = 1.5A2. Using the union bound, we arrive at   Z ∞   E max ζ ζ> − mΣ = P max ζ ζ> − mΣ ≥ s ds J¯ J¯ op J¯ J¯ op |J¯|=m 0 |J¯|=m Z ∞   = a + b P max ζ ζ> − mΣ ≥ a + bt dt J¯ J¯ op 0 |J¯|=m Z ∞  n    ≤ a + b 1 ∧ P ζ ζ> − mΣ ≥ a + bt dt 1:m 1:m op 0 m Z ∞  n  ≤ a + b 1 ∧ e−t dt. 0 m n  Splitting the last integral integral into two parts, corresponding to the intervals [0, log m ] n  and [log m , +∞), we obtain     > n E max ζJ¯ζJ¯ − mΣ ≤ a + b log + b |J¯|=m op m √ ne ≤ 2A1 mrΣ + A2rΣ + 3A2m log( /m), where in the last line we used that  n   n  1.5 + 1.5 log−1 + m log−1 ≤ 3, ∀m ∈ [2, n − 1], ∀n ≥ 4. m m Combining these bounds with (36), we arrive at  n  X > E sup Σ− (w|I )iζiζi op Wn,n−m+o i=1 √ √ √ A r ( n + 2 m) + (A + A )r + 3A m log(ne/m) ≤ 1 Σ 1 2 Σ 2 n − m r rΣ ≤ (5.1A + 2.5A ) + +7.5A (m/n) log(ne/m). 1 2 n 2 This completes the proof. ROBUST ESTIMATION OF A GAUSSIAN MEAN 29

8.8. Proof of Proposition2. Throughout this proof, supw,I stands for the supremum over all w ∈ Wn(ε) and over all I ⊂ {1, . . . , n} of cardinality larger than or equal to n(1 − ε). We recall that, Ξ = supw,I R(ζ, w,I) where for any subset I of{1, . . . , n}, √   √ 1/2 X > ¯ R(ζ, w,I) = 2εw + 2εw λmax,+ (w|I )i(Σ − ζiζi ) + (1 + 2εw)kζw|I k2. i∈I

Furthermore, as already mentioned earlier, for every w ∈ Wn(ε), εw ≤ ε/(1 − ε) ≤ 1.5ε. This implies that √   1/2 2 1/2 ¯ 2 E [Ξ ] ≤ 3ε + (1 + 3ε)E sup kζw|I k2 w,I √    1/2 X > + 3ε E sup λmax,+ (w|I )i(Σ − ζiζi ) . (37) w,I i∈I As proved in Lemma3 (by taking m = 2o and o = nε),   √ 1/2 ¯ 2 p  p E sup kζw|I k2 ≤ rΣ/n 1 + 7 ε + 9.2ε log(2/ε). (38) w,I In addition, in view of the first claim of Lemma7 (with m = 2o and o = nε), stated and proved in the last section, we have    X > p E sup λmax,+ (w|I )i(Σ − ζiζi ) ≤ 25 p/n + 66ε log(1/2ε). (39) w,I i∈I Combining (37), (38) and (39), we get √ √ E1/2[Ξ2] ≤ 3ε + (1 + 3ε)pp/n 1 + 7 ε + 9.2εplog(2/ε) √ + 3ε 25pp/n + 66ε log(1/2ε)1/2 √ √ ≤ pp/n(1 + 16 ε) + 18εplog(2/ε) + 3ε 25pp/n + 66ε log(1/2ε)1/2. This leads to (16). To obtain (15), we repeat the same arguments but use the second claim of Lemma7 instead of the first one. Acknowledgments. The work of AD was partially supported by the grant Investissements d’Avenir (ANR-11-IDEX-0003/Labex Ecodec/ANR-11-LABX-0047). The work of AM was supported by the FAST Advance grant.

REFERENCES S. Balakrishnan, S. S. Du, J. Li, and A. Singh. Computationally efficient robust sparse estimation in high dimen- sions. In COLT 2017, pages 169–212, 2017. A.-H. Bateni and A. S. Dalalyan. Confidence regions and minimax rates in outlier-robust estimation on the probability simplex. Electron. J. Statist., 14(2):2653–2677, 2020. T. T. Cai and X. Li. Robust and computationally feasible community detection in the presence of arbitrary outlier nodes. Ann. Statist., 43(3):1027–1059, 2015. T. I. Cannings, Y. Fan, and R. J. Samworth. Classification with imperfect training labels. Biometrika, 107(2): 311–330, 2020. M. Chen, C. Gao, and Z. Ren. A general for Huber’s -contamination model. Electron. J. Statist., 10(2):3752–3774, 2016. M. Chen, C. Gao, and Z. Ren. Robust covariance and scatter matrix estimation under Huber’s contamination model. Ann. Statist., 46(5):1932–1960, 10 2018. 30 DALALYAN AND MINASYAN

Y. Cheng, I. Diakonikolas, and R. Ge. High-dimensional robust mean estimation in nearly-linear time. In SODA 2019, pages 2755–2771, 2019a. Y. Cheng, I. Diakonikolas, and R. Ge. High-dimensional robust mean estimation in nearly-linear time. In T. M. Chan, editor, SODA 2019, San Diego, pages 2755–2771. SIAM, 2019b. Y. Cherapanamjeri, N. Flammarion, and P. L. Bartlett. Fast mean estimation with sub-gaussian rates. In Pro- ceedings of COLT, volume 99 of Proceedings of Machine Learning Research, pages 786–806, Phoenix, USA, 25–28 Jun 2019. PMLR. G. Chinot. Erm and rerm are optimal estimators for regression problems when malicious outliers corrupt the labels. Electron. J. Statist., 14(2):3563–3605, 2020. O. Collier and A. S. Dalalyan. Multidimensional linear functional estimation in sparse gaussian models and robust estimation of the mean. Electron. J. Statist., 13(2):2830–2864, 2019. L. Comminges and A. S. Dalalyan. Tight conditions for consistency of variable selection in the context of high dimensionality. Ann. Statist., 40(5):2667–2696, 2012. L. Comminges, O. Collier, M. Ndaoud, and A. B. Tsybakov. Adaptive robust estimation in sparse vector model, 2020. B. Cox, A. Juditsky, and A. Nemirovski. Dual subgradient algorithms for large-scale nonsmooth learning prob- lems. Mathematical Programming, 148(1-2):143–180, 2014. A. Dalalyan and P. Thompson. Outlier-robust estimation of a sparse linear model using `1-penalized Huber’s M-estimator. In NeurIPS 32, pages 13188–13198. 2019. J. Depersin and G. Lecué. Robust subgaussian estimation of a mean vector in nearly linear time. arXiv, abs/1906.03058, 2019. L. Devroye, M. Lerasle, G. Lugosi, and R. I. Oliveira. Sub-Gaussian mean estimators. Ann. Statist., 44(6): 2695–2725, 2016. I. Diakonikolas and D. M. Kane. Recent advances in algorithmic high-dimensional . CoRR, abs/1911.05911, 2019. I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Robust estimators in high dimensions without the computational intractability. In FOCS 2016, pages 655–664, 2016. I. Diakonikolas, D. M. Kane, and A. Stewart. Statistical query lower bounds for robust estimation of high- dimensional gaussians and gaussian mixtures. In FOCS 2017, pages 73–84, 2017. I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Robustly learning a gaussian: Getting optimal error, efficiently. In SODA 2018, pages 2683–2702, 2018. I. Diakonikolas, G. Kamath, D. Kane, J. Li, A. Moitra, and A. Stewart. Robust estimators in high-dimensions without the computational intractability. SIAM J. Comput., 48(2):742–864, 2019a. I. Diakonikolas, D. Kane, S. Karmalkar, E. Price, and A. Stewart. Outlier-robust high-dimensional sparse estima- tion via iterative filtering. In NeurIPS 2019, pages 10688–10699, 2019b. Y. Dong, S. B. Hopkins, and J. Li. Quantum entropy scoring for fast robust mean estimation and improved outlier detection. In NeurIPS 2019, pages 6065–6075, 2019. D. Donoho. Breakdown properties of multivariate location estimators. Phd thesis, Harvard University, 1982. D. Donoho and P. J. Huber. The notion of breakdown point. In A Festschrift for Erich L. Lehmann, Wadsworth Statist./Probab. Ser., pages 157–184. Wadsworth, Belmont, CA, 1983. D. L. Donoho and M. Gasko. Breakdown properties of location estimates based on halfspace depth and projected outlyingness. Ann. Statist., 20(4):1803–1827, 1992. A. Elsener and S. van de Geer. Robust low-rank matrix estimation. Ann. Statist., 46(6B):3481–3509, 2018. C. Gao. via mutivariate regression depth. Bernoulli, 26(2):1139–1170, 2020. F. R. Hampel. Contributions to the theory of robust estimation. PhD thesis, University of California, Berkeley, 1968. S. B. Hopkins. Sub-gaussian mean estimation in polynomial time. CoRR, abs/1809.07425, 2018. P. J. Huber and E. M. Ronchetti. Robust Statistics, Second Edition. Wiley Series in Probability and Statistics. Wiley, 2009. O. Klopp, K. Lounici, and A. B. Tsybakov. Robust matrix completion. Probability Theory and Related Fields, 169(1):523–564, Oct 2017. V. Koltchinskii and K. Lounici. Concentration inequalities and moment bounds for sample covariance operators. Bernoulli, 23(1):110–133, 02 2017. K. A. Lai, A. B. Rao, and S. S. Vempala. Agnostic estimation of mean and covariance. In FOCS 2016, pages 665–674, 2016. G. Lecué and M. Lerasle. Learning from MOM’s principles: Le Cam’s approach. Stochastic Process. Appl., 129 (11):4385–4410, 2019. G. Lecué and M. Lerasle. Robust machine learning by median-of-means: Theory and practice. The Annals of Statistics, 48(2):906 – 931, 2020. ROBUST ESTIMATION OF A GAUSSIAN MEAN 31

O. V. Lepski and V. G. Spokoiny. Optimal pointwise adaptive methods in nonparametric estimation. The Annals of Statistics, 25(6):2512–2546, 1997. O. V. Lepskii. Asymptotically minimax adaptive estimation. i: Upper bounds. optimally adaptive estimates. Theory Probab. Appl., 36(4):682–697, 1992. A. H. Li and J. Bradic. Boosting in the presence of outliers: adaptive classification with nonconvex loss functions. J. Amer. Statist. Assoc., 113(522):660–674, 2018. P.-L. Loh. Statistical consistency and asymptotic normality for high-dimensional robust M-estimators. Ann. Statist., 45(2):866–896, 2017. H. P. Lopuhaä and P. J. Rousseeuw. Breakdown points of affine equivariant estimators of multivariate location and covariance matrices. Ann. Statist., 19(1):229–248, 1991. G. Lugosi and S. Mendelson. Near-optimal mean estimators with respect to general norms. Probab. Theory Related Fields, 175(3-4):957–973, 2019. G. Lugosi and S. Mendelson. Risk minimization by median-of-means tournaments. J. Eur. Math. Soc. (JEMS), 22(3):925–965, 2020. G. Lugosi and S. Mendelson. Robust multivariate mean estimation: The optimality of trimmed mean. Ann. Statist., 49(1):393 – 410, 2021. R. Maronna, D. Martin, and V. Yohai. Robust Statistics: Theory and Methods. Wiley Series in Probability and Statistics. Wiley, 2006. S. Minsker. Sub-Gaussian estimators of the mean of a random matrix with heavy-tailed entries. Ann. Statist., 46 (6A):2871–2903, 2018. J. Polzehl and V. G. Spokoiny. Adaptive weights smoothing with applications to image restoration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(2):335–354, 2000. P. Rigollet and J.-C. Hütter. High dimensional statistics, November 2019. P. Rousseeuw. Multivariate estimation with high breakdown point. In Mathematical statistics and applications, Vol. B (Bad Tatzmannsdorf, 1983), pages 283–297. Reidel, Dordrecht, 1985. P. Rousseeuw and M. Hubert. High-breakdown estimators of multivariate location and scatter. In Robustness and complex data structures, pages 49–66. Springer, Heidelberg, 2013. P. J. Rousseeuw. Least median of squares regression. J. Amer. Statist. Assoc., 79(388):871–880, 1984. M. Soltanolkotabi and E. J. Candès. A geometric analysis of subspace clustering with outliers. Ann. Statist., 40 (4):2195–2238, 2012. W. Stahel. Robuste Schätzungen: infinitesimale Optimalität und Schätzungen von Kovarianzmatrizen. Phd thesis, ETH Zurich, 1981. R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Compressed sensing, pages 210–268. Cambridge Univ. Press, Cambridge, 2012. B. Zhu, J. Jiao, and J. Steinhardt. When does the tukey median work? In IEEE International Symposium on Information Theory (ISIT), 2020. 32 DALALYAN AND MINASYAN

APPENDIX A: HIGH-PROBABILITY BOUNDS IN THE SUB-GAUSSIAN CASE

This section is devoted to the proof of Theorem2, which provides a high-probability upper bound on the error of the iteratively reweighted mean estimator in the sub-Gaussian case. We start with some technical lemmas, and postpone the full proof of Theorem2 to the end of the present section. Let us remind some notation. We assume here that for some covariance matrix Σ, we have 1/2 ζi = Σ ξi, for i = 1, . . . , n, where ξi’s are independent zero-mean with identity covariance matrix. In addition, we assume that ξis are sub-Gaussian vectors with parameter τ > 0, that is

> 2 v ξ τ/2kvk p E[e i ] ≤ e 2 , ∀v ∈ R . ind We write ξi ∼ SGp(τ). Recall that if ξi is standard Gaussian then it is sub-Gaussian with parameter 1. It is a well-known fact (see, for instance, [Rigollet and Hütter, 2019, Theorem 1.19]) that if ξ ∼ SGp(τ) then for all δ ∈ (0, 1), it holds  √ p  P kξk2 ≥ 4 pτ + 8τ log(1/δ) ≤ δ. (40)

In what follows, we assume that the covariance matrix Σ satisfies kΣkop = 1.

LEMMA 8. Let J¯ ⊂ {1, . . . , n} be a subset of cardinality m. For every δ ∈ (0, 1), it holds that   X √ √ p  P ζ ≤ mτ 4 p + 8 log(1/δ) ≥ 1 − δ. j j∈J¯ 2

PROOFOF LEMMA 8. Without loss of generality, we assume that J¯ = {1, . . . , m}. On the one hand, kΣkop = 1 implies that m m X X ζ ≤ ξ . i i i=1 2 i=1 2

On the other hand, ξ1 + ... + ξm ∼ SGp(mτ). Applying inequality (40) to this random variable, we get the claim of the lemma.

LEMMA 9. Let p ≥ 1, n ≥ 1, m ∈ [2, 0.562n] and o ∈ [0, m] be four positive integers. For every δ ∈ (0, 1), with probability at least 1 − δ, we have r rτp mpτ log(ne/m) 2τ log(2/δ) sup kζ¯ k ≤ 16 + 6.5 + 8 . w|I 2 w∈Wn,n−m+o n n n |I|≥n−o

PROOFOF LEMMA 9. We have (a) (b) sup kζ¯ k ≤ sup kζ¯ k ≤ sup ζ¯ J , w|I 2 w 2 u 2 w∈Wn,n−m+o w∈Wn,n−m |J|=n−m where (a) follows from claim iv) of Lemma1 and (b) is a direct consequence of claim iii) of Lemma1. Thus, we get

¯ ¯ 1 X sup kζw k2 ≤ max ζuJ = max ζi |I |J|=n−m 2 n − m |J|=n−m w∈Wn,n−m+o i∈J 2 ROBUST ESTIMATION OF A GAUSSIAN MEAN 33 n 1 X 1 X ≤ ζi + max ζi . (41) n − m n − m |J¯|=m i=1 2 i∈J¯ 2 By the union bound, for every t ≥ 0, we have       X √ √  n X √ √  P max ζi > mτ 4 p + 2t ≤ max P ζi > mτ 4 p + 2t |J¯|=m m |J¯|=m i∈J¯ 2 i∈J¯ 2   n 2 ≤ e−t /2, m where the last line follows from Lemma8. Hence, with probability at least 1 − δ/2, we have

X √ √ p p  max ζi ≤ mτ 4 p + 8m log(ne/m) + 8 log(2/δ) , (42) |J¯|=m i∈J¯ 2 n  where we have used the inequality log m ≤ m log(ne/m). Using once again Lemma2, we check that with probability at least 1 − δ/2, n X √ √ p  ζ ≤ nτ 4 p + 8 log(2/δ) . (43) i i=1 2 Combining (41), (42) and (43), and setting α = m/n ≤ 0.562, we arrive at √ r τ 4 p + p8 log(2/δ) mp8τ log(ne/m) ¯ √ sup kζw|I k2 ≤ + w∈Wn,n−m+o n 1 − α n(1 − α) r rτp 2τ log(2/δ) mpτ log(ne/m) ≤ 16 + 8 + 6.5 n n n with probability at least 1 − δ. This completes the proof of the lemma.

We recall that ζ1:n is the p × n random matrix obtained by concatenating the vectors ζi. 1/2 > 1/2 > We also remind that smin(ζ1:n) = λmin(ζ1:nζ1:n) and smax(ζ1:n) = λmax(ζ1:nζ1:n) are the smallest and the largest singular values of the matrix ζ1:n.

LEMMA 10 (Vershynin[2012], Theorem 5.39). There is a constant Cτ depending only on the variance proxy τ such that for every t > 0 and for every pair of positive integers n and p, we have √ √  −t2 P smin(ξ1:n) ≤ n − Cτ p − Cτ t ≤ e , √ √  −t2 P smax(ξ1:n) ≥ n + Cτ p + Cτ t ≤ e . ¯ ¯ LEMMA 11. For a subset J of {1, . . . , n}, we denote by ζJ¯ the p × |J| matrix obtained ¯ by concatenating the vectors {ζi : i ∈ J}. For every pair of integers n, p ≥ 1 and for every integer m ∈ [1, n], we have

 √ √  2 p ne  −t P max smax(ζJ¯) ≤ m + Cτ p + m log( /m) + t ≥ 1 − e , |J¯|=m where Cτ is the same constant as in Lemma 10. 34 DALALYAN AND MINASYAN

PROOF. Using the union bound, we get  √ √  p ne  P max smax(ξJ¯) ≥ m + Cτ p + m log( /m) + t |J¯|=m  n   √ √  p ne  ≤ max P smax(ξJ¯) ≥ m + Cτ p + m log( /m) + t m |J¯|=m

(1)   (2) n p 2 −t2 ≤ exp − { m log(ne/m) + t} ≤ e , m where in (1) above we have used the second inequality from Lemma 10, while (2) is a con- n  sequence of the inequality log m ≤ m log(ne/m).

LEMMA 12 (Koltchinskii and Lounici[2017], Theorem 9). There is a constant A3 > 0 depending only on the variance proxy τ such that for every pair of integers n ≥ 1 and p ≥ 1, we have   > p  −t P kζ1:nζ1:n − nΣkop ≥ A3 (p + t)n + p + t ≤ e , ∀t ≥ 1.

LEMMA 13. Let t > 0. For any p ≥ 1, n ≥ 1, m ∈ [2, 0.6n] and o ≤ m, with probability at least 1 − 2e−t, n r  X > p + t p + t m log(ne/m) sup Σ − (w|I )iζiζi ≤ 5A3 + + , op n n n w∈Wn,n−m+o i=1 |I|≥n−o where A3 is the same constant as in Lemma 12.

PROOFOF LEMMA 13. For every I ⊂ [n] such that |I| ≥ n − o, we check that for every weight vector w ∈ Wn,n−m+o, n n X > X > X > (n − m) Σ − (w|I )iζiζi ≤ (Σ − ζiζi ) + max (ζiζi − Σ) op |J¯|=m i=1 i=1 op i∈J¯ op > > = ζ1:nζ1:n − nΣ + max ζJ¯ζJ¯ − mΣ op |J¯|=m op > > ≤ ξ1:nξ1:n − nIp + max ξJ¯ξJ¯ − mIp . (44) op |J¯|=m op On the one hand, Lemma 12 implies that the inequality > p  kξJ¯ξJ¯ − mIkop ≤ A3 (p + t)m + p + t0  ≤ 0.5A3 m + 3p + 3t0 ,

−t0 n  holds with probability at least 1−e . Choosing t0 = t+log m and using the union bound, the last display yields that with probability at least 1 − e−t,

>   max kξJ¯ξJ¯ − mIkop ≤ 2A3 p + m log(ne/m) + t . (45) |J¯|=m Using once again Lemma 12, we can check that the inequality

> p  kξ1:nξ1:n − nIkop ≤ A3 n(p + log(2/δ)) + p + t (46) ROBUST ESTIMATION OF A GAUSSIAN MEAN 35 holds with probability at least 1 − e−t. Therefore, combining (44), (45) and (46), we get the following inequalities hold with probability at least 1 − 2e−t: n X > A3 p  sup Σ − (w|I )iζiζi ≤ n(p + t) + 2p + 2t + 2m log(ne/m) op n − m Wn,n−m+o i=1 rp + t p + t m log(ne/m) ≤ 5A + + . 3 n n n This completes the proof.

PROPOSITION 3. Let R(ζ, w,I) be defined in Proposition1 and let ζ1,..., ζn be inde- pendent centered random vectors with covariance matrix Σ satisfying λmax(Σ) = 1. Assume −1/2 that ξi = Σ ζi are sub-Gaussian with a variance proxy τ > 0. Let ε ≤ 0.28, p ≤ n and δ ∈ (4e−n, 1). The random variable Ξ = sup max R(ζ, w,I) |I|≥n(1−ε) w∈Wn(ε) satisfies, with probability at least 1 − δ, the inequality r  p + log(4/δ)  Ξ ≤ A + εplog(1/ε) , 4 n where A4 > 0 is a constant depending only on the variance proxy τ.

PROOFOF PROPOSITION 3. We recall that, for any subset I of {1, . . . , n}, √   √ 1/2 X > ¯ R(ζ, w,I) = 2εw + 2εw λmax,+ (w|I )i(Σ − ζiζi ) + (1 + 2εw)kζw|I k2. i∈I

Furthermore, for every w ∈ Wn(ε), εw ≤ ε/(1 − ε) ≤ 1.5ε. This implies that √  1/2 √ X > ¯ Ξ ≤ 3ε + 3ε sup λmax,+ (w|I )i(Σ − ζiζi ) + (1 + 3ε) sup kζw|I k2, (47) w,I i∈I w,I where supw,I is the supremum over all w ∈ Wn(ε) and over all I ⊂ {1, . . . , n} of cardinality larger than or equal to n(1 − ε). As proved in Lemma9 (by taking m = 2o and o = nε), with probability at least 1 − 2e−t, r r ¯ τp p 2tτ sup kζw|I k2 ≤ 16 + 13ε τ log(2/ε) + 8 . (48) w,I n n In addition, in view of Lemma 13 (with m = 2o and o = nε), with probability at least 1 − 2e−t, we have r  X   p + t p + t  sup λ (w ) (Σ − ζ ζ>) ≤ 5A + + 2ε log(2/ε) max,+ |I i i i 3 n n w,I i∈I rp + t  ≤ 13A + ε log(2/ε) , (49) 3 n where in the last inequality we have used that p ≤ n and t ≤ n. Combining (47), (48) and (49), we obtain that the inequality rp + t  Ξ ≤ A + εplog(1/ε) 4 n 36 DALALYAN AND MINASYAN

−t hold true with probability at least 1 − 4e , for every t ∈ [0, n], where A4 is a constant depending only on the variance proxy τ. Replacing t by log(4/δ), we get the claim of the proposition.

PROOFOF THEOREM 2. The first step of the proof consists in establishing a recurrent in- k k−1 equality relating the estimation error of µb to that of µb . More precisely, we show that this error decreases at a geometric rate, up to a small additive error term. Without loss of generality, we assume that kΣkop = 1. Let us recall the notation n n−1 1 o Wn(ε) = w ∈ ∆ : max wi ≤ i n(1 − ε) and √   √ 1/2 X > ¯ R(ζ, w,I) = 2εw + 2εwλmax,+ (w|I )i(Σ − ζiζi ) + (1 + 2εw)kζw|I k2. i∈I For every k ≥ 1, in view of Proposition1 and (3), we have √ k ∗ εk k k 1/2 kµb − µ k2 ≤ G(wb , µb ) + Ξ 1 − εk √ εk k k−1 1/2 ≤ G(wb , µb ) + Ξ, (50) 1 − εk where X εn ε ε = wk ≤ = (51) k i n − nε 1 − ε i6∈I and Ξ = max sup R(ζ, w,I). |I|≥n−nε w∈Wn(ε) √ Inequality (51) and the cat that the function x 7→ x/(1 − x) is increasing for x ∈ [0, 1) imply that √ p εk ε(1 − ε) ≤ = αε. 1 − εk 1 − 2ε k k−1 Using this inequality and the fact that wb is a minimizer of G(·, µb ), we infer from (50) that k ∗ k k−1 1/2 kµb − µ k2 ≤ αεG(wb , µb ) + Ξ (52) ∗ k−1 1/2 ≤ αεG(wI , µb ) + Ξ, ∗ ∗ 1 where wI is the weight vector defined by wi = (i ∈ I)/|I|. One can check that n n X ∗ k−1 ⊗2 X ∗ ⊗2 k−1 ⊗2 w (X − µ ) = w (X − X¯ ∗ ) + (X¯ ∗ − µ ) i i b i i wI wI b i=1 i=1 n X ∗ ⊗2 k−1 2  w (X − X¯ ∗ ) + X¯ ∗ − µ I i i wI wI b 2 p i=1 n X ∗ ∗ ⊗2 k−1 2  w (X − µ ) + X¯ ∗ − µ I . i i wI b 2 p i=1 ROBUST ESTIMATION OF A GAUSSIAN MEAN 37

This readily yields  n  ∗ k−1 X ∗ ∗ ⊗2 k−1 2 G(w , µ ) ≤ λ w (X − µ ) − Σ + X¯ ∗ − µ I b max,+ i i wI b 2 i=1 ∗ ∗ ∗ k−1 2 ≤ G(w , µ ) + ζ¯ ∗ + µ − µ . I wI 2 b 2 Combining with inequality (52), we arrive at k ∗ ∗ ∗ 1/2 ¯ ∗ k−1 kµ − µ k2 ≤ αεG(w , µ ) + αε ζ ∗ + αε µ − µ + Ξ. b I wI 2 b 2 Thus, we have obtained the desired recurrent inequality kµk − µ∗k ≤ α µk−1 − µ∗ + Ξ, b 2 ε b 2 e (53) with ∗ ∗ 1/2 ¯ Ξ = αεG(w , µ ) + αε ζ ∗ + Ξ e I wI 2 √ 1/2 √ X > ¯ ≤ 5ε sup (w|I )i(Σ − ζiζi ) + 5ε ζw∗ + Ξ. (54) I 2 w,I i∈I op √ The last inequality above follows from the inequality αε ≤ 5ε as soon as ε ≤ 0.3. Unfolding the recurrent inequality (53), we get Ξe kµK − µ∗k ≤ αK µ0 − µ∗ + . b 2 ε b 2 1 − αε In view of Lemma1 and the condition ε ≤ 0.3, we get n 5αK X Ξe kµK − µ∗k ≤ ε ζ + . b 2 i 2 n 1 − αε i=1 Simple algebra yields n  n 1/2  n 1/2 1 X 1 X 2 √ 1 X 2 ζ ≤ ξ ≤ p + ( ξ − p) n i 2 n i 2 n i 2 i=1 i=1 i=1 √ Tr(ξ ξ> − I )1/2 = p + 1:n 1:n p n

√ p > 1/2 ≤ p + p/n ξ1:nξ1:n − Ip op . Thus, in view of Lemma6, with probability at least 1 − δ, n  r  1 X √ p 1.5A3 log(1/δ) ζ ≤ p 1 + 2A3 + . n i 2 n i=1 −n K √ Since δ ≥ 4e and K is chosen in such a way that αε ≤ 0.5ε/ p, we get, on an event of probability at least 1 − δ,

K ∗ Ξe kµb − µ k2 ≤ 0.5ε(1 + 3A3) + . 1 − αε Combining (54) with Proposition3, Lemma8 and Lemma 13, we check that for some con- stant A5 depending only on the variance proxy τ, the inequality r ! K ∗ A5 p + log(4/δ) p kµb − µ k2 ≤ + ε log(1/ε) 1 − αε n holds with probability at least 1 − 4δ. This completes the proof of the theorem. 38 DALALYAN AND MINASYAN

APPENDIX B: PROOF OF ADAPTATION TO UNKNOWN CONTAMINATION RATE

This section provides a proof of Theorem3. Let ε be the true contamination rate, which ∗ ∗ means that the distribution Pn ∈ SGAC(µ , Σ, ε). Let ` be the largest value of ` such that the corresponding element ε` of the grid is larger than or equal to ε. Since ε is assumed to be smaller than ε0a, the following inequalities hold

ε0 > ε`∗ ≥ ε ≥ ε`∗ a. (55)

∗ IR ∗ Let us introduce the events Ωj = {µ ∈ B(µbn (εj),Rδ(εj))}. Since Pn ∈ SGAC(µ , Σ, ε) ⊆ ∗ ∗ SGAC(µ , Σ, εj) for all j ≤ ` . We infer from Theorem2 that P(Ωj) ≥ 1 − 4δ/`max for every j ≤ `∗. Hence, by the union bound, with probability at least 1 − 4δ, we have

`∗ ∗ \ IR  µ ∈ B µbn (εj),Rδ(εj) . (56) j=1

From now on, we assume that this event is realized. Clearly, this implies that `b≥ `∗ and, therefore, IR  AIR  µ (ε ∗ ),R (ε ∗ ) ∩ µ ,R (ε ) 6= and ε ≤ ε ∗ . B bn ` δ ` B bn δ `b ∅ `b ` Using the triangle inequality, we get AIR IR kµ − µ (ε ∗ )k ≤ R (ε ∗ ) + 2R (ε ) ≤ 2R (ε ∗ ) ≤ 2R (ε/a), (57) bn bn ` 2 δ ` δ `b δ ` δ where the last inequality follows from (55) and the fact that z 7→ Rδ(z) is an increasing function. Combining (57) and (56), we get AIR ∗ AIR IR IR ∗ kµbn − µ k2 ≤ kµbn − µbn (ε`∗ )k2 + kµbn (ε`∗ ) − µ k2 ≤ 3Rδ(ε/a). This completes the proof of Theorem3.

APPENDIX C: PROOF IN THE CASE OF UNKNOWN Σ

This section is devoted to the proof of Theorem4. We follow exactly the same steps as in the proof of Theorem2. Without loss of generality, we assume that kΣkop ≤ 1. For every k ≥ 1, in view of Proposition1 and (3), we have √ √ k ∗ εk k k 1/2 εk k k−1 1/2 kµb − µ k2 ≤ G(wb , µb ) + Ξ ≤ G(wb , µb ) + Ξ, (58) 1 − εk 1 − εk where X εn ε ε = wk ≤ = (59) k i n − nε 1 − ε i6∈I

and Ξ = max|I|≥n−nε supw∈Wn(ε) R(ζ, w,I). √ Inequality (59) and the cat that the function x 7→ x/(1 − x) is increasing for x ∈ [0, 1) imply that √ p εk ε(1 − ε) ≤ = αε. 1 − εk 1 − 2ε ROBUST ESTIMATION OF A GAUSSIAN MEAN 39

Pn ⊗2 1/2 Using this inequality and the obvious inequality G(w, µ) ≤ k i=1 wi(Xi − µ) kop , we infer from (58) that k ∗ k k−1 1/2 kµb − µ k2 ≤ αεG(wb , µb ) + Ξ n 1/2 X k k−1 ⊗2 ≤ αε w (Xi − µ ) + Ξ. bi b i=1 op Then, using the fact that wk is the minimizer of w 7→ Pn w (X − µk−1)⊗2 , we get b i=1 i i b op k ∗ k k−1 1/2 kµb − µ k2 ≤ αεG(wb , µb ) + Ξ n 1/2 X ∗ k−1 ⊗2 ≤ αε w (Xi − µ ) + Ξ, (60) i b i=1 op ∗ ∗ 1 where wI is the weight vector defined by wi = (i ∈ I)/|I|. Simple algebra (see the proof of Theorem2 for more details) yields n 1/2 n 1/2 X ∗ k−1 ⊗2 X ∗ ∗ ⊗2 ¯ k−1 w (Xi − µ ) ≤ w (Xi − µ ) + Xw∗ − µ i b i I b 2 i=1 op i=1 op n 1/2 X ∗ ∗ ⊗2 ¯ ∗ k−1 ≤ wi (Xi − µ ) + ζw∗ + µ − µ I 2 b 2 i=1 op ∗ ∗ 1/2 ¯ ∗ k−1 ≤ G(w , µ ) + kΣkop + ζ ∗ + µ − µ . I wI 2 b 2 Combining with inequality (60), we arrive at k ∗ ∗ ∗ 1/2 ¯ ∗ k−1 kµ − µ k2 ≤ αεG(w , µ ) + αε + αε ζ ∗ + αε µ − µ + Ξ. b I wI 2 b 2 Thus, we have obtained the desired recurrent inequality kµk − µ∗k ≤ α µk−1 − µ∗ + α + Ξ, b 2 ε b 2 ε e (61) with ∗ ∗ 1/2 ¯ Ξ = αεG(w , µ ) + αε ζ ∗ + Ξ e I wI 2 √ 1/2 √ X > ¯ ≤ 5ε sup (w|I )i(Σ − ζiζi ) + 5ε ζw∗ + Ξ. I 2 w,I i∈I op √ The last inequality above follows from the inequality αε ≤ 5ε as soon as ε ≤ 0.3. Unfolding the recurrent inequality (61), we get

αε + Ξe kµK − µ∗k ≤ αK µ0 − µ∗ + . b 2 ε b 2 1 − αε In view of Lemma1 and the condition ε ≤ 0.3, we get K n 5α X αε + Ξe kµK − µ∗k ≤ ε ζ + . b 2 i 2 n 1 − αε i=1 We have already seen that with probability at least 1 − 4δ, K n r ! 5αε X Ξe A5 p + log(4/δ) p ζi 2 + ≤ + ε log(1/ε) , n 1 − αε 1 − αε n i=1 √ where A > 1 is a constant depending only on τ. In addition, we have α ≤ 5ε and p 5 √ p √ ε ε log(1/ε) ≤ 0.6 ε, which lead to αε + A5ε log(1/ε) ≤ 3A5 ε. This completes the proof of the theorem. 40 DALALYAN AND MINASYAN

APPENDIX D: LOWER BOUND FOR THE GAUSSIAN MODEL WITH ADVERSARIAL CONTAMINATION

In this section we provide the proof of the lower bound on the expected risk in the setting of HC ∗ GAC model. To this end, we first denote by Mn (ε, µ ) the set of joint probability distri- butions Pn of the random vectors X1,..., Xn coming from Huber’s contamination model. Recall that Huber’s contamination model reads as follows i.i.d. X1,..., Xn ∼ Pε,µ∗,Q, n where Pε,µ∗,Q = (1 − ε)Pµ∗ + εQ. It is evident that with probability ε all observations X1,..., Xn can be outliers, i.e., drawn from distribution Q. This means that on an event of non-zero probability, it is impossible to have a bounded estimation error. To overcome this difficulty, one can focus on Huber’s deterministic contamination (HDC) model (see [Bateni and Dalalyan, 2020]). In this model, it is assumed that the number of outliers is at most dnεe. The set of joint probability distributions Pn coming from Huber’s deterministic contamina- HDC ∗ tion model is denoted by Mn (ε, µ ). We define the worst-case risks for these two models RHC (µ , ε) = sup sup R (µ , µ∗), max bn Pn bn µ∗ HC ∗ Pn∈Mn (ε,µ ) RHDC(µ , ε) = sup sup R (µ , µ∗). max bn Pn bn µ∗ HDC ∗ Pn∈Mn (ε,µ )

For Huber’s contamination model, the following lower bound holds true.

THEOREM 5 (Theorem 2.2 from [Chen et al., 2018]). There exists some universal constant C > 0 such that  r  ∗ 1/2 rΣ inf sup sup P kµbn − µ k2 ≥ CkΣkop + ε ≥ 1/2, µ µ∗ HC ∗ n b n Pn∈Mn (ε,µ ) for any ε ∈ [0, 1].

The combination of the lower bound from Theorem5 and the relation between the risks in HC ∗ HDC ∗ Mn (ε, µ ) and Mn (ε, µ ) (Prop. 1 from [Bateni and Dalalyan, 2020]) yields the desired lower bound for the GAC model, as stated in the next proposition.

PROPOSITION 4 (Lower bound for GAC). There exist some universal constants c, ec > 0 such that if n > ec, then r  1/2 rΣ inf Rmax(µbn, Σ, ε) ≥ ckΣkop + ε . µb n n

p PROOF. First, notice that it is sufficient to prove this bound for ε > rΣ/n. Indeed, if ε < p rΣ/n then one gets the desired lower bound by simply comparing GAC model with the standard outlier free model. More formally, r OF Tr Σ inf Rmax(µbn, Σ, ε) ≥ inf Rmax(µbn) = , µb n µb n n OF where Rmax(µbn) is the worst case risk in the classical outlier free setting where the observa- ∗ tions are i.i.d. and drawn from Np(µ , Σ). ROBUST ESTIMATION OF A GAUSSIAN MEAN 41

Since the GAC model is more general than HDC model, then one clearly has HDC inf Rmax(µbn, Σ, ε) ≥ inf Rmax(µbn, ε). (62) µb n µb n On the other hand, in view of Eq. (4) from [Bateni and Dalalyan, 2020], for every r > 0, HDC −nε/6  Rmax(µn, ε) + re ≥ sup rP kµn − µk2 ≥ r . b HC ∗ b Pn∈Mn (0.5ε,µ ) 1/2p  Using the result of Theorem5 and taking first r = CkΣkop rΣ/n + ε then infimum of both sides one arrives at r  HDC 1/2 rΣ −nε/6 inf Rmax(µbn, ε) ≥ (C/4)kΣkop + ε 1 − 2e . (63) µb n n p √ √ 2 Now, using ε > rΣ/n one gets that nε > rΣn ≥ n. Therefore, taking n ≥ 36 log (4), one can check that (63) yields r  HDC 1/2 rΣ inf Rmax(µbn, 2ε) ≥ (C/8)kΣkop + ε . µb n n 2 Then, the final result follows from (62) with constants c = C/8 and ec = 36 log (4).