All-In-One Robust Estimator of the Gaussian Mean

Submitted to the Annals of Statistics ALL-IN-ONE ROBUST ESTIMATOR OF THE GAUSSIAN MEAN BY ARNAK S. DALALYAN1 AND ARSHAK MINASYAN2 1ENSAE-CREST, [email protected] 2Yerevan State University, YerevaNN, [email protected] The goal of this paper is to show that a single robust estimator of the mean of a multivariate Gaussian distribution can enjoy five desirable properties. First, it is computationally tractable in the sense that it can be computed in a time which is at most polynomial in dimension, sample size and the log- arithm of the inverse of the contamination rate. Second, it is equivariant by translations, uniform scaling and orthogonal transformations. Third, it has a high breakdown point equal to 0:5, and a nearly-minimax-rate-breakdown point approximately equal to 0:28. Fourth, it is minimax rate optimal, up to a logarithmic factor, when data consists of independent observations corrupted by adversarially chosen outliers. Fifth, it is asymptotically efficient when the rate of contamination tends to zero. The estimator is obtained by an iterative reweighting approach. Each sample point is assigned a weight that is iteratively updated by solving a convex optimization problem. We also establish a dimension-free non-asymptotic risk bound for the expected error of the pro- posed estimator. It is the first result of this kind in the literature and involves only the effective rank of the covariance matrix. Finally, we show that the obtained results can be extended to sub-Gaussian distributions, as well as to the cases of unknown rate of contamination or unknown covariance matrix. CONTENTS 1 Introduction . .1 2 Desirable properties of a robust estimator . .3 3 Iterative reweighting approach . .6 4 Relation to prior work and discussion . 10 5 Formal statement of main building blocks . 11 6 Sub-Gaussian distributions, high-probability bounds and adaptation . 13 7 Empirical results . 16 arXiv:2002.01432v2 [math.ST] 4 Mar 2021 8 Postponed proofs . 18 1. Introduction. Robust estimation is one of the most fundamental problems in statistics. Its goal is to design efficient methods capable of processing data sets contaminated by outliers, so that these outliers have little influence on the final result. The notion of an outlier is hard to define for a single data point. It is also hard, inefficient and often impossible to clean data by removing the outliers. Instead, one can build methods that take as input the contaminated data set and provide as output an estimate which is not very sensitive to the AMS 2000 subject classifications: Primary 62H12, ; secondary 62F35. Keywords and phrases: Gaussian mean, robust estimation, breakdown point, minimax rate, computational tractability. 1 2 DALALYAN AND MINASYAN contamination. Recent advances in data acquisition and computational power provoked a re- vival of interest in robust estimation and learning, with a focus on finite sample results and computationally tractable procedures. This was in contrast to the more traditional studies analyzing asymptotic properties of such statistical methods. This paper builds on recent advances made in robust estimation and suggests a method that has attractive properties both from asymptotic and finite-sample points of view. Furthermore, it is computationally tractable and its statistical complexity depends optimally on the dimension. As a matter of fact, we even show that what really matters is the intrinsic dimension, defined in the Gaussian model as the effective rank of the covariance matrix. Note that in the framework of robust estimation, the high-dimensional setting is qualitatively different from the one dimensional setting. This qualitative difference can be shown at two levels. First, from a computational point of view, the running time of several robust methods scales poorly with dimension. Second, from a statistical point of view, while a simple “remove then average” strategy might be successful in low-dimensional settings, it can eas- ily be seen to fail in the high dimensional case. Indeed, assume that for some " 2 (0; 1=2), p 2 N, and n 2 N, the data X1;:::; Xn consist of n(1 − ") points (inliers) drawn from a p-dimensional Gaussian distribution Np(0; Ip) (where Ip is the p × p identity matrix) and "n points (outliers) equal to a given vector u. Consider an idealized setting in which, for a given threshold r > 0, an oracle tells the user whether or not Xi is within a distance r of the true mean 0. A simple strategy for robust mean estimation consists of removing all the points of p Euclidean norm larger than 2 p and averaging all the remaining points. If the norm of u is p equal to p, one can check that the distance between this estimator and the true mean µ = 0 p p p is of order p=n + "kuk2 = p=n + " p. This error rate is provably optimal in the small dimensional setting p = O(1), but suboptimal as compared to the optimal rate pp=n + " when the dimension p is not constant. The reason of this suboptimality is that the individu- ally harmless outliers, lying close to the bulk of the point cloud, have a strong joint impact on the quality of estimation. We postpone a review of the relevant prior work to Section4 in order to ease comparison with our results, and proceed here with a summary of our contributions. In the context of a data set subject to a fully adversarial corruption, we introduce a new estimator of the Gaussian mean that enjoys the following properties (the precise meaning of these properties is given in Section2): • it is computable in polynomial time, • it is equivariant with respect to similarity transformations (translations, uniform scaling and orthogonal transformations), p • it has a high (minimax) breakdown point: "∗ = (5 − 5)=10 ≈ 0:28, • it is minimax-rate-optimal, up to a logarithmic factor, • it is asymptotically efficient when the rate of contamination tends to zero, • for inhomogeneous covariance matrices, it achieves a better sample complexity than all the other previously studied methods. In order to keep the presentation simple, all the aforementioned results are established in the case where the inliers are drawn from the Gaussian distribution. We then show that the extension to a sub-Gaussian distribution can be carried out along the same lines. Furthermore, we prove that using Lepski’s method, one can get rid of the knowledge of the contamination p p rate. More precisely, we establish that the ratep p=n + " log(1=") can be achieved without any information on " other than " < (5 − 5)=10 ≈ 0:28. Finally, we prove that the same order of magnitude of the estimation error is achieved when the covariance matrix Σ is ROBUST ESTIMATION OF A GAUSSIAN MEAN 3 unknown but isotropic (i.e., proportional to the identity matrix). When the covariance matrix is an arbitrary unknownp matrix with bounded operator norm, our estimator has an error of order pp=n + ", which is the best known rate of estimation by a computationally tractable procedure in the case of unknown covariance matrices. The rest of this paper is organized as follows. We complete this introduction by presenting the notation used throughout the paper. Section2 describes the problem setting and provides the definitions of the properties of robust estimators such as rate optimality or breakdown point. The iteratively reweighted mean estimator is introduced in Section3. This section also contains the main facts characterizing the iteratively reweighted mean estimator along with their high-level proofs. A detailed discussion of relation to prior work is included in Section4. Section5 is devoted to a formal statement of the main building blocks of the proofs. Extensions to the cases of sub-Gaussian distributions, unknown " and Σ are examined in Section6. Some empirical results illustrating our theoretical claims are reported in Section7. Postponed proofs are gathered in Section8 and in the appendix. For any vector v, we use the norm notations kvk2 for the standard Euclidean norm, kvk1 for the sum of absolute values of entries and kvk1 for the largest in absolute value entry of v. The tensor product of v by itself is denoted by v⊗2 = vv>. We denote by ∆n−1 and n−1 n by S , respectively, the probability simplex and the unit sphere in R . For any symmetric matrix M, λmax(M) is the largest eigenvalue of M, while λmax;+(M) is its positive part. The operator norm of M is denoted by kMkop. We will often use the effective rank rM defined as Tr(M)=kMkop, where Tr(M) is the trace of matrix M. For symmetric matrices A and B of the same size we write A B, if the matrix A − B is positive semidefinite. For a rectangular p × n matrix A, we let smin(A) and smax(A) be the smallest and the largest singular values of A defined respectively as smin(A) = infv2 n−1 kAvk2 and smax(A) = supv2 n−1 kAvk2. S p S The set of all p × p positive semidefinite matrices is denoted by S+. 2. Desirable properties of a robust estimator. We consider the setting in which the sample points are corrupted versions of independent and identically distributed random vectors drawn from a p-variate Gaussian distribution with mean µ∗ and covariance matrix Σ. In what follows, we will assume that the rate of contamination and the covariance matrix are known and, therefore, can be used for constructing an estimator of µ∗. We present in Section6 some additional results which are valid under relaxations of this assumption. DEFINITION 1. We say that the distribution Pn of data X1;:::; Xn is Gaussian with ad- ∗ versarial contamination, denoted by Pn 2 GAC(µ ; Σ;") with " 2 (0; 1=2) and Σ 0, if there is a set of n independent and identically distributed random vectors Y 1;:::; Y n drawn ∗ from Np(µ ; Σ) satisfying fi : Xi 6= Y ig ≤ "n: In what follows, the sample points Xi with indices in the set O = fi : Xi 6= Y ig are called outliers, while all the other sample points are called inliers.

All-In-One Robust Estimator of the Gaussian Mean

Lecture 12 Robust Estimation

Should We Think of a Different Median Estimator?

Bias, Mean-Square Error, Relative Efficiency

A Joint Central Limit Theorem for the Sample Mean and Regenerative Variance Estimator*

1 Estimation and Beyond in the Bayes Universe

11. Parameter Estimation

Bayes Estimator Recap - Example

Ch. 2 Estimators

Estimators, Bias and Variance

Lecture 8 — October 15 8.1 Bayes Estimators and Average Risk

Unbiasedness and Bayes Estimators

Section 2 Simple Regression