A Robust Univariate Estimator is All You Need

Adarsh Prasad‡ Sivaraman Balakrishnan† Pradeep Ravikumar‡ Machine Learning Department‡ Department of and Data Science† Carnegie Mellon University

Abstract these huge datasets is thus fraught with methodologi- cal challenges. We study the problem of designing estima- To understand the fundamental challenges and trade- tors when the data has heavy-tails and is offs in handling such “dirty data” is precisely the corrupted by . In such an adversar- premise of the field of . Here, the afore- ial setup, we aim to design statistically op- mentioned complexities are largely formalized under timal estimators for flexible non-parametric two different models of robustness: (1) The heavy- distribution classes such as distributions with tailed model: In this model the sampling distribu- bounded-2k moments and symmetric distri- tion can have thick tails, for instance, only low-order butions. Our primary workhorse is a concep- moments of the distribution are assumed to be finite; tually simple reduction from multivariate es- and (2) The -contamination model: Here the timation to univariate estimation. Using this sampling distribution is modeled as a well-behaved dis- reduction, we design estimators which are op- tribution contaminated by an  fraction of arbitrary timal in both heavy-tailed and contaminated outliers. In each case, classical estimators of the distri- settings. Our estimators achieve an opti- bution (based for instance on the maximum likelihood mal dimension independent bias in the con- estimator) can behave considerably worse (potentially taminated setting, while also simultaneously arbitrarily worse) than under standard settings where achieving high- error guarantees the data is better behaved, satisfying various regular- with optimal complexity. These re- ity properties. In particular, these classical estimators sults provide some of the first such estima- can be extremely sensitive to the tails of the distribu- tors for a broad range of problems includ- tion or to the outliers and the broad goal in robust ing Mean Estimation, Sparse Mean Estima- statistics is to construct estimators that improve on tion, Covariance Estimation, Sparse Covari- these classical estimators by reducing their sensitivity ance Estimation and Sparse PCA. to outliers. 1 Introduction Heavy Tailed Model. Concretely, focusing on the fundamental problem of robust mean estimation, Modern data sets that arise in various branches of sci- in the heavy tailed model we observe n samples ence and engineering are characterized by their ever x1, . . . , xn drawn independently from a distribution P , increasing scale and richness. This has been spurred which is only assumed to have low-order moments fi- in part by easier access to computer, internet and var- nite (for instance, P only has finite ). The ious sensor-based technologies that enable the auto- goal of past work Catoni(2012); Minsker(2015); Lu- mated acquisition of such heterogeneous datasets. On gosi and Mendelson(2017); Catoni and Giulini(2017) the flip side, these large and rich data-sets are no has been to design an estimator θbn of the true mean µ longer carefully curated, are often collected in a de- of P which has a small `2-error with high-probability. centralized, distributed fashion, and consequently are Formally, for a given δ > 0, we would like an estimator plagued with the complexities of heterogeneity, adver- with minimal rδ such that, sarial manipulations, and outliers. The analysis of P (kθbn − µk2 ≤ rδ) ≥ 1 − δ. (1) Proceedings of the 23rdInternational Conference on Artifi- cial Intelligence and Statistics (AISTATS) 2020, Palermo, As a benchmark for estimators in the heavy-tailed Italy. PMLR: Volume 108. Copyright 2020 by the au- model, we observe that when P is a multivariate nor- thor(s). mal distribution (or more generally is a sub-Gaussian A Robust Univariate Mean Estimator is All You Need distribution) with mean µ and covariance Σ, it can be et al.(2018); Charikar et al.(2017); Diakonikolas et al. shown (see Hanson and Wright(1971)) that the sam- (2017); Balakrishnan et al.(2017); Prasad et al.(2018); P ple mean µbn = (1/n) i xi satisfies, with probability Diakonikolas et al.(2018) designing provably robust at least 1 − δ1, which are computationally tractable while achieving near-optimal contamination dependence (i.e. depen- r r trace (Σ) kΣk log(1/δ) dence on the fraction of outliers ) for computing kµ − µk + 2 . (2) bn 2 . n n and covariances. In the Huber model, using information-theoretic lower bounds Chen et al.(2016); where kΣk2 denotes the operator norm of the covari- Lai et al.(2016); Hopkins and Li(2018), it can be ance matrix Σ. shown that any estimator must suffer a non-zero bias The bound is referred to as a sub-Gaussian-style er- (the asymptotic error as the number of samples go to ror bound. However, for heavy tailed distributions, infinity). For example, for the class of distributions 2 as for instance showed in Catoni(2012), the sam- with bounded variance, Σ - σ I√p, the bias of any es- timator is lower bounded by Ω(σ ). Surprisingly, the ple mean only satisfies the sub-optimal bound rδ = Ω(pd/nδ). Somewhat surprisingly, recent work Lu- optimal bias that can be achieved is often independent gosi and Mendelson(2017) showed that the sub- of the data dimension. In other words, in many in- Gaussian error bound is achievable while only assum- teresting cases optimally robust estimators in Huber’s ing that P has finite variance, but by a carefully de- model can tolerate a constant fraction  of outliers, signed estimator. In the univariate setting, the classi- independent of the dimension. cal -of-means estimator Alon et al.(1996); Ne- While the aforementioned recent estimators for mean mirovski and Yudin(1983); Jerrum et al.(1986) and estimation under Huber contamination have a poly- Catoni’s M-estimator Catoni(2012) achieve this sur- nomial computational complexity, their correspond- prising result but designing such estimators in the mul- ing sample complexities are only known to be polyno- tivariate setting has proved challenging. Estimators mial in the dimension p. For example, Kothari et al. that achieve truly sub-Gaussian bounds, but which are (2018) and Hopkins and Li(2018) designed estima- computationally intractable, were proposed recently tors which achieve optimal bias for distributions with by Lugosi and Mendelson(2017) and subsequently certifiably bounded 2k-moments, but their statistical Catoni and Giulini(2017). Hopkins(2018) and Cher- sample complexity scales as O(pk). Steinhardt et al. apanamjeri et al.(2019) developed a sum-of-squares (2017) studied mean estimation and presented an es- based relaxation of Lugosi and Mendelson(2017)’s es- timator which has a sample complexity of Ω p1.5. timator, thereby giving a polynomial time algorithm which achieves optimal rates. Despite their apparent similarity, developments of es- timators that are robust in each of these models, have Huber’s -Contamination Model. In this setting, remained relatively independent. Focusing on mean instead of observing samples directly from the true dis- estimation we notice subtle differences, in the heavy- tribution P , we observe samples drawn from P, which tailed model our target is the mean of the sampling for an arbitrary distribution Q is defined as a mixture distribution whereas in the Huber model our target is model, the mean of the decontaminated sampling distribution P . Beyond this distinction, it is also important to note P = (1 − )P + Q. (3) that as highlighted above the natural focus in heavy- tailed mean estimation is on achieving strong, high- The distribution Q allows one to model arbitrary out- probability error guarantees, while in Huber’s model liers, which may correspond to gross corruptions, or the focus has been on achieving dimension indepen- subtle deviations from the true model. There has dent bias. been a lot of classical work studying estimators in the -contamination model under the umbrella of ro- Contributions. In this work, we aim to design esti- bust statistics (see for instance Hampel et al.(1986) mators which are statistically optimally robust in both and references therein). However, most of the estima- models simultaneously, i.e. they achieve a dimension- tors come that come with strong guarantees are com- independent asymptotic bias in the -contamination putationally intractable Tukey(1975), while others model and achieve high probability deviation bounds are statistically sub-optimal heuristics Hastings et al. similar to (2). Our main workhorse is a conceptually (1947). Recently, there has been substantial progress simple way of reducing multivariate estimation to the Diakonikolas et al.(2016); Lai et al.(2016); Kothari univariate setting. Then, by carefully solving mean estimation in the univariate setting, we are able to 1 Here and throughout our paper we use the notation . design optimal estimators for multivariate mean and to denote an inequality with universal constants dropped for conciseness. covariance estimation for non-parametric distribution Prasad, Balakrishnan, Ravikumar classes both in the low-dimensional (n ≥ p) and high- of H-symmetric distributions with unique center of H- t0,κ dimensional (n < p) setting. We achieve these rates symmetry. Moreover suppose Psym ⊂ Psym is the class t0,κ for non-parametric distribution classes such as distri- of distributions such that for any P ∈ Psym the CDF T butions with bounded 2k-moments and the class of of the univariate projection(u P ) given by FuT P in- symmetric distributions. creases at least linearly around uT θ. Formally, for all T −1 1 x1 ∈ [med(u P ),F T ( + t0)] we have that 2 Background and Setup u P 2 1 1 T F T (x ) − ≥ (x − med(u P )) u P 1 2 κ 1 In this section, we formally define two classes of dis- u,P tributions which we work with in this paper, (1) (4) Bounded-2k-Moment distributions and (2) Symmetric −1 1 T and for all x2 ∈ [FuT P ( 2 − t0), med(u P )], we have Distributions. 1 1 T that − F T (x2) ≥ (med(u P ) − x2) for κu,P ≤ 2 u P κu,P Bounded 2k-moment Class. Let x be a random κ. A higher κ corresponds to slower rate of increase in vector with mean µ and covariance Σ. We say that x CDF around the median. Note that κ can be thought has bounded 2k-moments if for all v ∈ Sp−1, E[(vT (x− of as a measure of variance or dispersion. In partic- 2k T 2 k σ2 µ)) ] ≤ C2k E[(v (x − µ)) ] . We let P2k be the ular, for example, in the case of univariate Gaussian class of distributions with bounded 2k moments with distribution, i.e. P = N (µ, σ2), κ(P ) = Cσ. Sim- 2 covariance matrix Σ . σ Ip. ilarly for univariate Cauchy distribution with scale γ, κ(P ) ≈ Cγ. Note that any univariate distribu- Symmetric Distributions. There exist several no- tion P ∈ Psym with density function p(x) such that tions of symmetry for multivariate distributions. We −1 1 1 t0,κ min p(PF ( 2 + t)) > κ also belongs to Psym . discuss these notions briefly, but refer the reader to Liu |t|

Model is Θ() Chen et al.(2015). Our explicit Algorithm 1 Robust Univariate Mean Estimation dimension-dependent lower bound on the bias of M- 2n function Interval1D({zi}i=1,Corruption estimators shows their sub-optimality. Level , Confidence Level δ) n Subset Search. Having ruled out convex estimation Split the data into two subsets: Z1 = {zi}i=1 2n to a certain extent, we next turn our attention to non- and Z2 = {zi}i=n+1. log(1/δ) convex methods. Perhaps the most simple non-convex Let α = max(, n ). method is simple search. Intuitively, the squared loss Using Z1, let Iˆ = [a, b] be the shortest interval measures the fit between a θ and samples q log(4/δ) log(4/δ) containing n(1−2α− 2α − ) points. Z, and if all samples don’t come from the same distri- n n Use Z2 to identify points lying in [a, b]. bution(i.e. have outliers), then the corresponding fit n o 1 P2n ˆ should be bad. To capture this intuition algorithmi- return P2n ˆ i=n ziI zi ∈ I i=n I{zi∈I} cally, one can (1) consider all subsets of size b(1 − )nc, end function (2) minimize the squared loss over these subsets, and then (3) return the estimator corresponding to the sub- set with least squared loss or best fit. To be precise, Then, we have the following Lemma. given n samples from P  Lemma 3. Suppose P be any 2k-moment bounded dis-

∗ def 1 X 2 tribution over R with mean µ with variance bounded by S = argmin min kxi − θk2 2 n S s.t. |S|=(1−)n θ (1 − )n σ . Given, n samples {xi}i=1 from the mixture distri- xi∈S bution (3), Algorithm1 returns an estimate θbδ such def 1 X 2 θbSRM = min kxi − θk2 (5) that with probability at least 1 − δ, θ (1 − )n ∗ xi∈S log(1/δ) 1− 1 |θbδ − µ| σ max(2, ) 2k Our next result studies the asymptotic performance of . n this estimator. r log n 1− 1 log(1/δ) +σ( ) 2k + σ Lemma 2. Let P = N (0, Ip), then as n 7→ ∞, we n n have that  √ Observe that Algorithm1 has an asymptotic bias of sup kθbSRM − Ex∼P [x]k2 = p p. (6) Q (1 − )(1 − 2) O(σ1−1/2k) in the -contamination setting, which is known to be information theoretically optimal Hopkins The above result shows that, while subset-search has and Li(2018); Lai et al.(2016). a finite dimension-independent breakdown point(0.5), Moreover, observe that for  = 0, P has atleast the bias of this estimator necessarily scales with the √ log(n) 1−1/2k dimension p. bounded 4th moment, i.e. k ≥ 2, n term can be ignored for large enough n. Hence, for k ≥ 2 4 Optimal Univariate Estimation and large enough n, Algorithm1 achieves the devia- q log(1/δ) In the previous section, we studied some natural can- tion rate of σ n . didate estimators and showed that they don’t achieve the optimal asymptotic bias in -contamination model 4.2 Symmetric Distributions for multivariate mean estimation. In this section, we take a step back, and study univariate estimation. In the univariate setting, our estimator presented in Algorithm2 simply returns the sample median of the 4.1 Bounded 2k-moments observed samples. While this idea is simple and cru- We study the interval estimator which was initially cially exploits that the mean and median overlap for proposed by Lai et al.(2016). The estimator, pre- a symmetric distribution, this leads to profound im- sented in Algorithm1, proceeds by using half of the plications on the effect of contamination in the Huber samples to identify the shortest interval containing at contamination model. Next, we present the theoreti- least (1 − )n fraction of the points, and then the re- cal bound achieved by this estimator, which was shown maining half of the points is used to return an estimate in Altschuler et al.(2018). of the mean. We further assume that  and δ are such that, We assume that the contamination level  and confi- r  1 log(2/δ) dence level δ are such that, + ≤ t . 2(1 − ) (1 − ) n 0 r log(4/δ) log(4/δ) 1 2 +  + < . n n 2 Then we have that, Prasad, Balakrishnan, Ravikumar

Algorithm 2 Sample Median Such directional-control based estimators have been 2n+1 previously studied in the context of heavy-tailed mean function Sample Median - 1D ({zi}i=1 ) th estimation by Joly et al.(2017) and Catoni and Giulini Let z[k] be k order- (2017). Joly et al.(2017) used the median-of-means es- return z[n+1] end function timator, while Catoni and Giulini(2017) used Catoni’s M-estimator Catoni(2012) as their univariate estima- tor. Then, we have the following result. Lemma 4. [Altschuler et al.(2018)] Let P be any uni- Lemma 5. Suppose P is a multivariate distribution t0,κ variate distribution in Psym . Given n-samples from with mean µ. Given n-samples from the mixture dis- the mixture distribution (3), we get that with probabil- tribution (3), we get that with probability at least 1−δ, ity at least 1 − δ, Algorithm2 returns an estimate θb such that, T δ kθbf (Xn) − µk2 . sup ωf (, u P, p ), r u∈N 1/2(Sp−1) 5 log(1/δ) |θb− Ex∼P [x]| ≤ C1κ + C2κ n where uT P is the univariate distribution of P along u.

Observe that Algorithm2 has an asymptotic bias Sparse Mean Estimation. In this setting, we fur- of O(κ), which is also information theoretically ther assume that the true mean vector of the distribu- optimal. To see this, observe that N (·, κ2) lies tion P has only a few non-zero co-ordinates, i.e. it is t0,κ in Psym for some constant t0 and the fact that sparse. Such sparsity patterns are known to be present 2 2 TV (N (µ1, κ ), N (µ2, κ )) = O(|µ1 − µ2|/κ)(Theorem in high-dimensional data(see Rish et al.(2014) and ref- 1.3 Devroye et al.(2018)). erences therein). Then, the goal is to design estima- 5 From 1D to p-D: A meta-estimator tors which can exploit this sparsity structure, while remaining robust under the -contamination model. Formally, for a vector x ∈ p, let supp(x) = {i ∈ In this section, we propose a general meta-estimator R [p] s.t. x(i) 6= 0}. Then, x is s-sparse if |supp(x)| ≤ s. to extend any univariate estimator to the multivariate We further assume that s ≤ p/2. Let Θs be the set of setting. For any univariate estimator f(·), suppose 1 p 2 p−1 that when given n-samples from the mixture model, it s-sparse vectors in R , and let N2s(S ) is the half- cover of the set of unit vectors which are 2s-sparse. returns an estimate f(Xn) such that with probability 1 − δ, Then, for any univariate estimator f(·), let |f(Xn, , δ) − µ(P )| ≤ ωf (, P, δ). n T T δ Df,s(θ, {xi}i=1) = sup |u θ−f(u Xn, , )|. Note that ωf (, P, δ) is the error suffered by the uni- 6ep s u∈N 1/2(Sp−1) ( ) variate estimator at a contamination level , and con- 2s s fidence level δ, when the true univariate distribution Then, we can define the following meta-estimator, is P . 5.1 Mean Estimation θbf,s(Xn) = argmin Df,s(θ, Xn), θ∈Θs The proposed meta-estimator proceeds by robustly es- which has the following error guarantee. timating the mean along almost every direction u, and returns an estimate θb, whose projection along u(uT θb) Lemma 6. Suppose P is a multivariate distribution is close to these univariate robust mean along that di- with mean µ such that µ is s-sparse. Given n-samples rection. In particular, let N 1/2(Sp−1) is the half-cover from the mixture distribution (3), we get that with of the unit sphere Sp−1, i.e. ∀u ∈ Sp−1, there exists a probability at least 1 − δ, y ∈ N 1/2(Sp−1) such that u = y+z for some kzk ≤ 1 . 2 2 δ Then, for any point θ ∈ p and any univariate estima- kθ (X ) − µk sup ω (, uT P, ), R bf,s n 2 . f 6ep s 1/2 p−1 ( ) tor f(·), consider the following loss, u∈N2s (S ) s

T n T T n δ where u P is the univariate distribution of P along u. Df(θ, {xi}i=1) = sup |u θ−f({u xi}i=1, , p )|, u∈N 1/2(Sp−1) 5 5.2 Covariance-Estimation Then, we use it to construct the following multivariate n meta-estimator, θbf which takes in n-samples {xi}i=1 In this section, we study recovering the true covari- and a univariate estimator f(·), ance matrix, when given samples from a mixture dis- tribution. We first center our observations by defining n n x −x θbf({xi}i=1) = argmin Df (θ, {xi}i=1), pseudo-samples z = i √i+n/2 . We can think of z θ i 2 i A Robust Univariate Mean Estimator is All You Need

˜ σ2 as being sampled from the Huber Contamination P2, 6 Consequences for P2k where P = √1 (P − P ). Let Z = {z }n be the set of e 2 n i i=1 T ⊗2 T 2 n these pseudo-samples, and let u Zn = {(u zi) }i=1. Next, we study the performance of our meta-estimator Let F = {Σ = ΣT ∈ Rp×p :Σ  0} be the class of for multivariate estimation for the class of distribu- positive semi-definite symmetric matrices. Then, for tions with bounded 2k-moments. In particular, we use any matrix Θ ∈ F, let, the interval estimator(IM) presented in Algorithm1 as our univariate estimator to instantiate our meta- ⊗2 T T ⊗2 δ estimator. Df (Θ, Zn) = sup |u Θu − f(u Zn , , p )| u∈N 1/4 9 Multivariate Mean Estimation. In the multivari- ate setting, we further assume that the contamination Then, consider the meta-estimator Θb f(Zn) given as, level , and confidence are such that, Θ (Z ) = argmin D⊗2(Θ, Z ) r b f n f n p log(1/δ) p log(4/δ) Θ∈F 2 + ( + ) + + < c, n n n n Lemma 7. Suppose P is a multivariate distribution with covariance Σ. Given n-samples from the mixture for some small constant c > 0. Then, we have the distribution (3), we get that with probability at least following result. 1 − δ, Corollary 1. Suppose P has bounded 2k moments δ with mean µ and covariance Σ. Given n samples T ˜⊗2 n kΘb f(Zn) − Σk2 . sup ωf (2, u P , p ), {xi} from the mixture distribution (3), we get that 1/4 p−1 9 i=1 u∈N (S ) with probability at least 1 − δ, T ˜⊗2 T 2 where u P is the univariate distribution of (u zi) r 1/2 1−1/2k 1/2 log(1/δ) for zi ∼ Pe. kθbIM(Xn) − µk2 kΣk  + kΣk . 2 2 n r Sparse Covariance Estimation. Next, we consider 1/2 p log n 1− 1 + kΣk ( + ( ) 2k ) sparse covariance matrices. In particular, we assume 2 n n that there exists a subset S of |S| = s covariates that are correlated with each other, and the remaining co- Observe that the proposed estimator achieves a dimen- variates [p]\S are independent from this subset and sion independent asymptotic bias of O(σ1−1/2k) in the from each other. Such sparsity patterns arise naturally -contamination model for multivariate mean estima- in various real-world data Bien and Tibshirani(2011). tion, with a sample complexity of O(p). More concretely, for a subset of co-ordinates S, define Sparse Mean Estimation. In this setting, we as- G(S) def= {G = (g) ∈ p×p, g = 0 if i∈ / S or j∈ / S}, ij R ij sume that the contamination level , and confidence and let G(s) = S G(S). Consider the class S⊂[p]:|S|≤s are such that, of matrices Fs such that, r T s log p log(1/δ) s log p log(4/δ) Fs = {Σ = Σ , Σ  0, Σ − diagΣ ∈ G(s)} 2 + ( + ) + + < c, n n n n Then for any matrix Θ and univariate estimator f, let for some small constant c > 0. Then, we have the δ following result. D (Θ, Z ) = sup |uT Θu−f(uT Z⊗2, , )|. f,s n n 9ep s 1/4 p−1 ( ) u∈N2s (S ) s Corollary 2. Suppose P has bounded 2k moments with mean µ and covariance Σ, where µ is s-sparse. Then, we can define the following estimator, n Then, given n samples {xi}i=1 from the mixture dis- tribution (3), we get that with probability at least 1−δ, Θb f,s(Xn) = arginf Df,s(Θ, Zn), θ∈Fs r 1/2 1−1/2k 1/2 log(1/δ) Lemma 8. Suppose P is a multivariate distribution kθbIM,s(Xn) − µk2 kΣk  + kΣk . 2,2s 2,2s n with covariance Σ such that Σ ∈ Fs. Given n-samples r from the mixture distribution (3), we get that with 1/2 s log p log n 1− 1 + kΣk ( + ( ) 2k ), probability at least 1 − δ, 2,2s n n

T δ where kΣk2,2s = sup p−1 u Σu. kΘ (Z )−Σk sup ω (2, uT P˜⊗2, ), u∈S ,kuk0≤2s b f,s n 2 . f 9ep s u∈N 1/4(Sp−1) ( ) 2s s The above result shows that the proposed esti- where uT P˜⊗2 is the univariate distribution of (uT z )2 mator exploits the underlying sparsity structure, i and achieves the near-optimal sample complexity of for zi ∼ P˜. Prasad, Balakrishnan, Ravikumar

O(s log p), while simultaneously achieving the optimal Sparse PCA in Spiked Covariance Model As 1/2 1−1/2k an application of the sparse-covariance estimation, we asymptotic bias of O(kΣk2,2s ). consider the following spiked covariance model, where Covariance Estimation. We begin by first calcu- the true distribution P ∈ P is such that T ⊗2 2k lating ωIM (2, u Pe , δ). To do this, recall that for T 2 T fixed u, for the clean samples in zi, (u zi) has mean Σ(P ) = V ΛV + Ip, (7) T T 2 u Σ(P )u, and variance C4(u Σ(P )u) . Note that the T 2 p×r scalar random variables (u zi) have bounded k mo- where V ∈ R is an orthonormal matrix, and Λ ∈ r×r ments, whenever xi has bounded 2k-moments. Hence, R is a diagonal matrix with entries Λ1 ≥ Λ2 ≥ from Lemma3, we have that ... ≥ Λr > 0. In this setting, suppose we observe sam- ples from a mixture distribution P , then the goal is ω (2, uT P ⊗2, δ) (uT Σ(P )u)1−1/k  IM e . to estimate the subspace projection matrix VV T , i.e. r T log 1/δ construct Vb such that kVbVb − VV kF is small. Note + uT Σ(P )u . n that when V has only s non-zero rows, then the cor- responding covariance matrix Σ is s-sparse(Σ ∈ Fs). We assume that the contamination level , and confi- We follow Chen et al.(2015) to use our sparse co- dence are such that, p×r variance estimator Θb IM,s(Xn) to construct Vb ∈ R r p log(1/δ) p log(4/δ) by setting its jth column to be the jth eigenvector of 4 + 2( + ) + + < c, n n n n Θb IM,s(Xn). Then, under the assumption that (, n) are such that for some small constant c > 0. Then, we have the r r following result. s log p log 1/δ Λ (1+Λ )1−1/k+(1+Λ ) +(1+Λ ) r , Corollary 3. Suppose P has bounded 2k-moments, 1 1 n 1 n . 2 then, given Xn drawn from the mixture model, then, we have that with probability at least 1 − δ, we have the following result. r Corollary 5. Suppose P has bounded 2k-moments, 1−1/k p kΘb IM − Σ(P )k2 . kΣ(P )k2 + kΣ(P )k2 and Σ(P ) is of the form of (7) and we are given n n samples from the mixture distribution. Then, we have r log 1/δ that with probability at least 1 − δ, + kΣ(P )k 2 n T T 2 1 + Λ1 2 2−2/k kVbVb − VV kF . ( ) ( ) Observe that the proposed estimator achieves a dimen- Λr sion independent asymptotic bias of O(σ21−1/k) in 1 + Λ s log p log 1/δ + ( 1 )2( + ) the -contamination model for multivariate covariance Λr n n estimation, with a sample complexity of O(p). Sparse Covariance Estimation. In this setting, we Discussion. Throughout this section, all our estima- assume that the contamination level , and confidence tors achieve a dimension-independent asymptotic bias. δ are such that, Hence, our proposed meta-estimator allows us to es- cape the dimension dependence in the -contamination r s log p log(1/δ) s log p log(4/δ) setting. 4 + 2( + ) + + < c, n n n n Next, we expand on a more subtle aspect of our es- for some small constant c > 0. Then, we have the timators. Observe that when  = 0, i.e. there is no following result. contamination, we see that the typical error rate of Corollary 4. Suppose P has bounded 2k-moments our estimators for k ≥ 2(k ≥ 4 for covariance) is q and Σ(P ) ∈ Fs, then, given Xn drawn from the mix- p p log(1/δ) O( n + n ) is the low dimensional setting, ture model, we have that with probability at least 1 − δ, q q and O( s log p + log(1/δ) ) in the high-dimensional r n n 1−1/k s log p setting. Typically, such high-probability bounds are kΘb IM,s − Σ(P )k2 . kΣ(P )k2 + kΣ(P )k2 n achieved only under the restrictive assumption that r log 1/δ the true distribution is Gaussian or sub-gaussian. In + kΣ(P )k 2 n contrast, all our results are valid for the much broader class of distributions with bounded 2k-moment. As As before, even in this case, the proposed estimator discussed in the introduction, while such results have achieves a dimension independent bias of O(σ21−1/k), been recently obtained for mean estimation, our simple with a sample complexity of O(s log p). meta-estimator achieves these high-probability error A Robust Univariate Mean Estimator is All You Need guarantees for a much broader range of problems. To for symmetric distributions without moments. Note the best of our knowledge, these are some of the first that similar results can be derived for other higher- estimators which get such high-probability deviation order moments. bounds for sparse-mean, covariance, sparse-covariance Discussion. Observe the difference in achievable 2 and sparse-PCA. t0,κ σ rates for Psym and P2k . In particular, for symmet- t0,κ 7 Consequences for Psym ric distributions including those which have no finite variance, the maximum bias introduced by Huber Con- Next, we study the performance of our meta-estimator tamination Model is at most O(κ). In contrast for dis- for multivariate estimation for the class of symmetric tributions with bounded 2k-moments, the lower bound distributions. In particular, we use the sample median for mean estimation is Ω(σ1−1/2k). Note that the presented in Algorithm2 as our univariate estimator depth based estimators of Chen et al.(2015) also im- to instantiate our meta-estimator. In the multivari- plicitly assume that the underlying distribution is sym- ate setting, we further assume that the contamination metric, and hence obtain similar rates for elliptical dis- level , and confidence level δ are such that, tributions. r  1 p log(2/δ) + + ≤ t . 8 Conclusion and Future Directions. 2(1 − ) (1 − ) n n 0 In this work we provided a conceptually simple way Then, we have the following result. of reducing multivariate estimation to univariate es- t0,κ Corollary 6. Suppose P ∈ Psym is a multivariate dis- timation. In particular, we showed how to use any n tribution with mean µ. Given n samples {xi}i=1 from robust univariate estimator to design statistically op- the mixture distribution (3), we get that with probabil- timal robust estimators for multivariate estimation. ity at least 1 − δ, Through this reduction, we derived optimal estima- tors for non-parametric distribution classes such as r r log(1/δ) p distributions with bounded 2k-moments and symmet- kθbMed(Xn) − µk2 κ + κ + κ . n n rical distributions. Our estimators achieved optimal asymptotic bias in the -contamination model, and Observe that the proposed estimator achieves a di- also high-probability deviation bounds in the uncon- mension independent asymptotic bias of O(κ) in the taminated setting. There are several avenues for future -contamination model for multivariate mean estima- work, some of which we discuss below. tion, with a sample complexity of O(p). Computationally Efficient Estimators. As noted Sparse Mean Estimation. In this setting, we as- in the introduction, there has been a flurry of work sume that the contamination level , and confidence in the theoretical computer science community on de- are such that, signing polynomial time estimators for robust mean estimation. Designing sample-efficient estimators for r  1 s log p log(2/δ) sparse-mean estimation for the bounded 2k-moment + + t . 2(1 − ) (1 − ) n n . 0 class is an open problem. Similarly for covariance es- timation, most existing work has focused on Frobe- Then, we have the following result. nius norm, or Mahalanobis distance, and designing es-

t0,κ timators for covariance estimation in operator norm Corollary 7. Suppose P ∈ Psym is a multivariate dis- n for general bounded 2k-moment is an open problem. tribution with mean µ. Given n samples {xi}i=1 from the mixture distribution (3), we get that with probabil- Another important challenge is to design computation- ity at least 1 − δ, ally efficient estimators for the mean of a symmetric distribution. r r log(1/δ) s log p Extension to General Risk Minimization. An- kθbMed,s(Xn) − µk2 . κ + κ + κ n n other future line of work is to extend our results for general risk minimization problems such as Linear Re- The above result shows that the proposed esti- gression and Generalized Linear Models. mator exploits the underlying sparsity structure, and achieves the near-optimal sample complexity of O(s log p), while simultaneously achieving the opti- mal asymptotic bias of O(κ). Moreover for the case of  = 0, the proposed estimator achieves the near- optimal deviation bound for sparse-mean estimation, Prasad, Balakrishnan, Ravikumar

9 Acknowledgements. of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 665–674. IEEE, 2016. AP and PR acknowledge the support of NSF via IIS- Pravesh K Kothari, Jacob Steinhardt, and David Steurer. 1909816. Robust moment estimation and improved clustering via sum of squares. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages References 1035–1046. ACM, 2018. Olivier Catoni. Challenging the empirical mean and empir- Moses Charikar, Jacob Steinhardt, and Gregory Valiant. ical variance: a deviation study. In Annales de l’Institut Learning from untrusted data. In Proceedings of the Henri Poincaré, Probabilités et Statistiques, volume 48, 49th Annual ACM SIGACT Symposium on Theory of pages 1148–1185. Institut Henri Poincaré, 2012. Computing, pages 47–60. ACM, 2017. Stanislav Minsker. Geometric median and robust estima- Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry tion in banach spaces. Bernoulli, 21(4):2308–2335, 2015. Li, Ankur Moitra, and Alistair Stewart. Being robust Gábor Lugosi and Shahar Mendelson. Sub-gaussian esti- (in high dimensions) can be practical. In International mators of the mean of a random vector. arXiv preprint Conference on Machine Learning, pages 999–1008, 2017. arXiv:1702.00482, 2017. Sivaraman Balakrishnan, Simon S Du, Jerry Li, and Aarti Olivier Catoni and Ilaria Giulini. Dimension-free pac- Singh. Computationally efficient robust sparse estima- bayesian bounds for matrices, vectors, and linear least tion in high dimensions. In Conference on Learning The- squares regression. arXiv preprint arXiv:1712.02747, ory, pages 169–212, 2017. 2017. Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakr- David Lee Hanson and Farroll Tim Wright. A bound on tail ishnan, and Pradeep Ravikumar. Robust estima- for quadratic forms in independent random tion via robust gradient estimation. arXiv preprint variables. The Annals of Mathematical Statistics, 42(3): arXiv:1802.06485, 2018. 1079–1083, 1971. Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Noga Alon, Yossi Matias, and Mario Szegedy. The space Li, Jacob Steinhardt, and Alistair Stewart. Sever: A ro- complexity of approximating the frequency moments. In bust meta-algorithm for stochastic optimization. arXiv Proceedings of the Twenty-eighth Annual ACM Sympo- preprint arXiv:1803.02815, 2018. sium on Theory of Computing, STOC ’96, pages 20–29, Mengjie Chen, Chao Gao, and Zhao Ren. A general de- New York, NY, USA, 1996. ACM. cision theory for Huber’s epsilon-contamination model. A.S. Nemirovski and D.B. Yudin. Problem Complexity and Electronic Journal of Statistics, 10(2):3752–3774, 2016. Method Efficiency in Optimization. A Wiley-Interscience publication. Wiley, 1983. Samuel B Hopkins and Jerry Li. Mixture models, robust- ness, and sum of squares proofs. In Proceedings of the Mark R. Jerrum, Leslie G. Valiant, and Vijay V. Vazirani. 50th Annual ACM SIGACT Symposium on Theory of Random generation of combinatorial structures from a Computing, pages 1021–1034. ACM, 2018. uniform distribution. Theoretical Computer Science, 43: 169 – 188, 1986. Jacob Steinhardt, Moses Charikar, and Gregory Valiant. Resilience: A criterion for learning in the presence of ar- Samuel B Hopkins. Sub-Gaussian mean estimation in poly- bitrary outliers. arXiv preprint arXiv:1703.04940, 2017. nomial time. arXiv preprint arXiv:1809.07425, 2018. Regina Y Liu. On a notion of data depth based on random Yeshwanth Cherapanamjeri, Nicolas Flammarion, and Pe- simplices. The Annals of Statistics, 18(1):405–414, 1990. ter L Bartlett. Fast mean estimation with sub-Gaussian rates. arXiv preprint arXiv:1902.01998, 2019. Peter J Huber. A robust version of the probability ratio test. The Annals of Mathematical Statistics, 36(6):1753– Frank R Hampel, Elvezio M Ronchetti, Peter J Rousseeuw, 1758, 1965. and Werner A Stahel. Robust statistics. Wiley Online Library, 1986. Ricardo Antonio Maronna. Robust M-estimators of mul- tivariate location and scatter. The annals of statistics, John W Tukey. Mathematics and the picturing of data. pages 51–67, 1976. In Proceedings of the International Congress of Mathe- maticians, Vancouver, 1975, volume 2, pages 523–531, David L Donoho and Miriam Gasko. Breakdown proper- 1975. ties of location estimates based on halfspace depth and projected outlyingness. The Annals of Statistics, 20(4): Cecil Hastings, Frederick Mosteller, John W Tukey, and 1803–1827, 1992. Charles P Winsor. Low moments for small samples: a comparative study of order statistics. The Annals of Mengjie Chen, Chao Gao, and Zhao Ren. Robust covari- Mathematical Statistics, 18(3):413–426, 1947. ance and scatter matrix estimation under huber’s con- tamination model. ArXiv e-prints, to appear in the An- Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry nals of Statistics, 2015. Li, Ankur Moitra, and Alistair Stewart. Robust esti- mators in high dimensions without the computational Jason Altschuler, Victor-Emmanuel Brunel, and Alan intractability. In Foundations of Computer Science Malek. Best arm identification for contaminated ban- (FOCS), 2016 IEEE 57th Annual Symposium on, pages dits. arXiv preprint arXiv:1802.09514, 2018. 655–664. IEEE, 2016. Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnos- total variation distance between high-dimensional Gaus- tic estimation of mean and covariance. In Foundations sians. arXiv preprint arXiv:1810.08693, 2018. A Robust Univariate Mean Estimator is All You Need

Emilien Joly, Gábor Lugosi, and Roberto Imbuzeiro Oliveira. On the estimation of the mean of a random vector. Electronic Journal of Statistics, 11(1):440–451, 2017. Irina Rish, Guillermo A Cecchi, Aurelie Lozano, and Alexandru Niculescu-Mizil. Practical applications of sparse modeling. MIT Press, 2014. Jacob Bien and Robert J Tibshirani. Sparse estimation of a covariance matrix. Biometrika, 98(4):807–820, 2011. José M Bernardo and Adrian FM Smith. Bayesian theory, volume 405. John Wiley & Sons, 2009. Vladimir N Vapnik and A Ya Chervonenkis. On the uni- form convergence of relative frequencies of events to their probabilities. In Measures of complexity, pages 11–30. Springer, 2015. Jacob Steinhardt. Robust Learning: Information Theory and Algorithms. PhD thesis, Stanford University, 2018. Martin J Wainwright. High-dimensional statistics: A non- asymptotic viewpoint, volume 48. Cambridge University Press, 2019. Roman Vershynin. On the role of sparsity in compressed sensing and random matrix theory. In 2009 3rd IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pages 189–192. IEEE, 2009. Chandler Davis and William Morton Kahan. The rotation of eigenvectors by a perturbation. SIAM Journal on Numerical Analysis, 7(1):1–46, 1970.