A Robust Univariate Mean Estimator Is All You Need

A Robust Univariate Mean Estimator is All You Need Adarsh Prasadz Sivaraman Balakrishnany Pradeep Ravikumarz Machine Learning Departmentz Department of Statistics and Data Sciencey Carnegie Mellon University Abstract these huge datasets is thus fraught with methodologi- cal challenges. We study the problem of designing estima- To understand the fundamental challenges and trade- tors when the data has heavy-tails and is offs in handling such “dirty data” is precisely the corrupted by outliers. In such an adversar- premise of the field of robust statistics. Here, the afore- ial setup, we aim to design statistically op- mentioned complexities are largely formalized under timal estimators for flexible non-parametric two different models of robustness: (1) The heavy- distribution classes such as distributions with tailed model: In this model the sampling distribu- bounded-2k moments and symmetric distri- tion can have thick tails, for instance, only low-order butions. Our primary workhorse is a concep- moments of the distribution are assumed to be finite; tually simple reduction from multivariate es- and (2) The -contamination model: Here the timation to univariate estimation. Using this sampling distribution is modeled as a well-behaved dis- reduction, we design estimators which are op- tribution contaminated by an fraction of arbitrary timal in both heavy-tailed and contaminated outliers. In each case, classical estimators of the distri- settings. Our estimators achieve an opti- bution (based for instance on the maximum likelihood mal dimension independent bias in the con- estimator) can behave considerably worse (potentially taminated setting, while also simultaneously arbitrarily worse) than under standard settings where achieving high-probability error guarantees the data is better behaved, satisfying various regular- with optimal sample complexity. These re- ity properties. In particular, these classical estimators sults provide some of the first such estima- can be extremely sensitive to the tails of the distribu- tors for a broad range of problems includ- tion or to the outliers and the broad goal in robust ing Mean Estimation, Sparse Mean Estima- statistics is to construct estimators that improve on tion, Covariance Estimation, Sparse Covari- these classical estimators by reducing their sensitivity ance Estimation and Sparse PCA. to outliers. 1 Introduction Heavy Tailed Model. Concretely, focusing on the fundamental problem of robust mean estimation, Modern data sets that arise in various branches of sci- in the heavy tailed model we observe n samples ence and engineering are characterized by their ever x1; : : : ; xn drawn independently from a distribution P , increasing scale and richness. This has been spurred which is only assumed to have low-order moments fi- in part by easier access to computer, internet and var- nite (for instance, P only has finite variance). The ious sensor-based technologies that enable the auto- goal of past work Catoni(2012); Minsker(2015); Lu- mated acquisition of such heterogeneous datasets. On gosi and Mendelson(2017); Catoni and Giulini(2017) the flip side, these large and rich data-sets are no has been to design an estimator θbn of the true mean µ longer carefully curated, are often collected in a de- of P which has a small `2-error with high-probability. centralized, distributed fashion, and consequently are Formally, for a given δ > 0, we would like an estimator plagued with the complexities of heterogeneity, adver- with minimal rδ such that, sarial manipulations, and outliers. The analysis of P (kθbn − µk2 ≤ rδ) ≥ 1 − δ: (1) Proceedings of the 23rdInternational Conference on Artifi- cial Intelligence and Statistics (AISTATS) 2020, Palermo, As a benchmark for estimators in the heavy-tailed Italy. PMLR: Volume 108. Copyright 2020 by the au- model, we observe that when P is a multivariate nor- thor(s). mal distribution (or more generally is a sub-Gaussian A Robust Univariate Mean Estimator is All You Need distribution) with mean µ and covariance Σ, it can be et al.(2018); Charikar et al.(2017); Diakonikolas et al. shown (see Hanson and Wright(1971)) that the sam- (2017); Balakrishnan et al.(2017); Prasad et al.(2018); P ple mean µbn = (1=n) i xi satisfies, with probability Diakonikolas et al.(2018) designing provably robust at least 1 − δ1, which are computationally tractable while achieving near-optimal contamination dependence (i.e. depen- r r trace (Σ) kΣk log(1/δ) dence on the fraction of outliers ) for computing kµ − µk + 2 : (2) bn 2 . n n means and covariances. In the Huber model, using information-theoretic lower bounds Chen et al.(2016); where kΣk2 denotes the operator norm of the covari- Lai et al.(2016); Hopkins and Li(2018), it can be ance matrix Σ. shown that any estimator must suffer a non-zero bias The bound is referred to as a sub-Gaussian-style er- (the asymptotic error as the number of samples go to ror bound. However, for heavy tailed distributions, infinity). For example, for the class of distributions 2 as for instance showed in Catoni(2012), the sam- with bounded variance, Σ - σ Ipp, the bias of any estimator is lower bounded by Ω(σ ). Surprisingly, the ple mean only satisfies the sub-optimal bound rδ = Ω(pd/nδ). Somewhat surprisingly, recent work Lu- optimal bias that can be achieved is often independent gosi and Mendelson(2017) showed that the sub- of the data dimension. In other words, in many in- Gaussian error bound is achievable while only assum- teresting cases optimally robust estimators in Huber’s ing that P has finite variance, but by a carefully de- model can tolerate a constant fraction of outliers, signed estimator. In the univariate setting, the classi- independent of the dimension. cal median-of-means estimator Alon et al.(1996); Ne- While the aforementioned recent estimators for mean mirovski and Yudin(1983); Jerrum et al.(1986) and estimation under Huber contamination have a poly- Catoni’s M-estimator Catoni(2012) achieve this sur- nomial computational complexity, their correspond- prising result but designing such estimators in the mul- ing sample complexities are only known to be polyno- tivariate setting has proved challenging. Estimators mial in the dimension p. For example, Kothari et al. that achieve truly sub-Gaussian bounds, but which are (2018) and Hopkins and Li(2018) designed estima- computationally intractable, were proposed recently tors which achieve optimal bias for distributions with by Lugosi and Mendelson(2017) and subsequently certifiably bounded 2k-moments, but their statistical Catoni and Giulini(2017). Hopkins(2018) and Cher- sample complexity scales as O(pk). Steinhardt et al. apanamjeri et al.(2019) developed a sum-of-squares (2017) studied mean estimation and presented an es- based relaxation of Lugosi and Mendelson(2017)’s estimator which has a sample complexity of Ω p1:5. timator, thereby giving a polynomial time algorithm which achieves optimal rates. Despite their apparent similarity, developments of estimators that are robust in each of these models, have Huber’s -Contamination Model. In this setting, remained relatively independent. Focusing on mean instead of observing samples directly from the true dis- estimation we notice subtle differences, in the heavy- tribution P , we observe samples drawn from P, which tailed model our target is the mean of the sampling for an arbitrary distribution Q is defined as a mixture distribution whereas in the Huber model our target is model, the mean of the decontaminated sampling distribution P . Beyond this distinction, it is also important to note P = (1 − )P + Q. (3) that as highlighted above the natural focus in heavy- tailed mean estimation is on achieving strong, high- The distribution Q allows one to model arbitrary out- probability error guarantees, while in Huber’s model liers, which may correspond to gross corruptions, or the focus has been on achieving dimension indepen- subtle deviations from the true model. There has dent bias. been a lot of classical work studying estimators in the -contamination model under the umbrella of ro- Contributions. In this work, we aim to design esti- bust statistics (see for instance Hampel et al.(1986) mators which are statistically optimally robust in both and references therein). However, most of the estima- models simultaneously, i.e. they achieve a dimension- tors come that come with strong guarantees are com- independent asymptotic bias in the -contamination putationally intractable Tukey(1975), while others model and achieve high probability deviation bounds are statistically sub-optimal heuristics Hastings et al. similar to (2). Our main workhorse is a conceptually (1947). Recently, there has been substantial progress simple way of reducing multivariate estimation to the Diakonikolas et al.(2016); Lai et al.(2016); Kothari univariate setting. Then, by carefully solving mean estimation in the univariate setting, we are able to 1 Here and throughout our paper we use the notation . design optimal estimators for multivariate mean and to denote an inequality with universal constants dropped for conciseness. covariance estimation for non-parametric distribution Prasad, Balakrishnan, Ravikumar classes both in the low-dimensional (n ≥ p) and high- of H-symmetric distributions with unique center of H- t0,κ dimensional (n < p) setting. We achieve these rates symmetry. Moreover suppose Psym ⊂ Psym is the class t0,κ for non-parametric distribution classes such as distri- of distributions such that for any P 2 Psym the CDF T butions with bounded 2k-moments and the class of of the univariate projection(u P ) given by FuT P in- symmetric distributions. creases at least linearly around uT θ. Formally, for all T −1 1 x1 2 [med(u P );F T ( + t0)] we have that 2 Background and Setup u P 2 1 1 T F T (x ) − ≥ (x − med(u P )) u P 1 2 κ 1 In this section, we formally define two classes of dis- u;P tributions which we work with in this paper, (1) (4) Bounded-2k-Moment distributions and (2) Symmetric −1 1 T and for all x2 2 [FuT P ( 2 − t0); med(u P )], we have Distributions.

Load more