Learning Entangled Single-Sample Distributions Via Iterative Trimming

Learning Entangled Single-Sample Distributions via Iterative Trimming Hui Yuan Yingyu Liang Department of Statistics and Finance Department of Computer Sciences University of Science and Technology of China University of Wisconsin-Madison Abstract have n data points from n different distributions with a common mean but different variances (the mean and all the variances are unknown), and our goal is to es- In the setting of entangled single-sample dis- timate the mean. tributions, the goal is to estimate some common parameter shared by a family of distri- This setting is motivated for both theoretical and prac- butions, given one single sample from each tical reasons. From the theoretical perspective, it goes distribution. We study mean estimation and beyond the typical i.i.d. setting and raises many inter- linear regression under general conditions, esting open questions, even on basic topics like mean and analyze a simple and computationally ef- estimation for Gaussians. It can also be viewed as ficient method based on iteratively trimming a generalization of the traditional mixture modeling, samples and re-estimating the parameter on since the number of distinct mixture components can the trimmed sample set. We show that the grow with the number of samples. From the practical method in logarithmic iterations outputs an perspective, many modern applications have various estimation whose error only depends on the forms of heterogeneity, for which the i.i.d. assumption noise level of the dαne-th noisiest data point can lead to bad modeling of their data. The entan- where α is a constant and n is the sample gled single-sample setting provides potentially better size. This means it can tolerate a constant modeling. This is particularly the case for applications fraction of high-noise points. These are the where we have no control over the noise levels of the first such results under our general conditions samples. For example, the images taken by self-driving with computationally efficient estimators. It cars can have varying degrees of noise due to chang- also justifies the wide application and empir- ing weather or lighting conditions. Similarly, signals ical success of iterative trimming in practice. collected from sensors on the Internet of Things can Our theoretical results are complemented by come with interferences from a changing environment. experiments on synthetic data. Though theoretically interesting and practically im- portant, few studies exist in this setting. Chierichetti et al. (2014) considered the mean estimation for en- 1 INTRODUCTION tangled Gaussians and showed the existence of a gap between estimation error rates of the best possible es- This work considers the novel parameter estimation timator in this setting and the maximum likelihood setting called entangled single-sample distributions. estimator when the variances are known. Pensia et al. Different from the typical i.i.d. setting, here we have (2019) considered means estimation for symmetric, n data points that are independent, but each is from unimodal distributions including the symmetric mul- a different distribution. These distributions are en- tivariate case (i.e., the distributions are radially sym- tangled in the sense that they share some common metric) with sharpened bounds, and provided exten- parameter, and our goal is to estimate the common sive discussion on the performance of their estimators parameter. For example, in the problem of mean es- in different configurations of the variances. These ex- timation for entangled single-sample distributions, we isting results focus on specific family of distributions or rd focus on the case where most samples are \high-noise" Proceedings of the 23 International Conference on Artifi- points. cial Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR: Volume 108. Copyright 2020 by the au- On the contrary, we focus on the case with a constant thor(s). fraction of "high-noise" points, which is more inter- Learning Entangled Single-Sample Distributions via Iterative Trimming esting in practice. We study multivariate mean esti- k-shortest gap, k-median estimators) and hybrid estimation and linear regression under more general con- mator, which combines Median estimator with Short- ditions and analyze a simple and efficient estimator est Gap or Modal Interval estimator. They also dis- based on iterative trimming. The iterative trimming cussed slight relaxation of the symmetry assumption idea is simple: the algorithm keeps an iterate and re- and provided extensions to linear regression. Our work peatedly refines it; each time it trims a fraction of bad considers mean estimation and linear regression under points based on the current iterate and then uses the more general conditions and analyzes a simpler estima- trimmed sample set to compute the next iterate. It tor. However, our results are not directly comparable is computationally very efficient and widely used in to the existing ones above, since those focus on the practice as a heuristic for handling noisy data. It can case where most of the points have high noise or have also be viewed as an alternating-update version of the extra constraints on distributions are assumed. For classic trimmed estimator (e.g., Huber (2011)) which the constrained distributions, our results are weaker typically takes exponential time: than the existing ones. See the detailed discussion in the remarks after our theorems. ^ X θ = arg min Lossi(θ) θ2Θ;S⊆[n];jSj=dαne This setting is also closely related to robust estimation, i2S which have been extensively studied in the literature where Θ is the feasible set for the parameter θ to be of both classic statistics and machine learning theory. estimated, dαne is the size of the trimmed sample set Robust mean estimation. There are several classes S, and Lossi(θ) is the loss of θ on the i-th data point of data distribution models for robust mean estima- (e.g., `2 error for linear regression). tors. The most commonly addressed is adversarial For mean estimation, only assuming the distributions contamination model, whose origin can be traced back have a common mean and bounded covariances, we to the malicious noise model by Valiant (1985) and the show that the iterative trimming method in logarith- contamination model by Huber (2011). Under contam- mic iterations outputs a solution whose error only de- ination, mean estimation has been investigated in Di- pends on the noise level of the dαne-th noisiest point akonikolas et al. (2017, 2019a); Cheng et al. (2019). for α ≥ 4=5. More precisely, the error only depends on Another related model is the mixture of distributions. the dαne-th largest value among all the norms of the There has been steady progress in algorithms for lean- n covariance matrices. This means the method can ing mixtures, in particular, leaning Gaussian mixtures. tolerate a 1=5 fraction of \high-noise" points. We also Starting from Dasgupta (1999), a rich collection of re- provide a similar result for linear regression, under a sults are provided in many studies, such as Sanjeev regularity condition that the explanatory variables are and Kannan (2001); Achlioptas and McSherry (2005); sufficiently spread out in different directions (satisfied Kannan et al. (2005); Belkin and Sinha (2010a,b); by typical distributions like Gaussians). As far as we Kalai et al. (2010); Moitra and Valiant (2010); Di- know, these are the first such results of iterative trim- akonikolas et al. (2018a). ming under our general conditions in the entangled Robust regression. Robust Least Squares Regres- single-sample distributions setting. These results also sion (RLSR) addresses the problem of learning regres- theoretically justify the wide application and empir- sion coefficients in the presence of corruptions in the ical success of the simple iterative trimming method response vector. A class of robust regression estimator in practice. Experiments on synthetic data provide solving RLSR is Least Trimmed Square (LTS) estima- positive support for our analysis. tor, which is first introduced by Rousseeuw (1984) and has high breakdown point. The algorithm solutions of 2 RELATED WORK LTS are investigated in Hössjer(1995); Rousseeuw and Van Driessen (2006); Shen et al. (2013) for the linear Entangled distributions. This setting is first stud- regression setting. Recently, for robust linear regres- ied by Chierichetti et al. (2014), which considered sion in the adversarial setting (i.e., a small fraction mean estimation for entangled Gaussians and pre- of responses are replaced by adversarial values), there sented a algorithm combining the k-median and the is a line of work providing algorithms with theoretical k-shortest gap algorithms. It also showed the exis- guarantees following the idea of LTS, e.g., Bhatia et al. tence of a gap between the error rates of the best pos- (2015); Vainsencher et al. (2017); Yang et al. (2018) for sible estimator in this setting and the maximum likeli- example. For robust linear regression in the adversary hood estimator when the variances are known. Pensia setting where both explanatory and response variables et al. (2019) considered a more general class of distri- can be replaced by adversarial values, a line of work butions (unimodal and symmetric) and provided anal- provided algorithms and guarantees, e.g., Diakoniko- ysis on both individual estimator (r-modal interval, las et al. (2018b); Prasad et al. (2018); Klivans et al. Hui Yuan, Yingyu Liang (2018); Shen and Sanghavi (2019), while some others Algorithm 1 Iterative Trimmed MEAN (ITM) like Chen et al. (2013); Balakrishnan et al. (2017); Liu n Input: Samples fxigi=1 ; number of rounds T; frac- et al. (2018) considered the high-dimensional scenario. tion of samples α 1 Pn 1: µ0 n i=1 xi 3 MEAN ESTIMATION 2: for t = 0; ··· ;T − 1 do 3: Choose samples with smallest current loss: d Suppose we have n independent samples xi ∼ Fi 2 R , ? X 2 d 2 N , where the mean vector and the covariance ma- St arg min kxi − µtk2 S:jSj=dαne trix of each distribution Fi exist.

Learning Entangled Single-Sample Distributions Via Iterative Trimming

Efficient Inflation Estimation

A Globally Optimal Solution to 3D ICP Point-Set Registration

Trimmed Mean Group Estimation

Programs Tramo (Time Series Regression with Arima Noise

Power-Shrinkage and Trimming: Two Ways to Mitigate Excessive Weights

26 Sep 2019 the Trimmed Mean in Non-Parametric Regression

Measuring Core Inflation in India: an Application of Asymmetric-Trimmed Mean Approach

Robust Statistics

Configurable FPGA-Based Outlier Detection for Time Series Data

Outlier Detection and Trimmed Estimation for General Functional Data

Robust Mean Estimation for Real-Time Blanking in Radioastronomy

To Trim Or Not to Trim?