Learning Entangled Single-Sample Distributions via Iterative Trimming

Hui Yuan Yingyu Liang Department of and Finance Department of Computer Sciences University of Science and Technology of China University of Wisconsin-Madison

Abstract have n data points from n different distributions with a common mean but different (the mean and all the variances are unknown), and our goal is to es- In the setting of entangled single-sample dis- timate the mean. tributions, the goal is to estimate some com- mon parameter shared by a family of distri- This setting is motivated for both theoretical and prac- butions, given one single sample from each tical reasons. From the theoretical perspective, it goes distribution. We study mean estimation and beyond the typical i.i.d. setting and raises many inter- linear regression under general conditions, esting open questions, even on basic topics like mean and analyze a simple and computationally ef- estimation for Gaussians. It can also be viewed as ficient method based on iteratively trimming a generalization of the traditional mixture modeling, samples and re-estimating the parameter on since the number of distinct mixture components can the trimmed sample set. We show that the grow with the number of samples. From the practical method in logarithmic iterations outputs an perspective, many modern applications have various estimation whose error only depends on the forms of heterogeneity, for which the i.i.d. assumption noise level of the dαne-th noisiest data point can lead to bad modeling of their data. The entan- where α is a constant and n is the sample gled single-sample setting provides potentially better size. This means it can tolerate a constant modeling. This is particularly the case for applications fraction of high-noise points. These are the where we have no control over the noise levels of the first such results under our general conditions samples. For example, the images taken by self-driving with computationally efficient . It cars can have varying degrees of noise due to chang- also justifies the wide application and empir- ing weather or lighting conditions. Similarly, signals ical success of iterative trimming in practice. collected from sensors on the Internet of Things can Our theoretical results are complemented by come with interferences from a changing environment. experiments on synthetic data. Though theoretically interesting and practically im- portant, few studies exist in this setting. Chierichetti et al. (2014) considered the mean estimation for en- 1 INTRODUCTION tangled Gaussians and showed the existence of a gap between estimation error rates of the best possible es- This work considers the novel parameter estimation timator in this setting and the maximum likelihood setting called entangled single-sample distributions. when the variances are known. Pensia et al. Different from the typical i.i.d. setting, here we have (2019) considered means estimation for symmetric, n data points that are independent, but each is from unimodal distributions including the symmetric mul- a different distribution. These distributions are en- tivariate case (i.e., the distributions are radially sym- tangled in the sense that they share some common metric) with sharpened bounds, and provided exten- parameter, and our goal is to estimate the common sive discussion on the performance of their estimators parameter. For example, in the problem of mean es- in different configurations of the variances. These ex- timation for entangled single-sample distributions, we isting results focus on specific family of distributions or

rd focus on the case where most samples are “high-noise” Proceedings of the 23 International Conference on Artifi- points. cial Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR: Volume 108. Copyright 2020 by the au- On the contrary, we focus on the case with a constant thor(s). fraction of ”high-noise” points, which is more inter- Learning Entangled Single-Sample Distributions via Iterative Trimming esting in practice. We study multivariate mean esti- k-shortest gap, k- estimators) and hybrid esti- mation and linear regression under more general con- mator, which combines Median estimator with Short- ditions and analyze a simple and efficient estimator est Gap or Modal Interval estimator. They also dis- based on iterative trimming. The iterative trimming cussed slight relaxation of the symmetry assumption idea is simple: the algorithm keeps an iterate and re- and provided extensions to linear regression. Our work peatedly refines it; each time it trims a fraction of bad considers mean estimation and linear regression under points based on the current iterate and then uses the more general conditions and analyzes a simpler estima- trimmed sample set to compute the next iterate. It tor. However, our results are not directly comparable is computationally very efficient and widely used in to the existing ones above, since those focus on the practice as a heuristic for handling noisy data. It can case where most of the points have high noise or have also be viewed as an alternating-update version of the extra constraints on distributions are assumed. For classic trimmed estimator (e.g., Huber (2011)) which the constrained distributions, our results are weaker typically takes exponential time: than the existing ones. See the detailed discussion in the remarks after our theorems. ˆ X θ = arg min Lossi(θ) θ∈Θ;S⊆[n],|S|=dαne This setting is also closely related to robust estimation, i∈S which have been extensively studied in the literature where Θ is the feasible set for the parameter θ to be of both classic statistics and machine learning theory. estimated, dαne is the size of the trimmed sample set Robust mean estimation. There are several classes S, and Lossi(θ) is the loss of θ on the i-th data point of data distribution models for robust mean estima- (e.g., `2 error for linear regression). tors. The most commonly addressed is adversarial For mean estimation, only assuming the distributions contamination model, whose origin can be traced back have a common mean and bounded covariances, we to the malicious noise model by Valiant (1985) and the show that the iterative trimming method in logarith- contamination model by Huber (2011). Under contam- mic iterations outputs a solution whose error only de- ination, mean estimation has been investigated in Di- pends on the noise level of the dαne-th noisiest point akonikolas et al. (2017, 2019a); Cheng et al. (2019). for α ≥ 4/5. More precisely, the error only depends on Another related model is the mixture of distributions. the dαne-th largest value among all the norms of the There has been steady progress in algorithms for lean- n covariance matrices. This means the method can ing mixtures, in particular, leaning Gaussian mixtures. tolerate a 1/5 fraction of “high-noise” points. We also Starting from Dasgupta (1999), a rich collection of re- provide a similar result for linear regression, under a sults are provided in many studies, such as Sanjeev regularity condition that the explanatory variables are and Kannan (2001); Achlioptas and McSherry (2005); sufficiently spread out in different directions (satisfied Kannan et al. (2005); Belkin and Sinha (2010a,b); by typical distributions like Gaussians). As far as we Kalai et al. (2010); Moitra and Valiant (2010); Di- know, these are the first such results of iterative trim- akonikolas et al. (2018a). ming under our general conditions in the entangled Robust regression. Robust Least Squares Regres- single-sample distributions setting. These results also sion (RLSR) addresses the problem of learning regres- theoretically justify the wide application and empir- sion coefficients in the presence of corruptions in the ical success of the simple iterative trimming method response vector. A class of robust regression estimator in practice. Experiments on synthetic data provide solving RLSR is Least Trimmed Square (LTS) estima- positive support for our analysis. tor, which is first introduced by Rousseeuw (1984) and has high breakdown point. The algorithm solutions of 2 RELATED WORK LTS are investigated in H¨ossjer(1995); Rousseeuw and Van Driessen (2006); Shen et al. (2013) for the linear Entangled distributions. This setting is first stud- regression setting. Recently, for robust linear regres- ied by Chierichetti et al. (2014), which considered sion in the adversarial setting (i.e., a small fraction mean estimation for entangled Gaussians and pre- of responses are replaced by adversarial values), there sented a algorithm combining the k-median and the is a line of work providing algorithms with theoretical k-shortest gap algorithms. It also showed the exis- guarantees following the idea of LTS, e.g., Bhatia et al. tence of a gap between the error rates of the best pos- (2015); Vainsencher et al. (2017); Yang et al. (2018) for sible estimator in this setting and the maximum likeli- example. For robust linear regression in the adversary hood estimator when the variances are known. Pensia setting where both explanatory and response variables et al. (2019) considered a more general class of distri- can be replaced by adversarial values, a line of work butions (unimodal and symmetric) and provided anal- provided algorithms and guarantees, e.g., Diakoniko- ysis on both individual estimator (r-modal interval, las et al. (2018b); Prasad et al. (2018); Klivans et al. Hui Yuan, Yingyu Liang

(2018); Shen and Sanghavi (2019), while some others Algorithm 1 Iterative Trimmed MEAN (ITM) like Chen et al. (2013); Balakrishnan et al. (2017); Liu n Input: Samples {xi}i=1 , number of rounds T, frac- et al. (2018) considered the high-dimensional scenario. tion of samples α 1 Pn 1: µ0 ← n i=1 xi 3 MEAN ESTIMATION 2: for t = 0, ··· ,T − 1 do 3: Choose samples with smallest current loss: d Suppose we have n independent samples xi ∼ Fi ∈ R , ? X 2 d ∈ N , where the mean vector and the covariance ma- St ← arg min kxi − µtk2 S:|S|=dαne trix of each distribution Fi exist. Assume Fi’s have a i∈S ? common mean µ and denote their covariance matri- 1 P 4: µt+1 = xi ces as Σi. When d = 1, each xi degenerate to an |St| i∈St ? 5: end for univariate random variable xi, and we also write µ ? 2 Output: µT as µ and write Σi as σi . Our goal is to estimate the common mean µ?.

Notations For an integer m,[m] denotes the set to our method described in Algorithm 1, which we {1, ··· , m} . |S| is the cardinality of a set S. For referred to as Iterative Trimmed Mean. two sets S1,S2, S1\S2 is the set of elements in S1 Each update in our algorithm is similar to the but not in S2. λmin and λmax are the minimum and trimmed-mean (or truncated-mean) estimator, which maximum eigenvalues. Denote the order statistics of is, in univariate case, defined by removing a fraction of n  n {λmax(Σi)}i=1 as λ(i) i=1. c or C denote constants the sample consisting of the largest and smallest points whose values can vary from line to line. and averaging over the rest; see Tukey and McLaughlin (1963); Huber (2011); Bickel et al. (1965). A natural 3.1 Iterative Trimmed Mean Algorithm generalization to multivariate case is to remove a frac- tion of the sample consisting of points which have the First, recall the general version of the least trimmed largest distances to µ?. The difference between our al- loss estimator. Let fµ (·) be the loss function, given gorithm and the generalized trimmed-mean estimator currently learned parameter µ. In contrast to mini- is that ours select points based on the estimated µt mizing the total loss of all samples, the least trimmed while trimmed-mean is based on the ground-truth µ?. loss estimator of µ? is given by

(TL) X 3.2 Theoretical Guarantees µˆ = arg min fµ (xi) , (1) µ,S:|S|=dαne i∈S Firstly, we introduce a lemma giving an upper bound where S ⊆ [n] and α ∈ (0, 1] is the fraction of samples of the sum of `2 distances between points xi and mean ? we want to fit. Finding µˆ (TL) requires minimizing the vector µ over a sample set S. trimmed loss over both the set of all subsets S with Lemma 1. Define λ = max {λ (Σ )}. Then size dαne and the set of all available values of the pa- S i∈S max i we have rameter µ. However, solving the minimization above X p kx − µ?k ≤ 2|S| λ d (2) is hard in general. Therefore, we attempt to minimize i 2 S the trimmed loss in an iterative way by alternating be- i∈S 1 tween minimizing over S and µ. That is, it follows a with probability at least 1 − |S| . natural iterative strategy: in the t-th iteration, first select a subset S of samples with the least loss on Proof. For each i ∈ S, denote the transformed ran- t 1 − 2 ? the current parameter µt, and then update to µt+1 by dom vector Σi (xi − µ ) as x˜i. Then x˜i has mean 0 ? minimizing the total loss over St. and identical covariance matrix. Since kxi − µ k2 = 2 q T √ By taking fµ (x) in (1) as kx − µk2, in each iteration, x˜i Σix˜i ≤ λS kx˜ik2, we can write dαne samples are selected due to their least distance to the current µ in l2 norm. Besides, note that for a X ? p X kxi − µ k2 ≤ λS kx˜ik2 . given sample set S, i∈S i∈S 1 X 2 X Let ξ = kx˜ k , thus {ξ } are independent and for arg min kxi − µk2 = xi, i i 2 i i∈S µ |S| i∈S i∈S any i ∈ S, it holds that that is, the parameter µ minimizing the total loss over q √ ξ ≤ ξ2 = d, Var(ξ ) ≤ ξ2 = d. sample set S is the empirical mean over S. This leads E i E i i E i Learning Entangled Single-Sample Distributions via Iterative Trimming

4 2(1−α) 1 By Chebyshev’s inequality, we have (1 − α)n. Thus, when α ≥ 5 , κ ≤ α ≤ 2 . Com- ! bining the inequalities completes the proof.  X √ 1 ξ ≥ 2|S| d ≤ , P i |S| i∈S Based the error bound per-round in Lemma 2, it is easy to show that kµ − µ?k can be upper bounded which completes the proof. p t 2  by Θ( dλ(dαne)) after sufficient iterations. This leads to our final guarantee. We are now ready to prove our key lemma, which 4 shows the progress of each iteration of the algorithm. Theorem 1. Given ITM with α ≥ 5 within T =  ?  The key idea is that the selected subset St of samples kµ0−µ k2 ? Θ log2 √ iterations, it holds that have a large overlap with the subset S of the dαne dλ(dαne) “good” points with smallest covariance. Furthermore, q S \S? is not that “bad” since by the selection criterion, ? t kµT − µ k ≤ c dλ(dαne) (4) ? 2 they have less loss on µt than the points in S \St. This thus allows to show the progress. 5T with probability at least 1 − 4n . Lemma 2. Given ITM with fraction α ≥ 4 , we have 5 Proof. By Lemma 2, we derive 1 q ? ? 1 q µt+1 − µ 2 ≤ kµt − µ k2 + 2 dλ(dαne) (3) ? ? 2 kµ − µ k ≤ µ − µ + 2 dλ(dαne) T 2 2 T −1 2 with probability at least 1 − 5 . T −1 4n 1 X 1 q ≤ kµ − µ?k + 2 dλ ?  2T 0 2 2i (dαne) Proof. Define S = i : λmax(Σi) ≤ λ(dαne) . With- i=0 out loss of generality, assume αn is an integer. Then 1 q ≤ kµ − µ?k + 4 dλ by the algorithm, 2T 0 2 (dαne) 1 1 q X ? X ? ≤ c dλ(dαne) µt+1 = xi = µ + (xi − µ ) . |St| αn i∈St i∈St with probability at least 1 − 5T when T =   4n Therefore, the ` distance between the learned param- kµ −µ?k 2 Θ log √ 0 2 . ? 2 dλ  eter µt+1 and the ground truth parameter µ can be (dαne) bound by: The theorem shows that the error bound only depends 1 X ? ? on the order statistics λ(dαne), irrelevant of the larger µt+1 − µ = (xi − µ ) 2 αn covariances for α ≥ 4/5. Therefore, using the iterative i∈St 2   trimming method, the magnitudes of a 1/5 fraction of 1 X X highest noises will not affect the quality of the final ≤ kx − µ?k + kx − µ?k αn  i 2 i 2 solution. However, it is still an open question whether ? ? i∈St∩S i∈St\S p the bound can be made c dλ(dαne)/n, i.e., the optimal |S \S?| ≤ t · kµ − µ?k rate given an oracle that knows the covariances. αn t 2   The method is also computationally very simple and 1 X ? X efficient. Since a single iteration of ITM takes time +  kxi − µ k2 + kxi − µtk2 ˜ p ˜ αn O(nd), O( dλ(dαne)) error can be computed in O(nd) i∈S?∩S i∈S?\S t t time, which is nearly linear in the input size. ? 2|St\S | ? 1 X ? ≤ · kµt − µ k + kxi − µ k , αn 2 αn 2 Remark 1. With a further assumption that xi are in- | {z } i∈S? ? κ dependently sampled from the Gaussians N (µ , Σi), the same bound in the theorem holds with prob- where the second inequality is guaranteed by line 3 ability converging to 1 with exponential rate of ? in Algorithm 1. Note that |S | = αn = |St|, thus n. This is because (2) can be guaranteed with ? ? |S \St| = |St\S |. Due to the choice of |St|, we have a higher probability. The sketch of the proof fol- ? 2 the distance kxi − µtk2 of each sample in St\S is less P ?  P 2 ? lows: i∈S kxi − µ k2 ≤ |S|λS i∈S kx˜ik2, by than that of samples in S \St. Cauchy-Schwarz inequality. Here x˜i ∼ N (0,Id) 1 P ? By Lemma 1, we can bound ? kxi − µ k by due to the Gaussian distribution of xi. By the αn i∈S 2 P 2 p 1 Lemma 3 in Fan and Lv (2008), kx˜ik ≤ 2 dλ(dαne) with probability at least 1 − αn . Mean- i∈S 2 ? ? −c0d|S| while, by |S | = |St| = αn, it guarantees |St\S | ≤ cd|S| with probability at least 1 − e . Hence, Hui Yuan, Yingyu Liang

  0 P ? 2 2 −c d|S| high noises. Our results depend on σ (or pλ P i∈S kxi − µ k2 ≤ cd|S| λS ≥ 1 − e , (αn) (dαne) where c can be any constant greater than 1 and for multivariate) for α ≥ 4/5, which is more relevant c0 = [c − 1 − log(c)]/2. when 1/5 fraction of the points have high noises. We believe our bounds are more applicable for many prac- Remark 2. For the univariate case with d = 1, it tical scenarios. Finally, for the multivariante case, our is sufficient to regard Σi as one dimensional matrix result holds under more general assumptions and the 2 4 σi . Then we obtain that when α ≥ 5 and T = method is significantly simpler and more efficient.  ?  kµ0−µ k ? Θ log iterations: kµT − µ k ≤ cσ(dαne), 2 σ(dαne) 2 5T 4 LINEAR REGRESSION with probability at least 1 − 4n . n Remark 3. It is worth mentioning that there is a trade Given observations {(xi, yi)}i=1 from the linear model off between the accuracy and running time in ITM. In T ? yi = x β + i, ∀1 ≤ i ≤ n, (5) particular, the constant α can be any constant greater i 2 our goal is to estimate β?. Here { }n are inde- than 3 , since it suffices to guarantee κ in the proof of i i=1 Lemma 2 is less than 1. On the other hand, a smaller pendently distributed with expectation 0 and vari- ? 2 n n α will slow down the speed for kµt − µ k2 to shrink, ances {σi }, and {xi}i=1 are independent with {i}i=1 although the computational complexity is still in the and satisfy some regularity conditions described be- ˜ T T T same order of O(nd). low. Denote the stack matrix (x1 , ··· , xn ) as X, T the noise vector (1, ··· , n) as  and the response Remark 4. We now discuss existing results in detail. T vector (y1, ··· , yn) as y. In the univariate case, Chierichetti et al. (2014) 1/2(1+1/(k−1))  Assumption 1. Assume kxik2 = 1 for all i. Define achieved min2≤k≤log n O˜ n σk error in 2 − T  time O(n log n). Among all estimators studied ψ (k) = min λmin XS XS , S:|S|=k in Pensia et al. (2019), the superior performance is obtained by the hybrid estimators, which includes ver- where XS is the submatrix of X consisting of rows in- − sion (1): combining k1-median with k2-shorth and dexed by S ⊆ [n]. Assume that for k = Ω(n), ψ (k) ≥ version (2): combining k1-median with modal in- k/c1 for a constant c1 > 0. terval estimator. These two versions achieve sim- Remark 5. kx k = 1 is assumed without loss of gen- ilar guarantees while version 1 has lower run time i 2 erality, as we can always normalize (x , y ), without O(n log n). Version 1 of the hybrid estimator out- i i √ affecting the assumption on  ’s. The assumption on 4 n log n i putsµ ˆk ,k such that |µˆk ,k − µ| ≤ r2k − 1 2 1 2 k2 2 ψ (k) states that every large enough subset of X is 0 2 with probability√ 1 − 2 exp(−c k2) − 2 exp(−c log n), well conditioned, and has been used in previous work, where k1 = n log n and k2 ≥ C log n. Since here e.g., Bhatia et al. (2015); Shen and Sanghavi (2019).  1 Pn ? k rk is defined as inf r : n i=1 P(|xi − µ | ≤ r) ≥ n , It is worth mentioning that the uniformity over S as- the error bound giving above varies with specific sumed is not for convenience but for necessity since the {F }n , while the worst-case error guarantee is covariances of samples are unknown. Still, the assump- i√i=1 dαne tion holds under several common settings in practice. O( nσ(C log n)). When take k2 = 2 , the error can log√ n For example, by Theorem 17 in Bhatia et al. (2015), be O( n σ(dαne)). However, this result for symmetric and unimodal distributions F ’s. this regularity is guaranteed for c1 close to 1 w.h.p. i when the rows of X are i.i.d. spherical Gaussian vec- In the multivariate case, Chierichetti et al. (2014) tors and n is sufficiently large. 2 studied the special case where xi ∼ N (µ, σi Id) and provided an algorithm with the error bound Remark 6. In our setting, the noise terms are indepen- (1+1/(k−1))/d  2 dent but not necessarily identically distributed. It is min2≤k≤log n O˜ n σk in time O˜(n ). Pen- sia et al. (2019) mainly considered the special case also referred to as heteroscedasticity in linear regres- where the overall is radically sym- sion by Rao (1970) and Horn et al. (1975). metric, and sharpens the bound above, resulting in the √ √ 1/d 4.1 Iterative Trimmed Squares Minimization worst-case error bound O( d n σ(Cd log n)). Pensia et al. (2019) also provided computationally efficient We now apply iterative trimming to linear regres- 2 estimator with running time O(n d). sion. The first step is to use the square loss, i.e. T 2 These existing results depend on σ(C log n) (or alike) at let fβ (xi, yi) = yi − xi β , then the form of the √ the expense of an additional factor of roughly n1/d, trimmed loss estimator turns to which is most relevant when C log n of the points have ˆ (TL) X T 2 β = arg min yi − xi β . (6) small noises, i.e., when the samples are dominated by β,S:|S|=dαne i∈S Learning Entangled Single-Sample Distributions via Iterative Trimming

Algorithm 2 Iterative Trimmed Squares Minimiza- Denote Wt as the diagonal matrix where Wt,ii = 1 if tion (ITSM) the i-th sample is in set St, otherwise Wt,ii = 0. Let n S? be a subset of i : σ ≤ σ with size dαne and Input: Samples {(xi, yi)} , number of rounds T, i (dαne) i=1 ? ? fraction of samples α denote W as the diagonal matrix w.r.t. S . (LS) 1: β ← βˆ Under Assumption 1, XT X = XT W X is nonsingu- 0 St St t 2: for t = 0, ··· ,T − 1 do T −1 T lar, so we have βt+1 = X WtX X Wty, where 3: Choose samples with smallest current loss: 2 we have used Wt = Wt. Then 2 X >  −1 St ← arg min yi − xi βt ? T  T βt+1 = β + X WtX X Wt S:|S|=dαne i∈S ? T −1 T ? ? = β + X WtX X Wt ((I − W ) + W ) . (LS) 4: β = βˆ t+1 St 5: end for Therefore, the error can be bounded by: Output: βT ? βt+1 − β 2 T −1 T ? ? = X WtX X (Wt(I − W ) + WtW ) ˆ (TL) 2 Such β is first introduced by Rousseeuw (1984) as   least trimmed squares estimator and its statistical ef- 1 T ? T ? ficiency has been studied in previous literature. How- ≤  X Wt(I − W ) + X WtW   , ψ−(dαne)  2 2 ever, the principal shortcoming is also its high com- | {z } | {z } T1 T2 putational complexity; see, e.g., Mount et al. (2014). Hence, we again use iterative trimming. Let where we use the spectral norm inequality and triangle inequality and the fact that Tr(Wt) = dαne. In the ˆ (LS) X > 2 βS = arg min yi − xi β following, we bound the two terms T1 and T2. β i∈S First, T1 can be bounded as: which is the least square estimator obtained by the T ? sample set S. When S = [n], we omit the subscript T1 = X Wt(I − W ) 2 (LS) S and write it as βˆ . The resulting algorithm is T ? ? ≤ X Wt(I − W )X(βt − β )) 2 called Iterative Trimmed Squares Minimization and + XT W (I − W ?)(y − Xβ ) described in Algorithm 2. t t 2 + ? ? ≤ ψ (|St\S |) kβt − β k2 4.2 Theoretical Guarantees T ? + X Wt(I − W )(y − Xβt) 2 ,

? ? The key idea for the analysis is similar to that for since Tr((I − W )Wt) = |St\S |. mean estimation: the selected set S has sufficiently ? ? large overlap with the set S? of dαne “good” points By the fact that |St\S | = |S \St|, there exists a bi- ? ? with smallest noises, while the points in S\S? are not jection between St\S and S \St. Denote the image ? T 2 that “bad” by the selection criterion. Therefore, the of i ∈ St\S as ki. Since the loss (yi − xi βt) of sam- ? ? algorithm makes progress in each iteration. ple in St\S is less than that of sample in S \St, we have y − xT β ≤ y − xT β for any i ∈ S \S?. i i t ki ki t t Lemma 3. Under Assumption 1, given ITSM α ≥ Hence we can write 4c1 1+4c1 1+4c , with probability at least 1 − 4c n , we have 1 1 T ? X Wt(I − W )(y − Xβt) 2 ? 1 ? βt+1 − β 2 ≤ kβt − β k2 + 2c1σ(dαne). (7) 2 X T = (yi − x β )xi i t Proof. First we introduce some notations. Define i∈S \S? t 2 + T  ψ (k) = max λmax XS XS S:|S|=k X T ≤ ni(yki − x β )xi ki t i∈S \S? where XS is the submatrix of X consisting of rows t 2 + indexed by S. Note that ψ (k) ≤ k, since for any S T  X X ? of size k, by kxik = 1, λmax X XS is bounded by T 2 S ≤ niki xi + nixixki (β − β ) t i∈S \S? i∈S \S? T  T  X 2 t 2 t 2 Tr XS XS = Tr XSXS = kxik2 = k. i∈S where ni is either 1 or −1. Hui Yuan, Yingyu Liang

By the assumption that kxik2 = 1, By assumption, there exist a constant c1 such that + ? ? αn ψ (|St\S |) |St\S | ψ−(dαne) ≤ c1 and ψ−(dαne) ≤ c1 · dαne . With-

X X X out loss of generality, assume αn is an integer. Since niki xi ≤ |ki | = |i|. ? 4c1 1−α 1 |St\S | ≤ (1−α)n, when α ≥ 1+4c , κ ≤ 2c1· α ≤ 2 . i∈S \S? i∈S \S? i∈S?\S 1 t 2 t t P ? And i∈S? |i| is bounded by 2|S |σ(dαne) with prob- 1 Meanwhile, ability at least 1 − |S?| , using Markov’s inequality.  Theorem 2. Under Assumption 1, given ITSM with  ?  X T ? ? 4c1 kβ0−β k2 nixixki (β − β ) ≤ smax · kβ − β k . α ≥ and T = Θ log2 , it holds that t t 2 1+4c1 σ(dαne) i∈S \S? t 2 ? kβT − β k2 ≤ cc1σ(dαne) (8) P T  where s := s ? n x x denotes the max max i∈St\S i i ki with probability at least 1 − T 1+4c1 . maximum singular value and is bounded as follows. 4c1n

+ ? Proof. It follows from Lemma 3 by an argument simi- Claim 1. smax ≤ ψ (|St\S |). lar to that for Theorem 1.  Proof. The proof is implicit in the proof for Theorem 7 in Shen and Sanghavi (2019). With matrix notations, Remark 7. When xi’s are i.i.d. spherical Gaussians, P T T Assumption 1 can be satisfied with c close to 1. Then ? n x x can be then written as X W (I − 1 i∈St\S i i ki t W ?)NPX, where N is diagonal with diagonal entries we require α ≥ 4/5 and the error bound holds with in {1, −1} and P is some permutation matrix. Then probability ≥ 1 − 5T/(4n), similar to that for mean estimation. Also, (8) is obtained without extra as- T T ? smax = max u X Wt(I − W )NP Xv. sumption on noise i except for assuming its second kuk2=1,kvk=1 2 order moment exists. If i ∼ N (0, σi ), the previous 0 bound holds with a higher probability 1 − e−c n. Letu ˜ = Xu andv ˜ = Xv, smax is bounded by

X X 2 X 2 Remark 8. We now discuss existing results. Pensia |u˜r v˜t | ≤ max{ u˜ , v˜ } i i ri ti et al. (2019) also adapted its mean estimation method- ? ? ? i∈St\S i∈St\S i∈St\S ology to linear regression. When xi’s are from a mul- for some sequences {r } and {t }. So s is bounded tivariate Gaussian with covariance matrix Σ, w.h.p. i i max c0nσ by the larger of the maximum singular values of it obtained error bound √ (cd log n) . Again, the result λmin(Σ) XT W (I − W ?)X and XT P T N T W (I − W ?)NPX, t t depends on σ(cd log n) with an additional factor n, while which is bounded by ψ+ (|S \S?|). t  ours depends on σ(α log n) for a constant α (our bound will also have a √ 1 factor for such Gaussians). With this claim, we can derive λmin(Σ) However, their run time is O(nd), exponential in the X + ? ? dimension d, while ours is polynomial. T1 ≤ |i| + 2ψ (|St\S |) kβ − βtk2 . ? i∈S \St There also exist studies for robust linear regression in the adversary setting, where a small fraction of points

Next, T2 can be bounded as: are being corrupted by an adversary (e.g., Bhatia et al. (2015); Liu et al. (2018); Shen and Sanghavi (2019);

Diakonikolas et al. (2019b)). It is unclear if their re- X X T2 = ixi ≤ |i| . sults directly apply to our setting, since they have ad- ? ? i∈St∩S 2 i∈St∩S ditional assumptions on the data.

Combining the inequalities above, we have 5 Experiments 1 kβ? − β k = (T + T ) t 2 ψ−(dαne) 1 2 5.1 Mean Estimation ! 1 X ≤ | | + 2ψ+ (|S \S?|) kβ? − β k To validate Theorem 1, we repeat the procedure that ψ−(dαne) i t t 2 n ? first generating samples {xi}i=1 under designed dis- i∈S n + ? P tribution {Fi}i=1, then run Algorithm 1 with fraction 2ψ (|S \S |) ? | | t ? i∈S i α = 4 and iteration T = 20 and finally output error = − kβ − βtk2 + − . 5 ψ (dαne) ψ (dαne) |µ − µ?|. We report the average error over R rep- | {z } T κ etitions; we set R = 200 for the univariate case and Learning Entangled Single-Sample Distributions via Iterative Trimming

Figure 1: Univariate mean estimation. Left: setting Figure 2: Multivariate mean estimation. Left: setting 1; Right: setting 2. x-axis: sample size; y-axis: error 3; Right: setting 4. x-axis: sample size; y-axis: error after T = 20 iterations. after T = 20 iterations.

R = 20 for the multivariate case. For comparison, for i ≤ αn and Fi = N (0, 100 · Σ0) for i > αn. we also report the error of the empirical mean over Figure 2 shows the results. Again, in both settings, the first αn samples with smallest (or norm ITM also obtain the same average error with OM. The of covariance matrix), which is the estimator given an results for setting 4 suggest that the radical symmetry oracle knowing all the covariances, and thus referred to assumption may not be needed. as Oracle Mean (OM). It is easy to be seen that the lat- αn αn ter average is only relevant to σ (or λ ) (i) i=1 (i) i=1 5.2 Linear Regression and regardless of the rest (1 − α)n samples. Univariate Case We first present experiments when We now consider the linear regression problem with d = 1 under two designed settings of the entangled d = 100. We generate data according to the model (5), n where each row of matrix X is independently sampled distributions {Fi}i=1. ? from N (0,Id) and we choose β to be a random vector 2 Setting 1 For i ≤ αn, Fi = N (0, 1) and Fi = N (0, i ) with l2 norm 1. The noise vector is generated s.t. for i > αn. for i ≤ αn, i ∼ N (0, 1) and i ∼ N (0, 100) for i > 2 αn. We repeatedly generate data and run ITSM for Setting 2 For i ≤ αn, Fi = N (0, (log i) ) and Fi = N (0, i2) for i > αn. R = 20 times, then report the average errors of both ITSM and the least square estimator over the first αn Figure 1 shows the performances of ITM and OM. In samples with smallest noise variances, which we refer both settings, ITM obtains roughly the same average 4 to as Oracle Least Square (OLS). We also set α = 5 error as OM, showing that the error bound of ITM is and T = 20 for simplicity though the smallest possible regardless of all σi ≥ σ(dαne). Besides, ITM actually value of α in Lemma 3 is dependent on X. converges to the truth µ?, in the same rate as OM. It suggests ITM can achieve a vanishing error, at least in some special cases. This is left as a future direction. Multivariate Case For the multivariate case, we set n d = 10 and provide two simulation settings of {Fi}i=1.

Setting 3 (radically symmetric) Fi = N (0,I10) for i ≤ αn, and Fi = N (0, 100 · I10) for i > αn.

Setting 4 (radically asymmetric) Let Σ0 be a stochastic positive definite matrix generated as follows. (1) Set the diagonal entries to 1. (2) Off-diagonal en- tries are set to zero or nonzero with equal probability. Figure 3: Linear Regression. x-axis: sample size; y- For each nonzero off-diagonal entry, it is sampled from axis: error after T = 20 iterations. the uniform distribution on interval (−0.5, 0.5). Then we make it symmetric by forcing the lower triangular Figure 3 shows the results. Similar to mean estima- matrix equal to the upper triangular, and then add a tion, the iterative trimming method has error closely diagonal matrix cI10 to make sure it is positive defi- tracking those of the oracle method. This again veri- nite, where c is chosen to make the smallest eigenvalue fies the effectiveness of iterative trimming and provide of the matrix equal to 0.2. Finally, let Fi = N (0, Σ0) positive support for our analysis. Hui Yuan, Yingyu Liang

Acknowledgement Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Be- This work was supported in part by FA9550-18-1- ing robust (in high dimensions) can be practical. 0166. The authors would also like to acknowledge In Proceedings of the 34th International Conference the support provided by the University of Wisconsin- on Machine Learning-Volume 70, pages 999–1008. Madison Office of the Vice Chancellor for Research and JMLR. org, 2017. Graduate Education with funding from the Wisconsin Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Alumni Research Foundation. Jerry Li, Ankur Moitra, and Alistair Stewart. Ro- bustly learning a gaussian: Getting optimal er- References ror, efficiently. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algo- Dimitris Achlioptas and Frank McSherry. On spectral rithms, pages 2683–2702. Society for Industrial and learning of mixtures of distributions. In Interna- Applied Mathematics, 2018a. tional Conference on Computational Learning The- ory, pages 458–469. Springer, 2005. Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. Sivaraman Balakrishnan, Simon S Du, Jerry Li, and Sever: A robust meta-algorithm for stochastic opti- Aarti Singh. Computationally efficient robust sparse mization. arXiv preprint arXiv:1803.02815, 2018b. estimation in high dimensions. In Conference on Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Learning Theory, pages 169–212, 2017. Jerry Li, Ankur Moitra, and Alistair Stewart. Ro- Mikhail Belkin and Kaushik Sinha. Polynomial learn- bust estimators in high-dimensions without the com- ing of distribution families. In 2010 IEEE 51st An- putational intractability. SIAM Journal on Comput- nual Symposium on Foundations of Computer Sci- ing, 48(2):742–864, 2019a. ence, pages 103–112. IEEE, 2010a. Ilias Diakonikolas, Weihao Kong, and Alistair Stew- Mikhail Belkin and Kaushik Sinha. Toward learn- art. Efficient algorithms and lower bounds for ro- ing gaussian mixtures with arbitrary separation. In bust linear regression. In Proceedings of the Thirti- COLT, pages 407–419. Citeseer, 2010b. eth Annual ACM-SIAM Symposium on Discrete Al- gorithms, pages 2745–2754. SIAM, 2019b. Kush Bhatia, Prateek Jain, and Purushottam Kar. Robust regression via hard thresholding. In Ad- Jianqing Fan and Jinchi Lv. Sure independence screen- vances in Neural Information Processing Systems, ing for ultrahigh dimensional feature space. Journal pages 721–729, 2015. of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849–911, 2008. Peter J Bickel et al. On some robust estimates of loca- Susan D Horn, Roger A Horn, and David B Duncan. tion. The Annals of Mathematical Statistics, 36(3): Estimating heteroscedastic variances in linear mod- 847–858, 1965. els. Journal of the American Statistical Association, Yudong Chen, Constantine Caramanis, and Shie Man- 70(350):380–385, 1975. nor. Robust sparse regression under adversarial cor- Ola H¨ossjer. Exact computation of the least trimmed ruption. In International Conference on Machine squares estimate in simple linear regression. Compu- Learning, pages 774–782, 2013. tational Statistics & Data Analysis, 19(3):265–282, Yu Cheng, Ilias Diakonikolas, and Rong Ge. High- 1995. dimensional robust mean estimation in nearly-linear Peter J Huber. . Springer, 2011. time. In Proceedings of the Thirtieth Annual ACM- Adam Tauman Kalai, Ankur Moitra, and Gregory SIAM Symposium on Discrete Algorithms, pages Valiant. Efficiently learning mixtures of two gaus- 2755–2771. SIAM, 2019. sians. In Proceedings of the forty-second ACM sym- Flavio Chierichetti, Anirban Dasgupta, Ravi Kumar, posium on Theory of computing, pages 553–562. and Silvio Lattanzi. Learning entangled single- ACM, 2010. sample gaussians. In Proceedings of the twenty-fifth Ravindran Kannan, Hadi Salmasian, and Santosh annual ACM-SIAM symposium on Discrete algo- Vempala. The spectral method for general mixture rithms, pages 511–522. Society for Industrial and models. In International Conference on Compu- Applied Mathematics, 2014. tational Learning Theory, pages 444–457. Springer, Sanjoy Dasgupta. Learning mixtures of gaussians. In 2005. 40th Annual Symposium on Foundations of Com- Adam Klivans, Pravesh K Kothari, and Raghu Meka. puter Science (Cat. No. 99CB37039), pages 634– Efficient algorithms for -robust regression. 644. IEEE, 1999. arXiv preprint arXiv:1803.03241, 2018. Learning Entangled Single-Sample Distributions via Iterative Trimming

Liu Liu, Yanyao Shen, Tianyang Li, and Constantine Eunho Yang, Aur´elieC Lozano, Aleksandr Aravkin, Caramanis. High dimensional robust sparse regres- et al. A general family of trimmed estimators for sion. arXiv preprint arXiv:1805.11643, 2018. robust high-dimensional data analysis. Electronic Ankur Moitra and Gregory Valiant. Settling the poly- Journal of Statistics, 12(2):3519–3553, 2018. nomial learnability of mixtures of gaussians. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 93–102. IEEE, 2010. David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and Angela Y Wu. On the least trimmed squares estimator. Algorithmica, 69 (1):148–183, 2014. Ankit Pensia, Varun Jog, and Po-Ling Loh. Estimat- ing location parameters in entangled single-sample distributions. arXiv preprint arXiv:1907.03087, 2019. Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakr- ishnan, and Pradeep Ravikumar. Robust estima- tion via robust gradient estimation. arXiv preprint arXiv:1802.06485, 2018. C Radhakrishna Rao. Estimation of heteroscedastic variances in linear models. Journal of the American Statistical Association, 65(329):161–172, 1970. Peter J Rousseeuw. Least median of squares regres- sion. Journal of the American statistical association, 79(388):871–880, 1984. Peter J Rousseeuw and Katrien Van Driessen. Com- puting lts regression for large data sets. Data mining and knowledge discovery, 12(1):29–45, 2006. Arora Sanjeev and Ravi Kannan. Learning mixtures of arbitrary gaussians. In Proceedings of the thirty- third annual ACM symposium on Theory of comput- ing, pages 247–257. ACM, 2001. Fumin Shen, Chunhua Shen, Anton van den Hengel, and Zhenmin Tang. Approximate least trimmed sum of squares fitting and applications in image analysis. IEEE Transactions on Image Processing, 22(5):1836–1847, 2013. Yanyao Shen and Sujay Sanghavi. Learning with bad training data via iterative trimmed loss mini- mization. In International Conference on Machine Learning, pages 5739–5748, 2019. John W Tukey and Donald H McLaughlin. Less vulnerable confidence and significance procedures for location based on a single sample: Trim- ming/winsorization 1. Sankhy¯a:The Indian Journal of Statistics, Series A, pages 331–352, 1963. Daniel Vainsencher, Shie Mannor, and Huan Xu. Ig- noring is a bliss: Learning with large noise through reweighting-minimization. In Conference on Learn- ing Theory, pages 1849–1881, 2017. Leslie G Valiant. Learning disjunction of conjunctions. In IJCAI, pages 560–566. Citeseer, 1985.