A Short Note on the Median-Of-Means Estimator

A short note on the median-of-means estimator Yen-Chi Chen University of Washington July 13, 2020 The main reference of this short note is Chapter 3 of [L2019] Lerasle, M. (2019). Lecture notes: Selected topics on robust statistical learning theory. arXiv preprint arXiv:1908.10761. The median-of-means (MoM) is an old method but it has received a lot of attentions these day. The MoM estimator gives a nice concentration bound around the target parameter under mild conditions and it can be applied to an empirical process as well (it is called a median-of-means process in [L2019]). 1 MoM estimator of the mean Consider a simple scenario where we observe X1;··· ;Xn ∼ F. The goal is to estimator the mean of the R underlying distribution µ0 = E(X1) = xdF(x). The MoM estimator works as follows. Assume that the sample size n = K · B, where K is the number of subsamples and B is the size of each subsample. We first randomly split the data into K subsample and compute the mean using each subsample, which leads to estimators bµ1;··· ;bµK and each estimator is based on B observations. The MoM estimator is the median of all these estimator, i.e., bµMoM = median(bµ1;··· ;bµK): 2 Why is this estimator useful? It turns out that even with a very mild condition: Var(X1) = s < ¥, the MoM estimator has nice concentration inequality under ‘finite sample’ case. Proposition 1 Assume that Var(X1) < ¥. Then the MoM estimator has the following property: 2 2 1 K s2 n 1 s2 −2K 2 − n 2 −2 B 2 − 2 P(jbµMoM − µ0j > e) ≤ e e = e Be for every n = K · B. Proposition1 implies the following two forms, with different choices of e. When we choose e = p(2 + d)K=n = p(2 + d)=B, which leads to 2 −K d p ( + )2 P(jbµMoM − µ0j > (2 + d)K=n) ≤ e 2 2 d (1) 1 and when we further choose d = 2, it becomes p −K=8 P(jbµMoM − µ0j > 4K=n) ≤ e : (2) A powerful feature of Proposition1 is that finite sample exponential concentration is not easy to obtain when we only assume variance exists. Xi’s can be unbounded in this case (when Xi is bounded, Hoeffding’s inequality implies such a concentration). Now we prove Proposition1. The proof is very simple and elegant. Proof. First, observe that the event fjbµMoM − µ0j > eg implies that at least K=2 of bµ` has to be outside e distance to µ0. Namely, ( ) K K fjbµMoM − µ0j > eg ⊂ ∑ I(jbµ` − µ0j > e) ≥ : `=1 2 Define Z` = I(jbµ` − µ0j > e) and let pe;B = E(Z`) = P(jbµ` − µ0j > e), then the above implies that ! K K P(jbµMoM − µ0j > e) ≤ P ∑ Z` ≥ `=1 2 ! K K = P ∑(Z` − E(Z`)) ≥ − K pe;B `=1 2 ! 1 K 1 = P ∑(Z` − E(Z`)) ≥ − pe;B : K `=1 2 The key trick of the MoM estimator is that the random variable Z` is IID and is bounded. So by Hoeffding’s inequality (one-sided), K ! 1 −2Kt2 P ∑(Z` − E(Z`)) ≥ t ≤ e : K `=1 As a result, ! 1 K 1 P(jbµMoM − µ0j > e) ≤ P ∑(Z` − E(Z`)) ≥ − pe;B K `=1 2 −2K( 1 −p )2 ≤ e 2 e;B : 2 To conclude that proof, note that the variance s = Var(X1) < ¥ and the Chebeshev’s inequality implies s2 K s2 p = P(jµ − µ j > e) ≤ = : e;B b` 0 Be2 n e2 So the bound becomes 2 1 2 1 K s 2 −2K( −pe;B) −2K( 2 − n 2 ) P(jbµMoM − µ0j > e) ≤ e 2 ≤ e e : 2 2 MoM process: example of the KDE The MoM approach can also be applied to an empirical process. To motivate the problem, we consider the kernel density estimator (KDE). The observations X1;··· ;Xn are IID from an unknown distribution with a PDF p. With a full-sample, the KDE is 1 n X − x p(x) = K i ; b d ∑ nh i=1 h where K is a kernel function such as a Gaussian and h > 0 is the smoothing bandwidth. Under the MoM procedure, we split the sample into K subsamples and compute the KDE within each sample, leading to pb1(x);··· ; pbK(x): The MoM KDE is pbMoM(x) = median(pb1(x);··· ; pbK(x)) Note: this estimator may not be a density (it may not integrate to 1) but we can simply rescale it to resolve this issue. Now we will show that suppose we have a bound of the form P(supjp(x) − p(x)j > e) ≤ qn;e: (3) x b We then have a bound on pbMoM as 1 2 −2K( −qe;B) P supjpMoM(x) − p(x)j > e ≤ e 2 : (4) x b Derivation of equation (4). We use a similar derivation as the proof of Proposition1. The following derivation is modified from [HBMV2020] Humbert, P., Bars, B. L., Minvielle, L., & Vayatis, N. (2020). Robust Kernel Density Estimation with Median-of-Means principle. arXiv preprint arXiv:2006.16590. Note that the event ( ) K K supjpbMoM(x) − p(x)j > e ⊂ sup ∑ I(jpb`(x) − p(x)j > e) > : x x `=1 2 Using the fact that jp`(x) − p(x)j ≤ supjp`(x) − p(x)j; b x b we have I(jp`(x) − p(x)j > e) ≤ I(supjp`(x) − p(x)j > e) b x b 3 which implies K sup ∑ I(jpb`(x) − p(x)j > e) ≤ ∑ I(supjpb`(x) − p(x)j > e): x `=1 `=1K x As a result, ( ) K K supjpbMoM(x) − p(x)j > e ⊂ sup ∑ I(jpb`(x) − p(x)j > e) > x x `=1 2 ( ) K K ⊂ ∑ I(supjpb`(x) − p(x)j > e) > `=1 x 2 Define Y` = I(supx jpb`(x) − p(x)j > e) and denote E(Y`) = P(supx jpb`(x) − p(x)j > e) = qe;B. Then we conclude that ! K K P supjpbMoM(x) − p(x)j > e ≤ P ∑ Y` ≥ x `=1 2 ! K K = P ∑(Y` − E(Y`)) ≥ − Kqe;B `=1 2 ! 1 K 1 = P ∑(Y` − E(Y`)) ≥ − qe;B K `=1 2 −2K( 1 −q )2 ≤ e 2 e;B ; which is equation (4). Concrete rate under KDE. Suppose the PDF p belongs to a b-Holder¨ class, then we have the following concentration under suitable conditions of K and h: r ! djloghj b −d P supjp(x) − p(x)j > C1 d +C2h ≤ e ; x b nh where C1 and C2 are some constants depending on the Holder¨ class and the kernel function; see Lemma 1 of q logtjloghj b [HBMV2020]. By choosing e = C1 nhd +C2h with t ≥ 1 and use the fact that B plays the same role as n, we obtain 1 q ≤ ; e;B t which leads to r ! logtjloghj 1 1 2 b −2K( 2 − t ) P supjpMoM(x) − p(x)j > C1 d +C2h ≤ e : x b Bh When t = 4 and use the fact that B = n=K, we further obtain r ! K log4jloghj b −K=8 P supjpMoM(x) − p(x)j > C1 d +C2h ≤ e : x b nh 4 After some algebra, we conclude that r ! K logh p (x) − p(x) = O(hb) + O : bMoM P nhd If we set the total number of subsample K to be fixed and let n ! ¥, the convergence rate is the same as the original KDE. Robustness of MoM KDE. Although MoM KDE does not improve the convergence rate (in fact, it may have a slower rate depending on the choice of K), it improves the robustness of the density estimator against outliers. In [HBMV2020], the authors showed that as long as we have outliers that are strictly less than K=2 points, the MoM KDE will not be affected by these outliers. One intuition is that when the number of outliers is less than K=2, then we have at least K=2 subsamples that are not affected by the outliers, which occupy more than half of the subsamples. 3 MoM process: general case To conclude this note, we briefly describe a general MoM process. Suppose that the parameter of interest is µ( f ) = E( f (X1)) for f 2 F . Then a full-sample estimator is 1 n bµ( f ) = ∑ f (Xi) n i=1 for each f 2 F . Using the MoM procedure, we obtain K subsample estimators (K = nB) bµ1( f );··· ;bµK( f ): The MoM estimator is then bµMoM( f ) = median(bµ1( f );··· ;bµK( f )); f 2 F : In reality, we are often interested in the L¥ loss of the estimator, i.e., sup jbµMoM( f ) − µ( f )j: f 2F Suppose that for a full-sample estimator bµ( f ), we have P(sup jbµ( f ) − µ( f )j > e) ≤ he;n; f 2F or equivalently, 1 P(sup jbµ( f ) − µ( f )j > fd;n) ≤ ; (5) f 2F d for some d > 2. Using the same derivation as equation (4), one can show that equation (5) implies ! −2K( 1 − 1 )2 P sup jbµMoM( f ) − µ( f )j > fd;n=K ≤ e 2 d : (6) f 2F 5 4 Remarks. Finite-sample exponential concentration without sub-Gaussian. A feature of an MoM estimator is that it establishes a finite-sample exponential concentration without assuming sub-Gaussian or bounded condition. In Proposition1, all we need is the existence of the second moment, which is a very mild condition. We allow for a heavy tail and the estimator can still be concentrating rapidly. The key is the use of median, which makes the estimator more robust.

Load more