Lecture 22 - 11/19/2019 Lecture 22: Robust Location Estimation Lecturer: Jiantao Jiao Scribe: Vignesh Subramanian
Total Page:16
File Type:pdf, Size:1020Kb
EE290 Mathematics of Data Science Lecture 22 - 11/19/2019 Lecture 22: Robust Location Estimation Lecturer: Jiantao Jiao Scribe: Vignesh Subramanian In this lecture, we get a historical perspective into the robust estimation problem and discuss Huber's work [1] for robust estimation of a location parameter. The Huber loss function is given by, ( 1 2 2 t ; jtj ≤ k ρHuber(t) = 1 2 : (1) k jtj − 2 k ; jtj > k Here k is a parameter and the idea behind the loss function is to penalize outliers (beyond k) linearly instead of quadratically. Figure 1 shows the Huber loss function for k = 1. In this lecture we will get an intuitive 1 2 Figure 1: The green line plots the Huber-loss function for k = 1, and the blue line plots the quadratic function 2 t . understanding for the reasons behind the particular form of this function, quadratic in interior, linear in exterior and convex and will see that this loss function is optimal for one dimensional robust mean estimation for Gaussian location model. First we describe the problem setting. 1 Problem Setting Suppose we observe X1;X2;:::;Xn where Xi − µ ∼ F 2 F are i.i.d. Here, F = fF j F = (1 − )G + H; H 2 Mg; (2) where G 2 M is some fixed distribution function which is usually assumed to have zero mean, and M denotes the space of all probability measures. This describes the corruption model where the observed distribution is a convex combination of the true distribution G(x) and an arbitrary corruption distribution H. It is a location model since we assume X − µ has distribution F where µ 2 R is unknown. The goal here is to estimate the parameter µ. First we must determine how we evaluate estimators and in the paper, Huber restricted his attention to M-estimators of the form, n X µ^ = min ρ(Xi − t): t i=1 1 1 2 1 Pn As an example if ρ(t) = 2 t , thenµ ^ = n i=1 Xi, the empirical mean which is sensitive to outliers. To evaluate estimators Huber looks at asymptotics. 2 Asymptotics 0 Let (t) = ρ (t). Then from first order condition of optimality, an optimizer Tn must satisfy, n X (Xi − Tn) = 0: (3) i=1 Assume for now µ = 0, and EF [ (X)] = 0. This means that for the population version of (3), Tn = 0 is a solution. We now assume that Tn ! 0 as n ! 1, and we will provide a proof sketch showing Tn is asymptotically normal and compute its asymptotic variance. From (3), using the first order approximation for the term (Xi − Tn) around the point Xi and using the mean-value theorem, for some 0 ≤ θ ≤ 1 we have, n n X X 0 (Xi) − Tn (Xi − θTn) = 0: i=1 i=1 Rearranging we get, n p1 P p i=1 (Xi) nT = n : n 1 Pn 0 n i=1 (Xi − θTn) Since we have EF [ (X)] = 0, the numerator by the Central Limit Theorem converges weakly to 2 N ∼ N (0; EF [ (X) ]). Further since we assumed Tn ! 0 as n ! 1 then from the weak law of large numbers 0 the denominator converges weakly to EF [ (X)]. Thus we have, 2 p w EF [ (X) ] n(Tn − 0) −!N 0; 0 2 : (EF [ (X)]) One basic result for M-estimators is showing the maximum likelihood estimator achieves the smallest asymptotic variance among all M-estimators. We provide a proof below. Letting f(x) denote the density function for F , we have Z b 0 0 EF [ (X)] = f(x) (x)dx a b Z b 0 0 = f(x) (x) − (x)f (x)dx: a a If we assume that f(a) = f(b) = 0 then we have, Z b 0 0 EF [ (X)] = − (x)f (x)dx: a Thus, 2 R b 2 EF [ (X) ] a (x) f(x)dx 0 2 = 2 (EF [ (X)]) R b f 0(x) a (x) f(x) f(x)dx 1 ≥ 2 ; f 0(x) f(x) f(x)dx 2 where we used the Cauchy-Schwarz inequality. Observe that the RHS does not depend on and the f 0(x) ρ(t) − A inequality is tight when (x) / − f(x) which results in f(t) / e for some constant A. Thus minimizing ρ(t) is equivalent to finding the maximum likelihood estimator. When f(x) is a Gaussian density function, then ρ is the squared-loss function and the optimizer Tn is the empirical mean. 3 Two player game and Huber's Theorem Consider a two player game with payoff function given by −V ( ; F ). Here is the action chosen by the statistician to maximize the payoff (minimize the asymptotic variance) and F is chosen by the adversary to minimize the payoff (maximize the asymptotic variance). Theorem 1. Assume G is symmetric around 0, log-concave with density function g(x) with convex support. Define FS = fF j F = (1 − )G + H; H symmetric around 0g (4) The two-player game under the assumptions describe above has a saddle point ( 0;F0) i.e., sup V ( 0;F ) = V ( 0;F0) = inf V ( ; F0): F 2FS First we describe the form of f0(x) which is the density function of F0. Let [t0; t1] be the interval where 0 g (x) ≤ k. We know that this interval exists since g(x) is log-concave with convex support. Here k is the g(x) solution to the equation 1 Z t1 g(t ) + g(t ) = g(t)dt + 0 1 : (5) 1 − t0 k Then, 8 (1 − )g(t )ek(t−t0); t ≤ t <> 0 0 f0(t) = (1 − )g(t); t0 < t < t1 (6) > −k(t−t1) :(1 − )g(t1)e ; t ≥ t1 0 f0(t) 0(t) = − : (7) f0(t) Before we look at the proof of this theorem we look at an example. 2 1 − t Example 2. Let g(t) = p e 2 . Then −t = t = k. We can solve either by binary search or line search 2π 0 1 for k using the equation, 1 Z k 2g(k) = g(t)dt + : 1 − −k k The optimal loss function to use in this case is the Huber loss function given by (1). ( 1 2 2 t ; jtj ≤ k ρHuber(t) = 1 2 : k jtj − 2 k ; jtj > k Note that for a generic distribution g(t) the dependence of t0 and t1 on k can be highly non-linear and it is not easy to solve for k using (5). Next we look at the proof for Theorem 1. 3 Proof First we verify that the distribution H determined by F0 and G is indeed a distribution i.e. its density function h(t) is non-negative and integrates to one. We have, 8 (1 − )(g(t )ek(t−t0) − g(t)); t ≤ t <> 0 0 h0(t) = 0; t0 < t < t1 : (8) > −k(t−t1) :(1 − )(g(t1)e − g(t)); t ≥ t1 Since g(t) and f0(t) integrate to one, h(t) integrates to one. To show non-negativity of h(t) we use the fact that g(t) is log-concave, which implies − log(g(t)) is a convex function. For any t ≤ t0, − log(g(t)) ≥ − log(g(t0)) − k(t − t0); k(t−t0) ) g(t) ≤ g(t0)e : 0 0 g (t0) 0 g (t) where we used the the facts = k and (log(g(t)) = . The proof for the case t ≥ t1 follows via a g(t0) g(t) similar argument. Next we need to show that V ( 0;F0) is a saddle point. We have, V ( 0;F0) = inf V ( ; F0); because for given F0, 0 was optimal and resulted in the optimizer being the maximum likelihood estimator as discussed in Section 2. Next we show that, V ( 0;F0) = sup V ( 0;F ): F 2FS For any F 2 FS we have 2 EF [ 0(X) ] V ( 0;F ) = 0 2 : (EF [ 0(X)]) We can rewrite the numerator as, 2 2 2 EF [ 0(X) ] = (1 − )EG[ 0(X) ] + EH [ 0(X) ] 2 2 ≤ (1 − )EG[ 0(X) ] + k ; 0 2 f0(t) where we upper EH [ 0(X) ] using 0(t) = − f(t) and the form of f0(t) from (6) which results in j (t)j = k 0 for t ≤ t or t ≥ t and j (t)j = g (t) ≤ k for t < t < t . Note that f (t) results in h (t) = 0 for t < t < t 0 1 g(t) 0 1 0 0 0 1 and thus maximizes the numerator. Similarly the denominator can be written as, 0 2 0 0 2 (EF [ 0(X)]) = ((1 − )EG([ 0(X)]) + EH ([ 0(X)])) 0 2 ≥ ((1 − )EG([ 0(X)])) ; 0 0 where we used the fact that 0 ≥ 0 pointwise and 0(t) = 0 for t ≤ t0 or t ≥ t1. Again note that f0(t) results in h0(t) = 0 for t0 < t < t1 and minimizes the denominator. Thus F0 is the maximizer of V ( 0;F ) among all F 2 FS. 4 4 Summary There were several criticisms of Huber's work including those on the assumptions that G and H are symmet- ric, and the requirement that be known in order to compute the Huber loss. Further in higher dimensions 1 the breakdown point scales as 1+d which is undesirable. (From Wikipedia: Intuitively, the breakdown point of an estimator is the proportion of incorrect observations (e.g. arbitrarily large observations) an estimator can handle before giving an incorrect (e.g., arbitrarily large) result).