Simulation Comparison Between an Outlier Resistant Model-Based Finite Population Estimator and Design-Based Estimators Under Contamination

Simulation Comparison Between an Outlier Resistant Model-Based Finite Population Estimator and Design-Based Estimators under Contamination Bogong Li U.S. Bureau of Labor Statistics There are two approaches to finite population esti- we propose. Limited simulation result shows that mation. One assumes the finite population elements this estimator is as efficient as other model-based es- are fixed quantities. The randomness associated timators when there is no significant violation from with estimators comes from the random selection of model assumptions and is more efficient than least samples. Each sampled element has a weight deter- squares types of estimators when there are contam- mined by the sample design, see Cochran (1977) [2]. inating symmetrical outlying observations. Most Bureau of Labor Statistics surveys rely on this theory, for example, the Occupational Employment Statistics program which was discussed in depth in 1. Oulier-Insensitive Linear Regres- Li (2002) [10] about methods used in small area es- sion timation. The other approach assumes the popu- In the past several decades, research on influence lation elements are a random draw from a larger of outlying observations on linear model estimates population or “super-population”, in a way similar yielded classes of estimators with controllable lev- to taking random samples from a random variable. els of reaction to outlying observations. We briefly The random variable has its own mean and vari- summarize some of them here, in particularly the ance structure, usually expressed in terms of linear ones that are implemented in popular statistical soft- models. Finite population summaries of the vari- ware packages such as R, S-plus and SAS. These are able under investigation, such as mean, total are the M-estimator, least median of squares estimator estimated through fitting sample data to the linear (LMS), the least trimmed means estimator (LTS). model. The design of the sample selection is relevant Also included are other methods that are applicable only to the selection of a suitable linear model un- to more complex linear model structures, the Mal- der the “super-population”, but is irrelevant to how lows’, Schweppe’s methods, and the broader class of the estimates are derived from the model. The first the Generalized M-estimator (GM). approach is known as “design-based”, the second “model-based”. Särndal, Swensson and Wretman (1992) [14] discusses this distinction in greater de- • Huber M-Estimator for Regression: Estimator ˆ tail. Table 1 lists some familiar design-based estima- βn for the regression parameter β (location) is tors and their model-based equivalents. A broader defined by review on this connection is in Li (2001) [9]. Since n least squares linear estimators are very vulnerable X 0 ˆ min ρ (yi − xiβn)/σ to outlying observations, that is, a small variation ˆ βn i=1 in outlying observation produces larger variation in the estimate than non-outlying observations, we de- or sire a model-based estimator that is less responsive n X 0 ˆ to outliers while being efficient. This is especially ψ (yi − xiβ)/σ xi = 0 necessary for survey data since processing error oc- i=1 cur prominently in surveys. This study aims to for some symmetric function ρ : → + and investigate the statistical quality of such a model- R R for a fixed σ. ψ(t) is the derivative of ρ(t). based, outlier insensitive finite population estimator Examples of ρ(t) are probability density func- 2 The views expressed in this article are those of the au- tion f(t), t , |t| etc. defined within certain dis- thor and do not constitute the policy of the Bureau of Labor tance from 0. Optimal βˆ is selected to mini- Statistics. The author can be contacted at: U.S. Bureau of mize asymptotic variance while allowing as high Labor Statistics, Postal Square Building, Suite 1950, 2 Mas- sachusetts Ave. NE. Washington, DC 20212-0001, or E-Mail: a breakdown points as it could be possible, see li [email protected]. Huber (1981) [7] for more details. Estimators Finite Population Tˆ Model Equivalent 2 Expansion estimator NY¯s Yi = µ + i, i ∼ (0, σ ) ˆ 2 Linear regression estimator N Y¯s + bls(¯x − x¯s) Yi = α + xiβ + i, i ∼ (0, σ ) 1 ¯ 2 2 Ratio estimator NYsx/¯ x¯s Yi = xiβ + xi i, i ∼ (0, σ ) P ¯ 2 Stratified expansion estimator h NhYhs Yhi = µh + hi, hi ∼ (0, σh) 1 P ¯ 2 2 Stratified ratio estimator h NhYhsx¯h/x¯hs Yhi = xhiβ + xhihi, hi ∼ (0, σh) P 0 Simple Proportion Estimator s Yi/n Yi ∼ ber (pi), logit(pi) = xiβ Table 1: Finite population estimators and their model-based equivalents • Least Median of Squares Regression (LMS): Re- the method we used to obtain outlier-insensitive gression parameters βˆ are implicitly defined by model parameter estimates, n o 0 0 0 ˆ 2 n −1/2 ô −1/2 ˆ −1 min Med (yi − xiβ) , i = 1, 2, . , n , min W ψ y − Xβ U V βˆ βˆ, Vˆ n o where only the median of the squared resid- U−1/2 W−1/2ψ y − Xβˆ , (1) uals are relevant in defining the best βˆ, see Rousseeuw and Leroy (1987) [13] for details. see Krasker and Welsch (1982) [8]. Solutions for β and V need numerical algorithm similar to • Least Trimmed Squares Regression(LTS): Sim- that for ML and REML, see Harville (1977) [5] ilar to least squares estimators, but only the and Searle, Casella and McCulloch (1992) [15] smallest h residuals are taken to minimize for methods to obtain ML and REML. summed squared residuals, h h 2. Outlier-Resistant Best Linear Un- X 0 ˆ 2 X 2 min (yi − xiβ)i:n = min (r )i:n biased Estimation ˆ ˆ β i=1 β i=1 Linear model with mixed-effects is gradually popu- 2 2 , where (r )1:n ≤ · · · ≤ (r )h:n ≤ · · · ≤ larizing with survey data usage, for example Battese, 2 (r )n:n are the ordered squared residuals, see Hater and Fuller (1987) [1]. We propose a super- Rousseeuw and Leroy (1987) [13] for more de- population structure incorporates mixed-effects: tails on this. Y1 X1 Z1 0 ··· 0 ν1 1 Y2 X2 0 Z2 ··· 0 ν2 2 • Mallows’ Method: A weight w(xi) is part of . = . β+ . . + . , . . . . . the estimating equation in Huber M-estimator . . . ··· . . . Y X 0 0 ··· Z ν which limits the influence of outlying x rows, g N×1 g g g g n where ( iid X 0 ˆ(M) ν ∼ (0, σ2) w(xi)ψ (yi − xiβ )/σ xi = 0, i ν iid i=1 2 i ∼ (0, σ I) see Mallows (1975) [11]. and Yi and xi are population and auxiliary infor- mation vectors. νi is random (can be constant) • Schweppe’s Method: Weights w(xi) depending stratum-specific effects. i is independent random on xi are as well attached to scale the residual error vector. If we rearrange the Y by observed (s) in response to outlying influential observation and non-observed (r) population elements: in x: n Ys Xs Zs 0 X 0 ˆ(S) Y = = β + ν + , w(xi)ψ (yi − xiβ )/(σw(xi)) xi = 0 Yr Xr Zr i=1 and , see Hill (1982) [6]. V V Var(Y) = ss sr , then V V • Generalized M-estimator for Mixed Models: Es- rs rr timating equations written in matrix form to the best linear unbiased predictor (BLUP) of accommodating more complex covariance struc- 0 0 0 ture and weight matrices V, W and U. This is T = lN×1Y = lsYs + lrYr which could be the mean, total or contrasts, is where τˆ is the bounded influence estimator (with IFτˆ,F < ∞) of the vector of all unknown parameters 0 0 ˆ ˆ ˆ −1 ˆ Te = lsys + lr XsβLS + VrsVss (ys − XsβLS) . in the model, (2) 0 0 τ = (β , v11, . , vij, . , vNN ) . Valliant (2000) [17] provides detailed derivation and proofs of unbiasedness and variance minimization of then τˆ is asymptotically normally distributed, Te. We propose an outlier resistant RBLUP, based Maronna and Yohai (1981) [12] discusses its asymp- on above formulation: totic mean and variance; η is a finite projection from τˆ to Te(R). Given such projection exists, by the (R) 0 0 ˆ (R) Chain Rule defined on Gâteaux differentials, the in- Te =l ys + l WsXsβ s r fluence function of RBLUP is (R) ˆ (R) ˆ (R)−1 ˆ 0 + Vrs Vss ψ ys − Xsβ (3) IFRBLUP,F = η IFτˆ,F . 0 (R) Since both IFτˆ,F and η are finite, IFRBLUP,F is nec- where βˆ , Vˆ (R) are any bounded influence esti- essarily bounded. mators from the mixed-effects linear model; Ws are weight functions depend on Xs, ψ is the Huber ψ- 3. Simulation Study ˆ (R) ˆ (R) function. In practice, β , V are generalized We study the statistical quality of proposed RBLUP M-estimators given in section 1. Strictly speaking through a simulation. First we produce an artificial RBLUP is neither “best” in the sense of minimiz- popoulation according to the model: ing variance nor unbiased with respect to the model among all linear unbiased estimators. The name Y1 Y2 β1 given simply indicates its predecessor. = (11000 x) Y3 β2 Influence of RBLUP is necessarily bounded. By Y4 definition, influence function of a statistical func- 150 ··· 0 1 tional T (F ) ∈ k is the first derivative of T (F ): R . ν1 . 1 . . 150 . + . + , T (F) − T (F ) ∂T (F) . . IFT,F (z) := lim = , . 1250 ν . →0 ∂ 4 →0 0 ··· 1550 1000 1 where F := (1−)F +δ , ∈ , with δ the point ( iid z R z ν ∼ (0, σ2) mass at z.

Simulation Comparison Between an Outlier Resistant Model-Based Finite Population Estimator and Design-Based Estimators Under Contamination

A Review on Outlier/Anomaly Detection in Time Series Data

Outlier Identification.Pdf

A Machine Learning Approach to Outlier Detection and Imputation of Missing Data1

Effects of Outliers with an Outlier

How Significant Is a Boxplot Outlier?

Lecture 14 Testing for Kurtosis

A SYSTEMATIC REVIEW of OUTLIERS DETECTION TECHNIQUES in MEDICAL DATA Preliminary Study

Recommended Methods for Outlier Detection and Calculations Of

Identify the Outlier in the Data Set. Determine How the Outlier Affects the Mean, Median, and Mode of the Data

Topics in Biostatistics: Part II

Visualizing Outliers

Outlier Detection and Robust Mixture Modeling Using Nonconvex Penalized Likelihood