<<

Simulation Comparison Between an Outlier Resistant Model-Based Finite Population and Design-Based under Contamination

Bogong Li

U.S. Bureau of Labor

There are two approaches to finite population esti- we propose. Limited simulation result shows that mation. One assumes the finite population elements this estimator is as efficient as other model-based es- are fixed quantities. The randomness associated timators when there is no significant violation from with estimators comes from the random selection of model assumptions and is more efficient than least samples. Each sampled element has a weight deter- squares types of estimators when there are contam- mined by the design, see Cochran (1977) [2]. inating symmetrical outlying observations. Most Bureau of Labor Statistics surveys rely on this , for example, the Occupational Employment Statistics program which was discussed in depth in 1. Oulier-Insensitive Linear Regres- Li (2002) [10] about methods used in small area es- sion timation. The other approach assumes the popu- In the past several decades, research on influence lation elements are a random draw from a larger of outlying observations on linear model estimates population or “super-population”, in a way similar yielded classes of estimators with controllable lev- to taking random samples from a random variable. els of reaction to outlying observations. We briefly The random variable has its own mean and vari- summarize some of them here, in particularly the ance structure, usually expressed in terms of linear ones that are implemented in popular statistical soft- models. Finite population summaries of the vari- ware packages such as R, S-plus and SAS. These are able under investigation, such as mean, total are the M-estimator, least of squares estimator estimated through fitting sample to the linear (LMS), the least trimmed means estimator (LTS). model. The design of the sample selection is relevant Also included are other methods that are applicable only to the selection of a suitable linear model un- to more complex linear model structures, the Mal- der the “super-population”, but is irrelevant to how lows’, Schweppe’s methods, and the broader class of the estimates are derived from the model. The first the Generalized M-estimator (GM). approach is known as “design-based”, the second “model-based”. S¨arndal, Swensson and Wretman (1992) [14] discusses this distinction in greater de- • Huber M-Estimator for Regression: Estimator ˆ tail. Table 1 lists some familiar design-based estima- βn for the regression parameter β (location) is tors and their model-based equivalents. A broader defined by review on this connection is in Li (2001) [9]. Since n least squares linear estimators are very vulnerable X 0 ˆ  min ρ (yi − xiβn)/σ to outlying observations, that is, a small variation ˆ βn i=1 in outlying observation produces larger variation in the estimate than non-outlying observations, we de- or sire a model-based estimator that is less responsive n X 0 ˆ  to outliers while being efficient. This is especially ψ (yi − xiβ)/σ xi = 0 necessary for survey data since processing error oc- i=1 cur prominently in surveys. This study aims to for some symmetric function ρ : → + and investigate the statistical quality of such a model- R R for a fixed σ. ψ(t) is the derivative of ρ(t). based, outlier insensitive finite population estimator Examples of ρ(t) are probability density func- 2 The views expressed in this article are those of the au- tion f(t), t , |t| etc. defined within certain dis- thor and do not constitute the policy of the Bureau of Labor tance from 0. Optimal βˆ is selected to mini- Statistics. The author can be contacted at: U.S. Bureau of mize asymptotic while allowing as high Labor Statistics, Postal Square Building, Suite 1950, 2 Mas- sachusetts Ave. NE. Washington, DC 20212-0001, or E-Mail: a breakdown points as it could be possible, see li [email protected]. Huber (1981) [7] for more details. Estimators Finite Population Tˆ Model Equivalent 2 Expansion estimator NY¯s Yi = µ + i, i ∼ (0, σ )  ˆ  2 estimator N Y¯s + bls(¯x − x¯s) Yi = α + xiβ + i, i ∼ (0, σ ) 1 ¯ 2 2 Ratio estimator NYsx/¯ x¯s Yi = xiβ + xi i, i ∼ (0, σ ) P ¯ 2 Stratified expansion estimator h NhYhs Yhi = µh + hi, hi ∼ (0, σh) 1 P ¯ 2 2 Stratified ratio estimator h NhYhsx¯h/x¯hs Yhi = xhiβ + xhihi, hi ∼ (0, σh) P 0 Simple Proportion Estimator s Yi/n Yi ∼ ber (pi), logit(pi) = xiβ

Table 1: Finite population estimators and their model-based equivalents

• Least Median of Squares Regression (LMS): Re- the method we used to obtain outlier-insensitive gression parameters βˆ are implicitly defined by model parameter estimates,

n o 0 0 0 ˆ 2 n −1/2  ˆo −1/2 ˆ −1 min Med (yi − xiβ) , i = 1, 2, . . . , n , min W ψ y − Xβ U V βˆ βˆ, Vˆ n  o where only the median of the squared resid- U−1/2 W−1/2ψ y − Xβˆ , (1) uals are relevant in defining the best βˆ, see Rousseeuw and Leroy (1987) [13] for details. see Krasker and Welsch (1982) [8]. Solutions for β and V need numerical algorithm similar to • Least Trimmed Squares Regression(LTS): Sim- that for ML and REML, see Harville (1977) [5] ilar to least squares estimators, but only the and Searle, Casella and McCulloch (1992) [15] smallest h residuals are taken to minimize for methods to obtain ML and REML. summed squared residuals,

h h 2. Outlier-Resistant Best Linear Un- X 0 ˆ 2 X 2 min (yi − xiβ)i:n = min (r )i:n biased Estimation ˆ ˆ β i=1 β i=1 Linear model with mixed-effects is gradually popu- 2 2 , where (r )1:n ≤ · · · ≤ (r )h:n ≤ · · · ≤ larizing with survey data usage, for example Battese, 2 (r )n:n are the ordered squared residuals, see Hater and Fuller (1987) [1]. We propose a super- Rousseeuw and Leroy (1987) [13] for more de- population structure incorporates mixed-effects: tails on this.           Y1 X1 Z1 0 ··· 0 ν1 1  Y2   X2   0 Z2 ··· 0   ν2   2            • Mallows’ Method: A weight w(xi) is part of            .  =  .  β+ . . .   . + .  ,  .   .   . . .   .   .  the estimating equation in Huber M-estimator  .   .   . . ··· .   .   .  Y X 0 0 ··· Z ν  which limits the influence of outlying x rows, g N×1 g g g g

n where ( iid X 0 ˆ(M)  ν ∼ (0, σ2) w(xi)ψ (yi − xiβ )/σ xi = 0, i ν iid i=1 2 i ∼ (0, σ I)

see Mallows (1975) [11]. and Yi and xi are population and auxiliary infor- mation vectors. νi is random (can be constant) • Schweppe’s Method: Weights w(xi) depending stratum-specific effects. i is independent random on xi are as well attached to scale the residual error vector. If we rearrange the Y by observed (s) in response to outlying influential observation and non-observed (r) population elements: in x:       n Ys Xs Zs 0 X 0 ˆ(S)  Y = = β + ν +  , w(xi)ψ (yi − xiβ )/(σw(xi)) xi = 0 Yr Xr Zr i=1 and , see Hill (1982) [6].  V V  Var(Y) = ss sr , then V V • Generalized M-estimator for Mixed Models: Es- rs rr timating equations written in matrix form to the best linear unbiased predictor (BLUP) of accommodating more complex covariance struc- 0 0 0 ture and weight matrices V, W and U. This is T = lN×1Y = lsYs + lrYr which could be the mean, total or contrasts, is where τˆ is the bounded influence estimator (with IFτˆ,F < ∞) of the vector of all unknown parameters 0 0  ˆ ˆ ˆ −1 ˆ  Te = lsys + lr XsβLS + VrsVss (ys − XsβLS) . in the model, (2) 0 0 τ = (β , v11, . . . , vij, . . . , vNN ) . Valliant (2000) [17] provides detailed derivation and proofs of unbiasedness and variance minimization of then τˆ is asymptotically normally distributed, Te. We propose an outlier resistant RBLUP, based Maronna and Yohai (1981) [12] discusses its asymp- on above formulation: totic mean and variance; η is a finite projection from τˆ to Te(R). Given such projection exists, by the  (R) 0 0 ˆ (R) Chain Rule defined on Gˆateaux differentials, the in- Te =l ys + l WsXsβ s r fluence function of RBLUP is (R)  ˆ (R) ˆ (R)−1 ˆ  0 + Vrs Vss ψ ys − Xsβ (3) IFRBLUP,F = η IFτˆ,F .

0 (R) Since both IFτˆ,F and η are finite, IFRBLUP,F is nec- where βˆ , Vˆ (R) are any bounded influence esti- essarily bounded. mators from the mixed-effects linear model; Ws are weight functions depend on Xs, ψ is the Huber ψ- 3. Simulation Study ˆ (R) ˆ (R) function. In practice, β , V are generalized We study the statistical quality of proposed RBLUP M-estimators given in section 1. Strictly speaking through a simulation. First we produce an artificial RBLUP is neither “best” in the sense of minimiz- popoulation according to the model: ing variance nor unbiased with respect to the model    among all linear unbiased estimators. The name Y1     Y2 β1 given simply indicates its predecessor.    = (11000 x)  Y3 β2 Influence of RBLUP is necessarily bounded. By   Y4 definition, influence function of a statistical func-   150 ··· 0   1  tional T (F ) ∈ k is the first derivative of T (F ):    R  . ν1 .   1 .   .    150   .     +    .  +   , T (F) − T (F ) ∂T (F)   .   .  IFT,F (z) := lim = ,   . 1250  ν  .  →0  ∂  4 →0  0 ··· 1550 1000

1  where F := (1−)F +δ ,  ∈ , with δ the point  ( iid  z R z  ν ∼ (0, σ2) mass at z. IF (z) reflects the influence on T of an  k ν , T,F  iid 2  i ∼ (0, σ ) infinitesimal change occurred at z. Bounded influ-  2 2 2 2  σ + σν σ ··· σ   2 2 2 2 ence estimators retain finite influence from outlying  σ σ + σν ··· σ   Var(Yk) =  . . .  observations as well as any other observations. A   . . .   . . ··· .  2 2 2 2 sample version of the influence function is  σ σ ··· σ + σν   n o  taken from ˆ ˆ ˆ νk, ei ⇐⇒ (1 − λ)N (0, 1) + λN (0, 41) IFi := (n − 1) T (F ) − T (F(i)) , Then contamination is introduced into this population ˆ where F(i) is the sample functional without ith ob- in four ways: no contamination, contamination in the servation. Cook’s Distance is another form that random effects, in the error term and in both. Contam- ination of percentage λ are taken from normal distribu- expresses the magnitude of IFˆ i by calculating a weighted norm: tion with large variance (40 vs. 1 of clean data). Notice in this case the contamination is symmetrically centered at the central location of clean data. This approach is D = (IFˆ )0M(IFˆ )/c i i i similar to de Jongh, Wet and Welsh (1988) [3]. where M is k × k positive or semipositive definite νk ei matrix and c is a scalar to be defined, see Ham- 0.0 0.0 : no contamination pel, Ronchetti, Rousseeuw, and Stahel (1986) [4] for 0.1 0.0 : only random effect contamination λ greater details. 0.0 0.1 : only error contamination 0.1 0.1 : both random effect and error contamination The influence function of RBLUP, IFRBLUP,F is then necessarily bounded. Now if we rewrite β and V are estimated through Generalized M- RBLUP as Estimator, as those used by Krasker and Welsch Te(R) = η(τˆ), (1982) [8] and Stahel and Welsh (1992) [16]. Figure 1: Fitted super-population models under three estimation methods (Contamination Patter: 0.1:0.1)

We then repeatedly draw stratified random samples , with of size 100 from this artificial population, 100 times to- 4 tal. Sample sizes are allocated to four strata first by the ˆ X 2 2 1/2 SD(TSLR) = ( (1 − fi)Syi(1 − ρi )) . method of sample proportional to size then the Neyman i=1 allocation. Results show little difference between the ρ = S /S S is the population correlation in two methods therefore all results listed later are based i xyi xi yi stratum i, S2 , S2 , and S2 are variance of y, x on Neyman allocation method. For each draw, we calcu- yi xi xyi and covariances between y and x in stratum i, and late an estimate of the population total using RBLUP. In addition we also calculate estimates using the following 3. Best Linear Unbiased estimator (BLUP) (2). alternative estimators: The estimates were then compared to the actual popu- lation total yielding values of root , a 1. Stratified Direct estimator (SD), measure for deviation from the true value. Spread of the 4 ni deviation were estimated through the estimates spread ¯ˆ X X TSD = 1/N Ni/ni yij based on the 100 independent repetitions. The median i=1 j=1 root mean square errors and a nominal 95% confidence with standard deviation region based on the 100 independent samples are listed in Table 2. Figure 1 displays the model fits from one i X 2 1/2 of the random draws. Different dash lines indicate the SD(Tˆ ) = ( (N /N) (1 − f )/n ) SD i i i method used. Notice both the SD and BLU are “pulled” in larger degree than RBLUP by the outliers in the sam- , where S2 is the stratum variance and f the sam- i i ple. RBLUP tend to stay closer to the majority of the pling fraction in stratum i. population in all four strata when outlier in y directions 2. Stratified Linear Regression estimator (SLR), is mild, as in stratum IV where all estimates produces

4 ni 4 similar model fits. ˆ X X X T¯SLR = Ni/N yij /ni − ˆbiNi/N i=1 j=1 i=1 4. Conclusion n N Xi Xi X Outlier-resistant predictor RBLUP under linear ( xij /ni − xij /Ni), mixed-effects model is less influenced by outliers (in j=1 j=1 y or rows of X) than LS predictors such as BLUP. where X When there is no significant deviations from model n n assumptions and/or there are not outlying y or rows Xi Xi ˆbi = (yij − yij /ni)(xij − of X, RBLUP is relatively efficient. j=1 j=1 X Given there is no contamination and presence of ni 4 ni asymmetric outlying observations, RBLUP is bi- X X X 2 xij /ni)/ (xij − xij /ni) ased. However it is reduces variability as compared j=1 i=1 j=1 to other estimators. Contamination Root Mean Squared Error Pattern T SD SLR BLUP RBLUP 0.0 : 0.0 95.2 0.65 1.30 1.31 1.31 (0.61-0.68) (0.54-1.61) (0.79-1.78) (0.79-1.78) 0.1 : 0.0 95.2 1.23 0.97 1.18 1.18 (0.59-7.83) (0.12-1.79) (.05-2.31) (.05-2.31) 0.0 : 0.1 95.2 5.41 2.68 2.72 1.61 (3.29-7.97) (1.12-3.62) (1.25-3.25) (0.81-2.76) 0.1 : 0.1 95.2 5.12 3.20 2.89 1.52 (1.98-7.25) (1.75-4.21) (1.60-4.25) (0.16-2.86)

Table 2: Median root mean squared error and 95% confidence region from 100 independent samples

X Under contamination and presence of outlying ob- [8] Krasker, W. S. and Welsch, R. E. (1982), “Efficient servations, RBLUP reduces bias and standard er- Bounded Influence Regression Estimation,” Journal ror. of the American Statistical Association, 77, 595-604. X Differences between RBLUP and BLUP decrease as [9] Li, B. (2001) Basic Sample Survey Principles & the sampling rate increases. Methods. SSA Course Note on Survey Sampling Methodology, Sacramento, CA. X Effect of the outlying observation on the RBLUP depends on the outlier resistant estimators used. [10] Li, B. (2002) “Small Area Research for the Occu- pational Employment Statistics Survey,” 2002 Pro- X Computation is slow, algorithm complex, solution is not guaranteed especially with large datasets. ceedings of the American Statistical Association, Government Statistics Section: 2091-2096 [CD- References ROM], Alexandria, VA: American Statistical Asso- ciation. [1] Battese, G. E., R. M. Harter and W. A. Fuller [11] Mallows, C. L. (1975), On Some Topics in Ro- (1987), “An Error-Components Model for Predic- bustnes, Unpublished memeorandum, Bell Tele- tion of County Crop Areas using Sruvey and Satel- phone Laboratories, Murray Hill, NJ. lite Data,” Journal of the American Statistical As- sociation, 83, 28-36. [12] Maronna, R. A., and Yohai, V. J. (1981), “Asymp- totic Behaviour of General M -Estimates for Regres- [2] Cochran, W. G. (1977), Sampling Techniques, 3rd sion and Scale with Random Carriers,” Zeitschrift edition. Wiley, New York, 3rd edition, 1977. f¨urWahrscheinlichkeitsheorie und Verwandte Gebi- [3] de Jongh, P. J., de Wet, T., and Welsh, ete/Probability and Related Fields, 58, 7-20. A. H. (1988), “Mallow-Type Bounded-Influence- [13] Rousseeuw, P. J. and A. M. Leroy (1987), Trimmed Means,” Journal of the Amer- Regression and Outlier Detection. Wiley, New York, ican Statistical Association, 83, 805-810. 1987. [4] Hampel, F. R., Ronchetti, E. M., Rousseeuw, P.J. [14] S¨arndal, C. E., Swensson, E. and Wretman, J. and Stahel, W. A. (1986), , the Ap- (1992), Model Assisted Survey Sampling. proach Based on Influence Functions. Wiley, New York, 1986. [15] Searle, S. R., Casella, G. and McCulloch, C. E. (1992), Variance Components. Wiley, New York, [5] Harville, D. A. (1977), “Maximum Likelihood Ap- 1992. proaches to Variance Compoment Estimation and to Related Problems,” Journal of the American Sta- [16] Stahel, W. A., Welsh, A. H. (1992), “Robust Esti- tistical Association, 72, 320-338. mation of Variance Components,” Research Report 69, ETH, Z¨urich. [6] Hill, R. W. (1982), “Robust Regression When There are Outliers in the Carriers: The Case,” [17] Valliant, R., A. H. Dorfman and R.M. Royall Communications in Statistics, Part A-Theory and (2000), Robust Regression and Outlier Detection. Methods, 11, 849-868. Wiley, New York, 2000. [7] Huber, P.J. (1981) Robust Statistics, Wiley, New York, 1981.