Robust Statistics Using Stata UK17 Stata Users Meeting
Total Page:16
File Type:pdf, Size:1020Kb
Robust Statistics using Stata UK17 Stata Users Meeting Vincenzo Verardi Fnrs, UNamur, ULB September 2017 Remark: for a thorough description of the concepts and estimators discussed in this talk, we advice to refer to Maronna, Martin and Yohai (2006). Vincenzo Verardi (Fnrs, UNamur, ULB) UK17 Stata Users Meeting 8/09/2017 1/77 Outliers do matter and are not always bad Structure of the presentation Introduction Descriptive Satistics Univariate outliers identi…cation Regression models Multivariate analysis Multivariate outlier identi…cation Robust logit Conclusion Vincenzo Verardi (Fnrs, UNamur, ULB) UK17 Stata Users Meeting 8/09/2017 3/77 Outliers do matter and are not necessarily coding errors Star Cluster CYG OB1 Hertzsprung•Russell Data 7 6 5 4 Log of light intensity 3 3.5 4 4.5 Log of temperature Least Squares Robust Estimator Source:Source: P. J. Rousseeuw P.J. Rousseeuwand A. M. Leroy (1987) and A.M. Leroy (1987) Vincenzo Verardi (Fnrs, UNamur, ULB) UK17 Stata Users Meeting 8/09/2017 4/77 Outliers do matter and are not necessarily coding errors Number of international calls from Belgium Belgian Statistical Survey, Ministry of Economy. 20 15 10 5 0 50 55 60 65 70 75 Year Least Squares Robust Estimator Source:Source: P. J. Rousseeuw P.J. and Rousseeuw A. M. Leroy (1987) and A.M. Leroy (1987) Vincenzo Verardi (Fnrs, UNamur, ULB) UK17 Stata Users Meeting 8/09/2017 6/77 Measuring robustness of an estimator Sensitivity curve, see inter alia [Maronna et al., 2006] Let us consider a data set Xn = x1, ... , xn and the statistic f g Tn = Tn(x1, ... , xn). To study the impact of a potential outlier on this statistic, we may analyze the modi…cation of value observed for the statistic when we add an extra data point x and allow it to move on the whole line (from ∞ to +∞). The (standardized) sensitivity curve of the statistic Tn for the sample Xn is de…ned by Tn+1(x1, ... , xn, x) Tn(x1, ... , xn) SC(x; Tn, Xn) = 1 ; n+1 for each value of x, we compare the value of the statistic in the "contaminated" sample with its value in the initial sample, and rescale the di¤erence by dividing by 1/(n + 1), the amount of contamination. Vincenzo Verardi (Fnrs, UNamur, ULB) UK17 Stata Users Meeting 8/09/2017 8/77 Measuring robustness of an estimator Sensitivity curve Mean and Median Standardized Sensitivity Curve X~N(0,1), N=20 5 0 •5 •5 0 5 Median Mean Vincenzo Verardi (Fnrs, UNamur, ULB) UK17 Stata Users Meeting 8/09/2017 9/77 Measuring robustness of an estimator In‡uence function The in‡uence function (IF) can be considered as an asymptotic version of the sensitivity curve of the statistic Tn when the sample size n grows, that is, when the empirical distribution function Fn tends to the underlying population distribution function F : 1 1 T 1 F + ∆x T (F ) IF(x; T , F ) = lim n+1 n+1 n ∞ 1 ! n+1 T ((1 ε) F + ε∆x ) T (F ) = lim , ε 0 ε ! where ∆x denotes the probability distribution putting all its mass in the point x. This function measures the e¤ect on T of a pertubation of F obtained by adding a small probability mass at the point x. Vincenzo Verardi (Fnrs, UNamur, ULB) UK17 Stata Users Meeting 8/09/2017 10/77 Measuring robustness of an estimator In‡uence Function Mean and Median Influence Function X~N(0,1) 5 y 0 •5 •5 0 5 x Median Mean Vincenzo Verardi (Fnrs, UNamur, ULB) UK17 Stata Users Meeting 8/09/2017 11/77 Measuring robustness of an estimator Gross-error sensitivity The gross-error sensitivity of T at distribution F , de…ned by γ(T , F ) = sup IF(x; T , F ) , x j j evaluates the biggest in‡uence that an outlier may have on T . From the robustness point of view, it is of course preferable to use an estimator for which γ(T , F ) is …nite (i.e. bounded IF). Local-shift sensitivity The local-shift sensitivity measures the e¤ect of a small perturbation of the value of x on T . We may determine the local-shift sensitivity IF(y; T , F ) IF(x; T , F ) λ(T , F ) = sup j j. x =y y x 6 j j From the robustness point of view, it is of course preferable to use an estimator for which the IF is smooth everywhere. Vincenzo Verardi (Fnrs, UNamur, ULB) UK17 Stata Users Meeting 8/09/2017 12/77 Measuring robustness of an estimator Breakdown point The sensitivity curve shows how an estimator reacts to the introduction of one single outlier. Some estimators have bounded sensitivity curve (SC) and therefore resist to this contamination. However, it is possible that the number of outliers in a sample is so large that even these estimators with bounded SC can break. The breakdown point is, roughly, the smallest amount of contamination in the sample that may cause the estimator to take on arbitrary values. Example If the ith observation among x1, ... , xn goes to in…nity, the sample mean µn goes to in…nity as well. This means that the …nite-sample breakdown point of this statistic is only 1/n. In contrast, the …nite-sample breakdown n/2 (n+1)/2 point of the median Q0.5;n is n if n is even and n if n is odd. Vincenzo Verardi (Fnrs, UNamur, ULB) UK17 Stata Users Meeting 8/09/2017 13/77 Measuring robustness of an estimator Choosing a good (robust) estimator Fisher-consistent. If the estimator was calculated using the entire population rather than a sample, the true value of the estimated parameter should be obtained Bounded in‡uence function (low gross-error sensitivity). The biggest in‡uence that an outlier may have on the estimator should be limited Smooth in‡uence function (low local-shift sensitivity). The e¤ect on the estimator of a small perturbation in the data should be limited High breakdown point. The estimator must withstand a contamination of a large proportion of the data Highly e¢ cient with convergence rate of pn Computationally feasible Compromises must often be made to achieve good performance. Vincenzo Verardi (Fnrs, UNamur, ULB) UK17 Stata Users Meeting 8/09/2017 14/77 Descriptive statistics Location parameters Several measures of location are available in the literature. We compare i) two “classical” estimators based on (centered) moments of the empirical distribution, ii) an estimator based on quantiles of the distribution, and iii) an estimator based on pairwise comparisons of the observations Classical estimator (mean) 1 n µn = n ∑i=1 xi Classical estimator (trimmed mean) 1 n αn α b c µn = n 2 αn ∑i= αn +1 x(i) b c b c Quantile-based estimator (median) Q0.5 = med xi f g Pairwise based estimator [Hodges and Lehmann, 1963] xi +xj HLn = med 2 ; i < j Vincenzo Verardi (Fnrs, UNamur,n ULB) UK17o Stata Users Meeting 8/09/2017 16/77 Location parameters In‡uence functions 2 1 0 IF •1 •2 •4 •2 0 2 4 x m Q0.5 HL 0.25 0.05 m m Vincenzo Verardi (Fnrs, UNamur, ULB) UK17 Stata Users Meeting 8/09/2017 17/77 Location parameters Comparing properties of location estimators Asymptotic Computational Estimator ASV( , Φ) breakdown complexity value µn 1 0% O(n) 1.0263 if α = 0.05 µα 8 1.0604 if α = 0.10 100α% O(n) n > > < 1.1952 if α = 0.25 > Q0.5;n :> π/2 = 1.5708 50% O(n) HLn π/3 = 1.0472 29% O(n log n) Vincenzo Verardi (Fnrs, UNamur, ULB) UK17 Stata Users Meeting 8/09/2017 18/77 Location parameters Stata example Clean dataset Contaminated dataset clear clear set seed 1234 set seed 1234 set obs 10000 set obs 10000 drawnorm z drawnorm z gen x=z gen x=z+10 in 1/100 sum x, d sum x, d robstat x, stat(hl) robstat x, stat(hl) µn Q0.5;n HLn µn Q0.5;n HLn Value -0.00 -0.01 -0.01 Value 1.00 0.00 0.01 Time 0.01 0.01 0.40 Time 0.01 0.01 0.41 Vincenzo Verardi (Fnrs, UNamur, ULB) UK17 Stata Users Meeting 8/09/2017 19/77 Descriptive statistics Scale parameters Several measures of dispersion are available in the literature. We compare i) a “classical” estimators based on (centered) moments of the empirical distribution, ii) two estimators based on quantiles of the distribution, and iii) an estimator based on pairwise comparisons of the observations Classical estimator (standard deviation) 1 n 2 σn = ∑ (xi µ ) n i=1 n q Quantile-based estimator (inter-quartile range) IQR = 0.7413 (Q0.75 Q0.25) n Quantile-based estimator (median absolute deviation) MADn = 1.4826 medi xi medj xj j j Pairwise based estimator [Rousseeuw and Croux, 1993] n Qn = 2.2219 xi xj ; i < j and k = ( )/4 fj j g(k) 2 Vincenzo Verardi (Fnrs, UNamur, ULB) UK17 Stata Users Meeting 8/09/2017 20/77 Scale parameters In‡uence functions 4 3 2 IF 1 0 •1 •4 •2 0 2 4 x s IQR Qn MAD Vincenzo Verardi (Fnrs, UNamur, ULB) UK17 Stata Users Meeting 8/09/2017 21/77 Scale parameters Comparing properties of location estimators Asymptotic Computational Estimator Type ASV( , Φ) breakdown Complexity value σn Classical 0.5 0% O(n) IQRn Quantile-based 1.3605 25% O(n) MADn Quantile-based 1.3605 50% O(n) Qn Pairwise-based 0.6077 50% O(n log n) Vincenzo Verardi (Fnrs, UNamur, ULB) UK17 Stata Users Meeting 8/09/2017 22/77 Scale parameters Stata example Clean dataset Contaminated dataset clear clear set seed 1234 set seed 1234 set obs 10000 set obs 10000 drawnorm z drawnorm z gen x=z gen x=z+10 in 1/100 robstat, stat(sd iqrc) robstat, stat(sd iqrc) robstat, stat(qn) robstat, stat(qn) σn IQRn Qn σn IQRn Qn Value 0.99 1.00 1.00 Value 10.00 1.00 1.02 Time 0.02 0.07 0.58 Time 0.01 0.10 0.52 Vincenzo Verardi (Fnrs, UNamur, ULB) UK17 Stata Users Meeting 8/09/2017 23/77 Descriptive statistis Skewness parameters Several measures of skewness are available in the literature.