A generalized boxplot for skewed and heavy-tailed distributions implemented in

Vincenzo Verardi joint with C. Vermandele and C. Bru¤aerts

UK Stata users meeting

September 2014

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 1/23 Roadmap

Structure of the presentation

Introduction

Preamble (Tukey g and h: Tg ,h) A generalized boxplot

Simulations

Examples (Eartquakes in Latin America and Footballers’wages)

Stata command

Conclusion

References

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 2/23 Univariate identi…cation

Standard Boxplot, Standard Normal distribution X is the (n 1) data vector (n individuals, 1 variable) 

IQR

P25 P75

P25•1.5IQR P75+1.5IQR

0.35% 0.35%

•4 sd •3 sd •2 sd •1 sd 1 sd 2 sd 3 sd 4 sd

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 3/23 Univariate outliers identi…cation

Standard Boxplot, heavy tailed t2 distribution X is the (n 1) data vector (n individuals, 1 variable) 

IQR P25 P75

P25•1.25IQR P75+1.5IQR

2.75% 2.75%

•4 sd •3 sd •2 sd •1 sd median 1 sd 2 sd 3 sd 4 sd

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 4/23 Univariate outliers identi…cation

2 Standard Boxplot, skewed χ5 distribution X is the (n 1) data vector (n individuals, 1 variable) 

IQR

P25 P75

max(P25•1.5IQR,0) P75+1.5IQR

0% 2.80%

0 1 sd median 2 sd 3 sd 4 sd 5 sd 6 sd

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 5/23 Univariate outliers identi…cation

Limitations of the boxplot Only suited for (almost) symmetric data and (approximately) mesokurtic distributions Solution 1 Modify the whiskers of the boxplot to deal with asymmetry Adjusted Boxplot (Hubert and Vandervieren, 2008). The whiskers of the boxplot are moved according to a robust measure of asymmetry, the medcouple ( 1 MC 1): 4 MC 3 MC  Q0.25 1.5e IQR ; Q0.75 + 1.5e IQR if MC 0 3 MC 4 MC  Q0.25 1.5e IQR ; Q0.75 + 1.5e IQR if MC < 0,    Copes well with asymmetry (MC 0.6) but does not take (explicitly) into accout heavyness of tails Rejection rate set to 0.7%  Rule based on simulations  Computational complexity O(n log n) (see Gelade et al., 2014).  Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 6/23 Univariate outliers identi…cation

Limitations of the boxplot Only suited for (almost) symmetric data and (approximately) mesokurtic distributions Solution 2 Modify the whiskers of the boxplot to deal with asymmetry and tail heavyness Generalized Boxplot Do a rank preserving transformation of the data to end-up with a known distribution Use the theoretical quantiles of the latter to set whiskers (after applying an inverse transformation) Cope with both the and tail heavyness  Set the desired rejection rate to any chosen level  Computational complexity O(n) (as the standard boxplot)  Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 7/23 Preamble: Tukey g and h distribution

Heavy-tailed distributions De…nition

If Z N(0, 1), g = 0 and h , the random variable Y is said to be  6 1 2 2 Tg ,h distributed if Y = [exp (gZ ) 1] exp hZ /2 g  .4

g~0 .3 .2 .1 0

•10 •5 0 5 10 x

h~0 h=0.5 h=1

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 8/23 Preamble: Tukey g and h distribution

Asymmetrical distributions De…nition

If Z N(0, 1), g = 0 and h R, the random variable Y is said to be  6 1 2 2 Tg ,h distributed if Y = [exp (gZ ) 1] exp hZ /2 g  .6

h~0 .4 .2 0

•5 0 5 10 x

g~0 g=0.5 g=1

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 9/23 Preamble: Tukey g and h distribution

Asymmetrical and heavy-tailed distributions De…nition

If Z N(0, 1), g = 0 and h R, the random variable Y is said to be  6 1 2 2 Tg ,h distributed if Y = [exp (gZ ) 1] exp hZ /2 g  .4 .3 .2 .1 0

•5 0 5 10 15 x

g,h~0 g,h=0.5 g,h=1

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 10/23 Univariate outliers identi…cation

Standard Boxplot An is de…ned as any observation lying outside the fence de…ned by whiskers P25 1.5 IQR and P75 + 1.5 IQR Theoretical detection rate α More generally, a theoretical detection rate equal to α is given by z1 α/2 z0.75 [Q c (α) IQR ; Q + c (α) IQR] with c (α) = where 0.25 0.75 z0.75 z0.25 zp denotes the quantile of order p of the standard normal distribution. Limitations of the boxplot Only suited for (almost) symmetric data and (approximately) mesokurtic distributions Solution Modify the boxplot to deal with asymmetry and tail heavyness.

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 11/23 2 Shift the dataset to obtain only strictly positive values: ri = x min( x ) + 0.1 i f jg 3 Standardize ri to map xi on the open interval (0, 1): ri ri = min( rj )+max( rj ) f g f g 4 Consider the inverse normal (also called probit) transformation 1 ewi = Φ (ri ) wi Q0.5 ( wj ) 5 Center and reduce the values wi : w = f g i IQR( wj )/1.3426 e f g

Procedure

Transformation

For an initial dataset x1, ... , xn , the guidelines of the new method are the following: f g 0 xi m 1 Center and reduce the data: x  = where s0 = IQR( xj ) and i s0 f g m0 = Q0.5( xj ) f g

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 12/23 3 Standardize ri to map xi on the open interval (0, 1): ri ri = min( rj )+max( rj ) f g f g 4 Consider the inverse normal (also called probit) transformation 1 ewi = Φ (ri ) wi Q0.5 ( wj ) 5 Center and reduce the values wi : w = f g i IQR( wj )/1.3426 e f g

Procedure

Transformation

For an initial dataset x1, ... , xn , the guidelines of the new method are the following: f g 0 xi m 1 Center and reduce the data: x  = where s0 = IQR( xj ) and i s0 f g m0 = Q0.5( xj ) f g 2 Shift the dataset to obtain only strictly positive values: ri = x min( x ) + 0.1 i f jg

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 12/23 4 Consider the inverse normal (also called probit) transformation 1 wi = Φ (ri ) wi Q0.5 ( wj ) 5 Center and reduce the values wi : w = f g i IQR( wj )/1.3426 e f g

Procedure

Transformation

For an initial dataset x1, ... , xn , the guidelines of the new method are the following: f g 0 xi m 1 Center and reduce the data: x  = where s0 = IQR( xj ) and i s0 f g m0 = Q0.5( xj ) f g 2 Shift the dataset to obtain only strictly positive values: ri = x min( x ) + 0.1 i f jg 3 Standardize ri to map xi on the open interval (0, 1): ri ri = min( rj )+max( rj ) f g f g e

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 12/23 wi Q0.5 ( wj ) 5 Center and reduce the values wi : w = f g i IQR( wj )/1.3426 f g

Procedure

Transformation

For an initial dataset x1, ... , xn , the guidelines of the new method are the following: f g 0 xi m 1 Center and reduce the data: x  = where s0 = IQR( xj ) and i s0 f g m0 = Q0.5( xj ) f g 2 Shift the dataset to obtain only strictly positive values: ri = x min( x ) + 0.1 i f jg 3 Standardize ri to map xi on the open interval (0, 1): ri ri = min( rj )+max( rj ) f g f g 4 Consider the inverse normal (also called probit) transformation 1 ewi = Φ (ri )

e

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 12/23 Procedure

Transformation

For an initial dataset x1, ... , xn , the guidelines of the new method are the following: f g 0 xi m 1 Center and reduce the data: x  = where s0 = IQR( xj ) and i s0 f g m0 = Q0.5( xj ) f g 2 Shift the dataset to obtain only strictly positive values: ri = x min( x ) + 0.1 i f jg 3 Standardize ri to map xi on the open interval (0, 1): ri ri = min( rj )+max( rj ) f g f g 4 Consider the inverse normal (also called probit) transformation 1 ewi = Φ (ri ) wi Q0.5 ( wj ) 5 Center and reduce the values wi : w = f g i IQR( wj )/1.3426 e f g

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 12/23 7 Select the rejection bounds (L , L+ ) using speci…c quantiles of the adjusted distribution (here P0.35 and P99.65)

8 Build the detection bounds B and B+ (whiskers) for the original dataset applying the complete inverse transformation

IQR( wj ) f (L ) = Φ Q0.5 ( wj ) + 1.3426f g L  f g    B = f (L ) [min( rj ) + max( rj )] + min( xj ) 0.1 s0 + m0   f g f g f g  

Procedure

Transformation

6 Adjust the distribution of the values wi (i = 1, ... , n) by the Tukey T distribution: g ,h P0.9 ( w )P0.1 ( w ) 2 ln g f jg f jg b b P w P w +P w 1 0.9 ( j ) 0.9 (f jg) 0.1 (f jg) ! g = ln f g , h = 2 z0.9 z P0.1 ( wj ) b 0.9  f g  b b

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 13/23 8 Build the detection bounds B and B+ (whiskers) for the original dataset applying the complete inverse transformation

IQR( wj ) f (L ) = Φ Q0.5 ( wj ) + 1.3426f g L  f g    B = f (L ) [min( rj ) + max( rj )] + min( xj ) 0.1 s0 + m0   f g f g f g  

Procedure

Transformation

6 Adjust the distribution of the values wi (i = 1, ... , n) by the Tukey T distribution: g ,h P0.9 ( w )P0.1 ( w ) 2 ln g f jg f jg b b P w P w +P w 1 0.9 ( j ) 0.9 (f jg) 0.1 (f jg) ! g = ln f g , h = 2 z0.9 z P0.1 ( wj ) b 0.9  f g  7 Select the rejection bounds (L , L+ ) using speci…c quantiles of the b b adjusted distribution (here P0.35 and P99.65)

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 13/23 Procedure

Transformation

6 Adjust the distribution of the values wi (i = 1, ... , n) by the Tukey T distribution: g ,h P0.9 ( w )P0.1 ( w ) 2 ln g f jg f jg b b P w P w +P w 1 0.9 ( j ) 0.9 (f jg) 0.1 (f jg) ! g = ln f g , h = 2 z0.9 z P0.1 ( wj ) b 0.9  f g  7 Select the rejection bounds (L , L+ ) using speci…c quantiles of the b b adjusted distribution (here P0.35 and P99.65)

8 Build the detection bounds B and B+ (whiskers) for the original dataset applying the complete inverse transformation

IQR( wj ) f (L ) = Φ Q0.5 ( wj ) + 1.3426f g L  f g    B = f (L ) [min( rj ) + max( rj )] + min( xj ) 0.1 s0 + m0   f g f g f g  

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 13/23 Numerical example

Considered distributions

Standard Normal Student(2) Exponential(1) sk=0, ex. kurt=0 sk=0, ex. kurt= ¥ sk=2, ex. kurt=6 1 .4 .4 .8 .3 .3 .6 y y y .2 .2 .4 .1 .1 .2 0 0 0

•5 0 5 •10 •5 0 5 10 0 2 4 6 8 10 x x x

Fréchet(2) Triangular(0,0.1,1) Beta(2,5) sk=¥, e x . k u r t=¥ s k = 0 .5 4 , e x . k u r t= • 0 .6 sk=0.6, ex. kurt=2.88 2 2.5 1 2 .8 1.5 1.5 .6 y y 1 1 .4 kdensity x kdensity .5 .5 .2 0 0 0

0 2 4 6 8 10 0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1 x x x

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 14/23 Numerical example

Quality of …t of transformed variable

X~Standard Normal X~Student(2) X~Exponential(1) .4 .4 .4 .3 .3 .3 .2 .2 .2 .1 .1 .1 0 0 0

•5 0 5 10 •40 •20 0 20 40 •2 0 2 4 6 w* w* w*

X~Fréchet(2) X~Triangular(0,0.1,1) X~Beta(2,5) .4 .4 .4 .3 .3 .3 .2 .2 .2 .1 .1 .1 0 0 0

•5 0 5 10 •2 0 2 4 •4 •2 0 2 4 w* w* w*

Tukey g and h Kernel

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 15/23 Boxplot

Standard, Adjusted and Generalized boxplots

X~Standard Normal X~Student(2) X~Exponential(1)

#out=1 #out=6 #out=38 #out=43 #out=0 #out=38 Standard Standard Standard

#out=1 #out=7 #out=32 #out=47 #out=0 #out=2 Adjusted Adjusted Adjusted

#out=3 #out=6 #out=3 #out=2 #out=0 #out=8

Generalized •4 •2 0 2 4 Generalized •40 •20 0 20 Generalized 0 2 4 6 8 x x x

X~Fréchet(2) X~Triangular(0,0.1,1) X~Beta(2,5)

#out=0 #out=83 #out=0 #out=0 #out=0 #out=4 Standard Standard Standard

#out=0 #out=17 #out=0 #out=0 #out=0 #out=0 Adjusted Adjusted Adjusted

#out=0 #out=4 #out=5 #out=3 #out=1 #out=3

Generalized 0 10 20 30 40 Generalized 0 .2 .4 .6 .8 1 Generalized 0 .2 .4 .6 .8 1 x x x

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 16/23 Sensitivity and Speci…city

Outliers U(4.9,5.1) on the scale of the Normal (1000 replications) 

Outliers: U(4.9,5.1) Sensitivity Specificity e n=100 n=1000 n=100 n=1000 1% 100.00% 100.00% 99.06% 99.32% Standard Boxplot 5% 100.00% 100.00% 99.19% 99.58% 1% 98.10% 100.00% 97.81% 99.12% N(0,1) Adjusted Boxplot 5% 92.40% 100.00% 97.71% 98.95% 1% 100.00% 100.00% 96.82% 98.91% Generalized Boxplot 5% 98.30% 100.00% 97.95% 99.45% 1% 100.00% 100.00% 91.96% 91.98% Standard Boxplot 5% 100.00% 100.00% 92.95% 92.83% 1% 100.00% 100.00% 90.93% 91.70% t2 Adjusted Boxplot 5% 100.00% 100.00% 91.05% 91.54% 1% 100.00% 100.00% 96.68% 98.47% Generalized Boxplot 5% 100.00% 100.00% 97.71% 99.15% 1% 100.00% 100.00% 94.93% 95.47% Standard Boxplot 5% 100.00% 100.00% 96.29% 96.72% 1% 99.70% 100.00% 98.48% 99.47% Exp(1) Adjusted Boxplot 5% 98.80% 100.00% 98.54% 99.90% 1% 100.00% 100.00% 96.55% 99.35% Generalized Boxplot 5% 99.50% 100.00% 98.42% 99.95%

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 17/23 Sensitivity and Speci…city

Outliers U(4.9,5.1) on the scale of the Normal (1000 replications) 

Outliers: U(4.9,5.1) Sensitivity Specificity e n=100 n=1000 n=100 n=1000 1% 100.00% 100.00% 91.96% 92.00% Standard Boxplot 5% 100.00% 100.00% 93.31% 93.41% 1% 100.00% 100.00% 94.18% 95.26% Fréchet(2) Adjusted Boxplot 5% 100.00% 100.00% 93.38% 94.54% 1% 100.00% 100.00% 96.74% 98.96% Generalized Boxplot 5% 100.00% 100.00% 98.08% 99.57% 1% 100.00% 100.00% 99.75% 99.83% Standard Boxplot 5% 99.50% 100.00% 99.90% 99.95% 1% 53.20% 56.40% 99.39% 99.99% Triangular(0,0.1,1) Adjusted Boxplot 5% 34.40% 8.20% 99.23% 100.00% 1% 98.90% 99.90% 96.71% 99.26% Generalized Boxplot 5% 93.80% 97.70% 97.67% 99.86% 1% 100.00% 100.00% 98.81% 99.42% Standard Boxplot 5% 100.00% 100.00% 99.15% 99.72% 1% 76.40% 98.70% 99.07% 99.97% Beta(2,5) Adjusted Boxplot 5% 54.60% 71.10% 98.62% 99.94% 1% 99.60% 100.00% 97.26% 99.36% Generalized Boxplot 5% 98.48% 99.60% 98.48% 99.79%

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 18/23 Example 2: 200 earthquakes in Latin America (2013)

Estimated medcouple: 0.43 2 Standard 1.5 1 Density Adjusted .5 Yacuanquer, Colombia Islands, UK Falkland Peru Acari, 0

4.5 5 5.5 6 6.5 7 Generalized Magnitude 4.5 5 5.5 6 6.5 7 Magnitude

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 20/23 Stata command

Syntax box_out varname [if][in][,out(varname ) bdp(#) perc(#) nograph] Options out: Identi…es the new variable to be created to identify individuals outside the fence de…ned by the whiskers bdp: Sets the desired Break-down point (in %). It is 10% by default perc: Sets the desired percentage of points outside the whiskers in case of uncontaminated data. It is set to 0.7% by default nograph: Suppresses the graph Saved results and output e(g), e(h): Estimated skewness and elongation parameters of the underlying Tukey g and h distribution e(lowerW), e(upperW): Value of the lower and upper whiskers A basic boxplot is created but we recommend to refer to N. J. Cox, S.J. (2009) for better output

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 21/23 Conclusion

Generalized boxplot We propose a very simple generalized boxplot that is suited for skewed and/or heavy-tailed distributions allows for setting the desired detection rate of atypical observation has a computational complexity of O(n) In Stata We provide a simple command that estimates the whiskers of the generalized boxplot creates a simple boxplot. we however refer to Cox (2009) and Cox(2013) for more complete graphs. Complementary results In multivariate analysis we have a projection based estimator to create a bagplot in 2D identify outliers for multivariate skewed and heavy-tailed distributions

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 22/23 Reference

References Bru¤aerts, C., Verardi, V. and Vermandele, C., (2014). "A generalized boxplot for skewed and heavy-tailed distributions". and Probability Letters (forthcoming) Cox, N.J. (2013)."Speaking Stata: Creating and varying box plots: Correction". Stata Journal 13(2). pp. 398-400. Cox, N.J. (2009)."Speaking Stata: Creating and varying box plots". Stata Journal 9(3). pp. 478-496. Gelade, W., Verardi, V. and Vermandele, C. (2014). "Time e¢ cient algorithms for robust estimators of location, scale, symmetry and tail heaviness". Stata Journal (forthcoming). Hubert, M. and Vendervieren, E. (2008). "An adjusted boxplot for skewed distributions". Computational Statistics & Data Analysis 52(12). pp 5186-5201

Vincenzo Verardi (UK Stata users meeting) A generalized boxplot 12/09/2014 23/23