CONTRIBUTIONS TO OUTLIER DETECTION METHODS:
SOME THEORY AND APPLICATIONS
by
YINAZE HERVE DOVOEDO
SUBHABRATA CHAKRABORTI, COMMITTEE CHAIR B. MICHAEL ADAMS BRUCE BARRETT GILES D’SOUZA JUNSOO LEE
A DISSERTATION
Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Information Systems, Statistics, and Management Science in the Graduate School of The University of Alabama
TUSCALOOSA, ALABAMA
2011
Copyright Yinaze Herve Dovoedo 2011 ALL RIGHTS RESERVED
ABSTRACT
Tukey’s traditional boxplot (Tukey, 1977) is a widely used Exploratory Data Analysis
(EDA) tools often used for outlier detection with univariate data. In this dissertation, a
modification of Tukey’s boxplot is proposed in which the probability of at least one false alarm is controlled, as in Sim et al. 2005. The exact expression for that probability is derived and is
used to find the fence constants, for observations from any specified location-scale distribution.
The proposed procedure is compared with that of Sim et al., 2005 in a simulation study.
Outlier detection and control charting are closely related. Using the preceding procedure,
one- and two-sided boxplot-based Phase I control charts for individual observations are proposed
for data from an exponential distribution, while controlling the overall false alarm rate. The proposed charts are compared with the charts by Jones and Champ, 2002, in a simulation study.
Sometimes, the practitioner is unable or unwilling to make an assumption about the form of the underlying distribution but is confident that the distribution is skewed. In that case, it is well documented that the application of Tukey’s boxplot for outlier detection results in increased number of false alarms. To this end, in this dissertation, a modification of the so-called adjusted boxplot for skewed distributions by Hubert and Vandervieren, 2008, is proposed. The proposed procedure is compared to the adjusted boxplot and Tukey’s procedure in a simulation study.
In practice, the data are often multivariate. The concept of a (statistical) depth (or equivalently outlyingness) function provides a natural, nonparametric, “center-outward” ordering of a multivariate data point with respect to data cloud. The deeper a point, the less outlying it is.
It is then natural to use some outlyingness functions as outlier identifiers. A simulation study is
ii
performed to compare the outlier detection capabilities of selected outlyingness functions available in the literature for multivariate skewed data. Recommendations are provided.
iii
ACKNOWLEDGMENTS
I would like to extend my grateful thanks to my dissertation committee, specially the chair, as well as other professors, for their help and time. I also thank everyone who gave me emotional support.
iv
CONTENTS
ABSTRACT ...... ii
ACKNOWLEDGMENTS ...... iv
LIST OF TABLES ...... viii
LIST OF FIGURES ...... x
1. INTRODUCTION ...... 1
1.1 Brief Overview of Outlier Detection ...... 1
1.2 Univariate Outlier Detection ...... 2
1.3 Higher Dimensional Outlier Detection ...... 3
1.4 Focus of Dissertation ...... 4
1.4.1 A New Boxplot Procedure for Oulier Detection in Location-scale Data ...... 4
1.4.2 Application: Boxplot Based Phase I Control Charts for Individual Observations ...... 6
1.4.3 A Modified Adjusted Boxplot (Distribution-free) Procedure for Skewed Distributions ...... 7
1.4.4 Outlier Detection for Multivariate Skewed Data: A Comparative study ...... 8
1.5 Organization of the Dissertation ...... 10
2. LITERATURE REVIEW ...... 11
2.1 Univariate Outlier Detection ...... 11
v
2.1.1 Controlling False Alarm Rate in the Traditional Boxplot ...... 12
2.1.2 Accounting for Skewness in the Boxplot Procedures ...... 13
2.1.3 Adjusting for Skewness in the Boxplot Procedures: Another Approach ...... 15
2.1.3.1 A new Measure of Skewness: The Medcouple ...... 15
2.1.3.2 Adjusted Boxplot Procedures for Skewed Distributions ...... 17
2.2 Multivariate Outlier Detection ...... 18
2.2.1 Distance Based Outlier Detection ...... 19
2.2.2 Projection Pursuit Method for Outlier Detection ...... 22
2.2.3 Using Data Depth for Outlier Detection ...... 24
2.2.4 Multivariate Skewed Data and Outlier Detection ...... 27
2.3 Summary ...... 29
3. ON A MORE GENERAL BOXPLOT METHOD FOR IDENTIFYING OUTLIERS IN THE LOCATION-SCALE FAMILY ...... 31
4. BOXPLOT-BASED PHASE I CONTROL CHARTS FOR TIME BETWEEN EVENTS ...... 59
5. A MODIFIED ADJUSTED BOXPLOT FOR SKEWED DISTRIBUTIONS ...... 79
6. OUTLIER DETECTION FOR MULTIVARIATE SKEW-NORMAL DATA: A COMPARATIVE STUDY ...... 105
7. SUMMARY AND FUTURE RESEARCH ...... 136
7.1 Summary ...... 136
7.2 Future Research ...... 138
ADDITIONAL REFERENCES...... 139
ADDITIONAL APPENDIX: R PROGRAMS USED IN SIMULATIONS ...... 141
A1 Programs for chapter 3 ...... 141
vi
A2 Programs for chapter 4 ...... 143
A3 Programs for chapter 5 ...... 149
A4 Programs for chapter 6 ...... 166
vii
LIST OF TABLES
3.1 Fence constants for selected sample sizes from the Normal, Logistic, and Exponential populations ...... 38
3.2 Approximate fence constants for selected large sample sizes from the Normal, Logistic, and Exponential populations ...... 39
3.3 Data from Daniel (1959) ...... 41
3.4 Fences for various boxplot procedures for the data in Daniel (1959) ...... 42
3.5 Performance of various procedures in detecting outliers based on Daniel (1959)’s data ...... 43
3.6 Times between failures data ...... 44
3.7 Fences for various boxplot procedures for the times between failures data ...... 44
3.8 Performance of various procedures in detecting outliers based on Times to failure data ...... 45
3.9 Empirical proportions of outliers detected with respect to the hypothesized Expo 1 distribution ...... 48
3.10 Empirical proportions of outliers detected with respect to the hypothesized logis 0,1 distribution ...... 49
4.1 Charting constants for the proposed one-sided control chart ...... 65
4.2 Charting constants for the proposed two-sided control chart ...... 68
4.3 Empirical overall false alarm rates for the one-sided control charts ...... 69
4.4 Empirical overall false alarm rates for the two-side control charts ...... 70
4.5 Out-of-control performance of the one-sided phase I control charts ...... 73
4.6 Out-of-control performance of the two-sided phase I control charts ...... 74
5.1 The 20 different distributions used in the simulation study ...... 81
5.2 Distributions used to fit the models ...... 87
viii
5.3 Various boxplot procedure fences for various data sets ...... 89
5.6 Comparisons among various boxplot procedures ...... 98
6.1 Formulae for outlyingness functions used in the simulation study ...... 125
6.2 The 90th, 95th and 99th percentiles of the distributions of four selected outlyingness functions for 100 when the data come from the multivariate skew-normal distributions specified in the simulation study ...... 126
6.3 Case of 2D-3D Cluster outliers: The performance of four outlyingness functions using the 99th, the 95th and the 90th Percentiles of outlyingness values of underlying uncontaminated distributions ...... 127
6.4 Case of 2D-3D Radial outliers: Performance of four outlyingness functions using the 99th, the 95th and the 90th percentiles of outlyingness values of underlying uncontaminated distributions ...... 128
6.5 Outlyingness values for the bushfire data ...... 129
6.6 Performance of outlyingness functions under study on the bushfire data ...... 129
ix
LIST OF FIGURES
3.1 The traditional boxplot and the proposed boxplot ...... 35
4.1 Phase I control chart for times between failures ...... 76
5.1 Tukey’s boxplot, and the proposed boxplot ...... 99
5.2 Comparison of average percentages of outliers declared, in both tails combined (No contamination) ...... 100
5.3 Absolute deviation from the true percentage of outliers (5% lower contamination) of the average percentages of outliers declared in the lower tail ...... 100
5.4 Absolute deviation from the true percentage of outliers (5% lower contamination) of the average percentages of outliers declared (in the two tails combined) ...... 101
5.5 Absolute deviation from the true percentage of outliers (5% upper contamination) of the average percentages of outliers declared in the upper tail ...... 101
5.6 Absolute deviation from the true percentage of outliers (5% upper contamination) of the average percentages of outliers declared (in the two tails combined) ...... 102
6.1 simulated bivariate skew-normal data with 10% cluster outliers ...... 130
6.2 simulated bivariate skew-normal data with 10% radial outliers ...... 130
6.3 Cluster outliers. Simulation results for two-dimensional and three-dimensional data of size n 100 using the 99th percentiles of uncontaminated data outlyingness values ...... 131
6.4 Cluster outliers. Simulation results for two-dimensional and three-dimensional data of size n 100 using the 95th percentiles of uncontaminated data outlyingness values ...... 131
6.5 Cluster outliers. Simulation results for two-dimensional and three-dimensional data of size n 100 using the 90th percentiles of uncontaminated data outlyingness values ...... 132
6.6 Radial outliers. Simulation results for two-dimensional and three-dimensional
x
data of size 100 using the 99th percentiles of uncontaminated data outlyingness values ...... 132
6.7 Radial outliers. Simulation results for two-dimensional and three-dimensional data of size 100 using the 95th percentiles of uncontaminated data outlyingness values ...... 133
6.8 Radial outliers. Simulation results for two-dimensional and three-dimensional data of size 100 using the 90th percentiles of uncontaminated data outlyingness values ...... 133
xi
CHAPTER 1
INTRODUCTION
1.1 Brief Overview of Outlier Detection
As pointed out by Beckman and Cook (1983), no observation can be guaranteed to be a totally dependable manifestation of a phenomenon under consideration. However, observations which stand apart from the bulk of the data warrant attention. Many terms have been used in the literature to refer to such observations; some of them listed in Beckman and Cook (1983) are
“outliers”, “discordant observations”, “rogue values”, “contaminants”, “surprising values”,
“mavericks”, “dirty”. The notion of outliers has been somewhat vague throughout the years.
Beckman and Cook (1983) reported how Edgeworth (1887) defined such observations:
“discordant observations may be defined as those which present the appearance of differing in respect of their law of frequency from other observations with which they are combined”.
Outliers undeniably have a long history.
It is apparent that if one could identify the “faulty” observations in a dataset, then one could better understand the phenomenon under study; however, if one could better understand the phenomenon under study, discordant observations could easily be identified and hence inference and predictions for the future could be improved. Thus, the study or the detection of outliers is like the “chicken-and-egg” problem as Hadi, Imon, and Werner (2009) put it. The following citation is due to Francis Bacon (1620), and was taken from Billor, Hadi and Velleman
(2000); also cited in Serfling (2007):
1
“Whoever knows the ways of Nature will more easily notice her deviations; and, on the other hand, whoever knows her deviations will more accurately describe her way.” – Francis Bacon, 1620
Needless to say that the study of outliers has received considerable attention (see for example Tukey, 1977; Barnett, 1978; Hawkins, 1980; Davies and Gather, 1993; Barnett and
Lewis, 1994; Schwertman, Owens, and Adnan, 2004; Hawkins, 2006; Schwertman and de Silva
2007; Cerioli 2010; Dang and Serfling, 2010) in the context of data analysis. This is simply because their non-identification or mis-identification can substantially influence the data analysis leading to distortion and possibly inaccurate conclusions. In some circumstances, however, the outliers themselves are of interest, as they may lead to new knowledge or discovery.
1.2 Univariate Outlier Detection
Among the more routinely used univariate outlier detection techniques, the boxplot is the most popular with practitioners. The boxplot is a versatile exploratory data analysis (EDA) tool, used for displaying and summarizing univariate data. It helps visualize the location, spread, and skewness of data distributions, along with unusual values or outliers. The most commonly used version of the boxplot is the one by Tukey (1977) and is referred to as the traditional boxplot. Let