Contributions to Outlier Detection Methods: Some

CONTRIBUTIONS TO OUTLIER DETECTION METHODS: SOME THEORY AND APPLICATIONS by YINAZE HERVE DOVOEDO SUBHABRATA CHAKRABORTI, COMMITTEE CHAIR B. MICHAEL ADAMS BRUCE BARRETT GILES D’SOUZA JUNSOO LEE A DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Information Systems, Statistics, and Management Science in the Graduate School of The University of Alabama TUSCALOOSA, ALABAMA 2011 Copyright Yinaze Herve Dovoedo 2011 ALL RIGHTS RESERVED ABSTRACT Tukey’s traditional boxplot (Tukey, 1977) is a widely used Exploratory Data Analysis (EDA) tools often used for outlier detection with univariate data. In this dissertation, a modification of Tukey’s boxplot is proposed in which the probability of at least one false alarm is controlled, as in Sim et al. 2005. The exact expression for that probability is derived and is used to find the fence constants, for observations from any specified location-scale distribution. The proposed procedure is compared with that of Sim et al., 2005 in a simulation study. Outlier detection and control charting are closely related. Using the preceding procedure, one- and two-sided boxplot-based Phase I control charts for individual observations are proposed for data from an exponential distribution, while controlling the overall false alarm rate. The proposed charts are compared with the charts by Jones and Champ, 2002, in a simulation study. Sometimes, the practitioner is unable or unwilling to make an assumption about the form of the underlying distribution but is confident that the distribution is skewed. In that case, it is well documented that the application of Tukey’s boxplot for outlier detection results in increased number of false alarms. To this end, in this dissertation, a modification of the so-called adjusted boxplot for skewed distributions by Hubert and Vandervieren, 2008, is proposed. The proposed procedure is compared to the adjusted boxplot and Tukey’s procedure in a simulation study. In practice, the data are often multivariate. The concept of a (statistical) depth (or equivalently outlyingness) function provides a natural, nonparametric, “center-outward” ordering of a multivariate data point with respect to data cloud. The deeper a point, the less outlying it is. It is then natural to use some outlyingness functions as outlier identifiers. A simulation study is ii performed to compare the outlier detection capabilities of selected outlyingness functions available in the literature for multivariate skewed data. Recommendations are provided. iii ACKNOWLEDGMENTS I would like to extend my grateful thanks to my dissertation committee, specially the chair, as well as other professors, for their help and time. I also thank everyone who gave me emotional support. iv CONTENTS ABSTRACT .................................................................................................................................... ii ACKNOWLEDGMENTS ............................................................................................................. iv LIST OF TABLES ....................................................................................................................... viii LIST OF FIGURES .........................................................................................................................x 1. INTRODUCTION ......................................................................................................................1 1.1 Brief Overview of Outlier Detection .......................................................................1 1.2 Univariate Outlier Detection ....................................................................................2 1.3 Higher Dimensional Outlier Detection ....................................................................3 1.4 Focus of Dissertation ...............................................................................................4 1.4.1 A New Boxplot Procedure for Oulier Detection in Location-scale Data .....................................................................................4 1.4.2 Application: Boxplot Based Phase I Control Charts for Individual Observations ...............................................................................6 1.4.3 A Modified Adjusted Boxplot (Distribution-free) Procedure for Skewed Distributions ...................................................................................7 1.4.4 Outlier Detection for Multivariate Skewed Data: A Comparative study ...................................................................................8 1.5 Organization of the Dissertation ...........................................................................10 2. LITERATURE REVIEW .........................................................................................................11 2.1 Univariate Outlier Detection ..................................................................................11 v 2.1.1 Controlling False Alarm Rate in the Traditional Boxplot .....................................12 2.1.2 Accounting for Skewness in the Boxplot Procedures ................................13 2.1.3 Adjusting for Skewness in the Boxplot Procedures: Another Approach ......................................................................................15 2.1.3.1 A new Measure of Skewness: The Medcouple ..............................15 2.1.3.2 Adjusted Boxplot Procedures for Skewed Distributions ..............17 2.2 Multivariate Outlier Detection ...............................................................................18 2.2.1 Distance Based Outlier Detection ..............................................................19 2.2.2 Projection Pursuit Method for Outlier Detection .......................................22 2.2.3 Using Data Depth for Outlier Detection ....................................................24 2.2.4 Multivariate Skewed Data and Outlier Detection ......................................27 2.3 Summary ................................................................................................................29 3. ON A MORE GENERAL BOXPLOT METHOD FOR IDENTIFYING OUTLIERS IN THE LOCATION-SCALE FAMILY ......................................................31 4. BOXPLOT-BASED PHASE I CONTROL CHARTS FOR TIME BETWEEN EVENTS ........................................................................................................59 5. A MODIFIED ADJUSTED BOXPLOT FOR SKEWED DISTRIBUTIONS ..................79 6. OUTLIER DETECTION FOR MULTIVARIATE SKEW-NORMAL DATA: A COMPARATIVE STUDY ..........................................................................................105 7. SUMMARY AND FUTURE RESEARCH .....................................................................136 7.1 Summary ..............................................................................................................136 7.2 Future Research ...................................................................................................138 ADDITIONAL REFERENCES...................................................................................................139 ADDITIONAL APPENDIX: R PROGRAMS USED IN SIMULATIONS ...............................141 A1 Programs for chapter 3 .........................................................................................141 vi A2 Programs for chapter 4 .........................................................................................143 A3 Programs for chapter 5 .........................................................................................149 A4 Programs for chapter 6 .........................................................................................166 vii LIST OF TABLES 3.1 Fence constants for selected sample sizes from the Normal, Logistic, and Exponential populations .....................................................................................................38 3.2 Approximate fence constants for selected large sample sizes from the Normal, Logistic, and Exponential populations ...........................................................39 3.3 Data from Daniel (1959) ....................................................................................................41 3.4 Fences for various boxplot procedures for the data in Daniel (1959) ................................42 3.5 Performance of various procedures in detecting outliers based on Daniel (1959)’s data ...........................................................................................................43 3.6 Times between failures data ...............................................................................................44 3.7 Fences for various boxplot procedures for the times between failures data ......................44 3.8 Performance of various procedures in detecting outliers based on Times to failure data ..........................................................................................................45 3.9 Empirical proportions of outliers detected with respect to the hypothesized Expo1 distribution ..........................................................................................................48 3.10 Empirical proportions of outliers detected with respect to the hypothesized logis0,1 distribution .......................................................................................................49 4.1 Charting constants for the proposed one-sided control chart .............................................65 4.2 Charting constants for the

Load more