A Unified Approach to Data Transformation and Outlier

A Unified Approach to Data Transformation and Outlier Detection using Penalized Assessment A dissertation submitted to the Graduate School of the University of Cincinnati in partial fulfillment of the requirements for the degree of Ph.D. in Statistics of the McMicken College of Arts and Sciences November 2013 by Wei Guo B.S. TongJi University, Shanghai, China, 2006 M.S. University of Cincinnati, Cincinnati, OH, USA, 2010 Committee Chair: Seongho Song Ph.D. Abstract In many statistical applications normally distributed sample and sample without outliers are desired. However, in practice, it is often the case that the normality assumption is violated, such as when highly influential outliers exist in the dataset, which will adversely impact the validity of the statistical analysis. In this dissertation, a Unified Approach is proposed to handle outlier detection, Box-Cox transformation using a penalized information criteria at the same time. This research started from investigating the performance of Box-Cox transformation in uncontaminated samples and suggested that the sample should be anchored to 1 before Box-Cox transformation is applied when the sample minimum is larger than 1. Simulation results showed that anchor-to-1 method is working well in enhancing the accuracy of Box-Cox transformation by decreasing the variance of λ and eliminating extremely large or small values of λ. The efficacy of Unified Approach is also verified in the clean samples including normal and lognormal, where the Unified Approach is able to tell that no Box-Cox is needed and no outliers are present. Later, simulations in the contaminated normal samples demonstrated that the Unified Approach can achieve the balance between a good model fitting (close to normal sample) and the complexity of data analysis through penalizing anchor-to-1, outlier exclusion, and Box-Cox transformation in the form of an adjusted information criteria. Through precise outlier detection and appropriate Box-Cox transformation, the efficacy of the Unified Approach is verified in the contaminated samples. i ii Acknowledgement I would like to take this opportunity to extend my deepest gratitude and sincere thanks to my adviser, Dr. Paul S. Horn, for his priceless guidance, insightful feedback, warm encouragement and all his constant support through the past few years. I appreciate the valuable advice and support of the other committee members, Drs. Seongho Song, Siva Sivaganesan, Xia Wang, and Emily Kang. I thank the Department of Mathematics for providing me the financial support for my graduate studies at University of Cincinnati. I thank my parents Zhanjun Guo and Yuexia Zhang for all their love, support and faith in me and allowing me to be as ambitious as I wanted. I also would like to thank my mother in law Conghui Wang for helping us taking of my child when I was working toward my degree. Without her help I could not finish my degree nor approach my dream in my career. Lastly I would like to thank my wife Xuejiao Diao. Her enduring love, encouragement, patience, and caring have been the greatest backing throughout my life. Without her accompany I could not succeed in my graduate study in a foreign country. I am fortunate to have my daughter Angie during my graduate study and she brings me so much joy and makes me laugh, although I did not have much time for her every day. Without such a warm family all of this would not be possible. iii Table of Contents Abstract .......................................................................................................................................................... I Acknowledgement ....................................................................................................................................... iii Table of Contents ......................................................................................................................................... iv List of Figures .............................................................................................................................................. vi List of Tables ................................................................................................................................................ x 1 Introduction ........................................................................................................................................... 1 1.1 Background ................................................................................................................................... 1 1.2 Outliers .......................................................................................................................................... 5 1.3 Data transformation ...................................................................................................................... 8 1.3.1. Mathematical expressions ..................................................................................................... 8 1.3.2. Estimation of parameters .................................................................................................... 10 1.3.3. Hypothesis tests and inference on transformation parameter .............................................. 12 1.3.4. Impact of outliers on data transformation. .......................................................................... 14 1.4 Information Criterion .................................................................................................................. 16 1.4.1 Akaike Information Criterion .............................................................................................. 16 1.4.2 Bayesian Information Criterion .......................................................................................... 17 1.5 Research Gap in Literature ......................................................................................................... 19 2. Unified Approach ................................................................................................................................ 22 2.1. Anchoring-to-1 ............................................................................................................................ 22 2.2. Penalized Assessment ................................................................................................................. 41 2.3. Implement of unified approach ................................................................................................... 43 2.4. Advantages of new method ......................................................................................................... 47 3. No Outliers: The Uncontaminated Sample Case ................................................................................ 51 3.1. Overview ..................................................................................................................................... 51 3.2. Anchoring-to-1 ............................................................................................................................ 54 3.3. Unified Approach-Normal Distribution ...................................................................................... 75 3.4. Assessment-Normal Distribution ................................................................................................ 91 3.5. Unified Approach-Lognormal Distribution ................................................................................ 98 3.6. Assessment-Lognormal Distribution ........................................................................................ 113 iv 4. Outliers: The Contaminated Sample Case ........................................................................................ 116 4.1. Samples with Outliers ............................................................................................................... 116 4.2. Simulation on N(10,1) samples ................................................................................................. 123 4.3. Simulation on Lognormal samples ........................................................................................... 151 5. Discussion and future research ......................................................................................................... 164 Bibliography ............................................................................................................................................. 169 Appendix ................................................................................................................................................... 177 v List of Figures Figure 1-1 Research Gap ............................................................................................................................ 21 Figure 2-1 Anchor-to-1, N(10,1), Sample size=100 ................................................................................... 24 Figure 2-2 Anchor-to-5, N(10,1), Sample size=100 ................................................................................... 25 Figure 2-3 LogNormal, Anchor-to-1,sample size=100 ............................................................................... 27 Figure 2-4 LogNormal, Anchor-to-5,sample size=100 ............................................................................... 28 Figure 2-5 Lambdas of Different Anchoring .............................................................................................. 29 Figure 2-6 95% Confidence Interval of λ at different anchor points ........................................................

Load more