Testing for Normality of Censored Data
Total Page:16
File Type:pdf, Size:1020Kb
Department of Statistics ________________________________________________________________ Testing for Normality of Censored Data Spring 2015 Johan Andersson & Mats Burberg Supervisor: Måns Thulin Abstract In order to make statistical inference, that is drawing conclusions from a sample to describe a population, it is crucial to know the correct distribution of the data. This paper focused on censored data from the normal distribution. The purpose of this paper was to answer whether we can test if data comes from a censored normal distribution. This by using normality tests and tests designed for censored data and investigate if we got correct size of these tests. This has been carried out with simulations in the program R for left censored data. The results indicated that with increasing censoring normality tests failed to accept normality in a sample. On the other hand the censoring tests met the requirements with increasing censoring level, which was the most important conclusion in this paper. Keywords: Censored data, normality tests, Cramer-von Mises test statistic, Anderson-Darling test statistic, testing for normality. 0 Table Of Content Introduction ................................................................................................................................ 1 Theory ........................................................................................................................................ 2 Censored Data ........................................................................................................................ 2 Type I And Type II Error In Hypothesis Testing ................................................................... 3 Test Statistics .......................................................................................................................... 4 Method ....................................................................................................................................... 7 Type I Error For Normality Tests With Censoring Of Type I ............................................... 7 Type I Error For The Adjusted Anderson-Darling And Cramer-von Mises Tests With Censoring Of Type II ............................................................................................................. 7 Type II Error For The Adjusted Anderson-Darling And Cramer-von Mises Tests With Censoring Of Type II ............................................................................................................. 7 Results ........................................................................................................................................ 8 Type I Error For All Test Statistics When Censoring Is Applied ........................................ 10 Type I Error From The EDF Tests ....................................................................................... 10 Type I Error From Normality Test Statistics ........................................................................ 11 Type I Error For Adjusted Anderson-Darling And Cramer-von Mises Test With Type II Censoring ............................................................................................................................. 12 Power Of The Adjusted Anderson-Darling And Adjusted Cramer-von Mises Tests .......... 13 Discussion ................................................................................................................................ 13 Conclusion ................................................................................................................................ 15 References ................................................................................................................................ 16 Appendix .................................................................................................................................. 17 Appendix A: Critical Values For the Adjusted Anderson-Darling-, The Adjusted Cramer- von Mises Statistic. .............................................................................................................. 17 Appendix B: Parameter Estimation ...................................................................................... 18 Appendix C: Critical Values Of the Skewness And Kurtosis Test ...................................... 19 Appendix D: R Code For The Adjusted Anderson-Darling And Cramer-von Mises Test For Censored Data Of Type II .................................................................................................... 20 1 Introduction In statistical analysis it is important to know what distribution a sample is drawn from in order to make correct inference (Körner, 2006). A standard assumption in many applications is the assumption of normality. The normal distribution is symmetric around the mean and the further from the mean the lesser the density of observations (Wackerly, Mendenhall and Scheaffer, 2008). Uncensored data is data where the measurement information is known. If there in some way are values that are not observed or impossible to measure the observation is called censored or truncated. For example, in environmental data analysis and analysing substances in blood samples it is common with observations that are below a limit of detection value (LoD-value), and hence they are not observed (Millard, 2008). The problem with censored data is that there is an information gap in the sample, which makes it harder to evaluate if a dataset is normally distributed. If the value is not observable, is it the same as non-existing? There are different kinds of ways to deal with censored observations and an often-used technique is imputation. An imputed value is one that is not observed but inserted in the sample in a way that is most probably for that value to have. In these cases the assumption of normally distributed data are used to make imputations when samples have missing values or censored observations. When handling censored data the unobserved values are often imputed as the LoD-value (Millard, 2008). To give a more intuitive explanation of censored data, consider a sample of 0.5 1 1.75 2 3 3.4 If the LoD-value is 1.1 the sample above will have two observation that are censored (considering left censoring): < 1.1 < 1.1 1.75 2 3 3.4 The difference between censored data and trunced data is that observations below the LoD will not be presented at all, and hence with a truncation limit of 1.1 would the sample above be present as: 1.75 2 3 3.4 The purpose of the study is to investigate if it is possible to use normality tests and censoring tests on censored data without changing the size of the test. This will be carried out 1 for left censored data. The test statistics being evaluated are the Lilliefors test (which is based on the Kolmogorov–Smirnov test), Jarque-Bera test, skewness test, kurtosis test, Shapiro- Wilk test, the Anderson-Darling test, the Anderson-Darling test for censored data and Cramer-von Mises test for censored data. The simulations are done with a sample size of 20 observations since it is a sample size that is not uncommon in environmental data analysis. By sampling data that is censored at different levels and running these simulated samples through different normality tests, it will be seen how many times a each normality test fail to accept normality when handling censored data. The two adjusted normality tests will be tested with data censoring of type I and type II. Samples drawn from a �!distribution and a student’s t- distribution will be simulated to approximate the power of the adjusted Anderson-Darling and Cramer-von Mises tests. The approximated power of a test will further be referred as the power of a test. This has been carried out in the program R. Theory One advantage of the normal distribution is its symmetry near the y-axis, why the assumption of normality is to prefer in a dataset from a theoretical perspective (Wackerly et al, 2008). A way to visually check if a dataset follows a normal distribution is to examine the data in a histogram. The basic idea is that observations are gathered around the mean and the further from the mean the fewer observations are observed. When the mean is zero and the variance is one, the normal distribution is called the standardized normal distribution. It has the following probability function. 1 1 ! Φ �! = exp − �! − � � 2� 2�! When conducting statistical analysis the assumption that the data is normally distributed often makes the analysis easier due to its properties (Wackerly et al, 2008). Censored Data Censored data are normally categorized in left censoring, right censoring and interval censoring. Left censored data occurs when there are values in a sample that are smaller than the LoD-value. When it is not possible to measure values larger than a LoD-value the data analysed is called right censored. Observations that are outside the LoD-values are often 2 imputed as the LoD-value, and with a “smaller than” or “larger than” sign. Interval censoring is when an observation is censored if it is outside a specific interval, for example 10 ≤ � ≤ 15. Any observation on a continuous random variable can be considered as interval censored when rounded to a few decimals (Millard, 2013). Censored data is further divided in two types; type I-censoring and type II-censoring. The type I- and type II censoring should not be confused with type I and type II errors in hypothesis testing. The main difference between the censoring types is the random outcome in a sample with censored observations.