Tests of Normality and Other Goodness-Of-Fit Tests
Total Page:16
File Type:pdf, Size:1020Kb
Statistics 2 7 3 TESTS OF NORMALITY AND OTHER GOODNESS-OF-FIT TESTS Ralph B. D'Agostino, Albert J. Belanger, and Ralph B. D' Agostino Jr. Boston University Mathematics Department, Statistics and Consulting Unit Probability plots and goodness-of-fit tests are standardized deviates, :z;. on the horizontal axis. In table useful tools in detennining the underlying disnibution of I, we list the formulas for the disnibutions we plot in our a population (D'Agostino and Stephens, 1986. chapter 2). macro. If the underlying distribution is F(x), the resulting Probability plotting is an informal procedure for describing plot will be approximately a straight line. data and for identifying deviations from the hypothesized disnibution. Goodness-of-fit tests are formal procedures TABLE 1 which can be used to test for specific hypothesized disnibutions. We will present macros using SAS for Plotting Fonnulas for the six distributions plotted in our macro. creating probability plots for six disnibutions: the uniform. (PF(i-0.5)/n) normal, lognonnai, logistic, Weibull. and exponential. In Distribution cdf F(x) Vertical Horizontal addition these macros will compute the skewness <.pi;> . Axis Axis 7, kurtosis (b2), and D'Agostino-Pearson Kz statistics for testing if the underlying disttibution is normal (or Unifonn x-11 for 1'<%<1' +o ~ P... ~ ·lognormal) and the Anderson-Darling (EDF) A2 statistic G II for testing for the normal, log-normal, and exponential disnibutions. The latter can be modified for general ··I( ~;j i-3/8) distributions. 11+1(4 PROBABILITY PLOTS In(~~ ·-1( 1-3/8) 11+1{4 Say we desire to investigate if the underlying cumulative disnibution of a population is F(x) where this Weibull 1-exp( -(.!.t) ln(lw) In( -In( 1-p,)) disnibution depends upon a location parameter )l and scale 6 plll31lleter a. not necessarily the mean and standard deviation. Further. let Logistic [I <ap{-11'(%-jl)/oi3>J·1 F(z)=G(-)=G(z)%-1' 0 Exponential l-ap(-%/6)) -ln(l-p,) where z=(x-p)/a. Further, say we have a random sample The macro PROBPLOT takes as input the data of observations of size n with ordered observations set and produces probability plots for the six distributions ~n~... ~Dl" A probability plot is a plot of mentioned above. We use the rank procedure to order the data to produce the ordered .standardized rants. When X(o 011 ~=G- 1 (F.f~u))=G- 1 (p~ observations are equal we use the means of the rants (ties=mean option)as in D' Agostino and Stephens, chapter where o·•o is the inverse transformation of the 2. Using these rants, i, we compute p,=(i-0.5)/n and then standardized disttibution of the population (hypothesized the inverse transformation distributions for the six distribution) under consideration and F.O is the empirical dislributions. Probability plots are then produced. cumulative defined here as: Since the normal probability plot is the most i~.S widely used we describe it in detail now. This plot F~'=p,=- (1) • (I)' n consists of the ordered observations on the vertical axis and the standard normal deviates on the horizontal axis. We use Blom's approximation when defining the nonnal (see D'Agostino and Stephens. 1986, p 34). In our plots cumulative in order to enhance the linearity of the we place the data (~il) on the vertical axis and the NESUG '91 Proceedings 2 7 4 Statistics plot. The plot is thus may reflect the presence of outliers, mixlmes in data, or· truncation (censoring) in the data. The reader is referred XIII on Z=e-1( i-3/8) to D' Agostino and Stephens (1986) chapter 2 and n+l/4 D' Agostino, Belanger and D' Agostino (1990) for further details. where ~ 0 is the ith ordered observation from the ordered Probability plots are only informal techniques for sample evaluating the underlying distribution of data. Next we provide several statistical tests which provide a more formal approach. and Z is such that GOODNESS-OF-FIT TESTS z' j-3/8 z 1 -- A population, or its random variable X. is said to --=! --'-<! 2dJ& n + 1/4 -.;2ft be normally distributed if its density function is given by for i:l,... ,n The figure in Appendix A contains a nonnal probability plot of sample data with the expected sll'aight line going through the +'son the graph. In programming Here p and o are the mean imd standard deviation, the macro to create this plot we took advantage of two respectively, of it. Of interest here are the third and options in the Proc Rank procedure. The fust. was the fourth standardized moments given by "ties=mean" option which chooses the mean rank when there· are observations with the same value (see D'Agostino and Stephens. chap 2 for further discussion) and second the "normal=blom" option which will find the standardized cumulative nonnal Blom rank automatically. The pagesize and linesize options allowed the axis to be and wider than the traditional Proc Univariate nonnal 4 probability plot. B:= E(X-~&) =E{X-~&f For the lognormal distribution we provide two [E(X-11ff a4 plots, the first after taking logs of the raw data and the second after taking the logs of (observed data- estimated . where E is the expected value operator. These moments lambda ). Lambda corresponds to the third parameter of a measure skewness and kurtosis, respectively, and for the three parameter lognormal distribution whose density is normal distribution they equal 0 and 3, respectively. A positive third moment correspond to a skewness to the right (ie a longer right tail) and a negative skewness corresponds to skewness to the left. Kurtosis, (the word means curvature) is a measure of tail thickness. A Unless lambda is close to zero. the probability plot will kurtosis larger than 3 on a unimodal distribution indicates not be a straight line for a lognormal distribution when thicker or heavier 1llils than the normal distribution, while one takes the logs of the data. The macro will kurtosis less than three on a unimodal distribution automatically produce both plots and gives as output the indicates lighter tails than the normal. estimated value of lambda (D'Agostino and Stephens, The sample estimates of these moments have 1986, p. 53) so the user can decide which plot is more been shown to be useful statistics to test whether data is appropriate. If the raw data contains values less than or normally distributed (D'Agostino et al 1990). For a equal to zero. the macro will automatically add the sample of size n, ~.... .X,. the sample estimates absolute value of the minimum plus .01 to each value 01 the data set for calculating the logs of the data. of .pr; and B: are respectively. Probability plots will fonn approximately a straight line if the underlying distribution is the hypothesized distribution. Deviations from linearity help to determine properties of the underlying distribution such and as if it is skewed and/or thick tailed. Other deviations NESUG '91 Proceedings Statistics 2 7 5 These are related to .fDt and bz via the following: where If:' ll v~~· t·2) 1 ll(n-1) and and x is the sample me311 (n-2)(n-3) 3(n-l) i=EX,fn. ~"' (n+l)(n-1)8%+ (n+l) • Values of 0 (for the third moment) and 3 (for the fourth Thus, once we transformed the statistics we can perform moment) would indicate that the underlying population of the normality tests. a data set was normally distributed. Their expected values The second type of formal tests we programmed under normality are 0 and 3(n-l)/(n+l) respectively. into the macro are EDF (Empirical Distribution Function) These statistics can be used to test. formally if the tests. For a random sample of size n, with data X1•••• .x •. underlying distribution is normal (D' Agostino and and the order statistics defmed as ~ 1 ,SX(2,s··::;~.,; let the Stephens. 1986, chapter 9). If they lead to rejecting the distribution of X be F(x). EDF statistics measure the normal distribution they automatically indicate the type of difference between F.(x) and F(x) where Fu(x) is defined nonnormality present in the data. For instance, if the third here by: moment is negative this indicates that the data is negatively skewed or if the fourth moment is greater than 3(n-l)/(n+l) this indicates heavy tails in the population distribution. Thus, the signs and magnitude of these statistics are both useful here. more precisely We present tests for normalitv using these statistics in our macro as well as an omnibus te~t using the K1 statistic. (Omnibus here means that the test will detect deviations from normality due to either skewness or kurtosis). Much of the programming for these tests involved finding the third and fourth moments using the output from SAS's Proc Univariate procedure. The skewness and kurtosis statistics calculated in the procedure · are the Flsher g statistics defined as: Note, F.(x) here is defined differently than for the probability plots (formula ( 1)). In our macro we used the Anderson-Darling (1954) A1 statistic which uses a quadratic measure of and discrepancy between F.(x) and F(x) when it is calculated. This test falls in the class given by the Cramer-von Mises bnosis I. n(n .. l)L (X-Xyt _ 3(n-lf family z (n·l)(n-2)(n-3)r (n-2)(n-3) -'Nhere 1 where -.oo is [{F(x) }{ 1-F(x))]' • See D'Agostino and Stephens chaprer 4 for an in depth di.w·nssion of EDF statistics.