Statistics 2 7 3

TESTS OF NORMALITY AND OTHER GOODNESS-OF-FIT TESTS

Ralph B. D'Agostino, Albert J. Belanger, and Ralph B. D' Agostino Jr.

Boston University Mathematics Department, Statistics and Consulting Unit

Probability plots and goodness-of-fit tests are standardized deviates, :z;. on the horizontal axis. In table useful tools in detennining the underlying disnibution of I, we list the formulas for the disnibutions we in our a population (D'Agostino and Stephens, 1986. chapter 2). macro. If the underlying distribution is F(x), the resulting Probability plotting is an informal procedure for describing plot will be approximately a straight line. data and for identifying deviations from the hypothesized disnibution. Goodness-of-fit tests are formal procedures TABLE 1 which can be used to test for specific hypothesized disnibutions. We will present macros using SAS for Plotting Fonnulas for the six distributions plotted in our macro. creating probability plots for six disnibutions: the uniform. (PF(i-0.5)/n) normal, lognonnai, logistic, Weibull. and exponential. In Distribution cdf F(x) Vertical Horizontal addition these macros will compute the <.pi;> . Axis Axis 7, (b2), and D'Agostino-Pearson Kz statistics for testing if the underlying disttibution is normal (or Unifonn x-11 for 1'<%<1' +o ~ P... ~ ·lognormal) and the Anderson-Darling (EDF) A2 statistic G II for testing for the normal, log-normal, and exponential disnibutions. The latter can be modified for general ··I( ~;j i-3/8) distributions. 11+1(4

PROBABILITY PLOTS In(~~ ·-1( 1-3/8) 11+1{4 Say we desire to investigate if the underlying cumulative disnibution of a population is F(x) where this Weibull 1-exp( -(.!.t) ln(lw) In( -In( 1-p,)) disnibution depends upon a )l and scale 6 plll31lleter a. not necessarily the mean and standard deviation. Further. let Logistic [I J·1

F(z)=G(-)=G(z)%-1' 0 Exponential l-ap(-%/6)) -ln(l-p,) where z=(x-p)/a. Further, say we have a random sample The macro PROBPLOT takes as input the data of observations of size n with ordered observations set and produces probability plots for the six distributions ~n~... ~Dl" A probability plot is a plot of mentioned above. We use the rank procedure to order the data to produce the ordered .standardized rants. When X(o 011 ~=G- 1 (F.f~u))=G- 1 (p~ observations are equal we use the means of the rants (ties=mean option)as in D' Agostino and Stephens, chapter where o·•o is the inverse transformation of the 2. Using these rants, i, we compute p,=(i-0.5)/n and then standardized disttibution of the population (hypothesized the inverse transformation distributions for the six distribution) under consideration and F.O is the empirical dislributions. Probability plots are then produced. cumulative defined here as: Since the normal probability plot is the most i~.S widely used we describe it in detail now. This plot F~'=p,=- (1) • (I)' n consists of the ordered observations on the vertical axis and the standard normal deviates on the horizontal axis. We use Blom's approximation when defining the nonnal (see D'Agostino and Stephens. 1986, p 34). In our plots cumulative in order to enhance the linearity of the we place the data (~il) on the vertical axis and the

NESUG '91 Proceedings 2 7 4 Statistics

plot. The plot is thus may reflect the presence of outliers, mixlmes in data, or· truncation (censoring) in the data. The reader is referred XIII on Z=e-1( i-3/8) to D' Agostino and Stephens (1986) chapter 2 and n+l/4 D' Agostino, Belanger and D' Agostino (1990) for further details. where ~ 0 is the ith ordered observation from the ordered Probability plots are only informal techniques for sample evaluating the underlying distribution of data. Next we provide several statistical tests which provide a more formal approach.

and Z is such that GOODNESS-OF-FIT TESTS

z' j-3/8 z 1 -- A population, or its random variable X. is said to --=! --'-

for i:l,... ,n The figure in Appendix A contains a nonnal probability plot of sample data with the expected sll'aight line going through the +'son the graph. In programming Here p and o are the mean imd standard deviation, the macro to create this plot we took advantage of two respectively, of it. Of interest here are the third and options in the Proc Rank procedure. The fust. was the fourth standardized moments given by "ties=mean" option which chooses the mean rank when there· are observations with the same value (see D'Agostino and Stephens. chap 2 for further discussion) and second the "normal=blom" option which will find the standardized cumulative nonnal Blom rank automatically. The pagesize and linesize options allowed the axis to be and wider than the traditional Proc Univariate nonnal 4 probability plot. B:= E(X-~&) =E{X-~&f For the lognormal distribution we provide two [E(X-11ff a4 plots, the first after taking logs of the raw data and the second after taking the logs of (observed data- estimated . where E is the expected value operator. These moments lambda ). Lambda corresponds to the third parameter of a measure skewness and kurtosis, respectively, and for the three parameter lognormal distribution whose density is they equal 0 and 3, respectively. A positive third moment correspond to a skewness to the right (ie a longer right tail) and a negative skewness corresponds to skewness to the left. Kurtosis, (the word means curvature) is a measure of tail thickness. A Unless lambda is close to zero. the probability plot will kurtosis larger than 3 on a unimodal distribution indicates not be a straight line for a lognormal distribution when thicker or heavier 1llils than the normal distribution, while one takes the logs of the data. The macro will kurtosis less than three on a unimodal distribution automatically produce both plots and gives as output the indicates lighter tails than the normal. estimated value of lambda (D'Agostino and Stephens, The sample estimates of these moments have 1986, p. 53) so the user can decide which plot is more been shown to be useful statistics to test whether data is appropriate. If the raw data contains values less than or normally distributed (D'Agostino et al 1990). For a equal to zero. the macro will automatically add the sample of size n, ~.... .X,. the sample estimates absolute value of the minimum plus .01 to each value 01 the data set for calculating the logs of the data. of .pr; and B: are respectively. Probability plots will fonn approximately a straight line if the underlying distribution is the hypothesized distribution. Deviations from linearity help to determine properties of the underlying distribution such and as if it is skewed and/or thick tailed. Other deviations

NESUG '91 Proceedings Statistics 2 7 5

These are related to .fDt and bz via the following: where

If:' ll v~~· t·2) 1 ll(n-1)

and and x is the sample me311 (n-2)(n-3) 3(n-l) i=EX,fn. ~"' (n+l)(n-1)8%+ (n+l) •

Values of 0 (for the third moment) and 3 (for the fourth Thus, once we transformed the statistics we can perform moment) would indicate that the underlying population of the normality tests. a data set was normally distributed. Their expected values The second type of formal tests we programmed under normality are 0 and 3(n-l)/(n+l) respectively. into the macro are EDF (Empirical Distribution Function) These statistics can be used to test. formally if the tests. For a random sample of size n, with data X1•••• .x •. underlying distribution is normal (D' Agostino and and the order statistics defmed as ~ 1 ,SX(2,s··::;~.,; let the Stephens. 1986, chapter 9). If they lead to rejecting the distribution of X be F(x). EDF statistics measure the normal distribution they automatically indicate the type of difference between F.(x) and F(x) where Fu(x) is defined nonnormality present in the data. For instance, if the third here by: moment is negative this indicates that the data is negatively skewed or if the fourth moment is greater than 3(n-l)/(n+l) this indicates heavy tails in the population distribution. Thus, the signs and magnitude of these statistics are both useful here. more precisely We present tests for normalitv using these statistics in our macro as well as an omnibus te~t using the K1 statistic. (Omnibus here means that the test will detect deviations from normality due to either skewness or kurtosis). Much of the programming for these tests involved finding the third and fourth moments using the output from SAS's Proc Univariate procedure. The skewness and kurtosis statistics calculated in the procedure · are the Flsher g statistics defined as:

Note, F.(x) here is defined differently than for the probability plots (formula ( 1)). . In our macro we used the Anderson-Darling (1954) A1 statistic which uses a quadratic measure of and discrepancy between F.(x) and F(x) when it is calculated. This test falls in the class given by the Cramer-von Mises bnosis I. n(n .. l)L (X-Xyt _ 3(n-lf family z (n·l)(n-2)(n-3)r (n-2)(n-3)

-'Nhere

1 where -.oo is [{F(x) }{ 1-F(x))]' • See D'Agostino and Stephens chaprer 4 for an in depth di.w·nssion of EDF statistics. In order to compute the A2 statistic, we used the is the sample variance. computing formulas suggested in D' Agostino and Stephens

NESUG '91 Proceedings 2 7 6 Statistics

(chap. 4). Using the Probability Integral Transformation uniform, logistic, exponential and Weibull all are far from . (PIT), z,;F(x), we know that if F(x) is the true distribution linear indicating that these data probably do not follow of X 1hen Z will be uniformly distributed on [0,1). We any of these distributions. tests calculate values of Z,=F(X;) i=l •.•. ,n from our sample X1, When we look at what the goodness-of-fit ... .X.,. Fe'(z), the EDF of the values of Z, is then found. produced. we see that the skewness test for normality is Using these values we compute A2 as follows: rejected (p=.Ol), while the kurtosis test is not (p=.59). The K2 test, which combines the two, rejects normality at p=.03. The Anderson-Darling A2 tests confirm these results, with the statistic to test for normality having a p We compute this statistic for testing the normal. value <.005. The test for the lognormal distribution is not lognormal. and exponential distributions. For the rejected with .15

Table 2 Systolic Blood Pressure Data from the Framingham Heart SIUdy For the lognormal distribution we calculate the EDF Slem-11111-leaf plot Nwnber statistic only without the l we computed in the probability plotting. We also present the A2 statistics both 21 0 forms where the 20 in their unmodified and modified 19 08 2 modifications are made using suggestions by Stephens in 18 0006 4 D'Agostino and Stephens (chap. 4). The user could test 17 003 3 for other distributions as well if they modify the macro so 16 0008 4 they can input the Z; values for the distribution of interest 15 0046 4 is 14 00124688 8 An example of the output from this macro 13 000002244689 12 presented below. Table 2 contains a sample of 67 systolic 12 000024467888 12 blood pressures from a sample of 67 subjects from the 11 0000446889 10 Framingham Heart Srudy. The data are presented in a 10 046888 6 stem-and-leaf plot with descriptive statistics, also. 9 0 I probability plots From examining the seven The clescriplive 118tiJtics an:118111ple size, n=67, mem:l37.15; slanciard computed one can see that the data appears most likely to devialion=25.63; stewnes ....768; ll:wlosia=3.08. follow the lognormal distribution. After examining the normal probability plot one can see that the data seems to This work was funded by a grant from the form two straight lines, one for z-values below zero, and National Heart, Lung and Blood Instirute to R. B. one for z-values above zero. This could be an indication D' Agostino (RO 1 HL 40423..()3) of a mixture of two normals. The lognormal probability plot forms nearly a straight line, however when we REFERENCES examine the lognormal plot with estimate of lambda the D'Agoslino, R.B., Belanger, AJ., and D' Agostino Jr., R.B., (1990) data form a sttaight line except at the lower tail. This "A Suggestion for Using Powerful and Iofonnative Tests of The Ameticao St.atislician,44.316-321. should prompt the investigator to check the lowest point Normality." D'Agostino. R.B., and Stephens, M.A. (1986), GoodMss-of·fit as a possible outlier. Finally. the other four plots,the T«hniquu, New York: Man:e1 Dekker.

NESUG '91 Proceedings Statistics 277 APPENDIX A: OUTP\IT FROM THE MACRO

I .,.. I .... I .....I I t. 0 ·-. s •.a I• i G I 0 I 0 .... , t.& • I i 5 ! 0 I ! I a .... ll ..... $ .. E ....I £ .. ll . g I 'I .... E .. I .. I g ... ~ I ~ I 'I I tl I ,. ··' .. E .... t. Q' I -· E ...... I ....I I I ...... ••• 1.~ 1.~ J.J J,3 ...I I ...... •&o.l •lol •• J lol loJ "'' ..... lol loJ J,J

I .... "' .....I t. I "G 4ol'"' -- 0 '1 ....I : 5 I K 1.1 •I " G '5 ! t. ....I ll A .. ll £ I lol •I c c A '7...... I t. C' : ....I ... .I ...... I ...... •...... •••• •••• •"-1 .. ,...... ••• .... ~.. ••• ••• J.t G~~%1'!~1111 1-VlU.lll: UIG•IfOIIIQl. Z•'llU.ll1: (tAIIIIDAooUI I I .. I -i -· 0 0 I I s s I . E ] E .. i. . II 'I •1' E IC D ....I a 1' -I : .r. .... ~ I t. a u•:. Q' IC I=.:: .... :I -- I· ...I .. I...... I ~.J.L z-v;um: •1.1 •I.J •1,1 'L•I •\•I .... 1 1.1 1.1 1.1 S...J 1.1 1.1 1.1 1 oDa v•r• cu~ o~ ranqe :,..,c-:r.:e :-,u:..""E 2 7 8 Statistics

APPENDIX B: THE MACRO

~MA~O PROBPLOT(V AR.DATAI: DATA: SET &DATA: T=1; KEEP T &:VAR; PROC SORT: BY &:V AR: PROC RANK TIES-MEAN OurooAA: VAR &:VAB:. RANKS~ PROCUNIVAR.!ATE DATAaAA NORMAL PLOT: VAR &:VAR; OU'I'PIJT OuralOC.STAT MIN•MIN NaN MEAN-XBAR S'ID-S P:5_.., P9~5 MEDIAN-MEDIAN SKEWNESS=01 ICURTOSIS-G2;

DATA lOC.STAT: SET JOCSTAT; T•1; UMBDAooi(P95•P5)-{MEDIAN•MEDIAN))((P9$+1'5-(2"MEDIAN)); ALPHA•Mnii-1/N; BETA.-N"'CBAR-MIN')J(N-1); DROP P9S ~ MEDIAN: DATA M: MERGE AA JOCSTAT: BY T: LOGVARE=LOO&:VAR-LAMBDAI: ·•·• •&.a •lei ·~·• •Lol ...l t.a •·• ;..J ~.J 1.1 a.J a• .a IF MlN>O niE."' LOOV AR=LOO(&:V ARJ; W!%Str...:. :-U'.u,t,-;; ELSE DO: LOGY AR=LOO(&:V AR+ABS(MIN)+.Ol): l obs vere ou~ ~~ :anqe. FILE PRINT: PUr 'WARNING SOME OFn!E DATA HAS VALUES 'IB' .LEFT(Pt..'TCLAMBDA.BESTI.)Il: uh0.711392 SQRTBI .0.76821 Z.l.S6017 1"".0105 02a41Mtj 82.-3.0&216 Z. O..S~OIJ p.o.599:S !t•o:!oOIISQ (: DP) • 6.!30"..!1 PoO.Ql:9 PROC UNIVARIATE DATADAA NOPRJNT; VARLOOVARE LOGVAR; OU'I'PIJT Ot.rr•BB MEAN•LXBARE L.XBAR S1'DaLSE LS: EDP·-··----·----····---··-··------·· 'IUrS USIHu AHlll!llSOH·DAIWNO A.SQUAJU!Il STA ns!IC CRmCAL VALliES I'OR NORMAl. DIStlUII1lriOH W!'IH MEAN AHll VAIUI.l: .301 .0111 ,j61 All .7:52 .ITl l.a!S US. LOGIS'I7.oo(SQRT(3)13.141S926Sl"LOG(Pi/(1·PI)l: LOWB TAIL .)It .249 .z:e .111 .Ill) .139 .119 EXPONENZ--1..00(1-Pil: NORMALZI-PROBNORM((&:V AR·XBAR).IS); BDP n'A'Itl'r!CI'OR1JIII NORMAL DIS11UBUUQHOIODD'II!Dl 1.1714 LOONCIIIZI-PROBNORMfiLOGV AR-UBARJILS); EDP n'A'Itl'r!C POl!. 1JIII !IOilMAL DIS11UBU'DQH (l1l6IOD1IIIID) 1~511 EXPOHI!ZI>o1-EXP(-(&:V ARIXBARll: EXI'OHMZI-1-EXP(-((&VAR-AIJ'HA)IBEI'Al); l!Dl' STA.mt!C !'OR 'Ill!! toG-NORMAL DISTRI3IITION (MODII'IED) D.n24 .NORMAU.S-(lJN)"((l"_."'_·I)"LOO+ atmC\1. V ALliES POR 'Ill!! 1!XP0N!1mAL DISnllllltiOH. OIIIUIN Cl"N+1·2•_.'f..)"LOG(I·LOGNORZn); mowN AND SCAU1 ~ EXPONEAS-< 1/N)"((l"_.'f_-1l 0 LOOIEXPONEZn+ (2-N+ 1-2• _."'..)"LOG(l-EXPONEZill: SICINII'ICA.HCI UV2L AIJ'HA .:Z:S .:111 ~ .10 .D5 .IIZS .D1 JlOS .D11Z5 EXI'ONMASoo(1/Nl"((2" _."'_·ll"LClG(EICPONMZI>+ UPPER TAIL (2"N+1-2" _!'IJ•LOG(1-EXPONMZill: . .1M .116 .916 1.DG 1.321 I ~I I.J59 l.lU ~Jol !.OWD TAIL PROC MEANS NOPI!.OO; VAR NORMALAS LOGNORAS EXPONEAS .302 .311 .:111.2141 .lDI J1l ~:10 EXPONMAS: l!Dl' STA.'Itl'r!C P0R 1JIIII!XPONI!Il1IAI. DIS11UBITDOII lii.!IS3 OU'I'PUT OUT•ANDERSON SUM•NASQUARE LASQUARE 1!llP STA.'Itl'r!C POR 'Ill!! I!XPOH1!HltAL DIS'!liiBITDOH ()IODII'II!III :11.1411 EASQUARE EMA.SQUAR N-N: DATA ANDERSON; MERGE ANDERSON XXSTAT; BY N: aaiCAI. V ALliES POR 'Ill!! !XP0M!1mA1. DISniiiU'IION. OIIJGIH NASQUARE-N-NASQUARE: AHDSCAUitJICHOWior MASQUARE-NASQUARE•(i+(.75~•N))); M4f SICIIIIPICANCZ UV2L 1oLP11A LASQUARE-N-LASQUARE: H .23 ~5 .10 Jl5 .D1S .111 MLASQUAR-LASQUARE"f I +(.751N)+(l.:!.51(N•N))); Ul'l'!ll TAIL EASQUARE-N-EASQUARE: 5 - .sss .w .7%5 .... 319 MEASQUAR-EASQUARE"fl+(.6/Nl); 10 .$OS - .7.rl .920 1.11051 1.3S2 EMASQIJAR-N-EMASQUAR; .116 I.ID 1.191 1_.95 IS ..ftS .720 MMEASQUA-£MASQUAR•(!+(.6/Nl); ::D - .7J7 .161 L.OQ 1.:47 l..sa:J 25 .1125 .D& .890 1.1197 I.JI7 1.635 DROPT J'REQ.._nPE..; :10 .410 .131 .965 u97 ~- 1.rn 100 .110 .175 I.Dal I.:SO 1..510 I..ISS DATA; SEI'XXSTAT: -..735 .916 I.DQ I.J%1 1..591 I.?S. DO _z_a-1.0.1; _X..•XBAR+ _z_ "S:OUTPtJI': END: KEEP ..X....):_; I!S'tiMA1BS PORioLPIIA • l.J216 AND B!TA ••7.1636 EDP STA1lS'!lC !'OR '!HI! :e:xPOHI!NTL\!. DIS'!llllllrnOH 6..0107 DATA: MEROE AA J,..o.S'I'..; EDP STA1lS'!lC I'OR 'Ill!! :e:xPOHI!NTL\!. DIS'!liiBliTIOH IMODIF!Eill ~I

NESUG '91 Proceedings Statistics 2 7 9

PROC RANK T!ES=MEAN NORMALooBLOM OUT=>AA: MASQUARE 6.4 I VAR .tVAR LOGVAR LOGVARE; @110 'EDF STATISTIC FOR 1liE NORMAL DISTRIBUTION RANKS BLOMRANK LOOBLOM LOGEBLOM; D60: NASQUARE 6.4 II PROC PLOT NOL£GEND: @110 'EDF STATISTlC FOR 1liE LOG-NORMAL DISTRIBUTION LABEL BLOMRANK-'NORMALIZED RANK' (MODIFIED) ' LOGBLOM •'LOG-NORMAL RANK' MLASQUAR 6.4/ LOGEBLOM•"LOG-NORMAL RANK (LAMBDA..t.LAMII)" @110 'EDF STATISTIC FOR 1liE LOG-NORMAL DISTRIBUTION .tVAR a'OBSERVED VALUE'

NESUG '91 Proceedings