<<

573

USE OF PROC IML TO CALCULATE L-MOMENTS FOR THE DISTRIBUTIONAL SHAPE PARAMETERS AND

Michael A. Walega Berlex Laboratories, Wayne, New Jersey

Introduction

Exploratory analysis statistics, such as those Gaussian. Bickel (1988) and Van Oer Laan and generated by the sp,ge procedure PROC Verdooren (1987) discuss the concept of robustness UNIVARIATE (1990), are useful tools to characterize and how it pertains to the assumption of normality. the underlying distribution of data prior to more rigorous statistical analyses. Assessment of the As discussed by Glass et al. (1972), incorrect distributional shape of data is usually accomplished conclusions may be reached when the normality by careful examination of the values of the third and assumption is not valid, especially when one-tail tests fourth central moments, skewness and kurtosis. are employed or the size or significance level However, when the sample size is small or the are very small. Hopkins and Weeks (1990) also underlying distribution is non-normal, the information discuss the effects of highly non-normal data on obtained from the sample skewness and kurtosis can hypothesis testing of . Thus, it is apparent be misleading. that examination of the skewness (departure from symmetry) and kurtosis (deviation from a normal One alternative to the central shape statistics curve) is an important component of exploratory data is the use of linear combinations of order statistics (L­ analyses. moments) to examine the distributional shape characteristics of data. L-moments have several Various methods to estimate skewness and kurtosis theoretical advantages over the have been proposed (MacGillivray and Salanela, shape statistics: Characterization of a wider of 1988). For many years, the conventional coefficients cflstributions, robustness to and more of skewness and kurtosiS, 'Yand K (Hosking, 1990), accurate estimates in small sample sizes. have been used to describe the shape characteristics of distributions. However, as pointed out by Hosking This paper focuses on the development of a macro (1990) and Royston (1992), these coefficients are not program that uses SASIIML- (1989) to generate the without limitations. Both are sensitive to minute central moment and L-moment distributional shape changes in the tails of a distribution, susceptible to parameters. In addition, the results of simulations, moderate outliers and biased in small to moderately­ conducted with various sample sizes and sized samples from skew distributions. Also, the distributions, will be presented. information conveyed by the third and fourth central moments with regards to the shape of a distribution Background can be difficult to assess. Thus, it would be appropriate to determine if other, more robust Largely through the influence of John Tukey's work measures of skewness and kurtosis can be used to (19n), have increasingly emphasized the assess the shape of a distnbution. exploratory analysis of data prior to more formal statistical analyses (t-tests, ANOVA, etc.). Tukey has L-moments suggested that to fully understand the nature of a variable and its measurement, characteriStics other One more robust measure is the use of linear than the () and variability combinations of order statistics, or L-moments. In () need to be examined. Many theory, L-moments are less prone to the effects of classical statistical tests rely on the assumption that variability as compared to conventional the underlying distribution of the data (or residuals) is moments. Hosking (1990) provides an excellent

NESUG '93 Proceedings 574 Statistics

overview of the theory behind the derivation and randomly selected values of X; however. by its nature application of L-moments as for cr assigns more weight to extreme sample values univariate probability distributions. Royston (1992) than does~. As is a scale-dependent measure of compares the properties of the conventional shape $kewness for a sample of size 3. and A4 is parameters to their L-moment counterparts for two proportional to a weighted difference between outer lognormal distributions. Rather than discuss the extremes and the central portion in samples of size 4 detailed theory behind L-moments. the reader is (Royston. 1992). referred to the two aforementioned papers. Instead. a brief overview of the development of the equations Scale-free versions of the L-moments for skewness.

necessary to apply L-moments is described below. 'ta •. and kurtosis. t 4• can be written as As with the paper by Royston. the notation of Hosking (1990) will be employed.

For the random variables Xi •.... x. of sample size n drawn from the distribution of a X with mean Il and OZ. let X,:, ::s; ... ::s; X,:. be An alternative measure of skewness. 1'3' is defined as the order statistics such that the L-momerns of X are (1 + 'tJ I (1 - 'tJ. This measure is the ratio of the defined by expected length of the upper tall to that of the lower tail in samples of size 3. and as such may be easier to interpret than 'ta. ~. 'ta. 'to 3 and 't are subject to r-1 ( r - 1) 4 A,. .. (' E(-1)k EX'4<:r • r =1.2 .... the constraints "-0 k

~>O. -1<'ta<1. 0<'t'3 <00 and where r is the r'" L-moment of a distribution and ~7 th is the of the i smallest observation Y4(W3 -1)St4 <1. in a sample of size r. . If a random sample of size n is drawn from a The first four central moments of a random variable distribution of the random variable X and X can be written as x,:. S ... S xn;n are the ordered sample values then estimates olthe L-moments A.,. ~. As and A,4' namely 11 = E(X). I,. ~. ~ and 14, can be calculated as follows. First, define w2• W3 and w4 as OZ = E(X _ 1l)2. 1 n "I = E(X - 1l)3 I if and w2 = -- ~(i- 1)xl:n' n(n-1) tt II: = E(X - 11)4 I ct. 1 n In a similar fashion. the first four L-moments of a W3 = ~(i - 1Hi - 2)XL'tI and random variable X can be written as n(n-1)(n-2) tt

A., = E(X). 1 n w4 = ~(i - 1)(i - 2)(; - 3)xL." ~ = 1hE(~2 - X;:J. n(n-1)(n-2)(n-3)!t .

As = 1hE(~ - 2X2:3 + X,~ and Then the L-moments and the corresponding shape statistics can be estimated as ~ = Y.4E(X4:4 - 3X3:4 + 3X2:4 - X,:J. It can be seen that A., is equivalent to the usual measure of central tendency. II- ~ is similar to cr in that both measure the difference between two

NESUG '93 Proceedings Statistics 575

Two VM&specific SAS functions, FINDFILE and FINDEND, are used to search for the analysis data set in a permanent SAS data library or in the SAS$WORK data library. Users on other operating fa=13/~ and systems may have equivalent functions that could be substituted If no analysis data set is found, the 1:.t = 14 / ~ • program reports this error to the .LOG file and terminates. Otherwise, the calculation macro begins. fa and ~ are the sample L-skewness and L-kurtosis, respectively. The sample estimate of the altemative If BY variable processing is requested, the data are measure of skewness, f3' is defiled as sorted before submission to PROC UNIVARIATE for (1 + tJ I (1 - tJ. analysis. PROC PRINTTO is used to capture the usual output and send it to the file 'UNI.DAT'. Also, The Program an output data set from PROC UNIVARIATE is used to store the number of non-missing observations for The macro program L_MOMENTS was written using each analysis. SAS v6.08 under the VMS- operating environment. With slight modification (detailed below), the program If the user chooses to generate a hardcopy of the should run on any operating system. The user is results, a DATA step is used to process the UNI.DAT required to provide the name of the SAS dataset file. The functions PUT and SUBSTR are used in (macro variable INDAT) to be used in the analyses conjunction with the $BINARYS. format to search for and the name(s) of the variables (macro variable pagebreaks (the eC=CR option is specified in an VARS), separated by spaces, to be analyzed. "There OPTIONS statement) and set a flag that will be used is no limit as to the number of variables that can be to fire a PUT _PAGE_ in a DATA _NULL_ step at the analyzed. Options available to the user include: end of the program. Next, a flag is set if BY variable box plots are created. For each page of output that · Specify the location of the SAS data library (macro does not contain BY variable box plots, a counter is variable LIB). Default is current user location. incremented. The counter is used to facilitate direct read access of the shape statistics data set created · BY group processing (macro variable BYVAR). No by PROC IML for use In the DATA _NULL_ that limit on number of BY variables, delimited by a generates the hardcopy output. Next, that part Of the space. Default is no BY group processing. output line that displays the values for skewness and kurtosis "is removed. Finally, a flag is set that · Generate stem-leaf, box and normal probability plots indicates the last line of the tabular portion of the (macro variable PLOTS). Default is no plots. PROe UNIVARIATE output.

· Generate a hardcopy of the usual PROC For each analysis variable, the raw data are sorted, UNIVARIATE output, with the central moment and then merged and transposed. The output data set L-moment shape statistics appended (macro from PROC UNIVARIATE that contains the number variable PRINT). Default is to have the output of non-missing observations is also transposed. provided. PROe IML is then used to calculate the central moments and L-moments for skewness and kurtosis. · Create an output dataset (Temporary or Permanent) that contains the central moment and L-rnoment Using the same method as PROC UNIVARIATE, the shape statistics (macro variable OUT). Default is sample skewness and kurtosis are calculated. Then, no output dataset is created. conditional upon there being at least four non-missing observations, for each combination of BY variable

A brief description of the flow of the program follows. and analysis variable the values for w" w2' W3 and w4 A driver macro is used to initialize variables, search are calculated and appended to an interim matrix. If for the analysis data set, call a macro that outputs to this condition is not met, then the calculation Of the the LMOMENTS.LOG file the options selected by the L-rnoment parameters is not possible and a flag is user, and call a macro that performs the calculations. set. Finally, the L-moment parameters are

NFSUG '93 Proceedings 576 Statistics

calculated, concatenated with the BY variables (if L-moments are taken from Hosking, 1990): present), the central moment parameters and the conditional flag described above and placed into a Distribution I ~ 1 !:. !. SAS data set. The names of the analyses variables logistic 0.0 4.2 0.0 1.0 0.167 are placed into a separate SAS data set. Gumbel 1.137 5.4 0.170 1.410 0.150 Normal 0.0 3.0 0.0 1.0 0.123 Once the calculations have been completed, user­ Exponential 2.0 9.0 0.333 2.0 0.167 defined options direct the results to hardcopy output Lognormal(O.5) 1.75 8.90 0.241 1.635 0.169 anellor a temporary or permanent SAS data set If Lognormal(1.0) 6.19 113.9 0.463 2.724 D.294 hardcopy output is requested. a DATA _NULl-writes the modified PROe UNIVARIATE output and, using The method of Royston (1992) was used to quantify the direct access counter previously described, the results of the simulations. For the logistic and places the shape statistics immediately below the last normal distributions, the mean absolute values of ta line of PROe UNIVARIATE tabular output. If the and "f were plotted against sample size. Otherwise, sample size flag generated by PROC IML has been each of the 5000 values of the shape parameters fired. a .... , is printed for the L-momentoutput, with an was standardized by dividing its simulation mean by appropriate footnote. If the user has requested that the nominal, theoretical value and plotted against a temporary (OUT = T) or permanent (OUT = P) data sample size. The results are illustrated graphically in set be created, then the two resultant data sets from Figures 1 through 6. The horizontal lines indicate PROe IML are merged and the data set is created as either a nominal value of zero (for fa and "f for the appropriate. logistic and normal distributions) or one hundred percent. Simulations It is apparent that, for the simulations conducted and Simulations were conducted to explore the independent of sample size or distributional shape, applicability of the L-moment shape statistics to the L-moment shape statistics in general are less varying sample sizes and distributional shapes. For biased than the central moment shape statistics. As each of the following distributions, 5000 data sets such, the L-moment shape statistics are much more were generated for samples sizes 5, 10, 15,20,30, useful indicators of the type of departure of a sample 40, 50, 80, 125 and 250: from normality (Royston, 1992).

Logistic y = a + k*log(xI(1-x», where a = 0 Example andk=1; The usefulness of L-moment shape statistics Gumbel y = a - b(log(-log(x))), where a = 1 becomes apparent when applied to the analysis of and b = 1; pharmacokinetic parameters. It has been suggested that many pharmacokinetic parameters follow a log­ Normal(0,1) . To examine this, data from Metzler and Huang (1983) will be used to calculate Exponential y = a - b'log(1-x), where a = 1 and the central moment and L-moment shape statistics for b = 1; the untransformed and log-transformed area under the plasma concentration-time curve data. Figures 7 Lognormal y = exp(a*x), where a = 0.5; and 8 present an example of the output produced using the macro call Lognormal y = exp(a*x), where a = 1. %LMOMENTS(INDAT=TEST, PLOTS=Y, VARS=AUC LOGAUCj; In the equations. x is a random normal(0,1) variate. The table below lists the theoretical values for the The shape statistics for the untransformed data shape statistics for the above distnbutions (values for suggest that the underlying distribution is positively the central moments are taken from Hastings and skewed, with some evidence of kurtosis. Log­ Peacock. 1975; except for t4 for the Lognormal transformation of the data results in a closer distributions (Royston, 1992), values for the approximation to normality. Note the disparity

NESUG '93 Proceedings Statistics 577

between the central moment, K, and L-moment, ~, Hosking, J.R.M. L-moments: Analysis and measures of kurtosis. This can probably be attributed estimation of distributions using linear to the poor small sample performance of K compared combinations of order statistics. J. Royal Stat. to ~, and to the biasedness of K in non-normal . Soc. B 52:105-124,1990. distributions. MacGillivray, H.L and K.P. Balancla. The Discussion relationships between skewness and kurtosis. Austral. J. Stat 30:319-337, 1988. The L-moment shape indices ta, f3 and t4 have several advantages over the· usual shape statistics 'Y Metzler, C.M. and D.C. Huang. Statistical methods and JC, Accurate characterization of several non­ for bioavailability and bioequivalence. Clin. Res. normal distributions, reasonably unbiased in small Pract. Drug Reg. Affairs, 1:109-132, 1983. sample sizes, ease of interpretability and robustness to outliers make the L-moment shape statistics useful Royston, P. Which measures of skewness and measures of the shape of a distribution. As shown in kurtosis are best? Stat Med. 11:333-343, 1992. the example, the L-moment shape statistics could be useful indicators when transformation of data is SAS Institute, Inc. SAS" Language: Reference, required. A macro program was developed to include Version 6, First Edition, Cary, NC: SAS Institute, the calculation of the L-moment shape statistics with Inc., 1990. the central moment shape statistics in a hardcopy of PROC UNIVARIATE output, an output data set, or SAS Institute, Inc. SAS" Procedures Guide, Version both. 6, Third Edition, Cary, NC: SAS Institute, Inc., 1990.

SAS and SASIIML are registered trademarks of the SAS Institute, Inc. SASlIML" Software: Usage and SAS Institute, Inc., Cary, NC. Reference, Version 6, First Edition, Cary, NC: SAS Institute, Inc., 1989. VMS is a registered trademark of the Digital Equipment Corporation, Maynard, MA. Tukey, J.W. Exploratory Data Analysis. Addison­ Wesley (1977), Reading, MA. References Van Der Laan, P. and L R. Verdooren. Classical Bickel, P. Robust Estimation in S. Kotz and N. methods and nonparametric Johnson (eds.), the Encyclopedia of Statistical counterparts. Biom. J. 6:635-665,1987. SCiences, John Wiley and Sons (1988), New York, NY, Volume 8, pp. 157-163. The author can be reached at Glass, G.V., Peckham, P.O. and J.R. Sanders. Consequences of failure to meet assumptions Berlex Laboratories, Inc. underlying the fixed effects analysis of variance 300 Fairfield Rd. and . Rev. Educ. Res. 42:238-288. Wayne, New Jersey 07470

Hastings, N.A.J. and J.B. Peacock. Statistical Phone: (201) 305-5336 Distributions. John Wiley and Sons (1975), New York, NY.

Hopkins, KD. and D.L. Weeks. Tests fOr normality and measures of skewness and kurtosis! Their place in research reporting. Educ. Psychol. ~ 50:717-729,1990.

NESUG '93 Proceedings U1 ...... 00

FIGURE 1: FIGURE 2: g ~ 0.0& 120 "'0 '2D U> 0.04, ~ .. [ ~ __ 100 II .a ..... 'I .. ~ 1!l 0.02 " 0 c 8Df £ ~ 5' E aoe.. 0 ~ :z _::::1r-~:::/-::::=.-:::770F.!-~_-~~~=_==_.=....=_.-.-==~..--dl if .. _ 0 ::~~ .. ------_...... -- - 40 :r - ---.-_._._._- .a 1: ,. I~ ~ ~ ___..---- ....-.. - ..---.M ..- ..- -0.04 20. 2D ....-- a --..----­ ::I i -..--- -O~8~J~ .$--..----____ ~ ______~ ____ ~ __ -- ______~ <>. D" oJ'-"-"" ----'-- 5 10 20 eo 12& 250 a 10 aa &0 '25 260 Sample Size (Iog(, 0) scale) SalTlple Size (log(1 0) scale)

FIGURE 3: Normal DIstrIbution FIGURE 4: 0.08 120 -0 12D .. .. 0.04 I( IDol .a 1001 ...... ~ a ~ ~ ...... ~- ...-::.-:-:--.- ..-.--.-- .....~. <: 0.02 80~ 0 80 c .~ I ~ 8011------.. 0.00 "'"3- ~ ...- ". ..1 Dol 0 0 <> :z --~"-...7 of ..al~---·~- - -0.02 .oi" 0 l!I r 1:- ~ ~ .. ., ~ . ...------0.04 20~ it" ~ 0'.··5.--··--·····----··---..·..------_0.0&.1 L D" & 10 2D 60 '25 2150 5 1Q 20 eo '26 260 Sample Size (log(1 0) Bcole) Sample SIze (Iog( 1 0) scale)

FIGURE 5: Lognormal(0.5) Distribution FIGURE 6: Lognormal(1 .0) Distribution 120 '20 .. ..

,oat=::~~~;;;:;;;;;;;;;;;;;;;===;;;;;;;;;;;- ~c 'oo,t:~==::~:::::::=:===;;;==---~~--E __ ...5.-~.~.:-:-: ...:-:--::-.:-:-=._: ..:.:.:.~ .a~ .5 til ___~ ~E _ E . ....- ~ 80 .___ ~~ __. o eo _ .._ .. - ...- ~ ...... - .....__ ...-- ...- ...... _. ] 40 _____~- _._._- ... __ .•..••..•• -.. _. _ .0 _____ .- ..,,, ...------i _____...... Ii 20 ___ ...... l 20 __ .... --_ .. -_ ... - _ ...... - ...... - ...... -- ...... - 0.. _ .. _' .,. . o .Jt .. -....-_·.-- o ...'5...... __ •• -._...... _._-_ ...... -._ ...... -. ---.-- a .0 20 80 126 250 .. 10 20 110 12$ 260 Sample SIze (Iog( 10) scale) Somple Size (log(1 0) scale)

V1 V1 ......

a a 1.0 1.0 et. et.

I-

++ ++

1, 1,

7) 7)

4) 4)

15, 15,

13) 13)

Obs Obs

+2 +2

* *

+++++ +++++

I I

I I

* *

+++++ +++++

9.28 9.28

10.81 10.81

16.26( 16.26(

10.73 10.73 14.02( 14.02(

+1 +1

Highest Highest

+*+*++ +*+*++

Plot

2, 2,

18' 18'

17, 17,

20, 20, 11' 11'

Extremes Extremes

+++** +++**

Oba Oba

a a

constraints: constraints:

+++**** +++****

I I

I I

••• ••• ( (

Probability Probability

1. 1.

* *

2.33 2.33

3.04 3.04 3.92 3.92

3.141 3.141

2.891 2.891

++*+* ++*+*

< <

Lowest Lowest

Normal Normal

following following

T4 T4

*++*+* *++*+*

<-

the the

l' l'

to to

" "

-

-2 -2 -1

2.33 2.33

2.61 2.61

12.41 12.41

15.14 15.14

2.965 2.965

16.26 16.26

+----+----+----+----+----+----+----+----+----+~---+ +----+----+----+----+----+----+----+----+----+~---+

I I

I I

I I I I

I I

I I

subject subject

computed. computed.

3+ 3+

IT3.'2) IT3.'2)

17+ 17+

* *

be be

it it

5% 5%

are are

10% 10%

90' 90'

95% 95%

99% 99%

(5 (5

not not

7 7

* *

IDef-5) IDef-5)

Procedure Procedure

~ ~

could could

FIGURE FIGURE

statistics statistics

2.33 2.33 4.15 4.15

6.22 6.22

2.33 2.33

9.175 9.175

5.025 5.025 13.93 13.93

16.26 16.26

(1992). (1992).

Quantiles Quantiles

Univariate Univariate

I I Infinity, Infinity,

I I I I

I I I I

L-Moment L-Moment

statistics statistics

01 01

Min Min

Q3 Q3

Max Max Med Med

< <

*--+--* *--+--*

+-----+ +-----+

+-----+ +-----+

0.121 0.121 Boxplot Boxplot

0.514 0.514

0% 0%

11:333-343 11:333-343

The The

75% 75% 25' 25'

50% 50%

T3' T3'

------

Kurtosis Kurtosis

Range Range Mode

100% 100%

03-01 03-01

< <

0 0

L-moment L-moment

Med. Med.

T4 T4

20 20

20 20

that that

Stat. Stat.

1, 1,

4 4

2 2

5 5

5 5 1 1

1 1 2 2

, ,

141.6 141.6

0.0001 0.0001

0.0001 0.0001 0.0001 0.0001

< <

270.261 270.261

0.843335 0.843335 14.22426 14.22426

(1+T3,/(1-T3,. (1+T3,/(1-T3,.

T3 T3

-

< <

indicates indicates

Royston, Royston,

0 0

-1 -1

T3' T3'

Mean Mean

"qts "qts

** **

P. P.

1.516 1.516

0.205 0.205

0.923 0.923

Std Std

Pr>-IMI Pr>-IMI

Sum Sum

Pr>-ISI Pr>-ISI

Sum Sum

CSS CSS Pr>ITI Pr>ITI

Num> Num>

VaEianee VaEianee

Skewness Skewness

REF: REF:

NOTE: NOTE:

10 10

20 20

20 20

105 105

T3' T3'

T3 T3

Moments Moments

7.08 7.08

8.395245 8.395245

53.26987 53.26987 1272.789 1272.789

3.771507 3.771507

46789 46789

13 13

----+----+----+----+ ----+----+----+----+

78 78

5669 5669 39019 39019

Leaf Leaf

0 0

3 3

4 4

6 6

8 8

2 2

14 14

10 10

16 16 12 12

0 0

Method: Method:

Stem Stem

""-

Rank Rank

Dev Dev

L-Moments: L-Moments:

Sgn Sgn T:Mean-O T:Mean-O

Usual Usual

MISign) MISign) Std Std

CV CV

USS USS Hum Hum

Mean Mean

N N

Variable-lWC Variable-lWC

I- )j )j ~ ~

"l "l ~ ~ U1 o 00

Co Il fIl ~

1)

4) 7)

15)

13)

Dbs

"

(

( ( (

+*++++*++

Highest

2.227862

2.373044 2.379546(

2.640485 2.788708

+1

Plot

2)

Extremes 18)

11) 17)

20)

***+**+*+*+ Obs

a

( ( (

constraints:

Probability

+****+*++

Lowest

1 . 1

0.845868C 1.061257( 1.111858

1.144223 1.366092

<

Normal

-1

following

74

+*++*+*+**

<-

the

11

to

~ -

+++++*+++

2.788708

2.714596 1.086557

2.510016 0.953562 +----+----+----+----+----+----+----+----+----+----+

0.845868

I I

subject

computed.

(73"2)

n

5'

99% 95%

2.75+ 90% 1.75+

0.75+ 10%

*

be

ara

8

(5

CDef-5)

Procedure

not

~

FIGURE

could

statistics

Quantile.

1.42157

1.94284

(1992). 2.788708 2.216417 1.826445

0.845868

0.794847 0.845868

Univariate

I

I

Max Q3

Ql Med Min

Infinity,

L-Moment

statistics

0.075

Boxplot *--+--*

+-----+ +-----+ < 0%

-0.766

75%

50% 25%

Range

Q3-o1

Mode Kurtosis ------

-100%

The

11:333-343

T3'

<

74

0 L-moment

Mad.

20

20

that

, 2 6 7 4 1

0.0001 Stat.

0.0001

0.0001

I,

36.43574 0.294936

5.603775 0.121436

<

(I+T3)/(1-T3).

73

-

0

<

Wgts

indicates

Royston,

Mean

-1 T3'

0.943

••

P.

-0.091 -0.029

Swn Sum

std Va.ria.nce CSS

Pr>ITI

Nwn> Pr>-IMI Pr>-ISI

Skewness

20

10 20

Moments

NOTE:

REF:

105

T3

73'

15.002

0.54308

1.821787

71.98194

29.81027

68 Leaf

0012244 557889 1114 8 ----+----+----+----+

1 2 1

2 o

0

Method:

Stem

Dav

"'-

RAnk

L-Moments:

N Mean Std

US8 CV T:Mean-O

Hwn MCSign) 8gn

Usual

Variable-LOGAUC

~ ~ I ~