Statistics 573
USE OF PROC IML TO CALCULATE L-MOMENTS FOR THE UNIVARIATE DISTRIBUTIONAL SHAPE PARAMETERS SKEWNESS AND KURTOSIS
Michael A. Walega Berlex Laboratories, Wayne, New Jersey
Introduction
Exploratory data analysis statistics, such as those Gaussian. Bickel (1988) and Van Oer Laan and generated by the sp,ge procedure PROC Verdooren (1987) discuss the concept of robustness UNIVARIATE (1990), are useful tools to characterize and how it pertains to the assumption of normality. the underlying distribution of data prior to more rigorous statistical analyses. Assessment of the As discussed by Glass et al. (1972), incorrect distributional shape of data is usually accomplished conclusions may be reached when the normality by careful examination of the values of the third and assumption is not valid, especially when one-tail tests fourth central moments, skewness and kurtosis. are employed or the sample size or significance level However, when the sample size is small or the are very small. Hopkins and Weeks (1990) also underlying distribution is non-normal, the information discuss the effects of highly non-normal data on obtained from the sample skewness and kurtosis can hypothesis testing of variances. Thus, it is apparent be misleading. that examination of the skewness (departure from symmetry) and kurtosis (deviation from a normal One alternative to the central moment shape statistics curve) is an important component of exploratory data is the use of linear combinations of order statistics (L analyses. moments) to examine the distributional shape characteristics of data. L-moments have several Various methods to estimate skewness and kurtosis theoretical advantages over the central moment have been proposed (MacGillivray and Salanela, shape statistics: Characterization of a wider range of 1988). For many years, the conventional coefficients cflstributions, robustness to outliers and more of skewness and kurtosiS, 'Yand K (Hosking, 1990), accurate estimates in small sample sizes. have been used to describe the shape characteristics of distributions. However, as pointed out by Hosking This paper focuses on the development of a macro (1990) and Royston (1992), these coefficients are not program that uses SASIIML- (1989) to generate the without limitations. Both are sensitive to minute central moment and L-moment distributional shape changes in the tails of a distribution, susceptible to parameters. In addition, the results of simulations, moderate outliers and biased in small to moderately conducted with various sample sizes and sized samples from skew distributions. Also, the distributions, will be presented. information conveyed by the third and fourth central moments with regards to the shape of a distribution Background can be difficult to assess. Thus, it would be appropriate to determine if other, more robust Largely through the influence of John Tukey's work measures of skewness and kurtosis can be used to (19n), statisticians have increasingly emphasized the assess the shape of a distnbution. exploratory analysis of data prior to more formal statistical analyses (t-tests, ANOVA, etc.). Tukey has L-moments suggested that to fully understand the nature of a variable and its measurement, characteriStics other One more robust measure is the use of linear than the central tendency (mean) and variability combinations of order statistics, or L-moments. In (standard deviation) need to be examined. Many theory, L-moments are less prone to the effects of classical statistical tests rely on the assumption that sampling variability as compared to conventional the underlying distribution of the data (or residuals) is moments. Hosking (1990) provides an excellent
NESUG '93 Proceedings 574 Statistics
overview of the theory behind the derivation and randomly selected values of X; however. by its nature application of L-moments as summary statistics for cr assigns more weight to extreme sample values univariate probability distributions. Royston (1992) than does~. As is a scale-dependent measure of compares the properties of the conventional shape $kewness for a sample of size 3. and A4 is parameters to their L-moment counterparts for two proportional to a weighted difference between outer lognormal distributions. Rather than discuss the extremes and the central portion in samples of size 4 detailed theory behind L-moments. the reader is (Royston. 1992). referred to the two aforementioned papers. Instead. a brief overview of the development of the equations Scale-free versions of the L-moments for skewness.
necessary to apply L-moments is described below. 'ta •. and kurtosis. t 4• can be written as As with the paper by Royston. the notation of Hosking (1990) will be employed.
For the random variables Xi •.... x. of sample size n drawn from the distribution of a random variable X with mean Il and variance OZ. let X,:, ::s; ... ::s; X,:. be An alternative measure of skewness. 1'3' is defined as the order statistics such that the L-momerns of X are (1 + 'tJ I (1 - 'tJ. This measure is the ratio of the defined by expected length of the upper tall to that of the lower tail in samples of size 3. and as such may be easier to interpret than 'ta. ~. 'ta. 'to 3 and 't are subject to r-1 ( r - 1) 4 A,. .. (' E(-1)k EX'4<:r • r =1.2 .... the constraints "-0 k
~>O. -1<'ta<1. 0<'t'3 <00 and where r is the r'" L-moment of a distribution and ~7 th is the expected value of the i smallest observation Y4(W3 -1)St4 <1. in a sample of size r. . If a random sample of size n is drawn from a The first four central moments of a random variable distribution of the random variable X and X can be written as x,:. S ... S xn;n are the ordered sample values then estimates olthe L-moments A.,. ~. As and A,4' namely 11 = E(X). I,. ~. ~ and 14, can be calculated as follows. First, define w2• W3 and w4 as OZ = E(X _ 1l)2. 1 n "I = E(X - 1l)3 I if and w2 = -- ~(i- 1)xl:n' n(n-1) tt II: = E(X - 11)4 I ct. 1 n In a similar fashion. the first four L-moments of a W3 = ~(i - 1Hi - 2)XL'tI and random variable X can be written as n(n-1)(n-2) tt
A., = E(X). 1 n w4 = ~(i - 1)(i - 2)(; - 3)xL." ~ = 1hE(~2 - X;:J. n(n-1)(n-2)(n-3)!t .
As = 1hE(~ - 2X2:3 + X,~ and Then the L-moments and the corresponding shape statistics can be estimated as ~ = Y.4E(X4:4 - 3X3:4 + 3X2:4 - X,:J. It can be seen that A., is equivalent to the usual measure of central tendency. II- ~ is similar to cr in that both measure the difference between two
NESUG '93 Proceedings Statistics 575
Two VM&specific SAS functions, FINDFILE and FINDEND, are used to search for the analysis data set in a permanent SAS data library or in the SAS$WORK data library. Users on other operating fa=13/~ and systems may have equivalent functions that could be substituted If no analysis data set is found, the 1:.t = 14 / ~ • program reports this error to the .LOG file and terminates. Otherwise, the calculation macro begins. fa and ~ are the sample L-skewness and L-kurtosis, respectively. The sample estimate of the altemative If BY variable processing is requested, the data are measure of skewness, f3' is defiled as sorted before submission to PROC UNIVARIATE for (1 + tJ I (1 - tJ. analysis. PROC PRINTTO is used to capture the usual output and send it to the file 'UNI.DAT'. Also, The Program an output data set from PROC UNIVARIATE is used to store the number of non-missing observations for The macro program L_MOMENTS was written using each analysis. SAS v6.08 under the VMS- operating environment. With slight modification (detailed below), the program If the user chooses to generate a hardcopy of the should run on any operating system. The user is results, a DATA step is used to process the UNI.DAT required to provide the name of the SAS dataset file. The functions PUT and SUBSTR are used in (macro variable INDAT) to be used in the analyses conjunction with the $BINARYS. format to search for and the name(s) of the variables (macro variable pagebreaks (the eC=CR option is specified in an VARS), separated by spaces, to be analyzed. "There OPTIONS statement) and set a flag that will be used is no limit as to the number of variables that can be to fire a PUT _PAGE_ in a DATA _NULL_ step at the analyzed. Options available to the user include: end of the program. Next, a flag is set if BY variable box plots are created. For each page of output that · Specify the location of the SAS data library (macro does not contain BY variable box plots, a counter is variable LIB). Default is current user location. incremented. The counter is used to facilitate direct read access of the shape statistics data set created · BY group processing (macro variable BYVAR). No by PROC IML for use In the DATA _NULL_ that limit on number of BY variables, delimited by a generates the hardcopy output. Next, that part Of the space. Default is no BY group processing. output line that displays the values for skewness and kurtosis "is removed. Finally, a flag is set that · Generate stem-leaf, box and normal probability plots indicates the last line of the tabular portion of the (macro variable PLOTS). Default is no plots. PROe UNIVARIATE output.
· Generate a hardcopy of the usual PROC For each analysis variable, the raw data are sorted, UNIVARIATE output, with the central moment and then merged and transposed. The output data set L-moment shape statistics appended (macro from PROC UNIVARIATE that contains the number variable PRINT). Default is to have the output of non-missing observations is also transposed. provided. PROe IML is then used to calculate the central moments and L-moments for skewness and kurtosis. · Create an output dataset (Temporary or Permanent) that contains the central moment and L-rnoment Using the same method as PROC UNIVARIATE, the shape statistics (macro variable OUT). Default is sample skewness and kurtosis are calculated. Then, no output dataset is created. conditional upon there being at least four non-missing observations, for each combination of BY variable
A brief description of the flow of the program follows. and analysis variable the values for w" w2' W3 and w4 A driver macro is used to initialize variables, search are calculated and appended to an interim matrix. If for the analysis data set, call a macro that outputs to this condition is not met, then the calculation Of the the LMOMENTS.LOG file the options selected by the L-rnoment parameters is not possible and a flag is user, and call a macro that performs the calculations. set. Finally, the L-moment parameters are
NFSUG '93 Proceedings 576 Statistics
calculated, concatenated with the BY variables (if L-moments are taken from Hosking, 1990): present), the central moment parameters and the conditional flag described above and placed into a Distribution I ~ 1 !:. !. SAS data set. The names of the analyses variables logistic 0.0 4.2 0.0 1.0 0.167 are placed into a separate SAS data set. Gumbel 1.137 5.4 0.170 1.410 0.150 Normal 0.0 3.0 0.0 1.0 0.123 Once the calculations have been completed, user Exponential 2.0 9.0 0.333 2.0 0.167 defined options direct the results to hardcopy output Lognormal(O.5) 1.75 8.90 0.241 1.635 0.169 anellor a temporary or permanent SAS data set If Lognormal(1.0) 6.19 113.9 0.463 2.724 D.294 hardcopy output is requested. a DATA _NULl-writes the modified PROe UNIVARIATE output and, using The method of Royston (1992) was used to quantify the direct access counter previously described, the results of the simulations. For the logistic and places the shape statistics immediately below the last normal distributions, the mean absolute values of ta line of PROe UNIVARIATE tabular output. If the and "f were plotted against sample size. Otherwise, sample size flag generated by PROC IML has been each of the 5000 values of the shape parameters fired. a .... , is printed for the L-momentoutput, with an was standardized by dividing its simulation mean by appropriate footnote. If the user has requested that the nominal, theoretical value and plotted against a temporary (OUT = T) or permanent (OUT = P) data sample size. The results are illustrated graphically in set be created, then the two resultant data sets from Figures 1 through 6. The horizontal lines indicate PROe IML are merged and the data set is created as either a nominal value of zero (for fa and "f for the appropriate. logistic and normal distributions) or one hundred percent. Simulations It is apparent that, for the simulations conducted and Simulations were conducted to explore the independent of sample size or distributional shape, applicability of the L-moment shape statistics to the L-moment shape statistics in general are less varying sample sizes and distributional shapes. For biased than the central moment shape statistics. As each of the following distributions, 5000 data sets such, the L-moment shape statistics are much more were generated for samples sizes 5, 10, 15,20,30, useful indicators of the type of departure of a sample 40, 50, 80, 125 and 250: from normality (Royston, 1992).
Logistic y = a + k*log(xI(1-x», where a = 0 Example andk=1; The usefulness of L-moment shape statistics Gumbel y = a - b(log(-log(x))), where a = 1 becomes apparent when applied to the analysis of and b = 1; pharmacokinetic parameters. It has been suggested that many pharmacokinetic parameters follow a log Normal(0,1) normal distribution. To examine this, data from Metzler and Huang (1983) will be used to calculate Exponential y = a - b'log(1-x), where a = 1 and the central moment and L-moment shape statistics for b = 1; the untransformed and log-transformed area under the plasma concentration-time curve data. Figures 7 Lognormal y = exp(a*x), where a = 0.5; and 8 present an example of the output produced using the macro call Lognormal y = exp(a*x), where a = 1. %LMOMENTS(INDAT=TEST, PLOTS=Y, VARS=AUC LOGAUCj; In the equations. x is a random normal(0,1) variate. The table below lists the theoretical values for the The shape statistics for the untransformed data shape statistics for the above distnbutions (values for suggest that the underlying distribution is positively the central moments are taken from Hastings and skewed, with some evidence of kurtosis. Log Peacock. 1975; except for t4 for the Lognormal transformation of the data results in a closer distributions (Royston, 1992), values for the approximation to normality. Note the disparity
NESUG '93 Proceedings Statistics 577
between the central moment, K, and L-moment, ~, Hosking, J.R.M. L-moments: Analysis and measures of kurtosis. This can probably be attributed estimation of distributions using linear to the poor small sample performance of K compared combinations of order statistics. J. Royal Stat. to ~, and to the biasedness of K in non-normal . Soc. B 52:105-124,1990. distributions. MacGillivray, H.L and K.P. Balancla. The Discussion relationships between skewness and kurtosis. Austral. J. Stat 30:319-337, 1988. The L-moment shape indices ta, f3 and t4 have several advantages over the· usual shape statistics 'Y Metzler, C.M. and D.C. Huang. Statistical methods and JC, Accurate characterization of several non for bioavailability and bioequivalence. Clin. Res. normal distributions, reasonably unbiased in small Pract. Drug Reg. Affairs, 1:109-132, 1983. sample sizes, ease of interpretability and robustness to outliers make the L-moment shape statistics useful Royston, P. Which measures of skewness and measures of the shape of a distribution. As shown in kurtosis are best? Stat Med. 11:333-343, 1992. the example, the L-moment shape statistics could be useful indicators when transformation of data is SAS Institute, Inc. SAS" Language: Reference, required. A macro program was developed to include Version 6, First Edition, Cary, NC: SAS Institute, the calculation of the L-moment shape statistics with Inc., 1990. the central moment shape statistics in a hardcopy of PROC UNIVARIATE output, an output data set, or SAS Institute, Inc. SAS" Procedures Guide, Version both. 6, Third Edition, Cary, NC: SAS Institute, Inc., 1990.
SAS and SASIIML are registered trademarks of the SAS Institute, Inc. SASlIML" Software: Usage and SAS Institute, Inc., Cary, NC. Reference, Version 6, First Edition, Cary, NC: SAS Institute, Inc., 1989. VMS is a registered trademark of the Digital Equipment Corporation, Maynard, MA. Tukey, J.W. Exploratory Data Analysis. Addison Wesley (1977), Reading, MA. References Van Der Laan, P. and L R. Verdooren. Classical Bickel, P. Robust Estimation in S. Kotz and N. analysis of variance methods and nonparametric Johnson (eds.), the Encyclopedia of Statistical counterparts. Biom. J. 6:635-665,1987. SCiences, John Wiley and Sons (1988), New York, NY, Volume 8, pp. 157-163. The author can be reached at Glass, G.V., Peckham, P.O. and J.R. Sanders. Consequences of failure to meet assumptions Berlex Laboratories, Inc. underlying the fixed effects analysis of variance 300 Fairfield Rd. and covariance. Rev. Educ. Res. 42:238-288. Wayne, New Jersey 07470
Hastings, N.A.J. and J.B. Peacock. Statistical Phone: (201) 305-5336 Distributions. John Wiley and Sons (1975), New York, NY.
Hopkins, KD. and D.L. Weeks. Tests fOr normality and measures of skewness and kurtosis! Their place in research reporting. Educ. Psychol. ~ 50:717-729,1990.
NESUG '93 Proceedings U1 ...... 00
FIGURE 1: Logistic Distribution FIGURE 2: Gumbel Distribution g ~ 0.0& 120 "'0 '2D U> 0.04, ~ .. [ ~ __ 100 II .a ..... 'I .. ~ 1!l 0.02 " 0 c 8Df £ ~ 5' E aoe.. 0 ~ :z _::::1r-~:::/-::::=.-:::770F.!-~_-~~~=_==_.=....=_.-.-==~..--dl if .. _ 0 ::~~ .. ------_...... -- - 40 :r - ---.-_._._._- .a 1: ,. I~ ~ ~ ___..---- ....-.. - ..---.M ..- ..- -0.04 20. 2D ....-- a --..---- ::I i -..--- -O~8~J~ .$--..----____ ~ ______~ ____ ~ __ -- ______~ <>. D" oJ'-"-"" ----'-- 5 10 20 eo 12& 250 a 10 aa &0 '25 260 Sample Size (Iog(, 0) scale) SalTlple Size (log(1 0) scale)
FIGURE 3: Normal DIstrIbution FIGURE 4: Exponential Distribution 0.08 120 -0 12D .. .. 0.04 I( IDol .a 1001 ...... ~ a ~ ~ ...... ~- ...-::.-:-:--.- ..-.--.-- .....~. <: 0.02 80~ 0 80 c .~ I ~ 8011------.. 0.00 "'"3- ~ ...- ". ..1 Dol 0 0 <> :z --~"-...7 of ..al~---·~- - -0.02 .oi" 0 l!I r 1:- ~ ~ .. ., ~ . ...------0.04 20~ it" ~ 0'.··5.--··--·····----··---..·..------_0.0&.1 L D" & 10 2D 60 '25 2150 5 1Q 20 eo '26 260 Sample Size (log(1 0) Bcole) Sample SIze (Iog( 1 0) scale)
FIGURE 5: Lognormal(0.5) Distribution FIGURE 6: Lognormal(1 .0) Distribution 120 '20 .. ..
,oat=::~~~;;;:;;;;;;;;;;;;;;;===;;;;;;;;;;;- ~c 'oo,t:~==::~:::::::=:===;;;==---~~--E __ ...5.-~.~.:-:-: ...:-:--::-.:-:-=._: ..:.:.:.~ .a~ .5 til ___~ ~E _ E . ....- ~ 80 .___ ~~ __. o eo _ .._ .. - ...- ~ ...... - .....__ ...-- ...- ...... _. ] 40 _____~- _._._- ... __ .•..••..•• -.. _. _ .0 _____ .- ..,,, ...------i _____...... Ii 20 ___ ...... l 20 __ .... --_ .. -_ ... - _ ...... - ...... - ...... -- ...... - 0.. _ .. _' .,. . o .Jt .. -....-_·.-- o ...'5...... __ •• -._...... _._-_ ...... -._ ...... -. ---.-- a .0 20 80 126 250 .. 10 20 110 12$ 260 Sample SIze (Iog( 10) scale) Somple Size (log(1 0) scale)
V1 V1 ......
a a 1.0 1.0 et. et.
I-
++ ++
1, 1,
7) 7)
4) 4)
15, 15,
13) 13)
Obs Obs
+2 +2
* *
+++++ +++++
I I
I I
* *
+++++ +++++
9.28 9.28
10.81 10.81
16.26( 16.26(
10.73 10.73 14.02( 14.02(
+1 +1
Highest Highest
+*+*++ +*+*++
Plot Plot
2, 2,
18' 18'
17, 17,
20, 20, 11' 11'
Extremes Extremes
+++** +++**
Oba Oba
a a
constraints: constraints:
+++**** +++****
I I
I I
••• ••• ( (
Probability Probability
1. 1.
* *
2.33 2.33
3.04 3.04 3.92 3.92
3.141 3.141
2.891 2.891
++*+* ++*+*
< <
Lowest Lowest
Normal Normal
following following
T4 T4
*++*+* *++*+*
<-
the the
l' l'
to to
" "
-
-2 -2 -1
2.33 2.33
2.61 2.61
12.41 12.41
15.14 15.14
2.965 2.965
16.26 16.26
+----+----+----+----+----+----+----+----+----+~---+ +----+----+----+----+----+----+----+----+----+~---+
I I
I I
I I I I
I I
I I
subject subject
computed. computed.
3+ 3+
IT3.'2) IT3.'2)
17+ 17+
* *
be be
it it
5% 5%
are are
10% 10%
90' 90'
95% 95%
99% 99%
(5 (5
not not
7 7
* *
IDef-5) IDef-5)
Procedure Procedure
~ ~
could could
FIGURE FIGURE
statistics statistics
2.33 2.33 4.15 4.15
6.22 6.22
2.33 2.33
9.175 9.175
5.025 5.025 13.93 13.93
16.26 16.26
(1992). (1992).
Quantiles Quantiles
Univariate Univariate
I I Infinity, Infinity,
I I I I
I I I I
L-Moment L-Moment
statistics statistics
01 01
Min Min
Q3 Q3
Max Max Med Med
< <
*--+--* *--+--*
+-----+ +-----+
+-----+ +-----+
0.121 0.121 Boxplot Boxplot
0.514 0.514
0% 0%
11:333-343 11:333-343
The The
75% 75% 25' 25'
50% 50%
T3' T3'
------
Kurtosis Kurtosis
Range Range Mode Mode
100% 100%
03-01 03-01
< <
0 0
L-moment L-moment
Med. Med.
T4 T4
20 20
20 20
that that
Stat. Stat.
1, 1,
4 4
2 2
5 5
5 5 1 1
1 1 2 2
, ,
141.6 141.6
0.0001 0.0001
0.0001 0.0001 0.0001 0.0001
< <
270.261 270.261
0.843335 0.843335 14.22426 14.22426
(1+T3,/(1-T3,. (1+T3,/(1-T3,.
T3 T3
-
< <
indicates indicates
Royston, Royston,
0 0
-1 -1
T3' T3'
Mean Mean
"qts "qts
** **
P. P.
1.516 1.516
0.205 0.205
0.923 0.923
Std Std
Pr>-IMI Pr>-IMI
Sum Sum
Pr>-ISI Pr>-ISI
Sum Sum
CSS CSS Pr>ITI Pr>ITI
Num> Num>
VaEianee VaEianee
Skewness Skewness
REF: REF:
NOTE: NOTE:
10 10
20 20
20 20
105 105
T3' T3'
T3 T3
Moments Moments
7.08 7.08
8.395245 8.395245
53.26987 53.26987 1272.789 1272.789
3.771507 3.771507
46789 46789
13 13
----+----+----+----+ ----+----+----+----+
78 78
5669 5669 39019 39019
Leaf Leaf
0 0
3 3
4 4
6 6
8 8
2 2
14 14
10 10
16 16 12 12
0 0
Method: Method:
Stem Stem
""-
Rank Rank
Dev Dev
L-Moments: L-Moments:
Sgn Sgn T:Mean-O T:Mean-O
Usual Usual
MISign) MISign) Std Std
CV CV
USS USS Hum Hum
Mean Mean
N N
Variable-lWC Variable-lWC
I- )j )j ~ ~
"l "l ~ ~ U1 o 00
Co Il fIl ~
1)
4) 7)
15)
13)
Dbs
"
(
( ( (
+*++++*++
Highest
2.227862
2.373044 2.379546(
2.640485 2.788708
+1
Plot
2)
Extremes 18)
11) 17)
20)
***+**+*+*+ Obs
a
( ( (
constraints:
Probability
+****+*++
Lowest
1 . 1
0.845868C 1.061257( 1.111858
1.144223 1.366092
<
Normal
-1
following
74
+*++*+*+**
<-
the
11
to
~ -
+++++*+++
2.788708
2.714596 1.086557
2.510016 0.953562 +----+----+----+----+----+----+----+----+----+----+
0.845868
I I
subject
computed.
(73"2)
n
5'
99% 95%
2.75+ 90% 1.75+
0.75+ 10%
*
be
ara
8
(5
CDef-5)
Procedure
not
•
~
FIGURE
could
statistics
Quantile.
1.42157
1.94284
(1992). 2.788708 2.216417 1.826445
0.845868
0.794847 0.845868
Univariate
I
I
Max Q3
Ql Med Min
Infinity,
L-Moment
statistics
0.075
Boxplot *--+--*
+-----+ +-----+ < 0%
-0.766
75%
50% 25%
Range
Q3-o1
Mode Kurtosis ------
-100%
The
11:333-343
T3'
<
74
0 L-moment
Mad.
20
20
that
, 2 6 7 4 1
0.0001 Stat.
0.0001
0.0001
I,
36.43574 0.294936
5.603775 0.121436
<
(I+T3)/(1-T3).
73
-
0
<
Wgts
indicates
Royston,
Mean
-1 T3'
0.943
••
P.
-0.091 -0.029
Swn Sum
std Va.ria.nce CSS
Pr>ITI
Nwn> Pr>-IMI Pr>-ISI
Skewness
20
10 20
Moments
NOTE:
REF:
105
T3
73'
15.002
0.54308
1.821787
71.98194
29.81027
68 Leaf
0012244 557889 1114 8 ----+----+----+----+
1 2 1
2 o
0
Method:
Stem
Dav
"'-
RAnk
L-Moments:
N Mean Std
US8 CV T:Mean-O
Hwn MCSign) 8gn
Usual
Variable-LOGAUC
~ ~ I ~