USE OF PROC IML TO CALCULATE L-MOMENTS FOR THE UNIVARIATE DISTRIBUTIONAL SHAPE PARAMETERS AND

Michael A. Walega Covance, Princeton, New Jersey

Introduction and how it pertains to the assumption of normality. Exploratory data analysis statistics, such as those As discussed by Glass et al. (1972), incorrect generated by the SAS® procedure PROC conclusions may be reached when the normality UNIVARIATE (1990), are useful tools to assumption is not valid, especially when one-tail characterize the underlying distribution of data prior tests are employed or the sample size or to more rigorous statistical analyses. Assessment of significance level are very small. Hopkins and the distributional shape of data is usually Weeks (1990) also discuss the effects of highly non- accomplished by careful examination of the values normal data on hypothesis testing of variances. of the third and fourth central moments, skewness Thus, the exam-ination of the skewness (departure and kurtosis. However, when the sample size is from symmetry) and kurtosis (deviation from a small or the underlying distribution is non-normal, normal curve) is an important component of the information obtained from the sample skewness exploratory data analyses. and kurtosis can be misleading. Various methods to estimate skewness and kurtosis One alternative to the central moment shape have been proposed (MacGillivray and Balanda, statistics is the use of a linear combination of order 1988). For many years, the conventional statistics (L-moments) to examine the distributional coefficients of skewness and kurtosis, ϒ and κ shape characteristics of data. L-moments have (Hosking, 1990), have been used to describe the several theoretical advantages over the central shape characteristics of distributions. However, as moment shape statistics: Characterization of a pointed out by Hosking (1990) and Royston (1992), wider range of distributions, robustness to outliers these coefficients are not without limitations. Both and more accurate estimates in small sample sizes. are sensitive to minute changes in the tails of a distribution, susceptible to moderate outliers and This paper focuses on the development of a macro biased in small to moderately-sized samples from skew distributions. Also, the information conveyed program that uses SAS/IML® (1989) to generate the by the third and fourth central moments with regards central moment and L-moment distributional shape to the shape of a distribution can be difficult to parameters. In addition, the results of simulations, assess. Thus, it would be appropriate to determine conducted with various sample sizes and if other, more robust measures of skewness and distributions, will be presented. kurtosis can be used to assess the shape of a distribution. Background L-moments Largely through the influence of John Tukey’s work (1977), statisticians have increasingly emphasized One more robust measure are linear combinations the exploratory analysis of data prior to more formal of order statistics, or L-moments. In theory, L- statistical analyses (t-tests, ANOVA, etc.). Tukey moments are less prone to the effects of sampling has suggested that to fully understand the nature of variability as compared to conventional moments. a variable and its measurement, characteristics Hosking (1990) provides an excellent overview of other than the central tendency (mean) and the theory behind the derivation and application of variability (standard deviation) need to be L-moments as summary statistics for univariate examined. Many classical statistical tests rely on probability distributions. Royston (1992) compares the assumption that the underlying distribution of the the prop-erties of the conventional shape data (or residuals) is Gaussian. Bickel (1988) and Van Der Laan and Verdooren (1987) discuss the parameters to their L-moment counterparts for two concept of robustness lognormal dis-tributions. Rather than discuss the detailed theory behind L-moments, the reader is referred to the two aforementioned papers. Instead, a brief overview of the development of the outer extremes and the central portion samples of equations necessary to apply L-moments is size 4 (Royston, 1992). described below. As with the paper by Royston, the notation of Hosking (1990) will be employed. Scale-free versions of the L-moments for skewness, τ τ 3, and kurtosis, 4, can be written as For the random variables X1, …, Xn of sample size τ λ λ n drawn from the distribution of a random variable X 3 = 3 / 2 and µ σ2 ≤ ≤ with the mean and variance of , let X1:1 … τ λ λ X1:n be the order statistics such that the L-moments 4 = 4 / 2 . of X are defined by , An alternative measure of skewness, τ 3, is defined τ τ r-1 as (1 + 3) / (1 - 3). This measure is the ratio of the expected length of the upper tail to that of the lower -1 r - 1 tail in samples of size 3, and as such may be easier r k λr ≡ ∑ (-1) ( ) EX r = 1,2, … , r-k:r, to interpret than τ . λ , τ , τ and λ are subject to k 3 2 3 3 4 k=0 the constraints

λ τ τ, ∞ where r is the rth L-moment of a distribution and 2 > 0 , -1 < 3 < 1 , 0 < 3 < and th EXi:r is the expected value of the i smallest 1/4(5τ2 - 1) ≤ τ < 1. observation in a sample of size r. 3 4

The first four central moments of a random variable If a random sample of size n is drawn from a X can be written as distribution of the random variable X and ≤ ≤ x1:n … xn:n are the ordered sample values then µ λ λ λ λ = E(X), estimates of the L-moments 1, 2, 3 and 4, namely I1, I2, I3, I4, can be calculated as follows. σ2 = E(X - µ)2 , First, define w2, w3, and w4 as ϒ = E(X - µ)3 /σ3 and 1 n w = ------∑ (i- 1)x , κ = E(X - µ)4 / σ4. 2 i:n n(n-1) i=2 In a similar fashion, the first four L-moments of a random variable X can be written as 1 n ∑ w3 = ------(i - 1)(i -2)xi:n and λ 1 = E(X), n(n-1)(n-2) i=3

λ 2 = 1/2E(X2:2 - X1:2), 1 n w = ------∑(i - 1)(i - 2)(i - 3)x λ 4 i:n 3 = 1/3E (X3:3 - 2X2:3 + X1:3) and n(n-1)(n-2)(n-3) i=4

λ 4 = 1/4E(X4:4 - 3X3:4 + 3X2:4 - X1:4) Then the L-moments and the corresponding shape λ It can be seen that 1 is equivalent to the usual statistics can be estimated as µ λ σ2 measure of central tendency, . 2 is similar to I = ∑ / n, in that both measure the difference between two 1 xi randomly selected values of X; however, by its I = 2w - I , nature σ2 assigns more weight to extreme sample 2 2 1 values than does λ . λ is a scale-dependent 2 3 I = 6w - 6w + I , λ 3 3 2 1 measure of skewness for a sample of size 3, and 4 is proportional to a weighted difference between I4 = 20w4 - 30w3 + 12w2 - I1, t3 = I3 / I2 and If BY variable processing is requested, the data are sorted before submission to PROC UNIVARIATE for analysis. PROC PRINTTO is used to capture the t4 = I4 / I2. usual output and send it to the file ‘UNI.DAT’. An output data set from PROC UNIVARIATE is used to t and t are the sample L-skewness and 3 4 store the number of non-missing observations for L-kurtosis, respectively, The sample estimate of the , each analysis. alternative measure of skewness, t 3, is defined as If the user chooses to generate a hardcopy of the (1 + t3) / (1 - t3). results, a DATA step is used to process the UNI.DAT file. The functions PUT and SUBSTR are The Program used in conjunction with the $HEX16. format to search for pagebreaks and set a flag that will be The macro program L_MOMENTS was originally ® used to fire a PUT _PAGE_ in a DATA_NULL_ step written using SAS v6.08 under the VMS operating at the end of the program. Next, a flag is set if BY environment. It has been modified to run under variable box plots are created. For each page of ® V6.09 and V6.12 on HP-UNIX . With slight output that does not contain BY variable box plots, a modification (detailed below), the program should counter is incremented. The counter is used to run on any operating system. The user is required facilitate direct read access of the shape statistics to provide the name of the SAS data set (macro data set created by PROC IML for use in the variable INDS) to be used in the analyses and the DATA_NULL_ that generates the hardcopy output. name(s) of the variables (macro variable VARS), Next, that part of the output line that displays the separated by spaces, to be analyzed. There is no values for skewness and kurtosis is removed. limit as to the number of variables that can be Finally, a flag is set that indicates the last line of the analyzed. Options available to the user include: tabular portion of the PROC UNIVARIATE output.

• Specify the location of the SAS data library For each analysis variable, the raw data are sorted, (macro variable LIB). Default is current user then merged and transposed. The output data set location. from PROC UNIVARIATE that contains the number of non-missing observations is also transposed. • BY group processing (macro variable BYVAR). PROC IML is then used to calculate the central No limit of number of BY variables, delimited by moments and L-moments for skewness and a space. Default is no BY group processing. kurtosis.

• Generate stem-leaf, box and normal probability Using the same method as PROC UNIVARIATE, plots (macro variable PLOTS). Default is no the sample skewness and kurtosis are calculated. plots. Then, conditional upon there being at least four non- missing observations, for each combination of BY • Generate a hardcopy of the usual PROC variable and analysis variable the values for w1, w2, UNIVARIATE output, with the central moment w3 and w4 are calculated and appended to an and L-moment shape statistics appended interim matrix. If this condition is not met, then the (macro variable PRINT). Default is to have the calculation of the L-moment parameters is not output provided. possible and a flag is set. Finally, the L-moment • parameters are calculated, concatenated with the Create an output data (Temporary or BY variables (if present), the central moment Permanent) that contains the central moment parameters and the conditional flag described above and L-moment shape statistics (macro variable and placed into a SAS data set. The names of the OUT). Default is no output dataset is created. analyses variables are placed in a separate SAS data set. A brief description of the flow of the program follows. A driver macro is used to initialize Once the calculations have been completed, user- variables, search for the analysis data set, call a defined options direct the results to hardcopy output macro that outputs to the LMOMENTS.LOG file the and/or a temporary or permanent SAS data set. If options selected by the user, and call a macro that hardcopy output is requested, a DATA _NULL_ performs the calculations. If no analysis data set is writes the modified PROC UNIVARIATE output and, found, the program reports this error to the .LOG file using the direct access counter previously and terminates. Otherwise, the calculation macro described, places the shape statistics immediately begins. below the last line of PROC UNIVARIATE tabular output. If the sample size flag generated by PROC the simulated samples sizes. The results are IML has been fired, a ‘**’ is printed for the L-moment presented as the nominal values for τ3 and κ for the output, with an appropriate footnote. If the user has logistic and normal distributions or percent of requested a temporary (OUT = T) or permanent nominal value for the other distributions. (OUT =P) data set be created, then the two resultant data sets from PROC IML are merged and the data Upon review of the results it appears that, for the set is created as appropriate. simulations conducted and independent of sample size or distributional shape, the L-moment shape Simulations statistics in general are less biased than the central moment shape statistics. As such, the L-moment Simulations were conducted to explore the shape statistics may be more useful indicators of the applicability of the L-moment shape statistics to type of departure of a sample from normality varying sample sizes and distributional shapes. For (Royston, 1992). each of the following distributions, 5000 data sets were generated for samples sizes 5, 10, 20, 40, 60, Example 125 and 250: The usefulness of L-moment shape statistics Logistic y = a + k*log(x/(1-x)), where a = 0 become apparent when applied to the analysis of and k = 1; pharmacokinetic parameters. It has been suggested that many pharmacokinetic parameters follow a Gumbel y = a - b(log(-log(x))), where a = 1 lognormal distribution. To examine this, data from and b =1; Metzler and Huang (1983) will be used to calculate the central moment and L-moment shape statistics Normal(0, 1) for the untransformed and log-transformed area under the plasma concentration-time curve data. Exponential y = a - b*log(1-x), where a = 1 and Figures 1 and 2 present an example of the output b = 1; produced using the macro call

Lognormal y = exp(a*x), where a = 0.5; %LMOMENTS(INDS=TEST,PLOTS=Y,VARS= AUC LOGAUC); Lognormal y = exp(a*x), where a = 1. The shape statistics for the untransformed data In the equations, x is a random normal (0,1) variate. suggest that the underlying distribution is positively The table below lists the theoretical values for the skewed, with some evidence of kurtosis. Log- shape statistics for the above distributions (values transformation of the data results in a closer for the central moments are taken from Hastings approximation to normality. Note the disparity and Peacock, 1975; except for t for the Lognormal κ τ 4 between the central moment, , and L-moment, 4, distributions (Royston, 1992), values for the L- measures of kurtosis. This can probably be moments are taken from Hosking, 1990): attributed to the poor small sample performance of κ τ κ compared to 4, and to the biasedness of in non- Distribution κ ϒ τ τ’ τ 3 3 4 normal distributions. Logistic 0.0 4.2 0.0 1.0 0.167 Discussion Gumbel 1.137 5.4 0.170 1.410 0.150 Normal 0.0 3.0 0.0 1.0 0.123 Exponential 2.0 9.0 0.333 2.0 0.167 The L-moment shape indices t3, t’3, and t4 have Lognormal(0.5)1.75 8.90 0.241 1.635 0.169 several advantages over the usual shape statistics ϒ Lognormal(1.0)6.19 113.9 0.463 2.724 0.294 and κ. Accurate characterization of several non- normal distributions, reasonably unbiased in small sample sizes, ease of interpretability and robustness The method of Royston (1992) was used to quantify to outliers make the L-moment shape statistics the results of the simulations. For the logistic and useful measures of the shape of distribution. As τ normal distributions, the mean absolute values for 3 shown in the example, the L-moment shape κ and were determined. Otherwise, each of the statistics could be useful indicators when 5000 values of the shape parameters was transformation of data is required. A macro standardized by dividing its simulation mean by the program was developed to include the calculation of nominal, theoretical value and then averaged. The the L-moment shape statistics with the central results are presented in Appendix 1, comparing the moment shape statistics in a hardcopy of PROC usual shape statistics to the L-moment statistics for UNIVARIATE output, an output data set, or both. Van Der Laan, P. and L. R. Verdooren. Classical SAS and SAS/IML are registered trademarks of the analysis of variance methods and nonparametric SAS Institute, Inc., Cary, NC. counterparts. Biom. J. 6:635-665, 1987. HP-UNIX is a registered trademark of the Hewlett- Packard Corporation, Boise, Idaho. The author can be reached at

References Covance 210 Carnegie Center Bickle, P. Robust Estimation in S. Kotz and N. Princeton, NJ 08540 Johnson (eds.) the Encyclopedia of Statistical Sciences, John Wiley and Sons (1988), New York, Phone:(609)452-4150 NY, Volume 8, pp. 157-163. [email protected]

Glass, G.V., Peckham, P.D. and J.R. Sanders. Consequences of failure to meet assumptions underlying the fixed effects analysis of variance and covariance. Rev. Educ. Res. 42:238-288.

Hastings, N.A.J. and J.B. Peacock, Statistical Distributions. John Wiley and Sons (1975), New York, NY.

Hopins, K.D. and D.L. Weeks. Tests for normality and measures of skewness and kurtosis: Their place in research reporting. Educ. Psychol. Meas. 50:717-729, 90.

Hosking, J.R.M. L-moments: Analysis and estimation of distributions using linear combinations of order statistics. J. Royal Stat. Soc.. B 52:105-124, 1990.

MacGillivray, H.L. and K.P. Balanda. The relationships between skewness and kurtosis. Austral. J. Stat. 30:319-337, 1988.

Metzler, C.M. and D.C. Huang. Statistical methods for bioavailability and bioequivalence. Clin. Res. Pract. Drug Reg. Affairs, 1:109-132, 1983.

Royston, P. Which measures of skewness and kurtosis are best? Stat. Med. 11:333-343, 1992.

SAS Institute, Inc. SAS® Language: Reference, Version 6, First Edition, Cary, NC: SAS Institute, Inc., 1990.

SAS Institute, Inc. SAS® Procedures Guide, Version 6, Third Edition, Cary, NC: SAS Institute, Inc., 1990.

SAS Institute, Inc. SAS/IML® Software: Usage and Reference, Version 6, First Edition, Cary, NC: SAS Institute, Inc., 1989.

Tukey, J.W. Exploratory Data Analysis. Addison- Wesley (1977), Reading, MA. APPENDIX 1 - Simulation Results

Sample Size Logistic Gumbel

ϒ κ ϒ κ τ3 τ4 τ3 τ4 5 -0.026 -0.006 5% 95% 38% 79% 4% 101% 10 -0.025 -0.003 18% 102% 55% 87% 12% 105% 20 -0.007 -0.001 20% 100% 72% 100% 20% 107% 40 -0.011 -0.002 25% 101% 80% 100% 28% 108% 60 -0.004 -0.002 28% 102% 85% 101% 34% 108% 125 -0.020 -0.004 30% 102% 100% 103% 40% 108% 250 -0.008 -0.003 31% 102% 104% 103% 42% 108%

Sample Size Normal Exponential

ϒ κ ϒ κ τ3 τ4 τ3 τ4 5 0.006 0.004 99% 99% 27% 58% 4% 99% 10 -0.001 -0.004 100% 102% 43% 63% 12% 104% 20 0.000 0.000 100% 101% 57% 65% 21% 108% 40 0.005 0.001 100% 100% 64% 67% 32% 112% 60 0.004 0.001 100% 100% 69% 71% 38% 114% 125 0.003 0.000 100% 100% 74% 73% 46% 115% 250 0.000 0.000 100% 100% 82% 74% 51% 115%

Sample Size Lognormal (0.5) Lognormal (1.0)

ϒ κ ϒ κ τ3 τ4 τ3 τ4 5 33% 78% 3% 82% 17% 75% 1% 67% 10 44% 82% 7% 87% 21% 81% 4% 77% 20 61% 89% 18% 91% 30% 87% 8% 83% 40 68% 86% 28% 96% 37% 92% 10% 90% 60 75% 98% 33% 98% 42% 95% 13% 94% 125 82% 99% 40% 99% 51% 97% 18% 96% 250 92% 100% 44% 100% 61% 98% 20% 98% FIGURE 1

Univariate Procedure

Variable=AUC

Moments Quantiles(Def=5) Extremes

N 20 Sum Wgts 20 100% Max 16.26 99% 16.26 Lowest Obs Highest Obs Mean 7.08 Sum 141.6 75% Q3 9.175 95% 15.14 2.33( 1) 9.28( 16) Std Dev 3.771507 Variance 14.22426 50% Med 6.22 90% 12.41 2.89( 2) 10.73( 17) 25% Q1 4.15 10% 2.965 3.04( 3) 10.8( 18) USS 1272.789 CSS 270.261 0% Min 2.33 5% 2.61 3.14( 4) 14.02( 19) CV 53.26987 Std Mean 0.843335 1% 2.33 3.92( 5) 16.26( 20) T:Mean=0 8.395245 Pr>|T| 0.0001 Range 13.93 Num ^= 0 20 Num > 0 20 Q3-Q1 5.025 M(Sign) 10 Pr>=|M| 0.0001 Mode 2.33 Sgn Rank 105 Pr>=|S| 0.0001 W:Normal 0.864233 Pr

Skewness Kurtosis ------Usual Method: 0.923 0.514

L-Moments: T3 0.205 T4 0.121 T3’ 1.516

Stem Leaf # Boxplot Normal Probability 16 3 1 | 17+ * ++ 14 0 1 | | * +++++ 12 | | +++++ 10 78 2 | | +*+*++ 8 13 2 +-----+ | +++** 6 5669 4 *--+--* | +++**** 4 46789 5 +-----+ | ++*+**** 2 39019 5 | 3+ * *++*+*+* * ----+----+----+----+ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2

NOTE: T3’ = (1+T3)/(1-T3). The L-Moment statistics are subject to the following constraints: -1 < T3 < 1, 0 < T3’ < Infinity, ¼ * (5 * (T3**2) - 1) <= T4 <= 1. ** indicates that L-Moment statistics could not be computed. REF: P. Royston, Stat. Med. 11:333-43 (1992). FIGURE 2

Univariate Procedure

Variable=LOGAUC

Moments Quantiles(Def=5) Extremes

N 20 Sum Wgts 20 100% Max 2.788708 99% 2.788708 Lowest Obs Highest Obs Mean 1.821787 Sum 36.43574 75% Q3 2.216417 95% 2.714596 0.845868( 1) 2.227862( 16) Std Dev 0.54308 Variance 0.294936 50% Med 1.826445 90% 2.510016 1.061257( 2) 2.373044( 17) 25% Q1 1.42157 10% 1.086557 1.111858( 3) 2.379546( 18) USS 71.98194 CSS 5.603775 0% Min 0.845868 5% 0.953562 1.144223( 4) 2.640485( 19) CV 29.81027 Std Mean 0.121436 1% 0.845868 1.366092( 5) 2.788708( 20) T:Mean=0 15.002 Pr>|T| 0.0001 Range 1.94284 Num ^= 0 20 Num > 0 20 Q3-Q1 0.794847 M(Sign) 10 Pr>=|M| 0.0001 Mode 0.845868 Sgn Rank 105 Pr>=|S| 0.0001 W:Normal 0.913816 Pr

Skewness Kurtosis ------Usual Method: -0.091 -0.766

L-Moments: T3 -0.029 T4 0.075 T3’ 0.943

Stem Leaf # Boxplot Normal Probability Plot 2 68 2 | 2.75+ +*++++*++ 2 0012244 7 +-----+ | ***+**+*+*+ 1 557889 6 *--+--* 1.75+ +****+*++ 1 1114 4 +-----+ | +*++*+*+** 0 8 1 | 0.75+ +++++*+++ ----+----+----+----+ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2

NOTE: T3’ = (1+T3)/(1-T3). The L-Moment statistics are subject to the following constraints: -1 < T3 < 1, 0 < T3’ < Infinity, ¼ * (5 * (T3**2) - 1) <= T4 <= 1. ** indicates that L-Moment statistics could not be computed. REF: P. Royston, Stat. Med. 11:333-43 (1992).