Using SAS/IML to Calculate L-Moments for the Univariate
Total Page:16
File Type:pdf, Size:1020Kb
USE OF PROC IML TO CALCULATE L-MOMENTS FOR THE UNIVARIATE DISTRIBUTIONAL SHAPE PARAMETERS SKEWNESS AND KURTOSIS Michael A. Walega Covance, Princeton, New Jersey Introduction and how it pertains to the assumption of normality. Exploratory data analysis statistics, such as those As discussed by Glass et al. (1972), incorrect generated by the SAS® procedure PROC conclusions may be reached when the normality UNIVARIATE (1990), are useful tools to assumption is not valid, especially when one-tail characterize the underlying distribution of data prior tests are employed or the sample size or to more rigorous statistical analyses. Assessment of significance level are very small. Hopkins and the distributional shape of data is usually Weeks (1990) also discuss the effects of highly non- accomplished by careful examination of the values normal data on hypothesis testing of variances. of the third and fourth central moments, skewness Thus, the exam-ination of the skewness (departure and kurtosis. However, when the sample size is from symmetry) and kurtosis (deviation from a small or the underlying distribution is non-normal, normal curve) is an important component of the information obtained from the sample skewness exploratory data analyses. and kurtosis can be misleading. Various methods to estimate skewness and kurtosis One alternative to the central moment shape have been proposed (MacGillivray and Balanda, statistics is the use of a linear combination of order 1988). For many years, the conventional statistics (L-moments) to examine the distributional coefficients of skewness and kurtosis, ϒ and κ shape characteristics of data. L-moments have (Hosking, 1990), have been used to describe the several theoretical advantages over the central shape characteristics of distributions. However, as moment shape statistics: Characterization of a pointed out by Hosking (1990) and Royston (1992), wider range of distributions, robustness to outliers these coefficients are not without limitations. Both and more accurate estimates in small sample sizes. are sensitive to minute changes in the tails of a distribution, susceptible to moderate outliers and This paper focuses on the development of a macro biased in small to moderately-sized samples from skew distributions. Also, the information conveyed program that uses SAS/IML® (1989) to generate the by the third and fourth central moments with regards central moment and L-moment distributional shape to the shape of a distribution can be difficult to parameters. In addition, the results of simulations, assess. Thus, it would be appropriate to determine conducted with various sample sizes and if other, more robust measures of skewness and distributions, will be presented. kurtosis can be used to assess the shape of a distribution. Background L-moments Largely through the influence of John Tukey’s work (1977), statisticians have increasingly emphasized One more robust measure are linear combinations the exploratory analysis of data prior to more formal of order statistics, or L-moments. In theory, L- statistical analyses (t-tests, ANOVA, etc.). Tukey moments are less prone to the effects of sampling has suggested that to fully understand the nature of variability as compared to conventional moments. a variable and its measurement, characteristics Hosking (1990) provides an excellent overview of other than the central tendency (mean) and the theory behind the derivation and application of variability (standard deviation) need to be L-moments as summary statistics for univariate examined. Many classical statistical tests rely on probability distributions. Royston (1992) compares the assumption that the underlying distribution of the the prop-erties of the conventional shape data (or residuals) is Gaussian. Bickel (1988) and Van Der Laan and Verdooren (1987) discuss the parameters to their L-moment counterparts for two concept of robustness lognormal dis-tributions. Rather than discuss the detailed theory behind L-moments, the reader is referred to the two aforementioned papers. Instead, a brief overview of the development of the outer extremes and the central portion samples of equations necessary to apply L-moments is size 4 (Royston, 1992). described below. As with the paper by Royston, the notation of Hosking (1990) will be employed. Scale-free versions of the L-moments for skewness, τ τ 3, and kurtosis, 4, can be written as For the random variables X1, …, Xn of sample size τ λ λ n drawn from the distribution of a random variable X 3 = 3 / 2 and µ σ2 ≤ ≤ with the mean and variance of , let X1:1 … τ λ λ X1:n be the order statistics such that the L-moments 4 = 4 / 2 . of X are defined by , An alternative measure of skewness, τ 3, is defined τ τ r-1 as (1 + 3) / (1 - 3). This measure is the ratio of the expected length of the upper tail to that of the lower -1 r - 1 tail in samples of size 3, and as such may be easier r k λr ≡ ∑ (-1) ( ) EX r = 1,2, … , r-k:r, to interpret than τ . λ , τ , τ and λ are subject to k 3 2 3 3 4 k=0 the constraints λ τ τ, ∞ where r is the rth L-moment of a distribution and 2 > 0 , -1 < 3 < 1 , 0 < 3 < and th EXi:r is the expected value of the i smallest 1/4(5τ2 - 1) ≤ τ < 1. observation in a sample of size r. 3 4 The first four central moments of a random variable If a random sample of size n is drawn from a X can be written as distribution of the random variable X and ≤ ≤ x1:n … xn:n are the ordered sample values then µ λ λ λ λ = E(X), estimates of the L-moments 1, 2, 3 and 4, namely I1, I2, I3, I4, can be calculated as follows. σ2 = E(X - µ)2 , First, define w2, w3, and w4 as ϒ = E(X - µ)3 /σ3 and 1 n w = ------- ∑ (i- 1)x , κ = E(X - µ)4 / σ4. 2 i:n n(n-1) i=2 In a similar fashion, the first four L-moments of a random variable X can be written as 1 n ∑ w3 = -------------- (i - 1)(i -2)xi:n and λ 1 = E(X), n(n-1)(n-2) i=3 λ 2 = 1/2E(X2:2 - X1:2), 1 n w = ---------------------------- ∑(i - 1)(i - 2)(i - 3)x λ 4 i:n 3 = 1/3E (X3:3 - 2X2:3 + X1:3) and n(n-1)(n-2)(n-3) i=4 λ 4 = 1/4E(X4:4 - 3X3:4 + 3X2:4 - X1:4) Then the L-moments and the corresponding shape λ It can be seen that 1 is equivalent to the usual statistics can be estimated as µ λ σ2 measure of central tendency, . 2 is similar to I = ∑ / n, in that both measure the difference between two 1 xi randomly selected values of X; however, by its I = 2w - I , nature σ2 assigns more weight to extreme sample 2 2 1 values than does λ . λ is a scale-dependent 2 3 I = 6w - 6w + I , λ 3 3 2 1 measure of skewness for a sample of size 3, and 4 is proportional to a weighted difference between I4 = 20w4 - 30w3 + 12w2 - I1, t3 = I3 / I2 and If BY variable processing is requested, the data are sorted before submission to PROC UNIVARIATE for analysis. PROC PRINTTO is used to capture the t4 = I4 / I2. usual output and send it to the file ‘UNI.DAT’. An output data set from PROC UNIVARIATE is used to t and t are the sample L-skewness and 3 4 store the number of non-missing observations for L-kurtosis, respectively, The sample estimate of the , each analysis. alternative measure of skewness, t 3, is defined as If the user chooses to generate a hardcopy of the (1 + t3) / (1 - t3). results, a DATA step is used to process the UNI.DAT file. The functions PUT and SUBSTR are The Program used in conjunction with the $HEX16. format to search for pagebreaks and set a flag that will be The macro program L_MOMENTS was originally ® used to fire a PUT _PAGE_ in a DATA_NULL_ step written using SAS v6.08 under the VMS operating at the end of the program. Next, a flag is set if BY environment. It has been modified to run under variable box plots are created. For each page of ® V6.09 and V6.12 on HP-UNIX . With slight output that does not contain BY variable box plots, a modification (detailed below), the program should counter is incremented. The counter is used to run on any operating system. The user is required facilitate direct read access of the shape statistics to provide the name of the SAS data set (macro data set created by PROC IML for use in the variable INDS) to be used in the analyses and the DATA_NULL_ that generates the hardcopy output. name(s) of the variables (macro variable VARS), Next, that part of the output line that displays the separated by spaces, to be analyzed. There is no values for skewness and kurtosis is removed. limit as to the number of variables that can be Finally, a flag is set that indicates the last line of the analyzed. Options available to the user include: tabular portion of the PROC UNIVARIATE output. • Specify the location of the SAS data library For each analysis variable, the raw data are sorted, (macro variable LIB). Default is current user then merged and transposed. The output data set location. from PROC UNIVARIATE that contains the number of non-missing observations is also transposed. • BY group processing (macro variable BYVAR). PROC IML is then used to calculate the central No limit of number of BY variables, delimited by moments and L-moments for skewness and a space.