Use of Proc Iml to Calculate L-Moments for the Univariate Distributional Shape Parameters Skewness and Kurtosis
Total Page：16
File Type：pdf, Size：1020Kb
Statistics 573 USE OF PROC IML TO CALCULATE L-MOMENTS FOR THE UNIVARIATE DISTRIBUTIONAL SHAPE PARAMETERS SKEWNESS AND KURTOSIS Michael A. Walega Berlex Laboratories, Wayne, New Jersey Introduction Exploratory data analysis statistics, such as those Gaussian. Bickel (1988) and Van Oer Laan and generated by the sp,ge procedure PROC Verdooren (1987) discuss the concept of robustness UNIVARIATE (1990), are useful tools to characterize and how it pertains to the assumption of normality. the underlying distribution of data prior to more rigorous statistical analyses. Assessment of the As discussed by Glass et al. (1972), incorrect distributional shape of data is usually accomplished conclusions may be reached when the normality by careful examination of the values of the third and assumption is not valid, especially when one-tail tests fourth central moments, skewness and kurtosis. are employed or the sample size or significance level However, when the sample size is small or the are very small. Hopkins and Weeks (1990) also underlying distribution is non-normal, the information discuss the effects of highly non-normal data on obtained from the sample skewness and kurtosis can hypothesis testing of variances. Thus, it is apparent be misleading. that examination of the skewness (departure from symmetry) and kurtosis (deviation from a normal One alternative to the central moment shape statistics curve) is an important component of exploratory data is the use of linear combinations of order statistics (L analyses. moments) to examine the distributional shape characteristics of data. L-moments have several Various methods to estimate skewness and kurtosis theoretical advantages over the central moment have been proposed (MacGillivray and Salanela, shape statistics: Characterization of a wider range of 1988). For many years, the conventional coefficients cflstributions, robustness to outliers and more of skewness and kurtosiS, 'Yand K (Hosking, 1990), accurate estimates in small sample sizes. have been used to describe the shape characteristics of distributions. However, as pointed out by Hosking This paper focuses on the development of a macro (1990) and Royston (1992), these coefficients are not program that uses SASIIML- (1989) to generate the without limitations. Both are sensitive to minute central moment and L-moment distributional shape changes in the tails of a distribution, susceptible to parameters. In addition, the results of simulations, moderate outliers and biased in small to moderately conducted with various sample sizes and sized samples from skew distributions. Also, the distributions, will be presented. information conveyed by the third and fourth central moments with regards to the shape of a distribution Background can be difficult to assess. Thus, it would be appropriate to determine if other, more robust Largely through the influence of John Tukey's work measures of skewness and kurtosis can be used to (19n), statisticians have increasingly emphasized the assess the shape of a distnbution. exploratory analysis of data prior to more formal statistical analyses (t-tests, ANOVA, etc.). Tukey has L-moments suggested that to fully understand the nature of a variable and its measurement, characteriStics other One more robust measure is the use of linear than the central tendency (mean) and variability combinations of order statistics, or L-moments. In (standard deviation) need to be examined. Many theory, L-moments are less prone to the effects of classical statistical tests rely on the assumption that sampling variability as compared to conventional the underlying distribution of the data (or residuals) is moments. Hosking (1990) provides an excellent NESUG '93 Proceedings 574 Statistics overview of the theory behind the derivation and randomly selected values of X; however. by its nature application of L-moments as summary statistics for cr assigns more weight to extreme sample values univariate probability distributions. Royston (1992) than does~. As is a scale-dependent measure of compares the properties of the conventional shape $kewness for a sample of size 3. and A4 is parameters to their L-moment counterparts for two proportional to a weighted difference between outer lognormal distributions. Rather than discuss the extremes and the central portion in samples of size 4 detailed theory behind L-moments. the reader is (Royston. 1992). referred to the two aforementioned papers. Instead. a brief overview of the development of the equations Scale-free versions of the L-moments for skewness. necessary to apply L-moments is described below. 'ta •. and kurtosis. t 4• can be written as As with the paper by Royston. the notation of Hosking (1990) will be employed. For the random variables Xi •.... x. of sample size n drawn from the distribution of a random variable X with mean Il and variance OZ. let X,:, ::s; ... ::s; X,:. be An alternative measure of skewness. 1'3' is defined as the order statistics such that the L-momerns of X are (1 + 'tJ I (1 - 'tJ. This measure is the ratio of the defined by expected length of the upper tall to that of the lower tail in samples of size 3. and as such may be easier to interpret than 'ta. ~. 'ta. 'to 3 and 't are subject to r-1 ( r - 1) 4 A,. .. (' E(-1)k EX'4<:r • r =1.2 .... the constraints "-0 k ~>O. -1<'ta<1. 0<'t'3 <00 and where r is the r'" L-moment of a distribution and ~7 th is the expected value of the i smallest observation Y4(W3 -1)St4 <1. in a sample of size r. If a random sample of size n is drawn from a The first four central moments of a random variable distribution of the random variable X and X can be written as x,:. S ... S xn;n are the ordered sample values then estimates olthe L-moments A.,. ~. As and A,4' namely 11 = E(X). I,. ~. ~ and 14, can be calculated as follows. First, define w2• W3 and w4 as OZ = E(X _ 1l)2. 1 n "I = E(X - 1l)3 I if and w2 = -- ~(i- 1)xl:n' n(n-1) tt II: = E(X - 11)4 I ct. 1 n In a similar fashion. the first four L-moments of a W3 = ~(i - 1Hi - 2)XL'tI and random variable X can be written as n(n-1)(n-2) tt A., = E(X). 1 n w4 = ~(i - 1)(i - 2)(; - 3)xL." ~ = 1hE(~2 - X;:J. n(n-1)(n-2)(n-3)!t . As = 1hE(~ - 2X2:3 + X,~ and Then the L-moments and the corresponding shape statistics can be estimated as ~ = Y.4E(X4:4 - 3X3:4 + 3X2:4 - X,:J. It can be seen that A., is equivalent to the usual measure of central tendency. II- ~ is similar to cr in that both measure the difference between two NESUG '93 Proceedings Statistics 575 Two VM&specific SAS functions, FINDFILE and FINDEND, are used to search for the analysis data set in a permanent SAS data library or in the SAS$WORK data library. Users on other operating fa=13/~ and systems may have equivalent functions that could be substituted If no analysis data set is found, the 1:.t = 14 / ~ • program reports this error to the .LOG file and terminates. Otherwise, the calculation macro begins. fa and ~ are the sample L-skewness and L-kurtosis, respectively. The sample estimate of the altemative If BY variable processing is requested, the data are measure of skewness, f3' is defiled as sorted before submission to PROC UNIVARIATE for (1 + tJ I (1 - tJ. analysis. PROC PRINTTO is used to capture the usual output and send it to the file 'UNI.DAT'. Also, The Program an output data set from PROC UNIVARIATE is used to store the number of non-missing observations for The macro program L_MOMENTS was written using each analysis. SAS v6.08 under the VMS- operating environment. With slight modification (detailed below), the program If the user chooses to generate a hardcopy of the should run on any operating system. The user is results, a DATA step is used to process the UNI.DAT required to provide the name of the SAS dataset file. The functions PUT and SUBSTR are used in (macro variable INDAT) to be used in the analyses conjunction with the $BINARYS. format to search for and the name(s) of the variables (macro variable pagebreaks (the eC=CR option is specified in an VARS), separated by spaces, to be analyzed. "There OPTIONS statement) and set a flag that will be used is no limit as to the number of variables that can be to fire a PUT _PAGE_ in a DATA _NULL_ step at the analyzed. Options available to the user include: end of the program. Next, a flag is set if BY variable box plots are created. For each page of output that · Specify the location of the SAS data library (macro does not contain BY variable box plots, a counter is variable LIB). Default is current user location. incremented. The counter is used to facilitate direct read access of the shape statistics data set created · BY group processing (macro variable BYVAR). No by PROC IML for use In the DATA _NULL_ that limit on number of BY variables, delimited by a generates the hardcopy output. Next, that part Of the space. Default is no BY group processing. output line that displays the values for skewness and kurtosis "is removed. Finally, a flag is set that · Generate stem-leaf, box and normal probability plots indicates the last line of the tabular portion of the (macro variable PLOTS). Default is no plots. PROe UNIVARIATE output. · Generate a hardcopy of the usual PROC For each analysis variable, the raw data are sorted, UNIVARIATE output, with the central moment and then merged and transposed.