<<

NESUG 17 Analysis

Perusing, Choosing, and Not Mis-using: Non-parametric vs. Parametric Tests in SAS® Venita DePuy and Paul A. Pappas, Duke Clinical Research Institute, Durham, NC

ABSTRACT spread is less easy to quantify but is often represented by Most commonly used statistical procedures, such as the t- the interquartile , which is simply the difference test, are based on the assumption of normality. The field between the first and third . of non-parametric provides equivalent procedures that do not require normality, but often require DETERMINING NORMALITY assumptions such as equal . Parametric tests (OR LACK THEREOF) (which assume normality) are often used on non-normal One of the first steps in test selection should be ; even non-parametric tests are used when their investigating the distribution of the data. PROC assumptions are violated. This paper will provide an UNIVARIATE can be implemented to help determine overview of parametric tests and their non-parametric whether or not your data are normal. This procedure equivalents; what assumptions are required for each test; ® generates a variety of , such as the how to perform the tests in SAS ; and guidelines for when and , as well as numerical representations of both sets of assumptions are violated. properties such as and . Procedures covered will include PROCs ANOVA, CORR, If the population from which the data are obtained is NPAR1WAY, TTEST and UNIVARIATE. Discussion will normal, the mean and median should be equal or close to include assumptions and assumption violations, equal. The skewness coefficient, which is a measure of robustness, and exact versus approximate tests. symmetry, should be near zero. Positive values for the skewness coefficient indicate that the data are right INTRODUCTION skewed, and negative values indicate that that data are Many statistical tests rely heavily on distributional left skewed. The kurtosis coefficient, which is a measure assumptions, such as normality. When these of spread, should also be near zero. Positive values for assumptions are not satisfied, commonly used statistical the kurtosis coefficient indicate that the distribution of the tests often perform poorly, resulting in a greater chance of data is steeper than a , and negative committing an error. Non-parametric tests are designed to values for kurtosis indicate that the distribution of the data have desirable statistical properties when few is flatter than normal distribution. The NORMAL option in assumptions can be made about the underlying PROC UNIVARIATE produces a table with tests for distribution of the data. In other words, when the data are normality. In general, if the p-values are less than 0.05, obtained from a non-normal distribution or one containing then the data should be considered non-normally , a non-parametric test is often a more powerful distributed. However, it is important to remember that statistical tool than its parametric ‘normal theory’ these tests are heavily dependent on sample size. equivalent. Strikingly non-normal data may have a p-value greater than 0.05 due to a small sample size. Therefore, For example, the Likert scale data frequently used in graphical representations of the data should always be social sciences typically violates the assumption of examined. normality necessary for parametric tests. The ordinal scale also violates the frequent assumption that data are Low resolution plots and high resolution are from a continuous distribution. Various authors (Micceri, both available in PROC UNIVARIATE. The PLOTS option Breckler) have found that, in reviews of data sets and in PROC UNIVARIATE creates low-resolution stem-and- journal articles, the majority of behavioral sciences data leaf, box, and normal probability plots. The stem-and-leaf violate the assumption of normality but rarely address that is used to visualize the overall distribution of the data concern. and the is a graphical representation of the 5- number summary. The normal probability plot is designed In this paper, we explore the use of parametric and non- to investigate whether a variable is normally distributed. If parametric tests for one- and two-sample location the data are normal, then the plot should display a straight differences, two-sample dispersion differences, and a one diagonal line. Different departures from the straight way layout analysis. We will also examine testing for diagonal line indicate different types of departures from general differences between populations, and look at normality. different measures of correlation. The statement in PROC UNIVARIATE will MEASURES OF LOCATION AND SPREAD produce high resolution histograms. When used in Mean and are typically used to describe the conjunction with the NORMAL option, the histogram will center and spread of normally distributed data. If the data have a line indicating the shape of a normal distribution are not normally distributed or contain outliers, these with the same mean and variance as the sample. measures may not be robust enough to accurately PROC UNIVARIATE is an invaluable tool in visualizing describe the data. The median is a more robust measure and summarizing data in order to gain an understanding of of the center of a distribution, in that it is not as heavily the underlying populations from which the data are influenced by outliers or skewed data. As a result, the obtained. To produce these results, the following code median is typically used with non-parametric tests. The

1

NESUG 17 Analysis

can be used. Omitting the VAR statement will run the If the p-value for the paired t-test is less than the specified analysis on all the variables in the dataset. alpha level, then there is evidence to suggest that the population of the two variables differ. In the case PROC UNIVARIATE data=file1 normal plots; of measurements taken before and after a treatment, this Histogram; would suggest a treatment effect. PROC UNIVARIATE Var var1 var2...varn; Run; also calculates t-test for the difference being equal to zero, which is equivalent to the paired t-test (code given in next The determination of the normality of the data should section). result from evaluation of the graphical output in conjunction with the numerical output. In addition, the While the paired t-test is robust to departures in normality user might wish to look at subsets of the data; for when the two distributions are the same shape, the Type I example, a CLASS statement might be used to stratify by error is inflated when the distributions are skewed and gender. also have unequal variances. Therefore, care should be taken when variances appear unequal. GENERAL GUIDELINES FOR CHOOSING TESTS AND SIGNED RANK TEST Obviously, if your data meets the assumptions of a The Signed Rank test and the Sign test are non- parametric test, you should use it. They are always more parametric equivalents to the one-sample paired t-test. powerful, if used appropriately. Neither of these tests requires the data to be normally If you can transform the dependent variable, such as by distributed, but both tests require that the observed differences between the paired observations be mutually using the log or square root transformation, to make it independent, and that each of the observed paired normally distributed, this may be a good alternative. ® differences comes from a continuous population SAS/INSIGHT is the easiest method to explore these options, due to its interactive nature. Manning and symmetric about a common median. However, the observed paired differences do not necessarily have to be Mullahy (2001) discuss concerns regarding obtained from the same underlying distribution. and log transformations, and the biases possible when using transformations. To calculate these tests, the difference between measurements must be calculated for each subject. This If the data are not normally distributed but the populations “difference” variable will tend to be non-zero if there is a have the same spread and similarly shaped distributions, difference between the two groups. The Sign test is and other assumptions are met, the non-parametric tests calculated by counting the number of positive differences are typically the best options. and the number of negative differences. The Signed Rank If neither the parametric nor non-parametric test test is calculated by all differences by their assumptions are met, great care should be taken when absolute value, from least to greatest. If the two selecting tests. Things to consider include: differing populations do not have significantly different centers, the sample sizes, differing variances, and differing numbers of positives and negatives in the Sign test, or the distributional shapes. Zimmerman has published a variety sums of the ranks of the positive and negative numbers in of papers addressing different aspects of this, some of the Signed Rank test, should be roughly equal which are referenced. Specifics regarding each test are For small samples, the exact p-value of these tests can listed after the SAS code. manually be determined by comparing the numbers to previously determined critical values, as are found in texts DIFFERENCES IN DEPENDENT POPULATIONS such as Hollander & Wolfe. Larger sample sizes are Testing for the difference between two dependent typically calculated using a large-sample approximation. populations, such as before and after measurements on PROC UNIVARIATE uses the large-sample approximation the same subjects, is typically done by testing for a for the signed rank test when the sample size is greater difference in centers of distribution (means or ). than 20. In this situation, the data are paired; two observations are obtained on each of n subjects resulting in one sample of In general, the signed rank test has more statistical power 2n observations. The paired t-test looks for a difference in than the sign test, but if the data contain outliers or are means, while the non-parametric sign and signed rank obtained from a heavy-tailed distribution, the sign test will tests examine differences in medians. It is assumed that have the most statistical power. PROC UNIVARIATE the spreads and shapes of the two populations are the produces both of these tests as well as a paired t-test same for all tests. discussed above. Before PROC UNIVARIATE can be used to carry out these tests, the paired differences must PAIRED T-TEST be computed, as per the following code: If the data are normal, the one-sample paired t-test is the best statistical test to implement. Assumptions for the DATA file2; one-sample paired t-test are that the observations are Set file1; independent and identically normally distributed. To Difference=var2-var1; perform a one-sample paired t-test in SAS, either PROC PROC UNIVARIATE data=file2; UNIVARIATE (as described below) or the following PROC Var difference; Run; TTEST code can be used. If the p-values for the signed rank or sign tests are less PROC TTEST data=file1; than the specified alpha level, then there is evidence to Paired variable1*variable2; suggest that the population medians of the two variables Run; differ. By default, PROC UNIVARIATE tests whether the

2

NESUG 17 Analysis

differences between populations are equal to zero, identical against the that the two although a specific number can be used instead via the distributions differ only with respect to the median. MU0 option. For example, one might wish to test whether Assumptions for this test are that within each sample the one type of tomato plant produced an average of 3 more observations are independent and identically distributed, tomatoes per plant. and that the shapes and spreads of the distributions are the same. It does not matter which distribution the data DIFFERENCES BETWEEN TWO INDEPENDENT are obtained from as long as the data are randomly POPULATIONS selected from the same underlying population. The two Investigators may be interested in the difference between samples must also be independent of each other. The centers of distributions, difference in distributional following code will perform the rank sum test in SAS: spreads, or simply interested in any differences between PROC NPAR1WAY data=file1 wilcoxon; two populations. PROC TTEST allows the user to test for Class sample; differences in means for both equal and unequal Var variable1; variances, as well as providing a test for differences in Exact; *OPTIONAL; Run; variance. The non-parametric method provides different PROC NPAR1WAY incorporates a continuity correction tests, depending on the hypothesis of interest. The into the rank sum scores when computing the Wilcoxon Rank Sum test investigates differences in standardized test z, unless you specify the medians, with the assumption of identical spreads, while CORRECT=NO option. A continuity correction is typically the Ansari-Bradley test examines the differences in used when the data are discrete, since the test assumes spreads, with the assumption of identical medians. The data are continuous. more generalized Kolmogorov-Smirnov test looks for differences in center, spread, or both. It should be noted The Wilcoxon Rank Sum test will be performed without that “two stage” calculations should not be used. For using the WILCOXON option; however, this option limits instance, using the Wilcoxon Rank Sum test for the amount of other output. The EXACT statement is differences in medians followed by the Ansari-Bradley test optional and requests exact tests to be performed. for differences in spread dos not protect the nominal Approximate tests perform well when the sample size is significance level. When overall differences between sufficiently large. When sample size is small or the data populations need to be investigated, consider the are sparse, skewed, or heavy-tailed, approximate tests Kolmogorov-Smirnov test. may be unreliable. However, exact tests are computationally intensive and may require computer TWO-SAMPLE T-TEST AND FOLDED F TEST algorithms to run for hours or days. Thus, if your sample If the data consists of two samples from independent, size is sufficiently large, it may not be worth your while to normally distributed populations, the two-sample t-test is request exact tests. If the p-value is less than the the best statistical test to implement. Assumptions for the specified alpha level, then there is evidence to suggest two-sample t-test and Folded F test are that within each that the two population medians differ. sample the observations are independent and identically normally distributed. Also, the two samples must be Nanna and Sawilowsky (1998) demonstrated that with independent of each other. To perform these tests in typical Likert scale data, the rank sum test has a SAS, the following code can be used: considerable power advantage over the t- test, and that the advantage increases with sample size. But it should PROC TTEST data=file1; Class sample; also be noted that when the variances of the treatment Var variable1; Run; groups are heterogeneous, the Type I error probabilities in the Wilcoxon Rank Sum test can increase by as much as The CLASS statement is necessary in order to distinguish 300%, with no indication that they asymptotically approach between the two samples. The output from these the nominal significance level as the sample size statements includes the two-sample t-test for both equal increases. Therefore, care should be taken when and unequal variances, as well as the Folded F test for assuming that variances are, in fact, equal. For various equality of variances. The equality of variance results combinations of non-normal distribution shapes and should be reviewed first, and used to determine which degrees of variance heterogeneity, the Type I error results from the t-test are applicable. If the p-value for the probability of the rank sum test was found to be biased to t-test is less than the specified alpha level, then there is a far greater extent than that of the t-test (Zimmerman evidence to suggest that the two population means differ. 1998). If population variances are heterogeneous and sample ANSARI-BRADLEY TEST sizes are unequal, the t-test will be invalid, whether or not In some instances, it may be necessary to test for sample variances of treatment groups happen to be equal. differences in spread while assuming that the centers of It should be noted that the two-sample t-test is robust to two populations are identical. One example is comparing non-normality, as long as the two populations are similarly two assay methods to see which is more precise. The shaped. Ansari-Bradley test is the non-parametric equivalent to the WILCOXON RANK SUM TEST Folded F test for equality of variances. Assumptions for The Wilcoxon Rank Sum test (which is numerically the this test are that within each sample the observations are same as the Mann-Whitney U test) is the non-parametric independent and identically distributed. Also, the two equivalent to the two-sample t-test. This test can also be samples must be independent of each other, with equal performed if only (i.e., ordinal data) are available. medians. To perform an Ansari-Bradley test in SAS, the It tests the null hypothesis that the two distributions are following code can be used:

3

NESUG 17 Analysis

PROC NPAR1WAY data=file1 ab; are independent of each other. To perform a Kruskal- Class sample; Wallis test in SAS, the following code can be used: Var variable1; Run; PROC NPAR1WAY data=file1 wilcoxon; The EXACT option may also be used to request exact Class sample; tests. If the p-value is less than the specified alpha level, Var variable; then there is evidence to suggest that the spreads of the Exact; *OPTIONAL; Run; two populations are not identical. This block of code is identical to the code used to produce KOLMOGOROV-SMIRNOV TEST the Wilcoxon Rank Sum test. In fact, the Kruskal-Wallis In many cases, the Kolmogorov-Smirnov may be the most test reduces to the rank sum test when there are only two appropriate of the non-parametric tests for overall samples. The EXACT statement is optional and requests differences between two groups. If it can be assumed that exact-tests to be performed in addition to the large-sample spreads and shapes of the two distributions are the same, approximation. As before, the exact option is very the Wilcoxon Rank Sum test is more powerful than the computationally intensive and should only be used if Kolmogorov-Smirnov test; if the median and distributional needed. If the p-value is less than the specified alpha shapes are the same, the Ansari-Bradley test is more level, then there is evidence to suggest that at least one of powerful. This test is for the general hypothesis that the the population medians differs from the others. Further two populations differ. The assumptions are that the data investigation is required to determine which specific is independent and identically distributed within the two population medians can be considered statistically populations, and that the two samples are independent. significantly different from each other. To perform this test in SAS, use the following code: As with the Wilcoxon rank sum test, the Kruskal-Wallis test PROC NPAR1WAY data=file1 edf; is substantially biased by heteroscedasticity between Class sample; treatment groups, even with equal sample sizes. The Var variable1; Type I error has been shown to dramatically, with no sign Exact; *OPTIONAL; Run; of leveling off. As with the other non-parametric tests, the EXACT option is available in the Kolmogorov-Smirnov test, but is TESTING CORRELATIONS recommended for small sample sizes or sparse, skewed SAS offers several types of commonly used correlation or heavy tailed data. If the p-value is less than the calculations, including Pearson’s product-, specified alpha level, then there is evidence to suggest Spearman’s rank-order, and Kendall’s tau-b. that the two populations are not the same. PEARSON’S PRODUCT-MOMENT Pearson’s correlation is calculated from the variances and DIFFERENCES IN THREE OR MORE , and is typically used in parametric analyses INDEPENDENT POPULATIONS When one variable is dichotomous (0,1) and the other In this situation the data consist of more than two is continuous, a Pearson correlation is equivalent samples. to a point biserial correlation. It should be noted that ONE-WAY ANALYSIS TEST adding the COV option (for Pearson’s correlations only) If the data are normal, then a one-way ANOVA is usually will produce variances and covariances as well as the best statistical test to implement. Assumptions for a correlations one-way ANOVA are that within each sample the SPEARMAN’S RANK-ORDER observations are independent and identically normally PROC CORR computes the Spearman's correlation by distributed. Also, the samples must be independent of ranking the data and using the ranks in the Pearson each other with equal population variances. To perform a product-moment correlation formula. In case of ties, the one-way ANOVA in SAS, the following code can be used: averaged ranks are used. Spearman rank-order PROC ANOVA data=file1; correlation is a nonparametric measure of association Class sample; based on the rank of the data values, and assumes that Model variable1=sample; Run; data were measured at least on an ordinal scale. If sample sizes of groups are not all equal, the ANOVA KENDALL’S TAU-B should be conducted via PROC GLM. If the p-value is Kendall's tau-b is another nonparametric measure of less than the specified alpha level, then there is evidence association, based on the number of concordances and to suggest that at least one of the population means differs discordances in paired observations. Concordance occurs from the others. Further investigation is required to when paired observations vary together, and discordance determine which specific population means can be occurs when paired observations vary differently. PROC considered statistically significantly different from each CORR computes Kendall's correlation by double sorting other. the data by ranking observations according to values of the first variable, and reranking the observations according KRUSKAL-WALLIS TEST to values of the second variable. Kendall's tau-b is The Kruskal-Wallis test is the non-parametric equivalent to computed from the number of interchanges of the first the one-way ANOVA. Assumptions for the Kruskal-Wallis variable and corrects for tied pairs (pairs of observations test are that within each sample the observations are with equal values of X or equal values of Y). independent and identically distributed and the samples

4

NESUG 17 Analysis

The SAS code is very similar for all three methods, with Stat Soft online textbook: (1994-2003) the SPEARMAN or KENDALL option stated after PROC CORR. The following code produces the Person’s (July 25, 2004) correlation calculations: UCLA Academic Technology Services. “SAS Class Notes PROC CORR data=datafile; 2.0 – Analyzing Data” (July 9, 2004) Spearman’s and Kendall’s correlations have similar levels Wuensch, K. L. Psychology Dept., Eastern Carolina University. CONCLUSION Zimmerman, D.W. (1998). Invalidation of parametric and Statistical analyses may be invalid if the assumptions nonparametric statistical tests by concurrent violation of behind those tests are violated. Prior to conducting two assumptions. Journal of Experimental Education, analyses, the distribution of the data should be examined 67:55-68. for departures from normality, such as skewness or outliers. If the data are normally distributed, and other Zimmerman, D. W. (2000). Levels assumptions are met, parametric tests are the most of Nonparametric Tests Biased by Heterogeneous powerful. If the data are non-normal but other criteria are Variances of Treatment Groups. Journal of General met, non- provide valid analyses. Psychology, Oct, 2000 When neither set of assumptions has been met, both tests Zimmerman, D.W. (2001). The effect of selection of should be implemented to see if they agree. Also, further samples for homogeneity on Type I error rates. Interstat. research should be done to discover whether the appropriate parametric or non-parametric test is the most Zimmerman, D. W. (2003) A warning about the large- robust to specific data issues. sample Wilcoxon-Mann-Whitney test. Understanding Statistics 2(4):267-280 This paper discusses the most commonly used non- parametric tests, and their parametric equivalents. A Zimmerman, D.W. A Warning about Significance Tests of variety of other non-parametric tests are available in SAS, Differences between Paired Samples. Unpublished. both in PROC NPAR1WAY and in other procedures such as PROC FREQ. CONTACT Your comments and questions are valued and REFERENCES encouraged. Contact the authors at: Breckler, S. J. (1990). Application of structure

modeling in psychology: Cause for concern? Venita DePuy Paul A. Pappas Psychological Bulletin, 107, 260-273. DCRI DCRI Glass, G. V, Peckham, P. D., and Sanders, J. R. (1972). PO Box 17969 PO Box 17969 Consequences of failure to meet the assumptions Durham, NC 27715 Durham, NC 27715 underlying the fixed effects and (919) 668-8087 (919) 668-8542 covariance. Review of Educational Research, 42, 237- Fax: (919) 668-7124 Fax: (919) 668-7053 288. [email protected] [email protected] Hollander, M. and Wolfe, D.A. (1973), Nonparametric Statistical Methods, New York: John Wiley & Sons, Inc. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Keselman, H. J., Huberty, C., Lix, L. M., Olejnik, S., Institute Inc. in the USA and other countries. ® indicates Cribbie, R. A., Donahue, B., Kowalchuk, R. K., Lowman, L. USA registration. L., Petoskey, M. D., & Keselman, J. C. (1998). Statistical practices of educational researchers: An analysis of their ANOVA, MANOVA, and ANCOVA analyses. Review of Educational Research, 68:350-386. Manning, W. G. and Mullahy, J. (2001). Estimating log models: to transform or not to transform? Journal of Health Economics, 20: 461-494. Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin, 105:156-166. Nanna, M.J. and Sawilowsky,S.S. (1998) Analysis of Likert data in disability and medical rehabilitation research. Psychological Methods, 3, 55 – 67. SAS Institute (2003). SAS OnlineDoc, Version 8, Cary, NC: SAS Institute, Inc. 1999. 5