Invasive Plant Researchers Should Calculate Effect Sizes, Not P-Values Author(S): Matthew J
Total Page:16
File Type:pdf, Size:1020Kb
Invasive Plant Researchers Should Calculate Effect Sizes, Not P-Values Author(s): Matthew J. Rinella and Jeremy J. James Source: Invasive Plant Science and Management, 3(2):106-112. 2010. Published By: Weed Science Society of America DOI: 10.1614/IPSM-09-038.1 URL: http://www.bioone.org/doi/full/10.1614/IPSM-09-038.1 BioOne (www.bioone.org) is an electronic aggregator of bioscience research content, and the online home to over 160 journals and books published by not-for-profit societies, associations, museums, institutions, and presses. Your use of this PDF, the BioOne Web site, and all posted and associated content indicates your acceptance of BioOne’s Terms of Use, available at www.bioone.org/page/terms_of_use. Usage of BioOne content is strictly limited to personal, educational, and non-commercial use. Commercial inquiries or rights and permissions requests should be directed to the individual publisher as copyright holder. BioOne sees sustainable scholarly publishing as an inherently collaborative enterprise connecting authors, nonprofit publishers, academic institutions, research libraries, and research funders in the common goal of maximizing access to critical research. Invasive Plant Science and Management 2010 3:106–112 Review Invasive Plant Researchers Should Calculate Effect Sizes, Not P-Values Matthew J. Rinella and Jeremy J. James* Null hypothesis significance testing (NHST) forms the backbone of statistical inference in invasive plant science. Over 95% of research articles in Invasive Plant Science and Management report NHST results such as P-values or statistics closely related to P-values such as least significant differences. Unfortunately, NHST results are less informative than their ubiquity implies. P-values are hard to interpret and are regularly misinterpreted. Also, P- values do not provide estimates of the magnitudes and uncertainties of studied effects, and these effect size estimates are what invasive plant scientists care about most. In this paper, we reanalyze four datasets (two of our own and two of our colleagues; studies put forth as examples in this paper are used with permission of their authors) to illustrate limitations of NHST. The re-analyses are used to build a case for confidence intervals as preferable alternatives to P- values. Confidence intervals indicate effect sizes, and compared to P-values, confidence intervals provide more complete, intuitively appealing information on what data do/do not indicate. Key words: Estimation, statistics, confidence interval, P-values, Null hypothesis significance testing. P-values …, after being widely seeded by statisticians, are Null hypothesis significance testing (NHST) is very now well established in the wild, and often appear as widely used in invasive plant science. Of the papers in pernicious weeds rather than a useful crop (comment by I. Invasive Plant Science and Management that present data M. Wilson in Nelder 1999). amenable to NHST, over 95% report NHST results such as Contrary to common dogma, tests of statistical null P-values or statistics closely related to P-values such as least hypotheses have relatively little utility in science and are significant differences. Despite its widespread use, NHST not a fundamental aspect of the scientific method has serious practical limitations. For one thing, P-values are (Anderson et al. 2000). hard to interpret and often are misinterpreted. Also, P- … p values are neither objective nor credible measures of values do not answer the two questions invasive plant evidence in statistical significance testing. Moreover, the scientists care about most: (1) What is the magnitude of the authenticity of many published studies with p , .05 treatment effect? and (2) With what level of precision did findings must be called into question. Rather than the the study estimate the treatment effect? Over the past half preoccupation with p values … the goal … should be the century, these and other limitations have prompted harsh estimation of sample statistics, effect sizes and the confidence critique of NHST from scientists in virtually every field that intervals (CIs) surrounding them (Hubbard and Lindsay analyzes data (see 402 citations at http://welcome.warnercnr. 2008). colostate.edu/,anderson/thompson1.html), with critics in ecology and allied disciplines growing increasingly adamant over the past decade (Anderson et al. 2000; Fidler et al. 2006; Martinez-Abrain 2007, and many others). DOI: 10.1614/IPSM-09-038.1 This paper advocates for confidence intervals as alterna- * First author: Rangeland Ecologist, United States Department of tives to NHST. We begin with a brief review of NHST and Agriculture, Agricultural Research Service, 243 Fort Keogh Road, Miles confidence intervals. Then, we address the following four City, MT 59301; second author: Plant Physiologist, United States points in the course of illustrating advantages of confidence Department of Agriculture, Agricultural Research Service, Eastern intervals over NHST: (1) Nonsignificant statistical tests are Oregon Agricultural Research Center, 67826-A Hwy 205, Burns, OR not evidence for null hypotheses; (2) Effect sizes are 97720. Corresponding author’s E-mail: [email protected] uncertain, even when statistical tests are significant; (3) 106 N Invasive Plant Science and Management 3, April–June 2010 Significance thresholds are arbitrary; and (4) One minus the Although not technically correct, this interpretation is P-value is not the probability the alternative hypothesis is sometimes acceptable because it is the correct interpretation true. To illustrate these points, we reanalyzed four published for Bayesian confidence intervals, and Bayesian and datasets from the plant science literature (two of our own classical confidence intervals are identical under particular and two of our colleagues’). We fit the same or very similar sets of assumptions/conditions. Specifically, Bayesian and models that the authors originally fit, but instead of P-values, classical confidence intervals for normally distributed data we present confidence intervals on effect sizes. For example, are identical when a particular noninformative Bayesian in some cases we present confidence intervals on effect size 5 prior distribution is used (p. 355, Gelman et al. 2004). We treatment 2 control. present several 50 and 95% confidence intervals in this paper, and given the noninformative priors we used, it is acceptable to view these intervals as having a 0.50 and 0.95 Null Hypothesis Significance Testing and probability of bracketing the parameter. Confidence Intervals Consider a hypothetical experiment designed to evaluate Nonsignificant Statistical Tests Are Not Evidence for the effect of a weed control treatment on desired species productivity. The hypothetical treated and untreated plots Null Hypotheses were arranged in a completely randomized design. An Failing to reject a null hypothesis (large P-value) does expression for the data is: not suggest the null hypotheses is true or even approxi- mately true (Fisher 1929). We reanalyzed one of our y ~b zb x ze ½1 i 0 1 i i published datasets to illustrate this point. 22 where yi is desired plant biomass (g m )inploti, xi equals To evaluate the ability of plant groups to repel weed 0 if plot i was not treated and 1 if plot i was treated, and ei invasions, James et al. (2008) sowed medusahead (Tae- is normally distributed random error. Equation 1 can be niatherum caput-medusae (L.) Nevski) seeds in plots after thought of as a simple ANOVA model. With this scenario removing either (1) nothing, (2) annual forbs, (3) perennial for the xi values, b0 represents the mean of untreated plots forbs, or (4) bunchgrasses. The authors tested null and b1 describes the effect of treatment. As is typical in hypotheses that weed densities did not differ between invasive plant science, the hypothetical authors conduct a treatments, and instituted Bonferonni corrections to point null hypothesis test of b1 5 0; i.e., weed control had maintain an experiment-wise error rate of 5%. The null no effect. The authors use the F-test to calculate a P-value, hypothesis was rejected only for bunchgrasses and it was which has the following interpretation: Pr (observed data or concluded that ‘‘Bunchgrasses were the only functional data more extreme| b1 5 0). In other words, a P-value is the group that inhibited T. caput-medusae establishment.’’ probability of witnessing the observed data, or data more On one hand, James et al. (2008) are to be congratulated extreme, given that b1 5 0. If this probability is lower than for reporting dispersion statistics (i.e., standard errors), a some threshold statistical significance level (e.g., 0.05), the practice advocated by many (e.g., Anderson et al. 2001; authors will conclude that the null hypothesis is unlikely in Nagele 2001) (Figure 1). On the other hand, the large light of the data; i.e., they will reject the null hypothesis that standard errors were clearly ignored in blithely concluding b1 5 0. It is important to comprehend what the P-value is forb removals had no effect. Instead, it is far more likely not. It is not the probability that b1 5 0. that every removal treatment had some effect, and that An increasing number of authors avoid P-values and measurement error and small sample size prevented the instead use confidence intervals to describe the range of authors from detecting these effects. Many authors believe likely values of parameters such as b1 (e.g., Nickerson 2000; it is practically impossible for treatments to have no effect Stephens et al. 2007). Strictly defined, a 95% confidence to an infinite number of decimal points (i.e., a true point interval is one realization of a procedure (collect data, null hypothesis) (e.g., Cohen 1994; Kirk 1996; Tukey calculate interval) that has a 95% success rate at bracketing 1991; but see Guthery et al. 2001). In a reanalysis, we the true parameter value (Berry and Lindgren 1996). (The calculated confidence intervals for the treatment effects interpretations for other confidence intervals such as 50%, (Figure 1).