A Review of Statistical Power Analysis Software
Total Page:16
File Type:pdf, Size:1020Kb
DEPARTMENTS Technological Tools Note: Dr. David Inouye is the edi use to researchers (see also Goldstein size, and higher a. level, and declines tor of the Technological Tools sec 1989). We do not deal with software with increasing sampling variance. tion. Anyone wishing to contribute for planning the precision of studies, Effect size is the difference between articles or reviews to this section although this is an important consid the null and alternative hypotheses, should contact him at the Department eration if the results are to be ana and can be measured either using raw of Zoology, University of Maryland, lyzed using confidence intervals or standardized values. Raw mea College Park, MD 20742, e-mail: rather than P values (Greenland 1988, sures, such as the slope in a regres di5 @umail.umd.edu. Krebs 1989). Because of software up sion analysis or difference between dates and new program releases, our means in a t test, are closer to the A REVIEW OF program evaluations will quickly be measurements that researchers take STATISTICAL POWER come out of date. Nevertheless, we and so are easier to visualize and in terpret. Standardized measures, such ANAL VSIS SOFTWARE highlight some important and time less issues to consider when evaluat as the correlation coefficient or d Although ecologists have become ing power software: scope, flexibil value (difference in means divided by increasingly sophisticated in applying ity, accuracy, ease of use, and ability the standard deviation), are dimen tests for statistical significance, few to deal with prospective and retro sionless and incorporate the sampling are aware of the power of these tests. spective analyses. variance implicitly, removing the Statistical power is the probability of Readers interested only in "the need to specify variance when calcu getting a statistically significant re bottom line" should skip straight to lating power. sult given that there is a biologically the Conclusions by way of the sum Power analysis is most useful real effect in the population being mary evaluation in Table 2. Those when planning a study. Such "pro studied. If a particular test is not sta less experienced with power analysis spective" power analyses are usually tistically significant, is it because will find the Background, Stand exploratory in nature, investigating there is no effect or because the study alone power and sample size soft the relationship between the range of design makes it unlikely that a bio ware, and Discussion sections of sample sizes that are deemed feasible, logically real effect would be de most use. Experienced power ana effect sizes thought to be biologically tected? Power analysis can distin lysts will be most interested in the important, levels of variance that guish between these alternatives, and section on Programming power could exist in the popUlation (usuillly is therefore a critical component of analysis using general purpose statis taken from the literature or from pilot designing experiments and testing re tical software and Table 4. data), and desired levels of a. and sta sults (Toft and Shea 1983, Roten tistical power. The result is a decision berry and Wiens 1985, Peterman Background about the sample size and a. level that 1990, Fairweather 1991, Taylor and The concepts of statistical power will be used in the study, and the tar Gerrodette 1993, Thomas and Juanes are covered in detail in a number of get effect size that will be "detect 1996). texts (Kraemer and Thiemann 1987, able" with the given level of statisti Discussions with colleagues sug Cohen 1988, Lipsey 1990; see also a cal power. gest that one major obstacle to the use particularly clear paper by Muller and After the study is completed and of power analysis is the perceived Benignus 1992). Briefly, the power of the results analyzed, a "retrospective" lack of computer software. Our paper a test is the probability of rejecting power analysis can also be useful if a is designed to remove this obstacle, the null hypothesis given that the al statistically nonsignificant result was by reviewing numerous packages that ternative hypothesis is true. Power obtained (e.g., Thomas and Juanes calculate power or sample size and depends on the type of test, increases 1996). Here the actual sample size determining which are likely to be of with increasing sample size, effect and a. level are known, and the vari- 126 Bulletin of the Ecological Society of America ance observed in the sample provides tables and graphs for inclusion in re that replied, all agreed to send review an estimate of the variance in the ports; (10) allow easy transfer of re copies, although SPSS Inc. (SPSS population. These values are used to sults to other applications; and (11) and SYSTAT DESIGN) failed to de calculate power at the minimum ef be well documented. liver anything. fect size thought to be of biological The third point, on accuracy, re We reviewed three aspects of each significance, or alternatively the ef quires further explanation. For some program: scope (points 1-3 above), fect size detectable with the minimum tests, such as z tests and those based ease of use (4-10), and ease of learn desired level of power. Note that it is on discrete distributions like the bino ing (11; also the intuitiveness of the rarely useful to calculate power using mial, both the P value and statistical program layout, and the clarity and the effect size observed in the power are calculated using the same infonnation content of the on-line sample: such analyses tell us nothing distribution function. Algorithms for help). For scope, we compiled a list about the ability of the test to, detect computing these functions are well of the test situations explicitly men biologically important results (Tho known, which means that accurate tioned in the program documentation mas 1997). power calculations are easy to per or on screen. We also recorded Power calculations can be done form using a computer. Most t, F, and whether the program uses exact or ap using the tables or charts provided in c2 tests, however, require calculation proximate methods to calculate many articles and texts (e.g., Kraemer of a different distribution function for power. We summarized ease of use and Thiemann 1987, Cohen 1988, power than for significance tests, and ease of learning on a subjective Lipsey 1990, Zar 1996). However, called a "noncentral" distribution scale from excellent through very these often require some hand calcu function. Efficient algorithms for good, good, fair, and poor. Note that lations before they can be used, in computing noncentral distribution ease of use is more important than cluding interpolation between tabled functions have been developed only ease of learning for most users, be values, and can give inaccurate re recently, and it was previously com cause all the programs can be learned sults in some situations (e.g., see Bra mon to calculate power using ap in a day or two. dley et al. 1996 regarding the accu proximate methods. Some programs After our initial review, we se racy of Cohen's ANOVA tables). still use approximations, a topic we lected the four most promising pack Computer software has the potential will return to in the Discussion. ages and asked a class of 19 graduate to make power analysis more accu students to evaluate them as part of a rate, interactive, and easy to perform. Methods 2-week "power module" (lectures and Ideally, power analysis should be We compiled a list of software ca seminars) in an advanced ecology integrated within the general purpose pable of performing power or sample class. Students in the class had a wide statistical software researchers use size calculations by searching the range of statistical expertise, but none for their regular analyses. This means published literature and the Internet, had prior practical experience of having a comprehensive "study plan word of mouth, and requests to power analysis. During the module, ning" module for prospective power Internet news groups. From this pre students were asked to complete six analysis, and options to produce esti liminary list, we excluded a few spe power analysis problems using each mates of retrospective power or de cialized programs with limited scope of the four programs, and then to try a tectable effect size for each test per that we considered unlikely to be of problem from their own research. The formed. Failing this, researchers use to ecologists (CRCSIZ, EpiiNFO, six questions covered a t test, two could use stand-alone software de ECHIP, HYPERSTAT, SSIZE, R2; more way ANOV A with contrasts, trend signed solely for power analysis. In information about these programs is analysis (regression), comparison of either case an ideal program should: available at our World Wide Web site proportions, nonparametric test (Wil (1) cover the test situations most http://www.interchg.ubc.ca/cacb/ coxon), and survival analysis. Some commonly encountered by research power/). We also excluded general were framed in tenns of study design, ers; (2) be flexible enough to deal purpose statistical software where others as retrospective analyses. As with new or unusual situations; (3) power capabilities are not built in but part of their report, students indepen produce accurate results; (4) calculate can be programmed, although we dently completed an evaluation form power, sample size, and detectable ef briefly discuss these programs later in and ranked the packages on the basis fect size; (5) allow easy exploration the paper. For all remaining software, of which they would recommend to of multiple values of input param we contacted the vendors asking for a their colleagues. eters; (6) take a wide variety of effect review copy, or downloaded the soft We sent a draft of this paper to the size measures as input, both raw and ware if it was available over the vendors or authors of the program standardized; (7) allow estimation of Internet.