The Role of Hypothesis Testing in Wildlife Science
Total Page:16
File Type:pdf, Size:1020Kb
University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln USGS Northern Prairie Wildlife Research Center US Geological Survey 2002 The Role of Hypothesis Testing in Wildlife Science Douglas H. Johnson USGS Northern Prairie Wildlife Research Center, [email protected] Follow this and additional works at: https://digitalcommons.unl.edu/usgsnpwrc Part of the Other International and Area Studies Commons Johnson, Douglas H., "The Role of Hypothesis Testing in Wildlife Science" (2002). USGS Northern Prairie Wildlife Research Center. 229. https://digitalcommons.unl.edu/usgsnpwrc/229 This Article is brought to you for free and open access by the US Geological Survey at DigitalCommons@University of Nebraska - Lincoln. It has been accepted for inclusion in USGS Northern Prairie Wildlife Research Center by an authorized administrator of DigitalCommons@University of Nebraska - Lincoln. InvitedPaper: THE ROLEOF HYPOTHESISTESTING IN WILDLIFE SCIENCE DOUGLASH. JOHNSON,1U.S. GeologicalSurvey, Northern Prairie Wildlife Research Center,Jamestown, ND 58401, USA Abstract:Statistical testing of null hypothesesrecently has come under fire in wildlifesciences (Cherry1998; John- son 1999;Anderson et al. 2000,2001). In responseto this criticism,Robinson and Wainer(2002) providesome fur- ther backgroundinformation on significancetesting; they argue that significance testing in fact is usefulin certain situations.I counter by suggestingthat such situationsrarely arise in our field. I agree with Robinsonand Wainer that replicationis the key to scientificadvancement. I believe, however,that significancetesting and resultingP- valuesfrequently are confusedwith issuesof replication.Any single studycan yield a P-value,but only consistent resultsfrom trulyreplicated studies will advanceour understandingof the naturalworld. JOURNALOF WILDLIFEMANAGEMENT 66(2):272-276 Keywords: effect size, hypothesistest, null hypothesis,replication, significance test. The wildlife literature recently has hosted a any more often than any other statistical proce- number of articles critical of the statistical testing dure. Exact numbers would be difficult to calcu- of null hypotheses (hereafter, significance test- late, but by its very nature, significance testing of ing). Cherry (1998),Johnson (1999), and Ander- null hypotheses must be misused more than other son et al. (2000, 2001) argued that the practice is procedures. This is true because significance test- applied too often and inappropriately. Scientists ing is integral to the misuse of most other proce- in other disciplines have had similar discussions dures. For example, perhaps the most easily mis- (seeJohnson 1999 for references). Robinson and used statistical procedure is stepwise regression Wainer (2002; hereafter RW) provide some histo- (e.g., Draper et al. 1971, Pope and Webster 1972, ry of significance testing and its background. Hurvich and Tsai 1990, Thompson 1995). Step- Understanding how significance testing arose is wise regression includes or excludes variables in helpful in appreciating its strengths and weak- a model depending on their P-values. That is, nesses. Robinson and Wainer further offer some model selection is based on tests of hypotheses examples intended to illustrate situations in that the effect of individual explanatory variables which significance testing of null hypotheses can on a response variable is zero, given the other be appropriate. I agree with much of what RW explanatory variables already included in the say. In fact, they restate and reinforce many of the model. A second kit of statistical tools that are points made in articles to which they are readily misused are multivariate methods (Arm- responding. For example, their emphasis on R. strong 1967, Johnson 1981, Rexstad et al. 1988). A. Fisher's view of the importance of replication Again, a key aspect involving their misuse is based is consistent with points made byJohnson (1999). on significance tests; multivariate procedures While I do not dispute most of the points RW permit a plethora of null hypotheses to be tested. make, I question whether some of these points Other examples could be cited, but the mere fact have utility to the wildlife profession. In this com- that significance tests are central to so many sta- mentary, I focus on the issues on which I do not tistical procedures per force implies that they are fully concur with RW. I offer comments of 2 types: misused more than any single procedure. (1) general responses to RW, and (2) remarks As an example of misused statistics, RW suggest related to the points made by RW as they specifi- that the mean is inappropriate when the under- cally apply to wildlife situations. lying distribution contains outliers. That statement may be true in some instances, but not in others; GENERAL RESPONSES the appropriateness of any statistical procedure Robinson and Wainer (2002) claim to have seen depends on the objective of the analysis. Consider no evidence that significance testing is misused an example involving a known population of 51 pheasant hunters in a county. I will simplify the example by supposing that 25 of them get no birds 1 E-mail:[email protected] during the season, 25 of them bag 1 bird each, and 272 J. Wildl.Manage. 66(2):2002 THE ROLEOF HYPOTHESISTESTING * Johnson 273 a single individual shoots 179 birds. The mean bag are of too small a scale to yield results of unques- is 4 birds. The median bag is 1 bird. Which of these tioned significance. Robinson and Wainer (2002) 2 measures of central tendency is more appropri- note that Fisher was not concerned with what we ate? If the objective is to characterize the typical call Type I errors, claiming an effect to be real hunter, then the median (1 bird) clearly portrays when it is not; he thought that continued repli- the harvest of this hunter better than the mean cations would demonstrate that the effect was not does. So the median might be the measure of real. In wildlife applications, however, we rarely choice in a human dimensions study. Suppose, seek to replicate studies. Many studies simply can- however, that interest was in the dynamics of the not be replicated because of conditions that vary pheasant population. Then we would be con- markedly from 1 occasion to another. Further, we cerned about the total harvest, which is the mean often (too often, in my opinion) urge managers bag x the number of hunters. In this situation, the to take action based on the results of a single mean-not the median-is the appropriate mea- study. Indeed, authors writing for TheJournal of sure, even though the distribution contains a wild WildlifeManagement (JWM) are strongly encour- outlier. The blanket statement by RW about the aged to include a Management Implications sec- propriety of the mean is misleading: the objective tion. The authors call managers to arms based on of a particular investigation must be considered. results of their single-probably unreplicated- Robinson and Wainer (2002) agree with Guthery study, in which a Type I error may have occurred. et al. (2001) that adoption of information-theo- This is not the situation Fisher envisaged. retic methods in place of significance testing Johnson (1999) emphasized that replication is would still involve an arbitrary numerical criteri- a cornerstone of science and referred to Carver's on to judge the strength of evidence in single (1978) point that statistical significance generally studies. I disagree. One of the advantages of the is interpreted as relating to replication. Johnson information-theoretic approach is that it allows a (1999:768) even noted Fisher's idea of "repeated- set of models to be ranked, based on the support ly getting results significant at 5%." Key to this each model receives from the data. Further, issue is the comment by Bauernfeind (1968) that information-theoretic methods lend themselves replicated results automatically make statistical nicely to model averaging (Burnham and Ander- significance testing unnecessary. Fisher argued son 1998, Anderson et al. 2000). Model averaging that a result significant at 5% provided motiva- involves consideration of the full set of meaning- tion to continue studying the phenomenon. ful models that are supported by the data. Once such a series of experiments has been con- Instead of settling on a single best model, all sup- ducted, however-and most of them have pro- ported models are considered, with weights relat- vided P-values less than 0.05-those individual P- ed to the strength of evidence for each model. values are of little relevance. That process is in stark contrast to what is usually In my view, the interpretation of any statistical done by significance-testing approaches to model evidence (e.g., P-values, estimated effect sizes, selection, in which a single model is chosen confidence intervals) makes sense only if the based on tests of null hypotheses that the effects interpretation is grounded in the context of prior of variables are exactly zero. related findings. Even if no individual study obtained statistically significant results-but the Replication effect sizes from a series of studies were consis- Robinson and Waiiier (2002) remind us of the tent-important truth may have been discovered. importance of replication, pointing out R. A. Indeed, proponents of significance testing (e.g., Fisher's perspective of science as a continuing Robinson and Levin 1997) and its opponents process. They cite (2002:265) with evident (e.g., Thompson 1996) agree on the importance approval Fisher's belief that significance testing of replication in research. "only made sense in the context of a continuing Detractors of significance testing, however, series of experiments that were aimed at con- argue that too many researchers erroneously firming the size and direction of the effects of interpret statistical significance as necessary and specific treatments." They also cite Tukey (1969) sufficient evidence that results are replicable and note (2002:269) that "statisticallysignificant (Cohen 1994). Without statistical significance results that are replicated provide the basis of sci- tests, such researchers would be forced to com- entific truth." I wholeheartedly concur.