University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln

USGS Northern Prairie Wildlife Research Center US Geological Survey

2002

The Role of Hypothesis Testing in Wildlife Science

Douglas H. Johnson USGS Northern Prairie Wildlife Research Center, [email protected]

Follow this and additional works at: https://digitalcommons.unl.edu/usgsnpwrc

Part of the Other International and Area Studies Commons

Johnson, Douglas H., "The Role of Hypothesis Testing in Wildlife Science" (2002). USGS Northern Prairie Wildlife Research Center. 229. https://digitalcommons.unl.edu/usgsnpwrc/229

This Article is brought to you for free and open access by the US Geological Survey at DigitalCommons@University of Nebraska - Lincoln. It has been accepted for inclusion in USGS Northern Prairie Wildlife Research Center by an authorized administrator of DigitalCommons@University of Nebraska - Lincoln. InvitedPaper: THE ROLEOF HYPOTHESISTESTING IN WILDLIFE SCIENCE

DOUGLASH. JOHNSON,1U.S. GeologicalSurvey, Northern Prairie Wildlife Research Center,Jamestown, ND 58401, USA

Abstract:Statistical testing of null hypothesesrecently has come under fire in wildlifesciences (Cherry1998; John- son 1999;Anderson et al. 2000,2001). In responseto this criticism,Robinson and Wainer(2002) providesome fur- ther backgroundinformation on significancetesting; they argue that significance testing in fact is usefulin certain situations.I counter by suggestingthat such situationsrarely arise in our field. I agree with Robinsonand Wainer that replicationis the key to scientificadvancement. I believe, however,that significancetesting and resultingP- valuesfrequently are confusedwith issuesof replication.Any single studycan yield a P-value,but only consistent resultsfrom trulyreplicated studies will advanceour understandingof the naturalworld. JOURNALOF WILDLIFEMANAGEMENT 66(2):272-276 Keywords: effect size, hypothesistest, null hypothesis,replication, significance test.

The wildlife literature recently has hosted a any more often than any other statistical proce- number of articles critical of the statistical testing dure. Exact numbers would be difficult to calcu- of null hypotheses (hereafter, significance test- late, but by its very nature, significance testing of ing). Cherry (1998),Johnson (1999), and Ander- null hypotheses must be misused more than other son et al. (2000, 2001) argued that the practice is procedures. This is true because significance test- applied too often and inappropriately. Scientists ing is integral to the misuse of most other proce- in other disciplines have had similar discussions dures. For example, perhaps the most easily mis- (seeJohnson 1999 for references). Robinson and used statistical procedure is stepwise regression Wainer (2002; hereafter RW) provide some histo- (e.g., Draper et al. 1971, Pope and Webster 1972, ry of significance testing and its background. Hurvich and Tsai 1990, Thompson 1995). Step- Understanding how significance testing arose is wise regression includes or excludes variables in helpful in appreciating its strengths and weak- a model depending on their P-values. That is, nesses. Robinson and Wainer further offer some model selection is based on tests of hypotheses examples intended to illustrate situations in that the effect of individual explanatory variables which significance testing of null hypotheses can on a response variable is zero, given the other be appropriate. I agree with much of what RW explanatory variables already included in the say. In fact, they restate and reinforce many of the model. A second kit of statistical tools that are points made in articles to which they are readily misused are multivariate methods (Arm- responding. For example, their emphasis on R. strong 1967, Johnson 1981, Rexstad et al. 1988). A. Fisher's view of the importance of replication Again, a key aspect involving their misuse is based is consistent with points made byJohnson (1999). on significance tests; multivariate procedures While I do not dispute most of the points RW permit a plethora of null hypotheses to be tested. make, I question whether some of these points Other examples could be cited, but the mere fact have utility to the wildlife profession. In this com- that significance tests are central to so many sta- mentary, I focus on the issues on which I do not tistical procedures per force implies that they are fully concur with RW. I offer comments of 2 types: misused more than any single procedure. (1) general responses to RW, and (2) remarks As an example of misused , RW suggest related to the points made by RW as they specifi- that the mean is inappropriate when the under- cally apply to wildlife situations. lying distribution contains outliers. That statement may be true in some instances, but not in others; GENERAL RESPONSES the appropriateness of any statistical procedure Robinson and Wainer (2002) claim to have seen depends on the objective of the analysis. Consider no evidence that significance testing is misused an example involving a known population of 51 pheasant hunters in a county. I will simplify the example by supposing that 25 of them get no birds 1 E-mail:[email protected] during the season, 25 of them bag 1 bird each, and 272 J. Wildl.Manage. 66(2):2002 THE ROLEOF HYPOTHESISTESTING * Johnson 273 a single individual shoots 179 birds. The mean bag are of too small a scale to yield results of unques- is 4 birds. The median bag is 1 bird. Which of these tioned significance. Robinson and Wainer (2002) 2 measures of central tendency is more appropri- note that Fisher was not concerned with what we ate? If the objective is to characterize the typical call Type I errors, claiming an effect to be real hunter, then the median (1 bird) clearly portrays when it is not; he thought that continued repli- the harvest of this hunter better than the mean cations would demonstrate that the effect was not does. So the median might be the measure of real. In wildlife applications, however, we rarely choice in a human dimensions study. Suppose, seek to replicate studies. Many studies simply can- however, that interest was in the dynamics of the not be replicated because of conditions that vary pheasant population. Then we would be con- markedly from 1 occasion to another. Further, we cerned about the total harvest, which is the mean often (too often, in my opinion) urge managers bag x the number of hunters. In this situation, the to take action based on the results of a single mean-not the median-is the appropriate mea- study. Indeed, authors writing for TheJournal of sure, even though the distribution contains a wild WildlifeManagement (JWM) are strongly encour- outlier. The blanket statement by RW about the aged to include a Management Implications sec- propriety of the mean is misleading: the objective tion. The authors call managers to arms based on of a particular investigation must be considered. results of their single-probably unreplicated- Robinson and Wainer (2002) agree with Guthery study, in which a Type I error may have occurred. et al. (2001) that adoption of information-theo- This is not the situation Fisher envisaged. retic methods in place of significance testing Johnson (1999) emphasized that replication is would still involve an arbitrary numerical criteri- a cornerstone of science and referred to Carver's on to judge the strength of evidence in single (1978) point that statistical significance generally studies. I disagree. One of the advantages of the is interpreted as relating to replication. Johnson information-theoretic approach is that it allows a (1999:768) even noted Fisher's idea of "repeated- set of models to be ranked, based on the support ly getting results significant at 5%." Key to this each model receives from the . Further, issue is the comment by Bauernfeind (1968) that information-theoretic methods lend themselves replicated results automatically make statistical nicely to model averaging (Burnham and Ander- significance testing unnecessary. Fisher argued son 1998, Anderson et al. 2000). Model averaging that a result significant at 5% provided motiva- involves consideration of the full set of meaning- tion to continue studying the phenomenon. ful models that are supported by the data. Once such a series of has been con- Instead of settling on a single best model, all sup- ducted, however-and most of them have pro- ported models are considered, with weights relat- vided P-values less than 0.05-those individual P- ed to the strength of evidence for each model. values are of little relevance. That process is in stark contrast to what is usually In my view, the interpretation of any statistical done by significance-testing approaches to model evidence (e.g., P-values, estimated effect sizes, selection, in which a single model is chosen confidence intervals) makes sense only if the based on tests of null hypotheses that the effects interpretation is grounded in the context of prior of variables are exactly zero. related findings. Even if no individual study obtained statistically significant results-but the Replication effect sizes from a series of studies were consis- Robinson and Waiiier (2002) remind us of the tent-important truth may have been discovered. importance of replication, pointing out R. A. Indeed, proponents of significance testing (e.g., Fisher's perspective of science as a continuing Robinson and Levin 1997) and its opponents process. They cite (2002:265) with evident (e.g., Thompson 1996) agree on the importance approval Fisher's belief that significance testing of replication in research. "only made sense in the context of a continuing Detractors of significance testing, however, series of experiments that were aimed at con- argue that too many researchers erroneously firming the size and direction of the effects of interpret statistical significance as necessary and specific treatments." They also cite Tukey (1969) sufficient evidence that results are replicable and note (2002:269) that "statisticallysignificant (Cohen 1994). Without statistical significance results that are replicated provide the basis of sci- tests, such researchers would be forced to com- entific truth." I wholeheartedly concur. Studies pare their effect sizes directly with those from conducted to understand phenomena generally similar studies or actually to conduct further 274 THE ROLEOF HYPOTHESISTESTING * Johnson J. Wildl.Manage. 66(2):2002

replicated studies. The researchers also would be a) for portraying the uncer- forced to argue explicitly that their effects are tainty of an estimate, in lieu of P-values. Robinson practically-as opposed to only statistically-sig- and Wainer suggest that a (1 - a) confidence nificant. Doing so would be a positive contribu- interval is as arbitrary as rejecting or failing to tion to most areas of science. reject a null hypothesis based on whether or not P < a. It is true that a (1 - a) confidence interval the Null Proving Hypothesis gives as much information as knowing whether or Robinson and Wainer (2002) note that critics of not P < a. But it gives much more information. significance testing worry that researchers too The width of the confidence interval tells how commonly interpret results with P > 0.05 as indi- well the parameter has been estimated. The dis- cating no effect. The critics are right: researchers tance from the hypothesized value of the para- do exactly that. A scan of the first few papers in a meter to the confidence interval gives a measure recent issue of JWM (Volume 65) found several of the inconsistency of that value with the examples in which authors determined that there observed data. In contrast to a P-value, a confi- was no effect after finding P> 0.05, even if Pjust dence interval allows the reader to know if lack of barely exceeded 0.05. Among these instances statistical significance represents lack of effect or were: "the probability of being detected in at least too small a sample size (Johnson 1999: Fig. 1). 1 month (p*) did not differ from 1 (P> 0.058)"; Further, the clear distinction between confidence "overlap between female and male core areas dif- intervals and significance testing can be seen in fered neither in early (U= 30, P= 0.052) nor in late the realization that one cannot test statistical sig- spring (U= 51.0, P= 0.144)"; and "Daily nest sur- nificance without a null hypothesis, but that con- vival was not significantly different between regen- fidence intervals can be obtained without nulls. eration methods for ... yellow-breasted chat (X2= A major advantage of confidence intervals is 3.28, df = 1, P= 0.07)." The problem of declaring that they allow (and even facilitate) thinking no effect arises especially often when interactions "meta-analytically"about effect size and effect are examined. If the interactions are real, even if size replicability across studies (Anderson et al. not statistically significant, interpretation of main 2000, Cumming and Finch 2001). Confidence effects is confounded. Yet authors typicallyprovide intervals have the additional appeal that they are little information about interactions. One of the readily amenable to graphical presentation. scanned articles in the JWMissue was characteris- Unfortunately, confidence intervals are too rarely tic: "[I]nteractions ... were not significant. There- reported in many scientific journals. fore, we ..." No evidence, not even a P-value, was provided to demonstrate that the interactions real- APPLICABILITYTO WILDLIFESITUATIONS ly were negligible and could safely be ignored. Some of the arguments made by RW are correct but to few situations in wildlife science. As Scientificversus Statistical apply Hypotheses an example, wildlifers can only envy databases Robinson and Wainer (2002) state that not all P- like the Cochrane Collaboration, which is based values are unimportant and refer specifically, if on more than 250,000 medical experiments with obliquely, to Albert Einstein and a hypothesis about random assignments (presumably of treatments the speed of light. Without knowing exactly what to subjects) and for which enough information is hypothesis RWare referring to, I would suggest that provided to conduct meta-analyses. We have it likely represents a scientific, as opposed to a sta- nothing comparable, but instead do as RW tistical, hypothesis. Johnson (1999), among many (2002:265) say: we "rarelyreplicate results where others, also distinguished these kinds of hypothe- P< 0.05...." ses, citing Copernicus' hypothesis that the Earth Robinson and Wainer argue that testing of null revolves around the sun, in contrast to the hypoth- hypotheses can be useful when attempting to esis widely believed at the time that the sun revolved determine only the sign of an effect, rather than around the Earth. That scientific hypothesis was its sign and magnitude. They illustrate this idea contrasted with the statistical hypotheses typically with a medical research example in which a new tested in JWMand many other scientific journals. treatment is compared to an old one. Once a treatment has been demonstrated to be superior, Confidence Intervals ethical considerations demand that the inferior Robinson and Wainer (2002) note that Ander- treatment not be applied to additional subjects. son et al. (2001) recommended the use of a (1 - The magnitude of the difference between treat- J. Wildl.Manage. 66(2):2002 THE ROLEOF HYPOTHESISTESTING * Johnson 275 ments is not estimated; knowing the sign of the which evolutionary operation was designed. It effect is sufficient to make a decision. remains to be seen whether adaptive resource That example is valid but rarely relevant in the management will be as successful as evolutionary wildlife field. We generally need to know the operation has been. magnitude, as well as the sign, of an effect. Con- CONCLUSIONS sider the hypothesis: if we eliminate sport hunt- ing on the North American mallard (Anas Testing hypotheses is an important component platyrhynchos)population, mallard survival rates of scientific endeavor. Indeed, it is integral to the will increase. Probably all of us believe this state- hypothetico-deductive method, which is a power- ment is true. The real question is: How much will ful way of learning (Romesburg 1981). Key to this survival rates increase? Is the increase in survival concept is that the hypotheses being tested are rate worthwhile compared to the loss of recre- scientific, not merely statistical. That is, scientific ational opportunities? Similarly, we might all hypotheses address fundamental, global predic- agree, even without study, that eliminating all ani- tions that derive from theory. Statistical hypothe- mals that depredate nests of a species in an area ses, in contrast, address local questions, usually will have a positive effect on the nesting success about single populations or systems (Simberloff of that species there. But this information is not 1990), and the null hypotheses usually are mean- enough: we need to know how big that increase ingless and known a priori to be false. Most will be. Predator reduction is expensive and has hypotheses tested in JWMare statistical in nature, social implications, and a conscientious manager not scientific. Wildlife researchers should be wants to know what benefits will result from such encouraged to employ scientific hypotheses costly and potentially controversial actions. more often and statistical hypotheses less often. It is widely acknowledged that virtually all sta- Evolutionary Operation tistical null hypotheses are known to be false, Evolutionary operation (EVOP) is proffered by even before any data are collected or any tests RW as a situation in which interest lies only in the conducted (Johnson 1995, 1999; Cherry 1998; sign of an effect. As RW note, EVOP is applied in Anderson et al. 2001). Why then should a signifi- industrial settings when slight differences in man- cance test be conducted? As it turns out, signifi- ufacturing procedures (temperature, chemical cance testing can be useful for determining if the inputs, etc.) are made and the direction of the null hypothesis is approximately true, if the sam- effect on the product is noted (Box 1957, Box ple size is not too small and not too large (Berg- and Draper 1969). Only small changes from cur- er and Delampady 1987). For example, the null rent settings are made, so that actual production hypothesis that the means of 2 populations are = is not compromised. Hence, effects are likely to the same (P1 p2) is almost certainly false in any be small, too. Robinson and Wainer note that finite population. Nonetheless, the hypothesis only the direction of the change is important: did will be accepted if the sample is too small. The - the quality of the product improve or worsen? hypothesis tl P2 is much more reasonable to Box and Draper (1982) provided an overview of consider; how similar the means need to be EVOP. Interestingly, in their examples, they pre- depends on the context. This hypothesis will be sented estimated effects and their standard accepted if the sample is too small and will be errors, but no P-values or hypothesis tests. Evolu- rejected if the sample is very large, but for mod- tionary operation is akin to adaptive resource erate samples, the test can be meaningful. management (Walters 1986), which has gained Testing null hypotheses is seldom useful or nec- increased popularity in wildlife and fisheries essary. The examples cited by RW rarely are ger- management. Both methodologies focus on mane to the wildlife field. The fundamental learning about the system at the same time the need, as RW mention and as R. A. Fisher empha- system is managed. That is, managers want to sized, is for true replication. Researchers and manipulate inputs to the system to seek optimal managers should not rely on single studies con- combinations of those inputs while not varying ducted in a single area even over a few years, but things so much as to cause a serious reduction in instead should require results that are replicated the output. The major difference between the by different researchers using a variety of meth- methodologies, in my view, is that natural systems ods. As Cohen (1994), Thompson (1996), and have far more uncontrollable, and often unknow- others have strongly emphasized-contrary to able, inputs than do the industrial systems for common misperceptions-P-values do not reflect 276 THE ROLE OF HYPOTHESISTESTING * Johnson J. Wildl. Manage. 66(2):2002 the repeatability of study results; actual replica- CUMMING, G., AND S. FINCH. 2001. A primer on the use tions are required to definitively establish understanding, and calculation of confidence intervals that are based on central and noncentral repeatability. Any single study can yield a P-value, distributions. Educational and Psychological Mea- but only consistency among replicated studies surement 61:532-574. will advance our science. DRAPER,N. R., I. GUTTMAN,AND H. KANEMASU.1971. The distribution of certain regression statistics. Biometri- ACKNOWLEDGMENTS ka 58:295-298. GUTHERY,F. LUSK,AND PETERSON.2001. The I am to D. R. Anderson, M. M. Row- S.,J.J. M.J. grateful fall of the null hypothesis: liabilities and opportuni- land, and G. A. Sargeant for valuable comments ties. Journal of Wildlife Management 65:379-384. on earlier drafts of this commentary. Special HURVICH,C. M., ANDC.-L. TSAI.1990. The impact of mo- del thanks for his contributions to B. Thompson, selection on inference in . Amer- of Educational Texas ican Statistician 44:214-217. Department , D. H. 1981. The use and misuse of statistics in A&M a leader in the role of JOHNSON, University, clarifying wildlife habitat studies. Pages 11-19 in D. E. Capen, statistical hypothesis testing. editor. The use of multivariate statistics in studies of wildlife habitat. U.S. Forest Service General Technical LITERATURECITED Report RM-87. .1995. Statistical sirens: the allure of nonpara- ANDERSON,D. R., K. P. BURNHAM,AND W. L. THOMPSON. metrics. Ecology 76:1998-2000. 2000. Null hypothesis testing: problems, prevalence, . 1999. The insignificance of statistical signifi- and an alternative. Journal of Wildlife Management cance testing. Journal of Wildlife Management 64:912-923. 63:763-772. , W. A. LINK,D. H.JOHNSON,AND K. P. BURNHAM. POPE, P. T., ANDJ. T. WEBSTER.1972. The use of an Fsta- 2001. Suggestions for presenting the results of data tistic in stepwise regression procedures. Technomet- analyses.Journal of Wildlife Management 65:373-378. rics 14:327-340. ARMSTRONG, J. S. 1967. Derivation of theory by means of REXSTAD,E. A., D. MILLER,C. FLATHER,E. ANDERSON,J. factor analysis, or Tom Swift and his electric factor HUPP,AND D. R. ANDERSON.1988. Questionable multi- analysis machine. American Statistician 21:17-21. variate statistical inference in wildlife and community BAUERNFEIND,R. H. 1968. The need for replication in studies. Journal of Wildlife Management 52:794-798. educational research. Phi Delta Kappan 50:126-128. ROBINSON, D. H., AND J. R. LEVIN. 1997. Reflections on BERGER, J. O., AND M. DELAMPADY.1987. Testing precise statistical and substantive significance, with a slice of hypotheses. Statistical Science 2:317-352. replication. Educational Researcher 26(5):21-26. Box, G. E. P. 1957. Evolutionary operation: a method ,AND H. WAINER.2002. On the past and future of for increasing industrial productivity. Applied Statis- null hypothesis significance testing. Journal of Wild- tics 6:81-101. life Management 66:263-271. , AND N. R. DRAPER.1969. Evolutionary opera- ROMESBURG,H. C. 1981. Wildlife science: gaining reli- tion: a statistical method for process improvement. able knowledge. Journal of Wildlife Management Wiley, New York, USA. 45:293-313. , AND . 1982. Evolutionary operation SIMBERLOFF,D. 1990. Hypotheses, errors, and statistical (EVOP). Pages 564-572 in S. Kotz and N. L.Johnson, assumptions. Herpetologica 46:351-357. editors-in-chief. Encyclopedia of statistical sciences. THOMPSON,B. 1995. Stepwise regression and stepwise Volume 2. Wiley, New York, USA. discriminant analysis need not apply here: a guide- BURNHAM,K. P., ANDD. R. ANDERSON.1998. Model selec- lines editorial. Educational and Psychological Mea- tion and inference: a practical information-theoretic surement 55:525-534. approach. Springer-Verlag, New York, USA. .1996. AERA editorial policies regarding statisti- CARVER, R. P. 1978. The case against statistical signifi- cal significance testing: three suggested reforms. Edu- cance testing. HarvardEducational Review 48:378-399. cational Researcher 25 (2):26-30. CHERRY,S. 1998. Statistical tests in publications of The TUKEY,J. W. 1969. Analyzing data: sanctification or Wildlife Society. Wildlife Society Bulletin 26:947-953. detective work? American Psychologist 24:83-91. COHEN, J. 1994. The earth is round (p < .05). American WALTERS,C. 1986. Adaptive management of renewable Psychologist 49:997-1003. resources. Macmillan, New York, USA.