14_TPPT_2015_Sep Agenda item 6.8 Environ Monit Assess (2014) 186:2729–2740 DOI 10.1007/s10661-013-3574-8

Assessing environmentally significant effects: a better strength-of-evidence than a single P value?

Graham McBride & Russell G. Cole & Ian Westbrooke & Ian Jowett

Received: 28 May 2013 /Accepted: 28 November 2013 /Published online: 20 December 2013 # Springer Science+Business Media Dordrecht 2013

Abstract Interpreting a P value from a traditional nil with two types of equivalence tests (one testing that the hypothesis test as a strength-of-evidence for the exis- difference that does exist is contained within an interval tence of an environmentally important difference be- of indifference, the other testing that it is beyond that tween two populations of continuous variables (e.g. a interval—also known as bioequivalence testing). This chemical concentration) has become commonplace. Yet, gives rise to a strength-of-evidence procedure that lends there is substantial literature, in many disciplines, that itself to a simple interpretation. It is faults this practice. In particular, the hypothesis tested is accompanied by a strength-of-evidence matrix that has virtually guaranteed to be false, with the result that P many desirable features: not only a strong/moderate/ depends far too heavily on the number of samples dubious/weak categorisation of the results, but also rec- collected (the ‘sample size’). The end result is a swing- ommendations about the desirability of collecting fur- ing burden-of-proof (permissive at low sample size but ther data to strengthen findings. precautionary at large sample size). We propose that these tests be reinterpreted as direction detectors (as Keywords Environmental significance . P value . has been proposed by others, starting from 1960) and Strength-of-evidence . Confidence interval that the test’s procedure be performed simultaneously

Electronic supplementary material The online version of this Introduction article (doi:10.1007/s10661-013-3574-8) contains supplementary material, which is available to authorized users. Environmental impact assessment frequently relies on G. McBride (*) advice based on statistical hypothesis testing. Like many NIWA (National Institute of Water & Atmospheric Research), others (http://warnercnr.colostate.edu/~anderson/ PO Box 11115, Hamilton 3216, New Zealand thompson1.html), we are of the view that the results e-mail: [email protected] while appearing to be instructive can in fact be R. G. Cole misleading. This is because standard tests (e.g. t tests, NIWA, analysis of variance) are most usually tests of a PO Box 893, Nelson 7010, New Zealand particular form of null hypotheses, i.e. ‘nil hypotheses’ I. Westbrooke (Cohen 1994). Typically denoted as H0, they postulate Department of Conservation, that differences between populations’ parameters (means, Private Bag 4715, Christchurch, New Zealand , …)differbyanexact amount, most usually zero. We should have little interest in such an implausible I. Jowett Jowett Consulting Ltd., hypothesis because we already know that such a state 123 Butcher Road, Pukekohe, New Zealand between populations’ parameters is extremely unlikely: 14_TPPT_2015_Sep Agenda item 6.8 2730 Environ Monit Assess (2014) 186:2729–2740

‘All we know about the world teaches us that the effects have to regard more extreme samples also of A and B are always different’ (Tukey 1991). Yet, the as evidence against H0.Therefore,P is the test’s calculation procedure is derived from key results probability of declaring there to be evi- in the mathematical theory of statistics that are built on dence against H0 when it is in fact true the assumption that the hypothesis to be tested is true. and the data under analysis are regarded That hypothesis is only rejected if the test’s ‘P value’ is as just decisive’ (Cox 1987). Other statisti- less than a chosen significance level (often taken as cians maintain that this line of reasoning α=5 %) in which case the result is called ‘statistically does not resolve the problem of appealing significant’—or sometimes just ‘significant’—and the to data that were not obtained (e.g. Berger ‘alternative hypothesis’ is accepted (i.e. the first and Delampady 1987). And note that, as is population’s mean or differs from the second). common in much statistical writing, the

Conversely, if the P value is greater than α, the best we above quotes assume that H0 may be true, can say is that the hypothesis is ‘not rejected’ which, on whereas we (and others, such as Tukey logical grounds, cannot and should not be taken to mean 1991) say the chances of that being so (for that such a hypothesis can be accepted—although this continuous variables at least) are vanish- incorrect inference is often made. As stated by Zar ingly small. We will propose a possible (1984,p.45):“… failing to reject a null hypothesis is resolution of this disagreement. not ‘proof’ that the hypothesis is true. It denotes only Problem 2 P depends on the number of data collected. that there is not sufficient evidence to conclude that it is Consider a simple t test, addressing the false” (a point first made in print by Berkson 1942). In hypothesis that the difference between the other words, how can it be legitimate to accept as true means of two normal populations is zero in something that we already understand to be false? two populations where the means can be Especially when, as implied by Zar, were more and taken as normally distributed (typically more data to be collected the hypothesis would based on the central limit theorem), with eventually be rejected. common variance. That is, H0 posits that If this significant/not significant dichotomy is aban- μ1−μ2=0, where μ1 and μ2 denote the doned, the practice can arise of using the P value as an unknown population means. The alterna- apparent strength-of-evidence concerning the changes tive hypothesis is HA: μ1−μ2≠0. If we or trends being analysed. This too is generally erroneous draw n samples from each population, we (Germano 1999), on two grounds. To see why, we need can then calculate the test statistic T, which to consider the P value in a little more detail. In partic- follows the t distribution when H0 is true. It pffiffi ular, note that it is defined as the probability of obtaining X 1−X 2 n is defined as T ¼ 2 where X 1 data at least as extreme as has been obtained if the tested Sp hypothesis is true. Two problems arise. and X 2 are the means of each set of sam- ples, Sp is the pooled standard deviation Problem 1 Using P as a strength-of-evidence provided obtained from the variances of the data by the data that we collect is to include in and the vertical bars denote absolute mag- that metric data that we did not collect (i.e. nitude. Observe that the term in vertical all data more extreme than was obtained). bars measures the relative ‘effect size’ in This problem has been criticised and the data (as defined by Cohen 1988,also defended in the statistical literature. A fa- called the ‘normal deviate’) and is com- mous criticism is: ‘What the use of P im- prised of unbiased and consistent estima- plies, therefore, is that a hypothesis that tors of means and variances (sensu the may be true may be rejected because it mathematical theory of statistics, e.g. has not predicted observable results that Freund 1992). As a consequence, were we have not occurred’ (Jeffreys 1961). In its to have taken less than sample size n (i.e. n defence another statistician stated ‘… if we samples) from each population, on aver- were to regard the data under analysis as age, the value of the effect size would

just decisive against H0 then we would remain roughly constant. So in essence, 14_TPPT_2015_Sep Agenda item 6.8 Environ Monit Assess (2014) 186:2729–2740 2731

pffiffiffiffiffiffiffiffi 0.4 the presence of the n=2 term means that the test statistic T will be substantially af- fected by the number of data we have. In 0.3 particular, the more data we have from a given population, the larger T will tend to 0.2 be. And because larger values of T auto- Larger tail area matically cause smaller P values, the more when t = 1.6

data we have the smaller P will be (larger Probability density 0.1 Smaller tail area values of T cutoff smaller areas in the tail of when t = 2.3 the t distribution, as shown on Fig. 1). 0 So with many samples, nil hypothesis tests on differ- -4 -3 -2 -1 0 1 2 3 4 ences in population means will routinely return a statisti- t 12 degrees of freedom cally significant result when in fact the differences that are present would be considered by environmental scien- Fig. 1 The larger the test statistic, the smaller the tail area of the t distribution tists to be trivial—reflecting a precautionary approach. Conversely, with only a few samples, environmentally seldom pointed out that P values may have some merits important differences may often be declared to be ‘not when testing hypotheses that can be true, and so, critics significant’—reflecting a permissive approach.1 (See may ignore some possible benefits of including tests of Appendix 1 for a numerical example of how this works.) nil hypotheses in a strength-of-evidence approach (e.g. Now, if the true difference is environmentally important, Fleiss 1986;Frick1995;Chow1996; Hagen 1997, this behaviour is desirable—collecting yet more data Harris 1997a). from a given population can lead to rejection of a false Herein, we present the essentials of an approach to hypothesis that was previously not found to be wanting hypothesis testing that retains useful features of nil (as is implied by the quote from Zar above, a feature well- hypothesis test procedures, marrying them with tests of known among statisticians). But this is an undesirable interval hypotheses (of which nil hypotheses and one- behaviour where the true difference is considered to be sided hypotheses are special cases). We suggest this new trivial. Contrast this state-of-affairs with tests of a ‘one- approach will minimise the problems referred to above. sided hypothesis’ positing that one population mean is Full analysis of the mathematical aspects of this ap- greater than the other. Indeed, in that case, it is a straight- proach awaits further attention, including considerations forward exercise to show that when the one-sided hy- beyond differences in means between two populations pothesis is considered to be true, P values will increase as considered herein (especially, trend analysis and the sample size is increased (McBride 2005, Sec. 5.4). ANOVA). We recommend investigation of its merits in That is, increasing the number of data can provide stron- a joint exercise between environmental scientists and ger confirmation for the tested hypothesis if it is in fact statisticians. true. So, in contrast to testing an always-false ‘nil’ hy- pothesis, we could contemplate accepting a one-sided hypothesis—because it could be true. Over many years, a number of authors have detailed Simultaneous hypothesis test procedures problems associated with tests of nil hypotheses and their associated P values (Berkson 1942; Rozeboom The essence of the strength-of-evidence procedure 1960; Gibbons and Pratt 1975; Carver 1978;Schervish (SEP) we are advocating is to meld together three si- 1996;Germano1999; Johnson 1999; Jones and Tukey multaneous procedures, all based on two one-sided 2000;McBride2005;Newman2008;Läärä2009; (‘TOST’) decision rules: (a) The nil hypothesis is in- Gerrodette 2011;Beningeretal.2012). However, it is voked to infer the direction-of-change; (b) Testing the equivalence hypothesis that the true difference in means lies within an interval of indifference; (c) Testing the 1 In a precautionary approach, it is assumed that important differ- ences exist. That assumption is only abandoned if data are suffi- inequivalence hypothesis, positing that the true differ- ciently convincing: vice versa in the permissive approach. ence in means lies beyond that interval (also called 14_TPPT_2015_Sep Agenda item 6.8 2732 Environ Monit Assess (2014) 186:2729–2740

‘bioequivalence testing’—Westlake 1981;Schuirmann mathematical results obtained under that as- 1996; Wellek 2003). sumption are used to generate the tests’ de- cision rule. That is, the nil values are used as a starting point from which to infer the di- What to do with the nil hypothesis procedure? rection of change, rather than posing it as a hypothesis that could in fact be true. A number of authors have proposed recasting of nil Caveat 2 Interpret test outcomes as aids to decisions, hypothesis testing procedures, in what may be called a not as conclusions? This is a view also es- ‘three-valued logic’ (Bohrer 1979;Harris1997b). Many poused by Tukey (1960). So, for example, years ago, Kaiser (1960) pointed out that accepting the conclusion (1) above would read: ‘act as if alternative hypothesis H does not of itself enable us A the first population has the larger mean’. logically to conclude anything about the direction-of- This is consistent with advising environmen- change (because H merely posits that μ and μ differ, A 1 2 tal managers who must soon make decisions, saying nothing about the direction-of-change). at least in part, based on the direction-of- Researchers employing nil hypothesis testing alone can- change. This approach also guards against not therefore come to any logical conclusion about the concluding too much from the results of a sign of an effect. Kaiser (1960), following a lead by single test. In other words, strong inference Hodges and Lehmann (1954), pointed out that the only about the state of the environment should way to use hypothesis tests to come to a conclusion rely on more than one, or even a few, studies about the direction-of-change is to employ two one- (a view expounded by Platt 1964). sided (TOST) procedures, testing the hypotheses that Nevertheless, our SEP approach does allow μ >μ or μ <μ . To do so invokes the three-valued 1 2 1 2 strong conclusions for the analysis of a par- logic, so named because it allows three outcomes ticular dataset. That does not necessarily im- (Harris 1997b): (1) confidence that the change is in the ply a strong weight-of-evidence for the hy- positive direction; (2) confidence that the change is in pothesis in question, given the possibility the negative direction; (3) there are insufficient data to that analysis of other appropriate datasets be confident about the direction-of-change. Conclusion may only reach weak (or even conflicting) (3) is valid in that it is a statement that can be accepted, outcomes. even though it flows from the apparatus for testing a nil Caveat 3 Each one-sided test should be at level α.This hypothesis. is equivalent to couching the tests’ rules in More recently, this approach has been endorsed by terms of symmetric 100(1−2α) % confi- others (Jones and Tukey 2000; Harris 2001;Goudey dence intervals—as explained in Appendix 2007), though there is inertia resisting its adoption 2. At first, this may be surprising, yet it is a (judging by the lack of its availability in statistical property with a sound theoretical basis also software). Its advocacy by Jones and Tukey (2000) shared by interval tests that we now discuss contains three caveats that should now be addressed— using the TOST approach. we disagree with the first, mostly agree with the second and fully agree with the third. Caveat 1 Avoid stating a hypothesis? Jones and Tukey Interval hypothesis tests (2000) opined that we should not set forth a null hypothesis because to do so is unrealistic Having thus obtained a coherent decision rule and misleading. Instead, we should only en- concerning the direction-of-change, environmental tertain one of three conclusions, i.e. out- management can be better informed compared with the comes (1)–(3) above. While we sympathise results of a standard nil hypothesis test (Goudey 2007). with the motive for this, their two one-sided But in many cases, that will not be enough; some tests’ decision rules are couched in terms of indication about confidence in the magnitude of the critical values of the t distribution, which is measured effect size will also be needed. only appropriate when the nil hypothesis is There are two available statistical procedures which assumed to be true. So, whether stated or not, management professionals could use to address 14_TPPT_2015_Sep Agenda item 6.8 Environ Monit Assess (2014) 186:2729–2740 2733 meaningful hypotheses. First is the bioequivalence pro- of 12 combinations (triplets) of outcomes to consider. cedure, which we have used in impact assessments (e.g. We need to determine how many of these can occur, Cole and McBride 2004) and is widely used in drug expecting that some will be impossible. This has been trials (Berger and Hsu 1996). This procedure tests the done by examining the algebraic interplay of the set of hypothesis that the difference between two population decision rules encompassed by these procedures (as means is greater than the ‘equivalence interval’,where given in Table 1), resulting in eight sets of possible any difference within that interval is held by expert triplets, as shown in the matrix in Table 2—the essential ecologists to be practically unimportant (i.e. environ- product of this work. This includes the option of includ- mentally unimportant). So, if that hypothesis is rejected, ing a non-zero value of the difference in population the two populations can be inferred to be ‘equivalent’. means for the direction-of-change procedure. (The As such, it represents a precautionary approach because mathematical details are given in the Supplementary equivalence can only be inferred if data are sufficiently Information.) convincing. In this case, using a small significance level Table 2 defines the triplets by a convenient shorthand

(e.g. α=0.05) minimises the error risk of inferring notation in which, for example, ‘PReFi’ denotes that the equivalence when in fact there is a practically important nil procedure has inferred a positive direction-of- difference between the two populations. Properties of change, the equivalence hypothesis has been rejected these tests have been examined in considerable detail in and the inequivalence hypothesis has failedtobe the statistical literature (e.g. Schuirmann 1987, 1996; rejected. As such, one would expect this result to confer Berger and Hsu 1996; McBride 2002; Wellek 2003) strong strength-of-evidence for a positive and practical- and have previously been advocated (McBride et al. ly important change. Note that the first (P) entry in the 1993). triplet can be replaced by N (inferring a negative The second procedure inverts the hypothesis to be direction-of-change) or U (unsure of direction-of- tested, which now posits that the difference between two change). The second and third entries can be Fe and population means is within the ‘equivalence interval’. Ri, with obvious interpretation. Note that this inversion also moves the burden-of-proof Figure 2 displays a confidence interval interpretation from ‘precautionary’ to ‘permissive’: Its significance of all possible outcomes from the procedure (somewhat level guards against the risk of falsely inferring similar diagrams appear in Anderson and Meleason inequivalence. While the classic text on hypothesis test- 2009 and, particularly, Brosi and Biber 2009, but with ing does include material on testing this hypothesis somewhat different interpretations). The patterns shown (Lehmann 1986), it has seldom been implemented in are intuitively appealing. First, to reject the equivalence the applied sciences—possibly because, on its own, it hypothesis and fail to reject the inequivalence hypothe- may be considered to be too permissive. These tests sis (i.e. obtaining ReFi) are strongly suggestive of a have occasionally been used alongside bioequivalence practically important change—because the difference tests (e.g. McBride 1999, Cole and McBride 2004), but is some way beyond the equivalence interval. not in the simultaneous fashion suggested herein. Obtaining this ‘strong’ result can be seen as ‘proof of hazard’. That outcome can only be accompanied by P or

N in the first position of the triplet. Note that UReFi is an The new SEP procedure impossible outcome (as explained in Table 2 and in the Supplementary Information)—on heuristic grounds, we The essence of our proposal is to determine a strength- could have anticipated that it would be highly unlikely of-evidence metric to decide on our confidence in to be sure of having found a practically important dif- asserting that a difference in means lies within or outside ference but be unsure of its direction. To fail to reject of a region of indifference (the equivalence interval). either hypothesis (obtaining FeFi) but to be confident of This uses testing procedures based on three hypotheses: the direction-of-change confers only ‘dubious’ strength- the traditional nil hypothesis, the equivalence hypothe- of-evidence (the difference is close to the edge of the sis and the inequivalence hypothesis. The nil hypothesis equivalence region, and so, more data would be needed procedure differs from the standard approach in that it to obtain narrower confidence intervals). But obtaining admits three possible outcomes (traditionally two): the that result and being unsure about the direction-of- other two each admit two outcomes. So, there is a total change can only be described as ‘weak’ evidence. To 14_TPPT_2015_Sep Agenda item 6.8 2734 Environ Monit Assess (2014) 186:2729–2740

Table 1 Inferences about the difference between the means of two populations

Decision rule Inference

Three-valued logic (for nil hypothesis H0: δ=δn)

If T>Tc Confidence that the difference is positive

If T<−Tc Confidence that the difference is negative Else Not enough data to be confident about the direction of difference

Testing the equivalence hypothesis (He: δL≤δ≤δU) (from McBride 2005, Table 5.9)

If Tl<−Tc or Tr>Tc Confidence that the true difference is beyond the equivalence interval Else Cannot reject the possibility that the true difference is within the interval

Testing the inequivalence hypothesis (Hi: δ<δL or δ>δU) (from Schuirmann 1987)

If Tl≥Tc and Tr≤−Tc Confidence that the true difference is within the equivalence interval Else Cannot reject the possibility that the true difference is beyond the interval

Definitions

δ Unknown difference in population means: δ (=μ1−μ2) T Test statistic for nil hypothesis procedures: T [= (d−δ )/SE(d)]a ÀÁn d Known difference between sample means: d ¼ X 1−X 2 δ User-supplied single value of δ at which the nil hypothesis is true n pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi SE(d) of d:SE(d)=S 1=n1 þ 1=n2 p qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðÞ−1 2þðÞ−1 2 Sp ¼ n1 S1 n2 S2 Pooled sample standard deviation: Sp f

n1, n2 Number of samples drawn from each population 2 2 S1,S2 Variances of each set of data

f Degrees of freedom: f=n1+n2−2

Tc Critical T value for the nil and equivalence hypothesis procedures: Tc=t1−α,f , cutting off an area α (not α/2) in the right tail of the t distribution α User-supplied significance level

δL, δU User-supplied lower and upper bounds of the equivalence interval (the interval of indifference). Note the requirement that δL<δn<δU, so that the nil δ value is always captured within the equivalence interval a, b Tl Left test statistic for equivalence procedures: Tl=(d−δL)/SE(d) a, b Tr Right test statistic for equivalence procedures, Tr=(d−δU)/SE(d) a Test statistics can be negative (whereas nil hypothesis procedures use always positive absolute values) b Note that as δL and δU→δn, then Tl→T and Tr→T and so the procedure for the test of the equivalence hypothesis reduces to the three- valued nil hypothesis procedure, as expected

fail to reject the equivalence hypothesis but to reject seen as a ‘proofofsafety’. Note that obtaining FeRi the inequivalence hypothesis (obtaining FeRi) and at low sample size can be difficult—the confidence to be sure of the direction-of-change is ‘moderate’ intervals need to be quite narrow (which they are evidence (of a trivial difference). But obtaining that not at low sample size) and well within the equiv- result and being uncertain of the direction-of- alence interval, reflecting the finding that ‘proof of change confers strong evidence (of a trivial differ- safety is much more difficult than proof of hazard’ ence). Obtaining this strong result (UFeRi)canbe (Bross 1985). 14_TPPT_2015_Sep Agenda item 6.8 Environ Monit Assess (2014) 186:2729–2740 2735

Table 2 Proposed strength of evidence matrix

Results triplet Strength-of-evidence

Strong Moderate Dubious Weak

PReFi, NReFi UFeRi PFeRi, NFeRi PFeFi, NFeFi UFeFi

Conclusion (about Practically No practically Difference may be trivial compared Difference may be Inconclusive strength of important important to the equivalence limits practically evidence) difference difference important Need more data? No No Collecting more data may not help. Yes, but maybe just a Yes, a lot Consider reducing the interval? modest amount more

Note that PReRi, NReRi and UReRi are all formally impossible. UReFi is also impossible because δL and δU straddle the nil hypothesis value (δn)

Definitions of the triplets (by example): PReFi positive direction-of-change found, reject equivalence, fail to reject inequivalence; NFeRi negative direction-of-change found, fail to reject equivalence, reject inequivalence; UFeRi unsure of the direction-of-change, fail to reject equivalence, reject inequivalence

Example applications as Tl=1.875 and Tr=0.625 (using the data on means and standard deviations given in Appendix 1). 1. Consider the example given in the Appendix 1.For Applying the decision rules in Table 1,wetherefore n=50 samples from inside the marine protected area reject neither the equivalence hypothesis nor the (MPA) and the same number of samples from out- inequivalence hypothesis. The resulting triplet is

side the MPA, the critical value of T at the 5 % UFeFi, described in Table 2 as ‘weak evidence’; significance level is Tc=1.661. But the value of T is data are insufficient to reach a useful conclusion. only 1.250, so we would conclude that we have For n=500 (using the data in Appendix 1), we

insufficient information to decide on which popula- obtain Tc =1.646, T=3.309, Tl =5.148 and Tr = tion of fish has the larger mean; a U result. We also 1.471. The resulting triplet is PFeFi, indicating ‘du- calculate the left and right equivalence test statistics bious evidence’ for the possible presence of a

δ δ δ LnU CONCLUSIONS

Equivalence interval Confident of direction-of-change NR F Important difference detected e i PReFi Confident of direction-of-change Unsure of importance of difference NFeFi PFeFi Confident of direction-of-change Difference may be trivial NFeRi PFeRi

Confident that difference is trivial

UFeRi Not enough data

UFeFi

δ U The CI contains n -- unsure of direction of change δ P,N The CI does not contain n -- confident of the direction of change Strength of Evidence

Ri All of the CI lies within the equivalence interval Strong Moderate Fi At least some of the CI lies beyond the equivalence interval Dubious R All of the CI lies beyond the equivalence interval e Weak Fe At least some of the CI lies within the equivalence interval "CI" denotes symmetric 100(1 - 2α )% confidence interval

Fig. 2 Confidence interval interpretation of the proposed procedure 14_TPPT_2015_Sep Agenda item 6.8 2736 Environ Monit Assess (2014) 186:2729–2740

practically important change and that possibly only given on Table 3. In this case, because of the range of a modest amount of extra data may be needed to differences measured, some notable results were move this finding into the ‘strong evidence’ catego- achieved with only seven samples (strong evidence ry. For n=700 (again using the data in Appendix 1), of environmentally important differences was found

we obtain Tc=1.646, T=3.916, Tl=6.091 and Tr= for two streams and for another strong evidence was 1.740. The resulting triplet is PReFi, indicating found for a trivial difference). The other results were strong evidence for the possible presence of a prac- weak or dubious. But had another three samples been tically important change; no more data are needed taken at Red Jacks Creek (with the same difference in (calculations show that this result is reached once means and pooled standard deviation), calculation 627 data are collected at each site). Note however that shows that its results would have improved from were the 700 data taken outside the MPA to have a dubious to moderate. These calculations have all been meanof5.9kg(insteadof5.1kg),thetripletreturned performed in a simple one-page Excel® spreadsheet,

by this procedure is UFeRi, indicating that no practi- accompanying the Supplementary Information. cally important difference was detected. In general terms, this example demonstrates the intuitively obvi- ous feature that collecting more data will strengthen findings along the weak–dubious–moderate–strong Discussion spectrum. But note that if the true difference is very close to an edge of the equivalence region, a very large This new SEP approach to testing differences between extra sampling effort may be necessary. population statistics offers a coherent logical way out 2. Quinn et al. (1992) reported results for benthic inver- from the many and various well-documented confusions tebrate taxonomic richness upstream and downstream that accompany tests of nil hypotheses. For example, it of alluvial gold mining operations on six streams on avoids use of subjective words, especially significant. the west coast of New Zealand’s South Island. In each A distinguishing feature of the proposed procedure is stream, seven replicates were taken from each site that the limits of the equivalence interval must be sup- from a bed area of 0.1 m2 in a ‘run’ (deeper than a plied in advance of performing its procedures, and that ‘riffle’ but shallower than a ‘pool’). The lead author of can cause discomfort—yet having to specify it can lead that work advised that as an ecologist, he would regard to more thorough exploration of data for environmen- a difference of ±20 % between the upstream and tally important effects. In contrast, no such specification downstream sites to be environmentally significant. is required when performing traditional tests of nil hy- While it was expected that there would be a decline potheses. In essence, a nil hypothesis test is a limiting in richness from upstream to the downstream sites, case of the test of the equivalence hypothesis such that

that does not completely preclude the possibility of an the upper and lower interval limits (δL and δU)converge increase. Accordingly, these data lend themselves to to the same nil value (δn). Standard procedures and be analysed by our proposed procedure. Results are software only call for that value (or assume it to be

Table 3 Results of Quinn et al. (1992) analysed by the proposed SEP approach

Site Taxonomic richness (#/0.1 m2)Results

Upstream mean Downstream Pooled standard Triplet Strength-of-evidence mean deviation

German Gully 16.29 8.71 2.46 NReFi Strong (important difference)

Houhou Creek 16.00 11.57 2.04 NFeFi Dubious

Kaniere River 11.29 9.14 2.35 UFeFi Weak

Kapitea Creek 19.00 8.57 2.23 NReFi Strong (important difference)

Red Jacks Creek 19.14 13.43 2.32 NFeFi Dubious

Waimea Creek 13.14 13.00 1.98 UFeRi Strong (trivial difference) 14_TPPT_2015_Sep Agenda item 6.8 Environ Monit Assess (2014) 186:2729–2740 2737 zero), and so, no user input concerning the equivalence theoretic approaches (Burnham and Anderson 2002; interval is required. Environmental scientists and man- Gerrodette 2011). In many cases, these approaches agers may see this as an advantage in that it supports have been advocated in response to perceived short- notions of scientific objectivity, avoiding the subjectiv- comings in using P values from nil hypothesis tests. ity that is inevitably involved in setting the interval’s Indeed, the SEP procedure we have proposed lends limits. Yet in sample programme, design one is often itself to an informative confidence interval interpre- required to use power analysis and that does call for tation (Fig. 2). such limits—because it is often necessary to ensure that Finally, we note that the utility of P values should not the sample size will be big enough to have good confi- be considered to be independent of the nature of the dence of detecting a given effect size, were it (or a larger hypotheses tested—as many do. Indeed for one-sided effect size) to be present. We argue that the need to state hypotheses, they can be most useful indeed: It is a well- an interval of indifference is essential in most practical known result in statistical theory that for one-sided tests, applications. We also argue that the gains to be made in the P value and the Bayesian posterior probability can terms of obtaining valid strengths of evidence far out- be the same numerical value (Edwards et al. 1963; weigh the simpler nil hypothesis approach, and the DeGroot 1973;Lee1997). Bayesian probability state- necessity to state an interval of indifference is often an ments about hypotheses use only the data observed, not unavoidable necessity in performing statistical hypoth- any data ‘more extreme’,so‘problem 1’ raised in the esis testing. ‘Introduction’ may be capable of simple resolution be- We note some desirable features of the strength-of- cause the proposed SEP approach uses one-sided pro- evidence matrix (Table 2), particularly concerning the cedures. Problem 2 has been resolved, by the sensible appropriateness of further sampling. First, failing to find outcomes in the matrix regarding the desirability of the direction-of-change (obtaining the U result for the collecting further data. three-valued logic) does not guarantee that the true difference is ‘small’ (see also Berger and Delampady

1987). It does if accompanied by the FeRi doublet, but it Conclusions does not necessarily do so if accompanied by FeFi.In the latter case, the dataset is insufficient to support even The problems associated with using P values from tradi- claims of moderate or dubious strength-of-evidence. tional tests of nil hypotheses as strength-of-evidence for Second, the two strong outcomes in the matrix indicate interesting hypotheses have been well covered in many that for those cases, it seems not to be worthwhile to publications, but alternatives using further hypothesis test perform any further sampling (at least in the context of procedures have been rather lacking or vague. We pro- testing the stated hypotheses). There are already suffi- pose a strength-of-evidence framework that appears to cient data to be firm in a conclusion about whether the provide comprehensive strength-of-evidence based on difference that does exist is or is not practically two simultaneous interval tests combined with a re- (environmentally) important. On the other hand, the interpretation of the traditional nil hypothesis tests. This weak inconclusive result cannot be strengthened unless gives rise to a matrix of combined test outcomes with (a) more data are collected. In the middle ground are the two strong results (further data collection seems not to be moderate and dubious results. Note that it does not necessary), (b) moderate or dubious results (to provide always follow that collecting a feasible number of extra greater strength some further data collection may be data will strengthen these findings; it may well do so for necessary) and (c) one weak result (a further substantial the FeFi doublets, but it may not for the FeRi doublets. data collection is definitely desirable). These procedures Calculations can easily be performed to examine that have been incorporated into freeware (Time Trends and question (e.g. using the calculator accompanying the Equivalence, http://www.jowettconsulting.co.nz/home/ Supplementary Information). software). It remains to further examine the statistical Adopting the SEP procedure advocated here does properties of this approach and to extend it to other not detract from the merits of increasingly advocated analyses such as trend detection (for which interval tests practices of reporting confidence intervals and esti- have already been proposed, Dixon and Pechmann 2005) mated effect sizes (e.g. Smithson 2000; Cumming and ANOVA and to consider the implication of and Finch 2001) and using model-based information conducting multiple comparisons. 14_TPPT_2015_Sep Agenda item 6.8 2738 Environ Monit Assess (2014) 186:2729–2740

Acknowledgments We thank colleagues (Jen Drummond, greater than the square root of the proportional increase James Sukias, Rob Goudey and Mark Meleason) for constructive in n). comments. John Quinn provided the data used in the example applications. This work was funded by the New Zealand Ministry We think that a more meaningful hypothesis to test of Science and Innovation (contract C09X1003: Integrated Valu- would be concerned with whether a difference of say ation and Monitoring Framework for Improved Freshwater Out- 0.5 kg for fish size is large enough to be of environmental comes). Sadly our second author, Dr. Russell Cole, passed away significance, so that the equivalence interval limits are during the processing of our submission. ±0.5 kg.

Appendices Appendix 2. Procedures are all performed at level α (not α/2) Appendix 1. Nil hypothesis test result depends on sample size Under the three-valued logic procedure error risks con- cern erroneously inferring the direction-of-change in Consider a survey designed to examine the effect of a one direction or the other, were such an inference to be marine protected area on populations of a fish, using made. (The third possible inference—that there are too normal statistical methods (for simplicity, ignoring any few data to detect the direction-of-change—is not an influence of data ). After collecting n=50 error, Jones and Tukey 2000.) In considering the prob- samples from each area, the mean fish weight per transect ability of one of these two errors occurring, a critical ’ inside the MPA is found to be X 1 ¼ 6:1 kg , outside it is departure arises from the two-sided nil hypothesis test s decision rule, which is derived by minimising the risk of X 2 ¼ 5:1 kg and the pooled standard deviation of fish committing the type I error (of falsely rejecting a hy- sizes inside and outside the MPA is Sp=4.0 kg. Then, to 2 test the nil hypothesis that there is in fact no difference pothesis). That rule states that its hypothesis should be whatsoever between the means of these two populations rejected if the test statistic cuts off an area of no more α — (i.e. within the MPA and beyond it), we calculate the test than /2 in either tail of the t distribution equivalent to qffiffiffiffi examining whether the measured difference in sample ¼ 6:1−5:1 50 ¼ 0:250 Â 5 ¼ 1:250 statistic T 4:0 2 .Letus means is contained within a 100(1−α)% confidence now see what T might be for a larger dataset. Bearing in interval. In the three-valued logic of the two one-sided mind that the sample means and variances are unbiased TOST approach, the decision rule for each of the two and consistent estimators of their true population values, one-sided tests rests on whether the test statistic cuts off on average, their values for larger sample sizes would be a an area of no more than α (not α/2) in the appropriate little different. Say we have 500 samples at each site and tail of the t distribution. This is equivalent to couching ¼ 6:2 ¼ 5:3 the rule in terms of a 100(1−2α)% confidence interval. obtain X 1 kg ,qXffiffiffiffiffi2 kg and Sp=4.2 kg. pffiffiffiffiffiffiffi At first, this may be surprising. Indeed, in the context of Then T ¼ 6:2−5:3 500 ¼ 0:214 Â 250 ¼ 3:388: 4:2 2 bioequivalence tests, Berger and Hsu (1996)noted:‘The Finally, consider taking 700 samples, obtaining fact that the TOST seemingly corresponds to a 100(1− X 1 ¼ 6:0 kg , X 2 ¼ 5:1 kg and Sp=4.3 kg, in which 2α)%, not 100(1−α)%, confidence interval procedure qffiffiffiffiffi pffiffiffiffiffiffiffi 6:0−5:1 700 initially caused some concern (Westlake 1976, 1981) … case T ¼ 4:3 2 ¼ 0:2093 Â 350 ¼ 3:916: but many authors (e.g. Chow and Shao 1990,and The P values for those situations (for example obtained Schuirmann 1989) have defined bioequivalence tests using Excel’s ‘T.DIST.2T’ function) are 0.214, 0.00073 in terms of 100(1–α)% confidence sets’. Indeed, and 0.000094, respectively. At the 5 % significance level, Berger and Hsu (1996), based on earlier material by only the last two results would be statistically significant. Berger (1982), present a theorem showing that if each Having more samples has led to smaller P values—even of the two individual TOST tests is performed at level α, though the measured effect size happens to have decreased the overall test has the same level. Its proof rests on a little (from 0.250 to 0.214 to 0.209). What is more, this ‘intersection–union’ theory (IUT) (TOST is a simple pattern will most usually occur when we test a nil hypoth- example of IUT). Importantly, Schuirmann (1996)noted esis: P will decrease as n is increased (there will be rare exceptions when the estimated effect size decreases sub- 2 As we have noted, if the nil hypothesis cannot be true, this error stantially with a larger number of samples, by a factor cannot occur. 14_TPPT_2015_Sep Agenda item 6.8 Environ Monit Assess (2014) 186:2729–2740 2739 that this finding rests on the requirement that the 100(1− and noncentral distributions. Educational Psychological – 2α)% confidence interval is equi-tailed, i.e. is symmet- Measurement, 61,530 572. DeGroot, M. H. (1973). Doing what comes naturally: interpreting a rical about the estimated difference in means, which is tail area as a posterior probability or as a likelihood ratio. the approach adopted herein. Journal of the American Statistical Association, 68, 966–969. Dixon, P. M., & Pechmann, H. K. (2005). A statistical test to show negligible trend. Ecology, 86(7), 1751–1756. Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian References statistical inference for psychological research. Psychological Review, 70, 193–242. Fleiss, J. L. (1986). Significance tests have a role in epidemiologic Anderson, P. D., & Meleason, M. A. (2009). Discerning responses research: reactions to AM Walker (different views). of down wood and understory vegetation abundance to ri- American Journal of Public Health, 76,559–560. parian buffer width and thinning treatments: an equivalence- Freund, J. E. (1992). Mathematical statistics (5th ed.). Upper inequivalence approach. Canadian Journal of Forest Saddle River: Prentice-Hall. Research, 39,2470–2485. Frick, R. W. (1995). Accepting the null hypothesis. Memory and Beninger, P. G., Boldina, I., & Katsanevakis, S. (2012). Cognition, 23(1), 132–138. Strengthening statistical usage in marine ecology. Journal of Germano, J. D. (1999). Ecology, statistics, and the art of misdiag- Experimental Marine Biology and Ecology, 426–427,97–108. nosis: the need for a paradigm shift. Environmental Reviews, Berger, R. L. (1982). Multiparameter hypothesis testing and ac- 7,167–190. ceptance sampling. Technometrics, 24,295–300. Gerrodette, T. (2011). Inference without significance: measuring Berger, J. O., & Delampady, M. (1987). Testing precise hypothe- support for hypotheses rather than rejecting them. Marine ses (rejoinder to Cox 1987). Statistical Science, 2(3), 348. Ecology, 32(3), 404–418. Berger, R. L., & Hsu, J. C. (1996). Bioequivalence trials, intersec- Gibbons, J. D., & Pratt, J. W. (1975). P-values: interpretation and tion–union tests and equivalence confidence sets. Statistical methodology. American Statistician, 29,20–25. Science, 11(4), 283–319. with discussion. Goudey, R. (2007). Do statistical inferences allowing three alter- Berkson, J. (1942). Tests of significance considered as evidence. native decisions give better feedback for environmentally Journal of the American Statistical Association, 37,325– precautionary decision-making? Journal of Environmental 335. Management, 85,338–344. Bohrer, R. (1979). Multiple three-decision rules for parametric Hagen, R. L. (1997). In praise of the null hypothesis statistical test. signs. Journal of the American Statistical Association, 74, American Psychologist, 52(1), 15–24. 432–437. Harris, R. J. (1997a). Significance tests have their place. Brosi, B. J., & Biber, E. G. (2009). Statistical inference, type II Psychological Science, 8(1), 8–11. error, and decision making under the US Endangered Species Harris, R. J. (1997b). Reforming significance testing via three- Act. Frontiers in Ecology and the Environment, 7(9), 487– valued logic. In L. L. Harlow, S. A. Muliak, & J. H. Steiger 494. (Eds.), What if there were no significance tests? (pp. 145– Bross, I. D. (1985). Why proof of safety is much more difficult 174). Mahwah: Lawrence Erlbaum. than proof of hazard. Biometrics, 41,785–793. Harris, R. J. (2001). A primer of (3rd ed.). Burnham, K. P., & Anderson, D. R. (2002). and Mahwah: Lawrence Erlbaum. multimodel inference: a practical information-theoretic ap- Hodges, J. L., & Lehmann, E. L. (1954). Testing the approximate proach (2nd ed.). New York: Springer-Verlag. validity of statistical hypotheses. Journal of the Royal Carver, R. P. (1978). The case against Statistical Society, Series B, 16,261–268. testing. Harvard Educational Review, 48,378–399. Jeffreys, H. S. (1961). Theory of probability. Oxford: Oxford Chow, S. L. (1996). Statistical significance: rationale, validity and University Press. utility. London: Sage. Johnson, D. H. (1999). The insignificance of statistical signifi- Chow, S.–C., & Shao, J. (1990). An alternative approach for the cance testing. Journal of Wildlife Management, 63(3), 763– assessment of bioequivalence between two formulations of a 772. drug. Biometrical Journal, 32,969–976. Jones, L. V., & Tukey, J. W. (2000). A sensible formulation of the Cohen, J. (1988). Statistical power analysis for the behavioral significance test. Psychological Methods, 5(4), 411–414. sciences (2nd ed.). Hillsdale: Lawrence Erlbaum. Kaiser, H. F. (1960). Directional statistical decisions. Cohen, J. (1994). The earth is round (p<05). American Psychological Review, 67(3), 160–167. Psychologist, 49(12), 997–1003. Läärä, E. (2009). Statistics: reasoning on uncertainty, and the Cole, R. G., & McBride, G. B. (2004). Assessing impacts of insignificance of testing null. Annal Zoologici Fennici, dredge spoil disposal using equivalence tests: implications 46(2), 138–157. of a precautionary (proof of safety) approach. Marine Lee, P. M. (1997). Bayesian statistics: an introduction (2nd ed.). Ecology Progress Series, 279.,63–72 London:Arnold. Cox, D. R. (1987). Comment on Berger, J. O., & Delampady, M. Lehmann, E. L. (1986). Testing statistical hypotheses (2nd ed.). Testing precise hypotheses. Statistical Science, 2(3), 335– New York: Wiley. 336. McBride, G. B. (1999). Equivalence tests can enhance environ- Cumming, G., & Finch, S. (2001). A primer on the understanding, mental science and management. Australian and New use and calculation of confidence intervals based on central Zealand Journal of Statistics, 41(1), 19–29. 14_TPPT_2015_Sep Agenda item 6.8 2740 Environ Monit Assess (2014) 186:2729–2740

McBride, G. B. (2002). Statistical methods helping and hindering equivalence of average bioavailability. Journal of environmental science and management. Journal of Pharmacokinetics and Biopharmacuetics, 15,657–680. Agricultural, Biological, and Environmental Statistics, 7, Schuirmann, D. J. (1989). Confidence intervals for the ratio of two 300–305. means from a crossover study. In Proceedings of the McBride, G. B. (2005). Using statistical methods for water quality Biopharmaceutical Section 121–126. Alexandria: American management: issues, options and solutions. New York: Statistical Association. Wiley. Schuirmann, D. J. (1996). Comment on: bioequivalence trials, McBride, G. B., Loftis, J. C., & Adkins, N. C. (1993). What do intersection–union tests and equivalence confidence sets. significance tests really tell us about the environment? R. L. Berger RL & J. C. Hsu. Statistical Science, 11(4), Environmental Management, 17(4), 423–432. errata: 18: 317. 312–313. Newman, M. C. (2008). “What exactly are you inferring?” A Smithson, M. (2000). Statistics with confidence. London: Sage. closer look at hypothesis testing. Environmental Toxicology Tukey, J. W. (1960). Conclusions vs decisions. Technometrics, 2, and Chemistry, 27(5), 1013–1019. 423–433. Platt, J. R. (1964). Strong inference. Science, 146(3642), Tukey, J. W. (1991). The philosophy of multiple comparisons. 347–353. Statistical Science, 6(1), 100–116. Quinn, J. M., Davies-Colley, R. J., Hickey, C. W., Vickers, M. L., Wellek, S. (2003). Testing statistical hypotheses of equivalence. & Ryan, P. A. (1992). Effects of clay discharges in streams: 2. Boca Raton: Chapman and Hall/CRC. Benthic invertebrates. Hydrobiologia, 248,235–247. Westlake, W. J. (1976). Symmetric confidence intervals for bio- Rozeboom, W. W. (1960). The fallacy of the null-hypothesis equivalence trials. Biometrics, 32,741–744. significance test. Psychological Bulletin, 57(5), 416–428. Westlake, W. J. (1981). Response to TBL Kirkwood: bio- Schervish, M. J. (1996). P values: what they are and what they are equivalence testing—a need to rethink. Biometrics, 37, not. American Statistician, 50(3), 203–206. 589–594. Schuirmann, D. J. (1987). A comparison of the two one-sided tests Zar, J. H. (1984). Biostatistical analysis (2nd ed.). Englewood procedure and the power approach for assessing the Cliffs: Prentice-Hall.