We Agree That Statistical Significance Proves Essentially Nothing: a Rejoinder to Thomas Mayer
Total Page:16
File Type:pdf, Size:1020Kb
Discuss this article at Journaltalk: http://journaltalk.net/articles/5791 ECON JOURNAL WATCH 10(1) January 2013: 97-107 We Agree That Statistical Significance Proves Essentially Nothing: A Rejoinder to Thomas Mayer Stephen T. Ziliak1 and Deirdre N. McCloskey2 LINK TO ABSTRACT We are happy to reply a second time to Tom Mayer, and are flattered that he has devoted so much time to our book. Long before most readers here were born, Professor Mayer was already making major contributions to economics—as recognized by a festschrift in his honor (Hoover and Sheffrin, eds., 1996). Mayer (2012, 2013) strongly agrees with us on four of the five major claims we make in our book about the theory and practice of significance. We are not surprised. As we noted in Ziliak and McCloskey (2004a, b), in reply to other critics from Graham Elliott and Clive Granger (2004) to Jeffrey Wooldridge (2004), Edward Leamer (2004), and Arnold Zellner (2004), leading econometricians find on reflection that they agree with our points, and realize with us that most of modern econometrics has to be redone, focusing on economic significance and not on mere statistical sig- nificance. So we have been saying since 1985 (McCloskey 1985). Mayer limits his complaints to certain aspects of theoretical and empirical claims we make about significance testing in economics, in the 1990s American Economic Review. He is ignoring, for example, Ziliak’s archival work, especially on William Sealy Gosset and Ronald A. Fisher, and our chapters on the 20th century history, philosophy, sociology, and practice of significance testing in the life and human sciences, from agronomy to zoology and from Gosset to Joshua Angrist (Ziliak and McCloskey 2008, chs. 1, 6-7, 16-24). We wish Mayer took a wider view. 1. Roosevelt University, Chicago, IL 60605. 2. University of Illinois at Chicago, 60607. VOLUME 10, NUMBER 1, JANUARY 2013 97 ZILIAK AND MCCLOSKEY Our complaints about statistical significance have been routine in other fields for a century, and are not “sweeping claims” (Mayer 2013, 87). It is quite mistaken, as Mayer asserts without citation, “that [our] point has also been criticized repeatedly for an equally long time” (88). No, it has not been. No one has refuted our central theses, not ever. In several dozen journal reviews and in comments we have received—from, for example, four Nobel lau- reates, the statistician Dennis Lindley (2012), the mathematician Olle Häggström (2010), the sociologist Steve Fuller (2008), and the historian Theodore Porter (2008)—no one, Tom Mayer included, has tried to actually defend null hypothesis significance testing. True, people sometimes express anger. Given the large impli- cations for science and policy, we understand. But no one has offered real arguments. We praised Kevin Hoover and Mark Siegler (2008) for attempting to use significance tests in step-wise fashion, that is, as a mechanical screening device for sorting possibly important from unimportant relationships (McCloskey and Ziliak 2008, 46-47). But they fumbled the ball, and for the same reason that Mayer (2013, 94) does and must: statistical significance does not and cannot do the job of interpretation and decision-making. But the main point here is that Mayer mostly agrees with us. We are in fact in complete agreement about four of the five major claims we make in The Cult of Statistical Significance (Ziliak and McCloskey 2008, passim): 1.) Economic significance is not the same thing as statistical significance: Each can exist without the other—without, that is, either the presence or the prospect of the other—though economists often confuse and conflate the two kinds of significance. Mayer agrees. As he admits in his second assessment of our work by quoting his first assessment: But, far from supporting the orthodox view of significance tests I wrote: “Z-M are right…one must guard against substituting statistical for sub- stantive significance…. They are also right in criticizing the wrong-way- round use of significance tests” ([Mayer 2012,] 278), and that in tests of maintained hypotheses the latter error “is both severe enough and occurs frequently enough to present a serious—and inexcusable— problem” (279). (Mayer 2013, 88) 2.) Economists allocate far more time, space, and legitimacy to calculating statisticalstatistical significance than to exploring what they should: economiceconomic signif- icance. 98 VOLUME 10, NUMBER 1, JANUARY 2013 REJOINDER TO MAYER Mayer agrees: “[I]n countering the mechanical way in which significance tests are often used, and in introducing economists to the significance-test literature in other fields, Z-M have rendered valuable services” (Mayer 2012, 279, quoted in Mayer 2013, 88). A few pages on, he goes further, fixing on the main point of our agreement: “M-Z claim that a significance test is ‘not answering the scientific or policy or other human question at issue’…. Yes, I agree” (Mayer 2013, 90). 3.) The probability of the hypothesis given the available evidence is not equal to the probability of the evidence assuming the null hypothesis is true. Thus the null hypothesis test procedure is illogical in its primitives, equations, and conclusions. The illogic of the test is an example of what is known as the fallacy of the transposed conditional. Mayer again agrees, in this case with mere logic and common sense. As he said both in his first assessment of our work and in the current reply: “They [Z-M] are also right in criticizing the wrong-way-round use of significance tests” (Mayer 2012, 278, quoted in Mayer 2013, 88 and 93; see also Mayer 2012, 271). The null test of significance is illogical because it measures the probability of the evidence assuming a true null hypothesis but then pretends to answer a much more important question: the probability of the hypothesis, given the new evidence. Thus economists do not know the probability of their hypotheses. 4.) Assessing probabilities of hypotheses (Bayesian, classical, and other), comparing expected loss functions, and demonstrating substantive signifi- cance ought to be the central activities in applied econometrics and in most other statistical sciences. Here again, Mayer agrees. “M-Z criticize my argument ([Mayer 2012,] 264, third paragraph) that a variable may be important regardless of its oomph. In this they are right; my argument was muddled, and I withdraw it with apologies to the readers” (2013, 91). In other words, Mayer agrees that statistical signifi- cance—though conventional and even required by modern institutions of science, politics, and finance—is a dangerously broken instrument, one that in most cases needs to be put back on the shelf. Statistical significance proves nothing. Let’s get back to science. End of story, off to the pub. With such overlap in our positions the impartial spectator will begin to suspect that the remaining differences could be not very important ones. But Mayer hesitates, perhaps out of an admirable loyalty to other economic scientists, such as Kevin Hoover and Aris Spanos. He persists in trying to defend another side of the story, as though there was one, kicking at a horse that has long been dead, as Thomas Schelling (2004) has noted. Then he falls back into the main and highly ubiquitous confusion, which our work has sought to repair. Mayer (2013, 89-90) VOLUME 10, NUMBER 1, JANUARY 2013 99 ZILIAK AND MCCLOSKEY offers a hypothetical test of y > x in 98 cases out of 100 as evidence of a “significant difference.” Wait a minute. Suppose y was 4.2 and x was 4.1 and there was no scientific or policy reason to care about a difference of 0.1. “A binomial sign test does tell us something,” he writes (90). Yes, but it does not tell us that a difference is im- portant, which is overwhelmingly the scientific point at issue, a point which Mayer elsewhere understands and accepts. 5.) Applying our 19-item questionnaire to all of the 187 full-length empirical articles published in the 1990s in the AmericanAmericanEconomicEconomicReviewReview, we find that about 80% of economists made the significance mistake, equating statistical significance with economic importance, and lack of statistical significance with unimportance. Mayer (2012, 279) agrees that the error is “inexcusable” but he disagrees with our estimate (2013, abs. and 93). Notice that of the five main claims we make in The Cult, the only real difference of opinion between Mayer and us is not a difference in kind but of degree. This makes our point rather well that quantitative oomph, not a qualitative statement that p < .05, is the main point of science. Mayer did not conduct a replication of our survey of significance testing in American Economic Review articles from the 1990s. Neither did he attempt to replicate our survey of 182 articles published in the same journal during the 1980s (Ziliak and McCloskey 2008, chs. 6-7; Ziliak and McCloskey 2004a, b; McCloskey and Ziliak 1996). He did not use our survey at all. And yet he expresses discontent with the estimate that emerges from our application of our survey to the pages of the AER. We find, to repeat, that 80% or more of our colleagues who published in the AER in the 1990s made the significance mistake—the mistake of not under- standing claim number (1) above—causing incorrect decisions about models, vari- ables, and actual economies. Mayer says he disagrees. We do not accept his alleged evidence, but even if we did we do not believe that a failure rate of 20% (his lower bound) or of 50% (his upper bound), though lower than the 80% rate we found, would be cause for jubilation. A 50% failure rate in significance testing would still be akin to choosing hypotheses, estimates, and economic policies on the basis of random coin-flipping.