Statistical Science Meets Philosophy of Science Part 2: Shallow Versus Deep Explorations
Total Page:16
File Type:pdf, Size:1020Kb
RMM Vol. 3, 2012, 71–107 Special Topic: Statistical Science and Philosophy of Science Edited by Deborah G. Mayo, Aris Spanos and Kent W. Staley http://www.rmm-journal.de/ Deborah Mayo Statistical Science Meets Philosophy of Science Part 2: Shallow versus Deep Explorations Abstract: Inability to clearly defend against the criticisms of frequentist methods has turned many a frequentist away from venturing into foundational battlegrounds. Conceding the dis- torted perspectives drawn from overly literal and radical expositions of what Fisher, Ney- man, and Pearson ‘really thought’, some deny they matter to current practice. The goal of this paper is not merely to call attention to the howlers that pass as legitimate crit- icisms of frequentist error statistics, but also to sketch the main lines of an alternative statistical philosophy within which to better articulate the roles and value of frequentist tools. 1. Comedy Hour at the Bayesian Retreat Overheard at the comedy hour at the Bayesian retreat: Did you hear the one about the frequentist. “who defended the reliability of his radiation reading, despite using a broken radiometer, on the grounds that most of the time he uses one that works, so on average he’s pretty reliable?” or “who claimed that observing ‘heads’ on a biased coin that lands heads with probability .05 is evidence of a statistically significant improve- ment over the standard treatment of diabetes, on the grounds that such an event occurs with low probability (.05)?” Such jests may work for an after-dinner laugh, but if it turns out that, despite being retreads of ‘straw-men’ fallacies, they form the basis of why some statis- ticians and philosophers reject frequentist methods, then they are not such a laughing matter. But surely the drubbing of frequentist methods could not be based on a collection of howlers, could it? I invite the reader to stay and find out. If we are to take the criticisms seriously, and put to one side the possibility that they are deliberate distortions of frequentist statistical methods, we need 72 Deborah Mayo to identify their sources. To this end I consider two interrelated areas around which to organize foundational issues in statistics: (1) the roles of probability in induction and inference, and (2) the nature and goals of statistical inference in science or learning. Frequentist sampling statistics, which I prefer to call ‘error statistics’, continues to be raked over the coals in the foundational literature, but with little scrutiny of the presuppositions about goals and methods, without which the criticisms lose all force. First, there is the supposition that an adequate account must assign degrees of probability to hypotheses, an assumption often called probabilism. Second, there is the assumption that the main, if not the only, goal of error-statistical methods is to evaluate long-run error rates. Given the wide latitude with which some critics define ‘controlling long-run error’, it is not surprising to find them arguing that (i) error statisticians approve of silly methods, and/or (ii) rival (e.g., Bayesian) accounts also satisfy error statistical demands. Absent this sleight of hand, Bayesian celebrants would have to go straight to the finale of their entertainment hour: a rousing rendition of ‘There’s No Theorem Like Bayes’s Theorem’. Never mind that frequentists have responded to these criticisms, they keep popping up (verbatim) in every Bayesian and some non-Bayesian textbooks and articles on philosophical foundations. No wonder that statistician Stephen Senn is inclined to “describe a Bayesian as one who has a reverential awe for all opin- ions except those of a frequentist statistician” (Senn 2011, 59, this special topic of RMM). Never mind that a correct understanding of the error-statistical de- mands belies the assumption that any method (with good performance proper- ties in the asymptotic long-run) succeeds in satisfying error-statistical demands. The difficulty of articulating a statistical philosophy that fully explains the basis for both (i) insisting on error-statistical guarantees, while (ii) avoiding pathological examples in practice, has turned many a frequentist away from venturing into foundational battlegrounds. Some even concede the distorted perspectives drawn from overly literal and radical expositions of what Fisher, Neyman, and Pearson ‘really thought’. I regard this as a shallow way to do foundations. Here is where I view my contribution—as a philosopher of science—to the long-standing debate: not merely to call attention to the howlers that pass as legitimate criticisms of frequentist error statistics, but also to sketch the main lines of an alternative statistical philosophy within which to better articulate the roles and value of frequentist tools. Let me be clear that I do not consider this the only philosophical framework for frequentist statistics—different terminology could do as well. I will consider myself successful if I can provide one way of building, or one standpoint from which to build, a frequentist, error-statistical philosophy. Here I mostly sketch key ingredients and report on updates in a larger, ongoing project. Statistical Science Meets Philosophy of Science Part 2 73 2. Popperians Are to Frequentists as Carnapians Are to Bayesians Statisticians do, from time to time, allude to better-known philosophers of sci- ence (e.g., Popper). The familiar philosophy/statistics analogy—that Popper is to frequentists as Carnap is to Bayesians—is worth exploring more deeply, most notably the contrast between the popular conception of Popperian falsification and inductive probabilism. Popper himself remarked: “In opposition to [the] inductivist attitude, I assert that C(H,x) must not be interpreted as the degree of corroboration of H by x, unless x reports the results of our sincere efforts to overthrow H. The re- quirement of sincerity cannot be formalized—no more than the in- ductivist requirement that x must represent our total observational knowledge.” (Popper 1959, 418, I replace ‘e’ with ‘x’) In contrast with the more familiar reference to Popperian falsification, and its apparent similarity to statistical significance testing, here we see Popper allud- ing to failing to reject, or what he called the “corroboration” of hypothesis H. Popper chides the inductivist for making it too easy for agreements between data x and H to count as giving H a degree of confirmation. “Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory—or in other words, only if they result from serious attempts to refute the theory.” (Popper 1994, 89) (Note the similarity to Peirce in Mayo 2011, 87, this special topic of RMM.) 2.1 Severe Tests Popper did not mean to cash out ‘sincerity’ psychologically of course, but in some objective manner. Further, high corroboration must be ascertainable: ‘sincerely trying’ to find flaws will not suffice. Although Popper never adequately cashed out his intuition, there is clearly something right in this requirement. It is the gist of an experimental principle presumably accepted by Bayesians and fre- quentists alike, thereby supplying a minimal basis to philosophically scrutinize different methods. (Mayo 2011, section 2.5, this special topic of RMM) Error-statistical tests lend themselves to the philosophical standpoint re- flected in the severity demand. Pretty clearly, evidence is not being taken se- riously in appraising hypothesis H if it is predetermined that, even if H is false, a way would be found to either obtain, or interpret, data as agreeing with (or ‘passing’) hypothesis H. Here is one of many ways to state this: Severity Requirement (weakest): An agreement between data x and H fails to count as evidence for a hypothesis or claim H if the test 74 Deborah Mayo would yield (with high probability) so good an agreement even if H is false. Because such a test procedure had little or no ability to find flaws in H, finding none would scarcely count in H’s favor. 2.1.1 Example: Negative Pressure Tests on the Deep Horizon Rig Did the negative pressure readings provide ample evidence that: H0: leaking gases, if any, were within the bounds of safety (e.g., less than θ0)? Not if the rig workers kept decreasing the pressure until H passed, rather than performing a more stringent test (e.g., a so-called ‘cement bond log’ using acous- tics). Such a lowering of the hurdle for passing H0 made it too easy to pass H0 even if it was false, i.e., even if in fact: H1: the pressure build-up was in excess of θ0. That ‘the negative pressure readings were misinterpreted’, meant that it was incorrect to construe them as indicating H0. If such negative readings would be expected, say, 80 percent of the time, even if H1 is true, then H0 might be said to have passed a test with only .2 severity. Using Popper’s nifty abbreviation, it could be said to have low corroboration, .2. So the error probability associ- ated with the inference to H1 would be .8—clearly high. This is not a posterior probability, but it does just what we want it to do. 2.2 Another Egregious Violation of the Severity Requirement Too readily interpreting data as agreeing with or fitting hypothesis H is not the only way to violate the severity requirement. Using utterly irrelevant evi- dence, such as the result of a coin flip to appraise a diabetes treatment, would be another way. In order for data x to succeed in corroborating H with severity, two things are required: (i) x must fit H, for an adequate notion of fit, and (ii) the test must have a reasonable probability of finding worse agreement with H, were H false.