The Impotency of Post Hoc Power
Total Page:16
File Type:pdf, Size:1020Kb
The Impotency of Post Hoc Power By Emma Gunnarsson and Hugo Sebyhed Department of Statistics Uppsala University Supervisor: Per Johansson Autumn 2020 ABSTRACT In this thesis, we hope to dispel some confusion regarding the so-called post hoc power, i.e. power computed making the assumption that the estimated sample effect is equal to the population effect size. In previous research, it has been shown that post hoc power is a function of the p-value, making it redundant as a tool of analysis. We go further, arguing for it to never be reported, since it is a source of confusion and potentially harmful incentives. We also conduct a Monte Carlo simulation to illustrate our points of view. Previous research is confirmed by the results of this study. KEY WORDS: Hypothesis testing, statistical inference, retrospective power, observed power, a posteriori power, a priori power, true power 1 Table of Contents 1. Introduction 3-4 2. Inference and Hypothesis Testing 4-5 2.1 Fisher’s Hypothesis Testing 5-7 2.2 Neyman-Pearson Hypothesis Testing 8-10 3. Statistical Power 11 3.1. A Priori Power 11-15 3.2. Post Hoc Power 15-20 4. Simulation Study 20 4.1. Simulation setup 20-22 4.2. Results 22-25 5. Discussion 25-27 References 28-29 2 1. Introduction Hypothesis testing is widely used in many different research fields, where one aspires at making inference. Oftentimes, the testing procedure is oversimplified, and lacking in a rigorous understanding of what can in fact be achieved. There is a wide array of aspects to consider in order to have a valid test, which combined are far beyond the scope of this paper. One of them however, the one of statistical power, will be thoroughly discussed. As defined by Neyman and Pearson, power is the conditional probability of finding a significant effect, given that there is a specific true effect. Since it will be difficult to detect effects having a low power, the researcher must ensure a high enough power before conducting a study, mainly through increasing the sample size. Although power analysis has experienced an upsurge in most inference research (e.g., Bausell and Li, 2002; Murphy and Myors, 2004; Staffa and Zurakowski, 2020), there is a substantial literature suggesting that a large part of published research is underpowered (e.g., Ioannidis et al., 2017; Maxwell, 2004; Mone et al., 1996). Perhaps in response to this, it has been suggested that power be computed post hoc, using the estimated sample effect (e.g., Carle, 2020; Fagley, 1985; Hallahan and Rosenthal, 1996; Onwuegbuzie and Leech, 2004). Post hoc power has indeed been used in applied research. Bababekov and Chang (2019) use it to help interpret non-significant results in surgical science, while for example Gantner et al. (2018) reports it arguing that it will ease the design of future studies in the neurological science field. Moreover, Nadler et al. (2018) use it to argue for over-powerness in cancer medicine research. These research fields share the trait of having difficulties obtaining large enough samples, making power discussions especially important. Although such noble intentions of the advocates, there are nevertheless several inherent flaws of post hoc power. These have been pointed out by statisticians ever since it was first promoted, and are also the main focus of this thesis. Here, we aim at dispelling the confusion concerning post hoc power, partly by a thorough review of previous literature, and partly by a Monte Carlo simulation study. Since the mentioned research, where post hoc power has been proposed as useful, mainly is from the genre of the causal inference, that will be our main focus. We argue for post hoc power not only being redundant, but also confusing and a potential source of harmful incentives among researchers and journals. Therefore, we recommend post hoc power to be conclusively abandoned. Power considerations are indeed important in the hypothesis testing procedure, and should be thoroughly discussed in any given experiment. However, such analysis must be 3 based on hypothesised effect sizes. Meritoriously, researchers could calculate several powers using different hypothesised true effects, thereby providing a range of the plausible power of the study. Furthermore, the practice of applying post hoc power to evaluate non-significant results, is untrue to some of the fundamentals of inference: Power, being a probability, is no longer relevant to discuss post study. There are other tools more apt for evaluating the results found, namely the magnitude of the estimate and its confidence interval. The paper is structured as follows. We begin by a description of hypothesis tests in general, including both the true Fisherian approach and then later addendums by Neyman and Pearson, the inventors of power as defined above. Although Fisher strongly opposed power as defined, we present both testing approaches in order to provide a deeper understanding of the disagreements. Thereafter in Section 3.1, we discuss power as intended by Neyman and Pearson, which we denote “a priori power”. Neyman and Pearson never used that exact terminology, but for the purposes of this study where we wish to distinguish it from post hoc power, we find it suitable. Section 3.2 discusses post hoc power. Section 4 describes the simulation study in depth, and Section 5 concludes. 2. Inference and Hypothesis Testing Statistical Inference refers to methods where one estimates population properties from observed samples, including the procedures of hypothesis testing. It could be for purely descriptive purposes, for example determining the population mean by studying the sample mean. Or, one could have intentions of determining causal effects of some kind of “treatment”. Power considerations are important in all sorts of hypothesis tests, but since post hoc power promoters mainly have been in the field of causal inference, that is our main focus. Of course, the reasoning that we present in this paper can, with some minor modifications, be generalized to all inference research. Before moving on to a discussion on power, we would like to shortly present the process of hypothesis tests. Only by a thorough understanding of this, one can fully understand the concept of power. We will discuss both Fisher’s testing approach, as well as the one of Neyman and Pearson (NP). In today’s hypothesis testing, these two (together with some Basyian elements) are combined into a hybrid, the so called Null Hypothesis Significance 4 Testing (NHST) (Denis, 2004). However, since Fisher and NP in fact are conceptually rather incompatible, we present them separately. By presenting the theories separately, it will also become clear that only in the latter approach, power as we know it in NHST is considered useful. Firstly however, one must note that a statistical test alone cannot be the determinant of statistical inference, no matter approach. First and foremost, one must have a sample that is representative of the population one is interested in. For causal inference, the assumption of a comparable contrafactual must also be satisfied. Neither one of these assumptions can be tested in any hypothesis test. Rather, one must use proper experimental designs and correct sampling techniques to argue for their fulfilment. However, once those requirements are met, hypothesis testing provides a powerful tool for evaluating results seen in the sample data. Moreover, as for the validity of a statistical test, all necessary assumptions of the test in question must be fulfilled, regarding for example distributions or independence of observations. Although important issues, they will not be discussed further. Rather, we will assume that the statistical test at hand is performed in line with all necessary requirements, so that statistical inference is possible. In fact, it is only then it becomes relevant to discuss the power of the test. Finally, one must also understand what the randomness of a sample entails. Different samples will never be exactly the same (at least for continuous data), although sampled from the same population using impeccable sampling techniques. By pure chance, the resulting samples will differ. Due to this random sampling error, the sample estimates are considered random variables with a density distribution. Hypothesis testing really is a tool to evaluate these distributions in order to make inference. 2.1 Fisher’s Hypothesis Testing Fisher is a widely known statistician, for many considered as the father of statistics. He conducted the famous “Lady tasting tea” experiment where he tested whether a lady, as she claimed, could notice if the tea or the milk was the one firstly poured into the cup (Fisher, 1935). This marks the start of the hypothesis testing era. 5 Generally in the Fisher hypothesis testing approach, one formulates a single null hypothesis that will be tested (Christensen, 2005; Fisher, 1935). A test is to be applied in order to evaluate the discrepancy between the observed data and the data under the null. There are many kinds of Fisherian tests, depending on the question of research and the looks of the data. What they have in common is the computation of a p-value; the probability of getting a result as extreme or more extreme than the event you observed, given that the null is true. This p-value is the estimate which is to be evaluated. If the p-value is small, the null hypothesis does not seem to explain the data well. Thus, it is considered a continuous measure of evidence against the stated null: The lower the p-value, the greater evidence against it. Since Fisherian testing is not the main focus of this paper, we confine ourselves to a brief discussion on the topic. Having said that, in order to guarantee a basic understanding of the Fisherian way, we present a simple example; Fisher’s Exact Test.