Statistical Significance
Total Page:16
File Type:pdf, Size:1020Kb
Statistical significance In statistical hypothesis testing,[1][2] statistical signif- 1.1 Related concepts icance (or a statistically significant result) is at- tained whenever the observed p-value of a test statis- The significance level α is the threshhold for p below tic is less than the significance level defined for the which the experimenter assumes the null hypothesis is study.[3][4][5][6][7][8][9] The p-value is the probability of false, and something else is going on. This means α is obtaining results at least as extreme as those observed, also the probability of mistakenly rejecting the null hy- given that the null hypothesis is true. The significance pothesis, if the null hypothesis is true.[22] level, α, is the probability of rejecting the null hypothe- Sometimes researchers talk about the confidence level γ sis, given that it is true.[10] This statistical technique for = (1 − α) instead. This is the probability of not rejecting testing the significance of results was developed in the the null hypothesis given that it is true. [23][24] Confidence early 20th century. levels and confidence intervals were introduced by Ney- In any experiment or observation that involves drawing man in 1937.[25] a sample from a population, there is always the possibil- ity that an observed effect would have occurred due to sampling error alone.[11][12] But if the p-value of an ob- 2 Role in statistical hypothesis test- served effect is less than the significance level, an inves- tigator may conclude that that effect reflects the charac- ing teristics of the whole population,[1] thereby rejecting the null hypothesis.[13] A significance level is chosen before Main articles: Statistical hypothesis testing, Null hypoth- data collection, and typically set to 5%[14] or much lower, esis, Alternative hypothesis, p-value, and Type I and type depending on the field of study.[15] II errors The term significance does not imply importance and the Statistical significance plays a pivotal role in statistical term statistical significance is not the same as research, theoretical, or practical significance.[1][2][16] For exam- ple, the term clinical significance refers to the practical importance of a treatment effect. 1 History Main article: History of statistics In 1925, Ronald Fisher advanced the idea of statisti- In a two-tailed test, the rejection region for a significance level of cal hypothesis testing, which he called “tests of signifi- α=0.05 is partitioned to both ends of the sampling distribution cance”, in his publication Statistical Methods for Research and makes up 5% of the area under the curve (white areas). Workers.[17][18][19] Fisher suggested a probability of one in twenty (0.05) as a convenient cutoff level to reject hypothesis testing. It is used to determine whether the the null hypothesis.[20] In a 1933 paper, Jerzy Neyman null hypothesis should be rejected or retained. The null and Egon Pearson called this cutoff the significance level, hypothesis is the default assumption that nothing hap- which they named α. They recommended that α be set pened or changed.[26] For the null hypothesis to be re- ahead of time, prior to any data collection.[20][21] jected, an observed result has to be statistically signif- Despite his initial suggestion of 0.05 as a significance icant, i.e. the observed p-value is less than the pre- level, Fisher did not intend this cutoff value to be fixed. specified significance level. For instance, in his 1956 publication Statistical methods To determine whether a result is statistically significant, and scientific inference he recommended that significant a researcher calculates a p-value, which is the probabil- levels be set according to specific circumstances.[20] ity of observing an effect given that the null hypothesis 1 2 4 SEE ALSO is true.[9] The null hypothesis is rejected if the p-value nificance. A study that is found to be statistically signif- is less than a predetermined level, α. α is called the sig- icant, may not necessarily be practically significant. [39] nificance level, and is the probability of rejecting the null hypothesis given that it is true (a type I error). It is usually set at or below 5%. 3.1 Effect size For example, when α is set to 5%, the conditional proba- Main article: Effect size bility of a type I error, given that the null hypothesis is true, is 5%,[27] and a statistically significant result is one where the observed p-value is less than 5%.[28] When drawing Effect size is a measure of a study’s practical significance. [40] data from a sample, this means that the rejection region A statistically significant result may have a weak ef- comprises 5% of the sampling distribution.[29] These 5% fect. To gauge the research significance of their result, can be allocated to one side of the sampling distribution, researchers are encouraged to always report an effect size as in a one-tailed test, or partitioned to both sides of the along with p-values. An effect size measure quantifies the distribution as in a two-tailed test, with each tail (or re- strength of an effect, such as the distance between two jection region) containing 2.5% of the distribution. means in units of standard deviation (cf. Cohen’s d), the correlation between two variables or its square, and other The use of a one-tailed test is dependent on whether the measures.[41] research question or alternative hypothesis specifies a di- rection such as whether a group of objects is heavier or the performance of students on an assessment is better.[30] 3.2 Reproducibility A two-tailed test may still be used but it will be less powerful than a one-tailed test because the rejection re- Main article: Reproducibility gion for a one-tailed test is concentrated on one end of the null distribution and is twice the size (5% vs. 2.5%) A statistically significant result may not be easy to repro- of each rejection region for a two-tailed test. As a result, duce. In particular, some statistically significant results the null hypothesis can be rejected with a less extreme re- will in fact be false positives. Each failed attempt to re- sult if a one-tailed test was used.[31] The one-tailed test is produce a result increases the belief that the result was a only more powerful than a two-tailed test if the specified false positive. [42] direction of the alternative hypothesis is correct. If it is wrong, however, then the one-tailed test has no power. 3.3 Controversy around overuse in some journals 2.1 Stringent significance thresholds in specific fields Starting in the 2010s, some journals began question- ing whether significance testing, and particularly using Main articles: Standard deviation and Normal distribu- a threshold of α=5%, was being relied on too heavily as tion the primary measure of validity of a hypothesis.[43] Some journals encouraged authors to do more detailed analysis In specific fields such as particle physics and than just a statistical significance test. In social psychol- manufacturing, statistical significance is often ex- ogy, the Journal of Basic and Applied Social Psychology pressed in multiples of the standard deviation or sigma banned the use of significance testing altogether from pa- (σ) of a normal distribution, with significance thresholds pers it published, requiring authors to use other measure set at a much stricter level (e.g. 5σ).[32][33] For instance, to evaluate hypotheses and impact.[44][45] the certainty of the Higgs boson particle’s existence was based on the 5σ criterion, which corresponds to a p-value of about 1 in 3.5 million.[33][34] 4 See also In other fields of scientific research such as genome-wide association studies significance levels as low as 5×10−8 are • A/B testing, ABX test [35][36] not uncommon. • Fisher’s method for combining independent tests of significance 3 Limitations • Look-elsewhere effect • Multiple comparisons problem Researchers focusing solely on whether their results are • Sample size statistically significant might report findings that are not substantive[37] and not replicable.[38] There is also a dif- • Texas sharpshooter fallacy (gives examples of tests ference between statistical significance and practical sig- where the significance level was set too high) 3 5 References [14] Craparo, Robert M. (2007). “Significance level”. In Salkind, Neil J. Encyclopedia of Measurement and Statis- [1] Sirkin, R. Mark (2005). “Two-sample t tests”. Statistics tics. 3. Thousand Oaks, CA: SAGE Publications. pp. for the Social Sciences (3rd ed.). Thousand Oaks, CA: 889–891. ISBN 1-412-91611-9. SAGE Publications, Inc. pp. 271–316. ISBN 1-412- 90546-X. [15] Sproull, Natalie L. (2002). “Hypothesis testing”. Hand- book of Research Methods: A Guide for Practitioners and [2] Borror, Connie M. (2009). “Statistical decision making”. Students in the Social Science (2nd ed.). Lanham, MD: The Certified Quality Engineer Handbook (3rd ed.). Mil- Scarecrow Press, Inc. pp. 49–64. ISBN 0-810-84486-9. waukee, WI: ASQ Quality Press. pp. 418–472. ISBN 0-873-89745-5. [16] Myers, Jerome L.; Well, Arnold D.; Lorch Jr, Robert F. (2010). “The t distribution and its applications”. Research [3] Redmond, Carol; Colton, Theodore (2001). “Clinical sig- Design and Statistical Analysis: Third Edition (3rd ed.). nificance versus statistical significance”. Biostatistics in New York, NY: Routledge. pp. 124–153. ISBN 0-805- Clinical Trials. Wiley Reference Series in Biostatistics 86431-8. (3rd ed.). West Sussex, United Kingdom: John Wiley & Sons Ltd. pp. 35–36. ISBN 0-471-82211-6.