Redefine Statistical Significance We Propose to Change the Default P-Value Threshold for Statistical Significance from 0.05 to 0.005 for Claims of New Discoveries

Total Page:16

File Type:pdf, Size:1020Kb

Redefine Statistical Significance We Propose to Change the Default P-Value Threshold for Statistical Significance from 0.05 to 0.005 for Claims of New Discoveries comment Redefine statistical significance We propose to change the default P-value threshold for statistical significance from 0.05 to 0.005 for claims of new discoveries. Daniel J. Benjamin, James O. Berger, Magnus Johannesson, Brian A. Nosek, E.-J. Wagenmakers, Richard Berk, Kenneth A. Bollen, Björn Brembs, Lawrence Brown, Colin Camerer, David Cesarini, Christopher D. Chambers, Merlise Clyde, Thomas D. Cook, Paul De Boeck, Zoltan Dienes, Anna Dreber, Kenny Easwaran, Charles Efferson, Ernst Fehr, Fiona Fidler, Andy P. Field, Malcolm Forster, Edward I. George, Richard Gonzalez, Steven Goodman, Edwin Green, Donald P. Green, Anthony Greenwald, Jarrod D. Hadfield, Larry V. Hedges, Leonhard Held, Teck Hua Ho, Herbert Hoijtink, Daniel J. Hruschka, Kosuke Imai, Guido Imbens, John P. A. Ioannidis, Minjeong Jeon, James Holland Jones, Michael Kirchler, David Laibson, John List, Roderick Little, Arthur Lupia, Edouard Machery, Scott E. Maxwell, Michael McCarthy, Don Moore, Stephen L. Morgan, Marcus Munafó, Shinichi Nakagawa, Brendan Nyhan, Timothy H. Parker, Luis Pericchi, Marco Perugini, Jeff Rouder, Judith Rousseau, Victoria Savalei, Felix D. Schönbrodt, Thomas Sellke, Betsy Sinclair, Dustin Tingley, Trisha Van Zandt, Simine Vazire, Duncan J. Watts, Christopher Winship, Robert L. Wolpert, Yu Xie, Cristobal Young, Jonathan Zinman and Valen E. Johnson he lack of reproducibility of scientific not address the appropriate threshold for probabilities. By Bayes’ rule, this ratio may studies has caused growing concern confirmatory or contradictory replications be written as: over the credibility of claims of new of existing claims. We also do not advocate T Pr Hx discoveries based on ‘statistically significant’ changes to discovery thresholds in fields ()1obs findings. There has been much progress that have already adopted more stringent Pr()Hx0obs toward documenting and addressing standards (for example, genomics fx()obs1H Pr()H1 (1) several causes of this lack of reproducibility and high-energy physics research; see the =× (for example, multiple testing, P-hacking, ‘Potential objections’ section below). fx()obs0H Pr()H0 publication bias and under-powered We also restrict our recommendation ≡×BF (priorodds) studies). However, we believe that a leading to studies that conduct null hypothesis cause of non-reproducibility has not yet significance tests. We have diverse views where BF is the Bayes factor that represents been adequately addressed: statistical about how best to improve reproducibility, the evidence from the data, and the prior standards of evidence for claiming new and many of us believe that other ways of odds can be informed by researchers’ beliefs, discoveries in many fields of science are summarizing the data, such as Bayes factors scientific consensus, and validated evidence simply too low. Associating statistically or other posterior summaries based on from similar research questions in the same significant findings with P < 0.05 results clearly articulated model assumptions, are field. Multiple-hypothesis testing, P-hacking in a high rate of false positives even in the preferable to P values. However, changing the and publication bias all reduce the credibility absence of other experimental, procedural P value threshold is simple, aligns with the of evidence. Some of these practices reduce and reporting problems. training undertaken by many researchers, the prior odds of H1 relative to H0 by For fields where the threshold for and might quickly achieve broad acceptance. changing the population of hypothesis tests defining statistical significance for new that are reported. Prediction markets3 and discoveries is P < 0.05, we propose a change Strength of evidence from P values analyses of replication results4 both suggest to P < 0.005. This simple step would In testing a point null hypothesis H0 against that for psychology experiments, the prior immediately improve the reproducibility of an alternative hypothesis H1 based on data odds of H1 relative to H0 may be only about scientific research in many fields. Results xobs, the P value is defined as the probability, 1:10. A similar number has been suggested that would currently be called significant calculated under the null hypothesis, that a in cancer clinical trials, and the number but do not meet the new threshold should test statistic is as extreme or more extreme is likely to be much lower in preclinical instead be called suggestive. While than its observed value. The null hypothesis biomedical research5. statisticians have known the relative is typically rejected — and the finding is There is no unique mapping between weakness of using P ≈ 0.05 as a threshold declared statistically significant — if the the P value and the Bayes factor, since the for discovery and the proposal to lower P value falls below the (current) type I error Bayes factor depends on H1. However, the it to 0.005 is not new1,2, a critical mass of threshold α = 0.05. connection between the two quantities researchers now endorse this change. From a Bayesian perspective, a more can be evaluated for particular test We restrict our recommendation to direct measure of the strength of evidence statistics under certain classes of plausible claims of discovery of new effects. We do for H1 relative to H0 is the ratio of their alternatives (Fig. 1). NATURE HUMAN BEHAVIOUR | www.nature.com/nathumbehav © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. comment 100.0 100.0 7 Power In many studies, statistical power is low . Likelihood ratio bound Figure 2 demonstrates that low statistical 50.0 UMPBT 50.0 power and α = 0.05 combine to produce Local−H1 bound high false positive rates. 25.7 20.0 20.0 For many, the calculations illustrated by 13.9 Fig. 2 may be unsettling. For example, the 10.0 10.0 false positive rate is greater than 33% with prior odds of 1:10 and a P value threshold actor 5.0 5.0 of 0.05, regardless of the level of statistical es f 3.4 power. Reducing the threshold to 0.005 Bay 2.4 would reduce this minimum false positive 2.0 2.0 rate to 5%. Similar reductions in false 1.0 1.0 positive rates would occur over a wide range of statistical powers. 0.5 0.5 Empirical evidence from recent replication projects in psychology and 0.3 0.3 experimental economics provide insights 0.0010 0.0025 0.0050 0.0100 0.02500.05000.1000 into the prior odds in favour of H1. In both P value projects, the rate of replication (that is, significance at P < 0.05 in the replication in Fig. 1 | Relationship between the P value and the Bayes factor. The Bayes factor (BF) is defined a consistent direction) was roughly double as fx()obs1H . The figure assumes that observations are independent and identically distributed for initial studies with P < 0.005 relative to fx()obs0H 2 2 initial studies with 0.005 < P < 0.05: 50% (i.i.d.) according to x ~ N(μ,σ ), where the mean μ is unknown and the variance σ is known. The P value 8 2 versus 24% for psychology , and 85% versus is from a two-sided z-test (or equivalently a one-sided χ -test) of the null hypothesis H0: μ = 0. Power 1 44% for experimental economics9. Although (red curve): BF obtained by defining H1 as putting ½ probability on μ = ± m for the value of m that gives based on relatively small samples of studies 75% power for the test of size α = 0.05. This H1 represents an effect size typical of that which is implicitly assumed by researchers during experimental design. Likelihood ratio bound (black curve): BF obtained (93 in psychology, and 16 in experimental economics, after excluding initial studies by defining H1 as putting ½ probability on μ = ± x,̂ where x ̂is approximately equal to the mean of the with P > 0.05), these numbers are suggestive observations. These BFs are upper bounds among the class of all H1 terms that are symmetric around the of the potential gains in reproducibility null, but they are improper because the data are used to define H1. UMPBT (blue curve): BF obtained by 2 that would accrue from the new threshold defining H1 according to the uniformly most powerful Bayesian test that places ½ probability on μ = ± w, where w is the alternative hypothesis that corresponds to a one-sided test of size 0.0025. This curve of P < 0.005 in these fields. In biomedical is indistinguishable from the ‘Power’ curve that would be obtained if the power used in its definition was research, 96% of a sample of recent papers 1 80% rather than 75%. Local-H bound (green curve): BF = , where p is the P value, is a large-sample claim statistically significant results with 1 −ep lnp the P < 0.05 threshold10. However, upper bound on the BF from among all unimodal alternative hypotheses that have a mode at the null and 5 satisfy certain regularity conditions15. The red numbers on the y axis indicate the range of Bayes factors replication rates were very low for these that are obtained for P values of 0.005 or 0.05. For more details, see the Supplementary Information. studies, suggesting a potential for gains by adopting this new standard in these fields as well. A two-sided P value of 0.05 corresponds evidence according to conventional Bayes Potential objections 6 to Bayes factors in favour of H1 that range factor classifications . We now address the most compelling from about 2.5 to 3.4 under reasonable Second, in many fields the P < 0.005 arguments against adopting this higher assumptions about H1 (Fig. 1). This standard would reduce the false positive rate standard of evidence. is weak evidence from at least three to levels we judge to be reasonable.
Recommended publications
  • Underestimation of Type I Errors in Scientific Analysis
    Open Access Journal of Mathematical and Theoretical Physics Research Article Open Access Underestimation of type I errors in scientific analysis Abstract Volume 2 Issue 1 - 2019 Statistical significance is the hallmark, the main empirical thesis and the main argument Roman Anton to quantitatively verify or falsify scientific hypothesis in science. The correct use of Department of Theoretical Sciences, The University of Truth and statistics assures and increases the reproducibility of the results and findings, and Common Sense, Germany there is recently a widely wrong use of statistics and a reproducibility crisis in science. Today, most scientists can’t replicate the results of their peers and peer-review hinders Correspondence: Roman Anton, Department of Theoretical them to reveal it. Besides an estimated 80% of general misconduct in the practices, Sciences, The University of Truth and Common Sense, Germany, and above 90% of non-reproducible results, there is also a high proportion of generally Email false statistical assessment using statistical significance tests like the otherwise valuable t-test. The 5% confidence interval, the widely used significance threshold Received: October 29, 2018 | Published: January 04, 2019 of the t-test, is regularly and very frequently used without assuming the natural rate of false positives within the test system, referred to as the stochastic alpha or type I error. Due to the global need of a clear-cut clarification, this statistical research paper didactically shows for the scientific fields that are using the t-test, how to prevent the intrinsically occurring pitfalls of p-values and advises random testing of alpha, beta, SPM and data characteristics to determine the right p-level, statistics, and testing.
    [Show full text]
  • Felix Elwert
    July 2021 FELIX ELWERT University of Wisconsin–Madison Department of Sociology and Center for Demography and Ecology 4426 Sewell Social Sciences Building 1180 Observatory Drive Madison, WI 53706 [email protected] ACADEMIC POSITIONS University of Wisconsin–Madison 2017- Professor, Department of Sociology Department of Biostatistics and Medical Informatics (affiliated) Department of Population Health Sciences (affiliated) 2013–2017 Associate Professor (Affiliated), Department of Population Health Sciences 2012–2017 Associate Professor, Department of Sociology 2007–2012 Assistant Professor, Department of Sociology WZB Berlin Social Science Center, Germany 2015–2016 Acting Director, Research Unit Social Inequality and Social Policy 2014–2015 Karl W. Deutsch Visiting Professor Harvard Medical School 2006–2007 Postdoctoral Fellow, Department of Health Care Policy EDUCATION 2007 Ph.D., Sociology, Harvard University 2006 A.M., Statistics, Harvard University 1999 M.A., Sociology, New School for Social Research 1997 Vordiplom, Sociology, Free University of Berlin Felix Elwert / May 2020 AWARDS AND FELLOWSHIPS 2018 Leo Goodman Award, Methodology Section, American Sociological Association 2018 Elected member, Sociological Research Association 2018 Vilas Midcareer Faculty Award, University of Wisconsin-Madison 2017- H. I. Romnes Faculty Fellowship, University of Wisconsin-Madison 2016-2018 Fellow, WZB Berlin Social Science Center, Berlin, Germany 2013 Causality in Statistics Education Award, American Statistical Association 2013 Vilas Associate Award, University of Wisconsin–Madison 2012 Jane Addams Award (Best Paper), Community and Urban Sociology Section, American Sociological Association 2010 Gunther Beyer Award (Best Paper by a Young Scholar), European Association for Population Studies 2009, 2010, 2017 University Housing Honored Instructor Award, University of Wisconsin- Madison 2009 & 2010 Best Poster Awards, Annual Meeting of the Population Association of America 2005–2006 Eliot Fellowship, Harvard University 2004 Aage B.
    [Show full text]
  • Moving to Inequality: Neighborhood Effects and Experiments Meet Social Structure1
    Moving to Inequality: Neighborhood Effects and Experiments Meet Social Structure1 Robert J. Sampson Harvard University The Moving to Opportunity (MTO) housing experiment has proven to be an important intervention not just in the lives of the poor, but in social science theories of neighborhood effects. Competing causal claims have been the subject of considerable disagreement, culmi- nating in the debate between Clampet-Lundquist and Massey and Ludwig et al. in this issue. This article assesses the debate by clar- ifying analytically distinct questions posed by neighborhood-level theories, reconceptualizing selection bias as a fundamental social process worthy of study in its own right rather than a statistical nuisance, and reconsidering the scientific method of experimenta- tion, and hence causality, in the social world of the city. The author also analyzes MTO and independent survey data from Chicago to examine trajectories of residential attainment. Although MTO pro- vides crucial leverage for estimating neighborhood effects on indi- viduals, as proponents rightly claim, this study demonstrates the implications imposed by a stratified urban structure and how MTO simultaneously provides a new window on the social reproduction of concentrated inequality. Contemporary wisdom traces the idea of neighborhood effects to William Julius Wilson’s justly lauded book, The Truly Disadvantaged (1987). Since its publication, a veritable explosion of work has emerged, much of it attempting to test the hypothesis that living in a neighborhood of con- 1 I am indebted to Corina Graif for superb research assistance and to Patrick Sharkey for his collaborative work on neighborhood selection. They both provided incisive comments as well, as did Nicholas Christakis, Steve Raudenbush, Bruce Western, P.-O.
    [Show full text]
  • Statistical Significance Testing in Information Retrieval:An Empirical
    Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors Julián Urbano Harlley Lima Alan Hanjalic Delft University of Technology Delft University of Technology Delft University of Technology The Netherlands The Netherlands The Netherlands [email protected] [email protected] [email protected] ABSTRACT 1 INTRODUCTION Statistical significance testing is widely accepted as a means to In the traditional test collection based evaluation of Information assess how well a difference in effectiveness reflects an actual differ- Retrieval (IR) systems, statistical significance tests are the most ence between systems, as opposed to random noise because of the popular tool to assess how much noise there is in a set of evaluation selection of topics. According to recent surveys on SIGIR, CIKM, results. Random noise in our experiments comes from sampling ECIR and TOIS papers, the t-test is the most popular choice among various sources like document sets [18, 24, 30] or assessors [1, 2, 41], IR researchers. However, previous work has suggested computer but mainly because of topics [6, 28, 36, 38, 43]. Given two systems intensive tests like the bootstrap or the permutation test, based evaluated on the same collection, the question that naturally arises mainly on theoretical arguments. On empirical grounds, others is “how well does the observed difference reflect the real difference have suggested non-parametric alternatives such as the Wilcoxon between the systems and not just noise due to sampling of topics”? test. Indeed, the question of which tests we should use has accom- Our field can only advance if the published retrieval methods truly panied IR and related fields for decades now.
    [Show full text]
  • Tests of Hypotheses Using Statistics
    Tests of Hypotheses Using Statistics Adam Massey¤and Steven J. Millery Mathematics Department Brown University Providence, RI 02912 Abstract We present the various methods of hypothesis testing that one typically encounters in a mathematical statistics course. The focus will be on conditions for using each test, the hypothesis tested by each test, and the appropriate (and inappropriate) ways of using each test. We conclude by summarizing the di®erent tests (what conditions must be met to use them, what the test statistic is, and what the critical region is). Contents 1 Types of Hypotheses and Test Statistics 2 1.1 Introduction . 2 1.2 Types of Hypotheses . 3 1.3 Types of Statistics . 3 2 z-Tests and t-Tests 5 2.1 Testing Means I: Large Sample Size or Known Variance . 5 2.2 Testing Means II: Small Sample Size and Unknown Variance . 9 3 Testing the Variance 12 4 Testing Proportions 13 4.1 Testing Proportions I: One Proportion . 13 4.2 Testing Proportions II: K Proportions . 15 4.3 Testing r £ c Contingency Tables . 17 4.4 Incomplete r £ c Contingency Tables Tables . 18 5 Normal Regression Analysis 19 6 Non-parametric Tests 21 6.1 Tests of Signs . 21 6.2 Tests of Ranked Signs . 22 6.3 Tests Based on Runs . 23 ¤E-mail: [email protected] yE-mail: [email protected] 1 7 Summary 26 7.1 z-tests . 26 7.2 t-tests . 27 7.3 Tests comparing means . 27 7.4 Variance Test . 28 7.5 Proportions . 28 7.6 Contingency Tables .
    [Show full text]
  • Statistical Significance
    Statistical significance In statistical hypothesis testing,[1][2] statistical signif- 1.1 Related concepts icance (or a statistically significant result) is at- tained whenever the observed p-value of a test statis- The significance level α is the threshhold for p below tic is less than the significance level defined for the which the experimenter assumes the null hypothesis is study.[3][4][5][6][7][8][9] The p-value is the probability of false, and something else is going on. This means α is obtaining results at least as extreme as those observed, also the probability of mistakenly rejecting the null hy- given that the null hypothesis is true. The significance pothesis, if the null hypothesis is true.[22] level, α, is the probability of rejecting the null hypothe- Sometimes researchers talk about the confidence level γ sis, given that it is true.[10] This statistical technique for = (1 − α) instead. This is the probability of not rejecting testing the significance of results was developed in the the null hypothesis given that it is true. [23][24] Confidence early 20th century. levels and confidence intervals were introduced by Ney- In any experiment or observation that involves drawing man in 1937.[25] a sample from a population, there is always the possibil- ity that an observed effect would have occurred due to sampling error alone.[11][12] But if the p-value of an ob- 2 Role in statistical hypothesis test- served effect is less than the significance level, an inves- tigator may conclude that that effect reflects the charac- ing teristics of the
    [Show full text]
  • The Paradox of Lessening Racial Inequality and Joblessness Among Black Youth: Enrollment, Enlistment, and Employment, 1964-1981*
    THE PARADOX OF LESSENING RACIAL INEQUALITY AND JOBLESSNESS AMONG BLACK YOUTH: ENROLLMENT, ENLISTMENT, AND EMPLOYMENT, 1964-1981* ROBERT D. MARE CHRISTOPHER WINSHIP University of Wisconsin-Madison Northwestern University and NORC An important exception to improvementsin the relative socioeconomic status of blacks duringrecent decades is increased levels of joblessness among black youths relative to whites. Few proposed explanationsfor this trend reconcile worsening employment status for- black youths with improvementson other socioeconomic indicators. Threemechanisms that link reducedstatus differencesbetween the races in other spheres with increased disparity in employment are: (1) increased substitutionof schooling and militaryservice for employmentby young blacks; (2) reduced workexperience and disruptedemployment for young blacks at older ages as a result of later average ages leaving school and the armed forces; and (3) "creaming"from the civilian out-of-schoolpopulation of young blacks with above average employmentprospects as a result of higher school enrollmentand military enlistmentrates. Empiricalassessment of these argumentsshows that they account for a substantialpart of the growingracial employmentdifference among men aged 16 to 29. Although racial convergence on school enrollment and educational attainmenthas reduced other socioeconomic inequalities between the races, it has widenedthe employmentdifference. One of the most important changes in 1977; Hill, 1978; Wilson, 1978; Collins, 1983), American society in recent decades
    [Show full text]
  • Understanding Statistical Hypothesis Testing: the Logic of Statistical Inference
    Review Understanding Statistical Hypothesis Testing: The Logic of Statistical Inference Frank Emmert-Streib 1,2,* and Matthias Dehmer 3,4,5 1 Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, 33100 Tampere, Finland 2 Institute of Biosciences and Medical Technology, Tampere University, 33520 Tampere, Finland 3 Institute for Intelligent Production, Faculty for Management, University of Applied Sciences Upper Austria, Steyr Campus, 4040 Steyr, Austria 4 Department of Mechatronics and Biomedical Computer Science, University for Health Sciences, Medical Informatics and Technology (UMIT), 6060 Hall, Tyrol, Austria 5 College of Computer and Control Engineering, Nankai University, Tianjin 300000, China * Correspondence: [email protected]; Tel.: +358-50-301-5353 Received: 27 July 2019; Accepted: 9 August 2019; Published: 12 August 2019 Abstract: Statistical hypothesis testing is among the most misunderstood quantitative analysis methods from data science. Despite its seeming simplicity, it has complex interdependencies between its procedural components. In this paper, we discuss the underlying logic behind statistical hypothesis testing, the formal meaning of its components and their connections. Our presentation is applicable to all statistical hypothesis tests as generic backbone and, hence, useful across all application domains in data science and artificial intelligence. Keywords: hypothesis testing; machine learning; statistics; data science; statistical inference 1. Introduction We are living in an era that is characterized by the availability of big data. In order to emphasize the importance of this, data have been called the ‘oil of the 21st Century’ [1]. However, for dealing with the challenges posed by such data, advanced analysis methods are needed.
    [Show full text]
  • The Reproducibility of Research and the Misinterpretation of P Values
    bioRxiv preprint doi: https://doi.org/10.1101/144337; this version posted August 2, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 1 The reproducibility of research and the misinterpretation of P values David Colquhoun University College London Email: [email protected] Revision 6 Abstract We wish to answer this question If you observe a “significant” P value after doing a single unbiased experiment, what is the probability that your result is a false positive?. The weak evidence provided by P values between 0.01 and 0.05 is explored by exact calculations of false positive rates. When you observe P = 0.05, the odds in favour of there being a real effect (given by the likelihood ratio) are about 3:1. This is far weaker evidence than the odds of 19 to 1 that might, wrongly, be inferred from the P value. And if you want to limit the false positive rate to 5 %, you would have to assume that you were 87% sure that there was a real effect before the experiment was done. If you observe P = 0.001 in a well-powered experiment, it gives a likelihood ratio of almost 100:1 odds on there being a real effect. That would usually be regarded as conclusive, But the false positive rate would still be 8% if the prior probability of a real effect was only 0.1.
    [Show full text]
  • What Are Confidence Intervals and P-Values?
    What is...? series Second edition Statistics Supported by sanofi-aventis What are confidence intervals and p-values? G A confidence interval calculated for a measure of treatment effect Huw TO Davies PhD shows the range within which the true treatment effect is likely to lie Professor of Health (subject to a number of assumptions). Care Policy and G A p-value is calculated to assess whether trial results are likely to have Management, occurred simply through chance (assuming that there is no real University of St difference between new treatment and old, and assuming, of course, Andrews that the study was well conducted). Iain K Crombie PhD G Confidence intervals are preferable to p-values, as they tell us the range FFPHM Professor of of possible effect sizes compatible with the data. Public Health, G p-values simply provide a cut-off beyond which we assert that the University of Dundee findings are ‘statistically significant’ (by convention, this is p<0.05). G A confidence interval that embraces the value of no difference between treatments indicates that the treatment under investigation is not significantly different from the control. G Confidence intervals aid interpretation of clinical trial data by putting upper and lower bounds on the likely size of any true effect. G Bias must be assessed before confidence intervals can be interpreted. Even very large samples and very narrow confidence intervals can mislead if they come from biased studies. G Non-significance does not mean ‘no effect’. Small studies will often report non-significance even when there are important, real effects which a large study would have detected.
    [Show full text]
  • Understanding Statistical Significance: a Short Guide
    UNDERSTANDING STATISTICAL SIGNIFICANCE: A SHORT GUIDE Farooq Sabri and Tracey Gyateng September 2015 Using a control or comparison group is a powerful way to measure the impact of an intervention, but doing this in a robust way requires statistical expertise. NPC’s Data Labs project, funded by the Oak Foundation, aims to help charities by opening up government data that will allow organisations to compare the longer-term outcomes of their service to non-users. This guide explains the terminology used in comparison group analysis and how to interpret the results. Introduction With budgets tightening across the charity sector, it is helpful to test whether services are actually helping beneficiaries by measuring their impact. Robust evaluation means assessing whether programmes have made a difference over and above what would have happened without them1. This is known as the ‘counterfactual’. Obviously we can only estimate this difference. The best way to do so is to find a control or comparison group2 of people who have similar characteristics to the service users, the only difference being that they did not receive the intervention in question. Assessment is made by comparing the outcomes for service users with the comparison group to see if there is any statistically significant difference. Creating comparison groups and testing for statistical significance can involve complex calculations, and interpreting the results can be difficult, especially when the result is not clear cut. That’s why NPC launched the Data Labs project to respond to this need for robust quantitative evaluations. This paper is designed as an introduction to the field, explaining the key terminology to non-specialists.
    [Show full text]
  • From P-Value to FDR
    From P-value To FDR Jie Yang, Ph.D. Associate Professor Department of Family, Population and Preventive Medicine Director Biostatistical Consulting Core In collaboration with Clinical Translational Science Center (CTSC) and the Biostatistics and Bioinformatics Shared Resource (BB-SR), Stony Brook Cancer Center (SBCC). OUTLINE P-values - What is p-value? - How important is a p-value? - Misinterpretation of p-values Multiple Testing Adjustment - Why, How, When? - Bonferroni: What and How? - FDR: What and How? STATISTICAL MODELS Statistical model is a mathematical representation of data variability, ideally catching all sources of such variability. All methods of statistical inference have assumptions about • How data were collected • How data were analyzed • How the analysis results were selected for presentation Assumptions are often simple to express mathematically, but difficult to satisfy and verify in practice. Hypothesis test is the predominant approach to statistical inference on effect sizes which describe the magnitude of a quantitative relationship between variables (such as standardized differences in means, odds ratios, correlations etc). BASIC STEPS IN HYPOTHESIS TEST 1. State null (H0) and alternative (H1) hypotheses 2. Choose a significance level, α (usually 0.05) 3. Based on the sample, calculate the test statistic and calculate p-value based on a theoretical distribution of the test statistic 4. Compare p-value with the significance level α 5. Make a decision, and state the conclusion HISTORY OF P-VALUES P-values have been in use for nearly a century. The p-value was first formally introduced by Karl Pearson, in his Pearson's chi-squared test and popularized by Ronald Fisher.
    [Show full text]