(P>0.05): Significance Thresholds and the Crisis of Unreplicable Research
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Downloading of a Human Consciousness Into a Digital Computer Would Involve ‘A Certain Loss of Our Finer Feelings and Qualities’89
When We Can Trust Computers (and When We Can’t) Peter V. Coveney1,2* Roger R. Highfield3 1Centre for Computational Science, University College London, Gordon Street, London WC1H 0AJ, UK, https://orcid.org/0000-0002-8787-7256 2Institute for Informatics, Science Park 904, University of Amsterdam, 1098 XH Amsterdam, Netherlands 3Science Museum, Exhibition Road, London SW7 2DD, UK, https://orcid.org/0000-0003-2507-5458 Keywords: validation; verification; uncertainty quantification; big data; machine learning; artificial intelligence Abstract With the relentless rise of computer power, there is a widespread expectation that computers can solve the most pressing problems of science, and even more besides. We explore the limits of computational modelling and conclude that, in the domains of science and engineering that are relatively simple and firmly grounded in theory, these methods are indeed powerful. Even so, the availability of code, data and documentation, along with a range of techniques for validation, verification and uncertainty quantification, are essential for building trust in computer generated findings. When it comes to complex systems in domains of science that are less firmly grounded in theory, notably biology and medicine, to say nothing of the social sciences and humanities, computers can create the illusion of objectivity, not least because the rise of big data and machine learning pose new challenges to reproducibility, while lacking true explanatory power. We also discuss important aspects of the natural world which cannot be solved by digital means. In the long-term, renewed emphasis on analogue methods will be necessary to temper the excessive faith currently placed in digital computation. -
Rothamsted in the Making of Sir Ronald Fisher Scd FRS
Rothamsted in the Making of Sir Ronald Fisher ScD FRS John Aldrich University of Southampton RSS September 2019 1 Finish 1962 “the most famous statistician and mathematical biologist in the world” dies Start 1919 29 year-old Cambridge BA in maths with no prospects takes temporary job at Rothamsted Experimental Station In between 1919-33 at Rothamsted he makes a career for himself by creating a career 2 Rothamsted helped make Fisher by establishing and elevating the office of agricultural statistician—a position in which Fisher was unsurpassed by letting him do other things—including mathematical statistics, genetics and eugenics 3 Before Fisher … (9 slides) The problems were already there viz. o the relationship between harvest and weather o design of experimental trials and the analysis of their results Leading figures in agricultural science believed the problems could be treated statistically 4 Established in 1843 by Lawes and Gilbert to investigate the effective- ness of fertilisers Sir John Bennet Lawes Sir Joseph Henry Gilbert (1814-1900) land-owner, fertiliser (1817-1901) professional magnate and amateur scientist scientist (chemist) In 1902 tired Rothamsted gets make-over when Daniel Hall becomes Director— public money feeds growth 5 Crops and weather and experimental trials Crops & the weather o examined by agriculturalists—e.g. Lawes & Gilbert 1880 o subsequently treated more by meteorologists Experimental trials a hotter topic o treated in Journal of Agricultural Science o by leading figures Wood of Cambridge and Hall -
Caveat Emptor: the Risks of Using Big Data for Human Development
1 Caveat emptor: the risks of using big data for human development Siddique Latif1,2, Adnan Qayyum1, Muhammad Usama1, Junaid Qadir1, Andrej Zwitter3, and Muhammad Shahzad4 1Information Technology University (ITU)-Punjab, Pakistan 2University of Southern Queensland, Australia 3University of Groningen, Netherlands 4National University of Sciences and Technology (NUST), Pakistan Abstract Big data revolution promises to be instrumental in facilitating sustainable development in many sectors of life such as education, health, agriculture, and in combating humanitarian crises and violent conflicts. However, lurking beneath the immense promises of big data are some significant risks such as (1) the potential use of big data for unethical ends; (2) its ability to mislead through reliance on unrepresentative and biased data; and (3) the various privacy and security challenges associated with data (including the danger of an adversary tampering with the data to harm people). These risks can have severe consequences and a better understanding of these risks is the first step towards mitigation of these risks. In this paper, we highlight the potential dangers associated with using big data, particularly for human development. Index Terms Human development, big data analytics, risks and challenges, artificial intelligence, and machine learning. I. INTRODUCTION Over the last decades, widespread adoption of digital applications has moved all aspects of human lives into the digital sphere. The commoditization of the data collection process due to increased digitization has resulted in the “data deluge” that continues to intensify with a number of Internet companies dealing with petabytes of data on a daily basis. The term “big data” has been coined to refer to our emerging ability to collect, process, and analyze the massive amount of data being generated from multiple sources in order to obtain previously inaccessible insights. -
Reproducibility: Promoting Scientific Rigor and Transparency
Reproducibility: Promoting scientific rigor and transparency Roma Konecky, PhD Editorial Quality Advisor What does reproducibility mean? • Reproducibility is the ability to generate similar results each time an experiment is duplicated. • Data reproducibility enables us to validate experimental results. • Reproducibility is a key part of the scientific process; however, many scientific findings are not replicable. The Reproducibility Crisis • ~2010 as part of a growing awareness that many scientific studies are not replicable, the phrase “Reproducibility Crisis” was coined. • An initiative of the Center for Open Science conducted replications of 100 psychology experiments published in prominent journals. (Science, 349 (6251), 28 Aug 2015) - Out of 100 replication attempts, only 39 were successful. The Reproducibility Crisis • According to a poll of over 1,500 scientists, 70% had failed to reproduce at least one other scientist's experiment or their own. (Nature 533 (437), 26 May 2016) • Irreproducible research is a major concern because in valid claims: - slow scientific progress - waste time and resources - contribute to the public’s mistrust of science Factors contributing to Over 80% of respondents irreproducibility Nature | News Feature 25 May 2016 Underspecified methods Factors Data dredging/ Low statistical contributing to p-hacking power irreproducibility Technical Bias - omitting errors null results Weak experimental design Underspecified methods Factors Data dredging/ Low statistical contributing to p-hacking power irreproducibility Technical Bias - omitting errors null results Weak experimental design Underspecified methods • When experimental details are omitted, the procedure needed to reproduce a study isn’t clear. • Underspecified methods are like providing only part of a recipe. ? = Underspecified methods Underspecified methods Underspecified methods Underspecified methods • Like baking a loaf of bread, a “scientific recipe” should include all the details needed to reproduce the study. -
Rejecting Statistical Significance Tests: Defanging the Arguments
Rejecting Statistical Significance Tests: Defanging the Arguments D. G. Mayo Philosophy Department, Virginia Tech, 235 Major Williams Hall, Blacksburg VA 24060 Abstract I critically analyze three groups of arguments for rejecting statistical significance tests (don’t say ‘significance’, don’t use P-value thresholds), as espoused in the 2019 Editorial of The American Statistician (Wasserstein, Schirm and Lazar 2019). The strongest argument supposes that banning P-value thresholds would diminish P-hacking and data dredging. I argue that it is the opposite. In a world without thresholds, it would be harder to hold accountable those who fail to meet a predesignated threshold by dint of data dredging. Forgoing predesignated thresholds obstructs error control. If an account cannot say about any outcomes that they will not count as evidence for a claim—if all thresholds are abandoned—then there is no a test of that claim. Giving up on tests means forgoing statistical falsification. The second group of arguments constitutes a series of strawperson fallacies in which statistical significance tests are too readily identified with classic abuses of tests. The logical principle of charity is violated. The third group rests on implicit arguments. The first in this group presupposes, without argument, a different philosophy of statistics from the one underlying statistical significance tests; the second group—appeals to popularity and fear—only exacerbate the ‘perverse’ incentives underlying today’s replication crisis. Key Words: Fisher, Neyman and Pearson, replication crisis, statistical significance tests, strawperson fallacy, psychological appeals, 2016 ASA Statement on P-values 1. Introduction and Background Today’s crisis of replication gives a new urgency to critically appraising proposed statistical reforms intended to ameliorate the situation. -
Statistical Significance Testing in Information Retrieval:An Empirical
Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors Julián Urbano Harlley Lima Alan Hanjalic Delft University of Technology Delft University of Technology Delft University of Technology The Netherlands The Netherlands The Netherlands [email protected] [email protected] [email protected] ABSTRACT 1 INTRODUCTION Statistical significance testing is widely accepted as a means to In the traditional test collection based evaluation of Information assess how well a difference in effectiveness reflects an actual differ- Retrieval (IR) systems, statistical significance tests are the most ence between systems, as opposed to random noise because of the popular tool to assess how much noise there is in a set of evaluation selection of topics. According to recent surveys on SIGIR, CIKM, results. Random noise in our experiments comes from sampling ECIR and TOIS papers, the t-test is the most popular choice among various sources like document sets [18, 24, 30] or assessors [1, 2, 41], IR researchers. However, previous work has suggested computer but mainly because of topics [6, 28, 36, 38, 43]. Given two systems intensive tests like the bootstrap or the permutation test, based evaluated on the same collection, the question that naturally arises mainly on theoretical arguments. On empirical grounds, others is “how well does the observed difference reflect the real difference have suggested non-parametric alternatives such as the Wilcoxon between the systems and not just noise due to sampling of topics”? test. Indeed, the question of which tests we should use has accom- Our field can only advance if the published retrieval methods truly panied IR and related fields for decades now. -
Basic Statistics
Statistics Kristof Reid Assistant Professor Medical University of South Carolina Core Curriculum V5 Financial Disclosures • None Core Curriculum V5 Further Disclosures • I am not a statistician • I do like making clinical decisions based on appropriately interpreted data Core Curriculum V5 Learning Objectives • Understand why knowing statistics is important • Understand the basic principles and statistical test • Understand common statistical errors in the medical literature Core Curriculum V5 Why should I care? Indications for statistics Core Curriculum V5 47% unable to determine study design 31% did not understand p values 38% did not understand sensitivity and specificity 83% could not use odds ratios Araoye et al, JBJS 2020;102(5):e19 Core Curriculum V5 50% (or more!) of clinical research publications have at least one statistical error Thiese et al, Journal of thoracic disease, 2016-08, Vol.8 (8), p.E726-E730 Core Curriculum V5 17% of conclusions not justified by results 39% of studies used the wrong analysis Parsons et al, J Bone Joint Surg Br. 2011;93-B(9):1154-1159. Core Curriculum V5 Are these two columns different? Column A Column B You have been asked by your insurance carrier 19 10 12 11 to prove that your total hip patient outcomes 10 9 20 19 are not statistically different than your competitor 10 17 14 10 next door. 21 19 24 13 Column A is the patient reported score for you 12 18 and column B for your competitor 17 8 12 17 17 10 9 9 How would you prove you are 24 10 21 9 18 18 different or better? 13 4 18 8 12 14 15 15 14 17 Core Curriculum V5 Are they still the same? Column A: Mean=15 Column B: Mean=12 SD=4 Core Curriculum V5 We want to know if we're making a difference Core Curriculum V5 What is statistics? a. -
Principles of the Design and Analysis of Experiments
A FRESH LOOK AT THE BASIC PRINCIPLES OF THE DESIGN AND ANALYSIS OF EXPERIMENTS F. YATES ROTHAMSTED EXPERIMENTAL STATION, HARPENDEN 1. Introduction When Professor Neyman invited me to attend the Fifth Berkeley Symposium, and give a paper on the basic principles of the design and analysis of experiments, I was a little hesitant. I felt certain that all those here must be thoroughly conversant with these basic principles, and that to mull over them again would be of little interest. This, however, is the first symposium to be held since Sir Ronald Fisher's death, and it does therefore seem apposite that a paper discussing some aspect of his work should be given. If so, what could be better than the design and analysis of experiments, which in its modern form he created? I do not propose today to give a history of the development of the subject. This I did in a paper presented in 1963 to the Seventh International Biometrics Congress [14]. Instead I want to take a fresh look at the logical principles Fisher laid down, and the action that flows from them; also briefly to consider certain modern trends, and see how far they are really of value. 2. General principles Fisher, in his first formal exposition of experimental design [4] laid down three basic principles: replication; randomization; local control. Replication and local control (for example, arrangement in blocks or rows and columns of a square) were not new, but the idea of assigning the treatments at random (subject to the restrictions imposed by the local control) was novel, and proved to be a most fruitful contribution. -
Tests of Hypotheses Using Statistics
Tests of Hypotheses Using Statistics Adam Massey¤and Steven J. Millery Mathematics Department Brown University Providence, RI 02912 Abstract We present the various methods of hypothesis testing that one typically encounters in a mathematical statistics course. The focus will be on conditions for using each test, the hypothesis tested by each test, and the appropriate (and inappropriate) ways of using each test. We conclude by summarizing the di®erent tests (what conditions must be met to use them, what the test statistic is, and what the critical region is). Contents 1 Types of Hypotheses and Test Statistics 2 1.1 Introduction . 2 1.2 Types of Hypotheses . 3 1.3 Types of Statistics . 3 2 z-Tests and t-Tests 5 2.1 Testing Means I: Large Sample Size or Known Variance . 5 2.2 Testing Means II: Small Sample Size and Unknown Variance . 9 3 Testing the Variance 12 4 Testing Proportions 13 4.1 Testing Proportions I: One Proportion . 13 4.2 Testing Proportions II: K Proportions . 15 4.3 Testing r £ c Contingency Tables . 17 4.4 Incomplete r £ c Contingency Tables Tables . 18 5 Normal Regression Analysis 19 6 Non-parametric Tests 21 6.1 Tests of Signs . 21 6.2 Tests of Ranked Signs . 22 6.3 Tests Based on Runs . 23 ¤E-mail: [email protected] yE-mail: [email protected] 1 7 Summary 26 7.1 z-tests . 26 7.2 t-tests . 27 7.3 Tests comparing means . 27 7.4 Variance Test . 28 7.5 Proportions . 28 7.6 Contingency Tables . -
Statistical Significance
Statistical significance In statistical hypothesis testing,[1][2] statistical signif- 1.1 Related concepts icance (or a statistically significant result) is at- tained whenever the observed p-value of a test statis- The significance level α is the threshhold for p below tic is less than the significance level defined for the which the experimenter assumes the null hypothesis is study.[3][4][5][6][7][8][9] The p-value is the probability of false, and something else is going on. This means α is obtaining results at least as extreme as those observed, also the probability of mistakenly rejecting the null hy- given that the null hypothesis is true. The significance pothesis, if the null hypothesis is true.[22] level, α, is the probability of rejecting the null hypothe- Sometimes researchers talk about the confidence level γ sis, given that it is true.[10] This statistical technique for = (1 − α) instead. This is the probability of not rejecting testing the significance of results was developed in the the null hypothesis given that it is true. [23][24] Confidence early 20th century. levels and confidence intervals were introduced by Ney- In any experiment or observation that involves drawing man in 1937.[25] a sample from a population, there is always the possibil- ity that an observed effect would have occurred due to sampling error alone.[11][12] But if the p-value of an ob- 2 Role in statistical hypothesis test- served effect is less than the significance level, an inves- tigator may conclude that that effect reflects the charac- ing teristics of the -
Understanding Statistical Hypothesis Testing: the Logic of Statistical Inference
Review Understanding Statistical Hypothesis Testing: The Logic of Statistical Inference Frank Emmert-Streib 1,2,* and Matthias Dehmer 3,4,5 1 Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, 33100 Tampere, Finland 2 Institute of Biosciences and Medical Technology, Tampere University, 33520 Tampere, Finland 3 Institute for Intelligent Production, Faculty for Management, University of Applied Sciences Upper Austria, Steyr Campus, 4040 Steyr, Austria 4 Department of Mechatronics and Biomedical Computer Science, University for Health Sciences, Medical Informatics and Technology (UMIT), 6060 Hall, Tyrol, Austria 5 College of Computer and Control Engineering, Nankai University, Tianjin 300000, China * Correspondence: [email protected]; Tel.: +358-50-301-5353 Received: 27 July 2019; Accepted: 9 August 2019; Published: 12 August 2019 Abstract: Statistical hypothesis testing is among the most misunderstood quantitative analysis methods from data science. Despite its seeming simplicity, it has complex interdependencies between its procedural components. In this paper, we discuss the underlying logic behind statistical hypothesis testing, the formal meaning of its components and their connections. Our presentation is applicable to all statistical hypothesis tests as generic backbone and, hence, useful across all application domains in data science and artificial intelligence. Keywords: hypothesis testing; machine learning; statistics; data science; statistical inference 1. Introduction We are living in an era that is characterized by the availability of big data. In order to emphasize the importance of this, data have been called the ‘oil of the 21st Century’ [1]. However, for dealing with the challenges posed by such data, advanced analysis methods are needed. -
Notes: Hypothesis Testing, Fisher's Exact Test
Notes: Hypothesis Testing, Fisher’s Exact Test CS 3130 / ECE 3530: Probability and Statistics for Engineers Novermber 25, 2014 The Lady Tasting Tea Many of the modern principles used today for designing experiments and testing hypotheses were intro- duced by Ronald A. Fisher in his 1935 book The Design of Experiments. As the story goes, he came up with these ideas at a party where a woman claimed to be able to tell if a tea was prepared with milk added to the cup first or with milk added after the tea was poured. Fisher designed an experiment where the lady was presented with 8 cups of tea, 4 with milk first, 4 with tea first, in random order. She then tasted each cup and reported which four she thought had milk added first. Now the question Fisher asked is, “how do we test whether she really is skilled at this or if she’s just guessing?” To do this, Fisher introduced the idea of a null hypothesis, which can be thought of as a “default position” or “the status quo” where nothing very interesting is happening. In the lady tasting tea experiment, the null hypothesis was that the lady could not really tell the difference between teas, and she is just guessing. Now, the idea of hypothesis testing is to attempt to disprove or reject the null hypothesis, or more accurately, to see how much the data collected in the experiment provides evidence that the null hypothesis is false. The idea is to assume the null hypothesis is true, i.e., that the lady is just guessing.