Adjusting the Sample Size During the Experiment

Total Page:16

File Type:pdf, Size:1020Kb

Adjusting the Sample Size During the Experiment Volume 21, Number 1, February/March 2015 Adjusting the Sample Size during the Experiment Aniko Szabo, PhD, Division of Biostatistics, MCW Selecting the sample size for a study is an essential part of research. The basic statistical tenet of sample size selection is that it needs to be selected in advance, and modifying it based on the collected data can lead to invalid results. The IRB and the IACUC require sample size justification before approval of human and animal studies, respectively, so correct practices are typically followed. However in contexts not governed by these agencies in detail, such basic science laboratory studies or secondary data analyses of existing records, practices vary widely. In this article we will illustrate some of the dangers of incorrect practices, and discuss group- sequential designs as a possible solution. Sampling to significance A common experimental plan that researchers apply – consciously or not – is to collect some data, stop if the results are statistically significant, or collect more data to increase the power if not. Statisticians refer to this practice as “sampling to significance.” In fact, if such a procedure would be carried on indefinitely long, all experiments would give a statistically significant result sooner or later! Figure 1 shows simulated results for 10 hypothetical series of experiments in which there is no actual difference between two study groups. The p-value is calculated after each set of 5 additional samples from both groups, up to 50 samples per group. In all the 10 © 2015 by Division of Biostatistics, Medical College of Wisconsin Volume 21, Number 1, February/March 2015 experiments the p-value varies substantially between repeated tests, and in three experiments dips below the significance level of 0.05 denoted by the blue horizontal dashed line. The red points show where the experimenter would have stopped the study, the future p-values shown by gray lines would never be observed. In two of the cases the p-values would have “rebounded” if the experiment were continued, and only one of the runs would have been significant at the final sample size. These simulations demonstrate the increased rate of false positives when the researcher “looks” at the p-value multiple times. While false positives are unavoidable, a significance level of 0.05 should limit their occurrence to 5% of experiments. In the situation described above, the probability of a significant result at one of the 10 multiple possible looks is almost 20%! While this example is somewhat extreme – most researchers would probably add new samples only once or twice – the general phenomenon of increasing rate of false positives is still present. Even just one additional look after 25 samples per group almost doubles the false positive rate. Sequential testing The sample size required for any given study depends on a number of factors, such as the expected variability of the measurements or the expected difference between the groups. Unfortunately, these are often unknown at the initiation of a study. One of the reasons for the researchers’ reluctance to commit to a fixed sample size a priori might be the uncertainty about © 2015 by Division of Biostatistics, Medical College of Wisconsin Volume 21, Number 1, February/March 2015 these inputs. What if the variability is larger than expected? The preplanned sample size might be too low. What if the difference is much larger than expected? Perhaps money and effort could be saved by stopping earlier. The repeated looks seem to offer a solution to these problems, by allowing the sample size to adjust to the experiment, but as we saw above, this approach is not statistically valid. Fortunately, several statistical methods allowing such multiple looks have been developed. So called group-sequential designs are most commonly used in clinical trials, but they are applicable in any situation where sample size adjustment is desired. A group-sequential design requires specifying the maximum sample size, and the number and timing of the intermediate evaluations. A statistical program then can calculate the significance cutoff for each look. Figure 2 shows two such possible adjusted cutoffs for the previous simulation study. The vertical axis is restricted to the 0 to 0.06 range to show the details, so only the three simulation runs with p- values under 0.05 are shown. The green dotted line shows the so-called Pocock boundary, which uses the same adjusted significance level, here 0.0116, for each comparison. Simulation 4 crosses below this boundary at 45 samples per group, simulation 5 at 15, but simulation 9 never reaches it. So in our small example 2 out of the 10 simulations crossed even the adjusted boundary, but it can be proven that in the long run only 5% will do so. In this example we planned for up to 10 equal-spaced looks at the p-value; Table 1 shows the p-value cutoffs for different number of evaluations. Number of 2 3 4 5 6 7 8 9 10 looks Significance 0.0304 0.0232 0.0193 0.0169 0.0152 0.0140 0.0130 0.0122 0.0116 cutoff One drawback of the constant Pocock boundary is that if the study goes to the planned maximal size, the significance level that has to be used is much lower than the typical 0.05 level that would have been used without the interim looks. As an alternative, a different boundary with significance cutoffs that increase with sample size can be used. The red solid line shows © 2015 by Division of Biostatistics, Medical College of Wisconsin Volume 21, Number 1, February/March 2015 such a boundary; initially a very low p-value is needed to stop the trial, so simulation 5 is not stopped after 15 samples (though it gets really close to the cutoff). The benefit is that by the end of the study the significance cutoff is much closer to 0.05, it is 0.0412 in this example. As with the Pocock boundary, even though 1 out of the 10 simulations crossed this boundary, in the long run only 5% will do so. 4 5 9 0.06 0.05 0.04 P-value 0.03 0.02 0.01 0.00 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 Sample size Significance boundary Adjusted-varying Adjusted-constant Unadjusted Conclusion Adjusting the sample size of the study based on accumulating data destroys the statistical validity of p-value testing. However more complicated study designs, such as group-sequential studies or internal pilot studies (not discussed here) can accommodate situations with large uncertainty in estimates required for sample size calculations. Any such designs need to be planned in advance, and adjustments need to be made even if the study is not stopping early. © 2015 by Division of Biostatistics, Medical College of Wisconsin .
Recommended publications
  • Introduction to Biostatistics
    Introduction to Biostatistics Jie Yang, Ph.D. Associate Professor Department of Family, Population and Preventive Medicine Director Biostatistical Consulting Core Director Biostatistics and Bioinformatics Shared Resource, Stony Brook Cancer Center In collaboration with Clinical Translational Science Center (CTSC) and the Biostatistics and Bioinformatics Shared Resource (BB-SR), Stony Brook Cancer Center (SBCC). OUTLINE What is Biostatistics What does a biostatistician do • Experiment design, clinical trial design • Descriptive and Inferential analysis • Result interpretation What you should bring while consulting with a biostatistician WHAT IS BIOSTATISTICS • The science of (bio)statistics encompasses the design of biological/clinical experiments the collection, summarization, and analysis of data from those experiments the interpretation of, and inference from, the results How to Lie with Statistics (1954) by Darrell Huff. http://www.youtube.com/watch?v=PbODigCZqL8 GOAL OF STATISTICS Sampling POPULATION Probability SAMPLE Theory Descriptive Descriptive Statistics Statistics Inference Population Sample Parameters: Inferential Statistics Statistics: 흁, 흈, 흅… 푿ഥ , 풔, 풑ෝ,… PROPERTIES OF A “GOOD” SAMPLE • Adequate sample size (statistical power) • Random selection (representative) Sampling Techniques: 1.Simple random sampling 2.Stratified sampling 3.Systematic sampling 4.Cluster sampling 5.Convenience sampling STUDY DESIGN EXPERIEMENT DESIGN Completely Randomized Design (CRD) - Randomly assign the experiment units to the treatments
    [Show full text]
  • Experimentation Science: a Process Approach for the Complete Design of an Experiment
    Kansas State University Libraries New Prairie Press Conference on Applied Statistics in Agriculture 1996 - 8th Annual Conference Proceedings EXPERIMENTATION SCIENCE: A PROCESS APPROACH FOR THE COMPLETE DESIGN OF AN EXPERIMENT D. D. Kratzer K. A. Ash Follow this and additional works at: https://newprairiepress.org/agstatconference Part of the Agriculture Commons, and the Applied Statistics Commons This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License. Recommended Citation Kratzer, D. D. and Ash, K. A. (1996). "EXPERIMENTATION SCIENCE: A PROCESS APPROACH FOR THE COMPLETE DESIGN OF AN EXPERIMENT," Conference on Applied Statistics in Agriculture. https://doi.org/ 10.4148/2475-7772.1322 This is brought to you for free and open access by the Conferences at New Prairie Press. It has been accepted for inclusion in Conference on Applied Statistics in Agriculture by an authorized administrator of New Prairie Press. For more information, please contact [email protected]. Conference on Applied Statistics in Agriculture Kansas State University Applied Statistics in Agriculture 109 EXPERIMENTATION SCIENCE: A PROCESS APPROACH FOR THE COMPLETE DESIGN OF AN EXPERIMENT. D. D. Kratzer Ph.D., Pharmacia and Upjohn Inc., Kalamazoo MI, and K. A. Ash D.V.M., Ph.D., Town and Country Animal Hospital, Charlotte MI ABSTRACT Experimentation Science is introduced as a process through which the necessary steps of experimental design are all sufficiently addressed. Experimentation Science is defined as a nearly linear process of objective formulation, selection of experimentation unit and decision variable(s), deciding treatment, design and error structure, defining the randomization, statistical analyses and decision procedures, outlining quality control procedures for data collection, and finally analysis, presentation and interpretation of results.
    [Show full text]
  • Double Blind Trials Workshop
    Double Blind Trials Workshop Introduction These activities demonstrate how double blind trials are run, explaining what a placebo is and how the placebo effect works, how bias is removed as far as possible and how participants and trial medicines are randomised. Curriculum Links KS3: Science SQA Access, Intermediate and KS4: Biology Higher: Biology Keywords Double-blind trials randomisation observer bias clinical trials placebo effect designing a fair trial placebo Contents Activities Materials Activity 1 Placebo Effect Activity Activity 2 Observer Bias Activity 3 Double Blind Trial Role Cards for the Double Blind Trial Activity Testing Layout Background Information Medicines undergo a number of trials before they are declared fit for use (see classroom activity on Clinical Research for details). In the trial in the second activity, pupils compare two potential new sunscreens. This type of trial is done with healthy volunteers to see if the there are any side effects and to provide data to suggest the dosage needed. If there were no current best treatment then this sort of trial would also be done with patients to test for the effectiveness of the new medicine. How do scientists make sure that medicines are tested fairly? One thing they need to do is to find out if their tests are free of bias. Are the medicines really working, or do they just appear to be working? One difficulty in designing fair tests for medicines is the placebo effect. When patients are prescribed a treatment, especially by a doctor or expert they trust, the patient’s own belief in the treatment can cause the patient to produce a response.
    [Show full text]
  • Observational Studies and Bias in Epidemiology
    The Young Epidemiology Scholars Program (YES) is supported by The Robert Wood Johnson Foundation and administered by the College Board. Observational Studies and Bias in Epidemiology Manuel Bayona Department of Epidemiology School of Public Health University of North Texas Fort Worth, Texas and Chris Olsen Mathematics Department George Washington High School Cedar Rapids, Iowa Observational Studies and Bias in Epidemiology Contents Lesson Plan . 3 The Logic of Inference in Science . 8 The Logic of Observational Studies and the Problem of Bias . 15 Characteristics of the Relative Risk When Random Sampling . and Not . 19 Types of Bias . 20 Selection Bias . 21 Information Bias . 23 Conclusion . 24 Take-Home, Open-Book Quiz (Student Version) . 25 Take-Home, Open-Book Quiz (Teacher’s Answer Key) . 27 In-Class Exercise (Student Version) . 30 In-Class Exercise (Teacher’s Answer Key) . 32 Bias in Epidemiologic Research (Examination) (Student Version) . 33 Bias in Epidemiologic Research (Examination with Answers) (Teacher’s Answer Key) . 35 Copyright © 2004 by College Entrance Examination Board. All rights reserved. College Board, SAT and the acorn logo are registered trademarks of the College Entrance Examination Board. Other products and services may be trademarks of their respective owners. Visit College Board on the Web: www.collegeboard.com. Copyright © 2004. All rights reserved. 2 Observational Studies and Bias in Epidemiology Lesson Plan TITLE: Observational Studies and Bias in Epidemiology SUBJECT AREA: Biology, mathematics, statistics, environmental and health sciences GOAL: To identify and appreciate the effects of bias in epidemiologic research OBJECTIVES: 1. Introduce students to the principles and methods for interpreting the results of epidemio- logic research and bias 2.
    [Show full text]
  • Design of Experiments and Data Analysis,” 2012 Reliability and Maintainability Symposium, January, 2012
    Copyright © 2012 IEEE. Reprinted, with permission, from Huairui Guo and Adamantios Mettas, “Design of Experiments and Data Analysis,” 2012 Reliability and Maintainability Symposium, January, 2012. This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of ReliaSoft Corporation's products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to [email protected]. By choosing to view this document, you agree to all provisions of the copyright laws protecting it. 2012 Annual RELIABILITY and MAINTAINABILITY Symposium Design of Experiments and Data Analysis Huairui Guo, Ph. D. & Adamantios Mettas Huairui Guo, Ph.D., CPR. Adamantios Mettas, CPR ReliaSoft Corporation ReliaSoft Corporation 1450 S. Eastside Loop 1450 S. Eastside Loop Tucson, AZ 85710 USA Tucson, AZ 85710 USA e-mail: [email protected] e-mail: [email protected] Tutorial Notes © 2012 AR&MS SUMMARY & PURPOSE Design of Experiments (DOE) is one of the most useful statistical tools in product design and testing. While many organizations benefit from designed experiments, others are getting data with little useful information and wasting resources because of experiments that have not been carefully designed. Design of Experiments can be applied in many areas including but not limited to: design comparisons, variable identification, design optimization, process control and product performance prediction. Different design types in DOE have been developed for different purposes.
    [Show full text]
  • A Randomized Control Trial Evaluating the Effects of Police Body-Worn
    A randomized control trial evaluating the effects of police body-worn cameras David Yokuma,b,1,2, Anita Ravishankara,c,d,1, and Alexander Coppocke,1 aThe Lab @ DC, Office of the City Administrator, Executive Office of the Mayor, Washington, DC 20004; bThe Policy Lab, Brown University, Providence, RI 02912; cExecutive Office of the Chief of Police, Metropolitan Police Department, Washington, DC 20024; dPublic Policy and Political Science Joint PhD Program, University of Michigan, Ann Arbor, MI 48109; and eDepartment of Political Science, Yale University, New Haven, CT 06511 Edited by Susan A. Murphy, Harvard University, Cambridge, MA, and approved March 21, 2019 (received for review August 28, 2018) Police body-worn cameras (BWCs) have been widely promoted as The existing evidence on whether BWCs have the anticipated a technological mechanism to improve policing and the perceived effects on policing outcomes remains relatively limited (17–19). legitimacy of police and legal institutions, yet evidence of their Several observational studies have evaluated BWCs by compar- effectiveness is limited. To estimate the effects of BWCs, we con- ing the behavior of officers before and after the introduction of ducted a randomized controlled trial involving 2,224 Metropolitan BWCs into the police department (20, 21). Other studies com- Police Department officers in Washington, DC. Here we show pared officers who happened to wear BWCs to those without (15, that BWCs have very small and statistically insignificant effects 22, 23). The causal inferences drawn in those studies depend on on police use of force and civilian complaints, as well as other strong assumptions about whether, after statistical adjustments policing activities and judicial outcomes.
    [Show full text]
  • Randomized Controlled Trials, Development Economics and Policy Making in Developing Countries
    Randomized Controlled Trials, Development Economics and Policy Making in Developing Countries Esther Duflo Department of Economics, MIT Co-Director J-PAL [Joint work with Abhijit Banerjee and Michael Kremer] Randomized controlled trials have greatly expanded in the last two decades • Randomized controlled Trials were progressively accepted as a tool for policy evaluation in the US through many battles from the 1970s to the 1990s. • In development, the rapid growth starts after the mid 1990s – Kremer et al, studies on Kenya (1994) – PROGRESA experiment (1997) • Since 2000, the growth have been very rapid. J-PAL | THE ROLE OF RANDOMIZED EVALUATIONS IN INFORMING POLICY 2 Cameron et al (2016): RCT in development Figure 1: Number of Published RCTs 300 250 200 150 100 50 0 1975 1980 1985 1990 1995 2000 2005 2010 2015 Publication Year J-PAL | THE ROLE OF RANDOMIZED EVALUATIONS IN INFORMING POLICY 3 BREAD Affiliates doing RCT Figure 4. Fraction of BREAD Affiliates & Fellows with 1 or more RCTs 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 1980 or earlier 1981-1990 1991-2000 2001-2005 2006-today * Total Number of Fellows and Affiliates is 166. PhD Year J-PAL | THE ROLE OF RANDOMIZED EVALUATIONS IN INFORMING POLICY 4 Top Journals J-PAL | THE ROLE OF RANDOMIZED EVALUATIONS IN INFORMING POLICY 5 Many sectors, many countries J-PAL | THE ROLE OF RANDOMIZED EVALUATIONS IN INFORMING POLICY 6 Why have RCT had so much impact? • Focus on identification of causal effects (across the board) • Assessing External Validity • Observing Unobservables • Data collection • Iterative Experimentation • Unpack impacts J-PAL | THE ROLE OF RANDOMIZED EVALUATIONS IN INFORMING POLICY 7 Focus on Identification… across the board! • The key advantage of RCT was perceived to be a clear identification advantage • With RCT, since those who received a treatment are randomly selected in a relevant sample, any difference between treatment and control must be due to the treatment • Most criticisms of experiment also focus on limits to identification (imperfect randomization, attrition, etc.
    [Show full text]
  • Descriptive Statistics and ANOVA
    Basic statistics Descriptive statistics and ANOVA Thomas Alexander Gerds Department of Biostatistics, University of Copenhagen Contents I Data are variable I Statistical uncertainty I Summary and display of data I Confidence intervals I ANOVA Data are variable A statistician is used to receive a value, such as 3.17 %, together with an explanation, such as "this is the expression of 1-B6.DBA-GTM in mouse 12". The value from the next mouse in the list is 4.88% . The measurement is difficult Data processing is done by humans Two mice have different genes They are exposed . and treated differently Decomposing variance Variability of data is usually a composite of I Measurement error, sampling scheme I Random variation I Genotype I Exposure, life style, environment I Treatment Statistical conclusions can often be obtained by explaining the sources of variation in the data. Example 1 In the yeast experiment of Smith and Kruglyak (2008) 1 transcript levels were profiled in 6 replicates of the same strain called ’RM’ in glucose under controlled conditions. 1the article is available at http://biology.plosjournals.org Example 1 Figure: Sources of the variation of these 6 values I Measurement error I Random variation Example 1 In the same yeast experiment Smith and Kruglyak (2008) profiled also 6 replicates of a different strain called ’By’ in glucose.The order in which the 12 samples were processed was at random to minimize a systematic experimental effect. Example 1 Figure: Sources of the variation of these 12 values I Measurement error I Study design/experimental environment I Genotype Example 1 Furthermore, Smith and Kruglyak (2008) cultured 6 ’RM’ and 6 ’By’ replicates in ethanol.The order in which the 24 samples were processed was random to minimize a systematic experimental effect.
    [Show full text]
  • In Cardiovascular Epidemiology in the Early 21St Century
    255 VIEWPOINT Heart: first published as 10.1136/heart.89.3.255 on 1 March 2003. Downloaded from A “natural experiment” in cardiovascular epidemiology in the early 21st century A Sekikawa, B Y Horiuchi, D Edmundowicz, H Ueshima, J D Curb, K Sutton-Tyrrell, T Okamura, T Kadowaki, A Kashiwagi, K Mitsunami, K Murata, Y Nakamura, B L Rodriguez, L H Kuller ............................................................................................................................. Heart 2003;89:255–257 Despite similar traditional risk factors, morbidity and mortality rates from coronary heart disease in western sectional study of cardiovascular disease in migrant Japanese men aged 45–69 years in and non-western cohorts remain substantially different. Hawaii, and California, and Japanese in Japan in Careful study of such cohorts may help identify novel the 1960s.5 Most of those migrated to the USA in risk factors for CHD, and contribute to the formulation of the late 19th or early 20th century, or were second generation Japanese American. By adopting new preventive strategies Americanised dietary lifestyle, the concentrations .......................................................................... of serum total cholesterol among Japanese American men in the 1960s were higher than that he term “natural experiment” is defined as: in men in Japan by almost 1.3 mmol/l. The study “Naturally occurring circumstances in which showed that the CHD mortality was significantly subsets of the population have different levels higher in Japanese American men than in men in T Japan. of exposure to a supposed causal factor, in a situ- While developed countries have witnessed a ation resembling an actual experiment where dramatic decline in CHD mortality during the human subjects would be randomly allocated to 1 20th century, it remains one of the leading causes groups.” The term is derived from the work of Dr of mortality in the western world.
    [Show full text]
  • Balanced Design Analysis of Variance
    NCSS Statistical Software NCSS.com Chapter 211 Balanced Design Analysis of Variance Introduction This procedure performs an analysis of variance on up to ten factors. The experimental design must be of the factorial type (no nested or repeated-measures factors) with no missing cells. If the data are balanced (equal-cell frequency), this procedure yields exact F-tests. If the data are not balanced, approximate F-tests are generated using the method of unweighted means (UWM). The F-ratio is used to determine statistical significance. The tests are nondirectional in that the null hypothesis specifies that all means for a specified main effect or interaction are equal and the alternative hypothesis simply states that at least one is different. Studies have shown that the properties of UWM F-tests are very good if the amount of unbalance in the cell frequencies is small. Despite that relative accuracy, you might well ask, “If the results are not always exact, why provide the method?” The answer is that the general linear models (GLM) solution (discussed in the General Linear Models chapter) sometimes requires more computer time and memory than is available. When there are several factors each with many levels, the GLM solution may not be obtainable. In these cases, UWM provides a very useful approximation. When the design is balanced, both procedures yield the same results, but the UWM method is much faster. The procedure also calculates Friedman’s two-way analysis of variance by ranks. This test is the nonparametric analog of the F-test in a randomized block design.
    [Show full text]
  • Analysis of Variance in the Modern Design of Experiments
    Analysis of Variance in the Modern Design of Experiments Richard DeLoach* NASA Langley Research Center, Hampton, Virginia, 23681 This paper is a tutorial introduction to the analysis of variance (ANOVA), intended as a reference for aerospace researchers who are being introduced to the analytical methods of the Modern Design of Experiments (MDOE), or who may have other opportunities to apply this method. One-way and two-way fixed-effects ANOVA, as well as random effects ANOVA, are illustrated in practical terms that will be familiar to most practicing aerospace researchers. I. Introduction he Modern Design of Experiments (MDOE) is an integrated system of experiment design, execution, and T analysis procedures based on industrial experiment design methods introduced at the beginning of the 20th century for various product and process improvement applications. MDOE is focused on the complexities and special requirements of aerospace ground testing. It was introduced to the experimental aeronautics community at NASA Langley Research Center in the 1990s as part of a renewed focus on quality and productivity improvement in wind tunnel testing, and has been used since then in numerous other applications as well. MDOE has been offered as a more productive, less costly, and higher quality alternative to conventional testing methods used widely in the aerospace industry, and known in the literature of experiment design as ―One Factor At a Time‖ (OFAT) testing. The basic principles of MDOE are documented in the references1-4, and some representative examples of its application in aerospace research are also provided5-16. The MDOE method provides productivity advantages over conventional testing methods by generating more information per data point.
    [Show full text]
  • DOE It Yourself Fun Science Projects Compiled by Mark J
    DOE It Yourself Fun science projects compiled by Mark J. Anderson, Principal, Stat-Ease, Inc. Give design of experiments a try! These are my favorites for doing at home or in class – in no particular order. You don’t need any unusual equipment. The details are sketchy but they should be sufficient. Use your imagination*! If you have your own favorite DOE that anyone can do, send me the details. I’ll add it to the list. ---Mark *(To maximize creativity, I encourage you to get friends and family together for some brainstorming. Use a ‘fishbone’ diagram to organize your ideas. See a helpful description by the American Society of Quality of this tool, also known as a cause-and-effect diagram, at http://asq.org/learn-about-quality/cause-analysis-tools/overview/fishbone.html.) 1. Paper Helicopter (From “George’s Column: Teaching Engineers Experimental Design with a Paper Helicopter”, Quality Engineering, 4 (3), pages 453-459, 1992, by George E. P. Box): • The diagram shows how to make a helicopter. You can experiment on paper type, length and width of various parts, and anything else you can think of. See how long you can keep it in the air.* By making use of various DOE methods, Box’s students achieved hover times of over 4 seconds from a step-ladder. *(For a video of South Dakota School of Mines and Technology engineering students dropping paper helicopters, see: www.statsmadeeasy.net/2011/05/video-of-paper-helicopter-fly-offs- at-south-dakota-school-of-mines-technology/.) • A variation on this experiment makes use of foam cups.
    [Show full text]