Error Bars in Experimental Biology • Cumming Et Al
Total Page:16
File Type:pdf, Size:1020Kb
JCB: FEATURE Error bars in experimental biology Geoff Cumming,1 Fiona Fidler,1 and David L. Vaux2 1School of Psychological Science and 2Department of Biochemistry, La Trobe University, Melbourne, Victoria, Australia 3086 Error bars commonly appear in fi gures error bars encompass the lowest and high- in publications, but experimental est values. SD is calculated by the formula biologists are often unsure how they ∑()XM− 2 should be used and interpreted. In this SD = − article we illustrate some basic features n 1 of error bars and explain how they can where X refers to the individual data help communicate data and assist points, M is the mean, and Σ (sigma) correct interpretation. Error bars may means add to fi nd the sum, for all the n show confi dence intervals, standard data points. SD is, roughly, the average or errors, standard deviations, or other typical difference between the data points Downloaded from quantities. Different types of error bars and their mean, M. About two thirds of the data points will lie within the region give quite different information, and so of the data %95ف of mean ± 1 SD, and fi gure legends must make clear what points will be within 2 SD of the mean. error bars represent. We suggest eight simple rules to assist with effective use Figure 1. Descriptive error bars. Means with er- www.jcb.org It is highly desirable to ror bars for three cases: n = 3, n = 10, and n = and interpretation of error bars. 30. The small black dots are data points, and the use larger n, to achieve column denotes the data mean M. The bars on the left of each column show range, and the bars What are error bars for? on May 2, 2007 narrower inferential on the right show standard deviation (SD). M and Journals that publish science—knowledge error bars and more SD are the same for every case, but notice how gained through repeated observation or much the range increases with n. Note also that precise estimates of true although the range error bars encompass all of experiment—don’t just present new the experimental results, they do not necessarily conclusions, they also present evidence so population values. cover all the results that could possibly occur. SD readers can verify that the authors’ error bars include about two thirds of the sample, and 2 x SD error bars would encompass roughly reasoning is correct. Figures with error bars 95% of the sample. can, if used properly (1–6), give information Descriptive error bars can also be describing the data (descriptive statistics), used to see whether a single result fi ts within the normal range. For example, if SD of your results will tend to more and THE JOURNAL OF CELL BIOLOGY or information about what conclusions, or inferences, are justifi ed (inferential you wished to see if a red blood cell count more closely approximate the true stan- σ statistics). These two basic categories of was normal, you could see whether it was dard deviation ( ) that you would get if error bars are depicted in exactly the same within 2 SD of the mean of the population the experiment was performed an infi nite way, but are actually fundamentally as a whole. Less than 5% of all red blood number of times, or on the whole popula- different. Our aim is to illustrate basic pro- cell counts are more than 2 SD from the tion. However, the SD of the experimental σ perties of fi gures with any of the common mean, so if the count in question is more results will approximate to , whether n is error bars, as summarized in Table I, and to than 2 SD from the mean, you might con- large or small. Like M, SD does not change explain how they should be used. sider it to be abnormal. systematically as n changes, and we can As you increase the size of your use SD as our best estimate of the un- σ What do error bars tell you? sample, or repeat the experiment more known , whatever the value of n. Descriptive error bars. Range times, the mean of your results (M) will Inferential error bars. In experi- and standard deviation (SD) are used for tend to get closer and closer to the true mental biology it is more common to be descriptive error bars because they show mean, or the mean of the whole popula- interested in comparing samples from two μ how the data are spread (Fig. 1). Range tion, . We can use M as our best estimate groups, to see if they are different. For of the unknown μ. Similarly, as you repeat example, you might be comparing wild- David L. Vaux: [email protected] an experiment more and more times, the type mice with mutant mice, or drug with © The Rockefeller University Press $15.00 The Journal of Cell Biology, Vol. 177, No. 1, April 9, 2007 7–11 http://www.jcb.org/cgi/doi/10.1083/jcb.200611141 JCB 7 (Fig. 2). The interval defi nes the values that are most plausible for μ. Because error bars can be descriptive or inferential, and could be any of the bars listed in Table I or even something else, they are meaningless, or misleading, if the fi gure legend does not state what kind they are. This leads to the fi rst rule. Rule 1: when showing error bars, always describe in the fi gure legends what they are. Statistical signifi cance tests and P values If you carry out a statistical signifi cance Figure 3. Inappropriate use of error bars. En- test, the result is a P value, where P is the zyme activity for MEFs showing mean + SD probability that, if there really is no differ- from duplicate samples from one of three repre- sentative experiments. Values for wild-type vs. ence, you would get, by chance, a differ- −/− MEFs were signifi cant for enzyme activity ence as large as the one you observed, or at the 3-h timepoint (P < 0.0005). This fi gure and Figure 2. Confi dence intervals. Means and even larger. Other things (e.g., sample its legend are typical, but illustrate inappropriate 95% CIs for 20 independent sets of results, each and misleading use of statistics because n = 1. = μ = of size n 10, from a population with mean size, variation) being equal, a larger differ- The very low variation of the duplicate samples 40 (marked by the dotted line). In the long run we ence in results gives a lower P value, implies consistency of pipetting, but says nothing μ Downloaded from expect 95% of such CIs to capture ; here 18 do about whether the differences between the wild- so (large black dots) and 2 do not (open circles). which makes you suspect there is a true − − < type and / MEFs are reproducible. In this Successive CIs vary considerably, not only in po- difference. By convention, if P 0.05 you case, the means and errors of the three experi- μ sition relative to , but also in length. The varia- say the result is statistically signifi cant, ments should have been shown. tion from CI to CI would be less for larger sets of < results, for example n = 30 or more, but varia- and if P 0.01 you say the result is highly tion in position and in CI length would be even signifi cant and you can be more confi dent repeated your experiment often enough to = greater for smaller samples, for example n 3. you have found a true effect. As always reveal it. It is a common and serious error www.jcb.org with statistical inference, you may be to conclude “no effect exists” just because placebo, or experimental results with wrong! Perhaps there really is no effect, P is greater than 0.05. If you measured the controls. To make inferences from the data and you had the bad luck to get one of the heights of three male and three female (i.e., to make a judgment whether the 5% (if P < 0.05) or 1% (if P < 0.01) of Biddelonian basketball players, and did on May 2, 2007 groups are signifi cantly different, or sets of results that suggests a difference not see a signifi cant difference, you could whether the differences might just be due where there is none. Of course, even if re- not conclude that sex has no relationship to random fl uctuation or chance), a differ- sults are statistically highly signifi cant, it with height, as a larger sample size might ent type of error bar can be used. These does not mean they are necessarily bio- reveal one. A big advantage of inferential are standard error (SE) bars and confi - logically important. It is also essential to error bars is that their length gives a dence intervals (CIs). The mean of the note that if P > 0.05, and you therefore graphic signal of how much uncertainty data, M, with SE or CI error bars, gives an cannot conclude there is a statistically sig- there is in the data: The true value of the indication of the region where you can ex- nifi cant effect, you may not conclude that mean μ we are estimating could plausibly pect the mean of the whole possible set of the effect is zero. There may be a real ef- be anywhere in the 95% CI. Wide inferen- results, or the whole population, μ, to lie fect, but it is small, or you may not have tial bars indicate large error; short inferen- tial bars indicate high precision.