Depicting Error
Total Page:16
File Type:pdf, Size:1020Kb
Depicting Error Howard WAINER tainty. Despite our appreciation for the importance of this work, we will ignore how the uncertainty is calculated, and Communicating data without some measure of their preci- proceed assuming that, in whatever way was required, such sion can lead to misinterpretation and incorrect inferences. estimates are available and our task is simply to convey Several conventions for displaying error along with the data them as well as we can. they modify are described and illustrated. Alternatives are offered that seem to provide improvements in the effective 2. ERRORS IN TABLES communication of error as well as increasing their ease, and hence their likelihood, of use. These alternatives are illus- Let us begin our examination of the depiction of error trated principally with data from the National Assessment with a typical tabular display. Shown in Table 1 is a data of Educational Progress. table extracted from a much larger display. Omitted are 30 states and two territories as well as another variable (per- KEY WORDS: Display; Errors in graphs; Errors in ta- centage in each category) and its standard error. As abbre- bles. viated, this table still contains the elements important to our discussion without being unwieldy. The only deviation from standard format in this table is that the standard error 1. INTRODUCTION of each data entry is in a separate column rather than de- noted within parentheses adjacent to the data entry whose One hallmark of the statistically conscious investigator is his firm belief that however the survey, experiment, or observational program actually variability it characterizes. This was done primarily to ease turned out, it could have turned out somewhat differently. Holding such a text manipulation. There are three lines added to the bottom belief and taking appropriate actions make effective use of data possible. of this table. These lines are the point of this discussion. We need not always ask explicitly "How much differently?" but we should Showing the standard errors of statistics in this way cer- be aware of such questions. Most of us find uncertainty uncomfortable ... (but) ... each of us who deals with the analysis and interpretation of tainly satisfies requirement (1). The visual weight given to data must learn to cope with uncertainty. the numbers connoting error is the same as that which con- (Mosteller and Tukey 1968, p. 100) veys the data. This depiction does not however allow us to Whenever we discuss information we must also discuss summarize easily the structure of error. For example, we its accuracy. Effective display of data must: cannot tell if the size of the error is related to the profi- ciency, nor is it easy to see how much variation there is in 1. remind us that the data being displayed do contain the error over the various states or across the categories of some uncertainty, and then parental education. The answers to these kinds of questions 2. characterize the size of that uncertainty as it pertains are important for requirement (2). In this case, errors across to the inferences we have in mind, and in so doing, states are quite homogeneous, and so displaying the max- 3. help keep us from drawing incorrect conclusions imum value of the standard error as a summary allows us through the lack of a full appreciation of the precision of to extract a handy, conservative value from the error terms our knowledge. provided. Using such a summary term (rather than say a The examples chosen here focus principally on errors in mean or a median) will lower the likelihood of our declar- means, but the graphical ideas expressed should generalize ing states different when there is reasonable evidence that easily to other situations. Throughout this discussion we as- they might not be. This raises the second part of require- sume that estimates of precision are available, and thus the ment (2), "uncertainty as it pertains to the inferences we task is strictly one of effective display. This assumption is have in mind." What kinds of questions are likely to be admittedly a big one. Great efforts and much imagination used to query Table 1? have been expended in the search for the true uncertainty. When national data are reported broken down by states, it Modern statistics has gone far beyond characterizing error is natural to assume that the disaggregated data are meant by dividing the observed standard deviation by n. Meth- to be compared. Thus the standard errors are merely the ods of sensitivity analysis through resampling or multiple building blocks that are required to be able to construct the imputation are some attempts to measure the real uncer- standard error of the difference so that the statistical sig- nificance of the observed difference between state means can be ascertained. An upper bound on this standard er- Howard Wainer is Principal Research Scientist, Educational Testing ror is obtained by multiplying the maximum standard error Service, Princeton, NJ 08541. This work was supported by Contract R999B40013 from the National Center for Education Statistics to the Ed- by the 2. This result is reported on the first line of the ucational Testing Service, Howard Wainer, Principal Investigator. The au- section at the bottom of the table labeled "Errorterms for thor is pleased to acknowledge this help. He also thanks Alan MacEachren, comparisons." It is a handy rule-of-thumb for anyone wish- Eugene Johnson, Kinley Larntz, Robert Mislevy, Charles P. Nesko, Linda Steinberg, and John W. Tukey for their helpful comments on an earlier ing to make a single comparison. Thus suppose, for some draft. reason, we are interested in comparing the performance of ( 1996 American Statistical Association1 The American Statisticiani,May 1996, Vol. 50, No. 2 101 This content downloaded from 160.39.32.189 on Fri, 23 Jan 2015 07:43:12 AM All use subject to JSTOR Terms and Conditions Table 1. Average Mathematics Proficiency by Parents' Highest Level of Education Grade 8-1992. A revised display of these data is shown in Table 2. Some education Did not Graduated after Graduated finish I don't college high school high school high school know Public schools 0 se 0 se 0 se 0 se 0 se Nation 279 1.4 270 1.2 256 1.4 248 1.8 251 1.7 States Alabama 261 2.5 258 2.0 244 1.8 239 2.0 237 2.9 Arizona 277 1.5 270 1.5 256 1.6 245 2.5 248 2.7 Arkansas 264 1.9 264 1.7 248 1.6 246 2.4 245 2.7 California 275 2.0 266 2.1 251 2.1 241 2.2 240 2.9 Colorado 282 1.3 276 1.6 260 1.5 250 2.4 252 2.6 Connecticut 288 1.0 272 1.8 260 1.8 245 3.3 251 2.4 Delaware 274 1.3 268 2.3 251 1.7 248 4.0 248 3.4 District of Columbia 244 1.7 240 1.9 224 1.6 225 3.2 229 2.2 Florida 268 1.9 266 1.9 251 1.8 244 2.7 244 3.2 Georgia 271 2.1 264 1.7 250 1.3 244 2.2 245 2.6 Hawaii 267 1.5 266 1.9 246 1.8 242 3.5 246 2.1 Idaho 281 .9 278 1.3 268 1.4 254 2.3 254 2.8 Error Terms for Comparisons Max. Std. error of difference 3.5 3.4 3.5 6.4 6.4 40 Bonferroni (std. err. x 3.2) 11.3 11.0 11.3 20.7 20.7 820 Bonferroni (std. err. x 4.0) 14.0 13.6 14.0 25.6 25.6 0 is an estimate of the average proficiency in the state. Source Abstracted from Table 2.12 in The 1992 NAEP TrialState Assessment. Hawaii's 8th graders whose parents were college graduates that participated in the state assessment, we can declare any (267) with their counterparts in Delaware (274). We note differences greater than 11.3 points "statistically significant that their mean scores differ by 7. This is twice the max- beyond the .05 level." The 11.3 point decision rule is con- imum standard error of the difference of the means (3.5). servative (for most comparisons about 50% too large), but We can thus conclude that this observed difference is statis- it provides a safe and quick rule-of-thumb. tically significant beyond the nominal (a = .05) value. We If someone is interested in comparing each state with could have done this comparison for any single pair that all others there will be ("1) or 820 comparisons. To con- was of special interest to us. trol the type I error rate in this situation requires boosting Although the comparison of a particular pair of states the critical region still further (to 14.0). Such a figure is may be an occasional use for a data table like this, we sus- given next to the label "820 Bonferroni." We are uncer- pect that comparing one state (perhaps our own) with all tain of the usefulness of including such a figure, although others is a more likely use. To do this correctly we need it still works, because anyone who is really interested in to control the artifactual increases in the error rate due to making all 820 comparisons will probably want a some- making many comparisons. The most common way to as- what less conservative decision rule. Such a user of the sure that the type I error rate is controlled is boosting the data table would need to go back to the original standard size of the standard error sufficiently so that the overall er- errors and compute a more precise figure.