Depicting Error

Howard WAINER

tainty. Despite our appreciation for the importance of this work, we will ignore how the uncertainty is calculated, and Communicating data without some measure of their preci- proceed assuming that, in whatever way was required, such sion can lead to misinterpretation and incorrect inferences. estimates are available and our task is simply to convey Several conventions for displaying error along with the data them as well as we can. they modify are described and illustrated. Alternatives are offered that seem to provide improvements in the effective 2. ERRORS IN TABLES communication of error as well as increasing their ease, and hence their likelihood, of use. These alternatives are illus- Let us begin our examination of the depiction of error trated principally with data from the National Assessment with a typical tabular display. Shown in 1 is a data of Educational Progress. table extracted from a much larger display. Omitted are 30 states and two territories as well as another variable (per- KEY WORDS: Display; Errors in graphs; Errors in ta- centage in each category) and its standard error. As abbre- bles. viated, this table still contains the elements important to our discussion without being unwieldy. The only deviation from standard format in this table is that the standard error 1. INTRODUCTION of each data entry is in a separate column rather than de- noted within parentheses adjacent to the data entry whose One hallmark of the statistically conscious investigator is his firm belief that however the survey, experiment, or observational program actually variability it characterizes. This was done primarily to ease turned out, it could have turned out somewhat differently. Holding such a text manipulation. There are three lines added to the bottom belief and taking appropriate actions make effective use of data possible. of this table. These lines are the point of this discussion. We need not always ask explicitly "How much differently?" but we should Showing the standard errors of in this way cer- be aware of such questions. Most of us find uncertainty uncomfortable ... (but) ... each of us who deals with the analysis and interpretation of tainly satisfies requirement (1). The visual weight given to data must learn to cope with uncertainty. the numbers connoting error is the same as that which con- (Mosteller and Tukey 1968, p. 100) veys the data. This depiction does not however allow us to Whenever we discuss information we must also discuss summarize easily the structure of error. For example, we its accuracy. Effective display of data must: cannot tell if the size of the error is related to the profi- ciency, nor is it easy to see how much variation there is in 1. remind us that the data being displayed do contain the error over the various states or across the categories of some uncertainty, and then parental education. The answers to these kinds of questions 2. characterize the size of that uncertainty as it pertains are important for requirement (2). In this case, errors across to the inferences we have in mind, and in so doing, states are quite homogeneous, and so displaying the max- 3. help keep us from drawing incorrect conclusions imum value of the standard error as a summary allows us through the lack of a full appreciation of the precision of to extract a handy, conservative value from the error terms our knowledge. provided. Using such a summary term (rather than say a The examples chosen here focus principally on errors in mean or a median) will lower the likelihood of our declar- means, but the graphical ideas expressed should generalize ing states different when there is reasonable evidence that easily to other situations. Throughout this discussion we as- they might not be. This raises the second part of require- sume that estimates of precision are available, and thus the ment (2), "uncertainty as it pertains to the inferences we task is strictly one of effective display. This assumption is have in mind." What kinds of questions are likely to be admittedly a big one. Great efforts and much imagination used to query Table 1? have been expended in the search for the true uncertainty. When national data are reported broken down by states, it Modern statistics has gone far beyond characterizing error is natural to assume that the disaggregated data are meant by dividing the observed standard deviation by n. Meth- to be compared. Thus the standard errors are merely the ods of sensitivity analysis through resampling or multiple building blocks that are required to be able to construct the imputation are some attempts to measure the real uncer- standard error of the difference so that the statistical sig- nificance of the observed difference between state means can be ascertained. An upper bound on this standard er- Howard Wainer is Principal Research Scientist, Educational Testing ror is obtained by multiplying the maximum standard error Service, Princeton, NJ 08541. This work was supported by Contract R999B40013 from the National Center for Education Statistics to the Ed- by the 2. This result is reported on the first line of the ucational Testing Service, Howard Wainer, Principal Investigator. The au- section at the bottom of the table labeled "Errorterms for thor is pleased to acknowledge this help. He also thanks Alan MacEachren, comparisons." It is a handy rule-of-thumb for anyone wish- Eugene Johnson, Kinley Larntz, Robert Mislevy, Charles P. Nesko, Linda Steinberg, and John W. Tukey for their helpful comments on an earlier ing to make a single comparison. Thus suppose, for some draft. reason, we are interested in comparing the performance of

( 1996 American Statistical Association1 The American Statisticiani,May 1996, Vol. 50, No. 2 101

This content downloaded from 160.39.32.189 on Fri, 23 Jan 2015 07:43:12 AM All use subject to JSTOR Terms and Conditions Table 1. Average Mathematics Proficiency by Parents' Highest Level of Education Grade 8-1992. A revised display of these data is shown in Table 2.

Some education Did not Graduated after Graduated finish I don't college high school high school high school know Public schools 0 se 0 se 0 se 0 se 0 se Nation 279 1.4 270 1.2 256 1.4 248 1.8 251 1.7 States Alabama 261 2.5 258 2.0 244 1.8 239 2.0 237 2.9 Arizona 277 1.5 270 1.5 256 1.6 245 2.5 248 2.7 Arkansas 264 1.9 264 1.7 248 1.6 246 2.4 245 2.7 California 275 2.0 266 2.1 251 2.1 241 2.2 240 2.9 Colorado 282 1.3 276 1.6 260 1.5 250 2.4 252 2.6 Connecticut 288 1.0 272 1.8 260 1.8 245 3.3 251 2.4 Delaware 274 1.3 268 2.3 251 1.7 248 4.0 248 3.4 District of Columbia 244 1.7 240 1.9 224 1.6 225 3.2 229 2.2 Florida 268 1.9 266 1.9 251 1.8 244 2.7 244 3.2 Georgia 271 2.1 264 1.7 250 1.3 244 2.2 245 2.6 Hawaii 267 1.5 266 1.9 246 1.8 242 3.5 246 2.1 Idaho 281 .9 278 1.3 268 1.4 254 2.3 254 2.8

Error Terms for Comparisons Max. Std. error of difference 3.5 3.4 3.5 6.4 6.4 40 Bonferroni (std. err. x 3.2) 11.3 11.0 11.3 20.7 20.7 820 Bonferroni (std. err. x 4.0) 14.0 13.6 14.0 25.6 25.6

0 is an estimate of the average proficiency in the state.

Source Abstracted from Table 2.12 in The 1992 NAEP TrialState Assessment.

Hawaii's 8th graders whose parents were college graduates that participated in the state assessment, we can declare any (267) with their counterparts in Delaware (274). We note differences greater than 11.3 points "statistically significant that their mean scores differ by 7. This is twice the max- beyond the .05 level." The 11.3 point decision rule is con- imum standard error of the difference of the means (3.5). servative (for most comparisons about 50% too large), but We can thus conclude that this observed difference is statis- it provides a safe and quick rule-of-thumb. tically significant beyond the nominal (a = .05) value. We If someone is interested in comparing each state with could have done this comparison for any single pair that all others there will be ("1) or 820 comparisons. To con- was of special interest to us. trol the type I error rate in this situation requires boosting Although the comparison of a particular pair of states the critical region still further (to 14.0). Such a figure is may be an occasional use for a data table like this, we sus- given next to the label "820 Bonferroni." We are uncer- pect that comparing one state (perhaps our own) with all tain of the usefulness of including such a figure, although others is a more likely use. To do this correctly we need it still works, because anyone who is really interested in to control the artifactual increases in the error rate due to making all 820 comparisons will probably want a some- making many comparisons. The most common way to as- what less conservative decision rule. Such a user of the sure that the type I error rate is controlled is boosting the data table would need to go back to the original standard size of the standard error sufficiently so that the overall er- errors and compute a more precise figure. For this particu- ror rate remains at the nominal (.05) level. In the original lar data set, even though the standard errors are all reason- table there were 41 states, and so comparing any one of ably similar, a 30-60% shrinkage in the decision rule will these with each of the others would yield a total of 40 sep- occur if the individual standard errors are used. Because arate t tests. The Bonferroni (1936) inequality (presented our purpose here is not to explore alternative schemes for carefully, with many examples, in Miller 1966) provides us multiple comparisons we will not dwell on this aspect too with a conservative rule for combining the standard error much longer. However, depending on the circumstance, it of the difference with the 40 different t tests to yield the may be profitable to choose a less conservative summary ordinate associated with a critical region of the right size (i.e., Benjamini and Hochberg 1995). The interested reader for the entire family of tests. This figure is provided on the is referred to Williams, Jones, and Tukey (1994) for a much line labeled "40 Bonferroni." Thus if we wish to compare fuller exploration of the issues surrounding adjustment pro- Hawaii's performance (among children whose parents were cedures for multiple comparisons within the National As- college graduates) with that of each of the other 40 states sessment of Educational Progress (NAEP).

102 The American , May 1996, Vol. 50, No. 2

This content downloaded from 160.39.32.189 on Fri, 23 Jan 2015 07:43:12 AM All use subject to JSTOR Terms and Conditions Table 2. Revision of Table 1: Average Mathematics Proficiency by Parents' Highest Level of Education Grade 8-1992

Some Education Did Not PUBLIC After Graduated Finish SCHOOLS Graduated High High High I Don't College School School School Know Mean Nation 279 270 256 248 251 267

States 1 Iowa 291 285 273 262 266 283 2 North Dakota 289 283 271 259 272 283 3 Minnesota 290 284 270 256 268 282

4 Maine 288 281 267 259 266 278 5 Wisconsin 287 282 270 254 255 278 6 New Hampshire 287 280 267 259 262 278 7 Nebraska 287 280 267 | 247 - 256 277

8 Idaho 281 278 268 254 254 274 9 Wyoming 281 278 266 258 260 274 10 Utah 280 278 258 254 258 274 11 Connecticut 288 272 260 245 - 251 273

12 Colorado 282 276 260 250 252 272 13 Massachusetts 284 272 261 248 248- 272 14 New Jersey 283 275 259 253 250 271 15 Pennsylvania 282 274 262 252 252 271 16 Missouri 280 275 264 254 252 271 17 Indiana 283 275 260 250 249 269

18 Ohio 279 272 260 243 249 268 19 Oklahoma 277 272 257 254 251 267 20 Virginia 282 270 252 248 251 267 21 Michigan 277 271 257 249 248 267 22 277 271 256 243 240 - 265 23 Rhode Island 276 271 256 244 239 265 24 Arizona 277 270 256 245 248 264 25 Maryland 278 266 250 240 245 264 26 Texas 281 272 253 247 244 264

27 Delaware 274 268 251 248 248 262 28 Kentucky 278 267 254 246 242 261 29 California 275 266 251 241 240 260 30 South Carolina 272 268 248 248 247 260 31 Florida 268 266 251 244 244 259 32 Georgia 271 264 250 244 245 259 33 New Mexico 272 264 249 244 245 259 34 Tennessee 267 265 251 245 243 258 35 West Virginia 270 [ 269 + 251 244 239 258 36 NorthCarolina 271 265 246 240 240 258 37 Hawaii 267 266 246 242 246 257 38 Arkansas 264 264 248 | 246+ 245 256

39 Alabama 261 258 244 239 237 251 40 Louisiania 256 259 242 237 236 249 41 Mississippi 254 256 239 234 231 246

Other Jurisdictions 42 Guam 246 244 229 224 226 235 43 Districtof Columbia 244 240 224 225 229 234 44 VirginIslands 224 232 221 219 217 222

Errorterms for comaions Max Std error of diff 3.5 3.4 3.5 6.4 6.4 3.5 40 Bonferroni 11.3 11.0 11.3 20.7 20.7 11.3 820 Bonferroni 14.0 13.6 14.0 25.6 25.6 14.0

The American Statistician, May 1996, Vol. 50, No. 2 103

This content downloaded from 160.39.32.189 on Fri, 23 Jan 2015 07:43:12 AM All use subject to JSTOR Terms and Conditions Estimates of Expenditures by Owner-Occupants for Alterations and Repairs to 290 Average Mathematics Proficiency for 8th grade children Their 1-to-4-Unit Residential Properties as Measured by Quarterly Surveys, + with at least one parent who graduated from college 1960-1969 1992 NAEP State Assessment data Samplingerror: 68%confidencelimits

10

280-

270

260-

1960 1961 1962 1963 1964 1965 '966 1967 1968 1969 Year Nas 250 Note: Estimatesfor 1960 and 1961 are reevisedto adjust for itottoteally publishedfigures. whitchwee were able to mteasureand cotrect. Figure3. Figure 10-2 FromSchmid (1983, p. 193) Showinga Tradi- tionalMethod of DepictingError on a Bar ChartUsing an ErrorRectangle Figure 1. Typical Depiction of Data Including Both Their Level and SuperimposedOver the End of the Bar.Reprinted with permission. This Error. The large dots locate the position of the mean; the bars depict ,ilike the previousone, shows the range of 68%confidence ilimits of one standard error of the mean in each direction. the data, not by separate rectanglesbut by cross-hatching.(From Maria E. Gonzalez et al., StandardsFor Discussion and Presentationof Er- A completely revised version of the original table (from rors in Data, TechnicalPaper 32, UnitedStates Bureauof the Census, Wainer 1997) is shown as Table 2. This table replaces the Washington,DC: GovernmentPrinting Office, 1974, p. v-3.) standard error columns completely with the summary stan- dard errors. We have also reordered the rows of the table seven most extreme. These seven are typographically de- by the overall state performance, inserted other summary noted with shading and a "?" or "-" indicating the direc- statistics for comparison, spaced the table according to the tion of the residual.) Although a description of the value of data, and indicated unusual data values. ("Unusual" in this these other changes is outside the purview of this presenta- context refers to a data entry whose residual from an ad- tion, we include this complete revision as an indication of ditive model was, out of the 220 residuals, among the how the revised error terms fit into a broader presentation scheme; for the full story see Wainer (in press). [The num- bers alongside each state name are location aids. There is a Unemployment Among High School Graduates 16 to 24 Years Old, by Race: October 1975 separate locator table, alphabetical by state name, that pro- Number unemployed Unemployment rate (thousands) (percent) vides these numbers for easy lookup. The idea of including 1,200 30 an alphabetical locator table is common on ("Albany M6"), but has been in use for large tables for at least a cen- 1,000 25 tury (e.g., Francis Walker used it frequently in the tables describing the growth of cities in the 1890 census).] 500 20 We have learned several lessons from this tabular exper-

600 15 iment. First is that including the standard errors of every data entry clogs the display, hindering our vision of the

400 10 primary data structure. Second is that the inclusion of this additional complexity may not be especially helpful, even

200 5 for the job for which the standard errors are intended. Third is that a little thought about the prospective use

White Btack and White Black and of the data contained in the table may allow both a con- other races other r-ces siderable simplification as well as increasing the ease of Figure 2. Figure 10-1 From Schmid (1983, p. 192) Showing a Tradi- use. And finally, fourth is that through this consideration tional Method of Depicting Erroron a Bar Chart Using an ErrorRectangle on the Side of the Bar. Reprinted with permission. The data depicted by we can provide protection for the user of the table (in this chart are based on a sample survey. The amount of sampling vari- this case through the explicit inclusion of the stepped-up ability of the data is indicated by the small rectangles. The height of the confidence bounds for multiple comparisons) against the rectanglescovers a range of values of one standarderror above and one naive use of the standard errors. Thus the implementation standarderror below the reportedvalue. (FromOffice of FederalStatis- ticalPolicy and Standards,US. Bureauof the Census, Social Indicators, demonstrated in Table 2 satisfies all three of the desider- 1976, Washington,DC: GovernmentPrinting Office, 1977, p. XXVII.) ata of effective display of error. In the next section we

104 The American Statistician, May 1996, Vol. 50, No. 2

This content downloaded from 160.39.32.189 on Fri, 23 Jan 2015 07:43:12 AM All use subject to JSTOR Terms and Conditions 68 6%9 95%6 68 1 95 ' 95% 68b% 95% 8 L.CONFIDENCE CONFIDENCE CONFIDENCE CONFIDENCE CONFIDENCE CONFIDENCE CONFIDENCE CONFIDENCE

7 INERVA I INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL

(n

C

z5 U) LU

(D 4

LU

0 3 198 1901 90180 19 90 19

2

0 1980 1990 1980 1990 1980 1990 -1980 1990 -

TEMPORALOR OTHER TYPE OF DESIGNATION

Figure 4. Figure 10-3 From Schmid (1983, p. 195) Showing Four Methods of Depicting Confidence Bounds on a Bar Chart Using Various Figurations Superimposed Over the End of the Bar. Reprinted with permission. Additional designs for portraying confidence limits for simple column and bar . Certain features based on suggestions of Albert Biderman and staff, Bureau of Social Science Research, Washington, DC. will explore how these lessons translate into other display formats.

3. ERRORS IN GRAPHS Educational Attainment of Adult Males

Figure 1 is a typical display that includes both data and Yearsof schoolingoompleted, March 197 3 associated error. The data are shown as large black dots; 20 the errors are shown as vertical bars that are one standard error long in each direction. This is as typical as the tabular 16 form of including standard errors as parenthetical additions to the data entries. 14O 12.81 1240 276

There are analogous versions of Figure 1 for other graph- 12 9 11.6 19 72 12902 11.03 ical formats. Figures 2 and 3 show two (now outdated) ap- 10.55 proaches to adding a standard error bar to bar charts. Of .22 . 91 course, we need not fixate on just one standarderror (which 8.9 under the usual Gaussian assumptions corresponds to a 68% confidence bound on the mean); we could show one-and-a- half or two standard errors (corresponding to 95 and 99% confidence bounds). In Figure 4 are four versions of how this might be done. These are all traditional usages despite all of them being wasteful of space. It is certainly profligate to use an entire bar when all of the information about the mean is contained in the location of the top line; the rest is Maleage cohortsby year of birth (using Tufte's (1983) apt neologism). aFrom 1962 OCGSurvey Sourceof Data Bureauof the Censusand special tabulations by Hauser& Featberttone An Aside on Accurate Labeling: It is important to explain exactly what the bounds depicted are bounding. In each of Figure 5. A Figure Showing How IncorrectLabeling Can Lead to the cases illustrated so far they are confidence bounds on WildlyInaccurate Inferences. It reports confidence intervals on the meanr the mean of the distribution. In the past there have been but in fact merelydepicts the mean and errorbars showingone and two standarddeviations in each direction.(From Mary A. Golladay,The Con- clear errors of understanding on this point. As an exam- ditionof EducationrPart 1r HEWNational Center for Education Statisticsr ple consider the graph in Figure 5, which was mislabeled Washingtonr DC: Government Printing Officer36 77, p. 104.) It also mis- by its producers and subsequently described erroneously reports the source of the datar which should probably be Table 1, p. 102, by Calvin Schmid in his otherwise accurate reportage. in Hauser & Featherman (1976).

The American Statistician, May 1996, Vol. 50, No. 2 105

This content downloaded from 160.39.32.189 on Fri, 23 Jan 2015 07:43:12 AM All use subject to JSTOR Terms and Conditions ------Read down th column dvcty under a suic narne hsed in -.he headmnga the top of. th chwt' Match th INSTItUCTIONS: shadg inte;t surrounding a sate postal abbreviation to the keN below to detemm whether tWheaverage mathmauc perfrmanc of this suate is tuher than. the same as, or lower than the te irlth collumn heading.

-S --- = _ _ _XL_.. ______. _ ._F>_ _ i

| E tV ME MNELNE MtE NEN EFEZ-tE.tt E

TVtTVT UT UT UiT uT UT UT: UT UT UT UT UT UT UT UTTt ' TCT 'T CT CT cT ct cT CT:a CT CT CT CT CT cT cT CT CT CT

a0t:~~o cocloco Cotococo O';.CO'COXcof:- co co co C_

PA),^PAM A PA PA PA PA PA: PA. PA PA; PA PA PA PA PA.PA PA. PA PA Pa PA PA PA

4,m t t IN: It 11 ON: IN IN IN IN IN IN N IN IN IN IN IN I 1N I$N|

_K QK~~~~OOK:0>K- -OK0KOKK OK OK:0K-oK OKOKO OK OK-:" | t i VA:V t V VVA VA VA VA VA VA VA:VVAA A V VAVA av

WfttS~~~~~~~~~~~~~~~~~~~~~: me:yi"Ymy. my my NY my, my:" My-"MYY ZYNY my: MY NY mY it ^*t;Wt9F/tFl*ait+">tititebtARL jRI Rt t RI RI RtI Rl RlI Rl RIFlt Otit RI RlFtt t

at t w t" XaXt TXI TX Txr TX TX TX TX TX -TX TX TX! TX TXa TX TX TX TX TXT XTXT XT f">E.X}E 'tFEDfAED4 XDEVX"DEI ZoseE DE-DE DE Of DE DE DE Of DE Of DE OLEDE DE DE DE Of DE

| .: K

ecycesssctsc1scSc.Scise}s i;'Scisc. sSc;sicscisc"=' SCscc sc ssc Sc sc sc sc s cs csi c s ft A. FL FL FL FLL FL fL bFL, ?e ; zit+L tFL FLtF FtFL FL FLe e

|~ ~ ~ ~ ~~~~~~~~~~~~~04A A-AG^CGA A. AGAGAG&GA;CkG&AGAGA C C I-S ~~~~~~VW_

I~ ~ ~ MCc'14~~~~~~ ~~~~~~~~~~~~~MH F C NC;FtC NCltC' MCNC;C ttC NCFtC JtC j *a Io mi; 5 t t ? ,"S^t*f^}*;4w.w*it;}*t^ls8*Rto +a tw#tS:{ ..:#a xi; i ^t"0 "" 4 1

;JAL,L' 8AL}^bKAt5K9w^ 4eK^LttiAUA^Lt^^tALk ALLAALL ALAL AL-A AL- AL ALA.AL^A t

"4V4w^U4"GUut:4fftGthwZSU:sv(tu4uCumGU04W:4 U4Uo4Q~~~~~~~Uo4XU-9 4^U 4 U- V;+U U

rDC >:XKX2a>

|DtcD"C'CO DCOC 0C0-DC,CtCy ttClyC', DCMD''c'co~cloT4 pN tM? ZC'70 DtIX p

vd%t %nx%n'vvi'vp of vivitS %n'vBonfvi, pwved onVIomal oeu State hstatisticanv fcnyo5rg o ficfectew nhtttntho tht VAfc than state haed at thie top of the cham measqurt rt erora thatuno ea sut4 is ndzg crompredw Fiue6.Fgro 16FotheisQ NafmtioadCovmpendim frthe stAtEP everyMathemsatics Assessment for theNtionuand theSan a e (p.o 40.hi islipnthsted cart" includes theatopo t Bonferrnxaddsadaderoohthe enwitiis prceunds bsed ona 946use cpanscans bonay

106 The American Statistician, May 1996, Vol. 50, No. 2

This content downloaded from 160.39.32.189 on Fri, 23 Jan 2015 07:43:12 AM All use subject to JSTOR Terms and Conditions Clearly, the bounds specified are far too large to be confi- dence bounds on the mean. Neither could they be an accurate depiction of the distribution of years of schooling. Obviously, they are showing the means with horizontal lines and using boxes and bars to de- note one and two standard deviation bounds. This r ------would only represent the actual distribution of edu- 2 7 0 cation if that distribution were approximately Gauss- - ~~~~~Non-overlappitigerror bars betiveett atty 41 t vojiinsdwctuonsmtidwcate that they are ian. We suspect that this is unlikely, and that it is probable Figure8 Avera geMathematicsProficieeficatlydCfflerert(e = .05). that at least at the turn of the century the distribution was skewed with a long straggly tail upward. How this skew- ness changed over time (if it did) would be an interesting question, unanswerable from this depiction. An obvious modification to these figures would be - ting something analogous to the Bonferroni bounds, or per- haps half a bound. In the former case anytime a point falls within such a bound from another point they are not signifi- cantly different. In the latter any overlap of the bars sticking out could be interpreted as "not significant." This sort of idea has been ingeniously implemented in the NAEP charts WitatLat Oe cEarv3aoenP Who Graduat?n>edFroegYollege C (19va n-XeP N (whose resemblance to those found on pantyhose packages has not gone unnoticed; see Figure 6) in which only states whose bounds do not overlap are different enough in perfor- mance for that difference to be thought of as consequential. State Assessment Data).A differencebetween two states is statistically significant (at the .05 level) if one state's bounds overlap the other state's Figure 7 is a simpler version of this same idea in which bounds. the error bars shown represent the 820 Bonferroni bounds. To compare any two states you have only to note if one Options is preferred, and thus will depart this aspect with state's point is outside the other's bounds. An alternative no further recommendation except to try both and see what to in one in the error display this (shown Figure 8) is which appeals to you. Variations on this theme were explored ear- bars are half of these. It appears somewhat less cluttered, lier by McGill, Tukey, and Larsen (1978) in their notched and whether two of comparisons require noting sets bounds box plots. overlap. We have little intuition as to which of these two

3.1 Changing the Metaphor Making the physical size, and hence the visual impact, of data points proportional to their error with plotting con- ventions like error bars may be a bad idea. Not only does it often make the picture crowded, but it also gives greatest visual impact to the most poorly estimated points. Is this 280-l ------what we want to do? For some uses such a display metaphor may be useful. For example, if we wish to draw some sort of regression line through the points, we should try to get closer to a

2 - Errorrepreseob bars t 95% coiifideiice inlternals oti ain, state-to-state cozrnparison.. well-estimated point than to one that is more error laden. Thus we might draw less precisely estimated points with a 240 large symbol (a longer bar) and more precisely estimated ones with a smaller symbol (a shorter bar). Then, when fitting, we try to make the fitted function pass through all the error bars. Often, however, we can implement this sort of weighted regression analytically, and not have to clog the display visually with inaccurate points. 200 ...... A basic premise of effective display (Bertin 1973; Cleve-

< c 3 Z 3 oU 22 C)2>2 land 1994; Wainer 1984, 1992) is to make the visual rep- resentation of the data an accurate visual metaphor. If the Jurisdictions data increase, the representation should get larger or go Figure 7. Average MathematicsProficiency for 8th Grade Children higher; more should indicate more. A display that violates Withat Least One Parent Who GraduatedFrom College (1992 NAEP this rule is bound to be misunderstood. How can this rule be State Assessment Data).A differencebetween two states is statistically significant(at the .05 level) if one state's data point is outside the other implemented in depicting data and their associated error? state's bounds. At first blush it would seem to suggest that bigger error

The American Statisticiani,Mav 1996, Vol. 50, No. 2 107

This content downloaded from 160.39.32.189 on Fri, 23 Jan 2015 07:43:12 AM All use subject to JSTOR Terms and Conditions 6- 6

51 5

4 - 4

3 y3 y3

2 - 2

0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 x x

Figure 9. A Plot of Hypothetical Data That Displays the Data and Figure10. A Plot of HypotheticalData ThatDisplays the Data Sized Their Standard Errors as Vertical Bars. The graph suggests a nonlinear Proportionalto TheirPrecision. The graph suggests a linearrelationship relationship between X and Y between X and Y

0 1 2 3ME 4 S bars should be associatedwith bigger error.Indeed this is 1 2 3 4 5 probablycorrect if the focus of the displayis on the error. If, however,the focus is not on the errorbut ratheron the datathemselves, perhaps we can improvethe comprehensi- 6 7 8 9 10 bility of the displayby reversingthis metaphor;by making ME on ONE MEN the size of the plotting icon proportional to the precision En son OUE son and not the error.Thus datapoints that are more accurately estimatedare drawn to make a bigger visual impression. Hoaglin and Tukey (1985) use the amount of empty This suggests that we make data points' size proportional spaceto depictthe least accuratelyestimated points in their to their precision (i.e., perhapsproportional to se-1 if we confidence-aperture plots. Figure 11Ishows that there is no wish to focus on the precisionor proportionalto se-2 if we significantdeparture from a logarithmicseries distribution wish instead to focus on the sample size represented).In in the butterflydata that were collected by Fisher,Corbet, this way our attentionwill be drawnto the more accurate and Williams (1943). They plotted a function of the num- points;very inaccuratepoints would be hardlyvisible, and ber of species found (nk) and the number of individual so have little impact. membersof a butterflyspecies caught(k) againstk. In this As an example considerthe hypotheticaldata shown in displaythe tips of the "pencils"represent a 95%confidence Figure 9. There are five points shown with their standard intervalfor an individualdata point, while simultaneously that the betweenX errors.The graphsuggests relationship yielding but a 29% simultaneousinterval for the 24 val- and Y is nonlinearwith an ogive shape.The same dataare ues of k. The portionof the plot associatedwith the top of shownin Figure 10 in whichthe datapoints' area are shown the body of the pencil yields a largerarea, and so the cor- proportionalto their precision(you must look carefullyfor HalnadTky(1985,Fi.91,s.36ei hc Athe mountOpnSaeRfecpts of point 4). In this depictionthe linearityof the data is one's a is for- principalperception; such perception supportedby inthe bunetaintlyReprnte wthapermisson.ce yFshr obt 0 mal statisticaltesting. Perhapswe can generalizethis idea Approximate throughthe use of somethingakin to horizontal"precision significancelevels: bars"around each point. These bars could be proportional 5%indivdual to the (Fisherian)information contained in each point.This 2 71% simultaneous idea is merelythe visual counterpartto what is commonly IiNM to done alreadyin manyanalytic estimation procedures, when 766% simultaneous we sometimesexamine the diagonalelements of the Hes- ~~~~33%Ii ~~~~~~~~~~~' individual sian (the standarderrors of the parameters)and sometimes the inverseHessian (the informationmatrix). Another alter- native might be to use a plottingicon (like Bachi's (1978) "GraphicalRational Pattern" shown below) to reflect pre- cision.

108 The American Statistician, May 1996, Vol. 50, No. 2

This content downloaded from 160.39.32.189 on Fri, 23 Jan 2015 07:43:12 AM All use subject to JSTOR Terms and Conditions respondingprobabilities grow to 99.6 and 92.4%.Hoaglin, 290- Mosteller,and Tukey thought enough of this idea to feature it on the cover of their book. _ The worthof such schemes will requiresome visual ex- perimentationwith formats,some experienceto overcome AD 270 the older conventions,and the carefulconsideration of the questionsthat the displayis meantto answer. 3.2 Sloppy (Dirty) Graphics

E 250_ Anothergraphical way to remindthe viewer of the fal- 0- libility of the data in a display is to draw and label the a.n display only as well as the accuracyof the data allows. This is exactly analogousto the too often ignoredpractice of writing down numericalvalues to only the numberof i acr i IsadrIda digits that are significant(for argumentsand O--U of the efficacy and importanceof roundingsee Ehrenberg

1977;Walker and Durost 1936;Wainer 1993, 1996). Asian! White Hispanic Black Ameccnn One possible approachmight be to calibrateand label Pacific Islander Indian Race/Ethnicity the axes in such a way so as to limit the accuracyof the inferencesto only whatis justifiedby the data.An example Figure 13. Average US. Proficiencyin "Algebraand Functions," of such a display is shown in Figure 12 (after Herrnstein Grade 8-1990. Using graduatedshading can make it possible to au- tomaticallybuild the perceptionof variabilityinto whatused to be a bar and Murray1994). chart. Althoughan asset of such a displayis thatit does not al- low incorrectlyprecise statements,a possible drawbackis own. Let's at least providethe best single numberwe can." that it does not allow impreciselycorrect statements. This I have some sympathywith both views. Perhapsby provid- provokesthe image of DamonRunyun, when he said, "The ing graphicalanswers like that in Figure 12 we can provide race is not always to the swift, nor the battleto the strong, the qualitativeanswer without allowing inferences of inap- but that's the way to lay your bets." (Perhaps,specializ- propriateprecision.) ing John Tukey's well-knownaphorism, this corresponds An allied approachmight be to accuratelylabel and cal- to: "If it's worth displaying,it's worth displayingbadly.") ibrate the axes, but remove distinct boundariesfrom the Thus even thoughwe may not have good estimates,it may graphicalrepresentation. For example, in Figure 13 we sometimesbe importantfor the graphto allow us to extract show the performanceof five ethnic groupson one portion the best guess available.(I have often witnessed requests of the 1992 NAEP mathematicsassessment. Through the for a single numberto characterizea complex situation. utilizationof graduatedshading around the actualmean per- Some then arguethat "a single numberis inappropriateand formancewe make it impossibleto unambiguouslychoose we ought not provideit." Otherscynically contend that "if a single numberto characterizethat ethnic group'sperfor- we don't give them a numberthey'll calculateone of their mance.The locationat which the shadingbegins to thin is determinedby both the mean level and the estimatedstan- darderror. Groups, like AmericanIndians, with large stan- dard errorshave the shadingbegin well below the mean, and it changes very gradually.Ethnic groups, like Whites, thathave small standarderrors are characterizedby a much steepergradient in the shadingalgorithm. (Figure 13 is lit- tle morethan an initialattempt. I fully expect that it can be improvedupon with experience.The ruleused to generateit 0 0 was to use black at the mean,and then graduate the shading * . . =~~ 2 0 exponentiallyto white 2.5 se above and below the mean.) This approachis a continuousanalog to the more tradi- o 0 0~~~~~ tional methodsshown in Figure 4. (This is probablymore

?- properlycalled a Bayesian approachbecause it depicts a 0 0neliec -, posteriordistribution rather than a confidenceregion.) - \ 0mllsa 0s

Figure~~~02 ipaigDt fUcrai rcso ihteAcrc ThyDeeveTrog Clbrtd0xsteUs f proratl 4. SUMMARY In this essay we have describedand illustrated some sug- gestionsfor alternativemethods to depicterror. The princi- pal goal was to make it easier for inferencesabout data to include the effects of the precisionwith which those data

The American Statistician, May 1996, Vol. 50, No. 2 109

This content downloaded from 160.39.32.189 on Fri, 23 Jan 2015 07:43:12 AM All use subject to JSTOR Terms and Conditions ts-w:~~~~~~~~~~~~~~~~k.

Figure 14. Quetelet's 1831 Plate 2, "CartesFiguratives-Crimes ContreLes Propribtbs-Contre Les Personnes.' Original335 x 230 mm. Lithograph,crayon shading. (Courtesy of the Centerfor Research Libraries,Chicago, IL.) Graduated shading was used by Queteletin 1831 to show the distributionof crimes against property(left side) and crimes against persons (rightside). Reprintedwith permission from Robinson (1982). were measured.Sometimes (with tabularpresentation) this The suggestionscontained here are meant only as a be- involved calculatingand displayingthe appropriateerror ginning.The assessmentof their usefulnessawaits broader term for the most likely uses of the data. In other sit- application,discussion, and experimentationfor, reapply- uations (like the fade-away chart in Figure 13) it meant ing Karl Pearson's final words in the Grammar of Science, making the precision an integral part of the display. it is only by "daringto display our ignorance"will our I am sure that the notion of using shadingto characterize perceptionsof scientificprogress stay in reasonablyclose erroris a long way from being originalwith me (my former proximityto that progress. colleague,Albert Biderman, suggested it to me over coffee a decadeago). I am certainthat the primaryreason we have [Received November 1994. Revised September 1995.] not seen it used more widely in practiceis its difficultim- plementation.Modern computergraphic software has re- REFERENCES moved this impediment.This apparentlynovel statistical Bachi, R. (1978), "Proposals for the Development of Selected Graphical suggestion is old hat to cartographers.Petermann (1851) Methods," in Graphic Presentation of Statistical Information: Papers used continuoustone implementedby lithographiccrayon Presented at the 136th Annual Meeting of the American Statistical As- in sociation. U.S. Dept. of Commerce, Bureau of the Census. shadingto show variations the populationof Scotland. Benjamini, Y., and Hochberg, Y. (1995), "Controlling the False Discovery But even this use was presagedby Quetelet (1832) who Rate: A Practical and Powerful Approach to Multiple Testing," Journal producedsome fuzzy social data maps showingthe distri- of the Royal Statistical Society, Ser. B, 57, 289-300. butionof crime (see Figure 14, reproducedfrom Robinson Bertin, J. (1973), Semiologie Graphique (2nd ed.), The Hague: Mouton- Gautier. (English translation done by William Berg and Howard Wainer 1982). More recently,MacEachren (1992, p. 14; 1994, p. and published as Semiology of Graphics, Madison, WI: University of 81) used this methodeffectively to displaythe distribution Wisconsin Press, 1983) of risk surroundinga nuclearpower plant. He also suggests Bonferroni, C. E. (1936), "Il Calcolo delle Assicurazioni su Gruppi di schemes such as defocusingimages to convey uncertainty. Teste," in Studii in Onore del Prof. S. 0. Carboni, Rome, Italy. Cleveland, W. S. (1994), The Elements of Graphing Data (2nd ed.), Sum- Dynamic possibilities for the representationof uncer- mit, NJ: Hobart. tainty have not been discussed, althoughwe believe that Ehrenberg, A. S. C. (1977), "Rudiments of Numeracy," Journal of the such schemes as blinkingdata points in which the propor- Royal Statistical Society, Ser. A, 140, 277-297. tion of "on"to "off"time is proportionalto the points' pre- Fisher, R. A., Corbet, A. S., and Williams, C. B. (1943), "The Relation Between the Number of Species and the Number of Individuals in a cision, or a visual analog of multipleimputation in which Random Sample of an Animal Population,"Journal of Animal Ecology, the location of a data point changes as you watch it are 12. promising possibilities. Hauser, R. M. and Featherman, D. L. (1976), "Equality of Schooling:

110 The American Statistician, May 1996, Vol. 50, No. 2

This content downloaded from 160.39.32.189 on Fri, 23 Jan 2015 07:43:12 AM All use subject to JSTOR Terms and Conditions Trends and Prospects," Sociology of Edutcation,49, 92-112. Robinson, A. H. (1982), Early ThemiiaticMapping in the Histo;y of Car- Herrnstein, R. J., and Murray, C. (1994), The Bell Cursve:Intelligenice and togr-aphy,Chicago: Press. Class Strulctiurein AmiiericanLife, New York: The Free Press. Quetelet, A. (1832), "Recherches sur le Penchant au Crime aux Diff6rens Hoaglin, D. C., and Tukey, J. W. (1985), "Checking the Shape of Discrete Ages." NoutvealuxMeinoires de /'Acadeinie Royales des Sciences et Distributions,"in Exploring Data Tables, Trenidsanid Shapes, eds. D. C. Belle-Lettres de Bruzxelles,7, 1-87. Hoaglin, F. Mosteller, and J. W. Tukey, New York: Wiley, chap. 9, pp. Schmid, C. F. (1983), StcatisticalGraphics: Designi Principles alnd Prac- 345-416. tices, New York: John Wiley. MacEachren, A. M. (1992), "Visualizing Uncertain Information," Carto- Tufte, E. R. (1983), The Visual Display of Quantitative Inifonclation, gracphicPerspectives, 13, 10-19. Cheshire, CT: Graphics Press. (1994), " Quality and the Representation of Uncer- Wainer, H. (1984), "How to Display Data Badly," The Amlerica,oStatisti- tainty," in Some Truth with Maps: A Primileron Symbolization anzdDe- cian, 38, 137-147. signi, Washington, DC: Association of American Cartographers, chap. (1992), "Understanding Graphs and Tables," Edlucationial Re- 4. searchler, 21, 14-23. McGill, R., Tukey, J. W., and Larsen, W. (1978), "Variationsof Box Plots," (1993), "TabularPresentation," Chance, 6(3), 52-56. The AmiiericaniStatisticiani, 32, 12-16. (in press), "Improving TabularDisplays: With NAEP Tables as Ex- amples and Inspirations,"Journal of Educationialand Behavioral Statis- Miller, R. G. (1966), Simutltaneous Statistical Iniferenice,New York: tics, 22. McGraw-Hill. Walker, H. M., and Durost, W. N. (1936), Statistical Tables: Their Struzc- Mosteller, F., and Tukey, J. W. (1968), "Data Analysis, Including Statis- tiure and Use, New York: Bureau of Publications, Teachers College, tics," in The Handbook of Social Psychology, Vol. 2, eds. G. Lindzey . and E. Aronson, Reading, MA: Addison-Wesley, pp. 80-203. Williams, V. S. L., Jones, L. V., and Tukey, J. W. (1994), "Controlling Pearson, K. (1892), The Gram)m1arof Science, London: Walter Scott. Error in Multiple Comparisons with Special Attention to the National Petermann, A. (1852), "Distribution of the Population," Census of Great Assessment of Educational Progress," Technical Report 33, Research Britain, 1851, Lithograph, Crayon Shading in the British Library. Triangle Park, NC: National Institute of Statistical Sciences.

The American Statisticiani,Ma)y 1996, Vol. 50, No. 2 1]]

This content downloaded from 160.39.32.189 on Fri, 23 Jan 2015 07:43:12 AM All use subject to JSTOR Terms and Conditions