Violin Plots: a Box Plot-Density Trace Synergism Author(S): Jerry L
Total Page:16
File Type:pdf, Size:1020Kb
Violin Plots: A Box Plot-Density Trace Synergism Author(s): Jerry L. Hintze and Ray D. Nelson Source: The American Statistician, Vol. 52, No. 2 (May, 1998), pp. 181-184 Published by: American Statistical Association Stable URL: http://www.jstor.org/stable/2685478 Accessed: 02/09/2010 11:01 Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/action/showPublisher?publisherCode=astata. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to The American Statistician. http://www.jstor.org StatisticalComputing and Graphics ViolinPlots: A Box Plot-DensityTrace Synergism JerryL. HINTZE and Ray D. NELSON Hoaglin(1981); Chambers, Cleveland, Kleiner, and Tukey (1983);Frigge, Hoaglin, and Iglewicz (1989), and others. Manymodifications build on Tukey'soriginal box plot.A Box plotsshow four main features about a variable:cen- proposedfurther adaptation, the violin plot, pools the best ter,spread, asymmetry, and outliers.As an example,con- statisticalfeatures of alternativegraphical representations sider the box plotin Figure1 forthe data publishedby of batchesof data.It adds theinformation available from Hamermesh(1994). The ASA StatisticalGraphics Section's local densityestimates to thebasic summarystatistics in- 1995Data AnalysisExposition analyzes these data, which herentin box plots. This marriage of summary statistics and reportcompensation of professors from all academicranks densityshape into a singleplot provides a usefultool for in theUnited States. The labelsin thediagram identify the dataanalysis and exploration. principallines and points which form the main structure of thetraditional box plotdiagram. As shown,the violin plot KEY WORDS: Density estimation;Exploratory data includesa box plotwith two slight modifications. First, a analysis;Graphical techniques. circlereplaces the median line which facilitates quick com- parisonswhen viewing multiple groups. Second, outside 1. INTRODUCTION pointswhich are traditionally classified as mildand severe outliers,are not identified by individual symbols. Manydifferent statistics and graphs summarize the charac- Thedensity trace supplements traditional summary statis- teristicsof single batches of data. Descriptive statistics give ticsby graphically showing the distributional characteristics informationabout location, scale, symmetry, and tail thick- of batchesof data.One simpledensity estimator, the his- ness.Other statistics and graphs investigate extreme obser- togram,displays the distribution of data valuesalong the vationsor studythe distribution of datavalues. Diagrams realnumber line. Weaknesses of the histogram caused Tapia suchas stem-leafplots, dot plots,box plots,histograms, and Thompson(1979), Parzen(1979), Silverman(1986), densitytraces, and probability plots give information about Izenman(1991), and Scott (1992) to propose and summarize thedistribution of values assumed by all observations. numerousalternative density estimators. One of these alter- The violinplot, introduced in thisarticle, synergistically natives is thedensity trace described in Chambers,Cleve- combinesthe box plotand thedensity trace (or smoothed land,Kleiner, and Tukey (1983). Defining the location den- histogram)into a singledisplay that reveals structure found sityd(xlh) at a pointx as thefraction of thedata values withinthe data. The introduction ofthis new graphical tool perunit of measurementthat fall in an intervalcentered at beginswith a quickoverview of the combination of the box x gives plotand density trace into the violin plot. Then, three illus- trationsand examples show the advantages and challenges of violinplots in datasummarization and exploration. d(x h) h i(1) 2. COMPONENT PARTS OF VIOLIN PLOTS Theviolin plot, as depictedin Figure1 andimplemented in NCSS (1997) statisticalsoftware, combines the box plot and densitytrace into one diagram.The nameviolin plot I'S1. ph~~~n originatedbecause one of thefirst analyses that used the envisionedprocedure resulted in a graphicwith the ap- pearanceof a violin.Violin plots add informationto the simplestructure of thebox plotthat Tukey (1977) initially 1 _ 4 n t conceived.Although these original graphs are easily drawn withpencil and paper, computers ease subsequentmodifi- cations,refinements, and computationof box plotsas dis- cussedby McGill, Tukey, and Larsen (1978); Velleman and JerryL. Hintzeis President,NCSS, 329 North1000 East, Kaysville,UT 84037 (E-mail: [email protected]). Ray D. Nelson is AssociateProfessor of Business Management,Marriott School of Management,Brigham Young University,Provo, UT 84602. ftie I Comwt C4aWony of Box PFI tand Viooin Pb(- Tota wnsabon asr acadec r. ? 1998 AmericanStatistical Association The AmericatiStatistician, May 1998 Vol.52, No. 2 181 30 able densitytrace requires experience and judgment in de- terminingthe appropriate amount of smoothing.As with 15 theselection of thebin widthin thehistogram, the inter- val widthh, whichis usuallyspecified as a percentageof thedata range, must be selected.Experience suggests that valuesnear 15% of thedata range often give good results. The choiceof h, however,must be temperedby the size of -15 thesample. The densitytrace is subjectto thesame sample size restrictionsand challengesthat apply to anydensity -30 estimator.For small data sets, too small a valuefor h gives Bimodal Uniform Normal a wigglydensity trace that suggest features that are simply points.The oversmoothed (a) artifactsof theindividual data densityestimate that results from too largeh valuesgives theillusion of knowing the shape of thedistribution, while 30 in realitythe data set is too smallfor any conclusions. As a ruleof thumbbased on practice,the density trace tends 15 to do a reasonablejob withsamples of at least30 observa- tions.Even with sample sizes of severalhundred, however, choosingtoo large a valuefor h causesthe density trace to oversmooththe data. In general,values of h greaterthan -15 40% of therange usually result in oversmootheddensities, whilevalues less than10 percentof therange result in un- -30. dersmootheddensities. Hence, percentages between 10 and 40 percentare recommended. Bimodal Uniform Normal (b) 4. ILLUSTRATIONS AND APPLICATIONS Figure2. Comparisonof Box Plotsand ViolinPlots fo Known Distri- Withthe addition of thedensity trace to thebox plot, butions.(a) Boxplots; (b) violinplots. violinplots provide a betterindication of theshape of the distribution.This includes showing the existence of clusters wheren is the sample size, h is the intervalwidth, in data.The densitytrace highlights the peaks, valleys, and and 6i is one whenthe ith data value is in theinterval bumpsin thedistribution. Three applications and examples [x - h/2, x + h/2] andzero otherwise. In orderto plotthe ofviolin plots illustrate these advantages. The first example densitytrace, first select a valuefor h and thencompute demonstratesthe ability of violin plots to distinguish among d(xlh)on a densegrid of equallyspaced x values.Connect theshapes of knowndistributions. The secondhighlights thed(xlh) by lines. The shapeof thed(xlh) curve is essen- tiallydriven by the interval length, h. It is verysmooth for 140 140 largevalues of h, and"wiggly" for small values. Unfortunately,several density traces shown side by side i io 110 aredifficult tocompare. Contrasting the distributions ofsev- eralbatches of data, however, is a commontask. In orderto addinformation to the box plot and still make comparisons possible,Benjamini (1988) suggested"opening the box" of 20 20 thebox plot.He makesthe width of thebox proportional to theestimated density. The violinplot builds on theBen- (a) jaminiproposal by combiningthe advantages of box plots 3.0 5.0 withdensity traces. The violinplot, as shownin Figure1, combinesthe box 4.1 ~~~~~~~~~~~~~~~~~~~~4.1 plotwith density traces. The densitytrace is plottedsym- metricallyto theleft and the right of the (vertical) box plot. Thereis no differencein thesedensity traces other than the directionin whichthey extend. Adding two density traces 2.4 givesa symmetricplot whichmakes it easierto see the 2.4 magnitudeof thedensity. This hybrid of thedensity trace andthe box plot allows quick and insightful comparison of 1.5 1.5 severaldistributions. (b) 3. SPECIFICATION OF INTERVAL WIDTH Figure3. AdditionalInformation in Violin Plots. Two examples from the densityestimation literature: (a) annualsnowfal for Buffalo. N't: As withother density estimators, achieving an accept- 19101972; (b) Old Faithfuleruption length. 182 StatisticalCompulting and Graphics