Violin Plots: A Box -Density Trace Synergism Author(s): Jerry L. Hintze and Ray D. Nelson Source: The American , Vol. 52, No. 2 (May, 1998), pp. 181-184 Published by: American Statistical Association Stable URL: http://www.jstor.org/stable/2685478 Accessed: 02/09/2010 11:01

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/action/showPublisher?publisherCode=astata.

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission.

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].

American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to The American Statistician.

http://www.jstor.org StatisticalComputing and Graphics ViolinPlots: A -DensityTrace Synergism JerryL. HINTZE and Ray D. NELSON

Hoaglin(1981); Chambers, Cleveland, Kleiner, and Tukey (1983);Frigge, Hoaglin, and Iglewicz (1989), and others. Manymodifications build on Tukey'soriginal box plot.A Box plotsshow four main features about a variable:cen- proposedfurther adaptation, the violin plot, pools the best ter,spread, asymmetry, and outliers.As an example,con- statisticalfeatures of alternativegraphical representations sider the box plotin Figure1 forthe publishedby of batchesof data.It adds theinformation available from Hamermesh(1994). The ASA StatisticalGraphics Section's local densityestimates to thebasic summarystatistics in- 1995Data AnalysisExposition analyzes these data, which herentin box plots. This marriage of summary and reportcompensation of professors from all academicranks densityshape into a singleplot provides a usefultool for in theUnited States. The labelsin thediagram identify the dataanalysis and exploration. principallines and points which form the main structure of thetraditional box plotdiagram. As shown,the violin plot KEY WORDS: ;Exploratory data includesa box plotwith two slight modifications. First, a analysis;Graphical techniques. circlereplaces the line which facilitates quick com- parisonswhen viewing multiple groups. Second, outside 1. INTRODUCTION pointswhich are traditionally classified as mildand severe outliers,are not identified by individual symbols. Manydifferent statistics and graphs summarize the charac- Thedensity trace supplements traditional summary statis- teristicsof single batches of data. give ticsby graphically showing the distributional characteristics informationabout location, scale, symmetry, and tail thick- of batchesof data.One simpledensity estimator, the his- ness.Other statistics and graphs investigate extreme obser- togram,displays the distribution of data valuesalong the vationsor studythe distribution of datavalues. Diagrams realnumber line. Weaknesses of the caused Tapia suchas stem-leafplots, dot plots,box plots,, and Thompson(1979), Parzen(1979), Silverman(1986), densitytraces, and probability plots give information about Izenman(1991), and Scott (1992) to propose and summarize thedistribution of values assumed by all observations. numerousalternative density estimators. One of these alter- The violinplot, introduced in thisarticle, synergistically natives is thedensity trace described in Chambers,Cleve- combinesthe box plotand thedensity trace (or smoothed land,Kleiner, and Tukey (1983). Defining the location den- histogram)into a singledisplay that reveals structure found sityd(xlh) at a pointx as thefraction of thedata values withinthe data. The introduction ofthis new graphical tool perunit of measurementthat fall in an intervalcentered at beginswith a quickoverview of the combination of the box x gives plotand density trace into the violin plot. Then, three illus- trationsand examples show the advantages and challenges of violinplots in datasummarization and exploration. d(x h) h i(1)

2. COMPONENT PARTS OF VIOLIN PLOTS Theviolin plot, as depictedin Figure1 andimplemented in NCSS (1997) statisticalsoftware, combines the box plot and densitytrace into one diagram.The nameviolin plot I'S1. ph~~~n originatedbecause one of thefirst analyses that used the envisionedprocedure resulted in a graphicwith the ap- pearanceof a violin.Violin plots add informationto the simplestructure of thebox plotthat Tukey (1977) initially 1 _ 4 n t conceived.Although these original graphs are easily drawn withpencil and paper, computers ease subsequentmodifi- cations,refinements, and computationof box plotsas dis- cussedby McGill, Tukey, and Larsen (1978); Velleman and

JerryL. Hintzeis President,NCSS, 329 North1000 East, Kaysville,UT 84037 (E-mail: [email protected]). Ray D. Nelson is AssociateProfessor of Business Management,Marriott School of Management,Brigham Young University,Provo, UT 84602. ftie I Comwt C4aWony of Box PFI tand Viooin Pb(- Tota wnsabon asr acadec .

? 1998 AmericanStatistical Association The AmericatiStatistician, May 1998 Vol.52, No. 2 181 30 able densitytrace requires experience and judgment in de- terminingthe appropriate amount of smoothing.As with 15 theselection of thebin widthin thehistogram, the inter- val widthh, whichis usuallyspecified as a percentageof thedata range, must be selected.Experience suggests that valuesnear 15% of thedata range often give good results. The choiceof h, however,must be temperedby the size of -15 thesample. The densitytrace is subjectto thesame sample size restrictionsand challengesthat apply to anydensity -30 estimator.For small data sets, too small a valuefor h gives Bimodal Uniform Normal a wigglydensity trace that suggest features that are simply points.The oversmoothed (a) artifactsof theindividual data densityestimate that results from too largeh valuesgives theillusion of knowing the shape of thedistribution, while 30 in realitythe data set is too smallfor any conclusions. As a ruleof thumbbased on practice,the density trace tends 15 to do a reasonablejob withsamples of at least30 observa- tions.Even with sample sizes of severalhundred, however, choosingtoo large a valuefor h causesthe density trace to oversmooththe data. In general,values of h greaterthan -15 40% of therange usually result in oversmootheddensities, whilevalues less than10 percentof therange result in un- -30. dersmootheddensities. Hence, percentages between 10 and 40 percentare recommended. Bimodal Uniform Normal

(b) 4. ILLUSTRATIONS AND APPLICATIONS Figure2. Comparisonof Box Plotsand ViolinPlots fo Known Distri- Withthe addition of thedensity trace to thebox plot, butions.(a) Boxplots; (b) violinplots. violinplots provide a betterindication of theshape of the distribution.This includes showing the existence of clusters wheren is the sample size, h is the intervalwidth, in data.The densitytrace highlights the peaks, valleys, and and 6i is one whenthe ith data value is in theinterval bumpsin thedistribution. Three applications and examples [x - h/2, x + h/2] andzero otherwise. In orderto plotthe ofviolin plots illustrate these advantages. The first example densitytrace, first select a valuefor h and thencompute demonstratesthe ability of violin plots to distinguish among d(xlh)on a densegrid of equallyspaced x values.Connect theshapes of knowndistributions. The secondhighlights thed(xlh) by lines. The shapeof thed(xlh) curve is essen- tiallydriven by the interval length, h. It is verysmooth for 140 140 largevalues of h, and"wiggly" for small values. Unfortunately,several density traces shown side by side i io 110 aredifficult tocompare. Contrasting the distributions ofsev- eralbatches of data, however, is a commontask. In orderto addinformation to the box plot and still make comparisons possible,Benjamini (1988) suggested"opening the box" of 20 20 thebox plot.He makesthe width of thebox proportional to theestimated density. The violinplot builds on theBen- (a) jaminiproposal by combiningthe advantages of box plots 3.0 5.0 withdensity traces. The violinplot, as shownin Figure1, combinesthe box 4.1 ~~~~~~~~~~~~~~~~~~~~4.1 plotwith density traces. The densitytrace is plottedsym- metricallyto theleft and the right of the (vertical) box plot. Thereis no differencein thesedensity traces other than the directionin whichthey extend. Adding two density traces 2.4 givesa symmetricplot whichmakes it easierto see the 2.4 magnitudeof thedensity. This hybrid of thedensity trace andthe box plot allows quick and insightful comparison of 1.5 1.5 severaldistributions. (b)

3. SPECIFICATION OF INTERVAL WIDTH Figure3. AdditionalInformation in Violin Plots. Two examples from the densityestimation literature: (a) annualsnowfal for Buffalo. N't: As withother density estimators, achieving an accept- 19101972; (b) Old Faithfuleruption length.

182 StatisticalCompulting and Graphics Theseplots seem to indicatethat since the mass of the bimodalplot is less thanthe normal plot, the bimodal plot is basedon fewerobservations. This is a weaknessof this implementationofthe violin plot, which adjusts the density

70 tracesso thattheir maximum heights are equal. This allows a directcomparison of theshapes, but removes the visual impactof samplesize. A variationof thisimplementation wouldkeep a uniformscaling of thedensity traces. 20 C.egay I Cawga y CuIA gos IIB 4.2 DensityEstimation Examples (a) Clustersof dataappear as bumpsin densityestimators. Box plotsoften do notalert analysts to theirexistelnce. Two 140 examplesfrom the density estimation literature clearly il- lustratethis ability. First, Parzen (1979) and Scott(1992) 11O usedannual snowfall data for Buffalo, New York for 1910- 1972 to showthe value of nonparametricdensity estima-

sIo tion.The violinplot in Figure3(a) illustratesthe additional insightsavailable through density estimators that the basic box plot does notreveal. The secondexample in Figure 3(b), whichuses datapreviously considered by Silverman 20 (1986) and Scott(1992), shows the bimodal nature of Old Full Auociae A ta Profeson Prm(csnn Pfo(essn Faithfuleruption lengths. Once again, the violin plot clearly (b) addssignificant insight about the distribution ofthe process generatingthe data. Figure4. ExploringData withViolin Plots. Totalcompensation data fromASA analysis competition.(a) Compensation of all professorsby 4.3 AcademicCompensation universityclassification; (b) totalcompensation by academic rank. The informationthat violin plots add to box plotsin- theability to detectbumps or clustersof data.The third creasesthe potential of thesetools when used in dataex- showstheir potential in exploringfor structure and pattern ploration.As an exampleof thevalue of violinplots, con- in theacademic compensation data used previously in the siderthe diagrams in Figure 4 fordata published by Hamer- illustrationofthe components of violin plots. The values for mesh(1994). The graphicsin Figure4(a) summarizetotal theinterval widths h are chosenusing personal judgment compensationof all professorsfor three different classifica- fromvalues from within the recommended 10 to40 percent tionsof colleges and universities. The first category includes interval.These examplesestablish the potential of violin institutionswith a significantlevel of doctoral-leveledu- plotsin dataanalysis and exploration. cation.The secondencompasses institutions with diverse post-baccalaureateprograms but which do nothave a sig- nificantlevel of doctoralprograms. The collegesand uni- 4.1 Comparisonof KnownDistributions versitiesin the third category focus their primary activity on undergraduatebaccalaureate-level education. The bumps in Considerfirst the ability to detectgeneral shapes for dis- thedoctoral level and post undergraduate violin plots sug- tributionsof data.Figure 2 depictsbox plotsand violin gestthat some universities ineach of these categories might plotsfor random samples of 10,000simulated observations have compensation characteristics which distinguish them drawnfrom three different known distributions. The three fromother members of thegeneral group. distributionsshare identical location and scale characteris- Figure4(b) showsthe distribution of total compensation ticsas measuredby the median and . The forinstitutions in all threecategories by academicrank. firstis a bimodaldistribution with modes at -5 and5 and All threeof thecategories appear to be somewhatposi- rangebetween -10 and 10. The secondis a uniformdistri- tivelyskewed with the increasing with the aca- butionon theinterval [-10, 10].The third is a normaldistri- demicrank. Comparison of themedians gives the expected butionN(O, 54.95).The boxplots in Figure2(a) reflectthe increasein compensationwith the higher academic rank. factthat all threehave the same median and interquartileAn interestingbulge grows in theupper tail of thedistri- range. butionsas theacademic rank increases. In an exploratory As expected,the densitytrace accurately reveals the analysis,the violin plots point to thenext question which shapeof thedistribution from which the random samples mightinvestigate the characteristics of theinstitutions in are drawn.The violinplot for the bimodaldistribution these clusters. clearlyshows the twinpeaks of the knowndistribution. Unfortunately,box plotscannot differentiate between the 5. SUMMARY AND CONCLUSIONS shapesof thebimodal and uniformdistributions. The box Individually,box plotsprovide SUCCinlCt summaries of plotsdo, however, show that the normal distribution differs data.By themselves,density traces reveal important infor- fromthe others as it does havea largerrange. mationabout the distribution ofdata. The synergisticcom-

TheAm1erican Statistician, Ma)' 1998 Vol.52, No. 2 183 binationof the box plot and the densitytrace allows much Hamermesh,D. (1994), "Plus Ca Change:The AnnualReport on theEco- of the informationfrom each to be displayedin one plot. nomic Statusof the Profession,1993-1994," Academe, 5-89. This single plot structuremakes comparisonsof distribu- Hintze, J. (1997), User's Guide, NCSS 97, Statistical Systemfor Windows, Kaysville, UT: Number Cruncher Statistical Systems tional factorsof several variablesmuch easier. Three dif- (http://www.ncss.com). ferentillustrations show that violin plots retainmuch of Izenman,A. J. (1991), "Recent Developmentsin NonparametricDensity theinformation of box plots and add informationabout the Estimation,"Journal of theAmerican Statistical Association, 86, 205- shape of the distributionnot obvious in box plots. Their 224. is McGill, R., Tukey,J. W., and Larsen, W. A. (1978), "Variationsof Box abilityto detectclusters or bumpswithin a distribution Plots,"The AmericanStatistician, 32, 12-16. especiallyvaluable. Parzen,E. (1979), "NonparametricStatistical Data Modeling,"Journal of AmericanStatistical Association, 74, 105-131. [ReceivedFebruary 1997. RevisedNovember 1997.] Scott, D. W. (1992), MultivariateDensity Estimation: Theory, Practice, and Visualization,New York:Wiley. REFERENCES Silverman,B.W. (1986), DensityEstimation for Statisticsand Data Anal- ysis,New York: Chapmanand Hall. Benjamini,Y. (1988), "Openingthe Box of the Box Plot," The American Tapia, R. A., and Thompson,J. R. (1978), NonparametricProbability Den- Statistician,42, 257-262. sityEstimation, Baltimore, MD: JohnsHopkins University Press. Chambers,J. M., Cleveland,W. S., Kleiner,B., and Tukey,P. A. (1983), Tukey,J. W. (1977), ExploratoryData Analysis,Reading, MA: Addison- GraphicalMethods for Data Analysis,Belmont, CA: Wadsworth. Wesley. Frigge,M., Hoaglin,D. C., and Iglewicz,B. (1989), "Some Implementa- Velleman,P. F., and Hoaglin,D. C. (1981), Applicationis,Basics a71dCoin- tionsof the Box Plot,"The Ame7^icanStatistician, 43, 50-54. putingof ExploratoryData Analysis,Boston: DuxburyPress.