Violin Plots: a Box Plot-Density Trace Synergism Author(S): Jerry L

Total Page:16

File Type:pdf, Size:1020Kb

Violin Plots: a Box Plot-Density Trace Synergism Author(S): Jerry L Violin Plots: A Box Plot-Density Trace Synergism Author(s): Jerry L. Hintze and Ray D. Nelson Source: The American Statistician, Vol. 52, No. 2 (May, 1998), pp. 181-184 Published by: American Statistical Association Stable URL: http://www.jstor.org/stable/2685478 Accessed: 02/09/2010 11:01 Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/action/showPublisher?publisherCode=astata. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to The American Statistician. http://www.jstor.org StatisticalComputing and Graphics ViolinPlots: A Box Plot-DensityTrace Synergism JerryL. HINTZE and Ray D. NELSON Hoaglin(1981); Chambers, Cleveland, Kleiner, and Tukey (1983);Frigge, Hoaglin, and Iglewicz (1989), and others. Manymodifications build on Tukey'soriginal box plot.A Box plotsshow four main features about a variable:cen- proposedfurther adaptation, the violin plot, pools the best ter,spread, asymmetry, and outliers.As an example,con- statisticalfeatures of alternativegraphical representations sider the box plotin Figure1 forthe data publishedby of batchesof data.It adds theinformation available from Hamermesh(1994). The ASA StatisticalGraphics Section's local densityestimates to thebasic summarystatistics in- 1995Data AnalysisExposition analyzes these data, which herentin box plots. This marriage of summary statistics and reportcompensation of professors from all academicranks densityshape into a singleplot provides a usefultool for in theUnited States. The labelsin thediagram identify the dataanalysis and exploration. principallines and points which form the main structure of thetraditional box plotdiagram. As shown,the violin plot KEY WORDS: Density estimation;Exploratory data includesa box plotwith two slight modifications. First, a analysis;Graphical techniques. circlereplaces the median line which facilitates quick com- parisonswhen viewing multiple groups. Second, outside 1. INTRODUCTION pointswhich are traditionally classified as mildand severe outliers,are not identified by individual symbols. Manydifferent statistics and graphs summarize the charac- Thedensity trace supplements traditional summary statis- teristicsof single batches of data. Descriptive statistics give ticsby graphically showing the distributional characteristics informationabout location, scale, symmetry, and tail thick- of batchesof data.One simpledensity estimator, the his- ness.Other statistics and graphs investigate extreme obser- togram,displays the distribution of data valuesalong the vationsor studythe distribution of datavalues. Diagrams realnumber line. Weaknesses of the histogram caused Tapia suchas stem-leafplots, dot plots,box plots,histograms, and Thompson(1979), Parzen(1979), Silverman(1986), densitytraces, and probability plots give information about Izenman(1991), and Scott (1992) to propose and summarize thedistribution of values assumed by all observations. numerousalternative density estimators. One of these alter- The violinplot, introduced in thisarticle, synergistically natives is thedensity trace described in Chambers,Cleve- combinesthe box plotand thedensity trace (or smoothed land,Kleiner, and Tukey (1983). Defining the location den- histogram)into a singledisplay that reveals structure found sityd(xlh) at a pointx as thefraction of thedata values withinthe data. The introduction ofthis new graphical tool perunit of measurementthat fall in an intervalcentered at beginswith a quickoverview of the combination of the box x gives plotand density trace into the violin plot. Then, three illus- trationsand examples show the advantages and challenges of violinplots in datasummarization and exploration. d(x h) h i(1) 2. COMPONENT PARTS OF VIOLIN PLOTS Theviolin plot, as depictedin Figure1 andimplemented in NCSS (1997) statisticalsoftware, combines the box plot and densitytrace into one diagram.The nameviolin plot I'S1. ph~~~n originatedbecause one of thefirst analyses that used the envisionedprocedure resulted in a graphicwith the ap- pearanceof a violin.Violin plots add informationto the simplestructure of thebox plotthat Tukey (1977) initially 1 _ 4 n t conceived.Although these original graphs are easily drawn withpencil and paper, computers ease subsequentmodifi- cations,refinements, and computationof box plotsas dis- cussedby McGill, Tukey, and Larsen (1978); Velleman and JerryL. Hintzeis President,NCSS, 329 North1000 East, Kaysville,UT 84037 (E-mail: [email protected]). Ray D. Nelson is AssociateProfessor of Business Management,Marriott School of Management,Brigham Young University,Provo, UT 84602. ftie I Comwt C4aWony of Box PFI tand Viooin Pb(- Tota wnsabon asr acadec r. ? 1998 AmericanStatistical Association The AmericatiStatistician, May 1998 Vol.52, No. 2 181 30 able densitytrace requires experience and judgment in de- terminingthe appropriate amount of smoothing.As with 15 theselection of thebin widthin thehistogram, the inter- val widthh, whichis usuallyspecified as a percentageof thedata range, must be selected.Experience suggests that valuesnear 15% of thedata range often give good results. The choiceof h, however,must be temperedby the size of -15 thesample. The densitytrace is subjectto thesame sample size restrictionsand challengesthat apply to anydensity -30 estimator.For small data sets, too small a valuefor h gives Bimodal Uniform Normal a wigglydensity trace that suggest features that are simply points.The oversmoothed (a) artifactsof theindividual data densityestimate that results from too largeh valuesgives theillusion of knowing the shape of thedistribution, while 30 in realitythe data set is too smallfor any conclusions. As a ruleof thumbbased on practice,the density trace tends 15 to do a reasonablejob withsamples of at least30 observa- tions.Even with sample sizes of severalhundred, however, choosingtoo large a valuefor h causesthe density trace to oversmooththe data. In general,values of h greaterthan -15 40% of therange usually result in oversmootheddensities, whilevalues less than10 percentof therange result in un- -30. dersmootheddensities. Hence, percentages between 10 and 40 percentare recommended. Bimodal Uniform Normal (b) 4. ILLUSTRATIONS AND APPLICATIONS Figure2. Comparisonof Box Plotsand ViolinPlots fo Known Distri- Withthe addition of thedensity trace to thebox plot, butions.(a) Boxplots; (b) violinplots. violinplots provide a betterindication of theshape of the distribution.This includes showing the existence of clusters wheren is the sample size, h is the intervalwidth, in data.The densitytrace highlights the peaks, valleys, and and 6i is one whenthe ith data value is in theinterval bumpsin thedistribution. Three applications and examples [x - h/2, x + h/2] andzero otherwise. In orderto plotthe ofviolin plots illustrate these advantages. The first example densitytrace, first select a valuefor h and thencompute demonstratesthe ability of violin plots to distinguish among d(xlh)on a densegrid of equallyspaced x values.Connect theshapes of knowndistributions. The secondhighlights thed(xlh) by lines. The shapeof thed(xlh) curve is essen- tiallydriven by the interval length, h. It is verysmooth for 140 140 largevalues of h, and"wiggly" for small values. Unfortunately,several density traces shown side by side i io 110 aredifficult tocompare. Contrasting the distributions ofsev- eralbatches of data, however, is a commontask. In orderto addinformation to the box plot and still make comparisons possible,Benjamini (1988) suggested"opening the box" of 20 20 thebox plot.He makesthe width of thebox proportional to theestimated density. The violinplot builds on theBen- (a) jaminiproposal by combiningthe advantages of box plots 3.0 5.0 withdensity traces. The violinplot, as shownin Figure1, combinesthe box 4.1 ~~~~~~~~~~~~~~~~~~~~4.1 plotwith density traces. The densitytrace is plottedsym- metricallyto theleft and the right of the (vertical) box plot. Thereis no differencein thesedensity traces other than the directionin whichthey extend. Adding two density traces 2.4 givesa symmetricplot whichmakes it easierto see the 2.4 magnitudeof thedensity. This hybrid of thedensity trace andthe box plot allows quick and insightful comparison of 1.5 1.5 severaldistributions. (b) 3. SPECIFICATION OF INTERVAL WIDTH Figure3. AdditionalInformation in Violin Plots. Two examples from the densityestimation literature: (a) annualsnowfal for Buffalo. N't: As withother density estimators, achieving an accept- 19101972; (b) Old Faithfuleruption length. 182 StatisticalCompulting and Graphics
Recommended publications
  • If I Were Doing a Quick ANOVA Analysis
    ANOVA Note: If I were doing a quick ANOVA analysis (without the diagnostic checks, etc.), I’d do the following: 1) load the packages (#1); 2) do the prep work (#2); and 3) run the ggstatsplot::ggbetweenstats analysis (in the #6 section). 1. Packages needed. Here are the recommended packages to load prior to working. library(ggplot2) # for graphing library(ggstatsplot) # for graphing and statistical analyses (one-stop shop) library(GGally) # This package offers the ggpairs() function. library(moments) # This package allows for skewness and kurtosis functions library(Rmisc) # Package for calculating stats and bar graphs of means library(ggpubr) # normality related commands 2. Prep Work Declare factor variables as such. class(mydata$catvar) # this tells you how R currently sees the variable (e.g., double, factor) mydata$catvar <- factor(mydata$catvar) #Will declare the specified variable as a factor variable 3. Checking data for violations of assumptions: a) relatively equal group sizes; b) equal variances; and c) normal distribution. a. Group Sizes Group counts (to check group frequencies): table(mydata$catvar) b. Checking Equal Variances Group means and standard deviations (Note: the aggregate and by commands give you the same results): aggregate(mydata$intvar, by = list(mydata$catvar), FUN = mean, na.rm = TRUE) aggregate(mydata$intvar, by = list(mydata$catvar), FUN = sd, na.rm = TRUE) by(mydata$intvar, mydata$catvar, mean, na.rm = TRUE) by(mydata$intvar, mydata$catvar, sd, na.rm = TRUE) A simple bar graph of group means and CIs (using Rmisc package). This command is repeated further below in the graphing section. The ggplot command will vary depending on the number of categories in the grouping variable.
    [Show full text]
  • Healthy Volunteers (Retrospective Study) Urolithiasis Healthy Volunteers Group P-Value 5 (N = 110) (N = 157)
    Supplementary Table 1. The result of Gal3C-S-OPN between urolithiasis and healthy volunteers (retrospective study) Urolithiasis Healthy volunteers Group p-Value 5 (n = 110) (n = 157) median (IQR 1) median (IQR 1) Gal3C-S-OPN 2 515 1118 (810–1543) <0.001 (MFI 3) (292–786) uFL-OPN 4 14 56392 <0.001 (ng/mL/mg protein) (10–151) (30270-115516) Gal3C-S-OPN 2 52 0.007 /uFL-OPN 4 <0.001 (5.2–113.0) (0.003–0.020) (MFI 3/uFL-OPN 4) 1 IQR, Interquartile range; 2 Gal3C-S-OPN, Gal3C-S lectin reactive osteopontin; 3 MFI, mean fluorescence intensity; 4 uFL-OPN, Urinary full-length-osteopontin; 5 p-Value, Mann–Whitney U-test. Supplementary Table 2. Sex-related difference between Gal3C-S-OPN and Gal3C-S-OPN normalized to uFL- OPN (retrospective study) Group Urolithiasis Healthy volunteers p-Value 5 Male a Female b Male c Female d a vs. b c vs. d (n = 61) (n = 49) (n = 57) (n = 100) median (IQR 1) median (IQR 1) Gal3C-S-OPN 2 1216 972 518 516 0.15 0.28 (MFI 3) (888-1581) (604-1529) (301-854) (278-781) Gal3C-S-OPN 2 67 42 0.012 0.006 /uFL-OPN 4 0.62 0.56 (9-120) (4-103) (0.003-0.042) (0.002-0.014) (MFI 3/uFL-OPN 4) 1 IQR, Interquartile range; 2 Gal3C-S-OPN, Gal3C-S lectin reactive osteopontin; 3MFI, mean fluorescence intensity; 4 uFL-OPN, Urinary full-length-osteopontin; 5 p-Value, Mann–Whitney U-test.
    [Show full text]
  • Statistical Analysis in JASP
    Copyright © 2018 by Mark A Goss-Sampson. All rights reserved. This book or any portion thereof may not be reproduced or used in any manner whatsoever without the express written permission of the author except for the purposes of research, education or private study. CONTENTS PREFACE .................................................................................................................................................. 1 USING THE JASP INTERFACE .................................................................................................................... 2 DESCRIPTIVE STATISTICS ......................................................................................................................... 8 EXPLORING DATA INTEGRITY ................................................................................................................ 15 ONE SAMPLE T-TEST ............................................................................................................................. 22 BINOMIAL TEST ..................................................................................................................................... 25 MULTINOMIAL TEST .............................................................................................................................. 28 CHI-SQUARE ‘GOODNESS-OF-FIT’ TEST............................................................................................. 30 MULTINOMIAL AND Χ2 ‘GOODNESS-OF-FIT’ TEST. ..........................................................................
    [Show full text]
  • Beanplot: a Boxplot Alternative for Visual Comparison of Distributions
    Beanplot: A Boxplot Alternative for Visual Comparison of Distributions Peter Kampstra VU University Amsterdam Abstract This introduction to the R package beanplot is a (slightly) modified version of Kamp- stra(2008), published in the Journal of Statistical Software. Boxplots and variants thereof are frequently used to compare univariate data. Boxplots have the disadvantage that they are not easy to explain to non-mathematicians, and that some information is not visible. A beanplot is an alternative to the boxplot for visual comparison of univariate data between groups. In a beanplot, the individual observations are shown as small lines in a one-dimensional scatter plot. Next to that, the estimated density of the distributions is visible and the average is shown. It is easy to compare different groups of data in a beanplot and to see if a group contains enough observations to make the group interesting from a statistical point of view. Anomalies in the data, such as bimodal distributions and duplicate measurements, are easily spotted in a beanplot. For groups with two subgroups (e.g., male and female), there is a special asymmetric beanplot. For easy usage, an implementation was made in R. Keywords: exploratory data analysis, descriptive statistics, box plot, boxplot, violin plot, density plot, comparing univariate data, visualization, beanplot, R, graphical methods, visu- alization. 1. Introduction There are many known plots that are used to show distributions of univariate data. There are histograms, stem-and-leaf-plots, boxplots, density traces, and many more. Most of these plots are not handy when comparing multiple batches of univariate data. For example, com- paring multiple histograms or stem-and-leaf plots is difficult because of the space they take.
    [Show full text]
  • Chapter 6: Graphics
    Chapter 6 Graphics Modern computers with high resolution displays and graphic printers have rev- olutionized the visual display of information in fields ranging from computer- aided design, through flow dynamics, to the spatiotemporal attributes of infec- tious diseases. The impact on statistics is just being felt. Whole books have been written on statistical graphics and their contents are quite heterogeneous– simple how-to-do it texts (e.g., ?; ?), reference works (e.g., Murrell, 2006) and generic treatment of the principles of graphic presentation (e.g., ?). There is even a web site devoted to learning statistics through visualization http: //www.seeingstatistics.com/.Hence,itisnotpossibletobecomprehensive in this chapter. Instead, I focus on the types of graphics used most often in neu- roscience (e.g., plots of means) and avoid those seldom used in the field (e.g., pie charts, geographical maps). The here are four major purposes for statistical graphics. First, they are used to examine and screen data to check for abnormalities and to assess the distributions of the variables. Second, graphics are very useful aid to exploratory data analysis (??). Exploratory data analysis, however, is used for mining large data sets mostly for the purpose of hypothesis generation and modification, so that use of graphics will not be discussed here. Third, graphics can be used to assess both the assumptions and the validity of a statistical model applied to data. Finally, graphics are used to present data to others. The third of these purposes will be discussed in the appropriate sections on the statistics. This chapter deals with the first and last purposes–examining data and presenting results.
    [Show full text]
  • Dynamics of Hydrological Model Parameters: Calibration and Reliability
    Dynamics of hydrological model parameters: calibration and reliability Tian Lan1, Kairong Lin1,2,3, Xuezhi Tan1,2,3, Chong-Yu Xu4, Xiaohong Chen1,2,3 1Center for Water Resources and Environment, Sun Yat-sen University, Guangzhou, 510275, China. 2Guangdong Engineering Technology Research Center of Water Security Regulation and Control for Southern China, Guangzhou 510275, China. 3School of Civil Engineering, Sun Yat-sen University, Guangzhou, 510275, China. 4Department of Geosciences, University of Oslo, P.O. Box 1047, Blindern, 0316 Oslo, Norway Correspondence to: Kairong Lin ([email protected]) Contents of this file 1 Case study and data 2 HYMOD model 3 SCE-UA 4 Violin plot 5 Multi-metric evaluation 6 Evaluation of sub-period calibration schemes in Mumahe basin and Xunhe basin 7 Convergence performance in Mumahe basin and Xunhe basin using ECP-VP References Introduction This supporting information includes eight sections that support the analysis. The 1 Case study and data section is used to support the 2 Background section in the main manuscript. The 2 HYMOD model section is used to support the 3.1 Sub-period calibration section in the main manuscript. The 3 SCE-UA algorithm section and 4 Violin plot section are used to support the 3.2 A tool for reliability evaluation section in the main manuscript. The 5 Multi-metric evaluation section and 6 Evaluation of sub-period calibration schemes in Mumahe basin and Xunhe basin section are used to account for 4.1 Evaluation of calibration schemes section in the main manuscript. The 7 Convergence performance in Mumahe basin and Xunhe basin using ECP-VP section is used to supplement 4.3 Evaluation of reliability section in the main manuscript.
    [Show full text]
  • Modern Statistics for Modern Biology
    Contents 3 High Quality Graphics in R 5 3.1 Goals for this Chapter .......................... 5 3.2 Base R Plotting .............................. 6 3.3 An Example Dataset ........................... 7 3.4 ggplot2 .................................. 8 3.5 The Grammar of Graphics ........................ 9 3.6 1D Visualisations ............................. 12 3.6.1 Data tidying I – matrices versus data.frame . 12 3.6.2 Barplots ............................. 13 3.6.3 Boxplots ............................. 13 3.6.4 Violin plots ........................... 13 3.6.5 Dot plots and beeswarm plots . 14 3.6.6 Density plots ........................... 14 3.6.7 ECDF plots ............................ 15 3.6.8 Transformations ......................... 15 3.6.9 Data tidying II - wide vs long format . 16 3.7 2D Visualisation: Scatter Plots ...................... 17 3.7.1 Plot shapes ............................ 18 3.8 3–5D Data ................................. 19 3.8.1 Faceting ............................. 21 3.8.2 Interactive graphics ....................... 22 3.9 Colour ................................... 23 3.9.1 Heatmaps ............................ 24 3.9.2 Colour spaces .......................... 26 3.10 Data Transformations .......................... 27 3.11 Saving Figures .............................. 27 3.12 Biological Data with ggbio ........................ 28 3.13 Recap of this Chapter ........................... 29 3.14 Exercises ................................. 29 6 Multiple Testing 31 6.1 Goals for this Chapter .........................
    [Show full text]
  • Moving Beyond P Values: Everyday Data Analysis with Estimation Plots
    bioRxiv preprint doi: https://doi.org/10.1101/377978 ; this version posted April 6, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license. Moving beyond ​P​ values: Everyday data analysis with estimation plots Joses Ho​ 1​, Tayfun Tumkaya​ 1, 2​, Sameer Aryal​ 1, 3 ,​ Hyungwon Choi​ 1, 4​, Adam Claridge-Chang​ 1, 2, 5, 6 1. Institute for Molecular and Cell Biology, A*STAR, Singapore 138673 2. Department of Physiology, National University of Singapore, Singapore 3. Center for Neural Science, New York University, New York, NY, USA 4. Department of Medicine, Yong Loo Lin School of Medicine, National University of Singapore, Singapore 5. Program in Neuroscience and Behavioral Disorders, Duke-NUS Medical School, Singapore 6. Correspondence Introduction Over the past 75 years, a number of statisticians have advised that the data-analysis method known as null-hypothesis significance testing (NHST) should be deprecated ​(Berkson, 1942; Halsey et al., 2015; Wasserstein et al., 2019)​. The limitations of NHST have been extensively discussed, with a broad consensus that current statistical practice in the biological sciences needs reform. However, there is less agreement on reform’s specific nature, with vigorous debate surrounding what would constitute a suitable alternative (Altman et al., 2000; Benjamin et al., 2017; Cumming and Calin-Jageman, 2016)​. An emerging view is that a more complete analytic technique would use statistical graphics to estimate effect sizes and evaluate their uncertainty ​(Cohen, 1994; Cumming and Calin-Jageman, 2016)​.
    [Show full text]
  • A Machine Learning Framework for Computationally Expensive Transient Models Prashant Kumar1,4, Kushal Sinha 2,3*, Nandkishor K
    www.nature.com/scientificreports OPEN A machine learning framework for computationally expensive transient models Prashant Kumar1,4, Kushal Sinha 2,3*, Nandkishor K. Nere2,3, Yujin Shin1,5, Raimundo Ho1, Laurie B. Mlinar3 & Ahmad Y. Sheikh1 Transient simulations of dynamic systems, using physics-based scientifc computing tools, are practically limited by availability of computational resources and power. While the promise of machine learning has been explored in a variety of scientifc disciplines, its application in creation of a framework for computationally expensive transient models has not been fully explored. Here, we present an ensemble approach where one such computationally expensive tool, discrete element method, is combined with time-series forecasting via auto regressive integrated moving average and machine learning methods to simulate a complex pharmaceutical problem: development of an agitation protocol in an agitated flter dryer to ensure uniform solid bed mixing. This ensemble approach leads to a signifcant reduction in the computational burden, while retaining model accuracy and performance, practically rendering simulations possible. The developed machine-learning model shows good predictability and agreement with the literature, demonstrating its tremendous potential in scientifc computing. Machine learning has emerged as one of the most promising technologies in the past decade due to its capability to provide valuable insights1 into vast amounts of data generated during the Internet era. Rapid democratization of machine learning tools has allowed for the successful adoption of the technology in a wide range of felds including robotics, computer vision2, speech and natural language processing3, autonomous driving4, neurosci- ence, drug-discovery5 and in fundamental sciences6. However, its application to computational sciences, and applied computational physics in general, has been limited.
    [Show full text]
  • Effect of Continuous Saline Bladder Irrigation with Concomitant Single Instillation of Chemotherapy After Transurethral Resecti
    MOLECULAR AND CLINICAL ONCOLOGY 13: 6, 2020 Effect of continuous saline bladder irrigation with concomitant single instillation of chemotherapy after transurethral resection on intravesical recurrence in patients with non‑muscle‑invasive bladder cancer KENJI KURODA, SHINSUKE TASAKI, AKINORI SATO, JUNICHI ASAKUMA, AKIO HORIGUCHI and KEIICHI ITO Department of Urology, National Defense Medical College, Tokorozawa, Saitama 359‑8513, Japan Received April 29, 2019; Accepted June 1, 2020 DOI: 10.3892/mco.2020.2079 Abstract. A single immediate instillation of chemotherapy the matched groups B and C (P=0.0255 and P=0.0023, respec‑ following transurethral resection of bladder tumor (TURBT) tively). In conclusion, SIC alone could provide a higher IVR‑free is effective in preventing intravesical recurrence (IVR) in survival rate than CSBI with DXR or CSBI with SIC. patients with non‑muscle‑invasive urothelial bladder carcinoma (NMIBC). However, continuous saline bladder irrigation (CSBI) Introduction is also performed with a single instillation of chemotherapy (SIC), but its inhibitory effect on IVR remains unclear. In the present Bladder cancer can present in different pathological stages. study, the effect of CSBI with concomitant SIC following TUR Approximately 80% of all bladder cancers initially present on IVR was evaluated in patients with NMIBC. A retrospec‑ as non‑muscle‑invasive bladder carcinoma (NMIBC) (1). tive review of 253 patients who underwent TURBT and were Transurethral resection of bladder tumor (TURBT) is known clinically and histologically diagnosed with NMIBC at National as the gold standard therapeutic method for NMIBC; however, Defense Medical College Hospital was performed. Doxorubicin the recurrence rate ranges between 40 and 80% regardless of (DXR) was administered to all patients.
    [Show full text]
  • Raincloud Plots: a Multi-Platform Tool for Robust Data Visualization
    Raincloud plots: a multi-platform tool for robust data visualization Micah Allen1, Davide Poggiali2,3, Kirstie Whitaker1,4, Tom Rhys Marshall5, Rogier Kievit6,7 1Department of Psychiatry, University of Cambridge, UK 2Department of Mathematics, University of Padova, Padova, Italy 3Padova Neuroscience Center, University of Padova, Padova, Italy 4Alan Turing Institute, London, UK 5Department of Experimental Psychology, University of Oxford, UK 6Department of Psychology, University of Cambridge, UK 7Max-Planck Centre for Computational Psychiatry and Aging, London/Berlin Correspondence should be addressed to Micah Allen, Cambridge Psychiatry: [email protected] Abstract Across scientific disciplines, there is a rapidly growing recognition of the need for more statistically robust, transparent approaches to data visualization. Complimentary to this, many scientists have realized the need for plotting tools that accurately and transparently convey key aspects of statistical effects and raw data with minimal distortion. Previously common approaches, such as plotting conditional mean or median barplots together with error-bars have been criticized for distorting effect size, hiding underlying patterns in the raw data, and obscuring the assumptions upon which the most commonly used statistical tests are based. Here we describe a data visualization approach which overcomes these issues, providing maximal statistical information while preserving the desired ‘inference at a glance’ nature of barplots and other similar visualization devices. These “raincloud plots” can visualize raw data, probability density, and key summary statistics such as median, mean, and relevant confidence intervals in an appealing and flexible format with minimal redundancy. In this tutorial paper we provide basic demonstrations of the strength of raincloud plots and similar approaches, outline potential modifications for their optimal use, and provide open-source code for their streamlined implementation in R, Python and Matlab (https://github.com/RainCloudPlots/RainCloudPlots).
    [Show full text]
  • Lecture Notes Data Mining and Exploration
    Lecture Notes Data Mining and Exploration Original 2017 version by Michael Gutmann Edited and expanded by Arno Onken Spring Semester 2020 May 16, 2020 Contents 1 First Steps in Exploratory Data Analysis1 1.1 Numerical Data Description......................2 1.1.1 Location.............................2 1.1.2 Scale...............................3 1.1.3 Shape..............................4 1.1.4 Multivariate Measures.....................6 1.2 Data Visualisation........................... 10 1.2.1 Bar Plot............................. 10 1.2.2 Box Plot............................ 11 1.2.3 Scatter Plot........................... 11 1.2.4 Histogram............................ 11 1.2.5 Kernel Density Plot...................... 13 1.2.6 Violin Plot........................... 15 1.3 Data Pre-Processing.......................... 15 1.3.1 Standardisation......................... 15 1.3.2 Outlier Detection and Removal................ 18 2 Principal Component Analysis 19 2.1 PCA by Variance Maximisation.................... 19 2.1.1 First Principal Component Direction............. 19 2.1.2 Subsequent Principal Component Directions......... 21 2.1.3 Simultaneous Variance Maximisation............. 23 2.2 PCA by Minimisation of Approximation Error........... 23 2.3 PCA by Low Rank Matrix Approximation.............. 25 2.3.1 Approximating the Data Matrix............... 25 2.3.2 Approximating the Sample Covariance Matrix....... 27 2.3.3 Approximating the Gram Matrix............... 27 2.4 Probabilistic PCA........................... 28 2.4.1 Probabilistic Model...................... 28 2.4.2 Joint, Conditional and Observation Distributions...... 29 2.4.3 Maximum Likelihood...................... 31 2.4.4 Relation to PCA........................ 32 3 Dimensionality Reduction 35 3.1 Linear Dimensionality Reduction................... 35 3.1.1 From Data Points....................... 35 3.1.2 From Inner Products...................... 36 ii CONTENTS 3.1.3 From Distances........................
    [Show full text]