Supplementary Materials

Total Page:16

File Type:pdf, Size:1020Kb

Supplementary Materials Tomic et al, SIMON, an automated machine learning system reveals immune signatures of influenza vaccine responses 1 Supplementary Materials: 2 3 Figure S1. Staining profiles and gating scheme of immune cell subsets analyzed using mass 4 cytometry. Representative gating strategy for phenotype analysis of different blood- 5 derived immune cell subsets analyzed using mass cytometry in the sample from one donor 6 acquired before vaccination. In total PBMC from healthy 187 donors were analyzed using 7 same gating scheme. Text above plots indicates parent population, while arrows show 8 gating strategy defining major immune cell subsets (CD4+ T cells, CD8+ T cells, B cells, 9 NK cells, Tregs, NKT cells, etc.). 10 2 11 12 Figure S2. Distribution of high and low responders included in the initial dataset. Distribution 13 of individuals in groups of high (red, n=64) and low (grey, n=123) responders regarding the 14 (A) CMV status, gender and study year. (B) Age distribution between high and low 15 responders. Age is indicated in years. 16 3 17 18 Figure S3. Assays performed across different clinical studies and study years. Data from 5 19 different clinical studies (Study 15, 17, 18, 21 and 29) were included in the analysis. Flow 20 cytometry was performed only in year 2009, in other years phenotype of immune cells was 21 determined by mass cytometry. Luminex (either 51/63-plex) was performed from 2008 to 22 2014. Finally, signaling capacity of immune cells was analyzed by phosphorylation 23 cytometry (PhosphoFlow) on mass cytometer in 2013 and flow cytometer in all other years. 24 4 25 26 Figure S4. Visualization of the initial dataset in the context of missing values. Heatmap 27 showing distribution of data in the initial dataset. Each row represents a unique donor, while 28 each column is one feature. Missing values are shown in grey, while present values are 29 shown in black. 30 5 31 32 Fig. S5. Performance evaluation of models build on datasets 13 and 36 after applying 33 restriction filters. ROC curves shown for all the models build on (A) dataset 36 and (B) 34 dataset 13. Each model (classifier) is denoted in the color indicated in the graph legend. 35 6 36 37 Fig. S6. Heatmap of the correlation coefficients calculated between features from the dataset 38 13. Heatmap shows the significant correlation coefficients between all the features from 39 dataset 13 calculated using Pearson correlation (p<0.05). Not significant values are shown 40 as blank. Color of each circle follows the legend on the right side of the heatmap and red 41 indicates values with negative correlation, while blue values with positive correlation. 42 7 43 44 Fig. S7. Importance of features determined by Boruta. Boruta result plots for (A) dataset 13 45 and (B) dataset 36. Red boxplots represent importance score of rejected features, while green 46 boxplots show minimal, average and maximum importance score for confirmed features. 47 Blue boxplots show importance score of a shadow feature. Yellow boxplots are tentative 48 features. 49 8 50 51 Fig. S8. Heatmap of the correlation coefficients calculated between features from the dataset 52 36. Heatmap shows the significant correlation coefficients between all the features from 53 dataset 36 calculated using Pearson correlation (p<0.05). Not significant values are shown 54 as blank. Color of each circle follows the legend on the right side of the heatmap and red 55 indicates values with negative correlation, while blue values with positive correlation. 56 57 9 58 Tables S1-S19 provided in single Excel file 59 60 Data files S1-S2 61 62 Online Methods References 63 List of references of R packages used for Supplementary Table 6: 64 1. Algorithm: ada Package: ada - Mark Culp, Kjell Johnson and George Michailidis (2016). ada: 65 The R Package Ada for Stochastic Boosting. R package version 2.0-5. https://CRAN.R- 66 project.org/package=ada 67 2. Algorithm: AdaBag Package: adabag - Esteban Alfaro, Matias Gamez, Noelia Garcia (2013). 68 adabag: An R Package for Classification with Boosting and Bagging. Journal of Statistical 69 Software, 54(2), 1-35. URL http://www.jstatsoft.org/v54/i02/ 70 3. Algorithm: AdaBoost.M1 Package: adabag - Esteban Alfaro, Matias Gamez, Noelia Garcia 71 (2013). adabag: An R Package for Classification with Boosting and Bagging. Journal of 72 Statistical Software, 54(2), 1-35. URL http://www.jstatsoft.org/v54/i02/ 73 4. Algorithm: adaboost Package: fastAdaboost - Sourav Chatterjee (2016). fastAdaboost: a Fast 74 Implementation of Adaboost. R package version 1.0.0. https://CRAN.R- 75 project.org/package=fastAdaboost 76 5. Algorithm: amdai Package: adaptDA - Charles Bouveyron (2014). adaptDA: Adaptive Mixture 77 Discriminant Analysis. R package version 1.0. https://CRAN.R-project.org/package=adaptDA 78 6. Algorithm: avNNet Package: nnet - Venables, W. N. & Ripley, B. D. (2002) Modern Applied 79 Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0 10 80 7. Algorithm: awnb Package: caret - Max Kuhn. Contributions from Jed Wing, Steve Weston, 81 Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, Brenton 82 Kenkel, the R Core Team, Michael Benesty, Reynald Lescarbeau, Andrew Ziem, Luca Scrucca, 83 Yuan Tang, Can Candan and Tyler Hunt. (2017). caret: Classification and Regression Training. 84 R package version 6.0-76. https://CRAN.R-project.org/package=caret 85 8. Algorithm: bag Package: caret - Max Kuhn. Contributions from Jed Wing, Steve Weston, Andre 86 Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, Brenton Kenkel, the 87 R Core Team, Michael Benesty, Reynald Lescarbeau, Andrew Ziem, Luca Scrucca, Yuan Tang, 88 Can Candan and Tyler Hunt. (2017). caret: Classification and Regression Training. R package 89 version 6.0-76. https://CRAN.R-project.org/package=caret 90 9. Algorithm: awtan Package: caret - Max Kuhn. Contributions from Jed Wing, Steve Weston, 91 Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, Brenton 92 Kenkel, the R Core Team, Michael Benesty, Reynald Lescarbeau, Andrew Ziem, Luca Scrucca, 93 Yuan Tang, Can Candan and Tyler Hunt. (2017). caret: Classification and Regression Training. 94 R package version 6.0-76. https://CRAN.R-project.org/package=caret 95 10. Algorithm: bagEarth Package: earth - Stephen Milborrow. Derived from mda:mars by 96 Trevor Hastie and Rob Tibshirani. Uses Alan Miller's Fortran utilities with Thomas Lumley's 97 leaps wrapper. (2017). earth: Multivariate Adaptive Regression Splines. R package version 98 4.5.0. https://CRAN.R-project.org/package=earth 99 11. Algorithm: bagEarthGCV Package: earth - Stephen Milborrow. Derived from mda:mars 100 by Trevor Hastie and Rob Tibshirani. Uses Alan Miller's Fortran utilities with Thomas Lumley's 101 leaps wrapper. (2017). earth: Multivariate Adaptive Regression Splines. R package version 102 4.5.0. https://CRAN.R-project.org/package=earth 103 12. Algorithm: bagFDA Package: earth - Stephen Milborrow. Derived from mda:mars by 104 Trevor Hastie and Rob Tibshirani. Uses Alan Miller's Fortran utilities with Thomas Lumley's 11 105 leaps wrapper. (2017). earth: Multivariate Adaptive Regression Splines. R package version 106 4.5.0. https://CRAN.R-project.org/package=earth 107 13. Algorithm: bagFDAGCV Package: earth - Stephen Milborrow. Derived from mda:mars by 108 Trevor Hastie and Rob Tibshirani. Uses Alan Miller's Fortran utilities with Thomas Lumley's 109 leaps wrapper. (2017). earth: Multivariate Adaptive Regression Splines. R package version 110 4.5.0. https://CRAN.R-project.org/package=earth 111 14. Algorithm: bam Package: mgcv - Wood, S.N. (2011) Fast stable restricted maximum 112 likelihood \nand marginal likelihood estimation of semiparametric generalized linear \nmodels. 113 Journal of the Royal Statistical Society (B) 73(1):3-36 114 15. Algorithm: bayesglm Package: arm - Andrew Gelman and Yu-Sung Su (2016). arm: Data 115 Analysis Using Regression and Multilevel/Hierarchical\nModels. R package version 1.9-3. 116 https://CRAN.R-project.org/package=arm 117 16. Algorithm: binda Package: binda - Sebastian Gibb and Korbinian Strimmer. (2015). binda: 118 Multi-Class Discriminant Analysis using Binary Predictors. R package version 1.0.3. 119 https://CRAN.R-project.org/package=binda 120 17. Algorithm: blackboost Package: party - Torsten Hothorn, Kurt Hornik and Achim Zeileis 121 (2006). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of 122 Computational and Graphical Statistics, 15(3), 651--674. 123 18. Algorithm: C5.0 Package: C50 - Max Kuhn, Steve Weston, Nathan Coulter and Mark Culp. 124 C code for C5.0 by R. Quinlan (2015). C50: C5.0 Decision Trees and Rule-Based Models. R 125 package version 0.1.0-24. https://CRAN.R-project.org/package=C50 12 126 19. Algorithm: C5.0Rules Package: C50 - Max Kuhn, Steve Weston, Nathan Coulter and Mark 127 Culp. C code for C5.0 by R. Quinlan (2015). C50: C5.0 Decision Trees and Rule-Based Models. 128 R package version 0.1.0-24. https://CRAN.R-project.org/package=C50 129 20. Algorithm: C5.0Tree Package: C50 - Max Kuhn, Steve Weston, Nathan Coulter and Mark 130 Culp. C code for C5.0 by R. Quinlan (2015). C50: C5.0 Decision Trees and Rule-Based Models. 131 R package version 0.1.0-24. https://CRAN.R-project.org/package=C50 132 21. Algorithm: cforest Package: party - Torsten Hothorn, Kurt Hornik and Achim Zeileis 133 (2006). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of 134 Computational and Graphical Statistics, 15(3), 651--674. 135 22. Algorithm: chaid Package: CHAID - The FoRt Student Project Team (2015). CHAID: 136 CHi-squared Automated Interaction Detection R package version 0.1-2. 137 23. Algorithm: ctree Package: party - Torsten Hothorn, Kurt Hornik and Achim Zeileis (2006). 138 Unbiased Recursive Partitioning: A Conditional Inference Framework.
Recommended publications
  • Navigating the R Package Universe by Julia Silge, John C
    CONTRIBUTED RESEARCH ARTICLES 558 Navigating the R Package Universe by Julia Silge, John C. Nash, and Spencer Graves Abstract Today, the enormous number of contributed packages available to R users outstrips any given user’s ability to understand how these packages work, their relative merits, or how they are related to each other. We organized a plenary session at useR!2017 in Brussels for the R community to think through these issues and ways forward. This session considered three key points of discussion. Users can navigate the universe of R packages with (1) capabilities for directly searching for R packages, (2) guidance for which packages to use, e.g., from CRAN Task Views and other sources, and (3) access to common interfaces for alternative approaches to essentially the same problem. Introduction As of our writing, there are over 13,000 packages on CRAN. R users must approach this abundance of packages with effective strategies to find what they need and choose which packages to invest time in learning how to use. At useR!2017 in Brussels, we organized a plenary session on this issue, with three themes: search, guidance, and unification. Here, we summarize these important themes, the discussion in our community both at useR!2017 and in the intervening months, and where we can go from here. Users need options to search R packages, perhaps the content of DESCRIPTION files, documenta- tion files, or other components of R packages. One author (SG) has worked on the issue of searching for R functions from within R itself in the sos package (Graves et al., 2017).
    [Show full text]
  • The Split-Apply-Combine Strategy for Data Analysis
    JSS Journal of Statistical Software April 2011, Volume 40, Issue 1. http://www.jstatsoft.org/ The Split-Apply-Combine Strategy for Data Analysis Hadley Wickham Rice University Abstract Many data analysis problems involve the application of a split-apply-combine strategy, where you break up a big problem into manageable pieces, operate on each piece inde- pendently and then put all the pieces back together. This insight gives rise to a new R package that allows you to smoothly apply this strategy, without having to worry about the type of structure in which your data is stored. The paper includes two case studies showing how these insights make it easier to work with batting records for veteran baseball players and a large 3d array of spatio-temporal ozone measurements. Keywords: R, apply, split, data analysis. 1. Introduction What do we do when we analyze data? What are common actions and what are common mistakes? Given the importance of this activity in statistics, there is remarkably little research on how data analysis happens. This paper attempts to remedy a very small part of that lack by describing one common data analysis pattern: Split-apply-combine. You see the split-apply- combine strategy whenever you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together. This crops up in all stages of an analysis: During data preparation, when performing group-wise ranking, standardization, or nor- malization, or in general when creating new variables that are most easily calculated on a per-group basis.
    [Show full text]
  • Hadley Wickham, the Man Who Revolutionized R Hadley Wickham, the Man Who Revolutionized R · 51,321 Views · More Stats
    12/15/2017 Hadley Wickham, the Man Who Revolutionized R Hadley Wickham, the Man Who Revolutionized R · 51,321 views · More stats Share https://priceonomics.com/hadley-wickham-the-man-who-revolutionized-r/ 1/10 12/15/2017 Hadley Wickham, the Man Who Revolutionized R “Fundamentally learning about the world through data is really, really cool.” ~ Hadley Wickham, prolific R developer *** If you don’t spend much of your time coding in the open-source statistical programming language R, his name is likely not familiar to you -- but the statistician Hadley Wickham is, in his own words, “nerd famous.” The kind of famous where people at statistics conferences line up for selfies, ask him for autographs, and are generally in awe of him. “It’s utterly utterly bizarre,” he admits. “To be famous for writing R programs? It’s just crazy.” Wickham earned his renown as the preeminent developer of packages for R, a programming language developed for data analysis. Packages are programming tools that simplify the code necessary to complete common tasks such as aggregating and plotting data. He has helped millions of people become more efficient at their jobs -- something for which they are often grateful, and sometimes rapturous. The packages he has developed are used by tech behemoths like Google, Facebook and Twitter, journalism heavyweights like the New York Times and FiveThirtyEight, and government agencies like the Food and Drug Administration (FDA) and Drug Enforcement Administration (DEA). Truly, he is a giant among data nerds. *** Born in Hamilton, New Zealand, statistics is the Wickham family business: His father, Brian Wickham, did his PhD in the statistics heavy discipline of Animal Breeding at Cornell University and his sister has a PhD in Statistics from UC Berkeley.
    [Show full text]
  • The Rockerverse: Packages and Applications for Containerisation
    PREPRINT 1 The Rockerverse: Packages and Applications for Containerisation with R by Daniel Nüst, Dirk Eddelbuettel, Dom Bennett, Robrecht Cannoodt, Dav Clark, Gergely Daróczi, Mark Edmondson, Colin Fay, Ellis Hughes, Lars Kjeldgaard, Sean Lopp, Ben Marwick, Heather Nolis, Jacqueline Nolis, Hong Ooi, Karthik Ram, Noam Ross, Lori Shepherd, Péter Sólymos, Tyson Lee Swetnam, Nitesh Turaga, Charlotte Van Petegem, Jason Williams, Craig Willis, Nan Xiao Abstract The Rocker Project provides widely used Docker images for R across different application scenarios. This article surveys downstream projects that build upon the Rocker Project images and presents the current state of R packages for managing Docker images and controlling containers. These use cases cover diverse topics such as package development, reproducible research, collaborative work, cloud-based data processing, and production deployment of services. The variety of applications demonstrates the power of the Rocker Project specifically and containerisation in general. Across the diverse ways to use containers, we identified common themes: reproducible environments, scalability and efficiency, and portability across clouds. We conclude that the current growth and diversification of use cases is likely to continue its positive impact, but see the need for consolidating the Rockerverse ecosystem of packages, developing common practices for applications, and exploring alternative containerisation software. Introduction The R community continues to grow. This can be seen in the number of new packages on CRAN, which is still on growing exponentially (Hornik et al., 2019), but also in the numbers of conferences, open educational resources, meetups, unconferences, and companies that are adopting R, as exemplified by the useR! conference series1, the global growth of the R and R-Ladies user groups2, or the foundation and impact of the R Consortium3.
    [Show full text]
  • R Generation [1] 25
    IN DETAIL > y <- 25 > y R generation [1] 25 14 SIGNIFICANCE August 2018 The story of a statistical programming they shared an interest in what Ihaka calls “playing academic fun language that became a subcultural and games” with statistical computing languages. phenomenon. By Nick Thieme Each had questions about programming languages they wanted to answer. In particular, both Ihaka and Gentleman shared a common knowledge of the language called eyond the age of 5, very few people would profess “Scheme”, and both found the language useful in a variety to have a favourite letter. But if you have ever been of ways. Scheme, however, was unwieldy to type and lacked to a statistics or data science conference, you may desired functionality. Again, convenience brought good have seen more than a few grown adults wearing fortune. Each was familiar with another language, called “S”, Bbadges or stickers with the phrase “I love R!”. and S provided the kind of syntax they wanted. With no blend To these proud badge-wearers, R is much more than the of the two languages commercially available, Gentleman eighteenth letter of the modern English alphabet. The R suggested building something themselves. they love is a programming language that provides a robust Around that time, the University of Auckland needed environment for tabulating, analysing and visualising data, one a programming language to use in its undergraduate statistics powered by a community of millions of users collaborating courses as the school’s current tool had reached the end of its in ways large and small to make statistical computing more useful life.
    [Show full text]
  • A History of R (In 15 Minutes… and Mostly in Pictures)
    A history of R (in 15 minutes… and mostly in pictures) JULY 23, 2020 Andrew Zief!ler Lunch & Learn Department of Educational Psychology RMCC University of Minnesota LATIS Who am I and Some Caveats Andy Zie!ler • I teach statistics courses in the Department of Educational Psychology • I have been using R since 2005, when I couldn’t put Me (on the EPSY faculty board) SAS on my computer (it didn’t run natively on a Me Mac), and even if I could have gotten it to run, I (everywhere else) couldn’t afford it. Some caveats • Although I was alive during much of the era I will be talking about, I was not working as a statistician at that time (not even as an elementary student for some of it). • My knowledge is second-hand, from other people and sources. Statistical Computing in the 1970s Bell Labs In 1976, scientists from the Statistics Research Group were actively discussing how to design a language for statistical computing that allowed interactive access to routines in their FORTRAN library. John Chambers John Tukey Home to Statistics Research Group Rick Becker Jean Mc Rae Judy Schilling Doug Dunn Introducing…`S` An Interactive Language for Data Analysis and Graphics Chambers sketch of the interface made on May 5, 1976. The GE-635, a 36-bit system that ran at a 0.5MIPS, starting at $2M in 1964 dollars or leasing at $45K/month. ’S’ was introduced to Bell Labs in November, but at the time it did not actually have a name. The Impact of UNIX on ’S' Tom London Ken Thompson and Dennis Ritchie, creators of John Reiser the UNIX operating system at a PDP-11.
    [Show full text]
  • Changes on CRAN 2014-07-01 to 2014-12-31
    NEWS AND NOTES 192 Changes on CRAN 2014-07-01 to 2014-12-31 by Kurt Hornik and Achim Zeileis New packages in CRAN task views Bayesian BayesTree. Cluster fclust, funFEM, funHDDC, pgmm, tclust. Distributions FatTailsR, RTDE, STAR, predfinitepop, statmod. Econometrics LinRegInteractive, MSBVAR, nonnest2, phtt. Environmetrics siplab. Finance GCPM, MSBVAR, OptionPricing, financial, fractal, riskSimul. HighPerformanceComputing GUIProfiler, PGICA, aprof. MachineLearning BayesTree, LogicForest, hdi, mlr, randomForestSRC, stabs, vcrpart. MetaAnalysis MAVIS, ecoreg, ipdmeta, metaplus. NumericalMathematics RootsExtremaInflections, Rserve, SimplicialCubature, fastGHQuad, optR. OfficialStatistics CoImp, RecordLinkage, rworldmap, tmap, vardpoor. Optimization RCEIM, blowtorch, globalOptTests, irace, isotone, lbfgs. Phylogenetics BAMMtools, BoSSA, DiscML, HyPhy, MPSEM, OutbreakTools, PBD, PCPS, PHYLOGR, RADami, RNeXML, Reol, Rphylip, adhoc, betapart, dendextend, ex- pands, expoTree, jaatha, kdetrees, mvMORPH, outbreaker, pastis, pegas, phyloTop, phyloland, rdryad, rphast, strap, surface, taxize. Psychometrics IRTShiny, PP, WrightMap, mirtCAT, pairwise. ReproducibleResearch NMOF. Robust TEEReg, WRS2, robeth, robustDA, robustgam, robustloggamma, robustreg, ror, rorutadis. Spatial PReMiuM. SpatioTemporal BayesianAnimalTracker, TrackReconstruction, fishmove, mkde, wildlifeDI. Survival DStree, ICsurv, IDPSurvival, MIICD, MST, MicSim, PHeval, PReMiuM, aft- gee, bshazard, bujar, coxinterval, gamboostMSM, imputeYn, invGauss, lsmeans, multipleNCC, paf, penMSM, spBayesSurv,
    [Show full text]
  • R Programming for Data Science
    R Programming for Data Science Roger D. Peng This book is for sale at http://leanpub.com/rprogramming This version was published on 2015-07-20 This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and many iterations to get reader feedback, pivot until you have the right book and build traction once you do. ©2014 - 2015 Roger D. Peng Also By Roger D. Peng Exploratory Data Analysis with R Contents Preface ............................................... 1 History and Overview of R .................................... 4 What is R? ............................................ 4 What is S? ............................................ 4 The S Philosophy ........................................ 5 Back to R ............................................ 5 Basic Features of R ....................................... 6 Free Software .......................................... 6 Design of the R System ..................................... 7 Limitations of R ......................................... 8 R Resources ........................................... 9 Getting Started with R ...................................... 11 Installation ............................................ 11 Getting started with the R interface .............................. 11 R Nuts and Bolts .......................................... 12 Entering Input .......................................... 12 Evaluation ...........................................
    [Show full text]
  • ALFRED P. SLOAN FOUNDATION PROPOSAL COVER SHEET | Proposal Guidelines
    ALFRED P. SLOAN FOUNDATION PROPOSAL COVER SHEET www.sloan.org | proposal guidelines Project Information Principal Investigator Grantee Organization: University of Texas at Austin James Howison, Assistant Professor Amount Requested: 635,261 UTA 5.404 1616 Guadalupe St Austin TX 78722 Requested Start Date: 1 October 2016 (315) 395 4056 Requested End Date: 30 September 2018 [email protected] Project URL (if any): Project Goal Our goal is to improve software in scholarship (science, engineering, and the humanities) by raising the visibility of software work as a contribution in the literature, thus improving incentives for software work in scholarship. Objectives We seek support for a three year program to develop a manually coded gold-standard dataset of software mentions, build a machine learning system able to recognize software in the literature, create a dataset of software in publications using that system, build prototypes that demonstrate the potential usefulness of such data, and study these prototypes in use to identify the socio- technical barriers to full-scale, sustainable, implementations. The three prototypes are: CiteSuggest to analyze submitted text or code and make recommendations for normalized citations using the software author’s preferred citation, CiteMeAs to help software producers make clear request for their preferred citations, and Software Impactstory to help software authors demonstrate the scholarly impact of their software in the literature. Proposed Activities Manual content analysis of publications to discover software mentions, developing machine- learning system to automate mention discovery, developing prototypes of systems, conducting summative socio-technical evaluations (including stakeholder interviews). Expected Products Published gold standard dataset of software mentions.
    [Show full text]
  • Arxiv:1801.00371V2 [Stat.OT] 1 May 2018 Keywords the Edu for Communication, Mean for Trends Directions Research
    Japanese Journal of Statistics and Data Science 10.1007/s42081-018-0009-3 Data Science vs. Statistics: Two Cultures? Iain Carmichael · J.S. Marron Received: 4 January 2018 / Accepted: 21 April 2018 Abstract Data science is the business of learning from data, which is tradi- tionally the business of statistics. Data science, however, is often understood as a broader, task-driven and computationally-oriented version of statistics. Both the term data science and the broader idea it conveys have origins in statistics and are a reaction to a narrower view of data analysis. Expanding upon the views of a number of statisticians, this paper encourages a big-tent view of data analysis. We examine how evolving approaches to modern data analysis relate to the existing discipline of statistics (e.g. exploratory analy- sis, machine learning, reproducibility, computation, communication and the role of theory). Finally, we discuss what these trends mean for the future of statistics by highlighting promising directions for communication, education and research. Keywords Computation · Literate Programming · Machine Learning · Reproducibility · Robustness 1 Introduction A simple definition of a data scientist is someone who uses data to solve problems. In the past few years this term has caused a lot of buzz1 in industry, I. Carmichael B30 Hanes Hall University of North Carolina at Chapel Hill E-mail: [email protected] arXiv:1801.00371v2 [stat.OT] 1 May 2018 J.S. Marron 352 Hanes Hall University of North Carolina at Chapel Hill E-mail: [email protected] 1 https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century 2 Iain Carmichael, J.S.
    [Show full text]
  • R Software: Unfriendly but Probably the Best 67
    66 DATA ANALYSIS IN MEDICAL RESEARCH: FROM FOE TO FRIEND Croat Med J. 2020;61:66-8 https://doi.org/10.3325/cmj.2020.61.66 R software: unfriendly but by Branimir K. Hackenberger Department of Biology, probably the best Josip Juraj Strossmayer University of Osijek, Osijek, Croatia [email protected] Each of us has a friend with a difficult personality. However, first RKWard and later RStudio, made it much easier to work we would not waste our time and masochistically put up with R and solidified our ongoing relationship. with their personality if it did not benefit us in some way. And whenever we organize a get-together we always invite The biggest problem for R newbies is the knowledge and this friend, even though we know in advance that it would understanding of statistics. Unlike the use of commercial not go smoothly. It is a similar situation with R software. software, where the lists of suggested methods appear in windows or drop-down menus, the use of R requires a I am often asked how I can be so in love with this unfriend- priori knowledge of the method that should be used and ly software. I am often asked why R. My most common an- the way how to use it. While this may seem aggravating swer is: “Why not?!” I am aware of the beginners’ concerns and unfriendly, it reduces the possibility of using statistical because I used to be one myself. My first encounter with R methods incorrectly. If one understands what one is doing, was in 2000, when I found it on a CD that came with some the chance of making a mistake is reduced.
    [Show full text]
  • Interactive Visualisation to Explore Structured Temporal Data
    CONTRIBUTED RESEARCH ARTICLES 516 Conversations in Time: Interactive Visualization to Explore Structured Temporal Data by Earo Wang and Dianne Cook Abstract Temporal data often has a hierarchical structure, defined by categorical variables describing different levels, such as political regions or sales products. The nesting of categorical variables produces a hierarchical structure. The tsibbletalk package is developed to allow a user to interactively explore temporal data, relative to the nested or crossed structures. It can help to discover differences between category levels, and uncover interesting periodic or aperiodic slices. The package implements a shared tsibble object that allows for linked brushing between coordinated views, and a shiny module that aids in wrapping timelines for seasonal patterns. The tools are demonstrated using two data examples: domestic tourism in Australia and pedestrian traffic in Melbourne. Introduction Temporal data typically arrives as a set of many observational units measured over time. Some variables may be categorical, containing a hierarchy in the collection process, that may be measure- ments taken in different geographic regions, or types of products sold by one company. Exploring these multiple features can be daunting. Ensemble graphics (Unwin and Valero-Mora, 2018) bundle multiple views of a data set together into one composite figure. These provide an effective approach for exploring and digesting many different aspects of temporal data. Adding interactivity to the ensemble can greatly enhance the exploration process. This paper describes new software, the tsibbletalk package, for exploring temporal data using linked views and time wrapping. We first provide some background to the approach basedon setting up data structures and workflow, and give an overview of interactive systems inR.The section following introduces the tsibbletalk package.
    [Show full text]