The VGAM Package, Which Is Ca- Ter of the Publication Will Be Largely Unchanged

Total Page:16

File Type:pdf, Size:1020Kb

The VGAM Package, Which Is Ca- Ter of the Publication Will Be Largely Unchanged News The Newsletter of the R Project Volume 8/2, October 2008 Editorial by John Fox pable of fitting a dizzying array of statistical mod- els. Paul Murrell shows how the compare package I had the opportunity recently to examine the devel- is used to locate similarities and differences among opment of the R Project. A dramatic index of this de- non-identical objects, and explains how the package velopment is the rapid growth in R packages, which may be applied to grading students’ assignments. is displayed in Figure 1; because the vertical axis in Arthur Allignol, Jan Beyersmann, and Martin Schu- this graph is on a log scale, the least-squares linear- macher describe the mvna package for the Nelson- regression line represents exponential growth. Aalen estimator of cumulative transition hazards in This issue of R News is another kind of testament multistate models. Finally, in the latest installment of to the richness and diversity of applications pro- the Programmers’ Niche column, Vince Carey reveals grammed in R: Hadley Wickham, Michael Lawrence, the mind-bending qualities of recursive computation Duncan Temple Lang, and Deborah Swayne describe in R without named self-reference. the rggobi package, which provides an interface to The “news” portion of this issue of R Rnews in- GGobi, software for dynamic and interactive sta- cludes a report on the recent useR! conference in tistical graphics. Carsten Dormann, Bernd Gruber, Dortmund, Germany; announcements of useR! 2009 and Jochen Fründ introduce the bipartite package in Rennes, France, and of DSC 2009 in Copenhagen, for examining interactions among species in ecolog- Denmark, both to be held in July; news from the R ical communities. Ionnis Kosmidis explains how to Foundation and from the Bioconductor Project; the use the profileModel package to explore likelihood latest additions to CRAN; and changes in the re- surfaces and construct confidence intervals for pa- cently released version 2.8.0 of R. rameters of models fit by maximum-likelihood and Unless we accumulate enough completed articles closely related methods. Ingo Feinerer illustrates to justify a third issue in 2008, this will be my last is- how the tm (“text-mining”) package is employed for sue as editor-in-chief of R News. Barring a third issue the analysis of textual data. Yihui Xie and Xiaoyne in 2008, this is also the last issue of R News — but Cheng demonstrate the construction of statistical an- don’t panic! R News will be reborn in 2009 as The R imations in R with the animation package. Thomas Journal. Moreover, the purpose and general charac- Yee introduces the VGAM package, which is ca- ter of the publication will be largely unchanged. We Contents of this issue: Comparing Non-Identical Objects . 40 mvna: An R Package for the Nelson–Aalen Es- Editorial . .1 timator in Multistate Models . 48 An Introduction to rggobi ............3 Programmers’ Niche: The Y of R . 51 Introducing the bipartite Package: Analysing Changes in R Version 2.8.0 . 54 Ecological Networks . .8 Changes on CRAN . 60 The profileModel R Package: Profiling Objec- News from the Bioconductor Project . 69 tives for Models with Linear Predictors. 12 useR! 2008 . 71 An Introduction to Text Mining in R . 19 Forthcoming Events: useR! 2009 . 73 animation: A Package for Statistical Animations 23 Forthcoming Events: DSC 2009 . 74 The VGAM Package . 28 R Foundation News . 74 Vol. 8/2, October 2008 2 think that the new name better reflects the direction R Version in which R News has evolved, and in particular the 1.3 1.4 1.5 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 quality of the refereed articles that we publish. 1400 1427 1300 During my tenure as an R News editor, I’ve had 1200 1000 1000 the pleasure of working with Vince Carey, Peter Dal- 911 800 gaard, Torsten Hothorn, and Paul Murrell. I look for- 739 647 ward to working next year with Heather Turner, who 600 548 is joining the editorial board in 2009, which will be 500 my last year on the board. 400 406 357 300 John Fox 273 McMaster University Number of CRAN Packages 219 200 Canada 162 [email protected] 129 110 100 2001−06−21 2001−12−17 2002−06−12 2003−05−27 2003−11−16 2004−06−05 2004−10−12 2005−06−18 2005−12−16 2006−05−31 2006−12−12 2007−04−12 2007−11−16 2008−03−18 Date Figure 1: The number of R packages on CRAN has grown exponentially since R version 1.3 in 2001. Source of Data: https://svn.r-project.org/ R/branches/. R News ISSN 1609-3631 Vol. 8/2, October 2008 3 An Introduction to rggobi Hadley Wickham, Michael Lawrence, outliers. It is customary to look at uni- or bivariate Duncan Temple Lang, Deborah F Swayne plots to look for uni- or bivariate outliers, but higher- dimensional outliers may go unnoticed. Looking for these outliers is easy to do with the tour. Open your Introduction data with GGobi, change to the tour view, and se- The rggobi (Temple Lang and Swayne, 2001) pack- lect all the variables. Watch the tour and look for age provides a command-line interface to GGobi, an points that are far away or move differently from the interactive and dynamic graphics package. Rggobi others—these are outliers. complements GGobi’s graphical user interface by en- Adding more data sets to an open GGobi is also g$mtcars2 <- mtcars abling fluid transitions between analysis and explo- easy: will add another data ration and by automating common tasks. It builds on set named “mtcars2”. You can load any file type the first version of rggobi to provide a more robust that GGobi recognises by passing the path to that file. ggobi_find_file and user-friendly interface. In this article, we show In conjunction with , which locates how to use ggobi and offer some examples of the in- files in the GGobi installation directory, this makes it sights that can be gained by using a combination of easy to load GGobi sample data. This example loads analysis and visualisation. the olive oils data set included with GGobi: This article assumes some familiarity with library(rggobi) GGobi. A great deal of documentation, both ggobi(ggobi_find_file("data", "olive.csv")) introductory and advanced, is available on the GGobi web site, http://www.ggobi.org; newcom- ers to GGobi might find the demos at ggobi.org/ Modifying observation-level docs especially helpful. The software is there attributes, or “automatic brushing” as well, both source and executables for several platforms. Once you have installed GGobi, you Brushing is typically thought of as an operation per- can install rggobi and its dependencies using in- formed in a graphical user interface. In GGobi, it stall.packages("rggobi", dep=T). refers to changing the colour and symbol (glyph type This article introduces the three main compo- and size) of points. It is typically performed in a nents of rggobi, with examples of their use in com- linked environment, in which changes propagate to mon tasks: every plot in which the brushed observations are • getting data into and out of GGobi; displayed. in GGobi, brushing includes shadowing, where points sit in the background and have less vi- • modifying observation-level attributes (“auto- sual impact, and exclusion, where points are com- matic brushing”); pleted excluded from the plot. Using rggobi, brush- ing can be performed from the command line; we • basic plot control. think of this as “automatic brushing.” The following We will also discuss some advanced techniques such functions are available: as creating animations with GGobi, the use of edges, • change glyph colour with glyph_colour (or and analysing longitudinal data. Finally, a case study glyph_color) shows how to use rggobi to create a visualisation for a statistical algorithm: manova. • change glyph size with glyph_size • change glyph type with glyph_type Data • shadow and unshadow points with shadowed • exclude and include points with excluded Getting data from R into GGobi is easy: g <- ggobi(mtcars). This creates a GGobi object Each of these functions can be used to get or set called g. Getting data out isn’t much harder: Just the current values for the specified GGobiData. The index that GGobi object by position (g[[1]]) or by “getters” are useful for retrieving information that name (g[["mtcars"]] or g$mtcars). These return you have created while brushing in GGobi, and the GGobiData objects which are linked to the data in “setters” can be used to change the appearance of GGobi. They act just like regular data frames, except points based on model information, or to create an- that changes are synchronised with the data in the imations. They can also be used to store, and then corresponding GGobi. You can get a static copy of later recreate, the results of a complicated sequence the data using as.data.frame. of brushing steps. Once you have your data in GGobi, it’s easy to This example demonstrates the use of do something that was hard before: find multivariate glyph_colour to show the results of clustering the R News ISSN 1609-3631 Vol. 8/2, October 2008 4 infamous Iris data using hierachical clustering. Us- Name Variables ing GGobi allows us to investigate the clustering in 1D Plot 1 X the original dimensions of the data. The graphic XY Plot 1 X, 1 Y shows a single projection from the grand tour. 1D Tour n X Rotation 1 X, 1 Y, 1 Z 2D Tour n X g <- ggobi(iris) 2x1D Tour n X, n Y clustering <- hclust(dist(iris[,1:4]), Scatterplot Matrix n X method="average") Parallel Coordinates Display n X glyph_colour(g[1]) <- cuttree(clustering, 3) Time Series 1 X, n Y Barchart 1 X After creating a plot you can get and set the dis- played variables using the variable and variable<- methods.
Recommended publications
  • Book of Abstracts
    Book of Abstracts June 27, 2015 1 Conference Sponsors Diamond Sponsor Platinum Sponsors Gold Sponsors Silver Sponsors Open Analytics Bronze Sponsors Media Sponsors 2 Conference program Time Tuesday Wednesday Thursday Friday 08:00 Registration opens Registration opens Registration opens Registration opens 08:30 – 09:00 Opening session (by Rector peR! M. Johansen, Aalborg University) Aalborghallen 09:00 – 10:00 Romain François Di Cook Thomas Lumley Aalborghallen Aalborghallen Aalborghallen 10:00 – 10:30 Coffee break Coffee break Coffee break (15 min) ee break Sponsored by Quantide Sponsored by Alteryx ff Session 1 Session 4 10:30 – 12:00 Sponsor session (10:15) Kaleidoscope 1 Kaleidoscope 4 Aalborghallen Aalborghallen Aalborghallen incl. co Morning Tutorials DataRobot Ecology Medicine Gæstesalen Gæstesalen RStudio Teradata Networks Regression Musiksalen Musiksalen Revolution Analytics Reproducibility Commercial Offerings alteryx Det Lille Teater Det Lille Teater TIBCO H O Interfacing Interactive graphics 2 Radiosalen Radiosalen HP 12:00 – 13:00 Sandwiches Lunch (standing buffet) Lunch (standing buffet) Break: 12:00 – 12:30 Sponsored by Sponsored by TIBCO ff Revolution Analytics Ste en Lauritzen (12:30) Aalborghallen Session 2 Session 5 13:00 – 14:30 13:30: Closing remarks Kaleidoscope 2 Kaleidoscope 5 Aalborghallen Aalborghallen 13:45: Grab ’n go lunch 14:00: Conference ends Case study Teaching 1 Gæstesalen Gæstesalen Clustering Statistical Methodology 1 Musiksalen Musiksalen ee break Data Management Machine Learning 1 ff Det Lille Teater Det Lille
    [Show full text]
  • Cloud Computing with E-Science Applications
    TERZO Information Technology • MOSSUCCA Cloud Computing with e-Science Applications The amount of data in everyday life has been exploding. This data increase has been especially signicant in scientic elds, where substantial amounts of data must be captured, communicated, aggregated, stored, and analyzed. Cloud Computing with e-Science Applications explains how cloud computing can improve data management in data-heavy elds such as bioinformatics, earth science, and computer science. Cloud Computing The book begins with an overview of cloud models supplied by the National Institute of Standards and Technology (NIST), and then: • Discusses the challenges imposed by big data on scientic data infrastructures, including security and trust issues • Covers vulnerabilities such as data theft or loss, privacy concerns, infected applications, threats in virtualization, and cross-virtual machine attack with • Describes the implementation of workows in clouds, proposing an architecture composed of two layers—platform and application • Details infrastructure-as-a-service (IaaS), platform-as-a-service (PaaS), e-Science Applications and software-as-a-service (SaaS) solutions based on public, private, and hybrid cloud computing models • Demonstrates how cloud computing aids in resource control, vertical and horizontal scalability, interoperability, and adaptive scheduling Featuring signicant contributions from research centers, universities, Cloud Computing and industries worldwide, Cloud Computing with e-Science Applications presents innovative cloud
    [Show full text]
  • User! 2020 Tutorials – Afternoon Session
    useR! 2020 Tutorials – Afternoon Session How green was my valley - Spatial analytics with PostgreSQL, PostGIS, R, and PL/R by J. Conway This tutorial combines the use of PostgreSQL with PostGIS and R spatial packages. It covers data ingestion including cleaning and transformation, and basic analysis, of vector and raster based geospatial data from open sources. Learning outcomes: At the end of the tutorial, the participants will: • learn the advantages and disadvantages of using PostgreSQL, PostGIS, R, and PL/R to process geospatial data; • perform a basic installation and setup of the technology stack, including PostgreSQL, PostGIS, R, and PL/R; • gain experience on the data cleaning, subsetting, and ingestion process for that data; • perform several simply types of visualization and analysis on the ingested data. Requirements: The audience should be comfortable with R, SQL, and spatial data at an intermediate level. The attendees need only a modern OS/web browser. The hands-on exercises will be done via Katacoda. Creating beautiful data visualization in R: a ggplot2 crash course by S. Tyner Learning ggplot2 brings joy and “aha moments” to new R users, keeping them more engaged and eager to grow their R skills. Newer R users will be and feel more empowered with data visualization skills. In addition to experiencing joy in creating beautiful graphics, advanced R users will learn to take advantage of ggplot2’s elegant defaults, saving time on manual plotting tasks like drawing legends. Thus, time and energy can be spent on advanced analyses,
    [Show full text]
  • Computational Reproducibility in Archaeological Research: Basic Principles and a Case Study of Their Implementation
    J Archaeol Method Theory (2017) 24:424–450 DOI 10.1007/s10816-015-9272-9 Computational Reproducibility in Archaeological Research: Basic Principles and a Case Study of Their Implementation Ben Marwick1,2 Published online: 7 January 2016 # Springer Science+Business Media New York 2016 Abstract The use of computers and complex software is pervasive in archaeology, yet their role in the analytical pipeline is rarely exposed for other researchers to inspect or reuse. This limits the progress of archaeology because researchers cannot easily reproduce each other’sworktoverifyorextendit.Fourgeneralprinciplesofrepro- ducible research that have emerged in other fields are presented. An archaeological case study is described that shows how each principle can be implemented using freely available software. The costs and benefits of implementing reproducible research are assessed. The primary benefit, of sharing data in particular, is increased impact via an increased number of citations. The primary cost is the additional time required to enhance reproducibility, although the exact amount is difficult to quantify. Keywords Reproducible research . Computer programming . Software engineering . Australian archaeology. Open science Introduction Archaeology, like all scientific fields, advances through rigorous tests of previously published studies. When numerous investigations are performed by different re- searchers and demonstrate similar results, we hold these results to be a reasonable approximation of a true account of past human behavior. This
    [Show full text]
  • R Studio R Code Execution Error
    R Studio R Code Execution Error When I try to open a freshly-installed RStudio, I am getting the following problem (and a blank application window): ERROR r error 4 (R code execution error). When I installed RRO and click Update packages in RStudio, an error message box says R code execution error and the console says. Error in. When starting up RStudio on OS X, you may see an error similar to the following: ERROR: r error 4 (R code execution error) (errormsg=Error in identical(call((1L)). Shadow graphics device error: r error 4 (R code execution error) like the following ones with no results (of course, I restarted Rstudio after any instalation). Inside Rstudio I would determine whether the error is limited to a package or a particular repository. Next test your R installation outside of the Rstudio e.. Introduction to the R language Installation and Configuration R Studio executed results of our code as well as any error messages in the execution of our code. This pane also shows the history of the R code execution which has been. R Studio R Code Execution Error Read/Download 2015 21:20:22 (rsession-cmohan) ERROR r error 4 (R code execution error) Happens in both 32 and 64 bit R, as well as in RStudio and base R. I started. This content is adapted from the RStudio R Markdown - Dynamic Documents of R, Markdown basics and R code chunks tutorials. is used to name the code chunk, useful for following the code execution and debugging. (r or errors are to be displayed with (r, message=TRUE, warning=TRUE, error=TRUE) or FALSE.
    [Show full text]
  • Mastering Spark with R the Complete Guide to Large-Scale Analysis and Modeling
    Mastering Spark with R The Complete Guide to Large-Scale Analysis and Modeling Javier Luraschi, Kevin Kuo & Edgar Ruiz Foreword by Matei Zaharia Mastering Spark with R The Complete Guide to Large-Scale Analysis and Modeling Javier Luraschi, Kevin Kuo, and Edgar Ruiz Beijing Boston Farnham Sebastopol Tokyo Mastering Spark with R by Javier Luraschi, Kevin Kuo, and Edgar Ruiz Copyright © 2020 Javier Luraschi, Kevin Kuo, and Edgar Ruiz. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Acquisition Editor: Jonathan Hassell Indexer: Judy McConville Development Editor: Melissa Potter Interior Designer: David Futato Production Editor: Elizabeth Kelly Cover Designer: Karen Montgomery Copyeditor: Octal Publishing, LLC Illustrator: Rebecca Demarest Proofreader: Rachel Monaghan October 2019: First Edition Revision History for the First Release 2019-10-04: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492046370 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Mastering Spark with R, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work.
    [Show full text]
  • Package 'Drake'
    Package ‘drake’ December 11, 2018 Title A Pipeline Toolkit for Reproducible Computation at Scale Description A general-purpose computational engine for data analysis, drake rebuilds intermediate data objects when their dependencies change, and it skips work when the results are already up to date. Not every execution starts from scratch, there is native support for parallel and distributed computing, and completed projects have tangible evidence that they are reproducible. Extensive documentation, from beginner-friendly tutorials to practical examples and more, is available at the reference website <https://ropensci.github.io/drake/> and the online manual <https://ropenscilabs.github.io/drake-manual/>. Version 6.2.1 License GPL-3 URL https://github.com/ropensci/drake BugReports https://github.com/ropensci/drake/issues Depends R (>= 3.3.0) Imports codetools, digest, formatR, igraph, methods, pkgconfig, rlang (>= 0.2.0), storr (>= 1.1.0), tibble, utils Suggests abind, bindr, callr, clustermq, CodeDepends, crayon, datasets, DBI, downloader, future, future.apply, grDevices, ggplot2, ggraph, knitr, lubridate, networkD3, parallel, prettycode, RSQLite, stats, testthat, tidyselect (>= 0.2.4), txtq, rmarkdown, styler, visNetwork, webshot VignetteBuilder knitr Encoding UTF-8 Language en-US RoxygenNote 6.1.1 NeedsCompilation no 1 2 R topics documented: Author William Michael Landau [aut, cre] (<https://orcid.org/0000-0003-1878-3253>), Alex Axthelm [ctb], Jasper Clarkberg [ctb], Kirill Müller [ctb], Ben Marwick [rev], Peter Slaughter [rev], Ben Bond-Lamberty [ctb] (<https://orcid.org/0000-0001-9525-4633>), Eli Lilly and Company [cph] Maintainer William Michael Landau <[email protected]> Repository CRAN Date/Publication 2018-12-10 23:20:03 UTC R topics documented: drake-package .
    [Show full text]
  • Business Analytics in R Introduction to Statistical Programming
    Business Analytics in R Introduction to Statistical Programming TIMOTHY WONG1 JAMES GAMMERMAN2 July 15, 2019 [email protected] [email protected] ii Contents 1 R Ecosystem 3 1.1 Programming Enviornment . .3 1.2 Packages . .5 2 Programming Concepts 7 2.1 Vector . .8 2.2 Character and Datetime . .9 2.3 Factor . .9 2.4 Logical Operator . 10 2.5 Special Numbers . 10 2.6 List . 11 2.7 Data Frame and Tibble . 11 2.8 Function . 13 2.9 Flow Control . 14 2.9.1 If-Else . 14 2.9.2 While . 15 2.9.3 For . 16 2.10 Apply . 16 iii iv CONTENTS 3 Data Transformation 19 3.1 Filtering . 22 3.2 Sorting . 24 3.3 Subsetting Variables . 24 3.4 Compute Variables . 26 3.5 Summarising . 27 4 Regression Models 31 4.1 Linear Regression . 31 4.2 Poisson Regression . 42 4.3 Logistic Regression . 45 5 Tree-based Methods 47 5.1 Decision Trees . 48 5.2 Random Forest . 50 6 Neural Networks 53 6.1 Multilayer Perceptron . 55 7 Time Series Analysis 61 7.1 Auto-Correlation Function . 61 7.2 Decomposition . 66 7.3 ARIMA Model . 69 8 Survival Analysis 75 8.1 Kaplan-Meier Estimator . 75 8.2 Cox Proportional Hazards Model . 77 CONTENTS v 9 Unsupervised Learning 81 9.1 K-means Clustering . 82 9.2 Hierarchical Clustering . 85 10 Extending R 93 10.1 R Markdown . 93 10.1.1 R Notebook . 97 10.2 Shiny Web Application . 97 10.3 Writing Packages . 102 10.4 Reproducibility .
    [Show full text]
  • Sparkr: Scaling R Programs with Spark
    SparkR: Scaling R Programs with Spark Shivaram Venkataraman1, Zongheng Yang1, Davies Liu2, Eric Liang2, Hossein Falaki2 Xiangrui Meng2, Reynold Xin2, Ali Ghodsi2, Michael Franklin1, Ion Stoica1;2, Matei Zaharia2;3 1AMPLab UC Berkeley, 2 Databricks Inc., 3 MIT CSAIL ABSTRACT support for reading input from a variety of systems including R is a popular statistical programming language with a number of HDFS, HBase, Cassandra and a number of formats like JSON, Par- extensions that support data processing and machine learning tasks. quet, etc. Integrating with the data source API enables R users to However, interactive data analysis in R is usually limited as the R directly process data sets from any of these data sources. runtime is single threaded and can only process data sets that fit in Performance Improvements: As opposed to a new distributed en- a single machine’s memory. We present SparkR, an R package that gine, SparkR can inherit all of the optimizations made to the Spark provides a frontend to Apache Spark and uses Spark’s distributed computation engine in terms of task scheduling, code generation, computation engine to enable large scale data analysis from the R memory management [3], etc. shell. We describe the main design goals of SparkR, discuss how SparkR is built as an R package and requires no changes to the high-level DataFrame API enables scalable computation and R. The central component of SparkR is a distributed data frame present some of the key details of our implementation. that enables structured data processing with a syntax familiar to R users [31](Figure 1).
    [Show full text]
  • Large-Scale Parallel Statistical Forecasting Computations in R
    Large-Scale Parallel Statistical Forecasting Computations in R Murray Stokely∗ Farzan Rohani∗ Eric Tassone∗ Abstract We demonstrate the utility of massively parallel computational infrastructure for statistical computing using the MapReduce paradigm for R. This framework allows users to write com- putations in a high-level language that are then broken up and distributed to worker tasks in Google datacenters. Results are collected in a scalable, distributed data store and returned to the interactive user session. We apply our approach to a forecasting application that fits a variety of models, prohibiting an analytical description of the statistical uncertainty associated with the overall forecast. To overcome this, we generate simulation-based uncer- tainty bands, which necessitates a large number of computationally intensive realizations. Our technique cut total run time by a factor of 300. Distributing the computation across many machines permits analysts to focus on statistical issues while answering questions that would be intractable without significant parallel computational infrastructure. We present real-world performance characteristics from our application to allow practitioners to better understand the nature of massively parallel statistical simulations in R. Key Words: Statistical Computing, R, forecasting, timeseries, parallelism 1. Introduction Large-scale statistical computing has become widespread at Internet companies in recent years, and the rapid growth of available data has increased the importance of scaling the tools for data analysis. Significant progress has been made in design- ing distributed systems to take advantage of massive clusters of shared machines for long-running batch jobs, but the development of higher-level abstractions and tools for interactive statistical analysis using this infrastructure has lagged.
    [Show full text]
  • Compiler and Runtime Techniques for Optimizing Dynamic Scripting Languages
    c 2015 by Haichuan Wang. All rights reserved. COMPILER AND RUNTIME TECHNIQUES FOR OPTIMIZING DYNAMIC SCRIPTING LANGUAGES BY HAICHUAN WANG DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2015 Urbana, Illinois Doctoral Committee: Professor David A. Padua, Chair, Director of Research Professor Vikram Adve Professor Wen-Mei W. Hwu Doctor Peng Wu, Huawei Abstract This thesis studies the compilation and runtime techniques to improve the performance of dynamic scripting languages using R programming language as a test case. The R programming language is a convenient system for statistical computing. In this era of big data, R is becoming increasingly popular as a powerful data analytics tool. But the performance of R limits its usage in a broader context. The thesis introduces a classification of R programming styles into Looping over data(Type I), Vector programming(Type II), and Glue codes(Type III), and identified the most serious overhead of R is mostly manifested in Type I R codes. It pro- poses techniques to improve the performance R. First, it uses interpreter level specialization to do object allocation removal and path length reduction, and evaluates its effectiveness for GNU R VM. The approach uses profiling to translate R byte-code into a specialized byte-code to improve running speed, and uses data representation specialization to reduce the memory allocation and usage. Secondly, it proposes a lightweight approach that reduces the interpretation overhead of R through vectorization of the widely used Apply class of operations in R.
    [Show full text]
  • Speaking Serial R with a Parallel Accent
    Version 0.3-0 Programming with Big Data in R Speaking Serial R with a Parallel Accent Package Examples and Demonstrations pbdR Core Team Speaking Serial R with a Parallel Accent (Ver. 0.3-0) pbdR Package Examples and Demonstrations September 25, 2015 Drew Schmidt Business Analytics and Statistics University of Tennessee Wei-Chen Chen pbdR Core Team George Ostrouchov Computer Science and Mathematics Division Oak Ridge National Laboratory Pragneshkumar Patel National Institute for Computational Sciences University of Tennessee Version 0.3-0 c 2012–2015 pbdR Core Team. All rights reserved. Permission is granted to make and distribute verbatim copies of this vignette and its source provided the copyright notice and this permission notice are preserved on all copies. This manual may be incorrect or out-of-date. The authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. Cover art from Join, or Die (Franklin, 1754). Illustrations were created using the ggplot2 package (Wickham, 2009), native R functions, and Microsoft Powerpoint. This publication was typeset using LATEX. Contents List of Figures ........................................ v List of Tables ......................................... vii Acknowledgements ...................................... viii Note About the Cover .................................... ix Disclaimer ........................................... 1 I Preliminaries 2 1 Introduction 3 1.1 What is pbdR? ..................................... 3
    [Show full text]