The Journal Volume 4/1, June 2012

A peer-reviewed, open-access publication of the Foundation for Statistical Computing

Contents

Editorial ...... 3

Contributed Research Articles Analysing Seasonal Data ...... 5 MARSS: Multivariate Autoregressive State-space Models for Analyzing Time-series Data . . 11 openair – Data Analysis Tools for the Air Quality Community ...... 20 Foreign Library Interface ...... 30 Vdgraph: A Package for Creating Variance Dispersion Graphs ...... 41 xgrid and R: Parallel Distributed Processing Using Heterogeneous Groups of Apple Com- puters ...... 45 maxent: An for Low-memory Multinomial Logistic Regression with Support for Semi-automated Text Classification ...... 56 Sumo: An Authenticating Web Application with an Embedded R Session ...... 60

From the Core Who Did What? The Roles of R Package Authors and How to Refer to Them ...... 64

News and Notes Changes in R ...... 70 Changes on CRAN ...... 80 R Foundation News ...... 96 2

The Journal is a peer-reviewed publication of the R Foun- dation for Statistical Computing. Communications regarding this pub- lication should be addressed to the editors. All articles are licensed un- der the Creative Commons Attribution 3.0 Unported license (CC BY 3.0, http://creativecommons.org/licenses/by/3.0/).

Prospective authors will find detailed and up-to-date submission in- structions on the Journal’s homepage.

Editor-in-Chief: Martyn Plummer

Editorial Board: Heather Turner, , and Deepayan Sarkar

Editor Programmer’s Niche: Bill Venables

Editor Help Desk: Uwe Ligges

Editor Book Reviews: G. Jay Kerns Department of Mathematics and Statistics Youngstown State University Youngstown, Ohio 44555-0002 USA [email protected]

R Journal Homepage: http://journal.r-project.org/

Email of editors and editorial board: [email protected]

The R Journal is indexed/abstracted by EBSCO, DOAJ.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 3

Editorial by Martyn Plummer Interface for R to call arbitrary native functions with- out the need for C wrapper code. There is also an ar- Earlier this week I was at the “Premières Rencontres ticle from our occasional series “From the Core”, de- R” in Bordeaux, the first francophone meeting of R signed to highlight ideas and methods in the words users (although I confess that I gave my talk in En- of R development core team members. This article glish). This was a very enjoyable experience, and by Kurt Hornik, Duncan Murdoch and Achim Zeilies not just because it coincided with the Bordeaux explains the new facilities for handling bibliographic festival. The breadth and quality of the talks were information introduced in R 2.12.0 and R 2.14.0. I both comparable with the International UseR! con- strongly encourage package authors to read this ar- ferences. It was another demonstration of the extent ticle and implement the changes to their packages. to which R is used in diverse disciplines. Doing so will greatly facilitate the bibliometric anal- As R continues to expand into new areas, we yses and investigations of the social structure of the see the development of packages designed to ad- R community. dress the specific needs of different user communi- Elsewhere in this issue, we have articles describ- ties. This issue of The R Journal contains articles on ing the season package for displaying and analyz- two such packages. Karl Ropkins and David Carslaw ing seasonal data, which is a companion to the book describe how the design of the openair package was Analysing Seasonal Data by Adrian Barnett and An- influenced by the need to make an accessible set of nette Dobson; the Vdgraph package for drawing tools available for people in the air quality commu- variance dispersion graphs; and the maxent pack- nity who are not experts in R. Elizabeth Holmes and age which allows fast multinomial logistic regression colleagues introduce the MARSS package for mul- with a low memory footprint, and is designed for ap- tivariate autoregressive state-space models. Such plications in text classification. models are used in many fields, but the MARSS package was motivated by the particular needs of re- The R Journal continues to be the journal of searchers in the natural and environmental sciences, record for the R project. The usual end matter sum- There are two articles on the use of R outside the marizes recent changes to R itself and on CRAN. It desktop computing environment. Sarah Anoke and is worth spending some time browsing these sec- colleagues describe the use of the Apple Xgrid sys- tions in order to catch up on changes you may have tem to do distributed computing on Mac OS X com- missed, such as the new CRAN Task View on dif- puters, and Timothy Bergsma and Michael Smith ferential equations. highlighted the demonstrate the Sumo web application, which in- importance of this area in his editorial for volume 2, cludes an embedded R session. issue 2. Two articles are of particular interest to develop- On behalf of the editorial board I hope you enjoy ers. Daniel Adler demonstrates a Foreign Function this issue.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 4

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 5

Analysing Seasonal Data by Adrian G Barnett, Peter Baker and Annette J Dobson but often very effective analyses, as we describe be- low. Abstract Many common diseases, such as the flu More complex seasonal analyses examine non- and cardiovascular disease, increase markedly stationary seasonal patterns that change over time. in winter and dip in summer. These seasonal Changing seasonal patterns in health are currently patterns have been part of life for millennia and of great interest as global warming is predicted to were first noted in ancient Greece by both Hip- make seasonal changes in the weather more extreme. pocrates and Herodotus. Recent interest has fo- Hence there is a need for statistical tools that can es- cused on climate change, and the concern that timate whether a seasonal pattern has become more seasons will become more extreme with harsher extreme over time or whether its phase has changed. winter and summer weather. We describe a set Ours is also the first R package that includes the of R functions designed to model seasonal pat- case-crossover, a useful method for controlling for terns in disease. We illustrate some simple de- seasonality. scriptive and graphical methods, a more com- This paper illustrates just some of the functions of plex method that is able to model non-stationary the season package. We show some descriptive func- patterns, and the case-crossover to control for tions that give simple means or plots, and functions seasonal confounding. whose goal is inference based on generalised linear models. The package was written as a companion to In this paper we illustrate some of the functions a book on seasonal analysis by Barnett and Dobson of the season package (Barnett et al., 2012), which (2010), which contains further details on the statisti- contains a range of functions for analysing seasonal cal methods and R code. health data. We were motivated by the great inter- est in seasonality found in the health literature, and the relatively small number of seasonal tools in R (or Analysing monthly seasonal pat- other software packages). The existing seasonal tools terns in R are: Seasonal time series are often based on data collected baysea • the function of the timsac package and every month. An example that we use here is the decompose stl the and functions of the stats monthly number of cardiovascular disease deaths in package for decomposing a time series into a people aged ≥ 75 years in Los Angeles for the years trend and season; 1987–2000 (Samet et al., 2000). Before we examine • the dynlm function of the dynlm package and or plot the monthly death rates we need to make the ssm function of the sspir package for fitting them more comparable by adjusting them to a com- dynamic linear models with optional seasonal mon month length (Barnett and Dobson, 2010, Sec- components; tion 2.2.1). Otherwise January (with 31 days) will likely have more deaths than February (with 28 or • the arima function of the stats package and the 29). Arima function of the forecast package for fit- In the example below the monthmean function ting seasonal components as part of an autore- is used to create the variable mmean which is the gressive integrated moving average (ARIMA) monthly average rate of cardiovascular disease model; and deaths standardised to a month length of 30 days. As the data set contains the population size (pop) we can • the bfast package for detecting breaks in a sea- also standardise the rates to the number of deaths sonal pattern. per 100,000 people. The highest death rate is in Jan- uary (397 per 100,000) and the lowest in July (278 per These tools are all useful, but most concern decom- 100,000). posing equally spaced time series data. Our package includes models that can be applied to seasonal pat- > data(CVD) terns in unequally spaced data. Such data are com- > mmean = monthmean(data = CVD, mon in observational studies when the timing of re- resp = CVD$cvd, adjmonth = "thirty", sponses cannot be controlled (e.g. for a postal sur- pop = pop/100000) vey). > mmean In the health literature much of the analysis of Month Mean seasonal data uses simple methods such as compar- January 396.8 ing rates of disease by month or using a cosinor re- February 360.8 gression model, which assumes a sinusoidal seasonal March 327.3 pattern. We have created functions for these simple, April 311.9

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 6 CONTRIBUTED RESEARCH ARTICLES

May 294.9 their month of birth (for the 2009 football season) and June 284.5 the expected number of births per month based on July 277.8 national data. For this example we did not adjust August 279.2 for the unequal number of days in the months be- September 279.1 cause we can compare the observed numbers to the October 292.3 expected (which are also based on unequal month November 313.3 lengths). Using the expected numbers also shows December 368.5 any seasonal pattern in the national birth numbers. In this example there is a very slight decrease in births in November and December. Plotting monthly data We can plot these standardised means in a circular plot using the plotCircular function: > plotCircular(area1 = mmean$mean, dp = 1, labels = month.abb, scale = 0.7)

This produces the circular plot shown in Figure1. The numbers under each month are the adjusted av- erages, and the area of each segment is proportional to this average.

Figure 2: A circular plot of the monthly number of Australian Football League players by their month of birth (white segments) and the expected numbers based on national data for men born in the same pe- riod (grey segments). Australian born players in the 2009 football season.

The figure shows the greater than expected num- ber of players born in January to March, and the fewer than expected born in August to December. The numbers around the outside are the observed number of players. The code to create this plot is: Figure 1: A circular plot of the adjusted monthly mean number of cardiovascular deaths in Los Ange- > data(AFL) les in people aged ≥ 75, 1987–2000. > plotCircular(area1 = AFL$players, area2 = AFL$expected, scale = 0.72, The peak in the average number of deaths is in labels = month.abb, dp = 0, lines = TRUE, January, and the low is six months later in July in- auto.legend = list( dicating an annual seasonal pattern. If there were labels = c("Obs", "Exp"), no seasonal pattern we would expect the averages title = "# players")) in each month to be equal, and so the plot would be perfectly circular. The seasonal pattern is some- The key difference from the code to create the previ- what non-symmetric, as the decrease in deaths from ous circular plot is that we have given values for both January to July does not mirror the seasonal increase area1 and area2. The ‘lines = TRUE’ option added from July to January. This is because the increase in the dotted lines between the months. We have also deaths does not start in earnest until October. included a legend. Circular plots are also useful when we have an As well as a circular plot we also recommend a observed and expected number of observations in time series plot for monthly data, as these plots are each month. As an example, Figure2 shows the useful for highlighting the consistency in the sea- number of Australian Football League players by sonal pattern and possibly also the secular trend and

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 7 any unusual observations. For the cardiovascular ex- The plot in Figure4 shows the mean rate ratios ample data a time series plot is created using and 95% confidence intervals. The dotted horizon- tal reference line is at the rate ratio of 1. The mean ' ' > plot(CVD$yrmon, CVD$cvd, type = o , rate of deaths in January is 1.43 times the rate in July. pch = 19, The rates in August and September are not statisti- ' ' ylab = Number of CVD deaths per month , cally significantly different to the rates in July, as the ' ' xlab = Time ) confidence intervals in these months both cross 1. The result is shown in Figure3. The January peak in CVD was clearly larger in 1992 and 1994 compared with 1991, 1993 and 1995. There also appears to be a slight downward trend from 1987 to 1992.

Figure 4: Mean rate ratios and 95% confidence inter- vals of cardiovascular disease deaths using July as a reference month.

Figure 3: Monthly number of cardiovascular deaths Cosinor in Los Angeles for people aged ≥ 75, 1987–2000. The previous model assumed that the rate of car- diovascular disease varied arbitrarily in each month Modelling monthly data with no smoothing of or correlation between neigh- bouring months. This is an unlikely assumption for A simple and popular statistical model for examin- this seasonal pattern (Figure4). The advantage of ing seasonal patterns in monthly data is to use a using arbitrary estimates is that it does not constrain simple linear regression model (or generalised lin- the shape of the seasonal pattern. The disadvantage ear model) with a categorical variable of month. The is a potential loss of statistical power. Models that as- code below fits just such a model to the cardiovas- sume some parametric seasonal pattern will have a cular disease data and then plots the rate ratios (Fig- greater power when the parametric model is correct. ure4). A popular parametric seasonal model is the cosinor > mmodel = monthglm(formula = cvd ~ 1, model (Barnett and Dobson, 2010, Chapter 3), which data = CVD, family = poisson(), is based on a sinusoidal pattern, offsetpop = pop/100000,  2πt  offsetmonth = TRUE, refmonth = 7) st = Acos − P , t = 1,. . .,n, c > plot(mmodel) where A is the amplitude of the sinusoid and P is its As the data are counts we used a Poisson model. phase, c is the length of the seasonal cycle (e.g. c = 12 We adjusted for the unequal number of days in for monthly data with an annual seasonal pattern), t the month by using an offset (offsetmonth = TRUE), is the time of each observation and n is the total num- which divides the number of deaths in each month ber of times observed. The amplitude tells us the size by the number of days in each month to give a daily of the seasonal change and the phase tells us where rate. The reference month was set to July (refmonth it peaks. The sinusoid assumes a smooth seasonal = 7). We could have added other variables to the pattern that is symmetric about its peak (so the rate model, by adding them to the right hand side of the of the seasonal increase in disease is equal to the de- equation (e.g. ’formula = cvd ~ year’ to include a crease). We fit the Cosinor as part of a generalised linear trend for year). linear model.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 8 CONTRIBUTED RESEARCH ARTICLES

The example code below fits a cosinor model to changes in an important exposure. For example, im- the cardiovascular disease data. The results are for provements in housing over the 20th century are part each month, so we used the ‘type = 'monthly'’ op- of the reason for a decline in the winter peak in mor- tion with ‘date = month’. tality in London (Carson et al., 2006). To fit a non-stationary cosinor we expand the pre- > res = cosinor(cvd ~ 1, date = month, vious sinusoidal equation thus data = CVD, type = 'monthly', family = poisson(), offsetmonth = TRUE)  2πt  s = A cos − P , t = 1,. . .,n > summary(res) t t c t Cosinor test Number of observations = 168 so that both the amplitude and phase of the cosi- Amplitude = 232.34 (absolute scale) nor are now dependent on time. The key unknown Phase: Month = 1.3 is the extent to which these parameters will change Low point: Month = 7.3 over time. Using our nscosinor function the user Significant seasonality based on adjusted has some control over the amount of change and a significance level of 0.025 = TRUE number of different models can be tested assuming different levels of change. The final model should We again adjusted for the unequal number of days be chosen using model fit diagnostics and residual in the months using an offset (offsetmonth = TRUE). checks (available in the seasrescheck function). The amplitude is 232 deaths which has been given on The nscosinor function uses the Kalman filter to the absolute scale and the phase is estimated as 1.27 decompose the time series into a trend and seasonal months (early January). components (West and Harrison, 1997, Chapter 8), An advantage of these cosinor models is that they so can only be applied to equally spaced time series can be fitted to unequally spaced data. The exam- data. The code below fits a non-stationary sinusoidal ple code below fits a cosinor model to data from a model to the cardiovascular disease data (using the randomised controlled trial of physical activity with counts adjusted to the average month length, adj). data on body mass index (BMI) at baseline (Eakin et al., 2009). Subjects were recruited as they be- > nsmodel = nscosinor(data = CVD, came available and so the measurement dates are not response = adj, cycles = 12, niters = 5000, equally spaced. In the example below we test for a si- burnin = 1000, tau = c(10, 500), inits = 1) nusoidal seasonal pattern in BMI. The model uses Markov chain Monte Carlo > data(exercise) (MCMC) sampling, so we needed to specify the > res = cosinor(bmi ~ 1, date = date, number of iterations (niters), the number discarded type = 'daily', data = exercise, as a burn-in (burnin), and an initial value for each family = gaussian()) seasonal component (inits). The cycles gives the > summary(res) frequency of the sinusoid in units of time, in this Cosinor test case a seasonal pattern that completes a cycle in Number of observations = 1152 12 months. We can fit multiple seasonal compo- Amplitude = 0.3765669 nents, for example 6 and 12 month seasonal patterns Phase: Month = November , day = 18 would be fitted using ‘cycles = c(6,12)’. The tau Low point: Month = May , day = 19 are smoothing parameters, with tau[1] for the trend, Significant seasonality based on adjusted tau[2] for the first seasonal parameter, tau[3] for significance level of 0.025 = FALSE the second seasonal parameter. They are fixed values that scale the time between observations. Larger val- 2 Body mass index has an amplitude of 0.38 kg/m ues allow more time between observations and hence which peaks on 18 November, but this increase is create a more flexible spline. The ideal values for tau not statistically significant. In this example we used should be chosen using residual checking and trial ‘type = 'daily'’ as subjects’ results related to a spe- and error. cific date (‘date = date’ specifies the day when they The estimated seasonal pattern is shown in Fig- were measured). Thus the phase for body mass in- ure5. The mean amplitude varies from around 230 dex is given on a scale of days, whereas the phase for deaths (winter 1989) to around 180 deaths (winter cardiovascular death was given on a scale of months. 1995), so some winters were worse than others. Im- portantly the results did not show a steady decline in Non-stationary cosinor amplitude, so over this period seasonal deaths con- tinued to be a problem despite any improvements The models illustrated so far have all assumed a sta- in health care or housing. However, the residu- tionary seasonal pattern, meaning a pattern that does als from this model do show a significant seasonal not change from year to year. However, seasonal pattern (checked using the seasrescheck function). patterns in disease may gradually change because of This residual seasonal pattern is caused because the

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 9 seasonal pattern in cardiovascular deaths is non- other short-term change in exposure such as air pol- sinusoidal (as shown in Figure1) with a sharper in- lution or temperature. A number of different case- crease in deaths than decline. The model assumed crossover designs for time-series data have been pro- a sinusoidal pattern, albeit a non-stationary one. A posed. We used the time-stratified method as it is better fit might be achieved by adding a second sea- a localisable and ignorable design that is free from sonal cycle at a shorter frequency, such as 6 months. overlap bias while other referent window designs that are commonly used in the literature (e.g. sym- metric bi-directional) are not (Janes et al., 2005). Us- ing this design the data in broken into a number of fixed strata (e.g. 28 days or the months of the year) and the case and control days are compared within the same strata. The code below applies a case-crossover model to the cardiovascular disease data. In this case we use the daily cardiovascular disease (with the number of deaths on every day) rather than the data used above which used the number of cardiovascular deaths in each month. The independent variables are mean daily ozone (o3mean, which we first scale to a 10 unit increase) and temperature (tmpd). We also control for day of the week (using Sunday as the reference cat- egory). For this model we are interested in the ef- fect of day-to-day changes in ozone on the day-to- day changes in mortality. > data(CVDdaily) > CVDdaily$o3mean = CVDdaily$o3mean / 10 > cmodel = casecross(cvd ~ o3mean + tmpd + Figure 5: Estimated non-stationary seasonal pattern Mon + Tue + Wed + Thu + Fri + Sat, in cardiovascular disease deaths for Los Angeles, data = CVDdaily) 1987–2000. Mean (black line) and 95% confidence in- > summary(cmodel, digits = 2) terval (grey lines). Time-stratified case-crossover with a stratum length of 28 days Total number of cases 230695 Case-crossover Number of case days with available control days 5114 In some circumstances seasonality is not the focus Average number of control days per case day 23.2 of investigation, but is important because its effects need to be taken into account. This could be because Parameter Estimates: both the outcome and the exposure have an annual coef exp(coef) se(coef) z Pr(>|z|) seasonal pattern, but we are interested in associa- o3mean -0.0072 0.99 0.00362 -1.98 4.7e-02 tions at a different frequency (e.g. daily). tmpd 0.0024 1.00 0.00059 4.09 4.3e-05 The case-crossover can be used for individual- Mon 0.0323 1.03 0.00800 4.04 5.3e-05 level data, e.g. when the data are individual cases Tue 0.0144 1.01 0.00808 1.78 7.5e-02 Wed -0.0146 0.99 0.00807 -1.81 7.0e-02 with their date of heart attack and their recent ex- Thu -0.0118 0.99 0.00805 -1.46 1.4e-01 posure. However, we are concerned with regularly Fri 0.0065 1.01 0.00806 0.81 4.2e-01 spaced time-series data, where the data are grouped, Sat 0.0136 1.01 0.00788 1.73 8.4e-02 e.g. the number of heart attacks on each day in a year. The case-crossover is a useful time series method The default stratum length is 28, which means for controlling for seasonality (Maclure, 1991). It is that cases and controls are compared in blocks of 28 similar to the matched case-control design, where the days. This stratum length should be short enough to exposure of cases with the disease are compared with remove any seasonal pattern in ozone and tempera- one or more matched controls without the disease. ture. Ozone is formed by a reaction between other In the case-crossover, cases act as their own control, air pollutants and sunlight and so is strongly sea- since exposures are compared on case and control sonal with a peak in summer. Cardiovascular mor- days (also known as index and referent days). The tality is at its lowest in summer as warmer temper- case day will be the day on which an event occurred atures lower blood pressures and prevent flu out- (e.g. death), and the control days will be nearby days breaks. So without removing these seasonal patterns in the same season as the exposure but with a pos- we might find a significant negative association be- sibly different exposure. This means the cases and tween ozone and mortality. The above results sug- controls are matched for season, but not for some gest a marginally significant negative association be-

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 10 CONTRIBUTED RESEARCH ARTICLES tween ozone and mortality, as the odds ratio for a Bibliography ten unit increase in ozone is exp(−0.0072) = 0.993 (p- value = 0.047). This may indicate that we have not A. G. Barnett, P. Baker, and A. J. Dobson. season: sufficiently controlled for season and so should re- Seasonal analysis of health data, 2012. URL http: duce the stratum length using the stratalength op- //CRAN.R-project.org/package=season. R pack- tion. age version 0.3-1. As well as matching cases and controls by stra- A. G. Barnett and A. J. Dobson. Analysing Seasonal tum, it is also possible to match on another con- Health Data. Springer, 2010. founder. The code below shows a case-crossover model that matched case and control days by a mean C. Carson, S. Hajat, B. Armstrong, and P. Wilkin- temperature of ±1 degrees Fahrenheit. son. Declining vulnerability to temperature- related mortality in London over the 20th Century. > mmodel = casecross(cvd ~ o3mean + Am J Epidemiol, 164(1):77–84, 2006. Mon + Tue + Wed + Thu + Fri + Sat, E. Eakin, M. Reeves, S. Lawler, N. Graves, B. Olden- matchconf = 'tmpd', confrange = 1, data = CVDdaily) burg, C. DelMar, K. Wilke, E. Winkler, and A. Bar- > summary(mmodel, digits = 2) nett. Telephone counseling for physical activity Time-stratified case-crossover with a stratum and diet in primary care patients. Am J of Prev Med, length of 28 days 36(2):142–149, 2009. Total number of cases 205612 Matched on tmpd plus/minus 1 H. Janes, L. Sheppard, and T. Lumley. Case-crossover Number of case days with available control analyses of air pollution exposure data: Referent days 4581 selection strategies and their implications for bias. Average number of control days per case day 5.6 Epidemiology, 16(6):717–726, 2005.

Parameter Estimates: M. Maclure. The case-crossover design: a method coef exp(coef) se(coef) z Pr(>|z|) for studying transient effects on the risk of acute o3mean 0.0046 1 0.0043 1.07 2.8e-01 events. Am J Epidemiol, 133(2):144–153, 1991. Mon 0.0461 1 0.0094 4.93 8.1e-07 Tue 0.0324 1 0.0095 3.40 6.9e-04 J. Samet, F. Dominici, S. Zeger, J. Schwartz, and Wed 0.0103 1 0.0094 1.10 2.7e-01 D. Dockery. The national morbidity, mortality, and Thu 0.0034 1 0.0093 0.36 7.2e-01 air pollution study, part I: Methods and method- Fri 0.0229 1 0.0094 2.45 1.4e-02 ologic issues. 2000. Sat 0.0224 1 0.0092 2.45 1.4e-02 M. West and J. Harrison. Bayesian Forecasting and Dynamic Models. Springer Series in Statistics. By matching on temperature we have restricted Springer, New York; Berlin, 2nd edition, 1997. the number of available control days, so there are now only an average of 5.6 control days per case, compared with 23.2 days in the previous example. Adrian Barnett Also there are now only 4581 case days with at least School of Public Health one control day available compared with 5114 days Queensland University of Technology for the previous analysis. So 533 days have been Australia lost (and 25,083 cases), and these are most likely the [email protected] days with unusual temperatures that could not be matched to any other days in the same stratum. We Peter Baker did not use temperature as an independent variable School of Population Health in this model, as it has been controlled for by the University of Queensland matching. The odds ratio for a ten unit increase in Australia ozone is now positive (OR = exp(0.0046) = 1.005) al- though not statistically significant (p-value = 0.28). Annette Dobson It is also possible to match cases and control days School of Population Health by the day of the week using the ‘matchdow = TRUE’ University of Queensland option. Australia

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 11

MARSS: Multivariate Autoregressive State-space Models for Analyzing Time-series Data by Elizabeth E. Holmes, Eric J. Ward, Kellie Wills The MARSS package was originally developed for researchers analyzing data in the natural and Abstract MARSS is a package for fitting mul- environmental sciences, because many of the prob- tivariate autoregressive state-space models to lems often encountered in these fields are not com- time-series data. The MARSS package imple- monly encountered in disciplines like engineering ments state-space models in a maximum like- or finance. Two typical problems are high fractions lihood framework. The core functionality of of irregularly spaced missing observations and ob- MARSS is based on likelihood maximization us- servation error variance that cannot be estimated or ing the Kalman filter/smoother, combined with known a priori (Schnute, 1994). Packages developed an EM algorithm. To make comparisons with for other fields did not always allow estimation of other packages available, parameter estimation the parameters of interest to ecologists because these is also permitted via direct search routines avail- parameters are always fixed in the package authors’ able in ’optim’. The MARSS package allows field or application. The MARSS package was de- data to contain missing values and allows a wide veloped to address these issues and its three main variety of model structures and constraints to differences are summarized as follows. be specified (such as fixed or shared parame- First, maximum-likelihood optimization in most ters). In addition to model-fitting, the package packages for fitting state-space models relies on provides bootstrap routines for simulating data quasi-Newton or Nelder-Mead direct search rou- and generating confidence intervals, and multi- tines, such as provided in optim (for dlm) or nlm ple options for calculating model selection crite- (for dse). Multidimensional state-space problems ria (such as AIC). often have complex, non-linear likelihood surfaces. For certain types of multivariate state-space mod- The MARSS package (Holmes et al., 2012) is an els, an alternative maximization algorithm exists; R package for fitting linear multivariate autoregres- though generally slower for most models, the EM- sive state-space (MARSS) models with Gaussian er- algorithm (Metaxoglou and Smith, 2007; Shumway rors to time-series data. This class of model is ex- and Stoffer, 1982), is considerably more robust than tremely important in the study of linear stochas- direct search routines (this is particularly evident tic dynamical systems, and these models are used with large amounts of missing observations). To in many different fields, including economics, engi- date, no R package for the analysis of multivariate neering, genetics, physics and ecology. The model state-space models has implemented the EM algo- class has different names in different fields; some rithm for maximum-likelihood parameter estimation common names are dynamic linear models (DLMs) (sspir implements it for univariate models). In addi- and vector autoregressive (VAR) state-space mod- tion, the MARSS package implements an EM algo- els. There are a number of existing R packages for rithm for constrained parameter estimation (Holmes, fitting this class of models, including sspir (Deth- 2010) to allow fixed and shared values within param- lefsen et al., 2009) for univariate data and dlm eter matrices. To our knowledge, this constrained (Petris, 2010), dse (Gilbert, 2009), KFAS (Helske, EM algorithm is not implemented in any package, 2011) and FKF (Luethi et al., 2012) for multivari- although the Brodgar package implements a limited ate data. Additional packages are available on version for dynamic factor analysis. other platforms, such as SsfPack (Durbin and Koop- Second, model specification in the MARSS pack- man, 2001), EViews (www.eviews.com) and Brodgar age has a one-to-one relationship to a MARSS model (www.brodgar.com). Except for Brodgar and sspir, as written in matrix form on paper. Any model that these packages provide maximization of the like- can be written in MARSS form can be fitted with- lihood surface (for maximum-likelihood parameter out extra code by the user. In contrast, other pack- estimation) via quasi-Newton or Nelder-Mead type ages require users to write unique functions in ma- algorithms. The MARSS package was developed trix form (a non-trivial task for many non-expert R to provide an alternative maximization algorithm, users). For example, while dlm includes linear and based instead on an Expectation-Maximization (EM) polynomial univariate models, multivariate regres- algorithm and to provide a standardized model- sion is not readily accessible without these custom specification framework for fitting different model functions; in MARSS, all models written in matrix structures. form are fitted using the same model specification.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 12 CONTRIBUTED RESEARCH ARTICLES

The MARSS package also allows degenerate multi- pending on the B structure), and the correlation of variate models to be fitted, which means that some the process deviations is determined by the struc- or all observation or process variances can be set to ture of the matrix Q. In version 2.x of the MARSS 0. This allows users to include deterministic features package, the B and Q parameters are time-invariant; in their models and to rewrite models with longer however, in version 3.x, all parameters can be time- lags or moving averaged errors as a MARSS model. varying and also the u can be moved into the x term Third, the MARSS package provides algorithms to specify models with an auto-regressive trend pro- for computing model selection criteria that are spe- cess (called a stochastic level model). cific for state-space models. Model selection crite- Written out for two state processes, the MARSS ria are used to quantify the data support for differ- process model would be: ent model and parameter structures by balancing the x  b b  x  u  w  ability of the model to fit the data against the flex- 1 = 11 12 1 + 1 + 1 , x b b x u w ibility of the model. The criteria computed in the 2 t 21 22 2 t−1 2 2 t w   q q  MARSS package are based on Akaike’s Information 1 ∼ MVN 0, 11 12 Criterion (AIC). Models with the lowest AIC are in- w2 t q21 q22 terpreted as receiving more data support. While AIC Some of the parameter elements will generally be and its small sample corrected version AICc are eas- fixed to ensure identifiability. Within the MAR-1 ily calculated for fixed effects models (these are stan- form, MAR-p or autoregressive models with lag-p lm glm dard output for and , for instance), these cri- and models with exogenous factors (covariates) can teria are biased for hierarchical or state-space au- be included by properly defining the B, x and Q ele- toregressive models. The MARSS package provides ments. See for example Tsay(2010) where the formu- an unbiased AIC criterion via innovations bootstrap- lation of various time-series models as MAR-1 mod- ping (Cavanaugh and Shumway, 1997; Stoffer and els is covered. Wall, 1991) and parametric bootstrapping (Holmes, The initial state vector is specified at t = 0 or t = 1 2010). The package also provides functions for ap- as proximate and bootstrap confidence intervals, and bias correction for estimated parameters. x0 ∼ MVN(π,Λ) or x1 ∼ MVN(π,Λ) (2) The package comes with an extensive user guide where π is the mean of the initial state distribution that introduces users to the package and walks the or a constant. Λ specifies the variance of the initial user through a number of case studies involving eco- states or is specified as 0 if the initial state is treated logical data. The selection of case studies includes as fixed (but possibly unknown). estimating trends with univariate and multivariate The multivariate observation component in a data, making inferences concerning spatial structure MARSS model is expressed as with multi-location data, estimating inter-species in- teraction strengths using multi-species data, using yt = Zxt + a + vt; vt ∼ MVN(0,R) (3) dynamic factor analysis to reduce the dimension of a where yt is an n × 1 vector of observations at time t, multivariate dataset, and detecting structural breaks Z is an n × m matrix, a is an n × 1 matrix, and the in data sets. Though these examples have an eco- correlation structure of observation errors is speci- logical focus, the analysis of multivariate time series fied with the matrix R. Including Z and a is not models is cross-disciplinary work and researchers in required for every model, but these parameters are other fields will likely benefit from these examples. used when some state processes are observed mul- tiple times (perhaps with different scalings) or when observations are linear combinations of the state pro- The MARSS model cesses (such as in dynamic factor analysis). Note that time steps in the model are equidistant, but there is The MARSS model includes a process model and no requirement that there be an observation at ev- an observation model. The process component of ery time step; y may have missing values scattered a MARSS model is a multivariate first-order autore- thoughout it. gressive (MAR-1) process. The multivariate process Written out for three observation processes and model takes the form two state processes, the MARSS observation model = + + ∼ ( ) would be xt Bxt−1 u wt; wt MVN 0,Q (1)         y1 z11 z12   a1 v1 x1 The x is an m × 1 vector of state values, equally y2 = z21 z22 + a2 + v2 x2 t spaced in time, and B, u and Q are the state pro- y3 t z31 z32 a3 v3 t cess parameters. The m × m matrix B allows interac-      v1 r11 r12 r13 tion between state processes; the diagonal elements v2 ∼ MVN0,r21 r22 r23 of B can also be interpreted as coefficients of auto- v3 r31 r32 r33 regression in the state vectors through time. The vec- t tor u describes the mean trend or mean level (de-

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 13

Again, some of the parameter elements will gener- > model.list = list(B = B1, U = U1, Q = Q1, Z = Z1, ally be fixed to ensure identifiability. + A = A1, R = R1, x0 = pi1, V0 = V1) The MARSS process and observation models are When printed at the R command line, each parame- flexible and many multivariate time-series models ter matrix looks exactly like the parameters as writ- can be rewritten in MARSS form. See textbooks ten out in Equation4. Numeric values are fixed and on time series analysis where reformulating AR-p, character values are names of elements to be esti- ARMA-p as a MARSS model is covered (such as mated. Elements with the same character name are Tsay, 2010; Harvey, 1989). constrained to be equal (no sharing across parameter matrices, only within). List matrices allow the most flexible model struc- Package overview tures, but MARSS also has text shortcuts for many common model structures. The structure of the The package is designed to fit MARSS models with variance-covariance matrices in the model are spec- fixed and shared elements within the parameter ma- ified via the Q, R, and V0 components, respectively. trices. The following shows such a model using a Common structures are errors independent with one mean-reverting random walk model with three ob- variance on the diagonal (specified with "diagonal servation time series as an example: and equal") or all variances different ("diagonal           and unequal"); errors correlated with the same cor- x1 b 0 x1 0 w1 = + + (4a) relation parameter ("equalvarcov") or correlated x2 0 b x2 0 w2 t t−1 t with each process having unique correlation param- "unconstrained" "zero" w   q q  eters ( ), or all zero ( ). Com- 1 ∼ MVN 0, 11 12 (4b) w q q mon structures for the matrix B are an identity 2 t 12 22 matrix ("identity"); a diagonal matrix with one       "diagonal and equal" x1 0 1 0 value on the diagonal ( ) or ∼ MVN , (4c) all different values on the diagonal ("diagonal and x2 0 0 0 1 unequal"), or all values different ("unconstrained").         y 1 0 a v Common structures for the u and a parameters 1 x  1 1 y = 0 1 1 + 0 + v (4d) are all equal ("equal"); all unequal ("unequal" or  2   x    2 2 t "unconstrained" "zero" y3 t 1 0 0 v3 t ), or all zero ( ).      The Z matrix is idiosyncratic and there are fewer v1 r1 0 0 shortcuts for its specification. One common form for v2 ∼ MVN0, 0 r2 0  (4e) Z is a design matrix, composed of 0s and 1s, in which v3 t 0 0 r2 each row sum equals 1. This is used when each y in y Notice that many parameter elements are fixed, is associated with one x in x (one x may be associated while others are shared (have the same symbol). with multiple y’s). In this case, Z can be specified us- ing a factor of length n. The factor specifies to which x each of the n y’s correspond. For example, the Z in Model specification Equation4 could be specified factor(c(1,2,1)) or Model specification has a one-to-one relationship to factor(c("a","b","a")). the model written on paper in matrix form (Equa- tion4). The user passes in a list which includes an Model fitting with the MARSS function element for each parameter in the model: B, U, Q, Z, A, R, x0, V0. The list element specifies the form of the The main package function is MARSS. This function corresponding parameter in the model. fits a MARSS model (Equations1 and3) to a matrix of The most general way to specify a parameter data and returns the maximum-likelihood estimates form is to use a list matrix. The list matrix allows for the B, u, Q, Z, a, R, π, and Λ parameters or one to combine fixed and estimated elements in one’s more specifically, the free elements in those param- parameter specification. For example, to specify the eters since many elements will be fixed for a given parameters in Equation4, one would write the fol- model form. The basic MARSS call takes the form: lowing list matrices: > MARSS(data, model = model.list) > B1 = matrix(list("b", 0, 0, "b"), 2, 2) where model.list has the form shown in the R code > U1 = matrix(0, 2, 1) above. The data must be passed in as an n × T ma- > Q1 = matrix(c("q11", "q12", "q12", "q22"), 2, 2) trix, that is time goes across columns. A vector is not > Z1 = matrix(c(1, 0, 1, 0, 1, 0), 3, 2) a matrix, nor is a data frame. A matrix of 3 inputs > A1 = matrix(list("a1", 0, 0), 3, 1) (n = 3) measured for 6 time steps might look like > R1 = matrix(list("r1", 0, 0, 0, "r2", 0, + 0, 0, "r2"), 3, 3)  1 2 NA NA 3.2 8  > pi1 = matrix(0, 2, 1) y =  2 5 3 NA 5.1 5  > V1 = diag(1, 2) 1 NA 2 2.2 NA 7

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 14 CONTRIBUTED RESEARCH ARTICLES where NA denotes a missing value. Note that the Estimated state trajectories time steps are equidistant, but there may not be data at each time step, thus the observations are not con- In addition to outputting the maximum-likelihood MARSS strained to be equidistant. estimates of the model parameters, the call will also output the maximum-likelihood estimates The only required argument to MARSS is a ma- of the state trajectories (the x in the MARSS model) trix of data. The model argument is an optional ar- conditioned on the entire dataset. These states gument, although typically included, which speci- represent output from the Kalman smoother us- fies the form of the MARSS model to be fitted (if ing the maximum-likelihood parameter values. The left off, a default form is used). Other MARSS ar- estimated states are in the states element in the guments include initial parameter values (inits), a marssMLE object output from a MARSS call. The stan- non-default value for missing values (miss.value), dard errors of the state estimates are in element whether or not the model should be fitted (fit), states.se of the marssMLE object. For example, if a whether or not output should be verbose (silent), MARSS model with one state (m = 1) was fitted to and the method for maximum-likelihood estima- data and the output placed in fit, the state estimates tion (method). The default method is the EM al- and ± 2 standard errors could be plotted with the gorithm (‘method = "kem"’), but the quasi-Newton following code: method BFGS is also allowed (‘method = "BFGS"’). Changes to the default fitting algorithm are specified > plot(fit$states, type = "l", lwd = 2) with control; for example, control$minit is used to > lines(fit$states - 2*fit$states.se) change the minimum number of iterations used in > lines(fit$states + 2*fit$states.se) the maximization algorithm. Use ?MARSS to see all the control options. Each call to the MARSS function returns an ob- Example applications ject of class "marssMLE", an estimation object. The marssMLE object is a list with the following elements: To demonstrate the MARSS package functionality, the estimated parameter values (par), Kalman fil- we show a series of short examples based on a few of ter and smoother output (kf), convergence diagnos- the case studies in the MARSS User Guide. The user tics (convergence, numIter, errors, logLik), basic guide includes much more extensive analysis and in- model selection metrics (AIC, AICc), and a model ob- cludes many other case studies. All the case studies ject model which contains both the MARSS model are based on analysis of ecological data, as this was specification and the data. Further list elements, such the original motivation for the package. as bootstrap AIC or confidence intervals, are added with functions that take a marssMLE object as input. Trend estimation A commonly used model in ecology describing Model selection and confidence intervals the dynamics of a population experiencing density- dependent growth is the Gompertz model (Red- One of the most commonly used model selec- dingius and den Boer, 1989; Dennis and Taper, 1994). tion tools in the maximum-likelihood framework is The population dynamics are stochastic and driven Akaike’s Information Criterion (AIC) or the small by the combined effects of environmental variation sample variant (AICc). Both versions are returned and random demographic variation. with the call to MARSS. A state-space specific model The log-population sizes from this model can be selection metric, AICb (Cavanaugh and Shumway, described by an AR-1 process: 1997; Stoffer and Wall, 1991), can be added to a marssMLE object using the function MARSSaic. 2 xt = bxt−1 + u + wt; wt ∼ Normal(0, σ ) (5) The function MARSSparamCIs is used to add con- fidence intervals and bias estimates to a marssMLE In the absence of density dependence (b = 1), the ex- object. Confidence intervals may be computed by ponential growth rate or rate of decline of the pop- resampling the Kalman innovations if all data are ulation is controlled by the parameter u; when the present (innovations bootstrapping) or implement- population experiences density-dependence (|b| < ing a parametric bootstrap if some data are miss- 1), the population will converge to an equilibrium ing (parametric bootstraping). All bootstrapping u/(1 − b). The initial states are given by in the MARSS package is done with the lower- level function MARSSboot. Because bootstrapping can x0 ∼ Normal(π,λ) (6) be time-consuming, MARSSparamCIs also allows ap- The observation process is described by proximate confidence intervals to be calculated with a numerically estimated Hessian matrix. If boot- 2 yt = xt + vt; vt ∼ Normal(0, η ) (7) strapping is used to generate confidence intervals, MARSSparamCIs will also return a bootstrap estimate where y is a vector of observed log-population sizes. of parameter bias. The bias parameter, a, has been set equal to 0.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 15

This model with b set to 1 is a standard model Identification of spatial population struc- used for species of concern (Dennis et al., 1991) when ture using model selection the population is well below carrying capacity and the goal is to estimate the underlying trend (u) for In population surveys, censuses are often taken at use in population viability analysis (Holmes, 2001; multiple sites and those sites may or may not be con- Holmes et al., 2007; Humbert et al., 2009). To illus- nected through dispersal. Adjacent sites with high trate a simple trend analysis, we use the graywhales exchange rates of individuals will synchronize sub- dataset included in the MARSS package. This data populations (making them behave as a single sub- set consists of 24 abundance estimates of eastern population), while sites with low mixing rates will North Pacific gray whales (Gerber et al., 1999). This behave more independently as discrete subpopula- population was depleted by commercial whaling tions. We can use MARSS models and model selec- prior to 1900 and has steadily increased since the first tion to test hypotheses concerning the spatial struc- abundance estimates were made in the 1950s (Gerber ture of a population (Hinrichsen, 2009; Hinrichsen et al., 1999). The dataset consists of 24 annual obser- and Holmes, 2009; Ward et al., 2010). vations over 39 years, 1952–1997; years without an For the multi-site application, x is a vector of estimate are denoted as NA. The time step between abundance in each of m subpopulations and y is a observations is one year and the goal is to estimate vector of n observed time series associated with n the annual rate of increase (or decrease). sites. The number of underlying processes, m, is con- trolled by the user and not estimated. In the multi- After log-transforming the data, maximum- site scenario, Z, controls the assignment of survey likelihood parameter estimates can be calculated us- sites to subpopulations. For instance, if the popula- ing MARSS: tion is modeled as having four independent subpop- ulations and the first two were surveyed at one site > data(graywhales) each and the third was surveyed at two sites, then Z > years = graywhales[,1] is a 4 × 3 matrix with (1, 0, 0, 0) in the first column, > loggraywhales = log(graywhales[,2]) (0, 1, 0, 0) in the second and (0, 0, 1, 1) in the third. > kem = MARSS(loggraywhales) In the model argument for the MARSS function, we could specify this as ‘Z = as.factor(c(1,2,3,3))’. The MARSS wrapper returns an object of class The Q matrix controls the temporal correlation in the "marssMLE", which contains the parameter estimates process errors for each subpopulation; these errors (kem$par), state estimates (kem$states) and their could be correlated or uncorrelated. The R matrix standard errors (kem$states.se). The state estimates controls the correlation structure of observation er- can be plotted with ± 2 standard errors, along with rors between the n survey sites. the original data (Figure1). The dataset harborSeal in the MARSS package contains survey data for harbor seals on the west coast of the United States. We can test the hypoth- esis that the four survey sites in Puget Sound (Wash- ington state) are better described by one panmictic 10.0 population measured at four sites versus four inde- pendent subpopulations.

9.5 > dat = t(log(harborSeal[,4:7])) > harbor1 = MARSS(dat, + model = list(Z = factor(rep(1, 4)))) > harbor2 = MARSS(dat, 9.0 + model = list(Z = factor(1:4))) log(Gray Whales) The default model settings of ‘Q = "diagonal and

8.5 unequal"’, ‘R = "diagonal and equal"’, and ‘B = "identity"’ are used for this example. The AICc value for the one subpopulation surveyed at four

8.0 sites is considerably smaller than the four subpopu-

1960 1970 1980 1990 lations model. This suggests that these four sites are

Year within one subpopulation (which is what one would expect given their proximity and lack of geographic barriers). Figure 1: Maximum-likelihood estimates of east- ern North Pacific gray whales (solid line) population > harbor1$AICc size, with data (points) and 2 standard errors (dashed [1] -280.0486 lines). Additional examples of trend estimation may be found in MARSS User Guide.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 16 CONTRIBUTED RESEARCH ARTICLES

> harbor2$AICc state x, which is two-dimensional (latitude and lon- gitude) and we have one observation time series for [1] -254.8166 each state process. To model movement as a random walk with drift in two dimensions, we set B equal We can add a bootstrapped AIC using the func- to an identity matrix and constrain u to be unequal tion MARSSaic. to allow for non-isotrophic movement. We constrain > harbor1 = MARSSaic(harbor1, the process and observation variance-covariance ma- + output = c("AICbp")) trices (Q, R) to be diagonal to specify that the errors in latitude and longitude are independent. Figure2 We can add a confidence intervals to the estimated shows the estimated locations for the turtle named parameters using the function MARSSparamCIs. “Big Mama”. > harbor1 = MARSSparamCIs(harbor1) Estimation of species interactions from MARSS The default is to compute approximate confi- multi-species data dence intervals using a numerically estimated Hes- sian matrix. Two full model selection exercises can Ecologists are often interested in inferring species in- be found in the MARSS User Guide. teractions from multi-species datasets. For example, Ives et al.(2003) analyzed four time series of preda- Analysis of animal movement data tor (zooplankton) and prey (phytoplankton) abun- dance estimates collected weekly from a freshwater Another application of state-space modeling is iden- lake over seven years. The goal of their analysis was tification of true movement paths of tagged animals to estimate the per-capita effects of each species on from noisy satellite data. Using the MARSS package every other species (predation, competition), as well to analyze movement data assumes that the obser- as the per-capita effect of each species on itself (den- vations are equally spaced in time, or that enough sity dependence). These types of interactions can be NAs are included to make the observations spaced ac- modeled with a multivariate Gompertz model: cordingly. The MARSS package includes the satel- lite tracks from eight tagged sea turtles, which can xt = Bxt−1 + u + wt; wt ∼ MVN(0,Q) (8) be used to demonstrate estimation of location from noisy data. The B matrix is the interaction matrix and is the main parameter to be estimated. Here we show an exam- ple of estimating B using one of the Ives et al.(2003) datasets. This is only for illustration; in a real analy- sis, it is critical to take into account the important en- vironmental covariates, otherwise correlation driven by a covariate will affect the B estimates. We show how to incorporate covariates in the MARSS User Guide. The first step is to log-transform and de-mean the data: > plank.dat = t(log(ivesDataByWeek[,c("Large Phyto", + "Small Phyto","Daphnia","Non-daphnia")])) > d.plank.dat = (plank.dat - apply(plank.dat, 1, + mean, na.rm = TRUE))

Estimated track Observed locations Second, we specify the MARSS model structure. We assume that the process variances of the two phyto- plankton species are the same and that the process variances for the two zooplankton are the same. But we assume that the abundance of each species is an independent process. Because the data have been de- meaned, the parameter u is set to zero. Following Figure 2: Estimated movement track for one turtle Ives et al.(2003), we set the observation error to 0.04 (“Big Mama”) using a two-dimensional state-space for phytoplankton and to 0.16 for zooplankton. The model. The MARSS User Guide shows how to pro- model is specified as follows duce this figure from the satellite data provided with the package. > Z = factor(rownames(d.plank.dat)) > U = "zero" The data consist of daily longitude and latitude > A = "zero" for each turtle. The true location is the underlying > B = "unconstrained"

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 17

> Q = matrix(list(0), 4, 4) the objective is to estimate the number of underly- > diag(Q) = c("Phyto", "Phyto", "Zoo", "Zoo") ing state processes, m, and the Z matrix now rep- > R = diag(c(0.04, 0.04, 0.16, 0.16)) resents how these state processes are linearly com- > plank.model = list(Z = Z, U = U, Q = Q, bined to explain the larger set of n observed time se- + R = R, B = B, A = A, tinitx=1) ries. Model selection is used to select the size of m We do not specify x0 or V0 since we assume the and the objective is to find the most parsimonious default behavior which is that the initial states are number of underlying trends (size of m) plus their treated as estimated parameters with 0 variance. weightings (Z matrix) that explain the larger dataset. We can fit this model using MARSS: To illustrate, we show results from a DFA analysis for six time series of plankton from Lake Washington > kem.plank = MARSS(d.plank.dat, (Hampton et al., 2006). Using model selection with + model = plank.model) models using m = 1 to m = 6, we find that the best > B.est = matrix(kem.plank$par$B, 4, 4) (lowest AIC) model has four underlying trends, re- > rownames(B.est) = colnames(B.est) = duced from the original six. When m = 4, the DFA + c("LP", "SP", "D", "ND") observation process then has the following form: > print(B.est, digits = 2) y    a  v  LP SP D ND 1,t γ11 0 0 0 1 1,t y  γ γ 0 0  a  v  LP 0.549 -0.23 0.096 0.085  2,t  21 22   2  2,t SP -0.058 0.52 0.059 0.014 y3,t γ31 γ32 γ33 0  a3 v3,t   =   xt +   +   y  γ γ γ γ  a  v  D 0.045 0.27 0.619 0.455  4,t  41 42 43 44  4  4,t ND -0.097 0.44 -0.152 0.909 y5,t γ51 γ52 γ53 γ54 a5 v5,t y6,t γ61 γ62 γ63 γ64 a6 v6,t where "LP" and "SP" represent large and small phy- toplankton, respectively, and "D" and "ND" represent Figure3 shows the fitted data with the four state tra- the two zooplankton species, “Daphinia” and “Non- jectories superimposed. Daphnia”, respectively. The benefit of reducing the data with DFA is that Elements on the diagonal of B indicate the we can reduce a larger dataset into a smaller set of strength of density dependence. Values near 1 in- underlying trends and we can use factor analysis to dicate no density dependence; values near 0 indi- identify whether some time series fall into clusters cate strong density dependence. Elements on the off- represented by similar trend weightings. A more de- diagonal of B represent the effect of species j on the tailed description of dynamic factor analysis, includ- per-capita growth rate of species i. For example, the ing the use of factor rotation to intrepret the Z matrix, second time series (small phytoplankton) appears to is presented in the MARSS User Guide. have relatively strong positive effects on both zoo- plankton time series. This is somewhat expected, as Cryptomonas bluegreen more food availability might lead to higher predator 2 2 densities. The predators appear to have relatively lit- 1 1 tle effect on their prey; again, this is not surprising 0 0 -1 -1 abundance index abundance index abundance as phytoplankton densities are often strongly driven -2 -2 by environmental covariates (temperature and pH) -3 -3 rather than predator densities. Our estimates of B are 1977 1980 1983 1986 1989 1992 1995 1977 1980 1983 1986 1989 1992 1995 Diatom Unicells different from those presented by Ives et al.(2003); in 2 2 the MARSS User Guide we illustrate how to recover 1 1 the estimates in Ives et al.(2003) by fixing elements 0 0 -1 -1

to zero and including covariates. index abundance index abundance -2 -2 -3 -3

1977 1980 1983 1986 1989 1992 1995 1977 1980 1983 1986 1989 1992 1995

Dynamic factor analysis Green Otheralgae 2 2

In the previous examples, each observed time se- 1 1 ries was representative of a single underlying state 0 0 -1 -1

process, though multiple time series may be obser- index abundance index abundance -2 -2

vations of the same state process. This is not a re- -3 -3 quirement for a MARSS model, however. Dynamic 1977 1980 1983 1986 1989 1992 1995 1977 1980 1983 1986 1989 1992 1995 factor analysis (DFA) also uses MARSS models but treats each observed time series as a linear combina- tion of multiple state processes. DFA has been ap- Figure 3: Results of a dynamic factor analysis ap- plied in a number of fields (e.g. Harvey, 1989) and plied to the Lake Washington plankton data. Colors can be thought of as a principal components analysis represent estimated state trajectories and black lines for time-series data (Zuur et al., 2003a,b). In a DFA, represent the fitted values for each species.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 18 CONTRIBUTED RESEARCH ARTICLES

Summary R. Hinrichsen and E. E. Holmes. Using multivariate state-space models to study spatial structure and We hope this package will provide scientists in a va- dynamics. In R. S. Cantrell, C. Cosner, and S. Ruan, riety of disciplines some useful tools for the analy- editors, Spatial Ecology. CRC/Chapman Hall, 2009. sis of noisy univariate and multivariate time-series data, and tools to evaluate data support for different E. Holmes, E. Ward, K. Wills, NOAA, Seattle, and model structures for time-series observations. Future USA. MARSS: multivariate autoregressive state-space development will include Bayesian estimation and modeling, 2012. URL http://CRAN.R-project. constructing non-linear time-series models. org/package=MARSS. R package version 2.9. E. E. Holmes. Estimating risks in declining popu- Bibliography lations with poor data. Proceedings of the National Academy of Sciences of the United States of America, J. Cavanaugh and R. Shumway. A bootstrap variant 98(9):5072–5077, 2001. of AIC for state-space model selection. Statistica E. E. Holmes. Derivation of the EM algorithm for Sinica, 7:473–496, 1997. constrained and unconstrained MARSS models. Technical report, Northwest Fisheries Science Cen- B. Dennis and M. L. Taper. Density dependence ter, Mathematical Biology Program, 2010. in time series observations of natural populations: Estimation and testing. Ecological Monographs, 64 E. E. Holmes, J. L. Sabo, S. V. Viscido, and W. F. Fa- (2):205–224, 1994. gan. A statistical approach to quasi-extinction fore- casting. Ecology Letters, 10(12):1182–1198, 2007. B. Dennis, P. L. Munholland, and J. M. Scott. Estima- tion of growth and extinction parameters for en- J.-Y. Humbert, L. S. Mills, J. S. Horne, and B. Dennis. dangered species. Ecological Monographs, 61:115– A better way to estimate population trends. Oikos, 143, 1991. 118(12):1940–1946, 2009. C. Dethlefsen, S. Lundbye-Christensen, and A. L. A. R. Ives, B. Dennis, K. L. Cottingham, and S. R. Car- Christensen. sspir: state space models in R, penter. Estimating community stability and eco- 2009. URL http://CRAN.R-project.org/package= logical interactions from time-series data. Ecologi- sspir. R package version 0.2.8. cal Monographs, 73(2):301–330, 2003. J. Durbin and S. J. Koopman. Time series analysis by D. Luethi, P. Erb, and S. Otziger. FKF: fast Kalman state space methods. Oxford University Press, Ox- filter, 2012. URL http://CRAN.R-project.org/ ford, 2001. package=FKF. R package version 0.1.2. L. R. Gerber, D. P. D. Master, and P. M. Kareiva. Grey K. Metaxoglou and A. Smith. Maximum likelihood whales and the value of monitoring data in imple- estimation of VARMA models using a state-space menting the U.S. endangered species act. Conser- EM algorithm. Journal of Time Series Analysis, 28: vation Biology, 13:1215–1219, 1999. 666–685, 2007. P. D. Gilbert. Brief user’s guide: dynamic systems esti- G. Petris. An R package for dynamic linear mod- mation, 2009. URL http://cran.r-project.org/ els. Journal of Statistical Software, 36(12):1–16, 2010. web/packages/dse/vignettes/Guide.pdf. URL http://www.jstatsoft.org/v36/i12/. S. E. Hampton, M. D. Scheuerell, and D. E. Schindler. J. Reddingius and P. den Boer. On the stabilization of Coalescence in the Lake Washington story: interac- animal numbers. Problems of testing. 1. Power es- tion strengths in a planktonic food web. Limnology timates and observation errors. Oecologia, 78:1–8, and Oceanography, 51(5):2042–2051, 2006. 1989. A. C. Harvey. Forecasting, structural time series models J. T. Schnute. A general framework for developing and the Kalman filter. Cambridge University Press, sequential fisheries models. Canadian Journal of Cambridge, UK, 1989. Fisheries and Aquatic Sciences, 51:1676–1688, 1994. J. Helske. KFAS: Kalman filter and smoothers for expo- R. H. Shumway and D. S. Stoffer. An approach to nential family state space models., 2011. URL http: time series smoothing and forecasting using the //CRAN.R-project.org/package=KFAS. R package EM algorithm. Journal of Time Series Analysis, 3(4): version 0.6.1. 253–264, 1982. R. Hinrichsen. Population viability analysis for D. S. Stoffer and K. D. Wall. Bootstrapping state- several populations using multivariate state-space space models: Gaussian maximum likelihood es- models. Ecological Modelling, 220(9-10):1197–1202, timation and the Kalman filter. Journal of the Amer- 2009. ican Statistical Association, 86(416):1024–1033, 1991.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 19

R. S. Tsay. Analysis of financial time series. Wiley Series Elizabeth E. Holmes in Probability and Statistics. Wiley, 2010. Northwest Fisheries Science Center 2725 Montlake Blvd E, Seattle WA 98112 E. J. Ward, H. Chirakkal, M. González-Suárez, USA D. Aurioles-Gamboa, E. E. Holmes, and L. Ger- [email protected] ber. Inferring spatial structure from time-series data: using multivariate state-space models to de- Eric J. Ward tect metapopulation structure of California sea li- Northwest Fisheries Science Center ons in the Gulf of California, Mexico. Journal of 2725 Montlake Blvd E, Seattle WA 98112 Applied Ecology, 1(47):47–56, 2010. USA [email protected] A. F. Zuur, R. J. Fryer, I. T. Jolliffe, R. Dekker, and J. J. Beukema. Estimating common trends in mul- tivariate time series using dynamic factor analysis. Kellie Wills Environmetrics, 14(7):665–685, 2003a. University of Washington 2725 Montlake Blvd E, Seattle WA 98112 A. F. Zuur, I. D. Tuck, and N. Bailey. Dynamic fac- USA tor analysis to estimate common trends in fisheries [email protected] time series. Canadian Journal of Fisheries and Aquatic Sciences, 60(5):542–552, 2003b.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 20 CONTRIBUTED RESEARCH ARTICLES openair – Data Analysis Tools for the Air Quality Community by Karl Ropkins and David C. Carslaw The openair rationale

Abstract The openair package contains data With this potential in mind, the openair project was analysis tools for the air quality community. funded by UK NERC (award NE/G001081/1) specif- This paper provides an overview of data im- ically to develop data analysis tools for the wider air porters, main functions, and selected utilities quality community in R as part of the NERC Knowl- http://www.nerc.ac. and workhorse functions within the package edge Exchange programme ( uk/using/introduction/ and the function output class, as of package ver- ). sion 0.4-14. It is intended as an explanation of One potential issue was identified during the the rationale for the package and a technical de- very earliest stages of the project that is perhaps scription for those wishing to work more inter- worth emphasising for the existing R users. actively with the main functions or develop ad- Most R users already have several years of either ditional functions to support ‘higher level’ use of formal or self-taught experience in statistical, math- openair and R. ematical or computational working practices before they first encounter R. They probably first discovered R because they were already researching a specific technique that they identified as beneficial to their research and saw a reference to a package or script in an expert journal or were recommended R by a Large volumes of air quality data are routinely col- colleague. Their first reaction on discovering R, and lected for regulatory purposes, but few of those in in particular the packages, was probably one of ex- local authorities and government bodies tasked with citement. Since then they have most likely gone on this responsibility have the time, expertise or funds to use numerous packages, selecting an appropriate to comprehensively analyse this potential resource combination for each new application they under- (Chow and Watson, 2008). Furthermore, few of these took. institutions can routinely access the more powerful Many in the air quality community, especially statistical methods typically required to make the those associated with data collection and archiving, most effective use of such data without a suite of of- are likely to be coming to both openair (Carslaw and ten expensive and niche-application proprietary soft- Ropkins, 2010) and R with little or no previous expe- ware products. This in turn places large cost and rience of statistical programming. Like other R users, time burdens on both these institutions and others they recognise the importance of highly evolved sta- (e.g. academic or commercial) wishing to contribute tistical methods in making the most effective use of to this work. In addition, such collaborative work- their data; but, for them, the step-change to working ing practices can also become highly restricted and with R is significantly larger. polarised if data analysis undertaken by one partner As a result many of the decisions made when cannot be validated or replicated by another because developing and documenting the openair package they lack access to the same licensed products. were shaped by this issue. Being freely distributed under general licence, R has the obvious potential to act as a common platform for those routinely collecting and archiv- Data structures and importers ing data and the wider air quality community. This potential has already been proven in several other Most of the main functions in openair operate on research areas, and commonly cited examples in- a single data frame. Although it is likely that in clude the Bioconductor project (Gentleman et al, future this will be replaced with an object class to 2004) and the Epitools collaboration (http://www. allow data unit handling, the data frame was ini- medepi.com/epitools). However, what is perhaps tially adopted for two reasons. Firstly, air quality most inspiring is the degree of transparency that has data is currently collected and archived in numer- been demonstrated by the recent public analysis of ous formats and keeping the import requirements climate change data in R and associated open debate for openair simple minimised the frustrations associ- (http://chartsgraphs.wordpress.com/category/ ated with data importation. Secondly, restricting the r-climate-data-analysis-tool/). Anyone affected user to working in a single data format greatly sim- by a policy decision, could potentially have unlim- plifies data management and working practices for ited access to scrutinise both the tools and data used those less familiar with programming environments. to shape that decision. Typically, the data frame should have named

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 21

fields, three of which are specifically reserved, (http://www.airquality.co.uk/data_and_ namely: date, a field of ‘POSIXt’ class time stamps, statistics.php). importAURN provides a di- and ws and wd, numeric fields for wind speed and rect link to the archive and downloads data di- wind direction data. There are no restrictions on the rectly from the online archive. importAURNCsv number of other fields and the names used outside allows ‘.csv’ files previously downloaded from the standard conventions of R. This means that the the archive to be read into openair. ‘work up’ to make a new file openair-compatible is minimal: Read in data; reformat and rename date; • importKCL, an importer for direct online ac- and rename wind speed and wind direction as ws cess to data from the King’s College London and wd, if present. databases (http://www.londonair.org.uk/). That said, many users new to programming still found class structures, in particularly ‘POSIXt’, Here, we gratefully acknowledge the very sig- daunting. Therefore, a series of importer functions nificant help and support of AEAT, King Col- were developed to simplify this process. lege London’s Environmental Research Group (ERG) The first to be developed was import, a general and CERC in the development of these importers. purpose function intended to handle comma and tab AEAT and ERG operate the AURN and LondonAir delimited file types. It defaults to a file browser (via archives, respectively, and both specifically set up file.choose), and is intended to be used in the com- dedicated services to allow the direct download of mon form, e.g.: ‘.RData’ files from their archives. CERC provided ex- tensive access to multiple generations of ADMS file newdata <- import() structures and ran an extensive programme of com- newdata <- import(file.type = "txt") #etc patibility testing to ensure the widest possible body of ADMS data was accessible to openair users. (Here, as elsewhere in openair, argument options have been selected pragmatically for users with lim- ited knowledge of file structures or programming Example data conventions. Note that the file.type option is the file extension "txt" that many users are familiar The openair package includes one example dataset, with, rather than either the delim from read.delim mydata. This is data frame of hourly measurements or the "\t" separator.) of air pollutant concentrations, wind speed and wind A wide range of monitoring, data logging and direction collected at the Marylebone (London) air archiving systems are used by the air quality com- quality monitoring supersite between 1st January munity and many of these employ unique file lay- 1998 and 23rd June 2005 (source: London Air Quality outs, including e.g. multi-column date and stamps, Archive; http://www.londonair.org.uk). isolated or multi-row headers, or additional informa- The same dataset is available to download as tion of different dimensions to the main data set. So, a ‘.csv’ file from the openair website (http://www. import includes a number of arguments, described openair-project.org/CSV/OpenAir_example_data_ in detail in ?import, that can be used to fine-tune its long.csv). This file can be directly loaded into ope- operation for novel file layouts. nair using the import function. As a result, many Dedicated importers have since been written for users, especially those new to R, have found it a very some of the file formats and data sources most com- useful template when loading their own data. monly used by the air quality community in the UK. These operate in the common form: Manuals newdata <- import[Name]() Two manuals are available for use with openair. And include: The standard R manual is available alongside the • importADMS, a general importer for ‘.bgd’, package at its ‘CRAN’ repository site. An extended ‘.met’, ‘.mop’ and ‘.pst’ file formats used by manual, intended to provide new users less famil- the Atmospheric Dispersion Modelling Sys- iar with either R or openair with a gentler intro- tem (McHugh et al, 1997) developed by CERC duction, is available on the openair website: http: (http://www.cerc.co.uk/index.php). ADMS //www.openair-project.org. is widely used in various forms to model both current and future air quality scenarios (http: //www.cerc.co.uk/environmental-software/ Main functions ADMS-model.html). Most of the main functions within openair share • importAURN and importAURNCsv, importers for a highly similar structure and, wherever possible, hourly data from the UK (Automatic Urban common arguments. Many in the air quality commu- and Rural Network) Air Quality Data Archive nity are very familiar with ‘GUI’ interfaces and data

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 22 CONTRIBUTED RESEARCH ARTICLES analysis procedures that are very much predefined by the software developers. R allows users the op- Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec portunity to really explore their data. However, a 1998 1999 2000 mean 22 command line framework can sometimes feel frus- 20 18 16 400 tratingly slow and awkward to users more used to 14 12 10 a ‘click and go’ style of working. Standardising the 08 06 350 04 argument structure of the main functions both en- 02 00 courages a more interactive approach to data anal- 2001 2002 2003 300 22 20 ysis and minimises the amount of typing required of 18 16 250 14 users more used to working with a mouse than key- 12 10 hour 08 board. 06 200 04 02 Common openair function arguments include: 00 2004 2005 150 pollutant, which identities the data frame field or 22 20 18 fields to select data from; statistic, which, where 16 100 14 12 data are grouped, e.g. share common coordinates 10 08 06 on a plot, identifies the summary statistic to apply 04 02 NO if only a single value is required; and, avg.time, 00 X Jul Apr Oct Jan Jun Feb Mar Aug Nov Dec Sep which, where data series are to be averaged on May longer time periods, identifies the required time res- month olution. However, perhaps the most important of these common arguments is type, a simplified form Figure 1: openair plot trendLevel(mydata, "nox"). of the conditioning term cond widely used elsewhere Note: The seasonal and diurnal trends, high in win- in R. ter months, and daytime hours, most notably early Rapid data conditioning is only one of a large morning and evening, are very typical of man-made number of benefits that R provides, but it is probably sources such as traffic and the general, by-panel, de- the one that has most resonance with air quality data crease in mean concentrations reflects the effect of in- handlers. Most can instantly appreciate its potential cremental air quality management regulations intro- power as a data visualisation tool and its flexibility duced during the last decade. when used in a programming environment like R. However, many new users can struggle with the fine The function arguments x, y and type can be set cond control of , particularly with regards to the ap- to a wide range of time/date options or to any of the format.POSIX* type plication of to time stamps. The fields within the supplied data frame, with numer- argument therefore uses an openair workhorse func- ics being cut into quantiles, characters converted to cutData tion called , which is discussed further be- factors, and factors used as is. low, to provide a robust means of conditioning data Similarly statistic can also be either a pre- "hour" "weekday" "month" using options such as , , coded option, e.g. "mean", "max", etc, or be a user de- "year" and ) fined function. This ‘tiered approach’ provides both These features are perhaps best illustrated with simple, robust access for new users and a very flexi- an example. ble structure for more experienced R users. To illus- The openair function trendLevel is basically trate this point, the default trendLevel plot (Figure a wrapper for the lattice (Sarkar, 2009) function 1) can be generated using three equivalent calls: levelplot that incorporates a number of built-in conditioning and data handling options based on # predefined these common arguments. So, many users will be trendLevel(mydata, statistic = "mean") very familiar with the basic implementation. The function generates a levelplot of pollutant # using base::mean ∼ x * y | type where x, y and type are all trendLevel(mydata, statistic = mean) cut/factorised by cutData and in each x/y/type case pollutant data is summarised using the option # using local function statistic. my.mean <- function(x){ When applied to the openair example dataset x <- na.omit(x) mydata in its default form, trendLevel uses x sum(x) / length(x)} = "month" (month of year), y = "hour" (time of trendLevel(mydata, pollutant = "nox", day) and type = "year" to provide information on statistic = my.mean) trends, seasonal effects and diurnal variations in mean NOx concentrations in a single output (Fig- The type argument can accept one or two options, ure1). depending on function, and in the latter case strip la- However, x, y, type and statistic can all be belling is handled using the latticeExtra (Sarkar and user-defined. Andrews, 2011) function useOuterStrips.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 23

Figure 2: openair plots generated using scatterPlot(mydata, "nox", "no2", ...) and method = "scatter" (default; left), "hexbin" (middle) and "density" (right).

The other main functions include: to analysis using the stl function in stats (Cleveland et al, 1990). The other function, • summaryPlot, a function which generates a rug linearRelation, uses a rolling window linear plot and histogram view of one or more data regression method to visualise the degree of frame fields, as well as calculating several key change in the relations between two species statistics that can be used to characterise the over larger timescales. quality and capture efficiency of data collecting in extended monitoring programmes. The plot • windRose generates a traditional ‘wind rose’ provides a useful screening prior to the main style plot of wind speed and direction. The data analysis. associated wrapper function pollutionRose al- lows the user to substitute wind speed with an- • timePlot and scatterPlot, time-series and tra- other data frame field, most commonly a pollu- ditional scatter plot functions. These were orig- tant concentration time-series, to produce ‘pol- inally introduced to provide such plots in a for- lution roses’ similar to those used by Henry et mat consistent with other openair outputs, but al(2009). have evolved through user-feedback to include additional hard-coded options that may be of • polarFreq, polarPlot and polarAnnulus are a more general use. Perhaps, the most obvious family of functions that extend polar visual- of these being the "hexbin" (hexagonal binning isations. In its default form polarFreq pro- using the hexbin package (Carr et al, 2011)) vides an alternative to wind speed/direction and "density" (2D kernel density estimates us- description to windRose, but pollutant and ing .smoothScatterCalcDensity in grDevices) statistic arguments can also be included methods for handling over-plotting (Figure2). in the call to produce a wide range of other polar data summaries. polarPlot uses • Trend analysis is an important component of mgcv::gam to fit a surface to data in the form air quality management, both in terms of his- polarFreg(...,statistic = "mean") to pro- torical context and abatement strategy eval- vide a useful visualisation tool for pollutant uation. openair includes three trend analy- source characterisation. This point is illus- sis functions: MannKendall, smoothTrend and trated by Figure3 which shows three related LinearRelation. MannKendall uses methods plots. Figure3 left is a basic polar presen- based on Hirsch et al(1982) and Helsel and tation of mean NOx concentrations produced Hirsch(2002) to evaluate monotonic trends. using polarFreq(mydata,"nox",statistic = Sen-Theil slope and uncertainty are estimated "mean"). Figure3 middle is a comparable using code based on that published on-line polarPlot, which although strictly less quan- by Rand Wilxox (http://www-rcf.usc.edu/ titatively accurate, greatly simplifies the identi- ~rwilcox/) and the basic method has been fications of the main features, namely a broad extended to block bootstrap simulation to high concentration feature to the Southwest account for auto-correlation (Kunsch, 1989). with a maxima at lower wind speeds (indi- smoothTrend fits a generalized additive model cating a local source) and a lower concentra- (GAM) to monthly averaged data to provide tion but more resolved high wind speed feature a non-parametric description of trends using to the East (indicating a more distant source). methods and functions in the package mgcv Then, finally Figure3 right presents a similar (Wood, 2004, 2006). Both functions incorpo- polarPlot of SO2, which demonstrates that the rate an option to deseasonalise data prior local source (most likely near-by traffic) is rel-

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 24 CONTRIBUTED RESEARCH ARTICLES

35 35

35 30 30 N N 30 mean N 25 25 25 260 8 250 20 240 20 20 220 200 15 7 180 15 (m/s) 15 (m/s) 160 10 200 140 10 10 6 5 120 100 5 5 5 0 W E W E W E 80 150 60 4

40 3 100 20 2

50 1

0

S NOX S NOX S SO2

Figure 3: openair plots generated using (left) polarFreq(mydata, "nox", statistic = "mean"); (middle) polarPlot(mydata, "nox"); and, (right) polarPlot(mydata, "so2").

atively NOx rich while the more distant East- and Beevers(2005), and further details of erly feature (most likely power station emis- the function’s use are given in the extended sions) is relatively SO2 rich. This theme is dis- version of the openair manual (http://www. cussed in further detail in Carslaw et al(2006). openair-project.org). Functions modStats The polarAnnulus function provides an alter- and conditionalQuantile were developed for native polar visualisation based on wind di- model evaluation. rection and time frequency (e.g. hour of day, day of year, etc.) to explore similar diagnos- tics to those discussed above with regard to Utilities and workhorse functions trendLevel. The openair package includes a number of utilities • Likewise, timeVariation generates a range and workhorse functions that can be directly ac- of hour-of-the-day and day-of-the-week and cessed by users and therefore used more generally. month-of-the-year plots that can provide use- These include: ful insights regarding the time frequency of one or more pollutants. • cutData, a workhorse function for data condi- • calendarPlot presents daily data in a conven- tioning, intended for use with the type option tional calendar-style layout. This is a highly in main openair functions. It accepts a data effective format for the presentation of infor- frame and returns the conditioned form: mation, especially when working with non- experts. head(olddata) • Air quality standards are typically defined in date ws wd nox terms of upper limits that concentrations of a 1 1998-01-01 00:00:00 0.60 280 285 particular pollutant may not exceed or may 2 1998-01-01 01:00:00 2.16 230 NA only exceed a number of times in any year. 3 1998-01-01 02:00:00 2.76 190 NA kernelExceed uses a kernel density function 4 1998-01-01 03:00:00 2.16 170 493 (.smoothScatterCalcDensity in grDevices) to 5 1998-01-01 04:00:00 2.40 180 468 map the distribution of such exceedances rela- 6 1998-01-01 05:00:00 3.00 190 264 tive to two other parameters. The function was developed for use with daily mean (European) newdata <- cutData(olddata, limit value for PM (airborne particulate mat- 10 type = "hour") ter up to 10 micrometers in size) of 50 µg/m3 head(newdata) not to be exceeded on more than 35 days, and in its default form plots PM exceedances rel- 10 date ws wd nox hour ative to wind speed and direction. 1 1998-01-01 00:00:00 0.60 280 285 00 • openair also includes a number of special- 2 1998-01-01 01:00:00 2.16 230 NA 01 ist functions. The calcFno2 function pro- 3 1998-01-01 02:00:00 2.76 190 NA 02 vides an estimate of primary NO2 emissions 4 1998-01-01 03:00:00 2.16 170 493 03 ratios, a question of particular concern for 5 1998-01-01 04:00:00 2.40 180 468 04 the air quality community at the moment. 6 1998-01-01 05:00:00 3.00 190 264 05 Associated theory is provided in Carslaw

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 25

Here type can be the name of one of the fields daylight saving was or was not enforced (or in the data frame or one of a number of pre- more directly GMT and BST time stamps are defined terms. Data frame fields are han- or are not equivalent), respectively, and re- dled pragmatically, e.g. factors are returned sets the date field to local time (BST). Man- unmodified, characters are converted to fac- made sources, such as NOx emissions from tors, numerics are subset by stats::quantile. vehicles in urban areas where daylight sav- By default numerics are converted into four ing is enforced will tend to associated with quantile bands but this can be modified us- local time. So, for example, higher ‘rush- ing the additional option n.levels. Pre- hour’ concentrations will tend to associated defined terms include "hour" (hour-of-day), with BST time stamps (and remain relatively "daylight" (daylight or nighttime), "weekday" unaffected by the BST/GMT case). By con- (day-of-week), "weekend" (weekday or week- trast a variable such as wind speed or tem- end), "month" (month-of-year), "monthyear" perature that is not as directly influenced by (month and year), "season" (season-of-year), daylight saving should show a clear BST/GMT "gmtbst" (GMT or BST) and "wd" (wind direc- shift when expressed in local time. Therefore, tion sector). when used with an openair function such as timeVariation With the exception of "daylight", "season", , this type conditioning can help "gmtbst" and "wd" these are wrappers for determine whether variations in pollutant con- conventional format.POSIXt operations com- centrations are driven by man-made emissions monly required by openair users. or natural processes. "wd" conditions the data by the conventional "daylight" conditions the data relative to eight wind sectors, N (0-22.5◦ and 337.5-360◦), estimated sunrise and sunset to give either NE (22.5-67.5◦), E (67.5-112.5◦), SE (112.5-157.5◦), daylight or nighttime. The cut is actually S (157.5-202.5◦), SW (202.5-247.5◦), W (247.5- made by a specialist function, cutDaylight, 292.5◦) and NW (292.5-337.5◦). but more conveniently accessed via cutData or the main functions. The ‘daylight’ estimation, • selectByDate and splitByDate are functions which is valid for dates between 1901 and 2099, for conditioning and subsetting a data frame is made using the measurement date, time, using a range of format.POSIXt operations and location and astronomical algorithms to esti- options similar to cutData. These are mainly mate the relative positions of the Sun and the intended as a convenient alternative for newer measurement location on the Earth’s surface, openair users. and is based on NOAA methods (http://www. esrl.noaa.gov/gmd/grad/solcalc/). Date and • drawOpenKey generates a range of colour keys time are extracted from the date field but used by other openair functions. It is a modifi- can be modified using the additional op- cation of the lattice::draw.colorkey function tion local.hour.offset, and location is de- and here we gratefully acknowledge the help termined by latitude and longitude which and support of Deepayan Sarkar in providing should be supplied as additional options. both draw.colorkey and significant advice on its use and modification. More widely used "season" conditions the data by month-of- colour key operations can be accessed from "month" year (as ) and then regroups data into main openair functions using standard options, the four seasons. By default, the operation e.g. key.position (= "right", "left", "top", assumes the measurement was made in the or "bottom"), and key.header and key.footer northern hemisphere, and winter = Decem- (text to be added as headers and footers, re- ber, January and February, spring = March, spectively, on the colour key). In addition, finer April and May, etc., but can be reset using control can be obtained using the option key hemisphere hemisphere the additional option ( which should be a list. key is similar to key = "southern" which returns winter = June, in lattice::draw.colorkey but allows the ad- July and August, spring = September, Oc- ditional components: header (a character vec- tober and November, etc.). Note: for con- tor of text or list including header text and venience/traceability these are uniquely la- formatting options for text to be added above winter belled, i.e. northern hemisphere: the colour key), footer (as header but for be- (DJF) spring (MAM) summer (JJA) autumn , , , low the colour key), auto.text (TRUE/FALSE (SON) winter (JJA) ; southern hemisphere: , for using openair workhorse quickText), and spring (SON) summer (DJF) autumn (MAM) , , . tweaks, fit, slot (a range of options to con- "gmtbst" (and the alternative form "bstgmt") trol the relative positions of header, footer and conditions the data according to daylight sav- the colour key) and plot.style (a list of op- ing. The operation returns two cases: GMT tions controlling the type of colour key pro- or BST for measurement date/times where duced). One simple example of the use of

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 26 CONTRIBUTED RESEARCH ARTICLES

drawOpenKey(key(plot.style)) is the paddle The "greyscale" scheme is a special case in- scale in windRose - compare windRose(mydata) tended for those producing figures for use in and windRose(mydata,paddle = FALSE). black and white reports that also automatically drawOpenKey can be used with other lattice out- resets some other openair plot parameters (e.g. puts using methods previously described by strip backgrounds to white and other preset Deepayan Sarkar (Sarkar, 2008, 2009), e.g.: text, point and line to black or a ‘best guess’ grey). ## some silly data and colour scale • quickText is a look-up style wrapper function range <- 1:200; cols <- rainbow(200) that applies routine text formatting to expres- sions widely used in the air quality community, ## my.key -- an openair plot key e.g. the super- and sub-scripting of chemical my.key <- list(col = cols, at = range, names and measurement units. The function space = "right", accepts a ‘character’ class object and returns header = "my header", it as an ‘expression’ with any recognised text footer = "my footer") reformatted according to common convention. Labels in openair plots (xlab, ylab, main, etc) ## pass to lattice::xyplot are typically passed via quickText. This, for xyplot(range ~ range, examples, handles the automatic formatting of cex = range/10, col = cols, the colour key and axes labels in Figures1–3, legend = list(right = where the inputs were data frame field names, list(fun = drawOpenKey, "nox", "no2", etc (most convenient for com- args = list(key = my.key) mand line entry) and the conventional chemi- ))) cal naming convention (International Union of Pure and Applied Chemistry, IUPAC, nomen- clature) were NOx, NO2, etc. my header quickText 200 can also be used as a label wrapper 200 with non-openair plots, e.g.:

150 150 my.lab <- "Particulates as pm10, ug/m3" plot(pm10 ~ date, data = mydata[1:1000,],

100 100 ylab = quickText(my.lab)) range

expressions 50 50 (While many of us regard ‘ ’ as triv- ● ial, such label formatting can be particularly ●●● ●● confusing for those new both programming ● ● ● 0 ● my footer languages and command line work, and func- quickText 0 50 100 150 200 tions like really help those users to range focus on the data rather than becoming frus- trated with the periphery.)

Figure 4: Trivial example of the use of openColourKey with a lattice plot.

3 • openColours is a workhorse function that − ● 100 ● produces a range of colour gradients for g m ●

µ ● ● ● 80 , ● other openair plots. The main option is ●● ● ● 10 ● ● ● ●● ● ● ● ●● ● ●● ● scheme ● ●●●● ● ● ●● ● ● and this can be either the name of a 60 ● ● ● ● ● ● ●● ● ●● ● ● ●●●● ●● ● ●●●●● ●● ●●● ● ● ●●● ●● ● ●●● ●●● ●●● ●● ● ●●● ●●●● ● ●●●●● ● ●●●●● "increment" ●●●●● ●● ● ●●● ●●●●● ●●●●● ● ●●● pre-defined colour scheme, e.g. , ●●●●●●●● ●●● ●●●●● ●●●● ●●●●● ●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●● 40 ● ● ●● ●●●●●●●●●● ● ●●●●● ● ●●● ●● ●● ●●● ●●●●●●●●●●●●●●●●●●● ●●● ●● ●●●●●●● ●●●●●●●●●●●● "default" "brewer1" "heat" "jet" "hue" ●●●●●●●●● ●●●●●●●●●●●● ●●●● ● ●● ● ●●●●●●●●●● ● ●●●●● , , , , , ●●●●●●●● ●●●●●●● ●● ● ● ●● ● ● ● ● ●●●● ●● ●●●●●●●●●● ●●● ●●● ●● ●●●●●●●●●●●●● ●●● ●●● ●●●●●● ●●●●●●●● ●●●●●●●●●●● ● ● ●●●● ●● ●●● ●●● ● ●●●●●● ●● ● ● ●●●●●●●●●●●●● ●●●● ●● ● ●●● ●●●●●● ● ●●●● ● ●●

20 ●● ●●● ● ● ● ●● ●●●● ● "greyscale", or two or more terms that can be ●●●●●●● ● ●● ● ●●● ●●●●●●●●●●● ● ● ●●●●●● ●● ● ●●● ●●●●●● ●● ●● ● ●● ●● ● ●● ●● ● ● coerced into valid colours and between which ● ●● ●● a colour gradient can be extrapolated, e.g. Particulates as PM Jan 04 Jan 14 Jan 24 Feb 03 Feb 13 c("red","blue"). In most openair plot func- tions openColours(scheme) can be accessed us- date ing the common option cols. Most of the colour gradient operations are based on stan- dard R and RColorBrewer (Neuwirth, 2011) Figure 5: Trivial example of the use of quickText methods. outside openair.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 27

• timeAverage is a function for the aggrega- tion of openair data frames using the ‘POSIXt’ this contains: field "date". By default it calculates the a single data frame: day average using an operation analogous to $data [with no subset structure] mean(...,na.rm = TRUE), but additional op- a single plot object: tions also allow it to be used to calculate a $plot [with no subset structure] wide range of other statistics. The aggrega- tion interval can be set using option avg.time, e.g. "hour", "2 hour", etc (cut.POSIXt conven- Associated generic methods (head, names, plot, tions). The statistic can be selected from a print, results, summary) have a common structure: range of pre-defined terms, including "mean", "max", "min", "median", "sum", "frequency" #method structure for openair generics (using length), "sd" and "percentile". If [generic method].[class] #method.name statistic = "percentile", the default 95 (%) ([object], [class-specific options], may be modified using the additional option [method-specific options]) #options percentile. The data capture threshold can be set using data.thresh which defines the per- As would be expected, most .openair methods centage of valid cases required in any given work like associated .default methods, and object (aggregation) period where the statistics are to and method-specific options are based on those of be calculated. the .default method. Typically, openair methods re- While for the most part timeAverage could be turn outputs consistent with the associated .default considered merely a convenient wrapper for a method unless either $data or $plot have multiple number of more complex ‘POSIXt’ aggregation parts in which cases outputs are returned as lists of operations, one important point that should be data frames or plots, respectively. The main class- emphasised is that it handles the polar mea- specific option is subset, which can be used to se- sure wind direction correctly. Assuming wind lect specific sub-data or sub-plots if these are avail- direction and speed are supplied as the data able. The local method results extracts the data frame fields "wd" and "ws", respectively, these frames generated during the plotting process. Fig- are converted to wind vectors and the aver- ure6 shows some trivial examples of current usage. age wind direction is calculated using these. If this were not done and wind direction av- erages were calculated from "wd" alone then Conclusions and future directions measurements about North (e.g. 355–360◦ and 0–5◦) would average at about 180◦ not 0◦ or As with many other R packages, the feedback pro- 360◦. cess associated with users and developers working in an open-source environment means than openair is subject to continual optimisation. As a result, Output class openair will undoubtedly continue to evolve further through future versions. Obviously, the primary fo- Many of the main functions in openair return an ob- cus of openair will remain the development of tools ject of "openair" class, e.g.: to help the air quality community make better use of their data. However, as part of this work we recog- #From: nise that there is still work to be done. [object] <- openair.function(...) One area that is likely to undergo signifi- cant updates is the use of classes and methods. #object structure The current ‘S3’ implementation of output class is [object] #list[S3 "openair"] crude, and future options currently under consid- $call [function call] eration include improvements to plot.openair and $data [data.frame generated/used in plot] print.openair, the addition of an update.openair [or list if multiple part] method (for reworking openair plots), the release of $plot [plot from function] the openairApply (a currently un-exported wrapper [or list if multiple part] function for apply-type operations with openair ob- jects), and the migration of the object to ‘S4’. #Example In light of the progress made with the output ans <- windRose(mydata) class, we are also considering the possibility of re- ans placing the current simple data frame input with a dedicated class structure, as this could provide ac- openair object created by: cess to extended capabilities such as measurement windRose(mydata = mydata) unit management.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 28 CONTRIBUTED RESEARCH ARTICLES shown as inserts right top and right plot(ans, subset = "hour") and plot(ans) object handling, with the outputs of "openair" Figure 6: Trivialbottom, examples respectively. of

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 29

One other particularly exciting component of re- A.J. Rossini, G. Sawitzki, C. Smith, G. Smyth, cent work on openair is international compatibil- L. Tierney, J.Y.H. Yang and J. Zhang. Bioconductor: ity. The initial focus of the openair package was Open software development for computational bi- very much on the air quality community in the UK. ology and bioinformatics. Genome Biology, 5:R80, However, distribution of the package through CRAN 2004. URL http://www.bioconductor.org/. has meant that we now have an international user group with members in Europe, the United States, D. Helsel and R. Hirsch. R: Statistical methods in wa- http: Canada, Australia and New Zealand. This is obvi- ter resources. US Geological Survey. URL //pubs.usgs.gov/twri/twri4a3/ ously great. However, it has also brought with it . several challenges, most notably in association with R. Henry, G.A. Norris, R. Vedantham and J.R. Turner. local time stamp and language formats. Here, we Source Region Identification Using Kernel greatly acknowledge the help of numerous users and Smoothing. Environmental Science and Technology, colleagues who bug-tested and provided feedback as 43(11):4090–4097, 2009. part of our work to make openair less ‘UK-centric’. We will continue to work on this aspect of openair. R.M. Hirsch, J.R. Slack, and R.A. Smith. Techniques We also greatly acknowledge those in our current of trend analysis for monthly water-quality data. user group who were less familiar with program- Water Resources Research, 18(1):107–121, 1982. ming languages and command lines but who took H.R. Kunsch. The jackknife and the bootstrap for a real ‘leap of faith’ in adopting both R and openair. general stationary observations. Annals of Statis- We will continue to work to minimise the severity tics, 17(3):1217–1241, 1989. of the learning curves associated with both the up- take of openair and the subsequent move from using C.A. McHugh, D.J. Carruthers and H.A. Edmunds. openair in a standalone fashion to its much more ef- ADMS and ADMS570 Urban. International Journal ficient use as part of R. of Environment and Pollution, 8(3–6):438–440, 1997. E. Neuwirth. RColorBrewer: ColorBrewer palettes. Bibliography R package version 1.0-5. URL http://CRAN. R-project.org/package=RColorBrewer/. D. Carr, N. Lewin-Koh and M. Maechler. hexbin: Hexagonal Binning Routines. R package ver- D. Sarkar. Lattice: Multivariate Data Visualization sion 1.26.0. URL http://CRAN.R-project.org/ with R. Springer. ISBN: 978-0-387-75968-5. URL http://lmdvr.r-forge.r-project.org/ package=hexbin. . D.C. Carslaw and S.D. Beevers. Estimations of road D. Sarkar. lattice: Lattice Graphics. R package vehicle primary NO2 exhaust emission fractions version 0.18-5. URL http://r-forge.r-project. using monitoring data in London. Atmospheric En- org/projects/lattice/. vironment, 39(1):167–177, 2005. D. Sarkar and F. Andrews. latticeExtra: Extra Graph- D.C. Carslaw, S.D. Beevers, K. Ropkins and ical Utilities Based on Lattice. R package ver- M.C. Bell. Detecting and quantifying aircraft and sion 0.6-18. URL http://CRAN.R-project.org/ other on-airport contributions to ambient nitrogen package=latticeExtra. oxides in the vicinity of a large international air- S.N. Wood. Stable and efficient multiple smooth- port. Atmospheric Environment, 40(28):5424–5434, ing parameter estimation for generalized additive 2006. models Journal of the American Statistical Associa- D.C. Carslaw and K. Ropkins. openair: Open- tion, 99:673–686, 2004. source tools for the analysis of air pollution data. S.N. Wood. Generalized Additive Models: An Intro- R package version 0.4-14. URL http://www. duction with R. Chapman and Hall/CRC, 2006. openair-project.org/. J.C. Chow and J.G. Watson. New Directions: Beyond compliance air quality measurements. Atmospheric Karl Ropkins Environment, 42:5166–5168, 2008. Institute for Transport Studies University of Leeds, LS2 9JT, UK R.B. Cleveland, W.S. Cleveland, J.E. McRae and [email protected] I. Terpenning, I. STL: A Seasonal-Trend Decompo- sition Procedure Based on Loess. Journal of Official David C. Carslaw Statistics, 6:3–73, 1990. King’s College London R. Gentleman, V.J. Carey, D.M. Bates, B. Bolstad, Environmental Research Group M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, Franklin Wilkins Building, 150 Stamford Street J. Gentry, K. Hornik, T. Hothorn, W. Huber, S. Ia- London SE1 9NH, UK cus, R. Irizarry, F. Leisch, C. Li, M. Maechler, [email protected]

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 30 CONTRIBUTED RESEARCH ARTICLES

Foreign Library Interface by Daniel Adler applications that can run on a multitude of plat- forms. Abstract We present an improved Foreign Function Interface (FFI) for R to call arbitary na- tive functions without the need for C wrapper Foreign function interfaces code. Further we discuss a dynamic linkage framework for binding standard C libraries to FFIs provide the backbone of a language to inter- R across platforms using a universal type infor- face with foreign code. Depending on the design of mation format. The package rdyncall comprises this service, it can largely unburden developers from the framework and an initial repository of cross- writing additional wrapper code. In this section, we platform bindings for standard libraries such as compare the built-in R FFI with that provided by (legacy and modern) OpenGL, the family of SDL rdyncall. We use a simple example that sketches the libraries and Expat. The package enables system- different work flow paths for making an R binding to level programming using the R language; sam- a function from a foreign C library. ple applications are given in the article. We out- line the underlying automation tool-chain that extracts cross-platform bindings from C headers, FFI of base R making the repository extendable and open for Suppose that we wish to invoke the C function sqrt library developers. of the Standard C Math library. The function is de- clared as follows in C:

Introduction double sqrt(double x); We present an improved Foreign Function Interface The .C function from the base R FFI offers a call (FFI) for R that significantly reduces the amount of gate to C code with very strict conversion rules, and C wrapper code needed to interface with C. We also strong limitations regarding argument- and return- introduce a dynamic linkage that binds the C inter- types: R arguments are passed as C pointers and C face of a pre-compiled library (as a whole) to an inter- return types are not supported, so only C void func- preted programming environment such as R - hence tions, which are procedures, can be called. Given the name Foreign Library Interface. Table1 gives a these limitations, we are not able to invoke the for- list of the C libraries currently supported across ma- eign sqrt function directly; intermediate wrapper jor R platforms. For each library supported, ab- code written in C is needed: stract interface specifications are declared in a com- pact platform-neutral text-based format stored in so- #include called DynPort file on a local repository. void R_C_sqrt(double * ptr_to_x) R was choosen as the first language to implement { a proof-of-concept implementation for this approach. double x = ptr_to_x[0], ans; This article describes the rdyncall package which im- ans = sqrt(x); plements a toolkit of low-level facilities that can be ptr_to_x[0] = ans; used as an alternative FFI to interface with C. It also } facilitates direct and convenient access to common C libraries from R without compilation. We assume that the wrapper code is deployed as The project was motivated by the fact that high- a shared library in a package named testsqrt which quality software solutions implemented in portable links to the Standard C Math library1. Then we load C are often not available in interpreter-based lan- the testsqrt package and call the C wrapper function guages such as R. The pool of freely available C li- directly via .C. braries is quite large and represents an invaluable resource for software development. For example, > library(testsqrt) OpenGL (OpenGL Architecture Review Board et al., > .C("R_C_sqrt", 144, PACKAGE="testsqrt") 2005) is the most portable and standard interface to [[1]] accelerated graphics hardware for developing real- [1] 12 time graphics software. The combination of OpenGL with the Simple DirectMedia Layer (SDL) (Lantinga, To make sqrt available as a public function, an 2009) core and extension libraries offers a founda- additional R wrapper layer is needed to carry out tion framework for developing interactive multime- type-safety checks: 1We omit here the details such as registering C functions which is described in detail in the R Manual ’Writing R Extensions’(R Devel- opment Core Team, 2010).

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 31

Lib/DynPort Description Functions Constants Struct/Union GL OpenGL 336 3254 - GLU OpenGL Utility 59 155 - R R library 238 700 27 SDL Audio/Video/UI abstraction 201 416 34 SDL_image Pixel format loaders 35 3 - SDL_mixer Music format loaders and playing 71 27 - SDL_net Network programming 34 5 3 SDL_ttf Font format loaders 38 7 - cuda GPU programming 387 665 84 expat XML parsing framework 65 70 - glew GL extensions 1857 - - gl3 OpenGL 3 (strict) 317 838 1 ode Rigid Physics Engine 547 109 11 opencl GPU programming 79 263 10 stdio Standard I/O 75 3 -

Table 1: Overview of available DynPorts for portable C Libraries sqrtViaC <- function(x) Finally, we invoke the foreign C function sqrt di- { rectly via .dyncall: x <- as.numeric(x) # type(x) should be C double. # make sure length > 0: > .dyncall(sqrtAddr, "d)d", 144) length(x) <- max(1, length(x)) [1] 12 .C("R_C_sqrt", x, PACKAGE="example")[[1]] The last call pinpoints the core solution for a di- } rect invocation of foreign code within R: The first argument specifies the address of the foreign code, We can conclude that – in realistic settings – the given as an external pointer. The second argument built-in FFI of R almost always needs support by a is a call signature that specifies the argument- and re- wrapper layer written in C. The "foreign" in the FFI turn types of the target C function. This string "d)d" of base is in fact relegated to the C wrapper layer. specifies that the foreign function expects a double scalar argument and returns a double scalar value FFI of rdyncall in accordance with the C declaration of sqrt. Argu- ments following the call signature are passed to the rdyncall provides an alternative FFI for R that is ac- foreign function in the form specified by the call sig- cessible via the function .dyncall. In contrast to the nature. In the example we pass 144 as a C double ar- base R FFI, which uses a C wrapper layer, the sqrt gument type as first argument and receive a C double function is invoked dynamically and directly by the value converted to an R numeric. interpreter at run-time. Whereas the Standard C Math library was loaded implicitly via the testsqrt package, Call signatures it now has to be loaded explicitly. R offers functions to deal with shared libraries The introduction of a type descriptor for foreign at run-time, but the location has to be specified as functions is a key component that makes the FFI flex- an absolute file path, which is platform-specific. A ible and type-safe. The format of the call signature platform-portable solution is discussed in a follow- has the following pattern: ing section on Portable loading of shared library. For argument-types ')' return-type now, we assume that the example is done on Mac OS X where the Standard C Math library has the file path The signature can be derived from the C function ‘/usr/lib/libm.dylib’: declaration: Argument types are specified first, in the direct left-to-right order of the corresponding C func- > libm <- dyn.load("/usr/lib/libm.dylib") tion prototyp declaration, and are terminated by the > sqrtAddr <- libm$sqrt$address symbol ')' followed by a single return type signa- ture. We first need to load the R package rdyncall: Almost all fundamental C types are supported and there is no restriction2 regarding the number of > library(rdyncall) arguments supported to issue a call. Table2 gives an 2The maximum number of arguments is limited by the amount of memory required for prebuffering a single call. It is currently fixed to 4 kilobyte (approx. 512-1024 arguments).

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 32 CONTRIBUTED RESEARCH ARTICLES

Type Sign. Type Sign. void v bool B char c unsigned char C short s unsigned short S int i unsigned int I long j unsigned long J long long l unsigned long long L float f double d void* p struct name * * type* *... const char* Z

Table 2: C/C++ Types and Signatures

C function declaration Call signature void rsort_with_index(double*,int*,int n) *d*ii)v SDL_Surface * SDL_SetVideoMode(int,int,int,Uint32_t) iiiI)* void glClear(GLfloat,GLfloat,GLfloat,GLfloat) ffff)v

Table 3: Examples of C functions and corresponding call signatures overview of supported C types and the correspond- Typed pointers, specified by the prefix '*' fol- ing text encoding; Table3 provides some examples lowed by the signature of the base type, offer a mea- of C functions and call signatures. sure of type-safety for pointer types; if an R vector A public R function that encapsulates the details is passed and the R atomic type does not match the of the sqrt call is simply defined by base type, the call will be rejected. Typed pointers to C struct and union types are also supported; they > sqrtViaDynCall <- function(...) are briefly described in the section Handling of C Types + .dyncall(sqrtAddr, "d)d", ...) in R. In contrast to the R FFI, where the argument con- No further guard code is needed here because version is dictated solely by the R argument type at .dyncall has built-in type checks that are speci- call-time in a one-way fashion, the introduction of fied by the signature. In contrast to the R wrap- an additional specification with a call signature gives per code using .C, no explicit cast of the arguments several advantages. via as.numeric is required, because automatic coer- cion rules for fundamental types are implemented as • Almost all possible C functions can be invoked specified by the call signature. For example, using by a single interface; no additional C wrapper the integer literal 144L instead of double works here is required. as well. • The built-in type-safety checks enhance stabil- > sqrtViaDyncall(144L) ity and significantly reduce the need for asser- [1] 12 tion code.

If any incompatibility is detected, such as a • The same call signature works across plat- wrong number of arguments, empty atomic vectors forms, given that the C function type remains or incompatible type mappings, the invocation is constant. aborted and an error is reported without risking an application crash. • Given that our FFI is implemented in multiple Pointer type arguments, expressed via 'p', are languages (e.g. Python, Ruby, Perl, Lua), call handled differently. The type signature 'p' indicates signatures represent a universal type descrip- that the argument is an address. When passing R tion for C libraries. atomic vectors, the C argument value is the address of the first element of the vector. External pointers and the NULL object can also be passed as values for Package overview pointer type arguments. Automatic coercion is de- liberately not implemented for pointer types. This is Besides dynamic calling of foreign code, the pack- to support C functions that write into memory refer- age provides essential facilities for interoperability enced by out pointer types. between the R and C programming languages. An

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 33

Figure 1: Package Overview overview of components that make up the package library, it is sometimes enough to have a single short is given in Figure1. filename - e.g. "expat" for the Expat library. We already described the .dyncall FFI. It is fol- lowed by a brief description of portable loading of Wrapping C libraries shared libraries using dynfind, installation of wrap- pers via dynbind, handling of foreign data types Functional R interfaces to foreign code can be de- via new.struct and wrapping of R functions as C fined with small R wrapper functions, which effec- callbacks via new.callback. Finally the high-level tively delegate to .dyncall. Each function interface dynport interface for accessing whole C libraries is is parameterized by a target address and a matching briefly discussed. The technical details at low-level call signature. of some components are described briefly in the sec- tion Architecture. f <- function(...) .dyncall(target,signature,...) Since an Application Programming Interface (API) Portable loading of shared libraries often consist of hundreds of functions (see Table1), dynbind can create and install a batch of function The portable loading of shared libraries across plat- wrappers for a library with a single call by using a li- forms is not trivial because the file path is differ- brary signature that consists of concatenated function ent across operating systems. Referring back to the names and signatures separated by semicolons. previous example, to load a particular library in a For example, to install wrappers to the C func- portable fashion, one would have to check the plat- tions sqrt, sin and cos from the math library, one form to locate the C library.3 could use Although there is variation among the operating systems, library file paths and search patterns have > dynbind( c("msvcrt","m","c"), common structures. For example, among all the dif- + "sqrt(d)d;sin(d)d);cos(d)d;" ) ferent locations, prefixes and suffixes, there is a part within a full library filename that can be taken as a The function call has the side-effect that three R short library name or label. wrapper functions are created and stored in an envi- ronment that defaults to the global environment. Let The function dynfind takes a list of short li- us review the sin wrapper (on the 64-bit Version of brary names to locate a library using common search R running on Mac OS X 10.6): heuristics. For example, to load the Standard C Math library, depending on the the li- > sin brary is either the Microsoft Visual C Run-Time DLL function (...) labeled ‘msvcrt’ on Windows or the Standard C Math .dyncall.cdecl(, shared library labeled ‘m’ or ‘c’ on other operating "d)d)", ...) systems. The wrapper directly uses the address of the sin > mLib <- dynfind(c("msvcrt","m","c")) symbol from the Standard C Math library. In addi- tion, the wrapper uses .dyncall.cdecl, which is a dynfind also supports more exotic schemes, such concrete selector of a particular calling convention, as Mac OS X Framework folders. Depending on the as outlined below. 3Possible C math library names are ‘libm.so’ and ‘MSVCRT.DLL’ in locations such as ‘/lib’, ‘/usr/lib’, ‘/lib64’, ‘/usr/lib64’, ‘C:\WINDOWS\SYSTEM32’ etc..

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 34 CONTRIBUTED RESEARCH ARTICLES

Calling conventions Struct-name '{' Field-types '}' Field-names ';' Calling conventions specify how arguments and re- Field-types use the same type signature encod- turn values are passed across sub-routines and func- ing as that of call signatures for argument and return tions at machine-level. This information is vital for types (Table2). Field-names consist of a list of white- interfacing with the binary interface of C libraries. space separated names, that label each field compo- The package has support for multiple calling con- nent left to right. ventions. Calling conventions are controlled by An instance of a C type can be allocated via .dyncall via the named argument callmode to spec- new.struct: ify a non-default calling convention. Most supported > r <- new.struct(Rect) operating systems and platforms only have support for a single "default" calling convention at run-time. Finally, the extraction ('$', '[') and An exception to this is the Microsoft Windows plat- replacement('$<-', '[<-') operators can be used form on the Intel i386 processor architecture: While to access structure fields symbolically. During value the default C calling convention on i386 (excluding transfer between R and C, automatic conversion of Plan9) is "default", system shared libraries from values with respect to the underlying C field type Microsoft such as ‘KERNEL32.DLL’, ‘USER32.DLL’ as takes place. well as the OpenGL library ‘OPENGL32.DLL’ use the > r$x <- -10 ; r$y <- -20 ; r$w <- 40 ; r$h <- 30 "stdcall" calling convention. Only on this platform does the callmode argument have an effect. All other In this example, R numeric values are converted platforms currently ignore this argument. on the fly to signed and unsigned short integers (usually 16-bit values). On printing r a detailed pic- ture of the data object is given: Handling of C types in R > r C APIs often make use of high-level C struct and struct Rect { union types for exchanging information. Thus, to x: -10 make interoperability work at that level the handling y: -20 of C data types is addressed by the package. w: 40 To illustrate this concept we consider the follow- h: 30 } ing example: A user-interface library has a function to set the 2D coordinates and dimension of a graph- At low-level, one can see that r is stored as an R ical output window. The coordinates are specified raw vector object: using a C struct Rect data type and the C function > r[] receives a pointer to that object: [1] f6 ff ec ff 28 00 1e 00 attr(,"struct") void setWindowRect(struct Rect *pRect); [1] "Rect" The structure type is defined as follows: To follow the example, we issue a foreign func- tion call to setRect via .dyncall and pass in the r struct Rect { object, assuming the library is loaded and the sym- short x, y; unsigned short w, h; bol is resolved and stored in an external pointer ob- setWindowRectAddr }; ject named : > .dyncall( setWindowRectAddr, "*)v", r) Before we can issue a call, we have to allocate an object of that size and initialize the fields with val- We make use of a typed pointer expression '*' ues encoded in C types that are not part of the sup- instead of the untyped pointer signature 'p' ported set of R data types. The framework provides , which would also work but does not prevent R helper functions and objects to deal with C data users from passing other objects that do not refer- struct Rect types. Type information objects can be created with ence a data object. Typed pointer ex- a description of the C structure type. First, we create pressions increase type-safety and use the pattern '*< >' a type information object in R for the struct Rect C Type-Name . The invocation will be rejected if Rect data type with the function parseStructInfos using the argument passed in is not of C type . As r struct a structure type signature. is tagged with an attribute that refers to Rect, the call will be issued. Typed pointers can also > parseStructInfos("Rect{ssSS}x y w h;") occur as return types that permit the manipulation of returned objects in the same symbolic manner as After registration, an R object named Rect is in- above. stalled that contains C type information that corre- C union types are supported as well but use the sponds to struct Rect. The format of a structure type parseUnionInfos function instead for registration, signature has the following pattern: and a slightly different signature format:

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 35

Union-name '|' Field-types '}' Field-names ';' age. The R interface to C libraries looks very simi- lar to the actual C API. For details on the usage of a The underlying low-level C type read and write particular C library, the programming manuals and operations and conversions from R data types are documentation of the libraries should be consulted. .pack .unpack performed by the functions and . Before loading R bindings via dynport, the shared These can be used for various low-level operations library should have been installed onto the system. as well, such as dereferencing of pointer to pointers. Currently this is to be done manually and the instal- R objects such as external pointers and atomic lation method depends on the target operating sys- raw, integer and numeric vectors can be used as C tem. While OpenGL and Expat is often pre-installed struct struct/union types via the attribute . To cast a on typical desktop-systems, SDL usually has to be in- as.struct type in the style of C, one can use . stalled explicitly which is described in the package; see ?'rdyncall-demos' for details. Wrapping R functions as C callbacks Some C libraries, such as user-interface toolkits and OpenGL programming in R I/O processing frameworks, use callbacks as part of their interface to enable registration and activation of user-supplied event handlers. A callback is a user- defined function that has a library-defined function type. Callbacks are usually registered via a registra- tion function offered by the library interface and are activated later from within a library run-time con- text. rdyncall has support for wrapping ordinary R functions as C callbacks via the function new.callback. Callback wrappers are defined by a callback signature and the user-supplied R function to be wrapped. Callback signatures look very similar to call signatures and should match the functional type of the underlying C callback. new.callback returns an external pointer that can be used as a low-level function pointer for the registration as a C callback. See Section Parsing XML using Expat below for appli- cations of new.callback.

Figure 2: demo(SDL) Foreign library interface At the highest level, rdyncall provides the front-end In the first example, we make use of the Simple Di- function dynport to dynamically set up an interface rectMedia Layer library (SDL)(Pendleton, 2003) and to a C Application Programming Interface. This in- the Open Graphics Library (OpenGL)(OpenGL Archi- cludes loading of the corresponding shared C library tecture Review Board et al., 2005) to implement a and resolving of symbols. During the binding pro- portable multimedia application skeleton in R. cess, a new R name space (Tierney, 2003) will be We first need to load bindings to SDL and populated with thin R wrapper objects that repre- OpenGL via dynport: sent abstractions to C counterparts such as functions, pointers-to-functions, type-information objects for C > dynport(SDL) struct and union types and symbolic constant equiv- > dynport(GL) alents of C enums and macro definitions. The mech- anism works across platforms; as long as the corre- Now we initialize the SDL library, e.g. we initial- sponding shared libraries of a DynPort have been in- ize the video subsystem, and open a 640x480 win- stalled in a system standard location on the host. dow surface in 32-bit color depths with support for An initial repository of DynPorts is available in OpenGL rendering: the package that provides bindings for several pop- ular C APIs; see Table1 for available bindings. > SDL_Init(SDL_INIT_VIDEO) > surface <- SDL_SetVideoMode(640,480,32,SDL_OPENGL)

Sample applications Next, we implement the application loop which updates the display repeatedly and processes the We give examples that demonstrate the direct usage event queue until a quit request is issued by the user of C APIs from within R through the rdyncall pack- via the window close button.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 36 CONTRIBUTED RESEARCH ARTICLES

> mainloop <- function() demo(randomfield) gives a slightly more scien- { tific application of OpenGL and R: Random fields of ev <- new.struct(SDL_Event) 512x512 size are generated via blending of 5000 tex- quit <- FALSE ture mapped 2D gaussian kernels. The counter in the while(!quit) { window title bar gives the number of matrices gen- draw() erated per second (see Figure3). When clicking on while(SDL_PollEvent(ev)) { if (ev$type == SDL_QUIT) { the animation window, the current frame and matrix quit <- TRUE is passed to R and plotted. While several dozens of } matrices are computed and drawn per second using } OpenGL, it takes several seconds to plot a single ma- } trix in R using image(). } Parsing XML using Expat SDL event processing is implemented by collect- ing events that occur in a queue. Typical SDL appli- In the second example, we use the Expat XML Parser cations poll the event queue once per update frame library (Clark, 2007; Kim, 2001) to implement a by calling SDL_PollEvent with a pointer to a user- stream-oriented XML parser suitable for very large allocated buffer of C type union SDL_Event. Event documents. In Expat, custom XML parsers are im- records have a common type identifier which is set to plemented by defining functions that are registered SDL_QUIT when a quit event has occurred, e.g. when as callbacks to be invoked on events that occur dur- users press a close button on a window. ing parsing, such as the start and end of XML tags. In Next we implement our draw function making our second example, we create a simple parser skele- use of the OpenGL API. We clear the background ton that prints the start and end tag names. with a blue color and draw a light-green rectangle. First we load R bindings for Expat via dynport.

> draw <- function() > dynport(expat) { glClearColor(0,0,1,0) Next we create an abstract parser object via the glClear(GL_COLOR_BUFFER_BIT) C function XML_ParserCreate that receives one argu- glColor3f(0.5,1,0.5) ment of type C string to specify a desired character glRectf(-0.5,-0.5,0.5,0.5) encoding that overrides the document encoding dec- SDL_GL_SwapBuffers() laration. We want to pass a null pointer (NULL) here. } In the .dyncall FFI C null pointer values for pointer types are expressed via the R NULL value: Now we can run the application mainloop. > p <- XML_ParserCreate(NULL) > mainloop() The C interface for registering start- and end-tag To stop the application, we press the close button event handler callbacks is given below: of the window. A similar example is also available via demo(SDL). Here the draw function displays a ro- /* Language C, from file expat.h: */ typedef void (*XML_StartElementHandler) tating 3D cube displayed in Figure2. (void *userData, const XML_Char *name, const XML_Char **atts); typedef void (*XML_EndElementHandler) (void *userData, const XML_Char *name); void XML_SetElementHandler(XML_Parser parser, XML_StartElementHandler start, XML_EndElementHandler end);

We implement the callbacks as R functions that print the event and tag name. They are wrapped as C callback pointers via new.callback using a matching callback signature. The second argument name of type C string in both callbacks, XML_StartElementHandler and XML_EndElementHandler, is of primary interest in this example; this argument passes over the XML tag name. C strings are handled in a special way by the .dyncall FFI because they have to be copied as R character objects. The special type signature 'Z' is used to denote a C string type. The other arguments Figure 3: demo(randomfield) are simply denoted as untyped pointers using 'p':

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 37

> start <- new.callback("pZp)v", Minix and Plan9. Support for embedded platforms function(ignored1,tag,ignored2) such as Playstation Portable, Nintendo DS and iOS cat("Start tag:", tag, "\n") is available as well. FFI implementations for other ) languages such as Python (van Rossum and Drake, > end <- new.callback("pZ)v", Jr., 2005), Lua (Ierusalimschy et al., 1996) and Ruby function(ignored,tag) (Flanagan and Matsumoto, 2008) are available from cat("Stop tag:", tag, "\n") ) the DynCall project source repository. > XML_SetElementHandler(p, start, end) The source tree supports various build tools such as gcc, msvc, SunPro, pcc, llvm and supports sev- To test the parser we create a sample document eral make tools (BSD,C,GNU,N,Sun). A common stored in a character object named text and pass it abstraction layer for assembler dialects helps to de- to the parse function XML_Parse: velop cross-operating system call kernel. Due to the generic implementation and simple design, the li- > text <- " " braries are quite small (the dyncall library for Mac > XML_Parse( p, text, nchar(text), 1) OS X/AMD64 is 24 kb). To test stability of the libraries, a suite of test- The resulting output is ing frameworks is available, including test-case gen- erators with support for structured or random case Start tag: hello studies and for testing extreme scenarios with large Start tag: world End tag: world number of arguments. Prior to each release, the li- End tag: hello braries and tests are built for a large set of architec- tures on DynOS (Philipp, 2011); a batch-build sys- Expat supports processing of very large XML tem using CPU emulators such as QEmu (Bellard, documents in a chunk-based manner by calling 2005) and GXEmul (Gavare, 2010), and various op- XML_Parse several times, where the last argument is erating system images to test the release candidates used as indicator for the final chunk of the document. and to create pre-built binary releases of the library.

Creation of DynPort files Architecture The creation of DynPort files from C header files is The core implementation of the FFI, callback wrap- briefly described next. A tool chain, comprising of ping and loading of code is based on small C libraries freely available components, is applied once on a of the DynCall project (Adler and Philipp, 2011). build machine as depicted in Figure4. The implementation of the FFI is based on the dyncall C library, which provides an abstraction for making arbitrary machine-level calls offering a uni- versal C interface for scripting language interpreters. It has support for almost all fundamental C argu- ment/return types4 and multiple calling conven- tions, and is open for extension to other platforms and binary standards. Generic call implementations for the following processor architectures are sup- ported: Intel i386 32-bit, AMD 64-bit, PowerPC 32- bit, ARM (including Thumb extension), MIPS 32/64- bit and SPARC 32/64-bit including support for sev- eral platform-, processor- and compiler-specific call- ing conventions. The dyncallback C library implements generic callback handling. Callback handlers receive calls from C and they forward the call, including con- version of arguments, to a function of a scripting- language interpreter. A subset of architectures from the above is currently supported here: i386, AMD64 and ARM, and partial support for PowerPC 32-bit on Figure 4: Tool-chain to create DynPort files from C Mac OS X/Darwin. headers Besides the processor architecture, the libraries support various operating systems such as Linux, At first a main source file references the C header Mac OS X, Windows, the BSD family, Solaris, Haiku, files of the library that should be made accessable via 4Passing of long double, struct and union argument/return C value types are currently work in progress.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 38 CONTRIBUTED RESEARCH ARTICLES dynport. In a preprocessing phase the GNU C Macro 2011) for the Java VM and C/C++ libraries, which Processor is used to process all #include statements uses dyncall, provides support for passing struct using standard system search paths to create a con- value types for a number of i386 and AMD64 plat- catenated All-In-One source file. GCC-XML (King, forms. 2004), a modified version of the GNU C compiler, R character strings have a maximum size that can transforms C header declarations to XML. The XML limit the number of library functions per dynbind is further transformed to the final type signature for- function call. An improved DynPort file format and mat using xslproc (Veillard and Reese, 2009), a XSLT parser are being developed and are already available (Clark, 2001) processor, and a custom XSL stylesheet for luadyncall. that has been implemented for the actual transforma- This version of DynPort does not capture the full tion from GCC-XML to the type signature text for- range of the C type system. For example array mat. and bit-field types are not supported; the pointer-to- C Macro #define statements are handled sepa- function type in an argument list can only be spec- rately by a custom C Preprocessor implemented in ified using the void pointer '*v' or 'p' instead of C++ using the boost wave library (Kaiser, 2011). An this (more informative) explicit type. An extended optional filter stage is used to include only elements version of DynPort, that overcomes these inconve- with a certain pattern, such as a common prefix usu- niences and that improves type safety, is being de- ally found in many libraries, e.g. ’SDL_’. In a last veloped. step, the various fragments are assembled into a sin- Certain restrictions apply when rdyncall is used gle text-file that represents the DynPort file. to work with C libraries. These arise from limitations in R. For example the handling of C float point- ers/arrays and char pointer-to-pointer types are not Limitations implemented in R. The functions .unpack and .pack are powerful helper functions designed to over- During the creation of DynPort files, we encountered come these and some other restrictions. Additional some cases (mainly for the SDL library) where we helper functions are included, such as floatraw and had to comment out some symbolic assignments (de- floatraw2numeric that translate numeric R vectors rived from C macro definitions) manually. These to C float arrays and vice versa. could not be converted as-is into valid R assignments The portable loading of shared libraries via because they consist of complex C expressions such dynfind might require fine-tuning the list of short as bit-shift operations. One could solve this problem names when using less common R platforms such as by integrating a C interpreter within the tool-chain BSDs and Solaris. that deduces the appropriate type and value infor- mation from the replacement part of each C macro definitions; definitions with incomplete type could Related work be rejected and constant values could be stored in a language-neutral encoding. Several dynamic languages offer a flexible FFI, e.g. In order to use a single DynPort for a given C ctypes (Heller, 2011) for Python, alien (Mascaren- library across multiple platforms, its interface must has, 2009) for Lua, Rffi (Temple Lang, 2011) for R, be constant across platforms. DynPort does not sup- CFFI (Bielman, 2010) for Common LISP and the FFI port the conditional statements of the C preproces- module for Perl (Moore et al., 2008) and Ruby (Meiss- sor. Thus interfaces that use different types for argu- ner, 2011). These all facilitate similar services such as ments or structure fields depending on the architec- foreign function calls and handling of foreign data. ture cannot be supported in a universal manner. For With the exception of Rffi, these also support wrap- example, the Objective-C Run-Time C library of Mac ping of scripting functions as C callbacks. In most OS X uses a different number of fields within certain cases, the type information is specified in the gram- struct data types depending on whether the archi- mar of the dynamic language. An exception to this tecture is i386 or alternative AMD64 in which case is the Perl FFI that uses text-based type signatures padding fields are inserted in the middle of the struc- similar to rdyncall. ture. We are aware of this problem although we have ctypeslib (Kloss, 2008) is an extension to ctypes not encountered a conflict with the given palette of that comes closest to the idea of DynPorts in which C libraries available via DynPorts to R. A possible Python ctypes statements are automatically gener- work around for such cases would be to offer sep- ated from C library header files, also using GCC- arate DynPorts for different architectures. XML. In contrast, the DynPort framework con- dyncall and dyncallback currently lack support tributes a compact text-based type information format for handling long double, struct and union ar- that is also used as the main user-interface for var- gument and return value types and architecture- ious tasks in rdyncall. This software design is ap- specific vector types. Work is in progress to over- plicable across languages and thus type information come this limitation. The middleware BridJ (Chafik, can be shared across platforms and languages at the

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 39 same time. faces of native libraries from R. Instead of compil- Specific alternatives to dyncall include libffi ing bindings for every library/language combination, (Green, 2011) and ffcall (Haible, 2004). These are R bindings of a library are created dynamically at mature FFI libraries that use a data-driven C inter- run-time in a data-driven manner via DynPort files, face and have support for many platforms. Although a cross-platform universal type information format. not as popular as the first two, the C/Invoke library C libraries are made accessible in R as though they (Weisser, 2007) also offers a similar service with bind- were extension packages and the R interface looks ings to Lua, Java and Kite. The dyncall library of- very similar to that of C. This enables system-level fers a functional C interface (inspired by the OpenGL programming in R and brings a new wave of pos- API). It includes a comprehensive test suite and de- sibilities to R developers such as direct access to tailed documentation of calling conventions on a va- OpenGL across platforms as illustrated in the exam- riety of platforms and compilers. As the framework ple. An initial repository of DynPorts for standard was developed "de novo" we were free to introduce cross-platform portable C libraries comes with the our own strategy to support open as well as com- package. Work is in progress for implementation of mercial and embedded platforms. For example, the callback support on architectures already supported i386 Assembly (except for Plan9) is implemented in a by the dyncall C library. The handling of foreign data common abstract syntax that translates to GNU and types, which is currently implemented in R and C, is Microsoft Assembler. This makes sense here, because planned to be reimplemented as a C library and part i386-based operating systems use a common C call- of the DynCall project. ing convention which we address using a single As- The DynPort facility in rdyncall consitutes an ini- sembly source. A by-product of this feature is that tial step in building up an infrastructure between dyncall enables the user to call operating system for- scripting languages and C libraries. Analogous to eign code on some architectures. the way in which R users enjoy quick access to the In contrast to the dynamic zero-compilation ap- large pool of R software managed by CRAN, we en- proach of ctypeslib and rdyncall, the majority of lan- vision an archive network in which C library de- guage bindings to libraries use a compiled approach velopers can distribute their work across languages, in which code (handwritten or auto-generated) is users could then get quick access to the pool of C compiled for each platform. SWIG (Beazley, 2003) is libraries from within scripting languages via auto- a development tool for the automatic generation of matic installation of precompiled components and language bindings. The user specifies the interface using universal type information for cross-platform for a particular library in a C-like language and then and cross-language dynamic bindings. chooses among the several supported languages (in- cluding R) to generate C sources that implement the binding for that particular library/language combi- Bibliography nation. RGtk2 (Lawrence and Temple Lang, 2011) offers R bindings for the GTK+ GUI framework con- D. Adler and T. Philipp. DynCall project. URL sisting of R and C code. These are produced by a cus- http://dyncall.org, November 2011. C library tom code generator to offer carefully conceived map- version 0.7. pings to the object-oriented GObject framework. The generated code includes features such as owner- D. M. Beazley. Automated scientific software script- ship management of returned objects using human ing with SWIG. Future Gener. Comput. Syst., 19: annotations. While custom bindings offer the abil- 599–609, July 2003. ISSN 0167-739X. doi: 10.1016/ ity to take into account the features of a particu- S0167-739X(02)00171-1. URL http://portal.acm. lar library and framework to offer very user-friendly org/citation.cfm?id=860016.860018. mapping schemes, rdyncall aims to offer convenient access to C libraries in general but it requires users F. Bellard. QEMU, a fast and portable dynamic to know the details of the particular interface of a C translator. In USENIX Annual Technical Confer- library and the R run-time environment. ence, FREENIX Track, pages 41–46. USENIX, 2005. URL http://www.usenix.org/events/usenix05/ tech/freenix/bellard.html. Summary and Outlook J. Bielman. CFFI - common foreign function in- This paper introduces the rdyncall package5 that terface. URL http://common-lisp.net/project/ contributes an improved Foreign Function Interface cffi/, August 2010. CL library version 0.10.6. for R. The FFI facilitates direct invocation of for- eign functions without the need to compile wrappers O. Chafik. BridJ - Let Java & Scala call C, C++, in C. The FFI offers a dynamic cross-platform link- Objective-C, C#... URL http://code.google.com/ age framework to wrap and access whole C inter- p/bridj/, Jun 2011. Java package version 0.5. 5Version 0.7.4 on CRAN as of this writing.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 40 CONTRIBUTED RESEARCH ARTICLES

J. Clark. XSL transformations (XSLT) version 1.1. F. Mascarenhas. Alien - pure Lua extensions. URL W3C working draft, W3C, Aug. 2001. URL http: http://alien.luaforge.net/, October 2009. Lua //www.w3.org/TR/2001/WD-xslt11-20010824/. module version 0.5.1.

J. Clark. The Expat XML parser. URL http://expat. W. Meissner. Ruby-FFI. URL https://github.com/ sourceforge.net/, June 2007. C library version ffi/ffi/wiki, October 2011. Ruby package ver- 2.0.1. sion 1.0.10.

D. Flanagan and Y. Matsumoto. The Ruby Program- P. Moore, G. Yahas, and A. Vorobey. FFI - Perl foreign ming Language. O’Reilly, Cambridge, 2008. function interface. URL http://search.cpan.org/ ~gaal/FFI/FFI.pm, September 2008. Perl module A. Gavare. GXEmul: a framework for full-system version 1.04. computer architecture emulation. URL http:// gxemul.sourceforge.net/, February 2010. Pro- OpenGL Architecture Review Board, D. Shreiner, gram version 0.6.0. M. Woo, J. Neider, and T. Davis. OpenGL(R) Programming Guide: The Official Guide to Learning A. Green. libffi - a portable foreign function interface OpenGL(R), Version 2. Addison Wesley, 2005. library. URL http://sourceware.org/libffi/, August 2011. C library version 3.0.10. B. Pendleton. Game programming with the Simple DirectMedia Layer (SDL). Linux Journal, 110:42, 44, B. Haible. ffcall - foreign function call libraries. URL 46, 48, June 2003. ISSN 1075-3583. http://www.gnu.org/s/libffcall/, June 2004. C library version 1.10. T. Philipp. DynOS Project. URL http://dyncall. org/dynos, May 2011. T. Heller. ctypes - A foreign function library for Python. URL http://starship.python.net/ R Development Core Team. Writing R Extesions. crew/theller/ctypes/, October 2011. R Foundation for Statistical Computing, Vienna, Austria, 2010. URL http://www.R-project.org. R. Ierusalimschy, L. H. de Figueiredo, and W. C. ISBN 3-900051-11-9. Filho. Lua – an extensible extension language. Soft- ware – Practice and Experience, 26(6):635–652, June D. Temple Lang. Rffi for run-time invocation of 1996. arbitrary compiled routines from R. URL http: //www.omegahat.org/Rffi/, January 2011. R pack- H. Kaiser. Wave V2.0 - Boost C++ Libraries. age version 0.3-0. URL http://www.boost.org/doc/libs/release/ libs/wave/index.html, July 2011. Boost C++ Li- L. Tierney. A simple implementation of name spaces http://www.stat.uiowa.edu/~luke/ brary Version 1.45, Wave C++ Library Version for R. URL R/namespaces/morenames.pdf 2.1.0. , May 2003.

E. E. Kim. A triumph of simplicity: James Clark on G. van Rossum and F. L. Drake, Jr. Python Lan- markup languages and XML. Dr. Dobb’s Journal guage Reference Manual. Network Theory Ltd., http://www. of Software Tools, 26(7):56, 58–60, July 2001. ISSN 2005. ISBN 0-9541617-8-5. URL network-theory.co.uk/python/language/ 1044-789X. URL http://www.ddj.com/. .

B. King. GCC-XML. URL http://www.gccxml.org, D. Veillard and B. Reese. The XSLT C library http://xmlsoft.org/XSLT/ February 2004. Program version 0.6.0. for GNOME. URL , September 2009. C library version 1.1.26. G. K. Kloss. Automatic C library wrapping – ctypes W. Weisser. C/Invoke - version 1.0 - easily call C from from the trenches. The Python Papers, 3(3), 2008. any language. URL http://cinvoke.teegra.net/ ISSN 1834-3147. index.html, January 2007. C library version 1.0. S. Lantinga. libSDL: Simple DirectMedia layer. URL http://www.libsdl.org/, October 2009. C library version 1.2.14. Daniel Adler Georg-August Universität M. Lawrence and D. Temple Lang. RGtk2: A graph- Institute for Statistics and Economics ical user interface toolkit for R. Journal of Statis- Platz der Göttinger Sieben 5 tical Software, 2011. ISSN 15487660. URL http: 37079 Göttingen, Germany //www.jstatsoft.org/v37/i08/paper. [email protected]

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 41

Vdgraph: A Package for Creating Variance Dispersion Graphs by John Lawson where X is the design matrix for the quadratic model in Equation (1), σ2 is the variance of the experimental Abstract This article introduces the package error, and Vdgraph that is used for making variance dis- 2 2 persion graphs of response surface designs. The x = [1, x1,··· , xk, x1,··· , xk, x1x2,··· ] package includes functions that make the vari- is a vector valued function of the coordinates (of the ance dispersion graph of one design or com- point in the experimental region) whose elements pare variance dispersion graphs of two designs, correspond to the columns of the design matrix X. which are stored in data frames or matrices. The package also contains several minimum run re- sponse surface designs (stored as matrices) that Run x1 x2 are not available in other R packages. 1 -1 -1 2 1 -1 Introduction 3 -1 1 4 1 1 Response surface methods consist of (1) experimen- 5 -1 0 tal designs for collecting data to fit an approximate 6 1 0 relationship between the factors and the response, 7 0 -1 (2) regression analyses for fitting the model and (3) 8 0 1 graphical and numerical techniques for examining 9 0 0 the fitted model to identify the optimum. The model normally used in response surface analysis is a sec- Table 1: Face Center Cube Design or 32 Design ond order polynomial, as shown in Equation (1). 2 k k k For the face-centered cube design, or 3 design 2 shown in Table1, Figure1 is a contour plot of the y = β0 + ∑ βixi + ∑ βiixi + ∑∑ βijxixj (1) i=1 i=1 i

1.5 20 hood will be of most interest in the design space, a 25 desirable response surface design will be one that 10 makes the variance of a predicted value as uni- 1.0 form as possible throughout the experimental region. Standard response surface designs, such as the uni- 0.5 5 form precision central composite design, are con- Maximum Increase x2 0.0 structed so that the variance of a predicted value will Minimum Increase be near constant within a coded radius of one from the center of the design. −0.5 One way to visualize the uniformity of the vari- 5 ance of a predicted value for designs with more than −1.0 25 15 two factors is to use the variance dispersion graph 20 20 25 proposed by Myers et al.(1992). −1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 Variance of a predicted value x1

The variance of a predicted value at a point Figure 1: Contour plot of NVar[yˆ(x)]/σ2 (x1,··· , xk) in the experimental region is given by Equation (2) As seen in Figure1, the variance of a predicted 2 0 0 −1 Var[yˆ(x)] = σ x (X X) x (2) value increases faster along the line x2 = 0 than along

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 42 CONTRIBUTED RESEARCH ARTICLES

the line x2 = −x1. This is easy to visualize in two The Vdgraph package dimensions, but would be more difficult to see in higher dimensions. Vining(1993a), and Vining(1993b) published FOR- TRAN code for creating variance dispersion graphs. Vining’s code obtains the maximum and minimum prediction variance on hyperspheres using a com- bination of a grid search and Nelder-Mead search Variance dispersion graphs as described by Cook and Nachtsheim(1980). The package Vdgraph (Lawson, 2011) incorporates this A variance dispersion graph allows one to visualize code in R functions that make the graphs. the uniformity of the scaled variance of a predicted The package includes the function Vdgraph for value in multidimensional space. It consists of three making a variance dispersion graph of one design curves: the maximum, the minimum and the aver- and the function Compare2Vdg for comparing the age scaled variance of a predicted value on a hyper- variance dispersion graphs of two designs on the sphere. Each value is plotted against the radius of the same plot. The package also includes several mini- hypersphere. Figure2 shows the variance dispersion mum run response surface designs stored as matri- graph of the design shown in Table1. ces. These include Hartley’s small composite design In this figure it can be seen that the maximum for 2 to 6 factors, Draper and Lin’s small composite scaled variance of a predicted value is near 14 at a ra- design for 5 factors, the hexagonal rotatable design dius of 1.4 in coded units, while the minimum scaled for 2 factors and Roquemore’s hybrid designs for 3 variance is less than 10 at the same radius. This is to 6 factors. the same phenomenon that can be seen in the con- tour plot of the scaled variance of a predicted value shown in Figure1. The path of the maximum and Examples minimum variance through the design space will be determined by the design and may not follow The first example shown below illustrates the use of straight lines as shown in the specific example in this the R function Vdgraph to make variance dispersion contour plot. graphs of a three factor Box-Behnken design created Variance Dispersion Graph by the bbd function in the R package rsm (see Lenth, 2009). 14 Max Min

12 Avg > library(rsm)

10 > BB.des3 <- bbd(3)

8 > Vdgraph(BB.des3)

6 number of design points= 16 number of factors= 3 4 Scaled Variance Radius Maximum Minimum Average 2 [1,] 0.00000000 4.000000 4.000000 4.00000 0 [2,] 0.08660254 3.990100 3.990067 3.99008 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 [3,] 0.17320508 3.961600 3.961067 3.96128 [4,] 0.25980762 3.918100 3.915400 3.91648 Radius [5,] 0.34641016 3.865600 3.857067 3.86048 [6,] 0.43301270 3.812500 3.791667 3.80000 Figure 2: Variance dispersion graph for the design [7,] 0.51961524 3.769600 3.726400 3.74368 in Table1. [8,] 0.60621778 3.750100 3.670067 3.70208 [9,] 0.69282032 3.769600 3.633067 3.68768 Unlike the contour plot of the scaled prediction [10,] 0.77942286 3.846100 3.627400 3.71488 variance, the variance dispersion graph has the same [11,] 0.86602540 4.000000 3.666667 3.80000 format for a k dimensional response surface design [12,] 0.95262794 4.254106 3.766067 3.96128 as for a two dimensional design like Table1. [13,] 1.03923048 4.633600 3.942400 4.21888 Recent textbooks such as Montgomery(2005), [14,] 1.12583302 5.166116 4.214067 4.59488 Myers et al.(2009), and Lawson(2010) illustrate vari- [15,] 1.21243557 5.881600 4.601067 5.11328 ance dispersion graphs as a tool for judging the mer- [16,] 1.29903811 6.812500 5.125000 5.80000 its of a response surface design. These graphs can [17,] 1.38564065 7.993638 5.809067 6.68288 be produced in commercial software such as SAS [18,] 1.47224319 9.462100 6.678067 7.79168 ADX (see SAS Institute Inc., 2010) and Minitab (see [19,] 1.55884573 11.257600 7.758400 9.15808 Minitab Inc., 2010) by using a downloadable macro [20,] 1.64544827 13.422175 9.078067 10.81568 (see Santiago, 2009). [21,] 1.73205081 16.000000 10.666667 12.80000

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 43

Variance Dispersion Graph runs to 23, by substituting a 12 run Plackett-Burman design in the factorial portion, its variance dispersion Max 15 graph reveals that the variance of a predicted value is Min Avg not nearly as uniform as it is for the Hartley’s design. > data(SCDH5) 10 > data(SCDDL5) > Compare2Vdg("Hartley's Small Composite-5 fac",SCDH5,

5 > +"Draper and Lin's Small Composite-5 fac",SCDDL5) Scaled Variance

Variance Dispersion Graph 0 140

0.0 0.5 1.0 1.5 120

Radius 100 80 Figure 3: Variance dispersion graph for BB.des3. 60

The result from the first example (shown above) Scaled Variance 40 includes a listing of the coordinates of the plot and the graph shown in Figure3. 20

The second example illustrates the use of 0 Compare2Vdg by comparing the variance dispersion 0.0 0.5 1.0 1.5 2.0 graph of Draper and Lin’s small composite design SCDDL5 Radius for 5 factors ( )(Draper and Lin, 1990) with Max(Hartley's Small Composite−5 fac) Hartley’s Small Composite Design (SCDH5)(Hartley, Min(Hartley's Small Composite−5 fac) Avg(Hartley's Small Composite−5 fac) 1959). Hartley’s small composite design requires Max(Draper and Lin's Small Composite−5 fac) 1 Min(Draper and Lin's Small Composite−5 fac) only 28 runs by utilizing a 2 fraction of the factorial Avg(Draper and Lin's Small Composite−5 fac) portion of the design.

Figure 4: Comparison of Two Variance Dispersion Run x1 x2 x3 x4 x5 Graphs. 1 1 -1 1 1 1 2 1 1 -1 -1 -1 As seen in Figure4, Hartley’s small composite 3 -1 1 1 -1 1 design for 5 factors is rotatable since the three blue 4 1 -1 1 -1 1 curves for the max, min and average scaled predic- 5 1 1 -1 1 1 tion variance coincide. The scaled variance of a pre- 6 1 1 1 -1 -1 dicted value for Hartley’s design is near the mini- 7 -1 1 1 1 -1 mum scaled variance of a predicted value for Draper 8 -1 -1 1 1 -1 and Lin’s design throughout the experimental re- 9 -1 -1 -1 -1 1 gion. 10 1 -1 -1 1 -1 11 -1 1 -1 1 1 12 -1 -1 -1 -1 -1 Acknowledgement 13 −α 0 0 0 0 14 α 0 0 0 0 I would like to thank the editor and two reviewers 15 0 −α 0 0 0 for helpful suggestions that improved this article and 16 0 α 0 0 0 the package Vdgraph. 17 0 0 −α 0 0 18 0 0 α 0 0 19 0 0 0 −α 0 Bibliography 20 0 0 0 α 0 21 0 0 0 0 −α R. D. Cook and C. J. Nachtsheim. A comparison of al- 22 0 0 0 0 α gorithms for constructing d-optimal designs. Tech- 23 0 0 0 0 0 nometrics, 22:315–324, 1980. N. R. Draper and D. K. J. Lin. Small response surface Table 2: Draper and Lin”s Small Composite Design designs. Technometrics, 32:187–194, 1990. for 5 Factors H. O. Hartley. Smallest composite design for Although Draper and Lin’s design (shown in Ta- quadratic response surfaces. Biometrics, 15:611– ble2 with α = 1.86121) further reduces the number of 624, 1959.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 44 CONTRIBUTED RESEARCH ARTICLES

J. Lawson. Vdgraph: This package creates variance dis- Process and Product Optimization Using Designed Ex- persion graphs for response surface designs, 2011. URL periments. John Wiley & Sons, New York, 2009. http://CRAN.R-project.org/package=Vdgraph.R www. package version 1.0-1. E. Santiago. Macro: VDG.MAC, 2009. URL minitab.com/support/macros/default.aspx. J. S. Lawson. Design and Analysis of Experiments with SAS. CRC Press, Boca Raton, 2010. SAS Institute Inc. Getting started with the SAS 9.2 ADX interface for design of experiments, 2010. R. V. Lenth. Response surface methods in R, using URL http://support.sas.com/documentation/ rsm. Journal of Statistical Software, 32(7):1–17, 2009. cdl/en/adxgs/60376/PDF/default/adxgs.pdf.

Minitab Inc. Minitab software for quality improve- G. G. Vining. A computer program for generating ment, 2010. URL http://www.minitab.com. variance dispersion graphs. Journal of Quality Tech- nology, 25:45–58, 1993a. D. C. Montgomery. Design and Analysis of Experi- ments. John Wiley & Sons, New York, sixth edition, G. G. Vining. Corrigenda: A computer program for 2005. generating variance dispersion graphs. Journal of Quality Technology, 25:333–335, 1993b. R. H. Myers, G. Vining, A. Giovannitti-Jensen, and S. L. Myers. Variance dispersion properties of second order response surface designs. Journal of John Lawson Quality Technology, 24:1–11, 1992. Department of Statistics Brigham Young University R. H. Myers, D. C. Montgomery, and C. M. Provo, UT (USA) Anderson-Cook. Response Surface Methodology: [email protected]

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 45 xgrid and R: Parallel Distributed Processing Using Heterogeneous Groups of Apple Computers by Sarah C. Anoke, Yuting Zhao, Rafael Jaeger and vides several functions to divide work between a sin- Nicholas J. Horton gle machine’s multiple cores. Starting with release 2.14.0, snow and multicore are available as slightly Abstract The Apple Xgrid system provides ac- revised copies within the parallel package in base R. cess to groups (or grids) of computers that can be The Apple Xgrid (Apple Inc., 2009) technology used to facilitate parallel processing. We describe is a parallel computing environment. Many Ap- the xgrid package which facilitates access to this ple Xgrids already exist in academic settings, and system to undertake independent simulations are straightforward to set up. As loosely organized or other long-running jobs that can be divided clusters, Apple Xgrids provide graceful degradation, into replicate runs within R. Detailed examples where agents can easily be added to or removed from are provided to demonstrate the interface, along the grid without disrupting its operation. Xgrid sup- with results from a simulation study of the per- ports heterogeneous agents (also a plus in many set- formance gains using a variety of grids. Use of tings, where a single grid might include a lab, class- the grid for “embarassingly parallel” indepen- room, individual computers, as well as more pow- dent jobs has the potential for major speedups erful dedicated servers) and provides automated in time to completion. Appendices provide guid- housekeeping and cleanup. The Xgrid Admin pro- ance on setting up the workflow, utilizing add- gram provides a graphical overview of the controller, on packages, and constructing grids using exist- agents and jobs that are being managed (instructions ing machines. on downloading and installing this tool can be found in Appendix C). We created the xgrid package (Horton and Introduction Anoke, 2012) to provide a simple interface to this dis- tributed computing system. The package facilitates Many scientific computations can be sped up by di- use of an Apple Xgrid for distributed processing of viding them into smaller tasks and distributing the a simulation with many independent repetitions, by computations to multiple systems for simultaneous simplifying job submission (or grid stuffing) and col- processing. Particularly in the case of embarrassingly lation of results. It provides a relatively thin but use- parallel (Wilkinson and Allen, 1999) statistical simu- ful layer between R and Apple’s ‘xgrid’ shell com- lations, where the outcome of any given simulation mand, where the user constructs input scripts to be is independent of others, parallel computing on ex- run remotely. A similar set of routines, optimized for isting grids of computers can dramatically increase parallel estimation of JAGS (just another Gibbs sam- computation speed. Rather than waiting for the pre- pler) models is available within the runjags package vious simulation to complete before moving on to the (Denwood, 2010). However, with the exception of next, a grid controller can distribute tasks to agents runjags, none of the previously mentioned packages (also known as nodes) as quickly as they can process support parallel computation over an Apple Xgrid. them in parallel. As the number of nodes in the grid We begin by describing the xgrid package in- increases, the total computation time for a given job terface to the Apple Xgrid, detailing two examples will generally decrease. Figure1 provides a concep- which utilize this setup, summarizing simulation tual model of this framework. studies that characterize the performance of a variety Several solutions exist to facilitate parallel com- of tasks on different grid configurations, then close putation within R. Wegener et al.(2007) developed with a summary. We also include a glossary of terms GridR, a condor-based environment for settings and provide three appendices detailing how to ac- where one can connect directly to agents in a grid. cess a grid using R (Appendix A), how to utilize add- The Rmpi package (Yu, 2002) is an R wrapper for on packages within R (Appendix B), and how to con- the popular Message Passing Interface (MPI) proto- struct a grid using existing machines (Appendix C). col and provides extremely low-level control over grid functionality. The rpvm package (Li and Rossini, 2001) provides a connection to a Parallel Virtual Ma- chine (Geist et al., 1994). The snow package (Rossini Controlling an Xgrid using R et al., 2007) provides a simpler implementation of Rmpi and rpvm, using a low-level socket function- To facilitate use of an Apple Xgrid using R, we cre- ality. The multicore package (Urbanek, 2011) pro- ated the xgrid package, which contains support rou-

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 46 CONTRIBUTED RESEARCH ARTICLES

1. Initiate simulations. 2. Simulations transferred to controller.

3. Controller splits simula- tions into multiple jobs.

Client Controller

7. Controller retrieves and collates 4. Jobs are transferred to individual job results and returns the agents on the network them to the client. as they become available.

6. Agents return job results.

Agents 5. Agents compute a single job with multiple tasks.

Figure 1: Conceptual model of the Apple Xgrid framework (derived from graphic by F. Loxy) tines to split up, submit, monitor, then retrieve re- these could be divided into 200 jobs each running 10 sults from a series of simulation studies. of the tasks. The number of active jobs on the grid can throttle The xgrid() function connects to the grid by re- be controlled using the option (by default, peated calls to the xgrid command at the Mac OS X all jobs are submitted then queued until an agent is throttle shell level on the client. Table1 displays some of the available). The option helps facilitate shar- actions supported by the xgrid command and their ing a large grid between multiple users (since by de- analogous routines in the xgrid package. While users fault the Apple Xgrid system provides no load bal- will ordinarily not need to use these routines, they ancing amongst users). are helpful in understanding the workflow. These The xgrid() function checks for errors in routines are designed to call a specified R script with specification, then begins to repeatedly call the suitable environment (packages, input files) on a re- xgridsubmit() function for each job that needs mote machine. The remote job is given arguments as to be created. The xgrid() function also calls part of a call to ‘R CMD BATCH’, which allow it to cre- xgridsubmit() to create a properly formatted ‘xgrid ate a unique location to save results, which are com- -job submit’ command using Mac OS X through the municated back to the client by way of R object files R system() function. This has the effect of execut- created with the R saveRDS() function. Much of the ing a command of the form ‘R CMD BATCH file.R’ on work involves specifying a naming structure for the the grid, with appropriate arguments (the number of jobs, to allow the results to be automatically collated. repetitions to run, parameters to pass along and the name of the unique filename to save results). The re- The xgrid() function is called to start a series of sults of the system() call are saved to be able to de- simulations. This function takes as arguments the R termine the identification number for that particular script to run on the grid (by default set to ‘job.R’), job. This number can be used to check the status of the directory containing input files (by default set to the job as well as retrieve its results and delete it from ‘input’), the directory to save output created within R the system once it has completed. on the agent (by default set to ‘output’), and a name for the results file (by default set to ‘RESULTS.rds’). Once all of the jobs have been submitted, xgrid() In addition, the total number of iterations in the sim- then periodically polls the list of active jobs un- ulation (numsim) and number of tasks per job (ntask) til they are completed. This function makes a call can be specified. The xgrid() function divides the to- to xgridattr() and determines the value of the tal number of iterations into numsim/ntask individ- jobStatus attribute. The function waits (sleeps) be- ual jobs, where each job is responsible for calculating tween each poll, to lessen load on the grid. the specified number of tasks on a single agent (see When a job has completed, its results are re- Figure1). For example, if 2,000 iterations are desired, trieved using xgridresults() then deleted from the

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 47

Action R Function Description submit xgridsubmit() submit a job to the grid controller attributes xgridattr() check on the status of a job results xgridresults() retrieve the results from a completed job delete xgriddelete() delete the job

Table 1: Job actions supported by the xgrid command and their analogous functions in the xgrid package system using xgriddelete(). This capability relies run with a for loop on a local machine. on the properties of the Apple Xgrid, which can be Our first step is to set up an appropriate directory set up to have all files created by the agent when run- structure for our simulation (see Figure5; Appendix ning a given job copied to the ‘output’ directory on A provides an overview of requirements). The first the client computer. When all jobs have completed, item is the folder ‘input’, which contains two files that the individual result files are combined into a single will be run on the remote agents. The first of these output data frame in the current directory. The ‘ ’ direc- files, ‘job.R’ (Figure2), defines the code to run a par- tory has a complete listing of the individual results ticular task, ntask times. as well as the R output from the remote agents. This directory can be useful for debugging in case of prob- In this example, the job() function begins by gen- lems and the contents are typically accessed only in erating a sample of param exponential random vari- those cases. ables with mean 1. A one-sample t-test is conducted To help demonstrate how to access an existing on this sample and logical (TRUE/FALSE) values de- Xgrid, we provide two detailed examples: one in- noting whether the test rejected in that tail are saved volving a relatively straightforward computation as- in the vectors leftreject and rightreject. This sessing the robustness of the one-sample t-test and process is repeated ntask times, after which job() re- the second requiring use of add-on packages to un- turns a data frame with the rejection results and the dertake simulations of a latent class model. These ex- corresponding sample size. amples are provided as vignettes within the package. In addition, the example files are available for down- # Assess the robustness of the one-sample load from http://www.math.smith.edu/xgrid. # t-test when underlying data are exponential. # This function returns a data frame with # number of rows equal to the value of "ntask". Examples # The option "param" specifies the sample size. job <- function(ntask, param) { alpha <- 0.05 # how often to reject under null Example 1: Assessing the robustness of the leftreject <- logical(ntask) # placeholder one-sample t-test rightreject <- logical(ntask) # for results for (i in 1:ntask) { The t-test is remarkedly robust to violations of its un- dat <- rexp(param) # generate skewed data derlying assumptions (Sawiloswky and Blair, 1992). left <- t.test(dat, mu = 1, However, as Hesterberg(2008) argues, not only is it alternative = "less") possible for the total non-coverage to exceed α, the leftreject[i] <- left$p.value <= alpha/2 asymmetry of the test statistic causes one tail to ac- right <- t.test(dat, mu = 1, count for more than its share of the overall α level. alternative = "greater") Hesterberg found that sample sizes in the thousands rightreject[i] <- right$p.value <= alpha/2 were needed to get symmetric tails. } return(data.frame(leftreject, rightreject, In this example, we demonstrate how to utilize an n = rep(param, ntask))) Apple Xgrid cluster to investigate the robustness of } the one-sample t-test, by looking at how the α level is split between the two tails. When the number of Figure 2: Contents of ‘job.R’ simulation iterations is small (< 100,000), this study runs very quickly as a loop in R. Here, we provide the computation of a study consisting of 106 itera- The folder ‘input’ also contains ‘runjob.R’ (Fig- tions. A more efficient alternative would be to use ure3), which retrieves command line arguments xgrid() job() pvec() or mclapply() of the multicore package to generated by and passes them to . The res0 distribute this study over the cores of a single ma- results from the completed job are saved as , chine. However for the purposes of illustration, we which is subsequently saved to the ‘output’ folder. conduct this simple statistical investigation over an The folder ‘input’ may also contain other files (in- Xgrid to demonstrate package usage, and to compare cluding add-on packages or other files needed for the the results and computation time to the same study simulation; see Appendix B for details).

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 48 CONTRIBUTED RESEARCH ARTICLES

source("job.R") # commandArgs() is expecting three arguments: # 1) number of tasks to run within this job # 2) parameter to pass to the function # 3) place to stash the results when finished args <- commandArgs(trailingOnly = TRUE) ntask1 <- as.numeric(args[1]) param1 <- args[2] resfile <- args[3] res0 <- job(ntask = ntask1, param = param1) # stash the results saveRDS(res0, file = resfile)

Figure 3: Contents of ‘runjob.R’ Figure 5: File structure needed on the client to access the Xgrid (the ‘rlibs’ folder contains any add-on pack- The next item in the directory structure is ages required by the remote agent) ‘simulation.R’ (Figure4), which contains R code to be run on the client machine using source(). The call to xgrid() submits the simulation to the grid for cal- culation, which includes passing param and ntask to job() within ‘job.R’. Results from all jobs are re- turned as one data frame, res. The call to with() summarizes all results in a table and prints them to the console.

require(xgrid) # run the simulation res <- xgrid(Rcmd = "runjob.R", param = 30, numsim = 10^6, ntask = 5*10^4) # analyze the results with(res, table(leftreject,rightreject))

Figure 4: Contents of ‘simulation.R’

Here we specify a total of 106 simulation itera- tions, to be split into twenty jobs of 5 × 104 simula- tions each. Note that the number of jobs is calculated Figure 6: Monitoring overall grid status using the numsim as the total number of simulation iterations ( ) Xgrid Admin application divided by the number of tasks per job (ntask). Each simulation iteration has a sample size of param. The final item in the directory structure is ‘output’. Initially empty, results returned from the grid are saved here (this directory is automatically created if it does not already exist). Figure5 displays the file structure within the di- rectory used to access the Xgrid. Jobs are submitted to the grid by running ‘simulation.R’. In this particular simulation, twenty jobs are submitted. As jobs are completed, the results are saved in the ‘output’ folder then removed from the grid. Figures6 and7 display management tools avail- able using the Xgrid Admin interface. In addition to returning a data frame (106 rows and 3 columns) with the collated results, the xgrid() function saves this object as a file (by default as res in the file ‘RESULTS.rds’). Figure 7: Job management using Xgrid Admin

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 49

Class Criteria Acronym 2 3 4 5 6+ Bayesian Information Criterion BIC 49% 44% 7% 0% 0% Akaike Information Criterion AIC 0% 0% 53% 31% 16% Consistent Akaike Information Criterion cAIC 73% 25% 2% 0% 0% Sample Size Adjusted Bayesian Information Criterion aBIC 0% 5% 87% 6% 2%

Table 2: Percentage of times (out of 100 simulations) that a particular number of classes was selected as the best model (where the true data derive from 4 distinct and equiprobable classes, simulation sample size 300), reprinted from Swanson et al.(2011). Note that a perfect criterion would have “100%” under class 4.

In terms of our motivating example, when the un- teria, including a full simulation study to generate derlying data are normally distributed, we would ex- hypothetical data sets, and investigated how each pect to reject the null hypothesis 2.5% of the time on criterion behaved in a variety of statistical environ- the left and 2.5% on the right. The simulation yielded ments. For this example, we replicate some of their rejection rates of 6.5% and 0.7% for left and right, re- simulations using an Apple Xgrid to speed up com- spectively. This confirms Hesterberg’s argument re- putation time. garding lack of robustness for both the overall α-level Following the approach of Swanson et al., we as well as the individual tails. generated “true models” where we specified an arbi- Regarding computation time, this simulation trary four-class structure (with balanced number of took 48.2 seconds (standard deviation of 0.7 seconds) observations in each class). This structure was com- when run on a heterogeneous mixture of twenty posed of 10 binary indicators in a simulated data set iMacs and Mac Pros. When run locally on a single of size 300. The model was fitted using the poLCA() quad-core Intel Xeon Mac Pro computer, this simula- function of the poLCA (polytomous latent class anal- tion took 592 seconds (standard deviation of 0.4 sec- ysis) package (Linzer and Lewis, 2011). Separate la- onds). tent class models were fitted specifying the number of classes, ranging from two to six. For each simula- tion, we determined the lowest values of BIC, AIC, Example 2: Fitting latent class models us- cAIC and aBIC and recorded the class structure asso- ing add-on packages ciated with that value. Swanson and colleagues found that for this set of Our second example is more realistic, as it involves parameter values, the AIC and aBIC picked the cor- simulations that would ordinarily take days to com- rect number of classes more than half the time (see plete (as opposed to seconds for the one-sample t- Table2). test). It involves study of the properties of latent class models, which are used to determine better This example illustrates the computational bur- schemes for classification of eating disorders (Keel den of undertaking simulation studies to assess the et al., 2004). The development of an empirically- performance of modern statistical methods, as sev- created eating disorder classification system is of eral minutes are needed to undertake each of the sin- public health interest as it may help identify indi- gle iterations of the simulation (which may explain viduals who would benefit from diagnosis and treat- why Swanson and colleagues only ran 100 simula- ment. tions for each of their scenarios). As described by Collins and Lanza(2009), latent A complication of this example is that fitting LCA class analysis (LCA) is used to identify subgroups in models in R requires the use of add-on packages. For a population. There are several criteria used to eval- instance, we use poLCA and its supporting pack- uate the fit of a given model, including the Akaike ages with an Apple Xgrid to conduct our simulation. Information Criterion (AIC), the Bayesian Informa- When a simulation requires add-on packages, they tion Criterion (BIC), the Consistent Akaike Informa- must be preloaded within the job and shipped over tion Criterion (cAIC), and the Sample Size Adjusted to the Xgrid. The specific steps of utilizing add-on Bayesian Information Criterion (aBIC). These crite- packages are provided in Appendix B. ria are useful, but further guidance is needed for re- searchers to choose between them, as well as bet- Simulation results ter understand how their accuracy is affected by methodological factors encountered in eating disor- To better understand the potential performance der classification research, such as unbalanced class gains when running simulations on the grid, we as- size, sample size, missing data, and under- or over- sessed the relative performance of a variety of grids specification of the model. Swanson et al.(2011) un- with different characteristics and differing numbers dertook a comprehensive review of these model cri- of tasks per job (since there is some overhead to the

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 50 CONTRIBUTED RESEARCH ARTICLES

Grid description ntask numjobs mean sd Mac Pro (1 processor) 1,000 1 43.98 0.27 Mac Pro (8 processors) 20 50 9.97 0.04 Mac Pro (8 processors) 10 100 9.65 0.05 Mac Pro (8 processors) 4 250 9.84 0.59 grid of 11 machines (66 processors) 1 1,000 1.02 0.003 grid of 11 machines (77 processors) 1 1,000 0.91 0.003 grid of 11 machines (88 processors) 20 50 1.28 0.01 grid of 11 machines (88 processors) 10 100 1.20 0.04 grid of 11 machines (88 processors) 8 125 1.04 0.03 grid of 11 machines (88 processors) 4 250 0.87 0.01 grid of 11 machines (88 processors) 1 1,000 0.87 0.004

Table 3: Mean and standard deviation of elapsed timing results (in hours) to run 1,000 simulations (each with three repetitions) using poLCA(), by grid and number of tasks per job. Note that the Mac Pro processors were rated at 3 GHz while the grid CPUs were rated at 2.8 GHz.

Apple Xgrid system). One grid was a single quad- Summary core Intel Xeon Mac Pro computer (two quad-core processors, for total of 8 processors, each rated at 3 We have described an implementation of a “grid GHz, total 24 GHz). The other was a grid of eleven stuffer” that can be used to access an Apple Xgrid rack-mounted quad-core Intel servers (a total of 88 from within R. Xgrid is an attractive platform for processors, each rated at 2.8 GHz, total 246.4 GHz). scientific computing because it can be easily set up, For comparison, we also ran the job directly within and it handles tedious housekeeping and collation of R on the Mac Pro computer (equivalent to ‘ntask = results from large simulations with relatively mini- 1000,numjobs = 1’), as well as on the 88 processor mal effort. Core technologies in Mac OS X and eas- grid but restricting the number of active jobs to 66 ily available extensions simplify the process of cre- or 77 using the throttle option within xgrid(). For ating a grid. An Xgrid may be configured with little each setup, we varied the number of tasks (ntask) knowledge of advanced networking technologies or while maintaining 1,000 simulations for each sce- disruption of networking activities. nario (ntask * numjobs = 1000). Another attractive feature of this environment is that the Xgrid degrades gracefully. If an individ- For simulations with a relatively small number of ual agent fails, then the controller will automatically tasks per job and a large number of jobs, the time to resubmit the job to another agent. Once all of the submit them to the grid controller can be measured submitted jobs are queued up on a controller, the in minutes; this overhead may be nontrivial. Also, for xgrid() function can handle temporary controller simulations on a quiescent grid where the number of failures (if the controller restarts then all pending jobs is not evenly divisible by the number of avail- and running jobs will be restarted automatically). able processors, some part of the grid may be idle If the client crashes, the entire simulation must be near the end of the simulation and the elapsed time restarted. It may be feasible to keep track of this state longer than necessary. to allow more fault tolerance in a future release. A limitation of the Apple Xgrid system is that Each simulation was repeated three times and the it is proprietary, which may limit its use. A second mean and standard deviation of total elapsed time limitation is that it requires some effort on the part (in hours) was recorded. Table3 displays the results. of the user to structure their program in a manner that can be divided into chunks then reassembled. As expected, there was considerable speedup by Because of the relatively thin communication con- using the grid. When moving from a single proces- nections between the client and controller, the user sor to 8, we were able to speed up our simulation by is restricted to passing configuration via command 34 hours – a 78% decrease in elapsed time [(43.98- line arguments in R, which must be parsed by the 9.84)/43.98]. We saw an even larger decrease of 98% agent. Results are then passed back via the filesys- when moving to 88 (somewhat slower) processors. tem. In addition to the need for particular files to be There were only modest differences in the elapsed created that can be run on the client as well as on the time comparing different configurations of the num- controller, use of the Apple Xgrid system may also ber of tasks per job and number of jobs, indicating require separate installation of additional packages that the overhead of the system imposes a relatively needed for the simulation (particularly if the user small cost. Once the job granularity is sufficiently does not have administrative access or control of the fine, performance change is minimal. individual agents).

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 51

However, as demonstrated by our simulation Glossary studies, the Apple Xgrid system is particularly well- suited for embarrassingly parallel problems (e.g. Agent A member of a grid that performs tasks dis- simulation studies with multiple independent runs). tributed to it by the controller (also known as a It can be set up on a heterogeneous set of machines node). An agent can be its own controller. (such as a computer classroom) and configured to Apple Remote Desktop (ARD) An application that run when these agents are idle. For such a setup, the allows a user to monitor or control networked overall process load tends to balance if the simula- computers remotely. tion is broken up into sufficiently fine grained jobs, as faster agents will take on more than their share of Bonjour networking Apple’s automatic network jobs. service discovery and configuration tool. Built The Xgrid software is available as an additional on a variety of commonly used networking download for Apple users as part of Mac OS X protocols, Bonjour allows Mac computers to Server, and it is straightforward to set up a grid us- easily communicate with each other and with ing existing computers (instructions are provided in a large number of compatible devices, with lit- Appendix C). In our tests, it took approximately 30 tle to no user configuration required. minutes to set up an Apple Xgrid on a simple net- Client A computer that submits jobs to the con- work consisting of three computers. troller for grid processing. Our implementation provides for three distinct Controller A computer (running OS X Server or authentication schemes (Kerberos, clear text pass- XgridLite) that accepts jobs from clients and word, or none). While the Xgrid system provides distributes them to agents. A controller can also full support for authentication using Kerberos, this be an agent within the grid. requires more configuration and is beyond the scope of this paper. The system documentation Embarassingly parallel A computation that can be (Apple Inc., 2009) provides a comprehensive re- divided into a series of independent tasks view of administration using this mechanism. If where little or no communication is required. auth = "Password" the ‘ ’ option is used, then the Grid More general than an Apple Xgrid, a group XGRID_CONTROLLER_PASSWORD environment variable of machines (agents), managed by a con- must be set by the user to specify the password. As troller, which accept and perform computa- with any shared networking resource, care should be tional tasks. taken when less secure approaches are used to ac- cess the controller and agents. Firewalling the Xgrid Grid stuffer An application that submit jobs and re- controller and agents from outside networks may be trieves their results through the xgrid com- warranted. mand. The xgrid() function within the xgrid package implements a grid stuffer in R. There are a number of areas of potential improve- ment for this interface. It may be feasible to create a Job A group of one or more tasks submitted by a single XML file to allow all of the simulations to be client to the controller. included as multiple tasks within a single job. This Mac OS X Server A Unix server operating system has the potential to decrease input/output load and from Apple, that provides advanced features simplify monitoring. for operating an Xgrid controller (for those As described previously, the use of parallel com- without Mac OS X Server, the XgridLite panel puting to speed up scientific computation using R by provides similar functionality for grid comput- use of multiple cores, tightly connected clusters, and ing). other approaches remains an active area of research. Other approaches exist, though address different is- plist file Short for “property list file”. On Mac OS sues. For example, the GridR package does not sup- computers, this is an XML file that stores user- port access through the xgrid command and requires or system-defined settings for an application or access to individual agents. The runjags package extension. For example, the “Preferences” pane provides an alternative interface that is particularly in an application is linked to a plist file such well suited for Bayesian estimation using Gibbs sam- that when Preferences are changed, their val- pling. ues are recorded in the plist file, to be read by other parts of the application when running. Our setup focuses on a simple (but common) problem: embarrassingly parallel computations that Preference pane Any of the sections under the Mac can be chopped into multiple tasks or jobs. The abil- OS X System Preferences, each controlling a ity to dramatically speed up such simulations within different aspect of the operating system (desk- R by running them on a designated grid (or less for- top background, energy saving preferences, mally on a set of idle machines) may be an attractive etc.) or of user-installed programs. XgridLite is option in many settings with Apple computers. a preference pane, since it is simply a GUI for

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 52 CONTRIBUTED RESEARCH ARTICLES

core system tools (such as the xgridctl com- per job, and param, used to define any parameter of mand). interest. Note that job() must have ntask and param as arguments, but can be modified to take more. This Processor core Independent processors in the same function returns a data frame, where each row is CPU. Most modern computers have multi-core the result of one task. The file ‘runjob.R’ (running on CPUs, usually with 2 or 4 cores. Multi-core sys- the agent) receives three arguments when called by tems can execute more tasks simultaneously (in xgrid() (running on the client): ntask, param, and our case, multiple R runs at the same time). resfile (see description of ‘simulation.R’ below). The Multi-core systems are often able to manage arguments ntask and param are passed to job(), and tasks more efficiently, since less time is spent the data frame returned by job() is saved to the waiting for a core to be ready to process data. folder ‘output’, with filename specified by resfile. output Task Sub-components of jobs. A job may contain The folder ‘ ’ will contain the results from many tasks, each of which can be performed the simulations. If created manually, it should be xgrid() individually, or may depend on the results of empty. If not created manually, the function other tasks or jobs. will create it. The script ‘simulation.R’ accesses the controller by Xgrid Software and distributed computing proto- calling xgrid() and allows the user to override any col developed by Apple that allows networked default arguments. computers to contribute to distributed compu- tations. Call xgrid() within ‘simulation.R’ XgridLite Preference pane authored by Ed After loading the xgrid package and calling the Baskerville. A free GUI to access Xgrid func- xgrid() function as described in ‘simulation.R’, results tionality that is built into all Mac OS X comput- from all numsim simulations are collated and returned ers. Also facilitates configuration of controllers as the object res. This object is also saved as the file (similar functionality is provided by Mac OS X ‘RESULTS.rds’, at the root of the directory structure. Server). This can be analyzed as needed.

Appendix A: Accessing the Apple Appendix B: Using additional Xgrid using R packages with the Apple Xgrid

This section details the steps needed to access an ex- One of the strengths of R is the large number of add- isting Xgrid using R. All commands are executed on on packages that extend the system. If the user of the the client. grid has administrative privileges on the individual machines then any needed packages can be installed Install the package from CRAN in the usual manner. However, if no administrative access is available, The first step is to install the package from CRAN on it is possible to utilize such packages within this the client computer (needs to be done only once) setup by manually installing them in the ‘input/rlibs’ directory and loading them within a given job. install.packages("xgrid") This appendix details how to make a package available to a job running on a given agent.

Determine hostname and access informa- 1. Install the appropriate distribution file from tion for the Apple Xgrid CRAN into the directory ‘input/rlibs’. This can be done by using the lib argument The next step is to determine the access procedures of install.packages(). For instance, in Ex- for your particular grid. These include the host- ample 2 we can install poLCA by us- name (specified as the grid option) and authoriza- ing the call install.packages("poLCA",lib = tion scheme (specified as the auth option). "~/.../input/rlibs"), where ‘~/.../’ empha- sizes use of the absolute path to the ‘rlibs’ di- Create directory and file structure rectory. An overview of the required directory structure is 2. Access the add-on package within a job run- provided in Figure5. ning on an agent. This is done by using The ‘input’ folder contains two files, ‘job.R’ and the lib.loc option within the standard in- ‘runjob.R’. The file ‘job.R’ contains the definition for vocation of library(). For instance, in Ex- the function job(). This function expects two argu- ample 2 this can be done by using the call ments: ntask, which specifies the number of tasks library(poLCA,lib.loc="./rlibs").

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 53

It should be noted that the package will need to Downloading server tools be shipped over to the agent for each job, which may be less efficient than installing the package once per Although not necessary to install and operate an Ap- agent in the usual manner (but the latter option may ple Xgrid, the free Apple software suite called Mac not be available unless the grid user has administra- OS X Server Tools is extremely useful. In particular, tive access to the individual agents). the Xgrid Admin application (see Figure6) allows for the monitoring of any number of Apple Xgrids, in- cluding statistics on the grid such as the number of agents and total processing power, as well as a list Appendix C: Xgrid setup of jobs which can be paused, stopped, restarted, and deleted. Most “grid stuffers” (our system included) In this section we discuss the steps required to set up are designed to clean up after themselves, so the job a new Xgrid using XgridLite under Mac OS X 10.6 management functions in Xgrid Admin are primarily or 10.7. We presume some background in system ad- useful for monitoring the status of jobs. ministration and note that installation of a grid will likely require the assistance of technical staff. Depending on the nature of the network and Setting up a controller machines involved, different Xgrid-related tools can Figure8 displays the process of setting up a grid con- be used, including Mac OS X Server and XgridLite. troller within XgridLite. To start the Xgrid controller Mac OS X Server has several advantages, but for the service, simply click the “Start” button. Authentica- purposes of this paper, we restrict our attention to tion settings are configured with the “Set Client Pass- XgridLite. word...” and “Set Agent Password...” buttons. From the XgridLite panel, you can also reset the controller or turn it off. Anatomy of an Apple Xgrid

It is important to understand the flow of informa- tion through an Apple Xgrid and the relationship be- tween the different computers involved. The center- point of an Apple Xgrid is the controller (Figure1). When clients submit jobs to the grid, the controller distributes them to agents, which make themselves available to the controller on the local network (using Apple’s Bonjour network discovery). That is, rather than the controller keeping a master list of which agents are part of the grid, agents detect the pres- ence of a controller on the network and, depending on their configuration, make themselves available for job processing. This loosely coupled grid structure means that Figure 8: Setting up a controller if an agent previously involved in computation is unavailable, the controller simply passes jobs to Unlike many other client-server relationships, the other agents. This functionality is known as graceful Xgrid controller authenticates to the client, rather degradation. Once an agent has completed its com- than the other way around. That is, the password en- putation, the results are made available to the con- tered in the XgridLite “Set Agent Password...” field troller, where the client can retrieve them. As a result, is provided to agents, not required of them. Individual the client never communicates directly with agents, agents must be configured to require that particular only with the controller. password (or none at all) in order to join the Xgrid.

Downloading XgridLite Setting up agents

XgridLite (http://edbaskerville.com/software/ To set up an agent to join the Xgrid, open the Shar- xgridlite) is free and open-source software, re- ing preferences pane (“Internet & Wireless” section leased under the GNU GPL (general public license), in OS 10.6) in System Preferences (see Figures9 and and installed as a Mac OS X preference pane. Once it 10). Select the “Xgrid Sharing” item in the list on the is installed, it is found in the “Other” section (at the left. Click the “Configure...” button. From here, you bottom) of the System Preferences (Apple menu → can choose whether the agent should join the first System Preferences). available Xgrid, or a specific one. The latter of these

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 54 CONTRIBUTED RESEARCH ARTICLES options is usually best – the drop-down menu of con- Automation trollers will auto-populate with all that are currently running. For those with more experience in system adminis- tration and access to Apple Remote Desktop (ARD) or secure shell (SSH) on the agent machines (or work with an administrator who does), the process of set- ting up agents can be automated. R can be installed remotely (allowing you to be sure that it is installed in the same location on each machine), the file that contains Xgrid agent settings can be pushed to all agents and, finally, the agent service can be acti- vated. We will not discuss the ARD remote package installation process in detail here, but a full expla- nation can be found in the Apple Remote Desktop Administrator Guide at http://images.apple.com/ remotedesktop/pdf/ARD_Admin_Guide_v3.3.pdf. The settings for the Xgrid agent ser- vice are stored in the plist file located at ‘/Library/Preferences/com.apple.xgrid.agent.plist’. These settings include authentication options, controller binding, and the number of physical processor cores the agent should report to the controller (see the note Figure 9: Sharing setup to configure an agent for below for a discussion of this value, which is signif- Xgrid icant). The simplest way to automate the creation of an Apple Xgrid is to configure the agent service on a specific computer with the desired settings and then push the above plist file to the same relative loca- tion on all agents. Keep the processor core settings in mind; you may need to push different versions of the file to different agents in a heterogeneous grid (see Note 3 below). Once all agents have the proper plist file, run the commands ‘xgridctl agent enable’ and ‘xgridctl agent on’ on the agents.

Notes 1. Mac OS X Server provides several additional options for setting up an Xgrid controller. These include Kerberos authentication and cus- tomized Xgrid cache files location. Depending on your network setup, these features may bol- ster security, but we will not explore them fur- Figure 10: Specifying the controller for an agent ther here.

2.R must be installed in the same location on If there are multiple controllers on a network, you each agent. If it is installed locally by an end- may want to choose one specifically so that the agent user, it should default to the proper location, does not default to a different one. You may also but it is worth verifying this before running choose whether or not the agent should accept tasks tasks, otherwise they may fail. As mentioned when idle. Then click “Done” and choose the desired previously, installing R remotely using ARD is authentication method (most likely “Password” or an easy way to ensure an identical instantiation [less advisable] “None”) in the main window. If ap- on each agent. plicable, enter the password that will be required of the controller. Then, select the checkbox next to the 3. The reported number of processor cores (the “Xgrid Sharing” item on the left. The agent needs to ProcessorCount attribute) is significant be- be restarted to complete the process. cause the controller considers it the maximum Note that a controller can also be an agent within number of tasks the agent can execute simulta- its grid. However, it is important to consider that if neously. On Mac OS X 10.6, agents by default the controller is also an agent, this has the potential report the number of cores (not processors) that to slow a simulation. they have (see discussion at http://lists.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 55

apple.com/archives/Xgrid-users/2009/Oct/ N. M. Li and A. J. Rossini. rpvm: Cluster statistical msg00001.html). The ProcessorCount attribute computing in R. R News, 1(3):4–7, 2001. can be modified depending on the nature of the D. A. Linzer and J. B. Lewis. poLCA: An R package grid. Since R uses only 1 processor per core, this for polytomous variable latent class analysis. Jour- may leave many processors idle if the default nal of Statistical Software, 42(10):1–29, 2011. URL behavior is not modified. http://www.jstatsoft.org/v42/i10/. 4. Because Xgrid jobs are executed as the user A. J. Rossini, L. Tierney, and N. Li. Simple parallel “nobody” on grid agents, they are given lower statistical computing in R. Journal of Computational priority if the CPU is not idle. and Graphical Statistics, 16(2):399–420, 2007. S. S. Sawiloswky and R. C. Blair. A more realistic Acknowledgements look at the robustness and Type II error properties of the t test to departures from population normal- Thanks to Tony Caldanaro and Randall Pruim for ity. Psychological Bulletin, 111(2):352–360, 1992. technical assistance, as well as Molly Johnson, two anonymous reviewers and the associate editor for S. A. Swanson, K. Lindenberg, S. Bauer, and R. D. comments and useful suggestions on an earlier draft. Crosby. A Monte Carlo investigation of factors in- This material is based in part upon work sup- fluencing latent class analysis: An application to ported by the National Institute of Mental Health eating disorder research. Journal of Abnormal Psy- (5R01MH087786-02) and the US National Science chology, 121(1):225–231, 2011. Foundation (DUE-0920350, DMS-0721661, and DMS- S. Urbanek. multicore: Parallel processing of R 0602110). code on machines with multiple cores or CPUs, 2011. URL http://CRAN.R-project.org/package= Bibliography multicore. R package version 0.1-7. D. Wegener, T. Sengstag, S. Sfakianakis, and A. A. Apple Inc. Mac OS X Server: Xgrid Administration S. Rüping. GridR: An R-based grid-enabled tool and High Performance Computing (Version 10.6 Snow for data analysis in ACGT Clinico-Genomic Trials. Leopard). Apple Inc., 2009. Proceedings of the 3rd International Conference on e- L. M. Collins and S. T. Lanza. Latent Class and Latent Science and Grid Computing (eScience 2007), 2007. Transition Analysis: With Applications in the Social, B. Wilkinson and M. Allen. Parallel Programming: Behavioral, and Health Sciences. Wiley, 2009. Techniques and Applications Using Networked Work- M. J. Denwood. runjags: Run Bayesian MCMC stations and Parallel Computers. Prentice Hall, 1999. Models in the BUGS syntax from Within R, H. Yu. Rmpi: Parallel statistical computing in R. R 2010. URL http://cran.r-project.org/web/ News, 2(2):10–14, 2002. packages/runjags/. R package version 0.9.9-1. A. Geist, A. Beguelin, J. Dongarra, W. Jiang, Sarah Anoke R. Manchek, and V. Sunderam. PVM: Parallel Vir- Department of Mathematics and Statistics, Smith College tual Machine: A Users’ Guide and Tutorial for Net- Clark Science Center, 44 College Lane worked Parallel Computing. MIT Press, 1994. URL Northampton, MA 01063-0001 USA http://www.csm.ornl.gov/pvm/. [email protected] T. Hesterberg. It’s time to retire the n ≥ 30 rule. Yuting Zhao Proceedings of the Joint Statistical Meetings, 2008. Department of Mathematics and Statistics, Smith College URL http://home.comcast.net/~timhesterberg/ Clark Science Center, 44 College Lane articles/. Northampton, MA 01063-0001 USA N. Horton and S. Anoke. xgrid: Access Apple Xgrid [email protected] using R, 2012. URL http://CRAN.R-project.org/ package=xgrid. R package version 0.2-5. Rafael Jaeger Brown University P. K. Keel, M. Fichter, N. Quadflieg, C. M. Bulik, Providence, RI 02912 USA M. G. Baxter, L. Thornton, K. A. Halmi, A. S. [email protected] Kaplan, M. Strober, D. B. Woodside, S. J. Crow, J. E. Mitchell, A. Rotondo, M. Mauri, G. Cassano, Nicholas J. Horton J. Treasure, D. Goldman, W. H. Berrettini, and Department of Mathematics and Statistics, Smith College W. H. Kaye. Application of a latent class analysis Clark Science Center, 44 College Lane to empirically define eating disorder phenotypes. Northampton, MA 01063-0001 USA Archives of General Psychiatry, 61(2):192–200, Febru- [email protected] ary 2004.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 56 CONTRIBUTED RESEARCH ARTICLES maxent: An R Package for Low-memory Multinomial Logistic Regression with Support for Semi-automated Text Classification by Timothy P. Jurka corpora. Typically, this problem manifests as the er- ror "cannot allocate vector of size N" that ter- Abstract maxent is a package with tools for minates execution of the script. data classification using multinomial logistic re- gression, also known as maximum entropy. The focus of this maximum entropy classifier is to Reducing memory consumption minimize memory consumption on very large datasets, particularly sparse document-term ma- maxent seeks to address this issue by utiliz- trices represented by the tm text mining pack- ing an efficient C++ library based on an im- age. plementation written by Yoshimasa Tsuruoka at the University of Tokyo (Tsuruoka, 2011). Tsu- ruoka reduces memory consumption by using Introduction three efficient algorithms for parameter estima- tion: limited-memory Broyden-Fletcher-Goldfarb- The information era has provided political scien- Shanno method (L-BFGS), orthant-wise limited- tists with a wealth of data, yet along with this data memory quasi-newton optimization (OWLQN), and has come the need to make sense of it all. Re- stochastic gradient descent optimization (SGD). Typ- searchers have spent the past two decades manu- ically maximum entropy models grow rapidly in ally classifying, or "coding" data according to their size when trained on large text corpora, and min- specifications—a monotonous task often assigned to imizing the number of parameters in the model is undergraduate research assistants. key to reducing memory consumption. In this re- In recent years, supervised machine learning has spect, these algorithms outperform alternative opti- become a boon in the social sciences, supplement- mization techniques such as the conjugate gradient ing assistants with a computer that can classify doc- method (Nocedal, 1990). Furthermore, the SGD algo- uments with comparable accuracy. One of the main rithm provides particularly efficient parameter esti- programming languages used for supervised learn- mation when dealing with large-scale learning prob- ing in political science is R, which contains a plethora lems (Bottou and Bousquet, 2008) of machine learning add-ons via its package reposi- After modifying the C++ implementation to tory, CRAN. make it work with R and creating an interface to its functions using the Rcpp package (Eddelbuet- tel and Francois, 2011), the maxent package yielded Multinomial logistic regression impressively efficient memory usage on large cor- pora created by tm (Feinerer, Hornik, and Meyer, Multinomial logistic regression, or maximum en- 2008), as can be seen in Figure1. Additionally, max- tropy, has historically been a strong contender for ent includes the as.compressed.matrix() function text classification via supervised learning. When for converting various matrix representations includ- compared to the naive Bayes algorithm, a common ing "DocumentTermMatrix" found in the tm pack- benchmark for text classification, maximum entropy age, "Matrix" found in the Matrix package, and generally classifies documents with higher accuracy "simple_triplet_matrix" found in package slam (Nigam, Lafferty, and McCallum, 1999). into the "matrix.csr" representation from package Although packages for multinomial logistic re- SparseM (Koenker and Ng, 2011). In the case of gression exist on CRAN (e.g. multinom in package document-term matrices generated by tm, this func- nnet, package mlogit), none handle the data using tion eliminates the intermediate step of convert- compressed sparse matrices, and most use a for- ing the "DocumentTermMatrix" class to a standard mula interface to define the predictor variables, ap- "matrix", which consumes significant amounts of proaches which tend to be inefficient. For most text memory when representing large datasets. Mem- classification applications, these approaches do not ory consumption was measured using a shell script provide the level of optimization necessary to train that tracked the peak memory usage reported by and classify maximum entropy models using large the UNIX top command during execution of the

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 57 maxent() and predict() functions in a sterile R en- system.file("data/NYTimes.csv.gz", vironment. package = "maxent")) Next, we use tm to transform our data into a SparseM corpus, and subsequently a "DocumentTermMatrix" 3000 documents "TermDocumentMatrix" 1000 documents or its transpose, the . We 500 documents then convert our "DocumentTermMatrix" into a compressed "matrix.csr" representation using the Standard Matrix as.compressed.matrix() function. 3000 documents 1000 documents corpus <- Corpus(VectorSource(data$Title)) 500 documents matrix <- DocumentTermMatrix(corpus) sparse <- as.compressed.matrix(matrix) 0 100 200 300 400 500 600 700 We are now able to specify a training set and a Memory Usage (in MB) classification set of data, and start the supervised learning process using the maxent() and predict() Figure 1: Memory consumption of a SparseM functions. In this case we have specified a training "matrix.csr" representation compared with a stan- set of 1000 documents and a classification set of 500 dard "matrix" representation for 500, 1000, and 3000 documents. documents from The New York Times dataset. model <- maxent(sparse[1:1000,], Furthermore, the standard "DocumentTermMatrix" data$Topic.Code[1:1000]) represents i indices as repeating values, which need- results <- predict(model, lessly consumes memory. The i indices link a docu- sparse[1001:1500,]) j ment to its features via an intermediate index. In Although maxent’s primary appeal is to the example below, an eight-digit series of 1’s sig- those users needing low-memory multinomial j nifies that the first eight indices correspond to the logistic regression, the process can also be "DocumentTermMatrix" first document in the . run using standard matrices. However, the as.matrix() > matrix$i input matrix must be an rep- "DocumentTermMatrix" 111111112222223 resentation of a —class "TermDocumentMatrix" 333333334444444 is not currently supported. 444555555555555 model <- maxent(as.matrix(matrix)[1:1000,], 666666777778888 as.factor(data$Topic.Code)[1:1000]) 8 8 results <- predict(model, as.matrix(matrix)[1001:1500,]) Conversely, the SparseM package used in max- To save time during the training process, we can ent consolidates repeating values into a range of val- save our model to a file using the save.model() func- ues. In the next example, j indices 1 through 8 cor- tion. The saved maximum entropy model is accessi- respond to the first document, 9 through 14 to the ble using the load.model() function. second document, and so forth. This removes ex- cess integers in the "DocumentTermMatrix" represen- model <- maxent(sparse[1:1000,], tation of the sparse triplet matrix, further reducing data$Topic.Code[1:1000]) the memory footprint of the sparse matrix represen- save.model(model, "myModel") tation. The matrix representation below has been re- saved_model <- load.model("myModel") duced from 62 integers down to 9, or approximately results <- predict(saved_model, 1/7th the original size. sparse[1001:1500,])

> sparse@ia 1 9 15 24 34 46 52 57 63 Algorithm performance

In political science, supervised learning is heavily Example used to categorize textual data such as legislation, television transcripts, and survey responses. To We read in our dataset using input/output functions, demonstrate the accuracy of the maxent package in a in this case read.csv(). The NYTimes.csv.gz dataset research setting, we will train the multinomial logis- is bundled with the maxent package, and contains tic regression classifier using 4,000 summaries of leg- 4000 manually coded headlines from The New York islation proposed in the United States Congress and Times (Boydstun, 2008). then test accuracy on 400 new summaries. The sum- maries have been manually partitioned into 20 topic data <- read.csv( codes corresponding to policy issues such as health,

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 58 CONTRIBUTED RESEARCH ARTICLES agriculture, education, and defense (Adler and Wilk- across a sample space of 18 parameter configura- erson, 2004). tions including varying the regularization terms for We begin by loading the maxent package and L1 and L2 regularization, using SGD optimization, reading in the bundled USCongress.csv.gz dataset. and setting a held-out portion of the data to detect overfitting. data <- read.csv( We can use the tuning function to optimize a system.file("data/USCongress.csv.gz", model that will predict the species of an iris given package = "maxent")) the dimensions of petals and sepals. Using Ander- son’s Iris data set that is bundled with R, we will use Using the built-in sample() function in R n-fold cross-validation to test the model using sev- and the tm text mining package, we then ran- eral parameter configurations. domly sample 4,400 summaries and generate a "TermDocumentMatrix". To reduce noise in the data, data(iris) we make all characters lowercase and remove punc- tuation, numbers, stop words, and whitespace. x <- subset(iris, select = -Species) y <- iris$Species indices <- sample(1:4449, 4400, replace=FALSE) sample <- data[indices,] f <- tune.maxent(x, y, nfold=3, showall=TRUE) vector <- VectorSource(as.vector(sample$text)) corpus <- Corpus(vector) matrix <- TermDocumentMatrix(corpus, control=list(weighting = weightTfIdf, L1 L2 SGD Held-out Accuracy Pct. Fit language = "english", tolower = TRUE, 0 0 0 0 0.948 0.975 stopwords = TRUE, 0.2 0 0 0 0.666 0.685 removeNumbers = TRUE, 0.4 0 0 0 0.666 0.685 removePunctuation = TRUE, 0.6 0 0 0 0.666 0.685 stripWhitespace = TRUE)) 0.8 0 0 0 0.666 0.685 1 0 0 0 0.666 0.685 Next, we pass this matrix into the maxent() func- 0 0 0 25 0.948 0.975 tion to train the classifier, partitioning the first 4,000 0 0.2 0 0 0.966 0.993 summaries as the training data. Similarly, we parti- 0 0.4 0 0 0.966 0.993 tion the last 400 summaries as the new data to be clas- 0 0.6 0 0 0.972 1 sified. In this case, we use stochastic gradient descent 0 0.8 0 0 0.972 1 for parameter estimation and set a held-out sample 0 1 0 0 0.966 0.993 of 400 summaries to account for model overfitting. 0 0 1 0 0.966 0.993 0.2 0 1 0 0.666 0.685 model <- maxent(matrix[,1:4000], 0.4 0 1 0 0.666 0.685 sample$major[1:4000], 0.6 0 1 0 0.666 0.685 use_sgd = TRUE, 0.8 0 1 0 0.666 0.685 set_heldout = 400) 1 0 1 0 0.666 0.685 results <- predict(model, matrix[,4001:4400]) Table 1: Output of the maxent.tune() function when When we compare the predicted value of the re- run on Anderson’s Iris data set, with each row corre- sults to the true value assigned by a human coder, we sponding to a unique parameter configuration. The find that 80% of the summaries are accurately classi- first two columns are L1/L2 regularization parame- fied. Considering we only used 4,000 training sam- ters bounded by 0 and 1. SGD is a binary variable in- ples and 20 labels, this result is impressive, and the dicating whether stochastic gradient descent should classifier could certainly be used to supplement hu- be used. The fourth column represents the number of man coders. Furthermore, the algorithm is very fast; training samples held-out to test for overfitting. Ac- maxent uses only 135.4 megabytes of RAM and fin- curacy is measured via n-fold cross-validation and ishes in 53.3 seconds. percent fit is the accuracy of the configuration rela- tive to the optimal parameter configuration.

Model tuning As can be seen in Table1, the results of the tun- ing function show that using L2 regularization will The maxent package also includes the provide the best results, but using stochastic gradi- maxent.tune() function to determine parameters ent descent without any regularization will perform that will maximize the recall of the results during the nearly as well. We can also see that we would not training process. The function tests the algorithm want to use L1 regularization for this application as

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 59 it drastically reduces accuracy. The held-out param- L. Bouttou, O. Bousquet. The tradeoffs of large scale eter is used without L1/L2 regularization and SGD learning. Advances in Neural Information Process- because tests showed a decrease in accuracy when ing Systems, pp. 161–168, 2008. URL http://leon. using them together. However, when L1/L2 regular- bottou.org/publications/pdf/nips-2007.pdf. ization and SGD do not yield an increase in accuracy, using held-out data can improve performance. Us- A. Boydstun. How Policy Issues Become Front-Page http://psfaculty. ing this information, we can improve the accuracy News. Under review. URL ucdavis.edu/boydstun/ of our classifier without adding or manipulating the . training data. D. Eddelbuettel, R. Francois. Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40(8), pp. 1–18, 2011. URL http://www.jstatsoft. Summary org/v40/i08/.

The maxent package provides a fast, low-memory I. Feinerer, K. Hornik, D. Meyer. Text mining infras- maximum entropy classifier for a variety of classi- tructure in R. Journal of Statistical Software, 25(5), fication tasks including text classification and nat- pp. 1–54, 2008. URL http://www.jstatsoft.org/ ural language processing. The package integrates v25/i05/. directly into the popular tm text mining package, R. Koenker, P. Ng. SparseM: A Sparse Matrix Pack- and provides sample datasets and code to get started age for R. CRAN Package Archive, 2011. URL quickly. Additionally, a number of advanced fea- http://cran.r-project.org/package=SparseM. tures are available to prevent model overfitting and provide more accurate results. R. Nocedal. The performance of several algorithms for large scale unconstrained optimization. Large Scale Numerical Optimization, pp. 138–151, 1990. So- Acknowledgements ciety for Industrial & Applied Mathematics.

This project was made possible through financial K. Nigam, J. Lafferty, and A. McCallum. Using maxi- support from the University of California at Davis, mum entropy for text classification. IJCAI-99 Work- University of Antwerp, and Sciences Po Bordeaux. shop on Machine Learning for Information Filtering, http://www.kamalnigam. I am indebted to Professor Yoshimasa Tsuruoka for pp. 61–67, 1999. URL com/papers/maxent-ijcaiws99.pdf taking the time to answer questions about his excel- . lent C++ maximum entropy implementation, as well Y. Tsuruoka. A simple C++ library for maxi- as Professor Amber E. Boydstun for her New York mum entropy classification. University of Tokyo Times dataset and comments on this paper. This pa- Department of Computer Science (Tsujii Laboratory), per also benefited from the feedback of two anony- 2011. URL http://www-tsujii.is.s.u-tokyo.ac. mous reviewers. jp/~tsuruoka/maxent/.

Bibliography Timothy P. Jurka Department of Political Science E. Adler, J. Wilkerson. Congressional Bills Project. Uni- University of California, Davis versity of Washington, 2004. URL http://www. Davis, CA 95616-8682 USA congressionalbills.org/. [email protected]

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 60 CONTRIBUTED RESEARCH ARTICLES

Sumo: An Authenticating Web Application with an Embedded R Session by Timothy T. Bergsma and Michael S. Smith is widely available and surprisingly easy to install. While frequently run behind Apache (http://www. Abstract Sumo is a web application intended apache.org), it can function as a stand-alone web as a template for developers. It is distributed server. In our experience, it requires mere minutes as a Java ‘war’ file that deploys automatically to install under Linux (Ubuntu) or Mac OS X, and a when placed in a Servlet container’s ‘webapps’ little longer to install for Windows. Tomcat is open directory. If a user supplies proper credentials, source. Sumo creates a session-specific Secure Shell con- Sumo is maintained at http://sumo.googlecode. nection to the host and a user-specific R session com and is distributed under the GPL v3 open source over that connection. Developers may write dy- license as a Java ‘war’ file. The current distribution, namic server pages that make use of the persis- considered complete, is ‘sumo-007.war’ as of this writ- tent R session and user-specific file space. The ing. It can be downloaded, optionally renamed (e.g. supplied example plots a data set conditional on ‘sumo.war’), and saved in Tomcat’s ‘webapps’ direc- preferences indicated by the user; it also displays tory. Tomcat discovers, installs, and launches the some static text. A companion server page al- web application. By default, Tomcat listens on port lows the user to interact directly with the R ses- 8080. The application can be accessed with a web sion. Sumo’s novel feature set complements pre- browser, e.g. at http://localhost:8080/sumo. vious efforts to supply R functionality over the

internet. request

Despite longstanding interest in delivering R func- fail tionality over the internet (Hornik, 2011), embed- login

fail r ding R in a web application can still be technically d e e di ce re c ct challenging. Anonymity of internet users is a com- su authorize pounding problem, complicating the assignment of include form persistent sessions and file space. Sumo uses Java link link evaluate decorate logout servlet technology to minimize administrative bur- include den and requires authentication to allow unambigu- ous assignment of resources. Sumo has few depen- link include dencies and may be readily adapted to create other applications. form monitor

Deployment Figure 1: Sumo architecture: flow control among Use of Sumo requires an internet server with the fol- page requests. Solid boxes: visible JSPs. Dashed lowing: boxes: JSPs with no independent content. • users with passworded accounts (Sumo defers to the server for all authentication) Architecture • a system-wide installation of R (http://cran. r-project.org) An http request for the web application Sumo de- faults to the ‘login.jsp’ page (Figure1). Tomcat cre- • a running instance of a Java servlet container, ates a Java servlet session for the particular connec- e.g. Tomcat 6 (http://tomcat.apache.org) tion. ‘login.jsp’ forwards the supplied credentials to a servlet (not shown) that tries to connect to the lo- • Secure Shell1, e.g. OpenSSH (http://www. cal host2 using Secure Shell. If successful, it immedi- openssh.com) or freeSSHd (http://www. ately opens an R session using R --vanilla3, stores freesshd.com) for Windows. pointers to the R session, and forwards the user to Tomcat, the reference implementation for the Java ‘evaluate.jsp’: the core of the application (see Fig- Servlet and Java Server Pages (JSP) specifications, ure2). The only other viewable page is ‘ monitor.jsp’, 1OS X includes OpenSSH. Check the box: System Preferences|Sharing|remote login. In ‘/etc/sshd_config’, add "PasswordAuthentica- tion yes". With freeSSHd for Windows, uncheck: settings|SSH|Use new console engine. 2Host is configurable in the file ‘sumo.xml’. 3Command to start R is configurable in the file ‘sumo.xml’.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 61 which allows direct interaction with the R command Both ‘evaluate.jsp’ and ‘monitor.jsp’ use op- line (Figure3). tional Sitemesh technology (http://www.sitemesh. org) to include ‘decorate.jsp’, which itself in- cludes the header, the footer, and navigation links to ‘evaluate.jsp’, ‘monitor.jsp’, and ‘logout.jsp’. ‘decorate.jsp’ also includes ‘authorize.jsp’, which redi- rects to ‘login.jsp’ for non-authenticated requests. Thus decorated, ‘evaluate.jsp’ and ‘monitor.jsp’ always include links to each other and are only available to authenticated users. The key to embedding R in Sumo is the ex- change of information between the Java session and the R session. The exchange occurs primarily in ‘evaluate.jsp’. Similar to (Leisch, 2002), Java Server Pages support alternation between two lan- guages: in this case, html and Java. The R session can be accessed using either language. With respect to Java, ‘evaluate.jsp’ defines a reference to an object called R that evaluates R code and returns a string result (with or without echo). <% String directory = R.evaluate( "writeLines(getwd())",false); %> With respect to html, ‘evaluate.jsp’ includes Sumo’s RTags library (a JSP Tag Library), which al- lows blocks of R code to be inserted directly within html-like tags. <%@ taglib prefix="r" uri="/WEB-INF/RTags.tld" %> library(lattice) Figure 2: Sumo display: sample output of Theoph$all <- 1 ‘evaluate.jsp’.

  head(Theoph)  
Also included in ‘evaluate.jsp’ is ‘params2r.jsp’, which assures that parameters supplied by the user are automatically defined in the R session as either numeric (if possible) or character objects. Note that ‘evaluate.jsp’ has a single web form (Fig- ure2) whose target is ‘ evaluate.jsp’ itself. Interaction with a user consists of repeated calls to the page, with possibly modified parameters. The example given illustrates table and figure display, emphasiz- ing a variety of input styles (pull-down, radio button, text box). The data frame Theoph is plottable eight different ways (grouped or ungrouped, conditioned or unconditioned, log or linear) with an adjustable smoothing span. Internally, ‘evaluate.jsp’ follows a well-defined sequence: Figure 3: Sumo display: sample output of ‘monitor.jsp’. 1. Capture any parameters specified by the user.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 62 CONTRIBUTED RESEARCH ARTICLES

2. Supply defaults for parameters not user- page requests until logout. Files created on behalf specified. of the user are stored in the server file space associ- ated the user’s account. The application developer 3. Echo parameters to the R session as necessary. can take advantage of both. 4. Create and display a figure. In some cases, it is possible to create R sessions that persist, not just within server sessions, but also 5. Present a form with current parameters as the across them. On systems supporting ‘screen’ (e.g. defaults. Linux, OS X), the command used to start R can be configured in ‘sumo.xml’ as Additionally, static text is included to illustrate a very simple way of presenting tabular data. screen -d -r || screen R -- vanilla An R session will be attached if one exists, else a new Development one will be created. On logout, the R session persists, and is reattached at next login. Sumo is intended as a template for web application developers. There are at least three ways to adapt Sumo. For experimental purposes, one can edit de- Discussion ployed JSPs in place (see ‘webapps/sumo/’); Tomcat notices the changes and recompiles corresponding While Sumo is not the first attempt at an R-enabled servlets. For better source control, one can unzip web application, it offers simple deployment and ‘sumo.war’4, edit files of interest, re-zip and redeploy. standard architecture. Deployment consists of drop- For a formal development environment, one can ping the Sumo archive into the ‘webapps’ directory install Apache Ant (http://ant.apache.org) and of a running instance of Tomcat. Development can check out Sumo’s source: be as simple as editing the resulting text files, even while the application is running. svn checkout Like Rpad (Short and Grosjean, 2007), Sumo http://sumo.googlecode.com/svn/trunk/ sumo accesses R functionality through the user’s web browser. Rpad is arguably easier to deploy on a lo- Then at the command prompt in the ‘sumo’ direc- cal machine, requiring only the Rpad package, but 5 tory use: Sumo is easier to deploy on a server, as it requires no ant -f build.xml customization of Tomcat or Apache. Relative to Sumo, rApache (Horner, 2012b) is a more sophisticated solution for the development of Security R-enabled web applications. It uses the Apache web server rather than Tomcat. Sumo is an example ap- Sumo inherits most of its security attributes from the plication, whereas rApache is a module. As such, rA- infrastructure on which it builds. Since the R session pache remains generally unprejudiced with respect is created with a Secure Shell connection, it has ex- to design choices like those for authentication and actly the file access configured at the machine level persistence. Like Rpad, rApache requires server con- for the particular user. The JSPs and related servlets, figuration. rApache is currently only available for which are not user-modifiable, verify the user be- Linux and Mac OS X, whereas Tomcat (and therefore fore proceeding. ‘SpecificImageServlet’ (called from Sumo) is available for those platforms as well as Win- ‘evaluate.jsp’) can be used to request arbitrary files, dows. but it defers to the R session’s verdict on read per- Of broader utility than just for web pages, the mission. Over http, login credentials may be vulner- packages brew (Horner, 2011) and R.rsp (Bengtsson, able to discovery. Where necessary, administrators 2012) independently provide functions for text pre- may configure Tomcat or Apache to use the secure processing. Expressions within delimiters are evalu- http protocol (i.e. https). ated in R, substituting the resulting values into the original text. The delimiters are similar or identi- cal to those used in Java Server Pages. Sumo retains Persistence such delimiters to evaluate Java expressions and in- troduces alternative delimiters to evaluate R expres- While it is possible to evaluate R expressions in iso- sions. lation, more functionality is available if results can Rook (Horner, 2012a) is a specification that de- persist in session memory or in hardware memory. fines an interface between a web server and R. Ap- Sumo uses both forms of persistence. A session is plications written against the specification can be run created for each user at login, and persists across unmodified with any supporting server, such as R’s 4‘war’ files have ‘zip’ file architecture. 5The value of tomcat.dir in ‘build.xml’ may need modification.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 63 internal web server or an rApache instance. As a J. Horner. brew: Templating framework for report gen- convenience, Rook also supplies classes for writing eration, 2011. URL http://CRAN.R-project.org/ Rook-compliant applications and for working with package=brew. R package version 1.0-6. R’s built-in web server. For comparison, Sumo is compliant with the Java Servlet specification and J. Horner. Rook: a web server interface for R, 2012a. http://CRAN.R-project.org/package=Rook runs on supporting servers such as Tomcat. URL . Rserve (Urbanek, 2003) can be adapted for web R package version 1.0-3. applications, but is intended for applications where J. Horner. rApache: Web application development bandwidth is more important and security less so, with R and Apache, 2012b. URL http://www. relative to Sumo. rapache.net. While Sumo does provide some access to the R command line, those interested primarily in running K. Hornik. R FAQ: Frequently asked questions R remotely should favor the server version of RStu- on R, 2011. URL http://cran.r-project.org/ dio (http://rstudio.org). RStudio provides some doc/FAQ/R-FAQ.html#R-Web-Interfaces. Version graphical interaction via the manipulate package, 2.13.2011-07-05, ISBN 3-900051-08-9. but the technique does not seem easy to formalize R News for an independent application. E. Lecoutre. The R2HTML package. , 3(3):33– 36, December 2003. URL http://cran.r-project. Display of text is only briefly treated by Sumo; org/doc/Rnews/Rnews_2003-3.pdf. R2HTML (Lecoutre, 2003) seems a natural comple- ment. F. Leisch. Sweave: Dynamic generation of statistical reports using literate data analysis. In W. Härdle and B. Rönz, editors, Compstat 2002 — Proceedings Summary in Computational Statistics, pages 575–580. Physica Verlag, Heidelberg, 2002. URL http://www.stat. Sumo is a web application designed to be easily de- uni-muenchen.de/~leisch/Sweave. ISBN 3-7908- ployed and modified. It relies on industry-standard 1517-9. software, such as Tomcat and Secure Shell, to enlist the web browser as an interface to R functionality. T. Short and P. Grosjean. Rpad: Workbook-style, web- It has few dependencies, few limitations, good per- based interface to R, 2007. URL http://www.rpad. sistence and good security. Enforced authentication org/Rpad. R package version 1.3.0. of users allows unambiguous assignment of sessions S. Urbanek. Rserve: A fast way to provide R func- and file space for full-featured use of R. The learning tionality to applications. In K. Hornik, F. Leisch, curve is shallow: while some Java is required, most and A. Zeileis, editors, Proceedings of the 3rd Inter- interested parties will already know enough html national Workshop on Distributed Statistical Comput- and R to produce derivative applications quickly. ing (DSC 2003), Vienna, Austria, March 20-22 2003. URL http://www.ci.tuwien.ac.at/Conferences/ DSC-2003/. ISSN 1609-395X. Acknowledgments

We thank Henrik Bengtsson, Jeffrey Horner, James Timothy T. Bergsma Rogers, Joseph Hebert, and an anonymous reviewer Metrum Research Group LLC for valuable comments. 2 Tunxis Road Suite 112 Development and maintenance of Sumo is Tariffville CT 06081 funded by Metrum Research Group LLC (http:// USA www.metrumrg.com). [email protected]

Michael S. Smith Bibliography Consultant 26 Harris Fuller Road H. Bengtsson. R.rsp: Dynamic generation of scientific Preston, CT 06365 reports, 2012. URL http://CRAN.R-project.org/ USA package=R.rsp. R package version 0.7.1. [email protected]

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 64 FROMTHE CORE

Who Did What? The Roles of R Package Authors and How to Refer to Them by Kurt Hornik, Duncan Murdoch and Achim Zeileis thors or editors). Furthermore, we indicate how this functionality is employed in the management Abstract Computational infrastructure for rep- of R packages, in particular in their ‘CITATION’ and resenting persons and citations has been avail- ‘DESCRIPTION’ files. able in R for several years, but has been restruc- tured through enhanced classes "person" and "bibentry" in recent versions of R. The new Persons and their roles features include support for the specification of the roles of package authors (e.g. maintainer, Person objects can hold information about an arbi- author, contributor, translator, etc.) and more trary positive number of persons. These can be ob- flexible formatting/printing tools among vari- tained by one call to person() with list arguments, or ous other improvements. Here, we introduce by first creating objects representing single persons the new classes and their methods and indicate and combining these via c(). how this functionality is employed in the man- Every person has at least one given name, agement of R packages. Specifically, we show and natural persons typically (but not necessarily, how the authors of R packages can be specified Wikipedia, 2012) have a family name. These names along with their roles in package ‘DESCRIPTION’ are specified by arguments given and family, respec- and/or ‘CITATION’ files and the citations pro- tively. For backward compatibility with older ver- duced from it. sions, there are also arguments first, middle and last with obvious (Western-centric) meaning; how- R packages are the result of scholarly activity and ever, these are deprecated as not universally appli- as such constitute scholarly resources which must be cable: some cultures place the given after the family clearly identifiable for the respective scientific com- name, or use no family name. munities and, more generally, today’s information An optional email argument allows specifying society. In particular, packages published by stan- persons’ email addresses, which may serve as unique dard repositories can be regarded as reliable sources identifiers for the persons. which can and should be referenced (i.e. cited) by sci- Whereas person names by nature may be arbi- entific works such as articles or other packages. This trary non-empty character strings, the state of the requires conceptual frameworks and computational art in library science is to employ standardized vo- infrastructure for describing bibliographic resources, cabularies for person roles. R uses the “Relator and general enough to encompass the needs of commu- Role” codes and terms (http://www.loc.gov/marc/ nities with an interest in R. These needs include sup- relators/relaterm.html) from MARC (MAchine- port for exporting bibliographic metadata in stan- Readable Cataloging, Library of Congress, 2012), one dardized formats such as BIBTEX(Berry and Patash- of the leading collections of standards used by librar- nik, 2010), but also facilitating bibliometric analyses ians to encode and share bibliographic information. and investigations of the social fabric underlying the Argument role specifies the roles using the MARC creation of scholarly knowledge. codes; if necessary, the comment argument can be The latter requires a richer vocabulary than com- used to provide (human-readable) character strings monly employed by reference management software complementing the standardized role information. such as BIBTEX, identifying persons and their roles in As an example (see Figure1 and also ?person), relation to bibliographic resources. For example, a consider specifying the authors of package boot thesis typically has an author and advisors. Software (Canty and Ripley, 2012): Angelo Canty is the first can have an (original) author and a translator to an- author ("aut") who wrote the S original; Brian D. other language (such as from S to R). The maintainer Ripley is the second author ("aut"), who translated of an R package is not necessarily an author. the code to R ("trl"), and maintains the package In this paper, we introduce the base R infras- ("cre"). tructure (as completely available in R since version Note that the MARC definition of the relator term 2.14.0) for representing and manipulating such schol- “Creator” is “Use for a person or organization re- arly data: objects of class "person" (hereafter, per- sponsible for the intellectual or artistic content of son objects) hold information about persons, possi- a work.”, which maps nicely to the notion of the bly including their roles; objects of class "bibentry" maintainer of an R package: the maintainer creates (hereafter, bibentry objects) hold bibliographic infor- the package as a bibliographic resource for possi- mation in enhanced BIBTEX style, ideally using per- ble redistribution (e.g. via the CRAN package repos- son objects when referring to persons (such as au- itory), and is the person (currently) responsible for

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 FROMTHE CORE 65

R> ## specifying the authors of the "boot" package R> p <- c( + person("Angelo", "Canty", role = "aut", comment = + "S original, http://statwww.epfl.ch/davison/BMA/library.html"), + person(c("Brian", "D."), "Ripley", role = c("aut", "trl", "cre"), + comment = "R port, author of parallel support", email = "[email protected]")) R> p

[1] "Angelo Canty [aut] (S original, http://statwww.epfl.ch/davison/BMA/library.html)" [2] "Brian D. Ripley [aut, trl, cre] (R port, author of parallel support)"

R> ## subsetting R> p[-1]

[1] "Brian D. Ripley [aut, trl, cre] (R port, author of parallel support)"

R> ## extracting information (as list for multiple persons) R> p$given

[[1]] [1] "Angelo"

[[2]] [1] "Brian" "D."

R> ## extracting information (as vector for single person) R> p[2]$given

[1] "Brian" "D."

R> ## flexible formatting R> format(p, include = c("family", "given", "role"), + braces = list(family = c("", ","), role = c("[Role(s): ", "]")), + collapse = list(role = "/"))

[1] "Canty, Angelo [Role(s): aut]" "Ripley, Brian D. [Role(s): aut/trl/cre]"

R> ## formatting for BibTeX R> toBibtex(p)

[1] "Angelo Canty and Brian D. Ripley"

R> ## alternative BibTeX formatting "by hand" R> paste(format(p, include = c("family", "given"), + braces = list(family = c("", ","))), collapse = " and ")

[1] "Canty, Angelo and Ripley, Brian D."

Figure 1: Example for person(), specifying the authors of the boot package. the package, being the “interface” between what is "aut" (Author): Full authors who have made sub- inside the package and the outside world, and war- stantial contributions to the package and ranting in particular that the license and represen- should show up in the package citation. tation of intellectual property rights are appropriate. For the persons who originally “created” the package "com" (Compiler): Persons who collected code (po- code, typically author roles should be employed. tentially in other languages) but did not make further substantial contributions to the pack- In general, while all MARC relator codes are sup- age. ported, the following usage is suggested when giv- ing the roles of persons in the context of authoring R "ctb" (Contributor): Authors who have made packages: smaller contributions (such as code patches

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 66 FROMTHE CORE

etc.) but should not show up in the package fields. The author and editor field values can be per- citation. son objects as discussed above (and are coerced to such using as.person() if they are not). In addi- "cph" (Copyright holder): Copyright holders. tion, it is possible to specify header and footer infor- mation for the individual entries (arguments header "cre" (Creator): Package maintainer. and footer) and the collection of entries (arguments mheader and mfooter). "ths" (Thesis advisor): Thesis advisor, if the pack- age is part of a thesis. Bibentry objects can be subscripted by field (us- ing $) or by position (using [), and have print() "trl" (Translator): Translator to R, if the R code is a and format() methods providing a choice between translation from another language (typically S). at least six different styles (as controlled by a style argument): plain text, HTML, LATEX, BIBTEX, R (for Person objects can be subscripted by field (us- formatting bibentries as (high-level) R code), and ci- ing $) or by position (using [), and allow for flexible tation information (including headers/footers and formatting and printing. By default, email, role and a mixture of plain text and BIBTEX). The first comment entries are suitably “embraced” by enclos- three make use of a .bibstyle argument using the ing with angle brackets, square brackets and paren- bibstyle() function in package tools which defines theses, respectively, and suitably collapsed. There is the display style of the citation by providing a collec- also a toBibtex() method which creates a suitable tion of functions to format the individual parts of the BIBTEX representation. Finally, the default method entry. Currently the default style "JSS", which cor- for as.person() is capable of inverting the default responds to the bibliography style employed by the person formatting, and also can handle formatted Journal of Statistical Software (http://www.jstatsoft. person entries collapsed by comma or ‘and’ (with ap- org/), is the only available style, but it can be mod- propriate white space). See Figure1 and ?person for ified by replacing the component functions, or re- various illustrative usages of these methods. placed with a completely new style. A style is imple- R code specifying person objects can also be mented as an environment, and is required to con- used in the Authors@R field of ‘DESCRIPTION’ files. tain functions to format each type of bibliographic This information is used by the R package manage- entry (formatArticle(), formatBook(), etc.) as well ment system (specifically, R CMD build and R CMD a function sortKeys() to sort entries. In addition INSTALL) to create the Author and Maintainer fields to these required functions, the "JSS" style includes from Authors@R, unless these are still defined (stat- several dozen auxiliary functions that are employed ically) in ‘DESCRIPTION’. Also, the person informa- by the required ones. Users modifying it should read tion is used for reliably creating the author infor- the help page ?tools::bibstyle, which includes an mation in auto-generated package citations (more on example that changes the sort order. Those imple- this later). Note that there are no default roles for menting a completely new style should consult the Authors@R: all authors must be given an author role source code. ("aut"), and the maintainer must be identified by the Figure2 shows an illustrative example for using creator role ("cre"). bibentry(): A vector with two bibentries is created for the boot package and the book whose methods it implements. In the first bibentry the person object Bibliographic information p from Figure1 is reused while the second bibentry applies as.person() to a string with author and role Bibentry objects can hold information about an arbi- information. Subsequently, the bibentry is printed trary positive number of bibliography entries. These in "JSS" style and transformed to BIBTEX. Note that can be obtained by one call to bibentry() with list the roles of the persons in the author field are not arguments, or by first creating objects representing employed in toBibtex() as BIBTEX does not support single bibentries and combining these via c(). that concept. (Hence, the role information in the Bibentry objects represent bibliographic informa- as.person() call creating b[2] would not have been tion in “enhanced BIBTEX style”, using the same en- necessary for producing BIBTEX output.) try types (such as ‘Article’ or ‘Book’) and field names Currently, the primary use of bibentry objects is (such as ‘author’ and ‘year’) as BIBTEX. A single in package ‘CITATION’ files, which contain R code for bibentry is created by calling bibentry() with argu- defining such objects.1 Function citation() collects ments bibtype (typically used positionally) and key all of the defined bibentry objects into a single ob- (optionally) giving the type of the entry and a key ject for creating package citations, primarily for vi- for it, and further fields given in tag = value form, sual inspection and export to BIBTEX. (Technically, with tag and value the name and value of the field, citation() creates an object of class "citation" in- respectively. See ?bibentry for details on types and heriting from "bibentry", for which the print() 1Typically, the R code should contain bibentry() calls but special constructs for reusing the information from the ‘DESCRIPTION’ are also available as is some legacy functionality (citEntry() etc.). More details are provided in the following.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 FROMTHE CORE 67

R> b <- c( + bibentry( + bibtype = "Manual", + title = "{boot}: Bootstrap {R} ({S-Plus}) Functions", + author = p, + year = "2012", + note = "R package version 1.3-4", + url = "http://CRAN.R-project.org/package=boot", + key = "boot-package" + ), + + bibentry( + bibtype = "Book", + title = "Bootstrap Methods and Their Applications", + author = "Anthony C. Davison [aut], David V. Hinkley [aut]", + year = "1997", + publisher = "Cambridge University Press", + address = "Cambridge", + isbn = "0-521-57391-2", + url = "http://statwww.epfl.ch/davison/BMA/", + key = "boot-book" + ) + ) R> b

Canty A and Ripley BD (2012). _boot: Bootstrap R (S-Plus) Functions_. R package version 1.3-4, .

Davison AC and Hinkley DV (1997). _Bootstrap Methods and Their Applications_. Cambridge University Press, Cambridge. ISBN 0-521-57391-2, .

R> ## formatting for BibTeX R> toBibtex(b[1])

@Manual{boot-package, title = {{boot}: Bootstrap {R} ({S-Plus}) Functions}, author = {Angelo Canty and Brian D. Ripley}, year = {2012}, note = {R package version 1.3-4}, url = {http://CRAN.R-project.org/package=boot}, }

Figure 2: Example for bibentry(), specifying citations for the boot package (i.e. the package itself and the book which it implements). method uses a different default style.) package’s Version. If the package was installed from CRAN, the official stable CRAN package URL Package citations can also be auto-generated from (here: http://CRAN.R-project.org/package=boot) the package ‘DESCRIPTION’ metadata, see Figure3 is given in the url field.2 Finally, the bibentry au- for an example: The bibentry title field is gen- thor field is generated from the description Author Package erated from the package metadata fields field (unless there is an Authors@R, see below). As Title and which should be capitalized (in title the Author field in ‘DESCRIPTION’ is a plain text field case), and not use markup or end in a period. intended for human readers, but not necessarily for (Hence, it may need manual touch-ups, e.g. protect- automatic processing, it may happen that the auto- ing {S} or {B}ayesian etc.) The year is taken from generated BIBT X is incorrect (as in Figure3). Hence, Date/Publication Date E (for CRAN packages) or (if citation() provides an ATTENTION note to this available), and the note field is generated from the 2 Note that the current physical URLs, e.g. http://CRAN.R-project.org/web/packages/boot/index.html are not guaranteed to be stable and may change in the future (as they have in the past).

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 68 FROMTHE CORE

R> citation("boot", auto = TRUE)

To cite package 'boot' in publications use:

Angelo Canty and Brian Ripley (2012). boot: Bootstrap Functions (originally by Angelo Canty for S). R package version 1.3-4. http://CRAN.R-project.org/package=boot

A BibTeX entry for LaTeX users is

@Manual{, title = {boot: Bootstrap Functions (originally by Angelo Canty for S)}, author = {Angelo Canty and Brian Ripley}, year = {2012}, note = {R package version 1.3-4}, url = {http://CRAN.R-project.org/package=boot}, }

Figure 3: Citation for the boot package auto-generated from its ‘DESCRIPTION’ file. (Note that this is not the recommended citation of the package authors, see citation("boot") without auto = TRUE for that.) end. One can overcome this problem by providing Actions for package maintainers an Authors@R field in ‘DESCRIPTION’, from which the names of all persons with author roles can reliably be Package maintainers should provide an Authors@R determined and included in the author field. Note entry in their ‘DESCRIPTION’ files to allow compu- again that author roles ("aut") are not “implied” as tations on the list of authors, in particular to ensure defaults, and must be specified explicitly. that auto-generation of bibliographic references to their packages works reliably. Typically, the Author As illustrated in Figure3, one can inspect the and Maintainer entries in ‘DESCRIPTION’ can then auto-generated package citation bibentry by giving be omitted in the sources (as these are automatically auto = TRUE when calling citation() on a package added by R CMD build). – even if the package provides a ‘CITATION’ file in Packages wanting to employ the auto-generated which case auto defaults to FALSE. The motivation citation as (one of) the officially recommended refer- is that it must be possible to refer to the package ence(s) to their package can use the new simplified 3 itself, even if the preferred way of citing the pack- citation(auto = meta) to do so. Note that if this age in publications is via the references given in the is used then the package’s citation depends on R at package ‘CITATION’ file. Of course, both approaches least 2.14.0, suggesting to declare this requirement in might also be combined as in the boot package where the Depends field of ‘DESCRIPTION’. one of the bibentries in ‘CITATION’ corresponds es- As of 2012-04-02, CRAN had 3718 packages in to- sentially to the auto-generated package bibentry (see tal, with about 20% (734) having ‘CITATION’ files, but citation("boot")). To facilitate incorporating the less than 2.5% (92) already providing an Authors@R auto-generated citations into ‘CITATION’, one can in- entry in their package metadata: so clearly, there is clude a citation(auto = meta) entry: when evalu- substantial room for future improvement. ating the code in the ‘CITATION’ file, citation() au- tomagically sets meta to the package metadata. This eliminates the need for more elaborate programmatic Conclusions and outlook constructions of the package citation bibentry (or even worse, manually duplicating the relevant pack- R has provided functionality for representing per- age metadata), provided that auto-generation is reli- sons and generating package citations for several able (i.e. correctly processes author names and does years now. Earlier versions used ‘CITATION’ files with not need additional LATEX markup or protection in calls to citEntry() (and possibly citHeader() and the title). To check whether this is the case for a par- citFooter()). R 2.12.0 introduced the new biben- ticular package’s metadata and hence citation(auto try functionality for improved representation and = meta) can conveniently be used in ‘CITATION’, manipulation of bibliographic information (the old one can employ constructs such as citation(auto = mechanism is still supported and implemented by packageDescription("boot")). the new one). R 2.12.0 also added the new per- 3Thus, the information from the ‘DESCRIPTION’ can be reused in the ‘CITATION’. However, the reverse is not possible as the ‘DESCRIPTION’ needs to be self-contained.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 FROMTHE CORE 69 son functionality (changing to the more appropriate and scientometric analyses. In particular, it will be- given/family scheme and adding roles). R 2.14.0 come possible to programmatically analyze package added more functionality, including the Authors@R citations, and use these to learn about the processes field in package ‘DESCRIPTION’ files from which of knowledge dissemination driving the creation of Author and Maintainer fields in this file as well as R packages, and the relatedness of package authors package bibentries can reliably be auto-generated. and concepts. Having reliable comprehensive information on package “authors” and their roles which can be pro- cessed automatically is also highly relevant for the Bibliography purpose of understanding the roles of persons in the social fabric underlying the remarkable growth of the K. Berry and O. Patashnik. BIBTEX, 2010. URL http://www.ctan.org/tex-archive/biblio/ R package multiverse, opening up fascinating possi- bibtex/base bilities for research on R (of course, with R), e.g. us- . ing social network analysis and related methods. A. Canty and B. D. Ripley. boot: Bootstrap R (S-Plus) Since R 2.10.0, R help files have been able to con- Functions, 2012. URL http://CRAN.R-project. tain R code in \Sexpr macros to be executed before org/package=boot. R package version 1.3-4. the help text is displayed. In the future, it is planned to use this facility to auto-generate \references sec- R. François. bibtex: BIBTEX Parser, 2011. URL http: //CRAN.R-project.org/package=bibtex tions based on citations in the text, much as LATEX and . R pack- BIBTEX can automatically generate bibliographies. age version 0.3-0. These enhancements will clearly make deal- Library of Congress. MARC standards. URL http: ing with bibliographic information in Rd files //www.loc.gov/marc/, accessed 2012-04-02, 2012. much more convenient and efficient, and elimi- nate the need to “manually” (re)format entries in C. Putnam. Bibutils: Bibliography Conversion Utili- the \references sections. Of course, this should ties. The Scripps Research Institute, 2010. URL not come at the price of needing to manually gen- http://sourceforge.net/p/bibutils/. erate bibentries for references already available in XML: Tools for Parsing and Gener- other bibliographic formats. For BIBT X format D. Temple Lang. E http: ‘.bib’ databases (presumably the format typically em- ating XML within R and S-Plus, 2012. URL //CRAN.R-project.org/package=XML ployed by R users), conversion is straightforward: . R package one can read the entries in ‘.bib’ files into bibentry version 3.9-4. read.bib() objects using in CRAN package bibtex Wikipedia. Personal name – Wikipedia, the free format() style = "R" (François, 2011), use with to encyclopedia. URL http://en.wikipedia.org/ bibentry() obtain a character vector with calls for wiki/Personal_name, accessed 2012-04-02, 2012. each bibentry, and use writeLines() to write (or add) this to a bibentry database file (see ?bibentry for examples of the last two steps). For other for- Kurt Hornik mats, one could first transform to BIBTEX format: Department of Finance, Accounting and Statistics e.g. Bibutils (Putnam, 2010) can convert between the WU Wirtschaftsuniversität Wien bibliography formats COPAC, EndNote refer, End- Augasse 2–6, 1090, Wien Note XML, Pubmed XML, ISI , US Li- Austria brary of Congress MODS XML, RIS, and Word 2007. [email protected] However, this may lose useful information: e.g. the MODS format can provide role data which (as Duncan Murdoch discussed above) will be dropped when converting Department of Statistical and Actuarial Sciences to BIBTEX format. We are thus planning to pro- University of Western Ontario vide a bibutils package which instead converts to London, Ontario the MODS XML format first, and creates bibentry Canada objects from this. However, as this functionality [email protected] requires both external software and package XML (Temple Lang, 2012), it cannot be made available Achim Zeileis within base R. Faculty of Economics and Statistics With these functionalities in place, managing bib- Universität Innsbruck liographic information in R documentation files will Universitätsstr. 15, 6020 Innsbruck become much more convenient, and R will become Austria a much sharper knife for cutting-edge bibliometric [email protected]

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 70 NEWSAND NOTES

Changes in R

From version 2.14.1 to version 2.15.1 • A new option "show.error.locations" has been added. When set to TRUE, error mes- by the R Core Team sages will contain the location of the most re- cent call containing source reference informa- tion. (Other values are supported as well; see CHANGES IN R VERSION 2.15.1 ?options.) NEW FEATURES • The NA warning messages from e.g. pchisq() now report the call to the closure and not that • source() now uses withVisible() rather than of the .Internal. .Internal(eval.with.vis). This sometimes alters tracebacks slightly. • Added Polish translations by Łukasz Daniel.

• install.packages("pkg_version.tgz") on Mac OS X now has sanity checks that this is PERFORMANCE IMPROVEMENTS actually a binary package (as people have tried • In package parallel, makeForkCluster() and it with incorrectly named source packages). the multicore-based functions use native byte- order for serialization (deferred from 2.15.0). • splineDesign() and spline.des() in package splines have a new option sparse which can • lm.fit(), lm.wfit(), glm.fit() and lsfit() be used for efficient construction of a sparse B- do less copying of objects, mainly by using spline design matrix (via Matrix). .Call() rather than .Fortran().

• norm() now allows type = "2" (the ‘spectral’ • .C() and .Fortran() do less copying: ar- or 2-norm) as well, mainly for didactical com- guments which are raw, logical, integer, real pleteness. or complex vectors and are unnamed are not copied before the call, and (named or not) are • pmin() and pmax()) now also work when one not copied after the call. Lists are no longer of the inputs is of length zero and others copied (they are supposed to be used read-only are not, returning a zero-length vector, analo- in the C code). gously to, say, +. • tabulate() makes use of .C(DUP = FALSE) and • colorRamp() (and hence colorRampPalette()) hence does not copy bin. (Suggested by Tim now also works for the boundary case of just Hesterberg.) It also avoids making a copy of a one color when the ramp is flat. factor argument bin.

• qqline() has new optional arguments • Other functions (often or always) doing less cut() dist() distribution, probs and qtype, following copying include , , the complex eigen() hclust() image() kmeans() the example of lattice’s panel.qqmathline(). case of , , , , loess(), stl() and svd(LINPACK = TRUE). .C() • gains some protection against the misuse • There is less copying when using primitive re- of character vector arguments. (An all too com- placement functions such as names(), attr() character(N) mon error is to pass , which ini- and attributes(). tializes the elements to "", and then attempt to edit the strings in-place, sometimes forgetting to terminate them.) DEPRECATED AND DEFUNCT .C() • Calls to the new function globalVariables() • The converters for use with (see ?getCConverterDescriptions in package utils declare that functions and ) are depre- .Call() other objects in a package should be treated cated: use the interface instead. There as globally defined, so that CMD check will not are no known examples (they were never fully note them. documented).

• print(packageDescription(*)) trims the UTILITIES Collate field by default. • For R CMD check, a few people have re- • The included copy of zlib has been updated to ported problems with junctions on Windows version 1.2.7. (although they were tested on Windows 7,

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 NEWSAND NOTES 71

XP and Server 2008 machines and it is un- • The residuals in the 5-number summary known under what circumstances the prob- printed by summary() on an "lm" object are lems occur). Setting the environment vari- now explicitly labelled as weighted residuals able R_WIN_NO_JUNCTIONS to a non-empty value when non-constant weights are present. (Wish (e.g. in ‘~/.R/check.Renviron’) will force copies to of PR#14840.) be used instead. • tracemem() reported that all objects were copied by .C() or .Fortran() whereas only INSTALLATION some object types were ever copied. • R CMD INSTALL with It also reported and marked as copies some _R_CHECK_INSTALL_DEPENDS_ set to a true transformations such as rexp(n,x): it no value (as done by R CMD check --as-cran) longer does so. now restricts the packages available when lazy- loading as well as when test-loading (since • The plot() method for class "stepfun" only packages such as ETLUtils and agsemisc had used the optional xval argument to compute top-level calls to library() for undeclared xlim and not the points at which to plot (as doc- packages). umented). (PR#14864) This check is now also available on Windows. • Names containing characters which need to be escaped were not deparsed properly. C-LEVEL FACILITIES (PR#14846) • C entry points mkChar and mkCharCE now check • Trying to update (recommended) packages in that the length of the string they are passed ‘R_HOME/library’ without write access is now does not exceed 231 − 1 bytes: they used to dealt with more gracefully. Further, such pack- overflow with unpredictable consequences. age updates may be skipped (with a warning), when a newer installed version is already go- • C entry points R_GetCurrentSrcref and ing to be used from .libPaths(). (PR#14866) R_GetSrcFilename have been added to the API to allow debuggers access to the source • hclust() is now fast again (as up to end references on the stack. of 2003), with a different fix for the "me- dian"/"centroid" problem. (PR#4195).

WINDOWS-SPECIFIC CHANGES • get_all_vars() failed when the data came en- tirely from vectors in the global environment. • Windows-specific changes will now be an- (PR#14847) nounced in this file (‘NEWS’). Changes up and including R 2.15.0 remain in the ‘CHANGES’ • R CMD check with _R_CHECK_NO_RECOMMENDED_ file. set to a true value (as done by the --as-cran option) could issue false errors if there was an • There are two new environment variables indirect dependency on a recommended pack- which control the defaults for command-line age. options. If R_WIN_INTERNET2 is set to a non-empty value, • formatC() uses the C entry point str_signif it is as if ‘--internet2’ was used. which could write beyond the length allocated for the output string. If R_MAX_MEM_SIZE is set, it gives the default memory limit if ‘--max-mem-size’ is not spec- • Missing default argument added to implicit S4 ified: invalid values being ignored. generic for backsolve(). (PR#14883)

BUG FIXES • Some bugs have been fixed in handling load actions that could fail to export assigned items • lsfit() lost the names from the residuals. or generate spurious warnings in CMD check on loading. • More cases in which merge() could create a data frame with duplicate column names now • For tiff(type = "windows"), the numbering give warnings. Cases where names specified in of per-page files except the last was off by one. by match multiple columns are errors. • On Windows, loading package stats (which is • Nonsense uses such as seq(1:50,by = 5) done for a default session) would switch line (from package plotrix) and seq.int(1:50,by = endings on ‘stdout’ and ‘stderr’ from CRLF to LF. 5) are now errors. This affected Rterm and R CMD BATCH.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 72 NEWSAND NOTES

• On Windows, the compatibility function x11() • The value returned by tools::psnice() for in- had not kept up with changes to windows(), valid pid values was not always NA as docu- and issued warnings about bad parameters. mented. (PR#14880) • Closing an X11() window while locator() • On Windows, the Sys.glob() function did not was active could abort the R process. handle UNC paths as it was designed to try to getMethod(f,sig) do. (PR#14884) • produced an incorrect error message in some cases when f was not a string. • In package parallel, clusterApply() and simi- lar failed to handle a (pretty pointless) length-1 • Using a string as a “call” in an error con- options(showErrorCalls=TRUE) argument. (PR#14898) dition with could cause a segfault. (PR#14931) • Quartz Cocoa display reacted asynchronously • The string "infinity" allowed by C99 was not to dev.flush() which means that the re- accepted as a numerical string value by e.g. draw could be performed after the plot has scan() and as.character(). (PR#14933) been already modified by subsequent code. The redraw is now done synchronously in • In legend(), setting some entries of lwd to NA dev.flush() to allow animations without sleep was inconsistent (depending on the graphics cycles. device) in whether it would suppress those lines; now it consistently does so. (PR#14926) • Source locations reported in traceback() were incorrect when byte-compiled code was on the • by() failed for a zero-row data frame. (Re- stack. ported by Weiqiang Qian)

• plogis(x,lower = FALSE,log.p = TRUE) no • Yates correction in chisq.test() could be big- longer underflows early for large x (e.g. 800). ger than the terms it corrected, previously lead- ing to an infinite test statistic in some corner ?Arithmetic 1 ^ y y ^ 0 1 always • ’s “ and are , ” cases which are now reported as NaN. now also applies for integer vectors y. • xgettext() and related functions sometimes png(type = • X11-based pixmap devices like returned items that were not strings for trans- "Xlib") were trying to set the cursor style, lation. (PR#14935) which triggered some warnings and hangs. • plot(,which=5) now correctly labels the • Code executed by the built-in HTTP server no factor level combinations for the special case longer allows other HTTP clients to re-enter R where all hii are the same. (PR#14837) until the current worker evaluation finishes, to prevent cascades. CHANGES IN R VERSION 2.15.0 • The plot() and Axis() methods for class "table" now respect graphical parameters such as cex.axis. (Reported by Martin Becker.) SIGNIFICANT USER-VISIBLE CHANGES • The behaviour of unlink(recursive = TRUE) • Under some circumstances for a symbolic link to a directory has changed: package.skeleton() would give out progress it now removes the link rather than the direc- reports that could not be translated and so tory contents (just as rm -r does). were displayed by question marks. Now they are always in English. (This was seen for CJK On Windows it no longer follows reparse locales on Windows, but may have occurred points (including junctions and symbolic elsewhere.) links).

• The evaluator now keeps track of source refer- NEW FEATURES ences outside of functions, e.g. when source() executes a script. • Environment variable RD2DVI_INPUTENC has been renamed to RD2PDF_INPUTENC. • The replacement method for window() now works correctly for multiple time series of class • .Deprecated() becomes a bit more flexible, "mts". (PR#14925) getting an old argument.

• is.unsorted() gave incorrect results on non- • Even data-only packages without R code need atomic objects such as data frames. (Reported a namespace and so may need to be installed by Matthew Dowle.) under R 2.14.0 or later.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 NEWSAND NOTES 73

• assignInNamespace() has further restrictions • Function setClass() in package methods now on use apart from at top-level, as its help page returns, invisibly, a generator function for the has warned. Expect it to be disabled from pro- new class, slightly preferred to calling new(), grammatic use in the future. as explained on the setClass help page.

• system() and system2() when capturing out- • The "dendrogram" method of str() now put report a non-zero status in the new takes its default for last.str from option "status" attribute. str.dendrogram.last.

• kronecker() now has an S4 generic in package • New simple fitted() method for "kmeans" ob- methods on which packages can set methods. jects. X %x% Y X Y It will be invoked by if either or is • The traceback() function can now be called an S4 object. with an integer argument, to display a current • pdf() accepts forms like file = "|lpr" in the stack trace. (Wish of PR#14770.) same way as postscript(). • setGeneric() calls can be simplified when cre- ating a new generic function by supplying • pdf() accepts file = NULL. This means that the the default method as the def argument. See device does NOT create a PDF file (but it can ?setGeneric. still be queried, e.g., for font metric info). • serialize() has a new option xdr = FALSE format() print() "bibentry" • (and hence ) on which will use the native byte-order for binary options("width") objects now uses to set the serializations. In scenarios where only little- output width. endian machines are involved (these days, • legend() gains a text.font argument. (Sug- close to universal) and (un)serialization takes gested by Tim Paine, PR#14719.) an appreciable amount of time this may speed up noticeably transferring data between sys- • nchar() and nzchar() no longer accept factors tems. (as integer vectors). (Wish of PR#6899.) • The internal (un)serialization code is faster for • summary() behaves slightly differently (or more long vectors, particularly with XDR on some precisely, its print() method does). For nu- platforms. (Based on a suggested patch by meric inputs, the number of NAs is printed Michael Spiegel.) as an integer and not a real. For dates and • For consistency, circles with zero radius are datetimes, the number of NAs is included in omitted by points() and grid.circle(). Pre- the printed output (the latter being the wish of viously this was device-dependent, but they PR#14720). were usually invisible. The "data.frame" method is more consistent NROW(x) NCOL(x) with the default method: in particular it now • and now work whenever dim(x) applies zapsmall() to numeric/complex sum- looks appropriate, e.g., also for more maries. generalized matrices. • PCRE has been updated to version 8.30. • The number of items retained with options(warn = 0) can be set by • The internal R_Srcref variable is now updated options(nwarnings=). before the browser stops on entering a func- tion. (Suggestion of PR#14818.) • A new function assignInMyNamespace() uses the namespace of the function it is called from. • There are ‘bare-bones’ functions .colSums(), .rowSums(), .colMeans() and .rowMeans() for • attach() allows the default name for an at- use in programming where ultimate speed is tached file to be overridden. required. • bxp(), the work horse of boxplot(), now • The function .package_dependencies() from uses a more sensible default xlim in the package tools for calculating (recursive) (re- case where at is specified differently verse) dependencies on package databases, from 1:n, see the discussion on R-devel, which was formerly internal, has been re- https://stat.ethz.ch/pipermail/r-devel/ named to package_dependencies() and is now 2011-November/062586.html. exported. • New function paste0(), an efficient version of • There is a new function optimHess() to com- paste(*,sep=""), to be used in many places for pute the (approximate) Hessian for an optim() more concise (and slightly more efficient) code. solution if hessian = TRUE was forgotten.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 74 NEWSAND NOTES

• .filled.contour() is a ‘bare-bones’ function PACKAGE INSTALLATION to add a filled-contour rectangular plot to an already prepared plot region. • Non-ASCII vignettes without a declared en- coding are no longer accepted. • The stepping in debugging and single-step browsing modes has changed slightly: now left • C/C++ code in packages is now compiled with -NDEBUG braces at the start of the body are stepped over to mitigate against the C/C++ func- assert for if statements as well as for for and while tion being called in production use. De- statements. (Wish of PR#14814.) velopers can turn this off during package de- velopment with PKG_CPPFLAGS = -UNDEBUG. • library() no longer warns about a conflict R CMD INSTALL --dsym with a function from package:base if the func- • has a new option ‘ ’ tion has the same code as the base one but which on Mac OS X (Darwin) dumps the sym- .so with a different environment. (An example is bols alongside the ‘ ’ file: this is helpful when valgrind Matrix::det().) debugging with (and especially when installing packages into ‘R.framework’). • When deparsing very large language objects, [This can also be enabled by setting the undoc- as.character() now inserts newlines after umented environment variable PKG_MAKE_DSYM, each line of approximately 500 bytes, rather since R 2.12.0.] than truncating to the first line. • R CMD INSTALL will test loading under all • New function rWishart() generates Wishart- installed sub-architectures even for pack- distributed random matrices. ages without compiled code, unless the flag ‘--no-multiarch’ is used. (Pure R packages can • Packages may now specify actions to do things which are architecture-dependent: in be taken when the package is loaded the case which prompted this, looking for an (setLoadActions()). icon in a Windows R executable.)

• options(max.print = Inf) and similar now • There is a new option install.packages(type give an error (instead of warnings later). = "both") which tries source packages if bi- nary packages are not available, on those plat- "difftime" units • The replacement method of forms where the latter is the default. tries harder to preserve other attributes of the argument. (Wish of PR#14839.) • The meaning of install.packages( dependencies = TRUE) has changed: it now poly(raw = TRUE) • no longer requires more means to install the essential dependencies of unique points than the degree. (Requested by the named packages plus the ‘Suggests’, but John Fox.) only the essential dependencies of dependen- cies. To get the previous behaviour, specify PACKAGE parallel dependencies as a character vector. • There is a new function mcmapply(), a parallel • R CMD INSTALL --merge-multiarch is now version of mapply(), and a wrapper mcMap(), a supported on OS X and other Unix-alikes parallel version of Map(). using multiple sub-architectures.

• A default cluster can be registered by the • R CMD INSTALL --libs-only now by default new function setDefaultCluster(): this will does a test load on Unix-alikes as well as on be used by default in functions such as Windows: suppress with ‘--no-test-load’. parLapply().

• clusterMap() has a new argument UTILITIES .scheduling to allow the use of load- • R CMD check now gives a warning rather than balancing. a note if it finds inefficiently compressed datasets. With bzip2 and xz compression hav- • There are new load-balancing functions ing been available since R 2.10.0, it only excep- parLapplyLB() and parSapplyLB(). tionally makes sense to not use them. • makePSOCKCluster() has a new option useXDR The environment variable = FALSE which can be used to avoid byte- _R_CHECK_COMPACT_DATA2_ is no longer shuffling for serialization when all the nodes consulted: check is always done if are known to be little-endian (or all big- _R_CHECK_COMPACT_DATA_ has a true value endian). (its default).

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 NEWSAND NOTES 75

• Where multiple sub-architectures are to be • Use of library.dynam() without specifying all tested, R CMD check now runs the examples the first three arguments is now disallowed. and tests for all the sub-architectures even if Use of an argument chname in library.dynam() one fails. including the extension ‘.so’ or ‘.dll’ (which • R CMD check can optionally report timings on was never allowed according to the help various parts of the check: this is controlled by page) is defunct. This also applies to environment variable _R_CHECK_TIMINGS_ doc- library.dynam.unload() and to useDynLib di- umented in ‘Writing R Extensions’. Timings (in rectives in ‘NAMESPACE’ files. the style of R CMD BATCH) are given at the foot of the output files from running each test and • The internal functions .readRDS() and the R code in each vignette. .saveRDS() are now defunct.

• There are new options for more rigorous test- • The off-line help() types ‘"postscript"’ and ing by R CMD check selected by environment ‘"ps"’ are defunct. variables – see the ‘Writing R Extensions’ man- ual. • Sys.putenv(), replaced and deprecated in R 2.5.0, is finally removed. • R CMD check now warns (rather than notes) on undeclared use of other packages in examples • Some functions/objects which have been and tests: increasingly people are using the defunct for five or more years have metadata in the ‘DESCRIPTION’ file to compute been removed completely. These include information about packages, for example re- .Alias(), La.chol(), La.chol2inv(), verse dependencies. La.eigen(), Machine(), Platform(), Version, • The defaults for some of the options in R codes(), delay(), format.char(), getenv(), CMD check (described in the ‘R Internals’ man- httpclient(), loadURL(), machine(), ual) have changed: checks for unsafe and parse.dcf(), printNoClass(), provide(), .Internal() calls and for partial matching of read.table.url(), restart(), scan.url(), arguments in R function calls are now done by symbol.C(), symbol.For() and unix(). default. • The ENCODING argument to .C() is deprecated. • R CMD check has more comprehensive facilities It was intended to smooth the transition to for checking compiled code and so gives fewer multi-byte character strings, but can be re- reports on entry points linked into ‘.so’/‘.dll’ placed by the use of iconv() in the rare cases files from libraries (including C++ and Fortran where it is still needed. runtimes). Checking compiled code is now done on INSTALLATION FreeBSD (as well as the existing supported platforms of Linux, Mac OS X, Solaris and Win- • Building with a positive value of dows). ‘--with-valgrind-instrumentation’ now • R CMD build has more options for also instruments logical, complex and raw ‘--compact-vignettes’: see R CMD build vectors. --help. • R CMD build has a new option ‘--md5’ to add C-LEVEL FACILITIES an ‘MD5’ file (as done by CRAN): this is used • Passing R objects other than atomic vectors, by R CMD INSTALL to check the integrity of the functions, lists and environments to .C() is distribution. now deprecated and will give a warning. Most If this option is not specified, any existing (and cases (especially NULL) are actually coding er- probably stale) ‘MD5’ file is removed. rors. NULL will be disallowed in future. .C() now passes a pairlist as a SEXP to the com- DEPRECATED AND DEFUNCT piled code. This is as was documented, but • R CMD Rd2dvi is now defunct: use R CMD pairlists were in reality handled differently as Rd2pdf. a legacy from the early days of R. • Options such ‘--max-nsize’, ‘--max-vsize’ and • call_R and call_S are deprecated. They still the function mem.limits() are now defunct. exist in the headers and as entry points, but are (Options ‘--min-nsize’ and ‘--min-vsize’ re- no longer documented and should not be used main available.) for new code.

The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 76 NEWSAND NOTES

BUG FIXES • boxplot(x=x,at=at) with non finite elements in x and non integer at could not generate a str(x,width) width • now obeys its argument warning but failed. also for function headers and other objects x where deparse() is applied. • heatmap(x,symm=TRUE,RowSideColors=*) no longer draws the colors in reversed order. • The convention for x %/% 0L for integer-mode x 0L NA_integer_ has been changed from to . • predict() was incorrect in the multivari- (PR#14754) ate case, for p >= 2. • The exportMethods directive in a ‘NAMESPACE’ • print(x,max=m) is now consistent when x is a file now exports S4 generics as necessary, as the "Date"; also the “reached ... max.print ..” mes- extensions manual said it does. The manual sages are now consistently using single brack- has also been updated to be a little more infor- ets. mative on this point. It is now required that there is an S4 generic • Closed the ‘

  • ’ tag in pages generated by (imported or created in the package) when Rd2HTML(). (PR#14841.) methods are to be exported. • Axis tick marks could go out of range when a • Reference methods cannot safely use non- log scale was used. (PR#14833.) exported entries in the namespace. We now do not do so, and warn in the documentation. • Signature objects in methods were not allo- cated as S4 objects (caused a problem with • The namespace import code was warning trace() reported by Martin Morgan). when identical S4 generic functions were im- ported more than once, but should not (re- ported by Brian Ripley, then Martin Morgan). CHANGES IN R VERSION 2.14.2 • merge() is no longer allowed (in some ways) to create a data frame with duplicate column NEW FEATURES names (which confused PR#14786). • The internal untar() (as used by default by • Fixes for rendering raster images on X11 and R CMD INSTALL) now knows about some pax Windows devices when the x-axis or y-axis headers which bsdtar (e.g., the default tar for scale is reversed. Mac OS >= 10.6) can incorrectly include in tar files, and will skip them with a warning. • getAnywhere() found S3 methods as seen from the utils namespace and not from the environ- • PCRE has been upgraded to version 8.21: as ment from which it was called. well as bug fixes and greater Perl compatibility, this adds a JIT pattern compiler, about which • selectMethod(f,sig) would not return inher- PCRE’s news says ‘large performance benefits ited group methods when caching was off (as it can be had in many situations’. This is sup- is by default). ported on most but not all R platforms.

    • dev.copy2pdf(out.type = "") gave an • Function compactPDF() in package tools now error. (PR#14827) takes the default for argument gs_quality from environment variable GS_QUALITY: there NULL • Virtual classes (e.g., class unions) had a is a new value "none", the ultimate default, prototype even if that was not a legal subclass. which prevents GhostScript being used in pref- ?setClassUnion See . erence to qpdf just because environment vari- R_GSCMD R_GSCMD • The C prototypes for zdotc and zdotu in able is set. If is unset or set "" ‘R_ext/BLAS.h’ have been changed to the more to , the function will try to find a suitable modern style rather than that used by f2c. GhostScript executable. (Patch by Berwin Turlach.) • The included version of zlib has been updated • isGeneric() produced an error for primitives to 1.2.6. that can not have methods. • For consistency with the logLik() method, • .C() or .Fortran() had a lack-of-protection er- nobs() for "nls" files now excludes observa- ror if the registration information resulted in an tions with zero weight. (Reported by Berwin argument being coerced to another type. Turlach.)

    The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 NEWSAND NOTES 77

    UTILITIES • Forgetting the #endif tag in an Rd file could cause the parser to go into a loop. (Reported R CMD check • now reports by default on licenses by Hans-Jorg Bibiko.) not according to the description in ‘Writing R Extensions’. • str(*,.....,strict.width="cut") now also obeys list.len = n. (Reported by Sören Vo- • R CMD check has a new option ‘--as-cran’ to gel.) turn on most of the customizations that CRAN • Printing of arrays did not have enough pro- uses for its incoming checks. tection (C level), e.g., in the context of capture.output(). (Reported by Hervé Pagès PACKAGE INSTALLATION and Martin Morgan.) • R CMD INSTALL will now no longer install cer- • pdf(file = NULL) would produce a spurious tain file types from ‘inst/doc’: these are almost file named ‘NA’. (PR#14808) certainly mistakes and for some packages are • list2env() did not check the type of its envir wasting a lot of space. These are ‘Makefile’, files argument. (PR#14807) generated by running LaTeX, and unless the package uses a ‘vignettes’ directory, PostScript • svg() could segfault if called with a non- and image bitmap files. existent file path. (PR#14790) Note that only PDF vignettes have ever been • make install can install to a path containing supported: some of these files come from ‘+’ characters. (PR#14798) DVI/PS output from the Sweave defaults prior • The edit() function did not respect the to R 2.13.0. options("keep.source") setting. (Reported by Cleridy Lennert.) BUG FIXES • predict.lm(*,type="terms",terms=*, • R configured with ‘--disable-openmp’ would se.fit=TRUE) did not work. (PR#14817) mistakenly set HAVE_OPENMP (internal) and • There is a partial workaround for errors in the SUPPORT_OPENMP (in ‘Rconfig.h’) even though no TRE regular-expressions engine with named OpenMP flags were populated. classes and repeat counts of at least 2 in a MBCS locale (PR#14408): these are avoided when TRE getS3method() • The implementation had an old is in 8-bit mode (e.g. for useBytes = TRUE and computation to find an S4 default method. when all the data are ASCII). • readLines() could overflow a buffer if the last • The C function R_ReplDLLdo1() did not call line of the file was not terminated. (PR#14766) top-level handlers.

    • R CMD check could miss undocumented S4 • The Quartz device was unable to detect win- objects in packages which used S4 classes dow sessions on Mac OS X 10.7 (Lion) and but did not ‘Depends: methods’ in their higher and thus it was not used as the default ‘DESCRIPTION’ file. device on the console. Since Lion any applica- tion can use window sessions, so Quartz will • The HTML Help Search page had malformed now be the default device if the user’s window links. (PR#14769) session is active and R is not run via ssh which is at least close to the behavior in prior OS X • A couple of instances of lack of protection versions. of SEXPs have been squashed. (PR#14772, mclapply() PR#14773) • would fail in code assembling the translated error message if some (but not all) • image(x,useRaster=TRUE) misbehaved on cores encountered an error. x single-column . (PR#14774) • format.POSIXlt(x) raised an arithmetic excep- tion when x was an invalid object of class • Negative values for options("max.print") or "POSIXlt" and parts were empty. the max argument to print.default() caused crashes. Now the former are ignored and the • installed.packages() has some more protec- latter trigger an error. (PR#14779) tion against package installs going on in paral- lel. • The text of a function body containing more than 4096 bytes was not properly saved by the • .Primitive() could be mis-used to call parser when entered at the console. .Internal() entry points.

    The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 78 NEWSAND NOTES

    CHANGES IN R VERSION 2.14.1 PACKAGE INSTALLATION R CMD INSTALL NEW FEATURES • will now do a test load for all sub-architectures for which code was compiled • parallel::detectCores() is now able to find (rather than just the primary sub-architecture). the number of physical cores (rather than CPUs) on Sparc Solaris. UTILITIES It can also do so on most versions of Windows; however the default remains • When checking examples under more than detectCores(logical = TRUE) on that plat- one sub-architecture, R CMD check now uses form. a separate directory ‘examples_arch’ for each sub-architecture, and leaves the output in file • Reference classes now keep a record of which ‘pkgname-Ex_arch.Rout’. Some packages expect fields are locked. $lock() with no arguments their examples to be run in a clean directory . . . . returns the names of the locked fields. • HoltWinters() reports a warning rather than BUG FIXES an error for some optimization failures (where the answer might be a reasonable one). • stack() now gives an error if no vector column is selected, rather than returning a 1-column tools::dependsOnPkg() • now accepts the short- data frame (contrary to its documentation). hand dependencies = "all". • summary.mlm() did not handle objects where • parallel::clusterExport() now allows spec- the formula had been specified by an expres- ification of an environment from which to ex- sion. (Reported by Helios de Rosario Mar- port. tinez). • The quartz() device now does tilde expansion on its file argument. • tools::deparseLatex(dropBraces=TRUE) could drop text as well as braces. • tempfile() on a Unix-alike now takes the pro- cess ID into account. This is needed with mul- • colormodel = "grey" (new in R 2.14.0) did not ticore (and as part of parallel) because the par- always work in postscript() and pdf(). ent and all the children share a session tempo- file.append() TRUE rary directory, and they can share the C random • could return for failures. number stream used to produce the unique (PR#14727) part. Further, two children can call tempfile() • gzcon() connections are no longer subject to simultaneously. garbage collection: it was possible for this to • Option print in Sweave’s RweaveLatex() happen when unintended (e.g. when calling driver now emulates auto-printing rather than load()). printing (which can differ for an S4 object by nobs() calling show() rather than print()). • does not count zero-weight observa- tions for glm() fits, for consistency with lm(). • filled.contour() now accepts infinite values: This affects the BIC() values reported for such previously it might have generated invalid glm() fits. (Spotted by Bill Dunlap.) graphics files (e.g. containing NaN values). • options(warn = 0) failed to end a (C-level) context with more than 50 accumulated warn- INSTALLATION ings. (Spotted by Jeffrey Horner.) • On 64-bit Linux systems, configure now • The internal plot.default() code did not do only sets ‘LIBnn’ to lib64 if ‘/usr/lib64’ exists. sanity checks on a cex argument, so invalid in- This may obviate setting ‘LIBnn’ explicitly on put could cause problems. (Reported by Ben Debian-derived systems. Bolker.) It is still necessary to set ‘LIBnn = lib’ (or ‘lib32’) for 32-bit builds of R on a 64-bit OS on • anyDuplicated(,MARGIN=0) no longer those Linux distributions capable for support- fails. (Reported by Hervé Pagès.) ing that concept. • read.dcf() removes trailing blanks: unfortu- • configure looks for ‘inconsolata.sty’, and if not nately on some platforms this included \xa0 found adjusts the default R_RD4PDF to not use (non-breaking space) which is the trailing byte it (with a warning, since it is needed for high- of a UTF-8 character. It now only considers quality rendering of manuals). ASCII space and tab to be ‘blank’.

    The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 NEWSAND NOTES 79

    • There was a sign error in part of the • srcfilecopy() has gained a new argument calculations for the variance returned by isFile and now records the working directory, KalmanSmooth(). (PR#14738) to allow debuggers to find the original source file. (PR#14826) • pbinom(10,1e6,0.01,log.p = TRUE) was NaN thanks to the buggy fix to PR#14320 in R 2.11.0. • LaTeX conversion of Rd files did not correctly (PR#14739) handle preformatted backslashes. (PR#14751) • RweaveLatex() now emulates auto- • HTML conversion of Rd files did not han- printing rather than printing, by calling dle markup within tabular cells properly. methods::show() when auto-printing would. (PR#14708) • duplicated() ignored fromLast for a one- • source() on an empty file with keep.source column data frame. (PR#14742) = TRUE tried to read from stdin(), in R 2.14.0 only. (PR#14753) • source() and related functions did not put the correct timestamp on the source references; • The code to check Rd files in packages would srcfilecopy() has gained a new argument abort if duplicate description sections were timestamp to support this fix. (PR#14750) present.

    The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 80 NEWSAND NOTES

    Changes on CRAN

    2011-12-01 to 2012-06-08 OfficialStatistics FFD, pxR. by Kurt Hornik and Achim Zeileis Optimization GenSA, Rmalschains, clpAPI, cplexAPI, crs, dfoptim, glpkAPI, hydroPSO, mcga nloptr powell New CRAN task views , , . Phylogenetics SYNCSA, auteur, paleotree, pmc. DifferentialEquations Topic: Differential Equations. Maintainer: Karline Soetaert and Thomas Pet- Psychometrics CDM, MPTinR, MplusAutomation, zoldt. Packages: CollocInfer, FME, Gille- betareg, irtrees, pathmox, plspm. spieSSA, PBSddesolve, PKfit, PKtools, PSM, ReacTran, bvpSolve∗, deSolve∗, deTestSet, ReproducibleResearch RExcelInstaller, SWordInstaller, ecolMod, mkin, nlmeODE, pomp, pracma, , rtf, tables. ∗ ∗ rootSolve , scaRabee, sde , simecol. Spatial McSpatial, geospt, rangeMapper, (* = core package) vec2dtrans. Survival BaSTA, CPE, CPHshape, OIsurv, Ro- New packages in CRAN task views bustAFT, Survgini, ahaz, asbio, bwsurvival, dcens, dynpred, dynsurv, flexsurv, kaps, log- Bayesian AtelieR, Runuran, iterLap. concens, parfm, riskRegression, rtv, survC1, ChemPhys FITSio, cda, dielectric, planar, tlmec. ringscale, stellaR. TimeSeries Interpol.T, RMAWGEN, deseasonalize, ClinicalTrials AGSDest, CRTSize∗, FrF2, quantspec, tempdisagg. ∗ ∗ ∗ ∗ MSToolkit , PIPS , PowerTOST , TEQR , gR abn, gRim. ThreeGroups, adaptTest∗, asd∗, bcrm∗, clin- sig, conf.design, crossdes∗, dfcrm∗, long- (* = core package) power, medAdherence, nppbib, pwr∗, qtlDe- sign∗, samplesize∗, tdm. New contributed packages Cluster CoClust, bgmm, pmclust, psychomix. AGD Analysis of Growth Data. Author: Stef van Distributions LambertW, TSA, agricolae, e1071, Buuren. emdbook, modeest, moments, npde, s20x. ARTIVA Time-varying DBN inference with the Econometrics geepack, intReg, lfe, mhurdle, pcse, ARTIVA (Auto Regressive TIme VArying) spdep, sphet. model. Authors: S. Lebre and G. Lelandais. Environmetrics Interpol.T, RMAWGEN, anm, Ace Assay-based Cross-sectional Estimation of inci- boussinesq, fast, hydroPSO, nsRFA, qualV, dence rates. Authors: Brian Claggett, Weiliang rconifers, sensitivity, tiger. Qiu, Rui Wang. ExperimentalDesign dynaTree, mixexp, plgp, tgp. AdaptFitOS Adaptive Semiparametric Regression Finance AmericanCallOpt, BurStFin, FinAsym, with Simultaneous Confidence Bands. Au- VarSwapPrice, crp.CSFP, frmqa. thors: Manuel Wiesenfarth and Tatyana Krivobokova. HighPerformanceComputing BatchExperiments, BatchJobs, WideLM, cloudRmpi, cloudRmpi- AmericanCallOpt This package includes pricing Jars, harvestr, pmclust, rreval. function for selected American call options with underlying assets that generate payouts. MachineLearning GMMBoost, RPMM, bst, evtree, Author: Paolo Zagaglia. In view: Finance. gamboostLSS, gbev, longRPart, oblique.tree, obliqueRF, partykit, rattle, rpartOrdinal. AssotesteR Statistical Tests for Genetic Association Studies. Author: Gaston Sanchez. MedicalImaging DATforDCEMRI∗, arf3DS4∗, neuRosim∗, occ∗. BBmisc Miscellaneous helper functions for B. Bis- chl. Authors: Bernd Bischl, Michel Lang, Olaf NaturalLanguageProcessing KoNLP, RcmdrPlu- Mersmann. gin.TextMining, TextRegression, koRpus, maxent, tau, textcat, textir, tm.plugin.dc, BCEA Bayesian Cost Effectiveness Analysis. Au- tm.plugin.mail, topicmodels, wordcloud. thor: Gianluca Baio.

    The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 NEWSAND NOTES 81

    BINCO Bootstrap Inference for Network COstruc- CPHshape Find the maximum likelihood estimator tion. Authors: Shuang Li, Li Hsu, Jie Peng, Pei of the shape constrained hazard baseline and Wang. the effect parameters in the Cox proportional hazards model. Authors: Rihong Hui and BVS Bayesian Variant Selection: Bayesian Model Hanna Jankowski. In view: Survival. Uncertainty Techniques for Genetic Associa- tion Studies. Author: Melanie Quintana. CPMCGLM Correction of the pvalue after multiple coding. Authors: Jeremie Riou, Amadou Di- BaSAR Bayesian Spectrum Analysis in R. Au- akite, and Benoit Liquet. thors: Emma Granqvist, Matthew Hartley and Richard J Morris. CUMP Analyze Multivariate Phenotypes by Com- bining Univariate results. Authors: Xuan Liu BatchExperiments Statistical experiments on batch and Qiong Yang. computing clusters. Authors: Bernd Bischl, CePa Centrality-based pathway enrichment. Au- Michel Lang, Olaf Mersmann. In view: High- thor: Zuguang Gu. PerformanceComputing. ChoiceModelR Choice Modeling in R. Authors: BatchJobs Batch computing with R. Authors: Bernd Ryan Sermas, assisted by John V. Colias. Bischl, Michel Lang, Olaf Mersmann. In view: HighPerformanceComputing. CoClust Copula based cluster analysis. Authors: Francesca Marta Lilja Di Lascio, Simone Gian- BayesLCA Bayesian Latent Class Analysis. Au- nerini. In view: Cluster. thors: Arthur White and Brendan Murphy. Compounding Computing Continuous Distribu- BayesLogit Logistic Regression. Authors: Nicolas tions. Authors: Bozidar V. Popovic, Saralees Polson, James G. Scott, and Jesse Windle. Nadarajah, Miroslav M. Ristic.

    BayesSingleSub Computation of Bayes factors for ConjointChecks A package to check the cancella- interrupted time-series designs. Authors: tion axioms of conjoint measurement. Author: Richard D. Morey, Rivka de Vries. Ben Domingue.

    BayesXsrc R Package Distribution of the BayesX CpGassoc Association between Methylation and a C++ Sources. Authors: Daniel Adler [aut, cre], phenotype of interest. Authors: Barfield, R., Thomas Kneib [aut], Stefan Lang [aut], Niko- Conneely, K., Kilaru,V. laus Umlauf [aut], Achim Zeileis [aut]. CrypticIBDcheck Identifying cryptic relatedness in BerkeleyEarth Data Input for Berkeley Earth Sur- genetic association studies. Authors: Annick face Temperature. Author: Steven Mosher. Joelle Nembot-Simo, Jinko Graham and Brad McNeney. BrailleR Improved access for blind useRs. Author: DDD Diversity-dependent diversification. Author: Jonathan Godfrey. Rampal S. Etienne & Bart Haegeman. BurStFin Burns Statistics Financial. Author: Burns DIRECT Bayesian Clustering of Multivariate Data Finance Statistics. In view: . Under the Dirichlet-Process Prior. Authors: Audrey Qiuyan Fu, Steven Russell, Sarah J. CALINE3 R interface to Fortran 77 implementation Bray and Simon Tavare. of CALINE3. Author: David Holstius. DSL Distributed Storage and List. Authors: Ingo CARBayes Spatial areal unit modelling. Author: Feinerer [aut], Stefan Theussl [aut, cre], Chris- Duncan Lee. tian Buchta [ctb].

    CARE1 Statistical package for population size esti- DandEFA Dandelion Plot for R-mode Exploratory mation in capture-recapture models. Author: Factor Analysis. Authors: Artur Manukyan, T.C. Hsieh. Ahmet Sedef, Erhan Cene, Ibrahim Demir (Ad- visor). CARramps Reparameterized and marginalized pos- terior sampling for conditional autoregressive DeducerSpatial Deducer for spatial data analysis. models. Authors: Kate Cowles and Stephen Authors: Ian Fellows and Alex Rickett with Bonett; with thanks to Juan Cervantes, Dong contributions from Neal Fultz. Liang, Alex Sawyer, and Michael Seedorff. DetSel A computer program to detect markers re- CAscaling CA scaling in R. Author: J. H. Straat. sponding to selection. Author: Renaud Vitalis.

    The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 82 NEWSAND NOTES

    DiscreteLaplace Discrete Laplace distribution. Au- GPvam Maximum Likelihood Estimation of the thors: Alessandro Barbiero, Riccardo Inchin- Generalized Persistence Value-Added Model. golo. Authors: Andrew Karl, Yan Yang, and Sharon Lohr. Distance A simple way to fit detection functions to distance sampling data and calculate abun- GUniFrac Generalized UniFrac distances. Author: dance/density for biological populations. Au- Jun Chen. thor: David L. Miller. GenOrd Simulation of ordinal and discrete vari- ETLUtils Utility functions to execute standard ETL ables with given correlation matrix and operations (using package ff) on large data. marginal distributions. Authors: Alessandro Author: Jan Wijffels. Barbiero, Pier Alda Ferrari.

    EcoTroph EcoTroph modelling support. Authors: J. GeoLight Analysis of light based geolocator data. Guitton and M. Colleter, D. Gascuel. Authors: Simeon Lisovski, Silke Bauer, Tamara EffectStars Visualization of Categorical Response Emmenegger. Models. Author: Gunther Schauberger. Grid2Polygons Convert Spatial Grids to Polygons. EpiEstim EpiEstim: a package to estimate time Author: Jason C. Fisher. varying reproduction numbers from epidemic HIest curves. Author: Anne Cori. Hybrid index estimation. Author: Ben Fitz- patrick. EstSimPDMP Estimation of the jump rate and simulation for piecewise-deterministic Markov HMMmix The HMMmix (HMM with mixture of processes. Author: Romain Azais. gaussians as emission distribution) package. Authors: Stevenn Volant and Caroline Berard. Exact Exact Unconditional Tests for 2x2 Tables. Au- thor: Peter Calhoun. Holidays Holiday and halfday data, for use with the TimeWarp package. Author: Lars Hansen. ExomeDepth Calls CNV from exome sequence data. Author: Vincent Plagnol. IC2 Inequality and Concentration Indices and Curves. Author: Didier Plat. ExpDes Experimental Designs package. Authors: Eric Batista Ferreira, Portya Piscitelli Caval- IPMpack Builds and analyses Integral Projection canti, Denismar Alves Nogueira. Models (IPMs). Authors: CJE Metcalf, SM McMahon, R Salguero-Gomez, E Jongejans. FacPad Bayesian Sparse Factor Analysis model for the inference of pathways responsive to drug ISBF Iterative Selection of Blocks of Features — treatment. Author: Haisu Ma. ISBF. Author: Pierre Alquier.

    FactMixtAnalysis Factor Mixture Analysis with co- IgorR Read binary files saved by Igor Pro (including variates. Author: Cinzia Viroli. Neuromatic data). Author: Greg Jefferis. Familias Probabilities for Pedigrees given DNA Interpol.T Hourly interpolation of multiple temper- data. Authors: Petter Mostad and Thore Ege- ature daily series. Author: Emanuele Eccel & land. Emanuele Cordano. In views: Environmetrics, FastImputation Learn from training data then TimeSeries. quickly fill in missing data. Author: Stephen JMLSD Joint Modeling of Longitudinal and Sur- R. Haptonstahl. vival Data — Power Calculation. Authors: FinAsym Classifies implicit trading activity from Emil A. Cornea, Liddy M. Chen, Bahjat F. market quotes and computes the probability of Qaqish, Haitao Chu, and Joseph G. Ibrahim. informed trading. Author: Paolo Zagaglia. In view: Finance. JPSurv Methods for population-based cancer sur- vival analysis. Author: Yongwu Shao. ForeCA ForeCA — Forecastable Component Analy- sis. Author: Georg M. Goerg. Jmisc Julian Miscellaneous Function. Author: TszKin Julian Chan. G2Sd Grain-size Statistics and Description of Sed- iment. Authors: Regis K. Gallon, Jerome JohnsonDistribution Johnson Distribution. Au- Fournier. thors: A.I. McLeod and Leanna King. GOGANPA GO-Functional-Network-based Gene- LCFdata Data sets for package LMERConvenience- Set-Analysis. Author: Billy Chang. Functions. Authors: Antoine Tremblay.

    The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 NEWSAND NOTES 83

    LIHNPSD Poisson Subordinated Distribution. Au- MorseGen Simple raw data generator based on thor: Stephen Horng-Twu Lihn. user-specified summary statistics. Author: Brendan Morse. LOST Missing morphometric data simulation and estimation. Authors: J. Arbour and C. Brown. MultiOrd Generation of multivariate ordinal vari- ates. Authors: Anup Amatya and Hakan LTR Perform LTR analysis on microarray data. Au- Demirtas. thor: Paul C. Boutros. NHPoisson Modelling and validation of non homo- Lambda4 Estimation techniques for the reliability geneous Poisson processes. Author: Ana C. Ce- estimate: Maximized Lambda4. Author: Tyler brian. Hunt. NPMPM tertiary probabilistic model in predictive LeafAngle Fits, plots, and summarizes leaf angle microbiology for use in food manufacture. Au- distributions. Author: Remko Duursma. thor: Nadine Schoene. LinearizedSVR Linearized Support Vector Regres- NbClust An examination of indices for determining sion. Authors: Sriharsha Veeramachaneni and the number of clusters. Authors: Malika Char- Ken Williams. rad and Nadia Ghazzali and Veronique Boiteau and Azam Niknafs. Luminescence Package for Luminescence Dating data analysis. Authors: Sebastian Kreutzer NetComp Network Generation and Comparison. [aut, cre], Christoph Schmidt [aut, ctb], Margret Authors: Shannon M. Bell, Lyle D. Burgoon. C. Fuchs [aut, ctb], Michael Dietze [aut, ctb], NetPreProc NetPreProc: Network Pre-Processing Manfred Fischer [aut, ctb], Markus Fuchs [ths]. and normalization. Authors: Giorgio Valentini. MAMS Designing Multi-Arm Multi-Stage Studies. NlsyLinks Utilities and kinship information for Be- Authors: Thomas Jaki and Dominic Magirr. havior Genetics and Developmental research MAVTgsa Ordinary least square test and Multivari- using the NLSY. Authors: Will Beasley, Joe ate Analysis Of Variance test with n contrasts. Rodgers, David Bard, and Kelly Meredith. Authors: Chih-Yi Chien, Chen-An Tsai, Ching- NormalGamma Normal-gamma convolution Wei Chang, and James J. Chen. model. Authors: S. Plancade and Y. Rozenholc. MImix Mixture summary method for multiple im- OIdata Data sets and supplements (OpenIntro). Au- putation. Authors: Russell Steele, Naisyin thors: Andrew P Bray and David M Diez. Wang, and Adrian Raftery. In view: Official- Statistics. OIsurv Survival analysis supplement to OpenIntro guide. Author: David M Diez. In view: Sur- MLEP Maximum likelihood estimate of penetrance vival. parameters. Author: Yuki Sugaya. OLScurve OLS growth curve trajectories. Authors: MM The multiplicative multinomial distribution. Phil Chalmers and Carrie Smith and Matthew Authors: Robin K. S. Hankin and P. M. E. Al- Sigal. tham. OOmisc Ozgur-Ozlem Miscellaneous. Authors: MRCE Author: Adam J. Rothman. Ozgur Asar, Ozlem Ilk. McSpatial Nonparametric spatial data analysis. Ohmage R Client for Mobilize/Andwellness server. Author: Daniel McMillen. In view: Spatial. Author: Jeroen Ooms. MetaDE Meta analysis of multiple microarray data. OneHandClapping Prediction of condition-specific Authors: Jia Li and Xingbin Wang. transcription factor interactions. Author: Se- bastian Dümcke. MetaPath Perform the Meta-Analysis for Pathway Enrichment analysis (MAPE). Authors: Kui OpenStreetMap Access to open street map raster Shen and Geroge Tseng. images. Authors: Ian Fellows, using the JMapViewer library by Jan Peter Stotz. MixMod Analysis of Mixed Models. Authors: Alexandra Kuznetsova, Per Bruun Brockhoff. OptimalCutpoints Computing optimal cutpoints in diagnostic tests. Authors: Monica Lopez- Mobilize Mobilize plots and functions. Author: Raton, Maria Xose Rodriguez-Alvarez. Jeroen Ooms. OptionsPdf This package estimates a mix of log- Momocs Shape Analysis of Outlines. Authors: Vin- normal distributions from interest-rate option cent Bonhomme, Sandrine Picq, Julien Claude. data. Author: Paolo Zagaglia.

    The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 84 NEWSAND NOTES

    PEIP Functions for Aster Book on Inverse Theory. RCALI Calculation of the Integrated Flow of Parti- Author: Jonathan M. Lees. cles between Polygons. Authors: Annie Bou- vier, Kien Kieu, Kasia Adamczyk, and Herve PIPS Predicted Interval Plots. Authors: Daniel Monod. G. Muenz, Ray Griner, Huichao Chen, Lijuan Deng, Sachiko Miyahara, and Scott R. Evans, RCassandra R/Cassandra interface. Author: Simon with contributions from Lingling Li, Hajime Urbanek. Uno, and Laura M. Smeaton. In view: Clini- calTrials. RForcecom RForcecom provides the connection to Force.com (Salesforce.com) from R. Author: PKPDmodels Pharmacokinetic/pharmacodynamic Takekatsu Hiramura. models. Authors: Anne Dubois, Julie Bertand, France Mentre and Douglas Bates. RGIFT Create quizzes in GIFT Format. Authors: María José Haro-Delicado, Virgilio Gómez- PairedData Paired Data Analysis. Author: Stephane Rubio and Francisco Parreño-Torres. Champely. RIFS Random Iterated Function System (RIFS). Au- ParamHelpers Helpers for parameters in black- thors: Pavel V. Moskalev, Alexey G. Bukhovets optimization, tuning and maching learning. and Tatyana Ya. Biruchinskay. Authors: Bernd Bischl, Patrick Koch. RMallow Fit Multi-Modal Mallows’ Models to PopGenome Population genetic analysis. Author: ranking data. Author: Erik Gregory. Bastian Pfeifer. RMendeley Interface to Mendeley API methods. PrivateLR Differentially private regularized logistic Authors: Carl Boettiger, Duncan Temple Lang. regression. Author: Staal A. Vinterbo. RSeed borenstein analysis. Author: Claus Jonathan QRM Provides R-language code to examine Quan- Fritzemeier. titative Risk Management concepts. Author: RTDAmeritrade Author: Theodore Van Rooy. Bernhard Pfaff. RcmdrPlugin.EBM Rcmdr Evidence Based QUIC Regularized sparse inverse covariance ma- Medicine Plug-In package. Author: Daniel- trix estimation. Authors: Cho-Jui Hsieh [aut], Corneliu Leucuta. Matyas A. Sustik [aut, cre], Inderjit S. Dhillon [aut], Pradeep Ravikumar [aut]. RcmdrPlugin.KMggplot2 Rcmdr Plug-In for Kaplan-Meier Plot and Other Plots by Using QuasiSeq Author: Steve Lund. the Package. Authors: Triad sou. and R.devices Enhanced methods for handling graphi- Kengo NAGASHIMA. cal devices. Author: Henrik Bengtsson. RcmdrPlugin.SCDA Rcmdr plugin for designing R2BayesX Estimate Structured Additive Regression and analyzing single-case experiments. Au- Models with BayesX. Authors: Nikolaus Um- thors: Isis Bulte and Patrick Onghena. lauf [aut, cre], Thomas Kneib [aut], Stefan Lang RcmdrPlugin.UCA UCA Rcmdr Plug-in. Author: [aut], Achim Zeileis [aut]. and Manuel Munoz-Marquez. R2G2 Converting R CRAN outputs into Google RcmdrPlugin.doBy Rcmdr doBy Plug-In. Author: Earth. Author: Nils Arrigo. Jonathan Lee. R2OpenBUGS Running OpenBUGS from R. Au- RcmdrPlugin.pointG Rcmdr Graphical POINT of thors: originally written as R2WinBUGS by view for questionnaire data Plug-In. Author: Andrew Gelman; changes and packaged by Stephane Champely. Sibylle Sturtz and Uwe Ligges. With consid- erable contributions by Gregor Gorjanc and RcppSMC Rcpp bindings for Sequential Monte Jouni Kerman. Adapted to R2OpenBUGS Carlo. Authors: and Adam from R2WinBUGS by Neal Thomas. M. Johansen. R2admb ADMB to R interface functions. Authors: Rdistance Distance sampling analyses. Author: Ben Bolker, Hans Skaug. Trent McDonald. R330 An R package for Stats 330. Authors: Alan Lee, ReCiPa Redundancy Control in Pathways Blair Robertson. databases. Author: Juan C. Vivar. RAFM Admixture F-model. Authors: Markku RenextGUI GUI for Renext. Authors: Yves Deville Karhunen, Uni. Helsinki. and IRSN.

    The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 NEWSAND NOTES 85

    RfmriVC Varying stimulus coefficient fMRI mod- SPIn Optimal Shortest Probability Intervals. Au- els in R. Authors: Ludwig Bothmann, Stefanie thor: Ying Liu. Kalus. SPSL Site Percolation on Square Lattice (SPSL). Au- Rmalschains Continuous Optimization using thor: Pavel V. Moskalev. Memetic Algorithms with Local Search Chains (MA-LS-Chains) in R. Authors: Christoph SRMA SRMA Analysis of Array-based Sequenc- Bergmeir, Daniel Molina, José M. Benítez. In ing Data. Authors: Wenyi Wang, Nianxiang view: Optimization. Zhang, Yan Xu. Rmisc Rmisc: Ryan Miscellaneous. Author: Ryan SemiParSampleSel Semiparametric Sample Selec- M. Hope. tion Modelling with Continuous Response. Authors: Giampiero Marra and Rosalba Rmixmod An interface of MIXMOD. Authors: Remi Radice. Lebret and Serge Iovleff and Florent Lan- grognet, with contributions from C. Biernacki SightabilityModel Wildlife Sightability Modeling. and G. Celeux and G. Govaert. Author: John Fieberg. Rquake Seismic Hypocenter Determination. Au- Simile Interact with Simile models. Author: thor: Jonathan M. Lees. Simulistics Ltd. RxCEcolInf R x C Ecological Inference With Op- SoilR Models of Soil Organic Matter Decomposi- tional Incorporation of Survey Information. tion. Authors: Carlos A. Sierra, Markus Authors: D. James Greiner, Paul Baines, and Mueller. Kevin M. Quinn. In view: Bayesian. SparseGrid Sparse grid integration in R. Author: SAScii Import ASCII files directly into R using only Jelmer Ypma. a SAS input script. Author: Anthony Joseph Damico. SpatialTools Author: Joshua French. SCMA Single-Case Meta-Analysis. Authors: Isis SteinerNet Steiner approach for Biological Pathway Bulte and Patrick Onghena. Graph Analysis. Authors: Afshin Sadeghi, Holger Froehlich. SCRT Single-Case Randomization Tests. Authors: Isis Bulte and Patrick Onghena. StressStrength Computation and estimation of re- liability of stress-strength models. Authors: SCVA Single-Case Visual Analysis. Authors: Isis Alessandro Barbiero, Riccardo Inchingolo. Bulte and Patrick Onghena. StructR Structural geology tools. Author: Jeffrey R. SCperf Supply Chain Perform. Author: Marlene Webber. Silva Marchena. SunterSampling Sunter’s sampling design. Au- SEER2R reading and writing SEER*STAT data files. thors: Alessandro Barbiero, Giancarlo Manzi. Author: Jun Luo. TCC TCC: tag count comparison package. Au- SGL Fit a GLM (or cox model) with a combination of thors: Koji Kadota, Tomoaki Nishiyama, Ken- lasso and group lasso regularization. Authors: taro Shimizu. Noah Simon, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. TPAM Author: Yuping Zhang. SGPdata Exemplar data sets for SGP analyses. Au- TTAinterfaceTrendAnalysis Temporal Trend Anal- thors: Damian W. Betebenner, Adam Van ysis Graphical Interface. Authors: David De- Iwaarden and Ben Domingue. vreker [aut], Alain Lefebvre [aut, cre]. SKAT SNP-set (Sequence) Kernel Association Test. TUWmodel Lumped hydrological model devel- Authors: Seunggeun Lee, Larisa Miropolsky oped at the Vienna University of Technology and Micheal Wu. for education purposes. Authors: Juraj Parajka, Alberto Viglione. SNSequate Standard and Nonstandard Statistical Models and Methods for Test Equating. Au- Taxonstand Taxonomic standardization of plant thor: Jorge Gonzalez Burgos. species names. Author: Luis Cayuela. SPA3G SPA3G: R package for the method of Li and TestSurvRec Statistical tests to compare two sur- Cui (2012). Authors: Shaoyu Li and Yuehua vival curves with recurrent events. Author: Dr. Cui. Carlos Martinez.

    The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 86 NEWSAND NOTES

    ThresholdROC Optimum threshold estimation awsMethods Class and Methods definitions for based on cost function in a two and three state packages aws, adimpro, fmri, dwi. Author: Jo- setting. Author: Konstantina Skaltsa. erg Polzehl.

    TimeWarp Date calculations and manipulation. Au- bams Breakpoint annotation model smoothing. Au- thors: Tony Plate, Jeffrey Horner, Lars Hansen. thor: Toby Dylan Hocking.

    VarEff Variation of effective population size. Au- bandit Functions for simple A/B split test and thors: Natacha Nikolic and Claude Chevalet. multi-armed bandit analysis. Author: Thomas Lotze. VarSwapPrice Pricing a variance swap on an equity index. Author: Paolo Zagaglia. In view: Fi- base64 Base 64 encoder/decoder. Authors: Romain nance. Francois, based on code by Bob Trower avail- able at http://base64.sourceforge.net/. VariABEL Testing of genotypic variance hetero- geneity to detect potentially interacting SNP. bayesMCClust Mixtures-of-Experts Markov Chain Author: Maksim Struchalin. Clustering and Dirichlet Multinomial Cluster- ing. Author: Christoph Pamminger. Voss Generic Voss algorithm (random sequential bayesPop Probabilistic Population Projection. Au- additions). Author: Pavel V. Moskalev. thors: Hana Sevcikova, Adrian Raftery. WeightedPortTest Weighted Portmanteau Tests for bcrm Bayesian continuous reassessment method Time Series Goodness-of-fit. Authors: Thomas (CRM) designs for Phase I dose-finding trials. J. Fisher and Colin M. Gallagher. Author: Michael Sweeting. In view: Clinical- WideLM Fitting many skinny linear models to a sin- Trials. gle data set. Authors: Mark Seligman, with bda Algorithms for Birth Data Analysis. Author: contributions from Chris Fraley. In view: High- Bin Wang. PerformanceComputing. betafam Detecting rare variants for quantitative acer The ACER Method for Extreme Value Estima- traits using nuclear families. Author: Wei Guo. tion. Author: Even Haug. betapart Partitioning beta diversity into turnover acs Download and manipulate data from the US and nestedness components. Authors: Andres Census American Community Survey. Author: Baselga and David Orme. Ezra Haber Glenn. bigdata Big Data Analytics. Authors: Han Liu, Tuo adagio Discrete and Global Optimization Routines. Zhao. Author: Hans W Borchers. bigml R bindings for the BigML API. Authors: adaptMCMC Implementation of a generic adaptive Justin Donaldson [aut, cre]. Monte Carlo Markov Chain sampler. Authors: binomlogit Efficient MCMC for Binomial Logit Andreas Scheidegger,. Models. Author: Agnes Fussl. ageprior Prior distributions for molecular dating. bionetdata Biological and chemical data networks. Author: Michael Matschiner. Authors: Matteo Re. anametrix Connects to Anametrix. Author: Roman bisectr Tools to find bad commits with git bisect. Jugai. Author: Winston Chang. anoint Analysis of interactions. Authors: Ravi bit64 A S3 class for vectors of 64bit integers. Author: Varadhan and Stephanie Kovalchik. Jens Oehlschlägel. appell Compute Appell’s F1 hypergeometric func- biwavelet Conduct univariate and bivariate wavelet tion. Authors: Daniel Sabanes Bove with con- analyses. Authors: Tarik C. Gouhier, Aslak tributions by F. D. Colavecchia, R. C. Forrey, G. Grinsted. Gasaneo, N. L. J. Michel, L. F. Shampine, M. V. Stoitsov and H. A. Watts. bmp Read Windows Bitmap (BMP) images. Author: Gregory Jefferis. apple Approximate Path for Penalized Likelihood Estimators. Authors: Yi Yu, Yang Feng. boss Boosted One-Step Statistics: Fast and accu- rate approximations for GLM, GEE and Mixed assertive Readable check functions to ensure code models for use in GWAS. Author: Arend Voor- integrity. Authors: Richard Cotton [aut, cre]. man.

    The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 NEWSAND NOTES 87 boussinesq Analytic Solutions for (ground-water) crp.CSFP CreditRisk+ portfolio model. Authors: Dr. Boussinesq Equation. Author: Emanuele Cor- Matthias Fischer, Kevin Jakob & Stefan Kolb. In dano. In view: Environmetrics. view: Finance. bpkde Back Projected Kernel Density Estimation. crrstep Stepwise covariate selection for the Fine & Authors: Kjell Konis, Victor Panaretos. Gray competing risks regression model. Au- thor: Ravi Varadhan & Deborah Kuk. catdata Categorical Data. Authors: Gerhard Tutz, Gunther Schauberger. csound Accessing Csound functionality through R. Author: Ethan Brown. cec2005benchmark Benchmark for the CEC 2005 cumplyr Extends ddply to allow calculation of cu- Special Session on Real-Parameter Optimiza- mulative quantities. Author: John Myles tion. Authors: Yasser González-Fernández and White. Marta Soto. curvclust Curve clustering. Authors: Madison Gi- cepp Context Driven Exploratory Projection Pur- acofci, Sophie Lambert-Lacroix, Guillemette suit. Author: Mohit Dayal. Marot, Franck Picard. cin Causal Inference for Neuroscience. Authors: dataframe Fast data frames. Author: Tim Hester- Xi (Rossi) LUO with contributions from Dylan berg. Small, Chiang-shan Li, and Paul Rosenbaum. deTestSet Testset for differential equations. Au- clhs Conditioned Latin Hypercube Sampling. Au- thors: Karline Soetaert, Jeff Cash, Francesca thor: Pierre Roudier. Mazzia. In view: DifferentialEquations. cloudRmpi Cloud-based MPI Parallel Processing deltaPlotR Identification of dichotomous differen- for R (cloudRmpi). Author: Barnet Wagman. tial item functioning (DIF) using Angoff’s In view: HighPerformanceComputing. Delta Plot method. Authors: David Magis, Bruno Facon. cloudRmpiJars Third-party jars for cloudRmpi. Au- thor: Barnet Wagman. In view: HighPerfor- depend.truncation Inference for parametric and manceComputing. semiparametric models based on dependent truncation data. Authors: Takeshi Emura. cloudUtil Cloud Util Plots. Authors: Christian deseasonalize Optimal deseasonalization for geo- Panse, Ermir Qeli. physical time series using AR fitting. Author: coloc Colocalisation tests of two genetic traits. Au- A. I. McLeod. In view: TimeSeries. thor: Chris Wallace. dgmb dgmb Simulating data for PLS structural comparison Multivariate likelihood ratio calcula- models. Authors: Alba Martinez-Ruiz and tion and evaluation. Author: David Lucy. Claudia Martinez-Araneda. diffEq Functions from the book Solving Differential complex.surv.dat.sim Simulation of recurrent event Equations in R. Author: Karline Soetaert. survival data. Authors: David Moriña, Cen- tre Tecnològic de Nutrició i Salut and Albert disclap Discrete Laplace Family. Authors: Mikkel Navarro. Meyer Andersen and Poul Svante Eriksen. compound.Cox Regression estimation based on the disclapmix Discrete Laplace mixture inference us- compound covariate methed under the Cox ing the EM algorithm. Authors: Mikkel Meyer proportional hazard model. Author: Takeshi Andersen and Poul Svante Eriksen. Emura & Yi-Hau Chen. discreteMTP Multiple testing procedures for dis- condmixt Conditional Density Estimation with crete test statistics. Authors: Ruth Heller [aut], Neural Network Conditional Mixtures. Au- Hadas Gur [aut], Shay Yaacoby [aut, cre]. thor: Julie Carreau. disp2D 2D Hausdorff and Simplex Dispersion Or- derings. Author: Guillermo Ayala. cpm Sequential Parametric and Nonparametric Change Detection. Authors: Gordon J. Ross, distory Distance Between Phylogenetic Histories. incorporating code from the GNU Scientific Authors: John Chakerian and Susan Holmes. Library. In view: Phylogenetics. crblocks Categorical Randomized Block Data Anal- dkDNA Diffusion kernels on a set of genotypes. ysis. Authors: David Allingham, D.J. Best. Authors: Gota Morota and Masanori Koyama.

    The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 88 NEWSAND NOTES dma Dynamic model averaging. Authors: Tyler H. frmqa Financial Risk Management and Quantitative McCormick, Adrian Raftery, David Madigan. Analysis. Author: Thanh T. Tran. In view: Fi- nance. doParallel Foreach parallel adaptor for the parallel package. Author: . frt Full Randomization Test. Authors: Giangiacomo Bravo, Lucia Tamburino. dostats Compute statistics helper functions. Au- thor: Andrew Redd. fume FUME package. Author: Santander Meteorol- eigendog EigenDog. Author: The EigenDog Team. ogy Group. endorse R Package for Analyzing Endorsement Ex- fwsim Fisher-Wright Population Simulation. Au- periments. Authors: Yuki Shiraito, Kosuke thors: Mikkel Meyer Andersen and Poul Imai. Svante Eriksen. events Store and manipulate event data. Author: gammSlice Generalized additive mixed model anal- Will Lowe. ysis via slice sampling. Authors: Tung Pham and Matt Wand. evora Epigenetic Variable Outliers for Risk predic- tion Analysis. Author: Andrew E Teschen- gcdnet A generalized coordinate descent (GCD) al- dorff. gorithm for computing the solution path of exptest Tests for Exponentiality. Authors: Ruslan the hybrid Huberized support vector machine Pusev and Maxim Yakovlev. (HHSVM) and its generalization. Authors: Yi Yang, Hui Zou. extraBinomial Extra-binomial approach for pooled sequencing data. Author: Xin Yang. genomicper Circular Genomic Permutation using GWAS p-values of association. Authors: Clau- faisalconjoint Faisal Conjoint Model: A New Ap- dia P. Cabrera-Cardenas, Pau Navarro, Chris S. proach to Conjoint Analysis. Authors: Faisal Haley. Afzal Siddiqui, Ghulam Hussain, and Mudas- siruddin. geospt Spatial geostatistics; some geostatistical and radial basis functions, prediction and cross val- fanc Penalized likelihood factor analysis via non- idation; design of optimal spatial sampling net- convex penalty. Authors: Kei Hirose, Michio works based on geostatistical modelling. Au- Yamamoto. thors: Carlos Melo, Alí Santacruz, Oscar Melo fanovaGraph Building Kriging models from and others. In view: Spatial. FANOVA graphs. Authors: Jana Fruth, geotools Geo tools. Author: Antoine Lucas. Thomas Muehlenstaedt, Olivier Roustant. faoutlier Influential case detection methods for fac- gglasso Group Lasso Penalized Learning Using A tor analysis and SEM. Author: Phil Chalmers. Unified BMD Algorithm. Authors: Yi Yang, Hui Zou. fastGHQuad Fast Rcpp implementation of Gauss- Hermite quadrature. Author: Alexander W gldist An Asymmetry-Steepness Parameterization Blocker. of the Generalized Lambda Distribution. Au- thors: Yohan Chalabi and Diethelm Wuertz. fastcox Lasso and elastic-net penalized Cox’s regres- sion in high dimensions models using the cock- glmmGS Gauss-Seidel Generalized Linear Mixed tail algorithm. Authors: Yi Yang, Hui Zou. Model solver. Authors: Michele Morara, Louise Ryan, Subharup Guha, Christopher Pa- fgof Fast Goodness-of-fit Test. Authors: Ivan Ko- ciorek. jadinovic and Jun Yan. googlePublicData An R library to build Google’s filterviewR Get started easily with FilterView from Public Data Explorer DSPL Metadata files. Au- R. Author: Nelson Ray. thor: George Vega Yon. fishmove Prediction of Fish Movement Parameters. Author: Johannes Radinger. govdat Interface to API methods for government data. Author: Scott Chamberlain. foodweb visualisation and analysis of food web net- works. Author: Giselle Perdomo. graphComp Visual graph comparison. Authors: Khadija El Amrani, Ulrich Mansmann. fracprolif Fraction Proliferation via a quiescent growth model. Authors: Shawn Garbett, Dar- gridDebug Debugging Grid Graphics. Authors: ren Tyson. Paul Murrell and Velvet Ly.

    The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 NEWSAND NOTES 89 gridGraphviz Drawing Graphs with Grid. Author: intpoint linear programming solver by the interior Paul Murrell. point method and graphically (two dimen- sions). Author: Alejandro Quintela del Rio. growcurves Bayesian semiparametric growth curve models that additionally include multiple irace Iterated Racing Procedures. Authors: membership random effects. Author: Terrance Manuel López-Ibáñez, Jérémie Dubois- Savitsky. Lacoste, Thomas Stützle, Mauro Birattari, Eric Yuan and Prasanna Balaprakash. gsbDesign Group Sequential Bayes Design. Au- thors: Florian Gerber, Thomas Gsponer. irtrees Estimation of Tree-Based Item Response Models. Authors: Ivailo Partchev and Paul De gsmaRt Gene Set Microarray Testing. Authors: Boeck. In view: Psychometrics. Stephan Artmann, Mathias Fuchs. ivbma Bayesian Instrumental Variable Estimation gvcm.cat Regularized Categorial Effects/Categorial and Model Determination via Conditional Effect Modifiers in GLMs. Author: Margret- Bayes Factors. Authors: Alex Lenkoski, Anna Ruth Oelker. Karl, Andreas Neudecker. harvestr A Parallel Simulation Framework. Author: iwtp Numerical evaluation willingness to pay based Andrew Redd. In view: HighPerformanceCom- on interval data. Authors: Yuri Belyaev, Wen- puting. chao Zhou. hiPOD hierarchical Pooled Optimal Design. Au- jmec Fit joint model for event and censoring with thor: Wei E. Liang. cluster-level frailties. Author: Willem M. van hitandrun “Hit and Run” method for sampling uni- der Wal. formly form convex shapes. Author: Gert van joineR Joint modelling of repeated measurements Valkenhoef. and time-to-event data. Authors: Pete hof More higher order functions for R. Author: Philipson, Ines Sousa, Peter Diggle, Paula Hadley Wickham. Williamson, Ruwanthi Kolamunnage-Dona, Robin Henderson. httr Tools for working with URLs and HTTP. Au- thor: Hadley Wickham. kappaSize Sample Size Estimation Functions for Studies of Interobserver Agreement. Author: hydroPSO Model-Independent Particle Swarm Op- Michael A Rotondi. timisation for Environmental Models. Authors: Mauricio Zambrano-Bigiarini [aut, cre] and Ro- kaps K Adaptive Partitioning for Survival data. Au- drigo Rojas [ctb]. In views: Environmetrics, Op- thors: Soo-Heang Eo and HyungJun Cho. In timization. view: Survival. iDEMO integrated degradation models. Authors: kequate The kernel method of test equating. Au- Ya-Shan Cheng, Chien-Yu Peng. thors: Bjorn Andersson, Kenny Branberg and Marie Wiberg. iGasso Genetic association tests. Author: Dr. Kai Wang. knitr A general-purpose package for dynamic re- port generation in R. Author: . In iSubpathwayMiner The package can implement view: ReproducibleResearch. the graph-based reconstruction, analyses, and visualization of the KEGG pathways. Author: knnGarden Multi-distance based k-Nearest Neigh- Chunquan Li. bors. Author: Boxian Wei & Fan Yang & Xin- miao Wang & Yanni Ge. idbg R debugger. Author: Ronen Kimchi. koRpus An R Package for Text Analysis. Authors: igraph0 Network analysis and visualization. Au- m.eik michalke, with contributions from Earl thor: Gabor Csardi. Brown, Alberto Mirisola, and Laura Hauser. In view: NaturalLanguageProcessing. igraphdata A collection of network data sets for the igraph package. Author: Gabor Csardi. kriging Ordinary Kriging. Author: Omar E. Olmedo. imputeYn Imputing the last largest censored datum under weighted least squares. Authors: Has- ktspair k-Top Scoring Pairs for Microarray Classifi- inur Rahaman Khan and Ewart Shaw. cation. Author: Julien Damond. intReg Interval Regression. Author: Ott Toomet. In labeledLoop Labeled Loop. Author: Kohske Taka- view: Econometrics. hashi.

    The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 90 NEWSAND NOTES lava Linear Latent Variable Models. Author: Klaus marqLevAlg An algorithm for least-squares curve K. Holst. fitting. Authors: D. Commenges, M. Prague and Amadou Diakite. lava.tobit LVM with censored and binary outcomes. Author: Klaus K. Holst. maxlike Model species distributions by estimating the probability of occurrence using presence- leapp latent effect adjustment after primary projec- only data. Authors: Richard Chandler and tion. Authors: Yunting Sun, Nancy R.Zhang, Andy Royle. Art B. Owen. mcbiopi Matrix Computation Based Identification leiv Bivariate linear errors-in-variables estimation. Of Prime Implicants. Author: Holger Schwen- Author: David Leonard. der. lestat A package for LEarning STATistics. Author: Petter Mostad. mcmcse Monte Carlo standard errors for MCMC. Author: James M. Flegal. likelihood Methods for maximum likelihood esti- mation. Author: Lora Murphy. mets Analysis of Multivariate Event Times. Au- thors: Klaus K. Holst and Thomas Scheike. lisp List-processing à la SRFI-1. Author: Peter Da- nenberg. mgraph Graphing map attributes and non-map variables in R. Author: George Owusu. lle Locally linear embedding. Authors: Holger Diedrich, Dr. Markus Abel (Department of miscFuncs Miscellaneous Useful Functions. Au- Physics, University Potsdam). thor: Benjamin M. Taylor. lmbc Linear Model Bias Correction for RNA-Seq mlgt Multi-Locus Geno-Typing. Author: Dave T. Data. Author: David Dalpiaz. Gerrard. lmf Functions for estimation and inference of se- mmand Mathematical Morphology in Any Number lection in age-structured populations. Author: of Dimensions. Author: Jon Clayden. Thomas Kvalnes. mmds Mixture Model Distance Sampling (mmds). logitnorm Functions for the logitnormal distribu- Author: David Lawrence Miller. tion. Author: Thomas Wutzler. mmm2 Multivariate marginal models with shared longclust Model-Based Clustering and Classifica- regression parameters. Authors: Ozgur Asar, tion for Longitudinal Data. Authors: P. D. Ozlem Ilk. McNicholas, K. Raju Jampani and Sanjeena Subedi. mmod Modern measures of population differentia- tion. Author: David Winter. lqmm Linear Quantile Mixed Models. Author: Marco Geraci. mobForest Model based Random Forest analysis. Authors: Nikhil Garge and Barry Eggleston. lsr Companion to "Learning Statistics with R". Au- thor: Daniel Navarro. mosaicManip Project MOSAIC (mosaic-web.org) manipulate examples. Authors: Andrew Rich, lxb Fast LXB file reader. Author: Björn Winckler. Daniel Kaplan, Randall Pruim. mada Meta-Analysis of Diagnostic Accuracy (mada). Author: Philipp Doebler. mpMap Multi-parent RIL genetic analysis. Author: Emma Huang. makeProject Creates an empty package framework for the LCFD format. Author: Noah Silverman. mpc MPC — Multiple Precision Complex Library. Author: Murray Stokely. makeR Package for managing projects with mul- tiple versions derived from a single source mpoly Symbolic computation and more with multi- repository. Author: Jason Bryer. variate polynomials. Author: David Kahle. malaria.em EM Estimation of Malaria Haplotype mrds Mark-Recapture Distance Sampling (mrds). Probabilities from Multiply Infected Human Authors: Jeff Laake, David Borchers, Len Blood Samples. Author: Xiaohong Li. Thomas, David Miller and Jon Bishop. markdown Markdown rendering for R. Authors: JJ mrp Multilevel Regression and Poststratification. Allaire, Jeffrey Horner, Vicent Marti, and Nat- Authors: Andrew Gelman, Michael Malecki, acha Porte. Daniel Lee, Yu-Sung Su.

    The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 NEWSAND NOTES 91 mrpdata Data for Multilevel Regression and Post- pacose iPACOSE, PACOSE and other methods for stratification. Authors: Andrew Gelman, covariance selection. Authors: Vincent Guille- Michael Malecki, Daniel Lee, Yu-Sung Su, mot, Andreas Bender. Jiqiang Guo. pairedCI Confidence intervals for the ratio of loca- mseapca Metabolite set enrichment analysis for fac- tions and for the ratio of scales of two paired tor loading in principal component analysis. samples. Authors: Cornelia Froemke, Ludwig Author: Hiroyuki Yamamoto. Hothorn and Michael Schneider. msgps Degrees of freedom of elastic net, adaptive pairheatmap A tool for comparing heatmaps. Au- lasso and generalized elastic net. Author: Kei thor: Xiaoyong Sun. Hirose. paleotree Paleontological and Phylogenetic Analy- multivator A multivariate emulator. Author: Robin ses of Evolution. Author: David Bapst. In K. S. Hankin. view: Phylogenetics. muma Metabolomic Univariate and Multivariate parfm Parametric Frailty Models. Authors: Fed- Analysis. Authors: Edoardo Gaude, Francesca erico Rotolo and Marco Munda. In view: Sur- Chignola, Dimitrios Spiliotopoulos, Silvia vival. Mari, Andrea Spitaleri and Michela Ghitti. pcaL1 Three L1-Norm PCA Methods. Authors: myepisodes MyEpisodes RSS/API functions. Au- Sapan Jot and Paul Brooks. thor: Matt Malin. pcaPA Parallel Analysis for ordinal and numeric nadiv Functions to construct non-additive genetic data using polychoric and Pearson correlations relatedness matrices. Author: Matthew Wolak. with S3 classes. Authors: Carlos A. Arias and Víctor H. Cervantes. networkDynamic Dynamic Extensions for Network Objects. Authors: Ayn Leslie-Cook, Zack pcenum Permutations and Combinations Enumera- Almquist, Pavel N. Krivitsky, Skye Bender- tion. Author: Benjamin Auder. deMoll, David R. Hunter, Martina Morris, pdc Permutation Distribution Clustering. Author: Carter T. Butts. Andreas M. Brandmaier. neuroblastoma Neuroblastoma. Author: Toby Dy- pgmm Parsimonious Gaussian mixture models. Au- lan Hocking. thors: Paul McNicholas, Raju Jampani, Aaron nlsmsn Fitting nonlinear models with scale mixture McDaid, Brendan Murphy, Larry Banks. of skew-normal distributions. Authors: Aldo pgnorm The p-generalized normal distribution. Au- Garay, Marcos Prates and Victor Lachos. thor: Steve Kalke. oblique.tree Oblique Trees for Classification Data. phcfM phcfM R package. Author: Ghislain Vieille- Author: Alfred Truong. In view: MachineLearn- dent. ing. phenology Tools to manage a parametric function occ Estimates PET neuroreceptor occupancies. Au- that describes phenology. Author: Marc Giron- thor: Joaquim Radua. In view: MedicalImaging. dot. okmesonet Retrieve Oklahoma Mesonet climatolog- phia Post-Hoc Interaction Analysis. Author: Helios ical data. Authors: Brady Allred, Torre Hovick, De Rosario-Martinez. Samuel Fuhlendorf. phom Persistent Homology in R. Author: Andrew opencpu.demo OpenCPU Demo apps. Authors: Tausz. Jeroen Ooms and others as specified in the manual page. pkgmaker Package development utilities. Author: Renaud Gaujoux. opm Tools for analysing OmniLog(R) Phenotype Microarray data. Authors: Markus Goeker, plmDE Additive partially linear models for differ- with contributions by Lea A.I. Vaas, Johannes ential gene expression analysis. Author: Jonas Sikorski, Nora Buddruhs and Anne Fiebig. Mueller. ops Optimal Power Space Transformation. Author: pmc Phylogenetic Monte Carlo. Author: Carl Boet- Micha Sammeth. tiger. In view: Phylogenetics. p2distance Welfare’s Synthetic Indicator. Authors: pmclust Parallel Model-Based Clustering. Authors: A.J. Perez-Luque; R. Moreno; R. Perez-Perez Wei-Chen Chen, George Ostrouchov. In views: and F.J. Bonet. Cluster, HighPerformanceComputing.

    The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 92 NEWSAND NOTES polywog Bootstrapped Basis Regression with Ora- readMLData Reading machine learning benchmark cle Model Selection. Authors: Brenton Kenkel data sets from different sources in their original and Curtis S. Signorino. format. Author: Petr Savicky. popdemo Provides Tools For Demographic Mod- readbitmap Read bitmap images (BMP, JPEG, PNG) elling Using Projection Matrices. Authors: Iain without external dependencies. Author: Gre- Stott, Dave Hodgson, Stuart Townley. gory Jefferis. postCP A package to estimate posterior probabili- reams Resampling-Based Adaptive Model Selec- ties in change-point models using constrained tion. Authors: Philip Reiss and Lei Huang. HMM. Authors: Gregory Nuel and The Minh Luong. rehh Searching for footprints of selection using Haplotype Homozygosity based tests. Au- qLearn Estimation and inference for Q-learning. thors: Mathieu Gautier and Renaud Vitalis. Authors: Jingyi Xin, Bibhas Chakraborty, and Eric B. Laber. relSim Relative Simulator. Author: James M. Cur- ran. qmap Quantile-mapping for post-processing cli- mate model output. Author: Lukas Gud- replicationDemos Agnotology. Author: Rasmus E. mundsson. Benestad. qtbase Interface between R and Qt. Authors: represent Determine the representativity of two Michael Lawrence, Deepayan Sarkar. multidimensional data sets. Author: Harmen Draisma. qtlhot Inference for QTL Hotspots. Authors: Elias Chaibub Neto and Brian S Yandell. rgbif Interface to the Global Biodiversity Informa- tion Facility API methods. Authors: Scott qtlnet Causal Inference of QTL Networks. Authors: Chamberlain [aut, cre], Carl Boettiger [aut], Elias Chaibub Neto and Brian S. Yandell. Karthik Ram [aut]. qtpaint Qt-based painting infrastructure. Authors: ri ri: R package for performing randomization- Michael Lawrence, Deepayan Sarkar. based inference for experiments. Authors: Pe- ter M. Aronow and Cyrus Samii. qtutils Miscellaneous Qt-based utilities. Author: Deepayan Sarkar. risaac A fast cryptographic random number gener- ator using ISAAC by Robert Jenkins. Author: quantspec Quantile-based Spectral Analysis of Time David L Gibbs. Series. Author: Tobias Kley. In view: Time- Series. riskRegression Risk regression for survival anal- ysis. Authors: Thomas Alexander Gerds, rBeta2009 The Beta Random Number and Dirich- Thomas Harder Scheike. In view: Survival. let Random Vector Generating Functions. Au- thors: Ching-Wei Cheng, Ying-Chao Hung, rknn Random KNN Classification and Regression. Narayanaswamy Balakrishnan. Author: Shengqiao Li. rFerns Random ferns classifier. Author: Miron B. rngSetSeed Initialization of the R base random Kursa. number generator Mersenne-Twister with a vector of an arbitrary length. Author: Petr Sav- rSFA Slow Feature Analysis in R. Authors: Wolf- icky. gang Konen, Martin Zaefferer, Patrick Koch; Bug hunting and testing by Ayodele Fasika, robustHD Robust methods for high-dimensional Ashwin Kumar, Prawyn Jebakumar. data. Author: Andreas Alfons. rapport A report templating system. Authors: Alek- robustfa An Object Oriented Framework for Robust sandar Blagoti´cand Gergely Daróczi. Factor Analysis. Author: Ying-Ying Zhang (Robert). rbamtools Reading, manipulation and writing BAM (Binary alignment) files. Author: Wolfgang rolasized Solarized colours in R. Author: Christian Kaisers. Zang. rcppbugs R binding for cppbugs. Author: Whit ror Robust Ordinal Regression MCDA library. Au- Armstrong. thor: Tommi Tervonen. rdryad Dryad API interface. Authors: Scott Cham- rpartScore Classification trees for ordinal responses. berlain [aut, cre], Carl Boettiger [aut], Karthik Authors: Giuliano Galimberti, Gabriele Sof- Ram [aut]. fritti, Matteo Di Maso.

    The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 NEWSAND NOTES 93 rplos Interface to PLoS Journals API methods. Au- sitools Format a number to a string with SI prefix. thors: Scott Chamberlain, Carl Boettiger. Authors: Jonas Stein, Ben Tupper. rportfolios Random portfolio generation. Author: skewtools Tools for analyze Skew-Elliptical distri- Frederick Novomestky. butions and related models. Author: Javier E. Contreras-Reyes. rreval Remote R Evaluator (rreval). Author: Barnet Wagman. In view: HighPerformanceComputing. smcure Fit Semiparametric Mixture Cure Models. Authors: Chao Cai, Yubo Zou, Yingwei Peng, rriskDistributions Collection of functions for fitting Jiajia Zhang. distributions (related to the ‘rrisk’ project). Au- thors: Natalia Belgorodski, Matthias Greiner, soilDB Soil Database Interface. Authors: D.E. Kristin Tolksdorf, Katharina Schueller. Beaudette and J.M. Skovlin. soobench Single Objective Optimization Bench- rsgcc Gini methodology-based correlation and clus- mark Functions. Author: Olaf Mersmann tering analysis of both microarray and RNA- Bernd Bischl. Seq gene expression data. Authors: Chuang Ma, Xiangfeng Wang. spMC Continuous Lag Spatial Markov Chains. Au- thor: Luca Sartore. rstiefel Random orthonormal matrix generation on the Stiefel manifold. Author: Peter Hoff. spaceExt Extension of SPACE. Author: Shiyuan He. sROC Nonparametric Smooth ROC Curves for Con- spacom Spatially weighted context data for multi- tinuous Data. Author: Xiao-Feng Wang. level modelling. Authors: Till Junge, Sandra Penic, Guy Elcheroth. scam Shape constrained additive models. Author: Natalya Pya. sparseLTSEigen RcppEigen back end for sparse least trimmed squares regression. Author: An- scio Sparse Column-wise Inverse Operator. Au- dreas Alfons. thors: Xi (Rossi) Luo and Weidong Liu. sparsenet Fit sparse linear regression models via sdprisk Measures of Risk for the Compound Pois- nonconvex optimization. Authors: Rahul son Risk Process with Diffusion. Authors: Ben- Mazumder, Trevor Hastie, Jerome Friedman. jamin Baumgartner, Riccardo Gatto. spartan Spartan (Simulation Parameter Analysis R semGOF Goodness-of-fit indeces for structural Toolkit ApplicatioN). Authors: Kieran Alden, equations models. Author: Elena Bertossi. Mark Read, Paul Andrews, Jon Timmis, Hen- rique Veiga-Fernandes, Mark Coles. semTools Useful tools for structural equation mod- eling. Authors: Sunthud Pornprasertmanit, spatialprobit Spatial Probit Models. Authors: Ste- Patrick Miller, Alex Schoemann, Yves Rosseel. fan Wilhelm and Miguel Godinho de Matos. sentiment Tools for Sentiment Analysis. Author: spcadjust Functions for calibrating control charts. Timothy P. Jurka. Authors: Axel Gandy and Jan Terje Kvaloy. spclust Selective phenotyping for experimental sessionTools Tools for saving, restoring and tele- crosses. Author: Emma Huang. porting R sessions. Author: Matthew D. Furia. spcov Sparse Estimation of a Covariance Matrix. setwidth Set the value of options("width") on termi- Authors: Jacob Bien and Rob Tibshirani. nal emulators. Author: Jakson Aquino. speedglm Fitting Linear and Generalized Linear sft Functions for Systems Factorial Technology Models to large data sets. Author: Marco Analysis of Data. Authors: Joe Houpt, Leslie ENEA. In view: HighPerformanceComputing. Blaha. splm Econometric Models for Spatial Panel Data. shotGroups Analyze shot group data. Author: Authors: Giovanni Millo, Gianfranco Piras. Daniel Wollschlaeger. stabledist Stable Distribution Functions. Authors: simSummary Simulation summary. Author: Gre- Diethelm Wuertz, Martin Maechler and Rmet- gor Gorjanc. rics core team members. In view: Distributions. simsem SIMulated Structural Equation Modeling. stellaR stellar evolution tracks and isochrones. Au- Authors: Sunthud Pornprasertmanit, Patrick thors: Matteo Dell’Omodarme [aut, cre], Giada Miller, Alexander Schoemann. Valle [aut]. In view: ChemPhys.

    The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 94 NEWSAND NOTES stpp Space-Time Point Pattern simulation, visualisa- tlmec Linear Student-t Mixed-Effects Models with tion and analysis. Authors: Edith Gabriel, Pe- Censored Data. Authors: Larissa Matos, Mar- ter J Diggle, stan function by Barry Rowling- cos Prates and Victor Lachos. In view: Survival. son. tm.plugin.factiva A plug-in for the tm text mining superdiag R Code for Testing Markov Chain Non- framework to import articles from Factiva. Au- convergence. Authors: Tsung-han Tsai, Jeff Gill thor: Milan Bouchet-Valat. and Jonathan Rapkin. tpe Tree preserving embedding. Authors: Albert D. Shieh, Tatsunori B. Hashimoto, and Edoardo survSNP Power and Sample Size Calculations for M. Airoldi. SNP Association Studies with Censored Time to Event Outcomes. Authors: Kouros Owzar, translate Bindings for the Google Translate API v2. Zhiguo Li, Nancy Cox, Sin-Ho Jung and Chan- Author: Peter Danenberg. hee Yi. truncdist Truncated Random Variables. Authors: svHttp SciViews GUI API — R HTTP server. Au- Frederick Novomestky, Saralees Nadarajah. thor: Philippe Grosjean. turboEM A Suite of Convergence Acceleration Schemes for EM and MM algorithms. Authors: svKomodo SciViews GUI API — Functions to inter- Jennifer F. Bobb, Hui Zhao, and Ravi Varadhan. face with Komodo Edit/IDE. Author: Philippe Grosjean. vacem Vaccination Activities Coverage Estimation Model. Authors: Justin Lessler, Jessica Metcalf. sybil SyBiL — Efficient constrained based mod- elling in R. Authors: Gabriel Gelius-Dietrich varbvs Variational inference for Bayesian variable [cre], C. Jonathan Fritzemeier [ctb], Rajen selection. Author: Peter Carbonetto. Piernikarczyk [ctb], Benjamin Braasch [ctb]. vec2dtransf 2D Cartesian Coordinate Transforma- tion. Author: German Carrillo. In view: Spa- sybilDynFBA Dynamic FBA : dynamic flux balance tial. analysis. Author: Abdelmoneim Amer Des- ouki. vimcom Intermediate the communication between Vim and R. Author: Jakson Aquino. synbreed Framework for the analysis of genomic visreg Visualization of regression functions. Au- prediction data using R. Authors: Valentin thors: Patrick Breheny, Woodrow Burchett. Wimmer, Theresa Albrecht, Hans-Juergen Auinger, Chris-Carolin Schoen with contribu- visualFields Statistical methods for visual fields. tions by Larry Schaeffer, Malena Erbe, Ulrike Authors: Ivan Marin-Franch, Paul H Artes. Ober and Christian Reimer. waffect A package to simulate constrained pheno- synbreedData Data for the synbreed package. Au- types under a disease model H1. Authors: Gre- thors: Valentin Wimmer, Theresa Albrecht, gory Nuel and Vittorio Perduca. Hans-Juergen Auinger, Chris-Carolin Schoen weightedKmeans Weighted KMeans Clustering. with contributions by Malena Erbe, Ulrike Authors: Graham Williams, Joshua Z Huang, Ober and Christian Reimer. Xiaojun Chen, Qiang Wang, Longfei Xiao. tawny.types Common types for tawny. Author: wfe Weighted Linear Fixed Effects Estimators for Brian Lee Yung Rowe. Causal Inference. Authors: In Song Kim, Ko- suke Imai. tempdisagg Methods for Temporal Disaggregation and Interpolation of Time Series. Authors: Christoph Sax, Peter Steiner. In view: Time- Other changes Series. • The following packages which were previously ternvis Visualisation, verification and calibration of double-hosted on CRAN and Bioconductor ternary probabilistic forecasts. Author: Tim were moved to the Archive, and are now solely Jupp. available from Bioconductor: aroma.light, DART. tfplot Time Frame User Utilities. Author: Paul Gilbert. • The following packages which were previously double-hosted on CRAN and Omegahat were timetools provides objects and tools to manipulate moved to the Archive, and are now solely irregular unhomogeneous time data and sub- available from Omegahat: CGIwithR, Rstem, time data. Author: Vladislav Navel. XMLSchema.

    The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 NEWSAND NOTES 95

    • The following packages were moved to the lap, nvis, ofp, operator.tools, optpart, pheno, Archive: Aelasticnet, BARD, BioIDMap- pkDACLASS, press, pressData, revoIPC, rim- per, CoCo, CoCoCg, CoCoGraph, Diagno- age, rmetasim, rpvm, segclust, siatclust, snpX- sisMed, FEST, FactoClass, FracSim, GGally, pert, spef, statnet, stringkernels, svcR, time, GLDEX, IniStatR, LLAhclust, MMG, OSACC, treelet, ttime, twopartqtl, urn, yest. Package:, RFreak, RImageJ, ROracleUI, RPP- analyzer, SSOAP, Sim.DiffProcGUI, SoPhy, • The following packages were resurrected from SocrataR, SubpathwayMiner, TSTutorial, the Archive: DescribeDisplay, EMC, EMCC, TradeStrategyAnalyzer, agsemisc, binom- G1DBN, SMC, diseasemapping, fArma, irt- SamSize, caMassClass, catmap, cheatmap, Prob, partsm, powerpkg, riv. dmRTools, doSMP, dummies, egarch, farmR, formula.tools, fso, geoPlot, geofd, ggcolpairs, Kurt Hornik gllm, glpk, gmaps, gmvalid, gsc, hsmm, ker- WU Wirtschaftsuniversität Wien, Austria [email protected] nelPop, kinship, knnflex, lago, last.call, la- tentnet, lcd, log10, makesweave, mapReduce, marelacTeaching, mirf, mvgraph, mvtnorm- Achim Zeileis pcs, ncomplete, nnDiag, nonrandom, nover- Universität Innsbruck, Austria [email protected]

    The R Journal Vol. 4/1, June 2012 ISSN 2073-4859 96 NEWSAND NOTES

    R Foundation News by Kurt Hornik New supporting institutions Clinical Trial Unit, University Hospital Basel, Donations and new members Switzerland Donations Ilker Esener, Switzerland New supporting members Greenwich Statistics, France Benjamin French, USA Kansai University, Faculty Commerce, Japan Arthur W. Green, USA Olaf Krenz, Germany Scott Kostyshak, USA Phil Shepherd, New Zealand Philippe Baril Lecavalier, Canada Joshua Wiley, USA Thomas Roth, Germany Mathe W. Woodyard, USA New benefactors Kurt Hornik Greenwich Statistics, France WU Wirtschaftsuniversität Wien, Austria Quantide, Italy [email protected]

    The R Journal Vol. 4/1, June 2012 ISSN 2073-4859