News The Newsletter of the Project Volume 8/2, October 2008

Editorial by John Fox pable of fitting a dizzying array of statistical mod- els. Paul Murrell shows how the compare package I had the opportunity recently to examine the devel- is used to locate similarities and differences among opment of the R Project. A dramatic index of this de- non-identical objects, and explains how the package velopment is the rapid growth in R packages, which may be applied to grading students’ assignments. is displayed in Figure 1; because the vertical axis in Arthur Allignol, Jan Beyersmann, and Martin Schu- this graph is on a log scale, the least-squares linear- macher describe the mvna package for the Nelson- regression line represents exponential growth. Aalen estimator of cumulative transition hazards in This issue of R News is another kind of testament multistate models. Finally, in the latest installment of to the richness and diversity of applications pro- the Programmers’ Niche column, Vince Carey reveals grammed in R: , Michael Lawrence, the mind-bending qualities of recursive computation Duncan Temple Lang, and Deborah Swayne describe in R without named self-reference. the rggobi package, which provides an interface to The “news” portion of this issue of R Rnews in- GGobi, software for dynamic and interactive sta- cludes a report on the recent useR! conference in tistical graphics. Carsten Dormann, Bernd Gruber, Dortmund, Germany; announcements of useR! 2009 and Jochen Fründ introduce the bipartite package in Rennes, France, and of DSC 2009 in Copenhagen, for examining interactions among species in ecolog- Denmark, both to be held in July; news from the R ical communities. Ionnis Kosmidis explains how to Foundation and from the Bioconductor Project; the use the profileModel package to explore likelihood latest additions to CRAN; and changes in the re- surfaces and construct confidence intervals for pa- cently released version 2.8.0 of R. rameters of models fit by maximum-likelihood and Unless we accumulate enough completed articles closely related methods. Ingo Feinerer illustrates to justify a third issue in 2008, this will be my last is- how the tm (“text-mining”) package is employed for sue as editor-in-chief of R News. Barring a third issue the analysis of textual data. and Xiaoyne in 2008, this is also the last issue of R News — but Cheng demonstrate the construction of statistical an- don’t panic! R News will be reborn in 2009 as The R imations in R with the animation package. Thomas Journal. Moreover, the purpose and general charac- Yee introduces the VGAM package, which is ca- ter of the publication will be largely unchanged. We

Contents of this issue: Comparing Non-Identical Objects ...... 40 mvna: An for the Nelson–Aalen Es- Editorial ...... 1 timator in Multistate Models ...... 48 An Introduction to rggobi ...... 3 Programmers’ Niche: The Y of R ...... 51 Introducing the bipartite Package: Analysing Changes in R Version 2.8.0 ...... 54 Ecological Networks ...... 8 Changes on CRAN ...... 60 The profileModel R Package: Profiling Objec- News from the Bioconductor Project ...... 69 tives for Models with Linear Predictors. . . . 12 useR! 2008 ...... 71 An Introduction to Text Mining in R ...... 19 Forthcoming Events: useR! 2009 ...... 73 animation: A Package for Statistical Animations 23 Forthcoming Events: DSC 2009 ...... 74 The VGAM Package ...... 28 R Foundation News ...... 74 Vol. 8/2, October 2008 2 think that the new name better reflects the direction R Version in which R News has evolved, and in particular the 1.3 1.4 1.5 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 quality of the refereed articles that we publish. 1400 1427 1300 During my tenure as an R News editor, I’ve had 1200 1000 1000 the pleasure of working with Vince Carey, Peter Dal- 911 800 gaard, Torsten Hothorn, and Paul Murrell. I look for- 739 647 ward to working next year with Heather Turner, who 600 548 is joining the editorial board in 2009, which will be 500 my last year on the board. 400 406 357 300 John Fox 273

McMaster University Number of CRAN Packages 219 200 Canada 162 [email protected] 129 110 100 2001−06−21 2001−12−17 2002−06−12 2003−05−27 2003−11−16 2004−06−05 2004−10−12 2005−06−18 2005−12−16 2006−05−31 2006−12−12 2007−04−12 2007−11−16 2008−03−18 Date

Figure 1: The number of R packages on CRAN has grown exponentially since R version 1.3 in 2001. Source of Data: https://svn.r-project.org/ R/branches/.

R News ISSN 1609-3631 Vol. 8/2, October 2008 3

An Introduction to rggobi

Hadley Wickham, Michael Lawrence, outliers. It is customary to look at uni- or bivariate Duncan Temple Lang, Deborah F Swayne plots to look for uni- or bivariate outliers, but higher- dimensional outliers may go unnoticed. Looking for these outliers is easy to do with the tour. Open your Introduction data with GGobi, change to the tour view, and se- The rggobi (Temple Lang and Swayne, 2001) pack- lect all the variables. Watch the tour and look for age provides a command-line interface to GGobi, an points that are far away or move differently from the interactive and dynamic graphics package. Rggobi others—these are outliers. complements GGobi’s graphical user interface by en- Adding more data sets to an open GGobi is also g$mtcars2 <- mtcars abling fluid transitions between analysis and explo- easy: will add another data ration and by automating common tasks. It builds on set named “mtcars2”. You can load any file type the first version of rggobi to provide a more robust that GGobi recognises by passing the path to that file. ggobi_find_file and user-friendly interface. In this article, we show In conjunction with , which locates how to use ggobi and offer some examples of the in- files in the GGobi installation directory, this makes it sights that can be gained by using a combination of easy to load GGobi sample data. This example loads analysis and visualisation. the olive oils data set included with GGobi: This article assumes some familiarity with library(rggobi) GGobi. A great deal of documentation, both ggobi(ggobi_find_file("data", "olive.csv")) introductory and advanced, is available on the GGobi web site, http://www.ggobi.org; newcom- ers to GGobi might find the demos at ggobi.org/ Modifying observation-level docs especially helpful. The software is there attributes, or “automatic brushing” as well, both source and executables for several platforms. Once you have installed GGobi, you Brushing is typically thought of as an operation per- can install rggobi and its dependencies using in- formed in a graphical user interface. In GGobi, it stall.packages("rggobi", dep=T). refers to changing the colour and symbol (glyph type This article introduces the three main compo- and size) of points. It is typically performed in a nents of rggobi, with examples of their use in com- linked environment, in which changes propagate to mon tasks: every plot in which the brushed observations are • getting data into and out of GGobi; displayed. in GGobi, brushing includes shadowing, where points sit in the background and have less vi- • modifying observation-level attributes (“auto- sual impact, and exclusion, where points are com- matic brushing”); pleted excluded from the plot. Using rggobi, brush- ing can be performed from the command line; we • basic plot control. think of this as “automatic brushing.” The following We will also discuss some advanced techniques such functions are available: as creating animations with GGobi, the use of edges, • change glyph colour with glyph_colour (or and analysing longitudinal data. Finally, a case study glyph_color) shows how to use rggobi to create a visualisation for a statistical algorithm: manova. • change glyph size with glyph_size • change glyph type with glyph_type Data • shadow and unshadow points with shadowed • exclude and include points with excluded Getting data from R into GGobi is easy: g <- ggobi(mtcars). This creates a GGobi object Each of these functions can be used to get or set called g. Getting data out isn’t much harder: Just the current values for the specified GGobiData. The index that GGobi object by position (g[[1]]) or by “getters” are useful for retrieving information that name (g[["mtcars"]] or g$mtcars). These return you have created while brushing in GGobi, and the GGobiData objects which are linked to the data in “setters” can be used to change the appearance of GGobi. They act just like regular data frames, except points based on model information, or to create an- that changes are synchronised with the data in the imations. They can also be used to store, and then corresponding GGobi. You can get a static copy of later recreate, the results of a complicated sequence the data using as.data.frame. of brushing steps. Once you have your data in GGobi, it’s easy to This example demonstrates the use of do something that was hard before: find multivariate glyph_colour to show the results of clustering the

R News ISSN 1609-3631 Vol. 8/2, October 2008 4 infamous Iris data using hierachical clustering. Us- Name Variables ing GGobi allows us to investigate the clustering in 1D Plot 1 X the original dimensions of the data. The graphic XY Plot 1 X, 1 Y shows a single projection from the grand tour. 1D Tour n X Rotation 1 X, 1 Y, 1 Z 2D Tour n X g <- ggobi(iris) 2x1D Tour n X, n Y clustering <- hclust(dist(iris[,1:4]), Scatterplot Matrix n X method="average") Parallel Coordinates Display n X glyph_colour(g[1]) <- cuttree(clustering, 3) Time Series 1 X, n Y Barchart 1 X After creating a plot you can get and set the dis- played variables using the variable and variable<- methods. Because of the range of plot types in GGobi, variables should be specified as a list of one or more named vectors. All displays require an X vec- tor, and some require Y and even Z vectors, as speci- fied in the above table.

g <- ggobi(mtcars) d <- display(g[1], "Parallel Coordinates Display") variables(d) variables(d) <- list(X=8:6) variables(d) <- list(X=8:1) variables(d) Another function, selected, returns a logical vec- tor indicating whether each point is currently en- A function which saves the contents of closed by the brush. This could be used to further a GGobi display to a file on disk, is called explore interesting or unusual points. ggobi_display_save_picture. This is what we used to create the images in this document. This creates an exact (raster) copy of the GGobi display. If you want to create publication quality graphics Displays from GGobi, have a look at the DescribeDisplay plugin and package at http://www.ggobi.org/ describe-display A GGobiDisplay represents a window containing . These create R versions of one or more related plots. With rggobi you can cre- GGobi plots. ate new displays, change the projection of an existing To support the construction of custom interactive plot, set the mode which determines the interactions graphics applications, rggobi enables the embedding available in a display, or select a different set of vari- of GGobi displays in graphical user interfaces (GUIs) embed = TRUE ables to plot. based on the RGtk2 package. If is passed to the display method, the display is not im- displays To retrieve a list of displays, use the mediately shown on the screen but is returned to R as display function. To create a new display, use the a GtkWidget object suitable for use with RGtk2. Mul- GGobiData method of a object. You can specify the tiple displays of different types may be combined plot type (the default is a bivariate scatterplot, called with other widgets to form a cohesive GUI designed “XY Plot”) and variables to include. For example: for a particular data analysis task. g <- ggobi(mtcars) Animation display(g[1], vars=list(X=4, Y=5)) display(g[1], vars=list(X="drat", Y="hp")) Any changes that you make to the GGobiData objects display(g[1], "Parallel Coordinates Display") are updated in GGobi immediately, so you can eas- display(g[1], "2D Tour") ily create animations. This example scrolls through a long time series:

The following display types are available in df <- data.frame( GGobi (all are described in the manual, available x=1:2000, from ggobi.org/docs): y=sin(1:2000 * pi/20) + runif(2000, max=0.5) )

R News ISSN 1609-3631 Vol. 8/2, October 2008 5 g <- ggobi_longitudinal(df[1:100, ]) This example has two sets of edges because some pairs of families have marital relationships but not df_g <- g[1] business relationships, and vice versa. We can use for(i in 1:1901) { the edges menu in GGobi to change between the two df_g[, 2] <- df[i:(i + 99), 2] edge sets and compare the relationship patterns they } reveal.

Edge data How is this stored in GGobi? An edge dataset records the names of the source and destination ob- In GGobi, an edge data set is treated as a special servations for each edge. You can convert a regular type of dataset in which a record describes an edge dataset into an edge dataset with the edges function. – which may still be associated with an n-tuple of This takes a matrix with two columns, source and data. They can be used to represent many differ- destination names, with a row for each edge obser- entent types of data, such as distances between ob- vation. Typically, you will need to add a new data servations, social relationships, or biological path- frame with number of rows equal to the number of ways. edges you want to add. In this example we explore marital and business relationships between Florentine families in the 15th century. The data comes from the ergm (social net- Longitudinal data working analysis) package (Handcock et al., 2003), in the format provided by the network package (Butts A special case of data with edges is time series or et al., July 26, 2008). longitudinal data, in which observations adjacent in time are connected with a line. Rggobi provides a install.packages(c("network", "ergm")) convenient function for creating edge sets for longi- library(ergm) tudinal data, ggobi_longitudinal, that links obser- data(florentine) vations in sequential time order. This example uses the stormtracks data in- flo <- data.frame( cluded in rggobi. The first argument is the dataset, priorates = get.vertex.attribute( the second is the variable specifying the time compo- flobusiness, "priorates" nent, and the third is the variable that distinguishes ), the observations. wealth = get.vertex.attribute( flobusiness, "wealth" ggobi_longitudinal(stormtracks, seasday, id) ) ) For regular time series data (already in order, with no grouping variables), just use families <- network.vertex.names(flobusiness) ggobi_longitudinal with no other arguments. rownames(flo) <- families edge_data <- function(net) { edges <- as.matrix.network( net, matrix.type="edgelist" ) matrix(families[edges], ncol=2) } g <- ggobi(flo) edges(g) <- edge_data(flomarriage) edges(g) <- edge_data(flobusiness)

R News ISSN 1609-3631 Vol. 8/2, October 2008 6

Case study dm <- data.matrix(data) ellipse <- as.data.frame( This case study explores using rggobi to add model conf.ellipse(dm, n=1000, cl=cl) information to data; here will add confidence ellip- ) soids around the means so we can perform a graphi- cal manova. both <- rbind(data, ellipse) The first (and most complicated) step is to gener- both$SIM <- factor( ate the confidence ellipsoids. The ellipse function rep(c(FALSE, TRUE), c(nrow(data), 1000)) does this. First we generate random points on the ) surface of sphere by drawing npoints from a random normal distribution and standardising each dimen- both sion. This sphere is then skewed to match the desired } variance-covariance matrix, and its is size adjusted to ggobi(manovaci(matrix(rnorm(30), ncol=3))) give the appropriate cl-level confidence ellipsoid. Fi- nally, the ellipsoid is translated to match the column Finally, we create a method to break a dataset into locations. groups based on a categorical variable and compute the mean confidence ellipsoid for each group. We conf.ellipse <- function(data, npoints=1000, then use the automatic brushing functions to make cl=0.95, mean=colMeans(data), cov=var(data), the ellipsoid distinct and to paint each group a dif- n=nrow(data) ferent colour. Here we use 68% confidence ellipsoids ){ so that non-overlapping ellipsoids are likely to have norm.vec <- function(x) x / sqrt(sum(x^2)) significantly different means.

p <- length(mean) ggobi_manova <- function(data, catvar, ev <- eigen(cov) cl=0.68) { each <- split(data, catvar) normsamp <- matrix(rnorm(npoints*p), ncol=p) cis <- lapply(each, manovaci, cl=cl) sphere <- t(apply(normsamp, 1, norm.vec)) df <- as.data.frame(do.call(rbind, cis)) ellipse <- sphere %*% df$var <- factor(rep( diag(sqrt(ev$values)) %*% t(ev$vectors) names(cis), sapply(cis, nrow) conf.region <- ellipse * sqrt(p * (n-1) * )) qf(cl, p, n-p) / (n * (n-p))) if (!missing(data)) g <- ggobi(df) colnames(ellipse) <- colnames(data) glyph_type(g[1]) <- c(6,1)[df$SIM] glyph_colour(g[1]) <- df$var conf.region + rep(mean, each=npoints) invisible(g) } }

This function can be called with a data matrix, or These images show a graphical manova. You can with the sufficient statistics (mean, covariance ma- see that in some projections the means overlap, but trix, and number of points). We can look at the out- in others they do not. put with ggobi: ggobi(conf.ellipse( mean=c(0,0), cov=diag(2), n = 100)) cv <- matrix(c(1,0.15,0.25,1), ncol=2, n = 100) ggobi(conf.ellipse( mean=c(1,2), cov=cv, n = 100)) mean <- c(0,0,1,2) ggobi(conf.ellipse( mean=mean, cov=diag(4), n = 100)) Conclusion

In the next step, we will need to take the original GGobi is designed for data exploration, and its in- data and supplement it with the generated ellipsoid: tegration with R through rggobi allows a seamless workflow between analysis and exploration. Much manovaci <- function(data, cl=0.95) { of the potential of rggobi has yet to be realized,

R News ISSN 1609-3631 Vol. 8/2, October 2008 7 but some ideas are demonstrated in the classifly Bibliography package (Wickham, 2007), which visualises high- dimensional classification boundaries. We are also C. T. Butts, M. S. Handcock, and D. R. Hunter. net- keen to hear about your work—if you develop a work: Classes for Relational Data. Irvine, CA, July package using rggobi please let us know so we can 26, 2008. URL http://statnet.org/. R package highlight your work on the GGobi homepage. version 1.4-1. We are currently working on the infrastructure behind GGobi and rggobi to allow greater control M. S. Handcock, D. R. Hunter, C. T. Butts, S. M. from within R. The next generation of rggobi will Goodreau, and M. Morris. ergm: A Package to offer a direct low-level binding to the public inter- Fit, Simulate and Diagnose Exponential-Family Mod- face of every GGobi module. This will coexist with els for Networks. Seattle, WA, 2003. URL http:// the high-level interface presented in this paper. We CRAN.R-project.org/package=ergm. Version 2.1 . are also working on consistently generating events Project home page at http://statnetproject.org. in GGobi so that you will be able respond to events M. Lawrence and D. Temple Lang. RGtk2: R bind- of interest from your R code. Together with the ings for Gtk 2.8.0 and above, 2007. URL http://www. RGtk2 package (Lawrence and Temple Lang, 2007), ggobi.org/rgtk2,http://www.omegahat.org.R this should allow the development of custom inter- package version 2.12.1. active graphics applications for specific tasks, writ- ten purely with high-level R code. D. Temple Lang and D. F. Swayne. GGobi meets R: an extensible environment for interactive dynamic data visualization. In Proceedings of the 2nd Interna- Acknowledgements tional Workshop on Distributed Statistical Computing, 2001. This work was supported by National Science Foun- dation grant DMS0706949. The ellipse example is H. Wickham. classifly: Explore classification models in taken, with permission, from Professor Di Cook’s high dimensions, 2007. URL http://had.co.nz/ notes for multivariate analysis. classifly. R package version 0.2.3.

R News ISSN 1609-3631 Vol. 8/2, October 2008 8

Introducing the bipartite Package: Analysing Ecological Networks

Carsten F. Dormann, Bernd Gruber and Jochen Fründ then present an example output from the calcu- lation of network-level descriptors (using function networklevel) and finally address some miscella- Introduction neous issues to do with fitting degree distributions, secondary extinction slopes and null models for bi- Interactions among species in ecological commu- partite webs. nities have long fascinated ecologists. Prominent Along with several functions we also include 19 recent examples include pollination webs (Mem- data sets on pollinator networks, taken from the Na- mott et al., 2004), species-rich predator-prey sys- tional Center for Ecological Analysis and Synthe- tems (Tylianakis et al., 2007) and seed dispersal sis webpage devoted to this topic (www.nceas.ucsb. mutualisms (all reviewed in Blüthgen et al., 2007). edu/interactionweb). There are several other bipar- Many of the topological descriptors used in food tite data sets at this repository, and our data include webs since the 1960s have limited ecological mean- only quantitative plant-pollinator networks. ing when only two trophic levels are investigated (for example chain length: Pimm, 1982/2002; Mon- toya et al., 2006). Here, the network becomes bipar- tite, i.e. every member of one trophic level is only Plotting ecological networks connected to members of the other trophic level: di- rect interactions within trophic levels are regarded as The function plotweb draws a bipartite graph, in unimportant. For bipartite ecological networks very which rectangles represent species, and the width many, more or less different, indices have been pro- is proportional to the sum of interactions involving posed to capture important features of the interac- this species. Interacting species are linked by lines, tions and of the species. For example, species de- whose width is again proportional to the number of grees (number of species the target species is linked interactions (but can be represented as simple lines to) or species strength (sum of level of dependencies or triangles pointing up or down). An example is of other species on the target) are supposed to quan- given in Fig. 1 for the data set mosquin1967, which is tify the importance of a species in a web. included in the package and can be generated using The new R-package bipartite, introduced here, plotweb(mosquin1967). provides functions to viualise webs and calculate a Alternatively, the function visweb plots the data series of indices commonly used to describe pattern matrix with shading representing number of interac- in ecological webs. It focusses on webs consisting tions per link. As default, this gives an idea about the of only two trophic levels, e.g. pollination webs or filling of the matrix. With option type="diagonal", predator-prey-webs. We had three types of ecologi- however, visweb depicts compartments for easy per- cal bipartite webs in mind when writing this pack- ception (Fig.2). The same compartments are visible age: seed-disperser, plant-pollinator and predator- in Fig.1, too, due to the default sequence of species prey systems. here being the arrangement used in visweb(., Bipartite networks, as analysed and plotted in type="diagonal"). This sequence is determined, as the package bipartite, can be represented by a ma- suggested by (Lewinsohn et al., 2006), using corre- trix, in which, in our definition, columns represent spondence analysis. species in the higher trophic level, and rows species in the lower trophic level. Entries in the matrix C represent observed links, either quantitatively (with A A one to many interactions per link) or qualitatively A (binary). Usually such matrices are very sparse, A A A A A A marginal sums (i.e. abundance distributions) highly A A A skewed and average number of interactions per link A A A A A A A A A A A A A A are low (around 2: Blüthgen et al., 2007). A A A A A

With the package bipartite, presented here, we A A wanted to overcome two main deficiencies in the B B B B field: 1. Lack of software to calculate various in- Astragalu Petasites Cerastium Caltha.paTaraxacum Potentill Arnica.al Dryas.int Geum.rossPapaver.r Oxytropis dices and topological descriptors of bipartite net- Anarta.ri Pogonomyi Boloria.i Eupogonom Spilogona Protocall Spilogona Boreellus Spilogona Spilogona SpilogonaRhamphomySpilogona Spilogona Spilogona Colias.na Bombus.po Colias.he works. And 2. No convenient plotting tool for bipar- tite networks. This article aims to briefly present the Figure 2: A plot of the network matrix produced by two visualisation functions (plotweb and visweb), visweb(mosquin1967, type="diagonal").

R News ISSN 1609-3631 Vol. 8/2, October 2008 9

Spilogona.sanctipauli Spilogona.hurdiana Eupogonomyia.probilofensis Boreellus.atriceps Spilogona.aesturium Boloria.improba Spilogona.latilamina Rhamphomyia.sp. Colias.hecla Pogonomyiodes.segnis Protocalliphora.sp. Spilogona.melanosoma Bombus.polaris Anarta.richardsoni Spilogona.almquistii Spilogona.dorsata Colias.nastes

Oxytropis.arctica Dryas.integrifolia Potentilla.vahliana Caltha.palustris Papaver.radicatum Arnica.alpina Taraxacum.sp. Cerastium.alpinum Geum.rossii Petasites.frigidus Astragalus.alpinus

Figure 1: A bipartite graph produced by default settings of function plotweb.

Calculating network metrics [1] 0.2413793 $‘interaction strength asymmetry‘ bipartite features two main functions to calculate [1] 0.1607192 indices: specieslevel and networklevel. The for- $‘specialisation asymmetry‘ mer returns various values describing, e.g., the spe- [1] -0.1755229 cialisation or the dependence asymmetry of each $‘extinction slope lower trophic level‘ species. Since its output is bulky, we omit it here [1] 2.286152 and only present the function networklevel in de- $‘extinction slope higher trophic level‘ tail, which also comprises a larger set of different in- [1] 2.9211 dices: $‘degree distribution lower trophic level‘ Estimate Std.Err Pr(>|t|) R2 AIC >networklevel(mosquin1967) exponential 0.207527 0.02905 0.000834 0.992 -7.11 $‘number of higher trophic species‘ power law 0.701034 0.08856 0.000517 0.967 -8.09 [1] 18 trunc. power law [slope] 0.431810 0.31661 0.244321 0.987 -7.15

$‘number of lower trophic species‘ $‘degree distribution higher trophic level‘ [1] 11 Estimate Std.Err Pr(>|t|) R2 AIC exponential 0.221084 0.04283 0.006691 0.999 -3.21 $‘number of links‘ power law 0.744383 0.12834 0.004394 0.960 -4.34 [1] 1.310345 trunc. power law [slope] 0.511777 0.43347 0.322823 0.980 -2.82

$generality $‘higher trophic level niche overlap‘ [1] 2.677306 [1] 0.2237163

$vulnerability $‘lower trophic level niche overlap‘ [1] 4.114345 [1] 0.2505869

$‘interaction evenness‘ $‘mean number of shared hosts‘ [1] 0.8512671 [1] 0.8545455

$‘Alatalo interaction evenness‘ $togetherness [1] 0.6587369 [1] 0.1050109

$‘number of compartments‘ $‘C-score‘ [1] 3 [1] 0.6407096

$‘compartment diversity‘ $‘V-ratio‘ [1] 2.00787 [1] 11.11811

$‘cluster coefficient‘ $nestedness [1] 0.1363636 [1] 44.28693

$H2 [1] 0.4964885 We opted for a list structure to be able to accom- $‘web asymmetry‘ modate tables in the output, and because the option

R News ISSN 1609-3631 Vol. 8/2, October 2008 10 index allows specification of the indices the user is ferent ways to construct a null model, and this is only interested in (defaults to all). one of them (Vázquez and Aizen, 2003). We provide All indices are explained and/or referenced, in two further null models, shuffle.web and swap.web. the help pages, so a detailed description is omit- The former simply shuffles the observed values ted here. Among our personal favourites are the within the matrix under the constraint of retaining 0 network-wide specialisation H2 (Blüthgen et al., dimensionality; the latter contrains both marginal 2006), generality and vulnerability (Tylianakis et al., sums (as does r2dtable) and connectance (i.e. num- 2007) and the small-world measure ‘clustering co- ber of non-zero entries in the matrix). As observed efficient’ (Watts and Strogatz, 1998). Furthermore, connectance is much lower than connectance in ran- we took the liberty to modify the index ‘depen- dom marginal-sum-constrained networks, maintain- dence asymmetry’, because it has been shown to ing connectance implies an ecological mechanism be biased (Blüthgen et al., 2007). The original (such as co-evolution of pollinator and plant, body formulation is available as a special case of ‘in- size-relationships between prey and predator, and so teraction strength asymmetry’ and can be called forth). using networklevel(mosquin1967, index="ISA", Although we tried to incorporate many de- ISAmethod="Bascompte"). scriptors of networks, there are certainly several missing. For example, the social literature has put much emphasis on betweenness and centrality, Miscellaneous concepts that we find difficult to interpret in the context of bipartite (=two-mode) ecological net- Three list entries may warrant particular mention- works. Some of these functions are implemented ing: Web asymmetry is simply the ratio of matrix di- in the R-package sna (Social Network Analysis mensions. In a recent paper, Blüthgen et al. (2007) Butts, 2007), which can be accessed after transform- showed that some indices may be particularly in- ing the bipartite data to one-mode graphs using fluenced by the matrix dimensions, and hence web bipartite’s function as.one.mode. Others can be asymmetry may serve as a kind of correction in- calculated using the freeware Pajek (http://vlado. dex. Extinction slopes (for lower and higher level) fmf.uni-lj.si/pub/networks/pajek/: Batagelj and are hyperbolic fits to a simulated extinction se- Mrvar, 2003). We made, as yet, no attempt to in- quence of the network, which causes secondary ex- clude indices that use sophisticated optimisation al- tinctions in the other trophic level (only for net- gorithms (e.g. Guimerà and Amaral 2005’s modular- works with strong dependence). The idea was pre- ity index or Clauset et al. 2008’s hierarchical struc- sented by Memmott et al. (2004) and we include ture functions), mainly because of time limitations, this rather rough measure as a simple implementa- but also because they draw heavily on computer re- tion (see ?second.extinct for specification of simu- sources. Contributions to bipartite are welcomed, lations and ?slope.bipartite for details on fitting in order to make further progress in the issues await- of the hyperbolic curve). Finally, degree distributions ing networks in ecology (Bascompte, 2007). (for both trophic levels) have been heralded by Jor- dano et al. (2003) and Montoya et al. (2006) as being best described by truncated power laws, rather than Bibliography exponential of simple power law functions. We fit all three, but also notice that many networks provide J. Bascompte. Networks in ecology. Basic and Applied only 4 to 5 levels of degrees, so that a non-linear fit Ecology, 8:485–490, 2007. to so few data points gives us little confidence in its meaning, and often the fitting does not converge, due V. Batagelj and A. Mrvar. Pajek - analysis and vi- to singular gradients. sualization of large networks. In M. Jünger and P. Mutzel, editors, Graph Drawing Software, pages Some of the indices calculated by networklevel 77–103. Springer, Berlin, 2003. may be ecologically uninformative because they are driven by either constraints in web dimensions or N. Blüthgen, F. Menzel, and N. Blüthgen. Measur- are a consequence of the lognormal distribution of ing specialization in species interaction networks. species abundances (e.g. nestedness). This is not to BMC Ecology, 6(9):12, 2006. say that for specific questions these indices are not important, just to caution that in some cases statisti- N. Blüthgen, F. Menzel, T. Hovestadt, B. Fiala, and cal artefacts may confound an index’ intended mean- N. Blüthgen. Specialization, constraints and con- ing. We can investigate this by constructing ran- flicting interests in mutualistic networks. Current dom webs, e.g. by employing Patefield’s r2dtable- Biology, 17:1–6, 2007. algorithm (Fig. 3). These findings give us some hope that the ob- C. T. Butts. sna: Tools for Social Network Analysis, served pattern are not a mere artefact of species dis- 2007. URL http://erzuli.ss.uci.edu/R.stuff. tributions and web dimensions. There are many dif- R package version 1.5.

R News ISSN 1609-3631 Vol. 8/2, October 2008 11

>null.t.test(mosquin1967, index=c("generality", "vulnerability", "cluster coefficient", "H2", "ISA", + "SA"), nrep=2, N=20)

obs null mean lower CI upper CI t P generality 2.6773063 4.22916358 4.15161471 4.30671245 41.88423 3.497206e-20 vulnerability 4.1143452 7.21792806 7.07648249 7.35937363 45.92490 6.178095e-21 cluster coefficient 0.1363636 0.24545455 0.22667908 0.26423001 12.16108 2.067759e-10 H2 0.4964885 0.14030524 0.12975531 0.15085517 -70.66398 1.800214e-24 interaction strength asymmetry 0.1607192 0.05951281 0.05190249 0.06712313 -27.83425 7.291115e-17 specialisation asymmetry -0.1978587 -0.16827052 -0.20874785 -0.12779319 1.52996 1.425060e-01

Figure 3: Illustration of the null.t.test-function with selected network indices.

A. Clauset, C. Moore and M. E. J. Newman. Hierar- J. M. Tylianakis, T. Tscharntke, and O. T. Lewis. chical structure and the prediction of missing links Habitat modification alters the structure of tropi- in networks. Nature, 453:98–101, 2008. cal host-parasitoid food webs. Nature, 445:202–205, 2007. R. Guimerà and L. A. N. Amaral. Functional cartog- raphy of complex metabolic networks. Nature, 433: D. P. Vázquez and M. A. Aizen. Null model anal- 895–900, 2005. yses of specialization in plant-pollinator interac- tions. Ecology, 84(9):2493–2501, 2003. P. Jordano, J. Bascompte, and J. M. Olesen. Invariant properties in coevolutionary networks of plant- D. J. Watts and S. Strogatz. Collective dynamics of animal interactions. Ecology Letters, 6:69–81, 2003. ‘small-world’ networks. Nature, 393:440–442, 1998.

T. M. Lewinsohn, P. I. Prado, P. Jordano, J. Bas- compte, and J. M. Olesen. Structure in plant- Carsten F. Dormann & Bernd Gruber animal interaction assemblages. Oikos, 113(1):174– Department of Computational Landscape Ecology 184, 2006. UFZ Centre for Environmental Research Permoserstr. 15 J. Memmott, N. M. Waser, and M. V. Price. Tolerance 04318 Leipzig, Germany of pollination networks to species extinctions. Pro- Email: [email protected] ceedings of the Royal Society, 271:2605–2611, 2004.

J. M. Montoya, S. L. Pimm, and R. V. Solé. Ecological Jochen Fründ networks and their fragility. Nature, 442:259–264, Agroecology 2006. University of Göttingen Waldweg 26 S. Pimm. Food Webs. Chicago University Press, 37073 Göttingen Chicago, 1982/2002. Germany

R News ISSN 1609-3631 Vol. 8/2, October 2008 12

The profileModel R Package: Profiling Objectives for Models with Linear Predictors. by Ioannis Kosmidis cidence of leaf-blotch on barley example in the Ex- amples section). In its current version (0.5-3), the profileModel Introduction package has been tested and is known to work for fit- ted objects resulting from lm, glm, polr, gee, geeglm, In a multi-parameter problem the profiles of the like- brglm, brlr, BTm and survreg. lihood can be used for the construction of confidence intervals for parameters, as well as to assess fea- tures of the likelihood surface such as local maxima, Supported fitted objects asymptotes, etc., which can affect the performance of asymptotic procedures. The profile methods of profileModel aims at generality of application and the R language (stats and MASS packages) can be thus it is designed to support most of the classes of used for profiling the likelihood function for several fitted objects that are constructed according to the classes of fitted objects, such as glm and polr. How- specifications of a fitted object as given in Chambers ever, the methods are limited to cases where the pro- and Hastie (1991, Chapter 2). Specifically, the sup- files are almost quadratic in shape and can fail, for ported classes of fitted objects (object) should com- example, in cases where the profiles have an asymp- ply with the following: tote. Furthermore, often the likelihood is replaced by 1. The parameters of the model to be fitted have an alternative objective for either the improvement of to be related through a linear combination of the properties of the estimator, or for computational parameters and covariates. That is, only mod- efficiency when the likelihood has a complicated els with linear predictors are supported. form (see, for example, Firth (1993) for a maximum 2. The fitting procedure that resulted in object penalized likelihood approach to bias-reduction, and has to support offset in formula. Also, Lindsay (1988) for composite likelihood methods, re- object$call has to be the call that generated spectively). Alternatively, estimation might be per- object. formed by using a set of estimating equations which do not necessarily correspond to a unique objective 3. object has to be an object which supports to be optimized, as in quasi-likelihood estimation the method coef and which has object$terms (Wedderburn, 1974; McCullagh, 1983) and in gener- with the same meaning as, for example, in lm alized estimating equations for models for clustered and glm. coef(object) has to be a vector of co- and longitudinal data (Liang and Zeger, 1986). In all efficients where each component corresponds of the above cases, the construction of confidence in- to a column of the model matrix tervals can be done using the profiles of appropriate objective functions in the same way as the likelihood > mf <- model.frame(object$terms, profiles. For example, in the case of bias-reduction + data = eval(object$call$data)) in logistic regression via maximum penalized likeli- > model.matrix(object$terms, mf, hood, Heinze and Schemper (2002) suggest to use the + contrasts = object$contrasts) profiles of the penalized likelihood, and when esti- mation is via a set of estimating equations Lindsay or maybe just model.matrix(object), for fit- and Qu (2003) suggest the use of profiles of appro- ted objects that support the model.matrix priate quadratic score functions. method. profileModel is an R package that provides tools to calculate, evaluate, plot and use for inference Exceptions to this are objects returned by the the profiles of arbitrary objectives for arbitrary fitted BTm function of the BradleyTerry package, models with linear predictors. It allows the develop- where some special handling of the required ers of fitting procedures to use these capabilities by objects takes place. simply writing a small function for the appropriate Note that any or both of data and contrasts objective to be profiled. Furthermore, it provides an could be NULL. This depends on whether the alternative to already implemented profiling meth- data argument has been supplied to the fit- ods, such as profile.glm and profile.polr, by ex- ting function and whether object$contrasts tending and improving their capabilities (see, the in- exists.

R News ISSN 1609-3631 Vol. 8/2, October 2008 13

ˆc 4. Support for a summary method is optional. where βt (t = 1, . . . , r − 1, r + 1, . . . , p) is the re- The summary method is only used for stricted maximum likelihood estimate when βr = c, obtaining the estimated asymptotic stan- and l(.) is the log-likelihood function. dard errors associated to the coefficients in As an illustrative example consider object. If these standard errors are not in coef(summary(object))[,2], they can be sup- > m1 <- example(birthwt, "MASS")$value plied by the user. > library(profileModel) > prof1.m1 <- profileModel(fitted = m1, + objective = profDev, + dispersion = 1)

Here, we use the default behaviour of the profileModel function; prof1.m1$profiles is a list The profileModel objective func- of matrices, one for each estimated parameter in m1. The first column of each matrix contains 10 values of tions the corresponding parameter with the smallest and largest values being 5 estimated standard errors on Given a fitted object of p parameters, the p − 1 di- the left and on the right, respectively, of the maxi- mensional restricted fit is an object resulting from the mum likelihood estimate of the parameter in m1. The following procedure: second column contains the values of (1) that corre- spond to the grid of values in the first column. 1. Restrict a parameter βr, r ∈ {1, . . . , p}, to a spe- cific scalar c.

2. Calculate cxr, where xr is the model matrix col- umn that corresponds to βr.

3. Estimate the remaining p − 1 parameters with The profileModel function and cxr as an additional offset in the model for- class mula. The primary function of the profileModel package The profiles of an objective should depend on the is the profiling function with the same name, which restricted fit and a profileModel objective function provides the following two main features. should be defined as such. For example, if object was the result of a glm call and restrFit is the object corresponding to the re- stricted fit for βr = c for some r ∈ {1, . . . , p} then Profiling the objective over a specified grid an appropriate objective function for the profile like- lihood is of values for the parameters of interest. This feature can be used with generic objectives irre- profDev <- function(restrFit, dispersion) spective of their shape. It is most useful for the study restrFit$deviance/dispersion of the shape of the profiles in a given range of param- eter values. As an illustration consider the following: where the dispersion argument has to be defined > prof2.m1 <- update(prof1.m1, in the relevant R functions calling profDev. Once + which = c("age", "raceblack", "ftv1"), profDev is passed to the functions in profileModel i) + grid.bounds = c(-2, 0, 0, 2, -1, 1), the restricted fits for an appropriate grid of c-values + gridsize = 50) are obtained using glm with the appropriate offsets > plot(prof2.m1) and covariates in its formula argument and ii) the differences profDev(restrFit) - profDev(object) are calculated for each restricted fit. Note that, in the current case, where the fitting procedure is max- imum likelihood and the objective is the deviance, the above difference is the log-likelihood ratio that is usually used for testing the hypothesis βr = c, i.e., n o ˆ ˆ ˆc ˆc ˆc ˆc 2 l(β1,..., βp) − l(β1,..., βr−1, c, βr+1,..., βp) , (1)

R News ISSN 1609-3631 Vol. 8/2, October 2008 14

age raceblack (Intercept) x1 5 8 50 600 4 6 3 400 30 4 2 200 2 1 10 Profiled objective Profiled objective Profiled objective Profiled objective ∞ 0 0 0 0

−2.0 −1.5 −1.0 −0.5 0.0 0.0 0.5 1.0 1.5 2.0 −10 −5 0 5 0 50 100 200

age raceblack (Intercept) x1

ftv1 x2 7 6 8 5 6 4 3 4 2 2 1 Profiled objective Profiled objective 0 0

−1.0 −0.5 0.0 0.5 1.0 −10 −5 0 5 10

ftv1 x2

Figure 1: Profiling the log-likelihood ratio statistic Figure 2: Profiling until the profile exceeds over a grid of values. qchisq(0.95,1) (dashed line). The ML estimate for x1 is infinity.

The profileModel procedure results in objects of class profileModel. S3 methods specific to profileModel objects are Profiling the objective until the value of • Generics: the profile exceeds a given quantile value print, update for the parameters of interest. • Plotting facilities: plot, pairs

This feature relates to convex objectives. Restricted • Construction of confidence intervals: fits are calculated on the left and on the right of the profConfint, profZoom, profSmooth estimate until the value of the profile exceeds a given quantile value for the parameters of interest. In con- trast with the default profile methods, in profile- Construction of confidence inter- Model the size of the steps to be taken on the left vals and on the right of the estimate depends on the shape of the profile curve through estimates of the slope The confidence-interval methods in the profile- of the tangents to the curve. Specifically, the steps Model package relate to convex objectives. Asymp- are smaller when the profile curve seems to change totic confidence intervals based on the profiles of the fast. In this way any asymptotes in either direc- objective can be constructed using the profConfint tion can be detected (see the documentation of the method, which is a wrapper of two different meth- profileModel function for detailed information on ods for the computation of confidence intervals. the calculation of the steps). For example, consider Spline smoothing via the profSmooth method A smoothing spline is fitted on the points of the pro- > y <- c(1/3, 1/3, 1, 1) files in a profileModel object and then the endpoints > x1 <- c(0, 0, 1, 1) ; x2 <- c(0, 1, 0, 1) of the confidence intervals are computed by finding > m2 <- glm(y ~ x1 + x2, binomial) the value of the parameter where the fitted spline has > prof.m2 <- profileModel(m2, the value of the appropriate quantile. For example, + objective = profDev, + quantile = qchisq(0.95, 1), > profConfint(prof.m2, method="smooth") + gridsize = 50, dispersion = 1) Lower Upper > plot(prof.m2) (Intercept) -7.670327 3.802680

R News ISSN 1609-3631 Vol. 8/2, October 2008 15 x1 -0.507256 Inf > profLogLik <- function(restrFit) { x2 -7.668919 7.668919 + -2*restrFit$loglik[2] + } Since quantile = qchisq(0.95,1) was used for prof.m2, these are 95% asymptotic profile confidence In the case of a survreg object, the estimated asymp- intervals. totic standard errors for the estimates are given by The advantage of spline smoothing is that it re- summary(m3)$table[,2]. We have quires minimal computation. However, by construc- tion the method depends on the size of the profiling > prof.m3 <- profileModel(m3, grid, as well as on the shape of the profiles. The end- + quantile=qchisq(0.95,1), points of the resultant confidence intervals are most + objective = profLogLik, accurate when the profiles are quadratic in shape. + stdErrors = summary(m3)$table[,2], + gridsize=20)

Binary search via the profZoom method Using binary search, the profile likelihood 95% asymptotic confidence intervals are A binary search is performed that ‘zooms in’ on the appropriate regions until the absolute difference of > profConfint(prof.m3, the quantile from the profile value at each end point + method="zoom", is smaller than a specified tolerance. Construction of + endpoint.tolerance = 1e-8) confidence intervals in this way requires the calcu- Lower Upper lation of the value of the profiles at the candidates (Intercept) 4.5039809 9.7478481 returned by the binary search. Thus, in contrast to ecog.ps -1.6530569 0.7115453 smoothing, this method, whereas accurate, is com- rx -0.5631772 1.8014250 putationally intensive. On the same example as for spline smoothing we have Note that the 95% Wald asymptotic confidence inter- vals > profConfint(prof.m2, method="zoom", + endpoint.tolerance = 1e-8) > confint(m3) Lower Upper 2.5 % 97.5 % (Intercept) -7.6703317 3.802770 (Intercept) 4.3710056 9.5526696 x1 -0.8518235 Inf ecog.ps -1.5836210 0.7173517 x2 -7.6689352 7.668935 rx -0.5689836 1.7319891

The smoothing method was accurate up to at least are similar because of the almost quadratic shape of 3 decimal places for the endpoints of the confidence the profiles, or equivalently of the apparent linearity intervals for (Intercept) and x1. However, it failed of the signed square roots of the profiles as can be to accurately estimate the left endpoint of the confi- seen in Figure 3. dence interval for x1; the profile for x1 in Figure 2 has an asymptote on the right. > plot(prof.m3, signed=TRUE)

Examples (Intercept) ecog.ps 3 2 2 1

Profile likelihood for survreg fitted objects 1 0 0

To illustrate the generality of application of the pro- −1

fileModel package we consider the example in the −2 −2 Signed sqrt of objective Signed sqrt of objective surveg help file of the function in the survival pack- 4 5 6 7 8 9 10 −2.0 −1.0 0.0 0.5 age. (Intercept) ecog.ps > library(survival) > m3 <- survreg( rx + Surv(futime, fustat) ~ ecog.ps + rx, + ovarian, dist= ’weibull’, scale = 1) 2 1

For the parameters (Intercept), ecog.ps and rx we 0 calculate the values of the log-likelihood ratio statis- −1

tic in (1), plot the profiles and obtain asymptotic con- Signed sqrt of objective −2

fidence intervals based on the profiles. According −0.5 0.5 1.0 1.5 2.0 to the rules for the construction of objectives in pre- rx vious sections and to the available quantities in a survreg object, an appropriate profileModel objec- tive could be defined as Figure 3: Signed square roots of the profiles for m3.

R News ISSN 1609-3631 Vol. 8/2, October 2008 16

Incidence of leaf-blotch on barley variety2 variety3 variety4 4 4 4 3 3 3 2 2 As as demonstration of the improvement over the 2 1 1 1 0 0 0 default profile methods, as well as, of the generality Profiled objective Profiled objective Profiled objective of the profileModel package in terms of profiled ob- −1.5 −0.5 0.5 −1.0 0.0 0.5 1.0 0.0 0.5 1.0 1.5 2.0 variety2 variety3 variety4 jectives, we consider the leaf-blotch on barley study found in McCullagh and Nelder (1989, Section 9.4.2). variety5 variety6 variety7 4 4 A linear logistic regression model with main ef- 4 3 3 2 2 fects is chosen to describe the variety and site effects 2 1 1 0 0 0 and estimation is performed using quasi-likelihood Profiled objective Profiled objective Profiled objective with variance function µ2(1 − µ)2 (see Wedderburn, 0.5 1.0 1.5 2.0 0.5 1.5 2.5 1.5 2.0 2.5 3.0 variety5 variety6 variety7 1974). The family object to be used along with glm is implemented through the wedderburn function of variety8 variety9 varietyX the gnm package. 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 Profiled objective Profiled objective Profiled objective

2.5 3.0 3.5 4.0 2.0 2.5 3.0 3.5 4.0 3.0 3.5 4.0 4.5

> library(gnm) variety8 variety9 varietyX > data(barley) > m4 <- glm(y ~ site + variety, + family = wedderburn, data = barley) Figure 4: Profiles for m4. In contrast to profileModel, the default profile method of the MASS package for glm fitted objects In this case, an appropriate profileModel objective fails to calculate the correct profiles in this exam- is the quasi-deviance (i.e. calculate the values of ple; the profile confidence intervals using the default (1) with l(.) replaced by the corresponding quasi- profile method are likelihood function). > confint(m4)[profPars, ] Waiting for profiling to be done... 2.5 % 97.5 % > profPars <- paste("variety", c(2:9,"X"), variety2 -1.42315422 0.5048776 + sep = "") variety3 -0.86535806 1.0328576 > prof1.m4 <- profileModel(m4, variety4 -0.03998811 0.8842478 + objective = profDev, variety5 -2.23635211 -2.2011092 + quantile = qchisq(0.95, 1), variety6 0.31534770 2.3847282 + which = profPars, variety7 1.41342757 1.7276810 + gridsize = 20, variety8 2.31977238 2.6500143 + dispersion = summary(m4)$dispersion) variety9 2.15012980 2.5229588 > cis1.m4 <- profConfint(prof1.m4) varietyX 2.93050444 3.2755633 > cis1.m4 There were 36 warnings (use warnings() Lower Upper to see them) variety2 -1.4249881 0.5047550 variety3 -0.8653948 1.0327784 which clearly disagree with the confidence intervals variety4 0.0086728 1.8875857 obtained above (the reader is also referred to the variety5 0.4234367 2.2604532 example in the documentation of the profConfint variety6 0.3154417 2.3844366 function). The produced warnings give a clue for variety7 1.4351908 3.2068870 the misbehaviour; some of the restricted fits do not variety8 2.3436347 4.1319455 converge to the maximum quasi-likelihood estimates variety9 2.1698223 4.0626864 and so wrong values for the profiled objective are re- varietyX 2.9534824 4.7614612 turned. This is most probably caused by bad starting > plot(prof1.m4, cis= cis1.m4) values, which, in turn, depend on the size of the steps taken on the left and on the right of each estimate in the default profile method. In profileModel, the dependence of the size of the steps on the quasi- likelihood profiles avoids such issues. In addition to the plot method, the profileModel package provides a pairs method for the construc- tion of pairwise profile trace plots (see, Venables and Ripley, 2002, for a detailed description). This method

R News ISSN 1609-3631 Vol. 8/2, October 2008 17 is a minor modification of the original pairs method > prof2.m4 <- profileModel(m4, for profile objects in the MASS package. The mod- + objective = RaoScoreStatistic, ification was made under the GPL licence version + quantile = qchisq(0.95, 1), 2 or greater and with the permission of the origi- + which = profPars, nal authors, in order to comply with objects of class + gridsize = 20, X = model.matrix(m4), profileModel. + dispersion = summary(m4)$dispersion) > profConfint(prof2.m4) > pairs(prof1.m4) Lower Upper variety2 -1.4781768 0.5971332 y~site + variety variety3 -0.9073192 1.0991394 −1.5 −0.5 0.5−1.0 0.0 1.00.0 1.0 2.00.5 1.5 0.5 1.5 2.51.5 2.5 2.5 3.5 2.0 3.0 4.0 3.0 4.0 0.5 0.5 variety4 -0.0401468 1.9064558 0.0 0.0 −0.5 variety2 −0.5 −1.0 −1.0 variety5 0.3890679 2.2472880 −1.5 −1.5 1.0 1.0 variety6 0.2144497 2.6012640 0.5 0.5 0.0 variety3 0.0 −0.5 −0.5 variety7 1.4212454 3.1476794 −1.02.0 2.0−1.0 1.5 1.5 variety8 2.3107389 4.0685900 1.0 variety4 1.0 0.5 0.5 variety9 2.0921331 4.0503524 0.0 0.0 2.0 2.0 varietyX 2.9017855 4.6967597 1.5 variety5 1.5 1.0 1.0

0.5 0.5 2.5 2.5 2.0 2.0

1.5 variety6 1.5 1.0 1.0 0.5 0.5 Summary 3.0 3.0

2.5 2.5 variety7 2.0 2.0 1.5 1.5 In this article we introduce and demonstrate the 4.0 4.0 3.5 3.5 main capabilities of the profileModel package. The variety8 3.0 3.0 2.5 2.5 described facilities, along with other complementary 4.0 4.0 3.5 3.5 variety9 functions included in the package were developed 3.0 3.0 2.5 2.5 2.0 2.0 with the following aims: 4.5 4.5

4.0 varietyX 4.0 3.5 3.5 • Generality: the package provides a unified 3.0 3.0 −1.5 −0.5 0.5−1.0 0.0 1.00.0 1.0 2.00.5 1.5 0.5 1.5 2.51.5 2.5 2.5 3.5 2.0 3.0 4.0 3.0 4.0 approach to profiling objectives. It supports most of the classes of fitted objects with lin- Figure 5: Pairwise profile trace plots for m4. ear predictors that are constructed according to the specifications of a fitted object as given by The direction of the pairwise profile trace plots in Chambers and Hastie (1991, Chapter 2). Figure 5 agrees with the fact that all the correlations • Embedding: the developers of current and / amongst the variety effects are exactly 1 2. new fitting procedures can have direct access An alternative to the quasi-likelihood function for to profiling capabilities. The only requirement the construction of asymptotic confidence intervals is authoring a simple function that calculates in quasi-likelihood estimation is the quadratic form the value of the appropriate objective to be pro- filed. Furthermore, each function of the profile- 2 T −1 Q (β) = s(β) i (β)s(β) , (2) Model package is designed to perform a very specific task. For example, prelim.profiling where β = (β1,..., βp) is the parameter vector, s(β) relates only to convex objectives and results in is the vector of the quasi-score functions and i(β) = the grid of parameter values which should be cov(s(β)). A review on test statistics of the above used in order for the profile to cover a spe- form and their properties can be found in Lindsay cific value (usually a quantile). Such a modu- and Qu (2003). The quadratic form (2) could be used lar design enables developers to use only the when there is not a unique quasi-likelihood function relevant functions for their specific application corresponding to the set of quasi-score functions. (see, for example, the documentation of the The profile test statistic for the hypothesis βr = c for profile.brglm and confint.brglm methods some r ∈ {1, . . . , p} has the form of the brglm package). 2 ˆc ˆc ˆc ˆc 2 ˆ ˆ • Computational stability: all the facilities have Q (β1,..., βr−1, c, βr+1,..., βp) − Q (β1,..., βp) , (3) been developed with computational stability in ˆc mind, in order to provide an alternative that where βt (t = 1, . . . , r − 1, r + 1, . . . , p) are the re- improves and extends the capabilities of al- stricted quasi-likelihood estimates when βr = c. The appropriate profileModel objective ready available profile methods. to calculate (3) is implemented through the The result, we hope, is a flexible and extensible RaoScoreStatistic function, on which more details package that will be useful both to other package can be found in the documentation of profileModel. authors and to end-users of R’s statistical modelling For m4 we have functions.

R News ISSN 1609-3631 Vol. 8/2, October 2008 18

Bibliography P. McCullagh. Quasi-likelihood functions. The An- nals of Statistics, 11:59–67, 1983. J. M. Chambers and T. Hastie. Statistical Models in S. Chapman & Hall, 1991. P. McCullagh and J. Nelder. Generalized Linear Models. Chapman and Hall, London, 2nd edition, 1989. D. Firth. Bias reduction of maximum likelihood esti- mates. Biometrika, 80(1):27–38, 1993. W. N. Venables and B. D. Ripley. Statistics com- G. Heinze and M. Schemper. A solution to the prob- plements to Modern Applied Statistics with S, lem of separation in logistic regression. Statistics in 2002. URL http://www.stats.ox.ac.uk/pub/ Medicine, 21:2409–2419, 2002. MASS4/VR4stat.pdf. K.-Y. Liang and S. L. Zeger. Longitudinal data analy- R. W. M. Wedderburn. Quasi-likelihood functions, sis using generalized linear models. Biometrika, 73: generalized linear models, and the Gauss-Newton 13–22, 1986. method. Biometrika, 61:439–447, 1974. B. G. Lindsay. Composite likelihood methods. In N. U. Prabhu, editor, Statistical Inference Ioannis Kosmidis from Stochastic Processes, pages 221–239. American Department of Statistics University of Warwick Mathematical Society, 1988. Coventry B. G. Lindsay and A. Qu. Inference functions and CV4 7AL quadratic score tests. Statistical Science, 18(3):394– United Kingdom 410, 2003. [email protected]

R News ISSN 1609-3631 Vol. 8/2, October 2008 19

An Introduction to Text Mining in R

Ingo Feinerer stemming, or conversion between file formats. Fur- ther a generic filter architecture is available in order to filter documents for certain criteria, or perform Introduction full text search. The package supports the export from document collections to term-document matri- Text mining has gained big interest both in aca- ces, and string kernels can be easily constructed from demic research as in business intelligence applica- text documents. tions within the last decade. There is an enormous amount of textual data available in machine readable format which can be easily accessed via the Internet Wizard of Oz stylometry or databases. This ranges from scientific articles, ab- stracts and books to memos, letters, online forums, The Wizard of Oz book series has been among the mailing lists, blogs, and other communication media most popular children’s novels in the last century. delivering sensible information. The first book was created and written by Lyman Text mining is a highly interdisciplinary research Frank Baum, published in 1900. A series of Oz books field utilizing techniques from computer science, lin- followed until Baum died in 1919. After his death guistics, and statistics. For the latter R is one of the Ruth Plumly Thompson continued the story of Oz leading computing environments offering a broad books, but there was a longstanding doubt which range of statistical methods. However, until recently, was the first Oz book written by Thompson. Espe- R has lacked an explicit framework for text mining cially the authorship of the 15th book of Oz—The purposes. This has changed with the tm (Feinerer, Royal Book of Oz—has been longly disputed amongst 2008; Feinerer et al., 2008) package which provides literature experts. Today it is commonly attributed a text mining infrastructure for R. This allows R to Thompson as her first Oz book, supported by sev- users to work efficiently with texts and correspond- eral statistical stylometric analyses within the last ing meta data and transform the texts into structured decade. representations where existing R methods can be ap- Based on some techniques shown in the work of plied, e.g., for clustering or classification. Binongo (2003) we will investigate a subset of the Oz In this article we will give a short overview of the book series for authorship attribution. Our data set tm package. Then we will present two exemplary consists of five books of Oz, three attributed to Ly- text mining applications, one in the field of stylome- man Frank Baum, and two attributed to Ruth Plumly try and authorship attribution, the second is an anal- Thompson: ysis of a mailing list. Our aim is to show that tm pro- The Wonderful Wizard of Oz is the first Oz book vides the necessary abstraction over the actual text written by Baum and was published in 1900. management so that the reader can use his own texts for a wide variety of text mining techniques. The Marvelous Land of Oz is the second Oz book by Baum published in 1904. Ozma of Oz was published in 1907. It is authored The tm package by Baum and forms the third book in the Oz series. The tm package offers functionality for manag- ing text documents, abstracts the process of docu- The Royal Book of Oz is nowadays attributed to ment manipulation and eases the usage of heteroge- Thompson, but its authorship has been dis- neous text formats in R. The package has integrated puted for decades. It was published in 1921 database backend support to minimize memory de- and is considered as the 15th book in the series mands. An advanced meta data management is im- of Oz novels. plemented for collections of text documents to alle- Ozoplaning with the Wizard of Oz was written by viate the usage of large and with meta data enriched Thompson, is the 33rd book of Oz, and was document sets. With the package ships native sup- published in 1939. port for handling the Reuters 21578 data set, Gmane RSS feeds, e-mails, and several classic file formats Most books of Oz, including the five books we use (e.g. plain text, CSV text, or PDFs). The data struc- as corpus for our analysis, can be freely down- tures and algorithms can be extended to fit custom loaded at the Gutenberg Project website (http:// demands, since the package is designed in a mod- www.gutenberg.org/) or at the Wonderful Wizard of ular way to enable easy integration of new file for- Oz website (http://thewizardofoz.info/). mats, readers, transformations and filter operations. The main data structure in tm is a corpus consist- tm provides easy access to preprocessing and ma- ing of text documents and additional meta data. So- nipulation mechanisms such as whitespace removal, called reader functions can be used to read in texts

R News ISSN 1609-3631 Vol. 8/2, October 2008 20 from various sources, like text files on a hard disk. baum <- findFreqTerms(ozMatBaum, 70) E.g., following code creates the corpus oz by reading royal <- findFreqTerms(ozMatRoyal, 50) in the text files from the directory ‘OzBooks/’, utiliz- thomp <- findFreqTerms(ozMatThompson, 40) ing a standard reader for plain texts. The lower bounds have been chosen in a way that oz <- Corpus(DirSource("OzBooks/")) we obtain about 100 terms from each term-document The directory ‘OzBooks/’ contains five text files in matrix. A simple test of intersection within the 100 plain text format holding the five above described Oz most common words shows that the Royal Book of books. So the resulting corpus has five elements rep- Oz has more matches with the Thompson book than resenting them: with the Baum books, being a first indication for Thompson’s authorship. > oz A text document collection with 5 text documents > length(intersect(thomp, royal)) Since our input are plain texts without meta data an- [1] 73 notations we might want to add the meta informa- > length(intersect(baum, royal)) tion manually. Technically we use the meta() func- [1] 65 tion to operate on the meta data structures of the in- dividual text documents. In our case the meta data is The next step applies a Principal Component stored locally in explicit slots of the text documents Analysis (PCA), where we use the kernlab (Karat- (type = "local"): zoglou et al., 2004) package for computation and vi- sualization. Instead of using the whole books we meta(oz, tag = "Author", type = "local") <- create equally sized chunks to consider stylometric c(rep("Lyman Frank Baum", 3), fluctuations within single books. In detail, the func- rep("Ruth Plumly Thompson", 2)) tion makeChunks(corpus, chunksize) takes a cor- meta(oz, "Heading", "local") <- c("The Wonderful Wizard of Oz", pus and the chunk size as input and returns a new "The Marvelous Land of Oz", corpus containing the chunks. Then we can compute "Ozma of Oz", a term-document matrix for a corpus holding chunks "The Royal Book of Oz", of 500 lines length. Note that we use a binary weight- "Ozoplaning with the Wizard of Oz") ing of the matrix elements, i.e., multiple term occur- rences are only counted once. E.g. for our first book in our corpus this yields

> meta(oz[[1]]) ozMat <- TermDocMatrix(makeChunks(oz, 500), Available meta data pairs are: list(weighting = weightBin)) Author : Lyman Frank Baum Cached : TRUE This matrix is the input for the Kernel PCA, where DateTimeStamp: 2008-04-21 16:23:52 we use a standard Gaussian kernel, and request two ID : 1 Heading : The Wonderful Wizard of Oz feature dimensions for better visualization. Language : en_US URI : file k <- kpca(as.matrix(ozMat), features = 2) OzBooks//01_TheWonderfulWizardOfOz.txt UTF-8 plot(rotated(k), col = c(rep("black", 10), rep("red", 14), The first step in our analysis will be the extraction rep("blue", 10), of the most frequent terms, similarly to a procedure rep("yellow", 6), rep("green", 4)), by Binongo (2003). At first we create a so-called term- pty = "s", document matrix, that is a representation of the cor- xlab = "1st Principal Component", pus in form of a matrix with documents as rows and ylab = "2nd Principal Component") terms as columns. The matrix elements are frequen- cies counting how often a term occurs in a document. Figure 1 shows the results. Circles in black are chunks from the first book of Oz by Baum, red ozMatBaum <- TermDocMatrix(oz[1:3]) ozMatRoyal <- TermDocMatrix(oz[4]) circles denote chunks from the second book by ozMatThompson <- TermDocMatrix(oz[5]) Baum, and blue circles correspond to chunks of the third book by Baum. Yellow circles depict chunks After creating three term-document matrices for the from the long disputed 15th book (by Thomp- three cases we want to distinguish—first, books by son), whereas green circles correspond to the 33rd Baum, second, the Royal Book of Oz, and third, book of Oz by Thompson. The results show a book by Thompson—we can use the function that there is a visual correspondence between the findFreqTerms(mat, low, high) to compute the 15th and the 33rd book (yellow and green), this terms in the matrix mat occurring at least low times means that both books tend to be authored by and at most high times (defaults to Inf). Thompson. Anyway, the results also unveil that

R News ISSN 1609-3631 Vol. 8/2, October 2008 21 there are parts matching with books from Baum. R-help mailing list analysis

Our second example deals with aspects of analyzing 4 ● a mailing list. As data set we use the R-help mailing list since its archives are publicly available and can

3 be accessed via an RSS feed or can be downloaded in mailbox format. For the RSS feed we can use the

2 Gmane (Ingebrigtsen, 2008) service, so e.g., with

● ●

1 rss <- Corpus(GmaneSource("http://rss.gmane.org/ ● ● ● ● ● ● ● gmane.comp.lang.r.general")) ● ●● ● ● ● ●● ● ● ● ● ● ● ● 0 ● ●● ● ● ● ● ● we get the latest news from the mailing list. tm ships ● ● ● ● with a variety of sources for different purposes, in- 2nd Principal Component ● −1 ● ● cluding a source for RSS feeds in the format delivered ● by Gmane. The source automatically uses a reader ● −2 function for newsgroups as it detects the RSS feed

● structure. This means that the meta data in the e-mail −3 headers is extracted, e.g., we obtain −5 −4 −3 −2 −1 0 1 2 > meta(rss[[1]]) 1st Principal Component Available meta data pairs are: Author : Figure 1: Principal component plot for five Oz books Cached : TRUE using 500 line chunks. DateTimeStamp: 2008-04-22 08:16:54 ID : http://permalink.gmane.org/ Similarly, we can redo this plot using chunks of gmane.comp.lang.r.general/112060 100 lines. This time we use a weighting scheme Heading : R 2.7.0 is released Language : en_US called term frequency inverse document frequency, Origin : Gmane Mailing List Archive a weighting very common in text mining and infor- URI : file http://rss.gmane.org/ mation retrieval. gmane.comp.lang.r.general UTF-8 ozMat <- TermDocMatrix(makeChunks(oz, 100), list(weighting = weightTfIdf)) for a recent mail announcing the availability of R 2.7.0. Figure 2 shows the corresponding plot using the If we want to work with a bigger collection same colors as before. Again we see that there of mails we use the mailing list archives in mail- is a correspondence between the 15th book and box format. For demonstration we downloaded the the 33rd book by Thompson (yellow and green), file ‘2008-January.txt.gz’ from https://stat.ethz. but also matching parts with books by Baum. ch/pipermail/r-help/. In the next step we convert the single file containing all mails to a collection of files each containing a single mail. This is especially

2 ● useful when adding new files or when lazy loading ● of the mail corpora is intended, i.e., when the load- ● ● ● ● ing of the mail content into memory is deferred un-

● ● til it is first accessed. For the conversion we use 1 ● ● ●● ● ● ● ● ● the function convertMboxEml() which extracts the R- ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● help January 2008 mailing list postings to the direc- ● ● ● ● ●● ● ● ●● ● ● ●● ●● ● ● ● ● ●●●● ●● ● ●● ● ● ●● ●● ● ●●● ● Mails/2008-January/ ● ●●● ●●● tory ‘ ’: ●●● ●●● ●● ●●●● ● ● ● ●● ●●●●●●● ●● 0 ● ●● ● ●●●●●●● ●●● ● ● ● ● ● ●●●●● ●● ● ●● ● ● ●●●● ● ● ●●●●●●●● ● ● ● ● ●● ● ● ●●● ● ●●● ●● convertMboxEml( ● ● ● ● ● gzfile("Mails/2008-January.txt.gz"), ● ● ● ● 2nd Principal Component ●● ● "Mails/2008-January/") −1 Then we can create the corpus using the created di-

● rectory. This time we explicitly define the reader to ● be used as the directory source cannot infer automat- −2 ically the internal structures of its files: −1.0 −0.5 0.0 0.5 1.0

1st Principal Component rhelp <- Corpus(DirSource("Mails/2008-January/"), list(reader = readNewsgroup))

Figure 2: Principal component plot for five Oz books The newly created corpus rhelp representing the using 100 line chunks. January archive contains almost 2500 documents:

R News ISSN 1609-3631 Vol. 8/2, October 2008 22

> rhelp open challenges for future research: First, larger data A text document collection with 2486 documents sets easily consist of several thousand documents re- sulting in large term-document matrices. This causes Then we can perform a set of transformations, like a severe memory problem as soon as dense data converting the e-mails to plain text, strip extra structures are computed from sparse term-document whitespace, convert the mails to lower case, or re- matrices. Anyway, in many cases we can signifi- move e-mail signatures (i.e., text after a -- (two hy- cantly reduce the size of a term-document matrix phens, followed by a space) mark). by either removing stopwords, i.e., words with low rhelp <- tmMap(rhelp, asPlain) information entropy, or by using a controlled vo- rhelp <- tmMap(rhelp, stripWhitespace) cabulary. Both techniques are supported by tm via rhelp <- tmMap(rhelp, tmTolower) the stopwords and dictionary arguments for the rhelp <- tmMap(rhelp, removeSignature) TermDocMatrix() constructor. Second, operations This simple preprocessing already enables us to on large corpora are time consuming and should be work with the e-mails in a comfortable way, concen- avoided as far as possible. A first solution resulted trating on the main content in the mails. This way it in so-called lazy transformations, which materialize circumvents the manual handling and parsing of the operations only when documents are later accessed, internals of newsgroup postings. but further improvement is necessary. Finally, we Since we have the meta data available we can have to work on better integration with other pack- find out who wrote the most postings during Jan- ages for natural language processing, e.g., with pack- uary 2008 in the R-help mailing list. We extract the ages for tag based annotation. author information from all documents in the corpus and normalize multiple entries (i.e., several lines for the same mail) to a single line: Acknowledgments authors <- lapply(rhelp, Author) authors <- sapply(authors, paste, collapse = " ") I would like to thank Kurt Hornik, David Meyer, and Vince Carey for their constructive comments and Then we can easily find out the most active writers: feedback. > sort(table(authors), decreasing = TRUE)[1:3] Gabor Grothendieck 100 Prof Brian Ripley 97 Bibliography Duncan Murdoch 63 J. N. G. Binongo. Who wrote the 15th book of Oz? An Finally we perform a full text search on the cor- application of multivariate analysis to authorship pus. We want to find out the percentage of mails attribution. Chance, 16(2):9–17, 2003. dealing with problems. In detail we search for those documents in the corpus explicitly containing the I. Feinerer. tm: Text Mining Package, 2008. URL http: term problem. The function tmIndex() filters out //CRAN.R-project.org/package=tm. R package those documents and returns their index where the version 0.3-1. full text search matches. p <- tmIndex(rhelp, FUN = searchFullText, I. Feinerer, K. Hornik, and D. Meyer. Text mining "problem", doclevel = TRUE) infrastructure in R. Journal of Statistical Software, 25(5):1–54, March 2008. ISSN 1548-7660. URL The return variable p is a logical vector of the same http://www.jstatsoft.org/v25/i05. size as the corpus indicating for each document whether the search has matched or not. So we ob- L. M. Ingebrigtsen. Gmane: A mailing list archive, tain 2008. URL http://gmane.org/. > sum(p) / length(rhelp) [1] 0.2373290 A. Karatzoglou, A. Smola, K. Hornik, and A. Zeileis. kernlab – an S4 package for kernel methods in R. as the percentage of explicit problem mails in relation Journal of Statistical Software, 11(9):1–20, 2004. URL to the whole corpus. http://www.jstatsoft.org/v11/i09/. Outlook Ingo Feinerer The tm package provides the basic infrastructure for Wirtschaftsuniversität Wien, Austria text mining applications in R. However, there are [email protected]

R News ISSN 1609-3631 Vol. 8/2, October 2008 23 animation: A Package for Statistical Animations by Yihui Xie and Xiaoyue Cheng • make the package extensible, allowing users to create their own animations. The mastery of statistical ideas has always been a challenge for many who study statistics. The power of modern statistical software, taking advantage of Design and features advances in theory and in computer systems, both adds to those challenges and offers new opportuni- It is convenient to create animations using the R ties for creating tools that will assist learners. The environment. The code below illustrates how the animation package (Xie, 2008) uses graphical and functions in the animation package produce an- other animations to communicate the results of sta- imations. In this case, the output is similar to tistical simulations, giving meaning to abstract sta- the animation on the web page obtained by go- tistical theory. From our own experience, and that ing to http://animation.yihui.name/animation: of our classmates, we are convinced that such exper- start#method, as indicated in the first line of the fol- imentation can assist the learning of statistics. Of- lowing code: ten, practically minded data analysts try to bypass # animation:start#method much of the theory, moving quickly into practical clr = rainbow(360) data analysis. For such individuals, use of animation for (i in 1:360) { to help understand simulations may make their use plot(1, ann = F, type = "n", axes = F) of statistical methods less of a “black box”. text(1, 1, "Animation", srt = i, col = clr[i], cex = 7 * i/360) Sys.sleep(0.01) Introduction } Animation is the rapid display of a sequence of 2D or High-level and low-level plotting commands, 3D images to create an illusion of movement. Persis- used along with various graphical devices described tence of vision creates the illusion of motion. Figure in Murrell (2005), make it straightforward to create 1 shows a succession of frames that might be used and display animation frames. Within an R session, for a simple animation – here an animation that does any available graphics device (for ‘Windows’, ‘’, not serve any obvious statistical purpose. ‘MacOS X’, etc) can be used for display. Any R graph- ics system can be used (graphics, grid, lattice, gg- plot, etc). Image formats in which animations can be saved

Animation Animation (and shared) include, depending on the computer Animation

Animation platform, png(), jpeg() and pdf(). For display sep-

00:01 00:02 00:03 00:04 Animation arately from an R session, possible formats include

Animation Animation (1) HTML and JS, (2) GIF/MPEG, or (3) Flash. All that Animation is needed is (1) a browser with JavaScript enabled, (2) an image viewer (for GIF) / a media player (for 00:08 00:07 00:06 00:05 MPEG), or (3) a Flash viewer for the relevant anima- tion format. The basic schema for all animation functions in Figure 1: This sequence of frames, displayed in quick the package is: succession, creates an illusion of movement. ani.fun <- function(args.for.stat.method, An animation consists of multiple image frames, args.for.graphics, ...) { which can be designed to correspond to the succes- {stat.calculation.for.preparation.here} sive steps of an algorithm or of a data analysis. Thus i = 1 far, we have sought to: while (i <= ani.options("nmax") & other.conditions.for.stat.method) { • build a gallery of statistical animations, includ- {stat.calculation.for.animation} ing some that have a large element of novelty; {plot.results.in.ith.step} • provide convenient functions to generate an- # pause for a while in this step imations in a variety of formats including Sys.sleep(ani.options("interval")) HTML/JS (JavaScript), GIF/MPEG and SWF i = i + 1 (Flash); }

R News ISSN 1609-3631 Vol. 8/2, October 2008 24

# (i - 1) frames produced in the loop and y = y + rnorm(1).", outdir = tempdir()) ani.options("nmax") = i - 1 ani.start() {return.something} opar = par(mar = c(3, 3, 2, 0.5), } mgp = c(2, .5, 0), tcl = -0.3, cex.axis = 0.8, cex.lab = 0.8, cex.main = 1) As the above code illustrates, an animation con- brownian.motion(pch = 21, cex = 5, col = "red", sists of successive plots with pauses inserted as ap- bg = "yellow") propriate. The function ani.options(), modeled on par(opar) the function options(), is available to set or query ani.stop() animation parameters. For example, nmax sets the ani.options(oopt) maximum number of steps in a loop, while interval sets the duration for the pause in each step. The ############################################ following demonstrates the use of ani.options() in # (3) GIF animations (require ImageMagick!) creating an animation of Brownian Motion. (As with oopt = ani.options(interval = 0, nmax = 100) all code shown in this paper, a comment on the first saveMovie({brownian.motion(pch = 21, cex = 5, line of code has the name of the web page, in the root col = "red", bg = "yellow")}, directory http://animation.yihui.name, where the interval = 0.05, outdir = getwd(), animation can be found.) width = 600, height = 600) ani.options(oopt) # prob:brownian_motion library(animation) ############################################ # 100 frames of brownian motion # (4) Flash animations (require SWF Tools!) ani.options(nmax = 100) oopt = ani.options(nmax = 100, interval = 0) brownian.motion(pch = 21, cex = 5, saveSWF(buffon.needle(type = "S"), col = "red", bg = "yellow") para = list(mar = c(3, 2.5, 1, 0.2), pch = 20, mgp = c(1.5, 0.5, 0)), # faster: interval = 0.02 dev = "pdf", swfname = "buffon.swf", ani.options(interval = 0.02) outdir = getwd(), interval = 0.1) brownian.motion(pch = 21, cex = 5, ani.options(oopt) col = "red", bg = "yellow") Additionally, users can write their custom Additional software is not required for HTML/JS, functions that generate animation frames, then as the pages are generated by writing pure text use ani.start(), ani.stop(), saveMovie() and (HTML/JS) files. The functions ani.start() and saveSWF() as required to create animations that can ani.stop() do all that is needed. Creation of ani- be viewed outside R.3 The following commands de- mations in GIF/MPEG or SWF formats requires third- fine a function jitter.ani() that demonstrates the party utilities such as ImageMagick1 or SWF Tools2. jitter effect when plotting discrete variables. This We can use saveMovie() and saveSWF() to create helps make clear the randomness of jittering. GIF/MPEG and Flash animations, respectively. The four examples that now follow demonstrate # animation:example_jitter_effect the four sorts of animations that are supported. # function to show the jitter effect jitter.ani <- function(amount = 0.2, ...) { ############################################ x = sample(3, 300, TRUE, c(.1, .6, .3)) # (1) Animations inside R windows graphics y = sample(3, 300, TRUE, c(.7, .2, .1)) # devices: Bootstrapping i = 1 oopt = ani.options(interval = 0.3, nmax = 50) while (i <= ani.options("nmax")) { boot.iid() plot(jitter(x, amount = amount), ani.options(oopt) jitter(y, amount = amount), xlim = c(0.5, 3.5), ############################################ ylim = c(0.5, 3.5), # (2) Animations in HTML pages: create an xlab = "jittered x", # animation page for the Brownian Motion in ylab = "jittered y", # the tempdir() and auto-browse it panel.first = grid(), ...) oopt = ani.options(interval=0.05, nmax=100, points(rep(1:3, 3), rep(1:3, each = title = "Demonstration of Brownian Motion", 3), cex = 3, pch = 19) description = "Random walk on the 2D plane: Sys.sleep(ani.options("interval")) for each point (x, y), x = x + rnorm(1) i = i + 1

1Use the command convert to convert multiple image frames into single GIF/MPEG files. 2The utilities png2swf, jpeg2swf and pdf2swf can convert PNG, JPEG and PDF files into Flash animations 3Very little change to the code is needed in order to change to a different output format.

R News ISSN 1609-3631 Vol. 8/2, October 2008 25

}

} 21 (0.42) 7 (0.14) 22 (0.44)

StandTailTail # HTML/JS Tail Tail

0.4 Tail Tail ani.options(nmax = 30, interval = 0.5, Stand HeadStandHead title = "Jitter Effect") TailTail

0.3 Tail Head ani.start() HeadTailHeadTail Tail Head par(mar = c(5, 4, 1, 0.5)) HeadHead Tail Tail jitter.ani() 0.2 Head Head Flip 'coins' ani.stop() Frequency TailHead Head StandStandHead Tail Tail 0.1 Head # GIF Head Head Head HeadTail HeadStandTailTailStand ani.options(interval = 0) Head Tail 0.0 saveMovie(jitter.ani(), interval = 0.5, Head Stand Tail Number of Tosses: 50 outdir = getwd())

# Flash Figure 2: Result of flipping a coin 50 times. The right saveSWF(jitter.ani(), interval = 0.5, part of the graph shows the process of flipping, while dev = "png", outdir = getwd()) the left part displays the result.

Example 2 (k-Nearest Neighbor Algorithm) The k- Animation examples nearest neighbor algorithm is among the simplest of all machine learning algorithms. An object is Currently, the animation package has about classified by a majority vote of its neighbors, with http: 20 simulation options. The website the object being assigned to the class most common //animation.yihui.name has several HTML amongst its k nearest neighbors. pages of animations. These illustrate many Basically the kNN algorithm has four steps: different statistical methods and concepts, including the K-Means cluster algorithm 1. locate the test point (for which we want to pre- (mvstat:k-means_cluster_algorithm), the Law dict the class label); of Large Numbers (prob:law_of_large_numbers), the Central Limit Theorem (prob:central_limit_ 2. compute the distances between the test point theorem), and several simulations of fa- and all points in the training set; mous experiments including Buffon’s Needle (prob:buffon_s_needle). 3. find the k shortest distances and the corre- Two examples will now be discussed in more de- sponding training set points; tail – coin flips and the k-nearest neighbor algorithm 4. classify by majority vote. (kNN). Example 1 (Coin Flips) Most elementary courses in The function knn.ani() gives an animated probability theory use coin flips as a basis for some of demonstration (i.e. the four steps above) of classifi- the initial discussion. While teachers can use coins or cation using the kNN algorithm. The following com- dice for simulations, these quickly become tedious, mand uses the default arguments4 to show the ani- and it is convenient to use computers for extensive mation: simulations. The function flip.coin() is designed for repeated coin flips (die tosses) and for displaying # dmml:k-nearest_neighbour_algorithm the result. Code for a simple simulation is: knn.ani()

# prob:flipping_coins Figure 3 is a two-dimensional demonstration of flip.coin(faces = c("Head", "Stand", "Tail"), the kNN algorithm. Points with unknown classes are type = "n", prob = c(0.45, 0.1, 0.45), marked as ‘?’. The algorithm is applied to each test col =c(1, 2, 4)) point in turn, assigning the class ◦ or 4. Our initial motivation for writing this package Suppose the possible results are: ‘Head’, ‘Tail’, or came in large part from the experience of students in just ‘Stand’ on the table (which is rare but not impos- a machine learning class. We noticed that when the sible), and the corresponding probabilities are 0.45, lecturer described the k-NN algorithm, mentioning 0.1, and 0.45. Figure 2 is the final result of a simu- also k-fold cross-validation, many students confused lation. The frequencies (0.42, 0.14, and 0.44) are as the two ‘k’s. We therefore designed animations that close as might be expected to the true probabilities. were intended to help students understand the basic 4Note that the arguments of most animation functions have defaults.

R News ISSN 1609-3631 Vol. 8/2, October 2008 26 ideas behind various machine learning algorithms. ent Descent algorithm (see Figure 6 and web page: Judging from the response of the audience, our ef- compstat:gradient_descent_algorithm). forts proved effective. 1.0 3 ● first class

● 2 second class ? 0.5

? ) 1 x ( ● ● ? ● ● ● ● ● ● ● atan 0 0.0

● ● = 2 ● ● ● ●● ) X ● x

● ● ( ● f ● ● ● ●

−1 ? ●● ● ● ● ● −0.5 −2

training● set −3 Current root: −0.158009845557177 test set −1.0 ● −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −4 −3 −2 −1 0 1 2 x

X 1

Figure 5: This is the fifth animation frame from the Figure 3: Demonstration of the k-nearest neighbor al- demonstration of the Newton-Raphson method for gorithm. The training and test set are denoted with root-finding. The tangent lines will “guide” us to the different colors, and the different classes are denoted final solution. by different symbols. For each test point, there will be four animation frames. In this plot there are 8 cir- z = x2 + 2y2 cles and 2 triangles; hence the class ‘◦’ is assigned.

3 18 20 22 24 24 22 20 16 14 12 There are many other demonstrations in this 10 2 package. Figure 4 is the third frame of a demonstra- 6

tion for k-fold cross-validation (cv.ani()). In that 1 2 animation, the meaning of k is quite clear. No-one y would confuse it with the k in kNN even when these 0 two topics are talked about together. (Web page: −1 dmml:k-fold_cross-validation) 4

8 −2 10 12 14 16 24 22 20 18 20 22 24 ● ● −3 1.0 ● ● ● ● ●● ● ● ● ● ● −3 −2 −1 0 1 2 3 ● ● ● ● ● ● x

0.8 ● Test Set ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●

0.6 ● ● ● ● ● ● ● ● Figure 6: The final frame of a demonstration of the ●● ● ● ● ● ● ● 0.4 ● ● ● Gradient Descent algorithm. The arrows move from

Sample Value ● ● ● ● ● ● the initial solution to the final solution. ● ● ● ● ● 0.2 Training Set ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● 0.0 0 20 40 60 80 100 Summary Sample Index

The R animation package is designed to convert sta- Figure 4: An animation frame from the demonstra- tistical ideas into animations that will assist the un- tion of k-fold cross-validation. First, points are par- derstanding of statistical concepts, theories and data titioned into 10 subsets. Each subset will be used in analysis methods. Animations can be displayed both turn as the test set. In the animation, the subsets are inside R and outside R. taken in order from left to right. Further development of this package will in- clude: (1) addition of further animations; (2) de- Other examples that are on the website include velopment of a graphical user interface GUI (using the Newton-Raphson method (see Figure 5 and web Tcl/Tk or gWidgets, etc), for the convenience of page compstat:newton_s_method) and the Gradi- users.

R News ISSN 1609-3631 Vol. 8/2, October 2008 27

Acknowledgment tics, 2008. URL http://animation.yihui.name.R package version 1.0-1. We’d like to thank John Maindonald for his invalu- able suggestions for the development of the anima- tion package. Yihui Xie School of Statistics, Room 1037, Mingde Main Building, Renmin University of China, Beijing, 100872, China [email protected] Bibliography Xiaoyue Cheng P. Murrell. R Graphics. Chapman & Hall/CRC, 2005. School of Statistics, Room 1037, Mingde Main Building, Renmin University of China, Beijing, 100872, China Y. Xie. animation: Demonstrate Animations in Statis- [email protected]

R News ISSN 1609-3631 Vol. 8/2, October 2008 28

The VGAM Package by Thomas W. Yee Two examples

Introduction To motivate VGAM let’s consider two specific ex- amples: the negative binomial distribution and the The VGAM package implements several large proportional odds model. These are often fitted us- glm.nb() polr() classes of regression models of which vector gen- ing and respectively, both of which eralized linear and additive models (VGLMs/VGAMs) happen to reside in the MASS package. We now at- are most commonly used (Table 1). The primary tempt to highlight some advantages of using VGAM key words are maximum likelihood estimation, itera- to fit these models—that is, to describe some of the tively reweighted least squares (IRLS), Fisher scoring VGLM/VGAM framework. and additive models. Other subsidiary concepts are reduced rank regression, constrained ordination and Negative binomial regression vector smoothing. Altogether the package represents a broad and unified framework. A negative binomial random variable Y has a proba- Written in S4 (Chambers, 1998), the VGAM mod- bility function that can be written as elling functions are used in a similar manner as glm() and Trevor Hastie’s gam() (in gam, which y + k − 1  µ y  k k P(Y = y) = is very similar to his S-PLUS version). Given a y µ + k k + µ vglm()/vgam() object, standard generic functions such as coef(), fitted(), predict(), summary(), where y = 0, 1, 2, . . ., and parameters µ (= E(Y)) vcov() are available. The typical usage is like and k are positive. The quantity 1/k is known as vglm(yvector ~ x2 + x3 + x4, the dispersion parameter, and the Poisson distribu- family = VGAMfamilyFunction, tion is the limit as k approaches infinity. The nega- data = mydata) tive binomial is commonly used to handle overdis- vgam(ymatrix ~ s(x2) + x3, persion with respect to the Poisson distribution be- 2 family = VGAMfamilyFunction, cause Var(Y) = µ + µ /k > µ. data = mydata) The MASS implementation is restricted to a intercept-only estimate of k. For example, one cannot Many models have a multivariate response, and fit log k = β(2)1 + β(2)2x2, say. In contrast, VGAM therefore the LHS of the formula is often a matrix can fit (otherwise a vector). The function assigned to the family argument is known as a VGAM family func- T log µ = η1 = β1 x, tion. Table 2 lists some widely used ones. log k = η = βT x. The scope of VGAM is very broad; it poten- 2 2 tially covers a wide range of multivariate response by using negbinomial(zero=NULL) in the call to types and models, including univariate and multi- vglm(). Note that it is natural for any positive pa- variate distributions, categorical data analysis, quan- rameter to have the log link as the default. tile and expectile regression, time series, survival There are also variants of the negative binomial in analysis, extreme value analysis, mixture models, the VGAM package, e.g., the zero-inflated and zero- correlated binary data, bioassay data and nonlinear altered versions; see Table 2. least-squares problems. Consequently there is over- lap with many other R packages. VGAM is de- signed to be as general as possible; currently over 100 Proportional odds model VGAM family functions are available and new ones are being written all the time. The proportional odds model (POM) for an ordered Basically, VGLMs model each parameter, trans- factor Y taking levels {1, 2, . . . , M + 1} may be writ- formed if necessary, as a linear combination of the ten explanatory variables. That is, logit P(Y ≤ j|x) = η j(x) (3)

T T g j(θ j) = η j = β j x (1) where x = (x1,..., xp) with x1 = 1 denoting an intercept, and where g j is a parameter link function. This idea emerges in the next section. VGAMs extend (1) to T η j(x) = β(j)1 + β∗ x(−1), j = 1, . . . , M. (4) p g j(θ j) = η j = ∑ f(j)k(xk), (2) Here, x(−1) is x with the first element deleted. Stan- k=1 dard references for the POM include McCullagh and i.e., an additive model for each parameter. Nelder (1989) and Agresti (2002).

R News ISSN 1609-3631 Vol. 8/2, October 2008 29

η Model S function Reference

T T T B1 x1 + B2 x2 (= B x) VGLM vglm() Yee and Hastie (2003)

p1+p2 T ∗ B1 x1 + ∑ Hk f k (xk) VGAM vgam() Yee and Wild (1996) k=p1+1

T B1 x1 + A ν RR-VGLM rrvglm() Yee and Hastie (2003)

 T  ν D1ν T + ν +  .  cqo() B1 x1 A  .  QRR-VGLM Yee (2004a) T ν DMν

R T B1 x1 + ∑ f r(νr) RR-VGAM cao() Yee (2006) r=1

T T Table 1: A summary of VGAM and its framework. The latent variables ν = C x2, or ν = c x2 if rank R = 1. T T T Here, x = (x1 , x2 ). Abbreviations: A = additive, C = constrained, L = linear, O = ordination, Q = quadratic, RR = reduced-rank, VGLM = vector generalized linear model.

The VGAM family function cumulative() may Here, the parallelism assumption applies to x2 be used to fit this model. Additionally, in con- and x4 only. This can be achieved by trast to polr(), it can also fit some useful vari- ants/extensions. Some of these are described as fol- vglm(ymatrix ~ x2 + x3 + x4, lows. cumulative(parallel = TRUE ~ x2 + x4 - 1)) (i) Nonproportional odds model The regression coefficients β in (4) are com- ∗ or equivalently, mon for all j. This is known as the so-called parallelism or proportional odds assumption. This is a theoretically elegant idea because the η j vglm(ymatrix ~ x2 + x3 + x4, do not cross, and therefore there are no prob- cumulative(parallel = lems with negative probabilities etc. However, FALSE ~ x3)) this assumption is a strong one and ought to be checked for each explanatory variable. There are several other extensions that can eas- ily be handled by the constraint matrices idea (ii) Selecting different link functions. described later. VGAM is purposely extensible. The link ar- gument to cumulative() may be assigned any (iv) Common VGAM family function arguments. VGAM link function whereas polr() currently Many authors define the POM as provides four fixed choices. Of course, the user is responsible for choosing appropriate logit P(Y ≥ j + 1|x) = η (x) (5) links, e.g., the probit and complementary log- j log. Users may write their own VGAM link rather than (3) because M = 1 coincides with function if they wish. logistic regression. Many VGAM family func- (iii) Partial proportional odds model tions share common arguments and one of them is reverse. Here, setting reverse=TRUE An intermediary between the proportional will fit (5). Other common arguments include odds and nonproportional odds models is to link (usually one for each parameter), zero, have some explanatory variables parallel and parallel and initial values of parameters. others not. Some authors call this a partial pro- portional odds model. As an example, suppose Modelling ordinal responses in MASS is serviced p = 4, M = 2 and by a “one-off” function whereas VGAM naturally supports several related models such as the adjacent η1 = β(1)1 + β(1)2x2 + β(1)3x3 + β(1)4x4, categories and continuation/stopping ratio models; η2 = β(2)1 + β(1)2x2 + β(2)3x3 + β(1)4x4. see Table 2.

R News ISSN 1609-3631 Vol. 8/2, October 2008 30

General framework Wi here are wiVar(∂`i/∂ηi) (which, under regularity 2 T conditions, is equal to −wi E[∂ `i/(∂ηi ∂ηi )], called In this section we describe the general framework the expected information matrix or EIM), giving rise and notation. Fuller details can be found in the ref- to the Fisher scoring algorithm. Fisher scoring usu- erences of Table 1. Although this section may be ally has good numerical stability because the Wi are skipped on first reading, an understanding of these positive-definite over a larger region of parameter details is necessary to realize its full potential. space. The price to pay for this stability is typically a slower convergence rate and less accurate standard Vector generalized linear models errors (Efron and Hinkley, 1978) compared to the Newton-Raphson algorithm. Suppose the observed response y is a q-dimensional vector. VGLMs are defined as a model for which the conditional distribution of Y given explanatory x is of the form Vector generalized additive models

f (y|x; B) = h(y, η1,..., ηM) (6) VGAMs provide additive-model extensions to for some known function h(·), where B = VGLMs, that is, (7) is generalized to (β1 β2 ··· βM) is a p × M matrix of unknown regres- sion coefficients, and the jth linear predictor is p η j(x) = β(j)1 + ∑ f(j)k(xk), j = 1, . . . , M, p k=2 T η j = β j x = ∑ β(j)k xk, j = 1, . . . , M, (7) k=1 a sum of smooth functions of the individual covari- T where x = (x1,..., xp) with x1 = 1 if there is an ates, just as with ordinary GAMs (Hastie and Tibshi- T intercept. VGLMs are thus like GLMs but allow for rani, 1990). The f k = ( f(1)k(xk),..., f(M)k(xk)) are multiple linear predictors, and they encompass mod- centered for uniqueness, and are estimated simultane- els outside the limited confines of the classical expo- ously using vector smoothers. VGAMs are thus a visual nential family. data-driven method that is well suited to exploring The η j of VGLMs may be applied directly to pa- data, and they retain the simplicity of interpretation rameters of a distribution rather than just to means as that GAMs possess. for GLMs. A simple example is a univariate distribu- In practice we may wish to constrain the effect of ξ tion with a location parameter and a scale parame- a covariate to be the same for some of the η j and to terσ > 0, where we may take η1 = ξ and η2 = log σ. have no effect for others. For example, for VGAMs, In general, η j = g j(θ j) for some parameter link func- we may wish to take tion g j and parameter θ j. In VGAM, there are cur- rently over a dozen links to choose from (Table 3), of η = β + f (x ) + f (x ), which any can be assigned to any parameter, ensur- 1 (1)1 (1)2 2 (1)3 3 ing maximum flexibility. η2 = β(2)1 + f(1)2(x2), There is no relationship between q and M in gen- eral: it depends specifically on the model or distri- so that f(1)2 ≡ f(2)2 and f(2)3 ≡ 0. For VGAMs, we bution to be fitted. For example, the mixture of two can represent these models using normal distributions has q = 1 and M = 5. VGLMs are estimated by IRLS. Most models that p can be fitted have a log-likelihood η(x) = β(1) + ∑ f k(xk) n k=2 p ` = ∑ wi `i (8) ∗ ∗ i=1 = H1 β(1) + ∑ Hk f k (xk) (10) k=2 and this will be assumed here. The wi are known positive prior weights. Let xi denote the explanatory where H1, H2,..., Hp are known full-column rank vector for the ith observation, for i = 1, . . . , n. Then ∗ constraint matrices, f k is a vector containing a possi- one can write ∗ bly reduced set of component functions and β(1) is    T  η1(xi) β1 xi a vector of unknown intercepts. With no constraints ∗ . T . at all, H = H = ··· = Hp = I and β = β . η =  .  = B xi =  .  . (9) 1 2 M (1) (1) i  .   .  ∗ η T Like the f k, the f k are centered for uniqueness. For M(xi) βMxi VGLMs, the f k are linear so that In IRLS, an adjusted dependent vector zi = ηi + −1   Wi di is regressed upon a large (VLM) design ma- T ∗ ∗ ∗ B = H1β(1) H2β(2) ··· Hpβ(p) . (11) trix, with di = wi ∂`i/∂ηi. The working weights

R News ISSN 1609-3631 Vol. 8/2, October 2008 31

Type Name Description Categorical acat Adjacent categories model cratio Continuation ratio model cumulative Proportional and nonproportional odds model multinomial Multinomial logit model sratio Stopping ratio model brat Bradley Terry model bratt Bradley Terry model with ties ABO ABO blood group system Quantile lms.bcn Box-Cox transformation to normality regession alsqreg Asymmetric least squares (expectiles on normal distribution) amlbinomial Asymmetric maximum likelihood—for binomial amlpoisson Asymmetric maximum likelihood—for Poisson alaplace1 Asymmetric Laplace distribution (1-parameter) Counts negbinomial Negative binomial zipoisson Zero-inflated Poisson zinegbinomial Zero-inflated negative binomial zibinomial Zero-inflated binomial zapoisson Zero-altered Poisson (a hurdle model) zanegbinomial Zero-altered negative binomial (a hurdle model) mix2poisson Mixture of two Poissons pospoisson Positive Poisson Discrete logarithmic logarithmic distributions skellam Skellam (Pois(λ1)− Pois(λ2)) yulesimon Yule-Simon zetaff Zeta zipf Zipf Continuous betaff Beta (2-parameter) distributions betabinomial Beta-binomial betaprime Beta-prime dirichlet Dirichlet gamma2 Gamma (2-parameter) gev Generalized extreme value gpd Generalized Pareto mix2normal1 Mixture of two univariate normals skewnormal1 Skew-normal vonmises Von Mises weibull Weibull Bivariate amh Ali-Mikhail-Haq distributions fgm Farlie-Gumbel-Morgenstern frank Frank morgenstern Morgenstern Bivariate binom2.or Bivariate logistic odds-ratio model binary binom2.rho Bivariate probit model responses loglinb2 Loglinear model for 2 binary responses

Table 2: Some commonly used VGAM family functions. They have been grouped into types. Families in the classical exponential family are omitted here.

R News ISSN 1609-3631 Vol. 8/2, October 2008 32

Vector splines and penalized likelihood want to make provision for some variables x1 that we want to leave alone, e.g., the intercept. VGAMs employ vector smoothers in their estimation. It transpires that Reduced Rank VGLMs (RR- One type of vector smoother is the vector spline which VGLMs) are simply VGLMs where the constraint minimizes the quantity matrices are estimated. The modelling function n rrvglm() calls an alternating algorithm which tog- T −1 ∑ {yi − f (xi)} Σi {yi − f (xi)} + gles between estimating two thin matrices A and C, = T i 1 where B2 = AC . M Z b Incidentally, special cases of RR-VGLMs have λ { 00( )}2 ∑ j f j t dt (12) appeared in the literature. For example, a RR- j= a 1 multinomial logit model, or RR-MLM, is known as the stereotype model (Anderson, 1984). Another is to scatterplot data (x , y , Σ ), i = 1, . . . , n. Here, i i i Goodman (1981)’s RC model which is reduced-rank Σ are known symmetric and positive-definite er- i multivariate Poisson model. Note that the paral- ror covariances, and a < min(x ,..., xn) and b > 1 lelism assumption of the proportional odds model max(x ,..., xn). The first term of (12) measures lack 1 (McCullagh and Nelder, 1989) can be thought of as a of fit while the second term is a smoothing penalty. type of reduced rank regression where the constraint Equation (12) simplifies to an ordinary cubic smooth- matrices are thin and known. ing spline when M = 1, (see, e.g., Green and Silver- man, 1994). Each component function f j has a non- negative smoothing parameter λ j which operates as QRR-VGLMs and constrained ordination with an ordinary cubic spline. Now consider the penalized likelihood RR-VGLMs form an optimal linear combination of the explanatory variables x2 and then fit a VGLM p M 1 Z bk to these. Thus, in terms of (ecological) ordina- ` − λ { f 00 (x )}2 dx ∑ ∑ (j)k (j)k k k. (13) tion, it performs a constrained linear ordination or 2 k=1 j=1 ak CLO. Biotic responses are usually unimodal, and Special cases of this quantity have appeared in a therefore this suggest fitting a quadratic on the η wide range of situations in the statistical literature scale. This gives rise to the class of Quadratic RR- to introduce cubic spline smoothing into a model’s VGLMs, or QRR-VGLMs, so as to have constrained parameter estimation (for a basic overview of this quadratic ordination or CQO. The result are bell- roughness penalty approach see Green and Silver- shaped curves/surfaces on axes defined in terms of man, 1994). It transpires that the VGAM local scoring gradients. The CQO approach here is statistically algorithm can be justified via maximizing the penal- more rigorous than the popular canonical correspon- ized likelihood (13) by vector spline smoothing (12) dence analysis (CCA) method. For more details see −1 the references in Table 1. with Wi = Σi . From a practitioner’s viewpoint the smoothness of a smoothing spline can be more conveniently con- trolled by specifying the degrees of freedom of the Some user-oriented topics smooth rather than λ j. Given a value of the degrees VGAM of freedom, the software can search for the λ j pro- Making the most of requires an understand- ducing this value. In general, the higher the degrees ing of the general framework described above plus a of freedom, the more wiggly the curve. few other notions. Here are some of them.

RR-VGLMs Common arguments T T T T T T Partition x into (x1 , x2 ) and B = (B1 B2 ) . In VGAM family functions share a pool of common general, B is a dense matrix of full rank, i.e., rank types of arguments, e.g., exchangeable, nsimEIM, = min(M, p). Thus there are M × p regression co- parallel, zero. These are more convenient short- efficients to estimate, and for some models and data cuts for the argument constraints which accepts a sets this is “too” large. named list of constraint matrices Hk. For example, One solution is based on a simple and ele- setting parallel = TRUE would constrain the coef- gant idea: replace B2 by a reduced-rank regression. ficients β(j)k in (7) to be equal for all j = 1, . . . , M, This will cut down the number of regression coef- each separately for k = 2, . . . , p. For some models ficients enormously if the rank R is kept low. Ide- the β(j)1 are equal too. Another example is the zero ally, the problem can be reduced down to one or argument which accepts a vector specifying which two dimensions—a successful application of dimen- η j is to be modelled as an intercept-only. Assign- sion reduction—and therefore can be plotted. The ing it NULL means none. A last example is nsimEIM: reduced-rank regression is applied to B2 because we some VGAM family functions use simulation to es-

R News ISSN 1609-3631 Vol. 8/2, October 2008 33 timate the EIM and for these this argument controls such as I(poly(x)) and bs(scale(x)). In S-PLUS the number of random variates generated. even terms such as poly(x) and scale(x) will not predict properly without special measures. Link functions Most VGAM family functions allow a flexible choice Examples of link functions to be assigned to each η j. Table 3 lists some of them. Most VGAM family functions of- Here are some examples giving a flavour of VGAM fer a link argument for each parameter plus an extra that I hope convey some of its breadth (at the ex- argument to pass in any value associated with the pense of depth); http://www.stat.auckland.ac. link function, e.g., an offset or power. nz/~yee/VGAM has more examples as well as some fledging documentation and the latest prerelease version of VGAM. Random variates and EIMs Many dpqr-functions exist to return the density, dis- Example 1: Negative binomial regression tribution function, quantile function and random generation of their respective VGAM family func- Consider a simple example involving simulated data tion. For example, there are the [dpqr]sinmad() where the dispersion parameter is a function of x. functions corresponding to the Singh-Maddala dis- Two independent responses are constructed. Note tribution VGAM family function sinmad(). that the k parameter corresponds to the size argu- Incidentally, if you are a developer or writer of ment. models/distributions where scoring is natural, then set.seed(123) VGAM it helps writers of future family functions x = runif(n <- 500) to have expressions for the EIM. A good example y1 = rnbinom(n, mu=exp(3+x), size=exp(1+x)) is Kleiber and Kotz (2003) whose many distribu- y2 = rnbinom(n, mu=exp(2-x), size=exp(2*x)) VGAM tions have been implemented in . If EIMs are fit = vglm(cbind(y1,y2) ~ x, untractable then algorithms for generating random fam = negbinomial(zero=NULL)) variates and an expression for the score vector is the next best choice. Then > coef(fit, matrix=TRUE) Smart prediction (for interest only) log(mu1) log(k1) log(mu2) log(k2) It is well known among seasoned R and S-PLUS (Intercept) 2.96 0.741 2.10 -0.119 users that lm() and glm()-type models used to x 1.03 1.436 -1.24 1.865 VGAM have prediction problems. currently im- An edited summary shows plements “smart prediction” (written by T. Yee and T. Hastie) whereby the parameters of parameter de- > summary(fit) pendent functions are saved on the object. This ... means that functions such as scale(x), poly(x), Coefficients: bs(x) and ns(x) will always correctly predict. How- Value Std. Error t value ever, most R users are unaware that other R mod- (Intercept):1 2.958 0.0534 55.39 elling functions will not handle pathological cases (Intercept):2 0.741 0.1412 5.25

Table 3: Some VGAM link functions currently available. Link g(θ) Range of θ 1 cauchit tan(π(θ − 2 )) (0, 1) cloglog loge{− loge(1 − θ)} (0, 1) fisherz 1 log {(1 + θ)/(1 − θ)} (−1, 1) 2 √ e p fsqrt 2θ − 2(1 − θ) (0, 1) identity θ (− , ) loge loge(θ) (0, ) logc loge(1 − θ) (−∞ ,∞ 1) logit loge(θ/(1 − θ)) (0,∞ 1) logoff loge(θ + A) (−A∞, ) probit Φ−1(θ) (0, 1) p powl θ (0, ∞) rhobit loge{(1 + θ)/(1 − θ)} (−1, 1) ∞ R News ISSN 1609-3631 Vol. 8/2, October 2008 34

(Intercept):3 2.097 0.0851 24.66 standard deviation are equal. Its default is η1 = µ (Intercept):4 -0.119 0.1748 -0.68 and η2 = log σ but we can use the identity link func- T x:1 1.027 0.0802 12.81 tion for σ and set H1 = (1, 1) . Suppose n = 100 and x:2 1.436 0.2508 5.73 θ = 10. Then we can use x:3 -1.235 0.1380 -8.95 x:4 1.865 0.4159 4.48 set.seed(123) theta = 10 so that all estimates are within two standard errors y = rnorm(n <- 100, mean = theta, sd = theta) of their true values. clist = list("(Intercept)" = rbind(1, 1)) fit = vglm(y ~ 1, normal1(lsd="identity"), Example 2: Proportional odds model constraints = clist) Here we fit the POM on data from Section 5.6.2 Then we get θb = 9.7504 from of McCullagh and Nelder (1989). > coef(fit, matrix = TRUE) data(pneumo) mean sd pneumo = transform(pneumo, (Intercept) 9.7504 9.7504 let=log(exposure.time)) fit = vglm(cbind(normal,mild,severe) ~ let, Consider a similar problem but from N(µ = 2 cumulative(par=TRUE, rev=TRUE), θ,σ = θ). To estimate θ make use of log µ = data = pneumo) log θ = 2 log σ so that Then set.seed(123) y2 = rnorm(n, theta, sd = sqrt(theta)) > coef(fit, matrix=TRUE) clist2 = list("(Intercept)" = rbind(1, 0.5)) logit(P[Y>=2]) logit(P[Y>=3]) fit2 = vglm(y2 ~ 1, normal1(lmean="loge"), (Intercept) -9.6761 -10.5817 constraints = clist2) let 2.5968 2.5968 > constraints(fit) # Constraint matrices Then we get θb = 10.191 from $‘(Intercept)‘ [,1] [,2] > (cfit2 <- coef(fit2, matrix = TRUE)) [1,] 1 0 log(mean) log(sd) [2,] 0 1 (Intercept) 2.3215 1.1608 > exp(cfit2[1,"log(mean)"]) $let [1] 10.191 [,1] It is left to the reader as an exercise to figure [1,] 1 out how to estimate θ from a random sample from [2,] 1 N(µ = θ,σ = 1 + e−θ), for θ = 1, say.

The constraint matrices are the Hk in (10) and (11). Example 4: mixture models The estimated variance-covariance matrix, and some of the fitted values and prior weights, are The 2008 World Fly Fishing Championships (WFFC) > vcov(fit) was held in New Zealand a few months ago. Per- (Intercept):1 (Intercept):2 let mission to access and distribute the data was kindly (Intercept):1 1.753 1.772 -0.502 given, and here we briefly look at the length of fish (Intercept):2 1.772 1.810 -0.509 caught in Lake Rotoaira during the competition. A let -0.502 -0.509 0.145 histogram of the data is given in Figure 1(a). The > cbind(fitted(fit), data looks possibly bimodal, so let’s see how well weights(fit, type="prior"))[1:4,] a mixture of two normal distributions fits the data. normal mild severe This distribution is usually estimated by the EM al- 1 0.994 0.00356 0.00243 98 gorithm but mix2normal1() uses simulated Fisher 2 0.934 0.03843 0.02794 54 scoring. 3 0.847 0.08509 0.06821 43 data(wffc) 4 0.745 0.13364 0.12181 48 fit.roto = vglm(length/10 ~ 1, data=wffc, mix2normal1(imu2=45, imu1=25, ESD=FALSE), Example 3: constraint matrices subset = water=="Rotoaira") How can we estimate θ given a random sample of n Here, we convert mm to cm. The estimation was observations from a N(µ = θ,σ = θ) distribu- helped by passing in some initial values, and we tion? One way is to use the VGAM family func- could have constrained the two standard deviations tion normal1() with the constraint that the mean and to be equal for parsimony. Now

R News ISSN 1609-3631 Vol. 8/2, October 2008 35

> coef(fit.roto, matrix = TRUE) given X = x, the percentile ξτ (x) specifies the po- logit(phi) mu1 log(sd1) sition below which 100τ% of the (probability) mass (Intercept) -1.8990 25.671 1.4676 of Y lies; while the expectile µω(x) determines, again mu2 log(sd2) given X = x, the point such that 100ω% of the mean (Intercept) 46.473 1.6473 distance between it and Y comes from the mass be- > Coef(fit.roto) low it. The 0.5-expectile is the mean while the 0.5- phi mu1 sd1 mu2 sd2 quantile is the median. Expectile regression can be 0.1302 25.6706 4.3388 46.4733 5.1928 used to perform quantile regression by choosing ω that gives the desired τ. There are theoretical reasons The resulting density is overlaid on the histogram why this is justified. in Figure 1(b). The LHS normal distribution doesn’t For normally distributed responses, expectile re- look particularly convincing and would benefit from gression is based on asymmetric least squares (ALS) es- further investigation. timation, a variant of ordinary LS estimation. ALS Finally, for illustration’s sake, let’s try the estimation was generalized to asymmetric maximum untransform argument of vcov(). Under limited likelihood (AML) estimation for members in the expo- conditions it applies the delta method to get at the nential family by Efron (1992). An example of this is untransformed parameters. The use of vcov() here the AML Poisson family which is illustrated in Fig- is dubious (why?); its (purported?) standard errors ure 2. The data were generated by are set.seed(123) > round(sqrt(diag(vcov(fit.roto, alldat = data.frame(x=sort(runif(n<-500))+.2) untransform=TRUE))), dig=3) mymu = function(x) phi mu1 sd1 mu2 sd2 exp(-2 + 6*sin(2*x-0.2) / (x+0.5)^2) 0.026 1.071 0.839 0.422 0.319 alldat = transform(alldat, y = rpois(n,mymu(x))) A more detailed analysis of the data is given Through trial and error we use in Yee (2008). The 2008 WFFC was a successful and enjoyable event, albeit we got pipped by the Czechs. fit = vgam(y ~ s(x), data = alldat, (a) (b) amlpoisson(w=c(0.11, 0.9, 5.5))) 0.08 0.08 because the desired τ = (0.25, 0.5, 0.75)T obtained 0.06 0.06 by

0.04 0.04 > fit@extra[["percentile"]] w.aml=0.11 w.aml=0.9 w.aml=5.5 0.02 0.02 24.6 50.2 75.0

0.00 0.00 is to sufficient accuracy. Then Figure 2 was obtained

20 30 40 50 60 70 20 30 40 50 60 70 by with(alldat, plot(x, y, col="darkgreen")) Figure 1: (a) Histogram of fish lengths (cm) at with(alldat, matlines(x, fitted(fit), Lake Rotoaira. (b) Overlaid with the estimated mix- col="blue", lty=1)) ture of two normal density functions. The sample size is 201. If parallel=TRUE was set then this would avoid the embarrasing crossing quantile problem.

Example 5: quantile & expectile regression ● ● ● ● ● 15 ●● ● Over recent years quantile regression has become an ● ●●● ●● ●●●●●●●●● ●● ● ● important tool with a wide range of applications. ● ●●●● ●●●●● ● ●●● ●●●● ● ● ●●●●● ● ● The breadth of the VGLM/VGAM framework is il- 10 ● ●● ●● ●●●●●●● ●●● ●●● ● ● ●● ● ● ● ●●●● ●●● ●●●●● ● ●●●● ● ●●●● ● y lustrated by VGAM implementing three subclasses ● ●● ●●● ●● ●● ●●●●● ●●● ● ●●●●●●●● ● ● ● ● ●●● ●●●●●●● ● ● ●●●● ● ●● ●● ● ● ● of quantile regression methods. We will only men- ●● ●● ● ●● ● ● ● ● ●●●●● ●● ● ●● ●●● ● ● ● 5 ●●●● ●● ● ● ●● ●● ● ● ●●●●● ● ●●●● ●● ●●●● ● ●●● ● ● ● ● ● ●●●● ● ●● ● ●●● ● ●●●● ●●●● ●● ●●●●●●●●●●●● ● ●● tion the two most popular subclasses here. ●●● ● ● ● ● ● ●● ● ● ● ●●●●●●●●● ● ●●●●● ●●● ●●●●● ●● ● ● ●● ●●●● ●●●●● ● ● ●●●●●●●●●● ● ●● ●● ●● The first is of the LMS-type where a transforma- ●●● ●●●●● ● ● ● ●●●●●●●● ●●●● ●●●●●●●●●●●●● ●●●●●●●●● tion of the response (e.g., Box-Cox) to some paramet- 0 ● ● ● ● ●● ● ●●●●● ●●●●●●● ●●●●●●●●● ric distribution (e.g., standard normal) enables quan- 0.2 0.4 0.6 0.8 1.0 1.2 tiles to be estimated on the transformed scale and x then back-transformed to the original scale. In partic- ular the popular Cole and Green (1992) method falls Figure 2: AML Poisson model fitted to the simulated in this subclass. More details are in Yee (2004b). Poisson data. The fitted expectiles correspond ap- The second is via expectile regression. Expec- proximately to τ = (0.25, 0.5, 0.75)T and were fitted tiles are almost as interpretable as quantiles because, by smoothing splines.

R News ISSN 1609-3631 Vol. 8/2, October 2008 36

Example 6: the bivariate logistic odds-ratio the maxima is Gumbel and let Y(1),..., Y(r) be the r model largest observations such that Y(1) ≥ · · · ≥ Y(r). The joint distribution of We consider the coalminers data of Table 6.6 of Mc-  T Cullagh and Nelder (1989). Briefly, 18282 working Y(1) − bn Y(r) − bn coalminers were classified by the presence of Y1 = ,..., an an breathlessness and Y2 = wheeze. The explanatory variable x = age was available. We fit a nonparamet- has, for large n, a limiting distribution with density ric bivariate logistic odds-ratio model f (y(1),..., y(r); µ,σ) =

η1 = logit P(Y1 = 1|x) = β( ) + f( ) (x), (14) (   r  ) 1 1 1 1 y(r) − µ y(j) − µ σ−r exp − exp − − , η2 = logit P(Y2 = 1|x) = β(2)1 + f(2)1(x), (15) σ ∑ σ j=1 η3 = log ψ = β(3)1 + f(3)1(x), (16) for y(1) ≥ · · · ≥ y(r). Upon taking logarithms, one where ψ = p11 p00/(p10 p01) is the odds ratio, p jk = can treat this as an approximate log-likelihood. P(Y1 = j, Y2 = k|x). It is a good idea to afford The VGAM family function gumbel() fits this T f(3)1(x) less flexibility, and so we use model. Its default is η(x) = (µ(x), log σ(x)) . We apply the block-Gumbel model to the well-known data(coalminers) Venice sea levels data where the 10 highest sea lev- fit.coal = vgam(cbind(nBnW,nBW,BnW,BW) ~ els (in cm) for each year for the years x = 1931 to s(age, df=c(4,4,3)), 1981 were recorded. Note that only the top 6 values binom2.or(zero=NULL), coalminers) are available in 1935. For this reason let’s try using mycols = c("blue","darkgreen","purple") the top 5 values. We have plot(fit.coal, se=TRUE, lcol=mycols, scol=mycols, overlay=TRUE) data(venice) fit=vglm(cbind(r1,r2,r3,r4,r5)~I(year-1930), to give Figure 3. It appears that the log odds ratio gumbel(R=365, mpv=TRUE, zero=2, could be modelled linearly. lscale="identity"), Were an exchangeable error structure true we data = venice) would expect both functions fb(1)1(x) and fb(2)1(x) to be on top of each other. However it is evident that giving this is not so, relative to the standard error bands. > coef(fit, mat=TRUE) Possibly a quadratic function for each is suitable. location scale 2 (Intercept) 104.246 12.8 I(year - 1930) 0.458 0.0 1 This agrees with Smith (1986). 0 Now let’s look at the first 10 order statistics and introduce some splines such as Rosen and Cohen −1 (1996). As they do, let’s assume no serial correlation. s(age, df = c(4, 4, 3)) −2 y=as.matrix(venice[,paste("r",1:10,sep="")]) fit1 = vgam(y ~ s(year, df=3), gumbel(R=365, mpv=TRUE), 30 40 50 60 data=venice, na.action=na.pass)

age A plot of the fitted quantiles mycols = 1:3 Figure 3: Bivariate logistic odds-ratio model fitted qtplot(fit1, mpv=TRUE, lcol=mycols, to the coalminers data. This is the VGAM (14)–(16) tcol=mycols, las=1, llwd=2, ( ) with centered components overlaid: fb(1)1 x in blue ylab="Venice sea levels (cm)", (steepest curve), fb(2)1(x) in green, fb(3)1(x) in purple pcol="blue", tadj=0.1, bty="l") (downward curve). The dashed curves are pointwise ±2 standard error bands produces Figure 4. The curves suggest a possible in- flection point in the mid-1960s. The median predicted Example 7: the Gumbel model value (MPV) for a particular year is the value that the maximum of the year has an even chance of exceed- Central to classical extreme value theory is the gen- ing. As a check, the plot shows about 26 values ex- eralized extreme value (GEV) distribution of which ceeding the MPV curve—this is about half of the 51 the Gumbel is a special case when ξ = 0. Suppose year values.

R News ISSN 1609-3631 Vol. 8/2, October 2008 37

● The result is Figure 5.

● 180 0.06 + ● ● 160 0.05 ● ● ● ● 0.04 ● 140 ● ● ● ● ● ● ● ● ● MPV ● ● ● ● ● ● ● ● ● ● 0.03 + ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● velocity 120 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 99% 0.02 ● + ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Venice sea levels (cm) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● + ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● + ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● + ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.01 ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 95% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 80 ● ● ● ● ● ● ● 0.00 ● ● ● ● ● ● ● ● 0.0 0.5 1.0 1.5 2.0 1930 1940 1950 1960 1970 1980 concentration year Figure 5: Michaelis-Menten model fitted to some en- Figure 4: Block-Gumbel model fitted to the Venice zyme data. sea levels data, smoothing splines are used.

More details about the application of VGLMs and Example 9: constrained ordination VGAMs to extreme value data analysis are given Ecologists frequently perform an ordination in order in Yee and Stephenson (2007). to see how multi-species and environmental data are related. An ordination enables one to display these Example 8: nonlinear regression data types simultaneously in an ordination plot. Several methods have been developed of which The VGLM framework enables a Gauss-Newton-like one of the most common is canonical correspondence algorithm to be performed in order to solve the non- analysis (CCA). It was developed under the restric- linear regression model tive assumptions of a species packing model which stip- ulates that species have equal tolerances (a measure Yi = f (ui;θ) + εi, i = 1, . . . , n, (17) of niche width), equal maxima (a measure of abun- where θ is an M-vector of unknown parameters and dance), and optima (optimal environments) that are 2 the εi are assumed to be i.i.d. N(0,σ ). Here, it is uniformly distributed over the range of the gradient, more convenient to use u to denote the regressors which are linear combinations of the environmental rather than the usual x. An example of (17) is the variables (also known as latent variables). Michaelis-Menten model CCA is a heuristic approximation to constrained cqo() θ u quadratic ordination (CQO). The function im- Y = 1 + ε plements CQO, meaning it assumes each species has θ + u 2 a symmetric bell-shaped response with respect to un- and this is implemented in the VGAM family func- derlying gradients. CQO does not necessarily make tion micmen(). One advantage VGAM has over any of the species packing model assumptions. nls() for fitting this model is that micmen() is self- A simple CQO model is where the responses are starting, i.e., initial values are automatically chosen. Poisson counts and there is one gradient. Then the Here is the Michaelis-Menten model fitted to model can be written like a Poisson regression some enzyme velocity and substrate concentration 2 data (see Figure 5). log µs = β(s)1 + β(s)2 ν + β(s)3 ν (18)  2 data(enzyme) 1 ν − us = αs − . (19) fit = vglm(velocity ~ 1, micmen, enzyme, 2 ts form2 = ~ conc - 1) T with(enzyme, plot(conc, velocity, las=1, where ν = c x2 is the latent variable, and s denotes xlab="concentration", the species (s = 1, . . . , S). The curves are unimodal ylim=c(0,max(velocity)), provided β(s)3 are negative. The second parameter- xlim=c(0,max(conc)))) ization (19) has direct ecological interpretation: us is with(enzyme, points(conc, fitted(fit), the species’ optimum or species score, ts is the toler- col="red", pch="+")) ance, and exp(αs) is the maximum. Here is an example using some hunting spiders U = with(enzyme, max(conc)) data. The responses are 12 species’ numbers trapped newenzyme = data.frame(conc=seq(0,U,len=200)) over a 60 week period and there are 6 environmental fv = predict(fit, newenzyme, type="response") variables measured at 28 sites. The x are standard- with(newenzyme, lines(conc, fv, col="blue")) ized prior to fitting.

R News ISSN 1609-3631 Vol. 8/2, October 2008 38 data(hspider) clr = (1:(S+1))[-7] # omits yellow hspider[,1:6] = scale(hspider[,1:6]) persp(p1, col=clr, llty=1:S, llwd=2, las=1) p1 = cqo(cbind(Alopacce, Alopcune, Alopfabr, legend("topright", legend=colnames(p1@y), Arctlute, Arctperi, Auloalbi, col=clr, bty="n", lty=1:S, lwd=2) Pardlugu, Pardmont, Pardnigr, and this results in Figure 6. Pardpull, Trocterr, Zoraspin) 120 ~ WaterCon + BareSand + FallTwig + Alopacce Alopcune 100 CoveMoss + CoveHerb + ReflLux, Alopfabr Arctlute fam = poissonff, data = hspider, 80 Arctperi Crow1posit = FALSE, ITol = FALSE) Auloalbi Pardlugu 60 Pardmont Then the fitted coefficients of (19) can be seen in Pardnigr 40 Pardpull Expected Value Trocterr > coef(p1) Zoraspin 20

C matrix (constrained/canonical coefficients) 0

lv −2 −1 0 1 2 3

WaterCon -0.233 Latent Variable BareSand 0.513 FallTwig -0.603 Figure 6: CQO Poisson model, with unequal toler- CoveMoss 0.211 ances, fitted to the hunting spiders data. CoveHerb -0.340 ReflLux 0.798 Currently cqo() will handle presence/absence data via family = binomialff. Altogether the ... present cqo() function is a first stab at ordination based on sound regression techniques. Since it fits Optima and maxima by maximum likelihood estimation it is likely to be Optimum Maximum sensitive to outliers. The method requires very clear Alopacce 1.679 19.29 unimodal responses amongst the species, hence the x Alopcune -0.334 18.37 needs to be collected over a broad range of environ- Alopfabr 2.844 13.03 ments. All this means that it may be of limited use Arctlute -0.644 6.16 to ‘real’ biological data where messy data is common Arctperi 3.906 14.42 and robustness to outliers is needed. Auloalbi -0.586 19.22 Pardlugu NA NA Pardmont 0.714 48.60 Summary Pardnigr -0.530 87.80 Pardpull -0.419 110.32 Being a large project, VGAM is far from completion. Trocterr -0.683 102.27 Currently some of the internals are being rewritten Zoraspin -0.742 27.24 and a monograph is in the making. Its continual de- velopment means changes to some details presented Tolerance here may occur in the future. The usual maintenance, lv consisting of bug fixing and implementing improve- Alopacce 1.000 ments, applies. Alopcune 0.849 In summary, the contributed R package VGAM is Alopfabr 1.048 purposely general and computes the maximum like- Arctlute 0.474 lihood estimates of many types of models and distri- Arctperi 0.841 butions. It comes not only with the capability to do a Auloalbi 0.719 lot of things but with a large and unified framework Pardlugu NA that is flexible and easily understood. It is hoped that Pardmont 0.945 the package will be useful to many statistical practi- Pardnigr 0.529 tioners. Pardpull 0.597 Trocterr 0.932 Zoraspin 0.708 Bibliography

All but one species have fitted bell-shaped curves. A. Agresti. Categorical Data Analysis. Wiley, New The νb can be interpreted as a moisture gradient. The York, second edition, 2002. curves may be plotted with J. A. Anderson. Regression and ordered categorical S = ncol(p1@y) # Number of species variables (with discussion). Journal of the Royal Sta-

R News ISSN 1609-3631 Vol. 8/2, October 2008 39

tistical Society, Series B, Methodological, 46(1):1–30, ing: Proceedings of the COMPSTAT ’94 Satellite Meet- 1984. ing held in Semmering, Austria, 27–28 August 1994, pages 200–214, Heidelberg, 1996. Physica-Verlag. J. M. Chambers. Programming with Data: A Guide to the S Language. Springer, New York, 1998. R. L. Smith. Extreme value theory based on of Hydrology T. J. Cole and P. J. Green. Smoothing reference cen- largest annual events. , 86(1– tile curves: The LMS method and penalized likeli- 2):27–43, 1986. hood. Statistics in Medicine, 11(10):1305–1319, 1992. T. W. Yee. A new technique for maximum-likelihood B. Efron. Poisson overdispersion estimates based canonical Gaussian ordination. Ecological Mono- on the method of asymmetric maximum likeli- graphs, 74(4):685–701, 2004a. hood. Journal of the American Statistical Association, 87(417):98–107, 1992. T. W. Yee. Quantile regression via vector generalized additive models. Statistics in Medicine, 23(14):2295– B. Efron and D. V. Hinkley. Assessing the accuracy of 2315, 2004b. the maximum likelihood estimator: Observed ver- sus expected Fisher information. Biometrika, 65(3): T. W. Yee. Constrained additive ordination. Ecology, 457–481, 1978. 87(1):203–213, 2006.

L. A. Goodman. Association models and canonical T. W. Yee. Vector generalized linear and additive correlation in the analysis of cross-classifications models, with applications to the 2008 World Fly having ordered categories. Journal of the American Fishing Championships. In preparation, 2008. Statistical Association, 76(374):320–334, 1981. T. W. Yee and T. J. Hastie. Reduced-rank vector gen- P. J. Green and B. W. Silverman. Nonparametric Re- eralized linear models. Statistical Modelling, 3(1): gression and Generalized Linear Models: A Roughness 15–41, 2003. Penalty Approach. Chapman & Hall, London, 1994.

T. J. Hastie and R. J. Tibshirani. Generalized Additive T. W. Yee and A. G. Stephenson. Vector general- Models. Chapman & Hall, London, 1990. ized linear and additive extreme value models. Ex- tremes, 10(1–2):1–19, 2007. C. Kleiber and S. Kotz. Statistical Size Distribu- tions in Economics and Actuarial Sciences. Wiley- T. W. Yee and C. J. Wild. Vector generalized additive Interscience, Hoboken, NJ, USA, 2003. models. Journal of the Royal Statistical Society, Series B, Methodological, 58(3):481–493, 1996. P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman & Hall, London, second edition, 1989. Thomas W. Yee Department of Statistics O. Rosen and A. Cohen. Extreme percentile regres- University of Auckland sion. In W. Härdle and M. G. Schimek, editors, Sta- New Zealand tistical Theory and Computational Aspects of Smooth- [email protected]

R News ISSN 1609-3631 Vol. 8/2, October 2008 40

Comparing Non-Identical Objects

Introducing the compare Package The problem with this function is that it is very strict indeed and will fail for objects that are, for all by Paul Murrell practical purposes, the same. The classic example is the comparison of two real (floating-point) values, The compare package provides functions for com- as demonstrated in the following code, where differ- paring two R objects for equality, while allowing for ences can arise simply due to the limitations of how a range of “minor” differences. Objects may be re- numbers are represented in computer memory (see ordered, rounded, or resized, they may have names R FAQ 7.31, Hornik, 2008). or attributes removed, or they may even be coerced > identical(0.3 - 0.2, 0.1) to a new class if necessary in order to achieve equal- ity. [1] FALSE The results of comparisons report not just Using the function to test for equality would clearly whether the objects are the same, but also include a be unreasonably harsh when marking any student record of any modifications that were performed. answer that involves calculating a numeric result. This package was developed for the purpose of The identical() function, by itself, is not suffi- partially automating the marking of coursework in- cient for comparing student answers with model an- volving R code submissions, so functions are also swers. provided to convert the results of comparisons into numeric grades and to provide feedback for stu- Shades of grey dents. The recommended solution to the problem men- tioned above of comparing two floating-point values Motivation is to use the all.equal() function. This function al- lows for “insignificant” differences between numeric STATS 220 is a second year university course run values, as shown below. by the Department of Statistics at the University of > all.equal(0.3 - 0.2, 0.1) Auckland.1 The course covers a range of “Data Tech- [1] TRUE nologies”, including HTML, XML, databases, SQL, and, as a general purpose data processing tool, R. This makes all.equal() a much more appropriate In addition to larger assignments, students in the function for comparing student answers with model course must complete short exercises in weekly com- answers. puter labs. What is less well-known about the all.equal() For the R section of the course, students must function is that it also works for comparing other write short pieces of R code to produce specific R ob- sorts of R objects, besides numeric vectors, and that jects. Figure 1 shows two examples of basic, intro- it does more than just report equality between two ductory exercises. objects. The students submit their answers to the exer- If the objects being compared have differences, cises as a file containing R code, which means that then all.equal() does not simply return FALSE. In- it is possible to recreate their answers by calling stead, it returns a character vector containing mes- source() on the submitted files. sages that describe the differences between the ob- At this point, the R objects generated by the stu- jects. The following code gives a simple example, dents’ code can be compared with a set of model R where all.equal() reports that the two character objects in order to establish whether the students’ an- vectors have different lengths, and that, of the two swers are correct. pairs of strings that can be compared, one pair of How this comparison occurs is the focus of this strings does not match. article. > all.equal(c("a", "b", "c"), c("a", "B")) [1] "Lengths (3, 2) differ (string compare on first 2)" Black and white comparisons [2] "1 string mismatch" The simplest and most strict test for equality be- This feature is actually very useful for marking tween two objects in the base R system (R Develop- student work. Information about whether a stu- ment Core Team, 2008) is provided by the function dent’s answer is correct is useful for determining a identical(). This returns TRUE if the two objects are raw mark, but it is also useful to have information exactly the same, otherwise it returns FALSE. about what the student did wrong. This information 1http://www.stat.auckland.ac.nz/courses/stage2/#STATS220

R News ISSN 1609-3631 Vol. 8/2, October 2008 41

1. Write R code to create the three vectors and 2. Combine the objects from Question 1 the factor shown below, with names id, together to make a data frame called age, edu, and class. IndianMothers. You should end up with objects that look You should end up with an object that like this: looks like this:

> id > IndianMothers [1] 1 2 3 4 5 6 id age edu class 1 1 30 0 poor > age 2 2 32 0 poor [1] 30 32 28 39 20 25 3 3 28 0 poor 4 4 39 0 middle > edu 5 5 20 0 middle [1] 0 0 0 0 0 0 6 6 25 0 middle

> class [1] poor poor poor middle [5] middle middle Levels: middle poor

Figure 1: Two simple examples of the exercises that STATS 220 students are asked to perform. can be used as the basis for assigning partial marks [1] "Modes: character, numeric" for an answer that is close to the correct answer, and [2] "Attributes: < target is NULL, current for providing feedback to the student about where is list >" marks were lost. [3] "target is character, current is factor" The all.equal() function has some useful fea- The alternative approach would be to allow var- tures that make it a helpful tool for comparing stu- ious transformations of the objects to see if they can dent answers with model answers. However, there be transformed to be the same. The following code is an approach that can perform better than this. shows this approach, which reports that the objects The all.equal() function looks for equality be- are equal, if the second one is coerced from a factor to tween two objects and, if that fails, provides infor- a character vector. This is more information than was mation about the sort of differences that exist. An al- provided by all.equal() and, in the particular case ternative approach, when two objects are not equal, of comparing student answers to model answers, it is to try to transform the objects to make them equal, tells us a lot about how close the student got to the and report on which transformations were necessary right answer. in order to achieve equality. As an example of the difference between these ap- > library(compare) > compare(obj1, obj2, allowAll=TRUE) proaches, consider the two objects below: a character vector and a factor. TRUE coerced from to > obj1 <- c("a", "a", "b", "c") > obj1 Another limitation of all.equal() is that it does not report on some other possible differences between [1] "a" "a" "b" "c" objects. For example, it is possible for a student to have the correct values for an R object, but have the > obj2 <- factor(obj1) > obj2 values in the wrong order. Another common mistake is to get the case wrong in a set of string values (e.g., [1] a a b c in a character vector or in the names attribute of an Levels: a b c object). In summary, while all.equal() provides some The all.equal() function reports that these objects desirable features for comparing student answers to are different because they differ in terms of their fun- model answers, we can do better by allowing for a damental mode—one has attributes and the other wider range of differences between objects and by does not—and because each object is of a different taking a different approach that attempts to trans- class. form the student answer to be the same as the model answer, if at all possible, while reporting which > all.equal(obj1, obj2) transformations were necessary.

R News ISSN 1609-3631 Vol. 8/2, October 2008 42

The remainder of this article describes the com- Of course, transforming an object is not guaranteed pare package, which provides functions for produc- to produce identical objects if the original objects are ing these sorts of comparisons. genuinely different. > compare(obj1, obj2[1:3], coerce=TRUE) The compare() function FALSE coerced from to The main function in the compare package is the Notice, however, that even though the comparison compare() function. This function checks whether failed, the result still reports the transformation that two objects are the same and, if they are not, car- was attempted. This result indicates that the compar- ries out various transformations on the objects and ison object was converted from a factor (to a charac- checks them again to see if they are the same after ter vector), but it still did not end up being the same they have been transformed. as the model object. By default, compare() only succeeds if the two A number of other transformations are available objects are identical (using the identical() func- in addition to coercion. For example, differences in tion) or the two objects are numeric and they are length, like in the last case, can also be ignored. equal (according to all.equal()). If the objects are not the same, no transformations of the objects are > compare(obj1, obj2[1:3], considered. In other words, by default, compare() is + shorten=TRUE, coerce=TRUE) simply a convenience wrapper for identical() and TRUE all.equal(). As a simple example, the following coerced from to comparison takes account of the fact that the values shortened model being compared are numeric and uses all.equal() It is also possible to allow values to be sorted, or rather than identical(). rounded, or to convert all character values to upper > compare(0.3 - 0.2, 0.1) case (i.e., ignore the case of strings). Table 1 provides a complete list of the transfor- TRUE mations that are currently allowed (in version 0.2 of compare) and the arguments that are used to enable Transformations them. A further argument to the compare() function, The more interesting uses of compare() involve spec- allowAll, controls the default setting for most of ifying one or more of the arguments that allow trans- these transformations, so specifying allowAll=TRUE formations of the objects that are being compared. is a quick way of enabling all possible transforma- For example, the coerce argument specifies that the tions. Specific transformations can still be excluded by second argument may be coerced to the class of the explicitly setting the appropriate argument to FALSE. first argument. This allows for more flexible compar- The equal argument is a bit of a special case be- isons such as between a factor and a character vector. cause it is TRUE by default, whereas almost all oth- ers are FALSE. The equal argument is also especially > compare(obj1, obj2, coerce=TRUE) influential because objects are compared after every transformation and this argument controls what sort TRUE of comparison takes place. Objects are always com- coerced from to pared using identical() first, which will only suc- ceed if the objects have exactly the same representa- It is important to note that there is a definite or- tion in memory. If the test using identical() fails der to the objects; the model object is given first and equal=TRUE, then a more lenient comparison is and the comparison object is given second. Transfor- also performed. By default, this just means that nu- mations attempt to make the comparison object like meric values are compared using all.equal(), but the model object, though in a number of cases (e.g., various other arguments can extend this to allow when ignoring the case of strings) the model object things like differences in case for character values may also be transformed. In the example above, the (see the asterisked arguments in Table 1). comparison object has been coerced to be the same The round argument is also special because it al- class as the model object. The following code demon- ways defaults to FALSE, even if allowAll=TRUE. This strates the effect of reversing the order of the objects means that the round argument must be specified ex- in the comparison. Now the character vector is being plicitly in order to enable rounding. The default is coerced to a factor. set up this way because the value of the round argu- > compare(obj2, obj1, coerce=TRUE) ment is either FALSE or an integer value specifying the number of decimal places to round to. For this TRUE argument, the value TRUE corresponds to rounding coerced from to to zero decimal places.

R News ISSN 1609-3631 Vol. 8/2, October 2008 43

Table 1: Arguments to the compare() function that control which transformations are attempted when com- paring a model object to a comparison object. Argument Meaning equal Compare objects for “equality” as well as “identity” (e.g., use all.equal() if model object is numeric). coerce Allow coercion of comparison object to class of model object. shorten Allow either the model or the comparison to be shrunk so that the objects have the same “size”. ignoreOrder Ignore the original order of the comparison and model objects; allow both comparison object and model object to be sorted. ignoreNameCase Ignore the case of the names attribute for both compar- ison and model objects; the name attributes for both ob- jects are converted to upper case. ignoreNames Ignore any differences in the names attributes of the comparison and model objects; any names attributes are dropped. ignoreAttrs Ignore all attributes of both the comparison and model objects; all attributes are dropped. round∗ Allow numeric values to be rounded; either FALSE (the default), or an integer value giving the number of deci- mal places for rounding, or a function of one argument, e.g., floor. ignoreCase∗ Ignore the case of character vectors; both comparison and model are converted to upper case. trim∗ Ignore leading and trailing spaces in character vec- tors; leading and trailing spaces are trimmed from both comparison and model. ignoreLevelOrder∗ Ignore original order of levels of factor objects; the lev- els of the comparison object are sorted to the order of the levels of the model object. dropLevels∗ Ignore any unused levels in factors; unused levels are dropped from both comparison and model objects. ignoreDimOrder Ignore the order of dimensions in array, matrix, or table objects; the dimensions are reordered by name. ignoreColOrder Ignore the order of columns in data frame objects; the columns in the comparison object are reordered to match the model object. ignoreComponentOrder Ignore the order of components in a list object; the com- ponents are reordered by name. ∗These transformations only occur if equal=TRUE

R News ISSN 1609-3631 Vol. 8/2, October 2008 44

Finally, there is an additional argument colsOnly > model <- for comparing data frames. This argument con- + data.frame(x=1:26, trols whether transformations are only applied to + y=letters, columns (and not to rows). For example, by default, + z=factor(letters), a data frame will only allow columns to be dropped, + row.names=letters, + stringsAsFactors=FALSE) but not rows, if shorten=TRUE. Note, however, that ignoreOrder means ignore the order of rows for data The comparison object contains essentially the same frames and ignoreColOrder must be used to ignore information, except that there is an extra column, the the order of columns in comparisons involving data column names are uppercase rather than lowercase, frames. the columns are in a different order, the y variable is a factor rather than a character vector, and the z vari- able is a character variable rather than a factor. The y The compareName() function variable and the row names are also uppercase rather The compareName() function offers a slight variation than lowercase. on the compare() function. > comparison <- For this function, only the name of the compari- + data.frame(W=26:1, son object is specified, rather than an explicit object. + Z=letters, The advantage of this is that it allows for variations + Y=factor(LETTERS), + X=1:26, in case in the names of objects. For example, a stu- + row.names=LETTERS, indianMothers dent might create a variable called + stringsAsFactors=FALSE) rather than the desired IndianMothers. This case- insensitivity is enabled via the ignore.case argu- The compare() function can detect that these two ob- ment. jects are essentially the same as long as we reorder Another advantage of this function is that it is the columns (ignoring the case of the column names), possible to specify, via the compEnv argument, a par- coerce the y and z variables, drop the extra variable, ticular environment to search within for the compar- ignore the case of the y variable, and ignore the case ison object (rather than just the current workspace). of the row names. This becomes useful when checking the answers > compare(model, comparison, allowAll=TRUE) from several students because each student’s an- TRUE swers may be generated within a separate environ- renamed ment in order to avoid any interactions between code reordered columns from different students. [Y] coerced from to The following code shows a simple demonstra- [Z] coerced from to tion of this function, where a comparison object is shortened comparison created within a temporary environment and the [Y] ignored case name of the comparison object is upper case when renamed rows it should be lowercase. Notice that we have used allowAll=TRUE to allow compare() to attempt all possible transformations at > tempEnv <- new.env() its disposal. > with(tempEnv, X <- 1:10) > compareName(1:10, "x", compEnv=tempEnv) Comparing files of R code TRUE renamed object Returning now to the original motivation for the compare package, the compare() function provides Notice that, as with the transformations in an excellent basis for determining not only whether compare() compareName() , the function records a student’s answers are correct, but also how much whether it needed to ignore the case of the name incorrect answers differ from the model answer. of the comparison object. As described earlier, submissions by students in the STATS 220 course consist of files of R code. Mark- A pathological example ing these submissions consists of using source() to run the code, then comparing the resulting objects This section shows a manufactured example that with model answer objects. With approximately 100 demonstrates some of the flexibility of the compare() students in the STATS 220 course, with weekly labs, function. and with multiple questions per lab, each of which We will compare two data frames that have a may contain more than one R object, there is a rea- number of simple differences. The model object is sonable marking burden. Consequently, there is a a data frame with three columns: a numeric vector, a strong incentive to automate as much of the marking character vector, and a factor. process as possible.

R News ISSN 1609-3631 Vol. 8/2, October 2008 45

The compareFile() function We can provide extra arguments to allow trans- formations of the student’s answers, as in the follow- The compareFile() function can be used to run R ing code: code from a specific file and compare the results with a set of model answers. This function requires three pieces of information: the name of a file containing > compareFile(file.path("Examples", the “comparison code”, which is run within a local + "student1.R"), environment, using source(), to generate the com- + modelNames, parison values; a vector of “model names”, which + file.path("Examples", are the names of the objects that will be looked for in + "model.R"), the local environment after the comparison code has + allowAll=TRUE) been run; and the model answers, either as the name of a binary file to load(), or as the name of a file of $id R code to source(), or as a list object containing the TRUE ready-made model answer objects. Any argument to compare() may also be in- $age cluded in the call. TRUE Once the comparison code has been run, $edu compareName() is called for each of the model names TRUE and the result is a list of "comparison" objects. As a simple demonstration, consider the basic $class questions shown in Figure 1. The model names in TRUE this case are the following: reordered levels

> modelNames <- c("id", "age", $IndianMothers + "edu", "class", FALSE + "IndianMothers") object not found

One student’s submission for this exercise is in a file called student1.R, within a directory called This shows that, although the student’s answer for Examples. The model answer is in a file called the class object was not perfect, it was pretty close; model.R in the same directory. We can evaluate this it just had the levels of the factor in the wrong order. student’s submission and compare it to the model answer with the following code:

> compareFile(file.path("Examples", The compareFiles() function + "student1.R"), + modelNames, The compareFiles() function builds on + file.path("Examples", compareFile() by allowing a vector of compari- + "model.R")) son file names. This allows a whole set of student submissions to be tested at once. The result of this $id function is a list of lists of "comparison" objects and TRUE a special print method provides a simplified view of $age this result. TRUE Continuing the example from above, the Examples directory contains submissions from a fur- $edu ther four students. We can compare all of these TRUE submissions with the model answers and produce a summary of the results with a single call to $class compareFiles(). The appropriate code and output FALSE are shown in Figure 2. $IndianMothers The results show that most students got the first FALSE three problems correct. They had more trouble get- object not found ting the fourth problem right, with one getting the factor levels in the wrong order and two others pro- This provides a strict check and shows that the stu- ducing a character vector rather than a factor. Only dent got the first three problems correct, but the one student, student2, got the final problem exactly last two wrong. In fact, the student’s code com- right and only one other, student4, got essentially pletely failed to generate an object with the name the right answer, though this student spelt the name IndianMothers. of the object wrong.

R News ISSN 1609-3631 Vol. 8/2, October 2008 46

> files <- list.files("Examples", + pattern="^student[0-9]+[.]R$", + full.names=TRUE) > results <- compareFiles(files, + modelNames, + file.path("Examples", "model.R"), + allowAll=TRUE, + resultNames=gsub("Examples.|[.]R", "", files)) > results

id age edu class IndianMothers student1 TRUE TRUE TRUE TRUE reordered levels FALSE object not found student2 TRUE TRUE TRUE TRUE TRUE student3 TRUE TRUE TRUE TRUE coerced from to FALSE object not found student4 TRUE TRUE TRUE TRUE coerced from to TRUE renamed object student5 TRUE TRUE TRUE FALSE object not found FALSE object not found

Figure 2: Using the compareFiles() function to run R code from several files and compare the results to model objects. The result of this sort of comparison can easily get quite wide, so it is often useful to print the result with options(width) set to some large value and using a small font, as has been done here.

Assigning marks and comparison is FALSE. giving feedback > q2 <- + questionMarks("IndianMothers", The result returned by compareFiles() is a list of + maxMark=1, lists of comparison results, where each result is itself + rule("IndianMothers", 1)) a list of information including whether two objects are the same and a record of how the objects were The first question from Figure 1 provides a more transformed during the comparison. This represents complex example. In this case, there are four dif- a wealth of information with which to assess the per- ferent objects involved and the maximum mark is 2. formance of students on a set of R exercises, but it The rules below specify that any FALSE comparison can be a little unwieldly to deal with. drops a mark and that, for the comparison involving the object named "class", a mark should also be de- The compare package provides further functions ducted if coercion was necessary to get a TRUE result. that make it easier to deal with this information for the purpose of determining a final mark and for > q1 <- the purpose of providing comments for each student + questionMarks( submission. + c("id", "age", "edu", "class"), In order to determine a final mark, we use the + maxMark=2, questionMarks() function to specify which object + rule("id", 1), names are involved in a particular question, to pro- + rule("age", 1), vide a maximum mark for the question, and to spec- + rule("edu", 1), ify a set of rules that determine how many marks + rule("class", 1, + transformRule("coerced", 1))) should be deducted for various deviations from the correct answers. Having set up this marking scheme, marks are gener- The rule() function is used to define a marking ated using the markQuestions() function, as shown rule. It takes an object name, a number of marks by the following code. to deduct if the comparison for that object is FALSE, plus any number of transformation rules. The latter > markQuestions(results, q1, q2) are generated using the transformRule() function, which associates a regular expression with a num- id-age-edu-class IndianMothers ber of marks to deduct. If the regular expression is student1 2 0 student2 2 1 matched in the record of transformations for a com- student3 1 0 parison, then the appropriate number of marks are student4 1 1 deducted. student5 1 0 A simple example, based on the second ques- tion in Figure 1, is shown below. This specifies For the first question, the third and fourth students that the question only involves an object named lose a mark because of the coercion, and the fifth stu- IndianMothers, that there is a maximum mark of 1 dent loses a mark because he has not generated the for this question, and that 1 mark is deducted if the required object.

R News ISSN 1609-3631 Vol. 8/2, October 2008 47

A similar suite of functions are provided to asso- both to check whether they have the correct answer ciate comments, rather than mark deductions, with and to provide feedback about how their answer dif- particular transformations. The following code pro- fers from the model answer. More generally, the vides a simple demonstration. compare() function may have application wherever the identical() and all.equal() functions are cur- > q1comments <- + questionComments( rently in use. For example, it may be useful when + c("id", "age", "edu", "class"), debugging code and for performing regression tests + comments( as part of a quality control process. + "class", Obvious extensions of the compare package in- + transformComment( clude adding new transformations and providing + "coerced", comparison methods for other classes of objects. + "’class’ is a factor!"))) More details about how the package works and how > commentQuestions(results, q1comments) these extensions might be developed are discussed id-age-edu-class in the vignette, “Fundamentals of the Compare Pack- student1 "" age”, which is installed as part of the compare pack- student2 "" age. student3 "’class’ is a factor!" student4 "’class’ is a factor!" student5 "" Acknowledgements In this case, we have just generated feedback for the students who generated a character vector instead of Many thanks to the editors and anonymous review- the desired factor in Question 1 of the exercise. ers for their useful comments and suggestions, on both this article and the compare package itself. Summary, discussion, and future directions Bibliography

The compare package is based around the compare() K. Hornik. The R FAQ, 2008. URL http://CRAN. function, which compares two objects for equality R-project.org/doc/FAQ/R-FAQ.html. ISBN 3- and, if they are not equal, attempts to transform the 900051-08-9. objects to make them equal. It reports whether the comparison succeeded overall and provides a record R Development Core Team. R: A Language and Envi- of the transformations that were attempted during ronment for Statistical Computing. R Foundation for the comparison. Statistical Computing, Vienna, Austria, 2008. URL Further functions are provided on top of the http://www.R-project.org. ISBN 3-900051-07-0. compare() function to facilitate marking exercises where students in a class submit R code in a file to create a set of R objects. Paul Murrell This article has given some basic demonstrations Department of Statistics of the use of the compare() package for comparing University of Auckland objects and marking student submissions. The pack- New Zealand age could also be useful for the students themselves, [email protected]

R News ISSN 1609-3631 Vol. 8/2, October 2008 48 mvna: An R Package for the Nelson–Aalen Estimator in Multistate Models by A. Allignol, J. Beyersmann and M. Schumacher id from to time 1 0 2 4 2 1 2 10 Introduction 3 1 2 2 4 1 2 49 The multivariate Nelson–Aalen estimator of cumula- 5 1 0 36 tive transition hazards is the fundamental nonpara- 5 0 2 47 metric estimator in event history analysis. How- 749 1 0 11 ever, there has not been a multivariate Nelson–Aalen 749 0 cens 22 R–package available, and the same appears to be true for other statistical softwares, like SAS or Stata. id is the patient identification number, from is the As the estimator is algebraically simple, its uni- state from which a transition occurs, to is the state to variate entries are easily computed in R using the which a transition occurs and time is the transition survfit function of the package survival (Therneau time. Left–truncation, e.g., due to delayed entry into and Grambsch, 2001; Lumley, 2004), but other val- the study cohort, can be handled replacing time with ues returned by survfit may be meaningless in an entry and exit time within a state. One need to the multivariate setting and consequently mislead- specify the state names, the possible transitions, and ing (Kalbfleisch and Prentice, 2002, Chapter VIII.2.4). the censoring code. For example, patient 749 is right– In addition, the computations become cumbersome censored at time 22. for even a moderate number of transition types, like xyplot.mvna plots the cumulative hazard es- 4 or 5. timates in a lattice plot (Sarkar, 2002), along For instance, time–dependent covariates may with pointwise confidence intervals, with possible easily be incorporated into a proportional haz- log or arcsin transformations to improve the ap- ards analysis, but, in general, this approach will proximation to the asymptotic distribution. The not result in a model for event probabilities any predict.mvna function gives Nelson–Aalen esti- more (Kalbfleisch and Prentice, 2002, Chapter VI.3). mates at time points given by the user. The multivariate Nelson–Aalen estimate straightfor- Two random samples of intensive care unit data wardly illustrates the analysis, while popular strati- are also included within the mvna package (Beyers- fied Kaplan-Meier plots lack a probability interpre- mann et al., 2006a). One of these data sets focuses on tation (e.g., Feuer et al. (1992); Beyersmann et al. the effect of pneumonia on admission on the hazard (2006b)). of discharge and death, respectively. The other one The article is organised as follows: First we de- contains data to gauge the influence of ventilation (a scribe the main features of the mvna package. Next time–dependent covariate) on the hazard of end of we use the package to illustrate the analysis of a hospital stay (combined endpoint death/discharge). time–dependent covariate. We briefly discuss the rel- ative merits of the available variance estimators, and conclude with a summary. Illustration

Package description The aim of the present section is to anal- yse the influence of a time-dependent covari- The mvna contains the following functions: ate, namely ventilation, which is binary and re- versible. We use the model of Figure 1 to anal- • mvna yse the data, where ventilation is considered • xyplot.mvna as an intermediate event in an illness–death– model (Andersen et al., 1993, Chapter I.3.3). • print.mvna Ventilation • predict.mvna ¨ H ¨¨¨* 1 H ¨¨¨ H The main function, mvna, computes the Nelson– ¨ ¨ HH ¨¨ ¨ HHj Aalen estimates at each of the observed event times, ¨¨ and the two variance estimators described in eq. 0 - 2 (4.1.6) and (4.1.7) of Andersen et al. (1993). mvna takes as arguments a transition–oriented data.frame, No Ventilation End–of–Stay where each row represents a transition: Figure 1: Model to assess the influence of ventilation.

R News ISSN 1609-3631 Vol. 8/2, October 2008 49

The following call computes the estimated cumu- Variance estimation lative hazards along with the estimated variance mvna computes two variance estimators based on the > data(sir.cont) optional variation process (Andersen et al., 1993, eq. > na.cont <- mvna(sir.cont,c("0","1","2"), (4.1.6)) and the predictable variation process (Ander- + tra,"cens") sen et al., 1993, eq. (4.1.7)) ‘attached’ to the Nelson- Aalen estimator, respectively. For standard survival tra being a quadratic matrix of logical specifying data, Klein (1991) found that the (4.1.6) estimator the possible transitions: tended to overestimate the true variance, but that the bias was of no practical importance for risk sets ≥ 5. > tra He found that the (4.1.7) estimator tended to under- 0 1 2 estimate the true variance and had a smaller mean 0 FALSE TRUE TRUE squared error. The two estimators appear to coin- 1 TRUE FALSE TRUE cide from a practical point of view for risk sets ≥ 10. 2 FALSE FALSE FALSE Small risk sets are a concern in multistate modelling because of multiple transient states. In a preliminary To assess the influence of the time–dependent investigation, we found comparable results in a mul- covariate ventilation on the hazard of end–of–stay, tistate model simulation. As in Klein (1991), we rec- we are specifically interested in the transitions from ommend the use of the (4.1.6) estimator, which was state “No ventilation” to “End–of–stay” and from found to have a smaller absolute bias. state “Ventilation” to “End–of–stay”. We specify to plot only these cumulative hazard estimates with the tr.choice option of the xyplot.mvna function. Summary

> xyplot(na.cont,tr.choice=c("0 2","1 2"), The mvna package provides a way to easily estimate + aspect=1,strip=strip.custom(bg="white", and display the cumulative transition hazards from a + factor.levels= time–inhomogeneous Markov multistate model. The + c("No ventilation -- Discharge/Death", estimator may remain valid under even more gen- + "Ventilation -- Discharge/Death"), eral assumptions; we refer to Andersen et al. (1993, + par.strip.text=list(cex=0.9)), + scales=list(alternating=1),xlab="Days", Chapter III.1.3) and Glidden (2002) for mathemat- + ylab="Nelson-Aalen estimates") ically detailed accounts. We hope that the mvna package will help to promote a computationally sim- ple estimator that is extremely useful in illustrating

No ventilation −− Discharge/Death Ventilation −− Discharge/Death and understanding complex event history processes, but that is underused in applications. We have il- lustrated the usefulness of the package for visualis- 10 ing the impact of a time–dependent covariate on sur- vival. We also wish to mention that the package sim- ilarly illustrates standard Cox competing risks anal- 5

Nelson−Aalen estimates yses in a straightforward way, whereas the usual cu- mulative incidence function plots don’t; this issue is

0 pursued in Beyersmann and Schumacher (2008). In

0 50 100 150 0 50 100 150 closing, we also wish to mention the very useful R- Days package muhaz (Gentleman) for producing a smooth estimate of the survival hazard function, which also Figure 2: Nelson–Aalen estimates for the transitions allows to estimate the univariate Nelson–Aalen es- “No ventilation” to “End–of–Stay”, and “Ventila- timator subject to right-censoring, but not to left- tion” to “End–of–Stay”. truncation.

We conclude from Figure 2 that ventilation pro- longs the length of ICU stay since it reduces the haz- Bibliography ard of end–of–stay. Fig. 2 is well suited to illustrate a Cox analysis of P. K. Andersen, Ø. Borgan, R. D. Gill, and N. Kei- ventilation as a time–dependent covariate (Therneau ding. Statistical models based on counting processes. and Grambsch, 2001). Taking no ventilation as the Springer-Verlag, New-York, 1993. baseline, we find a hazard ratio of 0.159 (95% CI, 0.132–0.193), meaning that ventilation prolongs hos- J. Beyersmann and M. Schumacher. Time-dependent pital stay. Note that, in general, probability plots are covariates in the proportional subdistribution haz- not available to illustrate the analysis. ards model for competing risks. Biostatistics, 2008.

R News ISSN 1609-3631 Vol. 8/2, October 2008 50

J. Beyersmann, P. Gastmeier, H. Grundmann, S. Bär- Analysis of Failure Time Data. John Wiley & Sons, wolff, C. Geffers, M. Behnke, H. Rüden, and 2002. M. Schumacher. Use of Multistate Models to As- sess Prolongation of Intensive Care Unit Stay Due J. P. Klein. Small sample moments of some estimators to Nosocomial Infection. Infection Control and Hos- of the variance of the Kaplan–Meier and Nelson– pital Epidemiology, 27:493–499, 2006a. Aalen estimators. Scandinavian Journal of Statistics, 18:333–340, 1991. J. Beyersmann, T. Gerds, and M. Schumacher. Let- T. Lumley. The survival package. R News, 4(1):26– ter to the editor: comment on ‘Illustrating the im- 28, June 2004. URL http://CRAN.R-project.org/ pact of a time-varying covariate with an extended doc/Rnews/. Kaplan-Meier estimator’ by Steven Snapinn, Qi Jiang, and Boris Iglewicz in the November 2005 D. Sarkar. Lattice. R News, 2(2):19–23, June 2002. URL issue of The American Statistician. The American http://CRAN.R-project.org/doc/Rnews/. Statistician, 60(30):295–296, 2006b. T. Therneau and P. Grambsch. Modeling Survival E. J. Feuer, B. F. Hankey, J. J. Gaynor, M. N. Wesley, Data: Extending the Cox Model. Statistics for biol- S. G. Baker, and J. S. Meyer. Graphical represen- ogy and health. Springer, New York, 2001. tation of survival curves associated with a binary non-reversible time dependent covariate. Statistics in Medicine, 11:455–474, 1992. Arthur Allignol Jan Beyersmann R. Gentleman. muhaz: Hazard Function Estimation in Martin Schumacher Survival Analysis. R package version 1.2.3. Freiburg Center for Data Analysis and Modelling, Uni- D. Glidden. Robust inference for event probabili- versity of Freiburg, Germany. ties with non-Markov data. Biometrics, 58:361–368, Institute of Medical Biometry and Medical Informatics, 2002. University Medical Centre Freiburg, Germany.

J. D. Kalbfleisch and R. L. Prentice. The Statistical [email protected]

R News ISSN 1609-3631 Vol. 8/2, October 2008 51

Programmers’ Niche: The Y of R Vince Carey 10 computes ∑ k without iteration and without explicit k=1 self-reference. Introduction Note that csum is not recursive. It is a function that accepts a function f and returns a function of Recursion is a general method of function definition one argument. Also note that if the argument passed involving self-reference. In R, it is easy to create re- to csum is the “true” cumulative sum function, which cursive functions. The following R function com- we’ll denote K, then csum(K)(n) will be equal to putes the sum of numbers from 1 to n inclusive: K(n). In fact, since we “know” that the function s de- > s <- function(n) { fined above correctly implements cumulative sum- + if (n == 1) return(1) mation, we can supply it as an argument to csum: + return(s(n-1)+n) + } > csum(s)(100)

Illustration: [1] 5050 > s(5) and we can see that

[1] 15 > csum(s)(100) == s(100) [1] TRUE We can avoid the fragility of reliance on the func- tion name defined as a global variable, by invoking We say that a function f is a fixed point of a func- the function through Recall instead of using self- tional F if F( f )(x) = f (x) for all relevant x. Thus, reference. in the calculation above example, we exhibit an in- It is easy to accept the recursion/Recall facility stance of the fact that s is a fixed point of csum. This as a given feature of computer programming lan- is an illustration of the more general fact that K (the guages. In this article we sketch some of the ideas true cumulative summation function) is a fixed point that allow us to implement recursive computation of csum as defined in R. without named self-reference. The primary inspira- Now we come to the crux of the matter. Y com- tion for these notes is the article “The Why of Y” by putes a fixed point of csum. For example: Richard P. Gabriel (Gabriel, 1988). Notation. Throughout we interpret = and = as the > csum(Y(csum))(10) == Y(csum)(10) symbol for mathematical identity, and use <- to de- [1] TRUE note programmatic assignment, with the exception of notation for indices in algebraic summations (e.g., We will show a little later that Y is the applicative- n order fixed point operator for functionals F, mean- ∑ ), to which we give the usual interpretation. i=1 ing that Y(F)(x) = F(Y(F))(x) for all suitable argu- ments x. This, in conjunction with a uniqueness the- orem for “least defined fixed points for functionals”, The punch line allows us to argue that K (a fixed point of csum) and Y(csum) perform equivalent computations. Since we We can define a function known as the applicative- can’t really implement K (it is an abstract mathemat- order fixed point operator for functionals as ical object that maps, in some unspecified way, the > Y <- function(f) { n number n to ∑ k) it is very useful to know that we + g <- function(h) function(x) f(h(h))(x) k=1 + g(g) can (and did) implement an equivalent function in R. + } The following function can be used to compute cu- Peeking under the hood mulative sums in conjunction with Y: > csum <- function(f) function(n) { One of the nice things about R is that we can interac- + if (n < 2) return(1); tively explore the software we are using, typically by + return(n+f(n-1)) mentioning the functions we use to the interpreter. + } > s This call function(n) { > Y(csum)(10) if (n == 1) return(1) return(s(n-1)+n) [1] 55 }

R News ISSN 1609-3631 Vol. 8/2, October 2008 52

We have argued that Y(csum) is equivalent to K, so it [1] 5050 seems reasonable to define > ss(ss)(100) > K <- Y(csum) [1] 5050 > K(100) s is intuitively a recursive function with behavior [1] 5050 equivalent to K, but relies on the interpretation of s in the global namespace at time of execution. but when we mention this to the interpreter, we get ss(ss) computes the same function, but avoids > K use of global variables. We have obtained a form of self-reference through self-application. This could function (x) serve as a reasonable implementation of K, but it f(h(h))(x) lacks the attractive property possessed by csum that csum(K)(x) = K(x). We want to be able to establish functional self- which is not very illuminating. We can of course drill reference like that possessed by ss, but without re- down: quiring the seemingly artificial self-application that > ls(environment(K)) is the essence of ss. Note that csum can be self-applied, but only un- [1] "h" der certain conditions:

> H <- get("h", environment(K)) > csum(csum(csum(csum(88))))(4) > H [1] 10 function (h) If the outer argument exceeded the number of self- function(x) f(h(h))(x) applications, such a call would fail. Curiously, the argument to the innermost call is ignored. > get("f", environment(H)) function(f) function(n) { Y if (n < 2) return(1); Y return(n+f(n-1)) We can start to understand the function of by ex- } panding the definition with argument csum. The first line of the body of Y becomes > get("g", environment(H)) g <- function(h) function(x) csum(h(h))(x) function (h) The second line can be rewritten: function(x) f(h(h))(x) g(g) = function(x) csum(g(g))(x) because by evaluating g on argument g we get to re- The lexical scoping of R (Gentleman and Ihaka, 2000) move the first function instance and substitute g for has given us a number of closures (functions accom- h. panied by variable bindings in environments) that If we view the last “equation” syntactically, and are used to carry out the concrete computation spec- pretend that g(g) is a name, we see that Y(csum) has ified by K. csum is bound to f, and the inner function created something that looks a lot like an instance of of Y is bound to both h and g. recursion via named self-reference. Let us continue with this impression with some R: Self-reference via self-application > cow <- function(x) csum(cow)(x) > cow(100) Before we work through the definition of Y, we [1] 5050 briefly restate the punch line: Y transforms a func- tional having K as fixed point into a function that im- Y has arranged bindings for constituents of g in its plements a recursive function that is equivalent to K. definition so that the desired recursion occurs with- We want to understand how this occurs. out reference to global variables. Define We can now make good on our promise to show that Y satisfies the fixed point condition Y(F)(x) = > ss <- function(f)function(n) { F(Y(F))(x). Here we use an R-like notation to ex- + if (n==1) return(1) press formal relationships between constructs de- + return(n+f(f)(n-1)) scribed above. The functional F is of the form + } > s(100) # s() defined above F = function(f) function(x) m(f,x)

R News ISSN 1609-3631 Vol. 8/2, October 2008 53 where m is any R function of two parameters, of > myst = function(x) { which the first may be a function. For any R func- + if (length(x) == 0) tion r, + return(x) + if (x[1] %in% x[-1]) F(r) = function(x) m(r,x) + return(myst(x[-1])) + return(c(x[1], myst(x[-1]))) Y(F) Now the expression engenders + } g = function(h) function(x)F(h(h))(x) 2. Extend Y to handle mutual recursions such as and returns the value g(g). This particular expres- Hofstader’s male-female recursion: sion binds g to h, so that the value of Y(F) is in fact ( ) = ( ) = function(x) F(g(g))(x) F 0 1, M 0 0,

Because g(g) is the value of Y(F) we can substitute F(n) = n − M(F(n − 1)), in the above and find M(n) = n − F(M(n − 1)). Y(F) = function(x)Y(F)(x) = function(x) F(Y(F))(x) 3. Improve R’s debugging facility so that debug- as claimed. ging K does not lead to an immediate error.

Exercises Bibliography

1. There is nothing special about recursions hav- R. Gabriel. The why of Y. ACM SIGPLAN Lisp Point- ing the simple form of csum discussed above. ers, 2:15–25, 1988. Interpret the following mystery function and show that it can be modified to work recur- R. Gentleman and R. Ihaka. Lexical scope and statis- sively with Y. tical computing. JCGS, 9:491–508, 2000.

R News ISSN 1609-3631 Vol. 8/2, October 2008 54

Changes in R Version 2.8.0 by the R Core Team • data.matrix() now tries harder to convert non-numeric columns, via as.numeric() or Significant user-visible changes as(, "numeric"). • dev.interactive() is able to recog- • var(), cov(), cor(), sd() etc. now by default nize the standard screen devices if (when ’use’ is not specified) return NA in many getOption("device") is a function (as well cases where they signalled an error before. as by name). • dev.new() gains a ’...’ argument which can be New features used to pass named arguments which will be used if appropriate to the device selected. • abbreviate() gains an optional argument • dimnames(x) <- value extends ’value’ if it a list ’strict’ allowing cheap and fast strict abbreva- and too short, and ’x’ is an array. This allows tion. constructions such as dimnames(x)[[1]] <- 1:3 to work whether or not ’x’ already has dim- • The "lm" methods of add1(), anova() and names. drop1() warn if they are mis-used on an essen- tially exact fit. • format(), formatC() and prettyNum() gain a new argument ’drop0trailing’ which can be • as.array() is now generic, gains a ’...’ argu- used to suppress trailing "0"s. ment. • format() now works for environments; also • New function as.hexmode() for converting in- print(env) and str(env) share the same code tegers in hex format. format.hexmode() and for environments. as.character.hexmode() gain an ’upper.case’ argument. • It is now possible to create and open a text- mode gzfile() connection by explicitly using • bitmap() and dev2bitmap() gain support for e.g. open="rt". anti-aliasing. The default type has been • New help.request() function for compiling changed to ’png16m’, which supports anti- an e-mail to R-help according to "the rules". It aliasing. is built on the new utility, create.post() on • Box.test() gains a ’fitdf’ argument to adjust which also bug.report() is based now; both the degrees of freedom if applied to residuals. thanks to a contribution by Heather Turner. • help.search() • browseURL() has a new argument ’encode- now assumes that non-ASCII IfNeeded’ to use URLencode() in cases where items are in latin1 if that makes sense (all it seems likely that would be helpful. (Unfor- known examples on CRAN are). tunately, those are hard to guess.) • HoltWinters() and decompose() use a (statis- tically) more efficient computation for seasonal • by() gains a ’simplify’ argument, passed to fits (they used to waste one period). tapply(). • intToUtf8() and intToBits() now accept nu- • capabilities() gains a new argument "tiff" to meric vectors, truncating them to integers. report if tiff() is operational. • is.unsorted() gains an argument ’strictly’. It • chol2inv() now treats as a [1 x now works for classed objects with a >= or > 1]-matrix. method (as incorrectly documented earlier).

• cov() and cor() have the option ’use = • library() no longer warns about masking ob- "everything"’ as default, and so does var() jects that are identical(.,.) to those they with its default ’na.rm = FALSE’. This re- mask. turns NA instead of signalling an error for • lockBinding(), unlockBinding(), NA observations. Another new option is lockEnvironment() and makeActiveBinding() ’use = "na.or.complete"’ which is the default now all return invisibly (they always return for var(*, na.rm=FALSE). var(double(0), NULL). na.rm = L) now returns NA instead of sig- nalling an error, for both L = TRUE or FALSE, • mood.test() now behaves better in the pres- as one consequence of these changes. ence of ties.

R News ISSN 1609-3631 Vol. 8/2, October 2008 55

• na.action() now works on fits of classes "lm", • New functions rawConnection() and "glm", .... rawConnectionValue() allow raw vectors to be treated as connections. • optim(.., method="SANN", .., trace=TRUE) is now customizable via the ’REPORT’ control • read.dcf() now consistently gives an error for argument, thanks to code proposals by Thomas malformed DCF. Petzoldt. • read.fwf() no longer passes its default for • The ’factory-fresh’ defaults for options ’as.is’ to read.table(): this allows the latter’s ("device") have been changed to refer default to be used. to the devices as functions in the grDe- vices namespace and not as names. This • readBin(), writeBin(), readChar() and makes it more likely that the incorrect writeChar() now open a connection which (since R 2.5.0) assumption in packages that was not already open in an appropriate binary get(getOption("device"))() will work will mode rather than the default mode. catch users of those packages. readLines(), cat() and sink() now open a • pch=16 now has no border (for consistency connection which was not already open in an with 15, 17, 18) and hence is now different from appropriate text mode rather than the default pch=19. mode.

• pdf() has new arguments ’useDingbats’ (set • readCitationFile() (and hence citation) now this to FALSE for use with broken viewers) reads a package’s CITATION file in the pack- and ’colormodel’. It now only references age’s declared encoding (if there is one). the ZapfDingbats font if it is used (for small opaque circles). • The behaviour of readLines() for incomplete The default PDF version is now 1.4, since view- final lines on binary-mode connections has ers that do not accept that are now rare. been changed to be like blocking rather than non-blocking text-mode connections. Different viewers were rendering consecutive text() calls on a pdf() device in different • A new reorder.character() method has been ways where translucency was involved. The added. This allows use of ’reorder(x, ...)’ PDF generated has been changed to force each as a shorthand for ’reorder(factor(x), ...)’ call to be rendered separately (which is the way when ’x’ is a character vector. xpdf or ghostscript was rendering, but Acrobat was forming a transparency group), which is • round() now computes in long doubles where consistent with other graphics devices support- possible so the results are more likely to be cor- ing semi-transparency. rect to representation error.

• plot.dendrogram() has new arguments (xlim, • rug() now uses axis()’s new arguments from ylim) which allows zooming into a hiearchical 2.7.2, hence no longer draws an axis line. clustering dendrogram. • save() (optionally, but by default) checks for • plot.histogram() gains an ’ann’ argument. the existence of objects before opening the (Wish from Ben Bolker.) file/connections (wish of PR#12543). • plot() now warns when it omits • segments(), arrows() and rect() allow zero- points with leverage one from a plot. length coordinates. (Wish of PR#11192.) • Plotmath now recognizes ’aleph’ and ’nabla’ • (the Adobe Symbol ’gradient’ glyph) as symbol set.seed(kind=NULL) now takes ’kind’ from names. a saved seed if the workspace has been re- stored or .Random.seed has been set in some • polyroot() no longer has a maximum degree. other way. Previously it took the ’currently used’ value, which was "default" unless ran- • The alpha/alphamax argument of the ’nls’ dom numbers had been used in the current profile() and ’mle’ methods is used to com- session. Similarly for the values reported by pute confidence limits for univariate t-statistics RNGkind(). (Related to PR#12567.) rather than a confidence region for all the pa- rameters (and not just those being profiled). set.seed() gains a ’normal.kind’ argument. • quantile.default() allows ’probs’ to stray • setEPS() and setPS() gain ’...’ to allow other just beyond [0, 1], to allow for computed val- arguments to be passed to ps.options(), in- ues. cluding overriding ’width’ and ’height’.

R News ISSN 1609-3631 Vol. 8/2, October 2008 56

• setTimeLimit() function to set limits on the same way as its input. The default method sup- CPU and/or elapsed time for each top-level ports any class with ==, > and is.na() meth- computation, and setSessionLimit() to set ods but specific methods can be much faster. limits for the rest of the session. As a side-effect, rank() will now work bet- • splinefun() has a new method = "monoH.FC" ter on classed objects, although possibly rather for monotone Hermite spline interpolation. slowly. • X11() and capabilities("X11") now catch • sprintf() optionally supports the %a/%A no- some X11 I/O errors that previously termi- tation of C99 (if the platform does, including nated R. These were rare and have only be seen under Windows). with a misconfigured X11 setup on some ver- • str()’s default method gains a ’formatNum’ sions of X11. function argument which is used for format- • The handling of nuls in character strings ting numeric vectors. Note that this is very has been changed – they are no longer slightly not backward compatible, and that its allowed, and attempting to create such a default may change before release. string now gives a truncation warning (unless options("warnEscapes") • The summary() method for class "ecdf" now is FALSE). uses a print() method rather than printing di- • The user environment and profile files rectly. can now be specified via environ- ment variables ’R_ENVIRON_USER’ and • summary.manova() uses a stabler computation ’R_PROFILE_USER’, respectively. of the test statistics, and gains a ’tol’ argument to allow highly correlated responses to be ex- • ?pkg::topic and ?pkg:::topic now find help on plored (with probable loss of accuracy). Sim- ’topic’ from package ’pkg’ (and not help on :: ilar changes have been made to anova.mlm() or :::). and anova.mlmlist(). • ??topic now does help.search("topic"); • () now writes concordance information variations such as ??pkg::topic or field??topic inside a \Sconcordance LATEXmacro, which al- are also supported. lows it to be inserted into PDF output. • There is support for using ICU (International • system.time() now uses lazy evaluation Components for Unicode) for collation, en- rather than eval/substitute, which results in abled by configure option –with-ICU on a more natural scoping. (PR#11169.) Unix-alike and by a setting in MkRules on Windows. Function icuSetCollate() allows • In table(), ’exclude=NULL’ now does some- the collation rules (including the locale) to be thing also for factor arguments. A new ’useNA’ tuned. [Experimental.] argument allows you to control whether to add NA levels unconditionally or only when • If S4 method dispatch is on and S4 objects present in data. A new convenience function are found as attributes, show() rather than addNA() gives similar functionality by adding print() is used to print the S4 attributes. NA levels to individual factors. • Starting package tcltk without access to Tk (e.g. no available display) is now a warning rather • unlink() tries the literal pattern if it does not than an error, as Tcl will still be usable. (On match with wildcards interpreted – this helps most platforms it was possible to inhibit Tk by with e.g. unlink("a[b") which previously not having DISPLAY set, but not on Windows needed to be unlink("a\\[b"). nor Mac OS X builds with –with-aqua.) • update.packages() gains an argument • Using $ on a non-subsettable object (such as a ’oldPkgs’, where new.packages() and function) is now an error (rather than returning old.packages() get ’instPkgs’. These allow NULL). to consider only subsets of packages instead of all installed ones. • Hexadecimal numerical constants (such as 0xab.cdp+12) may now contain a decimal • which(b) is somewhat faster now, notably for point. named vectors, thanks to a suggestion by Hen- rik Bengtsson. • PCRE has been updated to version 7.8 (mainly bug fixes). • New generic function xtfrm() as an auxiliary helper for sort(), order() and rank(). This • plot.ecdf() now defaults to pch=19 so as to should return a numeric vector that sorts in the better convey the left-closed line segments.

R News ISSN 1609-3631 Vol. 8/2, October 2008 57

New features in package ’methods’ • R CMD INSTALL will now fail if it finds a non-executable ’configure’ script in the pack- • S3 classes that are registered by a call to age – this usually indicates a file system with setOldClass() now have the S3 class as a spe- insufficient permissions. If a non-executable cial slot, and therefore so do any S4 classes that ’cleanup’ script is found and either –clean or contain them. This mechanism is used to sup- –preclean is used, a warning is given. port S4 classes that extend S3 classes, to the ex- tent possible. See ?Classes, ?setOldClass, and ?S3Class. Deprecated & defunct The treatment of special pseudo-classes "ma- • Use in packages of the graphics headers Rde- trix", "array", and "ts" as S4 classes has also vices.h and Rgraphics.h is defunct: they are no been modified to be more consistent and, longer installed. within limitations imposed by special treat- ment of these objects in the base code, to allow • options("par.ask.default") is defunct in other classes to contain them. See class?ts. favour of "device.ask.default". A general feature added to implement "ts" and • The ’device-independent’ family "symbol" is also "data.frame" as S4 classes is that an S4 class defunct: use font=5 (base) or fontface=5 (grid) definition can be supplied to setOldClass() instead. when the S3 class has known attributes of known class. • gammaCody() is defunct. setOldClass() now saves all the S3 inheri- tance, allowing the calls to be built up in stages, • par("gamma") is defunct. rather than including all the S3 classes in each • ’methods’ package functions getAccess(), call. Also allows as(x,"S3") to generate valid getAllMethods(), getClassName(), S3 inheritance from the stored definition. See getClassPackage(), getExtends(), ?S3. getProperties(), getPrototype(), getSubclasses(), getVirtual(), • S4 methods may now be defined correspond- mlistMetaName(), removeMethodsObject() ing to "...", by creating a generic function that and seemsS4Object() are defunct. has "..." as its signature. A method will be se- lected and called if all the arguments match- • Use of a non-integer .Random.seed is now an ing "..." are from this class or a subclass. See error. (R itself has never generated such val- ?dotsMethods. ues, but user code has, and R >= 2.6.0 has given a warning.) • New functions S3Part() and S3Class() pro- vide access to the corresponding S3 object and • methods::allGenerics() is deprecated. class for S4 classes that extend either an S3 class or a basic R object type. • In package installation, SaveImage: yes is now ignored, and any use of the field will give a • show() now also shows warning. the class name. • unserialize() no longer accepts character strings as input.

Installation • The C macro ’allocString’ has been removed – use ’mkChar’ and variants. • If sub-architectures are used, a copy of Rscript is installed in $R_HOME/bin/exec$R_ARCH • Use of allocVector(CHARSXP ...) is depre- (since that in $R_HOME/bin and /usr/bin cated and gives a warning. might be overwritten in a subsequent installa- tion). Utilities

Package installation • The default for ’stylepath’ in Sweave’s (default) RweaveLatex driver is now FALSE rather than • LazyLoad: yes is now the default, so packages TRUE if SWEAVE_STYLEPATH_DEFAULT is wanting to avoid lazy loading must set ’Lazy- unset : see ?RweaveLatex. To support this, Load: no’ (or an equivalent value) in the DE- tools::texi2dvi adds the R ’texmf’ directory to SCRIPTION file. the input search path.

R News ISSN 1609-3631 Vol. 8/2, October 2008 58

• R CMD Rd2dvi now previews PDF output (as • Indexing of data frames with NA column was documented) if R_PDFVIEWER is set (as names and a numeric or logical column index it will normally be on a Unix-alike but not on works again even if columns with NA names Windows, where the file association is used by are selected. default). • on.exit() has been fixed to use lexical scope • R CMD check checks for binary executable files in determining where to evaluate the exit ac- (which should not appear in a source package), tion when the on.exit expression appears in a using a suitable ’file’ if available, else by name. function argument.

• R CMD check now also uses codetools’ checks • rank() now consistently returns a double re- on the body of S4 methods. sult for ties.method = "average" and an inte- ger result otherwise. Previously the storage mode depended on ’na.last’ and if any NAs C-level facilities were present.

• R_ReadConsole will now be called with a • The "lm" methods of add1(), and drop1() now buffer size of 4096 bytes (rather than 1024): also work on a model fit with na.action = maintainers of alternative front-ends should na.exclude. check that they do not have a smaller limit. • median(c(x = NA_real_)) no longer has spu- • Graphics structure NewDevDesc has been re- rious names(). named to DevDesc. For now there is a compat- • isoreg(x, y) now returns the correct result ibility define in GraphicsDevice.h, but it will be also when x has ties, in all cases. removed in R 2.9.0. • What na.action() does is now correctly docu- • PROTECT and UNPROTECT macros now mented. work even with R_NO_REMAP. • source() with echo=TRUE now behaves like ordinary automatic printing, by using Bug fixes methods::show() for S4 objects.

• @ now gives an error (and not just a warning) • Several bugs fixed in ‘?‘ with topics: it previ- if it is being applied to a non-S4 object. ously died trying to construct some error mes- sages; for S4 methods, class "ANY" should be • R CMD appends (not prepends) R’s texmf path used for omitted arguments and default meth- to TEXINPUTS. ods.

• Objects generated by new() from S4 classes • trace() should create missing traceable classes should now all satisfy isS4(object). Previ- in the global environment, not in baseenv() ously, prototypes not of object type S4 would where other classes will not be found. not be S4 objects. new() applied to basic, non- S4 classes still will (and should) return non-S4 • Class inheritance using explicit coerce= meth- objects. ods via setIs() failed to coerce the argument in method dispatch. With this fixed, a mecha- • Functions writing to connections such as nism was needed to prohibit such inheritance writeLines(), writeBin(), writeChar(), when it would break the generic function (e.g., save(), dput() and dump() now check more initialize). See ?setIs and ?setGeneric. carefully that the connections are opened for writing, including connections that they open • RSiteSearch() encodes its query (it seems this themselves. is occasionally needed on some platforms, but encoding other fields is harmful). Similarly functions which read such as readLines(), scan(), dcf() and parse() • ’incomparables’ in match() was looking up in- check connections for being open for reading. dices in the wrong table.

• Equality comparison of factors with lev- • write.dcf() did not escape "." according to els now works correctly again. Debian policy (PR#12816).

• Repainting of open X11 View() windows is • col2rgb() sometimes opened a graphics de- now done whilst an X11 dataentry window is vice unnecessarily, and col2rgb(NA) did not in use. return a transparent color, as documented.

R News ISSN 1609-3631 Vol. 8/2, October 2008 59

• pdf(family="Japan") [and other CIDfonts] no • The default R_LIBS_USER path in AQUA longer seg.faults when writing "western" text builds now matches the Mac-specifc path used strings. by the Mac GUI: /Library/R/x.y/library • as.list() applied to an environment now • splinefun() with natural splines incorrectly forces promises and returns values. evaluated derivatives to the left of the first knot. (PR#13132, fix thanks to Berwin Turlach.) • Promises capturing calls to sys.parent() and friends did not work properly when evaluated • anova(glm(..., y=FALSE)) now works. via method dispatch for internal S3 generics. (PR#13098.) • The default pkgType option for non-CRAN • cut.Date(x, "weeks") could fail if x has only builds of R on Mac OS X is now correctly one unique value which fell on a week bound- "source" as documented. ary. (PR#13159.)

R News ISSN 1609-3631 Vol. 8/2, October 2008 60

Changes on CRAN by Kurt Hornik CHsharp Functions that cluster 3-dimensional data into their local modes. Based on a convergent form of Choi and Hall’s (1999) data sharpening New contributed packages method. By Douglas G. Woolford. AER Functions, data sets, examples, demos, and vi- CTT Classical Test Theory Functions. By John T. gnettes for the book “Applied Econometrics Willse and Zhan Shu. with R” by C. Kleiber and A. Zeileis (2008, Springer-Verlag, New York, ISBN 978-0-387- ConvergenceConcepts Provides a way to investi- 77316-2). By Achim Zeileis and Christian gate various modes of convergence of random Kleiber. variables. By P. Lafaye de Micheaux and B. Li- quet. ALS Multivariate curve resolution alternating least squares (MCR-ALS). Alternating least squares CvM2SL2Test Cramer-von Mises Two Sample Tests, is often used to resolve components contribut- featuring functionality to compute the exact ing to data with a bilinear structure; the basic p-value(s) for given Cramer-von Mises two- technique may be extended to alternating con- sample test score(s) under the assumption that strained least squares. Commonly applied con- the populations under comparison have the straints include unimodality, non-negativity, same probability distribution. By Yuanhui and normalization of components. Several Xiao. data matrices may be decomposed simultane- DiagnosisMed Analysis of data from diagnostic test ously by assuming that one of the two matri- accuracy evaluating health conditions. De- ces in the bilinear decomposition is shared be- signed for use by health professionals. Can es- tween datasets. By Katharine M. Mullen. timate sample size for common situations in di- AdMit Performs fitting of an adaptive mixture of agnostic test accuracy, estimate sensitivity and Student t distributions to a target density specificity from categorical and continuous test through its kernel function. The mixture ap- results including some evaluations of indeter- proximation can then be used as the impor- minate results, or compare different analysis tance density in importance sampling or as the strategies into measures commonly used by candidate density in the Metropolis-Hastings health professionals. By Pedro Brasil. algorithm to obtain quantities of interest for the EVER Estimation of Variance by Efficient Replica- target density itself. By David Ardia, Lennart F. tion, providing delete-a-group jackknife repli- Hoogerheide and Herman K. van Dijk. cation. Gives estimates, standard errors and BCE Estimation of taxonomic compositions from confidence intervals for: totals, means, abso- biomarker data, using a Bayesian approach. By lute and relative frequency distributions, con- Karel Van den Meersche and Karline Soetaert. tingency tables, ratios, quantiles and regression coefficients, as well as for for user-defined esti- BLCOP An implementation of the Black-Litterman mators (even non-analytic). By Diego Zardetto. Model and Atilio Meucci’s copula opinion pooling framework. By Francisco Gochez. FITSio Utilities to read and write files in the FITS (Flexible Image Transport System) format, a BaM Functions and datasets for “Bayesian Methods: standard format in astronomy. By Andrew A Social and Behavioral Sciences Approach” by Harris. J. Gill (Second Edition, 2007, CRC Press). By Jeff Gill. FTICRMS Programs for Analyzing Fourier Transform-Ion Cyclotron Resonance Mass BiplotGUI Provides a GUI with which users can Spectrometry Data. By Don Barkauskas. construct and interact with biplots. By An- thony la Grange. GExMap Analysis of genomic distribution of genes lists produced by transcriptomic studies. By N. CADStat Provides a GUI to several statistical meth- Cagnard. ods useful for causal assessment. Methods in- clude scatterplots, boxplots, linear regression, HAPim Provides a set of functions whose aim generalized linear regression, quantile regres- is to propose 4 methods of QTL detection: sion, conditional probability calculations, and HAPimLD (an interval-mapping method de- regression trees. By Lester Yuan, Tom Stock- signed for unrelated individuals with no fam- ton, Doug Bronson, Pasha Minallah and Mark ily information that makes use of linkage dise- Fitzgerald. quilibrium), HAPimLDL (an interval-mapping

R News ISSN 1609-3631 Vol. 8/2, October 2008 61

method for design of half-sib families, com- and constraint efficiencies and the trophic level bining linkage analysis and linkage disequilib- and omnivory indices of food webs. By Karline rium), HaploMax (based on an analysis of vari- Soetaert and Julius Kipkeygon Kones. ance with a dose haplotype effect), and Haplo- MaxHS (based on an analysis of variance with a OPE Fit an outer-product emulator to the multi- sire effect and a dose haplotype effectin half-sib variate evaluations of a computer model. By family design). By S. Dejean, N. Oumouhou, D. Jonathan Rougier. Estivals and B. Mangin. PBSddesolve Solver for Delay Differential Equa- LDtests Exact tests for Linkage Disequilibrium (LD) tions via interfacing numerical routines written and Hardy-Weinberg Equilibrium (HWE). By by Simon N. Wood, with contributions by Ben- ddesolve Alex Lewin. jamin J. Cairns. Replaces package . By Alex Couture-Beil, Jon Schnute and Rowan LIStest Compute the p-value for the Longest In- Haigh. creasing Subsequence Independence Test (for PSM Estimation of linear and non-linear mixed- continuous random variables). By Jesus Garcia effects models using stochastic differential and Veronica Andrea Gonzalez Lopez. equations. Also provides functions for find- MAclinical Class prediction based on microarray ing smoothed estimates of model states and data and clinical parameters. Prediction is per- for simulation. Allows for any multivariate formed using a two-step method combining non-linear time-variant model to be specified, (pre-validated) PLS dimension reduction and and also handles multi-dimensional input, co- random forests. By Anne-Laure Boulesteix. variates, missing observations and specifica- tion of dosage regimen. By Stig Mortensen and MCAPS Weather and air pollution data, risk esti- Søren Klim. mates, and other information from the Medi- care Air Pollution Study (MCAPS) of 204 PolynomF Univariate polynomial operations in R. U.S. counties, 1999–2002. By Roger D. Peng. By Bill Venables. MDD Calculates Minimum Detectable Difference Pomic Calculations of Pattern Oriented Model- (MDD) for several continuous and binary end- ing Information Criterion (POMIC), a non- points. Also contains programs to compare the parsimonious based information criterion, to MDD to the clinically significant difference for check the quality of simulations results of most of the same tests. By Don Barkauskas. ABM/IBM or other non-linear rule-based mod- els. The POMIC is based on the KL divergence MIfuns Pharmacometric tools for common data and likelihood theory. By Cyril Piou. preparation tasks, stratified bootstrap resam- pling of data sets, NONMEM control stream PredictiveRegression Prediction intervals for three creation/editing, NONMEM model execution, algorithms described in “On-line predictive creation of standard and user-defined diagnos- linear regression” (Annals of Statistics, 2008). tic plots, execution and summary of bootstrap By Vladimir Vovk and Ilia Nouretdinov. and predictive check results, implementation PtProcess Time dependent point process modeling, of simulations from posterior parameter distri- with an emphasis on earthquake modeling. By butions, reporting of output tables and creation David Harte. of a detailed analysis log. By Bill Knebel. R2jags Running JAGS from R. By Yu-Sung Su and MKmisc Miscellaneous Functions from M. Kohl. By Masanao Yajima. Matthias Kohl. RItools Tools for randomization inference. By Jake MSVAR Estimation of a 2 state Markov Switching Bowers, Mark Fredrickson and Ben Hansen. VAR. By James Eustace. RKEA An R interface to KEA (Version 5.0). By Ingo Metabonomic Graphical user interface for metabo- Feinerer. nomic analysis (baseline, normalization, peak detection, PCA, PLS, nearest neigbor, neural RM2 Functions used in revenue management and network). By Jose L. Izquierdo. pricing environments. By Tudor Bodea, Dev Koushik and Mark Ferguson. NetIndices Estimates network indices, including trophic structure of foodwebs. Indices include RPostgreSQL Database interface and PostgreSQL ascendency network indices, direct and indi- driver for R. Complies with the database in- rect dependencies, effective measures, envi- terface definition as implemented in package ron network indices, general network indices, DBI. By Sameer Kumar Prayaga with mentor pathway analysis, network uncertainty indices .

R News ISSN 1609-3631 Vol. 8/2, October 2008 62

Ratings Functions to implement the methods de- STAR Spike Train Analysis with R: functions to an- scribed in “Improving the Presentation and alyze neuronal spike trains from a single neu- Interpretation of Online Ratings Data with ron or from several neurons recorded simulta- Model-based Figures” by Ho and Quinn (The neously. By Christophe Pouzat. American Statistician). By Kevin M. Quinn and Daniel E. Ho. SiZer ‘SiZer: Significant Zero Crossings. Calculates and plots the SiZer map for scatterplot data. A Read.isi Automated access to old World Fertil- SiZer map is a way of examining when the p-th ity Survey data saved in fixed-width format derivative of a scatterplot-smoother is signifi- based on ISI-formatted codebooks. By Rense cantly negative, possibly zero or significantly Nieuwenhuis. positive across a range of smoothing band- widths. By Derek Sonderegger. ResearchMethods Using GUIs to help teach statis- tics to non-statistics students. By Sam Stewart SimpleTable Methods to conduct Bayesian infer- and Mohammed Abdolell. ence and sensitivity analysis for causal effects from 2 × 2 and 2 × 2 × K tables when unmea- RobAStBase Base S4 classes and functions for ro- sured confounding is present or suspected. By bust asymptotic statistics. By Matthias Kohl Kevin M. Quinn. and Peter Ruckdeschel. SpatialExtremes Rsge Functions for using R with the SGE clus- Several approaches to spatial ex- ter/grid queuing system. By Dan Bode. tremes modeling. By Mathieu Ribatet. SASPECT Significant AnalysiS of PEptide CounTs. StatMatch Perform statistical matching between A statistical method for significant analysis of two data sources and also for imputing missing comparative proteomics based on LC-MS/MS values in data sets through hot deck methods. Experiments. By Pei Wang and Yan Liu. By Marcello D’Orazio. SDDA Stepwsie Diagonal Discriminant Analysis — StreamMetabolism Calculate single station a fast algorithm for building multivariate clas- metabolism from diurnal Oxygen curves. By sifiers. By CSIRO Bioinformatics, Glenn Stone. Stephen A Sefick Jr. SGP Calculate growth percentile and projections for TinnR Tinn-R GUI/editor resources for R. By Jose students using large scale, longitudinal assess- Claudio Faria, based on code by Philippe Gros- ment data. These norm referenced growth val- jean. ues are presently used in some state testing TraMineR Sequences and trajectories mining for the and accountability systems. The functions use field of social sciences, where sequences are quantile regression techniques package to es- sets of states or events describing life histories, timate the conditional density associated with for example family formation histories. Pro- each student’s achievement history. Student vides tools for translating sequences from one Growth Projections (i.e., percentile growth tra- format to another, statistical functions for de- jectories) help to identify what it will take for scribing sequences and methods for comput- students to reach future achievement targets. ing distances between sequences using several By Damian W. Betebenner, with contributions metrics like optimal matching and some other from Jonathan Weeks, Jinnie Choi, Xin Wei and metrics proposed by C. Elzinga. By Alexis Hi Shin Shim. Gabadinho, Matthias Studer, Nicolas S. Müller SMVar Structural Model for Variances in order to and Gilbert Ritschard. detect differentially expressed genes from gene VhayuR Vhayu R Interface. By The Brookhaven expression data. By Guillemette Marot. Group. SNPMaP SNP Microarrays and Pooling in R. Pool- apsrtable Formats LAT X tables from one or more ing DNA on SNP microarrays is a cost-effective E model objects side-by-side with standard er- way to carry out genome-wide association rors below, not unlike tables found in such jour- studies for heritable disorders or traits. By nals as the American Political Science Review. Oliver SP Davis and Leo C Schalkwyk. By Michael Malecki. SNPMaP.cdm Annotation for SNP Microarrays and audio Pooling in R: provides cdm objects for the Interfaces to audio devices (mainly sample- SNPMaP package. By Oliver SP Davis and Leo based) from R to allow recording and playback C Schalkwyk. of audio. Built-in devices include Windows MM, Mac OS X AudioUnits and PortAudio (the SQLiteMap Manage vector graphical maps using last one is very experimental). By Simon Ur- SQLite. By Norbert Solymosi. banek.

R News ISSN 1609-3631 Vol. 8/2, October 2008 63 bark Bayesian Additive Regression Kernels. Im- compoisson Routines for density and moments of plementation of BARK as described in Zhi the Conway-Maxwell-Poisson distribution as Ouyang’s 2008 Ph.D. thesis. By Zhi Ouyang, well as functions for fitting the COM-Poisson Merlise Clyde and Robert Wolpert. model for over/under-dispersed count data. By Jeffrey Dunn. bayesGARCH Bayesian estimation of the GARCH(1,1) model with Student’s t innova- convexHaz Functions to compute the nonpara- tions. By David Ardia. metric maximum likelihood estimator (MLE) and the nonparametric least squares estimator bear An average bioequivalence (ABE) and bioavail- (LSE) of a convex hazard function, assuming ability data analysis tool including sample size that the data is i.i.d. By Hanna Jankowski, Ivy estimation, noncompartmental analysis (NCA) Wang, Hugh McCague and Jon A. Wellner. and ANOVA (lm) for a standard RT/TR 2- treatment, 2-sequence, 2-period, and balanced, copas Statistical methods to model and adjust for cross-over design. By Hsin-ya Lee and Yung- bias in meta-analysis. By James Carpenter and jin Lee. Guido Schwarzer. betaper Evaluates and quantifies distance decay of coxphw Weighted estimation for Cox regression. similarity among biological inventories in the R by Meinhard Ploner, FORTRAN by Georg face of taxonomic uncertainty. By Luis Cayuela Heinze. and Marcelino de la Cruz. curvetest Test if two curves defined by two data sets bigmemory Use C++ to create, store, access, and are equal, or if one curve is equal to zero. By manipulate massive matrices. Under UNIX, Zhongfa Zhang and Jiayang Sun. it also supports use of shared memory. By .xls Michael J. Kane and John W. Emerson. dataframes2xls Write data frames to ‘ ’ files. Sup- ports multiple sheets and basic formatting. By bise Some easy-to-use functions for spatial analyses Guido van Steen. of (plant-) phenological data sets and satellite observations of vegetation. By Daniel Doktor. ddst Data driven smooth test. By Przemyslaw Biecek (R code) and Teresa Ledwina (support, bit A class for vectors of 1-bit booleans. With bit descriptions). vectors you can store true binary booleans at the expense of 1 bit only; on a 32 bit architec- deSolve General solvers for ordinary differential ture this means factor 32 less RAM and factor equations (ODE) and for differential algebraic 32 more speed on boolean operations. By Jens equations (DAE). The functions provide an in- Oehlschlägel. terface to the FORTRAN functions lsoda, lso- dar, lsode, lsodes, dvode and daspk. The pack- bmd Benchmark dose analysis for continuous and age also contains routines designed for solving quantal dose-response data. By Christian Ritz. uni- and multicomponent 1-D and 2-D reactive transport models. By Karline Soetaert, Thomas bpca Biplot (2d and 3d) of multivariate data based Petzoldt and R. Woodrow Setzer. on principal components analysis and diagnos- tic tools of the quality of the reduction. By denstrip Graphical methods for compactly illustrat- Jose Claudio Faria and Clarice Garcia Borges ing probability distributions, including density Demetrio. strips, density regions, sectioned density plots and varying width strips. By Christopher Jack- ccgarch Estimation and simulation of Conditional son. Correlation GARCH (CC-GARCH) models. By Tomoaki Nakatani. dfcrm Dose-finding by the continual reassessment method. Provides functions to run the CRM cem Coarsened Exact Matching. Implements the and TITE-CRM in phase I trials and calibration CEM algorithm (and many extensions) de- tools for trial planning purposes. By Ken Che- scribed in ‘Matching for Causal Inference ung. Without Balance Checking” by Stefano M. Ia- cus, Gary King, and Giuseppe Porro (http: diffractometry Residual-based baseline identifica- //gking.harvard.edu/files/abs/cem-abs. tion and peak decomposition for x-ray diffrac- shtml). By Stefano Iacus, Gary King and tograms as introduced in the corresponding Giuseppe Porro. paper by Davies et al. (2008, Annals of Ap- plied Statistics). By P. L. Davies, U. Gather, M. compare Functions for comparing vectors and data Meise, D. Mergel and T. Mildenberger. Addi- frames. By Paul Murrell. tional Code by T. Bernholt and T. Hofmeister.

R News ISSN 1609-3631 Vol. 8/2, October 2008 64 dirichlet Dirichlet model of consumer buying be- entropy Implements various estimators of en- havior for marketing research. The Dirichlet tropy, such as the shrinkage estimator by (aka NBD-Dirichlet) model describes the pur- Hausser and Strimmer, the maximum likeli- chase incidence and brand choice of consumer hood and the Millow-Madow estimator, var- products. Provides model estimation and sum- ious Bayesian estimators, and the Chao-Shen maries of various theoretical quantities of in- estimator. It also offers an R interface to the terest to marketing researchers. Also provides NSB estimator. Furthermore, it provides func- functions for making tables that compare ob- tions for estimating mutual information. By served and theoretical statistics. By Feiming Jean Hauser and Korbinian Strimmer. Chen. eqtl Analysis of experimental crosses to identify distrMod Object orientated implementation of genes (called quantitative trait loci, QTLs) con- probability models based on packages distr tributing to variation in quantitative traits, a and distrEx. By Matthias Kohl and Peter Ruck- complementary to Karl Broman’s qtl pack- deschel. age for genome-wide analysis. By Ahmid A. Khalili and Olivier Loudet. distrTeach Extensions of package distr for teach- ing stochastics/statistics in secondary school. etm Empirical Transition Matrix: matrix of transi- By Peter Ruckdeschel, Matthias Kohl, Anja tion probabilities for any time-inhomogeneous Hueller and Eleonara Feist. multistate model with finite state space. By Arthur Allignol. divagis Tools for quality checks of georeferenced plant species accessions. By Reinhard Simon. expert Modeling without data using expert opinion. Expert opinion (or judgment) is a body of tech- dti DTI Analysis. Diffusion Weighted Imaging is niques to estimate the distribution of a random a Magnetic Resonance Imaging modality that variable when data is scarce or unavailable. measures diffusion of water in tissues like the Opinions on the quantiles of the distribution human brain. The package contains unctions are sought from experts in the field and aggre- to process diffusion-weighted data in the con- gated into a final estimate. Supports aggrega- text of the diffusion tensor model (DTI), includ- tion by means of the Cooke, Mendel-Sheridan ing the calculation of anisotropy measures and, and predefined weights models. By Mathieu the implementation of the structural adaptive Pigeon, Michel Jacques and Vincent Goulet. smoothing algorithm described in “Diffusion Tensor Imaging: Structural Adaptive Smooth- fbati Family-based gene by environment interaction ing” by K. Tabelow, J. Polzehl, V. Spokoiny, and tests, and joint gene, gene-environment inter- H.U. Voss (2008, Neuroimage 39(4), 1763-1773). action test. By Thomas Hoffmann. By Karsten Tabelow and Joerg Polzehl. fgui Function GUI: rapidly create a GUI for a func- dynamo Routines for estimation, simulation, reg- tion by automatically creating widgets for ar- ularization and prediction of univariate dy- guments of the function. By Thomas Hoff- namic models including: ARMA, ARMA- mann. GARCH, ACD, MEM. By Christian T. Brown- lees. fossil Palaeoecological and palaeogeographical analysis tools. Includes functions for estimat- ecolMod Figures, data sets and examples from the ing species richness (Chao 1 and 2, ACE, ICE, book “A practical guide to ecological mod- Jacknife), shared species/beta diversity, species elling — using R as a simulation platform” by area curves and geographic distances and ar- Karline Soetaert and Peter MJ Herman (2008, eas. By Matthew Vavrek. Springer). By Karline Soetaert and Peter MJ Herman. frontier Maximum Likelihood Estimation of stochastic frontier production and cost func- empiricalBayes A bundle providing a simple solu- tions. Two specifications are available: the tion to the extreme multiple testing problem error components specification with time- by estimating local false discovery rates. Con- varying efficiencies (Battese and Coelli, 1992) tains two packages, localFDR and HighProb- and a model specification in which the firm ability. Given a vector of p-values, the former effects are directly influenced by a number of estimates local false discovery rates and the lat- variables (Battese and Coelli, 1995). By Tim ter determines which p-values are low enough Coelli and Arne Henningsen. that their alternative hypotheses can be consid- ered highly probable. By Zahra Montazeri and fts R interface to tslib (a time series library in C++). David R. Bickel. By Whit Armstrong.

R News ISSN 1609-3631 Vol. 8/2, October 2008 65 fuzzyOP Fuzzy numbers and the main mathemati- icomp Calculates the ICOMP criterion and its varia- cal operations for these. By Aklan Semagul, Al- tions. By Jake Ferguson. tindas Emine, Macit Rabiye, Umar Senay and Unal Hatice. imputeMDR Multifactor Dimensionality Reduction (MDR) analysis for incomplete data. By gWidgetsWWW Toolkit implementation of gWid- Junghyun Namkung, Taeyoung Hwang, Min- gets for www. By John Verzani. Seok Kwon, Sunggon Yi and Wonil Chung. gbs Utilities for analyzing censored and uncensored intervals Tools for working with and comparing sets data from generalized Birnbaum-Saunders dis- of points and intervals. By Richard Bourgon. tributions. By Michelli Barros, Victor Leiva and Gilberto A. Paula. kml K-Means for Longitudinal data (kml), a non parametric algorithm for clustering longitudi- gene2pathway Prediction of KEGG pathway mem- nal data. By Christophe M. Genolini. bership for individual genes based on InterPro domain signatures. By Holger Froehlich, with laercio Functions to compare group means (Duncan contributions by Tim Beissbarth. test, Tukey test and Scott-Knott test). By Laer- cio Junio da Silva. geonames Interface to www.geonames.org web ser- vice. By Barry Rowlingson. lda.cv Cross-validation for linear discriminant anal- ysis. By Wenxuan Zhong. glmmBUGS Generalized Linear Mixed Models with WinBUGS. By Patrick Brown. lmom L-moments: computation of L-moments of distributions and data samples, parameter es- glmnet Extremely efficient procedures for fitting the timation, L-moment ratio diagram, and plot entire lasso or elastic-net regularization path against quantiles of an extreme-value distribu- for linear regression, logistic and multinomial tion. By J. R. M. Hosking. regression models, using cyclical coordinate descent in a pathwise fashion. By Jerome Fried- loglognorm Double log normal distribution func- man, Trevor Hastie and Rob Tibshirani. tions. By Heike Trautmann, Detlef Steuer and Olaf Mersmann. gmm Estimation of parameters using the general- ized method of moments (GMM) of Hansen lpSolveAPI An R interface to the lp_solve li- (1982). By Pierre Chausse. brary API. lp_solve is a Mixed Integer Lin- grade Binary Grading functions for R. By Leif John- ear Programming (MILP) solver with sup- son. port for pure linear, (mixed) integer/binary, semi-continuous and special ordered sets (SOS) gtm Generative topographic mapping. By Ondrej models. By Kjell Konis. Such. lpc Implements the Lassoed Principal Components hlr Hidden Logistic Regression. Implements the (LPC) method of Witten & Tibshirani (2008) methods described in Rousseeuw and Christ- for identification of significant genes in a mi- man (2003) to cope with separation issues and croarray experiment. By Daniela M Witten and outliers in logistic regression. Original S- Robert Tibshirani. PLUS code by Peter J. Rousseeuw and Andreas Christmann, R port by Tobias Verbeke. marelac Datasets, chemical and physical constants and functions, routines for unit conversions, hwriter HTML Writer: easy-to-use and versatile etc, for the marine and lacustrine sciences. By functions to output R objects in HTML format. Karline Soetaert and Filip Meysman. By Gregoire Pau. marginalmodelplots Marginal model plots for lin- ic.infer Inequality constrained inference in linear ear and generalized linear models, including normal situations. Implements parameter es- tools for bandwidth exploration. By Andrew timation in normal (linear) models under lin- Redd. ear equality and inequality constraints as well as likelihood ratio tests involving inequality- mco Functions for multi criteria optimization using constrained hypotheses. By Ulrike Groemping. genetic algorithms and related test problems. By Heike Trautmann, Detlef Steuer and Olaf ic50 Standardized high-throughput evaluation of Mersmann. compound screens. Calculation of IC50 values, automatic drawing of dose-response curves mi Missing-data imputation and model checking. and validation of compound screens on 96- and By Andrew Gelman, Jennifer Hill, Masanao Ya- 384-well plates. By Peter Frommolt. jima, Yu-Sung Su and Maria Grazia Pittau.

R News ISSN 1609-3631 Vol. 8/2, October 2008 66 mirf Multiple Imputation and Random Forests for multivariate normal distribution, multivariate unobservable phase, high-dimensional data. t-distribution with a Satterthwaite Approxima- Applies a combination of missing haplotype tion of the degree of freedom, or using multi- imputation via the EM algorithm of Ex- variate range preserving transformations with coffier and Slatkin (1995) and modeling trait- logit or probit as transformation function. By haplotype associations via the Random For- Frank Konietschke. est algorithm, as described in “Multiple im- onemap putation and random forests (MIRF) for un- Software for constructing genetic maps in observable high-dimensional data” by B.A.S. outcrossing species. Analysis of molecular Nonyane and A.S. Foulkes (2007, The Interna- marker data from non-model systems to simul- tional Journal of Biostatistics 3(1): Article 12). taneously estimate linkage and linkage phases. By Yimin Wu, B. Aletta S. Nonyane and Andrea By Gabriel Rodrigues Alves Margarido, with S. Foulkes. contributions from Antonio Augusto Franco Garcia. mixer Estimation of Erd˝os-Rényi mixture for opentick Provide an interface to opentick real time graphs. By Christophe Ambroise, Gilles and historical market data. By Josh Ulrich. Grasseau, Mark Hoebeke and Pierre Latouche. orloca The Operations Research LOCational Analy- mixlow Assessing drug synergism/antagonism. By sis models. Deals with the min-sum or cen- John Boik. ter location problems. By Fernando Fernandez- mlogit Estimation of the multinomial logit model Palacin and Manuel Munoz-Marquez. with alternative and/or individual specific orloca.es Spanish version of package orloca. By Fer- variables. By Yves Croissant. nando Fernandez-Palacin and Manuel Munoz- Marquez. mpm Multivariate Projection Methods: exploratory graphical analysis of multivariate data, specif- orth Performs multivariate logistic regressions by ically gene expression data with different pro- way of orthogonalized residuals. As a special jection methods: principal component analysis, case, the methodology recasts alternating logis- correspondence analysis, spectral map analy- tic regressions in a way that is consistent with sis. By Luc Wouters. standard estimating equation theory. Cluster diagnostics and observation level diagnostics mrdrc Model-Robust Concentration-Response such as leverage and Cook’s distance are com- Analysis semi-parametric modeling of continu- puted based on an approximation. By Kunthel ous and quantal concentration/dose-response By, Bahjat F. Qaqish and John S. Preisser. data. By Christian Ritz and Mads Jeppe Tarp- Johansen. pack Functions to easily convert data to binary formats other programs/machines can under- msBreast A dataset of 96 protein mass spectra gen- stand. By Josh Ulrich. erated from a pooled sample of nipple aspirate fluid (NAF) from healthy breasts and breasts packClassic Illustration of the tutorial “S4: From with cancer. By Lixin Gong, William Constan- Idea To Package”. By Christophe Genolini and tine and Yu Alex Chen. some reader that sent useful comments. msDilution A dataset of 280 MALDI-TOF mass paltran Functions for paleolimnology: wa- spectra generated from a dilution experiment regression (see also package analogue), wa-pls aimed at elucidating which features in MALDI- and MW regression. By Sven Adler. TOF mass spectrometry data are informative pec Prediction Error Curves for survival models: for quantifying peptide content. By Lixin validation of predicted surival probabilities us- Gong, William Constantine and Yu Alex Chen. ing inverse weighting and resampling. By Thomas A. Gerds. msProstate A dataset of protein mass spectra gen- erated from patients with prostate cancer, be- peperr Parallelized Estimation of Prediction Error. nign prostatic hypertrophy, and normal con- Prediction error estimation through resam- trols. By Lixin Gong, William Constantine and pling techniques, possibly accelerated by par- Alex Chen. allel execution on a computer cluster. By Chris- tine Porzelius and Harald Binder. nparcomp Computation of nonparametric simulta- neous confidence intervals and simultaneous picante Phylocom integration, community analyses, p-values for relative contrast effects in the un- null-models, traits and evolution in R. By Steve balanced one way layout. The simultaneous Kembel, David Ackerly, Simon Blomberg, Pe- confidence intervals can be computed using ter Cowan, Matthew Helmus and Cam Webb.

R News ISSN 1609-3631 Vol. 8/2, October 2008 67 plyr Tools for splitting, applying and combining conditions for uni-and multicomponent 1-D data. By Hadley Wickham. and 2-D reactive transport models (boundary value problems of ODE) using the method of qdg QTL Directed Graphs: infer QTL-directed de- lines approach. By Karline Soetaert. pendency graphs for phenotype networks. By Elias Chaibub Neto and Brian S. Yandell. roxygen Literate Programming in R: a Doxygen-like in-source documentation system for Rd, colla- qlspack Quasi Least Squares (QLS) package. QLS tion, namespace and callgraphs. By Peter Da- is a two-stage computational approach for es- nenberg and Manuel Eugster. timation of the correlation parameters within the framework of GEE. It helps solving param- rscproxy Provides a portable C-style interface to R eters in mean, scale, and correlation structures (StatConnector) used by third party applica- for longitudinal data. By Jichun Xie and Justine tions, most notable, but not limited to, rcom/R Shults. Scilab (D)COM Server, ROOo and other sys- tems. By Thomas Baier. rPorta An R interface to a modified ver- sion of PORTA (see http://www.zib.de/ runjags Run Bayesian MCMC models in the BUGS Optimization/Software/Porta/). By Robin syntax from within R. Includes functions to Nunkesser, Silke Straatmann and Simone Wen- read external WinBUGS type text files, and al- zel. lows several ways of automatically specifying randtoolbox Toolbox for pseudo and quasi ran- model data from existing R objects or R func- dom number generation and RNG tests. Pro- tions. Also includes functions to automati- vides general linear congruential generators cally calculate model run length, autocorrela- (Park Miller) and multiple recursive gener- tion and Gelman Rubin statistic diagnostics for ators (Knuth TAOCP), generalized feedback all models to simplify the process of achiev- shift register (SF-Mersenne Twister algorithm ing chain convergence. Requires Just Another http://www-fis.iarc. and WELL generator), the Torus algorithm, Gibbs Sampler (JAGS, fr/~martyn/software/jags/ and some additional tests (gap test, serial test, ) for most func- poker test, . . . ). By Christophe Dutang, with tions. By Matthew Denwood (funded as part the SFMT algorithm from M. Matsumoto and of the DEFRA VTRI project 0101). M. Saito, the WELL generator from P. L’Ecuyer, rwm APL-like functions for managing R and the Knuth-TAOCP RNG from D. Knuth). workspaces. By A.I. McLeod. rbounds Perform Rosenbaum bounds sensitivity sdtalt Signal detection theory measures and alter- tests for matched data. Calculates bounds for natives, as detailed in the book “Functions for binary data, Hodges-Lehmann point estimates, traditional and multilevel approaches to signal Wilcoxon signed-rank test, and for data with detection theory” by D.B. Wright, R. Horry and multiple matched controls. Is designed to work E.M. Skagerberg, E.M. (in press, Behavior Re- with package Matching. By Luke J. Keele. search Methods). By Daniel B. Wright. rdetools Functions for Relevant Dimension Estima- sensR Thurstonian models for sensory discrimina- tion (RDE) of a data set in feature spaces, ap- tion. By Rune Haubo B Christensen and Per plications to model selection, graphical illustra- Bruun Brockhoff. tions and prediction. By Jan Saputra Mueller. singlecase Various functions for the single-case re- repolr Repeated measures proportional odds logis- search in neuropsychology, mainly dealing tic regression via generalized estimating equa- with the comparison of a patient’s test score (or tions. By Nick Parsons. score difference) to a control or normative sam- rjags Bayesian graphical models via an interface to ple. These methods also provide a point esti- the JAGS MCMC library. By Martyn Plummer. mate of the percentage of the population that would obtain a more extreme score (or score rootSolve Nonlinear root finding, equilibrium and difference) and, for some problems, an accom- steady-state analysis of ordinary differential panying interval estimate (i.e., confidence lim- equations. Includes routines that: (1) gener- its) on this percentage. By Matthieu Dubois. ate gradient and Jacobian matrices (full and banded), (2) find roots of non-linear equations smacof provides the following approaches to mul- by the Newton-Raphson method, (3) estimate tidimensional scaling (MDS) based on stress steady-state conditions of a system of (differ- minimization by means of majorization (sma- ential) equations in full, banded or sparse form, cof): Simple smacof on symmetric dissimilar- using the Newton-Raphson method, or by dy- ity matrices, smacof for rectangular matrices namically running, (4) solve the steady-state (unfolding models), smacof with constraints

R News ISSN 1609-3631 Vol. 8/2, October 2008 68

on the configuration, three-way smacof for in- treelet Treelet: a novel construction of multi-scale dividual differences (including constraints for bases that extends wavelets to non-smooth sig- idioscal, indscal, and identity), and spherical nals. Returns a hierarchical tree and a multi- smacof (primal and dual algorithm). Each of scale orthonormal basis which both reflect the these approaches is implemented in a metric internal structure of the data. By Di Liu. and nonmetric manner including primary, sec- ondary, and tertiary approaches for tie han- tsModel Time Series Modeling for Air Pollution and dling. By Jan de Leeuw and Patrick Mair. Health. By Roger D. Peng, with contributions from Aidan McDermott. snowfall Top-level wrapper around snow for easier development of parallel R programs. All func- uncompress Functionality for decompressing ‘.Z’ tions work in sequential mode, too, if no cluster files. By Nicholas Vinen. is present or wished. By Jochen Knaus. waveclock Time-frequency analysis of cycling spssDDI Read SPSS System files and produce valid cell luminescence data. Provides function DDI version 3.0 documents. By Guido Gay. waveclock designed to assess the period and ssize.fdr Sample size calculations for microarray ex- amplitude of cycling cell luminescence data. periments, featuring appropriate sample sizes The function reconstructs the modal frequen- for one-sample t-tests, two-sample t-tests, and cies from a continuous wavelet decomposi- F-tests for microarray experiments based on tion of the luminescence data using the “crazy desired power while controlling for false dis- climbers” algorithm described in the book covery rates. For all tests, the standard devia- “Practical Time-Frequency Analysis: Gabor tions (variances) among genes can be assumed and Wavelet Transforms with an Implemen- fixed or random. This is also true for effect sizes tation in S” by René Carmona, Wen L. Hwang among genes in one-sample experiments and and Bruno Torresani (1998, Academic Press). differences in mean treatment expressions for By Tom Price. two-sample experiments. Functions also out- put a chart of power versus sample size, a table wgaim Whole Genome Average Interval Mapping of power at different sample sizes, and a table for QTL detection using mixed models. In- of critical test values at different sample sizes. tegrates sophisticated mixed modeling meth- By Megan Orr and Peng Liu. ods with a whole genome approach to detect- ing significant QTLs in linkage maps. By Ju- tframePlus Time Frame coding kernel extensions. lian Taylor, Simon Diffey, Ari Verbyla and Brian By Paul Gilbert. Cullis. tileHMM Hidden Markov Models for ChIP-on- Chip Analysis. Provided parameter estima- tion methods include the Baum-Welch algo- Other changes rithm and Viterbi training as well as a combi- nation of both. By Peter Humburg. • Package PARccs was renamed to pARccs. timereg Programs for the book “Dynamic Regres- • Package BRugs was moved to the Archive sion Models for Survival Data” by T. Marti- (available via CRANextras now). nussen and T. Scheike (2006, Springer Verlag), plus more recent developments. By Thomas • Package HighProbability was moved to the Scheike, with contributions from Torben Mar- Archive (now contained in bundle empirical- tinussen and Jeremy Silver. Bayes). tis Functions and S3 classes for time indexes and • Package brlr was moved to the Archive. time indexed series, which are compatible with FAME frequencies. By Jeff Hallman. • Package paleoTSalt was moved to the Archive topmodel An R implementation of TOPMODEL, (integrated in package paleoTS). based on the 1995 FORTRAN version by Keith • Package torus was moved to the Archive (re- Beven. Adapted from the C translation by placed by package randtoolbox). Huidae Cho. By Wouter Buytaert. tossm Testing Of Spatial Structure Methods. Pro- • Packages data.table and elasticnet were resur- vides a framework under which methods for rected from the Archive. using genetic data to detect population struc- ture and define management units can be Kurt Hornik tested. By Karen Martien, Dave Gregovich and Wirtschaftsuniversität Wien, Austria Mark Bravington. [email protected]

R News ISSN 1609-3631 Vol. 8/2, October 2008 69

News from the Bioconductor Project

Bioconductor Team genes they are meant to interrogate, or contain EN- Program in Computational Biology TREZ gene-based annotations of whole genomes. Fred Hutchinson Cancer Research Center This release marks the completion of our transition from environment-based to SQLite-based annotation We are pleased to announce Bioconductor 2.3, re- packages. SQLite annotation packages allow for ef- leased on October 22, 2008. Bioconductor 2.3 is ficient memory use and facilitate more complicated compatible with R 2.8.0, and consists of 294 pack- data queries. The release now supports a menagerie ages. There are 37 new packages, and enhancements of 15 different model organisms, from Arabidopsis to many others. Explore Bioconductor at http:// to Zebrafish, nearly doubling the number of species bioconductor.org, and install standard or individ- compared to the previous Bioconductor release. We ual packages with have also increased the amount of information in each annotation package; the inst/NEWS file in An- > source("http://bioconductor.org/biocLite.R") notationDbi provides details. > biocLite() # install standard packages... > biocLite("rtracklayer") # or rtracklayer High-throughput sequencing New packages An ensemble of new or expanded packages intro- New to this release are powerful packages for di- duces tools for ‘next generation’ DNA sequence verse areas of high-throughput analysis. Highlights data. This data is very large, consisting of 10s include: to 100s of millions of ‘reads’ (each 10s to 100s of nucleotides long), coupled with whole genome se- Assay technologies such as HELP and MeDIP DNA quences (e.g., 3 billion nucleotides in the human methylation (HELP, MEDME), micro-RNAs genome). BSgenome provides a foundation for (microRNA, miRNApath), and array com- representing whole-genome sequences; there are 13 parative genomic hybridization (ITALICS, model organisms represented by 18 distinct genome CGHregions, KCsmart). builds, and facilities for users to create their own genome packages. Functionality in the Biostrings Pathways and integrative analysis packages such package performs fast or flexible alignments be- as SIM (gene set enrichment for copy number tween reads and genomes. ShortRead provides tools and expression data), domainsignatures (In- for import and exploration of common data for- terPro gene set enrichment), minet (mutual in- mats. IRanges offers an emerging infrastructure for formation network inference) and miRNApath representing very large data objects, and for range- (pathway enrichment for micro-RNA data). based representations. The rtracklayer package in- Interoperability with the ArrayExpress data base terfaces with genome browsers and their track layers. (ArrayExpress), the Chemmine chemical com- HilbertVis and HilbertVisGUI provide an example pound data base (ChemmineR), the PSI-MI of creative approaches to visualizing this data, using protein interaction data base (RpsiXML) and space-filling (Hilbert) curves that maintain, as much external visualization tools such as the X:Map as possible, the spatial information implied by linear exon viewer (xmapbridge). chromosomes.

Algorithms for machine learning, such as BicARE, dualKS, CMA) and iterative Bayesian model Other activities averaging (IterativeBMA, IterativeBMAsurv).

Refined expression array methods including pre- Bioconductor package maintainers and the Biocon- processing (e.g., multiscan), contaminant and ductor team invest considerable effort in produc- outlier detection (affyContam, arrayMvout, ing high-quality software. New packages are re- parody), and small-sample and other assays viewed on both technical and scientific merit, be- for differential expression (e.g., logitT, DFP, fore being added to the ‘development’ roster for the LPEadj, PLPE). next release. Release packages are built daily on Linux, Windows, and Mac platforms, tracking the most recent released versions of R and the packages Annotations hosted on CRAN repositories. The active Bioconduc- tor mailing lists (http://bioconductor.org/docs/ Bioconductor ‘Annotation’ packages contain biolog- mailList.html) connect users with each other, to do- ical information about microarray probes and the main experts, and to maintainers eager to ensure that

R News ISSN 1609-3631 Vol. 8/2, October 2008 70 their packages satisfy the needs of leading edge ap- ued efforts to leverage diverse external data sources proaches. The Bioconductor community meets at our and to meet the challenges of presenting high vol- annual conference in Seattle; well over 100 individu- ume data in rich graphical contexts. als attended the July meeting for a combination of High throughput sequencing technologies repre- scientific talks and hands-on tutorials. sent a particularly likely area for development. First- generation approaches with relatively short reads, restricted application domains, and small numbers Looking forward of sample individuals are being supplanted by newer technologies producing longer and more numerous This will be a dynamic release cycle. New con- reads. New protocols and the intrinsic curiosity of tributed packages are already under review, and our biologists are expanding the range of questions being build machines have started tracking the latest de- addressed, and creating a concomitant need for flex- velopment versions of R. It is hard to know what the ible software analysis tools. The increasing afford- future holds, but past experience points to surprise ability of high throughput sequencing technologies — at the creativity, curiosity, enthusiasm, and com- means that multi-sample studies with non-trivial ex- mitment of package developers, and at the speed of perimental designs are just around the corner. The technological change and growth of data. In addition statistical analytic abilities and flexibility of R and to development of high-quality algorithms to ad- Bioconductor represent ideal tools for addressing dress microarray data analysis, we anticipate contin- these challenges.

R News ISSN 1609-3631 Vol. 8/2, October 2008 71 useR! 2008

The international R user conference ‘useR! 2008’ took • Gary King: The Dataverse Network place at the Technische Universität Dortmund, Dort- • Duncan Murdoch: Package Development in mund, Germany, August 12-14, 2008. Windows This world-wide meeting of the R user commu- • Jean Thioulouse: Multivariate Data Analysis in nity focussed on Microbial Ecology – New Skin for the Old Cer- • R as the ‘lingua franca’ of data analysis and sta- emony tistical computing; • Graham J. Williams: Deploying Data Mining in • providing a platform for R users to discuss and Government – Experiences With R/Rattle exchange ideas about how R can be used to do statistical computations, data analysis, vi- sualization and exciting applications in various fields; • giving an overview of the new features of the rapidly evolving R project. User-contributed sessions The program comprised invited lectures, user- contributed sessions and pre-conference tutorials. More than 400 participants from all over the User contributed sessions brought together R users, world met in Dortmund and heard more than 170 contributers, package maintainers and developers in talks. It was a pleasure for us to see that the lecture the S spirit that ‘users are developers’. People from rooms were available after heavy weather and some different fields showed us how they solve problems serious flooding of the building two weeks before the with R in fascinating applications. The scientific pro- conference. People were carrying on discussions the gram was organized by members of the program whole time: before sessions, during the coffee and committee, including Micah Altman, , Pe- lunch breaks, and during the social events such as ter Dalgaard, Jan de Leeuw, Ramón Díaz-Uriarte, Spencer • the reception in the Dortmund city hall by the Graves, Leonhard Held, Torsten Hothorn, François Hus- Mayor (who talked about the history, present, son, Christian Kleiber, Friedrich Leisch, Andy Liaw, Mar- and future of Dortmund), and tin Mächler, Kate Mullen, Ei-ji Nakama, Thomas Pet- • the conference dinner in the ‘Signal Iduna Park’ zoldt, Martin Theus, and Heather Turner, and covered stadium, the famous home ground of the foot- topics such as ball (i.e. soccer) team ‘Borussia Dortmund’. • Applied Statistics & Biostatistics Buses brought the participants to the stadium • Bayesian Statistics where a virtual game was going on: at least, • Bioinformatics noise was played, made by the more than 80000 • Chemometrics and Computational Physics supporters who regularly attend games in the • Data Mining stadium. After the dinner, Marc Schwartz got • Econometrics & Finance the birthday cake he really deserves for his • Environmetrics & Ecological Modeling work in the R community. • High Performance Computing It was a great plesure to meet so many people from • Machine Learning the whole R community in Dortmund, people who • Marketing & Business Analytics formerly knew each other only from ‘virtual’ contact • Psychometrics by email and had never met in person. • Robust Statistics • Sensometrics • Spatial Statistics Invited lectures • Statistics in the Social and Political Sciences • Teaching Those who attended had the opportunity to listen to • Visualization & Graphics several talks. Invited speakers talked about several • and many more hot topics, including: • Peter Bühlmann: Computationally Tractable Talks in Kaleidoscope sessions presented inter- Methods for High-Dimensional Data esting topics to a broader audience, while talks in • John Fox and Kurt Hornik: The Past, Present, Focus sessions led to intensive discussions on the and Future of the R Project, a double-feature more focussed topics. As was the case at former presentation including useR! conferences, the participants listened to talks Social Organization of the R Project (John Fox), the whole time; it turned out that the organizers un- Development in the R Project (Kurt Hornik) derestimated the interest of the participants for some • Andrew Gelman: Bayesian Generalized Linear sessions and chose lecture rooms that were too small Models and an Appropriate Default Prior to accommodate all who wanted to attend.

R News ISSN 1609-3631 Vol. 8/2, October 2008 72

Pre-conference tutorials • Sébastien Lê, Julie Josse, François Husson: Multi- way Data Analysis Before the start of the official program, half-day tuto- • Bernhard Pfaff : Analysis of Integrated and Co- rials were offered on Monday, August 11. integrated Time Series In the morning: At the end of the day, Torsten Hothorn and Fritz • Douglas Bates: Mixed Effects Models Leisch opened the Welcome mixer by talking about • Julie Josse, François Husson, Sébastien Lê: Ex- the close relationship among R, the useR! organizers, ploratory Data Analysis and Dortmund. • Martin Mächler, Elvezio Ronchetti: Introduction to Robust Statistics with R • Jim Porzak: Using R for Customer Segmentation More information • Stefan Rüping, Michael Mock, and Dennis We- gener: Distributed Data Analysis Using R A web page offering more information on ‘useR! • Jing Hua Zhao: Analysis of Complex Traits Us- 2008’ is available at: ing R: Case studies http://www.R-project.org/useR-2008/. In the afternoon: That page now includes slides from many of the pre- • Karim Chine: Distributed R and Bioconductor sentation as well as some photos of the event. Those for the Web who could not attend the conference should feel free • Dirk Eddelbuettel: An Introduction to High- to browse and see what they missed, as should those Performance R who attended but were not able to make it to all of • Andrea S. Foulkes: Analysis of Complex Traits the presentations that they wanted to see. Using R: Statistical Applications • Virgilio Gómez-Rubio: Small Area Estimation The organizing committee: with R Uwe Ligges, Achim Zeileis, Claus Weihs, Gerd Kopp, • Frank E. Harrell, Jr.: Regression Modelling Friedrich Leisch, and Torsten Hothorn Strategies [email protected]

R News ISSN 1609-3631 Vol. 8/2, October 2008 73

Forthcoming Events: useR! 2009

The international R user conference ‘useR! 2009’ will • Data Mining take place at the Agrocampus, Rennes, France, July 8-10, 2009. • Econometrics & Finance This world-meeting of the R user community will • Environmetrics & Ecological Modeling focus on • High Performance Computing • R as the ‘lingua franca’ of data analysis and sta- tistical computing, • Machine Learning

• providing a platform for R users to discuss and • Marketing & Business Analytics exchange ideas about how R can be used to • do statistical computations, data analysis, vi- Psychometrics sualization and exciting applications in various • Robust Statistics fields, • Sensometrics • giving an overview of the new features of the rapidly evolving R project. • Spatial Statistics

The program comprises invited lectures, usercon- • Statistics in the Social and Political Sciences tributed sessions and pre-conference tutorials. • Teaching • Invited lectures Visualization & Graphics • and many more. R has become the standard computing engine in more and more disciplines, both in academia and the business world. How R is used in different areas will Pre-conference tutorials be presented in hot topics as evidenced by the list of the invited speakers: Adele Cutler, Peter Dalgaard, Before the start of the official program, half-day tuto- Jerome H. Friedman, Michael Greenacre, Trevor Hastie, rials will be offered on Tuesday, July 7. John Storey and Martin Theus. Call for papers User-contributed sessions We invite all R users to submit abstracts on top- The sessions will be a platform to bring together R ics presenting innovations or exciting applications users, contributers, package maintainers and devel- of R. A web page offering more information on opers in the S spirit that ‘users are developers’. Peo- ‘useR!2009’ is available at: ple from different fields will show us how they solve problems with R in fascinating applications. The sci- http://www.R-project.org/useR-2009/ entific program will include Douglas Bates, Vincent Carey, Arthur Charpentier, Pierre-André Cornillon, Nico- Abstract submission and registration will start in las Hengartner, Torsten Hothorn, François Husson, Jan November 2008. de Leeuw, Friedrich Leisch, Ching Fan Sheu and Achim Zeileis. The program will cover topics of current in- The organizing committee David Causeur, Julie Josse, terest such as François Husson, Maela Kloareg, Sébastien Lê, Eric Matzner-Løber and Jérôme Pagès will be very happy • Applied Statistics & Biostatistics to meet you in Rennes, France, the capital of Brit- tany, famous for its architecture (the Parlement • Bayesian Statistics de Bretagne, private mansions, ...), its night life • Bioinformatics (Festival des tombées de la nuit) and its cuisine (the Festival Gourmand, the Lices Market, ...). • Chemometrics and Computational Physics [email protected]

R News ISSN 1609-3631 Vol. 8/2, October 2008 74

Forthcoming Events: DSC 2009 Hot on the heels of ‘useR! 2009’, the sixth interna- Call for papers tional workshop on Directions in Statistical Com- puting (DSC 2009) will take place at the Center for It is expected that a large proportion of participants Health and Society, University of Copenhagen, Den- will bring a presentation. We particularly welcome mark, 13th-14th of July 2009. contributions advancing statistical computing as an independent topic. Abstracts can be submitted via The workshop will focus on, but is not limited to, the website once it becomes fully operational. open source statistical computing and aims to pro- A web page offering more information on vide a platform for exchanging ideas about develop- ‘DSC 2009’ is available at: ments in statistical computing (rather than ‘only’ the http://www.R-project.org/dsc-2009/ usage of statistical software for applications). Abstract submission and registration will start in November 2008. Traditionally, this is a workshop with a flatter structure than the useR! series, mainly designed for The organizing committee, Peter Dalgaard, Claus Ek- formal and informal discussions among developers strøm, Klaus K. Holst and Søren Højsgaard, will be very of statistical software. Consequently there are no in- happy to meet you in Copenhagen at the height of vited speakers, although the programme committee Summer. When the weather is good, it is extremely will select some talks for plenary presentations. beautiful here, not too hot and with long summer evenings (it might also rain for days on end, though).

[email protected]

R Foundation News by Kurt Hornik New supporting members Peter Ehlers, Canada Donations and new members Chel Hee Lee, Canada

New benefactors Kurt Hornik Wirtschaftsuniversität Wien, Austria REvolution Computing, USA [email protected]

R News ISSN 1609-3631 Vol. 8/2, October 2008 75

Editor-in-Chief: R News is a publication of the R Foundation for Sta- John Fox tistical Computing. Communications regarding this Department of Sociology publication should be addressed to the editors. All McMaster University articles are copyrighted by the respective authors. 1280 Main Street West Please send submissions to regular columns to the Hamilton, Ontario respective column editor and all other submissions Canada L8S 4M4 to the editor-in-chief or another member of the edi- torial board. More detailed submission instructions Editorial Board: can be found on the R homepage. Vincent Carey and Peter Dalgaard. Editor Programmer’s Niche: Bill Venables R Project Homepage: http://www.R-project.org/ Editor Help Desk: Uwe Ligges Email of editors and editorial board: This newsletter is available online at firstname.lastname @R-project.org http://CRAN.R-project.org/doc/Rnews/

R News ISSN 1609-3631