The Journal Volume 1/1, May 2009

A peer-reviewed, open-access publication of the Foundation for Statistical Computing

Contents

Editorial...... 3

Special section: The Future of R Facets of R...... 5 Collaborative Development Using R-Forge...... 9

Contributed Research Articles Drawing Diagrams with R...... 15 The hwriter Package...... 22 AdMit: Adaptive Mixtures of Student-t Distributions...... 25 expert: Modeling Without Data Using Expert Opinion...... 31 mvtnorm: New Numerical Algorithm for Multivariate Normal Probabilities...... 37 EMD: A Package for Empirical Mode Decomposition and Hilbert Spectrum...... 40 Sample Size Estimation while Controlling False Discovery Rate for Microarray Experiments Using the ssize.fdr Package...... 47 Easier in R with snowfall and sfCluster...... 54 PMML: An Open Standard for Sharing Models...... 60

News and Notes Forthcoming Events: Conference on Quantitative Social Science Research Using R...... 66 Forthcoming Events: DSC 2009...... 66 Forthcoming Events: useR! 2009...... 67 Conference Review: The 1st Chinese R Conference...... 69 Changes in R 2.9.0...... 71 Changes on CRAN...... 77 News from the Bioconductor Project...... 91 R Foundation News...... 92 2

The Journal is a peer-reviewed publication of the R Foundation for Statistical Computing. Communications regarding this publication should be addressed to the editors. All articles are copyrighted by the respective authors. Please send submissions to regular columns to the respective column editor and all other sub- missions to the editor-in-chief or another member of the editorial board. More detailed submission instructions can be found on the Journal’s homepage.

Editor-in-Chief: Vince Carey Channing Laboratory Brigham and Women’s Hospital 75 Francis St. Boston, MA 02115 USA

Editorial Board: John Fox, Heather Turner, and .

Editor Programmer’s Niche: Bill Venables

Editor Help Desk: Uwe Ligges

R Journal Homepage: http://journal.r-project.org/

Email of editors and editorial board: [email protected]

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 3

Editorial by Vince Carey first issue of the new Journal is purely accidental – the transition has been in development for several Early in the morning of January 7, 2009 — very early, years now, with critical guidance and support from from the perspective of readers in Europe and the the R Foundation, from a number of R Core mem- Americas, as I was in Almaty, Kazakhstan — I no- bers, and from previous editors. I am particularly ticed a somewhat surprising title on the front web thankful for John Fox’s assistance with this transi- page of the New York Times: “Data Analysts Capti- tion, and we are all indebted to co-editors Peter Dal- vated by R’s Power”. My prior probability that a fa- gaard and Heather Turner for technical work on the vorite programming language would find its way to new Journal’s computational infrastructure, under- the Times front page was rather low, so I navigated to taken on top of already substantial commitments of the article to check. There it was: an early version, editorial effort. minus the photographs. Later a few nice pictures While contemplating the impending step into a http://bits.blogs. were added, and a Bits Blog new publishing vehicle, John Fox conceived a series nytimes.com/2009/01/08/r-you-ready-for-r/ cre- of special invited articles addressing views of “The ated. As of early May, the blog includes 43 en- Future of R”. This issue includes two articles in this tries received from January 8 to March 12 and in- projected series. In the first, John Chambers analyzes cludes comment from Trevor Hastie clarifying histor- R into facets that help us to understand the ‘rich- ical roots of R in S, from Alan Zaslavsky reminding ness’ and ‘messiness’ of R’s programming model. readers of John Chambers’s ACM award and its role In the second, Stefan Theußl and Achim Zeileis de- in funding an American Statistical Association prize scribe R-Forge, a new infrastructure for contributing for students’ , and from nu- and managing source code of packages destined for merous others on aspects of various proprietary and CRAN. A number of other future-oriented articles non-proprietary solutions to data analytic comput- have been promised for this series; we expect to is- ing problems. Somewhat more recently, the Times sue a special edition of the Journal collecting these blog was back in the fold, describing a SAS-to-R in- when all have arrived. terface (February 16). As a native New Yorker, it is natural for me to be aware, and glad, of the Times’ The peer-reviewed contributions section includes coverage of R; consumers of other mainstream me- material on programming for diagram construction, dia products are invited to notify me of comparable HTML generation, probability model elicitation, in- coverage events elsewhere. tegration of the multivariate normal density, Hilbert As R’s impact continues to expand, R News has spectrum analysis, microarray study design, paral- transitioned to . You are now reading lel computing support, and the predictive modeling the first issue of this new periodical, which continues markup language. Production of this issue involved the tradition of the newsletter in various respects: ar- voluntary efforts of 22 referees on three continents, ticles are peer-reviewed, copyright of articles is re- who will be acknowledged in the year-end issue. tained by authors, and the journal is published elec- tronically through an archive of freely downloadable Vince Carey PDF issues, easily found on the R project homepage Channing Laboratory, Brigham and Women’s Hospital and on CRAN. My role as Editor-In-Chief of this [email protected]

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 4

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 INVITED SECTION:THE FUTUREOF R 5

Facets of R

Special invited paper on “The Future of R” 2. interactive, hands-on in real time; by John M. Chambers 3. functional in its model of programming; 4. object-oriented, “everything is an object”; We are seeing today a widespread, and welcome, tendency for non-computer-specialists among statis- 5. modular, built from standardized pieces; and, ticians and others to write collections of R functions 6. collaborative, a world-wide, open-source effort. that organize and communicate their work. Along with the flood of software sometimes comes an atti- None of these facets is original to R, but while many tude that one need only learn, or teach, a sort of basic systems emphasize one or two of them, R continues how-to-write-the-function level of R programming, to reflect them all, resulting in a programming model beyond which most of the detail is unimportant or that is indeed rich but sometimes messy. can be absorbed without much discussion. As delu- The thousands of R packages available from sions go, this one is not very objectionable if it en- CRAN, BioConductor, R-Forge, and other reposi- courages participation. Nevertheless, a delusion it is. tories, the uncounted other software contributions In fact, functions are only one of a variety of impor- from groups and individuals, and the many citations tant facets that R has acquired by intent or circum- in the scientific literature all testify to the useful com- stance during the three-plus decades of the history of putations created within the R model. Understand- the software and of its predecessor S. To create valu- ing how the various facets arose and how they can able and trustworthy software using R often requires work together may help us improve and extend that an understanding of some of these facets and their software. interrelations. This paper identifies six facets, dis- We introduce the facets more or less chronolog- cussing where they came from, how they support or ically, in three pairs that entered into the software conflict with each other, and what implications they during successive intervals of roughly a decade. To have for the future of programming with R. provide some context, here is a brief chronology, with some references. The S project began in our statistics research group at Bell Labs in 1976, evolved Facets into a generally licensed system through the 1980s and continues in the S+ software, currently owned by Any software system that has endured and retained TIBCO Software Inc. The history of S is summarized a reasonably happy user community will likely in the Appendix to Chambers(2008); the standard have some distinguishing characteristics that form books introducing still-relevant versions of S include its style. The characteristics of different systems are Becker et al.(1988) (no longer in print), Chambers usually not totally unrelated, at least among systems and Hastie(1992), and Chambers(1998). R was an- serving roughly similar goals—in computing as in nounced to the world in Ihaka and Gentleman(1996) other fields, the set of truly distinct concepts is not and evolved fairly soon to be developed and man- that large. But in the mix of different characteris- aged by the R-core group and other contributors. Its tics and in the details of how they work together lies history is harder to pin down, partly because the much of the flavor of a particular system. story is very much still happening and partly be- Understanding such characteristics, including a cause, as a collaborative open-source project, indi- bit of the historical background, can be helpful in vidual responsibility is sometimes hard to attribute; making better use of the software. It can also guide in addition, not all the important new ideas have thinking about directions for future improvements of been described separately by their authors. The the system itself. http://www.r-project.org site points to the on-line The R software and the S software that preceded manuals, as well as a list of over 70 related books and it reflect a rather large range of characteristics, result- other material. The manuals are invaluable as a tech- ing in software that might be termed rich or messy nical resource, but short on background information. according to one’s taste. Since R has become a very Articles in R News, the predecessor of this journal, widely used environment for applications and re- give descriptions of some of the key contributions, search in data analysis, understanding something of such as namespaces (Tierney, 2003), and internation- these characteristics may be helpful to the commu- alization (Ripley, 2005). nity. This paper considers six characteristics, which we will call facets. They characterize R as: An interactive interface

1. an interface to computational procedures of The goals for the first version of S in 1976 already many kinds; combined two facets of computing usually kept

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 6 INVITED SECTION:THE FUTUREOF R apart, interactive computing and the development of Functional and object-oriented pro- procedural software for scientific applications. gramming One goal was a better interface to new computa- tional procedures for scientific data analysis, usually During the 1980s the topics of functional programming implemented then as subroutines. Bell Labs and object-oriented programming stimulated growing statistics research had collected and written a variety interest in the computer science literature. At the of these while at the same time numerical libraries same time, I was exploring possibilities for the next and published algorithm collections were establish- S, perhaps “after S”. These ideas were eventually ing many of the essential computational techniques. blended with other work to produce the “new S”, These procedures were the essence of the actual data later called Version 3 of S, or S3, (Becker et al., 1988). analysis; the initial goal was essentially a better in- As before, user expressions were still interpreted, terface to procedural computation. A graphic from but were now parsed into objects representing the the first plans for S (Chambers, 2008, Appendix, page language elements, mainly function calls. The func- 476) illustrated this. tions were also objects, the result of parsing expres- sions in the language that defined the function. As But that interface was to be interactive, specifically the motto expressed it, “Everything is an object”. The via a language used by the data analyst communicat- procedural interface facet was not abandoned, how- ing in real time with the software. At this time, a few ever, but from the user’s perspective it was now de- systems had pointed the way for interactive scientific fined by functions that provided an interface to a computing, but most of them were highly specialized or Fortran subroutine. with limited programming facilities; that is, a limited The new programming model was functional in ability for the user to express novel computations. two senses. It largely followed the functional pro- The most striking exception was APL, created by gramming paradigm, which defined programming Kenneth Iverson(1962), an operator-based interac- as the evaluation of function calls, with the value tive language with heavy emphasis on general mul- uniquely determined by the objects supplied as ar- tiway arrays. APL introduced conventions that now guments and with no external side-effects. Not all seem obvious; for example, the result of evaluating functions followed this strictly, however. a user’s expression was either an assignment or a The new version was functional also in the sense printed display of the computed object (rather than that nearly everything that happened in evaluation expecting users to type a print command). But APL was a function call. Evaluation in R is even more at the time had no notion of an interface to proce- functional in that language components such as as- dures, which along with a limited, if exotic, syntax signment and loops are formally function calls inter- caused us to reject it, although S adopted a number nally, reflecting in part the influence of Lisp on the of features from it, such as its approach to general initial implementation of R. multidimensional arrays. Subsequently, the programing model was ex- tended to include classes of objects and methods From its first implementation, the S software that specialized functions according to the classes combined these two facets: users carried out data of their arguments. That extension itself came in analysis interactively by writing expressions in the two stages. First, a thin layer of additional code S language, but much of the programming to extend implemented classes and methods by two simple the system was via new Fortran subroutines accom- mechanisms: objects could have an attribute defin- panied by interfaces to call them from S. The inter- ing their class as a character string or a vector of faces were written in a special language that was multiple strings; and an explicitly invoked method compiled into Fortran. dispatch function would match the class of the func- The result was already a mixed system that had tion’s first argument against methods, which were much of the tension between human and machine simply saved function objects with a class string ap- efficiency that still stimulates debate among R users pended to the object’s name. In the second stage, the and programmers. At the same time, the flexibility version of the system known as S4 provided formal from that same mixture helped the software to grow class definitions, stored as a special metadata object, and adapt with a growing user community. and similarly formal method definitions associated with generic functions, also defined via object classes Later versions of the system replaced the inter- (Chambers, 2008, Chapters 9 and 10, for the R ver- face language with individual functions providing sion). interfaces to C or Fortran, but the combination of The functional and object-oriented facets were an interactive language and an interface to com- combined in S and R, but they appear more fre- piled procedures remains a key feature of R. Increas- quently in separate, incompatible languages. Typi- ingly, interfaces to other languages and systems have cal object-oriented languages such as C++ or Java are been added as well, for example to database systems, not functional; instead, their programming model is spreadsheets and user interface software. of methods associated with a class rather than with a

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 INVITED SECTION:THE FUTUREOF R 7 function and invoked on an object. Because the ob- essential tools provided in R to create, develop, test, ject is treated as a reference, the methods can usu- and distribute packages. Among many benefits for ally modify the object, again quite a different model users, these tools help programmers create software from that leading to R. The functional use of meth- that can be installed and used on the three main op- ods is more complicated, but richer and indeed nec- erating systems for computing with data—Windows, essary to support the essential role of functions. A , and Mac OS X. few other languages do combine functional structure with methods, such as Dylan, (Shalit, 1996, chapters The sixth facet for our discussion is that R is a col- 5 and 6), and comparisons have proved useful for R, laborative enterprise, which a large group of people but came after the initial design was in place. share in and contribute to. Central to the collabo- rative facet is, of course, that R is a freely available, open-source system, both the core R software and Modular design and collaborative most of the packages built upon it. As most read- ers of this journal will be aware, the combination of support the collaborative efforts and the package mechanism have enormously increased the availability of tech- Our third pair of facets arrives with R. The paper by niques for data analysis, especially the results of new Ihaka and Gentleman(1996) introduced a statistical research in statistics and its applications. The pack- software system, characterized later as “not unlike age management tools, in fact, illustrate the essential S”, but distinguished by some ideas intended to im- role of collaboration. Another highly collaborative prove on the earlier system. A facet of R related to contribution has facilitated use of R internationally some of those ideas and to important later develop- (Ripley, 2005), both by the use of locales in compu- ments is its approach to modularity. In a number of tations and by the contribution of translations for R fields, including architecture and computer science, documentation and messages. modules are units designed to be useful in them- selves and to fit together easily to create a desired While the collaborative facet of R is the least tech- result, such as a building or a software system. R has nical, it may eventually have the most profound im- two very important levels of modules: functions and plications. The growth of the collaborative commu- packages. nity involved is unprecedented. For example, a plot Functions are an obvious modular unit, espe- in Fox(2008) shows that the number of packages cially functions as objects. R extended the modu- in the CRAN repository exhibited literal exponen- larity of functions by providing them with an envi- tial growth from 2001 to 2008. The current size of ronment, usually the environment in which the func- this and other repositories implies at a conservative tion object was created. Objects assigned in this en- estimate that several hundreds of contributors have vironment can be accessed by name in a call to the mastered quite a bit of technical expertise and satis- function, and even modified, by stepping outside fied some considerable formal requirements to offer the strict functional programming model. Functions their efforts freely to the worldwide scientific com- sharing an environment can thus be used together munity. as a unit. The programming technique resulting is usually called programming with closures in R; for This facet does have some technical implications. a discussion and comparison with other techniques, One is that, being an open-source system, R is gener- see (Chambers, 2008, section 5.4). Because the eval- ally able to incorporate other open-source software uator searches for external objects in the function’s and does indeed do so in many areas. In contrast, environment, dependencies on external objects can proprietary systems are more likely to build exten- be controlled through that environment. The names- sions internally, either to maintain competitive ad- pace mechanism (Tierney, 2003), an important con- vantage or to avoid the free distribution require- tribution to trustworthy programming with R, uses ments of some open-source licenses. Looking for this technique to help ensure consistent behavior of quality open-source software to extend the system functions in a package. should continue to be a favored strategy for R de- The larger module introduced by R is probably velopment. the most important for the system’s growing use, the package. As it has evolved, and continues to evolve, On the downside, a large collaborative enterprise an is a module combining in most cases with a general practice of making collective decisions R functions, their documentation, and possibly data has a natural tendency towards conservatism. Rad- objects and/or code in other languages. ical changes do threaten to break currently working While S always had the idea of collections of soft- features. The future benefits they might bring will of- ware for distribution, somewhat formalized in S4 ten not be sufficiently persuasive. The very success as chapters, the R package greatly extends and im- of R, combined with its collaborative facet, poses a proves the ability to share computing results as ef- challenge to cultivate the “next big step” in software fective modules. The mechanism benefits from some for data analysis.

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 8 INVITED SECTION:THE FUTUREOF R

Implications “workarounds” are a good thing or not. One may feel that the total effort devoted to multiple approx- A number of specific implications of the facets of R, imate solutions would have been more productive if and of their interaction, have been noted in previous coordinated and applied to improving the core im- sections. Further implications are suggested in con- plementation. In practice, however, the nature of the sidering the current state of R and its future possibil- collaborative effort supporting R makes it much eas- ities. ier to obtain a modest effort from one or a few indi- Many reasons can be suggested for R’s increasing viduals than to organize and execute a larger project. popularity. Certainly part of the picture is a produc- Perhaps the growth of R will encourage the com- tive interaction among the interface facet, the modu- munity to push for such projects, with a new view larity of packages, and the collaborative, open-source of funding and organization. Meanwhile, we can approach, all in an interactive mode. From the very hope that the growing communication facilities in beginning, S was not viewed as a set of procedures the community will assess and improve on the ex- (the model for SAS, for example) but as an interactive isting efforts. Particularly helpful facilities in this system to support the implementation of new pro- context include the CRAN task views (http://cran. cedures. The growth in packages noted previously r-project.org/web/views/), the mailing lists and reflects the same view. publications such as this journal. The emphasis on interfaces has never diminished and the range of possible procedures across the in- terface has expanded greatly. That R’s growth has Bibliography been especially strong in academic and other re- search environments is at least partly due to the abil- R. A. Becker, J. M. Chambers, and A. R. Wilks. The ity to write interfaces to existing and new software New S Language. Chapman & Hall, London, 1988. of many forms. The evolution of the package as J. M. Chambers. Programming with Data. Springer, an effective module and of the repositories for con- New York, 1998. URL http://cm.bell-labs. tributed software from the community have resulted com/cm/ms/departments/sia/Sbook/. ISBN 0-387- in many new interfaces. 98503-4. The facets also have implications for the possible future of R. The hope is that R can continue to sup- J. M. Chambers. Software for Data Analysis: Program- port the implementation and communication of im- ming with R. Springer, New York, 2008. ISBN 978- portant techniques for data analysis, contributed and 0-387-75935-7. used by a growing community. At the same time, J. M. Chambers and T. J. Hastie. Statistical Mod- those interested in new directions for statistical com- els in S. Chapman & Hall, London, 1992. ISBN puting will hope that R or some future alternative 9780412830402. engages a variety of challenges for computing with data. The collaborative community seems the only J. Fox. Editorial. R News, 8(2):1–2, October 2008. URL likely generator of the new features required but, as http://CRAN.R-project.org/doc/Rnews/. noted in the previous section, combining the new with maintenance of a popular current collaborative R. Ihaka and R. Gentleman. R: A language for data system will not be easy. analysis and graphics. Journal of Computational and The encouragement for new packages generally Graphical Statistics, 5:299–314, 1996. and for interfaces to other software in particular are K. E. Iverson. A Programming Language. Wiley, 1962. relevant for this question also. Changes that might be difficult if applied internally to the core R im- B. D. Ripley. Internationalization features of R 2.1.0. plementation can often be supplied, or at least ap- R News, 5(1):2–7, May 2005. URL http://CRAN. proximated, by packages. Two examples of enhance- R-project.org/doc/Rnews/. ments considered in discussions are an optional ob- A. Shalit. The Dylan Reference Manual. Addison- ject model using mutable references and computa- Wesley Developers Press, Reading, Mass., 1996. tions using multiple threads. In both cases, it can ISBN 0-201-44211-6. be argued that adding the facility to R could extend its applicability and performance. But a substantial L. Tierney. Name space management for R. R News, programming effort and knowledge of the current 3(1):2–6, June 2003. URL http://CRAN.R-project. implementation would be required to modify the in- org/doc/Rnews/. ternal object model or to make the core code thread- safe. Issues of back compatibility are likely to arise John M. Chambers as well. Meanwhile, less ambitious approaches in Department of Statistics the form of packages that approximate the desired Stanford University behavior have been written. For the second exam- USA ple especially, these may interface to other software [email protected] designed to provide this facility. Opinions may reasonably differ on whether these The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 INVITED SECTION:THE FUTUREOF R 9

Collaborative Software Development Using R-Forge

Special invited paper on “The Future of R” Forge, host a project-specific website, and how pack- age maintainers submit a package to the Compre- by Stefan Theußl and Achim Zeileis hensive R Archive Network (CRAN, http://CRAN. R-project.org/). Finally, we summarize recent de- velopments and give a brief outlook to future work. Introduction

Open source software (OSS) is typically created in a R-Forge decentralized self-organizing process by a commu- nity of developers having the same or similar inter- R-Forge offers a central platform for the develop- ests (see the famous essay by Raymond, 1999). A ment of R packages, R-related software and other key factor for the success of OSS over the last two projects. decades is the Internet: Developers who rarely meet R-Forge is based on GForge (Copeland et al., face-to-face can employ new means of communica- 2006) which is an open source fork of the 2.61 Source- tion, both for rapidly writing and deploying software Forge code maintained by Tim Perdue, one of the (in the spirit of Linus Torvald’s “release early, release original SourceForge authors. GForge has been mod- often paradigm”). Therefore, many tools emerged ified to provide additional features for the R commu- that assist a collaborative software development pro- nity, namely a CRAN-style repository for hosting de- cess, including in particular tools for source code velopment releases of R packages as well as a quality management (SCM) and version control. management system similar to that of CRAN. Pack- In the R world, SCM is not a new idea; in fact, ages hosted on R-Forge are provided in source form the R Development Core Team has always been us- as well as in binary form for Mac OS X and Windows. ing SCM tools for the R sources, first by means of They can be downloaded from the website of the cor- Concurrent Versions System (CVS, see Cederqvist et responding project on R-Forge or installed directly in al., 2006), and then via Subversion (SVN, see Pilato R; for a package foo, say, install.packages("foo", et al., 2004). A central repository is hosted by ETH repos = "http://R-Forge.R-project.org"). Zürich mainly for managing the development of the On R-Forge, developers organize their work in base R system. Mailing lists like R-help, R-devel and so-called “Projects”. Every project has various tools many others are currently the main communication and web-based features for software development, channels in the R community. communcation and other services. All features men- Also beyond the base system, many R con- tioned in the following sections are accessible via so- tributors employ SCM tools for managing their R called “Tabs”: e.g., user accounts can be managed in packages, e.g., via web-based SVN repositories like the My Page tab or a list of available projects can be SourceForge (http://SourceForge.net/) or Google displayed using the Project Tree tab. Code (http://Code.Google.com/). However, there Since starting the platform in early 2007, more has been no central SCM repository providing ser- and more interested users registered their projects on vices suited to the specific needs of R package de- R-Forge. Now, after more than two years of develop- velopers. Since early 2007, the R-project offers such ment and testing, around 350 projects and more than a central platform to the R community. R-Forge 900 users are registered on R-Forge. This and the (http://R-Forge.R-project.org/) provides a set of steadily growing list of feature requests show that tools for source code management and various web- there is a high demand for centralized source code based features. It aims to provide a platform for management tools and for releasing prototype code collaborative development of R packages, R-related frequently among the R community. software or further projects. R-Forge is closely re- In the next three sections, we summarize the core lated to the most famous of such platforms—the features of R-Forge and what R-Forge offers to the R world’s largest OSS development website—namely community in terms of collaborative development of http://SourceForge.net/. R-related software projects. The remainder of this article is organized as fol- lows. First, we present the core features that R- Source code management Forge offers to the R community. Second, we give a hands-on tutorial on how users and developers can When carrying out software projects, source files get started with R-Forge. In particular, we illustrate change over time, new files get created and old files how people can register, set up new projects, use R- deleted. Typically, several authors work on several Forge’s SCM facilities, provide their packages on R- computers on the same and/or different files and

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 10 INVITED SECTION:THE FUTUREOF R keeping track of every change can become a tedious velopers so that they can improve the package to task. In the open source community, the general so- pass all tests on R-Forge and subsequently on CRAN. lution to this problem is to use version control, typi- Second, bug tracking systems allow users to no- cally provided by the majority of SCM tools. For this tify package authors about problems they encounter. reason R-Forge utilizes SVN to facilitate the devel- In the spirit of OSS—given enough eyeballs, all oper’s work when creating software. bugs are shallow (Raymond, 1999)—peer code re- A central repository ensures that the developer view leads to an overall improvement of the quality always has access to the current version of the of software projects. project’s source code. Any of the authorized col- laborators can “check out” (i.e., download) or “up- Additional features date” the project file structure, make the necessary changes or additions, delete files from the current re- A number of further tools, of increased interest for vision and finally “commit” changes or additions to larger projects, help developers to coordinate their the repository. More than that, SVN keeps track of work and to communicate with their user base. the complete history of the project file structure. At These tools include: any point in the development stage it is possible to go back to any previous stage in the history to inspect • Project websites: Developers may present their and restore old files. This is called version control, as work on a subdomain of R-Forge, e.g., http: every stage automatically is assigned a unique ver- //foo.R-Forge.R-project.org/, or via a link sion number which increases over time. to an external website. On R-Forge such a version-controlled repository foo-commits@ is automatically created for each project. To get • Mailing lists: By default a list lists.R-Forge.R-project.org started, the project members just have to install the is automati- client of their choice (e.g., Tortoise SVN on Windows cally created for each project. Additional mail- or svnX on Mac OS X) and check out the repository. ing lists can be set up easily. In addition to the inherent backup of every version • Project categorization: Administrators may within the repository a backup of the whole reposi- categorize their project in several so-called tory is generated daily. “Trove Categories” in the Admin tab of their A rights management system assures that, by de- project (under Trove Categorization). For each fault, anonymous users have read access and devel- category three different items can be selected. opers have write access to the data associated with a This enables users to quickly find what they are certain project on R-Forge. More precisely, registered looking for using the Project Tree tab. users can be granted one of several roles: e.g., the “Administrator” has all rights including the right to • News: Announcements and other information add new users to the project or release packages di- related to a project can be put on the project rectly to CRAN. He/she is usually the package main- summary page as well as on the home page of tainer, the project leader or has registered the project R-Forge. The latter needs approval by one of originally. Other members of a project typically have the R-Forge administrators. All items are avail- either the role “Senior Developer” or “Junior Devel- able as RSS feeds. oper” which both have permission to commit to the project SVN repository and examine the log files in • Forums: Discussion forums can be set up sepa- the R Packages tab (the differences between the two rately by the project administrators. roles are subtle, e.g., senior developers additionally have administrative privileges in several places in the project). When we speak of developers in sub- How to get started sequent sections we refer to project members having the rights at least of a junior developer. This section is intended to be a hands-on tutorial for new users. Depending on familiarity with the sys- tems/tools involved the instructions might be some- Release and quality management what brief. In case of problems we recommend con- sulting the user’s manual (R-Forge Administration Development versions of a software project are typ- and Development Team, 2008) which contains de- ically prototypes and are subject to many changes. tailed step-by-step instructions. Thus, R-Forge offers two tools which assist the devel- When accessing the URL http://R-Forge. opers in improving the quality of their source code. R-project.org/ the home page is presented (see First, it offers a quality management system sim- Figure1). Here one can ilar to that of CRAN. Packages on R-Forge are checked in a standardized way on different plat- • login, forms based on R CMD check at least once daily. The resulting log files can be examined by the project de- • register a user or a project,

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 INVITED SECTION:THE FUTUREOF R 11

Figure 1: Home page of R-Forge on 2009-03-10

• download the documentation, Registering a project

There are two possibilities to register a project: Click- • examine the latest news and changes, ing on Register Your Project on the home page or going to the My Page tab and clicking on Register Project (see • go to a specific project website either by search- Figure2 below the main tabs). Both links lead to a ing for available projects (top middle of the form which has to be filled out in order to finish the page), by clicking on one of the projects listed registration process. In the text field “Project Pub- on the right, or by going through the listing in lic Description” the registrant is supposed to enter a the Project Tree tab. concise description of the project. This text and the “Project Full Name” will be shown in several places Registered users can access their personal page on R-Forge (see e.g., Figure1). The text entered in the via the tab named My Page. Figure2 shows the per- field “Project Purpose And Summarization” is addi- sonal page of the first author. tional information for the R-Forge administrators, in- spected for approval. “Project Unix Name” refers to the name which uniquely determines the project. In Registering as a new user the case of a project that contains a single R pack- age, the project Unix name typically corresponds to To use R-Forge as a developer, one has to register as the package name (in its lower-case version). Restric- a site user. A link on the main web site called New tions according to the Unix file system convention Account on the top right of the home page leads to force Unix names to be in lower case (and will be the corresponding registration form. converted automatically if they are typed in upper After submitting the completed form, an e-mail is case). sent to the given address containing a link for activat- After the completed form is submitted, the ing the account. Subsequently, all R-Forge features, project has to be approved by the R-Forge admin- including joining existing or creating new projects, istrators and a confirmation e-mail is sent to the are available to logged-in users. registrant upon approval. After that, the regis-

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 12 INVITED SECTION:THE FUTUREOF R

Figure 2: The My Page tab of the first author trant automatically becomes the project administra- if the package is already version-controlled some- tor and the standardized web area of the project where else—the corresponding parts of the reposi- (http://R-Forge.R-project.org/projects/foo/) is tory including the history can be migrated to R-Forge immediately available on R-Forge. This web area in- (see Section 6 of the user’s manual). cludes a Summary page, an Admin tab visible only to The SCM tab of a project explains how the project administrators, and various other tabs de- the corresponding SVN repository located at pending on the features enabled for this project in the svn://svn.R-Forge.R-project.org/svnroot/foo Admin tab. To present the new project to a broader can be checked out. From this URL the sources are community the name of the project additionally is checked out either anonymously without write per- promoted on the home page under “Recently Reg- mission (enabled by default) or as developer using istered Projects” (see Figure1). an encrypted authentication method, namely secure Furthermore, within an hour after approval a de- shell (ssh). Via this secure protocol (svn+ssh:// fol- fault mailing list and the project’s SVN repository lowed by the registered user name) developers have containing a ‘README’ file and two pre-defined di- full access to the repository. Typically, the developer rectories called ‘pkg’ and ‘www’ are created. The con- is authenticated via the registered password but it is tent of these directories is used by the R-Forge server also possible to upload a secure shell key (updated for creating R packages and a project website, respec- hourly) to make use of public/private key authenti- tively. cation. Section 3.2 of the user’s manual explains this process in more detail. SCM and R packages To make use of the package builds and checks the package source code has to be put into the ‘pkg’ The first step after creation of a project is typically directory of the repository (i.e., ‘pkg/DESCRIPTION’, to start generation of content for one (or more) R ‘pkg/R’, ‘pkg/man’, etc.) or, alternatively, a subdirec- package(s) in the ‘pkg’ directory. Developers can ei- tory of ‘pkg’. The latter structure allows developers to ther start committing changes via SVN as usual or— have more than one package in a single project; e.g.,

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 INVITED SECTION:THE FUTUREOF R 13 if a project consists of the packages foo and bar, then Recent and future developments the source code is located in ‘pkg/foo’ and ‘pkg/bar’, respectively. In this section, we briefly summarize the major R-Forge automatically examines the ‘pkg’ direc- changes and updates to the R-Forge system during tory of every repository and builds the package the last few months and give an outlook to future sources as well as the package binaries on a daily developments. basis for Mac OS X and Windows (if applicable). Recently added features and major changes in- The package builds are provided in the R Pack- clude: ages tab (see Figure3) for download or can be in- stalled directly in R using install.packages("foo", • New defaults for freshly registered projects: repos="http://R-Forge.R-project.org"). Further- Only the tabs Lists, SCM and R packages are en- more, in the R Packages tab developers can examine abled initially. Forums, Tracker and News can logs of the build and check process on different plat- be activated separately (in order not to over- forms. whelm new users with too many features) us- To release a package to CRAN the project ad- ing Edit Public Info in the Admin tab of the ministrator clicks on Submit this package to CRAN in project. Experienced users can decide which the R Packages tab. Upon confirmation, a message features they want and activate them. will be sent to [email protected] and the latest successfully built source package (the ‘.tar.gz’ file) • An enhanced structure in the SVN repository is automatically copied to ftp://CRAN.R-project. allowing multiple packages in a single project org/incoming/. Note that packages are built once (see above). daily, i.e., the latest source package does not include • The R package RForgeTools (Theußl, 2009) more recent code committed to the SVN repository. contains platform-independent package build- Figure3 shows the R Packages tab of the tm ing and quality management code used on the http://R-Forge.R-project.org/projects/ project ( R-Forge servers. tm/) as one of the project administrators would see it. Depending on whether you are a member of the • A modified News submit page offering two project or not and your rights you will see only parts types of submissions: project-specific and of the information provided for each package. global news. The latter needs approval by one of the R-Forge administrators (default: project- Further steps only submission). A customized project website, accessible through • Circulation of SVN commit messages can be http://foo.R-Forge.R-project.org/ where foo enabled in the SCM Admin tab by project ad- corresponds to the unix name of the project, is man- ministrators. The mailing list mentioned in the aged via the ‘www’ directory. The website gets up- text field is used for delivering the SVN commit dated every hour. messages (default: off). The changes made to the project can be exam- • Mailing list search facilities provided by the ined by entering the corresponding standardized Swish-e engine which can be accessed via the web area. On entry, the Summary page is shown. List tab (private lists are not included in the Here, one can search index). • examine the details of the project including a Further features and improvements which are short description and a listing of the adminis- currently on the wishlist or under development in- trators and developers, clude • follow a link leading to the project homepage, • a Wiki, • examine the latest news announcements (if available), • task management facilities,

• go to other sections of the project like Forums, • a re-organized tracker more compatible with R Tracker, Lists, R Packages, etc. package development and,

• follow the download link leading directly to • an update of the underlying GForge sys- the available packages of the project (i.e., the tem to its successor FusionForge (http:// R Packages tab). FusionForge.org/).

Furthermore, meta-information about a project For suggestions, problems, feature requests, and can be supplied in the Admin tab via so-called “Trove other questions regarding R-Forge please contact Categorization”. [email protected].

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 14 INVITED SECTION:THE FUTUREOF R

Figure 3: R Packages tab of the tm project

Acknowledgments 2004. Full book available online at http:// svnbook.red-bean.com/. Setting up this project would not have been possi- ble without Douglas Bates and the University of Wis- R-Forge Administration and Development Team. R- consin as they provided us with a server for hosting Forge User’s Manual, 2008. URL http://download. this platform. Furthermore, the authors would like R-Forge.R-project.org/R-Forge.pdf. to thank the Computer Center of the Wirtschaftsuni- E. S. Raymond. The Cathedral and the Bazaar. versität Wien for their support and for providing us Knowledge, Technology, and Policy, 12(3):23–49, 1999. with additional hardware as well as a professional server infrastructure. S. Theußl. RForgeTools: R-Forge Build and Check Tools, 2009. URL http://R-Forge.R-project.org/ projects/site/ Bibliography . R package version 0.3-3.

P. Cederqvist et al. Version Management with CVS. Stefan Theußl Network Theory Limited, Bristol, 2006. Full Department of Statistics and Mathematics book available online at http://ximbiot.com/ WU Wirtschaftsuniversität Wien cvs/manual/. Austria [email protected] T. Copeland, R. Mas, K. McCullagh, T. Per- due, G. Smet, and R. Spisser. GForge Man- ual, 2006. URL http://gforge.org/gf/download/ Achim Zeileis docmanfileversion/8/1700/gforge_manual.pdf. Department of Statistics and Mathematics WU Wirtschaftsuniversität Wien C. M. Pilato, B. Collins-Sussman, and B. W. Fitz- Austria patrick. Version Control with Subversion. O’Reilly, [email protected]

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 15

Drawing Diagrams with R by Paul Murrell of creating this sort of diagram by pushing objects around the screen with a mouse fills me with dread. R provides a number of well-known high-level facil- Maybe I’m just not a very GUI guy. ities for producing sophisticated statistical plots, in- Before we look at drawing diagrams with the core cluding the “traditional” plots in the graphics pack- R graphics facilties, it is important to acknowledge age (R Development Core Team, 2008), the Trellis- that several contributed R packages already pro- style plots provided by lattice (Sarkar, 2008), and the vide facilities for drawing diagrams. The Rgraphviz grammar-of-graphics-inspired approach of (Gentry et al., 2008) and igraph (Csardi and Nepusz, (Wickham, 2009). 2006) packages provide automated layout of node- However, R also provides a powerful set of low- and-edge graphs, and the shape and diagram pack- level graphics facilities for drawing basic shapes and, ages (Soetaert, 2008b,a) provide functions for draw- more importantly, for arranging those shapes relative ing nodes of various shapes with lines and arrows to each other, which can be used to draw a wide vari- between them, with manual control over layout. ety of graphical images. This article highlights some In this article, we will only be concerned with of R’s low-level graphics facilities by demonstrating drawing diagrams with a small number of elements, their use in the production of diagrams. In particular, so we do not need the automated layout facilities the focus will be on some of the useful things that can of Rgraphviz or igraph. Furthermore, while the be done with the low-level facilities provided by the shape and diagram packages provide flexible tools grid graphics package (Murrell, 2002, 2005b,a). for building node-and-edge diagrams, the point of this article is to demonstrate low-level grid func- tions. We will use a node-and-edge diagram as the Starting at the end motivation, but the underlying ideas can be applied to a much wider range of applications. An example of the type of diagram that we are going In each of the following sections, we will meet a to work towards is shown below. We have several basic low-level graphical tool and demonstrate how “boxes” that describe table schema for a database, it can be used in the generation of the pieces of an with lines and arrows between the boxes to show the overall diagram, or how the tool can be used to com- relationships between tables. bine pieces together in convenient ways.

book_table publisher_table ISBN ID book_author_table title name Graphical primitives ID pub ● country book ● author ● One of the core low-level facilities of R graphics is the author_table gender_table ID ability to draw basic shapes. The typical graphical ID name primitives such as text, circles, lines, and rectangles ● gender gender are all available. In this case, the shape of each box in our dia- To forestall some possible misunderstandings, the gram is not quite as simple as a rectangle because it sort of diagram that we are talking about is one that has rounded corners. However, a rounded rectangle is designed by hand. This is not a diagram that has is also one of the graphical primitives that the grid 1 been automatically laid out. package provides. The sort of diagram being addressed is one where The code below draws a rounded rectangle with the author of the diagram has a clear idea of what the a text label in the middle. end result will roughly look like—the sort of diagram that can be sketched with pen and paper. The task is > library(grid) to produce a pre-planned design, using a computer to get a nice crisp result. > grid.roundrect(width=.25) That being said, a reasonable question is “why > grid.text("ISBN") not draw it by hand?”, for example, using a free- hand drawing program such as Dia (Larsson, 2008). The advantage of using R code to produce this sort of image is that code is easier to reproduce, reuse, ISBN maintain, and fine-tune with accuracy. The thought 1From R version 2.9.0; prior to that, a simpler rounded rectangle was available via the grid.roundRect() function in the RGraphics package.

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 16 CONTRIBUTED RESEARCH ARTICLES

Viewports number of labels and the width of the viewport is based on the width of the largest label, plus a 2 mm A feature of the boxes in the diagram at the begin- gap either side. This code also simplifies the labelling ning of this article is that the text is carefully posi- by drawing both labels in a single grid.text() call. tioned relative to the rounded rectangle; the text is left-aligned within the rectangle. This careful posi- > labels <- c("ISBN", "title") tioning requires knowing where the left edge of the > vp <- rectangle is on the page. Calculating those positions viewport(width=max(stringWidth(labels))+ is annoyingly tricky and only becomes more annoy- unit(4, "mm"), ing if at some later point the position of the box is height=unit(length(labels), adjusted and the positions of the text labels have to "lines")) be calculated all over again. > pushViewport(vp) Using a grid viewport makes this sort of position- > grid.roundrect() ing very simple. The basic idea is that we can create > grid.text(labels, a viewport where the box is going to be drawn and x=unit(2, "mm"), then do all of our drawing within that viewport. Po- y=unit(2:1 - 0.5, "lines"), sitioning text at the left edge of a viewport is very just="left") straightforward, and if we need to shift the box, we > popViewport() simply shift the viewport and the text automatically tags along for the ride. All of this applies equally to ISBN positioning the text vertically within the box. title In the code below, we create a viewport for the overall box, we draw a rounded rectangle occupying the entire viewport, then we draw text 2 mm from the left hand edge of the viewport and 1.5 lines of Clipping text up from the bottom of the viewport. A second line of text is also added, 0.5 lines of text from the Another feature of the boxes that we want to pro- bottom. duce is that they have shaded backgrounds. Looking > pushViewport(viewport(width=.25)) closely, there are some relatively complex shapes in- > grid.roundrect() volved in this shading. For example, the grey back- > grid.text("ISBN", ground for the “heading” of each box has a curvy x=unit(2, "mm"), top, but a flat bottom. These are not simple rounded y=unit(1.5, "lines"), rectangles, but some unholy alliance of a rounded just="left") rectangle and a normal rectangle. > grid.text("title", It is possible, in theory, to achieve any sort of x=unit(2, "mm"), shape with R because there is a general polygon y=unit(0.5, "lines"), graphical primitive. However, as with the position- just="left") ing of the text labels, determining the exact boundary > popViewport() of this polygon is not trivial and there are easier ways to work. In this case, we can achieve the result we want us- ing clipping, so that any drawing that we do is only ISBN title visible on a restricted portion of the page. R does not provide clipping to arbitrary regions, but it is pos- sible to set the clipping region to any rectangular re- Coordinate systems gion. The basic idea is that we will draw the complete The positioning of the labels within the viewport in rounded rectangle, then set the clipping region for the previous example demonstrates another useful the box viewport so that no drawing can occur in feature of the grid graphics system: the fact that loca- the last line of text in the box and then draw the tions can be specified in a variety of coordinate sys- rounded rectangle again, this time with a different tems or units. In that example, the text was posi- background. If we continue doing this, we end up tioned horizontally in terms of millimetres and ver- with bands of different shading. tically in terms of lines of text (which is based on the The following code creates an overall viewport font size in use). for a box and draws a rounded rectangle with a grey As another example of the use of these different fill. The code then sets the clipping region to start units, we can size the overall viewport so that it is one line of text above the bottom of the viewport and just the right size to fit the text labels. In the follow- draws another rounded rectangle with a white fill. ing code, the height of the viewport is based on the The effect is to leave just the last line of the original

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 17 grey rounded rectangle showing beneath the white add an arrow to either end of any straight or curvy rounded rectangle that has had its last line clipped. line that R draws. The following code draws three curves between > pushViewport(viewport(width=.25)) pairs of end points. The first curve draws the default > grid.roundrect(gp=gpar(fill="grey")) “city-block” line between end points, with a smooth > grid.clip(y=unit(1, "lines"), corner at the turning point, the second curve is sim- just="bottom") ilar, but with an extra corner added, and the third > grid.roundrect(gp=gpar(fill="white")) curve draws a single wide, smooth corner that is dis- > popViewport() torted towards the end point. The third curve also has an arrow at the end. > x1a <- 0.1; x1b <- 0.2 > y1a <- 0.2; y1b <- 0.8 > grid.curve(x1a, y1a, x1b, y1b) > x2a <- 0.4; x2b <- 0.5 > y2a <- 0.2; y2b <- 0.8 Drawing curves > grid.curve(x2a, y2a, x2b, y2b, inflect=TRUE) Another basic shape that is used in the overall dia- > x3a <- 0.7; x3b <- 0.8 gram is a nice curve from one box to another. > y3a <- 0.2; y3b <- 0.8 In addition to the basic functions to draw straight > grid.curve(x3a, y3a, x3b, y3b, lines in R, there are functions that draw curves. In ncp=8, angle=135, particular, R provides a graphical primitive called an square=FALSE, X-spline (Blanc and Schlick, 1995). The idea of an X- curvature=2, spline is that we define a set of control points and a arrow=arrow(angle=15)) curve is drawn either through or near to the control points. Each control point has a parameter that spec- ● ● ● ifies whether to create a sharp corner at the control point, or draw a smooth curve through the control ● ● ● point, or draw a smooth curve that passes nearby. The following code sets up sets of three control points and draws an X-spline relative to each set of control points. The first curve makes a sharp corner Graphical functions at the middle control point, the second curve makes a The use of graphical primitives, viewports, coordi- smooth corner through the middle control point, and nate systems, and clipping, as described so far, can the third curve makes a smooth corner near the mid- be used to produce a box of the style shown in the dle control point. The control points are drawn as diagram at the start of the article. For example, the grey dots for reference (code not shown). following code produces a box containing three la- > x1 <- c(0.1, 0.2, 0.2) bels, with background shading to assist in differenti- > y1 <- c(0.2, 0.2, 0.8) ating among the labels. > grid.xspline(x1, y1) > labels <- c("ISBN", "title", "pub") > x2 <- c(0.4, 0.5, 0.5) > vp <- > y2 <- c(0.2, 0.2, 0.8) viewport(width=max(stringWidth( > grid.xspline(x2, y2, shape=-1) labels))+ > x3 <- c(0.7, 0.8, 0.8) unit(4, "mm"), > y3 <- c(0.2, 0.2, 0.8) height=unit(length(labels), > grid.xspline(x3, y3, shape=1) "lines")) > pushViewport(vp) ● ● ● > grid.roundrect() > grid.clip(y=unit(1, "lines"), ● ● ● ● ● ● just="bottom") Determining where to place the control points for > grid.roundrect(gp=gpar(fill="grey")) a curve between two boxes is another one of those > grid.clip(y=unit(2, "lines"), annoying calculations, so a more convenient option just="bottom") is provided by a curve graphical primitive in grid. > grid.roundrect(gp=gpar(fill="white")) The idea of this primitive is that we simply spec- > grid.clip() ify the start and end points of the curve and R fig- > grid.text(labels, ures out a set of reasonable control points to produce x=unit(rep(2, 3), "mm"), an appropriate X-spline. It is also straightforward to y=unit(3:1 - .5, "lines"),

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 18 CONTRIBUTED RESEARCH ARTICLES

just="left") > box1 <- boxGrob(c("ISBN", "title", > popViewport() "pub"), x=0.3) > box2 <- boxGrob(c("ID", "name",

ISBN "country"), x=0.7) title pub The grid.draw() function can be used to draw However, in the sort of diagram that we want to any graphical object, but we need to supply the de- "box" produce, there will be several such boxes. Rather tails of how objects get drawn. This is the pur- than write separate code for each box, it makes sense pose of the second function in Figure2. This function drawDetails() to write a general function that will work for any set is a method for the function; it says "box" of labels. Such a function is shown in Figure1 and how to draw objects. In this case, the func- tableBox() the code below uses this function to draw two boxes tion is very simple because it can call the side by side. function that we defined in Figure1. The important detail is that the boxGrob() function specified a spe- > tableBox(c("ISBN", "title", "pub"), cial class, cl="box", for "box" objects, which meant x=0.3) that we could define a drawDetails() method specif- > tableBox(c("ID", "name", "country"), ically for this sort of object and control what gets x=0.7) drawn. With this drawDetails() method defined, we can draw the boxes that we created earlier by call- ISBN ID title name ing the grid.draw() function. This function will pub country draw any grid graphical object by calling the appro- priate method for the drawDetails() generic func- This function represents the simplest way to ef- tion (among other things). The following code calls ficiently reuse graphics code and to provide graph- grid.draw() to draw the two boxes. ics code for others to use. However, there are ben- efits to be gained from going beyond this procedu- > grid.draw(box1) ral programming style to a slightly more complicated > grid.draw(box2) object-oriented approach.

ISBN ID title name Graphical objects pub country

In order to achieve the complete diagram introduced At this point, we appear to have achieved only a at the start of this article, we need one more step: more complicated equivalent of the previous graph- we need to draw lines and arrows from one box to ics function. However, there are a number of another. We already know how to draw lines and other functions that can do useful things with grid curves between two points; the main difficulty that graphical objects. For example, the grobX() and remains is calculating the exact position of the start grobY() functions can be used to calculate locations and end points, because these locations depend on on the boundary of a graphical object. As with the locations and dimensions of the boxes. The cal- grid.draw(), which has to call drawDetails() to find culations could be done by hand for each individual out how to draw a particular class of graphical object, curve, but as we have seen before, there are easier these functions call generic functions to find out how ways to work. The crucial idea for this step is that to calculate locations on the boundary for a particu- we want to create not just a graphical function that lar class of object. The generic functions are called encapsulates how to draw a box, but define a graphi- xDetails() and yDetails() and methods for our cal object that encapsulates information about a box. special "box" class are defined in the last two func- The code in Figure2 defines such a graphical tions in Figure2. object, plus a few other things that we will get to These methods work by passing the buck. They shortly. The first thing to concentrate on is the both create a rounded rectangle at the correct loca- boxGrob() function. This function creates a "box" tion and the right size for the box, then call grobX() graphical object. In order to do this, all it has to do is (or grobY()) to determine a location on the boundary call the grob() function and supply all of the infor- of the rounded rectangle. In other words, they rely mation that we want to record about "box" objects. on code within the grid package that already exists In this case, we just record the labels to be drawn to calculate the boundary of rounded rectangles. within the box and the location where we want to With these methods defined, we are now in a po- draw the box. sition to draw a curved line between our boxes. The This function does not draw anything. For exam- key idea is that we can use grobX() and grobY() to ple, the following code creates two "box" objects, but specify a start and end point for the curve. For exam- produces no graphical output whatsoever. ple, we can start the curve at the right hand edge of

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 19

> tableBox <- function(labels, x=.5, y=.5) { nlabel <- length(labels) tablevp <- viewport(x=x, y=y, width=max(stringWidth(labels)) + unit(4, "mm"), height=unit(nlabel, "lines")) pushViewport(tablevp) grid.roundrect() if (nlabel > 1) { for (i in 1:(nlabel - 1)) { fill <- c("white", "grey")[i %% 2 + 1] grid.clip(y=unit(i, "lines"), just="bottom") grid.roundrect(gp=gpar(fill=fill)) } } grid.clip() grid.text(labels, x=unit(2, "mm"), y=unit(nlabel:1 - .5, "lines"), just="left") popViewport() }

Figure 1: A function to draw a diagram box, for a given set of labels, centred at the specified (x, y) location.

> boxGrob <- function(labels, x=.5, y=.5) { grob(labels=labels, x=x, y=y, cl="box") } > drawDetails.box <- function(x, ...) { tableBox(x$labels, x$x, x$y) } > xDetails.box <- function(x, theta) { nlines <- length(x$labels) height <- unit(nlines, "lines") width <- unit(4, "mm") + max(stringWidth(x$labels)) grobX(roundrectGrob(x=x$x, y=x$y, width=width, height=height), theta) } > yDetails.box <- function(x, theta) { nlines <- length(x$labels) height <- unit(nlines, "lines") width <- unit(4, "mm") + max(stringWidth(x$labels)) grobY(rectGrob(x=x$x, y=x$y, width=width, height=height), theta) }

Figure 2: Some functions that define a graphical object representing a diagram box. The boxGrob() function constructs a "box" object, the drawDetails() method describes how to draw a "box" object, and the xDetails() and yDetails() functions calculate locations on the boundary of a "box" object.

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 20 CONTRIBUTED RESEARCH ARTICLES box1 by specifying grobX(box1, "east"). The ver- are five interconnected boxes and the boxes have tical position is slightly trickier because we do not some additional features, such as a distinct “header” want the line starting at the top or bottom of the box, line at the top. This complete code was excluded but we can simply add or subtract the appropriate to save on space, but a simple R package is pro- number of lines of text to get the right spot. vided at http://www.stat.auckland.ac.nz/~paul/ The following code uses these ideas to draw a with code to draw that complete diagram and the curve from the pub label of box1 to the ID label of package also contains a more complete implementa- box2. The curve has two corners (inflect=TRUE) and tion of code to create and draw "box" graphical ob- it has a small arrow at the end. jects. This call to grid.curve() is relatively verbose, One final point is that using R graphics to draw but in a diagram containing many similar curves, diagrams like this is not fast. In keeping with the S this burden can be significantly reduced by writing tradition, the emphasis is on developing code quickly a simple function that hides away the common fea- and on having code that is not a complete nightmare tures, such as the specification of the arrow head. to maintain. In this case particularly, the speed of The major gain from this object-oriented ap- developing a diagram comes at the expense of the proach is that the start and end points of this curve time taken to draw the diagram. For small, one-off are described by simple expressions that will auto- diagrams this is not likely to be an issue, but the matically update if the locations of the boxes are approach described in this article would not be ap- modified. propriate, for example, for drawing a node-and-edge graph of the Internet. > grid.curve(grobX(box1, "east"), grobY(box1, "south") + unit(0.5, "lines"), Bibliography grobX(box2, "west"), grobY(box2, "north") - C. Blanc and C. Schlick. X-splines: a spline model de- unit(0.5, "lines"), signed for the end-user. In SIGGRAPH ’95: Proceed- inflect=TRUE, ings of the 22nd annual conference on Computer graph- arrow= ics and interactive techniques, pages 377–386, New arrow(type="closed", York, NY, USA, 1995. ACM. ISBN 0-89791-701-4. angle=15, doi: http://doi.acm.org/10.1145/218380.218488. length=unit(2, "mm")), gp=gpar(fill="black")) G. Csardi and T. Nepusz. The igraph software package for complex network research. InterJour- nal, Complex Systems:1695, 2006. URL http:// ISBN ID title name igraph.sf.net. pub country J. Gentry, L. Long, R. Gentleman, S. Falcon, F. Hahne, and D. Sarkar. Rgraphviz: Provides plotting capabil- ities for R graph objects, 2008. R package version Conclusion 1.18.1.

This article has demonstrated a number of useful A. Larsson. Dia, 2008. http://www.gnome.org/ low-level graphical facilities in R with an example of projects/dia/. how they can be combined to produce a diagram consisting of non-trivial nodes with smooth curves P. Murrell. The grid graphics package. R News, 2(2): between them. 14–19, June 2002. URL http://CRAN.R-project. org/doc/Rnews/ The code examples provided in this article have . ignored some details in order to keep things simple. P. Murrell. Recent changes in grid graphics. R For example, there are no checks that the arguments News, 5(1):12–20, May 2005a. URL http://CRAN. have sensible values in the functions tableBox() and R-project.org/doc/Rnews/. boxGrob(). However, for creating one-off diagrams, this level of detail is not necessary anyway. P. Murrell. R Graphics. Chapman & Hall/CRC, One detail that would be encountered quite 2005b. URL http://www.stat.auckland.ac.nz/ quickly in practice, in this particular sort of dia- ~paul/RGraphics/rgraphics.html. ISBN 1-584- gram, is that a curve from one box to another that 88486-X. needs to go across-and-down rather than across-and- up would require the addition of curvature=-1 to R Development Core Team. R: A Language and Envi- the grid.curve() call. Another thing that is miss- ronment for Statistical Computing. R Foundation for ing is complete code to produce the example dia- Statistical Computing, Vienna, Austria, 2008. URL gram from the beginning of this article, where there http://www.R-project.org. ISBN 3-900051-07-0.

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 21

D. Sarkar. Lattice: Multivariate Data Visualization with H. Wickham. ggplot2. Springer, 2009. http://had. R. Springer, 2008. URL http://amazon.com/o/ co.nz/ggplot2/. ASIN/0387759689/. ISBN 9780387759685. K. Soetaert. diagram: Functions for visualising simple Paul Murrell graphs (networks), plotting flow diagrams, 2008a. R Department of Statistics package version 1.2. The University of Auckland K. Soetaert. shape: Functions for plotting graphical New Zealand shapes, colors, 2008b. R package version 1.2. [email protected]

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 22 CONTRIBUTED RESEARCH ARTICLES

The hwriter Package

Composing HTML documents with R objects > hwrite('Hello world !', 'doc.html') by Gregoire Pau and Wolfgang Huber Hello world ! Introduction

HTML documents are structured documents made Character strings can be rendered with a variety link of diverse elements such as paragraphs, sections, of formatting options. The argument adds a hy- columns, figures and tables organized in a hierar- perlink to the HTML element pointing to an external style chical layout. Combination of HTML documents document. The argument specifies an inline and hyperlinking is useful to report analysis results; Cascaded Style Sheet (CSS) style (color, alignment, for example, in the package arrayQualityMetrics margins, font, ...) to render the element. Other argu- (Kauffmann et al., 2009), estimating the quality of mi- ments allow a finer rendering control. croarray data sets and cellHTS2 (Boutros et al., 2006), > hwrite('Hello world !', 'doc.html', performing the analysis of cell-based screens. link='http://cran.r-project.org/') There are several tools for exporting data from R into HTML documents. The package R2HTML is able to render a large diversity of R objects in HTML Hello world ! but does not easily support combining them in a structured layout and has a complex syntax. On the other hand, the package xtable can render R matrices > hwrite('Hello world !', 'doc.html', with simple commands but cannot combine HTML style='color: red; font-size: 20pt; elements and has limited formatting options. font-family: Gill Sans Ultra Bold') The package hwriter allows rendering R objects in HTML and combining resulting elements in a structured layout. It uses a simple syntax, supports extensive formatting options and takes full advan- Hello world ! tage of the ellipsis ’...’ argument and R vector re- cycling rules. hwriteImage Comprehensive documentation and examples of Images can be included using which hwriter are generated by running the command supports diverse formatting options. example(hwriter), which creates the package web > img=system.file('images', c('iris1.jpg', page http://www.ebi.ac.uk/~gpau/hwriter. 'iris3.jpg'), package='hwriter') > hwriteImage(img[2], 'doc.html') The function hwrite

The generic function hwrite is the core function of the package hwriter. hwrite(x, page=NULL, ...) renders the object x in HTML using the formatting arguments specified in ’...’ and returns a character vector containing the HTML element code of x. If page is a filename, the HTML element is written in this file. If it is an R connection/file, the element is appended to this connection, allowing the sequential building of HTML documents. If NULL, the returned Arguments recycling HTML element code can be concatenated with other Objects written with formatting arguments contain- elements with paste or nested inside other elements ing more than one element are recycled by hwrite, through subsequent hwrite calls. These operations producing a vector of HTML elements. are the tree structure equivalents of adding a sibling node and adding a child node in the document tree. > hwrite('test', 'doc.html', style=paste( 'color:', c('peru', 'purple', 'green'))) Formatting objects The most basic call of hwrite is to output a character test test test vector into an HTML document.

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 23

Vectors and matrices > hcss=system.file('images', 'hwriter.css', package='hwriter') Vectors and matrices are rendered as HTML tables. > p=openPage('doc.html', link.css=hcss) > hwriteImage(img, p, br=TRUE) > hwrite(iris[1:2, 1:2], 'doc.html') > hwrite(iris[1:2, 1:2], p, br=TRUE, row.bgcolor='#d2c0ed') > hwrite('Iris flowers', p, class='intro') Sepal.Length Sepal.Width > closePage(p) 1 5.1 3.5 2 4.9 3

Formatting arguments are recycled and dis- tributed over table cells. Arguments link and style are valid. Cell background color is controlled by the argument bgcolor. > colors=rgb(colorRamp(c('red', 'yellow', Sepal.Length Sepal.Width 'white'))((0:7)/7), max=255) > hwrite(0:7, 'doc.html', bgcolor=colors, 1 5.1 3.5 style='padding:5px') 2 4.9 3

0 1 2 3 4 5 6 7 Iris flowers

Formatting arguments can also be distributed on Nesting HTML elements rows and columns using the syntax row.* and col.* where * can be any valid HTML attribute. The function hwrite always returns a character vec- Global table properties are specified using the syn- tor containing HTML code parts. Code parts can be tax table.* where * can be any valid HTML

reused in further hwrite calls to build tables contain- attribute. As an example, table.style controls the ing complex cells or inner tables, which are useful to global table CSS inline style and table.cellpadding compose HTML document layouts. controls the global cell padding size. > cap=hwrite(c('Plantae','Monocots','Iris'), > hwrite(iris[1:2, 1:2], 'doc.html', table.style='border-collapse:collapse', row.bgcolor='#ffdc98', table.style= table.cellpadding='5px') 'border-collapse:collapse', > print(cap) table.cellpadding='5px')
Sepal.Length Sepal.Width PlantaeMonocotsIris

1 5.1 3.5 > hwrite(matrix(c(hwriteImage(img[2]),cap)), 'doc.html', table.cellpadding='5px', 2 4.9 3 table.style='border-collapse:collapse')

Appending HTML elements to a page A new HTML document is created by the function openPage which returns an R connection. Appending HTML elements to the page is done by passing the connection to hwrite instead of a filename. The doc- ument is closed by closePage which terminates the HTML document. The function openPage supports the argument link.css which links the document to external stylesheets containing predefined CSS styles Plantae Monocots Iris and classes. Classes can be used with the argument class of the function hwrite.

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 24 CONTRIBUTED RESEARCH ARTICLES

Bibliography Gregoire Pau EMBL-European Bioinformatics Institute M. Boutros, L. Bras, and W. Huber. Analysis of cell- Cambridge, UK based RNAi screens. Genome Biology, 7(66), 2006. [email protected]

A. Kauffmann, R. Gentleman, and W. Huber. ar- Wolfgang Huber rayQualityMetrics - a Bioconductor package for EMBL-European Bioinformatics Institute quality assessment of microarray data. Bioinformat- Cambridge, UK ics, 25(3), 2009. [email protected]

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 25

AdMit: Adaptive Mixtures of Student-t Distributions by David Ardia, Lennart F. Hoogerheide and Herman K. fatter tails than the Normal distribution; especially van Dijk if one specifies Student-t distributions with few de- grees of freedom, the risk is small that the tails of the candidate are thinner than those of the target Introduction distribution. Finally, Zeevi and Meir(1997) showed that under certain conditions any density function This note presents the package AdMit (Ardia et al., may be approximated to arbitrary accuracy by a con- 2008, 2009), an R implementation of the adaptive vex combination of basis densities; the mixture of mixture of Student-t distributions (AdMit) proce- Student-t distributions falls within their framework. dure developed by Hoogerheide(2006); see also The package AdMit consists of three main func- Hoogerheide et al.(2007); Hoogerheide and van Dijk tions: AdMit, AdMitIS and AdMitMH. The first one al- (2008). The AdMit strategy consists of the construc- lows the user to fit a mixture of Student-t distribu- tion of a mixture of Student-t distributions which tions to a given density through its kernel function. approximates a target distribution of interest. The The next two functions perform importance sam- fitting procedure relies only on a kernel of the tar- pling and independence chain M-H sampling using get density, so that the normalizing constant is not the fitted mixture estimated by AdMit as the impor- required. In a second step, this approximation is tance or candidate density, respectively. used as an importance function in importance sam- To illustrate the use of the package, we apply the pling or as a candidate density in the independence AdMit methodology to a bivariate bimodal distribu- chain Metropolis-Hastings (M-H) algorithm to esti- tion. We describe the use of the functions provided mate characteristics of the target density. The esti- by the package and document the ability and rele- mation procedure is fully automatic and thus avoids vance of the methodology to reproduce the shape of the difficult task, especially for non-experts, of tun- non-elliptical distributions. ing a sampling algorithm. Typically, the target is a posterior distribution in a Bayesian analysis, where we indeed often only know a kernel of the posterior Illustration density. In a standard case of importance sampling or the This section presents the functions provided by the independence chain M-H algorithm, the candidate package AdMit with an illustration of a bivariate bi- density is unimodal. If the target distribution is mul- modal distribution. This distribution belongs to the timodal then some draws may have huge weights class of conditionally Normal distributions proposed in the importance sampling approach and a second by Gelman and Meng(1991) with the property that mode may be completely missed in the M-H strat- the joint density is not Normal. It is not a posterior egy. As a consequence, the convergence behavior of distribution, but it is chosen because it is a simple these Monte Carlo integration methods is rather un- distribution with non-elliptical shapes so that it al- certain. Thus, an important problem is the choice of lows for an easy illustration of the AdMit approach. the importance or candidate density, especially when Let X1 and X2 be two random variables, for little is known a priori about the shape of the target which X1 is Normally distributed given X2 and vice density. For both importance sampling and the in- versa. Then, the joint distribution, after location and dependence chain M-H, it holds that the candidate scale transformations in each variable, can be written density should be close to the target density, and it as (see Gelman and Meng, 1991): is especially important that the tails of the candidate . h 1 2 2 2 2 p(x) ∝ k(x) = exp − 2 Ax1x2 + x1 + x2 should not be thinner than those of the target. (1) i Hoogerheide(2006) and Hoogerheide et al.(2007) − 2Bx1x2 − 2C1x1 − 2C2x2 mention several reasons why mixtures of Student-t . distributions are natural candidate densities. First, where p(x) denotes a density, k(x) a kernel and x = 0 they can provide an accurate approximation to a (x1, x2) for notational purposes. A, B, C1 and C2 are wide variety of target densities, with substantial constants; we consider an asymmetric case in which skewness and high kurtosis. Furthermore, they A = 5, B = 5, C1 = 3, C2 = 3.5 in what follows. can deal with multi-modality and with non-elliptical The adaptive mixture approach determines the shapes due to asymptotes. Second, this approxima- number of mixture components H, the mixing prob- tion can be constructed in a quick, iterative proce- abilities, the modes and scale matrices of the compo- dure and a mixture of Student-t distributions is easy nents in such a way that the mixture density q(x) ap- to sample from. Third, the Student-t distribution has proximates the target density p(x) of which we only

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 26 CONTRIBUTED RESEARCH ARTICLES know a kernel function k(x) with x ∈ Rd. Typically, default, the algorithm uses the value one (i.e. the ap- k(x) will be a posterior density kernel for a vector of proximation density is a mixture of Cauchy) since i) model parameters x. it enables the method to deal with fat-tailed target The AdMit strategy consists of the following distributions and ii) it makes it easier for the iterative steps: procedure to detect modes that are far apart. There may be a trade-off between efficiency and robustness (0) Initialization: computation of the mode and at this point: setting the degrees of freedom to one re- scale matrix of the first component, and draw- flects that we find a higher robustness more valuable ing a sample from this Student-t distribution; than a slight possible gain in efficiency. (1) Iterate on the number of components: add a The core function provided by the package is the AdMit new component that covers a part of the space function . The main arguments of the function KERNEL of x where the previous mixture density was are: , a kernel function of the target density on relatively small, as compared to k(x); which the approximation is constructed. This func- tion must contain the logical argument log. When (2) Optimization of the mixing probabilities; log = TRUE, the function KERNEL returns the loga- rithm value of the kernel function; this is used for nu- (3) Drawing a sample {x[i]} from the new mixture merical stability. mu0 is the starting value of the first q(x); stage optimization; it is a vector whose length corre- sponds to the length of the first argument in KERNEL. (4) Evaluation of importance sampling weights Sigma0 is the (symmetric positive definite) scale ma- [i] [i] {k(x )/q(x )}. If the coefficient of variation, trix of the first component. If a matrix is provided by i.e. the standard deviation divided by the the user, it is used as the scale matrix of the first com- mean, of the weights has converged, then stop. ponent and then mu0 is used as the mode of the first Otherwise, go to step (1). By default, the adap- component (instead of a starting value for optimiza- tive procedure stops if the relative change in tion). control is a list of tuning parameters, contain- the coefficient of variation of the importance ing in particular: df (default: 1), the degrees of free- sampling weights caused by adding one new dom of the mixture components and CVtol (default: Student-t component to the candidate mixture 0.1), the tolerance of the relative change of the co- is less than 10%. efficient of variation. For further details, the reader is referred to the documentation manual (by typing There are two main reasons for using the coefficient ?AdMit) or to Ardia et al.(2009). Finally, the last ar- of variation of the importance sampling weights as gument of AdMit is ... which allows the user to pass a measure of the fitting quality. First, it is a nat- additional arguments to the function KERNEL. ural, intuitive measure of quality of the candidate Let us come back to our bivariate conditionally as an approximation to the target. If the candidate Normal distribution. First, we need to code the ker- and the target distributions coincide, all importance nel: sampling weights are equal, so that the coefficient of variation is zero. For a poor candidate that not > GelmanMeng <- function(x, log = TRUE) even roughly approximates the target, some impor- + { tance sampling weights are huge while most are (al- + if (is.vector(x)) most) zero, so that the coefficient of variation is high. + x <- matrix(x, nrow = 1) + r <- -0.5 * ( 5 * x[,1]^2 * x[,2]^2 The better the candidate approximates the target, the + + x[,1]^2 + x[,2]^2 more evenly the weight is divided among the can- + - 10 * x[,1] * x[,2] didate draws, and the smaller the coefficient of vari- + - 6 * x[,1] - 7 * x[,2] ) ation of the importance sampling weights. Second, + if (!log) Geweke(1989) argues that a reasonable objective in + r <- exp(r) the choice of an importance density is the minimiza- + as.vector(r) tion of the variance, or equivalently the coefficient of + } variation, of the importance weights. We prefer to Note that the argument log is set to TRUE by default quote the coefficient of variation because it does not so that the function outputs the logarithm of the ker- depend on the number of draws or the integration nel. Moreover, the function is vectorized to speed up constant of the posterior density kernel, unlike the the computations. The contour plot of GelmanMeng is standard deviation of the scaled or unscaled impor- displayed in the left part of Figure1. tance weights, respectively. The coefficient of varia- We now use the function AdMit to find a suit- tion merely reflects the quality of the candidate as an able approximation for the density whose kernel is approximation to the target. GelmanMeng. We use the starting value mu0 = c(0, Note also that all Student-t components in the 0.1) for the first stage optimization and assign the mixture approximation q(x) have the same degrees result of the function to outAdMit. of freedom parameter. Moreover, this value is not determined by the procedure but set by the user; by > set.seed(1234)

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 27

Gelman−Meng kernel AdMit approximation

6 6

5 5

5500

0.18

4 7000 4

0.22

8500

8000

0.26

0.28 3 7500 3 6500 0.24 2 2

6000 0.16 X X 0.2

2 5000 2 0.14 0.12 4500 0.16 0.08 1 4000 1 3500 0.1 3000 2000 0.06 2500 0.06 0.08 0.1 1500 1000 500 0.04 0 0 0.02

−1 −1

−1 0 1 2 3 4 5 6 −1 0 1 2 3 4 5 6

X1 X1

Figure 1: Left: contour plot of the kernel GelmanMeng. Right: contour plot of the four-component Student-t mixture approximation estimated by the function AdMit.

> outAdMit <- AdMit(KERNEL = GelmanMeng, tance sampling weights at each step of the adaptive + mu0 = c(0, 0.1)) fitting procedure. The second component is mit, a list which consists of four components giving informa- $CV tion on the fitted mixture of Student-t distributions: [1] 8.0737 1.7480 0.9557 0.8807 p is a vector of length H of mixing probabilities, mu is a H × d matrix whose rows give the modes of the $mit Sigma 2 $mit$p mixture components, is a H × d matrix whose cmp1 cmp2 cmp3 cmp4 rows give the scale matrices (in vector form) of the 0.54707 0.14268 0.21053 0.09972 mixture components and df is the degrees of free- dom of the Student-t components. The third compo- $mit$mu nent of the list returned by AdMit is summary. This is k1 k2 a data frame containing information on the adaptive cmp1 0.3640 3.2002 fitting procedure. We refer the reader to the docu- cmp2 3.7088 0.3175 mentation manual for further details. cmp3 1.4304 1.1240 For the kernel GelmanMeng, the approximation cmp4 0.8253 2.0261 constructs a mixture of four components. The value $mit$Sigma of the coefficient of variation decreases from 8.0737 k1k1 k1k2 k2k1 k2k2 to 0.8807. A contour plot of the four-component ap- cmp1 0.03903 -0.15606 -0.15606 1.22561 proximation is displayed in the right-hand side of cmp2 0.86080 -0.08398 -0.08398 0.02252 Figure1. This graph is produced using the func- cmp3 0.13525 -0.06743 -0.06743 0.09520 tion dMit which returns the density of the mixture cmp4 0.03532 -0.04072 -0.04072 0.21735 given by the output outAdMit$mit. The contour plot illustrates that the four-component mixture provides $mit$df a good approximation of the density: the areas with [1] 1 substantial target density mass are covered by suffi- cient density mass of the mixture approximation. $summary H METHOD.mu TIME.mu METHOD.p TIME.p CV Once the adaptive mixture of Student-t distribu- 1 1 BFGS 0.21 NONE 0.00 8.0737 tions is fitted to the density using a kernel, the ap- 2 2 BFGS 0.08 NLMINB 0.08 1.7480 proximation provided by AdMit is used as the impor- 3 3 BFGS 0.12 NLMINB 0.08 0.9557 tance sampling density in importance sampling or as 4 4 BFGS 0.20 NLMINB 0.19 0.8807 the candidate density in the independence chain M- H algorithm. The output of the function AdMit is a list. The first The AdMit package contains the AdMitIS func- component is CV, a vector of length H which gives tion which performs importance sampling using the the value of the coefficient of variation of the impor- mixture approximation in outAdMit$mit as the im-

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 28 CONTRIBUTED RESEARCH ARTICLES portance density. The arguments of the function + mit = outAdMit$mit) AdMitIS are: (i) N, the number of draws used in im- portance sampling; (ii) KERNEL, a kernel function of $draws the target density; (iii) G, the function of which the k1 k2 expectation IE g(x) is estimated; By default, the 1 7.429e-01 3.021e-01 p 2 7.429e-01 3.021e-01 function G is the identity so that the function outputs 3 7.429e-01 3.021e-01 a vector containing the mean estimates for the com- 4 1.352e+00 1.316e+00 ponents of x. Alternative functions may be provided 5 1.011e+00 1.709e+00 by the user to obtain other quantities of interest for 6 1.011e+00 1.709e+00 p(x). The only requirement is that the function out- 7 1.005e+00 1.386e+00 puts a matrix; (iv) mit, a list providing information ... on the mixture approximation; (v) ... allows addi- tional parameters to be passed to the function KERNEL $accept and/or G. [1] 0.5119 Let us apply the function AdMitIS to the kernel The output of the function AdMitMH is a list of two GelmanMeng using the approximation outAdMit$mit: components. The first component is draws, a N × d > outAdMitIS <- AdMitIS(N = 1e5, matrix containing draws from the target density p(x) + KERNEL = GelmanMeng, in its rows. The second component is accept, the ac- + mit = outAdMit$mit) ceptance rate of the independence chain M-H algo- rithm. $ghat The package coda (Plummer et al., 2008) can [1] 0.9556 2.2465 be used to check the convergence of the MCMC chain and obtain quantities of interest for p(x). $NSE Here, for simplicity, we discard the first 1’000 [1] 0.003833 0.005639 draws as a burn-in sample and transform the output outAdMitMH$draws mcmc $RNE in a object using the func- [1] 0.6038 0.5536 tion as.mcmc provided by coda. A summary of the MCMC chain can be obtained using summary: The output of the function AdMitIS is a list. The first > library("coda") component is ghat, the importance sampling estima-   > draws <- as.mcmc(outAdMitMH$draws) tor of IEp g(x) . The second component is NSE, a > draws <- window(draws, start = 1001) vector containing the numerical standard errors (the > colnames(draws) <- c("X1", "X2") variation of the estimates that can be expected if the > summary(draws)$stat simulations were to be repeated) of the components of ghat. The third component is RNE, a vector con- Mean SD Naive SE Time-series SE taining the relative numerical efficiencies of the com- X1 0.9643 0.9476 0.002996 0.004327 ponents of ghat (the ratio between an estimate of the X2 2.2388 1.3284 0.004201 0.006232 variance of an estimator based on direct sampling We note that the mean estimates are close to the val- and the importance sampling estimator’s estimated ues obtained with the function AdMitIS. The relative variance with the same number of draws). RNE is an numerical efficiency (RNE) can be computed using indicator of the efficiency of the chosen importance the functions effectiveSize and niter: function; if target and importance densities coincide, > effectiveSize(draws) / niter(draws) RNE equals one, whereas a very poor importance den- sity will have a RNE close to zero. Both NSE and RNE X1 X2 are estimated by the method given in Geweke(1989). 0.3082 0.3055 AdMitMH Further, the AdMit package contains the These relative numerical efficiencies reflect the good function which uses the mixture approximation in quality of the candidate density in the independence outAdMit$mit as the candidate density in the inde- chain M-H algorithm. pendence chain M-H algorithm. The arguments of The approach is compared to the standard Gibbs AdMitMH N the function are: (i) , the length of the sampler, which is extremely easy to implement here KERNEL MCMC sequence of draws; (ii) , a kernel func- since the full conditional densities are Normals. We mit tion of the target density; (iii) , a list providing in- generated N = 101000 draws from a Gibbs sampler ... formation on the mixture approximation; (iv) al- using the first 1’000 draws as a burn-in period. In lows additional parameters to be passed to the func- this case, the RNE were 0.06411 and 0.07132, respec- KERNEL tion . tively. In Figure2 we display the autocorrelogram AdMitMH Let us apply the function to the kernel for the MCMC output of the AdMitMH function (up- GelmanMeng outAdMit$mit using the approximation : per graphs) together with the autocorrelogram of the > outAdMitMH <- AdMitMH(N = 101000, Gibbs (lower part). We clearly notice that the Gibbs + KERNEL = GelmanMeng, sequence is mixing much more slowly, explaining the

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 29

ACF X1 (AdMit) ACF X2 (AdMit)

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0 0 10 20 30 40 50 0 10 20 30 40 50 Lag Lag

ACF X1 (Gibbs) ACF X2 (Gibbs)

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0 0 10 20 30 40 50 0 10 20 30 40 50 Lag Lag

Figure 2: Upper graphs: Autocorrelation function (ACF) for the AdMitMH output. Lower graphs: ACF for the Gibbs output.

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 30 CONTRIBUTED RESEARCH ARTICLES lower RNE values. This example illustrates the phe- flexible candidate distribution for efficient simu- nomenon that the autocorrelation in the Gibbs se- lation: The R package AdMit. Journal of Statis- quence may be high when assessing a non-elliptical tical Software, 29(3):1–32, 2009. URL http://www. distribution, and that the AdMit approach may over- jstatsoft.org/v29/i03/. come this. A. Gelman and X.-L. Meng. A note on bivariate dis- tributions that are conditionally normal. The Amer- Summary ican Statistician, 45(2):125–126, 1991.

This note presented the package AdMit which pro- J. F. Geweke. Bayesian inference in econometric mod- vides functions to approximate and sample from a els using Monte Carlo integration. Econometrica, 57 certain target distribution given only a kernel of the (6):1317–1339, 1989. target density function. The estimation procedure is fully automatic and thus avoids the time-consuming L. F. Hoogerheide. Essays on Neural Network Sampling and difficult task of tuning a sampling algorithm. Methods and Instrumental Variables. PhD thesis, Tin- The use of the package has been illustrated in a bi- bergen Institute, Erasmus University Rotterdam, variate bimodal distribution. 2006. Book nr. 379 of the Tinbergen Institute Re- Interested users are referred to Ardia et al.(2009) search Series. for a more complete discussion on the AdMit pack- L. F. Hoogerheide and H. K. van Dijk. Possi- age. The article provides a more extensive intro- bly ill-behaved posteriors in econometric mod- duction of the package usage as well as an illus- els: On the connection between model structures, tration of the relevance of the AdMit procedure non-elliptical credible sets and neural network with the Bayesian estimation of a mixture of ARCH simulation techniques. Technical Report 2008- model fitted to foreign exchange log-returns data. 036/4, Tinbergen Institute, Erasmus University The methodology is compared to standard cases of Rotterdam, 2008. URL http://www.tinbergen.nl/ importance sampling and the Metropolis-Hastings discussionpapers/08036.pdf. algorithm using a naive candidate and with the Griddy-Gibbs approach. It is shown that for inves- L. F. Hoogerheide, J. F. Kaashoek, and H. K. van tigating means and particularly tails of the joint pos- Dijk. On the shape of posterior densities and credi- terior distribution the AdMit approach is preferable. ble sets in instrumental variable regression models with reduced rank: An application of flexible sam- Acknowledgments pling methods using neural networks. Journal of Econometrics, 139(1):154–180, 2007. The authors acknowledge an anonymous reviewer M. Plummer, N. Best, K. Cowles, and K. Vines. coda: for helpful comments that have led to improve- Output Analysis and Diagnostics for MCMC in R, ments of this note. The first author is grateful to 2008. URL http://CRAN.R-project.org/package= the Swiss National Science Foundation (under grant coda. R package version 0.13-3. #FN PB FR1-121441) for financial support. The third author gratefully acknowledges the financial assis- A. J. Zeevi and R. Meir. Density estimation through tance from the Netherlands Organization of Research convex combinations of densities: Approximation (under grant #400-07-703). and estimation bounds. Neural Networks, 10(1):99– 109, 1997. Bibliography David Ardia D. Ardia, L. F. Hoogerheide, and H. K. van Dijk. University of Fribourg, Switzerland AdMit: Adaptive Mixture of Student-t Distributions [email protected] for Efficient Simulation in R, 2008. URL http:// CRAN.R-project.org/package=AdMit. R package Lennart F. Hoogerheide version 1.01-01. Erasmus University Rotterdam, The Netherlands D. Ardia, L. F. Hoogerheide, and H. K. van Dijk. Herman K. van Dijk Adaptive mixture of Student-t distributions as a Erasmus University Rotterdam, The Netherlands

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 31 expert: Modeling Without Data Using Expert Opinion by Vincent Goulet, Michel Jacques and Mathieu Pigeon a group of 12 to 15 experts is consulted. The experts express their opinion on the distribu- tion of the decision random variable by means of a Introduction few quantiles, usually the median and a symmetric interval around it. Let us recall that the 100rth quan- The expert package provides tools to create and ma- tile of a random variable is the value q such that nipulate empirical statistical models using expert r opinion (or judgment). Here, the latter expression lim F(qr − h) ≤ r ≤ F(qr), refers to a specific body of techniques to elicit the dis- h→0 tribution of a random variable when data is scarce or where F is the cumulative distribution function of unavailable. Opinions on the quantiles of the distri- the random variable. Hence, by giving the quantiles bution are sought from experts in the field and aggre- qu and qv, u < v, an expert states that, in his opinion, gated into a final estimate. The package supports ag- there is a probability of v − u that the true value of the gregation by means of the Cooke, Mendel–Sheridan variable lies between qu and qv. We generally avoid and predefined weights models. asking for quantiles far in the tails (say, below 5% and We do not mean to give a complete introduction above 95%) since humans have difficulty evaluating to the theory and practice of expert opinion elicita- them. tion in this paper. However, for the sake of complete- In addition to the distribution of the decision ness and to assist the casual reader, the next section variable, the experts are also queried for the distri- summarizes the main ideas and concepts. bution of a series of seed (or calibration) variables It should be noted that we are only interested, for which only the analyst knows the true values. By here, in the mathematical techniques of expert opin- comparing the distributions given by the experts to ion aggregation. Obtaining the opinion from the ex- the latter, the analyst will be able to assess the qual- perts is an entirely different task; see Kadane and ity of the information provided by each expert. This Wolfson(1998); Kadane and Winkler(1988); Tver- step is called calibration. sky and Kahneman(1974) for more information. In summary, expert opinion is given by means of Moreover, we do not discuss behavioral models (see quantiles for k seed variables and one decision vari- Ouchi, 2004, for an exhaustive review) nor the prob- able. The calibration phase determines the influence lems of expert selection, design and conducting of in- of each expert on the final aggregated distribution. terviews. We refer the interested reader to O’Hagan The three methods discussed in this paper and sup- et al.(2006) and Cooke(1991) for details. Although ported by the package are simply different ways to it is extremely important to carefully examine these aggregate the information provided by the experts. considerations if expert opinion is to be useful, we The aggregated distribution in the classical model assume that these questions have been solved previ- of Cooke(1991) is a convex combination of the quan- ously. The package takes the opinion of experts as an tiles given by the experts, with weights obtained input that we take here as available. from the calibration phase. Consequently, if we The other main section presents the features of asked for three quantiles from the experts (say the version 1.0-0 of package expert. 10th, 50th and 90th), the aggregated distribution will consist of three “average” quantiles corresponding to these same probabilities. This model has been used Expert opinion in many real life studies by the analysts of the Delft Institute of Applied Mathematics of the Delft Univer- Consider the task of modeling the distribution of a sity of Technology (Cooke and Goossens, 2008). quantity for which data is very scarce (e.g. the cost The predefined weights method is similar, except of a nuclear accident) or even unavailable (e.g. settle- that the weights are provided by the analyst. ments in liability insurance, usually reached out of On the other hand, the model of Mendel and court and kept secret). A natural option for the ana- Sheridan(1989) adjusts the probabilities associated lyst1 is then to ask experts their opinion on the distri- with each quantile by a (somewhat convoluted) bution of the so-called decision variable. The experts Bayesian procedure to reflect the results of the cal- are people who have a good knowledge of the field ibration phase. Returning to the context above, this under study and who are able to express their opin- means that the probability of 10% associated with the ion in a simple probabilistic fashion. They can be en- first quantile given by an expert may be changed to, gineers, physicists, lawyers, actuaries, etc. Typically, say, 80% if the calibration phase proved that this ex- 1Also called decision maker in the literature.

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 32 CONTRIBUTED RESEARCH ARTICLES pert systematically overestimates the distributions. > x <- list( Combining intersecting adjusted probabilities from + E1 <- list( all the experts usually results in an aggregated dis- + S1 <- c(0.14, 0.22, 0.28), tribution with far more quantiles than were asked for + S2 <- c(130000, 150000, 200000), in the beginning. + X <- c(350000, 400000, 525000)), + E2 <- list( It is well beyond the scope of this paper to dis- + S1 <- c(0.2, 0.3, 0.4), cuss the actual assumptions and computations in the + S2 <- c(165000, 205000, 250000), calibration and aggregation phases. The interested + X <- c(550000, 600000, 650000)), reader may consult the aforementioned references, + E3 <- list( Goulet et al.(2008) or Pigeon(2008, in French). + S1 <- c(0.2, 0.4, 0.52), + S2 <- c(200000, 400000, 500000), + X <- c(625000, 700000, 800000)))

The expert package 2. method, the model to be used for aggregation. Must be one of ‘cooke’ for Cooke’s classical To the best of our knowledge, the only readily avail- model, ‘ms’ for the Mendel–Sheridan model, or able software for expert opinion calculations is Ex- ‘weights’ for the predefined weights model; calibur, a closed source, Windows-only application available from the Risk and Environmental Mod- 3. probs, the vector of probabilities correspond- elling website of the Delft University of Technology ing to the quantiles given by the experts. In the (http://dutiosc.twi.tudelft.nl/~risk/). Excal- example, we have ibur does not support the Mendel–Sheridan model. > probs <- c(0.1, 0.5, 0.9) Data has to be entered manually in the application, making connection to other tools inconvenient. Fur- 4. true.seed, the vector of true values of the seed thermore, exporting the results for further analysis variables. Obviously, these must be listed in the requires an ad hoc procedure. same order as the seed variables are listed in ar- expert is a small R package providing a unified gument x. In the example, this is interface to the three expert opinion models exposed above along with a few utility functions and meth- > true.seed <- c(0.27, 210000) ods to manipulate the results. Naturally, the analyst alpha also benefits from R’s rich array of statistical, graph- 5. , parameter α in Cooke’s model. This ar- ical, import and export tools. gument is ignored in the other models. Param- eter α sets a threshold for the calibration com- ponent of an expert below which the expert au- Example dataset tomatically receives a weight of 0 in the aggre- gated distribution. If the argument is missing Before looking at the functions of the package, we in- or NULL, the value of α is determined by an op- troduce a dataset that will be used in forthcoming ex- timization procedure (see below); amples. We consider an analysis where three experts must express their opinion on the 10th, 50th and 90th 6. w, the vector of weights in the predefined quantiles of a decision variable and two seed vari- weights model. This argument is ignored in ables. The true values of the seed variables are 0.27 the other models. If the argument is missing and 210,000, respectively. See Table1 for the quan- or NULL, the weights are all equal to 1/n, where tiles given by the experts. n is the number of experts. If the calibration threshold α in Cooke’s model is Main function not fixed by the analyst, one is computed following a procedure proposed by Cooke(1991). We first fit The main function of the package is expert. It serves the model with α = 0. This gives a weight to each ex- as a unified interface to all three models supported pert. Using these weights, we then create a “virtual” by the package. The arguments of the function are expert by aggregating the quantiles for the seed vari- the following: ables. The optimal α is the value yielding the largest weight for the virtual expert. This is determined by 1. x, the quantiles of the experts for all seed vari- trying all values of α just below the calibration com- ables and the decision variable — in other ponents obtained in the first step. There is no need to words the content of Table1. This is given as try other values and, hence, to conduct a systematic a list of lists, one for each expert. The inte- numerical optimization. rior lists each contain vectors of quantiles: one When the experts do not provide the 0th and per seed variable and one for the decision vari- 100th quantiles, expert sets them automatically fol- able (in this order). For the example above, this lowing the ad hoc procedure exposed in Cooke gives (1991). In a nutshell, the quantiles are obtained by

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 33

Variable Expert Quantile Seed 1 Seed 2 Decision 1 10th 0.14 130,000 350,000 50th 0.22 150,000 400,000 90th 0.28 200,000 525,000 2 10th 0.20 165,000 550,000 50th 0.30 205,000 600,000 90th 0.40 250,000 650,000 3 10th 0.20 200,000 625,000 50th 0.40 400,000 700,000 90th 0.52 500,000 800,000

Table 1: Quantiles given by the experts for the seed and decision variables removing and adding 10% of the smallest interval > expert(x, "cooke", probs, true.seed) containing all quantiles given by the experts to the bounds of this interval. Therefore, the minimum and Aggregated Distribution Using Cooke Model maximum of the distribution are set the same for all experts. In our example, the 0th and 100th quantiles Interval Probability (305000, 550000] 0.1 of the first seed variable are set, respectively, to (550000, 600000] 0.4 q0 = 0.14 − 0.1(0.52 − 0.14) = 0.102 (600000, 650000] 0.4 (650000, 845000] 0.1 q1 = 0.52 + 0.1(0.52 − 0.14) = 0.558, those of the second seed variable to Alpha: 0.3448 q = 130,000 − 0.1(500,000 − 130,000) = 93,000 0 The Mendel–Sheridan model yields an entirely q1 = 500,000 + 0.1(500,000 − 130,000) = 537,000 different result that retains many of the quantiles and, finally, those of the decision variable to given by the experts. The result is a more detailed distribution. Here, we store the result in an object = − ( − ) = q0 350,000 0.1 800,000 350,000 305,000 fit for later use: q1 = 800,000 + 0.1(800,000 − 350,000) = 845,000. > fit <- expert(x, "ms", probs, true.seed) Note that only the Mendel–Sheridan model allows > fit non-finite lower and upper bounds. The function expert returns a list of class ‘expert’ Aggregated Distribution Using containing the vector of the knots (or bounds of the Mendel-Sheridan Model intervals) of the aggregated distribution, the vector of corresponding probability masses and some other Interval Probability characteristics of the model used. A print method (305000, 350000] 0.01726 displays the results neatly. (350000, 400000] 0.06864 (400000, 525000] 0.06864 Consider using Cooke’s model with the data of = (525000, 550000] 0.01726 the example and a threshold of α 0.3. We obtain: (550000, 600000] 0.06864 > expert(x, "cooke", probs, true.seed, (600000, 625000] 0.06864 + alpha = 0.03) (625000, 650000] 0.53636 Aggregated Distribution Using Cooke Model (650000, 700000] 0.06864 (700000, 800000] 0.06864 Interval Probability (800000, 845000] 0.01726 (305000, 512931] 0.1 (512931, 563423] 0.4 (563423, 628864] 0.4 Utility functions and methods (628864, 845000] 0.1 In addition to the main function expert, the package features a few functions and methods to ease manip- Alpha: 0.03 ulation and analysis of the aggregated distribution. As previously explained, if the value of α is not spec- Most are inspired by similar functions in package ac- ified in the call to expert, its value is obtained by op- tuar (Dutang et al., 2008). timization and the resulting aggregated distribution First, a summary method for objects of class is different: ‘expert’ provides more details on the fitted model

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 34 CONTRIBUTED RESEARCH ARTICLES than the print method, namely the number of ex- > F <- cdf(fit) perts, the number of seed variables and the requested > knots(F) and set quantiles: > summary(fit) [1] 305000 350000 400000 525000 550000 [6] 600000 625000 650000 700000 800000 Call: [11] 845000 expert(x = x, method = "ms", probs = probs, true.seed = true.seed) > F(knots(F))

Aggregated Distribution Using [1] 0.00000 0.01726 0.08590 0.15455 Mendel-Sheridan Model [5] 0.17181 0.24045 0.30909 0.84545 [9] 0.91410 0.98274 1.00000 Interval Probability (305000, 350000] 0.01726 (350000, 400000] 0.06864 > F(c(450000, 750000)) (400000, 525000] 0.06864 (525000, 550000] 0.01726 [1] 0.0859 0.9141 (550000, 600000] 0.06864 (600000, 625000] 0.06864 (625000, 650000] 0.53636 A plot method yields the graphic in Figure1: (650000, 700000] 0.06864 (700000, 800000] 0.06864 > plot(F) (800000, 845000] 0.01726

Number of experts: 3, Number of seed variables: 2 cdf(fit) Quantiles: 0*, 0.1, 0.5, 0.9, 1* (* were set) ● ● 1.0 ● ● Methods of mean and quantile give easy access 0.8 to the mean and to the quantiles of the aggregated

distribution: 0.6

> mean(fit) F(x) 0.4 ● [1] 607875 ● ●

0.2 ● > quantile(fit) ● ● ● 0.0 0% 25% 50% 75% 100% 305000 600000 625000 625000 845000 3e+05 5e+05 7e+05 9e+05

Of course, one may also specify one or many quan- x tiles in a second argument: > quantile(fit, c(0.1, 0.9)) Figure 1: Cumulative distribution function of the ag- 10% 90% gregated distribution 400000 650000 Function ogive is the analogue of cdf for the lin- quantile Moreover, the method can return linearly early interpolated cdf: interpolated quantiles by setting the smooth argu- ment to TRUE: > G <- ogive(fit) > quantile(fit, smooth = TRUE) > G(knots(G))

0% 25% 50% 75% 100% [1] 0.00000 0.01726 0.08590 0.15455 305000 603478 633898 645551 845000 [5] 0.17181 0.24045 0.30909 0.84545 [9] 0.91410 0.98274 1.00000 The inverse of the quantile and smoothed quan- tile functions are, respectively, the cumulative distri- bution function (cdf) and the ogive (Klugman et al., > G(c(450000, 750000)) 1998; Goulet and Pigeon, 2008). In a fashion similar to ecdf of the base R package stats, the cdf function [1] 0.1134 0.9484 returns a function object to compute the cdf of the aggregated distribution at any point: > plot(G)

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 35

See Figure2 for the graphic created with the last ex- That means one can further fit a continuous or dis- pression. crete distribution to the aggregated “data”. For example, consider the task of fitting a gamma ogive(fit) distribution to the Mendel–Sheridan aggregated dis- tribution obtained above. Among many other op- tions (see, e.g., Klugman et al., 1998; Goulet and Pi- ● ● 1.0 ● geon, 2008), we choose our parameters to minimize ● the Cramér-von Mises distance 0.8 n 2 d(α,λ) = (G(xi) − F(xi;α,λ)) , 0.6 ∑ i=1 G(x)

0.4 (·) ● where G is the ogive of the aggregated distribu- ● tion, x1,..., xn are its knots and F(·;α,λ) is the cdf of ●

0.2 ● ● a gamma distribution with shape parameter α and ● ● rate parameter λ. The optimal parameters are simple 0.0 to obtain with the function optim: 3e+05 5e+05 7e+05 > f <- function(p, x) drop(crossprod(G(x) - + pgamma(x, p[1], p[2]))) x > optim(c(100, 0.00016), f, x = knots(G))$par

Figure 2: Ogive of the aggregated distribution [1] 4.233e+02 6.716e-04

Finally, a method of hist draws the derivative of the ogive, that is the histogram. The arguments of the Conclusion method are the same as those of the default method The paper introduced a package to elicit statistical in package stats. See Figure3 for the graphic created distributions from expert opinion in situations where by the following expression: real data is scarce or absent. The main function, > hist(fit) expert, is a unified interface to the two most impor- tant models in the field — the Cooke classical model and the Mendel–Sheridan Bayesian model — as well Histogram of fit as the informal predefined weights model. The pack- age also provides functions and methods to compute from the object returned by expert and plot impor- tant results. Most notably, cdf, ogive and quantile

2.0e−05 give access to the cumulative distribution function of the aggregated distribution, its ogive and the inverse of the cdf and ogive. ) x

( For modeling purposes, the package can be used f as a direct replacement of Excalibur. However, cou- 1.0e−05 pled with the outstanding statistical and graphical capabilities of R, we feel expert goes far beyond Ex- calibur in terms of usability and interoperability. 0.0e+00 3e+05 5e+05 7e+05 Acknowledgments x This research benefited from financial support from the Natural Sciences and Engineering Research Council of Canada and from the Chaire d’actuariat Figure 3: Histogram of the aggregated distribution (Actuarial Science Chair) of Université Laval. We also thank one anonymous referee and the R Jour- Further model fitting nal editors for corrections and improvements to the paper. Depending on the application, the aggregated distri- bution one obtains from expert might be too crude for risk analysis, especially if one used Cooke’s Bibliography model with very few quantiles. Now, the expert ag- gregated distribution actually plays the role of the R. Cooke. Experts in Uncertainty. Oxford University empirical distribution in usual statistical modeling. Press, 1991.

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 36 CONTRIBUTED RESEARCH ARTICLES

R. Cooke and L. Goossens. TU Delft expert judgment F. Ouchi. A literature review on the use of expert data base. Reliability Engineering and System Safety, opinion in probabilistic risk analysis. Policy Re- 93(5):657–674, 2008. search Working Paper 3201, World Bank, 2004. URL http://econ.worldbank.org/. C. Dutang, V. Goulet, and M. Pigeon. actuar: An R package for actuarial science. Journal of Statistical Software, 25(7), 2008. URL http://www.jstatsoft. M. Pigeon. Utilisation d’avis d’experts en actuariat. org/v25/i07. Master’s thesis, Université Laval, 2008. V. Goulet and M. Pigeon. Statistical modeling of loss distributions using actuar. R News, 8(1):34– A. Tversky and D. Kahneman. Judgment under un- 40, May 2008. URL http://cran.r-project.org/ certainty: Heuristics and biases. Science, New Se- doc/Rnews/Rnews_2008-1.pdf. ries, 185(4157):1124–1131, 1974. V. Goulet, M. Jacques, and M. Pigeon. Modeling of data using expert opinion: An actuarial perspec- tive. 2008. In preparation. Vincent Goulet J. B. Kadane and R. L. Winkler. Separating probabil- École d’actuariat ity elicitation from utilities. Journal of the American Université Laval Statistical Association, 83(402):357–363, 1988. Québec, Canada [email protected] J. B. Kadane and L. J. Wolfson. Experiences in elicita- tion. The Statistician, 47(1):3–19, 1998. Michel Jacques S. A. Klugman, H. H. Panjer, and G. Willmot. Loss École d’actuariat Models: From Data to Decisions. Wiley, New York, Université Laval 1998. ISBN 0-4712388-4-8. Québec, Canada [email protected] M. Mendel and T. Sheridan. Filtering information from human experts. IEEE Transactions on Systems, Mathieu Pigeon Man and Cybernetics, 36(1):6–16, 1989. Institut de Statistique A. O’Hagan, C. Buck, C. Daneshkhah, J. Eiser, Université Catholique de Louvain P. Garthwaite, D. Jenkinson, J. Oakley, and Louvain-la-Neuve, Belgium T. Rakow. Uncertain Judgements. Wiley, 2006. [email protected]

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 37 mvtnorm: New Numerical Algorithm for Multivariate Normal Probabilities by Xuefei Mi, Tetsuhisa Miwa and Torsten Hothorn can be calculated from 2m m-dimensional one-sided probabilities which have the same mean and correla- Miwa et al.(2003) proposed a numerical algo- tion matrix. The total order of this two-sided prob- rithm for evaluating multivariate normal probabil- lem is G × m! × 2m. ities. Starting with version 0.9-0 of the mvtnorm A new algorithm argument to pmvnorm() and package (Hothorn et al., 2001; Genz et al., 2008), this qmvnorm() has been introduced in mvtnorm version algorithm is available to the R community. We give a 0.9-0 in order to switch between two algorithms: brief introduction to Miwa’s procedure and compare GenzBretz() is the default and triggers the use it, with respect to computing time and accuracy, to a of the above mentioned quasi-randomized Monte- quasi-randomized Monte-Carlo procedure proposed Carlo procedure by Genz and Bretz(1999). Alter- by Genz and Bretz(1999), which has been available natively, algorithm = Miwa() applies the procedure through mvtnorm for some years now. discussed here. Both functions can be used to specify The new algorithm is applicable to problems with hyper-parameters of the algorithm. For Miwa(), the dimension smaller than 20, whereas the procedures argument steps defines the number of grid points G by Genz and Bretz(1999) can be used to evaluate to be evaluated. 1000-dimensional normal distributions. At the end The following example shows how to calculate of this article, a suggestion is given for choosing a the probability suitable algorithm in different situations. p = Fm(d,c) = P{−1 < X < 1,−4 < X < 4,−2 < X < 2}. Introduction 1 2 3 with mean µ = (0,0,0)> and correlation matrix An algorithm for calculating any non-centered or-   thant probability of a non-singular multivariate nor- 1 1/4 1/5 mal distribution is described by Miwa et al.(2003). R =  1/4 1 1/3  The probability function in a one-sided problem is 1/5 1/3 1

Pm(µ,R) = P{Xi ≥ 0; 1 ≤ i ≤ m} by using the following R code: Z ∞ Z ∞ > library("mvtnorm") = ... φm(x;µ,R) dx1 ... dxm 0 0 > m <- 3 > > S <- diag(m) where µ = (µ1,...,µm) is the mean and and R = > S[2, 1] <- S[1, 2] <- 1 / 4 (ρij) the correlation matrix of m multivariate normal > S[3, 1] <- S[3, 1] <- 1 / 5 distributed random variables X1,..., Xm ∼ Nm(µ,R). > S[3, 2] <- S[3, 2] <- 1 / 3 The function φm denotes the density function of the > pmvnorm(lower = -c(1,4,2), m-dimensional normal distribution with mean µ and + upper = c(1,4,2), correlation matrix R. + mean=rep(0, m), sigma = S, + algorithm = Miwa()) The distribution function for c = (c1,...,cm) can be expressed as: [1] 0.6536804 attr(,"error") Fm(c) = P{X ≤ c ; 1 ≤ i ≤ m} i i [1] NA = P{−Xi ≥ −ci; 1 ≤ i ≤ m} attr(,"msg") = Pm(−µ + c,R). [1] "Normal Completion"

The m-dimensional non-centered orthant one-sided The upper limit and lower limit of the integral region probability can be calculated from at most (m − 1)! are passed by the vectors upper and lower. The mean non-centered probabilities with positive definite tri- vector and correlation matrix are given by the vector diagonal correlation matrix. The algorithm for calcu- mean and the matrix sigma. From the result we know lating such probabilities is a recursive linear integra- that p = 0.6536804 with given correlation matrix R. tion procedure. The total order of a one-sided prob- lem is G × m!, where G is the number of grid points for integration. Accuracy and time consumption The two-sided probability In this section we compare the accuracy and time Fm(d,c) = P{di ≤ Xi ≤ ci; 1 ≤ i ≤ m} consumption of the R implementation of the algo-

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 38 CONTRIBUTED RESEARCH ARTICLES

Algorithm m = 5 m = 10 1 1 1 1 ρ = 2 ρ = − 2 ρ = 2 ρ = − 2 Genz & Bretz (ε = 10−4) 0.08468833 0.001385620 0.008863600 2.376316 × 10−8 Genz & Bretz (ε = 10−5) 0.08472561 0.001390769 0.008863877 2.319286 × 10−8 Genz & Bretz (ε = 10−6) 0.08472682 0.001388424 0.008862195 2.671923 × 10−8 Miwa (G = 128) 0.08472222 0.001388889 0.008863235 2.505205 × 10−8 Exact. 0.08472222 0.001388889 0.008863236 2.505211 × 10−8

−1 Table 1: Value of probabilities with tri-diagonal correlation coefficients, ρi,i±1 = 2 ,1 ≤ i ≤ m and ρi,j = 0,∀|i − j| > 1.

rithm of Miwa et al.(2003) with the default proce- Miwa’s algorithm and Genz & Bretz’ algorithm for dure for approximating multivariate normal prob- this situation are compared in Table2. abilities in mvtnorm by Genz and Bretz(1999). The experiments were performed using an Intelr Pentiumr 4 processor with 2.8 GHz. Time consumption Miwa’s algorithm is a numerical method which Probabilities with tri-diagonal correlation has advantages in achieving higher accuracy with matrix less time consumption compared to the stochastic method. However, Miwa’s algorithm has some dis- ( ) The exact value of Pm µ,R is known if R has some advantages. When the dimension m increases, the special structure. E.g., when R is a m-dimensional time consumption of Miwa’s algorithm increases tri-diagonal correlation matrix with correlation coef- dramatically. Moreover, it can not be applied to sin- ficients gular problems which are common in multiple test-  2−1 j = i ± 1 ing problems, for example. ρ = 1 ≤ i ≤ m i,j 0 |i − j| > 1

−1 the value of Pm(0,R) is ((1 + m)!) (Miwa et al., Conclusion 2003). The accuracy of Miwa’ algorithm (G = 128 grid points) and the Genz & Bretz algorithm We have implemented an R interface to the proce- −4 −5 −6 (with absolute error tolerance ε = 10 ,10 ,10 ) dure of Miwa et al.(2003) in the R package mvtnorm. for probabilities with tri-diagonal correlation matrix For small dimensions, it is an alternative to quasi- are compared in Table1. In each calculation we have randomized Monte-Carlo procedures which are used results with small variance. The values which do not by default. For larger dimensions and especially for hold the tolerance error are marked with bold char- singular problems, the method is not applicable. acters in the tables. When the dimension is larger than five, using Genz & Bretz’ algorithm, it is hard to −5 achieve an error tolerance smaller than 10 . Bibliography

Orthant probabilities A. Genz and F. Bretz. Numerical computation of multivariate t-probabilities with application to = −1 6= When R is the correlation matrix with ρi,j 2 ,i j, power calculation of multiple contrasts. Journal of −1 the value of Pm(0,R) is known to be (1 + m) (Miwa Statistical Computation and Simulation, 63:361–378, et al., 2003). Accuracy and time consumption of 1999.

Algorithm m=5 m=9 1 1 ρ = 2 sec. ρ = 2 sec. Genz & Bretz (ε = 10−4) 0.1666398 0.029 0.09998728 0.231 Genz & Bretz (ε = 10−5) 0.1666719 0.132 0.09998277 0.403 Genz & Bretz (ε = 10−6) 0.1666686 0.133 0.09999726 0.431 Miwa (G = 128) 0.1666667 0.021 0.09999995 89.921 Exact. 0.1666667 0.10000000

Table 2: Accuracy and time consumption of centered orthant probabilities with correlation coefficients, −1 ρi,j = 2 ,i 6= j

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 39

Dimension Miwa (G = 128) Genz & Bretz (ε = 10−4) Single Double Single Double m = 5 0.021 0.441 0.029 0.085 m = 6 0.089 8.731 0.089 0.149 m = 7 0.599 156.01 0.083 0.255 m = 8 9.956 4hours 0.138 0.233 m = 9 89.921 - 0.231 0.392

Table 3: Time consumption of non-centered orthant probabilities (measured in seconds).

A. Genz, F. Bretz, T. Hothorn, T. Miwa, X. Mi, Xuefei Mi F. Leisch, and F. Scheipl. mvtnorm: Multivariate Institut für Biostatistik Normal and T Distribution, 2008. URL http://CRAN. Leibniz Universität Hannover, Germany R-project.org. R package version 0.9-0. [email protected]

Tetsuhisa Miwa T. Hothorn, F. Bretz, and A. Genz. On multivariate National Institute for Agro-Environmental Sciences t and Gauß probabilities in R. R News, 1(2):27–29, Kannondai, Japan June 2001. [email protected]

T. Miwa, A. J. Hayter, and S. Kuriki. The evaluation Torsten Hothorn of general non-centred orthant probabilities. Jour- Institut für Statistik nal of The Royal Statistical Society Series B–Statistical Ludwig-Maximilians-Universität München, Germany Methodology, 65:223–234, 2003. [email protected]

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 40 CONTRIBUTED RESEARCH ARTICLES

EMD: A Package for Empirical Mode Decomposition and Hilbert Spectrum by Donghoh Kim and Hee-Seok Oh

Introduction

The concept of empirical mode decomposition (EMD) and the Hilbert spectrum (HS) has been de- veloped rapidly in many disciplines of science and engineering since Huang et al.(1998) invented EMD. The key feature of EMD is to decompose a signal into so-called intrinsic mode function (IMF). Further- Figure 2: A sinusoidal function more, the Hilbert spectral analysis of intrinsic mode functions provides frequency information evolving Thus the first step to define intrinsic oscillation with time and quantifies the amount of variation due is to detect local extrema or zero-crossings. The to oscillation at different time scales and time loca- function extrema() identifies local extrema and zero- tions. In this article, we introduce an R package crossings of the signal in Figure2. called EMD (Kim and Oh, 2008) that performs one- > ### Identify extrema and zero-crossings and two- dimensional EMD and HS. > ndata <- 3000 > tt <- seq(0, 9, length=ndata) > xt <- sin(pi * tt) Intrinsic mode function > > library(EMD) The essential step extracting an IMF is to iden- > extrema(xt) tify an oscillation embedded in a signal from local $minindex time scale. Consider the following synthetic signal [,1] [,2] x(t),0 < t < 9 of the form [1,] 501 501 [2,] 1167 1167 x(t) = 0.5t + sin(πt) + sin(2πt) + sin(6πt). (1) [3,] 1834 1834 [4,] 2500 2500 The signal in Figure1 consists of several com- ponents, which are generated through the process $maxindex that a component is superimposed to each other. [,1] [,2] [1,] 168 168 [2,] 834 834 [3,] 1500 1501 [4,] 2167 2167 [5,] 2833 2833

$nextreme [1] 9

$cross [,1] [,2] Figure 1: A sinusoidal function having 4 components [1,] 1 1 [2,] 334 335 An intrinsic oscillation or frequency of a compo- [3,] 667 668 nent, for example, sin(πt),t ∈ (0,9) in Figure1 can [4,] 1000 1001 be perceived through the red solid wave or the blue [5,] 1333 1334 dotted wave in Figure2. The blue dotted wave in [6,] 1667 1668 Figure2 illustrates one cycle of intrinsic oscillation [7,] 2000 2001 which starts at a local maximum and terminates at [8,] 2333 2334 a consecutive local maximum by passing through [9,] 2666 2667 two zeros and a local minimum which eventually $ncross appears between two consecutive maxima. A com- [1] 9 ponent for a given time scale can be regarded as the composition of repeated intrinsic oscillation which is The function extrema() returns a list of follow- symmetric to its local mean, zero. ings.

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 41

• minindex : matrix of time index at which local identifying the local extrema with the extrema(). minima are attained. Each row specifies a start- Note that when setting the option ‘check=TRUE’, one ing and ending time index of a local minimum. must click the plot to proceed to the next step. > ### Generating a signal • maxindex : matrix of time index at which local > ndata <- 3000 maxima are attained. Each row specifies a start- > par(mfrow=c(1,1), mar=c(1,1,1,1)) ing and ending time index of a local maximum. > tt2 <- seq(0, 9, length=ndata) nextreme > xt2 <- sin(pi * tt2) + sin(2* pi * tt2) + • : the number of extrema. + sin(6 * pi * tt2) + 0.5 * tt2 • cross : matrix of time index of zero-crossings. > plot(tt2, xt2, xlab="", ylab="", type="l", + axes=FALSE); box() Each row specifies a starting and ending time > index of zero-crossings. > ### Extracting the first IMF by sifting process • ncross : the number of zero-crossings. > tryimf <- extractimf(xt2, tt2, check=TRUE) The function extractimf() extracts IMF’s from a Once local extrema is obtained, the intrinsic mode given signal, and it is controlled by the following ar- function is derived through the sifting procedure. guments. • residue : observation or signal observed at Sifting process time tt. • tt : observation index or time index. Huang et al.(1998) suggested a data-adapted algo- rithm extracting a sinusoidal wave or equivalently a • tol : tolerance for stopping rule. frequency from a given signal x. First, identify the lo- • max.sift : the maximum number of sifting. cal extrema in Figure3(a), and generate the two func- tions called the upper envelope and lower envelope • stoprule : stopping rule. by interpolating local maxima and local minima, re- boundary spectively. See Figure3(b). Second, take their aver- • : specifies boundary condition. age, which will produce a lower frequency compo- • check : specifies whether the sifting process is nent than the original signal as in Figure3(c). Third, displayed. If check=TRUE, click the plotting area by subtracting the envelope mean from the signal x, to start the next step. the highly oscillated pattern h is separated as in Fig- ure3(d). Huang et al.(1998) defined an oscillating wave Stopping rule as an intrinsic mode function if it satisfies two con- ditions 1) the number of extrema and the num- The sifting process stops when the replication of ber of zero-crossings differs only by one and 2) sifting procedure exceed the predefined maximum the local average is zero. If the conditions of number by max.sift or satisfies the properties of IMF are not satisfied after one iteration of afore- IMF by stopping rule. The stopping rule stoprule mentioned procedure, the same procedure is ap- has two options – "type1" and "type2". The option plied to the residue signal as in Figure3(d), stoprule = "type1" makes the sifting process stop (e) and (f) until properties of IMF are satisfied. when the absolute values of the candidate IMF hi are (a) (b) ● ● smaller than tolerance level, that is, |h (t)| < tol for ● ● i ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● t stoprule = "type2" ● ● all . Or by the option , the sift- ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ing process stops when the variation of consecutive ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● candidate IMF’s is within the tolerance level,

(c) (d)

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ●   h (t) − h − (t) ∑ i i 1 < tol. t hi−1(t)

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

(e) (f)

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Boundary adjustment

To eliminate the boundary effect of a signal, it is nec- ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● essary to adjust a signal at the boundary. Huang et al.(1998) extended the original signal by adding arti- Figure 3: Sifting procedure ficial waves repeatedly on both sides of the bound- aries. The waves called characteristic waves are This iterative process is called sifting. The fol- constructed by repeating the implicit mode formed lowing code produces Figure3, and the function from extreme values nearest to boundary. The argu- extractimf() implements the sifting algorithm by ment boundary specifies the adjusting method of the

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 42 CONTRIBUTED RESEARCH ARTICLES

Signal = 1−st IMF + 1−st residue 1−st residue = 2−nd IMF + 2−nd residue boundary. The argument boundary = "wave" con- structs a wave which is defined by two consecutive extrema at either boundary, and adds four waves at either end. Typical adjusting method extends a sig- 1−st imf 2−nd imf nal assuming that a signal is symmetric or periodic. The option boundary = "symmetric" or boundary = "periodic" extends both boundaries symmetrically or periodically. 1−st residue 2−nd residue Zeng and He(2004) considered two extended sig- nals by adding a signal in a symmetric way and re- flexive way called even extension and odd exten- sion, respectively. Even extension and odd exten- sion produce the extended signals so that its aver- Figure 4: Two IMF’s by the sifting algorithm age is zero. This boundary condition can be specified boundary = "evenodd" by . For each extended signal, The above-mentioned decomposition process is upper and lower envelopes are constructed and en- implemented by the function emd() that utilizes the velope mean of the extended signals is defined by the functions extractimf() and extrema(). The final average of four envelopes. Then, the envelope mean decomposition result by the following code is illus- outside the time scale of the original signal is close to trated in Figure5. zero, while the envelope mean within the time scale of the original signal is almost the same as the enve- > ### Empirical Mode Decomposition lope mean of the original signal. On the other hand, > par(mfrow=c(3,1), mar=c(2,1,2,1)) the option boundary = "none" performs no bound- > try <- emd(xt2, tt2, boundary="wave") ary adjustments. > > ### Ploting the IMF's > par(mfrow=c(3,1), mar=c(2,1,2,1)) > par(mfrow=c(try$nimf+1, 1), mar=c(2,1,2,1)) > rangeimf <- range(try$imf) > for(i in 1:try$nimf) + plot(tt2, try$imf[,i], type="l", xlab="", Empirical mode decomposition + ylab="", ylim=rangeimf, main= + paste(i, "-th IMF", sep="")); abline(h=0) > plot(tt2, try$residue, xlab="", ylab="", Once the highest frequency is removed from a sig- + main="residue", type="l") nal, the same procedure is applied on the residue sig- 1−st IMF nal to identify next highest frequency. The residue is considered a new signal to decompose. 2−nd IMF Suppose that we have a signal from model (1). The signal in Figure1 is composed of 4 compo- nents from sin(6πt) with the highest frequency to 3−rd IMF 0.5t with the lowest frequency. We may regard the linear component as a component having the low- residue est frequency. The left panel in Figure4 illustrates the first IMF and the residue signal obtained by the function extractimf(). If the remaining signal is still compound of components with several frequen- Figure 5: Decomposition of a signal by model (1) cies as in the left panel in Figure4, then the next IMF is obtained by taking the residue signal as a The arguments of emd() are similar to those of new signal in the right panel in Figure4. The num- extractimf(). The additional arguments are ber of extrema will decrease as the procedure contin- ues, so that the signal is sequently decomposed into • max.imf : the maximum number of IMF’s. the highest frequency component im f to the low- 1 • plot.imf : specifies whether each IMF is dis- est frequency component im f , for some finite n and n played. If plot.imf=TRUE, click the plotting a residue signal r. Finally, we have n IMF’s and a area to start the next step. residue signal as Up to now we have focused on artificial signals without any measurement error. A typical signal in n the real world is corrupted by noise, which is not the x(t) = ∑ im fi(t) + r(t). i=1 component of interest and contains no interpretable

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 43

Signal information. A remedy to smooth out the noise is 2 1

to apply smoothing technique not interpolation dur- 0 ing the sifting process. Then the first IMF might −1 −2 capture the entire noise effectively. As an alterna- 0.00 0.02 0.04 0.06 0.08 0.10 tive, Kim and Oh (Kim and Oh, 2006) proposed an IMFtt 1 efficient smoothing method for IMF’s by combining 1.0 0.0 the conventional cross-validation and thresholding −1.0 approach. By thresholding, noisy signal can be de- 0.00 0.02 0.04 0.06 0.08 0.10 noised while the distinct localized feature of a signal IMF 2 can be kept. 1.0 0.0 −1.0

0.00 0.02 0.04 0.06 0.08 0.10

Intermittence Figure 6: Signal x(t) by model (2) and first two IMF’s

Huang et al.(1998, 2003) pointed out that intermit- By following the approach of Huang et al.(1998, tence raises mode mixing, which means that different 2003), we can remove waves whose empirical period modes of oscillations coexist in a single IMF. Since represented by the distance of other zero-crossings EMD traces the highest frequency embedded in a is larger than 0.0007 in the first IMF. The period in- given signal locally, when intermittence occurs, the formation obtained by histogram in Figure7 can be shape of resulting IMF is abruptly changed and this used to choose an appropriate distance. We eliminate effect distorts procedures thereafter. the waves with lower frequency in the first IMF with Huang et al.(2003) attacked this phenomenon by the histogram of other zero-crossings. restricting the size of frequency. To be specific, the > ### Histogram of empirical period distance limit of the successive maxima (minima) in > par(mfrow=c(1,1), mar=c(2,4,1,1)) an IMF is introduced. Thus, IMF composes of only si- > tmpinterm <- extrema(interm1$imf[,1]) nusoidal waves whose length of successive maxima > zerocross <- (minima) are shorter than their limit. Equivalently, + as.numeric(round(apply(tmpinterm$cross, 1, mean))) we may employ the length of the zero-crossings to > hist(diff(tt[zerocross[seq(1, length(zerocross), overcome the intermittence problem. Consider a sig- + by=2)]]), freq=FALSE, xlab="", main="") nal x(t) combined by two sine curves (Deering and Kaiser, 2005), 4000 x(t) ( 3000 ( f t) + ( f t) 1 ≤ t ≤ 2 sin 2π 1 sin 2π 2 , 30 30 , Density = (2) 2000 sin(2π f1t), otherwise. 1000 0 Figure6 illustrates the signal x(t) when f1 = 1776 0.0004 0.0005 0.0006 0.0007 0.0008 0.0009 0.0010 0.0011 and f2 = 1000 and the corresponding two IMF’s. The first IMF absorbs the component that appeared in the 1 2 Figure 7: Histogram of the empirical period second IMF between 30 and 30 . Thus, the resulting IMF has a mode mixing pattern. Figure8 shows the resulting IMF’s after treating intermittence properly. The argument interm of the > ### Mode mixing function emd() specifies a vector of periods to be ex- > tt <- seq(0, 0.1, length = 2001)[1:2000] cluded from the IMF’s. > f1 <- 1776; f2 <- 1000 > xt <- sin(2*pi*f1*tt) * (tt <= 0.033 | > ### Treating intermittence + tt >= 0.067) + sin(2*pi*f2*tt) > interm2 <- emd(xt, tt, boundary="wave", > + max.imf=2, plot.imf=FALSE, interm=0.0007) > ### EMD > > interm1 <- emd(xt, tt, boundary="wave", > ### Plot of each imf + max.imf=2, plot.imf=FALSE) > par(mfrow=c(2,1), mar=c(2,2,3,1), oma=c(0,0,0,0)) > par(mfrow=c(3, 1), mar=c(3,2,2,1)) > rangeimf <- range(interm2$imf) > plot(tt, xt, main="Signal", type="l") > plot(tt,interm2$imf[,1], type="l", > rangeimf <- range(interm1$imf) + main="IMF 1 after treating intermittence", > plot(tt, interm1$imf[,1], type="l", xlab="", + xlab="", ylab="", ylim=rangeimf) + ylab="", ylim=rangeimf, main="IMF 1") > plot(tt,interm2$imf[,2], type="l", > plot(tt, interm1$imf[,2], type="l", xlab="", + main="IMF 2 after treating intermittence", + ylab="", ylim=rangeimf, main="IMF 2") + xlab="", ylab="", ylim=rangeimf)

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 44 CONTRIBUTED RESEARCH ARTICLES

IMF 1 after treating intermittence

1.0 1.4 0.25

0.0 1.2 0.20 1.0 −1.0

0.00 0.02 0.04 0.06 0.08 0.10 0.15 0.8

IMF 2 after treating intermittence 0.10 0.6 1.0 0.05 500 1000 1500 2000 0.0 −1.0 10000 1.0 0.00 0.02 0.04 0.06 0.08 0.10

8000 0.8

Figure 8: Decomposition of signal x(t) by model (2) 6000 0.6

0.4 after treating the intermittence. 4000

0.2 2000 0

500 1000 1500 2000

Hilbert spectrum Figure 9: The Hilbert spectrum for IMF 1 of the signal of model (2)

When a signal is subject to non-stationarity so that For multiple signals, the function hilbertspec() the frequency and amplitude change over time, it is calculates the amplitudes and instantaneous fre- necessary to have a more flexible and extended no- quency using Hilbert transform. The function has the tion of frequency. Huang et al.(1998) used the con- following arguments, cept of instantaneous frequency through the Hilbert • xt : matrix of multiple signals. Each column transform. For a comprehensive explanation of the represents a signal. Hilbert transform, refer to Cohen(1995). For a real signal x(t), the analytic signal z(t) is defined as • tt : observation index or time index. z(t) = x(t) + i y(t) where y(t) is the Hilbert trans- ∞ x(s) The function hilbertspec() returns a matrix of am- form of x(t), that is, y(t) = 1 P R ds where π −∞ t−s plitudes and instantaneous frequencies for multiple P is the Cauchy principal value. The polar coor- signals. The function spectrogram() produces an dinate form of the analytic signal z with amplitude image of amplitude by time index and instantaneous and phase is z(t) = a(t)exp(iθ(t)) where amplitude frequency. The horizontal axis represents time, the ( ) || ( )|| = p ( )2 + ( )2 ( ) a t is z t x t y t and phase θ t is vertical axis is instantaneous frequency, and the color  y(t)  arctan x(t) . The instantaneous frequency as time- of each point in the image represents amplitude of a dθ(t) particular frequency at a particular time. It has argu- varying phase is defined as . After decompos- dt ments as ing a signal into IMF’s with EMD thereby preserving any local property in the time domain, we can ex- • amplitude : vector or matrix of amplitudes for tract localized information in the frequency domain multiple signals. with the Hilbert transform and identify hidden lo- cal structures embedded in the original signal. The • freq : vector or matrix of instantaneous fre- local information can be described by the Hilbert quencies for multiple signals. spectrum which is amplitude and instantaneous fre- • tt : observation index or time index. quency representation with respect to time. Figure9 describes the Hilbert spectrum for IMF 1 of the sig- • multi : specifies whether spectrograms of mul- nal of model (2) before and after treating the inter- tiple signals are separated or not. mittence. The X-Y axis represents time and instanta- neous frequency, and the color intensity of the image • nlevel : the number of color levels used in leg- depicts instantaneous amplitude. end strip • size : vector of image size.

> ### Spectrogram : X - Time, Y - frequency, > ### Z (Image) - Amplitude Extension to two dimensional im- > test1 <- hilbertspec(interm1$imf) > spectrogram(test1$amplitude[,1], age + test1$instantfreq[,1]) > test2 <- hilbertspec(interm2$imf, tt=tt) The extension of EMD to an image or two dimen- > spectrogram(test2$amplitude[,1], sional data is straightforward except the identifica- + test2$instantfreq[,1]) tion of the local extrema. Once the local extrema

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 45 are identified, the two dimensional smoothing spline technique is used for the sifting procedure. For the two-dimensional case, we provide four R functions.

(1) extrema2dC() for identifying the two dimen- sional extrema,

(2) extractimf2d() for extracting the IMF from a given image,

(3) emd2d() for decomposing an image to IMF’s and the residue image combining two R func- tions above, and

(4) imageEMD() for displaying the decomposition results.

As in a one-dimensional case, extractimf2d() extracts two dimensional IMF’s from a given image based on local extrema identified by extrema2dC(). Combining these functions, emd2d() performs de- composition and its arguments are as follows.

• z : matrix of an image observed at (x, y).

• x, y : locations of regular grid at which the val- ues in z are measured. Figure 10: Decomposition of the Lena image • tol : tolerance for stopping rule of sifting. Conclusions • max.sift : the maximum number of sifting. IMF’s through EMD provide a multi-resolution tool • boundary : specifies boundary condition ‘sym- and spectral analysis gives local information with metric’, ‘reflexive’ or ‘none’. time-varying amplitude and phase according to the scales. We introduce EMD, an R package for the • boundperc : expand an image by adding speci- proper implementation of EMD, and the Hilbert fied percentage of image at the boundary when spectral analysis for non-stationary signals. It is ex- boundary condition is ‘symmetric’ or ‘reflex- pected that R package EMD makes EMD methodol- ive’. ogy practical for many statistical applications. • max.imf : the maximum number of IMF. Bibliography • plot.imf : specifies whether each IMF is dis- played. If plot.imf=TRUE, click the plotting L. Cohen. Time-Frequency Analysis. Prentice-Hall, area to start the next step. 1995.

The following R code performs two dimensional R. Deering and J. F. Kaiser. The Use of a Mask- EMD of the Lena image. The size of the original im- ing Signal to Improve Empirical Mode Decomposi- age is reduced for computational simplicity. tion. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 4:485–488, > data(lena) 2005. > z <- lena[seq(1, 512, by=4), seq(1, 512, by=4)] > lenadecom <- emd2d(z, max.imf = 4) N. E. Huang, Z. Shen, S. R. Long, M. C. Wu, H. H. Shih, Q. Zheng, N.-C. Yen, C. C. Tung, and H. The R function imageEMD() plots decomposition H. Liu. The Empirical Mode Decomposition and results and the argument extrma=TRUE illustrates the Hilbert Spectrum for Nonlinear and Nonstation- local maxima (minima) with the white (black) color ary Time Series Analysis. Proceedings of the Royal and grey background. See Figure 10. Society London A., 454:903–995, 1998.

> imageEMD(z=z, emdz=lenadecom, extrema=TRUE, N. E. Huang, M. C. Wu, S. R. Long, S. Shen, W. Qu, P. + col=gray(0:100/100)) Gloerson, and K. L. Fan. A Confidence Limit for

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 46 CONTRIBUTED RESEARCH ARTICLES

the Empirical Mode Decomposition and Hilbert K. Zeng and M-X. He. A Simple Boundary Pro- Spectral Analysis. Proceedings of the Royal Society cess Technique for Empirical Mode Decomposi- London A., 31:417–457, 2003. tion. Proceedings of IEEE International Geoscience and Remote Sensing Symposium, 6:4258–4261, 2004. D. Kim and H-S. Oh. Hierarchical Smoothing Tech- nique by Empirical Mode Decomposition. The Ko- rean Journal of Applied Statistics, 19:319–330, 2006. D. Kim and H-S. Oh. EMD: Empirical Mode Donghoh Kim Decomposition and Hilbert Spectral Analysis, Sejong University, Korea 2008. URL http://cran.r-project.org/web/ E-mail: [email protected] packages/EMD/index.html. Hee-Seok Oh Seoul National University, Korea D. Nychka. fields: Tools for Spatial Data, E-mail: [email protected] 2007. URL http://cran.r-project.org/web/ packages/fields/index.html.

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 47

Sample Size Estimation while Controlling False Discovery Rate for Microarray Experiments Using the ssize.fdr Package by Megan Orr and Peng Liu pression. As in Storey and Tibshirani(2003), we as- sume each test is Bernoulli distributed with the prob- ability Pr(H = 0) = π0, where π0 is the proportion of Introduction non-differentially expressed genes. Liu and Hwang (2007) derived that the following must hold to con- Microarray experiments are becoming more and trol FDR at the level of α: more popular and critical in many biological disci- plines. As in any statistical experiment, appropriate α 1 − π0 Pr(T ∈ Γ | H = 0) experimental design is essential for reliable statisti- ≥ (1) 1 − α π0 Pr(T ∈ Γ | H = 1) cal inference, and sample size has a crucial role in ex- perimental design. Because microarray experiments where T is the test statistic and Γ is the rejection re- are rather costly, it is important to have an adequate gion of the test. For each proposed hypothesis (test) sample size that will achieve a desired power with- with given user-defined quantities, the sample size is out wasting resources. calculated using the following steps: For a given microarray data set, thousands of 1. Solve for Γ using (1) for each sample size. hypotheses, one for each gene, are simultaneously tested. Storey and Tibshirani(2003) argue that con- 2. Calculate the power corresponding to Γ with trolling false discovery rate (FDR) is more reasonable appropriate formula for Pr(T ∈ Γ | H = 1). and more powerful than controlling family-wise er- ror rate (FWER) in genomic data. However, the most 3. Determine sample size based on desired common procedure used to calculate sample size in- power. volves controlling FWER, not FDR. For specific experimental designs, the numerator and Liu and Hwang(2007) describe a method for a denominator of the right hand side of (1) is replaced quick sample size calculation for microarray experi- by corresponding functions that calculate the type I ments while controlling FDR. In this paper, we intro- error and power of the test, respectively. duce the R package ssize.fdr which implements the method proposed by Liu and Hwang(2007). This package can be applied for designing one-sample, Functions two-sample, or multi-sample experiments. The prac- titioner defines the desired power, the level of FDR The package ssize.fdr has six functions which in- to control, the proportion of non-differentially ex- cludes three new functions that have recently been pressed genes, as well as effect size and variance. developed as well as three functions translated from More specifically, the effect size and variance can be the previous available Matlab codes. The six func- set as fixed or random quantities coming from appro- tions and their descriptions are listed below. priate distributions. Based on user-defined parame- ssize.oneSamp(delta, sigma, fdr = 0.05, ters, this package creates plots of power vs. sample power = 0.8, pi0 = 0.95, maxN = 35, size. These plots allow for visualization of trade-offs side = "two-sided", cex.title=1.15, between power and sample size, in addition to the cex.legend=1) changes between power and sample size when the user-defined quantities are modified. ssize.oneSampVary(deltaMean, deltaSE, a, b, For more in-depth details and evaluation of this fdr = 0.05, power = 0.8, pi0 = 0.95, sample size calculation method, please refer to Liu maxN = 35, side = "two-sided", and Hwang(2007). cex.title=1.15, cex.legend=1)

ssize.twoSamp(delta, sigma, fdr = 0.05, power = 0.8, pi0 = 0.95, maxN = 35, Method side = "two-sided", cex.title=1.15, cex.legend=1) For a given microarray experiment, for each gene, let H = 0 if the null hypothesis is true and H = 1 if the ssize.twoSampVary(deltaMean, deltaSE, a, b, alternative hypothesis is true. In a microarray exper- fdr = 0.05, power = 0.8, pi0 = 0.95, iment, H = 1 represents differential expression for a maxN = 35, side = "two-sided", gene, whereas H = 0 represents no differential ex- cex.title=1.15, cex.legend=1)

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 48 CONTRIBUTED RESEARCH ARTICLES

1 ∼ that if σ2 Gamma(α, β) with mean αβ, version 1.1 of ssize.F(X, beta, L = NULL, dn, sigma, the ssize.fdr package uses the parameterization that fdr = 0.05, power = 0.8, pi0 = 0.95, σ2 ∼Inverse Gamma(α,β−1). maxN = 35, cex.title=1.15, Each function outputs the following results: a cex.legend=1) plot of power vs. sample size, the smallest sample ssize.Fvary(X, beta, L = NULL, dn, a, b, size that achieves the desired power, a table of cal- fdr = 0.05, power = 0.8, pi0 = 0.95, culated powers for each sample size, and a table of maxN = 35, cex.title=1.15, calculated critical values for each sample size. cex.legend=1)

ssize.oneSamp and ssize.twoSamp compute ap- Examples propriate sample sizes for one- and two-sample mi- croarray experiments, respectively, for fixed effect Unless otherwise noted, all data sets analyzed in this size (delta) and standard deviation (sigma). For one- section can be downloaded at http://bioinf.wehi. sample designs, the effect size is defined as the dif- edu.au/limmaGUI/DataSets.html. ference in the true mean expression level and its pro- posed value under the null hypothesis for each gene. For two-sample designs, the effect size is defined Using ssize.oneSamp as the difference in mean expression levels between and ssize.oneSampVary treatment groups for each gene. In the two-sample case, sigma is the pooled standard deviation of treat- To illustrate the use of the functions ssize.oneSamp ment expression levels. and ssize.oneSampVary, a data example from Smyth ssize.oneSampVary and ssize.twoSampVary (2004) will be used. In this experiment, zebrafish compute appropriate sample sizes for one- and are used to study early development in vertebrates. two-sample microarray experiments, respectively, Swirl is a point mutation in the BMP2 gene in ze- in which effect sizes and standard deviations vary brafish that affects the dorsal/ventral body axis. The among genes. Effect sizes among genes are assumed goal of this experiment is to identify gene expres- to follow a normal distribution with mean deltaMean sion levels that differ between zebrafish with the and standard deviation deltaSE, while variances (the swirl mutation and wild-type zebrafish. This data square of the standard deviations) among genes are set (Swirl) includes 8448 gene expression levels from assumed to follow an Inverse Gamma distribution both types of the organism. The experimental de- with shape parameter a and scale parameter b. sign includes two sets of dye-swap pairs. For more ssize.F computes appropriate sample sizes for details on the microarray technology used and the multi-sample microarray experiments. Additional experimental design of this experiment, see Smyth inputs include the design matrix (X), the parameter (2004). The limma package performs both normal- vector (beta), the coefficient matrix for linear con- ization and analysis of microarray data sets. See Sec- trasts of interest (L), a function of n for the degrees tion 11.1 of Smyth et al.(2008) for complete analysis of freedom of the experimental design (dn), and the of the Swirl data set, along with the R code for this pooled standard deviation of treatment expression analysis. levels (sigma), assumed to be identical for all genes. After normalizing the data, the limma package is ssize.Fvary computes appropriate sample sizes applied to test if genes are differentially expressed for multi-sample microarray experiments in which between groups, in this case genotypes (Smyth et al., the parameter vector is fixed for all genes, but the 2008). This is done by determining if the log2 ratio variances are assumed to vary among genes and fol- of mean gene expressions between swirl mutant ze- low an Inverse Gamma distribution with shape pa- brafish and the wild type zebrafish are significantly rameter a and scale parameter b. different from zero or not. We then apply the qvalue All functions contain a default value for π0 (pi0), function (in the qvalue package) to the vector of p- among others quantities. The value of π0 can be values obtained from the analysis, and π0 is esti- obtained from a pilot study. If a pilot study is not mated to be 0.63. available, a guess based on a biological system un- Now suppose we are interested in performing a der study could be used. In this case, we recommend similar experiment and want to find an appropri- using a conservative guess (bigger values for π0) so ate sample size to use. We want to be able to de- that the desired power will still be achieved. The in- tect a two-fold change in expression levels between put π0 can be a vector, in which case separate calcu- the two genotypes, corresponding to a difference in lations are performed for each element of the vector. log2 expressions of one. We find both up-regulated This allows one to assess the changes of power due and down-regulated genes to be interesting, thus our to the changes in the π0 estimate(s). tests will be two-sided. We also want to obtain 80% For functions that assume variances follow an In- power while controlling FDR at 5%. We will use the verse Gamma distribution, it is important to note estimated value of π0 (0.63) from the available data

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 49

in addition to other values for π0 to estimate sample deviations of the log ratios more precisely. This can size. be done using a Bayesian approach. Smyth(2004) The ssize.oneSamp function: To illustrate the models the variances of normalized expression lev- use of the ssize.oneSamp function, we need to assign els as a common value for the standard deviations of all genes, so we choose the 90th percentile of the sample 1 1 ∼ χ2 (2) standard deviations from the swirl experiment. This 2 2 d0 σg d0s0 value is 0.44. It is important to note that power cal- culations will be overestimated for genes with higher standard deviation than this. Similarly, power will be 2 where d0 and s0 are hyperparameters and can be es- conservative for genes with smaller standard devia- timated based on observed residual variances. Per- tion than 0.44. To calculate appropriate sample sizes forming a simple transformation, we see that (2) is using ssize.oneSamp, the following code is used. equivalent to > os <- ssize.oneSamp(delta=1, sigma=0.44, fdr=0.05, power=0.8, ! d d s2 pi0=c(0.5, 0.63, 0.75, 0.9), maxN=10) σ2 ∼ Inverse Gamma 0 , 0 0 (3) g 2 2 This code produces the graph shown in Figure1

2 2 Power vs. sample size with fdr=0.05 and ∆ σ = 2.2727 where the expected value of σg is d0s0/(d0 − 2). Us- ing the lmFit and eBayes functions in the limma ● ● ● ● ● ● ● ● ● 1.0 ● ● ● 2 ● package, d0 and s0 can be calculated from any mi- ● ● croarray data set. For the Swirl data set, these val- ● 2 0.8 ues are calculated to be d0 = 4.17 and s0 = 0.051 ● (Smyth, 2004). From (3), we see that the variances ● of the log ratios follow an Inverse Gamma distribu- 0.6 ● tion with shape parameter 2.09 and scale parame-

Power ter 0.106. From this information, we find appropri-

0.4 ate sample sizes for a future experiment using the ssize.oneSampVary function as follows. π0 ● 0.5 0.2 ● ● 0.63 ● ● 0.75 ● 0.9 > osv <- ssize.oneSampVary(deltaMean=1, ● ● ●

0.0 deltaSE=0, a=2.09, b=0.106, fdr=0.05, power=0.8, 0 2 4 6 8 10 12 pi0=c(0.5, 0.63, 0.75, 0.9), maxN=15) Sample size (n)

Figure2 shows the resulting plot of average Figure 1: Sample size vs. power for one-sample two- power versus sample size for each proportion of sided t-test with effect size of one, standard deviation non-differentially expressed genes that was input. of 0.44, and various proportions of non-differentially Note that we are simply interested in finding genes expressed genes with FDR controlled at 5%. that have a two-fold change in expression levels, so We see that appropriate samples sizes are 4, 5, we want the effect size to be held constant at one. 6, and 7 for the above experiment at π values of This is done by setting the mean of the effect sizes 0 deltaMean 0.50, 0.63, 0.75, and 0.90. Notice that although a ( ) at the desired value with standard devi- deltaSE sample size of five was calculated to be adequate ation ( ) equal to zero. for π0 = 0.63, we are performing a dye-swap experi- From Figure2, we see that the appropriate sam- ment, so it is better to have an even number of slides. ple sizes are 3, 4, 4, and 5, respectively, to achieve Because of this, we recommend six slides for a dye- 80% average power with a FDR of 5%. These values swap experiment with two pairs. Sample size infor- are smaller than those in the previous example. In mation, along with the calculated powers and critical order to obtain appropriate sample size information values can be obtained using the following code. more directly, along with the calculated powers and critical values, use the following code. > os$ssize > os$power > os$crit.vals > osv$ssize The ssize.oneSampVary function: Now sup- > osv$power pose we are interested in representing the standard > osv$crit.vals

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 50 CONTRIBUTED RESEARCH ARTICLES

5%. The ssize.twoSamp function can be used to cal- Average power vs. sample size with fdr=0.05, ∆ σ2 culate an appropriate sample size for a simliar exper- g~N(1,0) and g~IG(2.09,0.106)

● ● ● ● ● ● ● ● iment with the following code. This code also pro- ● ● ● ● ● ● ● ● 1.0 ● ● ● ● ● ● ● ● duces Figure3.

● ● ● > ts <- ssize.twoSamp(delta=1, sigma=0.42, 0.8 fdr=0.05, power=0.8, pi0=c(0.6, 0.7, 0.8, 0.9),

● 0.6 side="upper", maxN=15)} ● Power

0.4 Power vs. sample size with fdr=0.05 and ∆ σ = 2.381 π 0 ● ● ● ● ● ● ● ● ● ● ● ● 1.0 ● ● ● ● ● ● ● 0.5 ● 0.2 ● ● ● 0.63 ● ● 0.75 ● ● ● ● ●

0.9 0.8 ●

0.0 ●

0 5 10 15 ● 0.6 ● Sample size (n) ● Power 0.4

Figure 2: Sample size vs. power for one-sample two- π ● 0 sided t-test with effect size of one and various pro- ● ● 0.6 0.2 portions of non-differentially expressed genes with ● 0.7 ● 0.8 FDR controlled at 5%. Variances are assumed to fol- ● 0.9 ● ● low an Inverse Gamma(2.09, 0.106) distribution. 0.0

0 5 10 15 Using ssize.twoSamp Sample size (n) and ssize.twoSampVary In order to illustrate the uses of the ssize.twoSamp Figure 3: Sample size vs. power for two-sample one- and ssize.twoSampVary functions, another data ex- sided upper t-test with effect size of one, standard ample from Smyth(2004) will be used. The ApoAI deviation of 0.42, and various proportions of non- gene plays an important role in high density lipopro- differentially expressed genes with FDR controlled at tein (HDL) metabolism, as mice with this gene 5%. knocked out have very low HDL levels. The goal From Figure3, we see that appropriate sample of this study is to determine how the absence of the sizes are 4, 5, 6, and 7 for each genotype group for ApoAI gene affects other genes in the liver. In order π0 values of 0.6, 0.7, 0.8, and 0.9, respectively. Calcu- to do this, gene expression levels of ApoAI knockout lated critical values and power along with the sam- mice and control mice were compared using a com- ple size estimates can be obtained with the following mon reference design. See Smyth(2004) for a more code. in-depth description of the study. This is an exam- ple of a two sample microarray experiment. For full > ts$ssize analysis of this data sets using the limma package, > ts$power see Section 11.4 of Smyth et al.(2008). > ts$crit.vals Similar to the previous example, the limma pack- The ssize.twoSampVary function: As in the age is applied to analyze the data. From the resulting ssize.oneSampVary example, the limma package cal- p-values, we use the qvalue function to estimate the culates d to be 3.88 and s2 to be 0.05 as in (2) and (3) value of π as 0.70. 0 0 0 using the ApoAI data. From (3), it follows that the The ssize.twoSamp function: To illustrate the variances of the gene expression levels follow an In- use of the ssize.twoSamp function, we can choose a verse Gamma distribution with shape parameter 1.94 common value to represent the standard deviation and scale parameter 0.10. With these values, we can for all genes. Similar to the ssize.oneSamp example, use the ssize.twoSampVary function to calculate ap- we can use the 90th percentile of the gene residual propriate sample sizes for experiments for detecting standard deviations. In this case, this value is 0.42. a fold change of one in up-regulated genes. We also Again assume we are interested in genes showing a want our results to have at least 80% average power two-fold change, but this time we are only concerned with FDR conrolled at 5%. with up-regulated genes, so we will be performing a one-sided upper tail t-test for each gene. We also > tsv <- ssize.twoSampVary(deltaMean=1, want to achieve 80% power while controlling FDR at deltaSE=0, a=1.94, b=0.10,

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 51

fdr=0.05, power=0.8, µ represents the mean of normalized gene expression pi0=c(0.6, 0.7, 0.8, 0.9), levels for cells with no estrogen at 10 hours, τ rep- side="upper", maxN=15) resents the effect of time in the absence of estrogen, α represents the change for cells with estrogen at 10 Average power vs. sample size with fdr=0.05, hours, and γ represents the change for cells with es- ∆ ~N(1,0) and σ2~IG(1.94,0.1) g g trogen at 48 hours Smyth et al.(2008). ● ● ● ● ● ● ● ● ● ● ● ● 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Treatment Mean

● ● no estrogen, 10 hours µ 0.8 ● estrogen, 10 hours µ + α no estrogen, 48 hours µ + τ

0.6 estrogen, 48 hours µ + τ + γ ● Power One design matrix that corresponds to this exper- 0.4 iment and the means in the table above is π0 ● 0.6   0.2 1 0 0 0 ● 0.7 ● 0.8  1 0 1 0  ● X =   0.9  1 1 0 0  0.0 1 1 0 1 0 5 10 15

Sample size (n) with parameter vector β = [µ,τ,α,γ]. Notice that for this design, there are four parameters to estimate and four slides for each sample, thus the degrees of free- Figure 4: Sample size vs. power for two-sample one- dom are 4n − 4. Because we are only interested in the sided upper t-test with effect size of one, gene vari- effect of estrogen and how this changes over time, we ances following an Inverse Gamma(1.94, 0.10) distri- only care about the parameters α and γ. Thus, we let bution, and various proportions of non-differentially L be the matrix of the linear contrasts of interest or expressed genes with FDR controlled at 5%.  0 0  Figure (4) shows that appropriate sample sizes  0 0  L =   have decreased from the previous example (3, 3, 4,  1 0  and 4 for π0 values of 0.6, 0.7, 0.8, 0.9). To get this 0 1 information along with the power and critical value 0 estimates, the following code can be used. so that L β = [α,γ]. > tsv$ssize In order to calculate samples sizes for a similar > tsv$power experiment that will result in tests with 80% power > tsv$crit.vals and FDR controlled at 5%, we must have a reason- able estimate for π0. Using the lmFit and eBayes functions of the limma package in conjunction with Using ssize.F and ssize.Fvary qvalue, we obtain an estimate value for π0 of 0.944. Lastly, we will use data from an experiment with Similar to the other examples, we will include this a 2x2 factorial design to illustrate the uses of the value along with other similar values of π0 to see ssize.F and ssize.Fvary functions. The data in this how power varies. ssize.F case are gene expression levels from MCF7 breast The function: As in all of the previous th cancer cells obtained from Affymetrix HGV5av2 mi- examples, we find the 90 percentile of the sample croarrays. The two treatment factors are estrogen residual standard deviations to be 0.29 (or we can (present/absent) and exposure length (10 hours/48 find any other percentile that is preferred). Also, let hours), and the goal of this experiment is to identify the true parameter vector be β = [12,1,1,0.5]. Note genes that respond to an estrogen treatment and clas- that the values of the µ and τ parameters do not af- fect any sample size calculations because we are only sify these genes as early or late responders. Data for 0 this experiment can be found in the estrogen pack- interested in α and γ, as represented by L β. Thus, age available at http://www.bioconductor.org. For the following code will calculate the sample size re- full analysis of this data set using the limma package quired to achieve 80% power while controlling the and more details of the experiment, see Section 11.4 FDR at 5%. of Smyth et al.(2008). > X <- matrix(c(1,0,0,0,1,0,1,0,1,1,0,0,1,1,0,1), Because there are two factors with two levels nrow=4, byrow=TRUE) each, there are a total of four treatments, and we de- > L <- matrix(c(0,0,0,0,1,0,0,1), fine their means of normalized log2 expression val- nrow=4, byrow=TRUE) ues for a given gene g in the table below. In this table, > B <- c(12, 1, 1, 0.5)

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 52 CONTRIBUTED RESEARCH ARTICLES

> dn <- function(n){4*n - 4} > fs <- ssize.F(X=X, beta=B, L=L, dn=dn, Average power vs. sample size with specified design matrix, fdr=0.05, and σ2~IG(2.24,0.049) sigma=0.29, pi0=c(0.9, 0.944, 0.975, 0.995), g ● ● ● ● ● ● ● ● ● ● ● ● 1.0 ● ● maxN=12) ● ● ● ● ● ● ●

● 0.8 Power vs. sample size with specified design matrix, fdr=0.05, and σ = 0.29

● ● ● ● ● ● ● ●

1.0 ● ● ● ● 0.6

● ● ● Power

● ● 0.8 0.4

● π0 ● ●

0.6 0.9 0.2 ● ● 0.944 ● 0.975 ● Power ● 0.995 ● 0.4 0.0

π 0 0 2 4 6 8 10 12 ● ● ● 0.9 0.2 ● 0.944 Sample size (n) ● 0.975 ● ● 0.995 ● ● 0.0 Figure 6: Sample size vs. power for F-test with pa- 0 2 4 6 8 10 12 rameter vector β = [12,1,1,0.5], gene expression vari- Sample size (n) ances following an Inverse Gamma(2.24,0.049), and various proportions of non-differentially expressed genes with FDR controlled at 5%. Figure 5: Sample size vs. power for F-test with parameter vector β = [12,1,1,0.5], common gene ex- Figure6 suggests that appropriate sample sizes pression standard deviation of 0.29, and various pro- for this experiment is three for the smallest three π0 portions of non-differentially expressed genes with values and four for π0=0.995. For calculated sample FDR controlled at 5%. sizes, critical values, and powers, use the following code. From Figure5, we see that appropriate sample sizes for this experiment are 4, 5, 5, and 6, for π0 val- > fsv$ssize ues of 0.9, 0.944, 0.975, and 0.995, respectively. It is > fsv$power important to note that in this experiment, one sam- > fsv$crit.vals ple includes a set of four gene chips (one for each treatment). So a sample size of 4 would require a to- tal of 16 chips, with 4 for each of the four treatment Modifications groups. For other useful information, use the following Functions ssize.oneSampVary, ssize.twoSampVary, code. and ssize.Fvary calculate sample sizes using the as- sumption that at least one parameter is random. Cur- > fs$ssize rently, these assumptions are that effect sizes follow > fs$power Normal distributions and variances follow Inverse > fs$crit.vals Gamma distributions. If users desire to assume that parameters follow other distributions, the code can The ssize.Fvary function: Using the limma be modified accordingly. 2 package, d0 and s0 are calculated as 4.48 and 0.022, 2 where d0 and s0 are as in (2). As shown in (3), this 2 corresponds to inverse gamma parameters for σg of Acknowledgement 2.24 and 0.049. Using the same parameter vector as in the previous example, sample sizes for future ex- This material is based upon work partially sup- periments can be calculated with the following com- ported by the National Science Foundation under mands, where we wish to control the FDR at 5% and Grant Number 0714978. achieve an average test power of 80%.

> fsv <- ssize.Fvary(X=X, beta=B, L=L, Bibliography dn=dn, a=2.24, b=0.049, pi0=c(0.9, 0.944, 0.975, 0.995), P. Liu and J. T. G. Hwang. Quick calculation for sam- maxN=12)} ple size while controlling false discovery rate with

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 53

application to microarray analysis. Bioinformatics, for genomewide studies. Proc. Natl. Acad. Sci. USA, 23(6):739–746, 2007. 100(16):9440–9445, 2003. G. K. Smyth. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Megan Orr Genetics and Molecular Biology, 3(1):1–4, 2004. Department of Statistics & Statistical Laboratory Iowa State University, USA G. K. Smyth, M. Ritchie, N. Thorne, and [email protected] J. Wettenhall. limma: Linear models for microarray data user’s guide. 2008. URL http://bioconductor.org/packages/2.3/bioc/ Peng Liu Department of Statistics & Statistical Laboratory vignettes/limma/inst/doc/usersguide.pdf. Iowa State University, USA J. D. Storey and R. Tibshirani. Statistical significance [email protected]

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 54 CONTRIBUTED RESEARCH ARTICLES

Easier Parallel Computing in R with snowfall and sfCluster by Jochen Knaus, Christine Porzelius, Harald Binder and accessible to R programmers without further general Guido Schwarzer computing knowledge. The Application Program- ming Interface (API) of snowfall is similar to snow Many statistical analysis tasks in areas such as bioin- and indeed snow functions can be called directly. So formatics are computationally very intensive, while porting of existing snow programs is very simple. lots of them rely on embarrassingly parallel compu- The main snowfall features are as follows: tations (Grama et al., 2003). Multiple computers or even multiple processor cores on standard desktop • All cluster functions and snow wrappers in- computers, which are widespread nowadays, can clude extended error handling with stopping easily contribute to faster analyses. on error, which makes it easier to find improper R itself does not allow parallel execution. There behaviour for users without deeper knowledge are some existing solutions for R to distribute calcu- of clusters. lations over many computers — a cluster — for ex- ample Rmpi, rpvm, snow, nws or papply. However • In addition to snow functions, there are several these solutions require the user to setup and manage functions for common tasks in parallel comput- the cluster on his own and therefore deeper knowl- ing. For example functions for loading pack- edge about cluster computing itself is needed. From ages and sources in the cluster and exchanging our experience this is a barrier for lots of R users, who variables between cluster nodes are present, as basically use it as a tool for statistical computing. well as helpers for saving intermediate results Parallel computing has several pitfalls. A single during calculation. program can easily affect a complete computing in- frastructure on maloperation such as allocating too • Cluster settings can be controlled with com- many CPUs or RAM leaving no resources for other mand line arguments. Users do not have to users and their processes, or degrading the perfor- change their R scripts to switch between se- mance of one or more individual machines. Another quential or parallel execution, or to change the problem is the difficulty of keeping track of what is number of cores or type of clusters used. going on in the cluster, which is sometimes hard if the program fails for some reason on a slave. • Connector to sfCluster: Used with sfCluster, We developed the management tool sfCluster configuration does not have to be done on ini- and the corresponding R package snowfall, which tialisation as all values are taken from the user- are designed to make parallel programming easier given sfCluster settings. and more flexible. sfCluster completely hides the setup and handling of clusters from the user and • All functions work in sequential execution, too, monitors the execution of all parallel programs for i.e. without a cluster. This is useful for develop- problems affecting machines and the cluster. To- ment and distribution of packages using snow- gether with snowfall it allows the use of parallel fall. Switching between sequential and parallel computing in R without further knowledge of clus- execution does not require code changes inside ter implementation and configuration. the parallel program (and can be changed on initialisation). snowfall Like snow, snowfall basically uses list functions for parallelisation. Calculations are distributed on The R package snowfall is built as an extended ab- slaves for different list elements. This is directly ap- straction layer above the well established snow pack- plicable to any data parallel problem, for example age by L. Tierney, A. J. Rossini, N. Li and H. Sev- bootstrapping or cross-validation can be represented cikova (Rossini et al., 2007). Note this is not a techni- with this approach (other types of parallelisation can cal layer, but an enhancement in useablity. snowfall also be used, but probably with more effort). can use all networking types implemented in snow, Generally, a cluster constitutes single machines, which are socket, MPI, PVM and NetWorkSpaces. called nodes, which are chosen out of a set of all ma- snowfall is also usable on its own (without chines usable as nodes, called the universe. The cal- sfCluster), which makes it handy on single multicore culation is started on a master node, which spawns machines, for development or for distribution inside worker R processes (sometimes also called slaves). packages. A CPU is a single calculation unit of which modern The design goal was to make parallel computing computers may have more than one.

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 55

library(snowfall) (function wrapper in the example), callable by # 1. Initialisation of snowfall. an R list function. # (if used with sfCluster, just call sfInit()) sfInit(parallel=TRUE, cpus=4, type="SOCK") 4. Export objects needed in the parallel calcu- lation (e.g., sir.adm) to cluster nodes. It is # 2. Loading data. also necessary to load required packages on all require(mvna) workers. Exporting objects can reduce the total data(sir.adm) amount of data transmitted. If there are only a few objects needed in the parallel function, you # 3. Wrapper, which can be parallelised. can export them implicitly using additional ar- wrapper <- function(idx) { # Output progress in worker logfile guments in the wrapper function and specify- cat( "Current index: ", idx, "\n" ) ing them in the parallel call (e.g. sfLapply). index <- sample(1:nrow(sir.adm), replace=TRUE) 5. Optional step: start a network random num- temp <- sir.adm[index, ] ber generator. Such random number genera- fit <- crr(temp$time, temp$status, temp$pneu) return(fit$coef) tors ensure that nodes produce independent se- } quences of random numbers. The sequences, and hence results relying on random numbers, # 4. Exporting needed data and loading required are reproducible provided that the same num- # packages on workers. ber of workers process the same sequence of sfExport("sir.adm") tasks. sfLibrary(cmprsk) 6. Distribute the calculation to the cluster by us- # 5. Start network random number generator ing a parallel list function (sfLapply in the # (as "sample" is using random numbers). example). This function distributes calls of sfClusterSetupRNG() wrapper to workers (which usually means in- dex 1 to index n is called on CPU 1 to CPU n # 6. Distribute calculation respectively. These calls are then executed in result <- sfLapply(1:1000, wrapper) parallel. If the list is longer than the amount of CPUs, index n + 1 is scheduled on CPU 1 # Result is always in list form. again 1). That also means all used data inside mean(unlist(result)) the wrapper function must be exported first, to # 7. Stop snowfall have them existing on any node (see point 2). sfStop() 7. Stop cluster via sfStop() (If used with sfCluster this is not stopping the cluster itself, sfInit Figure 1: Example bootstrap using snowfall. but allows reinitialisation with ). Probably the most irritating thing in the example For most basic programs, the parallelisation fol- is the export of the data frame. On the provided type lows the workflow of the example in Figure1. of cluster computing, the source program runs only on the master first of all. Local variables and objects 1. Initialisation. Call sfInit with parameters if remain on the master, so workers do not automati- not using sfCluster or without parameters if cally have access to them. All data, functions and used with sfCluster. Parameters are used to packages needed for the parallel calculation have to switch between parallel or sequential execution be transfered to the workers’ processes first. Export (argument parallel, default FALSE) and the means objects are instantiated on the slaves. Unlike number of CPUs wanted (argument cpus with snow’s export function, local variables can be ex- numerical value, in sequential mode always ‘1’, ported, too. As a further addition, all variables can in parallel mode the default is ‘2’). Used with be exported or removed from the worker processes.2 sfCluster, these parameters are taken from the Basic networking parameters (like execution sfCluster settings. mode and the number of CPUs on each machine) 2. Load the data and prepare the data needed in can be set on the command line, as seen in Fig- the parallel calculations (for example generat- ure2. Arguments provided in sfInit() will over- ing data for a simulation study). ride the command line arguments. For example, a script that always uses MPI clusters might in- 3. Wrap parallel code into a wrapper function clude sfInit(type="MPI"). This mechanism can be 1If calls to the wrapper function take different times, as with search problems, or you have computers with different speeds, most likely you will want to use a load balanced version, like sfClusterApplyLB, which dynamically re-schedules calls of wrapper to CPUs which have finished their previous job. 2There are more elaborate ways to integrate data transfer, e.g. NetWorkSpaces, but from our experience, snowfall’s exporting functions are enough for most common needs.

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 56 CONTRIBUTED RESEARCH ARTICLES

# Start a socket cluster on local machine using 3 processors R CMD BATCH myParPrg.R --args --parallel --cpus=3

# Start a socket cluster with 5 cores (3 on localhost, 2 on machine "other") R --args --parallel --hosts=localhost:3,other:2 < myParPrg.R

# Start using MPI with 5 cores on R interactive shell. R --args --parallel --type=MPI --cpus=5

Figure 2: Examples for snowfall configuration using the command line. used in R scripts using snowfall, as a connector to The resource allocation can be widely configured: sfCluster, or as a binding to other workload and man- Even partial usage of a specific machine is possible agement tools. (for example on a machine with 4 CPUs, which is In all current parallel computing solutions inter- used for other purposes as well, it is possible to allow mediate results are lost if the cluster dies, perhaps only e.g. 2 cores for usage in clusters and sfCluster due to a shutdown or crash of one of the used ma- ensures that no more than 2 CPUs are used by par- chines. snowfall offers a function which saves all allel computing on this machine). The restriction of available parts of results to disc and reloads them on usage can leave calculation resources for other tasks, a restored run. Indeed it does not save each finished e.g. if a computer is used for other calculations or part, but any number of CPUs parts (for example: perhaps a computing pool for students is used for Working on a list with 100 segments on a 5 CPU clus- cluster programs, leaving enough CPU power for the ter, 20 result steps are saved). This function cannot users of those machines. prevent a loss of results, but can greatly save time on sfCluster checks the cluster universe to find ma- long running clusters. As a side effect this can also chines with free resources to start the program (with be used to realise a kind of “dynamic” resource allo- the desired amount of resources — or less, if the cation: just stop and restart with restoring results on requested resources are not available). These ma- a differently sized cluster. chines, or better, parts of these machines, are built Note that the state of the RNG is not saved in into a new cluster, which belongs to the new pro- the current version of snowfall. Users wishing to gram. This is done via LAM sessions, so each pro- use random numbers need to implement customized gram has its own independent LAM cluster. save and restore functions, or use pre-calculated ran- Optionally sfCluster can keep track of resource dom numbers. usage during runtime and can stop cluster programs if they exceed their given memory usage, machines start to swap, or similar events. On any run, all sfCluster spawned R worker processes are detected and on shut down it is ensured that all of them are killed, sfCluster is a Unix commandline tool which is built even if LAM itself is not closed/shut down correctly. to handle cluster setup, monitoring and shutdown automatically and therefore hides these tasks from The complete workflow of sfCluster is shown in the user. This is done as safely as possible enabling Figure3. cluster computing even for inexperienced users. Us- An additional mechanism for resource adminis- ing snowfall as the R frontend, users can change re- tration is offered through user groups, which divide source settings without changing their R program. the cluster universe into “subuniverses”. This can sfCluster is written in Perl, using only Open Source be used to preserve specific machines for specific tools. users. For example, users can be divided in two On the backend, sfCluster is currently built upon groups, one able to use the whole universe, the other MPI, using the LAM implementation (Burns et al., only specific machines or a subset of the universe. 1994), which is available on most common Unix dis- This feature was introduced, because we intercon- tributions. nect the machine pools from two institutes to one Basically, a cluster is defined by two resources: universe, where some scientists can use all machines, CPU and memory. The monitoring of memory usage and some only the machines from their institute. is very important, as a machine is practically unus- sfCluster features three main parallel execution able for high performance purposes if it is running modes. All of these setup and start a cluster for the out of physical memory and starts to swap memory program as described, run R and once the program on the hard disk. sfCluster is able to probe memory has finished, shutdown the cluster. The batch- and usage of a program automatically (by running it in monitoring mode shuts down the cluster on inter- sequential mode for a certain time) or to set the up- ruption from the user (using keys Ctrl-C on most per bound to a user-given value. Unix systems).

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 57

Figure 3: Workflow of sfCluster.

1. Parallel batchmode: Parallel counterpart to R Administration CMD BATCH. Called using sfCluster -b or by the optionally installed R CMD par. sfCluster includes features to get information about 2. Interactive R shell (sfCluster -i or R CMD current running clusters and free resources on the parint). cluster universe (see Figure5 for an example). This can help users to get an overview of what is cur- rently happening on the cluster machines and of the 3. Monitoring mode (sfCluster -m or R CMD resources available for starting their own programs. parmon): Behaves similarly to batchmode, but Also detailed information about clusters, with all features a process monitor for workers, ac- process PIDs, memory use and runtime of single pro- cess to worker logfiles, system output and de- cesses can be printed, which is useful for administra- bugging and runtime messages. The visuali- tors to determine which R process belongs to which sation in the terminal is done using Ncurses. cluster. For an example output see Figure4: A clus- ter with 9 worker processes (marked SL) and All these administration options are directly us- one master process (marked MA) are running able by (administrative) root account, as well, so ad- on five machines. Each machine has a tab ministrators can safely kill clusters without wiping with logfiles marked by the name of the node, out all the slave processes manually. e.g. knecht4. For each process, it’s process The configuration of sfCluster itself is done via identification number (PID), the node it is run- common Unix-style configuration files. Configura- ning on, its memory usage and state (like run- tion includes system variables (like definition of trig- ning, sleeping, dead etc.) is shown at the top. gers for stopping events on observation), the user The System tab contains system messages from groups, R versions and of course the cluster universe sfCluster; the R output on the master is shown with resource limits. in tab R-Master. The installation is available through tar.gz or as a Besides the parallel execution modes, the sequen- Debian package; both are instantly usable for a sin- tial mode can be chosen, which works in all cases, gle (probably multicore) machine and only need to even without an installed or running cluster environ- be configured for “real” clusters. In normal cases the ment, i.e. also on Windows systems or as part of a default installation should work out of the box. It package build on snowfall. On execution in sequen- needs some minor tweaking if R is not installed in the tial mode, snowfall is forced to run in non-parallel default way, e.g. if multiple R installations are avail- mode. able on the machines. All needed CRAN and CPAN packages are also installable through the installer. Besides choosing the amount of required CPUs, users can specify several options on starting Future additions will be a port to OpenMPI, inte- sfCluster. These contain for example the R version, gration to common batch and resource management the nice level of the slaves, activating the sending of systems (e.g. the Sun Grid Engine or slurp) and basic emails after finish or failure and many more. reports about cluster behaviour.

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 58 CONTRIBUTED RESEARCH ARTICLES

Figure 4: Monitoring mode (9 slaves on five machines and master).

Summary FOR 534. Thanks to Arthur Allignol for his little bootstrapping Although many well-working solutions for parallel example. computing in R are available, they have their down- sides on forcing the user to manage the underly- ing clusters manually. sfCluster/snowfall solves this problem in a very flexible and comfortable way, en- Bibliography abling even inexperienced computer users to ben- A. Grama, G. Karypis, V. Kumar, A. Gupta. Introduction efit from parallel programming (without having to to Parallel Computing. Pearson Eduction, second edition, learn cluster management, usage and falling into pit- 2003. falls affecting other people’s processes). snowfall makes sure the resulting R programs are runable ev- erywhere, even without a cluster. G. Burns, R. Daoud, and J. Vaigl. LAM: An Open Cluster Environment for MPI. Technical re- The combination snowfall/sfCluster has been port, 1994. http://www.lam-mpi.org/download/files/ used daily in our institute for several months and lam-papers.tar.gz. has evolved with user’s demands and wishes. There are packages that have been built integrating snow- A. Rossini, L. Tierney, and N. Li. Simple parallel statistical fall with optionally usable parallelisation techniques computing in R. Journal of Computational and Graphical (e.g. the package peperr). Statistics, 16(2):399–420, 2007.

The software and further information are avail- able at http://www.imbi.uni-freiburg.de/parallel.

Jochen Knaus, Christine Porzelius, Harald Binder and Guido Schwarzer Acknowledgments Department of Medical Biometry and Statistics University of Freiburg This research was supported by the Deutsche Forschungs- Germany gemeinschaft (German Research Foundation) with [email protected]

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 59

jo@biom9:~$ sfCluster -o SESSION | STATE | M | MASTER #N RUNTIME R-FILE / R-OUT ------+------+----+------LrtpdV7T_R-2.8.1 | run | MO | biom9.imbi 9 2:46:51 coxBst081223.R / coxBst081223.Rout baYwQ0GB_R-2.5.1 | run | IN | biom9.imbi 2 0:00:18 -undef- / -undef-

jo@biom9:~$ sfCluster -o --all SESSION | STATE | USR | M | MASTER #N RUNTIME R-FILE / R-OUT ------+------+-----+----+------LrtpdV7T_R-2.8.1 | run | jo | MO | biom9.imbi 9 3:16:09 coxBst081223.R / coxBst081223.Rout jlXUhxtP_R-2.5.1 | run | jo | IN | biom9.imbi 2 0:00:22 -undef- / -undef- bSpNLNhd_R-2.7.2 | run | cp | BA | biom9.imbi 8 0:32:57 getPoints11.R / getPoints11.Rout NPS5QHkK_R-2.7.2 | run | cp | MO | biom9.imbi 10 3:50:42 box2.R / box2.Rout

jo@biom9:~$ sfCluster --universe --mem=1G Assumed memuse: 1024M (use '--mem' to change).

Node | Max-Load | CPUs | RAM | Free-Load | Free-RAM | FREE-TOTAL ------+------+------+------+------+------+------biom8.imbi.uni-freiburg.de | 8 | 8 | 15.9G | 0 | 9.3G | 0 biom9.imbi.uni-freiburg.de | 8 | 8 | 15.9G | 0 | 12.6G | 0 biom10.imbi.uni-freiburg.de | 8 | 8 | 15.9G | 0 | 14.0G | 0 biom12.imbi.uni-freiburg.de | 2 | 4 | 7.9G | 0 | 5.8G | 0 knecht5.fdm.uni-freiburg.de | 8 | 8 | 15.7G | 1 | 1.2G | 1 knecht4.fdm.uni-freiburg.de | 8 | 8 | 15.7G | 1 | 4.3G | 1 knecht3.fdm.uni-freiburg.de | 5 | 8 | 15.7G | 3 | 11.1G | 3 biom6.imbi.uni-freiburg.de | no-sched | 4 | 7.9G | - | - | - biom7.imbi.uni-freiburg.de | 2 | 4 | 7.9G | 1 | 2.1G | 1

Potential usable CPUs: 6

Figure 5: Overview of running clusters from the user (first call) and from all users (second call). The session column states a unique identifier as well as the used R version. “state” describes whether the cluster is cur- rently running or dead. “USR” contains the user who started the cluster. Column “M” includes the sfCluster running mode: MO/BA/IN for monitoring, batch and interative. Column “#N” contains the amount of CPUs used in this cluster. The third call gives an overview of the current usage of the whole universe, with free calculation time (“Free-Load”) and useable memory (“Free-RAM”).

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 60 CONTRIBUTED RESEARCH ARTICLES

PMML: An Open Standard for Sharing Models by Alex Guazzelli, Michael Zeller, Wen-Ching Lin and series, as well as different ways to explain models, Graham Williams including statistical and graphical measures. Ver- sion 4.0 also expands on existing PMML elements and their capabilities to further represent the diverse Introduction world of predictive analytics. The PMML package exports a variety of predic- tive and descriptive models from R to the Predic- tive Model Markup Language (Data Mining Group, PMML Structure 2008). PMML is an XML-based language and has become the de-facto standard to represent not only PMML follows a very intuitive structure to describe predictive and descriptive models, but also data pre- a data mining model. As depicted in Figure1, it and post-processing. In so doing, it allows for the is composed of many elements which encapsulate interchange of models among different tools and en- different functionality as it relates to the input data, vironments, mostly avoiding proprietary issues and model, and outputs. incompatibilities. The PMML package itself (Williams et al., 2009) was conceived at first as part of Togaware’s data min- ing toolkit Rattle, the R Analytical Tool To Learn Eas- ily (Williams, 2009). Although it can easily be ac- cessed through Rattle’s GUI, it has been separated from Rattle so that it can also be accessed directly in R. In the next section, we describe PMML and its overall structure. This is followed by a description of the functionality supported by the PMML pack- age and how this can be used in R. We then discuss the importance of working with a valid PMML file and finish by highlighting some of the debate sur- rounding the adoption of PMML by the data mining community at large.

A PMML primer

Developed by the Data Mining Group, an indepen- dent, vendor-led committee (http://www.dmg.org), PMML provides an open standard for representing data mining models. In this way, models can easily Figure 1: PMML overall structure described sequen- be shared between different applications, avoiding tially from top to bottom. proprietary issues and incompatibilities. Not only can PMML represent a wide range of statistical tech- niques, but it can also be used to represent their input Sequentially, PMML can be described by the fol- data as well as the data transformations necessary to lowing elements: transform these into meaningful features. PMML has established itself as the lingua franca for the sharing of predictive analytics solutions be- Header tween applications. This enables data mining scien- tists to use different statistical packages, including R, The header element contains general information to build, visualize, and execute their solutions. about the PMML document, such as copyright in- PMML is an XML-based language. Its current 3.2 formation for the model, its description, and infor- version was released in 2007. Version 4.0, the next mation about the application used to generate the version, which is to be released in 2009, will expand model such as name and version. It also contains an the list of available techniques covered by the stan- attribute for a timestamp which can be used to spec- dard. For example, it will offer support for time ify the date of model creation.

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 61

Data dictionary – Outlier Treatment: Defines the outlier treat- ment to be used. In PMML, outliers can be The data dictionary element contains definitions for treated as missing values, as extreme val- all the possible fields used by the model. It is in the ues (based on the definition of high and data dictionary that a field is defined as continuous, low values for a particular field), or as is. categorical, or ordinal. Depending on this definition, the appropriate value ranges are then defined as well – Missing Value Replacement Policy: If this at- as the data type (such as string or double). The data tribute is specified then a missing value is dictionary is also used for describing the list of valid, automatically replaced by the given val- invalid, and missing values relating to every single ues. input data. – Missing Value Treatment: Indicates how the missing value replacement was derived Data transformations (e.g. as value, mean or median). Transformations allow for the mapping of user data • Targets: The targets element allows for the scal- into a more desirable form to be used by the mining ing of predicted variables. model. PMML defines several kinds of data transfor- mation: • Model Specifics: Once we have the data schema in place we can specify the details of the actual • Normalization: Maps values to numbers, the in- model. put can be continuous or discrete. A multi-layered feed-forward neural network, for example, is represented in PMML by its • Discretization: Maps continuous values to dis- activation function and number of layers, fol- crete values. lowed by the description of all the network • Value mapping: Maps discrete values to discrete components, such as connectivity and connec- values. tion weights. A decision tree is represented in PMML, recur- • Functions: Derive a value by applying a func- sively, by the nodes in the tree. For each node, tion to one or more parameters. a test on one variable is recorded and the sub- nodes then correspond to the result of the test. • Aggregation: Summarizes or collects groups of Each sub-node contains another test, or the fi- values. nal decision. The ability to represent data transformations (as well Besides neural networks and decision trees, as outlier and missing value treatment methods) in PMML allows for the representation of many conjunction with the parameters that define the mod- other data mining models, including linear re- els themselves is a key concept of PMML. gression, generalised linear models, random forests and other ensembles, association rules, Model cluster models, näive Bayes models, support vector machines, and more. The model element contains the definition of the data mining model. Models usually have a model PMML also offers many other elements such as name, function name (classification or regression) built-in functions, statistics, model composition and and technique-specific attributes. verification, etc. The model representation begins with a mining schema and then continues with the actual represen- tation of the model: Exporting PMML from R

• Mining Schema: The mining schema (which is The PMML package in R provides a gen