The Journal Volume 3/1, June 2011

A peer-reviewed, open-access publication of the Foundation for Statistical Computing

Contents

Editorial...... 3

Contributed Research Articles testthat: Get Started with Testing...... 5 Content-Based Social Network Analysis of Mailing Lists...... 11 - timeDate Package...... 19 The digitize Package: Extracting Numerical Data from Scatterplots...... 25 Differential Evolution with DEoptim ...... 27 rworldmap: A New for Mapping Global Data...... 35 Cryptographic Boolean Functions with R...... 44 Raster Images in R Graphics...... 48 Probabilistic Weather Forecasting in R...... 55 Analyzing an Electronic Limit Order Book...... 64

Help Desk Giving a useR! Talk...... 69 Tips for Presenting Your Work...... 72

Book Reviews Forest Analytics with R (Use R)...... 75

News and Notes Conference Review: Third Meeting of Polish R Users, The Influenza Challenge...... 76 Conference Review: Kickoff Workshop for Project MOSAIC...... 78 Changes in R...... 79 Changes on CRAN...... 89 News from the Bioconductor Project...... 101 R Foundation News...... 102 2

The Journal is a peer-reviewed publication of the R Foundation for Statistical Computing. Communications regarding this publication should be addressed to the editors. All articles are copyrighted by the respective authors.

Prospective authors will find detailed and up-to-date submission instructions on the Journal’s homepage.

Editor-in-Chief: Heather Turner Pharmatherapeutics Statistics Pfizer Global R & D (UK) Ramsgate Road (ipc 292) Sandwich Kent CT13 9NJ UK

Editorial Board: , Martyn Plummer, and .

Editor Programmer’s Niche: Bill Venables

Editor Help Desk: Uwe Ligges

Editor Book Reviews: G. Jay Kerns Department of Mathematics and Statistics Youngstown State University Youngstown, Ohio 44555-0002 USA [email protected]

R Journal Homepage: http://journal.r-project.org/

Email of editors and editorial board: [email protected]

The R Journal is indexed/abstracted by EBSCO, DOAJ.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 3

Editorial by Heather Turner conference at the same time as preparing this issue. At this year’s useR! there will be around 150 con- This issue comes approximately one month after tributed talks and more than 30 contributed posters CRAN passed a new milestone: the number of pack- from across the R community. Looking forward to ages in the repository had reached 3000. With such this conference, we invited Rob Hyndman and Di a large number of packages (and the number is still Cook to share their presentation tips in the Help rising of course) it becomes increasingly difficult for Desk column of this issue. Of course, useRs present package authors to get their work known and for R their work at many and varied conferences, and this users to find packages that will be useful to them. is reflected in our News section where we have a re- The R Journal plays an important role in this context; port of an experimental format used for the Polish R it provides a platform for developers to introduce Users’ Conference and a report of workshops run by their packages in a user-friendly form and although Project MOSAIC, a statistical education initiative in the presented packages are not validated, publica- which R features prominently. tion in The R Journal gives the quality assurance that Although Vince Carey officially stepped down comes from thorough . from the editorial board at the end of last year, this Given that The R Journal only publishes around issue features several articles that he saw through 20 contributed articles a year however, there is to publication, so we thank him once again for all more that needs to be done to communicate de- his work on the Journal. We welcome Hadley Wick- velopments on CRAN. The editorial board had ham on board, an enthusiastic contributor to the R some discussion recently over whether to keep community, as evidenced not only by his own arti- the traditional “Changes on CRAN´´ section. Al- cle in this issue, but by the social network analysis though there are other ways to keep up-to-date of R mailing lists presented by Angela Bohn and col- http: with CRAN news, notably CRANberries ( leagues. These two articles are just the start of a fine //dirk.eddelbuettel.com/cranberries/ ) and cran- collection of contributed articles and we also have a http://crantastic.org/ tastic ( ), some people still new book review to report. appreciate an occasional round-up that they can browse through. So this section stays for now, but I hope this issue goes some way to helping useRs we are open to suggested improvements. navigate their way through the R world! Another important way to keep abreast of de- velopments in the R community is via conferences Heather Turner and workshops. As program chair for useR! 2011, PharmaTherapeutics Statistics, Pfizer Ltd, UK I have been pulling together the program for this [email protected]

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 4

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 5 testthat: Get Started with Testing by Hadley Wickham In part, this is because existing R testing pack- ages, such as RUnit (Burger et al., 2009) and svUnit Abstract Software testing is important, but (Grosjean, 2009), require a lot of up-front work to get many of us don’t do it because it is frustrating started. One of the motivations of testthat is to make and boring. testthat is a new testing framework the initial effort as small as possible, so you can start for R that is easy learn and use, and integrates off slowly and gradually ramp up the formality and with your existing workflow. This paper shows rigour of your tests. how, with illustrations from existing packages. It will always require a little more work to turn your casual interactive tests into reproducible scripts: you can no longer visually inspect the out- Introduction put, so instead you have to write code that does the inspection for you. However, this is an investment in Testing should be something that you do all the time, the future of your code that will pay off in: but it’s normally painful and boring. testthat (Wick- ham, 2011) tries to make testing as painless as possi- • Decreased frustration. Whenever I’m working ble, so you do it as often as possible. To make that to a strict deadline I always seem to discover a happen, testthat: bug in old code. Having to stop what I’m do- ing to fix the bug is a real pain. This happens • Provides functions that make it easy to describe less when I do more testing, and I can easily see what you expect a function to do, including which parts of my code I can be confident in by catching errors, warnings and messages. looking at how well they are tested. • Easily integrates in your existing workflow, • Better code structure. Code that’s easy to test whether it’s informal testing on the com- is usually better designed. I have found writ- mand line, building test suites, or using ‘R CMD ing tests makes me extract out the complicated check’. parts of my code into separate functions that • Can re-run tests automatically as you change work in isolation. These functions are easier your code or tests. to test, have less duplication, are easier to un- derstand and are easier to re-combine in new • Displays test progress visually, showing a pass, ways. fail or error for every expectation. If you’re us- ing the terminal, it’ll even colour the output. • Less struggle to pick up development after a break. If you always finish a session of cod- testthat draws inspiration from the xUnit family ing by creating a failing test (e.g. for the feature of testing packages, as well from many of the innova- you want to implement next) it’s easy to pick tive Ruby testing libraries like rspec1, testy2, bacon3 up where you left off: your tests let you know and cucumber4. I have used what I think works for what to do next. R, and abandoned what doesn’t, creating a testing environment that is philosophically centred in R. • Increased confidence when making changes. If you know that all major functionality has a test associated with it, you can confidently make Why test? big changes without worrying about acciden- tally breaking something. For me, this is par- I wrote testthat because I discovered I was spending ticularly useful when I think of a simpler way too much time recreating bugs that I had previously to accomplish a task - often my simpler solu- fixed. While I was writing the original code or fixing tion is only simpler because I’ve forgotten an the bug, I’d perform many interactive tests to make important use case! sure the code worked, but I never had a system for retaining these tests and running them, again and again. I think this is a common development prac- Test structure tice of R programmers: it’s not that we don’t test our code, it’s that we don’t store our tests so they can be testthat has a hierarchical structure made up of ex- re-run automatically. pectations, tests and contexts. 1http://rspec.info/ 2http://github.com/ahoward/testy 3http://github.com/chneukirchen/bacon 4http://wiki.github.com/aslakhellesoy/cucumber/

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 6 CONTRIBUTED RESEARCH ARTICLES

• An expectation describes what the result of a • is_a() checks that an object inherit()s from a computation should be. Does it have the right specified class: value and right class? Does it produce error messages when you expect it to? There are 11 model <- lm(mpg ~ wt, data = mtcars) types of built-in expectations. # Passes expect_that(model, is_a("lm")) •A test groups together multiple expectations to # Fails test one function, or tightly related functional- expect_that(model, is_a("glm")) ity across multiple functions. A test is created • matches() matches a character vector against with the test_that function. a regular expression. The optional all argu- •A context groups together multiple tests that ment controls where all elements or just one test related functionality. element need to match. This code is powered by str_detect() from the stringr (Wickham, These are described in detail below. Expectations 2010) package: give you the tools to convert your visual, interactive string <- "Testing is fun!" experiments into reproducible scripts; tests and con- # Passes texts are just ways of organising your expectations expect_that(string, matches("Testing")) so that when something goes wrong you can easily # Fails, match is case-sensitive track down the source of the problem. expect_that(string, matches("testing")) # Passes, match can be a regular expression Expectations expect_that(string, matches("t.+ting")) An expectation is the finest level of testing; it makes • prints_text() matches the printed output a binary assertion about whether or not a value from an expression against a regular expres- is as you expect. An expectation is easy to read, sion: since it is nearly a sentence already: expect_that(a, equals(b)) a b a <- list(1:10, letters) reads as “I expect that will equal ”. If # Passes the expectation isn’t true, testthat will raise an error. expect_that(str(a), prints_text("List of 2")) There are 11 built-in expectations: # Passes expect_that(str(a), • equals() uses all.equal() to check for equal- prints_text(fixed("int [1:10]")) ity with numerical tolerance: • shows_message() checks that an expression # Passes shows a message: expect_that(10, equals(10)) # Also passes # Passes expect_that(10, equals(10 + 1e-7)) expect_that(library(mgcv), # Fails shows_message("This is mgcv")) expect_that(10, equals(10 + 1e-6)) # Definitely fails! • gives_warning() expects that you get a warn- expect_that(10, equals(11)) ing:

# Passes • is_identical_to() uses identical() to check expect_that(log(-1), gives_warning()) for exact equality: expect_that(log(-1), gives_warning("NaNs produced")) # Passes # Fails expect_that(10, is_identical_to(10)) expect_that(log(0), gives_warning()) # Fails expect_that(10, is_identical_to(10 + 1e-10)) • throws_error() verifies that the expression throws an error. You can also supply a regular • is_equivalent_to() is a more relaxed version expression which is applied to the text of the of equals() that ignores attributes: error: # Fails # Fails expect_that(1 / 2, throws_error()) expect_that(c("one" = 1, "two" = 2), # Passes equals(1:2)) expect_that(1 / "a", throws_error()) # Passes # But better to be explicit expect_that(c("one" = 1, "two" = 2), expect_that(1 / "a", is_equivalent_to(1:2)) throws_error("non-numeric argument"))

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 7

Full Short cut expect_that(x, is_true()) expect_true(x) expect_that(x, is_false()) expect_false(x) expect_that(x, is_a(y)) expect_is(x, y) expect_that(x, equals(y)) expect_equal(x, y) expect_that(x, is_equivalent_to(y)) expect_equivalent(x, y) expect_that(x, is_identical_to(y)) expect_identical(x, y) expect_that(x, matches(y)) expect_matches(x, y) expect_that(x, prints_text(y)) expect_output(x, y) expect_that(x, shows_message(y)) expect_message(x, y) expect_that(x, gives_warning(y)) expect_warning(x, y) expect_that(x, throws_error(y)) expect_error(x, y)

Table 1: Expectation shortcuts

• is_true() is a useful catchall if none of the Tests other expectations do what you want - it checks that an expression is true. is_false() is the Each test should test a single item of functionality complement of is_true(). and have an informative name. The idea is that when a test fails, you should know exactly where to look for the problem in your code. You create a new If you don’t like the readable, but verbose, test with test_that, with parameters name and code expect_that style, you can use one of the shortcut block. The test name should complete the sentence functions described in Table1. “Test that . . . ” and the code block should be a collec- You can also write your own expectations. An tion of expectations. When there’s a failure, it’s the expectation should return a function that compares test name that will help you figure out what’s gone its input to the expected value and reports the result wrong. floor_date using expectation(). expectation() has two argu- Figure1 shows one test of the func- ments: a boolean indicating the result of the test, and tion from lubridate (Wickham and Grolemund, the message to display if the expectation fails. Your 2010). There are 7 expectations that check the re- expectation function will be called by expect_that sults of rounding a date down to the nearest second, with a single argument: the actual value. The fol- minute, hour, etc. Note how we’ve defined a couple lowing code shows the simple is_true expectation. of helper functions to make the test more concise so Most of the other expectations are equally simple, you can easily see what changes in each expectation. and if you want to write your own, I’d recommend Each test is run in its own environment so it reading the source code of testthat to see other exam- is self-contained. The exceptions are actions which ples. have effects outside the local environment. These in- clude things that affect: is_true <- function() { • The filesystem: creating and deleting files, function(x) { changing the working directory, etc. expectation( identical(x, TRUE), • The search path: package loading & detaching, "isn't true" attach. ) } • Global options, like options() and par(). } When you use these actions in tests, you’ll need to clean up after yourself. Many other testing packages Running a sequence of expectations is useful be- have set-up and teardown methods that are run au- cause it ensures that your code behaves as expected. tomatically before and after each test. These are not You could even use an expectation within a func- so important with testthat because you can create tion to check that the inputs are what you expect. objects outside of the tests and rely on R’s copy-on- However, they’re not so useful when something goes modify semantics to keep them unchanged between wrong: all you know is that something is not as ex- test runs. To clean up other actions you can use reg- pected, you know nothing about where the problem ular R functions. is. Tests, described next, organise expectations into You can run a set of tests just by source()ing a coherent blocks that describe the overall goal of that file, but as you write more and more tests, you’ll set of expectations. probably want a little more infrastructure. The first

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 8 CONTRIBUTED RESEARCH ARTICLES test_that("floor_date works for different units", { base <- as.POSIXct("2009-08-03 12:01:59.23", tz = "UTC")

is_time <- function(x) equals(as.POSIXct(x, tz = "UTC")) floor_base <- function(unit) floor_date(base, unit)

expect_that(floor_base("second"), is_time("2009-08-03 12:01:59")) expect_that(floor_base("minute"), is_time("2009-08-03 12:01:00")) expect_that(floor_base("hour"), is_time("2009-08-03 12:00:00")) expect_that(floor_base("day"), is_time("2009-08-03 00:00:00")) expect_that(floor_base("week"), is_time("2009-08-02 00:00:00")) expect_that(floor_base("month"), is_time("2009-08-01 00:00:00")) expect_that(floor_base("year"), is_time("2009-01-01 00:00:00")) })

Figure 1: A test case from the lubridate package. part of that infrastructure is contexts, described be- Each expectation is displayed as either a green low, which give a convenient way to label each file, dot (indicating success) or a red number (indicating helping to locate failures when you have many tests. failure). That number indexes into a list of further details, printed after all tests have been run. What Contexts you can’t see is that this display is dynamic: a new dot gets printed each time a test passes and it’s rather Contexts group tests together into blocks that test re- satisfying to watch. lated functionality and are established with the code: test_dir will run all of the test files in a directory, context("My context"). Normally there is one con- assuming that test files start with test (so it’s possi- text per file, but you can have more if you want, or ble to intermix regular code and tests in the same di- you can use the same context in multiple files. rectory). This is handy if you’re developing a small Figure2 shows the context that tests the operation set of scripts rather than a complete package. The fol- of the str_length function in stringr. The tests are lowing shows the output from the stringr tests. You very simple, but cover two situations where nchar() can see there are 12 contexts with between 2 and 25 in base R gives surprising results. expectations each. As you’d hope in a released pack- age, all the tests pass. > test_dir("inst/tests/") Workflow String and pattern checks : ...... Detecting patterns : ...... So far we’ve talked about running tests by Duplicating strings : ...... source() ing in R files. This is useful to double-check Extract patterns :.. that everything works, but it gives you little infor- Joining strings : ...... mation about what went wrong. This section shows String length : ...... how to take your testing to the next level by setting Locations : ...... up a more formal workflow. There are three basic Matching groups : ...... techniques to use: Test padding : .... Splitting strings : ...... • Run all tests in a file or directory test_file() Extracting substrings : ...... or test_dir(). Trimming strings : ...... • Automatically run tests whenever something If you want a more minimal report, suitable for autotest changes with . display on a dashboard, you can use a different re- • Have R CMD check run your tests. porter. testthat comes with three reporters: stop, minimal and summary. The stop reporter is the de- fault and stop()s whenever a failure is encountered; Testing files and directories the summary report is the default for test_file and test_dir. The minimal reporter prints ‘.’ for suc- You can run all tests in a file with test_file(path). cess, ‘E’ for an error and ‘F’ for a failure. The follow- Figure3 shows the difference between test_file ing output shows (some of) the output from running and source for the tests in Figure2, as well as those the stringr test suite with the minimal reporter. same tests for nchar. You can see the advantage of test_file over source: instead of seeing the first > test_dir("inst/tests/", "minimal") failure, you see the performance of all tests......

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 9

context("String length") test_that("str_length is number of characters", { expect_that(str_length("a"), equals(1)) expect_that(str_length("ab"), equals(2)) expect_that(str_length("abc"), equals(3)) }) test_that("str_length of missing is missing", { expect_that(str_length(NA), equals(NA_integer_)) expect_that(str_length(c(NA, 1)), equals(c(NA, 1))) expect_that(str_length("NA"), equals(2)) }) test_that("str_length of factor is length of level", { expect_that(str_length(factor("a")), equals(1)) expect_that(str_length(factor("ab")), equals(2)) expect_that(str_length(factor("abc")), equals(3)) })

Figure 2: A complete context from the stringr package that tests the str_length function for computing string length.

> source("test-str_length.r") > test_file("test-str_length.r") ......

> source("test-nchar.r") Error: Test failure in 'nchar of missing is missing' * nchar(NA) not equal to NA_integer_ 'is.NA' value mismatch: 0 in current 1 in target * nchar(c(NA, 1)) not equal to c(NA, 1) 'is.NA' value mismatch: 0 in current 1 in target > test_file("test-nchar.r") ...12..34

1. Failure: nchar of missing is missing ------nchar(NA) not equal to NA_integer_ 'is.NA' value mismatch: 0 in current 1 in target

2. Failure: nchar of missing is missing ------nchar(c(NA, 1)) not equal to c(NA, 1) 'is.NA' value mismatch: 0 in current 1 in target

3. Failure: nchar of factor is length of level ------nchar(factor("ab")) not equal to 2 Mean relative difference: 0.5

4. Failure: nchar of factor is length of level ------nchar(factor("abc")) not equal to 3 Mean relative difference: 0.6666667

Figure 3: Results from running the str_length context, as well as results from running a modified version that uses nchar. nchar gives the length of NA as 2, and converts factors to integers before calculating length. These tests ensures that str_length doesn’t make the same mistakes.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 10 CONTRIBUTED RESEARCH ARTICLES

Autotest • Code coverage. It’s very useful to be able to tell exactly what parts of your code have been Tests are most useful when run frequently, and tested. I’m not yet sure how to achieve this in autotest takes that idea to the limit by re-running R, but it might be possible with a combination your tests whenever your code or tests change. of RProf and codetools (Tierney, 2009). autotest() has two arguments, code_path and test_path, which point to a directory of source code • Graphical display for auto_test. I find that and tests respectively. the more visually appealing I make testing, the Once run, autotest() will continuously scan more fun it becomes. Coloured dots are pretty both directories for changes. If a test file is modi- primitive, so I’d also like to provide a GUI wid- fied, it will test that file; if a code file is modified, it get that displays test output. will reload that file and rerun all tests. To quit, you’ll need to press Ctrl + Break on windows, Escape in the Mac GUI, or Ctrl + C if running from the com- Bibliography mand line. This promotes a workflow where the only way M. Burger, K. Juenemann, and T. Koenig. RUnit: you test your code is through tests. Instead of R Unit test framework, 2009. URL http://CRAN. modify-save-source-check you just modify and save, R-project.org/package=RUnit. R package ver- then watch the automated test output for problems. sion 0.4.22.

R CMD check P. Grosjean. SciViews-R: A GUI API for R. URL http: //www.sciviews.org/SciViews-R. UMH, Mons, If you are developing a package, you can have your Belgium, 2009. tests automatically run by ‘R CMD check’. I recom- mend storing your tests in inst/tests/ (so users L. Tierney. codetools: Code Analysis Tools for R, also have access to them), then including one file 2009. URL http://CRAN.R-project.org/package= in tests/ that runs all of the package tests. The codetools. R package version 0.2-2. test_package(package_name) function makes this easy. It: H. Wickham. stringr: Make it easier to work with strings., 2010. URL http://CRAN.R-project.org/ • Expects your tests to be in the inst/tests/ di- package=stringr. R package version 0.4. rectory. H. Wickham. testthat: Testthat code. Tools to make test- • Evaluates your tests in the package namespace ing fun :), 2011. URL http://CRAN.R-project.org/ (so you can test non exported functions). package=testthat. R package version 0.5. • Throws an error at the end if there are any test H. Wickham and G. Grolemund. lubridate: Make failures. This means you’ll see the full report of dealing with dates a little easier, 2010. URL http: test failures and ‘R CMD check’ won’t pass un- //www.jstatsoft.org/v40/i03/. R package ver- less all tests pass. sion 0.1. This setup has the additional advantage that users can make sure your package works correctly in their run-time environment. Hadley Wickham Department of Statistics Rice University Future work 6100 Main St MS#138 Houston TX 77005-1827 There are two additional features I’d like to incorpo- USA rate in future versions: [email protected]

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 11

Content-Based Social Network Analysis of Mailing Lists by Angela Bohn, Ingo Feinerer, Kurt Hornik and Patrick thors’ interests can be found based on their central- Mair ities in these topic-based subnetworks. By compar- ing the networks showing who shares interests with Abstract Social Network Analysis (SNA) pro- their communication networks, we find that commu- vides tools to examine relationships between nicating is the more associated with sharing interests people. Text Mining (TM) allows capturing the the more central the authors are in the communica- text they produce in Web 2.0 applications, for ex- tion networks. We argue that this finding is due to ample, however it neglects their social structure. the motives of authors to use the mailing lists. This paper applies an approach to combine the two methods named “content-based SNA”. Us- ing the R mailing lists, R-help and R-devel, we show how this combination can be used to de- Related work scribe people’s interests and to find out if au- thors who have similar interests actually com- There are several approaches to combine SNA and municate. We find that the expected positive TM. One has to distinguish between combinations relationship between sharing interests and com- that enrich TM through SNA and those that enrich municating gets stronger as the centrality scores SNA through TM. One of the most prominent appli- of authors in the communication networks in- cations on the TM side is used by the search engines crease. Google and Yahoo. They rank search results found by TM procedures according to centrality measures Introduction based on hyperlink graphs (Brin and Page, 1998; Kleinberg, 1999). Another very important TM plus Social Network Analysis (SNA) provides powerful SNA application is the summarization of texts by cal- methods to study the relationships between people culating centrality measures in word networks (net- expressed as binary or weighted adjacency matrices. works where nodes represent words that are con- It can be used to find influential or popular nodes, nected if they appear in the same text unit). The re- communities and informal hierarchies. However, sults are often visualized as tag clouds (Erkan and it is limited in the sense that it cannot capture the Radev, 2004; Vanderwende et al., 2004). context of their encounter. An actor might be a re- However, approaches to enrich an SNA with TM garded adviser or trend setter concerning one topic are not as numerous. McCallum et al.(2007) intro- and might not know anything about another. If all duced the author-recipient-topic model which con- his or her relationships are expressed as purely nu- sists of fitting a multinomial distribution over topics merical networks, a differentiation is not possible. and authors/recipients of a message simultaneously. In open source software communities, a large The results are combinations of concepts and pairs part of the developers’ communication and thus col- of authors and recipients that characterize the net- laboration happens electronically via e-mails and fo- work. By studying the Enron e-mail corpus, they de- rums. In the R mailing lists, R-help and R-devel, all fined social roles based on such combinations. For kinds of application and development related ques- example the relationships of an Enron lawyer to a tions are discussed. From experts to newcomers, ev- few other people were characterized by the concept eryone shares the same platform, so a view on the en- “legal contracts”. A similar approach was applied to tire network does not offer a detailed picture, leaving authors of Wikipedia articles by Chang et al.(2009). the question of how well the communication really The tradition this paper stands in is called works in the R community and which role the most “content-based SNA”. In content-based SNA, the central actors play in this context. As bad coordi- network is partitioned into several overlapping sub- nation can be reflected by redundant code or hardly graphs of people who discuss the same topic. Exam- compatible packages, it is important that developers ples in the literature comprise Velardi et al.(2008), and users working in the same field stay in contact. who analyzed the evolution of content-based com- In this paper, we use content-based SNA, an ap- munication networks in a company by measur- proach to combine SNA and Text Mining (TM) in or- ing the agent-aggregation around topics, and Vier- der to find out to which extent sharing interests is as- metz(2008), who calculated local actor centrality in sociated with communicating in two R mailing lists. content-based networks to find opinion leaders. To Content-based SNA consists of extracting overlap- the best of our knowledge, content-based SNA has ping topic related subnetworks from the entire com- not been applied to mailing lists or to any kind of munication network. The paper shows how the au- open source related data.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 12 CONTRIBUTED RESEARCH ARTICLES

Data preparation cannot identify complete threads. Furthermore, ci- tations and signatures were omitted from the e-mail The data were taken from the mailing lists R- content (removeCitation() and removeSignature() help and R-devel from January 2008 to May 2009. from the tm.plugin.mail package). The data contains The e-mails can be downloaded as compressed 43760 R-help e-mails and 5008 R-devel e-mails. text files from the R website (https://stat.ethz. ch/pipermail/r-help/ and https://stat.ethz.ch/ pipermail/r-devel/). There is one text file for each Data preparation for TM month. The R code for the entire data preparation First, in order to reduce spelling ambiguities result- process can be downloaded from R-Forge https: ing from differences in American versus British En- //r-forge.r-project.org/projects/snatm/. First, glish, the e-mail subjects and content were trans- the e-mails for each month were written into a sin- formed into British English using the ae-to-be() gle file using as.one.file() from the snatm package function from the snatm package. (example of the R-devel mailing list): Second, the wn.replace() function from the > as.one.file(files, snatm package was used to find groups of synonyms + filename = "allthreads.txt", that were used in the texts based on the WordNet + list = "rdevel") Database (Fellbaum, 1998; Wallace, 2007; Feinerer and Hornik, 2010). All terms in these groups of syn- Then, the meta data (author, date, subject, thread-ID onyms are replaced by one single representative of and e-mail-ID) and the e-mail content were extracted the group. However, in order to account for the from these texts and transformed into a matrix. R-specific vocabulary, not all of the synonym terms > forest <- makeforest(month = "allthreads") should be replaced. For example, the word “car” > head(forest) should not be replaced by “auto” where it “typically” refers to the car package. The wn.replace() func- emailID threadID tion allows for manual deselection of certain terms, [1,] "1" "1" if they should not be replaced by a synonym. [2,] "2" "2" [3,] "3" "2" Third, terms are stemmed using the Snowball stemDocument() [4,] "4" "3" stemmer from the tm package. [5,] "5" "2" Fourth, term frequencies of the resulting terms [6,] "6" "3" (termFreq() from the tm package (Feinerer et al., author 2008)) were calculated (separately for subjects and [1,] "ggrothendieck at gmail.com (Gabor..." content). The function tolower() from the base [2,] "huber at ebi.ac.uk (Wolfgang Hube..." package was applied and stopwords were ignored. [3,] "ripley at stats.ox.ac.uk (Prof Br..." Words of length less than three were omitted as the [4,] "dfaden at gmail.com (David Faden)..." vast majority of them did not have a meaning in [5,] "ripley at stats.ox.ac.uk (Prof Br..." terms of abbreviations and the like, but were code [6,] "ripley at stats.ox.ac.uk (Prof Br..." chunks. Terms that were used less than 10 times (for subjects [1,] "[Rd] Wish List" subjects) or less than 20 times (for content) were ig- [2,] "[Rd] Error from wilcox.test" nored as well. [3,] "[Rd] Error from wilcox.test" The four steps can be done all at once by the [4,] "[Rd] setting the seed in standalo..." prepare.text() function from the snatm package. [5,] "[Rd] Error from wilcox.test" [6,] "[Rd] setting the seed in standalo..." content Data preparation for SNA [1,] "In: trunk/src/library/base/ma..." To obtain a social network from the mailing list data, [2,] "On 1/3/2008 9:03 AM, Gabor Grothe..." first, an alias detection procedure was performed on [3,] "What it is trying is % env R_DEF..." Author forest [4,] "On 1/4/08, Prof Brian Ripley

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 13 in the same thread as we assume that the respondent from Figure1 that Deepayan Sarkar is very inter- is aware of all the previous e-mails. After this, the ested in the word “lattice”. In SNA, there are several edgelist was transformed into a matrix representa- measures to calculate centrality, for example degree, tion of a network (adjacency() from snatm). closeness, betweenness and others. We chose to use the degree (number of direct neighbors; degree from > network <- adjacency(createedges(forest)) the sna package (Butts, 2008)) because it measures ac- > network[1:2, 1:2] tivity while the others measure the connectedness to Gabor G. Brian R. the entire network. The degree was calculated for all Gabor G. 130 29 the nodes of networks that have more than 20 mem- Brian R. 19 250 bers. If a network has only 20 members or less, the re- sults of centrality measures and thus the assumption The resulting social network consists of nodes rep- of somebody being interested in a certain topic are resenting e-mail authors and directed valued lines considered not to be meaningful. Then, we created a indicating who answered whom and the number of network consisting of two sets of nodes (2-mode net- mails sent in the corresponding direction. For ex- work), one representing e-mail authors and the other ample, Gabor Grothendieck answered on 29 of Brian representing terms. Ripley’s e-mails. If the diagonal is not zero the corre- sponding authors appeared several times in the same > twomode <- adjacency(centrality.edgelist( thread. (However, the diagonal will be ignored in + terms, apply.to = "subjects")) the subsequent analysis.) We will call these networks “communication networks”. Each term and each author who used the term was connected by a line having the normalized degree Combining network data and textual data centrality rank as line weight. (For instance the per- son with the highest degree in a certain communica- This section describes the crucial step where net- tion network is connected to the corresponding word work data and textual data are combined. Three data with a line weight of 1.) Thus, the line weight can be preparation steps are needed. interpreted as the extent to which somebody is in- terested in a particular term in a [0,1] interval. Fig- ure2 shows the 0.45% strongest lines of R-help au- First step thors and subject-terms (largest connected subnet- For all the subject terms that were used 10 times work only). For example, Jim Lemon is connected or more and all the content terms that appeared 20 to terms like “bar”, “scatter”, “axi(s)”, “barplot” and times or more (using the prepared text corpora re- “diagram”, so he seems to be interested in plots. sulting from the prepare.text() function) a com- munication network was extracted that shows only the connections between authors who used this par- Third step ticular term. In a loop each term was assigned to subjectfilter, such that only connections between In the third step, this 2-mode network was reduced authors who used a certain term were extracted from to a 1-mode network by omitting the term nodes and forest. connecting the author nodes that were adjacent to the term (shrink from the snatm package). > network <- adjacency(createedges( + forest, + subjectfilter = "lattice")) > shrink(twomode, keep = people, + values = "min") As an example, Figure1 shows the communication network of all the R-help authors who used the term For example, in the 1-mode network John Fox and “lattice” in the e-mail subject. Gabor Grothendieck are connected because they are Deepayan Sarkar, who is the author of the lat- both connected to “mean” in the 2-mode network. tice package (Sarkar, 2008), is clearly the most cen- The networks have line weights representing the tral person in this network. This indicates that he minimum of the line weights two people were for- answered nearly everyone having a question about mally connected with through the term node. For lattice. example, in the 2-mode network John Fox was con- nected to the term “mean” with a line weight of Second step 0.9957 and Gabor Grothendieck was also connected to “graph” but with a line weight of 1. Then the two This step is based on the idea that someone’s interest authors are connected with a line weight of 0.9957 for a certain subject can be measured by his or her in the 1-mode network, meaning that the extent to centrality or activity in the corresponding commu- which they share interests is 0.9957 of 1. We will call nication network. For example, we would conclude these networks “interest networks”.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 14 CONTRIBUTED RESEARCH ARTICLES

Petr PIKALPhilipp Pagel

RBlonk In Heerobbie Park heremansPtit Bleu Samuel B Civ USAF AFMC AFRL RVBXI Cable SaptarshiAnton Bossenbroek GuhaRon BonnerDavid Scott William Deese kateRobbie m Heremans Troy S John Foxamvds Thomas Zumbrunn Erik Meesters rique bengoechea Sebastien BihorelAndrewjohnclose Dieter Menne HenriqueKarl Dallazuanna Ove Hufthammer Mat Soukup Alex Brown Douglas Bates Rebecca Fisher Mike Prager Paul Murrellwillem vervoort BryanDimitri Hanson Liakhovitski Sunil Suchindran Remko Duursma Javier PB DavidHenning WildhagenAfshartous Aaron Arvey Larry Ma Alex Karner LeviVitalie Waldron Spinu jim holtman Don McKenzieMike Lawrence MarkHofert WardleMarius StevenChris Barker McKinneyNaomi Robbins Paulo Cardoso Gustaf Rydevik tyler MichaelFrank KubovyE Harrell GaborGreg Snow Grothendieckvittorio Richard and Barbara Males Stephen Weigand Judith Flores Haoda Fu Lord Yo Benjamin Tyner Duncan Mackay RINNER Heinrich DeepayanMark ColettiCharilaosSarkarAndreas Krause Skiadas JonDaniel LoehrkeBrian Kornhauser Desany Wolfram Fischer PALMIER Patrick CETE NP INFRA TRF IagoStefanJohn Mosqueira Kane Grosse Felix Andrews PaulKatell Boutros HAMON Bert Gunter Rolf Turner Christoph Meyer lost river Paul Hiemstra Jim Price FerryBram Kuijper Ranjan Maitra Joachim Heidemeier Dr Chris Poliquin Juliet Hannah Andrew BeckermanSundar Dorai RajDavidStavros Chosid Macrakis markleeds jimdare Erin Steiner Weidong Gu Thomas Roth steve SteveMichael Friedmanstephen DhruvHopkins Sharma vbertinsefick Leandro Marino AlexDylan MarkReynolds BeaudettehadleyChuck DiffordSeth Cleland W Bigelow wickham Veerappa Chetty Brian Ripley K Elo baptiste auguie DavidElena WinsemiusWilson Rainer Hurling Mark Wilkinson Coltrey Mather Patrick Hausmann Bernd Weiss Richard Cotton Dan Kortschak ravi Malte Brockmann Claudia Beleites Wayne F Marianne Promberger Scott R Waichler Daniel Myall Robert Buitenwerf Dimitris Rizopoulos Guillaume Chapron Andrew McFadden Gabor Csardi Arminagoralczyk Goralczyk Jesse Y Cai GOUACHEMark HeckmannDavid alienz747taz9 r help 20 trevva pecardoso

Thierry ONKELINX Jaap Van Wyk ancesc montane

Troels Ring Tom Cohen iverson Jim Lemon Gavin Simpson Fredrik Karlsson

John Poulsen Richard Weeks Einar = iso 8859 1 Q =C1rnason = Patrick Connolly Ben Bolker Antje Christos Argyropoulos

Matthew Pettis audrey

Ryan Hafen Qian R Yasir Kaheil Christopher Ryan Gary Nelson

Tom Bonen Surai Mr Derik Ola Caster

Figure 1: Communication network of all R-help authors who used “lattice” in the e-mail subject.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 15

latex hmisc

mac boundari result box entri coordin lotteri min panel strip key lattic colour Kevin Thorpe sign produc curv log regress coeffici sas product xyplot regres logist Greg Snow design Marc Schwartz Deepayan Sarkar gui Ricardo Rodriguez Your XEN ICT Team scatterplotbar xaxi scatter licens diagram texFrank E Harrell axi dist time att reg seq Jim Lemon observformat legend famili match simpllarg barplot hist plot crosstabhandl computweek creat shellstack whistring label fill vista Berwin A Turlach boxplot rid gulardoc sql lappli seri maxformula cat subject multi chart mode hypothesi local titl partiinfinit model histogram summari type nth lose merg matric anoth closereshap char singl matlab changinteg hang solv date appli recommend paramet replac varianc behaviour structur extra pca cast ess partit xls crossdistanctext falsperl gridggplot ressequenc graph hadley wickham add scheme infer rbind suggest logic split object base Gaborbook Grothendieckrep row pars recod month diff syntax var lar hold Mark Difford decim excel grep valu devic equal day name avail windwarn file differenti vari defin regex sizewindow frame specifi nls version begin zoofinit command build output pdf huge Wacek Kusnierczykcycl request score analys list inconsist png ode tabl dat prob evalu win inf spec code tri line mean page standardport vectorset path Brian Ripley dataframoopsumfactor con helpmap pass rodbc data vecgeneratbin automat graphicunix linux matrix inputextract search question particular past ith act comma updat scopecope autom check sub John Kane regularsort write Henrique Dallazuanna loop cor uniqu font doubl instal attribut arch error memoripredict binomiconfid packagpack Charles Berry compilstep export John Fox col jim holtman squar columncolum anova remov correl Uwe Ligges Duncan Murdoch rgl rcmdr symbolmessag return combin sem convert fundament cmd sep cut newbi Peter Dalgaard integrjava cant David Winsemius block empti rcmd alloc indexselect onc axe ani librari condit comparison tinn bug mani avoid subset csv interv adtabl doe ape readtablmisstappli environ age null partial chi creation precis sampl test stuff conversrpro call sweav save mail binari statist frequenc ave pvaludataset understandttesttransform freq count difficulti stathowto els term difficultimpossvalid spss

Figure 2: Author-term network showing the largest connected subnetwork of the 0.45% strongest connections between R-help authors and subject-terms.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 16 CONTRIBUTED RESEARCH ARTICLES

Results: Comparison of communi- the igraph package (Csardi and Nepusz, 2006).) cation networks and interest net- R−help subjects R−help content works n=4328 n=5235 1.0 1.0

At this point, the two kinds of networks, the com-

munication networks and the interest networks can 0.8 0.8 be compared. The communication networks have line weights representing the number of e-mails ex- changed. The line weights of the interest networks 0.6 0.6 indicate the extent to which two people have similar

interests in a 0–1 interval. If we calculate the correla- Correlation tion between the line weights of both networks, we 0.4 0.4 get the extent to which the fact of sharing interests is associated with communicating. We should expect 0.2 0.2 that the more two people are interested in the same topic, the more e-mails they exchange.

There are two communication networks, one for 0.0 0.0

R-help having 5972 nodes and one for R-devel hav- 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 ing 983 nodes. Furthermore, there are four interest Centrality networks because the extent of shared interests was measured once by using the subjects and once by R−devel subjects R−devel content using the content for each of the mailing lists. The n=445 n=787 interest networks have fewer nodes because only 1.0 1.0 the most frequently used terms were included into the interest analysis. Thus, people who only used 0.8 0.8 less frequent terms do not appear in the interest networks. To compare the two kinds of networks

the communication networks were reduced to those 0.6 0.6 nodes who also appeared in the interest networks. Furthermore, the reduced communication network Correlation was permuted (permutation() from the snatm pack- 0.4 0.4 age) such that the sequence of author names assigned to rownames(network) is the same in both networks. However, the correlation between sharing interests 0.2 0.2 and communicating is only approximately 0 for all possible combinations of communication networks 0.0 0.0 and interest networks (Table 1). (The diagonals of the communication networks were set to 0.) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Centrality Centrality

Degree Betweenness Closeness Interest Communication network network R-help R-devel Figure 3: Correlation between intensity of communi- subjects 0.030 0.070 cation and sharing interests (y-axis) and degree cen- content 0.009 0.044 trality of nodes (x-axis).

Table 1: Correlations between all combinations of More precisely, it shows the correlations between communication networks and interest networks. communication networks and interest networks con- sisting of only those authors who have a higher or However, if the centrality of authors in the equally high centrality than indicated on the x-axis. communication network is taken into account, we There is an almost linear increase in correlation in get a different picture. Figure3 shows how the R-help subjects network for the centrality mea- the correlation between intensity of communica- sures degree and betweenness, whose distributions tion and extent of sharing interests changes (y- are highly right skewed. Closeness is approximately axis) as the normalized centrality (degree, be- normally distributed. Thus, networks of people hav- tweenness and closeness) of nodes increases (x- ing an average closeness (around 0.5) are not cor- axis). (Degree and closeness were calculated with related just as networks of authors with an aver- the sna package (Butts, 2008), betweenness with age normalized degree (around 0). In the R-help

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 17 content network, a relation between the centrality contrary, less central R-help authors are either more of authors and the correlation between sharing in- choosey with the choice of e-mail partners or they are terests and communicating begins only at a central- not active enough to answer everyone having sim- ity of around 0.5 for right skewed measures and ilar interests. In R-devel the proportion of answers around 0.95 for closeness. This is a result of the to questions of central authors (red) is not different discussion topics sometimes changing in the course from less central authors. Still, Figure3 suggests that of threads. In the R-devel networks, the relation there is a great correspondency between communi- is also positive but not as stable, which is due to cating and sharing interests among the most active the much smaller network sizes n of 445 (subjects) authors. The low correlation between intensity of and 787 (content) compared to 4328 (subjects) and communication and extent of sharing interests for 5235 (content) in the R-help networks. The correla- peripheral nodes in both mailing lists can be due to tion in R-devel is generally higher than in R-help. their very low activity which hinders the definition In both, the R-help (for high centralities) and the of their interests. R-devel networks, the lines are smoother when the subjects were used to find people’s interests. This might be due to some additional chatting in the con- Summary tent while the subjects are more focused on the ac- tual topics. However, the choice of subjects or con- The paper showed how content-based SNA can be tent does not influence the general finding that shar- used to find people’s interests in mailing list net- ing interests and communicating is the more associ- works. By comparing communication graphs and ated the higher people’s centrality scores are. This networks showing who has similar interests, a re- might be explained by impartiality or indifference lationship between the correlation of these two and of highly active authors towards the personality of node centrality could be found. Accordingly, the their e-mail partners. They rather concentrate on expected relationship between sharing interests and the topics. Figure4 supports this interpretation: communicating exists only for very active authors while less active authors do not answer everyone R−help R−devel who has similar interests. Thus, the communica- tion efficiency can be regarded to be high for very

500 active mailing list authors while it is moderate for ● ● ● ● ●●● mid-active authors. Additionally, the paper suggests

500 ●

● 200 ● ● ●● ● ● ●● using only the subjects to find the relationship be- ● ● ●●●●● ● ● ●● ● ● tween communicating and sharing interests because ● ● ● ● ● 100 ● ● ● ● ● ● ● ● ● ● the content contains more noise. ● ● ● ●● ● ● ● ● ● ● ● 100 ● ● ● ● ● 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● 50 ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●

● 20 ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● Code for figures1 and2 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●

● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ●● ● ● ● ● ● 10 ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ●

●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● > # Figure 1 10

●●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●

●● ●● ●● ●●● ●● ● ●● ● ● ● ● ●● ●● ●● ● ●

●●● ●●● ● ● ● ● ●●●● ●●● ●● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● 5 Number of answers (log scale) Number of answers > gplot.snatm(network, vertex.col = "white",

●●● ●●●● ●●● ●● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ●●● ●● ● ● ● ● 5

●● ●● ●●● ●●●● ● ●● ●● ● ● ● ● ●● ●● ●● ● ● ● ● ●● ●●● ●●● ●● ● ● ● ●● ● + vertex.border = "grey",

●●●● ●●● ●●●● ●●● ●●● ●● ●● ●●● ● ● ●● ● ● ●● ● ● ●

●● ●● ●●●●● ●● ● ● ● 2 ●●● ●●● ●●●● ●●●● ●●● ● ● ● ● ●● ●● ● ●● ● ●● ●● ●● ● ● + label = rownames(network), + boxed.labels = FALSE,

●●● ●● ●●● ●●● ●● ●● ●●● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●●●● ●● ● ● ● ● ● 1 1 + label.pos = 5,

1 5 20 100 500 1 2 5 10 50 + label.cex = Number of questions Number of questions + ((sna::degree(network)+1)^0.45)*0.25, (log scale) (log scale) + gmode = "graph", edge.lwd = 0.1, + edge.col = "grey") Figure 4: Scatterplot showing the number of ques- > # Figure 2 tions (x-axis) and answers (y-axis) as well as the de- > # See how to get peoplelist in the snatm demo. gree centrality (dot size) of each author. > people <- which(is.element(rownames(twomode) + ,unique(peoplelist))) Each point represents one author and shows how > labelcol <- rep(rgb(0,0,1,0.75),dim(twomode)[1]) > labelcol[people] <- "red" many questions (x-axis) and answers (y-axis) he or ans.quest() > gplot.snatm(twomode, gmode = "graph", she wrote ( from the snatm package). + vertex.col = "white", vertex.cex = 1, The larger a point the higher the author’s degree + label = rownames(twomode), centrality. The plots indicate that in R-help the 15 + label.col = labelcol, most active authors (red) write far more answers + label.cex = than questions. All but one author are developers of + (sna::degree(twomode)^0.25)*0.35, CRAN packages, which suggests that their motiva- + label.pos = 5, edge.lwd = 0.1, tion is to provide help concerning their packages or + boxed.labels = FALSE, specific topics, no matter who asked the questions. In + vertex.border = "white",edge.col = "grey")

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 18 CONTRIBUTED RESEARCH ARTICLES

Bibliography A. McCallum, X. Wang, and A. Corrada-Emmanuel. Topic and role discovery in social networks with C. Bird, A. Gourley, P. Devanbu, M. Gertz, and experiments on Enron and academic email. Jour- A. Swaminathan. Mining email social networks. nal of Artificial Intelligence Research, 30:249–272, In Proceedings of the 2006 International Workshop on 2007. URL http://www.cs.umass.edu/~mccallum/ Mining Software Repositories, Shanghai, China, pages papers/art-ijcai05.pdf. 137–143, 2006. URL http://macbeth.cs.ucdavis. edu/msr06.pdf. D. Sarkar. Lattice: Multivariate Data Visualization with R. Springer, 2008. S. Brin and L. Page. The anatomy of a large- scale hypertextual web search engine. Com- L. Vanderwende, M. Banko, and A. Menezes. puter Networks and ISDN Systems, 30(1-7):107– Event-centric summary generation. In In 117, 1998. URL http://wx.bsjh.tcc.edu.tw/ Working Notes of DUC 2004, 2004. URL ~t2003013/wiki/images/8/8f/Anatomy.pdf. http://www-nlpir.nist.gov/projects/duc/ pubs/2004papers/microsoft.banko.pdf. C. T. Butts. Social network analysis with sna. Jour- nal of Statistical Software, 24(6), 2008. URL http: P. Velardi, R. Navigli, A. Cucchiarelli, and //www.jstatsoft.org/v24/i06/. F. D’Antonio. A new content-based model J. Chang, J. B. Graber, and D. M. Blei. Connec- for social network analysis. In Proceedings of the tions between the lines: Augmenting social net- 2008 IEEE International Conference on Semantic works with text. In KDD ’09: Proceedings of the 15th Computing, pages 18–25, Washington, DC, USA, http://www. ACM SIGKDD international conference on Knowl- 2008. IEEE Computer Society. URL dsi.uniroma1.it/~velardi/ICSCpubblicato.pdf edge discovery and data mining, pages 169–178, New . York, NY, USA, 2009. ACM. URL http://www.cs. M. Viermetz. Partitioning Massive Graphs for Content princeton.edu/~jbg/documents/kdd2009.pdf. Oriented Social Network Analysis. PhD thesis, G. Csardi and T. Nepusz. The igraph software Heinrich-Heine-Universität Düsseldorf, June package for complex network research. InterJour- 2008. URL http://docserv.uni-duesseldorf. nal, Complex Systems:1695, 2006. URL http:// de/servlets/DerivateServlet/Derivate-8825/ igraph.sf.net. diss_viermetz_pdfa1b.pdf. G. Erkan and D. R. Radev. Lexrank: Graph- M. Wallace. Jawbone Java WordNet API, 2007. URL based centrality as salience in text summariza- http://mfwallace.googlepages.com/jawbone. tion. Journal of Artificial Intelligence Research, html. 2004. URL http://tangra.si.umich.edu/~radev/ lexrank/lexrank.pdf. Angela Bohn, Kurt Hornik, Patrick Mair I. Feinerer and K. Hornik. wordnet: WordNet In- Institute for Statistics and Mathematics http: terface. R package version 0.1-6., 2010. URL Department of Finance, Accounting and Statistics //CRAN.R-project.org/package=wordnet . Wirtschaftsuniversität Wien I. Feinerer, K. Hornik, and D. Meyer. Text min- Augasse 2–6 ing infrastructure in R. Journal of Statistical Soft- A-1090 Wien ware, 25(5):1–54, 2008. ISSN 1548-7660. URL http: Austria //www.jstatsoft.org/v25/i05. [email protected] [email protected] C. Fellbaum. WordNet: An Electronic Lexical Database. [email protected] Bradford Books, 1998. J. M. Kleinberg. Authoritative sources in a hyper- Ingo Feinerer linked environment. Journal of the ACM, 46(5): Database and Artificial Intelligence Group 604–632, 1999. URL http://www.cs.cornell.edu/ Institute of Information Systems home/kleinber/auth.pdf. Technische Universität Wien Favoritenstrasse 9 V. I. Levenshtein. Binary codes capable of correcting A-1040 Wien deletions, insertions, and reversals. Technical Re- Austria port 8, Soviet Physics Doklady, 1966. [email protected]

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 19

Rmetrics - timeDate Package by Yohan Chalabi, Martin Mächler, and Diethelm Würtz R, see Ripley & Hornik(2001) and Grothendieck & Petzoldt(2004). Another problem commonly faced in managing time and dates is the midnight standard. Indeed, the problem can be handled in different ways depending on the standard in use. For example, the standard C library does not allow for the “midnight” standard in the format “24:00:00” (Bateman(2000)). However, the timeDate package supports this format. Moreover, timeDate includes functions for cal- endar manipulations, business days, weekends, and public and ecclesiastical holidays. One can handle Figure 1: World map with major time zones. 1 day count conventions and rolling business conven- tions. Such periods can pertain to, for example, the last working day of the month. The below examples Abstract The management of time and holidays illustrate this point. can prove crucial in applications that rely on his- In the remaining part of this note, we first present torical data. A typical example is the aggregation the structure of the "timeDate" class. Then, we ex- of a data set recorded in different time zones and plain the creation of "timeDate" objects. The use of under different daylight saving time rules. Be- financial centers with DST rules is described next. sides the time zone conversion function, which is Thereafter, we explain the management of holidays. well supported by default classes in R, one might Finally, operations on "timeDate" objects such as need functions to handle special days or holi- mathematical operations, rounding, subsetting, and days. In this respect, the package timeDate en- coercions are discussed. Throughout this note, we hances default date-time classes in R and brings provide many examples to illustrate the functionali- new functionalities to time zone management ties of the package. and the creation of holiday calendars. "timeDate" Chronological data sets recorded in different time Structure of the class zones play an important role in industrial applica- The "timeDate" S4 class is composed of a "POSIXct" tions. For example, in financial applications, it is object that is always in Greenwich Mean Time (GMT) common to aggregate time series that have been and of a financial center that keeps track of DST. We recorded in financial centers with different daylight use the term financial center to denote geographical saving time (DST) rules and time zones. locations. This terminology stems from the main ap- R includes different classes to represent dates plication of the Rmetrics packages. However, this de- and time. The class that holds particular interest nomination should not stop programmers from us- for us is the "POSIXct" one. It internally records ing the package in other fields. By default, the local timestamps as the number of seconds from “1970- financial center is set to GMT. The default setting can 01-01 UTC”, where UTC stands for universal time be changed via setRmetricsOptions(myFinCenter = coordinated. Moreover, it supports the DST and ....). The formal S4 class is defined as time zone functions by using the rules provided by the operating system (OS). However, at the time > showClass("timeDate") timeDate (Würtz et al.(2011))—formerly known as Class "timeDate" [package "timeDate"] fCalendar—was first released, the implementation of the DST function was not consistent across OSs. Slots: Back then, the main purpose of the package was to have DST rules directly available to bring con- Name: Data format FinCenter sistency over OSs. Today, DST support by OSs is Class: POSIXct character character not a problematic question as it used to be. As we will show later, the "timeDate" class is based on the where the slot Data contains the timestamps in the "POSIXct" one. Both classes hence share common POSIXct class, format is the format typically applied functionalities. However, the timeDate package has to Data, and FinCenter is the financial center. some additional functionalities, which we will em- Note: we use the abbreviation GMT equivalently phasize in this note. For related date-time classes in to UTC, the universal time coordinated. 1The original data set of the world map with time zones is available at http://efele.net/maps/tz/world/. Full and reduced rda versions were kindly contributed by .

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 20 CONTRIBUTED RESEARCH ARTICLES

"timeDate" object creation > finCenter(td) <- "Zurich" > td There are different ways to generate a "timeDate" Zurich object. It can be generated using either timeDate(), [1] [2009-09-28 23:12:55] [2010-01-15 10:34:02] timeSequence(), or timeCalendar(). The function timeDate() creates a "timeDate" ob- If the format of charvec is not specified, timeDate ject from scratch. It requires a character vector of uses an automated date-time format identifier called timestamps and optional arguments to specify the whichFormat() that supports common date-time for- format of this vector and the financial center. The mats. financial center can be specified, as mentioned, via setRmetricsOptions() or (more cleanly) with the ar- > whichFormat(charvec) gument FinCenter. By default, it is set to GMT2. [1] "%Y-%m-%d %H:%M:%S" In the following, we illustrate the creation of "timeDate" objects as well as the method used to The function timeSequence() creates a convert timestamps from different time zones. "timeDate" object representing an equidistant se- We first create character vectors of two times- quence of points in time. You can specify the range tamps with the default financial center (GMT): of dates with the arguments from and to. If from is missing, length.out defines the length of the se- > Dates <- c("2009-09-28","2010-01-15") quence. In the case of a monthly sequence, you can > Times <- c( "23:12:55", "10:34:02") define specific rules. For example, you can generate > charvec <- paste(Dates, Times) the sequence with the last days of the month or with > getRmetricsOption("myFinCenter") the last or n-th Friday of every month. This can be of myFinCenter particular interest in financial applications. "GMT" Let us first reset the financial center to an interna- > timeDate(charvec) tional environment: GMT [1] [2009-09-28 23:12:55] [2010-01-15 10:34:02] > setRmetricsOptions(myFinCenter = "GMT") > # 'timeDate' is now in the financial center "GMT" As a second step, we set the local financial center > timeDate(charvec) to Zurich and create a "timeDate" object. GMT [1] [2009-09-28 23:12:55] [2010-01-15 10:34:02] > setRmetricsOptions(myFinCenter = "Zurich") > timeDate(charvec) A sequence of days or months can be created as fol- Zurich [1] [2009-09-28 23:12:55] [2010-01-15 10:34:02] lows: > # first three days in January 2010, The third example shows how the timestamps > timeSequence(from = "2010-01-01", can be conveniently converted into different time + to = "2010-01-03", by = "day") zones; charvec is recorded in Tokyo and subse- GMT quently converted to our local center, i.e., Zurich (see [1] [2010-01-01] [2010-01-02] [2010-01-03] above): > # first 3 months in 2010: > timeSequence(from = "2010-01-01", > timeDate(charvec, zone = "Tokyo") + to = "2010-03-31", by = "month") Zurich GMT [1] [2009-09-28 16:12:55] [2010-01-15 02:34:02] [1] [2010-01-01] [2010-02-01] [2010-03-01] or converted from Zurich to New York: The function timeCalendar() creates "timeDate" > timeDate(charvec, zone = "Zurich", objects from calendar atoms. You can specify values + FinCenter = "NewYork") or vectors of equal length denoting year, month, day, NewYork hour, minute, and seconds as integers. For example, [1] [2009-09-28 17:12:55] [2010-01-15 04:34:02] the monthly calendar of the current year or a specific calendar in a given time zone can be created as fol- It is also possible to use the function finCenter() lows: to view or modify the local center: > timeCalendar() > td <- timeDate(charvec, zone = "Zurich", + FinCenter = "NewYork") GMT > finCenter(td) [1] [2011-01-01] [2011-02-01] [2011-03-01] [1] "NewYork" [4] [2011-04-01] [2011-05-01] [2011-06-01] [7] [2011-07-01] [2011-08-01] [2011-09-01] [10] [2011-10-01] [2011-11-01] [2011-12-01]

2GMT can be considered as a “virtual” financial center.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 21

The following represents the first four days of > # first few financial centers: January recorded in Tokyo at local time “16:00” and > head(listFinCenter()) converted to the financial center Zurich: [1] "/Abidjan" "Africa/Accra" > timeCalendar(2010, m=1, d=1:4, h=16, [3] "Africa/Addis_Ababa" "Africa/Algiers" + zone = "Tokyo", FinCenter = "Zurich") [5] "Africa/Asmara" "Africa/Bamako" Zurich > # European centers starting with A or B: [1] [2010-01-01 08:00:00] [2010-01-02 08:00:00] > listFinCenter("/[AB].*") # -> nine [3] [2010-01-03 08:00:00] [2010-01-04 08:00:00] [1] "Europe/Amsterdam" "Europe/Andorra" [3] "Europe/Athens" "Europe/Belgrade" [5] "Europe/Berlin" "Europe/Bratislava" [7] "Europe/Brussels" "Europe/Bucharest" Midnight standard [9] "Europe/Budapest"

The "timeDate" printing format is designed in ac- Each financial center has an associated func- cordance with the ISO-8601(1988) standard. It uses tion that returns its DST rules in the form of a the 24-hour clock format. Dates are expressed in "data.frame". These functions share the name of the “%Y-%m-%d” format while time-dates are stated their financial center, e.g., Zurich(). in the “%Y-%m-%d %H:%M:%S” format. A special case in the 24-hour clock system is the representa- > Zurich()[64:67, ] tion of midnight. It can be equivalently represented Zurich offSet isdst TimeZone by “00:00” and “24:00”. The former is usually used 64 2010-03-28 01:00:00 7200 1 CEST to denote the beginning of the day whereas the lat- 65 2010-10-31 01:00:00 3600 0 CET ter denotes the end. timeDate supports the midnight 66 2011-03-27 01:00:00 7200 1 CEST standard as described by the ISO-8601(1988) stan- 67 2011-10-30 01:00:00 3600 0 CET dard as illustrated here: numeric 64 1269738000 > timeDate(ch <- "2010-01-31 24:00:00") 65 1288486800 GMT 66 1301187600 [1] [2010-02-01] 67 1319936400 Note, following the revision of this paper, the R The returned "data.frame" shows when the source has been amended in May 2011 to also al- clock was changed in Zurich, the offset in seconds low “24:00” time specifications, such that a midnight with respect to GMT, a flag that tells us if DST is in ef- standard will be part of standard R. fect or not, the time zone abbreviation, and the num- ber of seconds since “1970-01-01” in GMT. The reader Financial centers – via daylight sav- interested in the history of DST is referred to Bartky & Harrison(1979). ing time rules Note new centers can be easily added as long as their associated functions return a "data.frame" As mentioned earlier, the global financial center can with the same structure as described above. be set with the function setRmetricsOptions() and accessed with the function getRmetricsOption(). Its default value is set to “GMT”: Holidays and calendars > getRmetricsOption("myFinCenter") myFinCenter Holidays are usually grouped by their origins. The "GMT" first category, as the etymology suggests, is based > # change to Zurich: on religious origin. For example, the ecclesiastical > setRmetricsOptions(myFinCenter = "Zurich") calendars of Christian churches are based on cycles of movable and immovable feasts. Christmas, De- From now on, all dates and times are handled cember 25, is the principal immovable feast, whereas in accordance with the Central European time zone Easter is the principal movable feast. Most of the and the DST rule for Zurich. Note that setting the other dates are movable feasts that are determined financial center to a continent/city that lies outside with respect to Easter, Montes(1996). A second cat- the time zone used by your OS does not change any egory of holidays is secular holidays, which denotes of the time settings or environment variables of your days that are celebrated internationally and in differ- computer. ent cultures, such as Labor Day. Another category There are almost 400 financial centers supported of holidays includes days that are relative to natural thanks to the Olson database. They can be accessed events. For example, the dates can be related to as- by the function listFinCenter(), and partial lists tronomical events such as cycles of the moon or the can be extracted through the use of regular expres- equinox. Moreover, there are also country-specific sions. national holidays.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 22 CONTRIBUTED RESEARCH ARTICLES

The calculation of holidays might prove tedious 2010-04-06 2010-04-07 in some circumstances. Indeed, the estimation of "Tue" "Wed" the Easter date is a complex procedure with differ- Thank to the comments of one of the referees, we ent algorithms involved in its computation. The have added a new argument, wday, in the func- algorithm implemented in the package is the one tions isWeekend(), isWeekday(), isBizday() and of Oudin (1940) as quoted in Seidelmann(1992). isHoliday() that can be used to specify which days This approach is valid for any Gregorian calendar should be considered as business days. This is im- year. Further details about holiday calculation can portant when using calendars in Islamic countries or be found in Tøndering(2008). in Israel. By default, wday specifies the weekdays as The dates of Easter for the next five years can be Monday to Friday. calculated with > thisYear <- getRmetricsOption("currentYear") Special dates > Easter(thisYear:(thisYear+5)) Zurich As mentioned earlier, holidays often refer to a spe- [1] [2011-04-24] [2012-04-08] [2013-03-31] cific date or event. It is therefore crucial to have func- [4] [2014-04-20] [2015-04-05] [2016-03-27] tions to compute: the first day in a given month, the last day in a given month and year, a given day be- The timeDate package includes functions for fore or after a date, the n-th occurrences of a day in a bank holidays in Canada, France, Germany, Great specified year/month, or a given last day for a spec- Britain, Italy, Japan3, Switzerland, and the US. These ified year/month. We have summarized these func- holidays can be grouped in calendars. At the mo- tions in a table in the appendix. ment, the package provides functions for the New In the following, we demonstrate how to retrieve York stock exchange, holidayNYSE(); for the North the last day in each quarter or the second Sunday of American Reliability Council, holidayNERC()4; for each month. Note that days are numbered from 0 to the Toronto stock exchange holidayTSX(), and for 6 where 0 corresponds to Sunday and 6 to Saturday. holidayZURICH() Zurich, . Other calendars can be > charvec <- c("2011-03-01", "2011-04-01") easily implemented given that the package already > # Last day in quarter provides many holidays functions. A list of all holi- > timeLastDayInQuarter(charvec) days is provided in the appendix. Zurich [1] [2011-03-31] [2011-06-30] Logical test > # Second Sunday of each month: > timeNthNdayInMonth(charvec, nday = 0, nth = 2) It is possible to construct tests for weekdays, week- Zurich ends, business days, and holidays with the func- [1] [2011-03-13] [2011-04-10] tions isWeekday(), isWeekend(), isBizday() and > # Closest Friday that occurred before: isHoliday(), respectively. > timeNdayOnOrBefore(charvec, nday = 5) Let us take a sequence of dates around Easter: Zurich [1] [2011-02-25] [2011-04-01] > Easter(2010) Zurich [1] [2010-04-04] "timeDate" > (tS <- timeSequence(Easter(2010, -2), Operations on objects + Easter(2010, +3))) Just like the other date-time classes in R, the Zurich "timeDate" [1] [2010-04-02] [2010-04-03] [2010-04-04] class supports common operations. It [4] [2010-04-05] [2010-04-06] [2010-04-07] allows for mathematical operations such as addi- tion, subtraction, and comparisons to be performed. We can now extract the weekdays or business days Moreover, methods for the generic functions to con- according to a holiday calendar. catenate, replicate, sort, re-sample, unify, revert, or lag are available as the well known calls c(), rep(), > (tS1 <- tS[isWeekday(tS)]) sort(), sample(), unique(), rev(), and diff(), re- Zurich spectively. We spare the reader superfluous exam- [1] [2010-04-02] [2010-04-05] [2010-04-06] ples of functions that are common to other date- [4] [2010-04-07] time classes. In the rest of the section, we empha- > (tS2 <- tS[isBizday(tS, holidayZURICH(2010))]) size methods that are not available for other classes Zurich or are not strictly identical. The reader is referred [1] [2010-04-06] [2010-04-07] to the ebook “Chronological Objects with Rmetrics” > dayOfWeek(tS2) (Würtz, Chalabi & Ellis, 2010) for more examples. 3The Japanese holidays were contributed by Parlamis Franklin. 4holidayNERC() was contributed by Joe W. Byers.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 23

Subsetting methods > c(NY, ZH) NewYork The timeDate package has different functions to sub- [1] [2010-01-01 13:00:00] [2010-01-01 11:00:00] set a timeDate object. The usual function ‘[’ ex- tracts or replaces subsets of "timeDate" objects as ex- pected. However, the package provides some addi- tional functions. For example, the function window() Rounding and truncation methods extracts a sequence from a starting and ending point. The rounding and truncation methods have similar start() end() The functions and extract the first and functionalities to those of their counterparts in other last timestamps, respectively. date-time classes. However, the default unit argu- ment is not the same. Coercion methods The package provides both S3 and S4 methods to co- Summary erce to and from "timeDate" objects. Below, we list S4 S3 the coercion methods available. The equivalent The timeDate package offers functions for calen- as.* methods are also provided, although the mixing dar manipulations for business days, weekends, and S4 S3 of classes and methods is discouraged. Note public and ecclesiastical holidays that are of inter- that information can be lost in the coercion process if est in financial applications as well as in other fields. the destination class does not support the same func- Moreover, the Financial Center concept facilitates the tionality. mixing of data collected in different time zones and > showMethods("coerce", class = "timeDate") the manipulation of data recorded in the same time Function: coerce (package methods) zone but with different DST rules. from="ANY", to="timeDate" from="Date", to="timeDate" from="POSIXt", to="timeDate" Acknowledgments from="timeDate", to="Date" from="timeDate", to="POSIXct" We thank the anonymous referees and the editor for from="timeDate", to="POSIXlt" their valuable comments as well as all R users and from="timeDate", to="character" developers who have helped us to improve the time- from="timeDate", to="data.frame" Date package. from="timeDate", to="list" from="timeDate", to="numeric" Bibliography We would like to point out a particular difference as.numeric "timeDate" between the methods of and R.I. Bartky and E. Harrison. Standard and Daylight "POSIXct" as.numeric-timeDate classes. Indeed, the Saving Time. Scientific American, 240:46–53, 1979. method returns the time in minutes, in contrast to as.numeric.POSIXct, which returns the results R. Bateman. Time Functionality in the Standard C Li- in seconds. However, the as.numeric-timeDate brary, Novell AppNotes, September Issue, 73–85, method has an additional argument, unit, to select 2000. other units. These include seconds, hours, days, and weeks. N. Dershowitz and E.M. Reingold. Calendrical Cal- culations. Software - Practice and Experience, 20:899– 928, 1990. Concatenation method The concatenation c() method for "timeDate" ob- N. Dershowitz and E.M. Reingold. Calendrical Calcu- jects takes care of the different financial centers of the lations. Cambridge University Press, 1997. object to be concatenated. In such cases, all times- G. Grothendieck and T. Petzoldt. Date and Time tamps are transformed to the financial center of the Classes in R. R News, 4(1):29-32, 2004. first "timeDate" object. This feature is now also sup- ported by R’s "POSIXct" class. However, it was not ISO-8601. Data Elements and Interchange Formats – available in previous versions of the class. Information Interchange, Representation of Dates and Time. International Organization for Standardiza- > ZH <- timeDate("2010-01-01 16:00", zone = "GMT", tion, Reference Number ISO 8601, 1988. + FinCenter = "Zurich") > NY <- timeDate("2010-01-01 18:00", zone = "GMT", M.J. Montes. Butcher’s Algorithm for Calculating the + FinCenter = "NewYork") Date of Easter in the Gregorian Calendar, 1996. > c(ZH, NY) Zurich B.D. Ripley and K. Hornik. Date-Time Classes. R- [1] [2010-01-01 17:00:00] [2010-01-01 19:00:00] News, 1(2):8–12, 2001.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 24 CONTRIBUTED RESEARCH ARTICLES

P.K. Seidelmann (Editor). Explanatory Supplement 2. List of holidays to the Astronomical Almanac. University Science Advent1st JPBunkaNoHi Books, Herndon, 1992. Advent2nd JPChildrensDay C. Tøndering. Frequently Asked Questions about Advent3rd JPComingOfAgeDay Calendars, 2008. URL http://www.tondering.dk/ Advent4th JPConstitutionDay AllSaints JPEmperorsBirthday claus/calendar.html. AllSouls JPGantan D. Würtz and Y. Chalabi with contribution from Annunciation JPGreeneryDay M. Maechler, J.W. Byers, and others. timeDate: Ascension JPHealthandSportsDay Rmetrics - Chronological and Calendarical Ob- AshWednesday JPKeirouNOhi jects, 2011. URL http://cran.r-project.org/ AssumptionOfMary JPKenkokuKinenNoHi BirthOfVirginMary JPKenpouKinenBi web/packages/timeDate/. BoxingDay JPKinrouKanshaNoHi D. Würtz, Y. Chalabi and A. Ellis Chronological CACanadaDay JPKodomoNoHi Objects with Rmetrics, 2010. URL http://www. CACivicProvincialHoliday JPKokuminNoKyujitu rmetrics.org/ebooks. CALabourDay JPMarineDay CAThanksgivingDay JPMidoriNoHi CAVictoriaDay JPNatFoundationDay Yohan Chalabi CHAscension JPNationHoliday Finance Online GmbH, Zurich & CHBerchtoldsDay JPNationalCultureDay Institute of Theoretical Physics, ETH Zurich, CHConfederationDay JPNewYearsDay Switzerland CHKnabenschiessen JPRespectForTheAgedDay [email protected] CHSechselaeuten JPSeijinNoHi CaRemembranceDay JPShuubunNoHi CelebrationOfHolyCross JPTaiikuNoHi Martin Mächler ChristTheKing JPTennouTanjyouBi Seminar für Statistik, ETH Zurich, ChristmasDay JPThanksgivingDay Switzerland ChristmasEve JPUmiNoHi [email protected] CorpusChristi LaborDay DEAscension MassOfArchangels Diethelm Würtz DEChristmasEve NewYearsDay Institute of Theoretical Physics, ETH Zurich, DECorpusChristi PalmSunday Switzerland DEGermanUnity Pentecost [email protected] DENewYearsEve PentecostMonday Easter PresentationOfLord EasterMonday Quinquagesima Appendix EasterSunday RogationSunday Epiphany Septuagesima FRAllSaints SolemnityOfMary Session Info FRArmisticeDay TransfigurationOfLord FRAscension TrinitySunday > toLatex(sessionInfo()) FRAssumptionVirginMary USCPulaskisBirthday • R version 2.13.0 (2011-04-13), FRBastilleDay USChristmasDay x86_64-apple-darwin10.7.0 FRFetDeLaVictoire1945 USColumbusDay GBBankHoliday USDecorationMemorialDay • Locale: C ... GBMayDay USElectionDay • Base packages: base, datasets, grDevices, graphics, GBMilleniumDay USGoodFriday methods, stats, utils GBSummerBankHoliday USInaugurationDay • Other packages: timeDate 2130.93 GoodFriday USIndependenceDay ITAllSaints USLaborDay Tables ITAssumptionOfVirginMary USLincolnsBirthday ITEpiphany USMLKingsBirthday 1. Special dates ITImmaculateConception USMemorialDay ITLiberationDay USNewYearsDay timeFirstDayInMonth First day in a given month ITStAmrose USPresidentsDay timeLastDayInMonth Last day in a given month JPAutumnalEquinox USThanksgivingDay timeFirstDayInQuarter First day in a given quarter JPBankHolidayDec31 USVeteransDay timeLastDayInQuarter Last day in a given quarter JPBankHolidayJan2 USWashingtonsBirthday timeNdayOnOrAfter Day "on-or-after" n-days JPBankHolidayJan3 timeNdayOnOrBefore Day "on-or-before" n-days timeNthNdayInMonth N-th occurrence of a n-day in month timeLastNdayInMonth Last n-day in month

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 25

The digitize Package: Extracting Numerical Data from Scatterplots by Timothée Poisot rate in different concentrations of nutrients. Monod (1949) proposed that, in the measured population, −6 Abstract I present the small R package digitize, RK = 1.35 and C1 = 22 × 10 . By using the digitize designed to extract data from scatterplots with package to extract the data from this paper, we will a simple method and suited to small datasets. I be able to get our own estimates for these parame- present an application of this method to the ex- ters. traction of data from a graph whose source is not Values of RK and C1 were estimated using a sim- available. ple genetic algorithm, which minimizes the error function (sum of squared errors) defined by The package digitize, that I present here, allows a user to load a graphical file of a scatterplot (with the MonodError <- function(params, M, y) { help of the read.jpeg function of the ReadImages with(params, package) in the graphical window of R, and to use sum((MonodGrowth(params, M)-y)^2)) the locator function to calibrate and extract the data. } Calibration is done by setting four reference points on the original graph axis, two for the x values and two for the y values. The use of four points for cal- ibration is justified by the fact that it makes calibra- tions on the axis possible, as y data are not taken into account for calibration of the x axis, and vice versa. This is useful when working on data that are not available in digital form, e.g. when integrating old papers in meta-analyses. Several commercial or free software packages allow a user to extract data from a plot in image format, among which we can cite PlotDigitizer (http://plotdigitizer.sourceforge. net/ http: ) or the commercial package GraphClick ( Figure 1: Original figure, as obtained after pointing //www.arizona-software.ch/graphclick/ ). While and clicking the data. Calibration was made using these programs are powerful and quite ergonomic, the points x1 = 1, x2 = 8, y1 = 0.5 and y2 = 1. All the for some lightweight use, one may want to load the points that have been clicked are marked by a red graph directly into R, and as a result get the data di- point. The digitization of each series is stopped by rectly in R format. This paper presents a rapid dig- right-clicking or pressing the Esc key. itization of a scatterplot and subsequent statistical analysis of the data. As an example, we will use the The first step when using the digitize package is data presented by Jacques Monod in a seminal mi- to specify four points on the graph that will be used crobiology paper (Monod, 1949). to calibrate the axes. They must be in the following The original paper presents the growth rate (in order : leftmost x, rightmost x, lower y, upper y. For terms of divisions per hour) of the bacterium Es- the first two of them, the y value is not important cherichia coli in media of increasing glucose concen- (and vice versa). For this example, it is assumed that tration. Such a hyperbolic relationship is best repre- we set the first two points at x1 = 1 and x2 = 8, and sented by the equation the two last points at y1 = 0.5 and y2 = 1, simply by C clicking in the graphical window at these positions R = RK , (preferentially on the axes). It should be noted that C + C 1 it is not necessary to calibrate using the extremity of where R is the growth rate at a given concentration the axes. of nutrients C, RK is the maximal growth rate, C1 is Loading the figure and printing it in the current the concentration of nutrients at which R = 0.5RK. In device for calibration is done by R, this function is written as cal <- ReadAndCal('monod.jpg') MonodGrowth <- function(params, M) { with(params, rK*(M/(M1+M))) Once the graph appears in the window, the user } must input (by clicking on them) the four calibration points, marked as blue crosses. The calibration val- In order to characterize the growth parameters of ues will be stocked in the cal object, which is a list a bacterial population, one can measure its growth with x and y values. The next step is to read the data,

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 26 CONTRIBUTED RESEARCH ARTICLES simply by pointing and clicking on the graph. This is done by calling the DigitData function, whose argu- ments are the type of lines/points drawn. 1.4 growth <- DigitData(col = 'red') 1.2 1.0 When all the points have been identified, the dig- 0.8 itization is stopped in the same way that one stops the locator function, i.e. either by right-clicking or 0.6 Divisions per hour per Divisions pressing the Esc key (see ?locator). The outcome 0.4 data of this step is shown in figure1. The next step for 0.2 best fit these data to be exploitable is to use the calibra- paper value 0.0 tion information to have the correct coordinates of 0 2 4 6 8 the points. We can do this by using the function Nutrients concentration Calibrate(data,calpoints,x1,x2,y1,y2), and the correct coordinates of the points of calibration (x1, x2, Figure 2: Result of the digitization. Original data are y1 and y2 correspond, respectively, to the points x1, in grey, the proposed fit is in the dashed line, and x2, y1 and y2). the estimate realized from the raw data is in the bold data <- Calibrate(growth, cal, 1, 8, 0.5, 1) line. Our estimates for the growth parameters are R = 1.43 and C = 62 × 10−6. The internals of the function Calibrate are quite K 1 = ( ) simple. If we consider X X1, X2 a vector con- While using the MonodError function with the taining the x coordinates of the calibration points for proposed parameters yields a sum of squares of 1.32, = ( ) the x axis on the graphic window, and x x1, x2 our estimated parameters minimizes this value to a vector with their true value, it is straightforward 0.45, thus suggesting that the values of R and C = + K 1 that x aX b. As such, performing a simple lin- presented in the paper are not optimal given the x X ear regression of against allows us to determine data. the coefficients to convert the data. The same proce- dure is repeated for the y axis. One advantage of this method of calibration is that you do not need to fo- Conclusion cus on the y value of the points used for x calibration, and reciprocally. It means that in order to accurately I presented an example of using the digitize package calibrate a graph, you only need to have two x coor- to reproduce data in an ancient paper that are not dinates, and two y coordinates. Eventually, it is very available in digital form. While the principle of this simple to calibrate the graphic by setting the calibra- package is really simple, it allows for a quick extrac- tion points directly on the tick mark of their axis. tion of data from a scatterplot (or any kind of planar The object returned by Calibrate is of class display), which can be useful when data from old or "data.frame", with columns "x" and "y" represent- proprietary figures are needed. ing the x and y coordinates so that we can plot it directly. The following code produces Figure2, as- suming that out is a list containing the arguments of Acknowledgements MonodGrowth obtained after optimization (out$set), and paper is the same list with the values of Monod Thanks are due to two anonymous referees for their (1949). comments on an earlier version of this manuscript. plot(data$x, data$y, pch=20, col='grey', xlab = 'Nutrients concentration', Bibliography ylab = 'Divisions per hour') points(xcal, MonodGrowth(out$set, xcal), J. Monod. The growth of bacterial cultures. Annual type = 'l', lty = 1, lwd = 2) Reviews in Microbiology, 3(1):371–394, 1949. points(xcal, MonodGrowth(paper, xcal), type = 'l', lty = 2) legend('bottomright', Timothée Poisot legend = c('data', 'best fit', 'paper value'), Université Montpellier II pch = c(20, NA, NA), Institut des Sciences de l’Évolution, UMR 5554 lty = c(NA, 1, 2), Place Eugène Bataillon lwd = c(NA, 2, 1), 34095 Montpellier CEDEX 05 col = c('grey', 'black', 'black'), France bty = 'n') [email protected]

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 27

Differential Evolution with DEoptim

An Application to Non-Convex Portfolio Optimiza- tween upper and lower bounds (defined by the user) tion or using values given by the user. Each generation involves creation of a new population from the cur- by David Ardia, Kris Boudt, Peter Carl, Katharine M. rent population members {xi |i = 1,. . .,NP}, where Mullen and Brian G. Peterson i indexes the vectors that make up the population. This is accomplished using differential mutation of Abstract The R package DEoptim implements the population members. An initial mutant parame- the Differential Evolution algorithm. This al- ter vector vi is created by choosing three members of gorithm is an evolutionary technique similar to the population, xi1 , xi2 and xi3 , at random. Then vi is classic genetic algorithms that is useful for the generated as solution of global optimization problems. In this note we provide an introduction to the package . vi = xi + F · (xi − xi ), and demonstrate its utility for financial appli- 1 2 3 cations by solving a non-convex portfolio opti- where F is a positive scale factor, effective values for mization problem. which are typically less than one. After the first mu- tation operation, mutation is continued until d mu- tations have been made, with a crossover probability Introduction CR ∈ [0,1]. The crossover probability CR controls the fraction of the parameter values that are copied from Differential Evolution (DE) is a search heuristic intro- the mutant. Mutation is applied in this way to each duced by Storn and Price(1997). Its remarkable per- member of the population. If an element of the trial formance as a global optimization algorithm on con- parameter vector is found to violate the bounds after tinuous numerical minimization problems has been mutation and crossover, it is reset in such a way that extensively explored (Price et al., 2006). DE has the bounds are respected (with the specific protocol also become a powerful tool for solving optimiza- depending on the implementation). Then, the ob- tion problems that arise in financial applications: for jective function values associated with the children fitting sophisticated models (Gilli and Schumann, are determined. If a trial vector has equal or lower 2009), for performing model selection (Maringer and objective function value than the previous vector it Meyer, 2008), and for optimizing portfolios under replaces the previous vector in the population; oth- non-convex settings (Krink and Paterlini, 2011). DE erwise the previous vector remains. Variations of is available in R with the package DEoptim. this scheme have also been proposed; see Price et al. In what follows, we briefly sketch the DE algo- (2006). rithm and discuss the content of the package DEop- Intuitively, the effect of the scheme is that the tim. The utility of the package for financial applica- shape of the distribution of the population in the tions is then explored by solving a non-convex port- search space is converging with respect to size and folio optimization problem. direction towards areas with high fitness. The closer the population gets to the global optimum, the more the distribution will shrink and therefore reinforce Differential Evolution the generation of smaller difference vectors. For more details on the DE strategy, we refer DE belongs to the class of genetic algorithms (GAs) the reader to Price et al.(2006) and Storn and Price which use biology-inspired operations of crossover, (1997). mutation, and selection on a population in order to minimize an objective function over the course The package DEoptim of successive generations (Holland, 1975). As with other evolutionary algorithms, DE solves optimiza- DEoptim (Ardia et al., 2011) was first published on tion problems by evolving a population of candi- CRAN in 2005. Since becoming publicly available, it date solutions using alteration and selection opera- has been used by several authors to solve optimiza- tors. DE uses floating-point instead of bit-string en- tion problems arising in diverse domains. We refer coding of population members, and arithmetic op- the reader to Mullen et al.(2011) for a detailed de- erations instead of logical operations in mutation, in scription of the package. DEoptim contrast to classic GAs. DEoptim consists of the core function Let NP denote the number of parameter vectors whose arguments are: ∈ Rd (members) x in the population, where d de- • fn: the function to be optimized (minimized). notes dimension. In order to create the initial genera- tion, NP guesses for the optimal value of the param- • lower, upper: two vectors specifying scalar real eter vector are made, either using random values be- lower and upper bounds on each parameter to

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 28 CONTRIBUTED RESEARCH ARTICLES

be optimized. The implementation searches be- > set.seed(1234) tween lower and upper for the global optimum > DEoptim(fn = Rastrigin, of fn. + lower = c(-5, -5), + upper = c(5, 5), • control: a list of tuning parameters, among + control = list(storepopfrom = 1)) which: NP (default: 10 · d), the number of pop- ulation members and itermax (default: 200), The above call specifies the objective function to the maximum number of iterations (i.e., pop- minimize, Rastrigin, the lower and upper bounds ulation generations) allowed. For details on on the parameters, and, via the control argument, the other control parameters, the reader is re- that we want to store intermediate populations from ferred to the documentation manual (by typ- the first generation onwards (storepopfrom = 1). ing ?DEoptim). For convenience, the function Storing intermediate populations allows us to exam- DEoptim.control() returns a list with default ine the progress of the optimization in detail. Upon elements of control. initialization, the population is comprised of 50 ran- dom values drawn uniformly within the lower and • ...: allows the user to pass additional argu- upper bounds. The members of the population gen- ments to the function fn. erated by the above call are plotted at the end of The output of the function DEoptim is a member different generations in Figure2. DEoptim consis- of the S3 class DEoptim. Members of this class have a tently finds the minimum of the function (market plot and a summary method that allow to analyze the with an open white circle) within 200 generations us- optimizer’s output. ing the default settings. We have observed that DE- Let us quickly illustrate the package’s usage with optim solves the Rastrigin problem more efficiently the minimization of the Rastrigin function in R2, than the simulated annealing method available in the which is a common test for global optimization: R function optim (for all annealing schedules tried). Note that users interested in exact reproduction of > Rastrigin <- function(x) { results should set the seed of their random number + sum(x^2 - 10 * cos(2 * pi * x)) + 20 generator before calling DEoptim (using set.seed). } DE is a randomized algorithm, and the results may The global minimum is zero at point x = (0,0)0.A vary between runs. perspective plot of the function is shown in Figure1. Finally, note that DEoptim relies on repeated evaluation of the objective function in order to move the population toward a global minimum. Therefore, users interested in making DEoptim run as fast as possible should ensure that evaluation of the objec- tive function is as efficient as possible. Using pure R code, this may often be accomplished using vector- ization. Writing parts of the objective function in a lower-level language like C or Fortran may also in- crease speed.

f(x_1,x_2) Risk allocation portfolios

Mean-risk models were developed in early fifties for the portfolio selection problem. Initially, variance was used as a risk measure. Since then, many al- x_2 ternative risk measures have been proposed. The x_1 question of which risk measure is most appropriate is still the subject of much debate. Value-at-Risk (VaR) and Conditional Value-at-Risk (CVaR) are the most popular measures of downside risk. VaR is the neg- ative value of the portfolio return such that lower returns will only occur with at most a preset prob- ability level, which typically is between one and five percent. CVaR is the negative value of the mean of Figure 1: Perspective plot of the Rastrigin function. all return realizations that are below the VaR. There is a voluminous literature on portfolio optimization The function DEoptim searches for a minimum problems with VaR and CVaR risk measures, see, of the objective function between lower and upper e.g., Fábián and Veszprémi(2008) and the references bounds. A call to DEoptim can be made as follows: therein.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 29

Generation 1 Generation 20

4 4

2 2

2 0 2 0 x x

−2 −2

−4 −4

−4 −2 0 2 4 −4 −2 0 2 4

x1 x1

Generation 40 Generation 80

4 4

2 2

2 0 2 0 x x

−2 −2

−4 −4

−4 −2 0 2 4 −4 −2 0 2 4

x1 x1

Figure 2: The population (market with black dots) associated with various generations of a call to DEoptim as it searches for the minimum of the Rastrigin function at point x = (0,0)0 (market with an open white circle). The minimum is consistently determined within 200 generations using the default settings of DEoptim.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 30 CONTRIBUTED RESEARCH ARTICLES

Modern portfolio selection considers additional age estimators, and their implementation in portfolio criteria such as to maximize upper-tail skewness and allocation, we refer the reader to Würtz et al.(2009, liquidity or minimize the number of securities in the Chapter 4). These estimators might yield better per- portfolio. In the recent portfolio literature, it has formance in the case of small samples, outliers or de- been advocated by various authors to incorporate partures from normality. risk contributions in the portfolio allocation prob- lem. Qian’s (2005) Risk Parity Portfolio allocates port- > library("quantmod") folio variance equally across the portfolio compo- > tickers <- c("GE", "IBM", "JPM", "MSFT", "WMT") > getSymbols(tickers, nents. Maillard et al.(2010) call this the Equally- + from = "2000-12-01", Weighted Risk Contribution (ERC) Portfolio. They de- + to = "2010-12-31") rive the theoretical properties of the ERC portfolio > P <- NULL and show that its volatility is located between those > for(ticker in tickers) { of the minimum variance and equal-weight portfo- + tmp <- Cl(to.monthly(eval(parse(text = ticker)))) lio. Zhu et al.(2010) study optimal mean-variance + P <- cbind(P, tmp) portfolio selection under a direct constraint on the + } contributions to portfolio variance. Because the re- > colnames(P) <- tickers sulting optimization model is a non-convex quadrat- > R <- diff(log(P)) ically constrained quadratic programming problem, > R <- R[-1,] they develop a branch-and-bound algorithm to solve > mu <- colMeans(R) > sigma <- cov(R) it. Boudt et al.(2010a) propose to use the contribu- We first compute the equal-weight portfolio. This tions to portfolio CVaR as an input in the portfolio is the portfolio with the highest weight diversifica- optimization problem to create portfolios whose per- tion and often used as a benchmark. But is the risk centage CVaR contributions are aligned with the de- exposure of this portfolio effectively well diversified sired level of CVaR risk diversification. Under the across the different assets? This question can be an- assumption of normality, the percentage CVaR con- swered by computing the percentage CVaR contribu- tribution of asset i is given by the following explicit . tions with function ES in the package Performance- w = (w w )0 function of the vector of weights 1,..., d , Analytics (Carl and Peterson, 2011). These percent- =. ( )0 mean vector µ µ1,...,µd and covariance matrix age CVaR contributions indicate how much each as- Σ : set contributes to the total portfolio CVaR. h ( ) ( ) i w −µ + √Σw i φ zα > library("PerformanceAnalytics") i i w0Σw α √ , > pContribCVaR <- ES(weights = rep(0.2, 5), 0 0 φ(zα) −w µ + w Σw α + method = "gaussian", + portfolio_method = "component", with zα the α-quantile of the standard normal distri- + mu = mu, bution and φ(·) the standard normal density func- + sigma = sigma)$pct_contrib_ES tion. Throughout the paper we set α = 5%. As we > rbind(tickers, round(100 * pContribCVaR, 2)) show here, the package DEoptim is well suited to [,1] [,2] [,3] [,4] [,5] solve these problems. Note that it is also the evo- tickers "GE" "IBM" "JPM" "MSFT" "WMT" lutionary optimization strategy used in the package "21.61" "18.6" "25.1" "25.39" "9.3" PortfolioAnalytics (Boudt et al., 2010b). To illustrate this, consider as a stylized example a We see that in the equal-weight portfolio, 25.39% five-asset portfolio invested in the stocks with tickers of the portfolio CVaR risk is caused by the 20% GE, IBM, JPM, MSFT and WMT. We keep the dimen- investment in MSFT, while the 20% investment in sion of the problem low in order to allow interested WMT only causes 9.3% of total portfolio CVaR. The users to reproduce the results using a personal com- high risk contribution of MSFT is due to its high stan- puter. More realistic portfolio applications can be dard deviation and low average return (reported in obtained in a straightforward manner from the code percent): below, by expanding the number of parameters opti- > round(100 * mu , 2) mized. Interested readers can also see the DEoptim GE IBM JPM MSFT WMT package vignette for a 100-parameter portfolio opti- -0.80 0.46 -0.06 -0.37 0.0 mization problem that is typical of those encountered > round(100 * diag(sigma)^(1/2), 2) in practice. GE IBM JPM MSFT WMT We first download ten years of monthly data us- 8.90 7.95 9.65 10.47 5.33 ing the function getSymbols of the package quant- mod (Ryan, 2010). Then we compute the log-return We now use the function DEoptim of the package series and the mean and covariance matrix estima- DEoptim to find the portfolio weights for which the tors. For on overview of alternative estimators of the portfolio has the lowest CVaR and each investment covariance matrix, such as outlier robust or shrink- can contribute at most 22.5% to total portfolio CVaR

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 31 risk. For this, we first define our objective function from the last two lines, this minimum risk portfolio to minimize. The current implementation of DEop- has a higher expected return than the equal weight tim allows for constraints on the domain space. To portfolio. The following code illustrates that DEop- include the risk budget constraints, we add them to tim yields superior results than the gradient-based the objective function through a penalty function. As optimization routines available in R. such, we allow the search algorithm to consider in- > out <- optim(par = rep(0.2, 5), feasible solutions. A portfolio which is unacceptable + fn = obj, for the investor must be penalized enough to be re- + method = "L-BFGS-B", jected by the minimization process and the larger the + lower = rep(0, 5), violation of the constraint, the larger the increase in + upper = rep(1, 5)) the value of the objective function. A standard ex- > out$value pression of these violations is α × |violation|p. Crama [1] 0.1255093 and Schyns(2003) describe several ways to calibrate > out <- nlminb(start = rep(0.2, 5), the scaling factors α and p. If these values are too + objective = obj, small, then the penalties do not play their expected + lower = rep(0, 5), role and the final solution may be infeasible. On the + upper = rep(1, 5)) > out$objective other hand, if α and p are too large, then the CVaR [1] 0.1158250 term becomes negligible with respect to the penalty; thus, small variations of w can lead to large varia- Even on this relatively simple stylized example, tions of the penalty term, which mask the effect on the optim and nlminb routines converged to local 3 the portfolio CVaR. We set α = 10 and p = 1, but minima. recognize that better choices may be possible and de- Suppose now the investor is interested in the pend on the problem at hand. most risk diversified portfolio whose expected re- turn is higher than the equal-weight portfolio. This > obj <- function(w) { amounts to minimizing the largest CVaR contribu- + if (sum(w) == 0) { w <- w + 1e-2 } + w <- w / sum(w) tion subject to a return target and can be imple- + CVaR <- ES(weights = w, mented as follows: + method = "gaussian", > obj <- function(w) { + portfolio_method = "component", + if(sum(w) == 0) { w <- w + 1e-2 } + mu = mu, + w <- w / sum(w) + sigma = sigma) + contribCVaR <- ES(weights = w, + tmp1 <- CVaR$ES + method = "gaussian", + tmp2 <- max(CVaR$pct_contrib_ES - 0.225, 0) + portfolio_method = "component", + out <- tmp1 + 1e3 * tmp2 + mu = mu, + } + sigma = sigma)$contribution + tmp1 <- max(contribCVaR) The penalty introduced in the objective func- + tmp2 <- max(mean(mu) - sum(w * mu), 0) tion is non-differentiable and therefore standard + out <- tmp1 + 1e3 * tmp2 gradient-based optimization routines cannot be + } used. In contrast, DEoptim is designed to consistently > set.seed(1234) find a good approximation to the global minimum of > out <- DEoptim(fn = obj, the optimization problem: + lower = rep(0, 5), + upper = rep(1, 5)) > set.seed(1234) > wstar <- out$optim$bestmem > out <- DEoptim(fn = obj, > wstar <- wstar / sum(wstar) + lower = rep(0, 5), > rbind(tickers, round(100 * wstar, 2)) + upper = rep(1, 5)) par1 par2 par3 par4 par5 > out$optim$bestval tickers "GE" "IBM" "JPM" "MSFT" "WMT" [1] 0.1143538 "17.38" "19.61" "14.85" "15.19" "32.98" > wstar <- out$optim$bestmem > 100 * (sum(wstar * mu) - mean(mu)) > wstar <- wstar / sum(wstar) [1] 0.04150506 > rbind(tickers, round(100 * wstar, 2)) par1 par2 par3 par4 par5 This portfolio invests more in the JPM stock and tickers "GE" "IBM" "JPM" "MSFT" "WMT" less in the GE (which has the lowest average return) "18.53" "21.19" "11.61" "13.37" "35.3" compared to the portfolio with the upper 22.5% per- > 100 * (sum(wstar * mu) - mean(mu)) centage CVaR constraint. We refer to Boudt et al. [1] 0.04827935 (2010a) for a more elaborate study on using CVaR al- locations as an objective function or constraint in the Note that the main differences with the equal- portfolio optimization problem. weight portfolio is the low weights given to JPM and A classic risk/return (i.e., CVaR/mean) scatter MSFT and the high weight to WMT. As can be seen chart showing the results for portfolios tested by

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 32 CONTRIBUTED RESEARCH ARTICLES

DEoptim is displayed in Figure3. Gray elements for the solution of global optimization problems in depict the results for all tested portfolios (using a wide variety of fields. We have referred interested hexagon binning which is a form of bivariate his- users to Price et al.(2006) and Mullen et al.(2011) for togram, see Carr et al.(2011); higher density regions a more extensive introduction, and further pointers are darker). The yellow-red line shows the path of to the literature on DE. The utility of using DEoptim the best member of the population over time, with was further demonstrated with a simple example of a the darkest solution at the end being the optimal port- stylized non-convex portfolio risk contribution allo- folio. We can notice how DEoptim does not spend cation, with users referred to PortfolioAnalytics for much time computing solutions in the scatter space portfolio optimization using DE with real portfolios that are suboptimal, but concentrates the bulk of the under non-convex constraints and objectives. calculation time in the vicinity of the final best port- The latest version of the package includes the folio. Note that we are not looking for the tangency adaptive Differential Evolution framework by Zhang portfolio; simple mean/CVaR optimization can be and Sanderson(2009). Future work will also be di- achieved with standard optimizers. We are looking rected at parallelization of the implementation. The here for the best balance between return and risk con- DEoptim project is hosted on R-forge at https:// centration. r-forge.r-project.org/projects/deoptim/. We have utilized the mean return/CVaR concen- tration examples here as more realistic, but still styl- ized, examples of non-convex objectives and con- Acknowledgements straints in portfolio optimization; other non-convex objectives, such as drawdown minimization, are also The authors would like to thank , common in real portfolios, and are likewise suitable Juan Ospina, and Enrico Schumann for useful com- to application of Differential Evolution. One of the ments and Rainer Storn for his advocacy of DE and key issues in practice with real portfolios is that a making his code publicly available. The authors are portfolio manager rarely has only a single objective also grateful to two anonymous reviewers for help- or only a few simple objectives combined. For many ful comments that have led to improvements of this combinations of objectives, there is no unique global note. Kris Boudt gratefully acknowledges financial optimum, and the constraints and objectives formed support from the National Bank of Belgium. lead to a non-convex search space. It may take sev- eral hours on very fast machines to get the best an- swers, and the best answers may not be a true global Disclaimer optimum, they are just as close as feasible given poten- tially competing and contradictory objectives. The views expressed in this note are the sole respon- When the constraints and objectives are relatively sibility of the authors and do not necessarily reflect simple, and may be reduced to quadratic, linear, or those of aeris CAPITAL AG, Guidance Capital Man- conical forms, a simpler optimization solver will pro- agement, Cheiron Trading, or any of their affiliates. duce answers more quickly. When the objectives Any remaining errors or shortcomings are the au- are more layered, complex, and potentially contra- thors’ responsibility. dictory, as those in real portfolios tend to be, DE- optim or other global optimization algorithms such as those integrated into PortfolioAnalytics provide Bibliography a portfolio manager with a feasible option for opti- mizing their portfolio under real-world non-convex D. Ardia, K. M. Mullen, B. G. Peterson and J. Ulrich. constraints and objectives. DEoptim: Differential Evolution Optimization in R, The PortfolioAnalytics framework allows any ar- 2011. URL http://CRAN.R-project.org/package= bitrary R function to be part of the objective set, and DEoptim. R package version 2.1-0. allows the user to set the relative weighting that they want on any specific objective, and use the appropri- K. Boudt, P. Carl, and B. G. Peterson. Portfolio op- ately tuned optimization solver algorithm to locate timization with conditional value-at-risk budgets. portfolios that most closely match those objectives. Working paper, 2010a. K. Boudt, P. Carl, and B. G. Peterson. Port- folioAnalytics: Portfolio Analysis, including Nu- Summary meric Methods for Optimization of Portfolios, 2010b. URL http://r-forge.r-project.org/projects/ In this note we have introduced DE and DEoptim. returnanalytics/. R package version 0.6. The package DEoptim provides a means of applying the DE algorithm in the R language and environment P. Carl and B. G. Peterson. PerformanceAnalytics: for statistical computing. DE and the package DE- Econometric tools for performance and risk analy- optim have proven themselves to be powerful tools sis., 2011. URL http://r-forge.r-project.org/

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 33

0.4

0.2

0 monthly mean (in %)

−0.2

−0.4

14 16 18 20 22 24 monthly CVaR (in %)

Figure 3: Risk/return scatter chart showing the results for portfolios tested by DEoptim.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 34 CONTRIBUTED RESEARCH ARTICLES

projects/returnanalytics/. R package version J. A. Ryan. quantmod: Quantitative Financial Modelling 1.0.3.3. Framework, 2010. URL http://CRAN.R-project. org/package=quantmod. R package version 0.3-16. D. Carr, N. Lewin-Koh and M. Maechler. hexbin: http:// Hexagonal binning routines, 2011. URL O. Scaillet. Nonparametric estimation and sensitiv- CRAN.R-project.org/package=hexbin . R package ity analysis of expected shortfall. Mathematical Fi- version 1.26.0 nance, 14(1):74–86, 2002. Y. Crama and M. Schyns. Simulated annealing for complex portfolio selection problems. European R. Storn and K. Price. Differential Evolution – A sim- Journal of Operations Research, 150(3):546–571, 2003. ple and efficient heuristic for global optimization over continuous spaces. Journal of Global Optimiza- C. Fábián and A. Veszprémi. Algorithms for han- tion, 11:341–359, 1997. dling CVaR constraints in dynamic stochastic pro- gramming models with applications to finance. D. Würtz, Y. Chalabi, W. Chen and A. Ellis Portfolio Journal of Risk, 10(3):111–131, 2008. Optimization with R/Rmetrics. Rmetrics Association M. Gilli and E. Schumann. Heuristic optimisa- and Finance Online, 2009. tion in financial modelling. COMISEF wps-007 J. Zhang and A. C. Sanderson. JADE: Adap- 09/02/2009, 2009. tive Differential Evolution with optional external J. H. Holland. Adaptation in Natural Artificial Systems. archive. IEEE Transactions on Evolutionary Compu- University of Michigan Press, 1975. tation, 13(5):945–958, 2009.

T. Krink and S. Paterlini. Multiobjective optimization S. Zhu, D. Li, and X. Sun. Portfolio selection with using Differential Evolution for real-world portfo- marginal risk control. Journal of Computational Fi- lio optimization. Computational Management Sci- nance, 14(1):1–26, 2010. ence, 8:157–179, 2011. S. Maillard, T. Roncalli, and J. Teiletche. On the prop- erties of equally-weighted risk contributions port- David Ardia folios. Journal of Portfolio Management, 36(4):60–70, aeris CAPITAL AG, Switzerland 2010. [email protected] D. G. Maringer and M. Meyer. Smooth transition Kris Boudt autoregressive models: New approaches to the Lessius and K.U.Leuven, Belgium model selection problem. Studies in Nonlinear Dy- [email protected] namics & Econometrics, 12(1):1–19, 2008. K. M. Mullen, D. Ardia, D. L. Gil, D. Windover, and Peter Carl J. Cline. DEoptim: An R package for global opti- Guidance Capital Management, Chicago, IL mization by Differential Evolution. Journal of Sta- [email protected] tistical Software, 40(6):1–26, 2011. Katharine M. Mullen K. V. Price, R. M. Storn, and J. A. Lampinen. Differ- National Institute of Standards and Technology ential Evolution: A Practical Approach to Global Opti- Gaithersburg, MD mization. Springer-Verlag, Berlin, Heidelberg, sec- [email protected] ond edition, 2006. ISBN 3540209506. E. Qian. Risk parity portfolios: Efficient portfolios Brian G. Peterson through true diversification of risk. PanAgora As- Cheiron Trading, Chicago, IL set Management working paper, 2005. [email protected]

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 35 rworldmap: A New R package for Mapping Global Data by Andy South There appears to be a gap in the market for free software tools that can be used across disci- Abstract rworldmap is a relatively new pack- plinary boundaries to produce innovative, publica- age available on CRAN for the mapping and vi- tion quality global visualisations. Within R there are sualisation of global data. The vision is to make great building blocks (particularly sp, maptools and the display of global data easier, to facilitate un- fields) for spatial data but users previously had to go derstanding and communication. The initial fo- through a number of steps if they wanted to produce cus is on data referenced by country or grid due world maps of their own data. Experience has shown to the frequency of use of such data in global as- that difficulties with linking data and creating classi- sessments. Tools to link data referenced by coun- fications, colour schemes and legends, currently con- try (either name or code) to a map, and then to strains researchers’ ability to view and display global display the map are provided as are functions to data. We aim to reduce that constraint to allow re- map global gridded data. Country and gridded searchers to spend more time on the more important functions accept the same arguments to specify issue of what they want to display. The vision for the nature of categories and colour and how leg- rworldmap is to produce a package to facilitate the ends are formatted. This package builds on the visualisation and mapping of global data. Because functionality of existing packages, particularly the focus is on global data, the package can be more sp, maptools and fields. Example code is pro- specialised than existing packages, making world vided to produce maps, to link with the pack- mapping easier, partly because it doesn’t have to deal ages classInt, RColorBrewer and ncdf, and to with detailed local maps. Through rworldmap we plot examples of publicly available country and aim to make it easy for R users to explore their global gridded data. data and also to produce publication quality figures from their outputs. Introduction

Global datasets are becoming increasingly common rworldmap was partly inspired and largely and are frequently seen on the web, in journal pa- funded by the UK Natural Environment Research pers and in our newspapers (for example a ’carbon Council (NERC) program Quantifying Uncertainty atlas’ of global emissions available at http://image. in System Science (QUEST). This program guardian.co.uk/sys-files/Guardian/documents/ brings together scientists from a wide range of dis- 2007/12/17/CARBON_ATLAS.pdf). At the same time ciplines including climate modellers, hydrologists there is a greater interest in global issues such as cli- and social scientists. It was apparent that while mate change, the global economy, and poverty (as researchers have common tools for visualisation for example outlined in the Millenium Development within disciplines, they tend to use different ones Goals, http://www.un.org/millenniumgoals/bkgd. across disciplines and that this limits the sharing shtml). Thirdly, there is an increasing availability of of data and methods necessary for truly interdis- software tools for visualising data in new and inter- ciplinary research. Within the project, climate and esting ways. Gapminder (http://www.gapminder. earth system modellers tended to use IDL, ecologists org) has pioneered making UN statistics more avail- ArcGIS, hydrologists and social scientists Matlab and able and intelligible using innovative visualisation fisheries scientists R. With the exception of R, these tools and Many Eyes (http://www-958.ibm.com/) software products cost thousands of pounds which provides a very impressive interface for sharing data acts as a considerable constraint on users being able and creating visualisations. to try out techniques used by collaborators. This World maps have become so common that they high cost and learning curve of adopting new soft- have even attracted satire. ’s Atlas of ware tools hinders the sharing of data and methods the Planet Earth (The Onion, 2007), contains a ’Bono between disciplines. To address this, part of the vi- Awareness’ world map representing ’the intensity sion for rworldmap was to develop a tool that can be with which artist Bono is aware of the plight and freely used and modified across a multi-disciplinary daily struggles of region’, with a categorisation rang- project, to facilitate the sharing of scripts, data and ing from ’has heard of nation once’ through ’moved outputs. Such freely available software offers greater enough by nations crisis to momentarily remove sun- opportunity for collaboration with research institutes glasses’ to ’cares about welfare of nation nearly as in developing countries that may not be able to af- much as his own’. ford expensive licenses.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 36 CONTRIBUTED RESEARCH ARTICLES rworldmap data inputs NetCDF is a data storage file format com- monly used by climate scientists and oceanogra- rworldmap consists of tools to visualise global data phers. NetCDF files can be multi-dimensional, e.g. and focuses on two types of data. Firstly, data that holding (x,y) data for multiple attributes over mul- are referenced by country codes or names and sec- tiple months, years, days etc. The package ncdf is ondly, data that are referenced on a grid. good for reading data from netCDF files.

Country data rworldmap functionality

There is a wealth of global country level data avail- rworldmap has three core functions outlined below able on the internet including UN population data, and others that are described later. and many global indices for, among others: Envi- 1. joinCountryData2Map() joins user country ronmental Performance, Global Hunger and Multi- data referenced by country names or codes to dimensional Poverty. a map to enable plotting Data are commonly referenced by country names mapCountryData() as these are the most easily recognised by users, 2. plots a map of country data but country names have the problem that vocabu- 3. mapGriddedData() plots a map of gridded data laries are not well conserved and many countries have a number of subtly different alternate names Joining country data to a map (e.g. Ivory Coast and Cote d’Ivoire, Laos and Peo- ple’s Democratic Republic of Lao). To address this To join the data to a map use joinCountryData2Map. problem there are ISO standard country codes of ei- You will need to specify the name of column contain- ther 2 letters, 3 letters or numeric, and also 2 letter ing your country identifiers (nameJoinColumn) and FIPS country codes, so there is still not one univer- the type of code used (joinCode) e.g. "ISO3" for ISO sally adopted standard. rworldmap supports all of 3 letter codes or "UN" for numeric country codes. these country codes and offers tools to help when data(countryExData) data are referenced by names or other identifiers. sPDF <- joinCountryData2Map( countryExData ,joinCode = "ISO3" ,nameJoinColumn = "ISO3V10") Gridded data This code outputs, to the R console, a summary of Global datasets are frequently spatially referenced how many countries are successfully joined. You can on a grid, because such gridded or raster data for- specify verbose=TRUE to get a full list of countries. mats offer advantages in efficiency of data storage The object returned (named sPDF in this case) is of and processing. Remotely sensed data and other val- type "SpatialPolygonsDataFrame" from the package ues calculated from it are most frequently available sp. This object is required for the next step, display- in gridded formats. These can include terrestrial or ing the map. marine data or both. If you only have country names rather than codes There are many gridded data formats, here I will in your data, use joinCode="NAME"; you can expect concentrate on two: ESRI GridAscii and netCDF. more mismatches due to the greater variation within ESRI GridAscii files are an efficient way of storing country names mentioned previously. To address and transferring gridded data. They are straightfor- this you can use the identifyCountries() function ward text files so can be opened by any text editor. described below, and change any country names in They have a short header defining the structure of your data that do not exactly match those in the in- the file (e.g. number, size and position of rows and ternal map. columns), followed by a single row specifying the value at each grid cell. Thus they use much less space Mapping country data than if the coordinates for each cell had to be speci- fied. To plot anything other than the default map, mapCountryData Example start of gridAscii file for a half degree requires an object of class "SpatialPolygonsDataFrame" global grid: and a specification of the name of the column containing the data to plot: ncols 720 data(countryExData) nrows 360 sPDF <- joinCountryData2Map( countryExData xllcorner -180 ,joinCode = "ISO3" yllcorner -90 ,nameJoinColumn = "ISO3V10") cellsize 0.5 mapDevice() #create world map shaped window NODATA_value -999 mapCountryData(sPDF -999 1 0 1 1 ... [all 259200 cell values] ,nameColumnToPlot='BIODIVERSITY')

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 37

BIODIVERSITY "quantiles", "categorical", or a numeric vec- tor defining the breaks between categories. Works with the next argument (numCats) , al- though numCats is not used for "categorical", where the data are not modified, or if the user specifies a numeric vector, where the number of categories will be a result. • numCats specifies the favoured number of cat- 0.2 100 egories for the data. The number of categories used may be different if the number requested catMethod Figure 1: Output of mapCountryData() is not possible for the chosen (e.g. for "quantiles" if there are only 2 values in the data it is not possible to have more than 2 cate- Mapping gridded data gories). mapGriddedData The function can accept either • colourPalette specifies a colour palette to use 1. an object of type "SpatialGridDataFrame", as from: defined in the package sp 1. "palette" for the current palette 2. the name of an ESRI GridAscii file as a charac- 2. a vector of valid colours, e.g. c("red", ter string "white", "blue") or output from RColorBrewer 3. a 2D R matrix or array (rows by columns) 3. a string defining one of the in- rworldmap contains a "SpatialGridDataFrame" ternal rworldmap palettes from: example that can be accessed and mapped as shown "heat", "diverging", "white2Black", in the code and figure below. "black2White", "topo", "rainbow", "terrain", "negpos8", "negpos9". data(gridExData) mapDevice() #create world map shaped window • addLegend set to TRUE for a default legend, mapGriddedData(gridExData) if set to FALSE the function addMapLegend or addMapLegendBoxes can be used to create a more flexible legend. • mapRegion a region to zoom in on, can be set to a country name from getMap()$NAME or one of "eurasia", "africa", "latin america", "uk", "", "".

mapBubbles(), mapBars(), and mapPies()

0 27350000 Another option for displaying data is to use the mapBubbles function which allows flexible creation Figure 2: Output of mapGriddedData() of bubble plots on global maps. You can specify data columns that will determine the sizing and colouring of the bubbles (using nameZSize and nameZColour). Modifying map appearance The function also accepts other spatialDataFrame ob- rworldmap plotting functions are set up to work jects or data frames containing columns specifying with few parameters specified (usually just those the x and y coordinates. If you wish to represent identifying the data) and in such cases, default val- more attribute values per location there are also the ues will be used. However, there is considerable flex- newer mapBars() and mapPies() functions to pro- ibility to modify the appearance of plots by specify- duce bar and pie charts respectively (noting that pie ing values for arguments. Details of argument op- charts may not be the best way of presenting data tions are provided in the help files but here are a se- when there are more than a few categories). lection of the main arguments: mapDevice() #create world map shaped window mapBubbles(dF=getMap() • catMethod determines how the data values ,nameZSize="POP2005" are put into categories (then colourPalette ,nameZColour="REGION" determines the colours for those cate- ,colourPalette="rainbow" gories). Options for catMethod are: "pretty", ,oceanCol="lightblue" "fixedWidth", "diverging", "logfixedWidth", ,landCol="wheat")

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 38 CONTRIBUTED RESEARCH ARTICLES

meanCLIMATEbyStern

● ● Australasia 56.92000 ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● Caribbean 65.20000 ● ●

● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● Central America 76.11250 ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ●

● ● ● ● ● ● ● ● ● Central Asia 56.18000 ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● East Asia 69.18462 ● ● ● ● ●

● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● Europe 73.87619 ● ● ● ●

● ● ● ● ● ● ● ● REGION ● ● ● ● North Africa 71.00000 0 ● ● POP2005 2 62.70000 9 ● 0 19 6.56e+08 142 ● 77.01818 150 1.31e+09 South Asia 77.22000 South+E Africa 75.79474 Figure 3: Output of mapBubbles() West Africa 78.68421 West Asia 49.62000 Identifying countries data(countryExData) mapDevice() #create world map shaped window The interactive function identifyCountries() al- mapByRegion(countryExData lows the user to click on the current map and, pro- ,nameDataColumn="CLIMATE" vided the cursor is sufficiently close to country cen- ,joinCode="ISO3" troid, will add the country name to the map. Op- ,nameJoinColumn="ISO3V10" ,regionType="Stern" tionally, the name of an attribute variable can also ,FUN="mean") be passed to the function to cause the value of the attribute to be added to the label. e.g. ’Cape Verde Produces this map: 506807’ will be printed on the map if the code below mean CLIMATE by Stern regions is entered and the mouse clicked over those islands (if you know where they are!). identifyCountries(getMap() ,nameColumnToPlot="POP2005")

Aggregating data to produce different out- puts 49.6 78.7 rworldmap offers options for aggregating half de- gree gridded data to countries, and in turn for aggre- Figure 4: Output of mapByRegion() gating country level data to regions. In both of these options a range of aggregation options are available The identity of which countries are in which re- including mean, minimum, maximum and variance. gions are stored in the data frame countryRegions. mapHalfDegreeGridToCountries() takes a grid- This also identifies which countries are currently ded input file, aggregates to a country level and plots classed by the UN as Least Developed Countries the map. It accepts most of the same arguments as (LDC), Small Island Developing states (SID) and mapCountryData(). Landlocked Developing Countries (LLDC). To map Country level data can be aggregated to global re- just the Least Developed Countries the code below gions specified by regionType in country2Region() could be used: which outputs as text, and mapByRegion() which data(countryRegions) produces a map plot. The regional classifications sPDF <- joinCountryData2Map( countryRegions available include SRES (The Special Report on Emis- ,joinCode = "ISO3" sions Scenarios of the Intergovernmental Panel on ,nameJoinColumn = "ISO3") mapDevice() #create world map shaped window Climate Change (IPCC) ), GEO3(Global Earth Obser- mapCountryData(sPDF[which(sPDF$LDC=='LDC'),] vation), Stern and GBD (Global Burden of Disease). ,nameColumnToPlot="POP2005") data(countryExData) country2Region(countryExData Using rworldmap with other packages ,nameDataColumn="CLIMATE" classInt and RColorBrewer ,joinCode="ISO3" ,nameJoinColumn="ISO3V10" While rworldmap sets many defaults internally there ,regionType="Stern" are also options to use other packages to have greater ,FUN="mean") flexibility. In the example below classInt is used to create the classification and RColorBrewer to spec- Outputs this text: ify the colours.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 39 library(classInt) website at: http://www.happyplanetindex.org/ library(RColorBrewer) learn/download-report.html. I copied the country data from the sheet ‘All Data’ #getting smallexample data and joining to a map and pasted into a ‘.csv’ file so that the header row data(countryExData) was at the top of the file and the summary statistics sPDF <- joinCountryData2Map(countryExData at the bottom were left off. I then edited some of the ,joinCode = "ISO3" ,nameJoinColumn = "ISO3V10" column names to make them appropriate R variable ,mapResolution = "coarse") names (e.g. changing ‘Life Sat (0-10)’ to ‘LifeSat’). This data can then be read in easily using #getting class intervals read.csv() and joinCountryData2Map(). classInt <- classIntervals( sPDF[["EPI"]] ,n=5, style = "jenks") inFile <- 'hpi2_0edited2.csv' catMethod = classInt[["brks"]] dF <- read.csv(inFile,header=TRUE,as.is=TRUE) sPDF <- joinCountryData2Map(dF #getting colours , joinCode='NAME' colourPalette <- brewer.pal(5,'RdPu') , nameJoinColumn='country' , verbose='TRUE') #plot map mapDevice() #create world map shaped window Unfortunately, but in common with many global mapParams <- mapCountryData(sPDF country level datasets, the countries are specified by ,nameColumnToPlot="EPI" name alone rather than by ISO country codes. The ,addLegend=FALSE verbose=TRUE option in joinCountryData2Map() can ,catMethod = catMethod be used to show the names of the 10 countries that ,colourPalette=colourPalette ) don’t join due to their names being slightly different in the data than rworldmap. The names of the coun- #adding legend tries that failed to join can be edited in the csv to be do.call(addMapLegend the same as those in getMap()[[’NAME’]], and then ,c(mapParams ,legendLabels="all" they will join. A selection is shown below. ,legendWidth=0.5 name in HPI name in rworldmap ,legendIntervals="data" Iran Iran (Islamic Republic of) ,legendMar = 2)) Korea Korea, Republic of Laos Lao People's Democratic Republic EPI Moldova Republic of Moldova Syria Syrian Arab Republic Tanzania United Republic of Tanzania Vietnam Viet Nam United States of America United States

The map of the HPI on the website is interesting in that the colours applied to countries are not deter- mined by the value of the HPI for that country, but instead by the values of the three component indices 39.1 50.5 62.3 72.8 81.8 95.5 for Life Expectancy, Life Satisfaction and Ecological Figure 5: Output of mapCountryData() using inputs Footprint. Therefore I needed to add some extra R from classInt and RColorBrewer code to be able to recreate the HPI map. #categorise component indices dF$LifeSatcolour <- Examples using data freely avail- ifelse(dF$LifeSat < 5.5,'red' able on the web ,ifelse(dF$LifeSat > 7.0,'green' ,'amber' ))

Example country data from the web dF$LifeExpcolour <- The Happy Planet Index or HPI (http://www. ifelse(dF$LifeExp < 60,'red' ,ifelse(dF$LifeExp > 75,'green' happyplanetindex.org) combines country by coun- ,'amber' )) try estimates of life expectancy, life satisfaction and ecological footprint to create an index of sustainable dF$HLYcolour <- well-being that makes much more sense than GDP ifelse(dF$HLY < 33,'red' for assessing how countries are doing (Marks, 2010). ,ifelse(dF$HLY > 52.5,'green' An Excel file containing the data (‘hpi-2-0- ,'amber' )) results.xls’) can be downloaded from the HPI

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 40 CONTRIBUTED RESEARCH ARTICLES

dF$Footprintcolour <- ifelse(dF$Footprint > 8.4,'blood red' ,ifelse(dF$Footprint > 4.2,'red' Happy Planet Index ,ifelse(dF$Footprint < 2.1,'green' ,'amber' )))

#count red, amber , greens per country numReds<- (as.numeric(dF$Footprintcolour=='red') +as.numeric(dF$LifeExpcolour=='red') HPI colour +as.numeric(dF$LifeSatcolour=='red')) 2 good, 1 middle 1 good, 2 middle 3 middle numAmbers<- 1 poor (as.numeric(dF$Footprintcolour=='amber') 2 poor or footprint v.poor +as.numeric(dF$LifeExpcolour=='amber') +as.numeric(dF$LifeSatcolour=='amber')) Figure 6: Happy Planet Index 2.0, using rworldmap numGreens<- to replicate map in happy planet report (as.numeric(dF$Footprintcolour=='green') +as.numeric(dF$LifeExpcolour=='green') +as.numeric(dF$LifeSatcolour=='green')) Example gridded data from the web

#calculate HPI colour per country ‘Koeppen Geiger’ is a published classification divid- dF$HPIcolour <- ing the world into 30 climatic zones. The GIS files ifelse(dF$Footprintcolour=='blood red' for a Koeppen Geiger gridded climatic regions map | numReds>1,6 are freely available from http://koeppen-geiger. ,ifelse(numReds==1,5 vu-wien.ac.at/. The code below shows how to read ,ifelse(numAmbers==3,4 in and plot an ascii file downloaded from that site. ,ifelse(numGreens==1 & numAmbers==2,3 ,ifelse(numGreens==2 & numAmbers==1,2 inFile1 <- 'Koeppen-Geiger-ASCII.txt' ,ifelse(numGreens==3,1 #read in data which is as lon,lat,catID ,NA)))))) dF<-read.table(inFile1,header=TRUE,as.is=TRUE) #convert to sp SpatialPointsDataFrame #join data to map coordinates(dF) = c("Lon", "Lat") sPDF <- joinCountryData2Map(dF # promote to SpatialPixelsDataFrame ,joinCode="NAME" gridded(dF) <- TRUE ,nameJoinColumn="country") # promote to SpatialGridDataFrame sGDF = as(dF, "SpatialGridDataFrame") #set colours #plotting map colourPalette <- c('palegreen' mapDevice() #create world map shaped window ,'yellow' mapParams <- mapGriddedData(sGDF ,'orange' ,catMethod='categorical' ,'orangered' ,addLegend=FALSE) ,'darkred') #adding formatted legend do.call(addMapLegendBoxes #plot map ,c(mapParams mapDevice() #create world map shaped window ,cex=0.8 mapParams <- mapCountryData(sPDF ,ncol=10 ,nameColumnToPlot='HPIcolour' ,x='bottom' ,catMethod='categorical' ,title='Koeppen-Geiger Climate Zones')) ,colourPalette=colourPalette ,addLegend=FALSE ,mapTitle='Happy Planet Index') This produces a map that looks a lot different from the published map because it is using a differ- #changing legendText ent colour palette. The default "heat" colour palette mapParams$legendText <- is not the best for this categorical data and one of c('2 good, 1 middle' the palettes from RColorBrewer more suitable for ,'1 good, 2 middle' categorical data could be used. However it would ,'3 middle' be good to retain the palette created by the authors ,'1 poor' of the data. The ascii file does not contain any ,'2 poor or footprint v.poor') colour information, however the authors also pro- #add legend vide ESRI and compatible files that do do.call( addMapLegendBoxes , c(mapParams have colour information. It appeared to be impos- ,x='bottom' sible to extract the palette from the ESRI files, but ,title="HPI colour")) by opening the ‘.kmz’ file in Google Earth, saving to ‘.kml’ and some fiddly text editing in R the colours

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 41 could be extracted as a single column then saved to a This particular file was accessed from: http:// ‘.csv’ file, the first few lines of which are: www.ipcc-data.org/ar4/info/UKMO-HADCM3_SRA1B_ colour tas.html #960000 The ’lon180’ option was chosen to get a file with #ff0000 longitude values from −180 to 180 as that requires #ff9999 less editing before plotting. #ffcccc ... library(ncdf) #the downloaded file Plotting the map in the original colours can then inFile <- be achieved relatively simply by: 'HADCM3_SRA1B_1_tas-change_2046-2065.cyto180.nc' memory.limit(4000) #set memory limit to max #reading in colour palette nc = open.ncdf(inFile, write=FALSE) #as.is=T stops conversion to factors #which otherwise messes up colours print(nc) prints to console a description of the tst <- read.csv('paletteSaved.csv',as.is=TRUE) file contents which includes the information shown below (edited slightly for brevity). #plotting map mapDevice() #create world map shaped window file ... has 4 dimensions: #tst$x passes the palette as a vector time Size: 12 mapParams <- mapGriddedData(sGDF latitude Size: 73 ,catMethod='categorical' longitude Size: 96 ,addLegend=FALSE bounds Size: 2 ,colourPalette=tst$x) ------#adding legend file ... has 4 variables do.call(addMapLegendBoxes float climatological_bounds[bounds,time] ,c(mapParams float latitude_bounds[bounds,latitude] ,cex=0.8 float longitude_bounds[bounds,longitude] ,ncol=3 float ,x='bottomleft' air_temperature_anomaly[longitude,latitude,time] ,title='Koeppen-Geiger Climate Zones')) After the netCDF file has been opened in this way, selected variables can be read into R using get.var.ncdf(), converted to a SpatialGridDataFrame and plotted using the rworldmap function mapGriddedData. The code below first creates a grid from the parameters in netCDF file, and is written to be generic so that it Koeppen−Geiger Climate Zones Af Cfc Dfd Am Csa Dsa should work on other similar netCDF files. Then it As Csb Dsb Aw Csc Dsc BSh Cwa Dwa creates a standard classification and colour scheme BSk Cwb Dwb BWh Cwc Dwc BWk Dfa Dwd for all of the plots based on having looked at the val- Cfa Dfb EF Cfb Dfc ET ues in each months data first. Finally it loops through all months reads in the data for that month, creates a Figure 7: ‘Koeppen Geiger’ climatic zones, map pro- spatialGridDataFrame for each, plots it and saves it duced using mapGriddedData() and freely available as a ‘.png’. data ncArray = get.var.ncdf(nc,'air_temperature_anomaly') Using rworldmap to map netCDF data in tandem with ncdf # creating gridTopology from the netCDF metadata offset = c(min(nc$dim$longitude$vals) netCDF is a common file format used in meteorology ,min(nc$dim$latitude$vals)) and oceanography, it is both multi-dimensional and cellsize = c( abs(diff(nc$dim$longitude$vals[1:2])) self documented. The package ncdf allows netCDF , abs(diff(nc$dim$latitude$vals[1:2]))) files to be opened, queried and data extracted di- # add cellsize/2 to offset # to convert from lower left referencing to centre rectly to R. Data extracted from netCDF files in this offset = offset + cellsize/2 way can then be converted to sp objects and plotted cells.dim = c(nc$dim$longitude$len using rworldmap. In the example below the netCDF ,nc$dim$latitude$len ) file (∼ 333 KB) is first downloaded from the IPCC data distribution centre at: http://www.ipcc-data. gt <- GridTopology(cellcentre.offset = offset org/cgi-bin/downl/ar4_nc/tas/HADCM3_SRA1B_1_ , cellsize = cellsize tas-change_2046-2065.cyto180.nc. , cells.dim = cells.dim )

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 42 CONTRIBUTED RESEARCH ARTICLES mapDevice() Summary #creating a vector to classify the data catMethod=seq(from=-5,to=19,by=2) rworldmap is a new package to facilitate the display #creating a colourPalette for all plots of global data, referenced by grid, country or region. #-ve blue, 0 white, +ve yellow to red It is available on CRAN at http://cran.r-project. colourPalette=c('blue','lightblue','white' org/web/packages/rworldmap. rworldmap aims to ,brewer.pal(9,'YlOrRd')) provide tools to improve the communication and #looping for each month understanding of world datasets. If you have any for( zDim in 1 : nc$dim$time$len ){ comments or suggestions or would like to contribute #reading the values for this month code please get in touch. The source code and de- ncMatrix <- ncArray[,,zDim] velopment versions are available from http://code. #to get the image up the right way google.com/p/rworld/. I plan to extend this work by #this reverses the y values but not the x ones making datasets, including those used in this paper, ncMatrix2 <-ncMatrix[ ,nc$dim$latitude$len:1 ] more easily accessible, perhaps through a package gridVals <-data.frame(att=as.vector(ncMatrix2)) called rworldmapData. We are also working on vi- #creating a spatialGridDataFrame sualisations to communicate more information than sGDF <-SpatialGridDataFrame(gt, data=gridVals) by maps alone. For more help on rworldmap there is http://cran.r-project.org/ #plotting the map and getting params for legend a vignette available at mapParams <- mapGriddedData( sGDF web/packages/rworldmap/vignettes/rworldmap. ,nameColumnToPlot='att' pdf and an FAQ at http://cran.r-project.org/ ,catMethod=catMethod web/packages/rworldmapvignettes/rworldmapFAQ. ,colourPalette=colourPalette pdf. ,addLegend=FALSE ) #adding formatted legend do.call(addMapLegend Acknowledgements ,c(mapParams ,legendLabels="all" This work was funded by NERC through the QUEST ,legendWidth=0.5 Data Initiative (QESDI) and QUEST Global Systems ,legendMar = 3)) Impacts (GSI) projects. Thanks to Pru Foster and title(paste('month :',zDim))#adding brief title Sarah Cornell at QUEST and Graham Pilling at Cefas outputPlotType = 'png' without whose enthusiasm this would not have hap- savePlot(paste("ipccAirAnomalyMonth",zDim,sep='') pened. Thanks to Martin Juckes for understanding ,type=outputPlotType) project management, Barry Rowlingson and Roger } #end of month loop Bivand for their contributions at a startup workshop close.ncdf(nc) #closing the ncdf file and after, and to Nick Dulvy, Joe Scutt-Phillips and Elisabeth Simelton for inspiration on global vulnera- bility analyses and colour.

Bibliography

N. Marks GDP RIP (1933-2010). Significance, 7(1):27–30, 2010. Published Online: 23 Feb 2010 URL http://www3.interscience.wiley. com/cgi-bin/fulltext/123300797/PDFSTART.

The Onion Our Dumb World: The Onion’s Atlas of the Planet Earth. Little, Brown, and Company, 2007. ISBN: 978 0 7528 9120 0. URL http://www. theonion.com/.

Andy South Centre for Environment, Fisheries and Aquaculture Sci- ence (Cefas) Pakefield Road, Lowestoft, NR33 OHT, UK Current address: Graduate School of Education Figure 8: Two examples of IPCC temperature University of Exeter anomaly data (for December and June respectively) Exeter, EX1 2LU, UK plotted using mapGriddedData() [email protected]

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 43

Appendix: Comparing using dev.new(width=9,height=4.5) #so that maps extends to edge of window rworldmap to not using it oldpar <- par(mai=c(0,0,0.2,0))

Here I show a brief comparison between using #categorising the data rworldmap and not. It is not my intention to set numCats <- 7 up a straw-man that demonstrates how superior my quantileProbs <- seq(0,1,1/numCats) package is to others. Other packages have different quantileBreaks <- quantile(wrld_simpl$X2005 objectives, by focusing on world maps I can afford ,na.rm=T to ignore certain details. It is only by building on ,probs=quantileProbs) the great work in other packages that I have been wrld_simpl$toPlot <- cut( wrld_simpl$X2005 able to get this far. Those caveats aside here is the , breaks=quantileBreaks comparison using some data on alcohol consump- , labels=F ) tion per adult by country, downloaded as an Ex- #plotting map cel file from the gapminder website at http://www. plot(wrld_simpl gapminder.org/data/ and saved as a ‘.csv’. Reading ,col=rev(heat.colors(numCats))[wrld_simpl$toPlot]) in the data is common to both approaches: #adding legend using the fields package inFile <- 'indicatoralcoholconsumption20100830.csv' zlim <- range(quantileBreaks,na.rm=TRUE) dF <- read.csv(inFile) image.plot(legend.only=TRUE ,zlim=zlim Using rworldmap ,col=rev(heat.colors(numCats)) ,breaks=quantileBreaks library(rworldmap) ,horizontal=TRUE) sPDF <- joinCountryData2Map(dF, , joinCode = "NAME" par(oldpar) #reset graphics settings , nameJoinColumn = "X" , nameCountryColumn = "X" , verbose = TRUE) Slight differences in country naming, and an ab- sence of country codes, causes 9 countries not to mapCountryData(sPDF,nameColumnToPlot='X2005') to join in both approaches. joinCountryData2Map() outputs the identities of these countries. This means Not using rworldmap that in the maps it incorrectly looks like alcohol con- library(maptools) sumption is zero in both the UK and USA. To correct library(fields) this the country names need to be renamed prior to ## get map joining, either in the ‘.csv’ or in the dataframe. In the data(wrld_simpl) #from package maptools R dataframe the countries could be renamed by: ## joining #first identify failures n1<-'United Kingdom of Great Britain and Northern Ireland' matchPosnsInLookup <- match( n2<-'United Kingdom' levels(dF$X)[which(levels(dF$X)==n1)] <- n2 as.character(dF$X) ,as.character(wrld_simpl$NAME)) failedCodes <- dF$X[is.na(matchPosnsInLookup)] The objective for rworldmap is to make world numFailedCodes <- length(failedCodes) mapping easier, based on difficulties experienced by the author and project collaborators. Of course, there #printing info to console is still a learning curve associated with being able to cat(numFailedCodes use rworldmap itself. All of the operations shown ,"countries failed to join to the map\n") in this paper can be accomplished by accomplished print(failedCodes) R programmers without the need for rworldmap, #find match positions in the data however in an inter-disciplinary project rworldmap matchPosnsInData <- match( made it easier for scientists who are new to R to start as.character(wrld_simpl$NAME) ,as.character(dF$X)) producing outputs, and encouraged them to extend # join data to the map their R learning further. There are repetitive data wrld_simpl@data <- cbind(wrld_simpl@data manipulation routines that are tricky to implement, , dF[matchPosnsInData,]) rworldmap reduces the time spent on these so that more time can be spent on the important task of com- #sizing window to a good shape for the world municating what the data say.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 44 CONTRIBUTED RESEARCH ARTICLES

Cryptographic Boolean Functions with R by Frédéric Lafitte, Dirk Van Heule and Julien Van hamme function with n = 4 by concatenating f’s truth table. In order to represent the truth table as a vector of Abstract A new package called boolfun is avail- return values without ambiguity, a total order needs able for R users. The package provides tools to to be defined over the input assignments. The posi- handle Boolean functions, in particular for cryp- tion of the element (x1,..., xn) in this ordering is sim- tographic purposes. This document guides the ply the integer encoded in base 2 by the digits xn...x1. user through some (code) examples and gives a For example, the first function defined above is ex- feel of what can be done with the package. plicitly represented by the following truth table.

x x x f (x , x , x ) A Boolean function is a mapping {0,1}n → {0,1}. 1 2 3 1 2 3 Those mappings are of much interest in the design of 0 0 0 0 cryptographic algorithms such as secure pseudoran- 1 0 0 1 dom number generators (for the design of stream ci- 0 1 0 1 phers among other applications), hash functions and 1 1 0 0 block ciphers. The lack of open source software to 0 0 1 0 assess cryptographic properties of Boolean functions 1 0 1 1 and the increasing interest for statistical testing of 0 1 1 0 properties related to random Boolean functions (Fil- 1 1 1 1 iol, 2002; Saarinen, 2006; Englund et al., 2007; Au- masson et al., 2009) are the main motivations for the Methods of the BooleanFunction object can be development of this package. called in two ways, using functional notation as The number of Boolean functions with n vari- in tt(f) or using object oriented notation as in n ables is 22 , i.e. 2 possible output values for each of f$tt(). This is a feature of any object that inherits the 2n input assignments. In this search space of from Object defined in the R.oo package (Bengtsson, n size 22 , looking for a function which has specific 2003). properties is impractical. Already for n ≥ 6, an ex- An overview of all public methods is given in Fig- haustive search would require too many computa- ure1 with a short description. tional resources. Furthermore, most properties are computed in O(n2n). This explains why the cryp- method returned value tographic community has been interested in evolu- n() number of input variables n tionary techniques (e.g. simulated annealing, genetic tt() truth table (vector of integers) algorithms) to find an optimal function in this huge wh() walsh spectrum (vector of integers) space. Another way to tackle the computational diffi- anf() algebraic normal form (vector of integers) culties is to study algebraic constructions of Boolean ANF() algebraic normal form as Polynomial functions. An example of such constructions is to deg() algebraic degree start from two functions with m variables and to nl() nonlinearity combine them to get a function with n > m variables ai() algebraic immunity that provably preserves or enhances some properties ci() correlation immunity of the initial functions. In both cases, the boolfun res() resiliency package can be used to experiment on Boolean func- isBal() TRUE if function is balanced tions and test cryptographic properties such as non- isCi(t) TRUE if ci() returns t linearity, algebraic immunity and resilience. This isRes(t) TRUE if res() returns t short article gives an overview of the package. More details can be found in the package vignettes. Figure 1: Public methods of BooleanFunction.

Also some generic functions such as print() or First steps equals() are overloaded so that they support in- stances of BooleanFunction. Any additional infor- In R, type install.packages("boolfun") in order to mation can be found in the document displayed by install the package and library(boolfun) to load it. the R command vignette("boolfun"). A Boolean function is instantiated with its truth Note that an object Polynomial is also imple- table, a character or integer vector representing the mented in the boolfun package and is discussed in output values of the function. For example, f the following section. The other sections give some <- BooleanFunction("01101010") defines a Boolean examples on how to use the package in order to visu- function with n = 3 input variables. Also, g <- alize properties or to analyze random Boolean func- BooleanFunction(c(tt(f),tt(f)) defines a Boolean tions.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 45

Multivariate polynomials over F2 The Polynomial object inherits from R.oo’s Object and Figure2 gives an overview of the imple- Let F2 denote the finite field with two elements and mented methods. let ⊕ denote the addition in F2 (exclusive or). For- mally, the algebraic normal form of a Boolean func- method returned value tion f is an element anf( f ) of the quotient ring n() number of input variables n deg() algebraic degree 2 2 n F2[x1,..., xn]/ < x1 = x1,..., xn = xn > anf() vector of coefficients for 2 monomials len() returns 2n defined as follows string() algebraic normal form as character string

M a1 an anf( f ) = h(a1,..., an) · x1 ··· xn * multiplication returns a new Polynomial n (a1,...,an)∈{0,1} + addition returns a new Polynomial [[ evaluates the polynomial where the function h is based on the Möbius inver- sion principle Polynomial M Figure 2: Public methods of . h(a1,..., an) = f (x1,..., xn) n (x1,...,xn)∈{0,1} Addition and multiplication are executed using C code. The addition is straightforward given two anf( ) Simply put, f is a multivariate polynomial vectors of coefficients as it consists in making a in (Boolean) variables x1,..., xn where coefficients and component-wise exclusive or of the input vectors. { } exponents are in 0,1 . Those polynomials are The multiplication is built upon the addition of two commonly called Boolean polynomials and represent polynomials and the multiplication of two monomi- uniquely the corresponding Boolean function. als. Hence addition is computed in O(2n) and multi- The recent package multipol (Hankin, 2008) al- plication in O(n · 22n). Note that lower complexities lows the user to handle multivariate polynomials but can be achieved using graph based representations is not well suited to handle operations over poly- such as binary decision diagrams. For now, only vec- nomial Boolean rings. In boolfun, an S3 object tors of length 2n are used which are sufficient to ma- Polynomial is defined and implements basic func- nipulate sets of Boolean functions having up to about tionality to manipulate the algebraic normal form of n = 22 variables (and n = 13 variables if algebraic im- Boolean functions. munity needs to be assessed). > f <- BooleanFunction("01011010") > p <- f$ANF() > data.class(p) Visualizing the tradeoff between [1] "Polynomial" > q <- Polynomial("0101") cryptographic properties > p [1] "x1 + x3" Depending on its cryptographic application, a > q Boolean function needs to satisfy a set of properties [1] "x1*x2 + x1" in order to be used safely. Some of those properties > p * q cannot be satisfied simultaneously as there are trade- [1] "x1*x2*x3 + x1*x2 + x1*x3 + x1" offs between them. In this section we focus on two > p * q + q properties, resilience and algebraic immunity, and [1] "x1*x2*x3 + x1*x3" show how R can be used to illustrate the trade-off > deg(p * q + q) [1] 3 that exists between them. The algebraic immunity of a Boolean function f is In this example, p holds the algebraic normal defined to be the lowest degree of the nonzero func- form of f which is represented in Polynomial by a tions g such that f · g = 0 or ( f ⊕ 1)g = 0. This prop- vector of length 2n, i.e. the truth table of the Boolean erty is used to assess the resistance to some algebraic function h defined above. The operator [[]] can be attacks. used to evaluate the polynomial for a particular as- Resilience is a combination of two properties: bal- signment. ancedness and correlation immunity. A function is balanced if its truth table contains as many zeros as > r <- p * q + q ones and correlation immune of order t if the proba- > x <- c(0, 1, 1) # x1=0, x2=1, x3=1 > if( length(x) != r$n() ) stop("this is an error") bility distribution of its output does not change when > p[[x]] at most t input variables are fixed. Hence a function [1] 1 is resilient of order t if its output remains balanced > x[3] <- 0 even after fixing at most t input variables. > p[[x]] The following example illustrates the trade-off [1] 0 between algebraic immunity and resilience.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 46 CONTRIBUTED RESEARCH ARTICLES n <- 3 The example code plots the resilience and the al- N <- 2^2^n # number of functions with n var gebraic immunity of all functions with 3 variables. The result is shown in Figures3 and4. allRes <- vector("integer", N) n Observe that the search space of size 22 has a allAIs <- vector("integer", N) particular structure. The symmetry between the two for( i in 1:N ) { # forall Boolean function halves of the search space is justified in that a func- f <- BooleanFunction( toBin(i-1,2^n) ) tion with constant term zero, i.e. with f (0,...,0) = allRes[i] <- res(f) # get its resiliency 0 will have the same properties as its complement allAIs[i] <- ai(f) # and algebraic immunity 1 ⊕ f (i.e. the same function with constant term 1). } Note also that in those figures the functions achiev- ing the highest resiliencies belong to the second quar- xlabel <- "Truth tables as integers" ter of the search space, which also seems more dense in algebraically immune functions. Very similar plots plot( x=1:N, y=allRes, type="b", are observed when considering more variables. xlab=xlabel, ylab="Resiliency" ) Now let us consider the functions achieving a plot( x=1:N, y=allAIs, type="b", xlab=xlabel, ylab="Algebraic immunity" ) good tradeoff between algebraic immunity and re- silience. The sum ai(f) + res(f) is computed for each function f with 3 variables according to the fol-

● ● lowing code and plotted in Figure5. 2.0

plot( 1:N, allRes+allAIs, type="b",

1.5 xlab="f", ylab="ai(f)+res(f)" )

● ●● ● ●● ● ●● ●● ● ●● ● ●● ● Note that for all function f, the value res(f) 1.0 + ai(f) never reaches the value max(allRes) + max(allAIs), hence the tradeoff. Also this figure 0.5 suggests which part of the search space should be ex- Resiliency plored in order to find optimal tradeoffs. ● ●●●● ●●● ●●●● ●●● ●●●● ●● ●● ● ● ● ● ●● ●● ●●●● ●●● ●●●● ●●● ●●●● ● 0.0 −0.5

● ● ● ● ● ●● ●● ● ● ● ● ● 3

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● −1.0

0 50 100 150 200 250

●●●● ●●● ●●●● ●●● ●●●● ●●●●● ● ● ● ● ●●●●● ●●●● ●●● ●●●● ●●● ●●●● Truth tables as integers 2

Figure 3: Resiliency of all functions with 3 variables. ● ● ● ● ● ● 1 ai(f)+res(f)

●●●● ●●●● ●●●● ●●●● ●●●● ●● ●● ●●●● ●●●● ●● ●● ●●●● ●●●● ●●●● ●●●● ●●●● 2.0 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 1.5 ● ● −1

0 50 100 150 200 250

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● f 1.0

Algebraic immunity Algebraic Figure 5: The sum of algebraic immunity with re-

0.5 siliency for all functions with 3 variables.

Properties of random Boolean func- ● ● 0.0 tions 0 50 100 150 200 250 Truth tables as integers Figures3 and4 suggest that some properties are more likely to be observed in a random Boolean func- Figure 4: Algebraic immunity of all functions with 3 tion. For example, there are more functions having variables. maximal algebraic immunity than functions having

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 47 maximal resilience. The following example gives a in order to add new functionality such as comput- better idea of what is expected in random Boolean ing the autocorrelation spectrum, manipulating (ro- functions. tation) symmetric Boolean functions, etc... as well as additional features for multivariate polynomials. n <- 8 data <- data.frame( matrix(nrow=0,ncol=4) ) names(data) <- c( "deg", "ai", "nl", "res" ) Bibliography for( i in 1:1000 ) { # for 1000 random functions randomTT <- round(runif(2^n, 0, 1)) J.-P. Aumasson, I. Dinur, W. Meier, and A. Shamir. randomBF <- BooleanFunction(randomTT) Cube testers and key recovery attacks on reduced- data[i,] <-c( deg(randomBF), ai(randomBF), FSE nl(randomBF), res(randomBF)) round MD6 and Trivium. In , pages 1–22, 2009. } H. Bengtsson. The R.oo Package - Object-Oriented Programming with References Using Standard R data After the code is executed, holds the val- Code. In Proceedings of the 3rd International Work- ues of four properties (columns) for 1000 random shop on Distributed Statistical Computing, Vienna, Boolean functions with n = 8 variables. The mean Austria, 2003. and standard deviation of each property is given bel- low. H. Englund, T. Johansson, and M. S. Turan. A frame- work for chosen iv statistical analysis of stream ci- > mean(data) phers. In INDOCRYPT, pages 268–281, 2007. deg ai nl res 7.479 3.997 103.376 -0.939 E. Filiol. A new statistical testing for symmetric ci- > sd(data) phers and hash functions. In ICICS, pages 342–353, deg ai nl res 2002. 0.5057814 0.0547174 3.0248593 0.2476698 R. Hankin. Programmers’ Niche: Multivariate poly- It is also very easy to apply statistical tests using nomials in R. R News, 8(1):41–45, 2008. R. For example, in (Englund et al., 2007; Aumasson D. Knuth. A draft of section 7.1.1: Boolean basics. The et al., 2009) cryptographic pseudorandom generators Art of Computer Programming, volume 4 pre-fascicle are tested for non random behavior. Those tests con- 0B, 2006. sider the ith output bit of the generator as a (Boolean) function fi of n chosen input bits, the remaining in- W. Meier and O. Staffelbach. Fast correlation attacks put bits being fixed to some random value. Then the on certain stream ciphers. Journal of Cryptology, 1 properties of the fis are compared to the expected (3):159–176, 1989. ISSN 0933-2790. properties of a random function. All those tests in- volve a χ2 statistic in order to support a goodness of W. Meier, E. Pasalic, and C. Carlet. Algebraic attacks fit test. Such testing is easy with R using the func- and decomposition of boolean functions. In EU- tion qchisq from the package stats as suggested by ROCRYPT, pages 474–491, 2004. the following code. M. Saarinen. Chosen-IV statistical attacks on estream data <- getTheBooleanFunctions() ciphers. In Proceeding of SECRYPT 2006. Citeseer, chistat <- computeChiStat(data) 2006. outcome <- "random" if(chistat > qchisq(0.99, df=n)) Frédéric Lafitte outcome <- "cipher" Department of Mathematics Royal Military Academy print(outcome) Belgium [email protected]

Summary Dirk Van Heule Department of Mathematics A free open source package to manipulate Boolean Royal Military Academy functions is available at cran.r-project.org. The Belgium package also provides tools to handle Boolean [email protected] polynomials (multivariate polynomials over F2). boolfun has been developed to evaluate some cryp- Julien Van hamme tographic properties of Boolean functions and carry Department of Mathematics statistical analysis on them. An effort has been made Royal Military Academy to optimize execution speed rather than memory us- Belgium age using C code. It is easy to extend this package [email protected]

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 48 CONTRIBUTED RESEARCH ARTICLES

Raster Images in R Graphics by Paul Murrell

Abstract The R graphics engine has new sup- 39837_s_at 1267_at 402_s_at 32562_at 37600_at 1007_s_at 36643_at 1039_s_at 40215_at 39781_at port for rendering raster images via the func- 266_s_at 307_at 38408_at 37539_at 1911_s_at 1463_at 2057_g_at 39556_at 41744_at 34789_at 34850_at rasterImage() grid.raster() 38631_at tions and . This 37280_at 36536_at 37006_at 41397_at 41346_at 40692_at 35714_at 1992_at 33244_at 40167_s_at 32872_at 34699_at 33440_at leads to better scaling of raster images, faster 36275_at 33809_at 40953_at 1308_g_at 1928_s_at 1307_at 40504_at 41742_s_at 41743_i_at 1674_at 40784_at 40785_g_at rendering to screen, and smaller graphics files. 37978_at 37099_at 1973_s_at 38413_at 2036_s_at 1126_s_at 31472_s_at 37184_at 35256_at 40493_at 41779_at 33412_at Several examples of possible applications of 37193_at 37479_at 39210_at 919_at 1140_at 41478_at 35831_at 176_at 37724_at 38385_at 41401_at 41071_at these new features are described. 39135_at 34961_at 37251_s_at 41470_at 1947_g_at 37810_at 36777_at 38004_at 177_at 36897_at 34106_at 31615_i_at 35665_at 33358_at 39315_at 41191_at 931_at 1914_at 36873_at 37809_at 39635_at 38223_at 33936_at 37558_at 41348_at 31605_at 205_g_at 32475_at 34247_at 36149_at 1500_at 34098_f_at 33528_at 35663_at 40393_at 33193_at 39716_at 33405_at 1929_at 36092_at 32215_i_at 41448_at 40763_at 873_at Prior to version 2.11.0, the core R graphics engine 37225_at 38056_at 37413_at 39424_at 32116_at 2039_s_at 40480_s_at 35816_at 1134_at 35260_at 39327_at 35769_at was entirely vector based. In other words, R was 33774_at 38032_at 34362_at 36650_at 41193_at 40088_at 40570_at 41266_at 36773_f_at 36878_f_at 41723_s_at 32977_at only capable of drawing mathematical shapes, such 38968_at 34210_at 37320_at 39717_g_at 36798_g_at 36398_at 40369_f_at 37967_at 32378_at 37043_at 41215_s_at 38096_f_at 36795_at as lines, rectangles, and polygons (and text). 675_at 1389_at 38095_i_at 38833_at 35016_at 676_g_at 41745_at 31508_at 1461_at 37420_i_at 37383_f_at 41237_at This works well for most examples of statistical 37039_at

graphics because plots are typically made up of data 26008 04006 63001 28028 28032 31007 24005 19005 16004 15004 22010 24001 28019 30001 28021 15005 09008 11005 28036 62001 27003 26003 62002 65005 84004 03002 20002 12012 22013 37013 14016 27004 49006 24011 08011 62003 12026 31011 43001 24017 68003 12006 24010 24022 08001 12007 01005 symbols (polygons) or bars (rectangles), with axes (lines and text) alongside (see Figure1). Figure 2: An example of an inherently raster graph- 2.0 3.0 4.0 0.5 1.5 2.5

● ● ● ical image: a heatmap used to represent microarray ● ● ● ● ● ●●● ●●●● ● ● ● Sepal.Length ●● ● ● ●●●● ● ● ● 7.5 ● ● ● ● ● ● ● ●●● ●●●●●●● ●● ● ● data. This image is based on Cock(2010); the data ● ●●● ● ●●●●● ●●● ●●● ●● ● ●●● ●●●●●● ●●● ●●●●●● ● ● ●●●●●●

● ● ●●● ●● ●●●●● ● ● ● ●● ●● ●● 6.5 ● ● ●●● ● ●●●●●●● ●● ●●●● ● ● ● ● ●● ● ● ●●●●●● ● ●● ● ●●●●●● ● ● ● ●●● ● ●●●●● ●● ●●● ● ●● ●● ● are from Chiaretti et al.(2004). ● ●●●● ●●●●● ● ● ● ● ● ●●●● ● ●● ● ● ● ●●●● ●●● ●● ● ● ●●●● ● ● ●● ● ● ●● ● ●● ● 5.5 ● ● ● ● ●●●●●●● ●●●●●●● ●●● ●●●●● ●● ●● ●● ● ● ●●●● ● ● ●●● ● ● ●● ● ● ●●●●● ●● ● ●● ● ●● ●●

● ● ● 4.5

● ● ● Another minor problem is that some PDF view- ● ● ● ● ● ● ● ● ● ● Sepal.Width ● ● ● 4.0 ● ● ●● ●●●● ●● ●●● ● ● ●●● ● ● ● ● ●● ● ● ● ● ●● ● ers have trouble reconciling their anti-aliasing algo- ●●● ● ●●●● ●● ● ●●●●●● ●●● ●●●●● ● ●● ●●● ● ●● ●● ● ● ●● ● ●● ● ● ● ● ● ●●● ● ● ●● ●●●● ●●●● ●●●●● ●●● ● ●● ● ● ● ●●● ● ●● ●● ●●●●●●● ●● ●● ● ● ●● ●● ●●● ●●●●●● ●●●● ●● ●● ●●●● ●●●●●●●●●●●●●● ● ●●● ●●●●●●● ●●●● ● ●● ●●●●●● ● ● ● ●●●●● ● ● ● ●●● ● 3.0 ●●● ●●●●● ● ● ● ●● ●●●●●● ● ● ● ●●●● ●●●●● ● rithms with a large number of rectangles drawn side ● ●●● ●● ●●● ●●● ● ●●● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ●● ● ●● ● ● ● ● ●●●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● by side and may end up producing an ugly thin line ● ● ● 2.0

● ● ● 7 ●● ● ● ● ●●● between the rectangles. ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ●●● ●● ● ● ● ● ● ●●● 6 ●●● ●● ● ●● ●●●● Petal.Length ● ● ●●●●● ●●● ●● ●● ● ● ● ● ●●● ● ● ● ●● ● ●●●● ●●●● ● ● ●● ●●● ●●●●●● ●● ● ●●●●● ●● ● ●● ●●● ●● ● ● 5 ●●●●● ● ●●●●●● ●●●●● ● ●●●● ●●●●●● ●● ●● ●●●●● ● ●●●●●● To avoid these problems, from R version 2.11.0 ●●● ●● ● ●● ●● ● ●●●●●● ●● ●●●● ● ● ●● ● ●● ● ● ●● ● 4 ● ● ●● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● 3 on, the R graphics engine supports rendering raster

● ● ● ● ● ● 2 ●●●● ● ● ●●●●●● ●● ●●●●● ●●●●●●●●●●● ●●●●●●●●●● ●● ● ●●●● ●●● ● ●● ● ● ● ● ● ●● ●●● ● ● ● ● ●● 1 elements as part of a statistical plot.

● ● ● ● ● ●●● ● ● ● ● ● ● ● ● 2.5 ●● ●●● ● ● ●●● ● ●●●●●●● ● ●● ● ● ● ● ●● ● ● ●●●● ● ● ●● ● ●●●●● ● ●● ● ●● ● ● ● ● ● ●●●● ●● Petal.Width ● ●● ● ● ●● ●●● ● ●●●●●●●● ●● ● ●●●●●● ●●● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●●●●●●●● ● ● ●●●●● ●●●●●●● ● ● ●●●● ●●●●●●● ● ●●●● ● 1.5 ●●● ●●●●● ● ● ●●●● ● ●●●●●●● ●●● ● ●●● ● ●●●●● ● ●● ●● ● ●● ●● ●●●● ● ●●● ●● ●●●●●

● ● ● ● ● ● ●● ● ● ● ●●● ● ●●●●● 0.5 ●●●●● ● ● ● ●● ● ●●●● ●●●●●●●●●●● ● ●●●●●●●●●● ● ● ●●●●●●●● ● ●● ● ●● ● ● ●●● Raster graphics functions 4.5 5.5 6.5 7.5 1 2 3 4 5 6 7

The low-level R language interface to the new Figure 1: A typical statistical plot that is well- raster facility is provided by two new func- described by vector elements: polygons, rectangles, tions: rasterImage() in the graphics package and lines, and text. grid.raster() in the grid package. For both functions, the first argument provides However, some displays of data are inherently the raster image that is to be drawn. This argument raster. In other words, what is drawn is simply an should be a "raster" object, but both functions will array of values, where each value is visualized as a accept any object for which there is an as.raster() square or rectangular region of colour (see Figure2). method. This means that it is possible to simply spec- It is possible to draw such raster elements using ify a vector, matrix, or array to describe the raster vector primitives—a small rectangle can be drawn image. For example, the following code produces a for each data value—but this approach leads to at simple greyscale image (see Figure3). least two problems: it can be very slow to draw lots of small rectangles when drawing to the screen; and > library(grid) it can lead to very large files if output is saved in a vector file format such as PDF (and viewing the re- sulting PDF file can be very slow). > grid.raster(1:10/11)

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 49

new raster functions—the interpolate argument. In most cases, a raster image is not going to be rendered at its “natural” size (using exactly one de- vice pixel for each pixel in the image), which means that the image has to be resized. The interpolate argument is used to control how that resizing occurs. By default, the interpolate argument is TRUE, which means that what is actually drawn by R is a linear interpolation of the pixels in the original im- age. Setting interpolate to FALSE means that what gets drawn is essentially a sample from the pixels in Figure 3: A simple greyscale raster image generated the original image. The former case was used in Fig- from a numeric vector. ure3 and it produces a smoother result, while the lat- ter case was used in Figure4 and the result is more As the previous example demonstrates, a nu- “blocky.” Figure5 shows the images from Figures3 meric vector is interpreted as a greyscale image, with and4 with their interpolation settings reversed. 0 corresponding to black and 1 corresponding to white. Other possibilities are logical vectors, which are interpreted as black-and-white images, and char- acter vectors, which are assumed to contain either > grid.raster(1:10/11, interpolate=FALSE) colour names or RGB strings of the form "#RRGGBB". The previous example also demonstrates that a > grid.raster(matrix(colors()[1:100], ncol=10)) vector is treated as a matrix of pixels with a single column. More usefully, the image to draw can be specified by an explicit matrix (numeric, logical, or character). For example, the following code shows a simple way to visualize the first 100 named colours in R. > grid.raster(matrix(colors()[1:100], ncol=10), + interpolate=FALSE)

Figure 5: The raster images from Figures3 and4 with their interpolation settings reversed.

The ability to use linear interpolation provides another advantage over the old behaviour of draw- ing a rectangle per pixel. For example, Figure6 Figure 4: A raster image generated from a character shows a greyscale version of the R logo image drawn matrix (of the first 100 values in colors()). using both the old behaviour and with the new raster support. The latter version of the image is smoother It is also possible to specify the raster image as a thanks to linear interpolation. numeric array: either three-planes of red, green, and blue channels, or four-planes where the fourth plane provides an “alpha” (transparency) channel. > download.file("http://cran.r-project.org/Rlogo.jpg", Greater control over the conversion to a "raster" + "Rlogo.jpg") object is possible by directly calling the as.raster() > library(ReadImages) function, as shown below. > logo <- read.jpeg("Rlogo.jpg") > grid.raster(as.raster(1:10, max=11)) > par(mar=rep(0, 4)) Interpolation > plot(logo) The simple image matrix example above demon- strates another important argument in both of the > grid.raster(logo)

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 50 CONTRIBUTED RESEARCH ARTICLES

> plot(x, y, ann=FALSE, + xlim=xrange, ylim=yrange, + xaxs="i", yaxs="i") > rasterImage(image, + xrange[1], yrange[1], + xrange[2], yrange[2], + interpolate=FALSE)

● ● ●

10 ● Figure 6: The R logo drawn using the old behaviour ● ● of drawing a small rectangle for each pixel (left) and ● ●

using the new raster support with linear interpola- 5 ● tion (right), which produces a smoother result. ● ● ● In situations where the raster image represents ● ● 0 actual data (e.g., microarray data), it is important to ● ● preserve each individual “pixel” of data. If an image ● is viewed at a reduced size, so that there are fewer ● ● screen pixels used to display the image than there are −5 ● pixels in the image, then some pixels in the original ● ● image will be lost (though this is true whether the ● ●

image is rendered as pixels or as small rectangles). −10 ● What has been described so far applies equally to ● ● the rasterImage() function and the grid.raster() function. The next few sections look at each of these −10 −5 0 5 10 functions separately to describe their individual fea- tures. Figure 7: An image drawn from a matrix of normal- The rasterImage() function ized values using rasterImage().

The rasterImage() function is analogous to other It is also possible to rotate the image (about the low-level graphics functions, such as lines() and bottom-left corner) via the angle argument. rect(); it is designed to add a raster image to the cur- To avoid distorting an image, some calculations rent plot. using functions like xinch(), yinch(), and dim() (to The image is positioned by specifying the loca- get the dimensions of the image) may be required. tion of the bottom-left and top-right corners of the More sophisticated support for positioning and siz- image in user coordinates (i.e., relative to the axis ing the image is provided by grid.raster(). scales in the current plot). To provide an example, the following code sets up a matrix of normalized values based on a mathe- The grid.raster() function matical function (taken from the first example on the image() help page). The grid.raster() function works like any other grid graphical primitive; it draws a raster image > x <- y <- seq(-4*pi, 4*pi, len=27) within the current grid viewport. > r <- sqrt(outer(x^2, y^2, "+")) By default, the image is drawn as large as possible > z <- cos(r^2)*exp(-r/6) while still respecting its native aspect ratio (Figure3 > image <- (z - min(z))/diff(range(z)) shows an example of this behaviour). Otherwise, the > image is positioned according to the arguments x and y (justified by just, hjust, and vjust) and sized via The following code draws a raster image from this width and height. If only one of width or height is matrix that occupies the entire plot region (see Fig- given, then the aspect ratio of the image is preserved ure7). Notice that some work needs to be done to (and the image may extend beyond the current view- correctly align the raster cells with axis scales when port). the pixel coordinates in the image represent data val- x y width height ues. Any of , , , and can be vectors, in which case multiple copies of the image are drawn. > step <- diff(x)[1] For example, the following code uses grid.raster() > xrange <- range(x) + c(-step/2, step/2) to draw the R logo within each bar of a lattice bar- > yrange <- range(y) + c(-step/2, step/2) chart.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 51

> x <- c(0.00, 0.40, 0.86, 0.85, 0.69, 0.48, image data structure into a "raster" object. In the + 0.54, 1.09, 1.11, 1.73, 2.05, 2.02) absence of that, the simplest path is to convert the > library(lattice) data structure into a matrix or array, for which there already exist as.raster() methods. > barchart(1:12 ~ x, origin=0, col="white", In the example that produced Figure6, the R + panel=function(x, y, ...) { logo was loaded into R using the ReadImages pack- + panel.barchart(x, y, ...) "imagematrix" + grid.raster(logo, x=0, width=x, y=y, age, which created an object called + default.units="native", logo. This object could be passed directly to ei- + just="left", ther rasterImage() or grid.raster() because an + height=unit(2/37, "imagematrix" is also an array, so the predefined + "npc")) conversion for arrays did the job. + }) Profiling

12 This section briefly demonstrates that the claimed

11 improvements in terms of file size and speed of ren- dering are actually true. The following code gener- 10 ates a simple random test case image (see Figure9). 9 The only important feature of this image is that it is 8 a reasonably large image (in terms of number of pix-

7 els).

1:12 6 > z <- matrix(runif(500*500), ncol=500)

5

4

3

2

1

0.0 0.5 1.0 1.5 2.0 x

Figure 8: A lattice barchart of the change in the “level of interest in R” during 1996 (see demo(graphics)) Figure 9: A simple random test image. with the R logo image used to annotate each bar (with apologies to Edward Tufte and all of his heirs The following code demonstrates that a PDF ver- and disciples). sion of this image is more than three times larger if drawn using a rectangle per pixel (via the image() In addition to grid.raster(), there is also a function) compared to a file that is generated using rasterGrob() function to create a raster image grid.raster(). graphical object. > pdf("image.pdf") > image(z, col=grey(0:99/100)) Support for raster image packages > dev.off()

A common source of raster images is likely to be ex- > pdf("gridraster.pdf") ternal files, such as digital photos and medical scans. > grid.raster(z, interp=FALSE) A number of R packages exist to read general raster > dev.off() formats, such as JPEG or TIFF files, plus there are many packages to support more domain-specific for- > file.info("image.pdf", "gridraster.pdf")["size"] mats, such as NIFTI and ANALYZE. size Each of these packages creates its own sort of image.pdf 14893004 data structure to represent the image within R, so in gridraster.pdf 1511027 order to render an image from an external file, the data structure must be converted to something that The next piece of code can be used to demon- rasterImage() or grid.raster() can handle. strate that rendering speed is also much slower when Ideally, the package will provide a method for the drawing an image to screen as many small rectan- as.raster() function to convert the package-specific gles. The timings are from a run on a CentOS Linux

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 52 CONTRIBUTED RESEARCH ARTICLES system with the Cairo-based X11 device. There are likely to be significant differences to these results if this code is run on other systems. 12 11 > system.time({ 10 + for (i in 1:10) { + image(z, col=grey(0:99/100)) 9

+ } 8 + }) 7

user system elapsed 1:12 6 42.017 0.188 42.484 5

4 > system.time({ + for (i in 1:10) { 3

+ grid.newpage() 2 + grid.raster(z, interpolate=FALSE) 1 + } + }) 0.0 0.5 1.0 1.5 2.0 x user system elapsed 2.013 0.081 2.372 Figure 11: A lattice barchart of the change in This is not a completely fair comparison because the “level of interest in R” during 1996 (see there are different amounts of input checking and demo(graphics)) with the a greyscale gradient used housekeeping occurring inside the image() function to fill each bar (once again, begging the forgiveness and the grid.raster() function, but more detailed of Edward Tufte). profiling (with Rprof()) was used to determine that most of the time was being spent by image() doing Another example is non-rectangular clipping op- the actual rendering. erations via raster “masks” (R’s graphics engine only supports clipping to rectangular regions). The code below is used to produce a map of Spain that is filled Examples with black (see Figure 12). > library(maps) This section demonstrates some possible applica- > par(mar=rep(0, 4)) tions of the new raster support. > map(region="Spain", col="black", fill=TRUE) The most obvious application is simply to use the new functions wherever images are currently being Having produced this image on screen, the function drawn as many small rectangles using a function like grid.cap() can be used to capture the current screen image(). For example, Granovskaia et al.(2010) used image as a raster object. the new raster graphics support in R in the produc- tion of gene expression profiles (see Figure 10). > mask <- grid.cap() Having raster images as a graphical primitive An alternative approach would be produce a PNG also makes it easier to think about performing some file and read that in, but grid.cap() is more conve- graphical tricks that were not necessarily obvious be- nient for interactive use. fore. An example is gradient fills, which are not ex- The following code reads in a raster image of plicitly supported by the R graphics engine. The fol- the Spanish flag from an external file (using the png lowing code shows a simple example where the bars package), converting the image to a "raster" object. of a barchart are filled with a greyscale gradient (see Figure 11). > library(png) > espana <- readPNG("1000px-Flag_of_Spain.png") > barchart(1:12 ~ x, origin=0, col="white", > espanaRaster <- as.raster(espana) + panel=function(x, y, ...) { + panel.barchart(x, y, ...) We now have two raster images in R. The follow- + grid.raster(t(1:10/11), x=0, ing code trims the flag image on its right edge and + width=x, y=y, trims the map of Spain on its bottom edge so that + default.units="native", the two images are the same size (demonstrating that + just="left", "raster" objects can be subsetted like matrices). + height=unit(2/37, + "npc")) > espanaRaster <- espanaRaster[, 1:dim(mask)[2]] + }) > mask <- mask[1:dim(espanaRaster)[1], ]

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 53

(a) Chr 10 (b) Chr 3 (c) Chr 16

r 200 r 200 r 200 o o o t t t

c 150 c 150 c 150 a a a f f f

100 100 100 a a a

h 50 h 50 h 50 p p p l l l

a 0 a 0 a 0 200 200 200

150 150 150 8 8 8

2 100 2 100 2 100 c c c d d d

c 50 c 50 c 50

0 0 0 YJL156W−A TOM5 MSS18 CTF4 124000 126000 128000 202000 204000 206000 798000 800000 802000

FAR1 SSY5 TAF2

r 200 r 200 r 200 o o o t t t

c 150 c 150 c 150 a a a f f f 100 100 100 a a a h h h 50 50 50 p p p l l l

a 0 a 0 a 0 200 200 200

150 150 150 8 8 8

2 100 2 100 2 100 c c c d d d

c 50 c 50 c 50

0 0 0 (d) Chr 8 (e) Chr 12

r 200 r 200 o o t t CDS

c 150 c 150 a a f f

100 100 a a

h 50 h 50 CDS dubious p p l l

a 0 a 0 200 200 TF binding site 150 150 8 8

2 100 2 100 c c

d d Segment boundary

c 50 c 50

0 0 YHR140W 378000 380000 244000 246000 7

YHR138C SPS100 YHR139C−A YLR049C YLR050C YLR051C 6

r 200 r 200 o o

t t 5

c 150 c 150 a a f f 100 100 4 a a log2(expression level) h h 50 50 p p l l

a a 3 0 0 200 200

150 150 2 8 8

2 100 2 100 c c 1 d d

c 50 c 50

0 0 0

Figure 10: An example of the use of raster images in an R graphic (reproduced from Figure 3 of Granovskaia et al., 2010, which was published as an article by Biomed Central).

Now, we use the map as a “mask” to set all pixels dows device does not support images with a differ- in the flag image to transparent wherever the map ent alpha level per pixel.1 image is not black (demonstrating that assigning to There is no support for raster images for the XFig "raster" subsets also works for objects). or PICTEX devices. http: > espanaRaster[mask != "black"] <- "transparent" A web page on the R developer web site, //developer.r-project.org/Raster/raster-RFC. html The flag image can now be used to fill a map of Spain, , will be maintained to show the ongoing state as shown by the following code (see Figure 12). of raster support.

> par(mar=rep(0, 4)) > map(region="Spain") > grid.raster(espanaRaster, y=1, just="top") Summary > map(region="Spain", add=TRUE) The R graphics engine now has native support for rendering raster images. This will improve render- Known issues ing speed and memory efficiency for plots that con- tain large raster images. It also broadens the set The raster image support has been implemented for of graphical effects that are possible (or convenient) the primary screen graphics devices that are dis- with R. tributed with R—Cairo (for Linux), Quartz (for Ma- cOS X), and Windows—plus the vector file format devices for PDF and PostScript. The screen device support also covers support for standard raster file Acknowledgements formats (e.g., PNG) on each platform. The X11 device has basic raster support, but ro- Thanks to the editors and reviewers for helpful com- tated images can be problematic and there is no sup- ments and suggestions that have significantly im- port for transparent pixels in the image. The Win- proved this article. 1However, Brian Ripley has added support in the development version of R.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 54 CONTRIBUTED RESEARCH ARTICLES

Figure 12: A raster image of the Spanish flag being used as a fill pattern for a map of Spain. On the left is a map of spain filled in black (produced by the map() function from the maps package). In the middle is a PNG image of Spanish flag (a public domain image from Wikimedia Commons, http://en.wikipedia.org/wiki/File: Flag_of_Spain.svg), and on the right is the result of clipping the Spanish flag image using the map as a mask.

Bibliography M. V. Granovskaia, L. M. Jensen, M. E. Ritchie, J. Toedling, Y. Ning, P. Bork, W. Huber, and R. Bivand, F. Leisch, and M. Maechler. pixmap: Bitmap L. M. Steinmetz. High-resolution transcription Images (“Pixel Maps”), 2009. URL http://CRAN. atlas of the mitotic cell cycle in budding yeast. R-project.org/package=pixmap. R package ver- Genome Biology, 11(3):R24, 2010. URL http:// sion 0.4-10. genomebiology.com/2010/11/3/R24. S. Chiaretti, X. Li, R. Gentleman, A. Vitale, M. Vi- gnetti, F. Mandelli, J. Ritz, and R. Foa. Gene M. Loecher. ReadImages: Image Reading Module for R, http://CRAN.R-project.org/package= expression profile of adult T-cell acute lympho- 2009. URL ReadImages cytic leukemia identifies distinct subsets of pa- . R package version 0.1.3.1. tients with different response to therapy and sur- vival. Blood, 103(7):2771–2778, 2004. D. Sarkar. lattice: Lattice Graphics, 2010. URL http: //CRAN.R-project.org/package=lattice. R pack- P. Cock. Using R to draw a heatmap from microar- age version 0.18-3. ray data, 2010. URL http://www2.warwick.ac.uk/ fac/sci/moac/students/peter_cock/r/heatmap. O. S. code by Richard A. Becker and A. R. W. R. ver- Paul Murrell sion by Ray Brownrigg Enhancements by Thomas Department of Statistics P Minka . maps: Draw The University of Auckland Geographical Maps, 2009. URL http://CRAN. Private Bag 92019, Auckland R-project.org/package=maps. R package version New Zealand 2.1-0. [email protected]

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 55

Probabilistic Weather Forecasting in R by Chris Fraley, Adrian Raftery, Tilmann Gneiting, pressure), mixtures of gamma distributions (appro- McLean Sloughter and Veronica Berrocal priate for wind speed), and Bernoulli-gamma mix- tures with a point mass at zero (appropriate for quan- Abstract This article describes two R packages titative precipitation). Functions for verification that for probabilistic weather forecasting, ensem- assess predictive performance are also available. bleBMA, which offers ensemble postprocessing The BMA approach to the postprocessing of en- via Bayesian model averaging (BMA), and Prob- semble forecasts was introduced by Raftery et al. ForecastGOP, which implements the geostatis- (2005) and has been developed in Berrocal et al. tical output perturbation (GOP) method. BMA (2007), Sloughter et al.(2007), Wilson et al.(2007), forecasting models use mixture distributions, in Fraley et al.(2010) and Sloughter et al.(2010). Detail which each component corresponds to an en- on verification procedures can be found in Gneiting semble member, and the form of the component and Raftery(2007) and Gneiting et al.(2007). distribution depends on the weather parameter (temperature, quantitative precipitation or wind "ensembleData" objects speed). The model parameters are estimated from training data. The GOP technique uses geo- Ensemble forecasting data for weather typically in- statistical methods to produce probabilistic fore- cludes some or all of the following information: casts of entire weather fields for temperature or • ensemble member forecasts pressure, based on a single numerical forecast on a spatial grid. Both packages include functions • initial date for evaluating predictive performance, in addi- tion to model fitting and forecasting. • valid date • forecast hour (prediction horizon) Introduction • location (latitude, longitude, elevation) • station and network identification Over the past two decades, weather forecasting has The initial date is the day and time at which ini- experienced a paradigm shift towards probabilistic tial conditions are provided to the numerical weather forecasts, which take the form of probability distri- prediction model, to run forward the partial differen- butions over future weather quantities and events. tial equations that produce the members of the fore- Probabilistic forecasts allow for optimal decision cast ensemble. The forecast hour is the prediction making for many purposes, including air traffic con- horizon or time between initial and valid dates. The trol, ship routing, agriculture, electricity generation ensemble member forecasts then are valid for the and weather-risk finance. hour and day that correspond to the forecast hour Up to the early 1990s, most weather forecast- ahead of the initial date. In all the examples and il- ing was deterministic, meaning that only one “best” lustrations in this article, the prediction horizon is 48 forecast was produced by a numerical model. The hours. recent advent of ensemble prediction systems marks For use with the ensembleBMA package, data a radical change. An ensemble forecast consists of must be organized into an "ensembleData" object multiple numerical forecasts, each computed in a dif- that minimally includes the ensemble member fore- ferent way. Statistical postprocessing is then used to casts. For model fitting and verification, the cor- convert the ensemble into calibrated and sharp prob- responding weather observations are also needed. abilistic forecasts (Gneiting and Raftery, 2005). Several of the model fitting functions can produce forecasting models over a sequence of dates, pro- vided that the "ensembleData" are for a single pre- The ensembleBMA package diction horizon. Attributes such as station and net- work identification, and latitude and longitude, may The ensembleBMA package (Fraley et al., 2010) of- be useful for plotting and/or analysis but are not cur- fers statistical postprocessing of forecast ensembles rently used in any of the modeling functions. The via Bayesian model averaging (BMA). It provides "ensembleData" object facilitates preservation of the functions for model fitting and forecasting with en- data as a unit for use in processing by the package semble data that may include missing and/or ex- functions. changeable members. The modeling functions es- Here we illustrate the creation of an timate BMA parameters from training data via the "ensembleData" object called srftData that corre- EM algorithm. Currently available options are nor- sponds to the srft data set of surface temperature mal mixture models (appropriate for temperature or (Berrocal et al., 2007):

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 56 CONTRIBUTED RESEARCH ARTICLES data(srft) When no dates are specified, a model fit is produced members <- c("CMCG", "ETA", "GASP", "GFS", for each date for which there are sufficient training "JMA", "NGPS", "TCWB", "UKMO") data for the desired rolling training period. srftData <- The BMA predictive PDFs can be plotted as fol- ensembleData(forecasts = srft[,members], lows, with Figure1 showing an example: dates = srft$date, plot(srftFit, srftData, dates = "2004013100") observations = srft$obs, latitude = srft$lat, This steps through each location on the given dates, longitude = srft$lon, plotting the corresponding BMA PDFs. forecastHour = 48) Alternatively, the modeling process for a single date can be separated into two steps: first extracting The dates specification in an "ensembleData" object the training data, and then fitting the model directly refers to the valid dates for the forecasts. using the fitBMA function. See Fraley et al.(2007) for an example. A limitation of this approach is that date information is not automatically retained. Specifying exchangeable members. Forecast en- Forecasting is often done on grids that cover an sembles may contain members that can be consid- area of interest, rather than at station locations. The ered exchangeable (arising from the same generating dataset srftGrid provides ensemble forecasts of sur- mechanism, such as random perturbations of a given face temperature initialized on January 29, 2004 and set of initial conditions) and for which the BMA pa- valid for January 31, 2004 at grid locations in the rameters, such as weights and bias correction co- same region as that of the srft stations. efficients, should be the same. In ensembleBMA, Quantiles of the BMA predictive PDFs at the exchangeability is specified by supplying a vector grid locations can be obtained with the function that represents the grouping of the ensemble mem- quantileForecast: bers. The non-exchangeable groups consist of single- ton members, while exchangeable members belong srftGridForc <- quantileForecast(srftFit, to the same group. See Fraley et al.(2010) for a de- srftGridData, quantiles = c( .1, .5, .9)) tailed discussion. Here srftGridData is an "ensembleData" object cre- ated from srftGrid, and srftFit provides a fore- 1 Specifying dates. Functions that rely on the chron casting model for the corresponding date. The fore- package (James and Hornik, 2010) are provided for cast probability of temperatures below freezing at the converting to and from Julian dates. These functions grid locations can be computed with the cdf func- check for proper format (‘YYYYMMDD’ or ‘YYYYMMDDHH’). tion, which evaluates the BMA cumulative distribu- tion function: probFreeze <- cdf(srftFit, srftGridData, BMA forecasting date = "2004013100", value = 273.15) BMA generates full predictive probability density In the srft and srftGrid datasets, temperature is functions (PDFs) for future weather quantities. Ex- recorded in degrees Kelvin (K), so freezing occurs at amples of BMA predictive PDFs for temperature and 273.15 K. precipitation are shown in Figure1. These results can be displayed as image plots us- ing the plotProbcast function, as shown in Figure Surface temperature example. As an example, we 2, in which darker shades represent higher probabil- fit a BMA normal mixture model for forecasts of ities. The plots are made by binning values onto a plotProbcast surface temperature valid January 31, 2004, using plotting grid, which is the default in . the srft training data. The "ensembleData" object Loading the fields (Furrer et al., 2010) and maps srftData created in the previous section is used to (Brownrigg and Minka, 2011) packages enables dis- fit the predictive model, with a rolling training pe- play of the country and state outlines, as well as a riod of 25 days, excluding the two most recent days legend. because of the 48 hour prediction horizon. One of several options is to use the function Precipitation example. The prcpFit dataset con- ensembleBMA with the valid date(s) of interest as in- sists of the fitted BMA parameters for 48 hour ahead put to obtain the associated BMA fit(s): forecasts of daily precipitation accumulation (in hun- dredths of an inch) over the U.S. Pacific Northwest srftFit <- from December 12, 2002 through March 31, 2005, ensembleBMA(srftData, dates = "2004013100", as described by Sloughter et al.(2007). The fitted model = "normal", trainingDays = 25) models are Bernoulli-gamma mixtures with a point 1The package implements the original BMA method of Raftery et al.(2005) and Sloughter et al.(2007), in which there is a single, con- stant bias correction term over the whole domain. Model biases are likely to differ by location, and there are newer methods that account for this (Gel, 2007; Mass et al., 2008; Kleiber et al., in press).

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 57 0.5 0.12 0.4 0.3 0.08 0.2 Probability Density 0.04 0.1 Prob No Precip and Scaled PDF for Precip Prob No Precip and Scaled PDF for 0.0 0.00

270 275 280 285 0 1 2 3 4 5 6

Temperature in Degrees Kelvin Precipitation in Hundreths of an Inch

Figure 1: BMA predictive distributions for temperature (in degrees Kelvin) valid January 31, 2004 (left) and for precip- itation (in hundredths of an inch) valid January 15, 2003 (right), at Port Angeles, Washington at 4PM local time, based on the eight-member University of Washington Mesoscale Ensemble (Grimit and Mass, 2002; Eckel and Mass, 2005). The thick black curve is the BMA PDF, while the colored curves are the weighted PDFs of the constituent ensemble members. The thin vertical black line is the median of the BMA PDF (occurs at or near the mode in the temperature plot), and the dashed vertical lines represent the 10th and 90th percentiles. The orange vertical line is at the verifying observation. In the precipitation plot (right), the thick vertical black line at zero shows the point mass probability of no precipitation (47%). The densities for positive precipitation amounts have been rescaled, so that the maximum of the thick black BMA PDF agrees with the probability of precipitation (53%).

Median Forecast for Surface Temperature Probability of Freezing 50 50 48 48 46 46 44 44 42 42

−130 −125 −120 −130 −125 −120

265 270 275 280 285 0.2 0.4 0.6 0.8

Figure 2: Image plots of the BMA median forecast for surface temperature and BMA probability of freezing over the Pacific Northwest, valid January 31, 2004.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 58 CONTRIBUTED RESEARCH ARTICLES mass at zero that apply to the cube root transforma- station locations to a grid for surface plotting. It is tion of the ensemble forecasts and observed data. A also possible to request image or perspective plots, rolling training period of 30 days is used. The dataset or contour plots. used to obtain prcpFit is not included in the pack- The mean absolute error (MAE) and mean contin- age on account of its size. However, the correspond- uous ranked probability score (CRPS; e.g., Gneiting ing "ensembleData" object can be constructed in the and Raftery, 2007) can be computed with the func- same way as illustrated for the surface temperature tions CRPS and MAE: data, and the modeling process also is analogous, ex- CRPS(srftFit, srftData) cept that the "gamma0" model for quantitative precip- # ensemble BMA itation is used in lieu of the "normal" model. # 1.945544 1.490496 The prcpGrid dataset contains gridded ensemble forecasts of daily precipitation accumulation in the MAE(srftFit, srftData) same region as that of prcpFit, initialized January # ensemble BMA 13, 2003 and valid January 15, 2003. The BMA me- # 2.164045 2.042603 dian and upper bound (90th percentile) forecasts can be obtained and plotted as follows: The function MAE computes the mean absolute dif- data(prcpFit) ference between the ensemble or BMA median fore- cast2 and the observation. The BMA CRPS is ob- prcpGridForc <- quantileForecast( tained via Monte Carlo simulation and thus may prcpFit, prcpGridData, date = "20030115", vary slightly in replications. Here we compute these q = c(0.5, 0.9)) measures from forecasts valid on a single date; more typically, the CRPS and MAE would be computed Here prcpGridData is an "ensembleData" object cre- from a range of dates and the corresponding predic- ated from the prcpGrid dataset. The 90th percentile tive models. plot is shown in Figure3. The probability of precip- Brier scores (see e.g., Jolliffe and Stephenson, itation and the probability of precipitation above .25 2003; Gneiting and Raftery, 2007) for probability inches can be obtained as follows: forecasts of the binary event of exceeding an arbi- trary precipitation threshold can be computed with probPrecip <- 1 - cdf(prcpFit, prcpGridData, the function brierScore. date = "20030115", values = c(0, 25))

The plot for the BMA forecast probability of pre- Assessing calibration. Calibration refers to the sta- cipitation accumulation exceeding .25 inches is also tistical consistency between the predictive distribu- shown in Figure3. tions and the observations (Gneiting et al., 2007). The verification rank histogram is used to assess calibra- Verification tion for an ensemble forecast, while the probability integral transform (PIT) histogram assesses calibra- The ensembleBMA functions for verification can tion for predictive PDFs, such as the BMA forecast be used whenever observed weather conditions are distributions. available. Included are functions to compute veri- The verification rank histogram plots the rank of fication rank and probability integral transform his- each observation relative to the combined set of the tograms, the mean absolute error, continuous ranked ensemble members and the observation. Thus, it probability score, and Brier score. equals one plus the number of ensemble members that are smaller than the observation. The histogram Mean absolute error, continuous ranked probabil- allows for the visual assessment of the calibration of ity score, and Brier score. In the previous section, the ensemble forecast (Hamill, 2001). If the observa- we obtained a gridded BMA forecast of surface tem- tion and the ensemble members are exchangeable, all perature valid January 31, 2004 from the srft data ranks are equally likely, and so deviations from uni- set. To obtain forecasts at station locations, we ap- formity suggest departures from calibration. We il- srft ply the function quantileForecast to the model fit lustrate this with the dataset, starting at January srftFit: 30, 2004: srftForc <- quantileForecast(srftFit, use <- ensembleValidDates(srftData) >= srftData, quantiles = c( .1, .5, .9)) "2004013000"

The BMA quantile forecasts can be plotted together srftVerifRank <- verifRank( with the observed weather conditions using the func- ensembleForecasts(srftData[use,]), tion plotProbcast as shown in Figure4. Here the R ensembleVerifObs(srftData[use,])) core function loess was used to interpolate from the 2Raftery et al.(2005) employ the BMA predictive mean rather than the predictive median.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 59

Upper Bound (90th Percentile) Forecast for Precipitation Probability of Precipitation above .25 inches 50 50 48 48 46 46 44 44 42 42

−130 −125 −120 −130 −125 −120

0 50 100 150 200 250 0.0 0.2 0.4 0.6 0.8 1.0

Figure 3: Image plots of the BMA upper bound (90th percentile) forecast of precipitation accumulation (in hundredths of an inch), and the BMA probability of precipitation exceeding .25 inches over the Pacific Northwest, valid January 15, 2003.

Median Forecast Observed Surface Temperature

52 ● 52 ● 268 264 ● 266 ● ● ● 276 ● ● ● 276 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 268 ● ● 266 ● ● 278 ● ● 270 278 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 270 ● ● ● ● ● ● ● ● ● ● ● ●

50 ● 50 ● ● ● 280 ● ● ● ● 272 ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● 280 ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●● ● ● ●●●●● ● ● 282 ● ●●● ●● ●● ● ● ●●● ●● ●● ● ● ● ●●● ● ● ● ●● ● ●● ● ● ●●● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●●●●● ● ● ● ● ● ● ●● ● ● ●●●●●● ● ● ● ● ● ● ●● ● 48 ● ●●●●●●● ● ●● ●●● ● ● 48 ● ●●●●●●● ● ●● ●●● ● ● ●●●●●●●● ●● ● ●●● ● ●●●●●●●● ●● ● ●●● ● ● ● ●●●●●●●●●● ●● ●● ●●●● ● ● ●●●●●●●●●● ●● ●● ●●●● ●● ●●● ●●● ● ●● ● ●● ●● ●●● ●●● ● ●● ● ●● ● ● ●●●● ●●●●● ●●●●● ● ●● ● ● ● ●●●● ●●●●● ●●●●● ● ●● ● ● ●● ● ● ●●●●●●● ● ● ●● ● ●● ● ● ●●●●●●● ● ● ●● ●● ● ● ●● ● ● ● ● 282 ●● ● ● ●● ● ● ● ● ●● ●●● ● ●● ● ●● ● ●● ●●● ● ●● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ●● ● ● ●● ●● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ●● ●●●●●●●● ● ●● ● ● ●● ●●●●●●●● ● ●●●●● ●● ● ● ●●●●● ●● ● ● ● ● ● ● ●●●●● ● ●● ● ● ● ● ● ●●●●● ● ●● ● ● ●●● ● ● ● ●●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 284 ● ● ● ● ● 46 ● ●● ● ●●● ● ● ● ● ● 46 ● ●● ● ●●● ● ● ● ● ● ●● ● ●●●● ● ● ● ●● ● ●●●● ● ● ● ●● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ●● ● ● 284 ● ● ● ● ● ● ● ● ● ● ● ● 276 ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 274● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 274● ● ● ● ● ● ● ● ● ● ● ● ● 44 ●●●● 44 ●●●● ● ● ● ● ●●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● 272 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 42 ● ● ● ● ● ● 42 ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

−130 −125 −120 −115 −130 −125 −120 −115

Figure 4: Contour plots of the BMA median forecast of surface temperature and verifying observations at station locations in the Pacific Northwest, valid January 31, 2004 (srft dataset). The plots use loess fits to the forecasts and observations at the station locations, which are interpolated to a plotting grid. The dots represent the 715 observation sites.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 60 CONTRIBUTED RESEARCH ARTICLES

Verification Rank Histogram Probability Integral Transform 0.5 1.2 0.4 1.0 0.8 0.3 0.6 Density 0.2 0.4 0.1 0.2 0.0 0.0

1 2 3 4 5 6 7 8 9 0 0.2 0.4 0.6 0.8 1

Figure 5: Verification rank histogram for ensemble forecasts, and PIT histogram for BMA forecast PDFs for surface tem- perature over the Pacific Northwest in the srft dataset valid from January 30, 2004 to February 28, 2004. More uniform histograms correspond to better calibration. k <- ensembleSize(srftData) The resulting PIT histogram is shown in Figure 5. It shows signs of negative bias, which is not sur- hist(srftVerifRank, breaks = 0:(k+1), prising because it is based on only about a month of prob = TRUE, xaxt = "n", xlab = "", verifying data. We generally recommend computing main = "Verification Rank Histogram") the PIT histogram for longer periods, ideally at least axis(1, at = seq(.5, to = k+.5, by = 1), a year, to avoid its being dominated by short-term labels = 1:(k+1)) and seasonal effects. abline(h=1/(ensembleSize(srftData)+1), lty=2)

The resulting rank histogram composites ranks spa- The ProbForecastGOP package tially and is shown in Figure5. The U shape indicates a lack of calibration, in that the ensemble forecast is The ProbForecastGOP package (Berrocal et al., 2010) underdispersed. generates probabilistic forecasts of entire weather The PIT is the value that the predictive cumula- fields using the geostatistical output perturbation tive distribution function attains at the observation, (GOP) method of Gel et al.(2004). The package con- and is a continuous analog of the verification rank. tains functions for the GOP method, a wrapper func- The function pit computes it. The PIT histogram al- tion named ProbForecastGOP, and a plotting utility lows for the visual assessment of calibration and is named plotfields. More detailed information can interpreted in the same way as the verification rank be found in the PDF document installed in the pack- histogram. We illustrate this on BMA forecasts of age directory at ‘ProbForecastGOP/docs/vignette.pdf’ or surface temperature obtained for the entire srft data in the help files. set using a 25 day training period (forecasts begin on The GOP method uses geostatistical methods January 30, 2004 and end on February 28, 2004): to produce probabilistic forecasts of entire weather fields for temperature and pressure, based on a sin- gle numerical weather forecast on a spatial grid. The srftFitALL <- ensembleBMA(srftData, method involves three steps: trainingDays = 25) • an initial step in which linear regression is srftPIT <- pit(srftFitALL, srftData) used for bias correction, and the empirical var- iogram of the forecast errors is computed; hist(srftPIT, breaks = (0:(k+1))/(k+1), • an estimation step in which the weighted least xlab="", xaxt="n", prob = TRUE, squares method is used to fit a parametric main = "Probability Integral Transform") model to the empirical variogram; and axis(1, at = seq(0, to = 1, by = .2), labels = (0:5)/5) • a forecasting step in which a statistical en- abline(h = 1, lty = 2) semble of weather field forecasts is generated,

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 61

by simulating realizations of the error field The parametric models implemented in from the fitted geostatistical model, and adding Variog.fit are the exponential, spherical, Gaussian, them to the bias-corrected numerical forecast generalized Cauchy and Whittle-Matérn, with fur- field. ther detail available in the help files. The function linesmodel computes these parametric models. Empirical variogram

In the first step of the GOP method, the empiri- cal variogram of the forecast errors is found. Var- Generating ensemble members iograms are used in spatial statistics to character- ize variability in spatial data. The empirical vari- The final step in the GOP method involves generat- ogram plots one-half the mean squared difference be- ing multiple realizations of the forecast weather field. tween paired observations against the distance sepa- Each realization provides a member of a statistical rating the corresponding sites. Distances are usually ensemble of weather field forecasts, and is obtained binned into intervals, whose midpoints are used to by simulating a sample path of the fitted error field, represent classes. and adding it to the bias-adjusted numerical forecast In ProbForecastGOP, four functions compute field. empirical variograms. Two of them, avg.variog and Field.sim avg.variog.dir, composite forecast errors over tem- This is achieved by the function , or by ProbForecastGOP poral replications, while the other two, Emp.variog the wrapper function with the en- out "SIM" and EmpDir.variog, average forecast errors tem- try set equal to . Both options depend on GaussRF porally, and only then compute variograms. Al- the function in the RandomFields package ternatively, one can use the wrapper function that simulates Gaussian random fields (Schlather, ProbForecastGOP with argument out = "VARIOG". 2011). The output of the functions is both numerical and graphical, in that they return members of the Parameter estimation forecast field ensemble, quantile forecasts at individ- ual locations, and plots of the ensemble members The second step in the GOP method consists of and quantile forecasts. The plots are created using fitting a parametric model to the empirical var- the image.plot, US and world functions in the fields iogram of the forecast errors. This is done by package (Furrer et al., 2010). As an illustration, Fig- the Variog.fit function using the weighted least ure7 shows a member of a forecast field ensemble of squares approach. Alternatively, the wrapper func- surface temperature over the Pacific Northwest valid tion ProbForecastGOP with entry out set equal to January 12, 2002. "FIT" can be employed. Figure6 illustrates the re- sult for forecasts of surface temperature over the Simulated weather field Pacific Northwest in January through June 2000. variance − 5 0 5 10 15 20 Semi − 10 42 44 46 48 50 −

−130 −125 −120 −115 0 2 4 6 8 10

0 200 400 600 Distance Figure 7: A member of a 48-hour ahead forecast field ensemble of surface temperature (in degrees Celsius) Figure 6: Empirical variogram of temperature fore- over the Pacific Northwest valid January 12, 2002 at cast errors and fitted exponential variogram model. 4PM local time.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 62 CONTRIBUTED RESEARCH ARTICLES

Summary package=maps. R package version 2.1-6. S original by Richard A. Becker and Allan R. Wilks, R port We have described two packages, ensembleBMA by Ray Brownrigg, enhancements by Thomas P. and ProbForecastGOP, for probabilistic weather Minka. forecasting. Both packages provide functionality to fit forecasting models to training data. F. A. Eckel and C. F. Mass. Effective mesoscale, short- range ensemble forecasting. Weather and Forecast- In ensembleBMA, parametric mixture models, in ing, 20:328–350, 2005. which the components correspond to the members of an ensemble, are fit to a training set of ensemble C. Fraley, A. E. Raftery, T. Gneiting, and J. M. forcasts, in a form that depends on the weather pa- Sloughter. ensembleBMA: Probabilistic forecast- rameter. These models are then used to postprocess ing using ensembles and Bayesian Model Averaging, ensemble forecasts at station or grid locations. 2010. URL http://CRAN.R-project.org/package= In ProbForecastGOP, a parametric model is fit to ensembleBMA. R package version 4.5. the empirical variogram of the forecast errors from a single member, and forecasts are obtained by simu- C. Fraley, A. E. Raftery, T. Gneiting, and J. M. lating realizations of the fitted error field and adding Sloughter. ensembleBMA : An R package for prob- them to the bias-adjusted numerical forecast field. abilistic forecasting using ensembles and Bayesian The resulting probabilistic forecast is for an entire model averaging. Technical Report 516, University weather field, and shows physically realistic spatial of Washington, Department of Statistics, 2007. (re- features. vised 2011). Supplementing model fitting and forecasting, C. Fraley, A. E. Raftery, and T. Gneiting. Calibrat- both packages provide functionality for verification, ing multi-model forecasting ensembles with ex- allowing the quality of the forecasts produced to be changeable and missing members using Bayesian assessed. model averaging. Monthly Weather Review, 138: 190–202, 2010. Acknowledgements R. Furrer, D. Nychka, and S. Sain. fields: Tools for spa- tial data, 2010. URL http://CRAN.R-project.org/ This research was supported by the DoD Multidis- package=fields. R package version 6.3. ciplinary University Research Initiative (MURI) pro- Y. Gel, A. E. Raftery, and T. Gneiting. Cali- gram administered by the Office of Naval Research brated probabilistic mesoscale weather field fore- under Grant N00014-01-10745, by the National Sci- casting: the Geostatistical Output Perturbation ence Foundation under grants ATM-0724721 and (GOP) method (with discussion). Journal of the DMS-0706745, and by the Joint Ensemble Forecasting American Statistical Association, 99:575–590, 2004. System (JEFS) under subcontract S06-47225 from the University Corporation for Atmospheric Research Y. R. Gel. Comparative analysis of the local (UCAR). We are indebted to Cliff Mass and Jeff Baars observation-based (LOB) method and the non- of the University of Washington Department of At- parametric regression-based method for gridded mospheric Sciences for sharing insights, expertise bias correction in mesoscale weather forecasting. and data, and thank Michael Scheuerer for com- Weather and Forecasting, 22:1243–1256, 2007. ments. This paper also benefitted from the critiques T. Gneiting and A. E. Raftery. Weather forecast- of two anonymous reviewers and the associate edi- ing with ensemble methods. Science, 310:248–249, tor. 2005. T. Gneiting and A. E. Raftery. Strictly proper scor- Bibliography ing rules, prediction, and estimation. Journal of the American Statistical Association, 102:359–378, 2007. V. J. Berrocal, Y. Gel, A. E. Raftery, and T. Gneiting. T. Gneiting, F. Balabdaoui, and A. E. Raftery. Proba- ProbForecastGOP: Probabilistic weather field forecast bilistic forecasts, calibration, and sharpness. Jour- using the GOP method, 2010. URL http://CRAN. nal of the Royal Statistical Society, Series B, 69:243– R-project.org/package=ProbForecastGOP.R 268, 2007. package version 1.3.2. E. P. Grimit and C. F. Mass. Initial results for a V. J. Berrocal, A. E. Raftery, and T. Gneiting. Com- mesoscale short-range ensemble forecasting sys- bining spatial statistical and ensemble information tem over the Pacific Northwest. Weather and Fore- in probabilistic weather forecasts. Monthly Weather casting, 17:192–205, 2002. Review, 135:1386–1402, 2007. T. M. Hamill. Interpretation of rank histograms for R. Brownrigg and T. P. Minka. maps: Draw geographi- verifying ensemble forecasts. Monthly Weather Re- cal maps, 2011. URL http://CRAN.R-project.org/ view, 129:550–560, 2001.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 63

D. James and K. Hornik. chron: Chronological objects L. J. Wilson, S. Beauregard, A. E. Raftery, and R. Ver- which can handle dates and times, 2010. URL http: ret. Calibrated surface temperature forecasts from //CRAN.R-project.org/package=chron. R pack- the Canadian ensemble prediction system using age version 2.3-39. S original by David James, R Bayesian model averaging. Monthly Weather Re- port by Kurt Hornik. view, 135:1364–1385, 2007. I. T. Jolliffe and D. B. Stephenson, editors. Forecast Verification: A Practitioner’s Guide in Atmospheric Chris Fraley Science. Wiley, 2003. Department of Statistics W. Kleiber, A. E. Raftery, J. Baars, T. Gneiting, University of Washington C. Mass, and E. P. Grimit. Locally calibrated prob- Seattle, WA 98195-4322 USA abilistic temperature forecasting using geostatisti- [email protected] cal model averaging and local Bayesian model av- eraging. Monthly Weather Review, 2011 (in press). Adrian Raftery Department of Statistics C. F. Mass, J. Baars, G. Wedam, E. P. Grimit, and University of Washington R. Steed. Removal of systematic model bias on a Seattle, WA 98195-4322 USA model grid. Weather and Forecasting, 23:438–459, [email protected] 2008. A. E. Raftery, T. Gneiting, F. Balabdaoui, and M. Po- Tilmann Gneiting lakowski. Using Bayesian model averaging to cal- Institute of Applied Mathematics ibrate forecast ensembles. Monthly Weather Review, Heidelberg University 133:1155–1174, 2005. 69120 Heidelberg, Germany [email protected] M. Schlather. RandomFields: Simulation and analysis tools for random fields, 2011. URL http://CRAN. McLean Sloughter R-project.org/package=RandomFields. R pack- Department of Mathematics age version 2.0.45. Seattle University Seattle, WA 98122 USA J. M. Sloughter, A. E. Raftery, T. Gneiting, and C. Fra- [email protected] ley. Probabilistic quantitative precipitation fore- casting using Bayesian model averaging. Monthly Veronica Berrocal Weather Review, 135:3209–3220, 2007. School of Public Health J. M. Sloughter, T. Gneiting, and A. E. Raftery. Prob- University of Michigan abilistic wind speed forecasting using ensembles 1415 Washington Heights and Bayesian model averaging. Journal of the Amer- Ann Arbor, MI 48109-2029 USA ican Statistical Association, 105:25–35, 2010. [email protected]

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 64 CONTRIBUTED RESEARCH ARTICLES

Analyzing an Electronic Limit Order Book by David Kane, Andrew Liu, and Khanh Nguyen cancel/replace orders can identify the correspond- ing limit order. A market order is an order to im- Abstract The orderbook package provides fa- mediately buy or sell a quantity of stock at the best cilities for exploring and visualizing the data available prices. A trade occurs when a market or- associated with an order book: the electronic der “hits” a limit order on the other side of the inside collection of the outstanding limit orders for a market. financial instrument. This article provides an All orders have timestamps indicating the time at overview of the orderbook package and exam- which they were accepted into the order book. The ples of its use. timestamp determines the time priority of an order. Earlier orders are executed before later orders. For example, suppose that the order to buy 100 shares Introduction at $11.05 was submitted before the order to buy 200 shares at $11.05. Now suppose a market order sell- The orderbook package provides facilities for ex- ing 200 shares is submitted to the order book. The ploring and visualizing the data associated with an limit order for 100 shares will be executed because it order book: the electronic collection of the outstand- is at the front of the queue at the best bid. Then, 100 ing limit orders for a financial instrument, e.g. a shares of the order with 200 total shares will be exe- stock. A limit order is an order to buy or sell a given cuted, since it was second in the queue. 100 shares quantity of stock at a specified limit price or better. of the 200 share order remain in the order book at The size is the number of shares to be bought or sold. $11.05. An order remains in the order book until fully exe- A market order for more shares than the size at cuted, i.e. until its size is zero as a result of trades. the inside market will execute at worse price levels Partial executions occur as a result of trades for less until it is complete. For example, if a market order to than the entire size of the order. buy 200 shares is submitted to the order book, the or- Consider a simple order book containing five der at $11.08 will be fully executed. Since there are no limit orders: sell 150 shares of IBM at $11.11, sell more shares available at that price level, 100 shares at 150 shares of IBM at $11.08, buy 100 shares of IBM the $11.11 price level will be transacted to complete at $11.05, buy 200 shares of IBM at $11.05, and buy the market order. An order to sell 50 shares at $11.11 200 shares of IBM at $11.01. will remain in the order book. Executing these two market orders (a sell of 200 shares and a buy of 200 shares) on our hypothetical order book results in a Price Ask Size new state for the order book.

$11.11 150 Price Ask Size $11.08 100 300 $11.05 $11.11 50 200 $11.01 100 $11.05 200 $11.01 Bid Size Price Bid Size Price Orders on the bid (ask) side represent orders to buy (sell). The price levels are $11.11, $11.08, $11.05, and Note that cancel/replace orders can lower the $11.01. The best bid at $11.05 (highest bid price) and size of an order, but not increase it. Cancel/replace the best ask at $11.08 (lowest ask price) make up the orders maintain the time priority of the original or- inside market. The spread ($0.03) is the difference der, so if size increases were allowed, traders with between the best bid and best ask. The midpoint orders at the highest time priority for a price level ($11.065) is the average of the best bid and best ask. could perpetually increase the size of their order, pre- There are four types of messages that traders venting others from being able to transact stock us- can submit to an order book: add, cancel, can- ing limit orders at that price level. See Johnson(2010) cel/replace, and market order. A trader can add a for more details on the order book. limit order in to the order book. She can also can- cel an order and remove it from the order book. If a trader wants to reduce the size of her order, she Example can issue a cancel/replace, which cancels the order, then immediately replaces it with another order at NVIDIA is a graphics processing unit and chipset de- the same price, but with a lower size. Every limit veloper with ticker symbol NVDA. Consider the or- order is assigned a unique ID so that cancel and der book for NVDA at a leading electronic exchange

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 65 on June 8, 2010. We create the orderbook object by > display(ob) specifying the location of our data file. Current time is 09:35:02 > library(orderbook) > filename <- system.file("extdata", Price Ask Size + "sample.txt", ------+ package = "orderbook") 11.42 900 > ob <- orderbook(file = filename) 11.41 1,400 > ob <- read.orders(ob, 10000) 11.40 1,205 11.39 1,600 > ob 11.38 400 ------An object of class orderbook (default) 2,700 11.36 ------1,100 11.35 Current orderbook time: 09:35:02 1,100 11.34 Message Index: 10,000 1,600 11.33 Bid Orders: 631 700 11.32 Ask Orders: 1,856 ------Total Orders: 2,487 Bid Size Price We read in the first 10,000 messages then show the display object. The current time is 9:35:02 AM. This is the shows the inside market, along with the time of the last message read. The message index in- four next best bid and ask price levels and the size at dicates which row in the data file the object has read each price level. through. The display also shows that there are 631 > plot(ob) bids and 1,856 asks outstanding, for a total of 2,487 orders. This indicates that many earlier orders have Order Book − 09:35:02 been removed through either cancels or trades. BID ASK 12.60 ● ● ● ● > summary(ob) ● ● 12.40 ● ● ● ● 12.20 ●● ● ● Current time is 09:35:02 ● ● ● ●● 12.00 ● ● ● ● ●● ● 11.80 Ask price levels: 540 ● ● ● ● ● ● ● ● ● Bid price levels: 179 ● ● 11.60 ● ● ● ●● ● ●● ● ●● ● Total price levels: 719 ●● ● ●● ●● 11.38 11.36 ●● ● ● Price ●● ● ------● ●● ● ●● 11.20 ● ● ● ● ●●● Ask orders: 1,856 ●● ●●● ● ● ● 11.00 ●● ● Bid orders: 631 ●●● ● ●● 10.80 ●● ● Total orders: 2,487 ●● ● ●●● ●● ● ------10.60 ●● ● ●● ● ●● ●●● ●● Spread: 0.02 ● 10.40 ●● ● ● ● 10.20 Mid point: 11.37 ------0 Inside market 60000 50000 40000 30000 20000 10000 10000 20000 30000 40000 50000 60000 Size (Shares) Best Bid: 11.36 Size: 2,700 plot is a graphical representation of display. Price levels are on the y-axis, and size is on the x-axis. Best Ask: 11.38 Size: 400 The maximum and minimum price levels displayed by default are 10% above and below the midpoint. Using summary the total order information from Note the large number of shares at $11.01. It is help- show is repeated. Additionally, we see that there are ful to know the number of orders which make up the 540 ask and 179 bid price levels, for a total of 719. large size at that price level. Using the "[" method This indicates that many orders have been submitted we can view the order information at particular price at the same price level. The spread is $0.02, and the levels. midpoint is $11.37. The inside market is composed of 2,700 shares offered at the best bid of $11.36 and 400 shares offered at the best ask of $11.38.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 66 CONTRIBUTED RESEARCH ARTICLES

> ob["11.01"] > plot(ob, bounds = 0.01, type = "sd")

price size type time id Supply and Demand

1 11.01 109 BID 34220988 4403084 150 2 11.01 50000 BID 34220988 4403085 3 11.01 100 BID 34220988 4403086 100 There is an order for 50,000 shares at the $11.01 price level that accounts for almost all of the size. We 50 can view a plot of the number of orders rather than the number of shares at each price level by specifying 0 type = ’o’ when using plot. In the previous plot Price (%) the maximum and minimum price levels were 10% off from the midpoint, but for this plot we specify a −50 range of only 3.3%.

Note the large number of orders at $11.00. The −100 "[" method returns a data.frame, so we can use nrow to return the number of orders at $11.00. −150 0 20 40 60 80 100 120 > nrow(ob["11.00"]) Size (%) 09:35:02 [1] 56 orderbook has methods for creating new orderbook objects at specified clock times of interest. There are 56 orders at that price level, which con- read.time returns an orderbook object that has read firms what we see in the plot. all messages before the specified time. For example, > plot(ob, bounds = 0.033, type = 'o') this returns the orderbook object at 9:30:00. > ob <- read.time(ob, "9:30:00") Order Book − 09:35:02 BID ASK read.orders is used to move forwards or back-

● wards in the order book by a specified number of ● 11.70 ● orderbook ● messages. In this case, an object at 50 mes- ● ● ● ● sages before the current message is returned. ● ● 11.60 ● ● ● > ob <- read.orders(ob, n = -50) ● ● ● ● ● ● 11.50 > ob ● ● ● ● ● ● ● An object of class orderbook (default) ● ● ● ● 11.38 ● ------11.36 ●

Price ● ● ● ● Current orderbook time: 09:28:41 ● 11.30 ● ● ● ● Message Index: 292 ● ● ● ● ● Bid Orders: 72 11.20 ● ● ● Ask Orders: 81 ● ● ● ● ● ● Total Orders: 153 11.10 ● ● ● ● ● ● ● ● ● 11.00 Data 60 50 40 30 20 10 0 10 20 30 40 50 60 Number of Orders Data files should contain all messages for one stock on a single trading day. Most brokers and exchanges have their own format for transmitting raw message The type argument on plot allows for an “sd” data to customers, so it would be unfeasible for us option which shows supply and demand curves to write scripts to automatically process all data for- for the order book. The demand (supply) curve is mats. Consequently, raw data for an orderbook ob- downsloping (upsloping). This is because more peo- ject must be in the following form: ple want to buy (sell) a stock when the price de- creases (increases). The ask (bid) prices are normal- type,time,id,price,size,type,status ized by the absolute value of the difference between A,34226539,5920814,25.95,100,ASK,TRUE the highest (lowest) plotted ask (bid) price level and A,34226788,5933949,25.91,100,BID,FALSE the the midpoint. Following Cao et al.(2009), the R,34226900,5933949,50 sizes are normalized by the sum of the sizes across C,34226904,5920814 all plotted price levels for each side. T,34226904,755377,25.95,100,TRUE

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 67 where A, R, T, and C mean Add, Replace, Trade, Since the trade price is higher than the midpoint and Cancel, respectively. The second column is the price, we know that the trade occurred as a result of timestamp of the message in milliseconds after mid- an ask order getting hit. Note that trade data is stored night, and the third column is the order ID. For a Re- into the order book only after it has read through the place the next column is the new size, while for Add corresponding trade message. and Trade a column for price comes before the size column. Add messages also have the type of order > midpoint.return(ob, tradenum = 584, time = 10) (BID/ASK) in the sixth column. The optional sev- midpoint.return TRUE enth (sixth) column is if the order (trade) be- 10 second 0.065 longs to the user, and FALSE otherwise. This allows the user to create plots that show the time priority of The midpoint return is the difference in cents be- his own orders. If the column is omitted, the first line tween the execution price and the midpoint price af- of the data file should be type, time, id, price, ter a specified period of time. For example, the above size, type and not include status. calculates the ten second midpoint return for the first In this example a user order to sell 100 shares at trade. Since it was a sell order, the midpoint return $25.95 is added to the order book, followed by an or- will be positive if the stock price decreases, and neg- der to buy 100 shares at $25.91. The size of the order ative if the stock price increases. at $25.91 is then replaced to 50 shares. Finally, the or- der at $25.95 is cancelled, and a trade for 100 shares > ob <- read.time(ob, "9:30:15") > plot(ob, type = "t", bounds = 0.02) at $25.95 occurs. Order Book −− 08:30:15 Analyzing trades BID ASK

A user can create plots that show the time priority of 25.89 his own orders if a status column is present in the data file.

> filename <- system.file("extdata", 25.88 + "tradersample.txt", + package = "orderbook") Price > ob <- orderbook(file = filename) > ob <- read.time(ob, "9:30:05") 25.87 > ob <- next.trade(ob) > ob

25.86 An object of class orderbook (trader) ------Current orderbook time: 09:30:05 0 Message Index: 6,062 8000 6000 4000 2000 2000 4000 6000 8000 Bid Orders: 164 Size (Shares) Ask Orders: 252 Total Orders: 416 This plot shows two pennies above and below the Note that this orderbook object is of type trader. best bid and best ask. We see that the midpoint has The next.trade function sets the state of the order dropped to 25.875, confirming the midpoint return book to when the trade after the current time oc- above. This graph shows two pennies above and be- curs. There is also a previous.trade function with low the best bid and ask. Orders at these price levels the same functionality moving backwards are shown in time priority, with the earliest submit- ted order being closest to the middle y-axis. Note the > view.trade(ob, tradenum = 584) red order–this is an order marked TRUE by the user, trade 584 indicating that it belonged to him. row 6063 time 09:30:05 id 636783 Simulation price 25.94 size 1000 Simulating and modelling the intraday decisions of status FALSE traders is a topic of active research in behavioral fi- > mid.point(ob) nance and economics. orderbook supports adding, replacing, and cancelling orders. Add orders require price the price, size, and type (ASK/BID) of the limit or- 25.935 der. Time and ID are optional, and will default to the

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 68 CONTRIBUTED RESEARCH ARTICLES maximum time + 1 and the maximum ID + 1. Re- Gilles(2006) used simulations to test the impact place messages require the new size and ID. Cancel of liquidity on price level stabilization. He concluded orders only require ID. In addition, market orders that most price changes are the result of uninformed can be issued to the order book. Market orders re- traders (who supply liquidity), rather than informed quire size and side (BUY/SELL). traders (who demand liquidity). > ob <- add.order(ob, 11.20, 300, "ASK") > ob <- remove.order(ob, 1231883) > ob <- replace.order(ob, 1231883, 150) Conclusion > ob <- market.order(ob, 200, "BUY") The orderbook package is part of a collection of Using these tools, the user can write functions to packages (see Campbell et al.(2007) and Kane and simulate the an order book. In the following exam- Enos(2006)) for working with financial market data. ple, we consulted Gilles(2006). We simulate 1,000 R provides all the necessary tools for managing insti- messages. The messages are chosen based on the fol- tutional sized portfolios. lowing probabilities: 50% for a cancel message, 20% for a market order, and 30% for a limit order. In the David Kane event of a cancel message the order cancelled is ran- Kane Capital Management domly chosen. Market order have a 50-50 chance for Cambridge, MA a buy or sell order. The size of the market order al- USA ways corresponds to the size of the individual order [email protected] at the best ask or bid with the highest time priority. Limit orders have a 50-50 chance to be an ask or bid. Andrew Liu There is a 35% chance for the price of a limit order Williams College to be within the spread. If the price is outside of the Williamstown, MA spread, a price is chosen using a power law distri- USA bution. Finally, the size follows a log-normal distri- [email protected] bution. A plot of this example simulation is shown below. Khanh Nguyen > ob <- simulate(ob) University Of Massachusetts at Boston Boston, MA > plot(ob) USA [email protected] Order Book − 09:35:09 BID ASK

● ● 28.00 ● Bibliography ●● ● ●● ●● ● ● 27.50 ●● ● ● ● ●● K. Campbell, J. Enos, D. Gerlanc, and D. Kane. Back- ● ● 27.00 ●● ● ● tests. R News, 7(1):36–41, April 2007. ● ● ● ● ● 26.50 ● ●●●● ●● ●● ● ● ●● ● ● C. Cao, O. Hansch, and X. Wang. The information ●● ● ● 26.00 ●● 25.70 ● ●●● 25.74 content of an open limit-order book. The Journal of ●●● ● ● Price ● ● ●● Futures Markets, 29(1):16–41, 2009. ●● ● 25.00 ●● ●● ● ● ● ●● ● D. Gilles. Asynchronous Simulations of a Limit Order 24.50 ●● ●● ●● ● ● ●●● Book. PhD thesis, University of Manchester, 2006. ●● 24.00 ●● ●● ●

23.50 ● ● B. Johnson. Algorithmic Trading & DMA: An intro- ● duction to direct access trading strategies. 4Myeloma 23.00 Press, London, 2010. 0 8000 6000 4000 2000 2000 4000 6000 8000 12000 10000 10000 12000 D. Kane and J. Enos. Analysing equity portfolios in Size (Shares) R. R News, 6(2):13–19, May 2006.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 HELP DESK 69

Giving a useR! Talk by Rob J. Hyndman knows what you have spent long hours learning on your own. Abstract Giving a UseR! talk at the the inter- national R user conference is a balancing act in which you have to try to impart some new ideas, Consider the context provide sufficient background and keep the au- dience interested, all in a very short period of For a useR! talk, there is the additional challenge of time. getting the right balance of code, technical and ap- plication detail. The appropriate mix depends on the I’ve sat through more than my fair share of bad con- session. ference talks. Slides full of equations flashing past If you are giving a kaleidoscope talk (for a wide quickly, tables containing tiny figures that no-one can audience), you need to make a particular effort to read, most of the audience lost on the third slide. make your talk accessible, keeping examples simple Anyone who has attended even one conference will and minimising technical detail. Speakers in focus have seen these examples and more. sessions should consider what other talks are in their session in order to determine how specialised the au- dience is likely to be (e.g., people at a high perfor- What is the aim of the talk? mance computing session are probably comfortable with technical details, but those at a session on ecol- The problems often stem from confusion about the ogy are likely to be more interested in the application purpose of the talk. Some speakers clearly think the detail.) aim of a talk is to impress the audience with their Looking at some related talks from previous technical skills, even (or especially) if that means the years can be useful. Many talks from previous useR! audience does not understand what they are talking conferences can be found on the conference websites http://www.r-project.org/conferences.html about. Others speakers appear to believe that talks (see are for those few members of the audience who are for links). working in the same narrow research area — and so no attempt is made to provide an introduction to the topic. Still others see conference talks as an oral ver- A suggested structure sion of an academic paper, in all its painful detail. Instead, I like to think of conference talks as ad- I recommend the following structure for a conference vertisements for the associated paper or R package, talk: or shared wisdom gained through valuable experi- (a) Start with a motivating example demonstrating ence. The talk is not intended to cover everything the problem you are trying to solve; you have done, or even to summarize what you have done. In giving a talk, I am hoping that (1) everyone (b) Explain existing approaches to the problem in the audience will have a clear idea of what I have and their weaknesses; been working on; and (2) some of those in the audi- ence will be motivated to read my paper, download (c) Describe your main contributions; the associated R package, or put into practice some of my advice. (d) Show how your ideas solve the prob- These aims mean that I never bother with proofs lem/example you started with. or derivations — they are for the people who read the paper. Similarly, there is no point discussing the This structure will not necessarily work for every internals of R code or algorithm technicalities. Those talk, but it is a good place to start. In particular, be- who care will explore the details afterwards. ginning with a motivating example is much better Instead, I tend to spend at least half the time go- than setting up the problem algebraically. ing through a motivating example and reviewing the For a 15 minute conference presentation, I divide relevant background — most of the audience will the time approximately into 3/4/6/2 minute sec- need that context in order to understand what the tions. talk is about. In fact, it is reasonable to assume that Using this structure, you will have barely started the audience knows about as much as you did at the on your own contributions when you are half way start of your work in this area. That is, probably very through your allocated time. Resist the temptation to little. So it is important to spend some time provid- trim the first two sections. The audience needs time ing background information or you will lose the au- to absorb the purpose of your work and the context dience quickly. Do not assume the audience already in which it is set.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 70 HELP DESK

Keeping to time number of words to a minimum. Do not include ev- ery detail of what you plan to say. Keep it simple. Do not deliver a 30-minute talk in 15 minutes. Noth- It is easy, but boring, to have bulleted lists summa- ing irritates an audience more than a rushed presen- rizing your main points. Instead, use pictures and tation. It is like trying to get a drink out of a fire hy- graphics as much as possible. Resort to text only drant. Your objective is to engage the audience and when illustrations fail you. have them understand your message. Do not flood If you must present tables, only show the essen- them with more than they can absorb. tial information. No-one is going to read a slide full Present only as much material as can reasonably of tiny numbers, let alone absorb the information. fit into the allocated time. Generally that means no Very often, a table can be replaced with a suitable more than one slide per minute. I tend to use an av- graphic. erage of about 0.8 slides per minute of talking. It is Give only the most necessary mathematical de- helpful to use some slides as time-markers and make tails. People not working in the research area can sure you are at the relevant slide at the right time. find equations difficult to follow as part of a rapidly Never go over time. Keep an eye out for signals delivered presentation. When you do need to use from the session chair indicating when you need to equations, define your notation. conclude. If necessary, be prepared to cut your talk Avoid showing too much R code. Trim it back to short and finish with a quick summary. the bare minimum so the audience can focus on the Rehearsing is invaluable. Practise. Out loud. essential details. Use different colours to clearly dis- Standing up. Using a data projector. Get colleagues tinguish R code from output. to listen to you, including some who are not knowl- Slides are not there to remind you what to say — edgeable on the topic of your talk; they will be able use a page of notes for that purpose. The slides are to point out places where you may not come across for the audience — make sure everything on your clearly. If several people in your research group are slides is there because it will help the audience un- attending the same conference, get together before- derstand what you are saying. hand and practice your talks to each other. Make On your last slide, give your website or email ad- such rehearsals as realistic as possible and time them. dress for people to contact you if they want to read After the rehearsal, you may wish to delete some the paper or download your R code or just ask a of your material to ensure the talk can be delivered question. within the allocated time. It is useful to add slide numbers so the audience Balance the amount of material you present with can refer back to specific slides in question time. a reasonable pace of presentation. If you feel rushed I spend a lot of time going over my slides looking when you practise, then you have too much mate- for ways to improve them. After a presentation is es- rial. Budget your time to take a minute or two less sentially complete, I go through all the slides to see than your maximum allotment. what I can remove — less text is better. I also look for places where I can simplify the presentation, where I can replace text with graphics, and where the titles Preparing slides can be improved. I often spend almost as much time refining the slides as in creating the first version. High quality slides are an important part of a good Always preview your slides on the computer be- presentation. I recommend using the beamer pack- ing used for the talk. You will look foolish if sym- age with LATEX. MS-Powerpoint is also popular, but bols and Greek letters that looked OK on your com- it makes it much harder to format mathematical sym- puter translate into something unreadable on the big bols and equations nicely. screen. This is much more common if you use MS- Avoid distracting slide transitions and dazzling Powerpoint as the fonts may not be embedded in the slide themes. You want the audience to focus on your document and so equations lose important symbols. content, not wonder how you implemented some gimmick. Save animation for aiding interpretation. Use at least a 20-point font so everyone in the Giving the presentation room can read your material. (The default font size in beamer is generally too small — use a 14pt font By the time you give the talk, you will have spent in beamer which is equivalent to 23pt on the screen.) enough time preparing your slides and practising Similarly, view your slides at 50% on screen to check your talk that you should feel confident of giving a that code and figures are legible. Unreadable mate- great presentation. rial is worse than useless — it inspires a negative at- At the conference, make sure you talk to the ses- titude by the audience to your work and, ultimately, sion chair beforehand so they are aware of who you to you. Many R users are near-sighted; do not make are. Arrive at the meeting room 10 minutes before it any harder for them. the session begins to take care of last-minute details. Limit the material on each slide, keeping the Talk at a pace that everybody in the audience can

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 HELP DESK 71 understand. Speak slowly, clearly, and loudly, es- plete. Just present the results and let the audience pecially if your English is heavily accented. Speak judge. It is okay to say, “work is on-going”. loudly enough to be easily heard by those sitting in When finished, thank the audience for their at- the back row. tention. Stay for the entire session, for the courtesy Engage the audience — speak to them, not to the and benefit of your audience and your co-speakers. projector screen or to your notes. It helps to move Afterwards, be available for people to ask you ques- around, look at your audience and smile. tions. Never apologize for your slides. Make apologies unnecessary by producing great slides in the first place. Do not say, “I know you can’t see this, but Rob J Hyndman . . . ” If the audience cannot read your slide, there is Department of Econometrics & Business Statistics no point displaying it. Monash University Do not apologize for incomplete results either. Australia Researchers understand that all research is incom- [email protected]

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 72 HELP DESK

Tips for Presenting Your Work by Dianne Cook • Avoid slide after slide of bullet points (the Pow- erpoint syndrome), or slide after slide of equa- Abstract With the international R user confer- tions. ence, useR! 2011, approaching, many partici- pants may be contemplating how to put their • Establish your credentials. For a traditional au- thoughts together for presentation. This paper dience this might be a few equations, and the provides some suggestions for giving presenta- key parts to a proof. For the statistical comput- tions and making posters. ing and computational statistics audience show some code fragments – ones that may be par- ticularly elegant, or illustrate the critical pieces. Some background (Some colleagues and I have occasionally jested that watching programmer program might be Just after completing a practice of the talk I planned more interesting than listening to some talks. to give about my PhD research in several upcoming This is only partially in jest because watch- academic job interviews, my advisor, Andreas Buja, ing seasoned programmers tackling a compu- sat me down, and completely re-drafted my talk! I tational problem can be very insightful.) had produced what I had seen many times presented • Draw the audience in with convention, a usual by numerous speakers in weekly seminars at Rutgers format, style of slides, font, . . . . Familiarity is University, slide after slide of details. Andreas ex- in-groupness, shows the audience that they are plained that while this might be appropriate for a pa- like you, that you are bringing them with you, per it is not the best approach for a talk. We laid out

and that you can be trusted.

the key problem that my work addressed, and then too. unexpected, something Do rules. the breaking put in a slide that simply said “Stay tuned. . . .” The by curiosity, and attention, audience’s the Keep next few slides addressed the methodology, and the key answers to the problem came near the end of the • Use pictures, that is, plots (of data). BUT, BUT, talk. please don’t say “You can see from the picture The “Stay tuned” sticks with me. that . . . .” without helping interpret with some- Although, today this phrase might be something thing like “because the points follow a curved of a cliché, overused by the electronic media to keep pattern”. That is, explain the pattern in the your attention through another barrage of advertis- graphical elements of the plot that allows one ing, the effect is useful to re-create. What did this to make the analytical leap. phrase do for the talk at the time? It allowed it to • Use pictures, or, cartoons, to help explain con- build interest in the topic early, and let the audi- cepts. ence know they would be rewarded later. The mid- dle of the talk also showed a VHS video of high- • Use pictures, for visual stimulation. Not be- dimensional data being explored using various tours cause they “are devices for showing the obvi- – computers that could display videos were not so ous to the ignorant” or to “prevent the dullards portable in 1993! (That video is now available in in the audience from falling asleep” (Tufte, Flash format at http://stat-graphics.org/movies/ 1990) but because graphics are beautiful. Plots grand-tour.html.) evoke creativity.

• Tell a story. Key pointers

The web sites Schoeberl & Toon(2011) and van Marle Practicalities (2007) have some useful tips on giving scientific pre- sentations. Here are a few collected from my experi- Many statisticians use LATEX, with Beamer style, to ences. make the slides for their talk. This is convenient for equations, and makes elegant, beautifully type- • If your audience remembers and takes away set slides. It has a lot of flexibility to make colored one thing your talk is a success. What is this text, different fonts, navigation strips, including fig- going to be for your talk? ures and even animations or movies. My preference, however, is to use keynote on the Mac. It provides • Make a plan, map out the start, with motiva- a lot more flexibility in the formatting, the ability to tion, middle and finish. Sketch out the slides, use wild fonts, such as my own handwriting, and and what you want to communicate with each seamless incorporation of movies and animations. slide. To include equations I have a generic ‘tmp.tex’ file on

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 HELP DESK 73 my computer with all the equations that I have ever • Choose your text size and font. Titles should be used, and I cut and paste these out of the pdf from about 100pt font, headings 50pt and text in the Preview. Keynote maintains the quality of the equa- body at least 25pt. Avoid all capitals. tions, and images, through resizing, unlike Power- point. It also, like TEX, keeps figures in separate files, • Data plots make good focal points. A contex- slides.key actually the ‘ ’ might look like a file but is re- tual image can help the visitors grasp the con- ally a directory. text of the data quickly, and provide people Just out of graduate school I would meticulously with something familiar to draw their atten- write down every word that I wanted to say in as- tion. Good quality graphics are important, gen- sociation with the talk. I would practice, and prac- erated by R software for example. tice, and practice, then throw the notes away to actu- ally give the talk. I still occasionally do this. Why? With a limited time frame, and under the glare of • A movie, or audio recording, can help draw my colleagues, making language precise helps get the attention of passers-by. These should not the message across, which is respectful of the audi- be substitutes for engaging the audience in dis- ence. Making the notes makes the language precise, cussion. but giving the talk without notes lends spontaneity. Check the equipment. How does your font and Remember that there are lots of bad examples of the graphics appear from the back of the room? Does posters at Statistics meetings. The excuse of “this is your computer work with the projector? Or does the how everyone else does their poster” is not a reason- provided computer have the software that you need able justification for perpetuating poor scholarship. for your presentation? Each generation is held to higher and higher stan- How do you choose colors? The colorspace dards as we develop our understanding about good (Ihaka, Murrell, Hornik & Zeileis, 2011) package in practices. Excellent advice on producing posters can R provides are reasonable selection of color palettes be found at the web site by Cape Higher Education for plots, more specifically applied to statistical Consortium(2011). Also the web site by Purring- graphics than Cynthia Brewer’s map work in the ton(2011) has some useful discussion about design- RColorBrewer package (Neuwirth, 2011). Several ing scientific posters. The Data Expo competitions web sites (Etre, 2011; Dougherty & Wade, 2011) pro- (ASA, 2011) run in conjunction with the Joint Statis- vide tools to help color blind proof your work. tical Meetings often have examples of good posters, Robbins(2006) is a basic guide to good graphics. and examples of previous useR! posters can be found The R package (Wickham, 2009) has elegant via http://www.r-project.org/conferences.html. and cognitively perceptive plot defaults.

Responsible audience? Not a talk, a poster! Occasionally, well maybe, more than occasionally, I Slides from a 2007 Joint Statistical Meetings Introduc- hear some members of our profession extolling the tory Overview Lecture (Cook, 2007) give guidelines virtues of a talk – but it is clear they didn’t have a for constructing a poster. Posters allow the presen- clue what the talk was about. There is a responsi- ter to engage the audience in small group individu- bility of the audience to not be impressed because alized discussion. But in order to engage a small au- they are snowballed by a speaker. The audience has dience you need to attract the attention of passers-by. a right to expect the speaker to make the work clear, Designing your poster with a visual focal point that and easier to understand, and to do some of the work can be seen from several feet away will draw people of deciphering the material for them. to your work. Some of the key recommendations in this lecture are: Last words • Plan the layout and flow of the poster.

• As with a talk, decide on the main message, Be mindful, that giving a talk in front of peers is a and determine who is your audience. privilege – not many people in this world have the opportunity to speak their mind and be listened to, • Choose your color scheme, keeping in mind particularly in a prominent setting such as useR!. color blindness limitations, readability, and Many people enthuse about a TED talk (Rosling, avoid color combinations that have subliminal 2006). I’ve recently been pointed to another by Chris meanings, e.g. red, yellow and black of the Ger- Wild(2009) which is a marvellous statistical presen- man flag. tation.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 74 HELP DESK

Bibliography N. Robbins. (2006) Creating More Effective Graphs. URL http://www.wiley.com. American Statistical Association. (2011) Data Expo Posters URL http://stat-computing.org/ H. Rosling. (2006) GapMinder URL http: dataexpo/ //www.ted.com/talks/lang/eng/hans_rosling_ shows_the_best_stats_you_ve_ever_seen.html Beamer Developers. (2011) Beamer—A LaTeX class for producing presentations. URL https:// M. Schoeberl and B. Toon. (2011) Ten Secrets to Giv- bitbucket.org/rivanvx/beamer/wiki/Home. ing a Good Scientific Talk. URL http://www.cgd. ucar.edu/cms/agu/scientific_talk.html. Cape Higher Education Consortium. (2011) Infor- mation Literacy URL http://www.lib.uct.ac.za/ E. Tufte. (1990) The Visual Display of Quantitative infolit/poster.htm. Information. Graphics Press. Cheshire, CT.

D. Cook. (2007) Improving Statistical Posters. URL A. J. van Marle. (2007) The Art of Scientific Pre- http://www.amstat.org/meetings/jsm/2008/ sentations. URL https://www.cfa.harvard.edu/ pdfs/ImprovingStatisticalPosters.pdf. ~scranmer/vanmarle_talks.html#Technical_ B. Dougherty and A. Wade. (2011) URL http://www. preparation. vischeck.com/vischeck/. H. Wickham. (2009) ggplot2: Elegant graphics for data Etre Limited. (2011) URL http://www.etre.com/ analysis. useR. Springer. tools/colourblindsimulator/. C. Wild. (2009) Early Statistical Inferences: The Eyes R. Ihaka, P. Murrell, K. Hornik, A. Zeleis. (2011) col- Have It. URL http://www.stat.auckland.ac.nz/ orspace: Color Space Manipulation. URL http: ~wild/09.wild.USCOTS.html. //cran.r-project.org E. Neuwirth. (2011) RColorBrewer: ColorBrewer Dianne Cook palettes. URL http://cran.r-project.org. Department of Statistics C. Purrington. (2011) Advice on Designing Scien- Iowa State University tific Posters. URL http://www.swarthmore.edu/ USA NatSci/cpurrin1/posteradvice.htm. [email protected]

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 BOOK REVIEWS 75

Book Reviews Forest Analytics with R (Use R) of how to teach model interpretation. Even after a decade of stats teaching I found this part inspiring. Andrew R. ROBINSON and Jeff D. HAMANN. The key ingredient, I think, is that the authors carry Berlin: Springer, 2011. ISBN 978-1-4419-7761- a single example through different analyses. They 8. xv + 339 pp. e57.99 (paperback). show improvements (or lack thereof) compared to previous models, explain very accurately the model Forestry is a broad field touching economical output and they give intuitive and rule-of-thumb management as well as landscape planning, sur- guidance on which “knobs to turn” during analysis. vey design/analysis, spatial statistics, growth simu- lations and much more. Accordingly, also the topics The final part is again rather different in style. It related to statistical computing (and hence R) cover a provides walk-through examples without much ex- lot of ground. The present book has luckily refrained planation of why one would want to use this specific rconifers from trying to cover all possible aspects, while at the forest growth model (provided through the same time still being surprisingly comprehensive. It package: Hamann et al., 2010) or any implementa- aims at forest scientists, managers, researchers and tional details. The optimisation section using linear students, who have little experience with R. programming is virtually incomprehensible without The book is organised in four parts. Part I, In- parallel reading on the subject. This fourth part feels troduction and Data Management, introduces R and like an incomplete draft shoved in for the sake of typical forest data sets, of which several are pro- topic coverage. vided in the companion R-package FAwR. Part II, Overall, Forest Analytics with R offers an entry to Sampling and Mapping, illustrates the use of the sur- several statistical computation topics encountered by vey package to estimate mean and variance of typi- forestry students and practitioners. The sampling- cal forest survey designs. It then continues to briefly section is sufficient for many standard designs. Only sketch imputation and spatial interpolation tech- the regression-section is both in-depth and generic, niques. Part III, Allometry and Fitting Models, cov- while the simulation studies are too specific to offer ers regression, non-linear regression and mixed ef- themselves to an easy transfer to other problems. fect models. Part IV, Simulation and Optimization in- troduce the interaction of C-code with R through two Carsten F. Dormann forest growth models and uses a question from forest Helmholtz Centre for Environmental planing to illustrate the use of linear programming Research-UFZ, Leipzig, Germany for optimisation. The four parts differ greatly in their style, depth and quality. Part I can easily be skipped by the more experienced R-user, but offers a useful and gentle in- Bibliography troduction to general R functionality with respect to the following three parts. To appreciate the sampling T. G. Gregoire and H. T. Valentine. Sampling Strategies analyses of part II (including, for example, simple for Natural Resources and the Environment. Chap- random and systematic sampling, cluster and two- man and Hall/CRC, New York, 2008. ISBN stage sampling), a more detailed knowledge of the 9781584883708. survey package (Lumley, 2010) and of sampling de- signs in general (e.g., Lohr, 2009) is required. I found J. Hamann, M. Ritchie, and the R Core team. rconifers: the notation for variance estimation somewhat un- Alternative Interface to the CONIFERS Forest Growth savoury, because it deviated from both these books, Model, 2010. R package version 1.0.5. the as well as book dedicated to forest sampling (Gre- F. E. Harrell. Regression Modeling Strategies - with Ap- goire and Valentine, 2008). Imputation and inter- plications to Linear Models, Logistic Regression, and polation techniques receive only a superficial brush, Survival Analysis. Springer, New York, 2001. ISBN e.g. focussing on (spatial!) nearest-neighbour impu- 0-387-95232-2. tation without mentioning regression-based imputa- tion (Harrell, 2001) at all. S. L. Lohr. Sampling: Design and Analysis. Duxbury In stark contrast, regression, non-linear and Press, North Scituate, Mass., 2nd edition, 2009. mixed models are dealt with much more carefully ISBN 9780495105275. and very competently. While the purpose of the book is not to explain so much why, but how, to carry T. Lumley. Complex Surveys: A Guide to Analysis Using out certain computations, this section is a showcase R. Wiley, Hoboken, NJ, 2010. ISBN 9780470284308.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 76 NEWSAND NOTES

Conference Review: Third Meeting of Polish R Users, The Influenza Challenge by Przemysław Biecek List of presented talks: 1. Short-term changes in mortality during in- The third Polish R Users’ Conference took place at fluenza epidemics, Daniel Rabczenko (PZH) , the Mathematical Institute of Polish Academy of Sci- ences in Wrocław on 24-25 of September 2010. The 2. Calibration estimators in the study of meeting was traditionally named WZUR, an abbre- influenza, Tomek Józefowski, Marcin viation which has the same meaning as “equation” Szymkowiak (UE Pozna´n), in spoken Polish. This year the conference adopted 3. Time series comparison using Dynamic Time a fresh format. I wish to share with you both Warping, Pawel Teisseyre (IPI PAN), the reasons behind changing the format and the af- terthoughts of the conference participants. 4. Selected generalized linear models. Alicja Sz- This time the leading subject for the conference abelska (UE Pozna´n), was the epidemiology of influenza. Almost 50 peo- ple from various parts of Poland and representing 5. CUMSUM method in prediction of influenza numerous professions (statisticians, mathematicians, epidemics. Konrad Furma´nczyk, Marta Za- medical doctors, students) gathered to find out more lewska, Przemysław Biecek, Stanisław Ja- about influenza and methods for analyzing data re- worski (WUM+UW), lated to this disease. A few months in advance of 6. The ECAP study, the correspondence analy- the conference, we had gathered weekly reports cov- sis in the asthma study. Konrad Furma´nczyk, ering the last 11 years of influenza occurrences in Stanisław Jaworski, Bolesław Samoli´nski, Poland. Both the absolute numbers and the rates Marta Zalewska (WUM), per 100,000 individuals of infections and mortalities were available for four age groups and for 16 dif- 7. Sesonality of influenza. Marta Musiał, Łukasz ferent regions. All participants had access to the Wawrowski, Maciej Ber˛esewicz, Anita Mack- same data set and all of them had time to experi- owiak (UE Pozna´n), ment on the dataset using an assortment of statistical 8. Logistic regression in the study of influenza. methods. During the two-day conference various ap- Piotr Sliwka˙ (UKW). plications were presented of existing statistical pro- cedures available in R as well as newly developed 9. The simecol package in Influenza study. Marta methods. Markiewicz, Anna Sikora, Katarzyna Za- The dataset was a common basis for all talks. j ˛aczkowska,Michał Balcerek, Piotr Kupczyk We requested all presenters to prepare talks com- (PWr), posed of three parts. In the first part the statisti- 10. The artificial neural networks in forecasting cal methods, the mathematical theory behind them the influenza epidemics. Anna Noga, Roksana along with some insights were presented. In the sec- Kowalska, Maciej Kawecki, Paweł Szczypior ond part a set of R functions covering these methods (PWr), was introduced. And finally in the third part a sam- ple application of presented methods to the dataset 11. Intensity estimation with multiplicative inten- described above was demonstrated. All participants sity models. Anna Dudek Piotr Majerski were familiarized with the dataset, and this created a (AGH), common platform which facilitated better communi- ˙ cation among statisticians and physicians. The epi- 12. Influenza epidemiology, Grazyna Dulny demiology of influenza was tackled from different (WUM). angles. Results from variety of statistical procedures Additionally, during the second day of the con- created a framework for discussion between people ference the survey “Epidemiology of Allergic Dis- who are developing new statistical methods and for ease in Poland” was presented (http://ecap.pl/ people who are applying them. eng_www/material.html). The project was con- The conference was conducted in Polish, and ducted with the Department of Prevention of Envi- most of the conference materials are in Polish but ronmental Hazards and Allergology at the Medical the equations and figures are international, and pre- University of Warsaw by the initiative of the Minis- sentations are available on the web site http://www. ter of Health. biecek.pl/WZUR. Almost all talks were recorded and The organizing committee: Przemysław Biecek, the video files are freely available from the above ad- Marta Zalewska, Konrad Furma´nczyk, Stanisław Ja- dress. worski would like to thank:

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 NEWSAND NOTES 77

The Scientific Committee Prof. Leon Bobrowski, time we will focus on the data set “Social Diagnosis Prof Antoni Dawidowicz, Prof. Jacek Ko- 2000-2011” (see: http://diagnoza.com/index-en. ronacki, Dr Witold Kup´s´c, Prof. Wojciech html). The challenge is to understand changes in ob- Niemiro, Dr Daniel Rabczenko, Prof. Bolesław jective and subjective quality of life in Poland in last Samoli´nski,Prof. Łukasz Stettner, and Prof. 11 years. Wojciech Zieli´nski. As always we warmly welcome new participants and sponsors! Our host The Polish Mathematical Society in On behalf of The Organizing Committee, Wrocław, Our generous sponsors and supporters Netezza, Przemyslaw Biecek Warsaw University, The Medical University Institute of Applied Mathematics of Warsaw, GlaxoSmithKline, Sanofi Pasteur, Faculty of Mathematics, Informatics and Mechanics and The Independent Public Central Clinical University of Warsaw Hospital. Banacha2, 02-097 Warszawa, Poland We are looking forward to the next WZUR. This [email protected]

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 78 NEWSAND NOTES

Conference Review: Kickoff Workshop for Project MOSAIC by Nicholas J. Horton, Randall Pruim and Danny Kaplan instruction (Randall Pruim, Calvin College), stock market simulation to better understand variability Project MOSAIC (http://www.mosaic-web.org) is a (Nicholas Horton, Smith College), statistical model- community of educators working to develop new ing for poets (Vittorio Addona, Macalester College), ways of introducing modeling, statistics, computa- reproducible analysis (Nicholas Horton, Smith Col- tion and calculus to students in colleges and univer- lege) and the cost of information (Danny Kaplan, sities. The Institute for Mathematics and its Applica- Macalester College) tions at the University of Minnesota hosted the kick- Since the workshop, the project organizers have off workshop for Project MOSAIC from June 30–July been scheduling M-Casts. These 20-minute webi- 2, 2010. nars broadcast over the Internet on the 2nd, 4th, and The MOSAIC community helps educators share 5th Friday of each month are designed to provide a ideas and resources that will improve undergraduate quick and easy way for educators to share ideas, to teaching, and to develop a curricular and assessment get reactions from others, and to form collaborations. infrastructure to support the dissemination and eval- Recent R-related M-Casts include discussion of new uation of these ideas and resources. Other goals of differentiation and anti-differentiation operators in R this NSF-funded project (0920350) are to develop im- for calculus, and using simulation in R to teach the portant skills in students of science, technology, en- logic of hypothesis tests. Future M-Casts are planned gineering, and mathematics that will be needed for to encourage additional use of R as an environment their professional careers, with a particular focus on for computation within mathematics, science, statis- the integration of capacities in four main areas: tics and modeling courses. To further promote and facilitate the use of R Modeling The ability to create useful and informa- in these efforts, Project MOSAIC has developed tive representations of real-world situations. the mosaic package, which is now available on CRAN. It includes a number of datasets and func- Statistics The analysis of variability that draws on tions which work to clarify and simplify integra- our ability to quantify uncertainty and to draw tion of computing and modeling in introductory logical inferences from observations and exper- and intermediate statistics courses. This pack- iments. age was used extensively as part of the May 2011 Computation The capacity to think algorithmically, Project MOSAIC USCOTS (United States Confer- to manage data on large scales, to visualize and ence on Teaching Statistics) pre-conference work- interact with models, and to automate tasks for shop on using R to teach statistics. A low-volume [email protected] efficiency, accuracy, and reproducibility. discussion list ( ) has been created. For more information or to subscribe, Calculus The traditional mathematical entry point see the URL at http://mailman-mail5.webfaction. for college and university students and a sub- com/listinfo/r-users. ject that still has the potential to provide impor- tant insights to today’s students. Nicholas J. Horton Department of Mathematics and Statistics, Smith College The workshop, which spanned three days, pro- Clark Science Center, 44 College Lane vided an opportunity for educators to present curric- Northampton, MA 01063-0001 USA ular innovations that have already been developed, [email protected] to discuss the overall organization of curricula that will effectively unify the four areas outlined above, Randall Pruim and to establish strategic plans for Project MOSAIC. Department of Mathematics and Statistics, Calvin College More than 30 people from nearly as many institu- Grand Rapids, MI 49546 USA tions attended. [email protected] R was featured as a computational environment for many of the presentations. These included ex- Danny Kaplan amples of the use of R for teaching calculus (Danny Department of Mathematics and Computer Science, Kaplan, Macalester College), resampling in the in- Macalester College troductory statistics course (Nathan Tintle, Hope 1600 Grand Avenue, St. Paul, MN 55105 USA College), computer languages to support MOSAIC [email protected]

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 NEWSAND NOTES 79

Changes in R

From version 2.12.2 to version 2.13.0 • list2env(envir = NULL) defaults to hashing (with a suitably sized environment) for lists of by the R Core Team more than 100 elements.

• text() gains a formula method. CHANGES IN R VERSION 2.13.0 • IQR() now has a type argument which is SIGNIFICANT USER-VISIBLE CHANGES passed to quantile(). • replicate() (by default) and vapply() (al- • as.vector(), as.double(), etc., duplicate less ways) now return a higher-dimensional array when they leave the mode unchanged but re- instead of a matrix in the case where the inner move attributes. function value is an array of dimension ≥ 2. as.vector(mode = "any") no longer dupli- • Printing and formatting of floating point num- cates when it does not remove attributes. This matrix() array() bers is now using the correct number of dig- helps memory usage in and . its, where it previously rarely differed by a few matrix() duplicates less if data is an atomic digits. (See “scientific” entry below.) This af- vector with attributes such as names (but no fects many ‘*.Rout.save’ checks in packages. class). dim(x) <- NULL duplicates less if x has neither NEW FEATURES dimensions nor names (since this operation re- moves names and dimnames). • normalizePath() has been moved to the base package (from utils): This is so it can be used • setRepositories() gains an addURLs argu- by library() and friends. ment. It now does tilde expansion. • chisq.test() now also returns a stdres com- It gains new arguments winslash (to select the ponent, for standardized residuals (which have separator on Windows) and mustWork to con- unit variance, unlike the Pearson residuals). trol the action if a canonical path cannot be write.table() found. • and friends gain a fileEncoding argument, to simplify writing • The previously barely documented limit of 256 files for use on other OSes (e.g. a spreadsheet bytes on a symbol name has been raised to intended for Windows or Mac OS X Excel). 10,000 bytes (a sanity check). Long symbol names can sometimes occur when deparsing • Assignment expressions of the form foo::bar(x) <- y foo:::bar(x) <- y expressions (for example, in model.frame). and now work; the replacement functions used are • reformulate() gains a intercept argument. foo::‘bar<-‘ and foo:::‘bar<-‘.

• cmdscale(add = FALSE) now uses the more • Sys.getenv() gains a names argu- common definition that there is a representa- ment so Sys.getenv(x, names = FALSE) tion in n-1 or less dimensions, and only dimen- can replace the common idiom of sions corresponding to positive eigenvalues are as.vector(Sys.getenv()). The default used. (Avoids confusion such as PR#14397.) has been changed to not name a length-one result. • Names used by c(), unlist(), cbind() and rbind() are marked with an encoding when • Lazy loading of environments now preserves this can be ascertained. attributes and locked status. (The locked sta- tus of bindings and active bindings are still not • R colours are now defined to refer to the sRGB preserved; this may be addressed in the future). color space. options("install.lock") The PDF, PostScript, and Quartz graphics de- • may be set to FALSE install.packages() vices record this fact. X11 (and Cairo) and Win- so that defaults to --no-lock TRUE dows just assume that your screen conforms. ‘ ’ installs, or (on Windows) to so that binary installs implement locking. • system.file() gains a mustWork argument sort(partial = p) p (suggestion of Bill Dunlap). • for large now tries Shell- sort if quicksort is not appropriate and so • new.env(hash = TRUE) is now the default. works for non-numeric atomic vectors.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 80 NEWSAND NOTES

• sapply() gets a new option simplify = • Many functions when called with a not- "array" which returns a “higher rank” array open connection will now ensure that instead of just a matrix when FUN() returns a the connection is left not-open in the dim() length of two or more. event of error. These include read.dcf(), replicate() has this option set by default, and dput(), dump(), load(), parse(), readBin(), vapply() now behaves that way internally. readChar(), readLines(), save(), writeBin(), writeChar(), writeLines(), .readRDS(), • aperm() becomes S3 generic and gets a table .saveRDS() and tools::parse_Rd(), as well as method which preserves the class. functions calling these. • merge() and as.hclust() methods for objects • Public functions find.package() and of class "dendrogram" are now provided. path.package() replace the internal dot-name versions. • as.POSIXlt.factor() now passes ... to the character method (suggestion of Joshua Ul- • The default method for terms() now looks for rich). a "terms" attribute if it does not find a "terms" • The character method of as.POSIXlt() now component, and so works for model frames. tries to find a format that works for all non-NA • httpd() handlers receive an additional ar- inputs, not just the first one. gument containing the full request headers • str() now has a method for class "Date" anal- as a raw vector (this can be used to parse ogous to that for class "POSIXt". cookies, multi-part forms etc.). The recom- mended full signature for handlers is therefore • New function file.link() to create hard links function(url, query, body, headers, ...). on those file systems (POSIX, NTFS but not FAT) that support them. • file.edit() gains a fileEncoding argument to specify the encoding of the file(s). • New Summary() group method for class "ordered" implements min(), max() and • The format of the HTML package listings has range() for ordered factors. changed. If there is more than one library tree, a table of links to libraries is provided at the • mostattributes<-() now consults the "dim" top and bottom of the page. Where a library attribute and not the dim() function, making it contains more than 100 packages, an alphabetic more useful for objects (such as data frames) index is given at the top of the section for that from classes with methods for dim(). It also library. (As a consequence, package names are uses attr<-() in preference to the generics now sorted case-insensitively whatever the lo- name<-(), dim<-() and dimnames<-(). (Related cale.) to PR#14469.) • isSeekable() now returns FALSE on connec- • There is a new option "browserNLdisabled" to tions which have non-default encoding. Al- disable the use of an empty (e.g. via the ‘Re- though documented to record if ‘in principle’ turn’ key) as a synonym for c in browser() or n the connection supports seeking, it seems safer under debug(). (Wish of PR#14472.) to report FALSE when it may not work. • example() gains optional new arguments • R CMD REMOVE and remove.packages() now re- character.only and give.lines enabling pro- move file R.css when removing all remaining grammatic exploration. packages in a library tree. (Related to the wish • serialize() and unserialize() are no longer of PR#14475: note that this file is no longer in- described as ‘experimental’. The interface is stalled.) now regarded as stable, although the serial- • unzip() now has a unzip argument like ization format may well change in future re- zip.file.extract(). This allows an external serialize() leases. ( has a new argument unzip program to be used, which can be use- version which would allow the current format ful to access features supported by Info-ZIP’s to be written if that happens.) unzip version 6 which is now becoming more New functions saveRDS() and readRDS() are widely available. public versions of the ‘internal’ functions zip() .saveRDS() and .readRDS() made available • There is a simple function, as wrapper zip for general use. The dot-name versions re- for an external command. main available as several package authors have • bzfile() connections can now read from con- made use of them, despite the documentation. catenated bzip2 files (including files written saveRDS() supports compress = "xz". with bzfile(open = "a")) and files created by

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 NEWSAND NOTES 81

some other compressors (such as the example and help.request() now refer to that for of PR#14479). create.post(). It has a new method = "mailto" on Unix-alikes • The primitive function c() is now of type similar to that on Windows: it invokes a default BUILTIN. mailer via open (Mac OS X) or xdg-open or the • plot(, .., nodePar=*) now default browser (elsewhere). obeys an optional xpd specification (allowing The default for ccaddress is now clipping to be turned off completely). getOption("ccaddress") which is by de- fault unset: using the username as a mailing nls(algorithm="port") • now shares more code address nowadays rarely works as expected. with nlminb(), and is more consistent with the other nls() algorithms in its return value. • The default for options("mailer") is now "mailto" on all platforms. • xz has been updated to 5.0.1 (very minor bugfix release). • unlink() now does tilde-expansion (like most other file functions). • image() has gained a logical useRaster argu- file.rename() ment allowing it to use a bitmap raster for plot- • now allows vector arguments ting a regular grid instead of polygons. This (of the same length). can be more efficient, but may not be supported • The "glm" method for logLik() now returns by all devices. The default is FALSE. an "nobs" attribute (which stats4::BIC() as- sumed it did). • list.files()/dir() gains a new argument include.dirs() to include directories in the The "nls" method for logLik() gave incorrect listing when recursive = TRUE. results for zero weights. nobs() • New function list.dirs() lists all directories, • There is a new generic function in pack- (even empty ones). age stats, to extract from model objects a suit- able value for use in BIC calculations. An S4 • file.copy() now (by default) copies generic derived from it is defined in package read/write/execute permissions on files, mod- stats4. erated by the current setting of Sys.umask(). • Code for S4 reference-class methods is now ex- • Sys.umask() now accepts mode = NA and re- amined for possible errors in non-local assign- turns the current umask value (visibly) without ments. changing it. • findClasses, getGeneric, findMethods and hasMethods • There is a ! method for classes "octmode" and are revised to deal consistently package= "hexmode": this allows xor(a, b) to work if with the argument and be consistent both a and b are from one of those classes. with soft namespace policy for finding objects. • tools::Rdiff() now has the option to return • as.raster() no longer fails for vectors or ma- not only the status but a character vector of ob- trices containing NAs. served differences (which are still by default • New hook "before.new.plot" allows func- sent to ‘stdout’). tions to be run just before advancing the frame • The startup environment variables in plot.new, which is potentially useful for cus- R_ENVIRON_USER, R_ENVIRON, R_PROFILE_USER tom figure layout implementations. and R_PROFILE are now treated more consis- • Package tools has a new function tently. In all cases an empty value is considered compactPDF() to try to reduce the size of to be set and will stop the default being used, PDF files via qpdf or gs. and for the last two tilde expansion is per- formed on the file name. (Note that setting • tar() has a new argument extra_flags. an empty value is probably impossible on Windows.) • dotchart() accepts more general objects x such as 1D tables which can be coerced by • Using R -no-environ CMD, R -no-site-file as.numeric() to a numeric vector, with a warn- CMD or R -no-init-file CMD sets envi- ing since that might not be appropriate. ronment variables so these settings are passed on to child R processes, notably • The previously internal function those run by INSTALL, check and build. R create.post() is now exported from utils, -vanilla CMD sets these three options (but not and the documentation for bug.report() ‘--no-restore’).

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 82 NEWSAND NOTES

• smooth.spline() is somewhat faster. With • Sys.chmod() has an extra argument use_umask cv=NA it allows some leverage computations to which defaults to true and restricts the file be skipped, mode by the current setting of umask. This means that all the R functions which manip- • The internal (C) function ‘scientific()’, at ulate file/directory permissions by default re- the heart of R’s format.info(x), format(x), spect umask, notably R CMD INSTALL. print(x), etc, for numeric x, has been re- written in order to provide slightly more cor- • tempfile() has an extra argument fileext to rect results, fixing PR#14491, notably in border create a temporary filename with a specified cases including when digits >= 16, thanks extension. (Suggestion and initial implemen- to substantial contributions (code and experi- tation by Dirk Eddelbuettel.) ments) from Petr Savicky. This affects a notica- There are improvements in the way () ble amount of numeric output from R. and Stangle() handle non-ASCII vignette sources, especially in a UTF-8 locale: see ‘Writ- • A new function grepRaw() has been introduced ing R Extensions’ which now has a subsection for finding subsets of raw vectors. It supports on this topic. both literal searches and regular expressions. • factanal() now returns the rotation matrix • Package compiler is now provided as a stan- if a rotation such as "promax" is used, and dard package. See ?compiler::compile for in- hence factor correlations are displayed. (Wish formation on how to use the compiler. This of PR#12754.) package implements a byte code compiler for R: by default the compiler is not used in this • The gctorture2() function provides a release. See the ‘R Installation and Administra- more refined interface to the GC tor- tion Manual’ for how to compile the base and ture process. Environment variables recommended packages. R_GCTORTURE, R_GCTORTURE_WAIT, and R_GCTORTURE_INHIBIT_RELEASE can also be exportPattern • Providing an directive in a used to control the GC torture process. NAMESPACE file now causes classes to be ex- ported according to the same pattern, for ex- • file.copy(from, to) no longer regards it as ample the default from package.skeleton() to an error to supply a zero-length from: it now specify all names starting with a letter. An ex- simply does nothing. plicit directive to exportClassPattern will still over-ride. • rstandard.glm gains a type argument which can be used to request standardized Pearson • There is an additional marked encoding residuals. "bytes" for character strings. This is intended to be used for non-ASCII strings which should • A start on a Turkish translation, thanks to Mu- be treated as a set of bytes, and never re- rat Alkan. encoded as if they were in the encoding of the • .libPaths() calls normalizePath(winslash = currrent locale: useBytes = TRUE is autmati- "/") on the paths: this helps (usually) present cally selected in functions such as writeBin(), them in a user-friendly form and should detect writeLines(), grep() and strsplit(). duplicate paths accessed via different symbolic Only a few character operations are supported links. (such as substr()). Printing, format() and cat() will represent SWEAVE CHANGES non-ASCII bytes in such strings by a ‘\xab’ es- cape. • Sweave() has options to produce PNG and JPEG figures, and to use a custom function to • The new function removeSource() removes the open a graphics device (see ?RweaveLatex). internally stored source from a function. (Based in part on the contribution of PR#14418.) • "srcref" attributes now include two addi- tional line number values, recording the line • The default for Sweave() is to produce only numbers in the order they were parsed. PDF figures (rather than both EPS and PDF).

• New functions have been added for • Environment variable SWEAVE_OPTIONS can be source reference access: getSrcFilename(), used to supply defaults for existing or new getSrcDirectory(), getSrcLocation() and options to be applied after the Sweave driver getSrcref(). setup has been run.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 NEWSAND NOTES 83

• The Sweave manual is now included as a vi- There is a new option, ‘--compact-vignettes’, gnette in the utils package. to try reducing the size of PDF files in the ‘inst/doc’ directory. Currently this tries qpdf: Sweave() keep.source=TRUE • handles much bet- other options may be used in future. ter: it could duplicate some lines and omit com- ments. (Reported by John Maindonald and When re-building vignettes and a others.) ‘inst/doc/Makefile’ file is found, make clean is run if the makefile has a clean: target. C-LEVEL FACILITIES After re-building vignettes the default clean-up operation will remove any directories (and not • Because they use a C99 interface which a just files) created during the process: e.g. one C++ compiler is not required to support, package created a ‘.R_cache’ directory. Rvprintf and REvprintf are only defined by ‘R_ext/Print.h’ in C++ code if the macro Empty directories are now removed unless the R_USE_C99_IN_CXX is defined when it is in- option ‘--keep-empty-dirs’ is given (and a few cluded. packages do deliberately include empty direc- tories). • pythag duplicated the C99 function hypot. It is If there is a field BuildVignettes in the pack- no longer provided, but is used as a substitute age ‘DESCRIPTION’ file with a false value, re- for hypot in the very unlikely event that the lat- building the vignettes is skipped. ter is not available. • R_inspect(obj) and R_inspect3(obj, deep, • R CMD check now also checks for filenames pvec) are (hidden) C-level entry points to the that are case-insensitive matches to Windows’ internal inspect function and can be used for reserved file names with extensions, such as C-level debugging (e.g., in conjunction with the ‘nul.Rd’, as these have caused problems on p command in gdb). some Windows systems. • Compiling R with ‘--enable-strict-barrier’ It checks for inefficiently saved ‘data/*.rda’ and now also enables additional checking for use ‘data/*.RData’ files, and reports on those large of unprotected objects. In combination with than 100Kb. A more complete check (includ- gctorture() or gctorture2() and a C-level de- ing of the type of compression, but potentially bugger this can be useful for tracking down much slower) can be switched on by setting en- _R_CHECK_COMPACT_DATA2_ memory protection issues. vironment variable to ‘TRUE’. UTILITIES The types of files in the data directory are now checked, as packages are still misusing it for • R CMD Rdiff is now implemented in R on non-R data files. Unix-alikes (as it has been on Windows since It now extracts and runs the R code for each R 2.12.0). vignette in a separate directory and R process: • R CMD build no longer does any cleaning in the this is done in the package’s declared encoding. supplied package directory: all the cleaning is Rather than call tools::checkVignettes(), it done in the copy. calls tool::buildVignettes() to see if the vi- R It has a new option ‘--install-args’ to pass gnettes can be re-built as they would be by CMD build --use-valgrind arguments to R CMD INSTALL for ‘--build’ (but . Option ‘ ’ now ap- not when installing to rebuild vignettes). plies only to these runs, and not when running code to rebuild the vignettes. This version does There is new option, ‘--resave-data’, to call a much better job of suppressing output from tools::resaveRdaFiles() on the ‘data’ direc- successful vignette tests. tory, to compress tabular files (‘.tab’, ‘.csv’ etc) and to convert ‘.R’ files to ‘.rda’ files. The The ‘00check.log’ file is a more complete record default, ‘--resave-data=gzip’, is to do so in of what is output to ‘stdout’: in particular con- a way compatible even with years-old ver- tains more details of the tests. sions of R, but better compression is given by It now check all syntactically valid Rd usage ‘--resave-data=best’, requiring R >= 2.10.0. entries, and warns about assignments (unless It now adds a ‘datalist’ file for ‘data’ directories these give the usage of replacement functions). of more than 1Mb. ‘.tar.xz’ compressed tarballs are now allowed, Patterns in ‘.Rbuildignore’ are now also matched if tar supports them (and setting environment against all directory names (including those of variable TAR to ‘internal’ ensures so on all empty directories). platforms).

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 84 NEWSAND NOTES

• R CMD check now warns if it finds • A top-level file ‘.Rinstignore’ in the package ‘inst/doc/makefile’, and R CMD build renames sources can list (in the same way as ‘.Rbuildig- such a file to ‘inst/doc/Makefile’. nore’) files under inst that should not be in- stalled. (Why should there be any such files? Because all the files needed to re-build vi- INSTALLATION gnettes need to be under inst/doc, but they may not need to be installed.) • Installing R no longer tries to find perl, and R CMD no longer tries to substitute a full path for • R CMD INSTALL has a new option awk nor perl – this was a legacy from the days ‘--compact-docs’ to compact any PDFs under when they were used by R itself. Because a cou- the ‘inst/doc’ directory. Currently this uses ple of packages do use awk, it is set as the make qpdf, which must be installed (see ‘Writing R (rather than environment) variable AWK. Extensions’). --lock • make check will now fail if there are differ- • There is a new option ‘ ’ which can be --no-lock ences from the reference output when testing used to cancel the effect of ‘ ’ or --pkglock package examples and if environment variable ‘ ’ earlier on the command line. R_STRICT_PACKAGE_CHECK is set to a true value. • Option ‘--pkglock’ can now be used with more than one package, and is now the default if • The C99 double complex type is now required. only one package is specified. The C99 complex trigonometric functions (such • Argument lock of install.packages() can as ‘csin’) are not currently required (FreeBSD now be use for Mac binary installs as well as for lacks most of them): substitutes are used if they Windows ones. The value "pkglock" is now ac- are missing. cepted, as well as TRUE and FALSE (the default).

• The C99 system call ‘va_copy’ is now required. • There is a new option ‘--no-clean-on-error’ for R CMD INSTALL to retain a partially installed • If environment variable R_LD_LIBRARY_PATH package for forensic analysis. is set during configuration (for example in • Packages with names ending in ‘.’ are not ‘config.site’) it is used unchanged in file portable since Windows does not work cor- ‘etc/ldpaths’ rather than being appended to. rectly with such directory names. This is now warned about in R CMD check, and will not be • configure looks for support for OpenMP and allowed in R 2.14.x. if found compiles R with appropriate flags and also makes them available for use in packages: • The vignette indices are more comprehensive see ‘Writing R Extensions’. (in the style of browseVignetttes()). This is currently experimental, and is only used in R with a single thread for colSums() and DEPRECATED & DEFUNCT colMeans(). Expect it to be more widely used require(save = TRUE) in later versions of R. • is defunct, and use of the save argument is deprecated. This can be disabled by the ‘--disable-openmp’ R CMD check -no-latex flag. • is defunct: use ‘--no-manual’ instead. • R CMD Sd2Rd is defunct. PACKAGE INSTALLATION • The gamma argument to hsv(), rainbow(), and • R CMD INSTALL -clean now removes copies of rgb2hsv() is deprecated and no longer has any a ‘src’ directory which are created when mul- effect. tiple sub-architectures are in use. (Following a comment from Berwin Turlach.) • The previous options for R CMD build -binary (‘--auto-zip’, ‘--use-zip-data’ • File ‘R.css’ is now installed on a per-package and ‘--no-docs’) are deprecated (or defunct): basis (in the package’s ‘html’ directory) rather use the new option ‘--install-args’ instead. than in each library tree, and this is used for • When a character value is used for the EXPR ar- all the HTML pages in the package. This helps gument in switch(), only a single unnamed al- when installing packages with static HTML ternative value is now allowed. pages for use on a webserver. It will also allow future versions of R to use different stylesheets • The wrapper utils::link.html.help() is no for the packages they install. longer available.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 NEWSAND NOTES 85

• Zip-ing data sets in packages (and hence • Added .requireCachedGenerics to the R CMD INSTALL options ‘--use-zip-data’ and dont.mind list for library() to avoid warnings ‘--auto-zip’, as well as the ‘ZipData: yes’ about duplicates. field in a DESCRIPTION file) is defunct. • $<-.data.frame messed with the class at- Installed packages with zip-ed data sets can tribute, breaking any S4 subclass. The S4 still be used, but a warning that they should be data.frame class now has its own $<- method, re-installed will be given. and turns dispatch on for this primitive.

• The ‘experimental’ alternative specification of • Map() did not look up a character argument f .Export() a name space via etc is now defunct. in the correct frame, thanks to lazy evaluation. • The option ‘--unsafe’ to R CMD INSTALL is dep- (PR#14495) --no-lock recated: use the identical option ‘ ’ in- • file.copy() did not tilde-expand from and to stead. when to was a directory. (PR#14507) pythag Rmath.h • The entry point in ‘ ’ is depre- • It was possible (but very rare) for the loading hypot cated in favour of the C99 function .A test in R CMD INSTALL to crash a child R process hypot wrapper for is provided for R 2.13.x only. and so leave around a lock directory and a par- • Direct access to the "source" attribute of tially installed package. That test is now done functions is deprecated; use deparse(fn, in a separate process. control="useSource") to access it, and • plot(, data=,..) now removeSource(fn) to remove it. works in more cases; similarly for points(), • R CMD build -binary is now formally depre- lines() and text(). cated: R CMD INSTALL -build has long been • edit.default() contained a manual dispatch the preferred alternative. for matrices (the "matrix" class didn’t really • Single-character package names are deprecated exist when it was written). This caused an infi- (and R is already disallowed to avoid confu- nite recursion in the no-GUI case and has now sion in ‘Depends:’ fields). been removed. • data.frame(check.rows = TRUE) sometimes BUG FIXES worked when it should have detected an error. (PR#14530) • drop.terms and the [ method for class "terms" no longer add back an intercept. (Reported by • scan(sep= , strip.white=TRUE) sometimes Niels Hansen.) stripped trailing spaces from within quoted strings. (The real bug in PR#14522.) • aggregate preserves the class of a column (e.g. a date) under some circumstances where it dis- • The rank-correlation methods for cor() and carded the class previously. cov() with use = "complete.obs" computed the ranks before removing missing values, • p.adjust() now always returns a vector result, whereas the documentation implied incom- as documented. In previous versions it copied plete cases were removed first. (PR#14488) attributes (such as dimensions) from the p ar- gument: now it only copies names. They also failed for 1-row matrices. • On PDF and PostScript devices, a line width • The perpendicular adjustment used in plac- of zero was recorded verbatim and this caused ing text and expressions in the margins of problems for some viewers (a very thin line plots was not scaled by par("mex"). (Part of combined with a non-solid line dash pattern PR#14532.) could also cause a problem). On these devices, • Quartz Cocoa device now catches any Cocoa the line width is now limited at 0.01 and for exceptions that occur during the creation of the very thin lines with complex dash patterns the device window to prevent crashes. It also im- device may force the line dash pattern to be 2 solid. (Reported by Jari Oksanen.) poses a limit of 144ft on the area used by a window to catch user errors (unit misinterpre- • The str() method for class "POSIXt" now tation) early. gives sensible output for 0-length input. • The browser (invoked by debug(), browser() • The one- and two-argument complex maths or otherwise) would display attributes such as functions failed to warn if NAs were generated "wholeSrcref" that were intended for internal (as their numeric analogues do). use only.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 86 NEWSAND NOTES

• R’s internal filename completion now prop- CHANGES IN R VERSION 2.12.2 erly handles filenames with spaces in them even when the readline library is used. This SIGNIFICANT USER-VISIBLE CHANGES resolves PR#14452 provided the internal file- name completion is used (e.g., by setting • Complex arithmetic (notably z^n for complex rc.settings(files = TRUE)). z and integer n) gave incorrect results since R 2.10.0 on platforms without C99 complex sup- • Inside uniroot(f, ...), -Inf function values port. This and some lesser issues in trignomet- are now replaced by a maximally negative ric functions have been corrected. value. Such platforms were rare (we know of Cygwin • rowsum() could silently over/underflow on in- and FreeBSD). However, because of new com- teger inputs (reported by Bill Dunlap). piler optimizations in the way complex argu- ments are handled, the same code was selected • as.matrix() did not handle "dist" objects on x86_64 Linux with gcc 4.5.x at the default with zero rows. -O2 optimization (but not at -O).

• There is a workaround for crashes seen with CHANGES IN R VERSION 2.12.2 several packages on systems using ‘zlib patched 1.2.5’: see the INSTALLATION section.

NEW FEATURES NEW FEATURES max() min() • and work harder to ensure that • PCRE has been updated to 8.12 (two bug-fix re- NA NaN min(NaN, has precedence over , so e.g. leases since 8.10). NA) is NA. (This was not previously docu- mented except for within a single numeric vec- • rep(), seq(), seq.int() and seq_len() report tor, where compiler optimizations often de- more often when the first element is taken of an feated the code.) argument of incorrect length.

BUG FIXES • The Cocoa back-end for the quartz() graphics device on Mac OS X provides a way to disable • A change to the C function ‘R_tryEval’ had event loop processing temporarily (useful, e.g., broken error messages in S4 method selection; for forked instances of R). the error message is now printed. • kernel()’s default for m was not appropriate • PDF output with a non-RGB color model used if coef was a set of coefficients. (Reported by RGB for the line stroke color. (PR#14511) Pierre Chausse.)

• stats4::BIC() assumed without checking that • bug.report() has been updated for the current an object of class "logLik" has an "nobs" at- R bug tracker, which does not accept emailed tribute: glm() fits did not and so BIC() failed submissions. for them. • R CMD check now checks for the correct use of • In some circumstances a one-sided ‘$(LAPACK_LIBS)’ (as well as ‘$(BLAS_LIBS)’), mantelhaen.test() reported the p-value since several CRAN recent submissions have for the wrong tail. (PR#14514) ignored ‘Writing R Extensions’. • Passing the invalid value lty = NULL to axis() sent an invalid value to the graphics device, INSTALLATION and might cause the device to segfault. • The ‘zlib’ sources in the distribution are now • Sweave() with concordance=TRUE could lead built with all symbols remapped: this is in- to invalid PDF files; ‘Sweave.sty’ has been up- tended to avoid problems seen with packages dated to avoid this. such as XML and rggobi which link to ‘zlib.so.1’ • Non-ASCII characters in the titles of help pages on systems using ‘zlib 1.2.5’. were not rendered properly in some locales, FFLAGS FCFLAGS and could cause errors or warnings. • The default for and with gfortran on x86_64 Linux has been changed • checkRd() gave a spurious error if the \href back to ‘-g -O2’: however, setting ‘-g -O’ may macro was used. still be needed for gfortran 4.3.x.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 NEWSAND NOTES 87

PACKAGE INSTALLATION • R CMD check failed to check the filenames un- der ‘man’ for Windows’ reserved names. •A‘ LazyDataCompression’ field in the ‘DESCRIPTION’ file will be used to set the • The "Date" and "POSIXt" methods for seq() value for the ‘--data-compress’ option of R could overshoot when to was supplied and by CMD INSTALL. was specified in months or years. untar() • Files ‘R/sysdata.rda’ of more than 1Mb are now • The internal method of now restores stored in the lazyload database using xz com- hard links as file copies rather than symbolic pression: this for example halves the installed links (which did not work for cross-directory size of package Imap. links). • unzip() did not handle zip files which con- • R CMD INSTALL now ensures that directories in- tained filepaths with two or more leading di- stalled from ‘inst’ have search permission for rectories which were not in the zipfile and did everyone. not already exist. (It is unclear if such zipfiles It no longer installs files ‘inst/doc/Rplots.ps’ and are valid and the third-party C code used did ‘inst/doc/Rplots.pdf’. These are almost certainly not support them, but PR#14462 created one.) left-overs from Sweave runs, and are often combn(n, m) large. • now behaves more regularly for the border case m = 0. (PR#14473) DEPRECATED & DEFUNCT • The rendering of numbers in plotmath expres- sions (e.g. expression(10^2)) used the current • The ‘experimental’ alternative specification of settings for conversion to strings rather than a name space via .Export() etc is now depre- setting the defaults, and so could be affected cated. by what has been done before. (PR#14477)

• zip.file.extract() is now deprecated. • The methods of napredict() and naresid() for na.action = na.exclude fits did not work • Zip-ing data sets in packages (and hence R CMD correctly in the very rare event that every case INSTALL -use-zip-data and the ‘ZipData: had been omitted in the fit. (Reported by Simon yes’ field in a DESCRIPTION file) is deprecated: Wood.) using efficiently compressed ‘.rda’ images and • weighted.residuals(drop0=TRUE) returned a lazy-loading of data has superseded it. vector when the residuals were a matrix (e.g. those of class "mlm"). (Reported by Bill Dun- BUG FIXES lap.) • identical() could in rare cases generate • Package HTML index files a warning about non-pairlist attributes on ‘/html/00Index.html’ were generated with CHARSXPs. As these are used for internal pur- a stylesheet reference that was not correct for poses, the attribute check should be skipped. static browsing in libraries. (Reported by Niels Richard Hansen). • ccf(na.action = na.pass) was not imple- • If the filename extension (usually ‘.Rnw’) was mented. not included in a call to Sweave(), source • The parser accepted some incorrect numeric references would not work properly and the constants, e.g. 20x2. (Reported by Olaf Mers- keep.source option failed. (PR#14459) mann.) • format.data.frame() now keeps zero charac- • format(*, zero.print) did not always re- ter column names. place the full zero parts.

• pretty(x) no longer raises an error when x • Fixes for subsetting or subassignment of contains solely non-finite values. (PR#14468) "raster" objects when not both i and j are specified. • The plot.TukeyHSD() function now uses a line • R CMD INSTALL was not always respecting the width of 0.5 for its reference lines rather than ‘ZipData: yes’ field of a ‘DESCRIPTION’ file ‘lwd = 0’ (which caused problems for some (although this is frequently incorrectly speci- PDF and PostScript viewers). fied for packages with no data or which specify • The big.mark argument to prettyNum(), lazy-loading of data). format(), etc. was inserted reversed if it was R CMD INSTALL -use-zip-data was incorrectly more than one character long. implemented as ‘--use-zipdata’ since R 2.9.0.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 88 NEWSAND NOTES

• source(file, echo=TRUE) could fail if the file • In some circumstances (including for package contained ‘#line’ directives. It now recov- XML), R CMD INSTALL installed version-control ers more gracefully, but may still display the directories from source packages. wrong line if the directive gives incorrect infor- mation. • Added PROTECT calls to some constructed ex- pressions used in C level eval calls. • atan(1i) returned NaN+Infi (rather than 0+Infi) on platforms without C99 complex • utils:::create.post() (used by support. bug.report() and help.request()) failed to quote arguments to the mailer, and so often • library() failed to cache S4 metadata (unlike failed. loadNamespace()) causing failures in S4-using packages without a namespace (e.g. those us- • bug.report() was naive about how to extract ing reference classes). maintainer email addresses from package de- scriptions, so would often try mailing to incor- • The function qlogis(lp, log.p=TRUE) no rect addresses. longer prematurely overflows to Inf when exp(lp) is close to 1. • debugger() could fail to read the environment of a call to a function with a ... argument. (Re- • Updating S4 methods for a group generic func- ported by Charlie Roosen.) tion requires resetting the methods tables for the members of the group (patch contributed • prettyNum(c(1i, NA), drop0=TRUE) or by Martin Morgan). str(NA_complex_) now work correctly.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 NEWSAND NOTES 89

Changes on CRAN

2010-12-13 to 2011-05-31 TimeSeries FitARMA, Rssa, egarch, fastVAR, lubri- date, season. by Kurt Hornik and Achim Zeileis (* = core package) New packages in CRAN task views New contributed packages Bayesian BLR, BMS, BaBooN, BayesDA, BayesX, Bmix, LaplacesDemon, SampleSizeMeans, ADM3 An interpretation of the ADM method — au- SampleSizeProportions, abc, bayespack, bbe- tomated detection algorithm. Author: Tomas mkr, bclust, bfp, bisoreg, bnlearn, cat- William Fitzgerald. net, cudaBayesreg, dclone, ebdbNet, fac- torQR, gausspred, mlogitBMA, mombf, pred- Agreement Statistical Tools for Measuring Agree- bayescor, predmixcor, rWMBAT, sbgcop, ment. Author: Yue Yu and Lawrence Lin. spikeSlabGAM, varSelectIP. AnnotLists Data extraction tool from annotations ChemPhys ChemoSpec, ChemometricsWithR, files. Author: Nicolas Cagnard. ChemometricsWithRData. BCA Business and Customer Analytics. Author: Cluster HDclassif, MOCCA, MetabolAnalyze, Dan Putler. SDisc, clusterCons, dpmixsim, edci, fast- cluster, isa2, isopam, kernlab, kml, kml3d, BaBooN Bayesian Bootstrap Predictive Mean lcmm, mixsmsn, movMF, optpart, pdfCluster, Matching — Multiple and single imputation profdpm, sigclust, sparcl. for discrete data. Author: Florian Meinfelder. In view: Bayesian. Distributions CompQuadForm, FAdist, GB2, Gen- eralizedHyperbolic, HI, PearsonDS∗, RMT- Bessel Bessel Functions Computations and Approx- stat, SkewHyperbolic, distrEllipse, emg, imations. Author: Martin Maechler. gaussDiff, hyperdirichlet, loglognorm, mgpd, BiGGR Creates an interface to BiGG database, pro- movMF, mvtBinaryEP, poibin, poilog, poist- vides a framework for simulation and pro- weedie, reliaR∗, skellam, spd, stabledist, duces flux graphs. Author: Anand K. Gavai. trapezoid. BioMark Find biomarkers in two-class discrimina- Econometrics apt, censReg, erer, ordinal. tion problems. Authors: Ron Wehrens and Environmetrics diveMove, wq. Pietro Franceschi.

Finance RTAQ. Biograph Explore life histories. Author: Frans Willekens. In view: Survival. HighPerformanceComputing RInside, doSMP. CCM Correlation classification method (CCM). Au- MachineLearning CORElearn, Cubist, ahaz, thors: Garrett M. Dancik and Yuanbin Ru. ncvreg, rminer. CDVine Statistical inference of C- and D-vine copu- Optimization RcppDE, alabama, cmaes, dclone, las. Authors: Ulf Schepsmeier and Eike Chris- neldermead, pso, rneos, soma. tian Brechmann.

Phylogenetics PHYLOGR, RBrownie, TreePar, di- CGene Causal Genetic Analysis. Author: Peter Lip- versitree. man.

Psychometrics irtoys, mpt, qgraph, semPLS. CITAN CITation ANalysis toolpack. Author: Marek Gagolewski. Spatial Geneland, adehabitatHR, adehabitatHS, adehabitatLT, adehabitatMA, gdistance, CRTSize Sample Size Estimation Functions for rgeos. Cluster Randomized Trials. Author: Michael A Rotondi. Survival Biograph, NADA, PermAlgo, bujar, compeir, concreg, crrSC, frailtyHL, global- ChemoSpec Exploratory Chemometrics for Spec- boosttest, msSurv, plsRcox, powerSurvEpi, troscopy. Authors: Bryan A. Hanson DePauw risksetROC, survAUC, survJamda, survival- University, Greencastle Indiana USA. In view: BIV. ChemPhys.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 90 NEWSAND NOTES

ChemometricsWithR Chemometrics with R — Mul- ESPRESSO Power Analysis and Sample Size Calcu- tivariate Data Analysis in the Natural Sciences lation. Authors: Amadou Gaye and Paul Bur- and Life Sciences. Author: Ron Wehrens. In ton. view: ChemPhys. EuclideanMaps Displays Euclidean Maps. Author: ChemometricsWithRData Data for package Elisa Frutos Bernal. ChemometricsWithR. Author: Ron Wehrens. FAdist Distributions that are sometimes used in hy- In view: ChemPhys. drology. Author: François Aucoin. In view: Distributions. CommonJavaJars Useful libraries for building a Java based GUI under R. Author: Kornelius FAmle Maximum Likelihood and Bayesian Estima- Rohmeyer. tion of Univariate Probability Distributions. Author: François Aucoin. CompareTests Estimate diagnostic accuracy (sensi- tivity, specificity, etc) and agreement statistics FRBData Download financial data from FRB’s when one test is conducted on only a subsam- website. Author: Shinichi Takayanagi. ple of specimens. Authors: Hormuzd A. Katki FisherEM Model-Based Clustering in the Fisher and David W. Edelstein. Discriminative Subspace. Authors: J. Loisel, T. Covpath Implements a pathwise algorithm for co- Souvannarath, M. Tchebanenko, C. Bouveyron, variance selection. Author: Vijay Krishna- and C. Brunet. murthy. FuncMap Hive Plots of R Package Function Calls. Author: Bryan A. Hanson. Cubist Rule- and Instance-Based Regression Model- ing. Authors: Max Kuhn, Steve Weston, and GA4Stratification A genetic algorithm approach to Chris Keefer. C code for Cubist by R. Quinlan. determine stratum boundaries and sample In view: MachineLearning. sizes of each stratum in stratified sampling. Authors: Sebnem Er, Timur Keskinturk, and DAGGER Consensus genetic maps. Author: Jeffrey Charlie Daly. Endelman. GANPA Gene Association Network-based Pathway DALY DALY Calculator — A GUI for stochas- Analysis. Authors: Zhaoyuan Fang, Weidong tic DALY calculation in R. Authors: Brecht Tian, and Hongbin Ji. Devleesschauwer, Arie Havelaar, Juanita GANPAdata The GANPA Datasets Package. Au- Haagsma, Nicolas Praet, and Niko Speybroeck. thors: Zhaoyuan Fang, Weidong Tian, and DART Denoising Algorithm based on Relevance Hongbin Ji. network Topology. Authors: Yan Jiaoa and An- GB2 Generalized Beta Distribution of the Sec- drew E. Teschendorff. ond Kind: properties, likelihood, estima- tion. Authors: Monique Graf and Desislava DARTData Example data set for DART package. Nedyalkova. In view: Distributions. Author: Yan Jiao. GenSA R functions for Generalized Simulated An- DBGSA Methods of distance-based gene set func- nealing. Authors: Sylvain Gubian, Yang Xiang, tional enrichment analysis. Authors: Li Jin and Brian Suomela, and Julia Hoeng. Huang Meilin. GrapheR A multiplatform GUI for drawing cus- DIME Differential Identification using Mixture En- tomizable graphs in R. Author: Maxime Hervé. semble. Author: Cenny Taslim, with contribu- tions from Dustin Potter, Abbasali Khalili, and GriegSmith Uses Grieg-Smith method on 2 dimen- Shili Lin. sional spatial data. Author: Brian McGuire. HLMdiag Diagnostic tools for two-level normal hi- DOBAD Analysis of Discretely Observed linear erarchical linear models. Author: Adam Loy. Birth-And-Death(-and-immigration) Markov Chains. Author: Charles Doss. HPbayes Heligman Pollard mortality model pa- rameter estimation using Bayesian Melding DeducerMMR A Deducer plugin for moderated with Incremental Mixture Importance Sam- multiple regressions and simple slopes analy- pling. Author: David J Sharrow. sis. Authors: Alberto Mirisola, Luciano Seta, Manuel Gentile, and Dario La Guardia. HSROC Joint meta-analysis of diagnostic test sen- sitivity and specificity with or without a gold ENmisc Neuwirth miscellaneous. Author: Erich standard reference test. Authors: Ian Schiller Neuwirth. and Nandini Dendukuri.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 NEWSAND NOTES 91

HiDimDA High Dimensional Discriminant Analy- MDR Detect gene-gene interactions using multifac- sis. Author: António Pedro Duarte Silva. tor dimensionality reduction. Author: Stacey Winham. ICC Functions facilitating the estimation of the Intraclass Correlation Coefficient. Author: MEWMA Multivariate Exponentially Weighted Matthew Wolak. Moving Average (MEWMA) Control Chart. Author: Edgar Santos Fernandez. IQMNMR Identification and Quantification of Metabolites by Using 1D 1H NMR FID. Au- MHadaptive General Markov Chain Monte Carlo thor: XuSong. for Bayesian Inference using adaptive Metropolis-Hastings sampling. Author: Corey ImageMetrics Facilitates image analysis for social Chivers. scientists. Author: Solomon Messing. MISA Bayesian Model Search and Multilevel In- IndependenceTests Nonparametric tests of inde- ference for SNP Association Studies. Author: pendence between random vectors. Authors: Melanie Wilson. P Lafaye de Micheaux and M Bilodeau. MLPAstats MLPA analysis to detect gains and loses InfDim Infine-dimensional model (IDM) to anal- in genes. Authors: Alejandro Cáceres and Juan yse phenotypic variation in growth trajectories. R González. Authors: Anna Kuparinen and Mats Björk- MSG Data and functions for the book “Modern Sta- lund. tistical Graphics”. Author: . Interpol Interpolation of amino acid sequences. Au- MUCflights Munich Franz-Josef-Strauss Airport thor: Dominik Heider. Pattern Analysis. Authors: The students of the “Advanced R Programming Course”: Basil IsotopeR Estimates diet contributions from isotopic Abou El-Komboz, Andreas Bender, Abdelilah sources using JAGS. Authors: Jack Hopkins El Hadad, Laura Göres, Roman Hornung, and Jake Ferguson. Max Hughes-Brandl, Christian Lindenlaub, Johnson Johnson Transformation. Author: Edgar Christina Riedel, Ariane Straub, Florian Wick- Santos Fernandez. ler, under the supervision of Manuel Eugster and Torsten Hothorn. Julia Fractal Image Data Generator. Author: Mehmet Suzen. MVA An Introduction to Applied Multivariate Analysis with R. Authors: Brian S. Everitt and KrigInv Kriging-based Inversion for Deterministic Torsten Hothorn. and Noisy Computer Experiments. Authors: MetABEL Meta-analysis of genome-wide SNP asso- Victor Picheny and David Ginsbourger. ciation results. Authors: Maksim Struchalin KsPlot Check the power of a statistical model. Au- and Yurii Aulchenko. thor: Issei Kurahashi. MetaPCA Meta-analysis in the Dimension Reduc- LCAextend Latent Class Analysis (LCA) with famil- tion of Genomic data. Author: Don Kang. ial dependence in extended pedigrees. Au- MetaQC Quantitative Quality Assessment for In- thors: Arafat Tayeb, Alexandre Bureau, and clusion/Exclusion Criteria of Genomic Meta- Aurelie Labbe. Analysis. Author: Don Kang. LaplacesDemon Software for Bayesian Inference. Meth27QC Sample quality analysis and sample Author: Byron Hall. In view: Bayesian. control analysis. Authors: Ling Teng, Chao Chen, and Chunyu Liu. LeLogicielR Functions and datasets to accompany the book “Le logiciel R: Maîtriser le langage, MethComp Functions for analysis of method com- Effectuer des analyses statistiques” (in French). parison studies. Authors: Bendix Carstensen Authors: Lafaye de Micheaux Pierre, Drouilhet and Lyle Gurrin. Rémy, and Liquet Benoit. Miney Implementation of the Well-Known Game to MALDIquant Quantitative analysis of MALDI-TOF Clear Bombs from a Given Field (Matrix). Au- MS data. Author: Sebastian Gibb. thor: Roland Rau.

MAPLES Smoothed age profile estimation. Author: NBPSeq The NBP Negative Binomial Model for Roberto Impicciatore. Assessing Differential Gene Expression from RNA-Seq. Authors: Yanming Di and Daniel MAT Multidimensional Adaptive Testing (MAT). W. Schafer, with contributions from Jason S. Author: Seung W. Choi. Cumbie and Jeff H. Chang.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 92 NEWSAND NOTES

NightDay Night and Day Boundary Plot Funtion. ROAuth R interface to liboauth. Author: Jeff Gen- Author: Max Hughes-Brandl. try.

Nippon Japanese utility functions and data. Author: ROCwoGS Non-parametric estimation of ROC Susumu Tanimura. curves without Gold Standard Test. Author: Chong Wang. PBImisc A set of datasets used in my classes or in the book “Modele liniowe i mieszane w R, wraz RStackExchange R based StackExchange client. Au- z przykładami w analizie danych”. Author: thor: Jeff Gentry. Przemysław Biecek. RTAQ Tools for the analysis of trades and quotes in PL.popN Population Size Estimation. Author: R. Authors: Kris Boudt and Jonathan Cornelis- Jakub Stoklosa. sen. In view: Finance.

PSCN Parent Specific DNA Copy Number Estima- RVAideMemoire Fonctions diverses accompagnant tion. Authors: Hao Chen, Haipeng Xing, and l’ouvrage “Aide-memoire de statistique ap- Nancy R. Zhang. pliquee a la biologie”. Author: Maxime Hervé.

Peak2Trough Estimation of the peak-to-trough ra- RXMCDA Authors: Patrick Meyer and Sébastien tio of a seasonal variation component. Author: Bigaret. Anette Luther Christensen. RcmdrPlugin.BCA Rcmdr Plug-In for Business and PoissonSeq Significance analysis of sequencing Customer Analytics. Author: Dan Putler. data based on a Poisson log linear model. Au- thor: Jun Li. RcmdrPlugin.depthTools Depth Tools Plug-In. Authors: Sara López-Pintado PottsUtils Utility Functions of the Potts Models. and Aurora Torrente. Author: Dai Feng. RcppBDT Rcpp bindings for the Boost Date Time li- PredictABEL Assessment of risk prediction models. brary. Authors: Dirk Eddelbuettel and Romain Authors: Suman Kundu, Yurii S. Aulchenko, François. A. Cecile, and J.W. Janssens. RcppClassic Deprecated ‘classic’ Rcpp API. Au- ProfileLikelihood Profile Likelihood for a Parame- thors: Dirk Eddelbuettel and Romain François, ter in Commonly Used Statistical Models. Au- with contributions by David Reiss, and based thor: Leena Choi. on code written during 2005 and 2006 by Do- minick Samperi. QTLRel Tools for mapping of quantitative traits of genetically related individuals and calculating RcppDE Global optimization by differential evolu- identity coefficients from a pedigree. Author: tion in C++. Authors: Dirk Eddelbuettel ex- Riyan Cheng. tending DEoptim (by David Ardia, Katharine Mullen, Brian Peterson, Joshua Ulrich) which R2SWF Convert R Graphics to Flash Animations. itself is based on DE-Engine (by Rainer Storn). Authors: Yixuan Qiu and Yihui Xie. In view: Optimization.

RAD Fit RAD models to biological data. Authors: RepeatedHighDim Detection of global group effect Piers K Dunstan and Scott D Foster. in microarrays data. Author: Klaus Jung.

RAtmosphere Standard Atmosperic profiles. Au- RiDMC R interface to the idmclib library. Authors: thor: Gionata Biavati. Antonio, Fabio Di Narzo.

RISmed Search Pubmed, Make bibliographic con- RnavGraph Using graphs as a navigational infras- tent analyzable. Author: S. A. Kovalchik. tructure. Authors: Adrian R. Waddell and R. Wayne Oldford. RMark R Code for MARK Analysis. Author: Jeff Laake. RnavGraphImageData Some image data used in the RnavGraph package demos. Authors: RMediation Mediation Analysis. Authors: Davood Adrian R. Waddell and R. Wayne Oldford. Tofighi and David P. MacKinnon. RobustRankAggreg Methods for robust rank aggre- RNCEP Obtain, organize, and visualize NCEP gation. Authors: Raivo Kolde and Sven Laur. weather data. Authors: Michael U. Kemp, with contributions from E. Emiel van Loon, Judy Rook A web server interface for R. Author: Jeffrey Shamoun-Baranes, and Willem Bouten. Horner.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 NEWSAND NOTES 93

Rssa A collection of methods for singular spectrum TauP.R Earthquake Traveltime Calculations for 1-D analysis. Author: Anton Korobeynikov. In Earth Models. Authors: Jake Anderson and view: TimeSeries. Jonathan Lees.

Rz GUI Tool for Data Management like SPSS or TwoPhaseInd Two-Phase Independence. Author: Stata. Author: Masahiro Hayashi. James Dai. TwoStepCLogit Conditional logistic regression: A SPARQL SPARQL client. Author: Willem Robert two-step estimation method. Authors: Radu V. van Hage. Craiu, Thierry Duchesne, Daniel Fortin and So- SPECIES Statistical package for species richness es- phie Baillargeon. timation. Author: Ji-Ping Wang. VBmix Variational Bayesian mixture models. Au- thor: Pierrick Bruneau. SSsimple State space models. Author: Dave Zes. VennDiagram Generate high-resolution Venn and SVMMaj SVMMaj algorithm. Authors: Hoksan Euler plots. Author: Hanbo Chen. Yip, Patrick J.F. Groenen, and Georgi Nalban- tov. XLConnect Excel Connector for R. Author: Mirai Solutions GmbH. SYNCSA Analysis of functional and phylogenetic XLConnectJars JAR dependencies for the XLCon- patterns in metacommunities. Author: Vander- nect package. Author: Mirai Solutions GmbH. lei Júlio Debastiani. YjdnJlp Text Analysis by Yahoo! Japan Develper SamplingStrata Optimal stratification of sampling Network. Author: Yohei Sato. frames for multipurpose sampling surveys. Author: Giulio Barcaroli. abd The Analysis of Biological Data. Authors: Kevin M. Middleton and Randall Pruim. SemiParBIVProbit Semiparametric Bivariate Probit adehabitatHR Home Range Estimation. Author: Modelling. Authors: Giampiero Marra and Clément Calenge, contributions from Scott Rosalba Radice. Fortmann-Roe. In view: Spatial. Sim.DiffProc Simulation of Diffusion Processes. adehabitatHS Analysis of habitat selection by ani- Authors: Boukhetala Kamal and Guidoum Ar- mals. Author: Clément Calenge, contributions salane. from Mathieu Basille. In view: Spatial. Sim.DiffProcGUI Graphical User Interface for Sim- adehabitatLT Analysis of animal movements. Au- ulation of Diffusion Processes. Authors: thor: Clément Calenge, contributions from Boukhetala Kamal and Guidoum Arsalane. Stéphane Dray and Manuela Royer. In view: Spatial. SimultAnR Correspondence and Simultaneous Analysis. Authors: Amaya Zarraga and Beatriz adehabitatMA Tools to Deal with Raster Maps. Au- Goitisolo. thor: Clément Calenge, contributions from Mathieu Basille. In view: Spatial. SixSigma Six Sigma Tools for Quality and Process afmtools Estimation, Diagnostic and Forecasting Improvement. Authors: Emilio López, Andrés Functions for ARFIMA models. Authors: Redchuk, Javier M.Moguerza. Javier Contreras-Reyes, Georg M. Goerg, and Wilfredo Palma. SpeciesMix Fit Mixtures of Archetype Species. Au- thors: Piers K Dunstan, Scott D Foster and Ross agridat Agricultural datasets. Author: Kevin Darnell. Wright.

TERAplusB Test for A+B Traditional Escalation ahaz Regularization for semiparametric additive Rule. Author: Eun-Kyung Lee. hazards regression. Author: Anders Gorst- Rasmussen. In view: MachineLearning. TSgetSymbol Time Series Database Interface exten- allanvar Allan Variance Analysis. Author: Javier sion to connect with getSymbols. Author: Paul Hidalgo Carrió. Gilbert. apt Asymmetric Price Transmission. Author: TSxls Time Series Database Interface extension to Changyou Sun. In view: Econometrics. connect to spreadsheets. Author: Paul Gilbert. arulesViz Visualizing Association Rules and Fre- TSzip Time Series Database Interface extension to quent Itemsets. Authors: Michael Hahsler and connect to zip files. Author: Paul Gilbert. Sudheer Chelluboina.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 94 NEWSAND NOTES audit Bounds for Accounting Populations. Author: cems Conditional expectatation manifolds. Author: Glen Meeden. Samuel Gerber. baseline Baseline Correction of Spectra. Authors: cg Comparison of groups. Authors: Bill Pikounis Kristian Hovde Liland and Bjørn-Helge Mevik. and John Oleynick. basicspace Recover a Basic Space from Issue Scales. changepoint Contains funcions that run various sin- Authors: Keith Poole, Howard Rosenthal, Jef- gle and multiple changepoint methods. Au- frey Lewis, James Lo, and Royce Carroll. thors: Rebecca Killick and Idris A. Eckley. citbcmst Assigning tumor samples to CIT Breast bayesLife Bayesian Projection of Life Expectancy. Cancer Molecular Subtypes from gene expres- Authors: Hana Ševˇcíková, Adrian Raftery; sion data. Authors: Laetitia Marisa, Aurélien original WinBugs code written by Jennifer de Reyniès, Mickael Guedj. Chunn. clime Constrained L1-minimization for Inverse (co- bayesQR Bayesian quantile regression. Authors: variance) Matrix Estimation. Authors: T. Tony Dries F. Benoit, Rahim Al-Hamzawi, Keming Cai, Weidong Liu and Xi (Rossi) Luo. Yu, and Dirk Van den Poel. cncaGUI Canonical Non-symmetrical Correspon- bayespack Numerical Integration for Bayesian In- dence Analysis in R. Authors: Ana Belén Nieto ference. Authors: Alan Genz (BAYESPACK Librero, Priscila Willems, Purificación Galindo Fortran source code) and Björn Bornkamp Villardón. (R interface and minor adaptions in Fortran compeir source code). In view: Bayesian. Event-specific incidence rates for compet- ing risks data. Authors: Nadine Grambauer bayespref Hierarchical Bayesian analysis of ecolog- and Andreas Neudecker. In view: Survival. ical count data. Authors: Zachariah Gompert concreg Concordance regression. Authors: R by and James A. Fordyce. Meinhard Ploner, Daniela Dunkler and Georg bgmm Gaussian Mixture Modeling algorithms, Heinze; Fortran by Georg Heinze. In view: Sur- including the belief-based mixture model- vival. ing. Author: Przemysław Biecek and Ewa copBasic Basic Copula Functions. Author: William Szczurek. H. Asquith. biGraph Analysis Toolbox For Bipartite Graphs. copulaedas Estimation of Distribution Algorithms Author: Ingo Vogt. Based on Copula Theory. Authors: Yasser González-Fernández and Marta Soto. bisoreg Bayesian Isotonic Regression with Bernstein Polynomials. Author: S. McKay Curtis. In cosso COSSO, adaptive COSSO, variable selection view: Bayesian. for nonparametric smoothing spline models. Author: Hao Helen Zhang. blender Analyze biotic homogenization of land- scapes. Author: David J. Harris. crrSC Competing risks regression for Stratified and Clustered data. Authors: Bingqing Zhou and bmem Mediation analysis with missing data using Aurélien Latouche. In view: Survival. bootstrap. Authors: Zhiyong Zhang and Li- darts Statistical Tools to Analyze Your Darts Game. juan Wang. Author: Ryan Tibshirani. bsml Basis Selection from Multiple Libraries. Au- dataview Human readable data presentation. Au- thors: Junqing Wu, Jeff Sklar, Yuedong Wang thor: Christofer Backlin. and Wendy Meiring. dbConnect Provides a graphical user interface to bst Gradient Boosting. Author: Zhu Wang. connect with databases that use MySQL. Au- thors: Dason Kurkiewicz, Heike Hofmann, Ul- bujar Buckley-James Regression for Survival Data rike Genschel. with High-Dimensional Covariates. Author: Zhu Wang. In view: Survival. dbEmpLikeGOF Goodness-of-fit and two sample comparison tests using sample entropy. Au- c3net Infering large-scale gene networks with thors: Jeffrey C. Miecznikowski, Lori A. Shep- C3NET. Authors: Gokmen Altay and Frank herd, and Albert Vexler. Emmert-Streib. deducorrect Deductive correction of simple round- caschrono Séries temporelles avec R — Méthodes et ing, typing and sign errors. Authors: Mark van cas. Author: Yves Aragon. der Loo, Edwin de Jonge, and Sander Scholtus.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 NEWSAND NOTES 95 depthTools Depth Tools. Authors: Sara López- eyetracking Eyetracking Helper Functions. Author: Pintado and Aurora Torrente. Ryan M. Hope. detrendeR A Graphical User Interface (GUI) to pro- factualR Thin wrapper for the Factual.com server cess and visualize tree-ring data using R. Au- API. Author: Ethan McCallum. thor: Filipe Campelo. fastVAR Fast implementations to estimate Vector devEMF EMF graphics output device. Authors: Autoregressive models and Vector Autoregres- Philip Johnson, with code from R-core. sive models with Exogenous Inputs. Author: Jeffrey Wong. In view: TimeSeries. dfoptim Derivative-free Optimization. Author: Ravi Varadhan. fastcluster Fast hierarchical clustering routines for R and Python. Author: Daniel Müllner. In view: diff Statistics: What’s the Difference?. Authors: An- Cluster. drew Gelman, Vincent Dorie, Val Chan, and Daniel Lee. fastmatch Fast match() function. Author: Simon Urbanek. diversitree Comparative phylogenetic tests of di- versification. Authors: Richard G. FitzJohn, fda.usc Functional Data Analysis and Utilities for with GeoSSE by Emma E. Goldberg. In view: Statistical Computing. Authors: Febrero- Phylogenetics. Bande, M. and Oviedo de la Fuente, M. dlmodeler Generalized Dynamic Linear Modeler. ffbase Basic statistical functions for package ff. Au- Author: Cyrille Szymanski. thor: Edwin de Jonge. dmt Dependency Modeling Toolkit. Authors: Leo flux Flux rate calculation from dynamic closed Lahti and Olli-Pekka Huovilainen. chamber measurements. Authors: Gerald Jurasinski and Franziska Koebsch. doSMP Foreach parallel adaptor for the revoIPC package. Author: . In fractaldim Estimation of fractal dimensions. Au- view: HighPerformanceComputing. thors: Hana Ševˇcíková,Tilmann Gneiting, and Don Percival. eaf Plots of the Empirical Attainment Function. Au- thors: Carlos Fonseca, Luís Paquete, Thomas frailtyHL Frailty Models via H-likelihood. Authors: Stützle, Manuel López-Ibáñez and Marco Il Do HA, Maengseok Noh, and Youngjo Lee. Chiarandini. In view: Survival. editrules Parsing and manipulating edit rules. Au- functional Curry, Compose, and other higher-order thors: Edwin de Jonge and Mark van der Loo. functions. Author: Peter Danenberg. english Translate integers into English. Authors: gaussquad Collection of functions for Gaussian John Fox and Bill Venables. quadrature. Author: Frederick Novomestky. enrichvs Enrichment assessment of virtual screen- gdistance Distances and routes on geographical ing approaches. Author: Hiroaki Yabuuchi. grids. Author: Jacob van Etten. In view: Spa- tial. epoc EPoC (Endogenous Perturbation analysis of Cancer). Authors: Rebecka Jornsten, Tobias gdsfmt CoreArray Generic Data Structurs (GDS) R Abenius, and Sven Nelander. Interface. Author: Xiuwen Zheng. erer Empirical Research in Economics with R. Au- geoPlot Performs Address Comparison. Authors: thor: Changyou Sun. In view: Econometrics. Randall Shane. ergm.userterms User-specified terms for the statnet ggcolpairs Combination of GGplot2 plots into a ma- suite of packages. Authors: Mark S. Handcock, trix based on data columns. Authors: Fernando David R. Hunter, Carter T. Butts, Steven M. Fuentes and George Ostrouchov. Goodreau, Martina Morris and Pavel N. Krivit- sky. glmnetcr Fit a penalized constrained continuation ratio model for predicting an ordinal response. esotericR esotericR articles from lemnica.com. Au- Author: Kellie J. Archer. thor: Jeffrey A. Ryan. globalboosttest Testing the additional predictive extfunnel Additional Funnel Plot Augmentations. value of high-dimensional data. Authors: Authors: Dean Langan, Alexander Sutton, Ju- Anne-Laure Boulesteix and Torsten Hothorn. lian PT Higgins, and Walter Gregory. In view: Survival.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 96 NEWSAND NOTES gooJSON Google JSON Data Interpreter for R. Au- kerdiest Nonparametric kernel estimation of the thor: Christopher Steven Marcum. distribution function. Bandwidth selection and estimation of related functions. Authors: Ale- graphite GRAPH Interaction from pathway Topo- jandro Quintela del Río and Graciela Estévez logical Environment. Authors: Gabriele Sales, Pérez. Enrica Calura, and Chiara Romualdi. knn knn-gcv. Authors: Ernest Liu, Daisy Sun and gtcorr Calculate efficiencies of group testing algo- Aaron Swoboda. rithms with correlated responses. Author: Sam kulife Data sets and functions from the Faculty of Lendle. Life Sciences, University of Copenhagen. Au- gwerAM Controlling the genome-wide type I er- thors: Claus Ekstrøm, Ib M. Skovgaard, and ror rate in association mapping experiments. Torben Martinussen. Authors: Benjamin Stich, Bettina Müller, and lassoshooting L1 regularized regression (Lasso) Hans-Peter Piepho. solver using the Cyclic Coordinate Descent al- gorithm aka Lasso Shooting. Author: Tobias hive Hadoop InteractiVE. Authors: Ingo Feinerer Abenius. and Stefan Theussl. lcmr Bayesian estimation of latent class models with homeR Functions useful for building physics. Au- random effects. Authors: Liangliang Wang and thor: Neurobat AG. Nandini Dendukuri. hypred Simulation of genomic data in applied ge- ldlasso LD LASSO Regression for SNP Association netics. Author: Frank Technow. Study. Author: Samuel G. Younkin. iSubpathwayMiner Graph-based reconstruction, lfe Linear Group Fixed Effects. Authors: Simen analyses, and visualization of the KEGG path- Gaure, The Ragnar Frisch Centre for Economic ways. Author: Chunquan Li. Research. linkcomm Tools for Generating, Visualizing, and iWebPlots Interactive web-based scatter plots. Au- Analysing Link Communities in Networks. thors: Eleni Chatzimichalia and Conrad Author: Alex T. Kalinka. Bessant. lmSupport Support for Linear Models. Author: ieeeround Functions to set and get the IEEE round- John Curtin. ing mode. Author: Gianluca Amato. lmmlasso Linear mixed-effects models with Lasso. imputation Fill missing values in a data matrix us- Author: Jürg Schelldorfer. ing SVD, SVT, or kNN imputation. Author: Jef- mRm Conditional maximum likelihood estimation frey Wong. in mixed Rasch models. Author: David Prein- infoDecompuTE Information Decomposition of erstorfer. Two-phase Experiments. Authors: Kevin margLikArrogance Marginal Likelihood Computa- Chang and Katya Ruggiero. tion via Arrogance Sampling. Author: Benedict Escoto. insideRODE Functions with deSolve solver and C/FORTRAN interfaces to nlme, together with maxLinear Conditional Samplings for Max-Linear compiled codes. Authors: Yuzhuo Pan and Xi- Models. Author: Yizao Wang. aoyu Yan. mcga Machine Coded Genetic Algorithms for the irlba Fast partial SVD by implicitly-restarted Lanc- Real Value Optimization Problems. Author: zos bidiagonalization. Authors: Jim Baglama Mehmet Hakan Satman. and Lothar Reichel. mederrRank Bayesian Methods for Identifying the isocir Isotonic Inference for Circular Data. Author: Most Harmful Medication Errors. Authors: The implementation in R was done by Sandra Sergio Venturini and Jessica Myers. Barragan based on the SAS routines written by mefa4 Multivariate Data Handling with S4 Classes Miguel A. Fernández. and Sparse Matrices. Author: Peter Solymos. iterLap Approximate probability densities by iter- mgpd Functions for multivariate generalized Pareto ated Laplace Approximations. Author: Björn distribution (MGPD of Type II). Author: Pal Bornkamp. Rakonczai. In view: Distributions.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 NEWSAND NOTES 97 mht Multiple Hypotheses Testing For Variable Se- mvmeta Multivariate meta-analysis and meta- lection. Author: F. Rohart. regression. Author: Antonio Gasparrini. miP Multiple Imputation Plots. Author: Paul Brix. mxkssd Efficient mixed-level k-circulant supersatu- rated designs. Author: B N Mandal. miRtest Combined miRNA- and mRNA-testing. Authors: Stephan Artmann, Klaus Jung, and neariso Near-Isotonic Regression. Authors: Holger Tim Beissbarth. Hoefling, Ryan Tibshirani, and Robert Tibshi- microbenchmark Sub microsecond accurate timing rani. functions. Author: Olaf Mersmann. nullabor Tools for graphical inference. Author: minPtest Gene region-level testing procedure for Hadley Wickham. SNP data, using the min P test resampling ap- numConversion Test of accuracy of as.character() proach. Author: Stefanie Hieke. for a numeric input using the standard arith- missForest Nonparametric Missing Value Imputa- metic. Author: Petr Savicky. tion using Random Forest. Author: Daniel J. nutshellDE Beispieldaten zu “R in a Nutshell” Stekhoven. (deutsche Ausgabe). Author: Joseph Adler mnspc Multivariate Nonparametric Statistical Pro- (Autor des nutshell-Pakets) and Jörg Beyer cess Control. Authors: Martin Bezener and Pei- (deutsche Übersetzung und Bearbeitung; hua Qiu. ergänzender Code). mosaic Project MOSAIC (mosaic-web.org) math and nvis Combination of visualization functions for nu- stats teaching utilities. Authors: Randall clear data using ggplot2 and ggcolpairs. Au- Pruim, Daniel Kaplan, and Nicholas Horton. thors: Fernando Fuentes and George Ostrou- chov. moult Models for Analysing Moult Data. Authors: Birgit Erni. Based on models developed by Un- oncomodel Maximum likelihood tree models for derhill and Zucchini (1988,1990). oncogenesis. Authors: Anja von Heydebreck, contributions from Christiane Heiss. movMF Mixtures of von Mises Fisher Distributions. Authors: Kurt Hornik and Bettina Grün. In orclus ORCLUS subspace clustering. Author: Gero views: Cluster, Distributions. Szepannek. mpa CoWords Method. Author: Daniel Hernando ordPens Selection and/or Smoothing of Ordinal Rodríguez & Campo Elías Pardo. Predictors. Author: Jan Gertheiss. msSurv Nonparametric Estimation for Multistate orddom Ordinal Dominance Statistics. Author: Jens Models. Authors: Nicole Ferguson, Guy Brock, J. Rogmann. and Somnath Datta. In view: Survival. palaeoSig Significance tests for palaeoenvironmen- msir Model-based Sliced Inverse Regression. Au- tal reconstructions. Author: Richard Telford. thor: Luca Scrucca. paramlink msr Morse-Smale approximation, regression and vi- Parametric linkage analysis. Author: sualization. Authors: Samuel Gerber, Kristi Magnus Dehli Vigeland. Potter, and Oliver Ruebel. parmigene Parallel Mutual Information estimation mtcreator Creating MAGE-TAB files using mtcre- for Gene Network reconstruction. Authors: ator. Author: Fabian Grandke. Gabriele Sales and Chiara Romualdi. multiPIM Variable Importance Analysis with Popu- pbivnorm Vectorized Bivariate Normal CDF. Au- lation Intervention Models. Authors: Stephan thors: Fortran code by Alan Genz. R code by Ritter, Alan Hubbard, and Nicholas Jewell. Brenton Kenkel, based on Adelchi Azzalini’s mnormt package. multibiplotGUI Multibiplot Analysis in R. Authors: Ana Belén Nieto Librero, Nora Baccala, Pu- pdfCluster Cluster analysis via nonparametric den- rificación Vicente Galindo, and Purificación sity estimation. Authors: Adelchi Azzalini, Galindo Villardon. Giovanna Menardi and Tiziana Rosolin. In view: Cluster. multxpert Common Multiple Testing Procedures and Gatekeeping Procedures. Authors: Alex penalizedLDA Penalized classification using Dmitrienko, Eric Nantz, and Gautier Paux, Fisher’s linear discriminant. Author: Daniela with contributions by Thomas Brechenmacher. Witten.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 98 NEWSAND NOTES pencopula Flexible Copula Density Estimation with readBrukerFlexData Reads mass spectrometry data Penalized Hierarchical B-Splines. Author: in Bruker *flex format. Author: Sebastian Gibb. Christian Schellhase. readMzXmlData Reads mass spectrometry data in pensim Simulation of high-dimensional data and mzXML format. Authors: Jarek Tuszynski and parallelized repeated penalized regression. Sebastian Gibb. Authors: L. Waldron, M. Pintilie, C. Hutten- rebmix Random univariate and multivariate finite hower* and I. Jurisica* (*equal contribution). mixture generation. Author: Marko Nagode. phyext An extension of some of the classes in phy- reliaR Package for some probability distributions. lobase. Tree objects now support subnodes on Authors: Vijay Kumar and Uwe Ligges. In branches. Author: J. Conrad Stack. view: Distributions. pkDACLASS A complete statistical analysis of reshapeGUI A GUI for the reshape2 and plyr pack- MALDI-TOF mass spectrometry proteomics ages. Author: Jason Crowley. data. Authors: Juliet Ndukum, Mourad Atlas, and Susmita Datta. review Manage Review Logs. Author: Tim Bergsma. plotmo Plot a model’s response while varying the revoIPC Shared memory parallel framework for values of the predictors. Author: Stephen Mil- multicore machines. Author: Revolution An- borrow. alytics. plsRcox Partial least squares Regression for Cox rgam Robust Generalized Additive Model. Authors: models and related techniques. Authors: Fred- Matías Salibián-Barrera, Davor Cubranic, and eric Bertrand, Myriam Maumy-Bertrand, and Azadeh Alimadad. Nicolas Meyer. In view: Survival. rgeos Interface to Geometry Engine — Open Source poibin The Poisson Binomial Distribution. Author: (GEOS). Authors: Roger Bivand and Colin Yili Hong. In view: Distributions. Rundel. In view: Spatial. pracma Practical Numerical Math Functions. Au- rich Species richness estimation and comparison. thor: Hans W Borchers. Author: Jean-Pierre Rossi. press Protein REsidue-level Structural Statistics. rminer Simpler use of data mining methods (e.g. Authors: Yuanyuan Huang, Stephen Bonett, NN and SVM) in classification and regression. and Zhijun Wu. Author: Paulo Cortez. In view: MachineLearn- ing. pressData Data sets for package press. Authors: Yuanyuan Huang, Stephen Bonett, and Zhijun rneos XML-RPC Interface to NEOS. Author: Bern- Wu. hard Pfaff. In view: Optimization. rocplus ROC, Precision-Recall, Convex Hull and pycno Pycnophylactic Interpolation. Author: Chris other plots. Author: Bob Wheeler. Brunsdon. rpart.plot Plot rpart models. An enhanced version qat Quality Assurance Toolkit. Author: Andre of plot.rpart(). Author: Stephen Milborrow. Duesterhus. rrBLUP Genomic selection and association analysis. qgraph Visualize data as networks. Authors: Sacha Author: Jeffrey Endelman. Epskamp, Angelique O. J. Cramer, Lourens J. Waldorp, Verena D. Schmittmann, and Denny rrdf Support for the Resource Description Frame- Borsboom. In view: Psychometrics. work. Author: Egon Willighagen. rrdflibs Jena libraries for use with rrdf. Author: queueing Basic Markovian queueing models. Au- Egon Willighagen. thor: Pedro Canadilla. rsdepth Ray Shooting Depth (i.e., RS Depth) func- rCUR CUR decomposition package. Authors: An- tions for bivariate analysis. Authors: Nabil dras Bodor and Norbert Solymosi. Mustafa, Saurabh Ray, and Mudassir Shabbir. rChoiceDialogs rChoiceDialogs collection. Au- rtape Manage and manipulate large collections of R thors: Alex Lisovich and Roger Day. objects stored as tape-like files. Author: Miron B. Kursa. rdyncall Improved Foreign Function Interface (FFI) and Dynamic Bindings to C Libraries (e.g. rvmbinary RVM Binary Classification. Author: OpenGL). Author: Daniel Adler. Robert Lowe.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 NEWSAND NOTES 99 rysgran Grain size analysis, textural classifications speedR A GUI based importing and filtering tool. and distribuition of uncosolidated sediments. Author: Ilhami Visne. Authors: Maurício Garcia de Camargo, Elian- speedRlibTF speedR’s table filter library. Author: dro Ronael Gilbert, Leonardo Sandrini-Neto. Ilhami Visne. s3x Enhanced S3-Based Programming. Author: speedRlibs Libraries (jars) for speedR. Author: Il- Charlotte Maia. hami Visne. saves Fast load variables. Author: Gergely Daróczi. spi Compute SPI index. Author: Josemir Neves. scaleCoef Scale regression coefficients. Author: spikeSlabGAM Bayesian variable selection and Sandrah P. Eckel. model choice for generalized additive mixed models. Author: Fabian Scheipl. In view: scriptests Transcript-based unit tests that are easy to Bayesian. create and maintain. Author: Tony Plate. sporm Semiparametric proportional odds rate seg A set of tools for residential segregation re- model. Authors: Zhong Guan and Cheng search. Authors: Seong-Yun Hong, David Peng. O’Sullivan. sra Selection Response Analysis. Author: Arnaud sfa Stochastic Frontier Analysis. Authors: Ari- Le Rouzic. ane Straub, under the supervision of Torsten stabledist Stable Distribution Functions. Authors: Hothorn. Diethelm Wuertz, Martin Maechler and Rmet- sideChannelAttack Side Channel Attack. Authors: rics core team members. In view: Distributions. Liran Lerman, Gianluca Bontempi, and Olivier stepp Subpopulation Treatment Effect Pattern Plot Markowitch. (STEPP). Author: Wai-ki Yip, with contribu- simboot Simultaneous Confidence Intervals and tions from Ann Lazar, David Zahrieh, Chip Adjusted p-values for Diversity Indices. Au- Cole, Ann Lazar, Marco Bonetti, and Richard thor: Ralph Scherer. Gelber. superMDS Implements the supervised multidi- skills Skills in Knowledge Space Theory. Author: mensional scaling (superMDS) proposal of Angela Pilhoefer. Witten and Tibshirani (2011). Author: Daniela smco A simple Monte Carlo optimizer using adap- M. Witten. tive coordinate sampling. Author: Juan David survAUC Estimators of prediction accuracy for Velásquez. time-to-event data. Authors: Sergej Potapov, Werner Adler and Matthias Schmid. In view: snort Social Network-Analysis On Relational Ta- Survival. bles. Authors: Eugene Dubossarsky and Mark Norrie. survivalBIV Estimation of the Bivariate Distribu- tion Function. Authors: Ana Moreira and Luis soma General-purpose optimisation with the Self- Meira-Machado. In view: Survival. Organising Migrating Algorithm. Author: Jon Clayden; based on the work of Ivan Zelinka. In svd Interfaces to various state-of-art SVD and eigen- view: Optimization. solvers. Author: Anton Korobeynikov. somplot Visualisation of hexagonal Kohonen maps. swamp Analysis and visualization of high- Authors: Benjamin Schulz and Andreas Do- dimensional data in respect to sample anno- minik. tations. Author: Martin Lauss. tabplot Tableplot, a visualization of large statistical sos4R An R client for the OGC Sensor Observation datasets. Authors: Martijn Tennekes and Ed- Service. Author: Daniel Nüst. win de Jonge. spa Implements The Sequential Predictions Algo- texmex Threshold exceedences and multivariate ex- rithm. Author: Mark Culp. tremes. Authors: Harry Southworth and Janet spacodiR Spatial and Phylogenetic Analysis of E. Heffernan. Community Diversity. Authors: Jonathan East- textir Inverse Regression for Text Analysis. Author: man, Timothy Paine, and Olivier Hardy. Matt Taddy. spd Semi Parametric Distribution. Author: Alexios timeordered Time-ordered and time-aggregated Ghalanos. In view: Distributions. network analyses. Author: Benjamin Blonder.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 100 NEWSAND NOTES tm.plugin.dc Text Mining Distributed Corpus Plug- vwr Useful functions for visual word recognition re- In. Authors: Ingo Feinerer and Stefan Theussl. search. Author: Emmanuel Keuleers. tmle Targeted Maximum Likelihood Estimation of wavemulcor Wavelet routine for multiple corre- additive treatment effect. Author: Susan Gru- lation. Author: Javier Fernandez-Macho ber, in collaboration with Mark van der Laan. (UPV/EHU). tonymisc Functions for Econometrics Output. Au- websockets HTML 5 Websocket Interface for R. Au- thor: J. Anthony Cookson. thor: B. W. Lewis. track Track Objects. Author: Tony Plate. weights Weighting and Weighted Statistics. Author: trapezoid The Trapezoidal Distribution. Author: Josh Pasek. Jeremy Thoms Hetzel. In view: Distributions. xtermStyle Basic text formatting using xterm escape treecm Estimating and plotting the centre of mass of sequences. Author: Christofer Backlin. a tree. Author: Marco Bascietto. zipcode U.S. ZIP Code database. Author: Jeffrey triggr Fast, simple RPC controller for R. Author: Breen. Miron B. Kursa. two.stage.boot Two-stage cluster sample bootstrap Other changes algorithm. Author: Patrick Zimmerman. • The following packages which were previ- udunits2 udunits-2 bindings for R. Author: James ously double-hosted on CRAN and Biocon- Hiebert. ductor were moved to the Archive, and are unknownR You didn’t know you didn’t know? Au- now solely available from Bioconductor: LM- thor: Matthew Dowle. Gene, gene2pathway, genefu, graph, minet, multtest, survcomp, vbmp. varSelectIP Objective Bayes Model Selection. Au- thors: Gopal, V. and Novelo, L. L. and Casella, • The following packages were moved to the G. In view: Bayesian. Archive: EMC, EMCC, EffectiveDose, Funct- SNP, G1DBN, ISA, PermuteNGS, RSe- varcompci Computation of Confidence Intervals for qMeth, SMC, cxxPack, distributions, elec, Variance Components of Mixed Models in R. formula.tools, hierfstat, muscor, pARccs, ripa, Authors: Civit, S., Vilardell, M., Hess, A., ris, sigma2tools, tossm, tradeCosts. Matthew, Z., Ge, Y., Caballe, A. • The following packages were resurrected from vines Multivariate Dependence Modeling with the Archive: NADA, glmc, ifa, ipptoolbox, Vines. Authors: Yasser González-Fernández locfdr, nparcomp, ramps, risksetROC, rv. and Marta Soto. visualizationTools A few functions to visualize sta- Kurt Hornik tistical circumstances. Authors: Thomas Roth WU Wirtschaftsuniversität Wien, Austria [email protected] and Etienne Stockhausen. voronoi Methods and applications related to Achim Zeileis Voronoi tessellations. Authors: Christopher Universität Innsbruck, Austria D. Barr, Travis A. Gerke, and David M. Diez. [email protected]

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 NEWSAND NOTES 101

News from the Bioconductor Project

Bioconductor Team known gene annotations already available as Tran- Program in Computational Biology scriptDb objects. Fred Hutchinson Cancer Research Center Further information on new and existing pack- ages can be found on the Bioconductor web site; We are pleased to announce Bioconductor 2.8, re- ‘BiocViews’ identify coherent groups of packages, leased on 14 April 2011. Bioconductor 2.8 is com- with links to package descriptions, vignettes, refer- patible with R 2.13.0, and consists of 467 software ence manuals, and use statistics. packages and more than 500 up-to-date annotation packages. There are 49 new software packages, and enhancements to many others. Explore Bioconduc- Other activities tor at http://bioconductor.org, and install pack- ages with The Bioconductor community meets on July 27- 29 at our annual conference (https://secure. > source("http://bioconductor.org/biocLite.R") bioconductor.org/BioC2011) in Seattle for a com- > biocLite() # install standard packages... bination of scientific talks and hands-on tutorials; > biocLite("IRanges") # ...or only IRanges a developer meeting in the late fall will highlight A Bioconductor machine instance for use with contributions from the European developer commu- Amazon’s cloud service is available; see http:// nity. The active Bioconductor mailing lists (http: bioconductor.org/help/bioconductor-cloud-ami. //bioconductor.org/help/mailing-list/) connect users with each other, to domain experts, and to maintainers eager to ensure that their packages sat- New and revised packages isfy the needs of leading edge approaches. Biocon- ductor package maintainers and the Bioconductor This release includes new packages for diverse areas team invest considerable effort in producing high- of high-throughput analysis. Highlights include: quality software. The Bioconductor team continues to ensure quality software through technical and sci- Next-generation sequencing: ArrayExpressHTS entific reviews of new packages, and daily builds of mosaics Rsubread (RNA-seq), (ChIP-seq), released packages on Linux, Windows, and MacOS qrqc seqbias TEQC (alignment), , , (quality platforms. assessment), clst, clstutils (taxonomic place- ment). SNPs and small variants: inveRsion, chopsticks, Looking forward snpStats. Contributions from the Bioconductor community Flow cytometry and proteomics: flowPhyto, flow- shape each release; we are seeing a tremendous Plots. IPPD, MSnbase, procoil. number of new contributed packages addressing se- quence, advanced microarray, and other diverse ar- Microarray: a4 ExiMiR The collection of packages, , eas of high-throughput genomic analysis. pvac, snm, TurboNorm. Sequence analysis continues to pose significant Copy number variation: cn.farms, genoset, Vega. statistical and computational challenges, although remarkable progress is being made both in the Bio- Advanced statistical implementations: anota, conductor community and more generally. The next Clonality, clusterProfiler, gaia, genefu, release will see contributions enabling common data GSVA, ibh, joda, lol, mgsa, MLP, pheno- management (e.g., efficiently querying very large se- Dist, phenoTest, RNAinteract, survcomp, quence alignments), representation (e.g., of gapped TDARACNE. alignments and small variants), and work flow (e.g., RNA-seq read counts and estimation) tasks. Annotation and visualization: AnnotationFuncs, Important developments in the annotation do- NCIgraph, ENVISIONQuery, mcaGUI. main include efforts to enhance reproducibility by Our large collection of microarray- and organism- providing transcript annotation packages, and the specific annotation packages have, as usual, been up- ability to easily generate packages for non-model or- dated to include current information. Developments ganisms. New efforts are also under way to bet- in the GenomicFeatures package allow users to re- ter use existing and new annotation resources to trieve and easily save annotation tracks found at the help users with common tasks in next-generation se- UCSC genome browser. This provides easy access quence work flows, for instance immediately identi- to diverse information about, e.g., binding factor lo- fying whether single nucleotide and other small vari- cations or nucleosome positions, and complements ants disrupt protein production.

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859 102 NEWSAND NOTES

R Foundation News by Kurt Hornik New supporting members

Masoud Charkhabi, Canada Donations and new members James Davis, USA Patrick Hausmann, Germany Donations Sebastian Köhler, Germany Joseph J. Allaire, RStudio Inc., USA Dieter Menne, Germany J. Brian Loria, USA Fausto Molinari, Sweden Jeffrey R. Stevens, Germany Stefan Wyder, Switzerland

New benefactors Kurt Hornik WU Wirtschaftsuniversität Wien, Austria Joseph J. Allaire, RStudio Inc., USA [email protected]

The R Journal Vol. 3/1, June 2011 ISSN 2073-4859