Quick viewing(Text Mode)

Foreach/Iterators User's Guide

Foreach/Iterators User's Guide

foreach/ User’s Guide

April 20, 2015

Palo Alto Seattle Dallas Singapore London Copyright ©2015 Revolution Analytics. All rights reserved. Revolution R, Revolution R Enterprise, RPE, RevoScaleR, DeployR, NetWorkSpaces, NWS, ParallelR, and Revolution An- alytics are trademarks of Revolution Analytics. All other trademarks are the property of their respective owners. Contents

1 Parallelizing Loops5 1.1 Using foreach ...... 5 1.2 Parallel Backends...... 6 1.2.1 Using the doParallel parallel backend...... 6 1.2.2 Getting information about the parallel backend...... 7 1.3 Nesting Calls to foreach ...... 7 1.4 Using Iterators...... 9 1.4.1 Some Special Iterators...... 10 1.4.2 Writing Iterators...... 11

A Function and Class Reference 15 A.1 iterators package...... 15 iapply...... 15 icount...... 16 idiv...... 17 iread.table...... 18 ireadLines...... 19 irnorm...... 20 isplit...... 21 iter...... 22 iterators-package...... 24 nextElem...... 24 A.2 foreach package...... 25 foreach...... 25 foreach-ext...... 29 foreach-package...... 30 getDoParWorkers...... 31 setDoPar...... 32 registerDoSEQ...... 33 A.3 doParallel package...... 33

3 4 CONTENTS

doParallel-package...... 33 registerDoParallel...... 34 A.4 doMC package...... 35 doMC-package...... 35 registerDoMC...... 36 A.5 multicore package...... 36 children...... 37 fork...... 38 mclapply...... 40 multicore...... 42 parallel...... 44 process...... 46 sendMaster...... 47 signals...... 48 Chapter 1

Parallelizing Loops

One common approach to parallelization is to see if the iterations within a loop can be performed independently, and if so, then try to run the iterations concurrently rather than sequentially. The foreach and iterators packages can help you do this loop parallelization quickly and easily.

1.1 Using foreach

The foreach package is a set of tools that allow you to run virtually anything that can be expressed as a for-loop as a set of parallel tasks. One application of this is to allow multiple simulations to run in parallel. As a simple example, consider the case of simulating 10000 coin flips, which can be done by sampling with replacement from the vector ("H", "T"). To run this simulation 10 times sequentially, use foreach with the %do% operator: > library(foreach) > foreach(i=1:10) %do% sample(c("H", "T"), 10000, replace=TRUE) Comparing the foreach output with that of a similar shows one obvious difference: foreach returns a list containing the value returned by each computation. A for loop, by contrast, returns only the value of its last computation, and relies on user-defined side effects to do its work. We can parallelize the operation immediately by replacing %do% with %dopar%: > foreach(i=1:10) %dopar% sample(c("H", "T"), 10000, replace=TRUE) However, if we run this example, we see the following warning: Warning message: executing %dopar% sequentially: no parallel backend registered To actually run in parallel, we need to have a “parallel backend” for foreach. Parallel backends are discussed in the next section.

5 6 CHAPTER 1. PARALLELIZING LOOPS 1.2 Parallel Backends

In order for loops coded with foreach to run in parallel, you must register a parallel backend to manage the execution of the loop. Any type of mechanism for running code in parallel could potentially have a parallel backend written for it. Currently, Revolution R Enterprise includes the doParallel backend; this uses the parallel package of R 2.14.0 or later to run jobs in parallel, using either of the component parallelization methods incorporated into the parallel package: SNOW-like functionality using socket connections or multicore-like functionality using forking (on Linux only). The doParallel package is a parallel backend for foreach that is intended for parallel processing on a single computer with multiple cores or processors. Additional parallel backends are available from CRAN:

• doMPI for use with the Rmpi package

• doRedis for use with the rredis package

• doMC provides access to the multicore functionality of the parallel package

• doSNOW for use with the now superseded SNOW package.

To use a parallel backend, you must first register it. Once a parallel backend is registered, calls to %dopar% run in parallel using the mechanisms provided by the parallel backend. How- ever, the details of registering the parallel backends differ, so we consider them separately.

1.2.1 Using the doParallel parallel backend The parallel package of R 2.14.0 and later combines elements of snow and multicore; doParallel similarly combines elements of both doSNOW and doMC. You can register doParallel with a cluster, as with doSNOW, or with a number of cores, as with doMC. For example, here we create a cluster and register it:

> library(doParallel) > cl <- makeCluster(4) > registerDoParallel(cl)

Once you’ve registered the parallel backend, you’re ready to run foreach code in parallel. For example, to see how long it takes to run 10,000 bootstrap iterations in parallel on all available cores, you can run the following code:

> x <- iris[which(iris[,5] != "setosa"), c(1,5)] > trials <- 10000 > ptime <- system.time({ 1.3. NESTING CALLS TO FOREACH 7

+ r <- foreach(icount(trials), .combine = cbind) %dopar% { + ind <- sample(100, 100, replace = TRUE) + result1 <- glm(x[ind, 2] ~ x[ind, 1], family=binomial(logit)) + coefficients(result1) + } + })[3] > ptime

1.2.2 Getting information about the parallel backend To find out how many workers foreach is going to use, you can use the getDoParWorkers function:

> getDoParWorkers()

This is a useful sanity check that you’re actually running in parallel. If you haven’t registered a parallel backend, or if your machine only has one core, getDoParWorkers will return 1. In either case, don’t expect a speed improvement. The getDoParWorkers function is also useful when you want the number of tasks to be equal to the number of workers. You may want to pass this value to an constructor, for example. You can also get the name and version of the currently registered backend:

> getDoParName() > getDoParVersion()

1.3 Nesting Calls to foreach

An important feature of foreach is nesting operator %:%. Like the %do% and %dopar% operators, it is a binary operator, but it operates on two foreach objects. It also returns a foreach object, which is essentially a special merger of its operands. Let’s say that we want to perform a Monte Carlo simulation using a function called sim. The sim function takes two arguments, and we want to call it with all combinations of the values that are stored in the vectors avec and bvec. The following doubly-nested for loop does that. For testing purposes, the sim function is defined to return 10a + b (although an operation this trivial is not worth executing in parallel): sim <- function(a, b) 10 * a + b avec <- 1:2 bvec <- 1:4 8 CHAPTER 1. PARALLELIZING LOOPS x <- matrix(0, length(avec), length(bvec)) for (j in 1:length(bvec)) { for (i in 1:length(avec)) { x[i,j] <- sim(avec[i], bvec[j]) } } x In this case, it makes sense to store the results in a matrix, so we create one of the proper size called x, and assign the return value of sim to the appropriate element of x each time through the inner loop. When using foreach, we don’t create a matrix and assign values into it. Instead, the inner loop returns the columns of the result matrix as vectors, which are combined in the outer loop into a matrix. Here’s how to do that using the %:% operator: x <- foreach(b=bvec, .combine='cbind') %:% foreach(a=avec, .combine='c') %do% { sim(a, b) } x This is structured very much like the nested for loop. The outer foreach is iterating over the values in “bvec”, passing them to the inner foreach, which iterates over the values in “avec” for each value of “bvec”. Thus, the “sim” function is called in the same way in both cases. The code is slightly cleaner in this version, and has the advantage of being easily parallelized. When parallelizing nested for loops, there is always a question of which loop to parallelize. The standard advice is to parallelize the outer loop. This results in larger individual tasks, and larger tasks can often be performed more efficiently than smaller tasks. However, if the outer loop doesn’t have many iterations and the tasks are already large, parallelizing the outer loop results in a small number of huge tasks, which may not allow you to use all of your processors, and can also result in load balancing problems. You could parallelize an inner loop instead, but that could be inefficient because you’re repeatedly waiting for all the results to be returned every time through the outer loop. And if the tasks and number of iterations vary in size, then it’s really hard to know which loop to parallelize. But in our Monte Carlo example, all of the tasks are completely independent of each other, and so they can all be executed in parallel. You really want to think of the loops as specifying a single stream of tasks. You just need to be careful to process all of the results correctly, depending on which iteration of the inner loop they came from. That is exactly what the %:% operator does: it turns multiple foreach loops into a single loop. That is why there is only one %do% operator in the example above. And when we parallelize 1.4. USING ITERATORS 9 that nested foreach loop by changing the %do% into a %dopar%, we are creating a single stream of tasks that can all be executed in parallel: x <- foreach(b=bvec, .combine='cbind') %:% foreach(a=avec, .combine='c') %dopar% { sim(a, b) } x

Of course, we’ll actually only run as many tasks in parallel as we have processors, but the parallel backend takes care of all that. The point is that the %:% operator makes it easy to specify the stream of tasks to be executed, and the .combine argument to foreach allows us to specify how the results should be processed. The backend handles executing the tasks in parallel. For more on nested foreach calls, see the vignette “Nesting foreach Loops” in the foreach package.

1.4 Using Iterators

An iterator is a special type of object that generalizes the notion of a looping variable. When passed as an argument to a function that knows what to do with it, the iterator supplies a sequence of values. The iterator also maintains information about its state, in particular its current index. The iterators package includes a number of functions for creating iterators, the simplest of which is iter, which takes virtually any R object and turns it into an iterator object. The simplest function that operates on iterators is the nextElem function, which when given an iterator, returns the next value of the iterator. For example, here we create an iterator object from the sequence 1 to 10, and then use nextElem to iterate through the values:

> i1 <- iter(1:10) > nextElem(i1) [1] 1 > nextElem(i1) [2] 2

You can create iterators from matrices and data frames, using the by argument to specify whether to iterate by row or column:

> istate <- iter(state.x77, by='row') > nextElem(istate) Population Income Illiteracy Life Exp Murder HS Grad Frost Area Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 10 CHAPTER 1. PARALLELIZING LOOPS

> nextElem(istate) Population Income Illiteracy Life Exp Murder HS Grad Frost Area Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432

Iterators can also be created from functions, in which case the iterator can be an endless source of values:

> ifun <- iter(function() sample(0:9, 4, replace=TRUE)) > nextElem(ifun) [1] 9 5 2 8 > nextElem(ifun) [1] 3 4 2 2

For practical applications, iterators can be paired with foreach to obtain parallel results quite easily:

> x <- matrix(rnorm(1000000), ncol=1000) > itx <- iter(x, by='row') > foreach(i=itx, .combine=c) %dopar% mean(i)

1.4.1 Some Special Iterators The notion of an iterator is new to R, but should be familiar to users of languages such as Python. The iterators package includes a number of special functions that generate iterators for some common scenarios. For example, the irnorm function creates an iterator for which each value is drawn from a specified random normal distribution:

> library(iterators) > itrn <- irnorm(1, count=10) > nextElem(itrn) [1] 0.6300053 > nextElem(itrn) [1] 1.242886

Similarly, the irunif, irbinom, and irpois functions create iterators which drawn their values from uniform, binomial, and Poisson distributions, respectively. (These functions use the standard R distribution functions to generate random numbers, and these are not necessarily useful in a distributed or parallel environment. When using random numbers with foreach, we recommend using the doRNG package to ensure independent random number streams on each worker.) We can then use these functions just as we used irnorm: 1.4. USING ITERATORS 11

> itru <- irunif(1, count=10) > nextElem(itru) [1] 0.4960539 > nextElem(itru) [1] 0.4071111

The icount function returns an iterator that counts starting from one:

> it <- icount(3) > nextElem(it) [1] 1 > nextElem(it) [1] 2 > nextElem(it) [1] 3

1.4.2 Writing Iterators There will be times when you need an iterator that isn’t provided by the iterators package. That is when you need to write your own custom iterator. Basically, an iterator is an S3 object whose base class is "iter", and has iter and nextElem methods. The purpose of the iter method is to return an iterator for the specified object. For iterators, that usually just means returning itself, which seems odd at first. But the iter method can be defined for other objects that don’t define a nextElem method. We call those objects iterables, meaning that you can iterate over them. The iterators package defines iter meth- ods for vectors, lists, matrices, and data frames, making those objects iterables. By defining an iter method for iterators, they can be used in the same context as an iterable, which can be convenient. For example, the foreach function takes iterables as arguments. It calls the iter method on those arguments in order to create iterators for them. By defining the iter method for all iterators, we can pass iterators to foreach that we created using any method we choose. Thus, we can pass vectors, lists, or iterators to foreach, and they are all processed by foreach in exactly the same way. The iterators package comes with an iter method defined for the "iter" class that simply returns itself. That is usually all that is needed for an iterator. However, if you want to create an iterator for some existing class, you can do that by writing an iter method that returns an appropriate iterator. That will allow you to pass an instance of your class to foreach, which will automatically convert it into an iterator. The alternative is to write your own function that takes arbitrary arguments, and returns an iterator. You can choose whichever method is most natural. The most important method required for iterators is nextElem. This simply returns the next value, or throws an error. Calling the stop function with the string "StopIteration" indicates 12 CHAPTER 1. PARALLELIZING LOOPS that there are no more values available in the iterator. In most cases, you don’t actually need to write the iter and nextElem methods; you can inherit them. By inheriting from the class abstractiter, you can use the following methods as the basis of your own iterators:

> iterators:::iter.iter function (obj, ...) { obj } > iterators:::nextElem.abstractiter function (obj, ...) { obj$nextElem() }

The following function creates a simple iterator that uses these two methods: iforever <- function(x) { nextEl <- function() x obj <- list(nextElem=nextEl) class(obj) <- c('iforever', 'abstractiter', 'iter') obj }

Note that we called the internal function nextEl rather than nextElem to avoid masking the standard nextElem generic function. That causes problems when you want your iterator to call the nextElem method of another iterator, which can be quite useful. We create an instance of this iterator by calling the iforever function, and then use it by calling the nextElem method on the resulting object: it <- iforever(42) nextElem(it) nextElem(it)

Notice that it doesn’t make sense to implement this iterator by defining a new iter method, since there is no natural iterable on which to dispatch. The only argument that we need is the object for the iterator to return, which can be of any type. Instead, we implement this iterator by defining a normal function that returns the iterator. 1.4. USING ITERATORS 13

This iterator is quite simple to implement, and possibly even useful. Be careful, however, how you you use this iterator. If you pass it to foreach, it will result in an infinite loop unless you pair it with a finite iterator. And never pass this iterator to as.list without the n argument. The iterator returned by iforever is a list that has a single element named nextElem, whose value is a function that returns the value of x. Because we are subclassing abstractiter, we inherit a nextElem method that will call this function, and because we are subclassing iter, we inherit an iter method that will return itself. Of course, the reason this iterator is so simple is because it doesn’t contain any state. Most iterators need to contain some state, or it will be difficult to make it return different values and eventually stop. Managing the state is usually the real trick to writing iterators. As an example of writing a stateful iterator, Let’s modify the previous iterator to put a limit on the number of values that it returns. I’ll call the new function irep, and give it another argument called times: irep <- function(x, times) { nextEl <- function() { if (times > 0) { times <<- times - 1 } else { stop('StopIteration') } x } obj <- list(nextElem=nextEl) class(obj) <- c('irep', 'abstractiter', 'iter') obj }

Now let’s try it out: it <- irep(7, 6) unlist(as.list(it))

The real difference between iforever and irep is in the function that gets called by the nextElem method. This function not only accesses the values of the variables x and times, but it also modifies the value of times. This is accomplished by means of the <<- operator, and the magic of lexical scoping. Technically, this kind of function is called a closure, and is a somewhat advanced feature of R. The important thing to remember is that nextEl is able to get the value of variables that were passed as arguments to irep, and it can modify those values using the <<- operator. These are not global variables: they are defined in the enclosing environment of 14 CHAPTER 1. PARALLELIZING LOOPS the nextEl function. You can create as many iterators as you want using the irep function, and they will all work as expected without conflicts. Note that this iterator only uses the arguments to irep to store its state. If any other state variables are needed, they can be defined anywhere inside the irep function. More examples of writing iterators can be found in the vignette “Writing Custom Iterators” in the iterators package. Appendix A

Function and Class Reference

A.1 iterators package

iapply Array/Apply Iterator

Description Returns an iterator over an array, which iterates over the array in much the same manner as the apply function.

Usage iapply(X, MARGIN)

Arguments X the array to iterate over. MARGIN a vector of subscripts. 1 indicates the first dimension (rows), 2 indicates the second dimension (columns), etc.

Value The apply iterator.

15 16 icount

See Also apply

Examples a <- array(1:8, c(2, 2, 2))

# iterate over all the matrices it <- iapply(a, 3) as.list(it)

# iterate over all the columns of all the matrices it <- iapply(a, c(2, 3)) as.list(it)

# iterate over all the rows of all the matrices it <- iapply(a, c(1, 3)) as.list(it)

icount Counting Iterators

Description Returns an iterator that counts starting from one.

Usage icount(count) icountn(vn)

Arguments count number of times that the iterator will fire. If not specified, it will count forever. vn vector of counts. idiv 17

Value The counting iterator.

Examples # create an iterator that counts from 1 to 3. it <- icount(3) nextElem(it) nextElem(it) nextElem(it) try(nextElem(it)) # expect a StopIteration exception

idiv Dividing Iterator

Description Returns an iterator that returns pieces of numeric value.

Usage idiv(n, ..., chunks, chunkSize)

Arguments n number of times that the iterator will fire. If not specified, it will count forever. ... unused. chunks the number of pieces that n should be divided into. This is useful when you know the number of pieces that you want. If specified, then chunkSize should not be. chunkSize the maximum size of the pieces that n should be divided into. This is useful when you know the size of the pieces that you want. If specified, then chunks should not be.

Value The dividing iterator. 18 iread.table

Examples # divide the value 10 into 3 pieces it <- idiv(10, chunks=3) nextElem(it) nextElem(it) nextElem(it) try(nextElem(it)) # expect a StopIteration exception

# divide the value 10 into pieces no larger than 3 it <- idiv(10, chunkSize=3) nextElem(it) nextElem(it) nextElem(it) nextElem(it) try(nextElem(it)) # expect a StopIteration exception

iread.table Iterator over Rows of a Data Frame Stored in a File

Description Returns an iterator over the rows of a data frame stored in a file in table format. It is a wrapper around the standard read.table function.

Usage iread.table(file, ..., verbose=FALSE)

Arguments file the name of the file to read the data from. ... all additional arguments are passed on to the read.table function. See the documentation for read.table for more information. verbose logical value indicating whether or not to print the calls to read.table.

Value The file reading iterator. ireadLines 19

Note In this version of iread.table, both the read.table arguments header and row.names must be specified. This is because the default values of these arguments depend on the contents of the beginning of the file. In order to make the subsequent calls to read.table work consistently, the user must specify those arguments explicitly. A future version of iread.table may remove this requirement.

See Also read.table

ireadLines Iterator over Lines of Text from a Connection

Description Returns an iterator over the lines of text from a connection. It is a wrapper around the standard readLines function.

Usage ireadLines(con, n=1, ...)

Arguments con a connection object or a character string. n integer. The maximum number of lines to read. Negative values indicate that one should read up to the end of the connection. The default value is 1. ... passed on to the readLines function.

Value The line reading iterator.

See Also readLines 20 irnorm

Examples

# create an iterator over the lines of COPYING it <- ireadLines(file.path(R.home(), 'COPYING')) nextElem(it) nextElem(it) nextElem(it)

irnorm Random Number Iterators

Description

These function returns an iterators that return random numbers of various distributions. Each one is a wrapper around a standard R function.

Usage

irnorm(..., count) irunif(..., count) irbinom(..., count) irnbinom(..., count) irpois(..., count)

Arguments

count number of times that the iterator will fire. If not specified, it will fire values forever. ... arguments to pass to the underlying rnorm function.

Value

An iterator that is a wrapper around the corresponding random number generator function. isplit 21

Examples

# create an iterator that returns three random numbers it <- irnorm(1, count=3) nextElem(it) nextElem(it) nextElem(it) try(nextElem(it)) # expect a StopIteration exception

isplit Split Iterator

Description

Returns an iterator that divides the data in the vector x into the groups defined by f.

Usage

isplit(x, f, drop=FALSE, ...)

Arguments

x vector or data frame of values to be split into groups. f a factor or list of factors used to categorize x. drop logical indicating if levels that do not occur should be dropped. ... current ignored.

Value

The split iterator.

See Also

split 22 iter

Examples x <- rnorm(200) f <- factor(sample(1:10, length(x), replace=TRUE))

it <- isplit(x, f) expected <- split(x, f)

for (i in expected) { actual <- nextElem(it) stopifnot(actual$value == i) }

iter Iterator Factory Functions

Description iter is a generic function used to create iterator objects.

Usage iter(obj, ...)

## Default S3 method: iter(obj, checkFunc=function(...) TRUE, recycle=FALSE, ...) ## S3 method for class 'iter' iter(obj, ...) ## S3 method for class 'matrix' iter(obj, by=c('column', 'cell', 'row'), chunksize=1L, checkFunc=function(...) TRUE, recycle=FALSE, ...) ## S3 method for class 'data.frame' iter(obj, by=c('column', 'row'), checkFunc=function(...) TRUE, recycle=FALSE, ...) ## S3 method for class 'function' iter(obj, checkFunc=function(...) TRUE, recycle=FALSE, ...) iter 23

Arguments obj an object from which to generate an iterator. by how to iterate. chunksize the number of elements of by to return with each call to nextElem. checkFunc a function which, when passed an iterator value, return TRUE or FALSE. If FALSE, the value is skipped in the iteration. recycle a boolean describing whether the iterator should reset after running through all it’s values. ... additional arguments affecting the iterator.

Value The iterator.

Examples # a vector iterator i1 <- iter(1:3) nextElem(i1) nextElem(i1) nextElem(i1)

# a vector iterator with a checkFunc i1 <- iter(1:3, checkFunc=function(i) i %% 2 == 0) nextElem(i1)

# a data frame iterator by column i2 <- iter(data.frame(x=1:3, y=10, z=c('a', 'b', 'c'))) nextElem(i2) nextElem(i2) nextElem(i2)

# a data frame iterator by row i3 <- iter(data.frame(x=1:3, y=10), by='row') nextElem(i3) nextElem(i3) nextElem(i3) 24 nextElem

# a function iterator i4 <- iter(function() rnorm(1)) nextElem(i4) nextElem(i4) nextElem(i4)

iterators-package The Iterators Package

Description The iterators package provides tools for iterating over various R data structures. Iterators are available for vectors, lists, matrices, data frames, and files. By following very simple conventions, new iterators can be written to support any type of data source, such as database queries or dynamically generated data.

Details Further information is available in the following help topics:

iter Generic function used to create iterator objects. nextElem Generic function used to get the next element of a iterator. icount A function used to create a counting iterator. idiv A function used to create a number dividing iterator. ireadLines A function used to create a file reading iterator.

For a complete list of functions with individual help pages, use library(help="iterators").

nextElem Get Next Element of Iterator

Description nextElem is a generic function used to produce values. If a checkFunc was specified to the constructor, the potential iterated values will be passed to the checkFunc until the checkFunc returns TRUE. When the iterator has no more values, it calls stop with the mes- sage ’StopIteration’. A.2. FOREACH PACKAGE 25

Usage nextElem(obj, ...)

## S3 method for class 'containeriter' nextElem(obj, ...) ## S3 method for class 'funiter' nextElem(obj, ...)

Arguments obj an iterator object. ... additional arguments that are ignored.

Value The value.

Examples it <- iter(c('a', 'b', 'c')) print(nextElem(it)) print(nextElem(it)) print(nextElem(it))

A.2 foreach package

foreach foreach

Description %do% and %dopar% are binary operators that operate on a foreach object and an R expres- sion. The expression, ex, is evaluated multiple times in an environment that is created by the foreach object, and that environment is modified for each evaluation as specified by the foreach object. %do% evaluates the expression sequentially, while %dopar% evalutes it in parallel. The results of evaluating ex are returned as a list by default, but this can be modified by means of the .combine argument. 26 foreach

Usage foreach(..., .combine, .init, .final=NULL, .inorder=TRUE, .multicombine=FALSE, .maxcombine=if (.multicombine) 100 else 2, .errorhandling=c('stop', 'remove', 'pass'), .packages=NULL, .export=NULL, .noexport=NULL, .verbose=FALSE) when(cond) e1 %:% e2 obj %do% ex obj %dopar% ex times(n)

Arguments ... one or more arguments that control how ex is evaluated. Named arguments specify the name and values of variables to be defined in the evaluation environment. An unnamed argument can be used to specify the number of times that ex should be evaluated. At least one argument must be specified in order to define the number of times ex should be executed. .combine function that is used to process the tasks results as they generated. This can be specified as either a function or a non-empty character string naming the function. Specifying ’c’ is useful for concatenating the results into a vector, for example. The values ’cbind’ and ’rbind’ can combine vectors into a matrix. The values ’+’ and ’*’ can be used to process numeric data. By default, the results are returned in a list. .init initial value to pass as the first argument of the .combine function. This should not be specified unless .combine is also specified. .final function of one argument that is called to return final result. .inorder logical flag indicating whether the .combine function requires the task re- sults to be combined in the same order that they were submitted. If the order is not important, then it setting .inorder to FALSE can give improved performance. The default value is TRUE. .multicombine logical flag indicating whether the .combine function can accept more than two arguments. If an arbitrary .combine function is specified, by default, that function will always be called with two arguments. If it can take more than two arguments, then setting .multicombine to TRUE could improve foreach 27

the performance. The default value is FALSE unless the .combine function is cbind, rbind, or c, which are known to take more than two arguments. .maxcombine maximum number of arguments to pass to the combine function. This is only relevant if .multicombine is TRUE. .errorhandling specifies how a task evalution error should be handled. If the value is ”stop”, then execution will be stopped via the stop function if an error occurs. If the value is ”remove”, the result for that task will not be returned, or passed to the .combine function. If it is ”pass”, then the error object generated by task evaluation will be included with the rest of the results. It is assumed that the combine function (if specified) will be able to deal with the error object. The default value is ”stop”. .packages character vector of packages that the tasks depend on. If ex requires a R package to be loaded, this option can be used to load that package on each of the workers. Ignored when used with %do%. .export character vector of variables to export. This can be useful when accessing a variable that isn’t defined in the current environment. The default value in NULL. .noexport character vector of variables to exclude from exporting. This can be useful to prevent variables from being exported that aren’t actually needed, perhaps because the symbol is used in a model formula. The default value in NULL. .verbose logical flag enabling verbose messages. This can be very useful for trouble shooting. obj foreach object used to control the evaluation of ex. e1 foreach object to merge. e2 foreach object to merge. ex the R expression to evaluate. cond condition to evaluate. n number of times to evaluate the R expression.

Details The foreach and %do%/%dopar% operators provide a looping construct that can be viewed as a hybrid of the standard for loop and lapply function. It looks similar to the for loop, and it evaluates an expression, rather than a function (as in lapply), but it’s purpose is to return a value (a list, by default), rather than to cause side-effects. This faciliates parallelization, but looks more natural to people that prefer for loops to lapply. 28 foreach

The %:% operator is the nesting operator, used for creating nested foreach loops. Type vignette("nested") at the R prompt for more details. Parallel computation depends upon a parallel backend that must be registered before per- forming the computation. The parallel backends available will be system-specific, but in- clude doParallel, which uses R’s built-in parallel package, doMC, which uses the mul- ticore package, and doSNOW. Each parallel backend has a specific registration function, such as registerDoParallel or registerDoSNOW. The times function is a simple convenience function that calls foreach. It is useful for evaluating an R expression multiple times when there are no varying arguments. This can be convenient for resampling, for example.

See Also iter

Examples # equivalent to rnorm(3) times(3) %do% rnorm(1)

# equivalent to lapply(1:3, sqrt) foreach(i=1:3) %do% sqrt(i)

# equivalent to colMeans(m) m <- matrix(rnorm(9), 3, 3) foreach(i=1:ncol(m), .combine=c) %do% mean(m[,i])

# normalize the rows of a matrix in parallel, with parenthesis used to # force proper operator precedence # Need to register a parallel backend before this example will run # in parallel foreach(i=1:nrow(m), .combine=rbind) %dopar% (m[i,] / mean(m[i,]))

# simple (and inefficient) parallel matrix multiply library(iterators) a <- matrix(1:16, 4, 4) b <- t(a) foreach-ext 29

foreach(b=iter(b, by='col'), .combine=cbind) %dopar% (a %*% b)

# split a data frame by row, and put them back together again without # changing anything <- data.frame(x=1:10, y=rnorm(10)) s <- foreach(d=iter(d, by='row'), .combine=rbind) %dopar% d identical(s, d)

# a quick sort function qsort <- function(x) { n <- length(x) if (n == 0) { x } else { p <- sample(n, 1) smaller <- foreach(y=x[-p], .combine=c) %:% when(y <= x[p]) %do% y larger <- foreach(y=x[-p], .combine=c) %:% when(y > x[p]) %do% y c(qsort(smaller), x[p], qsort(larger)) } } qsort(runif(12))

foreach-ext Foreach Extension Functions

Description These functions are used to write parallel backends for the foreach package. They should not be used from normal scripts or packages that use the foreach package.

Usage makeAccum(it) accumulate(obj, result, tag, ...) getexports(ex, e, env, good=character(0), bad=character(0)) getResult(obj, ...) getErrorValue(obj, ...) getErrorIndex(obj, ...) 30 foreach-package

Arguments it foreach iterator. ex call object to analyze. e local environment of the call object. env exported environment in which call object will be evaluated. good names of symbols that are being exported. bad names of symbols that are not being exported. obj foreach iterator object. result task result to accumulate. tag tag of task result to accumulate. ... unused.

Note These functions are likely to change in future versions of the foreach package. When they become more stable, they will be documented.

foreach-package The Foreach Package

Description The foreach package provides a new looping construct for executing R code repeatedly. The main reason for using the foreach package is that it supports parallel execution. The foreach package can be used with a variety of different parallel computing systems, include NetWorkSpaces and snow. In addition, foreach can be used with iterators, which allows the data to specified in a very flexible way.

Details Further information is available in the following help topics:

foreach Specify the variables to iterate over %do% Execute the R expression sequentially %dopar% Execute the R expression using the currently registered backend getDoParWorkers 31

To see a tutorial introduction to the foreach package, use vignette("foreach"). To see a demo of foreach computing the sinc function, use demo(sincSEQ). Some examples (in addition to those in the help pages) are included in the “examples” direc- tory of the foreach package. To list the files in the examples directory, use list.files(system.file("examples", package="foreach")). To run the bootstrap example, use source(system.file("examples", "bootseq.R", package="foreach")). For a complete list of functions with individual help pages, use library(help="foreach").

getDoParWorkers Functions Providing Information on the doPar Backend

Description The getDoParWorkers function returns the number of execution workers there are in the currently registered doPar backend. It can be useful when determining how to split up the work to be executed in parallel. A 1 is returned by default. The getDoParRegistered function returns TRUE if a doPar backend has been registered, otherwise FALSE. The getDoParName function returns the name of the currently registered doPar backend. A NULL is returned if no backend is registered. The getDoParVersion function returns the version of the currently registered doPar back- end. A NULL is returned if no backend is registered.

Usage getDoParWorkers() getDoParRegistered() getDoParName() getDoParVersion()

Examples cat(sprintf('%s backend is registered\n', if(getDoParRegistered()) 'A' else 'No')) cat(sprintf('Running with %d worker(s)\n', getDoParWorkers())) (name <- getDoParName()) 32 setDoPar

(ver <- getDoParVersion()) if (getDoParRegistered()) cat(sprintf('Currently using %s [%s]\n', name, ver))

setDoPar setDoPar

Description

The setDoPar function is used to register a parallel backend with the foreach package. This isn’t normally executed by the user. Instead, packages that provide a parallel back- end provide a function named registerDoPar that calls setDoPar using the appropriate arguments.

Usage

setDoPar(fun, data=NULL, info=function(data, item) NULL)

Arguments

fun A function that implements the functionality of %dopar%. data Data to passed to the registered function. info Function that retrieves information about the backend.

See Also

%dopar% registerDoSEQ 33

registerDoSEQ registerDoSEQ

Description The registerDoSEQ function is used to explicitly register a sequential parallel backend with the foreach package. This will prevent a warning message from being issued if the %dopar% function is called and no parallel backend has been registered.

Usage registerDoSEQ()

See Also registerDoSNOW

Examples # specify that %dopar% should run sequentially registerDoSEQ()

A.3 doParallel package

doParallel-package The doParallel Package

Description The doParallel package provides a parallel backend for the foreach/%dopar% function using the parallel package of R 2.14.0 and later.

Details Further information is available in the following help topics: 34 registerDoParallel

registerDoParallel register doParallel to be used by foreach/%dopar%

To see a tutorial introduction to the doParallel package, use vignette("gettingstartedParallel"). To see a tutorial introduction to the foreach package, use vignette("foreach"). To see a demo of doParallel computing the sinc function, use demo(sincParallel). Some examples (in addition to those in the help pages) are included in the “examples” direc- tory of the doParallel package. To list the files in the examples directory, use list.files(system.file("examples", package="doParallel")). To run the bootstrap example, use source(system.file("examples", "bootParallel.R", package="doParallel")). This is a simple benchmark, executing both sequentally and in parallel. There are many more examples that come with the fore- ach package, which will work with the doParallel package if it is registered as the parallel backend. For a complete list of functions with individual help pages, use library(help="doParallel").

registerDoParallel registerDoParallel

Description The registerDoParallel function is used to register the parallel backend with the foreach package.

Usage registerDoParallel(cl, cores=NULL, ...)

Arguments cl A cluster object as returned by makeCluster, or the number of nodes to be created in the cluster. If not specified, on Windows a three worker cluster is created and used. cores The number of cores to use for parallel execution. If not specified, the num- ber of cores is set to the value of options("cores"), if specified, or to one-half the number of cores detected by the parallel package. ... Package options. Currently, only the nocompile option is supported. If nocompile is set to TRUE, compiler support is disabled. doMC-package 35

Details The parallel package from R 2.14.0 and later provides functions for parallel execution of R code on machines with multiple cores or processors or multiple computers. It is essentially a blend of the snow and multicore packages. By default, the doParallel package uses snow-like functionality on Windows systems and multicore-like functionality on Unix- like systems. The snow-like functionality should work fine on Unix-like systems, but the multicore-like functionality is limited to a single sequential worker on Windows systems. On workstations with multiple cores running Unix-like operating systems, the system fork call is used to spawn copies of the current process.

A.4 doMC package

doMC-package The doMC Package

Description The doMC package provides a parallel backend for the foreach/%dopar% function using Simon Urbanek’s multicore package.

Details Further information is available in the following help topics:

registerDoMC register doMC to be used by foreach/%dopar%

To see a tutorial introduction to the doMC package, use vignette("gettingstartedMC"). To see a tutorial introduction to the foreach package, use vignette("foreach"). To see a demo of doMC computing the sinc function, use demo(sincMC). Some examples (in addition to those in the help pages) are included in the “examples” direc- tory of the doMC package. To list the files in the examples directory, use list.files(system.file("examples", package="doMC")). To run the bootstrap example, use source(system.file("examples", "bootMC.R", package="doMC")). This is a simple benchmark, executing both sequen- tally and in parallel. There are many more examples that come with the foreach package, which will work with the doMC package if it is registered as the parallel backend. 36 registerDoMC

For a complete list of functions with individual help pages, use library(help="doMC").

registerDoMC registerDoMC

Description

The registerDoMC function is used to register the multicore parallel backend with the foreach package.

Usage

registerDoMC(cores=NULL, ...)

Arguments

cores The number of cores to use for parallel execution. If not specified, the number of cores is set to the value of options("cores"), if specified, or to approximately half the number of cores detected by the parallel or multicore package. ... Package options. Currently, only the nocompile option is supported. If nocompile is set to TRUE, compiler support is disabled.

Details

The multicore package by Simon Urbanek provides functions for parallel execution of R code on machines with multiple cores or processors, using the system fork call to spawn copies of the current process. The multicore package, and therefore registerDoMC, should not be used in a GUI environment, because multiple processes then share the same GUI.

A.5 multicore package children 37

children Functions for management of parallel children processes

Description children returns all currently active children readChild reads data from a given child process selectChildren checks children for available data readChildren checks children for available data and reads from the first child that has available data sendChildStdin sends string (or data) to child’s standard input kill sends a signal to a child process

Usage children() readChild(child) readChildren(timeout = 0) selectChildren(children = NULL, timeout = 0) sendChildStdin(child, what) kill(process, signal = SIGINT)

Arguments child child process (object of the class childProcess) or a process ID (pid) timeout timeout (in seconds, fractions supported) to wait before giving up. Negative numbers mean wait indefinitely (strongly discouraged as it blocks R and may be removed in the future). children list of child processes or a single child process object or a vector of pro- cess IDs or NULL. If NULL behaves as if all currently known children were supplied. what character or raw vector. In the former case elements are collapsed using the newline chracter. (But no trailing newline is added at the end!) process process (object of the class process) or a process ID (pid) signal signal to send (one of SIG... constants – see signals – or a valid integer signal number) 38 fork

Value children returns a list of child processes (or an empty list) readChild and readChildren return a raw vector with a "pid" attribute if data were avail- able, integer vector of length one with the process ID if a child terminated or NULL if the child no longer exists (no children at all for readChildren). selectChildren returns TRUE is the timeout was reached, FALSE if an error occurred (e.g. if the master process was interrupted) or an integer vector of process IDs with children that have data available. sendChildStdin sends given content to the standard input (stdin) of the child process. Note that if the master session was interactive, it will also be echoed on the standard output of the master process (unless disabled). The function is vector-compatible, so you can specify more than one child as a list or a vector of process IDs. kill returns TRUE.

Warning This is a very low-level API for expert use only. If you are interested in user-level parallel execution use mclapply, parallel and friends instead.

Author(s) Simon Urbanek

See Also fork, sendMaster, parallel, mclapply

fork Fork a copy of the current R process

Description fork creates a new child process as a copy of the current R process exit closes the current child process, informing the master process as necessary fork 39

Usage fork() exit(exit.code = 0L, send = NULL)

Arguments exit.code process exit code. Currently it is not used by multicore, but other applcia- tions might. By convention 0 signifies clean exit, 1 an error. send if not NULL send this data before exiting (equivalent to using sendMaster)

Details The fork function provides an interface to the fork system call. In addition it sets up a pipe between the master and child process that can be used to send data from the child process to the master (see sendMaster) and child’s stdin is re-mapped to another pipe held by the master process (see link{sendChildStdin}). If you are not familiar with the fork system call, do not use this function since it leads to very complex inter-process interactions among the R processes involved. In a nutshell fork spawns a copy (child) of the current process, that can work in parallel to the master (parent) process. At the point of forking both processes share exactly the same state including the workspace, global options, loaded packages etc. Forking is relatively cheap in modern operating systems and no real copy of the used memory is created, instead both processes share the same memory and only modified parts are copied. This makes fork an ideal tool for parallel processing since there is no need to setup the parallel working environment, data and code is shared automatically from the start. It is strongly discouraged to use fork in GUI or embedded environments, because it leads to several processes sharing the same GUI which will likely cause chaos (and possibly crashes). Child processes should never use on-screen graphics devices.

Value fork returns an object of the class childProcess (to the master) and masterProcess (to the child). exit never returns

Warning This is a very low-level API for expert use only. If you are interested in user-level parallel execution use mclapply, parallel and friends instead. 40 mclapply

Note

Windows opearting system lacks the fork system call so it cannot be used with multicore.

Author(s)

Simon Urbanek

See Also

parallel, sendMaster

Examples

p <- fork() if (inherits(p, "masterProcess")) { cat("I'm a child! ", Sys.getpid(), "\n") exit(,"I was a child") } cat("I'm the master\n") unserialize(readChildren(1.5))

mclapply Parallel version of lapply

Description

mclapply is a parallelized version of lapply, it returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

Usage

mclapply(X, FUN, ..., mc.preschedule = TRUE, mc.set.seed = TRUE, mc.silent = FALSE, mc.cores = getOption("cores")) mclapply 41

Arguments X a vector (atomic or list) or an expressions vector. Other objects (including classed objects) will be coerced by as.list. FUN the function to be applied to each element of X ... optional arguments to FUN mc.preschedule if set to TRUE then the computation is first divided to (at most) as many jobs are there are cores and then the jobs are started, each job possibly covering more than one value. If set to FALSE then one job is spawned for each value of X sequentially (if used with mc.set.seed=FALSE then random number sequences will be identical for all values). The former is better for short computations or large number of values in X, the latter is better for jobs that have high variance of completion time and not too many values of X. mc.set.seed if set to TRUE then each parallel process first sets its seed to something dif- ferent from other processes. Otherwise all processes start with the same (namely current) seed. mc.silent if set to TRUE then all output on stdout will be suppressed for all parallel processes spawned (stderr is not affected). mc.cores The number of cores to use, i.e. how many processes will be spawned (at most)

Details mclapply is a parallelized version of lapply. By default (mc.preschedule=TRUE) the input vector/list X is split into as many parts as there are cores (currently the values are spread across the cores sequentially, i.e. first value to core 1, second to core 2, ... (core + 1)-th value to core 1 etc.) and then one process is spawned to each core and the results are collected. Due to the parallel nature of the execution random numbers are not sequential (in the ran- dom number sequence) as they would be in lapply. They are sequential for each spawned process, but not all jobs as a whole. In addition, each process is running the job inside try(..., silent=TRUE) so if error occur they will be stored as try-error objects in the list. Note: the number of file descriptors is usually limited by the operating system, so you may have trouble using more than 100 cores or so (see ulimit -n or similar in your OS docu- mentation) unless you raise the limit of permissible open file descriptors (fork will fail with ”unable to create a pipe”). 42 multicore

Value

A list.

Author(s)

Simon Urbanek

See Also

parallel, collect

Examples

mclapply(1:30, rnorm) # use the same random numbers for all values set.seed(1) mclapply(1:30, rnorm, mc.preschedule=FALSE, mc.set.seed=FALSE)

multicore multicore R package for parallel processing of R code

Description

multicore is an R package that provides functions for parallel execution of R code on ma- chines with multiple cores or CPUs. Unlike other parallel processing methods all jobs share the full state of R when spawned, so no data or code needs to be initialized. The actual spawning is very fast as well since no new R instance needs to be started.

Pivotal functions

mclapply - parallelized verson of lapply parallel and collect - functions to evaluate R expressions in parallel and collect the results. multicore 43

Low-level functions

Those function should be used only by experienced users understanding the interaction of the master (parent) process and the child processes (jobs) as well as the system-level mechanics involved. See fork help page for the principles of forking parallel processes and system-level func- tions, children and sendMaster help pages for management and communication between the parent and child processes.

Classes

multicore defines a few informal (S3) classes: process is a list with a named entry pid containing the process ID. childProcess is a subclass of process representing a child process of the current R pro- cess. A child process is a special process that can send messages to the parent process. The list may contain additional entries for IPC (more precisely file descriptors), however those are considered internal. masterProcess is a subclass of process representing a handle that is passed to a child process by fork. parallelJob is a subclass of childProcess representing a child process created using the parallel function. It may (optionally) contain a name entry – a character vector of the length one as the name of the job.

Options

By default functions that spawn jobs across cores use the "cores" option (see options) to determine how many cores (or CPUs) will be used (unless specified directly). If this option is not set, multicore uses by default as many cores as there are available. The number of available cores is determined on startup using the (non-exported) detectCores() function. It should work on most commonly used unix systems (Mac OS X, Linux, Solaris and IRIX), but there is no standard way of determining the number of cores, so please con- tact me (with sessionInfo() output and the test) if you have tests for other platforms. If in doubt, use multicore:::detectCores(all.tests=TRUE) to see whether your platform is covered by one of the already existing tests. If multicore cannot determine the number of cores (the above returns NA), it will default to 8 (which should be fine for most modern desktop systems). 44 parallel

Warning multicore uses the fork system call to spawn a copy of the current process which performs the compultations in parallel. Modern operating systems use copy-on-write approach which makes this so appealing for parallel computation since only objects modified during the computation will be actually copied and all other memory is directly shared. However, the copy shares everything including any user interface elements. This can cause havoc since let’s say one window now suddenly belongs to two processes. Therefore multi- comp should be preferrably used in console R and code executed in parallel may never use GUIs or on-screen devices. An (experimental) way to avoid some such problems in some GUI environments (those using pipes or sockets) is to use multicore:::closeAll() in each child process immediately after it is spawned.

Author(s) Simon Urbanek

See Also parallel, mclapply, fork, sendMaster, children and signals

parallel Evaluate an expression asynchronously in a separate process

Description parallel starts a parallel process which evaluates the given expression. mcparallel is a synonym for parallel that can be used at top level if parallel is masked by other packages. It should not be used in other packages since it’s just a shortcut for importing multicore::parallel. collect collects results from parallel processes.

Usage parallel(expr, name, mc.set.seed = FALSE, silent = FALSE) mcparallel(expr, name, mc.set.seed = FALSE, silent = FALSE) collect(jobs, wait = TRUE, timeout = 0, intermediate = FALSE) parallel 45

Arguments expr expression to evaluate (do not use any on-screen devices or GUI elements in this code) name an optional name (character vector of length one) that can be associated with the job. mc.set.seed if set to TRUE then the random number generator is seeded such that it is different from any other process. Otherwise it will be the same as in the current R session. silent if set to TRUE then all output on stdout will be suppressed (stderr is not affected). jobs list of jobs (or a single job) to collect results for. Alternatively jobs can also be an integer vector of process IDs. If omitted collect will wait for all currently existing children. wait if set to FALSE it checks for any results that are available within timeout seconds from now, otherwise it waits for all specified jobs to finish. timeout timeout (in seconds) to check for job results - applies only if wait is FALSE. intermediate FALSE or a function which will be called while collect waits for results. The function will be called with one parameter which is the list of results received so far.

Details parallel evaluates the expr expression in parallel to the current R process. Everything is shared read-only (or in fact copy-on-write) between the parallel process and the current process, i.e. no side-effects of the expression affect the main process. The result of the parallel execution can be collected using collect function. collect function collects any available results from parallel jobs (or in fact any child pro- cess). If wait is TRUE then collect waits for all specified jobs to finish before returning a list containing the last reported result for each job. If wait is FALSE then collect merely checks for any results available at the moment and will not wait for jobs to finish. If jobs is specified, jobs not listed there will not be affected or acted upon. Note: If expr uses low-level multicore functions such as sendMaster a single job can deliver results multiple times and it is the responsibility of the user to interpret them correctly. collect will return NULL for a terminating job that has sent its results already after which the job is no longer available. 46 process

Value parallel returns an object of the class parallelJob which is in turn a childProcess. collect returns any results that are available in a list. The results will have the same order as the specified jobs. If there are multiple jobs and a job has a name it will be used to name the result, otherwise its process ID will be used.

Author(s) Simon Urbanek

See Also mclapply, sendMaster

Examples p <- parallel(1:10) q <- parallel(1:20) collect(list(p, q)) # wait for jobs to finish and collect all results

p <- parallel(1:10) collect(p, wait=FALSE, 10) # will retrieve the result (since it's fast) collect(p, wait=FALSE) # will signal the job as terminating collect(p, wait=FALSE) # there is no such job

# a naive parallelized lapply can be created using parallel alone: jobs <- lapply(1:10, function(x) parallel(rnorm(x), name=x)) collect(jobs)

process Function to query objects of the class process

Description processID returns the process IDs for the given processes. It raises an error if process is not an object of the class process or a list of such objects. print methods shows the process ID and its class name. sendMaster 47

Usage

processID(process) ## S3 method for class 'process' print(x, ...)

Arguments

process process (object of the class process) or a list of such objects. x process to print ... ignored

Value

processID returns an integer vector contatining the process IDs. print returns NULL invisibly

Author(s) Simon Urbanek

See Also

fork

sendMaster Sends data from the child to to the master process

Description

sendMaster Sends data from the child to to the master process

Usage

sendMaster(what) 48 signals

Arguments what data to send to the master process. If what is not a raw vetor, what will be serialized into a raw vector. Do NOT send an empty raw vector - it is reserved for internal use.

Details Any child process (created by fork directly or by parallel indirectly) can send data to the parent (master) process. Usually this is used to deliver results from the parallel child processes to the master process.

Value returns TRUE

Author(s) Simon Urbanek

See Also parallel, fork

signals Signal constants (subset)

Description SIGALRM alarm clock SIGCHLD to parent on child stop or exit SIGHUP hangup SIGINFO information request SIGINT interrupt SIGKILL kill (cannot be caught or ignored) SIGQUIT quit SIGSTOP sendable stop signal not from tty signals 49

SIGTERM software termination signal from kill SIGUSR1 user defined signal 1 SIGUSR2 user defined signal 2

Details See man signal in your OS for details. The above codes can be used in conjunction with the kill function to send signals to processes.

Author(s) Simon Urbanek

See Also kill