Foreach/Iterators User's Guide

foreach/iterators User’s Guide April 20, 2015 Palo Alto Seattle Dallas Singapore London Copyright ©2015 Revolution Analytics. All rights reserved. Revolution R, Revolution R Enterprise, RPE, RevoScaleR, DeployR, NetWorkSpaces, NWS, ParallelR, and Revolution An- alytics are trademarks of Revolution Analytics. All other trademarks are the property of their respective owners. Contents 1 Parallelizing Loops5 1.1 Using foreach ..................................5 1.2 Parallel Backends.................................6 1.2.1 Using the doParallel parallel backend.................6 1.2.2 Getting information about the parallel backend.............7 1.3 Nesting Calls to foreach .............................7 1.4 Using Iterators..................................9 1.4.1 Some Special Iterators.......................... 10 1.4.2 Writing Iterators............................. 11 A Function and Class Reference 15 A.1 iterators package................................. 15 iapply....................................... 15 icount....................................... 16 idiv........................................ 17 iread.table..................................... 18 ireadLines..................................... 19 irnorm....................................... 20 isplit........................................ 21 iter......................................... 22 iterators-package................................. 24 nextElem..................................... 24 A.2 foreach package.................................. 25 foreach...................................... 25 foreach-ext.................................... 29 foreach-package.................................. 30 getDoParWorkers................................. 31 setDoPar...................................... 32 registerDoSEQ.................................. 33 A.3 doParallel package................................ 33 3 4 CONTENTS doParallel-package................................ 33 registerDoParallel................................. 34 A.4 doMC package.................................. 35 doMC-package.................................. 35 registerDoMC................................... 36 A.5 multicore package................................. 36 children...................................... 37 fork........................................ 38 mclapply..................................... 40 multicore..................................... 42 parallel...................................... 44 process...................................... 46 sendMaster.................................... 47 signals....................................... 48 Chapter 1 Parallelizing Loops One common approach to parallelization is to see if the iterations within a loop can be performed independently, and if so, then try to run the iterations concurrently rather than sequentially. The foreach and iterators packages can help you do this loop parallelization quickly and easily. 1.1 Using foreach The foreach package is a set of tools that allow you to run virtually anything that can be expressed as a for-loop as a set of parallel tasks. One application of this is to allow multiple simulations to run in parallel. As a simple example, consider the case of simulating 10000 coin flips, which can be done by sampling with replacement from the vector c("H", "T"). To run this simulation 10 times sequentially, use foreach with the %do% operator: > library(foreach) > foreach(i=1:10) %do% sample(c("H", "T"), 10000, replace=TRUE) Comparing the foreach output with that of a similar for loop shows one obvious difference: foreach returns a list containing the value returned by each computation. A for loop, by contrast, returns only the value of its last computation, and relies on user-defined side effects to do its work. We can parallelize the operation immediately by replacing %do% with %dopar%: > foreach(i=1:10) %dopar% sample(c("H", "T"), 10000, replace=TRUE) However, if we run this example, we see the following warning: Warning message: executing %dopar% sequentially: no parallel backend registered To actually run in parallel, we need to have a “parallel backend” for foreach. Parallel backends are discussed in the next section. 5 6 CHAPTER 1. PARALLELIZING LOOPS 1.2 Parallel Backends In order for loops coded with foreach to run in parallel, you must register a parallel backend to manage the execution of the loop. Any type of mechanism for running code in parallel could potentially have a parallel backend written for it. Currently, Revolution R Enterprise includes the doParallel backend; this uses the parallel package of R 2.14.0 or later to run jobs in parallel, using either of the component parallelization methods incorporated into the parallel package: SNOW-like functionality using socket connections or multicore-like functionality using forking (on Linux only). The doParallel package is a parallel backend for foreach that is intended for parallel processing on a single computer with multiple cores or processors. Additional parallel backends are available from CRAN: • doMPI for use with the Rmpi package • doRedis for use with the rredis package • doMC provides access to the multicore functionality of the parallel package • doSNOW for use with the now superseded SNOW package. To use a parallel backend, you must first register it. Once a parallel backend is registered, calls to %dopar% run in parallel using the mechanisms provided by the parallel backend. How- ever, the details of registering the parallel backends differ, so we consider them separately. 1.2.1 Using the doParallel parallel backend The parallel package of R 2.14.0 and later combines elements of snow and multicore; doParallel similarly combines elements of both doSNOW and doMC. You can register doParallel with a cluster, as with doSNOW, or with a number of cores, as with doMC. For example, here we create a cluster and register it: > library(doParallel) > cl <- makeCluster(4) > registerDoParallel(cl) Once you’ve registered the parallel backend, you’re ready to run foreach code in parallel. For example, to see how long it takes to run 10,000 bootstrap iterations in parallel on all available cores, you can run the following code: > x <- iris[which(iris[,5] != "setosa"), c(1,5)] > trials <- 10000 > ptime <- system.time({ 1.3. NESTING CALLS TO FOREACH 7 + r <- foreach(icount(trials), .combine = cbind) %dopar% { + ind <- sample(100, 100, replace = TRUE) + result1 <- glm(x[ind, 2] ~ x[ind, 1], family=binomial(logit)) + coefficients(result1) + } + })[3] > ptime 1.2.2 Getting information about the parallel backend To find out how many workers foreach is going to use, you can use the getDoParWorkers function: > getDoParWorkers() This is a useful sanity check that you’re actually running in parallel. If you haven’t registered a parallel backend, or if your machine only has one core, getDoParWorkers will return 1. In either case, don’t expect a speed improvement. The getDoParWorkers function is also useful when you want the number of tasks to be equal to the number of workers. You may want to pass this value to an iterator constructor, for example. You can also get the name and version of the currently registered backend: > getDoParName() > getDoParVersion() 1.3 Nesting Calls to foreach An important feature of foreach is nesting operator %:%. Like the %do% and %dopar% operators, it is a binary operator, but it operates on two foreach objects. It also returns a foreach object, which is essentially a special merger of its operands. Let’s say that we want to perform a Monte Carlo simulation using a function called sim. The sim function takes two arguments, and we want to call it with all combinations of the values that are stored in the vectors avec and bvec. The following doubly-nested for loop does that. For testing purposes, the sim function is defined to return 10a + b (although an operation this trivial is not worth executing in parallel): sim <- function(a, b) 10 * a + b avec <- 1:2 bvec <- 1:4 8 CHAPTER 1. PARALLELIZING LOOPS x <- matrix(0, length(avec), length(bvec)) for (j in 1:length(bvec)) { for (i in 1:length(avec)) { x[i,j] <- sim(avec[i], bvec[j]) } } x In this case, it makes sense to store the results in a matrix, so we create one of the proper size called x, and assign the return value of sim to the appropriate element of x each time through the inner loop. When using foreach, we don’t create a matrix and assign values into it. Instead, the inner loop returns the columns of the result matrix as vectors, which are combined in the outer loop into a matrix. Here’s how to do that using the %:% operator: x <- foreach(b=bvec, .combine='cbind') %:% foreach(a=avec, .combine='c') %do% { sim(a, b) } x This is structured very much like the nested for loop. The outer foreach is iterating over the values in “bvec”, passing them to the inner foreach, which iterates over the values in “avec” for each value of “bvec”. Thus, the “sim” function is called in the same way in both cases. The code is slightly cleaner in this version, and has the advantage of being easily parallelized. When parallelizing nested for loops, there is always a question of which loop to parallelize. The standard advice is to parallelize the outer loop. This results in larger individual tasks, and larger tasks can often be performed more efficiently than smaller tasks. However, if the outer loop doesn’t have many iterations and the tasks are already large, parallelizing the outer loop results in a small number of huge tasks, which may not allow you to use all of your processors, and can also result in load

Load more