Programming with Big Data in

George Ostrouchov Oak Ridge National Laboratory and University of Tennessee

Future Trends in Nuclear Physics Computing March 16-18, 2016 Thomas Jefferson National Accelerator Facility Newport News, VA

ppppbbbbddddRRRR Core Team Programming with Big Data in R Introduction to R and HPC Why R?: Programming with Data

Chambers. Becker, Chambers, Chambers and Chambers. Chambers. Computational and Wilks. The Hastie. Statistical Programming Software for Data Methods for New S Language. Models in S. with Data. Analysis: Data Analysis. Chapman & Hall, Chapman & Hall, Springer, 1998. Programming Wiley, 1977. 1988. 1992. with R. Springer, 2008.

Thanks to Dirk Eddelbuettel for this slide idea and to John Chambers for providing the high-resolution scans of the covers of his books.

ppppbbbbddddRRRR Core Team Programming with Big Data in R Introduction to R and HPC Popularity?

IEEE Spectrum’s 2014 Ranking of Programming Languages

See: http://spectrum.ieee.org/static/interactive-the-top-programming-languages#index

ppppbbbbddddRRRR Core Team Programming with Big Data in R An Example: document produced with RStudio IDE Data: 1,653 start and end timestamps for GPU offload periods. Want 1-ahead prediction of start and end to run other codes. ts <- read.table("sorted1node.txt", header=TRUE) library(dplyr) ts <- mutate(ts, idle = end - start, busy = start - lag(end)) head(ts)

## start end idle busy ## 1 56114210457 56114211289 832 NA ## 2 56114300920 56114311544 10624 89631 ## 3 56114373143 56114373943 800 61599 ## 4 56117433146 56117436858 3712 3059203 ## 5 56117469818 56117470650 832 32960 ## 6 56117507098 56117517081 9983 36448 dim(ts)

## [1] 1653 4 library() ggplot(ts, aes(idle)) + geom_histogram()

1000

750

500 count

250

0

0 50000 100000 150000 idle ggplot(ts, aes(idle)) + geom_histogram() + scale_x_log10()

400

300

200 count

100

0

1e+03 1e+04 1e+05 idle ggplot(ts, aes(busy)) + geom_histogram() + theme_bw()

1500

1000 count

500

0

0.0e+00 5.0e+07 1.0e+08 1.5e+08 busy ggplot(ts, aes(busy)) + geom_histogram() + scale_x_log10() + theme_bw()

400

300

count 200

100

0

1e+04 1e+06 1e+08 busy ggplot(ts, aes(busy, idle, color=start)) + geom_point() + scale_x_log10() + scale_y_log10()

1e+05

start 5.85e+10 5.80e+10 5.75e+10 idle 1e+04 5.70e+10 5.65e+10

1e+03

1e+04 1e+06 1e+08 busy ggplot(ts, aes(busy, idle, color=start)) + geom_path() + geom_point() + scale_x_log10() + scale_y_log10()

1e+05

start 5.85e+10 5.80e+10 5.75e+10 idle 1e+04 5.70e+10 5.65e+10

1e+03

1e+04 1e+06 1e+08 busy Successive busy idle periods cluster around several values. Markov chain . . . probably sparse . . . Write my own?

Turns out that most of this already exists in the package rEMM (One of 7,000+ CRAN packages!) ts_log <- transmute(ts, lidle = log10(idle), lbusy = log10(busy)) library(rEMM) emm <- build(EMM(threshold = 0.2, measure="euclidean"), ts_log) plot(emm)

36 14 34

45 37 33 43 35 29 13 12 8 9 26 5 28 2 1 21 38 16 11 7 30 4 6 3 10 15 18 25 22 24 17 27 19 32 41 23 31 39 44 20 40

42

Markov transition graph after seeing all the data. Add Markov node locations on top of the log scale busy-idle space: ggplot(ts, aes(busy, idle, color=start)) + geom_path() + geom_point() + scale_x_log10() + scale_y_log10() + geom_point(aes(10^lbusy, 10^lidle), data.frame(cluster_centers(emm)), col="red")

1e+05

start 5.85e+10 5.80e+10 5.75e+10 idle 1e+04 5.70e+10 5.65e+10

1e+03

1e+04 1e+06 1e+08 busy Looks promising! pr_idle <- function(ts, threshold, lambda, measure = "euclidean") { emm <- build(EMM(threshold, measure, lambda = lambda), ts_log[1:5, ]) ts$idle_p <- ts$busy_p <-NA for(i in6: nrow(ts_log)) { pred_data <- cluster_centers(emm)[predict(emm, n=1), ] ts$idle_p[i] <- 10^(pred_data["lidle"]) ts$busy_p[i] <- 10^(pred_data["lbusy"]) emm <- build(emm, ts_log[i, ]) } ts } ts <- pr_idle(ts, threshold=0.1, lambda=0.01)

Prediction proceeds one pair of busy, idle at a time, predicting the next pair, then updating the Markov states and probabilities for actual values observed. library(reshape2) ts_melt <- melt(ts, value.name="observed", measure.vars=c("idle", "busy")) ts_melt <- mutate(ts_melt, predicted=ifelse(variable == "idle", idle_p, busy_p)) ggplot(ts_melt, aes(start, predicted/observed)) + geom_point() + scale_y_log10() + stat_quantile(quantiles=c(.75,.25), col="blue") + stat_quantile(quantiles=c(.95,.05), col="red") + facet_grid(~variable)

idle busy

1e+03

1e+00 predicted/observed

1e−03

5.6e+10 5.7e+10 5.8e+10 5.6e+10 5.7e+10 5.8e+10 start This is a lightweight algorithm in R (~2 seconds total for all predictions). It can be made more lightweight (100x ?) by implementing in C or C++ and by updating less often. Introduction to R and HPC

Resources for Learning R RStudio IDE http://www.rstudio.com/products/rstudio-desktop/ Task Views: http://cran.at.r-project.org/web/views Book: The Art of R Programming by Norm Matloff: http://nostarch.com/artofr.htm Advanced R: http://adv-r.had.co.nz/ and ggplot2 http://docs.ggplot2.org/current/ by R programming for those coming from other languages: http: //www.johndcook.com/R_language_for_programmers.html aRrgh: a newcomer’s (angry) guide to R, by Tim Smith and Kevin Ushey: http://tim-smith.us/arrgh/ Mailing list archives: http://tolstoy.newcastle.edu.au/R/ The [R] stackoverflow tag.

ppppbbbbddddRRRR Core Team Programming with Big Data in R Introduction to R and HPC Why R?: Programming with Big Data

Wei-Chen Chen1 George Ostrouchov2,3,4 Pragneshkumar Patel3 Drew Schmidt4

1FDA 3Joint Institute for Computational Sciences Washington, DC, USA University of Tennessee, Knoxville TN, USA 2Computer Science and Mathematics Division 4Business Analytics and Statistics Oak Ridge National Laboratory, Oak Ridge TN, USA University of Tennessee, Knoxville TN, USA

ppppbbbbddddRRRR Core Team Programming with Big Data in R pbdR Cluster Computer Architectures HPC Cluster with NVRAM and Parallel File System

Multicore Parallel Today’s HPC Cluster File System

Solid State Disk Big Data Big

Compute Nodes I/O Nodes Storage Servers Disk

Login Nodes Your Laptop “Little Data”

ppppbbbbddddRRRR Core Team Programming with Big Data in R pbdR The pbdR Project pppbbbdddRRR Interfaces to Libraries: Sustainable Path

Profiling Distributed Memory pbdMPI Tau Interconnection Network ZeroMQ pbdCS MPI ppbbddPROFPROF pbdZMQ mpiP

PROCScaLAPACK PROC PROC PROC + cache PBLAS+ cache + cache + cachePETSc BLACS fpmpi Trilinos pbdPAPI Mem Mem Mem Mem pbdDMAT Co-Processor pbdDMATpbpdbdDMATBASE PAPI pbdSLAP CombBLAS

I/O LibSci (Cray) DPLASMA MKL (Intel) pbdNCDF4 NetCDF4 Shared Memory Local Memory PLASMA COREACML (AMD)CORE CORE CORE GPU: Graphical Processing Unit ADIOS + cache + cache + cache + cache MIC: Many Integrated Core pbdADIOS MAGMA Network cuBLAS (NVIDIA) Learning Memory

cuSPARSE (NVIDIA) pbdDEMO

Released Under Development Why use HPC libraries? Many science communities are invested in their API. Data analysis uses much of the same basic math as simulation science The libraries represent 30+ years of parallel algorithm research They’re tested. They’re fast. They’re scalable.

ppppbbbbddddRRRR Core Team Programming with Big Data in R pbdR pbdMPI pbdMPI: a High Level Interface to MPI API is simplified: defaults in control objects. S4 methods: extensible to complex R objects. Additional error checking Array and matrix methods without serialization: faster than Rmpi.

pbdMPI (S4) Rmpi allreduce mpi.allreduce allgather mpi.allgather, mpi.allgatherv, mpi.allgather.Robj bcast mpi.bcast, mpi.bcast.Robj gather mpi.gather, mpi.gatherv, mpi.gather.Robj recv mpi.recv, mpi.recv.Robj reduce mpi.reduce scatter mpi.scatter, mpi.scatterv, mpi.scatter.Robj send mpi.send, mpi.send.Robj

ppppbbbbddddRRRR Core Team Programming with Big Data in R pbdR pbdMPI SPMD: Copies of One Code Run Asynchronously

A simple SPMD allreduce allreduce.r

1 library(pbdMPI, quiet = TRUE) 2 3 ## Your local computation 4 n <- comm.rank() + 1 5 6 ## Now"Reduce" and give the result to all 7 all_sum <- allreduce(n)# Sum is default 8 9 text <- paste("Hello:n is", n,"sum is", all_sum) 10 comm .print(text, all.rank=TRUE) 11 12 finalize () Execute this batch script via: Output:

1 mpirun -np 2 Rscript allreduce.r 1 COMM.RANK = 0 2 [1]"Hello:n is1 sum is3" 3 COMM.RANK = 1 4 [1]"Hello:n is2 sum is3"

ppppbbbbddddRRRR Core Team Programming with Big Data in R pbdR pbdMPI: Machine Learning: Random Forest Machine Learning Example: Random Forest

Example: Letter Recognition data from package mlbench (20,000 17) ×

1 [,1] lettr capital letter 2 [ ,2] x.box horizontal position of box 3 [ ,3] y.box vertical position of box 4 [,4] width width of box 5 [,5] high height of box 6 [,6] onpix total number of on pixels 7 [ ,7] x. bar mean x of on pixels in box 8 [ ,8] y. bar mean y of on pixels in box 9 [ ,9] x2bar mean x variance 10 [ ,10] y2bar mean y variance 11 [ ,11] xybar mean x y correlation 12 [ ,12] x2ybr mean of x^2 y 13 [ ,13] xy2br mean of x y^2 14 [ ,14] x. ege mean edge count left to right 15 [,15] xegvy correlation of x.ege with y 16 [ ,16] y. ege mean edge count bottom to top 17 [,17] yegvx correlation of y.ege with x

P. W. Frey and D. J. Slate (Machine Learning Vol 6/2 March 91): ”Letter Recognition Using Holland-style Adaptive Classifiers”.

ppppbbbbddddRRRR Core Team Programming with Big Data in R pbdR pbdMPI: Machine Learning: Random Forest

Example: Random Forest Code (build many simple models, use model averaging to predict)

Serial Code 4 rf s.r

1 library(randomForest) 2 library(mlbench) 3 data(LetterRecognition)# 26 Capital Letters Data 20,000x 17 4 set.seed(seed=123) 5 n <- nrow(LetterRecognition) 6 n_test <- floor(0.2*n) 7 i_test <- sample.int(n, n_test)# Use1/5 of the data to test 8 train <- LetterRecognition[-i_test, ] 9 test <- LetterRecognition[i_test, ] 10 11 ## train random forest 12 rf.all <- randomForest(lettr~ ., train, ntree=500, norm.votes=FALSE) 13 14 ## predict test data 15 pred <- predict(rf.all, test) 16 correct <- sum(pred == test$lettr) 17 cat("Proportion Correct:", correct/(n_test),"\n")

ppppbbbbddddRRRR Core Team Programming with Big Data in R pbdR pbdMPI: Machine Learning: Random Forest

Example: Random Forest Code (Split learning by blocks of trees. Split prediction by blocks of rows.)

Parallel Code 4 rf p.r

1 library(randomForest) 2 library(mlbench) 3 data(LetterRecognition) 4 comm.set.seed(seed=123, diff=FALSE)# same training data 5 n <- nrow(LetterRecognition) 6 n_test <- floor(0.2*n) 7 i_test <- sample.int(n, n_test)# Use1/5 of the data to test 8 train <- LetterRecognition[-i_test, ] 9 test <- LetterRecognition[i_test, ][get.jid(n test), ] 10 11 comm.set.seed(seed=1e6*runif(1), diff=TRUE) 12 my.rf <- randomForest(lettr~ ., train, ntree=500%/%comm.size(), norm.votes=FALSE) 13 rf.all <- do.call(combine, allgather(my.rf)) 14 15 pred <- predict(rf.all, test) 16 correct <- allreduce(sum(pred == test$lettr)) 17 comm.cat("Proportion Correct:", correct/(n_test),"\n")

ppppbbbbddddRRRR Core Team Programming with Big Data in R pbdR pbdDMAT Dense Matrix and Vector Operations

A matrix is mapped to a processor grid shape 0 0 1 2 3 4 5 1  2    0 1  3    0 1 2 2 3  4    3 4 5  4 5   5      (a)1 6 (b)2 3 (c)3 2 (d) 6 1 × × × × Table: Processor Grid Shapes with 6 Processors

ppppbbbbddddRRRR Core Team Programming with Big Data in R pbdR pbdDMAT pppbbbdddRRR No change in syntax. Data redistribution functions.

1 x <- x[-1, 2:5] 2 x <- log(abs(x) + 1) 3 x. pca <- prcomp(x) 4 xtx <-t(x) %*%x 5 ans <- svd(solve(xtx))

The above (and over 100 other functions) runs on 1 core with R or 10,000 cores with pppbbbdddRRR ddmatrix class

1 > showClass("ddmatrix") 2 Class"ddmatrix"[package"pbdDMAT"] 3 Slots : 4 Name : Data dim ldim bldim ICTXT 5 Class : matrix numeric numeric numeric numeric

1 > x <- as.rowblock(x) 2 > x <- as.colblock(x) 3 > x <- redistribute(x, bldim=c(8, 8), ICTXT = 0)

ppppbbbbddddRRRR Core Team Programming with Big Data in R pbdR RandSVD

Truncated SVD from random projections1

PROBABILISTIC ALGORITHMS FOR MATRIX APPROXIMATION 227

Serial R Prototype for Randomized SVD Given an m n matrix A, a target number k of singular vectors, and an exponent q (say,× q =1or q =2), this procedure computes an approximate 1 randSVD <− f u n c t i o n(A, k,q=3)

rank-2k factorization UΣV ∗,whereU and V are orthonormal, and Σ is 2 {

nonnegative and diagonal. 3## StageA Stage A: 4 Omega <− matrix(rnorm(n*2*k), 244 1 Generate an n N.2 HALKO,k Gaussian P. G. MARTINSSON, test matrix ANDΩ. J. A. TROPP × q 2 Form Y =(AA∗) AΩ by multiplying alternately with A and A∗. 5 nrow=n, ncol=2*k) 3 ConstructAlgorithm a matrix Q 4.3:whose Randomized columns form Power an orthonormal Iteration basis for 6Y <− A%∗% Omega Giventhe an rangem ofnYmatrix. A and integers  and q, this algorithm computes an m  orthonormal× matrix Q whose range approximates the range of A. 7Q <− qr.Q(qr(Y)) Stage× B: 1 Draw an n  Gaussian random matrix Ω. 4 Form B = Q∗A. 8 At <− t(A) × q 2 Form the m  matrix Y =(AA∗) AΩ via alternating application 5 Compute an× SVD of the small matrix: B = U ΣV ∗. 9 for(i in 1:q) of A and A∗. 6 Set U = QU. 10 { 3 Construct an m  matrix Q whose columns form an orthonormal Note: The computation of Y in step 2 is vulnerable to round-off errors. basis for the range× of Y , e.g., via the QR factorization Y = QR. 11Y <− At %∗%Q When high accuracy is required, we must incorporate an orthonormalization Note: This procedure is vulnerable to round-off errors; see Remark 4.3. The 12Q <− qr.Q(qr(Y)) step between each application of A and A∗; see Algorithm 4.4. recommended implementation appears as Algorithm 4.4. 13Y <− A%∗%Q

14Q <− qr.Q(qr(Y)) The theory developed in this paper provides much more detailed information Algorithm 4.4: Randomized Subspace Iteration 15 } aboutGiven the performance an m n matrix of theA proto-algorithm.and integers  and q, this algorithm computes an × m When orthonormal the singular matrix valuesQ whose of A decay range slightly, approximates the error the rangeA ofQQA∗.A does 16 • ×  −  1 notDraw depend an n on thestandard dimensions Gaussian of the matrix matrixΩ. (sections 10.2–10.3). 17## StageB × 2 WeForm canY reduce0 = AΩ theand size compute of the bracketits QR factorization in the error boundY0 = Q (1.8)0R0. by combining • the proto-algorithm with a power iteration (section 10.4). For an example, 18B <− t(Q)% ∗%A 3 for j =1, 2, ..., q see section 1.6 below. 19U <− La .svd(B)$u 4 Form Yj = A∗Qj 1 and compute its QR factorization Yj = Qj Rj . For the structured random− matrices we mentioned in section 1.4.1, related 5• Form Yj = AQj and compute its QR factorization Yj = Qj Rj . 20U <− Q% ∗%U error bounds are in force (section 11). 6 end 21 U[,1:k] We can obtain inexpensive a posteriori error estimates to verify the quality 7 Q = Qq. • of the approximation (section 4.3). 22 }

1.6. Example: Randomized SVD. We conclude this introduction with a short discussion of how these ideas allow us to perform an approximate SVD of a large data 1Halko,Algorithm Martinsson, 4.3 targets and theTropp. fixed-rank 2011. problem. Finding To address structure the fixed-precision with randomness: probabilistic algorithms for constructing matrix,problem, which we can is a incorporate compelling theapplication error estimators of randomized described matrix in section approximation 4.3 to obtain [113]. approximatean adaptiveThe two-stage matrix scheme randomized analogous decompositions with method Algorithm offersSIAM 4.2. a natural In Review situations approach53 where217–288 to it SVD is critical compu- to tations.achieve near-optimal Unfortunately, approximation the simplest errors, version one of can this increase scheme the is oversampling inadequate in beyond many applicationsour standard because recommendation the singular spectrum= k +5allthewayto of the input matrix =2 mayk without decay slowly. changing To address this difficulty, we incorporate q steps of a power iteration, where q =1or the scaling of the asymptotic computational cost. A supporting analysis appears in qCorollary= 2 usually 10.10. suffices in practice. The complete scheme appears in the box labeled PrototypeRemark for4.3. RandomizedUnfortunately, SVD. For when most Algorithm applications, 4.3 pppp itisbbbb isd executeddddR importantRRR Core into Teamfloating-point incorporateProgramming with Big Data in R additionalarithmetic, refinements, rounding errors as we will discuss extinguish in sections all 4 information and 5. pertaining to singular modesThe associated Randomized with SVD singular procedure values requiresthat are onlysmall 2(compared q + 1) passes with overA . the (Roughly, matrix, soif machine it is efficient precision even is forµ, matrices then allstored information out-of-core. associated The with flop singularcount satisfies values smaller 1/(2q+1) than µ A is lost.) This problem can easily be 2 remedied by orthonormalizing   TrandSVD =(2q +2)kTmult +O(k (m + n)), the columns of the sample matrix between each application of A and A∗. The result- whereing scheme,T summarizedis the flop count as Algorithm of a matrix–vector 4.4, is algebraically multiply with equivalentA or A to Algorithm.Wehavethe 4.3 mult ∗ followingwhen executed theorem in exacton the arithmetic performance [93, of 125]. this Wemethod recommend in exact Algorithm arithmetic, 4.4 which because is a its computational costs are similar to those of Algorithm 4.3, even though the former consequence of Corollary 10.10. is substantiallyTheorem 1.2. moreSuppose accurate that in floating-pointA is a real m arithmetic.n matrix.Select an exponent q and a target number k of singular vectors, where 2× k 0.5min m, n .Execute the 4.6. An Accelerated Technique for General Dense≤ ≤ Matrices.{ This} section de- scribes a set of techniques that allow us to compute an approximate rank- factor- ization of a general dense m n matrix in roughly O( mn log()) flops, in contrast to the asymptotic cost O(mn)× required by earlier methods. We can tailor this scheme for the real or complex case, but we focus on the conceptually simpler complex case.

These algorithms were introduced in [138]; similar techniques were proposed in [119]. The first step toward this accelerated technique is to observe that the bottleneck Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. in Algorithm 4.1 is the computation of the matrix product AΩ. When the test matrix

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. pbdR RandSVD

Truncated SVD from random projections Serial R Parallel pbdR

1 randSVD <− f u n c t i o n(A, k,q=3) 1 randSVD <− f u n c t i o n(A, k,q=3) 2 { 2 { 3## StageA 3## StageA 4 Omega <− m a t r i x(rnorm(n ∗2∗k ) , 4 Omega <− ddmatrix(”rnorm”, nrow=n , ncol=2∗k ) 5 nrow=n, ncol=2*k) 5Y <− A%∗% Omega 6Y <− A%∗% Omega 6Q <− qr.Q(qr(Y)) 7Q <− qr.Q(qr(Y)) 7 At <− t(A) 8 At <− t(A) 8 for(i in 1:q) 9 for(i in 1:q) 9 { 10 { 10Y <− At %∗%Q 11Y <− At %∗%Q 11Q <− qr.Q(qr(Y)) 12Q <− qr.Q(qr(Y)) 12Y <− A%∗%Q 13Y <− A%∗%Q 13Q <− qr.Q(qr(Y)) 14Q <− qr.Q(qr(Y)) 14 } 15 } 15 16 16## StageB 17## StageB 17B <− t(Q)% ∗%A 18B <− t(Q)% ∗%A 18U <− La .svd(B)$u 19U <− La .svd(B)$u 19U <− Q% ∗%U 20U <− Q% ∗%U 20 U[,1:k] 21 U[,1:k] 21 } 22 }

ppppbbbbddddRRRR Core Team Programming with Big Data in R pbdR RandSVD

From journal to scalable code and scaling data in one day.

30 Singular Vectors from a 100,000 by 1,000 Matrix 30 Singular Vectors from a 100,000 by 1,000 Matrix Speedup of Randomized vs. Full SVD

● ● ● Algorithm full randomized ● ●

128 15 64 ● ● 32

16 ● 10 ● ● 8

● Speedup ●

Speedup 4 ● ● ●

● 2 ● 5 ● 1 ● ●

● ●

1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128 Cores Cores Speedup relative to 1 core RandSVD speedup relative to full SVD

ppppbbbbddddRRRR Core Team Programming with Big Data in R Future Work

Future Work NSF/DMS: second year of a 3 year grant to Bring back interactivity via client/server (pbdCS/pbdZMQ) Simplify parallel data input Begin DPLASMA integration Outreach to the statistics community DOE/SC: In-situ or staging use with simulations pbdADIOS - HPC I/O Pending: NSF BIGDATA, Tensor Regression Pending: Exascale Computing Project, analytics for ParaView/VisIt

ppppbbbbddddRRRR Core Team Programming with Big Data in R Future Work

Where to learn more? http://r-pbd.org/ pbdDEMO vignette Googlegroup:RBigDataProgramming pppbbbdddRRR Installations: OLCF, NERSC, SDSC, TACC, IU, BSC Spain, CSCS Switzerland, IT4I Czech, ISM Japan, and many more Need access to a cluster computer? From NSF: XSEDE trial or startup allocation https://www.xsede.org/web/xup/allocations-overview. Most resources have pppbbbdddRRR installed

Support This material is based upon work supported by the National Science Foundation Division of Mathematical Sciences under Grant No. 1418195. This work used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. This work also used resources of National Institute for Computational Sciences at the University of Tennessee, Knoxville, which is supported by the U.S. National Science Foundation.

ppppbbbbddddRRRR Core Team Programming with Big Data in R