Programming with Big Data in R

Programming with Big Data in R George Ostrouchov Oak Ridge National Laboratory and University of Tennessee Future Trends in Nuclear Physics Computing March 16-18, 2016 Thomas Jefferson National Accelerator Facility Newport News, VA ppppbbbbddddRRRR Core Team Programming with Big Data in R Introduction to R and HPC Why R?: Programming with Data Chambers. Becker, Chambers, Chambers and Chambers. Chambers. Computational and Wilks. The Hastie. Statistical Programming Software for Data Methods for New S Language. Models in S. with Data. Analysis: Data Analysis. Chapman & Hall, Chapman & Hall, Springer, 1998. Programming Wiley, 1977. 1988. 1992. with R. Springer, 2008. Thanks to Dirk Eddelbuettel for this slide idea and to John Chambers for providing the high-resolution scans of the covers of his books. ppppbbbbddddRRRR Core Team Programming with Big Data in R Introduction to R and HPC Popularity? IEEE Spectrum's 2014 Ranking of Programming Languages See: http://spectrum.ieee.org/static/interactive-the-top-programming-languages#index ppppbbbbddddRRRR Core Team Programming with Big Data in R An Example: knitr document produced with RStudio IDE Data: 1,653 start and end timestamps for GPU offload periods. Want 1-ahead prediction of start and end to run other codes. ts <- read.table("sorted1node.txt", header=TRUE) library(dplyr) ts <- mutate(ts, idle = end - start, busy = start - lag(end)) head(ts) ## start end idle busy ## 1 56114210457 56114211289 832 NA ## 2 56114300920 56114311544 10624 89631 ## 3 56114373143 56114373943 800 61599 ## 4 56117433146 56117436858 3712 3059203 ## 5 56117469818 56117470650 832 32960 ## 6 56117507098 56117517081 9983 36448 dim(ts) ## [1] 1653 4 library(ggplot2) ggplot(ts, aes(idle)) + geom_histogram() 1000 750 500 count 250 0 0 50000 100000 150000 idle ggplot(ts, aes(idle)) + geom_histogram() + scale_x_log10() 400 300 200 count 100 0 1e+03 1e+04 1e+05 idle ggplot(ts, aes(busy)) + geom_histogram() + theme_bw() 1500 1000 count 500 0 0.0e+00 5.0e+07 1.0e+08 1.5e+08 busy ggplot(ts, aes(busy)) + geom_histogram() + scale_x_log10() + theme_bw() 400 300 count 200 100 0 1e+04 1e+06 1e+08 busy ggplot(ts, aes(busy, idle, color=start)) + geom_point() + scale_x_log10() + scale_y_log10() 1e+05 start 5.85e+10 5.80e+10 5.75e+10 idle 1e+04 5.70e+10 5.65e+10 1e+03 1e+04 1e+06 1e+08 busy ggplot(ts, aes(busy, idle, color=start)) + geom_path() + geom_point() + scale_x_log10() + scale_y_log10() 1e+05 start 5.85e+10 5.80e+10 5.75e+10 idle 1e+04 5.70e+10 5.65e+10 1e+03 1e+04 1e+06 1e+08 busy Successive busy idle periods cluster around several values. Markov chain . probably sparse . Write my own? Turns out that most of this already exists in the package rEMM (One of 7,000+ CRAN packages!) ts_log <- transmute(ts, lidle = log10(idle), lbusy = log10(busy)) library(rEMM) emm <- build(EMM(threshold = 0.2, measure="euclidean"), ts_log) plot(emm) 36 14 34 45 37 33 43 35 29 13 12 8 9 26 5 28 2 1 21 38 16 11 7 30 4 6 3 10 15 18 25 22 24 17 27 19 32 41 23 31 39 44 20 40 42 Markov transition graph after seeing all the data. Add Markov node locations on top of the log scale busy-idle space: ggplot(ts, aes(busy, idle, color=start)) + geom_path() + geom_point() + scale_x_log10() + scale_y_log10() + geom_point(aes(10^lbusy, 10^lidle), data.frame(cluster_centers(emm)), col="red") 1e+05 start 5.85e+10 5.80e+10 5.75e+10 idle 1e+04 5.70e+10 5.65e+10 1e+03 1e+04 1e+06 1e+08 busy Looks promising! pr_idle <- function(ts, threshold, lambda, measure = "euclidean") { emm <- build(EMM(threshold, measure, lambda = lambda), ts_log[1:5, ]) ts$idle_p <- ts$busy_p <-NA for(i in6: nrow(ts_log)) { pred_data <- cluster_centers(emm)[predict(emm, n=1), ] ts$idle_p[i] <- 10^(pred_data["lidle"]) ts$busy_p[i] <- 10^(pred_data["lbusy"]) emm <- build(emm, ts_log[i, ]) } ts } ts <- pr_idle(ts, threshold=0.1, lambda=0.01) Prediction proceeds one pair of busy, idle at a time, predicting the next pair, then updating the Markov states and probabilities for actual values observed. library(reshape2) ts_melt <- melt(ts, value.name="observed", measure.vars=c("idle", "busy")) ts_melt <- mutate(ts_melt, predicted=ifelse(variable == "idle", idle_p, busy_p)) ggplot(ts_melt, aes(start, predicted/observed)) + geom_point() + scale_y_log10() + stat_quantile(quantiles=c(.75,.25), col="blue") + stat_quantile(quantiles=c(.95,.05), col="red") + facet_grid(~variable) idle busy 1e+03 1e+00 predicted/observed 1e−03 5.6e+10 5.7e+10 5.8e+10 5.6e+10 5.7e+10 5.8e+10 start This is a lightweight algorithm in R (~2 seconds total for all predictions). It can be made more lightweight (100x ?) by implementing in C or C++ and by updating less often. Introduction to R and HPC Resources for Learning R RStudio IDE http://www.rstudio.com/products/rstudio-desktop/ Task Views: http://cran.at.r-project.org/web/views Book: The Art of R Programming by Norm Matloff: http://nostarch.com/artofr.htm Advanced R: http://adv-r.had.co.nz/ and ggplot2 http://docs.ggplot2.org/current/ by Hadley Wickham R programming for those coming from other languages: http: //www.johndcook.com/R_language_for_programmers.html aRrgh: a newcomer's (angry) guide to R, by Tim Smith and Kevin Ushey: http://tim-smith.us/arrgh/ Mailing list archives: http://tolstoy.newcastle.edu.au/R/ The [R] stackoverflow tag. ppppbbbbddddRRRR Core Team Programming with Big Data in R Introduction to R and HPC Why R?: Programming with Big Data Wei-Chen Chen1 George Ostrouchov2;3;4 Pragneshkumar Patel3 Drew Schmidt4 1FDA 3Joint Institute for Computational Sciences Washington, DC, USA University of Tennessee, Knoxville TN, USA 2Computer Science and Mathematics Division 4Business Analytics and Statistics Oak Ridge National Laboratory, Oak Ridge TN, USA University of Tennessee, Knoxville TN, USA ppppbbbbddddRRRR Core Team Programming with Big Data in R pbdR Cluster Computer Architectures HPC Cluster with NVRAM and Parallel File System Multicore Parallel Today’s HPC Cluster File System Solid State Disk BigData Compute Nodes I/O Nodes Storage Servers Disk Login Nodes Your Laptop “Little Data” ppppbbbbddddRRRR Core Team Programming with Big Data in R pbdR The pbdR Project pppbbbdddRRR Interfaces to Libraries: Sustainable Path Profiling Distributed Memory pbdMPI Tau Interconnection Network ZeroMQ pbdCS MPI ppbbddPROFPROF pbdZMQ mpiP PROCScaLAPACK PROC PROC PROC + cache PBLAS+ cache + cache + cachePETSc BLACS fpmpi Trilinos pbdPAPI Mem Mem Mem Mem pbdDMAT Co-Processor pbdDMATpbpdbdDMATBASE PAPI pbdSLAP CombBLAS I/O LibSci (Cray) DPLASMA MKL (Intel) pbdNCDF4 NetCDF4 Shared Memory Local Memory PLASMA COREACML (AMD)CORE CORE CORE GPU: Graphical Processing Unit ADIOS + cache + cache + cache + cache MIC: Many Integrated Core pbdADIOS MAGMA Network cuBLAS (NVIDIA) Learning Memory cuSPARSE (NVIDIA) pbdDEMO Released Under Development Why use HPC libraries? Many science communities are invested in their API. Data analysis uses much of the same basic math as simulation science The libraries represent 30+ years of parallel algorithm research They're tested. They're fast. They're scalable. ppppbbbbddddRRRR Core Team Programming with Big Data in R pbdR pbdMPI pbdMPI: a High Level Interface to MPI API is simplified: defaults in control objects. S4 methods: extensible to complex R objects. Additional error checking Array and matrix methods without serialization: faster than Rmpi. pbdMPI (S4) Rmpi allreduce mpi.allreduce allgather mpi.allgather, mpi.allgatherv, mpi.allgather.Robj bcast mpi.bcast, mpi.bcast.Robj gather mpi.gather, mpi.gatherv, mpi.gather.Robj recv mpi.recv, mpi.recv.Robj reduce mpi.reduce scatter mpi.scatter, mpi.scatterv, mpi.scatter.Robj send mpi.send, mpi.send.Robj ppppbbbbddddRRRR Core Team Programming with Big Data in R pbdR pbdMPI SPMD: Copies of One Code Run Asynchronously A simple SPMD allreduce allreduce.r 1 library(pbdMPI, quiet = TRUE) 2 3 ## Your local computation 4 n <- comm.rank() + 1 5 6 ## Now"Reduce" and give the result to all 7 all_sum <- allreduce(n)# Sum is default 8 9 text <- paste("Hello:n is", n,"sum is", all_sum) 10 comm .print(text, all.rank=TRUE) 11 12 finalize () Execute this batch script via: Output: 1 mpirun -np 2 Rscript allreduce.r 1 COMM.RANK = 0 2 [1]"Hello:n is1 sum is3" 3 COMM.RANK = 1 4 [1]"Hello:n is2 sum is3" ppppbbbbddddRRRR Core Team Programming with Big Data in R pbdR pbdMPI: Machine Learning: Random Forest Machine Learning Example: Random Forest Example: Letter Recognition data from package mlbench (20,000 17) × 1 [,1] lettr capital letter 2 [ ,2] x.box horizontal position of box 3 [ ,3] y.box vertical position of box 4 [,4] width width of box 5 [,5] high height of box 6 [,6] onpix total number of on pixels 7 [ ,7] x. bar mean x of on pixels in box 8 [ ,8] y. bar mean y of on pixels in box 9 [ ,9] x2bar mean x variance 10 [ ,10] y2bar mean y variance 11 [ ,11] xybar mean x y correlation 12 [ ,12] x2ybr mean of x^2 y 13 [ ,13] xy2br mean of x y^2 14 [ ,14] x. ege mean edge count left to right 15 [,15] xegvy correlation of x.ege with y 16 [ ,16] y. ege mean edge count bottom to top 17 [,17] yegvx correlation of y.ege with x P. W. Frey and D. J. Slate (Machine Learning Vol 6/2 March 91): "Letter Recognition Using Holland-style Adaptive Classifiers”. ppppbbbbddddRRRR Core Team Programming with Big Data in R pbdR pbdMPI: Machine Learning: Random Forest Example: Random Forest Code (build many simple models, use model averaging to predict) Serial Code 4 rf s.r 1 library(randomForest) 2 library(mlbench) 3 data(LetterRecognition)# 26 Capital Letters Data 20,000x 17 4 set.seed(seed=123) 5 n <- nrow(LetterRecognition) 6 n_test <- floor(0.2*n) 7 i_test <- sample.int(n, n_test)# Use1/5 of the data to test 8 train <- LetterRecognition[-i_test, ] 9 test <- LetterRecognition[i_test, ] 10 11 ## train random forest 12 rf.all <- randomForest(lettr~ ., train, ntree=500, norm.votes=FALSE) 13 14 ## predict test data 15 pred <- predict(rf.all, test) 16 correct <- sum(pred == test$lettr) 17 cat("Proportion Correct:", correct/(n_test),"\n") ppppbbbbddddRRRR Core Team Programming with Big Data in R pbdR pbdMPI: Machine Learning: Random Forest Example: Random Forest Code (Split learning by blocks of trees.

Load more