Compute Grid: Parallel Processing

RCS Lunch & Learn Training Series

Bob Freeman, PhD Director, Research Technology Operations HBS

8 November, 2017 Overview • Q&A • Introduction • Serial vs parallel • Approaches to Parallelization • Submitting parallel jobs on the compute grid • Parallel tasks • Parallel Code Serial vs Parallel work Serial vs Multicore Approaches

Traditionally, software has been written for serial computers • To be run on a single computer having a single Central Processing Unit (CPU) • Problem is broken into a discrete set of instructions • Instructions are executed one after the other • One one instruction can be executed at any moment in time Serial vs Multicore Approaches

In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem: • To be run using multiple CPUs • A problem is broken into discrete parts (either by you or the application itself) that can be solved concurrently • Each part is further broken down to a series of instructions • Instructions from each part execute simultaneously on different CPUs or different machines Serial vs Multicore Approaches

Many different parallelization approaches, which we won't discuss:

Shared memory Distributed memory

6 Hybrid Distributed-Shared memory Parallel Processing…

So, we are going to briefly touch on two approaches:

• Parallel tasks • Tasks in the background • gnu_parallel • Pleasantly parallelizing

• Parallel code • Considerations for parallelizing • Parallel frameworks & examples

We will not discuss parallelized frameworks such as Hadoop, Apache Spark, MongoDB, ElasticSearch, etc Parallel Jobs on the Compute Grid…

Nota Bene!! • In order to run in parallel, programs (code) must be explicitly programmed to do so. • And you must ask the scheduler to reserve those cores for your program/work to use.

Thus, requesting cores from the scheduler does not automagically parallelize your code! # SAMPLE JOB FILE #!/bin/ #BSUB -q normal # Queue to submit to (comma separated) #BSUB -n 8 # Number of cores ... blastn –query seqs.fasta –db nt –out seqs.nt.blastn # WRONG!! blastn –query seqs.fasta –db nt –out seqs.nt.blastn –num_threads $LSB_MAX_NUM_PROCESSORS# YES!!

# SAMPLE PARALLELIZED CODE bsub –q normal –n 4 –W 24:00 - "rusage[mem=4000]" stata-mp4 –d myfile.do

# SAMPLE PARALLEL TASKS bsub –q normal –n 4 –W 24:00 -R "rusage[mem=4000]" \ parallel –joblog .log --outputasfiles –j\$LSB_MAX_NUM_PROCESSORS :::: tasklist.txt

# SAMPLE PLEASANT PARALLELIZATION for file in folder/*.txt; do echo $file bsub -q normal -W 24:00 -R "rusage[mem=1000]" python process_input_data.py $file done Parallel Tasks Background tasks

Shells, by default, have the ability to multitask: doing more than one thing at a time In BASH, this can be accomplished by sending a command to the background: • Explicitly, with & • After the fact, with ^Z and bg

When you put a task in the background • The task keeps running, while you continue to work at the shell in the foreground • If any output is done, it appears on your screen immediately • If input is required, the process prints a message and stops • When it is done, a message will be printed

From Processes & Job Control: http://slideplayer.com/slide/4592906/ GnuParallel Approach

GNU parallel is a shell tool for executing jobs in parallel using one or more computers. • single command or small script that has to be run for each of the lines in the input. • typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables. • Many options for working with control and output of results • Can specify the degree of parallelization

# create list of files to unzip for index in `seq 1 100`; do echo "unzip myfile$index.zip" >> tasklist.txt; done

# Ask the compute cluster to do this for me in parallel, using 4 CPU/cores bsub –q normal –n 4 –W 2:00 -R "rusage[mem=4000]" \ parallel –joblog .log --outputasfiles –j\$LSB_MAX_NUM_PROCESSORS :::: tasklist.txt

11 Concept of Pleasant Parallelization

Problem: How do I BLAST 200,000 transcripts against NR? Solution: Fake a parallel BLAST. But how? • Divide your input file into n separate files • BLAST each smaller input file on a separate core • Running on n cores will be almost exactly as n times faster! Why? • Each core doesn't need to talk to one another • You could submit n jobs individually, but not recommended • Use more sophisticated techniques: job arrays, gnu_parallel, GridRunner • Shouldn't confuse this with truly parallel mpiBLAST The efficiency of your work depends on how parallelized you make your task: • You want to ensure that your jobs spend most of their time computing, and not in the queue or doing compute prep

schedule module load BLAST Job finish

versus schedule module load BLAST Job finish 1 X 100?? What would you choose? 2 Manual (Script) Approach

• Split input file into N files that run 1 to 6 hrs each • Can be done with or python script, unix split, etc • User script parses the datafile whose name is passed as the command parameter for file in my*.dat do echo $file bsub –q normal –W 6:00 -R "rusage[mem=1000]" \ python process_data_file.py $file sleep 1 done

For advanced users, one can submit this as one job in a job array, a feature on most schedulers

# create script for job array (process_data_file_array.py) # and now submit file as job array num_files=`wc –l < $( ls -1 my*.dat )` bsub –J myarray[1-$num_files] –q normal –W 6:00 -R "rusage[mem=1000]" \ python process_data_file_array.py

This process is ideal for serially numbered files, parameter sweeps, & optimization routines!! 13 Parallel Code Can my code be parallelized?

Does it have large loops that repeat the same operations?

Does your code do multiple tasks that are not dependent on one another? If so is the dependency weak?

Can any dependencies or information sharing be overlapped with computation? If not, is the amount of communications small?

Do multiple tasks depend on the same data?

Does the order of operations matter? If so how strict does it have to be?

23 Basic guidance for efficient parallelization:

Is it even worth parallelizing my code?

Does your code take an intractably long amount of time to complete?

Do you run a single large model or do statistics on multiple small runs?

Would the amount of time it take to parallelize your code be worth the gain in speed?

Parallelizing established code vs. starting from scratch

Established code: Maybe easier / faster to parallelize, but my not give good performance or scaling

Start from scratch: Takes longer, but will give better performance, accuracy, and gives the opportunity to turn a “black box” into a code you understand

24 Basic guidance for efficient parallelization:

Increase the fraction of your program that can be parallelized. Identify the most time consuming parts of your program and parallelize them. This could require modifying your intrinsic algorithm and code’s organization

Balance parallel workload

Minimize time spent in communication

Use simple arrays instead of user defined derived types

Partition data. Distribute arrays and matrices – allocate specific memory for each MPI process

25 Designing parallel programs - partitioning:

One of the first steps in designing a parallel program is to break the problem into discrete “chunks” that can be distributed to multiple parallel tasks.

Domain Decomposition: Data associate with a problem is partitioned – each parallel task works on a portion of the data

There are different ways to partition the data

31 Designing parallel programs - partitioning:

One of the first steps in designing a parallel program is to break the problem into discrete “chunks” that can be distributed to multiple parallel tasks.

Functional Decomposition: Problem is decomposed according to the work that must be done. Each parallel task performs a fraction of the total computation.

32 Designing parallel programs - communication:

Most parallel applications require tasks to share data with each other.

Cost of communication: Computational resources are used to package and transmit data. Requires frequently synchronization – some tasks will wait instead of doing work. Could saturate network bandwidth.

Latency vs. Bandwidth: Latency is the time it takes to send a minimal message between two tasks. Bandwidth is the amount of data that can be communicated per unit of time. Sending many small messages can cause latency to dominate communication overhead.

Synchronous vs. Asynchronous communication: Synchronous communication is referred to as blocking communication – other work stops until the communication is completed. Asynchronous communication is referred to as non-blocking since other work can be done while communication is taking place.

Scope of communication: Point-to-point communication – data transmission between tasks. Collective communication – involves all tasks (in a communication group)

This is only partial list of things to consider!

33 Designing parallel programs – load balancing:

Load balancing is the practice of distributing approximately equal amount of work so that all tasks are kept busy all the time.

How to Achieve Load Balance?

Equally partition the work given to each task: For array/matrix operations equally distribute the data set among parallel tasks. For loop iterations where the work done for each iteration is equal, evenly distribute iterations among tasks.

Use dynamic work assignment: Certain class problems result in load imbalance even if data is distributed evenly among tasks (sparse matrices, adaptive grid methods, many body simulations, etc.). Use scheduler – task pool approach. As each task finishes, it queues to get a new piece of work. Modify your algorithm to handle imbalances dynamically.

34 Designing parallel programs – I/O:

The Bad News: I/O operations are inhibitors of parallelism I/O operations are orders of magnitude slower than memory operations Parallel file systems may be immature or not available on all systems I/O that must be conducted over network can cause severe bottlenecks

The Good News: Parallel file systems are available (e.g., Lustre) MPI parallel I/O interface has been available since 1996 as a part of MPI-2

I/O Tips: Reduce overall I/O as much as possible If you have access to parallel file system, use it Writing large chunks of data rather than small ones is significantly more efficient Fewer, larger files perform much better than many small files Have a subset of parallel tasks to perform the I/O instead of using all tasks, or Confine I/O to a single tasks and then broadcast (gather) data to (from) other tasks

35 Languages that use Parallel Computing

• C/C++ • Fortran • MATLAB • Python • R • Perl • Julia • Scala • …. Parallel Options in R, Python, & MATLAB

• By default, R, Python, Perl, and MATLAB* are not multithreaded … so do not ask for or try to use more than 1 core/CPU!! • For all these programs, you cannot use the drop-down GUI menus, and you must set the # of CPUs/core dynamically! DO NOT USE STATIC VALUES! • For R, you can use appropriate routines with Rparallel • Now part of base-R • Includes Rforeach, RdoMC, or Rsnow • For Python, you can use the multiprocessing library (or many others) • For Perl, there's threads or Parallel::ForkManager • MATLAB has parpool, and do not set the worker thread count in GUI settings

# R example (parallel.R) library(doMC) mclapply(seq_len(), run2, mc.cores = Sys.getenv('LSB_MAX_NUM_PROCESSORS')) bsub –q normal –n 4 -app R-5g R CMD BATCH parallel.R # custom submission command

# MATLAB example (parallel.m) hPar = parpool( 'local' , str2num( getenv('LSB_MAX_NUM_PROCESSORS') ) ); … matlab-5g –n4 parallel.m # uses command-line wrapper

See more info on our website at http://grid.rcs.hbs.org/parallel-processing Example: Stata Parallelization Stata/MP Performance Report Summary (1)

1 Summary

Stata/MP1 is the version of Stata that is programmed to take full advantage of multicore and multipro- Stata offers a 293-page report on its cessor computers. It is exactly like Stata/SE in all ways except that it distributes many of Stata’s most computationally demanding tasks across all the cores in your computer and thereby runs faster—much parallelization efforts. They are pretty faster. In a perfect world, software would run 2 times faster on 2 cores, 3 times faster on 3 cores, and so impressive. However: on. Stata/MP achieves about 75% eciency. It runs 1.7 times faster on 2 cores, 2.4 times faster on 4cores,and3.2timesfasteron8cores(seefigure1). Half the commands run faster than that. The other half run slower than the median speedup, and some of those commands are not sped up at all, either because they are inherently sequential 8 (most time-series commands) or because they Possible performance region Theoretical upper bound With multiple cores, one might expect to achieve the have not been parallelized (graphics, mixed). In terms of evaluating average performance Logistic theoretical upper bound of doubling the speed by doubling the improvement, commands that take longer to regression run—such as estimation commands—are of 4 number of cores—2 cores run twice as fast as 1, 4 run twice as greater importance. When estimation com- mands are taken as a group, Stata/MP achieves Median performance (estimation) fast as 2, and so on. However, there are three reasons why such an even greater eciency of approximately 85%. Taken at the median, estimation com- Median performance (all commands) perfect scalability cannot be expected: 1) some calculations have mands run 1.9 times faster on 2 cores, 3.1 times 2

faster on 4 cores, and 4.1 times faster on 8 Speed relative to speed of single core parts that cannot be partitioned into parallel processes; 2) even cores. Stata/MP supports up to 64 cores. This paper provides a detailed report on Lower bound (no improvement) when there are parts that can be partitioned, determining how to the performance of Stata/MP. Command-by- 1 1 2 4 8 command performance assessments are pro- Number of cores partition them takes computer time; and 3) vided in section 8. Figure 1. Performance of Stata/MP. Speed on multicore/multiprocessor systems only duplicate processors and multiple cores relative to speed on a single core. cores, not all the other system resources. Stata/MP achieved 75% efficiency overall and 85% efficiency among estimation commands. This parallelization benefit is mostly Speed is more important for problems that are quantified 1. Support for this e↵ort was partially provided by the U.S. National Institutes of Health, National Institute on Aging grants 1R43AG019542-01A1, 2R44AG019542-02, and 5R44AG019542-03. We also thank Cornell Institute for Social and as large in terms of the size of the dataset or some other aspect Economicrealized in batch mode Research (CISER) at Cornell University for graciously providing access to… most of several highly parallel SMP platforms. CISER sta↵, in particular John Abowd, Kim Burlingame, Janet Heslop, and Lars Vilhuber, were exceptionally helpful in scheduling time and helping with configuration. The views expressed here do not necessarily reflect those of any of the of the problem, such as the number of covariates. On large partiesinteractive Stata is waiting for user thanked above. problems, Stata/MP with 2 cores runs half of Stata’s commands Revision 3.0.1 30jan2016 at least 1.7 times faster than on a single core. With 4 cores, the input (or left idle), as CPU efficiency same commands run at least 2.4 times faster than on a single is typically < 5% - 10% core. Parallel Processing in R

• By default, R, Python, Perl, and MATLAB* are not multithreaded … so do not ask for or try to use more than 1 core/CPU!! • For R, you can use appropriate routines with Rparallel • Now part of base-R • Includes Rforeach, RdoMC, or Rsnow • multicore base enable parallelization through the apply()functions, but will not work on Windows systems due to how parallelization is achieved (no fork())

# R example (parallel.R) library(doMC) mclapply(seq_len(), run2, mc.cores = Sys.getenv('LSB_MAX_NUM_PROCESSORS'))

bsub –q normal –n 4 -app R-5g R CMD BATCH parallel.R # custom submission command

See more info on our website at http://grid.rcs.hbs.org/parallel-r Parallel Processing in R

# library(parallel): snow single-node parallel cluster library(parallel) # wraps the makeSOCKcluster() function and launches the specified number # of R processes on the local machine cluster <- makeCluster(Sys.getenv('LSB_MAX_NUM_PROCESSORS')) # one must explicitly make vars/functions available in the sub-processes.

clusterExport(cluster, list('myProc')) # now result <- clusterApply(cluster, 1:10, function(i) myProc()) Result stopCluster(cluster) bsub –q normal –n 4 -app R-5g R CMD BATCH parallel_snow.R # custom submission command

# library(parallel): foreach + multicore library(foreach) library(doMC)

registerDoMC(cores=Sys.getenv('LSB_MAX_NUM_PROCESSORS')) result <- foreach(i=1:10, .combine=c) %dopar% { myProc() } result bsub –q normal –n 4 -app R-5g R CMD BATCH parallel_foreach.R # custom submission command

See more info on our website at http://grid.rcs.hbs.org/parallel-r ‘Multiprocessing’ in Python

By default, R, Python, Perl, and MATLAB* are not multithreaded … so do not ask for or try to use more than 1 core/CPU!! • Python has ‘multiprocessing’ module • Evolved from the threading module • Uses subprocesses, instead of threads, to bypass python’s Global Interpreter Lock • Rich subclasses, including Pool offers a convenient means of parallelizing the execution of a function across multiple input values, distributing the input data across processes (data parallelism).

• Runs on both Unix & Windows systems ‘Multiprocessing’ in Python

import multiprocessing, os def worker(num): """thread worker function""” print 'Worker:', num return

if __name__ == '__main__': jobs = [] cores = os.environ['LSB_MAX_NUM_PROCESSORS'] for i in range(cores): p = multiprocessing.Process(target=worker, args=(i,)) jobs.append(p) p.start()

$ python multiprocessing_simpleargs.py

Worker: 0 Worker: 1 Worker: 2 Worker: 3 Worker: 4 bsub –q normal –n 5 –W 1:00 -R "rusage[mem=1000]" python parallel_workers.py

See more info on our website at http://grid.rcs.hbs.org/parallel-processing Multicore Options MATLAB

By default, R, Python, Perl, and MATLAB* are not multithreaded … so do not ask for or try to use more than 1 core/CPU!! • MATLAB has parpool, the Parallel Computing Toolbox (PCT) standard on all installations • This operates on a single machine! • On some systems (e.g. FASRC's Odyssey), workers can be spawned across multiple machines for large scale work • DCS: Distributed Computing Toolbox

# MATLAB example (parallel.m) hPar = parpool( 'local' , str2num( getenv('LSB_MAX_NUM_PROCESSORS') ) );

R = 1; darts = 1e7; count = 0; % Prepare settings parfor i = 1:darts x = R * rand(1); y = R * rand(1); if x^2 + y^2 <= R^2 count = count + 1 end end myPI = 4 * count / darts;

% Log results & close down parallel pool fprintf( hLog , ‘The computed value of pi is %2.7f\n’ , myPI ); delete(gcp); matlab-5g –n4 par_compute_pi.m # command-line wrapper bsub –n 4 –q normal –W 2:00 -R "rusage[mem=1000]" matlab par_compute_pi.m # custom submission See more info on our website at http://grid.rcs.hbs.org/parallel-processing Other Important Points& Troubleshooting Mixed Multicore and Serial Workflows

Choosing core count can be difficult, especially if there's a mix of serial and parallel steps….

• Think about how long your code will be in either modes • Determine the fraction resource use across the whole job • If < 20% in multicore use, then split up the tasks into two separate jobs • Can use job dependencies to make submission easier

OK, can run as one long job

+

32 Scaling Tests Ensures Efficiency

Not all programs can be scaled well. This is due to • Overhead of program start • Overhead of communication between processes (threads) within the program • (worse:) Waiting to write to the network or disk (I/O) • Other, serial parts of the program (parts that cannot be parallelized)

Scaling tests are important to help you determine the optimal # of cores to use!!

3 3 Your Own Scaling Tests!

# Create a SLURM script for an analysis that can be used for multiple CPU (core) values # Input seqs.fa file has 350 FASTA sequences so we can get good parallelization values: -- file: blast_scale_test.slurm --- #!/bin/bash # #SBATCH -p serial_requeue # Partition to submit to (comma separated) #SBATCH -J blastx # Job name #SBATCH -N 1 # Ensure that all cores are on one machine #SBATCH -t 0-4:00 # Runtime in D-HH:MM (or use minutes) #SBATCH --mem 10000 # Memory pool in MB for all cores #SBATCH --mail-type=END,FAIL # Type of email notification: BEGIN,END,FAIL,ALL

source new-modules.sh; module load ncbi-blast/2.2.31+-fasrc01 export BLASTDB=/n/regal/informatics_public/

blastx –in seqs.fa -db $BLASTDB/custom/other/model_chordate_proteins \ -out sk_shuffle_seqs.n${1}.modelchordate.blastx -num_threads $1 -----

# and now submit file multiple times with different core values for i in 1 2 4 8 16; do echo $i # sbatch flags here will override those in the SLURM submission script sbatch -n $i -J blastx$i -o blastx_n$i.out -e blastx_n$i.err blast_scale_test.slurm $i sleep 1 done 34 Your Own Scaling Test Results!

[bfreeman@rclogin04 ~]$ sacct -u bfreeman -S 4/6/16 --format=jobid,\ elapsed,alloccpus,cputime,totalcpu,state JobID Elapsed AllocCPUS CPUTime TotalCPU State ------59817008 16:12:26 1 16:12:26 16:03:34 COMPLETED 59817008.ba+ 16:12:26 1 16:12:26 16:03:34 COMPLETED 59817024 10:49:16 2 21:38:32 17:53:07 COMPLETED 59817024.ba+ 10:49:16 2 21:38:32 17:53:07 COMPLETED 59817026 06:03:38 4 1-00:14:32 15:56:55 COMPLETED 59817026.ba+ 06:03:38 4 1-00:14:32 15:56:55 COMPLETED 59817028 04:55:44 8 1-15:25:52 21:27:30 COMPLETED 59817028.ba+ 04:55:44 8 1-15:25:52 21:27:30 COMPLETED 59817043 03:01:51 16 2-00:29:36 1-01:33:03 COMPLETED 59817043.ba+ 03:01:51 16 2-00:29:36 1-01:33:03 COMPLETED 59847485 02:04:58 32 2-18:38:56 1-11:42:36 COMPLETED 59847485.ba+ 02:04:58 32 2-18:38:56 1-11:42:36 COMPLETED

1 2 4 8 16 32 Elapsed 16:12:26 10:49:16 6:03:38 4:55:44 3:01:51 2:04:58 Ideal 16:12:26 8:06:13 4:03:07 2:01:33 1:00:47 0:30:23 CPUTime 16:12:26 21:38:32 24:14:32 39:25:52 48:29:36 66:38:56 Ideal 16:12:26 16:12:26 16:12:26 16:12:26 16:12:26 16:12:26 NoGain 16:12:26 32:24:52 64:49:44 129:39:28 259:18:56 518:37:52 TotalCPU 16:03:34 17:53:07 15:56:55 21:27:30 25:33:03 35:42:36 Ideal 16:03:34 16:03:34 16:03:34 16:03:34 16:03:34 16:03:34 35 NoGain 16:03:34 32:07:08 64:14:16 128:28:32 256:57:04 513:54:08 Your Own Scaling Test Results!

0:00:00 1 2 4 8 16 32 1 2 4 8 16 32 Elapsed 16:12:26 10:49:16 6:03:38 4:55:44 3:01:51 2:04:58 Ideal 16:12:26 8:06:13 4:03:07 2:01:33 1:00:47 0:30:23 CPUTime 16:12:26 21:38:32 24:14:32 39:25:52 48:29:36 66:38:56 Ideal 16:12:26 16:12:26 16:12:26 16:12:26 16:12:26 16:12:26 NoGain 16:12:26 32:24:52 64:49:44 129:39:28 259:18:56 518:37:52 2:24:00 TotalCPU 16:03:34 17:53:07 15:56:55 21:27:30 25:33:03 35:42:36 Ideal 16:03:34 16:03:34 16:03:34 16:03:34 16:03:34 16:03:34 NoGain 16:03:34 32:07:08 64:14:16 128:28:32 256:57:04 513:54:08 Elapsed Ideal

0:14:24

2400:00:00 2400:00:00 CPUTime TotalCPU Ideal Ideal NoGain NoGain 240:00:00 240:00:00

24:00:00 24:00:00 1 2 4 8 16 32 1 2 4 8 16 32

36 2:24:00 2:24:00 Getting Help

RCS Website & Documentation -- only authoritative source https://grid.rcs.hbs.org/

Submit a help request [email protected]

Best way to help us to help you? Give us... Description of problem Additional info (login/batch? queue? JobIDs?) Steps to Reproduce (1., 2., 3...) Actual results Expected results Research Computing Services

• Please talk to your peers, and … • We wish you success in your research!

• http://intranet.hbs.edu/dept/research/ • https://grid.rcs.hbs.org/ • https://training.rcs.hbs.org/

• @hbs_rcs