Parallel Processing.Pptx
Total Page:16
File Type:pdf, Size:1020Kb
Compute Grid: Parallel Processing RCS Lunch & Learn Training Series Bob Freeman, PhD Director, Research Technology Operations HBS 8 November, 2017 Overview • Q&A • Introduction • Serial vs parallel • Approaches to Parallelization • Submitting parallel jobs on the compute grid • Parallel tasks • Parallel Code Serial vs Parallel work Serial vs Multicore Approaches Traditionally, software has been written for serial computers • To be run on a single computer having a single Central Processing Unit (CPU) • Problem is broken into a discrete set of instructions • Instructions are executed one after the other • One one instruction can be executed at any moment in time Serial vs Multicore Approaches In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem: • To be run using multiple CPUs • A problem is broken into discrete parts (either by you or the application itself) that can be solved concurrently • Each part is further broken down to a series of instructions • Instructions from each part execute simultaneously on different CPUs or different machines Serial vs Multicore Approaches Many different parallelization approaches, which we won't discuss: Shared memory Distributed memory 6 Hybrid Distributed-Shared memory Parallel Processing… So, we are going to briefly touch on two approaches: • Parallel tasks • Tasks in the background • gnu_parallel • Pleasantly parallelizing • Parallel code • Considerations for parallelizing • Parallel frameworks & examples We will not discuss parallelized frameworks such as Hadoop, Apache Spark, MongoDB, ElasticSearch, etc Parallel Jobs on the Compute Grid… Nota Bene!! • In order to run in parallel, programs (code) must be explicitly programmed to do so. • And you must ask the scheduler to reserve those cores for your program/work to use. Thus, requesting cores from the scheduler does not automagically parallelize your code! # SAMPLE JOB FILE #!/bin/bash #BSUB -q normal # Queue to submit to (comma separated) #BSUB -n 8 # Number of cores ... blastn –query seqs.fasta –db nt –out seqs.nt.blastn # WRONG!! blastn –query seqs.fasta –db nt –out seqs.nt.blastn –num_threads $LSB_MAX_NUM_PROCESSORS# YES!! # SAMPLE PARALLELIZED CODE bsub –q normal –n 4 –W 24:00 -R "rusage[mem=4000]" stata-mp4 –d myfile.do # SAMPLE PARALLEL TASKS bsub –q normal –n 4 –W 24:00 -R "rusage[mem=4000]" \ parallel –joblog .log --outputasfiles –j\$LSB_MAX_NUM_PROCESSORS :::: tasklist.txt # SAMPLE PLEASANT PARALLELIZATION for file in folder/*.txt; do echo $file bsub -q normal -W 24:00 -R "rusage[mem=1000]" python process_input_data.py $file done Parallel Tasks Background tasks Shells, by default, have the ability to multitask: doing more than one thing at a time In BASH, this can be accomplished by sending a command to the background: • Explicitly, with & • After the fact, with ^Z and bg When you put a task in the background • The task keeps running, while you continue to work at the shell in the foreground • If any output is done, it appears on your screen immediately • If input is required, the process prints a message and stops • When it is done, a message will be printed From Processes & Job Control: http://slideplayer.com/slide/4592906/ GnuParallel Approach GNU parallel is a shell tool for executing jobs in parallel using one or more computers. • single command or small script that has to be run for each of the lines in the input. • typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables. • Many options for working with control and output of results • Can specify the degree of parallelization # create list of files to unzip for index in `seq 1 100`; do echo "unzip myfile$index.zip" >> tasklist.txt; done # Ask the compute cluster to do this for me in parallel, using 4 CPU/cores bsub –q normal –n 4 –W 2:00 -R "rusage[mem=4000]" \ parallel –joblog .log --outputasfiles –j\$LSB_MAX_NUM_PROCESSORS :::: tasklist.txt 11 Concept of Pleasant Parallelization Problem: How do I BLAST 200,000 transcripts against NR? Solution: Fake a parallel BLAST. But how? • Divide your input file into n separate files • BLAST each smaller input file on a separate core • Running on n cores will be almost exactly as n times faster! Why? • Each core doesn't need to talk to one another • You could submit n jobs individually, but not recommended • Use more sophisticated techniques: job arrays, gnu_parallel, GridRunner • Shouldn't confuse this with truly parallel mpiBLAST The efficiency of your work depends on how parallelized you make your task: • You want to ensure that your jobs spend most of their time computing, and not in the queue or doing compute prep schedule module load BLAST Job finish versus schedule module load BLAST Job finish 1 X 100?? What would you choose? 2 Manual (Script) Approach • Split input file into N files that run 1 to 6 hrs each • Can be done with perl or python script, unix split, etc • User script parses the datafile whose name is passed as the command parameter for file in my*.dat do echo $file bsub –q normal –W 6:00 -R "rusage[mem=1000]" \ python process_data_file.py $file sleep 1 done For advanced users, one can submit this as one job in a job array, a feature on most schedulers # create script for job array (process_data_file_array.py) # and now submit file as job array num_files=`wc –l < $( ls -1 my*.dat )` bsub –J myarray[1-$num_files] –q normal –W 6:00 -R "rusage[mem=1000]" \ python process_data_file_array.py This process is ideal for serially numbered files, parameter sweeps, & optimization routines!! 13 Parallel Code Can my code be parallelized? ß Does it have large loops that repeat the same operations? ß Does your code do multiple tasks that are not dependent on one another? If so is the dependency weak? ß Can any dependencies or information sharing be overlapped with computation? If not, is the amount of communications small? ß Do multiple tasks depend on the same data? ß Does the order of operations matter? If so how strict does it have to be? 23 Basic guidance for efficient parallelization: ß Is it even worth parallelizing my code? ° Does your code take an intractably long amount of time to complete? ° Do you run a single large model or do statistics on multiple small runs? ° Would the amount of time it take to parallelize your code be worth the gain in speed? ß Parallelizing established code vs. starting from scratch ° Established code: Maybe easier / faster to parallelize, but my not give good performance or scaling ° Start from scratch: Takes longer, but will give better performance, accuracy, and gives the opportunity to turn a “black box” into a code you understand 24 Basic guidance for efficient parallelization: ß Increase the fraction of your program that can be parallelized. Identify the most time consuming parts of your program and parallelize them. This could require modifying your intrinsic algorithm and code’s organization ß Balance parallel workload ß Minimize time spent in communication ß Use simple arrays instead of user defined derived types ß Partition data. Distribute arrays and matrices – allocate specific memory for each MPI process 25 Designing parallel programs - partitioning: One of the first steps in designing a parallel program is to break the problem into discrete “chunks” that can be distributed to multiple parallel tasks. Domain Decomposition: Data associate with a problem is partitioned – each parallel task works on a portion of the data There are different ways to partition the data 31 Designing parallel programs - partitioning: One of the first steps in designing a parallel program is to break the problem into discrete “chunks” that can be distributed to multiple parallel tasks. Functional Decomposition: Problem is decomposed according to the work that must be done. Each parallel task performs a fraction of the total computation. 32 Designing parallel programs - communication: Most parallel applications require tasks to share data with each other. Cost of communication: Computational resources are used to package and transmit data. Requires frequently synchronization – some tasks will wait instead of doing work. Could saturate network bandwidth. Latency vs. Bandwidth: Latency is the time it takes to send a minimal message between two tasks. Bandwidth is the amount of data that can be communicated per unit of time. Sending many small messages can cause latency to dominate communication overhead. Synchronous vs. Asynchronous communication: Synchronous communication is referred to as blocking communication – other work stops until the communication is completed. Asynchronous communication is referred to as non-blocking since other work can be done while communication is taking place. Scope of communication: Point-to-point communication – data transmission between tasks. Collective communication – involves all tasks (in a communication group) This is only partial list of things to consider! 33 Designing parallel programs – load balancing: Load balancing is the practice of distributing approximately equal amount of work so that all tasks are kept busy all the time. How to Achieve Load Balance? Equally partition the work given to each task: For array/matrix operations equally distribute the data set among parallel tasks. For loop iterations where the work done for each iteration is equal, evenly distribute iterations among tasks. Use dynamic work assignment: Certain class problems result in load imbalance even if data is distributed evenly among tasks (sparse matrices, adaptive