Using Parallel Execution Perl Program on Multiple Biohpc Lab Machines
Total Page:16
File Type:pdf, Size:1020Kb
Using parallel execution Perl program on multiple BioHPC Lab machines This Perl program is intended to run on multiple machines, and it is also possible to run multiple programs on each machine. This Perl program takes 2 arguments: /programs/bin/perlscripts/perl_fork_univ.pl JobListFile MachineFile JobListFile is a file containing all the commands to execute – one per line. MachineFile is the file with the list of machines to be used (one per line), with a number of processes to execute in parallel on this machine following the machine name: cbsumm14 24 cbsumm15 24 cbsumm16 24 Typical examples of parallel Perl driver use are cases when the number of tasks exceeds the number of cores. For example, when the number of libraries in RAN-seq project is large (say 500), you can prepare a file with all tophat tasks needed (50 lines, no ‘&’ at the end of lines!, each of them on 7 cores) and then run 9 of them at a time on 2 64 core machines (using total 63 cores on each machine – 9 instances at a time using 7 cores each). Using multiple machines is more complicated than running parallel Perl driver on a single machine. Local directory /workdir is not visible across machines so the processes will need to use home directory for files storage and communication. Also, you need to enable program execution between BioHPC Lab machines without a need to enter your password. It can be done using the following commands: cd ~ ssh-keygen -t rsa (use empty password when prompted) cat .ssh/id_rsa.pub >> .ssh/authorized_keys chmod 640 .ssh/authorized_keys chmod 700 .ssh A good example is PAML simulation on 110 genes. The example input data for multiple machines is in /programs/paml_mn.example.tar. If you would like to try it you need to unpack data into a subdirectory of your home directory. In this example it will be /home/jarekp/tmp – or ~/tmp (~/ denotes home directory). cd /~ mkdir tmp cd tmp tar –xf /programs/paml_mn.example.tar Task list is stored in file ‘tasklist’, which begins as follows (total 110 lines). Note no ‘&’ at line ends! ~/tmp/runpaml.sh 961 ~/tmp/runpaml.sh 914 ~/tmp/runpaml.sh 971 ~/tmp/runpaml.sh 974 ~/tmp/runpaml.sh 970 ~/tmp/runpaml.sh 948 Each PAML execution is done via runpaml.sh script. It copies files from home directory to a subdirectory of /workdir, executes PAML there and then copies the results back to home directory (removing leftover files from /workdir): #!/bin/bash cd /workdir if [ ! -e $LOGNAME ] then mkdir $LOGNAME fi cd $LOGNAME cp -ar ~/tmp/results/$1 . cd $1 ~/tmp/paml4.7/bin/codeml my.control >& log cd .. cp -arf $1 ~/tmp/results/ rm -rf $1 $LOGNAME is an environmental variable always set to user’s login name. $1 represents the first argument of the script command line (and the only one in this example). In order to run PAML simulations in parallel you need to execute Perl parallel driver /programs/bin/perlscripts/perl_fork_univ_mn.pl tasklist machines I used 3 medium memory machines in this example and 24 cores (tasks) on each. Before running the driver make sure you can connect to each machine using ssh from the “master” machine (the one you will be running perl_fork_univ_mn.pl), if it is the first time you ssh into it from master you will need to allow it to be added to “known hosts”: [jarekp@cbsumm14 tmp]$ ssh cbsumm16 The authenticity of host 'cbsumm16 (128.84.3.236)' can't be established. RSA key fingerprint is 4d:5d:d1:a0:b1:c4:cb:e0:5d:b5:03:e0:8b:e8:b4:5f. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'cbsumm16,128.84.3.236' (RSA) to the list of known hosts. [jarekp@cbsumm16 ~]$ .