GNU Parallel

Ramses van Zon

SciNet HPC

October 22, 2014

Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 1 / 26 Data Deluge

Lots of data to process? Many combinations of your data? Have a program (script, command, function) for each case already? You want it to use more resources?

Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 2 / 26 Use Case 1: US Air Traffic Information

Air traffic in the US from 1987-2012, with flight data given in monthly files, in ‘comma separated value’ format. /scinet/course/ScalableDataAnalysis/usecases/airline Question we may want to answer using this data set:

What were the busiest airports in terms of number of flights, by month through the time period? For each month, which airports had the longest departure delays? For each calendar year, which airport had the most days where departure delays were no more than 1 hour?

Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 3 / 26 GNU Parallel

Basically a script But surprisingly versatile, esp. for text input. Gets your many cases assigned to different cores and on different nodes without much hassle. 1. O. Tange, “GNU Parallel - The Command-Line Power Tool” ;login: 36 (1), 42-47 (2011) 2. http://www.gnu.org/software/parallel/parallel tutorial.html

Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 4 / 26 GNU Parallel Example

find. -name \*.csv | parallel grep -l ’JFK’

1. Unix command find lists files with the extension csv. 2. parallel, the GNU parallel command, divides the list of filenames 3. For each filename, it executes the unix grep command to find ‘JFK’. 4. It runs N commands at the same time, where N=number of cores.

Serial version might’ve been: find. -name \*.csv -exec grep -l ’JFK’ {}\ ;

Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 5 / 26 Why not roll your own?

#!/bin/ function dobyfour() What’s wrong with that? { while [ -n"$1"] Reinventing the wheel do More code to maintain/debug grep -l ’JFK’ $1& No load balancing grep -l ’JFK’ $2& grep -l ’JFK’ $3& No job control grep -l ’JFK’ $4& No error checking wait shift4 No fault tolerance done No multinode jobs } dobyfour $(find -name \*.csv)

Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 6 / 26 What is GNU Parallel good at?

We’ll see a number of common data-processing patterns in this workshop, but the following is one that suits GNU parallel well:

1. Many tasks (“jobs”) GNU parallel allows various input sources to specify the tasks 2. Distribution Different ways to divide the jobs, and sending input to tasks 3. Parallel Execution Run multiple scripts/commands/programs at the same time Load balancing. No communication 4. Output Collect in files, to screen, ordered, non-ordered, etc. No reduction

Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 7 / 26 Let’s get practical

We’ll largely follow the topics in the GNU Parallel tutorial. (Many of the examples come from that tutorial too.) http://www.gnu.org/software/parallel/parallel tutorial.html Hands-on parts will be done on SciNet, so get logged in and get a node to yourself:

$ qsub -l node=1:ppn=8,walltime=8:00:00 -I -X

Copy the example data and code

$ cd $SCRATCH $ cp - /scinet/course/ScalableDataAnalysis . $ cd ScalableDataAnalysis/GnuParallel $ source setup

Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 8 / 26 Input: A Single Input Source

.. . Input can be read from the The input source can be a file: command line: $ parallel :::: abc-file echo $ parallel echo ::: A B C Output: Same as above. Output (order may differ as the . jobs are run in parallel): . STDIN can be the input source: A $ cat abc-file | parallel echo B C Output: Same as above. . .

(abc-file is a files with three lines, containing “A”, “B”, and “C”, respectively)

Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 9 / 26 Multiple Input Sources

GNU Parallel can take multiple input sources on the command line. It then generates all combinations of the input sources:

$ parallel echo ::: A B C ::: D E F AD AE AF BD BE BF CD CE CF

Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 10 / 26 Match Arguments From Input Sources

With --xapply you can get one argument from each input source.

$ parallel --xapply echo ::: A B C ::: D E F AD BE CF

Shorter inputs will wrap their values.

$ parallel --xapply echo ::: A B C D E ::: F G AF BG CF DG EF

Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 11 / 26 Commands

No command means arguments are commands

$ parallel ::: ls ’echo foo’ pwd [list of files in current dir] foo [/path/to/current/working/dir]

The command can be a script, a binary or a bash function if the . function is exported using export -f: func() { echo "in func $1" $ bash script.sh } in func 1 export -f func in func 2 parallel func ::: 1 2 3 in func 3

Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 12 / 26 Replacement Strings

Default replacement string is {}.

$ parallel echo {} ::: A/B.C A/B.C

{.} removes the extension.

$ parallel echo {.} ::: A/B.C A/B

{/} removes the path.

$ parallel echo {/} ::: A/B.C B.C

Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 13 / 26 Replacement Strings (continued)

{//} keeps only the path.

$ parallel echo {//} ::: A/B.C A

{/.} removes path and extension.

$ parallel echo {/.} ::: A/B.C B

{#} gives the job number:

$ parallel echo {#} ::: A B C 1 2 3

Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 14 / 26 Replacement Strings (continued)

{%} gives the job slot number (between 1 and number of jobs to run in parallel)

$ parallel -j 2 echo {%} ::: A B C 1 2 1

With multiple input sources the argument from the individual input sources can be access with {number}

$ parallel echo {1} and {2} ::: A B ::: C D A and C A and D B and C B and D

Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 15 / 26 Input Data Manipulation

Using --no-run-if-empty, empty lines will be skipped.

$ (echo 1; echo; echo2) | parallel --no-run-if-empty echo 1 2

Space can be trimmed on the arguments using --trim:

$ parallel --trim lr echo pre-{}-post ::: ’ A ’ pre-A-post

Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 16 / 26 Column Input

Columns can be bound to positional replacement strings with --colsep

$ parallel --colsep ’,’ echo 1={1} 2={2} :::: csvfile.csv 1=a 2=b 1=A 2=B 1=C 2=D

Here, csvfile.csv is a file with the three lines “a,b”, “A,B”, and “C,D”. With --header, GNU Parallel will use the first value of the input source as the name of the replacement string.

$ parallel --header: echo a= {a} b={b} ::: a A B ::: b C D a=A b=C a=A b=D a=B b=C a=B b=D

Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 17 / 26 Parallel execution

$ time parallel -j1 sleep ::: 1 2 4 3 Specify number of real 0m10.980s simultaneous jobs with -j. user 0m0.091s *Why isn’t the latter case sys 0m0.038s 4x faster than the first? $ time parallel -j2 sleep ::: 1 2 4 3 real 0m5.323s Each core gets a job. user 0m0.089s When job is done, give sys 0m0.042s that core a new one. $ time parallel -j4 sleep ::: 1 2 4 3 real 0m4.284s Remote execution, i.e., user 0m0.093s running on several nodes sys 0m0.038s at once, also possible.

Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 18 / 26 Load Balancing

Without GNU Parallel: With GNU Parallel:

10 hours 72% utilization

17 hours 42% utilization .

Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 19 / 26 Distributing Input Arguments By default, one argument is given to one command -- will fit as many arguments as possible on a single line: $ cat num30000 | parallel --xargs echo | wc -l 2

...... Here, num30000 is a file with 30,000 lines. The maximal length of a single line can be set with -s: $ cat num30000 | parallel --xargs -s 10000 echo | wc -l 17

Distribute arguments evenly over cores with -m instead of --xargs To limit the number of arguments in each command, use -N: $ parallel -N3 echo ::: A B C D E F G H ABC DEF RamsesGH van Zon (SciNet HPC) GNU Parallel October 22, 2014 20 / 26 Output Control Output order depends on when a job finishes. You can restore the original order with -k. GNU Parallel postpones output for each job until it completes, but even this can be disabled with -u. $ export c=’printf "%s-start\n%s" {} {};sleep {}; printf "%s\n" -middle;echo {}-end’

$ parallel -j2 $c ::: 4 2 1 $ parallel -j2 -u $c ::: 4 2 1 2-start 4-start 2-middle 42-start 2-end 2-middle 1-start 2-end 1-middle 1-start 1-end 1-middle 4-start 1-end 4-middle -middle 4-end 4-end Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 21 / 26 Output Control (continued)

The output can prefixed with the argument.

$ parallel --tag echo foo-{} ::: A B C A foo-A B foo-B C foo-C

To print the command before running them use --verbose.

$ parallel --verbose echo {} ::: A B C echo A echo B A echo C B C

Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 22 / 26 Output to files --files makes output go to files (names written to STDOUT) $ parallel --tmpdir outdir --files echo ::: A B outdir/pAh6uWuQCg.par outdir/opjhZCzAX4.par

Impose structure on output files with --results $parallel --results outdir --files echo ::: A B outdir/1/A/stdout outdir/1/B/stdout

The structure is more useful if you are running multiple variables: $ parallel --files --results outdir echo ::: A B ::: C D outdir/1/A/2/C/stdout outdir/1/A/2/D/stdout outdir/1/B/2/C/stdout outdir/1/B/2/D/stdout

Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 23 / 26 Monitoring --progress gives progress information

$ parallel --progress sleep ::: 1 3 2 2 1 3 3 2 1 Computers / CPU cores / Max jobs to run 1:local / 2 / 2

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete local:0/9/100%/1.1s

--joblog creates a logfile of the jobs completed so far.

$ parallel --joblog /tmp/log exit ::: 1 2 0 $ cat /tmp/log Seq Host Starttime Runtime Send Receive Exitval Signal Command 1: 1376577364.974 0.008 0 0 1 0 exit1 2: 1376577364.982 0.013 0 0 2 0 exit2 3: 1376577365.003 0.003 0 0 0 0 exit0

Can then stop and later pickup using --resume.

Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 24 / 26 Remote execution Using ssh, one can run jobs on remote servers.

$ parallel -S gpc01 echo running on ::: gpc01 running on gpc01

Can also read node names from a file.

$ parallel --slf $PBS_NODEFILE echo ::: run remotely now run remotely now

Set the working directory on the remote machines with --workdir. Running the same commands on all hosts with --onall. Run a single command on all hosts with --nonall. Transfer environment variables with --env.

Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 25 / 26 Hands-on For each month in 1988-2012, list airports with longest dep delay: 1988 01 1389.0 [’LGA’] 1988 02 1320.0 [’TUS’, ’DCA’] ... from csv import reader def maxdelay(file): Use this python script rd = reader(open(file,’rb’)) (maxdelay.py). hd = rd.next() Also do a scaling test, i.e., dl = hd.index(’DEP_DELAY’) on = hd.index(’ORIGIN’) see how long this takes use ls = [[float(r[dl]),r[on]] 2,4,6 and 8 cores. for r in rd if r[dl]!=’’] . dt = max(ls)[0] ps = [r[1] for r in ls if r[0]==dt]. return dt,list(set(ps)) . if __name__ == ’__main__’: . from sys import argv . if len(argv)>1: . dt, ls = maxdelay(argv[1]). print dt, ls Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 26 / 26