GNU Parallel

GNU Parallel

GNU Parallel Ramses van Zon SciNet HPC October 22, 2014 Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 1 / 26 Data Deluge Lots of data to process? Many combinations of your data? Have a program (script, command, function) for each case already? You want it to use more resources? Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 2 / 26 Use Case 1: US Air Traffic Information Air traffic in the US from 1987-2012, with flight data given in monthly files, in ‘comma separated value’ format. /scinet/course/ScalableDataAnalysis/usecases/airline Question we may want to answer using this data set: What were the busiest airports in terms of number of flights, by month through the time period? For each month, which airports had the longest departure delays? For each calendar year, which airport had the most days where departure delays were no more than 1 hour? Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 3 / 26 GNU Parallel Basically a perl script But surprisingly versatile, esp. for text input. Gets your many cases assigned to different cores and on different nodes without much hassle. 1. O. Tange, “GNU Parallel - The Command-Line Power Tool” ;login: 36 (1), 42-47 (2011) 2. http://www.gnu.org/software/parallel/parallel tutorial.html Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 4 / 26 GNU Parallel Example find. -name n*.csv j parallel grep -l 'JFK' 1. Unix command find lists files with the extension csv. 2. parallel, the GNU parallel command, divides the list of filenames 3. For each filename, it executes the unix grep command to find ‘JFK’. 4. It runs N commands at the same time, where N=number of cores. Serial version might’ve been: find. -name n*.csv -exec grep -l 'JFK' fgn ; Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 5 / 26 Why not roll your own? #!/bin/bash function dobyfour() What’s wrong with that? f while [ -n"$1"] Reinventing the wheel do More code to maintain/debug grep -l 'JFK' $1& No load balancing grep -l 'JFK' $2& grep -l 'JFK' $3& No job control grep -l 'JFK' $4& No error checking wait shift4 No fault tolerance done No multinode jobs g dobyfour $(find -name n*.csv) Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 6 / 26 What is GNU Parallel good at? We’ll see a number of common data-processing patterns in this workshop, but the following is one that suits GNU parallel well: 1. Many tasks (“jobs”) GNU parallel allows various input sources to specify the tasks 2. Distribution Different ways to divide the jobs, and sending input to tasks 3. Parallel Execution Run multiple scripts/commands/programs at the same time Load balancing. No communication 4. Output Collect in files, to screen, ordered, non-ordered, etc. No reduction Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 7 / 26 Let’s get practical We’ll largely follow the topics in the GNU Parallel tutorial. (Many of the examples come from that tutorial too.) http://www.gnu.org/software/parallel/parallel tutorial.html Hands-on parts will be done on SciNet, so get logged in and get a node to yourself: $ qsub -l node=1:ppn=8,walltime=8:00:00 -I -X Copy the example data and code $ cd $SCRATCH $ cp -r /scinet/course/ScalableDataAnalysis . $ cd ScalableDataAnalysis/GnuParallel $ source setup Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 8 / 26 Input: A Single Input Source .. Input can be read from the The input source can be a file: command line: $ parallel :::: abc-file echo $ parallel echo ::: A B C Output: Same as above. Output (order may differ as the . jobs are run in parallel): . STDIN can be the input source: A $ cat abc-file j parallel echo B C Output: Same as above. (abc-file is a files with three lines, containing “A”, “B”, and “C”, respectively) Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 9 / 26 Multiple Input Sources GNU Parallel can take multiple input sources on the command line. It then generates all combinations of the input sources: $ parallel echo ::: A B C ::: D E F AD AE AF BD BE BF CD CE CF Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 10 / 26 Match Arguments From Input Sources With --xapply you can get one argument from each input source. $ parallel --xapply echo ::: A B C ::: D E F AD BE CF Shorter inputs will wrap their values. $ parallel --xapply echo ::: A B C D E ::: F G AF BG CF DG EF Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 11 / 26 Commands No command means arguments are commands $ parallel ::: ls 'echo foo' pwd [list of files in current dir] foo [/path/to/current/working/dir] The command can be a script, a binary or a bash function if the . function is exported using export -f: func() f echo "in func $1" $ bash script.sh g in func 1 export -f func in func 2 parallel func ::: 1 2 3 in func 3 Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 12 / 26 Replacement Strings Default replacement string is fg. $ parallel echo fg ::: A/B.C A/B.C f.g removes the extension. $ parallel echo f.g ::: A/B.C A/B f/g removes the path. $ parallel echo f/g ::: A/B.C B.C Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 13 / 26 Replacement Strings (continued) f//g keeps only the path. $ parallel echo f//g ::: A/B.C A f/.g removes path and extension. $ parallel echo f/.g ::: A/B.C B f#g gives the job number: $ parallel echo f#g ::: A B C 1 2 3 Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 14 / 26 Replacement Strings (continued) f%g gives the job slot number (between 1 and number of jobs to run in parallel) $ parallel -j 2 echo f%g ::: A B C 1 2 1 With multiple input sources the argument from the individual input sources can be access with fnumberg $ parallel echo f1g and f2g ::: A B ::: C D A and C A and D B and C B and D Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 15 / 26 Input Data Manipulation Using --no-run-if-empty, empty lines will be skipped. $ (echo 1; echo; echo2) j parallel --no-run-if-empty echo 1 2 Space can be trimmed on the arguments using --trim: $ parallel --trim lr echo pre-{}-post ::: ' A ' pre-A-post Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 16 / 26 Column Input Columns can be bound to positional replacement strings with --colsep $ parallel --colsep ',' echo 1=f1g 2=f2g :::: csvfile.csv 1=a 2=b 1=A 2=B 1=C 2=D Here, csvfile.csv is a file with the three lines “a,b”, “A,B”, and “C,D”. With --header, GNU Parallel will use the first value of the input source as the name of the replacement string. $ parallel --header: echo a= fag b=fbg ::: a A B ::: b C D a=A b=C a=A b=D a=B b=C a=B b=D Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 17 / 26 Parallel execution $ time parallel -j1 sleep ::: 1 2 4 3 Specify number of real 0m10.980s simultaneous jobs with -j. user 0m0.091s *Why isn’t the latter case sys 0m0.038s 4x faster than the first? $ time parallel -j2 sleep ::: 1 2 4 3 real 0m5.323s Each core gets a job. user 0m0.089s When job is done, give sys 0m0.042s that core a new one. $ time parallel -j4 sleep ::: 1 2 4 3 real 0m4.284s Remote execution, i.e., user 0m0.093s running on several nodes sys 0m0.038s at once, also possible. Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 18 / 26 Load Balancing Without GNU Parallel: With GNU Parallel: 10 hours 72% utilization 17 hours 42% utilization . Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 19 / 26 Distributing Input Arguments By default, one argument is given to one command --xargs will fit as many arguments as possible on a single line: $ cat num30000 j parallel --xargs echo j wc -l 2 .......Here, num30000 is a file with 30,000 lines. The maximal length of a single line can be set with -s: $ cat num30000 j parallel --xargs -s 10000 echo j wc -l 17 Distribute arguments evenly over cores with -m instead of --xargs To limit the number of arguments in each command, use -N: $ parallel -N3 echo ::: A B C D E F G H ABC DEF RamsesGH van Zon (SciNet HPC) GNU Parallel October 22, 2014 20 / 26 Output Control Output order depends on when a job finishes. You can restore the original order with -k. GNU Parallel postpones output for each job until it completes, but even this can be disabled with -u. $ export c='printf "%s-startnn%s" fg fg;sleep fg; printf "%snn" -middle;echo fg-end' $ parallel -j2 $c ::: 4 2 1 $ parallel -j2 -u $c ::: 4 2 1 2-start 4-start 2-middle 42-start 2-end 2-middle 1-start 2-end 1-middle 1-start 1-end 1-middle 4-start 1-end 4-middle -middle 4-end 4-end Ramses van Zon (SciNet HPC) GNU Parallel October 22, 2014 21 / 26 Output Control (continued) The output can prefixed with the argument.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    26 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us