GNU Parallel for Bioinformatics

GNU Parallel for Bioinformatics

GNU Parallel for Bioinformatics. Pierre Lindenbaum @yokofakun http://plindenbaum.blogspot.com October 15, 2013 Abstract This document follows Ole Tange's parallel tutorial http://www.gnu.org/software/parallel/parallel_ tutorial.html. The sources of this document are available at https://github.com/lindenb/courses/tree/master/ about.parallel Contents 1 Input Source 3 1.1 A single input source . .3 1.1.1 Input can be read from the command line. .3 1.1.2 The input source can be a file . .3 1.1.3 STDIN (standard input) can be the input source . .3 1.2 Multiple input source . .4 1.2.1 If one of the input sources is too short, its values will wrap . .4 1.2.2 The input sources can be files . .5 1 1.2.3 STDIN can be one of the input sources using '-' . .5 1.2.4 Instead of -a files can be given after '::::' ........................5 1.2.5 ':::' and '::::' canbemixed:..............................6 1.3 Matching arguments from all input sources . .6 1.3.1 With --xapply you can get one argument from each input source . .6 1.3.2 If one of the input sources is too short, its values will wrap . .7 1.4 Changing the argument separator . .7 1.4.1 Changing the argument file separator . .7 1.4.2 Changing the argument delimiter . .7 1.4.3 NULL can be given as 'n0' .................................8 1.4.4 A shorthand for '-d n0' is '-0' ..............................8 2 Building the command line 8 2.0.5 No command means arguments are commands . .8 3 Replacement strings 9 3.1 The 5 replacement strings . .9 3.2 Positional replacement strings . 12 3.3 Input from columns . 12 3.4 Header defined replacement strings . 13 3.5 More than one argument . 13 3.6 Quoting . 13 3.7 Trimming space . 14 4 Controling the output 15 4.1 To see what commands will be run without running them . 15 4.2 To print the command before running them use {verbose . 15 4.3 GNU parallel will postpone the output until the command completes . 17 4.4 To get the output immediately use '--ungroup' ......................... 18 4.4.1 Saving output into files . 20 5 Control the execution. 22 5.1 Number of simultaneous jobs. 22 6 Interactiveness. 22 6.1 Timing . 22 6.2 Progress . 23 6.3 Termination . 25 6.4 Limiting the ressources . 25 7 Remote execution 25 7.1 Sshlogin . 25 7.2 Transferring files . 26 7.3 Working dir . 27 7.4 Avoid overloading sshd . 27 7.5 Ignore hosts that are down . 28 2 7.6 Transfer environment variables and functions . 28 8 {pipe 29 8.1 Chunk size . 29 8.2 Records . 29 8.3 Header . 30 8.4 Shebang . 30 9 References 31 1 Input Source 1.1 A single input source 1.1.1 Input can be read from the command line. Example: determine the file type of a list of bam . 1 $ parallel file ::: samtools − 0.1.18/examples/ ∗ . bam output: 1 s a m t o o l s − 0.1.18/examples/ex1a.bam: gzip compressed data, extra field 2 s a m t o o l s − 0.1.18/examples/ex1.bam: gzip compressed data, extra field 3 s a m t o o l s − 0.1.18/examples/ex1b.bam: gzip compressed data, extra field 4 s a m t o o l s − 0.1.18/examples/ex1f.bam: gzip compressed data, extra field 5 s a m t o o l s − 0.1.18/examples/ex1f −rmduppe.bam: gzip compressed data, extra field 6 s a m t o o l s − 0.1.18/examples/ex1f −rmdupse.bam: gzip compressed data, extra field 7 s a m t o o l s − 0.1.18/examples/ex1 sorted.bam: gzip compressed data, extra field 8 s a m t o o l s − 0.1.18/examples/toy.bam: gzip compressed data, extra field 1.1.2 The input source can be a file Example: determine the file type of a list of bam . 1 $ find samtools − 0.1.18/examples/ −name " ∗ . bam" −type f > listbams .txt 2 $ p a r a l l e l −a listbams.txt file output: 1 s a m t o o l s − 0.1.18/examples/ex1a.bam: gzip compressed data, extra field 2 s a m t o o l s − 0.1.18/examples/ex1.bam: gzip compressed data, extra field 3 s a m t o o l s − 0.1.18/examples/ex1b.bam: gzip compressed data, extra field 4 s a m t o o l s − 0.1.18/examples/ex1f.bam: gzip compressed data, extra field 5 s a m t o o l s − 0.1.18/examples/ex1f −rmduppe.bam: gzip compressed data, extra field 6 s a m t o o l s − 0.1.18/examples/ex1f −rmdupse.bam: gzip compressed data, extra field 7 s a m t o o l s − 0.1.18/examples/ex1 sorted.bam: gzip compressed data, extra field 8 s a m t o o l s − 0.1.18/examples/toy.bam: gzip compressed data, extra field 1.1.3 STDIN (standard input) can be the input source Example: determine the file type of a list of bam . 1 $ find samtools − 0.1.18/examples/ −name " ∗ . bam" −type f j parallel file output: 3 1 s a m t o o l s − 0.1.18/examples/ex1a.bam: gzip compressed data, extra field 2 s a m t o o l s − 0.1.18/examples/ex1.bam: gzip compressed data, extra field 3 s a m t o o l s − 0.1.18/examples/ex1b.bam: gzip compressed data, extra field 4 s a m t o o l s − 0.1.18/examples/ex1f.bam: gzip compressed data, extra field 5 s a m t o o l s − 0.1.18/examples/ex1f −rmduppe.bam: gzip compressed data, extra field 6 s a m t o o l s − 0.1.18/examples/ex1f −rmdupse.bam: gzip compressed data, extra field 7 s a m t o o l s − 0.1.18/examples/ex1 sorted.bam: gzip compressed data, extra field 8 s a m t o o l s − 0.1.18/examples/toy.bam: gzip compressed data, extra field Example: indexing sorted sorted bam files with samtools .: 1 $ f i n d d i r 1 −name " ∗ . bam" j g r e p s o r t e d j n 2 p a r a l l e l −a − 'samtools index ' or , without '-a -' 1 $ f i n d d i r 1 −name " ∗ . bam" j g r e p s o r t e d j n 2 parallel 'samtools index ' Ole Tange: "The '-a -' construct is unnatural to me. It makes sense when you have multiple '-a' but if it is the only one, leave it out. That will also make it easier for people who are used to xargs." 1.2 Multiple input source 1 n example f Print the combinations of two lists of nucleotides g 2 $ p a r a l l e l echo ::: ATGC ::: a t g c output: 1 A a 2 A t 3 A g 4 A c 5 T a 6 T t 7 T g 8 T c 9 G a 10 G t 11 G g 12 G c 13 C a 14 C t 15 C g 16 C c 1.2.1 If one of the input sources is too short, its values will wrap Example: Print the combinations of two lists of nucleotides. The second list is shorter. 1 $ p a r a l l e l echo ::: ATGCN ::: a t output: 1 A a 2 A t 3 T a 4 T t 5 G a 6 G t 7 C a 8 C t 9 N a 10 N t 4 1.2.2 The input sources can be files Example: Print the combinations of two files of nucleotides.. 1 $ echo −e "AnnTnnGnnC" > ATGC. t x t 2 $ echo −e " an nt nngn nc " > a t g c . t x t 3 $ p a r a l l e l −a ATGC. t x t −a a t g c . t x t echo output: 1 A a 2 A t 3 A g 4 A c 5 T a 6 T t 7 T g 8 T c 9 G a 10 G t 11 G g 12 G c 13 C a 14 C t 15 C g 16 C c 1.2.3 STDIN can be one of the input sources using '-' Example: Print the combinations of one file of nucleotides and stdin. 1 $ echo −e " an nt nngn nc " > a t g c . t x t 2 $ echo −e "AnnTnnGnnC" j n 3 p a r a l l e l −a − −a a t g c . t x t echo output: 1 A a 2 A t 3 A g 4 A c 5 T a 6 T t 7 T g 8 T c 9 G a 10 G t 11 G g 12 G c 13 C a 14 C t 15 C g 16 C c 1.2.4 Instead of -a files can be given after '::::' Example: Print the combinations of two files of nucleotides.. 1 $ echo −e "AnnTnnGnnC" > ATGC.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    31 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us