
GNU Parallel for Bioinformatics. Pierre Lindenbaum @yokofakun http://plindenbaum.blogspot.com October 15, 2013 Abstract This document follows Ole Tange's parallel tutorial http://www.gnu.org/software/parallel/parallel_ tutorial.html. The sources of this document are available at https://github.com/lindenb/courses/tree/master/ about.parallel Contents 1 Input Source 3 1.1 A single input source . .3 1.1.1 Input can be read from the command line. .3 1.1.2 The input source can be a file . .3 1.1.3 STDIN (standard input) can be the input source . .3 1.2 Multiple input source . .4 1.2.1 If one of the input sources is too short, its values will wrap . .4 1.2.2 The input sources can be files . .5 1 1.2.3 STDIN can be one of the input sources using '-' . .5 1.2.4 Instead of -a files can be given after '::::' ........................5 1.2.5 ':::' and '::::' canbemixed:..............................6 1.3 Matching arguments from all input sources . .6 1.3.1 With --xapply you can get one argument from each input source . .6 1.3.2 If one of the input sources is too short, its values will wrap . .7 1.4 Changing the argument separator . .7 1.4.1 Changing the argument file separator . .7 1.4.2 Changing the argument delimiter . .7 1.4.3 NULL can be given as 'n0' .................................8 1.4.4 A shorthand for '-d n0' is '-0' ..............................8 2 Building the command line 8 2.0.5 No command means arguments are commands . .8 3 Replacement strings 9 3.1 The 5 replacement strings . .9 3.2 Positional replacement strings . 12 3.3 Input from columns . 12 3.4 Header defined replacement strings . 13 3.5 More than one argument . 13 3.6 Quoting . 13 3.7 Trimming space . 14 4 Controling the output 15 4.1 To see what commands will be run without running them . 15 4.2 To print the command before running them use {verbose . 15 4.3 GNU parallel will postpone the output until the command completes . 17 4.4 To get the output immediately use '--ungroup' ......................... 18 4.4.1 Saving output into files . 20 5 Control the execution. 22 5.1 Number of simultaneous jobs. 22 6 Interactiveness. 22 6.1 Timing . 22 6.2 Progress . 23 6.3 Termination . 25 6.4 Limiting the ressources . 25 7 Remote execution 25 7.1 Sshlogin . 25 7.2 Transferring files . 26 7.3 Working dir . 27 7.4 Avoid overloading sshd . 27 7.5 Ignore hosts that are down . 28 2 7.6 Transfer environment variables and functions . 28 8 {pipe 29 8.1 Chunk size . 29 8.2 Records . 29 8.3 Header . 30 8.4 Shebang . 30 9 References 31 1 Input Source 1.1 A single input source 1.1.1 Input can be read from the command line. Example: determine the file type of a list of bam . 1 $ parallel file ::: samtools − 0.1.18/examples/ ∗ . bam output: 1 s a m t o o l s − 0.1.18/examples/ex1a.bam: gzip compressed data, extra field 2 s a m t o o l s − 0.1.18/examples/ex1.bam: gzip compressed data, extra field 3 s a m t o o l s − 0.1.18/examples/ex1b.bam: gzip compressed data, extra field 4 s a m t o o l s − 0.1.18/examples/ex1f.bam: gzip compressed data, extra field 5 s a m t o o l s − 0.1.18/examples/ex1f −rmduppe.bam: gzip compressed data, extra field 6 s a m t o o l s − 0.1.18/examples/ex1f −rmdupse.bam: gzip compressed data, extra field 7 s a m t o o l s − 0.1.18/examples/ex1 sorted.bam: gzip compressed data, extra field 8 s a m t o o l s − 0.1.18/examples/toy.bam: gzip compressed data, extra field 1.1.2 The input source can be a file Example: determine the file type of a list of bam . 1 $ find samtools − 0.1.18/examples/ −name " ∗ . bam" −type f > listbams .txt 2 $ p a r a l l e l −a listbams.txt file output: 1 s a m t o o l s − 0.1.18/examples/ex1a.bam: gzip compressed data, extra field 2 s a m t o o l s − 0.1.18/examples/ex1.bam: gzip compressed data, extra field 3 s a m t o o l s − 0.1.18/examples/ex1b.bam: gzip compressed data, extra field 4 s a m t o o l s − 0.1.18/examples/ex1f.bam: gzip compressed data, extra field 5 s a m t o o l s − 0.1.18/examples/ex1f −rmduppe.bam: gzip compressed data, extra field 6 s a m t o o l s − 0.1.18/examples/ex1f −rmdupse.bam: gzip compressed data, extra field 7 s a m t o o l s − 0.1.18/examples/ex1 sorted.bam: gzip compressed data, extra field 8 s a m t o o l s − 0.1.18/examples/toy.bam: gzip compressed data, extra field 1.1.3 STDIN (standard input) can be the input source Example: determine the file type of a list of bam . 1 $ find samtools − 0.1.18/examples/ −name " ∗ . bam" −type f j parallel file output: 3 1 s a m t o o l s − 0.1.18/examples/ex1a.bam: gzip compressed data, extra field 2 s a m t o o l s − 0.1.18/examples/ex1.bam: gzip compressed data, extra field 3 s a m t o o l s − 0.1.18/examples/ex1b.bam: gzip compressed data, extra field 4 s a m t o o l s − 0.1.18/examples/ex1f.bam: gzip compressed data, extra field 5 s a m t o o l s − 0.1.18/examples/ex1f −rmduppe.bam: gzip compressed data, extra field 6 s a m t o o l s − 0.1.18/examples/ex1f −rmdupse.bam: gzip compressed data, extra field 7 s a m t o o l s − 0.1.18/examples/ex1 sorted.bam: gzip compressed data, extra field 8 s a m t o o l s − 0.1.18/examples/toy.bam: gzip compressed data, extra field Example: indexing sorted sorted bam files with samtools .: 1 $ f i n d d i r 1 −name " ∗ . bam" j g r e p s o r t e d j n 2 p a r a l l e l −a − 'samtools index ' or , without '-a -' 1 $ f i n d d i r 1 −name " ∗ . bam" j g r e p s o r t e d j n 2 parallel 'samtools index ' Ole Tange: "The '-a -' construct is unnatural to me. It makes sense when you have multiple '-a' but if it is the only one, leave it out. That will also make it easier for people who are used to xargs." 1.2 Multiple input source 1 n example f Print the combinations of two lists of nucleotides g 2 $ p a r a l l e l echo ::: ATGC ::: a t g c output: 1 A a 2 A t 3 A g 4 A c 5 T a 6 T t 7 T g 8 T c 9 G a 10 G t 11 G g 12 G c 13 C a 14 C t 15 C g 16 C c 1.2.1 If one of the input sources is too short, its values will wrap Example: Print the combinations of two lists of nucleotides. The second list is shorter. 1 $ p a r a l l e l echo ::: ATGCN ::: a t output: 1 A a 2 A t 3 T a 4 T t 5 G a 6 G t 7 C a 8 C t 9 N a 10 N t 4 1.2.2 The input sources can be files Example: Print the combinations of two files of nucleotides.. 1 $ echo −e "AnnTnnGnnC" > ATGC. t x t 2 $ echo −e " an nt nngn nc " > a t g c . t x t 3 $ p a r a l l e l −a ATGC. t x t −a a t g c . t x t echo output: 1 A a 2 A t 3 A g 4 A c 5 T a 6 T t 7 T g 8 T c 9 G a 10 G t 11 G g 12 G c 13 C a 14 C t 15 C g 16 C c 1.2.3 STDIN can be one of the input sources using '-' Example: Print the combinations of one file of nucleotides and stdin. 1 $ echo −e " an nt nngn nc " > a t g c . t x t 2 $ echo −e "AnnTnnGnnC" j n 3 p a r a l l e l −a − −a a t g c . t x t echo output: 1 A a 2 A t 3 A g 4 A c 5 T a 6 T t 7 T g 8 T c 9 G a 10 G t 11 G g 12 G c 13 C a 14 C t 15 C g 16 C c 1.2.4 Instead of -a files can be given after '::::' Example: Print the combinations of two files of nucleotides.. 1 $ echo −e "AnnTnnGnnC" > ATGC.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages31 Page
-
File Size-