<<

Nagarajan Kathiresan, Ph.D., Computational Scientist, KAUST Supercomputing Lab, [email protected] Agenda ü tools for Bioinformatics ü A simple job script in a command line! ü What is workflow? How can I build it? ü How to address the job dependencies?? ü When to use Job arrays? ü Optimization in workflow design.

Note: Some of the Bioinformatics tools like bwa - Burrows-Wheeler Alignment, Samtools, and Picard/GATK are used for explinations.) UNIX tools for Bioinformatics Data transfer and Search for pattern Move data between two systems The rsync utility is a very useful utility for synchronizing files and directories between two different servers. q Copying from the local machine to a remote machine: rsync local_directory remote_server_name:remote_directory q Copying from a remote machine to the local machine: rsync remote_server_name:remote_directory local_directory -a archive mode

-r recursive over subdirectories -v verbose $ rsync -arvxHP -x don't cross filesystem boundaries my_data [email protected]:/

-n no-op, or dry-run Search for pattern • , egrep, fgrep • wc • | (Pipe character) • cut • • sort • uniq … …. Working with genome files Fasta

Indexed Fasta

Compressed Fastq

Compressed VCF BAM

Sorted BAM

SAM GTF Working with fasta file $ more Aegilops_tauschii.Aet_v4.0.ncrna.fa Extract the headers from the FASTA file grep, egrep, fgrep à print lines matching a pattern -i, --ignore-case à ignore case -v, --invert-match à “invert”, get the lines not matching the patent -w, --word-regexp à Get the lines when matches whole patent egrep = grep –E -o, --only-matching à Get only the matching part (--extended-regexp) fgrep = grep –F (--fixed-strings) Word count wc à Count the number of lines, words and characters in a given file $ wc Aegilops_tauschii.Aet_v4.0.ncrna.fa 13525 48871 1247270 Aegilops_tauschii.Aet_v4.0.ncrna.fa

$ wc -l Aegilops_tauschii.Aet_v4.0.ncrna.fa 13525 Aegilops_tauschii.Aet_v4.0.ncrna.fa

$ wc -w Aegilops_tauschii.Aet_v4.0.ncrna.fa 48871 Aegilops_tauschii.Aet_v4.0.ncrna.fa $ wc - Aegilops_tauschii.Aet_v4.0.ncrna.fa 1247270 Aegilops_tauschii.Aet_v4.0.ncrna.fa

Question: How do I count the number of sequences in the above fasta file? Answer:

$ grep -c ">" Aegilops_tauschii.Aet_v4.0.ncrna.fa 3732

Why? • Counting the header (“>”) is an appropriate way! • Many sequence lines is possible within a single sequence identification. >ENSRNA050031380-T1 ncrna Sequence chromosome:Aet_v4.0:2D:126982204:126982306:-1 identification gene:ENSRNA050031380 gene_biotype:snRNA transcript_biotype:snRNA gene_symbol:U6 description:U6 spliceosomal RNA ACTATATAAAAAACTTCCAATTTTAGTGGAACTATACAGAGAAGATTAGCATGGCCCCGA Sequence CGCAAGGATGACACACACGAATTGAGAAATGATCCAAATTTTT Combining the commands | à Pipe character Example

Grep option: -io: ignore-case and only-matching Useful data processing tools! cut à This command allows extracting the column from the file Use: cut –f file name Useful tableview! https://github.com/informationsea/tableview/releases/download/v0.4.6/tableview_linux_amd64

$ cat sample.vcf | grep -v "#" Grep Option: -v: invert-match #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003

$ cat sample.vcf | grep -v "#" | tableview_linux_amd64 Cont. … cut command! Uniq and sort AWK AWK à scans each line and performance some actions. awk ‘ {action1} …’ Combine commands: awk + pipe + uniq + sort … A simple job script Caution note (by default SLURM allocation) Minimum 3 parameters: • memory = 2GB 1. sbatch: Submit a batch script to Slurm • CPU = 1 core 2. time: Set a limit on the total run time of the job • Node = 1 node allocation --time=days-hours:minutes:seconds Job script (batch jobs) -t days-hours:minutes:seconds

3. wrap: specified command string or simple "sh" shell $ cat my_job.sh script & submit to the slurm controller #!/bin/bash #SBATCH --time=00:10 hostname Example: $ sbatch --time=00:10 --wrap="hostname” $ sbatch ./my_job.sh Submitted batch job 7438 Output: slurm-9024853.out $ cat slurm-7438.out $ cat slurm-9024853.out cn512-05-r cn603-28-l Workflow - example

Step #1 BWA Genome mapping/alignment

Step #2 Samtools Compress the Sequence Alignment Map file (SAM to BAM)

Step #3 Samtools Sorting the BAM file

Step #4 Samtools Index for BAM files

Step #5 Samtools Chromosome interval for research interest

GATK/ Mark or remove the duplicate Step #6 Picard Step #1: Genome alignment Burrows-Wheeler Aligner: Pre-request 1. Genome Reference file (GRCh37, HG19 …) • bwa index ref.fa 2. Genome Reference - Index files • bwa mem ref.fa reads.fq > aln-se.sam 3. Sample data (Single or Pair-end) • bwa mem ref.fa read1.fq read2.fq > aln-pe.sam • bwa aln ref.fa short_read.fq > aln_sa.sai • bwa samse ref.fa aln_sa.sai short_read.fq > aln-se.sam • bwa sampe ref.fa aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln-pe.sam • bwa bwasw ref.fa long_read.fq > aln.sam Caution note: • By default, the BWA tool will run as a sequential (1 core) • It’s supported with multi-threads for parallelization using the option -t • The option -T (alignment score) is different. Reference available: http://bio-bwa.sourceforge.net/bwa.shtml /ibex/reference/KSL/ Example: BWA MEM bwa mem ref.fa read1.fq read2.fq > aln-pe.sam Step #1: Check the availability on the software $ module av bwa ------/sw/csi/modulefiles/applications ------bwa/0.7.17/gnu-6.4.0 bwakit/0.7.15/binary-0.7.15 Step #2: Use the module software $ module load bwa/0.7.17/gnu-6.4.0 Caution note in resource allocation: Step #3: Prepare a job submission script • 2 GB memory • 1 Core Command: $ bwa mem /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta SRR396636.sra_1.fastq SRR396636.sra_2.fastq > SRR396636.sam SLURM Script: $ sbatch --time=00:10 --wrap="bwa mem /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fas ta SRR396636.sra_1.fastq SRR396636.sra_2.fastq > SRR396636.sam” Cont. . . (Optimized script) SLURM Script: $ sbatch \ --time=2:00:00 \ --mem=100GB \ --cpus-per-task=16 \ --wrap="bwa mem -t 16 /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fas ta SRR396636.sra_1.fastq SRR396636.sra_2.fastq > SRR396636.sam”

Submitted batch job 9028055 $ cat slurm-9028055.out [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 1600000 sequences (160000000 bp)... [M::process] read 1600000 sequences (160000000 bp)... Batch job script $ cat BWA_MEM_batch.sh Job submitted using sbatch #!/bin/bash $ sbatch ./BWA_MEM_batch.sh Submitted batch job 7439 #SBATCH --time=2:00:00 Standard output/error will be in the #SBATCH --mem=100GB name of slurm-.out $ cat slurm-7439.out #SBATCH --cpus-per-task=16 Loading module for BWA BWA 0.7.17 is now loaded [M::bwa_idx_load_from_disk] read 0 ## Software ALT contigs [M::process] read 1600000 sequences module load bwa/0.7.17/gnu-6.4.0 (160000000 bp)... [M::process] read 1600000 sequences (160000000 bp)... ## Command bwa mem -t 16 /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fas ta SRR396636.sra_1.fastq SRR396636.sra_2.fastq > SRR396636.sam How can I run 100+ genome samples? $ ls -lrta *_001.fastq.gz -rw-r--r-- 1 kathirn g-kathirn 2125471805 Dec 17 2013 NIST7086_CGTACTAG_L002_R2_001.fastq.gz -rw-r--r-- 1 kathirn g-kathirn 2083510543 Dec 17 2013 NIST7086_CGTACTAG_L002_R1_001.fastq.gz -rw-r--r-- 1 kathirn g-kathirn 2081364133 Dec 17 2013 NIST7086_CGTACTAG_L001_R2_001.fastq.gz -rw-r--r-- 1 kathirn g-kathirn 2037956271 Dec 17 2013 NIST7086_CGTACTAG_L001_R1_001.fastq.gz

-rw-r--r-- 1 kathirn g-kathirn 2001172486 Dec 17 2013 NIST7035_TAAGGCGA_L002_R2_001.fastq.gz -rw-r--r-- 1 kathirn g-kathirn 1962477139 Dec 17 2013 NIST7035_TAAGGCGA_L002_R1_001.fastq.gz -rw-r--r-- 1 kathirn g-kathirn 1954935121 Dec 17 2013 NIST7035_TAAGGCGA_L001_R2_001.fastq.gz -rw-r--r-- 1 kathirn g-kathirn 1914722761 Dec 17 2013 NIST7035_TAAGGCGA_L001_R1_001.fastq.gz

------DATA PROCESSING ------Steps for processing more samples

S1 S2 S3 Sn

One by one SAMPLE_NAME

$ sbatch \ --time=2:00:00 \ --mem=100GB \ --cpus-per-task=16 \ --wrap=" bwa mem -t 16 /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta SAMPLE_NAME_1.fastq SAMPLE_NAME_2.fastq > SAMPLE_NAME.sam”

Ye s Until all Job SAMPLES done In UNIX script 1. Get the unique list of samples $ ls *_R1_001.fastq.gz NIST7035_TAAGGCGA_L001_R1_001.fastq.gz NIST7086_CGTACTAG_L001_R1_001.fastq.gz NIST7035_TAAGGCGA_L002_R1_001.fastq.gz NIST7086_CGTACTAG_L002_R1_001.fastq.gz

2. Parse sample by sample $ for SAMPLE_NAME in `ls *_R1_001.fastq.gz`; do echo $SAMPLE_NAME; done Cont. 3. Get the UNIQUE sample name $ for SAMPLE_NAME in `ls *_R1_001.fastq.gz`; do echo `basename $SAMPLE_NAME _R1_001.fastq.gz`; done Output: NIST7035_TAAGGCGA_L001 NIST7035_TAAGGCGA_L002 NIST7086_CGTACTAG_L001 NIST7086_CGTACTAG_L002 Cont. 4. Multiple Job submission using FOR LOOP $ module load bwa/0.7.17/gnu-6.4.0 $ for SAMPLE_NAME in `ls *_R1_001.fastq.gz`; do PREFIX=`basename $SAMPLE_NAME _R1_001.fastq.gz`; sbatch \ --time=2:00:00 \ --mem=100GB \ --cpus-per-task=16 \ --wrap=" bwa mem -t 16 /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta ${PREFIX}_R1_001.fastq.gz ${PREFIX}_R2_001.fastq.gz > ${PREFIX}.sam” done In a batch script (as a Job arrays) #!/bin/bash Pre-request: #SBATCH --job-name=BWA_MEM #SBATCH --output=BWA_MEM.%A_%a.out Array size = number of samples #SBATCH --error=BWA_MEM.%A_%a.err

#SBATCH --time=2:00:00 $ sbatch ./bwa_mem_array.sh #SBATCH --nodes=1 Submitted batch job 7440

#SBATCH --mem=100GB $ squeue -u $USER #SBATCH --cpus-per-task=16 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) #SBATCH --array=1-4 7440_1 batch BWA_MEM kathirn R 0:53 1 cn509-23-l 7440_2 batch BWA_MEM kathirn R 0:53 1 cn509-23-l 7440_3 batch BWA_MEM kathirn R 0:53 1 cn512-05-r ## Software 7440_4 batch BWA_MEM kathirn R 0:53 1 cn512-05-r module load bwa/0.7.17/gnu-6.4.0

## My variables SAMPLE=`ls *_R1_001.fastq.gz | head -n $SLURM_ARRAY_TASK_ID | tail -n 1` ; PREFIX=`basename $SAMPLE _R1_001.fastq.gz` ;

## Job command bwa mem -t 16 /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta ${PREFIX}_R1_001.fastq.gz ${PREFIX}_R2_001.fastq.gz > ${PREFIX}.sam Cont. …

$ sbatch ./bwa_mem_array.sh Submitted batch job 7440

$ squeue -u $USER JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7440_1 batch BWA_MEM kathirn R 0:53 1 cn509-23-l 7440_2 batch BWA_MEM kathirn R 0:53 1 cn509-23-l 7440_3 batch BWA_MEM kathirn R 0:53 1 cn512-05-r 7440_4 batch BWA_MEM kathirn R 0:53 1 cn512-05-r

$ ls -lrta *.sam -rw-r--r-- 1 kathirn g-kathirn 2790260736 Feb 9 16:41 NIST7086_CGTACTAG_L002.sam -rw-r--r-- 1 kathirn g-kathirn 3783000064 Feb 9 16:41 NIST7086_CGTACTAG_L001.sam -rw-r--r-- 1 kathirn g-kathirn 3024093184 Feb 9 16:41 NIST7035_TAAGGCGA_L002.sam -rw-r--r-- 1 kathirn g-kathirn 3978428416 Feb 9 16:41 NIST7035_TAAGGCGA_L001.sam Step 2: SAM to BAM files Samtools to convert SAM files to BAM #!/bin/bash module load samtools/1.8 for SAMPLE_NAME in `ls *.sam`; do PREFIX=`basename $SAMPLE_NAME .sam`; sbatch --time=2:00:00 --mem=100GB --cpus-per-task=16 --wrap="samtools view --threads 16 - -S -h -q 30 ${SAMPLE_NAME} > ${PREFIX}.bam" done

• Sam files are very large • 1.8G NIST7035_TAAGGCGA_L001_R1_001.fastq.gz • 1.9G NIST7035_TAAGGCGA_L001_R2_001.fastq.gz • BAM file is compressed version of SAM • 13G NIST7035_TAAGGCGA_L001.sam • Good to use BAM files and safe to delete • 3.4G NIST7035_TAAGGCGA_L001.bam SAM once the BAM files are available. Step 3: convert bam to sorted bam Sort the BAM files using samtools

#!/bin/bash module load samtools/1.8 for SAMPLE_NAME in `ls *.bam`; do PREFIX=`basename $SAMPLE_NAME .bam`; sbatch --time=2:00:00 --mem=100GB --cpus-per-task=16 --wrap="samtools sort --threads 16 -T ${PREFIX} ${SAMPLE_NAME} -o ${PREFIX}.sorted.bam" done End-of-Step 3!

What are the files generated? *.Fastq.gz ü sam files (Generated from Genome alignment) *.sam ü unsorted bam files (Generated from the samtools, part of data compression) *.bam ü sorted bam files (Generated from samtools) *.sorted. bam Do we need all these intermediate files generated? IF NOT ?! $ bwa mem -t 16 $REF $PREFIX_R1_001.fastq.gz $PREFIX_R1_001.fastq.gz | samtools view --threads 16 -b -S -h -q 30 - | samtools sort --threads 16 -T $PREFIX - > $PREFIX.sorted.bam 3-in-1 !?

#!/bin/bash *.Fastq.gz #SBATCH --job-name=BWA_MEM #SBATCH --output=BWA_MEM.%A_%a.out #SBATCH --error=BWA_MEM.%A_%a.err *.sam #SBATCH --time=2:00:00 #SBATCH --nodes=1 *.bam #SBATCH --mem=100GB #SBATCH --cpus-per-task=16 *.sorted. #SBATCH --array=1-4 bam

# Software module load bwa/0.7.17/gnu-6.4.0 module load samtools/1.8

# My variables SAMPLE=`ls *_R1_001.fastq.gz | head -n $SLURM_ARRAY_TASK_ID | tail -n 1` ; PREFIX=`basename $SAMPLE _R1_001.fastq.gz` ;

# Job command bwa mem -t 16 /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta ${PREFIX}_R1_001.fastq.gz ${PREFIX}_R2_001.fastq.gz | samtools view --threads 16 -b -S -h -q 30 - | samtools sort --threads 16 - > $PREFIX.sorted.bam Step 4: index the bam files Index the BAM files using samtools

#!/bin/bash module load samtools/1.8 for SAMPLE_NAME in `ls *.sorted.bam`; do PREFIX=`basename $SAMPLE_NAME .sorted.bam`; sbatch --time=30:00 --mem=100GB --cpus-per-task=1 --wrap="samtools index ${SAMPLE_NAME}" done Summary: list of files generated Sorted BAM files: • -rw-r--r-- 1 kathirn g-kathirn 2.6G Feb 4 12:14 NIST7035_TAAGGCGA_L002.sorted.bam • -rw-r--r-- 1 kathirn g-kathirn 2.5G Feb 4 12:14 NIST7035_TAAGGCGA_L001.sorted.bam • -rw-r--r-- 1 kathirn g-kathirn 2.7G Feb 4 12:14 NIST7086_CGTACTAG_L001.sorted.bam • -rw-r--r-- 1 kathirn g-kathirn 2.7G Feb 4 12:14 NIST7086_CGTACTAG_L002.sorted.bam

Index of Sorted BAM files: • -rw-r--r-- 1 kathirn g-kathirn 3.4M Feb 4 12:25 NIST7086_CGTACTAG_L001.sorted.bam.bai • -rw-r--r-- 1 kathirn g-kathirn 3.5M Feb 4 12:26 NIST7035_TAAGGCGA_L001.sorted.bam.bai • -rw-r--r-- 1 kathirn g-kathirn 3.5M Feb 4 12:26 NIST7035_TAAGGCGA_L002.sorted.bam.bai • -rw-r--r-- 1 kathirn g-kathirn 3.4M Feb 4 12:26 NIST7086_CGTACTAG_L002.sorted.bam.bai List of chr. In each BAM files @HD VN:1.5 SO:coordinate @SQ SN:11 LN:135006516 @SQ SN:1 LN:249250621 @SQ SN:12 LN:133851895 @SQ SN:2 LN:243199373 @SQ SN:13 LN:115169878 @SQ SN:14 LN:107349540 @SQ SN:3 LN:198022430 @SQ SN:15 LN:102531392 @SQ SN:4 LN:191154276 @SQ SN:16 LN:90354753 @SQ SN:GL000207.1 LN:4262 @SQ SN:5 LN:180915260 @SQ SN:17 LN:81195210 @SQ SN:GL000226.1 LN:15008 @SQ SN:18 LN:78077248 @SQ SN:GL000229.1 LN:19913 @SQ SN:6 LN:171115067 @SQ SN:19 LN:59128983 @SQ SN:GL000231.1 LN:27386 @SQ SN:7 LN:159138663 @SQ SN:20 LN:63025520 @SQ SN:GL000210.1 LN:27682 @SQ SN:21 LN:48129895 @SQ SN:GL000239.1 LN:33824 @SQ SN:8 LN:146364022 @SQ SN:22 LN:51304566 …. @SQ SN:9 LN:141213431 @SQ SN:X LN:155270560 …. @SQ SN:Y LN:59373566 @SQ SN:10 LN:135534747 @SQ SN:MT LN:16569 @SQ SN:NC_007605 LN:171823 @SQ SN:hs37d5 LN:35477943

@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -t 16 /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta NIST7035_TAAGGCGA_L002_R1_001.fastq.gz NIST7035_TAAGGCGA_L002_R2_001.fastq.gz Step 5: Chromosome interval for research interest Objective: Generate a chunk of BAM file that has the interval between10,000-15,000 from Chromosome-1 and Chromosome-2, etc. Solution: $ samtools view NIST7035_TAAGGCGA_L002.sorted.bam 1:10000-15000 | more HWI-D00119:50:H7AP8ADXX:2:1214:6356:27283 163 1 10354 60 89M12S = 10354 96 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTA Chr1 ACCCTAACCCTAACCCTAACCCTAAACCTAACCCTAACCCTAAGCCCCGGCA 8??DBDBAFF>?FGAFFIIFF9ED8;CCDFDED3?9?@?0?B@?DFF(DHECCC@@HGHHGIIIEECC==BCDFFECECECCCCCCDCDCEC C N M:i:0 MD:Z:101 MC:Z:101M AS:i:101 XS:i:71 …. ….. Cont. Any Optimal or better Solution!? #!/bin/bash #SBATCH --job-name=Region_of_Interest Caution note! #SBATCH --output=Region_of_Interest.%A.out • Job array will be numeric letters (no #SBATCH --error=Region_of_Interest.%A.err fractions, no characters, no special symbols, no #SBATCH --time=2:00:00 alpha-numeric …. etc. ) #SBATCH --nodes=1 • When the Chromosome is “Chr1”, data #SBATCH --mem=100GB distribution is required as follows: #SBATCH --cpus-per-task=16 if [${SLURM_ARRAY_TASK_ID} -eq 1 ] // ….do something …. // #SBATCH --array=1-2 else // ….do something …. // ## My variables fi SAMPLE=NIST7035_TAAGGCGA_L002.sorted.bam PREFIX=NIST7035_TAAGGCGA_L002 REGION="10000-15000" Batch processing is required to get the value of ## Software ${SLURM_ARRAY_TASK_ID} module load samtools/1.8

## Job command to get Region of Interest from Chromosome 1 & 2 samtools view ${SAMPLE} ${SLURM_ARRAY_TASK_ID}:$REGION --threads 16 -b -o ${PREFIX}.${SLURM_ARRAY_TASK_ID}.$REGION.sorted.bam To view the chromosome …. (e.G. IGV can be used) Step 6: mark duplicate(s) Any Optimal or better Solution!? #!/bin/bash #SBATCH --job-name=MarkDupe #SBATCH --output=MarkDupe.%A_%a.out #SBATCH --error=MarkDupe.%A_%a.err #SBATCH --time=2:00:00 #SBATCH --nodes=1 Pre-request: #SBATCH --mem=100GB Array size = number of samples #SBATCH --array=1-4

## My variables SAMPLE=`ls *.sorted.bam | head -n $SLURM_ARRAY_TASK_ID | tail -n 1` ; PREFIX=`basename $SAMPLE .sorted.bam` ;

## Software module load gatk/4.0.1.1

## Job command gatk MarkDuplicates --INPUT $SAMPLE --OUTPUT $PREFIX.duped.sorted.bam -- METRICS_FILE $PREFIX.txt --REMOVE_DUPLICATES true End-of-Step 6

- Many job script - Multiple files - Manual steps - etc. Job dependency

Single job script Job dependency sbatch --dependency= ...

job can begin after the specified jobs have after:jobid[:jobid...] started job can begin after the specified jobs have afterany:jobid[:jobid...] terminated job can begin after the specified jobs have afternotok:jobid[:jobid...] failed job can begin after the specified jobs have run afterok:jobid[:jobid...] to completion with an exit code of zero. jobs can begin execution after all previously launched jobs with the same name and singleton user have ended. This is useful to collate results of a swarm or to send a notification at the end of a swarm.

Source: https://hpc.nih.gov/docs/job_dependencies.html Job dependency - Example

$ cat dependent.sh #!/bin/bash ## Any bugs/issues, please e-mail: [email protected] echo "Submitting 5 jobs with 4 job dependency condition";

## Submit First job First_CMD="sleep 40"; First_Job="sbatch --partition=batch --job-name=First_Step --time=30:00 --output=First-%J.out --error=First-%J.err --nodes=1"; First_ID=$(${First_Job} --parsable --wrap="${First_CMD}"); echo "First Job submitted (\" ${First_CMD} is executing \") and this job id is " ${First_ID};

## Execute the Second job only when First job is successful Second_CMD="hostname"; Second_Job="sbatch --partition=batch --job-name=Second_Step --time=30:00 --output=Second-%J.out --error=Second- %J.err --nodes=1"; Second_ID=$(${Second_Job} --parsable --dependency=afterok:${First_ID} --wrap="${Second_CMD}"); echo " Second Job (\" ${Second_CMD} \") was submitted (Job_ID=${Second_ID}) and it will execute when the First Job_ID=${First_ID} is successful"

echo " The status of running jobs are ..." echo "------" squeue -u $USER -l echo "------" Workflow

Source: Computational and Bioinformatics Frameworks for Next-Generation Whole Exome and Genome Sequencing

Source: Best Practices for Variant Discovery in DNAseq Is this simple and/or Optimized? workflow optimization Optimal Workflow for Different Heterogenous multiple software/tools for Outcome resource allocation samples every job step sample…….. N • Automated the workflow Step 1 sample 2 TRIMMOMATIC_JAR Cores = 4 Read trimming sample 1 8 scripts = single script sample…….. N sample 2 bwa mem Cores = 16 • Heterogenous resource allocation Step 2 Read Mapping sample 1 used sample…….. N sample 2 GATK 4.x MarkDuplicates Cores = 1 Step 3 Mark Duplicate sample 1 64 cores = optimal # cores • Turnaround time minimized sample…….. N gatk sample 2 Cores = 1 Step 4 Add/Replace Readsample 1 AddOrReplaceReadGroups Unpredicted = Predicted Groups sample…….. N • Optimized resource allocations sample 2 Cores = 1 Step 5 Indexing sample 1 samtools index Max. cores = Optimal sample…….. N • Job monitoring and job control sample 2 Cores = 16 Step 6 Haplotype callersample 1 GATK 3.x HaplotypeCaller Complex = Simplified sample…….. N sample 2 bgzip Cores = 1 ü Job restart Step 7 Compress the gVCFsample 1 sample N ü Job statistics/report x sample…….. 2 Cores = 1 Step 8 gVCF Index sample 1 tabix Acknowledgements: Elodie Rey (Prof. Mark Tester ) Michael D. Abrouk (Prof. Simon Krattinger) 48 THANKS! Time for Questions and your feedback!