Sequence Alignment and Variant Calling Pipeline

Overview

The brief synopsis of the and variant calling pipeline is that the Genomic Analysis Facility sequences an individual and releases the FASTQ files. This data is aligned to the reference genome using BWA, then, using SAMtools and Picard the aligned data is sorted, merged, and processed to call variants and the data is loaded onto SVA for analysis. Detailed explanations for each step in the pipeline are given in later sections. The specific steps are:

1. The Genomic Analysis Facility releases FASTQ files to us. 2. Create scripts that generate necessary files and directories to begin alignment. 3. Run scripts which do the following: a. Find plausible alignment coordinates for each individual read using BWA aln command. b. Pair aligned reads using BWA sampe/samse command based on ‘aln’ output. c. Change the alignment format from text to binary (*.‐>*.bam). d. Sort reads by coordinate using the SAMtools sort command. e. Merge aligned files using the SAMtools merge command. f. Remove PCR duplicates using the Picard MarkDuplicates command. g. Check how many reads aligned and compute other alignment stats with Picard’s CollectAlignmentSummaryMetrics and CollectInsertSizeMetrics. h. Create pileup file using SAMtools pileup command. i. Use pileup file to create SVA input files: bco directory and files (runs in parallel). Once this script is finished, it submits the Erds program to run. Erds identifies copy number variants. j. Calculate coverage from pileup file (this step runs in parallel). k. Call snps and indels using SAMtools.pl varFilter command. l. Separate SNPs and Indels into two different files using snp_filter.pl and indel_filter.pl m. Index the alignment file so that it can be viewed by users with the SAMtools alignment viewer. 4. Copy output to CHGV storage servers. 5. Create a file which lists the file sizes in the DSCR of alignment and variant files. 6. Check for errors. 7. Run concordance check. 8. Notify Genomic Analysis Facility of any irregularities seen during alignment. 9. Delete files from seqanalysis. 10. Request that the Genomic Analysis Facility delete FASTQ files from seqanalysis. 11. Release the sample to the lab. 12. Create .gsap file in preparation for SVA annotation. 13. Annotate using SVA.

Details of the sequence alignment and variant calling pipeline

The Genomic Analysis Facility releases FASTQ files

When the Genomic Analysis Facility finishes a sample, they will copy the FASTQ files to a directory on the Duke wide cluster, DSCR, and update a text file which contains individual sample sequencing information (Run_Summary.txt file). The Run_Summary.txt file can contain several columns:

1. Run number. 2. Lanes sequenced. 3. Location of FASTQ files. 4. Type of fragment. 5. Paired/Single end. 6. Length of first read. 7. Length of second read. 8. Median insert size. 9. Standard deviation of the insert size for reads shorter than the median. 10. Standard deviation of the insert size for reads longer than the median. 11. GWAS_ID. 12. Top_Strand file name. 13. For Exomes, the kit used to generate the data.

Create necessary files and directories to begin alignment

Detailed information for this section can be found in /nfs/goldstein/goldsteinlab//BWA_Alignment_Documentation_basic s.pdf

The following files and directories are created based on information in the Run_Summary.txt file using sample_info.pl (see documentation for usage) and are necessary before alignment can begin:

 The sample information file – This file summarizes information for a single sample from the Run_Summary.txt file, which is released by the Genomic Analysis Facility, and is used as input for the program which generates all alignment scripts for a given sample.  The sample directory – The sample directory is created in ‘seqanalysis/samples’ and will be given the name of the sample.  ‘combined’ directory – This directory is located within the sample directory. All final files from the sequence alignment and variant calling pipeline for a sample will be output to this directory.  ‘Logs’ directory – This directory is located within the sample directory. The output logs (*.out) and the error logs (*.err) are output to this directory.  ‘Scripts’ directory – This directory is located within the sample directory. All of the alignment scripts for a sample will be in this directory.  Directories for each run – These directories are located within the sample directory. One directory will be created for each of the N runs, named Run1 – RunN. These directories will store temporary data that is created for each run. Create scripts

Detailed information for this section can be found in /nfs/goldstein/goldsteinlab/Bioinformatics/BWA_Alignment_Documentation_basic s.pdf.

The following directories and scripts can be created for alignment:

 ‘combined’ directory  ‘Logs’ directory  ‘Scripts’ directory  ‘Run1 …RunN’ directories  ‘ALN’ directory – This contains all of scripts which run BWA’s aln command. Each single‐end run has one script and each paired‐end run has two scripts.  convert.py – Converts base qualities from Illumina format to Sanger format.  coverage.sh – Calculates coverage information for the sample.  erds.sh – Runs ERDS to call CNVs.  merge_final.sh – Merges all runs. Removes PCR duplicates. Creates the pileup file. Calls pileup2bco.q. Calls coverage.sh. Calls SNPs and indels.  pileup2bco.q – Creates the bco files.  qsub.pl – Can be used to submit all alignment scripts, all sampe scripts, all merge scripts, or all scripts that need to be run.  ‘SAMPE’ directory – Contains all scripts that run the BWA sampe or samse command on aligned reads. Run the scripts

Detailed information for this section can be found in /nfs/goldstein/goldsteinlab/Bioinformatics/BWA_Alignment_Documentation_basic s.pdf

Once the scripts are generated, the sequence can be aligned and the variants can be called using a single line command by calling qsub.pl with the ‘all’ option, or the scripts can be run individually (see the documentation for usage). The details of each step are outlined below.

Find plausible alignment coordinates for each individual read using BWA aln command

The BWA aln command is primarily responsible for aligning the reads to reference and is run using the following settings (these are all default):

 ‐n [0.04] – Allows a maximum of 4 percent missing alignments given 2 percent uniform base error rate.  ‐o [1] – Maximum number of gap opens.  ‐e [‐1] – Specifies that gap extensions should be run in k‐difference modes, which disallows long gaps.  ‐d [16] – Deletions are not allowed within 16 bp from the 3’ end.  ‐i [5] – Deletions are not allowed within 5 bp of either end.  ‐l [infinity] – Seeding is disabled  ‐t [1] – Multi‐threading mode is disabled.  ‐M [3] – Mismatch penalty is 3.  ‐O [11] – Gap open penalty is 11.  ‐E [4] – Gap extension penalty is 4.  ‐R [30] – Proceed with suboptimal alignments when there are no more than 30 equally best hits.  ‐m [2000000] – No more than 2,000,000 entries in the queue.

Pair aligned reads using BWA sampe/samse command based on ‘aln’ output.

For paired end reads the BWA ‘sampe’ command is run by setting –a based on the insert size specified in the sample information file.

For single end reads the BWA ‘samse’ command is used.

Change the alignment format from text to binary (SAM‐>BAM)

This is done using the SAMtools ‘import’ command.

Sort reads using the SAMtools sort command

The SAMtools ‘sort’ command sorts reads according to the leftmost coordinate. Merge aligned files using the SAMtools merge command

All lanes and all runs are merged in a single step.

Remove PCR duplicates using the Picard MarkDuplicates command

The Picard ‘MarkDuplicates’ command can be used to mark or to remove PCR duplicates. It is run using the following options:

 REMOVE_DUPLICATES=true – Duplicates are removed rather than simply marking them.  MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=5000000 – This is set at an arbitrary value higher than the default (50000) so that Picard does not exit. Here are other suggestions from Picard’s mailing list: “if it [Picard] is going to write to temp files, it needs a file handle for each sequence. There are two work-arounds at this point: either set the MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP value low, and give it lots of RAM, or else ask your system administrator to increase the limit on # of file handles to something like 15,000”. ASSUME_SORTED=true – Sorting is performed in an earlier step so activating this safeguard would cause unnecessary problems.

Check how many reads aligned and compute other alignment stats with Picard’s CollectAlignmentSummaryMetrics and CollectInsertSizeMetrics

Create pileup file using SAMtools pileup command

The pileup file shows the alignment at each genomic position and allows us to view how the reads “pile up” across the genome. It is a useful way to view coverage at specific sites and to hand check variant calls.

The SAMtools ‘pileup’ command is run using the following settings (these are all default):

 ‐M [60] – Mapping quality is capped at 60.  ‐N [2] – Number of haplotypes in the sample is 2.  ‐r [0.001] – Expected differences between two haplotypes is 0.1 percent.  ‐G [0.00015] – Expected percent of indels between two haplotypes is 0.015.  ‐I [40] – Phred probability of an indel in sequencing or preparation is 40.

Use pileup file to create bco directories and files

The bco files are needed to annotate a sample in SVA and are created from the pileup files. The script that creates the bco files (pileup2bco.q) is submitted as part of merge_final.sh and runs in parallel.

Run ERDS to call CNVs

The script that runs ERDS (erds.sh) is submitted at the end of pileup2bco.q.

Calculate coverage.

Coverage is calculated based on the total number of alignable bases (the number of bases in the reference that are not ‘N’) and the total number of bases that aligned to reference. This script is submitted in merge_final.sh and runs in parallel.

Call SNPs and indels using SAMtools.pl varFilter command.

SNPs and indels are called using the SAMtools.pl ‘varFilter’ command with the following settings (these are all default except ‐D):

 ‐Q [25] – Minimum RMS mapping quality for SNPs is 25.  ‐q [10] – Minimum RMS mapping quality for indels is 10.  ‐d [3] – Minimum read depth is 3.  ‐D [500] – Maximum read depth is 500.  ‐G [25] – Minimum indel score for nearby SNP filtering is 25.  ‐w [10] – SNPs within 10 bp around a gap are filtered.  ‐W [10] – Window size for filtering dense SNPs is 10.  ‐N [2] – Maximum number of SNPs in a window is 2.  ‐l [30] – Window size of 30 is used for filtering adjacent gaps.

Separate SNPs and Indels into two different files using snp_filter.pl and indel_filter.pl

This step is required for variant files in the SAMtools pileup format. For other variant file formats such as VCF, all the variants can be in one file.

These two programs simply filter the one file created SAMtools’s varFilter into a file for SNPs and a file for indels that can be loaded into SVA. These two scripts do not write the individual base qualities, which are present in the output of varFilter, to the separate snp and indel files.

Index the alignment file so that it can be viewed by users with the SAMtools’ alignment viewer

The ‘index’ command in SAMtools makes it possible to access the bam file rapidly.

Copy output to storage servers using Solexa cluster.

Detailed information for this section can be found in /nfs/goldstein/goldsteinlab/Bioinformatics/BWA_Alignment_Documentation_basic s.pdf

For whole genome sequencing, use bwa_copy.pl using the Solexa cluster by accessing from sva.igsp.duke.edu. This will create the directory where the files are stored with copy.sh in it. Submitting copy.sh to the Solexa cluster will copy the files that are kept onto the seqsata set of drives.

The files that are kept are:

 The entire ‘Logs’ directory  The entire ‘Scripts’ directory  From the ‘combined’ directory: o combined_rmdup.bam o combined_rmdup.bam.bai o snps. o indels.samtoolsindels o snps_indels o coverage.txt o erds/sample.events (this is for whole genome only) o metrics_alignment o metrics_duplicates Take a ‘snapshot’ of alignment and variant files.

Detailed information for this section can be found in /nfs/goldstein/goldsteinlab/Bioinformatics/BWA_Alignment_Documentation_basic s.pdf

The ‘snapshot’ is a text file that lists the files that were created during the sequence alignment and variant calling pipeline and their file sizes and other information that may be used to verify that the copying process was run successfully. It is created automatically by copy.sh. Run concordance check

Concordance is run on a command line feature and will be called as part of a script, but has not been fully integrated into the pipeline. This section will be updated when it is integrated. Check for errors

Detailed information for this section can be found in /nfs/goldstein/goldsteinlab/Bioinformatics/BWA_Alignment_Documentation_basic s.pdf Notify the Genomic Analysis Facility of any irregularities seen during alignment.

There are several irregularities that may become apparent during the alignment and variant calling pipeline which should be reported to the Genomic Analysis Facility. These may include:

 Low coverage. o Whole genome : average coverage should be > 25x o Whole exome: the percentage of captured bases with coverage >=5 should be greater than 0.90.  The proportion of reads aligned should be > 80%.  Low concordance.  Percentage of reads removed by Picard’s PCR duplicated removal.

Detailed information for this section can be found in /nfs/goldstein/goldsteinlab/Bioinformatics/BWA_Alignment_Documentation_basic s.pdf

Delete files from DSCR (Duke Shared Cluster Resource).

Once the files have been copied to the CHGV storage directories and it has been verified that there were no errors or irregularities during the process, all data pertaining to a sample is deleted from the DSCR. Request that the Genomic Analysis Facility delete the FASTQ files from seqanalysis.

This is done by email. Release the sample to the lab.

An email is sent to [email protected] stating that the alignment is complete and giving the location of the necessary files. The alignment statistics Tables are also updated at this time:

/nfs/sva/SequenceAnalyses/Alignment_stats/exome_alignment_stats.txt /nfs/sva/SequenceAnalyses/Alignment_stats/genome_alignment_stats.txt

Annotate Variants with SVA

The CHGV annotates variants using SVA. Once a week, samples released during the previous week are incorporated into one or two master projects located here: /nfs/svaprojects/Master.SVA (genomes only) /nfs/svaprojects/MasterExome.SVA (exomes and genome samples).

CHGV members, who want to use variants from newly released samples, can extract the information they need from these master projects.

Running the pipeline

A step‐by‐step guide to running the pipeline is available at /nfs/goldstein/goldsteinlab/Bioinformatics/BWA_Alignment_Documentation_basic s.pdf. This gives detailed instructions of how to create the necessary files and directories to begin alignment, generate and run the scripts that will perform alignment and variant calling, check for errors, copy the output to the appropriate locations, delete files, and release the sample to the lab.

Review and implementation of new software or additional steps

Periodically software will be released that may potentially be incorporated into the pipeline to enhance or replace software that is currently implemented. Before it can be incorporated, it must be shown that it adds demonstrable improvement to the pipeline. What type of improvement is added will determine the amount of effort that we can expend to implement the software. Altering the pipeline may have severe consequences so the following considerations must be taken into account:

 Will the addition shorten or lengthen the amount of time needed for the pipeline?  Will previous samples need to be recalled?  Has software been tested outside the lab?  What resources does it require in terms of processors, RAM, and disk space?  What other software is required in order for the new software or step to run properly?  Will it be performed the same on every sample?  What tests can be performed and what should be measured in order to confirm any claimed improvements.

No software changes will be made without approval by CHGV management.

Software

There are a number of different software programs currently being used in the pipeline. These programs are subject to change if new software becomes available that will add demonstrable improvement to the pipeline. The programs are:

 BWA (documentation: http://bio‐bwa.sourceforge.net/bwa.shtml)  SAMTools (documentation: http://samtools.sourceforge.net/samtools.shtml)  Picard (documentation: http://picard.sourceforge.net/command‐line‐ overview.shtml)

Scripting

Our alignment/variant calling programs are run on the Duke Shared Resource Cluster which operates under the SGE environment. In general, scripts are written using Perl, tcsh, or bash. A typical tcsh script begins with a header similar to the following:

#$ ‐S /bin/tcsh ‐cwd #$ ‐o /nfs/igsp/seqanalysis/ALIGNMENT/samples/hemo0022/Logs/aln1_1.out #$ ‐e /nfs/igsp/seqanalysis/ALIGNMENT/samples/hemo0022/Logs/aln1_1.err

There are a number of options that can be set in this script. For more information see https://wiki.duke.edu/display/SCSC/SGE+Queueing+System.

Data freezes

When new software, processes, or parameters are added to the pipeline, the output may change in significant ways. This has the potential to alter our ability to replicate published data and analyses. In order to maintain our ability to replicate published data and analyses, output that may be altered by introduced changes must be stored as a data freeze. Depending on what is changed the following files may need to be saved:

 Variant files (snps.samtools and indels.samtoolsindels)  bco files  Alignment (.bam) files

Because the original files produced by the Genomic Analysis Facility are stored permanently, all data can be reproduced from the beginning. Data freezes simply make it much easier to replicate data and analyses. The current location for data freezes is

/nfs/sva2/SequenceAnalyses/svaData/dataFreeze.