Sequence Alignment and Variant Calling Pipeline

Sequence Alignment and Variant Calling Pipeline Overview The brief synopsis of the sequence alignment and variant calling pipeline is that the Genomic Analysis Facility sequences an individual and releases the FASTQ files. This data is aligned to the reference genome using BWA, then, using SAMtools and Picard the aligned data is sorted, merged, and processed to call variants and the data is loaded onto SVA for analysis. Detailed explanations for each step in the pipeline are given in later sections. The specific steps are: 1. The Genomic Analysis Facility releases FASTQ files to us. 2. Create scripts that generate necessary files and directories to begin alignment. 3. Run scripts which do the following: a. Find plausible alignment coordinates for each individual read using BWA aln command. b. Pair aligned reads using BWA sampe/samse command based on ‘aln’ output. c. Change the alignment format from text to binary (*.sam‐>*.bam). d. Sort reads by coordinate using the SAMtools sort command. e. Merge aligned files using the SAMtools merge command. f. Remove PCR duplicates using the Picard MarkDuplicates command. g. Check how many reads aligned and compute other alignment stats with Picard’s CollectAlignmentSummaryMetrics and CollectInsertSizeMetrics. h. Create pileup file using SAMtools pileup command. i. Use pileup file to create SVA input files: bco directory and files (runs in parallel). Once this script is finished, it submits the Erds program to run. Erds identifies copy number variants. j. Calculate coverage from pileup file (this step runs in parallel). k. Call snps and indels using SAMtools.pl varFilter command. l. Separate SNPs and Indels into two different files using snp_filter.pl and indel_filter.pl m. Index the alignment file so that it can be viewed by users with the SAMtools alignment viewer. 4. Copy output to CHGV storage servers. 5. Create a file which lists the file sizes in the DSCR of alignment and variant files. 6. Check for errors. 7. Run concordance check. 8. Notify Genomic Analysis Facility of any irregularities seen during alignment. 9. Delete files from seqanalysis. 10. Request that the Genomic Analysis Facility delete FASTQ files from seqanalysis. 11. Release the sample to the lab. 12. Create .gsap file in preparation for SVA annotation. 13. Annotate using SVA. Details of the sequence alignment and variant calling pipeline The Genomic Analysis Facility releases FASTQ files When the Genomic Analysis Facility finishes sequencing a sample, they will copy the FASTQ files to a directory on the Duke wide cluster, DSCR, and update a text file which contains individual sample sequencing information (Run_Summary.txt file). The Run_Summary.txt file can contain several columns: 1. Run number. 2. Lanes sequenced. 3. Location of FASTQ files. 4. Type of fragment. 5. Paired/Single end. 6. Length of first read. 7. Length of second read. 8. Median insert size. 9. Standard deviation of the insert size for reads shorter than the median. 10. Standard deviation of the insert size for reads longer than the median. 11. GWAS_ID. 12. Top_Strand file name. 13. For Exomes, the kit used to generate the data. Create necessary files and directories to begin alignment Detailed information for this section can be found in /nfs/goldstein/goldsteinlab/Bioinformatics/BWA_Alignment_Documentation_basic s.pdf The following files and directories are created based on information in the Run_Summary.txt file using sample_info.pl (see documentation for usage) and are necessary before alignment can begin: The sample information file – This file summarizes information for a single sample from the Run_Summary.txt file, which is released by the Genomic Analysis Facility, and is used as input for the program which generates all alignment scripts for a given sample. The sample directory – The sample directory is created in ‘seqanalysis/samples’ and will be given the name of the sample. ‘combined’ directory – This directory is located within the sample directory. All final files from the sequence alignment and variant calling pipeline for a sample will be output to this directory. ‘Logs’ directory – This directory is located within the sample directory. The output logs (*.out) and the error logs (*.err) are output to this directory. ‘Scripts’ directory – This directory is located within the sample directory. All of the alignment scripts for a sample will be in this directory. Directories for each run – These directories are located within the sample directory. One directory will be created for each of the N runs, named Run1 – RunN. These directories will store temporary data that is created for each run. Create scripts Detailed information for this section can be found in /nfs/goldstein/goldsteinlab/Bioinformatics/BWA_Alignment_Documentation_basic s.pdf. The following directories and scripts can be created for alignment: ‘combined’ directory ‘Logs’ directory ‘Scripts’ directory ‘Run1 …RunN’ directories ‘ALN’ directory – This contains all of scripts which run BWA’s aln command. Each single‐end run has one script and each paired‐end run has two scripts. convert.py – Converts base qualities from Illumina format to Sanger format. coverage.sh – Calculates coverage information for the sample. erds.sh – Runs ERDS to call CNVs. merge_final.sh – Merges all runs. Removes PCR duplicates. Creates the pileup file. Calls pileup2bco.q. Calls coverage.sh. Calls SNPs and indels. pileup2bco.q – Creates the bco files. qsub.pl – Can be used to submit all alignment scripts, all sampe scripts, all merge scripts, or all scripts that need to be run. ‘SAMPE’ directory – Contains all scripts that run the BWA sampe or samse command on aligned reads. Run the scripts Detailed information for this section can be found in /nfs/goldstein/goldsteinlab/Bioinformatics/BWA_Alignment_Documentation_basic s.pdf Once the scripts are generated, the sequence can be aligned and the variants can be called using a single line command by calling qsub.pl with the ‘all’ option, or the scripts can be run individually (see the documentation for usage). The details of each step are outlined below. Find plausible alignment coordinates for each individual read using BWA aln command The BWA aln command is primarily responsible for aligning the reads to reference and is run using the following settings (these are all default): ‐n [0.04] – Allows a maximum of 4 percent missing alignments given 2 percent uniform base error rate. ‐o [1] – Maximum number of gap opens. ‐e [‐1] – Specifies that gap extensions should be run in k‐difference modes, which disallows long gaps. ‐d [16] – Deletions are not allowed within 16 bp from the 3’ end. ‐i [5] – Deletions are not allowed within 5 bp of either end. ‐l [infinity] – Seeding is disabled ‐t [1] – Multi‐threading mode is disabled. ‐M [3] – Mismatch penalty is 3. ‐O [11] – Gap open penalty is 11. ‐E [4] – Gap extension penalty is 4. ‐R [30] – Proceed with suboptimal alignments when there are no more than 30 equally best hits. ‐m [2000000] – No more than 2,000,000 entries in the queue. Pair aligned reads using BWA sampe/samse command based on ‘aln’ output. For paired end reads the BWA ‘sampe’ command is run by setting –a based on the insert size specified in the sample information file. For single end reads the BWA ‘samse’ command is used. Change the alignment format from text to binary (SAM‐>BAM) This is done using the SAMtools ‘import’ command. Sort reads using the SAMtools sort command The SAMtools ‘sort’ command sorts reads according to the leftmost coordinate. Merge aligned files using the SAMtools merge command All lanes and all runs are merged in a single step. Remove PCR duplicates using the Picard MarkDuplicates command The Picard ‘MarkDuplicates’ command can be used to mark or to remove PCR duplicates. It is run using the following options: REMOVE_DUPLICATES=true – Duplicates are removed rather than simply marking them. MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=5000000 – This is set at an arbitrary value higher than the default (50000) so that Picard does not exit. Here are other suggestions from Picard’s mailing list: “if it [Picard] is going to write to temp files, it needs a file handle for each sequence. There are two work-arounds at this point: either set the MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP value low, and give it lots of RAM, or else ask your system administrator to increase the limit on # of file handles to something like 15,000”. ASSUME_SORTED=true – Sorting is performed in an earlier step so activating this safeguard would cause unnecessary problems. Check how many reads aligned and compute other alignment stats with Picard’s CollectAlignmentSummaryMetrics and CollectInsertSizeMetrics Create pileup file using SAMtools pileup command The pileup file shows the alignment at each genomic position and allows us to view how the reads “pile up” across the genome. It is a useful way to view coverage at specific sites and to hand check variant calls. The SAMtools ‘pileup’ command is run using the following settings (these are all default): ‐M [60] – Mapping quality is capped at 60. ‐N [2] – Number of haplotypes in the sample is 2. ‐r [0.001] – Expected differences between two haplotypes is 0.1 percent. ‐G [0.00015] – Expected percent of indels between two haplotypes is 0.015. ‐I [40] – Phred probability of an indel in sequencing or preparation is 40. Use pileup file to create bco directories and files The bco files are needed to annotate a sample in SVA and are created from the pileup files. The script that creates the bco files (pileup2bco.q) is submitted as part of merge_final.sh and runs in parallel. Run ERDS to call CNVs The script that runs ERDS (erds.sh) is submitted at the end of pileup2bco.q. Calculate coverage. Coverage is calculated based on the total number of alignable bases (the number of bases in the reference that are not ‘N’) and the total number of bases that aligned to reference.

Load more