NGS Reads Aligning and SNP Calling

NGSNGS readsreads aligningaligning andand SNPSNP callingcalling Christophe Klopp - 2012 Genetic variation http://en.wikipedia.org/wiki/Genetic_variation Genetic variation, variations in alleles of genes, occurs both within and in populations. Genetic variation is important because it provides the “raw material” for natural selection. http://studentreader.com/genotypes-phenotypes/ 2 Types of variations ● SNP : Single nucleotide polymorphism ● CNV : copy number variation ● Chromosomal rearrangement ● Chromosomal duplication http://en.wikipedia.org/wiki/Copy-number_variation http://en.wikipedia.org/wiki/Human_genetic_variation 3 The variation transmission ● Mutation : In molecular biology and genetics, mutations are changes in a genomic sequence: the DNA sequence of a cell's genome or the DNA or RNA sequence of a virus (http://en.wikipedia.org/wiki/Mutation). ● Mutations are transmitted if they are not lethal. ● Mutations can impact the phenotype. 4 Genetic markers and genotyping ● A set of SNPs is selected along the genome. ● The phenotypes are collected for individuals. ● The SNPs are genotyped (measured) for the same individuals. ● This enables to find location having a link between the genotype and the phenotype : – Major genes – QTL (Quantitative Trait Loci) 5 The 1000 genomes project ● Joint project NCBI / EBI ● Common data formats : – fastq – SAM (Sequence Alignment/Map) 6 Sequence Alignment/Map (SAM) format SAM format ➢ Data sharing was a major issue with the 1000 genomes ➢ Capture all of the critical information about NGS data in a single indexed and compressed file ➢ Sharing : data across and tools ➢ Generic alignment format ➢ Supports short and long reads (454 – Solexa – Solid) ➢ Flexible in style, compact in size, efficient in random access Website : http://samtools.sourceforge.net Paper : Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943] 7 Sequence Alignment/Map (SAM) format SAM format 8 SAM format Header section ➢ Header lines start with @ followed by a two-letter TAG ➢ Header fields are TYPE:VALUE pairs 9 SAM format Alignment section ➢ 11 mandatory fields ➢ Variable number of optional fields ➢ Fields are tab delimited 10 SAM format Full example Header A l i g n e m e n t <QNAME> <FLAG> <RNAME> <POS> <MAPQ> <CIGAR> <MRNM> <MPOS> <ISIZE> <SEQ> <QUAL> [<TAG>:<VTYPE>:<VALUE> [...]] X? : Reserved for end users A : Printable character NM : Number of nuc. Difference i : Signed 32-bit integer MD : String for mismatching positions f : Single-precision float number RG : Read group Z : Printable string 11 [...] H : Hex string (high nybble first) SAM format Flag field http://picard.sourceforge.net/explain-flags.html 12 SAM format Extended CIGAR format Ref: GCATTCAGATGCAGTACGC Read: ccTCAG--GCATTAgtg POS CIGAR 5 2S4M2D6M3S 13 BAM format ➢ Binary representation of SAM ➢ Compressed by BGZF library ➢ Greatly reduces storage space requirements to about 27% of original SAM 14 SAMtools ➢ Library and software package ➢ Creating sorted and indexed BAM files from SAM files ➢ Removing PCR duplicates ➢ Merging alignments ➢ Visualization of alignments from BAM files ➢ SNP calling ➢ Short indel detection http://samtools.sourceforge.net/samtools.shtml 15 SAMtools Example usage 16 SAMtools Example usage ➢ Create BAM from SAM samtools view -bS aln.sam -o aln.bam ➢ Sort BAM file samtools sort example.bam sortedExample ➢ Merge sorted BAM files samtools merge sortedMerge.bam sorted1.bam sorted2.bam ➢ Index BAM file samtools index sortedExample.bam ➢ Visualize BAM file samtools tview sortedExample.bam reference.fa 17 Exercise 1 ● Downloading the sam file : – ● Visualizing the file content : find the number of exact matching reads ● Installing the samtools : – http://samtools.sourceforge.net/samtools.shtml ● Producing as sorted and indexed bam file from the same file 18 Visualizing the alignment IGV ➢ IGV : Integrative Genomics Viewer ➢ Website : http://www.broadinstitute.org/igv 19 Visualizing the alignment IGV ➢ High-performance visualization tool ➢ Interactive exploration of large, integrated datasets ➢ Supports a wide variety of data types ➢ Documentations ➢ Developed at the Broad Institute of MIT and Harvard 20 Visualizing the alignment IGV 21 Visualizing the alignment IGV - Loading the reference 22 Visualizing the alignment IGV - Loading the reference 23 Visualizing the alignment IGV - Loading the bam file 24 Visualizing the alignment IGV - Loading the bam file 25 Visualizing the alignment IGV - Zoom 26 Visualizing the alignment IGV - Zoom 27 Visualizing the alignment IGV - Loading a gff file 28 Visualizing the alignment IGV - Loading a gff file 29 Visualizing the alignment IGV - Coverage ➢ Generate the coverage information to be displayed in IGV. java -jar igvtools.jar count aln.bam aln.depth.tdf ref.genome ➢ Remark : ref.genome was generated when we imported the genome sequence ➢ This step is optional, but it is essential if you want to see the read depth information in large scale. 30 Visualizing the alignment IGV - Coverage 31 Exercise 2 ● Open IGV in java webstart : http://www.broadinstitute.org/igv ● Create the genome using the fasta file ● Load the sorted bam file 32 The pileup format Chr. - Coord. - Base('*' for indel) - Number of reads covering the site - Read bases* - Base qualities Read bases : ➢ '.' and ',' : match to the reference base on the forward/reverse strand ➢ 'ACTGN' and 'actgn' : for a mismatch on the forward/reverse strand ➢ '^' and '$' : start/end of a read segment ➢ '+[0-9]+[ACGTNacgtn]+' and '-[0-9]+[ACGTNacgtn]+' : insertion/deletion http://samtools.sourceforge.net/pileup.shtml 33 Using mpileup ➢ Get the raw pileup: samtools mpileup -f ref.fa aln.bam > raw.txt ref.fa Fasta formatted file of the reference genome aln.bam Sorted BAM formatted file, from the alignments raw.txt Output pileup formatted, with consensus calls -f Reference sequence, ref.fa (in FastA format) 34 Mpileup output 35 Variant Calling samtools mpileup -uf ref.fa aln1.bam aln2.bam | bcftools view -bvcg - > var.raw.bcf bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf Var.raw.bcf binary compressed variants Var.flt.cvf text filtered variants -b output BCF instead of VCF -v output potential variant sites only (force -c) -c SNP calling (force -e) -g call genotypes at variant sites (force -c) -D100 maximum read depth [10000000] 36 VCF format 37 Exercise 3 ● Visualize the bam and bai files in IGV ● Produce a tdf file for the coverage ● Find SNPs from the mpileup file ● Transform it into gff ● Load the gff in IGV 38.

Load more