NGSNGS readsreads aligningaligning andand SNPSNP callingcalling
Christophe Klopp - 2012 Genetic variation
http://en.wikipedia.org/wiki/Genetic_variation Genetic variation, variations in alleles of genes, occurs both within and in populations. Genetic variation is important because it provides the “raw material” for natural selection.
http://studentreader.com/genotypes-phenotypes/ 2 Types of variations
● SNP : Single nucleotide polymorphism
● CNV : copy number variation
● Chromosomal rearrangement
● Chromosomal duplication
http://en.wikipedia.org/wiki/Copy-number_variation http://en.wikipedia.org/wiki/Human_genetic_variation 3 The variation transmission
● Mutation : In molecular biology and genetics, mutations are changes in a genomic sequence: the DNA sequence of a cell's genome or the DNA or RNA sequence of a virus (http://en.wikipedia.org/wiki/Mutation).
● Mutations are transmitted if they are not lethal.
● Mutations can impact the phenotype.
4 Genetic markers and genotyping
● A set of SNPs is selected along the genome.
● The phenotypes are collected for individuals.
● The SNPs are genotyped (measured) for the same individuals.
● This enables to find location having a link between the genotype and the phenotype : – Major genes – QTL (Quantitative Trait Loci)
5 The 1000 genomes project
● Joint project NCBI / EBI
● Common data formats : – fastq – SAM (Sequence Alignment/Map)
6 Sequence Alignment/Map (SAM) format SAM format
➢ Data sharing was a major issue with the 1000 genomes ➢ Capture all of the critical information about NGS data in a single indexed and compressed file ➢ Sharing : data across and tools ➢ Generic alignment format ➢ Supports short and long reads (454 – Solexa – Solid) ➢ Flexible in style, compact in size, efficient in random access
Website : http://samtools.sourceforge.net Paper : Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943] 7 Sequence Alignment/Map (SAM) format SAM format
8 SAM format Header section
➢ Header lines start with @ followed by a two-letter TAG
➢ Header fields are TYPE:VALUE pairs
9 SAM format Alignment section
➢ 11 mandatory fields
➢ Variable number of optional fields
➢ Fields are tab delimited
10 SAM format Full example Header
A
l
i
g
n
e
m
e
n
t
[
X? : Reserved for end users A : Printable character NM : Number of nuc. Difference i : Signed 32bit integer MD : String for mismatching positions f : Singleprecision float number RG : Read group Z : Printable string 11 [...] H : Hex string (high nybble first) SAM format Flag field
http://picard.sourceforge.net/explain-flags.html
12 SAM format Extended CIGAR format
Ref: GCATTCAGATGCAGTACGC Read: ccTCAGGCATTAgtg POS CIGAR 5 2S4M2D6M3S
13 BAM format
➢ Binary representation of SAM
➢ Compressed by BGZF library
➢ Greatly reduces storage space requirements to about 27% of original SAM
14 SAMtools
➢ Library and software package
➢ Creating sorted and indexed BAM files from SAM files
➢ Removing PCR duplicates
➢ Merging alignments
➢ Visualization of alignments from BAM files
➢ SNP calling
➢ Short indel detection
http://samtools.sourceforge.net/samtools.shtml
15 SAMtools Example usage
16 SAMtools Example usage
➢ Create BAM from SAM samtools view bS aln.sam o aln.bam
➢ Sort BAM file samtools sort example.bam sortedExample
➢ Merge sorted BAM files samtools merge sortedMerge.bam sorted1.bam sorted2.bam
➢ Index BAM file samtools index sortedExample.bam
➢ Visualize BAM file samtools tview sortedExample.bam reference.fa 17 Exercise 1
● Downloading the sam file : –
● Visualizing the file content : find the number of exact matching reads
● Installing the samtools : – http://samtools.sourceforge.net/samtools.shtml ● Producing as sorted and indexed bam file from the same file
18 Visualizing the alignment IGV
➢ IGV : Integrative Genomics Viewer
➢ Website : http://www.broadinstitute.org/igv
19 Visualizing the alignment IGV
➢ High-performance visualization tool
➢ Interactive exploration of large, integrated datasets
➢ Supports a wide variety of data types
➢ Documentations
➢ Developed at the Broad Institute of MIT and Harvard
20 Visualizing the alignment IGV
21 Visualizing the alignment IGV - Loading the reference
22 Visualizing the alignment IGV - Loading the reference
23 Visualizing the alignment IGV - Loading the bam file
24 Visualizing the alignment IGV - Loading the bam file
25 Visualizing the alignment IGV - Zoom
26 Visualizing the alignment IGV - Zoom
27 Visualizing the alignment IGV - Loading a gff file
28 Visualizing the alignment IGV - Loading a gff file
29 Visualizing the alignment IGV - Coverage
➢ Generate the coverage information to be displayed in IGV. java jar igvtools.jar count aln.bam aln.depth.tdf ref.genome
➢ Remark : ref.genome was generated when we imported the genome sequence
➢ This step is optional, but it is essential if you want to see the read depth information in large scale.
30 Visualizing the alignment IGV - Coverage
31 Exercise 2
● Open IGV in java webstart : http://www.broadinstitute.org/igv
● Create the genome using the fasta file
● Load the sorted bam file
32 The pileup format
Chr. - Coord. - Base('*' for indel) - Number of reads covering the site - Read bases* - Base qualities
Read bases : ➢ '.' and ',' : match to the reference base on the forward/reverse strand ➢ 'ACTGN' and 'actgn' : for a mismatch on the forward/reverse strand ➢ '^' and '$' : start/end of a read segment ➢ '+[0-9]+[ACGTNacgtn]+' and '-[0-9]+[ACGTNacgtn]+' : insertion/deletion
http://samtools.sourceforge.net/pileup.shtml 33 Using mpileup
➢ Get the raw pileup:
samtools mpileup f ref.fa aln.bam > raw.txt
ref.fa Fasta formatted file of the reference genome aln.bam Sorted BAM formatted file, from the alignments raw.txt Output pileup formatted, with consensus calls -f Reference sequence, ref.fa (in FastA format)
34 Mpileup output
35 Variant Calling samtools mpileup uf ref.fa aln1.bam aln2.bam | bcftools view bvcg > var.raw.bcf bcftools view var.raw.bcf | vcfutils.pl varFilter D100 > var.flt.vcf Var.raw.bcf binary compressed variants Var.flt.cvf text filtered variants -b output BCF instead of VCF -v output potential variant sites only (force -c) -c SNP calling (force -e) -g call genotypes at variant sites (force -c) -D100 maximum read depth [10000000]
36 VCF format
37 Exercise 3
● Visualize the bam and bai files in IGV
● Produce a tdf file for the coverage
● Find SNPs from the mpileup file
● Transform it into gff
● Load the gff in IGV
38