GenomicGenomic alignmentalignment (mapping)(mapping) andand SNPSNP // polymorphismpolymorphism callingcalling
Jérôme Mariette & Christophe Klopp http://bioinfo.genotoul.fr/ Bioinfo Genotoul platform
– Since 2008
● 1 Roche 454
● 1 MiSeq
● 2 HiSeq
– Providing
● Data processing for quality control
● Secure data access to end users
http://bioinfo.genotoul.fr/
http://ng6.toulouse.inra.fr/ 2 Bioinfo Genotoul : Services
– High speed computing facility access – Application and web-server hosting – Training – Support – Project partnership
3 Genetic variation
http://en.wikipedia.org/wiki/Genetic_variation Genetic variation, variations in alleles of genes, occurs both within and in populations. Genetic variation is important because it provides the “raw material” for natural selection.
http://studentreader.com/genotypes-phenotypes/ 4 Types of variations
● SNP : Single nucleotide polymorphism
● CNV : copy number variation
● Chromosomal rearrangement
● Chromosomal duplication
http://en.wikipedia.org/wiki/Copy-number_variation http://en.wikipedia.org/wiki/Human_genetic_variation 5 The variation transmission
● Mutation : In molecular biology and genetics, mutations are changes in a genomic sequence: the DNA sequence of a cell's genome or the DNA or RNA sequence of a virus (http://en.wikipedia.org/wiki/Mutation).
● Mutations are transmitted if they are not lethal.
● Mutations can impact the phenotype.
6 Genetic markers and genotyping
● A set of SNPs is selected along the genome.
● The phenotypes are collected for individuals.
● The SNPs are genotyped (measured) for the same individuals.
● This enables to find location having a link between the genotype and the phenotype : – Major genes – QTL (Quantitative Trait Loci)
http://snp.toulouse.inra.fr/~sigenae/50K_goat_snp_chip/index.html
7 Where are we?
Sequencing
De Novo Alignment Assembly
SNPGenome Calling Genome TranscriptomeChip Seq Transcriptome TranscriptomicGenome
TranscriptomeMethylation
8 Overview
FastQC
SRA fastqfastq ENA fastq […] BWA (aln / bwsw) SAM SAMSAM
GFF IGV samtools (view/merge/ sort/...) pileup
BAM awk samtools.pl samtools BAM varfilter pileup pileup pileup 9 The pieces of software
● Quality : fastqc
● BWA : alignment
● Samtools : formating SNP discovery
● IGV : visualisation
10 The 1000 genomes project
● Joint project NCBI / EBI
● Common data formats : – fastq – SAM (Sequence Alignment/Map)
11 NGS platforms
● Two platforms : – Illumina Solexa – Roche 454
12 Sequencing bias bibliography
13 Sequencing bias
● Platform related
● Roche 454 (data from Jean-Marc Aury CNS)
– 99,9% mapped reads – Mean error rate : 0,55% – 37% deletions, 53% insertions, 10% substitutions. – homopolymers errors – emPCR duplications
● Solexa (data from Jean-Marc Aury CNS)
– 98,5% mapped reads – Mean error rate : 0,38% – 3% deletions, 2% insertions, 95% substitutions – Low A/T rich coverage 14 What data will we use?
● The needed data : – A reference sequence :
● Genome ● Parts of the genome ● Transcriptome – Short reads
15 Where to get a reference genome?
● Assemble your own
● Use a public assembly : – NCBI : Genbank – EMBL
16 Where to get short reads?
● Produce your own sequences : – CNS – Local platform – Private company
● Use public data : – SRA : NCBI Sequence Read Archive – ENA : EMBL/EBI European Nucleotide Archive
17 NCBI SRA?
18 EBI ENA
19 Meta data
● Meta data structure : – Experiment – Sample – Study – Run – Data file
20 Which reads should I keep?
● All
● Some : what criteria and threshold should I use – Composition (number of Ns, complexity,...), – Quality, – Alignment based criteria,
● Should I trim the reads using : – Composition – Quality
21 Basic reads statistics
● Number of reads
● Length histogram
● Number of Ns in the reads
● Reads quality
● Reads redundancy
● Reads complexity
22 QC : per base quality
● FastQC :
23 QC : sequence content
● FastQC :
24 QC : undetermined bases
N content across all bases
25 Read alignment
● The different software generations : – Smith-Waterman / Needleman-Wunch (1970) – BLAST (1990) – MAQ (2008) – BWA (2009)
26 BWA
● Fast and moderate memory footprint (<4GB)
● SAM output by default
● Gapped alignment for both SE and PE reads
● Effective pairing to achieve high alignment accuracy; suboptimal hits considered in pairing.
● Non-unique read is placed randomly with a mapping quality 0
● Limited number of errors (2 for 32bp, 4 for 100 bp, ...)
● The default conguration works for most typical input. – Automatically adjust parameters based on read lengths and error rates.
– Estimate the insert size distribution on the fly http://bio-bwa.sourceforge.net/
27 BWA prefix trie
● Word 'googol'
● ^ = start character
● --- Search of 'lol' with one error
The prefix trie is compressed to fit in memory in most cases ( 1Go for the human genome).
28 http://bio-bwa.sourceforge.net/bwa.shtml
29 Commands
Reference sequence indexing : bwa index -a bwtsw db.fasta Read Alignment : bwa aln db.fasta short_read.fastq > aln_sa.sai bwa bwasw database.fasta long_read.fastq > aln.sam Formatting unpaired reads : bwa samse db.fasta aln_sa.sai short_read.fastq > aln.sam Formatting pair ends : bwa sampe database.fasta aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln.sam bwa bwasw database.fasta long_read1.fastq long_read2.fastq > aln.sam
30 Index
31 aln
32 samse & sampe
33 bwasw
34 Sequence Alignment/Map (SAM) format SAM format
➢ Data sharing was a major issue with the 1000 genomes ➢ Capture all of the critical information about NGS data in a single indexed and compressed file ➢ Sharing : data across and tools ➢ Generic alignment format ➢ Supports short and long reads (454 – Solexa – Solid) ➢ Flexible in style, compact in size, efficient in random access
Website : http://samtools.sourceforge.net Paper : Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943] 35 Sequence Alignment/Map (SAM) format
36 BAM format
➢ Binary representation of SAM
➢ Compressed by BGZF library
➢ Greatly reduces storage space requirements to about 27% of original SAM
37 SAMtools
➢ Library and software package
➢ Creating sorted and indexed BAM files from SAM files
➢ Removing PCR duplicates
➢ Merging alignments
➢ Visualization of alignments from BAM files
➢ SNP calling
➢ Short indel detection
http://samtools.sourceforge.net/samtools.shtml
38 SAMtools Example usage
39 SAMtools Example usage
➢ Create BAM from SAM samtools view -bS aln.sam -o aln.bam
➢ Sort BAM file samtools sort example.bam sortedExample
➢ Merge sorted BAM files samtools merge sortedMerge.bam sorted1.bam sorted2.bam
➢ Index BAM file samtools index sortedExample.bam
➢ Visualize BAM file
samtools tview sortedExample.bam reference.fa 40 Picard
➢ A SAMtools complementary package
➢ More format conversion than SAMtools
➢ Visualization of alignments not available
➢ SNP calling & short indel detection not available
http://picard.sourceforge.net/
41 GATK
➢ It's a cross-platform application programming interface (API), written in Java, specifically designed for working with gargantuan (up to hundreds of terabytes) next-generation sequencing (NGS) datasets.
➢ It's a set of tools built upon that API for performing certain processing and analysis tasks on NGS data.
http://www.broadinstitute.org/gsa/wiki/index.php/Home_Page 42 GATK: realignment / recalibration
● Realignment : in particular areas, GATK split reads to perform a better alignment
● Recalibration : Recalculate the bases quality after the alignment.
● High variability between SNP calling tools counts :
Δ = 777% Δ = 714% Δ = 42% Δ = 45%
80000
70000
60000
50000 Mpileup Mpileup -B 40000 To be revealed Mpileup -E 30000 GATK Popoolation2 20000
10000
0 raw data realigned data recalibrated data Realigned & recalibrated data
43 GATK: realignment / recalibration
● The GATK realignment INDEL zone
44 GATK: realignment / recalibration
● The GATK realignment
45 GATK: realignment / recalibration
● The GATK realignment
Raw BAM 101M 2 mismatchs
46 GATK: realignment / recalibration
● The GATK realignment
47 GATK: realignment / recalibration
● The GATK realignment
BAM realigned 98M25D3M 1 mismatch
48 GATK: realignment / recalibration
● The GATK recalibration
« The per-base quality scores, which convey the probability that the called base in the read is the true sequenced base, are quite inaccurate and co-vary with features like sequencing technology, machine cycle and sequence context. »
DePristo et al. (2011) – Ewing and Green (1998) – Li et al. (2004 ; 2009)
Mean BQ = 32,8 - Median = 36,7
49 GATK: realignment / recalibration
● The GATK recalibration
Raw Data (Mean BQ = 32,8 - Median = 36,7) Recalibrated Data (Mean BQ = 28,8 - Median = 28,7)
● Reduces Mean Base Quality
● Lower variability 50 realignment / recalibration impact
● More homogeneous SNP counts
● Higher impact of recalibration on SNP counts
Δ = 777% Δ = 714% Δ = 42% Δ = 45%
80000
70000
60000
50000 Mpileup Mpileup -B 40000 Mpileup -E GATK 30000 Popoolation2
20000
10000
0 raw data realigned data recalibrated data Realigned & recalibrated data
51 Visualizing the alignment IGV
➢ IGV : Integrative Genomics Viewer
➢ Website : http://www.broadinstitute.org/igv
52 Visualizing the alignment IGV
➢ High-performance visualization tool
➢ Interactive exploration of large, integrated datasets
➢ Supports a wide variety of data types
➢ Documentations
➢ Developed at the Broad Institute of MIT and Harvard
53 Visualizing the alignment IGV
54 Visualizing the alignment IGV - Loading the reference
55 Visualizing the alignment IGV - Loading the reference
56 Visualizing the alignment IGV - Loading the bam file
57 Visualizing the alignment IGV - Loading the bam file
58 Visualizing the alignment IGV - Zoom
59 Visualizing the alignment IGV - Zoom
60 Visualizing the alignment IGV - Loading a gff file
61 Visualizing the alignment IGV - Loading a gff file
62 Visualizing the alignment IGV - Coverage
63 Variant calling
➢ Two new file formats : ✔ pileup ✔ vcf/bcf
➢ Variant calling with samtools
➢ IGV visualisation
64 The pileup format
Chr. - Coord. - Base('*' for indel) - Number of reads covering the site - Read bases* - Base qualities
Read bases : ➢ '.' and ',' : match to the reference base on the forward/reverse strand ➢ 'ACTGN' and 'actgn' : for a mismatch on the forward/reverse strand ➢ '^' and '$' : start/end of a read segment ➢ '+[0-9]+[ACGTNacgtn]+' and '-[0-9]+[ACGTNacgtn]+' : insertion/deletion
http://samtools.sourceforge.net/pileup.shtml 65 The vcf/bcf format
http://www.broadinstitute.org/ - vcf description 66 The vcf/bcf format
Header section : Define INFO and FORMAT fields
http://www.broadinstitute.org/ - vcf description 67 The vcf/bcf format
CHROM and POS : gives the contig on which the variant occurs. For indels this is actually the base preceding the event, due to the representation of indels in a VCF.
http://www.broadinstitute.org/ - vcf description 68 The vcf/bcf format
ID : The dbSNP rs identifier of the SNP, based on the contig : position of the call and whether a record exists at this site in dbSNP. (ex : rs3828047)
http://www.broadinstitute.org/ - vcf description 69 The vcf/bcf format
REF and ALT : The reference base and alternative base that vary in the samples, or in the population in general. Note that REF and ALT are always given on the forward strand.
http://www.broadinstitute.org/ - vcf description 70 The vcf/bcf format
QUAL : The Phred scaled probability of Probability that REF/ALT polymorphism exists at this site given sequencing data.
http://www.broadinstitute.org/ - vcf description 71 The vcf/bcf format
FILTER : PASS = ok, '.' = no filtering, anything else (LowQual, ...) = fail
http://www.broadinstitute.org/ - vcf description 72 The vcf/bcf format
DP : Real depth (all reads, even filtered reads)
http://www.broadinstitute.org/ - vcf description 73 The vcf/bcf format
GT : Genotype (0/1 = heterozygous ; 1/1 = homozygous) PL : Genotype probabilities
http://www.broadinstitute.org/ - vcf description 74 Variant Calling with SAMtools
➢ Get the raw variants :
✗ Output format : pileup
samtools mpileup -f ref.fa aln.bam > aln.pileup
samtools view -u aln.bam X | samtools mpileup -f ref.fa - > aln-X.pileup
✗ Output format : vcf/bcf
samtools mpileup -uf ref.fa aln1.bam aln2.bam | bcftools view -bvcg - > var.raw.bcf
75 Variant Calling with SAMtools
76 Filter - vcfutils.pl
➢ Filter the raw variant calls :
bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf
77 What's next
● Variation comparison – Between samples – With dbSNP
● Functional annotation – variant_effect_predictor.pl – Other tools
78 dbSNP
http://www.ncbi.nlm.nih.gov/projects/SNP/ 79 DbSNP : variation genotype
80 Conclusion
● Be aware of the quality of your data.
● Do not trust manufacturers quality values
● Use multiple data sources for variant calling if possible
81