Genomic Alignment (Mapping) and SNP / Polymorphism Calling
Total Page:16
File Type:pdf, Size:1020Kb
GenomicGenomic alignmentalignment (mapping)(mapping) andand SNPSNP // polymorphismpolymorphism callingcalling Jérôme Mariette & Christophe Klopp http://bioinfo.genotoul.fr/ Bioinfo Genotoul platform – Since 2008 ● 1 Roche 454 ● 1 MiSeq ● 2 HiSeq – Providing ● Data processing for quality control ● Secure data access to end users http://bioinfo.genotoul.fr/ http://ng6.toulouse.inra.fr/ 2 Bioinfo Genotoul : Services – High speed computing facility access – Application and web-server hosting – Training – Support – Project partnership 3 Genetic variation http://en.wikipedia.org/wiki/Genetic_variation Genetic variation, variations in alleles of genes, occurs both within and in populations. Genetic variation is important because it provides the “raw material” for natural selection. http://studentreader.com/genotypes-phenotypes/ 4 Types of variations ● SNP : Single nucleotide polymorphism ● CNV : copy number variation ● Chromosomal rearrangement ● Chromosomal duplication http://en.wikipedia.org/wiki/Copy-number_variation http://en.wikipedia.org/wiki/Human_genetic_variation 5 The variation transmission ● Mutation : In molecular biology and genetics, mutations are changes in a genomic sequence: the DNA sequence of a cell's genome or the DNA or RNA sequence of a virus (http://en.wikipedia.org/wiki/Mutation). ● Mutations are transmitted if they are not lethal. ● Mutations can impact the phenotype. 6 Genetic markers and genotyping ● A set of SNPs is selected along the genome. ● The phenotypes are collected for individuals. ● The SNPs are genotyped (measured) for the same individuals. ● This enables to find location having a link between the genotype and the phenotype : – Major genes – QTL (Quantitative Trait Loci) http://snp.toulouse.inra.fr/~sigenae/50K_goat_snp_chip/index.html 7 Where are we? Sequencing De Novo Alignment Assembly SNPGenome Calling Genome TranscriptomeChip Seq Transcriptome TranscriptomicGenome TranscriptomeMethylation 8 Overview FastQC SRA fastqfastq ENA fastq […] BWA (aln / bwsw) SAM SAMSAM GFF IGV samtools (view/merge/ sort/...) pileup BAM awk samtools.pl samtools BAM varfilter pileup pileup pileup 9 The pieces of software ● Quality : fastqc ● BWA : alignment ● Samtools : formating SNP discovery ● IGV : visualisation 10 The 1000 genomes project ● Joint project NCBI / EBI ● Common data formats : – fastq – SAM (Sequence Alignment/Map) 11 NGS platforms ● Two platforms : – Illumina Solexa – Roche 454 12 Sequencing bias bibliography 13 Sequencing bias ● Platform related ● Roche 454 (data from Jean-Marc Aury CNS) – 99,9% mapped reads – Mean error rate : 0,55% – 37% deletions, 53% insertions, 10% substitutions. – homopolymers errors – emPCR duplications ● Solexa (data from Jean-Marc Aury CNS) – 98,5% mapped reads – Mean error rate : 0,38% – 3% deletions, 2% insertions, 95% substitutions – Low A/T rich coverage 14 What data will we use? ● The needed data : – A reference sequence : ● Genome ● Parts of the genome ● Transcriptome – Short reads 15 Where to get a reference genome? ● Assemble your own ● Use a public assembly : – NCBI : Genbank – EMBL 16 Where to get short reads? ● Produce your own sequences : – CNS – Local platform – Private company ● Use public data : – SRA : NCBI Sequence Read Archive – ENA : EMBL/EBI European Nucleotide Archive 17 NCBI SRA? 18 EBI ENA 19 Meta data ● Meta data structure : – Experiment – Sample – Study – Run – Data file 20 Which reads should I keep? ● All ● Some : what criteria and threshold should I use – Composition (number of Ns, complexity,...), – Quality, – Alignment based criteria, ● Should I trim the reads using : – Composition – Quality 21 Basic reads statistics ● Number of reads ● Length histogram ● Number of Ns in the reads ● Reads quality ● Reads redundancy ● Reads complexity 22 QC : per base quality ● FastQC : 23 QC : sequence content ● FastQC : 24 QC : undetermined bases N content across all bases 25 Read alignment ● The different software generations : – Smith-Waterman / Needleman-Wunch (1970) – BLAST (1990) – MAQ (2008) – BWA (2009) 26 BWA ● Fast and moderate memory footprint (<4GB) ● SAM output by default ● Gapped alignment for both SE and PE reads ● Effective pairing to achieve high alignment accuracy; suboptimal hits considered in pairing. ● Non-unique read is placed randomly with a mapping quality 0 ● Limited number of errors (2 for 32bp, 4 for 100 bp, ...) ● The default conguration works for most typical input. – Automatically adjust parameters based on read lengths and error rates. – Estimate the insert size distribution on the fly http://bio-bwa.sourceforge.net/ 27 BWA prefix trie ● Word 'googol' ● ^ = start character ● --- Search of 'lol' with one error The prefix trie is compressed to fit in memory in most cases ( 1Go for the human genome). 28 http://bio-bwa.sourceforge.net/bwa.shtml 29 Commands Reference sequence indexing : bwa index -a bwtsw db.fasta Read Alignment : bwa aln db.fasta short_read.fastq > aln_sa.sai bwa bwasw database.fasta long_read.fastq > aln.sam Formatting unpaired reads : bwa samse db.fasta aln_sa.sai short_read.fastq > aln.sam Formatting pair ends : bwa sampe database.fasta aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln.sam bwa bwasw database.fasta long_read1.fastq long_read2.fastq > aln.sam 30 Index 31 aln 32 samse & sampe 33 bwasw 34 Sequence Alignment/Map (SAM) format SAM format ➢ Data sharing was a major issue with the 1000 genomes ➢ Capture all of the critical information about NGS data in a single indexed and compressed file ➢ Sharing : data across and tools ➢ Generic alignment format ➢ Supports short and long reads (454 – Solexa – Solid) ➢ Flexible in style, compact in size, efficient in random access Website : http://samtools.sourceforge.net Paper : Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943] 35 Sequence Alignment/Map (SAM) format 36 BAM format ➢ Binary representation of SAM ➢ Compressed by BGZF library ➢ Greatly reduces storage space requirements to about 27% of original SAM 37 SAMtools ➢ Library and software package ➢ Creating sorted and indexed BAM files from SAM files ➢ Removing PCR duplicates ➢ Merging alignments ➢ Visualization of alignments from BAM files ➢ SNP calling ➢ Short indel detection http://samtools.sourceforge.net/samtools.shtml 38 SAMtools Example usage 39 SAMtools Example usage ➢ Create BAM from SAM samtools view -bS aln.sam -o aln.bam ➢ Sort BAM file samtools sort example.bam sortedExample ➢ Merge sorted BAM files samtools merge sortedMerge.bam sorted1.bam sorted2.bam ➢ Index BAM file samtools index sortedExample.bam ➢ Visualize BAM file samtools tview sortedExample.bam reference.fa 40 Picard ➢ A SAMtools complementary package ➢ More format conversion than SAMtools ➢ Visualization of alignments not available ➢ SNP calling & short indel detection not available http://picard.sourceforge.net/ 41 GATK ➢ It's a cross-platform application programming interface (API), written in Java, specifically designed for working with gargantuan (up to hundreds of terabytes) next-generation sequencing (NGS) datasets. ➢ It's a set of tools built upon that API for performing certain processing and analysis tasks on NGS data. http://www.broadinstitute.org/gsa/wiki/index.php/Home_Page 42 GATK: realignment / recalibration ● Realignment : in particular areas, GATK split reads to perform a better alignment ● Recalibration : Recalculate the bases quality after the alignment. ● High variability between SNP calling tools counts : Δ = 777% Δ = 714% Δ = 42% Δ = 45% 80000 70000 60000 50000 Mpileup Mpileup -B 40000 To be revealed Mpileup -E 30000 GATK Popoolation2 20000 10000 0 raw data realigned data recalibrated data Realigned & recalibrated data 43 GATK: realignment / recalibration ● The GATK realignment INDEL zone 44 GATK: realignment / recalibration ● The GATK realignment 45 GATK: realignment / recalibration ● The GATK realignment Raw BAM 101M 2 mismatchs 46 GATK: realignment / recalibration ● The GATK realignment 47 GATK: realignment / recalibration ● The GATK realignment BAM realigned 98M25D3M 1 mismatch 48 GATK: realignment / recalibration ● The GATK recalibration « The per-base quality scores, which convey the probability that the called base in the read is the true sequenced base, are quite inaccurate and co-vary with features like sequencing technology, machine cycle and sequence context. » DePristo et al. (2011) – Ewing and Green (1998) – Li et al. (2004 ; 2009) Mean BQ = 32,8 - Median = 36,7 49 GATK: realignment / recalibration ● The GATK recalibration Raw Data (Mean BQ = 32,8 - Median = 36,7) Recalibrated Data (Mean BQ = 28,8 - Median = 28,7) ● Reduces Mean Base Quality ● Lower variability 50 realignment / recalibration impact ● More homogeneous SNP counts ● Higher impact of recalibration on SNP counts Δ = 777% Δ = 714% Δ = 42% Δ = 45% 80000 70000 60000 50000 Mpileup Mpileup -B 40000 Mpileup -E GATK 30000 Popoolation2 20000 10000 0 raw data realigned data recalibrated data Realigned & recalibrated data 51 Visualizing the alignment IGV ➢ IGV : Integrative Genomics Viewer ➢ Website : http://www.broadinstitute.org/igv 52 Visualizing the alignment IGV ➢ High-performance visualization tool ➢ Interactive exploration of large, integrated datasets ➢ Supports a wide variety of data types ➢ Documentations ➢ Developed at the Broad Institute of MIT and Harvard 53 Visualizing the alignment IGV 54 Visualizing the alignment IGV - Loading the reference