Genomic Alignment (Mapping) and SNP / Polymorphism Calling

Genomic Alignment (Mapping) and SNP / Polymorphism Calling

GenomicGenomic alignmentalignment (mapping)(mapping) andand SNPSNP // polymorphismpolymorphism callingcalling Jérôme Mariette & Christophe Klopp http://bioinfo.genotoul.fr/ Bioinfo Genotoul platform – Since 2008 ● 1 Roche 454 ● 1 MiSeq ● 2 HiSeq – Providing ● Data processing for quality control ● Secure data access to end users http://bioinfo.genotoul.fr/ http://ng6.toulouse.inra.fr/ 2 Bioinfo Genotoul : Services – High speed computing facility access – Application and web-server hosting – Training – Support – Project partnership 3 Genetic variation http://en.wikipedia.org/wiki/Genetic_variation Genetic variation, variations in alleles of genes, occurs both within and in populations. Genetic variation is important because it provides the “raw material” for natural selection. http://studentreader.com/genotypes-phenotypes/ 4 Types of variations ● SNP : Single nucleotide polymorphism ● CNV : copy number variation ● Chromosomal rearrangement ● Chromosomal duplication http://en.wikipedia.org/wiki/Copy-number_variation http://en.wikipedia.org/wiki/Human_genetic_variation 5 The variation transmission ● Mutation : In molecular biology and genetics, mutations are changes in a genomic sequence: the DNA sequence of a cell's genome or the DNA or RNA sequence of a virus (http://en.wikipedia.org/wiki/Mutation). ● Mutations are transmitted if they are not lethal. ● Mutations can impact the phenotype. 6 Genetic markers and genotyping ● A set of SNPs is selected along the genome. ● The phenotypes are collected for individuals. ● The SNPs are genotyped (measured) for the same individuals. ● This enables to find location having a link between the genotype and the phenotype : – Major genes – QTL (Quantitative Trait Loci) http://snp.toulouse.inra.fr/~sigenae/50K_goat_snp_chip/index.html 7 Where are we? Sequencing De Novo Alignment Assembly SNPGenome Calling Genome TranscriptomeChip Seq Transcriptome TranscriptomicGenome TranscriptomeMethylation 8 Overview FastQC SRA fastqfastq ENA fastq […] BWA (aln / bwsw) SAM SAMSAM GFF IGV samtools (view/merge/ sort/...) pileup BAM awk samtools.pl samtools BAM varfilter pileup pileup pileup 9 The pieces of software ● Quality : fastqc ● BWA : alignment ● Samtools : formating SNP discovery ● IGV : visualisation 10 The 1000 genomes project ● Joint project NCBI / EBI ● Common data formats : – fastq – SAM (Sequence Alignment/Map) 11 NGS platforms ● Two platforms : – Illumina Solexa – Roche 454 12 Sequencing bias bibliography 13 Sequencing bias ● Platform related ● Roche 454 (data from Jean-Marc Aury CNS) – 99,9% mapped reads – Mean error rate : 0,55% – 37% deletions, 53% insertions, 10% substitutions. – homopolymers errors – emPCR duplications ● Solexa (data from Jean-Marc Aury CNS) – 98,5% mapped reads – Mean error rate : 0,38% – 3% deletions, 2% insertions, 95% substitutions – Low A/T rich coverage 14 What data will we use? ● The needed data : – A reference sequence : ● Genome ● Parts of the genome ● Transcriptome – Short reads 15 Where to get a reference genome? ● Assemble your own ● Use a public assembly : – NCBI : Genbank – EMBL 16 Where to get short reads? ● Produce your own sequences : – CNS – Local platform – Private company ● Use public data : – SRA : NCBI Sequence Read Archive – ENA : EMBL/EBI European Nucleotide Archive 17 NCBI SRA? 18 EBI ENA 19 Meta data ● Meta data structure : – Experiment – Sample – Study – Run – Data file 20 Which reads should I keep? ● All ● Some : what criteria and threshold should I use – Composition (number of Ns, complexity,...), – Quality, – Alignment based criteria, ● Should I trim the reads using : – Composition – Quality 21 Basic reads statistics ● Number of reads ● Length histogram ● Number of Ns in the reads ● Reads quality ● Reads redundancy ● Reads complexity 22 QC : per base quality ● FastQC : 23 QC : sequence content ● FastQC : 24 QC : undetermined bases N content across all bases 25 Read alignment ● The different software generations : – Smith-Waterman / Needleman-Wunch (1970) – BLAST (1990) – MAQ (2008) – BWA (2009) 26 BWA ● Fast and moderate memory footprint (<4GB) ● SAM output by default ● Gapped alignment for both SE and PE reads ● Effective pairing to achieve high alignment accuracy; suboptimal hits considered in pairing. ● Non-unique read is placed randomly with a mapping quality 0 ● Limited number of errors (2 for 32bp, 4 for 100 bp, ...) ● The default conguration works for most typical input. – Automatically adjust parameters based on read lengths and error rates. – Estimate the insert size distribution on the fly http://bio-bwa.sourceforge.net/ 27 BWA prefix trie ● Word 'googol' ● ^ = start character ● --- Search of 'lol' with one error The prefix trie is compressed to fit in memory in most cases ( 1Go for the human genome). 28 http://bio-bwa.sourceforge.net/bwa.shtml 29 Commands Reference sequence indexing : bwa index -a bwtsw db.fasta Read Alignment : bwa aln db.fasta short_read.fastq > aln_sa.sai bwa bwasw database.fasta long_read.fastq > aln.sam Formatting unpaired reads : bwa samse db.fasta aln_sa.sai short_read.fastq > aln.sam Formatting pair ends : bwa sampe database.fasta aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln.sam bwa bwasw database.fasta long_read1.fastq long_read2.fastq > aln.sam 30 Index 31 aln 32 samse & sampe 33 bwasw 34 Sequence Alignment/Map (SAM) format SAM format ➢ Data sharing was a major issue with the 1000 genomes ➢ Capture all of the critical information about NGS data in a single indexed and compressed file ➢ Sharing : data across and tools ➢ Generic alignment format ➢ Supports short and long reads (454 – Solexa – Solid) ➢ Flexible in style, compact in size, efficient in random access Website : http://samtools.sourceforge.net Paper : Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943] 35 Sequence Alignment/Map (SAM) format 36 BAM format ➢ Binary representation of SAM ➢ Compressed by BGZF library ➢ Greatly reduces storage space requirements to about 27% of original SAM 37 SAMtools ➢ Library and software package ➢ Creating sorted and indexed BAM files from SAM files ➢ Removing PCR duplicates ➢ Merging alignments ➢ Visualization of alignments from BAM files ➢ SNP calling ➢ Short indel detection http://samtools.sourceforge.net/samtools.shtml 38 SAMtools Example usage 39 SAMtools Example usage ➢ Create BAM from SAM samtools view -bS aln.sam -o aln.bam ➢ Sort BAM file samtools sort example.bam sortedExample ➢ Merge sorted BAM files samtools merge sortedMerge.bam sorted1.bam sorted2.bam ➢ Index BAM file samtools index sortedExample.bam ➢ Visualize BAM file samtools tview sortedExample.bam reference.fa 40 Picard ➢ A SAMtools complementary package ➢ More format conversion than SAMtools ➢ Visualization of alignments not available ➢ SNP calling & short indel detection not available http://picard.sourceforge.net/ 41 GATK ➢ It's a cross-platform application programming interface (API), written in Java, specifically designed for working with gargantuan (up to hundreds of terabytes) next-generation sequencing (NGS) datasets. ➢ It's a set of tools built upon that API for performing certain processing and analysis tasks on NGS data. http://www.broadinstitute.org/gsa/wiki/index.php/Home_Page 42 GATK: realignment / recalibration ● Realignment : in particular areas, GATK split reads to perform a better alignment ● Recalibration : Recalculate the bases quality after the alignment. ● High variability between SNP calling tools counts : Δ = 777% Δ = 714% Δ = 42% Δ = 45% 80000 70000 60000 50000 Mpileup Mpileup -B 40000 To be revealed Mpileup -E 30000 GATK Popoolation2 20000 10000 0 raw data realigned data recalibrated data Realigned & recalibrated data 43 GATK: realignment / recalibration ● The GATK realignment INDEL zone 44 GATK: realignment / recalibration ● The GATK realignment 45 GATK: realignment / recalibration ● The GATK realignment Raw BAM 101M 2 mismatchs 46 GATK: realignment / recalibration ● The GATK realignment 47 GATK: realignment / recalibration ● The GATK realignment BAM realigned 98M25D3M 1 mismatch 48 GATK: realignment / recalibration ● The GATK recalibration « The per-base quality scores, which convey the probability that the called base in the read is the true sequenced base, are quite inaccurate and co-vary with features like sequencing technology, machine cycle and sequence context. » DePristo et al. (2011) – Ewing and Green (1998) – Li et al. (2004 ; 2009) Mean BQ = 32,8 - Median = 36,7 49 GATK: realignment / recalibration ● The GATK recalibration Raw Data (Mean BQ = 32,8 - Median = 36,7) Recalibrated Data (Mean BQ = 28,8 - Median = 28,7) ● Reduces Mean Base Quality ● Lower variability 50 realignment / recalibration impact ● More homogeneous SNP counts ● Higher impact of recalibration on SNP counts Δ = 777% Δ = 714% Δ = 42% Δ = 45% 80000 70000 60000 50000 Mpileup Mpileup -B 40000 Mpileup -E GATK 30000 Popoolation2 20000 10000 0 raw data realigned data recalibrated data Realigned & recalibrated data 51 Visualizing the alignment IGV ➢ IGV : Integrative Genomics Viewer ➢ Website : http://www.broadinstitute.org/igv 52 Visualizing the alignment IGV ➢ High-performance visualization tool ➢ Interactive exploration of large, integrated datasets ➢ Supports a wide variety of data types ➢ Documentations ➢ Developed at the Broad Institute of MIT and Harvard 53 Visualizing the alignment IGV 54 Visualizing the alignment IGV - Loading the reference

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    81 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us