<<

GenomicGenomic alignmentalignment (mapping)(mapping) andand SNPSNP // polymorphismpolymorphism callingcalling

Jérôme Mariette & Christophe Klopp http://bioinfo.genotoul.fr/ Bioinfo Genotoul platform

– Since 2008

● 1 Roche 454

● 1 MiSeq

● 2 HiSeq

– Providing

● Data processing for quality control

● Secure data access to end users

http://bioinfo.genotoul.fr/

http://ng6.toulouse.inra.fr/ 2 Bioinfo Genotoul : Services

– High speed computing facility access – Application and web-server hosting – Training – Support – Project partnership

3 Genetic variation

http://en.wikipedia.org/wiki/Genetic_variation Genetic variation, variations in alleles of genes, occurs both within and in populations. Genetic variation is important because it provides the “raw material” for natural selection.

http://studentreader.com/genotypes-phenotypes/ 4 Types of variations

● SNP : Single nucleotide polymorphism

● CNV : copy number variation

● Chromosomal rearrangement

● Chromosomal duplication

http://en.wikipedia.org/wiki/Copy-number_variation http://en.wikipedia.org/wiki/Human_genetic_variation 5 The variation transmission

● Mutation : In molecular biology and genetics, mutations are changes in a genomic sequence: the DNA sequence of a cell's genome or the DNA or RNA sequence of a virus (http://en.wikipedia.org/wiki/Mutation).

● Mutations are transmitted if they are not lethal.

● Mutations can impact the phenotype.

6 Genetic markers and genotyping

● A set of SNPs is selected along the genome.

● The phenotypes are collected for individuals.

● The SNPs are genotyped (measured) for the same individuals.

● This enables to find location having a link between the genotype and the phenotype : – Major genes – QTL (Quantitative Trait Loci)

http://snp.toulouse.inra.fr/~sigenae/50K_goat_snp_chip/index.html

7 Where are we?

Sequencing

De Novo Alignment Assembly

SNPGenome Calling Genome TranscriptomeChip Seq Transcriptome TranscriptomicGenome

TranscriptomeMethylation

8 Overview

FastQC

SRA fastqfastq ENA fastq […] BWA (aln / bwsw) SAM SAMSAM

GFF IGV (view/merge/ sort/...) pileup

BAM awk samtools.pl samtools BAM varfilter pileup pileup pileup 9 The pieces of software

● Quality : fastqc

● BWA : alignment

● Samtools : formating SNP discovery

● IGV : visualisation

10 The 1000 genomes project

● Joint project NCBI / EBI

● Common data formats : – fastq – SAM (/Map)

11 NGS platforms

● Two platforms : – Illumina Solexa – Roche 454

12 bias bibliography

13 Sequencing bias

● Platform related

● Roche 454 (data from Jean-Marc Aury CNS)

– 99,9% mapped reads – Mean error rate : 0,55% – 37% deletions, 53% insertions, 10% substitutions. – homopolymers errors – emPCR duplications

● Solexa (data from Jean-Marc Aury CNS)

– 98,5% mapped reads – Mean error rate : 0,38% – 3% deletions, 2% insertions, 95% substitutions – Low A/T rich coverage 14 What data will we use?

● The needed data : – A reference sequence :

● Genome ● Parts of the genome ● Transcriptome – Short reads

15 Where to get a reference genome?

● Assemble your own

● Use a public assembly : – NCBI : Genbank – EMBL

16 Where to get short reads?

● Produce your own sequences : – CNS – Local platform – Private company

● Use public data : – SRA : NCBI Sequence Read Archive – ENA : EMBL/EBI European Nucleotide Archive

17 NCBI SRA?

18 EBI ENA

19 Meta data

● Meta data structure : – Experiment – Sample – Study – Run – Data file

20 Which reads should I keep?

● All

● Some : what criteria and threshold should I use – Composition (number of Ns, complexity,...), – Quality, – Alignment based criteria,

● Should I trim the reads using : – Composition – Quality

21 Basic reads statistics

● Number of reads

● Length histogram

● Number of Ns in the reads

● Reads quality

● Reads redundancy

● Reads complexity

22 QC : per base quality

● FastQC :

23 QC : sequence content

● FastQC :

24 QC : undetermined bases

N content across all bases

25 Read alignment

● The different software generations : – Smith-Waterman / Needleman-Wunch (1970) – BLAST (1990) – MAQ (2008) – BWA (2009)

26 BWA

● Fast and moderate memory footprint (<4GB)

● SAM output by default

● Gapped alignment for both SE and PE reads

● Effective pairing to achieve high alignment accuracy; suboptimal hits considered in pairing.

● Non-unique read is placed randomly with a mapping quality 0

● Limited number of errors (2 for 32bp, 4 for 100 bp, ...)

● The default conguration works for most typical input. – Automatically adjust parameters based on read lengths and error rates.

– Estimate the insert size distribution on the fly http://bio-bwa.sourceforge.net/

27 BWA prefix trie

● Word 'googol'

● ^ = start character

● --- Search of 'lol' with one error

The prefix trie is compressed to fit in memory in most cases ( 1Go for the human genome).

28 http://bio-bwa.sourceforge.net/bwa.shtml

29 Commands

Reference sequence indexing : bwa index -a bwtsw db.fasta Read Alignment : bwa aln db.fasta short_read.fastq > aln_sa.sai bwa bwasw database.fasta long_read.fastq > aln. Formatting unpaired reads : bwa samse db.fasta aln_sa.sai short_read.fastq > aln.sam Formatting pair ends : bwa sampe database.fasta aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln.sam bwa bwasw database.fasta long_read1.fastq long_read2.fastq > aln.sam

30 Index

31 aln

32 samse & sampe

33 bwasw

34 Sequence Alignment/Map (SAM) format SAM format

➢ Data sharing was a major issue with the 1000 genomes ➢ Capture all of the critical information about NGS data in a single indexed and compressed file ➢ Sharing : data across and tools ➢ Generic alignment format ➢ Supports short and long reads (454 – Solexa – Solid) ➢ Flexible in style, compact in size, efficient in random access

Website : http://samtools.sourceforge.net Paper : Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. , 25, 2078-9. [PMID: 19505943] 35 Sequence Alignment/Map (SAM) format

36 BAM format

➢ Binary representation of SAM

➢ Compressed by BGZF library

➢ Greatly reduces storage space requirements to about 27% of original SAM

37 SAMtools

➢ Library and software package

➢ Creating sorted and indexed BAM files from SAM files

➢ Removing PCR duplicates

➢ Merging alignments

➢ Visualization of alignments from BAM files

➢ SNP calling

➢ Short indel detection

http://samtools.sourceforge.net/samtools.shtml

38 SAMtools Example usage

39 SAMtools Example usage

➢ Create BAM from SAM samtools view -bS aln.sam -o aln.bam

➢ Sort BAM file samtools sort example.bam sortedExample

➢ Merge sorted BAM files samtools merge sortedMerge.bam sorted1.bam sorted2.bam

➢ Index BAM file samtools index sortedExample.bam

➢ Visualize BAM file

samtools tview sortedExample.bam reference.fa 40 Picard

➢ A SAMtools complementary package

➢ More format conversion than SAMtools

➢ Visualization of alignments not available

➢ SNP calling & short indel detection not available

http://picard.sourceforge.net/

41 GATK

➢ It's a cross-platform application programming interface (API), written in Java, specifically designed for working with gargantuan (up to hundreds of terabytes) next-generation sequencing (NGS) datasets.

➢ It's a set of tools built upon that API for performing certain processing and analysis tasks on NGS data.

http://www.broadinstitute.org/gsa/wiki/index.php/Home_Page 42 GATK: realignment / recalibration

● Realignment : in particular areas, GATK split reads to perform a better alignment

● Recalibration : Recalculate the bases quality after the alignment.

● High variability between SNP calling tools counts :

Δ = 777% Δ = 714% Δ = 42% Δ = 45%

80000

70000

60000

50000 Mpileup Mpileup -B 40000 To be revealed Mpileup -E 30000 GATK Popoolation2 20000

10000

0 raw data realigned data recalibrated data Realigned & recalibrated data

43 GATK: realignment / recalibration

● The GATK realignment INDEL zone

44 GATK: realignment / recalibration

● The GATK realignment

45 GATK: realignment / recalibration

● The GATK realignment

Raw BAM 101M 2 mismatchs

46 GATK: realignment / recalibration

● The GATK realignment

47 GATK: realignment / recalibration

● The GATK realignment

BAM realigned 98M25D3M 1 mismatch

48 GATK: realignment / recalibration

● The GATK recalibration

« The per-base quality scores, which convey the probability that the called base in the read is the true sequenced base, are quite inaccurate and co-vary with features like sequencing technology, machine cycle and sequence context. »

DePristo et al. (2011) – Ewing and Green (1998) – Li et al. (2004 ; 2009)

Mean BQ = 32,8 - Median = 36,7

49 GATK: realignment / recalibration

● The GATK recalibration

Raw Data (Mean BQ = 32,8 - Median = 36,7) Recalibrated Data (Mean BQ = 28,8 - Median = 28,7)

● Reduces Mean Base Quality

● Lower variability 50 realignment / recalibration impact

● More homogeneous SNP counts

● Higher impact of recalibration on SNP counts

Δ = 777% Δ = 714% Δ = 42% Δ = 45%

80000

70000

60000

50000 Mpileup Mpileup -B 40000 Mpileup -E GATK 30000 Popoolation2

20000

10000

0 raw data realigned data recalibrated data Realigned & recalibrated data

51 Visualizing the alignment IGV

➢ IGV : Integrative Genomics Viewer

➢ Website : http://www.broadinstitute.org/igv

52 Visualizing the alignment IGV

➢ High-performance visualization tool

➢ Interactive exploration of large, integrated datasets

➢ Supports a wide variety of data types

➢ Documentations

➢ Developed at the of MIT and Harvard

53 Visualizing the alignment IGV

54 Visualizing the alignment IGV - Loading the reference

55 Visualizing the alignment IGV - Loading the reference

56 Visualizing the alignment IGV - Loading the bam file

57 Visualizing the alignment IGV - Loading the bam file

58 Visualizing the alignment IGV - Zoom

59 Visualizing the alignment IGV - Zoom

60 Visualizing the alignment IGV - Loading a gff file

61 Visualizing the alignment IGV - Loading a gff file

62 Visualizing the alignment IGV - Coverage

63 Variant calling

➢ Two new file formats : ✔ pileup ✔ vcf/bcf

➢ Variant calling with samtools

➢ IGV visualisation

64 The pileup format

Chr. - Coord. - Base('*' for indel) - Number of reads covering the site - Read bases* - Base qualities

Read bases : ➢ '.' and ',' : match to the reference base on the forward/reverse strand ➢ 'ACTGN' and 'actgn' : for a mismatch on the forward/reverse strand ➢ '^' and '$' : start/end of a read segment ➢ '+[0-9]+[ACGTNacgtn]+' and '-[0-9]+[ACGTNacgtn]+' : insertion/deletion

http://samtools.sourceforge.net/pileup.shtml 65 The vcf/bcf format

http://www.broadinstitute.org/ - vcf description 66 The vcf/bcf format

Header section : Define INFO and FORMAT fields

http://www.broadinstitute.org/ - vcf description 67 The vcf/bcf format

CHROM and POS : gives the contig on which the variant occurs. For indels this is actually the base preceding the event, due to the representation of indels in a VCF.

http://www.broadinstitute.org/ - vcf description 68 The vcf/bcf format

ID : The dbSNP rs identifier of the SNP, based on the contig : position of the call and whether a record exists at this site in dbSNP. (ex : rs3828047)

http://www.broadinstitute.org/ - vcf description 69 The vcf/bcf format

REF and ALT : The reference base and alternative base that vary in the samples, or in the population in general. Note that REF and ALT are always given on the forward strand.

http://www.broadinstitute.org/ - vcf description 70 The vcf/bcf format

QUAL : The Phred scaled probability of Probability that REF/ALT polymorphism exists at this site given sequencing data.

http://www.broadinstitute.org/ - vcf description 71 The vcf/bcf format

FILTER : PASS = ok, '.' = no filtering, anything else (LowQual, ...) = fail

http://www.broadinstitute.org/ - vcf description 72 The vcf/bcf format

DP : Real depth (all reads, even filtered reads)

http://www.broadinstitute.org/ - vcf description 73 The vcf/bcf format

GT : Genotype (0/1 = heterozygous ; 1/1 = homozygous) PL : Genotype probabilities

http://www.broadinstitute.org/ - vcf description 74 Variant Calling with SAMtools

➢ Get the raw variants :

✗ Output format : pileup

samtools mpileup -f ref.fa aln.bam > aln.pileup

samtools view -u aln.bam X | samtools mpileup -f ref.fa - > aln-X.pileup

✗ Output format : vcf/bcf

samtools mpileup -uf ref.fa aln1.bam aln2.bam | bcftools view -bvcg - > var.raw.bcf

75 Variant Calling with SAMtools

76 Filter - vcfutils.pl

➢ Filter the raw variant calls :

bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf

77 What's next

● Variation comparison – Between samples – With dbSNP

● Functional annotation – variant_effect_predictor.pl – Other tools

78 dbSNP

http://www.ncbi.nlm.nih.gov/projects/SNP/ 79 DbSNP : variation genotype

80 Conclusion

● Be aware of the quality of your data.

● Do not trust manufacturers quality values

● Use multiple data sources for variant calling if possible

81