<<

NGSNGS readsreads aligningaligning andand SNPSNP callingcalling

Christophe Klopp - 2012 Genetic variation

http://en.wikipedia.org/wiki/Genetic_variation Genetic variation, variations in alleles of genes, occurs both within and in populations. Genetic variation is important because it provides the “raw material” for natural selection.

http://studentreader.com/genotypes-phenotypes/ 2 Types of variations

● SNP : Single nucleotide polymorphism

● CNV : copy number variation

● Chromosomal rearrangement

● Chromosomal duplication

http://en.wikipedia.org/wiki/Copy-number_variation http://en.wikipedia.org/wiki/Human_genetic_variation 3 The variation transmission

● Mutation : In molecular biology and genetics, mutations are changes in a genomic sequence: the DNA sequence of a cell's genome or the DNA or RNA sequence of a virus (http://en.wikipedia.org/wiki/Mutation).

● Mutations are transmitted if they are not lethal.

● Mutations can impact the phenotype.

4 Genetic markers and genotyping

● A set of SNPs is selected along the genome.

● The phenotypes are collected for individuals.

● The SNPs are genotyped (measured) for the same individuals.

● This enables to find location having a link between the genotype and the phenotype : – Major genes – QTL (Quantitative Trait Loci)

5 The 1000 genomes project

● Joint project NCBI / EBI

● Common data formats : – fastq – SAM (/Map)

6 Sequence Alignment/Map (SAM) format SAM format

➢ Data sharing was a major issue with the 1000 genomes ➢ Capture all of the critical information about NGS data in a single indexed and compressed file ➢ Sharing : data across and tools ➢ Generic alignment format ➢ Supports short and long reads (454 – Solexa – Solid) ➢ Flexible in style, compact in size, efficient in random access

Website : http://samtools.sourceforge.net Paper : Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. , 25, 2078-9. [PMID: 19505943] 7 Sequence Alignment/Map (SAM) format SAM format

8 SAM format Header section

➢ Header lines start with @ followed by a two-letter TAG

➢ Header fields are TYPE:VALUE pairs

9 SAM format Alignment section

➢ 11 mandatory fields

➢ Variable number of optional fields

➢ Fields are tab delimited

10 SAM format Full example Header

A

l

i

g

n

e

m

e

n

t

[:: [...]]

X? : Reserved for end users A : Printable character NM : Number of nuc. Difference i : Signed 32­bit integer MD : String for mismatching positions f : Single­precision float number RG : Read group Z : Printable string 11 [...] H : Hex string (high nybble first) SAM format Flag field

http://picard.sourceforge.net/explain-flags.html

12 SAM format Extended CIGAR format

Ref: GCATTCAGATGCAGTACGC Read: ccTCAG­­GCATTAgtg POS CIGAR 5 2S4M2D6M3S

13 BAM format

➢ Binary representation of SAM

➢ Compressed by BGZF library

➢ Greatly reduces storage space requirements to about 27% of original SAM

14 SAMtools

➢ Library and software package

➢ Creating sorted and indexed BAM files from SAM files

➢ Removing PCR duplicates

➢ Merging alignments

➢ Visualization of alignments from BAM files

➢ SNP calling

➢ Short indel detection

http://samtools.sourceforge.net/samtools.shtml

15 SAMtools Example usage

16 SAMtools Example usage

➢ Create BAM from SAM samtools view ­bS aln. ­o aln.bam

➢ Sort BAM file samtools sort example.bam sortedExample

➢ Merge sorted BAM files samtools merge sortedMerge.bam sorted1.bam sorted2.bam

➢ Index BAM file samtools index sortedExample.bam

➢ Visualize BAM file samtools tview sortedExample.bam reference.fa 17 Exercise 1

● Downloading the sam file : –

● Visualizing the file content : find the number of exact matching reads

● Installing the samtools : – http://samtools.sourceforge.net/samtools.shtml ● Producing as sorted and indexed bam file from the same file

18 Visualizing the alignment IGV

➢ IGV : Integrative Genomics Viewer

➢ Website : http://www.broadinstitute.org/igv

19 Visualizing the alignment IGV

➢ High-performance visualization tool

➢ Interactive exploration of large, integrated datasets

➢ Supports a wide variety of data types

➢ Documentations

➢ Developed at the of MIT and Harvard

20 Visualizing the alignment IGV

21 Visualizing the alignment IGV - Loading the reference

22 Visualizing the alignment IGV - Loading the reference

23 Visualizing the alignment IGV - Loading the bam file

24 Visualizing the alignment IGV - Loading the bam file

25 Visualizing the alignment IGV - Zoom

26 Visualizing the alignment IGV - Zoom

27 Visualizing the alignment IGV - Loading a gff file

28 Visualizing the alignment IGV - Loading a gff file

29 Visualizing the alignment IGV - Coverage

➢ Generate the coverage information to be displayed in IGV. java ­jar igvtools.jar count aln.bam aln.depth.tdf ref.genome

➢ Remark : ref.genome was generated when we imported the genome sequence

➢ This step is optional, but it is essential if you want to see the read depth information in large scale.

30 Visualizing the alignment IGV - Coverage

31 Exercise 2

● Open IGV in java webstart : http://www.broadinstitute.org/igv

● Create the genome using the fasta file

● Load the sorted bam file

32 The pileup format

Chr. - Coord. - Base('*' for indel) - Number of reads covering the site - Read bases* - Base qualities

Read bases : ➢ '.' and ',' : match to the reference base on the forward/reverse strand ➢ 'ACTGN' and 'actgn' : for a mismatch on the forward/reverse strand ➢ '^' and '$' : start/end of a read segment ➢ '+[0-9]+[ACGTNacgtn]+' and '-[0-9]+[ACGTNacgtn]+' : insertion/deletion

http://samtools.sourceforge.net/pileup.shtml 33 Using mpileup

➢ Get the raw pileup:

samtools mpileup ­f ref.fa aln.bam > raw.txt

ref.fa Fasta formatted file of the reference genome aln.bam Sorted BAM formatted file, from the alignments raw.txt Output pileup formatted, with consensus calls -f Reference sequence, ref.fa (in FastA format)

34 Mpileup output

35 Variant Calling samtools mpileup ­uf ref.fa aln1.bam aln2.bam | bcftools view ­bvcg ­ > var.raw.bcf bcftools view var.raw.bcf | vcfutils.pl varFilter ­D100 > var.flt.vcf Var.raw.bcf binary compressed variants Var.flt.cvf text filtered variants -b output BCF instead of VCF -v output potential variant sites only (force -c) -c SNP calling (force -e) -g call genotypes at variant sites (force -c) -D100 maximum read depth [10000000]

36 VCF format

37 Exercise 3

● Visualize the bam and bai files in IGV

● Produce a tdf file for the coverage

● Find SNPs from the mpileup file

● Transform it into gff

● Load the gff in IGV

38