NGS Reads Aligning and SNP Calling

NGS Reads Aligning and SNP Calling

NGSNGS readsreads aligningaligning andand SNPSNP callingcalling Christophe Klopp - 2012 Genetic variation http://en.wikipedia.org/wiki/Genetic_variation Genetic variation, variations in alleles of genes, occurs both within and in populations. Genetic variation is important because it provides the “raw material” for natural selection. http://studentreader.com/genotypes-phenotypes/ 2 Types of variations ● SNP : Single nucleotide polymorphism ● CNV : copy number variation ● Chromosomal rearrangement ● Chromosomal duplication http://en.wikipedia.org/wiki/Copy-number_variation http://en.wikipedia.org/wiki/Human_genetic_variation 3 The variation transmission ● Mutation : In molecular biology and genetics, mutations are changes in a genomic sequence: the DNA sequence of a cell's genome or the DNA or RNA sequence of a virus (http://en.wikipedia.org/wiki/Mutation). ● Mutations are transmitted if they are not lethal. ● Mutations can impact the phenotype. 4 Genetic markers and genotyping ● A set of SNPs is selected along the genome. ● The phenotypes are collected for individuals. ● The SNPs are genotyped (measured) for the same individuals. ● This enables to find location having a link between the genotype and the phenotype : – Major genes – QTL (Quantitative Trait Loci) 5 The 1000 genomes project ● Joint project NCBI / EBI ● Common data formats : – fastq – SAM (Sequence Alignment/Map) 6 Sequence Alignment/Map (SAM) format SAM format ➢ Data sharing was a major issue with the 1000 genomes ➢ Capture all of the critical information about NGS data in a single indexed and compressed file ➢ Sharing : data across and tools ➢ Generic alignment format ➢ Supports short and long reads (454 – Solexa – Solid) ➢ Flexible in style, compact in size, efficient in random access Website : http://samtools.sourceforge.net Paper : Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943] 7 Sequence Alignment/Map (SAM) format SAM format 8 SAM format Header section ➢ Header lines start with @ followed by a two-letter TAG ➢ Header fields are TYPE:VALUE pairs 9 SAM format Alignment section ➢ 11 mandatory fields ➢ Variable number of optional fields ➢ Fields are tab delimited 10 SAM format Full example Header A l i g n e m e n t <QNAME> <FLAG> <RNAME> <POS> <MAPQ> <CIGAR> <MRNM> <MPOS> <ISIZE> <SEQ> <QUAL> [<TAG>:<VTYPE>:<VALUE> [...]] X? : Reserved for end users A : Printable character NM : Number of nuc. Difference i : Signed 32-bit integer MD : String for mismatching positions f : Single-precision float number RG : Read group Z : Printable string 11 [...] H : Hex string (high nybble first) SAM format Flag field http://picard.sourceforge.net/explain-flags.html 12 SAM format Extended CIGAR format Ref: GCATTCAGATGCAGTACGC Read: ccTCAG--GCATTAgtg POS CIGAR 5 2S4M2D6M3S 13 BAM format ➢ Binary representation of SAM ➢ Compressed by BGZF library ➢ Greatly reduces storage space requirements to about 27% of original SAM 14 SAMtools ➢ Library and software package ➢ Creating sorted and indexed BAM files from SAM files ➢ Removing PCR duplicates ➢ Merging alignments ➢ Visualization of alignments from BAM files ➢ SNP calling ➢ Short indel detection http://samtools.sourceforge.net/samtools.shtml 15 SAMtools Example usage 16 SAMtools Example usage ➢ Create BAM from SAM samtools view -bS aln.sam -o aln.bam ➢ Sort BAM file samtools sort example.bam sortedExample ➢ Merge sorted BAM files samtools merge sortedMerge.bam sorted1.bam sorted2.bam ➢ Index BAM file samtools index sortedExample.bam ➢ Visualize BAM file samtools tview sortedExample.bam reference.fa 17 Exercise 1 ● Downloading the sam file : – ● Visualizing the file content : find the number of exact matching reads ● Installing the samtools : – http://samtools.sourceforge.net/samtools.shtml ● Producing as sorted and indexed bam file from the same file 18 Visualizing the alignment IGV ➢ IGV : Integrative Genomics Viewer ➢ Website : http://www.broadinstitute.org/igv 19 Visualizing the alignment IGV ➢ High-performance visualization tool ➢ Interactive exploration of large, integrated datasets ➢ Supports a wide variety of data types ➢ Documentations ➢ Developed at the Broad Institute of MIT and Harvard 20 Visualizing the alignment IGV 21 Visualizing the alignment IGV - Loading the reference 22 Visualizing the alignment IGV - Loading the reference 23 Visualizing the alignment IGV - Loading the bam file 24 Visualizing the alignment IGV - Loading the bam file 25 Visualizing the alignment IGV - Zoom 26 Visualizing the alignment IGV - Zoom 27 Visualizing the alignment IGV - Loading a gff file 28 Visualizing the alignment IGV - Loading a gff file 29 Visualizing the alignment IGV - Coverage ➢ Generate the coverage information to be displayed in IGV. java -jar igvtools.jar count aln.bam aln.depth.tdf ref.genome ➢ Remark : ref.genome was generated when we imported the genome sequence ➢ This step is optional, but it is essential if you want to see the read depth information in large scale. 30 Visualizing the alignment IGV - Coverage 31 Exercise 2 ● Open IGV in java webstart : http://www.broadinstitute.org/igv ● Create the genome using the fasta file ● Load the sorted bam file 32 The pileup format Chr. - Coord. - Base('*' for indel) - Number of reads covering the site - Read bases* - Base qualities Read bases : ➢ '.' and ',' : match to the reference base on the forward/reverse strand ➢ 'ACTGN' and 'actgn' : for a mismatch on the forward/reverse strand ➢ '^' and '$' : start/end of a read segment ➢ '+[0-9]+[ACGTNacgtn]+' and '-[0-9]+[ACGTNacgtn]+' : insertion/deletion http://samtools.sourceforge.net/pileup.shtml 33 Using mpileup ➢ Get the raw pileup: samtools mpileup -f ref.fa aln.bam > raw.txt ref.fa Fasta formatted file of the reference genome aln.bam Sorted BAM formatted file, from the alignments raw.txt Output pileup formatted, with consensus calls -f Reference sequence, ref.fa (in FastA format) 34 Mpileup output 35 Variant Calling samtools mpileup -uf ref.fa aln1.bam aln2.bam | bcftools view -bvcg - > var.raw.bcf bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf Var.raw.bcf binary compressed variants Var.flt.cvf text filtered variants -b output BCF instead of VCF -v output potential variant sites only (force -c) -c SNP calling (force -e) -g call genotypes at variant sites (force -c) -D100 maximum read depth [10000000] 36 VCF format 37 Exercise 3 ● Visualize the bam and bai files in IGV ● Produce a tdf file for the coverage ● Find SNPs from the mpileup file ● Transform it into gff ● Load the gff in IGV 38.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    38 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us