Next-‐Generafon Sequencing

Next-generaon sequencing Lecture 8 idenfy sequence variaons ChIP-seq Applicaons DNA-seq Idenfy Pathogens RNA-seq Kahvejian et al, 2008 Applicaons of RNA-seq Study of mRNA gene expression profile and discovery of differen*ally expressed genes. Study of microRNA profiling. Discovery of new transcripts, such as long non- coding RNA and circular RNA. Short Read Alignment short reads have to be aligned (mapped) to a reference sequence. To genomes To transcriptomes Short Read Applicaons • To genomes (DNA-seq) • To transcriptomes (RNA-seq, ChIP-seq, Methyl-seq) Measure significant peaks Alignment • GCACTTCACAAATTAATGACCATGAGCTCGTTTTTGATAAACTCCAACTACATCGAGCCC • |||||||||||| |||||||||| • ACCATGAGCTCGATTTTGATAAA GOAL: to efficiently find the true locaon of each read from a poten*ally large quan*ty of reference data while dis*nguishing between technical sequencing errors and true gene*c variaon within the sample. 1. Efficient 2. True locaon 3. Dis*nguishing between technical sequencing errors and true gene*c variaon traditional methods would not work well • BLAST: a sequence similarity search algorithm (fast for searching a single sequence) – Not specifically designed for mapping millions of query sequences – Take very long time • Can take 2 days to map half million sequences to 70MB reference genome • Very memory expensive (indexing the whole genome) Performance Requirements • Aligning tens of millions of sequences requires: – Ultra fast search algorithms (100-1000x faster than BLAST) – Small memory usage with economic data structures and containers for programming • Alignment requirements – The requirements for short-read mapping applications are very different from traditional sequence database search approaches for ortholog identification. – Most short-read alignment algorithms will not work for longer sequences! However, most of them are more sensitive for short- reads than BLAST, because they lack its word size limitation. – Only best hits with almost perfect alignments are required. Lower scoring alternative hits (more mismatches) are less interesting. – Often only perfect matching required, but with the possibility to allow 1-2 mismatches and only sometimes very short gaps. Alignment • “Mapping the vast quan**es of short sequence fragments produced by next- generaon sequencing plaorms is a challenge. What programs are available and how do they work?” (Trapnell et al. Nature Biotechnology, 2009.) 9 9 Mapping Algorithms • Hash Table (Lookup table) – FAST, but requires perfect matches. [O(m, n + N)] • Array Scanning – Can handle mismatches, but not gaps. [O(m, N)] • Dynamic Programming (Smith-Waterman) – Indels – Mathemacally op*mal solu*on – Slow (most programs use Hash Mapping as a prefilter) [O(mnN)] • Merge Sor*ng - i.e. Slider [Malhis et al 2009] • Indexing method, i.e. Burrows-Wheeler Transform (BW Transform) – FAST. [O(m + N)] (without mismatch/gap) – Memory efficient. – But for gaps/mismatches, it lacks sensi*vity Dynamic programming • A dynamic programming can be used to find the local alignments between a text T and a paern P in O(|T|,|P|) *me, but need memory about O(|T|x|P|) • It is used for two sequence comparison developed by Smith and Waterman • The genome is too big for this approach Indexing • Lengths of genome sequences and numbers of reads are too large for direct approaches • Indexing is required Hash tables, such as Suffix array or Burrows- spaced seed wheeler Transform Short-Read Alignment Tools with indexing • Indexing Reads with Hash Tables – ZOOM: uses spaced seeds algorithm [Lin et al 2008] – RMAP: simpler spaced seeds algorithm [Smith et al 2008] – SHRiMP: employs a combination of spaced seeds and the Smith-Waterman – MAQ [Li et al 2008b] – Eland (commercial Solexa Pipeline) • Indexing Reference with Hash Tables – SOAPv1 [Li et al 2008] • Indexing Reference with Sux Array/Burrows-Wheeler – Bowtie [Langmead et al 2009] – BWA – SOAPv2 MAQ • Algorithm [Li et al 2008b] – Uses a hashing technique that guarantees to found alignments with up to two mismatches in the first 28 bp of the reads. – It indexes the read sequences in six hash tables and scans the reference genome sequence for seed hits that are subsequently extended and scored. – The commercial Eland alignment program uses a very similar approach. • Performance – Slower than Bowtie and SOAP. • Features – Versatile pipeline for SNP detection. – Can report all hits for queries with multiple ones. – Performs on single reads only ungapped alignments. Gaps only possible for paired end reads by applying Smith- Waterman algorithm on small candidate set. Bowe, Burrows-Wheeler Transformaon (BWT) • Store en*re reference genome. • Align tag base by base from the end. • When tag is traversed, all ac*ve locaons are reported. • If no match is found, then back up and try a subs*tu*on. Burrows-Wheeler Transform (BWT) § BWT is the sequence of the last characters of sorted list of rotaons of the string (no*ce that they are sorted in the same order as the prefixes). Burrows-Wheeler Transform (BWT) BWT $acaacg $acaacg g$acaac aacg$ac cg$acaa acaacg$ acaacg$ acg$aca acg$aca gc$aaac aacg$ac caacg$a caacg$a cg$acaa acaacg$ g$acaac rotaon sorng Burrows-Wheeler Matrix $acaacg 3 aacg$ac 1 acaacg$ suffix array 4 acg$aca 2 caacg$a 5 cg$acaa 6 g$acaac Key observaon 1 1 a1c1a2a3c2g1$1 $acaacg 2 1 aacg$ac “last first (LF) mapping” 1 1 acaacg$ The i-th occurrence of character X in the 3 2 last column corresponds to acg$aca the same text character as the i-th 1 1 occurrence of X in the first column. caacg$a 2 3 cg$acaa 1g$acaac2 Recover text a1c1a2a3c2g1$1 1 1 $acaacg 4 2 1 aacg$ac 6 5 5 6 1 1 6 acaacg$ 3 2 acg$aca 3 3 3 1 1 1 caacg$a 4 4 4 2 2 2 3 5 5 5 cg$acaa 6 6 6 1g$acaac2 Main advantage of BWT against suffix array § BWT very compact: – Approximately 1/4 byte per base (2 bits) – As large as the original text, plus a few “extras” – Can fit onto a standard computer with 2GB of memory – BWT needs less memory than suffix array • For example, human genome, m = 3 x 109 : – Suffix array: mlog2(m) bits = 4m bytes = 12GB – BWT: m/4 bytes plus extras = 1 - 2 GB • Linear-*me search algorithm – propor*onal to length of query for exact matches Comparison Indexing Reads with Hash Burrows-Wheeler Tables • Requires <2Gb of • Requires ~50Gb of memory. memory. • Runs 30-fold faster. • Runs 30-fold slower. • Is much more • Is much simpler to program. complicated to program. MAQ Bow*e, BWA etc. Short-read mapping sooware Software Technique Developer License SOAP Hashing refs BGI Academic Maq Hashing reads Sanger (Li, Heng) GNUPL Bowtie BWT Salzberg/UMD GNUPL BWA BWT Sanger (Li, Heng) GNUPL SOAP2 BWT & hashing BGI Academic hdp://www.oxfordjournals.org/our_journals/bioinformacs/nextgeneraonsequencing.html DNA mappers: blue RNA mappers: red miRNA mappers: green Bisulphite mappers: purple Fonseca, Bioinforma%cs, 2012 3170 N.A.Fonseca et al. Table 1. List of mappers Citations Mapper Data Seq.Plat. Input Output Avail. Version Cit. Years Reference BFAST DNA I,So,4, Hel (C)FAST(A/Q) SAM TSV OS 0.7.0 94 37.11 Homer et al. (2009) Bismark Bisulphite I FASTA/Q SAM OS 0.7.3 7 6.21 Krueger and Andrews (2011) BLAT DNA N FASTA TSV BLAST OS 34 2844 275.67 Kent (2002) Bowtie DNA I,So,4,Sa,P (C)FAST(A/Q) SAM TSV OS 0.12.7 1168 363.42 Langmead et al. (2009) Bowtie2 DNA I,4,Ion FASTA/Q SAM TSV OS 2.0beta5 0.00 Langmead and Salzberg (2012) BS Seeker Bisulphite I FASTA/Q SAM OS 19 9.26 Chen et al. (2010) BSMAP Bisulphite I FASTA/Q SAM TSV OS 2.43 31 11.06 Xi and Li (2009) BWA DNA I,So,4,Sa,P FASTA/Q SAM OS 0.6.2 738 224.20 Li and Durbin (2009) BWA-SW DNA I,So,4,Sa,P FASTA/Q SAM OS 0.6.2 160 67.69 Li and Durbin (2010) BWT-SW DNA N FASTA TSV OS 20070916 45 10.42 Lam et al. (2008) CloudBurst DNA N FASTA TSV OS 1.1 146 46.97 Schatz (2009) DynMap DNA N FASTA TSV OS 0.0.20 0.00 Flouri et al. (2011) ELAND DNA I FASTA TSV Com 2 71.09Unpublisheda Exonerate DNA N FASTA TSV OS 2.2 255 34.69 Slater and Birney (2005) GEM DNA I, So FASTA/Q SAM, counts Bin 1.x 41.35Unpublishedb GenomeMapper DNA I FASTA/Q BED TSV OS 0.4.3 31 11.66 Schneeberger et al. (2009) GMAP DNA I,4,Sa,Hel,Ion,P FASTA/Q SAM, GFF OS 2012-04-27 217 29.52 Wu and Watanabe (2005) GNUMAP DNA I FASTA/Q Illumina SAM TSV OS 3.0.2 15 5.73 Clement et al. (2010) GSNAP DNA I,4,Sa,Hel,Ion,P FASTA/Q SAM OS 2012-04-27 72 31.61 Wu and Nacu (2010) MapReads DNA So FASTA/Q TSV OS 2.4.1 0.00 Unpublishedc MapSplice RNA I FASTA/Q SAM BED OS 1.15.2 50 28.17 Wang et al. (2010) MAQ DNA I,So (C)FAST(A/Q) TSV OS 0.7.1 957 251.66 Li et al. (2008a) MicroRazerS miRNA N FASTA SAM TSV OS 0.1 72.75Emdeet al. (2010) MOM DNA I,4 FASTA TSV Bin 0.6 18 5.55 Eaves and Gao (2009) MOSAIK DNA I,So,4,Sa,Hel,Ion,P (C)FAST(A/Q) BAM OS 2.1 41.18Unpublishedd mrFAST miRNA I FASTA/Q SAM OS 2.1.0.4 158 58.34 Alkan et al. (2009) mrsFAST miRNA I,So FASTA/Q SAM OS 2.3.0 32 18.03 Hach et al.

Next-‐Generafon Sequencing

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support