SMALT - a New Mapper for DNA Sequencing Reads Protected

Posters.

F1000

Posters. protected. F1000

Copyright Posters. F1000 protected. F1000 Posters. SMALT - A New Mapper for DNA Sequencing Reads protected. Copyright Posters. F1000 protected. Hannes Ponstingl and Zemin Ning F1000 Posters. Sequencing InformaticsCopyright Division, The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK. protected. Copyright Posters. F1000 protected. Outline Posters. sequencing reads K-mer hash genomic reference F1000 SMALT is a pairwise sequence alignment program designed for the efficient and accurate mapping of DNA sequencing reads onto genomic Posters. reference sequences. Reads from a range of sequencing platforms, for Copyright example Illumina-Solexa, Roche-454F1000 or ABI-Sanger, can be processed including paired-end reads. protected. Copyright Posters. The software employs a perfect hash index of short words, less than 15 nucleotides long, sampled at equidistant steps along the genomic F1000 protected. reference sequences [1]. For each read, potentially matching segments in the reference are identified from seed matches in the index and Posters. Posters. subsequently aligned with the read using a banded Smith-Waterman F1000 algorithm. protected. Posters. The best gapped alignments of each read is reported including a score Copyright for the reliability of the best mapping. The user can adjust the trade-F1000 F1000 off between sensitivity and speed by tuning the length and spacing of protected. Posters. the hashed words. Copyright A mode for the detection of split (chimeric) reads is provided. Multi- F1000 threaded program execution is supported. protected. Copyright Posters. F1000 Posters. protected. Performance protected. Copyright alignment Paired-end 2 × 100 bp reads produced by Illumina sequencing plat- F1000 banded Smith-Waterman forms for the human genome can be mapped at a rate of 1.2 × 106 protected. Copyright Posters. pairs per hour (Table 2). Simulations (Table 1) suggest error rates of below 0.03% when 96.7% of the reads are mapped. Figure 1: The SMALT concept of short word hashing and dynamic programming. k-mer word seeds of the sequencing F1000 read (query) are looked up in a hash table constructed for the genomic sequences. Adjacent k-mer word hits are joined to protected. Higher variation rates, including base errors, can be tolerated. For potentially matching segments [1] which are then alignedPosters. using a banded Smith-Waterman algorithm. example, 95.8% of 100Copyright bp read pairs with 5% variation can be mapped Copyright at an error rate of 0.2%. F1000 This suggests the sofware will perform very well over a range of cur- protected. Posters. rent sequencing platfroms and for a large variety of mapping tasks Copyright including plant species. F1000 100 bp 150 bp protected. Copyright Posters. program measured entity Posters. 0.5% 1% 2% 5% 0.5% 1% 2% 5% F1000 Availability speed [106 pairs/h] 1.3 1.3 1.3 1.3 1.1 1.0 1.0 1.0 Posters. protected. Pre-compiled binaries are available via FTP from Copyright SMALT fraction mapped [%] 96.7 96.7 96.6 95.8 97.4 95.6 94.6 95.1 F1000 ftp://ftp.sanger.ac.uk/pub/hp3/smalt.tgz error rate [%] 0.01 0.03 0.06 0.22 0.004 0.0004 0.0008 0.006 Posters. F1000 The software will be made available shortly as open source at protected.speed [106 pairs/h] 2.8 1.9 1.2 – 2.0 1.2 0.6 – Copyright http://www.sanger.ac.uk/software BWA fraction mapped [%] 97.5 95.9 89.5F1000 – 97.5 94.9 84.5 – protected. Posters. error rate [%] 0.05 0.1 0.2 – 0.03 0.06 0.11 – Copyright Comparison to other software speed [106 pairs/h] 6.3 5.3 – – 4.8 4.1 – – F1000 BOWTIE fraction mapped [%] 80.0 67.6 – – 72.2 55.6 – – protected. The performance of SMALT was compared to BWA [2] and BOWTIE Posters. protected. [3], two of the currently fastest mappers that are widely used for the Copyright error rate [%] 2.17 2.67 – – 1.86 1.75 – – human genome.F1000 The simulations suggest SMALT is significantly more accurate than BWA and BOWTIE. With the exception of low variation Table 1: Performance assessment on simulated reads. The performance of SMALT was compared to two widely used protected. 6 Copyright Posters. rates of 0.5%, SMALT also maps a greater fraction of reads. mappers, BWA [2] and BOWTIE [3], which both employ a Burrows-Wheeler index of the genomic reference. A total of 4×10 read pairs, each mate 100 or 150 bp long, were generated from the human genome at uniformly distributedF1000 positions. Single-base SMALT is 30% slower than BWA and 5x slower than BOWTIE on variations, insertions and deletions (indels) were introduced at uniform rates of 0.5%, 1%, 2% and 5%. Every 5th variation was an protected. sequencing reads obtained with current sequencing platformsPosters. by Illu- indel with a length drawn from a geometric distribution (p=0.7). Execution times were measured on a single core of an Intel Xeon Copyright mina for the human genome. However, SMALT matches the speed of E5450 3.0 GHz processor. Error rates refer to the fraction of mapped reads assigned to a location more than a read length away Copyright BWA for high variation rates of 2%. from the correct location. SMALT and BWA were run with default settings, BOWTIE with the options ’-e 160 -X 800’. F1000 protected. Copyright Posters. F1000 Command Line Examples protected. Copyright Posters. Mapping with SMALT involves two steps: First, a hash index has to be 0.12 ● SMALT Posters. generated for the genomic reference sequences. Then the sequencing F1000 reads are mapped onto the reference using the index. program speed memory mapped BWA Posters.6 protected. Copyright All sequence input files have to be in FASTA or FASTQ format. SMALT 1.2 × 10 pairs/h 3.3 GB 92.7% 6

BWA 1.8 × 10 pairs/h 3.2 GB 86.9% 0.08 smalt index -k 13 -s 6 hs37k13s6 NCBI37.fasta F1000 6 Posters. protected. BOWTIE 6.7 × 10 pairs/h 3.0 GB 81.3% Copyright builds a hash table for the human genome. Two ﬁles hs37k13s6.smi and s37k13s6.sma are written to disk. -k 13 speciﬁes the length, F1000Table 2: Performance assessment on real data. Paired- -s 6 the spacing of the hashed words. This setting is suitable for end reads of 100 bp per mate and an insert size of 300 bp where error [%] protected. mapping reads of the Illumina-Solexa platform with read length > 70 produced by an Illumina-GA2 sequencer for a whole genome 0.04 Posters. ● Copyright nucleotides. shotgun sequencing run of a human individual. The reference F1000 ● genome was NCBI build 36. Execution times were measured ● ● smalt map -i 800 hs37k13s6 mate_1.fastq mate_2.fastq on a single core of an Intel Xeon E5450 3.0 GHz processor. ●● protected. Posters. ●● Copyright ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● loads the hash table created by the previous step into memory and 0.00 F1000 maps paired-end reads with an expected range of insertprotected. sizes of up to 800 bp. 80 Copyright85 90 95 100 References F1000 mapped fraction [%] protected. Figure 2: Mapping error of reported mappings as a func- [1] Z. Ning et al. (2001) Genome Res. 11, 1725–29. Posters. tion of the fraction of reads mapped. SMALT and BWA[2] are [2] H. Li & R. Durbin (2009) Bioinformatics 25, 1754-60. Copyright Acknowledgements compared for the set of simulated 2×100bp paired-end reads [3] B. Langmead et al. (2009) Genome Biol. 10, R25. of Table 1 with 1% variation rate. protected. Funding by the Wellcome TrustCopyright is gratefully acknowledged. Posters. F1000 protected. Copyright F1000 Copyright Posters. protected. Copyright Posters. F1000 protected. F1000 Copyright Posters. protected. Copyright F1000 protected.