Systematic Evaluation of Spliced Alignment Programs for RNA-Seq Data
Total Page:16
File Type:pdf, Size:1020Kb
ANALYSIS OPEN Systematic evaluation of spliced alignment programs for RNA-seq data Pär G Engström1,13, Tamara Steijger1, Botond Sipos1, Gregory R Grant2,3, André Kahles4,5, The RGASP Consortium6, Gunnar Rätsch4,5, Nick Goldman1, Tim J Hubbard7, Jennifer Harrow7, Roderic Guigó8,9 & Paul Bertone1,10–12 High-throughput RNA sequencing is an increasingly accessible examines the density of independent reads at those loci. Many method for studying gene structure and activity on a genome- algorithms also consider base-call quality scores and use sophis- wide scale. A critical step in RNA-seq data analysis is the ticated indexing schemes to decrease runtime. alignment of partial transcript reads to a reference genome Here we assess the performance of 26 RNA-seq alignment sequence. To assess the performance of current mapping protocols on real and simulated human and mouse transcrip- software, we invited developers of RNA-seq aligners to process tomes. We adopted a competitive evaluation model applied four large human and mouse RNA-seq data sets. In total, in other areas of bioinformatics11–14. Developers were invited to we compared 26 mapping protocols based on 11 programs run their software and submit results for evaluation as part of and pipelines and found major performance differences the RNA-seq Genome Annotation Assessment Project (RGASP). between methods on numerous benchmarks, including Programs included six spliced aligners GSNAP7, MapSplice4, alignment yield, basewise accuracy, mismatch and gap PALMapper8, ReadsMap, STAR9 and TopHat5,6) and four placement, exon junction discovery and suitability of alignment pipelines (GEM3, PASS15, GSTRUCT and BAGET). alignments for transcript reconstruction. We observed GSTRUCT is based on GSNAP, whereas BAGET uses a contigu- concordant results on real and simulated RNA-seq data, ous DNA aligner to map reads to the genome as well as to exon confirming the relevance of the metrics employed. Future junction sequences derived from reference gene annotation. developments in RNA-seq alignment methods would benefit For comparison, the contiguous aligner SMALT was also tested. from improved placement of multimapped reads, balanced SMALT can map reads in a split manner, but it lacks several fea- utilization of existing gene annotation and a reduced false tures of dedicated spliced aligners, such as precise determination discovery rate for splice junctions. of exon-intron boundaries. We demonstrate that choice of align- © 2013 Nature America, Inc. All rights reserved. America, Inc. © 2013 Nature ment software is critical for accurate interpretation of RNA-seq Programs for aligning transcript reads to a reference genome data, and we identify aspects of the spliced-alignment problem address the challenging task of placing spliced reads across introns in need of further attention. npg and correctly determining exon-intron boundaries. The advent of RNA-seq prompted the development of a new generation of RESULTS spliced-alignment software, with several advances over earlier Alignment protocols were evaluated on Illumina 76-nucleotide (nt) programs such as the BLAST-like alignment tool (BLAT)1,2. The paired-end RNA-seq data from the human leukemia cell line K562 tools GEM3, GSTRUCT, MapSplice4 and TopHat5,6 implement (1.3 × 109 reads), mouse brain (1.1 × 108 reads) and two simulated a two-step approach in which initial read alignments are ana- human transcriptomes (8.0 × 107 reads each; Supplementary lyzed to discover exon junctions; these junctions are then used Table 1). Nine development teams contributed alignments for to guide final alignment. Several programs can also use existing evaluation. We additionally included two versions of the widely gene annotation to inform spliced-read placement5–9. Most RNA- used RNA-seq aligner TopHat5,6. Most development teams pro- seq aligners can further increase accuracy by prioritizing align- vided results from several alignment protocols, corresponding to ments in which read pairs map in a consistent fashion3,5–7,9,10. To different parameter choices and pipeline configurations (Fig. 1 place reads that match multiple genomic sequences, GSTRUCT and Supplementary Note). 1European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK. 2Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania, USA. 3Penn Center for Bioinformatics, University of Pennsylvania, Philadelphia, Pennsylvania, USA. 4Computational Biology Center, Sloan-Kettering Institute, New York, New York, USA. 5Friedrich Miescher Laboratory of the Max Planck Society, Tübingen, Germany. 6Full lists of members and affiliations appear at the end of the paper. 7Wellcome Trust Sanger Institute, Cambridge, UK. 8Centre for Genomic Regulation, Barcelona, Spain. 9Universitat Pompeu Fabra, Barcelona, Spain. 10Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany. 11Developmental Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany. 12Wellcome Trust–Medical Research Council Cambridge Stem Cell Institute, University of Cambridge, Cambridge, UK. 13Present address: Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Stockholm, Sweden. Correspondence should be addressed to P.B. ([email protected]). RECEIVED 31 MARCH; ACCEPTED 10 SEPTEMBER; PUBLISHED ONLINE 3 NOVEMBER 2013; DOI:10.1038/NMETH.2722 NATURE METHODS | VOL.10 NO.12 | DECEMBER 2013 | 1185 ANALYSIS Both uniquely mapped Both multimapped One unique and one multi One unique and one unmapped One multi and one unmapped K562 Mouse brain Simulation 1 Simulation 2 BAGET ann GEM ann GEM cons GEM cons ann GSNAP GSNAP ann GSTRUCT GSTRUCT ann MapSplice MapSplice ann PALMapper PALMapper ann PALMapper cons PALMapper cons ann PASS PASS cons ReadsMap SMALT STAR 1-pass STAR 1-pass ann STAR 2-pass STAR 2-pass ann TopHat1 TopHat1 ann TopHat2 TopHat2 ann 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Mapped fragments (%) Figure 1 | Alignment yield. Shown is the percentage of sequenced or simulated read pairs (fragments) mapped by each protocol. Protocols are grouped by the underlying alignment program (gray shading). Protocol names contain the suffix “ann” if annotation was used. The suffix “cons” distinguishes more conservative protocols from others based on the same aligner. The K562 data set comprises six samples, and the metrics presented here were averaged over them. Alignment yield aligners have options to increase mismatch tolerance beyond the There were major differences among protocols in the alignment settings used here, but this approach may negatively affect other yield (68.4–95.1% of K562 read pairs; mean = 91.5%, s.d. = 5.4), performance aspects. extent to which both reads from a pair were mapped, and fre- Polymorphisms and accumulated mutations distinguish the quency of ambiguous mappings (reads with several reported cancer cell line K562 from the human reference assembly, which alignments) (Fig. 1 and Supplementary Tables 2 and 3). These itself is a consensus based on several individuals16. Conversely, trends were similar across data sets (Fig. 1). The fraction of pairs mouse RNA samples were obtained from strain C57BL/6NJ, the with only one read aligned was typically highest for TopHat, genome of which is nearly identical to the mouse reference assem- ReadsMap and PASS, whereas PALMapper output exhibited more bly17. Accordingly, high-quality reads from mouse were mapped complex discrepancies within read pairs. GEM results consistently at a greater rate and with fewer mismatches than those from K562 included many ambiguous mappings (37% of sequenced reads (Supplementary Fig. 3). Even so, differences among aligners in © 2013 Nature America, Inc. All rights reserved. America, Inc. © 2013 Nature per data set on average). Mapping ambiguities were also common mismatch and truncation frequencies were consistent across data with PALMapper, although these were reduced with the more sets (Fig. 2 and Supplementary Fig. 4). Mapping properties are conservative protocols that involve stringent filtering of align- thus largely dependent on software algorithms even when the ments (Fig. 1 and Supplementary Fig. 1). To avoid introducing genome and transcriptome are virtually identical. npg bias at later evaluation stages due to differences in the number Consistent with real RNA-seq data, GSNAP, GSTRUCT, of alignments per read, we instructed developer teams to assign MapSplice and STAR outperformed other methods for base- a preferred (primary) alignment for each read mapped in their wise accuracy on simulated data (Supplementary Table 2). program output. The following results are based on these primary As expected, error rates were substantially lower for uniquely alignments unless otherwise noted. mapped reads than for primary alignments of multimapped reads (Supplementary Table 4). Notably, despite the many ambiguous Mismatches and basewise accuracy mappings reported by GEM and PALMapper, the primary align- Compared to the other aligners, GSNAP, GSTRUCT, MapSplice, ments were usually correct (Supplementary Table 4). PASS, SMALT and STAR reported more primary alignments Differences among methods were most apparent for spliced devoid of mismatches (Fig. 2a), partly because these methods reads (Supplementary Tables 5–7). On the first simulated data can truncate read ends and thus output an incomplete align- set, GSNAP, GSTRUCT, MapSplice and STAR mapped 96.3–98.4% ment when they are unable to map an entire sequence (Fig.