Efficient and Accurate Error Correction for Illumina RNA-Seq Reads Li Song1 and Liliana Florea1,2*

Song and Florea GigaScience (2015) 4:48 DOI 10.1186/s13742-015-0089-y TECHNICAL NOTE Open Access Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads Li Song1 and Liliana Florea1,2* Abstract Background: Next-generation sequencing of cellular RNA (RNA-seq) is rapidly becoming the cornerstone of transcriptomic analysis. However, sequencing errors in the already short RNA-seq reads complicate bioinformatics analyses, in particular alignment and assembly. Error correction methods have been highly effective for whole-genome sequencing (WGS) reads, but are unsuitable for RNA-seq reads, owing to the variation in gene expression levels and alternative splicing. Findings: We developed a k-mer based method, Rcorrector, to correct random sequencing errors in Illumina RNA-seq reads. Rcorrector uses a De Bruijn graph to compactly represent all trusted k-mers in the input reads. Unlike WGS read correctors, which use a global threshold to determine trusted k-mers, Rcorrector computes a local threshold at every position in a read. Conclusions: Rcorrector has an accuracy higher than or comparable to existing methods, including the only other method (SEECER) designed for RNA-seq reads, and is more time and memory efficient. With a 5 GB memory footprint for 100 million reads, it can be run on virtually any desktop or server. The software is available free of charge under the GNU General Public License from https://github.com/mourisl/Rcorrector/. Keywords: Next-generation sequencing, RNA-seq, Error correction, k-mers Introduction based methods, which are the most popular of the Next-generation sequencing of cellular RNA (RNA-seq) three, classify a k-mer as trusted or untrusted depending has become the foundation of virtually every transcrip- on whether the number of occurrences in the input tomic analysis. The large number of reads generated from reads exceeds a given threshold. Then, for each read, a single sample allow researchers to study the genes being low-frequency (untrusted) k-mers are converted into expressed and estimate their expression levels, and to dis- high-frequency (trusted) ones. Candidate k-mers are cover alternative splicing and other sequence variations. stored in a data structure such as a Hamming graph, However, biases and errors introduced at various stages which connects k-mers within a fixed distance, or a during the experiment, in particular sequencing errors, Bloom filter. Methods in this category include Quake [5], can have a significant impact on bioinformatics analyses. Hammer [6], Musket [7], Bless [1], BFC [2], and Lighter Systematic error correction of whole-genome sequenc- [3]. Suffix tree and suffix array based methods build a data ing (WGS) reads was proven to increase the quality structure from the input reads, and replace a substring of alignment and assembly [1–3], two critical steps in in a read if its number of occurrences falls below that analyzing next-generation sequencing data. There are expected given a probabilistic model. These methods, currently several error correction methods for WGS which include Shrec [8], Hybrid-Shrec [9] and HiTEC reads, classified into three categories [4]. K-spectrum [10], can handle multiple k-mer sizes. Lastly, multiple sequence alignment (MSA) based methods such as Coral *Correspondence: [email protected] [11] and SEECER [12] cluster reads that share k-mers to 1Department of Computer Science, Johns Hopkins University, 21218 create a local vicinity and a multiple alignment, and use Baltimore, USA the consensus sequence as a guide to correct the reads. 2McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, 21205 Baltimore, USA © 2015 Song and Florea. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Song and Florea GigaScience (2015) 4:48 Page 2 of 8 RNA-seq sequence data differ from WGS data in sev- Read error correction: the path search algorithm eral critical ways. First, while read coverage in WGS data is As with any k-spectrum method, Rcorrector distinguishes largely uniform across the genome, genes and transcripts among solid and non-solid k-mersasthebasisforits in an RNA-seq experiment have different expression lev- correction algorithm. A solid k-mer is one that passes a els. Consequently, even low-frequency k-mers may be given count threshold and therefore can be trusted to be correct, belonging to a homolog or a splice isoform. correct. Rcorrector uses a flexible threshold for solid k- Second, alternative splicing events can create multiple mers, which is calculated for each k-mer within each read correct k-mers at the event boundaries, a phenomenon sequence. At run time, Rcorrector scans the read sequence that occurs only at repeat regions for WGS reads. In both and, at each position, decides whether the next k-mer and of these cases, the reads would be erroneously converted each of its alternatives are solid and therefore represent by a WGS correction method. Hence, error correctors valid continuations of the path. The path with the smallest for WGS reads are generally not well suited for RNA-seq number of differences from the read sequence, represent- sequences [13]. ing the likely transcript of origin, is then used to correct There is so far only one other tool designed specifically k-mers in the original read. for RNA-seq error correction, called SEECER [12], based More formally, let u be a k-mer in read r and S(u, c) on the MSA approach. Given a read, SEECER attempts to denote the successor k-mer for u when appending determine its context (overlapping reads from the same nucleotide c,withc ∈{A, C, G, T}. For example, in Fig. 1, transcript), characterized by a hidden Markov model, and S(AAGT, C) = AGTC, k = 4. Let M(u) denote the multi- to use this to identify and correct errors. One signifi- plicity of k-mer u. To find a start node in the graph from cant drawback, however, is the large amount of memory which to search for a valid path, Rcorrector scans the read needed to index the reads. Herein we propose a novel to identify a stretch of two or more consecutive solid k- k-spectrumbasedmethod,Rcorrector(RNA-seqerror mers, and marks these bases as solid. Starting from the CORRECTOR), for RNA-seq data. Rcorrector uses a flex- longest stretch of solid bases, it proceeds in both direc- ible k-mer count threshold, computing a different thresh- tions, one base at a time as described below. By symmetry, old for a k-mer within each read, to account for different we only illustrate the search in the 5 → 3 direction. transcript and gene expression levels. It also allows for Suppose u = riri+1 ...ri+k−1 is the k-mer starting multiple k-mer choices at any position in the read. Rcor- at position i in read r. Rcorrector considers all possible rector only stores k-mers that appear more than once in successors S(u, c), c ∈{A,C,G,T}, and their multiplicities the read set, which makes it scalable with large datasets. M(S(u, c)) and determines which ones are solid based on Accurate and efficient, Rcorrector is uniquely suited to a locally defined threshold (see below). Rcorrector tests datasets from species with large and complex genomes all the possible nucleotides for position i + k and retains and transcriptomes, such as human, without requiring those that lead to solid k-mers, and then follows the paths significant hardware resources. Rcorrector can also be intheDeBruijngraphfromthesek-mers. Multiple k-mer applied to other types of data with non-uniform cover- choices are considered in order to allow for splice variants. age such as single-cell sequencing, as we will show later. If the nucleotide in the current path is different from ri+k, In the following sections we present the algorithm, first, then it is marked as a correction. When the number of followed by an evaluation of this and other methods on corrections in the path exceeds an a priori defined thresh- both simulated and real data. In particular, we illustrate old, Rcorrector terminates the current search path and and compare the impact of several error correctors for two starts a new one. In the end, Rcorrector selects the path popular bioinformatics applications, namely, alignment with the minimum number of changes and uses the path’s and assembly of reads. sequencetocorrecttheread.Toimprovespeed,Rcor- rector does not attempt to correct solid positions, and Algorithm gradually decreases the allowable number of corrections if De Bruijn graph thenumberofsearchedpathsbecomeslarge. In a first preprocessing stage, Rcorrector builds a De Bruijn graph of all k-mers that appear more than once A flexible local threshold for solid k-mers in the input reads, together with their counts. To do so, Let u be the k-mer starting at position i in the read, as Rcorrector uses Jellyfish2 [14] to build a Bloom counter before. Unlike with WGS reads, even if the multiplicity that detects k-mers occurring multiple times, and then M(S(u, ri+k)) of its successor k-mer is very low, the base stores these in a hash table. Intuitively, the graph encodes ri+k may still be correct, for instance sampled from a low- all transcripts (full or partial) that can be assembled from expression transcript.

Load more