Pnas.Org/Lookup/Suppl/Doi:10
Total Page:16
File Type:pdf, Size:1020Kb
Assembler for de novo assembly of large genomes PNAS PLUS Te-Chin Chua,b, Chen-Hua Lua, Tsunglin Liuc, Greg C. Leeb, Wen-Hsiung Lid,e,1, and Arthur Chun-Chieh Shiha,f,1 aInstitute of Information Science, dBiodiversity Research Center, and fResearch Center for Information Technology Innovation, Academia Sinica, Taipei 115, Taiwan; bDepartment of Computer Science and Information Engineering, National Taiwan Normal University, Taipei 106, Taiwan; cInstitute of Bioinformatics and Biosignal Transduction, National Cheng Kung University, Tainan 701, Taiwan; and eDepartment of Ecology and Evolution, The University of Chicago, Chicago, IL 60637 Contributed by Wen-Hsiung Li, July 25, 2013 (sent for review March 15, 2013) Assembling a large genome using next generation sequencing Assembly By Short Sequences (ABySS) (6), ALLPATHS-LG reads requires large computer memory and a long execution time. (7), EULER-USR (8), SOAPdenovo (9, 10), and Velvet (11). To reduce these requirements, we propose an extension-based The de Bruijn graph approach requires more memory than assembler, called JR-Assembler, where J and R stand for “jumping” both the OLC and the extension approach because it needs to extension and read “remapping.” First, it uses the read count to save the entire graph in memory for assembly. To handle a large select good quality reads as seeds. Second, it extends each seed by genome such as a 3 Gb genome, ALLPATHS-LG and SOAP- a whole-read extension process, which expedites the extension denovo (7, 9, 10) have been carefully engineered, but most other process and can jump over short repeats. Third, it uses a dynamic de Bruijn graph assemblers cannot handle large genomes when back trimming process to avoid extension termination due to se- the memory is limited. Note that the memory requirement quencing errors. Fourth, it remaps reads to each assembled se- increases dramatically as the read length increases, which will quence, and if an assembly error occurs by the presence of a occur as the sequencing technology advances. Although cor- repeat, it breaks the contig at the repeat boundaries. Fifth, it ap- recting errors in sequencing reads can reduce the graph size and plies a less stringent extension criterion to connect low-coverage increase the assembly quality, the execution time still increases regions. Finally, it merges contigs by unused reads. An extensive with the total read count and the read length. Therefore, bal- comparison of JR-Assembler with current assemblers using data- ancing the assembly quality, memory requirement, and execution sets from small, medium, and large genomes shows that JR-Assem- time is a major challenge for de novo assembly. bler achieves a better or comparable overall assembly quality and In this paper, we propose a unique extension-based assembler requires lower memory use and less central processing unit time, for SRS data, called JR-Assembler (J stands for jumping ex- especially for large genomes. Finally, a simulation study shows that tension and R for read remapping). We revive the extension- JR-Assembler achieves a superior performance on memory use and based approach because it is straightforward and economical in central processing unit time than most current assemblers when the run time and memory use. A key concept in our approach is that read length is 150 bp or longer, indicating that the advantages of JR- instead of a base-by-base extension, JR-Assembler extends a Assembler over current assemblers will increase as the read length read by other whole reads—that is, it makes a jump. This ap- increases with advances in next generation sequencing technology. proach speeds up extension and can readily jump over small repeats. Second, it includes a dynamic back trimming process, hort read sequencing (SRS) platforms, such as HiSeq 2500, so that the extension can effectively avoid termination by se- Shave much higher throughputs and much lower costs than quencing errors. Finally, to avoid partial repeat resolutions, JR- traditional Sanger sequencing. However, genome assembly is Assembler remaps reads to each assembled sequence, and if an more difficult using SRS data than using Sanger data because the error is found due to the existence of repeats, it breaks the repeat amount of SRS data are huge [≥100 giga base pairs (Gb) per run] sequences at their boundaries and reconnects the broken contigs and SRS reads (100∼250 bp) are much shorter than Sanger reads at the stage of contig merging. We conducted performance ∼ ( 1,000 bp). GENETICS Most assemblers for Sanger data apply the overlap–layout– Significance consensus (OLC) approach (1). This approach is not suitable for SRS data because the run time quadratically increases with the Assembling a large genome faces three challenges: assembly number of reads. The long run time problem can be alleviated by quality, computer memory requirement, and execution time. the indexing technique, as in Extract DE Novo Assembler (Edena) Our developed assembler, JR-Assembler, uses (a) a strategy (2). However, Edena and string graph assembler (SGA) (3) are that selects good seeds for contig construction, (b) an exten- perhaps the only two OLC assemblers for SRS data. sion strategy that uses whole sequencing reads to increase the Currently, the two major types of SRS assemblers are exten- chance to jump over repeats and to expedite extension, and (c) sion-based and de Bruijn (or Eulerian) graph-based. Extension- detecting misassemblies by remapping reads to assembled based assemblers [e.g., Short Sequence Assembly by K-mer sequences. Compared with current assemblers, JR-Assembler search and 3′ read Extension (SSAKE)] (4, 5) start from a seed achieves a better overall assembly quality, requires less exe- read and extend it by other reads that overlap with the seed. cution time and requires, with one exception, less memory. The Without constructing a layout graph, this approach saves both advantages of JR-Assembler in memory usage and execution computational time and memory. However, it is sensitive to se- time will increase slowly as the read length increases. Thus, quencing errors and repeats and usually generates a fragmented contrary to the prevailing view, the extension approach seems assembly. The de Bruijn graph approach is fast at the step of superior to the de Bruijn graph approach. finding overlaps between reads. It decomposes a read into con- secutive k-mers, which are represented by nodes connected by Author contributions: T.-C.C., C.-H.L., W.-H.L., and A.C.-C.S. designed research; T.-C.C., edges in a graph. By scanning all reads once, it rapidly finds the C.-H.L., G.C.L., W.-H.L., and A.C.-C.S. performed research; T.-C.C., C.-H.L., and A.C.-C.S. contributed new reagents/analytic tools; T.-C.C., T.L., G.C.L., and A.C.-C.S. analyzed data; overlaps between all reads because overlapping reads have the and T.-C.C., T.L., W.-H.L., and A.C.-C.S. wrote the paper. same k-mer. The resulting graph contains all characteristics of The authors declare no conflict of interest. a genomic sequence. The constructed graph, however, has to be 1To whom correspondence may be addressed. E-mail: [email protected] or whli@ reduced using various algorithms and heuristic rules to eliminate uchicago.edu. sequencing errors and to resolve repeats. This approach is This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. adopted by several leading assemblers for SRS data, including 1073/pnas.1314090110/-/DCSupplemental. www.pnas.org/cgi/doi/10.1073/pnas.1314090110 PNAS | Published online August 21, 2013 | E3417–E3424 comparisons of many assemblers using many datasets from for a fair comparison (Table S1). We compared JR-Assembler genomes of various sizes and using simulation. For large ge- with ABySS (6), Edena (2), SOAPdenovo (9), Taipan (13), and nomes, JR-Assembler was more efficient in memory use and run Velvet (11). For each assembler, we aligned the contigs to the time than current SRS assemblers, while achieving at least reference genome (retrieved from GenBank with accession a comparable genome assembly quality. For small- and medium- no. NC_000913) using BLAST-like alignment tool (BLAT) size genomes, JR-Assembler performed better than most current (14) and checked misassemblies (see Table 1 for methods). For SRS assemblers. JR-Assembler is freely available at http://jr- all methods compared, the misassembled contigs were broken at the assembler.iis.sinica.edu.tw/. junctions of misassemblies. The resulting pieces were also included in the calculation of genome coverage. Results Table 1 shows the assembly statistics by the six assemblers. JR- Overview of JR-Assembler. JR-Assembler runs in five steps: raw Assembler was the best in most assembly metrics, including the read processing, seed selection, seed extension, repeat detection, number of contigs, the maximal contig length, the N50 length, and contig merging (Fig. 1). First, all reads containing any base the number of misassemblies, and the memory use. JR-Assem- “N” or any low-complexity region are filtered out. Second, it bler used only ∼5.1 gigabytes (GB) of memory, whereas the selects “good” reads as seeds using the read count—that is, the others required 11–22 GB. The run times of JR-Assembler, number of identical reads in the data. Third, JR-Assembler uses Taipan, and Velvet were all under 30 min, whereas SOAPdenovo a jumping extension, including many whole reads at a time (Fig. only required 7 min because it ran in multiple threads; the other 2). Moreover, to deal with sequencing errors at read tails, JR- assemblers all ran in a single thread (Velvet had not then pro- Assembler uses back trimming to remove low-quality nucleotides vided the multithread function yet when the experiment was at the 3′-end of a read to facilitate extension (Fig.