Sequence Assembly Demystified

REVIEWS COMPUTATIONAL TOOLS Sequence assembly demystified Niranjan Nagarajan1 and Mihai Pop2 Abstract | Advances in sequencing technologies and increased access to sequencing services have led to renewed interest in sequence and genome assembly. Concurrently, new applications for sequencing have emerged, including gene expression analysis, discovery of genomic variants and metagenomics, and each of these has different needs and challenges in terms of assembly. We survey the theoretical foundations that underlie modern assembly and highlight the options and practical trade-offs that need to be considered, focusing on how individual features address the needs of specific applications. We also review key software and the interplay between experimental design and efficacy of assembly. Paired-end data Previously the province of large genome centres, DNA between DNA sequences is then used to ‘stitch’ together Data from a pair of reads sequencing is now a key component of research carried the individual fragments into larger contiguous sequenced from ends of out in many laboratories. A number of new applica- sequences (contigs), thereby recovering the informa- the same DNA fragment. The tions of sequencing technologies have been spurred tion lost during the sequencing process. The assembly genomic distance between the reads is approximately by rapidly decreasing costs, including the study of process is complicated by the fact that, in many cases, known and is used to constrain microbial communities (metagenomics), the discovery this underlying assumption is incorrect. For example, assembly solutions. See also of structural variants in genomes and the analysis of genomic repeats — segments of DNA repeated in an ‘mate-pair read’. gene structure and expression. The length of the almost identical form throughout a genome — yield sequences generated by modern sequencing instru- fragments with highly similar sequences that originate Mate-pair data Data from a pair of reads ments is considerably shorter (hundreds to thousands of from different places in the genome. Similarly, in tran- sequenced from the same base pairs) than that of the genomes or genomic features scriptome or metagenomic samples, nearly identical circularized DNA fragment. being studied (which commonly span tens of thousands sequences may originate from different transcripts or The circularization step allows to billions of base pairs). Thus, many analyses start with genomes within the sample. How much an assembler for larger fragments sizes to be used. They provide the the computational process of sequence assembly that is confused by such artefacts primarily depends on the same information as paired- joins together the many sequence fragments generated length of the sequences that are read by the sequenc- end reads to the assembler. by the instrument. Biologists who need to assemble ing instrument, as repetitive regions shorter than a the sequencing data generated in their experiments are sequencing read can be automatically resolved14. The Contiguous sequence faced with the challenge of choosing, from a myriad of accuracy of the sequence data also has an important (Contig). A sequence reconstructed by assembling options, the assembly strategy and software best suited role: the more errors an assembler is willing to tolerate together multiple reads. for their experiment. This choice is made harder by within the data, the more similar distinct regions of a the rapid development of new assembly tools, which genome will appear to the assembler. Broadly speaking, is driven by advances in sequencing technologies and segments of the genome that diverge by less than the 1Computational and Systems Biology, Genome Institute the broader scope of applications for these technolo- error rate of the sequencing instrument cannot easily of Singapore, 138672 gies. Among recent innovations in sequence assem- be distinguished by an assembler. Singapore. bly are the use of memory-efficient data structures1,2, In this Review, we begin by providing an overview 2Department of Computer the development of new de novo assembly strategies of the principles underlying modern assembly tools as Science, University of 3–5 6,7 Maryland, College Park, for data derived from metagenomic , single-cell or well as the engineering trade-offs involved in designing 8,9 Maryland 20742, USA. transcriptome experiments and the effective use of them. We discuss the value of experimental design for Correspondence to M.P. complementary information derived from multiple successful assembly as well as the ongoing work in the e-mail: [email protected]. sequencing technologies10,11 and/or paired-end data or important field of assembly evaluation. We then discuss edu mate-pair data12,13. trade-offs in the context of the most common applica- doi:10.1038/nrg3367 Published online 29 January All assembly approaches rely on the simple assump- tions of sequence assembly. Our goal is to provide non- 2013; corrected online tion that highly similar DNA fragments originate from specialist readers with sufficient background to make 5 February 2013 the same position within a genome. The similarity informed choices in the use of assembly techniques. NATURE REVIEWS | GENETICS VOLUME 14 | MARCH 2013 | 157 © 2013 Macmillan Publishers Limited. All rights reserved REVIEWS Modern sequence assemblers assemblers have successfully been used with longer reads Mathematical analyses of sequence assembly, which by using a pre-processing stage to correct sequencing dates back to the pioneering work of Esko Ukkonen errors17,18, and efficient overlap-based assemblers for in the 1980s15,16, revealed the fundamental difficulty of short reads have also been developed19. The choice of reconstructing a genome from sequenced fragments. assembly paradigm on its own plays a marginal part in Depending on the relationship between the length of defining the performance and efficiency of an assembler. reads and the length of repeats in the DNA being assem- As sequencing technologies evolve, assembly tools bled (BOX 1), genome assembly can range from trivial are adapting to cope with the features and scale of the (when all repeats are shorter than the read length) to data (TABLE 1). Whereas early sequence assemblers had to computationally intractable (that is, finding the cor- make do with sparse coverage, modern assemblers have rect answer requires trying an exponential number of to deal with the problems of plenty. The success of modern arrangements of reads, a task that cannot be achieved sequence assemblers has thus primarily been determined even on the best supercomputers) to impossible14 (that by their ability to face the twin challenges of engineering is, the information contained within the reads is insuf- (that is, dealing with the scale of data) and analysis (that ficient to identify the correct sequence reconstruction is, adapting to and exploiting specific features of the data). from an exponential number of equally good alterna- tives). Given the characteristics of currently available Engineering challenges. Modern assemblers must be able sequencing technologies (BOX 1), few projects fall into to analyse large data sets efficiently, to handle sequencing the category of ‘trivial’ (primarily small viral genomes or errors and to capture the repeat structure of the genome transcriptomes). In most cases, full genome sequences correctly. Meeting these challenges has been key in defin- cannot be efficiently and reliably reconstructed from the ing assembly tools that have had an important impact data produced by the sequencing experiment; rather, on the field. For example, one of the first widely used assemblers produce a fragmented and often error-prone short-read assemblers, Velvet20, made a mark by show- picture of the sequence being assembled. ing that high-quality assemblies could be obtained from ultra-short reads (~30 bp) and high-coverage data sets. Assembly paradigms. Assemblers are based on one of This approach was extended to the assembly of large several different paradigms, such as greedy, overlap– genomes in the program ABySS21 and for the first de novo layout–consensus (OLC), de Bruijn graph and string assembly of a mammalian genome entirely using short graph (introduced in BOX 2). The choice of approach reads with the program SOAPdenovo22. SOAPdenovo is depends on the characteristics of the data being assem- a memory-efficient assembler that also includes robust bled. For example, de Bruijn-graph-based approaches error correction (to reduce sequencing errors) and scaf- have been successful in assembling highly accurate short folding modules (to leverage mate-pair data; see ‘Analysis reads (<~100 bp, such as those generated by the Illumina challenges’ below). Solexa technology; BOX 1), whereas overlap-based The importance of error correction for de Bruijn- approaches (such as OLC or string graph) are mostly graph-based assembly, in particular, has led to the used for longer, more inaccurate data (>200 bp, such as recent development of tools for this task18,23,24 that serve Roche 454 and Sanger sequencing data; BOX 1). However, as useful pre-processors in genome assembly applica- this is not an exclusive arrangement: de Bruijn graph tions. Whereas early short-read assemblers focused on Box 1 | Sequencing and mapping technologies A full survey of sequencing technologies is beyond the scope of this article. The table below compares the currently available technologies in terms of several characteristics of importance for genome assembly: read

Sequence Assembly Demystified

DNA Sequencing

Metagenomic Assembly Using Coverage and Evolutionary Distance (MACE)

De Novo Genome Assembly Versus Mapping to a Reference Genome

Sequencing and Comparative Analysis of <I>De Novo</I> Genome

Improving Metagenomic Assemblies Through Data Partitioning: a GC Content Approach

Extreme Scale De Novo Metagenome Assembly

DNA Sequence Assembly and Genetic Algorithms New Results and Puzzling Insights

Efficient and High Quality Hybrid De Novo Assembly of Human Genomes

Cancer Whole-Genome Sequencing: Present and Future

Basics on Molecular Biology Ŷ

Information Theory of DNA Shotgun Sequencing Abolfazl S

Assembly Algorithms for Next-Generation Sequence Data