Methods for De-Novo Genome Assembly
Total Page:16
File Type:pdf, Size:1020Kb
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 28 June 2020 doi:10.20944/preprints202006.0324.v1 Methods for De-novo Genome Assembly Arash Bayat∗1,3, Hasindu Gamaarachchi1, Nandan P Deshpande2, Marc R Wilkins2, and Sri Parameswaran1 1School of Computer Science and Engineering, UNSW, Australia 2Systems Biology Initiative, School of Biotechnology and Biomolecular Sciences, UNSW, Australia 3Health and Biosecurity, CSIRO, Australia June 25, 2020 Abstract Despite advances in algorithms and computational platforms, de-novo genome assembly remains a chal- lenging process. Due to the constant innovation in sequencing technologies (Sanger, SOLiD, Illumina, 454 , PacBio and Oxford Nanopore), genome assembly has evolved to respond to the changes in input data type. This paper includes a broad and comparative review of the most recent short-read, long-read and hybrid assembly techniques. In this review, we provide (1) an algorithmic description of the important processes in the workflow that introduces fundamental concepts and improvements; (2) a review of existing software that explains possible options for genome assembly; and (3) a comparison of the accuracy and the performance of existing methods executed on the same computer using the same processing capabilities and using the same set of real and synthetic datasets. Such evaluation allows a fair and precise comparison of accuracy in all aspects. As a result, this paper identifies both the strengths and weaknesses of each method. This com- parative review is unique in providing a detailed comparison of a broad spectrum of cutting-edge algorithms and methods. Availability: https://arashbayat.github.io/asm ∗To whom correspondence should be addressed. Email: [email protected] 1 © 2020 by the author(s). Distributed under a Creative Commons CC BY license. Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 28 June 2020 doi:10.20944/preprints202006.0324.v1 1 Introduction is usually obtained by applying approximations. This leads to a less accurate assembled genome, which might, DNA is a giant molecular strand that is a chain of four for example, consider only exact-match overlaps in- small molecules. Sequencing is the process of read- stead of all overlaps between reads [43]. Scalability ing DNA molecular chain into strings of A, C, Tand G is another expectation from an assembly pipeline [24]. where each letter (called a base) represents one of the Genome assembly requires a massive amount of pro- small molecules. Due to the available technology to- cessing, and a pipeline should be able to utilize all avail- day, the DNA strand is broken into small pieces and able hardware resources such as processors and mem- each fragment is then sequenced separately. However, ory with the highest efficiency. The program should the order of DNA fragments (reads) cannot be pre- also be able to deal with limitations. For instance, us- served. Therefore, a process called genome assembly is ing a large amount of memory is common in de-novo used. This is the process by which the entire genome assembly programs. A scalable program should be able of an individual or species is obtained. The word as- to deal with the limited amount of available memory, sembly indicates that the entire genome cannot be ob- with minimal impact on execution time [28]. tained at once; rather, it must be assembled from small In addition to the above expectations, the choice of DNA fragments. The process of de-novo assembly has an assembly pipeline should be based on the following to do with assembling a species' genome for the first 'data-specific’ parameters. time. The resulting genome can then be used to facili- tate genome assembly of other individuals of the same • The types of data that are available for assembly: species, in a process called the reference-guided assem- It is important to compare short-read and long- bly. (To understand this, one might think of solving read assembly approaches to each other and to a jigsaw puzzle by using its cover photo for reference.) hybrid assembly techniques. In contrast, de-novo assembly refers to assembling the genome without having a draft genome; this is a com- { Ultra-short read sequencing such as early plex and difficult problem to solve. SOLiD [38] sequencing is suitable for de- Based on the purpose of the assembly, there are sev- Bruijn graph assembly but not an Overlap- eral expectations about the accuracy of the assembled Layout-Consensus (OLC ) assembly approach. genome and the time it takes to be assembled. Fur- { Next-Generation Sequencing (NGS) [31] ma- thermore, the type and volume of sequenced data play chines such as Illumina [4] produce reads up important roles in the assembly process. It is critical to to hundreds of bases (short-reads) with high choose a pipeline that best fits the available data and accuracy. The error rate is less than 2%; fulfils expectations. Only a fair comparison of assem- most errors are of the substitution type and bly pipelines can be informative as it would truly reveal are less frequently short Indels. the strengths and weaknesses of various methods. { paired-end [23] and mate-pair [50] reads are The assembled genome is expected to be accurate; sequenced from two ends of a long DNA various metrics can be used to measure the accuracy of fragment where the distance between them an assembled genome [16]. The most important met- is approximately known. Such information rics include genome coverage and contiguity as well can be used to improve the quality of the as base calling error rate. It is not possible to find a assembled genome. pipeline that maximizes the accuracy of all aspects [9]. The pipeline should be chosen based on the application { PacBio [14] and Oxford Nanopore [20] are for which the genome is assembled. Single-Molecule Real-Time (SMRT ) technolo- If the assembled genome is going to be used as a gies that can sequence reads up to many reference-genome in a reference-guided assembly, high thousands of bases long. However, they suf- genome coverage is prioritized. If part of the genome fer from a high base calling error rate (up to is missed in the reference-genome, all subsequent as- 30%) and can include long Indels. sembled genomes will lack the information on that re- • Sequencing remains an expensive process: It is gion. However, if a slight base calling error exists in the important to understand how different software reference-genome, a variant-calling process shows the programs respond to low and high read coverage same variation in all individuals; this variation pro- depth when assembling a genome. vides evidence of the error in the reference-genome. This can be fixed by updating the reference-genome { If only short-read or long-read are provided, for further analysis. On the other hand, if the genome the coverage depth should be considered in is assembled for functional genomic studies, high base configuring the assembly pipeline. For high calling accuracy in gene areas is critical since an in- coverage data, more filtering can be applied correct base calling could change the predicted protein to collect high-confidence data for more ac- structure. High genome coverage and contiguity in all curate assembly. On the other hand, when regions may not be a benefit in this scenario. the coverage is low, fewer heuristics should There are expectations from the assembly pipeline be applied to utilise all the information in as well. The program should trade-off between speed the data. For example, one should consider and accuracy. Despite algorithmic enhancements, speed a more exhaustive search to find all overlaps 2 Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 28 June 2020 doi:10.20944/preprints202006.0324.v1 between reads, not just those that are easy 2 MATERIALS AND METHODS to capture. { For hybrid assembly where short and long 2.1 Algorithms and Methods reads are used together in an assembly, cov- The input to the assembly process is a set of reads that erage depth has a stronger effect on the choice are passed through all data preprocessing steps. There of assembly pipeline. For example, if high is a graph data structure in the heart of each assembly coverage long-reads are provided, it is possi- pipeline called the assembly graph. The goal of the as- ble to do a long-read-only assembly [24] and sembly graph is to link small DNA fragments to one an- use short-read for genome polishing [51]. For other to build the genome. The output of the assembly lower long-read coverage, long-read should graph is a set of sequences called contigs which are usu- be corrected with short-read [17] prior to ally large pieces of the genome without any information the long-read assembly. If long-read cover- about their order in the genome. In the best scenario, age is not sufficient for assembly, it is pos- there would be one contig per chromosome covering the sible to use long-read to improve contiguity entire chromosome. However, this is an idealistic goal of a short-read assembly [2]. for current assembly pipelines, especially when dealing { For all assemblies, it should be noted that with large genomes. Finally, the post-processing steps sequencing coverage depth is not fixed across would improve the assembly by ordering these contigs, the genome. Variation in coverage results in filling the gaps and fixing errors. Scaffolding is the some regions having low (or no) read cover- process in which the order of the contig in the genome age that leads to a discontiguity in the re- and the distance between them is predicted. Gap fill- sulting assembly. In this case, one should ing is the process of identifying the sequence between consider a pipeline that includes a scaffold- the scaffolds.