Computational Methods for Analysis of Single Molecule Sequencing Data
by Ehsan Haghshenas
M.Sc., University of Western Ontario, 2014 B.Sc., Isfahan University of Technology, 2012
Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy
in the School of Computing Science Faculty of Applied Sciences
c Ehsan Haghshenas 2020 SIMON FRASER UNIVERSITY Spring 2020
Copyright in this work rests with the author. Please ensure that any reproduction or re-use is done in accordance with the relevant national copyright legislation. Approval
Name: Ehsan Haghshenas Degree: Doctor of Philosophy (Computing Science) Title: Computational Methods for Analysis of Single Molecule Sequencing Data Examining Committee: Chair: Diana Cukierman University Lecturer Binay Bhattacharya Senior Supervisor Professor S. Cenk Sahinalp Co-Supervisor Senior Investigator Center for Cancer Research National Cancer Institute Cedric Chauve Co-Supervisor Professor Faraz Hach Co-Supervisor Assistant Professor Department of Urologic Sciences The University of British Columbia Senior Research Scientist Vancouver Prostate Centre Martin Ester Internal Examiner Professor Mihai Pop External Examiner Professor Department of Computer Science University of Maryland, College Park
Date Defended: March 26, 2020
ii Abstract
Next-generation sequencing (NGS) technologies paved the way to a significant increase in the number of sequenced genomes, both prokaryotic and eukaryotic. This increase provided an opportunity for considerable advancement in genomics and precision medicine. Although NGS technologies have proven their power in many applications such as de novo genome assembly and variation discovery, computational analysis of the data they generate is still far from being perfect. The main limitation of NGS technologies is their short read length relative to the lengths of (common) genomic repeats. Today, newer sequencing technologies (known as single-molecule sequencing or SMS) such as Pacific Biosciences and Oxford Nanopore are producing significantly longer reads, making it theoretically possible to overcome the difficulties imposed by repeat regions. For instance, for the first time, a complete human chromosome was fully assembled using ultra-long reads generated by Oxford Nanopore. Unfortunately, long reads generated by SMS technologies are characterized by a high error rate, which prevents their direct utilization in many of the standard downstream analysis pipelines and poses new computational challenges. This motivates the development of new computational tools specifically designed for SMS long reads. In this thesis, we present three computational methods that are tailored for SMS long reads. First, we present lordFAST, a fast and sensitive tool for mapping noisy long reads to a reference genome. Mapping sequenced reads to their potential genomic origin is the first fundamental step for many computational biology tasks. As an example, in this thesis, we show the success of lordFAST to be employed in structural variation discovery. Next, we present the second tool, CoLoRMap, which tackles the high level of base-level errors in SMS long reads by providing a means to correct them using a complementary set of NGS short reads. This integrative use of SMS and NGS data is known as hybrid technique. Finally, we introduce HASLR, an ultra-fast hybrid assembler that uses reads generated by both technologies to efficiently generate accurate genome assemblies. We demonstrate that HASLR is not only the fastest assembler but also the one with the lowest number of misassemblies on all the samples compared to other tested assemblers. Furthermore, the generated assemblies in terms of contiguity and accuracy are on par with the other tools on most of the samples.
iii Keywords: Computational biology; Single-molecule sequencing; PacBio; Oxford Nanopore; Long read mapping; Hybrid error correction; Hybrid assembly
iv Dedication
To my family, with love
v Acknowledgements
First and foremost, I would like to express my sincerest gratitude to my supervisors, Dr. Cenk Sahinalp, Dr. Cedric Chauve, Dr. Faraz Hach, and Dr. Binay Bhattacharya, for their constant support, guidance, and patience throughout my PhD studies. I was honored to have the opportunity to work with such brilliant scholars, from whom I learned critical thinking and the proper way of doing research. In addition, I would like to give my regards and appreciation to Dr. Jens Stoye, my host and supervisor during my visit at Bielefeld University. This visit greatly influenced the direction of my work on hybrid assembly. I would also like to thank Dr. Mihai Pop and Dr. Martin Ester, my external and internal examiners, for their careful review of my thesis. I appreciate their invaluable discussions, comments, and suggestions, which helped me improve the thesis. I want to give special thanks to Dr. Diana Cukierman, who graciously accepted to be the chair of my examining committee. During my PhD, I was also involved in a few research projects that are not included in this thesis. I had wonderful valuable experiences in these collaborative projects. Regarding these collaborations, I would like to thank Dr. Salem Malikic, Michael Ford, Hossein Asghari, Sean La, and Farid Rashidi Mehrabadi. I also offer enduring gratitude to all past and present lab members in Lab for Computational Biology and Bioinformatics at Simon Fraser University, as well as Hach Lab at Vacnouver Prostate Centre. In particular, I thank Dr. Yen-Yi Lin, Iman Sarrafi, Dr. Ibrahim Numanagic, Ermin Hodzic, Can Kockan, Dr. Raunak Shrestha, Baraa Orabi, Tunc Morova, and Fatih Karaoglanoglu, who all made the work environment a more pleasant one. My special thanks go to Baraa Orabi and Elie Ritch for their help with proofreading the thesis. In addition, I would like to thank all members of the Genome Informatics research group at Bielefeld University in Germany, including Omar Castillo, Konstantinos Tzanakis, Eyla Willing, Georges Hattab, Tizian Schulz, Guillaume Holley, Nina Luhmann, Liren Huang, Lu Zhu, Markus Lux, Linda Sundermann, and Tina Zekic, who made my visit from Bielefeld University such a great experience. I am grateful to many other friends I made in Vancouver including Abdollah Safari, Sina Bahrasemani, Sajjad Gholami, Mehran Khodabandeh, Hedayat Zarkoob, Mohsen Jamali, Soheil Horr, Shahram Zaheri, Amirmasoud Ghasemi, Abraham Hashemian, Hashem Jeihooni, Shahram Pourazadi, Mohammad Mahdavian, Majid Talebi, Saeed
vi Mirazimi, Hassan Shavarani, Mahdi Nemati Mehr, Sima Jamali, Nazanin Mehrasa, Ramtin Mehdizadeh, Sina Salari, Ali Afsah, Chakaveh Ahmadizadeh, Mahsa Gharibi, Mohammad Akbari, Akbar Rafiey, Saeed Izadi, Saeid Asgari, Hossein Sharifi-Noghabi, Sepehr MohaimenianPour, Sara Daneshvar, Hooman Zabeti, Sara Jalili, Mohammad Mazraeh, Marjan Moodi, and many more. All these amazing people made Vancouver a true home. Last but not least, I would like to thank my loving family for all their support during these years. An exceptional thanks goes to my wonderful wife, Rana, who definitely made a significant contribution to this thesis with her continuous support and patience.
vii Table of Contents
Approval ii
Abstract iii
Dedication v
Acknowledgements vi
Table of Contents viii
List of Tables xi
List of Figures xiv
1 Introduction 1 1.1 Contributions ...... 5 1.2 Organization of the thesis ...... 7
2 Background and Related Work 8 2.1 Single-molecule sequencing technologies ...... 8 2.1.1 Pacific Biosciences ...... 8 2.1.2 Oxford Nanopore Technology ...... 9 2.1.3 Synthetic long reads ...... 11 2.2 Definitions and Notations ...... 12 2.3 Long Read Mapping ...... 13 2.4 Error correction of long noisy reads ...... 16 2.4.1 Hybrid correction ...... 16 2.4.2 Self-correction ...... 18 2.5 de novo genome assembly ...... 20 2.5.1 Hybrid assembly ...... 21 2.5.2 Non-hybrid assembly ...... 23 2.5.3 wtdbg2 ...... 26
3 Long read mapping 27
viii 3.1 Methods ...... 29 3.1.1 Overview ...... 29 3.1.2 Stage One: Reference Genome Indexing ...... 29 3.1.3 Stage Two: Read Mapping ...... 30 3.2 Results ...... 33 3.2.1 Experiment on a simulated dataset without structural variations . . 33 3.2.2 Simulation in presence of structural variations ...... 36 3.2.3 Experiment on a real dataset ...... 39 3.3 Summary ...... 41
4 Hybrid error correction of long reads 43 4.1 Methods ...... 44 4.1.1 Overview ...... 44 4.1.2 Initial correction of long reads: the SP algorithm ...... 45 4.1.3 Correcting gaps using One-End Anchors ...... 47 4.2 Results ...... 50 4.2.1 Data and computational setting ...... 50 4.2.2 Measures of evaluation ...... 51 4.2.3 Comparison based on alignment ...... 52 4.2.4 Comparison based on assembly ...... 52 4.3 Comparison with more recent hybrid correction tools ...... 60 4.4 Summary ...... 60
5 Hybrid assembly of long reads 68 5.1 Methods ...... 70 5.1.1 Obtaining unique short read contigs ...... 70 5.1.2 Construction of backbone graph ...... 71 5.1.3 Graph cleaning and simplification ...... 74 5.1.4 Generating the assembly ...... 76 5.1.5 Methodological remarks ...... 77 5.2 Results ...... 79 5.2.1 Experiment on simulated dataset ...... 79 5.2.2 Experiment on real dataset ...... 82 5.3 Summary ...... 85
6 Conclusion 86 6.1 Future directions ...... 87 6.2 Recommended guidelines ...... 88
Bibliography 90
ix Appendix A lordFAST Material 105 A.1 Data ...... 105 A.1.1 Real data ...... 105 A.1.2 Synthetic data ...... 105 A.2 Software ...... 107 A.3 Command details ...... 108
Appendix B CoLoRMap Material 111 B.1 Data ...... 111
Appendix C HASLR Material 112 C.1 Data ...... 112 C.1.1 Simulated data ...... 112 C.1.2 Real data ...... 113 C.2 Software ...... 114 C.3 Command details ...... 115 C.4 Visual examples of regions assembled only by HASLR without any misassembly or fragmentation ...... 117
x List of Tables
Table 3.1 Comparison between different tools capable of mapping PacBio long reads on the simulated human dataset. This dataset contains 25,000 reads and 183.61 million bases. Best results are marked with bold typeface...... 35 Table 3.2 Runtime and memory usage of same table...... 36 Table 3.3 Comparison between different tools capable of mapping PacBio long reads on the simulated human dataset. This dataset contains 25,000 reads and 183.61 million bases. Best results are marked with bold typeface...... 36 Table 3.4 Comparison between different tools capable of mapping PacBio long reads on the simulated human dataset. This dataset contains 25,000 reads and 183.61 million bases. Best results are marked with bold typeface...... 37 Table 3.5 Structural variations called by Sniffles based on mappings from different tools...... 39 Table 3.6 Evaluation of the performance of various long read mappers on a real human dataset. This dataset includes 23,155 reads and 178.45 million bases...... 40 Table 3.7 Agreement of different methods in reporting alignments...... 41 Table 3.8 The performance of different methods on reads for which their alignments do not agree...... 41
Table 4.1 Runtime of different correction methods for E. coli dataset...... 50 Table 4.2 Quality of corrected long reads for E. coli, yeast, and fruit fly datasets obtained with different methods. Assessment is based on alignments of long reads to the reference genome obtained with BLASR...... 53 Table 4.3 Quality of corrected long reads for E. coli and yeast datasets obtained with different methods. Assessment is based on alignments of long reads to the reference genome obtained with BWA-MEM...... 54 Table 4.4 Statistics of corrected and un-corrected regions after correction with different methods...... 55
xi Table 4.5 Quality of Canu assemblies for E. coli data set corrected by different methods. The assessment is done using QUAST. All statistics are based on contigs of size ≥ 500 bp, unless otherwise noted...... 56 Table 4.6 Quality of Canu assemblies for yeast data set corrected by different methods. The assessment is done using QUAST. All statistics are based on contigs of size ≥ 500 bp, unless otherwise noted...... 57 Table 4.7 Quality of Canu assemblies for D. melanogaster data set corrected by different methods. The assessment is done using QUAST. All statistics are based on contigs of size ≥ 500 bp, unless otherwise noted. . . . . 58 Table 4.8 The effect of chunking on correction quality for CoLoRMap. CoLoRMap-w represents running of our software on the whole long read set without chunking...... 59 Table 4.9 Comparison between hybrid error correction tools on E. coli PacBio dataset. The experiment is done by Zhang et al. [177]. Table is taken from [177]...... 62 Table 4.10 Comparison between hybrid error correction tools on E. coli Oxford Nanopore dataset. The experiment is done by Zhang et al. [177]. Table is taken from [177]...... 63 Table 4.11 Comparison between hybrid error correction tools on yeast PacBio dataset. The experiment is done by Zhang et al. [177]. Table is taken from [177]...... 64 Table 4.12 Comparison between hybrid error correction tools on yeast Oxford Nanopore dataset. The experiment is done by Zhang et al. [177]. Table is taken from [177]...... 65 Table 4.13 Comparison between hybrid error correction tools on fruit fly PacBio dataset. The experiment is done by Zhang et al. [177]. Table is taken from [177]...... 66 Table 4.14 Comparison between hybrid error correction tools on fruit fly Oxford Nanopore dataset. The experiment is done by Zhang et al. [177]. Table is taken from [177]...... 67
Table 5.1 Comparison between draft assemblies obtained by different tools on simulated data...... 81 Table 5.2 Statistics of real long read datasets ...... 82 Table 5.3 Comparison between assemblies obtained by different tools on real data 83 Table 5.4 Gene completeness analysis ...... 84 Table 5.5 Effect of polishing assemblies on the small assembly errors of real datasets 84
Table A.1 Version, reference, and respository of utilized software...... 107
xii Table B.1 Availability and statistics of real datasets ...... 111
Table C.1 Availability of real long read datasets ...... 113 Table C.2 Version, reference, and respository of utilized software...... 114
xiii List of Figures
Figure 2.1 PacBio sequencing. Reprinted by permission from Springer Nature: Nature Reviews Genetics (Coming of age: ten years of next-generation sequencing technologies, Sara Goodwin et al.), copyright (2016) [54]...... 9 Figure 2.2 Oxford Nanopore Technology sequencing. Reprinted by permission from Springer Nature: Nature Reviews Genetics (Coming of age: ten years of next-generation sequencing technologies, Sara Goodwin et al.), copyright (2016) [54]...... 10
Figure 3.1 The speed up of lordFAST’s combined index for searching exact matches in a real human dataset compared to the original FM index. That is 29% speed up for finding all anchors in the first step. Note that this combined index uses only 0.25 GB more memory. . . 30 Figure 3.2 (a) The implicit windows considered on the reference genome for the candidate selection step. If the read length is `, then the windows are of size 2` and overlap by ` bases. (b) An example of the candidate selection step. Each dot represents an anchor, and its size represents the weight of the anchor. In this example, f = 2, and since the maximum window score is 11, every window with a score ≤ 5.5 will be ignored. In addition, the window with score 6 is not kept since it is overlapping with a window with score 7. Also, only one of the windows with score 11 will be in the final list of candidates since the other window is overlapping it...... 31 Figure 3.3 The read length distribution of 72,708 real PacBio reads from a human genome (CHM1) dataset. The vertical axis shows the number of bases in each bin rather than the number of reads. At least 99% of the bases are in the reads longer than 1000 bases. . . . 34
xiv Figure 3.4 Read mappings are sorted based on their mapping quality in descending order. Then for each mapping quality threshold, the fraction of mapped reads with mapping quality above the threshold (out of the total number of reads) and the fraction of incorrectly mapped read (out of the number of mapped reads) are plotted along the curve...... 37
Figure 3.5 Examples of covering and non-covering alignments. Suppose x, y, z1,
z2, z3, and z4 are different alignments of the same read. In this figure, alignments x and y cover each other as they span the subsequences on the reference genome that have at least 90% overlap. The alignments
x and y cover alignments z1 and z2 but not the alignments z3 and
z4. On the other hand, the alignments z1, z2, z3, and z4 do not cover either alignment x or y...... 40 Figure 3.6 Run-time comparison of different methods for mapping 23,155 real human reads using different threads. Note that the y-axis is in logarithmic scale...... 42 Figure 3.7 Memory comparison of different methods for mapping 23,155 real human reads using different threads. Note that the y-axis is in logarithmic scale...... 42
Figure 4.1 (a) The notion of overlap for mappings. For two overlapping
mappings mi and mj, the weight of the corresponding edge is set to
the edit distance between the suffix of mj.seq and its aligned region in L (marked by red in this figure). (b) Reconstruction of the corrected sequence spelled from the shortest path. The spelled string can be easily obtained by concatenation of mapping suffixes from the shortest path...... 46 Figure 4.2 An example of a gap (region uncovered by short reads) on long read, exported from IGV software. There are so many sequencing errors that mapping short read in this region is very challenging. In the region shown here, the maximum exact match between long read and the reference genome is 4 bp long, in a region of size ≈ 150 bp. 49 Figure 4.3 Detecting One-End Anchors (OEAs) for a gap (un-corrected region). OEAs, shown in red, are unmapped or partially mapped reads whose mates, shown in blue, are mapped to corrected regions concordantly (with proper orientation and distance). The assembled contigs, shown in light green, are used to improve the quality of gap region...... 50
xv Figure 5.1 Precision and recall results in identification of unique short read contigs on 6 different reference genomes. Precision is shown with blue dots and recall is shown with orange dots. Precision is always high across the different experiments and in all the experiments a big jump in recall happens at length threshold of 300...... 72 Figure 5.2 Possible orientations of aligning two unique contigs to a long read. The direction of contigs aligned to long reads shows the strand of their corresponding sequence. These directions guide us to find the proper edge type. The set of long reads supporting each edge is shown as its label...... 73 Figure 5.3 Examples of tip and bubbles in the backbone graph. Here the backbone graph is visualized using Bandage [171] ...... 74 Figure 5.4 Example of an edge in backbone graph and its corresponding long read alignments. Partial Order Alignment (POA) is used in constructing the consensus sequence (see subsection 5.1.5) . . . . . 77 Figure 5.5 Two backbone graphs built from a real PacBio dataset sequenced from a yeast genome. Each graph is visualized with Bandage [171] and colored using its rainbow coloring feature. Each chromosome is colored with a full rainbow spectrum. (Left) Tangled graph built from all short read contigs. (Right) Untangled graph built from unique short read contigs...... 78
Figure C.1 An example showing a region of choromosome 4 of C. elegans. . . . 118 Figure C.2 An example showing a region of choromosome X of C. elegans. . . . 119 Figure C.3 An example showing a region of choromosome X of hg38...... 120 Figure C.4 An example showing a region of choromosome 18 of hg38...... 121 Figure C.5 An example showing a region of choromosome 16 of hg38...... 122 Figure C.6 An example showing a region of choromosome 15 of hg38...... 123 Figure C.7 An example showing a region of choromosome 14 of hg38...... 124 Figure C.8 An example showing a region of choromosome 13 of hg38...... 125 Figure C.9 An example showing a region of choromosome 11 of hg38...... 126 Figure C.10 An example showing a region of choromosome 9 of hg38...... 127
xvi Chapter 1
Introduction
The first draft of the human genome sequence was first published in 2001 [88, 160], and an updated version of it was announced later in 2003 after a 13-year-long effort called the Human Genome Project. Completion of this high-quality version of the human genome sequence cost $2.7 billion for International Human Genome Sequencing Consortium (IHGSC) [132] and approximately $300 million for the Celera corporation. The high price tag was mainly a result of using Sanger sequencing [143] throughout the project, which has a high cost and low throughput [78]. Since then, remarkable progress has been made in the genome sequencing industry. The introduction of next-generation sequencing (NGS) technologies – namely 454 [114], Illumina [10], and SOLiD [115] – changed the landscape of genomics research. These technologies provide significantly higher throughput while requiring less DNA input material [137, 38]. This enabled researchers to generate orders of magnitude more data for a fraction of the cost. With the continuous evolution of NGS sequencing machines, the cost of sequencing the human genome at 30× coverage has now dropped to $1000 [67], with some companies pushing the price tag even lower than that [31]. The massive drop in sequencing cost has made NGS sequencing machines a ubiquitous staple of both larger research centers as well as individual labs [137]. A natural outcome of this situation was sequencing and de novo assembly of more organisms with small [26] or large genomes [144, 50] using new assemblers specifically designed for NGS data [176, 150, 109, 149, 120]. Another application was the resequencing of organisms to study their genomic diversity. In particular, sequencing more individual human genomes, made the outlook for the application of precision medicine more promising. The idea of precision medicine – also known as personalized medicine – is to tailor therapy with the best possible response for each patient according to their own molecular characteristics [22, 28]. The first step towards this goal is characterizing the genetic variations in the genome of each individual compared to a normal genome, such as the reference genome. The importance of this problem has led to establishing multiple projects such as the 1000 Genomes Project [29, 30] and The Cancer Genome Atlas (TCGA) project [167]. Genetic variants appear in different sizes and
1 types: single nucleotide variations (SNVs), short insertions/deletions (indels), and structural variations (SVs), which are extensive structural rearrangements larger than 50 bps [2]. SVs are known to affect the human genome the most among all types of genetic variations [168] and to be more closely associated with susceptibility to many common and rare genetic diseases such as Autism, Schizophrenia, Diabetes, and different types of cancers [168, 110, 155], as well resistance to the related therapies. Over the years, there have been many tools developed for detecting SVs using NGS data. One could classify these tools into two categories: mapping-based or assembly-based approaches. In the first approach, NGS reads are mapped to a reference genome, and abnormal signals are used to identify SVs. These signals may include split-read alignments [175, 75], unexpected read depths [1, 165], or unexpected read pairing [21, 136]. The other approach is to perform either whole-genome assembly [68, 20] or local assembly around the SV breakpoints [102, 76], followed by comparing the assembled contigs against the reference genome to identify SVs. For a comprehensive survey of SV discovery methods using NGS reads, see reviews by Alkan et al. [2] and Guan et al. [55]. Aside from de novo assembly and variation discovery, NGS technologies have proven their ability in many other applications such as whole-exome sequencing [27], transcriptome characterization [125, 36], finding disease-causing mutations [16, 173, 130], cancer analysis [65, 90], and cancer evolution study [49]. However, the computational analysis based on the data generated from NGS technologies is still far from being perfect. This is mainly due to their short read length relative to the lengths of common repeat sequences [2, 63], which limits their use in resolving and analyzing repetitive regions. In fact, the low cost offered by NGS technologies comes at the expense of much shorter reads (often ~ 150 bp for NGS compared to ~ 700 bp for Sanger), posing challenges to many downstream analysis algorithms. In particular, genomes assembled solely using NGS short reads are considered draft assemblies since it is not possible to bridge many of the contigs unambiguously, resulting in fragmented genome assemblies. Accordingly, such assemblies may contain many gaps that might cause missing genes, and in other words, are not finished [144, 100, 83]. Although these fragmented assemblies could be bridged using paired-end information via the use of scaffolding techniques, such aggressive strategies often result in misassemblies that would potentially complicate later analysis. On the other hand, projects aiming to analyze the diversity of known organisms have been fundamentally limited, as SV discovery approaches using NGS short reads essentially miss many SVs, especially larger ones [19, 83]. Within complex regions, even the detection of short events is often challenging, since the mapping of short reads cannot be done with certainty. The advent of long read sequencing technologies has come with the promise of overcoming these limitations of NGS technologies. The most popular types of long reads are those generated by single-molecule sequencing (SMS) technologies. SMS technologies
2 are capable of generating reads that are orders of magnitude longer than NGS short reads. The larger length of these reads enables them to span many of the repetitive regions of the genome, which dramatically improves the downstream genomic analysis, especially when focusing on large-scale structures [139]. However, the exceptional length of reads generated by these technologies comes at the expense of much higher error rates compared to Sanger or Illumina sequencing methods. Nevertheless, the development of new tools specifically designed for handling this high error rate enabled their use in many different projects to date. Among all long read sequencing technologies, Pacific Biosciences (PacBio) and Oxford Nanopore Technologies are the most prominent and widely used. PacBio, which was first introduced in 2009 by Eid et al. [41], is the pioneer in delivering SMS long reads. The first PacBio sequencer, PacBio RS, was commercially released in 2010 and its second version was introduced in 2013. The length and quality of reads generated by PacBio sequencers have improved over the years with updates to their chemistry. Currently, PacBio can generate reads of length >80 kbp with an average length of 10-15 kbp [158]. Although the error rate of these reads is 10-15%, it has been shown that the base-level accuracy of the consensus sequence generated from them can go above 99.99%, given sufficient coverage [93]. The study by Berline et al. [11] suggested that approximately 50× coverage of PacBio reads are needed for error correcting and assembling them without the help of other technologies. Furthermore, the unique design of the PacBio template (using two single-stranded hairpin loops) enables the sequencing of the same DNA template multiple times. Having the same molecule sequenced multiple times allows for computationally calculating the consensus sequence of that single molecule and generating long reads with up to ~ 99% accuracy. These long reads are called circular consensus sequence or CCS PacBio reads. Oxford Nanopore Technologies is the biggest competitor of PacBio in the long read technology market. The idea of using nanopores for DNA sequencing is not new and has been discussed or demonstrated since 1996 [94]. However, Oxford Nanopore was the first company to commercialize this idea by releasing its first sequencer, MinION, in 2014. While early Nanopore reads suffered from about 30% error rate [77], thanks to the rapid pace of developments both in chemistry and base-calling software, their current error rate is 10- 15% [170]. Similar to PacBio reads, a high consensus quality has been reported for Nanopore reads with base-level accuracy of >99.95% [108]. In terms of length, Oxford Nanopore has undoubtedly broken the barrier by having essentially no theoretical upper limit to the read length [158]. In practice, researchers have been able to successfully generate ultra-long reads with N50 >100 Kbp with some of the longest reads reaching about 1 Mbp [70, 133]. Using a graphical viewer for visualizing the raw signal trace of Nanopore reads together with their mappings to a reference genome, Payne et al. [133] identified a molecule of length >2 Mbp being sequenced which was broken into 11 consecutive subreads due to weak signal and/or software limitations. Aside from the characteristics of Nanopore long reads, the sequencer
3 itself has exceptional game-changing features. MinION is the cheapest sequencing machine with an instrument price of ~ $1000. Additionally, it is a handheld portable device that can operate by connecting to a laptop via a USB cable. This unique characteristic enabled MinION to be used in remote locations, including Antarctica [40] and International Space Station [17, 37]. The power of long SMS reads in resolving repeat regions, immediately enabled their success in de novo assembly and SV discovery applications, whose progress with NGS data were limited. However, SMS long reads could not be directly passed to the available de novo assemblers (e.g., Celera [119]), which expect much higher base-level accuracy. As a result, error correction methods for increasing the base-level accuracy of long reads started to draw the attention of researchers. Due to the very high error rate of early long reads, they often had to be coupled with NGS data [84, 4]. Such integrative uses of SMS and NGS reads are referred to as hybrid techniques. After a while, with improvements in read accuracy and the development of overlapping algorithms specifically tailored for correcting erroneous long reads [124, 11], non-hybrid correction and assembly became possible [24]. To date, SMS long reads have been utilized for de novo assembly of many genomes, including microbial [82, 108], mammalian [11, 134, 147], and plant genomes [181, 145]. With regards to SV detection using long reads, similar to NGS-based methods, SMS- based methods can be classified into mapping-based and assembly-based approaches. The first step for both of these approaches is mapping long reads to the reference genome. However, due to the lengthy and error-prone nature of SMS reads, the mapping algorithms designed for NGS data either fail or remain too slow. To address this issue, a few mapping algorithms were proposed [124, 146, 98] that are specifically designed for noisy long reads. Having long reads aligned to the reference genome, SVs can be identified either directly from mapping signal (for example, using Sniffles [146] or Picky [51]) or using a local assembly of long reads showing SV signal (similar to what used by Chaisson et al. [19] and Fan et al. [44]). SMS technologies have also been used in many other applications such as gap filling of assembled genomes [42] including the human GRCh37 reference genome [19], genome finishing [9, 14], haplotype phasing [39], detection of DNA base modifications [47, 151], sequencing and genotyping medically relevant regions of the genome like human leukocyte antigen (HLA) and immunoglobin heavy (IGH) genes [64, 48], novel isoform identification [103, 34], and gene fusion detection [126, 72]. We are experiencing an exciting era in which DNA sequencing is on the cusp of a new revolution. The significant increase in the throughput of SMS technologies, especially with the introduction of Sequel by PacBio and PromethION by Oxford Nanopore, has promised a reduction in their cost. On the other hand, the desire to generate longer reads is a persistent goal. Ultra-long reads generated by Oxford Nanopore have enabled the full telomere-to-telomere assembly of a human chromosome for the first time [117]. PacBio
4 sequencers can now generate much longer CCS reads than before, which can improve the quality of assembled genomes [161]. PacBio Iso-Seq is capable of sequencing full-length transcripts with high accuracy. The small size of MinION sequencer, together with the easy library preparation, has enabled its success in remote locations, for example, in surveillance of the Ebola outbreak in West Africa [135]. And these are only some of the examples of the tremendous potential that SMS technologies are offering.
1.1 Contributions
Many of the achievements of SMS technologies that were mentioned above could not be possible without the development of new computational tools specifically tailored for error- prone long reads. This thesis is focused on computational algorithms for the three previously mentioned long read analysis problems: (i) long read mapping, (ii) hybrid error correction, and (iii) hybrid assembly of long reads. Tools that address these problems can be used as the first step in many downstream analysis tasks involving long sequencing reads. An example of a task that can take advantage of these problems is SV discovery using long reads. As a first approach, one could map all long reads to the reference genome and employ a mapping-based SV detection tool (e.g. Sniffles [146]). As another approach, long reads showing the signals of SVs could be error corrected using short reads and then utilized to identify SVs more accurately with base level resolution. On the other hand, such long reads can be used to locally assemble the region containing one or more SVs to generate contigs that can be compared against the reference genome. A completely different approach could be performing whole genome assembly of long reads and comparing assembled contigs against the reference genome to identify SVs. In particular, the following contributions regarding the analysis of SMS long reads are presented in this thesis:
• We introduce lordFAST [62], an efficient mapping tool that is especially designed for noisy SMS long reads. lordFAST is a sensitive aligner that can tolerate high sequencing error rates observed in SMS long reads, through its use of variable-length short exact matches. In addition, it is among the fastest long read mappers due to its sparse anchor extraction strategy which significantly speeds up the chaining of exact matches. Our experiments on simulated data allowed us to demonstrate the superiority of lordFAST in terms of sensitivity and precision compared to the previously published alternatives. lordFAST also provides both clipped and split alignments of the long reads, which makes it appropriate for aligning reads originating from regions with large SVs. This enables simpler downstream analysis of its alignments for the task of SV discovery. lordFAST is an open source tool available https://github.com/vpc-ccg/lordfast as well as Bioconda.
5 • We present CoLoRMap [61], a hybrid method for correcting noisy long reads using high-quality Illumina paired-end short reads mapped onto the long reads. CoLoRMap achieves this using two novel ideas: (i) using a classical shortest path algorithm to find a sequence of overlapping short reads that minimizes the edit score to a long read, and (ii) by extending corrected regions via local assembly of unmapped mates of mapped short reads (also known as one-end anchors). We applied CoLoRMap on three real datasets and compared its performance with the previously published hybrid error correction tools. We observed that the accuracy of reads corrected by CoLoRMap is on par with the accuracy of reads corrected by other tools, while more corrected long reads could align to the reference genome compared to other tools, both in terms of number of corrected reads that aligns to the reference genome and the total size of the aligned regions. We also demonstrated that the assemblies generated by Canu assembler [85] using the corrected long reads are of slightly better quality with CoLoRMap. The source code of CoLoRMap could is available at https://github. com/sfu-compbio/colormap.
• We introduce HASLR [60], an ultra-fast hybrid assembler for SMS long reads. It requires both NGS and SMS reads from the same sample and is capable of assembling both small and large genomes. HASLR first generates short contigs by assembling short NGS reads using Minia [23] and then builds a graph structure called backbone graph that is used to connect these short contigs. It finally generates long contigs by filling the gaps between the connected short contigs using consensus sequences obtained from multiple sequence alignment based on partial order alignment technique. Our experiments show that HASLR is not only the fastest assembler among all the tested assemblers, but it is also the one with the lowest number of misassemblies on all tested samples. Furthermore, the generated assemblies in terms of contiguity and accuracy are on par with the other tools on most of the samples. HASLR is an open source assembler which is available at https://github.com/vpc-ccg/haslr as well as Bioconda.
Addressing the above mentioned problems could be of a great impact as long reads are becoming more widely used due to the continuous improvements in the throughput of SMS technologies which in turn reduce their cost. Considering the current competition between PacBio and Oxford Nanopore, this trend is expected to continue at least for a while. In addition to our primary contributions listed above, our collaboration in a number of projects has made other contributions to the field of Computational Biology that can be found in [86], [120], [111], [48], and [5].
6 1.2 Organization of the thesis
The rest of the thesis is organized as follows. Chapter 2 is devoted to providing the related background including details of single-molecule sequencing (SMS) technologies, as well as explaining the problem descriptions and related state of the art work for addressing each of these problems. In Chapter 3, we introduce lordFAST, our fast and sensitive long read mapper alongside its experimental results. Chapter 4 presents CoLoRMap, our hybrid error correction tool. Note that CoLoRMap was published in 2016, after which a number of other hybrid assemblers have been developed. Therefore, we provide results taken from a comparative study by Zhang et al. [177] to show where it currently stands in the field of hybrid error correction. In Chapter 5, we introduce HASLR, our ultra-fast hybrid assembler and show how it compares against currently available tools. We also demonstrate how its speed scales for large genomes as long as human genome. Finally, we conclude the thesis in Chapter 6 by providing a summary of our contributions to the field, as well as a discussion of possible directions for future work.
7 Chapter 2
Background and Related Work
2.1 Single-molecule sequencing technologies
The major difference between single-molecule sequencing (SMS) and next-generation sequencing (NGS) technologies is that unlike NGS technologies that rely on clonal amplification of DNA fragments of a solid surface, SMS technologies sequence the native DNA from a single DNA molecule [54]. This prevents amplification biases in SMS reads. In this section, we briefly explain how these technologies sequence a DNA fragment.
2.1.1 Pacific Biosciences
The PacBio RS II sequencer, which was the first PacBio’s commercially available sequencing instrument, is fed with the so-called SMRTbell template [157]. This template consists of a double-stranded DNA fragment (the insert of interest) ligated to a single-stranded hairpin loop on either end (see Figure 2.1). Each hairpin loop is the complementary sequence of a sequencing primer, which provides a site for primer binding. The instrument utilizes specialized flowcells containing many thousands of picoliter-sized wells with a transparent bottom, called zero-mode waveguides (ZMWs). The sequencing is carried out in the ZMW using a DNA polymerase fixed to the bottom of the well (see Figure 2.1). This allows the DNA strand from the SMRTbell template to progress through the polymerase. The polymerase replicates the DNA template by incorporation of phosphate-labeled nucleotides, which emits colored light. A camera records the color of emitted light over time as the sequencing signal (as depicted at the bottom of Figure 2.1). The unique circular design of the SMRTbell template enables it to pass through the polymerase continuously while the polymerase remains active. Therefore, each template can be sequenced multiple times, depending on its length. In general, shorter DNA templates have a higher chance of being sequenced repeatedly, while it is difficult for templates longer than ∼ 3 Kbp to get sequenced multiple times. The beauty of this technique is that multiple sequencing of the same template allows for obtaining a consensus sequence with higher quality since the errors are distributed
8 Figure 2.1: PacBio sequencing. Reprinted by permission from Springer Nature: Nature Reviews Genetics (Coming of age: ten years of next-generation sequencing technologies, Sara Goodwin et al.), copyright (2016) [54]. randomly. If at least two full passes of the template are available, the obtained consensus is called circular consensus sequence (CCS). Otherwise, each incomplete pass of the template generates a sequence called subread. Similar to CCS, all subreads of the same template can be used to generate a consensus called read of insert. Although a high error rate (∼ 15%) is evident for reads obtained from single-pass sequencing, the consensus reads could have accuracy > 99% depending on the number of passes through the template.
2.1.2 Oxford Nanopore Technology
In principle, nanopore-based sequencing relies on the transition of DNA through a tiny channel called nanopore. Oxford Nanopore Technology is leading the development and commercialization of this approach. The Oxford Nanopore’s MinION instrument uses flow cells with hundreds of embedded nanometer-sized protein pores (the nanopores).
9 Figure 2.2: Oxford Nanopore Technology sequencing. Reprinted by permission from Springer Nature: Nature Reviews Genetics (Coming of age: ten years of next-generation sequencing technologies, Sara Goodwin et al.), copyright (2016) [54].
The DNA fragment is ligated with two adapters (see Figure 2.2). At one end, the adapter is a hairpin sequence that allows the passing of two strands of the DNA fragments consecutively. The other adapter is bound to a protein enzyme that interacts with the nanopore to direct single-stranded DNA through it. The DNA strand starts passing through the nanopore as a voltage is applied across it, causing a flow of electric current. While traversing the nanopore, the DNA strand partially blocks the electric current passing through the nanopore. In principle, the flow of current is considered as a function of the specific k-mer present in the nanopore. In other words, rather than having four possible signals, here there are more than 1000 possible signals, one for each specific k-mer. The changes in the flow of current are recorded over time, which is the observable output of the system. In fact, the output of the sequencer needs post-processing so that the electrical signal data is decoded into a DNA sequence.
10 This post-processing step is called base-calling. In order to do this, the sequence of electric current measurements is segmented into “events” based on the changes in the current. Ideally, two consecutive events correspond to the shift of the DNA context by a single base. However, in practice, base-calling is not trivial for two reasons: (i) the rate of sampling the electric current is constant while the process of traversing the single-stranded DNA through the nanopore is stochastic, and (ii) the segmentation process is noisy. The consequence is to have “stays”, when two consecutive events correspond to the same DNA context, and “skips”, when two consecutive events correspond to DNA contexts different by more than one base, in the sequence of events. The official base-calling software provided by Oxford Nanopore is Metrichor. Metrichor was first available only via a cloud computing platform, meaning that private analysis of Oxford Nanopore data was not possible. This motivated the development of free open-source tools for base-calling, namely Nanocall [32] and DeepNano [13], which achieved comparable results. Still, Metrichor remains the most reliable base-calling tool for Oxford Nanopore data. Later, Oxford Nanopore released the code of Metrichor, which enables base-calling on a standalone machine. In this work, we do not cover methods for base-calling as it seems a less pressing problem, especially after the release of Oxford Nanopore’s official base-caller. Similar to PacBio sequencing, the use of a hairpin adapter allows the sequencing of both strands of the DNA fragment. These two “1D” reads can be aligned together to generate a consensus read with a higher quality known as “2D”. This reduces the error rate from ∼ 30% in 1D reads to ∼ 15% in 2D reads.
2.1.3 Synthetic long reads
Synthetic long read technologies do not generate actual long reads of the native DNA fragment but utilize computational tools to reconstruct long-range information. These approaches rely on unique barcoding techniques during the library preparation that allows later grouping of the reads from the same fragment and computational reconstruction of long-range information [54]. There are a few synthetic long read technologies, among which Illumina’s TruSeq synthetic long read technology and 10x Genomics are popular.
TruSeq synthetic long read sequencing technology, formerly known as Illumina Moleculo, breaks genomic DNA into about 10 Kbp templates and partitions them into many wells such that about 3,000 fragments are in a single well. The templates in each well are amplified and sheared into ∼ 350 bp long fragments. DNA fragments from each well are barcoded and used for standard library preparation and sequencing on an Illumina HiSeq instrument. The barcodes in the sequenced reads are used to identify reads from the same well, which are then computationally assembled to generate synthetic long reads [162].
11 10x Genomics uses the same general idea of partitioning DNA fragments (up to 100 Kbp long), barcoding, amplification, fragmentation, and performing a standard Illumina short-read sequencing [179]. However, there are two main differences compared to TruSeq synthetic technology. First, DNA fragments are distributed to about 106 droplet partitions, which significantly reduces the number of DNA molecules in each partition. Increasing the number of partitions reduces the chance of having two fragments that cover the same loci in a single partition. This results in the potential for capturing more accurate phase information [79]. Second, unlike TruSeq synthetic technology, 10x Genomics does not try to obtain full coverage of the long DNA fragment due to random priming and “nonprocessive” polymerase amplification [179]. This means assembling long fragments is not possible. Instead, the reads are sorted according to their genomic position and grouped together based on their barcode as linked-reads. As mentioned before, synthetic long reads are not generated by sequencing the long native DNA. Although they are helpful for some applications such as haplotype phasing (in which the goal is to reconstruct two haplotypes of a diploid genome), we do not explore their use in this survey. The reason is that these technologies rely on computational methods for assembling/linking reads. In fact, they still suffer from the limitations of short read sequencing technologies for highly repetitive regions of complex genomes.
2.2 Definitions and Notations
Suppose Σ = A, C, G, T is the alphabet of nucleotides. A DNA sequence is defined as
S = s1s2s3...sn which is a string over Σ. The length of S is denoted by |S| = n. We denote by S[i, j] = sisi+1...sj a subsequence of S, by S[−, i] a prefix of S, and by S[i, −] a suffix of S.A k-mer of a sequence S is a subsequence of S of length k. The edit distance between two sequences S1 and S2 is the minimum number of edit operations (insertion, deletion, substitution) that transforms S1 to S2. A reference genome G of an organism is a long DNA sequence obtained from one or multiple individual genomes and represents the normal genome of that organism. A sequenced read R is a DNA sequence and a short subsequence of a donor genome D, which can be different from G.
Let R = R1,R2, ..., R|R| be a set of reads sequenced from a donor genome D. The de Bruijn graph (DBG) of order k built on R is a directed graph DGk. The nodes of DGk are the (sub-)set of all possible k-mers and there is a directed edge from node (k-mer) v to node (k-mer) u if (i) the suffix of length k of v is identical to the prefix of length k of u; i.e. v[2, −] = u[−, k − 1], and (ii) v and u are consecutive in R. A string graph, on the other hand, is also an overlap graph where nodes are not fixed-length sequences but sequences of arbitrary length, and there is an edge between two nodes if their corresponding sequences
12 overlap. DBG and string graph are two frameworks for assembling genomes. They generate DNA sequences called contigs that are ideally subsequences of the donor genome D. A hash table index, in its general form, keeps all the locations on the reference genome that exactly match every possible k-mer. This data structure provides a quick lookup of fixed size exact matches in constant time. A suffix array [112] of a sequence S is defined as the sorted array of all suffixes {S[i, −] | 1 ≤ i ≤ n} of S. As a result, the position of each occurrence of a substring of S will occur in an interval (consecutive rows) in the suffix array. This enables finding exact matches of arbitrary length using a suffix array index. A BWT-FM index [45] is another index capable of finding variable-length exact matches, which is a compressed representation of the text. Usually, the BWT-FM index is built from the suffix array of the text. Currently, available mapping softwares usually use either a hash table index or a SA/BWT-FM index for finding exact matches.
2.3 Long Read Mapping
The very first step for most of the downstream analysis pipelines involves mapping the reads to a reference genome. Here we explore the mapping problem for long noisy reads. Some methods find and report only approximate mapping location of one long read to another. This is more interesting for the de novo assembly task and more specifically, for overlap detection between long reads. Examples of such methods are minimap [97] (which will be explained in Subsection 2.5.2) and mashmap [69]. The focus of this section, however, is on methods that generate the real detailed alignment of long reads to the reference genome. Among the existing published methods, BLASR, rHAT, and GraphMap use the same general strategy of first finding the approximate candidate mapping region on the reference genome, followed by a more detailed alignment of the read to the candidate regions. BWA- MEM, however, works based on finding strong local alignments of the exact matches and chaining these local alignments to the full alignment of the whole read. In this section, we briefly explain how each of these tools maps long noisy reads to the reference genome. Here, we describe the process of aligning a single long read. The same process is simultaneously applied to multiple reads at the same time using multiple CPU cores.
BLASR
BLASR [18] is the first tool specifically designed for PacBio reads. It starts by finding the longest common prefix (of minimum length 12 bp) between each suffix of the long read and the reference genome using a suffix array index and storing them as the set of anchors. The set of anchors is sorted by the reference coordinates of the anchors. These anchors are clustered in intervals, roughly the length of the read. A global chaining of anchors in each interval is calculated, which is a maximal subset of co-linear anchors increasing in both
13 read and reference coordinates [127]. Top 10 chains with the highest scores are kept, each of which defines a candidate interval of mapping. Next, an approximate alignment of the long read to the candidate interval is obtained. To do this, a set of short exact matches (of length 11 bp) are found between the read and the candidate interval. This set of matches is used for sparse dynamic programming (SDP) [43], which is in principle same as chaining short exact matches. At last, since the SDP alignment does not align all the bases, a detailed banded alignment is performed to align the remaining bases. The SDP alignment is served as a guide for this banded alignment. rHAT rHAT [107] is a hash-table based mapper designed for PacBio reads, that uses a heuristic to estimate the approximate location of the mapping for each read. This is done by finding potential mapping regions for the middle 1 Kbp segment of the long read on the reference genome through a quick seed counting scheme. To do this, the reference genome is partitioned into windows of length 2 Kbp overlapping with the neighboring windows by 1 Kbp. The overlap between the neighbor windows is essential in order to ensure the long read is fully contained in at least one window. A collision-free hash-table is built to map each reference k-mer (by default 13-mer) to windows containing it. Then for mapping the middle 1 Kbp segment of a read, the number of occurrences of its k-mers in the windows are calculated, and five windows with the highest k-mer count are kept as candidate mapping windows for the middle 1 Kbp segment. Therefore, the candidate mapping regions for the whole read are intervals of size ∼ |R|, flanking the candidate windows. Then, for each candidate mapping region, a more detailed mapping is obtained as the following. A lookup table is built for indexing all the l-mers (by default 11-mers). Seeds are extracted for the whole read from the lookup table. rHAT then finds a local chain of these seeds (sparse dynamic programming) using a direct acyclic graph (DAG) as the skeleton of the alignment. Finally, the gaps between the seeds in the selected chain are filled using a banded Smith-Waterman algorithm.
GraphMap
GraphMap [154] is specifically designed for Oxford Nanopore (both 1D and 2D) reads. However, the authors claim that it can successfully map PabBio reads with default parameters too. In order to increase the sensitivity, GraphMap uses gapped spaced seeds to account for different types of sequencing errors (i.e., substitution, insertions, and deletions) in the reads. Using gapped seeds is a strategy for increasing the sensitivity of inexact match lookup while remaining fast.
14 To find approximate candidate mapping regions, GraphMap partitions the reference genome into bins of length |R|/3. This selection of bin size guarantees that at least one bin is fully covered by the read. The number of seed hits is counted for each bin, and bins with count > 75% of the max count are selected. A candidate region is then defined as the interval of length ∼ |R| on the reference that expands the corresponding bin’s start and end location. Next, GraphMap builds an “alignment graph” from short k-mers of the target sequence. A set of anchors is obtained by finding an exact walk in the alignment graph following the sequence of the query. A chain of anchors is calculated by solving an instance of the longest common subsequence (LCS) problem, and this chain is further refined to generate the final alignment.
BWA-MEM
BWA-MEM [96] was originally designed to align short sequence reads as well as assembled contigs to a reference genome. Later, it was extended to map long SMS reads by tuning its alignment parameters (via options -x pacbio for PacBio reads and -x ont2d for Oxford Nanopore 2D reads). BWA-MEM starts by finding the longest exact match covering each query position as a possible initial anchor. In BWA’s terminology, these matches are called super-maximal exact matches (SMEMs). Long SMEMs may cause mismapping due to missing correct anchors. To fix this issue, each SMEM longer than 28 bp is considered for the extraction of additional shorter anchors. More specifically, the longest exact matches that cover the middle base of the SMEM and occur more frequently than the SMEM in the reference genome are added to the list of anchors. The obtained set of anchors is then grouped into local chains, and “contained” chains are filtered out. The anchors are ranked by the length of the chain they belong to and then by the anchor length. The anchors are considered for an extension from the highest rank to the lowest rank, unless they are already contained in an alignment. The extension of anchors is done with a banded affine-gap penalty dynamic programming. BWA-MEM stops extension if the best extension score deviates more than a cutoff parameter from the score before the extension. This cutoff parameter automatically determines whether local alignments are merged to build a longer alignment.
NGMLR
NGMLR [146] follows seed-and-extend paradigm. For each long read, first, it finds subsegments of the read that can be aligned to the reference genome with a single linear alignment with high similarity. It then performs alignment using the Smith-Waterman algorithm for each of these linearly mapped segments to compute a pairwise sequence alignment. And finally, selects a set of linear alignments with the highest joint score.
15 The main advantage of NGMLR compared to other tools is its use of a convex gap- cost model during Smith-Waterman dynamic programming alignment that accounts for sequencing error and SVs at the same time. This new gap-cost model enables NGMLR to localize SV events better and align the long read around the SV breakpoints more precisely. minimap2
Minimap2 [98] is the successor of minimap [97], which extends it to enable its use for mapping to the reference genome with base-level alignment. It starts by extracting minimizers from the reference and indexing them in a hash table. It then identifies minimizers on one long read and locates its occurrences in the index as seeds. Next, seeds are sorted and chaining is performed to find all chains. By analyzing the overlapping chains, it selects one best chain (known as primary) for each query segment. In the end, it performs base-level alignment for each chain using dynamic programming. Minimap2 not only maps SMS long reads onto the reference genome, but is capable of performing self-overlapping, splice-aware mapping, and short read mapping.
2.4 Error correction of long noisy reads
As mentioned in Section 2.1, the observed error rate of long read sequencing technologies is high. This means that the reads they generate cannot be used directly for some applications, namely de novo assembly and structural variation discovery. Therefore, correcting these long reads is an active field of research. There are two different approaches for error correction of long noisy reads: (i) Self-correction (non-hybrid) approaches that correct long reads, only using themselves, (ii) Hybrid correction approaches that use complementary NGS short reads to correct long reads. In the following, we briefly describe general ideas for existing methods of either approach.
2.4.1 Hybrid correction
Hybrid correction methods correct long reads by taking advantage of a set of high-quality short reads sequenced from the same sample, using NGS technologies. This is done by aligning one set onto the other and correcting every long read using these alignments. In general, we can put hybrid correction methods into two general categories: (i) methods that map short reads to the long reads, (ii) methods that align every long read to the assembled short reads. The latter category benefits from faster alignment compared to the former. The reason is that the size of the assembled data set is much smaller than the original set of short reads. In the following, these methods are explained in more detail.
16 PacBioToCA
PacBioToCA, used in the PBcR assembly pipeline [84], is a hybrid correction module that was implemented as part of the Celera Assembler [119]. In this pipeline, first, all short reads are mapped to all long reads. The requirement for computing an alignment is observing shared seed sequences of length 14 bp. Besides, only alignments that span along the full length of short reads are considered. Mappings to multiple locations (repeats) are handled by choosing the highest identity mapping. After obtaining all alignments of short reads to long reads, correction for each long read is done by generating a consensus sequence. If there are regions with zero or insufficient coverage of short read alignments, the consensus sequence is split to generate shorter high- quality sequences.
LSC
In general, LSC [6] follows the same strategy of PacBioToCA. The main difference is that short and long reads are transformed to a compressed form using homopolymer compression, in which homopolymer runs are replaced by a single nucleotide of the same type. The authors claim that this transformation improves alignment sensitivity without sacrificing accuracy. proovread
Similar to PacBioToCA and LSC, proovread [59] works based on consensus calling from short read alignments. However, the correction process is done iteratively using three subsets of short reads (sampled in such a way to complement each other) with a successive increase in the sensitivity. The intuition behind this strategy is to reduce the search space in each cycle. More specifically, the first cycle performs mapping of only 20% of short reads with low sensitivity parameters (high speed). In the second cycle, the same process is repeated for 30% of short reads with higher sensitivity. The third cycle of correction is done by mapping the last 50% of the short reads while focusing only on the remaining error-enriched regions.
LoRDEC
LoRDEC [141] starts by building a de Bruijn graph (DBG) from the set of short reads (with k-mer size ∼ 19 bp). The general idea for correcting a long read is to thread it against the DBG and find a path in the DBG that minimizes the edit distance to the long read. In order to do this, LoRDEC scans the long read and identifies all so-called solid k-mers, those k-mers that also appear in the DBG. Assuming solid k-mers are error-free, the problem is to correct the regions in the long read between solid k-mers. To correct the inner regions, consider each solid k-mer as the source and five upstream solid k-mers (not from the same run of solid k-mers) as the target. For each pair of source and target, an exhaustive search (depth-first) is performed to find an optimal path (one
17 with minimal edit distance) between the source and target in the DBG. Such an optimal path spells a sequence that can be considered as the corrected sequence of the inner region. When this is done for all source/target pairs, a path graph is built that records all the optimal paths found between solid k-mers. The path graph is a directed graph in which solid k-mers are the nodes, and the optimal paths between solid k-mers correspond to edges in the graph. Finally, to get a sequence that minimizes the edit distance to the whole long read, the Dijkstra algorithm is employed to find the shortest path in the path graph between the leftmost and rightmost solid k-mer.
Jabba
Jabba [116] further builds on LoRDEC’s idea of correcting long reads based on alignments to a de Bruijn graph (DBG). The main difference compared to LoRDEC is that Jabba builds the DBG on longer k-mers (∼ 75 bp). The usage of longer k-mers has two advantages: (i) the DBG will be less complex due to resolving many short repeats, (ii) long k-mers enables Jabba to use maximal exact matches (MEMs) for aligning long reads to the DBG. As a result, the alignment step in Jabba is expected to be faster than that of LoRDEC. To avoid using erroneous long k-mers, before building the DBG, Jabba first corrects the set of short reads using Karect (with small k-mer size ∼ 13 bp). Then the DBG is built, and an index is built using essaMEM on node sequences that helps the extraction of MEMs during the correction. In order to correct a long read, essaMEM is used to detect MEMs between nodes of DBG and the long read. These MEMs are grouped into local assemblies, and obtained local assemblies are chained to find the final alignment.
Nanocorr
Nanocorr [53] is a hybrid error correction tool for Oxford Nanopore long reads. It follows a strategy similar to PacBioToCA [84]. First, short read Illumina reads are mapped to nanopore long reads using BLAST. Since there are potentially many spurious alignments, it is essential to select a correct subset of short read alignments. For each long read, Nanocorr uses a dynamic programming algorithm based on the longest increasing subsequence (LIS) problem to select such a subset of short read alignments. These short read alignments are used to correct the long read using the pbdagcon consensus algorithm (described in the next subsection).
2.4.2 Self-correction
In principle, the idea for self-correction approach is to exploit error-free segments of the whole set of long reads to correct every individual long read. The rationale behind this idea is that sequencing errors are distributed randomly in long reads generated by SMS technologies. As a result, theoretically, the consensus sequence of a set of reads derived
18 from the same genomic region can achieve any desired accuracy level by increasing the coverage [124]. To do this, self-correction methods require aligning of each long read against other long reads. There are two major concerns regarding the self-correction approach. First, aligning long reads against long reads is both challenging and time-consuming, especially at the high error rate of SMS reads. Second, as mentioned, high accuracy can be achieved only using a high coverage data set of long reads. Thinking of a genome as large as the human genome, this can be a major issue as obtaining high coverage is not currently affordable due to the high cost of SMS technologies. Nevertheless, developing methods that solely use SMS reads is an active field of research. Here, we summarize currently available methods for non-hybrid correction. pbdagcon pbdagcon is an error correction algorithm, first used in HGAP assembler [24], which generates a consensus sequence from a set of alignments to a template sequence. Consider a single long read subject to error correction, which is called a seed read, as well as the alignment of some long reads to this seed read. pbdagcon exploits the idea of representing multiple sequence alignment using a directed acyclic graph (DAG). In this DAG, nodes are labeled by DNA bases A, C, G, or T, and edges connect nodes whose corresponding bases are consecutive in a read. An initial DAG is constructed, which is a linear directed graph and corresponds to the seed read. The information from alignments is added to this DAG iteratively, resulting in a multi-graph. For each alignment, an edge is inserted to the graph between nodes corresponding to consecutive bases (new nodes can be added if needed). A score is associated with each node, which is positive if the number of out-edges is more than half of the number of all possible out-edges, and negative otherwise. The consensus is built by finding a path with maximum score sum among all possible paths. In fact, pbdagcon somehow generates the consensus sequence by stitching high-quality local alignments based on the information in the DAG. falcon_sense falcon_sense is another correction algorithm, first used in [11], which is faster than pbdagcon in the cost of accuracy. Consider a single seed reads as well as a set of long reads that are supposed to overlap this seed read. falcon_sense starts by aligning all long reads to the seed read using a fast aligner [122]. Each mismatch is converted to an insertion, followed by a deletion. For each seed read position, falcon_sense generates a tag denoting the match, insertion, or deletion at that position, and the number of tags of each type is counted. In the end, a consensus is generated based on a majority voting of tags.
19 LoRMA
LoRMA [142] extends the idea of LoRDEC for self correction of long reads. In the first phase of correction, LoRMA exploits the idea of using DBG for error correction (as explained for LoRDEC). However, in the absence of high-quality short reads, DBG is built from the set of long reads. LoRMA iteratively corrects long reads using DBG in 3 rounds while increasing the k-mer size. Since the error rate in raw long reads is high, the first iteration has to use a relatively small k-mer size. In addition, only k-mers that appear at least four times in the whole long read set are used. The rationale for such an iterative approach is that after each round of correction, the error-free regions in the long reads become a bit longer. In addition, increasing the k-mer size for the next round decreases the complexity of the DBG significantly. By default, LoRMA uses three iterations of LoRDEC corrections using k-mer sizes 19, 40, and 61. In the second phase of correction, LoRMA generates a consensus using multiple sequence alignment of reads to each other. To do this, LoRMA builds a DBG (this time with all k- mers) from the set of long reads corrected in the first phase. Then, for each long read, a unique path of DBG is found that spells it out, and the read id is stored on the edges of such a path. After processing all long reads, for correction of a single long read, its unique path on DBG is traversed again, and all the long reads sharing enough k-mers are collected. The set of collected long reads is used to generate a consensus-based on an iterative computation of multiple sequence alignment of the long reads. The iterative procedure of LoRMA makes it extremely slow. Thus, its usage is limited to bacterial size genomes only.
2.5 de novo genome assembly
In general, de novo genome assembly problem, in which no information from any already assembled related genome is used, has always been challenging. In practice, the existence of repeat sequences (longer than the length of reads) makes it impossible to obtain a perfect assembly. de novo assembly is usually done using either de Bruijn graph (DBG) or overlap layout consensus (OLC) frameworks. In both cases, a non-branching path of edges corresponds to unitigs. A unitig is a special kind of contig (ideally) entirely consistent with all the data and contains no misassembly. Often, extending unitigs to contigs is challenging due to ambiguous branches of the graph. Using SMS long reads, it is possible to obtain a less complicated graph or resolve ambiguous branches in an existing graph. As a result, SMS long reads are helpful in achieving high contiguity assemblies. For instance, for a human genome, SMS based assembly gives NG50 of 4,320 Kbp, which is far longer than 129 Kbp of NGS based assembly for the same genome [11]. Another example is the assembly of a highly repetitive plant genome Aegilops tauschii, for which NGS based assembler SOAPdenovo2 [109] yields
20 an assembly which covers only ∼ 65% of the estimated genome size, while a hybrid assembly of the same genome using both SMS and NGS reads generate almost the whole expected length.
2.5.1 Hybrid assembly
The use of SMS reads for de novo assembly was first explored for scaffolding step in NGS based assemblers. AHA [9], PBJelly [42], and SSPACE-LongRead [12] are among these methods. Since these methods still generate contigs from NGS data, they do not use the full power of SMS reads. In the following, we introduce some hybrid de novo assembly methods that take advantage of SMS long reads for generating considerably longer contigs.
PBcR
The PBcR assembly pipeline [84] was first to prove the potential of SMS long reads for the assembly task. The PacBioToCA correction method (explained in Subsection 2.4.1) was used to correct sequencing errors in the raw long reads. Corrected long reads were assembled together using an OLC assembly technique. To do this, Celera assembler [119] was modified to handle longer input reads. hybridSPAdes hybridSPAdes [4] uses long reads to resolve ambiguities of an assembly graph. It starts by constructing a de Bruijn graph (DBG) and performing various graph simplifications to transform it into an assembly graph, AG, in which edges represent unitigs. Then each long read is mapped to AG, by finding a path in AG that spells out the error-free version of the long read. In order to do this, a set of k-mer matches between the long read and edges of AG is obtained. These k-mer matches can guide the alignment of the long read. More specifically, a chain of k-mer matches in each edge is obtained, and these chains are converted to alignments by (i) merging chains in trivial cases when their corresponding edges are consecutive in AG (ii) performing an exhaustive search in more complicated cases when their corresponding edges are not consecutive. In the latter case, a path with minimum edit distance to the long read is chosen. After aligning all long reads to AG, hybridSPAdes uses all alignments for two purposes. First, they are used for closing gaps that are present in AG due to lack of coverage in some genomic regions. These gaps can be identified as two dead-end edges (one ending in a vertex without outgoing edges and the other starting in a vertex without incoming edges). If there are multiple long reads aligned to both of these dead-end edges, the consensus sequence of long reads is generated and used for gap closure. Second, ambiguous branches of AG can be resolved using long read alignments. Note that each long read alignment corresponds to a path in AG. Therefore, if the extension of an edge is supported by the majority of the paths, such an extension is reliable and can be applied.
21 The drawback of hybridSPAdes is using substantial memory usage as it requires storing path information of each long read. In practice, the usage of hybridSPAdes is limited to small genomes only.
DBG2OLC
BDG2OLC [174] is another hybrid pipeline. Rather than mapping short reads to long reads for error correction, DBG2OLC uses contigs obtained from short reads to find alignments between long reads. A sketch of this approach is explained as the following. First, a DBG based method is used to generate a set of contigs from short read dataset. Contigs are mapped to the long reads based on the number of shared k-mers (only those k-mers that are unique in the contig set). These mapped contigs serve as anchors for the alignment of long reads to each other. After finding the overlaps between long reads, contained long reads are removed. A greedy approach is followed to stitch each long read to its best overlapping long reads in both directions. The result of this greedy algorithm is a set of low-quality contigs (since long reads are not error corrected). Each draft contig gets corrected by first mapping all related long reads into it using BLASR, followed by a consensus calling. In DBG2OLC, rather than correcting long reads before assembly, high-quality bases are achieved only in the last step (consensus calling). In addition, after mapping of the contigs, long reads are transformed to a compressed form in which each read is represented only with an ordered list of contig ids mapped onto it. DBG2OLC uses these compressed long reads for alignment and overlap detection instead of aligning raw long reads. Using such a lossy compression, although speeds up the overlap detection step, reduces the sensitivity of alignments. This compression strategy, together with low sensitivity mapping of contigs and greedy assembly of long reads, somehow explains the higher number of reported misassemblies compared to PBcR pipeline.
MaSuRCA
MaSuRCA assembler [180] was first designed for assembling NGS short reads. Recently, it is updated to support hybrid assembly of long noisy reads [181]. MaSuRCA uses the notion of super-reads generated from short read dataset. The aim is to generate a set of long sequences that contain all the information of the original dataset while reducing the coverage significantly. MaSuRCA builds a k-mer index database and generates k-unitigs by extending every short read in both directions as long as the extensions are unambiguous. Paired-end information is used to link k-unitigs to create super-reads. Now, the set of super-reads is mapped to long reads. To do this, a database of 15-mers is built on the super-reads dataset. Super-reads are approximately aligned to long reads by chaining of matching 15-mers. This approximate alignment gives the approximate start and end positions of mapping, which helps to detect overlapping super-reads. An ordered
22 sequence of overlapping super-reads is computed for each long read to generate a pre-mega- read. Often, super-reads do not cover every long read fully due to either difference in genomic coverage of NGS and SMS technologies or low-quality regions in the long read. As a result, there might exist multiple of such pre-mega-reads for each long read. For each gap between two pre-mega-read in a long read, if at least 3 other long reads overlap this long read and the flanking pre-mega-reads are identical, aligned subsequences of long reads can be used to generate a consensus sequence to fill the gap. The long sequences obtained after performing all gap closures are called mega-read. If a gap cannot be filled in the previous step, two 500-bp sequences are extracted from flanking pre-mega-reads and get linked as mates. This information can be used by the assembler during scaffolding. The Celera assembler [119] is fed with the set of mega-reads and the set of generated mates to generate the final assembly. MaSuRCA can handle large and repetitive genomes of length up to 22 Gbp. It requires 100x paired-end Illumina short reads and at least 10x PacBio long reads.
2.5.2 Non-hybrid assembly
Although obtaining a high-quality assembly only using SMS reads requires high coverage (based on the discussion in Subsection 2.4.2) and is only affordable by large groups, the development of tools for self-assembly of SMS reads is an active field of research. Even though de Bruijn graphs are shown to be capable of non-hybrid assembling of long reads [105], their usage for larger genomes remains challenging. On the other hand, OLC based methods are proven to work for large genomes and will be our focus in this section. Among all available tools, only Falcon and Canu seem to be able to assemble human-size genomes effectively. In addition, note that it is always possible to increase the base-level accuracy by polishing the draft genome assembly. This is usually done using Quiver [24] for PacBio reads or Nanopolish [108] for Oxford Nanopore reads.
HGAP
The hierarchical genome-assembly process (HGAP) [24] was the first non-hybrid assembly pipeline. Its hierarchical strategy relies on generating some mini-assemblies first and assembling these mini-assemblies into the draft assembly. HGAP considers the longest 20x of the long reads as seed reads. All other long reads are to seed reads using BLASR. It then uses the pbdagcon correction module described in Subsection 2.4.2 to correct seed reads. The error-corrected (pre-assembled) high-quality seed reads obtained from pbdagcon are passed into an OLC based assembler, namely Celera [119]. In HGAP, this assembly step generates a draft assembly (with high genome contiguity) rather than generating the final assembly. In order to obtain the final assembly, a polishing algorithm call Quiver is used. Quiver takes advantage of raw information generated during SMRT sequencing. Raw long reads are
23 aligned to the draft genome using BLASR. Now the assembled draft genome is disregarded, and an initial approximate consensus is generated from all alignments using a fast heuristic. This is done to make the polished assembly independent from local assembly biases and fine-scale errors in the draft assembly. All single-base substitution, insertion, and deletion edits are tested to the draft assembly, and only those that improve the likelihood are applied. This process is repeated until no further improvement in the likelihood is observed.
MHAP
MHAP [11] is a probabilistic algorithm for the rapid identification of overlaps in long noisy reads. It uses a dimensional reduction technique called MinHash, which was first used to quickly determine the similarity between webpages. The general idea is similar to minimizers [138], but instead of using a single hash function to generate the list of representatives, here the list of integer representatives is computed using multiple randomized hash functions. When two long reads have long enough overlap, it is likely to observe shared representatives in the list of representatives of these two reads. The authors considered the longest 40x of long reads as seed reads. Then all long reads are mapped to the seed reads using the MHAP technique. Each seed read gets corrected using falcon_sense consensus algorithm (see Subsection 2.4.2). falcon_sense needs detailed alignments while MHAP reports approximate alignments only. Thus, a fast alignment algorithm [122] is used to align all overlapping long reads to the seed read before using falcon_sense. Finally, Celera assembler [119] is used to generate the final assembly from the set of corrected seed reads.
FALCON
FALCON [25] is an assembler designed by Pacific Biosciences specifically for PacBio long reads. It selects a set of seed reads based on a pre-defined seed length. DALIGNER[124] is used to find all long reads overlapping seed reads. For each seed read, the supporting reads are aligned to the seed read using a fast alignment algorithm [122]. A modified version of falcon_sense algorithm (described in Subsection 2.4.2) is used to correct each seed read. This modified falcon_sense exploits the idea of using a DAG similar to pbdagcon, but the nodes of the DAG contain more detailed alignment tags. The highest weight path is followed to generate a corrected sequence. After the error correction step, FALCON identifies the overlaps between all pairs of corrected seed reads using DALIGNER. The overlapping sequences are used to build a directed string graph that keeps the diploidy information. Using this string graph, the first draft of contigs is generated. In order to do phasing, raw reads are associated to the contigs by tracing the read overlapping information used for error correction. For each draft contig, all the associated raw reads are collected and aligned to the contig using BLASR. These alignments are used
24 to call heterozygous SNPs. Those raw reads that contain a sufficient number of heterozygous SNPs are used by the FALCON-Unzip algorithm to generate haplotype-specific contigs.
Canu
Canu [85] is another assembler specifically designed for assembling both PacBio and Oxford Nanopore reads. It uses an optimized version of MHAP to find all-vs-all overlaps of raw long reads. Based on the overlap information, Canu estimates the length of corrected reads (reads with no coverage will have corrected length of zero). Canu then considers the longest 40x corrected length of reads as seed reads. For each seed read, all supporting raw reads are quickly aligned to the seed read. The DAG version of falcon_sense consensus algorithm is used to correct the seed reads. After the correction step, the set of corrected seed reads is used to build an overlap graph using which the draft contigs can be generated. In the end, Canu improves the quality of contigs by aligning all the supporting reads to contigs. A consensus sequence is then generated for each contig using pbdagcon algorithm. miniasm + Racon
Unlike other SMS assemblers, miniasm [97] skips the error correction step (which usually takes the majority of the running time) and generates genome assembly directly from uncorrected long reads. miniasm takes advantage of a quick approximate overlap detection module called minimap. The first step is to find all-vs-all alignments of the long reads using minimap. minimap collects minimizers [138] of long reads and stores them in a hash table. Then for each long read query, all minimizer matches are obtained from the hash table. The minimizer matches are clustered, and for each cluster, minimap finds the longest chain of co-linear minimizer matches using the longest increasing subsequence problem. This gives an approximate alignment of the query long read to other long reads. Next, before the assembly step, low-quality regions of long reads are trimmed. For each long read, miniasm detects the longest region covered by at least three other reads and removes all the bases outside this region. miniasm generates a string graph from trimmed reads and performs usual simplifications, namely transitive edge reduction, tip pruning, and bubble removal. In the end, unambiguous overlaps are merged to generate the set of contigs. Although miniasm is a fast assembler compared to Falcon and Canu, due to avoiding error correction, the quality of generated contigs is essentially close to the original raw dataset. Therefore, its generated contigs are not useful directly, and increasing their quality is left to other tools. A recently published consensus module, Racon [159], was shown effective for this purpose. Racon first maps long reads to the draft contigs using the minimap to find approximate alignments, followed by a fast edit distance based alignment to get a more detailed assignment of bases. The draft contigs, as well as the alignments,
25 are then split into smaller windows. In the end, a consensus sequence is built using the partial order alignment (POA) graph approach [92, 91]. A major drawback of miniasm is building the whole string graph in memory, which needs a substantial amount of internal memory. Thus, even though together with Racon, miniasm can generate acceptable assemblies, it cannot handle the human genome and its usage is limited to small genomes.
2.5.3 wtdbg2 wtdbg2 [140], also known as Redbeans, introduces a new graph framework called fuzzy- bruijn graph for self-assembly of long reads. This graph framework somehow combines the idea of de Bruijn graphs with overlap-layout-consensus approaches. In order to do this, wtdbg2 first builds a binned representation of each long read in which every tiling 256 bp subsequence is considered as one bin. It then performs all-vs-all alignment on these binned long reads, which is much faster to perform (due to the reduction in the size of dynamic programming matrix ~ 65,536 times smaller). Next, it identifies all K-bins (K-consecutive bins) aligned together and considers a node for them in the fuzzy-Bruijn graph. An edge is added between two nodes if they are both present on a reads. In the end, it simplifies the graph and finds the consensus sequence corresponding to each simple path in the simplified graph. Although wtdbg2 is the fastest available self-assembler for long reads, the final assemblies it generates do not have high quality and require further polishing.
26 Chapter 3
Long read mapping
The very first step for most of the downstream analysis pipelines involves mapping the reads to a reference genome. On an Illumina-like short read with a low error rate, it is usually possible to find a “long” substring that would exactly match its mapping locus on the reference genome. All existing tools for mapping short reads are based on this fundamental observation. They aim to find such exact matches by using either (i) Burrows-Wheeler Transform/FM Index [15, 45] based methods [99, 89, 101], or (ii) substring hashing [3, 172, 57, 58, 166, 33, 104, 52], or (iii) hybrid methods that combine FM Index with hashing [152, 113]. Unfortunately, because of high error rates (up to 20% reported for PacBio [157, 156] and up to 40% reported for Oxford Nanopore1 [53]), this key observation is not valid for SMS technologies. Furthermore, even when the mapping locus for a read can be correctly found, it is quite challenging to differentiate sequencing errors from actual genomic variants. There are a number of available methods for mapping long reads to the reference genome. BLASR [18] is the first tool specifically designed for PacBio reads. It finds all sufficiently long exact matches between a long read and a reference genome using a suffix array index. Then it groups the matches into clusters and ranks them by a frequency weighted score. The top-scored clusters that correspond to candidate genomic locations are used for performing sparse dynamic programming (SDP), followed by a banded alignment. BWA-MEM [96] is another mapper that was originally designed to align short sequence reads as well as assembled contigs to a reference genome. It has been also extended to map long SMS reads by tuning its alignment parameters (via option -x pacbio or -x ont2d). BWA-MEM achieves this by finding the longest exact match covering each query position as a possible initial match, chaining these matches (and filtering out those chains “contained” by others), ranking the initial matches by the length of the chains containing them, and finally extending the initial matches based on a specific score cutoff to get a complete alignment. Another tool,
1Note that with the recent advances in Oxford Nanopore chemistry and base-calling, its current error rate is closer to 15% (see https://github.com/rrwick/Basecalling-comparison)
27 rHAT [107], is a hash table based mapper that uses a heuristic to estimate the approximate location of the mapping for each read. This is done by finding potential mapping regions for the middle 1000 bp segment of the long read on the reference genome through an approximate k-mer counting scheme. Then, for each potential mapping region, a lookup table is built to find short seeds and a chain of these seeds using an SDP-based heuristic. The final alignment is formed from the selected chains. A fourth tool, GraphMap [154], uses gapped spaced seeds and performs an approximate alignment by clustering these seeds. It then constructs alignment anchors by finding an exact walk in their “alignment graph” built from short k-mers of the target, chains these anchors, and finally refines the chain to generate the final alignment. Another tool, LAMSA [106], splits the long read to some “seeding fragments” and finds all their approximate matches on the reference genome using GEM mapper [113]. It then finds the “skeleton” of the alignment using a directed acyclic graph (DAG) based on SDP. Lastly, LAMSA prioritizes the candidate skeletons and fills the gaps within the skeletons while accounting for different possible structural differences (e.g., large deletions). Recently, two new mappers, NGMLR [146] and Minimap2 [98], have been published. Similar to LAMSA, NGMLR starts by finding alignments of subsegments of a read that are aligned by a single linear alignment. For each pair of such subsegment alignments, it then performs a pairwise sequence alignment using a convex gap-cost model. It finally scans inside alignments to identify regions with low sequence identities that exist due to small SVs. Minimap2 uses the notion of minimizers [138] for indexing the reference and finding seeds. It then performs chaining of the seeds and identifies the primary chains. In the end, it performs alignment between adjacent anchors of chains using its fast implementation based on SSE instructions. For a more detailed overview of each tool, we refer the reader to Section 2.3. Among the above tools, BLASR and BWA-MEM are sensitive but too slow in mapping large data sets. Speed is becoming a major issue since the delivery of Pacbio Sequel by Pacific Biosciences and the introduction of PromethION device by Oxford Nanopore, which promises higher throughput for long read data at a lower cost. On the other hand, tools like rHAT and LAMSA are not sensitive enough to find the correct mapping locations for many reads. For instance, the candidate selection step of rHAT uses the seeds only from the middle 1000 bp segment of the long reads, which could be problematic, especially if that segment comes from repetitive regions. In this chapter, we introduce lordFAST, a novel long-read mapper that is specially designed for PacBio’s continuous long reads (CLR). lordFAST is a highly efficient and sensitive aligner that can tolerate high sequencing error rates observed in CLR reads, through its use of multiple short exact matches. lordFAST not only maps more reads in a PacBio dataset but also maps them more accurately than the available alternatives such as BLASR and BWA-MEM. It is worth mentioning that lordFAST is also capable of aligning reads generated by Oxford Nanopore Technology since the error models are
28 somewhat similar. Our experimental results show that Minimap2 is the fastest tool among the above mappers. lordFAST is second in speed while achieving the highest sensitivity and precision on simulated data. This is primarily due to the fact that it maps the highest number of bases correctly among all mappers we tested.
3.1 Methods
3.1.1 Overview lordFAST is a heuristic anchor-based aligner for long reads generated by third-generation sequencing technologies. lordFAST aims to find a set of candidate locations (ideally, only one) per read before the costly step of base-to-base alignment to the reference genome. lordFAST works in two main stages. In stage one, it builds an index from the reference genome, which is used to find short exact matches. The index is a combination of a lookup table and an (uncompressed) FM index. In stage two, it maps the long reads to the reference genome in four steps: (i) on each read, it identifies a fixed number of evenly spaced k-mers (k = 14 in the default settings), which are matched to the reference genome through the use of the index. For each such match, it obtains the longest exact matching (prefix) extension. Among these extended matches of each k-mer identified in each read, it finally chooses the longest (there could be more than one) which act as anchor matches; (ii) for each read, it then splits the reference genome into overlapping windows (of length twice that of the read) and identifies each such window as a candidate region if the number of anchor matches in that window is above a threshold value; (iii) for each candidate region, it identifies the longest chain of “concordant” anchor matches (i.e., chain of anchor matches which have respective equal spacing in the read and the reference genome); (iv) it obtains the base-to- base alignment by performing dynamic programming between consecutive anchor matches in the selected chain. We provide a more detailed description of each step below.
3.1.2 Stage One: Reference Genome Indexing
In order to build a (substring) index for the reference genome, we use a combination of a simple lookup table for initial short matches, and an (uncompressed) FM index for extending such initial matches. This combined index benefits from the speed of the lookup table and the compactness of the BWT representation for the reference genome. The lookup table (with 4h entries for all possible h-mers) provides a constant time search capability for each h-mer’s position in the uncompressed FM index [45] (in the default setting h = 12, but the user is given the option to pick any value). As is well known, the FM index provides a compact representation of a suffix array [112] which we use to find (exact matching) extensions of initial h-mer matches. Note that in order to be able to perform an efficient search on both strands of the reference genome, we use an extension to the FM index implemented in fermi [95]. As
29 Figure 3.1: The speed up of lordFAST’s combined index for searching exact matches in a real human dataset compared to the original FM index. That is 29% speed up for finding all anchors in the first step. Note that this combined index uses only 0.25 GB more memory. depicted in Figure 3.1, our combined index provides a 29% speed up over the standard uncompressed FM index for retrieving exact matches in a real human dataset with a negligible increase in the memory usage.
3.1.3 Stage Two: Read Mapping
Given a set of long reads, lordFAST aligns one read at a time as follows:
Step 1: Sparse Extraction of Anchor Matches. For a given read with length `, lordFAST identifies C (user defined, default 1000) evenly spaced anchoring positions on the read. For each anchoring position, it finds the longest prefix match(es) (of length at least k = 14) to the genome as follows. First, it extracts the first h-mer starting from the anchoring position and uses the lookup table of the genome index to obtain the interval that represents the initial set of matching locations on the FM index. It then uses the LF- mapping operation of the FM index to extend the initial set of matches and identify the longest match(es). Note that using the longest matches reduces the total number of anchor matches significantly. The longest matches are then added to the set of anchors, M, as triplets (r, g, s) where r is the anchoring position on the read, g is the starting location of the longest match on the genome, and s is the length of the match. At the end of this step, M is partitioned into + − M and M based on the strand of the matching location on the genome. (Note that for reads that are “too short”, i.e., ` < C + k − 1, we use ` − k + 1 anchoring positions instead of C anchoring positions.)
30 Window length = 2 (a) l
Reference
(b) 2 11 2 6
0 3 11 7
Read coordinates Reference coordinates
Figure 3.2: (a) The implicit windows considered on the reference genome for the candidate selection step. If the read length is `, then the windows are of size 2` and overlap by ` bases. (b) An example of the candidate selection step. Each dot represents an anchor, and its size represents the weight of the anchor. In this example, f = 2, and since the maximum window score is 11, every window with a score ≤ 5.5 will be ignored. In addition, the window with score 6 is not kept since it is overlapping with a window with score 7. Also, only one of the windows with score 11 will be in the final list of candidates since the other window is overlapping it.
Step 2: Candidate Region Selection. In order to select the candidate regions for alignment, lordFAST splits the reference genome into overlapping windows of size 2` (as illustrated in Figure 3.2(a)). For each window, it calculates two scores for the forward and + − reverse strands from anchor matches of the respective strands (M and M ). For each anchor match falling in a window, it adds s − k + 1 to the score of that window. lordFAST keeps all the windows with score > scoremax/f where f is the factor defining the significance of the window score (default 4) and scoremax is the maximum window score. In other words, lordFAST keeps those windows whose score is not significantly worse than the maximum window score. In cases where two overlapping windows both meet the minimum window score requirement, lordFAST will keep the one with a higher window score in the final list (ties are broken by choosing the window with the smaller reference coordinate). Figure 3.2(b) depicts an example of the selection process. Assuming |G| is the size of the reference genome, using O(|G|/`) space, this step has worst case time complexity of O(|M| log(N)). The reason is that for each exact match in M, we need to find its matching window. Each such operation can be done in O(1). But since we need to keep N top scoring windows, we need a priority queue in which each insertion/replacement takes O(log(N)).
31 Step 3: Chaining and Anchor Selection. Among all the anchor matches in a candidate region, lordFAST chooses a set of “concordant” anchors using local chaining. The best local chain is a set of co-linear, non-overlapping anchors on the reference genome that has the highest score among all such sets [127]. To calculate the best local chain, lordFAST assigns a weight to each anchor match equal to the length of the match. lordFAST supports two chaining algorithms. By default, it obtains the best chain using the dynamic programming based chaining algorithm [127]. Note that the time complexity of this chaining algorithm is quadratic, but in practice, it is fast due to our small number of anchor matches per read. It is also possible for the user to select the alternative chaining algorithm based on clasp [131]. The anchor matches in the best local chain form the basis of the alignment in that region.
Step 4: Alignment. lordFAST prioritizes the candidate regions based on their best chaining score and performs the final alignment for the top N regions (default value for N is 10). In order to generate the base-to-base alignment of a region, it uses anchor matches from the top-scoring chain and performs banded global alignment for gaps between pairs of consecutive anchor matches. Furthermore, the alignment between the prefix of the read and the reference prior to the first anchor can be performed by the use of an anchored global- to-local alignment, and the alignment between the suffix of the read and the reference following the last anchor can be computed in an identical fashion. This strategy is a widely used technique to avoid computing the full alignment between long sequences as that needs huge memory and computational time. lordFAST uses Edlib [153] for computing the global alignments and ksw library2 for computing the global-to-local alignments. Edlib is a library implementing a fast bit-vector algorithm devised by Myers (1999) [123]. ksw, on the other hand, provides alignment extension based on an affine gap cost model. While the actual time complexity of this step depends on the number of selected exact matches inside the chain, it is not more than O(b `), where b is the bandwidth used for the banded dynamic programming alignments. It is worth mentioning that lordFAST supports clipping as follows: if the prefix of the read before the first anchor (or, respectively, the suffix of the read after the last anchor) has an alignment score/similarity which is lower than a threshold (thclip), lordFAST performs clipping of that prefix (or, respectively, suffix). This is done by using ksw library to extend the alignment as long as a significant drop in the alignment score/similarity is not observed. ksw library performs this using an algorithm similar to BLAST’s X-drop [178] heuristic).
In addition, lordFAST supports split alignment as follows: Let Si,j denote the substring of S that starts at position i and ends at position j. Suppose we are mapping a long read
R to the reference genome G. Consider two consecutive anchors A = (rA, gA, sA) and B =
2https://github.com/attractivechaos/klib
32 (rB, gB, sB), as per the definition above, in the best chain chosen for a candidate window. If the alignment between RrA,rB and GgA,gB has a score lower than a threshold (thsplit), we split the alignment and report one alignment as primary and another as supplementary (as the definition in the SAM format specification). One alignment corresponds to the substring before anchor A, and the other alignment corresponds to the substring after anchor B. Furthermore, since the drop in alignment score/similarity could be due to the presence of an inversion, we check if the alignment between the reverse complement of RrA,rB and
GgA,gB has a score higher than thsplit. In that case, such an alignment will also be reported as another supplementary alignment.
3.2 Results
We evaluated the performance of lordFAST-v0.0.9 against BLASR [18], BWA-MEM [96], GraphMap [154], LAMSA [106], rHAT [107], NGMLR [146], Minimap2 [98], and another recently available software minialign3. Note that although GraphMap is specifically designed for Oxford Nanopore reads, we included it in our experiment as it is capable of mapping PacBio long reads with default parameters [154]. We compared the methods on both simulated and real data sets. We used the results on the simulated dataset for calculating the methods’ precision and recall. All experiments were performed on a server running Cenots 6.9 equipped with 4 twelve-core (2 threads per core) Intel(R) Xeon(R) CPU processors (E7-4860 v2 @ 2.60GHz) and 1000 GB RAM. The details about versions of the tools as well as commands and parameters used to run each tool are provided in Appendices A.2 and A.3. Note that on real PacBio datasets, we observed that more than 99% of the sequence data are provided in reads of length 1000 bp or longer (see Figure 3.3 for details). Thus, we only focused on aligning reads that are 1000 bp or longer.
3.2.1 Experiment on a simulated dataset without structural variations
To evaluate the precision and recall of lordFAST against the above-mentioned tools, we simulated 25,000 long reads from hg38 using PBSIM [128], which infers the read length and error model from a real human read dataset. Appendix A.1.2 provides the instruction and commands for reproducing this simulated dataset. Note that we did not introduce any SNPs, indels or structural variants in this experiment, i.e., the correct alignment between a read and the reference genome has mismatches and gaps only due to (simulated) read errors. For each read, PBSIM provides both the originating location on the reference genome and the “true” base-to-base alignment of the read to the reference genome in that location. Since
3https://github.com/ocxtal/minialign
33 median: 4,975 mean: 6,675 min: 36
N99: 1,068
max: 35,489
Figure 3.3: The read length distribution of 72,708 real PacBio reads from a human genome (CHM1) dataset. The vertical axis shows the number of bases in each bin rather than the number of reads. At least 99% of the bases are in the reads longer than 1000 bases. for any base on any read, its “true” base pairing on the reference genome is known, we have been able to calculate the number of correctly mapped reads/bases. We consider a read to be correctly mapped if (i) it gets mapped to the correct chromosome and strand; and (ii) the subsequence on the reference genome the read maps to, overlaps with the “true” mapping subsequence by at least p bases. In order to compare the methods we tested with respect to the number of correctly mapped reads, we used two values of p: a fixed value of 1 bp and a variable value, which is set to 90% of the length of the originating “true” mapping subsequence. Note that, for most of the methods there is not a big difference between the results based on the two settings for p; however some methods can not identify the “correct” mapping subsequence in its entirety and report only a partial alignment - accordingly, those methods perform poorly for the variable setting of p. We consider a base in a read to be correctly mapped if (i) the read is correctly mapped (as per the definition above) and (ii) the mapped location of the base is within 25 bp of the true alignment locus of the base (A smaller value for the second condition makes the definition of a correctly mapped base more stringent. Tables 3.3 and 3.4 show the result when this threshold is increased to 50 bp and decreased to 5 bp, respectively). Sensitivity is thus defined as the fraction of correctly mapped bases (according to this notion of a correct mapping) out of the total number of bases in the reads. Similarly, precision is defined as the fraction of correctly mapped bases out of the total number of mapped bases in the reads.
34 Table 3.1: Comparison between different tools capable of mapping PacBio long reads on the simulated human dataset. This dataset contains 25,000 reads and 183.61 million bases. Best results are marked with bold typeface.
Minimum Correctly Correct Incorrect Unmapped Sensitivitya Precisionb overlap (p) Mapper mapped bases (Mb) bases (Mb) bases (Kb) (%) (%) 1 bp BLASR 24,642 164.52 18.39 698.22 89.61 89.95 BWA-MEM 24,603 170.63 12.50 525.11 92.91 93.17 GraphMap 24,161 177.26 4.05 2,297.27 96.55 97.77 LAMSA 24,458 176.00 6.40 282.15 96.36 96.51 rHAT 24,409 177.59 5.63 391.52 96.72 96.93 NGMLR 24,194 170.50 8.86 4,246.51 92.86 95.06 Minimap2 24,745 180.06 3.34 223.46 98.06 98.18 minialign 24,567 178.25 4.73 621.60 97.08 97.41 lordFAST 24,751 181.68 1.89 29.35 98.95 98.97 90% BLASR 24,563 164.46 18.47 675.95 89.57 89.90 BWA-MEM 24,485 170.23 12.98 417.84 92.70 92.91 GraphMap 24,161 177.26 4.05 2,297.27 96.55 97.77 LAMSA 24,371 176.87 6.59 208.22 96.30 96.41 rHAT 24,372 177.55 5.98 80.98 96.70 96.74 NGMLR 23,769 169.66 10.44 3,508.56 92.40 94.20 Minimap2 24,740 180.04 3.35 223.20 98.05 98.17 minialign 24,469 177.84 5.53 233.74 96.86 96.98 lordFAST 24,747 181.68 1.90 29.10 98.95 98.97 Note: A read is considered to be mapped correctly if its aligned subsequence in the reference overlaps with the "correct" mapping subsequence by at least p bases. On the other hand, a base in a read is considered to be correctly mapped if the read is correctly mapped and the mapping location of the base is within a 25 bp vicinity of the correct alignment locus of the base. a The sensitivity is defined as the number of correctly mapped bases / the total number of bases. b The precision is defined as the number of correctly mapped bases / the number of mapped bases.
Using these definitions, we compared all of the above-mentioned methods; a summary of the results is presented in Table 3.1. As can be seen, lordFAST not only maps more reads correctly than any other mapper but also aligns about 98.9% of the total number of bases correctly, which is 0.9%–9.4% more than its competitors. In addition, lordFAST achieves the highest base sensitivity and precision. It is important to note that for GraphMap, the precision value is much higher than the sensitivity because it leaves many of the bases unmapped. In that sense, we believe that sensitivity provides a much better measure to compare the tools, even though lordFAST is the best with respect to both measures. Tabel 3.2 provides details about running time and memory usage of each tool on this dataset. Here, Minimap2 is the fastest tool, followed by minialign and lordFAST. BWA-MEM, lordFAST, and LAMSA show the lowest memory footprint. We also evaluated the ability of different tools to distinguish between unique and repetitive hits in terms of the assigned mapping quality (MAPQ) per [98]. For this evaluation, a read is considered as correctly mapped if its best mapping aligns to a region of the reference that (i) overlaps with 10% of the “true” mapping region (Figure 3.4(a)), or (ii) 90% of the “true” mapping region (Figure 3.4(b)). In general, Minimap2 and lordFAST map a higher portion of reads with high mapping quality to correct location compared to other tools, especially with the more stringent definition of the correct mapping (see Figure 3.4(b)).
35 Table 3.2: Runtime and memory usage of same table.
Time Memory Mapper (sec) (GB) BLASR 9,233 14.67 BWA-MEM 6,842 5.22 GraphMap 17,546 42.56 LAMSA 1,277 5.85 rHAT 1,044 13.95 NGMLR 2,970 5.45 Minimap2 154 6.50 minialign 201 12.70 lordFAST 696 5.43 Note: The running time and peak memory usage are measured using GNU time command (/usr/bin/time -v)
Table 3.3: Comparison between different tools capable of mapping PacBio long reads on the simulated human dataset. This dataset contains 25,000 reads and 183.61 million bases. Best results are marked with bold typeface.
Minimum Correctly Correct Incorrect Unmapped Sensitivitya Precisionb overlap (p) Mapper mapped bases (Mb) bases (Mb) bases (Kb) (%) (%) 1 bp BLASR 24,642 171.74 11.17 698.22 93.53 93.89 BWA-MEM 24,603 171.76 11.36 525.11 93.53 93.80 GraphMap 24,161 177.33 3.98 2,297.27 96.58 97.81 LAMSA 24,458 177.65 5.75 282.15 96.72 96.87 rHAT 24,409 177.87 5.35 391.52 96.87 97.08 NGMLR 24,194 172.83 6.53 4,246.51 94.13 96.36 Minimap2 24,745 181.56 1.84 223.46 98.88 99.00 minialign 24,567 179.68 3.31 621.60 97.86 98.19 lordFAST 24,751 181.74 1.84 29.35 98.98 99.00 90% BLASR 24,563 171.66 11.27 675.95 93.50 93.84 BWA-MEM 24,485 171.37 11.85 417.84 93.32 93.53 GraphMap 24,161 177.33 3.98 2,297.27 96.58 97.81 LAMSA 24,371 177.52 5.94 208.22 96.65 96.76 rHAT 24,372 177.82 5.71 80.98 96.85 96.89 NGMLR 23,769 171.99 8.11 3,508.56 93.67 95.50 Minimap2 24,740 181.53 1.85 223.20 98.87 98.99 minialign 24,469 179,27 4.11 233.74 97.64 97.76 lordFAST 24,747 181.73 1.85 29.10 98.98 98.99 Note: A read is considered to be mapped correctly if its aligned subsequence in the reference overlaps with the "correct" mapping subsequence by at least p bases. On the other hand, a base in a read is considered to be correctly mapped if the read is correctly mapped and the mapping location of the base is within a 50 bp vicinity of the correct alignment locus of the base. a The sensitivity is defined as the number of correctly mapped bases / the total number of bases. b The precision is defined as the number of correctly mapped bases / the number of mapped bases.
3.2.2 Simulation in presence of structural variations
In order to evaluate the capability of lordFAST for mapping reads that span structural variations, we performed another experiment to detect simulated SVs using Sniffles [146]. Sniffles requires a minimum of 15× coverage to have good accuracy. Therefore, for this experiment, we only focused on Chr1 and generated a simulated dataset by inserting 21 SVs from DGV (9 insertions, 9 deletions, and 3 inversions) of different sizes. More specifically, we performed simulation and SV calling as follows:
36 Table 3.4: Comparison between different tools capable of mapping PacBio long reads on the simulated human dataset. This dataset contains 25,000 reads and 183.61 million bases. Best results are marked with bold typeface.
Minimum Correctly Correct Incorrect Unmapped Sensitivitya Precisionb overlap (p) Mapper mapped bases (Mb) bases (Mb) bases (Mb) (%) (%) 1 bp BLASR 24,642 136.73 46.18 698.22 74.47 74.75 BWA-MEM 24,603 164.35 18.78 525.11 89.49 89.75 GraphMap 24,161 175.48 5.83 2,297.27 95.57 96.79 LAMSA 24,458 163.97 19.42 282.15 89.27 89.41 rHAT 24,409 172.63 10.59 391.52 94.02 94.22 NGMLR 24,194 151.75 27.61 4,246.51 82.65 84.60 Minimap2 24,745 160.90 22.50 223.46 87.62 87.73 minialign 24,567 159.37 23.62 621.60 86.80 87.09 lordFAST 24,751 180.91 2.67 29.35 98.53 98.54 90% BLASR 24,563 136.67 46.26 675.95 74.43 74.71 BWA-MEM 24,485 163.95 19.26 417.84 89.28 89.49 GraphMap 24,161 175.48 5.83 2,297.27 95.57 96.79 LAMSA 24,371 163.84 19.62 208.22 89.21 89.31 rHAT 24,372 172.60 10.93 80.98 94.00 94.04 NGMLR 23,769 150.91 29.19 3,508.56 82.19 83.79 Minimap2 24,740 160.87 22.52 223.20 87.61 87.72 minialign 24,469 158.95 24.42 233.74 86.57 86.68 lordFAST 24,747 180.91 2.67 29.10 98.53 98.54 Note: A read is considered to be mapped correctly if its aligned subsequence in the reference overlaps with the "correct" mapping subsequence by at least p bases. On the other hand, a base in a read is considered to be correctly mapped if the read is correctly mapped and the mapping location of the base is within a 5 bp vicinity of the correct alignment locus of the base. a The sensitivity is defined as the number of correctly mapped bases / the total number of bases. b The precision is defined as the number of correctly mapped bases / the number of mapped bases.
(a) (b) 1 1
0.98 0.98 blasr bwa-mem 0.96 0.96 graphmap lamsa lordfast minialign 0.94 0.94 minimap2 rhat ngmlr 0.92 0.92 Fraction of mapped reads Fraction of mapped reads
0.9 0.9 0 0.005 0.01 0.015 0.02 0.025 0 0.005 0.01 0.015 0.02 0.025 Fraction of incorrectly mapped PacBio reads Fraction of incorrectly mapped PacBio reads 10% overlap 90% overlap
Figure 3.4: Read mappings are sorted based on their mapping quality in descending order. Then for each mapping quality threshold, the fraction of mapped reads with mapping quality above the threshold (out of the total number of reads) and the fraction of incorrectly mapped read (out of the number of mapped reads) are plotted along the curve.
(i) We assigned SVs reported in DGV on Chr1 of NA12878 individual into 3 groups based on their size (shorter than 500 bp, between 500 and 5000 bp, and longer than 5000 bp) and randomly selected 3 insertions, 3 deletions and 1 inversion from each group.
37 (ii) Selected SVs were inserted into the reference Chr1 to get a simulated donor chromosome.
(iii) A set of long reads with 15× coverage were simulated from the donor chromosome using PBSIM [128]. PBSIM was fed with a FASTQ file from a real human dataset to use its sample-based model (via option –sample-fastq).
(iv) Long reads were mapped to the reference Chr1 using different mappers. Sniffles requires MD tag in order to operate. Among different mappers, BLASR, LAMSA, and rHAT do not generate MD tag in the output SAM file. Therefore, for these mappers, we used samtools calmd command to calculate and add the MD tag. Minimap2 and minialign add MD tags with optional arguments. Other tools (including lordFAST) generate MD tags by default.
(v) For each mapper, a sorted bam file was generated from the SAM file using samtools sort.
(vi) Sniffles (version v1.0.8) was run with parameter -s 4.
For this experiment, some tools required a more specialized command. In particular, we used the following commands for each tool:
• Blasr was run with parameters --sam --bestn 1 --clipping subread --affineAlign --noSplitSubreads --nCandidates 20 --minPctSimilarity 75 --sdpTupleSize 6.
• BWA-MEM was run with parameters -x pacbio -MY as mentioned in [146].
• LAMSA was run with parameters -T pacbio -i 25 -l 50 -S.
• minimap2 was run with parameters -aY -x map-pb --MD.
• minialign was run with parameters -x pacbio -T AS,XS,NM,NH,IH,SA,MD -P.
• rHAT, GraphMap, NGMLR, and lordFAST were run with default parameters.
Here, we provide the results of SV calling using Sniffles based on the mappings from different tools. We define a call as “exact” if (i) its start and end coordinates are at most 25 bp away from the actual simulated breakpoints; and (ii) it overlaps with one simulated SV of the same type. However, if the first condition is not satisfied, the call is considered as “inexact”. If the second condition is not satisfied, the call is considered as “mis-classified”. If none of the conditions is satisfied, the call is considered as “wrong”. Among all mappers, Sniffles generated SV calls only for NGMLR, BWA-MEM, rHAT, and lordFAST. As it can be seen in the Table 3.5, all calls based on rHAT mappings are wrong. Also, Sniffles
38 Table 3.5: Structural variations called by Sniffles based on mappings from different tools.
Mapper # calls # exact # inexact # mis-classified # wrong NGMLR 19 17 1 0 1 BWA-MEM 18 12 5 0 1 rHAT 35 0 0 0 35 lordFAST 17 16 1 0 0
finds more “exact” calls with lordFAST and NGMLR mappings in comparison to mappings provided by BWA-MEM. This suggests that lordFAST does not generate misalignments around SV breakpoints and is capable of properly mapping reads that span/overlap SVs.
3.2.3 Experiment on a real dataset
We evaluated the above methods on a real dataset, containing 23,155 reads sequenced from a human genome (CHM1 cell line; Appendix A.1.1 contains details related to this dataset). Since the true mapping locations of the reads are not known a priori, we compared methods based on the quality of their reported alignments. For each mapping of a read, we count the number of its bases that are aligned to the identical bases in the reference (matched bases). In addition, we calculated the alignment score by adding up +1 for every matching base and −1 for every mismatching, inserted, deleted, or unmapped/clipped base. For each tool, we reported the sum of alignment scores of all the reads in the dataset. Although the number of matched bases per se may not be the best comparison measure (since one could match all the bases in the read without paying attention to the gaps created in the reference), it is complementary to the alignment score. If a program tries to maximize the number of matched bases greedily, it will very likely produce a low alignment score. Table 3.6 shows the result of this experiment. lordFAST has the highest total alignment score. More precisely, lordFAST reports 2.79 million higher alignment score and 1.74 million higher number of matched bases compared to the closest competitor. We also measured the agreement between various methods based on their alignment of the reads. For a given read, an alignment x covers another alignment y if and only if the subsequence on the reference genome covered by x overlaps with at least 90% of the subsequence on the reference genome covered by y. Figure 3.5 shows examples of covering and non-covering alignments. Table 3.7 shows how best alignments from different methods cover each other. More specifically, each row contains the percentage of mappings reported by the corresponding tool that cover the mappings of other tools. For instance, among all reads for which both lordFAST and BLASR report an alignment, 90.84% of the alignments reported by BLASR are covered by lordFAST, while only 88.28% of the alignments reported by lordFAST are covered by BLASR. As can be observed, lordFAST alignments provide high coverage of the alignments obtained by the alternative tools. In addition, in Table 3.8, we compared the performance of the tools on reads for which their alignments do not agree. To give an example, there are 2,930 reads for which BLASR
39 Table 3.6: Evaluation of the performance of various long read mappers on a real human dataset. This dataset includes 23,155 reads and 178.45 million bases.
Mapped Mapped Matched Alignment Timeb Memoryb Mapper reads bases (Mb) bases (Mb) score (sec) (GB) BLASR 22,866 163.11 148.58 108002225 12,243 14.96 BWA-MEM 22,913 170.76 154.15 119117389 8,810 5.25 GraphMap 22,159 169.57 151.93 113717041 17,745 42.56 LAMSA 23,154 173.90 155.68 122035697 2,040 6.29 rHAT 23,136 159.99 142.40 92824214 1,769 13.95 NGMLR 21,295 155.83 143.06 97830317 4,629 5.43 Minimap2 22,818 170.97 154.78 119673199 262 6.57 minialign 23,006 152.61 139.22 89538289 207 12.70 lordFAST 22,961 176.18 157.42 124826081 765 5.43 Note: Given a single mapped read, suppose nMatch is the number of matched bases, qLen is the length of the read, qStart and qEnd denote the start and end coordinates of the mapping on the read, and tStart and tEnd denote the start and end coordinates of the mapping on the reference. a For each read, the ground truth region is the region of the reference that is shared by mappings from at least 4 mappers. A read mapping reported by a mapper is considered to be “correctly” mapped if it overlaps at least 90% of the bases of ground truth regions. The number in parentheses shows percentage of total number of reads that are “correctly” mapped. b The running time and peak memory usage are measured using /usr/bin/time -v Unix command.
x
y
z1 z2 z3 z4
Reference ......
Figure 3.5: Examples of covering and non-covering alignments. Suppose x, y, z1, z2, z3, and z4 are different alignments of the same read. In this figure, alignments x and y cover each other as they span the subsequences on the reference genome that have at least 90% overlap. The alignments x and y cover alignments z1 and z2 but not the alignments z3 and z4. On the other hand, the alignments z1, z2, z3, and z4 do not cover either alignment x or y. does not cover alignments of lordFAST. For those reads, BLASR reports alignments with an average of 28.84% lower identity. In contrast, there are 2,094 reads for which lordFAST does not cover BLASR’s alignments. For those reads, on average, lordFAST’s alignments have only 7.40% lower identity than BLASR’s. With a lack of the true mappings for the real dataset, the information in Tables 3.7 and 3.8 are some extra support for the fact that lordFAST’s alignments are reliable. Finally, we benchmarked the performance of each tool using multiple threads. Figures 3.6 and 3.7 depict the runtime and memory requirement of all tools we tested on this dataset when using multiple threads.
40 Table 3.7: Agreement of different methods in reporting alignments.
BLASR BWA GraphMap LAMSA rHAT NGMLR Minimap2 minialign lordFAST
BLASR N/A 92.38 90.13 89.40 89.57 97.10 93.04 91.82 88.28 BWA 90.10 N/A 87.45 87.25 86.99 95.11 90.98 91.09 86.34 GraphMap 92.47 92.55 N/A 89.06 90.76 96.69 92.54 91.71 91.59 LAMSA 85.74 87.06 83.91 N/A 84.12 91.02 86.13 88.11 83.89 rHAT 90.51 89.87 90.62 87.85 N/A 93.84 89.96 89.41 88.21 NGMLR 86.33 87.02 84.88 83.72 83.67 N/A 86.93 86.99 82.47 Minimap2 92.89 93.45 89.83 89.17 89.35 97.48 N/A 93.01 88.25 minialign 79.97 81.54 77.40 78.49 77.57 84.92 80.88 N/A 77.11 lordFAST 90.84 91.76 91.96 89.20 88.77 94.44 91.01 91.79 N/A Note: Each row shows the percentage of best alignments from the corresponding mapper that cover alignments from other mappers. Note that this table is not symmetric.
Table 3.8: The performance of different methods on reads for which their alignments do not agree.
BLASR BWA GraphMap LAMSA rHAT NGMLR Minimap2 minialign lordFAST
-22.85 -29.05 6.47 -8.57 10.61 -20.92 -7.92 -28.84 BLASR N/A (1747) (2187) (2454) (2414) (617) (1585) (1882) (2930) -4.78 -25.10 14.25 -2.58 -2.16 -11.00 5.89 -20.97 BWA N/A (2264) (2780) (2951) (3010) (1041) (2059) (2049) (3074) -16.66 -34.19 5.78 -10.97 3.05 -30.42 -16.13 -36.39 GraphMap N/A (1721) (1708) (2533) (2137) (704) (1702) (1907) (2033) -25.88 -30.42 -38.40 -22.07 -29.02 -31.68 -18.04 -37.96 LAMSA N/A (3261) (2964) (3566) (3673) (1913) (3166) (2735) (4047) -17.68 -25.53 -41.22 0.17 -22.59 -25.16 -12.33 -37.63 rHAT N/A (2171) (2320) (2079) (2814) (1312) (2291) (2436) (2723) -37.90 -46.62 -43.60 -21.69 -34.93 -46.60 -36.96 -44.03 NGMLR N/A (3126) (2973) (3351) (3770) (3778) (2983) (2992) (4024) -0.11 -13.12 -25.17 15.10 -1.46 9.39 3.23 -16.81 Minimap2 N/A (1626) (1501) (2253) (2508) (2464) (537) (1608) (2697) -26.03 -28.14 -34.78 -9.67 -22.32 -29.23 -30.00 -33.90 minialign N/A (4579) (4229) (5007) (4981) (5190) (3211) (4366) (4530) -7.40 -17.31 -29.20 17.64 -2.82 -14.90 -17.58 1.84 lordFAST N/A (2094) (1887) (1781) (2500) (2598) (1183) (2051) (1889) Each row shows the performance superiority of the corresponding method over other methods for the inconsistent alignments, in terms of the average identity difference. The numbers in parentheses (for each row) show the number of reads for which the corresponding method reports alignments that do not cover alignments of other methods. Note that this table is not symmetric.
3.3 Summary
In this chapter, we presented lordFAST, a fast and highly sensitive mapping tool for long noisy reads. Its sparse anchor extraction strategy has an important impact on the speed of its chaining step. Our experiment on the simulated data showed that despite using a small number of anchors, lordFAST not only maps more reads to its true originating region compared to its competitors but also is highly accurate in base-level alignment (see Table 3.1). In addition, lordFAST also provides both clipped and split alignments of the reads. This makes lordFAST appropriate for aligning reads originating from regions with long structural variations (SVs), so that downstream analysis of its alignments would be simpler for the task of variation discovery.
41 214
213
212
BLASR 11 2 BWA-MEM GraphMap 10 2 LAMSA rHAT 9 NGMLR
Time (s) 2 minimap2 28 minialign lordFAST
27
26
25
1 4 8 12 Number of cores
Figure 3.6: Run-time comparison of different methods for mapping 23,155 real human reads using different threads. Note that the y-axis is in logarithmic scale.
25
BLASR BWA-MEM GraphMap 4 2 LAMSA rHAT NGMLR minimap2 Memory (GB) minialign lordFAST 23
22 0 2 4 6 8 10 12 Number of cores
Figure 3.7: Memory comparison of different methods for mapping 23,155 real human reads using different threads. Note that the y-axis is in logarithmic scale.
42 Chapter 4
Hybrid error correction of long reads
In order to improve the quality of noisy long reads, such as those generated by PacBio or Oxford Nanopore technologies, several tools have been developed (See [87] for a review of error correction tools). These tools can be classified into two categories: (i) self-correcting methods and (ii) hybrid methods. In the “self-correcting” approach, the idea is to correct the long reads by only using the long reads. In this approach, multiple sequence alignment between the reads is built from the pairwise alignment of every two long reads (all-versus-all alignment). Based on this alignment, a consensus sequence is built that has a higher base-level quality. This approach has been implemented in HGAP [24], which is a non-hybrid assembler that can handle bacterial genome data. The recently introduced assembler Canu [85] relies on the idea of local hashing to detect overlaps between long reads and assemble them using an overlap graph. On the other hand, hybrid methods (e.g., PacBioToCA [84], LSC [6], proovread [59], LoRDEC [141]) try jointly to utilize the high-quality short reads and the noisy long reads to correct the long reads. PacBioToCA and LSC map the short reads (e.g., Illumina reads) onto the long read and correct the long reads by calling consensus of these short read mappings; proovread uses similar idea except for performing an iterative procedure for mapping and correcting with successively increasing sensitivity. A different approach, akin to local assembly, is followed by Nanocorr [53] (developed for correcting Oxford Nanopore long reads) and LoRDEC [141]. Nanocorr relies on computing a Longest Increasing Subsequence (LIS) of overlapping reads. In contrast, LoRDEC builds a De Bruijn graph from the short reads and then aligns each long read to this De Bruijn graph by finding a path between solid regions of long read that aims at minimizing the edit distance with the region sequence. One of the main drawbacks of the self-correcting approach is that it requires substantial computational power in order to perform all-versus-all alignment of the long reads for finding overlaps between them, although recent advances require less resources [11]. More importantly, using self-correcting methods requires at least 50x coverage of long reads [83]
43 in order to find all-versus-all overlaps that can be used for error correction. Considering the low throughput of single-molecule sequencing technologies, getting 50x coverage is costly. The advantage of the hybrid approach comes from the fact that high throughput short reads can be generated at a much lower cost, complementing the low coverage long reads from the same donor. Here, we introduce CoLoRMap, a hybrid method that takes advantage of high-quality short reads and corrects noisy long reads. Similar to LSC and PacBioToCA, CoLoRMap maps the short reads onto the long reads as the first step, but unlike those tools, it does not look for a consensus base call at each base. Instead, it does not look for a consensus base call at each base but formulates the problem of correcting a long read region as a local assembly problem aiming at finding an optimal path of overlapping mapped short reads that minimizes the edit score to the long read region, a problem that can be solved exactly using a classical Shortest Path (SP) algorithm. Thus our criterion is different from the one defined in Nanocorr, which is based on the Longest Increasing Subsequence approach, although the general principle is similar 1. Next, in a second step, CoLoRMap addresses the problem of correcting the long reads regions where, due for example to a higher error rate, no short read does map (called gaps), using the idea of de novo assembly of One-End Anchors (OEA), that are unmapped reads whose mate map to a flanking corrected region.
4.1 Methods
4.1.1 Overview
Similar to most hybrid methods for error correction, CoLoRMap gets as input two sets of reads, namely short reads and long reads from the same donor. CoLoRMap starts by mapping the short reads to the long reads by using BWA-MEM [96]. It then uses the set of mappings obtained from BWA-MEM to build a graph structure akin to an overlap graph. Using a polynomial-time Shortest Path (SP) algorithm, CoLoRMap can then reconstruct a sequence of overlapping mapped short reads that minimizes the edit score to the covered long read region and can be used as the corrected sequence for this region. As both short and long reads are sequenced from the same donor, mapped short reads usually cover a large portion of the long reads (see Table 4.4 for the supporting results). However, since they are mapped to noisy long reads, there are regions on the long reads that are not covered by any short read, that we call gaps, as they are located either at the extremities of the long reads or between two corrected regions. In a second step, CoLoRMap attempts to expand the corrected regions using One-End Anchors (OEAs), which are those reads that are not mapped to the long reads but whose corresponding mates are mapped to
1Note, however, that at the time of the publishing CoLoRMap, the precise definition of the objective function used in Nanocorr is not available; it is only stated that it “penalizes overlaps while maximizing alignment lengths and accuracy”.
44 a corrected region on the long reads. For each gap, CoLoRMap then employs Minia [23] to perform a local assembly of the set of OEAs associated with the gap and uses the obtained contigs to correct the gap.
4.1.2 Initial correction of long reads: the SP algorithm
For the sake of simplicity, here we explain the process of correcting a single long read, L, as this process is independent of the correction of the other long reads.
Preliminaries. For a string S = s1s2...sk, |S| = k is the length of S. The i-th character of S is shown by S[i]. A substring of S is denoted by Si,j = sisi+1...sj where i, j ∈ {1, ..., k} and i ≤ j. An alignment between two strings with characters in {A, C, G, T } is a sequence of pairs of elements from {A, C, G, T, −}2 \{(−, −)}.
Let M = {m1, m2, ..., mn} be the set of mappings of the short reads onto L, where each mi is represented by three pieces of information: mi.bp denotes the beginning position
(leftmost position) of the mapping on L, mi.ep denotes the end position (rightmost position) of the alignment on L, and mi.seq indicates the actual sequence of short read aligned to L. Note that some mapping tools may clip the beginning or end of the query and align just a substring of the query to the target long reads, which does not impact our method. Nevertheless, for the sake of exposition, even if a short read has been clipped during the mapping process, we keep calling it a read.
Weighted alignment graph construction. CoLoRMap builds a weighted graph OL from M. Each node in OL corresponds to a mapping mi from M and there is an edge between two nodes if their corresponding mappings overlap on the long read L, as defined below:
Definition 4.1. Two mappings mi and mj overlap iff
(i) mi.bp ≤ mj.bp and mi.ep < mj.ep;
(ii) mj.bp ≤ mi.ep − minOverlap + 1;
(iii) the respective substrings of both short reads that belong to the overlap are identical; where minOverlap is the minimum required overlap length.
Based on this definition, we insert edges into OL only for exactly matching overlaps. However, if there is a single mismatch between the overlapping parts of two reads, we replace the lower quality base at the mismatching position with the higher quality base so that the overlap becomes an exact matching overlap, and then we can add its corresponding edge to the graph. This change does not modify the original read sequence and is limited to the content of the inserted edge.
45 m i (a) x mj.seqx,|mj .seq| mj
Lmi .ep,mj .ep
mi.bp mj.bp mi.ep mj.ep
L
(b)
Figure 4.1: (a) The notion of overlap for mappings. For two overlapping mappings mi and mj, the weight of the corresponding edge is set to the edit distance between the suffix of mj.seq and its aligned region in L (marked by red in this figure). (b) Reconstruction of the corrected sequence spelled from the shortest path. The spelled string can be easily obtained by concatenation of mapping suffixes from the shortest path.
The weight of the edge associated to an overlap between mi and mj, denoted by wij, is defined as the edit distance between mj.seqx,y and L, where x is the position on mj.seq which is aligned with L[mi.ep] and y = |mj.seq|. In other words, the edit distance is calculated from the suffix of mj.seq that does not belong to the overlap (as shown in Figure 4.1.a). The motivation behind choosing such a weight function is the following observation:
Property 4.1. In a connected component of the weighted alignment graph OL, consider the leftmost mapping as the source node and the rightmost mapping as the target node. For each path in OL, we define edit score as the sum of edit distances of overlap suffices (shown in red in Figure 4.1.a) on that path. If the overlaps in OL are exact matching overlaps, the shortest path from source to target defines a sequence of overlapping mapped short reads that minimizes the edit score to the covered region of L among all such sequences.
The observation above does not imply that we always obtain a sequence of overlapping short reads that minimizes the edit score to a region of L (the general principle underlying the method LoRDEC for example), as the sequence of overlapping reads is constrained by the set of initial mappings M.
46 Thus, for each connected component of OL, we define a source node, which is the leftmost mapping of this component, and a target node, which is the rightmost mapping. CoLoRMap then uses the Dijkstra shortest path algorithm to find the shortest path, p, from the source node to the target node. A string can be spelled from p using the sequences of the mappings corresponding to p (see Figure 4.1.b for a toy example), which is used as the corrected string of the region of L spanned by the mappings of the connected component. For each connected component, CoLoRMap replaces the uncorrected string on L (starting from source mapping and ending at destination mapping) with the spelled string. CoLoRMap can perform several rounds of correction using the SP algorithm explained above. The reason is that mapping of short reads onto long reads in the second pass gives more coverage and also more consistent mappings. Preliminary experiments showed that this aids in getting higher quality corrections, although at the cost of higher computation time.
Complexity analysis. Let n be the total number of alignments on a single long read. The time complexity upper bound for building the weighted alignment graph is O(n2). The shortest path can be calculated in O(e log(n)) time, where e is the number of edges of the graph.
Mapping parameters. For mapping short reads to the long reads CoLoRMap runs BWA-MEM with options -aY -A 5 -B 11 -O 2,1 -E 4,3 -k 8 -W 16 -w 40 -r 1 -D 0 -y 20 -L 30,30 -T 2.5, which are similar to parameters used by proovread [59] except using shorter seeds for higher sensitivity. It is important to note that since BWA-MEM does not guarantee to report all mappings of each short read, if we break the set of long reads into smaller chunks of long reads, we can expect higher coverage of mappings of short reads onto long reads. Table 4.8 shows the result of our experiment on how chunking enhances the quality of correction using CoLoRMap. However, the running time of mapping to chunks is greater than mapping to the whole set of long reads, so choosing the size of the chunks depends on the desired trade-off between accuracy and speed. CoLoRMap splits the long read set to chunks of about 50 Mbp and performs correction on each one of these chunks separately.
4.1.3 Correcting gaps using One-End Anchors
Although the previous correction step can correct a large part of many long reads, there are usually some regions of the long reads containing so many sequencing errors that no short read can align there (the so-called gaps). For example, we observed regions on some long reads where the maximum exact match with the reference genome is only four base-pair long (Figure 4.2 depicts an example of such region). Therefore, these uncovered regions can not be corrected through a mapping-based approach. More generally, it is natural to ask
47 if looking to correct such regions by optimizing some notion of distance between the long read region and the short read mappings is relevant. Nevertheless, correcting these regions is essential in order to have higher quality long reads. To address this issue, CoLoRMap uses One-End Anchors (OEAs) to correct these regions. Again for a long read L, a One-End Anchor is a short read that did not map to L, but whose corresponding mate read is mapped to L. It is important to note that since both short reads and long reads come from the same donor genome, it is possible to properly identify a set of OEAs for each such gap (uncorrected region) by looking at mappings in its flanking corrected regions, corresponding to connected components of the weighted alignment graph OL. 0 0 0 Suppose R = {r1, r1, r2, r2, ..., rn, rn} is the set of input paired-end short reads with 0 mean library insert size δ and standard deviation σ, where ri and ri are mates. We map this set of short reads to the input set of corrected long reads using BWA [96], which is a mapping tool optimized for Illumina short reads. As a result, short reads will be easily mapped to the corrected regions. Consider the case of an uncorrected region on long read L surrounded by two regions corrected during the initial correction step.
Definition 4.2. Let Lp,q be a gap of L flanked by two corrected regions Li,p−1 and Lq+1,j. A read r is an OEA for the gap if
(i) its mate r0 or its reverse complement is mapped to a flanking region of the gap with the proper orientation indicating its mate could belong to the gap;
(ii) r is not mapped to Li,j or partially over one of the boundaries of the gap;
(iii) the distance of the position of the mapping of r0 to the gap at most δ + 3σ.
So, after obtaining all the mappings of the short reads to L with BWA, for each gap CoLoRMap records the set of corresponding OEAs. Figure 4.3 depicts an instance of a gap and how OEAs are extracted. The sequences of recorded OEAs are then fed into the assembly tool Minia [23] to obtain contigs. Minia v2.0.3 was run with parameters -kmer-size 43 -abundance-min 1. Minia was chosen for its ease of use and low computational resources requirements.
48 49
Figure 4.2: An example of a gap (region uncovered by short reads) on long read, exported from IGV software. There are so many sequencing errors that mapping short read in this region is very challenging. In the region shown here, the maximum exact match between long read and the reference genome is 4 bp long, in a region of size ≈ 150 bp. L
Figure 4.3: Detecting One-End Anchors (OEAs) for a gap (un-corrected region). OEAs, shown in red, are unmapped or partially mapped reads whose mates, shown in blue, are mapped to corrected regions concordantly (with proper orientation and distance). The assembled contigs, shown in light green, are used to improve the quality of gap region.
4.2 Results
4.2.1 Data and computational setting
We performed experiments on three data sets: a bacterial genome data set from Escherichia coli, and two eukaryotic ones from Saccharomyces cerevisiae (yeast) and from Drosophila melanogaster (fruit fly). For each genome, we obtained a set of PacBio (noisy) long reads, which include 98 Mbp, 1.4 Gbp, and 1.35 Gbp, respectively. We also obtained a set of high- quality Illumina short paired-end reads for each genome, containing 234 Mbp, 455 Mbp, and 7 Gbp, respectively (more details are available in Appendix B.1). We compared CoLoRMap with PacBioToCA, LSC, proovread, and LoRDEC. PacBioToCA, LSC, and proovread were run with default parameters except the number of threads. For LoRDEC v0.6, we used options -k 19 -s 3 -e 0.4 -b 200 -t 5 as explained in [141]. For the E. coli data set, experiments were performed on a local workstation equipped with a Xeon E3-1270 v3 processor (CPU clock speed: 3.5 GHz and 8 cores), 32 GB of main memory and 2 TB of locally attached hard disk. Table 4.1 provides a comparison of these tools in terms of running time for the E. coli data set. For the larger yeast and D. melanogaster data sets, the experiments were performed on multiple computers, so we are not providing a comparison for running time.
Table 4.1: Runtime of different correction methods for E. coli dataset.
Data #Thread Method Elapsed time (minutes) E. coli 8 PacBioToCA 97m LSC 387m proovread 105m LoRDEC 7m CoLoRMap 38m CoLoRMap + OEA 38m+81m Notes: The Linux/Unix "time" command was used for reporting the runtime.
50 In the following, we describe our evaluation approach for comparing the corrected long reads obtained by the different considered methods.
4.2.2 Measures of evaluation
In order to check the performance of correction methods, we followed Salmela et al. [141] and investigated how well corrected long reads align to the reference genome, followed by checking how well corrected long reads can be used for de novo assembly. In order to map long reads to the reference genome, we used BLASR [18] and BWA-MEM [96]. The rationale behind using both tools for evaluation is the observation that there are usually some reads for which one tool finds mappings while the other tool reports none. BLASR is specifically designed for aligning PacBio long reads to a reference sequence. Running BLASR with options -noSplitSubreads -bestn 1 gives a single best alignment for each long read. BWA-MEM is a fast alignment tool that supports mapping of long reads to a reference sequence and can handle noisy PacBio long reads via option -x pacbio. It is important to note that many times BWA-MEM reports multi-piece mapping for a long read rather than one contiguous alignment. In our evaluations, we still consider all such fragmented alignments of a long read if the distance between mapping positions of these fragments on the reference is not larger than the length of the long read. The first evaluation measure we considered is the number of long reads that align to the reference genome. We also recorded the number of aligned bases in corrected long reads and the number of bases that match with the reference in the alignment. We computed a notion of identity as defined by Salmela et al. [141], defined as the number of base matches over the length of the aligned region in the reference genome.
Trimming and splitting corrected reads. Among the compared correction tools, CoLoRMap and LoRDEC report full long reads with corrected high-quality regions indicated in upper case while uncorrected regions are indicated in lower case. proovread outputs both full corrected long reads (without marking the corrected regions though) and corrected regions as separate sequences. PacBioToCA, however, outputs only corrected regions of long reads as separate sequences. We evaluate the full long reads obtained from CoLoRMap, LSC, LoRDEC and proovread as well as the trimmed long reads, obtained after removing all uncorrected bases from both extremities of a long read while keeping gaps (uncorrected regions flanked by corrected regions). In order to compare with PacBioToCA and proovread, we also evaluated the split long reads from CoLoRMap and LoRDEC, obtained by extracting only corrected regions from the corrected long reads, each such regions being considered as a separate sequence.
51 4.2.3 Comparison based on alignment
The results of our experiments are summarized in Tables 4.2–4.3. These results are based on alignments from BLASR (Table 4.2) or BWA-MEM (Table 4.3). We can observe that CoLoRMap performs best in terms of corrected reads that aligns back onto the reference genome, while maintaining a high average identity, although slightly lower than PacBioToCA, LoRDEC and proovread. It is also interesting to observe that the OEA step results in a non-negligible improvement of the size of corrected regions (see Table 4.4, while also increasing the average identity of the trimmed reads. In terms of corrected regions, proovread computes the longest ones, and it might be interesting to see if it is possible to combine the hierarchical approach of proovread with our algorithm.
4.2.4 Comparison based on assembly
In addition to comparing the quality of corrected long reads, we also investigated how well corrected long reads from different tools can be utilized for a downstream analysis task. We chose the task of de novo assembly as there exists a specialized assembler, Canu [85], available for long noisy reads. In order to assess the quality of the assembled contigs we used QUAST [56]. Tables 4.5–4.7 show the output of QUAST for assemblies obtained from running Canu on the set of long reads corrected by different correction tools. The observation for E. coli and yeast data set is that the set of contigs assembled from our corrected long reads has the highest NGA50, lower number of mismatches and indels, and covers the reference genome better. The assembly of D. melanogaster dataset, however, does not seem reliable as they cover a tiny fraction of the reference genome. This might be due to the low coverage of the long reads (the coverage is 9.7x while Canu suggests coverage of about 50x at least).
52 Table 4.2: Quality of corrected long reads for E. coli, yeast, and fruit fly datasets obtained with different methods. Assessment is based on alignments of long reads to the reference genome obtained with BLASR.
Aligned Dataset Method #Readsa #Readsb #Basesc Sized Matchede Identityf coverageg (Mb) (%) (%) (%) (%)
E. coli Original 33360 31071 86.64 88.40 76.95 94.84 100.00 (Full) LSC 25426 25098 77.51 92.63 86.00 97.55 100.00 proovread 24722 23453 71.32 89.36 87.90 99.70 100.00 LoRDEC 33360 30837 79.37 86.91 85.24 99.48 100.00 CoLoRMap 33360 31271 83.34 89.92 87.53 99.27 100.00 CoLoRMap +OEA 33360 31215 82.92 89.66 87.58 99.38 100.00 (Trim) LSC 25426 25226 72.52 95.37 89.55 97.92 100.00 LoRDEC 31733 30969 79.25 93.27 92.01 99.68 100.00 CoLoRMap 30396 30190 76.67 96.26 94.24 99.46 100.00 CoLoRMap +OEA 30396 30183 76.43 96.21 94.56 99.58 100.00 (Split) PacBioToCA 100100 99668 68.21 98.51 98.48 99.94 99.71 proovread 30479 30456 71.40 99.34 99.22 99.97 99.66 LoRDEC 49018 41437 79.77 99.02 98.96 99.96 99.82 CoLoRMap 48987 48840 73.73 99.11 98.99 99.90 99.91 CoLoRMap +OEA 40256 40101 74.57 98.99 98.84 99.89 99.91
Yeast Original 231594 224694 1229.72 87.68 78.84 93.87 99.77 (Full) proovread 229702 222976 1205.71 87.99 83.13 96.38 99.82 LoRDEC 231594 221692 1171.49 86.11 83.48 98.38 99.82 CoLoRMap 231594 223641 1207.73 88.60 85.62 98.30 99.83 CoLoRMap +OEA 231594 223497 1205.65 88.55 85.72 98.40 99.83 (Trim) LoRDEC 228893 221902 1175.30 89.12 86.60 98.51 99.81 CoLoRMap 211324 208188 1017.55 92.84 90.46 98.79 99.82 CoLoRMap +OEA 211324 208310 1017.39 92.95 90.76 98.92 99.82 (Split) proovread 225878 225497 244.48 99.53 99.39 99.84 60.49 LoRDEC 1460179 919020 1120.63 96.78 96.30 99.50 99.77 CoLoRMap 435140 432750 943.50 97.56 97.29 99.69 99.79 CoLoRMap +OEA 349998 347516 953.00 97.26 96.95 99.66 99.79
Fruit fly Original 901564 313983 502.90 37.05 33.20 94.60 93.68 (Full) LoRDEC 901564 342784 499.02 37.34 35.27 97.16 93.91 CoLoRMap 901564 348810 535.90 40.23 38.39 97.96 94.65 (Trim) LoRDEC 665298 348924 493.09 45.13 42.73 97.27 93.73 CoLoRMap 286679 256775 324.98 68.98 66.34 98.46 85.53 (Split) LoRDEC 4303563 1366425 558.80 77.65 76.80 98.82 92.12 CoLoRMap 453006 415526 337.99 89.04 88.45 99.29 85.63
Notes: athe number of DNA sequences available after running the correction tool (may contain uncorrected sequences); in case of original data set, shows the total number of long reads. bthe number of aligned sequences. cthe number of bases aligned to the reference genome. dthe percentage of aligned bases; that is column c / summed length of sequences in column a. ethe percentage of matched bases; that is total number of matched bases / summed length of sequences in column a. f average identity; that is total number of matched bases / summed length of aligned regions in the reference genome. gpercentage of the reference genome covered by the aligned sequences.
53 Table 4.3: Quality of corrected long reads for E. coli and yeast datasets obtained with different methods. Assessment is based on alignments of long reads to the reference genome obtained with BWA-MEM.
Aligned Dataset Method #Readsa #Readsb #Basesc Sized Matchede Identityf coverageg (Mb) (%) (%) (%) (%)
E. coli Original 33360 30830 86.69 88.45 76.66 94.07 100.00 (Full) LSC 25426 25403 77.87 93.06 86.46 97.20 100.00 proovread 24722 24046 73.29 91.83 90.89 99.69 100.00 LoRDEC 33360 31371 82.33 90.16 88.74 99.44 100.00 CoLoRMap 33360 31693 84.69 91.37 89.34 99.20 100.00 CoLoRMap +OEA 33360 31693 84.51 91.39 89.67 99.33 100.00 (Trim) LSC 25426 25402 72.26 95.02 89.47 97.68 100.00 LoRDEC 31733 31320 80.14 94.32 93.49 99.69 100.00 CoLoRMap 30396 30392 76.69 96.28 94.77 99.45 100.00 CoLoRMap +OEA 30396 30392 76.50 96.29 95.17 99.59 100.00 (Split) PacBioToCA 100100 100006 69.10 99.80 99.77 99.95 99.81 proovread 30479 30477 71.52 99.50 99.40 99.97 99.67 LoRDEC 49018 41679 80.04 99.33 99.28 99.96 99.83 CoLoRMap 48987 48965 74.26 99.82 99.70 99.91 99.91 CoLoRMap +OEA 40256 40235 75.17 99.79 99.65 99.90 99.91
Yeast Original 231594 136943 742.47 52.94 47.05 92.79 99.69 (Full) proovread 229702 223719 1216.55 88.78 83.79 95.86 99.75 LoRDEC 231594 226827 1223.76 89.96 87.50 98.21 99.71 CoLoRMap 231594 228484 1240.42 91.00 88.42 98.14 99.71 CoLoRMap +OEA 231594 228484 1239.54 91.03 88.66 98.27 99.70 (Trim) LoRDEC 228893 226632 1206.11 91.46 89.25 98.40 99.71 CoLoRMap 211324 211206 1029.58 93.94 92.17 98.77 99.71 CoLoRMap +OEA 211324 211206 1028.61 93.98 92.47 98.93 99.70 (Split) proovread 225878 225670 245.18 99.82 99.66 99.82 60.64 LoRDEC 1460179 925878 1133.58 97.90 97.41 99.52 99.72 CoLoRMap 435140 434418 961.26 99.40 99.14 99.74 99.71 CoLoRMap +OEA 349998 349421 973.63 99.37 99.07 99.72 99.70
Notes: athe number of DNA sequences available after running the correction tool (may contain uncorrected sequences); in case of original data set, shows the total number of long reads. bthe number of aligned sequences. cthe number of bases aligned to the reference genome. dthe percentage of aligned bases; that is column c / summed length of sequences in column a. ethe percentage of matched bases; that is total number of matched bases / summed length of sequences in column a. f average identity; that is total number of matched bases / summed length of aligned regions in the reference genome. gpercentage of the reference genome covered by the aligned sequences.
54 Table 4.4: Statistics of corrected and un-corrected regions after correction with different methods.
Corrected regions Un-corrected regions (gaps) Data set Method # regions avg size total size # regions avg size total size E. coli Original NA NA NA 33360 2938 98.01 PacBioToCA 100100 691 69.24 NA NA NA proovread 30479 2358 71.87 NA NA NA LoRDEC 49018 1643 80.58 52696 203 10.74 CoLoRMap 48987 1518 74.39 40999 446 18.29 CoLoRMap +OEA 40256 1871 75.33 32268 531 17.15 Yeast Original NA NA NA 231594 6055 1402.46 proovread 229702 5965 1370.27 NA NA NA LoRDEC 1460179 793 1157.93 1564253 129 202.47 CoLoRMap 435140 2222 967.08 456717 867 396.02 CoLoRMap +OEA 349998 2799 979.85 371575 1027 381.76 Fruit fly Original NA NA NA 901564 1505 1357.18 LoRDEC 4303563 167 719.67 5006145 123 616.81 CoLoRMap 453006 837 379.60 1191316 799 952.52
55 Table 4.5: Quality of Canu assemblies for E. coli data set corrected by different methods. The assessment is done using QUAST. All statistics are based on contigs of size ≥ 500 bp, unless otherwise noted.
Assembly Original LoRDEC proovread CoLoRMap CoLoRMap+OEA # contigs (≥ 0 bp) 182 24 26 19 19 # contigs (≥ 1000 bp) 182 24 26 19 19 # contigs (≥ 5000 bp) 178 24 26 19 19 # contigs (≥ 10000 bp) 141 24 26 19 19 # contigs (≥ 50000 bp) 4 21 22 19 19 Total length (≥ 0 bp) 3508197 4623137 4629719 4624793 4627249 Total length (≥ 1000 bp) 3508197 4623137 4629719 4624793 4627249 Total length (≥ 5000 bp) 3492249 4623137 4629719 4624793 4627249 Total length (≥ 10000 bp) 3209268 4623137 4629719 4624793 4627249 Total length (≥ 25000 bp) 1710292 4623137 4616507 4624793 4627249 Total length (≥ 50000 bp) 228498 4495150 4492555 4624793 4627249 Largest contig 69266 920903 605792 1089140 1089205 Reference length 4641652 4641652 4641652 4641652 4641652 GC (%) 51.05 50.81 50.81 50.81 50.81 Reference GC (%) 50.79 50.79 50.79 50.79 50.79 N50 24663 226456 231774 239066 239066 NG50 17847 226456 231774 239066 239066 L50 48 6 7 5 5 LG50 76 6 7 5 5 # unaligned contigs 0 + 0 part 0 + 0 part 0 + 0 part 0 + 0 part 0 + 0 part Unaligned length 0 0 0 0 0 Genome fraction (%) 75.455 99.120 99.092 99.244 99.231 Duplication ratio 1.002 1.005 1.007 1.004 1.005 Largest alignment 69266 538466 398061 698643 698643 NA50 24663 202095 198530 239066 239066 NGA50 17847 202095 198530 239066 239066 LA50 48 8 9 6 6 LGA50 76 8 9 6 6 # misassemblies 0 6 7 5 6 # relocations 0 6 7 5 6 # translocations 0 0 0 0 0 # inversions 0 0 0 0 0 # misassembled contigs 0 4 3 3 4 Misassembled contigs length 0 1328532 1076559 1277904 1446651 # local misassemblies 1 2 3 1 1 # N’s per 100 kbp 0.00 0.00 0.00 0.00 0.00 # mismatches per 100 kbp 8.17 15.63 18.00 6.64 7.36 # indels per 100 kbp 191.04 3.43 2.02 1.80 1.74 Indels length 7249 222 126 99 98
56 Table 4.6: Quality of Canu assemblies for yeast data set corrected by different methods. The assessment is done using QUAST. All statistics are based on contigs of size ≥ 500 bp, unless otherwise noted.
Assembly original lordec proovread CoLoRMap CoLoRMap+OEA # contigs (≥ 0 bp) 26 28 32 24 29 # contigs (≥ 1000 bp) 26 28 32 24 29 # contigs (≥ 5000 bp) 26 28 31 22 28 # contigs (≥ 10000 bp) 26 27 30 21 28 # contigs (≥ 50000 bp) 22 19 24 19 20 Total length (≥ 0 bp) 12341981 12497078 12485995 12315869 12450479 Total length (≥ 1000 bp) 12341981 12497078 12485995 12315869 12450479 Total length (≥ 5000 bp) 12341981 12497078 12484209 12308283 12445656 Total length (≥ 10000 bp) 12341981 12490996 12474494 12302229 12445656 Total length (≥ 25000 bp) 12341981 12444116 12456794 12302229 12385648 Total length (≥ 50000 bp) 12218401 12257688 12279045 12239085 12217774 Largest contig 1543990 1552711 1537979 1555857 1538508 Reference length 12157105 12157105 12157105 12157105 12157105 GC (%) 38.18 38.21 38.22 38.17 38.20 Reference GC (%) 38.15 38.15 38.15 38.15 38.15 N50 777602 818962 777713 815158 932935 NG50 777602 818962 777713 815158 932935 L50 6 6 6 6 6 LG50 6 6 6 6 6 # unaligned contigs 1 + 1 part 1 + 0 part 1 + 0 part 1 + 0 part 1 + 0 part Unaligned length 27953 27982 42350 34077 29118 Genome fraction (%) 98.638 98.791 98.687 98.716 98.881 Duplication ratio 1.027 1.038 1.037 1.023 1.033 Largest alignment 1084893 1073237 1090741 1073302 1085688 NA50 354598 377095 350112 377108 377106 NGA50 354598 377095 350112 377108 377106 LA50 11 11 11 11 11 LGA50 11 11 11 11 11 # misassemblies 107 124 108 102 112 # relocations 26 42 29 30 31 # translocations 79 82 79 72 80 # inversions 2 0 0 0 1 # misassembled contigs 21 25 24 19 24 Misassembled contigs length 10513374 12191557 10639582 11996690 10856637 # local misassemblies 31 11 14 11 12 # N’s per 100 kbp 0.00 0.00 0.00 0.14 0.00 # mismatches per 100 kbp 75.76 89.07 96.59 87.75 84.37 # indels per 100 kbp 25.83 19.92 21.04 13.64 13.66 Indels length 6573 5899 6112 4901 4627
57 Table 4.7: Quality of Canu assemblies for D. melanogaster data set corrected by different methods. The assessment is done using QUAST. All statistics are based on contigs of size ≥ 500 bp, unless otherwise noted.
Assembly original lordec CoLoRMap # contigs (≥ 0 bp) 217 224 260 # contigs (≥ 1000 bp) 217 224 260 # contigs (≥ 5000 bp) 159 144 161 # contigs (≥ 10000 bp) 47 33 42 # contigs (≥ 50000 bp) 0 0 2 Total length (≥ 0 bp) 1768221 1730606 2106055 Total length (≥ 1000 bp) 1768221 1730606 2106055 Total length (≥ 5000 bp) 1543023 1410633 1726065 Total length (≥ 10000 bp) 735933 653134 910341 Total length (≥ 25000 bp) 58943 286003 488439 Total length (≥ 50000 bp) 0 0 142690 Largest contig 30023 42661 75766 Reference length 137567484 137567484 137567484 GC (%) 38.17 37.92 38.22 Reference GC (%) 42.08 42.08 42.08 N50 8620 7664 8485 L50 64 58 58 # unaligned contigs 69 + 8 part 67 + 14 part 61 + 17 part Unaligned length 770395 861325 986102 Genome fraction (%) 0.649 0.573 0.764 Duplication ratio 1.117 1.104 1.066 Largest alignment 16190 13571 17993 NA50 1442 - 955 NGA50 - - - LA50 177 - 238 # misassemblies 175 122 138 # relocations 117 73 83 # translocations 58 49 54 # inversions 0 0 1 # misassembled contigs 67 54 73 Misassembled contigs length 562340 358099 478259 # local misassemblies 55 32 21 # N’s per 100 kbp 0.00 0.00 0.00 # mismatches per 100 kbp 679.35 704.35 583.04 # indels per 100 kbp 401.99 273.08 191.33 Indels length 9235 6132 7931
58 Table 4.8: The effect of chunking on correction quality for CoLoRMap. CoLoRMap-w represents running of our software on the whole long read set without chunking.
Aligned Dataset Method #Readsa #Readsb #Basesc Sized Matchede Identityf coverageg (Mb) (%) (%) (%) (%)
E. coli CoLoRMap-w 33360 31247 83.21 89.49 86.59 99.02 100.00 (Full) CoLoRMap 33360 31271 83.34 89.92 87.53 99.27 100.00 CoLoRMap-w+OEA 33360 31165 82.56 89.07 86.63 99.20 100.00 CoLoRMap +OEA 33360 31215 82.92 89.66 87.58 99.38 100.00 E. coli CoLoRMap-w 30501 30302 76.71 95.98 93.44 99.23 100.00 (Trim) CoLoRMap 30396 30190 76.67 96.26 94.24 99.46 100.00 CoLoRMap-w+OEA 30501 30285 76.34 95.88 93.87 99.43 100.00 CoLoRMap +OEA 30396 30183 76.43 96.21 94.56 99.58 100.00 E. coli CoLoRMap-w 57458 57281 71.45 98.90 98.76 99.88 99.91 (Split) CoLoRMap 48987 48840 73.73 99.11 98.99 99.90 99.91 CoLoRMap-w+OEA 44037 43847 73.06 98.77 98.59 99.86 99.91 CoLoRMap +OEA 40256 40101 74.57 98.99 98.84 99.89 99.91
Yeast CoLoRMap-w 231594 223919 1211.63 88.07 83.12 96.69 99.85 (Full) CoLoRMap 231594 223641 1207.73 88.60 85.62 98.30 99.83 CoLoRMap-w+OEA 231594 223693 1207.65 88.02 83.61 97.10 99.85 CoLoRMap +OEA 231594 223497 1205.65 88.55 85.72 98.40 99.83 Yeast CoLoRMap-w 214765 211702 1004.25 93.35 88.61 96.94 99.85 (Trim) CoLoRMap 211324 208188 1017.55 92.84 90.46 98.79 99.82 CoLoRMap-w+OEA 214765 211710 1001.17 93.38 89.33 97.44 99.81 CoLoRMap +OEA 211324 208310 1017.39 92.95 90.76 98.92 99.82 Yeast CoLoRMap-w 1043237 1038397 631.79 96.65 96.14 99.44 99.68 (Split) CoLoRMap 435140 432750 943.50 97.56 97.29 99.69 99.79 CoLoRMap-w+OEA 676091 672731 707.32 97.36 96.60 99.28 99.77 CoLoRMap +OEA 349998 347516 952.99 97.26 96.95 99.66 99.79 Notes: athe number of DNA sequences available after running the correction tool (may contain uncorrected sequences); in case of original data set, shows the total number of long reads. bthe number of aligned sequences. cthe number of bases aligned to the reference genome. dthe percentage of aligned bases; that is column c / summed length of sequences in column a. ethe percentage of matched bases; that is total number of matched bases / summed length of sequences in column a. f average identity; that is total number of matched bases / summed length of aligned regions in the reference genome. gpercentage of the reference genome covered by the aligned sequences.
59 4.3 Comparison with more recent hybrid correction tools
Since publishing CoLoRMap in 2016, many other hybrid correction tools have been developed, which were not benchmarked in the original paper of CoLoRMap. Among these tools, we can point to Jabba [116], HALC [8], Hercules [46], FMLRC [164], and HG-CoLoR [121]. Although we have not improved CoLoRMap since its first release, it would be interesting to see how it performs against the state-of-the-art hybrid error correction tools. Zhang et al. [177] have benchmarked these tools together with previously published ones, including CoLoRMap, ECTools, and Nanocorr. The evaluation is done on PacBio and Nanopore long reads of three datasets: E. coli, yeast, and fruit fly. We refer the reader to Table 1 in [177] for more details about these datasets. Tables 4.9-4.14 show the results of this evaluation. As can be seen, none of the tools can be considered the best in all metrics. However, we can make the following observations. In terms of the number of aligned reads, HALC is the best performing tool, especially on larger datasets. CoLoRMap performs well in this metric only for E. coli dataset. In terms of the number of aligned bases, FMLRC is the best tool, while CoLoRMap is following FMLRC on most of the datasets (an exception is E. coli PacBio dataset where CoLoRMap is the best). In terms of N50, there is no clear winner, but CoLoRMap is more often the better performing tool. With regards to the genome fraction, CoLoRMap is always on par with the best performing tools. On the other hand, in terms of alignment identity, the reported numbers for CoLoRMap is very low (although Hercules is the lowest). However, this can be mainly due to the fact that [177] decided to skip the second step of CoLoRMap, which uses OEA reads to fill the gaps in the corrected reads. One clear limitation of CoLoRMap compared to some of the other tools is its long run time, especially when compared to FMLRC and Jabba on a larger dataset. Indeed, improving the run time of CoLoRMap is one of the future research directions of this thesis. We provide some suggestions in Chapter 6 towards this goal.
4.4 Summary
We described CoLoRMap, a new noisy long read correction method whose main features are (1) to rely on the shortest path algorithm applied to a weighted alignment graph in order to find a corrected sequence that minimizes the edit score to the long read, and (2) to extend the initial correction using unmapped mates of mapped short reads (so- called OEAs). Our experimental results suggest that CoLoRMap compares well with recent existing methods and especially corrects long reads that can be mapped to the reference and used for downstream analysis better than the long reads corrected by the existing methods while maintaining high accuracy.
60 The rationale for CoLoRMap algorithm is to combine the strengths of both consensus methods such as proovread and optimization-based methods such as LoRDEC and Nanocorr. As consensus methods, we indeed rely on mapped reads, i.e., correct regions using either a mapped read (the SP algorithm) or the mate of a mapped read (the OEA algorithm). However, as with LoRDEC, we also account for the global context of short reads selected for correction by using the optimization criterion of the SP algorithm.
61 bioRxiv preprint doi: https://doi.org/10.1101/519330. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Zhang et al. Page 9 of 18
Table 4.9: Comparison between hybrid error correction tools on E. coli PacBio dataset. The experiment is done by Zhang et al. [177]. Table is taken from [177]. Table 2 Experimental results for E. coli PacBio data set D1-P
#Aligned Maximum Genome Alignment Memory #Bases #Aligned N50 CPU time Wall time Method # Reads bases length fraction identity usage (Mbp) reads (bp) (hh:mm:ss) (hh:mm:ss) (Mbp) (bp) (%) (%) (GB)
Original 85 460 748.082886688.04411313990100.000 86.8763 --- Non-hybrid methods FLAS 69 327 632.368786621.24011713212100.000 99.5959 09:47:50 00:56:45 4.9 LoRMA 330 811 623.3330715623.0224992441100.000 99.6814 45:24:49 02:10:36 67.2 Canu 9283 168.19193166.73969320391100.000 99.6970 07:47:33 00:27:14 6.0 Short-read-assembly-based methods 62 HG-CoLoR ------FMLRC 85 260 706.583320669.94408413364100.000 99.6983 03:05:06 00:30:07 9.8 HALC 85 256 711.184030661.74411713399100.000 99.4374 60:41:59 16:02:32 30.2 Jabba 77 508 620.277508619.7413421255799.258 99.9624 02:05:09 00:12:01 37.0 LoRDEC 85 324 716.983507665.94431113491100.000 98.4149 15:03:42 00:40:05 2.0 ECTools 55 687 577.455687575.73977213583100.000 99.8592 11:25:22 00:29:49 8.2 Short-read-alignment-based methods Hercules ------>72:00:00 - CoLoRMap 85 674 730.783765678.64411313641100.000 95.2930 31:35:16 02:53:33 34.9 Nanocorr 73 368 504.973316493.14107910796100.000 98.3257 1862:59:19 70:57:19 15.1 proovread 85 367 720.283142665.74411313524100.000 96.7250 71:17:14 12:21:53 53.9 LSC ------>72:00:00 - Note: HG-CoLoR reported an error when correcting this dataset.
Table 3 Experimental results for E. coli ONT data set D1-O
#Aligned Maximum Genome Alignment Memory #Bases #Aligned N50 CPU time Wall time Method # Reads bases length fraction identity usage (Mbp) reads (bp) (hh:mm:ss) (hh:mm:ss) (Mbp) (bp) (%) (%) (GB)
Original 163 747 1481.51633861454.413196914895100.000 81.3559 --- Non-hybrid methods FLAS 138 472 1401.31384581392.91304971474899.997 93.0176 20:27:50 01:56:52 8.0 LoRMA 595 072 1433.55950511432.531743333399.924 96.6525 182:14:17 07:30:30 77.8 Canu 19 335 226.219326225.01331683803499.953 94.5969 17:14:11 00:50:04 6.7 Short-read-assembly-based methods HG-CoLoR 159 856 1540.71598541518.113800215744100.000 98.1308 231:20:30 44:41:19 13.8 FMLRC 163 749 1555.41635931546.313796015687100.000 99.6423 05:50:54 00:32:27 3.3 HALC ------>72:00:00 - Jabba 162 970 1287.01629701286.1939231279599.515 99.9557 02:51:05 00:10:33 37.1 LoRDEC 163 838 1555.51637221530.113788715664100.000 98.9920 32:35:27 01:12:37 2.2 ECTools 116 868 1431.71168681428.213786316354100.000 99.8116 19:44:40 00:46:51 8.1 Short-read-alignment-based methods Hercules ------>72:00:00 - CoLoRMap 164 072 1518.31637821495.713430215180100.000 89.2049 32:55:26 04:01:18 35.5 Nanocorr ------>72:00:00 - proovread 163 815 1514.01634811489.113579815222100.000 89.2071 104:33:09 18:35:46 47.8 LSC ------>72:00:00 - bioRxiv preprint doi: https://doi.org/10.1101/519330. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Zhang et al. Page 9 of 18
Table 2 Experimental results for E. coli PacBio data set D1-P
#Aligned Maximum Genome Alignment Memory #Bases #Aligned N50 CPU time Wall time Method # Reads bases length fraction identity usage (Mbp) reads (bp) (hh:mm:ss) (hh:mm:ss) (Mbp) (bp) (%) (%) (GB)
Original 85 460 748.082886688.04411313990100.000 86.8763 --- Non-hybrid methods FLAS 69 327 632.368786621.24011713212100.000 99.5959 09:47:50 00:56:45 4.9 LoRMA 330 811 623.3330715623.0224992441100.000 99.6814 45:24:49 02:10:36 67.2 Canu 9283 168.19193166.73969320391100.000 99.6970 07:47:33 00:27:14 6.0 Short-read-assembly-based methods HG-CoLoR ------FMLRC 85 260 706.583320669.94408413364100.000 99.6983 03:05:06 00:30:07 9.8 HALC 85 256 711.184030661.74411713399100.000 99.4374 60:41:59 16:02:32 30.2 Jabba 77 508 620.277508619.7413421255799.258 99.9624 02:05:09 00:12:01 37.0 LoRDEC 85 324 716.983507665.94431113491100.000 98.4149 15:03:42 00:40:05 2.0 ECTools 55 687 577.455687575.73977213583100.000 99.8592 11:25:22 00:29:49 8.2 Short-read-alignment-based methods Hercules ------>72:00:00 - CoLoRMap 85 674 730.783765678.64411313641100.000 95.2930 31:35:16 02:53:33 34.9 Nanocorr 73 368 504.973316493.14107910796100.000 98.3257 1862:59:19 70:57:19 15.1 proovread 85 367 720.283142665.74411313524100.000 96.7250 71:17:14 12:21:53 53.9 LSC ------>72:00:00 - Note: HG-CoLoR reported an error when correcting this dataset.
Table 4.10: Comparison between hybrid error correction tools on E. coli Oxford Nanopore dataset. The experiment is done by Zhang et al. [177]. Table is taken from [177]. Table 3 Experimental results for E. coli ONT data set D1-O
#Aligned Maximum Genome Alignment Memory #Bases #Aligned N50 CPU time Wall time Method # Reads bases length fraction identity usage (Mbp) reads (bp) (hh:mm:ss) (hh:mm:ss) (Mbp) (bp) (%) (%) (GB)
Original 163 747 1481.51633861454.413196914895100.000 81.3559 --- Non-hybrid methods FLAS 138 472 1401.31384581392.91304971474899.997 93.0176 20:27:50 01:56:52 8.0 LoRMA 595 072 1433.55950511432.531743333399.924 96.6525 182:14:17 07:30:30 77.8 Canu 19 335 226.219326225.01331683803499.953 94.5969 17:14:11 00:50:04 6.7 Short-read-assembly-based methods 63 HG-CoLoR 159 856 1540.71598541518.113800215744100.000 98.1308 231:20:30 44:41:19 13.8 FMLRC 163 749 1555.41635931546.313796015687100.000 99.6423 05:50:54 00:32:27 3.3 HALC ------>72:00:00 - Jabba 162 970 1287.01629701286.1939231279599.515 99.9557 02:51:05 00:10:33 37.1 LoRDEC 163 838 1555.51637221530.113788715664100.000 98.9920 32:35:27 01:12:37 2.2 ECTools 116 868 1431.71168681428.213786316354100.000 99.8116 19:44:40 00:46:51 8.1 Short-read-alignment-based methods Hercules ------>72:00:00 - CoLoRMap 164 072 1518.31637821495.713430215180100.000 89.2049 32:55:26 04:01:18 35.5 Nanocorr ------>72:00:00 - proovread 163 815 1514.01634811489.113579815222100.000 89.2071 104:33:09 18:35:46 47.8 LSC ------>72:00:00 - bioRxiv preprint doi: https://doi.org/10.1101/519330. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Zhang et al. Page 10 of 18
Table 4.11: Comparison between hybrid error correction tools on yeast PacBio dataset. The experiment is done by Zhang et al. [177]. Table is taken from [177]. Table 4 Experimental results for yeast PacBio data set D2-P
#Aligned Maximum Genome Alignment Memory #Bases #Aligned N50 CPU time Wall time Method # Reads bases length fraction identity usage (Mbp) reads (bp) (hh:mm:ss) (hh:mm:ss) (Mbp) (bp) (%) (%) (GB)
Original 239 408 1462.72356201332.635196865699.976 87.2637 --- Non-hybrid methods FLAS 173 187 1093.21730461078.830046813299.976 99.5777 11:46:31 01:15:40 7.9 LoRMA 650 467 1142.06503331141.418127232399.951 99.7583 172:24:38 07:03:03 72.9 Canu 38 228 453.238172446.7287481202199.975 99.5864 15:18:34 00:50:12 6.5 Short-read-assembly-based methods 64 HG-CoLoR ------FMLRC 238 706 1380.82368831311.033658818599.977 99.3889 07:52:17 00:28:55 5.5 HALC 238 787 1395.42380971287.634785827099.976 99.0796 52:12:11 09:45:10 29.0 Jabba 202 980 1087.22028791086.630141784795.627 99.9832 00:38:30 00:04:57 21.4 LoRDEC 238 847 1405.02372781297.134896832699.978 97.9568 01:10:03 00:57:17 1.9 ECTools 130 863 946.9130832943.128749841299.810 99.7712 938:25:28 58:25:00 4.3 Short-read-alignment-based methods Hercules 239 389 1460.32356301330.435196864499.976 87.6711 87:53:55 03:18:41 247.8 CoLoRMap 239 309 1429.62371351321.334850840999.976 96.3912 18:44:48 03:07:34 37.3 Nanocorr ------>72:00:00 - proovread 238 992 1412.42365191298.035122836999.978 97.9568 184:02:07 23:45:37 47.9 LSC ------>72:00:00 - Note: HG-CoLoR reported an error when correcting this dataset.
Table 5 Experimental results for yeast ONT data set D2-O
#Aligned Maximum Genome Alignment Memory #Bases #Aligned N50 CPU time Wall time Method # Reads bases length fraction identity usage (Mbp) reads (bp) (hh:mm:ss) (hh:mm:ss) (Mbp) (bp) (%) (%) (GB)
Original 118 723 715.7108463638.155374700399.976 86.1986 --- Non-hybrid methods FLAS 95 606 585.695290581.526592689399.940 97.1699 07:42:10 07:42:10 4.4 LoRMA 398 863 497.0398350495.216027143999.485 98.4024 68:02:36 02:55:05 68.8 Canu 64 829 475.164649475.126895751899.914 97.7710 12:31:04 00:37:53 9.0 Short-read-assembly-based methods HG-CoLoR ------FMLRC 118 701 713.7111869666.455374699099.975 99.2529 03:35:44 00:17:21 2.2 HALC 118 707 718.2114071647.955379702599.976 98.8884 50:11:58 04:03:18 3.6 Jabba 99 044 536.998631535.928194673095.400 99.9809 00:55:32 00:04:20 21.5 LoRDEC 118 727 720.8110606647.855375704999.976 96.9369 11:22:09 00:26:13 2.1 ECTools 81 105 531.980843529.326810707199.314 99.7697 09:31:32 20:17:33 5.6 Short-read-alignment-based methods Hercules 118 721 716.3108467638.955374700899.976 87.2912 125:22:19 04:37:01 246.6 CoLoRMap 118 774 722.0108969649.455374704999.976 95.5851 11:01:38 01:34:52 27.8 Nanocorr ------>72:00:00 - proovread 118 729 716.7109057643.455374700799.976 96.3689 66:14:09 07:20:18 28.1 LSC ------>72:00:00 - Note: HG-CoLoR reported an error when correcting this dataset. bioRxiv preprint doi: https://doi.org/10.1101/519330. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Zhang et al. Page 10 of 18
Table 4 Experimental results for yeast PacBio data set D2-P
#Aligned Maximum Genome Alignment Memory #Bases #Aligned N50 CPU time Wall time Method # Reads bases length fraction identity usage (Mbp) reads (bp) (hh:mm:ss) (hh:mm:ss) (Mbp) (bp) (%) (%) (GB)
Original 239 408 1462.72356201332.635196865699.976 87.2637 --- Non-hybrid methods FLAS 173 187 1093.21730461078.830046813299.976 99.5777 11:46:31 01:15:40 7.9 LoRMA 650 467 1142.06503331141.418127232399.951 99.7583 172:24:38 07:03:03 72.9 Canu 38 228 453.238172446.7287481202199.975 99.5864 15:18:34 00:50:12 6.5 Short-read-assembly-based methods HG-CoLoR ------FMLRC 238 706 1380.82368831311.033658818599.977 99.3889 07:52:17 00:28:55 5.5 HALC 238 787 1395.42380971287.634785827099.976 99.0796 52:12:11 09:45:10 29.0 Jabba 202 980 1087.22028791086.630141784795.627 99.9832 00:38:30 00:04:57 21.4 LoRDEC 238 847 1405.02372781297.134896832699.978 97.9568 01:10:03 00:57:17 1.9 ECTools 130 863 946.9130832943.128749841299.810 99.7712 938:25:28 58:25:00 4.3 Short-read-alignment-based methods Hercules 239 389 1460.32356301330.435196864499.976 87.6711 87:53:55 03:18:41 247.8 CoLoRMap 239 309 1429.62371351321.334850840999.976 96.3912 18:44:48 03:07:34 37.3 Nanocorr ------>72:00:00 - proovread 238 992 1412.42365191298.035122836999.978 97.9568 184:02:07 23:45:37 47.9 LSC ------>72:00:00 - Note: HG-CoLoR reported an error when correcting this dataset. Table 4.12: Comparison between hybrid error correction tools on yeast Oxford Nanopore dataset. The experiment is done by Zhang et al. [177]. Table is taken from [177]. Table 5 Experimental results for yeast ONT data set D2-O
#Aligned Maximum Genome Alignment Memory #Bases #Aligned N50 CPU time Wall time Method # Reads bases length fraction identity usage (Mbp) reads (bp) (hh:mm:ss) (hh:mm:ss) (Mbp) (bp) (%) (%) (GB)
Original 118 723 715.7108463638.155374700399.976 86.1986 --- Non-hybrid methods FLAS 95 606 585.695290581.526592689399.940 97.1699 07:42:10 07:42:10 4.4 LoRMA 398 863 497.0398350495.216027143999.485 98.4024 68:02:36 02:55:05 68.8 Canu 64 829 475.164649475.126895751899.914 97.7710 12:31:04 00:37:53 9.0 Short-read-assembly-based methods 65 HG-CoLoR ------FMLRC 118 701 713.7111869666.455374699099.975 99.2529 03:35:44 00:17:21 2.2 HALC 118 707 718.2114071647.955379702599.976 98.8884 50:11:58 04:03:18 3.6 Jabba 99 044 536.998631535.928194673095.400 99.9809 00:55:32 00:04:20 21.5 LoRDEC 118 727 720.8110606647.855375704999.976 96.9369 11:22:09 00:26:13 2.1 ECTools 81 105 531.980843529.326810707199.314 99.7697 09:31:32 20:17:33 5.6 Short-read-alignment-based methods Hercules 118 721 716.3108467638.955374700899.976 87.2912 125:22:19 04:37:01 246.6 CoLoRMap 118 774 722.0108969649.455374704999.976 95.5851 11:01:38 01:34:52 27.8 Nanocorr ------>72:00:00 - proovread 118 729 716.7109057643.455374700799.976 96.3689 66:14:09 07:20:18 28.1 LSC ------>72:00:00 - Note: HG-CoLoR reported an error when correcting this dataset. bioRxiv preprint doi: https://doi.org/10.1101/519330. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Zhang et al. Page 11 of 18
Table 4.13: Comparison between hybrid error correction tools on fruit fly PacBio dataset. The experiment is done by Zhang et al. [177]. Table is taken from [177]. Table 6 Experimental results for fruit fly PacBio data set D3-P
#Aligned Maximum Genome Alignment Memory #Bases #Aligned N50 CPU time Wall time Method # Reads bases length fraction identity usage (Mbp) reads (bp) (hh:mm:ss) (hh:mm:ss) (Mbp) (bp) (%) (%) (GB)
Original 5366088 28797.8183968116543.5747351537499.191 85.2734 --- Non-hybrid methods FLAS 1435682 14585.2142801813574.1435561355098.915 98.8363 271:44:27 36:30:42 53.1 LoRMA ------Canu ------>72:00:00 - Short-read-assembly-based methods 66 HG-CoLoR ------FMLRC 5246485 27354.6247789016543.5747351455499.191 96.5284 327:37:22 13:49:04 31.2 HALC 4451474 21997.5343477912793.3747351434999.178 96.8863 770:35:46 55:58:24 73.0 Jabba 35 549 239.835505239.1377291046165.616 99.9615 656:05:15 24:33:41 175.8 LoRDEC 5363998 28354.1205681215636.9747191507899.200 92.2954 1011:52:27 36:19:18 5.9 ECTools ------>72:00:00 - Short-read-alignment-based methods Hercules ------CoLoRMap 5366107 28891.6184182214976.8747351544299.189 83.2580 495:11:17 64:52:25 189.4 Nanocorr ------>72:00:00 - proovread ------>72:00:00 - LSC ------>72:00:00 - Note: LoRMA, HG-CoLoR and Hercules reported errors when correcting this dataset.
Table 7 Experimental results for fruit fly ONT data set D3-O
#Aligned Maximum Genome Alignment Memory #Bases #Aligned N50 CPU time Wall time Method # Reads bases length fraction identity usage (Mbp) reads (bp) (hh:mm:ss) (hh:mm:ss) (Mbp) (bp) (%) (%) (GB)
Original 642 255 4609.55540833857.94460501195698.719 83.5921 --- Non-hybrid methods FLAS 423 097 3507.64222063402.6643651151797.588 95.3301 23:04:50 03:12:50 10.8 LoRMA 703 097 615.5682288592.33264486530.338 98.1230 666:37:35 25:52:14 92.8 Canu 430 082 3415.64214753220.22549671209097.592 96.3739 88:51:10 04:36:20 20.2 Short-read-assembly-based methods HG-CoLoR ------FMLRC 641 945 4647.25782903978.24446051208898.592 97.6010 47:45:17 03:06:05 31.2 HALC 643 002 4668.56111913955.74512841211598.616 97.6634 126:30:01 05:43:37 42.4 Jabba 494 546 2878.24944302876.372501930583.166 99.9745 175:19:34 06:56:29 136.8 LoRDEC 642 882 4655.95678783921.14477261207998.691 94.0382 152:05:32 05:38:05 5.7 ECTools ------>72:00:00 - Short-read-alignment-based methods Hercules 642 287 4612.85546303859.44497991196698.713 83.9340 398:10:17 17:32:36 247.7 CoLoRMap 649 041 4692.15658813963.84429481205098.715 94.3361 160:00:22 16:07:18 57.3 Nanocorr ------>72:00:00 - proovread ------>72:00:00 - LSC ------>72:00:00 - Note: HG-CoLoR reported an error when correcting this dataset. bioRxiv preprint doi: https://doi.org/10.1101/519330. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Zhang et al. Page 11 of 18
Table 6 Experimental results for fruit fly PacBio data set D3-P
#Aligned Maximum Genome Alignment Memory #Bases #Aligned N50 CPU time Wall time Method # Reads bases length fraction identity usage (Mbp) reads (bp) (hh:mm:ss) (hh:mm:ss) (Mbp) (bp) (%) (%) (GB)
Original 5366088 28797.8183968116543.5747351537499.191 85.2734 --- Non-hybrid methods FLAS 1435682 14585.2142801813574.1435561355098.915 98.8363 271:44:27 36:30:42 53.1 LoRMA ------Canu ------>72:00:00 - Short-read-assembly-based methods HG-CoLoR ------FMLRC 5246485 27354.6247789016543.5747351455499.191 96.5284 327:37:22 13:49:04 31.2 HALC 4451474 21997.5343477912793.3747351434999.178 96.8863 770:35:46 55:58:24 73.0 Jabba 35 549 239.835505239.1377291046165.616 99.9615 656:05:15 24:33:41 175.8 LoRDEC 5363998 28354.1205681215636.9747191507899.200 92.2954 1011:52:27 36:19:18 5.9 ECTools ------>72:00:00 - Short-read-alignment-based methods Hercules ------CoLoRMap 5366107 28891.6184182214976.8747351544299.189 83.2580 495:11:17 64:52:25 189.4 Nanocorr ------>72:00:00 - proovread ------>72:00:00 - LSC ------>72:00:00 - Note: LoRMA, HG-CoLoR and Hercules reported errors when correcting this dataset. Table 4.14: Comparison between hybrid error correction tools on fruit fly Oxford Nanopore dataset. The experiment is done by Zhang et al. [177]. Table is taken from [177]. Table 7 Experimental results for fruit fly ONT data set D3-O
#Aligned Maximum Genome Alignment Memory #Bases #Aligned N50 CPU time Wall time Method # Reads bases length fraction identity usage (Mbp) reads (bp) (hh:mm:ss) (hh:mm:ss) (Mbp) (bp) (%) (%) (GB)
Original 642 255 4609.55540833857.94460501195698.719 83.5921 --- Non-hybrid methods FLAS 423 097 3507.64222063402.6643651151797.588 95.3301 23:04:50 03:12:50 10.8 LoRMA 703 097 615.5682288592.33264486530.338 98.1230 666:37:35 25:52:14 92.8 Canu 430 082 3415.64214753220.22549671209097.592 96.3739 88:51:10 04:36:20 20.2 Short-read-assembly-based methods 67 HG-CoLoR ------FMLRC 641 945 4647.25782903978.24446051208898.592 97.6010 47:45:17 03:06:05 31.2 HALC 643 002 4668.56111913955.74512841211598.616 97.6634 126:30:01 05:43:37 42.4 Jabba 494 546 2878.24944302876.372501930583.166 99.9745 175:19:34 06:56:29 136.8 LoRDEC 642 882 4655.95678783921.14477261207998.691 94.0382 152:05:32 05:38:05 5.7 ECTools ------>72:00:00 - Short-read-alignment-based methods Hercules 642 287 4612.85546303859.44497991196698.713 83.9340 398:10:17 17:32:36 247.7 CoLoRMap 649 041 4692.15658813963.84429481205098.715 94.3361 160:00:22 16:07:18 57.3 Nanocorr ------>72:00:00 - proovread ------>72:00:00 - LSC ------>72:00:00 - Note: HG-CoLoR reported an error when correcting this dataset. Chapter 5
Hybrid assembly of long reads
Long reads generated by single-molecule sequencing (SMS) technologies such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies have revolutionized the landscape of de novo genome assembly. While SMS long reads have a higher error rate compared to short reads generated by next-generation sequencing (NGS) technologies such as Illumina, they have been shown to result in accurate assemblies given sufficient coverage. Indeed the length of SMS long reads enables the resolution of many short and mid-range repeats that are problematic when assembling genomes from short reads. Recent advances in sequencing ultra-long Oxford Nanopore reads have moved us closer to the complete reconstruction of entire genomes than ever before (including difficult-to-assemble regions such as centromeres and telomeres) [117]. Similarly, High-Fidelity (HiFi) PacBio reads have been shown to be capable of improving the contiguity and accuracy in complex regions of the human genome [161]. These advances toward more accurate and complete genome assembly could not be achieved without the recent development of assemblers specifically tailored for long reads. These tools assemble long reads either after an error correction step [85, 25] or directly without any prior error correction [97, 140, 81]. Although long reads are becoming more widely used for de novo genome assembly, using hybrid approaches (that utilize a complementary short read dataset) is still popular for several reasons: (i) short reads have higher accuracy and can be generated by Illumina sequencers at high throughput for a lower cost; (ii) plenty of short read datasets are already publicly available for many genomes; (iii) for some basic tasks such as variant calling (SNV and short indel detection), short reads still provide better resolution due to their high accuracy which often motivates researchers to generate short reads even when long reads are in hand; and (iv) unlike PacBio assemblies whose accuracy increases with the depth of coverage thanks to their unbiased random error model [124], constructing reference-quality genomes solely from Oxford Nanopore reads remains challenging due to biases in base calling, even with a high coverage [85, 4]. As a result, hybrid assembly approaches are still useful [71, 73, 74].
68 Hybrid approaches for de novo genome assembly can be classified into three groups: (i) methods that first correct raw long reads using short reads and then build contigs using corrected long reads only (e.g. PBcR [84] and MaSuRCA [181]); (ii) methods that first assemble raw long reads and then correct/polish the resulting draft assembly with short reads using polishing tools such as Pilon [163] and Racon [159]; and (iii) methods that first assemble short reads and then utilize long reads to generate longer contigs (e.g. hybridSPAdes [4], Unicycler [169], DBG2OLC [174], and Wengan [35]). PBcR and MaSuRCA correct long reads using their internal correction algorithm and then employ CABOG [119] (Celera Assembler with the Best Overlap Graph) for assembling corrected long reads. hybridSPAdes and Unicycler are similar in design. Both of these tools first use SPAdes [7], which takes short reads as input and generates an assembly graph, a data structure in which multiple copies of a genome segment are collapsed into a single contig (see [176] for more details). This data structure also records connections between subsequent contigs such that every region of the genome corresponds to a path in the graph. hybridSPAdes and Unicycler then align long reads to this assembly graph in order to resolve ambiguities and generate longer contigs. On the other hand, DBG2OLC first assembles contigs from short reads and maps them onto raw long reads to get a compressed representation of long reads based on short read contig identifiers, and then applies an overlap-layout-consensus (OLC) approach on these compressed long reads to assemble the genome. Since compressed long reads are much shorter compared to raw long reads, building an overlap graph from them is quicker than building it from raw long reads, due to the faster pairwise alignment. Finally, the more recent tool, Wengan, assembles short reads and then builds multiple synthetic paired-read libraries of different insert sizes from long read sequences. These synthetic paired-reads are then aligned to short read contigs, and a scaffolding graph is built from the resulting alignments. In the end, the final assembly is generated by traversing proper paths of the scaffolding graph. A more detailed overview of each tool, together with some other non-hybrid assemblers, is provided in Section 2.5. Among the above tools, hybridSPAdes and Unicycler have been designed specifically for bacterial and small eukaryotic genomes and do not scale for the assembly of large genomes. PBcR, MaSuRCA, DBG2OLC, and Wengan are the only hybrid assemblers that are capable of assembling large genomes, such as the human genome. However, for mammalian genomes, PBcR and MaSuRCA require substantial computational time and cannot be used without a computing cluster. DBG2OLC is faster due to its use of compressed long reads. Wengan is also a fast assembler and can be used for assembling large genomes in a reasonable time. In this chapter, we introduce HASLR, a fast hybrid assembler that is capable of assembling large genomes. HASLR, similar to hybridSPAdes, Unicycler, and Wengan builds short read contigs using a fast short read assembler (i.e., Minia). Then it builds a novel data structure called backbone graph to put short read contigs in the order expected to appear in the genome and to fill the gaps between them using the consensus of long
69 reads. Based on our results, HASLR is the fastest between all the assemblers we tested, while generating the lowest number of misassemblies. Furthermore, it generates assemblies that are comparable to the best performing tools in terms of contiguity and accuracy. HASLR is also capable of assembling large genomes using less time and memory than other tools.
5.1 Methods
The input to HASLR is a set of long reads and a set of short reads from the same sample, together with an estimation of the genome size. HASLR performs the assembly using a novel approach that rapidly assembles the genome without performing all-vs-all long read alignments. The core of HASLR is to first assemble contigs from short reads using an efficient short read assembler and then to use long reads to find sequences of such contigs that represent the backbone of the sequenced genome.
5.1.1 Obtaining unique short read contigs
HASLR starts by assembling short reads into a set of short read contigs, denoted by C. Assembly of short reads is a well-studied topic, and many efficient tools have been specifically designed for that purpose. These tools use either a de Bruijn graph [150, 23] or an OLC strategy (based on an overlap graph or a string graph) [149, 120] to assemble the genome by finding “proper” paths in these graphs. Next, HASLR identifies a set U of unique contigs, those short read contigs that are likely to appear in the genome only once. In order to do this, for every short read contig, ci, the mean k-mer frequency, f(ci), is computed as the average k-mer count of all k-mers present in ci. Note that the value of f(ci) is proportional to the depth of coverage of ci. Assuming longer contigs are more likely to come from unique regions, their mean k-mer frequency can be a good indicator for identifying unique contigs. Let LCq ⊆ C be the set of q longest short read contigs in C, and favg, fstd be the average and standard deviation of {f(c) | c ∈ LCq}.
Then, the set of unique contigs is defined as U = {u | u ∈ C and f(u) ≤ favg + 3fstd}. Our empirical results show that this approach can identify unique contigs with high precision and recall. In order to measure the efficacy of this approach for identifying unique contigs, we conducted a set of experiments as follows. First, we simulated a short read dataset based on six different reference genomes: Escherichia coli, Saccharomyces cerevisiae (yeast), Caenorhabditis elegans, Arabidopsis thaliana, Drosophila melanogaster, and human GRCh38. For each genome, we used ART [66] to simulate 50× coverage short Illumina reads (2×100 bp long, 500 bp insert size mean, and 50 bp insert size deviation) using the Illumina HiSeq 2000 error model. Next, we used Minia to assemble the simulated short reads using k-mer size 49. Finally, to form the ground truth for the copy count of each
70 short read contig, we mapped the assembled short read contigs to the reference genome using minimap2 [98]. Here, we report the precision and recall of the above-mentioned approach in identifying unique contigs. For each dataset, we evaluate the performance of our approach in identifying unique contigs that are longer than a threshold. The length threshold that is used to discard small contigs in this experiment changes from 100 to 1000 with a step size of 100. As it can be seen in Figure 5.1, the precision of the identified unique contigs is always high regardless of the length threshold. In addition, in all the experiments, a big jump in the recall is observed at the length threshold of 300. The results of this experiment show that the proposed approach for identifying unique contigs performs well with high precision and recall.
5.1.2 Construction of backbone graph
The backbone graph encodes potential adjacencies between unique contigs and thus presents a large-scale map of the genome, albeit, with some level of ambiguity. Using the backbone graph, HASLR finds paths of unique contigs representing their relative order and orientation in the sequenced genome. These paths are later transformed into the assembly.
Formally, given a set of unique contigs, U = {u1, u2, . . . , u|U|}, and a set of long reads,
L = {l1, l2, . . . , l|L|}, HASLR builds the backbone graph BBG as follows. First, unique contigs are aligned against long reads. Each alignment can be encoded by a 7-tuple rbeg, rend, uid, ustrand, ubeg, uend, nmatch whose elements respectively denote the start and end positions of the alignment on the long read, the index of the unique contig in U, the strand of the alignment (+ or −), the start and end position of the alignment on the unique contig, and the number of matched bases in the alignment. Let A = ai , ai , . . . ai be i 1 2 |Ai| the list of alignments of unique contigs to li, sorted by rend.
Note that alignments in Ai may overlap due to relaxed alignment parameters in order to account for the high sequencing error rate of long reads. Thus, in the next step, we aim to select a subset of non-overlapping alignments whose total identity score – defined as the sum of the number of matched bases – is maximal. Let Si(j) be the best subset among the first j alignments, i.e., the non-overlapping subset of these j alignments with maximal total identity score. Si(j) can be calculated using the following dynamic programming formulation: 0 if j = 0 Si(j) = (5.1) n i o max Si j − 1 ,Si prev(j) + aj[nmatch] otherwise
i i where prev(j) is the largest index z < j such that aj and az are non-overlapping alignments. By calculating Si(|Ai|) and backtracking, we obtain a sorted sub-list R = (ri , ri , . . . , ri ) of non-overlapping alignments with maximal total identity score, i 1 2 |Ri|
71 ecoli yeast 100.00 100
99.75
99.50 99
99.25 98 99.00
98.75
Precision / Recall Precision / Recall 97 98.50
98.25 96 98.00
200 400 600 800 1000 200 400 600 800 1000 Length threshold Length threshold
celegans athaliana 100 100
99 99 98
98 97
96 Precision / Recall 97 Precision / Recall
95 96 94
200 400 600 800 1000 200 400 600 800 1000 Length threshold Length threshold
dmelanogaster hg38 100 100.0
99.5 98 99.0
98.5 96
98.0
Precision / Recall 94 Precision / Recall 97.5
97.0 92 96.5
200 400 600 800 1000 200 400 600 800 1000 Length threshold Length threshold
Figure 5.1: Precision and recall results in identification of unique short read contigs on 6 different reference genomes. Precision is shown with blue dots and recall is shown with orange dots. Precision is always high across the different experiments and in all the experiments a big jump in recall happens at length threshold of 300.
which we call the compact representation of read li. Note that since the input list is sorted, prev(.) can be calculated in logarithmic time, which makes the time complexity of this dynamic programming O(|Ai| log |Ai|). The backbone graph is a directed graph BBG = (V,E). The set of nodes is defined as + − + − V = {uj , uj | 1 ≤ j ≤ |U|} where uj and uj represent the forward and reverse strand of
72 { 1} + + 1 2 1 2 1 − − 1 2 { 1}
{ } { 2} + 2 + 3 4 3 4 2 − − 3 4