Computational Methods for Analysis of Single Molecule Sequencing Data

by Ehsan Haghshenas

M.Sc., University of Western Ontario, 2014 B.Sc., Isfahan University of Technology, 2012

Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of

in the School of Computing Science Faculty of Applied Sciences

c Ehsan Haghshenas 2020 SIMON FRASER UNIVERSITY Spring 2020

Copyright in this work rests with the author. Please ensure that any reproduction or re-use is done in accordance with the relevant national copyright legislation. Approval

Name: Ehsan Haghshenas Degree: Doctor of Philosophy (Computing Science) Title: Computational Methods for Analysis of Single Molecule Sequencing Data Examining Committee: Chair: Diana Cukierman University Lecturer Binay Bhattacharya Senior Supervisor Professor S. Cenk Sahinalp Co-Supervisor Senior Investigator Center for Cancer Research National Cancer Institute Cedric Chauve Co-Supervisor Professor Faraz Hach Co-Supervisor Assistant Professor Department of Urologic Sciences The University of British Columbia Senior Research Scientist Vancouver Prostate Centre Martin Ester Internal Examiner Professor Mihai Pop External Examiner Professor Department of Computer Science University of Maryland, College Park

Date Defended: March 26, 2020

ii Abstract

Next-generation sequencing (NGS) technologies paved the way to a significant increase in the number of sequenced , both prokaryotic and eukaryotic. This increase provided an opportunity for considerable advancement in genomics and precision medicine. Although NGS technologies have proven their power in many applications such as de novo assembly and variation discovery, computational analysis of the data they generate is still far from being perfect. The main limitation of NGS technologies is their short read length relative to the lengths of (common) genomic repeats. Today, newer sequencing technologies (known as single-molecule sequencing or SMS) such as Pacific Biosciences and Oxford Nanopore are producing significantly longer reads, making it theoretically possible to overcome the difficulties imposed by repeat regions. For instance, for the first time, a complete human chromosome was fully assembled using ultra-long reads generated by Oxford Nanopore. Unfortunately, long reads generated by SMS technologies are characterized by a high error rate, which prevents their direct utilization in many of the standard downstream analysis pipelines and poses new computational challenges. This motivates the development of new computational tools specifically designed for SMS long reads. In this thesis, we present three computational methods that are tailored for SMS long reads. First, we present lordFAST, a fast and sensitive tool for mapping noisy long reads to a reference genome. Mapping sequenced reads to their potential genomic origin is the first fundamental step for many computational biology tasks. As an example, in this thesis, we show the success of lordFAST to be employed in structural variation discovery. Next, we present the second tool, CoLoRMap, which tackles the high level of base-level errors in SMS long reads by providing a means to correct them using a complementary set of NGS short reads. This integrative use of SMS and NGS data is known as hybrid technique. Finally, we introduce HASLR, an ultra-fast hybrid assembler that uses reads generated by both technologies to efficiently generate accurate genome assemblies. We demonstrate that HASLR is not only the fastest assembler but also the one with the lowest number of misassemblies on all the samples compared to other tested assemblers. Furthermore, the generated assemblies in terms of contiguity and accuracy are on par with the other tools on most of the samples.

iii Keywords: Computational biology; Single-molecule sequencing; PacBio; Oxford Nanopore; Long read mapping; Hybrid error correction; Hybrid assembly

iv Dedication

To my family, with love

v Acknowledgements

First and foremost, I would like to express my sincerest gratitude to my supervisors, Dr. Cenk Sahinalp, Dr. Cedric Chauve, Dr. Faraz Hach, and Dr. Binay Bhattacharya, for their constant support, guidance, and patience throughout my PhD studies. I was honored to have the opportunity to work with such brilliant scholars, from whom I learned critical thinking and the proper way of doing research. In addition, I would like to give my regards and appreciation to Dr. Jens Stoye, my host and supervisor during my visit at Bielefeld University. This visit greatly influenced the direction of my work on hybrid assembly. I would also like to thank Dr. Mihai Pop and Dr. Martin Ester, my external and internal examiners, for their careful review of my thesis. I appreciate their invaluable discussions, comments, and suggestions, which helped me improve the thesis. I want to give special thanks to Dr. Diana Cukierman, who graciously accepted to be the chair of my examining committee. During my PhD, I was also involved in a few research projects that are not included in this thesis. I had wonderful valuable experiences in these collaborative projects. Regarding these collaborations, I would like to thank Dr. Salem Malikic, Michael Ford, Hossein Asghari, Sean La, and Farid Rashidi Mehrabadi. I also offer enduring gratitude to all past and present lab members in Lab for Computational Biology and at Simon Fraser University, as well as Hach Lab at Vacnouver Prostate Centre. In particular, I thank Dr. Yen-Yi Lin, Iman Sarrafi, Dr. Ibrahim Numanagic, Ermin Hodzic, Can Kockan, Dr. Raunak Shrestha, Baraa Orabi, Tunc Morova, and Fatih Karaoglanoglu, who all made the work environment a more pleasant one. My special thanks go to Baraa Orabi and Elie Ritch for their help with proofreading the thesis. In addition, I would like to thank all members of the Genome Informatics research group at Bielefeld University in Germany, including Omar Castillo, Konstantinos Tzanakis, Eyla Willing, Georges Hattab, Tizian Schulz, Guillaume Holley, Nina Luhmann, Liren Huang, Lu Zhu, Markus Lux, Linda Sundermann, and Tina Zekic, who made my visit from Bielefeld University such a great experience. I am grateful to many other friends I made in Vancouver including Abdollah Safari, Sina Bahrasemani, Sajjad Gholami, Mehran Khodabandeh, Hedayat Zarkoob, Mohsen Jamali, Soheil Horr, Shahram Zaheri, Amirmasoud Ghasemi, Abraham Hashemian, Hashem Jeihooni, Shahram Pourazadi, Mohammad Mahdavian, Majid Talebi, Saeed

vi Mirazimi, Hassan Shavarani, Mahdi Nemati Mehr, Sima Jamali, Nazanin Mehrasa, Ramtin Mehdizadeh, Sina Salari, Ali Afsah, Chakaveh Ahmadizadeh, Mahsa Gharibi, Mohammad Akbari, Akbar Rafiey, Saeed Izadi, Saeid Asgari, Hossein Sharifi-Noghabi, Sepehr MohaimenianPour, Sara Daneshvar, Hooman Zabeti, Sara Jalili, Mohammad Mazraeh, Marjan Moodi, and many more. All these amazing people made Vancouver a true home. Last but not least, I would like to thank my loving family for all their support during these years. An exceptional thanks goes to my wonderful wife, Rana, who definitely made a significant contribution to this thesis with her continuous support and patience.

vii Table of Contents

Approval ii

Abstract iii

Dedication v

Acknowledgements vi

Table of Contents viii

List of Tables xi

List of Figures xiv

1 Introduction 1 1.1 Contributions ...... 5 1.2 Organization of the thesis ...... 7

2 Background and Related Work 8 2.1 Single-molecule sequencing technologies ...... 8 2.1.1 Pacific Biosciences ...... 8 2.1.2 Oxford Nanopore Technology ...... 9 2.1.3 Synthetic long reads ...... 11 2.2 Definitions and Notations ...... 12 2.3 Long Read Mapping ...... 13 2.4 Error correction of long noisy reads ...... 16 2.4.1 Hybrid correction ...... 16 2.4.2 Self-correction ...... 18 2.5 de novo genome assembly ...... 20 2.5.1 Hybrid assembly ...... 21 2.5.2 Non-hybrid assembly ...... 23 2.5.3 wtdbg2 ...... 26

3 Long read mapping 27

viii 3.1 Methods ...... 29 3.1.1 Overview ...... 29 3.1.2 Stage One: Reference Genome Indexing ...... 29 3.1.3 Stage Two: Read Mapping ...... 30 3.2 Results ...... 33 3.2.1 Experiment on a simulated dataset without structural variations . . 33 3.2.2 Simulation in presence of structural variations ...... 36 3.2.3 Experiment on a real dataset ...... 39 3.3 Summary ...... 41

4 Hybrid error correction of long reads 43 4.1 Methods ...... 44 4.1.1 Overview ...... 44 4.1.2 Initial correction of long reads: the SP algorithm ...... 45 4.1.3 Correcting gaps using One-End Anchors ...... 47 4.2 Results ...... 50 4.2.1 Data and computational setting ...... 50 4.2.2 Measures of evaluation ...... 51 4.2.3 Comparison based on alignment ...... 52 4.2.4 Comparison based on assembly ...... 52 4.3 Comparison with more recent hybrid correction tools ...... 60 4.4 Summary ...... 60

5 Hybrid assembly of long reads 68 5.1 Methods ...... 70 5.1.1 Obtaining unique short read contigs ...... 70 5.1.2 Construction of backbone graph ...... 71 5.1.3 Graph cleaning and simplification ...... 74 5.1.4 Generating the assembly ...... 76 5.1.5 Methodological remarks ...... 77 5.2 Results ...... 79 5.2.1 Experiment on simulated dataset ...... 79 5.2.2 Experiment on real dataset ...... 82 5.3 Summary ...... 85

6 Conclusion 86 6.1 Future directions ...... 87 6.2 Recommended guidelines ...... 88

Bibliography 90

ix Appendix A lordFAST Material 105 A.1 Data ...... 105 A.1.1 Real data ...... 105 A.1.2 Synthetic data ...... 105 A.2 Software ...... 107 A.3 Command details ...... 108

Appendix B CoLoRMap Material 111 B.1 Data ...... 111

Appendix C HASLR Material 112 C.1 Data ...... 112 C.1.1 Simulated data ...... 112 C.1.2 Real data ...... 113 C.2 Software ...... 114 C.3 Command details ...... 115 C.4 Visual examples of regions assembled only by HASLR without any misassembly or fragmentation ...... 117

x List of Tables

Table 3.1 Comparison between different tools capable of mapping PacBio long reads on the simulated human dataset. This dataset contains 25,000 reads and 183.61 million bases. Best results are marked with bold typeface...... 35 Table 3.2 Runtime and memory usage of same table...... 36 Table 3.3 Comparison between different tools capable of mapping PacBio long reads on the simulated human dataset. This dataset contains 25,000 reads and 183.61 million bases. Best results are marked with bold typeface...... 36 Table 3.4 Comparison between different tools capable of mapping PacBio long reads on the simulated human dataset. This dataset contains 25,000 reads and 183.61 million bases. Best results are marked with bold typeface...... 37 Table 3.5 Structural variations called by Sniffles based on mappings from different tools...... 39 Table 3.6 Evaluation of the performance of various long read mappers on a real human dataset. This dataset includes 23,155 reads and 178.45 million bases...... 40 Table 3.7 Agreement of different methods in reporting alignments...... 41 Table 3.8 The performance of different methods on reads for which their alignments do not agree...... 41

Table 4.1 Runtime of different correction methods for E. coli dataset...... 50 Table 4.2 Quality of corrected long reads for E. coli, yeast, and fruit fly datasets obtained with different methods. Assessment is based on alignments of long reads to the reference genome obtained with BLASR...... 53 Table 4.3 Quality of corrected long reads for E. coli and yeast datasets obtained with different methods. Assessment is based on alignments of long reads to the reference genome obtained with BWA-MEM...... 54 Table 4.4 Statistics of corrected and un-corrected regions after correction with different methods...... 55

xi Table 4.5 Quality of Canu assemblies for E. coli data set corrected by different methods. The assessment is done using QUAST. All statistics are based on contigs of size ≥ 500 bp, unless otherwise noted...... 56 Table 4.6 Quality of Canu assemblies for yeast data set corrected by different methods. The assessment is done using QUAST. All statistics are based on contigs of size ≥ 500 bp, unless otherwise noted...... 57 Table 4.7 Quality of Canu assemblies for D. melanogaster data set corrected by different methods. The assessment is done using QUAST. All statistics are based on contigs of size ≥ 500 bp, unless otherwise noted. . . . . 58 Table 4.8 The effect of chunking on correction quality for CoLoRMap. CoLoRMap-w represents running of our software on the whole long read set without chunking...... 59 Table 4.9 Comparison between hybrid error correction tools on E. coli PacBio dataset. The experiment is done by Zhang et al. [177]. Table is taken from [177]...... 62 Table 4.10 Comparison between hybrid error correction tools on E. coli Oxford Nanopore dataset. The experiment is done by Zhang et al. [177]. Table is taken from [177]...... 63 Table 4.11 Comparison between hybrid error correction tools on yeast PacBio dataset. The experiment is done by Zhang et al. [177]. Table is taken from [177]...... 64 Table 4.12 Comparison between hybrid error correction tools on yeast Oxford Nanopore dataset. The experiment is done by Zhang et al. [177]. Table is taken from [177]...... 65 Table 4.13 Comparison between hybrid error correction tools on fruit fly PacBio dataset. The experiment is done by Zhang et al. [177]. Table is taken from [177]...... 66 Table 4.14 Comparison between hybrid error correction tools on fruit fly Oxford Nanopore dataset. The experiment is done by Zhang et al. [177]. Table is taken from [177]...... 67

Table 5.1 Comparison between draft assemblies obtained by different tools on simulated data...... 81 Table 5.2 Statistics of real long read datasets ...... 82 Table 5.3 Comparison between assemblies obtained by different tools on real data 83 Table 5.4 Gene completeness analysis ...... 84 Table 5.5 Effect of polishing assemblies on the small assembly errors of real datasets 84

Table A.1 Version, reference, and respository of utilized software...... 107

xii Table B.1 Availability and statistics of real datasets ...... 111

Table C.1 Availability of real long read datasets ...... 113 Table C.2 Version, reference, and respository of utilized software...... 114

xiii List of Figures

Figure 2.1 PacBio sequencing. Reprinted by permission from Springer Nature: Nature Reviews (Coming of age: ten years of next-generation sequencing technologies, Sara Goodwin et al.), copyright (2016) [54]...... 9 Figure 2.2 Oxford Nanopore Technology sequencing. Reprinted by permission from Springer Nature: Nature Reviews Genetics (Coming of age: ten years of next-generation sequencing technologies, Sara Goodwin et al.), copyright (2016) [54]...... 10

Figure 3.1 The speed up of lordFAST’s combined index for searching exact matches in a real human dataset compared to the original FM index. That is 29% speed up for finding all anchors in the first step. Note that this combined index uses only 0.25 GB more memory. . . 30 Figure 3.2 (a) The implicit windows considered on the reference genome for the candidate selection step. If the read length is `, then the windows are of size 2` and overlap by ` bases. (b) An example of the candidate selection step. Each dot represents an anchor, and its size represents the weight of the anchor. In this example, f = 2, and since the maximum window score is 11, every window with a score ≤ 5.5 will be ignored. In addition, the window with score 6 is not kept since it is overlapping with a window with score 7. Also, only one of the windows with score 11 will be in the final list of candidates since the other window is overlapping it...... 31 Figure 3.3 The read length distribution of 72,708 real PacBio reads from a (CHM1) dataset. The vertical axis shows the number of bases in each bin rather than the number of reads. At least 99% of the bases are in the reads longer than 1000 bases. . . . 34

xiv Figure 3.4 Read mappings are sorted based on their mapping quality in descending order. Then for each mapping quality threshold, the fraction of mapped reads with mapping quality above the threshold (out of the total number of reads) and the fraction of incorrectly mapped read (out of the number of mapped reads) are plotted along the curve...... 37

Figure 3.5 Examples of covering and non-covering alignments. Suppose x, y, z1,

z2, z3, and z4 are different alignments of the same read. In this figure, alignments x and y cover each other as they span the subsequences on the reference genome that have at least 90% overlap. The alignments

x and y cover alignments z1 and z2 but not the alignments z3 and

z4. On the other hand, the alignments z1, z2, z3, and z4 do not cover either alignment x or y...... 40 Figure 3.6 Run-time comparison of different methods for mapping 23,155 real human reads using different threads. Note that the y-axis is in logarithmic scale...... 42 Figure 3.7 Memory comparison of different methods for mapping 23,155 real human reads using different threads. Note that the y-axis is in logarithmic scale...... 42

Figure 4.1 (a) The notion of overlap for mappings. For two overlapping

mappings mi and mj, the weight of the corresponding edge is set to

the edit distance between the suffix of mj.seq and its aligned region in L (marked by red in this figure). (b) Reconstruction of the corrected sequence spelled from the shortest path. The spelled string can be easily obtained by concatenation of mapping suffixes from the shortest path...... 46 Figure 4.2 An example of a gap (region uncovered by short reads) on long read, exported from IGV software. There are so many sequencing errors that mapping short read in this region is very challenging. In the region shown here, the maximum exact match between long read and the reference genome is 4 bp long, in a region of size ≈ 150 bp. 49 Figure 4.3 Detecting One-End Anchors (OEAs) for a gap (un-corrected region). OEAs, shown in red, are unmapped or partially mapped reads whose mates, shown in blue, are mapped to corrected regions concordantly (with proper orientation and distance). The assembled contigs, shown in light green, are used to improve the quality of gap region...... 50

xv Figure 5.1 Precision and recall results in identification of unique short read contigs on 6 different reference genomes. Precision is shown with blue dots and recall is shown with orange dots. Precision is always high across the different experiments and in all the experiments a big jump in recall happens at length threshold of 300...... 72 Figure 5.2 Possible orientations of aligning two unique contigs to a long read. The direction of contigs aligned to long reads shows the strand of their corresponding sequence. These directions guide us to find the proper edge type. The set of long reads supporting each edge is shown as its label...... 73 Figure 5.3 Examples of tip and bubbles in the backbone graph. Here the backbone graph is visualized using Bandage [171] ...... 74 Figure 5.4 Example of an edge in backbone graph and its corresponding long read alignments. Partial Order Alignment (POA) is used in constructing the consensus sequence (see subsection 5.1.5) . . . . . 77 Figure 5.5 Two backbone graphs built from a real PacBio dataset sequenced from a yeast genome. Each graph is visualized with Bandage [171] and colored using its rainbow coloring feature. Each chromosome is colored with a full rainbow spectrum. (Left) Tangled graph built from all short read contigs. (Right) Untangled graph built from unique short read contigs...... 78

Figure C.1 An example showing a region of choromosome 4 of C. elegans. . . . 118 Figure C.2 An example showing a region of choromosome X of C. elegans. . . . 119 Figure C.3 An example showing a region of choromosome X of hg38...... 120 Figure C.4 An example showing a region of choromosome 18 of hg38...... 121 Figure C.5 An example showing a region of choromosome 16 of hg38...... 122 Figure C.6 An example showing a region of choromosome 15 of hg38...... 123 Figure C.7 An example showing a region of choromosome 14 of hg38...... 124 Figure C.8 An example showing a region of choromosome 13 of hg38...... 125 Figure C.9 An example showing a region of choromosome 11 of hg38...... 126 Figure C.10 An example showing a region of choromosome 9 of hg38...... 127

xvi Chapter 1

Introduction

The first draft of the human genome sequence was first published in 2001 [88, 160], and an updated version of it was announced later in 2003 after a 13-year-long effort called the . Completion of this high-quality version of the human genome sequence cost $2.7 billion for International Human Genome Sequencing Consortium (IHGSC) [132] and approximately $300 million for the Celera corporation. The high price tag was mainly a result of using Sanger sequencing [143] throughout the project, which has a high cost and low throughput [78]. Since then, remarkable progress has been made in the genome sequencing industry. The introduction of next-generation sequencing (NGS) technologies – namely 454 [114], Illumina [10], and SOLiD [115] – changed the landscape of genomics research. These technologies provide significantly higher throughput while requiring less DNA input material [137, 38]. This enabled researchers to generate orders of magnitude more data for a fraction of the cost. With the continuous evolution of NGS sequencing machines, the cost of sequencing the human genome at 30× coverage has now dropped to $1000 [67], with some companies pushing the price tag even lower than that [31]. The massive drop in sequencing cost has made NGS sequencing machines a ubiquitous staple of both larger research centers as well as individual labs [137]. A natural outcome of this situation was sequencing and de novo assembly of more organisms with small [26] or large genomes [144, 50] using new assemblers specifically designed for NGS data [176, 150, 109, 149, 120]. Another application was the resequencing of organisms to study their genomic diversity. In particular, sequencing more individual human genomes, made the outlook for the application of precision medicine more promising. The idea of precision medicine – also known as personalized medicine – is to tailor therapy with the best possible response for each patient according to their own molecular characteristics [22, 28]. The first step towards this goal is characterizing the genetic variations in the genome of each individual compared to a normal genome, such as the reference genome. The importance of this problem has led to establishing multiple projects such as the 1000 Genomes Project [29, 30] and The Cancer Genome Atlas (TCGA) project [167]. Genetic variants appear in different sizes and

1 types: single nucleotide variations (SNVs), short insertions/deletions (indels), and structural variations (SVs), which are extensive structural rearrangements larger than 50 bps [2]. SVs are known to affect the human genome the most among all types of genetic variations [168] and to be more closely associated with susceptibility to many common and rare genetic diseases such as Autism, Schizophrenia, Diabetes, and different types of cancers [168, 110, 155], as well resistance to the related therapies. Over the years, there have been many tools developed for detecting SVs using NGS data. One could classify these tools into two categories: mapping-based or assembly-based approaches. In the first approach, NGS reads are mapped to a reference genome, and abnormal signals are used to identify SVs. These signals may include split-read alignments [175, 75], unexpected read depths [1, 165], or unexpected read pairing [21, 136]. The other approach is to perform either whole-genome assembly [68, 20] or local assembly around the SV breakpoints [102, 76], followed by comparing the assembled contigs against the reference genome to identify SVs. For a comprehensive survey of SV discovery methods using NGS reads, see reviews by Alkan et al. [2] and Guan et al. [55]. Aside from de novo assembly and variation discovery, NGS technologies have proven their ability in many other applications such as whole-exome sequencing [27], transcriptome characterization [125, 36], finding disease-causing mutations [16, 173, 130], cancer analysis [65, 90], and cancer evolution study [49]. However, the computational analysis based on the data generated from NGS technologies is still far from being perfect. This is mainly due to their short read length relative to the lengths of common repeat sequences [2, 63], which limits their use in resolving and analyzing repetitive regions. In fact, the low cost offered by NGS technologies comes at the expense of much shorter reads (often ~ 150 bp for NGS compared to ~ 700 bp for Sanger), posing challenges to many downstream analysis algorithms. In particular, genomes assembled solely using NGS short reads are considered draft assemblies since it is not possible to bridge many of the contigs unambiguously, resulting in fragmented genome assemblies. Accordingly, such assemblies may contain many gaps that might cause missing genes, and in other words, are not finished [144, 100, 83]. Although these fragmented assemblies could be bridged using paired-end information via the use of scaffolding techniques, such aggressive strategies often result in misassemblies that would potentially complicate later analysis. On the other hand, projects aiming to analyze the diversity of known organisms have been fundamentally limited, as SV discovery approaches using NGS short reads essentially miss many SVs, especially larger ones [19, 83]. Within complex regions, even the detection of short events is often challenging, since the mapping of short reads cannot be done with certainty. The advent of long read sequencing technologies has come with the promise of overcoming these limitations of NGS technologies. The most popular types of long reads are those generated by single-molecule sequencing (SMS) technologies. SMS technologies

2 are capable of generating reads that are orders of magnitude longer than NGS short reads. The larger length of these reads enables them to span many of the repetitive regions of the genome, which dramatically improves the downstream genomic analysis, especially when focusing on large-scale structures [139]. However, the exceptional length of reads generated by these technologies comes at the expense of much higher error rates compared to Sanger or Illumina sequencing methods. Nevertheless, the development of new tools specifically designed for handling this high error rate enabled their use in many different projects to date. Among all long read sequencing technologies, Pacific Biosciences (PacBio) and Oxford Nanopore Technologies are the most prominent and widely used. PacBio, which was first introduced in 2009 by Eid et al. [41], is the pioneer in delivering SMS long reads. The first PacBio sequencer, PacBio RS, was commercially released in 2010 and its second version was introduced in 2013. The length and quality of reads generated by PacBio sequencers have improved over the years with updates to their chemistry. Currently, PacBio can generate reads of length >80 kbp with an average length of 10-15 kbp [158]. Although the error rate of these reads is 10-15%, it has been shown that the base-level accuracy of the consensus sequence generated from them can go above 99.99%, given sufficient coverage [93]. The study by Berline et al. [11] suggested that approximately 50× coverage of PacBio reads are needed for error correcting and assembling them without the help of other technologies. Furthermore, the unique design of the PacBio template (using two single-stranded hairpin loops) enables the sequencing of the same DNA template multiple times. Having the same molecule sequenced multiple times allows for computationally calculating the consensus sequence of that single molecule and generating long reads with up to ~ 99% accuracy. These long reads are called circular consensus sequence or CCS PacBio reads. Oxford Nanopore Technologies is the biggest competitor of PacBio in the long read technology market. The idea of using nanopores for DNA sequencing is not new and has been discussed or demonstrated since 1996 [94]. However, Oxford Nanopore was the first company to commercialize this idea by releasing its first sequencer, MinION, in 2014. While early Nanopore reads suffered from about 30% error rate [77], thanks to the rapid pace of developments both in chemistry and base-calling software, their current error rate is 10- 15% [170]. Similar to PacBio reads, a high consensus quality has been reported for Nanopore reads with base-level accuracy of >99.95% [108]. In terms of length, Oxford Nanopore has undoubtedly broken the barrier by having essentially no theoretical upper limit to the read length [158]. In practice, researchers have been able to successfully generate ultra-long reads with N50 >100 Kbp with some of the longest reads reaching about 1 Mbp [70, 133]. Using a graphical viewer for visualizing the raw signal trace of Nanopore reads together with their mappings to a reference genome, Payne et al. [133] identified a molecule of length >2 Mbp being sequenced which was broken into 11 consecutive subreads due to weak signal and/or software limitations. Aside from the characteristics of Nanopore long reads, the sequencer

3 itself has exceptional game-changing features. MinION is the cheapest sequencing machine with an instrument price of ~ $1000. Additionally, it is a handheld portable device that can operate by connecting to a laptop via a USB cable. This unique characteristic enabled MinION to be used in remote locations, including Antarctica [40] and International Space Station [17, 37]. The power of long SMS reads in resolving repeat regions, immediately enabled their success in de novo assembly and SV discovery applications, whose progress with NGS data were limited. However, SMS long reads could not be directly passed to the available de novo assemblers (e.g., Celera [119]), which expect much higher base-level accuracy. As a result, error correction methods for increasing the base-level accuracy of long reads started to draw the attention of researchers. Due to the very high error rate of early long reads, they often had to be coupled with NGS data [84, 4]. Such integrative uses of SMS and NGS reads are referred to as hybrid techniques. After a while, with improvements in read accuracy and the development of overlapping algorithms specifically tailored for correcting erroneous long reads [124, 11], non-hybrid correction and assembly became possible [24]. To date, SMS long reads have been utilized for de novo assembly of many genomes, including microbial [82, 108], mammalian [11, 134, 147], and plant genomes [181, 145]. With regards to SV detection using long reads, similar to NGS-based methods, SMS- based methods can be classified into mapping-based and assembly-based approaches. The first step for both of these approaches is mapping long reads to the reference genome. However, due to the lengthy and error-prone nature of SMS reads, the mapping algorithms designed for NGS data either fail or remain too slow. To address this issue, a few mapping algorithms were proposed [124, 146, 98] that are specifically designed for noisy long reads. Having long reads aligned to the reference genome, SVs can be identified either directly from mapping signal (for example, using Sniffles [146] or Picky [51]) or using a local assembly of long reads showing SV signal (similar to what used by Chaisson et al. [19] and Fan et al. [44]). SMS technologies have also been used in many other applications such as gap filling of assembled genomes [42] including the human GRCh37 reference genome [19], genome finishing [9, 14], haplotype phasing [39], detection of DNA base modifications [47, 151], sequencing and genotyping medically relevant regions of the genome like human leukocyte antigen (HLA) and immunoglobin heavy (IGH) genes [64, 48], novel isoform identification [103, 34], and gene fusion detection [126, 72]. We are experiencing an exciting era in which DNA sequencing is on the cusp of a new revolution. The significant increase in the throughput of SMS technologies, especially with the introduction of Sequel by PacBio and PromethION by Oxford Nanopore, has promised a reduction in their cost. On the other hand, the desire to generate longer reads is a persistent goal. Ultra-long reads generated by Oxford Nanopore have enabled the full telomere-to-telomere assembly of a human chromosome for the first time [117]. PacBio

4 sequencers can now generate much longer CCS reads than before, which can improve the quality of assembled genomes [161]. PacBio Iso-Seq is capable of sequencing full-length transcripts with high accuracy. The small size of MinION sequencer, together with the easy library preparation, has enabled its success in remote locations, for example, in surveillance of the Ebola outbreak in West Africa [135]. And these are only some of the examples of the tremendous potential that SMS technologies are offering.

1.1 Contributions

Many of the achievements of SMS technologies that were mentioned above could not be possible without the development of new computational tools specifically tailored for error- prone long reads. This thesis is focused on computational algorithms for the three previously mentioned long read analysis problems: (i) long read mapping, (ii) hybrid error correction, and (iii) hybrid assembly of long reads. Tools that address these problems can be used as the first step in many downstream analysis tasks involving long sequencing reads. An example of a task that can take advantage of these problems is SV discovery using long reads. As a first approach, one could map all long reads to the reference genome and employ a mapping-based SV detection tool (e.g. Sniffles [146]). As another approach, long reads showing the signals of SVs could be error corrected using short reads and then utilized to identify SVs more accurately with base level resolution. On the other hand, such long reads can be used to locally assemble the region containing one or more SVs to generate contigs that can be compared against the reference genome. A completely different approach could be performing whole genome assembly of long reads and comparing assembled contigs against the reference genome to identify SVs. In particular, the following contributions regarding the analysis of SMS long reads are presented in this thesis:

• We introduce lordFAST [62], an efficient mapping tool that is especially designed for noisy SMS long reads. lordFAST is a sensitive aligner that can tolerate high sequencing error rates observed in SMS long reads, through its use of variable-length short exact matches. In addition, it is among the fastest long read mappers due to its sparse anchor extraction strategy which significantly speeds up the chaining of exact matches. Our experiments on simulated data allowed us to demonstrate the superiority of lordFAST in terms of sensitivity and precision compared to the previously published alternatives. lordFAST also provides both clipped and split alignments of the long reads, which makes it appropriate for aligning reads originating from regions with large SVs. This enables simpler downstream analysis of its alignments for the task of SV discovery. lordFAST is an open source tool available https://github.com/vpc-ccg/lordfast as well as Bioconda.

5 • We present CoLoRMap [61], a hybrid method for correcting noisy long reads using high-quality Illumina paired-end short reads mapped onto the long reads. CoLoRMap achieves this using two novel ideas: (i) using a classical shortest path algorithm to find a sequence of overlapping short reads that minimizes the edit score to a long read, and (ii) by extending corrected regions via local assembly of unmapped mates of mapped short reads (also known as one-end anchors). We applied CoLoRMap on three real datasets and compared its performance with the previously published hybrid error correction tools. We observed that the accuracy of reads corrected by CoLoRMap is on par with the accuracy of reads corrected by other tools, while more corrected long reads could align to the reference genome compared to other tools, both in terms of number of corrected reads that aligns to the reference genome and the total size of the aligned regions. We also demonstrated that the assemblies generated by Canu assembler [85] using the corrected long reads are of slightly better quality with CoLoRMap. The source code of CoLoRMap could is available at https://github. com/sfu-compbio/colormap.

• We introduce HASLR [60], an ultra-fast hybrid assembler for SMS long reads. It requires both NGS and SMS reads from the same sample and is capable of assembling both small and large genomes. HASLR first generates short contigs by assembling short NGS reads using Minia [23] and then builds a graph structure called backbone graph that is used to connect these short contigs. It finally generates long contigs by filling the gaps between the connected short contigs using consensus sequences obtained from multiple sequence alignment based on partial order alignment technique. Our experiments show that HASLR is not only the fastest assembler among all the tested assemblers, but it is also the one with the lowest number of misassemblies on all tested samples. Furthermore, the generated assemblies in terms of contiguity and accuracy are on par with the other tools on most of the samples. HASLR is an open source assembler which is available at https://github.com/vpc-ccg/haslr as well as Bioconda.

Addressing the above mentioned problems could be of a great impact as long reads are becoming more widely used due to the continuous improvements in the throughput of SMS technologies which in turn reduce their cost. Considering the current competition between PacBio and Oxford Nanopore, this trend is expected to continue at least for a while. In addition to our primary contributions listed above, our collaboration in a number of projects has made other contributions to the field of Computational Biology that can be found in [86], [120], [111], [48], and [5].

6 1.2 Organization of the thesis

The rest of the thesis is organized as follows. Chapter 2 is devoted to providing the related background including details of single-molecule sequencing (SMS) technologies, as well as explaining the problem descriptions and related state of the art work for addressing each of these problems. In Chapter 3, we introduce lordFAST, our fast and sensitive long read mapper alongside its experimental results. Chapter 4 presents CoLoRMap, our hybrid error correction tool. Note that CoLoRMap was published in 2016, after which a number of other hybrid assemblers have been developed. Therefore, we provide results taken from a comparative study by Zhang et al. [177] to show where it currently stands in the field of hybrid error correction. In Chapter 5, we introduce HASLR, our ultra-fast hybrid assembler and show how it compares against currently available tools. We also demonstrate how its speed scales for large genomes as long as human genome. Finally, we conclude the thesis in Chapter 6 by providing a summary of our contributions to the field, as well as a discussion of possible directions for future work.

7 Chapter 2

Background and Related Work

2.1 Single-molecule sequencing technologies

The major difference between single-molecule sequencing (SMS) and next-generation sequencing (NGS) technologies is that unlike NGS technologies that rely on clonal amplification of DNA fragments of a solid surface, SMS technologies sequence the native DNA from a single DNA molecule [54]. This prevents amplification biases in SMS reads. In this section, we briefly explain how these technologies sequence a DNA fragment.

2.1.1 Pacific Biosciences

The PacBio RS II sequencer, which was the first PacBio’s commercially available sequencing instrument, is fed with the so-called SMRTbell template [157]. This template consists of a double-stranded DNA fragment (the insert of interest) ligated to a single-stranded hairpin loop on either end (see Figure 2.1). Each hairpin loop is the complementary sequence of a sequencing primer, which provides a site for primer binding. The instrument utilizes specialized flowcells containing many thousands of picoliter-sized wells with a transparent bottom, called zero-mode waveguides (ZMWs). The sequencing is carried out in the ZMW using a DNA polymerase fixed to the bottom of the well (see Figure 2.1). This allows the DNA strand from the SMRTbell template to progress through the polymerase. The polymerase replicates the DNA template by incorporation of phosphate-labeled nucleotides, which emits colored light. A camera records the color of emitted light over time as the sequencing signal (as depicted at the bottom of Figure 2.1). The unique circular design of the SMRTbell template enables it to pass through the polymerase continuously while the polymerase remains active. Therefore, each template can be sequenced multiple times, depending on its length. In general, shorter DNA templates have a higher chance of being sequenced repeatedly, while it is difficult for templates longer than ∼ 3 Kbp to get sequenced multiple times. The beauty of this technique is that multiple sequencing of the same template allows for obtaining a consensus sequence with higher quality since the errors are distributed

8 Figure 2.1: PacBio sequencing. Reprinted by permission from Springer Nature: Nature Reviews Genetics (Coming of age: ten years of next-generation sequencing technologies, Sara Goodwin et al.), copyright (2016) [54]. randomly. If at least two full passes of the template are available, the obtained consensus is called circular consensus sequence (CCS). Otherwise, each incomplete pass of the template generates a sequence called subread. Similar to CCS, all subreads of the same template can be used to generate a consensus called read of insert. Although a high error rate (∼ 15%) is evident for reads obtained from single-pass sequencing, the consensus reads could have accuracy > 99% depending on the number of passes through the template.

2.1.2 Oxford Nanopore Technology

In principle, nanopore-based sequencing relies on the transition of DNA through a tiny channel called nanopore. Oxford Nanopore Technology is leading the development and commercialization of this approach. The Oxford Nanopore’s MinION instrument uses flow cells with hundreds of embedded nanometer-sized protein pores (the nanopores).

9 Figure 2.2: Oxford Nanopore Technology sequencing. Reprinted by permission from Springer Nature: Nature Reviews Genetics (Coming of age: ten years of next-generation sequencing technologies, Sara Goodwin et al.), copyright (2016) [54].

The DNA fragment is ligated with two adapters (see Figure 2.2). At one end, the adapter is a hairpin sequence that allows the passing of two strands of the DNA fragments consecutively. The other adapter is bound to a protein enzyme that interacts with the nanopore to direct single-stranded DNA through it. The DNA strand starts passing through the nanopore as a voltage is applied across it, causing a flow of electric current. While traversing the nanopore, the DNA strand partially blocks the electric current passing through the nanopore. In principle, the flow of current is considered as a function of the specific k-mer present in the nanopore. In other words, rather than having four possible signals, here there are more than 1000 possible signals, one for each specific k-mer. The changes in the flow of current are recorded over time, which is the observable output of the system. In fact, the output of the sequencer needs post-processing so that the electrical signal data is decoded into a DNA sequence.

10 This post-processing step is called base-calling. In order to do this, the sequence of electric current measurements is segmented into “events” based on the changes in the current. Ideally, two consecutive events correspond to the shift of the DNA context by a single base. However, in practice, base-calling is not trivial for two reasons: (i) the rate of sampling the electric current is constant while the process of traversing the single-stranded DNA through the nanopore is stochastic, and (ii) the segmentation process is noisy. The consequence is to have “stays”, when two consecutive events correspond to the same DNA context, and “skips”, when two consecutive events correspond to DNA contexts different by more than one base, in the sequence of events. The official base-calling software provided by Oxford Nanopore is Metrichor. Metrichor was first available only via a cloud computing platform, meaning that private analysis of Oxford Nanopore data was not possible. This motivated the development of free open-source tools for base-calling, namely Nanocall [32] and DeepNano [13], which achieved comparable results. Still, Metrichor remains the most reliable base-calling tool for Oxford Nanopore data. Later, Oxford Nanopore released the code of Metrichor, which enables base-calling on a standalone machine. In this work, we do not cover methods for base-calling as it seems a less pressing problem, especially after the release of Oxford Nanopore’s official base-caller. Similar to PacBio sequencing, the use of a hairpin adapter allows the sequencing of both strands of the DNA fragment. These two “1D” reads can be aligned together to generate a consensus read with a higher quality known as “2D”. This reduces the error rate from ∼ 30% in 1D reads to ∼ 15% in 2D reads.

2.1.3 Synthetic long reads

Synthetic long read technologies do not generate actual long reads of the native DNA fragment but utilize computational tools to reconstruct long-range information. These approaches rely on unique barcoding techniques during the library preparation that allows later grouping of the reads from the same fragment and computational reconstruction of long-range information [54]. There are a few synthetic long read technologies, among which Illumina’s TruSeq synthetic long read technology and 10x Genomics are popular.

TruSeq synthetic long read sequencing technology, formerly known as Illumina Moleculo, breaks genomic DNA into about 10 Kbp templates and partitions them into many wells such that about 3,000 fragments are in a single well. The templates in each well are amplified and sheared into ∼ 350 bp long fragments. DNA fragments from each well are barcoded and used for standard library preparation and sequencing on an Illumina HiSeq instrument. The barcodes in the sequenced reads are used to identify reads from the same well, which are then computationally assembled to generate synthetic long reads [162].

11 10x Genomics uses the same general idea of partitioning DNA fragments (up to 100 Kbp long), barcoding, amplification, fragmentation, and performing a standard Illumina short-read sequencing [179]. However, there are two main differences compared to TruSeq synthetic technology. First, DNA fragments are distributed to about 106 droplet partitions, which significantly reduces the number of DNA molecules in each partition. Increasing the number of partitions reduces the chance of having two fragments that cover the same loci in a single partition. This results in the potential for capturing more accurate phase information [79]. Second, unlike TruSeq synthetic technology, 10x Genomics does not try to obtain full coverage of the long DNA fragment due to random priming and “nonprocessive” polymerase amplification [179]. This means assembling long fragments is not possible. Instead, the reads are sorted according to their genomic position and grouped together based on their barcode as linked-reads. As mentioned before, synthetic long reads are not generated by sequencing the long native DNA. Although they are helpful for some applications such as haplotype phasing (in which the goal is to reconstruct two haplotypes of a diploid genome), we do not explore their use in this survey. The reason is that these technologies rely on computational methods for assembling/linking reads. In fact, they still suffer from the limitations of short read sequencing technologies for highly repetitive regions of complex genomes.

2.2 Definitions and Notations

Suppose Σ = A, C, G, T is the alphabet of nucleotides. A DNA sequence is defined as

S = s1s2s3...sn which is a string over Σ. The length of S is denoted by |S| = n. We denote by S[i, j] = sisi+1...sj a subsequence of S, by S[−, i] a prefix of S, and by S[i, −] a suffix of S.A k-mer of a sequence S is a subsequence of S of length k. The edit distance between two sequences S1 and S2 is the minimum number of edit operations (insertion, deletion, substitution) that transforms S1 to S2. A reference genome G of an organism is a long DNA sequence obtained from one or multiple individual genomes and represents the normal genome of that organism. A sequenced read R is a DNA sequence and a short subsequence of a donor genome D, which can be different from G.

Let R = R1,R2, ..., R|R| be a set of reads sequenced from a donor genome D. The de Bruijn graph (DBG) of order k built on R is a directed graph DGk. The nodes of DGk are the (sub-)set of all possible k-mers and there is a directed edge from node (k-mer) v to node (k-mer) u if (i) the suffix of length k of v is identical to the prefix of length k of u; i.e. v[2, −] = u[−, k − 1], and (ii) v and u are consecutive in R. A string graph, on the other hand, is also an overlap graph where nodes are not fixed-length sequences but sequences of arbitrary length, and there is an edge between two nodes if their corresponding sequences

12 overlap. DBG and string graph are two frameworks for assembling genomes. They generate DNA sequences called contigs that are ideally subsequences of the donor genome D. A hash table index, in its general form, keeps all the locations on the reference genome that exactly match every possible k-mer. This data structure provides a quick lookup of fixed size exact matches in constant time. A suffix array [112] of a sequence S is defined as the sorted array of all suffixes {S[i, −] | 1 ≤ i ≤ n} of S. As a result, the position of each occurrence of a substring of S will occur in an interval (consecutive rows) in the suffix array. This enables finding exact matches of arbitrary length using a suffix array index. A BWT-FM index [45] is another index capable of finding variable-length exact matches, which is a compressed representation of the text. Usually, the BWT-FM index is built from the suffix array of the text. Currently, available mapping softwares usually use either a hash table index or a SA/BWT-FM index for finding exact matches.

2.3 Long Read Mapping

The very first step for most of the downstream analysis pipelines involves mapping the reads to a reference genome. Here we explore the mapping problem for long noisy reads. Some methods find and report only approximate mapping location of one long read to another. This is more interesting for the de novo assembly task and more specifically, for overlap detection between long reads. Examples of such methods are minimap [97] (which will be explained in Subsection 2.5.2) and mashmap [69]. The focus of this section, however, is on methods that generate the real detailed alignment of long reads to the reference genome. Among the existing published methods, BLASR, rHAT, and GraphMap use the same general strategy of first finding the approximate candidate mapping region on the reference genome, followed by a more detailed alignment of the read to the candidate regions. BWA- MEM, however, works based on finding strong local alignments of the exact matches and chaining these local alignments to the full alignment of the whole read. In this section, we briefly explain how each of these tools maps long noisy reads to the reference genome. Here, we describe the process of aligning a single long read. The same process is simultaneously applied to multiple reads at the same time using multiple CPU cores.

BLASR

BLASR [18] is the first tool specifically designed for PacBio reads. It starts by finding the longest common prefix (of minimum length 12 bp) between each suffix of the long read and the reference genome using a suffix array index and storing them as the set of anchors. The set of anchors is sorted by the reference coordinates of the anchors. These anchors are clustered in intervals, roughly the length of the read. A global chaining of anchors in each interval is calculated, which is a maximal subset of co-linear anchors increasing in both

13 read and reference coordinates [127]. Top 10 chains with the highest scores are kept, each of which defines a candidate interval of mapping. Next, an approximate alignment of the long read to the candidate interval is obtained. To do this, a set of short exact matches (of length 11 bp) are found between the read and the candidate interval. This set of matches is used for sparse dynamic programming (SDP) [43], which is in principle same as chaining short exact matches. At last, since the SDP alignment does not align all the bases, a detailed banded alignment is performed to align the remaining bases. The SDP alignment is served as a guide for this banded alignment. rHAT rHAT [107] is a hash-table based mapper designed for PacBio reads, that uses a heuristic to estimate the approximate location of the mapping for each read. This is done by finding potential mapping regions for the middle 1 Kbp segment of the long read on the reference genome through a quick seed counting scheme. To do this, the reference genome is partitioned into windows of length 2 Kbp overlapping with the neighboring windows by 1 Kbp. The overlap between the neighbor windows is essential in order to ensure the long read is fully contained in at least one window. A collision-free hash-table is built to map each reference k-mer (by default 13-mer) to windows containing it. Then for mapping the middle 1 Kbp segment of a read, the number of occurrences of its k-mers in the windows are calculated, and five windows with the highest k-mer count are kept as candidate mapping windows for the middle 1 Kbp segment. Therefore, the candidate mapping regions for the whole read are intervals of size ∼ |R|, flanking the candidate windows. Then, for each candidate mapping region, a more detailed mapping is obtained as the following. A lookup table is built for indexing all the l-mers (by default 11-mers). Seeds are extracted for the whole read from the lookup table. rHAT then finds a local chain of these seeds (sparse dynamic programming) using a direct acyclic graph (DAG) as the skeleton of the alignment. Finally, the gaps between the seeds in the selected chain are filled using a banded Smith-Waterman algorithm.

GraphMap

GraphMap [154] is specifically designed for Oxford Nanopore (both 1D and 2D) reads. However, the authors claim that it can successfully map PabBio reads with default parameters too. In order to increase the sensitivity, GraphMap uses gapped spaced seeds to account for different types of sequencing errors (i.e., substitution, insertions, and deletions) in the reads. Using gapped seeds is a strategy for increasing the sensitivity of inexact match lookup while remaining fast.

14 To find approximate candidate mapping regions, GraphMap partitions the reference genome into bins of length |R|/3. This selection of bin size guarantees that at least one bin is fully covered by the read. The number of seed hits is counted for each bin, and bins with count > 75% of the max count are selected. A candidate region is then defined as the interval of length ∼ |R| on the reference that expands the corresponding bin’s start and end location. Next, GraphMap builds an “alignment graph” from short k-mers of the target sequence. A set of anchors is obtained by finding an exact walk in the alignment graph following the sequence of the query. A chain of anchors is calculated by solving an instance of the longest common subsequence (LCS) problem, and this chain is further refined to generate the final alignment.

BWA-MEM

BWA-MEM [96] was originally designed to align short sequence reads as well as assembled contigs to a reference genome. Later, it was extended to map long SMS reads by tuning its alignment parameters (via options -x pacbio for PacBio reads and -x ont2d for Oxford Nanopore 2D reads). BWA-MEM starts by finding the longest exact match covering each query position as a possible initial anchor. In BWA’s terminology, these matches are called super-maximal exact matches (SMEMs). Long SMEMs may cause mismapping due to missing correct anchors. To fix this issue, each SMEM longer than 28 bp is considered for the extraction of additional shorter anchors. More specifically, the longest exact matches that cover the middle base of the SMEM and occur more frequently than the SMEM in the reference genome are added to the list of anchors. The obtained set of anchors is then grouped into local chains, and “contained” chains are filtered out. The anchors are ranked by the length of the chain they belong to and then by the anchor length. The anchors are considered for an extension from the highest rank to the lowest rank, unless they are already contained in an alignment. The extension of anchors is done with a banded affine-gap penalty dynamic programming. BWA-MEM stops extension if the best extension score deviates more than a cutoff parameter from the score before the extension. This cutoff parameter automatically determines whether local alignments are merged to build a longer alignment.

NGMLR

NGMLR [146] follows seed-and-extend paradigm. For each long read, first, it finds subsegments of the read that can be aligned to the reference genome with a single linear alignment with high similarity. It then performs alignment using the Smith-Waterman algorithm for each of these linearly mapped segments to compute a pairwise sequence alignment. And finally, selects a set of linear alignments with the highest joint score.

15 The main advantage of NGMLR compared to other tools is its use of a convex gap- cost model during Smith-Waterman dynamic programming alignment that accounts for sequencing error and SVs at the same time. This new gap-cost model enables NGMLR to localize SV events better and align the long read around the SV breakpoints more precisely. minimap2

Minimap2 [98] is the successor of minimap [97], which extends it to enable its use for mapping to the reference genome with base-level alignment. It starts by extracting minimizers from the reference and indexing them in a hash table. It then identifies minimizers on one long read and locates its occurrences in the index as seeds. Next, seeds are sorted and chaining is performed to find all chains. By analyzing the overlapping chains, it selects one best chain (known as primary) for each query segment. In the end, it performs base-level alignment for each chain using dynamic programming. Minimap2 not only maps SMS long reads onto the reference genome, but is capable of performing self-overlapping, splice-aware mapping, and short read mapping.

2.4 Error correction of long noisy reads

As mentioned in Section 2.1, the observed error rate of long read sequencing technologies is high. This means that the reads they generate cannot be used directly for some applications, namely de novo assembly and structural variation discovery. Therefore, correcting these long reads is an active field of research. There are two different approaches for error correction of long noisy reads: (i) Self-correction (non-hybrid) approaches that correct long reads, only using themselves, (ii) Hybrid correction approaches that use complementary NGS short reads to correct long reads. In the following, we briefly describe general ideas for existing methods of either approach.

2.4.1 Hybrid correction

Hybrid correction methods correct long reads by taking advantage of a set of high-quality short reads sequenced from the same sample, using NGS technologies. This is done by aligning one set onto the other and correcting every long read using these alignments. In general, we can put hybrid correction methods into two general categories: (i) methods that map short reads to the long reads, (ii) methods that align every long read to the assembled short reads. The latter category benefits from faster alignment compared to the former. The reason is that the size of the assembled data set is much smaller than the original set of short reads. In the following, these methods are explained in more detail.

16 PacBioToCA

PacBioToCA, used in the PBcR assembly pipeline [84], is a hybrid correction module that was implemented as part of the Celera Assembler [119]. In this pipeline, first, all short reads are mapped to all long reads. The requirement for computing an alignment is observing shared seed sequences of length 14 bp. Besides, only alignments that span along the full length of short reads are considered. Mappings to multiple locations (repeats) are handled by choosing the highest identity mapping. After obtaining all alignments of short reads to long reads, correction for each long read is done by generating a consensus sequence. If there are regions with zero or insufficient coverage of short read alignments, the consensus sequence is split to generate shorter high- quality sequences.

LSC

In general, LSC [6] follows the same strategy of PacBioToCA. The main difference is that short and long reads are transformed to a compressed form using homopolymer compression, in which homopolymer runs are replaced by a single nucleotide of the same type. The authors claim that this transformation improves alignment sensitivity without sacrificing accuracy. proovread

Similar to PacBioToCA and LSC, proovread [59] works based on consensus calling from short read alignments. However, the correction process is done iteratively using three subsets of short reads (sampled in such a way to complement each other) with a successive increase in the sensitivity. The intuition behind this strategy is to reduce the search space in each cycle. More specifically, the first cycle performs mapping of only 20% of short reads with low sensitivity parameters (high speed). In the second cycle, the same process is repeated for 30% of short reads with higher sensitivity. The third cycle of correction is done by mapping the last 50% of the short reads while focusing only on the remaining error-enriched regions.

LoRDEC

LoRDEC [141] starts by building a de Bruijn graph (DBG) from the set of short reads (with k-mer size ∼ 19 bp). The general idea for correcting a long read is to thread it against the DBG and find a path in the DBG that minimizes the edit distance to the long read. In order to do this, LoRDEC scans the long read and identifies all so-called solid k-mers, those k-mers that also appear in the DBG. Assuming solid k-mers are error-free, the problem is to correct the regions in the long read between solid k-mers. To correct the inner regions, consider each solid k-mer as the source and five upstream solid k-mers (not from the same run of solid k-mers) as the target. For each pair of source and target, an exhaustive search (depth-first) is performed to find an optimal path (one

17 with minimal edit distance) between the source and target in the DBG. Such an optimal path spells a sequence that can be considered as the corrected sequence of the inner region. When this is done for all source/target pairs, a path graph is built that records all the optimal paths found between solid k-mers. The path graph is a directed graph in which solid k-mers are the nodes, and the optimal paths between solid k-mers correspond to edges in the graph. Finally, to get a sequence that minimizes the edit distance to the whole long read, the Dijkstra algorithm is employed to find the shortest path in the path graph between the leftmost and rightmost solid k-mer.

Jabba

Jabba [116] further builds on LoRDEC’s idea of correcting long reads based on alignments to a de Bruijn graph (DBG). The main difference compared to LoRDEC is that Jabba builds the DBG on longer k-mers (∼ 75 bp). The usage of longer k-mers has two advantages: (i) the DBG will be less complex due to resolving many short repeats, (ii) long k-mers enables Jabba to use maximal exact matches (MEMs) for aligning long reads to the DBG. As a result, the alignment step in Jabba is expected to be faster than that of LoRDEC. To avoid using erroneous long k-mers, before building the DBG, Jabba first corrects the set of short reads using Karect (with small k-mer size ∼ 13 bp). Then the DBG is built, and an index is built using essaMEM on node sequences that helps the extraction of MEMs during the correction. In order to correct a long read, essaMEM is used to detect MEMs between nodes of DBG and the long read. These MEMs are grouped into local assemblies, and obtained local assemblies are chained to find the final alignment.

Nanocorr

Nanocorr [53] is a hybrid error correction tool for Oxford Nanopore long reads. It follows a strategy similar to PacBioToCA [84]. First, short read Illumina reads are mapped to nanopore long reads using BLAST. Since there are potentially many spurious alignments, it is essential to select a correct subset of short read alignments. For each long read, Nanocorr uses a dynamic programming algorithm based on the longest increasing subsequence (LIS) problem to select such a subset of short read alignments. These short read alignments are used to correct the long read using the pbdagcon consensus algorithm (described in the next subsection).

2.4.2 Self-correction

In principle, the idea for self-correction approach is to exploit error-free segments of the whole set of long reads to correct every individual long read. The rationale behind this idea is that sequencing errors are distributed randomly in long reads generated by SMS technologies. As a result, theoretically, the consensus sequence of a set of reads derived

18 from the same genomic region can achieve any desired accuracy level by increasing the coverage [124]. To do this, self-correction methods require aligning of each long read against other long reads. There are two major concerns regarding the self-correction approach. First, aligning long reads against long reads is both challenging and time-consuming, especially at the high error rate of SMS reads. Second, as mentioned, high accuracy can be achieved only using a high coverage data set of long reads. Thinking of a genome as large as the human genome, this can be a major issue as obtaining high coverage is not currently affordable due to the high cost of SMS technologies. Nevertheless, developing methods that solely use SMS reads is an active field of research. Here, we summarize currently available methods for non-hybrid correction. pbdagcon pbdagcon is an error correction algorithm, first used in HGAP assembler [24], which generates a consensus sequence from a set of alignments to a template sequence. Consider a single long read subject to error correction, which is called a seed read, as well as the alignment of some long reads to this seed read. pbdagcon exploits the idea of representing multiple sequence alignment using a directed acyclic graph (DAG). In this DAG, nodes are labeled by DNA bases A, C, G, or T, and edges connect nodes whose corresponding bases are consecutive in a read. An initial DAG is constructed, which is a linear directed graph and corresponds to the seed read. The information from alignments is added to this DAG iteratively, resulting in a multi-graph. For each alignment, an edge is inserted to the graph between nodes corresponding to consecutive bases (new nodes can be added if needed). A score is associated with each node, which is positive if the number of out-edges is more than half of the number of all possible out-edges, and negative otherwise. The consensus is built by finding a path with maximum score sum among all possible paths. In fact, pbdagcon somehow generates the consensus sequence by stitching high-quality local alignments based on the information in the DAG. falcon_sense falcon_sense is another correction algorithm, first used in [11], which is faster than pbdagcon in the cost of accuracy. Consider a single seed reads as well as a set of long reads that are supposed to overlap this seed read. falcon_sense starts by aligning all long reads to the seed read using a fast aligner [122]. Each mismatch is converted to an insertion, followed by a deletion. For each seed read position, falcon_sense generates a tag denoting the match, insertion, or deletion at that position, and the number of tags of each type is counted. In the end, a consensus is generated based on a majority voting of tags.

19 LoRMA

LoRMA [142] extends the idea of LoRDEC for self correction of long reads. In the first phase of correction, LoRMA exploits the idea of using DBG for error correction (as explained for LoRDEC). However, in the absence of high-quality short reads, DBG is built from the set of long reads. LoRMA iteratively corrects long reads using DBG in 3 rounds while increasing the k-mer size. Since the error rate in raw long reads is high, the first iteration has to use a relatively small k-mer size. In addition, only k-mers that appear at least four times in the whole long read set are used. The rationale for such an iterative approach is that after each round of correction, the error-free regions in the long reads become a bit longer. In addition, increasing the k-mer size for the next round decreases the complexity of the DBG significantly. By default, LoRMA uses three iterations of LoRDEC corrections using k-mer sizes 19, 40, and 61. In the second phase of correction, LoRMA generates a consensus using multiple sequence alignment of reads to each other. To do this, LoRMA builds a DBG (this time with all k- mers) from the set of long reads corrected in the first phase. Then, for each long read, a unique path of DBG is found that spells it out, and the read id is stored on the edges of such a path. After processing all long reads, for correction of a single long read, its unique path on DBG is traversed again, and all the long reads sharing enough k-mers are collected. The set of collected long reads is used to generate a consensus-based on an iterative computation of multiple sequence alignment of the long reads. The iterative procedure of LoRMA makes it extremely slow. Thus, its usage is limited to bacterial size genomes only.

2.5 de novo genome assembly

In general, de novo genome assembly problem, in which no information from any already assembled related genome is used, has always been challenging. In practice, the existence of repeat sequences (longer than the length of reads) makes it impossible to obtain a perfect assembly. de novo assembly is usually done using either de Bruijn graph (DBG) or overlap layout consensus (OLC) frameworks. In both cases, a non-branching path of edges corresponds to unitigs. A unitig is a special kind of contig (ideally) entirely consistent with all the data and contains no misassembly. Often, extending unitigs to contigs is challenging due to ambiguous branches of the graph. Using SMS long reads, it is possible to obtain a less complicated graph or resolve ambiguous branches in an existing graph. As a result, SMS long reads are helpful in achieving high contiguity assemblies. For instance, for a human genome, SMS based assembly gives NG50 of 4,320 Kbp, which is far longer than 129 Kbp of NGS based assembly for the same genome [11]. Another example is the assembly of a highly repetitive plant genome Aegilops tauschii, for which NGS based assembler SOAPdenovo2 [109] yields

20 an assembly which covers only ∼ 65% of the estimated genome size, while a hybrid assembly of the same genome using both SMS and NGS reads generate almost the whole expected length.

2.5.1 Hybrid assembly

The use of SMS reads for de novo assembly was first explored for scaffolding step in NGS based assemblers. AHA [9], PBJelly [42], and SSPACE-LongRead [12] are among these methods. Since these methods still generate contigs from NGS data, they do not use the full power of SMS reads. In the following, we introduce some hybrid de novo assembly methods that take advantage of SMS long reads for generating considerably longer contigs.

PBcR

The PBcR assembly pipeline [84] was first to prove the potential of SMS long reads for the assembly task. The PacBioToCA correction method (explained in Subsection 2.4.1) was used to correct sequencing errors in the raw long reads. Corrected long reads were assembled together using an OLC assembly technique. To do this, Celera assembler [119] was modified to handle longer input reads. hybridSPAdes hybridSPAdes [4] uses long reads to resolve ambiguities of an assembly graph. It starts by constructing a de Bruijn graph (DBG) and performing various graph simplifications to transform it into an assembly graph, AG, in which edges represent unitigs. Then each long read is mapped to AG, by finding a path in AG that spells out the error-free version of the long read. In order to do this, a set of k-mer matches between the long read and edges of AG is obtained. These k-mer matches can guide the alignment of the long read. More specifically, a chain of k-mer matches in each edge is obtained, and these chains are converted to alignments by (i) merging chains in trivial cases when their corresponding edges are consecutive in AG (ii) performing an exhaustive search in more complicated cases when their corresponding edges are not consecutive. In the latter case, a path with minimum edit distance to the long read is chosen. After aligning all long reads to AG, hybridSPAdes uses all alignments for two purposes. First, they are used for closing gaps that are present in AG due to lack of coverage in some genomic regions. These gaps can be identified as two dead-end edges (one ending in a vertex without outgoing edges and the other starting in a vertex without incoming edges). If there are multiple long reads aligned to both of these dead-end edges, the consensus sequence of long reads is generated and used for gap closure. Second, ambiguous branches of AG can be resolved using long read alignments. Note that each long read alignment corresponds to a path in AG. Therefore, if the extension of an edge is supported by the majority of the paths, such an extension is reliable and can be applied.

21 The drawback of hybridSPAdes is using substantial memory usage as it requires storing path information of each long read. In practice, the usage of hybridSPAdes is limited to small genomes only.

DBG2OLC

BDG2OLC [174] is another hybrid pipeline. Rather than mapping short reads to long reads for error correction, DBG2OLC uses contigs obtained from short reads to find alignments between long reads. A sketch of this approach is explained as the following. First, a DBG based method is used to generate a set of contigs from short read dataset. Contigs are mapped to the long reads based on the number of shared k-mers (only those k-mers that are unique in the contig set). These mapped contigs serve as anchors for the alignment of long reads to each other. After finding the overlaps between long reads, contained long reads are removed. A greedy approach is followed to stitch each long read to its best overlapping long reads in both directions. The result of this greedy algorithm is a set of low-quality contigs (since long reads are not error corrected). Each draft contig gets corrected by first mapping all related long reads into it using BLASR, followed by a consensus calling. In DBG2OLC, rather than correcting long reads before assembly, high-quality bases are achieved only in the last step (consensus calling). In addition, after mapping of the contigs, long reads are transformed to a compressed form in which each read is represented only with an ordered list of contig ids mapped onto it. DBG2OLC uses these compressed long reads for alignment and overlap detection instead of aligning raw long reads. Using such a lossy compression, although speeds up the overlap detection step, reduces the sensitivity of alignments. This compression strategy, together with low sensitivity mapping of contigs and greedy assembly of long reads, somehow explains the higher number of reported misassemblies compared to PBcR pipeline.

MaSuRCA

MaSuRCA assembler [180] was first designed for assembling NGS short reads. Recently, it is updated to support hybrid assembly of long noisy reads [181]. MaSuRCA uses the notion of super-reads generated from short read dataset. The aim is to generate a set of long sequences that contain all the information of the original dataset while reducing the coverage significantly. MaSuRCA builds a k-mer index database and generates k-unitigs by extending every short read in both directions as long as the extensions are unambiguous. Paired-end information is used to link k-unitigs to create super-reads. Now, the set of super-reads is mapped to long reads. To do this, a database of 15-mers is built on the super-reads dataset. Super-reads are approximately aligned to long reads by chaining of matching 15-mers. This approximate alignment gives the approximate start and end positions of mapping, which helps to detect overlapping super-reads. An ordered

22 sequence of overlapping super-reads is computed for each long read to generate a pre-mega- read. Often, super-reads do not cover every long read fully due to either difference in genomic coverage of NGS and SMS technologies or low-quality regions in the long read. As a result, there might exist multiple of such pre-mega-reads for each long read. For each gap between two pre-mega-read in a long read, if at least 3 other long reads overlap this long read and the flanking pre-mega-reads are identical, aligned subsequences of long reads can be used to generate a consensus sequence to fill the gap. The long sequences obtained after performing all gap closures are called mega-read. If a gap cannot be filled in the previous step, two 500-bp sequences are extracted from flanking pre-mega-reads and get linked as mates. This information can be used by the assembler during scaffolding. The Celera assembler [119] is fed with the set of mega-reads and the set of generated mates to generate the final assembly. MaSuRCA can handle large and repetitive genomes of length up to 22 Gbp. It requires 100x paired-end Illumina short reads and at least 10x PacBio long reads.

2.5.2 Non-hybrid assembly

Although obtaining a high-quality assembly only using SMS reads requires high coverage (based on the discussion in Subsection 2.4.2) and is only affordable by large groups, the development of tools for self-assembly of SMS reads is an active field of research. Even though de Bruijn graphs are shown to be capable of non-hybrid assembling of long reads [105], their usage for larger genomes remains challenging. On the other hand, OLC based methods are proven to work for large genomes and will be our focus in this section. Among all available tools, only Falcon and Canu seem to be able to assemble human-size genomes effectively. In addition, note that it is always possible to increase the base-level accuracy by polishing the draft genome assembly. This is usually done using Quiver [24] for PacBio reads or Nanopolish [108] for Oxford Nanopore reads.

HGAP

The hierarchical genome-assembly process (HGAP) [24] was the first non-hybrid assembly pipeline. Its hierarchical strategy relies on generating some mini-assemblies first and assembling these mini-assemblies into the draft assembly. HGAP considers the longest 20x of the long reads as seed reads. All other long reads are to seed reads using BLASR. It then uses the pbdagcon correction module described in Subsection 2.4.2 to correct seed reads. The error-corrected (pre-assembled) high-quality seed reads obtained from pbdagcon are passed into an OLC based assembler, namely Celera [119]. In HGAP, this assembly step generates a draft assembly (with high genome contiguity) rather than generating the final assembly. In order to obtain the final assembly, a polishing algorithm call Quiver is used. Quiver takes advantage of raw information generated during SMRT sequencing. Raw long reads are

23 aligned to the draft genome using BLASR. Now the assembled draft genome is disregarded, and an initial approximate consensus is generated from all alignments using a fast heuristic. This is done to make the polished assembly independent from local assembly biases and fine-scale errors in the draft assembly. All single-base substitution, insertion, and deletion edits are tested to the draft assembly, and only those that improve the likelihood are applied. This process is repeated until no further improvement in the likelihood is observed.

MHAP

MHAP [11] is a probabilistic algorithm for the rapid identification of overlaps in long noisy reads. It uses a dimensional reduction technique called MinHash, which was first used to quickly determine the similarity between webpages. The general idea is similar to minimizers [138], but instead of using a single hash function to generate the list of representatives, here the list of integer representatives is computed using multiple randomized hash functions. When two long reads have long enough overlap, it is likely to observe shared representatives in the list of representatives of these two reads. The authors considered the longest 40x of long reads as seed reads. Then all long reads are mapped to the seed reads using the MHAP technique. Each seed read gets corrected using falcon_sense consensus algorithm (see Subsection 2.4.2). falcon_sense needs detailed alignments while MHAP reports approximate alignments only. Thus, a fast alignment algorithm [122] is used to align all overlapping long reads to the seed read before using falcon_sense. Finally, Celera assembler [119] is used to generate the final assembly from the set of corrected seed reads.

FALCON

FALCON [25] is an assembler designed by Pacific Biosciences specifically for PacBio long reads. It selects a set of seed reads based on a pre-defined seed length. DALIGNER[124] is used to find all long reads overlapping seed reads. For each seed read, the supporting reads are aligned to the seed read using a fast alignment algorithm [122]. A modified version of falcon_sense algorithm (described in Subsection 2.4.2) is used to correct each seed read. This modified falcon_sense exploits the idea of using a DAG similar to pbdagcon, but the nodes of the DAG contain more detailed alignment tags. The highest weight path is followed to generate a corrected sequence. After the error correction step, FALCON identifies the overlaps between all pairs of corrected seed reads using DALIGNER. The overlapping sequences are used to build a directed string graph that keeps the diploidy information. Using this string graph, the first draft of contigs is generated. In order to do phasing, raw reads are associated to the contigs by tracing the read overlapping information used for error correction. For each draft contig, all the associated raw reads are collected and aligned to the contig using BLASR. These alignments are used

24 to call heterozygous SNPs. Those raw reads that contain a sufficient number of heterozygous SNPs are used by the FALCON-Unzip algorithm to generate haplotype-specific contigs.

Canu

Canu [85] is another assembler specifically designed for assembling both PacBio and Oxford Nanopore reads. It uses an optimized version of MHAP to find all-vs-all overlaps of raw long reads. Based on the overlap information, Canu estimates the length of corrected reads (reads with no coverage will have corrected length of zero). Canu then considers the longest 40x corrected length of reads as seed reads. For each seed read, all supporting raw reads are quickly aligned to the seed read. The DAG version of falcon_sense consensus algorithm is used to correct the seed reads. After the correction step, the set of corrected seed reads is used to build an overlap graph using which the draft contigs can be generated. In the end, Canu improves the quality of contigs by aligning all the supporting reads to contigs. A consensus sequence is then generated for each contig using pbdagcon algorithm. miniasm + Racon

Unlike other SMS assemblers, miniasm [97] skips the error correction step (which usually takes the majority of the running time) and generates genome assembly directly from uncorrected long reads. miniasm takes advantage of a quick approximate overlap detection module called minimap. The first step is to find all-vs-all alignments of the long reads using minimap. minimap collects minimizers [138] of long reads and stores them in a hash table. Then for each long read query, all minimizer matches are obtained from the hash table. The minimizer matches are clustered, and for each cluster, minimap finds the longest chain of co-linear minimizer matches using the longest increasing subsequence problem. This gives an approximate alignment of the query long read to other long reads. Next, before the assembly step, low-quality regions of long reads are trimmed. For each long read, miniasm detects the longest region covered by at least three other reads and removes all the bases outside this region. miniasm generates a string graph from trimmed reads and performs usual simplifications, namely transitive edge reduction, tip pruning, and bubble removal. In the end, unambiguous overlaps are merged to generate the set of contigs. Although miniasm is a fast assembler compared to Falcon and Canu, due to avoiding error correction, the quality of generated contigs is essentially close to the original raw dataset. Therefore, its generated contigs are not useful directly, and increasing their quality is left to other tools. A recently published consensus module, Racon [159], was shown effective for this purpose. Racon first maps long reads to the draft contigs using the minimap to find approximate alignments, followed by a fast edit distance based alignment to get a more detailed assignment of bases. The draft contigs, as well as the alignments,

25 are then split into smaller windows. In the end, a consensus sequence is built using the partial order alignment (POA) graph approach [92, 91]. A major drawback of miniasm is building the whole string graph in memory, which needs a substantial amount of internal memory. Thus, even though together with Racon, miniasm can generate acceptable assemblies, it cannot handle the human genome and its usage is limited to small genomes.

2.5.3 wtdbg2 wtdbg2 [140], also known as Redbeans, introduces a new graph framework called fuzzy- bruijn graph for self-assembly of long reads. This graph framework somehow combines the idea of de Bruijn graphs with overlap-layout-consensus approaches. In order to do this, wtdbg2 first builds a binned representation of each long read in which every tiling 256 bp subsequence is considered as one bin. It then performs all-vs-all alignment on these binned long reads, which is much faster to perform (due to the reduction in the size of dynamic programming matrix ~ 65,536 times smaller). Next, it identifies all K-bins (K-consecutive bins) aligned together and considers a node for them in the fuzzy-Bruijn graph. An edge is added between two nodes if they are both present on a reads. In the end, it simplifies the graph and finds the consensus sequence corresponding to each simple path in the simplified graph. Although wtdbg2 is the fastest available self-assembler for long reads, the final assemblies it generates do not have high quality and require further polishing.

26 Chapter 3

Long read mapping

The very first step for most of the downstream analysis pipelines involves mapping the reads to a reference genome. On an Illumina-like short read with a low error rate, it is usually possible to find a “long” substring that would exactly match its mapping locus on the reference genome. All existing tools for mapping short reads are based on this fundamental observation. They aim to find such exact matches by using either (i) Burrows-Wheeler Transform/FM Index [15, 45] based methods [99, 89, 101], or (ii) substring hashing [3, 172, 57, 58, 166, 33, 104, 52], or (iii) hybrid methods that combine FM Index with hashing [152, 113]. Unfortunately, because of high error rates (up to 20% reported for PacBio [157, 156] and up to 40% reported for Oxford Nanopore1 [53]), this key observation is not valid for SMS technologies. Furthermore, even when the mapping locus for a read can be correctly found, it is quite challenging to differentiate sequencing errors from actual genomic variants. There are a number of available methods for mapping long reads to the reference genome. BLASR [18] is the first tool specifically designed for PacBio reads. It finds all sufficiently long exact matches between a long read and a reference genome using a suffix array index. Then it groups the matches into clusters and ranks them by a frequency weighted score. The top-scored clusters that correspond to candidate genomic locations are used for performing sparse dynamic programming (SDP), followed by a banded alignment. BWA-MEM [96] is another mapper that was originally designed to align short sequence reads as well as assembled contigs to a reference genome. It has been also extended to map long SMS reads by tuning its alignment parameters (via option -x pacbio or -x ont2d). BWA-MEM achieves this by finding the longest exact match covering each query position as a possible initial match, chaining these matches (and filtering out those chains “contained” by others), ranking the initial matches by the length of the chains containing them, and finally extending the initial matches based on a specific score cutoff to get a complete alignment. Another tool,

1Note that with the recent advances in Oxford Nanopore chemistry and base-calling, its current error rate is closer to 15% (see https://github.com/rrwick/Basecalling-comparison)

27 rHAT [107], is a hash table based mapper that uses a heuristic to estimate the approximate location of the mapping for each read. This is done by finding potential mapping regions for the middle 1000 bp segment of the long read on the reference genome through an approximate k-mer counting scheme. Then, for each potential mapping region, a lookup table is built to find short seeds and a chain of these seeds using an SDP-based heuristic. The final alignment is formed from the selected chains. A fourth tool, GraphMap [154], uses gapped spaced seeds and performs an approximate alignment by clustering these seeds. It then constructs alignment anchors by finding an exact walk in their “alignment graph” built from short k-mers of the target, chains these anchors, and finally refines the chain to generate the final alignment. Another tool, LAMSA [106], splits the long read to some “seeding fragments” and finds all their approximate matches on the reference genome using GEM mapper [113]. It then finds the “skeleton” of the alignment using a directed acyclic graph (DAG) based on SDP. Lastly, LAMSA prioritizes the candidate skeletons and fills the gaps within the skeletons while accounting for different possible structural differences (e.g., large deletions). Recently, two new mappers, NGMLR [146] and Minimap2 [98], have been published. Similar to LAMSA, NGMLR starts by finding alignments of subsegments of a read that are aligned by a single linear alignment. For each pair of such subsegment alignments, it then performs a pairwise sequence alignment using a convex gap-cost model. It finally scans inside alignments to identify regions with low sequence identities that exist due to small SVs. Minimap2 uses the notion of minimizers [138] for indexing the reference and finding seeds. It then performs chaining of the seeds and identifies the primary chains. In the end, it performs alignment between adjacent anchors of chains using its fast implementation based on SSE instructions. For a more detailed overview of each tool, we refer the reader to Section 2.3. Among the above tools, BLASR and BWA-MEM are sensitive but too slow in mapping large data sets. Speed is becoming a major issue since the delivery of Pacbio Sequel by Pacific Biosciences and the introduction of PromethION device by Oxford Nanopore, which promises higher throughput for long read data at a lower cost. On the other hand, tools like rHAT and LAMSA are not sensitive enough to find the correct mapping locations for many reads. For instance, the candidate selection step of rHAT uses the seeds only from the middle 1000 bp segment of the long reads, which could be problematic, especially if that segment comes from repetitive regions. In this chapter, we introduce lordFAST, a novel long-read mapper that is specially designed for PacBio’s continuous long reads (CLR). lordFAST is a highly efficient and sensitive aligner that can tolerate high sequencing error rates observed in CLR reads, through its use of multiple short exact matches. lordFAST not only maps more reads in a PacBio dataset but also maps them more accurately than the available alternatives such as BLASR and BWA-MEM. It is worth mentioning that lordFAST is also capable of aligning reads generated by Oxford Nanopore Technology since the error models are

28 somewhat similar. Our experimental results show that Minimap2 is the fastest tool among the above mappers. lordFAST is second in speed while achieving the highest sensitivity and precision on simulated data. This is primarily due to the fact that it maps the highest number of bases correctly among all mappers we tested.

3.1 Methods

3.1.1 Overview lordFAST is a heuristic anchor-based aligner for long reads generated by third-generation sequencing technologies. lordFAST aims to find a set of candidate locations (ideally, only one) per read before the costly step of base-to-base alignment to the reference genome. lordFAST works in two main stages. In stage one, it builds an index from the reference genome, which is used to find short exact matches. The index is a combination of a lookup table and an (uncompressed) FM index. In stage two, it maps the long reads to the reference genome in four steps: (i) on each read, it identifies a fixed number of evenly spaced k-mers (k = 14 in the default settings), which are matched to the reference genome through the use of the index. For each such match, it obtains the longest exact matching (prefix) extension. Among these extended matches of each k-mer identified in each read, it finally chooses the longest (there could be more than one) which act as anchor matches; (ii) for each read, it then splits the reference genome into overlapping windows (of length twice that of the read) and identifies each such window as a candidate region if the number of anchor matches in that window is above a threshold value; (iii) for each candidate region, it identifies the longest chain of “concordant” anchor matches (i.e., chain of anchor matches which have respective equal spacing in the read and the reference genome); (iv) it obtains the base-to- base alignment by performing dynamic programming between consecutive anchor matches in the selected chain. We provide a more detailed description of each step below.

3.1.2 Stage One: Reference Genome Indexing

In order to build a (substring) index for the reference genome, we use a combination of a simple lookup table for initial short matches, and an (uncompressed) FM index for extending such initial matches. This combined index benefits from the speed of the lookup table and the compactness of the BWT representation for the reference genome. The lookup table (with 4h entries for all possible h-mers) provides a constant time search capability for each h-mer’s position in the uncompressed FM index [45] (in the default setting h = 12, but the user is given the option to pick any value). As is well known, the FM index provides a compact representation of a suffix array [112] which we use to find (exact matching) extensions of initial h-mer matches. Note that in order to be able to perform an efficient search on both strands of the reference genome, we use an extension to the FM index implemented in fermi [95]. As

29 Figure 3.1: The speed up of lordFAST’s combined index for searching exact matches in a real human dataset compared to the original FM index. That is 29% speed up for finding all anchors in the first step. Note that this combined index uses only 0.25 GB more memory. depicted in Figure 3.1, our combined index provides a 29% speed up over the standard uncompressed FM index for retrieving exact matches in a real human dataset with a negligible increase in the memory usage.

3.1.3 Stage Two: Read Mapping

Given a set of long reads, lordFAST aligns one read at a time as follows:

Step 1: Sparse Extraction of Anchor Matches. For a given read with length `, lordFAST identifies C (user defined, default 1000) evenly spaced anchoring positions on the read. For each anchoring position, it finds the longest prefix match(es) (of length at least k = 14) to the genome as follows. First, it extracts the first h-mer starting from the anchoring position and uses the lookup table of the genome index to obtain the interval that represents the initial set of matching locations on the FM index. It then uses the LF- mapping operation of the FM index to extend the initial set of matches and identify the longest match(es). Note that using the longest matches reduces the total number of anchor matches significantly. The longest matches are then added to the set of anchors, M, as triplets (r, g, s) where r is the anchoring position on the read, g is the starting location of the longest match on the genome, and s is the length of the match. At the end of this step, M is partitioned into + − M and M based on the strand of the matching location on the genome. (Note that for reads that are “too short”, i.e., ` < C + k − 1, we use ` − k + 1 anchoring positions instead of C anchoring positions.)

30 Window length = 2 (a) l

Reference

(b) 2 11 2 6

0 3 11 7

Read coordinates Reference coordinates

Figure 3.2: (a) The implicit windows considered on the reference genome for the candidate selection step. If the read length is `, then the windows are of size 2` and overlap by ` bases. (b) An example of the candidate selection step. Each dot represents an anchor, and its size represents the weight of the anchor. In this example, f = 2, and since the maximum window score is 11, every window with a score ≤ 5.5 will be ignored. In addition, the window with score 6 is not kept since it is overlapping with a window with score 7. Also, only one of the windows with score 11 will be in the final list of candidates since the other window is overlapping it.

Step 2: Candidate Region Selection. In order to select the candidate regions for alignment, lordFAST splits the reference genome into overlapping windows of size 2` (as illustrated in Figure 3.2(a)). For each window, it calculates two scores for the forward and + − reverse strands from anchor matches of the respective strands (M and M ). For each anchor match falling in a window, it adds s − k + 1 to the score of that window. lordFAST keeps all the windows with score > scoremax/f where f is the factor defining the significance of the window score (default 4) and scoremax is the maximum window score. In other words, lordFAST keeps those windows whose score is not significantly worse than the maximum window score. In cases where two overlapping windows both meet the minimum window score requirement, lordFAST will keep the one with a higher window score in the final list (ties are broken by choosing the window with the smaller reference coordinate). Figure 3.2(b) depicts an example of the selection process. Assuming |G| is the size of the reference genome, using O(|G|/`) space, this step has worst case time complexity of O(|M| log(N)). The reason is that for each exact match in M, we need to find its matching window. Each such operation can be done in O(1). But since we need to keep N top scoring windows, we need a priority queue in which each insertion/replacement takes O(log(N)).

31 Step 3: Chaining and Anchor Selection. Among all the anchor matches in a candidate region, lordFAST chooses a set of “concordant” anchors using local chaining. The best local chain is a set of co-linear, non-overlapping anchors on the reference genome that has the highest score among all such sets [127]. To calculate the best local chain, lordFAST assigns a weight to each anchor match equal to the length of the match. lordFAST supports two chaining algorithms. By default, it obtains the best chain using the dynamic programming based chaining algorithm [127]. Note that the time complexity of this chaining algorithm is quadratic, but in practice, it is fast due to our small number of anchor matches per read. It is also possible for the user to select the alternative chaining algorithm based on clasp [131]. The anchor matches in the best local chain form the basis of the alignment in that region.

Step 4: Alignment. lordFAST prioritizes the candidate regions based on their best chaining score and performs the final alignment for the top N regions (default value for N is 10). In order to generate the base-to-base alignment of a region, it uses anchor matches from the top-scoring chain and performs banded global alignment for gaps between pairs of consecutive anchor matches. Furthermore, the alignment between the prefix of the read and the reference prior to the first anchor can be performed by the use of an anchored global- to-local alignment, and the alignment between the suffix of the read and the reference following the last anchor can be computed in an identical fashion. This strategy is a widely used technique to avoid computing the full alignment between long sequences as that needs huge memory and computational time. lordFAST uses Edlib [153] for computing the global alignments and ksw library2 for computing the global-to-local alignments. Edlib is a library implementing a fast bit-vector algorithm devised by Myers (1999) [123]. ksw, on the other hand, provides alignment extension based on an affine gap cost model. While the actual time complexity of this step depends on the number of selected exact matches inside the chain, it is not more than O(b `), where b is the bandwidth used for the banded dynamic programming alignments. It is worth mentioning that lordFAST supports clipping as follows: if the prefix of the read before the first anchor (or, respectively, the suffix of the read after the last anchor) has an alignment score/similarity which is lower than a threshold (thclip), lordFAST performs clipping of that prefix (or, respectively, suffix). This is done by using ksw library to extend the alignment as long as a significant drop in the alignment score/similarity is not observed. ksw library performs this using an algorithm similar to BLAST’s X-drop [178] heuristic).

In addition, lordFAST supports split alignment as follows: Let Si,j denote the substring of S that starts at position i and ends at position j. Suppose we are mapping a long read

R to the reference genome G. Consider two consecutive anchors A = (rA, gA, sA) and B =

2https://github.com/attractivechaos/klib

32 (rB, gB, sB), as per the definition above, in the best chain chosen for a candidate window. If the alignment between RrA,rB and GgA,gB has a score lower than a threshold (thsplit), we split the alignment and report one alignment as primary and another as supplementary (as the definition in the SAM format specification). One alignment corresponds to the substring before anchor A, and the other alignment corresponds to the substring after anchor B. Furthermore, since the drop in alignment score/similarity could be due to the presence of an inversion, we check if the alignment between the reverse complement of RrA,rB and

GgA,gB has a score higher than thsplit. In that case, such an alignment will also be reported as another supplementary alignment.

3.2 Results

We evaluated the performance of lordFAST-v0.0.9 against BLASR [18], BWA-MEM [96], GraphMap [154], LAMSA [106], rHAT [107], NGMLR [146], Minimap2 [98], and another recently available software minialign3. Note that although GraphMap is specifically designed for Oxford Nanopore reads, we included it in our experiment as it is capable of mapping PacBio long reads with default parameters [154]. We compared the methods on both simulated and real data sets. We used the results on the simulated dataset for calculating the methods’ precision and recall. All experiments were performed on a server running Cenots 6.9 equipped with 4 twelve-core (2 threads per core) Intel(R) Xeon(R) CPU processors (E7-4860 v2 @ 2.60GHz) and 1000 GB RAM. The details about versions of the tools as well as commands and parameters used to run each tool are provided in Appendices A.2 and A.3. Note that on real PacBio datasets, we observed that more than 99% of the sequence data are provided in reads of length 1000 bp or longer (see Figure 3.3 for details). Thus, we only focused on aligning reads that are 1000 bp or longer.

3.2.1 Experiment on a simulated dataset without structural variations

To evaluate the precision and recall of lordFAST against the above-mentioned tools, we simulated 25,000 long reads from hg38 using PBSIM [128], which infers the read length and error model from a real human read dataset. Appendix A.1.2 provides the instruction and commands for reproducing this simulated dataset. Note that we did not introduce any SNPs, indels or structural variants in this experiment, i.e., the correct alignment between a read and the reference genome has mismatches and gaps only due to (simulated) read errors. For each read, PBSIM provides both the originating location on the reference genome and the “true” base-to-base alignment of the read to the reference genome in that location. Since

3https://github.com/ocxtal/minialign

33 median: 4,975 mean: 6,675 min: 36

N99: 1,068

max: 35,489

Figure 3.3: The read length distribution of 72,708 real PacBio reads from a human genome (CHM1) dataset. The vertical axis shows the number of bases in each bin rather than the number of reads. At least 99% of the bases are in the reads longer than 1000 bases. for any base on any read, its “true” base pairing on the reference genome is known, we have been able to calculate the number of correctly mapped reads/bases. We consider a read to be correctly mapped if (i) it gets mapped to the correct chromosome and strand; and (ii) the subsequence on the reference genome the read maps to, overlaps with the “true” mapping subsequence by at least p bases. In order to compare the methods we tested with respect to the number of correctly mapped reads, we used two values of p: a fixed value of 1 bp and a variable value, which is set to 90% of the length of the originating “true” mapping subsequence. Note that, for most of the methods there is not a big difference between the results based on the two settings for p; however some methods can not identify the “correct” mapping subsequence in its entirety and report only a partial alignment - accordingly, those methods perform poorly for the variable setting of p. We consider a base in a read to be correctly mapped if (i) the read is correctly mapped (as per the definition above) and (ii) the mapped location of the base is within 25 bp of the true alignment locus of the base (A smaller value for the second condition makes the definition of a correctly mapped base more stringent. Tables 3.3 and 3.4 show the result when this threshold is increased to 50 bp and decreased to 5 bp, respectively). Sensitivity is thus defined as the fraction of correctly mapped bases (according to this notion of a correct mapping) out of the total number of bases in the reads. Similarly, precision is defined as the fraction of correctly mapped bases out of the total number of mapped bases in the reads.

34 Table 3.1: Comparison between different tools capable of mapping PacBio long reads on the simulated human dataset. This dataset contains 25,000 reads and 183.61 million bases. Best results are marked with bold typeface.

Minimum Correctly Correct Incorrect Unmapped Sensitivitya Precisionb overlap (p) Mapper mapped bases (Mb) bases (Mb) bases (Kb) (%) (%) 1 bp BLASR 24,642 164.52 18.39 698.22 89.61 89.95 BWA-MEM 24,603 170.63 12.50 525.11 92.91 93.17 GraphMap 24,161 177.26 4.05 2,297.27 96.55 97.77 LAMSA 24,458 176.00 6.40 282.15 96.36 96.51 rHAT 24,409 177.59 5.63 391.52 96.72 96.93 NGMLR 24,194 170.50 8.86 4,246.51 92.86 95.06 Minimap2 24,745 180.06 3.34 223.46 98.06 98.18 minialign 24,567 178.25 4.73 621.60 97.08 97.41 lordFAST 24,751 181.68 1.89 29.35 98.95 98.97 90% BLASR 24,563 164.46 18.47 675.95 89.57 89.90 BWA-MEM 24,485 170.23 12.98 417.84 92.70 92.91 GraphMap 24,161 177.26 4.05 2,297.27 96.55 97.77 LAMSA 24,371 176.87 6.59 208.22 96.30 96.41 rHAT 24,372 177.55 5.98 80.98 96.70 96.74 NGMLR 23,769 169.66 10.44 3,508.56 92.40 94.20 Minimap2 24,740 180.04 3.35 223.20 98.05 98.17 minialign 24,469 177.84 5.53 233.74 96.86 96.98 lordFAST 24,747 181.68 1.90 29.10 98.95 98.97 Note: A read is considered to be mapped correctly if its aligned subsequence in the reference overlaps with the "correct" mapping subsequence by at least p bases. On the other hand, a base in a read is considered to be correctly mapped if the read is correctly mapped and the mapping location of the base is within a 25 bp vicinity of the correct alignment locus of the base. a The sensitivity is defined as the number of correctly mapped bases / the total number of bases. b The precision is defined as the number of correctly mapped bases / the number of mapped bases.

Using these definitions, we compared all of the above-mentioned methods; a summary of the results is presented in Table 3.1. As can be seen, lordFAST not only maps more reads correctly than any other mapper but also aligns about 98.9% of the total number of bases correctly, which is 0.9%–9.4% more than its competitors. In addition, lordFAST achieves the highest base sensitivity and precision. It is important to note that for GraphMap, the precision value is much higher than the sensitivity because it leaves many of the bases unmapped. In that sense, we believe that sensitivity provides a much better measure to compare the tools, even though lordFAST is the best with respect to both measures. Tabel 3.2 provides details about running time and memory usage of each tool on this dataset. Here, Minimap2 is the fastest tool, followed by minialign and lordFAST. BWA-MEM, lordFAST, and LAMSA show the lowest memory footprint. We also evaluated the ability of different tools to distinguish between unique and repetitive hits in terms of the assigned mapping quality (MAPQ) per [98]. For this evaluation, a read is considered as correctly mapped if its best mapping aligns to a region of the reference that (i) overlaps with 10% of the “true” mapping region (Figure 3.4(a)), or (ii) 90% of the “true” mapping region (Figure 3.4(b)). In general, Minimap2 and lordFAST map a higher portion of reads with high mapping quality to correct location compared to other tools, especially with the more stringent definition of the correct mapping (see Figure 3.4(b)).

35 Table 3.2: Runtime and memory usage of same table.

Time Memory Mapper (sec) (GB) BLASR 9,233 14.67 BWA-MEM 6,842 5.22 GraphMap 17,546 42.56 LAMSA 1,277 5.85 rHAT 1,044 13.95 NGMLR 2,970 5.45 Minimap2 154 6.50 minialign 201 12.70 lordFAST 696 5.43 Note: The running time and peak memory usage are measured using GNU time command (/usr/bin/time -v)

Table 3.3: Comparison between different tools capable of mapping PacBio long reads on the simulated human dataset. This dataset contains 25,000 reads and 183.61 million bases. Best results are marked with bold typeface.

Minimum Correctly Correct Incorrect Unmapped Sensitivitya Precisionb overlap (p) Mapper mapped bases (Mb) bases (Mb) bases (Kb) (%) (%) 1 bp BLASR 24,642 171.74 11.17 698.22 93.53 93.89 BWA-MEM 24,603 171.76 11.36 525.11 93.53 93.80 GraphMap 24,161 177.33 3.98 2,297.27 96.58 97.81 LAMSA 24,458 177.65 5.75 282.15 96.72 96.87 rHAT 24,409 177.87 5.35 391.52 96.87 97.08 NGMLR 24,194 172.83 6.53 4,246.51 94.13 96.36 Minimap2 24,745 181.56 1.84 223.46 98.88 99.00 minialign 24,567 179.68 3.31 621.60 97.86 98.19 lordFAST 24,751 181.74 1.84 29.35 98.98 99.00 90% BLASR 24,563 171.66 11.27 675.95 93.50 93.84 BWA-MEM 24,485 171.37 11.85 417.84 93.32 93.53 GraphMap 24,161 177.33 3.98 2,297.27 96.58 97.81 LAMSA 24,371 177.52 5.94 208.22 96.65 96.76 rHAT 24,372 177.82 5.71 80.98 96.85 96.89 NGMLR 23,769 171.99 8.11 3,508.56 93.67 95.50 Minimap2 24,740 181.53 1.85 223.20 98.87 98.99 minialign 24,469 179,27 4.11 233.74 97.64 97.76 lordFAST 24,747 181.73 1.85 29.10 98.98 98.99 Note: A read is considered to be mapped correctly if its aligned subsequence in the reference overlaps with the "correct" mapping subsequence by at least p bases. On the other hand, a base in a read is considered to be correctly mapped if the read is correctly mapped and the mapping location of the base is within a 50 bp vicinity of the correct alignment locus of the base. a The sensitivity is defined as the number of correctly mapped bases / the total number of bases. b The precision is defined as the number of correctly mapped bases / the number of mapped bases.

3.2.2 Simulation in presence of structural variations

In order to evaluate the capability of lordFAST for mapping reads that span structural variations, we performed another experiment to detect simulated SVs using Sniffles [146]. Sniffles requires a minimum of 15× coverage to have good accuracy. Therefore, for this experiment, we only focused on Chr1 and generated a simulated dataset by inserting 21 SVs from DGV (9 insertions, 9 deletions, and 3 inversions) of different sizes. More specifically, we performed simulation and SV calling as follows:

36 Table 3.4: Comparison between different tools capable of mapping PacBio long reads on the simulated human dataset. This dataset contains 25,000 reads and 183.61 million bases. Best results are marked with bold typeface.

Minimum Correctly Correct Incorrect Unmapped Sensitivitya Precisionb overlap (p) Mapper mapped bases (Mb) bases (Mb) bases (Mb) (%) (%) 1 bp BLASR 24,642 136.73 46.18 698.22 74.47 74.75 BWA-MEM 24,603 164.35 18.78 525.11 89.49 89.75 GraphMap 24,161 175.48 5.83 2,297.27 95.57 96.79 LAMSA 24,458 163.97 19.42 282.15 89.27 89.41 rHAT 24,409 172.63 10.59 391.52 94.02 94.22 NGMLR 24,194 151.75 27.61 4,246.51 82.65 84.60 Minimap2 24,745 160.90 22.50 223.46 87.62 87.73 minialign 24,567 159.37 23.62 621.60 86.80 87.09 lordFAST 24,751 180.91 2.67 29.35 98.53 98.54 90% BLASR 24,563 136.67 46.26 675.95 74.43 74.71 BWA-MEM 24,485 163.95 19.26 417.84 89.28 89.49 GraphMap 24,161 175.48 5.83 2,297.27 95.57 96.79 LAMSA 24,371 163.84 19.62 208.22 89.21 89.31 rHAT 24,372 172.60 10.93 80.98 94.00 94.04 NGMLR 23,769 150.91 29.19 3,508.56 82.19 83.79 Minimap2 24,740 160.87 22.52 223.20 87.61 87.72 minialign 24,469 158.95 24.42 233.74 86.57 86.68 lordFAST 24,747 180.91 2.67 29.10 98.53 98.54 Note: A read is considered to be mapped correctly if its aligned subsequence in the reference overlaps with the "correct" mapping subsequence by at least p bases. On the other hand, a base in a read is considered to be correctly mapped if the read is correctly mapped and the mapping location of the base is within a 5 bp vicinity of the correct alignment locus of the base. a The sensitivity is defined as the number of correctly mapped bases / the total number of bases. b The precision is defined as the number of correctly mapped bases / the number of mapped bases.

(a) (b) 1 1

0.98 0.98 blasr bwa-mem 0.96 0.96 graphmap lamsa lordfast minialign 0.94 0.94 minimap2 rhat ngmlr 0.92 0.92 Fraction of mapped reads Fraction of mapped reads

0.9 0.9 0 0.005 0.01 0.015 0.02 0.025 0 0.005 0.01 0.015 0.02 0.025 Fraction of incorrectly mapped PacBio reads Fraction of incorrectly mapped PacBio reads 10% overlap 90% overlap

Figure 3.4: Read mappings are sorted based on their mapping quality in descending order. Then for each mapping quality threshold, the fraction of mapped reads with mapping quality above the threshold (out of the total number of reads) and the fraction of incorrectly mapped read (out of the number of mapped reads) are plotted along the curve.

(i) We assigned SVs reported in DGV on Chr1 of NA12878 individual into 3 groups based on their size (shorter than 500 bp, between 500 and 5000 bp, and longer than 5000 bp) and randomly selected 3 insertions, 3 deletions and 1 inversion from each group.

37 (ii) Selected SVs were inserted into the reference Chr1 to get a simulated donor chromosome.

(iii) A set of long reads with 15× coverage were simulated from the donor chromosome using PBSIM [128]. PBSIM was fed with a FASTQ file from a real human dataset to use its sample-based model (via option –sample-fastq).

(iv) Long reads were mapped to the reference Chr1 using different mappers. Sniffles requires MD tag in order to operate. Among different mappers, BLASR, LAMSA, and rHAT do not generate MD tag in the output SAM file. Therefore, for these mappers, we used samtools calmd command to calculate and add the MD tag. Minimap2 and minialign add MD tags with optional arguments. Other tools (including lordFAST) generate MD tags by default.

(v) For each mapper, a sorted bam file was generated from the SAM file using samtools sort.

(vi) Sniffles (version v1.0.8) was run with parameter -s 4.

For this experiment, some tools required a more specialized command. In particular, we used the following commands for each tool:

• Blasr was run with parameters -- --bestn 1 --clipping subread --affineAlign --noSplitSubreads --nCandidates 20 --minPctSimilarity 75 --sdpTupleSize 6.

• BWA-MEM was run with parameters -x pacbio -MY as mentioned in [146].

• LAMSA was run with parameters -T pacbio -i 25 -l 50 -S.

• minimap2 was run with parameters -aY -x map-pb --MD.

• minialign was run with parameters -x pacbio -T AS,XS,NM,NH,IH,SA,MD -P.

• rHAT, GraphMap, NGMLR, and lordFAST were run with default parameters.

Here, we provide the results of SV calling using Sniffles based on the mappings from different tools. We define a call as “exact” if (i) its start and end coordinates are at most 25 bp away from the actual simulated breakpoints; and (ii) it overlaps with one simulated SV of the same type. However, if the first condition is not satisfied, the call is considered as “inexact”. If the second condition is not satisfied, the call is considered as “mis-classified”. If none of the conditions is satisfied, the call is considered as “wrong”. Among all mappers, Sniffles generated SV calls only for NGMLR, BWA-MEM, rHAT, and lordFAST. As it can be seen in the Table 3.5, all calls based on rHAT mappings are wrong. Also, Sniffles

38 Table 3.5: Structural variations called by Sniffles based on mappings from different tools.

Mapper # calls # exact # inexact # mis-classified # wrong NGMLR 19 17 1 0 1 BWA-MEM 18 12 5 0 1 rHAT 35 0 0 0 35 lordFAST 17 16 1 0 0

finds more “exact” calls with lordFAST and NGMLR mappings in comparison to mappings provided by BWA-MEM. This suggests that lordFAST does not generate misalignments around SV breakpoints and is capable of properly mapping reads that span/overlap SVs.

3.2.3 Experiment on a real dataset

We evaluated the above methods on a real dataset, containing 23,155 reads sequenced from a human genome (CHM1 cell line; Appendix A.1.1 contains details related to this dataset). Since the true mapping locations of the reads are not known a priori, we compared methods based on the quality of their reported alignments. For each mapping of a read, we count the number of its bases that are aligned to the identical bases in the reference (matched bases). In addition, we calculated the alignment score by adding up +1 for every matching base and −1 for every mismatching, inserted, deleted, or unmapped/clipped base. For each tool, we reported the sum of alignment scores of all the reads in the dataset. Although the number of matched bases per se may not be the best comparison measure (since one could match all the bases in the read without paying attention to the gaps created in the reference), it is complementary to the alignment score. If a program tries to maximize the number of matched bases greedily, it will very likely produce a low alignment score. Table 3.6 shows the result of this experiment. lordFAST has the highest total alignment score. More precisely, lordFAST reports 2.79 million higher alignment score and 1.74 million higher number of matched bases compared to the closest competitor. We also measured the agreement between various methods based on their alignment of the reads. For a given read, an alignment x covers another alignment y if and only if the subsequence on the reference genome covered by x overlaps with at least 90% of the subsequence on the reference genome covered by y. Figure 3.5 shows examples of covering and non-covering alignments. Table 3.7 shows how best alignments from different methods cover each other. More specifically, each row contains the percentage of mappings reported by the corresponding tool that cover the mappings of other tools. For instance, among all reads for which both lordFAST and BLASR report an alignment, 90.84% of the alignments reported by BLASR are covered by lordFAST, while only 88.28% of the alignments reported by lordFAST are covered by BLASR. As can be observed, lordFAST alignments provide high coverage of the alignments obtained by the alternative tools. In addition, in Table 3.8, we compared the performance of the tools on reads for which their alignments do not agree. To give an example, there are 2,930 reads for which BLASR

39 Table 3.6: Evaluation of the performance of various long read mappers on a real human dataset. This dataset includes 23,155 reads and 178.45 million bases.

Mapped Mapped Matched Alignment Timeb Memoryb Mapper reads bases (Mb) bases (Mb) score (sec) (GB) BLASR 22,866 163.11 148.58 108002225 12,243 14.96 BWA-MEM 22,913 170.76 154.15 119117389 8,810 5.25 GraphMap 22,159 169.57 151.93 113717041 17,745 42.56 LAMSA 23,154 173.90 155.68 122035697 2,040 6.29 rHAT 23,136 159.99 142.40 92824214 1,769 13.95 NGMLR 21,295 155.83 143.06 97830317 4,629 5.43 Minimap2 22,818 170.97 154.78 119673199 262 6.57 minialign 23,006 152.61 139.22 89538289 207 12.70 lordFAST 22,961 176.18 157.42 124826081 765 5.43 Note: Given a single mapped read, suppose nMatch is the number of matched bases, qLen is the length of the read, qStart and qEnd denote the start and end coordinates of the mapping on the read, and tStart and tEnd denote the start and end coordinates of the mapping on the reference. a For each read, the ground truth region is the region of the reference that is shared by mappings from at least 4 mappers. A read mapping reported by a mapper is considered to be “correctly” mapped if it overlaps at least 90% of the bases of ground truth regions. The number in parentheses shows percentage of total number of reads that are “correctly” mapped. b The running time and peak memory usage are measured using /usr/bin/time -v Unix command.

x

y

z1 z2 z3 z4

Reference ......

Figure 3.5: Examples of covering and non-covering alignments. Suppose x, y, z1, z2, z3, and z4 are different alignments of the same read. In this figure, alignments x and y cover each other as they span the subsequences on the reference genome that have at least 90% overlap. The alignments x and y cover alignments z1 and z2 but not the alignments z3 and z4. On the other hand, the alignments z1, z2, z3, and z4 do not cover either alignment x or y. does not cover alignments of lordFAST. For those reads, BLASR reports alignments with an average of 28.84% lower identity. In contrast, there are 2,094 reads for which lordFAST does not cover BLASR’s alignments. For those reads, on average, lordFAST’s alignments have only 7.40% lower identity than BLASR’s. With a lack of the true mappings for the real dataset, the information in Tables 3.7 and 3.8 are some extra support for the fact that lordFAST’s alignments are reliable. Finally, we benchmarked the performance of each tool using multiple threads. Figures 3.6 and 3.7 depict the runtime and memory requirement of all tools we tested on this dataset when using multiple threads.

40 Table 3.7: Agreement of different methods in reporting alignments.

BLASR BWA GraphMap LAMSA rHAT NGMLR Minimap2 minialign lordFAST

BLASR N/A 92.38 90.13 89.40 89.57 97.10 93.04 91.82 88.28 BWA 90.10 N/A 87.45 87.25 86.99 95.11 90.98 91.09 86.34 GraphMap 92.47 92.55 N/A 89.06 90.76 96.69 92.54 91.71 91.59 LAMSA 85.74 87.06 83.91 N/A 84.12 91.02 86.13 88.11 83.89 rHAT 90.51 89.87 90.62 87.85 N/A 93.84 89.96 89.41 88.21 NGMLR 86.33 87.02 84.88 83.72 83.67 N/A 86.93 86.99 82.47 Minimap2 92.89 93.45 89.83 89.17 89.35 97.48 N/A 93.01 88.25 minialign 79.97 81.54 77.40 78.49 77.57 84.92 80.88 N/A 77.11 lordFAST 90.84 91.76 91.96 89.20 88.77 94.44 91.01 91.79 N/A Note: Each row shows the percentage of best alignments from the corresponding mapper that cover alignments from other mappers. Note that this table is not symmetric.

Table 3.8: The performance of different methods on reads for which their alignments do not agree.

BLASR BWA GraphMap LAMSA rHAT NGMLR Minimap2 minialign lordFAST

-22.85 -29.05 6.47 -8.57 10.61 -20.92 -7.92 -28.84 BLASR N/A (1747) (2187) (2454) (2414) (617) (1585) (1882) (2930) -4.78 -25.10 14.25 -2.58 -2.16 -11.00 5.89 -20.97 BWA N/A (2264) (2780) (2951) (3010) (1041) (2059) (2049) (3074) -16.66 -34.19 5.78 -10.97 3.05 -30.42 -16.13 -36.39 GraphMap N/A (1721) (1708) (2533) (2137) (704) (1702) (1907) (2033) -25.88 -30.42 -38.40 -22.07 -29.02 -31.68 -18.04 -37.96 LAMSA N/A (3261) (2964) (3566) (3673) (1913) (3166) (2735) (4047) -17.68 -25.53 -41.22 0.17 -22.59 -25.16 -12.33 -37.63 rHAT N/A (2171) (2320) (2079) (2814) (1312) (2291) (2436) (2723) -37.90 -46.62 -43.60 -21.69 -34.93 -46.60 -36.96 -44.03 NGMLR N/A (3126) (2973) (3351) (3770) (3778) (2983) (2992) (4024) -0.11 -13.12 -25.17 15.10 -1.46 9.39 3.23 -16.81 Minimap2 N/A (1626) (1501) (2253) (2508) (2464) (537) (1608) (2697) -26.03 -28.14 -34.78 -9.67 -22.32 -29.23 -30.00 -33.90 minialign N/A (4579) (4229) (5007) (4981) (5190) (3211) (4366) (4530) -7.40 -17.31 -29.20 17.64 -2.82 -14.90 -17.58 1.84 lordFAST N/A (2094) (1887) (1781) (2500) (2598) (1183) (2051) (1889) Each row shows the performance superiority of the corresponding method over other methods for the inconsistent alignments, in terms of the average identity difference. The numbers in parentheses (for each row) show the number of reads for which the corresponding method reports alignments that do not cover alignments of other methods. Note that this table is not symmetric.

3.3 Summary

In this chapter, we presented lordFAST, a fast and highly sensitive mapping tool for long noisy reads. Its sparse anchor extraction strategy has an important impact on the speed of its chaining step. Our experiment on the simulated data showed that despite using a small number of anchors, lordFAST not only maps more reads to its true originating region compared to its competitors but also is highly accurate in base-level alignment (see Table 3.1). In addition, lordFAST also provides both clipped and split alignments of the reads. This makes lordFAST appropriate for aligning reads originating from regions with long structural variations (SVs), so that downstream analysis of its alignments would be simpler for the task of variation discovery.

41 214

213

212

BLASR 11 2 BWA-MEM GraphMap 10 2 LAMSA rHAT 9 NGMLR

Time (s) 2 minimap2 28 minialign lordFAST

27

26

25

1 4 8 12 Number of cores

Figure 3.6: Run-time comparison of different methods for mapping 23,155 real human reads using different threads. Note that the y-axis is in logarithmic scale.

25

BLASR BWA-MEM GraphMap 4 2 LAMSA rHAT NGMLR minimap2 Memory (GB) minialign lordFAST 23

22 0 2 4 6 8 10 12 Number of cores

Figure 3.7: Memory comparison of different methods for mapping 23,155 real human reads using different threads. Note that the y-axis is in logarithmic scale.

42 Chapter 4

Hybrid error correction of long reads

In order to improve the quality of noisy long reads, such as those generated by PacBio or Oxford Nanopore technologies, several tools have been developed (See [87] for a review of error correction tools). These tools can be classified into two categories: (i) self-correcting methods and (ii) hybrid methods. In the “self-correcting” approach, the idea is to correct the long reads by only using the long reads. In this approach, multiple sequence alignment between the reads is built from the pairwise alignment of every two long reads (all-versus-all alignment). Based on this alignment, a consensus sequence is built that has a higher base-level quality. This approach has been implemented in HGAP [24], which is a non-hybrid assembler that can handle bacterial genome data. The recently introduced assembler Canu [85] relies on the idea of local hashing to detect overlaps between long reads and assemble them using an overlap graph. On the other hand, hybrid methods (e.g., PacBioToCA [84], LSC [6], proovread [59], LoRDEC [141]) try jointly to utilize the high-quality short reads and the noisy long reads to correct the long reads. PacBioToCA and LSC map the short reads (e.g., Illumina reads) onto the long read and correct the long reads by calling consensus of these short read mappings; proovread uses similar idea except for performing an iterative procedure for mapping and correcting with successively increasing sensitivity. A different approach, akin to local assembly, is followed by Nanocorr [53] (developed for correcting Oxford Nanopore long reads) and LoRDEC [141]. Nanocorr relies on computing a Longest Increasing Subsequence (LIS) of overlapping reads. In contrast, LoRDEC builds a De Bruijn graph from the short reads and then aligns each long read to this De Bruijn graph by finding a path between solid regions of long read that aims at minimizing the edit distance with the region sequence. One of the main drawbacks of the self-correcting approach is that it requires substantial computational power in order to perform all-versus-all alignment of the long reads for finding overlaps between them, although recent advances require less resources [11]. More importantly, using self-correcting methods requires at least 50x coverage of long reads [83]

43 in order to find all-versus-all overlaps that can be used for error correction. Considering the low throughput of single-molecule sequencing technologies, getting 50x coverage is costly. The advantage of the hybrid approach comes from the fact that high throughput short reads can be generated at a much lower cost, complementing the low coverage long reads from the same donor. Here, we introduce CoLoRMap, a hybrid method that takes advantage of high-quality short reads and corrects noisy long reads. Similar to LSC and PacBioToCA, CoLoRMap maps the short reads onto the long reads as the first step, but unlike those tools, it does not look for a consensus base call at each base. Instead, it does not look for a consensus base call at each base but formulates the problem of correcting a long read region as a local assembly problem aiming at finding an optimal path of overlapping mapped short reads that minimizes the edit score to the long read region, a problem that can be solved exactly using a classical Shortest Path (SP) algorithm. Thus our criterion is different from the one defined in Nanocorr, which is based on the Longest Increasing Subsequence approach, although the general principle is similar 1. Next, in a second step, CoLoRMap addresses the problem of correcting the long reads regions where, due for example to a higher error rate, no short read does map (called gaps), using the idea of de novo assembly of One-End Anchors (OEA), that are unmapped reads whose mate map to a flanking corrected region.

4.1 Methods

4.1.1 Overview

Similar to most hybrid methods for error correction, CoLoRMap gets as input two sets of reads, namely short reads and long reads from the same donor. CoLoRMap starts by mapping the short reads to the long reads by using BWA-MEM [96]. It then uses the set of mappings obtained from BWA-MEM to build a graph structure akin to an overlap graph. Using a polynomial-time Shortest Path (SP) algorithm, CoLoRMap can then reconstruct a sequence of overlapping mapped short reads that minimizes the edit score to the covered long read region and can be used as the corrected sequence for this region. As both short and long reads are sequenced from the same donor, mapped short reads usually cover a large portion of the long reads (see Table 4.4 for the supporting results). However, since they are mapped to noisy long reads, there are regions on the long reads that are not covered by any short read, that we call gaps, as they are located either at the extremities of the long reads or between two corrected regions. In a second step, CoLoRMap attempts to expand the corrected regions using One-End Anchors (OEAs), which are those reads that are not mapped to the long reads but whose corresponding mates are mapped to

1Note, however, that at the time of the publishing CoLoRMap, the precise definition of the objective function used in Nanocorr is not available; it is only stated that it “penalizes overlaps while maximizing alignment lengths and accuracy”.

44 a corrected region on the long reads. For each gap, CoLoRMap then employs Minia [23] to perform a local assembly of the set of OEAs associated with the gap and uses the obtained contigs to correct the gap.

4.1.2 Initial correction of long reads: the SP algorithm

For the sake of simplicity, here we explain the process of correcting a single long read, L, as this process is independent of the correction of the other long reads.

Preliminaries. For a string S = s1s2...sk, |S| = k is the length of S. The i-th character of S is shown by S[i]. A substring of S is denoted by Si,j = sisi+1...sj where i, j ∈ {1, ..., k} and i ≤ j. An alignment between two strings with characters in {A, C, G, T } is a sequence of pairs of elements from {A, C, G, T, −}2 \{(−, −)}.

Let M = {m1, m2, ..., mn} be the set of mappings of the short reads onto L, where each mi is represented by three pieces of information: mi.bp denotes the beginning position

(leftmost position) of the mapping on L, mi.ep denotes the end position (rightmost position) of the alignment on L, and mi.seq indicates the actual sequence of short read aligned to L. Note that some mapping tools may clip the beginning or end of the query and align just a substring of the query to the target long reads, which does not impact our method. Nevertheless, for the sake of exposition, even if a short read has been clipped during the mapping process, we keep calling it a read.

Weighted alignment graph construction. CoLoRMap builds a weighted graph OL from M. Each node in OL corresponds to a mapping mi from M and there is an edge between two nodes if their corresponding mappings overlap on the long read L, as defined below:

Definition 4.1. Two mappings mi and mj overlap iff

(i) mi.bp ≤ mj.bp and mi.ep < mj.ep;

(ii) mj.bp ≤ mi.ep − minOverlap + 1;

(iii) the respective substrings of both short reads that belong to the overlap are identical; where minOverlap is the minimum required overlap length.

Based on this definition, we insert edges into OL only for exactly matching overlaps. However, if there is a single mismatch between the overlapping parts of two reads, we replace the lower quality base at the mismatching position with the higher quality base so that the overlap becomes an exact matching overlap, and then we can add its corresponding edge to the graph. This change does not modify the original read sequence and is limited to the content of the inserted edge.

45 m i (a) x mj.seqx,|mj .seq| mj

Lmi .ep,mj .ep

mi.bp mj.bp mi.ep mj.ep

L

(b)

Figure 4.1: (a) The notion of overlap for mappings. For two overlapping mappings mi and mj, the weight of the corresponding edge is set to the edit distance between the suffix of mj.seq and its aligned region in L (marked by red in this figure). (b) Reconstruction of the corrected sequence spelled from the shortest path. The spelled string can be easily obtained by concatenation of mapping suffixes from the shortest path.

The weight of the edge associated to an overlap between mi and mj, denoted by wij, is defined as the edit distance between mj.seqx,y and L, where x is the position on mj.seq which is aligned with L[mi.ep] and y = |mj.seq|. In other words, the edit distance is calculated from the suffix of mj.seq that does not belong to the overlap (as shown in Figure 4.1.a). The motivation behind choosing such a weight function is the following observation:

Property 4.1. In a connected component of the weighted alignment graph OL, consider the leftmost mapping as the source node and the rightmost mapping as the target node. For each path in OL, we define edit score as the sum of edit distances of overlap suffices (shown in red in Figure 4.1.a) on that path. If the overlaps in OL are exact matching overlaps, the shortest path from source to target defines a sequence of overlapping mapped short reads that minimizes the edit score to the covered region of L among all such sequences.

The observation above does not imply that we always obtain a sequence of overlapping short reads that minimizes the edit score to a region of L (the general principle underlying the method LoRDEC for example), as the sequence of overlapping reads is constrained by the set of initial mappings M.

46 Thus, for each connected component of OL, we define a source node, which is the leftmost mapping of this component, and a target node, which is the rightmost mapping. CoLoRMap then uses the Dijkstra shortest path algorithm to find the shortest path, p, from the source node to the target node. A string can be spelled from p using the sequences of the mappings corresponding to p (see Figure 4.1.b for a toy example), which is used as the corrected string of the region of L spanned by the mappings of the connected component. For each connected component, CoLoRMap replaces the uncorrected string on L (starting from source mapping and ending at destination mapping) with the spelled string. CoLoRMap can perform several rounds of correction using the SP algorithm explained above. The reason is that mapping of short reads onto long reads in the second pass gives more coverage and also more consistent mappings. Preliminary experiments showed that this aids in getting higher quality corrections, although at the cost of higher computation time.

Complexity analysis. Let n be the total number of alignments on a single long read. The time complexity upper bound for building the weighted alignment graph is O(n2). The shortest path can be calculated in O(e log(n)) time, where e is the number of edges of the graph.

Mapping parameters. For mapping short reads to the long reads CoLoRMap runs BWA-MEM with options -aY -A 5 -B 11 -O 2,1 -E 4,3 -k 8 -W 16 -w 40 -r 1 -D 0 -y 20 -L 30,30 -T 2.5, which are similar to parameters used by proovread [59] except using shorter seeds for higher sensitivity. It is important to note that since BWA-MEM does not guarantee to report all mappings of each short read, if we break the set of long reads into smaller chunks of long reads, we can expect higher coverage of mappings of short reads onto long reads. Table 4.8 shows the result of our experiment on how chunking enhances the quality of correction using CoLoRMap. However, the running time of mapping to chunks is greater than mapping to the whole set of long reads, so choosing the size of the chunks depends on the desired trade-off between accuracy and speed. CoLoRMap splits the long read set to chunks of about 50 Mbp and performs correction on each one of these chunks separately.

4.1.3 Correcting gaps using One-End Anchors

Although the previous correction step can correct a large part of many long reads, there are usually some regions of the long reads containing so many sequencing errors that no short read can align there (the so-called gaps). For example, we observed regions on some long reads where the maximum exact match with the reference genome is only four base-pair long (Figure 4.2 depicts an example of such region). Therefore, these uncovered regions can not be corrected through a mapping-based approach. More generally, it is natural to ask

47 if looking to correct such regions by optimizing some notion of distance between the long read region and the short read mappings is relevant. Nevertheless, correcting these regions is essential in order to have higher quality long reads. To address this issue, CoLoRMap uses One-End Anchors (OEAs) to correct these regions. Again for a long read L, a One-End Anchor is a short read that did not map to L, but whose corresponding mate read is mapped to L. It is important to note that since both short reads and long reads come from the same donor genome, it is possible to properly identify a set of OEAs for each such gap (uncorrected region) by looking at mappings in its flanking corrected regions, corresponding to connected components of the weighted alignment graph OL. 0 0 0 Suppose R = {r1, r1, r2, r2, ..., rn, rn} is the set of input paired-end short reads with 0 mean library insert size δ and standard deviation σ, where ri and ri are mates. We map this set of short reads to the input set of corrected long reads using BWA [96], which is a mapping tool optimized for Illumina short reads. As a result, short reads will be easily mapped to the corrected regions. Consider the case of an uncorrected region on long read L surrounded by two regions corrected during the initial correction step.

Definition 4.2. Let Lp,q be a gap of L flanked by two corrected regions Li,p−1 and Lq+1,j. A read r is an OEA for the gap if

(i) its mate r0 or its reverse complement is mapped to a flanking region of the gap with the proper orientation indicating its mate could belong to the gap;

(ii) r is not mapped to Li,j or partially over one of the boundaries of the gap;

(iii) the distance of the position of the mapping of r0 to the gap at most δ + 3σ.

So, after obtaining all the mappings of the short reads to L with BWA, for each gap CoLoRMap records the set of corresponding OEAs. Figure 4.3 depicts an instance of a gap and how OEAs are extracted. The sequences of recorded OEAs are then fed into the assembly tool Minia [23] to obtain contigs. Minia v2.0.3 was run with parameters -kmer-size 43 -abundance-min 1. Minia was chosen for its ease of use and low computational resources requirements.

48 49

Figure 4.2: An example of a gap (region uncovered by short reads) on long read, exported from IGV software. There are so many sequencing errors that mapping short read in this region is very challenging. In the region shown here, the maximum exact match between long read and the reference genome is 4 bp long, in a region of size ≈ 150 bp. L

Figure 4.3: Detecting One-End Anchors (OEAs) for a gap (un-corrected region). OEAs, shown in red, are unmapped or partially mapped reads whose mates, shown in blue, are mapped to corrected regions concordantly (with proper orientation and distance). The assembled contigs, shown in light green, are used to improve the quality of gap region.

4.2 Results

4.2.1 Data and computational setting

We performed experiments on three data sets: a bacterial genome data set from Escherichia coli, and two eukaryotic ones from Saccharomyces cerevisiae (yeast) and from Drosophila melanogaster (fruit fly). For each genome, we obtained a set of PacBio (noisy) long reads, which include 98 Mbp, 1.4 Gbp, and 1.35 Gbp, respectively. We also obtained a set of high- quality Illumina short paired-end reads for each genome, containing 234 Mbp, 455 Mbp, and 7 Gbp, respectively (more details are available in Appendix B.1). We compared CoLoRMap with PacBioToCA, LSC, proovread, and LoRDEC. PacBioToCA, LSC, and proovread were run with default parameters except the number of threads. For LoRDEC v0.6, we used options -k 19 -s 3 -e 0.4 -b 200 -t 5 as explained in [141]. For the E. coli data set, experiments were performed on a local workstation equipped with a Xeon E3-1270 v3 processor (CPU clock speed: 3.5 GHz and 8 cores), 32 GB of main memory and 2 TB of locally attached hard disk. Table 4.1 provides a comparison of these tools in terms of running time for the E. coli data set. For the larger yeast and D. melanogaster data sets, the experiments were performed on multiple computers, so we are not providing a comparison for running time.

Table 4.1: Runtime of different correction methods for E. coli dataset.

Data #Thread Method Elapsed time (minutes) E. coli 8 PacBioToCA 97m LSC 387m proovread 105m LoRDEC 7m CoLoRMap 38m CoLoRMap + OEA 38m+81m Notes: The Linux/Unix "time" command was used for reporting the runtime.

50 In the following, we describe our evaluation approach for comparing the corrected long reads obtained by the different considered methods.

4.2.2 Measures of evaluation

In order to check the performance of correction methods, we followed Salmela et al. [141] and investigated how well corrected long reads align to the reference genome, followed by checking how well corrected long reads can be used for de novo assembly. In order to map long reads to the reference genome, we used BLASR [18] and BWA-MEM [96]. The rationale behind using both tools for evaluation is the observation that there are usually some reads for which one tool finds mappings while the other tool reports none. BLASR is specifically designed for aligning PacBio long reads to a reference sequence. Running BLASR with options -noSplitSubreads -bestn 1 gives a single best alignment for each long read. BWA-MEM is a fast alignment tool that supports mapping of long reads to a reference sequence and can handle noisy PacBio long reads via option -x pacbio. It is important to note that many times BWA-MEM reports multi-piece mapping for a long read rather than one contiguous alignment. In our evaluations, we still consider all such fragmented alignments of a long read if the distance between mapping positions of these fragments on the reference is not larger than the length of the long read. The first evaluation measure we considered is the number of long reads that align to the reference genome. We also recorded the number of aligned bases in corrected long reads and the number of bases that match with the reference in the alignment. We computed a notion of identity as defined by Salmela et al. [141], defined as the number of base matches over the length of the aligned region in the reference genome.

Trimming and splitting corrected reads. Among the compared correction tools, CoLoRMap and LoRDEC report full long reads with corrected high-quality regions indicated in upper case while uncorrected regions are indicated in lower case. proovread outputs both full corrected long reads (without marking the corrected regions though) and corrected regions as separate sequences. PacBioToCA, however, outputs only corrected regions of long reads as separate sequences. We evaluate the full long reads obtained from CoLoRMap, LSC, LoRDEC and proovread as well as the trimmed long reads, obtained after removing all uncorrected bases from both extremities of a long read while keeping gaps (uncorrected regions flanked by corrected regions). In order to compare with PacBioToCA and proovread, we also evaluated the split long reads from CoLoRMap and LoRDEC, obtained by extracting only corrected regions from the corrected long reads, each such regions being considered as a separate sequence.

51 4.2.3 Comparison based on alignment

The results of our experiments are summarized in Tables 4.2–4.3. These results are based on alignments from BLASR (Table 4.2) or BWA-MEM (Table 4.3). We can observe that CoLoRMap performs best in terms of corrected reads that aligns back onto the reference genome, while maintaining a high average identity, although slightly lower than PacBioToCA, LoRDEC and proovread. It is also interesting to observe that the OEA step results in a non-negligible improvement of the size of corrected regions (see Table 4.4, while also increasing the average identity of the trimmed reads. In terms of corrected regions, proovread computes the longest ones, and it might be interesting to see if it is possible to combine the hierarchical approach of proovread with our algorithm.

4.2.4 Comparison based on assembly

In addition to comparing the quality of corrected long reads, we also investigated how well corrected long reads from different tools can be utilized for a downstream analysis task. We chose the task of de novo assembly as there exists a specialized assembler, Canu [85], available for long noisy reads. In order to assess the quality of the assembled contigs we used QUAST [56]. Tables 4.5–4.7 show the output of QUAST for assemblies obtained from running Canu on the set of long reads corrected by different correction tools. The observation for E. coli and yeast data set is that the set of contigs assembled from our corrected long reads has the highest NGA50, lower number of mismatches and indels, and covers the reference genome better. The assembly of D. melanogaster dataset, however, does not seem reliable as they cover a tiny fraction of the reference genome. This might be due to the low coverage of the long reads (the coverage is 9.7x while Canu suggests coverage of about 50x at least).

52 Table 4.2: Quality of corrected long reads for E. coli, yeast, and fruit fly datasets obtained with different methods. Assessment is based on alignments of long reads to the reference genome obtained with BLASR.

Aligned Dataset Method #Readsa #Readsb #Basesc Sized Matchede Identityf coverageg (Mb) (%) (%) (%) (%)

E. coli Original 33360 31071 86.64 88.40 76.95 94.84 100.00 (Full) LSC 25426 25098 77.51 92.63 86.00 97.55 100.00 proovread 24722 23453 71.32 89.36 87.90 99.70 100.00 LoRDEC 33360 30837 79.37 86.91 85.24 99.48 100.00 CoLoRMap 33360 31271 83.34 89.92 87.53 99.27 100.00 CoLoRMap +OEA 33360 31215 82.92 89.66 87.58 99.38 100.00 (Trim) LSC 25426 25226 72.52 95.37 89.55 97.92 100.00 LoRDEC 31733 30969 79.25 93.27 92.01 99.68 100.00 CoLoRMap 30396 30190 76.67 96.26 94.24 99.46 100.00 CoLoRMap +OEA 30396 30183 76.43 96.21 94.56 99.58 100.00 (Split) PacBioToCA 100100 99668 68.21 98.51 98.48 99.94 99.71 proovread 30479 30456 71.40 99.34 99.22 99.97 99.66 LoRDEC 49018 41437 79.77 99.02 98.96 99.96 99.82 CoLoRMap 48987 48840 73.73 99.11 98.99 99.90 99.91 CoLoRMap +OEA 40256 40101 74.57 98.99 98.84 99.89 99.91

Yeast Original 231594 224694 1229.72 87.68 78.84 93.87 99.77 (Full) proovread 229702 222976 1205.71 87.99 83.13 96.38 99.82 LoRDEC 231594 221692 1171.49 86.11 83.48 98.38 99.82 CoLoRMap 231594 223641 1207.73 88.60 85.62 98.30 99.83 CoLoRMap +OEA 231594 223497 1205.65 88.55 85.72 98.40 99.83 (Trim) LoRDEC 228893 221902 1175.30 89.12 86.60 98.51 99.81 CoLoRMap 211324 208188 1017.55 92.84 90.46 98.79 99.82 CoLoRMap +OEA 211324 208310 1017.39 92.95 90.76 98.92 99.82 (Split) proovread 225878 225497 244.48 99.53 99.39 99.84 60.49 LoRDEC 1460179 919020 1120.63 96.78 96.30 99.50 99.77 CoLoRMap 435140 432750 943.50 97.56 97.29 99.69 99.79 CoLoRMap +OEA 349998 347516 953.00 97.26 96.95 99.66 99.79

Fruit fly Original 901564 313983 502.90 37.05 33.20 94.60 93.68 (Full) LoRDEC 901564 342784 499.02 37.34 35.27 97.16 93.91 CoLoRMap 901564 348810 535.90 40.23 38.39 97.96 94.65 (Trim) LoRDEC 665298 348924 493.09 45.13 42.73 97.27 93.73 CoLoRMap 286679 256775 324.98 68.98 66.34 98.46 85.53 (Split) LoRDEC 4303563 1366425 558.80 77.65 76.80 98.82 92.12 CoLoRMap 453006 415526 337.99 89.04 88.45 99.29 85.63

Notes: athe number of DNA sequences available after running the correction tool (may contain uncorrected sequences); in case of original data set, shows the total number of long reads. bthe number of aligned sequences. cthe number of bases aligned to the reference genome. dthe percentage of aligned bases; that is column c / summed length of sequences in column a. ethe percentage of matched bases; that is total number of matched bases / summed length of sequences in column a. f average identity; that is total number of matched bases / summed length of aligned regions in the reference genome. gpercentage of the reference genome covered by the aligned sequences.

53 Table 4.3: Quality of corrected long reads for E. coli and yeast datasets obtained with different methods. Assessment is based on alignments of long reads to the reference genome obtained with BWA-MEM.

Aligned Dataset Method #Readsa #Readsb #Basesc Sized Matchede Identityf coverageg (Mb) (%) (%) (%) (%)

E. coli Original 33360 30830 86.69 88.45 76.66 94.07 100.00 (Full) LSC 25426 25403 77.87 93.06 86.46 97.20 100.00 proovread 24722 24046 73.29 91.83 90.89 99.69 100.00 LoRDEC 33360 31371 82.33 90.16 88.74 99.44 100.00 CoLoRMap 33360 31693 84.69 91.37 89.34 99.20 100.00 CoLoRMap +OEA 33360 31693 84.51 91.39 89.67 99.33 100.00 (Trim) LSC 25426 25402 72.26 95.02 89.47 97.68 100.00 LoRDEC 31733 31320 80.14 94.32 93.49 99.69 100.00 CoLoRMap 30396 30392 76.69 96.28 94.77 99.45 100.00 CoLoRMap +OEA 30396 30392 76.50 96.29 95.17 99.59 100.00 (Split) PacBioToCA 100100 100006 69.10 99.80 99.77 99.95 99.81 proovread 30479 30477 71.52 99.50 99.40 99.97 99.67 LoRDEC 49018 41679 80.04 99.33 99.28 99.96 99.83 CoLoRMap 48987 48965 74.26 99.82 99.70 99.91 99.91 CoLoRMap +OEA 40256 40235 75.17 99.79 99.65 99.90 99.91

Yeast Original 231594 136943 742.47 52.94 47.05 92.79 99.69 (Full) proovread 229702 223719 1216.55 88.78 83.79 95.86 99.75 LoRDEC 231594 226827 1223.76 89.96 87.50 98.21 99.71 CoLoRMap 231594 228484 1240.42 91.00 88.42 98.14 99.71 CoLoRMap +OEA 231594 228484 1239.54 91.03 88.66 98.27 99.70 (Trim) LoRDEC 228893 226632 1206.11 91.46 89.25 98.40 99.71 CoLoRMap 211324 211206 1029.58 93.94 92.17 98.77 99.71 CoLoRMap +OEA 211324 211206 1028.61 93.98 92.47 98.93 99.70 (Split) proovread 225878 225670 245.18 99.82 99.66 99.82 60.64 LoRDEC 1460179 925878 1133.58 97.90 97.41 99.52 99.72 CoLoRMap 435140 434418 961.26 99.40 99.14 99.74 99.71 CoLoRMap +OEA 349998 349421 973.63 99.37 99.07 99.72 99.70

Notes: athe number of DNA sequences available after running the correction tool (may contain uncorrected sequences); in case of original data set, shows the total number of long reads. bthe number of aligned sequences. cthe number of bases aligned to the reference genome. dthe percentage of aligned bases; that is column c / summed length of sequences in column a. ethe percentage of matched bases; that is total number of matched bases / summed length of sequences in column a. f average identity; that is total number of matched bases / summed length of aligned regions in the reference genome. gpercentage of the reference genome covered by the aligned sequences.

54 Table 4.4: Statistics of corrected and un-corrected regions after correction with different methods.

Corrected regions Un-corrected regions (gaps) Data set Method # regions avg size total size # regions avg size total size E. coli Original NA NA NA 33360 2938 98.01 PacBioToCA 100100 691 69.24 NA NA NA proovread 30479 2358 71.87 NA NA NA LoRDEC 49018 1643 80.58 52696 203 10.74 CoLoRMap 48987 1518 74.39 40999 446 18.29 CoLoRMap +OEA 40256 1871 75.33 32268 531 17.15 Yeast Original NA NA NA 231594 6055 1402.46 proovread 229702 5965 1370.27 NA NA NA LoRDEC 1460179 793 1157.93 1564253 129 202.47 CoLoRMap 435140 2222 967.08 456717 867 396.02 CoLoRMap +OEA 349998 2799 979.85 371575 1027 381.76 Fruit fly Original NA NA NA 901564 1505 1357.18 LoRDEC 4303563 167 719.67 5006145 123 616.81 CoLoRMap 453006 837 379.60 1191316 799 952.52

55 Table 4.5: Quality of Canu assemblies for E. coli data set corrected by different methods. The assessment is done using QUAST. All statistics are based on contigs of size ≥ 500 bp, unless otherwise noted.

Assembly Original LoRDEC proovread CoLoRMap CoLoRMap+OEA # contigs (≥ 0 bp) 182 24 26 19 19 # contigs (≥ 1000 bp) 182 24 26 19 19 # contigs (≥ 5000 bp) 178 24 26 19 19 # contigs (≥ 10000 bp) 141 24 26 19 19 # contigs (≥ 50000 bp) 4 21 22 19 19 Total length (≥ 0 bp) 3508197 4623137 4629719 4624793 4627249 Total length (≥ 1000 bp) 3508197 4623137 4629719 4624793 4627249 Total length (≥ 5000 bp) 3492249 4623137 4629719 4624793 4627249 Total length (≥ 10000 bp) 3209268 4623137 4629719 4624793 4627249 Total length (≥ 25000 bp) 1710292 4623137 4616507 4624793 4627249 Total length (≥ 50000 bp) 228498 4495150 4492555 4624793 4627249 Largest contig 69266 920903 605792 1089140 1089205 Reference length 4641652 4641652 4641652 4641652 4641652 GC (%) 51.05 50.81 50.81 50.81 50.81 Reference GC (%) 50.79 50.79 50.79 50.79 50.79 N50 24663 226456 231774 239066 239066 NG50 17847 226456 231774 239066 239066 L50 48 6 7 5 5 LG50 76 6 7 5 5 # unaligned contigs 0 + 0 part 0 + 0 part 0 + 0 part 0 + 0 part 0 + 0 part Unaligned length 0 0 0 0 0 Genome fraction (%) 75.455 99.120 99.092 99.244 99.231 Duplication ratio 1.002 1.005 1.007 1.004 1.005 Largest alignment 69266 538466 398061 698643 698643 NA50 24663 202095 198530 239066 239066 NGA50 17847 202095 198530 239066 239066 LA50 48 8 9 6 6 LGA50 76 8 9 6 6 # misassemblies 0 6 7 5 6 # relocations 0 6 7 5 6 # translocations 0 0 0 0 0 # inversions 0 0 0 0 0 # misassembled contigs 0 4 3 3 4 Misassembled contigs length 0 1328532 1076559 1277904 1446651 # local misassemblies 1 2 3 1 1 # N’s per 100 kbp 0.00 0.00 0.00 0.00 0.00 # mismatches per 100 kbp 8.17 15.63 18.00 6.64 7.36 # indels per 100 kbp 191.04 3.43 2.02 1.80 1.74 Indels length 7249 222 126 99 98

56 Table 4.6: Quality of Canu assemblies for yeast data set corrected by different methods. The assessment is done using QUAST. All statistics are based on contigs of size ≥ 500 bp, unless otherwise noted.

Assembly original lordec proovread CoLoRMap CoLoRMap+OEA # contigs (≥ 0 bp) 26 28 32 24 29 # contigs (≥ 1000 bp) 26 28 32 24 29 # contigs (≥ 5000 bp) 26 28 31 22 28 # contigs (≥ 10000 bp) 26 27 30 21 28 # contigs (≥ 50000 bp) 22 19 24 19 20 Total length (≥ 0 bp) 12341981 12497078 12485995 12315869 12450479 Total length (≥ 1000 bp) 12341981 12497078 12485995 12315869 12450479 Total length (≥ 5000 bp) 12341981 12497078 12484209 12308283 12445656 Total length (≥ 10000 bp) 12341981 12490996 12474494 12302229 12445656 Total length (≥ 25000 bp) 12341981 12444116 12456794 12302229 12385648 Total length (≥ 50000 bp) 12218401 12257688 12279045 12239085 12217774 Largest contig 1543990 1552711 1537979 1555857 1538508 Reference length 12157105 12157105 12157105 12157105 12157105 GC (%) 38.18 38.21 38.22 38.17 38.20 Reference GC (%) 38.15 38.15 38.15 38.15 38.15 N50 777602 818962 777713 815158 932935 NG50 777602 818962 777713 815158 932935 L50 6 6 6 6 6 LG50 6 6 6 6 6 # unaligned contigs 1 + 1 part 1 + 0 part 1 + 0 part 1 + 0 part 1 + 0 part Unaligned length 27953 27982 42350 34077 29118 Genome fraction (%) 98.638 98.791 98.687 98.716 98.881 Duplication ratio 1.027 1.038 1.037 1.023 1.033 Largest alignment 1084893 1073237 1090741 1073302 1085688 NA50 354598 377095 350112 377108 377106 NGA50 354598 377095 350112 377108 377106 LA50 11 11 11 11 11 LGA50 11 11 11 11 11 # misassemblies 107 124 108 102 112 # relocations 26 42 29 30 31 # translocations 79 82 79 72 80 # inversions 2 0 0 0 1 # misassembled contigs 21 25 24 19 24 Misassembled contigs length 10513374 12191557 10639582 11996690 10856637 # local misassemblies 31 11 14 11 12 # N’s per 100 kbp 0.00 0.00 0.00 0.14 0.00 # mismatches per 100 kbp 75.76 89.07 96.59 87.75 84.37 # indels per 100 kbp 25.83 19.92 21.04 13.64 13.66 Indels length 6573 5899 6112 4901 4627

57 Table 4.7: Quality of Canu assemblies for D. melanogaster data set corrected by different methods. The assessment is done using QUAST. All statistics are based on contigs of size ≥ 500 bp, unless otherwise noted.

Assembly original lordec CoLoRMap # contigs (≥ 0 bp) 217 224 260 # contigs (≥ 1000 bp) 217 224 260 # contigs (≥ 5000 bp) 159 144 161 # contigs (≥ 10000 bp) 47 33 42 # contigs (≥ 50000 bp) 0 0 2 Total length (≥ 0 bp) 1768221 1730606 2106055 Total length (≥ 1000 bp) 1768221 1730606 2106055 Total length (≥ 5000 bp) 1543023 1410633 1726065 Total length (≥ 10000 bp) 735933 653134 910341 Total length (≥ 25000 bp) 58943 286003 488439 Total length (≥ 50000 bp) 0 0 142690 Largest contig 30023 42661 75766 Reference length 137567484 137567484 137567484 GC (%) 38.17 37.92 38.22 Reference GC (%) 42.08 42.08 42.08 N50 8620 7664 8485 L50 64 58 58 # unaligned contigs 69 + 8 part 67 + 14 part 61 + 17 part Unaligned length 770395 861325 986102 Genome fraction (%) 0.649 0.573 0.764 Duplication ratio 1.117 1.104 1.066 Largest alignment 16190 13571 17993 NA50 1442 - 955 NGA50 - - - LA50 177 - 238 # misassemblies 175 122 138 # relocations 117 73 83 # translocations 58 49 54 # inversions 0 0 1 # misassembled contigs 67 54 73 Misassembled contigs length 562340 358099 478259 # local misassemblies 55 32 21 # N’s per 100 kbp 0.00 0.00 0.00 # mismatches per 100 kbp 679.35 704.35 583.04 # indels per 100 kbp 401.99 273.08 191.33 Indels length 9235 6132 7931

58 Table 4.8: The effect of chunking on correction quality for CoLoRMap. CoLoRMap-w represents running of our software on the whole long read set without chunking.

Aligned Dataset Method #Readsa #Readsb #Basesc Sized Matchede Identityf coverageg (Mb) (%) (%) (%) (%)

E. coli CoLoRMap-w 33360 31247 83.21 89.49 86.59 99.02 100.00 (Full) CoLoRMap 33360 31271 83.34 89.92 87.53 99.27 100.00 CoLoRMap-w+OEA 33360 31165 82.56 89.07 86.63 99.20 100.00 CoLoRMap +OEA 33360 31215 82.92 89.66 87.58 99.38 100.00 E. coli CoLoRMap-w 30501 30302 76.71 95.98 93.44 99.23 100.00 (Trim) CoLoRMap 30396 30190 76.67 96.26 94.24 99.46 100.00 CoLoRMap-w+OEA 30501 30285 76.34 95.88 93.87 99.43 100.00 CoLoRMap +OEA 30396 30183 76.43 96.21 94.56 99.58 100.00 E. coli CoLoRMap-w 57458 57281 71.45 98.90 98.76 99.88 99.91 (Split) CoLoRMap 48987 48840 73.73 99.11 98.99 99.90 99.91 CoLoRMap-w+OEA 44037 43847 73.06 98.77 98.59 99.86 99.91 CoLoRMap +OEA 40256 40101 74.57 98.99 98.84 99.89 99.91

Yeast CoLoRMap-w 231594 223919 1211.63 88.07 83.12 96.69 99.85 (Full) CoLoRMap 231594 223641 1207.73 88.60 85.62 98.30 99.83 CoLoRMap-w+OEA 231594 223693 1207.65 88.02 83.61 97.10 99.85 CoLoRMap +OEA 231594 223497 1205.65 88.55 85.72 98.40 99.83 Yeast CoLoRMap-w 214765 211702 1004.25 93.35 88.61 96.94 99.85 (Trim) CoLoRMap 211324 208188 1017.55 92.84 90.46 98.79 99.82 CoLoRMap-w+OEA 214765 211710 1001.17 93.38 89.33 97.44 99.81 CoLoRMap +OEA 211324 208310 1017.39 92.95 90.76 98.92 99.82 Yeast CoLoRMap-w 1043237 1038397 631.79 96.65 96.14 99.44 99.68 (Split) CoLoRMap 435140 432750 943.50 97.56 97.29 99.69 99.79 CoLoRMap-w+OEA 676091 672731 707.32 97.36 96.60 99.28 99.77 CoLoRMap +OEA 349998 347516 952.99 97.26 96.95 99.66 99.79 Notes: athe number of DNA sequences available after running the correction tool (may contain uncorrected sequences); in case of original data set, shows the total number of long reads. bthe number of aligned sequences. cthe number of bases aligned to the reference genome. dthe percentage of aligned bases; that is column c / summed length of sequences in column a. ethe percentage of matched bases; that is total number of matched bases / summed length of sequences in column a. f average identity; that is total number of matched bases / summed length of aligned regions in the reference genome. gpercentage of the reference genome covered by the aligned sequences.

59 4.3 Comparison with more recent hybrid correction tools

Since publishing CoLoRMap in 2016, many other hybrid correction tools have been developed, which were not benchmarked in the original paper of CoLoRMap. Among these tools, we can point to Jabba [116], HALC [8], Hercules [46], FMLRC [164], and HG-CoLoR [121]. Although we have not improved CoLoRMap since its first release, it would be interesting to see how it performs against the state-of-the-art hybrid error correction tools. Zhang et al. [177] have benchmarked these tools together with previously published ones, including CoLoRMap, ECTools, and Nanocorr. The evaluation is done on PacBio and Nanopore long reads of three datasets: E. coli, yeast, and fruit fly. We refer the reader to Table 1 in [177] for more details about these datasets. Tables 4.9-4.14 show the results of this evaluation. As can be seen, none of the tools can be considered the best in all metrics. However, we can make the following observations. In terms of the number of aligned reads, HALC is the best performing tool, especially on larger datasets. CoLoRMap performs well in this metric only for E. coli dataset. In terms of the number of aligned bases, FMLRC is the best tool, while CoLoRMap is following FMLRC on most of the datasets (an exception is E. coli PacBio dataset where CoLoRMap is the best). In terms of N50, there is no clear winner, but CoLoRMap is more often the better performing tool. With regards to the genome fraction, CoLoRMap is always on par with the best performing tools. On the other hand, in terms of alignment identity, the reported numbers for CoLoRMap is very low (although Hercules is the lowest). However, this can be mainly due to the fact that [177] decided to skip the second step of CoLoRMap, which uses OEA reads to fill the gaps in the corrected reads. One clear limitation of CoLoRMap compared to some of the other tools is its long run time, especially when compared to FMLRC and Jabba on a larger dataset. Indeed, improving the run time of CoLoRMap is one of the future research directions of this thesis. We provide some suggestions in Chapter 6 towards this goal.

4.4 Summary

We described CoLoRMap, a new noisy long read correction method whose main features are (1) to rely on the shortest path algorithm applied to a weighted alignment graph in order to find a corrected sequence that minimizes the edit score to the long read, and (2) to extend the initial correction using unmapped mates of mapped short reads (so- called OEAs). Our experimental results suggest that CoLoRMap compares well with recent existing methods and especially corrects long reads that can be mapped to the reference and used for downstream analysis better than the long reads corrected by the existing methods while maintaining high accuracy.

60 The rationale for CoLoRMap algorithm is to combine the strengths of both consensus methods such as proovread and optimization-based methods such as LoRDEC and Nanocorr. As consensus methods, we indeed rely on mapped reads, i.e., correct regions using either a mapped read (the SP algorithm) or the mate of a mapped read (the OEA algorithm). However, as with LoRDEC, we also account for the global context of short reads selected for correction by using the optimization criterion of the SP algorithm.

61 bioRxiv preprint doi: https://doi.org/10.1101/519330. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.

Zhang et al. Page 9 of 18

Table 4.9: Comparison between hybrid error correction tools on E. coli PacBio dataset. The experiment is done by Zhang et al. [177]. Table is taken from [177]. Table 2 Experimental results for E. coli PacBio data set D1-P

#Aligned Maximum Genome Alignment Memory #Bases #Aligned N50 CPU time Wall time Method # Reads bases length fraction identity usage (Mbp) reads (bp) (hh:mm:ss) (hh:mm:ss) (Mbp) (bp) (%) (%) (GB)

Original 85 460 748.082886688.04411313990100.000 86.8763 --- Non-hybrid methods FLAS 69 327 632.368786621.24011713212100.000 99.5959 09:47:50 00:56:45 4.9 LoRMA 330 811 623.3330715623.0224992441100.000 99.6814 45:24:49 02:10:36 67.2 Canu 9283 168.19193166.73969320391100.000 99.6970 07:47:33 00:27:14 6.0 Short-read-assembly-based methods 62 HG-CoLoR ------FMLRC 85 260 706.583320669.94408413364100.000 99.6983 03:05:06 00:30:07 9.8 HALC 85 256 711.184030661.74411713399100.000 99.4374 60:41:59 16:02:32 30.2 Jabba 77 508 620.277508619.7413421255799.258 99.9624 02:05:09 00:12:01 37.0 LoRDEC 85 324 716.983507665.94431113491100.000 98.4149 15:03:42 00:40:05 2.0 ECTools 55 687 577.455687575.73977213583100.000 99.8592 11:25:22 00:29:49 8.2 Short-read-alignment-based methods Hercules ------>72:00:00 - CoLoRMap 85 674 730.783765678.64411313641100.000 95.2930 31:35:16 02:53:33 34.9 Nanocorr 73 368 504.973316493.14107910796100.000 98.3257 1862:59:19 70:57:19 15.1 proovread 85 367 720.283142665.74411313524100.000 96.7250 71:17:14 12:21:53 53.9 LSC ------>72:00:00 - Note: HG-CoLoR reported an error when correcting this dataset.

Table 3 Experimental results for E. coli ONT data set D1-O

#Aligned Maximum Genome Alignment Memory #Bases #Aligned N50 CPU time Wall time Method # Reads bases length fraction identity usage (Mbp) reads (bp) (hh:mm:ss) (hh:mm:ss) (Mbp) (bp) (%) (%) (GB)

Original 163 747 1481.51633861454.413196914895100.000 81.3559 --- Non-hybrid methods FLAS 138 472 1401.31384581392.91304971474899.997 93.0176 20:27:50 01:56:52 8.0 LoRMA 595 072 1433.55950511432.531743333399.924 96.6525 182:14:17 07:30:30 77.8 Canu 19 335 226.219326225.01331683803499.953 94.5969 17:14:11 00:50:04 6.7 Short-read-assembly-based methods HG-CoLoR 159 856 1540.71598541518.113800215744100.000 98.1308 231:20:30 44:41:19 13.8 FMLRC 163 749 1555.41635931546.313796015687100.000 99.6423 05:50:54 00:32:27 3.3 HALC ------>72:00:00 - Jabba 162 970 1287.01629701286.1939231279599.515 99.9557 02:51:05 00:10:33 37.1 LoRDEC 163 838 1555.51637221530.113788715664100.000 98.9920 32:35:27 01:12:37 2.2 ECTools 116 868 1431.71168681428.213786316354100.000 99.8116 19:44:40 00:46:51 8.1 Short-read-alignment-based methods Hercules ------>72:00:00 - CoLoRMap 164 072 1518.31637821495.713430215180100.000 89.2049 32:55:26 04:01:18 35.5 Nanocorr ------>72:00:00 - proovread 163 815 1514.01634811489.113579815222100.000 89.2071 104:33:09 18:35:46 47.8 LSC ------>72:00:00 - bioRxiv preprint doi: https://doi.org/10.1101/519330. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.

Zhang et al. Page 9 of 18

Table 2 Experimental results for E. coli PacBio data set D1-P

#Aligned Maximum Genome Alignment Memory #Bases #Aligned N50 CPU time Wall time Method # Reads bases length fraction identity usage (Mbp) reads (bp) (hh:mm:ss) (hh:mm:ss) (Mbp) (bp) (%) (%) (GB)

Original 85 460 748.082886688.04411313990100.000 86.8763 --- Non-hybrid methods FLAS 69 327 632.368786621.24011713212100.000 99.5959 09:47:50 00:56:45 4.9 LoRMA 330 811 623.3330715623.0224992441100.000 99.6814 45:24:49 02:10:36 67.2 Canu 9283 168.19193166.73969320391100.000 99.6970 07:47:33 00:27:14 6.0 Short-read-assembly-based methods HG-CoLoR ------FMLRC 85 260 706.583320669.94408413364100.000 99.6983 03:05:06 00:30:07 9.8 HALC 85 256 711.184030661.74411713399100.000 99.4374 60:41:59 16:02:32 30.2 Jabba 77 508 620.277508619.7413421255799.258 99.9624 02:05:09 00:12:01 37.0 LoRDEC 85 324 716.983507665.94431113491100.000 98.4149 15:03:42 00:40:05 2.0 ECTools 55 687 577.455687575.73977213583100.000 99.8592 11:25:22 00:29:49 8.2 Short-read-alignment-based methods Hercules ------>72:00:00 - CoLoRMap 85 674 730.783765678.64411313641100.000 95.2930 31:35:16 02:53:33 34.9 Nanocorr 73 368 504.973316493.14107910796100.000 98.3257 1862:59:19 70:57:19 15.1 proovread 85 367 720.283142665.74411313524100.000 96.7250 71:17:14 12:21:53 53.9 LSC ------>72:00:00 - Note: HG-CoLoR reported an error when correcting this dataset.

Table 4.10: Comparison between hybrid error correction tools on E. coli Oxford Nanopore dataset. The experiment is done by Zhang et al. [177]. Table is taken from [177]. Table 3 Experimental results for E. coli ONT data set D1-O

#Aligned Maximum Genome Alignment Memory #Bases #Aligned N50 CPU time Wall time Method # Reads bases length fraction identity usage (Mbp) reads (bp) (hh:mm:ss) (hh:mm:ss) (Mbp) (bp) (%) (%) (GB)

Original 163 747 1481.51633861454.413196914895100.000 81.3559 --- Non-hybrid methods FLAS 138 472 1401.31384581392.91304971474899.997 93.0176 20:27:50 01:56:52 8.0 LoRMA 595 072 1433.55950511432.531743333399.924 96.6525 182:14:17 07:30:30 77.8 Canu 19 335 226.219326225.01331683803499.953 94.5969 17:14:11 00:50:04 6.7 Short-read-assembly-based methods 63 HG-CoLoR 159 856 1540.71598541518.113800215744100.000 98.1308 231:20:30 44:41:19 13.8 FMLRC 163 749 1555.41635931546.313796015687100.000 99.6423 05:50:54 00:32:27 3.3 HALC ------>72:00:00 - Jabba 162 970 1287.01629701286.1939231279599.515 99.9557 02:51:05 00:10:33 37.1 LoRDEC 163 838 1555.51637221530.113788715664100.000 98.9920 32:35:27 01:12:37 2.2 ECTools 116 868 1431.71168681428.213786316354100.000 99.8116 19:44:40 00:46:51 8.1 Short-read-alignment-based methods Hercules ------>72:00:00 - CoLoRMap 164 072 1518.31637821495.713430215180100.000 89.2049 32:55:26 04:01:18 35.5 Nanocorr ------>72:00:00 - proovread 163 815 1514.01634811489.113579815222100.000 89.2071 104:33:09 18:35:46 47.8 LSC ------>72:00:00 - bioRxiv preprint doi: https://doi.org/10.1101/519330. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.

Zhang et al. Page 10 of 18

Table 4.11: Comparison between hybrid error correction tools on yeast PacBio dataset. The experiment is done by Zhang et al. [177]. Table is taken from [177]. Table 4 Experimental results for yeast PacBio data set D2-P

#Aligned Maximum Genome Alignment Memory #Bases #Aligned N50 CPU time Wall time Method # Reads bases length fraction identity usage (Mbp) reads (bp) (hh:mm:ss) (hh:mm:ss) (Mbp) (bp) (%) (%) (GB)

Original 239 408 1462.72356201332.635196865699.976 87.2637 --- Non-hybrid methods FLAS 173 187 1093.21730461078.830046813299.976 99.5777 11:46:31 01:15:40 7.9 LoRMA 650 467 1142.06503331141.418127232399.951 99.7583 172:24:38 07:03:03 72.9 Canu 38 228 453.238172446.7287481202199.975 99.5864 15:18:34 00:50:12 6.5 Short-read-assembly-based methods 64 HG-CoLoR ------FMLRC 238 706 1380.82368831311.033658818599.977 99.3889 07:52:17 00:28:55 5.5 HALC 238 787 1395.42380971287.634785827099.976 99.0796 52:12:11 09:45:10 29.0 Jabba 202 980 1087.22028791086.630141784795.627 99.9832 00:38:30 00:04:57 21.4 LoRDEC 238 847 1405.02372781297.134896832699.978 97.9568 01:10:03 00:57:17 1.9 ECTools 130 863 946.9130832943.128749841299.810 99.7712 938:25:28 58:25:00 4.3 Short-read-alignment-based methods Hercules 239 389 1460.32356301330.435196864499.976 87.6711 87:53:55 03:18:41 247.8 CoLoRMap 239 309 1429.62371351321.334850840999.976 96.3912 18:44:48 03:07:34 37.3 Nanocorr ------>72:00:00 - proovread 238 992 1412.42365191298.035122836999.978 97.9568 184:02:07 23:45:37 47.9 LSC ------>72:00:00 - Note: HG-CoLoR reported an error when correcting this dataset.

Table 5 Experimental results for yeast ONT data set D2-O

#Aligned Maximum Genome Alignment Memory #Bases #Aligned N50 CPU time Wall time Method # Reads bases length fraction identity usage (Mbp) reads (bp) (hh:mm:ss) (hh:mm:ss) (Mbp) (bp) (%) (%) (GB)

Original 118 723 715.7108463638.155374700399.976 86.1986 --- Non-hybrid methods FLAS 95 606 585.695290581.526592689399.940 97.1699 07:42:10 07:42:10 4.4 LoRMA 398 863 497.0398350495.216027143999.485 98.4024 68:02:36 02:55:05 68.8 Canu 64 829 475.164649475.126895751899.914 97.7710 12:31:04 00:37:53 9.0 Short-read-assembly-based methods HG-CoLoR ------FMLRC 118 701 713.7111869666.455374699099.975 99.2529 03:35:44 00:17:21 2.2 HALC 118 707 718.2114071647.955379702599.976 98.8884 50:11:58 04:03:18 3.6 Jabba 99 044 536.998631535.928194673095.400 99.9809 00:55:32 00:04:20 21.5 LoRDEC 118 727 720.8110606647.855375704999.976 96.9369 11:22:09 00:26:13 2.1 ECTools 81 105 531.980843529.326810707199.314 99.7697 09:31:32 20:17:33 5.6 Short-read-alignment-based methods Hercules 118 721 716.3108467638.955374700899.976 87.2912 125:22:19 04:37:01 246.6 CoLoRMap 118 774 722.0108969649.455374704999.976 95.5851 11:01:38 01:34:52 27.8 Nanocorr ------>72:00:00 - proovread 118 729 716.7109057643.455374700799.976 96.3689 66:14:09 07:20:18 28.1 LSC ------>72:00:00 - Note: HG-CoLoR reported an error when correcting this dataset. bioRxiv preprint doi: https://doi.org/10.1101/519330. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.

Zhang et al. Page 10 of 18

Table 4 Experimental results for yeast PacBio data set D2-P

#Aligned Maximum Genome Alignment Memory #Bases #Aligned N50 CPU time Wall time Method # Reads bases length fraction identity usage (Mbp) reads (bp) (hh:mm:ss) (hh:mm:ss) (Mbp) (bp) (%) (%) (GB)

Original 239 408 1462.72356201332.635196865699.976 87.2637 --- Non-hybrid methods FLAS 173 187 1093.21730461078.830046813299.976 99.5777 11:46:31 01:15:40 7.9 LoRMA 650 467 1142.06503331141.418127232399.951 99.7583 172:24:38 07:03:03 72.9 Canu 38 228 453.238172446.7287481202199.975 99.5864 15:18:34 00:50:12 6.5 Short-read-assembly-based methods HG-CoLoR ------FMLRC 238 706 1380.82368831311.033658818599.977 99.3889 07:52:17 00:28:55 5.5 HALC 238 787 1395.42380971287.634785827099.976 99.0796 52:12:11 09:45:10 29.0 Jabba 202 980 1087.22028791086.630141784795.627 99.9832 00:38:30 00:04:57 21.4 LoRDEC 238 847 1405.02372781297.134896832699.978 97.9568 01:10:03 00:57:17 1.9 ECTools 130 863 946.9130832943.128749841299.810 99.7712 938:25:28 58:25:00 4.3 Short-read-alignment-based methods Hercules 239 389 1460.32356301330.435196864499.976 87.6711 87:53:55 03:18:41 247.8 CoLoRMap 239 309 1429.62371351321.334850840999.976 96.3912 18:44:48 03:07:34 37.3 Nanocorr ------>72:00:00 - proovread 238 992 1412.42365191298.035122836999.978 97.9568 184:02:07 23:45:37 47.9 LSC ------>72:00:00 - Note: HG-CoLoR reported an error when correcting this dataset. Table 4.12: Comparison between hybrid error correction tools on yeast Oxford Nanopore dataset. The experiment is done by Zhang et al. [177]. Table is taken from [177]. Table 5 Experimental results for yeast ONT data set D2-O

#Aligned Maximum Genome Alignment Memory #Bases #Aligned N50 CPU time Wall time Method # Reads bases length fraction identity usage (Mbp) reads (bp) (hh:mm:ss) (hh:mm:ss) (Mbp) (bp) (%) (%) (GB)

Original 118 723 715.7108463638.155374700399.976 86.1986 --- Non-hybrid methods FLAS 95 606 585.695290581.526592689399.940 97.1699 07:42:10 07:42:10 4.4 LoRMA 398 863 497.0398350495.216027143999.485 98.4024 68:02:36 02:55:05 68.8 Canu 64 829 475.164649475.126895751899.914 97.7710 12:31:04 00:37:53 9.0 Short-read-assembly-based methods 65 HG-CoLoR ------FMLRC 118 701 713.7111869666.455374699099.975 99.2529 03:35:44 00:17:21 2.2 HALC 118 707 718.2114071647.955379702599.976 98.8884 50:11:58 04:03:18 3.6 Jabba 99 044 536.998631535.928194673095.400 99.9809 00:55:32 00:04:20 21.5 LoRDEC 118 727 720.8110606647.855375704999.976 96.9369 11:22:09 00:26:13 2.1 ECTools 81 105 531.980843529.326810707199.314 99.7697 09:31:32 20:17:33 5.6 Short-read-alignment-based methods Hercules 118 721 716.3108467638.955374700899.976 87.2912 125:22:19 04:37:01 246.6 CoLoRMap 118 774 722.0108969649.455374704999.976 95.5851 11:01:38 01:34:52 27.8 Nanocorr ------>72:00:00 - proovread 118 729 716.7109057643.455374700799.976 96.3689 66:14:09 07:20:18 28.1 LSC ------>72:00:00 - Note: HG-CoLoR reported an error when correcting this dataset. bioRxiv preprint doi: https://doi.org/10.1101/519330. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.

Zhang et al. Page 11 of 18

Table 4.13: Comparison between hybrid error correction tools on fruit fly PacBio dataset. The experiment is done by Zhang et al. [177]. Table is taken from [177]. Table 6 Experimental results for fruit fly PacBio data set D3-P

#Aligned Maximum Genome Alignment Memory #Bases #Aligned N50 CPU time Wall time Method # Reads bases length fraction identity usage (Mbp) reads (bp) (hh:mm:ss) (hh:mm:ss) (Mbp) (bp) (%) (%) (GB)

Original 5366088 28797.8183968116543.5747351537499.191 85.2734 --- Non-hybrid methods FLAS 1435682 14585.2142801813574.1435561355098.915 98.8363 271:44:27 36:30:42 53.1 LoRMA ------Canu ------>72:00:00 - Short-read-assembly-based methods 66 HG-CoLoR ------FMLRC 5246485 27354.6247789016543.5747351455499.191 96.5284 327:37:22 13:49:04 31.2 HALC 4451474 21997.5343477912793.3747351434999.178 96.8863 770:35:46 55:58:24 73.0 Jabba 35 549 239.835505239.1377291046165.616 99.9615 656:05:15 24:33:41 175.8 LoRDEC 5363998 28354.1205681215636.9747191507899.200 92.2954 1011:52:27 36:19:18 5.9 ECTools ------>72:00:00 - Short-read-alignment-based methods Hercules ------CoLoRMap 5366107 28891.6184182214976.8747351544299.189 83.2580 495:11:17 64:52:25 189.4 Nanocorr ------>72:00:00 - proovread ------>72:00:00 - LSC ------>72:00:00 - Note: LoRMA, HG-CoLoR and Hercules reported errors when correcting this dataset.

Table 7 Experimental results for fruit fly ONT data set D3-O

#Aligned Maximum Genome Alignment Memory #Bases #Aligned N50 CPU time Wall time Method # Reads bases length fraction identity usage (Mbp) reads (bp) (hh:mm:ss) (hh:mm:ss) (Mbp) (bp) (%) (%) (GB)

Original 642 255 4609.55540833857.94460501195698.719 83.5921 --- Non-hybrid methods FLAS 423 097 3507.64222063402.6643651151797.588 95.3301 23:04:50 03:12:50 10.8 LoRMA 703 097 615.5682288592.33264486530.338 98.1230 666:37:35 25:52:14 92.8 Canu 430 082 3415.64214753220.22549671209097.592 96.3739 88:51:10 04:36:20 20.2 Short-read-assembly-based methods HG-CoLoR ------FMLRC 641 945 4647.25782903978.24446051208898.592 97.6010 47:45:17 03:06:05 31.2 HALC 643 002 4668.56111913955.74512841211598.616 97.6634 126:30:01 05:43:37 42.4 Jabba 494 546 2878.24944302876.372501930583.166 99.9745 175:19:34 06:56:29 136.8 LoRDEC 642 882 4655.95678783921.14477261207998.691 94.0382 152:05:32 05:38:05 5.7 ECTools ------>72:00:00 - Short-read-alignment-based methods Hercules 642 287 4612.85546303859.44497991196698.713 83.9340 398:10:17 17:32:36 247.7 CoLoRMap 649 041 4692.15658813963.84429481205098.715 94.3361 160:00:22 16:07:18 57.3 Nanocorr ------>72:00:00 - proovread ------>72:00:00 - LSC ------>72:00:00 - Note: HG-CoLoR reported an error when correcting this dataset. bioRxiv preprint doi: https://doi.org/10.1101/519330. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.

Zhang et al. Page 11 of 18

Table 6 Experimental results for fruit fly PacBio data set D3-P

#Aligned Maximum Genome Alignment Memory #Bases #Aligned N50 CPU time Wall time Method # Reads bases length fraction identity usage (Mbp) reads (bp) (hh:mm:ss) (hh:mm:ss) (Mbp) (bp) (%) (%) (GB)

Original 5366088 28797.8183968116543.5747351537499.191 85.2734 --- Non-hybrid methods FLAS 1435682 14585.2142801813574.1435561355098.915 98.8363 271:44:27 36:30:42 53.1 LoRMA ------Canu ------>72:00:00 - Short-read-assembly-based methods HG-CoLoR ------FMLRC 5246485 27354.6247789016543.5747351455499.191 96.5284 327:37:22 13:49:04 31.2 HALC 4451474 21997.5343477912793.3747351434999.178 96.8863 770:35:46 55:58:24 73.0 Jabba 35 549 239.835505239.1377291046165.616 99.9615 656:05:15 24:33:41 175.8 LoRDEC 5363998 28354.1205681215636.9747191507899.200 92.2954 1011:52:27 36:19:18 5.9 ECTools ------>72:00:00 - Short-read-alignment-based methods Hercules ------CoLoRMap 5366107 28891.6184182214976.8747351544299.189 83.2580 495:11:17 64:52:25 189.4 Nanocorr ------>72:00:00 - proovread ------>72:00:00 - LSC ------>72:00:00 - Note: LoRMA, HG-CoLoR and Hercules reported errors when correcting this dataset. Table 4.14: Comparison between hybrid error correction tools on fruit fly Oxford Nanopore dataset. The experiment is done by Zhang et al. [177]. Table is taken from [177]. Table 7 Experimental results for fruit fly ONT data set D3-O

#Aligned Maximum Genome Alignment Memory #Bases #Aligned N50 CPU time Wall time Method # Reads bases length fraction identity usage (Mbp) reads (bp) (hh:mm:ss) (hh:mm:ss) (Mbp) (bp) (%) (%) (GB)

Original 642 255 4609.55540833857.94460501195698.719 83.5921 --- Non-hybrid methods FLAS 423 097 3507.64222063402.6643651151797.588 95.3301 23:04:50 03:12:50 10.8 LoRMA 703 097 615.5682288592.33264486530.338 98.1230 666:37:35 25:52:14 92.8 Canu 430 082 3415.64214753220.22549671209097.592 96.3739 88:51:10 04:36:20 20.2 Short-read-assembly-based methods 67 HG-CoLoR ------FMLRC 641 945 4647.25782903978.24446051208898.592 97.6010 47:45:17 03:06:05 31.2 HALC 643 002 4668.56111913955.74512841211598.616 97.6634 126:30:01 05:43:37 42.4 Jabba 494 546 2878.24944302876.372501930583.166 99.9745 175:19:34 06:56:29 136.8 LoRDEC 642 882 4655.95678783921.14477261207998.691 94.0382 152:05:32 05:38:05 5.7 ECTools ------>72:00:00 - Short-read-alignment-based methods Hercules 642 287 4612.85546303859.44497991196698.713 83.9340 398:10:17 17:32:36 247.7 CoLoRMap 649 041 4692.15658813963.84429481205098.715 94.3361 160:00:22 16:07:18 57.3 Nanocorr ------>72:00:00 - proovread ------>72:00:00 - LSC ------>72:00:00 - Note: HG-CoLoR reported an error when correcting this dataset. Chapter 5

Hybrid assembly of long reads

Long reads generated by single-molecule sequencing (SMS) technologies such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies have revolutionized the landscape of de novo genome assembly. While SMS long reads have a higher error rate compared to short reads generated by next-generation sequencing (NGS) technologies such as Illumina, they have been shown to result in accurate assemblies given sufficient coverage. Indeed the length of SMS long reads enables the resolution of many short and mid-range repeats that are problematic when assembling genomes from short reads. Recent advances in sequencing ultra-long Oxford Nanopore reads have moved us closer to the complete reconstruction of entire genomes than ever before (including difficult-to-assemble regions such as centromeres and telomeres) [117]. Similarly, High-Fidelity (HiFi) PacBio reads have been shown to be capable of improving the contiguity and accuracy in complex regions of the human genome [161]. These advances toward more accurate and complete genome assembly could not be achieved without the recent development of assemblers specifically tailored for long reads. These tools assemble long reads either after an error correction step [85, 25] or directly without any prior error correction [97, 140, 81]. Although long reads are becoming more widely used for de novo genome assembly, using hybrid approaches (that utilize a complementary short read dataset) is still popular for several reasons: (i) short reads have higher accuracy and can be generated by Illumina sequencers at high throughput for a lower cost; (ii) plenty of short read datasets are already publicly available for many genomes; (iii) for some basic tasks such as variant calling (SNV and short indel detection), short reads still provide better resolution due to their high accuracy which often motivates researchers to generate short reads even when long reads are in hand; and (iv) unlike PacBio assemblies whose accuracy increases with the depth of coverage thanks to their unbiased random error model [124], constructing reference-quality genomes solely from Oxford Nanopore reads remains challenging due to biases in base calling, even with a high coverage [85, 4]. As a result, hybrid assembly approaches are still useful [71, 73, 74].

68 Hybrid approaches for de novo genome assembly can be classified into three groups: (i) methods that first correct raw long reads using short reads and then build contigs using corrected long reads only (e.g. PBcR [84] and MaSuRCA [181]); (ii) methods that first assemble raw long reads and then correct/polish the resulting draft assembly with short reads using polishing tools such as Pilon [163] and Racon [159]; and (iii) methods that first assemble short reads and then utilize long reads to generate longer contigs (e.g. hybridSPAdes [4], Unicycler [169], DBG2OLC [174], and Wengan [35]). PBcR and MaSuRCA correct long reads using their internal correction algorithm and then employ CABOG [119] (Celera Assembler with the Best Overlap Graph) for assembling corrected long reads. hybridSPAdes and Unicycler are similar in design. Both of these tools first use SPAdes [7], which takes short reads as input and generates an assembly graph, a data structure in which multiple copies of a genome segment are collapsed into a single contig (see [176] for more details). This data structure also records connections between subsequent contigs such that every region of the genome corresponds to a path in the graph. hybridSPAdes and Unicycler then align long reads to this assembly graph in order to resolve ambiguities and generate longer contigs. On the other hand, DBG2OLC first assembles contigs from short reads and maps them onto raw long reads to get a compressed representation of long reads based on short read contig identifiers, and then applies an overlap-layout-consensus (OLC) approach on these compressed long reads to assemble the genome. Since compressed long reads are much shorter compared to raw long reads, building an overlap graph from them is quicker than building it from raw long reads, due to the faster pairwise alignment. Finally, the more recent tool, Wengan, assembles short reads and then builds multiple synthetic paired-read libraries of different insert sizes from long read sequences. These synthetic paired-reads are then aligned to short read contigs, and a scaffolding graph is built from the resulting alignments. In the end, the final assembly is generated by traversing proper paths of the scaffolding graph. A more detailed overview of each tool, together with some other non-hybrid assemblers, is provided in Section 2.5. Among the above tools, hybridSPAdes and Unicycler have been designed specifically for bacterial and small eukaryotic genomes and do not scale for the assembly of large genomes. PBcR, MaSuRCA, DBG2OLC, and Wengan are the only hybrid assemblers that are capable of assembling large genomes, such as the human genome. However, for mammalian genomes, PBcR and MaSuRCA require substantial computational time and cannot be used without a computing cluster. DBG2OLC is faster due to its use of compressed long reads. Wengan is also a fast assembler and can be used for assembling large genomes in a reasonable time. In this chapter, we introduce HASLR, a fast hybrid assembler that is capable of assembling large genomes. HASLR, similar to hybridSPAdes, Unicycler, and Wengan builds short read contigs using a fast short read assembler (i.e., Minia). Then it builds a novel data structure called backbone graph to put short read contigs in the order expected to appear in the genome and to fill the gaps between them using the consensus of long

69 reads. Based on our results, HASLR is the fastest between all the assemblers we tested, while generating the lowest number of misassemblies. Furthermore, it generates assemblies that are comparable to the best performing tools in terms of contiguity and accuracy. HASLR is also capable of assembling large genomes using less time and memory than other tools.

5.1 Methods

The input to HASLR is a set of long reads and a set of short reads from the same sample, together with an estimation of the genome size. HASLR performs the assembly using a novel approach that rapidly assembles the genome without performing all-vs-all long read alignments. The core of HASLR is to first assemble contigs from short reads using an efficient short read assembler and then to use long reads to find sequences of such contigs that represent the backbone of the sequenced genome.

5.1.1 Obtaining unique short read contigs

HASLR starts by assembling short reads into a set of short read contigs, denoted by C. Assembly of short reads is a well-studied topic, and many efficient tools have been specifically designed for that purpose. These tools use either a de Bruijn graph [150, 23] or an OLC strategy (based on an overlap graph or a string graph) [149, 120] to assemble the genome by finding “proper” paths in these graphs. Next, HASLR identifies a set U of unique contigs, those short read contigs that are likely to appear in the genome only once. In order to do this, for every short read contig, ci, the mean k-mer frequency, f(ci), is computed as the average k-mer count of all k-mers present in ci. Note that the value of f(ci) is proportional to the depth of coverage of ci. Assuming longer contigs are more likely to come from unique regions, their mean k-mer frequency can be a good indicator for identifying unique contigs. Let LCq ⊆ C be the set of q longest short read contigs in C, and favg, fstd be the average and standard deviation of {f(c) | c ∈ LCq}.

Then, the set of unique contigs is defined as U = {u | u ∈ C and f(u) ≤ favg + 3fstd}. Our empirical results show that this approach can identify unique contigs with high precision and recall. In order to measure the efficacy of this approach for identifying unique contigs, we conducted a set of experiments as follows. First, we simulated a short read dataset based on six different reference genomes: Escherichia coli, Saccharomyces cerevisiae (yeast), Caenorhabditis elegans, Arabidopsis thaliana, Drosophila melanogaster, and human GRCh38. For each genome, we used ART [66] to simulate 50× coverage short Illumina reads (2×100 bp long, 500 bp insert size mean, and 50 bp insert size deviation) using the Illumina HiSeq 2000 error model. Next, we used Minia to assemble the simulated short reads using k-mer size 49. Finally, to form the ground truth for the copy count of each

70 short read contig, we mapped the assembled short read contigs to the reference genome using minimap2 [98]. Here, we report the precision and recall of the above-mentioned approach in identifying unique contigs. For each dataset, we evaluate the performance of our approach in identifying unique contigs that are longer than a threshold. The length threshold that is used to discard small contigs in this experiment changes from 100 to 1000 with a step size of 100. As it can be seen in Figure 5.1, the precision of the identified unique contigs is always high regardless of the length threshold. In addition, in all the experiments, a big jump in the recall is observed at the length threshold of 300. The results of this experiment show that the proposed approach for identifying unique contigs performs well with high precision and recall.

5.1.2 Construction of backbone graph

The backbone graph encodes potential adjacencies between unique contigs and thus presents a large-scale map of the genome, albeit, with some level of ambiguity. Using the backbone graph, HASLR finds paths of unique contigs representing their relative order and orientation in the sequenced genome. These paths are later transformed into the assembly.

Formally, given a set of unique contigs, U = {u1, u2, . . . , u|U|}, and a set of long reads,

L = {l1, l2, . . . , l|L|}, HASLR builds the backbone graph BBG as follows. First, unique contigs are aligned against long reads. Each alignment can be encoded by a 7-tuple rbeg, rend, uid, ustrand, ubeg, uend, nmatch whose elements respectively denote the start and end positions of the alignment on the long read, the index of the unique contig in U, the strand of the alignment (+ or −), the start and end position of the alignment on the unique contig, and the number of matched bases in the alignment. Let A = ai , ai , . . . ai  be i 1 2 |Ai| the list of alignments of unique contigs to li, sorted by rend.

Note that alignments in Ai may overlap due to relaxed alignment parameters in order to account for the high sequencing error rate of long reads. Thus, in the next step, we aim to select a subset of non-overlapping alignments whose total identity score – defined as the sum of the number of matched bases – is maximal. Let Si(j) be the best subset among the first j alignments, i.e., the non-overlapping subset of these j alignments with maximal total identity score. Si(j) can be calculated using the following dynamic programming formulation:  0 if j = 0 Si(j) = (5.1) n   i o max Si j − 1 ,Si prev(j) + aj[nmatch] otherwise

i i where prev(j) is the largest index z < j such that aj and az are non-overlapping alignments. By calculating Si(|Ai|) and backtracking, we obtain a sorted sub-list R = (ri , ri , . . . , ri ) of non-overlapping alignments with maximal total identity score, i 1 2 |Ri|

71 ecoli yeast 100.00 100

99.75

99.50 99

99.25 98 99.00

98.75

Precision / Recall Precision / Recall 97 98.50

98.25 96 98.00

200 400 600 800 1000 200 400 600 800 1000 Length threshold Length threshold

celegans athaliana 100 100

99 99 98

98 97

96 Precision / Recall 97 Precision / Recall

95 96 94

200 400 600 800 1000 200 400 600 800 1000 Length threshold Length threshold

dmelanogaster hg38 100 100.0

99.5 98 99.0

98.5 96

98.0

Precision / Recall 94 Precision / Recall 97.5

97.0 92 96.5

200 400 600 800 1000 200 400 600 800 1000 Length threshold Length threshold

Figure 5.1: Precision and recall results in identification of unique short read contigs on 6 different reference genomes. Precision is shown with blue dots and recall is shown with orange dots. Precision is always high across the different experiments and in all the experiments a big jump in recall happens at length threshold of 300.

which we call the compact representation of read li. Note that since the input list is sorted, prev(.) can be calculated in logarithmic time, which makes the time complexity of this dynamic programming O(|Ai| log |Ai|). The backbone graph is a directed graph BBG = (V,E). The set of nodes is defined as + − + − V = {uj , uj | 1 ≤ j ≤ |U|} where uj and uj represent the forward and reverse strand of

72 {1} + + 1 2 1 2 1 − − 1 2 {1}

{ } {2} + 2 + 3 4 3 4 2 − − 3 4

+ + 5 6 5 6 3 − − 5 6 {3} {3}

{4} + + 7 8 7 8 4 − − 7 8 {4}

Figure 5.2: Possible orientations of aligning two unique contigs to a long read. The direction of contigs aligned to long reads shows the strand of their corresponding sequence. These directions guide us to find the proper edge type. The set of long reads supporting each edge is shown as its label.

the unique contig uj, respectively. The set of edges is defined as the oriented adjacencies between unique contigs implied by the compact representations of long reads. Formally each edge is represented by a triplet (uh, ut, supp) where uh, ut ∈ V and supp is the set of indices of long reads supporting the adjacency between uh and ut; these triplets are obtained as follows:   [  hs ts   REV (ts) REV (hs)  E = uh , ut , {i} , ut , uh , {i} 1≤i≤|L| , 1≤j<|Ri|

i i i i where h = rj[uid], hs = rj[ustrand], t = rj+1[uid], ts = rj+1[ustrand], REV (+) = −, and REV (−) = +. Figure 5.2 illustrates the construction of the backbone graph edges for several combinations of unique contig alignments on long reads. At the end of this stage, the resulting backbone graph is a multi-graph as there can be multiple edges between two nodes with different supp. In order to make it easier to process the backbone graph, we convert it into a simple graph by merging supp of all edges between

73 (a) example of a tip in the backbone graph (b) example of a bubble in the backbone graph

Figure 5.3: Examples of tip and bubbles in the backbone graph. Here the backbone graph is visualized using Bandage [171] every pair of nodes into a set of supporting long reads. In practice, the backbone graph can be built with a single pass over all the unique short read contigs aligned to every long read. PL Therefore, time complexity of backbone graph construction is O(n), where n = i=1 |Ai|.

5.1.3 Graph cleaning and simplification

Ideally, with accurate identification of unique contigs and correct alignment of unique contigs onto long reads, the backbone graph for a haploid genome will consist of a set of connected components, each of which is a simple path of nodes. In practice, this ideal case does not happen – mainly due to sequencing errors, wrong unique contig to long read alignments, and chimeric reads. As a result, some artifactual branches might exist in the backbone graph forming structures known as tips and bubbles. Tips are dead-end simple paths whose length is small compared to their parallel paths. Bubbles are formed when two disjoint simple paths with similar length occur between two nodes. Figure 5.3 shows examples of tips and bubbles in our backbone graph. We clean the backbone graph BBG in two stages. First, in order to reduce the effect of wrong alignments of unique contig to long read, we remove all edges e such that |e[supp]| < minSupp, for a given parameter minSupp. In our implementation, this can be done simply in O(e) where e is the number of edges of the backbone graph. Second, the graph is simplified by using the tip and bubble removal algorithms. There exist well-known algorithms for these tasks that are commonly used in assemblers [176, 7, 120]. Note that our tip and bubble removal procedures require an estimation of the length of simple paths. Such estimation can be obtained from the length of unique contigs corresponding to the nodes contained in a simple path as well as the average length of all long read subsequences that are supporting edges between consecutive nodes.

74 Estimation of length and coverage for simple paths

In order to perform tip and bubble removal, HASLR requires an estimate for the length and coverage of each simple path. Here, we explain how this estimation is calculated. For each unique contig in a simple path, we can calculate the coordinates of the region that is aligned to all long reads (we refer to this region as shared region). Since the lengths of shared regions corresponding to all unique contigs are known, we only need to find an estimation for the middle regions (between two consecutive shared regions). To do this, for each long read supporting the edge connecting two unique contigs, we calculate the length of the LR subsequence that falls between shared regions, using the alignment’s CIGAR string (see Figure 5.4 for a toy example of an edge and its supporting long reads). We use the average of length of all these subsequences as the estimation for the region between shared regions. Finally, the length of the simple path can be estimated as the sum of the length of all shared regions plus the estimated length of all middle regions. In addition, the coverage of each simple path can be calculated based on the number of long reads supporting each edge, as well as the estimated length of the middle regions between two consecutive shared regions. In terms of time complexity, in worst-case, this step requires going over CIGAR string of all alignments of unique contigs onto long reads. Thus, the time complexity is bounded by O(g c) where g is the estimated size of the genome and c is the coverage of long reads used by HASLR.

Bubble removal

On a haploid genome, if our identification of unique short read contigs is accurate, bubbles are caused only by incorrect alignment of unique contigs in the middle of LRs. In this case, the bubble is usually formed by two simple paths with the same length while one of them has significantly lower coverage. In contrast, in diploid genomes, it is possible to have natural bubbles corresponding to heterozygous regions of the genome. The main characteristic of such bubbles is having similar coverage on two paths forming the bubble. If the region contains a heterozygous insertion or deletion, the lengths of two simple paths forming the bubble are different. On the other hand, if the region contains an inversion, two paths have the same length. Therefore, looking at the length of the two paths forming the bubble is not a good criterion for the identification of artificial bubbles. This means decision making should be solely based on the coverage of two paths. Currenlty, HASLR uses the bubble detection algorithm proposed by Onodera1 et al. [129], which is based on topological sort. This algorithm has worst-case time complexity of O(|U|(|U| + e)), but its average-case complexity is O(|U| + e) where e is the

75 number of edges. We can replace each bubble with the best path of the bubble, which can be detected via backtracking.

Tip removal

Tips are mainly caused by the incorrect alignment of unique contigs at the extremities of LRs. As a result, the simple path causing the tip is expected to have a small length. In addition, the coverage of such a simple path is usually much lower than other simple paths. In our implementation, a simple path is considered as a tip if (i) it is a dead-end (only one end is connected to other nodes) and (ii) contains less than 3 unique contigs. Based on our observations, most of the tips are dead-end simple paths that contain only a single unique contig. HASLR removes tips of the backbone graph in a single pass over all the nodes of the graph. Therefore, the time complexity of this step is O(|U|). In the remaining of this chapter, the cleaned and simplified backbone graph is denoted by G.

5.1.4 Generating the assembly

The principle behind the construction of the assembly is that each simple path in the cleaned backbone graph G is used to define a contig of this assembly. Suppose

P = (v1, e12, v2, e23, v3, . . . , vn) is a simple path of G. Although we already have the DNA sequence for each unique contig corresponding to each node vi, the DNA sequence of the resulting contig cannot be obtained immediately. This is due to the fact that at this stage the subsequence between vi and vi+1 is unknown for each 1 ≤ i < n. Here, we explain how these missing subsequences are reconstructed.

For simplicity, suppose we would like to obtain the subsequence between the pair v1 and v2 in P . Note that by construction, e12[supp] contains all long reads supporting e12. We can extract a compact representation of all those long reads and align them to P using the Longest Common Subsequence (LCS) dynamic programming algorithm forbidding mismatches (only gaps are allowed). We implemented this LCS algorithm in a way that takes + − into account the strand of unique contigs in P (recall that uj and uj correspond to the forward and reverse strand of uj respectively). At this point, we can extract the subsequence between v1 and v2 from each long read in e12[supp]. To do this, we find the region of unique contigs corresponding to v1 and v2 that are aligned to all long reads in e12[supp]. Using the alignment transcript (i.e. CIGAR string) the unaligned coordinate of each long read is calculated (see Figure 5.4 for a toy example). By computing the consensus sequence of the extracted subsequences, we obtain cns12. Therefore, the DNA sequence corresponding to P can be obtained via CONCAT (u1, cns12, u2, cns23, u3, . . . , un) where CONCAT (.) returns the concatenated DNA sequence of all its arguments. In order to generate the assembly, HASLR extracts all the simple paths in the cleaned backbone graph G and constructs the corresponding contig for each of them as explained

76 + − ( , +1, {1, 2, 3}) + − +1

+ + − − [] [] +1[] +1[]

1 2 3 POA consensus

+1

Figure 5.4: Example of an edge in backbone graph and its corresponding long read alignments. Partial Order Alignment (POA) is used in constructing the consensus sequence (see subsection 5.1.5) above. It is important to note that each simple path P has a twin path P 0, which corresponds to the reverse complement of the contig generated from P . Therefore, during our simple path extraction procedure, we ensure not to use twin paths to avoid redundancy.

5.1.5 Methodological remarks Rationale for using unique short read contigs

Here, we clarify the motivation for choosing only unique short read contigs as the nodes of the backbone graph. Repetitive genomic regions cause complexities in assembly graphs. The same complexity is reflected in our backbone graph. Repetitive short read contigs would cause branching in the backbone graph, and in fact, building the backbone graph using all short read contigs could result in a very tangled graph. Figure 5.5 illustrates the difference between a backbone graph built on all short read contigs with one built only on unique short read contigs on a yeast genome. As can be seen, using only unique short read contigs for building the backbone graph resolves many of the complexities and ambiguities in the graph. However, it is important to note that excluding non-unique short read contigs could potentially result in a more fragmented graph (some chromosomes are split into multiple paths rather than a single one) and assembly.

Backbone graph vs. assembly graph

It is important to note that the backbone graph is not an assembly graph per se, for two reasons. First, the regions between each pair of connected unique short read contigs are not present in the graph. These missing regions are obtained by calculating the consensus of long read subsequences between each pair of unique short read contigs. Second, unlike assembly

77 Figure 5.5: Two backbone graphs built from a real PacBio dataset sequenced from a yeast genome. Each graph is visualized with Bandage [171] and colored using its rainbow coloring feature. Each chromosome is colored with a full rainbow spectrum. (Left) Tangled graph built from all short read contigs. (Right) Untangled graph built from unique short read contigs. graphs, some segments of the genome cannot be translated to a path in the backbone graph. This is due to the potential fragmentation that was mentioned earlier.

Implementation details

(i) HASLR utilizes a short read assembler to build its initial short read contigs. However, a higher quality assembly that has fewer misassemblies is preferred. For this purpose, HASLR utilizes Minia [23] to assemble short reads into short read contigs. Based on our experiments, Minia can generate a high-quality assembly quickly using a small memory footprint. (ii) For finding unique contigs, HASLR calculates mean k-mer frequencies with a small value of k (default k = 49). This information can be easily obtained by performing a k-mer counting on the short read dataset (for example, using KMC [80]) and calculating the average k-mer count of all k-mers present in each short read contig. Nevertheless, usually, assemblers automatically provide such information (e.g., Minia and SPAdes). HASLR takes k-mer frequencies reported by Minia for this task. (iii) HASLR uses only the longest 25× coverage of long reads for building the backbone graph, which is extracted based on the given expected genome size. (iv) In order to align unique contigs to long reads, HASLR employs minimap2 [98]. (v) Graph cleaning is done with minSupp = 3 meaning that any edge that is supported with less than 3 long reads is discarded. (vi) Finally, consensus sequences are obtained using the Partial Order Alignment [92, 91] (POA) algorithm implemented in the

78 SPOA package [159]. We have provided the versions of the tools and the parameters that are used to execute them in Appendix C.2 and C.3, respectively.

5.2 Results

We evaluated the performance of HASLR on both simulated and real datasets. We selected five hybrid assemblers, hybridSPAdes [4], Unicycler [169], DBG2OLC [174], MaSuRCA [181] and Wengan [35], as well as two non-hybrid methods Canu [85] and wtdbg2 [140]. All experiments were performed on isolated nodes of a cluster (i.e., no other simultaneous jobs were allowed on each node). Each node runs CentOS 7 and is equipped with 32 cores (2 threads per core; a total of 64 CPUs) Intel(R) Xeon(R) processors (Gold 6130 @ 2.10GHz) and 720 GB of memory. Each tool was run with their recommended settings. See Appendices C.2 and C.3 for more details about the versions of tools and the employed commands. Note that for wtdbg2, we used the provided wtdbg2.pl wrapper, which automatically performs a polishing step using the embedded polishing module. For each experiment, assemblies were evaluated by comparing against their corresponding reference genome using QUAST [118]. QUAST reports on a wide range of assembly statistics, but we are mostly interested in misassemblies, NGA50, and rate of small errors (mismatch or indel). QUAST detects and reports misassemblies when a contig cannot align to the reference genome as a single continuous piece. Misassemblies indicate structural assembly errors. For computing NGA50, unlike N50 and NG50, only segments of assembled contigs that are aligned to the reference genome are considered. In addition, QUAST breaks contigs with extensive misassemblies before the calculation of NGA50. Therefore, NGA50 is a good indicator of the contiguity of the assembly while taking misassemblies into consideration.

5.2.1 Experiment on simulated dataset

We evaluated all the selected methods on 4 simulated datasets, namely E. coli, yeast, C. elegans and human, to provide a wide range of genome sizes and complexity. For each genome, we used ART [66] to simulate 50× coverage short Illumina reads (2×150 bp long, 500 bp insert size mean, and 50 bp insert size deviation) using the Illumina HiSeq 2000 error model. We also simulated 50× coverage long PacBio reads using PBSIM [128]. In order to capture the characteristics of real datasets, a set of PacBio reads generated from a human genome (See Appendix C.1.1 for details) with P6-C4 chemistry was passed to PBSIM via option –sample-fastq. This enables PBSIM to sample the read length and error model from the real long reads. Table 5.1, shows the QUAST metrics calculated for assemblies generated by different tools. As can be seen, HASLR generates assemblies with the lowest number of misassemblies in all datasets. It is important to note that since reads are simulated from

79 the same reference used for this assessment, any misassembly reported by QUAST is indeed a structural assembly mistake. In terms of the contiguity, HASLR achieves NGA50 on par with other tools for all datasets except for C. elegans where Canu shows an NGA50 twice larger than other tools. On the human dataset, HASLR generates the most contiguous assembly with an NGA50 of 17.03 Mb and only two extensive misassemblies, although at the price of a lower genome fraction (see Discussion). In addition, HASLR is the fastest assembler across the board. wtdbg2 has a comparable speed but generates lower quality assemblies, both in terms of misassemblies and mismatch/indel rate. It is particularly interesting to compare HASLR with hybridSPAdes, Unicycler, and Wengan, since they share a similar design in that they connect short read contigs rather than explicitly assembling long reads. In addition, Wengan uses short read contigs generated by Minia, similar to HASLR. hybridSPAdes and Unicycler do not scale for large genomes as they have been designed for small and bacterial genomes. On C. elegans dataset, HASLR gives significantly more contiguous assembly than hybridSPAdes and Wengan without any structural assembly error. For the human dataset, HASLR has a higher NGA50 while generating significantly fewer misassemblies. Note that, HASLR does not employ any polishing step neither internally nor externally. Thus, the indel rate of the draft assemblies generated by HASLR is less than desirable. However, these types of local assembly errors can be easily addressed through a polishing step, as shown in Table 5.5. With a single round of polishing, both indel and mismatches rates match the other tools in two datasets.

80 Table 5.1: Comparison between draft assemblies obtained by different tools on simulated data.

rate

Genome Misassemblies Mismatch Genome Assembler Contigs NGA50 Indel rate Time Memory (GB) fraction extensive+local E. coli Canu 1 99.648 4,625,313 0+0 0.86 15.85 30:18 4.16 wtdbg2 135 96.158 107,864 4+79 216.99 492.12 0:46 19.36 hybridSPAdes 1 100.000 4,641,652 0+0 6.18 0.32 8:05 113.92 Unicycler 1 99.997 4,641,530 0+0 3.12 0.45 18:43 21.56 DBG2OLC 2 92.497 2,647,379 0+0 0.28 30.05 4:37 1.35 MaSuRCA 1 99.874 4,636,209 0+4 0.56 0.19 5:21 32.52 Wengan 1 100.00 4,641,731 0+0 2.54 5.36 2:21 3.19 HASLR 1 99.999 4,643,699 0+0 2.00 42.89 0:41 3.04 Yeast Canu 21 98.831 910,628 0+0 3.18 25.44 44:10 5.51 wtdbg2 490 92.871 77,726 24+191 259.00 577.63 1:58 28.35

81 hybridSPAdes 38 97.840 797,316 2+12 41.54 2.12 19:41 113.93 Unicycler 52 97.893 799,601 0+1 8.81 0.44 57:47 22.99 DBG2OLC 18 98.492 771,063 1+0 5.9 85.95 13:29 1.21 MaSuRCA 17 99.476 919,651 0+3 5.97 0.56 15:10 32.66 Wengan 22 97.065 796,244 0+0 6.14 24.48 4:14 5.55 HASLR 18 96.597 796,649 0+0 5.39 76.63 1:52 10.48 C. elegans Canu 10 99.847 13,775,238 3+1 5.88 67.73 5:15:05 13.76 wtdbg2 4,487 95.468 81,074 194+506 246.33 657.89 15:57 29.45 hybridSPAdes 340 98.643 924,797 67+197 73.26 9.14 3:11:50 114.79 Unicycler NA DBG2OLC 16 99.692 6,732,354 10+7 8.55 174.21 2:04:23 7.99 MaSuRCA 18 99.609 4,614,507 34+123 14.89 4.56 2:07:41 33.76 Wengan 46 98.917 2,042,350 53+20 7.26 59.81 28:21 11.18 HASLR 25 99.182 6,455,832 0+0 14.74 230.58 10:45 22.42 Human Canu 1461 97.279 15,045,226 854+99 37.7 196.78 NA NA wtdbg2 122,438 92.735 87,595 3,436+13,041 224.02 598.87 10:25:19 190.07 hybridSPAdes NA Unicycler NA DBG2OLC 1,906 91.013 14,385,033 221+246 8.43 201.56 81:18:15 69.53 MaSuRCA NA Wengan 1776 94.617 11,216,374 185+70 3.84 33.5 20:12:12 38.08 HASLR 897 91.213 17,025,446 2+5 11.32 207.88 6:06:43 58.55

Note: Mismatch and indel rates are reported per 100 kbp. Unicycler crashed on C. elegans dataset due to maximum recursion limit. For the human dataset, hybridSPAdes and Unicycler failed due to memory limit and MaSuRCA failed due to a segmentation fault. For the human dataset, Canu was run with option useGrid=true which makes it run on multiple nodes of a cluster, and therefore, the time and memory usage are not available. Table 5.2: Statistics of real long read datasets

Dataset Technology N50 length Estimated Total size Aligned size Avg. coverage (Gb) (Gb) alignment identity (%) E. coli Nanopore R9.4 63,747 1,080 5.01 4.31 85.03 (K-12 MG1655) Illumina 2×151 372 1.73 - - Yeast PacBio 8,561 132 1.61 1.42 86.90 (S288C) Illumina 2×150 82 1.00 - - C. elegans PacBio 16,675 47 4.73 4.32 87.43 (Bristol) Illumina 2×100 67 6.76 - - Human PacBio 19,960 59 182.51 163.51 85.85 (CHM1) Illumina 2×151 41 127.76 - -

Note: Alignment statistics were obtaine by aligning long reads against their reference genome using lordFAST [62].

5.2.2 Experiment on real dataset

To compare the performance of HASLR on real dataset with other tools, we tested them on 4 publicly available datasets, E. coli, yeast, C. elegans, and human. Table 5.2 contains details about these real datasets (see Appendix C.1.2 for the availability of each dataset). Similar to simulated datasets, on real dataset HASLR generates less misassembly compared to other assemblers while remaining the fastest. Compared to other hybrid assemblers, HASLR performs similar or better in terms of contiguity, while stands behind self-assembly tools with a lower NGA50. For real datasets, we further evaluated the accuracy of assemblies by performing gene completeness analysis using BUSCO [148], which quantifies gene completeness using single- copy orthologs. Table 5.4 shows the results of BUSCO on E. coli, yeast, and C. elegans. We were unable to obtain BUSCO results for the human genome due to a high run time requirement. Another observation is that for some experiments, HASLR does not perform as well as others in terms of genome fraction (see Discussions for more details). However, our gene completeness analysis shows that HASLR is on par with other tools based on BUSCO gene completeness measure (see Table 5.4). Note that very low gene completeness of Canu, wtdbg2, and DBG2OLC on E. coli dataset could be due to high indel rates of their assemblies.

82 Table 5.3: Comparison between assemblies obtained by different tools on real data

rate

Genome Misassemblies Mismatch Dataset Assembler Contigs NGA50 Indel rate Time Memory (GB) fraction extensive+local E. coli Canu 1 99.976 3,647,271 2+6 108.85 1254.40 702:57:07 32.39 (Nanopore) wtdbg2 9 79.114 141,474 38+72 245.82 1501.74 4:57 28.05 hybridSPAdes 15 99.964 3,863,268 2+7 7.16 0.50 3:38:13 114.29 Unicycler NA DBG2OLC 1 99.950 3,539,045 3+4 46.86 335.82 8:25 8.74 MaSuRCA 1 99.988 3,892,134 3+7 2.82 0.50 30:28 32.66 Wengan 3 99.998 3,346,596 3+2 4.74 9.24 20:02 14.37 HASLR 2 99.992 3,970,011 2+2 22.62 79.85 3:18 5.78 Yeast Canu 23 99.724 739,932 29+2 8.85 7.99 1:00:19 5.97 (PacBio) wtdbg2 28 97.668 640,895 20+3 10.65 27.17 3:04 16.26

83 hybridSPAdes 61 97.207 436,584 28+20 44.77 3.71 20:58 114.09 Unicycler 51 97.555 531,185 15+5 15.13 4.22 2:09:27 36.90 DBG2OLC 24 63.275 229,397 25+10 28.37 58.43 9:51 0.99 MaSuRCA 24 99.262 538,374 30+8 11.83 5.85 23:15 32.69 Wengan 29 96.258 528,763 14+10 11.86 34.29 6:38 8.64 HASLR 28 95.735 530,856 11+5 8.13 100.64 2:25 11.30 C. elegans Canu 172 99.665 561,201 723+596 65.28 58.82 4:15:23 11.62 (PacBio) wtdbg2 288 98.994 561,292 329+596 26.82 79.72 14:13 21.19 hybridSPAdes 2,336 96.720 84,003 633+638 108.04 15.96 2:47:32 74.11 Unicycler 858 97.102 139,992 940+692 58.36 45.47 23:49:29 105.06 DBG2OLC 206 99.100 421,196 546+383 44.75 80.61 2:34:44 11.36 MaSuRCA 216 97.013 471,366 368+504 49.20 23.50 1:57:49 33.48 Wengan 270 93.341 341,861 308+336 35.75 121.11 45:45 8.02 HASLR 261 97.431 453,631 259+331 26.08 140.40 15:35 17.93 CHM1 Canu 2,110 96.084 2,329,909 6,715+7,048 145.81 120.69 689:26:01 70.44 (PacBio) wtdbg2 3,723 92.896 2,081,842 3,535+6,286 118.45 72.54 11:35:22 202.41 hybridSPAdes NA Unicycler NA DBG2OLC 2,118 95.547 1,599,466 3,718+8,690 116.81 116.89 78:21:08 64.94 MaSuRCA 3,781 93.782 1,761,291 4,984+7,491 180.83 57.53 350:35:59 225.63 Wengan 4,474 88.948 875,489 2,771+7,577 115.65 160.71 18:19:47 112.73 HASLR 1,469 92.664 1,699,092 2,097+7,661 113.06 281.74 6:32:33 60.75

Notes: Mismatch and indel rates are reported per 100 kbp. hybridSPAdes and Unicycler failed on human genome datasets due to memory limit. Unicycler did not finish on E. coli dataset within two weeks. Table 5.4: Gene completeness analysis

Complete Complete Total Dataset Assembler Complete single copy duplicate Fragmented Missing BUSCO (%) (%) (%) (%) (%) groups E. coli Canu 4.1 4.1 0.0 16.8 79.1 440 (Nanopore) wtdbg2 1.8 1.8 0.0 9.1 89.1 440 hybridSPAdes 100.0 99.5 0.5 0.0 0.0 440 Unicycler NA DBG2OLC 35.9 35.7 0.2 33.0 31.1 440 MaSuRCA 99.7 98.6 1.1 0.0 0.3 440 Wengan 100.0 99.5 0.5 0.0 0.0 440 HASLR 97.8 97.3 0.5 1.6 0.6 440 Yeast Canu 96.6 94.8 1.8 0.2 3.2 2137 (PacBio) wtdbg2 88.4 86.8 1.6 0.8 10.8 2137 hybridSPAdes 96.6 94.8 1.8 0.1 3.3 2137 Unicycler 96.4 94.7 1.7 0.1 3.5 2137 DBG2OLC 57.1 56.5 0.6 0.5 42.4 2137 MaSuRCA 96.3 94.1 2.2 0.1 3.6 2137 Wengan 96.5 94.9 1.6 0.0 3.5 2137 HASLR 95.8 94.4 1.4 0.1 4.1 2137 C. elegans Canu 97.4 96.8 0.6 1.1 1.5 3131 (PacBio) wtdbg2 97.1 96.5 0.6 1.3 1.6 3131 hybridSPAdes 96.4 95.8 0.6 1.3 2.3 3131 Unicycler 97.7 97.1 0.6 0.7 1.6 3131 DBG2OLC 97.5 95.8 1.7 0.6 1.9 3131 MaSuRCA 95.5 94.1 1.4 0.4 4.1 3131 Wengan 91.6 91.1 0.5 0.9 7.5 3131 HASLR 97.1 96.7 0.4 0.8 2.1 3131

Note: We used enterobacterales_odb10, saccharomycetes_odb10, and nematoda_odb10 gene sets for assessing gene completeness of E. coli, yeast, and C. elegans assemblies, respectively. We were not able to obtain the gene completeness results for the human dataset due to time restrictions.

Table 5.5: Effect of polishing assemblies on the small assembly errors of real datasets

Mismtach rate Indel rate Dataset Assembler draft polished draft polished Yeast Canu 8.85 7.56 7.99 7.99 (PacBio) wtdbg2 10.65 7.19 27.17 2.61 hybridSPAdes 44.77 9.88 3.71 3.93 Unicycler 15.13 6.84 4.22 2.44 DBG2OLC 28.37 14.42 58.43 5.51 MaSuRCA 11.83 8.49 5.85 9.69 Wengan 11.86 7.36 34.29 2.08 HASLR 8.13 4.33 100.64 2.05 C. elegans Canu 65.28 65.88 58.82 29.71 (PacBio) wtdbg2 26.82 25.9 79.72 27.11 hybridSPAdes 108.04 27.88 15.96 45.43 Unicycler 58.36 36.97 45.47 32.08 DBG2OLC 44.75 46.50 80.61 43.52 MaSuRCA 49.20 30.9 23.50 31.97 Wengan 35.75 21.13 121.11 22.82 HASLR 26.08 19.61 140.40 22.92

Note: Here polished genomes are obtained after a single round of polishing using Arrow (https://github.com/ PacificBiosciences/GenomicConsensus)

84 5.3 Summary

In this chapter, we presented HASLR that introduces the notion of the backbone graph for hybrid genome assembly. HASLR generates accurate assemblies with a high speed, which enables it to keep up with increasing throughput of LR sequencing technologies while remaining time and memory efficient. The high speed of HASLR is due to three reasons; (i) in a sense, it skips all-vs-all alignment of long reads through its use of backbone graph, (ii) HASLR uses the fast SPOA consensus module rather than normal POA implementation, and (iii) HASLR uses only the longest 25× coverage of LRs for assembly. Assemblies generated by HASLR are similar to those generated by best-performing tools in terms of contiguity while having the lowest number of misassemblies. In other words, we prefer to remain conservative in resolving ambiguous regions without a strong signal rather than aggressively resolving them to generate longer contigs and possibly generating misassemblies. However, the conservative nature of HASLR does not imply that it compromises on assembling complex regions. Every complex region that is covered by a sufficient number of LRs, together with its flanking unique SR contigs, would be resolved. In fact, based on our manual inspections, there are regions that HASLR assembles properly, but all other tools either misassemble or generate fragmented assembly (see Appendix C.4 for visual examples of these cases).

85 Chapter 6

Conclusion

High throughput sequencing (HTS) technologies have been continuously evolving since their inauguration. Next-generation sequencing (NGS) technologies revolutionized the genomics research by generating orders of magnitude more data compared to the previous sequencing technologies. Today, single-molecule sequencing (SMS) technologies such as PacBio and Oxford Nanopore are making another revolution by generating reads orders of magnitude longer than NGS reads. Since the length of SMS reads goes beyond the length of many repetitive regions observed in most of the genomes, they have been able to overcome limitations of short NGS reads for many applications, especially de novo assembly and structural variation (SV) detection. However, unlike NGS reads, SMS long reads are subject to a much higher base-level error rate. Therefore, analyzing them using algorithms and methods developed for NGS reads is not practical. This motivated computational biologists to develop new algorithms and techniques specifically tailored to handle such a high error rate. In addition, the increase in the throughput of SMS technologies recently offered by PacBio and Oxford Nanopore further necessitates the development of efficient tools to analyze the sheer volume of data they generate. In this thesis, we presented three computational methods useful for the analysis of long reads generated by SMS technologies. First, we introduced lordFAST, our fast and sensitive tool for mapping error-prone long reads to the reference genome. lordFAST achieves a high speed through its use of a sparse anchor extraction strategy. Using a set of simulated long reads, we showed that lordFAST not only maps more reads to their true originating region compared to its competitors but also is highly accurate in base-level alignment. We also demonstrated its success in aligning long reads that are originated from regions affected by SVs. Second, we presented CoLoRMap, our hybrid long read error correction tool. CoLoRMap takes advantage of high-quality short reads to improve the accuracy of input long reads. This is done in two steps: (i) aligning short reads onto long reads and finding a sequence of overlapping short reads that best represent the content of sequenced molecule, and (ii) further correction of regions with no short read mapping via the local assembly of

86 mates of short reads mapped to the previously corrected regions. Among hybrid error correction methods, CoLoRMap and LoRDEC differ significantly from consensus-based methods as they consider minimizing the edit distance of the corrected sequence to the long read (although using different approaches. The results obtained from these alignment-based optimization methods compare favorably with previously published consensus-based methods (proovread and PacBioToCA). Moreover, our experiments show the capability of CoLoRMap to improve the quality of assemblies more than previous hybrid correction tools. Finally, we introduced HASLR, which rapidly assembles long reads and short reads in a hybrid fashion. HASLR performs de novo assembly in three steps: (i) by assembling short reads using a well-established NGS-based assembler, Minia [23], (ii) by building the backbone of the genome via connecting the short contigs generated from NGS reads together. This is done using a graph structure built from the alignment of short contigs onto long reads, and (iii) by filling the gaps in the backbone using the consensus of long reads that support each gap. Based on our experiments, HASLR is among the fastest available assemblers. In addition, our experiments on simulated datasets demonstrated the power of HASLR in building assemblies that contain less structural errors, known as misassemblies. Our experiments show that HASLR indeed successfully scales to assembling large complex genomes, such as the human genome.

6.1 Future directions

There are a number of possibilities for extending each one of the presented tools that can be taken into account as future directions. Although lordFAST’s hybrid index is faster than the standard BTW-FM index in querying all exact matches of the seeds on the reference genome, it would be possible to speed up this step even further. Based on our experience, the LF- mapping operation of the FM index is the bottleneck in locating exact matches. Hence, as the future work, it is worth investigating the performance of a minimizer-based index (similar to the one used by minimap2 [98]) for lordFAST. In addition, using a faster alignment algorithm similar to the one devised by minialign (https://github.com/ocxtal/minialign) can increase the speed of lordFAST. As we mentioned earlier, CoLoRMap is much faster than other mapping-based hybrid error correction tools. However, since it still requires alignment of the whole short read dataset onto the long read dataset, it does not scale to large genomes (such as human genome). One way to address this limitation is to reduce the size of the sequence content that should be aligned to the long reads. Towards this goal, one could use the idea of super-reads introduced by Masurca [180]. In particular, this can be useful as the total size of super-reads is much smaller than the original short read dataset. As observed by Zimin et al. [180], the size of maximal super-reads is only ~ 2-3× coverage. This set of super-

87 reads can be further coupled with unitigs generated by an accurate short read assembler (such as Minia [23]) and be passed as the input for CoLoRMap. On the other hand, the second step of CoLoRMap relies on data that are generally not considered in mapping-based approaches, namely unmapped reads. Our experiments demonstrated that the inclusion of OEA improves the size of the corrected regions and even the average identity. This shows the potential of this targeted reads recruitment approach, whose principle has been used in other problems such as gap filling, for example. It would be interesting to see if using the principle of LoRDEC but only on these reads (i.e., trying to minimize the edit distance of the De Bruijn graph-based assembly of the OEA reads) would improve the quality of the correction despite the initial high ratio of errors in the long read gap that prevented the alignment of any short read. Finally, there are a number of possible improvements that are planned for future releases of HASLR. First, compared to other tools, HASLR usually has a higher indel rate. Note that most of the small local assembly mistakes (including mismatch and indel errors) can be fixed by further polishing. Nevertheless, since a large portion of the assembled genome is built from short read contigs, a polishing module could be specifically designed for HASLR that only polishes the regions between unique short read contigs that have been generated using partial order alignment (POA). This would enable a faster polishing phase. Second, HASLR sometimes generates assemblies with relatively lower genome fraction than other tools. This is more clear when we compare it against Canu, especially on a large and complex genome like the human genome. The main reason is the lack of unique short read contigs in a large region. This limitation could be mitigated by extracting unused LRs and assembling them in an OLC fashion (e.g., using miniasm [97]). Note that only a small portion of LRs are unused compared to the original input dataset. As a result, using an OLC approach for such a small set of LRs should not affect the total running time significantly. Another important future work is supporting ultra-long ONT reads (whose length can go beyond 1 Mbp). An important factor in the contiguity of assemblies generated by HASLR is the length of reads. Obviously, longer reads would generate a more connected and resolved backbone graph. In other words, HASLR should be able to generate much more contiguous assemblies using ultra-long Nanopore reads. Last but not least, heterozygosity-aware consensus calling of subreads falling between to unique SRCs is one of our main future directions. This would be possible via clustering of subreads that fall between consecutive unique SRCs into two groups and performing consensus calling for each group separately. This would enable HASLR to perform phased assembly of diploid genomes.

6.2 Recommended guidelines

Many different factors combine together to affect the performance of error correction or assembly tools on a specific dataset. These factors can include but are not limited to the

88 genome size, the depth of coverage, amount of sequencing errors, length of the reads, amount and types of sequencing artifacts (e.g., chimeric reads), and sample’s zygosity. Therefore, it is not possible to give some rule of thumb guidelines for deciding which tool is the best for each application. Still, based on our experience, we try to give some general suggestions. In the context of this thesis, probably the most critical question is the choice between hybrid and non-hybrid methods. A general observation is that hybrid approaches work better for low coverage long read datasets. However, with sufficient coverage of long reads, non-hybrid approaches can achieve similar or better performance in both structural and base-level accuracy. Non-hybrid methods seem to require at least 30× coverage in order to perform well. The genome size can be another role player. For instance, while hybridSPAdes and Unicycler perform very well on bacterial and small genomes, they do not generate satisfactory assemblies on large genomes and often fail due to the high memory requirement. Therefore, for low coverage long read datasets obtained from large genomes, DBG2OLC, MaSuRCA and HASLR seem to be good options. On the other hand, with enough long read coverage in hand, Canu seems to stay robust and consistently performs very well on both small and large genomes. In terms of user experience, based on this author’s observations, DBG2OLC and MaSuRCA are not as user-friendly as others to run. However, they could easily take advantage of a more straightforward pipeline wrapper to improve significantly in that sense. Another important aspect is genome polishing. Although it is possible to use a fast tool like Racon for polishing draft assemblies, it seems Pilon, Quiver, and Nanopolish perform much better for improving the base-level accuracy. Nevertheless, these tools are much slower than Racon. Similar to error correction and assembly tasks, for polishing, if there is enough coverage of long reads it is recommended to use non-hybrid methods (i.e., Quiver or Nanopolish) rather than hybrid ones (i.e., Pilon). This has to do with the fact that long read mappers perform better than short read mappers in finding the correct mapping location simply by taking advantage of the long-range information.

89 Bibliography

[1] Alexej Abyzov, Alexander E Urban, Michael Snyder, and Mark Gerstein. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome research, 21(6):974–984, 2011.

[2] Can Alkan, Bradley P Coe, and Evan E Eichler. Genome structural variation discovery and genotyping. Nature Reviews Genetics, 12(5):363–376, 2011.

[3] Can Alkan, Jeffrey M Kidd, Tomas Marques-Bonet, Gozde Aksay, Francesca Antonacci, Fereydoun Hormozdiari, Jacob O Kitzman, Carl Baker, Maika Malig, Onur Mutlu, et al. Personalized copy number and segmental duplication maps using next- generation sequencing. Nature genetics, 41(10):1061, 2009.

[4] Dmitry Antipov, Anton Korobeynikov, Jeffrey S McLean, and Pavel A Pevzner. hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics, 32(7):1009–1015, 2015.

[5] Hossein Asghari, Yen-Yi Lin, Yang Xu, Ehsan Haghshenas, Colin C Collins, and Faraz Hach. Circminer: Accurate and rapid detection of circular rna through splice-aware pseudo-alignment scheme. Bioinformatics, 2020.

[6] Kin Fai Au, Jason G Underwood, Lawrence Lee, and Wing Hung Wong. Improving PacBio long read accuracy by short read alignment. PloS one, 7(10):e46679, 2012.

[7] Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A Gurevich, Mikhail Dvorkin, Alexander S Kulikov, Valery M Lesin, Sergey I Nikolenko, Son Pham, Andrey D Prjibelski, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology, 19(5):455– 477, 2012.

[8] Ergude Bao and Lingxiao Lan. HALC: High throughput algorithm for long read error correction. BMC bioinformatics, 18(1):204, 2017.

[9] Ali Bashir, Aaron A Klammer, William P Robins, Chen-Shan Chin, Dale Webster, Ellen Paxinos, David Hsu, Meredith Ashby, Susana Wang, Paul Peluso, et al. A hybrid approach for the automated finishing of bacterial genomes. Nature biotechnology, 30(7):701–707, 2012.

[10] David R Bentley, Shankar Balasubramanian, Harold P Swerdlow, Geoffrey P Smith, John Milton, Clive G Brown, Kevin P Hall, Dirk J Evers, Colin L Barnes, Helen R Bignell, et al. Accurate whole human genome sequencing using reversible terminator chemistry. nature, 456(7218):53–59, 2008.

90 [11] Konstantin Berlin, Sergey Koren, Chen-Shan Chin, James P Drake, Jane M Landolin, and Adam M Phillippy. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nature biotechnology, 33(6):623–630, 2015.

[12] Marten Boetzer and Walter Pirovano. SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information. BMC bioinformatics, 15(1):211, 2014.

[13] Vladimír Boža, Broňa Brejová, and Tomáš Vinař. DeepNano: Deep recurrent neural networks for base calling in MinION nanopore reads. PloS one, 12(6):e0178751, 2017.

[14] Steven D Brown, Shilpa Nagaraju, Sagar Utturkar, Sashini De Tissera, Simón Segovia, Wayne Mitchell, Miriam L Land, Asela Dassanayake, and Michael Köpke. Comparison of single-molecule sequencing and hybrid approaches for finishing the genome of Clostridium autoethanogenum and analysis of CRISPR systems in industrial relevant Clostridia. Biotechnology for biofuels, 7(1):40, 2014.

[15] M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical report, Digital Equipment Corporation, 1994.

[16] Joseph D Buxbaum, Mark J Daly, Bernie Devlin, Thomas Lehner, Kathryn Roeder, Matthew W State, Autism Sequencing Consortium, et al. The autism sequencing consortium: large-scale, high-throughput sequencing in autism spectrum disorders. Neuron, 76(6):1052–1056, 2012.

[17] Sarah L Castro-Wallace, Charles Y Chiu, Kristen K John, Sarah E Stahl, Kathleen H Rubins, Alexa BR McIntyre, Jason P Dworkin, Mark L Lupisella, David J Smith, Douglas J Botkin, et al. Nanopore DNA sequencing and genome assembly on the International Space Station. Scientific reports, 7(1):1–12, 2017.

[18] Mark J Chaisson and Glenn Tesler. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC bioinformatics, 13(1):238, 2012.

[19] Mark JP Chaisson, John Huddleston, Megan Y Dennis, Peter H Sudmant, Maika Malig, Fereydoun Hormozdiari, Francesca Antonacci, Urvashi Surti, Richard Sandstrom, Matthew Boitano, et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature, 517(7536):608–611, 2015.

[20] Ken Chen, Lei Chen, Xian Fan, John Wallis, Li Ding, and George Weinstock. TIGRA: a targeted iterative graph routing assembler for breakpoint assembly. Genome research, 24(2):310–317, 2014.

[21] Ken Chen, John W Wallis, Michael D McLellan, David E Larson, Joelle M Kalicki, Craig S Pohl, Sean D McGrath, Michael C Wendl, Qunyuan Zhang, Devin P Locke, et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nature methods, 6(9):677, 2009.

[22] Rui Chen and Michael Snyder. Promise of personalized omics to precision medicine. Wiley Interdisciplinary Reviews: Systems Biology and Medicine, 5(1):73–82, 2013.

91 [23] Rayan Chikhi and Guillaume Rizk. Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms for Molecular Biology, 8(1):22, 2013.

[24] Chen-Shan Chin, David H Alexander, Patrick Marks, Aaron A Klammer, James Drake, Cheryl Heiner, Alicia Clum, Alex Copeland, John Huddleston, Evan E Eichler, et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature methods, 10(6):563–569, 2013.

[25] Chen-Shan Chin, Paul Peluso, Fritz J Sedlazeck, Maria Nattestad, Gregory T Concepcion, Alicia Clum, Christopher Dunn, Ronan O’Malley, Rosa Figueroa- Balderas, Abraham Morales-Cruz, et al. Phased diploid genome assembly with single- molecule real-time sequencing. Nature methods, 13(12):1050–1054, 2016.

[26] Hamidreza Chitsaz, Joyclyn L Yee-Greenbaum, Glenn Tesler, Mary-Jane Lombardo, Christopher L Dupont, Jonathan H Badger, Mark Novotny, Douglas B Rusch, Louise J Fraser, Niall A Gormley, et al. Efficient de novo assembly of single-cell bacterial genomes from short-read data sets. Nature biotechnology, 29(10):915, 2011.

[27] Murim Choi, Ute I Scholl, Weizhen Ji, Tiewen Liu, Irina R Tikhonova, Paul Zumbo, Ahmet Nayir, Aysin Bakkaloğlu, Seza Özen, Sami Sanjad, et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proceedings of the National Academy of Sciences, 106(45):19096–19101, 2009.

[28] Francis S Collins and Harold Varmus. A new initiative on precision medicine. New England journal of medicine, 372(9):793–795, 2015.

[29] 1000 Genomes Project Consortium et al. A map of human genome variation from population-scale sequencing. Nature, 467(7319):1061, 2010.

[30] 1000 Genomes Project Consortium et al. An integrated map of genetic variation from 1,092 human genomes. Nature, 491(7422):56, 2012.

[31] Diana Crow. A new wave of genomics for all. Cell, 177(1):5–7, 2019.

[32] Matei David, Lewis Jonathan Dursi, Delia Yao, Paul C Boutros, and Jared T Simpson. Nanocall: an open source basecaller for Oxford Nanopore sequencing data. Bioinformatics, 33(1):49–55, 2016.

[33] Matei David, Misko Dzamba, Dan Lister, Lucian Ilie, and Michael Brudno. SHRiMP2: sensitive yet practical short read mapping. Bioinformatics, 27(7):1011–1012, 2011.

[34] Lucy C de Jong, Simone Cree, Vanessa Lattimore, George AR Wiggins, Amanda B Spurdle, Allison Miller, Martin A Kennedy, and Logan C Walker. Nanopore sequencing of full-length BRCA1 mRNA transcripts reveals co-occurrence of known exon skipping events. Breast Cancer Research, 19(1):127, 2017.

[35] Alex Di Genova, Elena Buena-Atienza, Stephan Ossowski, and Marie-France Sagot. Wengan: Efficient and high quality hybrid de novo assembly of human genomes. bioRxiv, page 840447, 2019.

92 [36] Sarah Djebali, Carrie A Davis, Angelika Merkel, Alex Dobin, Timo Lassmann, Ali Mortazavi, Andrea Tanzer, Julien Lagarde, Wei Lin, Felix Schlesinger, et al. Landscape of transcription in human cells. Nature, 489(7414):101–108, 2012.

[37] DNA.SPACE blog. https://dna.space. Accessed: 2020-02-09. [38] Ling Dong, Wanheng Wang, Alvin Li, Rina Kansal, Yuhan Chen, Hong Chen, and Xinmin Li. Clinical next generation sequencing for precision medicine in cancer. Current genomics, 16(4):253–263, 2015.

[39] Peter Edge, Vineet Bafna, and Vikas Bansal. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome research, 27(5):801– 812, 2017.

[40] Arwyn Edwards, Aliyah R Debbonaire, Samuel M Nicholls, Sara ME Rassner, Birgit Sattler, Joseph M Cook, Tom Davy, André Soares, Luis AJ Mur, and Andrew J Hodson. In-field metagenome and 16S rRNA gene amplicon nanopore sequencing robustly characterize glacier microbiota. BioRxiv, page 073965, 2019.

[41] John Eid, Adrian Fehr, Jeremy Gray, Khai Luong, John Lyle, Geoff Otto, Paul Peluso, David Rank, Primo Baybayan, Brad Bettman, et al. Real-time DNA sequencing from single polymerase molecules. Science, 323(5910):133–138, 2009.

[42] Adam C English, Stephen Richards, Yi Han, Min Wang, Vanesa Vee, Jiaxin Qu, Xiang Qin, Donna M Muzny, Jeffrey G Reid, Kim C Worley, et al. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PloS one, 7(11):e47768, 2012.

[43] David Eppstein, Zvi Galil, Raffaele Giancarlo, and Giuseppe F Italiano. Sparse dynamic programming I: linear cost functions. Journal of the ACM (JACM), 39(3):519–545, 1992.

[44] Xian Fan, Mark Chaisson, Luay Nakhleh, and Ken Chen. HySA: a Hybrid Structural variant Assembly approach using next-generation and single-molecule sequencing technologies. Genome research, 27(5):793–800, 2017.

[45] Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pages 390–398. IEEE, 2000.

[46] Can Firtina, Ziv Bar-Joseph, Can Alkan, and A Ercument Cicek. Hercules: a profile HMM-based hybrid error correction algorithm for long reads. Nucleic acids research, 46(21):e125–e125, 2018.

[47] Benjamin A Flusberg, Dale R Webster, Jessica H Lee, Kevin J Travers, Eric C Olivares, Tyson A Clark, Jonas Korlach, and Stephen W Turner. Direct detection of DNA methylation during single-molecule, real-time sequencing. Nature methods, 7(6):461, 2010.

[48] Michael Ford, Ehsan Haghshenas, Corey T Watson, and S Cenk Sahinalp. Genotyping and Copy Number Analysis of Immunoglobin Heavy Chain Variable Genes using Long Reads. iScience, page 100883, 2020.

93 [49] Moritz Gerstung, Clemency Jolly, Ignaty Leshchiner, Stefan C Dentro, Santiago Gonzalez, Daniel Rosebrock, Thomas J Mitchell, Yulia Rubanova, Pavana Anur, Kaixian Yu, et al. The evolutionary history of 2,658 cancers. Nature, 578(7793):122– 128, 2020.

[50] Sante Gnerre, Iain MacCallum, Dariusz Przybylski, Filipe J Ribeiro, Joshua N Burton, Bruce J Walker, Ted Sharpe, Giles Hall, Terrance P Shea, Sean Sykes, et al. High- quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences, 108(4):1513–1518, 2011.

[51] Liang Gong, Chee-Hong Wong, Wei-Chung Cheng, Harianto Tjong, Francesca Menghi, Chew Yee Ngan, Edison T Liu, and Chia-Lin Wei. Picky comprehensively detects high-resolution structural variants in nanopore long reads. Nature methods, 15(6):455–460, 2018.

[52] Paul M Gontarz, Jennifer Berger, and Chung F Wong. SRmapper: a fast and sensitive genome-hashing alignment tool. Bioinformatics, 29(3):316–321, 2013.

[53] Sara Goodwin, James Gurtowski, Scott Ethe-Sayers, Panchajanya Deshpande, Michael C Schatz, and W Richard McCombie. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome research, 25(11):1750–1756, 2015.

[54] Sara Goodwin, John D McPherson, and W Richard McCombie. Coming of age: ten years of next-generation sequencing technologies. Nature Reviews Genetics, 17(6):333– 351, 2016.

[55] Peiyong Guan and Wing-Kin Sung. Structural variation detection using next- generation sequencing data: a comparative technical review. Methods, 102:36–49, 2016.

[56] Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi, and Glenn Tesler. QUAST: quality assessment tool for genome assemblies. Bioinformatics, page btt086, 2013.

[57] Faraz Hach, Fereydoun Hormozdiari, Can Alkan, Farhad Hormozdiari, Inanc Birol, Evan E Eichler, and S Cenk Sahinalp. mrsFAST: a cache-oblivious algorithm for short-read mapping. Nature methods, 7(8):576–577, 2010.

[58] Faraz Hach, Iman Sarrafi, Farhad Hormozdiari, Can Alkan, Evan E Eichler, and S Cenk Sahinalp. mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications. Nucleic acids research, 42(W1):W494–W500, 2014.

[59] Thomas Hackl, Rainer Hedrich, Jörg Schultz, and Frank Förster. proovread: large-scale high-accuracy PacBio correction through iterative short read consensus. Bioinformatics, 30(21):3004–3011, 2014.

[60] Ehsan Haghshenas, Hossein Asghari, Jens Stoye, Cedric Chauve, and Faraz Hach. HASLR: Fast Hybrid Assembly of Long Reads. bioRxiv, 2020.

94 [61] Ehsan Haghshenas, Faraz Hach, S Cenk Sahinalp, and Cedric Chauve. CoLoRMap: Correcting Long Reads by Mapping short reads. Bioinformatics, 32(17):i545–i551, 2016.

[62] Ehsan Haghshenas, S Cenk Sahinalp, and Faraz Hach. lordFAST: sensitive and fast alignment search tool for long noisy read sequencing data. Bioinformatics, 35(1):20– 27, 2019.

[63] Fereydoun Hormozdiari, Can Alkan, Evan E Eichler, and S Cenk Sahinalp. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome research, 19(7):1270–1278, 2009.

[64] Kazuyoshi Hosomichi, Takashi Shiina, Atsushi Tajima, and Ituro Inoue. The impact of next-generation sequencing technologies on HLA research. Journal of human genetics, 60(11):665–673, 2015.

[65] Franklin W Huang, Eran Hodis, Mary Jue Xu, Gregory V Kryukov, Lynda Chin, and Levi A Garraway. Highly recurrent TERT promoter mutations in human melanoma. Science, 339(6122):957–959, 2013.

[66] Weichun Huang, Leping Li, Jason R Myers, and Gabor T Marth. ART: a next- generation sequencing read simulator. Bioinformatics, 28(4):593–594, 2011.

[67] Illumina Inc. HiSeq X Series of Sequencing Systems. https://www.illumina.com/ documents/products/datasheets/datasheet-hiseq-x-ten.pdf. Accessed: 2020- 02-04.

[68] Zamin Iqbal, Mario Caccamo, Isaac Turner, Paul Flicek, and Gil McVean. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature genetics, 44(2):226, 2012.

[69] Chirag Jain, Alexander Dilthey, Sergey Koren, Srinivas Aluru, and Adam M Phillippy. A fast approximate algorithm for mapping long reads to large reference databases. In International Conference on Research in Computational Molecular Biology, pages 66–81. Springer, Cham, 2017.

[70] Miten Jain, Sergey Koren, Karen H Miga, Josh Quick, Arthur C Rand, Thomas A Sasani, John R Tyson, Andrew D Beggs, Alexander T Dilthey, Ian T Fiddes, et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nature biotechnology, 36(4):338, 2018.

[71] Coline C Jaworski, Carson W Allan, and Luciano M Matzkin. Chromosome-level hybrid de novo genome assemblies as an attainable option for non-model organisms. bioRxiv, page 748228, 2019.

[72] William R Jeck, Jesse Lee, Hayley Robinson, Long P Le, A John Iafrate, and Valentina Nardi. A nanopore sequencing–based assay for rapid detection of gene fusions. The Journal of Molecular Diagnostics, 21(1):58–69, 2019.

[73] Justin B Jiang, Andrea M Quattrini, Warren R Francis, Joseph F Ryan, Estefanía Rodríguez, and Catherine S McFadden. A hybrid de novo assembly of the sea pansy (Renilla muelleri) genome. GigaScience, 8(4):giz026, 2019.

95 [74] Mykola Kadobianskyi, Lisanne Schulze, Markus Schuelke, and Benjamin Judkewitz. Hybrid genome assembly and annotation of Danionella translucida. BioRxiv, page 539692, 2019.

[75] Emre Karakoc, Can Alkan, Brian J O’roak, Megan Y Dennis, Laura Vives, Kenneth Mark, Mark J Rieder, Debbie A Nickerson, and Evan E Eichler. Detection of structural variants and indels within exome data. Nature methods, 9(2):176, 2012.

[76] Pınar Kavak, Yen-Yi Lin, Ibrahim Numanagić, Hossein Asghari, Tunga Güngör, Can Alkan, and Faraz Hach. Discovery and genotyping of novel sequence insertions in many sequenced individuals. Bioinformatics, 33(14):i161–i169, 2017.

[77] Andy Kilianski, Jamie L Haas, Elizabeth J Corriveau, Alvin T Liem, Kristen L Willis, Dana R Kadavy, C Nicole Rosenzweig, and Samuel S Minot. Bacterial and viral identification and differentiation by amplicon sequencing on the MinION nanopore sequencer. Gigascience, 4(1):s13742–015, 2015.

[78] Martin Kircher and Janet Kelso. High-throughput DNA sequencing–concepts and limitations. Bioessays, 32(6):524–536, 2010.

[79] Jacob O Kitzman. Haplotypes drop by drop: short-read sequencing provides haplotype information when long DNA fragments are barcoded in microfluidic droplets. Nature biotechnology, 34(3):296–299, 2016.

[80] Marek Kokot, Maciej Długosz, and Sebastian Deorowicz. KMC 3: counting and manipulating k-mer statistics. Bioinformatics, 33(17):2759–2761, 2017.

[81] Mikhail Kolmogorov, Jeffrey Yuan, Yu Lin, and Pavel A Pevzner. Assembly of long, error-prone reads using repeat graphs. Nature biotechnology, 37(5):540, 2019.

[82] Sergey Koren, Gregory P Harhay, Timothy PL Smith, James L Bono, Dayna M Harhay, Scott D Mcvey, Diana Radune, Nicholas H Bergman, and Adam M Phillippy. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome biology, 14(9):R101, 2013.

[83] Sergey Koren and Adam M Phillippy. One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. Current opinion in microbiology, 23:110–120, 2015.

[84] Sergey Koren, Michael C Schatz, Brian P Walenz, Jeffrey Martin, Jason T Howard, Ganeshkumar Ganapathy, Zhong Wang, David A Rasko, W Richard McCombie, Erich D Jarvis, et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature biotechnology, 30(7):693–700, 2012.

[85] Sergey Koren, Brian P Walenz, Konstantin Berlin, Jason R Miller, Nicholas H Bergman, and Adam M Phillippy. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome research, 27(5):722–736, 2017.

[86] Sean La, Ehsan Haghshenas, and Cedric Chauve. LRCstats, a tool for evaluating long reads correction methods. Bioinformatics, 33(22):3652–3654, 2017.

96 [87] David Laehnemann, Arndt Borkhardt, and Alice Carolyn McHardy. Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction. Briefings in bioinformatics, 17(1):154–179, 2016.

[88] Eric S Lander, Lauren M Linton, Bruce Birren, Chad Nusbaum, Michael C Zody, Jennifer Baldwin, Keri Devon, Ken Dewar, Michael Doyle, William FitzHugh, et al. Initial sequencing and analysis of the human genome. Nature, 2001.

[89] Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4):357, 2012.

[90] Michael S Lawrence, Petar Stojanov, Craig H Mermel, James T Robinson, Levi A Garraway, Todd R Golub, Matthew Meyerson, Stacey B Gabriel, Eric S Lander, and Gad Getz. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature, 505(7484):495–501, 2014.

[91] Christopher Lee. Generating consensus sequences from partial order multiple sequence alignment graphs. Bioinformatics, 19(8):999–1008, 2003.

[92] Christopher Lee, Catherine Grasso, and Mark F Sharlow. Multiple sequence alignment using partial order graphs. Bioinformatics, 18(3):452–464, 2002.

[93] Hayan Lee, James Gurtowski, Shinjae Yoo, Maria Nattestad, Shoshana Marcus, Sara Goodwin, W Richard McCombie, and Michael Schatz. Third-generation sequencing and the future of genomics. BioRxiv, page 048603, 2016.

[94] Shawn E Levy and Richard M Myers. Advancements in next-generation sequencing. Annual review of genomics and human genetics, 17:95–115, 2016.

[95] Heng Li. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics, 28(14):1838–1844, 2012.

[96] Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA- MEM. arXiv preprint arXiv:1303.3997, 2013.

[97] Heng Li. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32(14):2103–2110, 2016.

[98] Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094–3100, 2018.

[99] Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows– Wheeler transform. bioinformatics, 25(14):1754–1760, 2009.

[100] Ruiqiang Li, Wei Fan, Geng Tian, Hongmei Zhu, Lin He, Jing Cai, Quanfei Huang, Qingle Cai, Bo Li, Yinqi Bai, et al. The sequence and de novo assembly of the giant panda genome. Nature, 463(7279):311–317, 2010.

[101] Ruiqiang Li, Chang Yu, Yingrui Li, Tak-Wah Lam, Siu-Ming Yiu, Karsten Kristiansen, and Jun Wang. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics, 25(15):1966–1967, 2009.

97 [102] Shengting Li, Ruiqiang Li, Heng Li, Jianliang Lu, Yingrui Li, Lars Bolund, Mikkel H Schierup, and Jun Wang. SOAPindel: efficient identification of indels from short paired reads. Genome research, 23(1):195–200, 2013.

[103] Bi Lian, Xin Hu, and Zhi-ming Shao. Unveiling novel targets of paclitaxel resistance by single molecule long-read RNA sequencing in breast cancer. Scientific reports, 9(1):1–10, 2019.

[104] Hao Lin, Zefeng Zhang, Michael Q Zhang, Bin Ma, and Ming Li. ZOOM! Zillions of oligos mapped. Bioinformatics, 24(21):2431–2437, 2008.

[105] Yu Lin, Jeffrey Yuan, Mikhail Kolmogorov, Max W Shen, Mark Chaisson, and Pavel A Pevzner. Assembly of long error-prone reads using de Bruijn graphs. Proceedings of the National Academy of Sciences, 113(52):E8396–E8405, 2016.

[106] Bo Liu, Yan Gao, and Yadong Wang. LAMSA: fast split read alignment with long approximate matches. Bioinformatics, 33(2):192–201, 2017.

[107] Bo Liu, Dengfeng Guan, Mingxiang Teng, and Yadong Wang. rHAT: fast alignment of noisy long reads with regional hashing. Bioinformatics, 32(11):1625–1631, 2016.

[108] Nicholas J Loman, Joshua Quick, and Jared T Simpson. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nature methods, 12(8):733– 735, 2015.

[109] Ruibang Luo, Binghang Liu, Yinlong Xie, Zhenyu Li, Weihua Huang, Jianying Yuan, Guangzhu He, Yanxiang Chen, Qi Pan, Yunjie Liu, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience, 1(1):2047–217X, 2012.

[110] Dheeraj Malhotra and Jonathan Sebat. CNVs: harbingers of a rare variant revolution in psychiatric genetics. Cell, 148(6):1223–1241, 2012.

[111] Salem Malikic, Farid Rashidi Mehrabadi, Simone Ciccolella, Md Khaledur Rahman, Camir Ricketts, Ehsan Haghshenas, Daniel Seidman, Faraz Hach, Iman Hajirasouliha, and S Cenk Sahinalp. PhISCS: a combinatorial approach for subperfect tumor phylogeny reconstruction via integrative use of single-cell and bulk sequencing data. Genome research, 29(11):1860–1877, 2019.

[112] Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5):935–948, 1993.

[113] Santiago Marco-Sola, Michael Sammeth, Roderic Guigó, and Paolo Ribeca. The GEM mapper: fast, accurate and versatile alignment by filtration. Nature methods, 9(12):1185, 2012.

[114] Marcel Margulies, Michael Egholm, William E Altman, Said Attiya, Joel S Bader, Lisa A Bemben, Jan Berka, Michael S Braverman, Yi-Ju Chen, Zhoutao Chen, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437(7057):376–380, 2005.

98 [115] Kevin Judd McKernan, Heather E Peckham, Gina L Costa, Stephen F McLaughlin, Yutao Fu, Eric F Tsung, Christopher R Clouser, Cisyla Duncan, Jeffrey K Ichikawa, Clarence C Lee, et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome research, 19(9):1527–1541, 2009.

[116] Giles Miclotte, Mahdi Heydari, Piet Demeester, Stephane Rombauts, Yves Van de Peer, Pieter Audenaert, and Jan Fostier. Jabba: hybrid error correction for long sequencing reads. Algorithms for Molecular Biology, 11(1):10, 2016.

[117] Karen H Miga, Sergey Koren, Arang Rhie, Mitchell R Vollger, Ariel Gershman, Andrey Bzikadze, Shelise Brooks, Edmund Howe, David Porubsky, Glennis A Logsdon, et al. Telomere-to-telomere assembly of a complete human X chromosome. BioRxiv, page 735928, 2019.

[118] Alla Mikheenko, Andrey Prjibelski, Vladislav Saveliev, Dmitry Antipov, and Alexey Gurevich. Versatile genome assembly evaluation with QUAST-LG. Bioinformatics, 34(13):i142–i150, 2018.

[119] Jason R Miller, Arthur L Delcher, Sergey Koren, Eli Venter, Brian P Walenz, Anushka Brownley, Justin Johnson, Kelvin Li, Clark Mobarry, and Granger Sutton. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics, 24(24):2818–2824, 2008.

[120] Michael Molnar, Ehsan Haghshenas, and Lucian Ilie. SAGE2: parallel human genome assembly. Bioinformatics, 34(4):678–680, 2018.

[121] Pierre Morisse, Thierry Lecroq, and Arnaud Lefebvre. Hybrid correction of highly noisy long reads using a variable-order de Bruijn graph. Bioinformatics, 34(24):4213– 4222, 2018.

[122] Eugene W Myers. An O(ND) difference algorithm and its variations. Algorithmica, 1(1):251–266, 1986.

[123] Gene Myers. A fast bit-vector algorithm for approximate string matching based on dynamic programming. Journal of the ACM (JACM), 46(3):395–415, 1999.

[124] Gene Myers. Efficient local alignment discovery amongst noisy long reads. In International Workshop on Algorithms in Bioinformatics, pages 52–67. Springer, 2014.

[125] Ugrappa Nagalakshmi, Zhong Wang, Karl Waern, Chong Shou, Debasish Raha, Mark Gerstein, and Michael Snyder. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science, 320(5881):1344–1349, 2008.

[126] Maria Nattestad, Sara Goodwin, Karen Ng, Timour Baslan, Fritz J Sedlazeck, Philipp Rescheneder, Tyler Garvin, Han Fang, James Gurtowski, Elizabeth Hutton, et al. Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line. Genome research, 28(8):1126–1135, 2018.

[127] Enno Ohlebusch and Mohamed I Abouelhoda. Chaining algorithms and applications in comparative genomics. Handbook of Computational Molecular Biology, 2006.

99 [128] Yukiteru Ono, Kiyoshi Asai, and Michiaki Hamada. PBSIM: PacBio reads simulatorâĂŤtoward accurate genome assembly. Bioinformatics, 29(1):119–121, 2013.

[129] Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Detecting superbubbles in assembly graphs. In International Workshop on Algorithms in Bioinformatics, pages 338–348. Springer, 2013.

[130] Brian J O’Roak, Pelagia Deriziotis, Choli Lee, Laura Vives, Jerrod J Schwartz, Santhosh Girirajan, Emre Karakoc, Alexandra P MacKenzie, Sarah B Ng, Carl Baker, et al. Exome sequencing in sporadic autism spectrum disorders identifies severe de novo mutations. Nature genetics, 43(6):585, 2011.

[131] Christian Otto, Steve Hoffmann, Jan Gorodkin, and Peter F Stadler. Fast local fragment chaining using sum-of-pair gap costs. Algorithms for Molecular Biology, 6(1):4, 2011.

[132] Karen Patterson. 1000 genomes: a world of variation, 2011.

[133] Alexander Payne, Nadine Holmes, Vardhman Rakyan, and Matthew Loose. BulkVis: a graphical viewer for Oxford nanopore bulk FAST5 files. Bioinformatics, 35(13):2193– 2198, 2019.

[134] Matthew Pendleton, Robert Sebra, Andy Wing Chun Pang, Ajay Ummat, Oscar Franzen, Tobias Rausch, Adrian M Stütz, William Stedman, Thomas Anantharaman, Alex Hastie, et al. Assembly and diploid architecture of an individual human genome via single-molecule technologies. Nature methods, 12(8):780, 2015.

[135] Joshua Quick, Nicholas J Loman, Sophie Duraffour, Jared T Simpson, Ettore Severi, Lauren Cowley, Joseph Akoi Bore, Raymond Koundouno, Gytis Dudas, Amy Mikhail, et al. Real-time, portable genome sequencing for Ebola surveillance. Nature, 530(7589):228–232, 2016.

[136] Tobias Rausch, Thomas Zichner, Andreas Schlattl, Adrian M Stütz, Vladimir Benes, and Jan O Korbel. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics, 28(18):i333–i339, 2012.

[137] Jason A Reuter, Damek V Spacek, and Michael P Snyder. High-throughput sequencing technologies. Molecular cell, 58(4):586–597, 2015.

[138] Michael Roberts, Wayne Hayes, Brian R Hunt, Stephen M Mount, and James A Yorke. Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18):3363–3369, 2004.

[139] Richard J Roberts, Mauricio O Carneiro, and Michael C Schatz. The advantages of SMRT sequencing. Genome biology, 14(7):405, 2013.

[140] Jue Ruan and Heng Li. Fast and accurate long-read assembly with wtdbg2. Nature Methods, pages 1–4, 2019.

[141] Leena Salmela and Eric Rivals. LoRDEC: accurate and efficient long read error correction. Bioinformatics, 30(24):3506–3514, 2014.

100 [142] Leena Salmela, Riku Walve, Eric Rivals, and Esko Ukkonen. Accurate self-correction of errors in long reads using de Bruijn graphs. Bioinformatics, 33(6):799–806, 2016.

[143] Frederick Sanger, Steven Nicklen, and Alan R Coulson. DNA sequencing with chain- terminating inhibitors. Proceedings of the national academy of sciences, 74(12):5463– 5467, 1977.

[144] Michael C Schatz, Arthur L Delcher, and Steven L Salzberg. Assembly of large genomes using second-generation sequencing. Genome research, 20(9):1165–1173, 2010.

[145] Maximilian H-W Schmidt, Alexander Vogel, Alisandra K Denton, Benjamin Istace, Alexandra Wormit, Henri van de Geest, Marie E Bolger, Saleh Alseekh, Janina Maß, Christian Pfaff, et al. De novo assembly of a new Solanum pennellii accession using nanopore sequencing. The Plant Cell, 29(10):2336–2348, 2017.

[146] Fritz J Sedlazeck, Philipp Rescheneder, Moritz Smolka, Han Fang, Maria Nattestad, Arndt von Haeseler, and Michael C Schatz. Accurate detection of complex structural variations using single-molecule sequencing. Nature methods, 15(6):461–468, 2018.

[147] Kishwar Shafin, Trevor Pesout, Ryan Lorig-Roach, Marina Haukness, Hugh E Olsen, Colleen Bosworth, Joel Armstrong, Kristof Tigyi, Nicholas Maurer, Sergey Koren, et al. Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit. BioRxiv, page 715722, 2019.

[148] Felipe A Simão, Robert M Waterhouse, Panagiotis Ioannidis, Evgenia V Kriventseva, and Evgeny M Zdobnov. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics, 31(19):3210–3212, 2015.

[149] Jared T Simpson and Richard Durbin. Efficient de novo assembly of large genomes using compressed data structures. Genome research, 22(3):549–556, 2012.

[150] Jared T Simpson, Kim Wong, Shaun D Jackman, Jacqueline E Schein, Steven JM Jones, and Inanç Birol. ABySS: a parallel assembler for short read sequence data. Genome research, 19(6):1117–1123, 2009.

[151] Jared T Simpson, Rachael E Workman, PC Zuzarte, Matei David, LJ Dursi, and Winston Timp. Detecting DNA cytosine methylation using nanopore sequencing. Nature methods, 14(4):407, 2017.

[152] Enrico Siragusa, David Weese, and Knut Reinert. Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic acids research, 41(7):e78– e78, 2013.

[153] Martin Šošić and Mile Šikić. Edlib: a C/C++ library for fast, exact sequence alignment using edit distance. Bioinformatics, 33(9):1394–1395, 2017.

[154] Ivan Sović, Mile Šikić, Andreas Wilm, Shannon Nicole Fenlon, Swaine Chen, and Niranjan Nagarajan. Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nature communications, 7, 2016.

101 [155] Paweł Stankiewicz and James R Lupski. Structural variation in the human genome and its role in disease. Annual review of medicine, 61:437–455, 2010.

[156] John F Thompson and Patrice M Milos. The properties and applications of single- molecule DNA sequencing. Genome biology, 12(2):217, 2011.

[157] Kevin J Travers, Chen-Shan Chin, David R Rank, John S Eid, and Stephen W Turner. A flexible and efficient template format for circular consensus sequencing and SNP detection. Nucleic acids research, 38(15):e159–e159, 2010.

[158] Erwin L van Dijk, Yan Jaszczyszyn, Delphine Naquin, and Claude Thermes. The third revolution in sequencing technology. Trends in Genetics, 34(9):666–681, 2018.

[159] Robert Vaser, Ivan Sović, Niranjan Nagarajan, and Mile Šikić. Fast and accurate de novo genome assembly from long uncorrected reads. Genome research, 27(5):737–746, 2017.

[160] J , Mark D Adams, Eugene W Myers, Peter W Li, Richard J Mural, Granger G Sutton, Hamilton O Smith, Mark Yandell, Cheryl A Evans, Robert A Holt, et al. The sequence of the human genome. science, 291(5507):1304–1351, 2001.

[161] Mitchell R Vollger, Glennis A Logsdon, Peter A Audano, Arvis Sulovari, David Porubsky, Paul Peluso, Aaron M Wenger, Gregory T Concepcion, Zev N Kronenberg, Katherine M Munson, et al. Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads. Annals of human genetics, 2019.

[162] Ayelet Voskoboynik, Norma F Neff, Debashis Sahoo, Aaron M Newman, Dmitry Pushkarev, Winston Koh, Benedetto Passarelli, H Christina Fan, Gary L Mantalas, Karla J Palmeri, et al. The genome sequence of the colonial chordate, Botryllus schlosseri. Elife, 2:e00569, 2013.

[163] Bruce J Walker, Thomas Abeel, Terrance Shea, Margaret Priest, Amr Abouelliel, Sharadha Sakthikumar, Christina A Cuomo, Qiandong Zeng, Jennifer Wortman, Sarah K Young, et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PloS one, 9(11):e112963, 2014.

[164] Jeremy R Wang, James Holt, Leonard McMillan, and Corbin D Jones. FMLRC: Hybrid long read error correction using an FM-index. BMC bioinformatics, 19(1):50, 2018.

[165] Zhanyong Wang, Farhad Hormozdiari, Wen-Yun Yang, Eran Halperin, and Eleazar Eskin. CNVeM: copy number variation detection using uncertainty of read mapping. Journal of Computational Biology, 20(3):224–236, 2013.

[166] David Weese, Manuel Holtgrewe, and Knut Reinert. RazerS 3: faster, fully sensitive read mapping. Bioinformatics, 28(20):2592–2599, 2012.

[167] John N Weinstein, Eric A Collisson, Gordon B Mills, Kenna R Mills Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya Shmulevich, Chris Sander, Joshua M Stuart, Cancer Genome Atlas Research Network, et al. The cancer genome atlas pan-cancer analysis project. Nature genetics, 45(10):1113, 2013.

102 [168] Joachim Weischenfeldt, Orsolya Symmons, François Spitz, and Jan O Korbel. Phenotypic impact of genomic structural variation: insights from and for human disease. Nature Reviews Genetics, 14(2):125–138, 2013.

[169] Ryan R Wick, Louise M Judd, Claire L Gorrie, and Kathryn E Holt. Unicycler: resolving bacterial genome assemblies from short and long sequencing reads. PLoS computational biology, 13(6):e1005595, 2017.

[170] Ryan R Wick, Louise M Judd, and Kathryn E Holt. Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome biology, 20(1):129, 2019.

[171] Ryan R Wick, Mark B Schultz, Justin Zobel, and Kathryn E Holt. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics, 31(20):3350– 3352, 2015.

[172] Hongyi Xin, Donghyuk Lee, Farhad Hormozdiari, Samihan Yedkar, Onur Mutlu, and Can Alkan. Accelerating read mapping with FastHASH. BMC genomics, 14(Suppl 1):S13, 2013.

[173] Yaping Yang, Donna M Muzny, Jeffrey G Reid, Matthew N Bainbridge, Alecia Willis, Patricia A Ward, Alicia Braxton, Joke Beuten, Fan Xia, Zhiyv Niu, et al. Clinical whole-exome sequencing for the diagnosis of mendelian disorders. New England Journal of Medicine, 369(16):1502–1511, 2013.

[174] Chengxi Ye, Christopher M Hill, Shigang Wu, Jue Ruan, and Zhanshan Sam Ma. DBG2OLC: efficient assembly of large genomes using long erroneous reads of the third generation sequencing technologies. Scientific reports, 6:31900, 2016.

[175] Kai Ye, Marcel H Schulz, Quan Long, Rolf Apweiler, and Zemin Ning. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics, 25(21):2865–2871, 2009.

[176] Daniel R Zerbino and . Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome research, 18(5):821–829, 2008.

[177] Haowen Zhang, Chirag Jain, and Srinivas Aluru. A comprehensive evaluation of long read error correction methods. BioRxiv, page 519330, 2019.

[178] Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller. A greedy algorithm for aligning DNA sequences. Journal of Computational biology, 7(1-2):203–214, 2000.

[179] Grace XY Zheng, Billy T Lau, Michael Schnall-Levin, Mirna Jarosz, John M Bell, Christopher M Hindson, Sofia Kyriazopoulou-Panagiotopoulou, Donald A Masquelier, Landon Merrill, Jessica M Terry, et al. Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nature biotechnology, 34(3):303–311, 2016.

[180] Aleksey V Zimin, Guillaume Marçais, Daniela Puiu, Michael Roberts, Steven L Salzberg, and James A Yorke. The MaSuRCA genome assembler. Bioinformatics, 29(21):2669–2677, 2013.

103 [181] Aleksey V Zimin, Daniela Puiu, Ming-Cheng Luo, Tingting Zhu, Sergey Koren, Guillaume Marçais, James A Yorke, Jan Dvořák, and Steven L Salzberg. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm. Genome Research, 27(5):787–792, 2017.

104 Appendix A lordFAST Material

A.1 Data

A.1.1 Real data

Long reads for a human genome (CHM1) sequenced by PacBio RS II instrument using P5-C3 chemistry are available at Pacific Biosciences’s Devnet repository: https://github.com/PacificBiosciences/DevNet/wiki/H_sapiens_54x_release

We used long reads stored in a single FASTA file (corresponding to one of the three files generated for an SMRT cell) for the real study, which can be downloaded from: http://datasets.pacb.com/2013/Human10x/READS/2530572/0001/Analysis_Results/ m130929_024849_42213_c100518541910000001823079209281311_s1_p0.1.subreads.fasta

After obtaining the fasta file, we filtered out reads shorter than 1000 base-pair so that we can focus more on the task of mapping longer reads, which is the goal for long read mappers. This is motivated by the fact that more than 99% of the data is in reads longer than 1000 base-pair (see Figure 3.3).

A.1.2 Synthetic data

We used PBSIM in order to generate synthetic data for the simulation study. PBSIM is a PacBio simulator that is capable of simulating long reads from a set of real reads. It uses real reads to infer the read length and error distribution. In addition to the generated long reads, PBSIM reports the true alignment between the generated long reads and the reference genome in MAF format. This enables us to evaluate tools based on their base-pair sensitivity and precision. Here we explain the detailed instruction and commands for generating the synthetic data. First, the real fastq files used for generating simulated reads is obtained as the following:

105 mkdir simulated cd simulated wget http://datasets.pacb.com/2013/Human10x/READS/2530572/0001/ Analysis_Results/ m130929_024849_42213_c100518541910000001823079209281311_s1_p0.1. subreads.fastq wget http://datasets.pacb.com/2013/Human10x/READS/2530572/0001/ Analysis_Results/ m130929_024849_42213_c100518541910000001823079209281311_s1_p0.2. subreads.fastq wget http://datasets.pacb.com/2013/Human10x/READS/2530572/0001/ Analysis_Results/ m130929_024849_42213_c100518541910000001823079209281311_s1_p0.3. subreads.fastq cat m130929_024849_42213_c100518541910000001823079209281311_s1_p0.1. subreads.fastq m130929_024849_42213_c100518541910000001823079209281311_s1_p0.2. subreads.fastq m130929_024849_42213_c100518541910000001823079209281311_s1_p0.3. subreads.fastq > real.fastq

Then we run PBSIM to generate the simulated reads: pbsim --data-type CLR --depth 1 --length-min 1 --length-max 100000 --seed 0 --sample-fastq real.fastq hg38.fa

Then 25000 reads with a minimum length of 1000 are sampled from the simulated reads as the synthetic dataset. The codes for sampling reads from the simulated reads is available at https://github.com/vpc-ccg/lordfast-extra.

106 A.2 Software

Table A.1: Version, reference, and respository of utilized software.

Tool Version Reference Repository BLASR 5.3.4323a52 [18] github.com/PacificBiosciences/blasr BWA-MEM 0.7.15-r1140 [96] github.com/lh3/bwa GraphMap 0.5.1 [154] github.com/isovic/graphmap LAMSA 1.0.0 [106] github.com/hitbc/LAMSA rHAT 0.1.1 [107] github.com/dfguan/rHAT NGMLR 0.2.6 [146] github.com/philres/ngmlr Minimap2 2.10-r761 [98] github.com/lh3/minimap2 minialign 0.5.3 - github.com/ocxtal/minialign

107 A.3 Command details

BLASR

Indexing: sawriter hg38.fa.sa hg38.fa

Mapping: blasr reads.fasta hg38.fa --sa hg38.fa.sa -m 5 --out map_blasr.m5 --nproc 1 --noSplitSubreads

BWA-MEM

Indexing: bwa index hg38.fa

Mapping: bwa mem -x pacbio -Y -t 1 hg38.fa reads.fasta > map_bwa.sam

GraphMap

Indexing: graphmap align -I -r hg38.fa

Mapping: graphmap align -r hg38.fa -d reads.fasta -o map_graphmap.sam -t 1

LAMSA

Indexing: lamsa index hg38.fa

Mapping: lamsa aln -t 1 -S -T pacbio -i 25 -l 50 hg38.fa reads.fasta > map_lamsa.sam

108 rHAT

Indexing: rHAT-indexer . hg38.fa

Mapping: rHAT-aligner . reads.fasta hg38.fa -t 1 > map_rhat.sam

NGMLR

Indexing: When invoked for the first time, NGMLR generates the index and write it to the disk. It uses the saved index for the next runs. Therefore, we ran it once to generate the index without including this run in comparisons. Mapping: ngmlr -r hg38.fa -q reads.fasta -t 1 -o map_ngmlr.sam

Minimap2

Indexing: minimap2 -d hg38.fa.mmi hg38.fa

Mapping: minimap2 -a -Y -x map-pb -t 1 hg38.fa.mmi reads.fasta > map_minimap2.sam minialign

Indexing: minialign -d hg38.fa.mai hg38.fa

Mapping: minialign -x pacbio -t 1 -l hg38.fa.mai reads.fasta > map_minialign.sam

109 lordFAST

Indexing: lordfast --index hg38.fa

Mapping: lordfast --search hg38.fa --seq reads.fasta --thread 1 > map_lordfast.sam

110 Appendix B

CoLoRMap Material

B.1 Data

Table B.1: Availability and statistics of real datasets

Bacteria Yeast Fruit fly Reference organism Name E. coli S. cerevisiae D. melanogaster Strain K-12 substr. MG1655 S288C iso-1 Reference sequence NC_000913 NC_0011{33-48} NT_0337{77-79} NC_001224 NC_0043{53-54} NC_0245{11-12} NT_037436 Genome size 4.6 Mbp 12.2 Mbp 140 Mbp Pacbio data Accession ID DevNet1 DevNet2 Bergman Lab3 Number of reads 33,360 231,604 901,564 Avg read length 2,938 6,055 1,505 Max read length 14,494 30,164 13,885 Number of bases 98 Mbp 1,402 Mbp 1,358 Mbp Coverage 21x 114x 9.7x Illumina data Accession ID ERR0220754 SRR567755 ERX6459694 Number of reads 2,316,614 4,503,422 70,000,000 Read length 100 & 102 101 101 Coverage 50x 37x 50x Insert size 504 ± 27 190 ± 80 240 ± 54

1 Obtained from https://github.com/PacificBiosciences/DevNet/wiki/EcoliK12MG1655HybridAssembly. Reads shorter than 100bp were filtered out. 2 https://github.com/PacificBiosciences/DevNet/wiki/Saccharomyces-cerevisiae-W303-Assembly-Contigs. 3 bergmanlab.ls.manchester.ac.uk/data/genomes/2057_PacBio.tgz. 4 Only a subset of the data was used; the read file was truncated to 50x coverage.

111 Appendix C

HASLR Material

C.1 Data

C.1.1 Simulated data

PBSIM has an option to infer the mean and standard deviation of read length and the error rate from a real dataset. So first, we prepare that real dataset. We use the first 10 runs of CHM1 (P6C4) dataset: $ for acc in SRR2183739 SRR2183740 SRR2183741 SRR2183742 SRR2183743 SRR2183744 SRR2183745 SRR2183746 SRR2183747 SRR2183748; do wget http:// sra-download.ncbi.nlm.nih.gov/srapub_files/${acc}_${acc}_hdf5.tgz; done

$ for acc in SRR2183739 SRR2183740 SRR2183741 SRR2183742 SRR2183743 SRR2183744 SRR2183745 SRR2183746 SRR2183747 SRR2183748; do tar -zxvf ${ acc}_${acc}_hdf5.tgz; done

$ for bax in m15051*.bax.h5; do bash5tools.py ${bax} --outFilePrefix ${bax} --outType fastq --readType subreads --minLength 50 --minReadScore 0.75; done

$ for seq in m15051*.fastq; do cat ${seq}; done > chm1_p6c4_first_10.fastq

For simulation of the long reads: $ pbsim --seed 0 --data-type CLR --depth 50 --length-min 1 --length-max 500000 --sample-fastq chm1_p6c4_first_10.fastq --prefix long < reference_fasta>

For simulation of the short reads: $ art_illumina --paired --in --len 150 --mflen 500 --sdev 50 --fcov 50 --rndSeed 0 --noALN --out short

112 C.1.2 Real data

Table C.1: Availability of real long read datasets

Dataset Technology Accession E.coli ONT R9.4 http://lab.loman.net/2017/03/09/ultrareads-for-nanopore Illumina ftp://webdata: [email protected]/Data/SequencingRuns/MG1655/MiSeq_Ecoli_MG1655_110721_PF_R1.fastq.gz ftp://webdata: [email protected]/Data/SequencingRuns/MG1655/MiSeq_Ecoli_MG1655_110721_PF_R2.fastq.gz Yeast PacBio ERX1725434, ERX1725435, ERX1725441 Illumina ERX1943903 C.elegans PacBio https://github.com/PacificBiosciences/DevNet/wiki/C.-elegans-data-set Illumina SRR065390 Human PacBio https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP044331 Illumina SRX652547

113 C.2 Software

Table C.2: Version, reference, and respository of utilized software.

Tool Version Reference Repository Minia 3.2.1 [23] github.com/GATB/minia minimap2 2.17 [98] github.com/lh3/minimap2 SPOA 1.1.3 [159] github.com/rvaser/spoa GNU Time 1.9 – ftp.gnu.org/gnu/time/ ART 2.5.8 [66] niehs.nih.gov/research/resources/ software/biostatistics/art/ PBSIM 7fdcefd [128] github.com/yukiteruono/pbsim Canu 1.8 [85] github.com/marbl/canu wtdbg2 2.5 [140] github.com/ruanjue/wtdbg2 hybridSPAdes 3.13.1 [4] github.com/ablab/spades Unicycler 0.4.8 [169] github.com/rrwick/unicycler DBG2OLC 0246e46 [174] github.com/yechengxi/dbg2olc Masurca 3.3.1 [181] github.com/alekseyzimin/masurca Wengan v0.1 [35] github.com/adigenova/wengan QUAST 5.0.2 [118] github.com/ablab/quast BUSCO 4.0.1 [148] busco.ezlab.org

114 C.3 Command details

• Running HASLR $ python3 haslr.py --threads --type --cov -lr 25 --minia-kmer 55 --minia-solid 3 --aln-block 500 --out < output_directory> --genome --long -- short

• Running Canu $ canu -p -d genomeSize=< genome_size> -pacbio-raw useGrid=false

• Running wtdbg2 $ perl wtdbg2.pl -t -x -g -o < assembly_prefix>

• Running hybridSPAdes $ spades.py -t -m -1 -2 --pacbio -o

• Running Unicycler $ unicycler -t --no_rotate --no_miniasm --no_pilon -o < assembly_prefix> -1 -2 -l

• Running DBG2OLC (based on suggestions on the github repository) $ fastutils interleave -q -1 -2 | fastutils subsample -q -d 50 -g > short.50x.fastq

$ fastutils subsample -l -d 30 -g -i > long .30x.fasta

$ SparseAssembler LD 0 k 51 g 15 NodeCovTh 1 EdgeCovTh 0 GS < genome_size> f short.50x.fastq

$ DBG2OLC k 17 AdaptiveTh 0.01 KmerCovTh 2 MinOverlap 20 RemoveChimera 1 Contigs Contigs.txt f long.30x.fasta

$ cat Contigs.txt long.30x.fasta > ctg_pb.fasta

$ ulimit -n 4000

$ split_and_run_sparc.sh backbone_raw.fasta DBG2OLC_Consensus_info. txt ctg_pb.fasta ./consensus_dir

115 • Running Masurca Content of config.txt DATA PE= pe PACBIO= # NANOPORE= END

PARAMETERS GRAPH_KMER_SIZE = auto LHE_COVERAGE=25 CA_PARAMETERS = cgwErrorRate=0.15 KMER_COUNT_THRESHOLD = 1 CLOSE_GAPS=0 NUM_THREADS = JF_SIZE = 200000000 END

Command bash assemble.sh

• Running Wengan perl wengan.pl -t -a M -p -x -g -s , -l

116 C.4 Visual examples of regions assembled only by HASLR without any misassembly or fragmentation

We manually inspected the assemblies of simulated datasets (C. elegans and human) using QUAST’s contig browser tool (Icarus). Here, we have included a few examples of the regions that are properly assembled with HASLR, but assemblies of other tools are either fragmented or contain misassemblies.

117 118

Figure C.1: An example showing a region of choromosome 4 of C. elegans. 119

Figure C.2: An example showing a region of choromosome X of C. elegans. 120

Figure C.3: An example showing a region of choromosome X of hg38. 121

Figure C.4: An example showing a region of choromosome 18 of hg38. 122

Figure C.5: An example showing a region of choromosome 16 of hg38. 123

Figure C.6: An example showing a region of choromosome 15 of hg38. 124

Figure C.7: An example showing a region of choromosome 14 of hg38. 125

Figure C.8: An example showing a region of choromosome 13 of hg38. 126

Figure C.9: An example showing a region of choromosome 11 of hg38. 127

Figure C.10: An example showing a region of choromosome 9 of hg38.