Computational Methods for Analysis of Single Molecule Sequencing Data
Total Page:16
File Type:pdf, Size:1020Kb
Computational Methods for Analysis of Single Molecule Sequencing Data by Ehsan Haghshenas M.Sc., University of Western Ontario, 2014 B.Sc., Isfahan University of Technology, 2012 Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in the School of Computing Science Faculty of Applied Sciences c Ehsan Haghshenas 2020 SIMON FRASER UNIVERSITY Spring 2020 Copyright in this work rests with the author. Please ensure that any reproduction or re-use is done in accordance with the relevant national copyright legislation. Approval Name: Ehsan Haghshenas Degree: Doctor of Philosophy (Computing Science) Title: Computational Methods for Analysis of Single Molecule Sequencing Data Examining Committee: Chair: Diana Cukierman University Lecturer Binay Bhattacharya Senior Supervisor Professor S. Cenk Sahinalp Co-Supervisor Senior Investigator Center for Cancer Research National Cancer Institute Cedric Chauve Co-Supervisor Professor Faraz Hach Co-Supervisor Assistant Professor Department of Urologic Sciences The University of British Columbia Senior Research Scientist Vancouver Prostate Centre Martin Ester Internal Examiner Professor Mihai Pop External Examiner Professor Department of Computer Science University of Maryland, College Park Date Defended: March 26, 2020 ii Abstract Next-generation sequencing (NGS) technologies paved the way to a significant increase in the number of sequenced genomes, both prokaryotic and eukaryotic. This increase provided an opportunity for considerable advancement in genomics and precision medicine. Although NGS technologies have proven their power in many applications such as de novo genome assembly and variation discovery, computational analysis of the data they generate is still far from being perfect. The main limitation of NGS technologies is their short read length relative to the lengths of (common) genomic repeats. Today, newer sequencing technologies (known as single-molecule sequencing or SMS) such as Pacific Biosciences and Oxford Nanopore are producing significantly longer reads, making it theoretically possible to overcome the difficulties imposed by repeat regions. For instance, for the first time, a complete human chromosome was fully assembled using ultra-long reads generated by Oxford Nanopore. Unfortunately, long reads generated by SMS technologies are characterized by a high error rate, which prevents their direct utilization in many of the standard downstream analysis pipelines and poses new computational challenges. This motivates the development of new computational tools specifically designed for SMS long reads. In this thesis, we present three computational methods that are tailored for SMS long reads. First, we present lordFAST, a fast and sensitive tool for mapping noisy long reads to a reference genome. Mapping sequenced reads to their potential genomic origin is the first fundamental step for many computational biology tasks. As an example, in this thesis, we show the success of lordFAST to be employed in structural variation discovery. Next, we present the second tool, CoLoRMap, which tackles the high level of base-level errors in SMS long reads by providing a means to correct them using a complementary set of NGS short reads. This integrative use of SMS and NGS data is known as hybrid technique. Finally, we introduce HASLR, an ultra-fast hybrid assembler that uses reads generated by both technologies to efficiently generate accurate genome assemblies. We demonstrate that HASLR is not only the fastest assembler but also the one with the lowest number of misassemblies on all the samples compared to other tested assemblers. Furthermore, the generated assemblies in terms of contiguity and accuracy are on par with the other tools on most of the samples. iii Keywords: Computational biology; Single-molecule sequencing; PacBio; Oxford Nanopore; Long read mapping; Hybrid error correction; Hybrid assembly iv Dedication To my family, with love v Acknowledgements First and foremost, I would like to express my sincerest gratitude to my supervisors, Dr. Cenk Sahinalp, Dr. Cedric Chauve, Dr. Faraz Hach, and Dr. Binay Bhattacharya, for their constant support, guidance, and patience throughout my PhD studies. I was honored to have the opportunity to work with such brilliant scholars, from whom I learned critical thinking and the proper way of doing research. In addition, I would like to give my regards and appreciation to Dr. Jens Stoye, my host and supervisor during my visit at Bielefeld University. This visit greatly influenced the direction of my work on hybrid assembly. I would also like to thank Dr. Mihai Pop and Dr. Martin Ester, my external and internal examiners, for their careful review of my thesis. I appreciate their invaluable discussions, comments, and suggestions, which helped me improve the thesis. I want to give special thanks to Dr. Diana Cukierman, who graciously accepted to be the chair of my examining committee. During my PhD, I was also involved in a few research projects that are not included in this thesis. I had wonderful valuable experiences in these collaborative projects. Regarding these collaborations, I would like to thank Dr. Salem Malikic, Michael Ford, Hossein Asghari, Sean La, and Farid Rashidi Mehrabadi. I also offer enduring gratitude to all past and present lab members in Lab for Computational Biology and Bioinformatics at Simon Fraser University, as well as Hach Lab at Vacnouver Prostate Centre. In particular, I thank Dr. Yen-Yi Lin, Iman Sarrafi, Dr. Ibrahim Numanagic, Ermin Hodzic, Can Kockan, Dr. Raunak Shrestha, Baraa Orabi, Tunc Morova, and Fatih Karaoglanoglu, who all made the work environment a more pleasant one. My special thanks go to Baraa Orabi and Elie Ritch for their help with proofreading the thesis. In addition, I would like to thank all members of the Genome Informatics research group at Bielefeld University in Germany, including Omar Castillo, Konstantinos Tzanakis, Eyla Willing, Georges Hattab, Tizian Schulz, Guillaume Holley, Nina Luhmann, Liren Huang, Lu Zhu, Markus Lux, Linda Sundermann, and Tina Zekic, who made my visit from Bielefeld University such a great experience. I am grateful to many other friends I made in Vancouver including Abdollah Safari, Sina Bahrasemani, Sajjad Gholami, Mehran Khodabandeh, Hedayat Zarkoob, Mohsen Jamali, Soheil Horr, Shahram Zaheri, Amirmasoud Ghasemi, Abraham Hashemian, Hashem Jeihooni, Shahram Pourazadi, Mohammad Mahdavian, Majid Talebi, Saeed vi Mirazimi, Hassan Shavarani, Mahdi Nemati Mehr, Sima Jamali, Nazanin Mehrasa, Ramtin Mehdizadeh, Sina Salari, Ali Afsah, Chakaveh Ahmadizadeh, Mahsa Gharibi, Mohammad Akbari, Akbar Rafiey, Saeed Izadi, Saeid Asgari, Hossein Sharifi-Noghabi, Sepehr MohaimenianPour, Sara Daneshvar, Hooman Zabeti, Sara Jalili, Mohammad Mazraeh, Marjan Moodi, and many more. All these amazing people made Vancouver a true home. Last but not least, I would like to thank my loving family for all their support during these years. An exceptional thanks goes to my wonderful wife, Rana, who definitely made a significant contribution to this thesis with her continuous support and patience. vii Table of Contents Approval ii Abstract iii Dedication v Acknowledgements vi Table of Contents viii List of Tables xi List of Figures xiv 1 Introduction 1 1.1 Contributions . 5 1.2 Organization of the thesis . 7 2 Background and Related Work 8 2.1 Single-molecule sequencing technologies . 8 2.1.1 Pacific Biosciences . 8 2.1.2 Oxford Nanopore Technology . 9 2.1.3 Synthetic long reads . 11 2.2 Definitions and Notations . 12 2.3 Long Read Mapping . 13 2.4 Error correction of long noisy reads . 16 2.4.1 Hybrid correction . 16 2.4.2 Self-correction . 18 2.5 de novo genome assembly . 20 2.5.1 Hybrid assembly . 21 2.5.2 Non-hybrid assembly . 23 2.5.3 wtdbg2 . 26 3 Long read mapping 27 viii 3.1 Methods . 29 3.1.1 Overview . 29 3.1.2 Stage One: Reference Genome Indexing . 29 3.1.3 Stage Two: Read Mapping . 30 3.2 Results . 33 3.2.1 Experiment on a simulated dataset without structural variations . 33 3.2.2 Simulation in presence of structural variations . 36 3.2.3 Experiment on a real dataset . 39 3.3 Summary . 41 4 Hybrid error correction of long reads 43 4.1 Methods . 44 4.1.1 Overview . 44 4.1.2 Initial correction of long reads: the SP algorithm . 45 4.1.3 Correcting gaps using One-End Anchors . 47 4.2 Results . 50 4.2.1 Data and computational setting . 50 4.2.2 Measures of evaluation . 51 4.2.3 Comparison based on alignment . 52 4.2.4 Comparison based on assembly . 52 4.3 Comparison with more recent hybrid correction tools . 60 4.4 Summary . 60 5 Hybrid assembly of long reads 68 5.1 Methods . 70 5.1.1 Obtaining unique short read contigs . 70 5.1.2 Construction of backbone graph . 71 5.1.3 Graph cleaning and simplification . 74 5.1.4 Generating the assembly . 76 5.1.5 Methodological remarks . 77 5.2 Results . 79 5.2.1 Experiment on simulated dataset . 79 5.2.2 Experiment on real dataset . 82 5.3 Summary . 85 6 Conclusion 86 6.1 Future directions . 87 6.2 Recommended guidelines . 88 Bibliography 90 ix Appendix A lordFAST Material 105 A.1 Data . 105 A.1.1 Real data . 105 A.1.2 Synthetic data . 105 A.2 Software . 107 A.3 Command details . 108 Appendix B CoLoRMap Material 111 B.1 Data . 111 Appendix C HASLR Material 112 C.1 Data . 112 C.1.1 Simulated data . 112 C.1.2 Real data . 113 C.2 Software . 114 C.3 Command details . 115 C.4 Visual examples of regions assembled only by HASLR without any misassembly or fragmentation . 117 x List of Tables Table 3.1 Comparison between different tools capable of mapping PacBio long reads on the simulated human dataset. This dataset contains 25,000 reads and 183.61 million bases. Best results are marked with bold typeface. 35 Table 3.2 Runtime and memory usage of same table. 36 Table 3.3 Comparison between different tools capable of mapping PacBio long reads on the simulated human dataset.