Downloads/Bacteria/Nctc

UC San Diego UC San Diego Electronic Theses and Dissertations Title Genome Assembly of Long Error-Prone Reads Using De Bruijn Graphs and Repeat Graphs Permalink https://escholarship.org/uc/item/62v8d30p Author Yuan, Jeffrey Publication Date 2019 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA SAN DIEGO Genome Assembly of Long Error-Prone Reads Using De Bruijn Graphs and Repeat Graphs A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Bioinformatics and Systems Biology by Jeffrey Yuan Committee in charge: Professor Pavel Pevzner, Chair Professor Bing Ren, Co-Chair Professor Vineet Bafna Professor Theresa Gaasterland Professor Siavash Mirarab 2019 Copyright Jeffrey Yuan, 2019 All rights reserved. The Dissertation of Jeffrey Yuan is approved, and it is acceptable in quality and form for publication on microfilm and electronically: Co-Chair Chair University of California San Diego 2019 iii EPIGRAPH In theory, there is no difference between theory and practice. In practice, there is. Benjamin Brewster iv TABLE OF CONTENTS SIGNATURE PAGE .................................................................................................................... iii EPIGRAPH ................................................................................................................................... iv TABLE OF CONTENTS ............................................................................................................... v LIST OF FIGURES ..................................................................................................................... vii LIST OF TABLES ........................................................................................................................ ix ACKNOWLEDGEMENTS ............................................................................................................ x VITA ............................................................................................................................................ xii ABSTRACT OF THE DISSERTATION ................................................................................... xiii INTRODUCTION .......................................................................................................................... 1 References ................................................................................................................................. 6 CHAPTER 1: Assembly of Long Error-Prone Reads Using De Bruijn Graphs ............................. 9 1.1 Abstract ............................................................................................................................... 9 1.2 Significance Statement ........................................................................................................ 9 1.3 Introduction ...................................................................................................................... 10 1.4 Methods............................................................................................................................ 12 1.4.1 The Key Idea of the ABruijn Algorithm ................................................................. 12 1.4.2 Assembling Long Error-Prone Reads ..................................................................... 17 1.4.3 Correcting Errors in the Draft Genome .................................................................. 36 1.5 Results .............................................................................................................................. 51 1.6 Discussion ........................................................................................................................ 62 1.7 Additional Information .................................................................................................... 63 1.8 Acknowledgements .......................................................................................................... 64 v 1.9 References ........................................................................................................................ 65 CHAPTER 2: Assembly of Long Error-Prone Reads Using Repeat Graphs .............................. 70 2.1 Abstract ............................................................................................................................ 70 2.2 Introduction ...................................................................................................................... 71 2.3 Results .............................................................................................................................. 73 2.4 Discussion ...................................................................................................................... 102 2.5 Methods.......................................................................................................................... 104 2.6 Additional Information .................................................................................................. 144 2.7 Acknowledgements ........................................................................................................ 146 2.8 References ...................................................................................................................... 147 CHAPTER 3: DiploidFlye: Haplotype Phasing of Long Read Assemblies Using Repeat Graphs... .................................................................................................................................................... 151 3.1 Abstract .......................................................................................................................... 151 3.2 Introduction .................................................................................................................... 152 3.3 Methods.......................................................................................................................... 154 3.4 Results ............................................................................................................................ 159 3.5 Discussion ...................................................................................................................... 168 3.6 Acknowledgements ........................................................................................................ 170 3.7 References ...................................................................................................................... 171 CONCLUSION .......................................................................................................................... 173 References ............................................................................................................................ 176 vi LIST OF FIGURES Figure 1.1: Constructing the de Bruijn graph (Left) and the A-Bruijn graph (Right) for a circular �� = ��. ................................................................................................... 14 Figure 1.2: A histogram of the number of 15-mers vs frequency for the ECOLI dataset. ......... 18 Figure 1.3: The pseudocode for hybridSPAdes. ......................................................................... 20 Figure 1.4: The starting three lines of the pseudocode for longSPAdes. .................................... 21 Figure 1.5: A bubble in the A-Bruijn graph of (15, 7)-mers for the ECOLI dataset. ................. 22 Figure 1.6: The pseudocode for ABruijn. ................................................................................... 23 Figure 1.7: An example of a common jump-subpath from the ECOLI dataset. ......................... 25 Figure 1.8: Path support and most-consistent paths. ................................................................... 32 Figure 1.9: Support graph examples revealing the absence and presence of repeats. ................ 35 Figure 1.10: Decomposing a multiple alignment into necklaces. ............................................... 38 Figure 1.11: A histogram of necklace lengths. ........................................................................... 40 Figure 1.12: Match and insertion rate distribution for a simulated corrupted genome. .............. 41 Figure 1.13: Examples of read well-aligned to homonucleotide regions. .................................. 45 Figure 1.14: ORF-length histograms for correct and incorrect positions. .................................. 50 Figure 1.15: A comparison between ABruijn and CANU assemblies for B. neritina. ............... 58 Figure 2.1: An outline of the Flye assembler workflow. ............................................................ 73 Figure 2.2: Constructing the approximate repeat graph from local self-alignments. ................. 75 Figure 2.3: Resolving unbridged repeats. ................................................................................... 77 Figure 2.4: A comparison of Flye and HINGE assembly graphs on bacterial genomes from the BACTERIA dataset. ................................................................................................................... 80 Figure 2.5: The assembly graph of the YEAST-ONT dataset. ................................................... 88 Figure 2.6: The assembly graph of the WORM dataset. ............................................................. 90 vii Figure 2.7: Dot-plots showing the alignment of reads against the Flye assembly, the Miniasm assembly

Load more