UNIVERSITY of CALIFORNIA, SAN DIEGO Genome Assembly And

UNIVERSITY OF CALIFORNIA, SAN DIEGO Genome Assembly and Comparison A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science by Kim Son Pham Committee in charge: Professor Pavel Pevzner, Chair Professor Vineet Bafna Professor Ronald Graham Professor Ramamohan Paturi Professor Glenn Tesler 2013 Copyright Kim Son Pham, 2013 All rights reserved. The Dissertation of Kim Son Pham is approved, and it is acceptable in quality and form for publication on microfilm and electronically: Chair University of California, San Diego 2013 iii DEDICATION To my family iv EPIGRAPH I never regretted it, although for some time I had to supplement my income at NII-GENETIKA by gathering empty bottles at Moscow railway stations, one of the very few legal ways to make extra money in pre-perestroika Moscow. Pavel A. Pevzner Happiness is here and now. Thich Nhat Hanh v TABLE OF CONTENTS SignaturePage...................................... ................................. iii Dedication ......................................... ................................. iv Epigraph........................................... ................................. v TableofContents .................................... ................................ vi ListofFigures ...................................... ................................. viii ListofTables....................................... ................................. xii Acknowledgements................................... ................................ xiii Vita............................................... ................................. xv AbstractoftheDissertation .......................... .................................. xvii Chapter1 Introduction ............................... ............................... 1 Chapter 2 Paired de Bruijn graphs: a novel approach for incorporating mate pair information intogenomeassemblers ............................... ..................... 4 2.1 Introduction.................................... ............................. 4 2.2 From de Bruijn Graphs to Paired de Bruijn Graphs . ......................... 6 2.2.1 Preliminaries ................................. ........................ 6 2.2.2 De Bruijn Graphs (Modelling Unpaired Reads) . ..................... 7 2.2.3 GraphComplexity............................... ...................... 9 2.2.4 Paired de Bruijn Graphs (Modelling Paired Reads with Exact Distance) . 11 2.2.5 Approximate Paired de Bruijn Graphs (Modelling Inexact Distance) . 11 2.3 Results ......................................... ............................ 14 2.4 Towards a Practical Paired de Bruijn Graph Assembler . .......................... 17 2.5 Conclusion...................................... ............................ 17 Chapter 3 From de Bruijn Graphs to Rectangle Graphs for GenomeAssembly ............... 19 3.1 Introduction.................................... ............................. 19 3.2 RectanglePuzzles ................................ ............................ 20 3.3 Rectangle Puzzles and Genome Assembly . ......................... 23 3.4 Results ......................................... ............................ 29 3.5 Conclusion...................................... ............................ 30 Chapter 4 Pathset Graphs: A New Approach for Comprehensive Utilization of Mate-Pairs in GenomeAssembly ..................................... ................... 32 4.1 Introduction.................................... ............................. 32 4.2 Methods ......................................... ........................... 33 4.2.1 Assumptions................................... ....................... 33 4.2.2 DeBruijngraphs ................................ ...................... 33 4.2.3 From k-merpairstoedge-pairhistograms ..................... ............ 34 4.2.4 From edge-pair intervals to pathsets . ........................ 38 4.2.5 Removing phantom paths from pathsets. ..................... 39 vi 4.2.6 Splittingpathsets ............................. ......................... 39 4.2.7 Identifyingessentialedges ..................... ......................... 40 4.2.8 Pathsetgraph .................................. ....................... 40 4.2.9 Adaptations for imperfect coverage and read errors . ....................... 41 4.3 Results ......................................... ............................ 42 4.3.1 From mate-pairs to edge-pair intervals . ........................ 42 4.3.2 From edge-pair intervals to pathsets . ........................ 43 4.3.3 Comparing Pathset with other genome assemblers . ..................... 43 4.4 Discussion ...................................... ............................ 44 4.5 Appendix: Compact representation of pathsets . ............................ 44 4.5.1 Gappedpathsets................................ ....................... 44 4.5.2 Countingpaths................................. ....................... 45 4.5.3 Identifyingbridges............................ ......................... 46 4.5.4 Prefixtesting .................................. ....................... 46 4.5.5 Findingessentialedges ......................... ........................ 46 4.5.6 Splittingpathsets ............................. ......................... 47 Chapter 5 DRIMM-Synteny: decomposing genomes into evolutionary conserved segments . 49 5.1 Introduction.................................... ............................. 49 5.2 Methods ......................................... ........................... 52 5.3 Results ......................................... ............................ 59 5.4 Discussion ...................................... ............................ 63 5.5 Acknowledgements................................ ........................... 63 Bibliography ....................................... ................................. 64 vii LIST OF FIGURES Figure 2.1: Mate pairs and the de Bruijn graph: (a) A mate pair is a pair of reads with a distance of d between their start positions. (b) A circular genome S and two mate pairs, with d = 4 and d = 5. (c–d) The de Bruijn graph construction for k = 2. In (c), the outside circle shows a separate black edge for each 3-mer (equivalently, each element of the 3-spectrum). The dotted red lines indicate vertices that will be glued. The inner circle shows the result of applying some of the glues. Note that this is an intermediate step of the construction in which we only show the gluings of vertices arising from the same position of S. (d) The final de Bruijn graph, resulting from alltheglues. ....................................... .................... 8 Figure 2.2: The effect of increasing k and d: (a) The number of repeated k-mers in the E. coli genome, for various values of k. (b) The number of repeated (k,d)-mers, for various values of d with k = 50. ................................................ 9 Figure 2.3: The (approximate) paired de Bruijn graph: (a–b) The paired de Bruijn construction for k = 2, d = 4 from the same string S as in Fig. 2.1. In (a), the outer circle has an edge from every element of the (3,4)-spectrum. (b) The paired de Bruijn graph after all the gluings; notice that it has only one branching vertex, versus four in the de Bruijn graph (Fig. 2.1(d)). (c–e) The construction of the approximate paired de Bruijn graph for k = 2, d = 5, ∆ = 1. In (c), one possible covering spectrum is shown in the outside circle, with black edges for elements with mate pair distance 6 and blue edges for distance 5. Since ∆ = 1, we glue vertices if they have equal left labels and their right labels are a distance at most 2 apart from each other in the de Bruijn graph (Fig. 2.1(d)). The final multigraph after all vertex gluings is shown in (d), and the resulting simple graph, used to spell the contigs, is shown in (e). Notice that this graph now has three branching vertices. ................. 10 Figure 2.4: Example of the standard and paired de Bruijn graphs: The reads are the (5,12)- spectrum generated from the cyclic sequence ATCGGGATGACTATGTCGCTCCT AATCGGGAAGACTAT GCCGCTCCTT. (a) The de Bruijn graph with edges con- structed from the set of 5-mers in the (5,12) spectrum. Each node is a rectangle labeled by a 4-mer with the node ID shown as a large red number on the left of the node. The mate pair information is also presented in the graph: for each node, the node IDs of its corresponding right 4-mers are shown as small numbers on the right of the rectangle. For instance, the right 4-mers (blue dotted lines) of CGGG (node 3) are GTCG (node 21) and GCCG (node 22) and we write 21 and 22 on the right side of node 3. Note that there is not a single mate pair with a unique path between the mates, making mate pair transformations impossible. (b) The paired de Bruijn graph from the (5,12) spectrum is a cycle, representing a single contig. In this example, the paired approach allows for longer contigs than would mate pair transformations (though there are also examples when the oppositeistrue). ...... 12 viii Figure 2.5: Contig Lengths: Cumulative contig lengths (for standard and paired de Bruijn graphs) on simulated data with perfect coverage. Contigs are sorted in order from largest to smallest. Point (x,y) means the largest x contigs have cumulative length y. (a) To analyze the effect of the insert size (IS) on the assembly, we kept the read length fixed at 50, but varied the insert size. We also generated non-paired reads of length 50. For E. coli, the curve for

Load more